Join us this month to learn about AWS services and solutions. New this month, we have a fireside chat with the GM of Amazon WorkSpaces and our 2nd episode of the “How to re:Invent” series. We’ll also cover best practices, deep dives, use cases and more! Join us and register today!
AWS re:Invent June 13, 2018 | 05:00 PM – 05:30 PM PT – Episode 2: AWS re:Invent Breakout Content Secret Sauce – Hear from one of our own AWS content experts as we dive deep into the re:Invent content strategy and how we maintain a high bar. Compute
Containers June 25, 2018 | 09:00 AM – 09:45 AM PT – Running Kubernetes on AWS – Learn about the basics of running Kubernetes on AWS including how setup masters, networking, security, and add auto-scaling to your cluster.
June 19, 2018 | 11:00 AM – 11:45 AM PT – Launch AWS Faster using Automated Landing Zones – Learn how the AWS Landing Zone can automate the set up of best practice baselines when setting up new
June 21, 2018 | 01:00 PM – 01:45 PM PT – Enabling New Retail Customer Experiences with Big Data – Learn how AWS can help retailers realize actual value from their big data and deliver on differentiated retail customer experiences.
June 28, 2018 | 01:00 PM – 01:45 PM PT – Fireside Chat: End User Collaboration on AWS – Learn how End User Compute services can help you deliver access to desktops and applications anywhere, anytime, using any device. IoT
June 27, 2018 | 11:00 AM – 11:45 AM PT – AWS IoT in the Connected Home – Learn how to use AWS IoT to build innovative Connected Home products.
Mobile June 25, 2018 | 11:00 AM – 11:45 AM PT – Drive User Engagement with Amazon Pinpoint – Learn how Amazon Pinpoint simplifies and streamlines effective user engagement.
June 26, 2018 | 11:00 AM – 11:45 AM PT – Deep Dive: Hybrid Cloud Storage with AWS Storage Gateway – Learn how you can reduce your on-premises infrastructure by using the AWS Storage Gateway to connecting your applications to the scalable and reliable AWS storage services. June 27, 2018 | 01:00 PM – 01:45 PM PT – Changing the Game: Extending Compute Capabilities to the Edge – Discover how to change the game for IIoT and edge analytics applications with AWS Snowball Edge plus enhanced Compute instances. June 28, 2018 | 11:00 AM – 11:45 AM PT – Big Data and Analytics Workloads on Amazon EFS – Get best practices and deployment advice for running big data and analytics workloads on Amazon EFS.
Many of my colleagues are fortunate to be able to spend a good part of their day sitting down with and listening to our customers, doing their best to understand ways that we can better meet their business and technology needs. This information is treated with extreme care and is used to drive the roadmap for new services and new features.
AWS customers in the financial services industry (often abbreviated as FSI) are looking ahead to the Fundamental Review of Trading Book (FRTB) regulations that will come in to effect between 2019 and 2021. Among other things, these regulations mandate a new approach to the “value at risk” calculations that each financial institution must perform in the four hour time window after trading ends in New York and begins in Tokyo. Today, our customers report this mission-critical calculation consumes on the order of 200,000 vCPUs, growing to between 400K and 800K vCPUs in order to meet the FRTB regulations. While there’s still some debate about the magnitude and frequency with which they’ll need to run this expanded calculation, the overall direction is clear.
Building a Big Grid In order to make sure that we are ready to help our FSI customers meet these new regulations, we worked with TIBCO to set up and run a proof of concept grid in the AWS Cloud. The periodic nature of the calculation, along with the amount of processing power and storage needed to run it to completion within four hours, make it a great fit for an environment where a vast amount of cost-effective compute power is available on an on-demand basis.
Our customers are already using the TIBCO GridServer on-premises and want to use it in the cloud. This product is designed to run grids at enterprise scale. It runs apps in a virtualized fashion, and accepts requests for resources, dynamically provisioning them on an as-needed basis. The cloud version supports Amazon Linux as well as the PostgreSQL-compatible edition of Amazon Aurora.
Working together with TIBCO, we set out to create a grid that was substantially larger than the current high-end prediction of 800K vCPUs, adding a 50% safety factor and then rounding up to reach 1.3 million vCPUs (5x the size of the largest on-premises grid). With that target in mind, the account limits were raised as follows:
Spot Instance Limit – 120,000
EBS Volume Limit – 120,000
EBS Capacity Limit – 2 PB
If you plan to create a grid of this size, you should also bring your friendly local AWS Solutions Architect into the loop as early as possible. They will review your plans, provide you with architecture guidance, and help you to schedule your run.
Running the Grid We hit the Go button and launched the grid, watching as it bid for and obtained Spot Instances, each of which booted, initialized, and joined the grid within two minutes. The test workload used the Strata open source analytics & market risk library from OpenGamma and was set up with their assistance.
The grid grew to 61,299 Spot Instances (1.3 million vCPUs drawn from 34 instance types spanning 3 generations of EC2 hardware) as planned, with just 1,937 instances reclaimed and automatically replaced during the run, and cost $30,000 per hour to run, at an average hourly cost of $0.078 per vCPU. If the same instances had been used in On-Demand form, the hourly cost to run the grid would have been approximately $93,000.
Despite the scale of the grid, prices for the EC2 instances did not move during the bidding process. This is due to the overall size of the AWS Cloud and the smooth price change model that we launched late last year.
To give you a sense of the compute power, we computed that this grid would have taken the #1 position on the TOP 500 supercomputer list in November 2007 by a considerable margin, and the #2 position in June 2008. Today, it would occupy position #360 on the list.
I hope that you enjoyed this AWS success story, and that it gives you an idea of the scale that you can achieve in the cloud!
We have several upcoming tech talks in the month of April and early May. Come join us to learn about AWS services and solution offerings. We’ll have AWS experts online to help answer questions in real-time. Sign up now to learn more, we look forward to seeing you.
Note – All sessions are free and in Pacific Time.
April & early May — 2018 Schedule
Compute
April 30, 2018 | 01:00 PM – 01:45 PM PT – Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR (300) – Learn about the best practices for scaling big data workloads as well as process, store, and analyze big data securely and cost effectively with Amazon EMR and Amazon EC2 Spot Instances.
May 1, 2018 | 01:00 PM – 01:45 PM PT – How to Bring Microsoft Apps to AWS (300) – Learn more about how to save significant money by bringing your Microsoft workloads to AWS.
May 2, 2018 | 01:00 PM – 01:45 PM PT – Deep Dive on Amazon EC2 Accelerated Computing (300) – Get a technical deep dive on how AWS’ GPU and FGPA-based compute services can help you to optimize and accelerate your ML/DL and HPC workloads in the cloud.
April 25, 2018 | 01:00 PM – 01:45 PM PT – Intro to Open Source Databases on AWS (200) – Learn how to tap the benefits of open source databases on AWS without the administrative hassle.
April 24, 2018 | 11:00 AM – 11:45 AM PT – Deploy your Desktops and Apps on AWS (300) – Learn how to deploy your desktops and apps on AWS with Amazon WorkSpaces and Amazon AppStream 2.0
IoT
May 2, 2018 | 11:00 AM – 11:45 AM PT – How to Easily and Securely Connect Devices to AWS IoT (200) – Learn how to easily and securely connect devices to the cloud and reliably scale to billions of devices and trillions of messages with AWS IoT.
April 30, 2018 | 11:00 AM – 11:45 AM PT – Offline GraphQL Apps with AWS AppSync (300) – Come learn how to enable real-time and offline data in your applications with GraphQL using AWS AppSync.
Networking
May 2, 2018 | 09:00 AM – 09:45 AM PT – Taking Serverless to the Edge (300) – Learn how to run your code closer to your end users in a serverless fashion. Also, David Von Lehman from Aerobatic will discuss how they used [email protected] to reduce latency and cloud costs for their customer’s websites.
May 3, 2018 | 09:00 AM – 09:45 AM PT – Protect Your Game Servers from DDoS Attacks (200) – Learn how to use the new AWS Shield Advanced for EC2 to protect your internet-facing game servers against network layer DDoS attacks and application layer attacks of all kinds.
May 1, 2018 | 11:00 AM – 11:45 AM PT – Building Data Lakes That Cost Less and Deliver Results Faster (300) – Learn how Amazon S3 Select And Amazon Glacier Select increase application performance by up to 400% and reduce total cost of ownership by extending your data lake into cost-effective archive storage.
May 3, 2018 | 11:00 AM – 11:45 AM PT – Integrating On-Premises Vendors with AWS for Backup (300) – Learn how to work with AWS and technology partners to build backup & restore solutions for your on-premises, hybrid, and cloud native environments.
Amazon EC2 Spot Instances offer spare compute capacity in the AWS Cloud at steep discounts. Customers—including Yelp, NASA JPL, FINRA, and Autodesk—use Spot Instances to reduce costs and get faster results. Spot Instances provide acceleration, scale, and deep cost savings to big data workloads, containerized applications such as web services, test/dev, and many types of HPC and batch jobs.
At re:Invent 2017, we launched a new pricing model that simplified the Spot purchasing experience. The new model gives you predictable prices that adjust slowly over days and weeks, with typical savings of 70-90% over On-Demand. With the previous pricing model, some of you had to invest time and effort to analyze historical prices to determine your bidding strategy and maximum bid price. Not anymore.
How does the new pricing model work?
You don’t have to bid for Spot Instances in the new pricing model, and you just pay the Spot price that’s in effect for the current hour for the instances that you launch. It’s that simple. Now you can request Spot capacity just like you would request On-Demand capacity, without having to spend time analyzing market prices or setting a maximum bid price.
Previously, Spot Instances were terminated in ascending order of bids, and the Spot price was set to the highest unfulfilled bid. The market prices fluctuated frequently because of this. In the new model, the Spot prices are more predictable, updated less frequently, and are determined by supply and demand for Amazon EC2 spare capacity, not bid prices. You can find the price that’s in effect for the current hour in the EC2 console.
As you can see from the above Spot Instance Pricing History graph (available in the EC2 console under Spot Requests), Spot prices were volatile before the pricing model update. However, after the pricing model update, prices are more predictable and change less frequently.
In the new model, you still have the option to further control costs by submitting a “maximum price” that you are willing to pay in the console when you request Spot Instances:
You can also set your maximum price in EC2 RunInstances or RequestSpotFleet API calls, or in command line requests:
The default maximum price is the On-Demand price and you can continue to set a maximum Spot price of up to 10x the On-Demand price. That means, if you have been running applications on Spot Instances and use the RequestSpotInstances or RequestSpotFleet operations, you can continue to do so. The new Spot pricing model is backward compatible and you do not need to make any changes to your existing applications.
Fewer interruptions
Spot Instances receive a two-minute interruption notice when these instances are about to be reclaimed by EC2, because EC2 needs the capacity back. We have significantly reduced the interruptions with the new pricing model. Now instances are not interrupted because of higher competing bids, and you can enjoy longer workload runtimes. The typical frequency of interruption for Spot Instances in the last 30 days was less than 5% on average.
To reduce the impact of interruptions and optimize Spot Instances, diversify and run your application across multiple capacity pools. Each instance family, each instance size, in each Availability Zone, in every Region is a separate Spot pool. You can use the RequestSpotFleet API operation to launch thousands of Spot Instances and diversify resources automatically. To further reduce the impact of interruptions, you can also set up Spot Instances and Spot Fleets to respond to an interruption notice by stopping or hibernating rather than terminating instances when capacity is no longer available.
Spot Instances are now available in 18 Regions and 51 Availability Zones, and offer 100s of instance options. We have eliminated bidding, simplified the pricing model, and have made it easy to get started with Amazon EC2 Spot Instances for you to take advantage of the largest pool of cost-effective compute capacity in the world. See the Spot Instances detail page for more information and create your Spot Instance here.
Amazon EC2 Spot Instances are spare compute capacity in the AWS Cloud available to you at steep discounts compared to On-Demand prices. The only difference between On-Demand Instances and Spot Instances is that Spot Instances can be interrupted by Amazon EC2 with two minutes of notification when EC2 needs the capacity back.
Customers have been taking advantage of Spot Instance interruption notices available via the instance metadata service since January 2015 to orchestrate their workloads seamlessly around any potential interruptions. Examples include saving the state of a job, detaching from a load balancer, or draining containers. Needless to say, the two-minute Spot Instance interruption notice is a powerful tool when using Spot Instances.
In January 2018, the Spot Instance interruption notice also became available as an event in Amazon CloudWatch Events. This allows targets such as AWS Lambda functions or Amazon SNS topics to process Spot Instance interruption notices by creating a CloudWatch Events rule to monitor for the notice.
In this post, I walk through an example use case for taking advantage of Spot Instance interruption notices in CloudWatch Events to automatically deregister Spot Instances from an Elastic Load Balancing Application Load Balancer.
When any of the Spot Instances receives an interruption notice, Spot Fleet sends the event to CloudWatch Events. The CloudWatch Events rule then notifies both targets, the Lambda function and SNS topic. The Lambda function detaches the Spot Instance from the Application Load Balancer target group, taking advantage of nearly a full two minutes of connection draining before the instance is interrupted. The SNS topic also receives a message, and is provided as an example for the reader to use as an exercise.
To complete this walkthrough, have the AWS CLI installed and configured, as well as the ability to launch CloudFormation stacks.
Launch the stack
Go ahead and launch the CloudFormation stack. You can check it out from GitHub, or grab the template directly. In this post, I use the stack name “spot-spin-cwe“, but feel free to use any name you like. Just remember to change it in the instructions.
Here are the details of the architecture being launched by the stack.
IAM permissions
Give permissions to a few components in the architecture:
The Lambda function
The CloudWatch Events rule
The Spot Fleet
The Lambda function needs basic Lambda function execution permissions so that it can write logs to CloudWatch Logs. You can use the AWS managed policy for this. It also needs to describe EC2 tags as well as deregister targets within Elastic Load Balancing. You can create a custom policy for these.
Finally, Spot Fleet needs permissions to request Spot Instances, tag, and register targets in Elastic Load Balancing. You can tap into an AWS managed policy for this.
Because you are taking advantage of the two-minute Spot Instance notice, you can tune the Elastic Load Balancing target group deregistration timeout delay to match. When a target is deregistered from the target group, it is put into connection draining mode for the length of the timeout delay: 120 seconds to equal the two-minute notice.
To capture the Spot Instance interruption notice being published to CloudWatch Events, create a rule with two targets: the Lambda function and the SNS topic.
The Lambda function does the heavy lifting for you. The details of the CloudWatch event are published to the Lambda function, which then uses boto3 to make a couple of AWS API calls. The first call is to describe the EC2 tags for the Spot Instance, filtering on a key of “TargetGroupArn”. If this tag is found, the instance is then deregistered from the target group ARN stored as the value of the tag.
import boto3
def handler(event, context):
instanceId = event['detail']['instance-id']
instanceAction = event['detail']['instance-action']
try:
ec2client = boto3.client('ec2')
describeTags = ec2client.describe_tags(Filters=[{'Name': 'resource-id','Values':[instanceId],'Name':'key','Values':['loadBalancerTargetGroup']}])
except:
print("No action being taken. Unable to describe tags for instance id:", instanceId)
return
try:
elbv2client = boto3.client('elbv2')
deregisterTargets = elbv2client.deregister_targets(TargetGroupArn=describeTags['Tags'][0]['Value'],Targets=[{'Id':instanceId}])
except:
print("No action being taken. Unable to deregister targets for instance id:", instanceId)
return
print("Detaching instance from target:")
print(instanceId, describeTags['Tags'][0]['Value'], deregisterTargets, sep=",")
return
SNS topic
Finally, you’ve created an SNS topic as an example target. For example, you could subscribe an email address to the SNS topic in order to receive email notifications when a Spot Instance interruption notice is received.
To proceed to creating your Spot Fleet request, use some of the resources that the CloudFormation stack created, to populate the Spot Fleet request launch configuration. You can find the values in the outputs values of the CloudFormation stack:
You can confirm that the Spot Fleet request was fulfilled by checking that ActivityStatus is “fulfilled”, or by checking that FulfilledCapacity is greater than or equal to TargetCapacity, while describing the request:
In order to test, you can take advantage of the fact that any interruption action that Spot Fleet takes on a Spot Instance results in a Spot Instance interruption notice being provided. Therefore, you can simply decrease the target size of your Spot Fleet from 2 to 1. The instance that is interrupted receives the interruption notice:
As soon as the interruption notice is published to CloudWatch Events, the Lambda function triggers and detaches the instance from the target group, effectively putting the instance in a draining state.
In conclusion, Amazon EC2 Spot Instance interruption notices are an extremely powerful tool when taking advantage of Amazon EC2 Spot Instances in your workloads, for tasks such as saving state, draining connections, and much more. I’d love to hear how you are using them in your own environment!
Chad Schmutzer Solutions Architect
Chad Schmutzer is a Solutions Architect at Amazon Web Services based in Pasadena, CA. As an extension of the Amazon EC2 Spot Instances team, Chad helps customers significantly reduce the cost of running their applications, growing their compute capacity and throughput without increasing budget, and enabling new types of cloud computing applications.
With the explosion in virtual reality (VR) technologies over the past few years, we’ve had an increasing number of customers ask us for advice and best practices around deploying their VR-based products and service offerings on the AWS Cloud. It soon became apparent that while the VR ecosystem is large in both scope and depth of types of workloads (gaming, e-medicine, security analytics, live streaming events, etc.), many of the workloads followed repeatable patterns, with storage and delivery of live and on-demand immersive video at the top of the list.
Looking at consumer trends, the desire for live and on-demand immersive video is fairly self-explanatory. VR has ushered in convenient and low-cost access for consumers and businesses to a wide variety of options for consuming content, ranging from browser playback of live and on-demand 360º video, all the way up to positional tracking systems with a high degree of immersion. All of these scenarios contain one lowest common denominator: video.
Which brings us to the topic of this post. We set out to build a solution that could support both live and on-demand events, bring with it a high degree of scalability, be flexible enough to support transformation of video if required, run at a low cost, and use open-source software to every extent possible.
In this post, we describe the reference architecture we created to solve this challenge, using Amazon EC2 Spot Instances, Amazon S3, Elastic Load Balancing, Amazon CloudFront, AWS CloudFormation, and Amazon CloudWatch, with open-source software such as NGINX, FFMPEG, and JavaScript-based client-side playback technologies. We step you through deployment of the solution and how the components work, as well as the capture, processing, and playback of the underlying live and on-demand immersive media streams.
This GitHub repository includes the source code necessary to follow along. We’ve also provided a self-paced workshop, from AWS re:Invent 2017 that breaks down this architecture even further. If you experience any issues or would like to suggest an enhancement, please use the GitHub issue tracker.
Prerequisites
As a side note, you’ll also need a few additional components to take best advantage of the infrastructure:
A camera/capture device capable of encoding and streaming RTMP video
A browser to consume the content.
You’re going to generate HTML5-compatible video (Apple HLS to be exact), but there are many other native iOS and Android options for consuming the media that you create. It’s also worth noting that your playback device should support projection of your input stream. We’ll talk more about that in the next section.
How does immersive media work?
At its core, any flavor of media, be that audio or video, can be viewed with some level of immersion. The ability to interact passively or actively with the content brings with it a further level of immersion. When you look at VR devices with rotational and positional tracking, you naturally need more than an ability to interact with a flat plane of video. The challenge for any creative thus becomes a tradeoff between immersion features (degrees of freedom, monoscopic 2D or stereoscopic 3D, resolution, framerate) and overall complexity.
Where can you start from a simple and effective point of view, that enables you to build out a fairly modular solution and test it? There are a few areas we chose to be prescriptive with our solution.
Source capture from the Ricoh Theta S
First, monoscopic 360-degree video is currently one of the most commonly consumed formats on consumer devices. We explicitly chose to focus on this format, although the infrastructure is not limited to it. More on this later.
Second, if you look at most consumer-level cameras that provide live streaming ability, and even many professional rigs, there are at least two lenses or cameras at a minimum. The figure above illustrates a single capture from a Ricoh Theta S in monoscopic 2D. The left image captures 180 degrees of the field of view, and the right image captures the other 180 degrees.
For this post, we chose a typical midlevel camera (the Ricoh Theta S), and used a laptop with open-source software (Open Broadcaster Software) to encode and stream the content. Again, the solution infrastructure is not limited to this particular brand of camera. Any camera or encoder that outputs 360º video and encodes to H264+AAC with an RTMP transport will work.
Third, capturing and streaming multiple camera feeds brings additional requirements around stream synchronization and cost of infrastructure. There is also a requirement to stitch media in real time, which can be CPU and GPU-intensive. Many devices and platforms do this either on the device, or via outboard processing that sits close to the camera location. If you stitch and deliver a single stream, you can save the costs of infrastructure and bitrate/connectivity requirements. We chose to keep these aspects on the encoder side to save on cost and reduce infrastructure complexity.
Last, the most common delivery format that requires little to no processing on the infrastructure side is equirectangular projection, as per the above figure. By stitching and unwrapping the spherical coordinates into a flat plane, you can easily deliver the video exactly as you would with any other live or on-demand stream. The only caveat is that resolution and bit rate are of utmost importance. The higher you can push these (high bit rate @ 4K resolution), the more immersive the experience is for viewers. This is due to the increase in sharpness and reduction of compression artifacts.
Knowing that we would be transcoding potentially at 4K on the source camera, but in a format that could be transmuxed without an encoding penalty on the origin servers, we implemented a pass-through for the highest bit rate, and elected to only transcode lower bitrates. This requires some level of configuration on the source encoder, but saves on cost and infrastructure. Because you can conform the source stream, you may as well take advantage of that!
For this post, we chose not to focus on ways to optimize projection. However, the reference architecture does support this with additional open source components compiled into the FFMPEG toolchain. A number of options are available to this end, such as open source equirectangular to cubic transformation filters. There is a tradeoff, however, in that reprojection implies that all streams must be transcoded.
Processing and origination stack
To get started, we’ve provided a CloudFormation template that you can launch directly into your own AWS account. We quickly review how it works, the solution’s components, key features, processing steps, and examine the main configuration files. Following this, you launch the stack, and then proceed with camera and encoder setup.
Immersive streaming reference architecture
The event encoder publishes the RTMP source to multiple origin elastic IP addresses for packaging into the HLS adaptive bitrate.
The client requests the live stream through the CloudFront CDN.
The origin responds with the appropriate HLS stream.
The edge fleet caches media requests from clients and elastically scales across both Availability Zones to meet peak demand.
CloudFront caches media at local edge PoPs to improve performance for users and reduce the origin load.
When the live event is finished, the VOD asset is published to S3. An S3 event is then published to SQS.
The encoding fleet processes the read messages from the SQS queue, processes the VOD clips, and stores them in the S3 bucket.
How it works
A camera captures content, and with the help of a contribution encoder, publishes a live stream in equirectangular format. The stream is encoded at a high bit rate (at least 2.5 Mbps, but typically 16+ Mbps for 4K) using H264 video and AAC audio compression codecs, and delivered to a primary origin via the RTMP protocol. Streams may transit over the internet or dedicated links to the origins. Typically, for live events in the field, internet or bonded cellular are the most widely used.
The encoder is typically configured to push the live stream to a primary URI, with the ability (depending on the source encoding software/hardware) to roll over to a backup publishing point origin if the primary fails. Because you run across multiple Availability Zones, this architecture could handle an entire zone outage with minor disruption to live events. The primary and backup origins handle the ingestion of the live stream as well as transcoding to H264+AAC-based adaptive bit rate sets. After transcode, they package the streams into HLS for delivery and create a master-level manifest that references all adaptive bit rates.
The edge cache fleet pulls segments and manifests from the active origin on demand, and supports failover from primary to backup if the primary origin fails. By adding this caching tier, you effectively separate the encoding backend tier from the cache tier that responds to client or CDN requests. In addition to origin protection, this separation allows you to independently monitor, configure, and scale these components.
Viewers can use the sample HTML5 player (or compatible desktop, iOS or Android application) to view the streams. Navigation in the 360-degree view is handled either natively via device-based gyroscope, positionally via more advanced devices such as a head mount display, or via mouse drag on the desktop. Adaptive bit rate is key here, as this allows you to target multiple device types, giving the player on each device the option of selecting an optimum stream based on network conditions or device profile.
Solution components
When you deploy the CloudFormation template, all the architecture services referenced above are created and launched. This includes:
The compute tier running on Spot Instances for the corresponding components:
the primary and backup ingest origins
the edge cache fleet
the transcoding fleet
the test source
The CloudFront distribution
S3 buckets for storage of on-demand VOD assets
An Application Load Balancer for load balancing the service
An Amazon ECS cluster and container for the test source
The template also provisions the underlying dependencies:
A VPC
Security groups
IAM policies and roles
Elastic network interfaces
Elastic IP addresses
The edge cache fleet instances need some way to discover the primary and backup origin locations. You use elastic network interfaces and elastic IP addresses for this purpose.
As each component of the infrastructure is provisioned, software required to transcode and process the streams across the Spot Instances is automatically deployed. This includes NGiNX-RTMP for ingest of live streams, FFMPEG for transcoding, NGINX for serving, and helper scripts to handle various tasks (potential Spot Instance interruptions, queueing, moving content to S3). Metrics and logs are available through CloudWatch and you can manage the deployment using the CloudFormation console or AWS CLI.
Key features include:
Live and video-on-demand recording
You’re supporting both live and on-demand. On-demand content is created automatically when the encoder stops publishing to the origin.
Cost-optimization and operating at scale using Spot Instances
Spot Instances are used exclusively for infrastructure to optimize cost and scale throughput.
Midtier caching
To protect the origin servers, the midtier cache fleet pulls, caches, and delivers to downstream CDNs.
Distribution via CloudFront or multi-CDN
The Application Load Balancer endpoint allows CloudFront or any third-party CDN to source content from the edge fleet and, indirectly, the origin.
FFMPEG + NGINX + NGiNX-RTMP
These three components form the core of the stream ingest, transcode, packaging, and delivery infrastructure, as well as the VOD-processing component for creating transcoded VOD content on-demand.
Simple deployment using a CloudFormation template
All infrastructure can be easily created and modified using CloudFormation.
Prototype player page
To provide an end-to-end experience right away, we’ve included a test player page hosted as a static site on S3. This page uses A-Frame, a cross-platform, open-source framework for building VR experiences in the browser. Though A-Frame provides many features, it’s used here to render a sphere that acts as a 3D canvas for your live stream.
Spot Instance considerations
At this stage, and before we discuss processing, it is important to understand how the architecture operates with Spot Instances.
Spot Instances are spare compute capacity in the AWS Cloud available to you at steep discounts compared to On-Demand prices. Spot Instances enables you to optimize your costs on the AWS Cloud and scale your application’s throughput up to 10X for the same budget. By selecting Spot Instances, you can save up-to 90% on On-Demand prices. This allows you to greatly reduce the cost of running the solution because, outside of S3 for storage and CloudFront for delivery, this solution is almost entirely dependent on Spot Instances for infrastructure requirements.
We also know that customers running events look to deploy streaming infrastructure at the lowest price point, so it makes sense to take advantage of it wherever possible. A potential challenge when using Spot Instances for live streaming and on-demand processing is that you need to proactively deal with potential Spot Instance interruptions. How can you best deal with this?
First, the origin is deployed in a primary/backup deployment. If a Spot Instance interruption happens on the primary origin, you can fail over to the backup with a brief interruption. Should a potential interruption not be acceptable, then either Reserved Instances or On-Demand options (or a combination) can be used at this tier.
Second, the edge cache fleet runs a job (started automatically at system boot) that periodically queries the local instance metadata to detect if an interruption is scheduled to occur. Spot Instance Interruption Notices provide a two-minute warning of a pending interruption. If you poll every 5 seconds, you have almost 2 full minutes to detach from the Load Balancer and drain or stop any traffic directed to your instance.
Lastly, use an SQS queue when transcoding. If a transcode for a Spot Instance is interrupted, the stale item falls back into the SQS queue and is eventually re-surfaced into the processing pipeline. Only remove items from the queue after the transcoded files have been successfully moved to the destination S3 bucket.
Processing
As discussed in the previous sections, you pass through the video for the highest bit rate to save on having to increase the instance size to transcode the 4K or similar high bit rate or resolution content.
We’ve selected a handful of bitrates for the adaptive bit rate stack. You can customize any of these to suit the requirements for your event. The default ABR stack includes:
2160p (4K)
1080p
540p
480p
These can be modified by editing the /etc/nginx/rtmp.d/rtmp.conf NGINX configuration file on the origin or the CloudFormation template.
It’s important to understand where and how streams are transcoded. When the source high bit rate stream enters the primary or backup origin at the /live RTMP application entry point, it is recorded on stop and start of publishing. On completion, it is moved to S3 by a cleanup script, and a message is placed in your SQS queue for workers to use. These workers transcode the media and push it to a playout location bucket.
This solution uses Spot Fleet with automatic scaling to drive the fleet size. You can customize it based on CloudWatch metrics, such as simple utilization metrics to drive the size of the fleet. Why use Spot Instances for the transcode option instead of Amazon Elastic Transcoder? This allows you to implement reprojection of the input stream via FFMPEG filters in the future.
The origins handle all the heavy live streaming work. Edges only store and forward the segments and manifests, and provide scaling plus reduction of burden on the origin. This lets you customize the origin to the right compute capacity without having to rely on a ‘high watermark’ for compute sizing, thus saving additional costs.
Loopback is an important concept for the live origins. The incoming stream entering /live is transcoded by FFMPEG to multiple bit rates, which are streamed back to the same host via RTMP, on a secondary publishing point /show. The secondary publishing point is transparent to the user and encoder, but handles HLS segment generation and cleanup, and keeps a sliding window of live segments and constantly updating manifests.
Configuration
Our solution provides two key points of configuration that can be used to customize the solution to accommodate ingest, recording, transcoding, and delivery, all controlled via origin and edge configuration files, which are described later. In addition, a number of job scripts run on the instances to provide hooks into Spot Instance interruption events and the VOD SQS-based processing queue.
Origin instances
The rtmp.conf excerpt below also shows additional parameters that can be customized, such as maximum recording file size in Kbytes, HLS Fragment length, and Playlist sizes. We’ve created these in accordance with general industry best practices to ensure the reliable streaming and delivery of your content.
rtmp {
server {
listen 1935;
chunk_size 4000;
application live {
live on;
record all;
record_path /var/lib/nginx/rec;
record_max_size 128000K;
exec_record_done /usr/local/bin/record-postprocess.sh $path $basename;
exec /usr/local/bin/ffmpeg <…parameters…>;
}
application show {
live on;
hls on;
...
hls_type live;
hls_fragment 10s;
hls_playlist_length 60s;
...
}
}
}
This exposes a few URL endpoints for debugging and general status. In production, you would most likely turn these off:
/stat provides a statistics endpoint accessible via any standard web browser.
/control enables control of RTMP streams and publishing points.
You also control the TTLs, as previously discussed. It’s important to note here that you are setting TTLs explicitly at the origin, instead of in CloudFront’s distribution configuration. While both are valid, this approach allows you to reconfigure and restart the service on the fly without having to push changes through CloudFront. This is useful for debugging any caching or playback issues.
record-postprocess.sh – Ensures that recorded files on the origin are well-formed, and transfers them to S3 for processing.
ffmpeg.sh – Transcodes content on the encoding fleet, pulling source media from your S3 ingress bucket, based on SQS queue entries, and pushing transcoded adaptive bit rate segments and manifests to your VOD playout egress bucket.
For more details, see the Delivery and Playback section later in this post.
Camera source
With the processing and origination infrastructure running, you need to configure your camera and encoder.
As discussed, we chose to use a Ricoh Theta S camera and Open Broadcaster Software (OBS) to stitch and deliver a stream into the infrastructure. Ricoh provides a free ‘blender’ driver, which allows you to transform, stitch, encode, and deliver both transformed equirectangular (used for this post) video as well as spherical (two camera) video. The Theta provides an easy way to get capturing for under $300, and OBS is a free and open-source software application for capturing and live streaming on a budget. It is quick, cheap, and enjoys wide use by the gaming community. OBS lowers the barrier to getting started with immersive streaming.
While the resolution and bit rate of the Theta may not be 4K, it still provides us with a way to test the functionality of the entire pipeline end to end, without having to invest in a more expensive camera rig. One could also use this type of model to target smaller events, which may involve mobile devices with smaller display profiles, such as phones and potentially smaller sized tablets.
Looking for a more professional solution? Nokia, GoPro, Samsung, and many others have options ranging from $500 to $50,000. This solution is based around the Theta S capabilities, but we’d encourage you to extend it to meet your specific needs.
If your device can support equirectangular RTMP, then it can deliver media through the reference architecture (dependent on instance sizing for higher bit rate sources, of course). If additional features are required such as camera stitching, mixing, or device bonding, we’d recommend exploring a commercial solution such as Teradek Sphere.
Teradek Rig (Teradek)
Ricoh Theta (CNET)
All cameras have varied PC connectivity support. We chose the Ricoh Theta S due to the real-time video connectivity that it provides through software drivers on macOS and PC. If you plan to purchase a camera to use with a PC, confirm that it supports real-time capabilities as a peripheral device.
Encoding and publishing
Now that you have a camera, encoder, and AWS stack running, you can finally publish a live stream.
To start streaming with OBS, configure the source camera and set a publishing point. Use the RTMP application name /live on port 1935 to ingest into the primary origin’s Elastic IP address provided as the CloudFormation output: primaryOriginElasticIp.
You also need to choose a stream name or stream key in OBS. You can use any stream name, but keep the naming short and lowercase, and use only alphanumeric characters. This avoids any parsing issues on client-side player frameworks. There’s no publish point protection in your deployment, so any stream key works with the default NGiNX-RTMP configuration. For more information about stream keys, publishing point security, and extending the NGiNX-RTMP module, see the NGiNX-RTMP Wiki.
You should end up with a configuration similar to the following:
OBS Stream Settings
The Output settings dialog allows us to rescale the Video canvas and encode it for delivery to our AWS infrastructure. In the dialog below, we’ve set the Theta to encode at 5 Mbps in CBR mode using a preset optimized for low CPU utilization. We chose these settings in accordance with best practices for the stream pass-through at the origin for the initial incoming bit rate. You may notice that they largely match the FFMPEG encoding settings we use on the origin – namely constant bit rate, a single audio track, and x264 encoding with the ‘veryfast’ encoding profile.
OBS Output Settings
Live to On-Demand
As you may have noticed, an on-demand component is included in the solution architecture. When talking to customers, one frequent request that we see is that they would like to record the incoming stream with as little effort as possible.
NGINX-RTMP’s recording directives provide an easy way to accomplish this. We record any newly published stream on stream start at the primary or backup origins, using the incoming source stream, which also happens to be the highest bit rate. When the encoder stops broadcasting, NGINX-RTMP executes an exec_record_done script – record-postprocess.sh (described in the Configuration section earlier), which ensures that the content is well-formed, and then moves it to an S3 ingest bucket for processing.
Transcoding of content to make it ready for VOD as adaptive bit rate is a multi-step pipeline. First, Spot Instances in the transcoding cluster periodically poll the SQS queue for new jobs. Items on the queue are pulled off on demand by processing instances, and transcoded via FFMPEG into adaptive bit rate HLS. This allows you to also extend FFMPEG using filters for cubic and other bitrate-optimizing 360-specific transforms. Finally, transcoded content is moved from the ingest bucket to an egress bucket, making them ready for playback via your CloudFront distribution.
Separate ingest and egress by bucket to provide hard security boundaries between source recordings (which are highest quality and unencrypted), and destination derivatives (which may be lower quality and potentially require encryption). Bucket separation also allows you to order and archive input and output content using different taxonomies, which is common when moving content from an asset management and archival pipeline (the ingest bucket) to a consumer-facing playback pipeline (the egress bucket, and any other attached infrastructure or services, such as CMS, Mobile applications, and so forth).
Because streams are pushed over the internet, there is always the chance that an interruption could occur in the network path, or even at the origin side of the equation (primary to backup roll-over). Both of these scenarios could result in malformed or partial recordings being created. For the best level of reliability, encoding should always be recorded locally on-site as a precaution to deal with potential stream interruptions.
Delivery and playback
With the camera turned on and OBS streaming to AWS, the final step is to play the live stream. We’ve primarily tested the prototype player on the latest Chrome and Firefox browsers on macOS, so your mileage may vary on different browsers or operating systems. For those looking to try the livestream on Google Cardboard, or similar headsets, native apps for iOS (VRPlayer) and Android exist that can play back HLS streams.
The prototype player is hosted in an S3 bucket and can be found from the CloudFormation output clientWebsiteUrl. It requires a stream URL provided as a query parameter ?url=<stream_url> to begin playback. This stream URL is determined by the RTMP stream configuration in OBS. For example, if OBS is publishing to rtmp://x.x.x.x:1935/live/foo, the resulting playback URL would be:
https://<cloudFrontDistribution>/hls/foo.m3u8
The combined player URL and playback URL results in a path like this one:
To assist in setup/debugging, we’ve provided a test source as part of the CloudFormation template. A color bar pattern with timecode and audio is being generated by FFmpeg running as an ECS task. Much like OBS, FFmpeg is streaming the test pattern to the primary origin over the RTMP protocol. The prototype player and test HLS stream can be accessed by opening the clientTestPatternUrl CloudFormation output link.
Test Stream Playback
What’s next?
In this post, we walked you through the design and implementation of a full end-to-end immersive streaming solution architecture. As you may have noticed, there are a number of areas this could expand into, and we intend to do this in follow-up posts around the topic of virtual reality media workloads in the cloud. We’ve identified a number of topics such as load testing, content protection, client-side metrics and analytics, and CI/CD infrastructure for 24/7 live streams. If you have any requests, please drop us a line.
We would like to extend extra-special thanks to Scott Malkie and Chad Neal for their help and contributions to this post and reference architecture.
Noted below are the upcoming scheduled live, online technical sessions being held during the month of January. Make sure to register ahead of time so you won’t miss out on these free talks conducted by AWS subject matter experts.
New Streamlined Access Today we are introducing a new, streamlined access model for Spot Instances. You simply indicate your desire to use Spot capacity when you launch an instance via the RunInstances function, the run-instances command, or the AWS Management Console to submit a request that will be fulfilled as long as the capacity is available. With no extra effort on your part you’ll save up to 90% off of the On-Demand price for the instance type, allowing you to boost your overall application throughput by up to 10x for the same budget. The instances that you launch in this way will continue to run until you terminate them or if EC2 needs to reclaim them for On-Demand usage. At that point the instance will be given the usual 2-minute warning and then reclaimed, making this a great fit for applications that are fault-tolerant.
Unlike the old model which required an understanding of Spot markets, bidding, and calls to a standalone asynchronous API, the new model is synchronous and as easy to use as On-Demand. Your code or your script receives an Instance ID immediately and need not check back to see if the request has been processed and accepted.
We’ve made this as clean and as simple as possible, with the expectation that it will be easy to modify many current scripts and applications to request and make use of Spot capacity. If you want to exercise additional control over your Spot instance budget, you have the option to specify a maximum price when you make a request for capacity. If you use Spot capacity to power your Amazon EMR, Amazon ECS, or AWS Batch clusters, or if you launch Spot instances by way of a AWS CloudFormation template or Auto Scaling Group, you will benefit from this new model without having to make any changes.
Applications that are built around RequestSpotInstances or RequestSpotFleet will continue to work just fine with no changes. However, you now have the option to make requests that do not include the SpotPrice parameter.
Smooth Price Changes As part of today’s launch we are also changing the way that Spot prices change, moving to a model where prices adjust more gradually, based on longer-term trends in supply and demand. As I mentioned earlier, you will continue to save an average of 70-90% off the On-Demand price, and you will continue to pay the Spot price that’s in effect for the time period your instances are running. Applications built around our Spot Fleet feature will continue to automatically diversify placement of their Spot Instances across the most cost-effective pools based on the configuration you specified when you created the fleet.
Spot in Action To launch a Spot Instance from the command line; simply specify the Spot market:
Instance Hibernation If you run workloads that keep a lot of state in memory, you will love this new feature!
You can arrange for instances to save their in-memory state when they are reclaimed, allowing the instances and the applications on them to pick up where they left off when capacity is once again available, just like closing and then opening your laptop. This feature works on C3, C4, and certain sizes of R3, R4, and M4 instances running Amazon Linux, Ubuntu, or Windows Server, and is supported by the EC2 Hibernation Agent.
The in-memory state is written to the root EBS volume of the instance using space that is set-aside when the instance launches. The private IP address and any Elastic IP Addresses are also preserved across a stop/start cycle.
Contributed by Tiffany Jernigan, Developer Advocate for Amazon ECS
Get ready for takeoff!
We made sure that this year’s re:Invent is chock-full of containers: there are over 40 sessions! New to containers? No problem, we have several introductory sessions for you to dip your toes. Been using containers for years and know the ins and outs? Don’t miss our technical deep-dives and interactive chalk talks led by container experts.
If you can’t make it to Las Vegas, you can catch the keynotes and session recaps from our livestream and on Twitch.
Session types
Not everyone learns the same way, so we have multiple types of breakout content:
Birds of a Feather An interactive discussion with industry leaders about containers on AWS.
Breakout sessions 60-minute presentations about building on AWS. Sessions are delivered by both AWS experts and customers and span all content levels.
Workshops 2.5-hour, hands-on sessions that teach how to build on AWS. AWS credits are provided. Bring a laptop, and have an active AWS account.
Chalk Talks 1-hour, highly interactive sessions with a smaller audience. They begin with a short lecture delivered by an AWS expert, followed by a discussion with the audience.
Session levels
Whether you’re new to containers or you’ve been using them for years, you’ll find useful information at every level.
Introductory Sessions are focused on providing an overview of AWS services and features, with the assumption that attendees are new to the topic.
Advanced Sessions dive deeper into the selected topic. Presenters assume that the audience has some familiarity with the topic, but may or may not have direct experience implementing a similar solution.
Expert Sessions are for attendees who are deeply familiar with the topic, have implemented a solution on their own already, and are comfortable with how the technology works across multiple services, architectures, and implementations.
Session locations
All container sessions are located in the Aria Resort.
MONDAY 11/27
Breakout sessions
Level 200 (Introductory)
CON202 – Getting Started with Docker and Amazon ECS By packaging software into standardized units, Docker gives code everything it needs to run, ensuring consistency from your laptop all the way into production. But once you have your code ready to ship, how do you run and scale it in the cloud? In this session, you become comfortable running containerized services in production using Amazon ECS. We cover container deployment, cluster management, service auto-scaling, service discovery, secrets management, logging, monitoring, security, and other core concepts. We also cover integrated AWS services and supplementary services that you can take advantage of to run and scale container-based services in the cloud.
Chalk talks
Level 200 (Introductory)
CON211 – Reducing your Compute Footprint with Containers and Amazon ECS Tomas Riha, platform architect for Volvo, shows how Volvo transitioned its WirelessCar platform from using Amazon EC2 virtual machines to containers running on Amazon ECS, significantly reducing cost. Tomas dives deep into the architecture that Volvo used to achieve the migration in under four months, including Amazon ECS, Amazon ECR, Elastic Load Balancing, and AWS CloudFormation.
CON212 – Anomaly Detection Using Amazon ECS, AWS Lambda, and Amazon EMR Learn about the architecture that Cisco CloudLock uses to enable automated security and compliance checks throughout the entire development lifecycle, from the first line of code through runtime. It includes integration with IAM roles, Amazon VPC, and AWS KMS.
Level 400 (Expert)
CON410 – Advanced CICD with Amazon ECS Control Plane Mohit Gupta, product and engineering lead for Clever, demonstrates how to extend the Amazon ECS control plane to optimize management of container deployments and how the control plane can be broadly applied to take advantage of new AWS services. This includes ark—an AWS CLI-based deployment to Amazon ECS, Dapple—a slack-based automation system for deployments and notifications, and Kayvee—log and event routing libraries based on Amazon Kinesis.
Workshops
Level 200 (Introductory)
CON209 – Interstella 8888: Learn How to Use Docker on AWS Interstella 8888 is an intergalactic trading company that deals in rare resources, but their antiquated monolithic logistics systems are causing the business to lose money. Join this workshop to get hands-on experience with Docker as you containerize Interstella 8888’s aging monolithic application and deploy it using Amazon ECS.
CON213 – Hands-on Deployment of Kubernetes on AWS In this workshop, attendees get hands-on experience using Kubernetes and Kops (Kubernetes Operations), as described in our recent blog post. Attendees learn how to provision a cluster, assign role-based permissions and security, and launch a container. If you’re interested in learning best practices for running Kubernetes on AWS, don’t miss this workshop.
TUESDAY 11/28
Breakout Sessions
Level 200 (Introductory)
CON206 – Docker on AWS In this session, Docker Technical Staff Member Patrick Chanezon discusses how Finnish Rail, the national train system for Finland, is using Docker on Amazon Web Services to modernize their customer facing applications, from ticket sales to reservations. Patrick also shares the state of Docker development and adoption on AWS, including explaining the opportunities and implications of efforts such as Project Moby, Docker EE, and how developers can use and contribute to Docker projects.
CON208 – Building Microservices on AWS Increasingly, organizations are turning to microservices to help them empower autonomous teams, letting them innovate and ship software faster than ever before. But implementing a microservices architecture comes with a number of new challenges that need to be dealt with. Chief among these finding an appropriate platform to help manage a growing number of independently deployable services. In this session, Sam Newman, author of Building Microservices and a renowned expert in microservices strategy, discusses strategies for building scalable and robust microservices architectures. He also tells you how to choose the right platform for building microservices, and about common challenges and mistakes organizations make when they move to microservices architectures.
Level 300 (Advanced)
CON302 – Building a CICD Pipeline for Containers on AWS Containers can make it easier to scale applications in the cloud, but how do you set up your CICD workflow to automatically test and deploy code to containerized apps? In this session, we explore how developers can build effective CICD workflows to manage their containerized code deployments on AWS.
Ajit Zadgaonkar, Director of Engineering and Operations at Edmunds walks through best practices for CICD architectures used by his team to deploy containers. We also deep dive into topics such as how to create an accessible CICD platform and architect for safe blue/green deployments.
CON307 – Building Effective Container Images Sick of getting paged at 2am and wondering “where did all my disk space go?” New Docker users often start with a stock image in order to get up and running quickly, but this can cause problems as your application matures and scales. Creating efficient container images is important to maximize resources, and deliver critical security benefits.
In this session, AWS Sr. Technical Evangelist Abby Fuller covers how to create effective images to run containers in production. This includes an in-depth discussion of how Docker image layers work, things you should think about when creating your images, working with Amazon ECR, and mise-en-place for install dependencies. Prakash Janakiraman, Co-Founder and Chief Architect at Nextdoor discuss high-level and language-specific best practices for with building images and how Nextdoor uses these practices to successfully scale their containerized services with a small team.
CON309 – Containerized Machine Learning on AWS Image recognition is a field of deep learning that uses neural networks to recognize the subject and traits for a given image. In Japan, Cookpad uses Amazon ECS to run an image recognition platform on clusters of GPU-enabled EC2 instances. In this session, hear from Cookpad about the challenges they faced building and scaling this advanced, user-friendly service to ensure high-availability and low-latency for tens of millions of users.
CON320 – Monitoring, Logging, and Debugging for Containerized Services As containers become more embedded in the platform tools, debug tools, traces, and logs become increasingly important. Nare Hayrapetyan, Senior Software Engineer and Calvin French-Owen, Senior Technical Officer for Segment discuss the principals of monitoring and debugging containers and the tools Segment has implemented and built for logging, alerting, metric collection, and debugging of containerized services running on Amazon ECS.
Chalk Talks
Level 300 (Advanced)
CON314 – Automating Zero-Downtime Production Cluster Upgrades for Amazon ECS Containers make it easy to deploy new code into production to update the functionality of a service, but what happens when you need to update the Amazon EC2 compute instances that your containers are running on? In this talk, we’ll deep dive into how to upgrade the Amazon EC2 infrastructure underlying a live production Amazon ECS cluster without affecting service availability. Matt Callanan, Engineering Manager at Expedia walk through Expedia’s “PRISM” project that safely relocates hundreds of tasks onto new Amazon EC2 instances with zero-downtime to applications.
CON322 – Maximizing Amazon ECS for Large-Scale Workloads Head of Mobfox DevOps, David Spitzer, shows how Mobfox used Docker and Amazon ECS to scale the Mobfox services and development teams to achieve low-latency networking and automatic scaling. This session covers Mobfox’s ecosystem architecture. It compares 2015 and today, the challenges Mobfox faced in growing their platform, and how they overcame them.
CON323 – Microservices Architectures for the Enterprise Salva Jung, Principle Engineer for Samsung Mobile shares how Samsung Connect is architected as microservices running on Amazon ECS to securely, stably, and efficiently handle requests from millions of mobile and IoT devices around the world.
CON324 – Windows Containers on Amazon ECS Docker containers are commonly regarded as powerful and portable runtime environments for Linux code, but Docker also offers API and toolchain support for running Windows Servers in containers. In this talk, we discuss the various options for running windows-based applications in containers on AWS.
CON326 – Remote Sensing and Image Processing on AWS Learn how Encirca services by DuPont Pioneer uses Amazon ECS powered by GPU-instances and Amazon EC2 Spot Instances to run proprietary image-processing algorithms against satellite imagery. Mark Lanning and Ethan Harstad, engineers at DuPont Pioneer show how this architecture has allowed them to process satellite imagery multiple times a day for each agricultural field in the United States in order to identify crop health changes.
Workshops
Level 300 (Advanced)
CON317 – Advanced Container Management at Catsndogs.lol Catsndogs.lol is a (fictional) company that needs help deploying and scaling its container-based application. During this workshop, attendees join the new DevOps team at CatsnDogs.lol, and help the company to manage their applications using Amazon ECS, and help release new features to make our customers happier than ever.Attendees get hands-on with service and container-instance auto-scaling, spot-fleet integration, container placement strategies, service discovery, secrets management with AWS Systems Manager Parameter Store, time-based and event-based scheduling, and automated deployment pipelines. If you are a developer interested in learning more about how Amazon ECS can accelerate your application development and deployment workflows, or if you are a systems administrator or DevOps person interested in understanding how Amazon ECS can simplify the operational model associated with running containers at scale, then this workshop is for you. You should have basic familiarity with Amazon ECS, Amazon EC2, and IAM.
Additional requirements:
The AWS CLI or AWS Tools for PowerShell installed
An AWS account with administrative permissions (including the ability to create IAM roles and policies) created at least 24 hours in advance.
WEDNESDAY 11/29
Birds of a Feather (BoF)
CON01 – Birds of a Feather: Containers and Open Source at AWS Cloud native architectures take advantage of on-demand delivery, global deployment, elasticity, and higher-level services to enable developer productivity and business agility. Open source is a core part of making cloud native possible for everyone. In this session, we welcome thought leaders from the CNCF, Docker, and AWS to discuss the cloud’s direction for growth and enablement of the open source community. We also discuss how AWS is integrating open source code into its container services and its contributions to open source projects.
Breakout Sessions
Level 300 (Advanced)
CON308 – Mastering Kubernetes on AWS Much progress has been made on how to bootstrap a cluster since Kubernetes’ first commit and is now only a matter of minutes to go from zero to a running cluster on Amazon Web Services. However, evolving a simple Kubernetes architecture to be ready for production in a large enterprise can quickly become overwhelming with options for configuration and customization.
In this session, Arun Gupta, Open Source Strategist for AWS and Raffaele Di Fazio, software engineer at leading European fashion platform Zalando, show the common practices for running Kubernetes on AWS and share insights from experience in operating tens of Kubernetes clusters in production on AWS. We cover options and recommendations on how to install and manage clusters, configure high availability, perform rolling upgrades and handle disaster recovery, as well as continuous integration and deployment of applications, logging, and security.
CON310 – Moving to Containers: Building with Docker and Amazon ECS If you’ve ever considered moving part of your application stack to containers, don’t miss this session. We cover best practices for containerizing your code, implementing automated service scaling and monitoring, and setting up automated CI/CD pipelines with fail-safe deployments. Manjeeva Silva and Thilina Gunasinghe show how McDonalds implemented their home delivery platform in four months using Docker containers and Amazon ECS to serve tens of thousands of customers.
Level 400 (Expert)
CON402 – Advanced Patterns in Microservices Implementation with Amazon ECS Scaling a microservice-based infrastructure can be challenging in terms of both technical implementation and developer workflow. In this talk, AWS Solutions Architect Pierre Steckmeyer is joined by Will McCutchen, Architect at BuzzFeed, to discuss Amazon ECS as a platform for building a robust infrastructure for microservices. We look at the key attributes of microservice architectures and how Amazon ECS supports these requirements in production, from configuration to sophisticated workload scheduling to networking capabilities to resource optimization. We also examine what it takes to build an end-to-end platform on top of the wider AWS ecosystem, and what it’s like to migrate a large engineering organization from a monolithic approach to microservices.
CON404 – Deep Dive into Container Scheduling with Amazon ECS As your application’s infrastructure grows and scales, well-managed container scheduling is critical to ensuring high availability and resource optimization. In this session, we deep dive into the challenges and opportunities around container scheduling, as well as the different tools available within Amazon ECS and AWS to carry out efficient container scheduling. We discuss patterns for container scheduling available with Amazon ECS, the Blox scheduling framework, and how you can customize and integrate third-party scheduler frameworks to manage container scheduling on Amazon ECS.
Chalk Talks
Level 300 (Advanced)
CON312 – Building a Selenium Fleet on the Cheap with Amazon ECS with Spot Fleet Roberto Rivera and Matthew Wedgwood, engineers at RetailMeNot, give a practical overview of setting up a fleet of Selenium nodes running on Amazon ECS with Spot Fleet. Discuss the challenges of running Selenium with high availability at minimum cost using Amazon ECS container introspection to connect the Selenium Hub with its nodes.
CON315 – Virtually There: Building a Render Farm with Amazon ECS Learn how 8i Corp scales its multi-tenanted, volumetric render farm up to thousands of instances using AWS, Docker, and an API-driven infrastructure. This render farm enables them to turn the video footage from an array of synchronized cameras into a photo-realistic hologram capable of playback on a range of devices, from mobile phones to high-end head mounted displays. Join Owen Evans, VP of Engineering for 8i, as they dive deep into how 8i’s rendering infrastructure is built and maintained by just a handful of people and powered by Amazon ECS.
CON325 – Developing Microservices – from Your Laptop to the Cloud Wesley Chow, Staff Engineer at Adroll, shows how his team extends Amazon ECS by enabling local development capabilities. Hologram, Adroll’s local development program, brings the capabilities of the Amazon EC2 instance metadata service to non-EC2 hosts, so that developers can run the same software on local machines with the same credentials source as in production.
CON327 – Patterns and Considerations for Service Discovery Roven Drabo, head of cloud operations at Kaplan Test Prep, illustrates Kaplan’s complete container automation solution using Amazon ECS along with how his team uses NGINX and HashiCorp Consul to provide an automated approach to service discovery and container provisioning.
CON328 – Building a Development Platform on Amazon ECS Quinton Anderson, Head of Engineering for Commonwealth Bank of Australia, walks through how they migrated their internal development and deployment platform from Mesos/Marathon to Amazon ECS. The platform uses a custom DSL to abstract a layered application architecture, in a way that makes it easy to plug or replace new implementations into each layer in the stack.
Workshops
Level 300 (Advanced)
CON318 – Interstella 8888: Monolith to Microservices with Amazon ECS Interstella 8888 is an intergalactic trading company that deals in rare resources, but their antiquated monolithic logistics systems are causing the business to lose money. Join this workshop to get hands-on experience deploying Docker containers as you break Interstella 8888’s aging monolithic application into containerized microservices. Using Amazon ECS and an Application Load Balancer, you create API-based microservices and deploy them leveraging integrations with other AWS services.
CON332 – Build a Java Spring Application on Amazon ECS This workshop teaches you how to lift and shift existing Spring and Spring Cloud applications onto the AWS platform. Learn how to build a Spring application container, understand bootstrap secrets, push container images to Amazon ECR, and deploy the application to Amazon ECS. Then, learn how to configure the deployment for production.
THURSDAY 11/30
Breakout Sessions
Level 200 (Introductory)
CON201 – Containers on AWS – State of the Union Just over four years after the first public release of Docker, and three years to the day after the launch of Amazon ECS, the use of containers has surged to run a significant percentage of production workloads at startups and enterprise organizations. Join Deepak Singh, General Manager of Amazon Container Services, as he covers the state of containerized application development and deployment trends, new container capabilities on AWS that are available now, options for running containerized applications on AWS, and how AWS customers successfully run container workloads in production.
Level 300 (Advanced)
CON304 – Batch Processing with Containers on AWS Batch processing is useful to analyze large amounts of data. But configuring and scaling a cluster of virtual machines to process complex batch jobs can be difficult. In this talk, we show how to use containers on AWS for batch processing jobs that can scale quickly and cost-effectively. We also discuss AWS Batch, our fully managed batch-processing service. You also hear from GoPro and Here about how they use AWS to run batch processing jobs at scale including best practices for ensuring efficient scheduling, fine-grained monitoring, compute resource automatic scaling, and security for your batch jobs.
Level 400 (Expert)
CON406 – Architecting Container Infrastructure for Security and Compliance While organizations gain agility and scalability when they migrate to containers and microservices, they also benefit from compliance and security, advantages that are often overlooked. In this session, Kelvin Zhu, lead software engineer at Okta, joins Mitch Beaumont, enterprise solutions architect at AWS, to discuss security best practices for containerized infrastructure. Learn how Okta built their development workflow with an emphasis on security through testing and automation. Dive deep into how containers enable automated security and compliance checks throughout the development lifecycle. Also understand best practices for implementing AWS security and secrets management services for any containerized service architecture.
Chalk Talks
Level 300 (Advanced)
CON329 – Full Software Lifecycle Management for Containers Running on Amazon ECS Learn how The Washington Post uses Amazon ECS to run Arc Publishing, a digital journalism platform that powers The Washington Post and a growing number of major media websites. Amazon ECS enabled The Washington Post to containerize their existing microservices architecture, avoiding a complete rewrite that would have delayed the platform’s launch by several years. In this session, Jason Bartz, Technical Architect at The Washington Post, discusses the platform’s architecture. He addresses the challenges of optimizing Arc Publishing’s workload, and managing the application lifecycle to support 2,000 containers running on more than 50 Amazon ECS clusters.
CON330 – Running Containerized HIPAA Workloads on AWS Nihar Pasala, Engineer at Aetion, discusses the Aetion Evidence Platform, a system for generating the real-world evidence used by healthcare decision makers to implement value-based care. This session discusses the architecture Aetion uses to run HIPAA workloads using containers on Amazon ECS, best practices, and learnings.
Level 400 (Expert)
CON408 – Building a Machine Learning Platform Using Containers on AWS DeepLearni.ng develops and implements machine learning models for complex enterprise applications. In this session, Thomas Rogers, Engineer for DeepLearni.ng discusses how they worked with Scotiabank to leverage Amazon ECS, Amazon ECR, Docker, GPU-accelerated Amazon EC2 instances, and TensorFlow to develop a retail risk model that helps manage payment collections for millions of Canadian credit card customers.
Workshops
Level 300 (Advanced)
CON319 – Interstella 8888: CICD for Containers on AWS Interstella 8888 is an intergalactic trading company that deals in rare resources, but their antiquated monolithic logistics systems are causing the business to lose money. Join this workshop to learn how to set up a CI/CD pipeline for containerized microservices. You get hands-on experience deploying Docker container images using Amazon ECS, AWS CloudFormation, AWS CodeBuild, and AWS CodePipeline, automating everything from code check-in to production.
FRIDAY 12/1
Breakout Sessions
Level 400 (Expert)
CON405 – Moving to Amazon ECS – the Not-So-Obvious Benefits If you ask 10 teams why they migrated to containers, you will likely get answers like ‘developer productivity’, ‘cost reduction’, and ‘faster scaling’. But teams often find there are several other ‘hidden’ benefits to using containers for their services. In this talk, Franziska Schmidt, Platform Engineer at Mapbox and Yaniv Donenfeld from AWS will discuss the obvious, and not so obvious benefits of moving to containerized architecture. These include using Docker and Amazon ECS to achieve shared libraries for dev teams, separating private infrastructure from shareable code, and making it easier for non-ops engineers to run services.
Chalk Talks
Level 300 (Advanced)
CON331 – Deploying a Regulated Payments Application on Amazon ECS Travelex discusses how they built an FCA-compliant international payments service using a microservices architecture on AWS. This chalk talk covers the challenges of designing and operating an Amazon ECS-based PaaS in a regulated environment using a DevOps model.
Workshops
Level 400 (Expert)
CON407 – Interstella 8888: Advanced Microservice Operations Interstella 8888 is an intergalactic trading company that deals in rare resources, but their antiquated monolithic logistics systems are causing the business to lose money. In this workshop, you help Interstella 8888 build a modern microservices-based logistics system to save the company from financial ruin. We give you the hands-on experience you need to run microservices in the real world. This includes implementing advanced container scheduling and scaling to deal with variable service requests, implementing a service mesh, issue tracing with AWS X-Ray, container and instance-level logging with Amazon CloudWatch, and load testing.
Know before you go
Want to brush up on your container knowledge before re:Invent? Here are some helpful resources to get started:
When you use Amazon Relational Database Service (Amazon RDS), depending on the logging levels on the RDS instances and the volume of transactions, you could generate a lot of log data. To ensure that everything is running smoothly, many customers search for log error patterns using different log aggregation and visualization systems, such as Amazon Elasticsearch Service, Splunk, or other tool of their choice. A module needs to periodically retrieve the RDS logs using the SDK, and then send them to Amazon S3. From there, you can stream them to your log aggregation tool.
One option is writing an AWS Lambda function to retrieve the log files. However, because of the time that this function needs to execute, depending on the volume of log files retrieved and transferred, it is possible that Lambda could time out on many instances. Another approach is launching an Amazon EC2 instance that runs this job periodically. However, this would require you to run an EC2 instance continuously, not an optimal use of time or money.
Using the new Amazon CloudWatch integration with Amazon EC2 Container Service, you can trigger this job to run in a container on an existing Amazon ECS cluster. Additionally, this would allow you to improve costs by running containers on a fleet of Spot Instances.
In this post, I will show you how to use the new scheduled tasks (cron) feature in Amazon ECS and launch tasks using CloudWatch events, while leveraging Spot Fleet to maximize availability and cost optimization for containerized workloads.
Architecture
The following diagram shows how the various components described schedule a task that retrieves log files from Amazon RDS database instances, and deposits the logs into an S3 bucket.
Amazon ECS cluster container instances are using Spot Fleet, which is a perfect match for the workload that needs to run when it can. This improves cluster costs.
The task definition defines which Docker image to retrieve from the Amazon EC2 Container Registry (Amazon ECR) repository and run on the Amazon ECS cluster.
The container image has Python code functions to make AWS API calls using boto3. It iterates over the RDS database instances, retrieves the logs, and deposits them in the S3 bucket. Many customers choose these logs to be delivered to their centralized log-store. CloudWatch Events defines the schedule for when the container task has to be launched.
Walkthrough
To provide the basic framework, we have built an AWS CloudFormation template that creates the following resources:
Amazon ECR repository for storing the Docker image to be used in the task definition
S3 bucket that holds the transferred logs
Task definition, with image name and S3 bucket as environment variables provided via input parameter
CloudWatch Events rule
Amazon ECS cluster
Amazon ECS container instances using Spot Fleet
IAM roles required for the container instance profiles
Before you begin
Ensure that Git, Docker, and the AWS CLI are installed on your computer.
Run the CloudFormation stack and get the names for the Amazon ECR repo and S3 bucket. In the stack, choose Outputs.
Open the ECS console and choose Repositories. The rdslogs repo has been created. Choose View Push Commands and follow the instructions to connect to the repository and push the image for the code that you built in Step 2. The screenshot shows the final result:
Associate the CloudWatch scheduled task with the created Amazon ECS Task Definition, using a new CloudWatch event rule that is scheduled to run at intervals. The following rule is scheduled to run every 15 minutes: aws --profile default --region us-west-2 events put-rule --name demo-ecs-task-rule --schedule-expression "rate(15 minutes)"
CloudWatch requires IAM permissions to place a task on the Amazon ECS cluster when the CloudWatch event rule is executed, in addition to an IAM role that can be assumed by CloudWatch Events. This is done in three steps:
Create the IAM role to be assumed by CloudWatch. aws --profile default --region us-west-2 iam create-role --role-name Test-Role --assume-role-policy-document file://event-role.json
Create the IAM policy defining the ECS cluster and task definition. You need to get these values from the CloudFormation outputs and resources. aws --profile default --region us-west-2 iam create-policy --policy-name test-policy --policy-document file://event-policy.json
Attach the IAM policy to the role. aws --profile default --region us-west-2 iam attach-role-policy --role-name Test-Role --policy-arn arn:aws:iam::1234567890:policy/test-policy
Associate the CloudWatch rule created earlier to place the task on the ECS cluster. The following command shows an example. Replace the AWS account ID and region with your settings. aws events put-targets --rule demo-ecs-task-rule --targets "Id"="1","Arn"="arn:aws:ecs:us-west-2:12345678901:cluster/test-cwe-blog-ecsCluster-15HJFWCH4SP67","EcsParameters"={"TaskDefinitionArn"="arn:aws:ecs:us-west-2:12345678901:task-definition/test-cwe-blog-taskdef:8"},"RoleArn"="arn:aws:iam::12345678901:role/Test-Role"
{
"FailedEntries": [],
"FailedEntryCount": 0
}
That’s it. The logs now run based on the defined schedule.
To test this, open the Amazon ECS console, select the Amazon ECS cluster that you created, and then choose Tasks, Run New Task. Select the task definition created by the CloudFormation template, and the cluster should be selected automatically. As this runs, the S3 bucket should be populated with the RDS logs for the instance.
Conclusion
In this post, you’ve seen that the choices for workloads that need to run at a scheduled time include Lambda with CloudWatch events or EC2 with cron. However, sometimes the job could run outside of Lambda execution time limits or be not cost-effective for an EC2 instance.
In such cases, you can schedule the tasks on an ECS cluster using CloudWatch rules. In addition, you can use a Spot Fleet cluster with Amazon ECS for cost-conscious workloads that do not have hard requirements on execution time or instance availability in the Spot Fleet. For more information, see Powering your Amazon ECS Cluster with Amazon EC2 Spot Instances and Scheduled Events.
If you have questions or suggestions, please comment below.
A group of researchers from Clemson University achieved a remarkable milestone while studying topic modeling, an important component of machine learning associated with natural language processing, breaking the record for creating the largest high-performance cluster by using more than 1,100,000 vCPUs on Amazon EC2 Spot Instances running in a single AWS region. The researchers conducted nearly half a million topic modeling experiments to study how human language is processed by computers. Topic modeling helps in discovering the underlying themes that are present across a collection of documents. Topic models are important because they are used to forecast business trends and help in making policy or funding decisions. These topic models can be run with many different parameters and the goal of the experiments is to explore how these parameters affect the model outputs.
The Experiment Professor Amy Apon, Co-Director of the Complex Systems, Analytics and Visualization Institute at Clemson University with Professor Alexander Herzog and graduate students Brandon Posey and Christopher Gropp in collaboration with members of the AWS team as well as AWS Partner Omnibond performed the experiments. They used software infrastructure based on CloudyCluster that provisions high performance computing clusters on dynamically allocated AWS resources using Amazon EC2 Spot Fleet. Spot Fleet is a collection of biddable spot instances in EC2 responsible for maintaining a target capacity specified during the request. The SLURM scheduler was used as an overlay virtual workload manager for the data analytics workflows. The team developed additional provisioning and workflow automation software as shown below for the design and orchestration of the experiments. This setup allowed them to evaluate various topic models on different data sets with massively parallel parameter sweeps on dynamically allocated AWS resources. This framework can easily be used beyond the current study for other scientific applications that use parallel computing.
Ramping to 1.1 Million vCPUs The figure below shows elastic, automatic expansion of resources as a function of time, in the US East (Northern Virginia) Region. At just after 21:40 (GMT-1) on Aug. 26, 2017, the number of vCPUs utilized was 1,119,196. Clemson researchers also took advantage of the new per-second billing for the EC2 instances that they launched. The vCPU count usage is comparable to the core count on the largest supercomputers in the world.
Here’s the breakdown of the EC2 instance types that they used:
Campus resources at Clemson funded by the National Science Foundation were used to determine an effective configuration for the AWS experiments as compared to campus resources, and the AWS cloud resources complement the campus resources for large-scale experiments.
Meet the Team Here’s the team that ran the experiment (Professor Alexander Herzog, graduate students Christopher Gropp and Brandon Posey, and Professor Amy Apon):
Professor Apon said about the experiment:
I am absolutely thrilled with the outcome of this experiment. The graduate students on the project are amazing. They used resources from AWS and Omnibond and developed a new software infrastructure to perform research at a scale and time-to-completion not possible with only campus resources. Per-second billing was a key enabler of these experiments.
Boyd Wilson (CEO, Omnibond, member of the AWS Partner Network) told me:
Participating in this project was exciting, seeing how the Clemson team developed a provisioning and workflow automation tool that tied into CloudyCluster to build a huge Spot Fleet supercomputer in a single region in AWS was outstanding.
About the Experiment The experiments test parameter combinations on a range of topics and other parameters used in the topic model. The topic model outputs are stored in Amazon S3 and are currently being analyzed. The models have been applied to 17 years of computer science journal abstracts (533,560 documents and 32,551,540 words) and full text papers from the NIPS (Neural Information Processing Systems) Conference (2,484 documents and 3,280,697 words). This study allows the research team to systematically measure and analyze the impact of parameters and model selection on model convergence, topic composition and quality.
Looking Forward This study constitutes an interaction between computer science, artificial intelligence, and high performance computing. Papers describing the full study are being submitted for peer-reviewed publication. I hope that you enjoyed this brief insight into the ways in which AWS is helping to break the boundaries in the frontiers of natural language processing!
— Sanjay Padhi, Ph.D, AWS Research and Technical Computing
EC2 Spot Instances give you access to spare EC2 compute capacity at up to 90% off of the On-Demand rates. Starting with the ability to request a specific number of instances of a particular size, we made Spot Instances even more useful and flexible with support for Spot Fleets and Auto Scaling Spot Fleets, allowing you to maintain any desired level of compute capacity.
EC2 users have long had the ability to stop running instances while leaving EBS volumes attached, opening the door to applications that automatically pick up where they left off when the instance starts running again.
Stop and Resume Spot Workloads Today we are blending these two important features, allowing you to set up Spot bids and Spot Fleets that respond by stopping (rather than terminating) instances when capacity is no longer available at or below your bid price. EBS volumes attached to stopped instances remain intact, as does the EBS-backed root volume. When capacity becomes available, the instances are started and can keep on going without having to spend time provisioning applications, setting up EBS volumes, downloading data, joining network domains, and so forth.
Many AWS customers have enhanced their applications to create and make use of checkpoints, adding some resilience and gaining the ability to take advantage of EC2’s start/stop feature in the process. These customers will now be able to run these applications on Spot Instances, with savings that average 70-90%.
While the instances are stopped, you can modify the EBS Optimization, User data, Ramdisk ID, and Delete on Termination attributes. Stopped Spot Instances do not incur any charges for compute time; space for attached EBS volumes is charged at the usual rates.
Here’s how you create a Spot bid or Spot Fleet and specify the use of stop/start:
Things to Know This feature is available now and you can start using it today in all AWS Regions where Spot Instances are available. It is designed to work well in conjunction with the new per-second billing for EC2 instances and EBS volumes, with the potential for another dimension of cost savings over and above that provided by Spot Instances.
EBS volumes always exist within a particular Availability Zone (AZ). As a result, Spot and Spot Fleet requests that specify a particular AZ will always restart in that AZ.
Take care when using this feature in conjunction with Spot Fleets that have the potential to span a wide variety of instance types. Because the composition of the fleet can change over time, you need to pay attention to your account’s limits for IP addresses and EBS volumes.
I’m looking forward to hearing about the new and creative uses that you’ll come up with for this feature. If you thought that your application was not a good fit for Spot Instances, or if the overhead needed to handle interruptions was too high, it is time to take another look!
Graphical rendering is a compute-intensive task that is, as they say, embarrassingly parallel. Looked at another way, this means that there’s a more or less linear relationship between the number of processors that are working on the problem and the overall wall-clock time that it takes to complete the task. In a creative endeavor such as movie-making, getting the results faster spurs creativity, improves the feedback loop, gives you time to make more iterations and trials, and leads to a better result. Even if you have a render farm in-house, you may still want to turn to the cloud in order to gain access to more compute power at peak times. Once you do this, the next challenge is to manage the combination of in-house resources, cloud resources, and the digital assets in a unified fashion.
Deadline 10 Earlier this week we launched Deadline 10, a powerful render management system. Building on technology that we brought on board with the acquisition of Thinkbox Software, Deadline 10 is designed to extend existing on-premises rendering into the AWS Cloud, giving you elasticity and flexibility while remaining simple and easy to use. You can set up and manage large-scale distributed jobs that span multiple AWS regions and benefit from elastic, usage-based AWS licensing for popular applications like Deadline for Autodesk 3ds Max, Maya, Arnold, and dozens more, all available from the Thinkbox Marketplace. You can purchase software licenses from the marketplace, use your existing licenses, or use them together.
Deadline 10 obtains cloud-based compute resources by managing bids for EC2 Spot Instances, providing you with access to enough low-cost compute capacity to let your imagination run wild! It uses your existing AWS account, tags EC2 instances for tracking, and synchronizes your local assets to the cloud before rendering begins.
A Quick Tour Let’s take a quick tour of Deadline 10 and see how it makes use of AWS. The AWS Portal is available from the View menu:
The first step is to log in to my AWS account:
Then I configure the connection server, license server, and the S3 bucket that will be used to store rendering assets:
Next, I set up my Spot fleet, establishing a maximum price per hour for each EC2 instance, setting target capacity, and choosing the desired rendering application:
I can also choose any desired combination of EC2 instance types:
When I am ready to render I click on Start Spot Fleet:
This will initiate the process of bidding for and managing Spot Instances. The running instances are visible from the Portal:
I can monitor the progress of my rendering pipeline:
I can stop my Spot fleet when I no longer need it:
Deadline 10 is now available for usage based license customers; a new license is needed for traditional floating license users. Pricing for yearly Deadline licenses has been reduced to $48 annually. If you are already using an earlier version of Deadline, feel free to contact us to learn more about licensing options.
Spot Instances allow you to bid on spare Amazon EC2 compute capacity. Spot Instances typically cost 50-90% less than On-Demand Instances. Powering your ECS cluster with Spot Instances lets you reduce the cost of running your existing containerized workloads, or increase your compute capacity by two to ten times while keeping the same budget. Or you could do a combination of both!
Using Spot Instances, you specify the price you are willing to pay per instance-hour. Your Spot Instance runs whenever your bid exceeds the current Spot price. If your instance is reclaimed due to an increase in the Spot price, you are not charged for the partial hour that your instance has run.
The ECS console uses Spot Fleet to deploy your Spot Instances. Spot Fleet attempts to deploy the target capacity you request (expressed in terms of instances or a vCPU count) for your containerized application by launching Spot Instances that result in the best prices for you. If your Spot Instances are reclaimed due to a change in Spot prices or available capacity, Spot Fleet also attempts to maintain its target capacity.
Containers are a natural fit for the diverse pool of resources that Spot Fleet thrives on. Spot Fleet enable you to provision capacity across multiple Spot Instance pools (combinations of instance types and Availability Zones), which helps improve your application’s availability and reduce operating costs of the fleet over time. Combining the extensible and flexible container placement system provided by ECS with Spot Fleet allows you to efficiently deploy containerized workloads and easily manage clusters at any scale for a fraction of the cost.
Previously, deploying your ECS cluster on Spot Instances was a manual process. In this post, we show you how to achieve high availability, scalability, and cost savings for your container workloads by using the new Spot Fleet integration in the ECS console. We also show you how to build your own ECS cluster on Spot Instances using AWS CloudFormation.
Creating an ECS cluster running on Spot Instances
You can create an ECS cluster using the AWS Management Console.
In Instance configuration, for Provisioning model, choose Spot.
Choosing an allocation strategy
The two available Spot Fleet allocation strategies are Diversified and Lowest price.
The allocation strategy you choose for your Spot Fleet determines how it fulfills your Spot Fleet request from the possible Spot Instance pools. When you use the diversified strategy, the Spot Instances are distributed across all pools. When you use the lowest price strategy, the Spot Instances come from the pool with the lowest price specified in your request.
Remember that each instance type (the instance size within each instance family, e.g., c4.4xlarge), in each Availability Zone, in every region, is a separate pool of capacity, and therefore a separate Spot market. By diversifying across as many different instance types and Availability Zones as possible, you can improve the availability of your fleet. You also make your fleet less sensitive to increases in the Spot price in any one pool over time.
You can select up to six EC2 instance types to use for your Spot Fleet. In this example, we’ve selected m3, m4, c3, c4, r3, and r4 instance types of size xlarge.
You need to enter a bid for your instances. Typically, bidding at or near the On-Demand Instance price is a good starting point. Your bid is the maximum price that you are willing to pay for instance types in that Spot pool. While the Spot price is at or below your bid, you pay the Spot price. Bidding lower ensures that you have lower costs, while bidding higher reduces the probability of interruption.
Configure the number of instances to have in your cluster. Spot Fleet attempts to launch the number of Spot Instances that are required to meet the target capacity specified in your request. The Spot Fleet also attempts to maintain its target capacity if your Spot Instances are reclaimed due to a change in Spot prices or available capacity.
The latest ECS–optimized AMI is used for the instances when they are launched.
Configure your storage and network settings. To enable diversification and high availability, be sure to select subnets in multiple Availability Zones. You can’t select multiple subnets in the same Availability Zone in a single Spot Fleet.
The ECS container agent makes calls to the ECS API actions on your behalf. Container instances that run the agent require the ecsInstanceRole IAM policy and role for the service to know that the agent belongs to you. If you don’t have the ecsInstanceRole already, you can create one using the ECS console.
If you create a managed compute environment that uses Spot Fleet, you must create a role that grants the Spot Fleet permission to bid on, launch, and terminate instances on your behalf. You can also create this role using the ECS console.
That’s it! In the ECS console, choose Create to spin up your new ECS cluster running on Spot Instances.
Using AWS CloudFormation to deploy your ECS cluster on Spot Instances
We have also published a reference architecture AWS CloudFormation template that demonstrates how to easily launch a CloudFormation stack and deploy your ECS cluster on Spot Instances.
The CloudFormation template includes the Spot Instance termination notice script mentioned earlier, as well as some additional logging and other example features to get you started quickly. You can find the CloudFormation template in the Amazon EC2 Spot Instances GitHub repo.
Give it a try and customize it as needed for your environment!
Handling termination
With Spot Instances, you never pay more than the price you specified. If the Spot price exceeds your bid price for a given instance, it is terminated automatically for you.
The best way to protect against Spot Instance interruption is to architect your containerized application to be fault-tolerant. In addition, you can take advantage of a feature called Spot Instance termination notices, which provides a two-minute warning before EC2 must terminate your Spot Instance.
This warning is made available to the applications on your Spot Instance using an item in the instance metadata. When you deploy your ECS cluster on Spot Instances using the console, AWS installs a script that checks every 5 seconds for the Spot Instance termination notice. If the notice is detected, the script immediately updates the container instance state to DRAINING.
A simplified version of the Spot Instance termination notice script is as follows:
#!/bin/bash
while sleep 5; do
if [ -z $(curl -Isf http://169.254.169.254/latest/meta-data/spot/termination-time) ]; then
/bin/false
else
ECS_CLUSTER=$(curl -s http://localhost:51678/v1/metadata | jq .Cluster | tr -d \")
CONTAINER_INSTANCE=$(curl -s http://localhost:51678/v1/metadata \
| jq .ContainerInstanceArn | tr -d \")
aws ecs update-container-instances-state --cluster $ECS_CLUSTER \
--container-instances $CONTAINER_INSTANCE --status DRAINING
fi
done
When you set a container instance to DRAINING, ECS prevents new tasks from being scheduled for placement on the container instance. If the resources are available, replacement service tasks are started on other container instances in the cluster. Container instance draining enables you to remove a container instance from a cluster without impacting tasks in your cluster. Service tasks on the container instance that are in the PENDING state are stopped immediately.
Service tasks on the container instance that are in the RUNNING state are stopped and replaced according to the service’s deployment configuration parameters, minimumHealthyPercent and maximumPercent.
ECS on Spot Instances in action
Want to see how customers are already powering their ECS clusters on Spot Instances? Our friends at Mapbox are doing just that.
Mapbox is a platform for designing and publishing custom maps. The company uses ECS to power their entire batch processing architecture to collect and process over 100 million miles of sensor data per day that they use for powering their maps. They also optimize their batch processing architecture on ECS using Spot Instances.
The Mapbox platform powers over 5,000 apps and reaches more than 200 million users each month. Its backend runs on ECS, allowing it to serve more than 1.3 billion requests per day. To learn more about their recent migration to ECS, read their recent blog post, We Switched to Amazon ECS, and You Won’t Believe What Happened Next. Then, in their follow-up blog post, Caches to Cash, learn how they are running their entire platform on Spot Instances, allowing them to save upwards of 50–90% on their EC2 costs.
Conclusion
We hope that you are as excited as we are about running your containerized applications at scale and cost effectively using Spot Instances. For more information, see the following pages:
Aaron Friedman is a Healthcare and Life Sciences Partner Solutions Architect at AWS
Angel Pizarro is a Scientific Computing Technical Business Development Manager at AWS
Deriving insights from data is foundational to nearly every organization, and many customers process high volumes of data every day. One common requirement of customers in life sciences is the need to analyze these data in a high-throughput fashion without sacrificing time-to-insight.
One such use case is genomic sequencing. Modern DNA sequencers, such as Illumina’s NovaSeq, can produce multiple terabytes of raw data each day. The data must then be processed into meaningful information that clinicians and research can act on in a timely fashion. This processing of genomic data is commonly referred to as "secondary analysis".
Most common secondary analysis workflows take raw reads generated from sequencers and then process them in a multi-step workflow to identify the variation in a biological sample compared to a standard genome reference. The individual steps are normally similar to the following:
DNA sequences are mapped to a standard genome reference by use of an alignment algorithm, such as Smith-Waterman or Burrows-Wheeler.
After the sequences are mapped to the reference, the differences are identified as single nucleotide variations, insertions, deletions, or complex structural variation in a process known as variant calling.
The resulting variants are often combined with other information to identify genomic variants highly correlated with disease or drug response. They might also be analyzed in the context of clinical data such as to identify disease susceptibility or state for a patient.
Along the way, quality metrics are collected or computed to ensure that the generated data is of the appropriate quality for use in requisite research or clinical settings.
In this post series, you can build a secondary analysis workflow similar to the one just described. Here is a diagram of the workflow:
Secondary analysis is a batch workflow
At its core, a genomics pipeline is similar to a series of Extract Transform and Load (ETL) steps that convert raw files from a DNA sequencer to a list of variants for one or more individuals. Each step extracts a set of input files from a data source, processes them as a compute-intensive workload (transform), and then loads the output into another location for subsequent storage or analysis.
These steps are often chained together to build a flexible genomics processing workflow. The files can then be used for downstream analysis, such as population scale analytics with Amazon Athena or Spark on Amazon EMR. These ETL processes can be represented as individual batch steps in an overall workflow.
When we discuss batch processing with customers, we often focus on the following three layers:
Jobs (analytical modules): These jobs are individual units of work that take a set of inputs, run a single process, and produce a set of outputs. In this series, you use Docker containers to define these analytical modules. For genomics, these commonly include alignment, variant calling, quality control, or another module in your workflow. Amazon ECS is an AWS service that orchestrates and runs these Docker containerized modules on top of Amazon EC2 instances.
Batch engine: This is a framework for submitting individual analytical modules with the appropriate requirements for computational resources, such as memory and number of CPUs. Each step of the analysis pipeline requires a definition of how to run a job:
Computational resources (disk, memory, CPU)
The compute environment to run it in (Docker container, runtime parameters)
Information about the priority of different jobs
Any dependencies between jobs
You can leverage concepts such as container placement and bin packing to maximize performance of your genomic pipeline while concurrently optimizing for cost. We will use AWS Batch for this layer. AWS Batch dynamically provisions the optimal quantity and type of compute resources (for example, CPU or memory optimized instances) based on the volume and specific resource requirements of the submitted batch jobs.
Workflow orchestration: This layer sits on top of the batch engine and allows you to decouple your workflow definition from the execution of the individual jobs. You can envision this layer as a state machine where you define a series of steps and pass appropriate metadata between states. We will use AWS Lambda to submit the jobs to AWS Batch and use AWS Step Functions to define and orchestrate the workflow.
Architecture
In the next three posts, you build a genome analysis pipeline using the following architecture. You don’t explicitly build the grayed out section, but we wanted to include them in the diagram as they are natural extensions to the core architecture. We discuss these and other extensions, briefly, in the concluding post. All code related to this blog series can be found in the associated GitHub repository here.
Part 2 covers the job layer. We demonstrate how you can package bioinformatics applications in Docker containers, and discuss best practices when developing these containers for use in a multitenant batch environment.
Part 3 dives deep into the batch, or data processing layer. We discuss common considerations for deploying Docker containers to be used in batch analysis as well as demonstrate how you can use AWS Batch for a scalable and elastic batch engine.
Part 4 dives into workflow layer orchestration. We show how you might architect that layer with AWS services. You take the components built in parts 2 and 3 and combine them into an entire secondary analysis workflow. This workflow manages dependencies as well as continually checking the progress of existing jobs. We conclude by running a secondary analysis end-to-end for under $1 and discuss some extensions you can build on top of this core workflow.
What you can expect to learn
At the end of this series, you will have built a scalable and elastic solution to process genomes on AWS, as well as gained a general understanding of architectural choices available to you. The solution is generally applicable to other workloads, such as image processing. Even non-life science workloads such as trade analytics in financial services can benefit.
In your solution, you use Amazon EC2 Spot Instances to optimize for cost. Spot Instances allow you to bid on spare EC2 compute capacity, which can save up to 90% off of traditional On-Demand prices. In many cases, this translates into the ability to analyze genomes at scale for as low as $1 per analysis.
Spring has officially sprung. As you enjoy the blossoming of May flowers, it may be worthy to also note some of the great tech talks blossoming online during the month of May. This month’s AWS Online Tech Talks features sessions on topics like AI, DevOps, Data, and Serverless just to name a few.
May 2017 – Schedule
Below is the upcoming schedule for the live, online technical sessions scheduled for the month of May. Make sure to register ahead of time so you won’t miss out on these free talks conducted by AWS subject matter experts. All schedule times for the online tech talks are shown in the Pacific Time (PDT) time zone.
The AWS Online Tech Talks series covers a broad range of topics at varying technical levels. These sessions feature live demonstrations & customer examples led by AWS engineers and Solution Architects. Check out the AWS YouTube channel for more on-demand webinars on AWS technologies.
Dougal Ballantyne, Principal Product Manager – AWS Batch
Docker enables you to create highly customized images that are used to execute your jobs. These images allow you to easily share complex applications between teams and even organizations. However, sometimes you might just need to run a script!
This post details the steps to create and run a simple “fetch & run” job in AWS Batch. AWS Batch executes jobs as Docker containers using Amazon ECS. You build a simple Docker image containing a helper application that can download your script or even a zip file from Amazon S3. AWS Batch then launches an instance of your container image to retrieve your script and run your job.
AWS Batch overview
AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted.
With AWS Batch, there is no need to install and manage batch computing software or server clusters that you use to run your jobs, allowing you to focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as Amazon EC2 Spot Instances.
Create an IAM role to be used by jobs to access S3
Create a job definition that uses the built image
Submit and run a job that execute the job script from S3
Prerequisites
Before you get started, there a few things to prepare. If this is the first time you have used AWS Batch, you should follow the Getting Started Guide and ensure you have a valid job queue and compute environment.
After you are up and running with AWS Batch, the next thing is to have an environment to build and register the Docker image to be used. For this post, register this image in an ECR repository. This is a private repository by default and can easily be used by AWS Batch jobs
The fetch & run Docker image is based on Amazon Linux. It includes a simple script that reads some environment variables and then uses the AWS CLI to download the job script (or zip file) to be executed.
To get started, download the source code from the aws-batch-helpers GitHub repository. The following link pulls the latest version: https://github.com/awslabs/aws-batch-helpers/archive/master.zip. Unzip the downloaded file and navigate to the “fetch-and-run” folder. Inside this folder are two files:
Dockerfile
fetchandrun.sh
Dockerfile is used by Docker to build an image. Look at the contents; you should see something like the following:
FROM amazonlinux:latest
RUN yum -y install unzip aws-cli
ADD fetch_and_run.sh /usr/local/bin/fetch_and_run.sh
WORKDIR /tmp
USER nobody
ENTRYPOINT ["/usr/local/bin/fetch_and_run.sh"]
The FROM line instructs Docker to pull the base image from the amazonlinux repository, using the latest tag.
The RUN line executes a shell command as part of the image build process.
The ADD line, copies the fetchandrun.sh script into the /usr/local/bin directory inside the image.
The WORKDIR line, sets the default directory to /tmp when the image is used to start a container.
The USER line sets the default user that the container executes as.
Finally, the ENTRYPOINT line instructs Docker to call the /usr/local/bin/fetchandrun.sh script when it starts the container. When running as an AWS Batch job, it is passed the contents of the command parameter.
Now, build the Docker image! Assuming that the docker command is in your PATH and you don’t need sudo to access it, you can build the image with the following command (note the dot at the end of the command):
docker build -t awsbatch/fetch_and_run .
This command should produce an output similar to the following:
Sending build context to Docker daemon 373.8 kB
Step 1/6 : FROM amazonlinux:latest
latest: Pulling from library/amazonlinux
c9141092a50d: Pull complete
Digest: sha256:2010c88ac1e7c118d61793eec71dcfe0e276d72b38dd86bd3e49da1f8c48bf54
Status: Downloaded newer image for amazonlinux:latest
---> 8ae6f52035b5
Step 2/6 : RUN yum -y install unzip aws-cli
---> Running in e49cba995ea6
Loaded plugins: ovl, priorities
Resolving Dependencies
--> Running transaction check
---> Package aws-cli.noarch 0:1.11.29-1.45.amzn1 will be installed
<< removed for brevity >>
Complete!
---> b30dfc9b1b0e
Removing intermediate container e49cba995ea6
Step 3/6 : ADD fetch_and_run.sh /usr/local/bin/fetch_and_run.sh
---> 256343139922
Removing intermediate container 326092094ede
Step 4/6 : WORKDIR /tmp
---> 5a8660e40d85
Removing intermediate container b48a7b9c7b74
Step 5/6 : USER nobody
---> Running in 72c2be3af547
---> fb17633a64fe
Removing intermediate container 72c2be3af547
Step 6/6 : ENTRYPOINT /usr/local/bin/fetch_and_run.sh
---> Running in aa454b301d37
---> fe753d94c372
Removing intermediate container aa454b301d37
Successfully built 9aa226c28efc
In addition, you should see a new local repository called fetchandrun, when you run the following command:
docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
awsbatch/fetch_and_run latest 9aa226c28efc 19 seconds ago 374 MB
amazonlinux latest 8ae6f52035b5 5 weeks ago 292 MB
To add more packages to the image, you could update the RUN line or add a second one, right after it.
Creating an ECR repository
The next step is to create an ECR repository to store the Docker image, so that it can be retrieved by AWS Batch when running jobs.
In the ECR console, choose Get Started or Create repository.
Enter a name for the repository, for example: awsbatch/fetchandrun.
Choose Next step and follow the instructions.
You can keep the console open, as the tips can be helpful.
Push the built image to ECR
Now that you have a Docker image and an ECR repository, it is time to push the image to the repository. Use the following AWS CLI commands, if you have used the previous example names. Replace the AWS account number in red with your own account.
Next, create and upload a simple job script that is executed using the fetchandrun image that you just built and registered in ECR. Start by creating a file called myjob.sh with the example content below:
!/bin/bash
date
echo "Args: [email protected]"
env
echo "This is my simple test job!."
echo "jobId: $AWS_BATCH_JOB_ID"
echo "jobQueue: $AWS_BATCH_JQ_NAME"
echo "computeEnvironment: $AWS_BATCH_CE_NAME"
sleep $1
date
echo "bye bye!!"
Upload the script to an S3 bucket.
aws s3 cp myjob.sh s3://<bucket>/myjob.sh
Create an IAM role
When the fetchandrun image runs as an AWS Batch job, it fetches the job script from Amazon S3. You need an IAM role that the AWS Batch job can use to access S3.
In the IAM console, choose Roles, Create New Role.
Enter a name for your new role, for example: batchJobRole, and choose Next Step.
For Role Type, under AWS Service Roles, choose Select next to “Amazon EC2 Container Service Task Role” and then choose Next Step.
On the Attach Policy page, type “AmazonS3ReadOnlyAccess” into the Filter field and then select the check box for that policy.
Choose Next Step, Create Role. You see the details of the new role.
Create a job definition
Now that you’ve have created all the resources needed, pull everything together and build a job definition that you can use to run one or many AWS Batch jobs.
For the Job Definition, enter a name, for example, fetchandrun.
For IAM Role, choose the role that you created earlier, batchJobRole.
For ECR Repository URI, enter the URI where the fetchandrun image was pushed, for example: 012345678901.dkr.ecr.us-east-1.amazonaws.com/awsbatch/fetchandrun.
Leave the Command field blank.
For vCPUs, enter 1. For Memory, enter 500.
For User, enter “nobody”.
Choose Create job definition.
Submit and run a job
Now, submit and run a job that uses the fetchandrun image to download the job script and execute it.
Enter a name for the job, for example: script_test.
Choose the latest fetchandrun job definition.
For Job Queue, choose a queue, for example: first-run-job-queue.
For Command, enter myjob.sh,60.
Choose Validate Command.
Enter the following environment variables and then choose Submit job.
Key=BATCHFILETYPE, Value=script
Key=BATCHFILES3_URL, Value=s3:///myjob.sh. Don’t forget to use the correct URL for your file.
After the job is completed, check the final status in the console.
In the job details page, you can also choose View logs for this job in CloudWatch console to see your job log.
How the fetch and run image works
The fetchandrun image works as a combination of the Docker ENTRYPOINT and COMMAND feature, and a shell script that reads environment variables set as part of the AWS Batch job. When building the Docker image, it starts with a base image from Amazon Linux and installs a few packages from the yum repository. This becomes the execution environment for the job.
If the script you planned to run needed more packages, you would add them using the RUN parameter in the Dockerfile. You could even change it to a different base image such as Ubuntu, by updating the FROM parameter.
Next, the fetchandrun.sh script is added to the image and set as the container ENTRYPOINT. The script simply reads some environment variables and then downloads and runs the script/zip file from S3. It is looking for the following environment variables BATCHFILETYPE and BATCHFILES3URL. If you run fetchand_run.sh, with no environment variables, you get the following usage message:
BATCHFILETYPE not set, unable to determine type (zip/script) of
This shows that it supports two values for BATCHFILETYPE, either “script” or “zip”. When you set “script”, it causes fetchandrun.sh to download a single file and then execute it, in addition to passing in any further arguments to the script. If you set it to “zip”, this causes fetchandrun.sh to download a zip file, then unpack it and execute the script name passed and any further arguments. You can use the “zip” option to pass more complex jobs with all the applications dependencies in one file.
Finally, the ENTRYPOINT parameter tells Docker to execute the /usr/local/bin/fetchandrun.sh script when creating a container. In addition, it passes the contents of the COMMAND parameter as arguments to the script. This is what enables you to pass the script and arguments to be executed by the fetchandrun image with the Command field in the SubmitJob API action call.
Summary
In this post, I detailed the steps to create and run a simple “fetch & run” job in AWS Batch. You can now easily use the same job definition to run as many jobs as you need by uploading a job script to Amazon S3 and calling SubmitJob with the appropriate environment variables.
If you have questions or suggestions, please comment below.
AWS OpsWorks Stacks is a configuration management service that helps you configure and operate applications of all shapes and sizes using Chef. You can define the application’s architecture and the specification of each component, including package installation, software configuration, and more.
Amazon EC2 Spot instances allow you to bid on spare Amazon EC2 computing capacity. Because Spot instances are often available at a discount compared to On-Demand instances, you can significantly reduce the cost of running your applications, grow your applications’ compute capacity and throughput for the same budget, and enable new types of cloud computing applications.
You can use Spot instances with AWS OpsWorks Stacks in the following ways:
As a part of an Auto Scaling group, as described in this blog post. You can follow the steps in the blog post and choose to use Spot instance option, in the launch configuration described in step 5.
To provision a Spot instance in the EC2 console and have it automatically register with an OpsWorks stack as described here.
The example used in this post will require you to create the following resources:
IAM instance profile: an IAM profile that grants your instances permission to register themselves with OpsWorks.
Lambda function: a function that deregisters your instances from an OpsWorks stack.
Spot instance: the Spot instance that will run your application.
CloudWatch Event role: an event that will trigger the Lambda function whenever your Spot instance is terminated.
Step 1: Create an IAM instance profile
When a Spot instance starts, it must be able to make an API call to register itself with an OpsWorks stack. By assigning an instance with an IAM instance profile, the instance will be able to make calls to OpsWorks.
Open the IAM console at https://console.aws.amazon.com/iam/, choose Roles, and then choose Create New Role. Type a name for the role, and then choose Next Step. Choose the Amazon EC2 role, and then select the check box next to the AWSOpsWorksInstanceRegistration policy. Finally, select Next Step, and then choose Create Role. As its name suggests, the AWSOpsWorksInstanceRegistration policy will allow the instance to make API calls only to register an instance. Copy the following policy to the new role you’ve just created.
This Lambda function is responsible for deregistering an instance from your OpsWorks stack. It will be invoked whenever the Spot instance is terminated.
Open the AWS Lambda console at https://us-west-2.console.aws.amazon.com/lambda/home, and choose the option to create a Lambda function. If you are prompted to choose a blueprint, choose Skip. Type a name for the Lambda function, and from the Runtime drop-down list, select Python 2.7.
Next, paste the following code into the Lambda Function Code text box:
import boto3
def lambda_handler(event, context):
ec2_instance_id = event['detail']['instance-id']
ec2 = boto3.client('ec2')
for tag in ec2.describe_instances(InstanceIds=[ec2_instance_id])['Reservations'][0]['Instances'][0]['Tags']:
if (tag['Key'] == 'opsworks_stack_id'):
opsworks_stack_id = tag['Value']
opsworks = boto3.client('opsworks', 'us-east-1')
for instance in opsworks.describe_instances(StackId=opsworks_stack_id)['Instances']:
if ('Ec2InstanceId' in instance):
if (instance['Ec2InstanceId'] == ec2_instance_id):
print("Deregistering OpsWorks instance " + instance['InstanceId'])
opsworks.deregister_instance(InstanceId=instance['InstanceId'])
return ec2_instance_id
The result should look like this:
Step 3: Create a CloudWatch event
Whenever the Spot instance is terminated, we need to trigger the Lambda function from step 2 to deregister the instance from its associated stack.
Open the AWS CloudWatch console at https://console.aws.amazon.com/cloudwatch/home, choose Events, and then choose the Create rule button. Choose Amazon EC2 from the event selector. Select Specific state(s), and choose Terminated. Select the Lambda function you created earlier as the target. Finally, choose the Configure details button.
Step 4: Create a Spot instance
Open the EC2 console at https://console.aws.amazon.com/ec2sp/v1/spot/home, and choose the Request Spot Instances button. Use the latest release of Amazon Linux. On the details page, under IAM instance profile, choose the instance profile you created in step 1. Paste the following script into the User data field:
This script will automatically register your Spot instance on boot with a corresponding OpsWorks stack and layer. Be sure to fill in the following fields:
STACK_ID=YOUR_STACK_ID
LAYER_ID=YOUR_LAYER_ID
It will take a few minutes for the instance to be provisioned and come online. You’ll see fulfilled displayed in the Status column and active displayed in the State column. After the instance and request are both in an active state, the instance should be fully booted and registered with your OpsWorks stack/layer.
You can also view the instance and its online state in the OpsWorks console under Spot Instance.
You can manually terminate a Spot instance from the OpsWorks service console. Simply choose the stop button and the Spot instance will be terminated and removed from your stack. Unlike an On-Demand Instance in OpsWorks, when a Spot instance is stopped, it cannot be restarted.
In case your Spot instance is terminated through other means (for example, in the EC2 console), a CloudWatch event will trigger the Lambda function, which will automatically deregister the instance from your OpsWorks stack.
Conclusion
You can now use OpsWorks Stacks to define your application’s architecture and software configuration while leveraging the attractive pricing of Spot instances.
Jonathan Fritz is a Senior Product Manager for Amazon EMR
Customers running Apache Spark, Presto, and the Apache Hadoop ecosystem take advantage of Amazon EMR’s elasticity to save costs by terminating clusters after workflows are complete and resizing clusters with low-cost Amazon EC2 Spot Instances. For instance, customers can create clusters for daily ETL or machine learning jobs and shut them down when they complete, or scale out a Presto cluster serving BI analysts during business hours for ad hoc, low-latency SQL on Amazon S3.
With new support for Auto Scaling in Amazon EMR releases 4.x and 5.x, customers can now add (scale out) and remove (scale in) nodes on a cluster more easily. Scaling actions are triggered automatically by Amazon CloudWatch metrics provided by EMR at 5 minute intervals, including several YARN metrics related to memory utilization, applications pending, and HDFS utilization.
In EMR release 5.1.0, we introduced two new metrics, YARNMemoryAvailablePercentage and ContainerPendingRatio, which serve as useful cluster utilization metrics for scalable, YARN-based frameworks like Apache Spark, Apache Tez, and Apache Hadoop MapReduce. Additionally, customers can use custom CloudWatch metrics in their Auto Scaling policies.
The following is an example Auto Scaling policy on an instance group that scales 1 instance at a time to a maximum of 40 instances or a minimum of 10 instances. The instance group scales out when the memory available in YARN is less than 15%, and scales in when this metric is greater than 75%: Also, the instance group scales out when the ratio of pending YARN containers over allocated YARN containers is 0.75.
Additionally, customers can now configure the scale-down behavior when nodes are terminated from their cluster on EMR 5.1.0. By default, EMR now terminates nodes only during a scale-in event at the instance hour boundary, regardless of when the request was submitted. Because EC2 charges per full hour regardless of when the instance is terminated, this behavior enables applications running on your cluster to use instances in a dynamically scaling environment more cost effectively.
Conversely, customers can set the previous default for EMR releases earlier than 5.1.0, which blacklists and drains tasks from nodes before terminating them, regardless of proximity to the instance hour boundary. With either behavior, EMR removes the least active nodes first and blocks termination if it could lead to HDFS corruption.
You can create or modify Auto Scaling polices using the EMR console, AWS CLI, or the AWS SDKs with the EMR API. To enable Auto Scaling, EMR also requires an additional IAM role to grant permission for Auto Scaling to add and terminate capacity. For more information, see Auto Scaling with Amazon EMR. If you have any questions or would like to share an interesting use case about encryption on EMR, please leave a comment below.
This is a guest post from Linda Hedges, Principal SA, High Performance Computing.
—–
One question we often hear is: “How well will my application scale on AWS?” For HPC workloads that cross multiple nodes, the cluster network is at the heart of scalability concerns. AWS uses advanced Ethernet networking technology which, like all things AWS, is designed for scale, security, high availability, and low cost. This network is exceptional and continues to benefit from Amazon’s rapid pace of development. For real world applications, all but the most demanding customers find that their applications run very well on AWS! Many have speculated that highly-coupled workloads require a name-brand network fabric to achieve good performance. For most applications, this is simply not the case. As with all clusters, the devil is in the details and some applications benefit from cluster tuning. This blog discusses the scalability of a representative, real-world application and provides a few performance tips for achieving excellent application performance using STARCCM+ as an example. For more HPC specific information, please see our website.
TLG Aerospace, a Seattle based aerospace engineering services company, runs most of their STARCCM+ Computational Fluid Dynamics (CFD) cases on AWS. A detailed case study describing TLG Aerospace’s experience and the results they achieved can be found here. This blog uses one of their CFD cases as an example to understand AWS scalability. By leveraging Amazon EC2 Spot instances which allow customers to purchase unused capacity at significantly reduced rates, TLG Aerospace consistently achieves an 80% cost savings compared to their previous cloud and on-premise HPC cluster options. TLG Aerospace experiences solid value, terrific scale-up, and effectively limitless case throughput – all with no queue wait!
HPC applications such as Computational Fluid Dynamics (CFD) depend heavily on the application’s ability to efficiently scale compute tasks in parallel across multiple compute resources. Parallel performance is often evaluated by determining an application’s scale-up. Scale-up is a function of the number of processors used and is defined as the time it takes to complete a run on one processor, divided by the time it takes to complete the same run on the number of processors used for the parallel run.
As an example, consider an application with a time to completion, or turn-around time of 32 hours when run on one processor. If the same application runs in one hour when run on 32 processors, then the scale-up is 32 hours of time on one processor / 1 hour time on 32 processors, or equal to 32, for 32 processes. Scaling is considered to be excellent when the scale-up is close to or equal to the number of processors on which the application is run.
If the same application took eight hours to complete on 32 processors, it would have a scale-up of only four: 32 (time on one processor) / 8 (time to complete on 32 processors). A scale-up of four on 32 processors is considered to be poor.
In addition to characterizing the scale-up of an application, scalability can be further characterized as “strong” or “weak” scaling. Note that the term “weak”, as used here, does not mean inadequate or bad but is a technical term facilitating the description of the type of scaling which is sought. Strong scaling offers a traditional view of application scaling, where a problem size is fixed and spread over an increasing number of processors. As more processors are added to the calculation, good strong scaling means that the time to complete the calculation decreases proportionally with increasing processor count.
In comparison, weak scaling does not fix the problem size used in the evaluation, but purposely increases the problem size as the number of processors also increases. The ratio of the problem size to the number of processors on which the case is run is held constant. For a CFD calculation, problem size most often refers to the size of the grid for a similar configuration.
An application demonstrates good weak scaling when the time to complete the calculation remains constant as the ratio of compute effort to the number of processors is held constant. Weak scaling offers insight into how an application behaves with varying case size.
Figure 1: Strong Scaling Demonstrated for a 16M Cell STARCCM+ CFD Calculation
Scale-up as a function of increasing processor count is shown in Figure 1 for the STARCCM+ case data provided by TLG Aerospace. This is a demonstration of “strong” scalability. The blue line shows what ideal or perfect scalability looks like. The purple triangles show the actual scale-up for the case as a function of increasing processor count. Excellent scaling is seen to well over 400 processors for this modest-sized 16M cell case as evidenced by the closeness of these two curves. This example was run on Amazon EC2 c3.8xlarge instances, each an Intel E5-2680, providing either 16 cores or 32 hyper-threaded processors.
AWS customers can choose to run their applications on either threads or cores. AWS provides access to the underlying hardware of our servers. For an application like STARCCM+, excellent linear scaling can be seen when using either threads or cores though testing of a specific case and application is always recommended. For this example, threads were chosen as the processing basis. Running on threads offered a few percent performance improvement when compared to running the same case on cores. Note that the number of available cores is equal to half of the number of available threads.
The scalability of real-world problems is directly related to the ratio of the compute-effort per-core to the time required to exchange data across the network. The grid size of a CFD case provides a strong indication of how much computational effort is required for a solution. Thus, larger cases will scale to even greater processor counts than for the modest size case discussed here.
Figure 2: Scale-up and Efficiency as a Function of Cells per Processor
STARCCM+ has been shown to demonstrate exceptional “weak” scaling on AWS. That’s not shown here though weak scaling is reflected in Figure 2 by plotting the cells per processor on the horizontal axis. The purple line in Figure 2 shows scale-up as a function of grid cells per processor. The vertical axis for scale-up is on the left-hand side of the graph as indicated by the purple arrow. The green line in Figure 2 shows efficiency as a function of grid cells per processor. The vertical axis for efficiency is shown on the right hand side of the graph and is indicated with a green arrow. Efficiency is defined as the scale-up divided by the number of processors used in the calculation.
Weak scaling is evidenced by considering the number of grid cells per processor as a measure of compute effort. Holding the grid cells per processor constant while increasing total case size demonstrates weak scaling. Weak scaling is not shown here because only one CFD case is used. Fewer grid cells per processor means reduced computational effort per processor. Maintaining efficiency while reducing cells per processor demonstrates the excellent strong scalability of STARCCM+ on AWS.
Efficiency remains at about 100% between approximately 250,000 grid cells per thread (or processor) and 100,000 grid cells per thread. Efficiency starts to fall off at about 100,000 grid cells per thread. An efficiency of at least 80% is maintained until 25,000 grid cells per thread. Decreasing grid cells per processor leads to decreased efficiency because the total computational effort per processor is reduced. Note that the perceived ability to achieve more than 100% efficiency (here, at about 150,000 cells per thread) is common in scaling studies, is case specific, and often related to smaller effects such as timing variation and memory caching.
Figure 3: Cost for per Run Based on Spot Pricing ($0.35 per hour for c3.8xlarge) as a function of Turn-around Time
Plots of scale-up and efficiency offer understanding of how a case or application scales. The bottom line though is that what really matters to most HPC users is case turn-around time and cost. A plot of turn-around time versus CPU cost for this case is shown in Figure 3. As the number of threads are increased, the total turn-around time decreases. But also, as the number of threads increase, the inefficiency increases. Increasing inefficiency leads to increased cost. The cost shown is based on typical Amazon Spot price for the c3.8xlarge and only includes the computational costs. Small costs will also be incurred for data storage.
Minimum cost and turn-around time were achieved with approximately 100,000 cells per thread. Many users will choose a cell count per thread to achieve the lowest possible cost. Others, may choose a cell count per thread to achieve the fastest turn-around time. If a run is desired in 1/3rd the time of the lowest price point, it can be achieved with approximately 25,000 cells per thread. (Note that many users run STARCCM+ with significantly fewer cells per thread than this.) While this increases the compute cost, other concerns, such as license costs or schedules can be over-riding factors. For this 16M cell case, the added inefficiency results in an increase in run price from $3 to $4 for computing. Many find the reduced turn-around time well worth the price of the additional instances.
As with any cluster, good performance requires attention to the details of the cluster set up. While AWS allows for the quick set up and take down of clusters, performance is affected by many of the specifics in that set up. This blog provides some examples.
On AWS, a placement group is a grouping of instances within a single Availability Zone that allow for low latency between the instances. Placement groups are recommended for all applications where low latency is a requirement. A placement group was used to achieve the best performance from STARCCM+. More on placement groups can be found in our docs.
Amazon Linux is a version of Linux maintained by Amazon. The distribution evolved from Red Hat Linux (RHEL) and is designed to provide a stable, secure, and highly performant environment. Amazon Linux is optimized to run on AWS and offers excellent performance for running HPC applications. For the case presented here, the operating system used was Amazon Linux. Other Linux distributions are also performant. However, it is strongly recommended that for Linux HPC applications, a minimum of the version 3.10 Linux kernel be used to be sure of using the latest Xen libraries. See our Amazon Linux page to learn more.
Amazon Elastic Block Store (EBS) is a persistent block level storage device often used for cluster storage on AWS. EBS provides reliable block level storage volumes that can be attached (and removed) from an Amazon EC2 instance. A standard EBS general purpose SSD (gp2) volume is all that is required to meet the needs of STARCCM+ and was used here. Other HPC applications may require faster I/O to prevent data writes from being a bottle neck to turn-around speed. For these applications, other storage options exist. A guide to amazon storage is found here.
As mentioned previously, STARCCM+, like many other CFD solvers, runs well on both threads and cores. Hyper-Threading can improve the performance of some MPI applications depending on the application, the case, and the size of the workload allocated to each thread; it may also slow performance. The one-size-fits-all nature of the static cluster compute environments means that most HPC clusters disable hyper-threading. To first order, it is believed that computationally intensive workloads run best on cores while those that are I/O bound may run best on threads. Again, a few percent increase in performance was discovered for this case, by running with threads. If there is no time to evaluate the effect of hyper-threading on case performance than it is recommended that hyper-threading be disabled. When hyper-threading is disabled, it is important to bind the core to designated CPU. This is called processor or CPU affinity. It almost universally improves performance over unpinned cores for computationally intensive workloads.
Occasionally, an application will include frequent time measurement in their code; perhaps this is done for performance tuning. Under these circumstances, performance can be improved by setting the clock source to the TSC (Time Stamp Counter). This tuning was not required for this application but is mentioned here for completeness.
When evaluating an application, it is always recommended that a meaningful real-world case be used. A case that is too big or too small won’t reflect the performance and scalability achievable in every day operation. The only way to know positively how an application will perform on AWS is to try it!
AWS offers solid strong scaling and exceptional weak scaling. Good performance can be achieved on AWS, for most applications. In addition to low cost and quick turn-around time, important considerations for HPC also include throughput and availability. AWS offers effectively limitless throughput, security, cost-savings, and high-availability making queues a “thing of the past”. A long queue wait makes for a very long case turn-around time, regardless of the scale.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.