Tag Archives: Amazon EC2 Spot

Running Cost-effective queue workers with Amazon SQS and Amazon EC2 Spot Instances

Post Syndicated from peven original https://aws.amazon.com/blogs/compute/running-cost-effective-queue-workers-with-amazon-sqs-and-amazon-ec2-spot-instances/

This post is contributed by Ran Sheinberg | Sr. Solutions Architect, EC2 Spot & Chad Schmutzer | Principal Developer Advocate, EC2 Spot | Twitter: @schmutze

Introduction

Amazon Simple Queue Service (SQS) is used by customers to run decoupled workloads in the AWS Cloud as a best practice, in order to increase their applications’ resilience. You can use a worker tier to do background processing of images, audio, documents and so on, as well as offload long-running processes from the web tier. This blog post covers the benefits of pairing Amazon SQS and Spot Instances to maximize cost savings in the worker tier, and a customer success story.

Solution Overview

Amazon SQS is a fully managed message queuing service that enables customers to decouple and scale microservices, distributed systems, and serverless applications. It is a common best practice to use Amazon SQS with decoupled applications. Amazon SQS increases applications resilience by decoupling the direct communication between the frontend application and the worker tier that does data processing. If a worker node fails, the jobs that were running on that node return to the Amazon SQS queue for a different node to pick up.

Both the frontend and worker tier can run on Spot Instances, which offer spare compute capacity at steep discounts compared to On-Demand Instances. Spot Instances optimize your costs on the AWS Cloud and scale your application’s throughput up to 10 times for the same budget. Spot Instances can be interrupted with two minutes of notification when EC2 needs the capacity back. You can use Spot Instances for various fault-tolerant and flexible applications. These can include analytics, containerized workloads, high performance computing (HPC), stateless web servers, rendering, CI/CD, and queue worker nodes—which is the focus of this post.

Worker tiers of a decoupled application are typically fault-tolerant. So, it is a prime candidate for running on interruptible capacity. Amazon SQS running on Spot Instances allows for more robust, cost-optimized applications.

By using EC2 Auto Scaling groups with multiple instance types that you configured as suitable for your application (for example, m4.xlarge, m5.xlarge, c5.xlarge, and c4.xlarge, in multiple Availability Zones), you can spread the worker tier’s compute capacity across many Spot capacity pools (a combination of instance type and Availability Zone). This increases the chance of achieving the scale that’s required for the worker tier to ingest messages from the queue, and of keeping that scale when Spot Instance interruptions occur, while selecting the lowest-priced Spot Instances in each availability zone.

You can also choose the capacity-optimized allocation strategy for the Spot Instances in your Auto Scaling group. This strategy automatically selects instances that have a lower chance of interruption, which decreases the chances of restarting jobs due to Spot interruptions. When Spot Instances are interrupted, your Auto Scaling group automatically replenishes the capacity from a different Spot capacity pool in order to achieve your desired capacity. Read the blog post “Introducing the capacity-optimized allocation strategy for Amazon EC2 Spot Instances” for more details on how to choose the suitable allocation strategy.

We focus on three main points in this blog:

  1. Best practices for using Spot Instances with Amazon SQS
  2. A customer example that uses these components
  3. Example solution that can help you get you started quickly

Application of Amazon SQS with Spot Instances

Amazon SQS eliminates the complexity of managing and operating message-oriented middleware. Using Amazon SQS, you can send, store, and receive messages between software components at any volume, without losing messages or requiring other services to be available. Amazon SQS is a fully managed service which allows you to set up a queue in seconds. It also allows you to use your preferred SDK to start writing and reading to and from the queue within minutes.

In the following example, we describe an AWS architecture that brings together the Amazon SQS queue and an EC2 Auto Scaling group running Spot Instances. The architecture is used for decoupling the worker tier from the web tier by using Amazon SQS. The example uses the Protect feature (which we will explain later in this post) to ensure that an instance currently processing a job does not get terminated by the Auto Scaling group when it detects that a scale-in activity is required due to a Dynamic Scaling Policy.Architecture diagram for using Amazon SQS with Spot Instances and Auto Scaling groups

AWS reference architecture used for decoupling the worker tier from the web tier by using Amazon SQS

Customer Example: How Trax Retail uses Auto Scaling groups with Spot Instances in their Amazon SQS application

Trax decided to run its queue worker tier exclusively on Spot Instances due to the fault-tolerant nature of its architecture and for cost-optimization purposes. The company digitizes the physical world of retail using Computer Vision. Their ‘Trax Factory’ transforms individual shelf into data and insights about retail store conditions.

Built using asynchronous event-driven architecture, Trax Factory is a cluster of microservices in which the completion of one service triggers the activation of another service. The worker tier uses Auto Scaling groups with dynamic scaling policies to increase and decrease the number of worker nodes in the worker tier.

You can create a Dynamic Scaling Policy by doing the following:

  1. Observe a Amazon CloudWatch metric. Watch the metric for the current number of messages in the Amazon SQS queue (ApproximateNumberOfMessagesVisible).
  2. Create a CloudWatch alarm. This alarm should be based on that metric you created in the prior step.
  3. Use your CloudWatch alarm in a Dynamic Scaling Policy. Use this policy increase and decrease the number of EC2 Instances in the Auto Scaling group.

In Trax’s case, due to the high variability of the number of messages in the queue, they opted to enhance this approach in order to minimize the time it takes to scale, by building a service that would call the SQS API and find the current number of messages in the queue more frequently, instead of waiting for the 5 minute metric refresh interval in CloudWatch.

Trax ensures that its applications are always scaled to meet the demand by leveraging the inherent elasticity of Amazon EC2 instances. This elasticity ensures that end users are never affected and/or service-level agreements (SLA) are never violated.

With a Dynamic Scaling Policy, the Auto Scaling group can detect when the number of messages in the queue has decreased, so that it can initiate a scale-in activity. The Auto Scaling group uses its configured termination policy for selecting the instances to be terminated. However, this policy poses the risk that the Auto Scaling group might select an instance for termination while that instance is currently processing an image. That instance’s work would be lost (although the image would eventually be processed by reappearing in the queue and getting picked up by another worker node).

To decrease this risk, you can use Auto Scaling groups instance protection. This means that every time an instance fetches a job from the queue, it also sends an API call to EC2 to protect itself from scale-in. The Auto Scaling group does not select the protected, working instance for termination until the instance finishes processing the job and calls the API to remove the protection.

Handling Spot Instance interruptions

This instance-protection solution ensures that no work is lost during scale-in activities. However, protecting from scale-in does not work when an instance is marked for termination due to Spot Instance interruptions. These interruptions occur when there’s increased demand for On-Demand Instances in the same capacity pool (a combination of an instance type in an Availability Zone).

Applications can minimize the impact of a Spot Instance interruption. To do so, an application catches the two-minute interruption notification (available in the instance’s metadata), and instructs itself to stop fetching jobs from the queue. If there’s an image still being processed when the two minutes expire and the instance is terminated, the application does not delete the message from the queue after finishing the process. Instead, the message simply becomes visible again for another instance to pick up and process after the Amazon SQS visibility timeout expires.

Alternatively, you can release any ongoing job back to the queue upon receiving a Spot Instance interruption notification by setting the visibility timeout of the specific message to 0. This timeout potentially decreases the total time it takes to process the message.

Testing the solution

If you’re not currently using Spot Instances in your queue worker tier, we suggest testing the approach described in this post.

For that purpose, we built a simple solution to demonstrate the capabilities mentioned in this post, using an AWS CloudFormation template. The stack includes an Amazon Simple Storage Service (S3) bucket with a CloudWatch trigger to push notifications to an SQS queue after an image is uploaded to the Amazon S3 bucket. Once the message is in the queue, it is picked up by the application running on the EC2 instances in the Auto Scaling group. Then, the image is converted to PDF, and the instance is protected from scale-in for as long as it has an active processing job.

To see the solution in action, deploy the CloudFormation template. Then upload an image to the Amazon S3 bucket. In the Auto Scaling Groups console, check the instance protection status on the Instances tab. The protection status is shown in the following screenshot.

instance protection status in console

You can also see the application logs using CloudWatch Logs:

/usr/local/bin/convert-worker.sh: Found 1 messages in https://sqs.us-east-1.amazonaws.com/123456789012/qtest-sqsQueue-1CL0NYLMX64OB

/usr/local/bin/convert-worker.sh: Found work to convert. Details: INPUT=Capture1.PNG, FNAME=capture1, FEXT=png

/usr/local/bin/convert-worker.sh: Running: aws autoscaling set-instance-protection --instance-ids i-0a184c5ae289b2990 --auto-scaling-group-name qtest-autoScalingGroup-QTGZX5N70POL --protected-from-scale-in

/usr/local/bin/convert-worker.sh: Convert done. Copying to S3 and cleaning up

/usr/local/bin/convert-worker.sh: Running: aws s3 cp /tmp/capture1.pdf s3://qtest-s3bucket-18fdpm2j17wxx

/usr/local/bin/convert-worker.sh: Running: aws sqs --output=json delete-message --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/qtest-sqsQueue-1CL0NYLMX64OB --receipt-handle

/usr/local/bin/convert-worker.sh: Running: aws autoscaling set-instance-protection --instance-ids i-0a184c5ae289b2990 --auto-scaling-group-name qtest-autoScalingGroup-QTGZX5N70POL --no-protected-from-scale-in

Conclusion

This post helps you architect fault tolerant worker tiers in a cost optimized way. If your queue worker tiers are fault tolerant and use the built-in Amazon SQS features, you can increase your application’s resilience and take advantage of Spot Instances to save up to 90% on compute costs.

In this post, we emphasized several best practices to help get you started saving money using Amazon SQS and Spot Instances. The main best practices are:

  • Diversifying your Spot Instances using Auto Scaling groups, and selecting the right Spot allocation strategy
  • Protecting instances from scale-in activities while they process jobs
  • Using the Spot interruption notification so that the application stop polling the queue for new jobs before the instance is terminated

We hope you found this post useful. If you’re not using Spot Instances in your queue worker tier, we suggest testing the approach described here. Finally, we would like to thank the Trax team for sharing its architecture and best practices. If you want to learn more, watch the “This is my architecture” video featuring Trax and their solution.

We’d love your feedback—please comment and let me know what you think.


About the authors

 

Ran Sheinberg is a specialist solutions architect for EC2 Spot Instances with Amazon Web Services. He works with AWS customers on cost optimizing their compute spend by utilizing Spot Instances across different types of workloads: stateless web applications, queue workers, containerized workloads, analytics, HPC and others.

 

 

 

 

As a Principal Developer Advocate for EC2 Spot at AWS, Chad’s job is to make sure our customers are saving at scale by using EC2 Spot Instances to take advantage of the most cost-effective way to purchase compute capacity. Follow him on Twitter here! @schmutze

 

Processing batch jobs quickly, cost-efficiently, and reliably with Amazon EC2 On-Demand and Spot Instances

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/processing-batch-jobs-quickly-cost-efficiently-and-reliably-with-amazon-ec2-on-demand-and-spot-instances/

This post is contributed by Alex Kimber, Global Solutions Architect

No one asks for their High Performance Computing (HPC) jobs to take longer, cost more, or have more variability in the time to get results. Fortunately, you can combine Amazon EC2 and Amazon EC2 Auto Scaling to make the delivery of batch workloads fast, cost-efficient, and reliable. Spot Instances offer spare AWS compute power at a considerable discount. Customers such as Yelp, NASA, and FINRA use them to reduce costs and get results faster.

This post outlines an approach that combines On-Demand Instances and Spot Instances to balance a predictable delivery of HPC results with an opportunistic approach to cost optimization.

 

Prerequisites

This approach will be demonstrated via a simple batch-processing environment with the following components:

  • A producer Python script to generate batches of tasks to process. You can develop this script in the AWS Cloud9 development environment. This solution also uses the environment to run the script and generate tasks.
  • An Amazon SQS queue to manage the tasks.
  • A consumer Python script to take incomplete tasks from the queue, simulate work, and then remove them from the queue after they’re complete.
  • Amazon EC2 Auto Scaling groups to model scenarios.
  • Amazon CloudWatch alarms to trigger the Auto Scaling groups and detect whether the queue is empty. The EC2 instances run the consumer script in a loop on startup.

 

Testing On-Demand Instances

In this scenario, an HPC batch of 6,000 tasks must complete within five hours. Each task takes eight minutes to complete on a single vCPU.

A simple approach to meeting the target is to provision 160 vCPUs using 20 c5.2xlarge On-Demand Instances. Each of the instances should complete 60 tasks per hour, completing the batch in approximately five hours. This approach provides an adequate level of predictability. You can test this approach with a simple Auto Scaling group configuration, set to create 20 c5.2xlarge instances if the queue has any pending visible messages. As expected, the batch takes approximately five hours, as shown in the following screenshot.

In the Ireland Region, using 20 c5.2xlarge instances for five hours results in a cost of $0.384 per hour for each instance.  The batch total is $38.40.

 

Testing On-Demand and Spot Instances

The alternative approach to the scenario also provisions sufficient capacity for On-Demand Instances to meet the target time, in this case 20 instances. This approach gives confidence that you can meet the batch target of five hours regardless of what other capacity you add.

You can then configure the Auto Scaling group to also add a number of Spot Instances. These instances are more numerous, with the aim of delivering the results at a lower cost and also allowing the batch to complete much earlier than it would otherwise. When the queue is empty it automatically terminates all of the instances involved to prevent further charges. This example configures the Auto Scaling group to have 80 instances in total, with 20 On-Demand Instances and 60 Spot Instances. Selecting multiple different instance types is a good strategy to help secure Spot capacity by diversification.

Spot Instances occasionally experience interruptions when AWS must reclaim the capacity with a two-minute warning. You can handle this occurrence gracefully by configuring your batch processor code to react to the interruption, such as checkpointing progress to some data store. This example has the SQS visibility timeout on the SQS queue set to nine minutes, so SQS re-queues any task that doesn’t complete in that time.

To test the impact of the new configuration another 6000 tasks are submitted into the SQS queue. The Auto Scaling group quickly provisions 20 On-Demand and 60 Spot Instances.

The instances then quickly set to work on the queue.

The batch completes in approximately 30 minutes, which is a significant improvement. This result is due to the additional Spot Instance capacity, which gave a total of 2,140 vCPUs.

The batch used the following instances for 30 minutes.

 

Instance TypeProvisioningHost CountHourly Instance CostTotal 30-minute batch cost
c5.18xlargeSpot15 $     1.2367 $     9.2753
c5.2xlargeSpot22 $     0.1547 $     1.7017
c5.4xlargeSpot12 $     0.2772 $     1.6632
c5.9xlargeSpot11 $     0.6239 $     3.4315
c5.2xlargeOn-Demand13 $     0.3840 $     2.4960
c5.4xlargeOn-Demand3 $     0.7680 $     1.1520
c5.9xlargeOn-Demand4 $     1.7280 $     3.4560

The total cost is $23.18, which is approximately 60 percent of the On-Demand cost and allows you to compute the batch 10 times faster. This example also shows no interruptions to the Spot Instances.

 

Summary

This post demonstrated that by combining On-Demand and Spot Instances you can improve the performance of a loosely coupled HPC batch workload without compromising on the predictability of runtime. This approach balances reliability with improved performance while reducing costs. The use of Auto Scaling groups and CloudWatch alarms makes the solution largely automated, responding to demand and provisioning and removing capacity as required.

Introducing the capacity-optimized allocation strategy for Amazon EC2 Spot Instances

Post Syndicated from Chad Schmutzer original https://aws.amazon.com/blogs/compute/introducing-the-capacity-optimized-allocation-strategy-for-amazon-ec2-spot-instances/

AWS announces the new capacity-optimized allocation strategy for Amazon EC2 Auto Scaling and EC2 Fleet. This new strategy automatically makes the most efficient use of spare capacity while still taking advantage of the steep discounts offered by Spot Instances. It’s a new way for you to gain easy access to extra EC2 compute capacity in the AWS Cloud.

This post compares how the capacity-optimized allocation strategy deploys capacity compared to the current lowest-price allocation strategy.

Overview

Spot Instances are spare EC2 compute capacity in the AWS Cloud available to you at savings of up to 90% off compared to On-Demand prices. The only difference between On-Demand Instances and Spot Instances is that Spot Instances can be interrupted by EC2 with two minutes of notification when EC2 needs the capacity back.

When making requests for Spot Instances, customers can take advantage of allocation strategies within services such as EC2 Auto Scaling and EC2 Fleet. The allocation strategy determines how the Spot portion of your request is fulfilled from the possible Spot Instance pools you provide in the configuration.

The existing allocation strategy available in EC2 Auto Scaling and EC2 Fleet is called “lowest-price” (with an option to diversify across N pools). This strategy allocates capacity strictly based on the lowest-priced Spot Instance pool or pools. The “diversified” allocation strategy (available in EC2 Fleet but not in EC2 Auto Scaling) spreads your Spot Instances across all the Spot Instance pools you’ve specified as evenly as possible.

As the AWS global infrastructure has grown over time in terms of geographic Regions and Availability Zones as well as the raw number of EC2 Instance families and types, so has the amount of spare EC2 capacity. Therefore it is important that customers have access to tools to help them utilize spare EC2 capacity optimally. The new capacity-optimized strategy for both EC2 Auto Scaling and EC2 Fleet provisions Spot Instances from the most-available Spot Instance pools by analyzing capacity metrics.

Walkthrough

To illustrate how the capacity-optimized allocation strategy deploys capacity compared to the existing lowest-price allocation strategy, here are examples of Auto Scaling group configurations and use cases for each strategy.

Lowest-price (diversified over N pools) allocation strategy

The lowest-price allocation strategy deploys Spot Instances from the pools with the lowest price in each Availability Zone. This strategy has an optional modifier SpotInstancePools that provides the ability to diversify over the N lowest-priced pools in each Availability Zone.

Spot pricing changes slowly over time based on long-term trends in supply and demand, but capacity fluctuates in real time. The lowest-price strategy does not account for pool capacity depth as it deploys Spot Instances.

As a result, the lowest-price allocation strategy is a good choice for workloads with a low cost of interruption that want the lowest possible prices, such as:

  • Time-insensitive workloads
  • Extremely transient workloads
  • Workloads that are easily check-pointed and restarted

Example

The following example configuration shows how capacity could be allocated in an Auto Scaling group using the lowest-price allocation strategy diversified over two pools:

{
  "AutoScalingGroupName": "runningAmazonEC2WorkloadsAtScale",
  "MixedInstancesPolicy": {
    "LaunchTemplate": {
      "LaunchTemplateSpecification": {
        "LaunchTemplateName": "my-launch-template",
        "Version": "$Latest"
      },
      "Overrides": [
        {
          "InstanceType": "c3.large"
        },
        {
          "InstanceType": "c4.large"
        },
        {
          "InstanceType": "c5.large"
        }
      ]
    },
    "InstancesDistribution": {
      "OnDemandPercentageAboveBaseCapacity": 0,
      "SpotAllocationStrategy": "lowest-price",
      "SpotInstancePools": 2
    }
  },
  "MinSize": 10,
  "MaxSize": 100,
  "DesiredCapacity": 60,
  "HealthCheckType": "EC2",
  "VPCZoneIdentifier": "subnet-a1234567890123456,subnet-b1234567890123456,subnet-c1234567890123456"
}

In this configuration, you request 60 Spot Instances because DesiredCapacity is set to 60 and OnDemandPercentageAboveBaseCapacity is set to 0. The example follows Spot best practices and is flexible across c3.large, c4.large, and c5.large in us-east-1a, us-east-1b, and us-east-1c (mapped according to the subnets in VPCZoneIdentifier). The Spot allocation strategy is set to lowest-price over two SpotInstancePools.

First, EC2 Auto Scaling tries to make sure that it balances the requested capacity across all the Availability Zones provided in the request. To do so, it splits the target capacity request of 60 across the three zones. Then, the lowest-price allocation strategy allocates the Spot Instance launches to the lowest-priced pool per zone.

Using the example Spot prices shown in the following table, the resulting allocation is:

  • 20 Spot Instances from us-east-1a (10 c3.large, 10 c4.large)
  • 20 Spot Instances from us-east-1b (10 c3.large, 10 c4.large)
  • 20 Spot Instances from us-east-1c (10 c3.large, 10 c4.large)
Availability ZoneInstance typeSpot Instances allocatedSpot price
us-east-1ac3.large10$0.0294
us-east-1ac4.large10$0.0308
us-east-1ac5.large0$0.0408
us-east-1bc3.large10$0.0294
us-east-1bc4.large10$0.0308
us-east-1bc5.large0$0.0387
us-east-1cc3.large10$0.0294
us-east-1cc4.large10$0.0331
us-east-1cc5.large0$0.0353

The cost for this Auto Scaling group is $1.83/hour. Of course, the Spot Instances are allocated according to the lowest price and are not optimized for capacity. The Auto Scaling group could experience higher interruptions if the lowest-priced Spot Instance pools are not as deep as others, since upon interruption the Auto Scaling group will attempt to re-provision instances into the lowest-priced Spot Instance pools.

Capacity-optimized allocation strategy

There is a price associated with interruptions, restarting work, and checkpointing. While the overall hourly cost of capacity-optimized allocation strategy might be slightly higher, the possibility of having fewer interruptions can lower the overall cost of your workload.

The effectiveness of the capacity-optimized allocation strategy depends on following Spot best practices by being flexible and providing as many instance types and Availability Zones (Spot Instance pools) as possible in the configuration. It is also important to understand that as capacity demands change, the allocations provided by this strategy also change over time.

Remember that Spot pricing changes slowly over time based on long-term trends in supply and demand, but capacity fluctuates in real time. The capacity-optimized strategy does account for pool capacity depth as it deploys Spot Instances, but it does not account for Spot prices.

As a result, the capacity-optimized allocation strategy is a good choice for workloads with a high cost of interruption, such as:

  • Big data and analytics
  • Image and media rendering
  • Machine learning
  • High performance computing

Example

The following example configuration shows how capacity could be allocated in an Auto Scaling group using the capacity-optimized allocation strategy:

{
  "AutoScalingGroupName": "runningAmazonEC2WorkloadsAtScale",
  "MixedInstancesPolicy": {
    "LaunchTemplate": {
      "LaunchTemplateSpecification": {
        "LaunchTemplateName": "my-launch-template",
        "Version": "$Latest"
      },
      "Overrides": [
        {
          "InstanceType": "c3.large"
        },
        {
          "InstanceType": "c4.large"
        },
        {
          "InstanceType": "c5.large"
        }
      ]
    },
    "InstancesDistribution": {
      "OnDemandPercentageAboveBaseCapacity": 0,
      "SpotAllocationStrategy": "capacity-optimized"
    }
  },
  "MinSize": 10,
  "MaxSize": 100,
  "DesiredCapacity": 60,
  "HealthCheckType": "EC2",
  "VPCZoneIdentifier": "subnet-a1234567890123456,subnet-b1234567890123456,subnet-c1234567890123456"
}

In this configuration, you request 60 Spot Instances because DesiredCapacity is set to 60 and OnDemandPercentageAboveBaseCapacity is set to 0. The example follows Spot best practices (especially critical when using the capacity-optimized allocation strategy) and is flexible across c3.large, c4.large, and c5.large in us-east-1a, us-east-1b, and us-east-1c (mapped according to the subnets in VPCZoneIdentifier). The Spot allocation strategy is set to capacity-optimized.

First, EC2 Auto Scaling tries to make sure that the requested capacity is evenly balanced across all the Availability Zones provided in the request. To do so, it splits the target capacity request of 60 across the three zones. Then, the capacity-optimized allocation strategy optimizes the Spot Instance launches by analyzing capacity metrics per instance type per zone. This is because this strategy effectively optimizes by capacity instead of by the lowest price (hence its name).

Using the example Spot prices shown in the following table, the resulting allocation is:

  • 20 Spot Instances from us-east-1a (20 c4.large)
  • 20 Spot Instances from us-east-1b (20 c3.large)
  • 20 Spot Instances from us-east-1c (20 c5.large)
Availability ZoneInstance typeSpot Instances allocatedSpot price
us-east-1ac3.large0$0.0294
us-east-1ac4.large20$0.0308
us-east-1ac5.large0$0.0408
us-east-1bc3.large20$0.0294
us-east-1bc4.large0$0.0308
us-east-1bc5.large0$0.0387
us-east-1cc3.large0$0.0294
us-east-1cc4.large0$0.0308
us-east-1cc5.large20$0.0353

The cost for this Auto Scaling group is $1.91/hour, only 5% more than the lowest-priced example above. However, notice the distribution of the Spot Instances is different. This is because the capacity-optimized allocation strategy determined this was the most efficient distribution from an available capacity perspective.

Conclusion

Consider using the new capacity-optimized allocation strategy to make the most efficient use of spare capacity. Automatically deploy into the most available Spot Instance pools—while still taking advantage of the steep discounts provided by Spot Instances.

This allocation strategy may be especially useful for workloads with a high cost of interruption, including:

  • Big data and analytics
  • Image and media rendering
  • Machine learning
  • High performance computing

No matter which allocation strategy you choose, you still enjoy the steep discounts provided by Spot Instances. These discounts are possible thanks to the stable Spot pricing made available with the new Spot pricing model.

Chad Schmutzer is a Principal Developer Advocate for the EC2 Spot team. Follow him on twitter to get the latest updates on saving at scale with Spot Instances, to provide feedback, or just say HI.

Running your game servers at scale for up to 90% lower compute cost

Post Syndicated from Roshni Pary original https://aws.amazon.com/blogs/compute/running-your-game-servers-at-scale-for-up-to-90-lower-compute-cost/

This post is contributed by Yahav Biran, Chad Schmutzer, and Jeremy Cowan, Solutions Architects at AWS

Many successful video games such Fortnite: Battle Royale, Warframe, and Apex Legends use a free-to-play model, which offers players access to a portion of the game without paying. Such games are no longer low quality and require premium-like quality. The business model is constrained on cost, and Amazon EC2 Spot Instances offer a viable low-cost compute option. The casual multiplayer games naturally fit the Spot offering. With the orchestration of Amazon EKS containers and the mechanism available to minimize the player impact and optimize the cost when running multiplayer game-servers workloads, both casual and hardcore multiplayer games fit the Spot Instance offering.

Spot Instances offer spare compute capacity available in the AWS Cloud at steep discounts compared to On-Demand Instances. Spot Instances enable you to optimize your costs and scale your application’s throughput up to 10 times for the same budget. Spot Instances are best suited for fault-tolerant workloads. Multiplayer game-servers are no exception: a game-server state is updated using real-time player inputs, which makes the server state transient. Game-server workloads can be disposable and take advantage of Spot Instances to save up to 90% on compute cost. In this blog, we share how to architect your game-server workloads to handle interruptions and effectively use Spot Instances.

Characteristics of game-server workloads

Simply put, multiplayer game servers spend most of their life updating current character position and state (mostly animation). The rest of the time is spent on image updates that result from combat actions, moves, and other game-related events. More specifically, game servers’ CPUs are busy doing network I/O operations by accepting client positions, calculating the new game state, and multi-casting the game state back to the clients. That makes a game server workload a good fit for general-purpose instance types for casual multiplayer games and, preferably, compute-optimized instance types for the hardcore multiplayer games.

AWS provides a wide variety for both compute-optimized (C5 and C4) and general-purpose (M5) instance types with Amazon EC2 Spot Instances. Because capacities fluctuate independently for each instance type in an Availability Zone, you can often get more compute capacity for the same price when using a wide range of instance types. For more information on Spot Instance best practices, see Getting Started with Amazon EC2 Spot Instances

One solution that customers use for running dedicated game-servers is Amazon GameLift. This solution deploys a fleet of Amazon GameLift FleetIQ and Spot Instances in an AWS Region. FleetIQ places new sessions on game servers based on player latencies, instance prices, and Spot Instance interruption rates so that you don’t need to worry about Spot Instance interruptions. For more information, see Reduce Cost by up to 90% with Amazon GameLift FleetIQ and Spot Instances on the AWS Game Tech Blog.

In other cases, you can use game-server deployment patterns like containers-based orchestration (such as Kubernetes, Swarm, and Amazon ECS) for deploying multiplayer game servers. Those systems manage a large number of game-servers deployed as Docker containers across several Regions. The rest of this blog focuses on this containerized game-server solution. Containers fit the game-server workload because they’re lightweight, start quickly, and allow for greater utilization of the underlying instance.

Why use Amazon EC2 Spot Instances?

A Spot Instance is the choice to run a disposable game server workload because of its built-in two-minute interruption notice that provides graceful handling. The two-minute termination notification enables the game server to act upon interruption. We demonstrate two examples for notification handling through Instance Metadata and Amazon CloudWatch. For more information, see “Interruption handling” and “What if I want my game-server to be redundant?” segments later in this blog.

Spot Instances also offer a variety of EC2 instances types that fit game-server workloads, such as general-purpose and compute-optimized (C4 and C5). Finally, Spot Instances provide low termination rates. The Spot Instance Advisor can help you choose a good starting point for determining which instance types have lower historical interruption rates.

Interruption handling

Avoiding player impact is key when using Spot Instances. Here is a strategy to avoid player impact that we apply in the proposed reference architecture and code examples available at Spotable Game Server on GitHub. Specifically, for Amazon EKS, node drainage requires draining the node via the kubectl drain command. This makes the node unschedulable and evicts the pods currently running on the node with a graceful termination period (terminationGracePeriodSeconds) that might impact the player experience. As a result, pods continue to run while a signal is sent to the game to end it gracefully.

Node drainage

Node drainage requires an agent pod that runs as a DaemonSet on every Spot Instance host to pull potential spot interruption from Amazon CloudWatch or instance metadata. We’re going to use the Instance Metadata notification. The following describes how termination events are handled with node drainage:

  1. Launch the game-server pod with a default of 120 seconds (terminationGracePeriodSeconds). As an example, see this deploy YAML file on GitHub.
  2. Provision a worker node pool with a mixed instances policy of On-Demand and Spot Instances. It uses the Spot Instance allocation strategy with the lowest price. For example, see this AWS CloudFormation template on GitHub.
  3. Use the Amazon EKS bootstrap tool (/etc/eks/bootstrap.sh in the recommended AMI) to label each node with its instances lifecycle, either nDemand or Spot. For example:
    • OnDemand: “–kubelet-extra-args –node labels=lifecycle=ondemand,title=minecraft,region=uswest2”
    • Spot: “–kubelet-extra-args –node-labels=lifecycle=spot,title=minecraft,region=uswest2”
  4. A daemon set deployed on every node pulls the termination status from the instance metadata endpoint. When a termination notification arrives, the `kubectl drain node` command is executed, and a SIGTERM signal is sent to the game-server pod. You can see these commands in the batch file on GitHub.
  5. The game server keeps running for the next 120 seconds to allow the game to notify the players about the incoming termination.
  6. No new game-server is scheduled on the node to be terminated because it’s marked as unschedulable.
  7. A notification to an external system such as a matchmaking system is sent to update the current inventory of available game servers.

Optimization strategies for Kubernetes specifications

This section describes a few recommended strategies for Kubernetes specifications that enable optimal game server placements on the provisioned worker nodes.

  • Use single Spot Instance Auto Scaling groups as worker nodes. To accommodate the use of multiple Auto Scaling groups, we use Kubernetes nodeSelector to control the game-server scheduling on any of the nodes in any of the Spot Instance–based Auto Scaling groups.
    nodeSelector:
         lifecycle: spot
            title: your game title

  • The lifecycle label is populated upon node creation through the AWS CloudFormation template in the following section:
    BootstrapArgumentsForSpotFleet:
    	Description: Sets Node Labels to set lifecycle as Ec2Spot
    	    Default: "--kubelet-extra-args --node-labels=lifecycle=spot,title=minecraft,region=uswest2"
    	
    	    Type: String

  • You might have a case where the incoming player actions are served by UDP and masking the interruption from the player is required. Here, the game-server allocator (a Kubernetes scheduler for us) schedules more than one game server as target upstream servers behind a UDP load balancer that multicasts any packet received to the set of game servers. After the scheduler terminates the game server upon node termination, the failover occurs seamlessly. For more information, see “What if I want my game-server to be redundant?” later in this blog.

Reference architecture

The following architecture describes an instance mix of On-Demand and Spot Instances in an Amazon EKS cluster of multiplayer game servers. Within a single VPC, control plane node pools (Master Host and Storage Host) are highly available and thus run On-Demand Instances. The game-server hosts/nodes uses a mix of Spot and On-Demand Instances. The control plane, the API server is accessible via an Amazon Elastic Load Balancing Application Load Balancer with a preconfigured allowed list.

What if I want my game server to be redundant?

A game server is a sessionful workload, but it traditionally runs as a single dedicated game server instance with no redundancy. For game servers that use TCP as the transport network layer, AWS offers Network Load Balancers as an option for distributing player traffic across multiple game servers’ targets. Currently, game servers that use UDP don’t have similar load balancer solutions that add redundancy to maintain a highly available game server.

This section proposes a solution for the case where game servers deployed as containerized Amazon EKS pods use UDP as the network transport layer and are required to be highly available. We’re using the UDP load balancer because of the Spot Instances, but the choice isn’t limited to when you’re using Spot Instances.

The following diagram shows a reference architecture for the implementation of a UDP load balancer based on Amazon EKS. It requires a setup of an Amazon EKS cluster as suggested above and a set of components that simulate architecture that supports multiplayer game services. For example, this includes game-server inventory that captures the running game-servers, their status, and allocation placement.The Amazon EKS cluster is on the left, and the proposed UDP load-balancer system is on the right. A new game server is reported to an Amazon SQS queue that persists in an Amazon DynamoDB table. When a player’s assignment is required, the match-making service queries an API endpoint for an optimal available game server through the game-server inventory that uses the DynamoDB tables.

The solution includes the following main components:

  • The game server (see mockup-udp-server at GitHub). This is a simple UDP socket server that accepts a delta of a game state from connected players and multicasts the updated state based on pseudo computation back to the players. It’s a single threaded server whose goal is to prove the viability of UDP-based load balancing in dedicated game servers. The model presented here isn’t limited to this implementation. It’s deployed as a single-container Kubernetes pod that uses hostNetwork: true for network optimization.
  • The load balancer (udp-lb). This is a containerized NGINX server loaded with the stream module. The load balance upstream set is configured upon initialization based on the dedicated game-server state that is stored in the DynamoDB table game-server-status-by-endpoint. Available load balancer instances are also stored in a DynamoDB table, lb-status-by-endpoint, to be used by core game services such as a matchmaking service.
  • An Amazon SQS queue that captures the initialization and termination of game servers and load balancers instances deployed in the Kubernetes cluster.
  • DynamoDB tables that persist the state of the cluster with regards to the game servers and load balancer inventory.
  • An API operation based on AWS Lambda (game-server-inventory-api-lambda) that serves the game servers and load balancers for an updated list of resources available. The operation supports /get-available-gs needed for the load balancer to set its upstream target game servers. It also supports /set-gs-busy/{endpoint} for labeling already claimed game servers from the available game servers inventory.
  • A Lambda function (game-server-status-poller-lambda) that the Amazon SQS queue triggers and that populates the DynamoDB tables.

Scheduling mechanism

Our goal in this example is to reduce the chance that two game servers that serve the same load-balancer game endpoint are interrupted at the same time. Therefore, we need to prevent the scheduling of the same game servers (mockup-UDP-server) on the same host. This example uses advanced scheduling in Kubernetes where the pod affinity/anti-affinity policy is being applied.

We define two soft labels, mockup-grp1 and mockup-grp2, in the podAffinity section as follows:

      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: "app"
                    operator: In
                    values:
                      - mockup-grp1
              topologyKey: "kubernetes.io/hostname"

The requiredDuringSchedulingIgnoredDuringExecution tells the scheduler that the subsequent rule must be met upon pod scheduling. The rule says that pods that carry the value of key: “app” mockup-grp1 will not be scheduled on the same node as pods with key: “app” mockup-grp2 due to topologyKey: “kubernetes.io/hostname”.

When a load balancer pod (udp-lb) is scheduled, it queries the game-server-inventory-api endpoint for two game-server pods that run on different nodes. If this request isn’t fulfilled, the load balancer pod enters a crash loop until two available game servers are ready.

Try it out

We published two examples that guide you on how to build an Amazon EKS cluster that uses Spot Instances. The first example, Spotable Game Server, creates the cluster, deploys Spot Instances, Dockerizes the game server, and deploys it. The second example, Game Server Glutamate, enhances the game-server workload and enables redundancy as a mechanism for handling Spot Instance interruptions.

Conclusion

Multiplayer game servers have relatively short-lived processes that last between a few minutes to a few hours. The current average observed life span of Spot Instances in US and EU Regions ranges between a few hours to a few days, which makes Spot Instances a good fit for game servers. Amazon GameLift FleetIQ offers native and seamless support for Spot Instances, and Amazon EKS offers mechanisms to significantly minimize the probability of interrupting the player experience. This makes the Spot Instances an attractive option for not only casual multiplayer game server but also hardcore game servers. Game studios that use Spot Instances for multiplayer game server can save up to 90% of the compute cost, thus benefiting them as well as delighting their players.

AWS Online Tech Talks – June 2018

Post Syndicated from Devin Watson original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-june-2018/

AWS Online Tech Talks – June 2018

Join us this month to learn about AWS services and solutions. New this month, we have a fireside chat with the GM of Amazon WorkSpaces and our 2nd episode of the “How to re:Invent” series. We’ll also cover best practices, deep dives, use cases and more! Join us and register today!

Note – All sessions are free and in Pacific Time.

Tech talks featured this month:

 

Analytics & Big Data

June 18, 2018 | 11:00 AM – 11:45 AM PTGet Started with Real-Time Streaming Data in Under 5 Minutes – Learn how to use Amazon Kinesis to capture, store, and analyze streaming data in real-time including IoT device data, VPC flow logs, and clickstream data.
June 20, 2018 | 11:00 AM – 11:45 AM PT – Insights For Everyone – Deploying Data across your Organization – Learn how to deploy data at scale using AWS Analytics and QuickSight’s new reader role and usage based pricing.

 

AWS re:Invent
June 13, 2018 | 05:00 PM – 05:30 PM PTEpisode 2: AWS re:Invent Breakout Content Secret Sauce – Hear from one of our own AWS content experts as we dive deep into the re:Invent content strategy and how we maintain a high bar.
Compute

June 25, 2018 | 01:00 PM – 01:45 PM PTAccelerating Containerized Workloads with Amazon EC2 Spot Instances – Learn how to efficiently deploy containerized workloads and easily manage clusters at any scale at a fraction of the cost with Spot Instances.

June 26, 2018 | 01:00 PM – 01:45 PM PTEnsuring Your Windows Server Workloads Are Well-Architected – Get the benefits, best practices and tools on running your Microsoft Workloads on AWS leveraging a well-architected approach.

 

Containers
June 25, 2018 | 09:00 AM – 09:45 AM PTRunning Kubernetes on AWS – Learn about the basics of running Kubernetes on AWS including how setup masters, networking, security, and add auto-scaling to your cluster.

 

Databases

June 18, 2018 | 01:00 PM – 01:45 PM PTOracle to Amazon Aurora Migration, Step by Step – Learn how to migrate your Oracle database to Amazon Aurora.
DevOps

June 20, 2018 | 09:00 AM – 09:45 AM PTSet Up a CI/CD Pipeline for Deploying Containers Using the AWS Developer Tools – Learn how to set up a CI/CD pipeline for deploying containers using the AWS Developer Tools.

 

Enterprise & Hybrid
June 18, 2018 | 09:00 AM – 09:45 AM PTDe-risking Enterprise Migration with AWS Managed Services – Learn how enterprise customers are de-risking cloud adoption with AWS Managed Services.

June 19, 2018 | 11:00 AM – 11:45 AM PTLaunch AWS Faster using Automated Landing Zones – Learn how the AWS Landing Zone can automate the set up of best practice baselines when setting up new

 

AWS Environments

June 21, 2018 | 11:00 AM – 11:45 AM PTLeading Your Team Through a Cloud Transformation – Learn how you can help lead your organization through a cloud transformation.

June 21, 2018 | 01:00 PM – 01:45 PM PTEnabling New Retail Customer Experiences with Big Data – Learn how AWS can help retailers realize actual value from their big data and deliver on differentiated retail customer experiences.

June 28, 2018 | 01:00 PM – 01:45 PM PTFireside Chat: End User Collaboration on AWS – Learn how End User Compute services can help you deliver access to desktops and applications anywhere, anytime, using any device.
IoT

June 27, 2018 | 11:00 AM – 11:45 AM PTAWS IoT in the Connected Home – Learn how to use AWS IoT to build innovative Connected Home products.

 

Machine Learning

June 19, 2018 | 09:00 AM – 09:45 AM PTIntegrating Amazon SageMaker into your Enterprise – Learn how to integrate Amazon SageMaker and other AWS Services within an Enterprise environment.

June 21, 2018 | 09:00 AM – 09:45 AM PTBuilding Text Analytics Applications on AWS using Amazon Comprehend – Learn how you can unlock the value of your unstructured data with NLP-based text analytics.

 

Management Tools

June 20, 2018 | 01:00 PM – 01:45 PM PTOptimizing Application Performance and Costs with Auto Scaling – Learn how selecting the right scaling option can help optimize application performance and costs.

 

Mobile
June 25, 2018 | 11:00 AM – 11:45 AM PTDrive User Engagement with Amazon Pinpoint – Learn how Amazon Pinpoint simplifies and streamlines effective user engagement.

 

Security, Identity & Compliance

June 26, 2018 | 09:00 AM – 09:45 AM PTUnderstanding AWS Secrets Manager – Learn how AWS Secrets Manager helps you rotate and manage access to secrets centrally.
June 28, 2018 | 09:00 AM – 09:45 AM PTUsing Amazon Inspector to Discover Potential Security Issues – See how Amazon Inspector can be used to discover security issues of your instances.

 

Serverless

June 19, 2018 | 01:00 PM – 01:45 PM PTProductionize Serverless Application Building and Deployments with AWS SAM – Learn expert tips and techniques for building and deploying serverless applications at scale with AWS SAM.

 

Storage

June 26, 2018 | 11:00 AM – 11:45 AM PTDeep Dive: Hybrid Cloud Storage with AWS Storage Gateway – Learn how you can reduce your on-premises infrastructure by using the AWS Storage Gateway to connecting your applications to the scalable and reliable AWS storage services.
June 27, 2018 | 01:00 PM – 01:45 PM PTChanging the Game: Extending Compute Capabilities to the Edge – Discover how to change the game for IIoT and edge analytics applications with AWS Snowball Edge plus enhanced Compute instances.
June 28, 2018 | 11:00 AM – 11:45 AM PTBig Data and Analytics Workloads on Amazon EFS – Get best practices and deployment advice for running big data and analytics workloads on Amazon EFS.

AWS Online Tech Talks – April & Early May 2018

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-april-early-may-2018/

We have several upcoming tech talks in the month of April and early May. Come join us to learn about AWS services and solution offerings. We’ll have AWS experts online to help answer questions in real-time. Sign up now to learn more, we look forward to seeing you.

Note – All sessions are free and in Pacific Time.

April & early May — 2018 Schedule

Compute

April 30, 2018 | 01:00 PM – 01:45 PM PTBest Practices for Running Amazon EC2 Spot Instances with Amazon EMR (300) – Learn about the best practices for scaling big data workloads as well as process, store, and analyze big data securely and cost effectively with Amazon EMR and Amazon EC2 Spot Instances.

May 1, 2018 | 01:00 PM – 01:45 PM PTHow to Bring Microsoft Apps to AWS (300) – Learn more about how to save significant money by bringing your Microsoft workloads to AWS.

May 2, 2018 | 01:00 PM – 01:45 PM PTDeep Dive on Amazon EC2 Accelerated Computing (300) – Get a technical deep dive on how AWS’ GPU and FGPA-based compute services can help you to optimize and accelerate your ML/DL and HPC workloads in the cloud.

Containers

April 23, 2018 | 11:00 AM – 11:45 AM PTNew Features for Building Powerful Containerized Microservices on AWS (300) – Learn about how this new feature works and how you can start using it to build and run modern, containerized applications on AWS.

Databases

April 23, 2018 | 01:00 PM – 01:45 PM PTElastiCache: Deep Dive Best Practices and Usage Patterns (200) – Learn about Redis-compatible in-memory data store and cache with Amazon ElastiCache.

April 25, 2018 | 01:00 PM – 01:45 PM PTIntro to Open Source Databases on AWS (200) – Learn how to tap the benefits of open source databases on AWS without the administrative hassle.

DevOps

April 25, 2018 | 09:00 AM – 09:45 AM PTDebug your Container and Serverless Applications with AWS X-Ray in 5 Minutes (300) – Learn how AWS X-Ray makes debugging your Container and Serverless applications fun.

Enterprise & Hybrid

April 23, 2018 | 09:00 AM – 09:45 AM PTAn Overview of Best Practices of Large-Scale Migrations (300) – Learn about the tools and best practices on how to migrate to AWS at scale.

April 24, 2018 | 11:00 AM – 11:45 AM PTDeploy your Desktops and Apps on AWS (300) – Learn how to deploy your desktops and apps on AWS with Amazon WorkSpaces and Amazon AppStream 2.0

IoT

May 2, 2018 | 11:00 AM – 11:45 AM PTHow to Easily and Securely Connect Devices to AWS IoT (200) – Learn how to easily and securely connect devices to the cloud and reliably scale to billions of devices and trillions of messages with AWS IoT.

Machine Learning

April 24, 2018 | 09:00 AM – 09:45 AM PT Automate for Efficiency with Amazon Transcribe and Amazon Translate (200) – Learn how you can increase the efficiency and reach your operations with Amazon Translate and Amazon Transcribe.

April 26, 2018 | 09:00 AM – 09:45 AM PT Perform Machine Learning at the IoT Edge using AWS Greengrass and Amazon Sagemaker (200) – Learn more about developing machine learning applications for the IoT edge.

Mobile

April 30, 2018 | 11:00 AM – 11:45 AM PTOffline GraphQL Apps with AWS AppSync (300) – Come learn how to enable real-time and offline data in your applications with GraphQL using AWS AppSync.

Networking

May 2, 2018 | 09:00 AM – 09:45 AM PT Taking Serverless to the Edge (300) – Learn how to run your code closer to your end users in a serverless fashion. Also, David Von Lehman from Aerobatic will discuss how they used [email protected] to reduce latency and cloud costs for their customer’s websites.

Security, Identity & Compliance

April 30, 2018 | 09:00 AM – 09:45 AM PTAmazon GuardDuty – Let’s Attack My Account! (300) – Amazon GuardDuty Test Drive – Practical steps on generating test findings.

May 3, 2018 | 09:00 AM – 09:45 AM PTProtect Your Game Servers from DDoS Attacks (200) – Learn how to use the new AWS Shield Advanced for EC2 to protect your internet-facing game servers against network layer DDoS attacks and application layer attacks of all kinds.

Serverless

April 24, 2018 | 01:00 PM – 01:45 PM PTTips and Tricks for Building and Deploying Serverless Apps In Minutes (200) – Learn how to build and deploy apps in minutes.

Storage

May 1, 2018 | 11:00 AM – 11:45 AM PTBuilding Data Lakes That Cost Less and Deliver Results Faster (300) – Learn how Amazon S3 Select And Amazon Glacier Select increase application performance by up to 400% and reduce total cost of ownership by extending your data lake into cost-effective archive storage.

May 3, 2018 | 11:00 AM – 11:45 AM PTIntegrating On-Premises Vendors with AWS for Backup (300) – Learn how to work with AWS and technology partners to build backup & restore solutions for your on-premises, hybrid, and cloud native environments.

New Amazon EC2 Spot pricing model: Simplified purchasing without bidding and fewer interruptions

Post Syndicated from Roshni Pary original https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/

Contributed by Deepthi Chelupati and Roshni Pary

Amazon EC2 Spot Instances offer spare compute capacity in the AWS Cloud at steep discounts. Customers—including Yelp, NASA JPL, FINRA, and Autodesk—use Spot Instances to reduce costs and get faster results. Spot Instances provide acceleration, scale, and deep cost savings to big data workloads, containerized applications such as web services, test/dev, and many types of HPC and batch jobs.

At re:Invent 2017, we launched a new pricing model that simplified the Spot purchasing experience. The new model gives you predictable prices that adjust slowly over days and weeks, with typical savings of 70-90% over On-Demand. With the previous pricing model, some of you had to invest time and effort to analyze historical prices to determine your bidding strategy and maximum bid price. Not anymore.

How does the new pricing model work?

You don’t have to bid for Spot Instances in the new pricing model, and you just pay the Spot price that’s in effect for the current hour for the instances that you launch. It’s that simple. Now you can request Spot capacity just like you would request On-Demand capacity, without having to spend time analyzing market prices or setting a maximum bid price.

Previously, Spot Instances were terminated in ascending order of bids, and the Spot price was set to the highest unfulfilled bid. The market prices fluctuated frequently because of this. In the new model, the Spot prices are more predictable, updated less frequently, and are determined by supply and demand for Amazon EC2 spare capacity, not bid prices. You can find the price that’s in effect for the current hour in the EC2 console.

As you can see from the above Spot Instance Pricing History graph (available in the EC2 console under Spot Requests), Spot prices were volatile before the pricing model update. However, after the pricing model update, prices are more predictable and change less frequently.

In the new model, you still have the option to further control costs by submitting a “maximum price” that you are willing to pay in the console when you request Spot Instances:

You can also set your maximum price in EC2 RunInstances or RequestSpotFleet API calls, or in command line requests:

$ aws ec2 run-instances --instance-market-options 
'{"MarketType":"Spot", "SpotOptions": {"SpotPrice": "0.12"}}' \
    --image-id ami-1a2b3c4d --count 1 --instance-type c4.2xlarge

The default maximum price is the On-Demand price and you can continue to set a maximum Spot price of up to 10x the On-Demand price. That means, if you have been running applications on Spot Instances and use the RequestSpotInstances or RequestSpotFleet operations, you can continue to do so. The new Spot pricing model is backward compatible and you do not need to make any changes to your existing applications.

Fewer interruptions

Spot Instances receive a two-minute interruption notice when these instances are about to be reclaimed by EC2, because EC2 needs the capacity back. We have significantly reduced the interruptions with the new pricing model. Now instances are not interrupted because of higher competing bids, and you can enjoy longer workload runtimes. The typical frequency of interruption for Spot Instances in the last 30 days was less than 5% on average.

To reduce the impact of interruptions and optimize Spot Instances, diversify and run your application across multiple capacity pools. Each instance family, each instance size, in each Availability Zone, in every Region is a separate Spot pool. You can use the RequestSpotFleet API operation to launch thousands of Spot Instances and diversify resources automatically. To further reduce the impact of interruptions, you can also set up Spot Instances and Spot Fleets to respond to an interruption notice by stopping or hibernating rather than terminating instances when capacity is no longer available.

Spot Instances are now available in 18 Regions and 51 Availability Zones, and offer 100s of instance options. We have eliminated bidding, simplified the pricing model, and have made it easy to get started with Amazon EC2 Spot Instances for you to take advantage of the largest pool of cost-effective compute capacity in the world. See the Spot Instances detail page for more information and create your Spot Instance here.

Creating a Cost-Efficient Amazon ECS Cluster for Scheduled Tasks

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/creating-a-cost-efficient-amazon-ecs-cluster-for-scheduled-tasks/

Madhuri Peri
Sr. DevOps Consultant

When you use Amazon Relational Database Service (Amazon RDS), depending on the logging levels on the RDS instances and the volume of transactions, you could generate a lot of log data. To ensure that everything is running smoothly, many customers search for log error patterns using different log aggregation and visualization systems, such as Amazon Elasticsearch Service, Splunk, or other tool of their choice. A module needs to periodically retrieve the RDS logs using the SDK, and then send them to Amazon S3. From there, you can stream them to your log aggregation tool.

One option is writing an AWS Lambda function to retrieve the log files. However, because of the time that this function needs to execute, depending on the volume of log files retrieved and transferred, it is possible that Lambda could time out on many instances.  Another approach is launching an Amazon EC2 instance that runs this job periodically. However, this would require you to run an EC2 instance continuously, not an optimal use of time or money.

Using the new Amazon CloudWatch integration with Amazon EC2 Container Service, you can trigger this job to run in a container on an existing Amazon ECS cluster. Additionally, this would allow you to improve costs by running containers on a fleet of Spot Instances.

In this post, I will show you how to use the new scheduled tasks (cron) feature in Amazon ECS and launch tasks using CloudWatch events, while leveraging Spot Fleet to maximize availability and cost optimization for containerized workloads.

Architecture

The following diagram shows how the various components described schedule a task that retrieves log files from Amazon RDS database instances, and deposits the logs into an S3 bucket.

Amazon ECS cluster container instances are using Spot Fleet, which is a perfect match for the workload that needs to run when it can. This improves cluster costs.

The task definition defines which Docker image to retrieve from the Amazon EC2 Container Registry (Amazon ECR) repository and run on the Amazon ECS cluster.

The container image has Python code functions to make AWS API calls using boto3. It iterates over the RDS database instances, retrieves the logs, and deposits them in the S3 bucket. Many customers choose these logs to be delivered to their centralized log-store. CloudWatch Events defines the schedule for when the container task has to be launched.

Walkthrough

To provide the basic framework, we have built an AWS CloudFormation template that creates the following resources:

  • Amazon ECR repository for storing the Docker image to be used in the task definition
  • S3 bucket that holds the transferred logs
  • Task definition, with image name and S3 bucket as environment variables provided via input parameter
  • CloudWatch Events rule
  • Amazon ECS cluster
  • Amazon ECS container instances using Spot Fleet
  • IAM roles required for the container instance profiles

Before you begin

Ensure that Git, Docker, and the AWS CLI are installed on your computer.

In your AWS account, instantiate one Amazon Aurora instance using the console. For more information, see Creating an Amazon Aurora DB Cluster.

Implementation Steps

  1. Clone the code from GitHub that performs RDS API calls to retrieve the log files.
    git clone https://github.com/awslabs/aws-ecs-scheduled-tasks.git
  2. Build and tag the image.
    cd aws-ecs-scheduled-tasks/container-code/src && ls

    Dockerfile		rdslogsshipper.py	requirements.txt

    docker build -t rdslogsshipper .

    Sending build context to Docker daemon 9.728 kB
    Step 1 : FROM python:3
     ---> 41397f4f2887
    Step 2 : WORKDIR /usr/src/app
     ---> Using cache
     ---> 59299c020e7e
    Step 3 : COPY requirements.txt ./
     ---> 8c017e931c3b
    Removing intermediate container df09e1bed9f2
    Step 4 : COPY rdslogsshipper.py /usr/src/app
     ---> 099a49ca4325
    Removing intermediate container 1b1da24a6699
    Step 5 : RUN pip install --no-cache-dir -r requirements.txt
     ---> Running in 3ed98b30901d
    Collecting boto3 (from -r requirements.txt (line 1))
      Downloading boto3-1.4.6-py2.py3-none-any.whl (128kB)
    Collecting botocore (from -r requirements.txt (line 2))
      Downloading botocore-1.6.7-py2.py3-none-any.whl (3.6MB)
    Collecting s3transfer<0.2.0,>=0.1.10 (from boto3->-r requirements.txt (line 1))
      Downloading s3transfer-0.1.10-py2.py3-none-any.whl (54kB)
    Collecting jmespath<1.0.0,>=0.7.1 (from boto3->-r requirements.txt (line 1))
      Downloading jmespath-0.9.3-py2.py3-none-any.whl
    Collecting python-dateutil<3.0.0,>=2.1 (from botocore->-r requirements.txt (line 2))
      Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB)
    Collecting docutils>=0.10 (from botocore->-r requirements.txt (line 2))
      Downloading docutils-0.14-py3-none-any.whl (543kB)
    Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore->-r requirements.txt (line 2))
      Downloading six-1.10.0-py2.py3-none-any.whl
    Installing collected packages: six, python-dateutil, docutils, jmespath, botocore, s3transfer, boto3
    Successfully installed boto3-1.4.6 botocore-1.6.7 docutils-0.14 jmespath-0.9.3 python-dateutil-2.6.1 s3transfer-0.1.10 six-1.10.0
     ---> f892d3cb7383
    Removing intermediate container 3ed98b30901d
    Step 6 : COPY . .
     ---> ea7550c04fea
    Removing intermediate container b558b3ebd406
    Successfully built ea7550c04fea
  3. Run the CloudFormation stack and get the names for the Amazon ECR repo and S3 bucket. In the stack, choose Outputs.
  4. Open the ECS console and choose Repositories. The rdslogs repo has been created. Choose View Push Commands and follow the instructions to connect to the repository and push the image for the code that you built in Step 2. The screenshot shows the final result:
  5. Associate the CloudWatch scheduled task with the created Amazon ECS Task Definition, using a new CloudWatch event rule that is scheduled to run at intervals. The following rule is scheduled to run every 15 minutes:
    aws --profile default --region us-west-2 events put-rule --name demo-ecs-task-rule  --schedule-expression "rate(15 minutes)"

    {
        "RuleArn": "arn:aws:events:us-west-2:12345678901:rule/demo-ecs-task-rule"
    }
  6. CloudWatch requires IAM permissions to place a task on the Amazon ECS cluster when the CloudWatch event rule is executed, in addition to an IAM role that can be assumed by CloudWatch Events. This is done in three steps:
    1. Create the IAM role to be assumed by CloudWatch.
      aws --profile default --region us-west-2 iam create-role --role-name Test-Role --assume-role-policy-document file://event-role.json

      {
          "Role": {
              "AssumeRolePolicyDocument": {
                  "Version": "2012-10-17", 
                  "Statement": [
                      {
                          "Action": "sts:AssumeRole", 
                          "Effect": "Allow", 
                          "Principal": {
                              "Service": "events.amazonaws.com"
                          }
                      }
                  ]
              }, 
              "RoleId": "AROAIRYYLDCVZCUACT7FS", 
              "CreateDate": "2017-07-14T22:44:52.627Z", 
              "RoleName": "Test-Role", 
              "Path": "/", 
              "Arn": "arn:aws:iam::12345678901:role/Test-Role"
          }
      }

      The following is an example of the event-role.json file used earlier:

      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                    "Service": "events.amazonaws.com"
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }
    2. Create the IAM policy defining the ECS cluster and task definition. You need to get these values from the CloudFormation outputs and resources.
      aws --profile default --region us-west-2 iam create-policy --policy-name test-policy --policy-document file://event-policy.json

      {
          "Policy": {
              "PolicyName": "test-policy", 
              "CreateDate": "2017-07-14T22:51:20.293Z", 
              "AttachmentCount": 0, 
              "IsAttachable": true, 
              "PolicyId": "ANPAI7XDIQOLTBUMDWGJW", 
              "DefaultVersionId": "v1", 
              "Path": "/", 
              "Arn": "arn:aws:iam::123455678901:policy/test-policy", 
              "UpdateDate": "2017-07-14T22:51:20.293Z"
          }
      }

      The following is an example of the event-policy.json file used earlier:

      {
          "Version": "2012-10-17",
          "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "ecs:RunTask"
                ],
                "Resource": [
                    "arn:aws:ecs:*::task-definition/"
                ],
                "Condition": {
                    "ArnLike": {
                        "ecs:cluster": "arn:aws:ecs:*::cluster/"
                    }
                }
            }
          ]
      }
    3. Attach the IAM policy to the role.
      aws --profile default --region us-west-2 iam attach-role-policy --role-name Test-Role --policy-arn arn:aws:iam::1234567890:policy/test-policy
  7. Associate the CloudWatch rule created earlier to place the task on the ECS cluster. The following command shows an example. Replace the AWS account ID and region with your settings.
    aws events put-targets --rule demo-ecs-task-rule --targets "Id"="1","Arn"="arn:aws:ecs:us-west-2:12345678901:cluster/test-cwe-blog-ecsCluster-15HJFWCH4SP67","EcsParameters"={"TaskDefinitionArn"="arn:aws:ecs:us-west-2:12345678901:task-definition/test-cwe-blog-taskdef:8"},"RoleArn"="arn:aws:iam::12345678901:role/Test-Role"

    {
        "FailedEntries": [], 
        "FailedEntryCount": 0
    }

That’s it. The logs now run based on the defined schedule.

To test this, open the Amazon ECS console, select the Amazon ECS cluster that you created, and then choose Tasks, Run New Task. Select the task definition created by the CloudFormation template, and the cluster should be selected automatically. As this runs, the S3 bucket should be populated with the RDS logs for the instance.

Conclusion

In this post, you’ve seen that the choices for workloads that need to run at a scheduled time include Lambda with CloudWatch events or EC2 with cron. However, sometimes the job could run outside of Lambda execution time limits or be not cost-effective for an EC2 instance.

In such cases, you can schedule the tasks on an ECS cluster using CloudWatch rules. In addition, you can use a Spot Fleet cluster with Amazon ECS for cost-conscious workloads that do not have hard requirements on execution time or instance availability in the Spot Fleet. For more information, see Powering your Amazon ECS Cluster with Amazon EC2 Spot Instances and Scheduled Events.

If you have questions or suggestions, please comment below.

Natural Language Processing at Clemson University – 1.1 Million vCPUs & EC2 Spot Instances

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/natural-language-processing-at-clemson-university-1-1-million-vcpus-ec2-spot-instances/

My colleague Sanjay Padhi shared the guest post below in order to recognize an important milestone in the use of EC2 Spot Instances.

Jeff;


A group of researchers from Clemson University achieved a remarkable milestone while studying topic modeling, an important component of machine learning associated with natural language processing, breaking the record for creating the largest high-performance cluster by using more than 1,100,000 vCPUs on Amazon EC2 Spot Instances running in a single AWS region. The researchers conducted nearly half a million topic modeling experiments to study how human language is processed by computers. Topic modeling helps in discovering the underlying themes that are present across a collection of documents. Topic models are important because they are used to forecast business trends and help in making policy or funding decisions. These topic models can be run with many different parameters and the goal of the experiments is to explore how these parameters affect the model outputs.

The Experiment
Professor Amy Apon, Co-Director of the Complex Systems, Analytics and Visualization Institute at Clemson University with Professor Alexander Herzog and graduate students Brandon Posey and Christopher Gropp in collaboration with members of the AWS team as well as AWS Partner Omnibond performed the experiments.  They used software infrastructure based on CloudyCluster that provisions high performance computing clusters on dynamically allocated AWS resources using Amazon EC2 Spot Fleet. Spot Fleet is a collection of biddable spot instances in EC2 responsible for maintaining a target capacity specified during the request. The SLURM scheduler was used as an overlay virtual workload manager for the data analytics workflows. The team developed additional provisioning and workflow automation software as shown below for the design and orchestration of the experiments. This setup allowed them to evaluate various topic models on different data sets with massively parallel parameter sweeps on dynamically allocated AWS resources. This framework can easily be used beyond the current study for other scientific applications that use parallel computing.

Ramping to 1.1 Million vCPUs
The figure below shows elastic, automatic expansion of resources as a function of time, in the US East (Northern Virginia) Region. At just after 21:40 (GMT-1) on Aug. 26, 2017, the number of vCPUs utilized was 1,119,196. Clemson researchers also took advantage of the new per-second billing for the EC2 instances that they launched. The vCPU count usage is comparable to the core count on the largest supercomputers in the world.

Here’s the breakdown of the EC2 instance types that they used:

Campus resources at Clemson funded by the National Science Foundation were used to determine an effective configuration for the AWS experiments as compared to campus resources, and the AWS cloud resources complement the campus resources for large-scale experiments.

Meet the Team
Here’s the team that ran the experiment (Professor Alexander Herzog, graduate students Christopher Gropp and Brandon Posey, and Professor Amy Apon):

Professor Apon said about the experiment:

I am absolutely thrilled with the outcome of this experiment. The graduate students on the project are amazing. They used resources from AWS and Omnibond and developed a new software infrastructure to perform research at a scale and time-to-completion not possible with only campus resources. Per-second billing was a key enabler of these experiments.

Boyd Wilson (CEO, Omnibond, member of the AWS Partner Network) told me:

Participating in this project was exciting, seeing how the Clemson team developed a provisioning and workflow automation tool that tied into CloudyCluster to build a huge Spot Fleet supercomputer in a single region in AWS was outstanding.

About the Experiment
The experiments test parameter combinations on a range of topics and other parameters used in the topic model. The topic model outputs are stored in Amazon S3 and are currently being analyzed. The models have been applied to 17 years of computer science journal abstracts (533,560 documents and 32,551,540 words) and full text papers from the NIPS (Neural Information Processing Systems) Conference (2,484 documents and 3,280,697 words). This study allows the research team to systematically measure and analyze the impact of parameters and model selection on model convergence, topic composition and quality.

Looking Forward
This study constitutes an interaction between computer science, artificial intelligence, and high performance computing. Papers describing the full study are being submitted for peer-reviewed publication. I hope that you enjoyed this brief insight into the ways in which AWS is helping to break the boundaries in the frontiers of natural language processing!

Sanjay Padhi, Ph.D, AWS Research and Technical Computing