Tag Archives: Compute

Announcing winners of the AWS Graviton Challenge Contest and Hackathon

Post Syndicated from Neelay Thaker original https://aws.amazon.com/blogs/compute/announcing-winners-of-the-aws-graviton-challenge-contest-and-hackathon/

At AWS, we are constantly innovating on behalf of our customers so they can run virtually any workload, with optimal price and performance. Amazon EC2 now includes more than 475 instance types that offer a choice of compute, memory, networking, and storage to suit your workload needs. While we work closely with our silicon partners to offer instances based on their latest processors and accelerators, we also drive more choice for our customers by building our own silicon.

The AWS Graviton family of processors were built as part of that silicon innovation initiative with the goal of pushing the price performance envelope for a wide variety of customer workloads in EC2. We now have 12 EC2 instance families powered by AWS Graviton2 processors – general purpose (M6g, M6gd), burstable (T4g), compute optimized (C6g, C6gd, C6gn), memory optimized (R6g, R6gd, X2gd), storage optimized (Im4gn, Is4gen), and accelerated computing (G5g) available globally across 23 AWS Regions. We also announced the preview of Amazon EC2 C7g instances powered by the latest generation AWS Graviton3 processors that will provide the best price performance for compute-intensive workloads in EC2. Thousands of customers, including Discovery, DIRECTV, Epic Games, and Formula 1, have realized significant price performance benefits with AWS Graviton-based instances for a broad range of workloads. This year, AWS Graviton-based instances also powered much of Amazon Prime Day 2021 and supported 12 core retail services during the massive 2-day online shopping event.

To make it easy for customers to adopt Graviton-based instances, we launched a program called the Graviton Challenge. Working with customers, we saw that many successful adoptions of Graviton-based instances were the result of one or two developers taking a single workload and spending a few days to benchmark the price performance gains with Graviton2-based instances, before scaling it to more workloads. The Graviton Challenge provides a step-by-step plan that developers can follow to move their first workload to Graviton-based instances. With the Graviton Challenge, we also launched a Contest (US-only), and then a Hackathon (global), where developers could compete for prizes by building new applications or moving existing applications to run on Graviton2-based instances. More than a thousand participants, including enterprises, startups, individual developers, open-source developers, and Arm developers, registered and ran a variety of applications on Graviton-based instances with significant price performance benefits. We saw some fantastic entries and usage of Graviton2-based instances across a variety of use cases and want to highlight a few.

The Graviton Challenge Contest winners:

  • Best Adoption – Enterprise and Most Impactful Adoption: VMware vRealize SRE team, who migrated 60 micro-services written in Java, Rust, and Golang to Graviton2-based general purpose and compute optimized instances and realized up to 48% latency reduction and 22% cost savings.
  • Best Adoption – Startup: Kasm Technologies, who realized up to 48% better performance and 25% potential cost savings for its container streaming platform built on C/C++ and Python.
  • Best New Workload adoption: Dustin Wilson, who built a dynamic tile server based on Golang and running on Graviton2-based memory-optimized instances that helps analysts query large geospatial datasets and benchmarked up to 1.8x performance gains over comparable x86-based instances.
  • Most Innovative Adoption: Loroa, an application that translates any given text into spoken words from one language into multiple other languages using Graviton2-based instances, Amazon Polly, and Amazon Translate.

If you are attending AWS re:Invent 2021 in person, you can hear more details on their Graviton adoption experience by attending the CMP213: Lessons learned from customers who have adopted AWS Graviton chalk talk.

Winners for the Graviton Challenge Hackathon:

  • Best New App: PickYourPlace, an open-source based data analytics platform to help users select a place to live based on property value, safety, and accessibility.
  • Best Migrated App: Genie, an image credibility checker based on deep learning that makes predictions on photographic and tampered confidence of an image.
  • Highest Potential Impact: Welly Tambunan, who’s also an AWS Community Builder, for porting big data platforms Spark, Dremio, and AirByte to Graviton2 instances so developers can leverage it to build big data capabilities into their applications.
  • Most Creative Use Case: OXY, a low-cost custom Oximeter with mobile and web apps that enables continuous and remote monitoring to prevent deaths due to Silent Hypoxia.
  • Best Technical Implementation: Apollonia Bot that plays songs, playlists, or podcasts on a Discord voice channel, so users can listen to it together.

It’s been incredibly exciting to see the enthusiasm and benefits realized by our customers. We are also thankful to our judges – Patrick Moorhead from Moor Insights, James Governor from RedMonk, and Jason Andrews from Arm, for their time and effort.

In addition to EC2, several AWS services for databases, analytics, and even serverless support options to run on Graviton-based instances. These include Amazon Aurora, Amazon RDS, Amazon MemoryDB, Amazon DocumentDB, Amazon Neptune, Amazon ElastiCache, Amazon OpenSearch, Amazon EMR, AWS Lambda, and most recently, AWS Fargate. By using these managed services on Graviton2-based instances, customers can get significant price performance gains with minimal or no code changes. We also added support for Graviton to key AWS infrastructure services such as Elastic Beanstalk, Amazon EKS, Amazon ECS, and Amazon CloudWatch to help customers build, run, and scale their applications on Graviton-based instances. Additionally, a large number of Linux and BSD-based operating systems, and partner software for security, monitoring, containers, CI/CD, and other use cases now support Graviton-based instances and we recently launched the AWS Graviton Ready program as part of the AWS Service Ready program to offer Graviton-certified and validated solutions to customers.

Congrats to all of our Contest and Hackathon winners! Full list of the Contest and Hackathon winners is available on the Graviton Challenge page.

P.S.: Even though the Contest and Hackathon have ended, developers can still access the step-by-step plan on the Graviton Challenge page to move their first workload to Graviton-based instances.

New for AWS Backup – Support for VMware and VMware Cloud on AWS

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-aws-backup-support-for-vmware-and-vmware-cloud-on-aws/

Today, I am happy to announce AWS Backup support for VMware, a new capability that enables you to centralize and automate data protection of virtual machines (VMs) running on VMware on premises and VMware CloudTM on AWS. You can now use a single, centrally managed policy in AWS Backup to protect these VMware environments together with 12 AWS compute, storage, and database services already supported by AWS Backup. You can then use AWS Backup to restore VMware workloads to on-premises data centers and VMware Cloud on AWS.

While doing so, AWS Backup Audit Manager lets you consistently demonstrate compliance by monitoring backup, copy, and restore operations and generating auditor-ready reports to satisfy your data governance and regulatory requirements.

Let’s see how this works in practice.

Using AWS Backup Support for VMware
There are three steps to back up VMware virtual machines (VMs) with AWS Backup:

  1. Create a gateway to connect AWS Backup to your hypervisor.
  2. Connect to your hypervisor through the gateway.
  3. Assign virtual machines managed by your hypervisor to a backup plan.

AWS Back Support for VMware diagram

On the left pane of the AWS Backup console, there is a new External resources section. There, I choose Gateways and then Create gateway. This AWS Backup gateway helps with discovery of the on-premises VMware environment and acts as a cloud gateway to send and receive data.

I download the Open Virtualization Format (OVF) file of the AWS Backup gateway and follow the instructions to deploy the gateway using the VMware vSphere client. I am using an internal test and development VMware environment for this walkthrough.

VMware vCenter screenshot.

After deploying the gateway in my VMware environment, I come back to the AWS Backup console. I write a name for the gateway (for simplicity, I use the same name of the gateway VM) and the IP address of the gateway VM. Optionally, I can add tags to help organize and track my setup. I go on and create the gateway.

Console screenshot.

Now, I choose Add hypervisor. I write a name for the hypervisor and the IP address of the VMware vCenter server host.

Console screenshot.

I enter the username and password of a service account that I created for AWS Backup on the Active Directory domain. The username should include the domain (for example, username@domain). Then, I choose the encryption key to protect the service account credentials. If I don’t choose my own AWS Key Management Service (KMS) key, AWS Backup encrypts the username and password using a key that AWS owns and manages.

Console screenshot.

I select the gateway to connect to the hypervisor and choose Test gateway connection. This test helps ensure that the gateway can communicate with the hypervisor before I complete the configuration. Optionally, I can add tags to help organize and track my setup. I go on and add the hypervisor.

Console screenshot.

After a few minutes, the hypervisor is online, and I see the VMs managed by vCenter in the AWS Backup console. I can now use these virtual machines as resources in my backup plans in the same way as the other AWS compute, storage, and database resources supported by AWS Backup.

Console screenshot.

I create a new backup plan and start with a template. The rules of the template enforce daily backups with five weeks of retention and monthly backups with one year of retention. I can customize these rules based on my requirements.

Console screenshot.

Then, I choose to assign resources to the backup plan, and I select three VMs.

Console screenshot.

If you need, you can create an on-demand backup in the Protected resources section of the console. For example, here I am starting the on-demand backup for one of the VMs.

Console screenshot.

When a backup is complete, VMs are added to the list of the protected resources, and I can initiate a restore.

Console screenshot.

I select the backup and choose Restore. Then, I enter the restore location, which can be the same VMware environment I used for the backup or another (for example, on VMware Cloud on AWS). Below, I specify name, path, compute resource name, and datastore to use for the restore. Then, I choose Restore backup.

Console screenshot.

I monitor the status of my backup and restore jobs from the AWS Backup console. To monitor backup and restore metrics over a period of time, I can use Amazon CloudWatch metrics, logs, and alarms. I can also send events to Amazon EventBridge to receive notifications once a job completes or fails.

Availability and Pricing
AWS Backup support for VMware is available in the US East (N. Virginia, Ohio), US West (N. California, Oregon), GovCloud (US-East, US-West), Canada (Central), Europe (Frankfurt, Ireland, London, Milan, Paris, Stockholm), South America (São Paulo), Asia Pacific (Hong Kong, Mumbai, Seoul, Singapore, Sydney, Tokyo, Osaka), Middle East (Bahrain), and Africa (Cape Town) Regions. Please see the AWS Regional Services List for more information.

AWS Backup supports VMware ESXi 6.7.x and 7.0.x VMs running on NFS, VMFS, and VSAN data stores on premises and in VMware Cloud on AWS. In addition, AWS Backup supports both SCSI Hot-Add and Network Block Device (NBD) transport modes for copying data from source VMs to AWS.

With AWS Backup support for VMware, you pay using the same dimensions that AWS Backup uses today: backup storage, restore, and cross-region data transfer. For more information, see the AWS Backup pricing page.

Your VM backups are stored in a backup vault. All backups stored and managed by AWS Backup are replicated to 3 Availability Zones (AZs) in the Region and designed for 99.999999999 percent (11 9s) durability and 99.99 percent (4 9s) of service availability.

AWS Backup supports first full, then incremental-forever, backups of VMs that you can create on-demand or via a schedule configured in your backup plan. AWS Backup always does full restores even though backups are stored as incremental, enabling you to benefit from storage efficiency cost savings while easily performing restores.

Centrally protect your VMware environments and your AWS compute, storage, and database resources with AWS Backup.

Danilo

New for AWS Compute Optimizer – Resource Efficiency Metrics to Estimate Savings Opportunities and Performance Risks

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-aws-compute-optimizer-resource-efficiency-metrics-to-estimate-savings-opportunities-and-performance-risks/

By applying the knowledge drawn from Amazon’s experience running diverse workloads in the cloud, AWS Compute Optimizer identifies workload patterns and recommends optimal AWS resources.

Today, I am happy to share that AWS Compute Optimizer now delivers resource efficiency metrics alongside its recommendations to help you assess how efficiently you are using AWS resources:

  • A dashboard shows you savings and performance improvement opportunities at the account level. You can dive into resource types and individual resources from the dashboard.
  • The Estimated monthly savings (On-Demand) and Savings opportunity (%) columns estimate the possible savings for over-provisioned resources. You can sort your recommendations using these two columns to quickly find the resources on which to focus your optimization efforts.
  • The Current performance risk column estimates the bottleneck risk with the current configuration for under-provisioned resources.

These efficiency metrics are available for Amazon Elastic Compute Cloud (Amazon EC2), AWS Lambda, and Amazon Elastic Block Store (EBS) at the resource and AWS account levels.

For multi-account environments, Compute Optimizer continuously calculates resource efficiency metrics at individual account level in an AWS organization to help identify teams with low cost-efficiency or possible performance risks. This lets you to create goals and track progress over time. You can quickly understand just how resource-efficient teams and applications are, easily prioritize recommendation evaluation and adoption by engineering team, and establish a mechanism that drives a cost-aware culture and accountability across engineering teams.

Using Resource Efficiency Metrics in AWS Compute Optimizer
You can opt in using the AWS Management Console or the AWS Command Line Interface (CLI) to start using Compute Optimizer. You can enroll the account that you’re currently signed in to or all of the accounts within your organization. Depending on your choice, Compute Optimizer analyzes resources that are in your individual account or for each account in your organization, and then generates optimization recommendations for those resources.

To see your savings opportunity in Compute Optimizer, you should also opt in to AWS Cost Explorer and enable the rightsizing recommendations in the AWS Cost Explorer preferences page. For more details, see Getting started with rightsizing recommendations.

I already enrolled some time ago, and in the Compute Optimizer console I see the overall savings opportunity for my account.

Console screenshot.

Below that, I have a recap of the performance improvement opportunity. This includes an overview of the under-provisioned resources, as well as the performance risks that they pose by resource type.

Console screenshot.

Let’s dive into some of those savings. In the EC2 instances section, Compute Optimizer found 37 over-provisioned instances.

Console screenshot.

I follow the 37 instances link to get recommendations for those resources, and then sort the table by Estimated monthly savings (On-Demand) descending.

Console screenshot.

On the right, in the same table, I see which is the current instance type, the recommended instance type based on Computer Optimizer estimates, the difference in pricing, and if there are platform differences between the current and recommended instance types.

Console screenshot.

I can select each instance to further drill down into the metrics collected, as well as the other possible instance types suggested by Computer Optimizer.

Back to the Compute Optimizer Dashboard, in the Lambda functions section, I see that eight functions have under-provisioned memory.

Console screenshot.

Again, I follow the 8 functions link to get recommendations for those resources, and then sort the table by Current performance risk. In my case, the risk is always low, but different values can help prioritize your activities.

Console screenshot.

Here, I see the current and recommended configured memory for those Lambda functions. I can select each function to get a view of the metrics collected. Choosing the memory allocated to Lambda functions is an optimization process that balances speed (duration) and cost. See Profiling functions with AWS Lambda Power Tuning in the documentation for more information.

Availability and Pricing
You can use resource efficiency metrics with AWS Compute Optimizer in any AWS Region where it is offered. For more information, see the AWS Regional Services List. There is no additional charge for this new capability. See the AWS Compute Optimizer pricing page for more information.

This new feature lets you implement a periodic workflow to optimize your costs:

  • You can start by reviewing savings opportunities for all of your accounts to identify which accounts have the highest savings opportunity.
  • Then, you can drill into those accounts with the highest savings opportunity. You can refer to the estimated monthly savings to see which recommendations can drive the largest absolute cost impact.
  • Finally, you can communicate optimization opportunities and priority order to the teams using those accounts.

Start using AWS Compute Optimizer today to find and prioritize savings opportunities in your AWS account or organization.

Danilo

Filtering event sources for AWS Lambda functions

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/filtering-event-sources-for-aws-lambda-functions/

This post is written by Heeki Park, Principal Specialist Solutions Architect – Serverless.

When an AWS Lambda function is configured with an event source, the Lambda service triggers a Lambda function for each message or record. The exact behavior depends on the choice of event source and the configuration of the event source mapping. The event source mapping defines how the Lambda service handles incoming messages or records from the event source.

Today, AWS announces the ability to filter messages before the invocation of a Lambda function. Filtering is supported for the following event sources: Amazon Kinesis Data Streams, Amazon DynamoDB Streams, and Amazon SQS. This helps reduce requests made to your Lambda functions, may simplify code, and can reduce overall cost.

Overview

Consider a logistics company with a fleet of vehicles in the field. Each vehicle is enabled with sensors and 4G/5G connectivity to emit telemetry data into Kinesis Data Streams:

  • In one scenario, they use machine learning models to infer the health of vehicles based on each payload of telemetry data, which is outlined in example 2 on the Lambda pricing page.
  • In another scenario, they want to invoke a function, but only when tire pressure is low on any of the tires.

If tire pressure is low, the company notifies the maintenance team to check the tires when the vehicle returns. The process checks if the warehouse has enough spare replacements. Optionally, it notifies the purchasing team to buy additional tires.

The application responds to the stream of incoming messages and runs business logic if tire pressure is below 32 psi. Each vehicle in the field emits telemetry as follows:

{
    "time": "2021-11-09 13:32:04",
    "fleet_id": "fleet-452",
    "vehicle_id": "a42bb15c-43eb-11ec-81d3-0242ac130003",
    "lat": 47.616226213162406,
    "lon": -122.33989110734133,
    "speed": 43,
    "odometer": 43519,
    "tire_pressure": [41, 40, 31, 41],
    "weather_temp": 76,
    "weather_pressure": 1013,
    "weather_humidity": 66,
    "weather_wind_speed": 8,
    "weather_wind_dir": "ne"
}

To process all messages from a fleet of vehicles, you configure a filter matching the fleet id in the following example. The Lambda service applies the filter pattern against the full payload that it receives.

The schema of the payload for Kinesis and DynamoDB Streams is shown under the “kinesis” attribute in the example Kinesis record event. When building filters for Kinesis or DynamoDB Streams, you filter the payload under the “data” attribute. The schema of the payload for SQS is shown in the array of records in the example SQS message event. When working with SQS, you filter the payload under the “body” attribute:

{
    "data": {
        "fleet_id": ["fleet-452"]
    }
}

To process all messages associated with a specific vehicle, configure a filter on only that vehicle id. The fleet id is kept in the example to show that it matches on both of those filter criteria:

{
    "data": {
        "fleet_id": ["fleet-452"],
        "vehicle_id": ["a42bb15c-43eb-11ec-81d3-0242ac130003"]
    }
}

To process all messages associated with that fleet but only if tire pressure is below 32 psi, you configure the following rule pattern. This pattern searches the array under tire_pressure to match values less than 32:

{
    "data": {
        "fleet_id": ["fleet-452"],
        "tire_pressure": [{"numeric": ["<", 32]}]
    }
}

To create the event source mapping with this filter criteria with an AWS CLI command, run the following command.

aws lambda create-event-source-mapping \
--function-name fleet-tire-pressure-evaluator \
--batch-size 100 \
--starting-position LATEST \
--event-source-arn arn:aws:kinesis:us-east-1:0123456789012:stream/fleet-telemetry \
--filter-criteria '{"Filters": [{"Pattern": "{\"tire_pressure\": [{\"numeric\": [\"<\", 32]}]}"}]}'

For the CLI, the value for Pattern in the filter criteria requires the double quotes to be escaped in order to be properly captured.

Alternatively, to create the event source mapping with this filter criteria with an AWS Serverless Application Model (AWS SAM) template, use the following snippet.

Events: 
  TirePressureEvent: 
    Type: Kinesis    
    Properties: 
      BatchSize: 100
      StartingPosition: LATEST
      Stream: "arn:aws:kinesis:us-east-1:0123456789012:stream/fleet-telemetry"
      Filters: 
        - Pattern: "{\"data\": {\"tire_pressure\": [{\"numeric\": [\"<\", 32]}]}}"

For the AWS SAM template, the value for Pattern in the filter criteria does not require escaped double quotes.

For more information on how to create filters, refer to examples of event pattern rules in EventBridge, as Lambda filters messages in the same way.

Reducing costs with event filtering

By configuring the event source with this filter criteria, you can reduce the number of messages that are used to invoke your Lambda function.

Using the example from the Lambda pricing page, with a fleet of 10,000 vehicles in the field, each is emitting telemetry once an hour. Each month, the vehicles emit 10,000 * 24 * 31 = 7,440,000 messages, which trigger the same number of Lambda invocations. You configure the function with 256 MB of memory and the average duration of the function is 100 ms. In this example, vehicles emit low-pressure telemetry once every 31 days.

Without filtering, the cost of the application is:

  • Monthly request charges → 7.44M * $0.20/million = $1.49
  • Monthly compute duration (seconds) → 7.44M * 0.1 seconds = 0.744M seconds
  • Monthly compute (GB-s) → 256MB/1024MB * 0.744M seconds = 0.186M GB-s
  • Monthly compute charges → 0.186M GB-s * $0.0000166667 = $3.10
  • Monthly total charges = $1.49 + $3.10 = $4.59

With filtering, the cost of the application is:

  • Monthly request charges → (7.44M / 31)* $0.20/million = $0.05
  • Monthly compute duration (seconds) → (7.44M / 31) * 0.1 seconds = 0.024M seconds
  • Monthly compute (GB-s) → 256MB/1024MB * 0.024M seconds = 0.006M GB-s
  • Monthly compute charges → 0.006M GB-s * $0.0000166667 = $0.10
  • Monthly total charges = $0.05 + $0.10 = $0.15

By using filtering, the cost is reduced from $4.59 to $0.15, a 96.7% cost reduction.

Designing and implementing event filtering

In addition to reducing cost, the functions now operate more efficiently. This is because they no longer iterate through arrays of messages to filter out messages. The Lambda service filters the messages that it receives from the source before batching and sending them as the payload for the function invocation. This is the order of operations:

Event flow with filtering

Event flow with filtering

As you design filter criteria, keep in mind a few additional properties. The event source mapping allows up to five patterns. Each pattern can be up to 2048 characters. As the Lambda service receives messages and filters them with the pattern, it fills the batch per the normal event source behavior.

For example, if the maximum batch size is set to 100 records and the maximum batching window is set to 10 seconds, the Lambda service filters and accumulates records in a batch until one of those two conditions is satisfied. In the case where 100 records that meet the filter criteria come during the batching window, the Lambda service triggers a function with those filtered 100 records in the payload.

If fewer than 100 records meeting the filter criteria arrive during the batch window, Lambda triggers a function with the filtered records that came during the batch window at the end of the 10-second batch window. Be sure to configure the batch window to match your latency requirements.

The Lambda service ignores filtered messages and treats them as successfully processed. For Kinesis Data Streams and DynamoDB Streams, the iterator advances past the records that were sent via the event source mapping.

For SQS, the messages are deleted from the queue without any additional processing. With SQS, be sure that the messages that are filtered out are not required. For example, you have an Amazon SNS topic with multiple SQS queues subscribed. The Lambda functions consuming each of those SQS queues process different subsets of messages. You could use filters on SNS but that would require the message publisher to add attributes to the messages that it sends. You could instead use filters on the event source mapping for SQS. Now the publisher does not need to make any changes, as the filter is applied on the messages payload directly.

Conclusion

Lambda now supports the ability to filter messages based on a criteria that you define. This can reduce the number of messages that your functions process, may reduce cost, and can simplify code.

You can now build applications for specific use cases that use only a subset of the messages that flow through your event-driven architectures. This can help optimize the compute efficiency of your functions.

Learn more about this capability in our AWS Lambda Developer Guide.

Using EC2 Auto Scaling predictive scaling policies with Blue/Green deployments

Post Syndicated from Pranaya Anshu original https://aws.amazon.com/blogs/compute/retaining-metrics-across-blue-green-deployment-for-predictive-scaling/

This post is written by Ankur Sethi, Product Manager for EC2.

Amazon EC2 Auto Scaling allows customers to realize the elasticity benefits of AWS by automatically launching and shutting down instances to match application demand. Earlier this year we introduced predictive scaling, a new EC2 Auto Scaling policy that predicts demand and proactively scales capacity, resulting in better availability of your applications (if you are new to predictive scaling, I suggest you read this blog post before proceeding). In this blog, I will walk you through how to use a new feature, predictive scaling custom metrics, to configure predictive scaling for an application that follows a Blue/Green deployment strategy.

Blue/Green Deployment using Auto Scaling groups

The fundamental idea behind Blue/Green deployment is to shift traffic between two environments that are running different versions of your application. The Blue environment represents your current application version serving production traffic. In parallel, the Green environment is staged running the newer version. After the Green environment is ready and tested, production traffic is redirected from Blue to Green either all at once or in increments, similar to canary deployments. At the end of the load transfer, you can either terminate the Blue Auto Scaling group or reuse it to stage the next version update. Irrespective of the approach, when a new Auto Scaling group is created as part of Blue/Green deployment, EC2 Auto Scaling, and in turn predictive scaling, does not know that this new Auto Scaling group is running the same application that the Blue one was. Predictive scaling needs a minimum of 24 hours of historical metric data and up to 14 days for the most accurate results, neither of which the new Auto Scaling group has when the Blue/Green deployment is initiated. This means that if you frequently conduct Blue/Green deployments, predictive scaling regularly pauses for at least 24 hours, and you may experience less optimal forecasts after each deployment.

In Blue/Green deployment you have two Auto Scaling groups - Blue Auto Scaling Group running the current version and Green Auto Scaling group staged with the updated version. Once you are ready to make the updated version live, you switch production traffic from Blue to Green through your load balancer or your DNS settings.

Figure 1. In Blue/Green deployment you have two Auto Scaling groups running different versions of an application. You switch production traffic from Blue to Green to make the updated version public.

How to retain your application load history using predictive scaling custom metrics

To make predictive scaling work for Blue/Green deployment scenarios, we need to aggregate load metrics from both Blue and Green environments before using it to forecast capacity as depicted in the following illustration. The key benefit of using the aggregated metric is that, throughout the Blue/Green deployment, predictive scaling can continue to forecast load correctly without a pause, and it can retain the entire 14 days of data to provide the best predictions. For example, if your application observes different patterns during a weekday vs. a weekend, predictive scaling will be able to retain knowledge of that pattern after the deployment.

The aggregated metrics of Blue and Green Auto Scaling groups give you the total load traffic of an application. Prior to Blue/Green deployment, Blue Auto Scaling group served the entire traffic while after the deployment, Green Auto Scaling group handles it. There can be a period of overlap where traffic is split between the two Auto Scaling groups. By adding the traffic on two Auto Scaling groups, you get a single time series which allows predictive scaling to generate forecasts based on complete set of 14 days of history.

Figure 2. The aggregated metrics of Blue and Green Auto Scaling groups give you the total load traffic of an application. Predictive scaling gives most accurate forecasts when based on last 14 days of history.

Example

Let’s explore this solution with an example. I created a sample application and load simulation infrastructure that you can use to follow along by deploying this example AWS CloudFormation Stack in your account. This example deploys two Auto Scaling groups: ASG-myapp-v1 (Blue) and ASG-myapp-v2 (Green) to run a sample application. Only ASG-myapp-v1 is attached to a load balancer and has recurring requests generated for its application. I have applied a target tracking policy and predictive scaling policy to maintain CPU utilization at 25%. You should keep this Auto Scaling group running for at least 24 hours before proceeding with the rest of the example to have enough load generated for predictive scaling to start forecasting.

ASG-myapp-v2 does not have any requests generated of its own. In the following sections, to highlight how metric aggregation works, I will apply a predictive scaling policy to it using Custom Metric configurations aggregating CPU Utilization metrics of both Auto Scaling groups. I’ll then verify if the forecasts are generated for ASG-myapp-v2 based on the aggregated metrics.

As part of your Blue/Green deployment approach, if you alternate between exactly two Auto Scaling groups, then you can use simple math expressions such as SUM (m1, m2) where m1 and m2 are metrics for each Auto Scaling group. However, if you create new Auto Scaling groups for each deployment, then you need to refer to the metrics of all the Auto Scaling groups that were used to run the application in the last 14 days. You can simplify this task by following a naming convention for your Auto Scaling groups and leveraging the Search expression to select the required metrics. The naming convention is ASG-myapp-vx where we name the new Auto Scaling group according to the version number (ASG-myapp-v1ASG-myapp-v2 and so on). Using SEARCH(‘ {Namespace, DimensionName1, DimensionName2} SearchTerm’, ‘Statistic’, Period) expression I can identify the metrics of all the Auto Scaling groups that follow the name according to the SearchTerm. I can then aggregate the metrics by appending another expression. The final expression should look like SUM(SEARCH(…).

Step 1: Apply predictive scaling policy to Green Auto Scaling group ASG-myapp-v2 with custom metrics

To generate forecasts, the predictive scaling algorithm needs three metrics as input: a load metric that represents total demand on an Auto Scaling group, the number of instances that represents the capacity of the Auto Scaling groups, and a scaling metric that represents the average utilization of the instances in the Auto Scaling groups.

Here is how it would work with CPU Utilization metrics. First, create a scaling configuration file where you define the metrics, target value, and the predictive scaling mode for your policy.

cat predictive-scaling-policy-cpu.json
{
        "MetricSpecifications": [
      {
            "TargetValue": 25,
           "CustomizedLoadMetricSpecification": {
        },
           "CustomizedCapacityMetricSpecification": {  
        },
           "CustomizedScalingMetricSpecification": {
        },
            }
    ],
        "Mode": “ForecastOnly”
}
EoF

I’ll elaborate on each of these metric specifications separately in the following sections. You can download the complete JSON file in GitHub.

Customized Load Metric Specification: You can aggregate the demand across your Auto Scaling groups by using the SUM expression. The demand forecasts are generated every hour, so this metric has to be aggregated with a time period of 3600 seconds.

"CustomizedLoadMetricSpecification": {
    "MetricDataQueries": [
        {
            "Id": "load_sum",
            "Expression": "SUM(SEARCH('{AWS/EC2,AutoScalingGroupName} MetricName=\"CPUUtilization\" ASG-myapp', 'Sum', 3600))"
        }
    ]
}

Customized Capacity Metric Specification: Your customized capacity metric represents the total number of instances across your Auto Scaling groups. Similar to the load metric, the aggregation across Auto Scaling groups is done by using the SUM expression. Note that this metric has to follow a 300 seconds interval period.

"CustomizedCapacityMetricSpecification": {
    "MetricDataQueries": [
        {
            "Id": "capacity_sum",
            "Expression": "SUM(SEARCH('{AWS/AutoScaling,AutoScalingGroupName} MetricName=\"GroupInServiceIntances\" ASG-myapp', 'Average', 300))"
        }
    ]
}

Customized Scaling Metric Specification: Your customized scaling metric represents the average utilization of the instances across your Auto Scaling groups. We cannot simply SUM the scaling metric of each Auto Scaling group as the utilization is an average metric that depends on the capacity and demand of the Auto Scaling group. Instead, we need to find the weighted average unit load (Load Metric/Capacity). To do so, we will use an expression: Sum(load)/Sum(capacity). Note that this metric also has to follow a 300 seconds interval period.

"CustomizedScalingMetricSpecification": {
    "MetricDataQueries": [
        {
            "Id": "capacity_sum",
            "Expression": "SUM(SEARCH('{AWS/AutoScaling,AutoScalingGroupName} MetricName=\"GroupInServiceIntances\" ASG-myapp', 'Average', 300))"
            “ReturnData”: “False”
        },
        {
            "Id": "load_sum",
            "Expression": "SUM(SEARCH('{AWS/EC2,AutoScalingGroupName} MetricName=\"CPUUtilization\" ASG-myapp', 'Sum', 300))"
            “ReturnData”: “False”
        },
        {
            "Id": "weighted_average",
            "Expression": "load_sum / capacity_sum”
       }
    ]
}

Once you have created the configuration file, you can run the following CLI command to add the predictive scaling policy to your Green Auto Scaling group.

aws autoscaling put-scaling-policy \
    --auto-scaling-group-name "ASG-myapp-v2" \
    --policy-name "CPUUtilizationpolicy" \
    --policy-type "PredictiveScaling" \
    --predictive-scaling-configuration file://predictive-scaling-policy-cpu.json

Instantaneously, the forecasts will be generated for the Green Auto Scaling group (My-ASG-v2) as if this new Auto Scaling group has been running the application. You can validate this using the predictive scaling forecasts API. You can also use the console to review forecasts by navigating to the Amazon EC2 Auto Scaling console, selecting the Auto Scaling group that you configured with predictive scaling, and viewing the predictive scaling policy located under the Automatic Scaling section of the Auto Scaling group details view.

EC2 Auto Scaling console shows you the capacity and load forecasts generated by your predictive scaling policies against the actual metric values. In this case, we are looking at the forecasts generated for Green Auto Scaling group. Since we aggregated metrics across Auto Scaling groups, the forecasts are generated as if this Auto Scaling group has been running the application from the beginning. You see the actual load and capacity values also aggregated for easier comparison of the forecasted and actual values.

Figure 3. EC2 Auto Scaling console showing capacity and load forecasts for Green Auto Scaling group. The forecasts are generated as if this Auto Scaling group has been running the application from the beginning.

Step 2: Terminate ASG-myapp-v1 and see predictive scaling forecasts continuing

Now complete the Blue/Green deployment pattern by terminating the Blue Auto Scaling group, and then go to the console to check if the forecasts are retained for the Green Auto Scaling group.

aws autoscaling delete-auto-scaling-group \
 --auto-scaling-group-name ASG-myapp-v1

You can quickly check the forecasts on the console for ASG-myapp-v2 to find that terminating the Blue Auto Scaling group has no impact on the forecasts of the Green one. The forecasts are all based on aggregated metrics. As you continue to do Blue/Green deployments in future, the history of all the prior Auto Scaling groups will persist, ensuring that our predictions are always based on the complete set of metric history. Before we conclude, remember to delete the resources you created. As part of this example, to avoid unnecessary costs, delete the CloudFormation stack.

Conclusion

Custom metrics give you the flexibility to base predictive scaling on metrics that most accurately represent the load on your Auto Scaling groups. This blog focused on the use case where we aggregated metrics from different Auto Scaling groups across Blue/Green deployments to get accurate forecasts from predictive scaling. You don’t have to wait for 24 hours to get the first set of forecasts or manually set capacity when the new Auto Scaling group is created to deploy an updated version of the application. You can read about other use cases of custom metrics and metric math in the public documentation such as scaling based on queue metrics.

New – Amazon EC2 R6i Memory-Optimized Instances Powered by the Latest Generation Intel Xeon Scalable Processors

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-amazon-ec2-r6i-memory-optimized-instances-powered-by-the-latest-generation-intel-xeon-scalable-processors/

In August, we introduced the general-purpose Amazon EC2 M6i instances powered by the latest generation Intel Xeon Scalable processors (code-named Ice Lake) with an all-core turbo frequency of 3.5 GHz. Compute-optimized EC2 C6i instances were also made available last month.

Today, I am happy to share that we are expanding our sixth-generation x86-based offerings to include memory-optimized Amazon EC2 R6i instances.

Here’s a quick recap of the advantages of the new R6i instances compared to R5 instances:

  • A larger instance size (r6i.32xlarge) with 128 vCPUs and 1,024 GiB of memory that makes it easier and more cost-efficient to consolidate workloads and scale up applications
  • Up to 15 percent improvement in compute price/performance
  • Up to 20 percent higher memory bandwidth
  • Up to 40 Gbps for Amazon Elastic Block Store (EBS) and 50 Gbps for networking which is 2x more than R5 instances
  • Always-on memory encryption.

R6i instances are SAP Certified and are an ideal fit for memory-intensive workloads such as SQL and NoSQL databases, distributed web scale in-memory caches like Memcached and Redis, in-memory databases, and real-time big data analytics like Apache Hadoop and Apache Spark clusters.

Compared to M6i and C6i instances, the only difference is in the amount of memory that is included per vCPU. R6i instances are available in ten sizes:

Name vCPUs Memory
(GiB)
Network Bandwidth
(Gbps)
EBS Throughput
(Gbps)
r6i.large 2 16 Up to 12.5 Up to 10
r6i.xlarge 4 32 Up to 12.5 Up to 10
r6i.2xlarge 8 64 Up to 12.5 Up to 10
r6i.4xlarge 16 128 Up to 12.5 Up to 10
r6i.8xlarge 32 256 12.5 10
r6i.12xlarge 48 384 18.75 15
r6i.16xlarge 64 512 25 20
r6i.24xlarge 96 768 37.5 30
r6i.32xlarge 128 1024 50 40
r6i.metal 128 1024 50 40

Like M6i and C6i instances, these new R6i instances are built on the AWS Nitro System, which is a collection of building blocks that offloads many of the traditional virtualization functions to dedicated hardware, delivering high performance, high availability, and highly secure cloud instances.

As with all sixth generation EC2 instances, you may need to upgrade your Elastic Network Adapter (ENA) for optimal networking performance. For more information, see this article about migrating an EC2 instance to a sixth-generation instance in the AWS Knowledge Center.

R6i instances support Elastic Fabric Adapter (EFA) on r6i.32xlarge and r6i.metal instances for workloads that benefit from lower network latency, such as HPC and video processing.

Availability and Pricing
EC2 R6i instances are available today in four AWS Regions: US East (N. Virginia), US West (Oregon), US East (Ohio), Europe (Ireland). As usual with EC2, you pay for what you use. For more information, see the EC2 pricing page.

Danilo

Insulating AWS Outposts Workloads from Amazon EC2 Instance Size, Family, and Generation Dependencies

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/insulating-aws-outposts-workloads-from-amazon-ec2-instance-size-family-and-generation-dependencies/

This post is written by Garry Galinsky, Senior Solutions Architect.

AWS Outposts is a fully managed service that offers the same AWS infrastructure, AWS services, APIs, and tools to virtually any datacenter, co-location space, or on-premises facility for a truly consistent hybrid experience. AWS Outposts is ideal for workloads that require low-latency access to on-premises systems, local data processing, data residency, and application migration with local system interdependencies.

Unlike AWS Regions, which offer near-infinite scale, Outposts are limited by their provisioned capacity, EC2 family and generations, configured instance sizes, and availability of compute capacity that is not already consumed by other workloads. This post explains how Amazon EC2 Fleet can be used to insulate workloads running on Outposts from EC2 instance size, family, and generation dependencies, reducing the likelihood of encountering an error when launching new workloads or scaling existing ones.

Product Overview

Outposts is available as a 42U rack that can scale to create pools of on-premises compute and storage capacity. When you order an Outposts rack, you specify the quantity, family, and generation of Amazon EC2 instances to be provisioned. As of this writing, five EC2 families, each of a single generation, are available on Outposts (m5, c5, r5, g4dn, and i3en). However, in the future, more families and generations may be available, and a given Outposts rack may include a mix of families and generations. EC2 servers on Outposts are partitioned into instances of homogenous or heterogeneous sizes (e.g., large, 2xlarge, 12xlarge) based on your workload requirements.

Workloads deployed through AWS CloudFormation or scaled through Amazon EC2 Auto Scaling generally assume that the required EC2 instance type will be available when the deployment or scaling event occurs. Although in the Region this is a reasonable assumption, the same is not true for Outposts. Whether as a result of competing workloads consuming the capacity, the Outpost having been configured with limited capacity for a given instance size, or an Outpost update resulting in instances being replaced with a newer generation, a deployment or scaling event tied to a specific instance size, family, and generation may encounter an InsufficentInstanceCapacity error (ICE). And this may occur even though sufficient unused capacity of a different size, family, or generation is available.

EC2 Fleet

Amazon EC2 Fleet simplifies the provisioning of Amazon EC2 capacity across different Amazon EC2 instance types and Availability Zones, as well as across On-Demand, Amazon EC2 Reserved Instances (RI), and Amazon EC2 Spot purchase models. A single API call lets you provision capacity across EC2 instance types and purchase models in order to achieve the desired scale, performance, and cost.

An EC2 Fleet contains a configuration to launch a fleet, or group, of EC2 instances. The LaunchTemplateConfigs parameter lets multiple instance size, family, and generation combinations be specified in a priority order.

This feature is commonly used in AWS Regions to optimize fleet costs and allocations across multiple deployment strategies (reserved, on-demand, and spot), while on Outposts it can be used to eliminate the tight coupling of a workload to specific EC2 instances by specifying multiple instance families, generations, and sizes.

Launch Template Overrides

The EC2 Fleet LaunchTemplateConfigs definition describes the EC2 instances required for the fleet. A specific parameter of this definition, the Overrides, can include prioritized and/or weighted options of EC2 instances that can be launched to satisfy the workload. Let’s investigate how you can use Overrides to decouple the EC2 size, family, and generation dependencies.

Overriding EC2 Instance Size

Let’s assume our Outpost was provisioned with an m5 server. The server is the equivalent of an m5.24xlarge, which can be configured into multiple instances. For example, the server can be homogeneously provisioned into 12 x m5.2xlarge, or heterogeneously into 1 x m5.8xlarge, 3 x m5.2xlarge, 8 x m5.xlarge, and 4 x m5.large. Let’s assume the heterogeneous configuration has been applied.

Our workload requires compute capacity equivalent to an m5.4xlarge (16 vCPUs, 64 GiB memory), but that instance size is not available on the Outpost. Attempting to launch this instance would result in an InsufficentInstanceCapacity error. Instead, the following LaunchTemplateConfigs override could be used:

"Overrides": [
    {
        "InstanceType": "m5.4xlarge",
        "WeightedCapacity": 1.0,
        "Priority": 1.0
    },
    {
        "InstanceType": "m5.2xlarge",
        "WeightedCapacity": 0.5,
        "Priority": 2.0
    },
    {
        "InstanceType": "m5.8xlarge",
        "WeightedCapacity": 2.0,
        "Priority": 3.0
    }
]

The Priority describes our order of preference. Ideally, we launch a single m5.4xlarge instance, but that’s not an option. Therefore, in this case, the EC2 Fleet would move to the next priority option, an m5.2xlarge. Given that an m5.2xlarge (8 vCPUs, 32 GiB memory) offers only half of the resource of the m5.4xlarge, the override includes the WeightedCapacity parameter of 0.5, resulting in two m5.2xlarge instances launching instead of one.

Our overrides include a third, over-provisioned and less preferable option, should the Outpost lack two m5.2xlarge capacity: launch one m5.8xlarge. Operating within finite resources requires tradeoffs, and priority lets us optimize them. Note that had the launch required 2 x m5.4xlarge, only one instance of m5.8xlarge would have been launched.

Overriding EC2 Instance Family

Let’s assume our Outpost was provisioned with an m5 and a c5 server, homogeneously partitioned into 12 x m5.2xlarge and 12 x c5.2xlarge instances. Our workload requires compute capacity equivalent to a c5.2xlarge instance (8 vCPUs, 16 GiB memory). As our workload scales, more instances must be launched to meet demand. If we couple our workload to c5.2xlarge, then our scaling will be blocked as soon as all 12 instances are consumed. Instead, we use the following LaunchTemplateConfigs override:

"Overrides": [
    {
        "InstanceType": "c5.2xlarge",
        "WeightedCapacity": 1.0,
        "Priority": 1.0
    },
    {
        "InstanceType": "m5.2xlarge",
        "WeightedCapacity": 1.0,
        "Priority": 2.0
    }
]

The Priority describes our order of preference. Ideally, we scale more c5.2xlarge instances, but when those are not an option EC2 Fleet would launch the next priority option, an m5.2xlarge. Here again the outcome may result in over-provisioned memory capacity (32 vs 16 GiB memory), but it’s a reasonable tradeoff in a finite resource environment.

Overriding EC2 Instance Generation

Let’s assume our Outpost was provisioned two years ago with an m5 server. Since then, m6 servers have become available, and there’s an expectation that m7 servers will be available soon. Our single-generation Outpost may unexpectedly become multi-generation if the Outpost is expanded, or if a hardware failure results in a newer generation replacement.

Coupling our workload to a specific generation could result in future scaling challenges. Instead, we use the following LaunchTemplateConfigs override:

"Overrides": [
    {
        "InstanceType": "m6.2xlarge",
        "WeightedCapacity": 1.0,
        "Priority": 1.0
    },
    {
        "InstanceType": "m5.2xlarge",
        "WeightedCapacity": 1.0,
        "Priority": 2.0
    },
    {
        "InstanceType": "m7.2xlarge",
        "WeightedCapacity": 1.0,
        "Priority": 3.0
    }

]

Note the Priority here, our preference is for the current generation m6, even though it’s not yet provisioned in our Outpost. The m5 is what would be launched now, given that it’s the only provisioned generation. However, we’ve also future-proofed our workload by including the yet unreleased m7.

Deploying an EC2 Fleet

To deploy an EC2 Fleet, you must:

  1. Create a launch template, which streamlines and standardizes EC2 instance provisioning by simplifying permission policies and enforcing best practices across your organization.
  2. Create a fleet configuration, where you set the number of instances required and specify the prioritized instance family/generation combinations.
  3. Launch your fleet (or a single EC2 instance).

These steps can be codified through AWS CloudFormation or executed through AWS Command Line Interface (CLI) commands. However, fleet definitions cannot be implemented by using the AWS Console. This example will use CLI commands to conduct these steps.

Prerequisites

To follow along with this tutorial, you should have the following prerequisites:

Create a Launch Template

Launch templates let you store launch parameters so that you do not have to specify them every time you launch an EC2 instance. A launch template can contain the Amazon Machine Images (AMI) ID, instance type, and network settings that you typically use to launch instances. For more details about launch templates, reference Launch an instance from a launch template .

For this example, we will focus on these specifications:

  • AMI image ImageId
  • Subnet (the SubnetId associated with your Outpost)
  • Availability zone (the AvailabilityZone associated with your Outpost)
  • Tags

Create a launch template configuration (launch-template.json) with the following content:

{
    "ImageId": "<YOUR-AMI>",
    "NetworkInterfaces": [
        {
            "DeviceIndex": 0,
            "SubnetId": "<YOUR-OUTPOST-SUBNET>"
        }
    ],
    "Placement": {
        "AvailabilityZone": "<YOUR-OUTPOST-AZ>"
    },
    "TagSpecifications": [
        {
            "ResourceType": "instance",
            "Tags": [
                {
                    "Key": "<YOUR-TAG-KEY>",
                    "Value": "<YOUR-TAG-VALUE>"
                }
            ]
        }
    ]
}

Create your launch template using the following CLI command:

aws ec2 create-launch-template \
  --launch-template-name <YOUR-LAUNCH-TEMPLATE-NAME> \
  --launch-template-data file://launch-template.json

You should see a response like this:

{
    "LaunchTemplate": {
        "LaunchTemplateId": "lt-010654c96462292e8",
        "LaunchTemplateName": "<YOUR-LAUNCH-TEMPLATE-NAME>",
        "CreateTime": "2021-07-12T15:55:00+00:00",
        "CreatedBy": "arn:aws:sts::<YOUR-AWS-ACCOUNT>:assumed-role/<YOUR-AWS-ROLE>",
        "DefaultVersionNumber": 1,
        "LatestVersionNumber": 1
    }
}

The value for LaunchTemplateId is the identifier for your newly created launch template. You will need this value lt-010654c96462292e8 in the subsequent step.

Create a Fleet Configuration

Refer to Generate an EC2 Fleet JSON configuration file for full documentation on the EC2 Fleet configuration.

For this example, we will use this configuration to override a mix of instance size, family, and generation. The override includes three EC2 instance types:

  • m5.large, the instance family and generation currently available on the Outpost.
  • m6.large, a forthcoming family and generation not yet available for Outposts.
  • m7.large, a potential future family and generation.

Create an EC2 fleet configuration (ec2-fleet.json) with the following content (note that the LaunchTemplateId was the value returned in the prior step):

{
    "TargetCapacitySpecification": {
        "TotalTargetCapacity": 1,
        "OnDemandTargetCapacity": 1,
        "SpotTargetCapacity": 0,
        "DefaultTargetCapacityType": "on-demand"
    },
    "OnDemandOptions": {
        "AllocationStrategy": "prioritized",
        "SingleInstanceType": true,
        "SingleAvailabilityZone": true,
        "MinTargetCapacity": 1
    },
    "LaunchTemplateConfigs": [
        {
            "LaunchTemplateSpecification": {
                "LaunchTemplateId": "lt-010654c96462292e8",
                "Version": "1"
            },
            "Overrides": [
                {
                    "InstanceType": "m6.2xlarge",
                    "WeightedCapacity": 1.0,
                    "Priority": 1.0
                },
                {
                    "InstanceType": "c5.2xlarge",
                    "WeightedCapacity": 1.0,
                    "Priority": 2.0
                },
                {
                    "InstanceType": "m5.large",
                    "WeightedCapacity": 0.25,
                    "Priority": 3.0
                },
                {
                    "InstanceType": "m5.2xlarge",
                    "WeightedCapacity": 1.0,
                    "Priority": 4.0
                },
                {
                    "InstanceType": "r5.2xlarge",
                    "WeightedCapacity": 1.0,
                    "Priority": 5.0
                }


            ]
        }
    ],
    "Type": "instant"
}

Launch the Single Instance Fleet

To launch the fleet, execute the following CLI command (this will launch a single instance, but a similar process can be used to launch multiple):

aws ec2 create-fleet \
  --cli-input-json file://ec2-fleet.json

You should see a response like this:

{
    "FleetId": "fleet-dc630649-5d77-60b3-2c30-09808ef8aa90",
    "Errors": [
        {
            "LaunchTemplateAndOverrides": {
                "LaunchTemplateSpecification": {
                    "LaunchTemplateId": "lt-010654c96462292e8",
                    "Version": "1"
                },
                "Overrides": {
                    "InstanceType": "m6.2xlarge",
                    "WeightedCapacity": 1.0,
                    "Priority": 1.0
                }
            },
            "Lifecycle": "on-demand",
            "ErrorCode": "InvalidParameterValue",
            "ErrorMessage": "The instance type 'm6.2xlarge' is not supported in Outpost 'arn:aws:outposts:us-west-2:111111111111:outpost/op-0000ffff0000fffff'."
        },
        {
            "LaunchTemplateAndOverrides": {
                "LaunchTemplateSpecification": {
                    "LaunchTemplateId": "lt-010654c96462292e8",
                    "Version": "1"
                },
                "Overrides": {
                    "InstanceType": "c5.2xlarge",
                    "WeightedCapacity": 1.0,
                    "Priority": 2.0
                }
            },
            "Lifecycle": "on-demand",
            "ErrorCode": "InsufficientCapacityOnOutpost",
            "ErrorMessage": "There is not enough capacity on the Outpost to launch or start the instance."
        }
    ],
    "Instances": [
        {
            "LaunchTemplateAndOverrides": {
                "LaunchTemplateSpecification": {
                    "LaunchTemplateId": "lt-010654c96462292e8",
                    "Version": "1"
                },
                "Overrides": {
                    "InstanceType": "m5.large",
                    "WeightedCapacity": 0.25,
                    "Priority": 3.0
                }
            },
            "Lifecycle": "on-demand",
            "InstanceIds": [
                "i-03d6323c8a1df8008",
                "i-0f62593c8d228dba5",
                "i-0ae25baae1f621c15",
                "i-0af7e688d0460a60a"
            ],
            "InstanceType": "m5.large"
        }
    ]
}

Results

Navigate to the EC2 Console where you will find new instances running on your Outpost. An example is shown in the following screenshot:

EC2 running instances, AWS console, network view, filtered by tag

Although multiple instance size, family, and generation combinations were included in the Overrides, only the c5.large was available on the Outpost. Instead of launching one m6.2xlarge, four c5.large were launched in order to compensate for their lower WeightedCapacity. From the fleet-create response, the overrides were clearly evaluated in priority order with the error messages explaining why the top two overrides were ignored.

Clean up

AWS CLI EC2 commands can be used to create fleets but can also be used to delete them.

To clean up the resources created in this tutorial:

    1. Note the FleetId values returned in the create-fleet command.
    2. Run the following command for each fleet created:
aws ec2 delete-fleets \
  --fleet-ids  \
  --terminate-instances
  1. Note the launch-template-name used in the create-launch-template command.
  2. Run the following command for each fleet created:
{
    "SuccessfulFleetDeletions": [
        {
            "CurrentFleetState": "deleted_terminating",
            "PreviousFleetState": "active",
            "FleetId": "fleet-dc630649-5d77-60b3-2c30-09808ef8aa90"
        }
    ],
    "UnsuccessfulFleetDeletions": []
}
  1. Clean up any resources you created for the prerequisites.

Conclusion

This post discussed how EC2 Fleet can be used to decouple the availability of specific EC2 instance sizes, families, and generation from the ability to launch or scale workloads. On an Outpost provisioned with multiple families of EC2 instances (say m5 and c5) and different sizes (say m5.large and m5.2xlarge), EC2 Fleet can be used to satisfy a workload launch request even if the capacity of the preferred instance size, family, or generation is unavailable.

To learn more about AWS Outposts, check out the Outposts product page. To see a full list of pre-defined Outposts configurations, visit the Outposts pricing page

Setting up EC2 Mac instances as shared remote development environments

Post Syndicated from Rick Armstrong original https://aws.amazon.com/blogs/compute/setting-up-ec2-mac-instances-as-shared-remote-development-environments/

This post is written by: Michael Meidlinger, Solutions Architect 

In December 2020, we announced a macOS-based Amazon Elastic Compute Cloud (Amazon EC2) instance. Amazon EC2 Mac instances let developers build, test, and package their applications for every Apple platform, including macOS, iOS, iPadOS, tvOS, and watchOS. Customers have been utilizing these instances in order to automate their build pipelines for the Apple platform and integrate their native build tools, such as Jenkins and GitLab.

Aside from build automation, more and more customers are looking to utilize EC2 Mac instances for interactive development. Several advantages exist when utilizing remote development environments over installations on local developer machines:

  • Light-weight process for rolling out consistent, up-to-date environments for every developer without having to install software locally.
  • Solve cross-platform issues by having separate environments for different target platforms, all of which are independent of the developer’s local setup.
  • Consolidate access to source code and internal build tools, as they can be integrated with the remote development environment rather than local developer machines.
  • No need for specialized or powerful developer hardware.

On top of that, this approach promotes cost efficiency, as it enables EC2 Mac instances to be shared and utilized by multiple developers concurrently. This is particularly relevant for EC2 Mac instances, as they run on dedicated Mac mini hosts with a minimum tenancy of 24 hours. Therefore, handing out full instances to individual developers is not practical most often.

Interactive remote development environments are also facilitated by code editors, such as VSCode, which provide a modern GUI based experience on the developer’s local machine while having source code files and terminal sessions for testing and debugging in the remote environment context.

This post will demonstrate how EC2 Mac instances can be setup as remote development servers that can be accessed by multiple developers concurrently in order to compile and run their code interactively via command line access. The proposed setup features centralized user management based on AWS Directory Service and shared network storage utilizing Amazon Elastic File System (Amazon EFS), thereby decoupling those aspects from the development server instances. As a result, new instances can easily be added when needed, and existing instances can be updated to the newest OS and development toolchain version without affecting developer workflow.

Architecture

The following diagram shows the architecture rolled out in the context of this blog.

Architecture Diagram. A detailed description is featured in the blog text.

Compute Layer

The compute layer consists of two EC2 Mac instances running in isolated private subnets in different Availability Zones. In a production setup, these instances are provisioned with every necessary tool and software needed by developers to build and test their code for Apple platforms. Provisioning can be accomplished by creating custom Amazon Machine Images (AMIs) for the EC2 Mac instances or by bootstrapping them with setup scripts. This post utilizes Amazon provided AMIs with macOS BigSur without custom software. Once setup, developers gain command line access to the instances via SSH and utilize them as remote development environments.

Storage Layer

The architecture promotes the decoupling of compute and storage so that EC2 Mac instances can be updated with new OS and/or software versions without affecting the developer experience or data. Home directories reside on a highly available Amazon EFS file system, and they can be consistently accessed from all EC2 Mac instances. From a user perspective, any two EC2 Mac instances are alike, in that the user experiences the same configuration and environment (e.g., shell configurations such as .zshrc, VSCode remote extensions .vscode-server, or other tools and configurations installed within the user’s home directory). The file system is exposed to the private subnets via redundant mount target ENIs and persistently mounted on the Mac instances.

Identity Layer

For centralized user and access management, all instances in the architecture are part of a common Active Directory domain based on AWS Managed Microsoft AD. This is exposed via redundant ENIs to the private subnets containing the Mac instances.

To manage and configure the Active Directory domain, a Windows Instance (MGMT01) is deployed. For this post, we will connect to this instance for setting up Active Directory users. Note: other than that, this instance is not required for operating the solution, and it can be shut down both for reasons of cost efficiency and security.

Access Layer

The access layer constitutes the entry and exit point of the setup. For this post, it is comprised of an internet-facing bastion host connecting authorized Active Directory users to the Mac instances, as well as redundant NAT gateways providing outbound internet connectivity.

Depending on customer requirements, the access layer can be realized in various ways. For example, it can provide access to customer on-premises networks by using AWS Direct Connect or AWS Virtual Private Network (AWS VPN), or to services in different Virtual Private Cloud (VPC) networks by using AWS PrivateLink. This means that you can integrate your Mac development environment with pre-existing development-related services, such as source code and software repositories or build and test services.

Prerequisites

We utilize AWS CloudFormation to automatically deploy the entire setup in the preceding description. All templates and code can be obtained from the blog’s GitHub repository. To complete the setup, you need

Warning: Deploying this example will incur AWS service charges of at least $50 due to the fact that EC2 Mac instances can only be released 24 hours after allocation.

Solution Deployment

In this section, we provide a step-by-step guide for deploying the solution. We will mostly rely on AWS CLI and shell scripts provided along with the CloudFormation templates and use the AWS Management Console for checking and verification only.

1. Get the Code: Obtain the CloudFormation templates and all relevant scripts and assets via git:

git clone https://github.com/aws-samples/ec2-mac-remote-dev-env.git
cd ec2-mac-remote-dev-env
git submodule init 
git submodule update

2. Create an Amazon Simple Storage Service (Amazon S3) deployment bucket and upload assets for deployment: CloudFormation templates and other assets are uploaded to this bucket in order to deploy them. To achieve this, run the upload.sh script in the repository root, accepting the default bucket configuration as suggested by the script:

./upload.sh

3. Create an SSH Keypair for admin Access: To access the instances deployed by CloudFormation, create an SSH keypair with name mac-admin, and then import it with EC2:

ssh-keygen -f ~/.ssh/mac-admin
aws ec2 import-key-pair \
    --key-name "mac-admin" \
    --public-key-material fileb://~/.ssh/mac-admin.pub

4. Create CloudFormation Parameters file: Initialize the json file by copying the provided template parameters-template.json :

cp parameters-template.json parameters.json

Substitute the following placeholders:

a. <YourS3BucketName>: The unique name of the S3 bucket you created in step 2.

b. <YourSecurePassword>: Active Directory domain admin password. This must be 8-32 characters long and can contain numbers, letters and symbols.

c. <YourMacOSAmiID>: We used the latest macOS BigSur AMI at the time of writing with AMI ID ami-0c84d9da210c1110b in the us-east-2 Region. You can obtain other AMI IDs for your desired AWS Region and macOS version from the console.

d. <MacHost1ID> and <MacHost2ID>: See the next step 5. on how to allocate Dedicated Hosts and obtain the host IDs.

5. Allocate Dedicated Hosts: EC2 Mac Instances run on Dedicated Hosts. Therefore, prior to being able to deploy instances, Dedicated Hosts must be allocated. We utilize us-east-2 as the target Region, and we allocate the hosts in the Availability Zones us-east-2b and us-east-2c:

aws ec2 allocate-hosts \
    --auto-placement off \
    --region us-east-2 \
    --availability-zone us-east-2b \
    --instance-type mac1.metal \
    --quantity 1 \
    --tag-specifications 'ResourceType=dedicated-host,Tags=[{Key=Name,Value=MacHost1}]'

aws ec2 allocate-hosts \
    --auto-placement off \
    --region us-east-2 \
    --availability-zone us-east-2c \
    --instance-type mac1.metal \
    --quantity 1 \
    --tag-specifications 'ResourceType=dedicated-host,Tags=[{Key=Name,Value=MacHost2}]'

Substitute the host IDs returned from those commands in the parameters.json file as instructed in the previous step 5.

6. Deploy the CloudFormation Stack: To deploy the stack with the name ec2-mac-remote-dev-env, run the provided sh script as follows:

./deploy.sh ec2-mac-remote-dev-env

Stack deployment can take up to 1.5 hours, which is due to the Microsoft Managed Active Directory, the Windows MGMT01 instance, and the Mac instances being created sequentially. Check the CloudFormation Console to see whether the stack finished deploying. In the console, under Stacks, select the stack name from the preceding code (ec2-mac-remote-dev-env), and then navigate to the Outputs Tab. Once finished, this will display the public DNS name of the bastion host, as well as the private IPs of the Mac instances. You need this information in the upcoming section in order to connect and test your setup.

Solution Test

Now you can log in and explore the setup. We will start out by creating a developer account within Active Directory and configure an SSH key in order for it to grant access.

Create an Active Directory User

Create an SSH Key for the Active Directory User and configure SSH Client

First, we create a new SSH key for the developer Active Directory user. Utilize OpenSSH CLI,

ssh-keygen -f ~/.ssh/mac-developer

Furthermore, utilizing the connection information from the CloudFormation output, setup your ~/.ssh/config to contain the following entries, where $BASTION_HOST_PUBLIC_DNS, $MAC1_PRIVATE_IP and $MAC2_PRIVATE_IP must be replaced accordingly:

Host bastion
  HostName $BASTION_HOST_PUBLIC_DNS
  User ec2-user
  IdentityFile ~/.ssh/mac-admin

Host bastion-developer
  HostName $BASTION_HOST_PUBLIC_DNS
  User developer
  IdentityFile ~/.ssh/mac-developer

Host macos1
  HostName $MAC1_PRIVATE_IP
  ProxyJump %r@bastion-developer
  User developer
  IdentityFile ~/.ssh/mac-developer

Host macos2
  HostName $MAC2_PRIVATE_IP
  ProxyJump %r@bastion-developer
  User developer
  IdentityFile ~/.ssh/mac-developer

As you can see from this configuration, we set up both SSH keys created during this blog. The mac-admin key that you created earlier provides access to the privileged local ec2-user account, while the mac-developer key that you just created grants access to the unprivileged AD developer account. We will create this next.

Login to the Windows MGMT Instance and setup a developer Active Directory account

Now login to the bastion host, forwarding port 3389 to the MGMT01 host in order to gain Remote Desktop Access to the Windows management instance:

ssh -L3389:mgmt01:3389 bastion

While having this connection open, launch your Remote Desktop Client and connect to localhost with Username admin and password as specified earlier in the CloudFormation parameters. Once connected to the instance, open Control Panel>System and Security>Administrative Tools and click Active Directory Users and Computers. Then, in the appearing window, enable View>Advanced Features. If you haven’t changed the Active Directory domain name explicitly in CloudFormation, then the default domain name is example.com with corresponding NetBIOS Name example. Therefore, to create a new user for that domain, select Active Directory Users and Computers>example.com>example>Users, and click Create a new User. In the resulting wizard, set the Full name and User logon name fields to developer, and proceed to set a password to create the user. Once created, right-click on the developer user, and select Properties>Attribute Editor. Search for the altSecurityIdentities property, and copy-paste the developer public SSH key (contained in ~/.ssh/mac-developer.pub) into the Value to add field, click Add, and then click OK. In the Properties window, save your changes by clicking Apply and OK. The following figure illustrates the process just described:

Screenshot from the Windows Management instance depicting the creation of the Active Directory user. A detailed description of this process is contained in the blog text.

Connect to the EC2 Mac instances

Now that the developer account is setup, you can connect to either of the two EC2 Mac instances from your local machine with the Active Directory account:

ssh macos1

When you connect via the preceding command, your local machine first establishes an SSH connection to the bastion host which authorizes the request against the key we just stored in Active Directory. Upon success, the bastion host forwards the connection to the macos1 instance, which again authorizes against Active Directory and launches a  terminal session upon success. The following figure illustrates the login with the macos1 instances, showcasing both the integration with AD (EXAMPLE\Domain Users group membership) as well as with the EFS share, which is mounted at /opt/nfsshare and symlinked to the developer’s home directory.

Screenshot from a terminal window after logging into the macos1 instance. Instructions for doing this are included in the blog text.

Likewise, you can create folders and files in the developer’s home directory such as the test-project folder depicted in the screenshot.

Lastly, let’s utilize VS Code’s remote plugin and connect to the other macos2 instance. Select the Remote Explorer on the left-hand pane and click to open the macos2 host as shown in the following screenshot:

Screenshot depicting how to connect to the macos2 instance using the VSCode Remote SSH extension.

A new window will be opened with the context of the remote server, as shown in the next figure. As you can see, we have access to the same files seen previously on the macos1 host.

Screenshot showing VSCode UI once connected to the macos2 instance.

Cleanup

From the repository root, run the provided destroy.sh script in order to destroy all resources created by CloudFormation, specifying the stack name as input parameter:

./destroy.sh ec2-mac-remote-dev-env

Check the CloudFormation Console to confirm that the stack and its resources are properly deleted.

Lastly, in the EC2 Console, release the dedicated Mac Hosts that you allocated in the beginning. Notice that this is only possible 24 hours after allocation.

Summary

This post has shown how EC2 Mac instances can be set up as remote development environments, thereby allowing developers to create software for Apple platforms regardless of their local hardware and software setup. Aside from increased flexibility and maintainability, this setup also saves cost because multiple developers can work interactively with the same EC2 Mac instance. We have rolled out an architecture that integrates EC2 Mac instances with AWS Directory Services for centralized user and access management as well as Amazon EFS to store developer home directories in a durable and highly available manner. This has resulted in an architecture where instances can easily be added, removed, or updated without affecting developer workflow. Now, irrespective of your client machine, you are all set to start coding with your local editor while leveraging EC2 Mac instances in the AWS Cloud to provide you with a macOS environment! To get started and learn more about EC2 Mac instances, please visit the product page.

Monitoring delay of AWS Batch jobs in transit before execution

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/monitoring-delay-of-aws-batch-jobs-in-transit-before-execution/

This post is written by Nikhil Anand, Solutions Architect 

AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch processing jobs on AWS. With AWS Batch you no longer have to install and manage batch computing software or server clusters used to run your jobs. This lets you focus on analyzing results and solving problems, not managing infrastructure. When you use AWS Batch, in the job lifetime, a job goes through several states. When creating a compute environment to run the Batch jobs and submit Batch jobs, a settings misconfiguration could cause the job to get stuck in a transit state. This means the job will not proceed to the desired RUNNING state – a common issue faced by most customers.

If your compute environment contains compute resources, but your jobs don’t progress beyond the RUNNABLE state, then something is preventing the jobs from being placed on a compute resource. There are various reasons why a job could remain in the RUNNABLE state. The usual call to action is referring the troubleshooting documentation in order to fix the issue. Similarly, if your job is dependent on another job, then the job would stay in the PENDING state.

However, if you have scheduled actions to be completed with Batch jobs, or if you do not have any mechanism monitoring the jobs, then your jobs might stay in any of the transit states if left unattended. You may end up continuing forward, unaware that your job has yet to run. Eventually, when you see the jobs not progressing beyond the RUNNABLE or PENDING state, you miss the task that the job was expected to do in the given timeframe. This can result in additional time and effort troubleshooting the stuck job.

To prevent this accidental avoidance or lack of in-transit job monitoring, this post provides a monitoring solution for jobs in transit (from the SUBMITTED to the RUNNING state) in AWS Batch.

You can configure a threshold monitoring duration for your jobs so that if a job stays in SUBMITTED/PENDING/RUNNABLE longer than that, then you get a notification. For example, you might have a job that you would want to proceed to the RUNNING state in approximately 15 minutes since the job submission. Sometimes a slight misconfiguration can cause the job to get stuck in RUNNABLE indefinitely. In that case, you can set a threshold of 15 minutes. Or, suppose you have a job that is dependent on the other job that is stuck in processing. In these situations, once the specified duration is crossed, you are notified about your job staying in transit beyond your defined threshold status.

The solution is deployed by using AWS CloudFormation.

Overview of solution

The solution creates an Amazon CloudWatch Events rule that triggers an AWS Lambda function on a schedule. Then, the Lambda function checks every job in transit for more than ‘X’ seconds on all compute environments since the job submission. Specify your own value for ‘X’ when you launch the AWS CloudFormation stack. The solution consists of the following components created via CloudFormation:

  • An Amazon CloudWatch event rule to monitor the submitted jobs in Batch using the target Lambda function
  • An AWS Lambda function with the logic to monitor the submitted jobs and trigger Amazon Simple Notification Service (Amazon SNS) notifications
  • A Lambda execution AWS Identity and Access Management (IAM) role
  • An Amazon SNS topic to be subscribed by end users in order to be notified about the submitted jobs

The solution components and workflow.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Walkthrough

To provision the necessary solution components, use this CloudFormation template. 

  1. While launching the CloudFormation stack, you will be asked to input the following information in addition to the CloudFormation stack name:
    1. The upper threshold (in seconds) for the jobs to stay in the transit state
    2. The evaluation period after which the Lambda runs periodically
    3. The email ID to get notifications after the job stays in the transit state for the defined threshold value.

pecify parameter values during CloudFormation stack launch

  1. Once the stack is created, the following resources will be provisioned – SNS topic, CloudWatch Events rule, Lambda function, Lambda invoke permissions, and Lambda execution role. View it in the ‘Resources’ tab of your CloudFormation stack.

Successful creation of the CloudFormation stack

  1. After the stack is created, the email ID you entered in step III above will receive an email from Amazon SNS in order to confirm the Amazon SNS subscription.

Subscription confirmation email that you receive on the specified email ID.

Click Confirm subscription in the email.

Subscription confirmed.

  1. Based on the customer’s inputs during stack launch, a Lambda function will be periodically invoked to look out for Batch jobs stuck in the RUNNABLE state for the defined threshold.
  2. An Amazon SNS notification is sent out at the evaluation periods with the job IDs of the jobs that have stayed stuck in the RUNNABLE state.

Verifying the solution

Launch your monitoring solution by using the CloudFormation template. Once the stack creation is complete, I get an email to subscribe to the SNS topic. Then, I subscribe to the SNS topic.

Click to launch Stack. 

Submit a job in AWS Batch by using console, CLI, or SDK. To test the solution, submit a job, Job1, to a job queue associated with a compute environment with no public subnets. Compute resources require access in order to communicate with the Amazon ECS service endpoint. This can be done through an interface VPC endpoint or your compute resources having public IP addresses. Since the compute environment was configured to only have a private subnet, Job1 will not proceed from the RUNNABLE state. Similarly, submit another job, Job2, and during submission add a dependency of Job1 on Job2. Therefore, Job2 will not proceed from the PENDING state. Thus, creating a sample space wherein two jobs will be stuck in transit.

AWS Batch jobs submitted and in transit.

Based on the CloudFormation template inputs, you will get notified on the subscribed Email ID when the job stays in transit for more than ‘X’ seconds (the input provided during stack launch).

otification received for the jobs that stayed in transit longer than expected.

Modifications

The Lambda function uses the ListJobs API call. The maximum number of results is returned by ListJobs in paginated output. Therefore, if you are submitting many jobs, then you must modify the Lambda function to fetch more results from the initial response of the call by using the nextToken response element. Use this nextToken element and iterate through in a loop to keep fetching the paginated results until there are no further nextToken elements present.

Cleaning up

To avoid incurring future charges, delete the resources. You can delete the CloudFormation stack that will clean up every resource that it provisioned for the monitoring solution.

Conclusion

This solution lets you detect AWS Batch jobs that remain in the transit state longer than expected. It provides you with an efficient way to monitor your Batch jobs. If the jobs stay in the RUNNABLE/PENDING/SUBMITTED state for a significant amount of time, then it is indicative of potential misconfiguration with either the compute environment setup, or with the job parameters that were passed during the job submission. An early notification around the issue can help you troubleshoot the misconfigurations early on and take subsequent actions.

If you have multiple jobs that remain in the RUNNABLE state and you realize that they will not proceed further to the RUNNING state due to a misconfiguration, then you can shut down all RUNNABLE jobs by using a simple bash script.

For additional references regarding troubleshooting RUNNABLE jobs in AWS Batch, refer to the suggested Knowledge Center article and the troubleshooting documentation.

Optimizing Apache Flink on Amazon EKS using Amazon EC2 Spot Instances

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/optimizing-apache-flink-on-amazon-eks-using-amazon-ec2-spot-instances/

This post is written by Kinnar Sen, Senior EC2 Spot Specialist Solutions Architect

Apache Flink is a distributed data processing engine for stateful computations for both batch and stream data sources. Flink supports event time semantics for out-of-order events, exactly-once semantics, backpressure control, and optimized APIs. Flink has connectors for third-party data sources and AWS Services, such as Apache Kafka, Apache NiFi, Amazon Kinesis, and Amazon MSK. Flink can be used for Event Driven (Fraud Detection), Data Analytics (Ad-Hoc Analysis), and Data Pipeline (Continuous ETL) applications. Amazon Elastic Kubernetes Service (Amazon EKS) is the chosen deployment option for many AWS customers for Big Data frameworks such as Apache Spark and Apache Flink. Flink has native integration with Kubernetes allowing direct deployment and dynamic resource allocation.

In this post, I illustrate the deployment of scalable, highly available (HA), resilient, and cost optimized Flink application using Kubernetes via Amazon EKS and Amazon EC2 Spot Instances (Spot). Learn how to save money on big data streaming workloads by implementing this solution.

Overview

Amazon EC2 Spot Instances

Amazon EC2 Spot Instances let you take advantage of spare EC2 capacity in the AWS Cloud and are available at up to a 90% discount compared to On-Demand Instances. Spot Instances receive a two-minute warning when these instances are about to be reclaimed by Amazon EC2. There are many graceful ways to handle the interruption. Recently EC2 Instance rebalance recommendation has been added to send proactive notifications when a Spot Instance is at elevated risk of interruption. Spot Instances are a great way to scale up and increase throughput of Big Data workloads and has been adopted by many customers.

Apache Flink and Kubernetes

Apache Flink is an adaptable framework and it allows multiple deployment options and one of them being Kubernetes. Flink framework has a couple of key building blocks.

  • Job Client submits the job in form of a JobGraph to the Job Manager.
  • Job Manager plays the role of central work coordinator which distributes the job to the Task Managers.
  • Task Managers are the worker component, which runs the operators for source, transformations and sinks.
  • External components which are optional such as Resource Provider, HA Service Provider, Application Data Source, Sinks etc., and this varies with the deployment mode and options.

Image shows Flink application deployment architecture with Job Manager, Task Manager, Scheduler, Data Flow Graph, and client.

Flink supports different deployment (Resource Provider) modes when running on Kubernetes. In this blog we will use the Standalone Deployment mode, as we just want to showcase the functionality. We recommend first-time users however to deploy Flink on Kubernetes using the Native Kubernetes Deployment.

Flink can be run in different modes such as Session, Application, and Per-Job. The modes differ in cluster lifecycle, resource isolation and execution of the main() method. Flink can run jobs on Kubernetes via Application and Session Modes only.

  • Application Mode: This is a lightweight and scalable way to submit an application on Flink and is the preferred way to launch application as it supports better resource isolation. Resource isolation is achieved by running a cluster per job. Once the application shuts down all the Flink components are cleaned up.
  • Session Mode: This is a long running Kubernetes deployment of Flink. Multiple applications can be launched on a cluster and the applications competes for the resources. There may be multiple jobs running on a TaskManager in parallel. Its main advantage is that it saves time on spinning up a new Flink cluster for new jobs, however if one of the Task Managers fails it may impact all the jobs running on that.

Amazon EKS

Amazon EKS is a fully managed Kubernetes service. EKS supports creating and managing Spot Instances using Amazon EKS managed node groups following Spot best practices. This enables you to take advantage of the steep savings and scale that Spot Instances provide for interruptible workloads. EKS-managed node groups require less operational effort compared to using self-managed nodes. You can learn more in the blog “Amazon EKS now supports provisioning and managing EC2 Spot Instances in managed node groups.”

Apache Flink and Spot

Big Data frameworks like Spark and Flink are distributed to manage and process high volumes of data. Designed for failure, they can run on machines with different configurations, inherently resilient and flexible. Spot Instances can optimize runtimes by increasing throughput, while spending the same (or less). Flink can tolerate interruptions using restart and failover strategies.

Fault Tolerance

Fault tolerance is implemented in Flink with the help of check-pointing the state. Checkpoints allow Flink to recover state and positions in the streams. There are two per-requisites for check-pointing a persistent data source (Apache Kafka, Amazon Kinesis) which has the ability to replay data and a persistent distributed storage to store state (Amazon Simple Storage Service (Amazon S3), HDFS).

Cost Optimization

Job Manager and Task Manager are key building blocks of Flink. The Task Manager is the compute intensive part and Job Manager is the orchestrator. We would be running Task Manager on Spot Instances and Job Manager on On Demand Instances.

Scaling

Flink supports elastic scaling via Reactive Mode, Task Managers can be added/removed based on metrics monitored by an external service monitor like Horizontal Pod Autoscaling (HPA). When scaling up new pods would be added, if the cluster has resources they would be scheduled it not then they will go in pending state. Cluster Autoscaler (CA) detects pods in pending state and new nodes will be added by EC2 Auto Scaling. This is ideal with Spot Instances as it implements elastic scaling with higher throughput in a cost optimized way.

Tutorial: Running Flink applications in a cost optimized way

In this tutorial, I review steps, which help you launch cost optimized and resilient Flink workloads running on EKS via Application mode. The streaming application will read dummy Stock ticker prices send to an Amazon Kinesis Data Stream by Amazon Kinesis Data Generator, try to determine the highest price within a per-defined window, and output will be written onto Amazon S3 files.

Image shows Flink application pipeline with data flowing from Amazon Kinesis Data Generator to Kinesis Data Stream, processed in Apache Flink and output being written in Amazon S3

The configuration files can be found in this github location. To run the workload on Kubernetes, make sure you have eksctl and kubectl command line utilities installed on your computer or on an AWS Cloud9 environment. You can run this by using an AWS IAM user or role that has the Administrator Access policy attached to it, or check the minimum required permissions for using eksctl. The Spot node groups in the Amazon EKS cluster can be launched both in a managed or a self-managed way, in this post I use the EKS Managed node group for Spot Instances.

Steps

When we deploy Flink in Application Mode it runs as a single application. The cluster is exclusive for the job. We will be bundling the user code in the Flink image for that purpose and upload in Amazon Elastic Container Registry (Amazon ECR). Amazon ECR is a fully managed container registry that makes it easy to store, manage, share, and deploy your container images and artifacts anywhere.

1. Build the Amazon ECR Image

  • Login using the following cmd and don’t forget to replace the AWS_REGION and AWS_ACCOUNT_ID with your details.

aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS —password-stdin ${ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com

  • Create a repository

aws ecr create-repository --repository-name flink-demo --image-scanning-configuration scanOnPush=true —region ${AWS_REGION}

  • Build the Docker image:

Download the Docker file. I am using multistage docker build here. The sample code is from Github’s Amazon Kinesis Data Analytics Java examples. I modified the code to allow checkpointing and change the sliding window interval. Build and push the docker image using the following instructions.

docker build --tag flink-demo .

  • Tag and Push your image to Amazon ECR

docker tag flink-demo:latest ${ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/flink-demo:latest
docker push ${ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.
amazonaws.com/flink-demo:latest

2. Create Amazon S3/Amazon Kinesis Access Policy

First, I must create an access policy to allow the Flink application to read/write from Amazon fFS3 and read Kinesis data streams. Download the Amazon S3 policy file from here and modify the <<output folder>> to an Amazon S3 bucket which you have to create.

  • Run the following to create the policy. Note the ARN.

aws iam create-policy --policy-name flink-demo-policy --policy-document file://flink-demo-policy.json

3. Cluster and node groups deployment

  • Create an EKS cluster using the following command:

eksctl create cluster –name= flink-demo --node-private-networking --without-nodegroup --asg-access –region=<<AWS Region>>

The cluster takes approximately 15 minutes to launch.

  • Create the node group using the nodeGroup config file. I am using multiple nodeGroups of different sizes to adapt Spot best practice of diversification.  Replace the <<Policy ARN>> string using the ARN string from the previous step.

eksctl create nodegroup -f managedNodeGroups.yml

  • Download the Cluster Autoscaler and edit it to add the cluster-name (flink-demo)

curl -LO https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

4. Install the Cluster AutoScaler using the following command:

kubectl apply -f cluster-autoscaler-autodiscover.yaml

  • Using EKS Managed node groups requires significantly less operational effort compared to using self-managed node group and enables:
    • Auto enforcement of Spot best practices.
    • Spot Instance lifecycle management.
    • Auto labeling of Pods.
  • eksctl has integrated amazon-ec2-instance-selector to enable auto selection of instances based on the criteria passed. This has multiple benefits
    • ‘instance diversification’ is implemented by enabling multiple instance types selection in the node group which works well with CA
    • Reduces manual effort of selecting the instances.
  • We can create node group manifests using ‘dryrun’ and then create node groups using that.

eksctl create cluster --name flink-demo --instance-selector-vcpus=2 --instance-selector-memory=4 --dry-run

eksctl create node group -f managedNodeGroups.yml

5. Create service accounts for Flink

$ kubectl create serviceaccount flink-service-account
$ kubectl create clusterrolebinding flink-role-binding-flink --clusterrole=edit --serviceaccount=default:flink-service-account

6. Deploy Flink

This install folder here has all the YAML files required to deploy a standalone Flink cluster. Run the install.sh file. This will deploy the cluster with a JobManager, a pool of TaskManagers and a Service exposing JobManager’s ports.

  • This is a High-Availability(HA) deployment of Flink with the use of Kubernetes high availability service.
  • The JobManager runs on OnDemand and TaskManager on Spot. As the cluster is launched in Application Mode, if a node is interrupted only one job will be restarted.
  • Autoscaling is enabled by the use of ‘Reactive Mode’. Horizontal Pod Autoscaler is used to monitor the CPU load and scale accordingly.
  • Check-pointing is enabled which allows Flink to save state and be fault tolerant.

Image shows the Flink dashboard highlighting checkpoints for a job

7. Create Amazon Kinesis data stream and send dummy data      

Log in to AWS Management Console and create a Kinesis data stream name ‘ExampleInputStream’. Kinesis Data Generator is used to send data to the data stream. The template of the dummy data can be found here. Once this sends data the Flink application starts processing.

Image shows Amazon Kinesis Data Generator console sending data to Kinesis Data Strea

Observations

Spot Interruptions

If there is an interruption then the Flick application will be restarted using check-pointed data. The JobManager will restore the job as highlighted in the following log. The node will be replaced automatically by the Managed Node Group.

mage shows logs from a Flink job highlighting job restart using checkpoints.

One will be able to observe the graceful restart in the Flink UI.

Image shows the Flink dashboard highlighting job restart after failure.

AutoScaling

You can observe the elastic scaling using logs. The number of TaskManagers in the Flink UI will also reflect the scaling state.

Image shows kubectl output showing status of HPA during scale-out

Cleanup

If you are trying out the tutorial, run the following steps to make sure that you don’t encounter unwanted costs.

  • Run the delete.sh file.
  • Delete the EKS cluster and the node groups:
    • eksctl delete cluster --name flink-demo
  • Delete the Amazon S3 Access Policy:
    • aws iam delete-policy --policy-arn <<POLICY ARN>>
  • Delete the Amazon S3 Bucket:
    • aws s3 rb --force s3://<<S3_BUCKET>>
  • Delete the CloudFormation stack related to Kinesis Data Generator named ‘Kinesis-Data-Generator-Cognito-User’
  • Delete the Kinesis Data Stream.

Conclusion

In this blog, I demonstrated how you can run Flink workloads on a Kubernetes Cluster using Spot Instances, achieving scalability, resilience, and cost optimization. To cost optimize your Flink based big data workloads you should start thinking about using Amazon EKS and Spot Instances.

Implementing interruption tolerance in Amazon EC2 Spot with AWS Fault Injection Simulator

Post Syndicated from Pranaya Anshu original https://aws.amazon.com/blogs/compute/implementing-interruption-tolerance-in-amazon-ec2-spot-with-aws-fault-injection-simulator/

This post is written by Steve Cole, WW SA Leader for EC2 Spot, and David Bermeo, Senior Product Manager for EC2.

On October 20, 2021, AWS released new functionality to the Amazon Fault Injection Simulator that supports triggering the interruption of Amazon EC2 Spot Instances. This functionality lets you test the fault tolerance of your software by interrupting instances on command. The triggered interruption will be preceded with a Rebalance Recommendation (RBR) and Instance Termination Notification (ITN) so that you can fully test your applications as if an actual Spot interruption had occurred.

In this post, we’ll provide two examples of how easy it has now become to simulate Spot interruptions and validate the fault-tolerance of an application or service. We will demonstrate testing an application through the console and a service via CLI.

Engineering use-case (console)

Whether you are building a Spot-capable product or service from scratch or evaluating the Spot compatibility of existing software, the first step in testing is identifying whether or not the software is tolerant of being interrupted.

In the past, one way this was accomplished was with an AWS open-source tool called the Amazon EC2 Metadata Mock. This tool let customers simulate a Spot interruption as if it had been delivered through the Instance Metadata Service (IMDS), which then let customers test how their code responded to an RBR or an ITN. However, this model wasn’t a direct plug-and-play solution with how an actual Spot interruption would occur, since the signal wasn’t coming from AWS. In particular, the method didn’t provide the centralized notifications available through Amazon EventBridge or Amazon CloudWatch Events that enabled off-instance activities like launching AWS Lambda functions or other orchestration work when an RBR or ITN was received.

Now, Fault Injection Simulator has removed the need for custom logic, since it lets RBR and ITN signals be delivered via the standard IMDS and event services simultaneously.

Let’s walk through the process in the AWS Management Console. We’ll identify an instance that’s hosting a perpetually-running queue worker that checks the IMDS before pulling messages from Amazon Simple Queue Service (SQS). It will be part of a service stack that is scaled in and out based on the queue depth. Our goal is to make sure that the IMDS is being polled properly so that no new messages are pulled once an ITN is received. The typical processing time of a message with this example is 30 seconds, so we can wait for an ITN (which provides a two minute warning) and need not act on an RBR.

First, we go to the Fault Injection Simulator in the AWS Management Console to create an experiment.

AWS Fault Injection Simulator start screen in AWS Management console

At the experiment creation screen, we start by creating an optional name (recommended for console use) and a description, and then selecting an IAM Role. If this is the first time that you’ve used Fault Injection Simulator, then you’ll need to create an IAM Role per the directions in the FIS IAM permissions documentation. I’ve named the role that we created ‘FIS.’ After that, I’ll select an action (interrupt) and identify a target (the instance).

Experiment template screen in AWS Management console with description “interrupt queue worker,” name “interrupt queue worker,” and IAM role “FIS.”

First, I name the action. The Action type I want is to interrupt the Spot Instance: aws:ec2:send-spot-instance-interruptions. In the Action parameters, we are given the option to set the duration. The minimum value here is two minutes, below which you will receive an error since Spot Instances will always receive a two minute warning. The advantage here is that, by setting the durationBeforeInterruption to a value above two minutes, you will get the RBR (an optional point for you to respond) and ITN (the actual two minute warning) at different points in time, and this lets you respond to one or both.

New action form in the create experiment template screen with name, description, action type and durationBeforeInterruption fields filled.

The target instance that we launched is depicted in the following screenshot. It is a Spot Instance that was launched as a persistent request with its interruption action set to ‘stop’ instead of ‘terminate.’ The option to stop a Spot Instance, introduced in 2020, will let us restart the instance, log in and retrieve logs, update code, and perform other work necessary to implement Spot interruption handling.

Now that an action has been defined, we configure the target. We have the option of naming the target, which we’ve done here to match the Name tagged on the EC2 instance ‘qWorker’. The target method we want to use here is Resource ID, and then we can either type or select the desired instance from a drop-down list. Selection mode will be ‘all’, as there is only one instance. If we were using tags, which we will in the next example, then we’d be able to select a count of instances, up to five, instead of just one.

Edit target form in the create experiment template screen with name, resource type, and resource id fields filled. Name is “qWorker,” Resource type is “aws:ec2:spot-Instance,” Target method is “Resource IDs” Resource IDs is “i-019ea405f8f81742b” and Selection mode is “All.”

Once you’ve saved the Action, the Target, and the Experiment, then you’ll be able to begin the experiment by selecting the ‘Start from the Action’ menu at the top right of the screen.

Fault Injection Simulator Experiment template summary screen in the AWS Management console highlights experiment template ID, stop conditions, description, creation time, IAM role, and last update time.

After the experiment starts, you’ll be able to observe its state by refreshing the screen. Generally, the process will take just seconds, and you should be greeted by the Completed state, as seen in the following screenshot.

Fault Injection Simulator Experiment summary screen in the AWS Management console shows status as “Completed.”

In the following screenshot, having opened an interruption log group created in CloudWatch Event Logs, we can see the JSON of the RBR.

CloudWatch Event Logs screen in the AWS Management Console with information on the EC2 Rebalance Recommendation.

Two minutes later, we see the ITN in the same log group.

Cloudwatch Event Logs screen in the AWS Management console with information on the Instance Termination Notification.

Another two minutes after the ITN, we can see the EC2 instance is in the process of stopping (or terminating, if you elect).

EC2 Instance List screen in the AWS Management console with information on the instance that is stopping.

Shortly after the stop is issued by EC2, we can see the instance stopped. It would now be possible to restart the instance and view logs, make code changes, or do whatever you find necessary before testing again.

EC2 Instance List screen in the AWS Management console with information on the Instance that is stopped.

Now that our experiment succeeded in interrupting our Spot Instance, we can evaluate the performance of the code running on the instance. It should have completed the processing of any messages already retrieved at the ITN, and it should have not pulled any new messages afterward.

This experiment can be saved for later use, but it will require selecting the specific instance each time that it’s run. We can also re-use the experiment template by using tags instead of an instance ID, as we’ll show in the next example. This shouldn’t prove troublesome for infrequent experiments, and especially those run through the console. Or, as we did in our example, you can set the instance interruption behavior to stop (versus terminate) and re-use the experiment as long as that particular instance continues to exist. When the experiments get more frequent, it might be advantageous to automate the process, possibly as part of the test phase of a CI/CD pipeline. Doing this is programmatically possible through the AWS CLI or SDK.

Operations use-case (CLI)

Once the developers of our product validate the single-instance fault tolerance, indicating that the target workload is capable of running on Spot Instances, then the next logical step is to deploy the product as a service on multiple instances. This will allow for more comprehensive testing of the service as a whole, and it is a key process in collecting valuable information, such as performance data, response times, error rates, and other metrics to be used in the monitoring of the service. Once data has been collected on a non-interrupted deployment, it is then possible to use the Spot interruption action of the Fault Injection Simulator to observe how well the service can handle RBR and ITN while running, and to see how those events influence the metrics collected previously.

When testing a service, whether it is launched as instances in an Amazon EC2 Auto Scaling group, or it is part of one of the AWS container services, such as Amazon Elastic Container Service (Amazon ECS) or the Amazon Elastic Kubernetes Service (EKS), EC2 Fleet, Amazon EMR, or across any instances with descriptive tagging, you now have the ability to trigger Spot interruptions to as many as five instances in a single Fault Injection Simulator experiment.

We’ll use tags, as opposed to instance IDs, to identify candidates for interruption to interrupt multiple Spot Instances simultaneously. We can further refine the candidate targets with one or more filters in our experiment, for example targeting only running instances if you perform an action repeatedly.

In the following example, we will be interrupting three instances in an Auto Scaling group that is backing a self-managed EKS node group. We already know the software will behave as desired from our previous engineering tests. Our goal here is to see how quickly EKS can launch replacement tasks and identify how the service as a whole responds during the event. In our experiment, we will identify instances that contain the tag aws:autoscaling:groupName with a value of “spotEKS”.

The key benefit here is that we don’t need a list of instance IDs in our experiment. Therefore, this is a re-usable experiment that can be incorporated into test automation without needing to make specific selections from previous steps like collecting instance IDs from the target Auto Scaling group.

We start by creating a file that describes our experiment in JSON rather than through the console:

{
    "description": "interrupt multiple random instances in ASG",
    "targets": {
        "spotEKS": {
            "resourceType": "aws:ec2:spot-instance",
            "resourceTags": {
                "aws:autoscaling:groupName": "spotEKS"
            },
            "selectionMode": "COUNT(3)"
        }
    },
    "actions": {
        "interrupt": {
            "actionId": "aws:ec2:send-spot-instance-interruptions",
            "description": "interrupt multiple instances",
            "parameters": {
                "durationBeforeInterruption": "PT4M"
            },
            "targets": {
            "SpotInstances": "spotEKS"
            }
        }
    },
    "stopConditions": [
        {
            "source": "none"
        }
    ],
    "roleArn": "arn:aws:iam::xxxxxxxxxxxx:role/FIS",
    "tags": {
        "Name": "multi-instance"
    }
}

Then we upload the experiment template to Fault Injection Simulator from the command-line.

aws fis create-experiment-template --cli-input-json file://experiment.json

The response we receive returns our template along with an ID, which we’ll need to execute the experiment.

{
    "experimentTemplate": {
        "id": "EXT3SHtpk1N4qmsn",
        ...
    }
}

We then execute the experiment from the command-line using the ID that we were given at template creation.

aws fis start-experiment --experiment-template-id EXT3SHtpk1N4qmsn

We then receive confirmation that the experiment has started.

{
    "experiment": {
        "id": "EXPaFhEaX8GusfztyY",
        "experimentTemplateId": "EXT3SHtpk1N4qmsn",
        "state": {
            "status": "initiating",
            "reason": "Experiment is initiating."
        },
        ...
    }
}

To check the status of the experiment as it runs, which for interrupting Spot Instances is quite fast, we can query the experiment ID for success or failure messages as follows:

aws fis get-experiment --id EXPaFhEaX8GusfztyY

And finally, we can confirm the results of our experiment by listing our instances through EC2. Here we use the following command-line before and after execution to generate pre- and post-experiment output:

aws ec2 describe-instances --filters\
 Name='tag:aws:autoscaling:groupName',Values='spotEKS'\
 Name='instance-state-name',Values='running'\
 | jq .Reservations[].Instances[].InstanceId | sort

We can then compare this to identify which instances were interrupted and which were launched as replacements.

< "i-003c8d95c7b6e3c63"
< "i-03aa172262c16840a"
< "i-02572fa37a61dc319"
---
> "i-04a13406d11a38ca6"
> "i-02723d957dc243981"
> "i-05ced3f71736b5c95"

Summary

In the previous examples, we have demonstrated through the console and command-line how you can use the Spot interruption action in the Fault Injection Simulator to ascertain how your software and service will behave when encountering a Spot interruption. Simulating Spot interruptions will help assess the fault-tolerance of your software and can assess the impact of interruptions in a running service. The addition of events can enable more tooling, and being able to simulate both ITNs and RBRs, along with the Capacity Rebalance feature of Auto scaling groups, now matches the end-to-end experience of an actual AWS interruption. Get started on simulating Spot interruptions in the console.

Implementing Auto Scaling for EC2 Mac Instances

Post Syndicated from Rick Armstrong original https://aws.amazon.com/blogs/compute/implementing-autoscaling-for-ec2-mac-instances/

This post is written by: Josh Bonello, Senior DevOps Architect, AWS Professional Services; Wes Fabella, Senior DevOps Architect, AWS Professional Services

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. The introduction of Amazon EC2 Mac now enables macOS based workloads to run in the AWS Cloud. These EC2 instances require Dedicated Hosts usage. EC2 integrates natively with Amazon CloudWatch to provide monitoring and observability capabilities.

In order to best leverage EC2 for dynamic workloads, it is a best practice to use Auto Scaling whenever possible. This will allow your workload to scale to demand, while keeping a minimal footprint during low activity periods. With Auto Scaling, you don’t have to worry about provisioning more servers to handle peak traffic or paying for more than you need.

This post will discuss how to create an Auto Scaling Group for the mac1.metal instance type. We will produce an Auto Scaling Group, a Launch Template, a Host Resource Group, and a License Configuration. These resources will work together to produce the now expected behavior of standard instance types with Auto Scaling. At AWS Professional Services, we have implemented this architecture to allow the dynamic sizing of a compute fleet utilizing the mac1.metal instance type for a large customer. Depending on what should invoke the scaling mechanisms, this architecture can be easily adapted to integrate with other AWS services, such as Elastic Load Balancers (ELB). We will provide Terraform templates as part of the walkthrough. Please take special note of the costs associated with running three mac1.metal Dedicated Hosts for 24 hours.

How it works

First, we will begin in AWS License Manager and create a License Configuration. This License Configuration will be associated with an Amazon Machine Image (AMI), and can be associated with multiple AMIs. We will utilize this License Configuration as a parameter when we create a Host Resource Group. As part of defining the Launch Template, we will be referencing our Host Resource Group. Then, we will create an Auto Scaling Group based on the Launch Template.

Example flow of License Manager, AWS Auto Scaling, and EC2 Instances and their relationship to each other.

The License Configuration will control the software licensing parameters. Normally, License Configurations are used for software licensing controls. In our case, it is only a required element for a Host Resource Group, and it handles nothing significant in our solution.

The Host Resource Group will be responsible for allocating and deallocating the Dedicated Hosts for the Mac1 instance family. An available Dedicated Host is required to launch a mac1.metal EC2 instance.

The Launch Template will govern many aspects to our EC2 Instances, including AWS Identity and Access Management (IAM) Instance Profile, Security Groups, and Subnets. These will be similar to typical Auto Scaling Group expectations. Note that, in our solution, we will use Tenancy Host Resource Group as our compute source.

Finally, we will create an Auto Scaling Group based on our Launch Template. The Auto Scaling Group will be the controller to signal when to create new EC2 Instances, create new Dedicated Hosts, and similarly terminate EC2 Instances. Unutilized Dedicated Hosts will be tracked and terminated by the Host Resource Group.

Limits

Some limits exist for this solution. To deploy this solution, a Service Quota Increase must be submitted for mac1.metal Dedicated Hosts, as the default quota is 0. Deploying the solution without this increase will result in failures when provisioning the Dedicated Hosts for the mac1.metal instances.

While testing scale-in operations of the auto scaling group, you might find that Dedicated Hosts are in “Pending” state. Mac1 documentation says “When you stop or terminate a Mac instance, Amazon EC2 performs a scrubbing workflow on the underlying Dedicated Host to erase the internal SSD, to clear the persistent NVRAM variables. If the bridgeOS software does not need to be updated, the scrubbing workflow takes up to 50 minutes to complete. If the bridgeOS software needs to be updated, the scrubbing workflow can take up to 3 hours to complete.” The Dedicated Host cannot be reused for a new scale-out operation until this scrubbing is complete. If you attempt a scale-in and a scale-out operation during testing, you might find more Dedicated Hosts than EC2 instances for your ASG as a result.

Auto Scaling Group features like dynamic scaling, health checking, and instance refresh can also cause similar side effects as a result of terminating the EC2 instances. These side effects will subside after 24 hours when a mac1 dedicate host can be released.

Building the solution

This walkthrough will utilize a Terraform template to automate the infrastructure deployment required for this solution. The following prerequisites should be met prior to proceeding with this walkthrough:

Before proceeding, note that the AWS resources created as part of the walkthrough have costs associated with them. Delete any AWS resources created by the walkthrough that you do not intend to use. Take special note that at the time of writing, mac1.metal Dedicated Hosts require a 24 minimum allocation time to align with Apple macOS EULA, and that mac1.metal EC2 instances are not charged separately, only the underlying Dedicated Hosts are.

Step 1: Deploy Dedicated Hosts infrastructure

First, we will do one-time setup for AWS License Manager to have the required IAM Permissions through the AWS Management Console. If you have already used License Manager, this has already been done for you. Click on “create customer managed license”, check the box, and then click on “Grant Permissions.”

AWS License Manager IAM Permissions Grant

To deploy the infrastructure, we will utilize a Terraform template to automate every component setup. The code is available at https://github.com/aws-samples/amazon-autoscaling-mac1metal-ec2-with-terraform. First, initialize your Terraform host. For this solution, utilize a local machine. For this walkthrough, we will assume the use of the us-west-2 (Oregon) AWS Region and the following links to help check resources will account for this.

terraform -chdir=terraform-aws-dedicated-hosts init

Initializing Terraform host and showing an example of expected output.

Then, we will plan our Terraform deployment and verify what we will be building before deployment.

terraform -chdir=terraform-aws-dedicated-hosts plan

In our case, we will expect a CloudFormation Stack and a Host Resource Group.

Planning Terraform template and showing an example of expected output.

Then, apply our Terraform deployment and verify via the AWS Management Console.

terraform -chdir=terraform-aws-dedicated-hosts apply -auto-approve

Applying Terraform template and showing an example of expected output.

Check that the License Configuration has been made in License Manager with a name similar to MyRequiredLicense.

Example of License Manager License after Terraform Template is applied.

Check that the Host Resource Group has been made in the AWS Management Console. Ensure that the name is similar to mac1-host-resource-group-famous-anchovy.

Example of Cloudformation Stack that is created, with License Manager Host Resource Group name pictured.

Note the host resource group name in the HostResourceGroup “Physical ID” value for the next step.

Step 2: Deploy mac1.metal Auto Scaling Group

We will be taking similar steps as in Step 1 with a new component set.

Initialize your Terraform State:

terraform -chdir=terraform-aws-ec2-mac init

Then, update the following values in terraform-aws-ec2-mac/my.tfvars:

vpc_id : Check the ID of a VPC in the account where you are deploying. You will always have a “default” VPC.

subnet_ids : Check the ID of one or many subnets in your VPC.

hint: use https://us-west-2.console.aws.amazon.com/vpc/home?region=us-west-2#subnets

security_group_ids : Check the ID of a Security Group in the account where you are deploying. You will always have a “default” SG.

host_resource_group_cfn_stack_name : Use the Host Resource Group Name value from the previous step.

Then, plan your deployment using the following:

terraform -chdir=terraform-aws-ec2-mac plan -var-file="my.tfvars"

Once we’re ready to deploy, utilize Terraform to apply the following:

terraform -chdir=terraform-aws-ec2-mac apply -var-file="my.tfvars" -auto-approve

Note, this will take three to five minutes to complete.

Step 3: Verify Deployment

Check our Auto Scaling Group in the AWS Management Console for a group named something like “ec2-native-xxxx”. Verify all attributes that we care about, including the underlying EC2.

Example of Autoscaling Group listing the EC2 Instances with mac1.metal instance type showing InService after Terraform Template is applied.

Check our Elastic Load Balancer in the AWS Management Console with a Tag key “Name” and the value of your Auto Scaling Group.

Check for the existence of our Dedicated Hosts in the AWS Management Console.

Step 4: Test Scaling Features

Now we have the entire infrastructure in place for an Auto Scaling Group to conduct normal activity. We will test with a scale-out behavior, then a scale-in behavior. We will force operations by updating the desired count of the Auto Scaling Group.

For scaling out, update the my.tfvars variable number_of_instances to three from two, and then apply our terraform template. We will expect to see one more EC2 instance for a total of three instances, with three Dedicated Hosts.

terraform -chdir=terraform-aws-ec2-mac apply -var-file="my.tfvars" -auto-approve

Then, take the steps in Step 3: Verify Deployment in order to check for expected behavior.

For scaling in, update the my.tfvars variable number_of_instances to one from three, and then apply our terraform template. We will expect your Auto Scaling Group to reduce to one active EC2 instance and have three Dedicated Hosts remaining until they are capable of being released 24 hours later.

terraform -chdir=terraform-aws-ec2-mac apply -var-file="my.tfvars" -auto-approve

Then, take the steps in Step 3: Verify Deployment in order to check for expected behavior.

Cleaning up

Complete the following steps in order to cleanup resources created by this exercise:

terraform -chdir=terraform-aws-ec2-mac destroy -var-file="my.tfvars" -auto-approve

This will take 10 to 12 minutes. Then, wait 24 hours for the Dedicated Hosts to be capable of being released, and then destroy the next template. We recommend putting a reminder on your calendar to make sure that you don’t forget this step.

terraform -chdir=terraform-aws-dedicated-hosts destroy -auto-approve

Conclusion

In this post, we created an Auto Scaling Group using mac1.metal instance types. Scaling mechanisms will work as expected with standard EC2 instance types, and the management of Dedicated Hosts is automated. This enables the management of macOS based application workloads to be automated based on the Well Architected patterns. Furthermore, this automation allows for rapid reactions to surges of demand and reclamation of unused compute once the demand is cleared. Now you can augment this system to integrate with other AWS services, such as Elastic Load Balancing, Amazon Simple Cloud Storage (Amazon S3), Amazon Relational Database Service (Amazon RDS), and more.

Review the information available regarding CloudWatch custom metrics to discover possibilities for adding new ways for scaling your system. Now we would be eager to know what AWS solution you’re going to build with the content described by this blog post! To get started with EC2 Mac instances, please visit the product page.

Deep Dive on Amazon EC2 VT1 Instances

Post Syndicated from Rick Armstrong original https://aws.amazon.com/blogs/compute/deep-dive-on-amazon-ec2-vt1-instances/

This post is written by:  Amr Ragab, Senior Solutions Architect; Bryan Samis, Principal Elemental SSA; Leif Reinert, Senior Product Manager

Introduction

We at AWS are excited to announce that new Amazon Elastic Compute Cloud (Amazon EC2) VT1 instances are now generally available in the US-East (N. Virginia), US-West (Oregon), Europe (Ireland), and Asia Pacific (Tokyo) Regions. This instance family provides dedicated video transcoding hardware in Amazon EC2 and offers up to 30% lower cost per stream as compared to G4dn GPU based instances or 60% lower cost per stream as compared to C5 CPU based instances. These instances are powered by Xilinx Alveo U30 media accelerators with up to eight U30 media accelerators per instance in the vt1.24xlarge. Each U30 accelerator comes with two XCU30 Zynq UltraScale+ SoCs, totaling 16 addressable devices in the vt1.24xlarge instance with H.264/H.265 Video Codec Units (VCU) cores.

Each U30 accelerator card comes with two XCU30 Zynq UltraScale+ SoCs

Currently, the VT1 family consists of three sizes, as summarized in the following:

Instance Type vCPUs RAM U30 accelerator cards Addressable XCU30 SoCs 
vt1.3xlarge 12 24 1 2
vt1.6xlarge 24 48 2 8
vt1.24xlarge 96 182 8 16

Each addressable XCU30 SoC device supports:

  • Codec: MPEG4 Part 10 H.264, MPEG-H Part 2 HEVC H.265
  • Resolutions: 128×128 to 3840×2160
  • Flexible rate control: Constant Bitrate (CBR), Variable Bitrate(VBR), and Constant Quantization Parameter(QP)
  • Frame Scan Types: Progressive H.264/H.265
  • Input Color Space: YCbCr 4:2:0, 8-bit per color channel.

The following table outlines the number of transcoding streams per addressable device and instance type:

Transcoding Each XCU30 SoC vt1.3xlarge vt1.6xlarge vt1.24xlarge
3840x2160p60 1 2 4 16
3840x2160p30 2 4 8 32
1920x1080p60 4 8 16 64
1920x1080p30 8 16 32 128
1280x720p30 16 32 64 256
960x540p30 24 48 92 384

Each XCU30 SoC can support the following stream densities: 1x 4kp60, 2x 4kp30, 4x 1080p60, 8x 1080p30, 16x 720p30

Customers with applications such as live broadcast, video conferencing and just-in-time transcoding can now benefit from a dedicated instance family devoted to video encoding and decoding with rescaling optimizations at the lowest cost per stream. This dedicated instance family lets customers run batch, real-time, and faster than real-time transcoding workloads.

Deployment and Quick Start

To get started, you launch a VT1 instance with prebuilt VT1 Amazon Machine Images (AMIs), available on the AWS Marketplace. However, if you have AMI hardening requirements or other requirements that require you to install the Xilinx software stack, you can reference the Xilinx Video SDK documentation for VT1.

The software stack utilizes a driver suite that is a combination of the driver stack as well as management and client tools. The following terminology will be used in this instance family:

  • XRT – Xilinx Runtime Library
  • XRM – Xilinx Runtime Management Library
  • XCDR – Xilinx Video Transcoding SDK
  • XMA – Xilinx Media Accelerator API and Samples
  • XOCL – Xilinx driver (xocl)

To run workloads directly on Amazon EC2 instances, you must load both the XRT and XRM stack. These are conveniently provided by loading the XCDR environment. To load the devices, run the following:

source /opt/xilinx/xcdr/setup.sh

With the output:

-----Source Xilinx U30 setup files-----
XILINX_XRT        : /opt/xilinx/xrt
PATH              : /opt/xilinx/xrt/bin:/usr/local/sbin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin
LD_LIBRARY_PATH   : /opt/xilinx/xrt/lib:
PYTHONPATH        : /opt/xilinx/xrt/python:
XILINX_XRM      : /opt/xilinx/xrm
PATH            : /opt/xilinx/xrm/bin:/opt/xilinx/xrt/bin:/usr/local/sbin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin
LD_LIBRARY_PATH : /opt/xilinx/xrm/lib:/opt/xilinx/xrt/lib:
Number of U30 devices found : 16  

Running Containerized Workloads on Amazon ECS and Amazon EKS

To help build AMIs for Amazon Linux2, Ubuntu 18/20, Amazon ECS and Amazon Elastic Kubernetes Service (Amazon EKS), we have provided a Github project in order to simplify the build process utilizing Packer:

https://github.com/aws-samples/aws-vt-baseami-pipeline

At the time of writing, Xilinx does not have an officially supported container runtime. However, it is possible to pass the specific devices in the docker run ... stanza, and in order to set this environment download this specific script. The following example is the output for vt1.24xlarge:

[ec2-user@ip-10-0-254-236 ~]$ source xilinx_aws_docker_setup.sh
XILINX_XRT : /opt/xilinx/xrt
PATH : /opt/xilinx/xrt/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/ec2-user/.local/bin:/home/ec2-user/bin
LD_LIBRARY_PATH  : /opt/xilinx/xrt/lib:
PYTHONPATH : /opt/xilinx/xrt/python:
XILINX_AWS_DOCKER_DEVICES : --device=/dev/dri/renderD128:/dev/dri/renderD128
--device=/dev/dri/renderD129:/dev/dri/renderD129
--device=/dev/dri/renderD130:/dev/dri/renderD130
--device=/dev/dri/renderD131:/dev/dri/renderD131
--device=/dev/dri/renderD132:/dev/dri/renderD132
--device=/dev/dri/renderD133:/dev/dri/renderD133
--device=/dev/dri/renderD134:/dev/dri/renderD134
--device=/dev/dri/renderD135:/dev/dri/renderD135
--device=/dev/dri/renderD136:/dev/dri/renderD136
--device=/dev/dri/renderD137:/dev/dri/renderD137
--device=/dev/dri/renderD138:/dev/dri/renderD138
--device=/dev/dri/renderD139:/dev/dri/renderD139
--device=/dev/dri/renderD140:/dev/dri/renderD140
--device=/dev/dri/renderD141:/dev/dri/renderD141
--device=/dev/dri/renderD142:/dev/dri/renderD142
--device=/dev/dri/renderD143:/dev/dri/renderD143
--mount type=bind,source=/sys/bus/pci/devices/0000:00:1d.0,target=/sys/bus/pci/devices/0000:00:1d.0 --mount type=bind,source=/sys/bus/pci/devices/0000:00:1e.0,target=/sys/bus/pci/devices/0000:00:1e.0

Once the devices have been enumerated, start the workload by running:

docker run -it $XILINX_AWS_DOCKER_DEVICES <image:tag>

Amazon EKS Setup

To launch an EKS cluster with VT1 instances, create the AMI from the scripts provided in the repo earlier.

https://github.com/aws-samples/aws-vt-baseami-pipeline

Once the AMI is created, launch an EKS cluster:

eksctl create cluster --region us-east-1 --without-nodegroup --version 1.19 \
       --zones us-east-1c,us-east-1d

Once the cluster is created, substitute the values for the cluster name, subnets, and AMI IDs in the following template.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: <cluster-name>
  region: us-east-1

vpc:
  id: vpc-285eb355
  subnets:
    public:
      endpoint-one:
        id: subnet-5163b237
      endpoint-two:
        id: subnet-baff22e5

managedNodeGroups:
  - name: vt1-ng-1d
    instanceType: vt1.3xlarge
    volumeSize: 200
    instancePrefix: vt1-ng-1d-worker
    ami: <ami-id>
    iam:
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        ebs: true
        fsx: true
        cloudWatch: true
    ssh:
      allow: true
      publicKeyName: amrragab-aws
    subnets:
    - endpoint-one
    minSize: 1
    desiredCapacity: 1
    maxSize: 4
    overrideBootstrapCommand: |
      #!/bin/bash
      /etc/eks/bootstrap.sh <cluster-name>

Save this file, and then deploy the nodegroup.

eksctl create nodegroup -f vt1-managed-ng.yaml

Once deployed, apply the FPGA U30 device plugin. The daemonset container is available on the Amazon Elastic Container Registry (ECR) public gallery. You can also access the daemonset deployment file.

kubectl apply -f xilinx-device-plugin.yml

Confirm that the Xilinx U30 device(s) are seen by K8s API server and can be allocatable in your job.

Capacity:
  attachable-volumes-aws-ebs:                  39
  cpu:                                         12
  ephemeral-storage:                           209702892Ki
  hugepages-1Gi:                               0
  hugepages-2Mi:                               0
  memory:                                      23079768Ki
  pods:                                        15
  xilinx.com/fpga-xilinx_u30_gen3x4_base_1-0:  1
Allocatable:
  attachable-volumes-aws-ebs:                  39
  cpu:                                         11900m
  ephemeral-storage:                           192188443124
  hugepages-1Gi:                               0
  hugepages-2Mi:                               0
  memory:                                      22547288Ki
  pods:                                        15
  xilinx.com/fpga-xilinx_u30_gen3x4_base_1-0:  1

Video Quality Analysis

The video quality produced by the U30 is roughly equivalent to the “faster” profile in the x264 and x265 codecs, or the “p4” preset using the nvenc codec on G4dn. For example, in the following test we encoded the same UHD (4K) video at multiple bitrates into H264, and then compared Video Multimethod Assessment Fusion (VMAF) scores:

Plotting VMAF and bitrate we see comparable quality across x264 faster, h264_nvenc p4 and u30

Stream Density and Encoding Performance

To illustrate the VT1 instance family stream density and encoding performance, let’s look at the smallest instance, the vt1.3xlarge, which can encode up to eight simultaneous 1080p60 streams into H.264. We chose a set of similar instances at a close price point, and then compared how many 1080p60 H264 streams they could encode simultaneously to an equivalent quality:

Column 1 Column 2 Column 3 Column 4 Column 5
Instance Codec us-east-1 Hourly Price* 1080p60 Streams / Instance Hourly Cost / Stream
c5.4xlarge x264 $0.680 2 $0.340
c6g.4xlarge x264 $0.544 2 $0.272
c5a.4xlarge x264 $0.616 3 $0.205
g4dn.xlarge nvenc $0.526 4 $0.132
vt1.3xlarge xma $0.650 8 $0.081

* Prices accurate as of the publishing date of this article.

As you can see, the vt1.3xlarge instance can encode four times as many streams as the c5.4xlarge, and at a lower hourly cost. It can also encode two times the number of streams as a g4dn.xlarge instance. Thus, yielding in this example a cost per stream reduction of up to 76% over c5.4xlarge, and up to 39% compared to g4dn.xlarge.

Faster than Real-time Transcoding

In addition to encoding multiple live streams in parallel, VT1 instances can also be utilized to encode file-based content at faster-than-real-time performance. This can be done by over-provisioning resources on a single XCU30 device so that more resources are dedicated to transcoding than are necessary to maintain real-time.

For example, running the following command (specifying -cores 4) will utilize all resources on a single XCU30 device, and yield an encode speed of approximately 177 FPS, or 2.95 times faster than real-time for a 60 FPS source:

$ ffmpeg -c:v mpsoc_vcu_h264 -i input_1920x1080p60_H264_8Mbps_AAC_Stereo.mp4 -f mp4 -b:v 5M -c:v mpsoc_vcu_h264 -cores 4 -slices 4 -y /tmp/out.mp4
frame=43092 fps=177 q=-0.0 Lsize= 402721kB time=00:11:58.92 bitrate=4588.9kbits/s speed=2.95x

To maximize FPS further, utilize the “split and stitch” operation to break the input file into segments, and then transcode those in parallel across multiple XCU30 chips or even multiple U30 cards in an instance. Then, recombine the file at the output. For more information, see the Xilinx Video SDK documentation on Faster than Real-time transcoding.

Using the provided example script on the same 12-minute source file as the preceding example on a vt1.3xlarge, we can utilize both addressable devices on the U30 card at once in order to yield an effective encode speed of 512 fps, or 8.5 times faster than real-time.

$ python3 13_ffmpeg_transcode_only_split_stitch.py -s input_1920x1080p60_H264_8Mbps_AAC_Stereo.mp4 -d /tmp/out.mp4 -i h264 -o h264 -b 5.0

There are 1 cards, 2 chips in the system
...
Time from start to completion : 84 seconds (1 minute and 24 seconds)
This clip was processed 8.5 times faster than realtime

This clip was effectively processed at 512.34 FPS

Conclusion

We are excited to launch VT1, our first EC2 instance with dedicated hardware acceleration for video transcoding, which provides up to 30% lower cost per stream as compared to G4dn or 60% lower cost per stream as compared to G5. With up to eight Xilinx Alveo U30 media accelerators, you can parallelize up to 16 4K UHD streams, for batch, real-time, and faster than real-time transcoding. If you have any questions, reach out to your account team. Now, go power up your video transcoding workloads with Amazon EC2 VT1 instances.

Building ARM64 applications on AWS Graviton2 using the AWS CDK and Self-Hosted Runners for GitHub Actions

Post Syndicated from Rick Armstrong original https://aws.amazon.com/blogs/compute/building-arm64-applications-on-aws-graviton2-using-the-aws-cdk-and-self-hosted-runners-for-github-actions/

This post is written by Frank Dallezotte, Sr. Technical Account Manager, and Maxwell Moon, Sr. Solutions Architect

AWS Graviton2 processors are custom built by AWS using the 64-bit Arm Neoverse cores to deliver great price performance for workloads running in Amazon Elastic Compute Cloud (Amazon EC2). These instances are powered by 64 physical core AWS Graviton2 processors utilizing 64-bit Arm Neoverse N1 cores and custom silicon designed by AWS, built using advanced 7-nanometer manufacturing technology.

Customers are migrating their applications to AWS Graviton2 based instance families in order to take advantage of up to 40% better price performance over comparable current generation x86-based instances for a broad spectrum of workloads. This migration includes updating continuous integration and continuous deployment (CI/CD) pipelines in order to build applications for ARM64.

One option for running CI/CD workflows is GitHub Actions, a GitHub product that lets you automate tasks within your software development lifecycle. Customers utilizing GitHub Actions today can host their own runners and customize the environment used to run jobs in GitHub Actions workflows, allowing you to build ARM64 applications. GitHub recommends that you only use self-hosted runners with private repositories.

This post will teach you to set up an AWS Graviton2 instance as a self-hosted runner for GitHub Actions. We will verify the runner is added to the default runner group for a GitHub Organization, which can only be used by private repositories by default. Then, we’ll walk through setting up a continuous integration workflow in GitHub Actions that runs on the self-hosted Graviton2 runner and hosted x86 runners.

Overview

This post will cover the following:

  • Network configurations for deploying a self-hosted runner on EC2.
  • Generating a GitHub token for a GitHub organization, and then storing the token and organization URL in AWS Systems Manager Parameter Store.
  • Configuring a self-hosted GitHub runner on EC2.
  • Deploying the network and EC2 resources by using the AWS Cloud Development Kit (AWS CDK).
  • Adding Graviton2 self-hosted runners to a workflow for GitHub Actions to an example Python application.
  • Running the workflow.

Prerequisites

  1. An AWS account with permissions to create the necessary resources.
  2. A GitHub account: This post assumes that you have the required permissions as a GitHub organization admin to configure your GitHub organization, as well as create workflows.
  3. Familiarity with the AWS Command Line Interface (AWS CLI).
  4. Access to AWS CloudShell.
  5. Access to an AWS account with administrator or PowerUser (or equivalent) AWS Identity and Access Management (IAM) role policies attached.
  6. Account capacity for two Elastic IPs for the NAT gateways.
  7. An IPv4 CIDR block for a new Virtual Private Cloud (VPC) that is created as part of the AWS CDK stack.

Security

We’ll be adding the self-hosted runner at the GitHub organization level. This makes the runner available for use by the private repositories belonging to the GitHub organization. When new runners are created for an organization, they are automatically assigned to the default self-hosted runner group, which, by default, cannot be utilized by public repositories.

You can verify that your self-hosted runner group is only available to private repositories by navigating to the Actions section of your GitHub Organization’s settings. Select the “Runner Groups” submenu then the Default runner group and confirm that “Allow public repositories” is not checked.

GitHub recommends only utilizing self-hosted runners with private repositories. Allowing self-hosted runners on public repositories and allowing workflows on public forks introduces a significant security risk. More information about the risks can be found in self-hosted runner security with public repositories.

In this post, we verified that for the default runner group, allowing public repositories is not enabled.

AWS CDK

To model and deploy our architecture, we use the AWS CDK. The AWS CDK lets us design components for a self-hosted runner that are customizable and shareable in several popular programming languages.

Our AWS CDK application is defined by two stacks (VPC and EC2) that we’ll use to create the networking resources and our self-hosted runner on EC2.

Network Configuration

This section will walk through the networking resources that the CDK stack will create in order to support this architecture. We are deploying our self-hosted runner in a private subnet. A NAT gateway in a public subnet lets the runner make requests to GitHub, but not direct access to the instance from the internet.

  • Virtual Private Cloud – Defines a VPC across two Availability Zones with an IPv4 CIDR block that you set.
  • Public Subnet – A NAT Gateway will be created in each public subnet for outbound traffic through the VPC’s internet gateway.
  • Private Subnet – Contains the EC2 instance for the self-hosted runner that routes internet bound traffic through a NAT gateway in the public subnet.

AWS architecture diagram for self-hosted runner.

Configuring the GitHub Runner on EC2

To successfully provision the instance, we must  supply the GitHub organization URL and token. To accomplish this, we’ll create two AWS Systems Manager Parameter Store values (gh-url and gh-token), which will be accessed via the EC2 instance user data script when the CDK application deploys the EC2 stack. The EC2 instance will only be accessible through AWS Systems Manager Session Manager.

Get a Token From GitHub

The following steps are based on these instructions for adding self-hosted runners – GitHub Docs.

  1. Navigate to the private GitHub organization where you’d like to configure a custom GitHub Action Runner.
  2. Under your repository name, organization, or enterprise, click Settings.
  3. In the left sidebar, click Actions, then click Runners.
  4. Under “Runners”, click Add runner.
  5. Copy the token value under the “Configure” section.

NOTE: this is an automatically generated time-limited token for authenticating the request.

Create the AWS Systems Manager Parameter Store Values

Next, launch an AWS CloudShell environment, and then create the following AWS Systems Manager Parameter Store values in the AWS account where you’ll be deploying the AWS CDK stack.

The names gh-url and gh-token, and types String and SecureString, respectively, are required for this integration:

#!/bin/bash
aws ssm put-parameter --name gh-token --type SecureString --value ABCDEFGHIJKLMNOPQRSTUVWXYZABC
aws ssm put-parameter --name gh-url --type String --value https://github.com/your-url

Self-Hosted Runner Configuration

The EC2 instance user data script will install all required packages, and it will register the GitHub Runner application using the gh-url and gh-token parameters from the AWS Systems Manager Parameter Store. These parameters are stored as variables (TOKEN and REPO) in order to configure the runner.

This script runs automatically when the EC2 instance is launched, and it is included in the GitHub repository. We’ll utilize Amazon Linux 2 for the operating system on the runner instance.

#!/bin/bash
yum update -y
# Download and build a recent version of International Components for Unicode.
# https://github.com/actions/runner/issues/629
# https://github.com/dotnet/core/blob/main/Documentation/linux-prereqs.md
# Install jq for parsing parameter store
yum install -y libicu60 jq
# Get the latest runner version
VERSION_FULL=$(curl -s https://api.github.com/repos/actions/runner/releases/latest | jq -r .tag_name)
RUNNER_VERSION="${VERSION_FULL:1}"


# Create a folder
mkdir /home/ec2-user/actions-runner && cd /home/ec2-user/actions-runner || exit
# Download the latest runner package
curl -o actions-runner-linux-arm64-${RUNNER_VERSION}.tar.gz -L https://GitHub.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-linux-arm64-${RUNNER_VERSION}.tar.gz
# Extract the installer
tar xzf ./actions-runner-linux-arm64-${RUNNER_VERSION}.tar.gz
chown -R ec2-user /home/ec2-user/actions-runner
REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/region)
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
TOKEN=$(aws ssm get-parameter --region "${REGION}" --name gh-token --with-decryption | jq -r '.Parameter.Value')
REPO=$(aws ssm get-parameter --region "${REGION}" --name gh-url | jq -r '.Parameter.Value')
sudo -i -u ec2-user bash << EOF
/home/ec2-user/actions-runner/config.sh --url "${REPO}" --token "${TOKEN}" --name gh-g2-runner-"${INSTANCE_ID}" --work /home/ec2-user/actions-runner --unattended
EOF
./svc.sh install
./svc.sh start

Deploying the Network Resources and Self-Hosted Runner

In this section, we’ll deploy the network resources and EC2 instance for the self-hosted GitHub runner using the AWS CDK.

From the same CloudShell environment, run the following commands in order to deploy the AWS CDK application:

#!/bin/bash
sudo npm install aws-cdk -g
git clone https://github.com/aws-samples/cdk-graviton2-gh-sh-runner.git
cd cdk-graviton2-gh-sh-runner
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -r requirements.txt
export VPC_CIDR="192.168.0.0/24" # Set your CIDR here.
export AWS_ACCOUNT=`aws sts get-caller-identity | jq -r '.Account'`
cdk bootstrap aws://$AWS_ACCOUNT/$AWS_REGION
cdk deploy --all
# Note: Before the EC2 stack deploys you will be prompted for approval
# The message states 'This deployment will make potentially sensitive changes according to your current security approval level (--require-approval broadening).' and prompts for y/n

These steps will deploy an EC2 instance self-hosted runner that is added to your GitHub organization (as previously specified by the gh-url parameter). Confirm the self-hosted runner has been successfully added to your organization by navigating to the Settings tab for your GitHub organization, selecting the Actions options from the left-hand panel, and then selecting Runners.

Default runner group including an ARM64 self-hosted runner.

Extending a Workflow to use the self-hosted Runner

This section will walk through setting up a GitHub Actions workflow to extend a test pipeline for an example application. We’ll define a workflow that runs a series of static code checks and unit tests on both x86 and ARM.

Our example application is an online bookstore where users can find books, add them to their cart, and create orders. The application is written in Python using the Flask framework, and it uses Amazon DynamoDB for data storage.

Actions expect the workflow to be defined in the folder .github/workflows and the extension of either .yaml or .yml. We’ll create the directory, as well as an empty file inside the directory called main.yml.

#!/bin/bash
mkdir -p .github/workflows
touch .github/workflows/main.yml

First, we must define when our workflow will run. We’ll define the workflow to run when pull requests are created, synchronized (new commits are pushed), or re-opened, and then on push to the main branch.

# main.yml
on:
  pull_request:
    types: [opened, synchronize, reopened]
  push:
    branches:
      - main

Next, define the workflow by adding jobs. Each job can have one or more steps to run. A step defines a command, set up task, or action that will be run. You can also create custom Actions with user-defined steps and repeatable modules.

Next, we’ll define a single job test to include every step of our workflow, as well as a strategy for the job to run the workflow on both x86 and the Graviton2 self-hosted runner. We’ll specify both ubuntu-latest, a hosted runner, and self-hosted for our Graviton2 runner. This lets our workflow run in parallel on two different CPUs, and it is not disruptive of existing processes.

# main.yml
jobs:
  test:
    runs-on: ${{ matrix.os }}
    strategy:
      fail-fast: false
      matrix:
        os: [ubuntu-latest, self-hosted]

Now we can add steps that each runner will run. We’ll be using custom Actions that we create for each step, as well as the pre-built action checkout for pulling down the latest changes to each runner.

GitHub Actions expects all custom actions to be defined in .github/actions/<name of action>/action.yml. We’ll define four custom Actions – check_system_deps, check_and_install_app_deps, run_static_checks, and run_unit_tests.

#!/bin/bash
for action in check_system_deps check_and_install_app_deps run_static_checks run_unit_tests; do \
    mkdir -p .github/actions/${action} && \
    touch .github/actions/${action}/action.yml; \
    done

We define an Action with a series of steps to ensure that the runner is prepared to run our tests and checks:

  1. Check that Python3 is installed
  2. Check that pipenv is installed

Our using statement specifies “composite” to run all steps as a single action.

# .github/actions/check_system_deps/action.yml
name: "Check System Deps"
description: "Check for Python 3.x, list version, Install pipenv if it is not installed"
runs:
  using: "composite"
  steps:
    - name: Check For Python3.x
      run: |
        which python3
        VERSION=$(python3 --version | cut -d ' ' -f 2)
        VERSION_PATCH=$(echo ${VERSION} | cut -d '.' -f 2)
        [ $VERSION_PATCH -ge 8 ]
      shell: bash
    - name: Install Pipenv
      run: python3 -m pip install pipenv --user
      shell: bash

Now that we have the correct version of Python and a package manager installed, we’ll create an action to install our application dependencies:

# .github/actions/check_and_install_app_deps/action.yml
name: "Install deps"
description: "Install application dependencies"
runs:
  using: "composite"
  steps:
    - name: Install deps
      run: python3 -m pipenv install --dev
      shell: bash

Next, we’ll create an action to run all of our static checks. For our example application, we want to perform the following checks:

  1. Check for security vulnerabilities using Bandit
  2. Check the cyclomatic complexity using McCabe
  3. Check for code that has no references using Vulture
  4. Perform a static type check using MyPy
  5. Check for open CVEs in dependencies using Safety
# .github/actions/run_static_checks/action.yml
name: "Run Static Checks"
description: "Run static checks for the python app"
runs:
  using: "composite"
  steps:
    - name: Check common sense security issues
      run: python3 -m pipenv run bandit -r graviton2_gh_runner_flask_app/
      shell: bash
    - name: Check Cyclomatic Complexity
      run: python3 -m pipenv run flake8 --max-complexity 10 graviton2_gh_runner_flask_app
      shell: bash
    - name: Check for dead code
      run: python3 -m pipenv run vulture graviton2_gh_runner_flask_app --min-confidence 100
      shell: bash
    - name: Check static types
      run: python3 -m pipenv run mypy graviton2_gh_runner_flask_app
      shell: bash
    - name: Check for CVEs
      run: python3 -m pipenv check
      shell: bash

We’ll create an action to run the unit tests using PyTest.

# .github/actions/run_unit_tests/action.yml
name: "Run Unit Tests"
description: "Run unit tests for python app"
runs:
  using: "composite"
  steps:
    - name: Run PyTest
      run: python3 -m pipenv run pytest -sv tests
      shell: bash

Finally, we’ll bring all of these actions into our steps in main.yml in order to define every step that will be run on each runner any time that our workflow is run.

# main.yml
steps:
   - name: Checkout Code
     uses: actions/checkout@v2
   - name: Check System Deps
     uses: ./.github/actions/check_system_deps
   - name: Install deps
     uses: ./.github/actions/check_and_install_app_deps
   - name: Run Static Checks
     uses: ./.github/actions/run_static_checks
   - name: Run PyTest
     uses: ./.github/actions/run_unit_tests

Save the file.

Running the Workflow

The workflow will run on the runners when you commit and push your changes. To demonstrate, we’ll create a PR to update the README of our example app in order to kick off the workflow.

After the change is pushed, see the status of your workflow by navigating to your repository in the browser. Select the Actions tab. Select your workflow run from the list of All Workflows. This opens the Summary page for the workflow run.

Successful run of jobs on hosted Ubuntu and self-hosted ARM64 runners.

As each step defined in the workflow job runs, view their details by clicking the job name on the left-hand panel or on the graph representation. The following images are screenshots of the jobs, and example outputs of each step. First, we have check_system_deps.

Successful run of a custom action checking for required system dependencies.

We’ve excluded a screenshot of check_and_install_app_deps that shows the output of pipenv install. Next, we can see that our change passes for our run_static_checks Action (first), and unit tests for our run_unit_tests Action (second).

Successful run of a custom action checking for required system dependencies.

Successful run of a custom action running unit tests with PyTest.

Finally, our workflow completes successfully!

Successful run of jobs on hosted and self-hosted runners.

Clean up

To delete the AWS CDK stacks, launch CloudShell and enter the following commands:

#!/bin/bash
cd cdk-graviton2-gh-sh-runner
source .venv/bin/activate
# Re-set the environment variables again if required
# export VPC_CIDR="192.168.0.0/24" # Set your CIDR here.
cdk destroy --all

Conclusion

This post covered the configuring of a self-hosted GitHub Runner on an EC2 instance with a Graviton2 processor, the required network resources, and a workflow that will run on the Runner on each repository push or pull request for the example application. The Runner is configured at the Organization level, which by default only allows access by private repositories. Lastly, we showed an example run of the workflow after creating a pull request for our example app.

Self-hosted runners on Graviton2 for GitHub Actions lets you add ARM64 to your CICD workflows, accelerating migrations to take advantage of the price and performance of Graviton2. In this blog we’ve utilized a strategy to create a build matrix to run jobs on hosted and self-hosted runners.

We could further extend this workflow by automating deployment with AWS CodeDeploy or sending a build status notification to Slack. To reduce the cost of idle resources during periods without builds, you can set up an Amazon CloudWatch Event to schedule a stop and start of the instance during business hours.

Github Actions also supports ephemeral self-hosted runners, which automatically unregister runners from the service. Ephemeral runners are a good choice for self-managed environments where you need each job to run on a clean image.

For more examples of how to create development environments using AWS Graviton2 and AWS CDK, reference Building an ARM64 Rust development environment using AWS Graviton2 and AWS CDK.

New – Amazon EC2 C6i Instances Powered by the Latest Generation Intel Xeon Scalable Processors

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-amazon-ec2-c6i-instances-powered-by-the-latest-generation-intel-xeon-scalable-processors/

We recently introduced Amazon EC2 M6i instances powered by the latest generation Intel® Xeon® Scalable processors with an all-core turbo frequency of 3.5 GHz, which offer customers up to 15% improvement in price performance compared to M5 instances.

Today, I am happy to announce the availability of the new compute-optimized Amazon EC2 C6i instances, which offer up to 15% improvement in price performance for a variety of workloads, versus comparable C5 instances. These instances are ideal for running compute-intensive workloads such as batch processing, machine learning, high-end gaming, high performance computing (HPC) workloads, ad serving, and video encoding.

Compared to C5 instances using an Intel processor, this new instance type provides:

  • Up to 15% improvement in compute price performance.
  • Up to 9% higher memory bandwidth.
  • Up to 40 Gbps for Amazon Elastic Block Store (EBS) and 50 Gbps for networking.
  • Always-on memory encryption.

Like M6i, C6i instances are available in 9 sizes:

Name vCPUs Memory
(GiB)
Network Bandwidth
(Gbps)
EBS Throughput
(Gbps)
c6i.large 2 4 Up to 12.5 Up to 10
c6i.xlarge 4 8 Up to 12.5 Up to 10
c6i.2xlarge 8 16 Up to 12.5 Up to 10
c6i.4xlarge 16 32 Up to 12.5 Up to 10
c6i.8xlarge 32 64 12.5 10
c6i.12xlarge 48 96 18.75 15
c6i.16xlarge 64 128 25 20
c6i.24xlarge 96 192 37.5 30
c6i.32xlarge 128 256 50 40

The new instances are built on the AWS Nitro System, a collection of building blocks that offloads many of the traditional virtualization functions to dedicated hardware for high performance, high availability, and highly secure cloud instances.

As you should do with M6i instances, for optimal networking performance, upgrade your Elastic Network Adapter (ENA) drivers to version 3. For more information, see this article about migrating an EC2 instance to a sixth-generation instance in the AWS Knowledge Center.

C6i instances support Elastic Fabric Adapter (EFA) on the c6i.32xlarge size for workloads that can benefit from lower network latency, such as HPC and video processing.

Available Now
C6i instances are available today in four AWS Regions: US East (N. Virginia, Ohio), US West (Oregon), and EU (Ireland). As usual with EC2, you pay for what you use. For more information, see the EC2 pricing page.

To learn more, visit the EC2 C6i instance page. You can send feedback to the AWS forum for Amazon EC2 or through your usual AWS Support contacts.

Channy

Identifying optimal locations for flexible workloads with Spot placement score

Post Syndicated from Pranaya Anshu original https://aws.amazon.com/blogs/compute/identifying-optimal-locations-for-flexible-workloads-with-spot-placement-score/

This post is written by Jessie Xie, Solutions Architect for EC2 Spot, and Peter Manastyrny, Senior Product Manager for EC2 Auto Scaling and EC2 Fleet.

Amazon EC2 Spot Instances let you run flexible, fault-tolerant, or stateless applications in the AWS Cloud at up to a 90% discount from On-Demand prices. Since we introduced Spot Instances back in 2009, we have been building new features and integrations with a single goal – to make Spot easy and efficient to use for your flexible compute needs.

Spot Instances are spare EC2 compute capacity in the AWS Cloud available for steep discounts. In exchange for the discount, Spot Instances are interruptible and must be returned when EC2 needs the capacity back. The location and amount of spare capacity available at any given moment is dynamic and changes in real time. This is why Spot workloads should be flexible, meaning they can utilize a variety of different EC2 instance types and can be shifted in real time to where the spare capacity currently is. You can use Spot Instances with tools such as EC2 Fleet and Amazon EC2 Auto Scaling which make it easy to run workloads on multiple instance types.

The AWS Cloud spans 81 Availability Zones across 25 Regions with plans to launch 21 more Availability Zones and 7 more Regions. However, until now there was no way to find an optimal location (either a Region or Availability Zone) to fulfill your Spot capacity needs without trying to launch Spot Instances there first. Today, we are excited to announce Spot placement score, a new feature that helps you identify an optimal location to run your workloads on Spot Instances. Spot placement score recommends an optimal Region or Availability Zone based on the amount of Spot capacity you need and your instance type requirements.

Spot placement score is useful for workloads that could potentially run in a different Region. Additionally, because the score takes into account your instance type selection, it can help you determine if your request is sufficiently instance type flexible for your chosen Region or Availability Zone.

How Spot placement score works

To use Spot placement score you need to specify the amount of Spot capacity you need, what your instance type requirements are, and whether you would like a recommendation for a Region or a single Availability Zone. For instance type requirements, you can either provide a list of instance types, or the instance attributes, like the number of vCPUs and amount of memory. If you choose to use the instance attributes option, you can then use the same attribute configuration to request your Spot Instances in the recommended Region or Availability Zone with the new attribute-based instance type selection feature in EC2 Fleet or EC2 Auto Scaling.

Spot placement score provides a list of Regions or Availability Zones, each scored from 1 to 10, based on factors such as the requested instance types, target capacity, historical and current Spot usage trends, and time of the request. The score reflects the likelihood of success when provisioning Spot capacity, with a 10 meaning that the request is highly likely to succeed. Provided scores change based on the current Spot capacity situation, and the same request can yield different scores when ran at different times. It is important to note that the score serves as a guideline, and no score guarantees that your Spot request will be fully or partially fulfilled.

You can also filter your score by Regions or Availability Zones, which is useful for cases where you can use only a subset of AWS Regions, for example any Region in the United States.

Let’s see how Spot placement score works in practice through an example.

Using Spot placement score with AWS Management Console

To try Spot placement score, log into your AWS account, select EC2, Spot Requests, and click on Spot placement score to open the Spot placement score window.

Spot placement score screen in AWS Management Console.

Here, you need provide your target capacity and instance type requirements by clicking on Enter requirements. You can enter target capacity as a number of instances, vCPUs, or memory. vCPUs and memory options are useful for vertically scalable workloads that are sized for a total amount of compute resources and can utilize a wide range of instance sizes. Target capacity is limited and based on your recent Spot usage with accounting for potential usage growth. For accounts that do not have recent Spot usage, there is a default limit aligned with the Spot Instances limit.

For instance type requirements, there are two options. First option is to select Specify instance attributes that match your compute requirements tab and enter your compute requirements as a number of vCPUs, amount of memory, CPU architecture, and other optional attributes. Second option is to select Manually select instance types tab and select instance types from the list.

Please note that you need to select at least three different instance types (that is, different families, generations, or sizes). If you specify a smaller number of instance types, Spot placement score will always yield a low score. Spot placement score is designed to help you find an optimal location to request Spot capacity tailored to your specific workload needs, but it is not intended to be used for getting high-level Spot capacity information across all Regions and instance types.

Let’s try to find an optimal location to run a workload that can utilize r5.8xlarge, c5.9xlarge, and m5.8xlarge instance types and is sized at 2000 instances.

Spot placement score screen in AWS Management Console with selected target capacity at 2000 instances and selected r5.8xlarge, c5.9xlarge, and m5.8xlarge instance types..

Once you select 2000 instances under Target capacity, select r5.8xlarge, c5.9xlarge, and m5.8xlarge instances under Select instance types, and click Load placement score button, you will get a list of Regions sorted by score in a descending order. There is also an option to filter by specific Regions if needed.

The highest rated Region for your requirements turns out to be US East (N. Virginia) with a score of 8. The second closest contender is Europe (Ireland) with a score of 5. That tells you that right now the optimal Region for your Spot requirements is US East (N. Virginia).

Spot placement score screen in AWS Management Console with displayed scores on Region level scores.

Let’s now see if it is possible to get a higher score. Remember, the key best practice for Spot is to be flexible and utilize as many instance types as possible. To do that, press the Edit button on the Target capacity and instance type requirements tab. For the new request, keep the same target capacity at 2000, but expand the selection of instance types by adding similarly sized instance types from a variety of instance families and generations, i.e., r5.4xlarge, r5.12xlarge, m5zn.12xlarge, m5zn.6xlarge, m5n.8xlarge, m5dn.8xlarge, m5d.8xlarge, r5n.8xlarge, r5dn.8xlarge, r5d.8xlarge, c5.12xlarge, c5.4xlarge, c5d.12xlarge, c5n.9xlarge. c5d.9xlarge, m4.4xlarge, m4.16xlarge, m4.10xlarge, r4.8xlarge, c4.8xlarge.

After requesting the scores with updated requirements, you can see that even though the score in US East (N. Virginia) stays unchanged at 8, the scores for Europe (Ireland) and US West (Oregon) improved dramatically, both raising to 9. Now, you have a choice of three high-scored Regions to request your Spot Instances, each with a high likelihood to succeed.

To request Spot Instances based on the score, you can use EC2 Fleet or EC2 Auto Scaling. Please note, that the score implies that you use capacity-optimized Spot allocation strategy when requesting the capacity. If you use other allocation strategies, such as lowest-price, the result in the recommended Region or Availability Zone will not align with the score provided.

Spot placement score screen in AWS Management Console with selected target capacity at 2000 instances and selected r5.8xlarge, c5.9xlarge, m5.8xlarge, r5.4xlarge, r5.12xlarge, m5zn.12xlarge, m5zn.6xlarge, m5n.8xlarge, m5dn.8xlarge, m5d.8xlarge, r5n.8xlarge, r5dn.8xlarge, r5d.8xlarge, c5.12xlarge, c5.4xlarge, c5d.12xlarge, c5n.9xlarge. c5d.9xlarge, m4.4xlarge, m4.16xlarge, m4.10xlarge, r4.8xlarge, c4.8xlarge instance types.

You can also request the scores at the Availability Zone level. This is useful for running workloads that need to have all instances in the same Availability Zone, potentially to minimize inter-Availability Zone data transfer costs. Workloads such as Apache Spark, which involve transferring a high volume of data between instances, would be a good use case for this. To get scores per Availability Zone you can check the box Provide placement scores per Availability Zone.

When requesting instances based on Availability Zone recommendation, you need to make sure to configure EC2 Fleet or EC2 Auto Scaling request to only use that specific Availability Zone.

With Spot placement score, you can test different instance type combinations at different points in time, and find the most optimal Region or Availability Zone to run your workloads on Spot Instances.

Availability and pricing

You can use Spot placement score today in all public and AWS GovCloud Regions with the exception of those based in China, where we plan to release later. You can access Spot placement score using the AWS Command Line Interface (CLI), AWS SDKs, and Management Console. There is no additional charge for using Spot placement score, you will only pay EC2 standard rates if provisioning instances based on recommendation.

To learn more about using Spot placement score, visit the Spot placement score documentation page. To learn more about best practices for using Spot Instances, see Spot documentation.

Amazon EC2 Auto Scaling will no longer add support for new EC2 features to Launch Configurations

Post Syndicated from Pranaya Anshu original https://aws.amazon.com/blogs/compute/amazon-ec2-auto-scaling-will-no-longer-add-support-for-new-ec2-features-to-launch-configurations/

This post is written by Scott Horsfield, Principal Solutions Architect, EC2 Scalability and Surabhi Agarwal, Sr. Product Manager, EC2.

In 2010, AWS released launch configurations as a way to define the parameters of instances launched by EC2 Auto Scaling groups. In 2017, AWS released launch templates, the successor of launch configurations, as a way to streamline and simplify the launch process for Auto Scaling, Spot Fleet, Amazon EC2 Spot Instances, and On-Demand Instances. Launch templates define the steps required to create an instance, by capturing instance parameters in a resource that can be used across multiple services. Launch configurations have continued to live alongside launch templates but haven’t benefitted from all of the features we’ve added to launch templates.

Today, AWS is recommending that customers using launch configurations migrate to launch templates. We will continue to support and maintain launch configurations, but we will not be adding any new features to them. We will focus on adding new EC2 features to launch templates only. You can continue using launch configurations, and AWS is committed to supporting applications you have already built using them, but in order for you to take advantage of our most recent and upcoming releases, a migration to launch templates is recommended. Additionally, we plan to no longer support new instance types with launch configurations by the end of 2022. Our goal is to have all customers moved over to launch templates by then.

Moving to launch templates is simple to accomplish and can be done easily today. In this blog, we provide more details on how you can transition from launch configurations to launch templates. If you are unable to transition to launch templates due to lack of tooling or specific functions, or have any concerns, please contact AWS Support.

Launch templates vs. launch configurations

Launch configurations have been a part of Amazon EC2 Auto Scaling Groups since 2010. Customers use launch configurations to define Auto Scaling group configurations that include AMI and instance type definition. In 2017, AWS released launch templates, which reduce the number of steps required to create an instance by capturing all launch parameters within one resource that can be used across multiple services. Since then, AWS has released many new features such as Mixed Instance Policies with Auto Scaling groups, Targeted Capacity Reservations, and unlimited mode for burstable performance instances that only work with launch templates.

Launch templates provide several key benefits to customers, when compared to launch configurations, that can improve the availability and optimization of the workloads you host in Auto Scaling groups and allow you to access the full set of EC2 features when launching instances in an Auto Scaling group.

Some of the key benefits of launch templates when used with Auto Scaling groups include:

How to determine where you are using launch configurations

Use the Launch Configuration Inventory Script to find all of the launch configurations in your account. You can use this script to generate an inventory of launch configurations across all regions in a single account or all accounts in your AWS Organization.

The script can be run with a variety of options for different levels of account access. You can learn more about these options in this GitHub post. In its simplest form it will use the default credentials profile to inventory launch configurations across all regions in a single account.

Screenshot of Launch Configuration Inventory script

Once the script has completed, you can view the generated inventory.csv file to get a sense of how many launch configurations may need to be converted to launch templates or deleted.

Screenshot of script

How to transition to launch templates today

If you’re ready to move to launch templates now, making the transition is simple and mostly automated through the AWS Management Console. For customers who do not use the AWS Management Console, most popular Infrastructure as Code (IaC), such as CloudFormation and Terraform, already support launch templates, as do the AWS CLI and SDKs.

To perform this transition, you will need to ensure that your user has the required permissions.

Here are some examples to get you started.

AWS Management Console

  1. Open the EC2 Launch Configuration console. You must sign in if you are not already authenticated.
  2. From the Launch Configuration console, click on the Copy to launch template button and select Copy all.
    1. Alternatively, you can select individual launch configurations, and use the Copy selected option to selectively copy certain launch configurations.copy to launch template screenshot
  1. Review the list of templates and click on the Copy button when you’re ready to proceed.3. Review the list of templates and click on the Copy button when you’re ready to proceed.
  1. Once the copy process has completed, you can close the wizard.

4. Once the copy process has completed, you can close the wizard.

  1. Navigate to the EC2 Launch Template console to view your newly created launch templates.5. Navigate to the EC2 Launch Template console to view your newly created launch templates.
  1. Your launch templates are now ready to replace launch configurations in your Auto Scaling group configuration. Navigate to the Auto Scaling group console, select your Auto Scaling group, and click on the Edit.

. Navigate to the Auto Scaling group console, select your Auto Scaling group, and click on the Edit button.

  1. Next, scroll down to the Launch configuration section, and click Switch to launch template.7. Next, scroll down to the Launch configuration section, and click Switch to launch template.
  1. Select your newly created Launch template, review and confirm your configuration, and when ready scroll down to the bottom of the page and click the Update button.when ready scroll down to the bottom of the page and click the Update button.
  2. Now that you’ve migrated your launch configurations to launch templates you can prevent users from creating new launch configurations by updating their IAM permissions to deny the autoscaling:CreateLaunchConfiguration action.

Instances launched by this Auto Scaling group continue to run and are not automatically be replaced by making this change. Any instance launched after making this change uses the launch template for its configuration. As your Auto Scaling group scales up and down, the older instances are replaced. If you’d like to force an update, you can use Instance Refresh to ensure that all instances are running the same launch template and version.

CloudFormation and Terraform

If you use CloudFormation to create and manage your infrastructure, you should use the AWS::EC2::LaunchTemplate resource to create launch templates. After adding a launch template resource to your CloudFormation stack template file update your Auto Scaling group resource definition by adding a LaunchTemplate property and removing the existing LaunchConfigurationName property. We have several examples available to help you get started.

Using launch templates with Terraform is a similar process. Update your template file to include a aws_launch_template resource and then update your aws_autoscaling_group resources to reference the launch template.

In addition to making these changes, you may also want to consider adding a MixedInstancesPolicy to your Auto Scaling group. A MixedInstancesPolicy allows you to configure your Auto Scaling group with multiple instance types and purchase options. This helps improve the availability and optimization of your applications. Some examples of these benefits include using Spot Instances and On-Demand Instances within the same Auto Scaling group, combining CPU architectures such as Intel, AMD, and ARM (Graviton2), and having multiple instance types configured in case of a temporary capacity issue.

You can generate and configure example templates for CloudFormation and Terraform in the AWS Management Console.

AWS CLI

If you’re using the AWS CLI to create and manage your Auto Scaling groups, these examples will show you how to accomplish common tasks when using launch templates.

SDKs

AWS SDKs already include APIs for creating launch templates. If you’re using one of our SDKs to create and configure your Auto Scaling groups, you can find more information in the SDK documentation for your language of choice.

Next steps

We’re excited to help you take advantage of the latest EC2 features by making the transition to launch templates as seamless as possible. As we make this transition together, we’re here to help and will continue to communicate our plans and timelines for this transition. If you are unable to transition to launch templates due to lack of tooling or functionalities or have any concerns, please contact AWS Support. Also, stay tuned for more information on tools to help make this transition easier for you.

AWS Lambda Functions Powered by AWS Graviton2 Processor – Run Your Functions on Arm and Get Up to 34% Better Price Performance

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-lambda-functions-powered-by-aws-graviton2-processor-run-your-functions-on-arm-and-get-up-to-34-better-price-performance/

Many of our customers (such as Formula One, Honeycomb, Intuit, SmugMug, and Snap Inc.) use the Arm-based AWS Graviton2 processor for their workloads and enjoy better price performance. Starting today, you can get the same benefits for your AWS Lambda functions. You can now configure new and existing functions to run on x86 or Arm/Graviton2 processors.

With this choice, you can save money in two ways. First, your functions run more efficiently due to the Graviton2 architecture. Second, you pay less for the time that they run. In fact, Lambda functions powered by Graviton2 are designed to deliver up to 19 percent better performance at 20 percent lower cost.

With Lambda, you are charged based on the number of requests for your functions and the duration (the time it takes for your code to execute) with millisecond granularity. For functions using the Arm/Graviton2 architecture, duration charges are 20 percent lower than the current pricing for x86. The same 20 percent reduction also applies to duration charges for functions using Provisioned Concurrency.

In addition to the price reduction, functions using the Arm architecture benefit from the performance and security built into the Graviton2 processor. Workloads using multithreading and multiprocessing, or performing many I/O operations, can experience lower execution time and, as a consequence, even lower costs. This is particularly useful now that you can use Lambda functions with up to 10 GB of memory and 6 vCPUs. For example, you can get better performance for web and mobile backends, microservices, and data processing systems.

If your functions don’t use architecture-specific binaries, including in their dependencies, you can switch from one architecture to the other. This is often the case for many functions using interpreted languages such as Node.js and Python or functions compiled to Java bytecode.

All Lambda runtimes built on top of Amazon Linux 2, including the custom runtime, are supported on Arm, with the exception of Node.js 10 that has reached end of support. If you have binaries in your function packages, you need to rebuild the function code for the architecture you want to use. Functions packaged as container images need to be built for the architecture (x86 or Arm) they are going to use.

To measure the difference between architectures, you can create two versions of a function, one for x86 and one for Arm. You can then send traffic to the function via an alias using weights to distribute traffic between the two versions. In Amazon CloudWatch, performance metrics are collected by function versions, and you can look at key indicators (such as duration) using statistics. You can then compare, for example, average and p99 duration between the two architectures.

You can also use function versions and weighted aliases to control the rollout in production. For example, you can deploy the new version to a small amount of invocations (such as 1 percent) and then increase up to 100 percent for a complete deployment. During rollout, you can lower the weight or set it to zero if your metrics show something suspicious (such as an increase in errors).

Let’s see how this new capability works in practice with a few examples.

Changing Architecture for Functions with No Binary Dependencies
When there are no binary dependencies, changing the architecture of a Lambda function is like flipping a switch. For example, some time ago, I built a quiz app with a Lambda function. With this app, you can ask and answer questions using a web API. I use an Amazon API Gateway HTTP API to trigger the function. Here’s the Node.js code including a few sample questions at the beginning:

const questions = [
  {
    question:
      "Are there more synapses (nerve connections) in your brain or stars in our galaxy?",
    answers: [
      "More stars in our galaxy.",
      "More synapses (nerve connections) in your brain.",
      "They are about the same.",
    ],
    correctAnswer: 1,
  },
  {
    question:
      "Did Cleopatra live closer in time to the launch of the iPhone or to the building of the Giza pyramids?",
    answers: [
      "To the launch of the iPhone.",
      "To the building of the Giza pyramids.",
      "Cleopatra lived right in between those events.",
    ],
    correctAnswer: 0,
  },
  {
    question:
      "Did mammoths still roam the earth while the pyramids were being built?",
    answers: [
      "No, they were all exctint long before.",
      "Mammooths exctinction is estimated right about that time.",
      "Yes, some still survived at the time.",
    ],
    correctAnswer: 2,
  },
];

exports.handler = async (event) => {
  console.log(event);

  const method = event.requestContext.http.method;
  const path = event.requestContext.http.path;
  const splitPath = path.replace(/^\/+|\/+$/g, "").split("/");

  console.log(method, path, splitPath);

  var response = {
    statusCode: 200,
    body: "",
  };

  if (splitPath[0] == "questions") {
    if (splitPath.length == 1) {
      console.log(Object.keys(questions));
      response.body = JSON.stringify(Object.keys(questions));
    } else {
      const questionId = splitPath[1];
      const question = questions[questionId];
      if (question === undefined) {
        response = {
          statusCode: 404,
          body: JSON.stringify({ message: "Question not found" }),
        };
      } else {
        if (splitPath.length == 2) {
          const publicQuestion = {
            question: question.question,
            answers: question.answers.slice(),
          };
          response.body = JSON.stringify(publicQuestion);
        } else {
          const answerId = splitPath[2];
          if (answerId == question.correctAnswer) {
            response.body = JSON.stringify({ correct: true });
          } else {
            response.body = JSON.stringify({ correct: false });
          }
        }
      }
    }
  }

  return response;
};

To start my quiz, I ask for the list of question IDs. To do so, I use curl with an HTTP GET on the /questions endpoint:

$ curl https://<api-id>.execute-api.us-east-1.amazonaws.com/questions
[
  "0",
  "1",
  "2"
]

Then, I ask more information on a question by adding the ID to the endpoint:

$ curl https://<api-id>.execute-api.us-east-1.amazonaws.com/questions/1
{
  "question": "Did Cleopatra live closer in time to the launch of the iPhone or to the building of the Giza pyramids?",
  "answers": [
    "To the launch of the iPhone.",
    "To the building of the Giza pyramids.",
    "Cleopatra lived right in between those events."
  ]
}

I plan to use this function in production. I expect many invocations and look for options to optimize my costs. In the Lambda console, I see that this function is using the x86_64 architecture.

Console screenshot.

Because this function is not using any binaries, I switch architecture to arm64 and benefit from the lower pricing.

Console screenshot.

The change in architecture doesn’t change the way the function is invoked or communicates its response back. This means that the integration with the API Gateway, as well as integrations with other applications or tools, are not affected by this change and continue to work as before.

I continue my quiz with no hint that the architecture used to run the code has changed in the backend. I answer back to the previous question by adding the number of the answer (starting from zero) to the question endpoint:

$ curl https://<api-id>.execute-api.us-east-1.amazonaws.com/questions/1/0
{
  "correct": true
}

That’s correct! Cleopatra lived closer in time to the launch of the iPhone than the building of the Giza pyramids. While I am digesting this piece of information, I realize that I completed the migration of the function to Arm and optimized my costs.

Changing Architecture for Functions Packaged Using Container Images
When we introduced the capability to package and deploy Lambda functions using container images, I did a demo with a Node.js function generating a PDF file with the PDFKit module. Let’s see how to migrate this function to Arm.

Each time it is invoked, the function creates a new PDF mail containing random data generated by the faker.js module. The output of the function is using the syntax of the Amazon API Gateway to return the PDF file using Base64 encoding. For convenience, I replicate the code (app.js) of the function here:

const PDFDocument = require('pdfkit');
const faker = require('faker');
const getStream = require('get-stream');

exports.lambdaHandler = async (event) => {

    const doc = new PDFDocument();

    const randomName = faker.name.findName();

    doc.text(randomName, { align: 'right' });
    doc.text(faker.address.streetAddress(), { align: 'right' });
    doc.text(faker.address.secondaryAddress(), { align: 'right' });
    doc.text(faker.address.zipCode() + ' ' + faker.address.city(), { align: 'right' });
    doc.moveDown();
    doc.text('Dear ' + randomName + ',');
    doc.moveDown();
    for(let i = 0; i < 3; i++) {
        doc.text(faker.lorem.paragraph());
        doc.moveDown();
    }
    doc.text(faker.name.findName(), { align: 'right' });
    doc.end();

    pdfBuffer = await getStream.buffer(doc);
    pdfBase64 = pdfBuffer.toString('base64');

    const response = {
        statusCode: 200,
        headers: {
            'Content-Length': Buffer.byteLength(pdfBase64),
            'Content-Type': 'application/pdf',
            'Content-disposition': 'attachment;filename=test.pdf'
        },
        isBase64Encoded: true,
        body: pdfBase64
    };
    return response;
};

To run this code, I need the pdfkit, faker, and get-stream npm modules. These packages and their versions are described in the package.json and package-lock.json files.

I update the FROM line in the Dockerfile to use an AWS base image for Lambda for the Arm architecture. Given the chance, I also update the image to use Node.js 14 (I was using Node.js 12 at the time). This is the only change I need to switch architecture.

FROM public.ecr.aws/lambda/nodejs:14-arm64
COPY app.js package*.json ./
RUN npm install
CMD [ "app.lambdaHandler" ]

For the next steps, I follow the post I mentioned previously. This time I use random-letter-arm for the name of the container image and for the name of the Lambda function. First, I build the image:

$ docker build -t random-letter-arm .

Then, I inspect the image to check that it is using the right architecture:

$ docker inspect random-letter-arm | grep Architecture

"Architecture": "arm64",

To be sure the function works with the new architecture, I run the container locally.

$ docker run -p 9000:8080 random-letter-arm:latest

Because the container image includes the Lambda Runtime Interface Emulator, I can test the function locally:

$ curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'

It works! The response is a JSON document containing a base64-encoded response for the API Gateway:

{
    "statusCode": 200,
    "headers": {
        "Content-Length": 2580,
        "Content-Type": "application/pdf",
        "Content-disposition": "attachment;filename=test.pdf"
    },
    "isBase64Encoded": true,
    "body": "..."
}

Confident that my Lambda function works with the arm64 architecture, I create a new Amazon Elastic Container Registry repository using the AWS Command Line Interface (CLI):

$ aws ecr create-repository --repository-name random-letter-arm --image-scanning-configuration scanOnPush=true

I tag the image and push it to the repo:

$ docker tag random-letter-arm:latest 123412341234.dkr.ecr.us-east-1.amazonaws.com/random-letter-arm:latest
$ aws ecr get-login-password | docker login --username AWS --password-stdin 123412341234.dkr.ecr.us-east-1.amazonaws.com
$ docker push 123412341234.dkr.ecr.us-east-1.amazonaws.com/random-letter-arm:latest

In the Lambda console, I create the random-letter-arm function and select the option to create the function from a container image.

Console screenshot.

I enter the function name, browse my ECR repositories to select the random-letter-arm container image, and choose the arm64 architecture.

Console screenshot.

I complete the creation of the function. Then, I add the API Gateway as a trigger. For simplicity, I leave the authentication of the API open.

Console screenshot.

Now, I click on the API endpoint a few times and download some PDF mails generated with random data:

Screenshot of some PDF files.

The migration of this Lambda function to Arm is complete. The process will differ if you have specific dependencies that do not support the target architecture. The ability to test your container image locally helps you find and fix issues early in the process.

Comparing Different Architectures with Function Versions and Aliases
To have a function that makes some meaningful use of the CPU, I use the following Python code. It computes all prime numbers up to a limit passed as a parameter. I am not using the best possible algorithm here, that would be the sieve of Eratosthenes, but it’s a good compromise for an efficient use of memory. To have more visibility, I add the architecture used by the function to the response of the function.

import json
import math
import platform
import timeit

def primes_up_to(n):
    primes = []
    for i in range(2, n+1):
        is_prime = True
        sqrt_i = math.isqrt(i)
        for p in primes:
            if p > sqrt_i:
                break
            if i % p == 0:
                is_prime = False
                break
        if is_prime:
            primes.append(i)
    return primes

def lambda_handler(event, context):
    start_time = timeit.default_timer()
    N = int(event['queryStringParameters']['max'])
    primes = primes_up_to(N)
    stop_time = timeit.default_timer()
    elapsed_time = stop_time - start_time

    response = {
        'machine': platform.machine(),
        'elapsed': elapsed_time,
        'message': 'There are {} prime numbers <= {}'.format(len(primes), N)
    }
    
    return {
        'statusCode': 200,
        'body': json.dumps(response)
    }

I create two function versions using different architectures.

Console screenshot.

I use a weighted alias with 50% weight on the x86 version and 50% weight on the Arm version to distribute invocations evenly. When invoking the function through this alias, the two versions running on the two different architectures are executed with the same probability.

Console screenshot.

I create an API Gateway trigger for the function alias and then generate some load using a few terminals on my laptop. Each invocation computes prime numbers up to one million. You can see in the output how two different architectures are used to run the function.

$ while True
  do
    curl https://<api-id>.execute-api.us-east-1.amazonaws.com/default/prime-numbers\?max\=1000000
  done

{"machine": "aarch64", "elapsed": 1.2595275060011772, "message": "There are 78498 prime numbers <= 1000000"}
{"machine": "aarch64", "elapsed": 1.2591725109996332, "message": "There are 78498 prime numbers <= 1000000"}
{"machine": "x86_64", "elapsed": 1.7200910530000328, "message": "There are 78498 prime numbers <= 1000000"}
{"machine": "x86_64", "elapsed": 1.6874686619994463, "message": "There are 78498 prime numbers <= 1000000"}
{"machine": "x86_64", "elapsed": 1.6865161940004327, "message": "There are 78498 prime numbers <= 1000000"}
{"machine": "aarch64", "elapsed": 1.2583248640003148, "message": "There are 78498 prime numbers <= 1000000"}
...

During these executions, Lambda sends metrics to CloudWatch and the function version (ExecutedVersion) is stored as one of the dimensions.

To better understand what is happening, I create a CloudWatch dashboard to monitor the p99 duration for the two architectures. In this way, I can compare the performance of the two environments for this function and make an informed decision on which architecture to use in production.

Console screenshot.

For this particular workload, functions are running much faster on the Graviton2 processor, providing a better user experience and much lower costs.

Comparing Different Architectures with Lambda Power Tuning
The AWS Lambda Power Tuning open-source project, created by my friend Alex Casalboni, runs your functions using different settings and suggests a configuration to minimize costs and/or maximize performance. The project has recently been updated to let you compare two results on the same chart. This comes in handy to compare two versions of the same function, one using x86 and the other Arm.

For example, this chart compares x86 and Arm/Graviton2 results for the function computing prime numbers I used earlier in the post:

Chart.

The function is using a single thread. In fact, the lowest duration for both architectures is reported when memory is configured with 1.8 GB. Above that, Lambda functions have access to more than 1 vCPU, but in this case, the function can’t use the additional power. For the same reason, costs are stable with memory up to 1.8 GB. With more memory, costs increase because there are no additional performance benefits for this workload.

I look at the chart and configure the function to use 1.8 GB of memory and the Arm architecture. The Graviton2 processor is clearly providing better performance and lower costs for this compute-intensive function.

Availability and Pricing
You can use Lambda Functions powered by Graviton2 processor today in US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Frankfurt), Europe (Ireland), EU (London), Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo).

The following runtimes running on top of Amazon Linux 2 are supported on Arm:

  • Node.js 12 and 14
  • Python 3.8 and 3.9
  • Java 8 (java8.al2) and 11
  • .NET Core 3.1
  • Ruby 2.7
  • Custom Runtime (provided.al2)

You can manage Lambda Functions powered by Graviton2 processor using AWS Serverless Application Model (SAM) and AWS Cloud Development Kit (AWS CDK). Support is also available through many AWS Lambda Partners such as AntStack, Check Point, Cloudwiry, Contino, Coralogix, Datadog, Lumigo, Pulumi, Slalom, Sumo Logic, Thundra, and Xerris.

Lambda functions using the Arm/Graviton2 architecture provide up to 34 percent price performance improvement. The 20 percent reduction in duration costs also applies when using Provisioned Concurrency. You can further reduce your costs by up to 17 percent with Compute Savings Plans. Lambda functions powered by Graviton2 are included in the AWS Free Tier up to the existing limits. For more information, see the AWS Lambda pricing page.

You can find help to optimize your workloads for the AWS Graviton2 processor in the Getting started with AWS Graviton repository.

Start running your Lambda functions on Arm today.

Danilo

Enabling parallel file systems in the cloud with Amazon EC2 (Part I: BeeGFS)

Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/enabling-parallel-file-systems-in-the-cloud-with-amazon-ec2-part-i-beegfs/

This post was authored by AWS Solutions Architects Ray Zaman, David Desroches, and Ameer Hakme.

In this blog series, you will discover how to build and manage your own Parallel Virtual File System (PVFS) on AWS. In this post you will learn how to deploy the popular open source parallel file system, BeeGFS, using AWS D3en and I3en EC2 instances. We will also provide a CloudFormation template to automate this BeeGFS deployment.

A PVFS is a type of distributed file system that distributes file data across multiple servers and provides concurrent data access to multiple execution tasks of an application. PVFS focuses on high-performance access to large datasets. It consists of a server process and a client library, which allows the file system to be mounted and used with standard utilities. PVFS on the Linux OS originated in the 1990’s and today several projects are available including Lustre, GlusterFS, and BeeGFS. Workloads such as shared storage for video transcoding and export, batch processing jobs, high frequency online transaction processing (OLTP) systems, and scratch storage for high performance computing (HPC) benefit from the high throughput and performance provided by PVFS.

Implementation of a PVFS can be complex and expensive. There are many variables you will want to take into account when designing a PVFS cluster including the number of nodes, node size (CPU, memory), cluster size, storage characteristics (size, performance), and network bandwidth. Due to the difficulty in estimating the correct configuration, systems procured for on-premises data centers are typically oversized, resulting in additional costs, and underutilized resources. In addition, the hardware procurement process is lengthy and the installation and maintenance of the hardware adds additional overhead.

AWS makes it easy to run and fully manage your parallel file systems by allowing you to choose from a variety of Amazon Elastic Compute Cloud (EC2) instances. EC2 instances are available on-demand and allow you to scale your workload as needed. AWS storage-optimized EC2 instances offer up to 60 TB of NVMe SSD storage per instance and up to 336 TB of local HDD storage per instance. With storage-optimized instances, you can easily deploy PVFS to support workloads requiring high-performance access to large datasets. You can test and iterate on different instances to find the optimal size for your workloads.

D3en instances leverage 2nd-generation Intel Xeon Scalable Processors (Cascade Lake) and provide a sustained all core frequency up to 3.1 GHz. These instances provide up to 336 TB of local HDD storage (which is the highest local storage capacity in EC2), up to 6.2 GiBps of disk throughput, and up to 75 Gbps of network bandwidth.

I3en instances are powered by 1st or 2nd generation Intel® Xeon® Scalable (Skylake or Cascade Lake) processors with 3.1 GHz sustained all-core turbo performance. These instances provide up to 60 TB of NVMe storage, up to 16 GB/s of sequential disk throughput, and up to 100 Gbps of network bandwidth.

BeeGFS, originally released by ThinkParQ in 2014, is an open source, software defined PVFS that runs on Linux. You can scale the size and performance of the BeeGFS file-system by configuring the number of servers and disks in the clusters up to thousands of nodes.

BeeGFS architecture

D3en instances offer HDD storage while I3en instances offer NVMe SSD storage. This diversity allows you to create tiers of storage based on performance requirements. In the example presented in this post you will use four D3en.8xlarge (32 vCPU, 128 GB, 16x14TB HDD, 50 Gbit) and two I3en.12xlarge (48 vCPU, 384 GB, 4 x 7.5-TB NVMe) instances to create two storage tiers. You may choose different sizes and quantities to meet your needs. The I3en instances, with SSD, will be configured as tier 1 and the D3en instances, with HDD, will be configured as tier 2. One disk from each instance will be formatted as ext4 and used for metadata while the remaining disks will be formatted as XFS and used for storage. You may choose to separate metadata and storage on different hosts for workloads where these must scale independently. The array will be configured RAID 0, since it will provide maximum performance. Software replication or other RAID types can be employed for higher durability.

BeeGFS architecture

Figure 1: BeeGFS architecture

You will deploy all instances within a single VPC in the same Availability Zone and subnet to minimize latency. Security groups must be configured to allow the following ports:

  • Management service (beegfs-mgmtd): 8008
  • Metadata service (beegfs-meta): 8005
  • Storage service (beegfs-storage): 8003
  • Client service (beegfs-client): 8004

You will use the Debian Quick Start Amazon Machine Image (AMI) as it supports BeeGFS. You can enable Amazon CloudWatch to capture metrics.

How to deploy the BeeGFS architecture

Follow the steps below to create the PVFS described above. For automated deployment, use the CloudFormation template located at AWS Samples.

  1. Use the AWS Management Console or CLI to deploy one D3en.8xlarge instance into a VPC as described above.
  2. Log in to the instance and update the system:
    • sudo apt update
    • sudo apt upgrade
  3. Install the XFS utilities and load the kernel module:
    • sudo apt-get -y install xfsprogs
    • sudo modprobe -v xfs

Format the first disk ext4 as it is used for metadata, the rest are formatted xfs. The disks will appear as “nvme???” which actually represent the HDD drives on the D3en instances.

4. View a listing of available disks:

    • sudo lsblk

5. Format hard disks:

    • sudo mkfs -t ext4 /dev/nvme0n1
    • sudo mkfs -t xfs /dev/nvme1n1
    • Repeat this command for disks nvme2n1 through nvme15n1

6. Create file system mount points:

    • sudo mkdir /disk00
    • sudo mkdir /disk01
    • Repeat this command for disks disk02 through disk15

7. Mount the filesystems:

    • sudo mount /dev/nvme0n1 /disk00
    • sudo mount /dev/nvme0n1 /disk01
    • Repeat this command for disks disk02 through disk15

Repeat steps 1 through 7 on the remaining nodes. Remember to account for fewer disks for i3en.12xlarge instances or if you decide to use different instance sizes.

8. Add the BeeGFS Repo to each node:

    • sudo apt-get -y install gnupg
    • wget https://www.beegfs.io/release/beegfs_7.2.3/dists/beegfs-deb10.list
    • sudo cp beegfs-deb10.list /etc/apt/sources.list.d/
    • sudo wget -q https://www.beegfs.io/release/latest-stable/gpg/DEB-GPG-KEY-beegfs -O- | sudo apt-key add -
    • sudo apt update

9. Install BeeGFS management (node 1 only):

    • sudo apt-get -y install beegfs-mgmtd
    • sudo mkdir /beegfs-mgmt
    • sudo /opt/beegfs/sbin/beegfs-setup-mgmtd -p /beegfs-mgmt/beegfs/beegfs_mgmtd

10. Install BeeGFS metadata and storage (all nodes):

    • sudo apt-get -y install beegfs-meta beegfs-storage beegfs-meta beegfs-client beegfs-helperd beegfs-utils
    • # -s is unique ID based on node - change this!, -m is hostname of management server
    • sudo /opt/beegfs/sbin/beegfs-setup-meta -p /disk00/beegfs/beegfs_meta -s 1 -m ip-XXX-XXX-XXX-XXX
    • # Change -s to nodeID and -i to (nodeid)0(disk), -m is hostname of management server
    • sudo /opt/beegfs/sbin/beegfs-setup-storage -p /disk01/beegfs_storage -s 1 -i 101 -m ip-XXX-XXX-XXX-XXX
    • sudo /opt/beegfs/sbin/beegfs-setup-storage -p /disk02/beegfs_storage -s 1 -i 102 -m ip-XXX-XXX-XXX-XXX
    • Repeat this last command for the remaining disks disk03 through disk15

11. Start the services:

    • #Only on node1
    • sudo systemctl start beegfs-mgmtd
    • #All servers
    • sudo systemctl start beegfs-meta
    • sudo systemctl start beegfs-storage

At this point, your BeeGFS cluster is running and ready for use by a client system. The client system requires BeeGFS client software in order to mount the cluster.

12. Deploy an m5n.2xlarge instance into the same subnet as the PVFS cluster.

13. Log in to the instance, install, and configure the client:

    • sudo apt update
    • sudo apt upgrade
    • sudo apt-get -y install gnupg
    • #Need linux sources for client compilation
    • sudo apt-get -y install linux-source
    • sudo apt-get -y install linux-headers-4.19.0-14-all
    • wget https://www.beegfs.io/release/beegfs_7.2.3/dists/beegfs-deb10.list
    • sudo cp beegfs-deb10.list /etc/apt/sources.list.d/
    • sudo wget -q https://www.beegfs.io/release/latest-stable/gpg/DEB-GPG-KEY-beegfs -O- | sudo apt-key add -
    • sudo apt update
    • sudo apt-get -y install beegfs-client beegfs-helperd beegfs-utils
    • sudo /opt/beegfs/sbin/beegfs-setup-client -m ip-XXX-XXX-XXX-XX # use the ip address of the management node
    • sudo systemctl start beegfs-helperd
    • sudo systemctl start beegfs-client

14. Create the storage pools:

    • sudo beegfs-ctl --addstoragepool —desc="tier1" —targets=501,502,503,601,602,603
    • sudo beegfs-ctl --addstoragepool --desc="tier2" --targets=101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,201,202,203,204,205,206,207,208,209,210,
      211,212,213,214,215,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,401,402,403,404,405,406,407,
      408,409,410,411,412,413,414,415
    • sudo beegfs-ctl --liststoragepools
    • Pool ID Pool Description                      Targets                 Buddy Groups
    • ======= ================== ============================ ============================
      • Default
      • tier1 501,502,503,601,602,603
      • tier2 101,102,103,104,105,106,107,
        • 108,109,110,111,112,113,114,
        • 115,201,202,203,204,205,206,
        • 207,208,209,210,211,212,213,
        • 214,215,301,302,303,304,305,
        • 306,307,308,309,310,311,312,
        • 313,314,315,401,402,403,404,
        • 405,406,407,408,409,410,411,
        • 412,413,414,415

15. Mount the pools to the file system:

    • sudo beegfs-ctl --setpattern --storagepoolid=2 /mnt/beegfs/tier1
    • sudo beegfs-ctl --setpattern --storagepoolid=3 /mnt/beegfs/tier2

The BeeGFS PVFS is now ready to be used by the client system.

How to test your new BeeGFS PVFS

BeeGFS provides StorageBench to evaluate the performance of BeeGFS on the storage targets. This benchmark measures the streaming throughput of the underlying file system and devices independent of the network performance. To simulate client I/O, this benchmark generates read/write locally on the servers without any client communication.

It is possible to benchmark specific targets or all targets together using the “servers” parameter. A “read” or “write” parameter sets the type pf test to perform. The “threads” parameter is set to the number of storage devices.

Try the following commands to test performance:

Write test (1x d3en):

sudo beegfs-ctl --storagebench --servers=1 --write --blocksize=512K —size=20G —threads=15

Write test (4x d3en):

sudo beegfs-ctl --storagebench --alltargets --write --blocksize=512K —size=20G —threads=15

Read test (4x d3en):

sudo beegfs-ctl --storagebench --servers=1,2,3,4 --read --blocksize=512K --size=20G --threads=15

Write test (1x i3en):

sudo beegfs-ctl --storagebench --servers=5 --write --blocksize=512K --size=20G --threads=3

Read test (2x i3en):

sudo beegfs-ctl --storagebench --servers=5,6 --read --blocksize=512K —size=20G —threads=3

StorageBench is a great way to test what the potential performance of a given environment looks like by reducing variables like network throughput and latency, but you may want to test in a more real-world fashion. For this, tools like ‘fio’ can generate mixed read/write workloads against files on the client BeeGFS mountpoint.

First, we need to define which directory goes to which Storage Pool (tier) by setting a pattern:

sudo beegfs-ctl --setpattern --storagepoolid=2 /mnt/beegfs/tier1 sudo beegfs-ctl --setpattern --storagepoolid=3 /mnt/beegfs/tier2

You can see how a file gets striped across the various disks in a pool by adding a file and running the command:

sudo beegfs-ctl —getentryinfo /mnt/beegfs/tier1/myfile.bin

Install fio:

sudo apt-get install -y fio

Now you can run a fio test against one of the tiers.  This example command runs eight threads running a 75/25 read/write workload against a 10-GB file:

sudo fio --numjobs=8 --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/mnt/beegfs/tier1/test --bs=512k --iodepth=64 --size=10G --readwrite=randrw --rwmixread=75

Cleaning up

To avoid ongoing charges for resources you created, you should:

Conclusion

In this blog post we demonstrated how to build and manage your own BeeGFS Parallel Virtual File System on AWS. In this example, you created two storage tiers using the I3en and D3en. The I3en was used as the first tier for SSD storage and the D3en was used as a second tier for HDD storage. By using two different tiers, you can optimize performance to meet your application requirements.

Amazon EC2 storage-optimized instances make it easy to deploy the BeeGFS Parallel Virtual File System. Using combinations of SSD and HDD storage available on the I3en and D3en instance types, you can achieve the capacity and performance needed to run the most demanding workloads. Read more about the D3en and I3en instances.

Use Amazon ECS Fargate Spot with CircleCI to deploy and manage applications in a cost-effective way

Post Syndicated from Pritam Pal original https://aws.amazon.com/blogs/devops/deploy-apps-cost-effective-way-with-ecs-fargate-spot-and-circleci/

This post is written by Pritam Pal, Sr EC2 Spot Specialist SA & Dan Kelly, Sr EC2 Spot GTM Specialist

Customers are using Amazon Web Services (AWS) to build CI/CD pipelines and follow DevOps best practices in order to deliver products rapidly and reliably. AWS services simplify infrastructure provisioning and management, application code deployment, software release processes automation, and application and infrastructure performance monitoring. Builders are taking advantage of low-cost, scalable compute with Amazon EC2 Spot Instances, as well as AWS Fargate Spot to build, deploy, and manage microservices or container-based workloads at a discounted price.

Amazon EC2 Spot Instances let you take advantage of unused Amazon Elastic Compute Cloud (Amazon EC2) capacity at steep discounts as compared to on-demand pricing. Fargate Spot is an AWS Fargate capability that can run interruption-tolerant Amazon Elastic Container Service (Amazon ECS) tasks at up to a 70% discount off the Fargate price. Since tasks can still be interrupted, only fault tolerant applications are suitable for Fargate Spot. However, for flexible workloads that can be interrupted, this feature enables significant cost savings over on-demand pricing.

CircleCI provides continuous integration and delivery for any platform, as well as your own infrastructure. CircleCI can automatically trigger low-cost, serverless tasks with AWS Fargate Spot in Amazon ECS. Moreover, CircleCI Orbs are reusable packages of CircleCI configuration that help automate repeated processes, accelerate project setup, and ease third-party tool integration. Currently, over 1,100 organizations are utilizing the CircleCI Amazon ECS Orb to power/run 250,000+ jobs per month.

Customers are utilizing Fargate Spot for a wide variety of workloads, such as Monte Carlo simulations and genomic processing. In this blog, I utilize a python code with the Tensorflow library that can run as a container image in order to train a simple linear model. It runs the training steps in a loop on a data batch and periodically writes checkpoints to S3. If there is a Fargate Spot interruption, then it restores the checkpoint from S3 (when a new Fargate Instance occurs) and continues training. We will deploy this on AWS ECS Fargate Spot for low-cost, serverless task deployment utilizing CircleCI.

Concepts

Before looking at the solution, let’s revisit some of the concepts we’ll be using.

Capacity Providers: Capacity providers let you manage computing capacity for Amazon ECS containers. This allows the application to define its requirements for how it utilizes the capacity. With capacity providers, you can define flexible rules for how containerized workloads run on different compute capacity types and manage the capacity scaling. Furthermore, capacity providers improve the availability, scalability, and cost of running tasks and services on Amazon ECS. In order to run tasks, the default capacity provider strategy will be utilized, or an alternative strategy can be specified if required.

AWS Fargate and AWS Fargate Spot capacity providers don’t need to be created. They are available to all accounts and only need to be associated with a cluster for utilization. When a new cluster is created via the Amazon ECS console, along with the Networking-only cluster template, the FARGATE and FARGATE_SPOT capacity providers are automatically associated with the new cluster.

CircleCI Orbs: Orbs are reusable CircleCI configuration packages that help automate repeated processes, accelerate project setup, and ease third-party tool integration. Orbs can be found in the developer hub on the CircleCI orb registry. Each orb listing has usage examples that can be referenced. Moreover, each orb includes a library of documented components that can be utilized within your config for more advanced purposes. Since the 2.0.0 release, the AWS ECS Orb supports the capacity provider strategy parameter for running tasks allowing you to efficiently run any ECS task against your new or existing clusters via Fargate Spot capacity providers.

Solution overview

Fargate Spot helps cost-optimize services that can handle interruptions like Containerized workloads, CI/CD, or Web services behind a load balancer. When Fargate Spot needs to interrupt a running task, it sends a SIGTERM signal. It is best practice to build applications capable of responding to the signal and shut down gracefully.

This walkthrough will utilize a capacity provider strategy leveraging Fargate and Fargate Spot, which mitigates risk if multiple Fargate Spot tasks get terminated simultaneously. If you’re unfamiliar with Fargate Spot, capacity providers, or capacity provider strategies, read our previous blog about Fargate Spot best practices here.

Prerequisites

Our walkthrough will utilize the following services:

  • GitHub as a code repository
  • AWS Fargate/Fargate Spot for running your containers as ECS tasks
  • CircleCI for demonstrating a CI/CD pipeline. We will utilize CircleCI Cloud Free version, which allows 2,500 free credits/week and can run 1 job at a time.

We will run a Job with CircleCI ECS Orb in order to deploy 4 ECS Tasks on Fargate and Fargate Spot. You should have the following prerequisites:

  1. An AWS account
  2. A GitHub account

Walkthrough

Step 1: Create AWS Keys for Circle CI to utilize.

Head to AWS IAM console, create a new user, i.e., circleci, and select only the Programmatic access checkbox. On the set permission page, select Attach existing policies directly. For the sake of simplicity, we added a managed policy AmazonECS_FullAccess to this user. However, for production workloads, employ a further least-privilege access model. Download the access key file, which will be utilized to connect to CircleCI in the next steps.

Step 2: Create an ECS Cluster, Task definition, and ECS Service

2.1 Open the Amazon ECS console

2.2 From the navigation bar, select the Region to use

2.3 In the navigation pane, choose Clusters

2.4 On the Clusters page, choose Create Cluster

2.5 Create a Networking only Cluster ( Powered by AWS Fargate)

Amazon ECS Create Cluster

This option lets you launch a cluster in your existing VPC to utilize for Fargate tasks. The FARGATE and FARGATE_SPOT capacity providers are automatically associated with the cluster.

2.6 Click on Update Cluster to define a default capacity provider strategy for the cluster, then add FARGATE and FARGATE_SPOT capacity providers each with a weight of 1. This ensures Tasks are divided equally among Capacity providers. Define other ratios for splitting your tasks between Fargate and Fargate Spot tasks, i.e., 1:1, 1:2, or 3:1.

ECS Update Cluster Capacity Providers

2.7 Here we will create a Task Definition by using the Fargate launch type, give it a name, and specify the task Memory and CPU needed to run the task. Feel free to utilize any Fargate task definition. You can use your own code, add the code in a container, or host the container in Docker hub or Amazon ECR. Provide a name and image URI that we copied in the previous step and specify the port mappings. Click Add and then click Create.

We are also showing an example of a python code using the Tensorflow library that can run as a container image in order to train a simple linear model. It runs the training steps in a loop on a batch of data, and it periodically writes checkpoints to S3. Please find the complete code here. Utilize a Dockerfile to create a container from the code.

Sample Docker file to create a container image from the code mentioned above.

FROM ubuntu:18.04
WORKDIR /app
COPY . /app
RUN pip install -r requirements.txt EXPOSE 5000 CMD python tensorflow_checkpoint.py

Below is the Code Snippet we are using for Tensorflow to Train and Checkpoint a Training Job.


def train_and_checkpoint(net, manager):
  ckpt.restore(manager.latest_checkpoint).expect_partial()
  if manager.latest_checkpoint:
    print("Restored from {}".format(manager.latest_checkpoint))
  else:
    print("Initializing from scratch.")
  for _ in range(5000):
    example = next(iterator)
    loss = train_step(net, example, opt)
    ckpt.step.assign_add(1)
    if int(ckpt.step) % 10 == 0:
        save_path = manager.save()
        list_of_files = glob.glob('tf_ckpts/*.index')
        latest_file = max(list_of_files, key=os.path.getctime)
        upload_file(latest_file, 'pythontfckpt', object_name=None)
        list_of_files = glob.glob('tf_ckpts/*.data*')
        latest_file = max(list_of_files, key=os.path.getctime)
        upload_file(latest_file, 'pythontfckpt', object_name=None)
        upload_file('tf_ckpts/checkpoint', 'pythontfckpt', object_name=None)

2.8 Next, we will create an ECS Service, which will be used to fetch Cluster information while running the job from CircleCI. In the ECS console, navigate to your Cluster, From Services tab, then click create. Create an ECS service by choosing Cluster default strategy from the Capacity provider strategy dropdown. For the Task Definition field, choose webapp-fargate-task, which is the one we created earlier, enter a service name, set the number of tasks to zero at this point, and then leave everything else as default. Click Next step, select an existing VPC and two or more Subnets, keep everything else default, and create the service.

Step 3: GitHub and CircleCI Configuration

Create a GitHub repository, i.e., circleci-fargate-spot, and then create a .circleci folder and a config file config.yml. If you’re unfamiliar with GitHub or adding a repository, check the user guide here.

For this project, the config.yml file contains the following lines of code that configure and run your deployments.

version: '2.1'
orbs:
  aws-ecs: circleci/[email protected]
  aws-cli: circleci/[email protected]
  orb-tools: circleci/[email protected]
  shellcheck: circleci/[email protected]
  jq: circleci/[email protected]

jobs:  

  test-fargatespot:
      docker:
        - image: cimg/base:stable
      steps:
        - aws-cli/setup
        - jq/install
        - run:
            name: Get cluster info
            command: |
              SERVICES_OBJ=$(aws ecs describe-services --cluster "${ECS_CLUSTER_NAME}" --services "${ECS_SERVICE_NAME}")
              VPC_CONF_OBJ=$(echo $SERVICES_OBJ | jq '.services[].networkConfiguration.awsvpcConfiguration')
              SUBNET_ONE=$(echo "$VPC_CONF_OBJ" |  jq '.subnets[0]')
              SUBNET_TWO=$(echo "$VPC_CONF_OBJ" |  jq '.subnets[1]')
              SECURITY_GROUP_IDS=$(echo "$VPC_CONF_OBJ" |  jq '.securityGroups[0]')
              CLUSTER_NAME=$(echo "$SERVICES_OBJ" |  jq '.services[].clusterArn')
              echo "export SUBNET_ONE=$SUBNET_ONE" >> $BASH_ENV
              echo "export SUBNET_TWO=$SUBNET_TWO" >> $BASH_ENV
              echo "export SECURITY_GROUP_IDS=$SECURITY_GROUP_IDS" >> $BASH_ENV=$SECURITY_GROUP_IDS=$SECURITY_GROUP_IDS" >> $BASH_ENV" >> $BASH_ENV
              echo "export CLUSTER_NAME=$CLUSTER_NAME" >> $BASH_ENV
        - run:
            name: Associate cluster
            command: |
              aws ecs put-cluster-capacity-providers \
                --cluster "${ECS_CLUSTER_NAME}" \
                --capacity-providers FARGATE FARGATE_SPOT  \
                --default-capacity-provider-strategy capacityProvider=FARGATE,weight=1 capacityProvider=FARGATE_SPOT,weight=1\                --region ${AWS_DEFAULT_REGION}
        - aws-ecs/run-task:
              cluster: $CLUSTER_NAME
              capacity-provider-strategy: capacityProvider=FARGATE,weight=1 capacityProvider=FARGATE_SPOT,weight=1
              launch-type: ""
              task-definition: webapp-fargate-task
              subnet-ids: '$SUBNET_ONE, $SUBNET_TWO'
              security-group-ids: $SECURITY_GROUP_IDS
              assign-public-ip : ENABLED
              count: 4

workflows:
  run-task:
    jobs:
      - test-fargatespot

Now, Create a CircleCI account. Choose Login with GitHub. Once you’re logged in from the CircleCI dashboard, click Add Project and add the project circleci-fargate-spot from the list shown.

When working with CircleCI Orbs, you will need the config.yml file and environment variables under Project Settings.

The config file utilizes CircleCI version 2.1 and various Orbs, i.e., AWS-ECS, AWS-CLI, and JQ.  We will use a job test-fargatespot, which uses a Docker image, and we will setup the environment. In config.yml we are using the jq tool to parse JSON and fetch the ECS cluster information like VPC config, Subnets, and Security Groups needed to run an ECS task. As we are utilizing the capacity-provider-strategy, we will set the launch type parameter to an empty string.

In order to run a task, we will demonstrate how to override the default Capacity Provider strategy with Fargate & Fargate Spot, both with a weight of 1, and to divide tasks equally among Fargate & Fargate Spot. In our example, we are running 4 tasks, so 2 should run on Fargate and 2 on Fargate Spot.

Parameters like ECS_SERVICE_NAME, ECS_CLUSTER_NAME and other AWS access specific details are added securely under Project Settings and can be utilized by other jobs running within the project.

Add the following environment variables under Project Settings

    • AWS_ACCESS_KEY_ID – From Step 1
    • AWS_SECRET_ACCESS_KEY – From Step 1
    • AWS_DEFAULT_REGION – i.e. : – us-west-2
    • ECS_CLUSTER_NAME – From Step 2
    • ECS_SERVICE_NAME – From Step 2
    • SECURITY_GROUP_IDS – Security Group that will be used to run the task

Circle CI Environment Variables

 

Step 4: Run Job

Now in the CircleCI console, navigate to your project, choose the branch, and click Edit Config to verify that config.xml is correctly populated. Check for the ribbon at the bottom. A green ribbon means that the config file is valid and ready to run. Click Commit & Run from the top-right menu.

Click build Status to check its progress as it runs.

CircleCI Project Dashboard

 

A successful build should look like the one below. Expand each section to see the output.

 

CircleCI Job Configuration

Return to the ECS console, go to the Tasks Tab, and check that 4 new tasks are running. Click each task for the Capacity provider details. Two tasks should have run with FARGATE_SPOT as a Capacity provider, and two should have run with FARGATE.

Congratulations!

You have successfully deployed ECS tasks utilizing CircleCI on AWS Fargate and Fargate Spot. If you have used any sample web applications, then please use the public IP address to see the page. If you have used the sample code that we provided, then you should see Tensorflow training jobs running on Fargate instances. If there is a Fargate Spot interruption, then it restores the checkpoint from S3 when a new Fargate Instance comes up and continues training.

Cleaning up

In order to avoid incurring future charges, delete the resources utilized in the walkthrough. Go to the ECS console and Task tab.

  • Delete any running Tasks.
  • Delete ECS cluster.
  • Delete the circleci user from IAM console.

Cost analysis in Cost Explorer

In order to demonstrate a cost breakdown between the tasks running on Fargate and Fargate Spot, we left the tasks running for a day. Then, we utilized Cost Explorer with the following filters and groups in order discover the savings by running Fargate Spot.

Apply a filter on Service for ECS on the right-side filter, set Group by to Usage Type, and change the time period to the specific day.

Cost analysis in Cost Explorer

The cost breakdown demonstrates how Fargate Spot usage (indicated by “SpotUsage”) was significantly less expensive than non-Spot Fargate usage. Current Fargate Spot Pricing can be found here.

Conclusion

In this blog post, we have demonstrated how to utilize CircleCI to deploy and manage ECS tasks and run applications in a cost-effective serverless approach by using Fargate Spot.

Author bio

Pritam is a Sr. Specialist Solutions Architect on the EC2 Spot team. For the last 15 years, he evangelized DevOps and Cloud adoption across industries and verticals. He likes to deep dive and find solutions to everyday problems.
Dan is a Sr. Spot GTM Specialist on the EC2 Spot Team. He works closely with Amazon Partners to ensure that their customers can optimize and modernize their compute with EC2 Spot.