Tag Archives: Compute

Enhance your workload resilience with new Amazon EMR instance fleet features

Post Syndicated from Deepmala Agarwal original https://aws.amazon.com/blogs/big-data/enhance-your-workload-resilience-with-new-amazon-emr-instance-fleet-features/

Big data processing and analytics have emerged as fundamental components of modern data architectures. Organizations worldwide use these capabilities to extract actionable insights and facilitate data-driven decision-making processes. Amazon EMR has long been a cornerstone for big data processing in the cloud. Now, with a suite of exciting new features for EMR instance fleets that enables you to effectively manage your compute, Amazon is taking cloud-based analytics to the next level.

Amazon EMR has introduced new features for instance fleets that address critical challenges in big data operations. This post explores how these innovations improve cluster resilience, scalability, and efficiency, enabling you to build more robust data processing architectures on AWS. This comprehensive post introduces instance fleets, demonstrates using this new allocation strategy, explores how enhanced Availability Zone and subnet selection works, and examines how these features improve cluster’s resilience. This technical exploration will equip you with the knowledge to implement more resilient and efficient EMR clusters for your organization’s big data processing needs.

The current challenges

Organizations using big data operations might face several challenges:

  • When preferred instance types are unavailable, finding suitable alternatives often delays cluster launches and disrupts workflows
  • Selecting the optimal Availability Zone for cluster launch is challenging due to constantly changing available compute capacity, especially when considering future scaling needs
  • Maintaining uninterrupted operation of mission-critical long-running clusters becomes complex as data processing requirements evolve over time
  • Organizations frequently struggle to scale their operations to meet growing data processing demands, leading to performance bottlenecks and delayed insights

These challenges underscore the need for more advanced, flexible, and intelligent solutions in the realm of big data operations, driving the demand for innovative features in cloud-based data processing platforms.

Introducing improved EMR instance fleets

Amazon EMR, a cloud-based big data platform, allows you to process large datasets using various open source tools such as Apache Spark, Apache Flink, and Trino. To address the aforementioned challenges, Amazon EMR introduced instance fleets, with a robust set of features.

When setting up an EMR cluster, Amazon EMR offers two configuration options for configuring the primary, core, and task nodes: uniform instance groups or instance fleets.

Uniform instance groups offer a streamlined approach to cluster setup, allowing up to 50 instance groups per cluster. An EMR cluster has a primary instance group for primary node, a core instance group with one or more Amazon Elastic Compute Cloud (Amazon EC2) instances, and the option to add up to 48 task instance groups. Both core and task instance groups are flexible, allowing any number of EC2 instances within each group. Both core and task groups offer flexibility in instance count, and each node type (primary, core, or task) consists of instances sharing the same specifications and purchasing model (On-Demand or Spot). However, this approach limits the ability to mix different instance types or purchasing options within a single group.

Instance fleets provide a versatile approach to provisioning EC2 instances, offering unparalleled flexibility in cluster configuration. This setup assigns one instance fleet each for primary and core nodes, with the task instance fleet being optional. It allows you to specify up to five EC2 instance types (or up to 30 when using the Amazon Command Line Interface (AWS CLI) or API with an instance allocation strategy) for each node type in a cluster, providing enhanced instance diversity to optimize cost and performance while increasing the likelihood of fulfilling capacity requirements. Instance fleets automatically manage the mix of instance types to meet specified target capacities for On-Demand and Spot, reducing operational overhead and improving compute availability.

Key benefits of instance fleets include improved cluster resilience to capacity fluctuations, superior management of Spot Instances with the ability to set timeouts and specify actions if Spot capacity can’t be provisioned, and faster cluster provisioning. The feature also allows you to select multiple subnets for different Availability Zones, enabling Amazon EMR to optimally launch clusters and automatically route traffic away from impacted zones during large-scale events. Additionally, instance fleets offer capacity reservation options for On-Demand Instances and support allocation strategies that prioritize instance types based on user-defined criteria, further enhancing the flexibility and efficiency of EMR cluster management.

Achieve resiliency with instance fleets

Now that you have a good understanding of instance fleets, let’s explore how the new instance fleet capabilities help achieve resiliency for your workloads through the following methods:

  • EC2 instance allocation – Enables precise control over instance type selection and prioritization
  • Enhanced subnet selection – Optimizes cluster deployment across Availability Zones

EC2 instance allocation

EMR instance fleets now offer newer allocation strategies for both Spot and On-Demand Instances, giving you control over selection and prioritization of instance types and allowing you to optimize for greater flexibility, resilience, and cost-efficiency.

Amazon EMR supports the following allocation strategies for On-Demand Instances:

  • Prioritized (new) – Allows you to define a priority order for instance types, giving you precise control over instance selection
  • Lowest-price (existing) – Selects the lowest-priced instance type from the available options

Amazon EMR supports the following allocation strategies for Spot Instances:

  • Price-capacity optimized (new) – Selects instances with the lowest price while also considering the available capacity
  • Capacity-optimized-prioritized (new) – Similar to capacity-optimized, but respects instance type priorities that you specify, on a best-effort basis
  • Capacity-optimized (existing) – Selects instances from the pools with the most available capacity
  • Lowest-price (existing) – Selects the lowest-priced Spot Instances
  • Diversified (existing) – Distributes instances across all pools

When using the prioritized On-Demand allocation strategy, Amazon EMR applies the same priority value to both your On-Demand and Spot Instances when you set priorities.

For Spot Instances, Amazon EMR recommends the capacity-optimized allocation strategy. This approach allocates instances from the most available capacity pools, thereby reducing the chance of interruptions and enhancing cluster stability. Amazon EMR also allows you to launch a cluster without an allocation strategy. However, using an allocation strategy is recommended for faster cluster provisioning, more accurate Spot Instance allocation, and fewer Spot Instance interruptions.

Enhanced subnet selection

Amazon EMR on EC2 offers improved reliability and cluster launch experience for instance fleet clusters through the newly launched enhanced subnet selection. With this feature, EMR on EC2 reduces cluster launch failures resulting from an IP address shortage. Previously, the subnet selection for EMR clusters only considered the available IP addresses for the core instance fleet. Amazon EMR now employs subnet filtering at cluster launch and selects one of the subnets that have adequate available IP addresses to successfully launch all instance fleets. If Amazon EMR can’t find a subnet with sufficient IP addresses to launch the whole cluster, it will prioritize the subnet that can at least launch the core and primary instance fleets. In this scenario, Amazon EMR will also publish an Amazon CloudWatch alert event to notify the user. If none of the configured subnets can be used to provision the core and primary fleet, Amazon EMR will fail the cluster launch and provide a critical error event. These CloudWatch events enable you to monitor your clusters and take remedial actions as necessary. This capability is enabled by default when you configure more than one subnet for cluster launch, and you don’t need to make any configuration changes to benefit from it.

Solution overview

Now that you have a comprehensive grasp of the two new features, let’s integrate the elements of instance fleets and look at the implementation flow for each feature.

EC2 instance allocation

The following diagram illustrates the instance fleet lifecycle management architecture.

The workflow consists of the following steps:

  1. Create a cluster configuration with the prioritized allocation strategy, specifying instance types, their priority, and a list of potential subnets.
  2. When you launch an EMR cluster, it evaluates compute capacity and available IPs across the specified subnets. Amazon EMR then selects a single Availability Zone that best meets capacity and instance availability needs for the entire cluster.
  3. Amazon EMR launches the cluster using available instance types in one of the configured Availability Zones based on enhanced subnet selection.
  4. During a scale-up scenario, Amazon EMR adds new instances to the clusters while following the configured compute allocation strategy.
  5. If a specific instance type is unavailable, Amazon EMR will select the next available instance types based on the priority order. This flexibility provides capacity availability for production workloads while maintaining scalability.

The following example code provisions an EMR cluster with a primary and core instance fleet configuration with both Spot and On-Demand Instances, using the Capacity-optimized-prioritized allocation strategy for Spot Instances and the Prioritized strategy for On-Demand Instances:

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Resources": {
    "myCluster": {
      "Type": "AWS::EMR::Cluster",
      "Properties": {
        "Instances": {
          "MasterInstanceFleet": {
            "Name": "cfnPrimary",
            "InstanceTypeConfigs": [
              {
                "BidPrice": "10.50",
                "InstanceType": "m5.xlarge",
                "Priority": "1",
                "EbsConfiguration": {
                  "EbsBlockDeviceConfigs": [
                    {
                      "VolumeSpecification": {
                        "VolumeType": "gp2",
                        "SizeInGB": 32
                      }
                    }
                  ]
                }
              }
            ],
            "TargetOnDemandCapacity": 1
          },
          "CoreInstanceFleet": {
            "Name": "cfnCore",
            "InstanceTypeConfigs": [
              {
                "BidPrice": "10.50",
                "InstanceType": "m5.xlarge",
                "Priority": "1",
                "WeightedCapacity": "1",
                "EbsConfiguration": {
                  "EbsBlockDeviceConfigs": [
                    {
                      "VolumeSpecification": {
                        "VolumeType": "gp2",
                        "SizeInGB": 32
                      }
                    }
                  ]
                }
              }
            ],
            "LaunchSpecifications": {
              "SpotSpecification": {
                "TimeoutAction": "SWITCH_TO_ON_DEMAND",
                "TimeoutDurationMinutes": 20,
                "AllocationStrategy": "CAPACITY_OPTIMIZED_PRIORITIZED"
              },
              "OnDemandSpecification": {
                "AllocationStrategy": "PRIORITIZED"
              }
            },
            "TargetOnDemandCapacity": "5",
            "TargetSpotCapacity": "0"
          }
        },
        "Name": "blog-test",
        "JobFlowRole": "EMR_EC2_DefaultRole",
        "ServiceRole": "EMR_DefaultRole",
        "ReleaseLabel": "emr-7.2.0"
      }
    }
  }
}

Enhanced subnet selection

To better understand Step 3 in the preceding workflow, let’s explore how enhanced subnet selection works with instance fleet EMR clusters.

For our example, let’s configure an EMR instance fleet as follows:

  • Primary fleet (1 unit) – r8g.xlarge, r6g.xlarge, r8g.2xlarge
  • Core fleet (48 units) – r6g.xlarge, r6g.2xlarge, m7g.2xlarge
  • Task fleet (48 units) – m7g.2xlarge, r6g.xlarge, r6a.4xlarge

For this example, let’s use the lowest price allocation strategy. Next, let’s check the available IP addresses in our subnets using the AWS CLI:

aws ec2 describe-subnets 
--query "sort_by(Subnets, &SubnetId)[*].[SubnetId, AvailableIpAddressCount, AvailabilityZoneId]" 
--output table

We get the following results:

--------------------------------------------------
|                 DescribeSubnets                |
+---------------------------+-------+------------+
|subnet-XXXXXXXXXXXXXXXX1   |  27  |  us-east-1a |
|subnet-XXXXXXXXXXXXXXXX2   |  251 |  us-east-1b |
|subnet-XXXXXXXXXXXXXXXX3   |  11  |  us-east-1a |
-------------------------------------------------

When launching an EMR cluster, Amazon EMR follows a specific subnet filtering process. First, EMR on EC2 evaluates subnets based on the total IP addresses required for all node types: primary, core, and task nodes. If multiple subnets have sufficient IP capacity to accommodate all instance fleets, Amazon EMR selects one based on the cluster’s allocation strategy. However, if no subnet has enough IPs to support all node types, Amazon EMR considers subnets that can at least accommodate the primary and core nodes, again using the allocation strategy to make the final selection. In our case, Amazon EMR selected a subnet in Availability Zone us-east-1b that had 251 available IPs that can support 97 instances to launch the whole cluster, bypassing smaller subnets with only 27 or 11 available IPs because they didn’t meet the minimum IP requirements for the cluster configuration.

  • Primary fleet (1 unit) – r6g.xlarge
  • Core fleet (48 units) – m7g.2xlarge
  • Task fleet (48 units) – r6g.xlarge

The EMR and CloudWatch event for this cluster would be:

Amazon EMR cluster j-X40BEI1Oxxx (Cluster) 
is being created in subnet (subnet-XXXXXXXXXXXXXXXX2) 
in VPC (vpc-XXXXXXXXXXXXXXXX1) in Availability Zone (us-east-1b), 
which was chosen from the specified VPC options.

If Amazon EMR can’t find a subnet with sufficient IP addresses to launch the entire cluster, it will prioritize launching the core and primary instance fleets. If no configured subnet can accommodate even the core and primary fleets, Amazon EMR will fail the cluster launch and provide a critical error event. These CloudWatch events enable you to monitor your clusters and take necessary actions.

Conclusion

The latest enhancements to EMR instance fleets mark a significant advancement in cloud-based big data processing, addressing key challenges in resource allocation, scalability, and reliability. These features, including priority-based instance selection and enhanced subnet selection, provide you with greater control over resource strategies, improved cluster availability, enhanced capacity optimization across Availability Zones, and more efficient fallback mechanisms for production workloads. Instance fleets help you tackle current resource management challenges while laying the groundwork for future scalability.

Get started today by setting up an EMR cluster using the example configuration provided in this post. For additional configuration options and implementation guidance, refer here or reach out to your AWS account team.


About the Authors

Deepmala Agarwal works as an AWS Data Specialist Solutions Architect. She is passionate about helping customers build out scalable, distributed, and data-driven solutions on AWS. When not at work, Deepmala likes spending time with family, walking, listening to music, watching movies, and cooking!

Ravi Kumar Singh is a Senior Product Manager Technical-ES (PMT) at Amazon Web Services, specialized in building petabyte-scale data infrastructure and analytics platforms. With a passion for building innovative tools, he helps customers unlock valuable insights from their structured and unstructured data. Ravi’s expertise lies in creating robust data foundations using open source technologies and advanced cloud computing that power advanced artificial intelligence and machine learning use cases. A recognized thought leader in the field, he advances the data and AI ecosystem through pioneering solutions and collaborative industry initiatives. As a strong advocate for customer-centric solutions, Ravi constantly seeks ways to simplify complex data challenges and enhance user experiences. Outside of work, Ravi is an avid technology enthusiast who enjoys exploring emerging trends in data science, cloud computing, and machine learning.

Mandisa Nxumalo is a Cloud Engineer at Amazon Web Services (AWS) with over 5 years experience in topics related to cloud services (databases, automation, and others). Currently, specializing in Big data service Amazon EMR. She is passionate about engaging customers to effectively adopt and utilize data driven approaches to improve their big data workflows. Outside work, Mandisa enjoys hiking mountains, chasing waterfalls and travelling across countries.

Kashif Khan is a Sr. Analytics Specialist Solutions Architect at AWS, specializing in big data services like Amazon EMR, AWS Lake Formation, AWS Glue, Amazon Athena, and Amazon DataZone. With over a decade of experience in the big data domain, he possesses extensive expertise in architecting scalable and robust solutions. His role involves providing architectural guidance and collaborating closely with customers to design tailored solutions using AWS analytics services to unlock the full potential of their data.

Gaurav Sharma is a Specialist Solutions Architect (Analytics) at AWS, supporting US public sector customers on their cloud journey. Outside of work, Gaurav enjoys spending time with his family and reading books.

AWS Weekly Roundup: DeepSeek-R1, S3 Metadata, Elastic Beanstalk updates, and more (February 3, 2024)

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-deepseek-r1-s3-metadata-elastic-beanstalk-updates-and-more-february-3-2024/

Last week, I had an amazing time attending AWS Community Day Thailand in Bangkok. This event came at an exciting time, following the recent launch of the AWS Asia Pacific (Bangkok) Region. We had over 300 attendees and featured 15 speakers from the community, including an AWS Hero and 4 AWS Community Builders who shared their technical expertise and experiences.

The highlight was definitely Jeff Barr, AWS Vice President & Chief Evangelist, delivering an inspiring keynote titled “Next-Generation Software Development”, which set the perfect tone for the day. The day kicked off with welcoming remarks from Vatsun Thirapatarapong, AWS Country Manager for Thailand, and was made even more special thanks to the tremendous support from both the AWS User Group volunteers and the AWS Thailand team.

Here’s a photo capturing the excitement from the event: 

Last week’s AWS Launches
There are 30+ launches last week and here are some launches that caught my attention:

DeepSeek-R1 models now available on AWS — Channy wrote on how you can now deploy DeepSeek-R1 models in Amazon Bedrock and Amazon SageMaker AI. This helps you to build and scale generative AI applications with minimal infrastructure investment.

Amazon S3 Tables increases table limit to 10,000 per bucket — S3 Tables now supports creating up to 10,000 tables in each table bucket, allowing you to scale up to 100,000 tables across 10 buckets within an AWS Region per account.

Amazon S3 Metadata now generally available — S3 Metadata provides automated and easily queried metadata that updates in near real-time, simplifying business analytics and real-time inference applications. It supports both system-defined and custom metadata, including integration with AWS analytics services.

AWS Amplify adds TypeScript Data client support for Lambda functions — Developers can now use the Amplify Data client within AWS Lambda functions, enabling consistent type-safe data operations across frontend and backend applications.

AWS Elastic Beanstalk adds Python 3.13, .NET 9, and PHP 8.4 support on Amazon Linux 2023 — AWS Elastic Beanstalk brings the latest language features and improvements to application deployments while benefiting from Amazon Linux 2023 enhanced security and performance features.

From community.aws
Here’s my top 5 personal favorites posts from community.aws:

Upcoming AWS and community events
Check your calendars and sign up for upcoming AWS and community events:

  • AWS Korea re:Invent reCap Online, February 2-4 — A virtual event recapping key announcements and innovations from re:Invent 2023 for the Korean audience.
  • AWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs. Upcoming AWS Community Day is in Ahmedabad (February 8).
  • AWS Public Sector Day London, February 27 — Join public sector leaders and innovators to explore how AWS is enabling digital transformation in government, education, and healthcare.
  • AWS Innovate GenAI + Data Edition — A free online conference focusing on generative AI and data innovations. Available in multiple Regions: APJC and EMEA (March 6), North America (March 13), Greater China Region (March 14), and Latin America (April 8).

Browse more upcoming AWS led in-person and virtual developer-focused events.

AWS Community re:Invent re:Caps

Lastly, if you want to learn about top announcements and innovations from AWS re:Invent, the AWS Community shares a summary from a community perspective of these announcements so you can get up to speed. Download the AWS Community re:Invent re:Caps deck

That’s all for this week. Check back next Monday for another Weekly Roundup!

Donnie

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Automating Notifications for Future-Dated Amazon EC2 Capacity Reservation State Changes

Post Syndicated from aostan original https://aws.amazon.com/blogs/compute/automating-notifications-for-future-dated-amazon-ec2-capacity-reservation-state-changes/

This post is written by Ballu Singh, Principal Solutions Architect at AWS, Sandeep Rohilla, Senior Solutions Architect at AWS and Pranjal Gururani, Senior Solutions Architect at AWS.

AWS customers are able to proactively reserve future-dated Amazon EC2 On-Demand Capacity Reservations (known as future-dated CRs) to get capacity assurance for workloads and events. Because reservations can be created weeks in advance, customers are able to ensure that they can monitor their real-time status. Future-dated CRs can transition through various state of a Capacity Reservation, based on capacity availability. Over the course of several weeks, the status of a reservation might only change 1 or 2 times. Even with the low frequency of changes, most customers don’t want to manually poll the state of their reservations, instead they want to be proactively notified when something changes.

Amazon EventBridge is a serverless event bus service that facilitates the routing of events and EventBridge rules establish how events are processed, allowing you to filter events and route them to one or more targets for action. Amazon Simple Notification Service (SNS) is a fully managed messaging service that enables you to send messages or notifications to a variety of endpoints, facilitating scalable, and decoupled communication between different components of your application or with end-users.

In this blog post, we guide you through the process of automating notifications for future-dated CR state changes using AWS services such as Amazon EventBridge and Amazon Simple Notification Service (SNS).

Pre-requisites

  • Existing future-dated Capacity Reservation. To learn how to create future-dated CRs, visit our blog post here.
  • IAM access to create Amazon EventBridge rules
  • IAM access to create Amazon SNS topic and subscription

Architecture

Amazon EC2 continuously monitors the state of your Capacity Reservations and sends events to Amazon EventBridge when Capacity Reservation states change. Using Amazon EventBridge, you can create rules that trigger notifications through Amazon SNS in response to these events. Amazon SNS then pushes these notifications to a variety of supported endpoint types such as Amazon Data Firehose, Amazon Simple Queue Service (SQS), AWS Lambda, HTTP, email, mobile push notifications, and mobile text messages (SMS). In our example, we will send notification to our email address.

Send notification to email address

Walkthrough

To automate the notification process for future-dated CR state changes, we will use the following AWS services:

  1. Amazon Simple Notification Service (SNS): We will use SNS to send email notifications to the designated recipients.
  2. Amazon EventBridge: We will create an EventBridge rule to capture the state change events of future-dated CRs.

AWS Console Walkthrough

Step 1: Set Up Amazon SNS Topics and Subscriptions

  • Create a new topic for future-dated CR state change notifications.
    • Navigate to Amazon SNS console. On the Topics page, choose Create topic
    • In the Details section, for Type, choose Standard and enter the name of Topic (ex: Subscriber)

Set up amazon sns topics and subscriptions

    • Choose rest of details as is. Choose Create topic.
  • Create a subscription for the desired recipients, such as email addresses to receive the notifications.
    • Navigate to Amazon SNS console. In the left navigation pane, choose Subscriptions. Choose Create subscription.
    • On the Create subscription page, in the Details section, for Topic ARN, choose the Amazon Resource Name (ARN) of a topic enter topic ARN noted above. For Protocol, choose Email/Email-JSON
    • For Endpoint, enter your email address. Keep rest of details as is.

create subscription for the desired recipients

    • Choose Create subscription
    • Navigate to your email and choose Confirm subscription in the email from Amazon SNS.

Step 2: Create an Amazon EventBridge Rule

  • Next open the Amazon EventBridge console and navigate to the Rules
  • Click on Create rule and provide a name (ex: EC2CapacityReservationStateChange) and description for the rule. Keep all other setting as it is. Select
  • For Event source, choose AWS events or EventBridge partner events.
  • In the Creation method section, for Method, choose Use pattern form.
  • For Event source, choose AWS services. Under AWS Service, choose EC2 and then choose EC2 Capacity Reservation.
  • Optionally, you can narrow down the state and reservation ID you want to be alerted for by selecting Specific Capacity Reservation State and Specific Capacity reservation ID. Select Next.

Create an amazon eventbridge rule

  • In the Target section, under Target Type, select AWS Service.
  • For Select Target, choose SNS topic, under Topic, select SNS Topic you created in Step 1. Select
  • On Configure Tags page, select Next.
  • On Review and create page, select Create Rule.

CloudFormation Walkthrough

To simplify the setup process and make it easier for you to implement the automated notification solution, we have provided a CloudFormation template. This template automates the creation of the necessary Amazon EventBridge rule and Amazon SNS topic, along with the required configurations and permissions.

  1. Download yaml sample template.
  2. Navigate to AWS CloudFormation console and click on Create stack.
  3. Choose Upload a template file and select the downloaded template. Choose Next
  4. Provide a name for the stack under Stack Name. For CapacityReservationId, enter ID of the EC2 Capacity Reservation to monitor, (e.g., cr-1234567890abcdef0), for EmailAddress, enter email address you want to subscribe to the SNS topic, and for MonitoredStates enter comma-separated list of Capacity Reservation states to monitor (e.g. failed, expired, cancelled, pending). Choose Next

CloudFormation walkthrough

5. On Configure stack options, keep defaults. Choose

6. On Review and create page, choose

Clean up

To avoid ongoing charges, clean up your environment, by following these steps to delete the resources you created by following this blog, if they are no longer needed:

If you followed setup for AWS Console, please follow the below steps:

  1. Delete the Amazon EventBridge Rule
    1. Navigate to Amazon EventBridge console and choose to the Rules from left pane.
    2. Choose the rule you created earlier and click on Delete.
    3. Confirm the deletion by clicking Delete in the confirmation dialog.
  2. Delete the Amazon SNS Topic and Subscriptions
    1. Navigate to Amazon SNS console and navigate to the Topics from left pane.
    2. Choose the topic you created earlier and click on Delete.
    3. Confirm the deletion by clicking Delete in the confirmation dialog.

If you created any subscriptions for the topic (e.g., email or SMS), they will be automatically deleted along with the topic.

If you deployed CloudFormation template, please follow the below steps:

Conclusion

In this blog post we walked you through setting up an EventBridge rule to capture the state change events of your future-dated CRs and configure SNS to send notifications to the designated recipients. This automated approach eliminates the need for manual monitoring and ensures that you stay informed about the status of your capacity reservations.

By proactively managing your future-dated CRs through automated notifications, you can make informed decisions, adjust your reservation plans if need, and take corrective actions to ensure you have the necessary resources available for your critical events.

This solution enhances your operational efficiency, reduces the risk of capacity shortages, and allows you to focus on other important aspects of your business.

We encourage you to implement this automated notification system for your future-dated Amazon EC2 On-Demand Capacity Reservations and experience the benefits of streamlined monitoring and proactive capacity management.

Implementing backup for workloads running on AWS Outposts servers

Post Syndicated from aostan original https://aws.amazon.com/blogs/compute/implementing-backup-for-workloads-running-on-aws-outposts-servers/

This post is written by Leonardo Queirolo, Senior Cloud Support Engineer and Tareq Rajabi, Senior Solutions Architect, Hybrid Cloud

AWS Outposts servers provide fully managed AWS infrastructure, services, APIs, and tools to on-premises and edge locations with limited space or small capacity requirements, such as retail stores, branch offices, healthcare provider locations, or factory floors. Outposts servers provide local compute and networking services.

Outposts servers come with internal NVMe SSD instance storage, supporting local storage used for data access and processing on premises, and for launching Amazon Elastic Block Store (Amazon EBS)-backed Amazon Machine Images (AMIs). The data on these volumes persists after an instance reboot but does not persist after an instance termination. In order for data to persist beyond the lifetime of the instance, it is important to back up your data to a persistent AWS storage, such as an Amazon Simple Storage Service (Amazon S3) bucket or an Amazon Elastic Block Store (Amazon EBS) volume.

In this post, we explore several approaches to back up the data stored in the instance storage volumes of your EC2 instances running on an Outposts server to a persistent storage solution from AWS, and explore their benefits and use cases.

Planning for failure

When evaluating a backup strategy, it’s important to understand the failure modes you are looking to recover from. Some examples are ransomware attacks, accidental data deletion, hardware failure, or a wide scale issue impacting the whole facility where your Outposts servers and on-premises devices (such as network switches, storage appliances) reside. These failures come in many forms and are often unplanned and unexpected events. Next, understand what is considered acceptable recovery for your business. For example, what are the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for your workload running on Outposts servers? These two values, defined by your organization, profile how long a service can be down during recovery and quantify the acceptable amount of data loss, helping you define the appropriate backup strategy.

Scenario 1: Backup to AWS storage in an AWS Region

Backup to an AWS Region enables data redundancy outside of the data center or facility where your Outpost resides, taking advantage of the durability, high availability, and scalability provided natively by the storage in the Region. This approach offers flexibility for restoration to the Region or to an Outposts server in a different edge location if the original data center/facility is impacted by an irrecoverable incident. However, when restoring the data back to an Outposts server, this approach could result in relatively high RTO, depending on the throughput of the service link and the amount of data to restore. In the following sections, we will cover using the AWS Elastic Disaster Recovery (AWS DRS) and an open source solution based on operating system tools and AWS Systems Manager (AWS SSM).

Option 1: AWS Elastic Disaster Recovery (AWS DRS)

You can use AWS DRS to perform a continuous replication of workloads that reside on the Outposts C6id server powered by Intel processors (C6gd are not supported, since only 64-bit operating systems built for the x86 system architecture are supported by AWS DRS) to a staging area subnet in the Region. AWS DRS provides nearly continuous, block-level replication in the Region and creates periodic EBS Snapshots according to the Point in Time (PIT) state schedule for AWS DRS.

The following diagram shows the continuous replication of the data in the instance store volumes through AWS DRS. The PIT EBS Snapshots are used to create Amazon EBS-backed AMIs as a backup of the EC2 instances running on the Outposts server.

Figure 1 - Continuous replication of the Instance Store Volumes data from the instances running Outpost Server to a staging area in the parent region through DRS

Figure 1 – Continuous replication of the Instance Store Volumes data from the instances running Outpost Server to a staging area in the parent region through DRS

Despite AWS DRS not supporting the failback from the Region to Outposts servers, you can use the EBS snapshots taken by AWS DRS to restore the data back to the Outposts server at the desired PIT following the steps described in this post.

Prerequisites

The following prerequisites are required to complete the walkthrough:

  1. The EC2 instance to restore running on the Outposts server has been added as a source server to AWS DRS by installing the AWS Replication Agent.
  2. The initial sync has been completed and the data replication status is showing as healthy.

Restore the entire EC2 instance on the same or a different Outposts server

  1. Use the describe-recovery-snapshots command to list the PIT Snapshots taken by AWS DRS for the source server to restore.$ aws drs describe-recovery-snapshots --source-server <source-server-id>

2. Based on the time in which you want to restore your data, retrieve the corresponding EBS Snapshots in the output of the command. The following is an example of the output:

{
    "items": [
       {
            "ebsSnapshots": [
                "snap-07bf348d58151a432"
            ],
            "expectedTimestamp": "2024-06-13T16:40:00+00:00",
            "snapshotID": "pit-a4877ff6fa68561bf",
            "sourceServerID": "s-a080ceb10af7275a7",
            "timestamp": "2024-06-13T16:46:56.645979+00:00"
        },
        {
            "ebsSnapshots": [
                "snap-0496020ff7f83486d"
            ],
            "expectedTimestamp": "2024-06-13T16:30:00+00:00",
            "snapshotID": "pit-aece827519e1b0fbb",
            "sourceServerID": "s-a080ceb10af7275a7",
            "timestamp": "2024-06-13T16:37:06.600323+00:00"
        },
        {
            "ebsSnapshots": [
                "snap-0d7ebd23e56346cea"
            ],
            "expectedTimestamp": "2024-06-13T16:20:00+00:00",
            "snapshotID": "pit-a56960f89ff12579e",
            "sourceServerID": "s-a080ceb10af7275a7",
            "timestamp": "2024-06-13T16:27:01.595791+00:00"
        },
…
…

3. Open the Amazon EC2 console. In the navigation pane, choose Snapshots and filter by the Snapshot ID chosen in the previous step: snap-07bf348d58151a432.

4. Choose Actions, Create image from snapshot, and specify the Image name. You can leave the other information as default or customize as desired.

5. To perform the restore, launch a new EC2 instance on the same or a different Outposts server from the Amazon EBS-backed AMI created in the previous step.

Note that since AMIs are downloaded from the Region with every instance launch on Outposts servers, this approach could result in an RTO spanning hours, depending on the throughput of the service link and the size of the local instance storage from which the Snapshot and AMI were taken by AWS DRS. Alternatively, if you need to restore only some files and directories, you can do so by launching the EC2 instance in the Region from the AMI taken in Step 4 and then transferring the desired data from that instance to the source server running on Outposts.

Option 2: Backup to the Region using an open source solution

In addition to AWS DRS, you can use open source solutions and/or operating system (OS) functions to back up data from local instance storage to a Region. Consider this approach when you want a highly-customizable solution for workloads where lack of commercial support is acceptable. The open source solution uses AWS Systems Manager Automation and OS functions to take an Amazon EBS-backed AMI in the Region from a Linux EC2 instance running on your Outposts server. The following diagram provides a high-level overview of the solution.

Figure 2 – Wokflow of the open source solution

Figure 2 – Workflow of the open source solution

  1. The Automation creates a helper instance and a baseline EBS volume attached to it in the Region, using an AWS CloudFormation
  2. The Automation executes commands on the OS of the EC2 instance running on the Outposts server to perform preliminary checks and start syncing data from the local instance store volume to the baseline EBS volume in the Region.
  3. The sync continues until the data has been transferred successfully.
  4. When the sync completes, the Automation takes an EBS Snapshot of the baseline EBS volume and then creates an Amazon EBS-backed AMI from it.

Create the Automation document

  1. Open the github page of the open source solution backup-outposts-servers-linux-instance.
  2. Follow the Installation Instructions to create the Systems Manager Automation document.

Back up an EC2 instance running on Outposts server

  1. After creating the Automation document, follow the Usage Instructions to execute the Automation and initiate the backup.
  2. Monitor the Execution status in the System Manager Automation console.

Restore the entire EC2 instance on the same or a different Outposts server

  1. Open the Amazon EC2 console. In the navigation pane, choose AMIs and filter by the AMI names that contain the InstanceId to restore.

2. Select the desired AMI to restore and note its AMI ID.

3. To perform the restore, launch a new EC2 instance on the same or a different Outposts server from the Amazon EBS-backed AMI identified in the previous step.

Considerations for data residency and service link bandwidth

Data residency is a critical consideration for organizations that need to collect and store data in their own data centers for regulatory or compliance reasons. In this case, users cannot back up their data to the Region and need to consider backing up to another on-premises system.

Another consideration is the impact on the service link connectivity when performing backup and restore operations between the Outposts and the Region. When implementing the solutions described in the “Backup to AWS storage in an AWS Region” scenario, both your backup/restore and management/monitoring operations for your Outpost rely on the service link connectivity. Although AWS DRS provides block-level replication, the open source solution we discuss in this post only replicates data, resulting in smaller snapshot sizes for users with lower service link bandwidth requirement.

If you foresee bandwidth constraints for your service link, consider backing up to another on-premises system that is reachable through the local network interface (LNI) of your Outposts server.

Scenario 2: Backup to AWS storage in your on-premises environment

For the preceding reasons, you may need to back up your workload running on Outposts server to a persistent AWS storage system within the same geo political boundary. To do so, you can use an AWS Outposts rack that resides in the same or a different physical location and is reachable through the LNI of your Outposts server.

Outposts rack with Amazon S3 on Outposts allows you to run AWS infrastructure, services, and object storage to your on-premises to meet local data processing and data residency needs while offering the AWS durable storage that can be used to store your backup.

Thanks to this, you can use the same approaches described in the “Backup to AWS storage in an AWS Region” section at a high level to back up your data, while the storage is hosted on the Outposts rack. When evaluating this approach, keep in mind these important considerations for local snapshots.

With this approach, you can store your backup on premises to meet your data residency requirements. This also keeps the network traffic for your backup and restore within your on-premises network, without impacting the service link.

Conclusion

In this post, we showed different approaches to design backup and restore strategies for your workloads running on Outposts servers. Implementing the right approach can help protect your organization’s data against loss or corruption while meeting your performance, RTO, RPO, and data residency needs, with backup destinations ranging from AWS storage in the Region, locally on Outposts rack, or in a hybrid architecture.

Accelerate your AWS Graviton adoption with the AWS Graviton Savings Dashboard

Post Syndicated from aostan original https://aws.amazon.com/blogs/compute/accelerate-your-aws-graviton-adoption-with-the-aws-graviton-savings-dashboard/

This post is written by Rajani Guptan, Rosa Corley and Shankar Gopalan.

Are you looking to optimize your AWS infrastructure costs while maintaining high performance? AWS Graviton is a custom-built CPU developed by Amazon Web Services (AWS), and it is designed to deliver the best price performance for a broad range of cloud workloads running on Amazon Elastic Compute Cloud (Amazon EC2). Graviton-based instances provide up to 40% better price performance while using up to 60% less energy than comparable EC2 instances.

AWS users recognize that migrating existing workloads to Graviton-based instances results in better price performance. However, migrating to Graviton necessitates identifying comparable instance types, understanding the performance impacts, and estimating the savings opportunities. Furthermore, prioritizing and tracking migration efforts at scale across a diverse set of services such as Amazon EC2, Amazon Relational Database Service (Amazon RDS), Amazon ElastiCache, and Amazon OpenSearch can be challenging. Therefore, AWS has developed the AWS Graviton Savings Dashboard to help users address these complexities and accelerate their Graviton migration.

In this post, we walk you through the dashboard architecture, deployment steps, features, and capabilities. Whether you are an Executive, FinOps Practitioner, Product Owner, or in Engineering, you can use the dashboard to get the following:

  • Centralized visibility across accounts/workloads: The dashboard consolidates and tracks Graviton adoption across multiple management accounts, member accounts, and AWS Regions in a single view.
  • Graviton support across key AWS services: There are dedicated tabs allowing users to review current Graviton usage and potential savings across AWS compute and managed services.
  • Granular resource-level visibility for managed services: The dashboard provides granular resource level visibility for managed services such as Amazon RDS, ElastiCache, and OpenSearch.
  • Accurate savings and unit cost estimations: The dashboard provides accurate cost estimations for existing and comparable Graviton-based instance types by using the existing AWS Cost and Usage Report (CUR) data with the AWS public pricing API.
  • Categorization of migration effort: The dashboard categorizes Graviton migration opportunities into two main groups: Typically Easy and Requires Additional Planning, for EC2 instances. It also identifies Graviton-eligible resources for managed services, which may need version or database upgrades. This categorization helps users prioritize their engineering efforts for migration.

Architecture overview

The solution integrates AWS CUR, the AWS SDK, and AWS Public pricing API to generate comprehensive data on the usage, cost, and resource inventory. This data is stored in Amazon S3 and analyzed using Amazon Athena, providing deep insights into potential cost savings. Then, the results are visualized through Amazon QuickSight, enabling stakeholders to collaborate effectively and make informed, data-driven decisions, as shown in the following figure.

Graviton Savings Dashboard architecture diagram

Figure 1: Graviton Savings Dashboard architecture diagram

Although the solution typically costs between $50–$100 per month, the potential return on investment is substantial. The dashboard often identifies measurable cost savings that significantly outweigh its operational expenses. Moreover, it offers additional productivity benefits by streamlining the process of adopting Graviton, saving valuable time and effort for your team. For a detailed breakdown of the dashboard’s cost structure, we invite you to explore our comprehensive Graviton Savings Dashboard Cost Breakdown guide.

Deployment

The Graviton Savings Dashboard is part of the Cloud Intelligence Dashboards framework. You can deploy it using AWS CloudFormation Templates and a ‘cid-cmd’ command line tool. Prior to deploying the dashboard, make sure that you’ve met the prerequisites. These include the following:

  1. Setting up your AWS CUR: We highly recommend that you complete Steps 1 and 2 from the Cloud Intelligence Dashboard Deployment Guide. This makes sure that your CUR is set up with settings that allow for easy installation and troubleshooting if necessary.
  2. Setting up the Inventory Collector Module of the Optimization Data Collection lab: This provides automation to collect metadata and pricing for Amazon RDS, ElastiCache, and OpenSearch for all accounts in your AWS Organizations and AWS Regions.
  3. Preparing QuickSight: If you’re an existing QuickSight user, then you can skip this step. If not, then you must complete Step 3.1 to Prepare QuickSight.

When the prerequisites are in place, you can deploy the dashboard by running three simple commands (shown as follows) using a terminal application with permissions to run API requests in your AWS account.

python3 -m ensurepip --upgrade

pip3 install --upgrade cid-cmd

cid-cmd deploy --dashboard-id graviton-savings

For detailed instructions about the deployment and prerequisites, refer to the AWS Well-Architected Cost Optimization lab.

Examining the results: unlocking insights from your Graviton Savings Dashboard

Now that you’ve successfully deployed the dashboard, we can explore its powerful features and uncover valuable insights. As you read through this section, we encourage you to interact with your dashboard to familiarize yourself with the dashboard’s intuitive interface and functionality.

The Graviton Savings Dashboard is organized into service-specific tabs, each containing two key sections:

Current Graviton Usage and Savings (top section): This section highlights the tangible benefits you’ve already achieved by migrating workloads to Graviton. You can explore the following:

    • Monthly Graviton adoption trends
    • Usage distribution across different accounts, Regions, and processor types
    • Popular Graviton instance families
    • Unit cost trends
    • Realized Graviton savings

These metrics are calculated by comparing your Graviton usage to comparable non-Graviton instances, which provides a clear picture of your cost optimization efforts, as shown in the following figure.

Current Amazon EC2 Graviton Usage and Savings

Figure 2: Current Amazon EC2 Graviton Usage and Savings

Potential Graviton Savings Opportunities (bottom section): This section identifies areas where you can further optimize costs by adopting Graviton instances. It provides the following:

  • Actionable migration insights
  • Estimated implementation effort
  • Potential savings breakdowns by account, instance family/type, OS, and purchase option

These insights compare potential Graviton savings across various attributes, enabling targeted decision-making for future Graviton migrations and cost optimizations, as shown in the following figure.

Amazon EC2 Graviton Opportunity

Figure 3: Amazon EC2 Graviton Opportunity

Using dashboard insights: a FinOps team use case

In this section we explore a use case where you, as the lead of the Cloud Center of Excellence Team, use the insights from this dashboard to address concerns raised by your Chief Technology Officer (CTO).

Your CTO at your company approaches you with the following questions:

  1. Is our organization using the price-performance benefits of Graviton-based EC2 instances?
  2. How does our Graviton usage and spend and savings compare to other processor types within our overall EC2 compute spend?

Step 1: Initial analysis

You begin generating summary reports from the Current Graviton Usage (Figure 2) and Graviton Opportunity (Figure 3) sections of the dashboard. After reviewing these reports, the CTO asks you to engage with the Engineering team to evaluate potential opportunities for increasing Graviton coverage.

Step 2: Engaging with Engineering on Graviton Migration

When presenting the summary reports to the engineering manager, they expressed interest in understanding the effort level required for this project. This information can help them allocate resources and prioritize workloads, thus identifying what can be started in the short-term and what needs additional planning.

Step 3: Detailed analysis

As shown in the following figure, the Engineering team can focus on identifying candidate workloads with the most significant savings impact by segmenting the dashboard data by:

  • Implementation efforts
  • Linked accounts
  • Regions
  • Instance types
  • Operating systems

Amazon EC2 Graviton opportunity breakdown

Figure 4: Amazon EC2 Graviton opportunity breakdown

Furthermore, the team can use the dashboard to determine comparable Graviton-based instances for migration and their potential savings, as shown in the following figure.

Potential graviton Savings Details

Figure 5: Potential graviton Savings Details

Step 4: Tracking progress

Over time, the FinOps team and Engineering team can showcase the Graviton migration successes by highlighting the increasing Graviton coverage and realized savings using the dashboard’s charts (Figure 2).

Broader application:

Although this post primarily focuses on EC2 instance migration, the dashboard also provides similar insights for AWS managed services such as Amazon RDS, ElastiCache, and OpenSearch. Individual tabs with visualizations guide your Graviton adoption across these services, as shown in the following figure.

Potential graviton savings details

Figure 6: Graviton Savings Dashboard

As demonstrated by this use case, the Graviton Savings Dashboard enables various stakeholders in an organization to collaborate effectively, which leads to efficient outcomes and potential cost savings.

Conclusion

In summary, we showed how the Graviton Savings Dashboard provides clear insights into suitable workloads for Graviton migration, offers easy-to-understand visualizations for monitoring adoption, and automates resource matching and savings calculations. Streamlining the process of identifying and implementing cost-saving opportunities with Graviton-based instances means that the dashboard enables more informed decision-making about your AWS infrastructure. To learn more and get started with the Graviton Savings Dashboard, visit the Graviton Savings Dashboard page and take the first step toward more efficient and cost-effective cloud computing.

New Amazon EC2 P5en instances with NVIDIA H200 Tensor Core GPUs and EFAv3 networking

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/new-amazon-ec2-p5en-instances-with-nvidia-h200-tensor-core-gpus-and-efav3-networking/

Today, we’re announcing the general availability of Amazon Elastic Compute Cloud (Amazon EC2) P5en instances, powered by NVIDIA H200 Tensor Core GPUs and custom 4th generation Intel Xeon Scalable processors with an all-core turbo frequency of 3.2 GHz (max core turbo frequency of 3.8 GHz) available only on AWS. These processors offer 50 percent higher memory bandwidth and up to four times throughput between CPU and GPU with PCIe Gen5, which help boost performance for machine learning (ML) training and inference workloads.

P5en, with up to 3200 Gbps of third generation of Elastic Fabric Adapter (EFAv3) using Nitro v5, shows up to 35% improvement in latency compared to P5 that uses the previous generation of EFA and Nitro. This helps improve collective communications performance for distributed training workloads such as deep learning, generative AI, real-time data processing, and high-performance computing (HPC) applications.

Here are the specs for P5en instances:

Instance size vCPUs Memory (GiB) GPUs (H200) Network bandwidth (Gbps) GPU Peer to peer (GB/s) Instance storage (TB) EBS bandwidth (Gbps)
p5en.48xlarge 192 2048 8 3200 900 8 x 3.84 100

On September 9, we introduced Amazon EC2 P5e instances, powered by 8 NVIDIA H200 GPUs with 1128 GB of high bandwidth GPU memory, 3rd Gen AMD EPYC processors, 2 TiB of system memory, and 30 TB of local NVMe storage. These instances provide up to 3,200 Gbps of aggregate network bandwidth with EFAv2 and support GPUDirect RDMA, enabling lower latency and efficient scale-out performance by bypassing the CPU for internode communication.

With P5en instances, you can increase the overall efficiency in a wide range of GPU-accelerated applications by further reducing the inference and network latency. P5en instances increases local storage performance by up to two times and Amazon Elastic Block Store (Amazon EBS) bandwidth by up to 25 percent compared with P5 instances, which will further improve inference latency performance for those of you who are using local storage for caching model weights.

The transfer of data between CPUs and GPUs can be time-consuming, especially for large datasets or workloads that require frequent data exchanges. With PCIe Gen 5 providing up to four times bandwidth between CPU and GPU compared with P5eand P5e instances, you can further improve latency for model training, fine-tuning, and running inference for complex large language models (LLMs) and multimodal foundation models (FMs), and memory-intensive HPC applications such as simulations, pharmaceutical discovery, weather forecasting, and financial modeling.

Getting started with Amazon EC2 P5en instances
You can use EC2 P5en instances available in the US East (Ohio), US West (Oregon), and Asia Pacific (Tokyo) AWS Regions through EC2 Capacity Blocks for ML, On Demand, and Savings Plan purchase options.

I want to introduce how to use P5en instances with Capacity Reservation as an option. To reserve your EC2 Capacity Blocks, choose Capacity Reservations on the Amazon EC2 console in the US East (Ohio) AWS Region.

Select Purchase Capacity Blocks for ML and then choose your total capacity and specify how long you need the EC2 Capacity Block for p5en.48xlarge instances. The total number of days that you can reserve EC2 Capacity Blocks is 1–14, 21, or 28 days. EC2 Capacity Blocks can be purchased up to 8 weeks in advance.

When you select Find Capacity Blocks, AWS returns the lowest-priced offering available that meets your specifications in the date range you have specified. After reviewing EC2 Capacity Blocks details, tags, and total price information, choose Purchase.

Now, your EC2 Capacity Block will be scheduled successfully. The total price of an EC2 Capacity Block is charged up front, and the price does not change after purchase. The payment will be billed to your account within 12 hours after you purchase the EC2 Capacity Blocks. To learn more, visit Capacity Blocks for ML in the Amazon EC2 User Guide.

To run instances within your purchased Capacity Block, you can use AWS Management Console, AWS Command Line Interface (AWS CLI) or AWS SDKs.

Here is a sample AWS CLI command to run 16 P5en instances to maximize EFAv3 benefits. This configuration provides up to 3200 Gbps of EFA networking bandwidth and up to 800 Gbps of IP networking bandwidth with eight private IP address:

$ aws ec2 run-instances --image-id ami-abc12345 \
  --instance-type p5en.48xlarge \
  --count 16 \
  --key-name MyKeyPair \
  --instance-market-options MarketType='capacity-block' \
  --capacity-reservation-specification CapacityReservationTarget={CapacityReservationId=cr-a1234567}
--network-interfaces "NetworkCardIndex=0,DeviceIndex=0,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa" \
"NetworkCardIndex=1,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=2,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=3,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=4,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa" \
"NetworkCardIndex=5,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=6,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=7,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=8,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa" \
"NetworkCardIndex=9,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=10,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=11,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=12,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa" \
"NetworkCardIndex=13,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=14,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=15,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=16,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa" \
"NetworkCardIndex=17,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=18,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=19,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=20,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa" \
"NetworkCardIndex=21,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=22,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=23,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=24,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa" \
"NetworkCardIndex=25,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=26,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=27,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=28,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa" \
"NetworkCardIndex=29,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=30,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only" \
"NetworkCardIndex=31,DeviceIndex=1,Groups=security_group_id,SubnetId=subnet_id,InterfaceType=efa-only"
...

When launching P5en instances, you can use AWS Deep Learning AMIs (DLAMI) to support EC2 P5en instances. DLAMI provides ML practitioners and researchers with the infrastructure and tools to quickly build scalable, secure, distributed ML applications in preconfigured environments.

You can run containerized ML applications on P5en instances with AWS Deep Learning Containers using libraries for Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS).

For fast access to large datasets, you can use up to 30 TB of local NVMe SSD storage or virtually unlimited cost-effective storage with Amazon Simple Storage Service (Amazon S3). You can also use Amazon FSx for Lustre file systems in P5en instances so you can access data at the hundreds of GB/s of throughput and millions of input/output operations per second (IOPS) required for large-scale deep learning and HPC workloads.

Now available
Amazon EC2 P5en instances are available today in the US East (Ohio), US West (Oregon), and Asia Pacific (Tokyo) AWS Regions and US East (Atlanta) Local Zone us-east-1-atl-2a through EC2 Capacity Blocks for ML, On Demand, and Savings Plan purchase options. For more information, visit the Amazon EC2 pricing page.

Give Amazon EC2 P5en instances a try in the Amazon EC2 console. To learn more, see Amazon EC2 P5 instance page and send feedback to AWS re:Post for EC2 or through your usual AWS Support contacts.

Channy

NEW: Simplifying the use of third-party block storage with AWS Outposts

Post Syndicated from Rachel Zheng original https://aws.amazon.com/blogs/compute/new-simplifying-the-use-of-third-party-block-storage-with-aws-outposts/

This post is written by Kate Sposato, Senior Solutions Architect, EC2 Edge Compute

AWS is excited to announce deeper collaboration with industry-leading storage solutions to streamline the use of third-party storage with AWS Outposts. You can now attach and use external block data volumes from NetApp® on-premises enterprise storage arrays and Pure Storage® FlashArray™ directly from the AWS Management Console.

Outposts is a fully managed service that extends AWS infrastructure, AWS services, APIs, and tools to customer premises. By providing local access to AWS managed infrastructure, Outposts allows you to build and run applications on premises using the same application programming interfaces (APIs) as in AWS Regions. Moreover, this is done while using local compute and storage resources to meet lower latency and local data processing needs. Outposts is available in various rack and server form factors.

Many of you have block storage systems running in your on-premises environments that provide advanced data storage and management features—such as snapshots, replication, and encryption—to protect data integrity and security. There are various uses cases that would predicate you needing to access data through these external volumes backed by external storage systems from an application running in Amazon Elastic Compute Cloud (Amazon EC2) instances on Outposts. These include: regulatory auditing requirements, government and local regulation compliance, high data durability and resiliency requirements, low-latency data access, and migration of on-premises applications that are tightly coupled with existing external storage systems. To make it easier for you to use external volumes with Outposts, AWS has validated a broad range of third-party storage solutions through the AWS Outposts Ready Program. With this program, you can easily identify storage solutions that are tested to run with Outposts.

Today, we are taking our integration with storage solutions from NetApp and Pure Storage to the next level. Outposts now has a simplified and automated way to launch EC2 instances with attached block storage from external infrastructure through the AWS Management Console. The new integration includes automated user script generation and attachment of data volumes to EC2 instances running on 42U Outposts racks and 2U Outposts servers. This integration reduces the friction associated with using the advanced data management and security features of external storage infrastructure in combination with Outposts, allowing you to create a resilient, compliant, and optimized storage and compute infrastructure.

Outposts rack storage and networking overview

Outposts racks support Amazon Elastic Block Store (Amazon EBS) volumes for EC2 instances, which provide persistent local block storage.

EC2 instances running on Outposts racks can access data stored on external block storage arrays over the Outposts local gateway (LGW). An LGW enables connectivity between the Outpost subnets, where EC2 instances run, and the on-premises network. It carries storage traffic between the EC2 instances running on the Outposts rack and the local network. The LGW is created by AWS as part of the Outposts rack installation process. Each Outposts rack supports a single LGW.

The following diagram shows an EC2 instance running on an Outposts rack with an elastic network interface (ENI) and LGW configured for instance connectivity. An external storage array communicates with the EC2 instance running on the Outposts rack through the Outpost network devices (ONDs). Customer Network Devices (CNDs) that connect to EC2 instances running on Outposts racks need to support the following:

  • Link aggregation: connections to the Outposts rack network devices are added to a link aggregation group (LAG).
  • VLANs: Virtual LANs (VLANs) are configured between each Outposts rack TOR device and any customer devices, including data stores.;
  • Dynamic routing: Border Gateway Protocol (BGP) is configured between the CND and the OND for each VLAN. Two total BGP sessions are shown in the following diagram between devices.

Figure 1. Outposts rack and Amazon EC2 networking architecture

Figure 1. Outposts rack and Amazon EC2 networking architecture

Outposts server storage and networking overview

Outposts servers come with internal NVMe SSD-based high-performance instance storage. Similar to AWS Regions, instance storage is allocated directly to the EC2 instance and follows the lifecycle of the instance. For example, if an EC2 instance is terminated, then the instance storage associated with the instance is also deleted. If you want data to persist after the instance is terminated, you can use external storage solutions to complement the instance storage included with Outposts servers.

Outposts servers have a local network interface (LNI). This logical networking component connects the EC2 instances running on the Outposts servers subnet to the on-premises network and allows communication to other on-premises storage, compute, and networking appliances.

To support the Amazon EC2 on Outposts to external storage array integration, an LNI must be created then added to the EC2 instance during instance launch. An LNI can only be created through the AWS Command Line Interface (AWS CLI) or the AWS software development toolkit (SDK) using the following command. The subnet id is the Outposts server subnet and the device index should be unique to the subnet.

aws ec2 modify-subnet-attribute --subnet-id <subnet id> --enable-lni-at-device-index <device index>

In the on-premises network, you must have a Network Interface Card (NIC) at the same device index that you specified when running the preceding CLI command.

Further detailed steps for this workflow are listed in the Outposts server user guide.

When the local network interfaces are enabled on an Outpost subnet, the EC2 instances in the Outpost subnet can be configured to include this LNI in addition to the ENI. The LNI connects to the on-premises network while the ENI connects to the VPC.

The following diagram shows an EC2 instance running on an Outposts server with both an ENI and LNI configured for instance connectivity. There is an external storage array connected to the Outposts server using a CND through NVMe-over-TCP or iSCSI protocol. Figure 2. Outposts server and Amazon EC2 networking architecture

Figure 2. Outposts server and Amazon EC2 networking architecture

Supported operating systems and AWS Support

The rest of this post covers the steps for how to launch an EC2 instance running on an Outposts 2U server or Outposts rack with a connected external block storage volume for local data access from within the EC2 instance. The current release of this feature supports EC2 instances running Microsoft Windows Server 2022 and Red Hat Enterprise Linux 9 (RHEL9) based operating systems.

Support for Outposts and all Outposts integration features, including this one, needs an active AWS Enterprise Support Plan or AWS Enterprise On-Ramp Support Plan. Support for external storage arrays and configurations can be obtained from the respective storage vendor and may need an additional support plan depending on the vendor and the storage solution implemented.

This post assumes you’re familiar with the basic functionality of Outposts servers and Outposts rack. If you would like to learn more about the Outposts family in general, then the user guide, What is AWS Outposts?, is a great place to start.

Solution deployment

The following sections outline the solution deployment.

Prerequisites:

  1. An Outposts 2U server or Outposts rack is provisioned, activated, and connected to the customer network.
  2. A block storage array is connected on the same network and accessible to Outposts subnets.
  3. A block data volume is configured and running on the storage array. The unique identifier for this volume is necessary for launching the EC2 instance on the Outpost. The volume must remain provisioned after initial provisioning on the storage array.
  4. The IP address and port number (optional for iSCSI connections) of the block storage volume, which is necessary for launching the EC2 instance on the Outpost.

Deployment architecture overview

The following deployment architecture shows the workflow attaching an external storage array to an Outpost, launching an EC2 instance through the AWS Management Console, and accessing the data on the external storage array from within the EC2 instance running on the Outpost.Figure 3. Third-party block storage on Outposts architecture overview

Figure 3. Third-party block storage on Outposts architecture overview

Deployment steps for NVMe-over-TCP connections

1. (Prerequisite) If there is no block data volume already running and configured on the compatible storage array, this must be completed in the storage solution’s interface before moving to Step 2.

a. Create an NVMe device, subsystem, and namespace for the block data volume.

b. Optionally, generate a host NQN that is used for the EC2 instance connection, and add it to the allow list for the appropriate subsystems.

c. The following pieces of information are used in later steps:

i. Host NQN: Unique identifier of the EC2 instance for attachment;

ii. Target IP: Address of the connected block volume host;

iii. Target Port Number: Port number of the connected block volume host.

You can learn more about launching and configuring external storage arrays in the Outposts family documentation or in the respective storage array vendor documentation.

2. In the Console, navigate to EC2 Launch Instance Wizard by choosing EC2, Instances, Launch instances.

a. Name the instance and add any desired tags to be applied at launch.

b. Choose the desired, compatible RHEL9 based Amazon Machine Image (AMI) from the list, or choose one from the AWS Marketplace.

c. Choose the desired EC2 Instance type.

d. Expand the Network settings section and select Edit. Choose the VPC and subnet of the target Outpost.

i. Outposts servers only: You must create an LNI in the Advanced Network settings before launching the instance.

e. Expand Advanced network configuration and select Add network device. Continue to add network devices until the Device index is equal to the volume index.

Figure 4. Advanced network configurationFigure 4. Advanced network configuration

f. Expand Configure storage and select Edit next to External storage volumes settings section and choose NVMe/TCP in Storage network protocol.

Figure 5. External storage volumes configuration

g. Enter the HostNQN in the format provided for the NVMe/TCP data volume. Make sure that the HostNQN used has been added to the storage array subsystem allow list.

h. Select Add NVMe/TCP Discovery Controller and enter the IP address and port of the controller from the storage array. Enter 4420 as the Target Port, if the target port is unknown.

i. (Optional) You can add more data volumes that use a different target discovery controller at this time by choosing the Add NVMe/TCP Data Volume button under the Target IP address. Repeat Steps 2.h for each data volume to be attached to the EC2 instance.

j. Expand the Advanced details and provide any additional Amazon EC2 behavior settings as appropriate.

k. At the bottom of the Advanced details section is the automatically generated User data. If you need to manually edit this data, you can do so by selecting Edit at the bottom.

Figure 6. Automatically generated user data file

l. When the configurations are set, choose the Launch instance button in the right-side column.

3. The EC2 Launch Instance Wizard now launches an EC2 instance configured as described on the Outpost and attaches the desired external data volume(s) to the EC2 instance.

4. Applications and users can access the data on the attached external volumes from within the EC2 instance. To verify this:

a. From within the launched EC2 instance, run sudo nvme list

b. The volumes are displayed as /dev/nvme1n1 with the number increasing for each attached volume. Local instance store volumes on Outposts servers and EBS boot volumes on Outposts racks are listed first. External volumes are listed after those with sequentially increasing node numbers.

5. External storage volume and array management, configuration, and backups continue to be managed through the storage vendor-provided toolkit. You can find more information on external storage management in the respective storage array vendor documentation.

Deployment steps for iSCSI connections

1. (Prerequisite) If there is no block data volume already running and configured on the compatible storage array, this must be completed in the storage solution’s interface before moving to Step 2.

a. Create an Initiator group (igroup) and add the Initiator IQN to the igroup. Then map the logical unit number (LUN) to the igroup.

b. Optionally, generate an initiator IQN that is used for the EC2 instance connection, and add it to the allow list for the appropriate subsystems.

c. The following pieces of information are used in later steps:

i. Initiator IQN: Unique identifier of the EC2 instance for attachment;

ii. Target IQNs: Unique identifier of the storage virtual machine (SVM);

iii. Target IP: Address of the connected block volume host;

iv. (Optional) Target Port Number: Port number of the connected block volume host.

You can learn more about launching and configuring external storage arrays in the Outposts family documentation or in the respective storage array vendor documentation.

2. In the Console, navigate to EC2 Launch Instance Wizard by choosing EC2, Instances, Launch instances.

a. Name the instance and add any desired tags to be applied at launch.

b. Choose the desired, compatible RHEL9 or Windows Server 2022 based AMI from the list, or purchase one from the AWS Marketplace.

c. Choose the desired EC2 Instance type.

d. Expand the Network settings section and choose the VPC and subnet of the target Outpost.

i. Outposts servers only: You must create an LNI in the Advanced Network settings before launching the instance.

e. Expand Advanced network configuration and select Add network device. Continue to add network devices until the Device index is equal to the volume index.

Figure 7. Advanced network configurationFigure 7. Advanced network configuration

f. Expand Configure storage and select Edit next to External storage volumes settings section and choose iSCSI in Storage network protocol.

Figure 8. External storage volumes configurationFigure 8. External storage volumes configuration

g. Enter the Initiator IQN for the iSCSI data volume in the format provided. Make sure that the Initiator IQN used has been added to the allow list for the volume.

h. Select Add iSCSI Target and enter the Target IP, Target Port, and Target IQN of the storage array. Enter 4420 for the Target Port, if the target port is unknown.

i. (Optional) You can add additional data volumes with a different Target IQN at this time by selecting the Add iSCSI Target button under the Target IP address. Repeat Steps 2.h for each data volume to be attached to the EC2 instance.

j. Expand the Advanced details and provide any additional Amazon EC2 behavior settings as appropriate.

k. At the bottom of the Advanced details section is the automatically generated User data. If you need to manually edit this data, you can do so by selecting Edit at the bottom.

Figure 9. Automatically generated user data fileFigure 9. Automatically generated user data file

l. When the configurations are set, choose the Launch instance button in the right-side column.

3. The EC2 Launch Instance Wizard now launches an EC2 instance configured as described on the Outpost and attaches the desired external data volume(s) to the EC2 instance.

4. Applications and users can access the data on the attached external volumes from within the EC2 instance. To verify this:

a. From within the launched EC2 instance, run iscsiadm -m session -P3

b. The volumes are displayed as /dev/sd0 with the number increasing for each attached volume.

5. External storage volume and array management, configuration, and backups continue to be managed through the storage vendor-provided toolkit. You can find more information on external storage management in the respective storage array vendor documentation.

Conclusion

This integration offers a streamlined workflow to attach and utilize external block data volumes on Outposts directly through the AWS Management Console, eliminating manual processes. It provides the full benefits of advanced data infrastructure from trusted storage providers in conjunction with the security, reliability, and scalability of AWS managed infrastructure. This helps you accelerate cloud migration with dependencies on third-party storage and realize the full potential of your on-premises data.

To learn more about this integration, visit the NetApp on-premises enterprise storage arrays for AWS Outposts solution page and the Pure Storage FlashArray for AWS Outposts blog post. To discuss your external storage needs with an Outposts expert, submit this form. If you are attending AWS re:Invent 2024, make sure to check out the NetApp booth (booth #1748) and Pure Storage booth (booth #454) to connect with our partner specialists.

Introducing storage optimized Amazon EC2 I8g instances powered by AWS Graviton4 processors and 3rd gen AWS Nitro SSDs

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/introducing-storage-optimized-amazon-ec2-i8g-instances-powered-by-aws-graviton4-processors-and-3rd-gen-aws-nitro-ssds/

Today, we’re announcing the general availability of Amazon EC2 I8g instances, a new storage optimized instance type to provide the highest real-time storage performance among storage-optimized EC2 instances with the third generation of AWS Nitro SSDs and AWS Graviton4 processors.

AWS Graviton4 is the most powerful and energy efficient processor we have ever designed for a broad range of workloads running on EC2 instances using a 64-bit ARM instruction set architecture. AWS Nitro System SSDs are custom built by AWS and offer high I/O performance, low latency, minimal latency variability, and security with always-on encryption.

EC2 I8g instances are the first instance type to use third-generation AWS Nitro SSDs. These instances offer up to 22.5 TB local NVME SSD storage with up to 65 percent better real-time storage performance per TB and 60 percent lower latency variability compared to the previous generation I4g instances. Based on the AWS Graviton4 processors, I8g instances deliver up to 60 percent better compute performance and two times larger caches compared to I4g.

I8g instances offer up to 96 vCPUs, 768 GiB of memory, and 22.5 TB of storage to deliver more compute and storage choices compared with I4g instances.

Instance name vCPUs Memory (Gib) Storage (GB) Network bandwidth (Gbps) EBS bandwidth (Gbps)
I8g.large 2 16 468 up to 10 up to 10 Gbps
I8g.xlarge 4 32 937 up to 10 up to 10 Gbps
I8g.2xlarge 8 64 1,875 up to 12 up to 10 Gbps
I8g.4xlarge 16 128 3,750 up to 25 up to 10 Gbps
I8g.8xlarge 32 256 7,500
(2 x 3,750)
up to 25 10 Gbps
I8g.12xlarge 48 384 11,520
(3 x 3,750)
up to 28.125 15 Gbps
I8g.16xlarge 64 512 15,000
(4 x 3,750)
up to 37.5 20 Gbps
I8g.24xlarge 96 768 22,500
(6 x 3,750)
up to 56.25 20 Gbps
I8g.metal-24×1 96 768 22,500
(6 x 3,750)
up to 56.25 30 Gbps

You can use I8g instances for I/O intensive workloads that require low latency access to data such as transactional databases (MySQL and PostgreSQL), real-time databases, NoSQL databases, (Aerospike, Apache Druid, MongoDB) and real-time analytics such as Apache Spark.

Additionally, I8g instances are built on the AWS Nitro System, which offloads CPU virtualization, storage, and networking functions to dedicated hardware and software to enhance the performance and security of your workloads. The Graviton4 processors offer you enhanced security by fully encrypting all high-speed physical hardware interfaces.

Things to know
Here are some things that you should know about EC2 I8g instances:

  • Operating system – EC2 I8g instances support Amazon Linux 2023, Amazon Linux 2, CentOS Stream 8 or newer, Ubuntu 18.04 or newer, SUSE 15 SP2 or newer, Debian 11 or newer, Red Hat Enterprise 8.2 or newer, CentOS 8.2 or newer, FreeBSD 13 or newer, Rocky Linux 8.4 or newer, Alma Linux 8.4 or newer, and Alpine Linux 3.12.7 or newer.
  • Networking – You can use I8g instances in storage intensive workloads that typically have burst network usage patterns. All I8g instance sizes have burst network bandwidth and can burst more than 60 minutes, depending on the instance sizes, to support the majority of the workloads requiring instance storage data hydration, backup, and snapshot over the network.
  • Migration – If you’re using I4g instances now, you will have straightforward experience migrating storage intensive workloads to I8g instances because these instances offer similar memory and storage ratios to existing I4g instances.

Now available
Amazon EC2 I8g instances are now available in the US East (N. Virginia) and US West (Oregon) AWS Regions through On-Demand instances, Savings Plans, Spot Instances, Dedicated Instances, or Dedicated Hosts.

Give EC2 I8g instances a try in the Amazon EC2 console. To learn more, visit the EC2 I8g instances page and send feedback to AWS re:Post for EC2 or through your usual AWS Support contacts.

Channy

Faster scaling with Amazon EC2 Auto Scaling Target Tracking

Post Syndicated from aostan original https://aws.amazon.com/blogs/compute/faster-scaling-with-amazon-ec2-auto-scaling-target-tracking/

This post is written by Shahad Choudhury, Senior Cloud Support Engineer and Tiago Souza, Solutions Architect

Introduction

One of the key benefits of the AWS cloud is elasticity. It enables our users to provision and pay only for resources they need. To fully use the elasticity benefits, users needed a mechanism that is automated and can be widely operated with ease. Amazon EC2 Auto Scaling solves these challenges by helping our users automatically scale the number of Amazon Elastic Compute Cloud (Amazon EC2) instances to meet the changing workload demands, and it offers a wide suite of capabilities to manage the instance’s lifecycle.

To scale their Auto Scaling groups (ASG), users need to create scaling policies. Scaling policies provide ASGs with guidelines for adjusting Amazon EC2 capacity to match the workload demand. There are different types of scaling policies, with each having a different approach to manage capacity. One type of policy is Target Tracking, which offers a simpler yet effective way to scale automatically. To use it, users need to define a utilization metric and set a target value to maintain. For example, setting a 60% Average CPU Utilization policy causes the ASG to keep the metric as close to that value as possible across its fleet of EC2 instances.

In this post, we describe the recently released updates to Target Tracking. We also walk through the steps to create a Target Tracking policy that uses the new feature, and highlight the improvements and benefits users can expect from this new feature.

What’s new with Target Tracking policy

As users modernized their applications, we learned from them that a dynamic Auto Scaling solution must expand beyond our original implementation of the Target Tracking policy.

First, users found that the few minutes Target Tracking took to respond to a demand spike could lead to short-term performance degradation. We’ve seen many users mitigate this challenge by buffering their running capacity, leading to increased costs. Second, different workloads have different scaling requirements. This leads to users having to create tailored scaling policies for each workload, which is a time consuming, error prone, and operationally expensive activity for performance and cost optimizations.

To address these user challenges, we released an intelligent and highly responsive Target Tracking scaling policy. Target Tracking now automatically tunes its responsiveness to the unique usage patterns of individual applications and closely monitors application demand for faster scaling decisions. Automatic tuning allows users to enhance their application performance and maintain high usage for their Amazon EC2 resources to save costs without having to create tailored scaling policies for each workload. Users must specify a target utilization they want to maintain, and Target Tracking scales without any further input needed from users.

For faster auto scaling decisions, users can configure Target Tracking policies using high-resolution metrics in Amazon CloudWatch. This fine-grained monitoring allows Target Tracking to detect and respond to changing demand, not in minutes, but in seconds. This capability is ideal for applications that have volatile demand patterns, such as client-serving APIs, live streaming services, e-commerce websites, and on-demand data processing.

Getting started with the new Target Tracking policy

If you’re already using Target Tracking policies, then no action is necessary for you to upgrade to Target Tracking that automatically tunes itself. Target Tracking policies regularly analyze targeted metric history and determine the appropriate level of sensitivity to initiate scale-outs and scale-ins. Furthermore, it determines the amount of capacity that must be added or removed to optimize both availability and lower cost. These decisions depend on the unique characteristics of the application’s demand patterns, such as the range and frequency of demand changes, and whether spikes in usage are long or short-lived. Target Tracking continues to learn on an ongoing basis, and reevaluates itself to automatically adapt for your specific application and demand patterns.

Enabling faster scaling response from Target Tracking

Moreover, to enable the fastest response from Target Tracking policies, users can track metrics published at sub-minute granularity to CloudWatch (also known as high-resolution CloudWatch metrics). Users can update an existing Target Tracking policy or create a new one with a high-resolution metric as part of a CustomizedMetricSpecification. Users must describe the same metric namespace, metric name, and any dimension(s) and/or unit created when publishing the metric to CloudWatch. They must also define the metric period to indicate the metric granularity at which target tracking should evaluate the metric. The following steps walk you through how to get started on the AWS Management Console for ASG:

Step 1: Choose the ASG

In the console, choose the name of the ASG. This takes you to the Details page, as shown in the following figure.

List of ASGs in the Amazon EC2 console Auto Scaling section

Figure 1: In the Amazon EC2 console, choose the ASG that you want to scale

Choose the Automatic scaling tab that gives you the option to Create a dynamic scaling policy, as shown in the following figure.

Step 2: Create dynamic scaling policy

Step 2: Create dynamic scaling policy

Choose the target tracking policy as the policy type. For Metric Type, choose Custom CloudWatch metric. This shows a prefilled JSON snippet that you can edit to specify the metric name, namespace, and dimensions of the metric that you want to scale using the Target Tracking policy that you used to publish the CloudWatch metric, as shown in the following figure.

Figure 3: Updated CustomizedMetricSpecification section added to the Auto Scaling Console

Figure 3: Updated CustomizedMetricSpecification section added to the Auto Scaling Console

The minimum Period supported is ten seconds. To use the ten second metric periods, your metric should be published at a ten second or higher resolution, for example at one second. However, publishing at one second intervals can substantially increase your CloudWatch cost. We discuss the cost considerations later in this post. Auto Scaling imposes a limit of 60 seconds to make sure that Target Tracking can observe and respond to usage spikes quickly.

These two steps allow you to enable target tracking to scale on a high resolution metric.

Enabling faster scaling impact:

The preceding steps allow the ASG to detect changes in your utilization faster, thus it can add more instances when demand spikes.

In the following diagram, we see the results of running identical load tests against an environment with a default target tracking policy of a 60 second period and a target tracking policy configured with a ten second period. Each policy has a target value of 60% CPU Utilization. The load test ramps up to 20 threads over three minutes each sending http requests to simulate a spike in demand. We can see that, in the 60 second period case (the left diagram) there were three minutes where the application was above the CPU Utilization target of 60% (blue line). The capacity (green line) increased only after the system had reached a peak of 100% CPU Utilization. This may lead to application performance issues and, to avoid that, users would have to aim for lower utilization level so that more capacity can be provisioned, which would increase their cost. However, with the ten second periods (the right diagram), scaling happened rapidly to avoid application impact. The capacity increased after one minute, during which CPU Utilization remained closer to 60% and didn’t hit the peak 100% level. This allows users to reach a higher utilization level, saving the cost without impacting the application performance.

Two graphs of CPU Utilization and Auto Scaling InService Capacity comparing scaling based on 60 seconds period as opposed to 10 seconds period. In the first case, the CPU approaches 100% for multiple minutes before scaling occurs to bring CPU down. Comparatively, when scaling with a 10 second period, the CPU increases over two minutes but remained closer to the 60% target throughout.

Figure 4: Target tracking policy with 60 second periods as opposed to 10 seconds

Considerations

Before applying high resolution custom metrics, we recommend that you consider the following factors as they may impact your costs.

Metric types: Target Tracking assumes that metrics change proportionally to the number of instances in the ASG. Selecting the right metric is key for successful Target Tracking policies. Refer to the Target Tracking public documentation for more details.

Pricing: There is no further charge for EC2 Auto Scaling, including these new features. Users pay only for the AWS resources needed to run their applications and CloudWatch monitoring fees. However, you must understand the three CloudWatch billing items relevant to these features:

1) High-resolution alarms

2) API calls

3) Custom metrics

Target Tracking creates at least two alarms, one each to track high and low usage with a buffer in between their thresholds to reduce oscillation. If the metric period is less than sixty seconds, these alarms are billed as high-resolution alarms. As of this writing, the price for high-resolution alarm for the AWS US East (Ohio) Region is $0.30 per alarm metric as compared to $0.10 per alarm metric for standard resolution alarms.

If you’re using CloudWatch Agent, it sends API calls from each instance based on the metrics_collection_interval setting in the CloudWatch Agent config. Each instance sends an API call once per interval to CloudWatch. In CloudWatch, a metric is defined as a unique combination of a Namespace, MetricName, Dimension(s) (optional), and Unit (optional). Every unique combination of dimensions pushed from the CloudWatch Agent is billed as its custom metric.

The following is an example of expected monthly charges in USD using us-east-2 for an account that has passed the free tier, but is still in the first tier of paid usage (the price reduction for bulk usage). This example assumes an average of ten instances running over the month in an ASG with one target tracking policy where metrics and alarms are configured for ten second intervals.

1) High-resolution alarms:

2 alarms @ $0.30 each = $0.60/month

2) API calls:

10 instances * 30 days * 24 hours * 3600 seconds / 10 second_intervals = 2.592 million API calls

2.592 million API calls * $0.01 per 1,000 requests = $25.92/month

3) Custom metrics:

1 ASG aggregate metric @ $0.30/month = $0.30/month

Total estimate: $26.82/month for a 10 instance ASG

Multiple metrics can be pushed in a single PutMetricData API call. If you decide to configure the CloudWatch Agent to publish more than the single aggregate AutoScalingGroupName metric, then the API charges stay the same until the PutMetricData size limit is hit, and only the Custom metrics charge increases.

For example, if the ASG is running c8g.xlarge instances, then by running one fewer instance due to the higher utilization unlocked by these features, then the monthly cost saving in us-east-2 would be:

1 c8g.xlarge @ $0.15896/hour * 30 days * 24 hours = $114.45/month

Taking away the $26.82/month in estimated CloudWatch costs means a savings of $87.63/month per ASG. This is nearly 8% saving on the EC2 cost in this example.

Template to publish metrics and updating your scaling policies

To help you start publishing high resolution metrics, we have created a sample AWS CloudFormation template. The template provides the scaffolding to demonstrate the new faster scaling period for an existing ASG. It includes installing a CloudWatch agent and publishing the CPU Utilization of the ASG instances to CloudWatch at high resolution. The template also includes a Target Tracking policy, as described in this post.

Instructions on deployment and customization requirements can be found in the AWS Samples Repo for Faster Target Tracking. However, there are a few code snippets in the template that we want to highlight.

First, to install the CloudWatch agent, the template updates the UserData of the Launch Template used with the ASG.

UserData: 
          Fn::Base64: 
            !Sub |
              #!/bin/bash
              yum install amazon-cloudwatch-agent -y
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c ssm:/cw-agent-asg-aggregate-cpu -s

This command refers to an AWS Systems Manager parameter holding the Cloudwatch Agent configuration.

The following snippet of the Systems Manager parameter reports the CPU Utilization metric at a 10 second interval to a custom namespace called FasterScalingDemo. The metric is also aggregated with the name of the ASG as a dimension so that you can easily refer to it in CloudWatch.

CloudWatchMetricsSSMParameter:
    Type: AWS::SSM::Parameter
    Properties:
      Name: cw-agent-asg-aggregate-cpu
      Type: String
      Value: '{"agent":{"metrics_collection_interval":10,"run_as_user":"cwagent"},"metrics":{"force_flush_interval":10,"aggregation_dimensions":[["AutoScalingGroupName"]],"append_dimensions":{"AutoScalingGroupName":"${aws:AutoScalingGroupName}"},"namespace":"FasterScalingDemo","metrics_collected":{"cpu":{"drop_original_metrics":["cpu_usage_active"],"measurement":[{"name":"cpu_usage_active","rename":"CPUUtilization"}]}}}}'
      Tier: Intelligent-Tiering
      Description: Custom metric specification for CloudWatch Agent

Second, the template also includes an updated AWS Identity and Access Management (IAM) Role and corresponding IAM Instance Profile with permissions to PutMetricData to CloudWatch, and to retrieve Systems Manager parameters that we created previously to configure the agent.

IAMInstanceRole:
    Type: 'AWS::IAM::Role'
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - ec2.amazonaws.com
            Action:
              - 'sts:AssumeRole'
      Path: /
      Policies:
        - PolicyName: FasterScalingDemo
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - cloudwatch:PutMetricData
                  - ec2:DescribeTags
                  - ssm:GetParameter
                Resource: '*'
      Tags:
        - Key: Name
          Value: !Sub ${EnvironmentName}_IAMROLE

Finally, the following image depicts the architecture deployed by the CloudFormation template.

Amazon VPC with an ASG launching instances in three Availability Zones, publishing high-resolution metrics to CloudWatch

Figure 4: AWS resources created in the CloudFormation example

When the template is deployed with your chosen ASG, you should be ready to test Target Tracking set with high resolution metrics. You can perform a load test to see Target Tracking in action. The closer the load test mimics your application usage pattern, the more conclusive the test would be in determining the benefits of these features.

Conclusion

This post provides an overview of the updates we have made to the Target Tracking policy that deliver higher precision in matching your demand with Amazon EC2 capacity. Specifically, this post demonstrated the value of using high resolution CloudWatch metrics with Target Tracking to increase the Auto Scaling rate to match demand, improve availability, and open possibilities for better resource utilization. We encourage you to test the feature and apply the consideration factors outlined in this post before opting for high-resolution metric scaling. You can find more details about these new features in the Target Tracking documentation.

Hosting containers at the edge using Amazon ECS and AWS Outposts server

Post Syndicated from aostan original https://aws.amazon.com/blogs/compute/hosting-containers-at-the-edge-using-amazon-ecs-and-aws-outposts-server/

This post is written by Craig Warburton, Hybrid Cloud Senior Solutions Architect and Sedji Gaouaou, Hybrid Cloud Senior Solutions Architect

In today’s fast-paced digital landscape, businesses are increasingly looking to process data and run applications closer to the source, at the edge of the network. For those seeking to use the power of containerized workloads in edge environments, AWS Outposts servers offer a compelling solution. This fully managed service brings the AWS infrastructure, services, APIs, and tools to virtually any on-premises or edge location, allowing users to run container-based applications seamlessly across their distributed environments. In this post, we explore how Outposts servers can empower organizations to deploy and manage containerized workloads at the edge, bringing cloud-native capabilities closer to where they’re needed most.

Solution overview

Amazon Elastic Container Service (Amazon ECS) is a fully managed container orchestration service that can be used with Outposts servers. This combination allows users to run containerized applications at the edge with the same ease and flexibility as in the AWS cloud.

By using Outposts server with Amazon ECS, users can effectively extend their container-based workloads to the edge, enabling new use cases and improving application performance for latency-sensitive operations.

The following diagram illustrates an example architecture where a user is looking to deploy a microservices based PHP web application and instance based MySQL database. Furthermore, a container based load balancer appliance is used to receive and distribute traffic to the web application container. The example application writes its data to a MySQL database, which is hosted on an external storage array. The application is deployed on the Outpost server, and can communicate with the database across the user data center network.

In this post we will show how users can deploy an example microservice based application. Each section of this post walks through Steps 1 through 4 shown in the following diagram.

Figure 1: Solution overview

Figure 1: Solution overview

Walkthrough

Prerequisites

Before deploying the sample application, you must have ordered, received, and successfully installed an Outposts server. The server is operational and visible in the AWS Management Console.

This walkthrough assumes you have access to Amazon Elastic Container Registry (Amazon ECR) that is used for the container repository.

You need the following AWS Identity and Access Management (IAM) role provisioned with the necessary permissions included in the policy to permit the load balancer to read the required Amazon ECS attributes. Refer to the user guide Create a role to delegate permissions to an IAM user section to help you through creating an IAM role and associated policy. The Amazon ECS task IAM role needs the following policy configuration to read the necessary Amazon ECS information:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "LoadBalancerECSReadAccess",
            "Effect": "Allow",
            "Action": [
                "ecs:ListClusters",
                "ecs:DescribeClusters",
                "ecs:ListTasks",
                "ecs:DescribeTasks",
                "ecs:DescribeContainerInstances",
                "ecs:DescribeTaskDefinition",
                "ec2:DescribeInstances",
                "ssm:DescribeInstanceInformation"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

You also need the Amazon ECS task execution IAM role (ecsTaskExecutionRole) that will grant the Amazon ECS container service the necessary permissions to make AWS API calls on your behalf.

Step 1: Setting up Amazon ECS on Outposts server

Amazon ECS is used in this walkthrough to deploy our container workloads to the Outposts server. Before deploying workloads, an ECS cluster on Outposts needs to be created.

In this configuration, the Amazon ECS cluster targets the private subnets (10.0.1.0/24 and 10.0.2.0/24) and the Amazon Elastic Compute Cloud Amazon (EC2) instances configured on the Outpost server for deployments.

To assist in targeting the deployment of our Amazon ECS services to specific instances with an attached Local Network Interface (LNI), our Amazon EC2 instances are assigned a logical role using custom Amazon ECS container instance attributes. Custom attributes are used to configure task placement constraints, as shown in the following figure.

Figure 2: Amazon ECS container instances used for tasks

Figure 2: Amazon ECS container instances used for tasks

One of the container instances is assigned the role of loadbalancer, as shown in the following figure. Follow the developer guide section to Define which container instances Amazon ECS uses for tasks, and add the following custom attribute to one of your instances:

  • Name = role, Value = loadbalancer

Figure 3: Instance with the Custom Attibutes - loadbalancer

Figure 3: Instance with the Custom Attibutes – loadbalancer

The other container instance is assigned the role of webserver, as shown in the following figure. Add the following custom attribute to each of the remaining instance:

  • Name = role, Value = webserver

Figure 4: Instance with the Custom Attibutes - webserver

Figure 4: Instance with the Custom Attributes – webserver

Step 2: Deploying a load balancer with host mode to use LNI

In this section, you deploy a task for the load balancer as seen in Step 2 of the Solution overview.

First, you must enable the private subnet, where your load balancer is deployed, for LNIs:

aws ec2 modify-subnet-attribute \

    --subnet-id subnet-1a2b3c4d \

    --enable-lni-at-device-index 1

Now add an LNI to the container instance with the attibute “loadbalancer”. This instance can now access your local network.

To deploy the load balancer, create an Amazon ECS task definition named “task-definition-loadbalancer.json”, which describes the container configuration to implement the load balancer as followed:

{
    "containerDefinitions": [
        {
            "name": "loadbalancer",
            "image": "traefik:latest",
            "cpu": 0,
            "portMappings": [
                {
                    "containerPort": 80,
                    "hostPort": 80,
                    "protocol": "tcp"
                },
                {
                    "containerPort": 8080,
                    "hostPort": 8080,
                    "protocol": "tcp"
                }
            ],
            "essential": true,
            "command": [
                "--api.dashboard=true",
                "--api.insecure=true",
                "--accesslog=true",
                "--providers.ecs.ecsAnywhere=false",
                "--providers.ecs.region=<AWS_REGION>",
                "--providers.ecs.autoDiscoverClusters=true",
                "--providers.ecs.clusters=<YOUR_CLUSTER_NAME>",
                "--providers.ecs.exposedByDefault=true"
            ],
            "environment": [],
            "mountPoints": [],
            "volumesFrom": [],
            "systemControls": []
        }
    ],
    "family": "loadbalancer",
    "taskRoleArn": <TASK_ROLE_ARN>,
    "executionRoleArn": <EXECUTION_ROLE_ARN>,
    "networkMode": "host",
    "volumes": [],
    "placementConstraints": [
        {
            "type": "memberOf",
            "expression": "attribute:role == loadbalancer"
        }
    ],
    "requiresCompatibilities": [
        "EC2"
    ],
    "cpu": "256",
    "memory": "128",
    "tags": []
}

Replace the string <TASK_ROLE_ARN> with the Amazon Resource Name (ARN) of the IAM role configured with the LoadBalancerECSReadAccess policy and the string <EXECUTION_ROLE_ARN> with the ARN of the IAM role configured with the ecsTaskExecutionRole policy as configured in the Prerequisites section, <AWS_REGION> with the AWS Region where you deployed your ECS cluster, <YOUR_CLUSTER_NAME> with your cluster name.

Some points to consider:

  • The Amazon ECS Network mode is set to “host”. The load balancer task uses the host’s network to access the LNI.
  • The task definition includes the placement constraint matching the loadbalancer custom attribute value.

Lastly, register the task definition with your cluster and create the loadbalancer service using the following AWS Command Line Interface (AWS CLI) command:

aws ecs register-task-definition --cli-input-json file://task-definition-loadbalancer.json

aws ecs create-service--cluster <CLUSTER_NAME> --service-name loadbalancer --task-definition loadbalancer:1 --desired-count 1

Replace the string <CLUSTER_NAME> with the target Amazon ECS cluster name.

The load balancer is now running.

Connecting to the Amazon EC2 instance with the attibute loadbalancer using Session Manager, you can get the following LNI IP address:

Figure 5: Getting the LNI IP

Figure 5: Getting the LNI IP

You can access the web user interface by browsing to the URL from your local network:

http://<HOST_IP>:8080/dashboard/

Replace the string <HOST_IP> with the Amazon EC2 instance host LNI IP address, or DNS hostname.

Step 3: Deploying sample web application in awsvpc mode

First, make sure that the AWSVPC Trunking is turned on, as shown in the following figure:

Figure 6: Enabling AWSVPC Trunking

Figure 6: Enabling AWSVPC Trunking

Create an Amazon ECS task definition for our application named “task-definition-webapp.json”, which describes the container configuration to implement the example web application as followed:

Replace the <PLACEHOLDER> values for your application.

{
    "containerDefinitions": [
        {
            "name": "whoami",
            "image": "<CONTAINER-IMAGE>:latest",
            "cpu": 0,
            "portMappings": [
                {
                    "name": "<WEBAPP>",
                    "containerPort": 80,
                    "hostPort": 80,
                    "protocol": "tcp"
                }
            ],
            "essential": true,
            "environment": [],
            "mountPoints": [],
            "volumesFrom": [],
            "dockerLabels": {
"traefik.http.routers.<WEBAPP>-host.rule":     "Host(`<WEBAPP>.domain.com`)",
               "traefik.http.routers.<WEBAPP>-path.rule": "Path(`/<WEBAPP>`)",
               "traefik.http.services.<WEBAPP>.loadbalancer.server.port": "80"
            },
            "systemControls": []
        }
    ],
    "family": "<WEBAPP>",
    "networkMode": "awsvpc",
    "volumes": [],
    "placementConstraints": [
        {
            "type": "memberOf",
            "expression": "attribute:role == webserver"
        }
    ],
    "requiresCompatibilities": [
        "EC2"
    ],
    "cpu": "256",
    "memory": "128",
    "tags": []
}

In the task-definition-webapp.json, consider the following:

  • The task definition includes the placement constraint matching the webserver custom attribute value.
  • Docker label traefik.http.routers is used to configure host and path based routing rules.
  • As the example web application container exposes the single TCP port 80, Docker label traefik.http.services.<WEBAPP> is used to configure this port for private communication with the Traefik load balancer.

Register the task definition with your cluster and create the loadbalancer service using the following AWS CLI command:

aws ecs register-task-definition --cli-input-json file://task-definition-webapp.json

aws ecs create-service--cluster <CLUSTER_NAME> --service-name <WEBAPP> --task-definition <WEBAPP>:1 --desired-count 1

Replace the string <CLUSTER_NAME> with the target Amazon ECS cluster name and the string <WEBAPP> with your application.

You can access the whoami application by browsing to the URL from your local network:

http://<HOST_IP>/<WEBAPP>

Step 4: Provision DB instance and attach an external storage

The web application has been successfully deployed, so we will move on to the deployment and configuration of the database server next. First, deploy an Amazon EC2 instance to host a MySQL database. As shown in the following screenshot, use the AWS Console to choose an instance type (this is dependent on your Outposts server instance capacity configuration) and configure its network settings to target the correct VPC and the subnet deployed to the Outposts server.

Figure 7: Provisioning a database instance

Figure 7: Provisioning a database instance

When the instance is available, deploy MySQL following a standard documented approach to install on a Linux host from the vendor. After successfully installing MySQL, configure users and tables necessary for the application. The sample application configuration file can now be updated to allow the PHP web server container to connect to the MySQL database, as well as create a user and list the users, as shown in the following figures.

Figure 8: Updating the application config file to use the database instance

Figure 8: Updating the application config file to use the database instance

Figure 9: Sample application connected to database

Figure 9: Sample application connected to database

For the database instance, make sure that the data associated with the application is stored on an existing storage array in the user data center. To do this, you must complete the following:

(a) Enable connectivity to the user network through the LNI.

(b) Mount the iSCSI volume in the EC2 instance.

(c) Configure MySQL to use this iSCSI volume.

To enable connectivity, follow the same process described in step 2 of this post to add an Elastic Network Interface (ENI) with the correct device index to present the LNI to the instance. The following screenshots show a second ENI configured on the instance and associated with the LNI along with the interface and address configuration of the instance that shows two addresses (VPC and user network addresses).

Figure 10: Network interface configuration

Figure 10: Network interface configuration

Now that connectivity has been established to the user network, you can configure the storage array to present an ISCSI volume to the database instance and mount that volume. The following screenshot shows the /mnt mountpoint being used with iSCSI multi-path across four volumes.

Figure 11: iSCSI volume mount

Figure 11: iSCSI volume mount

Finally, configure MySQL to use the iSCSI volume to store data by stopping the MySQL service, updating the default configuration file /etc/my.cnf, and restarting MySQL, as shown in the following figure.

Figure 12: MySQL configuration

Figure 12: MySQL configuration

Clean up:

Please follow the below instructions to clean up after testing:

  • Delete the <WEBAPP> service
  • Delete the loadbalancer service
  • Delete your Amazon ECS cluster
  • Delete the MySQL Database EC2 instance
  • Delete all VPCs

Conclusion

This post has demonstrated how to deploy a sample container-based web application while connecting to the user network, allowing access to the application and connecting to existing storage appliances.

AWS Outposts server allows users to run containers at the edge, addressing challenges related to low latency, local data processing, and data residency. Amazon ECS allows you to deploy consistently, whether in-Region or at the edge, allowing users to develop once and deploy many times.

Get started with Outposts servers by visiting the Outposts servers webpage and learn more about Amazon ECS to begin deploying your containarized workloads at the edge!

Announcing future-dated Amazon EC2 On-Demand Capacity Reservations

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/announcing-future-dated-amazon-ec2-on-demand-capacity-reservations/

Customers use Amazon Elastic Compute Cloud (Amazon EC2) to run every type of workload imaginable, including web hosting, big data processing, high-performance computing (HPC), virtual desktops, live event streaming, and databases. Some of these workloads are so critical that customers asked for the ability to reserve capacity for them.

To help customers flexibly reserve capacity, we launched EC2 On-Demand Capacity Reservations (ODCRs) in 2018. Since then, customers have used capacity reservations (CRs) to run critical applications like hosting consumer websites, streaming lives sporting events and processing financial transactions.

Today, we’re announcing the ability to get capacity for future workloads using CRs. Many customers have future events such as product launches, large migrations, or end-of-year sales events like Cyber Monday or Diwali. These events are critical, and customers want to ensure they have the capacity when and where they need it.

While CRs helped customers reserve capacity for these events, they were only available just-in-time. So customers either needed to provision the capacity ahead of time and pay for it or plan with precision to provision CRs just-in-time at the start of the event.

Now you can plan and schedule your CRs up to 120 days in advance. To get started you specify the capacity you need, the start date, delivery preference, and the minimum duration you commit to use the capacity reservation. There are no upfront charges to schedule a capacity reservation. After Amazon EC2 evaluates and approves the request, it will activate the reservation on the start date, and customers can use it to immediately launch instances.

Getting started with future-dated capacity reservations
To reserve your future-dated capacity, choose Capacity Reservations on the Amazon EC2 console and select Create On-Demand Capacity Reservation, and choose Get started.

To create a capacity reservation, specify the instance type, platform, Availability Zone, platform, tenancy, and number of instances you are requesting.

future-dated-2a

In the Capacity Reservation details section, choose At a future date in the Capacity Reservation starts option and choose your start date and commitment duration.

future-dated-1a

You can also choose to end the capacity reservation at a specific time or manually. If you select Manually, the reservation has no end date. It will remain active in your account and continue to be billed until you manually cancel it. To reserve this capacity, choose Create.

future-dated-4

After you create your capacity request, it appears in the dashboard with an Assessing status. During this state, AWS systems will work to determine if your request is supportable which is usually done within 5 days. Once the systems determine the request is supportable, the status will be changed to Scheduled. In rare cases, your request may be unsupported.

On your scheduled date, the capacity reservation will change to an Active state, the total instance count will be increased to the amount requested, and you can immediately launch instances.

After activation, you must hold the reservation for at least the commitment duration. After the commitment duration elapses, you can continue to hold and use the reservation if you’d like or cancel it if no longer needed.

Things to know
Here are some things that you should know about the future-dated CRs:

  • Evaluation – Amazon EC2 considers multiple factors when evaluating your request. Along with forecasted supply, Amazon EC2 considers how long you plan to hold the capacity, how early you create the Capacity Reservation relative to your start date, and the size of your request. To improve the ability of Amazon EC2 to support your request, create your reservation at least 56 days (8 weeks) before the start date. You need to submit a request for at least 100 vCPUs for only C, M, R, T, I instance types. The recommended minimum commitment for most requests will be 14 days.
  • Notification – We recommend monitoring the status of your request through the console or Amazon EventBridge You can use these notifications to trigger automation or send an email or text update. To learn more, visit Send an email when events happen using Amazon EventBridge in the Amazon EventBridge User Guide.
  • Pricing – Future dated capacity reservations are billed just like regular CRs. It is charged at the equivalent On-Demand rate whether you run instances in reserved capacity or not. For example, if you create a future dated CR for 20 instances and run 15 instances, you will be charged for 15 active instances and for 5 unused instances in the reservation including the minimum duration. Savings Plans apply to both unused reservations and instances running on the reservation. To learn more, visit Capacity Reservation pricing and billing in the Amazon EC2 User Guide.

Now available
Future dated EC2 Capacity Reservations are now available today in all AWS Regions where Amazon EC2 Capacity Reservations are available.

Give Amazon EC2 Capacity Reservations a try in the Amazon EC2 console. To learn more, visit On-Demand Capacity Reservations in the Amazon EC2 User Guide and send feedback to AWS re:Post for Amazon EC2 or through your usual AWS Support contacts.

Channy

Run high-availability long-running clusters with Amazon EMR instance fleets

Post Syndicated from Garima Arora original https://aws.amazon.com/blogs/big-data/run-high-availability-long-running-clusters-with-amazon-emr-instance-fleets/

AWS now supports high availability Amazon EMR on EC2 clusters with instance fleet configuration. With high availability instance fleet clusters, you now get the enhanced resiliency and fault tolerance of high availability architecture, along with the improved flexibility and intelligence in Amazon Elastic Compute Cloud (Amazon EC2) instance selection of instance fleets. Amazon EMR is a cloud big data platform for petabyte-scale data processing, interactive analysis, streaming, and machine learning (ML) using open source frameworks such as Apache Spark, Presto and Trino, and Apache Flink. Customers love the scalability and flexibility that Amazon EMR on EC2 offers. However, like most distributed systems running mission-critical workloads, high availability is a core requirement, especially for those with long-running workloads.

In this post, we demonstrate how to launch a high availability instance fleet cluster using the newly redesigned Amazon EMR console, as well as using an AWS CloudFormation template. We also go over the basic concepts of Hadoop high availability, EMR instance fleets, the benefits and trade-offs of high availability, and best practices for running resilient EMR clusters.

High availability in Hadoop

High availability (HA) provides continuous uptime and fault tolerance for a Hadoop cluster. The core components of Hadoop, like Hadoop Distributed File System (HDFS) NameNode and YARN ResourceManager, are single points of failure in clusters with a single primary node. In the event that any of them crash, the entire cluster goes down. High Availability removes this single point of failure by introducing redundant standby nodes that can quickly take over if the primary node fails.

In a high availability EMR cluster, one node serves as the active NameNode that handles client operations, and others act as standby NameNodes. The standby NameNodes constantly synchronize their state with the active one, enabling seamless failover to maintain service availability. To learn more, see Supported applications in an Amazon EMR Cluster with multiple primary nodes.

Key instance fleet differentiations

Amazon EMR recommends using the instance fleet configuration option for provisioning EC2 instances in EMR clusters because it offers a flexible and robust approach to cluster provisioning. Some key advantages include:

  • Flexible instance provisioning – Instance fleets provide a powerful and simple way to specify up to five EC2 instance types on the Amazon EMR console, or up to 30 when using the AWS Command Line Interface (AWS CLI) or API with an allocation strategy. This enhanced diversity helps optimize for cost and performance while increasing the likelihood of fulfilling capacity requirements.
  • Target capacity management – You can specify target capacities for On-Demand and Spot Instances for each fleet. Amazon EMR automatically manages the mix of instances to meet these targets, reducing operational overhead.
  • Improved availability – By spanning multiple instance types and purchasing options such as On-Demand and Spot, instance fleets are more resilient to capacity fluctuations in specific EC2 instance pools.
  • Enhanced Spot Instance handling – Instance fleets offer superior management of Spot Instances, including the ability to set timeouts and specify actions if Spot capacity can’t be provisioned.
  • Reliable cluster launches – You can configure your instance fleet to select multiple subnets for different Availability Zones, allowing Amazon EMR to find the best combination of instances and purchasing options across these zones to launch your cluster in. Amazon EMR will identify the best Availability Zone based on your configuration and available EC2 capacity and launch the cluster.

Prerequisites

Before you launch the high availability EMR instance fleet clusters, make sure you have the following:

  • Latest Amazon EMR release – We recommend that you use the latest Amazon EMR release to benefit from the highest level of resiliency and stability for your high availability clusters. High availability for instance fleets is supported with Amazon EMR releases 5.36.1, 6.8.1, 6.9.1, 6.10.1, 6.11.1, 6.12.0, and later.
  • Supported applications – High availability for instance fleets is supported for applications such as Apache Spark, Presto, Trino, and Apache Flink. Refer to Supported applications in an Amazon EMR Cluster with multiple primary nodes for the complete list of supported applications and their failover processes.

Launch a high availability instance fleet cluster using the Amazon EMR console

Complete the following steps on the Amazon EMR console to configure and launch a high availability EMR cluster with instance fleets:

  1. On the Amazon EMR console, create a new cluster.
  2. For Name, enter a name.
  3. For Amazon EMR release, choose the Amazon EMR release that supports high availability clusters with instance fleets. The setting will default to the latest available Amazon EMR release.

CreateHACluster-EMRRelease

  1. Under Cluster configuration, choose the desired instance types for the primary fleet. (You can select up to five when using the Amazon EMR console.)
  2. Select Use high availability to launch the cluster with three primary nodes.

CreateHACluster

  1. Choose the instance types and target On-Demand and Spot size for the core and task fleet according to your requirements.

InstanceFleet-CreateFleets

  1. Under Allocation strategy, select Apply allocation strategy.
    1. 1 We recommend that you select Price-capacity optimized for your allocation strategy for your cluster for faster cluster provisioning, more accurate Spot Instance allocation, and fewer Spot Instance interruptions.
  2. Under Networking, you can choose multiple subnets for different Availability Zones. This allows Amazon EMR to look across those subnets and launch the cluster in an Availability Zone that best suits your instance and purchasing option requirements.

allocationStrategy

  1. Review your cluster configuration and choose Create cluster.

Amazon EMR will launch your cluster in a few minutes. You can view the cluster details on the Amazon EMR console.
ClusterDetailPage

Launch a high availability cluster with AWS CloudFormation

To launch a high availability cluster using AWS CloudFormation, complete the following steps:

  1. Create a CloudFormation template with EMR resource type AWS::EMR::Cluster and JobFlowInstancesConfig property types MasterInstanceFleet, CoreInstanceFleet and (optional) TaskInstanceFleets. To launch a high availability cluster, configure TargetOnDemandCapacity=3, TargetSpotCapacity=0 for the primary instance fleet and weightedCapacity=1 for each instance type configured for the fleet. See the following code:
{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Resources": {
    "cluster": {
      "Type": "AWS::EMR::Cluster",
      "Properties": {
        "Instances": {
          "Ec2SubnetIds": [
            "subnet-003c889b8379f42d1",
            "subnet-0382aadd4de4f5da9",
            "subnet-078fbbb77c92ab099"
          ],
          "MasterInstanceFleet": {
            "Name": "HAPrimaryFleet",
            "TargetOnDemandCapacity": 3,
            "TargetSpotCapacity": 0,
            "InstanceTypeConfigs": [
              {
                "InstanceType": "m5.xlarge",
                "WeightedCapacity": 1
              },
              {
                "InstanceType": "m5.2xlarge",
                "WeightedCapacity": 1
              },
              {
                "InstanceType": "m5.4xlarge",
                "WeightedCapacity": 1
              }
            ]
          },
          "CoreInstanceFleet": {
            "Name": "cfnCore",
            "InstanceTypeConfigs": [
              {
                "InstanceType": "m5.xlarge",
                "WeightedCapacity": 1
              },
              {
                "InstanceType": "m5.2xlarge",
                "WeightedCapacity": 2
              },
              {
                "InstanceType": "m5.4xlarge",
                "WeightedCapacity": 4
              }
            ],
            "LaunchSpecifications": {
              "SpotSpecification": {
                "TimeoutAction": "SWITCH_TO_ON_DEMAND",
                "TimeoutDurationMinutes": 20,
                "AllocationStrategy": "PRICE_CAPACITY_OPTIMIZED"
              }
            },
            "TargetOnDemandCapacity": "4",
            "TargetSpotCapacity": 0
          },
          "TaskInstanceFleets": [
            {
              "Name": "cfnTask",
              "InstanceTypeConfigs": [
                {
                  "InstanceType": "m5.xlarge",
                  "WeightedCapacity": 1
                },
                {
                  "InstanceType": "m5.2xlarge",
                  "WeightedCapacity": 2
                },
                {
                  "InstanceType": "m5.4xlarge",
                  "WeightedCapacity": 4
                }
              ],
              "LaunchSpecifications": {
                "SpotSpecification": {
                  "TimeoutAction": "SWITCH_TO_ON_DEMAND",
                  "TimeoutDurationMinutes": 20,
                  "AllocationStrategy": "PRICE_CAPACITY_OPTIMIZED"
                }
              },
              "TargetOnDemandCapacity": "0",
              "TargetSpotCapacity": 4
            }
          ]
        },
        "Name": "TestHACluster",
        "ServiceRole": "EMR_DefaultRole",
        "JobFlowRole": "EMR_EC2_DefaultRole",
        "ReleaseLabel": "emr-6.15.0",
        "PlacementGroupConfigs": [
          {
            "InstanceRole": "MASTER",
            "PlacementStrategy": "SPREAD"
          }
        ]
      }
    }
  }
}

Make sure to use an Amazon EMR release that supports high availability clusters with instance fleets.

  1. Create a CloudFormation stack with the preceding template:
aws cloudformation create-stack --stack-name HAInstanceFleetCluster --template-body file://cfn-template.json --region us-east-1
  1. Retrieve the cluster ID from the list-clusters response to use in the following steps. You can further filter this list based on filters like cluster status, creation date, and time.
aws emr list-clusters --query "Clusters[?Name=='<YourClusterName>']"
  1. Run the following describe-cluster command:
aws emr describe-cluster --cluster-id j-XXXXXXXXXXX --region us-east-1

If the high availability cluster was launched successfully, the describe-cluster response will return the state of the primary fleet as RUNNING and provisionedOnDemandCapacity as 3. By this point, all three primary nodes have been started successfully.

DescribeClusterResponse

Primary node failover with High Availability clusters

To fetch information on all EC2 instances for an instance fleet, use the list-instances command:

aws emr list-instances --cluster-id j-XXXXXXXXXXX --instance-fleet-type MASTER --region us-east-1

For high availability clusters, it will return three instances in RUNNING state for the primary fleet and other attributes like public and private DNS names.

PrimaryInstance-DescribeCluster

The following screenshot shows the instance fleet status on the Amazon EMR console.

Instancefleet status

Let’s examine two cases for primary node failover.

Case 1: One of the three primary instances is accidentally stopped

When an EC2 instance is accidentally stopped by a user, Amazon EMR detects this and performs a failover for the stopped primary node. Amazon EMR also attempts to launch a new primary node with the same private IP and DNS name to recover back the quorum. During this failover, the cluster remains fully operational, providing true resiliency to single primary node failures.

The following screenshots illustrate the instance fleet details.

InstanceFleetDetail-PrimaryInstanceTerminated

instanceFleerRecovery

This automatic recovery for primary nodes is also reflected in the MultiMasterInstanceGroupNodesRunning or MultiMasterInstanceGroupNodesRunningPercentage Amazon CloudWatch metric emitted by Amazon EMR for your cluster. The following screenshot shows an example of these metrics.

CloudwatchMetrics

Case 2: One of the three primary instances becomes unhealthy

If Amazon EMR continuously receives failures when trying to connect to a primary instance, it is deemed as unhealthy and Amazon EMR will attempt to replace it. Similar to case 1, Amazon EMR will perform a failover for the stopped primary node and also attempt to launch a new primary node with the same private IP and DNS name to recover the quorum.

UnhealthyPrimaryInstance
PrimaryInstanceFailover-2

If you list the instances for the primary fleet, the response will include information for the EC2 instance that was stopped by the user and the new primary instance that replaced it with the same private IP and DNS name.
DescribeClusterResponse-instanceFailover

The following screenshot shows an example of the CloudWatch metrics.

An instance can have connection failures for multiple reasons, including but not limited to disk space unavailable on the instance, critical cluster daemons like instance controller shut down with errors, high CPU utilization, and more. Amazon EMR is continuously improving its health monitoring criteria to better identify unhealthy nodes on an EMR cluster.

Considerations and best practices

The following are some of the key considerations and best practices for using EMR instance fleets to launch a high availability cluster with multiple primary nodes:

  • Use the latest EMR release – With the latest EMR releases, you get the highest level of resiliency and stability for your high availability EMR clusters with multiple primary nodes.
  • Configure subnets for high availability – Amazon EMR can’t replace a failed primary node if the subnet is oversubscribed (there aren’t any available private IP addresses in the subnet). This results in a cluster failure as soon as the second primary node fails. Limited availability of IP addresses in a subnet can also result in cluster launch or scaling failures. To avoid such scenarios, we recommend that you dedicate an entire subnet to an EMR cluster.
  • Configure core nodes for enhanced data availability – To minimize the risk of local HDFS data loss on your production clusters, we recommend that you set the dfs.replication parameter to 3 and launch at least four core nodes. Setting dfs.replication to 1 on clusters with fewer than four core nodes can lead to data loss if a single core node goes down. For clusters with three or fewer core nodes, set dfs.replication parameter to at least 2 to achieve sufficient HDFS data replication. For more information, see HDFS configuration.
  • Use an allocation strategy – We recommend enabling an allocation strategy option for your instance fleet cluster to provide faster cluster provisioning, more accurate Spot Instance allocation, and fewer Spot Instance interruptions.
  • Set alarms for monitoring primary nodes – You should monitor the health and status of primary nodes of your long-running clusters to maintain smooth operations. Configure alarms using CloudWatch metrics such as MultiMasterInstanceGroupNodesRunning, MultiMasterInstanceGroupNodesRunningPercentage, or MultiMasterInstanceGroupNodesRequested.
  • Integrate with EC2 placement groups – You can also choose to protect primary instances against hardware failures by using a placement group strategy for your primary fleet. This will spread the three primary instances across separate underlying hardware to avoid loss of multiple primary nodes at the same time in the event of a hardware failure. See Amazon EMR integration with EC2 placement groups for more details.

When setting up a high availability instance fleet cluster with Amazon EMR on EC2, it’s important to understand that all EMR nodes, including the three primary nodes, are launched within a single Availability Zone. Although this configuration maintains high availability within that Availability Zone, it also means that the entire cluster can’t tolerate an Availability Zone outage. To mitigate the risk of cluster failures due to Spot Instance reclamation, Amazon EMR launches the primary nodes using On-Demand instances, providing an additional layer of reliability for these critical components of the cluster.

Conclusion

This post demonstrated how you can use high availability with EMR on EC2 instance fleets to enhance the resiliency and reliability of your big data workloads. By using instance fleets with multiple primary nodes, EMR clusters can withstand failures and maintain uninterrupted operations, while providing enhanced instance diversity and better Spot capacity management within a single Availability Zone. You can quickly set up these high availability clusters using the Amazon EMR console or AWS CloudFormation, and monitor their health using CloudWatch metrics.

To learn more about the supported applications and their failover process, see Supported applications in an Amazon EMR Cluster with multiple primary nodes. To get started with this feature and launch a high availability EMR on EC2 cluster, refer to Plan and configure primary nodes.


About the Authors

Garima Arora is a Software Development Engineer for Amazon EMR at Amazon Web Services. She specializes in capacity optimization and helps build services that allow customers to run big data applications and petabyte-scale data analytics faster. When not hard at work, she enjoys reading fiction novels and watching anime.

Ravi Kumar is a Senior Product Manager Technical-ES (PMT) at Amazon Web Services, specialized in building exabyte-scale data infrastructure and analytics platforms. With a passion for building innovative tools, he helps customers unlock valuable insights from their structured and unstructured data. Ravi’s expertise lies in creating robust data foundations using open-source technologies and advanced cloud computing, that powers advanced artificial intelligence and machine learning use cases. A recognized thought leader in the field, he advances the data and AI ecosystem through pioneering solutions and collaborative industry initiatives. As a strong advocate for customer-centric solutions, Ravi constantly seeks ways to simplify complex data challenges and enhance user experiences. Outside of work, Ravi is an avid technology enthusiast who enjoys exploring emerging trends in data science, cloud computing, and machine learning.

Tarun Chanana is a Software Development Manager for Amazon EMR at Amazon Web Services.

Using zonal shift with Amazon EC2 Auto Scaling

Post Syndicated from aostan original https://aws.amazon.com/blogs/compute/using-zonal-shift-with-amazon-ec2-auto-scaling/

This post is written by Michael Haken, Senior Principal Solutions Architect, AWS

Today, we’re announcing support for zonal shift in Amazon EC2 Auto Scaling. Zonal shift gives allows you to rapidly recover from application impairments in a single Availability Zone (AZ) impacting your Auto Scaling Group (ASG) resources. In this post, we describe how performing an ASG zonal shift fits in to a multi-AZ resilience strategy and considerations for how to use the feature with different architectures.

Overview

Using multiple AZs is an architectural best practice for building resilient applications on AWS. Deploying your application across multiple AZs makes your applications more available, fault tolerant, and scalable. EC2 Auto Scaling enables you to further enhance your application’s availability and fault tolerance by dynamically scaling your Amazon Elastic Compute Cloud (Amazon EC2) instances across multiple AZs and replacing them when they’re unhealthy.

AZs in AWS represent a fault isolation boundary, meaning that failures from various sources are contained to a single AZ, whether caused by a bad deployment, networking issues, power loss, or operator error. In 2023, we launched zonal shift, part of Amazon Application Recovery Controller (ARC), which allows you to Rapidly recover from application impairments in a single AZ by shifting traffic at your Elastic Load Balancing (ELB) load balancer.

Zonal shift for EC2 Auto Scaling enhances this capability for users who have already implemented recovery patterns for single AZ impairments. It also provides recovery capabilities for architectures that aren’t load balanced by allowing you to prevent new instance launches in a specified AZ. Without zonal shift, when EC2 Auto Scaling detects consistent launch failures in an AZ, the service tries to launch instances in other AZs configured for the ASG. However, certain conditions, like gray failures, can cause post-launch problems in a single AZ that EC2 Auto Scaling doesn’t detect. For example, successfully launched instances in a single AZ experience elevated error rates downloading their configuration files over a zonal Amazon S3, Amazon Virtual Private Cloud (Amazon VPC) interface endpoint. The instances can’t correctly configure their application software and respond to requests with errors. Alternatively, the single-AZ impairment could cause the instance to fail its health checks after provisioning. This causes EC2 Auto Scaling to constantly recycle instances in the impaired AZ, leading to the application running with less capacity than desired.

Although you might choose to perform a zonal shift at your load balancer to mitigate the impact caused by the event, new instances can still be launched in the impacted AZ and don’t receive incoming requests. Even if your application architecture doesn’t use load balancers, zonal shift for EC2 Auto Scaling can help you recover from single-AZ impairments by allowing you to prevent instance launches in the impaired AZ.

Using EC2 Auto Scaling zonal shift to recover

To use zonal shift on your ASG, you need to configure it with an AvailabilityZoneImpairmentPolicy parameter either when you create a new ASG or update an existing one. This parameter has two options, ZonalShiftEnabled that enables or disables the ability to perform zonal shifts, and ImpairedZoneHealthCheckBehaviour. The latter option allows you to choose between ignoring or replacing instances identified as unhealthy by EC2 Auto Scaling. First, we look at how you can use zonal shift with a standalone ASG architecture.

Standalone ASG zonal shift

This architecture uses a standalone ASG without being integrated with an ELB load balancer. Workloads with a standalone ASG commonly perform event driven work such as generating load against a target based on a schedule or processing messages from a queue. The architecture in the following figure uses an ASG that reads messages from an Amazon Simple Queue Service (Amazon SQS) queue, performs some processing on the message data, and writes the results into an Amazon Aurora database. The instances communicate with Amazon SQS using a VPC endpoint in each AZ. Each message varies in size, thus the instances use a heartbeat pattern to update the message visibility timeout until they finish processing it. EC2 Auto Scaling scales instances based on the queue depth, which helps make sure that messages are processed in a timely manner.

Figure 1: EC2 instances deployed across three AZs that process messages from an SQS queue

Figure 1: EC2 instances deployed across three AZs that process messages from an SQS queue

Say that a networking degradation causes instances in AZ 1 to experience elevated error rates when attempting to write to the Aurora database, resulting in a 2x increase in the p50 processing latency. The instances in AZ 1 continue to heartbeat until they time out, keeping the message hidden and preventing other healthy instances from taking over the work. As a result, the queue depth grows and EC2 Auto Scaling deploys a new instance, as shown in the following figure.

Figure 2: EC2 Auto Scaling launches a new instance in AZ 1 in response to the queue depth growing

Figure 2: EC2 Auto Scaling launches a new instance in AZ 1 in response to the queue depth growing

The new instance lands in AZ 1 and experiences the same problem as the other instance, thus it can’t decrease the queue depth and processing latency. Instead, it exacerbates the issue by consuming additional messages that aren’t successfully processed. The instances in AZ 1 never appeared unhealthy, thus EC2 Auto Scaling didn’t take any actions to replace them. To mitigate this problem, you can start a zonal shift for your ASG. This makes sure that any future instance launches only happen in AZ 2 or AZ 3, as shown in the following figure.

Figure 3: After the zonal shift new instances are only launched in AZ 2 and AZ 3 by EC2 Auto Scaling

Figure 3: After the zonal shift new instances are only launched in AZ 2 and AZ 3 by EC2 Auto Scaling

You have the option to mark the instances as unhealthy using the SetInstanceHealth API to force EC2 Auto Scaling to replace these instances to prevent them from continuing to contribute to additional latency and errors. Changing the instance health state is considered a mutating change and relies on the EC2 Auto Scaling control plane. Therefore, you should avoid making this a critical step in your recovery plan. When you are confident that the impairment has abated, you can cancel the zonal shift, which causes EC2 Auto Scaling to automatically rebalance capacity across your AZs.

ASG with ELB zonal shift

In this section we observe how to use zonal shift with an ASG that is serving traffic from an ELB. We also examine how the ImpairedZoneHealthCheckBehavior affects recovery in this situation. In this architecture, the instances in the ASG read data from the database when they receive HTTP requests from the ELB, as shown in the following figure.

Figure 4: A three-tier application deployed in three AZs using an ALB, ASG, and Aurora database

Figure 4: A three-tier application deployed in three AZs using an ALB, ASG, and Aurora database

In this scenario, the instances in AZ 1 start experiencing increased latency with their EBS volumes causing them to respond to requests with errors and fail their EC2 instance status checks. Initially, to mitigate the impact, you can start a zonal shift at your load balancer to prevent your users from receiving errors. Then, you can initiate a zonal shift for your ASG to prevent new capacity from being launched into the AZ that isn’t receiving traffic.

If the ASG’s ImpairedZoneHealthCheckBehavior is set to IgnoreUnhealthy, then the instances in AZ 1 that are failing their health checks aren’t terminated by EC2 Auto Scaling, as shown in the following figure. This can be helpful if you’re pre-scaled to handle the loss of an AZ’s worth of capacity by not causing EC2 Auto Scaling to attempt to launch additional instances. It can also make recovery safer by leaving capacity in the AZ, thus when you end your load balancer zonal shift after the impairment ends, the AZ can immediately start receiving traffic again.

Figure 5: Performing a zonal shift on the ALB and ASG, choosing to ignore unhealthy instances in the ASG

Figure 5: Performing a zonal shift on the ALB and ASG, choosing to ignore unhealthy instances in the ASG

Alternatively, you can set the option to ReplaceUnhealthy. Now, instances that are found to be unhealthy by EC2 Auto Scaling are replaced. This option can be helpful if you aren’t pre-scaled to handle the loss of capacity. EC2 Auto Scaling launches new instances into the remaining AZs to bring the ASG back to its desired capacity, as shown in the following figure. However, this approach also has a tradeoff: launching new instances isn’t guaranteed to be successful, thus you might experience delays in acquiring new capacity.

Figure 6: Performing a zonal shift on the ALB and ASG, this time replacing unhealthy instances in the remaining AZs

Figure 6: Performing a zonal shift on the ALB and ASG, this time replacing unhealthy instances in the remaining AZs

In both situations you must consider whether you have cross-zone load balancing enabled or disabled. When cross-zone load balancing is enabled, each instance, regardless of its AZ, receives an approximately equal share of the traffic. This means that you can end your zonal shift for both your load balancer and ASG at the same time safely. As EC2 Auto Scaling rebalances your instances across each enabled AZ, they receive the same percentage of traffic.

If cross-zone load balancing is disabled, then each AZ receives an equal percentage of the traffic, regardless of how many instances are in the AZ. If you’ve chosen to replace unhealthy instances, or if your ASG has scaled during the event, then the capacity across your AZs could have become imbalanced. When you end your load balancer zonal shift and EC2 Auto Scaling begins to rebalance your capacity, you could end up in a situation shown in the following figure, where a single or small number of instances gets an overwhelming portion of the load.

Figure 7: A three-tier architecture with an imbalance of capacity among its three AZs

Figure 7: A three-tier architecture with an imbalance of capacity among its three AZs

This imbalance can present an overload risk, thus you must specify the –skip-zonal-shift-validation parameter when you enable zonal shift to acknowledge that you understand the risk. However, you can help prevent overload from occurring due to imbalance by using the load balancer’s target_group_health.dns_failover.minimum_healthy_targets.count option and specifying the number of instances that should be present in the AZ. If you’re using three AZs and your desired capacity is 12, then you should set the value to four (which represents one third of the ASGs total capacity). This prevents traffic from being routed to the AZ until there is enough healthy capacity there to handle the load. You may need to dynamically adjust this number as the ASG scales over time. The minimum count you set in the past may not be the right minimum count today.

Zonal shift best practices

As a set of best practices, we recommend that you:

  1. Are pre-scaled to handle the loss of an AZ’s worth of capacity
  2. Configure your impairment policy to ignore unhealthy hosts
  3. Enable cross-zone load balancing

With this configuration, you can also safely use zonal autoshift. When zonal autoshift is enabled, AWS automatically starts and ends the zonal shift on your behalf whenever the AWS telemetry indicates there is an impairment affecting a single AZ. This can be used in conjunction with zonal autoshift for your ELB load balancer. If you are not using zonal autoshift, then you can still use the EventBridge observer notifications to inform your zonal shift decisions or start automated processes. Refer to the EC2 Auto Scaling zonal shift documentation for more details on the full set of best practices when using zonal shift.

Conclusion

In this post we showed you the benefits of using zonal shift with your Amazon EC2 Auto Scaling Groups as part of enhancing your resilience in multi-AZ architectures. We explored several scenarios where zonal shift can be used, and reviewed best practices for using zonal shift safely and effectively. To get started using zonal shift with your ASGs, refer to the documentation.

AWS Lambda SnapStart for Python and .NET functions is now generally available

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/aws-lambda-snapstart-for-python-and-net-functions-is-now-generally-available/

Today, we’re announcing the general availability of AWS Lambda SnapStart for Python and .NET functions that delivers faster function startup performance, from several seconds to as low as sub-second, typically with minimal or no code changes in Python, C#, F#, and Powershell.

In November 28, 2022, we introduced Lambda SnapStart for Java functions to improve startup performance by up to 10 times. With Lambda SnapStart, you can reduce outlier latencies that come from initializing functions, without having to provision resources or spend time implementing complex performance optimizations.

Lambda SnapStart works by caching and reusing the snapshotted memory and disk state of any one-time initialization code, or code that runs only the first time a Lambda function is invoked. Lambda takes a Firecracker microVM snapshot of the memory and disk state of the initialized execution environment, encrypts the snapshot, and caches it for low-latency access.

When you invoke the function version for the first time, and as the invocations scale up, Lambda resumes new execution environments from the cached snapshot instead of initializing them from scratch, improving startup latency. Lambda SnapStart makes it easy to build highly scalable and responsive applications in Python and .NET using AWS Lambda.

For Python functions, startup latency from initialization code can be several seconds long. Some scenarios where this can occur are – loading dependencies (such as LangChain, Numpy, Pandas, and DuckDB) or using frameworks (such as Flask or Django). Many functions also perform machine learning (ML) inference using Lambda, and need to load ML models during initialization – a process that can take tens of seconds depending on the size of the model used. Using Lambda SnapStart can reduce startup latency from several seconds to as low as sub-second for these scenarios.

For .NET functions, we expect most use cases to benefit because .NET just-in-time (JIT) compilation takes up to several seconds. Latency variability associated with initialization of Lambda functions has been a long-standing barrier for customers to use .NET for AWS Lambda. SnapStart enables functions to resume quickly by caching a snapshot of their memory and disk state. Therefore, most .NET functions will experience significant improvement in latency variability with Lambda SnapStart.

Getting started with Lambda SnapStart for Python and .NET
To get started, you can use the AWS Management Console, AWS Command Line Interface (AWS CLI) or AWS SDKs to activate, update, and delete SnapStart for Python and .NET functions.

On the AWS Lambda console, go to the Functions page and choose your function to use Lambda SnapStart. Select Configuration, choose General configuration, and then choose Edit. You can see SnapStart settings on the Edit basic settings page.

You can activate Lambda functions using Python 3.12 and higher, and .NET 8 and higher managed runtimes. Choose Published versions and then choose Save.

When you publish a new version of your function, Lambda initializes your code, creates a snapshot of the initialized execution environment, and then caches the snapshot for low-latency access. You can invoke the function to confirm activation of SnapStart.

Here is an AWS CLI command to update the function configuration by running the update-function-configuration command with the --snap-start option.

aws lambda update-function-configuration \
  --function-name lambda-python-snapstart-test \
  --snap-start ApplyOn=PublishedVersions

Publish a function version with the publish-version command.

aws lambda publish-version \
  --function-name lambda-python-snapstart-test

Confirm that SnapStart is activated for the function version by running the get-function-configuration command and specifying the version number.

aws lambda get-function-configuration \
  --function-name lambda-python-snapstart-test:1

If the response shows that OptimizationStatus is On and State is Active, then SnapStart is activated, and a snapshot is available for the specified function version.

"SnapStart": { 
    "ApplyOn": "PublishedVersions",
    "OptimizationStatus": "On"
 },
 "State": "Active",

To learn more about activating, updating, and deleting a snapshot with AWS SDKs, AWS CloudFormation, AWS Serverless Application Model (AWS SAM), and AWS Cloud Development Kit (AWS CDK), visit Activating and managing Lambda SnapStart in the AWS Lambda Developer Guide.

Runtime hooks
You can use runtime hooks to run code executed before Lambda creates a snapshot or after Lambda resumes a function from a snapshot. Runtime hooks are useful to perform cleanup or resource release operations, dynamically update configuration or other metadata, integrate with external services or systems, such as sending notifications or updating external state or to fine-tune your function’s startup sequence, such as by preloading dependencies.

Python runtime hooks are available as part of the open source Snapshot Restore for Python library, which is included in Python managed runtime. This library provides two decorators @register_before_snapshot to run before Lambda creates a snapshot and @register_after_restore to run when Lambda resumes a function from a snapshot. To learn more, visit Lambda SnapStart runtime hooks for Python in the AWS Lambda Developer Guide.

Here is an example Python handler to show how to run code before checkpointing and after restoring:

from snapshot_restore_py import register_before_snapshot, register_after_restore

def lambda_handler(event, context):
    # handler code

@register_before_snapshot
def before_checkpoint():
    # Logic to be executed before taking snapshots

@register_after_restore
def after_restore():
    # Logic to be executed after restore

You can also use .NET runtime hooks available as part of the Amazon.Lambda.Core package (version 2.5 or later) from NuGet. This library provides two methods RegisterBeforeSnapshot() to run before snapshot creation and RegisterAfterRestore() to run after resuming a function from a snapshot. To learn more, visit Lambda SnapStart runtime hooks for .NET in the AWS Lambda Developer Guide.

Here is an example C# handler to show how to run code before checkpointing and after restoring:

public class SampleClass
{
    public SampleClass()
    { 
        Amazon.Lambda.Core.SnapshotRestore.RegisterBeforeSnapshot(BeforeCheckpoint); 
        Amazon.Lambda.Core.SnapshotRestore.RegisterAfterRestore(AfterRestore);
    }
    
    private ValueTask BeforeCheckpoint()
    {
        // Add logic to be executed before taking the snapshot
        return ValueTask.CompletedTask;
    }

    private ValueTask AfterRestore()
    {
        // Add logic to be executed after restoring the snapshot
        return ValueTask.CompletedTask;
    }

    public APIGatewayProxyResponse FunctionHandler(APIGatewayProxyRequest request, ILambdaContext context)
    {
        // INSERT business logic
        return new APIGatewayProxyResponse
        {
            StatusCode = 200
        };
    }
}

To learn how to implement runtime hooks for your preferred runtime, visit Implement code before or after Lambda function snapshots in the AWS Lambda Developer Guide.

Things to know
Here are some things that you should know about Lambda SnapStart:

  • Handling uniqueness – If your initialization code generates unique content that is included in the snapshot, then the content will not be unique when it’s reused across execution environments. To maintain uniqueness when using SnapStart, you must generate unique content after initialization, such as if your code uses custom random number generation that doesn’t rely on built-in-libraries or caches any information such as DNS entries that might expire during initialization. To learn how to restore uniqueness, visit Handling uniqueness with Lambda SnapStart in the AWS Lambda Developer Guide.
  • Performance tuning – To maximize the performance, we recommend that you preload dependencies and initialize resources that contribute to startup latency in your initialization code instead of in the function handler. This moves the latency associated with heavy class loading out of the invocation path, optimizing startup performance with SnapStart.
  • Networking best practices –The state of connections that your function establishes during the initialization phase isn’t guaranteed when Lambda resumes your function from a snapshot. In most cases, network connections that an AWS SDK establishes automatically resume. For other connections, review the Maximize Lambda SnapStart performance in the AWS Lambda Developer Guide.
  • Monitoring functions – You can monitor your SnapStart functions using Amazon CloudWatch log stream, AWS X-Ray active tracing, and accessing real-time telemetry data for extensions using the Telemetry API, Amazon API Gateway and function URL metrics. To learn more about differences for SnapStart functions, visit Monitoring for Lambda SnapStart in the AWS Lambda Developer Guide.

Now available
AWS Lambda SnapStart for Python and .NET functions are available today in US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm) AWS Regions.

With the Python and .NET managed runtimes, there are two types of SnapStart charges: the cost of caching a snapshot per function version that you publish with SnapStart enabled, and the cost of restoration each time a function instance is restored from a snapshot. So, delete unused function versions to reduce your SnapStart cache costs. To learn more, visit the AWS Lambda pricing page.

Give Lambda SnapStart for Python and .NET a try in the AWS Lambda console. To learn more, visit Lambda SnapStart page and send feedback through AWS re:Post for AWS Lambda or your usual AWS Support contacts.

Channy

Reduce your Microsoft licensing costs by upgrading to 4th generation AMD processors

Post Syndicated from aostan original https://aws.amazon.com/blogs/compute/reduce-your-microsoft-licensing-costs-by-upgrading-to-4th-generation-amd-processors/

This post is written by Jeremy Girven, Solutions Architect at AWS. 

Amazon Web Services (AWS) and AMD have collaborated since 2018 to deliver cost effective performance for a broad variety of Microsoft workloads, such as Microsoft SQL Server, Microsoft Exchange Server, Microsoft SharePoint Server, Microsoft Systems Center suite, Active Directory, and many other Microsoft workload use cases. This post shows how the performance improvements of the latest generation AMD-powered Amazon Elastic Compute Cloud (Amazon EC2) instances can help you reduce licensing costs on Microsoft workloads running on AWS.

AWS has been running Microsoft workloads for over 16 years. The most common of these workloads are those running Microsoft Windows Server and Microsoft SQL Server. Both can be brought to AWS using the Bring Your Own License (BYOL) or License Included (provided by AWS) licensing models. Many BYOL licensing restrictions need workloads to be run on dedicated tenancy and need Dedicated Hosts. For these workloads, a license would be needed to cover each physical core of the Dedicated Host (for example if the Dedicated Host has 96 physical cores, 96 licenses would be necessary to cover the host). For License Included EC2 instances, the cost of the associated Microsoft licenses is a per-vCPU fee bundled into the total price of the EC2 instance.

Regardless of which licensing option works best for you, the licensing cost is directly related to the number of virtual cores (vCPUs) or physical cores used by your workloads. Using high-performance processors allows you to potentially reduce the total number of cores necessary to run a workload. Reducing the total number of cores subsequently reduces your total cost of ownership (TCO) by reducing the number of licenses. One potential option available for running Microsoft workloads on AWS are EC2 instances, which use fourth generation processors.

The AWS Nitro EC2 instance families using fourth generation AMD EPYC processors are M7a, C7a, R7a, and Hpc7a. These fourth generation AMD EC2 instances use DDR5 memory to deliver 2.25x more memory bandwidth and up to 50% higher performance as compared with previous generation AMD EC2 instances. For performance-per-watt improvements across integer performance, floating point, and natural language processing (NLP) throughout, these fourth generation AMD EPYC processors offer up to 2.7x greater results than those of previous generation AMD EC2 instances.

AMD has publicly available performance testing comparing the General Purpose M7a instances with the previous generation M6a instances. You can find the information in this link. We wanted to expand their testing to Compute Optimized and Memory Optimized EC2 instances to observe if their results hold true for different instance families.

In the following section we dive into our performance testing methodologies, and we review our results.

Method 1: CPU calculation speed

The following is the configuration of the EC2 instances used for testing:

  • Instance Types: C6a.large and C7a.large (2 vCPUs, 4 GiB Memory, and 30 GiB (3000 IOPS, 125 MB/s) GP3 EBS volume)
  • Operating System: Microsoft Windows Server 2022 Datacenter (10.0.20348 N/A Build 20348)
  • Installed Software: AWS device drivers (NVMe 1.5.1 & ENA 2.7.0), Amazon EC2 Launch Agent v2 (2.0.1981.0), Amazon SSM Agent (3.3.551.0), and PowerShell 7.4.5 (all non-essential software has been removed)
  • AWS Region and AZ: us-west-2 / us-west-2a (usw2-az1)

We performed a direct, yet CPU-intensive math test by calculating prime numbers in a range of 2 through 10,000 using Windows PowerShell (version 7 needed). This runs in a loop ten times, which allows us to use the processing time over all the runs. The following is the code used for testing:

Function Start-PrimeNumberTest {
    [CmdletBinding()]
    param(
        [Parameter(Mandatory = $True)][Int32]$TestRunLimit, #The number of times the test will run in a loop
        [Parameter(Mandatory = $True)][Int32]$UpperNumberRange #The upper number of the range to find prime numbers in (larger the number the longer it takes to process)
    )
    $DoCount = 0
    $NumberRange = 2..$UpperNumberRange
    [System.Collections.ArrayList]$TimeArray = @()
    [System.Collections.ArrayList]$OutputArray = @()
    $vCPUCount = Get-CimInstance -ClassName 'Win32_Processor' | Select-Object -ExpandProperty 'NumberOfLogicalProcessors'
    Do {
        $Time = Measure-Command {
            $Range = $NumberRange
            $Count = 0
            $Range | ForEach-Object -Parallel {
                $Number = $_
                $Divisor = [Math]::Sqrt($Number)
                2..$Divisor | ForEach-Object {
                    If ($Number % $_ -eq 0) {
                        $Prime = $False
                    } Else {
                        $Prime = $True
                    }
                }
                If ($Prime) {
                    $Count++
                    If ($Count % 10 -eq 0) {
                        $Null
                    }
                }
            } -ThrottleLimit $vCPUCount
        }
        $DoCount++
        [void]$TimeArray.Add($Time.TotalSeconds)
        Start-Sleep -Seconds 5
    } Until ($DoCount -eq $TestRunLimit)
    $Output = $TimeArray | Measure-Object -Average -Maximum -Minimum | Select-Object -Property 'Count', 'Average', 'Maximum', 'Minimum'
    [void]$OutputArray.Add("Number of runs                     : $($Output.Count)")
    [void]$OutputArray.Add("Average time to complete (seconds) : $($Output.Average)")
    [void]$OutputArray.Add("Maximum time to complete (seconds) : $($Output.Maximum)")
    [void]$OutputArray.Add("Minimum time to complete (seconds) : $($Output.Minimum)")
    Write-Output $Output
}

To run the code, invoke the function and specify the Test Run Limit and Upper Number Range. For example, the following code mimics our test by finding prime numbers up to 10,000 and run the test 10 times:

Start-PrimeNumberTest -TestRunLimit 10 -UpperNumberRange 10000

Test results: CPU calculation speed

Figure 1. C7a.large and C6a.large performance results over ten tests

Figure 1. C7a.large and C6a.large performance results over ten tests

Although this is a direct CPU performance test, it demonstrates a clear performance advantage of using the latest generation of AMD powered instances as compared with previous generations:

  • Slowest test: The C7a.large was over seven seconds faster than the quickest run on the C6a.large. This is a delta of more than 25% faster in the worst-case scenario for the C7a.large.
  • Fastest test: The C7a.large completed over 13 seconds faster than the C6a.large, showing a 47% faster processing time.
  • Average: There is an 11 second difference in processing time between the two instances. The C7a.large is averaging over 38% faster than the C6a.large.

Price-performance

The latest generation of AMD instances is more expensive than the previous generation. However, when we consider the performance delta between the two instances, using the average test duration length and the on-demand price of both instances in us-west-2, the C7a.large cost $0.000957791 per run to process the workload while the C6a.large cost $0.001352344. The C6a.large costs approximately $0.0004 per second more to process the same workload. Although that might sound small, this cost delta is greater than $12,000 over a 1-year period. These results show the value using the latest generation of AMD powered instances, especially with CPU bound workloads.

Method 2: SQL Server performance

We wanted our second testing method to focus more on real-world applications related to Microsoft workloads. For this test, we wanted to measure SQL Server performance.

SQL Server can be tested with an open source load testing tool called HammerDB. SQL Server is primarily used for OLTP workloads, thus we used the TPROC-C benchmark from HammerDB because it is specifically tailored for OLTP database testing.

The following is the configuration of the EC2 instances used for testing:

  • Instance Types:8xlarge and R6a.8xlarge (32 vCPUs, 256 GiB Memory)
  • Storage: io2 EBS volumes w/ 40,000 IOPS (EC2 instance maximum)
  • SQL Server: Microsoft SQL Server 2022 (RTM-CU14) (KB5038325) – 16.0.4135.4 (X64) Jul 10 2024 14:09:09 Copyright (C) 2022 Microsoft Corporation Enterprise Edition: Core-based Licensing (64-bit) on Windows Server 2022 Datacenter 10.0 <X64> (Build 20348: ) (Hypervisor)
    • Maximum Server Memory: 240 GB
    • Database File Size: 220 GB
    • Database Data Size: 2000 warehouses (~200 GB)
    • MAXDOP: 1

HammerDB creates a test database based on “warehouses.” Each warehouse is approximately 100 MB of data. Our test server used 2000 warehouses, leaving approximately 20 GB for overhead in the 220 GB database file size. The total database size was also purposely sized smaller than the total memory allocated to our SQL Server. This allows SQL Server to cache as much of the database as possible in memory to avoid latency reading from disk.

When testing with Hammer DB, it uses “virtual users” as a method of applying load to the database. Our testing on each EC2 instance started with a small load of 32 virtual users to match the number of virtual users to vCPUs. Tests used a warmup time of five minutes and five minutes of processing. Then, the virtual users were increased by logarithmic scale to apply a larger performance load on the servers. Testing continued until we saw a decline of the of the total Transactions Per Minute (TPM). Three full runs were completed on each EC2 instance to create an average TPM at each level of virtual users.

Test results: SQL Server performance

Figure 2. R7a.8xlarge and R6a.8xlarge average TPM

Figure 2. R7a.8xlarge and R6a.8xlarge average TPM

Figure 3. R7a.8xlarge and R6a.8xlarge average TPM

Figure 3. R7a.8xlarge and R6a.8xlarge average TPM

The R7a.8xlarge consistently outperformed the R6a.8xlarge, even on tests with low load. The most notable difference was a 34% increase in TPM at peak performance. These results are similar to the 32% difference that AMD published when testing the M7a.8xlarge and M6a.8xlarge instances using another OLTP benchmark, TPROC-E.

Cost savings

Our test results are good news if you’re running SQL Server workloads. The ability to process more transactions with the same number of vCPUs translates into needing fewer vCPUs to run your current workloads, thereby lowering the total number of SQL Server licenses in your environment. With SQL Server Enterprise Edition licensing costing over $15,000 per 2-core pack as of this writing, being able to reduce your SQL Server licensing costs could save you hundreds of thousands of dollars for your total cost of ownership.

Conclusion

When evaluating the cost of CPU license-based workloads, such as those available with Microsoft workloads, the results show looking at the price alone isn’t optimal for selecting instances to use for your workloads. Commercial software such as Microsoft’s Windows Server or SQL Server are typically licensed at the vCPU level or the physical core level (BYOL). When dealing with CPU-bound workloads, choosing the instance with the highest performance to price ratio is the best evaluation method.

Author Bio

Jeremy Girven Jeremy Girven

Jeremy is a solutions architect specializing in Microsoft workloads on AWS. He has over 16 years’ experience with Microsoft Active Directory and over 25 years of industry experience. One of his fun projects is using SSM to automate the Active Directory build processes in AWS. To see more, check out the Active Directory AWS Partner Solution (https://aws.amazon.com/solutions/partners/active-directory-ds/).

Retaining Optimize CPUs configuration during Amazon EC2 scaling to save on licensing costs

Post Syndicated from aostan original https://aws.amazon.com/blogs/compute/retaining-optimize-cpus-configuration-during-amazon-ec2-scaling-to-save-on-licensing-costs/

Introduction

Amazon Elastic Compute Cloud (Amazon EC2) now lets you modify CPU configurations after an instance has launched. With this new feature, users can change instance CPU settings either by directly modifying the CPU configuration, or when changing instance size or type. You can now specify a custom number of CPUs and/or disable simultaneous multithreading (SMT) also known as hyper-threading (HT), for workloads where HT doesn’t provide performance improvement. These capabilities help Bring Your Own license (BYOL) users to optimize their CPU-based licensing costs. For more details on supported instance types, core count, and threads per core values available for each instance type, refer to the supported CPU options for Amazon EC2 instance type documentation.

Why CPU configuration matters for different workloads?

One of our users recently faced a significant challenge when their SQL Server licensing costs unexpectedly increased after scaling their EC2 instances to the next size up. This increase occurred because the Optimized CPUs feature, which can be configured to enable a custom number of CPUs to disable HT so that they can save on SQL Server BYOL licensing costs, was reset during scaling. As a result, this user quadrupled (as opposed to doubled) their licensing requirements when scaling from r7i.xlarge to r7i.2xlarge. Initially, we recommended creating a new Amazon Machine Image (AMI) and launching a new instance with the desired CPU configurations. But this approach introduced complications such as creating new AMIs, moving Amazon Elastic Block Store (Amazon EBS) volumes, and managing security groups. The user wanted a solution that would allow them to scale their workloads seamlessly without these complexities. After working backward from this and other user requirements, we are excited to bring the capability to retain the Optimized CPU configuration during scaling. This reassures you that your licensing costs are as you would expect (for example increase or decrease linearly with your instance size).

An Optimized CPU can reduce your per-CPU licensing costs by 50% by disabling HT, as long as doing so doesn’t affect application performance. You can save more by selectively disabling additional cores based on your specific workloads. Example workloads include, but are not limited to the following:

  • Compute-intensive workloads (for example scientific computing, simulations), which often perform better with one thread per core rather than two threads per core.
  • Database workloads (for example SQL Server) where reducing the thread count to one per core typically does not impact performance, because these workloads need more memory and storage but are less dependent on a high number of CPUs. For more details, refer to Optimize CPU best practices for SQL Server workloads.
  • High-performance computing (HPC) workloads, which sometimes perform better without HT because it can cause performance degradation because of context switching.

Three ways to set or modify CPU configurations

First, you can modify the EC2 instance configuration on an instance that is stopped after launch by modifying CPU Options:

  1. Go to the EC2 Dashboard: Log in to your AWS Management Console and go to the EC2 dashboard.
  2. Choose the Instance: Choose the EC2 instance that you want to modify from the list.
  3. Stop the Instance: In the Instance State dropdown, choose Stop Instance.
  4. Change CPU Options: In the Actions dropdown, choose Instance Settings, and choose Change CPU Options. You should observe the box shown in Figure 1.
  5. Configure CPU Settings:

a. Adjust the number of CPU cores (for example the dropdown box to the left of core(s))

b. Set the number of threads per core to 1 so that you disable HT (for example the dropdown box to the left of thread(s) per CPU core)

  1. Apply the Changes: When your desired CPU configuration is set, choose Apply to save the settings.

Changing CPU options after instance launch

Figure 1: Changing CPU options after instance launch

Second, you can also modify the CPU configuration during an instance size or type change:

  1. Select the Instance: From the EC2 dashboard, choose the instance to modify.
  2. Stop the Instance: In the Instance State dropdown, choose Stop Instance.
  3. Change Instance Type: In Actions, choose Instance Settings, then choose Change Instance Type. You should observe the box shown in Figure 2.
  4. Configure CPU Options: While changing the instance type, you can also:

a. Adjust the number of CPU cores (for example the dropdown box to the left of core(s))

b. Set the number of threads per core to 1 so that you disable HT (for example the dropdown box to the left of thread(s) per CPU core)

  1. Apply Changes: When configured, apply the changes.

Specifying CPU options during instance size or type change

Figure 2: Specifying CPU options during instance size or type change

Finally, you can use the CLI, API, or SDK method to configure the core count and threads per core for your instance using the new command modify-instance-cpu-options:

aws ec2 modify-instance-cpu-options --core-count "2" --threads-per-core "1" --instance-id "i-<your-instance-id>"

License tracking with optimized CPUs in AWS License Manager

You can effectively track your license usage by enabling the vCPU Optimization feature for self-managed license configuration within AWS License Manager. This feature integrates with Amazon EC2 CPU optimization, which lets you track the number of vCPUs on an instance. When the vCPU Optimization rule is set to True, License Manager counts vCPUs based on your customized core and thread count. Otherwise, it counts the default number of vCPUs for the instance type, which may not reflect your optimized CPU settings.

Conclusion

The ability to modify CPU configurations after an EC2 instance launch offers flexibility and efficiency for managing your workloads. You can adjust CPU cores, threads per core, and change instance types or sizes while retaining custom CPU settings without creating a new instance. This feature helps optimize performance, reduce licensing costs, and streamline operations.

Start using this new functionality today to improve the efficiency and scalability of your EC2 instances!

To learn more about CPU options on Amazon EC2, check out this guide, and Optimize CPU best practices for SQL Server workloads.

Author Bio

Rafet-Ducic

Rafet Ducic

Rafet Ducic is a Senior Solutions Architect at Amazon Web Services (AWS). He applies his more than 20 years of technical experience to help Global Industrial and Automotive users transition their workloads to the cloud cost-efficiently and with optimal performance. With domain expertise in Database Technologies and Microsoft licensing, Rafet is adept at guiding companies of all sizes toward reduced operational costs and top performance standards.

The attendee’s guide to the AWS re:Invent 2024 Compute track

Post Syndicated from aostan original https://aws.amazon.com/blogs/compute/the-attendees-guide-to-the-aws-reinvent-2024-compute-track/

From December 2nd to December 6th, AWS will hold its annual premier learning event: re:Invent. At this event, attendees can become stronger and more proficient in any area of AWS technology through a variety of experiences: large keynotes given by AWS leaders, smaller innovation talks and interactive working sessions given by AWS experts, and fun activities such as live music and games at re:Play.

There are over 2000+ learning sessions that focus on specific topics at various skill levels, and the compute team have created 72 unique sessions for you to choose. There are many sessions you can choose from, and we are here to help you choose the sessions that best fits your needs. Even if you are not able to join in person, you can catch-up with many of the sessions on-demand and even watch the keynote and innovation sessions live.

The Basic: Session types

If you’re able to join us, just a reminder that we offer several types of sessions which can help maximize your learning in a variety of AWS topics.

re:Invent attendees can also choose to attend chalk-talks, builder sessions, workshops, or code talk sessions. Each of these are live non-recorded interactive sessions.

  • Breakout sessions: Attendees will be in a lecture-style 60-minute informative sessions presented by AWS experts, customers, or partners. These sessions are recorded and uploaded a few days after to the AWS Events YouTube channel.
  • Chalk-talk sessions: Attendees will interact with presenters, asking questions and using a whiteboard in session.
  • Builder Sessions: Attendees participate in a one-hour session and build something.
  • Workshops sessions: Attendees join a two-hour interactive session where they work in a small team to solve a real problem using AWS services.
  • Code talk sessions: Attendees participate in engaging code-focused sessions where an expert leads a live coding session.
  • Lightning talk sessions: Attendees watch a 20-minute demo dedicated to either a specific service or customer story (located in the Expo Hall).

Getting started with Amazon EC2

The foundation of compute in AWS is Amazon Elastic Compute Cloud (Amazon EC2). Amazon EC2 offers the broadest and deepest compute platform, with over 800 instances and choice of the latest processor, storage, networking, operating system, and purchase model to help you best match the needs of your workload. We’ve created the following sessions to help you implement and manage your workloads in Amazon EC2.

  • CMP101 | What’s new with Amazon EC2
    Learn about the latest compute innovations from AWS. This session helps you better understand Amazon EC2 instances and how organizations like yours can use them to run any workload while meeting your cost, performance, and sustainability goals.
  • CMP343 | Select and launch the right instance for your workload and budget
    With more than 800 instances for various use cases, including instances best for common workloads and for workloads with specific requirements, how do you choose instances? Learn how to determine which instance is best for your specific use case and budget.
  • CMP319 | Managing Amazon EC2 capacity and availability
    Amazon EC2 offers a variety of capacity usage and reservation models, so you can choose the right combination for your workload and budget. Learn how to combine these models in a way that’s best for your business and manage your capacity to improve utilization and availability.
  • CMP207 | AWS-accelerated computing enables customer success with generative AI
    Discover how AWS provides the most performant, low-cost infrastructure for building and scaling large-scale generative AI models. Come learn what’s new in the accelerated computing portfolio including our GPU-based and AWS AI chips-powered instances.
  • CMP318 | Choose the optimal compute environment for your AI/ML workloads
    If you’re trying to decide between accelerators such as AWS Inferentia and AWS Trainium, GPUs from NVIDIA and AMD, processors such as AWS Graviton, or managed services such as Amazon Bedrock and Amazon SageMaker, this chalk covers the different options available on AWS.

Learn about AWS compute innovations

AWS has invested years designing custom silicon optimized for the cloud to deliver the best price performance for a wide range of applications and workloads using AWS services. Learn more about the AWS Nitro System, processors at AWS, and ML chips.

The AWS Nitro System is a rich collection of building block technologies that are powering the recent and future generations of Amazon EC2 instances. Dive deep into the Nitro System and see how it made the seemingly impossible possible.

Generative AI promises to revolutionize industries, but its immense computational demands and escalating costs pose significant challenges. To overcome these hurdles, AWS designed and purpose-built AI chips including AWS Trainium2 and AWS Inferentia2.

Optimize your compute costs

At AWS, we focus on delivering the best possible cost structure for our customers. Frugality is one of our founding leadership principles. Cost effective design continues to shape everything we do, from how we develop products to how we run our operations. Come learn of new ways to optimize your compute costs through AWS services, tools, and optimization strategies in the following sessions:

Maximize you workload’s performance

Your workload’s performance matters beyond just cost because it directly impacts the quality, efficiency, and effectiveness of your compute solution. It can significantly influence customer satisfaction, business growth, and overall productivity. Even if a cheaper option exists, a low-cost option with poor performance can lead to long-term financial losses due to issues such as lost customers, engineering rework, and negative reputation. We have a number of sessions that help you optimize your workload’s performance.

  • CMP411 | Everything you’ve wanted to know about performance on EC2 instances
    This session covers all the details you’ve always wanted to know to optimize your compute performance such as memory topology, accessing hardware counters, accounting for the side-effects of hyperthreading, properly running performance tests, and optimizing your latency.
  • CMP413 | Moving from naive benchmarking to application performance engineering
    Most of the time, benchmarks aren’t representative of their applications’ behaviors. In this session, learn the tools and best practices that will help you understand your applications’ performance behaviors on Amazon EC2 instances so that you can maximize your performance.
  • CMP405 | How to optimize latency and throughput
    The availability of processors with and without hyperthreading makes performance evaluation a tricky game. In this code talk, study a web application and evaluate its performance in various scenarios, and discover how to optimize throughput and latency along the way.

Customer experiences and applications with machine learning

Machine learning (ML) has been evolving for decades and has an inflection point with generative AI applications capturing widespread attention and imagination. More customers, across a diverse set of industries, choose AWS compared to any other major cloud provider to build, train, and deploy their ML applications. Learn about generative AI infrastructure at Amazon or get hands-on experience building ML applications through our ML focused sessions, such as the following:

Accelerate your AWS Graviton adoption journey

The AWS Graviton Processors are custom designed server processors designed by AWS. They deliver the best price performance for your cloud workloads running in AWS, and help you reduce your carbon footprint. Ready to realize up to 40% better price performance for your workloads? We have curated the following session to help you accelerate your Graviton adoption:

  • CMP305 | Learnings from developers adopting AWS Graviton at scale
    In this chalk talk, engage directly with AWS specialists that help customers on a daily basis with their adoption journey—from workload selection to running at scale in production. Explore AWS Graviton use cases, best practices, performance, and customer success stories.
  • CMP310 | Migrating applications to AWS Graviton on Amazon EKS
    During this hands-on workshop, walk through the steps for migrating a workload running on x86 to AWS Graviton-based instances including performing tests locally and modifying the CI/CD pipeline to build and deploy the application in Amazon EKS using Karpenter.
  • CMP316 | AWS Graviton GameDay: Optimize your Amazon EC2 workload with Graviton
    Ready to learn more about AWS Graviton in an immersive environment? In this team-based gamified learning setting, perform a live migration of your workload to Graviton. You learn how to unlock Graviton’s full price-performance potential and optimize the size of an Amazon EC2 fleet.
  • CMP404 | Exploring performance analysis with AWS Graviton instances
    In this session, AWS experts open a shell on an Amazon EC2 instance and dig into the system to see which tools and resources you can use, including the Amazon Aperf tool. Learn as they write some mini-applications to study their performance behavior and how to improve them.

Check out workload-specific sessions

Amazon EC2 offers the broadest and deepest compute platform to help you best match the needs of your workload. More SAP, high performance computing (HPC), ML, and Windows workloads run on AWS than any other cloud. Join sessions focused around your specific workload to learn about how you can leverage AWS solutions to accelerate your innovations.

Ready to unlock new possibilities?

The AWS Compute team looks forward to seeing you in Las Vegas. Come meet us at the Compute Booth in the Expo and check out our various Amazon EC2 demos. And if you’re looking for more session recommendations, check-out additional re:Invent attendee guides curated by experts.

Convert AWS console actions to reusable code with AWS Console-to-Code, now generally available

Post Syndicated from Abhishek Gupta original https://aws.amazon.com/blogs/aws/convert-aws-console-actions-to-reusable-code-with-aws-console-to-code-now-generally-available/

Today, we are announcing the general availability (GA) of AWS Console-to-Code that makes it easy to convert AWS console actions to reusable code. You can use AWS Console-to-Code to record your actions and workflows in the console, such as launching an Amazon Elastic Compute Cloud (Amazon EC2) instance, and review the AWS Command Line Interface (AWS CLI) commands for your console actions. With just a few clicks, Amazon Q can generate code for you using the infrastructure-as-code (IaC) format of your choice, including AWS CloudFormation template (YAML or JSON), and AWS Cloud Development Kit (AWS CDK) (TypeScript, Python or Java). This can be used as a starting point for infrastructure automation and further customized for your production workloads, included in pipelines, and more.

Since we announced the preview last year, AWS Console-to-Code has garnered positive response from customers. It has now been improved further in this GA version, because we have continued to work backwards from customer feedback.

New features in GA

  • Support for more services – During preview, the only supported service was Amazon EC2. At GA, AWS Console-to-Code has extended support to include Amazon Relational Database Service (RDS) and Amazon Virtual Private Cloud (Amazon VPC).
  • Simplified experience – The new user experience makes it easier for customers to manage the prototyping, recording and code generation workflows.
  • Preview code – The launch wizards for EC2 instances and Auto Scaling groups have been updated to allow customers to generate code for these resources without actually creating them.
  • Advanced code generation – AWS CDK and CloudFormation code generation is powered by Amazon Q machine learning models.

Getting started with AWS Console-to-Code
Let’s begin with a simple scenario of launching an Amazon EC2 instance. Start by accessing the Amazon EC2 console. Locate the AWS Console-to-Code widget on the right and choose Start recording to initiate the recording.

Now, launch an Amazon EC2 instance using the launch instance wizard in the Amazon EC2 console. After the instance is launched, choose Stop to complete the recording.

In the Recorded actions table, review the actions that were recorded. Use the Type dropdown list to filter by write actions (Write). Choose the RunInstances action. Select Copy CLI to copy the corresponding AWS CLI command.

This is the CLI command that I got from AWS Console-to-Code:

aws ec2 run-instances \
  --image-id "ami-066784287e358dad1" \
  --instance-type "t2.micro" \
  --network-interfaces '{"AssociatePublicIpAddress":true,"DeviceIndex":0,"Groups":["sg-1z1c11zzz1c11zzz1"]}' \
  --credit-specification '{"CpuCredits":"standard"}' \
  --tag-specifications '{"ResourceType":"instance","Tags":[{"Key":"Name","Value":"c2c-demo"}]}' \
  --metadata-options '{"HttpEndpoint":"enabled","HttpPutResponseHopLimit":2,"HttpTokens":"required"}' \
  --private-dns-name-options '{"HostnameType":"ip-name","EnableResourceNameDnsARecord":true,"EnableResourceNameDnsAAAARecord":false}' \
  --count "1"

This command can be easily modified. For this example, I updated it to launch two instances (--count 2) of type t3.micro (--instance-type). This is a simplified example, but the same technique can be applied to other workflows.

I executed the command using AWS CloudShell and it worked as expected, launching two t3.micro EC2 instances:

The single-click CLI code generation experience is based on the API commands that were used when actions were executed (while launching the EC2 instance). Its interesting to note that the companion screen surfaces recorded actions as you complete them in console. And thanks to the interactive UI with start and stop functionality, its easy to clearly scope actions for prototyping.

IaC generation using AWS CDK
AWS CDK is an open-source framework for defining cloud infrastructure in code and provisioning it through AWS CloudFormation. With AWS Console-to-Code, you can generate AWS CDK code (currently in Java, Python and TypeScript) for your infrastructure workflows.

Lets continue with the EC2 launch instance use case. If you haven’t done it already, in the Amazon EC2 console, locate the AWS Console-to-Code widget on the right, choose Start recording, and launch an EC2 instance. After the instance is launched, choose Stop to complete the recording and choose the RunInstances action from the Recorded actions table.

To generate AWS CDK Python code, choose the Generate CDK Python button from the dropdown list.

You can use the code as a starting point, customizing it to make it production-ready for your specific use case.

I already had the AWS CDK installed, so I created a new Python CDK project:

mkdir c2c_cdk_demo
cd c2c_cdk_demo
cdk init app --language python

Then, I plugged in the generated code in the Python CDK project. For this example, I refactored the code into a AWS CDK Stack, changed the EC2 instance type, and made other minor changes to ensure that the code was correct. I successfully deployed it using cdk deploy.

I was able to go from the console action to launch an EC2 instance and then all the way to AWS CDK to reproduce the same result.

from aws_cdk import (
    Stack,
    aws_ec2 as ec2,
)
from constructs import Construct

class MyProjectStack(Stack):

    def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        existing_vpc = ec2.Vpc.from_lookup(self, "ExistingVPC",
            is_default=True
        )

        instance = ec2.Instance(self, "Instance",
                instance_type=ec2.InstanceType("t3.micro"),
                machine_image=ec2.AmazonLinuxImage(),
                vpc=existing_vpc,
                vpc_subnets=ec2.SubnetSelection(
                    subnet_type=ec2.SubnetType.PUBLIC
                )
        )

You can also generate CloudFormation template in YAML or JSON format:

Preview code
You can also directly access AWS Console-to-Code from Preview code feature in Amazon EC2 and Amazon EC2 Auto Scaling group launch experience. This means that you don’t have to actually create the resource in order to get the infrastructure code.

To try this out, follow the steps to create an Auto Scaling group using a launch template. However, instead of Create Auto Scaling group, click Preview code. You should now see the options to generate infrastructure code or copy the AWS CLI command.

Things to know
Here are a few things you should consider while using AWS Console-to-Code:

  • Anyone can use AWS Console-to-Code to generate AWS CLI commands for their infrastructure workflows. The code generation feature for AWS CDK and CloudFormation formats has a free quota of 25 generations per month, after which you will need an Amazon Q Developer subscription.
  • It’s recommended that you test and verify the generated IaC code code before deployment.
  • At GA, AWS Console-to-Code only records actions in Amazon EC2, Amazon VPC and Amazon RDS consoles.
  • The Recorded actions table in AWS Console-to-Code only display actions taken during the current session within the specific browser tab, and it does not retain actions from previous sessions or other tabs. Note that refreshing the browser tab will result in the loss of all recorded actions.

Now available
AWS Console-to-Code is available in all commercial Regions. You can learn more about it in the Amazon EC2 documentation. Give it a try in the Amazon EC2 console and send feedback to the AWS re:Post for Amazon EC2 or through your usual AWS Support contacts.

AWS Weekly Roundup: Amazon EC2 X8g Instances, Amazon Q generative SQL for Amazon Redshift, AWS SDK for Swift, and more (Sep 23, 2024)

Post Syndicated from Abhishek Gupta original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-amazon-ec2-x8g-instances-amazon-q-generative-sql-for-amazon-redshift-aws-sdk-for-swift-and-more-sep-23-2024/

AWS Community Days have been in full swing around the world. I am going to put the spotlight on AWS Community Day Argentina where Jeff Barr delivered the keynote, talks and shared his nuggets of wisdom with the community, including a fun story of how he once followed Bill Gates to a McDonald’s!

I encourage you to read about his experience.

Last week’s launches
Here are the launches that got my attention, starting off with the GA releases.

Amazon EC2 X8g Instances are now generally availableX8g instances are powered by AWS Graviton4 processors and deliver up to 60% better performance than AWS Graviton2-based Amazon EC2 X2gd instances. These instances offer larger sizes with up to 3x more vCPU (up to 48xlarge) and memory (up to 3TiB) than Graviton2-based X2gd instances.

Amazon Q generative SQL for Amazon Redshift is now generally available – Amazon Q generative SQL in Amazon Redshift Query Editor is an out-of-the-box web-based SQL editor for Amazon Redshift. It uses generative AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly within Amazon Redshift, accelerating the query authoring process for users and reducing the time required to derive actionable data insights.

AWS SDK for Swift is now generally availableAWS SDK for Swift provides a modern, user-friendly, and native Swift interface for accessing Amazon Web Services from Apple platforms, AWS Lambda, and Linux-based Swift on Server applications. Now that it’s GA, customers can use AWS SDK for Swift for production workloads. Learn more in the AWS SDK for Swift Developer Guide.

AWS Amplify now supports long-running tasks with asynchronous server-side function calls – Developers can use AWS Amplify to invoke Lambda function asynchronously for operations like generative AI model inferences, batch processing jobs, or message queuing without blocking the GraphQL API response. This improves responsiveness and scalability, especially for scenarios where immediate responses are not required or where long-running tasks need to be offloaded.

Amazon Keyspaces (for Apache Cassandra) now supports add-column for multi-Region tables – With this launch, you can modify the schema of your existing multi-Region tables in Amazon Keyspaces (for Apache Cassandra) to add new columns. You only have to modify the schema in one of its replica Regions and Keyspaces will replicate the new schema to the other Regions where the table exists.

Amazon Corretto 23 is now generally availableAmazon Corretto is a no-cost, multi-platform, production-ready distribution of OpenJDK. Corretto 23 is an OpenJDK 23 Feature Release that includes an updated Vector API, expanded pattern matching and switch expression, and more. It will be supported through April, 2025.

Use OR1 instances for existing Amazon OpenSearch Service domains – With OpenSearch 2.15, you can leverage OR1 instances for your existing Amazon OpenSearch Service domains by simply updating your existing domain configuration, and choosing OR1 instances for data nodes. This will seamlessly move domains running OpenSearch 2.15 to OR1 instances using a blue/green deployment.

Amazon S3 Express One Zone now supports AWS KMS with customer managed keys – By default, S3 Express One Zone encrypts all objects with server-side encryption using S3 managed keys (SSE-S3). With S3 Express One Zone support for customer managed keys, you have more options to encrypt and manage the security of your data. S3 Bucket Keys are always enabled when you use SSE-KMS with S3 Express One Zone, at no additional cost.

Use AWS Chatbot to interact with Amazon Bedrock agents from Microsoft Teams and Slack – Before, customers had to develop custom chat applications in Microsoft Teams or Slack and integrate it with Amazon Bedrock agents. Now they can invoke their Amazon Bedrock agents from chat channels by connecting the agent alias with an AWS Chatbot channel configuration.

AWS CodeBuild support for managed GitLab runners – Customers can configure their AWS CodeBuild projects to receive GitLab CI/CD job events and run them on ephemeral hosts. This feature allows GitLab jobs to integrate natively with AWS, providing security and convenience through features such as IAM, AWS Secrets Manager, AWS CloudTrail, and Amazon VPC.

We launched existing services in additional Regions:

Other AWS news
Here are some additional projects, blog posts, and news items that you might find interesting:

Secure Cross-Cluster Communication in EKS – It demonstrates how you can use Amazon VPC Lattice and Pod Identity to secure cross-EKS-cluster application communication, along with an example that you can use as a reference to adapt to your own microservices applications.

Improve RAG performance using Cohere Rerank – This post focuses on improving search efficiency and accuracy in RAG systems using Cohere Rerank.

AWS open source news and updates – My colleague Ricardo Sueiras writes about open source projects, tools, and events from the AWS Community; check out Ricardo’s page for the latest updates.

Upcoming AWS events
Check your calendars and sign up for upcoming AWS events:

AWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world. Upcoming AWS Community Days are in Italy (Sep. 27), Taiwan (Sep. 28), Saudi Arabia (Sep. 28)), Netherlands (Oct. 3), and Romania (Oct. 5).

Browse all upcoming AWS led in-person and virtual events and developer-focused events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— Abhishek

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

New: Zone Groups for Availability Zones in AWS Regions

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/new-zone-groups-for-availability-zones-in-aws-regions/

This blog post is written by Pranav Chachra, Principal Product Manager, AWS.

In 2019, AWS introduced Zone Groups for AWS Local Zones. Today, we’re announcing that we are working on extending the Zone Group construct to Availability Zones (AZs).

Zone Groups were launched to help users of AWS Local Zones identify related groups of Local Zones that reside in the same geography. For example, the two interconnected Local Zones in Los Angeles (us-west-2-lax-1a and us-west-2-lax-1b) make up the us-west-2-lax-1 Zone Group. These Zone Groups are used for opting in to the Local Zones, as shown in the following image.

In October 2024, we will extend the Zone Group construct to AZs with a consistent naming format across all AWS Regions. This update will help you differentiate group of Local Zones and AZs based on their unique GroupName, improving manageability and clarity. For example, the Zone Group of AZs in the US West (Oregon) Region will now be referred to as us-west-2-zg-1, where us-west-2 indicates the Region, and zg-1 indicates it is the group of AZs in the Region. This new identifier (such as us-west-2-zg-1) will replace the current naming (such as us-west-2) for the GroupName available through the DescribeAvailabilityZones API. The names for the Local Zones Groups (such as us-west-2-lax-1) will remain the same as before.

For example, when you list your AZs, you see the current naming (such as us-west-2) for the GroupName of AZs in the US West (Oregon) Region:

The new identifier (such as us-west-2-zg-1) will replace the current naming for the GroupName of AZs, as shown in the following image.

At AWS we remain committed to responding to customer feedback and we continuously improve our services based upon that feedback. If you have questions or need further assistance, contact AWS Support on the community forums and through AWS Support.