All posts by Sheila Busser

Configuring low latency connectivity between AWS Outposts rack and on-premises data using CoIP

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/configuring-low-latency-connectivity-between-aws-outposts-rack-using-coip-and-on-premises-data/

This blog post is written by, Leonardo Azize Martins, Cloud Infrastructure Architect, Professional Services.

AWS Outposts rack enables applications that need to run on-premises due to low latency, local data processing, or local data storage needs by connecting Outposts rack to your on-premises network via the local gateway (LGW).

Each Outpost rack includes a local gateway to provide low latency connectivity between the Outpost and any local data sources, end users, local machinery and equipment, or local databases. If you have an Outpost rack, you can include a local gateway as the target in your VPC subnet route table where the destination is your on-premises network. Local gateways are only available for Outposts rack and can only be used in route tables where the VPC has been associated with an LGW.

In a previous blog post on connecting AWS Outposts to on-premises data sources, you learned the different use cases to use AWS Outposts connected with your local network. In this post, you will dive deep into local gateway usage and specific details about it. You will learn how to use Outpost local gateway when it is configured as Customer-owned IP addresses (CoIP) mode. You will also learn how to integrate it with your Amazon Virtual Private Cloud (Amazon VPC) and how different routes work regarding Amazon Elastic Compute Cloud (Amazon EC2) instances running on Outposts.

Overview of solution

The primary role of a local gateway is to provide connectivity from an Outpost to your local on-premises LAN. It also provides AWS Outposts connectivity to the internet through your on-premises network via the LGW, so you don’t need to rely on an internet gateway (IGW). The local gateway can also provide a data plane path back to the AWS Region. If you already have connectivity between your LAN and the Region through AWS Site-to-Site VPN or AWS Direct Connect, you can use the same path to connect from the Outpost to the AWS Region privately.

Outpost subnet

Public subnets and private subnets are important concepts to understand for Outposts networking. A public subnet has a route to an internet gateway. The same concept applies to Outpost subnets, when a public subnet exists on Outposts, it will have a route to an internet gateway, which will use a service link as a communication path between an Outpost and the internet gateway in the parent AWS Region.

Outpost public route via internet gateway

A private subnet does not have a direct route to the internet gateway. It will only be local inside the VPC, or it could have a route to a Network Address Translation (NAT) gateway. In both cases, the communication between Outpost subnets and AWS Region subnets will be done via the service link.

Outpost private route via NAT gateway

Your subnet can be private and only be allowed to communicate with your on-premises network. You just need a route pointing to the LGW.

Outpost private route via LGW

You can also provide internet connectivity to your Outpost subnets via LGW. In this case, it will not use the service link. As soon as it traverses the LGW and goes to your next hop, it will follow your routing flow to the internet.

Outpost public route via LGW

Routing

By default, every Outpost subnet inherits the main route table from its VPC. You can create a custom route table and associate it with an Outpost subnet. You can include a local gateway as the target when the destination is your on-premises network. A local gateway can only be used in VPC and subnet route tables that are exclusively associated with an Outpost subnet. If the route table is associated with an Outpost subnet and a Region subnet, it will not allow you to add a local gateway as the target.

Error message: addition of a local gateway as the target is denied

Local gateways are also not supported in the main route table.

Error message: routes that target local gateways not supported in main route table

The local gateway advertises Outpost IP address ranges to your on-premises network via BGP. In the other direction, from an on-premises network to the Outpost, it doesn’t use BGP, which means there is no propagation. You need to configure your VPC route table with static routes.

As of this writing, the LGW does not support jumbo frame.

Outposts IP addresses

Outposts can be configured in customer-owned IP (CoIP) mode.

Customer-owned IP

During the installation process, uses information that you provide about your on-premises network to create an address pool, which is known as a customer-owned IP address pool (CoIP pool). AWS then assigns it to the local gateway for use and advertises back to your on-premises network through BGP.

CoIP addresses provide local or external connectivity to resources in your Outpost subnets through your on-premises network. You can assign these IP addresses to resources on your Outpost, such as an EC2 instance, by allocating a new Elastic IP address from the customer-owned IP pool and then assigning this new Elastic IP address to your EC2 instance.

A local gateway serves as NAT for EC2 instances that have been assigned addresses from your customer-owned IP pool.

You can optionally share your customer-owned pool with multiple AWS accounts in your AWS Organizations using the AWS Resource Access Manager (RAM). After you share the pool, participants can allocate and associate Elastic IP addresses from the customer-owned IP pool.

Communication between your Outpost and on-premises network will use the CoIP Elastic IP addresses to address instances in the Outpost; the VPC CIDR range is not used.

Walkthrough

You will follow the steps required to configure your VPC to use LGW configured as CoIP, including:

  • Associate your VPC with the LGW route table.
  • Create an Outposts subnet.
  • Create and associate the VPC route table with the subnet.
  • Add a route to on-premises network with LGW as the target.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account.
  • An Outpost that consists of one or more Outposts racks configured in CoIP mode.

Associate VPC with LGW route table

Use the following procedure to associate a VPC with the LGW route table. You can’t associate VPCs that have a CIDR block conflict.

  1. Open the AWS Outposts console.
  2. In the navigation pane, choose Local gateway route tables.
  3. Select the route table and then choose Actions, Associate VPC.
  4. For VPC ID, select the VPC to associate with the local gateway route table.
  5. Choose Associate VPC.

Create an Outpost subnet

You can add Outpost subnets to any VPC in the parent AWS Region for the Outpost. When you do so, the VPC also spans the Outpost.

  1. Open the AWS Outposts console.
  2. On the navigation pane, choose Outposts.
  3. Select the Outpost, and then choose Actions, Create subnet.
  4. Select the VPC and specify an IP address range for the subnet.
  5. Choose Create.

Create and associate VPC route table with the Outpost subnet

You can create a custom route table for your VPC using the Amazon VPC console. It is a best practice to have one specific route table for each subnet.

  1. Open the Amazon VPC console.
  2. In the navigation pane, choose Route Tables.
  3. Choose Create route table.
  4. For VPC, choose your VPC.
  5. Choose Create.
  6. On the Subnet associations tab, choose Edit subnet associations.
  7. Select the check box for the subnet to associate with the route table and then choose Save associations.

Add a route to on-premises network with LGW as the target

You can add a route to a route table using the Amazon VPC console.

  1. Open the Amazon VPC console.
  2. In the navigation pane, choose Route Tables and select
  3. Choose Actions, Edit routes.
  4. Choose Add route. For Destination, enter the destination CIDR block, a single IP address, or the ID of a prefix list.
  5. Choose Save routes.

Allocate and associate a customer-owned IP address

  1. Open the Amazon EC2 console.
  2. In the navigation pane, choose Elastic IPs.
  3. Choose Allocate new address.
  4. For Network Border Group, select the location from which the IP address is advertised.
  5. For Public IPv4 address pool, choose Customer owned IPv4 address pool.
  6. For Customer owned IPv4 address pool, select the pool that you configured.
  7. Choose Allocate and close the confirmation screen.
  8. In the navigation pane, choose Elastic IPs.
  9. Select an Elastic IP address and choose Actions, Associate address.
  10. Select the instance from Instance and then choose Associate.

For more information about Launch an instance on your Outpost, refer to the AWS Outposts User Guide.

Allocate Elastic IP address

Cleaning up

To avoid incurring future charges, delete the resources, like EC2 instances.

Conclusion

In this post, I covered how to use the Outposts rack local gateway to communicate with your on-premises network. You learned how a subnet route table can influence the connectivity of public or private Outpost instances.

To learn more, check out our Outposts local gateway documentation and the networking reference architecture.

Understanding the lifecycle of Amazon EC2 Dedicated Hosts

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/understanding-the-lifecycle-of-amazon-ec2-dedicated-hosts/

This post is written by Benjamin Meyer, Sr. Solutions Architect, and Pascal Vogel, Associate Solutions Architect.

Amazon Elastic Compute Cloud (Amazon EC2) Dedicated Hosts enable you to run software on dedicated physical servers. This lets you comply with corporate compliance requirements or per-socket, per-core, or per-VM licensing agreements by vendors, such as Microsoft, Oracle, and Red Hat. Dedicated Hosts are also required to run Amazon EC2 Mac Instances.

The lifecycles and states of Amazon EC2 Dedicated Hosts and Amazon EC2 instances are closely connected and dependent on each other. To operate Dedicated Hosts correctly and consistently, it is critical to understand the interplay between Dedicated Hosts and EC2 Instances. In this post, you’ll learn how EC2 instances are reliant on their (dedicated) hosts. We’ll also dive deep into their respective lifecycles, the connection points of these lifecycles, and the resulting considerations.

What is an EC2 instance?

An EC2 instance is a virtual server running on top of a physical Amazon EC2 host. EC2 instances are launched using a preconfigured template called Amazon Machine Image (AMI), which packages the information required to launch an instance. EC2 instances come in various CPU, memory, storage and GPU configurations, known as instance types, to enable you to choose the right instance for your workload. The process of finding the right instance size is known as right sizing. Amazon EC2 builds on the AWS Nitro System, which is a combination of dedicated hardware and the lightweight Nitro hypervisor. The EC2 instances that you launch in your AWS Management Console via Launch Instances are launched on AWS-controlled physical hosts.

What is an Amazon EC2 Bare Metal instance?

Bare Metal instances are instances that aren’t using the Nitro hypervisor. Bare Metal instances provide direct access to physical server hardware. Therefore, they let you run legacy workloads that don’t support a virtual environment, license-restricted business-critical applications, or even your own hypervisor. Workloads on Bare Metal instances continue to utilize AWS Cloud features, such as Amazon Elastic Block Store (Amazon EBS), Elastic Load Balancing (ELB), and Amazon Virtual Private Cloud (Amazon VPC).

What is an Amazon EC2 Dedicated Host?

An Amazon EC2 Dedicated Host is a physical server fully dedicated to a single customer. With visibility of sockets and physical cores of the Dedicated Host, you can address corporate compliance requirements, such as per-socket, per-core, or per-VM software licensing agreements.

You can launch EC2 instances onto a Dedicated Host. Instance families such as M5, C5, R5, M5n, C5n, and R5n allow for the launching of different instance sizes, such as4xlarge and 8xlarge, to the same host. Other instance families only support a homogenous launching of a single instance size. For more details, see Dedicated Host instance capacity.

As an example, let’s look at an M6i Dedicated Host. M6i Dedicated Hosts have 2 sockets and 64 physical cores. If you allocate a M6i Dedicated Host, then you can specify what instance type you’d like to support for allocation. In this case, possible instance sizes are:

  • large
  • xlarge
  • 2xlarge
  • 4xlarge
  • 8xlarge
  • 12xlarge
  • 16xlarge
  • 24xlarge
  • 32xlarge
  • metal

The number of instances that you can launch on a single M6i Dedicated Host depends on the selected instance size. For example:

  • In the case of xlarge (4 vCPUs), a maximum of 32 m6i.xlarge instances can be scheduled on this Dedicated Host.
  • In the case of 8xlarge (32 vCPUs), a maximum of 4 m6i.8xlarge instances can be scheduled on this Dedicated Host.
  • In the case of metal (128 vCPUs), a maximum of 1 m6i.metal instance can be scheduled on this Dedicated Host.

When launching an EC2 instance on a Dedicated Host, you’re billed for the Dedicated Host but not for the instance. The cost for Amazon EBS volumes is the same as in the case of regular EC2 instances.

Exemplary homogenious M6i Dedicated Host shown with 32 m6i.xlarge, four m6i.8xlarge and one m6i.metal each.

Exemplary M6i Dedicated Host instance selections: m6i.xlarge, m6i.8xlarge and m6i.metal

Understanding the EC2 instance lifecycle

Amazon EC2 instance lifecycle states and transitions

Throughout its lifecycle, an EC2 instance transitions through different states, starting with its launch and ending with its termination. Upon Launch, an EC2 instance enters the pending state. You can only launch EC2 instances on Dedicated Hosts in the available state. You aren’t billed for the time that the EC2 instance is in any state other than running. When launching an EC2 instance on a Dedicated Host, you’re billed for the Dedicated Host but not for the instance. Depending on the user action, the instance can transition into three different states from the running state:

  1. Via Reboot from the running state, the instance enters the rebooting state. Once the reboot is complete, it reenters the running state.
  2. In the case of an Amazon EBS-backed instance, a Stop or Stop-Hibernate transitions the running instance into the stopping state. After reaching the stopped state, it remains there until further action is taken. Via Start, the instance will reenter the pending and subsequently the running state. Via Terminate from the stopped state, the instance will enter the terminated state. As part of a Stop or Stop-Hibernate and subsequent Start, the EC2 instance may move to a different AWS-managed host. On Reboot, it remains on the same AWS-managed host.
  3. Via Terminate from the running state, the instance will enter the shutting-down state, and finally the terminated state. An instance can’t be started from the terminated state.

Understanding the Amazon EC2 Dedicated Host lifecycle

A diagram of the the Amazon EC2 Dedicated Host lifecycle states and transitions between them.

Amazon EC2 Dedicated Host lifecycle states and transitions

An Amazon EC2 Dedicated Host enters the available state as soon as you allocate it in your AWS account. Only if the Dedicated Host is in the available state, you can launch EC2 instances on it. You aren’t billed for the time that your Dedicated Host is in any state other than available. From the available state, the following states and state transitions can be reached:

  1. You can Release the Dedicated Host, transitioning it into the released state. Amazon EC2 Mac Instances Dedicated Hosts have a minimum allocation time of 24h. They can’t be released within the 24h. You can’t release a Dedicated Host that contains instances in one of the following states: pending, running, rebooting, stopping, or shutting down. Consequently, you must Stop or Terminate any EC2 instances on the Dedicated Host and wait until it’s in the available state before being able to release it. Once an instance is in the stopped state, you can move it to a different Dedicated Host by modifying its Instance placement configuration.
  2. The Dedicated Host may enter the pending state due to a number of reasons. In case of an EC2 Mac instance, stopping or terminating a Mac instance initiates a scrubbing workflow of the underlying Dedicated Host, during which it enters the pending state. This scrubbing workflow includes tasks such as erasing the internal SSD, resetting NVRAM, and more, and it can take up to 50 minutes to complete. Additionally, adding or removing a Dedicated Host to or from a Resource Group can cause the Dedicated Host to go into the pending state. From the pending state, the Dedicated Host will reenter the available state.
  3. The Dedicated Host may enter the under-assessment state if AWS is investigating a possible issue with the underlying infrastructure, such as a hardware defect or network connectivity event. While the host is in the under-assessment state, all of the EC2 instances running on it will have the impaired status. Depending on the nature of the underlying issue and if it’s configured, the Dedicated Host will initiate host auto recovery.

If Dedicated Host Auto Recovery is enabled for your host, then AWS attempts to restart the instances currently running on a defect Dedicated Host on an automatically allocated replacement Dedicated Host without requiring your manual intervention. When host recovery is initiated, the AWS account owner is notified by email and by an AWS Health Dashboard event. A second notification is sent after the host recovery has been successfully completed. Initially, the replacement Dedicated Host is in the pending state. EC2 instances running on the defect dedicated Host remain in the impaired status throughout this process. For more information, see the Host Recovery documentation.

Once all of the EC2 instances have been successfully relaunched on the replacement Dedicated Host, it enters the available state. Recovered instances reenter the running state. The original Dedicated Host enters the released-permanent-failure state. However, if the EC2 instances running on the Dedicated Host don’t support host recovery, then the original Dedicated Host enters the permanent-failure state instead.

Conclusion

In this post, we’ve explored the lifecycles of Amazon EC2 instances and Amazon EC2 Dedicated Hosts. We took a close look at the individual lifecycle states and how both lifecycles must be considered in unison to operate EC2 Instances on EC2 Dedicated Hosts correctly and consistently. To learn more about operating Amazon EC2 Dedicated Hosts, visit the EC2 Dedicated Hosts User Guide.

Use AWS Nitro Enclaves to perform computation of multiple sensitive datasets

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/leveraging-aws-nitro-enclaves-to-perform-computation-of-multiple-sensitive-datasets/

This blog post is written by, Jeff Wisman, Principal Solutions Architect and Andrew Lee, Solutions Architect.

Introduction

Many organizations have sensitive datasets that they do not want to share with others because of stringent security and compliance requirements. However, they would still like to use each other’s data to perform processing and aggregation. For example, B2B (business to business) companies often want to augment their customer information dataset with additional demographic or psychographic signals. This enrichment of data is often done by one party sending customer information to be matched against another party’s data universe. Naturally, privacy and the revealing of business-critical customer information to an external entity is a major concern here. In this blog, we present a solution where multiple parties can choose to give an isolated compute environment access to their encrypted data to be decrypted and processed in a secure way using AWS Nitro Enclaves.

Designing and building your own secure private computing solution can be challenging, with few out-of-the-box solutions. Our sample application uses Nitro Enclaves, which support the creation of an isolated execution environment called an enclave and a cryptographic attestation process for generating and validating the enclave’s identity. The attestation process makes it possible to ensure only authorized code is running, as well as integration with the AWS Key Management Service (AWS KMS), so that only enclaves that you choose can access sensitive data. Nitro Enlaves enables customers to focus more on their application instead of worrying about integration with external services. While many enterprise use cases involve complex datasets, we’ll use a hypothetical scenario to learn the fundamentals of how this works. The example proof of concept (POC) application will be centered around a third-party bidding service for real estate transactions. Buyers will submit encrypted bids to the application. Once all the bids have been entered, the application will decrypt the bids, determine the highest bidder, and return a result without disclosing the actual bid amounts to any party.

How it works

The POC will be deployed across three AWS accounts, one each for buyer-1, buyer-2, and the bidding service. The bidding service will be run in an enclave on the bidding service’s account.

  1. The bidding service will generate a set of measurements called platform configuration registers (PCRs) from the application code that uses the attestation process. PCRs are cryptographic measurements that are unique to an enclave. An attestation document can be used to verify the identity of the enclave and establish trust.
  2. The buyers will each generate their own AWS KMS key and use AWS KMS to authorize cryptographic requests from the bidding service enclave based on PCR values in the attestation document.
  3. The buyers will place their bids into a file, encrypt the file using AWS KMS, and store them in their own Amazon Simple Storage Service (Amazon S3)
  4. The bidding service will run the application, which will retrieve the encrypted bids from each buyer’s S3 bucket, decrypt the bids, calculate the highest bidder, and store the result in an S3 bucket.

Overall workflow of POC

Implementation

Let’s take a deeper dive into the steps involved in implementing this POC. To deploy the POC to your environment, follow the instructions in the AWS Nitro Enclaves Bidding Service GitHub project.

Enclave image generation

The first step is for the bidding service to launch the parent instance that will host the enclave. Refer to Launch the parent instance for more information about this process. The two real estate buyers, which we will call buyer-1 and buyer-2, will need to review the application code of the enclave application and agree that their data will not be exposed outside the enclave. Once they agree on the code, the bidding service generates the enclave image and a set of measurements as part of the attestation process. The buyers should also perform this process to ensure that the enclave image was generated and its measurements are from the agreed-upon application code. During the generation process, a unique set of measurements is taken of the application, which will make up its identity. When the enclave makes a request to decrypt data with AWS KMS, those measurements will be included in an attestation document to prove the enclave’s identity. Access policies in AWS KMS can then grant access to that identity. An example of a set of measurements is shown here:

Enclave Image successfully created.
{ "Measurements": { "HashAlgorithm": "Sha384 { ... }",
"PCR0":"287b24930a9f0fe14b01a71ecdc00d8be8fad90f9834d547158854b8279c74095c43f8d7f047714e98deb7903f20e3dd",
"PCR1":"aca6e62ffbf5f7deccac452d7f8cee1b94048faf62afc16c8ab68c9fed8c38010c73a669f9a36e596032f0b973d21895",
"PCR2":"0315f483ae1220b5e023d8c80ff1e135edcca277e70860c31f3003b36e3b2aaec5d043c9ce3a679e3bbd5b3b93b61d6f"
} }

Preparing encrypted data

Each buyer will create an AWS KMS key and use that key to encrypt their bids. The encrypted bids will be stored in their respective S3 buckets. Because AWS KMS integrates with Nitro Enclaves to provide built-in attestation support, each buyer can add the PCR values generated earlier as a condition to their respective AWS KMS key policies. This will ensure that only the enclave application code agreed upon by both buyers will have access to utilize the keys for decryption. The following is an example of a KMS key policy with PCR values as a condition:

{
  "Version": "2012-10-17",
  "Id": "key-default-1",
  "Statement": [{
    "Sid": "Enable decrypt from enclave",
    "Effect": "Allow",
    "Principal": < PARENT INSTANCE ROLE ARN > ,
    "Action": "kms:Decrypt",
    "Resource": "",
    "Condition": {
      "StringEqualsIgnoreCase": {
        "kms:RecipientAttestation:ImageSha384": "<PCR0 VALUE FROM BUILDING ENCLAVE IMAGE>",
        "kms:RecipientAttestation:PCR1":"<PCR1 VALUE FROM BUILDING ENCLAVE IMAGE>",
        "kms:RecipientAttestation:PCR2":"<PCR2 VALUE FROM BUILDING ENCLAVE IMAGE>"
      }
    }
  }]
}

The previous example only shows a key policy that uses PCR0, PCR1, and PCR2. You can further scope down the permissions by adding additional PCR values, for instance, role, parent instance ID, and a signing certificate for the enclave image. Refer to the AWS Nitro Enclaves User Guide for more details about PCR values.

Running the POC

The bidding service will run the enclave image generated earlier on the parent instance. The application runs as two parts, one part on the parent instance and another in the enclave. Communication between the parent and the enclave is done through a vsock connection. An AWS KMS proxy is also used on the parent to allow communication between the enclave and AWS KMS for decrypting data. The parent application will retrieve the encrypted bids from each buyer’s S3 bucket and send them to the enclave. The enclave will decrypt the data using both buyers’ AWS KMS keys and present attestation documents signed by the Nitro Hypervisor. AWS KMS will validate that the PCR values in the attestation documents match the key policy before performing the decryption. The decryption event will be logged in AWS CloudTrail for auditing purposes. Once the encrypted bids are decrypted, the values are compared, and the winning buyer is recorded in the result. The unencrypted result is then returned to the parent and written to an S3 bucket in the bidding service’s account. A diagram of this process is shown in Figure 2.

Cleanup

Be sure you delete all the resources that were created when following the included Github project:

  • Bidding service EC2 instance EC2 instance IAM role and policy
  • AWS KMS Customer managed keys
  • S3 Buckets for storing encrypted files

Conclusion

In this blog post, we introduced a sample POC utilizing Nitro Enclaves to allow a bidding service to process two parties’ encrypted data without revealing their data to any party. We did this by ensuring access to sensitive data is only allowed from an application running within an enclave. With the straightforward integration of AWS KMS and the attestation process, customers can quickly develop applications on Nitro Enclaves that can enable computing on encrypted datasets from multiple accounts. For more information on AWS Nitro Enclaves, see the official product documentation or the introductory videos on YouTube.

Implementing Attribute-Based Instance Type Selection using Terraform

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/implementing-attribute-based-instance-type-selection-using-terraform/

This blog post is written by Christian Melendez, Senior Specialist Solutions Architect, Flexible Compute – EC2 Spot and Carlos Manzanedo Rueda, WW SA Leader, Flexible Compute – EC2 Spot.

In this blog post we will cover the release of Terraform support for Attribute-Based Instance Type Selection (ABS). ABS simplifies the configuration required to acquire compute capacity for Instance Flexible workloads. Terraform  is an open-source infrastructure as code software tool by HashiCorp. Hashicorp is an AWS Partner Network (APN) Advanced Technology Partner and member of the AWS DevOps Competency.

Introduction

Amazon EC2 provides a wide selection of instance types optimized to fit different use cases. Instance types comprise varying combinations of CPU, memory, storage, and networking capacity and give you the flexibility to choose the appropriate mix of resources for your applications.

Workloads such as continuous integration, analytics, microservices on containers, etc., can use multiple instance types. Customers have been telling us that simplifying the configuration of instance flexible workloads is important. For workloads that are instance flexible, AWS released ABS to express workload requirements as a set of instance attributes such as: vCPU, memory, type of processor, etc. ABS translates these requirements and selects all matching instance types that meet the criteria. To select which instance to launch, Amazon EC2 Auto Scaling Groups and EC2 Fleet chose instances based on the allocation strategy configured. Lowest-price allocation strategy is supported on both Amazon EC2 On-Demand Instances and Amazon EC2 Spot Instances. The recommendation for Spot Instances is to use capacity-optimized, which select the optimal instances that reduce the frequency of interruptions. ABS does also future-proof EC2 Auto Scaling Group and EC2 Fleet configurations: any new instance type we launch that matches the selected attributes, will be included in the list automatically. No need to update your EC2 Auto Scaling Group or EC2 Fleet configuration.

Following our commitment to open-source projects, AWS has added support for ABS in the AWS Terraform provider. You can use ABS for launch templates, EC2 Auto Scaling Group, and EC2 Fleet resources. The minimum required version of the AWS provider is v4.16.0.

Applying instance flexibility is key for running fault-tolerant, elastic, reliable, and cost optimized workloads. By selecting a diversified choice of instances that qualify for your workload, your application will be better prepared to avoid scenarios where lack of capacity on a specific instance type could be an issue. This applies both to On-Demand and Spot Instance-based workloads. For Spot Instances, applying diversification is key. Spot Instances are spare capacity that can be reclaimed by EC2 when it is required. ABS allows you to specify diversification in simple terms, allowing EC2 Auto Scaling Group and EC2 Fleet’s allocation strategy to replace reclaimed Spot Instances with instances from other pools where capacity is available.

Instance Requirement Attributes

To represent the instance requirements for your workload using ABS, there are a set of attributes you can use within the instance_requirements block. When using Terraform, the only two required attributes are memory_mib and vcpu_count. The rest of the attributes provide default values that adhere to Instance Flexible workloads best practices. For example, bare_metal attribute is by default excluded. You can see the full list of ABS attributes in the Terraform docs site.

Once ABS attributes are configured, ABS picks a list of instance types that match the criteria. This list is especially important when you’re using Spot Instances. One of Spot Instances best practices is to diversify the instance types which, in combination with the capacity-optimized allocation strategy, gives you access to the highest amount of Spot capacity pools. For On-Demand Instances, the instance types list is important as well. There might be scenarios where On-Demand Instance pools lack capacity. By applying instance flexibility using ABS, you can avoid the InsufficientInstanceCapacity error. And in combination with the lowest-price allocation strategy, you get the lowest price instance types from your diversified selection.

There are different places where we can specify ABS attributes. We can specify them at the launch templates level and declare a base mechanism to select instances. In most cases, the recommendation is to configure ABS attributes at the EC2 Auto Scaling Group and EC2 Fleet level. Let’s explore each of these options.

Configuring instance requirements within launch templates

Launch templates are instance configuration templates where you specify parameters like AMI ID, instance type, key pair, and security groups to launch instances. You can use ABS attributes in a launch template when you need to be prescriptive, and define sane defaults or guardrails for your workloads. This way, EC2 Auto Scaling Group or EC2 Fleet simply reference and use the launch template.

You should use ABS attributes in launch templates when you want to prevent users from overriding the resources specified by a launch template. Note that is still possible to override those requirements.

Let’s say that we have a Java application that requires a minimum of 4 vCPUs and 8 GiB of memory, and has been using the c5.xlarge instance type. After performance testing we’ve identified that runs better with current instance types generations. The following code snippet represents how to define these requirements in a Launch Template. To see the full list of attributes, visit the launch template doc site.

resource "aws_launch_template" "abs" {
  name_prefix = "abs"
  image_id    = data.aws_ami.abs.id

  instance_requirements {
    memory_mib {
      min = 8192
    }
    vcpu_count {
      min = 4
    }
    instance_generations = ["current"]
  }
}

Note by using instance_requirements block in a LaunchTemplate, you’ll need to use the mixed_instances_policy block in the EC2 Auto Scaling Group.

resource "aws_autoscaling_group" "on_demand" {
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
  max_size           = 1
  min_size           = 1
  mixed_instances_policy {
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.abs.id
      }
    }
  }
}

The EC2 Auto Scaling Group will use instance types that match the requirements in the launch template.

You can preview what are the instances that the EC2 Auto Scaling Group will select. The section “How to Preview Matching Instances without Launching Them” of the ABS blog post, describes how to preview the instances that will be selected.

Launching EC2 Spot Instances with EC2 Auto Scaling

Launch templates are very powerful. They allow you to decouple the attributes such as user_data from the actual instance management. They are idempotent and can be versioned which is key for rolling out changes in the configuration and applying EC2 Auto Scaling Group Instance Refresh.

Our recommendation is to define ABS attributes as overrides within the mixed_instances_policy block in EC2 Auto Scaling Groups. For most of the applications, we recommend using EC2 Auto Scaling Group to provision EC2 instances.

Let’s get back to our previous example. Now we want to be more prescriptive for the instance requirements our Java application uses.

Let’s assume that the Java application is memory intensive, requires a vCPU to Memory ratio of 4GB for every vCPU, and has been using the m5.large instance type. Additionally, it does not need hardware accelerators like GPUs, FPGAs, etc. Our application does also require a minimum of 2 vCPUs, and the range of memory has been reduced to 32GB to avoid large garbage collection scenarios. This time, we’d like to launch only Spot Instances. As mentioned earlier in this blog post, diversification is key for Spot. To improve the experience with Spot, let’s enable the capacity rebalance feature from the EC2 Auto Scaling Group to proactively replace instances that are at an elevated risk of being interrupted. The following code snippet represents the ABS attributes we need for the more prescriptive workload:

resource “aws_autoscaling_group” “spot” {
  availability_zones = [“us-east-1a”, “us-east-1b”, “us-east-1c”]
  desired_capacity   = 1
  max_size           = 1
  min_size           = 1
  capacity_rebalance = true

  mixed_instances_policy {
    instances_distribution {
      spot_allocation_strategy = "capacity-optimized"
    }

    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.x86.id
      }

      override {
        instance_requirements {
          memory_mib {
            min = 4096
            max = 32768
          }

          vcpu_count {
            min = 2
          }

          memory_gib_per_vcpu {
            min = 4
            max = 4
          }

          accelerator_count {
            max = 0
          }
        }
      }
    }
  }
}

Launching EC2 Spot Instances with EC2 Fleet

Another method we have to launch EC2 instances is the EC2 Fleet API. We recommend to use EC2 Fleet for workloads that need granular controls to provision capacity. For example, tight HPC workloads where instances must be close together (single Availability Zone and within the same placement group) and need similar instance types. EC2 Fleet is also used by capacity orchestrators such as Karpenter or Atlassian Escalator that implement tuned up and optimized logic to provision capacity.

Let’s say that this time the workload is a CPU bound workload and has been using the c5.9xlarge instance type. The workload can be retried and the application supports checkpointing, so it qualifies to use Spot Instances. Given we’ll be using Spot Instances, we would like to benefit from the capacity rebalance feature as we did before in the EC2 Auto Scaling Group example. The application requires very prescriptive ranges of vCPU and memory and we also need a minimum of 100 GB SSD local storage. While in most cases EC2 Auto Scaling Groups are appropriate solution to procure and maintain capacity, in this case we will use EC2 Fleet.

The following code snippet represents the ABS attributes we need for this workload:

resource "aws_ec2_fleet" "spot" {
  target_capacity_specification {
    default_target_capacity_type = "spot"
    total_target_capacity        = 5
  }

  spot_options {
    allocation_strategy = "capacity-optimized"
    maintenance_strategies {
      capacity_rebalance {
        replacement_strategy = "launch"
      }
    }
  }

  launch_template_config {
    launch_template_specification {
      launch_template_id = aws_launch_template.x86.id
      version            = aws_launch_template.x86.latest_version
    }

    override {
      instance_requirements {
        memory_mib {
          min = 65536
          max = 73728
        }

        vcpu_count {
          min = 32
          max = 36
        }

        cpu_manufacturers   = ["intel"]
        local_storage       = "required"
        local_storage_types = ["ssd"]

        total_local_storage_gb {
          min = 100
        }
      }
    }
  }
}

Multi-Architecture workloads using Graviton and x86 with EC2 Auto Scaling Groups

Another recent feature from EC2 Auto Scaling Groups is that you can build multi-architecture workloads. EC2 Auto Scaling Group allows you to mix Graviton and x86 instance types in the same EC2 Auto Scaling Group. Unlike the x86_64 instances we have used so far, AWS Graviton processors are custom built by AWS using 64-bit Arm. You need to use different launch templates as each architecture needs to use a different AMI. This can be defined in within the override block.

In the example below, we use ABS to define different attributes depending on the CPU architecture. And what’s great about doing this is that we don’t need to exclude instance types. Instances will be launched with a compatible CPU architecture based on the AMI that you specify in our launch template.

Besides supporting mixing architectures, EC2 Auto Scaling Groups allows to combine purchase models. For our example, this time we’ll use a more complex scenario to showcase how powerful and feature rich EC2 Auto Scaling Group has become. The following code snippet applies many of the configurations we’ve seen before, but the key difference here is that we have two overrides. One is for Graviton instances, and another one for x86 instances.

resource "aws_autoscaling_group" "on_demand_spot" {
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
  desired_capacity   = 4
  max_size           = 10
  min_size           = 2
  capacity_rebalance = true

  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 2
      on_demand_percentage_above_base_capacity = 0
      spot_allocation_strategy                 = "capacity-optimized"
    }

    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.arm.id
      }

      override {
        launch_template_specification {
          launch_template_id = aws_launch_template.arm.id
        }

        instance_requirements {
          memory_mib {
            min = 16384
            max = 16384
          }

          vcpu_count {
            max = 4
          }
        }
      }

      override {
        launch_template_specification {
          launch_template_id = aws_launch_template.x86.id
        }

        instance_requirements {
          memory_mib {
            min = 16384
          }

          vcpu_count {
            min = 4
          }
        }
      }
    }
  }
}

Multi-architecture workloads can also be applied to container orchestration. Thanks to the Amazon Elastic Container Registry (Ama­zon ECR) support for multi-architecture container images, you can use manifests to push both container images, and Amazon ECR will pull the proper image based on the CPU architecture. We have a workshop for Elastic Kubernetes Service (Amazon EKS) where you can learn more about how to deploy a multi-architecture workload in Amazon EKS.

Conclusion

In this post you’ve learned how to configure ABS to launch instances using Auto Scaling Groups and EC2 Fleet. You’ve learned and how to mix CPU architectures along with purchase models in the same EC2 Auto Scaling Group using Terraform while simplifying the configuration.

Our commitment with open-source projects such as Terraform is to help customers implement AWS best practices in a larger ecosystem easily. ABS support allows customers to get access to compute capacity by simply specifying the resource requirement attributes of their workloads rather than the instance names.

ABS simplifies the configuration for instance flexible workloads and removes the need to list the instances that qualify for your workload. Instead, it simplifies the configuration and future proof for scenarios where AWS includes new instances that qualify for the workload. For Spot workloads where instance diversification is key, ABS simplifies the selection of instances and helps to increase the total number of pools. For more information, visit the ABS user guide and Terraform documentation for Auto Scaling Groups and EC2 Fleet.

Adding approval notifications to EC2 Image Builder before sharing AMIs

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/adding-approval-notifications-to-ec2-image-builder-before-sharing-amis/

This blog post was written by, Glenn Chia Jin Wee, Associate Cloud Architect at AWS and Randall Han, Associate Professional Services Consultant at AWS.

In some situations, you may be required to manually validate the Amazon Machine Image (AMI) built from an Amazon Elastic Compute Cloud (Amazon EC2) Image Builder pipeline before sharing this AMI to other AWS accounts or to an AWS Organization. Currently, Image Builder provides an end-to-end pipeline that automatically shares AMIs after they’ve been built.

In this post, we will walk through the steps to enable approval notifications before AMIs are shared with other AWS accounts. Having a manual approval step could be useful if you would like to verify the AMI configurations before it is shared to other AWS accounts or an AWS Organization. This reduces the possibility of incorrectly configured AMIs being shared to other teams which in turn could lead to downstream issues if applications are installed using this AMI. This solution uses serverless resources to send an email with a link that automatically shares the AMI with the specified AWS accounts. Users select this link after they’ve verified that the AMI is built according to specifications.

Overview

Architecture Diagram

  1. In this solution, an Image Builder Pipeline is run that builds a Golden AMI in Account A. After the AMI is built, Image Builder publishes data about the AMI to an Amazon Simple Notification Service (Amazon SNS) topic.
  2. This SNS Topic passes the data to an AWS Lambda function that subscribes to it.
  3. The Lambda function that subscribes to this topic retrieves the data, formats it, and sends a customized email to another SNS Topic.
  4. The second SNS Topic has an email subscription with the Approver’s email. The approver will receive the customized email with a URL that interacts with the next set of Serverless resources.
  5. Selecting the URL makes a GET request to Amazon API Gateway, thereby passing the AMI ID in the query string.
  6. API Gateway then triggers another Lambda function and passes the AMI ID to it.
  7. The Lambda function obtains the AMI ID from the query string parameter of the API Gateway request, and then shares it with the provided target account.

Prerequisites

For this walkthrough, you will need the following:

Walkthrough

In this section, we will guide you through the steps required to deploy the Image Builder solution that utilizes Serverless resources. The solution is deployed with AWS SAM.

In this scenario, we deploy the solution within the approver’s account. The approval email will be sent to a predefined email address for manual approval, before the newly created AMI is shared to target accounts.

Once the approver selects the approval link, an email notification will be sent to the predefined target account email address, notifying that the AMI has been successfully shared.

The high-level steps we will follow are:

  1. In Account A, deploy the provided AWS SAM template. This includes an example Image Builder Pipeline, Amazon SNS topics, API Gateway, and Lambda functions.
  2. Approve the SNS subscription from your supplied email address.
  3. Run the pipeline from the Amazon EC2 Image Builder Console.
  4. [Optional] After the pipeline runs, launch an Amazon EC2 instance from the built AMI to conduct manual tests
  5. An Amazon SNS email will be sent to you with an API Gateway URL. When clicked, an AWS Lambda function shares the AMI to the Account B.
  6. Log in to Account B and verify that the AMI has been shared.

Step 1: Launch the AWS SAM template

  1. Clone the SAM templates from this GitHub repository.
  2. Run the following command to deploy the templates via SAM. Replace <approver email> with the Approver’s email and <AWS Account B ID> with the AWS Account ID of your second AWS Account.

sam deploy \

–template-file template.yaml \

–stack-name ec2-image-builder-approver-notifications \

–capabilities CAPABILITY_IAM \

–resolve-s3 \

–parameter-overrides \

ApproverEmail=<approver email> \

TargetAccountEmail=<target account email> \

TargetAccountlds=<AWS Account B ID>

Step 2: Verify your email address

  1. After running the deployment, you will receive an email prompting you to confirm the Subscription at the approver email address. Choose Confirm subscription.

Email to confirm SNS topic subscription

  1. This leads to the following screen, which shows that your subscription is confirmed.

SNS topic subscription confirmation

  1. Repeat the previous 2 steps for the target email address.

Step 3: Run the pipeline from the Image Builder console

  1. In the Image Builder console, under Image pipelines, select the checkbox next to the Pipeline created, choose Actions, and select Run pipeline.

Run the Image Builder Pipeline

Note that the pipeline takes approximately 20 to 30 minutes to complete.

Step 4: [Optional] Launch an Amazon EC2 instance from the built AMI

There could be a requirement to manually validate the AMI before sharing it to other AWS accounts or to the AWS organization. With this requirement, approvers will launch an Amazon EC2 instance from the built AMI and conduct manual tests on the EC2 instance to make sure that it is functional.

  1. In the Amazon EC2 console, under Images, choose AMIs. Validate that the AMI is created.

Validate the AMI has been built

  1. Follow AWS docs: Launching an EC2 instances from a custom AMI for steps on how to launch an Amazon EC2 instance from the AMI.

Step 5: Select the approval URL in the email sent

  1. When the pipeline is run successfully, you will receive another email with a URL to share the AMI.

Approval link to share the AMI to Account B

2. Selecting this URL results in the following screen which shows that the AMI share is successful.

Result showing the AMI was successfully shared after selecting the approval link

Step 6: Verify that the AMI is shared to Account B

  1. Log in to Account B.
  2. In the Amazon EC2 console, under Images, choose AMIs. Then, in the dropdown, choose Private images. Validate that the AMI is shared.

AMI is shared when Private images are selected from the dropdown

3. Verify that a success email notification was sent to the target account email address provided.

Successful AMI share email notification sent to Target Account Email Address

Clean up

This section provides the necessary information for deleting various resources created as part of this post.

1. Deregister the AMIs created and shared.

a. Log in to Account A and follow the steps at AWS documentation: Deregister your Linux AMI.

2. Delete the SAM stack with the following command. Replace <region> with the Region of choice.

sam delete –stack-name ec2-image-builder-approver-notifications –no-prompts –region <region>

3. Delete the CloudWatch log groups for the Lambda functions. You’ll identify it with the name `/aws/lambda/ec2-image-builder-approve*`.

4. Consider deleting the Amazon S3 bucket used to store the packaged Lambda artifact.

Conclusion

In this post, we explained how to use Serverless resources to enable approval notifications for an Image Builder pipeline before AMIs are shared to other accounts. This solution can be extended to share to more than one AWS account or even to an AWS organization. With this solution, you will be notified when new golden images are created, allowing you to verify the correctness of their configuration before sharing them to for wider use. This reduces the possibility of sharing AMIs with misconfigurations that the written tests may not have identified.

We invite you to experiment with different AMIs created using Image Builder, and with different Image Builder components. Check out this GitHub repository for various examples that use Image Builder. Also check out this blog on Image builder integrations with EC2 Auto Scaling Instance Refresh. Let us know your questions and findings in the comments, and have fun!

Capturing GPU Telemetry on the Amazon EC2 Accelerated Computing Instances

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/capturing-gpu-telemetry-on-the-amazon-ec2-accelerated-computing-instances/

This post is written by Amr Ragab, Principal Solutions Architect EC2.

AWS is excited to announce the native integration of monitoring GPU metrics through the CloudWatch Agent. Customers can now easily monitor GPU utilization and its memory to scale their workloads more effectively without custom scripts. In this post, we’ll describe how to allow GPU monitoring and integrate it into your Amazon Machine Images (AMI). Furthermore, we’ll extend this to include the monitoring of GPU hardware events utilizing CloudWatch Log Streams. By combining this telemetry into the Amazon CloudWatch Console, customers can have a complete picture of GPU activity across their fleets.

Capturing GPU metrics

There is an extensive list of NVIDIA accelerator metrics that can be captured. Depending on the workload type, it may be unnecessary to capture all of the metrics at all times. The following table breaks down the suggested metrics to collect by workload type. This considers a balance of cost and impactful metrics at scale.

Compute (Machine Learning (ML), High Performance Computing (HPC)) Graphics/Gaming
utilization_gpu
power_draw
utilization_memory
memory_total
memory_used
memory_free
pcie_link_gen_current
pcie_link_width_current
clocks_current_smclocks_current_memory
utilization_gpu
utilization_memory
memory_total
memory_usedmemory_free
pcie_link_gen_current
pcie_link_width_current
encoder_stats_session_count
encoder_stats_average_fps
encoder_stats_average_latency
clocks_current_graphics
clocks_current_memory
clocks_current_video

Moreover, this is supported through custom AMIs that are deployed with managed service offerings, including Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Services (Amazon ECS), and AWS ParallelCluster w/ SLURM for HPC clusters.

The following is an example screenshot from the CloudWatch Console showcasing the telemetry captured for a P4d instance. You can see that we captured the preceding metrics on a per-GPU index. Each Amazon Elastic Compute Cloud (Amazon EC2) P4d instance has 8x A100 GPUs.

Cloudwatch Console

Capturing GPU Xid events

Xid events are a reporting mechanism from GPU hardware vendors that emit notable events from the device to the OS in this case we are capturing the events through the NVRM kernel module. Current GPU architecture requires that the full GPU with protections are passed into the running instance. Thus, most errors that manifest inside of the customer instance aren’t directly visible to the Amazon EC2 virtualization stack. Although some of these errors are benign, others indicate problems with the customer application, the NVIDIA driver, and under rare circumstances a defect in the GPU hardware.

For NVIDIA-based Amazon EC2 instances, these errors will be logged in the system journal with an “NVRM:” regular expression.

These events can be collected and pushed to Amazon CloudWatch Logs as a stream. When an Xid event occurs on the GPU, it will parse the event and push it the log stream for that instance ID in the Region in which the instance is running. The following steps are required to get started capturing those events.

Deployment

We’ll cover the deployment in two different use-cases: 1. You have an existing instance running and you want to start to capture metrics and XID events. 2. You want to build and an AMI and use it within Amazon EC2 or additional services.

I. On a running Amazon EC2 instance

Step 1. Attach an IAM Role to the EC2 instance that has permission to CloudWatch Metrics/Logs. The following is an IAM policy that you can attach to your IAM Role.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "1",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricStream",
                "logs:CreateLogDelivery",
                "logs:CreateLogStream",
                "cloudwatch:PutMetricData",
                "logs:UpdateLogDelivery",
                "logs:CreateLogGroup",
                "logs:PutLogEvents",
                "cloudwatch:ListMetrics"
            ],
            "Resource": "*"
        }
    ]
}

Step 2. Connect to a shell on the EC2 instance (through SSM or SSH). Install the CloudWatch Agent following the instructions here. There is support across architectures and distributions.

Step 3. Next, we can create our CloudWatch Agent JSON configuration file. The following JSON snippet will capture the logs from gpuerrors.log and push to CloudWatch Logs. Save the contents of the following JSON snippet to a file on the instance at /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json.

{
     "agent": {
         "run_as_user": "root"
     },
     "metrics": {
         "append_dimensions": {
             "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
             "ImageId": "${aws:ImageId}",
             "InstanceId": "${aws:InstanceId}",
             "InstanceType": "${aws:InstanceType}"
         },
         "aggregation_dimensions": [["InstanceId"]],
         "metrics_collected": {
            "nvidia_gpu": {
                "measurement": [
                    "utilization_gpu",
                    "utilization_memory",
                    "memory_total",
                    "memory_used",
                    "memory_free",
                    "clocks_current_graphics",
                    "clocks_current_sm",
                    "clocks_current_memory"
                ]
            }
         }
     },
     "logs": {
         "logs_collected": {
             "files": {
                 "collect_list": [
                     {
                         "file_path": "/var/log/gpuevent.log",
                         "log_group_name": "/ec2/accelerated/accel-event-log",
                         "log_stream_name": "{instance_id}"
                     }
                 ]
             }
         }
     }
 }

Step 4. To start capturing the logs, restart the aws cloudwatch systemd service.

sudo systemctl restart amazon-cloudwatch-agent.service

At this point, if you navigate to the CloudWatch Console in the Region that the instance is running, – All metrics – CWAgent, you should see a table of metrics similar to the following screenshot.

Cloudwatch Agent Metrics

Step 5. To capture the XID events it’s possible to use the same CloudWatch Log directive used in the preceding image were set the GPU metrics to capture. The JSON following snippet defines that we will stream the log in /var/log/gpuevent.log to CloudWatch.

"logs": {
         "logs_collected": {
             "files": {
                 "collect_list": [
                     {
                         "file_path": "/var/log/gpuevent.log",
                         "log_group_name": "/ec2/accelerated/accel-event-log",
                         "log_stream_name": "{instance_id}"
                     }
                 ]
             }
         }
     }

The GitHub project is an open source reference design for capturing these errors in the CloudWatch agent.

https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline

Step 6. Save the following file as /opt/aws/aws-hwaccel-event-parser.py|.go with the following contents, which will write the Xid errors parsed to /var/log/gpuevent.log:

The code is available in either Python3 or Go (> 1.16).

Golang code of the hwaccel-event-parser: https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline/blob/master/nvidia-efa-ami_base/cloudwatch/nvidia/aws-hwaccel-event-parser.go

Python3 code: https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline/blob/master/nvidia-efa-ami_base/cloudwatch/nvidia/aws-hwaccel-event-parser.py

As you can see from the code, this is a blocking thread, and it will be running during the lifetime of the instance or container.

Step 7. For ease of deployment, you can also create a systemd service (aws-hw-monitor.service), which will run at startup before the CloudWatch agent.

[Unit]
Description=HW Error Monitor
Before=amazon-cloudwatch-agent.service
After=syslog.target network-online.target

[Service]
Type=simple
ExecStart=/opt/aws/cloudwatch/aws-cloudwatch-wrapper.sh
RemainAfterExit=1
TimeoutStartSec=0

[Install]
WantedBy=multi-user.target

Where /opt/aws/cloudwatch/aws-cloudwatch-wrapper.sh is a script which contains:

#!/bin/bash
python3 /opt/aws/aws-hwaccel-event-parser.py &

Finally, enable and start the hw monitor service

sudo systemctl enable aws-hw-monitor.service –now

II. Building an AMI

For convenience, the following repo has what is needed to build the AMI for Amazon EC2, Amazon EKS, Amazon ECS, Amazon Linux 2, and Ubuntu 18.04/20.04 distributions. You must have Packer installed on your machine, and it must be authenticated to make API calls on your behalf to AWS. Generally you need to modify the variables:{} json and execute the packer build.

"variables": {
    "region": "us-east-1",
    "flag": "<flag>",
    "subnet_id": "<subnetid>",
    "security_groupids": "<security_group_id,security_group_id",
    "build_ami": "<buildami>",
    "efa_pkg": "aws-efa-installer-latest.tar.gz",
    "intel_mkl_version": "intel-mkl-2020.0-088",
    "nvidia_version": "510.47.03",
    "cuda_version": "cuda-toolkit-11-6 nvidia-gds-11-6",
    "cudnn_version": "libcudnn8",
    "nccl_version": "v2.12.7-1"
  },

After filling in the variables, check that the packer script is validated.

packer validate nvidia-efa-ml-al2.yml
packer build nvidia-efa-ml-al2.yml

The log group namespace is /ec2/accelerated/accel-event-log. However, you may change this to the namespace of your preference in the CloudWatch Agent config file created earlier.

Navigate to the CloudWatch Console – Logs – Log groups – /ec2/accelerated/accel-event-log. It’s sorted by instance ID, where the instance ID of the latest stream is on top.

CloudWatch Log-events

We can see in the preceding screenshot that an example workload ran on instance i-03a7b66de3198977e, which was a p4d.24xlarge triggered a Xid 63 event. Capturing these events is the first step. Next, we must interpret what these events mean. With each Xid error, there is a number associated with each event. As previously mentioned, these can be hardware errors, driver, and/or application errors. If you’re running on an Amazon EC2 accelerated instance, and after code execution run into one of these errors, contact AWS Support with the instance ID and Xid error. The following is a list of the more common Xid errors that you may encounter.

Xid Error Name Description Action
48 Double Bit ECC error Hardware memory error Contact AWS Support with Xid error and instance ID
74 GPU NVLink error Further SXid errors should also be populated which will inform on the error seen with the NVLink fabric Get information on which links are causing the issue by running nvidia-smi nvlink -e
63 GPU Row Remapping Event Specific to Ampere architecture –- a row bank is pending a memory remap Stop all CUDA processes, and reset the GPU (nvidia-smi -r), and make sure thatensure the remap is cleared in nvidia-smi -q
13 Graphics Engine Exception User application fault , illegal instruction or register Rerun the application with CUDA_LAUNCH_BLOCKING=1 enabled which should determine if it’s a NVIDIA driver or hardware issue
31 GPU memory page fault Illegal memory address access error Rerun the application with CUDA_LAUNCH_BLOCKING=1 enabled which should determine if it’s a NVIDIA driver or hardware issue

A quick way to check for row remapping failures is to run the below command on the instance.

nvidia-smi --query-remapped-
rows=gpu_name,gpu_bus_id,remapped_rows.failure,remapped_rows.pending,remapped_rows.correctable,remapped_rows.uncorrectable --format=csv
gpu_name, gpu_bus_id, remapped_rows.failure, remapped_rows.pending, remapped_rows.correctable, remapped_rows.uncorrectable
NVIDIA A100-SXM4-40GB, 00000000:10:1C.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:10:1D.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:20:1C.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:20:1D.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:90:1C.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:90:1D.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:A0:1C.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:A0:1D.0, 0, 0, 0, 0

This isn’t an exhaustive list of Xid events, but it provides some of the more common ones that you may come across as you develop your accelerated workload. You can find a more complete table of events here. Furthermore, if you have questions, you can reach out to AWS Support with the output of the tar ball created by executing the nvidia-bug-report.sh script included with the NVIDIA driver.

Conclusion

Get started with integrating this monitoring into your AMIs if you use custom AMIs specifically for key services, such as Amazon EKS, Amazon ECS, or Amazon EC2 with AWS ParallelCluster. This will help you discover utilization metrics for your accelerated computing workloads. If you have any questions about this post, then reach out to your account team.

Running AWS Lambda functions on AWS Outposts using AWS IoT Greengrass

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/running-aws-lambda-functions-on-aws-outposts-using-aws-iot-greengrass/

This blog post is written by Adam Imeson, Sr. Hybrid Edge Specialist Solution Architect.

Today, AWS customers can deploy serverless applications in AWS Regions using a variety of AWS services. Customers can also use AWS Outposts to deploy fully managed AWS infrastructure at virtually any datacenter, colocation space, or on-premises facility.

AWS Outposts extends the cloud by bringing AWS services to customers’ premises to support their hybrid and edge workloads. This post will describe how to deploy Lambda functions on an Outpost using AWS IoT Greengrass.

Consider a customer who has built an application that runs in an AWS Region and depends on AWS Lambda. This customer has a business need to enter a new geographic market, but the nearest AWS Region is not close enough to meet application latency or data residency requirements. AWS Outposts can help this customer extend AWS infrastructure and services to their desired geographic region. This blog post will explain how a customer can move their Lambda-dependent application to an Outpost.

Overview

In this walkthrough you will create a Lambda function that can run on AWS IoT Greengrass and deploy it on an Outpost. This architecture results in an AWS-native Lambda function running on the Outpost.

Architecture overview - Lambda functions on AWS Outposts

Deploying Lambda functions on Outposts rack

Prerequisites: Building a VPC

To get started, build a VPC in the same Region as your Outpost. You can do this with the create VPC option in the AWS console. The workflow allows you to set up a VPC with public and private subnets, an internet gateway, and NAT gateways as necessary. Do not consume all of the available IP space in the VPC with your subnets in this step, because you will still need to create Outposts subnets after this.

Now, build a subnet on your Outpost. You can do this by selecting your Outpost in the Outposts console and choosing Create Subnet in the drop-down Actions menu in the top right.

Confirm subnet details

Choose the VPC you just created and select a CIDR range for your new subnet that doesn’t overlap with the other subnets that are already in the VPC. Once you’ve created the subnet, you need to create a new subnet route table and associate it with your new subnet. Go into the subnet route tables section of the VPC console and create a new route table. Associate the route table with your new subnet. Add a 0.0.0.0/0 route pointing at your VPC’s internet gateway. This sets the subnet up as a public subnet, which for the purposes of this post will make it easier to access the instance you are about to build for Greengrass Core. Depending on your requirements, it may make more sense to set up a private subnet on your Outpost instead. You can also add a route pointing at your Outpost’s local gateway here. Although you won’t be using the local gateway during this walkthrough, adding a route to the local gateway makes it possible to trigger your Outpost-hosted Lambda function with on-premises traffic.

Create a new route table

Associate the route table with the new subnet

Add a 0.0.0.0/0 route pointing at your VPC’s internet gateway

Setup: Launching an instance to run Greengrass Core

Create a new EC2 instance in your Outpost subnet. As long as your Outpost has capacity for your desired instance type, this operation will proceed the same way as any other EC2 instance launch. You can check your Outpost’s capacity in the Outposts console or in Amazon CloudWatch:

I used a c5.large instance running Amazon Linux 2 with 20 GiB of Amazon EBS storage for this walkthough. You can pick a different instance size or a different operating system in accordance with your application’s needs and the AWS IoT Greengrass documentation. For the purposes of this tutorial, we assign a public IP address to the EC2 instance on creation.

Step 1: Installing the AWS IoT Greengrass Core software

Once your EC2 instance is up and running, you will need to install the AWS IoT Greengrass Core software on the instance. Follow the AWS IoT Greengrass documentation to do this. You will need to do the following:

  1. Ensure that your EC2 instance has appropriate AWS permissions to make AWS API calls. You can do this by attaching an instance profile to the instance, or by providing AWS credentials directly to the instance as environment variables, as in the Greengrass documentation.
  2. Log in to your instance.
  3. Install OpenJDK 11. For Amazon Linux 2, you can use sudo amazon-linux-extras install java-openjdk11 to do this.
  4. Create the default system user and group that runs components on the device, with
    sudo useradd —system —create-home ggc_user
    sudo groupadd —system ggc_group
  5. Edit the /etc/sudoers file with sudo visudosuch that the entry for the root user looks like root ALL=(ALL:ALL) ALL
  6. Enable cgroups and enable and mount the memory and devices cgroups. In Amazon Linux 2, you can do this with the grubby utility as follows:
    sudo grubby --args="cgroup_enable=memory cgroup_memory=1 systemd.unified_cgroup_hierarchy=0" --update-kernel /boot/vmlinuz-$(uname -r)
  7. Type sudo reboot to reboot your instance with the cgroup boot parameters enabled.
  8. Log back in to your instance once it has rebooted.
  9. Use this command to download the AWS IoT Greengrass Core software to the instance:
    curl -s https://d2s8p88vqu9w66.cloudfront.net/releases/greengrass-nucleus-latest.zip > greengrass-nucleus-latest.zip
  10. Unzip the AWS IoT Greengrass Core software:
    unzip greengrass-nucleus-latest.zip -d GreengrassInstaller && rm greengrass-nucleus-latest.zip
  11. Run the following command to launch the installer. Replace each argument with appropriate values for your particular deployment, particularly the aws-region and thing-name arguments.
    sudo -E java -Droot="/greengrass/v2" -Dlog.store=FILE \
    -jar ./GreengrassInstaller/lib/Greengrass.jar \
    --aws-region region \
    --thing-name MyGreengrassCore \
    --thing-group-name MyGreengrassCoreGroup \
    --thing-policy-name GreengrassV2IoTThingPolicy \
    --tes-role-name GreengrassV2TokenExchangeRole \
    --tes-role-alias-name GreengrassCoreTokenExchangeRoleAlias \
    --component-default-user ggc_user:ggc_group \
    --provision true \
    --setup-system-service true \
    --deploy-dev-tools true
  12. You have now installed the AWS IoT Greengrass Core software on your EC2 instance. If you type sudo systemctl status greengrass.service then you should see output similar to this:

Step 2: Building and deploying a Lambda function

Now build a Lambda function and deploy it to the new Greengrass Core instance. You can find example local Lambda functions in the aws-greengrass-lambda-functions GitHub repository. This example will use the Hello World Python 3 function from that repo.

  1. Create the Lambda function. Go to the Lambda console, choose Create function, and select the Python 3.8 runtime:

  1. Choose Create function at the bottom of the page. Once your new function has been created, copy the code from the Hello World Python 3 example into your function:

  1. Choose Deploy to deploy your new function’s code.
  2. In the top right, choose Actions and select Publish new version. For this particular function, you would need to create a deployment package with the AWS IoT Greengrass SDK for the function to work on the device. I’ve omitted this step for brevity as it is not a main focus of this post. Please reference the Lambda documentation on deployment packages and the Python-specific deployment package docs if you want to pursue this option.

  1. Go to the AWS IoT Greengrass console and choose Components in the left-side pop-in menu.
  2. On the Components page, choose Create component, and then Import Lambda function. If you prefer to do this programmatically, see the relevant AWS IoT Greengrass documentation or AWS CloudFormation documentation.
  3. Choose your new Lambda function from the drop-down.

Create component

  1. Scroll to the bottom and choose Create component.
  2. Go to the Core devices menu in the left-side nav bar and select your Greengrass Core device. This is the Greengrass Core EC2 instance you set up earlier. Make a note of the core device’s name.

  1. Use the left-side nav bar to go to the Deployments menu. Choose Create to create a new deployment, which will place your Lambda function on your Outpost-hosted core device.
  2. Give the deployment a name and select Core device, providing the name of your core device. Choose Next.

  1. Select your Lambda function and choose Next.

  1. Choose Next again, on both the Configure components and Configure advanced settings On the last page, choose Deploy.

You should see a green message at the top of the screen indicating that your configuration is now being deployed.

Clean up

  1. Delete the Lambda function you created.
  2. Terminate the Greengrass Core EC2 instance.
  3. Delete the VPC.

Conclusion

Many customers use AWS Outposts to expand applications into new geographies. Some customers want to run Lambda-based applications on Outposts. This blog post shows how to use AWS IoT Greengrass to build Lambda functions which run locally on Outposts.

To learn more about Outposts, please contact your AWS representative and visit the Outposts homepage and documentation.

Making your Go workloads up to 20% faster with Go 1.18 and AWS Graviton

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/making-your-go-workloads-up-to-20-faster-with-go-1-18-and-aws-graviton/

This blog post was written by Syl Taylor, Professional Services Consultant.

In March 2022, the highly anticipated Go 1.18 was released. Go 1.18 brings to the language some long-awaited features and additions, such as generics. It also brings significant performance improvements for Arm’s 64-bit architecture used in AWS Graviton server processors. In this post, we show how migrating Go workloads from Go 1.17.8 to Go 1.18 can help you run your applications up to 20% faster and more cost-effectively. To achieve this goal, we selected a series of realistic and relatable workloads to showcase how they perform when compiled with Go 1.18.

Overview

Go is an open-source programming language which can be used to create a wide range of applications. It’s developer-friendly and suitable for designing production-grade workloads in areas such as web development, distributed systems, and cloud-native software.

AWS Graviton2 processors are custom-built by AWS using 64-bit Arm Neoverse cores to deliver the best price-performance for your cloud workloads running in Amazon Elastic Compute Cloud (Amazon EC2). They provide up to 40% better price/performance over comparable x86-based instances for a wide variety of workloads and they can run numerous applications, including those written in Go.

Web service throughput

For web applications, the number of HTTP requests that a server can process in a window of time is an important measurement to determine scalability needs and reduce costs.

To demonstrate the performance improvements for a Go-based web service, we selected the popular Caddy web server. To perform the load testing, we selected the hey application, which was also written in Go. We deployed these packages in a client/server scenario on m6g Graviton instances.

Relative performance comparison for requesting a static webpage

The Caddy web server compiled with Go 1.18 brings a 7-8% throughput improvement as compared with the variant compiled with Go 1.17.8.

We conducted a second test where the client downloads a dynamic page on which the request handler performs some additional processing to write the HTTP response content. The performance gains were also noticeable at 10-11%.

Relative performance comparison for requesting a dynamic webpage

Regular expression searches

Searching through large amounts of text is where regular expression patterns excel. They can be used for many use cases, such as:

  • Checking if a string has a valid format (e.g., email address, domain name, IP address),
  • Finding all of the occurrences of a string (e.g., date) in a text document,
  • Identifying a string and replacing it with another.

However, despite their efficiency in search engines, text editors, or log parsers, regular expression evaluation is an expensive operation to run. We recommend identifying optimizations to reduce search time and compute costs.

The following example uses the Go regexp package to compile a pattern and search for the presence of a standard date format in a large generated string. We observed a 13.5% increase in completed executions with a 12% reduction in execution time.

Relative performance comparison for using regular expressions to check that a pattern exists

In a second example, we used the Go regexp package to find all of the occurrences of a pattern for character sequences in a string, and then replace them with a single character. We observed a 12% increase in evaluation rate with an 11% reduction in execution time.

Relative performance comparison for using regular expressions to find and replace all of the occurrences of a pattern

As with most workloads, the improvements will vary depending on the input data, the hardware selected, and the software stack installed. Furthermore, with this use case, the regular expression usage will have an impact on the overall performance. Given the importance of regex patterns in modern applications, as well as the scale at which they’re used, we recommend upgrading to Go 1.18 for any software that relies heavily on regular expression operations.

Database storage engines

Many database storage engines use a key-value store design to benefit from simplicity of use, faster speed, and improved horizontal scalability. Two implementations commonly used are B-trees and LSM (log-structured merge) trees. In the age of cloud technology, building distributed applications that leverage a suitable database service is important to make sure that you maximize your business outcomes.

B-trees are seen in many database management systems (DBMS), and they’re used to efficiently perform queries using indexes. When we tested a sample program for inserting and deleting in a large B-tree structure, we observed a 10.5% throughput increase with a 10% reduction in execution time.

Relative performance comparison for inserting and deleting in a B-Tree structure

On the other hand, LSM trees can achieve high rates of write throughput, thus making them useful for big data or time series events, such as metrics and real-time analytics. They’re used in modern applications due to their ability to handle large write workloads in a time of rapid data growth. The following are examples of databases that use LSM trees:

  • InfluxDB is a powerful database used for high-speed read and writes on time series data. It’s written in Go and its storage engine uses a variation of LSM called the Time-Structured Merge Tree (TSM).
  • CockroachDB is a popular distributed SQL database written in Go with its own LSM tree implementation.
  • Badger is written in Go and is the engine behind Dgraph, a graph database. Its design leverages LSM trees.

When we tested an LSM tree sample program, we observed a 13.5% throughput increase with a 9.5% reduction in execution time.

We also tested InfluxDB using comparison benchmarks to analyze writes and reads to the database server. On the load stress test, we saw a 10% increase of insertion throughput and a 14.5% faster rate when querying at a large scale.

Relative performance comparison for inserting to and querying from an InfluxDB database

In summary, for databases with an engine written in Go, you’ll likely observe better performance when upgrading to a version that has been compiled with Go 1.18.

Machine learning training

A popular unsupervised machine learning (ML) algorithm is K-Means clustering. It aims to group similar data points into k clusters. We used a dataset of 2D coordinates to train K-Means and obtain the cluster distribution in a deterministic manner. The example program uses an OOP design. We noticed an 18% improvement in execution throughput and a 15% reduction in execution time.

Relative performance comparison for training a K-means model

A widely-used and supervised ML algorithm for both classification and regression is Random Forest. It’s composed of numerous individual decision trees, and it uses a voting mechanism to determine which prediction to use. It’s a powerful method for optimizing ML models.

We ran a deterministic example to train a dense Random Forest. The program uses an OOP design and we noted a 20% improvement in execution throughput and a 15% reduction in execution time.

Relative performance comparison for training a Random Forest model

Recursion

An efficient, general-purpose method for sorting data is the merge sort algorithm. It works by repeatedly breaking down the data into parts until it can compare single units to each other. Then, it decides their order in the intermediary steps that will merge repeatedly until the final sorted result. To implement this divide-and-conquer approach, merge sort must use recursion. We ran the program using a large dataset of numbers and observed a 7% improvement in execution throughput and a 4.5% reduction in execution time.

Relative performance comparison for running a merge sort algorithm

Depth-first search (DFS) is a fundamental recursive algorithm for traversing tree or graph data structures. Many complex applications rely on DFS variants to solve or optimize hard problems in various areas, such as path finding, scheduling, or circuit design. We implemented a standard DFS traversal in a fully-connected graph. Then we observed a 14.5% improvement in execution throughput and a 13% reduction in execution time.

Relative performance comparison for running a DFS algorithm

Conclusion

In this post, we’ve shown that a variety of applications, not just those primarily compute-bound, can benefit from the 64-bit Arm CPU performance improvements released in Go 1.18. Programs with an object-oriented design, recursion, or that have many function calls in their implementation will likely benefit more from the new register ABI calling convention.

By using AWS Graviton EC2 instances, you can benefit from up to a 40% price/performance improvement over other instance types. Furthermore, you can save even more with Graviton through the additional performance improvements by simply recompiling your Go applications with Go 1.18.

To learn more about Graviton, see the Getting started with AWS Graviton guide.

Amazon EC2 DL1 instances Deep Dive

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/amazon-ec2-dl1-instances-deep-dive/

This post is written by Amr Ragab, Principal Solutions Architect, Amazon EC2.

AWS is excited to announce that the new Amazon Elastic Compute Cloud (Amazon EC2) DL1 instances are now generally available in US-East (N. Virginia) and US-West (Oregon). DL1 provides up to 40% better price performance for training deep learning models as compared to current generation GPU-based EC2 instances. The dl1.24xlarge instance type features eight Intel-Habana Gaudi accelerators, which are custom-built to train deep learning models. Each Gaudi accelerator has 32 GB of high bandwidth memory (HBM2) and a peer-to-peer bidirectional bandwidth of 100 Gbps RoCE, for a total bidirectional interconnect bandwidth of 700 Gbps per card. Further instance specifications are as follows:

Instance Size vCPU Instance Memory (GiB) Gaudi Accelerators Network Bandwidth (Gbps) Total Accelerator Interconnect (Gbs) Local Instance Storage EBS Bandwidth (Gbps)
d1.24xlarge 96 768 8 4×100 Gbps 700 4x1TB NVMe 19

Instance Architecture

System architecture of the amazon ec2 dl1 instances.

As the preceding instance architecture indicates, pairs of Gaudi accelerators (e.g., Gaudi0 and Gaudi1) are attached directly through a PCIe Gen3x16 link. Additionally, peer-to-peer networking via 100 Gbps RoCEv2 links – with seven active links per card – provides a torus configuration with a total of 700 Gbps of interconnect bandwidth per card. This topology is a separate interconnect outside of the two NUMA domains. Furthermore, the instance supports four EFA ENIs and 4x1TB of local NVMe SSD storage. We will provide a peer-direct driver over EFA, which will let you utilize high throughput, low latency peer-direct networking between accelerators across multiple instances to efficiently scale multi-node distributed training workloads.

Quick Start

Quickly get started with DL1 and SynapseAI SDK through with the following options:

1) Habana Deep Learning AMIs provided by AWS.

2) AWS Marketplace AMIs provided by Habana.

3) Using Packer to build a custom Amazon Machine Images (AMI) provided by this GitHub repo. This repo also provides build scripts to create Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Kubernetes Service (Amazon EKS) AMIs.

After selecting an AMI, launch a dl1.24xlarge instance in either us-east-1 or us-west-2. To help identify in which availability zone(s) dl1.24xlarge is available, run the following command:

aws ec2 describe-instance-type-offerings \
--location-type availability-zone \
--filters Name=instance-type,Values=dl1.24xlarge \
--region us-west-2 \
--output table

Once launched, you can connect to the instance over SSH (with the correct security group attached).

Habana Collectives Communication Library (HCL/HCCL)

As part of the Habana SynapseAI SDK, Habana Gaudi’s use the HCCL library for handling the collectives between HPUs. Get more information on HCCL here. On DL1 through the HCL-tests, we can confirm close to 700 Gbps (689 Gbps) per card for the collectives tested as follows.

You can confirm these tests by cloning the github repo here.

Habana DL1 HCCL tests.

Amazon EKS Quick Start

Support for DL1 on Amazon EKS is available today with Amazon EKS versions > 1.19. The following is a quick start to get up and running quickly with DL1.

The following dependencies will be needed:

eksctl – You need version 0.70.0+ of eksctl.
kubectl – You use Kubernetes version 1.20 in this post.

Create EKS cluster:

eksctl create cluster --region us-east-1 --without-nodegroup \
--vpc-public-subnets subnet-037d8e430963c2d3e,subnet-0abe898359a7d43e9

Nodegroup configuration – save the following codeblock to a file called dl1-managed-ng.yaml. Replace the AMI ID in the code block with the AMI created earlier.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: fabulous-rainbow-1635807811
  region: us-west-2

vpc:
  id: vpc-34f1894c
  subnets:
    public:
      endpoint-one:
        id: subnet-4532e73d
      endpoint-two:
        id: subnet-8f8b7dc5

managedNodeGroups:
  - name: dl1-ng-1d
    instanceType: dl1.24xlarge
    volumeSize: 200
    instancePrefix: dl1-ng-1d-worker
    ami: ami-072c632cbbc2255b3
    iam:
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        ebs: true
        fsx: true
        cloudWatch: true
    ssh:
      allow: true
      publicKeyName: amrragab-aws
    subnets:
    - endpoint-one
    minSize: 1
    desiredCapacity: 1
    maxSize: 4
    overrideBootstrapCommand: |
      #!/bin/bash
      /etc/eks/bootstrap.sh fabulous-rainbow-1635807811

Create the managed nodegroup with the following command:

eksctl create nodegroup -f dl1-managed-ng.yaml

Once the nodegroup has been completed, you must apply the habana-k8s-device-plugin

kubectl create -f https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml

Once completed, you should see the Gaudi devices as an allocatable resource in your EKS
cluster, presenting 8 Gaudi accelerators per DL1 node in the cluster.

Allocatable:

attachable-volumes-aws-ebs: 39
cpu:                        95690m
ephemeral-storage:          192188443124
habana.ai/gaudi:            8
hugepages-1Gi:              0
hugepages-2Mi:              30000Mi
memory:                     753055132Ki
pods:                       15

Example Distributed Machine Learning (ML) Workloads

The following tables are examples of Mixed Precision/FP32 training results comparing DL1 to the common GPU instances used for ML training.

Model: ResNet50
Framework: TensorFlow 2
Dataset: Imagenet2012
GitHub: https://github.com/HabanaAI/Model-
References/tree/master/TensorFlow/computer_vision/Resnets/resnet_keras

Instance Type Batch Size
Mixed Precision Training Throughput (images/sec)
8x Gaudi – 32 GB (dl1.24xlarge) 256 13036
8x A100 – 40 GB (p4d.24xlarge) 256 17921
8x V100 – 32 GB (p3dn.24xlarge) 256 9685
8x V100 – 16GB (p3.16xlarge) 256 8945

Model: Bert Large – Pretraining
Framework: Pytorch 1.9
Dataset: Wikipedia/BooksCorpus
GitHub: https://github.com/HabanaAI/Model-References/tree/master/PyTorch/nlp/bert

Instance Type Batch Size
@128 Sequence
Length
Mixed Precision Training Throughput (seq/sec)
8x Gaudi – 32 GB (dl1.24xlarge) 256 1318
8x A100 – 40 GB (p4d.24xlarge) 8192 2979
8x V100 – 32 GB (p3dn.24xlarge) 8192 1458
8x V100 – 16GB (p3.16xlarge) 8192 1013

You can find a more comprehensive list of ML models supported with performance data here. Support for containers with TensorFlow and Pytorch are also available. Furthermore, you can stay up-to-date with the operator support for TensorFlow and Pytorch.

CONCLUSION

We are excited to innovate on behalf of our customers and provide a diverse choice in ML accelerators with DL1 instances. The DL1 instances powered by Gaudi accelerators can provide up to 40% better price performance for training deep learning models as compared to current generation GPU-based EC2 instances. DL1 instances use the Habana SynapseAI SDK with framework support in Pytorch and TensorFlow. Additional future support for EFA with peer direct HPUs across nodes will also be supported. Now it’s time to go power up your ML workloads with Amazon EC2 DL1 instances.