All posts by Neelay Thaker

Deep dive into NitroTPM and UEFI Secure Boot support in Amazon EC2

2021-12-24 Neelay Thaker

Post Syndicated from Neelay Thaker original https://aws.amazon.com/blogs/compute/deep-dive-into-nitrotpm-and-uefi-secure-boot-support-in-amazon-ec2/

Contributed by Samartha Chandrashekar, Principal Product Manager Amazon EC2

At re:Invent 2021, we announced NitroTPM, a Trusted Platform Module (TPM) 2.0 and Unified Extensible Firmware Interface (UEFI) Secure Boot support in Amazon EC2. In this blog post, we’ll share additional details on how these capabilities can help further raise the security bar of EC2 deployments.

A TPM is a security device to gather and attest system state, store and generate cryptographic data, and prove platform identity. Although TPMs are traditionally discrete chips or firmware modules, their adaptation on AWS as NitroTPM preserves their security properties without affecting the agility and scalability of EC2. NitroTPM makes it possible to use TPM-dependent applications and Operating System (OS) capabilities in EC2 instances. It conforms to the TPM 2.0 specification, which makes it easy to migrate existing on-premises workloads that use TPM functionalities to EC2.

Unified Extensible Firmware Interface (UEFI) Secure Boot is a feature of UEFI that builds on EC2’s long-standing secure boot process and provides additional defense-in-depth that helps you secure software from threats that persist across reboots. It ensures that EC2 instances run authentic software by verifying the digital signature of all boot components, and halts the boot process if signature verification fails. When used with UEFI Secure Boot, NitroTPM can verify the integrity of software that boots and runs in the EC2 instance. It can measure instance properties and components as evidence that unaltered software in the correct order was used during boot. Features such as “Measured Boot” in Windows, Linux Unified Key Setup (LUKS) and dm-verity in popular Linux distributions can use NitroTPM to further secure OS launches from malware with administrative that attempt to persist across reboots.

NitroTPM derives its root-of-trust from the Nitro Security Chip and performs the same functions as a physical/discrete TPM. Similar to discrete TPMs, an immutable private and public Endorsement Key (EK) is set up inside the NitroTPM by AWS during instance creation. NitroTPM can serve as a “root-of-trust” to verify the provenance of software in the instance (e.g., NitroTPM’s EKCert as the basis for SSL certificates). Sensitive information protected by NitroTPM is made available only if the OS has booted correctly (i.e., boot measurements match expected values). If the system is tampered, keys are not released since the TPM state is different, thereby ensuring protection from malware attempting to hijack the boot process. NitroTPM can protect volume encryption keys used by full-disk encryption utilities (such as dm-crypt and BitLocker) or private keys for certificates.

NitroTPM can be used for attestation, a process to demonstrate that an EC2 instance meets pre-defined criteria, thereby allowing you to gain confidence in its integrity. It can be used to authenticate an instance requesting access to a resource (such as a service or a database) to be contingent on its health state (e.g., patching level, presence of mandated agents, etc.). For example, a private key can be “sealed” to a list of measurements of specific programs allowed to “unseal”. This makes it suited for use cases such as digital rights management to gate LDAP login, and database access on attestation. Access to AWS Key Management Service (KMS) keys to encrypt/decrypt data accessed by the instance can be made to require affirmative attestation of instance health. Anti-malware software (e.g., Windows Defender) can initiate remediation actions if attestation fails.

NitroTPM uses Platform Configuration Registers (PCR) to store system measurements. These do not change until the next boot of the instance. PCR measurements are computed during the boot process before malware can modify system state or tamper with the measuring process. These values are compared with pre-calculated known-good values, and secrets protected by NitroTPM are released only if the sequences match. PCRs are recalculated after each reboot, which ensures protection against malware aiming to hijack the boot process or persist across reboots. For example, if malware overwrites part of the kernel, measurements change, and disk decryption keys sealed to NitroTPM are not unsealed. Trust decisions can also be made based on additional criteria such as boot integrity, patching level, etc.

The workflow below shows how UEFI Secure Boot and NitroTPM work to ensure system integrity during OS startup.

workflow

To get started, you’ll need to register an Amazon Machine Image (AMI) of an Operating System that supports TPM 2.0 and UEFI Secure Boot using the register-image primitive via the CLI, API, or console. Alternatively, you can use pre-configured AMIs from AWS for both Windows and Linux to launch EC2 instances with TPM and Secure Boot. The screenshot below shows a Windows Server 2019 instance on EC2 launched with NitroTPM using its inbox TPM 2.0 drivers to recognize a TPM device.

NitroTPM and UEFI Secure Boot enables you to further raise the bar in running their workloads in a secure and trustworthy manner. We’re excited for you to try out NitroTPM when it becomes publicly available in 2022. Contact [email protected] for additional information.

Announcing winners of the AWS Graviton Challenge Contest and Hackathon

2021-12-01 Neelay Thaker

Post Syndicated from Neelay Thaker original https://aws.amazon.com/blogs/compute/announcing-winners-of-the-aws-graviton-challenge-contest-and-hackathon/

At AWS, we are constantly innovating on behalf of our customers so they can run virtually any workload, with optimal price and performance. Amazon EC2 now includes more than 475 instance types that offer a choice of compute, memory, networking, and storage to suit your workload needs. While we work closely with our silicon partners to offer instances based on their latest processors and accelerators, we also drive more choice for our customers by building our own silicon.

The AWS Graviton family of processors were built as part of that silicon innovation initiative with the goal of pushing the price performance envelope for a wide variety of customer workloads in EC2. We now have 12 EC2 instance families powered by AWS Graviton2 processors – general purpose (M6g, M6gd), burstable (T4g), compute optimized (C6g, C6gd, C6gn), memory optimized (R6g, R6gd, X2gd), storage optimized (Im4gn, Is4gen), and accelerated computing (G5g) available globally across 23 AWS Regions. We also announced the preview of Amazon EC2 C7g instances powered by the latest generation AWS Graviton3 processors that will provide the best price performance for compute-intensive workloads in EC2. Thousands of customers, including Discovery, DIRECTV, Epic Games, and Formula 1, have realized significant price performance benefits with AWS Graviton-based instances for a broad range of workloads. This year, AWS Graviton-based instances also powered much of Amazon Prime Day 2021 and supported 12 core retail services during the massive 2-day online shopping event.

To make it easy for customers to adopt Graviton-based instances, we launched a program called the Graviton Challenge. Working with customers, we saw that many successful adoptions of Graviton-based instances were the result of one or two developers taking a single workload and spending a few days to benchmark the price performance gains with Graviton2-based instances, before scaling it to more workloads. The Graviton Challenge provides a step-by-step plan that developers can follow to move their first workload to Graviton-based instances. With the Graviton Challenge, we also launched a Contest (US-only), and then a Hackathon (global), where developers could compete for prizes by building new applications or moving existing applications to run on Graviton2-based instances. More than a thousand participants, including enterprises, startups, individual developers, open-source developers, and Arm developers, registered and ran a variety of applications on Graviton-based instances with significant price performance benefits. We saw some fantastic entries and usage of Graviton2-based instances across a variety of use cases and want to highlight a few.

The Graviton Challenge Contest winners:

Best Adoption – Enterprise and Most Impactful Adoption: VMware vRealize SRE team, who migrated 60 micro-services written in Java, Rust, and Golang to Graviton2-based general purpose and compute optimized instances and realized up to 48% latency reduction and 22% cost savings.
Best Adoption – Startup: Kasm Technologies, who realized up to 48% better performance and 25% potential cost savings for its container streaming platform built on C/C++ and Python.
Best New Workload adoption: Dustin Wilson, who built a dynamic tile server based on Golang and running on Graviton2-based memory-optimized instances that helps analysts query large geospatial datasets and benchmarked up to 1.8x performance gains over comparable x86-based instances.
Most Innovative Adoption: Loroa, an application that translates any given text into spoken words from one language into multiple other languages using Graviton2-based instances, Amazon Polly, and Amazon Translate.

If you are attending AWS re:Invent 2021 in person, you can hear more details on their Graviton adoption experience by attending the CMP213: Lessons learned from customers who have adopted AWS Graviton chalk talk.

Winners for the Graviton Challenge Hackathon:

Best New App: PickYourPlace, an open-source based data analytics platform to help users select a place to live based on property value, safety, and accessibility.
Best Migrated App: Genie, an image credibility checker based on deep learning that makes predictions on photographic and tampered confidence of an image.
Highest Potential Impact: Welly Tambunan, who’s also an AWS Community Builder, for porting big data platforms Spark, Dremio, and AirByte to Graviton2 instances so developers can leverage it to build big data capabilities into their applications.
Most Creative Use Case: OXY, a low-cost custom Oximeter with mobile and web apps that enables continuous and remote monitoring to prevent deaths due to Silent Hypoxia.
Best Technical Implementation: Apollonia Bot that plays songs, playlists, or podcasts on a Discord voice channel, so users can listen to it together.

It’s been incredibly exciting to see the enthusiasm and benefits realized by our customers. We are also thankful to our judges – Patrick Moorhead from Moor Insights, James Governor from RedMonk, and Jason Andrews from Arm, for their time and effort.

In addition to EC2, several AWS services for databases, analytics, and even serverless support options to run on Graviton-based instances. These include Amazon Aurora, Amazon RDS, Amazon MemoryDB, Amazon DocumentDB, Amazon Neptune, Amazon ElastiCache, Amazon OpenSearch, Amazon EMR, AWS Lambda, and most recently, AWS Fargate. By using these managed services on Graviton2-based instances, customers can get significant price performance gains with minimal or no code changes. We also added support for Graviton to key AWS infrastructure services such as Elastic Beanstalk, Amazon EKS, Amazon ECS, and Amazon CloudWatch to help customers build, run, and scale their applications on Graviton-based instances. Additionally, a large number of Linux and BSD-based operating systems, and partner software for security, monitoring, containers, CI/CD, and other use cases now support Graviton-based instances and we recently launched the AWS Graviton Ready program as part of the AWS Service Ready program to offer Graviton-certified and validated solutions to customers.

Congrats to all of our Contest and Hackathon winners! Full list of the Contest and Hackathon winners is available on the Graviton Challenge page.

P.S.: Even though the Contest and Hackathon have ended, developers can still access the step-by-step plan on the Graviton Challenge page to move their first workload to Graviton-based instances.

15 years of silicon innovation with Amazon EC2

2021-08-25 Neelay Thaker

Post Syndicated from Neelay Thaker original https://aws.amazon.com/blogs/compute/15-years-of-silicon-innovation-with-amazon-ec2/

The Graviton Hackathon is now available to developers globally to build and migrate apps to run on AWS Graviton2

This week we are celebrating 15 years of Amazon EC2 live on Twitch August 23rd – 24th with keynotes and sessions from AWS leaders and experts. You can watch all the sessions on-demand later this week to learn more about innovation at Amazon EC2.

As Jeff Barr noted in his blog, EC2 started as a single instance back in 2006 with the idea of providing developers on demand access to compute infrastructure and have them pay only for what they use. Today, EC2 has over 400 instance types with the broadest and deepest portfolio of instances in the cloud. As we strive to be the best cloud for every workload, customers consistently ask us for higher performance, and lower costs for their workloads. One way we deliver on this is by building silicon specifically optimized for the cloud. Our journey with custom silicon started in 2012 when we began looking into ways to improve the performance of EC2 instances by offloading virtualization functionality that is traditionally run on underlying servers to a dedicated offload card. Today, the AWS Nitro System is the foundation upon which our modern EC2 instances are built, delivering better performance and enhanced security. The Nitro System has also enabled us to innovate faster and deliver new instances and unique capabilities including Mac Instances, AWS Outposts, AWS Wavelength, VMware Cloud on AWS, and AWS Nitro Enclaves. We also have strong collaboration with partners to build custom silicon optimized for the AWS cloud with their latest generation processors to continue to deliver better performance and better price performance for our joint customers. Additionally, we’ve also designed AWS Inferentia and AWS Trainium chips to drive down the cost and boost performance for deep learning workloads.

One of the biggest innovations that help us deliver higher performance at lower cost for customer workloads are the AWS Graviton2 processors, which are the second generation of Arm-based processors custom-built by AWS. Instances powered by the latest generation AWS Graviton2 processors deliver up to 40% better performance at 20% lower per-instance cost over comparable x86-based instances in EC2. Additionally, Graviton2 is our most power efficient processor. In fact, Graviton2 delivers 2 to 3.5 times better performance per Watt of energy use versus any other processor in AWS.

Customers from startups to large enterprises including Intuit, Snap, Lyft, SmugMug, and Nextroll have realized significant price performance benefits for their production workloads on AWS Graviton2-based instances. Recently, EPIC Games added support for Graviton2 in Unreal Engine to help its developers build high performance games. What’s even more interesting is that AWS Graviton2-based instances supported 12 core retail services during Amazon Prime Day this year.

Most customers get started with AWS Graviton2-based instances by identifying and moving one or two workloads that are easy to migrate, and after realizing the price performance benefits, they move more workloads to Graviton2. In her blog, Liz Fong-Jones, Principal Developer Advocate at Honeycomb.io, details her journey of adopting Graviton2 and realizing significant price performance improvements. Using experience from working with thousands of customers like Liz who have adopted Graviton2, we built a program called the Graviton Challenge that provides a step-by-step plan to help you move your first workload to Graviton2-based instances.

Today, to further incentivize developers to get started with Graviton2, we are launching the Graviton Hackathon, where you can build a new app or migrate an app to run on Graviton2-based instances. Whether you are an existing EC2 user looking to optimize price performance for your workload, an Arm developer looking to leverage Arm-based instances in the cloud, or an open source developer adding support for Arm, you can participate in the Graviton Hackathon for a chance to win prizes for your project, including up to $10,000 in prize money. We look forward to the new applications which will be able to take advantage of the price performance benefits of Graviton. To learn more about Graviton2, watch the EC2 15th Birthday event sessions on-demand later this week, register to the attend the Graviton workshop at the upcoming AWS Summit Online, or register for the Graviton Challenge.

Cloud computing has made cutting edge, cost effective infrastructure available to everyday developers. Startups can use a credit card to spin up instances in minutes and scale up and scale down easily based on demand. Enterprises can leverage compute infrastructure and services to drive improved operational efficiency and customer experience. The last 15 years of EC2 innovation have been at the forefront of this shift, and we are looking forward to the next 15 years.

Amazon EC2 P4d instances deep dive

2020-11-03 Neelay Thaker

Post Syndicated from Neelay Thaker original https://aws.amazon.com/blogs/compute/amazon-ec2-p4d-instances-deep-dive/

This post is contributed by Amr Ragab, Senior Solutions Architect, Amazon EC2

Introduction

AWS is excited to announce that the new Amazon EC2 P4d instances are now generally available. This instance type brings additional benefits with 2.5x higher deep learning performance; adding to the accelerated instances portfolio, new features, and technical breakthroughs that our customers can benefit from with this latest technology. This blog post details some of those key features and how to integrate them into your current workloads and architectures.

Overview

P4d instances

As you can see from the generalized block diagram above, the p4d comes with dual socket Intel Cascade Lake 8275CL processors totaling 96 vCPUs at 3.0 GHz with 1.1 TB of RAM and 8 TB of NVMe local storage. P4d also comes with 8 x 40 GB NVIDIA Tesla A100 GPUs with NVSwitch and 400 Gbps Elastic Fabric Adapter (EFA) enabled networking. This instance configuration represents the latest generation of computing for our customers spanning Machine Learning (ML), High Performance Computing (HPC), and analytics.

One of the improvements of the p4d is in the networking stack. This new instance type has 400 Gbps with support for EFA and GPUDirect RDMA. Now, on AWS, you can take advantage of point-to-point GPU to GPU communication (across nodes), bypassing the CPU. Look out for additional blogs and webinars detailing use cases of GPUDirect and how this feature helps decrease latency and improve performance for certain workloads.

Let’s look at some new features and performance metrics for the P4d instances.

Features

Local ephemeral NVMe storage
The p4d instance type comes with 8 TB of local NVMe storage. Each device has a maximum read/write throughput of 2.7 GB/s. To create a local namespace and staging area for input into the GPUs, you can create a local RAID 0 of all the drives. This results in aggregate read throughput of about 16 GB/s. The following table summarizes the I/O tests on the NVMe drives in this configuration.

	FIO – Test	Block Size	Threads	Bandwidth
1	Sequential Read	128k	96	16.4 GiB/s
2	Sequential Write	128k	96	8.2 GiB/s
3	Random Read	128k	96	16.3 GiB/s
4	Random Write	128k	96	8.1 GiB/s

NVSwitch

Introduced with the p4d instance type is NVSwitch. Every GPU in the node is connected to each other in a full mesh topology up to 600 GB/s bidirectional bandwidth. ML frameworks and HPC applications that use NVIDIA communication collectives library (NCCL) can take full advantage of this all-to-all communication layer.

P4d GPU to GPU bandwidth

P3 GPU to GPU bandwidth

P4d uses a full mesh NVLink topology for optimized all-to-all communication, compared to the previous generation P3/P3dn instances, which have all-to-all communication across various data path domains (NUMA, PCIe switch, NVLink). This new topology accessed via NCCL will improve performance for multiGPU workloads.
To make optimal use of the NVSwitch ensure that in your instance, all GPUs application boost clocks are set to its maximum values:

sudo nvidia-smi -ac 1215,1410

Multi-Instance GPU (MIG)

It’s now possible, at the user level, to have control of fractionating a GPU into multiple GPU slices, with each GPU slice isolated from each other. This enables multiple users to run different workloads on the same GPU without impacting performance. I walk you through an example implementation of MIG in the following steps:

With every newly launched instance, MIG is disabled. So, you must enable it with the following command:

ubuntu@ip-172-31-34-6:~# sudo nvidia-smi -mig 1

Enabled MIG Mode for GPU 00000000:10:1C.0
You can get a list of supported MIG profiles.

Next, you can create seven slices, and create compute instances for each slice.

ubuntu@ip-172-31-34-6:~# sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 
Successfully created GPU instance ID 9 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 7 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 8 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 11 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 12 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 13 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 14 on GPU 0 using profile MIG 1g.5gb (ID 19)

ubuntu@ip-172-31-34-6:~# nvidia-smi mig -cci -gi 7,8,9,11,12,13,14

Successfully created compute instance ID 0 on GPU 0 GPU instance ID 7 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 8 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 9 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 11 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 12 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 13 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 14 using profile MIG 1g.5gb (ID 0)

You can split a GPU into a maximum of seven slices. To pass the GPU through into a docker container, you can specify the index pair at runtime:

docker run -it --gpus '"device=1:0"' nvcr.io/nvidia/tensorflow:20.09-tf1-py3

With MIG, you can run multiple smaller workloads on the same GPU without compromising performance. We will follow up with additional blogs on this feature as we integrate it with additional AWS services.

NVIDIA GPUDirect RDMA over EFA

For workloads optimized for multiGPU capabilities, we introduced GPUDirect over EFA fabric. This allows direct GPU-GPU communication across multiple p4d nodes for decreased latency and improved performance. Follow this user guide to get started with installing the EFA driver and setting up the environment. The code sample below can be used as a template to use GPUDirect RDMA over EFA.

/opt/amazon/openmpi/bin/mpirun \
     -n ${NUM_PROCS} -N ${NUM_PROCS_NODE} \
     -x RDMAV_FORK_SAFE=1 -x NCCL_DEBUG=info \
     -x FI_EFA_USE_DEVICE_RDMA=1 \
     --hostfile ${HOSTS_FILE} \
     --mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
     $HOME/nccl-tests/build/all_reduce_perf -b 8 -e 4G -f 2 -g 1 -c 1 -n 100

Machine learning Optimizations

You can quickly get started with the all benefits mentioned earlier for the p4d by using our latest Deep Learning AMI (DLAMI). The DLAMI now comes with CUDA11 and the latest NVLink and cuDNN SDKs and drivers to take advantage of the p4d.

TensorFloat32 – TF32

TF32 is a new 19 bit precision datatype from NVIDIA introduced for the first time for the p4d.24xlarge instance. This datatype improves performance with little to no loss of training and validation accuracy for most mainstream models. We have more detailed benchmarks for individual algorithms. But, on the p4d.24xlarge you can achieve approximately a 2.5 fold increase compared to FP32 on the p3dn.24xlarge for mainstream deep learning models.

We have updated our machine learning models here to show examples (see the following chart) of popular algorithms our customers are using today including general DNNs and Bert.

	DNN	P3dn FP32 (imgs/sec)	P3dn FP16 (imgs/sec)	P4d Throughput TF32 (imgs/sec)	P4d Throughput FP16 (imgs/sec)	P4d over p3dn TF32/FP32	P4d over P3dn FP16
1	Resnet50	3057	7413	6841	15621	2.2	2.1
2	Resnet152	1145	2644	2823	5700	2.5	2.2
3	Inception3	2010	4969	4808	10433	2.4	2.1
4	Inception4	847	1778	2025	3811	2.4	2.1
5	VGG16	1202	2092	4532	7240	3.8	3.5
6	Alexnet	32198	50708	82192	133068	2.6	2.6
7	SSD300	1554	2918	3467	6016	2.2	2.1

BERT Large – Wikipedia/Books Corpus

	GPUs	Sequence Length	Batch size / GPU: mixed precision, TF32	Gradient Accumulation: mixed precision, TF32	Throughput – mixed precision
1	1	128	64,64	1024,1024	372
2	4	128	64,64	256,256	1493
3	8	128	64,64	128,128	2936
4	1	512	16,8	2048,4096	77
5	4	512	16,8	512,1024	303
6	8	512	16,8	256,512	596

You can find other code examples at github.com/NVIDIA/DeepLearningExamples.

If you want to builld your own AMI or extend an AMI maintained by your organization you can use the github repo, which provides Packer scripts to build AMIs for Amazon Linux 2 or Ubuntu 18.04 versions.

https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline

The stack includes the following components:

NVIDIA Driver 450.80.02
CUDA 11
NVIDIA Fabric Manager
cuDNN 8
NCCL 2.7.8
EFA latest driver
AWS-OFI-NCCL
FSx kernel and client driver and utilities
Intel OneDNN
NVIDIA-runtime Docker

Conclusion

Get started with the new P4d instances with support on Amazon EKS, AWS Batch, and Amazon Sagemaker. We are excited to hear about what you develop and run with the new P4d instances. If you have any questions please reach out to your account team. Now, go power up your ML and HPC workloads with NVIDIA Tesla A100s and the P4d instances.

Noise