Tag Archives: Amazon EC2

Multi-Region Migration using AWS Application Migration Service

Post Syndicated from Shreya Pathak original https://aws.amazon.com/blogs/architecture/multi-region-migration-using-aws-application-migration-service/

AWS customers are in various stages of their cloud journey. Frequently, enterprises begin that journey by rehosting (lift-and-shift migrating) their on-premises workloads into AWS, and running Amazon Elastic Compute Cloud (Amazon EC2) instances. You can rehost using AWS Application Migration Service (MGN), a cloud-native migration tool.

You may need to relocate instances and workloads to a Region that is closer in proximity to one of your offices or data centers. Or you may have a resilience requirement to balance your workloads across multiple Regions. This rehosting migration pattern with AWS MGN can also be used to migrate Amazon EC2-hosted workloads from one AWS Region to another.

In this blog post, we will show you how to configure AWS MGN for migrating your workloads from one AWS Region to another.

Overview of AWS MGN migration

AWS MGN, an AWS native service, minimizes time-intensive, error-prone, manual processes by automatically converting your source servers from physical, virtual, or cloud infrastructure to run natively on AWS. It reduces overall migration costs, such as investment in multiple migration solutions, specialized cloud development, or application-specific skills. With AWS MGN, you can migrate your applications from physical infrastructure, VMware vSphere, Microsoft Hyper-V, Amazon EC2, and Amazon Virtual Private Cloud (Amazon VPC) to AWS.

To migrate to AWS, install the AWS MGN Replication Agent on your source servers and define replication settings in the AWS MGN console, shown in Figure 1. Replication servers receive data from an agent running on source servers, and write this data to the Amazon Elastic Block Store (EBS) volumes. Your replicated data is compressed and encrypted in transit and at rest using EBS encryption.

AWS MGN keeps your source servers up to date on AWS using nearly continuous, block-level data replication. It uses your defined launch settings to launch instances when you conduct non-disruptive tests or perform a cutover. After confirming that your launched instances are operating properly on AWS, you can decommission your source servers.

Figure 1. MGN service architecture

Figure 1. MGN service architecture

Steps for migration with AWS MGN

This tutorial assumes that you already have your source AWS Region set up with Amazon EC2-hosted workloads running and a target AWS Region defined.

Migrating Amazon EC2 workload across AWS Regions include the following steps:

  1. Create the Replication Settings template. These settings are used to create and manage your staging area subnet with lightweight Amazon EC2 instances. These instances act as replication servers used to replicate data between your source servers and AWS.
  2. Install the AWS Replication Agent on your source instances to add them to the AWS MGN console.
  3. Configure the launch settings for each source server. These are a set of instructions that determine how a Test or Cutover instance will be launched for each source server on AWS.
  4. Initiate the test/cutover to the target Region.

Prerequisites

Following are the prerequisites:

Setting up AWS MGN for multi-Region migration

This section will guide you through AWS MGN configuration setup for multi-Region migration.

Log into your AWS account, select the target AWS Region, and complete the prerequisites. Then you are ready to configure AWS MGN:

1.      Choose Get started on the AWS MGN landing page.

2.      Create the Replication Settings template (see Figure 2):

  • Select Staging area subnet for Replication Server
  • Choose Replication Server instance type (By default, AWS MGN uses t3.small instance type)
  • Choose default or custom Amazon EBS encryption
  • Enable ‘Always use the Application Migration Service security group’
  • Add custom Replication resources tags
  • Select Create Template button
Figure 2. Replication Settings template creation

Figure 2. Replication Settings template creation

3.      Add source servers to AWS MGN:

  • Select Add Servers following Source Servers (AWS MGN > Source Servers)
  • Enter OS, Replication Preferences, IAM Access Key and Secret Access Key ID of the IAM user created following Prerequisites. This does not expose your Secret Access Key ID in any request
  • Copy the installation command and run on your source server for agent installation

After successful agent installation, the source server is listed on the Source Servers page. Data replication begins after completion of the Initial Sync steps.

4.      Monitor the Initial Sync status (shown in Figure 3):

  •  Source server name > Migration Dashboard > Data Replication Status
    (Refer to the Source Servers page documentation for more details)
  • After 100% initial data replication confirm:
    • Migration Lifecycle = Ready for testing
    • Next step = Launch test instance
Figure 3. Monitoring initial replication status

Figure 3. Monitoring initial replication status

5.      Configure Launch Settings for each server:

  • Source servers page > Select source server
  • Navigate to the Launch settings tab (see Figure 4.) For this tutorial we won’t adjust the General launch settings. We will modify the EC2 Launch Template instead
  • Click on EC2 Launch Template > About modifying EC2 Launch Templates > Modify
Figure 4. Modifying EC2 Launch Template

Figure 4. Modifying EC2 Launch Template

6.      Provide values for Launch Template:

  • AMI: Recents tab > Don’t include in launch template
  • Instance Type: Can be kept same as source server or changed as per expected workload
  • Key pair (login): Create new or use existing if already created in the Target AWS Region
  • Network Settings > Subnet: Subnet for launching Test instance
  • Advanced network configuration:
    • Security Groups: For access to the test and final cutover instances
    • Configure Storage: Size – Do not change or edit this field
    • Volume type: Select any volume type (io1 is default)
  • Review details and click Create Template Version under the Summary section on right side of the console

7.      Every time you modify the Launch template, a new version is created. Set the launch template that you want to use with MGN as the default (shown in Figure 5):

  • Navigate to Amazon EC2 dashboard > Launch Templates page
  • Select the Launch template ID
  • Open the Actions menu and choose Set default version and select the latest Launch template created
Figure 5. Setting up your Launch template as the default

Figure 5. Setting up your Launch template as the default

8.      Launch a test instance and perform a Test prior to Cutover to identify potential problems and solve them before the actual Cutover takes place:

  • Go to the Source Servers page (see Figure 6)
  • Select source server > Open Test and Cutover menu
  • Under Testing, choose Launch test instances
  • Launch test instances for X servers > Launch
  • Choose View job details on the ‘Launch Job Created’ dialog box to view the specific Job details for the test launch in the Launch History tab
Figure 6. Launching test instances

Figure 6. Launching test instances

9.      Validate launch of test instance (shown in Figure 7) by confirming:

  • Alerts column = Launched
  • Migration lifecycle column = Test in progress
  • Next step column = Complete testing and mark as ‘Ready for cutover’
Figure 7. Validating launch of test instances

Figure 7. Validating launch of test instances

10.  SSH/ RDP into Test instance (view from EC2 console) and validate connectivity. Perform acceptance tests for your application as required. Revert the test if you encounter any issues.

11.  Terminate Test instances after successful testing:

  • Go to Source servers page
  • Select source server > Open Test and Cutover menu
  • Under Testing, choose Mark as “Ready for cutover”
  • Mark X servers as “Ready for cutover” > Yes, terminate launched instances (recommended) > Continue

12.  Validate the status of termination job and cutover readiness:

  • Migration Lifecycle = Ready for cutover
  • Next step = Launch cutover instance

13.  Perform the final cutover at a set date and time:

  • Go to Source servers page (see Figure 8)
  • Select source server > Open Test and Cutover menu
  • Under Cutover, choose Launch cutover instances
  • Launch cutover instances for X > Launch
Figure 8. Performing final Cutover by launching Cutover instances

Figure 8. Performing final Cutover by launching Cutover instances

14.  Monitor the indicators to validate the success of the launch of your Cutover instance (shown in Figure 9):

  • Alerts column = Launched
  • Migration lifecycle column = Cutover in progress
  • Data replication status = Healthy
  • Next step column = Finalize cutover
Figure 9. Indicators for successful launch of Cutover instances

Figure 9. Indicators for successful launch of Cutover instances

15.  Test Cutover Instance:

  • Navigate to Amazon EC2 console > Instances (running)
  • Select Cutover instance
  • SSH/ RDP into your Cutover instance to confirm that it functions correctly
  • Validate connectivity and perform acceptance tests for your application
  • Revert Cutover if any issues

16.  Finalize the cutover after successful validation:

  • Navigate to AWS MGN console > Source servers page
  • Select source server > Open Test and Cutover menu
  • Under Cutover, choose Finalize Cutover
  • Finalize cutover for X servers > Finalize

17.  At this point, if your cutover is successful:

  • Migration lifecycle column = Cutover complete,
  • Data replication status column = Disconnected
  • Next step column = Mark as archived

The cutover is now complete and that the migration has been performed successfully. Data replication has also stopped and all replicated data will now be discarded.

Cleaning up

Archive your source servers that have launched Cutover instances to clean up your Source Servers page-

  • Navigate to Source Servers page (see Figure 10)
  • Select source server > Open Actions
  • Choose Mark as archived
  • Archive X server > Archive
Figure 10. Mark source servers as archived that are cutover

Figure 10. Mark source servers as archived that are cutover

Conclusion

In this post, we demonstrated how AWS MGN simplifies, expedites, and reduces the cost of migrating Amazon EC2-hosted workloads from one AWS Region to another. It integrates with AWS Migration Hub, enabling you to organize your servers into applications. You can track the progress of all your MGN at the server and app level, even as you move servers into multiple AWS Regions. Choose a Migration Hub Home Region for MGN to work with the Migration Hub.

Here are the AWS MGN supported AWS Regions. If your preferred AWS Region isn’t currently supported or you cannot install agents on your source servers, consider using CloudEndure Migration or AWS Server Migration Service respectively. CloudEndure Migration will be discontinued in all AWS Regions on December 30, 2022. Refer to CloudEndure Migration EOL for more information.

Note: Use of AWS MGN is free for 90 days but you will incur charges for any AWS infrastructure that is provisioned during migration and after cutover. For more information, refer to the pricing page.

Thanks for reading this blog post! If you have any comments or questions, feel free to put them in the comments section.

New – Amazon EC2 C6a Instances Powered By 3rd Gen AMD EPYC Processors for Compute-Intensive Workloads

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-amazon-ec2-c6a-instances-powered-by-3rd-gen-amd-epyc-processors-for-compute-intensive-workloads/

At AWS re:Invent 2021, we launched Amazon EC2 M6a instances powered by the 3rd Gen AMD EPYC processors, running at frequencies up to 3.6 GHz, which offer customers up to 35 percent improvement in price-performance compared to M5a instances.

Many customers are looking for ways to optimize their cloud utilization, and they are taking advantage of the compute choice that Amazon EC2 offers. Customers such as Dropbox, Capital One, and Sprinklr have been able to realize the cost benefits of AWS using EC2 instances powered by AMD EPYC processors.

Today, I am happy to announce the availability of the new compute-optimized Amazon EC2 C6a instances, which offer up to up to 15 percent improvement in price-performance versus C5a instances, and 10 percent lower cost than comparable x86-based EC2 instances.

These instances are ideal for running compute-intensive workloads such as high-performance web servers, batch processing, ad serving, machine learning, multi-player gaming, video encoding, high performance computing (HPC) such as scientific modeling, and machine learning.

Compared to C5a instances, this new instance type provides:

To increase instance security, C6a instances have always-on memory encryption with AMD Transparent Single Key Memory Encryption (TSME), and support new AVX2 instructions for accelerating encryption and decryption algorithms.

Like M6a, C6a instances are also available in 10 sizes:

Name vCPUs Memory
(GiB)
Network Bandwidth
(Gbps)
EBS Throughput
(Gbps)
c6a.large 2 4 Up to 12.5 Up to 6.6
c6a.xlarge 4 8 Up to 12.5 Up to 6.6
c6a.2xlarge 8 16 Up to 12.5 Up to 6.6
c6a.4xlarge 16 32 Up to 12.5 Up to 6.6
c6a.8xlarge 32 64 12.5 6.6
c6a.12xlarge 48 96 18.75 10
c6a.16xlarge 64 128 25 13.3
c6a.24xlarge 96 192 37.5 20
c6a.32xlarge 128 256 50 26.6
c6a.48xlarge 192 384 50 40

The new instances are built on the AWS Nitro System, a collection of building blocks that offloads many of the traditional virtualization functions to dedicated hardware for high performance, high availability, and highly secure cloud instances.

Available Now
C6a instances are available today in three AWS Regions: US East (N. Virginia), US West (Oregon), and EU (Ireland). As usual with EC2, you pay for what you use. For more information, see the EC2 pricing page.

To learn more, visit the EC2 C6a instance and AWS/AMD partner page. You can send feedback to  [email protected]AWS re:Post for EC2, or through your usual AWS Support contacts.

Channy

Running IBM MQ on AWS using High-performance Amazon FSx for NetApp ONTAP

Post Syndicated from Senthil Nagaraj original https://aws.amazon.com/blogs/architecture/running-ibm-mq-on-aws-using-high-performance-amazon-fsx-for-netapp-ontap/

Many Amazon Web Services (AWS) customers use IBM MQ on-premises and are looking to migrate it to the AWS Cloud. For persistent storage requirements with IBM MQ on AWS, Amazon Elastic File System (Amazon EFS) can be used for distributed storage and to provide high availability. The AWS QuickStart to deploy IBM MQ with Amazon EFS is an architecture used for applications where the file system throughput requirements are within the Amazon EFS limits.

However, there are scenarios where customers need increased capacity for their IBM MQ workloads. These could be applications that rely heavily on IBM MQ, which result in a much higher message data throughput. This means that the persistent messages must be written to and read from the shared file system more frequently. IBM MQ facilitates writing log information into the shared file system. These are two such situations where such application requirements translate to a higher number of read/write operations.

For applications using IBM MQ and requiring a higher file system throughput, Amazon provides Amazon FSx for NetApp ONTAP. This is a fully managed shared storage in the AWS Cloud, with the popular data access and management capabilities of ONTAP.

This blog explains how to use Amazon FSx for NetApp ONTAP for distributed storage and high availability with IBM MQ. Read more about Amazon FSx for NetApp ONTAP features, and performance details, throughput options, and performance tips.

Overview of IBM MQ architecture on AWS

For recovering queue data upon failure, you can set up IBM MQ with high availability.

The solution architecture is shown in Figure 1. This blog post assumes familiarity with AWS services such as Amazon EC2, VPCs, and subnets. For additional information on these topics, see the AWS documentation.

Figure 1. IBM MQ with Amazon FSx NetApp ONTAP

Figure 1. IBM MQ with Amazon FSx NetApp ONTAP

  1. IBM MQ is deployed in an Auto Scaling group spanning two Availability Zones.
  2. Amazon FSx NetApp ONTAP is used for data persistence and high availability of queue message data.
  3. Amazon FSx NetApp ONTAP is set up in the same Availability Zones as IBM MQ.
  4. Amazon FSx NetApp ONTAP provides automatic failover that is transparent to the application and completes in 60 seconds.

Considerations for the Amazon FSx NetApp ONTAP file system

When creating the Amazon FSx NetApp ONTAP file system as in Figure 1, consider the following:

  1. The subnets used for the file system should have connectivity with the subnets where your IBM MQ is running. See VPC documentation.
  2. Ensure that the security group(s) used by the elastic network interfaces (ENI) for Amazon FSx allow communication with the IBM MQ environment. Read more about limiting access security groups.
  3. When choosing the storage capacity, IOPS, and throughput capacity, make sure it aligns to your application requirements.
  4. If you choose to use AWS Key Management System (KMS) encryption, configure those details correctly.
  5. Be sure to provide an appropriate name for the volume junction, as you will use it to mount the file system onto your IBM MQ instance(s).
  6. Choose appropriate backup and maintenance windows according to your application needs.

Mount the Amazon FSx NetApp ONTAP file system onto the instance(s) where IBM MQ is running. Use either the DNS name or the IP address for the file system, as well as the correct volume junction name while mounting. Configure IBM MQ to make use of this mount for persisting the queue data.

This mount point must be included when updating fstab for Linux machines. This will allow for the file system to be mounted automatically in case the instance restarts. For Windows, take the appropriate steps to mount the file system automatically upon restart.

Conclusion

In this post, you have learned how to use Amazon FSx NetApp ONTAP with IBM MQ to maximize queue data throughput, while continuing to have persistent message storage. You can provision the Amazon FSx NetApp ONTAP file system, and mount its volume junction onto the IBM MQ instance(s).

Build a reliable, scalable, and cost-efficient IBM MQ solution on AWS, by using the fully elastic features that Amazon FSx NetApp ONTAP provides.

Related information:

Creating computing quotas on AWS Outposts rack with EC2 Capacity Reservations sharing

Post Syndicated from Rachel Zheng original https://aws.amazon.com/blogs/compute/creating-computing-quotas-on-aws-outposts-rack-with-ec2-capacity-reservation-sharing/

This post is written by Yi-Kang Wang, Senior Hybrid Specialist SA.

AWS Outposts rack is a fully managed service that delivers the same AWS infrastructure, AWS services, APIs, and tools to virtually any on-premises datacenter or co-location space for a truly consistent hybrid experience. AWS Outposts rack is ideal for workloads that require low latency access to on-premises systems, local data processing, data residency, and migration of applications with local system interdependencies. In addition to these benefits, we have started to see many of you need to share Outposts rack resources across business units and projects within your organization. This blog post will discuss how you can share Outposts rack resources by creating computing quotas on Outposts with Amazon Elastic Compute Cloud (Amazon EC2) Capacity Reservations sharing.

In AWS Regions, you can set up and govern a multi-account AWS environment using AWS Organizations and AWS Control Tower. The natural boundaries of accounts provide some built-in security controls, and AWS provides additional governance tooling to help you achieve your goals of managing a secure and scalable cloud environment. And while Outposts can consistently use organizational structures for security purposes, Outposts introduces another layer to consider in designing that structure. When an Outpost is shared within an Organization, the utilization of the purchased capacity also needs to be managed and tracked within the organization. The account that owns the Outpost resource can use AWS Resource Access Manager (RAM) to create resource shares for member accounts within their organization. An Outposts administrator (admin) can share the ability to launch instances on the Outpost itself, access to the local gateways (LGW) route tables, and/or access to customer-owned IPs (CoIP). Once the Outpost capacity is shared, the admin needs a mechanism to control the usage and prevent over utilization by individual accounts. With the introduction of Capacity Reservations on Outposts, we can now set up a mechanism for computing quotas.

Concept of computing quotas on Outposts rack

In the AWS Regions, Capacity Reservations enable you to reserve compute capacity for your Amazon EC2 instances in a specific Availability Zone for any duration you need. On May 24, 2021, Capacity Reservations were enabled for Outposts rack. It supports not only EC2 but Outposts services running over EC2 instances such as Amazon Elastic Kubernetes (EKS), Amazon Elastic Container Service (ECS) and Amazon EMR. The computing power of above services could be covered in your resource planning as well. For example, you’d like to launch an EKS cluster with two self-managed worker nodes for high availability. You can reserve two instances with Capacity Reservations to secure computing power for the requirement.

Here I’ll describe a method for thinking about resource pools that an admin can use to manage resource allocation. I’ll use three resource pools, that I’ve named reservation pool, bulk and attic, to effectively and extensibly manage the Outpost capacity.

A reservation pool is a resource pool reserved for a specified member account. An admin creates a Capacity Reservation to match member account’s need, and shares the Capacity Reservation with the member account through AWS RAM.

A bulk pool is an unreserved resource pool that is used when member accounts run out of compute capacity such as EC2, EKS, or other services using EC2 as underlay. All compute capacity in the bulk pool can be requested to launch until it is exhausted. Compute capacity that is not under a reservation pool belongs to the bulk pool by default.

An attic is a resource pool created to hold the compute capacity that the admin wouldn’t like to share with member accounts. The compute capacity remains in control by admin, and can be released to the bulk pool when needed. Admin creates a Capacity Reservation for the attic and owns the Capacity Reservation.

The following diagram shows how the admin uses Capacity Reservations with AWS RAM to manage computing quotas for two member accounts on an Outpost equipped with twenty-four m5.xlarge. Here, I’m going to break the idea into several pieces to help you understand easily.

  1. There are three Capacity Reservations created by admin. CR #1 reserves eight m5.xlarge for the attic, CR #2 reserves four m5.xlarge instances for account A and CR #3 reserves six m5.xlarge instances for account B.
  2. The admin shares Capacity Reservation CR #2 and CR #3 with account A and B respectively.
  3. Since eighteen m5.xlarge instances are reserved, the remaining compute capacity in the bulk pool will be six m5.xlarge.
  4. Both Account A and B can continue to launch instances exceeding the amount in their Capacity Reservation, by utilizing the instances available to everyone in the bulk pool.

Concept of defining computing quotas

  1. Once the bulk pool is exhausted, account A and B won’t be able to launch extra instances from the bulk pool.
  2. The admin can release more compute capacity from the attic to refill the bulk pool, or directly share more capacity with CR#2 and CR#3. The following diagram demonstrates how it works.

Concept of refilling bulk pool

Based on this concept, we realize that compute capacity can be securely and efficiently allocated among multiple AWS accounts. Reservation pools allow every member account to have sufficient resources to meet consistent demand. Making the bulk pool empty indirectly sets the maximum quota of each member account. The attic plays as a provider that is able to release compute capacity into the bulk pool for temporary demand. Here are the major benefits of computing quotas.

  • Centralized compute capacity management
  • Reserving minimum compute capacity for consistent demand
  • Resizable bulk pool for temporary demand
  • Limiting maximum compute capacity to avoid resource congestion.

Configuration process

To take you through the process of configuring computing quotas in the AWS console, I have simplified the environment like the following architecture. There are four m5.4xlarge instances in total. An admin account holds two of the m5.4xlarge in the attic, and a member account gets the other two m5.4xlarge for the reservation pool, which results in no extra instance in the bulk pool for temporary demand.

Prerequisites

  • The admin and the member account are within the same AWS Organization.
  • The Outpost ID, LGW and CoIP have been shared with the member account.

Architecture for configuring computing quotas

  1. Creating a Capacity Reservation for the member account

Sign in to AWS console of the admin account and navigate to the AWS Outposts page. Select the Outpost ID you want to share with the member account, choose Actions, and then select Create Capacity Reservation. In this case, reserve two m5.4xlarge instances.

Create a capacity reservation

In the Reservation details, you can terminate the Capacity Reservation by manually enabling or selecting a specific time. The first option of Instance eligibility will automatically count the number of instances against the Capacity Reservation without specifying a reservation ID. To avoid misconfiguration from member accounts, I suggest you select Any instance with matching details in most use cases.

Reservation details

  1. Sharing the Capacity Reservation through AWS RAM

Go to the RAM page, choose Create resource share under Resource shares page. Search and select the Capacity Reservation you just created for the member account.

Specify resource sharing details

Choose a principal that is an AWS ID of the member account.

Choose principals that are allowed to access

  1. Creating a Capacity Reservation for attic

Create a Capacity Reservation like step 1 without sharing with anyone. This reservation will just be owned by the admin account. After that, check Capacity Reservations under the EC2 page, and the two Capacity Reservations there, both with availability of two m5.4xlarge instances.

3.	Creating a Capacity Reservation for attic

  1. Launching EC2 instances

Log in to the member account, select the Outpost ID the admin shared in step 2 then choose Actions and select Launch instance. Follow AWS Outposts User Guide to launch two m5.4xlarge on the Outpost. When the two instances are in Running state, you can see a Capacity Reservation ID on Details page. In this case, it’s cr-0381467c286b3d900.

Create EC2 instances

So far, the member account has run out of two m5.4xlarge instances that the admin reserved for. If you try to launch the third m5.4xlarge instance, the following failure message will show you there is not enough capacity.

Launch status

  1. Allocating more compute capacity in bulk pool

Go back to the admin console, select the Capacity Reservation ID of the attic on EC2 page and choose Edit. Modify the value of Quantity from 2 to 1 and choose Save, which means the admin is going to release one more m5.4xlarge instance from the attic to the bulk pool.

Instance details

  1. Launching more instances from bulk pool

Switch to the member account console, and repeat step 4 but only launch one more m5.4xlarge instance. With the resource release on step 5, the member account successfully gets the third instance. The compute capacity is coming from the bulk pool, so when you check the Details page of the third instance, the Capacity Reservation ID is blank.

6.	Launching more instances from bulk pool

Cleaning up

  1. Terminate the three EC2 instances in the member account.
  2. Unshare the Capacity Reservation in RAM and delete it in the admin account.
  3. Unshare the Outpost ID, LGW and CoIP in RAM to get the Outposts resources back to the admin.

Conclusion

In this blog post, the admin can dynamically adjust compute capacity allocation on Outposts rack for purpose-built member accounts with an AWS Organization. The bulk pool offers an option to fulfill flexibility of resource planning among member accounts if the maximum instance need per member account is unpredictable. By contrast, if resource forecast is feasible, the admin can revise both the reservation pool and the attic to set a hard limit per member account without using the bulk pool. In addition, I only showed you how to create a Capacity Reservation of m5.4xlarge for the member account, but in fact an admin can create multiple Capacity Reservations with various instance types or sizes for a member account to customize the reservation pool. Lastly, if you would like to securely share Amazon S3 on Outposts with your member accounts, check out Amazon S3 on Outposts now supports sharing across multiple accounts to get more details.

How to Run Massively Scalable ADAS Simulation Workloads on CAEdge

Post Syndicated from Hendrik Schoeneberg original https://aws.amazon.com/blogs/architecture/how-to-run-massively-scalable-adas-simulation-workloads-on-caedge/

This post was co-written by Hendrik Schoeneberg, Sr. Global Big Data Architect, The An Binh Nguyen, Product Owner for Cloud Simulation at Continental, Autonomous Mobility – Engineering Platform, Rumeshkrishnan Mohan, Global Big Data Architect, and Junjie Tang, Principal Consultant at AWS Professional Services.

AV/ADAS simulations processing large-scale field sensor data such as radar, lidar, and high-resolution video come with many challenges. Typically, the simulation workloads are spiky with occasional, but high compute demands, so the platform must scale up or down elastically to match the compute requirements. The platform must be flexible enough to integrate specialized ADAS simulation software, use distributed computing or HPC frameworks, and leverage GPU accelerated compute resources where needed.

Continental created the Continental Automotive Edge (CAEdge) Framework to address these challenges. It is a modular multi-tenant hardware and software framework that connects the vehicle to the cloud. You can learn more about this in Developing a Platform for Software-defined Vehicles with Continental Automotive Edge (CAEdge) and Developing a platform for software-defined vehicles (re:Invent session-AUT304).

In this blog post, we’ll illustrate how the CAEdge Framework orchestrates ADAS simulation workloads with Amazon Managed Workflows for Apache Airflow (MWAA). We’ll show how it delegates the high-performance workloads to AWS Batch for elastic, highly scalable, and customizable compute needs. We’ll showcase the “bring your own software-in-the-loop” (BYO-SIL) pattern, detailing how to leverage specialized and proprietary simulation software in your workflow. We’ll also demonstrate how to integrate the simulation platform with tenant data in the CAEdge framework. This orchestrate and delegate pattern has previously been introduced in Field Notes: Deploying Autonomous Driving and ADAS Workloads at Scale with Amazon Managed Workflows for Apache Airflow and the Reference Architecture for Autonomous Driving Data Lake.

Solution overview for autonomous driving simulation

The following diagram shows a high-level overview of this solution.

Figure 1. Architecture diagram for autonomous driving simulation

Figure 1. Architecture diagram for autonomous driving simulation

It can be broken down into five major parts, illustrated in Figure 1.

  1. Simulation API: Amazon API Gateway (label 1) provides a REST API for authenticated users to schedule or monitor simulation. It runs on Amazon Managed Workflows for Apache Airflow (MWAA) using AWS Lambda (label 2).
  2. Simulation control plane: The simulation control plane (label 3) built on MWAA enables users to design and initiate workflows and integrate with AWS services.
  3. Scalable compute backend: We leverage the parallelization and elastic scalability capabilities of AWS Batch to distribute the workload (label 4). Additionally, we want to be able to run proprietary software components as part of the simulation workflows and use highly customizable Amazon EC2 compute environments. For example, we can use GPU acceleration or Graviton-based instances for workloads that must run on ARM, instead of x86 architectures.
  4. Autonomous Driving Data Lake (ADDL) integration: The simulations’ input and output data will be stored in a data lake (label 5) on Amazon S3. To provide efficient read and write access, data gets copied before each simulation run using RAID0 bundled ephemeral instance storage drives. After the simulation, results are written back to the data lake and are ready for reporting and analytics. We use AWS Lake Formation for metadata storage, data cataloging, and permission handling.
  5. Automated deployment: The solution architecture’s deployment is fully automated using a CI/CD pipeline in AWS CodePipeline and AWS CodeBuild (label 6). We can then perform automated testing and deployment to multiple target environments.

In this blog we will focus on the simulation control plane, scalable compute backend, and ADDL integration.

Simulation control plane

In a typical simulation, there are tasks like data movement (copying input data and persisting the results). These steps can require high levels of parallelization or GPU support. In addition, they can involve third-party or proprietary software, specialized runtime environments, or architectures like ARM. To facilitate all task requirements, we will delegate task initiation to the AWS services that best match the requirements. Correct sequence of tasks is achieved by an orchestration layer. Decoupling the simulation orchestration from task initiation follows the key design pattern to separate concerns. This enables the architecture to be adapted to specific requirements.

For the high-level task orchestration, we’ll introduce a simulation control plane built on MWAA. Modeling all simulation tasks in a directed acyclic graph (DAG), MWAA performs scheduling and task initiation in the correct sequence. It also handles the integration with AWS’ services. You can intuitively monitor and debug the progress of a simulation run across different services and networks. For the workflow within CAEdge, we’ll use AWS Batch to run simulations that are packaged as Docker containers.

Parameterizable workflows: Airflow supports parameterization of DAGs by providing a JSON object when triggering a DAG run that can be accessed at runtime. This enables us to create a simulation execution framework in which we model the simulation steps. It lets you specify the simulations’ parameters such as which Docker container to execute, runtime configuration, or input and output data locations, see Figures 2 and 3.

Figure 2. Specify simulations’ parameters in JSON

Figure 2. Specify simulations’ parameters in JSON

Figure 3. Read simulations’ parameters from JSON

Figure 3. Read simulations’ parameters from JSON

Scalable compute backend

Publishing the Docker image: To run a simulation, you must provide a Docker container in Amazon Elastic Container Registry (ECR). Manually push Docker images to ECR using the command line interface (CLI) or using CI/CD pipeline automation.

Choosing the compute option: Containers at AWS describes the services options for developers. Various factors can contribute to your decision-making. For the CAEdge platform, we want to run thousands of containers with fine-grained control over the underlying compute instance’s configuration. We also want to run containers in privileged mode, so AWS Batch on EC2 is a great match. Figure 4 outlines the compute backend’s architecture.

Figure 4. Compute backend architecture diagram

Figure 4. Compute backend architecture diagram

To run a Docker container on AWS Batch, we need three components: A job definition, a compute environment, and a job queue. The AWS Batch job definition specifies which job to run and how to run it. For example, we can define the Docker image to use, and specify the vCPU and memory configuration. The job definition will be submitted to a job queue, which in turn is linked to one or more compute environments. The compute environment specifies the compute configuration used to run a containerized workload, in addition to instance types and storage configurations. Creating a compute environment describes this more in detail. In the section ‘ADDL integration,’ we’ll describe how to select and configure the EC2 container instances to maximize network and disk I/O.

The AWS Batch array job feature can spin up thousands of independent, but similarly configured jobs. Instead of submitting a single job for each input file from the simulation control plane, we can instead submit a single array job. This provides the collection of input files as input. AWS Batch can spin up child jobs to process each entry of the collection in parallel, reducing operational overhead drastically.

With these components, we can now submit a job definition to a job queue from the simulation control plane, which will initiate on the corresponding compute environment.

ADDL integration

In the CAEdge platform, all simulation recordings and their metadata are stored and cataloged in the data lake. On average, single recordings are around 100 GB in size with simulation requests containing 100–300 recordings. A simulation request can therefore require 10–30 TB of data movement from S3 to the containers before the simulation starts. The containers’ performance will directly depend on I/O performance, as the input data is read and processed during execution. To provide the highest performance of data transfer and simulation workload, we need data storage options that maximize network and disk I/O throughput.

Choosing the storage option: For CAEdge’s simulation workloads, the storage solution acts as a temporary scratch space for the simulation containers’ input data. It should maximize I/O throughput. These requirements are met in the most cost-efficient way by choosing EC2 instances of the M5d family and their attached instance storage.

The M5d are general-purpose instances with NVMe-based SSDs, which are physically connected to the host server and provide block-level storage coupled to the lifetime of the instance. As described in the previous section, we configured AWS Batch to create compute environments using EC2’s M5d instances. While other storage options like Amazon Elastic File System (EFS) and Amazon Elastic Block Storage (EBS) can scale to match even demanding throughput scenarios, the simulation benefits from a high-performance, temporary storage location that is directly attached to the host.

Bundling the volumes: M5d instances can have multiple NVMe drives attached, depending on their size and storage configuration. To provide a single stable storage location for the simulation containers, bundle all attached NVMe SSDs into a single RAID0 volume during the instance’s launch, using a modified user data script. With this, we can provide a fixed storage location that can be accessed from the simulation containers. Additionally, we are distributing I/O operations evenly across all disks in the RAID0 configuration, which improves the read and write performance.

KPIs used for measurement

The SIL factor tells you how long the simulation takes compared to the duration of the recorded data.

SIL factor example:

  • For a recording with 60 minutes of data and a simulation duration of 120 minutes, the resulting SIL factor is 120 / 60 = 2. The simulation runs twice as long as real time.
  • For a recording with 60 minutes of data and a simulation duration of 60 minutes, the resulting SIL factor is 60 / 60 = 1. The simulation runs in real time.

Similarly, the aggregated SIL factor tells you how long the simulation for multiple recordings takes compared to the duration of the recorded data. It also factors in the horizontal scaling capabilities.

Aggregated SIL factor example:

  • For 3 recordings with 60 minutes of data and an overall simulation duration of 120 minutes, the resulting SIL factor is 120 / 3 * 60 = 0.67. The overall simulation is a third faster than real time.

Performance results

The storage optimizations described in the ADDL integration KPI section preceding, led to an improvement of the overall simulation duration of 50%. To benchmark the overall simulation platform’s performance, we created a simulation run with 15 recordings. The simulation run completed successfully with an aggregated SIL factor of 0.363, or, alternatively put, the simulation was roughly three times faster than real time. During production use, the platform will handle simulation runs with an average of 100-300 recordings. For these runs, the aggregated SIL factor is expected to be even smaller, as more simulations can be processed in parallel.

Conclusion

In this post, we showed how to design a platform for ADAS simulation workloads using the Bring Your Own Software-In-the-Loop (BYO-SIL) concept. We covered the key components including simulation control plane, scalable compute backend, and ADDL integration based on Continental Automotive Edge (CAEdge). We discussed the key performance benefits due to horizontal scalability and how to choose an optimized storage integration pattern. This led to an improvement of the overall simulation duration of 50%.

How to mount Linux volume and keep mount point consistency

Post Syndicated from limillan original https://aws.amazon.com/blogs/compute/how-to-mount-linux-volume-and-keep-mount-point-consistency/

This post is written by: Leonardo Azize Martins, Cloud Infrastructure Architect, Professional Services

Customers often use Amazon Elastic Compute Cloud (Amazon EC2) Linux based instances with many Amazon Elastic Block Store (Amazon EBS) volumes attached. In this case, device name can vary depending on some facts, such as virtualization type, instance type, or operating system. As the device name can change, the customer shouldn’t rely on the device name to mount volumes from it. The customer wants to avoid the situation where a volume is mounted on a different mount point just because the device name changed, or the device name doesn’t exist because the name pattern changed.

Customers who want to utilize the latest instance family usually change the instance type when a new one is available. The device name can be different between instance families, such as T2 and T3. T2 uses /dev/sd[a-z], while T3 uses /dev/nvme[0-26]n1. If you mount a device on T2 called /dev/sdc, when you change the instance family to T3 the same device won’t be called /dev/sdc anymore. In this case, it will fail to mount.

Amazon EBS volumes are exposed as NVMe block devices on instances built on the AWS Nitro System. The block device driver can assign NVMe device names in a different order than what you specified for the volumes in the block device mapping. In this situation, a device that should be mounted on /data could end-up being mounted on /logs.

On Linux, you can use the fstab file to mount devices using kernel name descriptors (the traditional way), file system labels, or the file system UUID. Kernel name descriptors aren’t persistent and can change each boot. Therefore, they shouldn’t be used in configuration files.

UUID is a mechanism for giving each filesystem a unique identifier. These identifiers’ attributes are generated by filesystem utilities (mkfs.*) when the device is formatted, and they’re designed so that collisions are unlikely. All GNU/Linux filesystems (including swap and LUKS headers of raw encrypted devices) support UUID.

As UUID is a filesystem attribute, it can also be used with Logical Volume Manager (LVM) and Linux software RAID (mdadm).

Depending on the fstab file configuration, you may find that you can’t access your instance, which requires you to follow a rescue process to fix issues. This is the case if you configure the fstab file with the device name and change the instance type.

This post shows how to mount Linux volumes and keep mount points preserved when the instance type is changed or the instance is rebooted.

Overview of solution

When you create an instance, you specify block devices mapping. It doesn’t mean that the Linux device has the same name or can be discovered in the same order as specified on the instance mapping. This situation can be more evident when using applications that require more volumes.

Using UUID to mount volumes lets you mitigate future issues when you stop and start your instance or change the instance type.

EC2 instance block device mapping

Figure 1: EC2 instance block device mapping

Walkthrough

You will create one instance with three volumes: one root volume and two data volumes. We use Amazon Linux 2 in this post. In this instance type, volumes have a specific name format. Later, you will change the instance type. The new instance type volumes will have another name format.

Follow these steps:

  • Create an instance with three volumes
  • Create the filesystem on the volumes and mount them
  • Change the instance type

Prerequisites

For this walkthrough, you should have the following prerequisites:

Create instance

Create one instance with three volumes: a root volume and two data volumes. Use Launch Instance Wizard with the following details.

Launch Amazon Linux 2 instance

  1. On Step 1, choose Amazon Linux 2 AMI (HVM), SSD Volume Type.
  2. On Step 2, choose micro.
  3. On Step 3, choose Next.
  4. On Step 4, add two new volumes, device /dev/sdb 10 GiB and device /dev/sdc 12 GiB.

Launch instances, add storage

Figure 2: Launch instances, add storage

Create filesystem and mount

Connect to your instance using EC2 Instance Connect or any other method that feels comfortable for you. Mount the device using UUID instead of the device name. Run the following instructions as the user root.

Format and mount the device

  1. Run the following command to confirm that you have three disks:
    $ lsblk
  2. Format disk as xfs, and run the following commands:
    mkfs.xfs /dev/xvdb
    mkfs.xfs /dev/xvdc
  3. Create mount point, and run the following commands:
    mkdir /mnt/disk1
    mkdir /mnt/disk2
  4. Add mount instructions, and run the following commands:
    echo "$(blkid /dev/xvdb | awk '{print $2}') /mnt/disk1 xfs defaults,noatime" | tee -a /etc/fstab
    echo "$(blkid /dev/xvdc | awk '{print $2}') /mnt/disk2 xfs defaults,noatime" | tee -a /etc/fstab
  5. Mount volumes, create dummy file, and run the following commands:
    mount -a
    touch /mnt/disk1/file1.txt
    touch /mnt/disk2/file2.txt

You will have an fstab file like the following:

cat /etc/fstab
UUID=7b355c6b-f82b-4810-94b9-4f3af651f629     /           xfs    defaults,noatime  1   1
UUID="2c160dd6-586c-4258-84bb-c79933b9ae02" /mnt/disk1 xfs defaults,noatime
UUID="3e7a0c28-7cf1-40dc-82a1-2f5cfb53f9a4" /mnt/disk2 xfs defaults,noatime

Change instance type

Change the instance type from t2.micro to t3.micro.

Launch Amazon Linux 2 instance

  1. Stop instance.
  2. Change the instance type to micro.
  3. Start instance.
  4. Connect to your instance using EC2 Instance Connect.
  5. Check the device name, and run the following command:
lsblk
  1. List files, and run the following command:
ls -l /mnt/*

Note that the device names are changed from xvd* to nvme*. All of the devices are mounted without any issue and with the correct mount points.

Cleaning up

To avoid incurring future charges, delete the instance and all of the volumes that you created in this post.

The other side

UUID is an attribute of the filesystem that was generated when you formatted your device. Therefore, it will follow your device even if you create an AMI or snapshot. So you don’t need to worry about a restore process, and it will smoothly proceed to an instance restore. You must be careful if you restore a snapshot from one volume and attach it to the same instance, as you will end-up with two volumes that are using the same UUID. If you try to mount the restored volume on the same instance, then it will fail and you will find this message on /var/log/messages file. kernel: XFS (xvdf1): Filesystem has duplicate UUID f6646b81-f2a6-46ca-9f3d-c746cf015379 - can't mount It is even more important to be careful if you attach a volume created from the snapshot of the root volume and restart your instance. Since both volumes have the same UUID, you may find that a volume other than the one attached to /dev/xvda or /dev/sda has become the root volume of your instance. See the following example for details. Note that both volumes have the same UUID, but the one mounted on / is /dev/xvdf1, not /dev/xvda1, which is the real root volume for this instance.

$ blkid
/dev/xvda1: LABEL="/" UUID="f6646b81-f2a6-46ca-9f3d-c746cf015379" TYPE="xfs" PARTLABEL="Linux" PARTUUID="79fae994-3708-4293-bb29-4d069d1c786b"
/dev/xvdf1: LABEL="/" UUID="f6646b81-f2a6-46ca-9f3d-c746cf015379" TYPE="xfs" PARTLABEL="Linux" PARTUUID="79fae994-3708-4293-bb29-4d069d1c786b"
$ lsblk
NAME    MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
xvda    202:0    0   8G  0 disk
└─xvda1 202:1    0   8G  0 part
xvdf    202:80   0   8G  0 disk
└─xvdf1 202:81   0   8G  0 part /

Conclusion

In this post, we covered how to use UUID to mount Linux devices using fstab file. This keeps the mount point on the correct device. It also lets you change the instance type without changes to the fstab file. You can use UUID with LVM and Linux software RAID (mdadm), UUID, as an attribute of the filesystem, will be the same even after a backup and restore process, snapshot, or clone. To learn more, check out our block device mappings and device names on Linux instances documentation.

Scaling DLT to 1M TPS on AWS: Optimizing a Regulated Liabilities Network

Post Syndicated from Erica Salinas original https://aws.amazon.com/blogs/architecture/scaling-dlt-to-1m-tps-on-aws-optimizing-a-regulated-liabilities-network/

SETL is an open source, distributed ledger technology (DLT) company that enables tokenisation, digital custody, and DLT for securities markets and payments. In mid-2021, they developed a blueprint for a Regulated Liabilities Network (RLN) that enables holding and managing a variety of tokenized value irrespective of its form. In a December 2021 collaboration with Amazon Web Services (AWS), SETL demonstrated that their RLN platform could support one million transactions per second. These findings were published in their whitepaper, The Regulated Liability Network: Whitepaper on scalability and performance.

This AWS-hosted architecture scaled each processing component and the complete transaction flow. It was built using Amazon Elastic Kubernetes Service (EKS), Amazon Elastic Cloud Compute (EC2), and Amazon Managed Streaming for Apache Kafka (MSK).

This post discusses the technical implementation of the simulated network. We show how scaling characteristics were achieved, while maintaining the business requirements of atomicity and finality. We also discuss how each RLN component was optimized for high performance.

Background: What is an RLN?

In The Regulated Internet of Value, Tony McLaughlin of Citi proposed a single network to record ownership of multiple token types, each which represents a liability of a regulated entity. The RLN would give commercial banks, e-money providers, and central banks, the ability to issue liabilities and immutably record all balances and owners. Because it is designed to maintain a globally sequenced set of state changes, an unambiguous ledger of token balances can be computed and updated. Transactions, which result in two or more ledger updates, are proposed to the network. They are authorized by all impacted network participants, ordered for distribution to participant ledgers, and then persisted by multiple participant ledgers. This process is completed with five processing components.

Figure 1. Core components and queues of RLN

Figure 1. Core components and queues of RLN

Given the existing performance limitations faced by many DLT platforms, a main concern with the network design was its ability to meet throughput requirements. SETL collaborated with AWS to define an architecture that integrated decoupled processing components, as shown in Figure 2. The target throughput was 1,000,000 TPS for each component through the simulated network.

Figure 2. Simulated RLN architecture in the AWS Cloud

Figure 2. Simulated RLN architecture in the AWS Cloud

A queue – process – queue architectural model

The basic architectural model applied was “queue – process – queue.” Each component consumed transactions from one or more Kafka topics, performed its requisite activities, and produced output to a second Kafka topic. Components of the message flow used Amazon EKS, Amazon EC2, and Amazon MSK.

Specific techniques were applied to achieve the scaling characteristics of components in the main message flow.

  • A distinct EKS-managed node group was used for each RLN component. Each node within the group had its own EC2 instance that housed at most two Kubernetes Pods. This technique decreased potential network throttling and enabled the monitoring of pod resource utilization using Amazon CloudWatch at the EC2 level. This aided in identifying bottlenecks, which can be challenging if large EC2 nodes house many Kubernetes Pods.
  • Each RLN queue component had its own dedicated Kafka cluster. By isolating the Kafka topics to their own cluster, we were able to scale and monitor the cluster individually. This decreased potential chokepoints.
  • A combination of eksctl, kubectl, and EC2 Auto Scaling groups was used to deploy and horizontally scale each RLN component. The tooling enabled rapid results by controlling pod configuration, automating deployments, and supporting multiple performance testing iterations.

Fine tuning RLN system components

The key metric tested was message throughput in transactions per second (TPS) consumed, processed, and produced for each component. An end-to-end test was performed, which measured throughput of the complete simulated network. Each RLN component was tuned to meet this metric as follows.

The Transaction Generator acted as the customer entering transactions into the system (see Figure 3.) The Kafka producer property buffer.memory was changed to 536870912 to simulate an increase in producer TPS for each component.

Figure 3. Simulated RLN Transaction Generator

Figure 3. Simulated RLN Transaction Generator

The Scheduler’s scaling technique was a dedicated Compute (one Pod/EC2 node) that listened to one Kafka partition in the Transaction Queue as seen in Figure 4. Additionally, turning on gzip compression (compression.type) on the producer resolved a network bottleneck for the downstream Kafka cluster.

Figure 4. Simulated RLN Scheduler

Figure 4. Simulated RLN Scheduler

The Approver’s scaling technique was like the Scheduler, however, the Proposal Queue had 10 topics as opposed to one. The approver is stateless, so it randomly selects a Kafka topic and partition to listen to. When scaling to 100 Kubernetes Pods, there would be roughly one Kafka partition per pod. As shown in Figure 5, each pod has its own EC2 instance. The Assembler’s scaling techniques were the same as for the Scheduler and Approver components.

Figure 5. Simulated RLN Approver

Figure 5. Simulated RLN Approver

The Sequencer, as designed, cannot scale horizontally due to the nature of the sequencing algorithm. However, without any special tuning of the Kafka consumer configuration or any significant increase in EC2 sizing, it was able to achieve one million TPS. Its architecture is shown in Figure 6.

Figure 6. Simulated RLN Sequencer

Figure 6. Simulated RLN Sequencer

The State Updater scaled with each Kubernetes Pod listening to an assigned Approved Proposal topic (one-to-one mapping) and receiving all sequenced hashes from the Hash Queue. Achieving 1 million TPS throughput required 10 nodes configured with c5.4xlarge, as seen in Figure 7.

Figure 7. Simulated RLN State Updater

Figure 7. Simulated RLN State Updater

Successful throughput testing

Test results demonstrate that each component can sustain over 1 million TPS. While horizontal scaling was applied, even a single instance achieved over 10,000 TPS. Individual component performance test results can be seen in Table 1.

Component Single Instance (TPS) # of Instances Multiple Instances (TPS)
Generator ~ 629,219 2 1,258,438
Scheduler ~11,019 100 1,101,928
Approver ~10,348 100 1,034,868
Assembler ~10,691 100 1,069,100
Sequencer ~1,052,845 1 N/A
State Updater ~100,256 10 1,002,569

Table 1. Test results per individual component

Network tests were performed with all components integrated to create a simulated RLN network. The tests were executed with transaction generation of ~1,000,000 TPS. Throughput was measured at the output of the final component’s (State Updater) instances (see Figure 8). It was also measured in aggregate at over 1.2 million TPS (see Figure 9).

Figure 8. Individual TPS for State Update component nodes

Figure 8. Individual TPS for State Update component nodes

Figure 9. Aggregate TPS for State Update component nodes

Figure 9. Aggregate TPS for State Update component nodes

While this study was designed to demonstrate capacity and throughput, the tests did replicate Kafka partitions across three Availability Zones in the selected AWS Region. Thus, testing demonstrated that the proposed architecture supports geographic resilience while meeting the throughput benchmark. Resiliency and security are areas of focus for testing in the next phases of architecting a production-ready version of RLN.

Looking forward

During the month of joint work between the SETL and AWS technical teams, we stood up and tuned basic RLN functionality. We successfully demonstrated the scalability of the environment to at least 1M TPS. By using Kafka queues and the horizontal and vertical scalability of AWS services, we addressed the primary concern of reaching production-level throughput.

The industry group collaborating on the RLN concept now has a solid technical foundation on which to build a production-ready RLN architecture. Future experimentation will include integration with financial institutions and technology partners that will provide the capabilities needed to build a successful RLN ecosystem.

Further explorations include enabling digital signing, verification at scale, and expanding the network to include financial institution environments integrated with their own RLN partitions.

New – Amazon EC2 X2iezn Instances Powered by the Fastest Intel Xeon Scalable CPU for Memory-Intensive Workloads

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-amazon-ec2-x2iezn-instances-powered-by-the-fastest-intel-xeon-scalable-cpu-for-memory-intensive-workloads/

Electronic Design Automation (EDA) workloads require high computing performance and a large memory footprint. These workloads are sensitive to faster CPU performance and higher clock speeds since the faster performance allows more jobs to be completed on the lower number of cores. At AWS re:Invent 2020, we launched Amazon EC2 M5zn instances which use second-generation Intel Xeon Scalable (Cascade Lake) processors with an all-core turbo clock frequency of up to 4.5 GHz, which is the fastest of any cloud instance.

Our customers have enjoyed the high single-threaded performance, high-speed networking, and balanced memory-to-vCPU ratio of EC2 M5zn instances. They have asked for instances that will leverage these features while also providing them a greater memory footprint per vCPU.

Today, we are launching Amazon EC2 X2iezn instances, which use the same Intel Xeon Scalable processors as M5zn instances, with an all-core turbo clock frequency of 4.5 GHz and up to 1.5 TiB of memory, which is the fastest of any cloud instance for EDA workloads. These instances are capable of delivering up to 55 percent better price-performance per vCPU compared to X1e instances.

X2iezn instances offer 32 GiB of memory per vCPU and will support up to 48 vCPUs and 1536 GiB of memory. Built on the AWS Nitro, they deliver up to 100 Gbps of networking bandwidth and 19 Gbps of dedicated Amazon EBS bandwidth to improve performance for EDA applications.

You might have noticed that we’re now using the “i” suffix in the instance type to specify that the instances are using an Intel processor, “e” in the memory-optimized instance family to indicate extended memory, “z” which indicates high-frequency processors, and “n” to support higher network bandwidth up to 100 Gbps.

X2iezn instances are VPC only, HVM-only, and EBS-Optimized, with support for Optimize vCPU. As you can see, the memory-to-vCPU ratio on these instances is the same as that of previous-generation X1e instances:

Instance Name vCPUs RAM (GiB) Network Bandwidth (Gbps) EBS-Optimized Bandwidth (Gbps)
x2iezn.2xlarge 8 256 Up to 25 3.170
x2iezn.4xlarge 16 512 Up to 25 4.750
x2iezn.6xlarge 24 768 50 9.5
x2iezn.8xlarge 32 1024 75 12
x2iezn.12xlarge 48 1536 100 19
x2iezn.metal 48 1536 100 19

Many customers will be able to benefit from using X2iezn instances to improve performance and efficiency for their EDA workloads. Here are some examples:

  • Annapurna Labs tested the X2iezn instances with Calibre’s Design Rule Checking, which has shown a 40 percent faster runtime compared to X1e instances, and a 25 percent faster runtime over R5d instances.
  • Astera Labs is a fabless, cloud-based semiconductor company developing purpose-built CXL, PCIe, and Ethernet connectivity solutions for data-centric systems. They were able to see performance gains of up to 25 percent compared to similar EDA workloads running on R5 instances.
  • Cadence tested the X2iezn instances using their Pegasus True Cloud feature, which allows designers to run physical verification jobs on the cloud and observed a 50 percent performance improvement over R5 instances. They see X2iezn instances as an excellent environment for testing EDA workloads.
  • NXP Semiconductors worked with AWS to run their Calibre and Spectre workloads on Amazon EC2 X2iezn instances, which measured 10-15 percent higher performance using X2iezn instances compared to their on-premises, Xeon Gold 6254 with max turbo frequency of 4.0GHz.
  • Siemens EDA worked with AWS to test the new Amazon EC2 X2iezn HPC/EDA focused instances with the industry performance and sign-off leader Calibre evaluating advanced node DRC workloads. They were pleased to demonstrate performance improvements of up to 14% using the 4.5 GHz all core turbo frequency of X2iezn instances for all VMs in the run. Additionally, they successfully demonstrated the use of a heterogeneous server configuration using the X2iezn as the primary node and other lower memory VMs for remote compute – providing an 11% speed up and attractive value. These results confirmed the X2iezn is a good fit for primary server EDA workloads for Calibre Physical and Circuit verification applications.
  • Synopsys IC Validator provides highly scalable high-performance physical verification signoff. They achieved 15 percent performance improvement, scalability to 1000s of cores, and 30 percent better efficiency using IC Validator’s unique elastic CPU management technology versus R5d instances.

Things to Know
Here are some fun facts about the X2iezn instances:

Optimizing CPU—You can disable Intel Hyper-Threading Technology for workloads that perform well with single-threaded CPUs, like some HPC applications.

NUMA—You can make use of non-uniform memory access (NUMA) on x2iezn.12xlarge instances. This advanced feature is worth exploring if you have a deep understanding of your application’s memory access patterns.

Available Now
Amazon EC2 X2iezn instances are now available in the US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland) Regions. You can use On-Demand Instances, Reserved Instances, Savings Plan, and Spot Instances. Dedicated Instances and Dedicated Hosts are also available.

To learn more, visit our EC2 X2i Instances page, and please send feedback to the AWS forum for EC2 or through your usual AWS Support contacts.

Channy

Deploy and Manage Gitlab Runners on Amazon EC2

Post Syndicated from Sylvia Qi original https://aws.amazon.com/blogs/devops/deploy-and-manage-gitlab-runners-on-amazon-ec2/

Gitlab CI is a tool utilized by many enterprises to automate their Continuous integration, continuous delivery and deployment (CI/CD) process. A Gitlab CI/CD pipeline consists of two major components: A .gitlab-ci.yml file describing a pipeline’s jobs, and a Gitlab Runner, an application that executes the pipeline jobs.

Setting up the Gitlab Runner is a time-consuming process. It involves provisioning the necessary infrastructure, installing the necessary software to run pipeline workloads, and configuring the runner. For enterprises running hundreds of pipelines across multiple environments, it is essential to automate the Gitlab Runner deployment process so as to be deployed quickly in a repeatable, consistent manner.

This post will guide you through utilizing Infrastructure-as-Code (IaC) to automate Gitlab Runner deployment and administrative tasks on Amazon EC2. With IaC, you can quickly and consistently deploy the entire Gitlab Runner architecture by running a script. You can track and manage changes efficiently. And, you can enforce guardrails and best practices via code. The solution presented here also offers autoscaling so that you save costs by terminating resources when not in use. You will learn:

  • How to deploy Gitlab Runner quickly and consistently across multiple AWS accounts.
  • How to enforce guardrails and best practices on the Gitlab Runner through IaC.
  • How to autoscale Gitlab Runner based on workloads to ensure best performance and save costs.

This post comes from a DevOps engineer perspective, and assumes that the engineer is familiar with the practices and tools of IaC and CI/CD.

Overview of the solution

The following diagram displays the solution architecture. We use AWS CloudFormation to describe the infrastructure that is hosting the Gitlab Runner. The main steps are as follows:

  1. The user runs a deploy script in order to deploy the CloudFormation template. The template is parameterized, and the parameters are defined in a properties file. The properties file specifies the infrastructure configuration, as well as the environment in which to deploy the template.
  2. The deploy script calls CloudFormation CreateStack API to create a Gitlab Runner stack in the specified environment.
  3. During stack creation, an EC2 autoscaling group is created with the desired number of EC2 instances. Each instance is launched via a launch template, which is created with values from the properties file. An IAM role is created and attached to the EC2 instance. The role contains permissions required for the Gitlab Runner to execute pipeline jobs. A lifecycle hook is attached to the autoscaling group on instance termination events. This ensures graceful instance termination.
  4. During instance launch, CloudFormation uses a cfn-init helper script to install and configure the Gitlab Runner:
    1. cfn-init installs the Gitlab Runner software on the EC2 instance.
    2. cfn-init configures the Gitlab Runner as a docker executor using a pre-defined docker image in the Gitlab Container Registry. The docker executor implementation lets the Gitlab Runner run each build in a separate and isolated container. The docker image contains the software required to run the pipeline workloads, thereby eliminating the need to install these packages during each build.
    3. cfn-init registers the Gitlab Runner to Gitlab projects specified in the properties file, so that these projects can utilize the Gitlab Runner to run pipelines.
  1. The user may repeat the same steps to deploy Gitlab Runner into another environment.

Architecture diagram previously explained in post.

Walkthrough

This walkthrough will demonstrate how to deploy the Gitlab Runner, and how easy it is to conduct Gitlab Runner administrative tasks via this architecture. We will walk through the following tasks:

  • Build a docker executor image for the Gitlab Runner.
  • Deploy the Gitlab Runner stack.
  • Update the Gitlab Runner.
  • Terminate the Gitlab Runner.
  • Add/Remove Gitlab projects from the Gitlab Runner.
  • Autoscale the Gitlab Runner based on workloads.

The code in this post is available at https://github.com/aws-samples/amazon-ec2-gitlab-runner.git

Prerequisites

For this walkthrough, you need the following:

  • A Gitlab account (all tiers including Gitlab Free self-managed, Gitlab Free SaaS, and higher tiers). This demo uses gitlab.com free tire.
  • A Gitlab Container Registry.
  • Git client to clone the source code provided.
  • An AWS account with local credentials properly configured (typically under ~/.aws/credentials).
  • The latest version of the AWS CLI. For more information, see Installing, updating, and uninstalling the AWS CLI.
  • Docker is installed and running on the localhost/laptop.
  • Nodejs and npm installed on the localhost/laptop.
  • A VPC with 2 private subnets and that is connected to the internet via NAT gateway allowing outbound traffic.
  • The following IAM service-linked role created in the AWS account: AWSServiceRoleForAutoScaling
  • An Amazon S3 bucket for storing Lambda deployment packages.
  • Familiarity with Git, Gitlab CI/CD, Docker, EC2, CloudFormation and Amazon CloudWatch.

Build a docker executor image for the Gitlab Runner

The Gitlab Runner in this solution is implemented as docker executor. The Docker executor connects to Docker Engine and runs each build in a separate and isolated container via a predefined docker image. The first step in deploying the Gitlab Runner is building a docker executor image. We provided a simple Dockerfile in order to build this image. You may customize the Dockerfile to install your own requirements.

To build a docker image using the sample Dockerfile:

  1. Create a directory where we will store our demo code. From your terminal run:
mkdir demo-repos && cd demo-repos
  1. Clone the source code repository found in the following location:
git clone https://github.com/aws-samples/amazon-ec2-gitlab-runner.git
  1. Create a new project on your Gitlab server. Name the project any name you like.
  2. Clone your newly created repo to your laptop. Ignore the warning about cloning an empty repository.
git clone <your-repo-url>
  1. Copy the demo repo files into your newly created repo on your laptop, and push it to your Gitlab repository. You may customize the Dockerfile before pushing it to Gitlab.
cp -r amazon-ec2-gitlab-runner/* <your-repo-dir>
cd <your-repo-dir>
git add .
git commit -m “Initial commit”
git push
  1. On the Gitlab console, go to your repository’s Package & Registries -> Container Registry. Follow the instructions provided on the Container Registry page in order to build and push a docker image to your repository’s container registry.

Deploy the Gitlab Runner stack

Once the docker executor image has been pushed to the Gitlab Container Registry, we can deploy the Gitlab Runner. The Gitlab Runner infrastructure is described in the Cloudformation template gitlab-runner.yaml. Its configuration is stored in a properties file called sample-runner.properties. A launch template is created with the values in the properties file. Then it is used to launch instances. This architecture lets you deploy Gitlab Runner to as many environments as you like by utilizing the configurations provided in the appropriate properties files.

During the provisioning process, utilize a cfn-init helper script to run a series of commands to install and configure the Gitlab Runner.

          commands:
            01InstallDocker:
              command: sudo yum -y install docker
            02StartDocker:
              command: sudo service docker start
            03DownloadGitlabRunner:
              command: sudo wget -O /usr/bin/gitlab-runner https://gitlab-runner-downloads.s3.amazonaws.com/latest/binaries/gitlab-runner-linux-amd64
            04ChmodGitlabRunner:
              command: sudo chmod a+x /usr/bin/gitlab-runner
            05AddUser:
              command: sudo useradd --comment 'GitLab Runner' --create-home gitlab-runner --shell /bin/bash
            06InstallGitlabRunner:
              command: sudo gitlab-runner install --user=gitlab-runner --working-directory=/home/gitlab-runner
            07SetRegion:
              command: !Sub 'aws configure set default.region ${AWS::Region}'
            08ConfigureDockerExecutor:
              command: !Sub 
                - |
                  for GitlabGroupToken in `aws ssm get-parameters --names /${AWS::StackName}/ci-tokens --query 'Parameters[0].Value' | sed -e "s/\"//g" | sed "s/,/ /g"`;do
                      sudo gitlab-runner register \
                      --non-interactive \
                      --url "${GitlabServerURL}" \
                      --registration-token $GitlabGroupToken \
                      --executor "docker" \
                      --docker-image "${DockerImagePath}" \
                      --description "Gitlab Runner with Docker Executor" \
                      --locked="${isLOCKED}" --access-level "${ACCESS}" \
                      --docker-volumes "/var/run/docker.sock:/var/run/docker.sock" \
                      --tag-list "${RunnerEnvironment}-${RunnerVersion}-docker"
                  done
                - isLOCKED: !FindInMap [GitlabRunnerRegisterOptionsMap, !Ref RunnerEnvironment, isLOCKED]
                  ACCESS: !FindInMap [GitlabRunnerRegisterOptionsMap, !Ref RunnerEnvironment, ACCESS]                              
            09StartGitlabRunner:
              command: sudo gitlab-runner start

The helper script ensures that the Gitlab Runner setup is consistent and repeatable for each deployment. If a configuration change is required, users simply update the configuration steps and redeploy the stack. Furthermore, all changes are tracked in Git, which allows for versioning of the Gitlab Runner.

To deploy the Gitlab Runner stack:

  1. Obtain the runner registration tokens of the Gitlab projects that you want registered to the Gitlab Runner. Obtain the token by selecting the project’s Settings > CI/CD and expand the Runners section.
  2. Update the sample-runner.properties file parameters according to your own environment. Refer to the gitlab-runner.yaml file for a description of these parameters. Rename the file if you like. You may also create an additional properties file for deploying into other environments.
  3. Run the deploy script to deploy the runner:
cd <your-repo-dir>
./deploy-runner.sh <properties-file> <region> <aws-profile> <stack-name> 

<properties-file> is the name of the properties file.

<region> is the region where you want to deploy the stack.

<aws-profile> is the name of the CLI profile you set up in the prerequisites section.

<stack-name> is the name you chose for the CloudFormation stack.

For example:

./deploy-runner.sh sample-runner.properties us-east-1 dev amazon-ec2-gitlab-runner-demo

After the stack is deployed successfully, you will see the Gitlab Runner autoscaling group created in the EC2 console:

After the stack is deployed successfully, you will see the Gitlab Runner autoscaling group created in the EC2 console.

Under your Gitlab project Settings > CICD > Runners > Available specific runners, you will see the fully configured Gitlab Runner. The green circle indicates that the Gitlab Runner is ready for use.

Now go to your Gitlab project Settings  CICD  Runners  Available specific runners, you will see the fully configured Gitlab Runner. The green circle indicates that the Gitlab Runner is ready for use.

Updating the Gitlab Runner

There are times when you would want to update the Gitlab Runner. For example, updating the instance VolumeSize in order to resolve a disk space issue, or updating the AMI ID when a new AMI becomes available.

Utilizing the properties file and launch template makes it easy to update the Gitlab Runner. Simply update the Gitlab Runner configuration parameters in the properties file. Then, run the deploy script to udpate the Gitlab Runner stack. To ensure that the changes take effect immediately (e.g., existing instances are replaced by new instances with the new configuration), we utilize an AutoscalingRollingUpdate update policy to automatically update the instances in the autoscaling group.

    UpdatePolicy:
      AutoScalingRollingUpdate:
        MinInstancesInService: !Ref MinInstancesInService
        MaxBatchSize: !Ref MaxBatchSize
        PauseTime: "PT5M"
        WaitOnResourceSignals: true
        SuspendProcesses:
          - HealthCheck
          - ReplaceUnhealthy
          - AZRebalance
          - AlarmNotification
          - ScheduledActions

The policy tells CloudFormation that when changes are detected in the launch template, update the instances in batch size of MaxBatchSize, while keeping a number of instances (specified in MinInstanceInService) in service during the update.

Below is an example of updating the Gitlab Runner instance type.

To update the instance type of the runner instance:

  1. Update the “InstanceType” parameter in the properties file.

InstanceType=t2.medium

  1. Run the deploy-runner.sh script to update the CloudFormation stack:
cd <your-repo-dir>
./deploy-runner.sh <properties-file> <region> <aws-profile> <stack-name> 

In the CloudFormation console, you will see that the launch template is updated first, then a rolling update is initiated. The instance type update requires a replacement of the original instance, so a temporary instance was launched and put in service. Then, the temporary instance was terminated when the new instance was launched successfully.

In the CloudFormation console, you will see that the launch template is updated first, then a rolling update is initiated. The instance type update requires a replacement of the original instance, so a temporary instance was launched and put in service. Then, the temporary instance was terminated when the new instance was launched successfully.

After the update is complete, you will see that on the Gitlab project’s console, the old Gitlab Runner, ez_5x8Rv, is replaced by the new Gitlab Runner, N1_UQ7yc.

After the update is complete, you will see that on the Gitlab project’s console, the old Gitlab Runner, ez_5x8Rv, is replaced by the new Gitlab Runner, N1_UQ7yc.

Terminate the Gitlab Runner

There are times when an autoscaling group instance must be terminated. For example, during an autoscaling scale-in event, or when the instance is being replaced by a new instance during a stack update, as seen previously. When terminating an instance, you must ensure that the Gitlab Runner finishes executing any running jobs before the instance is terminated, otherwise your environment could be left in an inconsistent state. Also, we want to ensure that the terminated Gitlab Runner is removed from the Gitlab project. We utilize an autoscaling lifecycle hook to achieve these goals.

The lifecycle hook works like this: A CloudWatch event rule actively listens for the EC2 Instance-terminate events. When one is detected, the event rule triggers a Lambda function. The Lambda function calls SSM Run Command to run a series of commands on the EC2 instances, via a SSM Document. The commands include stopping the Gitlab Runner gracefully when all running jobs are finished, de-registering the runner from Gitlab projects, and signaling the autoscaling group to terminate the instance.

The lifecycle hook works like this: A CloudWatch event rule actively listens for the EC2 Instance-terminate events. When one is detected, the event rule triggers a Lambda function. The Lambda function calls SSM Run Command to run a series of commands on the EC2 instances, via a SSM Document. The commands include stopping the Gitlab Runner gracefully when all running jobs are finished, de-registering the runner from Gitlab projects, and signaling the autoscaling group to terminate the instance.

There are also times when you want to terminate an instance manually. For example, when an instance is suspected to not be functioning properly. To terminate an instance from the Gitlab Runner autoscaling group, use the following command:

aws autoscaling terminate-instance-in-auto-scaling-group \
    --instance-id="${InstanceId}" \
    --no-should-decrement-desired-capacity \
    --region="${region}" \
    --profile="${profile}"

The above command terminates the instance. The lifecycle hook ensures that the cleanup steps are conducted properly, and the autoscaling group launches another new instance to replace the old one.

Note that if you terminate the instance by using the “ec2 terminate-instance” command, then the autoscaling lifecycle hook actions will not be triggered.

Add/Remove Gitlab projects from the Gitlab Runner

As new projects are added to your enterprise, you may want to register them to the Gitlab Runner, so that those projects can utilize the Gitlab Runner to run pipelines. On the other hand, you would want to remove the Gitlab Runner from a project if it no longer wants to utilize the Gitlab Runner, or if it qualifies to utilize the Gitlab Runner. For example, if a project is no longer allowed to deploy to an environment configured by the Gitlab Runner. Our architecture offers a simple way to add and remove projects from the Gitlab Runner. To add new projects to the Gitlab Runner, update the RunnerRegistrationTokens parameter in the properties file, and then rerun the deploy script to update the Gitlab Runner stack.

To add new projects to the Gitlab Runner:

  1. Update the RunnerRegistrationTokens parameter in the properties file. For example:
RunnerRegistrationTokens=ps8RjBSruy1sdRdP2nZX,XbtZNv4yxysbYhqvjEkC
  1. Update the Gitlab Runner stack. This updates the SSM parameter which stores the tokens.
cd <your-repo-dir>
./deploy-runner.sh <properties-file> <region> <aws-profile> <stack-name> 
  1. Relaunch the instances in the Gitlab Runner autoscaling group. The new instances will use the new RunnerRegistrationTokens value. Run the following command to relaunch the instances:
./cycle-runner.sh <runner-autoscaling-group-name> <region> <optional-aws-profile>

To remove projects from the Gitlab Runner, follow the steps described above, with just one difference. Instead of adding new tokens to the RunnerRegistrationTokens parameter, remove the token(s) of the project that you want to dissociate from the runner.

Autoscale the runner based on custom performance metrics

Each Gitlab Runner can be configured to handle a fixed number of concurrent jobs. Once this capacity is reached for every runner, any new jobs will be in a Queued/Waiting status until the current jobs complete, which would be a poor experience for our team. Setting the number of concurrent jobs too high on our runners would also result in a poor experience, because all jobs leverage the same CPU, memory, and storage in order to conduct the builds.

In this solution, we utilize a scheduled Lambda function that runs every minute in order to inspect the number of jobs running on every runner, leveraging the Prometheus Metrics endpoint that the runners expose. If we approach the concurrent build limit of the group, then we increase the Autoscaling Group size so that it can take on more work. As the number of concurrent jobs decreases, then the scheduled Lambda function will scale the Autoscaling Group back in an effort to minimize cost. The Scaling-Up operation will ignore the Autoscaling Group’s cooldown period, which will help ensure that our team is not waiting on a new instance, whereas the Scale-Down operation will obey the group’s cooldown period.

Here is the logical sequence diagram for the work:

Sequence diagram

For operational monitoring, the Lambda function also publishes custom CloudWatch Metrics for the count of active jobs, along with the target and actual capacities of the Autoscaling group. We can utilize this information to validate that the system is working properly and determine if we need to modify any of our autoscaling parameters.

For operational monitoring, the Lambda function also publishes custom CloudWatch Metrics for the count of active jobs, along with the target and actual capacities of the Autoscaling group. We can utilize this information to validate that the system is working properly and determine if we need to modify any of our autoscaling parameters.

Congratulations! You have completed the walkthrough. Take some time to review the resources you have deployed, and practice the various runner administrative tasks that we have covered in this post.

Troubleshooting

Problem: I deployed the CloudFormation template, but no runner is listed in my repository.

Possible Cause: Errors have been encountered during cfn-init, causing runner registration to fail. Connect to your runner EC2 instance, and check /var/log/cfn-*.log files.

Cleaning up

To avoid incurring future charges, delete every resource provisioned in this demo by deleting the CloudFormation stack created in the “Deploy the Gitlab Runner stack” section.

Conclusion

This article demonstrated how to utilize IaC to efficiently conduct various administrative tasks associated with a Gitlab Runner. We deployed Gitlab Runner consistently and quickly across multiple accounts. We utilized IaC to enforce guardrails and best practices, such as tracking Gitlab Runner configuration changes, terminating the Gitlab Runner gracefully, and autoscaling the Gitlab Runner to ensure best performance and minimum cost. We walked through the deploying, updating, autoscaling, and terminating of the Gitlab Runner. We also saw how easy it was to clean up the entire Gitlab Runner architecture by simply deleting a CloudFormation stack.

About the authors

Sylvia Qi

Sylvia is a Senior DevOps Architect focusing on architecting and automating DevOps processes, helping customers through their DevOps transformation journey. In her spare time, she enjoys biking, swimming, yoga, and photography.

Sebastian Carreras

Sebastian is a Senior Cloud Application Architect with AWS Professional Services. He leverages his breadth of experience to deliver bespoke solutions to satisfy the visions of his customer. In his free time, he really enjoys doing laundry. Really.

Using Amazon Aurora Global Database for Low Latency without Application Changes

Post Syndicated from Roneel Kumar original https://aws.amazon.com/blogs/architecture/using-amazon-aurora-global-database-for-low-latency-without-application-changes/

Deploying global applications has many challenges, especially when accessing a database to build custom pages for end users. One example is an application using AWS Lambda@Edge. Two main challenges include performance and availability.

This blog explains how you can optimally deploy a global application with fast response times and without application changes.

The Amazon Aurora Global Database enables a single database cluster to span multiple AWS Regions by asynchronously replicating your data within subsecond timing. This provides fast, low-latency local reads in each Region. It also enables disaster recovery from Region-wide outages using multi-Region writer failover. These capabilities minimize the recovery time objective (RTO) of cluster failure, thus reducing data loss during failure. You will then be able to achieve your recovery point objective (RPO).

However, there are some implementation challenges. Most applications are designed to connect to a single hostname with atomic, consistent, isolated, and durable (ACID) consistency. But Global Aurora clusters provide reader hostname endpoints in each Region. In the primary Region, there are two endpoints, one for writes, and one for reads. To achieve strong  data consistency, a global application requires the ability to:

  • Choose the optimal reader endpoints
  • Change writer endpoints on a database failover
  • Intelligently select the reader with the most up-to-date, freshest data

These capabilities typically require additional development.

The Heimdall Proxy coupled with Amazon Route 53 allows edge-based applications to access the Aurora Global Database seamlessly, without  application changes. Features include automated Read/Write split with ACID compliance and edge results caching.

Figure 1. Heimdall Proxy architecture

Figure 1. Heimdall Proxy architecture

The architecture in Figure 1 shows Aurora Global Databases primary Region in AP-SOUTHEAST-2, and secondary Regions in AP-SOUTH-1 and US-WEST-2. The Heimdall Proxy uses latency-based routing to determine the closest Reader Instance for read traffic, and redirects all write traffic to the Writer Instance. The Heimdall Configuration stores the Amazon Resource Name (ARN) of the global cluster. It automatically detects failover and cross-Region on the cluster, and directs traffic accordingly.

With an Aurora Global Database, there are two approaches to failover:

  • Managed planned failover. To relocate your primary database cluster to one of the secondary Regions in your Aurora global database, see Managed planned failovers with Amazon Aurora Global Database. With this feature, RPO is 0 (no data loss) and it synchronizes secondary DB clusters with the primary before making any other changes. RTO for this automated process is typically less than that of the manual failover.
  • Manual unplanned failover. To recover from an unplanned outage, you can manually perform a cross-Region failover to one of the secondaries in your Aurora Global Database. The RTO for this manual process depends on how quickly you can manually recover an Aurora global database from an unplanned outage. The RPO is typically measured in seconds, but this is dependent on the Aurora storage replication lag across the network at the time of the failure.

The Heimdall Proxy automatically detects Amazon Relational Database Service (RDS) / Amazon Aurora configuration changes based on the ARN of the Aurora Global cluster. Therefore, both managed planned and manual unplanned failovers are supported.

Solution benefits for global applications

Implementing the Heimdall Proxy has many benefits for global applications:

  1. An Aurora Global Database has a primary DB cluster in one Region and up to five secondary DB clusters in different Regions. But the Heimdall Proxy deployment does not have this limitation. This allows for a larger number of endpoints to be globally deployed. Combined with Amazon Route 53 latency-based routing, new connections have a shorter establishment time. They can use connection pooling to connect to the database, which reduces overall connection latency.
  2. SQL results are cached to the application for faster response times.
  3. The proxy intelligently routes non-cached queries. When safe to do so, the closest (lowest latency) reader will be used. When not safe to access the reader, the query will be routed to the global writer. Proxy nodes globally synchronize their state to ensure that volatile tables are locked to provide ACID compliance.

For more information on configuring the Heimdall Proxy and Amazon Route 53 for a global database, read the Heimdall Proxy for Aurora Global Database Solution Guide.

Download a free trial from the AWS Marketplace.

Resources:

Heimdall Data, based in the San Francisco Bay Area, is an AWS Advanced ISV partner. They have AWS Service Ready designations for Amazon RDS and Amazon Redshift. Heimdall Data offers a database proxy that offloads SQL improving database scale. Deployment does not require code changes.

New – Amazon EC2 Hpc6a Instance Optimized for High Performance Computing

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-amazon-ec2-hpc6a-instance-optimized-for-high-performance-computing/

High Performance Computing (HPC) allows scientists and engineers to solve complex, compute-intensive problems such as computational fluid dynamics (CFD), weather forecasting, and genomics. HPC applications typically require instances with high memory bandwidth, a low latency, high bandwidth network interconnect and access to a fast parallel file system.

Many customers have turned to AWS to run their HPC workloads. For example, Descartes Labs used AWS to power a TOP500 LINPACK benchmarking (the most powerful commercially available computer systems) run that delivered 1.93 PFLOPS, landing at position 136 on the TOP500 list in June 2019. That run made use of 41,472 cores on a cluster of Amazon EC2 C5 instances. Last year Descartes Labs ran the LINPACK benchmark again and placed within the top 40 on the June 2021 TOP500 list with 172,692 cores on a cluster of EC2 instances, which represents a 417 percent performance increase in just two years.

AWS enables you to increase the speed of research and reduce time-to-results by running HPC in the cloud and scaling to tens of thousands of parallel tasks that wouldn’t be practical in most on-premises environments. AWS helps you reduce costs by providing CPU, GPU, and FPGA instances on-demand, Elastic Fabric Adapter (EFA), an EC2 network device that improves throughput and scaling tightly coupled workloads, and AWS ParallelCluster, an open-source cluster management tool that makes it easy for you to deploy and manage HPC clusters on AWS.

Announcing EC2 Hpc6a Instances for HPC Workloads
Customers today across various industries use compute-optimized EFA-enabled Amazon EC2 instances (for example, C5n, R5n, M5n, and M5zn) to maximize the performance of a variety of HPC workloads, but as these workloads scale to tens of thousands of cores, cost-efficiency becomes increasingly important. We have found that customers are not only looking to optimize performance for their HPC workloads but want to optimize costs as well.

As we pre-announced in November 2021, Hpc6a, a new HPC-optimized EC2 instance, is generally available beginning today. This instance delivers 100 Gbps networking through EFA with 96 third-generation AMD EPYC™ processor (Milan) cores with 384 GB RAM, and offers up to 65 percent better price-performance over comparable x86-based compute-optimized instances.

You can launch Hpc6a instances today in the US East (Ohio) and GovCloud (US-West) Regions in On-Demand and Dedicated Hosting or as part of a Savings Plan. Here are the detailed specs:

Instance Name CPUs* RAM EFA Network Bandwidth Attached Storage
hpc6a.48xlarge 96 384 GiB Up to 100 Gbps EBS Only

*Hpc6a instances have simultaneous multi-threading disabled to optimize for HPC codes. This means that unlike other EC2 instances, Hpc6a vCPUs are physical cores, not threads.

To enable predictable thread performance and efficient scheduling for HPC workloads, simultaneous multi-threading is disabled. Thanks to AWS Nitro System, no cores are held back for the hypervisor, making all cores available to your code.

Hpc6a instances introduce a number of targeted features to deliver cost and performance optimizations for customers running tightly coupled HPC workloads that rely on high levels of inter-instance communications. These instances enable EFA networking bandwidth of 100 Gbps and are designed to efficiently scale large tightly coupled clusters within a single Availability Zone.

We hear from many of our engineering customers, such as those in the automotive sector, that they want to reduce the need for physical testing and move towards an increasingly virtual simulation-based product design process faster at a lower cost.

According to our benchmarking results for Siemens Simcenter STAR-CCM+ automotive CFD simulation, when the Hpc6a scales up to 400 nodes (approximately 40,000 cores), with the help of EFA networking, it is able to maintain approximately 100 percent scaling efficiency. Hpc6a instance shows 70 percent lower cost compared to c5n, meaning companies can deliver new designs faster and at a lower cost when using Hpc6a instances. This means companies can deliver new designs faster and at a lower cost when using Hpc6a instances.

You can use the Hpc6a instance with AMD EPYC third-generation (Milan) processors to run your largest and most complex HPC simulations on EC2 and optimize for cost and performance. Customers can also use the new Hpc6a instances with AWS Batch and AWS ParallelCluster to simplify workload submission and cluster creation.

To learn more, visit our Hpc6a instance page and get in touch with our HPC team, AWS re:Post for EC2, or through your usual AWS Support contacts.

Channy

Efficiently Scaling kOps clusters with Amazon EC2 Spot Instances

Post Syndicated from Pranaya Anshu original https://aws.amazon.com/blogs/compute/efficiently-scaling-kops-clusters-with-amazon-ec2-spot-instances/

This post is written by Carlos Manzanedo Rueda, WW SA Leader for EC2 Spot, and Brandon Wagner, Senior Software Development Engineer for EC2.

This post focuses on how you can leverage recently released tools to optimize your usage of Amazon EC2 Spot Instances on Kubernetes Operations (kOps) clusters. Spot Instances let you utilize unused capacity in the AWS cloud for up to 90% off compared to On-Demand prices, and they are a great fit for fault-tolerant, containerized applications. kOps is an open source project providing a cohesive toolset for provisioning, operating, and deleting Kubernetes clusters in the cloud.

Even with customers such as Snap Inc., Babylon Health, and Fidelity Investments telling us how Amazon Elastic Kubernetes Service (EKS) is essential for running their containerized workloads, we appreciate that there are scenarios where using Amazon EC2 instances and kOps are a viable alternative. At AWS, we understand “one size does not fit all.” While we encourage Kubernetes users to contribute their feedback to the AWS container roadmap so that we can improve our services, we also would like to reduce heavy lifting and simplify Spot best practices integration in kOps clusters.

To simplify the integration of Spot Instances in kOps clusters, in January of 2021 we introduced a new kops toolbox command: kops toolbox instance-selector. The utility is distributed as part of the standard kOps distribution. Moreover, it simplifies the creation of kOps Instance Groups by configuring them with full adherence to Spot Instances best practices.

Handling Spot interruption notifications in Kubernetes

Let’s quickly recap Spot best practices. Spot Instances perform exactly like any other EC2 Instances, except that in exchange for their discounted price, they can be interrupted with a two-minute warning when EC2 must reclaim capacity. Applications running on Spot can typically recover from transient interruptions by simply starting a new instance. Spot best practices involve measures such as diversifying into as many Spot capacity pools as possible, choosing the right Spot allocation strategy, and utilizing Spot integrated services. These handle the Spot Instances lifecycles for you. This blog post on handling Spot interruptions dives deeper into AWS’s EC2 Spot best practices.

In Kubernetes, to handle spot termination and rebalance recommendation events (both explained in this blog post on proactively managing Spot Instance lifecycle), we utilize the AWS open-source project AWS Node Termination Handler. We will be deploying the Node Termination Handler as a kOps managed addon, which simplifies its setup and configuration.

The Node Termination Handler ensures that the Kubernetes control plane responds appropriately to events that can make EC2 instances unavailable. It can be operated in two different modes: Instance Metadata Service (IMDS), deployed as a DaemonSet, or Queue Processor, deployed as a Deployment Controller. We recommend running it in Queue Processor mode. The Queue Processor controller continuously monitors an Amazon Simple Queue Service (SQS) queue for events received from Amazon EventBridge. This can lead to node termination in your cluster. When one of these events is received, the Node Termination Handler notifies the Kubernetes control plane to cordon and drain the node that is about to be interrupted. Then, the kubelet sends a SIGTERM signal to the Pods and containers running on the node. This lets your application proceed with a graceful termination – one of the recommended best practices of a Twelve-Factor App.

The kOps managed addon will let you configure the Node Termination Handler within your kOps cluster spec and, more importantly, manage provisioning the necessary infrastructure for you.

To deploy the AWS Node Termination Handler, we start by editing our cluster spec:

kops edit cluster --name ${KOPS_CLUSTER_NAME}

We append the nodeTerminationHandler configuration to the spec node:

spec:
  nodeTerminationHandler:
    enabled: true
    enableSQSTerminationDraining: true
    managedASGTag: "aws-node-termination-handler/managed"

Finally, we deploy the changes made to our cluster configuration:

kops update cluster --name ${KOPS_CLUSTER_NAME} –-state {KOPS_STATE_STORE} --yes --admin

${KOPS_CLUSTER_NAME} refers to the environment variable containing the cluster name, and ${KOPS_STATE_STORE} indicates the Amazon Simple Storage Service (S3) bucket – or kOps State Store – where kOps configuration is stored.

To check that your Node Termination Handler deployment was successful, you can execute:

kops get deployment aws-node-termination-handler -n kube-system

Instance Flexibility and Diversification

Diversification and selection of multiple instances types is essential to acquire and maintain Spot capacity, as well as to successfully replace interrupted instances with others from different pools. When running kOps on AWS, this is implemented by utilizing Amazon EC2 Auto Scaling. Amazon Auto Scaling group’s capacity-optimized allocation strategy ensures that Spot capacity is provisioned from the optimal pools, thereby reducing the chances of Spot terminations.

Simplifying adoption of Spot Best practices on kOps

Before the kops toolbox instance-selector, you would have to setup Spot best practices on kOps manually. This involved writing a stub file following the InstanceGroup specification and examples, and then implementing every best practice, including finding every pool that qualifies for our workload.

The new functionality in kops toolbox instance-selector simplifies InstanceGroup creation by moving the focus of kOps users and administrators from this manual configuration over to simply selecting the vCPUs and Memory requirements for their application (or a base instance type), and then letting kops toolbox instance-selector define the right configuration. Behind the scenes, it utilizes a library allowing it to plug into the feature-set of Amazon EC2 instance selector. At its core, ec2 instance selector helps you select compatible instance types for your application to run on. Utilize ec2 instance selector CLI or library when automating your configurations. In the case of kOps, the integration already comes in the kops toolbox.

For example, let’s say your cluster runs stateless, fault tolerant applications that are CPU/Memory bound and have a ratio of vCPU to Memory requiring at least 1vCPU : 4GB of RAM. You can run the following command in order to acquire cluster spot capacity:

kops toolbox instance-selector "spot-group-" \
  --usage-class spot --flexible --cluster-autoscaler \
  --vcpus-to-memory-ratio="1:4" \
  --ig-count 2

Let’s focus first on the command, and later cover its output. You can get a list of parameters and default values by running: kops toolbox instance-selector –help. A few default parameters weren’t passed in the command above, but they will be set to sane defaults, such as the maximum and minimum number of instances in the Instance Group. The parameter –flexible refers to our request to provide a group of flexible instance types spanning multiple generations.

Once you’ve defined the InstanceGroups, start them up by using the command:

kops update cluster \
–state=${KOPS_STATE_STORE} \
–name=${KOPS_CLUSTER_NAME} \
–yes –admin

The two commands above define and create a request for spot capacity from a flexible and diversified pool set, which meet the criteria to provide at least 4GB of RAM for each vCPU. The command creates not just one, but two node groups named “spot-group-1” and “spot-group-2” (–ig-count 2).

Now, let’s check the contents of the configuration file generated by kops toolbox instance-selector. To preview a configuration without making changes, add –dry-run –output yaml.

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-08-11T10:22:16Z"
  labels:
    kops.k8s.io/cluster: spot-kops-cluster.k8s.local
  name: spot-group-1
spec:
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: "1"
    k8s.io/cluster-autoscaler/spot-kops-cluster.k8s.local: "1"
    kops.k8s.io/instance-selector: "1"
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200716
  machineType: m3.xlarge
  maxSize: 15
  minSize: 2
  mixedInstancesPolicy:
    instances:
    - m3.xlarge
    - m4.xlarge
    - m5.xlarge
    - m5a.xlarge
    - t2.xlarge
    - t3.xlarge
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: capacity-optimized
  nodeLabels:
    kops.k8s.io/instancegroup: spot-group-1
  role: Node
  subnets:
  - eu-west-1a
  - eu-west-1b
  - eu-west-1c
...

The configuration above lists one of the groups created by kops toolbox instance-selector in the previous example. The second group will have a very similar make-up and format, except that it will refer to instances such as: r3.xlarge, r4.xlarge, r5.xlarge, and r5a.xlarge in the mixedInstancesPolicy section. By defining the parameter –usage-class to Spot, the configuration created by kops toolbox instance-selector will add the tags identifying this Auto Scaling group as a Spot group. When the nodes are initialized, kOps controller will identify the nodes as Spot and add the label node-role.kubernetes.io/spot-worker=true. Therefore, at a later stage, we can apply placement logic to our cluster by using nodeSelector and affinity. The configuration above adheres to the definition of kOps support for mixed Instance Groups in AWS, and adds all of the right cloudLabels in order to integrate and implement not only with Spot best practices, but also with Cluster Autoscaler Auto-Discovery configuration best practices.

Kubernetes Cluster Autoscaler is a Kubernetes controller that dynamically adjusts the cluster size. According to a 2020 survey by Cloud Native Computing Foundation (CNCF), 70% of Kubernetes workloads plan to autoscale their stateless applications. Dynamically scaling applications and clusters is also a great practice for optimizing your system costs in situations where capacity is unnecessary, as well as for scaling out accordingly in order to meet business demands. If there are Pods that can’t be scheduled due to insufficient resources, then Cluster Autoscaler will issue a Scale-out action. When there are nodes in the cluster that have been under-utilized for a configurable period of time, Cluster Autoscaler will Scale-in the cluster, and even down-scale to 0 instances when applications don’t need to be run.

On Scale-out operations, Cluster Autoscaler evaluates a set of node groups. When Cluster Autoscaler runs on AWS, node groups are implemented by using Auto Scaling groups (referring to the same instance group as a kOps Instance Group). Therefore, to calculate the number of nodes to scale-out, Cluster Autoscaler assumes that every instance in a node group has the same number of vCPUs and memory size.

By creating two node groups, you apply two diversification levels. You diversify within each node group by using an Auto Scaling group with Mixed Instance Policies and capacity-optimized allocation strategy. Then, to increase the pool range you can leverage, you add more than one node group, while still adhering to the best practices required by Cluster Autoscaler.

While we’ve been focusing on Spot Instances, the parameter –usage-class can be utilized to get OnDemand instances instead of Spot. In the next example, let’s say we would like to get On-Demand capacity in order to train complex deep learning models that will take hours to run. To train our models, we need instances that have at least one GPU with 16GB of RAM on instances that have at least 32GB Ram and 8 vCPUs.

kops toolbox instance-selector "ondemand-gpu-group" \
  --gpus-min 1 --gpu-memory-total-min 16gb --memory-min 32gb --vcpus 8\
  --node-count-max 4 --node-count-min 4 --cpu-architecture amd64

The command above, followed by kops update cluster –state=${KOPS_STATE_STORE} –name=${KOPS_CLUSTER_NAME} –yes can be utilized to produce a configuration and create a nodegroup with the right requirements. This could be created at the start of the training procedure, and then – once the training is done and the capacity is no longer needed – you could automate the nodegroup removal with the following command:

kops delete instancegroup ondemand-gpu-group --name ${KOPS_CLUSTER_NAME} –yes

Conclusions

We believe the best way to run Kubernetes on AWS is by using Amazon EKS. However, scenarios may exist where kOps is utilized in AWS. By using the kOps managed add-on to install aws-node-termination-handler and kops toolbox instance-selector, it is easier than ever to apply Spot best practices to Kubernetes workloads on kOps, and cost-optimize fault-tolerant, stateless applications. These tools let kOps workloads gracefully terminate applications, as well as proactively handle the replacement of instances that are at an elevated risk of termination. kops toolbox instance-selector leverages Amazon ec2-instance-selector in order to simplify the creation of Instance Group configurations adhering to Spot Instances best practices, implementing instance type flexibility, and utilizing capacity-optimized allocation strategy.

By adhering to these best practices to reduce the frequency of Spot interruptions, we will optimize not only the cost, but also our Spot Instances selection. This will enable us to acquire capacity at a massive scale if necessary.

To start using the tools we have described, follow along this step-by-step tutorial. Also, head over to the kops toolbox documentation to learn more about the ways in which you can use it.

Deep dive into NitroTPM and UEFI Secure Boot support in Amazon EC2

Post Syndicated from Neelay Thaker original https://aws.amazon.com/blogs/compute/deep-dive-into-nitrotpm-and-uefi-secure-boot-support-in-amazon-ec2/

Contributed by Samartha Chandrashekar, Principal Product Manager Amazon EC2

At re:Invent 2021, we announced NitroTPM, a Trusted Platform Module (TPM) 2.0 and Unified Extensible Firmware Interface (UEFI) Secure Boot support in Amazon EC2. In this blog post, we’ll share additional details on how these capabilities can help further raise the security bar of EC2 deployments.

A TPM is a security device to gather and attest system state, store and generate cryptographic data, and prove platform identity. Although TPMs are traditionally discrete chips or firmware modules, their adaptation on AWS as NitroTPM preserves their security properties without affecting the agility and scalability of EC2. NitroTPM makes it possible to use TPM-dependent applications and Operating System (OS) capabilities in EC2 instances. It conforms to the TPM 2.0 specification, which makes it easy to migrate existing on-premises workloads that use TPM functionalities to EC2.

Unified Extensible Firmware Interface (UEFI) Secure Boot is a feature of UEFI that builds on EC2’s long-standing secure boot process and provides additional defense-in-depth that helps you secure software from threats that persist across reboots. It ensures that EC2 instances run authentic software by verifying the digital signature of all boot components, and halts the boot process if signature verification fails. When used with UEFI Secure Boot, NitroTPM can verify the integrity of software that boots and runs in the EC2 instance. It can measure instance properties and components as evidence that unaltered software in the correct order was used during boot. Features such as “Measured Boot” in Windows, Linux Unified Key Setup (LUKS) and dm-verity in popular Linux distributions can use NitroTPM to further secure OS launches from malware with administrative that attempt to persist across reboots.

NitroTPM derives its root-of-trust from the Nitro Security Chip and performs the same functions as a physical/discrete TPM. Similar to discrete TPMs, an immutable private and public Endorsement Key (EK) is set up inside the NitroTPM by AWS during instance creation. NitroTPM can serve as a “root-of-trust” to verify the provenance of software in the instance (e.g., NitroTPM’s EKCert as the basis for SSL certificates). Sensitive information protected by NitroTPM is made available only if the OS has booted correctly (i.e., boot measurements match expected values). If the system is tampered, keys are not released since the TPM state is different, thereby ensuring protection from malware attempting to hijack the boot process. NitroTPM can protect volume encryption keys used by full-disk encryption utilities (such as dm-crypt and BitLocker) or private keys for certificates.

NitroTPM can be used for attestation, a process to demonstrate that an EC2 instance meets pre-defined criteria, thereby allowing you to gain confidence in its integrity. It can be used to authenticate an instance requesting access to a resource (such as a service or a database) to be contingent on its health state (e.g., patching level, presence of mandated agents, etc.). For example, a private key can be “sealed” to a list of measurements of specific programs allowed to “unseal”. This makes it suited for use cases such as digital rights management to gate LDAP login, and database access on attestation. Access to AWS Key Management Service (KMS) keys to encrypt/decrypt data accessed by the instance can be made to require affirmative attestation of instance health. Anti-malware software (e.g., Windows Defender) can initiate remediation actions if attestation fails.

NitroTPM uses Platform Configuration Registers (PCR) to store system measurements. These do not change until the next boot of the instance. PCR measurements are computed during the boot process before malware can modify system state or tamper with the measuring process. These values are compared with pre-calculated known-good values, and secrets protected by NitroTPM are released only if the sequences match. PCRs are recalculated after each reboot, which ensures protection against malware aiming to hijack the boot process or persist across reboots. For example, if malware overwrites part of the kernel, measurements change, and disk decryption keys sealed to NitroTPM are not unsealed. Trust decisions can also be made based on additional criteria such as boot integrity, patching level, etc.

The workflow below shows how UEFI Secure Boot and NitroTPM work to ensure system integrity during OS startup.

workflow

To get started, you’ll need to register an Amazon Machine Image (AMI) of an Operating System that supports TPM 2.0 and UEFI Secure Boot using the register-image primitive via the CLI, API, or console. Alternatively, you can use pre-configured AMIs from AWS for both Windows and Linux to launch EC2 instances with TPM and Secure Boot. The screenshot below shows a Windows Server 2019 instance on EC2 launched with NitroTPM using its inbox TPM 2.0 drivers to recognize a TPM device.

NitroTPM and UEFI Secure Boot enables you to further raise the bar in running their workloads in a secure and trustworthy manner. We’re excited for you to try out NitroTPM when it becomes publicly available in 2022. Contact [email protected] for additional information.

Creating a Multi-Region Application with AWS Services – Part 1, Compute and Security

Post Syndicated from Joe Chapman original https://aws.amazon.com/blogs/architecture/creating-a-multi-region-application-with-aws-services-part-1-compute-and-security/

Building a multi-Region application requires lots of preparation and work. Many AWS services have features to help you build and manage a multi-Region architecture, but identifying those capabilities across 200+ services can be overwhelming.

In this 3-part blog series, we’ll explore AWS services with features to assist you in building multi-Region applications. In Part 1, we’ll build a foundation with AWS security, networking, and compute services. In Part 2, we’ll add in data and replication strategies. Finally, in Part 3, we’ll look at the application and management layers.

Considerations before getting started

AWS Regions are built with multiple isolated and physically separate Availability Zones (AZs). This approach allows you to create highly available Well-Architected workloads that span AZs to achieve greater fault tolerance. There are three general reasons that you may need to expand beyond a single Region:

  • Expansion to a global audience as an application grows and its user base becomes more geographically dispersed, there can be a need to reduce latencies for different parts of the world.
  • Reducing Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO) as part of disaster recovery (DR) plan.
  • Local laws and regulations may have strict data residency and privacy requirements that must be followed.

Ensuring security, identity, and compliance

Creating a security foundation starts with proper authentication, authorization, and accounting to implement the principle of least privilege. AWS Identity and Access Management (IAM) operates in a global context by default. With IAM, you specify who can access which AWS resources and under what conditions. For workloads that use directory services, the AWS Directory Service for Microsoft Active Directory Enterprise Edition can be set up to automatically replicate directory data across Regions. This allows applications to reduce lookup latencies by using the closest directory and creates durability by spanning multiple Regions.

Applications that need to securely store, rotate, and audit secrets, such as database passwords, should use AWS Secrets Manager. It encrypts secrets with AWS Key Management Service (AWS KMS) keys and can replicate secrets to secondary Regions to ensure applications are able to obtain a secret in the closest Region.

Encrypt everything all the time

AWS KMS can be used to encrypt data at rest, and is used extensively for encryption across AWS services. By default, keys are confined to a single Region. AWS KMS multi-Region keys can be created to replicate keys to a second Region, which eliminates the need to decrypt and re-encrypt data with a different key in each Region.

AWS CloudTrail logs user activity and API usage. Logs are created in each Region, but they can be centralized from multiple Regions and multiple accounts into a single Amazon Simple Storage Service (Amazon S3) bucket. As a best practice, these logs should be aggregated to an account that is only accessible to required security personnel to prevent misuse.

As your application expands to new Regions, AWS Security Hub can aggregate and link findings to a single Region to create a centralized view across accounts and Regions. These findings are continuously synced between Regions to keep you updated on global findings.

We put these features together in Figure 1.

Multi-Region security, identity, and compliance services

Figure 1. Multi-Region security, identity, and compliance services

Building a global network

For resources launched into virtual networks in different Regions, Amazon Virtual Private Cloud (Amazon VPC) allows private routing between Regions and accounts with VPC peering. These resources can communicate using private IP addresses and do not require an internet gateway, VPN, or separate network appliances. This works well for smaller networks that only require a few peering connections. However, as the number of peered connections increases, the mesh of peered connections can become difficult to manage and troubleshoot.

AWS Transit Gateway can help reduce these difficulties by creating a central transitive hub to act as a cloud router. A Transit Gateway’s routing capabilities can expand to additional Regions with Transit Gateway inter-Region peering to create a globally distributed private network.

Building a reliable, cost-effective way to route users to distributed Internet applications requires highly available and scalable Domain Name System (DNS) records. Amazon Route 53 does exactly that.

Route 53 routing policies can route traffic to a record with the lowest latency, or automatically fail over a record. If a larger failure occurs, the Route 53 Application Recovery Controller can simplify the monitoring and failover process for application failures across Regions, AZs, and on-premises.

Amazon CloudFront’s content delivery network is truly global, built across 300+ points of presence (PoP) spread throughout the world. Applications that have multiple possible origins, such as across Regions, can use CloudFront origin failover to automatically fail over the origin. CloudFront’s capabilities expand beyond serving content, with the ability to run compute at the edge. CloudFront functions make it easy to run lightweight JavaScript functions, and AWS Lambda@Edge makes it easy to run Node.js and Python functions across these 300+ PoPs.

AWS Global Accelerator uses the AWS global network infrastructure to provide two static anycast IPs for your application. It automatically routes traffic to the closest Region deployment, and if a failure is detected it will automatically redirect traffic to a healthy endpoint within seconds.

Figure 2 brings these features together to create a global network across two Regions.

AWS VPC connectivity and content delivery

Figure 2. AWS VPC connectivity and content delivery

Building the compute layer

An Amazon Elastic Compute Cloud (Amazon EC2) instance is based on an Amazon Machine Image (AMI). An AMI specifies instance configurations such as the instance’s storage, launch permissions, and device mappings. When a new standard image needs to be created, EC2 Image Builder can be used to streamline copying AMIs to selected Regions.

Although EC2 instances and their associated Amazon Elastic Block Store (Amazon EBS) volumes live in a single AZ, Amazon Data Lifecycle Manager can automate the process of taking and copying EBS snapshots across Regions. This can enhance DR strategies by providing a relatively easy cold backup-and-restore option for EBS volumes.

As an architecture expands into multiple Regions, it can become difficult to track where instances are provisioned. Amazon EC2 Global View helps solve this by providing a centralized dashboard to see Amazon EC2 resources such as instances, VPCs, subnets, security groups, and volumes in all active Regions.

Microservice-based applications that use containers benefit from quicker start-up times. Amazon Elastic Container Registry (Amazon ECR) can help ensure this happens consistently across Regions with private image replication at the registry level. An ECR private registry can be configured for either cross-Region or cross-account replication to ensure your images are ready in secondary Regions when needed.

We bring these compute layer features together in Figure 3.

AMI and EBS snapshot copy across Regions

Figure 3. AMI and EBS snapshot copy across Regions

Summary

It’s important to create a solid foundation when architecting a multi-Region application. These foundations pave the way for you to move fast in a secure, reliable, and elastic way as you build out your application. In this post, we covered options across AWS security, networking, and compute services that have built-in functionality to take away some of the undifferentiated heavy lifting. We’ll cover data, application, and management services in future posts.

Ready to get started? We’ve chosen some AWS Solutions and AWS Blogs to help you!

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Use a City Planning Analogy to Visualize and Create your Cloud Architecture

Post Syndicated from Marwan Al Shawi original https://aws.amazon.com/blogs/architecture/use-a-city-planning-analogy-to-visualize-and-create-your-cloud-architecture/

If you are new to creating cloud architectures, you might find it a daunting undertaking. However, there is an approach that can help you define a cloud architecture pattern by using a similar construct. In this blog post, I will show you how to envision your cloud architecture using this structured and simplified approach.

Such an approach helps you to envision the architecture as a whole. You can then create reusable architecture patterns that can be used for scenarios with similar requirements. It also will help you define the more detailed technological requirements and interdependencies of the different architecture components.

First, I will briefly define what is meant by an architecture pattern and an architecture component.

Architecture pattern and components

An architecture pattern can be defined as a mechanism used to structure multiple functional components of a software or a technology solution to address predefined requirements. It can be characterized by use case and requirements, and should be tested and reusable whenever possible.

Architecture patterns can be composed of three main elements: the architecture components, the specific functions or capabilities of each component, and the connectivity among those components.

A component in the context of a technology solution architecture is a building block. Modular architecture is composed of a collection of these building blocks.

To think modularly, you must look at the overall technology solution. What is its intended function as a complete system? Then, break it down into smaller parts or components. Think about how each component communicates with others. Identify and define each block or component and its specific roles and function. Consider the technical operational responsibilities each is expected to deliver.

Cloud architecture patterns and the city planning analogy

Let’s assume a content marketing company wants to provide marketing analytics to its partners. It proposes a SaaS solution, by offering an analytics dashboard on Amazon Web Services (AWS). This company may offer the same solution in other locations in the future.

How would you create a reusable architecture pattern for such a solution? To simplify the concept of a component and the architecture pattern, let’s use city planning as a frame of reference.

Subarchitectures or components

A city can be imagined as consisting of three organizing contexts or components:

  1. Overall City Architecture (the big picture)
  2. District Architecture
  3. Building Architecture

Let’s define each of these components or subarchitectures, and see how they correlate to an enterprise cloud architecture.

I. City Architecture consists of the city structures and the integrations of services required by the population, see Figure 1.

Figure 1. Oversimplified city layout

Figure 1. Oversimplified city layout

The overall anticipated capacity within a certain period must be calculatedfor roads, sewage, water, electricity grids, and overall city layout. Typically, this structure should be built from the intended purpose or vision of the city. This can be the type of services it will offer, and the function of each district.

Think of City Architecture as the overall cloud architecture for your enterprise. Include the anticipated capacity, the layout (single Region, multi-Region), type, and number of Amazon Virtual Private Cloud (VPC)s. Decide how you will connect and integrate all these different architecture components.

The initial workflow that can be used to define the high-level architecture pattern layout of the SaaS solution example is analogous to the overall city architecture. We can define its three primary elements: architecture components, specific functions of each component, and the connectivity among those components.

  1. Production environment. The front and backend of your application. It provides the marketing data analytics dashboard.
  2. Testing and development environment. A replica of, but isolated from the Production app. Users’ traffic doesn’t pass through security inspection layer.
  3. Security layer. Provides perimeter security inspection. Users’ traffic passes through security inspection layer.

Translating this workflow into an AWS architecture, Figure 2 shows the analogous structure.

  • Single AWS Region (to be offered in a specific geographical area)
  • Amazon VPC to host the production application
  • Amazon VPC to host the test/dev application
  • Separate VPC (or a layer within a VPC) to provide security services for perimeter security inspection
  • Customer’s connectivity (for example, over public internet, or VPN)
  • AWS Transit Gateway (TGW) to connect and isolate the different components (VPCs and VPN)
Figure 2. Architecture pattern (high-level layout)

Figure 2. Architecture pattern (high-level layout)

Domain-driven design

At this stage, you may also consider a domain-driven design (DDD). This is an approach to software development that centers on a domain model. With your DDD, you can break the solution into different bounded contexts. You can translate the business functions/capabilities into logical domains, and then define how they communicate.

Let’s use the same SaaS example and further analyze the requirements of the solution with the DDD approach in mind. The SaaS solution is offered based on two types of industries: regulated with specific security compliance, and non-regulated. By translating this into logical domains, we can optimize the design to offer a more modular architecture. This will minimize the blast radius of the solution, as illustrated in Figure 3. Watch How AWS Minimizes the Blast Radius of Failures.

Figure 3. DDD-based architecture pattern (high-level layout)

Figure 3. DDD-based architecture pattern (high-level layout)

Now let’s think of governmental boundaries within a city and among its districts. This can be analogous to AWS accounts structures and the trust boundaries among them. By applying this to the example preceding, the VPC with the security compliance requirements can be placed in a separate AWS account. Read Design principles for organizing your AWS accounts.

II. District Architecture consists of the structures and integrations required within a district to manage its buildings, see Figure 4.

Figure 4. City structure with districts

Figure 4. City structure with districts

It illustrates how to connect/integrate back to the city-wide architecture. It should consider the overall anticipated capacity within each district.

For instance, a district can be designed based on the type of function/service it provides, such as residential district, leisure district, or business district.

Mapping this to cloud architecture, you can envision it as the more specific functions/services you are expecting from a certain block, component, or domain. Your architecture can be within one or multiple VPCs, as shown in Figure 5. The structure of a domain or block can vary by number of Availability Zones and VPCs, type of external access, compliance requirements, and the hosted application requirements. Each of these blocks serves a different function and requires different specifications. However, they all need to integrate back to the overall cloud and network architecture to provide a cohesive design.

The architect must define and specify clearly the communication model among the architecture components. You may further break the application architecture at the module level into microservices using the DDD approach. An example is the use of Micro-frontend Architectures on AWS.

Figure 5. Architecture module structure

Figure 5. Architecture module structure

III. Building Architecture refers to the buildings’ structures and standards required to deliver the specific properties/services within a district. It also must integrate back with the district architecture.

To apply this to your architecture, envision the specialized functions/capabilities you are expecting from your application within a module (subcomponents). What are the requirements needed for the application tiers? In this example, let’s assume that the VPC without security compliance requirements will use a frontend web tier on Amazon EC2. Its backend database will be Amazon Relational Database Service (RDS).

Each of these subcomponents must integrate with other components and modules, as well as to the public internet. For example, an AWS Application Load Balancer could handle connections requests from external users, and AWS Web Application Firewall (AWS WAF) used as the perimeter security layer. AWS Transit Gateway could connect to other modules (VPCs). NAT gateways could provide connectivity to the internet for the internal systems in a VPC (shown in Figure 6.)

Figure 6. Architecture module and its subcomponents structure

Figure 6. Architecture module and its subcomponents structure

Conclusion

The vision and goal of a city architecture can set the basis for districts’ architectures. In turn, the district architecture sets the basis of the building architecture within a district. Similarly, the targeted enterprise cloud architecture goal should set the key requirements of the building blocks (or functional components) of the architecture.

Each architecture block sets the requirements of the subcomponents. They collectively construct a system or module of a system, as illustrated in Figure 7.

Figure 7. Structure of cloud architecture requirements and interdependencies

Figure 7. Structure of cloud architecture requirements and interdependencies

As a next step, assess your architecture from both a scale and reliability perspective. Designing for scale alone is not enough. Reliable scalability should be always the targeted architectural attribute. Read Architecting for Reliable Scalability.

Use New Amazon EC2 M1 Mac Instances to Build & Test Apps for iPhone, iPad, Mac, Apple Watch, and Apple TV

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/use-amazon-ec2-m1-mac-instances-to-build-test-macos-ios-ipados-tvos-and-watchos-apps/

Last year at AWS re:Invent, Jeff Barr wrote about the exciting availability of Amazon Elastic Compute Cloud (Amazon EC2) Mac instances. Today, we’re announcing the preview of a new EC2 M1 Mac instance.

The introduction of EC2 Mac instances brought the flexibility, scalability, and cost benefits of AWS to all Apple developers. EC2 Mac instances are dedicated Mac mini computers attached through Thunderbolt to the AWS Nitro System, which lets the Mac mini appear and behave like another EC2 instance. It connects to your Amazon Virtual Private Cloud (VPC), boot from Amazon Elastic Block Store (EBS) volumes, and leverage EBS snapshots, security groups and other AWS services. EC2 Mac instances let you scale your build and test fleets of Macs, paying as you go. There is no hypervisor involved, and you get full bare metal performance of the underlying Mac mini. An EC2 dedicated host reserves a Mac mini for your usage.

The availability (in preview) of EC2 M1 Mac instances lets you access machines built around the Apple-designed M1 System on Chip (SoC). If you are a Mac developer and re-architecting your apps to natively support Macs with Apple silicon, you may now build and test your apps and take advantage of all the benefits of AWS. Developers building for iPhone, iPad, Apple Watch, and Apple TV will also benefit from faster builds. EC2 M1 Mac instances deliver up to 60% better price performance over the x86-based EC2 Mac instances for iPhone and Mac app build workloads.

EC2 M1 Mac instances are powered by a combination of two hardware components:

  • The Mac mini, featuring M1 SoC with 8 CPU cores, 8 GPU cores, 16 GiB of memory, and a 16 core Apple Neural Engine.
  • The AWS Nitro System, providing up to 10 Gbps of VPC network bandwidth and 8 Gbps of EBS storage bandwidth through a high-speed Thunderbolt connection.

How to Get Started
As I explained previously, when using EC2 Mac instances, there is no virtual machine involved. These are running on bare metal servers, each hosting a Mac mini. The first step, therefore, involves grabbing a dedicated server. I open the AWS Management Console, navigate to the Amazon EC2 section, then I select Dedicated Hosts. I select Allocate Dedicated Host to allocate a server to my AWS account.

EC2 Mac2 Instances - Dedicated Hosts

Alternatively, I may use the AWS Command Line Interface (CLI).

➜  ~ aws ec2 allocate-hosts                  \
         --instance-type mac2.metal          \
         --availability-zone us-east-2b      \
         --quantity 1 
{
    "HostIds": [
        "h-0fxxxxxxx90"
    ]
}

Once the host is allocated, I start an EC2 instance on it. The procedure is no different from starting any EC2 instance type. I just have to ensure I select a macOS AMI version that suits my requirements. I select the mac2.metal instance type and select host Tenancy and the dedicated Host I just created.

EC2 Dedicated TenancyAlternatively, I may use the CLI.

➜ ~ aws ec2 run-instances                                     \
	    --instance-type mac2.metal                             \
        --key-name my_key                                      \
        --placement HostId=h-0fxxxxxxx90                       \
        --security-group-ids sg-01000000000000032              \
        --image-id AWS_OR_YOUR_AMI_ID
{
    "Groups": [],
    "Instances": [
        {
            "AmiLaunchIndex": 0,
            "ImageId": "ami-01xxxxbd",
            "InstanceId": "i-08xxxxx5c",
            "InstanceType": "mac2.metal",
            "KeyName": "my_key",
            "LaunchTime": "2021-11-08T16:47:39+00:00",
            "Monitoring": {
                "State": "disabled"
            },
... redacted for brevity ....

When you use EC2 Mac instances for the first time, you’re likely to ask questions such as, “How do I connect through Apple Remote Desktop?” or “How do I increase the size of the APFS file system on the EBS volume?” The EC2 Mac documentation covers the answers for you and provides examples of commands to run on macOS to perform these common tasks.

I use SSH to connect to the newly launched instance as usual.

EC2 Mac M1 Instance uname -a

I may enable Apple Remote Desktop and start a VNC session to the EC2 instance. The EC2 Mac instance documentation page has the details.

mac2 GUI VNC

Availability and Pricing
EC2 M1 Mac instances are now available in preview in US East (N. Virginia) and US West (Oregon), with other AWS Regions coming at launch.

Pricing metrics are similar to the previous generation of EC2 Mac instances. You are charged per hour of reservation of the dedicated host, not for the time the instance is running, and there is a minimum charge of 24 hours for reserving a dedicated host.

In the two preview Regions, the on-demand price is $0.6498 per hour. You can save up to 42 percent over the on-demand price with Savings Plans. Check our Dedicated Host on-demand pricing page, as well as the Savings Plans page to learn the details.

You can sign up for the preview of EC2 Mac M1 instances today!

— seb

Announcing winners of the AWS Graviton Challenge Contest and Hackathon

Post Syndicated from Neelay Thaker original https://aws.amazon.com/blogs/compute/announcing-winners-of-the-aws-graviton-challenge-contest-and-hackathon/

At AWS, we are constantly innovating on behalf of our customers so they can run virtually any workload, with optimal price and performance. Amazon EC2 now includes more than 475 instance types that offer a choice of compute, memory, networking, and storage to suit your workload needs. While we work closely with our silicon partners to offer instances based on their latest processors and accelerators, we also drive more choice for our customers by building our own silicon.

The AWS Graviton family of processors were built as part of that silicon innovation initiative with the goal of pushing the price performance envelope for a wide variety of customer workloads in EC2. We now have 12 EC2 instance families powered by AWS Graviton2 processors – general purpose (M6g, M6gd), burstable (T4g), compute optimized (C6g, C6gd, C6gn), memory optimized (R6g, R6gd, X2gd), storage optimized (Im4gn, Is4gen), and accelerated computing (G5g) available globally across 23 AWS Regions. We also announced the preview of Amazon EC2 C7g instances powered by the latest generation AWS Graviton3 processors that will provide the best price performance for compute-intensive workloads in EC2. Thousands of customers, including Discovery, DIRECTV, Epic Games, and Formula 1, have realized significant price performance benefits with AWS Graviton-based instances for a broad range of workloads. This year, AWS Graviton-based instances also powered much of Amazon Prime Day 2021 and supported 12 core retail services during the massive 2-day online shopping event.

To make it easy for customers to adopt Graviton-based instances, we launched a program called the Graviton Challenge. Working with customers, we saw that many successful adoptions of Graviton-based instances were the result of one or two developers taking a single workload and spending a few days to benchmark the price performance gains with Graviton2-based instances, before scaling it to more workloads. The Graviton Challenge provides a step-by-step plan that developers can follow to move their first workload to Graviton-based instances. With the Graviton Challenge, we also launched a Contest (US-only), and then a Hackathon (global), where developers could compete for prizes by building new applications or moving existing applications to run on Graviton2-based instances. More than a thousand participants, including enterprises, startups, individual developers, open-source developers, and Arm developers, registered and ran a variety of applications on Graviton-based instances with significant price performance benefits. We saw some fantastic entries and usage of Graviton2-based instances across a variety of use cases and want to highlight a few.

The Graviton Challenge Contest winners:

  • Best Adoption – Enterprise and Most Impactful Adoption: VMware vRealize SRE team, who migrated 60 micro-services written in Java, Rust, and Golang to Graviton2-based general purpose and compute optimized instances and realized up to 48% latency reduction and 22% cost savings.
  • Best Adoption – Startup: Kasm Technologies, who realized up to 48% better performance and 25% potential cost savings for its container streaming platform built on C/C++ and Python.
  • Best New Workload adoption: Dustin Wilson, who built a dynamic tile server based on Golang and running on Graviton2-based memory-optimized instances that helps analysts query large geospatial datasets and benchmarked up to 1.8x performance gains over comparable x86-based instances.
  • Most Innovative Adoption: Loroa, an application that translates any given text into spoken words from one language into multiple other languages using Graviton2-based instances, Amazon Polly, and Amazon Translate.

If you are attending AWS re:Invent 2021 in person, you can hear more details on their Graviton adoption experience by attending the CMP213: Lessons learned from customers who have adopted AWS Graviton chalk talk.

Winners for the Graviton Challenge Hackathon:

  • Best New App: PickYourPlace, an open-source based data analytics platform to help users select a place to live based on property value, safety, and accessibility.
  • Best Migrated App: Genie, an image credibility checker based on deep learning that makes predictions on photographic and tampered confidence of an image.
  • Highest Potential Impact: Welly Tambunan, who’s also an AWS Community Builder, for porting big data platforms Spark, Dremio, and AirByte to Graviton2 instances so developers can leverage it to build big data capabilities into their applications.
  • Most Creative Use Case: OXY, a low-cost custom Oximeter with mobile and web apps that enables continuous and remote monitoring to prevent deaths due to Silent Hypoxia.
  • Best Technical Implementation: Apollonia Bot that plays songs, playlists, or podcasts on a Discord voice channel, so users can listen to it together.

It’s been incredibly exciting to see the enthusiasm and benefits realized by our customers. We are also thankful to our judges – Patrick Moorhead from Moor Insights, James Governor from RedMonk, and Jason Andrews from Arm, for their time and effort.

In addition to EC2, several AWS services for databases, analytics, and even serverless support options to run on Graviton-based instances. These include Amazon Aurora, Amazon RDS, Amazon MemoryDB, Amazon DocumentDB, Amazon Neptune, Amazon ElastiCache, Amazon OpenSearch, Amazon EMR, AWS Lambda, and most recently, AWS Fargate. By using these managed services on Graviton2-based instances, customers can get significant price performance gains with minimal or no code changes. We also added support for Graviton to key AWS infrastructure services such as Elastic Beanstalk, Amazon EKS, Amazon ECS, and Amazon CloudWatch to help customers build, run, and scale their applications on Graviton-based instances. Additionally, a large number of Linux and BSD-based operating systems, and partner software for security, monitoring, containers, CI/CD, and other use cases now support Graviton-based instances and we recently launched the AWS Graviton Ready program as part of the AWS Service Ready program to offer Graviton-certified and validated solutions to customers.

Congrats to all of our Contest and Hackathon winners! Full list of the Contest and Hackathon winners is available on the Graviton Challenge page.

P.S.: Even though the Contest and Hackathon have ended, developers can still access the step-by-step plan on the Graviton Challenge page to move their first workload to Graviton-based instances.

New Storage-Optimized Amazon EC2 Instances (Im4gn and Is4gen) Powered by AWS Graviton2 Processors

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-storage-optimized-amazon-ec2-instances-im4gn-and-is4gen-powered-by-aws-graviton2-processors/

EC2 storage-optimized instances are designed to deliver high disk I/O performance, and plenty of storage. Our customers use them to host high-performance real-time databases, distributed file systems, data warehouses, key-value stores, and more. Over the years we have released multiple generations of storage-optimized instances including the HS1 (2012) , D2 (2015), I2 (2013) , I3 (2017), I3en (2019), and D3/D3en (2020).

As I look back on all of these launches, it is interesting to see how we continue to provide an ever-increasing set of options that make each successive generation an even better fit for the diverse (and also ever-increasing) needs of our customers. HS1 instances were available in just one size, D2 and I2 in four, I3 in six, and I3en in eight. These instances give our customers the freedom to choose the size that best meets their current needs while also giving them room to scale up or down if those needs happen to change.

Im4gn and Is4gen
Today I am happy to introduce the two newest families of storage-optimized instances, Im4gn and Is4gen, powered by Graviton2 processors. Both instances offer up to 30 TB of NVMe storage using AWS Nitro SSD devices that are custom-built by AWS. As part of our drive to innovate on behalf of our customers, we turned our attention to storage and designed devices that were optimized to support high-speed access to large amounts of data. The AWS Nitro SSDs reduce I/O latency by up to 60% and also reduce latency variability by up to 75% when compared to the third generation of storage-optimized instances. As a result you get faster and more predictable performance for your I/O-intensive EC2 workloads.

Im4gn instances are a great fit for applications that require large amounts of dense SSD storage and high compute performance, but are not especially memory intensive such as social games, session storage, chatbots, and search engines. Here are the specs:

Instance Name vCPUs
Memory Local NVMe Storage
(AWS Nitro SSD)
Read Throughput
(128 KB Blocks)
EBS-Optimized Bandwidth Network Bandwidth
im4gn.large 2 8 GiB 937 GB 250 MB/s Up to 9.5 Gbps Up to 25 Gbps
im4gn.xlarge 4 16 GiB 1.875 TB 500 MB/s Up to 9.5 Gbps Up to 25 Gbps
im4gn.2xlarge 8 32 GiB 3.75 TB 1 GB/s Up to 9.5 Gbps Up to 25 Gbps
im4gn.4xlarge 16 64 GiB 7.5 TB 2 GB/s 9.5 Gbps 25 Gbps
im4gn.8xlarge 32 128 GiB 15 TB
(2 x 7.5 TB)
4 GB/s 19 Gbps 50 Gbps
im4gn.16xlarge 64 256 GiB 30 TB
(4 x 7.5 TB)
8 GB/s 38 Gbps 100 Gbps

Im4gn instances provide up to 40% better price performance and up to 44% lower cost per TB of storage compared to I3 instances. The new instances are available in the AWS US West (Oregon), US East (Ohio), US East (N. Virginia), and Europe (Ireland) Regions as On-Demand, Spot, Savings Plan, and Reserved instances.

Is4gen instances are a great fit for applications that do large amounts of random I/O to large amounts of SSD storage. This includes shared file systems, stream processing, social media monitoring, and streaming platforms, all of which can use the increased storage density to retain more data locally. Here are the specs:

Instance Name vCPUs
Memory Local NVMe Storage
(AWS Nitro SSD)
Read Throughput
(128 KB Blocks)
EBS-Optimized Bandwidth Network Bandwidth
is4gen.medium 1 6 GiB 937 GB 250 MB/s Up to 9.5 Gbps Up to 25 Gbps
is4gen.large 2 12 GiB 1.875 TB 500 MB/s Up to 9.5 Gbps Up to 25 Gbps
is4gen.xlarge 4 24 GiB 3.75 TB 1 GB/s Up to 9.5 Gbps Up to 25 Gbps
is4gen.2xlarge 8 48 GiB 7.5 TB 2 GB /s Up to 9.5 Gbps Up to 25 Gbps
is4gen.4xlarge 16 96 GiB 15 TB
(2 x 7.5 TB)
4 GB/s 9.5 Gbps 25 Gbps
is4gen.8xlarge 32 192 GiB 30 TB
(4 x 7.5 TB)
8 GB/s 19 Gbps 50 Gbps

Is4gen instances provide 15% lower cost per TB of storage and up to 48% better compute performance compared to I3en instances. The new instances are available in the AWS US West (Oregon), US East (Ohio), US East (N. Virginia), and Europe (Ireland) Regions as On-Demand, Spot, Savings Plan, and Reserved instances.

Available Now
As I never get tired of saying, these new instances are available now and you can start using them today. You can use Amazon Linux 2, Ubuntu 18.04.05 (and newer), Red Hat Enterprise Linux 8.0, and SUSE Enterprise Server 15 (and newer) AMIs, along with the container-optimized ECS and EKS AMIs. Learn more about the Im4gn and Is4gen instances.

Jeff;

PS – As of this launch twelve EC2 instance types are now powered by Graviton2 processors! To learn more, visit the Graviton2 page.

New – AWS Outposts Servers in Two Form Factors

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-aws-outposts-servers-in-two-form-factors/

AWS Outposts gives you on-premises compute and storage that is monitored and managed by AWS, and controlled by the same, familiar AWS APIs. You may already know about the AWS Outposts rack, which occupies a full 42U rack.

Last year I told you that we were working on new sizes of Outposts suitable for locations such as branch offices, factories, retail stores, health clinics, hospitals, and cell sites that are space-constrained and need access to low-latency compute capacity. Today we are launching three AWS Outposts servers, all powered by AWS Nitro System and with your choice of x86 or Arm/Graviton2 processors. Here’s an overview:

Name/Rack Size/Catalog ID
EC2 Instance Capacity
Processor / Architecture
vCPUs Memory
Local NVMe
SSD Storage
Outposts 1U
(STBKRBE)
c6gd.16xlarge Graviton2 / Arm 64 128 GiB 3.8 TB
( 2x 1.9 TB)
Outposts 2U
(LMXAD41)
c6id.16xlarge Intel Ice Lake / x86 64 128 GiB 3.8 TB
(2 x 1.9 TB)
Outposts 2U
(KOSKFSF)
c6id.32xlarge Intel Ice Lake / x86 128 256 GiB 7.6 TB
(4 x 1.9 GB)

You can create VPC subnets on each Outpost, and you can launch Amazon Elastic Compute Cloud (Amazon EC2) instances from EBS-backed AMIs in the parent region. The c6gd.16xlarge model supports six instance sizes, as follows:

Instance Name vCPUs Memory Local Storage
c6gd.large 2 4 GiB 118 GB
c6gd.xlarge 4 8 GiB 237 GB
c6gd.2xlarge 8 16 GiB 474 GB
c6gd.4xlarge 16 32 GiB 950 GB
c6gd.8xlarge 32 64 GiB 1.9 TB
c6gd.16xlarge 64 128 GiB 3.8 TB

The c6id.16xlarge model supports all but the largest of the following instance sizes, and the c6id.32xlarge supports all of them:

Instance Name vCPUs Memory Local Storage
c6id.large 2 4 GiB 118 GB
c6id.xlarge 4 8 GiB 237 GB
c6id.2xlarge 8 16 GiB 474 GB
c6id.4xlarge 16 32 GiB 950 GB
c6id.8xlarge 32 64 GiB 1.9 TB
c6id.16xlarge 64 128 GiB 3.8 TB
c6id.32xlarge 128 256 GiB 7.6 TB

Within each of your Outposts servers, you can launch any desired mix of instance sizes as long as you remain within the overall processing and storage available. You can create Amazon Elastic Container Service (Amazon ECS) clusters (Amazon Elastic Kubernetes Service (EKS) is coming soon) , and the code you run on-premises can make use of the entire lineup of services in the AWS Cloud.

Each Outposts server connects to the cloud via the public Internet or across a private AWS Direct Connect line. Additionally, each Outpost server supports a Local Network Interface (LNI) that provides a Level 2 presence on your local network for AWS service endpoints.

Outposts servers incorporate many powerful Nitro features including high speed networking and enhanced security. The security model is locked-down and prevents administrative access, preventing tampering or human error. Additionally, data at rest is protected by a NIST-compliant physical security key.

While I was writing this post, I stopped in to say hello to the design and development team, and met with my colleague Bianca Nagy to learn more about the Outposts server:

Ordering Outposts Servers
Let’s walk through the process of ordering an Outposts server from the AWS Management Console. I visit the AWS Outposts Console, make sure that I am in the desired AWS Region, and click Place order to get started:

I click Servers, and then choose the desired configuration. I pick the c6gd.16xlarge, and click Next to proceed:

Then I create a new Outpost:

And a new Site:

Then I review my payment options and select my shipping address:

On the next page I review all of my options, click Place order, and await delivery:

In general, we expect to be able to deliver Outposts servers in two to six weeks, starting in the first quarter of 2022. After you receive yours, you or a member of your IT team can mount it in a 19″ rack or position it on a flat surface, cable it to power and networking, and power the device on. You then use a set of temporary AWS credentials to confirm the identity of the device, and to verify that the device is able to use DHCP to obtain an IP address. Once the device has established connectivity to the designated AWS parent region, we will finalize the provisioning of EC2 instance capacity and make it available to you.

After that, you are ready to launch instances and to deploy your on-premises applications.

We will monitor hardware performance and will contact you if your device is in need of maintenance. We will ship a replacement device for arrival within 2 business days. You can migrate your workloads to a redundant device, and use tracking information & notifications to track delivery status. When the replacement arrives, you install it and then destroy the physical security key in the old one before shipping it back to AWS.

Outposts API Update
We are also enhancing the Outposts API as part of this launch. Here are some of the new functions:

ListCatalogItem – Get a list of items in the Outposts catalog, with optional filtering by EC2 family or supported storage options.

GetCatalogItem – Get full information about a single item in the Outposts catalog.

GetSiteAddress – Get the physical address of a site where an Outposts rack or server is installed.

You can use the information returned by GetCatalogItem to place an order that contains the desired quantity of one or more catalog items.

Things to Know
Here are a couple of important things to know about Outposts servers:

Availability – Outposts servers are available for order to most locations where Outposts racks are available (currently 23 regions and 49 countries), with more to follow in 2022.

Ordering at Scale – I showed you the console-based ordering process above, and also gave you a glimpse at the Outposts API. If you need hundreds or thousands of devices, get in touch and we will give you a template that you can fill in and then upload.

re:Invent 2021 Outposts Server Selfie Challenge
If you attend AWS re:Invent, be sure to visit the AWS Hybrid kiosk in the AWS Booth (#1719) to see the new Outposts Servers up close and personal. While you are there, take a fun & creative selfie, tag it with #AWSOutposts & #AWSPromotion, and share it on Twitter. I will post my three favorites at the end of the show!

Jeff;

Join the Preview – Amazon EC2 C7g Instances Powered by New AWS Graviton3 Processors

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/join-the-preview-amazon-ec2-c7g-instances-powered-by-new-aws-graviton3-processors/

We announced the first generation AWS-designed Graviton processor in late 2018, and followed it up with the second generation Graviton2 a year later. Today, AWS customers make use of twelve different Graviton2-powered instances including the new X2gd instances that are designed for memory-intensive workloads. All Graviton processors include dedicated cores & caches for each vCPU, along with additional security features courtesy of AWS Nitro System; the Graviton2 processors add support for always-on memory encryption.

C7g in the Works
I am thrilled to tell you about our upcoming C7g instances. Powered by new Graviton3 processors, these instances are going to be a great match for your compute-intensive workloads: HPC, batch processing, electronic design automation (EDA), media encoding, scientific modeling, ad serving, distributed analytics, and CPU-based machine learning inferencing.

While we are still optimizing these instances, it is clear that the Graviton3 is going to deliver amazing performance. In comparison to the Graviton2, the Graviton3 will deliver up to 25% more compute performance and up to twice as much floating point & cryptographic performance. On the machine learning side, Graviton3 includes support for bfloat16 data and will be able to deliver up to 3x better performance.

Graviton3 processors also include a new pointer authentication feature that is designed to improve security. Before return addresses are pushed on to the stack, they are first signed with a secret key and additional context information, including the current value of the stack pointer. When the signed addresses are popped off the stack, they are validated before being used. An exception is raised if the address is not valid, thereby blocking attacks that work by overwriting the stack contents with the address of harmful code. We are working with operating system and compiler developers to add additional support for this feature, so please get in touch if this is of interest to you.

C7g instances will be available in multiple sizes (including bare metal), and are the first in the cloud industry to be equipped with DDR5 memory. In addition to drawing less power, this memory delivers 50% higher bandwidth than the DDR4 memory used in the current generation of EC2 instances.

On the network side, C7g instances will offer up to 30 Gbps of network bandwidth and Elastic Fabric Adapter (EFA) support.

Join the Preview
We are now running a preview of the C7g instances so that you can be among the first to experience all of this power. Sign up now, take an instance for a spin, and let me know what you think!

Jeff;