All posts by Macey Neff

Architecting for Disaster Recovery on AWS Outposts Racks with AWS Elastic Disaster Recovery

2024-04-19 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/architecting-for-disaster-recovery-on-aws-outposts-racks-with-aws-elastic-disaster-recovery/

This blog post is written by Brianna Rosentrater, Hybrid Edge Specialist SA.

AWS Elastic Disaster Recovery Service (AWS DRS) now supports disaster recovery (DR) architectures that include on-premises Windows and Linux workloads running on AWS Outposts. AWS DRS minimizes downtime and data loss with fast, reliable recovery of on-premises and cloud-based applications using affordable storage, minimal compute, and point-in-time recovery. Both services are billed and managed from your AWS Management Console.

Like workloads running in AWS Regions, it’s critical to plan for failures. Outposts are designed with resiliency in mind, providing redundant power, networking, and are available to order with N+M active compute instance capacity. In other words, for every physical N compute servers, you have the option of including M redundant hosts capable of handling the workload during a failure. When leveraging AWS DRS with Outpost, you can plan for larger-scale failure modes, such as data center outages, by replicating mission-critical workloads to other remote data center locations or the AWS Region.

In this post, you’ll learn how AWS DRS can be used with Outpost rack to architect for high availability in the event of a site failure. The post will examine several different architectures enabled by AWS DRS that provide DR for Outpost, and the benefits of each method described.

Prerequisites

Each of these architectures described below need the following:

At least one Outpost rack
Outpost(s) must be in Direct VPC Routing (DVR) mode
Amazon S3 on Outposts (required for all AWS DRS replication destinations)
An AWS Replication Agent installed on each source server

Public internet access isn’t needed, AWS PrivateLink and AWS Direct Connect are supported for replication and failback which is a significant security benefit.

Planning for failure

Disasters come in many forms and are often unplanned and unexpected events. Regardless of whether your workload resides on premises, in a colocation facility, or in an AWS Region, it’s critical to define the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) which are often workload-specific. These two metrics profile how long a service can be down during recovery and quantify the acceptable amount of data loss. RTO and RPO guide you in choosing the appropriate strategy such as backup and recovery, pilot light, warm standby, or a multi-site (active-active) approach.

With AWS DRS, while failing back to a test machine (not the original source server), replication of the source server continues. This allows failback drills without impacting RPO, and non-disruptive failback drills are an important part of disaster planning to validate your recovery plan meets your expected RPO/RTO as per your business requirements.

How AWS DRS integrates with Outpost

AWS DRS uses an AWS Replication Agent at the source to capture the workload and transfer it to a lightweight staging area, which resides on an Outpost equipped with Amazon S3 on Outposts. This method also provides the ability to perform low-effort, non-disruptive DR drills before making the final cutover. The AWS Replication Agent doesn’t need a reboot nor does it impact your applications during installation.

When an Outpost’s subnet is selected as the target for replication or launch, all associated AWS DRS components remain within the Outpost, including the AWS DRS server conversion technology. These conversion servers convert source disks of servers being migrated so that they can boot and run in the target infrastructure, Amazon EBS volumes, snapshots, and replication servers. The replication servers replicate the disks to the target infrastructure. With AWS DRS you can control the data replication path using private connectivity options such as a virtual private network (VPN), AWS Direct Connect, VPC peering, or another private connection. Learn more about using a private IP for data replication.

AWS DRS provides nearly continuous replication for mission-critical workloads and supports deployment patterns including on-premises to Outpost, Outpost to Region, Region to Outpost, and between two logical Outposts through local networks. To leverage Outpost with AWS DRS, simply select the Outpost subnet as your target or source for replication when configuring AWS DRS for your workload. If you are currently using CloudEndure DR for disaster recovery with Outpost, see these detailed instructions for migrating to AWS DRS from CloudEndure DR.

DR from on-premises to Outpost

Outpost can be used as a DR target for on-premises workloads. By deploying an Outpost in a remote data center or colocation a significant distance from the source within the same geo-political boundary, you can replicate workloads across great distances and increase resiliency of the data while ensuring adherence to data residency policies or legislation.

Figure 1 – DR from on-premises to Outposts

In Figure 1, on premises sources replicate traffic from a LAN to a staging area residing in an Outpost subnet via the local gateway. This allows workloads to failover from their on-premises environment to an Outpost in a different physical location during a disaster.

The staging areas and replication servers run on Amazon Elastic Compute Cloud (Amazon EC2) with Amazon EBS volumes and require Amazon S3 on Outposts where the Amazon EBS snapshots reside.

The replication agent is responsible for providing nearly continuous, block-level replication from your LAN using TCP/1500 with traffic routing to Amazon EC2 instances using the Outposts local gateway.

DR from Outpost to Region

Since its initial release, Outpost has supported Amazon EBS snapshots written to Amazon S3 located in the AWS Region. Backup to an AWS Region is one of the most cost-effective and easiest-to-configure DR approaches, enabling data redundancy outside of your Outpost and data center.

This method also offers flexibility for restoration within an AWS Region if the original deployment is irrecoverable. However, depending on the frequency of the snapshots and the timing of the failure, backup, and recovery to the Region has the potential to have an RPO/RTO spanning hours depending on the throughput of the service link.

For critical workloads, AWS DRS can reduce RTO to minutes and RPO in the sub-second range. After creating an initial replication of workloads that reside on the Outpost, AWS DRS provides nearly continuous, block-level replication in the Region. Just like replication from non-AWS virtual machines or bare metal servers, AWS DRS resources, including Replication Servers, Conversion Servers, Amazon EBS Volumes, and Snapshots reside in the Region.

Figure 2 – DR from Outpost to Region

In Figure 2, data replication is performed over the service link from Amazon EC2 instances running locally on an Outpost to an AWS Region. The service link traverses either public Region connectivity or AWS Direct Connect.

AWS Direct Connect is the recommended option because it provides low latency and consistent bandwidth for the service link back to a Region, which also improves the reliability of transmission for AWS DRS replication traffic.

The service link is comprised of redundant, encrypted VPN tunnels. Replication traffic can also be sent privately without traversing the public internet by leveraging Private Virtual Interfaces with Direct Connect for the service link.

With this architecture in place, you can mitigate disasters and reduce downtime by failing over to the AWS Region using AWS DRS.

DR from Region to Outpost

AWS provides multiple Availability Zones (AZs) within a Region and isolated AWS Regions globally for the greatest possible fault tolerance and stability. The reliability pillar of AWS’s Well-Architected Framework encourages distributing workloads across AZs and replicating data between Regions when the need for distances exceeds those of AZs.

AWS DRS supports nearly continuous replication of workloads from a Region to an Outpost within your data center or colocation facility for DR. This deployment model provides increased durability from a source AWS Region to an Outpost anchored to a different Region.

In this model, AWS DRS components remain on-premises within the Outpost, but data charges are applicable as data egresses from the Region back to the data center and Amazon S3 on Outposts is required on the destination Outpost.

Figure 3 – DR from Region to Outpost

Implementing the preceding architecture diagram enables failover of critical workloads from the Region to on-premises Outposts seamlessly. Keep in mind that AWS Regions provide the management and control plane for Outpost, making it critical to consider probability and frequency of service link interruptions as a part of your DR planning. Scenarios such as warm standby with pre-allocated Amazon EC2 and Amazon EBS resources may prove more resilient during service link disruptions.

DR between two Outposts

Each logical Outpost is comprised of one or more physical racks. Logical Outposts are in independent colocations of one another, and support deployments in disparate data centers or colocation facilities. You can elect to have multiple logical Outposts anchored to different Availability Zones or Regions. AWS DRS unlocks options for replication between two logical Outposts, leading to increased resiliency and reducing the impact of your data center as a single point of failure. In the following architecture, nearly continuous replication captured from a single Outpost source is applied at a second logical Outpost.

Figure 4 – DR between two Outposts

Supporting both directional and bidirectional replication between Outposts can minimize disruption caused by events that take down a data center, Availability Zone, or even the entire Region result in minimal disruption. In the following architecture diagram, bidirectional data replication occurs between the Outposts by routing traffic via the local gateways, minimizing outbound data charges from the Region and allowing for more direct routing between deployment sites that could potentially span significant distances. AWS DRS cannot communicate with resources directly utilizing a customer-owned IP address pool (CoIP pool).

Figure 5 – DR between two Outposts – bidirectional

Architecture Considerations

When planning an Outpost deployment leveraging AWS DRS, it’s critical to consider the impact on storage. As a general best practice, AWS recommends planning for a 2:1 ratio consisting of EBS volumes used for nearly continuous replication and Amazon EBS snapshots on Amazon S3 for point-in-time recovery. While it’s unlikely that all servers would need recovery simultaneously, it’s also important to allocate a reserve of EBS volume capacity, which will launch at the time of recovery. Amazon S3 on Outpost is needed for each Outpost used as a replication destination, and the recommendation is to plan for a 1:1 ratio consisting of S3 on Outposts storage, plus the rate of data change. For example, if your data change rate is 10%, you’d want to plan for 110% S3 on Outpost use with AWS DRS.

Amazon CloudWatch has integrated metrics for EC2, Amazon EBS, and Amazon S3 capacity on Outposts, making it easy to create custom tailored dashboards and integrate with Simple Notification Service (Amazon SNS) for alerts at defined thresholds. Monitoring these metrics is critical in making sure that proper free space is available for data replication to occur unimpeded. CloudWatch has metrics available for AWS DRS as well. You can also use the AWS DRS service page in the AWS Console to monitor the status of your recovery instances.

Consider taking advantage of Recovery Plans within AWS DRS to make sure that related services are recovered in a particular order. For example, during a disaster, it might be critical to first bring up a database before recovering application tiers. Recovery plans provide the ability to group related services and apply wait times to individual targets.

Conclusion

AWS Outpost enables low latency, data residency, or data gravity-constrained workloads by supplying managed cloud compute and storage services within your data center or colocation. When coupled with AWS DRS, you can decrease RPO and RTO through a variety of flexible deployment models with sources and destinations ranging from on-premises, the Region, or another AWS Outpost.

Deploying an EMR cluster on AWS Outposts to process data from an on-premises database

2024-02-21 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/deploying-an-emr-cluster-on-aws-outposts-to-process-data-from-an-on-premises-database/

seThis post is written by Eder de Mattos, Sr. Cloud Security Consultant, AWS and Fernando Galves, Outpost Solutions Architect, AWS.

In this post, you will learn how to deploy an Amazon EMR cluster on AWS Outposts and use it to process data from an on-premises database. Many organizations have regulatory, contractual, or corporate policy requirements to process and store data in a specific geographical location. These strict requirements become a challenge for organizations to find flexible solutions that balance regulatory compliance with the agility of cloud services. Amazon EMR is the industry-leading cloud big data platform for data processing, interactive analysis, and machine learning (ML) that uses open-source frameworks. With Amazon EMR on Outposts, you can seamlessly use data analytics solutions to process data locally in your on-premises environment without moving data to the cloud. This post focuses on creating and configuring an Amazon EMR cluster on AWS Outposts rack using Amazon Virtual Private Cloud (Amazon VPC) endpoints and keeping the networking traffic in the on-premises environment.

Architecture overview

In this architecture, there is an Amazon EMR cluster created in an AWS Outposts subnet. The cluster retrieves data from an on-premises PostgreSQL database, employs a PySpark Step for data processing, and then stores the result in a new table within the same database. The following diagram shows this architecture.

Figure 1 Architecture overview

Networking traffic on premises: The communication between the EMR cluster and the on-premises PostgreSQL database is through the Local Gateway. The core Amazon Elastic Compute Cloud (Amazon EC2) instances of the EMR cluster are associated with Customer-owned IP addresses (CoIP), and each instance has two IP addresses: an internal IP and a CoIP IP. The internal IP is used to communicate locally in the subnet, and the CoIP IP is used to communicate with the on-premises network.

Amazon VPC endpoints: Amazon EMR establishes communication with the VPC through an interface VPC endpoint. This communication is private and conducted entirely within the AWS network instead of connecting over the internet. In this architecture, VPC endpoints are created on a subnet in the AWS Region.

The support files used to create the EMR cluster are stored in an Amazon Simple Storage Service (Amazon S3) bucket. The communication between the VPC and Amazon S3 stays within the AWS network. The following files are stored in this S3 bucket:

get-postgresql-driver.sh: This is a bootstrap script to download the PostgreSQL driver to allow the Spark step to communicate to the PostgreSQL database through JDBC. You can download it through the GitHub repository for this Amazon EMR on Outposts blog post.
postgresql-42.6.0.jar: PostgreSQL binary JAR file for the JDBC driver.
spark-step-example.py: Example of a Step application in PySpark to simulate the connection to the PostgreSQL database.

AWS Systems Manager is configured to manage the EC2 instances that belong to the EMR cluster. It uses an interface VPC endpoint to allow the VPC to communicate privately with the Systems Manager.

The database credentials to connect to the PostgreSQL database are stored in AWS Secrets Manager. Amazon EMR integrates with Secrets Manager. This allows the secret to be stored in the Secrets Manager and be used through its ARN in the cluster configuration. During the creation of the EMR cluster, the secret is accessed privately through an interface VPC endpoint and stored in the variable DBCONNECTION in the EMR cluster.

In this solution, we are creating a small EMR cluster with one primary and one core node. For the correct sizing of your cluster, see Estimating Amazon EMR cluster capacity.

There is additional information to improve the security posture for organizations that use AWS Control Tower landing zone and AWS Organizations. The post Architecting for data residency with AWS Outposts rack and landing zone guardrails is a great place to start.

Prerequisites

Before deploying the EMR cluster on Outposts, you must make sure the following resources are created and configured in your AWS account:

Outposts rack are installed, up and running.
Amazon EC2 key pair is created. To create it, you can follow the instructions in Create a key pair using Amazon EC2 in the Amazon EC2 user guide.

Deploying the EMR cluster on Outposts

1. Deploy the CloudFormation template to create the infrastructure for the EMR cluster

You can use this AWS CloudFormation template to create the infrastructure for the EMR cluster. To create a stack, you can follow the instructions in Creating a stack on the AWS CloudFormation console in the AWS CloudFormation user guide.

2. Create an EMR cluster

To launch a cluster with Spark installed using the console:

Step 1: Configure Name and Applications

Sign in to the AWS Management Console, and open the Amazon EMR console.
Under EMR on EC2, in the left navigation pane, select Clusters, and then choose Create Cluster.
On the Create cluster page, enter a unique cluster name for the Name
For Amazon EMR release, choose emr-6.13.0.
In the Application bundle field, select Spark 3.4.1 and Zeppelin 0.10.1, and unselect all the other options.
For the Operating system options, select Amazon Linux release.

Figure 2: Create Cluster

Step 2: Choose Cluster configuration method

Under the Cluster configuration, select Uniform instance groups.
For the Primary and the Core, select the EC2 instance type available in the Outposts rack that is supported by the EMR cluster.
Remove the instance group Task 1 of 1.

Figure 3: Remove the instance group Task 1 of 1

Step 3: Set up Cluster scaling and provisioning, Networking and Cluster termination

In the Cluster scaling and provisioning option, choose Set cluster size manually and type the value 1 for the Core
On the Networking, select the VPC and the Outposts subnet.
For Cluster termination, choose Manually terminate cluster.

Step 4: Configure the Bootstrap actions

A. In the Bootstrap actions, add an action with the following information:

1. Name: copy-postgresql-driver.sh
2. Script location: s3://<bucket-name>/copy-postgresql-driver.sh. Modify the <bucket-name> variable to the bucket name you specified as a parameter in Step 1.

Figure 4: Add bootstrap action

Step 5: Configure Cluster logs and Tags

a. Under Cluster logs, choose Publish cluster-specific logs to Amazon S3 and enter s3://<bucket-name>/logs for the field Amazon S3 location. Modify the <bucket-name> variable to the bucket name you specified as a parameter in Step 1.

Figure 5: Amazon S3 location for cluster logs

b. In Tags, add new tag. You must enter for-use-with-amazon-emr-managed-policies for the Key field and true for Value.

Figure 6: Add tags

Step 6: Set up Software settings and Security configuration and EC2 key pair

a. In the Software settings, enter the following configuration replacing the Secret ARN created in Step 1:

[
          {
                    "Classification": "spark-defaults",
                    "Properties": {
                              "spark.driver.extraClassPath": "/opt/spark/postgresql/driver/postgresql-42.6.0.jar",
                              "spark.executor.extraClassPath": "/opt/spark/postgresql/driver/postgresql-42.6.0.jar",
                              "[email protected]":
                                         "arn:aws:secretsmanager:<region>:<account-id>:secret:<secret-name>"
                    }
          }
]

This is an example of the Secret ARN replaced:

Figure 7: Example of the Secret ARN replaced

b. For the Security configuration and EC2 key pair, choose the SSH key pair.

Step 7: Choose Identity and Access Management (IAM) roles

a. Under Identity and Access Management (IAM) roles:

1. In the Amazon EMR service role:
  - Choose AmazonEMR-outposts-cluster-role for the Service role.
2. In EC2 instance profile for Amazon EMR
  - Choose AmazonEMR-outposts-EC2-role.

Figure 8: Choose the service role and instance profile

Step 8: Create cluster

Choose Create cluster to launch the cluster and open the cluster details page.

Now, the EMR cluster is starting. When your cluster is ready to process tasks, its status changes to Waiting. This means the cluster is up, running, and ready to accept work.

Figure 9: Result of the cluster creation

3. Add CoIPs to EMR core nodes

You need to allocate an Elastic IP from the CoIP pool and associate it with the EC2 instance of the EMR core nodes. This is necessary to allow the core nodes to access the on-premises environment. To allocate an Elastic IP, follow the instructions in Allocate an Elastic IP address in Amazon EC2 User Guide for Linux Instances. In Step 5, choose the Customer-owned pool of IPV4 addresses.

Once the CoIP IP is allocated, associate it with each EC2 instance of the EMR core node. Follow the instructions in Associate an Elastic IP address with an instance or network interface in Amazon EC2 User Guide for Linux Instances.

Checking the configuration

Make sure the EC2 instance of the core nodes can ping the IP of the PostgreSQL database.

Connect to the Core node EC2 instance using Systems Manager and ping the IP address of the PostgreSQL database.

Figure 10: Connectivity test

Make sure the Status of the EMR cluster is Waiting.

Figure 11: Cluster is ready and waiting

Adding a step to the Amazon EMR cluster

You can use the following Spark application to simulate the data processing from the PostgreSQL database.

spark-step-example.py:

import os
from pyspark.sql import SparkSession

if __name__ == "__main__":

    # ---------------------------------------------------------------------
    # Step 1: Get the database connection information from the EMR cluster 
    #         configuration
    dbconnection = os.environ.get('DBCONNECTION')
    #    Remove brackets
    dbconnection_info = (dbconnection[1:-1]).split(",")
    #    Initialize variables
    dbusername = ''
    dbpassword = ''
    dbhost = ''
    dbport = ''
    dbname = ''
    dburl = ''
    #    Parse the database connection information
    for dbconnection_attribute in dbconnection_info:
        (key_data, key_value) = dbconnection_attribute.split(":", 1)

        if key_data == "username":
            dbusername = key_value
        elif key_data == "password":
            dbpassword = key_value
        elif key_data == 'host':
            dbhost = key_value
        elif key_data == 'port':
            dbport = key_value
        elif key_data == 'dbname':
            dbname = key_value

    dburl = "jdbc:postgresql://" + dbhost + ":" + dbport + "/" + dbname

    # ---------------------------------------------------------------------
    # Step 2: Connect to the PostgreSQL database and select data from the 
    #         pg_catalog.pg_tables table
    spark_db = SparkSession.builder.config("spark.driver.extraClassPath",                                          
               "/opt/spark/postgresql/driver/postgresql-42.6.0.jar") \
               .appName("Connecting to PostgreSQL") \
               .getOrCreate()

    #    Connect to the database
    data_db = spark_db.read.format("jdbc") \
        .option("url", dburl) \
        .option("driver", "org.postgresql.Driver") \
        .option("query", "select count(*) from pg_catalog.pg_tables") \
        .option("user", dbusername) \
        .option("password", dbpassword) \
        .load()

    # ---------------------------------------------------------------------
    # Step 3: To do the data processing
    #
    #    TO-DO

    # ---------------------------------------------------------------------
    # Step 4: Save the data into the new table in the PostgreSQL database
    #
    data_db.write \
        .format("jdbc") \
        .option("url", dburl) \
        .option("dbtable", "results_proc") \
        .option("user", dbusername) \
        .option("password", dbpassword) \
        .save()

    # ---------------------------------------------------------------------
    # Step 5: Close the Spark session
    #
    spark_db.stop()
    # ---------------------------------------------------------------------

You must upload the file spark-step-example.py to the bucket created in Step 1 of this post before submitting the Spark application to the EMR cluster. You can get the file at this GitHub repository for a Spark step example.

Submitting the Spark application step using the Console

To submit the Spark application to the EMR cluster, follow the instructions in To submit a Spark step using the console in the Amazon EMR Release Guide. In Step 4 of this Amazon EMR guide, provide the following parameters to add a step:

choose Cluster mode for the Deploy mode
type a name for your step (such as Step 1)
for the Application location, choose s3://<bucket-name>/spark-step-example.py and replace the <bucket-name> variable to the bucket name you specified as a parameter in Step 1
leave the Spark-submit options field blank

Figure 12: Add a step to the EMR cluster

The Step is created with the Status Pending. When it is done, the Status changes to Completed.

Figure 13: Step executed successfully

Cleaning up

When the EMR cluster is no longer needed, you can delete the resources created to avoid incurring future costs by following these steps:

Follow the instructions in Terminate a cluster with the console in the Amazon EMR Documentation Management Guide. Remember to turn off the Termination protection.
Dissociate and release the CoIP IPs allocated to the EC2 instances of the EMR core nodes.
Delete the stack in the AWS CloudFormation using the instructions in Deleting a Stack on the AWS CloudFormation console in the AWS CloudFormation User Guide

Conclusion

Amazon EMR on Outposts allows you to use the managed services offered by AWS to perform big data processing close to your data that needs to remain on-premises. This architecture eliminates the need to transfer on-premises data to the cloud, providing a robust solution for organizations with regulatory, contractual, or corporate policy requirements to store and process data in a specific location. With the EMR cluster accessing the on-premises database directly through local networking, you can expect faster and more efficient data processing without compromising on compliance or agility. To learn more, visit the Amazon EMR on AWS Outposts product overview page.

Announcing IPv6 instance bundles and pricing update on Amazon Lightsail

2024-01-26 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/announcing-ipv6-instance-bundles-and-pricing-update-on-amazon-lightsail/

Amazon Lightsail is the easiest way to get started on AWS, allowing you to get your application running on your own virtual server in a matter of minutes. Lightsail bundles all the resources you need like memory, vCPU, solid-state drive (SSD), and data transfer allowance into a predictable monthly price, so budgeting is easy and straightforward.

IPv6 instance bundles

Announcing the availability of new IPv6 instance bundles on Lightsail. With the new bundles, you can now create and use Lightsail instances without a public IPv4 address. These bundles include an IPv6 address for use cases that do not require a public IPv4 address. Both Linux and Windows IPv6 bundles are available. See the full list of Amazon Lightsail instance blueprints compatible with IPv6 instances. If you have existing Lightsail instances with a public IPv4 address, you can migrate the instance to IPv6-only in a couple of steps: Create a snapshot of an existing instance, then create a new instance from the snapshot and select IPv6-only networking when choosing your instance plan.

To learn more about IPv6 bundles, read Lightsail documentation.

IPv4 instance bundles

Lightsail will continue to offer bundles that include one public IPv4 address and IPv6 address. Following AWS’s announcement on public IPv4 address charge, the prices of Lightsail bundles offered with a public IPv4 address will reflect the charge associated with the public IPv4 address.

Revised prices for bundles that include a public IPv4 address will be effective on all new and existing Lightsail bundles starting May 1, 2024.

The tables below outline all Lightsail instance bundles and pricing.

Linux-based bundles:

Windows-based bundles:

*Bundles in the Asia Pacific (Mumbai) and Asia Pacific (Sydney) AWS Regions include lower data transfer allowances than other regions.

To learn more about Lightsail’s bundled offerings and pricing, please see the Lightsail pricing page.

Optimizing video encoding with FFmpeg using NVIDIA GPU-based Amazon EC2 instances

2024-01-04 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/optimizing-video-encoding-with-ffmpeg-using-nvidia-gpu-based-amazon-ec2-instances/

This post is written by Alejandro Gil, Solutions Architect and Joseba Echevarría, Solutions Architect.

Introduction

The purpose of this blog post is to compare video encoding performance between CPUs and Nvidia GPUs to determine the price/performance ratio in different scenarios while highlighting where it would be best to use a GPU.

Video encoding plays a critical role in modern media delivery, enabling efficient storage, delivery, and playback of high-quality video content across a wide range of devices and platforms.

Video encoding is frequently performed solely by the CPU because of its widespread availability and flexibility. Still, modern hardware includes specialized components designed specifically to obtain very high performance video encoding and decoding.

Nvidia GPUs, such as those found in the P and G Amazon EC2 instances, include this kind of built-in hardware in their NVENC (encoding) and NVDEC (decoding) accelerator engines, which can be used for real-time video encoding/decoding with minimal impact on the performance of the CPU or GPU.

Figure 1: NVIDIA NVDEC/NVENC architecture. Source https://developer.nvidia.com/video-codec-sdk

Scenario

Two main transcoding job types should be considered depending on the video delivery use case, 1) batch jobs for on demand video files and 2) streaming jobs for real-time, low latency use cases. In order to achieve optimal throughput and cost efficiency, it is a best practice to encode the videos in parallel using the same instance.

The utilized instance types in this benchmark can be found in figure 2 table (i.e g4dn and p3). For hardware comparison purposes, the p4d instance has been included in the table, showing the GPU specs and total number of NVDEC & NVENC cores in these EC2 instances. Based on the requirements, multiple GPU instances types are available in EC2.

Instance size	GPUs	GPU model	NVDEC generation	NVENC generation	NVDEC cores/GPU	NVENC cores/GPU
g4dn.xlarge	1	T4	4^th	7^th	2	1
p3.2xlarge	1	V100	3^rd	6^th	1	3
p4d.24xlarge	8	A100	4^th	N/A	5	0

Figure 2: GPU instances specifications

Benchmark

In order to determine which encoding strategy is the most convenient for each scenario, a benchmark will be conducted comparing CPU and GPU instances across different video settings. The results will be further presented using graphical representations of the performance indicators obtained.

The benchmark uses 3 input videos with different motion and detail levels (still, medium motion and high dynamic scene) in 4k resolution at 60 frames per second. The tests will show the average performance for encoding with FFmpeg 6.0 in batch (using Constant Rate Factor (CRF) mode) and streaming (using Constant Bit Rate (CBR)) with x264 and x265 codecs to five output resolutions (1080p, 720p, 480p, 360p and 160p).

The benchmark tests encoding the target videos into H.264 and H.265 using the x264 and x265 open-source libraries in FFmpeg 6.0 on the CPU and the NVENC accelerator when using the Nvidia GPU. The H.264 standard enjoys broad compatibility, with most consumer devices supporting accelerated decoding. The H.265 standard offers superior compression at a given level of quality than H.264 but hardware accelerated decoding is not as widely deployed. As a result, for most media delivery scenarios having more than one video format will be required in order to provide the best possible user experience.

Offline (batch) encoding

This test consists of a batch encoding with two different standard presets (ultrafast and medium for CPU-based encoding and p1 and medium presets for GPU-accelerated encoding) defined in the FFmpeg guide.

The following chart shows the relative cost of transcoding 1 million frames to the 5 different output resolutions in parallel for CPU-encoding EC2 instance (c6i.4xlarge) and two types of GPU-powered instances (g4dn.xlarge and p3.2xlarge). The results are normalized so that the cost of x264 ultrafast preset on c6i.4xlarge is equal to one.

Figure 3: Batch encoding performance for CPU and GPU instances.

The performance of batch encoding in the best GPU instance (g4dn.xlarge) shows around 73% better price/performance in x264 compared to the c6i.4xlarge and around 82% improvement in x265.

A relevent aspect to have in consideration is that the presets used are not exactly equivalent for each hardware because FFmpeg uses different operators depending on where the process runs (i.e CPU or GPU). As a consequence, the video outputs in each case have a noticeable difference between them. Generally, NVENC-based encoded videos (GPU) tend to have a higher quality in H.264, whereas CPU outputs present more encoding artifacts. The difference is more noticeable for lower quality cases (ultrafast/p1 presets or streaming use cases).

The following images compare the output quality for the medium motion video in the ultrafast/p1 and medium presets.

It is clearly seen in the following example, that the h264_nevenc (GPU) codec outperforms the libx264 codec (CPU) in terms of quality, showing less pixelation, especially in the ultrafast preset. For the medium preset, although the quality difference is less pronounced, the GPU output file is noticeably larger (refer to Figure 6 table).

Figure 4: Result comparison between GPU and CPU for h264, ultrafast

Figure 5: Result comparison between GPU and CPU for h264, medium

The output file sizes mainly depend on the preset, codec and input video. The different configurations can be found in the following table.

Figure 6: Sizes for output batch encoded videos. Streaming not represented because the size is the same (fixed bitrate)

Live stream encoding

For live streaming use cases, it is useful to measure how many streams a single instance can maintain transcoding to five output resolutions (1080p, 720p, 480p, 360p and 160p). The following results are the relative cost of each instance, which is the ratio of number of streams the instance was able to sustain divided by the cost per hour.

Figure 6: Streaming encoding performance for CPU and GPU instances.

The previous results show that a GPU-based instance family like g4dn is ideal for streaming use cases, where they can sustain up to 4 parallel encodings from 4K to 1080p, 720p, 480p, 360p & 160p simultaneously. Notice that the GPU-based p5 family performance is not compensating the cost increase.

On the other hand, the CPU-based instances can sustain 1 parallel stream (at most). If you want to sustain the same number of parallel streams in Intel-based instances, you’d have to opt for a much larger instance (c6i.12xlarge can almost sustain 3 simultaneous streams, but it struggles to keep up with the more dynamic scenes when encoding with x265) with a much higher cost ($2.1888 hourly for c6i.12xlarge vs $0.587 for g4dn.xlarge).

The price/performance difference is around 68% better in GPU for x264 and 79% for x265.

Conclusion

The results show that for the tested scenarios there can be a price-performance gain when transcoding with GPU compared to CPU. Also, GPU-encoded videos tend to have an equal or higher perceived quality level to CPU-encoded counterparts and there is no significant performance penalty for encoding to the more advanced H.265 format, which can make GPU-based encoding pipelines an attractive option.

Still, CPU-encoders do a particularly good job with containing output file sizes for most of the cases we tested, producing smaller output file sizes even when the perceived quality is simmilar. This is an important aspect to have into account since it can have a big impact in cost. Depending on the amount of media files distributed and consumed by final users, the data transfer and storage cost will noticeably increase if GPUs are used. With this in mind, it is important to weight the compute costs with the data transfer and storage costs for your use case when chosing to use CPU or GPU-based video encoding.

One additional point to be considered is pipeline flexibility. Whereas the GPU encoding pipeline is rigid, CPU-based pipelines can be modified to the customer’s needs, including additional FFmpeg filters to accommodate future needs as required.

The test did not include any specific quality measurements in the transcoded images, but it would be interesting to perform an analysis based on quantitative VMAF (or similar algorithm) metrics for the videos. We always recommend to make your own test to validate if the results obtained meet your requirements.

Benchmarking method

This blog post extends on the original work described in Optimized Video Encoding with FFmpeg on AWS Graviton Processors and the benchmarking process has been maintained in order to preserve consistency of the benchmark results. The original article analyzes in detail the price/performance advantages of AWS Graviton 3 compared to other processors.

Figure 7: Batch encoding workflow

Hibernating EC2 Instances in Response to a CloudWatch Alarm

2023-12-19 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/hibernating-ec2-instances-in-response-to-a-cloudwatch-alarm/

This blog post is written by Jose Guay, Technical Account Manger, Enterprise Support.

A typical option to reduce costs associated with running Amazon Elastic Compute Cloud (Amazon EC2) instances is to stop them when they are idle. However, there are scenarios where stopping an idle instance is not practical. For example, instances with development environments that take time to prepare and run which benefit from not needing to do this process every day. For these instances, hibernation is a better alternative.

This blog post explores a solution that will find idle instances using an Amazon CloudWatch alarm that monitors the instance’s CPU usage. When the CPU usage consistently drops below the alarm’s threshold, the alarm enters the ALARM state and raises an event used to identify the instance and trigger hibernation.

With this solution, the instance no longer incurs in compute costs, and only accrues storage costs for any Amazon Elastic Block Store (Amazon EBS) volumes.

Overview

To hibernate an EC2 instance, there are prerequisites and required preparation. The instance must be configured to hibernate, and this is done when first launching it. This configuration cannot be changed after launching the instance.

One way to trigger instance hibernation is to use an AWS Lambda function. The Lambda function needs specific permissions configured with AWS Identity and Access Management (IAM). To connect the function with the alarm that detects the idle instance, use an Amazon EventBridge bus.

The following architecture diagram shows a solution.

Figure 1 – Solution architecture

An EC2 instance sends metrics to CloudWatch.
A CloudWatch alarm detects an idle instance and sends the event to EventBridge.
EventBridge triggers a Lambda function.
The Lambda function evaluates the execution role permissions.
The Lambda function identifies the instance and sends the hibernation signal.

To implement the solution, follow these steps:

Configure permissions with IAM
Create the Lambda function
Configure the EC2 instance to send metrics to CloudWatch
Configure EventBridge

a. Configure permissions with IAM

Create an IAM role with permissions to stop an EC2 instance. The Lambda function uses it as its execution role. The IAM role also needs permissions to save logs in CloudWatch. This is useful to log when an instance is entering hibernation.

Open the IAM console.
In the navigation pane, choose Policies.
Select Create policy.
For Select a service, search and select CloudWatch Logs.
In Actions allowed, search “createlog” and select CreateLogStream and CreateLogGroup.
Repeat the search, this time for “putlog”, and select PutLogEvents.
In Resources, choose All.
Select + Add more permissions.
For Select a service, select EC2.
In Actions allowed, search “stop” and select StopInstances from the results.
In Resources, choose Specific, and select the Add Arn
From the pop-up window select Resource in this account, type the region where the instance is, and the instance ID. This forms the ARN of the instances to monitor.
Select Add ARNs.
Select Next.
Name the policy AllowHibernateEC2InstancePolicy.

Figure 2 – IAM policy to access EC2 instances and CloudWatch logs

Figure 3 – Viewing the IAM policy in JSON format

In the navigation page, select Roles.
Select Create role.
For Trusted entity type, select AWS Service.
For Use case, select Lambda.
Select Next.
In the Permissions policies list, search and select Allow HibernateEC2InstancePolicy.
Select Next.
Name the role AllowHibernateEC2InstanceFromLambdaRole.
Select Create role.

Figure 4 – IAM role implementing the IAM policy

b. Create the Lambda function

Create a Lambda function that will find the ID of the idle instance using the event data from the CloudWatch alarm to hibernate it. The event data will be in a function parameter.

The event data is in the JSON format. The following is an example of what this data looks like.

{
	"version": "0",
	"id": "77b0f9cf-ebe3-3893-f60e-1950d2b8ef26",
	"detail-type": "CloudWatch Alarm State Change",
	"source": "aws.cloudwatch",
	"account": "<account>",
	"time": "2023-08-10T21:27:58Z",
	"region": "us-east-1",
	"resources": [
		"arn:aws:cloudwatch:<region>:<account>:alarm:alarm-name"
	],
	"detail": {
		"alarmName": "alarm-name",
		"state": {
			"value": "ALARM",
			"reason": "TEST",
			"timestamp": "2023-07-05T21:27:58.659+0000"
		},
		"previousState": {
			"value": "OK",
			"reason": "Unchecked: Initial alarm creation",
			"timestamp": "2023-07-05T21:13:51.658+0000"
		},
		"configuration": {
			"metrics": [
				{
					"id": "26c493f3-c295-4454-ff19-70ce482dca64",
					"metricStat": {
						"metric": {
							"namespace": "AWS/EC2",
							"name": "CPUUtilization",
							"dimensions": {
								"InstanceId": "<instance id>"
							}
						},
						"period": 300,
						"stat": "Average"
					},
					"returnData": true
				}
			],
			"description": "Created from EC2 Console"
		}
	}
}

Follow these steps to create the Lambda function.

Open the Functions page of the Lambda console.
Choose Create function.
Select Author from scratch.
Name the function HibernateEC2InstanceFunction.
For the Runtime, select Python 3.10 (or the latest Python version).
For Architecture, choose arm64.
Expand Change default execution role and select Use an existing role.
Select AllowHibernateEC2InstanceFromLambdaRole from the list of existing roles.
Select Create function at the bottom of the page.

In the Lambda function page, scroll down to view the Code tab at the bottom. Copy the following code onto the editor for the lambda_function.py file.

import boto3

def lambda_handler(event, context):
    instancesToHibernate = []
    region = getRegion(event)
    ec2Client = boto3.client('ec2', region_name=region)
    id = getInstanceId(event)

    if id is not None:
        instancesToHibernate.append(id)
        ec2Client.stop_instances(InstanceIds=instancesToHibernate, Hibernate=True)
        print('stopped instances: ' + str(instancesToHibernate) + ' in region ' + region)
    else:
        print('No instance id found')

def getRegion(payload):
    if 'region' in payload:
        region = payload['region']
        return region 
    
    #default to N. Virginia
    return 'us-east-1'

def getInstanceId(payload):
    if 'detail' in payload:
        detail = payload['detail']
        if 'configuration' in detail:
            configuration = detail['configuration']
            if 'metrics' in configuration:
                if len(configuration['metrics']) > 0:
                    firstMetric = configuration['metrics'][0] 
                    if 'metricStat' in firstMetric:
                        metricStat = firstMetric['metricStat']
                        if 'metric' in metricStat:
                            metric = metricStat['metric']
                            if 'dimensions' in metric:
                                dimensions = metric['dimensions']
                                if 'InstanceId' in dimensions:
                                    id = dimensions['InstanceId']
                                    return id
    
    return None

Figure 5 – Lambda function code editor

The code has the following contents:

Imports section. In this section, import the libraries that the function uses. In our case, the boto3
The main method, called lambda_handler, is the execution entry point. This is the method called whenever the Lambda function runs.
1. It defines an array to store the IDs of the instances that enter hibernation. This is necessary because the method stop_instances expects an array as opposed to a single value.
2. Using the event data, it finds the AWS Region and instance ID of the instance to hibernate.
3. It initializes the Amazon EC2 client by calling the client method.
4. If it finds an instance ID, then it adds it to the instances array.
5. Calls stop_instances passing as parameters the instances array and True to indicate the hibernation operation.

c. Configure the EC2 instance to send metrics to CloudWatch

In the scenario, an idle EC2 instance has its CPU utilization under 10% during a 15-minute period. Adjust the utilization percentage and/or period to meet your needs. To enable alarm tracking, the EC2 instance must send the CPU Usage metric to CloudWatch.

Open the Amazon EC2 console.
In the navigation pane, choose Instances.
Select an instance to monitor with the checkbox on the left.
Find the Alarm status column, and select the plus sign to add a new alarm.

Figure 6 – Creating a new CloudWatch alarm from the EC2 console

In the Manage CloudWatch alarms page, select Create an alarm. Then, turn off Alarm action. Use Alarm notification to notify when hibernating an instance, otherwise, turn off.

Figure 7 – CloudWatch alarm notification and action settings

In the Alarm thresholds section, select:
1. Group samples by Average.
2. Type of data to sample CPU utilization.
3. Alarm when less than (<).
4. Percent 10.
5. Consecutive periods 1.
6. Period 15 Minutes.
7. Alarm name Idle-EC2-Instance-LessThan10Pct-CPUUtilization-15Min.

Figure 8 – CloudWatch alarm thresholds

Select Create at the bottom of the page.
A successful creation shows a green banner at the top of the page.
Select the Alarm status column for the instance, then select the link that shows in the pop-up window to go to the new CloudWatch alarm details.

Figure 9 – Accessing the CloudWatch alarm from the EC2 console

Scroll down to view the alarm details and copy its ARN, which shows in the lower right corner. The EventBridge rule needs this.

Figure 10 – Finding the CloudWatch alarm ARN

d. Configure EventBridge to consume events from CloudWatch

When the alarm enters the ALARM state, it means it has detected an idle EC2 instance. It will then generate an event that EventBridge can consume and act upon. For this, EventBridge uses rules. EventBridge rules rely on patterns to identify the events and trigger the appropriate actions.

Open the Amazon EventBridge console.
In the navigation pane, choose Rules.
Choose Create rule.
Enter a name and description for the rule. A rule cannot have the same name as another rule in the same Region and on the same event bus.
For Event bus, choose an event bus to associate with this rule. To match events that come from the same account, select AWS default event bus. When an AWS service in the account emits an event, it always goes to the account’s default event bus.
For Rule type, choose Rule with an event pattern.
Select Next.
For Event source, choose AWS services.
Scroll down to Creation method and select Custom pattern (JSON editor).
Enter the following pattern on the Event Pattern

{
  "source": ["aws.cloudwatch"],
  "detail-type": ["CloudWatch Alarm State Change"],
  "detail": {
    "state": {
      "value": ["ALARM"]
    },
    "resources":[
    "<ARN of CW alarms to respond to>"
    ]
  }
}

In the resources element of the pattern, add the ARN of the CloudWatch alarm created for the EC2 instance. The resources element is an array. Add the ARN of every alarm to which this rule monitors and responds. Doing this allows using a single rule to handle the same action for multiple alarms.
Select Next.
Select a target. This is the action that EventBridge executes once it has identified an event. Choose AWS service and select Lambda function.
Select HibernateEC2InstanceFunction.
Select Next.
Add tags to the rule as needed.
Select Next.
Review the rule configuration, and select Create rule.

Figure 11 – EventBridge rule event pattern

Figure 12 – EventBridge rule targets

Testing the implementation

To test the solution, wait for the instance’s CPU utilization to fall below the 10% threshold for 15 minutes. Alternatively, force the alarm to enter the ALARM state with the following AWS CLI command.

aws cloudwatch set-alarm-state --alarm-name "Idle-EC2-Instance-LessThan10Pct-CPUUtilization-15Min" --state-value ALARM --state-reason "testing"

Conclusion

Hibernating EC2 instances brings savings during periods of low utilization. Another benefit is that when they start again, they continue their work from where they left off. To hibernate the instance, set the hibernation configuration when launching it. Detect the idle instance with a CloudWatch alarm, and use EventBridge to capture the alarms and trigger a Lambda function to call the Amazon EC2 stop API with the hibernate parameter.

To learn more

Introducing instance maintenance policy for Amazon EC2 Auto Scaling

2023-11-16 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/introducing-instance-maintenance-policy-for-amazon-ec2-auto-scaling/

This post is written by Ahmed Nada, Principal Solutions Architect, Flexible Compute and Kevin OConnor, Principal Product Manager, Amazon EC2 Auto Scaling.

Amazon Web Services (AWS) customers around the world trust Amazon EC2 Auto Scaling to provision, scale, and manage Amazon Elastic Compute Cloud (Amazon EC2) capacity for their workloads. Customers have come to rely on Amazon EC2 Auto Scaling instance refresh capabilities to drive deployments of new EC2 Amazon Machine Images (AMIs), change EC2 instance types, and make sure their code is up-to-date.

Currently, EC2 Auto Scaling uses a combination of ‘launch before terminate’ and ‘terminate and launch’ behaviors depending on the replacement cause. Customers have asked for more control over when new instances are launched, so they can minimize any potential disruptions created by replacing instances that are actively in use. This is why we’re excited to introduce instance maintenance policy for Amazon EC2 Auto Scaling, an enhancement that provides customers with greater control over the EC2 instance replacement processes to make sure instances are replaced in a way that aligns with performance priorities and operational efficiencies while minimizing Amazon EC2 costs.

This post dives into varying ways to configure an instance maintenance policy and gives you tools to use it in your Amazon EC2 Auto Scaling groups.

Background

AWS launched Amazon EC2 Auto Scaling in 2009 with the goal of simplifying the process of managing Amazon EC2 capacity. Since then, we’ve continued to innovate with advanced features like predictive scaling, attribute-based instance selection, and warm pools.

A fundamental Amazon EC2 Auto Scaling capability is replacing instances based on instance health, due to Amazon EC2 Spot Instance interruptions, or in response to an instance refresh operation. The instance refresh capability allows you to maintain a fleet of healthy and high-performing EC2 instances in your Amazon EC2 Auto Scaling group. In some situations, it’s possible that terminating instances before launching a replacement can impact performance, or in the worst case, cause downtime for your applications. No matter what your requirements are, instance maintenance policy allows you to fine-tune the instance replacement process to match your specific needs.

Overview

Instance maintenance policy adds two new Amazon EC2 Auto Scaling group settings: minimum healthy percentage (MinHealthyPercentage) and maximum healthy percentage (MaxHealthyPercentage). These values represent the percentage of the group’s desired capacity that must be in a healthy and running state during instance replacement. Values for MinHealthyPercentage can range from 0 to 100 percent and from 100 to 200 percent for MaxHealthyPercentage. These settings are applied to all events that lead to instance replacement, such as Health-check based replacement, Max Instance Lifetime, EC2 Spot Capacity Rebalancing, Availability Zone rebalancing, Instance Purchase Option Rebalancing, and Instance refresh. You can also override the group-level instance maintenance policy during instance refresh operations to meet specific deployment use cases.

Before launching instance maintenance policy, an Amazon EC2 Auto Scaling group would use the previously described behaviors when replacing instances. By setting the MinHealthyPercentage of the instance maintenance policy to 100% and the MaxHealthyPercentage to a value greater than 100%, the Amazon EC2 Auto Scaling group first launches replacement instances and waits for them to become available before terminating the instances being replaced.

Setting up instance maintenance policy

You can add an instance maintenance policy to new or existing Amazon EC2 Auto Scaling groups using the AWS Management Console, AWS Command Line Interface (AWS CLI), AWS SDK, AWS CloudFormation, and Terraform.

When creating or editing Amazon EC2 Auto Scaling groups in the Console, you are presented with four options to define the replacement behavior of your instance maintenance policy. These options include the No policy option, which allows you to maintain the default instance replacement settings that the Amazon EC2 Auto Scaling service uses today.

Image 1: The GUI for the instance maintenance policy feature within the “Create Auto Scaling group” wizard.

Using instance maintenance policy to increase application availability

The Launch before terminating policy is the right selection when you want to favor availability of your Amazon EC2 Auto Scaling group capacity. This policy setting temporarily increases the group’s capacity by launching new instances during replacement operations. In the Amazon EC2 console, you select the Launch before terminating replacement behavior, and then set your desired MaxHealthyPercentage value to determine how many more instances should be launched during instance replacement.

For example, if you are managing a workload that requires optimal availability during instance replacements, choose the Launch before terminating policy type with a MinHealthyPercentage set to 100%. If you set your MaxHealthyPercentage to 150%, then Amazon EC2 Auto Scaling launches replacement instances before terminating instances to be replaced. You should see the desired capacity increase by 50%, exceeding the group maximum capacity during the operation to provide you with the needed availability. The chart in the following figure illustrates what an instance refresh operation would behave like with a Launch before terminating policy.

Figure 1: A graph simulating the instance replacement process with a policy configured to launch before terminating.

Overriding a group’s instance maintenance policy during instance refresh

Instance maintenance policy settings apply to all instance replacement operations, but they can be overridden at the start of a new instance refresh operation. Overriding instance maintenance policy is helpful in situations like a bad code deployment that needs replacing without downtime. You could configure an instance maintenance policy to bring an entirely new group’s worth of instances into service before terminating the instances with the problematic code. In this situation, you set the MaxHealthyPercentage to 200% for the instance refresh operation and the replacement happens in a single cycle to promptly address the bad code issue. Setting the MaxHealthyPercentage to 200% will allow the replacement settings to breach the Auto Scaling Group’s Max capacity value, but would be constrained by any account level quotas, so be sure to factor these into application of this feature. See the following figure for a visualization of how this operation would behave.

Figure 2: A graph simulating the instance replacement process with a policy configured to accelerate a new deployment.

Controlling costs during replacements and deployments

The Terminate and launch policy option allows you to favor cost control during instance replacement. By configuring this policy type, Amazon EC2 Auto Scaling terminates existing instances and then launches new instances during the replacement process. To set a Terminate and launch policy, you must specify a MinHealthyPercentage to establish how low the capacity can drop, and keep your MaxHealthyPercentage set to 100%. This configuration keeps the Auto Scaling group’s capacity at or below the desired capacity setting.

The following figure shows behavior with the MinHealthyPercentage set to 80%. During the instance replacement process, the Auto Scaling group first terminates 20% of the instances and immediately launches replacement instances, temporarily reducing the group’s healthy capacity to 80%. The group waits for the new instances to pass its configured health checks and complete warm up before it moves on to replacing the remaining batches of instances.

Figure 3: A graph simulating the instance replacement process with a policy configured to terminate and launch.

Note that the difference between MinHealthyPercentage and MaxHealthyPercentage values impacts the speed of the instance replacement process. In the preceding figure, the Amazon EC2 Auto Scaling group replaces 20% of the instances in each cycle. The larger the gap between the MinHealthyPercentage and MaxHealthyPercentage, the faster the replacement process.

Using a custom policy for maximum flexibility

You can also choose to adopt a Custom behavior option, where you have the flexibility to set the MinHealthyPercentage and MinHealthyPercentage values to whatever you choose. Using this policy type allows you to fine-tune the replacement behavior and control the capacity of your instances within the Amazon EC2 Auto Scaling group to tailor the instance maintenance policy to meet your unique needs.

What about fractional replacement calculations?

Amazon EC2 Auto Scaling always favors availability when performing instance replacements. When instance maintenance policy is configured, Amazon EC2 Auto Scaling also prioritizes launching a new instance rather than going below the MinHealthyPercentage. For example, in an Amazon EC2 Auto Scaling group with a desired capacity of 10 instances and an instance maintenance policy with MinHealthyPercentage set to 99% and MaxHealthyPercentage set to 100%, your settings do not allow for a reduction in capacity of at least one instance. Therefore, Amazon EC2 Auto Scaling biases toward launch before terminating and launches one new instance before terminating any instances that need replacing.

Configuring an instance maintenance policy is not mandatory. If you don’t configure your Amazon EC2 Auto Scaling groups to use an instance maintenance policy, then there is no change in the behavior of your Amazon EC2 Auto Scaling groups’ existing instance replacement process.

You can set a group-level instance maintenance policy through your CloudFormation or Terraform templates. Within your templates, you must set values for both the MinHealthyPercentage and MaxHealthyPercentage settings to determine the instance replacement behavior that aligns with the specific requirements of your Amazon EC2 Auto Scaling group.

Conclusion

In this post, we introduced the new instance maintenance policy feature for Amazon EC2 Auto Scaling groups, explored its capabilities, and provided examples of how to use this new feature. Instance maintenance policy settings apply to all instance replacement processes with the option to override the settings on a per instance refresh basis. By configuring instance maintenance policies, you can control the launch and lifecycle of instances in your Amazon EC2 Auto Scaling groups, increase application availability, reduce manual intervention, and improve cost control for your Amazon EC2 usage.

To learn more about the feature and how to get started, refer to the Amazon EC2 Auto Scaling User Guide.

Maintaining a local copy of your data in AWS Local Zones

2023-10-19 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/maintaining-a-local-copy-of-your-data-in-aws-local-zones/

This post is written by Leonardo Solano, Senior Hybrid Cloud SA and Obed Gutierrez, Solutions Architect, Enterprise.

This post covers data replication strategies to back up your data into AWS Local Zones. These strategies include database replication, file based and object storage replication, and partner solutions for Amazon Elastic Compute Cloud (Amazon EC2).

Customers running workloads in AWS Regions are likely to require a copy of their data in their operational location for either their backup strategy or data residency requirements. To help with these requirements, you can use Local Zones.

Local Zones is an AWS infrastructure deployment that places compute, storage, database, and other select AWS services close to large population and industry centers. With Local Zones, customers can build and deploy workloads to comply with state and local data residency requirements in sectors such as healthcare, financial services, gaming, and government.

Solution overview

This post assumes the database source is Amazon Relational Database Service (Amazon RDS). To backup an Amazon RDS database to Local Zones, there are three options:

Figure 1. Amazon RDS replication to Local Zones with AWS DMS

To replicate data, AWS DMS needs a source and a target database. The source database should be your existing Amazon RDS database. The target database is placed in an EC2 instance in the Local Zone. A replication job is created in AWS DMS, which maintains the source and target databases in sync. The replicated database in the Local Zone can be accessed through a VPN. Your database administrator can directly connect to the database engine with your preferred tool.

With this architecture, you can maintain a locally accessible copy of your databases, allowing you to comply with regulatory requirements.

Prerequisites

The following prerequisites are required before continuing:

An AWS Account with Administrator permissions;
Installation of the latest version of AWS Command Line Interface (AWS CLI v2);
An Amazon RDS database.

Walkthrough

1. Enabling Local Zones

First, you must enable Local Zones. Make sure that the intended Local Zone is parented to the AWS Region where the environment is running. Edit the commands to match your parameters, group-name makes reference to your local zone group and region to the region identifier to use.

aws ec2 modify-availability-zone-group \
  --region us-east-1 \
  --group-name us-east-1-qro-1\
  --opt-in-status opted-in

If you have an error when calling the ModifyAvailabilityZoneGroup operation, you must sign up for the Local Zone.

After enabling the Local Zone, you must extend the VPC to the Local Zone by creating a subnet in the Local Zone:

aws ec2 create-subnet \
  --region us-east-1 \
  --availability-zone us-east-1-qro-1a \
  --vpc-id vpc-02a3eb6585example \
  --cidr-block my-subnet-cidr

If you need a step-by-step guide, refer to Getting started with AWS Local Zones. Enabling Local Zones is free of charge. Only deployed services in the Local Zone incur billing.

2. Set up your target database

Now that you have the Local Zone enabled with a subnet, set up your target database instance in the Local Zone subnet that you just created.

You can use AWS CLI to launch it as an EC2 instance:

aws ec2 run-instances \
  --region us-east-1 \
  --subnet-id subnet-08fc749671example \
  --instance-type t3.medium \
  --image-id ami-0abcdef123example \
  --security-group-ids sg-0b0384b66dexample \
  --key-name my-key-pair

You can verify that your EC2 instance is running with the following command:

aws ec2 describe-instances --filters "Name=availability-zone,Values=us-east-1-qro-1a" --query "Reservations[].Instances[].InstanceId"

Output:

$ ["i-0cda255374example"]

Note that not all instance types are available in Local Zones. You can verify it with the following AWS CLI command:

aws ec2 describe-instance-type-offerings --location-type "availability-zone" \
--filters Name=location,Values=us-east-1-qro-1a --region us-east-1

Once you have your instance running in the Local Zone, you can install the database engine matching your source database. Here is an example of how to install MariaDB:

Updates all packages to the latest OS versionsudo yum update -y
Install MySQL server on your instance, this also creates a systemd servicesudo yum install -y mariadb-server
Enable the service created in previous stepsudo systemctl enable mariadb
Start the MySQL server service on your Amazon Linux instancesudo systemctl start mariadb
Set root user password and improve your DB securitysudo mysql_secure_installation

You can confirm successful installation with these commands:

mysql -h localhost -u root -p
SHOW DATABASES;

3. Configure databases for replication

In order for AWS DMS to replicate ongoing changes, you must use change data capture (CDC), as well as set up your source and target database accordingly before replication:

Source database:

Make sure that the binary logs are available to AWS DMS:

call mysql.rds_set_configuration('binlog retention hours', 24);

Set the binlog_format parameter to “ROW“.
Set the binlog_row_image parameter to “Full“.
If you are using Read replica as source, then set the log_slave_updates parameter to TRUE.

For detailed information, refer to Using a MySQL-compatible database as a source for AWS DMS, or sources for your migration if your database engine is different.

Target database:

Create a user for AWS DMS that has read/write privileges to the MySQL-compatible database. To create the necessary privileges, run the following commands.

CREATE USER ''@'%' IDENTIFIED BY '';
GRANT ALTER, CREATE, DROP, INDEX, INSERT, UPDATE, DELETE, SELECT ON .* TO 
''@'%';
GRANT ALL PRIVILEGES ON awsdms_control.* TO ''@'%';

Disable foreign keys on target tables, by adding the next command in the Extra connection attributes section of the AWS DMS console for your target endpoint.

Initstmt=SET FOREIGN_KEY_CHECKS=0;

Set the database parameter local_infile = 1 to enable AWS DMS to load data into the target database.

4. Set up AWS DMS

Now that you have our Local Zone enabled with the target database ready and the source database configured, you can set up AWS DMS Replication instance.

Go to AWS DMS in the AWS Management Console, and under Migrate data select Replication Instances, then select the Create Replication button:

This shows the Create replication Instance, where you should fill up the parameters required:

Note that High Availability is set to Single-AZ, as this is a test workload, while Multi-AZ is recommended for Production workloads.

Refer to the AWS DMS replication instance documentation for details about how to size your replication instance.

Important note

To allow replication, make sure that you set up the replication instance in the VPC that your environment is running, and configure security groups from and to the source and target database.

Now you can create the DMS Source and Target endpoints:

5. Set up endpoints

Source endpoint:

In the AWS DMS console, select Endpoints, select the Create endpoint button, and select Source endpoint option. Then, fill the details required:

Make sure you select your RDS instance as Source by selecting the check box as show in the preceding figure. Moreover, include access to endpoint database details, such as user and password.

You can test your endpoint connectivity before creating it, as shown in the following figure:

If your test is successful, then you can select the Create endpoint button.

Target endpoint:

In the same way as the Source in the console, select Endpoints, select the Create endpoint button, and select Target endpoint option, then enter the details required, as shown in the following figure:

In the Access to endpoint database section, select Provide access information manually option, next add your Local Zone target database connection details as shown below. Notice that Server name value, should be the IP address of your target database.

Make sure you go to the bottom of the page and configure Extra connection attributes in the Endpoint settings, as described in the Configure databases for replication section of this post:

Like the source endpoint, you can test your endpoint connection before creating it.

6. Create the replication task

Once the endpoints are ready, you can create the migration task to start the replication. Under the Migrate Data section, select Database migration tasks, hit the Create task button, and configure your task:

Select Migrate existing data and replicate ongoing changes in the Migration type parameter.

Enable Task logs under Task Settings. This is recommended as it can help you with troubleshooting purposes.

In Table mappings, include the schema you want to replicate to the Local Zone database:

Once you have defined Task Configuration, Task Settings, and Table Mappings, you can proceed to create your database migration task.

This will trigger your migration task. Now wait until the migration task completes successfully.

7. Validate replicated database

After the replication job completes the Full Load, proceed to validate at your target database. Connect to your target database and run the following commands:

USE example;
SHOW TABLES;

As a result you should see the same tables as the source database.

MySQL [example]> SHOW TABLES;
+----------------------------+
| Tables_in_example          |
+----------------------------+
| actor                      |
| address                    |
| category                   |
| city                       |
| country                    |
| customer                   |
| customer_list              |
| film                       |
| film_actor                 |
| film_category              |
| film_list                  |
| film_text                  |
| inventory                  |
| language                   |
| nicer_but_slower_film_list |
| payment                    |
| rental                     |
| sales_by_film_category     |
| sales_by_store             |
| staff                      |
| staff_list                 |
| store                      |
+----------------------------+
22 rows in set (0.06 sec)

If you get the same tables from your source database, then congratulations, you’re set! Now you can maintain and navigate a live copy of database in the Local Zone for data residency purposes.

Clean up

When you have finished this tutorial, you can delete all the resources that have been deployed. You can do this in the Console or by running the following commands in the AWS CLI:

Delete target DB:

aws ec2 terminate-instances --instance-ids i-abcd1234

Decommision AWS DMS

Replication Task:

aws dms delete-replication-task --replication-task-arn arn:aws:dms:us-east-1:111111111111:task:K55IUCGBASJS5VHZJIIEXAMPLE

Endpoints:

aws dms delete-endpoint --endpoint-arn arn:aws:dms:us-east-1:111111111111:endpoint:OUJJVXO4XZ4CYTSEG5XEXAMPLE

Replication instance:

aws dms delete-replication-instance --replication-instance-arn us-east-1:111111111111:rep:T3OM7OUB5NM2LCVZF7JEXAMPLE

Delete Local Zone subnet

aws ec2 delete-subnet --subnet-id subnet-9example

Conclusion

Local Zones is a useful tool for running applications with low latency requirements or data residency regulations. In this post, you have learned how to use AWS DMS to seamlessly replicate your data to Local Zones. With this architecture you can efficiently maintain a local copy of your data in Local Zones and access it securly.

If you are interested on how to automate your workloads deployments in Local Zones, make sure you check this workshop.

Enabling highly available connectivity from on premises to AWS Local Zones

2023-10-18 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/enabling-highly-available-connectivity-from-on-premises-to-aws-local-zones/

This post is written by Leonardo Solano, Senior Hybrid Cloud SA and Robert Belson SA Developer Advocate.

Planning your network topology is a foundational requirement of the reliability pillar of the AWS Well-Architected Framework. REL02-BP02 defines how to provide redundant connectivity between private networks in the cloud and on-premises environments using AWS Direct Connect for resilient, redundant connections using AWS Site-to-Site VPN, or AWS Direct Connect failing over to AWS Site-to-Site VPN. As more customers use a combination of on-premises environments, Local Zones, and AWS Regions, they have asked for guidance on how to extend this pillar of the AWS Well-Architected Framework to include Local Zones. As an example, if you are on an application modernization journey, you may have existing Amazon EKS clusters that have dependencies on persistent on-premises data.

AWS Local Zones enables single-digit millisecond latency to power applications such as real-time gaming, live streaming, augmented and virtual reality (AR/VR), virtual workstations, and more. Local Zones can also help you meet data sovereignty requirements in regulated industries such as healthcare, financial services, and the public sector. Additionally, enterprises can leverage a hybrid architecture and seamlessly extend their on-premises environment to the cloud using Local Zones. In the example above, you could extend Amazon EKS clusters to include node groups in a Local Zone (or multiple Local Zones) or on premises using AWS Outpost rack.

To provide connectivity between private networks in Local Zones and on-premises environments, customers typically consider Direct Connect or software VPNs available in the AWS Marketplace. This post provides a reference implementation to eliminate single points of failure in connectivity while offering automatic network impairment detection and intelligent failover using both Direct Connect and software VPNs in AWS Market place. Moreover, this solution minimizes latency by ensuring traffic does not hairpin through the parent AWS Region to the Local Zone.

Solution overview

In Local Zones, all architectural patterns based on AWS Direct Connect follow the same architecture as in AWS Regions and can be deployed using the AWS Direct Connect Resiliency Toolkit. As of the date of publication, Local Zones do not support AWS managed Site-to-Site VPN (view latest Local Zones features). Thus, for customers that have access to only a single Direct Connect location or require resiliency beyond a single connection, this post will demonstrate a solution using an AWS Direct Connect failover strategy with a software VPN appliance. You can find a range of third-party software VPN appliances as well as the throughput per VPN tunnel that each offering provides in the AWS Marketplace.

Prerequisites:

To get started, make sure that your account is opt-in for Local Zones and configure the following:

Extend a Virtual Private Cloud (VPC) from the Region to the Local Zone, with at least 3 subnets. Use Getting Started with AWS Local Zones as a reference.
1. Public subnet in Local Zone (public-subnet-1)
2. Private subnets in Local Zone (private-subnet-1 and private-subnet-2)
3. Private subnet in the Region (private-subnet-3)
4. Modify DNS attributes in your VPC, including both “enableDnsSupport” and “enableDnsHostnames”;
Attach an Internet Gateway (IGW) to the VPC;
Attach a Virtual Private Gateway (VGW) to the VPC;
Create an ec2 vpc-endpoint attached to the private-subnet-3;
Define the following routing tables (RTB):
1. Private-subnet-1 RTB: enabling propagation for VGW;
2. Private-subnet-2 RTB: enabling propagation for VGW;
3. Public-subnet-1 RTB: with a default route with IGW-ID as the next hop;
Configure a Direct Connect Private Virtual Interface (VIF) from your on-premises environment to Local Zones Virtual Gateway’s VPC. For more details see this post: AWS Direct Connect and AWS Local Zones interoperability patterns;
Launch any software VPN appliance from AWS Marketplace on Public-subnet-1. In this blog post on simulating Site-to-Site VPN customer gateways using strongSwan, you can find an example that provides the steps to deploy a third-party software VPN in AWS Region;
Capture the following parameters from your environment:
1. Software VPN Elastic Network Interface (ENI) ID
2. Private-subnet-1 RTB ID
3. Probe IP, which must be an on-premises resource that can respond to Internet Control Message Protocol (ICMP) requests.

High level architecture

This architecture requires a utility Amazon Elastic Compute Cloud (Amazon EC2) instance in a private subnet (private-subnet-2), sending ICMP probes over the Direct Connect connection. Once the utility instance detects lost packets to on-premises network from the Local Zone it initiates a failover by adding a static route with the on-premises CIDR range as the destination and the VPN Appliance ENI-ID as the next hop in the production private subnet (private-subnet-1), taking priority over the Direct Connect propagated route. Once healthy, this utility will revert back to the default route to the original Direct Connect connection.

On-premises considerations

To add redundancy in the on-premises environment, you can use two routers using any First Hop Redundancy Protocol (FHRP) as Hot Standby Router Protocol (HSRP) or Virtual Router Redundancy Protocol (VRRP). The router connected to the Direct Connect link has the highest priority, taking the Primary role in the FHRP process while the VPN router remain the Secondary router. The failover mechanism in the FHRP relies on interface or protocol state as BGP, which triggers the failover mechanism.

Figure 1. High level HA architecture for Software VPN

Figure 2. Failover by modifying the production subnet RTB

Step-by-step deployment

Create IAM role with permissions to create and delete routes in your private-subnet-1 route table:

Create ec2-role-trust-policy.json file on your local machine:

cat > ec2-role-trust-policy.json <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ec2.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
EOF

Create your EC2 IAM role, such as my_ec2_role:

aws iam create-role --role-name my_ec2_role --assume-role-policy-document file://ec2-role-trust-policy.json

Create a file with the necessary permissions to attach to the EC2 IAM role. Name it ec2-role-iam-policy.json.

aws iam create-policy --policy-name my-ec2-policy --policy-document file://ec2-role-iam-policy.json

Create the IAM policy and attach the policy to the IAM role my_ec2_role that you previously created:

aws iam create-policy --policy-name my-ec2-policy --policy-document file://ec2-role-iam-policy.json

aws iam attach-role-policy --policy-arn arn:aws:iam::<account_id>:policy/my-ec2-policy --role-name my_ec2_role

Create an instance profile and attach the IAM role to it:

aws iam create-instance-profile –instance-profile-name my_ec2_instance_profile
aws iam add-role-to-instance-profile –instance-profile-name my_ec2_instance_profile –role-name my_ec2_role

Launch and configure your utility instance

Capture the Amazon Linux 2 AMI ID through CLI:

aws ec2 describe-images --filters "Name=name,Values=amzn2-ami-kernel-5.10-hvm-2.0.20230404.1-x86_64-gp2" | grep ImageId

Sample output:

"ImageId": "ami-069aabeee6f53e7bf",

Create an EC2 key for the utility instance:

aws ec2 create-key-pair --key-name MyKeyPair --query 'KeyMaterial' --output text > MyKeyPair.pem

Launch the utility instance in the Local Zone (replace the variables with your account and environment parameters):

aws ec2 run-instances --image-id ami-069aabeee6f53e7bf --key-name MyKeyPair --count 1 --instance-type t3.medium  --subnet-id <private-subnet-2-id> --iam-instance-profile Name=my_ec2_instance_profile_linux

Deploy failover automation shell script on the utility instance

Create the following shell script in your utility instance (replace the health check variables with your environment values):

cat > vpn_monitoring.sh <<EOF
// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: MIT-0
# Health Check variables
Wait_Between_Pings=2
RTB_ID=<private-subnet-1-rtb-id>
PROBE_IP=<probe-ip>
remote_cidr=<remote-cidr>
GW_ENI_ID=<software-vpn-eni_id>
Active_path=DX

echo `date` "-- Starting VPN monitor"

while [ . ]; do
  # Check health of main VPN Appliance path to remote probe ip
  pingresult=`ping -c 3 -W 1 $PROBE_IP | grep time= | wc -l`
  # Check to see if any of the health checks succeeded
  if ["$pingresult" == "0"]; then
    if ["$Active_path" == "DX"]; then
      echo `date` "-- Direct Connect failed. Failing over vpn"
      aws ec2 create-route --route-table-id $RTB_ID --destination-cidr-block $remote_cidr --network-interface-id $GW_ENI_ID --region us-east-1
      Active_path=VPN
      DX_tries=10
      echo "probe_ip: unreachable – active_path: vpn"
    else
      echo "probe_ip: unreachable – active_path: vpn"
    fi
  else     
    if ["$Active_path" == "VPN"]; then
      let DX_tries=DX_tries-1
      if ["$DX_tries" == "0"]; then
        echo `date` "-- failing back to Direct Connect"
        aws ec2 delete-route --route-table-id $RTB_ID --destination-cidr-block $remote_cidr --region us-east-1
        Active_path=DX
        echo "probe_ip: reachable – active_path: Direct Connect"
      else
        echo "probe_ip: reachable – active_path: vpn"
      fi
    else
      echo "probe:ip: reachable – active_path: Direct Connect"	    
    fi
  fi    
done EOF

Modify permissions to your shell script file:

chmod +x vpn_monitoring.sh

Start the shell script:

./vpn_monitoring.sh

Test the environment

Figure 3. Failover process between Direct Connect and software VPN

Simulate failure of the Direct Connect link, breaking the available path from the Local Zone to the on-premises environment. You can simulate the failure using the failure test feature in Direct Connect console.

Figure 4. Bringing BGP session down

Figure 5. Setting the failure time

In the utility instance you will see the following logs:

Thu Sep 21 14:39:34 UTC 2023 -- Direct Connect failed. Failing over vpn

The shell script in action will detect packet loss by ICMP probes against a probe IP destination on premises, triggering the failover process. As a result, it will make an API call (aws ec2 create-route) to AWS using the EC2 interface endpoint.

The script will create a static route in the private-subnet-1-RTB toward on-premises CIDR with the VPN Elastic-Network ID as the next hop.

Figure 6. private-subnet-1-RTB during the test

The FHRP mechanisms detect the failure in the Direct Connect Link and then reduce the FHRP priority on this path, which triggers the failover to the secondary link through the VPN path.

Once you cancel the test or the test finishes, the failback procedure will revert the private-subnet-1 route table to its initial state, resulting in the following logs to be emitted by the utility instance:

Thu Sep 21 14:42:34 UTC 2023 -- failing back to Direct Connect

Figure 7. private-subnet-1 route table initial state

Cleaning up

To clean up your AWS based resources, run following AWS CLI commands:

aws ec2 terminate-instances --instance-ids <your-utility-instance-id>
aws iam delete-instance-profile --instance-profile-name my_ec2_instance_profile
aws iam delete-role my_ec2_role

Conclusion

This post demonstrates how to create a failover strategy for Local Zones using the same resilience mechanisms already established in the AWS Regions. By leveraging Direct Connect and software VPNs, you can achieve high availability in scenarios where you are constrained to a single Direct Connect location due to geographical limitations. In the architectural pattern illustrated in this post, the failover strategy relies on a utility instance with least-privileged permissions. The utility instance identifies network impairment and dynamically modify your production route tables to keep the connectivity established from a Local Zone to your on-premises location. This same mechanism provides capabilities to automatically failback from the software VPN to Direct Connect once the utility instance validates that the Direct Connect Path is sufficiently reliable to avoid network flapping. To learn more about Local Zones, you can visit the AWS Local Zones user guide.

Training machine learning models on premises for data residency with AWS Outposts rack

2023-10-17 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/training-machine-learning-models-on-premises-for-data-residency-with-aws-outposts-rack/

This post is written by Sumit Menaria, Senior Hybrid Solutions Architect, and Boris Alexandrov, Senior Product Manager-Tech.

In this post, you will learn how to train machine learning (ML) models on premises using AWS Outposts rack and datasets stored locally in Amazon S3 on Outposts. With the rise in data sovereignty and privacy regulations, organizations are seeking flexible solutions that balance compliance with the agility of cloud services. Healthcare and financial sectors, for instance, harness machine learning for enhanced patient care and transaction safety, all while upholding strict confidentiality. Outposts rack provide a seamless hybrid solution by extending AWS capabilities to any on-premises or edge location, providing you the flexibility to store and process data wherever you choose. Data sovereignty regulations are highly nuanced and vary by country. This blog post addresses data sovereignty scenarios where training datasets need to be stored and processed in a geographic location without an AWS Region.

Amazon S3 on Outposts

As you prepare datasets for ML model training, a key component to consider is the storage and retrieval of your data, especially when adhering to data residency and regulatory requirements.

You can store training datasets as object data in local buckets with Amazon S3 on Outposts. In order to access S3 on Outposts buckets for data operations, you need to create access points and route the requests via an S3 on Outposts endpoint associated with your VPC. These endpoints are accessible both from within the VPC as well as on premises via the local gateway.

Solution overview

Using this sample architecture, you are going to train a YOLOv5 model on a subset of categories of the Common Objects in Context (COCO) dataset. The COCO dataset is a popular choice for object detection tasks offering a wide variety of image categories with rich annotations. It is also available under the AWS Open Data Sponsorship Program via fast.ai datasets.

This example is based on an architecture using an Amazon Elastic Compute Cloud (Amazon EC2) g4dn.8xlarge instance for model training on the Outposts rack. Depending on your Outposts rack compute configuration, you can use different instance sizes or types and make adjustments to training parameters, such as learning rate, augmentation, or model architecture accordingly. You will be using the AWS Deep Learning AMI to launch your EC2 instance, which comes with frameworks, dependencies, and tools to accelerate deep learning in the cloud.

For the training dataset storage, you are going to use an S3 on Outposts bucket and connect to it from your on-premises location via the Outposts local gateway. The local gateway routing mode can be direct VPC routing or Customer-owned IP (CoIP) depending on your workload’s requirements. Your local gateway routing mode will determine the S3 on Outposts endpoint configuration that you need to use.

1. Download and populate training dataset

You can download the training dataset to your local client machine using the following AWS CLI command:

aws s3 sync s3://fast-ai-coco/ .

After downloading, unzip annotations_trainval2017.zip, val2017.zip and train2017.zip files.

$ unzip annotations_trainval2017.zip
$ unzip val2017.zip
$ unzip train2017.zip

In the annotations folder, the files which you need to use are instances_train2017.json and instances_val2017.json, which contain the annotations corresponding to the images in the training and validation folders.

2. Filtering and preparing training dataset

You are going to use the training, validation, and annotation files from the COCO dataset. The dataset contains over 100K images across 80 categories, but to keep the training simple, you can focus on 10 specific categories of popular food items in supermarket shelves: banana, apple, sandwich, orange, broccoli, carrot, hot dog, pizza, donut, and cake. (Because who doesn’t like a bite after a model training.) Applications for training such models could be self-stock monitoring, automatic checkouts, or product placement optimization using computer vision in retail stores. Since YOLOv5 uses a specific annotations (labels) format, you need to convert the COCO dataset annotation to the target annotation.

3. Load training dataset to S3 on Outposts bucket

In order to load the training data on S3 on Outposts you need to first create a new bucket using the AWS Console or CLI, as well as an access point and endpoint for the VPC. You can use a bucket style access point alias to load the data, using the following CLI command:

$ cd /your/local/target/upload/path/
$ aws s3 sync . s3://trainingdata-o0a2b3c4d5e6d7f8g9h10f--op-s3

Replace the alias in the above CLI command with corresponding bucket alias name for your environment. The s3 sync command syncs the folders in the same structure containing the images and labels for the training and validation data, which you will be using later for loading it to the EC2 instance for model training.

4. Launch the EC2 instance

You can launch the EC2 instance with the Deep Learning AMI based on this getting started tutorial. For this exercise, the Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) has been used.

5. Download YOLOv5 and install dependencies

Once you ssh into the EC2 instance, activate the pre-configured PyTorch environment and clone the YOLOv5 repository.

$ ssh -i /path/key-pair-name.pem ubuntu@instance-ip-address
$ conda activate pytorch
$ git clone https://github.com/ultralytics/yolov5.git
$ cd yolov5

Then, and install its necessary dependencies.

$ pip install -U -r requirements.txt

To ensure the compatibility between various packages, you may need to modify existing packages on your instance running the AWS Deep Learning AMI.

6. Load the training dataset from S3 on Outposts to the EC2 instance

For copying the training dataset to the EC2 instance, use the s3 sync CLI command and point it to your local workspace.

aws s3 sync s3://trainingdata-o0a2b3c4d5e6d7f8g9h10f--op-s3 .

7. Prepare the configuration files

Create the data configuration files to reflect your dataset’s structure, categories, and other parameters.
data.yml

train: /your/ec2/path/to/data/images/train 
val: /your/ec2/path/to/data/images/val 
nc: 10 # Number of classes in your dataset 
names: ['banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake']

Create the model training parameter file using the sample configuration file from the YOLOv5 repository. You will need to update the number of classes to 10, but you can also change other parameters as you fine tune the model for performance.

parameters.yml:

# Parameters
nc: 10 # number of classes in your dataset
depth_multiple: 0.33 # model depth multiple
width_multiple: 0.50 # layer channel multiple
anchors:
- [10,13, 16,30, 33,23] # P3/8
- [30,61, 62,45, 59,119] # P4/16
- [116,90, 156,198, 373,326] # P5/32

# Backbone
backbone:
[[-1, 1, Conv, [64, 6, 2, 2]], # 0-P1/2
[-1, 1, Conv, [128, 3, 2]], # 1-P2/4
[-1, 3, C3, [128]],
[-1, 1, Conv, [256, 3, 2]], # 3-P3/8
[-1, 6, C3, [256]],
[-1, 1, Conv, [512, 3, 2]], # 5-P4/16
[-1, 9, C3, [512]],
[-1, 1, Conv, [1024, 3, 2]], # 7-P5/32
[-1, 3, C3, [1024]],
[-1, 1, SPPF, [1024, 5]], # 9
]

# Head
head:
[[-1, 1, Conv, [512, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 6], 1, Concat, [1]], # cat backbone P4
[-1, 3, C3, [512, False]], # 13

[-1, 1, Conv, [256, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 4], 1, Concat, [1]], # cat backbone P3
[-1, 3, C3, [256, False]], # 17 (P3/8-small)

[-1, 1, Conv, [256, 3, 2]],
[[-1, 14], 1, Concat, [1]], # cat head P4
[-1, 3, C3, [512, False]], # 20 (P4/16-medium)

[-1, 1, Conv, [512, 3, 2]],
[[-1, 10], 1, Concat, [1]], # cat head P5
[-1, 3, C3, [1024, False]], # 23 (P5/32-large)

[[17, 20, 23], 1, Detect, [nc, anchors]], # Detect(P3, P4, P5)

At this stage, the directory structure should look like below:

8. Train the model

You can run the following command to train the model. The batch-size and epochs can vary depending on your vCPU and GPU configuration and you can further modify these values or add weights as you try with additional rounds of training.

$ python3 train.py —img-size 640 —batch-size 32 —epochs 50 —data /your/path/to/configuation_files/dataconfig.yaml —cfg /your/path/to/configuation_files/parameters.yaml

You can monitor the model performance as it iterates through each epoch

Starting training for 50 epochs...

Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
0/49 6.7G 0.08403 0.05 0.04359 129 640: 100%|██████████| 455/455 [06:14<00:00,
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 9/9 [00:05<0
all 575 2114 0.216 0.155 0.0995 0.0338

Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
1/49 8.95G 0.07131 0.05091 0.02365 179 640: 100%|██████████| 455/455 [06:00<00:00,
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 9/9 [00:04<00:00, 1.97it/s]
all 575 2114 0.242 0.144 0.11 0.04

Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
2/49 8.96G 0.07068 0.05331 0.02712 154 640: 100%|██████████| 455/455 [06:01<00:00, 1.26it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 9/9 [00:04<00:00, 2.23it/s]
all 575 2114 0.185 0.124 0.0732 0.0273

Once the model training finishes, you can see the validation results against the batch of validation dataset and evaluate the model’s performance using standard metrics.

Validating runs/train/exp/weights/best.pt...
Fusing layers... 
YOLOv5 summary: 157 layers, 7037095 parameters, 0 gradients, 15.8 GFLOPs
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 9/9 [00:06<00:00,  1.48it/s]
                   all        575       2114      0.282      0.222       0.16     0.0653
                banana        575        280      0.189      0.143     0.0759      0.024
                 apple        575        186      0.206      0.085     0.0418     0.0151
              sandwich        575        146      0.368      0.404      0.343      0.146
                orange        575        188      0.265      0.149     0.0863     0.0362
              broccoli        575        226      0.239      0.226      0.138     0.0417
                carrot        575        310      0.182      0.203     0.0971     0.0267
               hot dog        575        108      0.242      0.111     0.0929     0.0311
                 pizza        575        208      0.405      0.418      0.333       0.15
                 donut        575        228      0.352      0.241       0.19     0.0973
                  cake        575        234      0.369      0.235      0.203     0.0853
Results saved to runs/train/exp

Use the model for inference

In order to test the model performance, you can test it by passing a new image which is from a shelf in a supermarket with some of the objects that you trained the model on.

(pytorch) ubuntu@ip-172-31-48-165:~/workspace/source/yolov5$ python3 detect.py --weights /home/ubuntu/workspace/source/yolov5/runs/train/exp/weights/best.pt —source /home/ubuntu/workspace/inference/Inference-image.jpg
<<omitted output>>
Fusing layers...
YOLOv5 summary: 157 layers, 7037095 parameters, 0 gradients, 15.8 GFLOPs
image 1/1 /home/ubuntu/workspace/inference/Inference-image.jpg: 640x640 4 apples, 6 oranges, 1 cake, 5.3ms
Speed: 0.6ms pre-process, 5.3ms inference, 1.1ms NMS per image at shape (1, 3, 640, 640)
Results saved to runs/detect/exp7

The response from the preceding model inference indicates that it predicted 4 apples, 6 oranges, and 1 cake in the image. The prediction may differ based on the image type used, and while a single sample image can give you a sense of the model’s performance, it will not provide a comprehensive understanding. For a more complete evaluation, it’s always recommended to test the model on a larger and more diverse set of validation images. Additional training and tuning of your parameters or datasets may be required to achieve better prediction.

Clean Up

You can terminate the following resources used in this tutorial after you have successfully trained and tested the model:

Terminate the EC2 instance used for the model training;
Delete the S3 on Outposts resources if you no longer need the training dataset.

Conclusion

The seamless integration of compute on AWS Outposts with S3 on Outposts, coupled with on-premises ML model training capabilities, offers organizations a robust solution to tackle data residency requirements. By setting up this environment, you can ensure that your datasets remain within desired geographies while still utilizing advanced machine learning models and cloud infrastructure. In addition to this, it remains essential to diligently review and fine-tune your implementation strategies and guard rails in place to ensure your data remains within the boundaries of your regulatory requirements. You can read more about architecting for data residency in this blog post.

Reference

COCO – Common Objects in Context – fast.ai datasets was accessed on 20-08-2023 from https://registry.opendata.aws/fast-ai-coco.

Quickly Restore Amazon EC2 Mac Instances using Replace Root Volume capability

2023-10-12 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/new-reset-amazon-ec2-mac-instances-to-a-known-state-using-replace-root-volume-capability/

This post is written by Sebastien Stormacq, Principal Developer Advocate.

Amazon Elastic Compute Cloud (Amazon EC2) now supports replacing the root volume on a running EC2 Mac instance, enabling you to restore the root volume of an EC2 Mac instance to its initial launch state, to a specific snapshot, or to a new Amazon Machine Image (AMI).

Since 2021, we have offered on-demand and pay-as-you-go access to Amazon EC2 Mac instances, in the same manner as our Intel, AMD and Graviton-based instances. Amazon EC2 Mac instances integrate all the capabilities you know and love from macOS with dozens of AWS services such as Amazon Virtual Private Cloud (VPC) for network security, Amazon Elastic Block Store (EBS) for expandable storage, Elastic Load Balancing (ELB) for distributing build queues, Amazon FSx for scalable file storage, and AWS Systems Manager Agent (SSM Agent) for configuring, managing, and patching macOS environments.

Just like for every EC2 instance type, AWS is responsible for protecting the infrastructure that runs all of the services oﬀered in the AWS cloud. To ensure that EC2 Mac instances provide the same security and data privacy as other Nitro-based EC2 instances, Amazon EC2 performs a scrubbing workflow on the underlying Dedicated Host as soon as you stop or terminate an instance. This scrubbing process erases the internal SSD, clears the persistent NVRAM variables, and updates the device firmware to the latest version enabling you to run the latest macOS AMIs. The documentation has more details about this process.

The scrubbing process ensures a sanitized dedicated host for each EC2 Mac instance launch and takes some time to complete. Our customers have shared two use cases where they may need to set back their instance to a previous state in a shorter time period or without the need to initiate the scrubbing workflow. The first use case is when patching an existing disk image to bring OS-level or applications-level updates to your fleet, without manually patching individual instances in-place. The second use case is during continuous integration and continuous deployment (CI/CD) when you need to restore an Amazon EC2 Mac instance to a defined well-known state at the end of a build.

To restart your EC2 Mac instance in its initial state without stopping or terminating them, we created the ability to replace the root volume of an Amazon EC2 Mac instance with another EBS volume. This new EBS volume is created either from a new AMI, an Amazon EBS Snapshot, or from the initial volume state during boot.

You just swap the root volume with a new one and initiate a reboot at OS-level. Local data, additional attached EBS volumes, networking configurations, and IAM profiles are all preserved. Additional EBS volumes attached to the instance are also preserved, as well as the instance IP addresses, IAM policies, and security groups.

Let’s see how Replace Root Volume works

To prepare and initiate an Amazon EBS root volume replacement, you can use the AWS Management Console, the AWS Command Line Interface (AWS CLI), or one of our AWS SDKs. For this demo, I used the AWS CLI to show how you can automate the entire process.

To start the demo, I first allocate a Dedicated Host and then start an EC2 Mac instance, SSH-connect to it, and install the latest version of Xcode. I use the open-source xcodeinstall CLI tool to download and install Xcode. Typically, you also download, install, and configure a build agent and additional build tools or libraries as required by your build pipelines.

Once the instance is ready, I create an Amazon Machine Image (AMI). AMIs are disk images you can reuse to launch additional and identical EC2 Mac instances. This can be done from any machine that has the credentials to make API calls on your AWS account. In the following, you can see the commands I issued from my laptop’s Terminal application.

#
# Find the instance’s ID based on the instance name tag
#
~ aws ec2 describe-instances \
--filters "Name=tag:Name,Values=RRV-Demo" \
--query "Reservations[].Instances[].InstanceId" \
--output text 

i-0fb8ffd5dbfdd5384

#
# Create an AMI based on this instance
#
~ aws ec2 create-image \
--instance-id i-0fb8ffd5dbfdd5384 \
--name "macOS_13.3_Gold_AMI"	\
--description "macOS 13.2 with Xcode 13.4.1"

{
 
"ImageId": "ami-0012e59ed047168e4"
}

It takes a few minutes to complete the AMI creation process.

After I created this AMI, I can use my instance as usual. I can use it to build, test, and distribute my application, or make any other changes on the root volume.

When I want to reset the instance to the state of my AMI, I initiate the replace root volume operation:

~ aws ec2 create-replace-root-volume-task	\
--instance-id i-0fb8ffd5dbfdd5384 \
--image-id ami-0012e59ed047168e4
{
"ReplaceRootVolumeTask": {
"ReplaceRootVolumeTaskId": "replacevol-07634c2a6cf2a1c61", "InstanceId": "i-0fb8ffd5dbfdd5384",
"TaskState": "pending", "StartTime": "2023-05-26T12:44:35Z", "Tags": [],
"ImageId": "ami-0012e59ed047168e4", "SnapshotId": "snap-02be6b9c02d654c83", "DeleteReplacedRootVolume": false
}
}

The root Amazon EBS volume is replaced with a fresh one created from the AMI, and the system triggers an OS-level reboot.

I can observe the progress with the DescribeReplaceRootVolumeTasks API

~ aws ec2 describe-replace-root-volume-tasks \
--replace-root-volume-task-ids replacevol-07634c2a6cf2a1c61

{
"ReplaceRootVolumeTasks": [
{
"ReplaceRootVolumeTaskId": "replacevol-07634c2a6cf2a1c61", "InstanceId": "i-0fb8ffd5dbfdd5384",
"TaskState": "succeeded", "StartTime": "2023-05-26T12:44:35Z",
"CompleteTime": "2023-05-26T12:44:43Z", "Tags": [],
"ImageId": "ami-0012e59ed047168e4", "DeleteReplacedRootVolume": false
}
]
}

After a short time, the instance becomes available again, and I can connect over ssh.

~ ssh [email protected]
Warning: Permanently added '3.0.0.86' (ED25519) to the list of known hosts.
Last login: Wed May 24 18:13:42 2023 from 81.0.0.0

┌───┬──┐	 |  |_ )
│ ╷╭╯╷ │	_| (	/
│ └╮	│   |\  |  |
│ ╰─┼╯ │ Amazon EC2
└───┴──┘ macOS Ventura 13.2.1
 
ec2-user@ip-172-31-58-100 ~ %

Additional thoughts

There are a couple of additional points to know before using this new capability:

By default, the old root volume is preserved. You can pass the –-delete-replaced-root-volume option to delete it automatically. Do not forget to delete old volumes and their corresponding Amazon EBS Snapshots when you don’t need them anymore to avoid being charged for them.
During the replacement, the instance will be unable to respond to health checks and hence might be marked as unhealthy if placed inside an Auto Scaled Group. You can write a custom health check to change that behavior.
When replacing the root volume with an AMI, the AMI must have the same product code, billing information, architecture type, and virtualization type as that of the instance.
When replacing the root volume with a snapshot, you must use snapshots from the same lineage as the instance’s current root volume.
The size of the new volume is the largest of the AMI’s block device mapping and the size of the old Amazon EBS root volume.
Any non-root Amazon EBS volume stays attached to the instance.
Finally, the content of the instance store (the internal SSD drive) is untouched, and all other meta-data of the instance are unmodified (the IP addresses, ENI, IAM policies etc.).

Pricing and availability

Replace Root Volume for EC2 Mac is available in all AWS Regions where Amazon EC2 Mac instances are available. There is no additional cost to use this capability. You are charged for the storage consumed by the Amazon EBS Snapshots and AMIs.

Check other options available on the API or AWS CLI and go configure your first root volume replacement task today!

Integrating AWS WAF with your Amazon Lightsail instance

2023-09-26 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/integrating-aws-waf-with-your-amazon-lightsail-instance/

This blog post is written by Riaz Panjwani, Solutions Architect, Canada CSC and Dylan Souvage, Solutions Architect, Canada CSC.

Security is the top priority at AWS. This post shows how you can level up your application security posture on your Amazon Lightsail instances with an AWS Web Application Firewall (AWS WAF) integration. Amazon Lightsail offers easy-to-use virtual private server (VPS) instances and more at a cost-effective monthly price.

Lightsail provides security functionality built-in with every instance through the Lightsail Firewall. Lightsail Firewall is a network-level firewall that enables you to define rules for incoming traffic based on IP addresses, ports, and protocols. Developers looking to help protect against attacks such as SQL injection, cross-site scripting (XSS), and distributed denial of service (DDoS) can leverage AWS WAF on top of the Lightsail Firewall.

As of this post’s publishing, AWS WAF can only be deployed on Amazon CloudFront, Application Load Balancer (ALB), Amazon API Gateway, and AWS AppSync. However, Lightsail can’t directly act as a target for these services because Lightsail instances run within an AWS managed Amazon Virtual Private Cloud (Amazon VPC). By leveraging VPC peering, you can deploy the aforementioned services in front of your Lightsail instance, allowing you to integrate AWS WAF with your Lightsail instance.

Solution Overview

This post shows you two solutions to integrate AWS WAF with your Lightsail instance(s). The first uses AWS WAF attached to an Internet-facing ALB. The second uses AWS WAF attached to CloudFront. By following one of these two solutions, you can utilize rule sets provided in AWS WAF to secure your application running on Lightsail.

Solution 1: ALB and AWS WAF

This first solution uses VPC peering and ALB to allow you to use AWS WAF to secure your Lightsail instances. This section guides you through the steps of creating a Lightsail instance, configuring VPC peering, creating a security group, setting up a target group for your load balancer, and integrating AWS WAF with your load balancer.

Creating the Lightsail Instance

For this walkthrough, you can utilize an AWS Free Tier Linux-based WordPress blueprint.

1. Navigate to the Lightsail console and create the instance.

2. Verify that your Lightsail instance is online and obtain its private IP, which you will need when configuring the Target Group later.

Attaching an ALB to your Lightsail instance

You must enable VPC peering as you will be utilizing an ALB in a separate VPC.

1. To enable VPC peering, navigate to your account in the top-right corner, select the Account dropdown, select Account, then select Advanced, and select Enable VPC Peering. Note the AWS Region being selected, as it is necessary later. For this example, select “us-east-2”. 2. In the AWS Management Console, navigate to the VPC service in the search bar, select VPC Peering Connections and verify the created peering connection.

3. In the left navigation pane, select Security groups, and create a Security group that allows HTTP traffic (port 80). This is used later to allow public HTTP traffic to the ALB.

4. Navigate to the Amazon Elastic Compute Cloud (Amazon EC2) service, and in the left pane under Load Balancing select Target Groups. Proceed to create a Target Group, choosing IP addresses as the target type.

5. Proceed to the Register targets section, and select Other private IP address. Add the private IP address of the Lightsail instance that you created before. Select Include as Pending below and then Create target group (note that if your Lightsail instance is re-launched the target group must be updated as the private IP address may change).

6. In the left pane, select Load Balancers, select Create load balancers and choose Application Load Balancer. Ensure that you select the “Internet-facing” scheme, otherwise, you will not be able to connect to your instance over the internet.

7. Select the VPC in which you want your ALB to reside. In this example, select the default VPC and all the Availability Zones (AZs) to make sure of the high availability of the load balancer.

8. Select the Security Group created in Step 3 to make sure that public Internet traffic can pass through the load balancer.

9. Select the target group under Listeners and routing to the target group you created earlier (in Step 5). Proceed to Create load balancer.

10. Retrieve the DNS name from your load balancer again by navigating to the Load Balancers menu under the EC2 service.

11. Verify that you can access your Lightsail instance using the Load Balancer’s DNS by copying the DNS name into your browser.

Integrating AWS WAF with your ALB

Now that you have ALB successfully routing to the Lightsail instance, you can restrict the instance to only accept traffic from the load balancer, and then create an AWS WAF web Access Control List (ACL).

1. Navigate back to the Lightsail service, select the Lightsail instance previously created, and select Networking. Delete all firewall rules that allow public access, and under IPv4 Firewall add a rule that restricts traffic to the IP CIDR range of the VPC of the previously created ALB.

2. Now you can integrate the AWS WAF to the ALB. In the Console, navigate to the AWS WAF console, or simply navigate to your load balancer’s integrations section, and select Create web ACL.

3. Choose Create a web ACL, and then select Add AWS resources to add the previously created ALB.

4. Add any rules you want to your ACL, these rules will govern the traffic allowed or denied to your resources. In this example, you can add the WordPress application managed rules.

5. Leave all other configurations as default and create the AWS WAF.

6. You can verify your firewall is attached to the ALB in the load balancer Integrations section.

Solution 2: CloudFront and AWS WAF

Now that you have set up ALB and VPC peering to your Lightsail instance, you can optionally choose to add CloudFront to the solution. This can be done by setting up a custom HTTP header rule in the Listener of your ALB, setting up the CloudFront distribution to use the ALB as an origin, and setting up an AWS WAF web ACL for your new CloudFront distribution. This configuration makes traffic limited to only accessing your application through CloudFront, and is still protected by WAF.

1. Navigate to the CloudFront service, and click Create distribution.

2. Under Origin domain, select the load balancer that you had created previously.

3. Scroll down to the Add custom header field, and click Add header.

4. Create your header name and value. Note the header name and value as you will need it later in the walkthrough.

5. Scroll down to the Cache key and origin requests section. Under Cache policy, choose CachingDisabled.

6. Scroll to the Web Application Firewall (WAF) section, and select Enable security protections.

7. Leave all other configurations as default, and click Create distribution.

8. Wait until your CloudFront distribution has been deployed, and verify that you can access your Lightsail application using the DNS under Domain name.

9. Navigate to the EC2 service, and in the left pane under Load Balancing, select Load Balancers.

10. Select the load balancer you created previously, and under the Listeners tab, select the Listener you had created previously. Select Actions in the top right and then select Manage rules.

11. Select the edit icon at the top, and then select the edit icon beside the Default rule.

12. Select the delete icon to delete the Default Action.

13. Choose Add action and then select Return fixed response.

14. For Response code, enter 403, this will restrict access to CloudFront.

15. For Response body, enter “Access Denied”.

16. Select Update in the top right corner to update the Default rule.

17. Select the insert icon at the top, then select Insert Rule.

18. Choose Add Condition, then select Http header. For Header type, enter the Header name, and then for Value enter the Header Value chosen previously.

19. Choose Add Action, then select Forward to and select the target group you had created in the previous section.

20. Choose Save at the top right corner to create the rule.

21. Retrieve the DNS name from your load balancer again by navigating to the Load Balancers menu under the EC2 service.

22. Verify that you can no longer access your Lightsail application using the Load Balancer’s DNS.

23. Navigate back to the CloudFront service and select the Distribution you had created. Under the General tab, select the Web ACL link under the AWS WAF section. Modify the Web ACL to leverage any managed or custom rule sets.

You have successfully integrated AWS WAF to your Lightsail instance! You can access your Lightsail instance via your CloudFront distribution domain name!

Clean Up Lightsail console resources

To start, you will delete your Lightsail instance.

Sign in to the Lightsail console.
For the instance you want to delete, choose the actions menu icon (⋮), then choose Delete.
Choose Yes to confirm the deletion.

Next you will delete your provisioned static IP.

Sign in to the Lightsail console.
On the Lightsail home page, choose the Networking tab.
On the Networking page choose the vertical ellipsis icon next to the static IP address that you want to delete, and then choose Delete.

Finally you will disable VPC peering.

In the Lightsail console, choose Account on the navigation bar.
Choose Advanced.
In the VPC peering section, clear Enable VPC peering for all AWS Regions.

Clean Up AWS console resources

To start, you will delete your Load balancer.

Navigate to the EC2 console, choose Load balancers on the navigation bar.
Select the load balancer you created previously.
Under Actions, select Delete load balancer.

Next, you will delete your target group.

Navigate to the EC2 console, choose Target Groups on the navigation bar.
Select the target group you created previously.
Under Actions, select Delete.

Now you will delete your CloudFront distribution.

Navigate to the CloudFront console, choose Distributions on the navigation bar.
Select the distribution you created earlier and select Disable.
Wait for the distribution to finish deploying.
Select the same distribution after it is finished deploying and select Delete.

Finally, you will delete your WAF ACL.

Navigate to the WAF console, and select Web ACLS on the navigation bar.
Select the web ACL you created previously, and select Delete.

Conclusion

Adding AWS WAF to your Lightsail instance enhances the security of your application by providing a robust layer of protection against common web exploits and vulnerabilities. In this post, you learned how to add AWS WAF to your Lightsail instance through two methods: Using AWS WAF attached to an Internet-facing ALB and using AWS WAF attached to CloudFront.

Security is top priority at AWS and security is an ongoing effort. AWS strives to help you build and operate architectures that protect information, systems, and assets while delivering business value. To learn more about Lightsail security, check out the AWS documentation for Security in Amazon Lightsail.

Introducing Intra-VPC Communication Across Multiple Outposts with Direct VPC Routing

2023-08-30 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/introducing-intra-vpc-communication-across-multiple-outposts-with-direct-vpc-routing/

This blog post is written by Jared Thompson, Specialist Solutions Architect, Hybrid Edge.

Today, we announced AWS Outposts rack support for intra-VPC communication across multiple Outposts. You can now add routes in your Outposts rack subnet route table to forward traffic between subnets within the same VPC spanning across multiple Outposts using the Outpost local gateways (LGW). The LGW enables connectivity between your Outpost subnets and your on-premises network. With this enhancement, you can establish intra-VPC instance-to-instance IP communication across Outposts through your on-premise network, via direct VPC routing.

You can take advantage of this enhancement to architect for high availability for your on-premises applications and, at the same time, improve application performance by reducing the latency between application components that are in the same VPC but running on different Outposts.

This post shows you how you can use intra-VPC communication across multiple Outposts to build a Multi-AZ like architecture for your on-premises applications and services by leveraging direct VPC routing.

To clarify a few concepts before we go into the details: Outposts rack is the 42U form factor of the AWS Outposts Family of services. An Outpost is a pool of AWS compute and storage capacity deployed at a customer’s site. An Outpost may comprise of one or more racks connected together at the site.

Overview

Prior to today’s announcement, applications and services running on multiple Outposts were not able to communicate with each other if they were in the same VPC and if the Outpost was configured to use direct VPC routing. To overcome this limitation it was necessary to separate workloads into multiple VPCs and align each VPC with a separate Outpost, or to configure the Outpost local gateway route table to use Customer-owned IP (CoIP) mode. This limitation was because the traffic between two subnets that are in the same VPC but in disparate Outposts was not able to communicate each other through the service link, as it was blocked in the Region. (See the following diagram in Figure 1)

To show how this worked previously, as an example, let’s assume we have a VPC CIDR range of 10.77.0.0/16, and we want to route 10.77.11.0/24 using the local gateway:

When we attempted to apply this change, we would get the following error message:

The destination CIDR block 10.77.11.0/24 is equal or more specific than one of this VPC’s CIDR blocks. This route can target only an interface or an instance.

Because we were not able to specify a more specific route, we were not able to route between these subnets.

Figure 1 – Prior to this feature, you could not send traffic to the local gateway, as you could not set a route that was more specific than the VPC’s CIDR Range

Using intra-VPC communication across multiple Outposts with direct VPC routing you can now define routes that are more specific than the local VPC CIDR range and has local gateway as target. This enables you to direct traffic from one subnet to another within the same VPC, using the Outpost’s local gateways (LGW). (See Figure 2)

Figure 2 – Two Outpost racks in the same VPC can be configured to communicate over the Outpost local gateways

With this feature, you can design highly available architectures on the edge with multiple Outpost racks, eliminating the need to use multiple VPCs.

Let’s see it in action!

For this example, we will assume that we have a VPC CIDR of 10.77.0.0/16, Outpost A has a subnet CIDR of 10.77.7.0/24, and Outpost B has a subnet CIDR of 10.77.11.0/24. By default, resources on these racks will not be able to communicate with each other since the default local route of each route table within the VPC is set to 10.77.0.0/16. If the traffic is on another Outpost, the traffic would be blocked because service link traffic cannot hairpin through the region. We are going to route this traffic across our on-premises infrastructure. (See Figure 3)

Figure 3 –This is what our example environment looks like. Note, we have one VPC with two Outpost subnets

For the purposes of this example, we are going to assume that the Customer WAN (See Figure 3) is already set up to route traffic between Outpost A and Outpost B subnets. For more information, see Local gateway BGP connectivity in the AWS Outpost documentation. Additionally, we will want to ensure that our local gateway routing tables are in direct VPC routing mode.

Let’s suppose that we want Instance A (10.77.7.88/24) to reach Instance B (10.77.11.119). We will try this with a ping:We can see that none of our pings worked. Since both of these subnets are on two different Outposts, we will need to configure our subnets to route traffic to each other by using intra-VPC communication across multiple Outposts with direct VPC routing.

To enable traffic between these two private subnets, we will configure the routing table to direct traffic towards the neighboring Outpost Subnet to use the Outpost local gateway, allowing traffic to flow between your on-premises network infrastructure. We do this by specifying a more specific route than the default VPC CIDR range.

1.To accomplish this, we will need to associate our VPC with the Outpost’s local gateway route table on each Outpost. From the console, navigate to AWS Outposts / Local gateway route tables. Find the local gateway route table that is associated with each Outpost, go to the VPC associations tab, and select Associate VPC.

Now that these VPC are associated to the local gateway routing table, we will be able to configure the route tables for these subnets to target the Outpost local gateway.

2. For our 10.77.7.0/24 subnet on Outpost Rack A, we will add a route to our other subnet, 10.77.11.0/24 in the subnet’s routing table. One of the target options is Outpost Local Gateway:

Selecting this option will bring up two options, for each of our local gateways. Be sure to select the correct local gateway ID for Outpost A’s local gateway, which is lgw-008e7656cf09c9c21 for my Outpost Rack A.

3. Do the same for our 10.77.11.0/24 subnet, this time setting a destination of 10.77.7.0/24 via the local gateway ID of Outpost Rack B:

Now that we have our routes updated, let’s try our ping again.

Success! We are now able to reach the other instance over the local gateways. This is because our route tables in the Outposts subnets are forwarding traffic over the local gateway, utilizing our on-premises network infrastructure for the communication backbone.

Availability

Intra-VPC communication across multiple Outposts with direct VPC routing is available in all AWS Regions where Outposts rack is available. Your existing Outposts racks may require an update to enable support for Intra-VPC communication across multiple Outposts. If this feature does not work for you, please contact AWS Support.

Conclusion

Utilizing intra-VPC communication across Outposts with direct VPC routing allows you to route traffic between subnets within the same VPC. This feature will allow traffic to route across different Outposts by utilizing Outposts local gateway and your on-premises network, without needing to divide your infrastructure into multiple VPCs. You can take advantage of this enhancement for your on-premises applications, while improving application performance by reducing latency between application components running on multiple Outposts.

Using and Managing Security Groups on AWS Snowball Edge devices

2023-08-25 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/using-and-managing-security-groups-on-aws-snowball-edge-devices/

This blog post is written by Jared Novotny & Tareq Rajabi, Specialist Hybrid Edge Solution Architects.

The AWS Snow family of products are purpose-built devices that allow petabyte-scale movement of data from on-premises locations to AWS Regions. Snow devices also enable customers to run Amazon Elastic Compute Cloud (Amazon EC2) instances with Amazon Elastic Block Storage (Amazon EBS), and Amazon Simple Storage Service (Amazon S3) in edge locations.

Security groups are used to protect EC2 instances by controlling ingress and egress traffic. Once a security group is created and associated with an instance, customers can add ingress and egress rules to control data flow. Just like the default VPC in a region, there is a default security group on Snow devices. A default security group is applied when an instance is launched and no other security group is specified. This default security group in a region allows all inbound traffic from network interfaces and instances that are assigned to the same security group, and allows and all outbound traffic. On Snowball Edge, the default security group allows all inbound and outbound traffic.

In this post, we will review the tools and commands required to create, manage and use security groups on the Snowball Edge device.

Some things to keep in mind:

AWS Snowball Edge is limited to 50 security groups.
An instance will only have one security group, but each group can have a total of 120 rules. This is comprised of 60 inbound and 60 outbound rules.
Security groups can only have allow statements to allow network traffic.
Deny statements aren’t allowed.
Some commands in the Snowball Edge client (AWS CLI) don’t provide an output.
AWS CLI commands can use the name or the security group ID.

Prerequisites and tools

Customers must place an order for Snowball Edge from their AWS Console to be able to run the following AWS CLI commands and configure security groups to protect their EC2 instances.

The AWS Snowball Edge client is a standalone terminal application that customers can run on their local servers and workstations to manage and operate their Snowball Edge devices. It supports Windows, Mac, and Linux systems.

AWS OpsHub is a graphical user interface that you can use to manage your AWS Snowball devices. Furthermore, it’s the easiest tool to use to unlock Snowball Edge devices. It can also be used to configure the device, launch instances, manage storage, and provide monitoring.

Customers can download and install the Snowball Edge client and AWS OpsHub from AWS Snowball resources.

Getting Started

To get started, when a Snow device arrives at a customer site, the customer must unlock the device and launch an EC2 instance. This can be done via AWS OpsHub or the AWS Snowball Edge Client. AWS Snow Family of devices support both Virtual Network Interfaces (VNI) and Direct Network interfaces (DNI), customers should review the types of interfaces before deciding which one is best for their use case. Note that security groups are only supported with VNIs, so that is what was used in this post. A post explaining how to use these interfaces should be reviewed before proceeding.

Viewing security group information

Once the AWS Snowball Edge is unlocked, configured, and has an EC2 instance running, we can dig deeper into using security groups to act as a virtual firewall and control incoming and outgoing traffic.

Although the AWS OpsHub tool provides various functionalities for compute and storage operations, it can only be used to view the name of the security group associated to an instance in a Snowball Edge device:

Every other interaction with security groups must be through the AWS CLI.

The following command shows how to easily read the outputs describing the protocols, sources, and destinations. This particular command will show information about the default security group, which allows all inbound and outbound traffic on EC2 instances running on the Snowball Edge.

In the following sections we review the most common commands with examples and outputs.

View (all) existing security groups:

aws ec2 describe-security-groups --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

{
    "SecurityGroups": [
        {
            "Description": "default security group",
            "GroupName": "default",
            "IpPermissions": [
                {
                    "IpProtocol": "-1",
                    "IpRanges": [
                        {
                            "CidrIp": "0.0.0.0/0"
                        }
                    ]
                }
            ],
            "GroupId": "s.sg-8ec664a23666db719",
            "IpPermissionsEgress": [
                {
                    "IpProtocol": "-1",
                    "IpRanges": [
                        {
                            "CidrIp": "0.0.0.0/0"
                        }
                    ]
                }
            ]
        }
    ]
}

Create new security group:

aws ec2 create-security-group --group-name allow-ssh--description "allow only ssh inbound" --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

The output returns a GroupId:

{  "GroupId": "s.sg-8f25ee27cee870b4a" }

Add port 22 ingress to security group:

aws ec2 authorize-security-group-ingress --group-ids.sg-8f25ee27cee870b4a --protocol tcp --port 22 --cidr 10.100.10.0/24 --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

{    "Return": true }

Note that if you’re using the default security group, then the outbound rule is still to allow all traffic.

Revoke port 22 ingress rule from security group

aws ec2 revoke-security-group-ingress --group-ids.sg-8f25ee27cee870b4a --ip-permissions IpProtocol=tcp,FromPort=22,ToPort=22, IpRanges=[{CidrIp=10.100.10.0/24}] --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

{ "Return": true }

Revoke default egress rule:

aws ec2 revoke-security-group-egress --group-ids.sg-8f25ee27cee870b4a --ip-permissions IpProtocol="-1",IpRanges=[{CidrIp=0.0.0.0/0}] --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

{ "Return": true }

Note that this rule will remove all outbound ephemeral ports.

Add default outbound rule (revoked above):

aws ec2 authorize-security-group-egress --group-id s.sg-8f25ee27cee870b4a --ip-permissions IpProtocol="-1", IpRanges=[{CidrIp=0.0.0.0/0}] --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

{  "Return": true }

Changing an instance’s existing security group:

aws ec2 modify-instance-attribute --instance-id s.i-852971d05144e1d63 --groups s.sg-8f25ee27cee870b4a --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

Note that this command produces no output. We can verify that it worked with the “aws ec2 describe-instances” command. See the example as follows (command output simplified):

aws ec2 describe-instances --instance-id s.i-852971d05144e1d63 --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge


    "Reservations": [{
            "Instances": [{
                    "InstanceId": "s.i-852971d05144e1d63",
                    "InstanceType": "sbe-c.2xlarge",
                    "LaunchTime": "2022-06-27T14:58:30.167000+00:00",
                    "PrivateIpAddress": "34.223.14.193",
                    "PublicIpAddress": "10.100.10.60",
                    "SecurityGroups": [
                        {
                            "GroupName": "allow-ssh",
                            "GroupId": "s.sg-8f25ee27cee870b4a"
                        }      ], }  ] }

Changing and instance’s security group back to default:

aws ec2 modify-instance-attribute --instance-ids.i-852971d05144e1d63 --groups s.sg-8ec664a23666db719 --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

Note that this command produces no output. You can verify that it worked with the “aws ec2 describe-instances” command. See the example as follows:

aws ec2 describe-instances –instance-ids.i-852971d05144e1d63 –endpoint Https://MySnowIPAddress:8008 –profile SnowballEdge

{
    "Reservations": [
        {  "Instances": [ {
                    "AmiLaunchIndex": 0,
                    "ImageId": "s.ami-8b0223704ca8f08b2",
                    "InstanceId": "s.i-852971d05144e1d63",
                    "InstanceType": "sbe-c.2xlarge",
                    "LaunchTime": "2022-06-27T14:58:30.167000+00:00",
                    "PrivateIpAddress": "34.223.14.193",
                    "PublicIpAddress": "10.100.10.60",
                             "SecurityGroups": [
                        {
                            "GroupName": "default",
                            "GroupId": "s.sg-8ec664a23666db719" ] }

Delete security group:

aws ec2 delete-security-group --group-ids.sg-8f25ee27cee870b4a --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

Sample walkthrough to add a SSH Security Group

As an example, assume a single EC2 instance “A” running on a Snowball Edge device. By default, all traffic is allowed to EC2 instance “A”. As per the following diagram, we want to tighten security and allow only the management PC to SSH to the instance.

1. Create an SSH security group:

aws ec2 create-security-group --group-name MySshGroup--description “ssh access” --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

2. This will return a “GroupId” as an output:

{   "GroupId": "s.sg-8a420242d86dbbb89" }

3. After the creation of the security group, we must allow port 22 ingress from the management PC’s IP:

aws ec2 authorize-security-group-ingress --group-name MySshGroup -- protocol tcp --port 22 -- cidr 192.168.26.193/32 --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

4. Verify that the security group has been created:

aws ec2 describe-security-groups ––group-name MySshGroup –endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

{
	“SecurityGroups”:   [
		{
			“Description”: “SG for web servers”,
			“GroupName”: :MySshGroup”,
			“IpPermissinos”:  [
				{ “FromPort”: 22,
			 “IpProtocol”: “tcp”,
			 “IpRanges”: [
			{
				“CidrIp”: “192.168.26.193.32/32”
						} ],
					“ToPort”:  22 }],}

5. After the security group has been created, we must associate it with the instance:

aws ec2 modify-instance-attribute –-instance-id s.i-8f7ab16867ffe23d4 –-groups s.sg-8a420242d86dbbb89 --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

6. Optionally, we can delete the Security Group after it is no longer required:

aws ec2 delete-security-group --group-id s.sg-8a420242d86dbbb89 --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

Note that for the above association, the instance ID is an output of the “aws ec2 describe-instances” command, while the security group ID is an output of the “describe-security-groups” command (or the “GroupId” returned by the console in Step 2 above).

Conclusion

This post addressed the most common commands used to create and manage security groups with the AWS Snowball Edge device. We explored the prerequisites, tools, and commands used to view, create, and modify security groups to ensure the EC2 instances deployed on AWS Snowball Edge are restricted to authorized users. We concluded with a simple walkthrough of how to restrict access to an EC2 instance over SSH from a single IP address. If you would like to learn more about the Snowball Edge product, there are several resources available on the AWS Snow Family site.

Providing durable storage for AWS Outpost servers using AWS Snowcone

2023-08-18 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/providing-durable-storage-for-aws-outpost-servers-using-aws-snowcone/

This blog post is written by Rob Goodwin, Specialist Solutions Architect, Secure Hybrid Edge.

With the announcement of AWS Outposts servers, you now have a streamlined means to deploy AWS Cloud infrastructure to regional offices using the 1 rack unit (1U) or 2 rack unit (2U) Outposts servers where the 42U AWS Outposts rack wasn’t an economical or physical fit.

This post discusses how you can use AWS Snowcone to provide persistent storage for AWS Outposts servers in the case of Amazon Elastic Compute Cloud (Amazon EC2) instance termination or if the Outposts server fails. In this post, we show:

How to leverage the built-in features of Snowcone to provide persistent storage to an EC2 instance.
Optionally replicate the data back to an AWS Region with AWS DataSync. Replicating data back to an AWS Region with DataSync allows for a seamless way to copy data offsite to improve resiliency. Furthermore, it allows the ability to leverage regional AWS Services for machine learning (ML) training.

Background

Outposts servers ship with internal NVMe SSD instance storage. Just like in the Regions, instance storage is allocated directly to the EC2 instance and tied to the lifecycle of the instance. This means that if the EC2 instance is terminated, then the data associated with the instance is deleted. In the event you want data to persist after the instance is terminated, you must use operating system (OS) functions to save and back up to other media or save your data to an external network attached storage or file system.

Mounting an external file system to an EC2 instance is not a new concept in AWS. Using Amazon Elastic File System (Amazon EFS), you can mount the EFS file system to EC2 instance(s).

This architecture may look similar to the following diagram:

Figure 1: AWS VPC showing EC2 Instances mounting Amazon EFS in the Region

In this architecture, EC2 instances are using Amazon EFS for a shared file system.

A main use case for Outposts servers is to deploy applications closer to an end user for quicker response times. If we move our application to the Outposts server to improve the response time to the end user, then we could still use Amazon EFS as a shared file system. However, the latency to read the file system over the service link may affect application performance.

There are third-party network attached storage systems available that could work with Outposts servers. However, Snowcone provides the built-in service of DataSync to replicate data back to the Region and is ideal where physical space and power are limited.

By leveraging Snowcone, we can provide persistent and durable network attached storage external to the Outposts server along with a means to replicate data to and from an AWS Region. Snowcone is a small, rugged, and secure device offering edge computing, data storage, and data transfer.

Solution overview

In this solution, we combine multiple AWS services to provide a durable environment. We use Snowcone as our Network File System (NFS) mount point and leverage the built-in DataSync Agent to replicate the bucket on the Snowcone back to an Amazon Simple Storage Service (Amazon S3) bucket in-Region.

When EC2 instances are launched on the Outposts server, we map the NFS mount point from the Snowcone into the file system of a Linux host through the Outposts server’s Logical Network Interface (LNI). For a Windows system, using the NFS Client for Windows, we can map a drive letter to the NFS mount point as well. The following diagram illustrates this.

Figure 2: EC2 instances on Outposts server attaching to the NFS mount on Snowcone with DataSync replicating data back to Amazon S3 in the AWS Region

Prerequisites

To deploy this solution, you must:

Have the Outposts server installed and authorized.
1. The Outposts server must be fully capable of launching an EC2 instance and being able to communicate through the LNI to local network resources.
Have an AWS Snowcone ordered, connected to the local network, and unlocked.
1. To make sure that NFS is available, the job type must be either Import into Amazon S3 or Export from Amazon S3, as shown in the following figure.
1. Figure 3: Screenshot of Job Type when ordering Snow devices
Have a local client with AWS OpsHub installed.
1. You can use an instance launched on the Outposts server to configure the Snowcone if:
  1. · The LNI is connected on the instance
  2. · The Snowcone is on the network

Steps to activate

Configure NFS on the Snowcone manually.
1. Either statically assign the IP address, or if you’re using DHCP, create an IP reservation to make sure that the NFS mount is consistent. In the following figure, we use 10.0.0.32 as a static IP assigned to the NFS Mount.
(Optional) Start the DataSync Agent on the Snowcone.
1. We assume that the Snowcone has access to the internet in the same way the Outposts server does. Configure the Agent, and then enable tasks. The Agent is used to replicate data from the Snowcone to the Region or from the Region to the Snowcone. The tasks that are created in this step enable replication.
Launch the EC2 instance (either a. or b.)
1. a. Using a Linux OS – When launching an instance on the Outposts server to attach to the NFS mount, make sure that the LNI is configured when launching the instance. In the User data section, enter the commands shown in the following figure to mount the NFS file system from the Snowcone.

Figure 5: Screenshot of User Data section within the Amazon EC2 Launch Wizard

#!/bin/bash
sudo mkdir /var/snowcone
sudo mount -t nfs SNOW-NFS-IP:/buckets /var/snowcone
sudo sh -c “echo ’ SNOW-NFS-IP:/buckets /var/snowcone nfs defaults 0 0’ >> /etc/fstab”

In this OS, we create a directory and then mount the NFS file system to that directory. The echo is used to place the mount into fstab to make sure that the mount is persistent if the instance is rebooted.

b Windows OS – The AMI being used during the launch must include the NFS client. The client is required to mount the NFS. When launching an instance on the Outposts server to attach to the NFS mount, make sure that the LNI is configured when launching the instance. In the User data section, enter the commands shown in the following figure to mount the NFS from the Snowcone as a drive letter.

Figure 6: A screenshot of User Data section of Amazon EC2 Launch wizard with commands to mount NFS to the Windows File System

<powershell>
NET USE Z: \\SNOW-NFS-IP\buckets -P
</powershell>

The NET USE command maps the Z: drive to the NFS mount, and the -P makes it persistent between reboots.

This solution also works with Snowball Edge Storage Optimized. When ordering the Snowball Edge, choose NFS based data transfer for the storage type.

Figure 4: Screenshot of Select the storage type for the Snowball Edge

Conclusion

In this post, we examined how to mount NFS file systems in Snowcone to EC2 instances running on Outposts servers. We also covered starting DataSync Agent on Snowcone to enable data transfer from the edge to an AWS Region. By pairing these services together, you can build persistent and durable storage external to the Outposts servers and replicate your data back to the AWS Region.

If you want to learn more about how to get started with Outposts servers, my colleague Josh Coen and I have published a video series on this topic. The demo series shows you how to unbox an Outposts server, activate the Outposts server, and what you can do with your Outposts server after it is activated. Make sure to check it out!

Create, Use, and Troubleshoot Launch Scripts on Amazon Lightsail

2023-08-15 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/create-use-and-troubleshoot-launch-scripts-on-amazon-lightsail/

This blog post is written by Brian Graf, Senior Developer Advocate, Amazon Lightsail and Sophia Parafina, Senior Developer Advocate.

Amazon Lightsail is a virtual private server (VPS) for deploying both operating systems (OS) and pre-packaged applications, such as WordPress, Plesk, cPanel, PrestaShop, and more. When deploying these instances, you can run launch scripts with additional commands such as installation of applications, configuration of system files, or installing pre-requisites for your application.

Where do I add a launch script?

If you’re deploying an instance with the Lightsail console, launch scripts can be added to an instance at deployment. They are added in the ‘deploy instance’ page:

The launch script must be added before the instance is deployed, because launch scripts can’t retroactively run after deployment.

Anatomy of a Windows Launch Script

When deploying a Lightsail Windows instance, you can use a batch script or a PowerShell script in the ‘launch script’ textbox. Of the two options, PowerShell is more extensible and provides greater flexibility for configuration and control.

If you choose to write your launch script as a batch file, you must add <script> </script> tags at the beginning and end of your code respectively. Alternatively, a launch script in PowerShell, must use the <powershell></powershell> tags in a similar fashion.

After the closing </script> or </powershell> tag, you must add a <persist></persist> tag on the following line. The persist tag is used to determine if this is a run-once command or if it should run every time your instance is rebooted or changed from the ‘Stop’ to ‘Start’ state. If you want your script to run every time the instance is rebooted or started, then you must set the persist tag to ‘true’. If you want your launch script to just run once, then you would set your persist tag to ‘false’.

Anatomy of a Linux Launch Script

Like a Windows launch script, a Linux launch script requires specific code on the first row of the textbox to successfully execute during deployment. You must place ‘#!/bin/bash’ as the first line of code to set the shell that executes the rest of the script. After first line of code, you can continue adding additional commands to achieve the results you want.

How do I know if my Launch Script ran successfully?

Although running launch scripts is convenient to create a baseline instance, it’s possible that your instance doesn’t achieve the desired end-state because of an error in your script or permissions issues. You must troubleshoot to see why the launch script didn’t complete successfully. To find if the launch script ran successfully, refer to the instance logs to determine whether your launch script was successful or not.

For Windows, the launch log can be found in: C:\ProgramData\Amazon\EC2-Windows\launch\Log\UserdataExecution.log. Note that ProgramData is a hidden folder, and unless you access the file from PowerShell or Command Prompt, you must use Windows File Explorer (`View > Show > Hidden items`) folders to see it.

For Linux, the launch log can be found in: /var/log/cloud-init-output.log and can be monitored after your instance launches by tailing the log by typing the following in the terminal:

tail -f /var/log/cloud-init-output.log

If you want to see the entire log file including commands that have already run before you opened the log file, then you can type the following in the terminal:

less +F /var/log/cloud-init-output.log

On a Windows instance, an easy way to monitor the UserdataExecution.log is to add the following code in your launch script, which creates a shortcut to tail or watch the log as commands are executing:

# Create a log-monitoring script to monitor the progress of the launch script execution

$monitorlogs = @"
get-content C:\ProgramData\Amazon\EC2-Windows\launch\Log\UserdataExecution.log -wait
"@

# Save the log-monitoring script to the desktop for the user

$monitorlogs | out-file -FilePath C:\Users\Administrator\Desktop\MonitorLogs.ps1 -Encoding utf8 -Force

</powershell>
<persist>false</persist>

If the script was executed, then the last line of the log should say ‘{Timestamp}: User data script completed’.

However, if you want more detail, you can build the logging into your launch script. For example, you can append a text or log file with each command so that you can read the output in an easy-to-access location:

<powershell>
# Set the location for the log file. In this case,
# it will appear on the desktop of your Lightsail instance
$loc = "c:\Users\Administrator\Desktop\mylog.txt"

# Write text to the log file
Write-Output "Starting Script" >> $loc

# Download and install Chocolatey to do unattended installations of the rest of the apps.
iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))

# You could run commands like this to output the progress to the log file:

# Install vscode and all dependencies
choco install -y vscode --force --force-dependencies --verbose >> $loc

# Install git and all dependencies
choco install -y git --force --force-dependencies --verbose >> $loc

# Completed
Write-Output "Completed" >> $loc
</powershell>
<persist>false</persist>

This code creates a log file, outputs data, and appends it along the way. If there is an issue, then you can see where the logs stopped or errors appeared.

For Ubuntu and Amazon Linux 2

If the cloud-init-output.log isn’t comprehensive enough, then you can re-direct the output from your commands to a log file of your choice. In this example, we create a log file in the /tmp/ directory and push all output from our commands to this file.

# Create the log file
touch /tmp/launchscript.log

# Add text to the log file if you so choose
echo 'Starting' >> /tmp/launchscript.log

# Update package index
sudo apt update >> /tmp/launchscript.log

# Install software to manage independent software vendor sources
sudo apt -y install software-properties-common >> /tmp/launchscript.log

# Add the repository for all PHP versions
sudo add-apt-repository -y ppa:ondrej/php >> /tmp/launchscript.log

# Install Web server, mySQL client, PHP (and packages), unzip, and curl
sudo apt -y install apache2 mysql-client-core-8.0 php8.0 libapache2-mod-php8.0 php8.0-common php8.0-imap php8.0-mbstring php8.0-xmlrpc php8.0-soap php8.0-gd php8.0-xml php8.0-intl php8.0-mysql php8.0-cli php8.0-bcmath php8.0-ldap php8.0-zip php8.0-curl unzip curl >> /tmp/launchscript.log

# Any final text you want to include
echo 'Completed' >> /tmp/launchscript.log

It’s possible to check the logs before the launch script has finished executing. One way to follow along is to ‘tail’ the log file. This lets you stream all updates as they occur. You can monitor the log using:

‘tail -f /tmp/launchscript.log’. </code>

Using Launch Scripts from AWS Command Line Interface (AWS CLI)

You can deploy their Lightsail instances from the AWS Command Line Interface (AWS CLI) instead of the Lightsail console. You can add launch scripts to the AWS CLI command as a parameter by creating a variable with the script and referencing the variable, or by saving the launch script as a file and referencing the local file location on your computer.

The launch script is still written the same way as the previous examples. For a Windows instance with a PowerShell launch script, you can deploy a Lightsail instance with a launch script with the following code:

# PowerShell script saved in the Downloads folder:

$loc = "c:\Users\Administrator\Desktop\mylog.txt"

# Write text to the log file

Write-Output "Starting Script" >> $loc

# Download and install Chocolatey to do unattended installations of the rest of the apps.

iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))

# You could run commands like this to output the progress to the log file:

# Install vscode and all dependencies

choco install -y vscode --force --force-dependencies --verbose >> $loc

# Install git and all dependencies

choco install -y git --force --force-dependencies --verbose >> $loc

# Completed

Write-Output "Completed" >> $loc

AWS CLI code to deploy a Windows Server 2019 medium instance in the us-west-2a Availability Zone:

aws lightsail create-instances \

--instance-names "my-windows-instance-1" \

--availability-zone us-west-2a \

--blueprint-id windows_server_2019 \

--bundle-id medium_win_2_0 \

--region us-west-2 \

--user-data file://~/Downloads/powershell_script.ps1

Clean up

Remember to delete resources when you are finished using them to avoid incurring future costs.

Conclusion

You now have the understanding and examples of how to create and troubleshoot Lightsail launch scripts both through the Lightsail console and AWS CLI. As demonstrated in this blog, using launch scripts, you can increase your productivity and decrease the deployment time and configuration of your applications. For more examples of using launch scripts, check out the aws-samples GitHub repository. You now have all the foundational building blocks you need to successfully script automated instance configuration. To learn more about Lightsail, visit the Lightsail service page.

Building a high-performance Windows workstation on AWS for graphics intensive applications

2023-08-09 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/building-a-high-performance-windows-workstation-on-aws-for-graphics-intensive-applications/

This blog post is written by Mike Lim, Senior Public Sector SA.

Video editing, professional visualization, and video games can be resource demanding applications that require high performance Windows workstations with GPUs. When developing these resource demanding applications, a high-performance remote display protocol is desirable in order to access the instances’ graphical desktops from the internet. Using NICE DCV provides a bandwidth-adaptive streaming protocol that provides near real-time responsiveness without compromising image quality. Customers using NICE DCV can leverage Amazon EC2 G4 and Amazon EC2 G5 GPU instances which support graphic-intensive applications in the cloud using a pay-as-you-go pricing model. By using Amazon Elastic Compute Cloud (Amazon EC2) with NICE DCV, customers can run graphically intensive applications remotely and stream their user interface to simpler client machines, eliminating the need for expensive dedicated workstations.

This post shows how you can provision and manage an Amazon EC2 GPU Windows instance and access it via the high-performance NICE DCV remote display protocol.

Solution overview

The solution is illustrated in the following figure.

Figure 1: Solution overview

We used the AWS CloudFormation Infrastructure-as-Code (IaC) service to provision our solution. Our CloudFormation template provides the following functionality:

1. Using AWS CloudFormation, you can specify your choice of EC2 instance type, the Amazon Virtual Private Cloud (Amazon VPC), and subnet in which to provision. You also have the option to assign a static IPv4 address. NICE DCV server is installed to provide remote access, and you can specify the choice of graphics driver to install.

2. A security group is created and associated with the EC2 instance, and it acts as a firewall.

3. An AWS Identity and Access Management (IAM) role is created and associated with the EC2 instance using an instance profile. It lets your instance access Amazon Simple Storage Service (Amazon S3) buckets for NICE DCV server license validation, and download and install the latest graphics drivers.

4. The IAM role also makes sure that your EC2 instance can be managed by AWS Systems Manager. This service provides in-browser command line and graphical access to your Windows instance via Session Manager and Fleet Manager from the AWS Management Console.

Walkthrough

The following sections walk through the steps to setup and maintain your graphics workstation. To begin, you need an AWS account. For this walkthrough, we provision a g5.xlarge instance for cloud gaming.

Check instance type availability

For best performance and lowest latency, you will want to provision EC2 in the AWS Region nearest to you. Before proceeding, verify that the g5.xlarge instance type is available in your desired AWS Region, and the Availability Zones (AZs) in which it is available.

Log in to your Amazon EC2 console and select your AWS Region. From the navigation pane, choose Instance Types to view the instance types available. In the search bar, filter instances types to the specific instance type you want, in this case g5.xlarge. Toggle the display preferences (gear) icon to display Availability zones column.

In the following screenshot, the g5.xlarge instance is available in two of the three AZs in eu-west-2 Europe London Region.

Figure 2: Amazon EC2 console instance types

Check Amazon EC2 running on-demand G instances quota

Your AWS account has a limit on the number and type of EC2 instances types you can run, and you need to make sure you have enough quota to run the g5.xlarge instance.

Go to the Service Quotas console for your AWS Region. Under AWS services in the navigation pane, select Amazon Elastic Compute Cloud (Amazon EC2) and search for Running On-Demand G and VT instances. Verify that the Applied quota value number is equal or more than the number of vCPUs for the instance size you need. In the following screenshot, the applied quota value is 64. It lets us launch instance sizes from 4 vCPUs xlarge up to 64 vCPUs 16xlarge instance size.

Figure 3: Service Quotas console

You can request a higher quota value by selecting Request quota increase.

Using CloudFormation template

Download the CloudFormation template file from aws-samples GitHub repository. Go to the CloudFormation console for your AWS Region to create a stack, and upload your downloaded file.

The CloudFormation parameters page is divided into the following sections:

AMI and instance type
EC2 configuration
Allowed inbound source IP prefixes to NICE DCV port 8443
EBS volume configuration

We go through the configuration settings for each section in detail.

AMI and instance type

In this section, we select the Windows Amazon Machine Image (AMI) to use, EC2 instance type to provision, and graphics driver to install. The default AMI is Microsoft Windows Server 2022.

Replace the instanceType value with g5.xlarge.

Figure 4: CloudFormation parameters: AMI and instance type

Select driverType based on your instance type and the following use case:

AMD: select this for instance types with AMD GPU (G4ad instance).
NVIDIA-Gaming: select this to install the NVIDIA gaming driver, which is optimized for gaming (G5 and G4dn instances).
NVIDIA-GRID: select this to install the GRID driver, which is optimized for professional visualization applications that render content such as 3D models or high-resolution videos (G5, G4dn, and G3 instances).
none: select this option for accelerated computing instances, such as P2 and P3 instances where you download and install public NVIDIA drivers manually.
NICE-DCV: this installs the NICE DCV Virtual Display driver and is suitable for all other instance types.

Note that GRID and NVIDIA gaming drivers’ downloads are available to AWS customers only. Upon installation of the software, you are bound by the terms of the NVIDIA GRID Cloud End User License Agreement.

For our walkthrough, select NVIDIA-Gaming for driverType.

Amazon EC2 configuration

In this section, we specify the VPC and subnet in which to provision our EC2 instance. You can select default VPC from the vpcID dropdown. Make sure that the subnetID value you select is in your selected VPC and resides in an AZ that has your instance type offering. You can also change the EC2 instance name.

Select Yes for the assignStaticIP option if you want to associate a static Internet IPv4 address. Note that there is a small hourly charge when the instance is not running.

Figure 5: CloudFormation parameters: Amazon EC2 configuration

Allowed inbound source IP prefixes to NICE DCV port 8443

Here, we specify the source prefixes allowed to access our instance. The default values allow access from all addresses. To secure access to your instance, you may want to limit the source prefix to your IP address.

To get your IPv4 address, go to https://checkip.amazonaws.com and append /32 to the value for ingressIPv4. The default VPC and subnet is IPv4 only. Therefore, you can enter ::1/128 to explicitly block all IPv6 access for ingressIPv6.

Figure 6: CloudFormation parameters: Allowed inbound source IP prefixes

Amazon EBS volume configuration

The default Amazon Elastic Block Store (Amazon EBS) volume size is 30 GiB. You can specify a larger size by changing the volumeSize value.

Figure 7: CloudFormation parameters: Amazon EBS volume configuration

Continue to provision your stack.

NICE DCV client

NICE DCV provides an HTML5 client for web browser access. For performance and additional features, such as QUIC UDP transport protocol support and USB remotization support, install the native client from the NICE DCV download page. NICE DCV offers native clients for Windows, MacOS for both Intel and Apple M1 processors, and modern Linux distributions including RHEL, SUSE Linux, and Ubuntu.

Cloudformation Outputs

Once provisioning is complete, go to the Outputs section.

Figure 8: CloudFormation Outputs

The following URLs are available.

DCVwebConsole
EC2instance
RdpConnect
SSMsessionManager

We go through the purpose of each URL in the following sections.

SSMsessionManager: Change administrator password

To log in, you must specify an administrator password. Open the SSMsessionManager value URL in a new browser tab and run the command net user administrator <YOUR-PASSWORD> where <YOUR-PASSWORD> is the password on which you decided.

Figure 9: Systems Manager session manager

DCVwebConsole: Connecting to the EC2 instance

Copy DCVWebConsole value, open the NICE DCV client from your local machine and either use the copied value or the IP address to connect. Log in as administrator with the password that you have configured. Alternatively, enter the URL in the format dcv://<EC2-IP-Address> in a browser URL bar to automatically launch and connect a locally installed NICE DCV client to your EC2 instance.

Figure 10: NICE DCV client

EC2instance: manage EC2 instance

Use this link to manage your EC2 instance in the Amazon EC2 console. If you did not select the static IP address option, then use this page to get the assigned IP address whenever you stop and start your instance.

RdpConnect: Fleet Manager console access

The RdpConnect link provides in-browser Remote Desktop Protocol (RDP) console access to your Windows instance. Choose User credentials for Authentication Type. Enter administrator for username and the password that you have configured.

Figure 11: Fleet Manager Remote Desktop

Updating NICE DCV server

To update NICE DCV server, log in via Fleet Manager Remote Desktop and run c:\users\administrator\update-DCV.cmd script. In the following screenshot, we successfully upgraded NICE DCV server from version 2022.2-14521 to 2023.0-15065.

Figure 12: Updating NICE DCV server

Updating graphics drivers

You can use the download-<DRIVER_TYPE>-driver.cmd batch file to download the latest graphics driver for your instance type GPU. Downloaded files are located in the Downloads\Drivers folder.

Figure 13: Graphics driver download scripts

AWS Command Line interface (AWS CLI v2) is installed in the instance. You can use it to view the different versions available on the driver S3 bucket. For example, the command aws s3 ls –recursive s3://nvidia-gaming/windows/ | sort /R lists NVIDIA gaming drivers available for download. NVIDIA GRID and AMD drivers are in the s3://ec2-windows-nvidia-drivers and s3://ec2-amd-windows-drivers S3 buckets respectively.

Figure 14: Listing graphics drivers on S3 bucket

Use the command aws s3 cp s3://<S3_BUCKET_PATH>/<FILE-NAME>. to copy a specific driver from the S3 bucket to your local directory.

You can refer to Install NVIDIA drivers on Windows instances and Install AMD drivers on Windows instances for NVIDIA and AMD drivers installation instructions respectively.

Customizing your EC2 instance environment

You may want to customize the instance to your needs. For NVIDIA GPU instances, you can optimize GPU settings to achieve the best performance.

If you are doing video editing, then you can enable high color accuracy, configure multi-channel audio, and enable accurate audio/video sync. For gaming, you may enable gamepad support to use a DualShock 4 or Xbox 360 controller. NICE DCV session storage is enabled. This lets you transfer files using NICE DCV client. More configuration options are available from the NICE DCV User Guide and Administrator Guide.

Terminating your EC2 instance

When you have finished using your EC2 instance, you can release all provisioned resources by going to CloudFormation console to delete your stack.

Conclusion

The Amazon G4 and G5 GPU instance types are suitable for graphics-intensive applications, and NICE DCV provides a responsive and high image quality display protocol for remote access. Using the CloudFormation template from amazon-ec2-nice-dcv-samples GitHub site, you can build and maintain your own high performance Windows graphics workstation in the AWS cloud

Mixing AWS Graviton with x86 CPUs to optimize cost and resiliency using Amazon EKS

2023-07-13 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/mixing-aws-graviton-with-x86-cpus-to-optimize-cost-and-resilience-using-amazon-eks/

This post is written by Yahav Biran, Principal SA, and Yuval Dovrat, Israel Head Compute SA.

This post shows you how to integrate AWS Graviton-based Amazon EC2 instances into an existing Amazon Elastic Kubernetes Service (Amazon EKS) environment running on x86-based Amazon EC2 instances. Customers use mixed-CPU architectures to enable their application to utilize a wide selection of Amazon EC2 instance types and improve overall application resilience. In order to successfully run a mixed-CPU application, it is strongly recommended that you test application performance in a test environment before running production applications on Graviton-based instances. You can follow AWS’ transition guide to learn more about porting your application to AWS Graviton.

This example shows how you can use KEDA for controlling application capacity across CPU types in EKS. KEDA will trigger a deployment based on the application’s response latency as measured by the Application Load Balancer (ALB). To simplify resource provisioning, Karpenter, an open-source Kubernetes node provisioning software, and AWS Load Balancer Controller, are shown as well.

Solution Overview

There are two solutions that this post covers to test a mixed-CPU application. The first configuration (shown in Figure 1 below) is the “A/B Configuration”. It uses an Application Load Balancer (ALB)-based Ingress to control traffic flowing to x86-based and Graviton-based node pools. You use this configuration to gradually migrate a live application from x86-based instances to Graviton-based instances, while validating the response time with Amazon CloudWatch.

Figure 1, config 1: A/B Configuration

In the second configuration, the “Karpenter Controlled Configuration” (shown in Figure 2 below as Config 2), Karpenter automatically controls the instance blend. Karpenter is configured to use weighted provisioners with values that prioritize AWS Graviton-based Amazon EC2 instances over x86-based Amazon EC2 instances.

Figure 2, config II: Karpenter Controlled Configuration, with Weighting provisioners topology

It is recommended that you start with the “A/B” configuration to measure the response time of live requests. Once your workload is validated on Graviton-based instances, you can build the second configuration to simplify the deployment configuration and increase resiliency. This enables your application to automatically utilize x86-based instances if needed, for example, during an unplanned large-scale event.

You can find the step-by-step guide on GitHub to help you to examine and try the example app deployment described in this post. The following provides an overview of the step-by-step guide.

Code Migration to AWS Graviton

The first step is migrating your code from x86-based instances to Graviton-based instances. AWS has multiple resources to help you migrate your code. These include AWS Graviton Fast Start Program, AWS Graviton Technical Guide GitHub Repository, AWS Graviton Transition Guide, and Porting Advisor for Graviton.

After making any required changes, you might need to recompile your application for the Arm64 architecture. This is necessary if your application is written in a language that compiles to machine code, such as Golang and C/C++, or if you need to rebuild native-code libraries for interpreted/JIT compiled languages such as the Python/C API or Java Native Interface (JNI).

To allow your containerized application to run on both x86 and Graviton-based nodes, you must build OCI images for both the x86 and Arm64 architectures, push them to your image repository (such as Amazon ECR), and stitch them together by creating and pushing an OCI multi-architecture manifest list. You can find an overview of these steps in this AWS blog post. You can also find the AWS Cloud Development Kit (CDK) construct on GitHub to help get you started.

To simplify the process, you can use a Linux distribution package manager that supports cross-platform packages and avoid platform-specific software package names in the Linux distribution wherever possible. For example, use:

RUN pip install httpd

instead of:

ARG ARCH=aarch64 or amd64
RUN yum install httpd.${ARCH}

This blog post shows you how to automate multi-arch OCI image building in greater depth.

Application Deployment

Config 1 – A/B controlled topology

This topology allows you to migrate to Graviton while validating the application’s response time (approximately 300ms) on both x86 and Graviton-based instances. As shown in Figure 1, this design has a single Listener that forwards incoming requests to two Target Groups. One Target Group is associated with Graviton-based instances, while the other Target Group is associated with x86-based instances. The traffic ratio associated with each target group is defined in the Ingress configuration.

Here are the steps to create Config 1:

Create two KEDA ScaledObjects that scale the number of pods based on the latency metric (AWS/ApplicationELB-TargetResponseTime) that matches the target group (triggers.metadata.dimensionValue). Declare the maximum acceptable latency in targetMetricValue:0.3.
Below is the Graviton deployment scaledObject (spec.scaleTargetRef), note the comments that denote the value of the x86 deployment scaledObject

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
…
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: armsimplemultiarchapp #amdsimplemultiarchapp
…
  triggers:                 
    - type: aws-cloudwatch
      metadata:
        namespace: "AWS/ApplicationELB"
        dimensionName: "LoadBalancer"
        dimensionValue: "app/simplemultiarchapp/xxxxxx"
        metricName: "TargetResponseTime"
        targetMetricValue: "0.3"

Once the topology has been created, add Amazon CloudWatch Container Insights to measure CPU, network throughput, and instance performance.
To simplify testing and control for potential performance differences in instance generations, create two dedicated Karpenter provisioners and Kubernetes Deployments (replica sets) and specify the instance generation, CPU count, and CPU architecture for each one. This example uses c7g (Graviton3) and c6i (Intel) . You will remove these constraints in the next topology to allow more allocation flexibility.

The x86-based instances Karpenter provisioner:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: x86provisioner
spec:
  requirements:
  - key: karpenter.k8s.aws/instance-generation
    operator: In
    values:
    - "6"
  - key: karpenter.k8s.aws/instance-cpu
    operator: In 
    values:
    - "2"
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64

The Graviton-based instances Karpenter provisioner:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: arm64provisioner
spec:
  requirements:
  - key: karpenter.k8s.aws/instance-generation
    operator: In
    values:
    - "7"
  - key: karpenter.k8s.aws/instance-cpu
    operator: In
    values:
    - "2"
  - key: kubernetes.io/arch
    operator: In
    values:
    - arm64

Create two Kubernetes Deployment resources—one per CPU architecture—that use nodeSelector to schedule one Deployment on Graviton-based instances, and another Deployment on x86-based instances. Similarly, create two NodePort Service resources, where each Service points to its architecture-specific ReplicaSet.
Create an Application Load Balancer using the AWS Load Balancer Controller to distribute incoming requests among the different pods. Control the traffic routing in the ingress by adding an ingress.kubernetes.io/actions.weighted-routing annotation. You can adjust the weight in the example below to meet your needs. This migration example started with a 100%-to-0% x86-to-Graviton ratio, adjusting over time by 10% increments until it reached a 0%-to-100% x86-to-Graviton ratio.

…
alb.ingress.kubernetes.io/actions.weighted-routing: | 
{
…
  "targetGroups":[
    {
      "serviceName":"armsimplemultiarchapp-svc",
      "servicePort":"80","weight":50
    },
    {
      "serviceName":"amdsimplemultiarchapp-svc",
      "servicePort":"80","weight":50}]
    }
 }

spec:
  ingressClassName: alb
  rules:
    - http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: weighted-routing

You can simulate live user requests to an example application ALB endpoint. Amazon CloudWatch populates ALB Target Group request/second metrics, dimensioned by HTTP response code, to help assess the application throughput and CPU usage.

During the simulation, you will need to verify the following:

Both Graviton-based instances and x86-based instances pods process a variable amount of traffic.
The application response time (p99) meets the performance requirements (300ms).

The orange (Graviton) and blue (x86) curves of HTTP 2xx responses (figure 4) show the application throughput (HTTP requests/seconds) for each CPU architecture during the migration.

Figure 3 HTTP 2XX per CPU architecture

Figure 4 shows an example of application response time during the transition from x86-based instances to Graviton-based instances. The latency associated with each instance family grows and shrinks as the live request simulation changes the load on the application. In this example, the latency on x86 instances (until 07:00) grew up to 300ms because most of the request load was directed at to x86-based pods. It began to converge at around 08:00 when more pods were powered by Graviton-based instances. Finally, after 15:00, the request load was processed by Graviton-based instances entirely.

Figure 4: Target Response Time p99

Config 2 – Karpenter Controlled Configuration

After fully testing the application on Graviton-based EC2 instances, you are ready to simplify the deployment topology with weighted provisioners while preserving the ability to launch x86-based instances as needed.

Here are the steps to create Config 2:

Reuse the CPU-based provisioners from the previous topology, but assign a higher .spec.weight to Graviton-based instances provisioner. The x86 provisioner is still deployed in case x86-based instances are required. The karpenter.k8s.aws/instance-family can be expanded beyond those set in Config 1 or excluded by switching the operator to NotIn.

The x86-based Amazon EC2 instances Karpenter provisioner:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: x86provisioner
spec:
  requirements:
  - key: kubernetes.io/arch
    operator: In
    values: [amd64]

The Graviton-based Amazon EC2 instances Karpenter provisioner:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: priority-arm64provisioner
spec:
  weight: 10
  requirements:
  - key: kubernetes.io/arch
    operator: In
    values: [arm64]

Next, merge the two Kubernetes deployments into one deployment similar to the original before migration (i.e., no specific nodeSelector that points to a CPU-specific provisioner).

The two services are also combined into a single Kubernetes service and the actions.weighted-routing annotation is removed from the ingress resources:

spec:
  rules:
    - http:
        paths:
          - path: /app
            pathType: Prefix
            backend:
              service:
                name: simplemultiarchapp-svc

Unite the two KEDA ScaledObject resources from the first configuration and point them to a single deployment, e.g., simplemultiarchapp. The new KEDA ScaledObject will be:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: simplemultiarchapp-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: simplemultiarchapp
…

Figure 5 Config 2 – Weighting provisioners results

A synthetic limit on Graviton CPU capacity is set to illustrate the scaling to x86_64 CPUs (Provisioner.limits.resources.cpu). The total application throughput (figure 6) is shown by aarch64_200 (blue) and x86_64_200 (orange). Mixing CPUs did not impact the target response time (Figure 6). Karpenter behaved as expected: prioritizing Graviton-based instances, and bursting to x86-based Amazon EC2 instances when CPU limits were crossed.

Figure 6 Config 2 -HTTP response time p99 with mixed-CPU provisioner

Conclusion

The use of a mixed-CPU architecture enables your application to utilize a wide selection of Amazon EC2 instance types and improves your applications’ resilience while meeting your service-level objectives. Application metrics can be used to control the migration with AWS ALB Ingress, Karpenter, and KEDA. Moreover, AWS Graviton-based Amazon EC2 instances can deliver up to 40% better price performance than x86-based Amazon EC2 instances. Learn more about this example on GitHub and more announcements about Gravtion.