Tag Archives: Compute

Announcing AWS Parallel Computing Service to run HPC workloads at virtually any scale

2024-08-28 Channy Yun (윤석찬)

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/announcing-aws-parallel-computing-service-to-run-hpc-workloads-at-virtually-any-scale/

Today we are announcing AWS Parallel Computing Service (AWS PCS), a new managed service that helps customers set up and manage high performance computing (HPC) clusters so they seamlessly run their simulations at virtually any scale on AWS. Using the Slurm scheduler, they can work in a familiar HPC environment, accelerating their time to results instead of worrying about infrastructure.

In November 2018, we introduced AWS ParallelCluster, an AWS supported open-source cluster management tool that helps you to deploy and manage HPC clusters in the AWS Cloud. With AWS ParallelCluster, customers can also quickly build and deploy proof of concept and production HPC compute environments. They can use AWS ParallelCluster Command-Line interface, API, Python library, and the user interface installed from open source packages. They are responsible for updates, which can include tearing down and redeploying clusters. Many customers, though, have asked us for a fully managed AWS service to eliminate operational jobs in building and operating HPC environments.

AWS PCS simplifies HPC environments managed by AWS and is accessible through the AWS Management Console, AWS SDK, and AWS Command-Line Interface (AWS CLI). Your system administrators can create managed Slurm clusters that use their compute and storage configurations, identity, and job allocation preferences. AWS PCS uses Slurm, a highly scalable, fault-tolerant job scheduler used across a wide range of HPC customers, for scheduling and orchestrating simulations. End users such as scientists, researchers, and engineers can log in to AWS PCS clusters to run and manage HPC jobs, use interactive software on virtual desktops, and access data. You can bring their workloads to AWS PCS quickly, without significant effort to port code.

You can use fully managed NICE DCV remote desktops for remote visualization, and access job telemetry or application logs to enable specialists to manage your HPC workflows in one place.

AWS PCS is designed for a wide range of traditional and emerging, compute or data-intensive, engineering and scientific workloads across areas such as computational fluid dynamics, weather modeling, finite element analysis, electronic design automation, and reservoir simulations using familiar ways of preparing, executing, and analyzing simulations and computations.

Getting started with AWS Parallel Computing Service
To try out AWS PCS, you can use our tutorial for creating a simple cluster in the AWS documentation. First, you create a virtual private cloud (VPC) with an AWS CloudFormation template and shared storage in Amazon Elastic File System (Amazon EFS) within your account for the AWS Region where you will try AWS PCS. To learn more, visit Create a VPC and Create shared storage in the AWS documentation.

1. Create a cluster
In the AWS PCS console, choose Create cluster, a persistent resource for managing resources and running workloads.

Next, enter your cluster name and choose the controller size of your Slurm scheduler. You can choose Small (up to 32 nodes and 256 jobs), Medium (up to 512 nodes and 8,192 jobs), or Large (up to 2,048 nodes and 16,384 jobs) for the limits of cluster workloads. In the Networking section, choose your created VPC, subnet to launch the cluster, and security group applied to your cluster.

Optionally, you can set the Slurm configuration such as an idle time before compute nodes will scale down, a Prolog and Epilog scripts directory on launched compute nodes, and a resource selection algorithm parameter used by Slurm.

Choose Create cluster. It takes some time for the cluster to be provisioned.

2. Create compute node groups
After creating your cluster, you can create compute node groups, a virtual collection of Amazon Elastic Compute Cloud (Amazon EC2) instances that AWS PCS uses to provide interactive access to a cluster or run jobs in a cluster. When you define a compute node group, you specify common traits such as EC2 instance types, minimum and maximum instance count, target VPC subnets, Amazon Machine Image (AMI), purchase option, and custom launch configuration. Compute node groups require an instance profile to pass an AWS Identity and Access Management (IAM) role to an EC2 instance and an EC2 launch template that AWS PCS uses to configure EC2 instances it launches. To learn more, visit Create a launch template And Create an instance profile in the AWS documentation.

To create a compute node group in the console, go to your cluster and choose the Compute node groups tab and the Create compute node group button.

You can create two compute node groups: a login node group to be accessed by end users and a job node group to run HPC jobs.

To create a compute node group running HPC jobs, enter a compute node name and select a previously-created EC2 launch template, IAM instance profile, and subnets to launch compute nodes in your cluster VPC.

Next, choose your preferred EC2 instance types to use when launching compute nodes and the minimum and maximum instance count for scaling. I chose the hpc6a.48xlarge instance type and scale limit up to eight instances. For a login node, you can choose a smaller instance, such as one c6i.xlarge instance. You can also choose either the On-demand or Spot EC2 purchase option if the instance type supports. Optionally, you can choose a specific AMI.

Choose Create. It takes some time for the compute node group to be provisioned. To learn more, visit Create a compute node group to run jobs and Create a compute node group for login nodes in the AWS documentation.

3. Create and run your HPC jobs
After creating your compute node groups, you submit a job to a queue to run it. The job remains in the queue until AWS PCS schedules it to run on a compute node group, based on available provisioned capacity. Each queue is associated with one or more compute node groups, which provide the necessary EC2 instances to do the processing.

To create a queue in the console, go to your cluster and choose the Queues tab and the Create queue button.

Enter your queue name and choose your compute node groups assigned to your queue.

Choose Create and wait while the queue is being created.

When the login compute node group is active, you can use AWS Systems Manager to connect to the EC2 instance it created. Go to the Amazon EC2 console and choose your EC2 instance of the login compute node group. To learn more, visit Create a queue to submit and manage jobs and Connect to your cluster in the AWS documentation.

To run a job using Slurm, you prepare a submission script that specifies the job requirements and submit it to a queue with the sbatch command. Typically, this is done from a shared directory so the login and compute nodes have a common space for accessing files.

You can also run a message passing interface (MPI) job in AWS PCS using Slurm. To learn more, visit Run a single node job with Slurm or Run a multi-node MPI job with Slurm in the AWS documentation.

You can connect a fully-managed NICE DCV remote desktop for visualization. To get started, use the CloudFormation template from HPC Recipes for AWS GitHub repository.

In this example, I used the OpenFOAM motorBike simulation to calculate the steady flow around a motorcycle and rider. This simulation was run with 288 cores of three hpc6a instances. The output can be visualized in the ParaView session after logging in to the web interface of DCV instance.

Finally, after you are done HPC jobs with the cluster and node groups that you created, you should delete the resources that you created to avoid unnecessary charges. To learn more, visit Delete your AWS resources in the AWS documentation.

Things to know
Here are a couple of things that you should know about this feature:

Slurm versions – AWS PCS initially supports Slurm 23.11 and oﬀers mechanisms designed to enable customers to upgrade their Slurm major versions once new versions are added. Additionally, AWS PCS is designed to automatically update the Slurm controller with patch versions. To learn more, visit Slurm versions in the AWS documentation.
Capacity Reservations – You can reserve EC2 capacity in a specific Availability Zone and for a specific duration using On-Demand Capacity Reservations to make sure that you have the necessary compute capacity available when you need it. To learn more, visit Capacity Reservations in the AWS documentation.
Network file systems – You can attach network storage volumes where data and files can be written and accessed, including Amazon FSx for NetApp ONTAP, Amazon FSx for OpenZFS, and Amazon File Cache as well as Amazon EFS and Amazon FSx for Lustre. You can also use self-managed volumes, such as NFS servers. To learn more, visit Network file systems in the AWS documentation.

Now available
AWS Parallel Computing Service is now available in the US East (N. Virginia), AWS US East (Ohio), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), Europe (Stockholm) Regions.

AWS PCS launches all resources in your AWS account. You will be billed appropriately for those resources. For more information, see the AWS PCS Pricing page.

Give it a try and send feedback to AWS re:Post or through your usual AWS Support contacts.

— Channy

P.S. Special thanks to Matthew Vaughn, a principal developer advocate at AWS for his contribution in creating a HPC testing environment.

AWS Weekly Roundup: S3 Conditional writes, AWS Lambda, JAWS Pankration, and more (August 26, 2024)

2024-08-26 Veliswa Boya

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-s3-conditional-writes-aws-lambda-jaws-pankration-and-more-august-26-2024/

The AWS User Group Japan (JAWS-UG) hosted JAWS PANKRATION 2024 themed ‘No Border’. This is a 24-hour online event where AWS Heroes, AWS Community Builders, AWS User Group leaders, and others from around the world discuss topics ranging from cultural discussions to technical talks. One of the speakers at this event, Kevin Tuei, an AWS Community Builder based in Kenya, highlighted the importance of building in public and sharing your knowledge with others, a very fitting talk for this kind of event.

Last week’s launches
Here are some launches that got my attention during the previous week.

Amazon S3 now supports conditional writes – We’ve added support for conditional writes in Amazon S3 which check for existence of an object before creating it. With this feature, you can now simplify how distributed applications with multiple clients concurrently update data in parallel across shared datasets. Each client can conditionally write objects, making sure that it does not overwrite any objects already written by another client.

AWS Lambda introduces recursive loop detection APIs – With the recursive loop detection APIs you can now set recursive loop detection configuration on individual AWS Lambda functions. This allows you to turn off recursive loop detection on functions that intentionally use recursive patterns, avoiding disruption of these workloads. Using these APIs, you can avoid disruption to any intentionally recursive workflows as Lambda expands support of recursive loop detection to other AWS services. Configure recursive loop detection for Lambda functions through the Lambda Console, the AWS command line interface (CLI), or Infrastructure as Code tools like AWS CloudFormation, AWS Serverless Application Model (AWS SAM), or AWS Cloud Development Kit (CDK). This new configuration option is supported in AWS SAM CLI version 1.123.0 and CDK v2.153.0.

General availability of Amazon Bedrock batch inference API – You can now use Amazon Bedrock to process prompts in batch to get responses for model evaluation, experimentation, and offline processing. Using the batch API makes it more efficient to run inference with foundation models (FMs). It also allows you to aggregate responses and analyze them in batches. To get started, visit Run batch inference.

Other AWS news
Launched in July 2024, AWS GenAI Lofts is a global tour designed to foster innovation and community in the evolving landscape of generative artificial intelligence (AI) technology. The lofts bring collaborative pop-up spaces to key AI hotspots around the world, offering developers, startups, and AI enthusiasts a platform to learn, build, and connect. The events are ongoing. Find a location near you and be sure to attend soon.

Upcoming AWS events
AWS Summits – These are free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Whether you’re in the Americas, Asia Pacific & Japan, or EMEA region, learn more about future AWS Summit events happening in your area. On a personal note, I look forward to being one of the keynote speakers at the AWS Summit Johannesburg happening this Thursday. Registrations are still open and I look forward to seeing you there if you’ll be attending.

AWS Community Days – Join an AWS Community Day event just like the one I mentioned at the beginning of this post to participate in technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from your area. If you’re in New York, there’s an event happening in your area this week.

You can browse all upcoming in-person and virtual events here.

That’s all for this week. Check back next Monday for another Weekly Roundup!

– Veliswa

Enabling high availability of Amazon EC2 instances on AWS Outposts servers: (Part 2)

2024-08-08 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/enabling-high-availability-of-amazon-ec2-instances-on-aws-outposts-servers-part-2/

This blog post was written by Brianna Rosentrater – Hybrid Edge Specialist SA and Jessica Win – Software Development Engineer

This post is Part 2 of the two-part series ‘Enabling high availability of Amazon EC2 instances on AWS Outposts servers’, providing you with code samples and considerations for implementing custom logic to automate Amazon Elastic Compute Cloud (Amazon EC2) relaunch on AWS Outposts servers. This post focuses on stateful applications where the Amazon EC2 instance store state needs to be maintained at relaunch.

Overview

AWS Outposts servers provide compute and networking services that are ideal for low-latency, local data processing needs for on-premises locations such as retail stores, branch offices, healthcare provider locations, or environments that are space-constrained. Outposts servers use EC2 instance store storage to provide non-durable block-level storage to the instances, and many applications use the instance store to save stateful information that must be retained in a Disaster Recovery (DR) type event. In this post, you will learn how to implement custom logic to provide High Availability (HA) for your applications running on an Outposts server using two or more servers for N+1 fault tolerance. The code provided is meant to help you get started with creating your own custom relaunch logic for workloads that require HA, and can be modified further for your unique workload needs.

Architecture

This solution is scoped to work for two Outposts servers set up as a resilient pair. For three or more servers running in the same data center, each server would need to be mapped to a secondary server for HA. One server can be the relaunch destination for multiple other servers, as long as Amazon EC2 capacity requirements are met. If both the source and destination Outposts servers are unavailable or experience a failure at the same time, then additional user action is required to resolve. In this case, a notification email is sent to the address specified in the notification email parameter that you supplied when executing the init.py script from Part 1 of this series. This lets you know that the attempted relaunch of your EC2 instances failed.

Figure 1: Amazon EC2 auto-relaunch custom logic on Outposts server architecture.

Refer to Part 1 of this series for a detailed breakdown of Steps 1-6 that discusses how the Amazon EC2 relaunch automation works, as shown in the preceding figure. For stateful applications, this logic has been extended to capture the EC2 instance store state. In order to save the state of the instance store, AWS Systems Manager automation is being used to create an Amazon Elastic Block Store (Amazon EBS)-backed Amazon Machine Image (AMI) in the Region of the EC2 instance running on the source Outposts server. Then, this AMI can be relaunched on another Outposts server in the event of a source server hardware or service link failure. The EBS volume associated with the AMI is automatically converted to the instance store root volume when relaunched on another Outposts server.

Prerequisites

The following prerequisites are required to complete the walkthrough:

EC2 instances being protected must be Linux instances.
EC2 instances being protected must have access to either an Internet Gateway (IGW) or NAT gateway in an AWS Region to download and install the required Linux packages and future updates.
Create an Amazon Simple Storage Service (Amazon S3) bucket in your Outposts servers’ parent region to store the Systems Manager automation document.
If you are setting this up from an Outpost consumer account, you need to configure Amazon CloudWatch cross-account observability between the consumer account and the Outpost owning account to view Outposts metrics in your consumer account.
Download the repository backup-outposts-servers-linux-instance.
- Note the prerequisites and limitations listed in the README file of the GitHub repository.

This post builds on the Amazon EC2 auto-relaunch logic implemented in Part 1 of this series. In Part 1, we covered the implementation for achieving HA for stateless applications. In Part 2, we extend the Part 1 implementation to achieve HA for stateful applications, which must retain EC2 instance store data when instances are relaunched.

Deploying Outposts Servers Linux Instance Backup Solution

For the purpose of this post, a virtual private cloud (VPC) named “Production-Application-A”, and subnets on each of the two Outposts servers being used for this post named “source-outpost-a” and “destination-outpost-b” have been created. The destination-outpost-b subnet is supplied in the launch template being used for this walkthrough. The Amazon EC2 auto-relaunch logic discussed in Part 1 of this series has already been implemented, and the focus here is on the next steps required to extend that auto-relaunch capability to stateful applications.

Following the installation instructions available in the GitHub repository README file, you first open an AWS CloudShell terminal from within the account that has access to your Outposts servers. Next, clone the GitHub repository and cd into the “backup-outposts-servers-linux-instance” directory:

From here you can build the Systems Manager Automation document with its attachments using the make documents command. Your output should look similar to the following after successful execution:

Finally, upload the Systems Manager Automation document you just created to the S3 bucket you created in your Outposts server’s parent region for this purpose. For the purpose of this post, an S3 bucket named “ssm-bucket07082024” was created. Following Step 4 in the GitHub installation instructions, the command looks like the following:

BUCKET_NAME="ssm-bucket07082024"
DOC_NAME="BackupOutpostsServerLinuxInstanceToEBS"
OUTPOST_REGION="us-east-1"
aws s3 cp Output/Attachments/attachment.zip s3://${BUCKET_NAME}
aws ssm create-document --content file://Output/BackupOutpostsServerLinuxInstanceToEBS.json --name ${DOC_NAME} --document-type "Automation" --document-format JSON --attachments Key=S3FileUrl,Values=s3://${BUCKET_NAME}/attachment.zip,Name=attachment.zip --region ${OUTPOST_REGION}

After you have successfully created the Systems Manager Automation document, the output of the command shows the content of your newly created file. After reviewing it, you can exit the terminal and confirm that a new file named “attachments.zip” is in the S3 bucket that you specified.

Now you’re ready to put this automation logic in place. Following the GitHub instructions for usage, navigate to Systems Manager in the account that has access to your Outposts servers, and execute the automation. The default document name is used for the purpose of this post “BackupOutpostsServerLinuxInstanceToEBS”, so that is the document selected. You may have other documents available to you for quick setup, and those can be disregarded for now.

Select the chosen document to execute this automation using the button in the top right-hand corner of the document details page.

After executing the automation, you are asked to configure the runbook for this automation. Leave the default Simple execution option selected:

For the Input parameters section, review the parameter definitions given in the GitHub repository README file. For the purpose of this post, the following is used:

Note that you may need to create a service role for Systems Manager to perform this automation on your behalf. For the purposes of this post, I have done so using the Required IAM Permissions to run this runbook section of the GitHub repository README file. The other settings can be left as default. Finish your set up by selecting Execute at the bottom of this page. It could take up to 30 minutes for all necessary steps to execute. Note that the automation document shows 32 steps, but the number of steps that are executed varies based on the type of Linux AMI that you started with. As long as your automation’s overall status shows as successful, you have completed implementation successfully. Here is a sample output:

You can find the AMI that was produced from this automation in your Amazon EC2 console under the Images section:

The final implementation step is creating a Systems Manager parameter for the AMI you just created. This prevents you from having to manually update the launch template for your application each time a new AMI is created and the AMI ID changes. Since this AMI is essentially a backup of your application and its current instance store state, you should expect the AMI ID to change with each new backup or new AMI that you create for your application, and determine the cadence for creating these AMIs that aligns to your application Recovery Point Objectives (RPO).

To create a Systems Manager parameter for your AMI, first navigate to your Systems Manager console. Under Application Management, select Parameter Store and Create parameter. You can select either the Standard or Advanced tier depending on your needs. The AMI ID I have is ami-038c878d31d9d0bfb and the following is an example of how the parameter details are filled in for this walkthrough:

Now you can modify your application’s launch template that you created in Part 1 of this series, and specify the Systems Manager parameter you just created. To do this, navigate to your Amazon EC2 console, and under Instances select the Launch Templates option. Create a new version of your launch template, select the Browse more AMIs option, and choose the arrow button to the right of the search bar. Select Specify custom value/Systems Manager parameter.

Now enter the name of your parameter in one of the listed formats, and select Save.

You should see your parameter listed in the launch template summary under Software Image (AMI):

Make sure that your launch template is set to the latest version. Your installation is now complete, and in the event of a source Outposts server failure, your application will be automatically relaunched on a new EC2 instance on your destination Outposts server. You will also receive a notification email sent to the address specified in the notification email parameter of the init.py script from Part 1 of this series. This means you can start triaging why your source Outposts server experienced a failure immediately without worrying about getting your application(s) back up and running. This helps make sure that your application(s) are highly available and reduces your Recovery Time Objective (RTO).

Cleaning up

The custom Amazon EC2 relaunch logic is implemented through AWS CloudFormation, so the only clean up required is to delete the CloudFormation stack from your AWS account. Doing so deletes the resources that were deployed through the CloudFormation stack. To remove the Systems Manager automation, un-enroll your EC2 instance from Host Management and delete the Amazon EBS-backed AMI in the Region.

Conclusion

The use of custom logic through AWS tools such as CloudFormation, CloudWatch, Systems Manager, and AWS Lambda enables you to architect for HA for stateful workloads on Outposts server. By implementing the custom logic we walked through in this post, you can automatically relaunch EC2 instances running on a source Outposts server to a secondary destination Outposts server while maintaining your application’s state data. This also reduces the downtime of your application(s) in the event of a hardware or service link failure. The code provided in this post can also be further expanded upon to meet the unique needs of your workload.

Note that while the use of Infrastructure-as-Code (IaC) can improve your application’s availability and be used to standardize deployments across multiple Outposts servers, it is crucial to do regular failure drills to test the custom logic in place to make sure that you understand your application’s expected behavior on relaunch in the event of a failure. To learn more about Outposts servers, please visit the Outposts servers user guide.

Enabling high availability of Amazon EC2 instances on AWS Outposts servers: (Part 1)

2024-08-08 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/enabling-high-availability-of-amazon-ec2-instances-on-aws-outposts-servers-part-1/

This blog post is written by Brianna Rosentrater – Hybrid Edge Specialist SA and Jessica Win – Software Development Engineer.

This post is part 1 of the two-part series ‘Enabling high availability of Amazon EC2 instances on AWS Outposts servers’, providing you with code samples and considerations for implementing custom logic to automate Amazon Elastic Compute Cloud (EC2) relaunch on Outposts servers. This post focuses on guidance for stateless applications, whereas part 2 focuses on stateful applications where the Amazon EC2 instance store state needs to be maintained at relaunch.

Outposts servers provide compute and networking services that are ideal for low-latency, local data processing needs for on-premises locations such as retail stores, branch offices, healthcare provider locations, or environments that are space-constrained. Outposts servers use EC2 instance store storage to provide non-durable block-level storage to the instances running stateless workloads, and while stateless workloads don’t require resilient storage, many application owners still have uptime requirements for these types of workloads. In this post, you will learn how to implement custom logic to provide high availability (HA) for your applications running on Outposts servers using two or more servers for N+1 fault tolerance. The code provided is meant to help you get started, and can be modified further for your unique workload needs.

Overview

In this post, we have provided an init.py script. This script takes your input parameters and creates a custom AWS CloudFormation template that is deployed in the specified account. Users can run “./init.py –-help” or “./init.py -h” to view parameter descriptions. The following input parameters are needed:

Parameter	Description
Launch template ID(s)	This is used to relaunch your EC2 instances on the destination Outposts server in the event of a source server hardware or service link failure. You can specify multiple Launch Template IDs for multiple applications.
Source Outpost ID	This is the Outpost ID of the server actively running your EC2 workload.
Template file	This is the base CloudFormation template. The `init.py` script customizes the `AutoRestartTemplate.yaml` template based on your inputs. Make sure to execute the `init.py` in the file directory that contains the `AutoRestartTemplate.yaml` file.
Stack name	This is the name you’d like to give your CloudFormation stack.
Region	This should be the same AWS Region to which your Outposts servers are anchored.
Notification email	This is the email Amazon Simple Notification Service (SNS) uses to alert you if Amazon CloudWatch detects that your source Outposts server has failed.
Launch template description	This is the description of the launch template(s) used to relaunch your EC2 instances on the destination Outposts server in the event of a source server failure.

After collecting the preceding parameters, the

script generates a CloudFormation template. You are asked to review the template and confirm that it meets your expectations. Once you select yes, the CloudFormation template is deployed in your account, and you can view the stack from your AWS Management Console. You also receive a confirmation email sent to the address specified in the notification email parameter, confirming your subscription to the SNS topic. This SNS topic was created by the CloudFormation stack to alert you if your source Outposts server experiences a hardware or service link failure.

The init.py script and AutoRestartTemplate.yaml CloudFormation template provided in this post is intended to be used to implement custom logic that relaunches EC2 instances running on the source Outposts server to a specified destination Outposts server for improved application availability. This logic works by essentially creating a mapping between the source and destination Outpost, and only works between two Outposts servers. This code can be further customized to meet your application requirements, and is meant to help you get started with implementing custom logic for your Outposts server environment. Now that we have covered the init.py parameters, the intended use case, scope, and limitations of the code provided, read on for more information on the architecture for this solution.

Architecture diagram

This solution is scoped to work for two Outposts servers set up as a resilient pair. For more than two servers running in the same data center, each server would need to be mapped to a secondary server for HA. One server can be the relaunch destination for multiple other servers, as long as Amazon EC2 capacity requirements are met. If both the source and destination Outposts servers are unavailable or experience a failure at the same time, then additional user action is required to resolve. In this case, a notification email is sent to the address specified in the notification email parameter letting you know that the attempted relaunch of your EC2 instances failed.

Figure 1: Amazon EC2 auto-relaunch custom logic on AWS Outposts server architecture.

Input environment parameters required for the CloudFormation template AutoRestartTemplate.yaml. After confirming that the customized template looks correct, agree to allow the init.py script to deploy the CloudFormation stack in your desired AWS account.
The CloudFormation stack is created and deployed in your AWS account with two or more Outposts servers. The CloudFormation stack creates the following resources:

- A CloudWatch alarm to monitor the source Outpost server ConnectedStatus metric;
- An SNS topic that alerts you if your source Outposts server ConnectedStatus shows as down;
- An AWS Lambda function that relaunches the source Outposts server EC2 instances on the destination Outposts server according to the launch template you provided.

A CloudWatch alarm monitors the ConnectedStatus metric of the source Outposts server to detect hardware or service link failure.
If the ConnectedStatus metric shows the source Outposts server service link as down, then a Lambda function coordinates relaunching the EC2 instances on the destination Outposts server according to the launch template that you provided.
In the event of a source Outposts server hardware or service link failure and Amazon EC2 relaunch, Amazon SNS sends a notification to the notification email provided in the init.py script as an environment parameter. You will be notified when the CloudWatch alarm is triggered, and when the automation finishes executing with an execution status included.
The EC2 instances described in your launch template are launched on the destination Outposts server automatically, with no manual action needed.

Now that we’ve covered the architecture and workflow for this solution, read on for step-by-step instructions on how to implement this code in your AWS account.

Prerequisites

The following prerequisites are required to complete the walkthrough:

Python is used to run the init.py script that dynamically creates a CloudFormation stack in the account specified as an input parameter.
Two Outposts servers that can be set up as an active/active or active/passive resilient pair depending on the size of the workload.
Create Launch Templates for the applications you want to protect—make sure that an instance type is selected that is available on your destination Outposts server.
Make sure that you have the credentials needed to programmatically deploy the CloudFormation stack in your AWS account.
If you are setting this up from an Outposts consumer account, you will need to configure CloudWatch cross-account observability between the consumer account and the Outposts owning account to view Outposts metrics.
Download the repository ec2-outposts-autorestart.

Deploying the AutoRestart CloudFormation stack

Make sure that you are in the directory that contains the init.py and AutoRestartTemplate.yaml files. Next, run the following command to execute the init.py file. Note that you may need to change the file permissions to do this. If so, then run “chmod a+x init.py” to give all users execute permissions for this file: ./init.py --launch-template-id <value> --source-outpost-id <value> --template-file AutoRestartTemplate.yaml --stack-name <value> --region <value> --notification-email <value>

After executing the preceding command, the init.py script asks you for a launch template description. Provide a brief description for the launch template that describes to which application it correlates. After that, the init.py script customizes the AutoRestartTemplate.yaml file using the parameter values you entered, and the content of the file is displayed in the terminal for you to verify before confirming everything looks correct.
After verifying the AutoRestartTemplate.yaml file looks correct, enter ‘y’ to confirm. Then, the script deploys a CloudFormation stack in your AWS account using the AutoRestartTemplate.yaml file as its template. It takes a few moments for the stack to deploy, after which it is visible in your AWS account under your CloudFormation console.
Verify the CloudFormation stack is visible in your AWS account.
You receive an email that looks like the preceding example asking you to confirm your subscription to the SNS topic that was created for your CloudWatch alarm. This alarm monitors your Outposts server ConnectedStatus metric. This is a crucial step, without confirming your SNS topic subscription for this alarm, you won’t be notified in the event that your source Outposts server experiences a hardware or service link failure and this relaunch logic is used. Once you have confirmed your email address, the implementation of this Amazon EC2 Auto-Relaunch logic is now complete, and in the event of a service link or source Outposts server failure, your EC2 instances now automatically relaunch on the destination Outposts server subnet you supplied as a parameter in your launch template. You also receive an email notifying you that your source Outpost went down and a relaunch event occurred.

A service link failure is simulated on the source-outpost-a server for the purpose of this post. Within a minute or so of the CloudWatch alarm being triggered, you receive an email alert from the SNS topic to which you subscribed earlier in the post. The email alert looks like the following image:

After receiving this alert, you can navigate to your EC2 Dashboard and view your running instances. There you should see a new instance being launched. It takes a minute or two to finish initializing before showing that both status checks passed:

Now that your EC2 instance(s) has been relaunched on your healthy destination Outposts server, you can start triaging why your source Outposts server experienced a failure without worrying about getting your application(s) back up and running.

Cleaning up

Because this custom logic is implemented through CloudFormation, the only clean up required is to delete the CloudFormation stack from your AWS account. Doing so deletes all resources that were deployed through the CloudFormation stack.

Conclusion

The use of custom logic through AWS tools such as CloudFormation, CloudWatch, and Lambda enables you to architect for HA for stateless workloads on an Outposts server. By implementing the custom logic we walked through in this post, you can automatically relaunch EC2 instances running on a source Outposts server to a secondary destination Outposts server, reducing the downtime of your applications in the event of a hardware or service link failure. The code provided in this post can also be further expanded upon to meet the unique needs of your workload.

Note that, while the use of Infrastructure-as-Code (IaC) can improve your application’s availability and be used to standardize deployments across multiple Outposts servers, it is crucial to do regular failure drills to test the custom logic in place. This helps make sure that you understand your application’s expected behavior on relaunch in the event of a hardware failure. Check out part 2 of this series to learn more about enabling HA on Outposts servers for stateful workloads.

Using Amazon APerf to go from 50% below to 36% above performance target

2024-08-07 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/using-amazon-aperf-to-go-from-50-below-to-36-above-performance-target/

This post is written by Tyler Jones, Senior Solutions Architect – Graviton, AWS.

Performance tuning the Renaissance Finagle-http benchmark

Sometimes software doesn’t perform the way it’s expected to across different systems. This can be due to a configuration error, code bug, or differences in hardware performance. Amazon APerf is a powerful tool designed to help identify and address performance issues on AWS instances and other computers. APerf captures comprehensive system metrics simultaneously and then visualizes them in an interactive report. The report allows users to analyze metrics such as CPU usage, interrupts, memory usage, and CPU core performance counters (PMU) together. APerf is particularly useful for performance tuning workloads across different instance types, as it can generate side-by-side reports for easy comparison. APerf is valuable for developers, system administrators, and performance engineers who need to optimize application performance on AWS. From here on we use the Renaissance benchmark as an example to demonstrate how APerf is used to debug and find performance bottlenecks.

The example

The Renaissance finagle-http benchmark was unexpectedly found to run 50% slower on a c7g.16xl Graviton3 than on a reference instance, both initially using the Linux-5.15.60 kernel. This is unexpected behavior.Graviton3 should be performing as good or better than our reference instance as it does for other Java based workloads. It’s likely there is a configuration problem somewhere. The Renaissance finagle-http benchmark is written in Scala but produces Java byte code, so our investigation will focus on the Java JVM as well as system-level configurations.

Overview

System performance tuning is an iterative process that is conducted in two main phases, the first focuses on overall system issues, and the second focuses on CPU core bottlenecks. APerf is used to assist in both phases.

APerf can render several instances’ data in one report, side by side, typically a reference system and the system to be tuned. The reference system provides values to compare against. In isolation, metrics are harder to evaluate. A metric may be acceptable in general but the comparison to the reference system makes it easier to spot room for improvement.

APerf helps to identify unusual system behavior, such as high interrupt load, excessive I/O wait, unusual network layer patterns, and other such issues in the first phase. After adjustments are made to address these issues, for example by modifying JVM flags, the second phase starts. Using the system tuning of the first phase, fresh APerf data is collected and evaluated with a focus on CPU core performance metrics.

Any inferior metric of the SUT CPU core, as compared to the reference system, holds potential for improvement. In the following section we discuss the two phases in detail.
For more background on system tuning, refer to the Performance Runbook in the AWS Graviton Getting Started guide.

Initial data collection

Here is an example for how 240 seconds of system data is collected with APerf:

#enable PMU access
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
#APerf has to open more than the default limit of files.
ulimit -n 65536
#usually aperf would be run in another terminal. 
#For illustration purposes it is send to the background here
./aperf record --run-name finagle_1 --period 240 &
#With 64 CPUs it takes APerf a little less than 15s to report readiness 
#for data collection.
sleep 15
java -jar renaissance-gpl-0.14.2.jar -r 8 finagle-http

The APerf report is generated as follows:

./aperf report --run finagle_1

Then, the report can be viewed with a web browser:

firefox aperf_report_finagle_1/index.html

The APerf report can render several data sets in the same report, side-by-side.

./aperf report --run finagle_1_c7g --run finagle_1_reference --run ...

Note that it is crucial to examine the CPU usage over time shown in APerf. Some data may be gathered while the system is idle. The metrics during idle times have no significant value.

First phase: system level

In this phase we look for differences on the system level. To do this APerf collects data during runs of finagle-http on c7g.16xl and the reference. The reference system provides the target numbers to compare against. Any large difference warrants closer inspection.

The first differences can be seen in the following figure.

The APerf CPU usage plot shows larger drops on Graviton3 (highlighted in red) at the beginning of each run than on the reference instance.

Figure 1: CPU usage. c7g.16xl on the left, reference on the right.

The log messages about the GC runtime hint at a possible reason as they coincide with the dips in CPU usage.

====== finagle-http (web) [default], iteration 0 started ======
GC before operation: completed in 32.692 ms, heap usage 87.588 MB → 31.411 MB.
====== finagle-http (web) [default], iteration 0 completed (6534.533 ms) ======

The JVM tends to spend a significant amount of time in garbage collection, during which time it has to suspend all threads, and choosing a different GC may have a positive impact.

The default GC on OpenJDK17 is G1GC. Using parallelGC is an alternative given that the instances have 64 CPUs, and thus GC can be performed highly parallel. The Graviton Getting Started guide also recommends checking the GC log when working on Java performance issues. A cross check using the JVM’s -Xlog:gc option confirms the reduced GC time with parallelGC.

The second difference is evident in the following figure, the CPU-to-CPU interrupts (IPI). There is more than 10x higher activity on Graviton 3, which means additional IRQ work c7g.16xl on which the reference system does not have to spend CPU cycles.

Figure 2: IPI0/RES Interrupts. c7g.16xl on the left, reference on the right.

Grepping through kernel commit messages can help find patches that address a particular issue, such as IPI inefficiencies.

This scheduling patch improves performance by 19%. Another IPI patch provides an additional 1% improvement. Switching to Linux-5.15.61 allows us to use these IPI improvements. The following figure shows the effect in APerf.

Figure 3: IPI0 Interrupts (c7g.16xl)

Second phase: CPU core level

Now that the system level issues are mitigated, the focus is on the CPU cores. The data collected by APerf shows PMU data where the reference instance and Graviton3 differ significantly.

PMU metric	Graviton3	Reference system
Branch Prediction Misses/1000 Instructions	16	4
Instruction Translation Lookaside Buffer (TLB) Misses/1000 Instructions	8.3	4
Instructions Per Clock cycle	0.6	0.6*

*Note that reference has a 20% higher CPU clock than c7g.16xl. As a rule of thumb, instructions per clock (IPC) multiplied by clock rate equals work done by a CPU.

Addressing CPU Core bottlenecks

The first improvement in PMU metrics stems from the parallelGC option. Although the intention was to increase CPU usage, the following figure shows lowered branch miss counts as well. Limiting the JIT tiered compilation to only use C2 mode helps branch prediction by reducing branch indirection and increasing the locality of executed code. Finally adding Transparent Huge Pages helps branch prediction logic and avoids lengthy address translation look-ups in DDR memory. The following graphs show the effects of the chosen JVM options.

Branch misses/1000 instructions (c7g.16xl)

Figure 4: Branch misses per 1k instructions

JVM Options from left to right:

-XX:+UseParallelGC
-XX:+UseParallelGC -XX:-TieredCompilation
-XX:+UseParallelGC -XX:-TieredCompilation -XX:+UseTransparentHugePages

With the options listed under the preceding figure, APerf shows the branch prediction miss rate decreasing from the initial 16 to 11. Branch mis-predictions incur significant performance penalties as they result in wasted cycles spent computing results that ultimately need to be discarded. Furthermore, these mis-predictions cause the prefetching and cache subsystems to fail to load the necessary subsequent instructions into cache. Consequently, costly pipeline stalls and frontend stalls occur, preventing the CPU from executing instructions.

Code sparsity

Figure 5: Code sparsity

JVM Options from left to right:

-XX:+UseParallelGC
-XX:+UseParallelGC -XX:-TieredCompilation
-XX:+UseParallelGC -XX:-TieredCompilation -XX:+UseTransparentHugePages

Code sparsity is a measure of how compact the instruction code is packed and how closely related code is placed. This is where turning off tiered compilation shows its effect. Lower sparsity helps branch prediction and the cache subsystem.

Instruction TLB misses/1000 instructions (c7g.16xl)

Figure 6: Instruction TLB misses per 1k Instructions

JVM Options from left to right:

-XX:+UseParallelGC
-XX:+UseParallelGC -XX:-TieredCompilation
-XX:+UseParallelGC -XX:-TieredCompilation -XX:+UseTransparentHugePages

The big decrease in TLB misses is caused by the use of transparent huge pages, which increase the likelihood that a virtual address translation is present in the TLB, since fewer entries are needed. Translation table walks are avoided that otherwise need to traverse entries in DDR memory that cost hundreds of CPU cycles to read.

Instructions per clock cycle (IPC)

Figure 7: Instructions per clock cycle

JVM Options from left to right:

-XX:+UseParallelGC
-XX:+UseParallelGC -XX:-TieredCompilation
-XX:+UseParallelGC -XX:-TieredCompilation -XX:+UseTransparentHugePages

The preceding figure shows IPC increasing from 0.58 to 0.71 as JVM flags are added.

Results

This table summarizes the measures taken for performance improvement and their results.

JVM option set	baseline	1	2	3
IPC	0.6	0.6	0.63	0.71
Branch misses/1000 instructions	16	14	12	11
ITLB misses/1000 instructions	8.3	7.7	8.8	1.1
Benchmark runtime [ms]	6000	3922	3843	3512
execution time improvement vs baseline

+parallelGC
+parallelGC -tieredCompilation
+parallelGC -tieredCompilation +UseTransparentHugePages

Re-examining the setup of our testing enviornment: Changing where the load-generator lives

With the preceding efforts, c7g.16xl is within 91% of the reference system. The c7g.16xl branch prediction miss rate is still higher at 11 than the references at 4. As shown in the preceding figure, reduced branch prediction misses have a strong positive effect on performance. What follows is an experiment to achieve parity or better with the reference system based on the reduction of branch prediction misses.

Finagle-http serves HTTP requests generated by wrk2, which is a load generator implemented in C. The expectation is that the c7g.16xl branch predictor works better with the native wrk2 binary, unlike the Renaissance load generator, which is executing on the JVM. The wrk2 load generator and the finagle-http are assigned through taskset to separate sets of CPUs: 16 CPUs for wrk2 and 48 CPUs for finagle-http. The idea here is to have the branch predictors on these CPU sets focus on a limited code set. The following diagram illustrates the difference between the Renaissance and the experimental setup.

With this CPU performance tuned setup, c7g.16xl can now handle a 36% higher request load than the reference using the same configuration, at an average latency limit of 1.5ms. This illustrates the impact that system tuning with APerf can have. The same system that scored 50% lower than the comparison system now exceeds it by 36%.
The following APerf data shows the improvement of key PMU metrics that lead to the performance jump.

Branch prediction misses/1000 instructions

Figure 8: Branch misses per 1k instructions

Left chart: Optimized Java-only setup. Right chart: Finagle-http with wrk2 load generator

The branch prediction miss rate is reduced to 1.5 from 11 with the Java-only setup.

IPC

Figure 9: Instructions Per Clock Cycle

Left chart: Optimized Java-only setup. Right chart: Finagle-http with wrk2 load generator

The IPC steps up from 0.7 to 1.5 due to the improvement in branch prediction.

Code sparsity

Figure 10: Code Sparsity

Left chart: Optimized Java-only setup. Right chart: Finagle-http with wrk2 load generator

The code sparsity decreases to 0.014 from 0.21, a factor of 15.

Conclusion

AWS created the APerf tool to aid in root cause analysis and help address performance issues for any workload. APerf is a standalone binary that captures relevant data simultaneously as a time series, such as CPU usage, interrupt frequency, memory usage, and CPU core metrics (PMU counters). APerf can generate reports for multiple data captures, making it easy to spot differences between instance types. We were able to use this data to analyze why Graviton3 was underperforming and to also see the impact of our changes in terms of performance. Using APerf we were able to successfully adjust configuration parameters and go from 50% below our performance target to 36% more performant than our reference system and associated performance target. Without Aperf, collecting these metrics and visualizing them is a non-trival task. With Aperf you can capture and visualize these metrics with two short commands, saving you time and effort so you can focus on what matters most: getting the most performance from your application.

Migrating your on-premises workloads to AWS Outposts rack

2024-08-06 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/migrating-your-on-premises-workloads-to-aws-outposts-rack/

This post is written by Craig Warburton, Senior Solutions Architect, Hybrid. Sedji Gaouaou, Senior Solutions Architect, Hybrid. Brian Daugherty, Principal Solutions Architect, Hybrid.

Migrating workloads to AWS Outposts rack offers you the opportunity to gain the benefits of cloud computing while keeping your data and applications on premises.

For organizations with strict data residency requirements, by deploying AWS infrastructure and services on premises, you can keep sensitive data and mission-critical applications within your own data centers or facilities, helping ensure compliance with data sovereignty laws and regulatory frameworks.

On the other hand, if your organization does not have stringent data residency requirements, you may opt for a hybrid approach, using both Outposts rack and the AWS Regions. With this flexibility, you can process and store data in the most appropriate location based on factors such as latency, cost optimization, and application requirements.

In this post, we cover the best options to migrate your workloads to Outposts rack, taking into account your specific data residency requirements. We explore strategies, tools, and best practices to enable a successful migration tailored to your organization’s needs.

Overview

AWS has a number of services to help you migrate and rehost workloads, including AWS Migration Hub, AWS Application Migration Service, AWS Elastic Disaster Recovery. Alternatively, you can use backup and recovery solutions provided by AWS partners.

At AWS, we use the 7 Rs framework to help organizations evaluate and choose the appropriate migration strategy for moving applications and workloads to the AWS Cloud. The 7 Rs represent:

Rehosting (rehost or lift and shift)
Replatforming (lift, tinker, and shift)
Repurchasing (republish or re-vendor)
Refactoring (re-architecting)
Retiring
Retaining (revisit)
Relocating (remigrate).

This post focuses on rehosting and the services available to help rehost on-premises applications to Outposts rack.

Before getting started with any migration, AWS recommends a three-phase approach to migrating workloads to the cloud (AWS Region or Outposts rack). The three phases are assess, mobilize, and migrate and modernize.

Figure 1: Diagram showing the three migration phases of assess, mobilize, and migrate and modernize

This post describes the steps that you can take in the migrate and modernize phase. However, the assess and mobilize phases are also critical to allow you to understand what applications will be migrated, the dependencies between them, and the planning associated with how and when migration will occur.

AWS Migration Hub is a cloud migration service provided by AWS that helps organizations accelerate and simplify the process of migrating workloads to AWS. It provides a unified location to track the progress of application migrations across multiple AWS and partner services. This service can be used to help work through all three phases of migration, and we recommend that you start with this service and complete each phase accordingly. The assess phase should help you identify any applications that require consideration when migrating (including any data residency requirements), and the mobilize phase defines the approach to take.

Workload migration to AWS Outposts rack: With staging environment in an AWS Region

After deploying an Outpost rack to your desired on-premises location, you can perform migrations of on-premises systems and virtual machines using either Application Migration Service or third-party backup and recovery services. Both scenarios are described in the following sections.

Scenario 1: Using AWS Application Migration Service

Application Migration Service is able to lift and shift a large number of physical or virtual servers without compatibility issues, performance disruption, or long cutover windows.

In this scenario, at least one Outpost rack is deployed on premises with the following prerequisites:

At least one Outpost rack installed and activated
The Outposts rack must be in Direct VPC Routing (DVR) mode
VPC in Region containing subnet for staging resources
VPC extended to the Outposts rack containing subnet for target resources
An AWS Replication Agent installed on each source server

The following diagram shows the solution architecture and includes the on-premises servers that will be migrated from the local network to the Outposts rack. It also includes the staging VPC in Region used to deploy the replication servers, Amazon S3 to store the Amazon EBS snapshots and the target VPC extended to Outposts rack.

Figure 2: Architecture diagram showing migration with Application Migration Service

Step 1: Outposts rack configuration

You can work with AWS specialists to size your Outpost for your workload and application requirements. In this scenario, you don’t need additional Outposts rack capacity for the migration because the staging area will be deployed in the Region (see 1 in Figure 2).

Step 2: Prepare Application Migration service

Set up Application Migration Service from the console in the Region your Outposts rack is anchored to. If this is your first setup, choose Get started on the AWS Application Migration Service console. When creating the replication settings template, make sure your staging area is using subnets in the parent Region (see 2 in Figure 2).

Step 3: Install the AWS Replication Agent to the source servers or machines

For large migrations, source servers may have a wide variety of operating system versions and may be distributed across multiple data centers. AWS Application Migration Service offers the MGN connector, a feature that allows you to automate running commands on your source environment. Finally, ensure that communication is possible between the agent and Application Migration Service (see 3 in Figure 2).

In the following image, there is an example of deploying the AWS Replication Agent providing the required parameters (Region, AWS access key and AWS secret access key).

Once the AWS Replication Agent is installed, the server will be added to the AWS Application Migration Service console. Next, it will undergo the initial sync process, which will be completed when showing the Ready for testing lifecycle state in the Application Migration Service console.

Step 4: Configure launch settings

Prior to testing or cutting over an instance, you must configure the launch settings by creating Amazon Elastic Compute Cloud (Amazon EC2) launch templates, ensuring that you select your extended virtual private cloud (VPC) and subnet deployed on Outposts rack and using an appropriate, available instance type (see 4 in Figure 2).

To identify EC2 instances configured on your Outpost, you can use the following AWS Command Line Interface (AWS CLI):

Outposts get-outpost-instance-types \

--outpost-id op-abcdefgh123456789

The output of this command lists the instance types and sizes configured on your Outpost:

InstanceTypes:

- InstanceType: c5.xlarge

- InstanceType: c5.4xlarge

- InstanceType: r5.2xlarge

- InstanceType: r5.4xlarge

With knowledge of the instance types configured, you can now determine how many of each are available. For example, the following AWS CLI command, which is run on the account that owns the Outpost, lists the number of c5.xlarge instances available for use:

aws cloudwatch get-metric-statistics \

--namespace AWS/Outposts \

--metric-name AvailableInstanceType_Count \

--statistics Average --period 3600 \

--start-time $(date -u -Iminutes -d '-1hour') \

--end-time $(date -u -Iminutes) \

--dimensions \

Name=OutpostId,Value=op-abcdefgh123456789 \

Name=InstanceType,Value=c5.xlarge

This command returns:

Datapoints:

- Average: 10.0

Timestamp: '2024-04-10T10:39:00+00:00'

Unit: Count

Label: AvailableInstanceType_Count

The output indicates that there were (on average) 10 c5.xlarge instances available in the specified time period (1 hour). Using the same command for the other instance types, you discover that there are also 20 c5.4xlarge, 10 r5.2xlarge, and 6 r5.4xlarge available for use in completing the required EC2 launch templates.

Step 5: Install AWS Systems Manager Agent in your on your target instances

Once the launch settings are defined, you must activate the post-launch actions for either a specific server or all the servers. You must leave the Install the Systems Manager agent and allow executing actions on launched servers option toggled on in order for post-launch actions to work. Untoggling the option would disallow Application Migration Service to install the AWS Systems Manager Agent (SSM Agent) on your servers, and post-launch actions would no longer be executed on them (see 5 in Figure 2).

Figure 3: Post-launch actions on the Application Migration Service console

Step 6: Testing and cutover

Once you have configured the launch settings for each source server, you are ready to launch the servers as test instances. Best practice is to test instances before cutover.

Figure 4: Application Migration Service console ready to launch test instances

Finally, after completing the testing of all the source servers, you are ready for cutover (see 6 on Figure 2). Prior to launching cutover instances, check that the source servers are listed as Ready for cutover under Migration lifecycle and Healthy under Data replication status.

Figure 5: Application Migration Console ready for cutover

To launch the cutover instances, select the instances you want to cutover and then select Launch cutover instances under Cutover (see Figure 5).

The AWS Application Migration Service console will indicate Cutover finalized when the cutover has completed successfully, the selected source servers’ Migration lifecycle column will show the Cutover complete status, the Data replication status column will show Disconnected, and the Next step column will show Mark as archived. The source servers have now been successfully migrated into AWS. You can now archive your source servers that have launched cutover instances.

Scenario 2: Using partner backup and replication solutions

You may already be using a third-party or AWS Partner solution to create on-premises backups of bare-metal or virtualized systems. These solutions often use local disk-arrays or object stores to create tiered backups of systems covering restore-points going back years, days, or just a few hours or minutes.

These solutions may also have inherent capabilities to restore from these backups directly to the AWS, enabling migration of on-premises systems to EC2 instances deployed to Outposts rack.

In the scenario illustrated in Figure 6, the partner backup and replication service (BR) creates backups (see 1 in Figure 6) of virtual machines to on-premises disk or object storage repositories. Using the service’s AWS integration, virtual machines can be restored (see 2 in Figure 6) to an EC2 instance deployed on Outposts rack, which is also on premises. The restoration may follow a process that uses helper instances and volumes (see 3 in Figure 6) during intermediate steps to create Amazon Elastic Block Store (Amazon EBS) snapshots (see 4 in Figure 6) and then Amazon Machine Images (AMIs) of the systems being migrated (see 5 in Figure 6), which are ultimately deployed (see 6 in Figure 6) to Outposts rack.

Figure 6: Architecture diagram of the partner backup and replication scenario

When performing this type of migration, there will typically be a stage where you are asked to specify parameters defining the target VPC and subnets. These should be the VPC being extended to the Outpost and a subnet that has been created in that VPC on the Outpost. You will also need to specify an EC2 instance type that is available on the Outpost, which can be discovered using the process described in the previous section.

Workload migration to AWS Outposts rack: With staging environment on an AWS Outpost rack

Data residency can be a critical consideration for organizations that collect and store sensitive information, such as personally identifiable information (PII), financial data or medical records. AWS Elastic Disaster Recovery, supported on Outposts rack, helps enable seamless replication of on-premises data to Outposts rack and addresses data residency concerns by keeping data within your on-premises environment, using Amazon EBS and Amazon S3 on Outposts.

In this scenario, an Outpost rack is deployed on premises with the following prerequisites:

At least one Outpost rack installed and activated
The Outposts rack must be in Direct VPC Routing (DVR) mode
VPC extended to the Outposts rack containing subnets for staging and target resources
Amazon S3 on Outposts (required for all Elastic Disaster Recovery replication destinations)
An AWS Replication Agent installed on each source server.

The following diagram shows the solution architecture and includes the on-premises servers that will be migrated from the local network to the Outposts rack. It also includes the staging VPC used to deploy the replication servers on Outposts rack, Amazon S3 on Outposts to store the local Amazon EBS snapshots and the target VPC extended to Outposts rack.

Figure 7: Architecture diagram for workflow migration to AWS Outposts rack

Step 1: Outposts rack configuration

To use Elastic Disaster Recovery on Outposts rack, you need to configure both Amazon EBS and Amazon S3 on Outposts to support nearly continuous replication and point-in-time recovery for your workload needs (see 1 in Figure 7). Specifically, you need to size Amazon EBS and Amazon S3 on Outposts capacity according to your workload capacity requirements and application interdependencies. To do this, you can define dependency groups–each dependency group is a collection of applications and their underlying infrastructure with technical or non-technical dependencies. A 2:1 ratio is recommended for the EBS volumes to be used for near-continuous replication; a 1:1 ratio is recommended for the Amazon S3 on Outposts ratio for EBS snapshots. For example, to migrate 40 terabytes (TB) of workloads, you need to plan for 80TB of EBS volumes and 40TB of S3 on Outposts capacity.

Step 2: Extend VPC to your Outposts rack

Once your Outpost has been provisioned and is available, extend the required Amazon Virtual Private Cloud (Amazon VPC) connection to the Outpost from the Region by creating the desired staging and target subnets (see 2 in Figure 7).

Step 3: Prepare AWS Elastic Disaster Recovery service

Prepare the AWS Elastic Disaster Recovery service from the AWS console to set the default replication and launch settings. When defining these settings, make sure that the Outposts resources available are chosen for staging and target subnets and instance and storage type (see 3 in Figure 7).

Step 4: Install the AWS Replication Agent to the source servers or machines

The next phase will be to install the AWS Replication Agent to the source servers and to ensure that communication is possible between the replication agent and your Outposts replication subnet through the Outposts local gateway to ensure that replication traffic uses the local network (see 4 in Figure 7).

Step 5: Continuous block-level replication

Staging area resources are automatically created and managed by Elastic Disaster Recovery. Once the AWS Replication Agent has been deployed, continuous block-level replication (compressed and encrypted in transit) will occur (see 5 in Figure 7) over the local network.

Step 6: Launch Outposts rack resources

Finally, migrated instances can now be launched using Outposts rack resources based on the launch settings defined previously (see 6 in Figure 7).

Conclusion

In this post, you have learned how to migrate your workloads from your on-premises environment to Outposts rack based on your specific data residency requirements. When you have the flexibility of using Regional services, AWS migration services or partner solutions can be used with infrastructure already in place. If your data must stay on-premises, using AWS Elastic Disaster Recovery allows you to migrate your data without using Regional services, allowing you to migrate to Outposts rack without your data leaving the boundary of a certain geographic location.

To learn more about an end-to-end migration and modernization journey, visit AWS Migration Hub.

Implementing network traffic inspection on AWS Outposts rack

2024-05-31 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/implementing-network-traffic-inspection-on-aws-outposts-rack/

This blog post is written by Brian Daugherty, Principal Solutions Architect. Enrico Liguori, Solution Architect, Networking. Sedji Gaouaou, Senior Solution Architect, Hybrid Cloud.

Network traffic inspection on AWS Outposts rack is a crucial aspect of making sure of security and compliance within your on-premises environment. With network traffic inspection, you can gain visibility into the data flowing in and out of your Outposts rack environment, enabling you to detect and mitigate potential threats proactively.

By deploying AWS partner solutions on Outposts rack, you can take advantage of their expertise and specialized capabilities to gain insights into network traffic patterns, identify and mitigate threats, and help ensure compliance with industry-specific regulations and standards. This includes advanced network traffic inspection capabilities, such as deep packet inspection, intrusion detection and prevention, application-level firewalling, and advanced threat detection.

This post presents an example architecture of deploying a firewall appliance on an Outposts rack to perform on-premises to Virtual Private Cloud (VPC) and VPC-to-VPC inline traffic inspection.

Architecture

The example traffic inspection architecture illustrated in the following diagram is built using a common Outposts rack deployment pattern.

In this example, an Outpost rack is deployed on premises to support:

Manufacturing/operational technologies (OT) applications that need low latency between OT servers and devices
Information technology (IT) applications that are subject to strict data residency and data protection policies

Separate VPCs, that can be owned by different AWS accounts, and subnets are created for the IT and OT departments’ instances (see 1 and 2 in the diagram).

Organizational security policies require that traffic flowing to and from the Outpost and the site, and between VPCs on the Outpost, be inspected, controlled, and logged using a centralized firewall.

In an AWS Region it is possible to implement a centralized traffic inspection architecture using routing services such as AWS Transit Gateways (TGW) or Gateway Load Balancers (GWLB) to route traffic to a central firewall, but these services are not available on Outposts.

On Outposts, some use the Local Gateway (LGW) to implement a distributed traffic inspection architecture with firewalls deployed in each VPC, but this can be operationally complex and cost prohibitive.

In this post, you will learn how to use a recently introduced feature – Multi-VPC Elastic Network Interface (ENI) Attachments – to create a centralized traffic inspection architecture on Outposts. Using Multi-VPC Attached ENIs you can attach ENIs created in subnets that are owned and managed by other VPCs (even VPCs in different accounts) to an Amazon Elastic Compute Cloud (EC2) instance.

Specifically, you can create ENIs in the IT and OT subnets that can be shared with a centralized firewall (see 3 and 4).

Because it’s a best practice to minimize the attack surface of a centralized firewall through isolation, the example includes a VPC and subnet created solely for the firewall instance (see 5).

To protect traffic flowing to and from the IT, OT, and firewall VPCs and on-site networks, another ‘Exposed’ VPC, subnet (see 6), and ENI (see 7) are created. These are the only resources associated with the Outposts Local Gateway (LGW) and ‘exposed’ to on-site networks.

In the example, traffic is routed from the IT and OT VPCs using a default route that points to the ENI used by the firewall (see 8 and 9). The firewall can route traffic back to the IT and OT VPCs, as allowed by policy, through its directly connected interfaces.

The firewall uses a route for the on-site network (192.168.30.0/24) – or a default route – pointing to the gateway associated with the exposed ENI (eni11, 172.16.2.1 – see 10).

To complete the routing between the IT, OT, and firewall VPCs and the on-site networks, static routes are added to the LGW route table pointing to the firewall’s exposed ENI as the next hop (see 11).

Once these static routes are inserted, the Outposts Ingress Routing feature will trigger the routes to be advertised toward the on-site layer-3 switch using BGP.

Likewise, the on-site layer-3 switch will advertise a route (see 12) for 192.168.30.0/24 (or a default route) over BGP to the LGW, completing end-to-end routing between on-site networks and the IT and OT VPCs through the centralized firewall.

The following diagram shows an example of packet flow between an on-site OT device and the OT server, inspected by the firewall:

Implementation on AWS Outposts rack

The following implementation details are essential for our example traffic inspection on the Outposts rack architecture.

Prerequisites

The following prerequisites are required:

Deployment of an Outpost on premises;
Creation of four VPCs – Exposed, firewall, IT, and OT;
Creation of private subnets in each of the four VPCs where ENIs and instances can be created;
Creation of ENIs in each of the four private subnets for attachment to the firewall instance (keep track of the ENI IDs);
If needed, sharing the subnets and ENIs with the firewall account, using AWS Resource Access Manager (AWS RAM);
Association of the Exposed VPC to the LGW.

Firewall selection and sizing

Although in this post a basic Linux instance is deployed and configured as the firewall, in the Network Security section of the AWS Marketplace, you can find several sophisticated, powerful, and manageable AWS Partner solutions that perform deep packet inspection.

Most network security marketplace offerings provide guidance on capabilities and expected performance and pricing for specific appliance instance sizes.

Firewall instance selection

Currently, an Outpost rack can be configured with EC2 instances in the M5, C5, R5, and G4dn families. As a user, you can select the size and number of instances available on an Outpost to match your requirements.

When selecting an EC2 instance for use as a centralized firewall it is important to consider the following:

Performance recommendations for instance types and sizes made by the firewall appliance partner;
The number of VPCs that are inspected by the firewall appliance;
The availability of instances on the Outpost.

For example, after evaluating the partner recommendations you may determine that an instance size of c5.large, r5.large, or larger provide the required performance.

Next, you can use the following AWS Command Line Interface (AWS CLI) command to identify the EC2 instances configured on an Outpost:

Outposts get-outpost-instance-types \
--outpost-id op-abcdefgh123456789

The output of this command lists the instance types and sizes configured on your Outpost:

InstanceTypes:
- InstanceType: c5.xlarge
- InstanceType: c5.4xlarge
- InstanceType: r5.2xlarge
- InstanceType: r5.4xlarge

With knowledge of the instance types and sizes installed on your Outpost, you can now determine if any of these are available. The following AWS CLI command – one for each of the preceding instance types – lists the number of each instance type and size available for use. For example:

aws cloudwatch get-metric-statistics \
--namespace AWS/Outposts \
--metric-name AvailableInstanceType_Count \
--statistics Average --period 3600 \
--start-time $(date -u -Iminutes -d '-1hour') \
--end-time $(date -u -Iminutes) \
--dimensions \
Name=OutpostId,Value=op-abcdefgh123456789 \
Name=InstanceType,Value=c5.xlarge

This command returns:

Datapoints:
- Average: 2.0
  Timestamp: '2024-04-10T10:39:00+00:00'
  Unit: Count
Label: AvailableInstanceType_Count

The output indicates that there are (on average) two c5.xlarge instances available on this Outpost in the specified time period (1 hour). The same steps for the other instance type suggest that there are also two c5.4xlarge, two r5.2xlarge, and no r5.4xlarge available.

Next, consider the number of VPCs to be connected to the firewall and determine if the instances available support the required number of ENIs.

The firewall requires an ENI in its own VPC, in the Exposed VPC, and one for each additional VPC. In this post, because there is a VPC for IT and for OT, you need an EC2 instance that supports four interfaces in total:

To determine the number of supported interfaces for each available instance type and size, let’s use the AWS CLI:

aws ec2 describe-instance-types \
--instance-types c5.xlarge c5.4xlarge r5.2xlarge \
--query 'InstanceTypes[].[InstanceType,NetworkInfo.NetworkCards]'

This returns:

- - r5.2xlarge
  - - BaselineBandwidthInGbps: 2.5
      MaximumNetworkInterfaces: 4
      NetworkCardIndex: 0
      NetworkPerformance: Up to 10 Gigabit
      PeakBandwidthInGbps: 10.0
- - c5.xlarge
  - - BaselineBandwidthInGbps: 1.25
      MaximumNetworkInterfaces: 4
      NetworkCardIndex: 0
      NetworkPerformance: Up to 10 Gigabit
      PeakBandwidthInGbps: 10.0
- - c5.4xlarge
  - - BaselineBandwidthInGbps: 5.0
      MaximumNetworkInterfaces: 8
      NetworkCardIndex: 0
      NetworkPerformance: Up to 10 Gigabit
      PeakBandwidthInGbps: 10.0

The output suggests that the three available EC2 instances (r5.2xlarge, c5.xlarge and c5.4xlarge) can support four network interfaces. The output also suggests that the c5.4xlarge instance, for example, supports up to 8 network interfaces and a maximum bandwidth of 10Gb/s. This helps you plan for the potential growth in network requirements.

Attaching remote ENIs to the firewall instance

With the firewall instance deployed in the firewall VPC, the next step is to attach the remote ENIs created previously in the Exposed, OT, and IT subnets. Using the firewall instance ID and the Network Interface IDs for each of the remote ENIs, you can create the Multi-VPC Attached ENIs to connect the firewall to the other VPCs. Each attached interface needs a unique device-index greater than ‘0’ which is the primary instance interface.

For example, to connect the Exposed VPC ENI:

aws ec2 attach-network-interface --device-index 1 \
--instance-id i-0e47e6eb9873d1234 \
--network-interface-id eni-012a3b4cd5efghijk \
--region us-west-2

Attach the OT and IT ENIs while incrementing the device-index and using the respective unique ENI IDs:

aws ec2 attach-network-interface --device-index 2 \
--instance-id i-0e47e6eb9873d1234 \
--network-interface-id eni-0bbe1543fb0bdabff \
--region us-west-2
aws ec2 attach-network-interface --device-index 3 \
--instance-id i-0e47e6eb9873d1234 \
--network-interface-id eni-0bbe1a123b0bdabde \
--region us-west-2

After attaching each remote ENI, the firewall instance now has an interface and IP address in each VPC used in this example architecture:

ubuntu@firewall:~$ ip address

ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    inet 10.240.4.10/24 metric 100 brd 10.240.4.255 scope global dynamic ens5

ens6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    inet 10.242.0.50/24 metric 100 brd 10.242.0.255 scope global dynamic ens6

ens7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    inet 10.244.76.51/16 metric 100 brd 10.244.255.255 scope global dynamic ens7

ens11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    inet 172.16.2.7/24 metric 100 brd 172.16.2.255 scope global dynamic ens11

Updating the VPC/subnet route tables

You can now add the routes needed to allow traffic to be inspected to flow through the firewall.

For example, the OT subnet (10.242.0.0/24) uses a route table with the ID rtb- abcdefgh123456789. To send the traffic through the firewall, you need to add a default route with the target being the ENI (eni-07957a9f294fdbf5d) that is now attached to the firewall:

aws ec2 create-route --route-table-id rtb-abcdefgh123456789 \
--destination-cidr-block 0.0.0.0/0 \
--network-interface-id eni-07957a9f294fdbf5d

You can follow the same process is used to add a default route to the IT VPC/subnet.

With routing established from the IT and OT VPCs to the firewall, you need to make sure that the firewall uses the Exposed VPC to route traffic toward the on-premises network 192.168.30.0/24. This is done by adding a route within the firewall OS using the VPC gateway as a next hop.

The ENI attached to the firewall from the Exposed VPC is in subnet 172.16.2.0/28, and the gateway used by this subnet is, by Amazon Virtual Private Cloud (VPC) convention, the first address in the subnet – 172.16.2.1. This is used when updating the firewall OS route table:

sudo ip route add 192.168.30.0/24 via 172.16.2.1

You can now confirm that the firewall OS has routes to each attached subnet and to the on-premises subnet:

ubuntu@firewall:~$ ip route
default via 10.240.4.1 dev ens5 proto dhcp src 10.240.4.10 metric 100
10.240.0.2 via 10.240.4.1 dev ens5 proto dhcp src 10.240.4.10 metric 100
10.240.4.0/24 dev ens5 proto kernel scope link src 10.240.4.10 metric 100
10.240.4.1 dev ens5 proto dhcp scope link src 10.240.4.10 metric 100
10.242.0.0/24 dev ens6 proto kernel scope link src 10.242.0.50 metric 100
10.242.0.2 dev ens6 proto dhcp scope link src 10.242.0.50 metric 100
10.244.0.0/16 dev ens7 proto kernel scope link src 10.244.76.51 metric 100
10.244.0.2 dev ens7 proto dhcp scope link src 10.244.76.51 metric 100
172.16.2.0/24 dev ens11 proto kernel scope link src 172.16.2.7 metric 100
172.16.2.2 dev ens11 proto dhcp scope link src 172.16.2.7 metric 100
192.168.30.0/24 via 172.16.2.1 dev ens11

The final step in establishing end-to-end routing is to make sure that the LGW route table contains static routes for the firewall, IT, and OT VPCs. These routes target the ENIs used by the firewall in the Exposed VPC.

After gathering the LGW Route Table ID and the firewall’s Exposed ENI ID used by the firewall, you can now add routes toward the firewall VPC:

aws ec2 create-local-gateway-route \
    --local-gateway-route-table-id lgw-rtb-abcdefgh123456789 \
    --network-interface-id eni-0a2e4f68f323022c3 \
    --destination-cidr-block 10.240.0.0/16

Repeat this command for the OT and IT VPC CIDRs – 10.242.0.0/16 and 10.244.0.0/16, respectively.

You can query the LGW route table to make sure that each of the static routes was inserted:

aws ec2 search-local-gateway-routes \
    --local-gateway-route-table-id lgw-rtb-abcdefgh123456789 \
    --filters "Name=type,Values=static"

This returns:

Routes:

- DestinationCidrBlock: 10.240.0.0/16
  LocalGatewayRouteTableId: lgw-rtb-abcdefgh123456789
  NetworkInterfaceId: eni-0a2e4f68f323022c3
  State: active
  Type: static

- DestinationCidrBlock: 10.242.0.0/16
  LocalGatewayRouteTableId: lgw-rtb-abcdefgh123456789
  NetworkInterfaceId: eni-0a2e4f68f323022c3
  State: active
  Type: static

- DestinationCidrBlock: 10.244.0.0/16
  LocalGatewayRouteTableId: lgw-rtb-abcdefgh123456789
  NetworkInterfaceId: eni-0a2e4f68f323022c3
  State: active
  Type: static

With the addition of these static routes the LGW begins to advertise reachability to the firewall, OT, and IT Classless Inter-Domain Routing (CIDR) blocks over the BGP neighborship. The CIDR for the Exposed VPC is already advertised because it is associated directly to the LGW.

The firewall now has full visibility of the traffic and can apply the monitoring, inspection, and security profiles defined by your organization.

Other considerations

It is important to follow the best practices specified by the Firewall Appliance Partner to fully secure the appliance. In the example architecture, access to the firewall console is restricted to AWS Session Manager.
The commands used previously to create/update the Outpost/LGW route tables need an account with full privileges to administer the Outpost.

Fault tolerance

As a crucial component of the infrastructure, the firewall instance needs a mechanism for automatic recovery from failures. One effective approach is to deploy the firewall instances within an Auto Scaling group, which can automatically replace unhealthy instances with new, healthy ones. In addition, using host or rack level spread placement group makes sure that your instances are deployed on distinct underlying hardware. This enables high availability and minimizes downtime. Furthermore, this approach based on Auto Scaling can be implemented regardless of the specific third-party product used.

To ensure a seamless transition when Auto Scaling replaces an unhealthy firewall instance, it is essential that the multi-VPC ENIs responsible for receiving and forwarding traffic are automatically attached to the new instance. When re-using the same multi-VPC ENIs, make sure that no changes are required in the subnets and LGW route tables.

To re-attach the same multi-VPC ENIs to the new instance, you can do this using Auto Scaling lifecycle hooks, with which you can pause the instance replacement process and perform custom actions.

After re-attaching the multi-VPC ENIs to the instance, the last step is to restore the configuration of the firewall from a backup.

Conclusion

In this post, you have learned how to implement on-premises to VPC and VPC-to-VPC inline traffic inspection on Outposts rack with a centralized firewall deployment. This architecture requires a VPC for the firewall instance itself, an Exposed VPC connecting to your on-premises network, and one or more VPCs for your workloads running on the Outpost. You can either use a basic Linux instance as a router, or choose from the advanced AWS Partner solutions in the Network Security section of the AWS Marketplace and follow the respective guidance on firewall instance selection. With multi-VPC ENI attachments, you can create network traffic routing between VPCs and forward traffic to the centralized firewall for inspection. In addition, you can use Auto Scaling groups, spread placement groups, and Auto Scaling lifecycle hooks to enable high availability and fault tolerance for your firewall instance.

If you want to learn more about network security on AWS, visit: Network Security on AWS.

Introducing Amazon EMR on EKS with Apache Flink: A scalable, reliable, and efficient data processing platform

2024-05-28 Kinnar Kumar Sen

Post Syndicated from Kinnar Kumar Sen original https://aws.amazon.com/blogs/big-data/introducing-amazon-emr-on-eks-with-apache-flink-a-scalable-reliable-and-efficient-data-processing-platform/

AWS recently announced that Apache Flink is generally available for Amazon EMR on Amazon Elastic Kubernetes Service (EKS). Apache Flink is a scalable, reliable, and efficient data processing framework that handles real-time streaming and batch workloads (but is most commonly used for real-time streaming). Amazon EMR on EKS is a deployment option for Amazon EMR that allows you to run open source big data frameworks such as Apache Spark and Flink on Amazon Elastic Kubernetes Service (Amazon EKS) clusters with the EMR runtime. With the addition of Flink support in EMR on EKS, you can now run your Flink applications on Amazon EKS using the EMR runtime and benefit from both services to deploy, scale, and operate Flink applications more efficiently and securely.

In this post, we introduce the features of EMR on EKS with Apache Flink, discuss their benefits, and highlight how to get started.

EMR on EKS for data workloads

AWS customers deploying large-scale data workloads are adopting the EMR runtime with Amazon EKS as the underlying orchestrator to benefit from complimenting features. This also enables multi-tenancy and allows data engineers and data scientists to focus on building the data applications, and the platform engineering and the site reliability engineering (SRE) team can manage the infrastructure. Some key benefits of Amazon EKS for these customers are:

The AWS-managed control plane, which improves resiliency and removes undifferentiated heavy lifting
Features like multi-tenancy and resource-based access policies (RBAC), which allow you to build cost-efficient platforms and enforce organization-wide governance policies
The extensibility of Kubernetes, which allows you to install open source add-ons (observability, security, notebooks) to meet your specific needs

The EMR runtime offers the following benefits:

Takes care of the undifferentiated heavy lifting of managing installations, configuration, patching, and backups
Simplifies scaling
Optimizes performance and cost
Implements security and compliance by integrating with other AWS services and tools

Benefits of EMR on EKS with Apache Flink

The flexibility to choose instance types, price, and AWS Region and Availability Zone according to the workload specification is often the main driver of reliability, availability, and cost-optimization. Amazon EMR on EKS natively integrates tools and functionalities to enable these—and more.

Integration with existing tools and processes, such as continuous integration and continuous development (CI/CD), observability, and governance policies, helps unify the tools used and decreases the time to launch new services. Many customers already have these tools and processes for their Amazon EKS infrastructure, which you can now easily extend to your Flink applications running on EMR on EKS. If you’re interested in building your Kubernetes and Amazon EKS capabilities, we recommend using EKS Blueprints, which provides a starting place to compose complete EKS clusters that are bootstrapped with the operational software that is needed to deploy and operate workloads.

Another benefit of running Flink applications with Amazon EMR on EKS is improving your applications’ scalability. The volume and complexity of data processed by Flink apps can vary significantly based on factors like the time of the day, day of the week, seasonality, or being tied to a specific marketing campaign or other activity. This volatility makes customers trade off between over-provisioning, which leads to inefficient resource usage and higher costs, or under-provisioning, where you risk missing latency and throughput SLAs or even service outages. When running Flink applications with Amazon EMR on EKS, the Flink auto scaler will increase the applications’ parallelism based on the data being ingested, and Amazon EKS auto scaling with Karpenter or Cluster Autoscaler will scale the underlying capacity required to meet those demands. In addition to scaling up, Amazon EKS can also scale your applications down when the resources aren’t needed so your Flink apps are more cost-efficient.

Running EMR on EKS with Flink allows you to run multiple versions of Flink on the same cluster. With traditional Amazon Elastic Compute Cloud (Amazon EC2) instances, each version of Flink needs to run on its own virtual machine to avoid challenges with resource management or conflicting dependencies and environment variables. However, containerizing Flink applications allows you to isolate versions and avoid conflicting dependencies, and running them on Amazon EKS allows you to use Kubernetes as the unified resource manager. This means that you have the flexibility to choose which version of Flink is best suited for each job, and also improves your agility to upgrade a single job to the next version of Flink rather than having to upgrade an entire cluster, or spin up a dedicated EC2 instance for a different Flink version, which would increase your costs.

Key EMR on EKS differentiations

In this section, we discuss the key EMR on EKS differentiations.

Faster restart of the Flink job during scaling or failure recovery

This is enabled by task local recovery via Amazon Elastic Block Store (Amazon EBS) volumes and fine-grained recovery support in Adaptive Scheduler.

Task local recovery via EBS volumes for TaskManager pods is available with Amazon EMR 6.15.0 and higher. The default overlay mount comes with 10 GB, which is sufficient for jobs with a lower state. Jobs with large states can enable the automatic EBS volume mount option. The TaskManager pods are automatically created and mounted during pod creation and removed during pod deletion.

Fine-grained recovery support in the adaptive scheduler is available with Amazon EMR 6.15.0 and higher. When a task fails during its run, fine-grained recovery restarts only the pipeline-connected component of the failed task, instead of resetting the entire graph, and triggers a complete rerun from the last completed checkpoint, which is more expensive than just rerunning the failed tasks. To enable fine-grained recovery, set the following configurations in your Flink configuration:

jobmanager.execution.failover-strategy: region
restart-strategy: exponential-delay or fixed-delay

Logging and monitoring support with customer managed keys

Monitoring and observability are key constructs of the AWS Well-Architected framework because they help you learn, measure, and adapt to operational changes. You can enable monitoring of launched Flink jobs while using EMR on EKS with Apache Flink. Amazon Managed Service for Prometheus is deployed automatically, if enabled while installing the Flink operator, and it helps analyze Prometheus metrics emitted for the Flink operator, job, and TaskManager.

You can use the Flink UI to monitor health and performance of Flink jobs through a browser using port-forwarding. We have also enabled collection and archival of operator and application logs to Amazon Simple Storage Service (Amazon S3) or Amazon CloudWatch using a FluentD sidecar. This can be enabled through a monitoringConfiguration block in the deployment customer resource definition (CRD):

monitoringConfiguration:
    s3MonitoringConfiguration:
      logUri: S3 BUCKET
      encryptionKeyArn: CMK ARN FOR S3 BUCKET ENCRYPTION
    cloudWatchMonitoringConfiguration:
      logGroupName: LOG GROUP NAME
      logStreamNamePrefix: LOG GROUP STREAM PREFIX
    sideCarResources:
      limits:
        cpuLimit: 500m
        memoryLimit: 250Mi
    containerLogRotationConfiguration:
        rotationSize: 2Gb
        maxFilesToKeep: 10

Cost-optimization using Amazon EC2 Spot Instances

Amazon EC2 Spot Instances are an Amazon EC2 pricing option that provides steep discounts of up to 90% over On-Demand prices. It’s the preferred choice to run big data workloads because it helps improve throughput and optimize Amazon EC2 spend. Spot Instances are spare EC2 capacity and can be interrupted with notification if Amazon EC2 needs the capacity for On-Demand requests. Flink streaming jobs running on EMR on EKS can now respond to Spot Instance interruption, perform a just-in-time (JIT) checkpoint of the running jobs, and prevent scheduling further tasks on these Spot Instances. When restarting the job, not only will the job restart from the checkpoint, but a combined restart mechanism will provide a best-effort service to restart the job either after reaching target resource parallelism or the end of the current configured window. This can also prevent consecutive job restarts caused by Spot Instances stopping in a short interval and help reduce cost and improve performance.

To minimize the impact of Spot Instance interruptions, you should adopt Spot Instance best practices. The combined restart mechanism and JIT checkpoint is offered only in Adaptive Scheduler.

Integration with the AWS Glue Data Catalog as a metadata store for Flink applications

The AWS Glue Data Catalog is a centralized metadata repository for data assets across various data sources, and provides a unified interface to store and query information about data formats, schemas, and sources. Amazon EMR on EKS with Apache Flink releases 6.15.0 and higher support using the Data Catalog as a metadata store for streaming and batch SQL workflows. This further enables data understanding and makes sure that it is transformed correctly.

Integration with Amazon S3, enabling resiliency and operational efficiency

Amazon S3 is the preferred cloud object store for AWS customers to store not only data but also application JARs and scripts. EMR on EKS with Apache Flink can fetch application JARs and scripts (PyFlink) through deployment specification, which eliminates the need to build custom images in Flink’s Application Mode. When checkpointing on Amazon S3 is enabled, a managed state is persisted to provide consistent recovery in case of failures. Retrieval and storage of files using Amazon S3 is enabled by two different Flink connectors. We recommend using Presto S3 (s3p) for checkpointing and s3 or s3a for reading and writing files including JARs and scripts. See the following code:

...
spec:
  flinkConfiguration:
    taskmanager.numberOfTaskSlots: "2"
    state.checkpoints.dir: s3p://<BUCKET-NAME>/flink-checkpoint/
...
job:
jarURI: "s3://<S3-BUCKET>/scripts/pyflink.py" # Note, this will trigger the artifact download process
entryClass: "org.apache.flink.client.python.PythonDriver"
...

Role-based access control using IRSA

IAM Roles for Service Accounts (IRSA) is the recommended way to implement role-based access control (RBAC) for deploying and running applications on Amazon EKS. EMR on EKS with Apache Flink creates two roles (IRSA) by default for Flink operator and Flink jobs. The operator role is used for JobManager and Flink services, and the job role is used for TaskManagers and ConfigMaps. This helps limit the scope of AWS Identity and Access Management (IAM) permission to a service account, helps with credential isolation, and improves auditability.

Get started with EMR on EKS with Apache Flink

If you want to run a Flink application on recently launched EMR on EKS with Apache Flink, refer to Running Flink jobs with Amazon EMR on EKS, which provides step-by-step guidance to deploy, run, and monitor Flink jobs.

We have also created an IaC (Infrastructure as Code) template for EMR on EKS with Flink Streaming as part of Data on EKS (DoEKS), an open-source project aimed at streamlining and accelerating the process of building, deploying, and scaling data and ML workloads on Amazon Elastic Kubernetes Service (Amazon EKS). This template will help you to provision a EMR on EKS with Flink cluster and evaluate the features as mentioned in this blog. This template comes with the best practices built in, so you can use this IaC template as a foundation for deploying EMR on EKS with Flink in your own environment if you decide to use it as part of your application.

Conclusion

In this post, we explored the features of recently launched EMR on EKS with Flink to help you understand how you might run Flink workloads on a managed, scalable, resilient, and cost-optimized EMR on EKS cluster. If you are planning to run/explore Flink workloads on Kubernetes consider running them on EMR on EKS with Apache Flink. Please do contact your AWS Solution Architects, who can be of assistance alongside your innovation journey.

About the Authors

Kinnar Kumar Sen is a Sr. Solutions Architect at Amazon Web Services (AWS) focusing on Flexible Compute. As a part of the EC2 Flexible Compute team, he works with customers to guide them to the most elastic and efficient compute options that are suitable for their workload running on AWS. Kinnar has more than 15 years of industry experience working in research, consultancy, engineering, and architecture.

Alex Lines is a Principal Containers Specialist at AWS helping customers modernize their Data and ML applications on Amazon EKS.

Mengfei Wang is a Software Development Engineer specializing in building large-scale, robust software infrastructure to support big data demands on containers and Kubernetes within the EMR on EKS team. Beyond work, Mengfei is an enthusiastic snowboarder and a passionate home cook.

Jerry Zhang is a Software Development Manager in AWS EMR on EKS. His team focuses on helping AWS customers to solve their business problems using cutting-edge data analytics technology on AWS infrastructure.

Architecting for Disaster Recovery on AWS Outposts Racks with AWS Elastic Disaster Recovery

2024-04-19 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/architecting-for-disaster-recovery-on-aws-outposts-racks-with-aws-elastic-disaster-recovery/

This blog post is written by Brianna Rosentrater, Hybrid Edge Specialist SA.

AWS Elastic Disaster Recovery Service (AWS DRS) now supports disaster recovery (DR) architectures that include on-premises Windows and Linux workloads running on AWS Outposts. AWS DRS minimizes downtime and data loss with fast, reliable recovery of on-premises and cloud-based applications using affordable storage, minimal compute, and point-in-time recovery. Both services are billed and managed from your AWS Management Console.

Like workloads running in AWS Regions, it’s critical to plan for failures. Outposts are designed with resiliency in mind, providing redundant power, networking, and are available to order with N+M active compute instance capacity. In other words, for every physical N compute servers, you have the option of including M redundant hosts capable of handling the workload during a failure. When leveraging AWS DRS with Outpost, you can plan for larger-scale failure modes, such as data center outages, by replicating mission-critical workloads to other remote data center locations or the AWS Region.

In this post, you’ll learn how AWS DRS can be used with Outpost rack to architect for high availability in the event of a site failure. The post will examine several different architectures enabled by AWS DRS that provide DR for Outpost, and the benefits of each method described.

Prerequisites

Each of these architectures described below need the following:

At least one Outpost rack
Outpost(s) must be in Direct VPC Routing (DVR) mode
Amazon S3 on Outposts (required for all AWS DRS replication destinations)
An AWS Replication Agent installed on each source server

Public internet access isn’t needed, AWS PrivateLink and AWS Direct Connect are supported for replication and failback which is a significant security benefit.

Planning for failure

Disasters come in many forms and are often unplanned and unexpected events. Regardless of whether your workload resides on premises, in a colocation facility, or in an AWS Region, it’s critical to define the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) which are often workload-specific. These two metrics profile how long a service can be down during recovery and quantify the acceptable amount of data loss. RTO and RPO guide you in choosing the appropriate strategy such as backup and recovery, pilot light, warm standby, or a multi-site (active-active) approach.

With AWS DRS, while failing back to a test machine (not the original source server), replication of the source server continues. This allows failback drills without impacting RPO, and non-disruptive failback drills are an important part of disaster planning to validate your recovery plan meets your expected RPO/RTO as per your business requirements.

How AWS DRS integrates with Outpost

AWS DRS uses an AWS Replication Agent at the source to capture the workload and transfer it to a lightweight staging area, which resides on an Outpost equipped with Amazon S3 on Outposts. This method also provides the ability to perform low-effort, non-disruptive DR drills before making the final cutover. The AWS Replication Agent doesn’t need a reboot nor does it impact your applications during installation.

When an Outpost’s subnet is selected as the target for replication or launch, all associated AWS DRS components remain within the Outpost, including the AWS DRS server conversion technology. These conversion servers convert source disks of servers being migrated so that they can boot and run in the target infrastructure, Amazon EBS volumes, snapshots, and replication servers. The replication servers replicate the disks to the target infrastructure. With AWS DRS you can control the data replication path using private connectivity options such as a virtual private network (VPN), AWS Direct Connect, VPC peering, or another private connection. Learn more about using a private IP for data replication.

AWS DRS provides nearly continuous replication for mission-critical workloads and supports deployment patterns including on-premises to Outpost, Outpost to Region, Region to Outpost, and between two logical Outposts through local networks. To leverage Outpost with AWS DRS, simply select the Outpost subnet as your target or source for replication when configuring AWS DRS for your workload. If you are currently using CloudEndure DR for disaster recovery with Outpost, see these detailed instructions for migrating to AWS DRS from CloudEndure DR.

DR from on-premises to Outpost

Outpost can be used as a DR target for on-premises workloads. By deploying an Outpost in a remote data center or colocation a significant distance from the source within the same geo-political boundary, you can replicate workloads across great distances and increase resiliency of the data while ensuring adherence to data residency policies or legislation.

Figure 1 – DR from on-premises to Outposts

In Figure 1, on premises sources replicate traffic from a LAN to a staging area residing in an Outpost subnet via the local gateway. This allows workloads to failover from their on-premises environment to an Outpost in a different physical location during a disaster.

The staging areas and replication servers run on Amazon Elastic Compute Cloud (Amazon EC2) with Amazon EBS volumes and require Amazon S3 on Outposts where the Amazon EBS snapshots reside.

The replication agent is responsible for providing nearly continuous, block-level replication from your LAN using TCP/1500 with traffic routing to Amazon EC2 instances using the Outposts local gateway.

DR from Outpost to Region

Since its initial release, Outpost has supported Amazon EBS snapshots written to Amazon S3 located in the AWS Region. Backup to an AWS Region is one of the most cost-effective and easiest-to-configure DR approaches, enabling data redundancy outside of your Outpost and data center.

This method also offers flexibility for restoration within an AWS Region if the original deployment is irrecoverable. However, depending on the frequency of the snapshots and the timing of the failure, backup, and recovery to the Region has the potential to have an RPO/RTO spanning hours depending on the throughput of the service link.

For critical workloads, AWS DRS can reduce RTO to minutes and RPO in the sub-second range. After creating an initial replication of workloads that reside on the Outpost, AWS DRS provides nearly continuous, block-level replication in the Region. Just like replication from non-AWS virtual machines or bare metal servers, AWS DRS resources, including Replication Servers, Conversion Servers, Amazon EBS Volumes, and Snapshots reside in the Region.

Figure 2 – DR from Outpost to Region

In Figure 2, data replication is performed over the service link from Amazon EC2 instances running locally on an Outpost to an AWS Region. The service link traverses either public Region connectivity or AWS Direct Connect.

AWS Direct Connect is the recommended option because it provides low latency and consistent bandwidth for the service link back to a Region, which also improves the reliability of transmission for AWS DRS replication traffic.

The service link is comprised of redundant, encrypted VPN tunnels. Replication traffic can also be sent privately without traversing the public internet by leveraging Private Virtual Interfaces with Direct Connect for the service link.

With this architecture in place, you can mitigate disasters and reduce downtime by failing over to the AWS Region using AWS DRS.

DR from Region to Outpost

AWS provides multiple Availability Zones (AZs) within a Region and isolated AWS Regions globally for the greatest possible fault tolerance and stability. The reliability pillar of AWS’s Well-Architected Framework encourages distributing workloads across AZs and replicating data between Regions when the need for distances exceeds those of AZs.

AWS DRS supports nearly continuous replication of workloads from a Region to an Outpost within your data center or colocation facility for DR. This deployment model provides increased durability from a source AWS Region to an Outpost anchored to a different Region.

In this model, AWS DRS components remain on-premises within the Outpost, but data charges are applicable as data egresses from the Region back to the data center and Amazon S3 on Outposts is required on the destination Outpost.

Figure 3 – DR from Region to Outpost

Implementing the preceding architecture diagram enables failover of critical workloads from the Region to on-premises Outposts seamlessly. Keep in mind that AWS Regions provide the management and control plane for Outpost, making it critical to consider probability and frequency of service link interruptions as a part of your DR planning. Scenarios such as warm standby with pre-allocated Amazon EC2 and Amazon EBS resources may prove more resilient during service link disruptions.

DR between two Outposts

Each logical Outpost is comprised of one or more physical racks. Logical Outposts are in independent colocations of one another, and support deployments in disparate data centers or colocation facilities. You can elect to have multiple logical Outposts anchored to different Availability Zones or Regions. AWS DRS unlocks options for replication between two logical Outposts, leading to increased resiliency and reducing the impact of your data center as a single point of failure. In the following architecture, nearly continuous replication captured from a single Outpost source is applied at a second logical Outpost.

Figure 4 – DR between two Outposts

Supporting both directional and bidirectional replication between Outposts can minimize disruption caused by events that take down a data center, Availability Zone, or even the entire Region result in minimal disruption. In the following architecture diagram, bidirectional data replication occurs between the Outposts by routing traffic via the local gateways, minimizing outbound data charges from the Region and allowing for more direct routing between deployment sites that could potentially span significant distances. AWS DRS cannot communicate with resources directly utilizing a customer-owned IP address pool (CoIP pool).

Figure 5 – DR between two Outposts – bidirectional

Architecture Considerations

When planning an Outpost deployment leveraging AWS DRS, it’s critical to consider the impact on storage. As a general best practice, AWS recommends planning for a 2:1 ratio consisting of EBS volumes used for nearly continuous replication and Amazon EBS snapshots on Amazon S3 for point-in-time recovery. While it’s unlikely that all servers would need recovery simultaneously, it’s also important to allocate a reserve of EBS volume capacity, which will launch at the time of recovery. Amazon S3 on Outpost is needed for each Outpost used as a replication destination, and the recommendation is to plan for a 1:1 ratio consisting of S3 on Outposts storage, plus the rate of data change. For example, if your data change rate is 10%, you’d want to plan for 110% S3 on Outpost use with AWS DRS.

Amazon CloudWatch has integrated metrics for EC2, Amazon EBS, and Amazon S3 capacity on Outposts, making it easy to create custom tailored dashboards and integrate with Simple Notification Service (Amazon SNS) for alerts at defined thresholds. Monitoring these metrics is critical in making sure that proper free space is available for data replication to occur unimpeded. CloudWatch has metrics available for AWS DRS as well. You can also use the AWS DRS service page in the AWS Console to monitor the status of your recovery instances.

Consider taking advantage of Recovery Plans within AWS DRS to make sure that related services are recovered in a particular order. For example, during a disaster, it might be critical to first bring up a database before recovering application tiers. Recovery plans provide the ability to group related services and apply wait times to individual targets.

Conclusion

AWS Outpost enables low latency, data residency, or data gravity-constrained workloads by supplying managed cloud compute and storage services within your data center or colocation. When coupled with AWS DRS, you can decrease RPO and RTO through a variety of flexible deployment models with sources and destinations ranging from on-premises, the Region, or another AWS Outpost.

AWS Weekly Roundup — Claude 3 Sonnet support in Bedrock, new instances, and more — March 11, 2024

2024-03-11 Marcia Villalba

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-claude-3-sonnet-support-in-bedrock-new-instances-and-more-march-11-2024/

Last Friday was International Women’s Day (IWD), and I want to take a moment to appreciate the amazing ladies in the cloud computing space that are breaking the glass ceiling by reaching technical leadership positions and inspiring others to go and build, as our CTO Werner Vogels says.

Last week’s launches
Here are some launches that got my attention during the previous week.

Amazon Bedrock – Now supports Anthropic’s Claude 3 Sonnet foundational model. Claude 3 Sonnet is two times faster and has the same level of intelligence as Anthropic’s highest-performing models, Claude 2 and Claude 2.1. My favorite characteristic is that Sonnet is better at producing JSON outputs, making it simpler for developers to build applications. It also offers vision capabilities. You can learn more about this foundation model (FM) in the post that Channy wrote early last week.

AWS re:Post – Launched last week! AWS re:Post Live is a weekly Twitch livestream show that provides a way for the community to reach out to experts, ask questions, and improve their skills. The show livestreams every Monday at 11 AM PT.

Amazon CloudWatch – Now streams daily metrics on CloudWatch metric streams. You can use metric streams to send a stream of near real-time metrics to a destination of your choice.

Amazon Elastic Compute Cloud (Amazon EC2) – Announced the general availability of new metal instances, C7gd, M7gd, and R7gd. These instances have up to 3.8 TB of local NVMe-based SSD block-level storage and are built on top of the AWS Nitro System.

AWS WAF – Now supports configurable evaluation time windows for request aggregation with rate-based rules. Previously, AWS WAF was fixed to a 5-minute window when aggregating and evaluating the rules. Now you can select windows of 1, 2, 5 or 10 minutes, depending on your application use case.

AWS Partners – Last week, we announced the AWS Generative AI Competency Partners. This new specialization features AWS Partners that have shown technical proficiency and a track record of successful projects with generative artificial intelligence (AI) powered by AWS.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS news
Some other updates and news that you may have missed:

One of the articles that caught my attention recently compares different design approaches for building serverless microservices. This article, written by Luca Mezzalira and Matt Diamond, compares the three most common designs for serverless workloads and explains the benefits and challenges of using one over the other.

And if you are interested in the serverless space, you shouldn’t miss the Serverless Office Hours, which airs live every Tuesday at 10 AM PT. Join the AWS Serverless Developer Advocates for a weekly chat on the latest from the serverless space.

The Official AWS Podcast – Listen each week for updates on the latest AWS news and deep dives into exciting use cases. There are also official AWS podcasts in several languages. Check out the ones in French, German, Italian, and Spanish.

AWS Open Source News and Updates – This is a newsletter curated by my colleague Ricardo to bring you the latest open source projects, posts, events, and more.

Upcoming AWS events
Check your calendars and sign up for these AWS events:

AWS Summit season is about to start. The first ones are Paris (April 3), Amsterdam (April 9), and London (April 24). AWS Summits are free events that you can attend in person and learn about the latest in AWS technology.

GOTO x AWS EDA Day London 2024 – On May 14, AWS partners with GOTO bring to you the event-driven architecture (EDA) day conference. At this conference, you will get to meet experts in the EDA space and listen to very interesting talks from customers, experts, and AWS.

You can browse all upcoming in-person and virtual events here.

That’s all for this week. Check back next Monday for another Week in Review!

— Marcia

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Deploying an EMR cluster on AWS Outposts to process data from an on-premises database

2024-02-21 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/deploying-an-emr-cluster-on-aws-outposts-to-process-data-from-an-on-premises-database/

seThis post is written by Eder de Mattos, Sr. Cloud Security Consultant, AWS and Fernando Galves, Outpost Solutions Architect, AWS.

In this post, you will learn how to deploy an Amazon EMR cluster on AWS Outposts and use it to process data from an on-premises database. Many organizations have regulatory, contractual, or corporate policy requirements to process and store data in a specific geographical location. These strict requirements become a challenge for organizations to find flexible solutions that balance regulatory compliance with the agility of cloud services. Amazon EMR is the industry-leading cloud big data platform for data processing, interactive analysis, and machine learning (ML) that uses open-source frameworks. With Amazon EMR on Outposts, you can seamlessly use data analytics solutions to process data locally in your on-premises environment without moving data to the cloud. This post focuses on creating and configuring an Amazon EMR cluster on AWS Outposts rack using Amazon Virtual Private Cloud (Amazon VPC) endpoints and keeping the networking traffic in the on-premises environment.

Architecture overview

In this architecture, there is an Amazon EMR cluster created in an AWS Outposts subnet. The cluster retrieves data from an on-premises PostgreSQL database, employs a PySpark Step for data processing, and then stores the result in a new table within the same database. The following diagram shows this architecture.

Figure 1 Architecture overview

Networking traffic on premises: The communication between the EMR cluster and the on-premises PostgreSQL database is through the Local Gateway. The core Amazon Elastic Compute Cloud (Amazon EC2) instances of the EMR cluster are associated with Customer-owned IP addresses (CoIP), and each instance has two IP addresses: an internal IP and a CoIP IP. The internal IP is used to communicate locally in the subnet, and the CoIP IP is used to communicate with the on-premises network.

Amazon VPC endpoints: Amazon EMR establishes communication with the VPC through an interface VPC endpoint. This communication is private and conducted entirely within the AWS network instead of connecting over the internet. In this architecture, VPC endpoints are created on a subnet in the AWS Region.

The support files used to create the EMR cluster are stored in an Amazon Simple Storage Service (Amazon S3) bucket. The communication between the VPC and Amazon S3 stays within the AWS network. The following files are stored in this S3 bucket:

get-postgresql-driver.sh: This is a bootstrap script to download the PostgreSQL driver to allow the Spark step to communicate to the PostgreSQL database through JDBC. You can download it through the GitHub repository for this Amazon EMR on Outposts blog post.
postgresql-42.6.0.jar: PostgreSQL binary JAR file for the JDBC driver.
spark-step-example.py: Example of a Step application in PySpark to simulate the connection to the PostgreSQL database.

AWS Systems Manager is configured to manage the EC2 instances that belong to the EMR cluster. It uses an interface VPC endpoint to allow the VPC to communicate privately with the Systems Manager.

The database credentials to connect to the PostgreSQL database are stored in AWS Secrets Manager. Amazon EMR integrates with Secrets Manager. This allows the secret to be stored in the Secrets Manager and be used through its ARN in the cluster configuration. During the creation of the EMR cluster, the secret is accessed privately through an interface VPC endpoint and stored in the variable DBCONNECTION in the EMR cluster.

In this solution, we are creating a small EMR cluster with one primary and one core node. For the correct sizing of your cluster, see Estimating Amazon EMR cluster capacity.

There is additional information to improve the security posture for organizations that use AWS Control Tower landing zone and AWS Organizations. The post Architecting for data residency with AWS Outposts rack and landing zone guardrails is a great place to start.

Prerequisites

Before deploying the EMR cluster on Outposts, you must make sure the following resources are created and configured in your AWS account:

Outposts rack are installed, up and running.
Amazon EC2 key pair is created. To create it, you can follow the instructions in Create a key pair using Amazon EC2 in the Amazon EC2 user guide.

Deploying the EMR cluster on Outposts

1. Deploy the CloudFormation template to create the infrastructure for the EMR cluster

You can use this AWS CloudFormation template to create the infrastructure for the EMR cluster. To create a stack, you can follow the instructions in Creating a stack on the AWS CloudFormation console in the AWS CloudFormation user guide.

2. Create an EMR cluster

To launch a cluster with Spark installed using the console:

Step 1: Configure Name and Applications

Sign in to the AWS Management Console, and open the Amazon EMR console.
Under EMR on EC2, in the left navigation pane, select Clusters, and then choose Create Cluster.
On the Create cluster page, enter a unique cluster name for the Name
For Amazon EMR release, choose emr-6.13.0.
In the Application bundle field, select Spark 3.4.1 and Zeppelin 0.10.1, and unselect all the other options.
For the Operating system options, select Amazon Linux release.

Figure 2: Create Cluster

Step 2: Choose Cluster configuration method

Under the Cluster configuration, select Uniform instance groups.
For the Primary and the Core, select the EC2 instance type available in the Outposts rack that is supported by the EMR cluster.
Remove the instance group Task 1 of 1.

Figure 3: Remove the instance group Task 1 of 1

Step 3: Set up Cluster scaling and provisioning, Networking and Cluster termination

In the Cluster scaling and provisioning option, choose Set cluster size manually and type the value 1 for the Core
On the Networking, select the VPC and the Outposts subnet.
For Cluster termination, choose Manually terminate cluster.

Step 4: Configure the Bootstrap actions

A. In the Bootstrap actions, add an action with the following information:

1. Name: copy-postgresql-driver.sh
2. Script location: s3://<bucket-name>/copy-postgresql-driver.sh. Modify the <bucket-name> variable to the bucket name you specified as a parameter in Step 1.

Figure 4: Add bootstrap action

Step 5: Configure Cluster logs and Tags

a. Under Cluster logs, choose Publish cluster-specific logs to Amazon S3 and enter s3://<bucket-name>/logs for the field Amazon S3 location. Modify the <bucket-name> variable to the bucket name you specified as a parameter in Step 1.

Figure 5: Amazon S3 location for cluster logs

b. In Tags, add new tag. You must enter for-use-with-amazon-emr-managed-policies for the Key field and true for Value.

Figure 6: Add tags

Step 6: Set up Software settings and Security configuration and EC2 key pair

a. In the Software settings, enter the following configuration replacing the Secret ARN created in Step 1:

[
          {
                    "Classification": "spark-defaults",
                    "Properties": {
                              "spark.driver.extraClassPath": "/opt/spark/postgresql/driver/postgresql-42.6.0.jar",
                              "spark.executor.extraClassPath": "/opt/spark/postgresql/driver/postgresql-42.6.0.jar",
                              "[email protected]":
                                         "arn:aws:secretsmanager:<region>:<account-id>:secret:<secret-name>"
                    }
          }
]

This is an example of the Secret ARN replaced:

Figure 7: Example of the Secret ARN replaced

b. For the Security configuration and EC2 key pair, choose the SSH key pair.

Step 7: Choose Identity and Access Management (IAM) roles

a. Under Identity and Access Management (IAM) roles:

1. In the Amazon EMR service role:
  - Choose AmazonEMR-outposts-cluster-role for the Service role.
2. In EC2 instance profile for Amazon EMR
  - Choose AmazonEMR-outposts-EC2-role.

Figure 8: Choose the service role and instance profile

Step 8: Create cluster

Choose Create cluster to launch the cluster and open the cluster details page.

Now, the EMR cluster is starting. When your cluster is ready to process tasks, its status changes to Waiting. This means the cluster is up, running, and ready to accept work.

Figure 9: Result of the cluster creation

3. Add CoIPs to EMR core nodes

You need to allocate an Elastic IP from the CoIP pool and associate it with the EC2 instance of the EMR core nodes. This is necessary to allow the core nodes to access the on-premises environment. To allocate an Elastic IP, follow the instructions in Allocate an Elastic IP address in Amazon EC2 User Guide for Linux Instances. In Step 5, choose the Customer-owned pool of IPV4 addresses.

Once the CoIP IP is allocated, associate it with each EC2 instance of the EMR core node. Follow the instructions in Associate an Elastic IP address with an instance or network interface in Amazon EC2 User Guide for Linux Instances.

Checking the configuration

Make sure the EC2 instance of the core nodes can ping the IP of the PostgreSQL database.

Connect to the Core node EC2 instance using Systems Manager and ping the IP address of the PostgreSQL database.

Figure 10: Connectivity test

Make sure the Status of the EMR cluster is Waiting.

Figure 11: Cluster is ready and waiting

Adding a step to the Amazon EMR cluster

You can use the following Spark application to simulate the data processing from the PostgreSQL database.

spark-step-example.py:

import os
from pyspark.sql import SparkSession

if __name__ == "__main__":

    # ---------------------------------------------------------------------
    # Step 1: Get the database connection information from the EMR cluster 
    #         configuration
    dbconnection = os.environ.get('DBCONNECTION')
    #    Remove brackets
    dbconnection_info = (dbconnection[1:-1]).split(",")
    #    Initialize variables
    dbusername = ''
    dbpassword = ''
    dbhost = ''
    dbport = ''
    dbname = ''
    dburl = ''
    #    Parse the database connection information
    for dbconnection_attribute in dbconnection_info:
        (key_data, key_value) = dbconnection_attribute.split(":", 1)

        if key_data == "username":
            dbusername = key_value
        elif key_data == "password":
            dbpassword = key_value
        elif key_data == 'host':
            dbhost = key_value
        elif key_data == 'port':
            dbport = key_value
        elif key_data == 'dbname':
            dbname = key_value

    dburl = "jdbc:postgresql://" + dbhost + ":" + dbport + "/" + dbname

    # ---------------------------------------------------------------------
    # Step 2: Connect to the PostgreSQL database and select data from the 
    #         pg_catalog.pg_tables table
    spark_db = SparkSession.builder.config("spark.driver.extraClassPath",                                          
               "/opt/spark/postgresql/driver/postgresql-42.6.0.jar") \
               .appName("Connecting to PostgreSQL") \
               .getOrCreate()

    #    Connect to the database
    data_db = spark_db.read.format("jdbc") \
        .option("url", dburl) \
        .option("driver", "org.postgresql.Driver") \
        .option("query", "select count(*) from pg_catalog.pg_tables") \
        .option("user", dbusername) \
        .option("password", dbpassword) \
        .load()

    # ---------------------------------------------------------------------
    # Step 3: To do the data processing
    #
    #    TO-DO

    # ---------------------------------------------------------------------
    # Step 4: Save the data into the new table in the PostgreSQL database
    #
    data_db.write \
        .format("jdbc") \
        .option("url", dburl) \
        .option("dbtable", "results_proc") \
        .option("user", dbusername) \
        .option("password", dbpassword) \
        .save()

    # ---------------------------------------------------------------------
    # Step 5: Close the Spark session
    #
    spark_db.stop()
    # ---------------------------------------------------------------------

You must upload the file spark-step-example.py to the bucket created in Step 1 of this post before submitting the Spark application to the EMR cluster. You can get the file at this GitHub repository for a Spark step example.

Submitting the Spark application step using the Console

To submit the Spark application to the EMR cluster, follow the instructions in To submit a Spark step using the console in the Amazon EMR Release Guide. In Step 4 of this Amazon EMR guide, provide the following parameters to add a step:

choose Cluster mode for the Deploy mode
type a name for your step (such as Step 1)
for the Application location, choose s3://<bucket-name>/spark-step-example.py and replace the <bucket-name> variable to the bucket name you specified as a parameter in Step 1
leave the Spark-submit options field blank

Figure 12: Add a step to the EMR cluster

The Step is created with the Status Pending. When it is done, the Status changes to Completed.

Figure 13: Step executed successfully

Cleaning up

When the EMR cluster is no longer needed, you can delete the resources created to avoid incurring future costs by following these steps:

Follow the instructions in Terminate a cluster with the console in the Amazon EMR Documentation Management Guide. Remember to turn off the Termination protection.
Dissociate and release the CoIP IPs allocated to the EC2 instances of the EMR core nodes.
Delete the stack in the AWS CloudFormation using the instructions in Deleting a Stack on the AWS CloudFormation console in the AWS CloudFormation User Guide

Conclusion

Amazon EMR on Outposts allows you to use the managed services offered by AWS to perform big data processing close to your data that needs to remain on-premises. This architecture eliminates the need to transfer on-premises data to the cloud, providing a robust solution for organizations with regulatory, contractual, or corporate policy requirements to store and process data in a specific location. With the EMR cluster accessing the on-premises database directly through local networking, you can expect faster and more efficient data processing without compromising on compliance or agility. To learn more, visit the Amazon EMR on AWS Outposts product overview page.

Generative AI Infrastructure at AWS

2024-01-31 Betsy Chernoff

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/compute/generative-ai-infrastructure-at-aws/

Building and training generative artificial intelligence (AI) models, as well as predicting and providing accurate and insightful outputs requires a significant amount of infrastructure.

There’s a lot of data that goes into generating the high-quality synthetic text, images, and other media outputs that large-language models (LLMs), as well as foundational models (FMs), create. To start, the data set generally has somewhere around one billion variables present in the model that it was trained on (also known as parameters). To process that massive amount of data (think: petabytes), it can take hundreds of hardware accelerators (which are incorporated into purpose-built ML silicon or GPUs).

Given how much data is required for an effective LLM, it becomes costly and inefficient if an organization can’t access the data for these models as quickly as their GPUs/ML silicon are processing it. Selecting infrastructure for generative AI workloads impacts everything from cost to performance to sustainability goals to the ease of use. To successfully run training and inference for FMs organizations need:

Price-performant accelerated computing (including the latest GPUs and dedicated ML Silicon) to power large generative AI workloads.
High-performance and low-latency cloud storage that’s built to keep accelerators highly utilized.
The most performant and cutting-edge technologies, networking, and systems to support the infrastructure for a generative AI workload.
The ability to build with cloud services that can provide seamless integration across generative AI applications, tools, and infrastructure.

Overview of compute, storage, & networking for generative AI

Amazon Elastic Compute Cloud (Amazon EC2) accelerated computing portfolio (including instances powered by GPUs and purpose-built ML silicon) offers the broadest choice of accelerators to power generative AI workloads.

To keep the accelerators highly utilized, they need constant access to data for processing. AWS provides this fast data transfer from storage (up to hundreds of GBs/TBs of data throughput) with Amazon FSx for Lustre and Amazon S3.

Accelerated computing instances combined with differentiated AWS technologies such as the AWS Nitro System, up to 3,200 Gbps of Elastic Fabric Adapter (EFA) networking, as well as exascale computing with Amazon EC2 UltraClusters helps to deliver the most performant infrastructure for generative AI workloads.

Coupled with other managed services such as Amazon SageMaker HyperPod and Amazon Elastic Kubernetes Service (Amazon EKS), these instances provide developers with the industry’s best platform for building and deploying generative AI applications.

This blog post will focus on highlighting announcements across Amazon EC2 instances, storage, and networking that are centered around generative AI.

AWS compute enhancements for generative AI workloads

Training large FMs requires extensive compute resources and because every project is different, a broad set of options are needed so that organization of all sizes can iterate faster, train more models, and increase accuracy. In 2023, there were a lot of launches across the AWS compute category that supported both training and inference workloads for generative AI.

One of those launches, Amazon EC2 Trn1n instances, doubled the network bandwidth (compared to Trn1 instances) to 1600 Gbps of Elastic Fabric Adapter (EFA). That increased bandwidth delivers up to 20% faster time-to-train relative to Trn1 for training network-intensive generative AI models, such as LLMs and mixture of experts (MoE).

Watashiha offers an innovative and interactive AI chatbot service, “OGIRI AI,” which uses LLMs to incorporate humor and offer a more relevant and conversational experience to their customers. “This requires us to pre-train and fine-tune these models frequently. We pre-trained a GPT-based Japanese model on the EC2 Trn1.32xlarge instance, leveraging tensor and data parallelism,” said Yohei Kobashi, CTO, Watashiha, K.K. “The training was completed within 28 days at a 33% cost reduction over our previous GPU based infrastructure. As our models rapidly continue to grow in complexity, we are looking forward to Trn1n instances which has double the network bandwidth of Trn1 to speed up training of larger models.”

AWS continues to advance its infrastructure for generative AI workloads, and recently announced that Trainium2 accelerators are also coming soon. These accelerators are designed to deliver up to 4x faster training than first generation Trainium chips and will be able to be deployed in EC2 UltraClusters of up to 100,000 chips, making it possible to train FMs and LLMs in a fraction of the time, while improving energy efficiency up to 2x.

AWS has continued to invest in GPU infrastructure over the years, too. To date, NVIDIA has deployed 2 million GPUs on AWS, across the Ampere and Grace Hopper GPU generations. That’s 3 zetaflops, or 3,000 exascale super computers. Most recently, AWS announced the Amazon EC2 P5 Instances that are designed for time-sensitive, large-scale training workloads that use NVIDIA CUDA or CuDNN and are powered by NVIDIA H100 Tensor Core GPUs. They help you accelerate your time to solution by up to 4x compared to previous-generation GPU-based EC2 instances, and reduce cost to train ML models by up to 40%. P5 instances help you iterate on your solutions at a faster pace and get to market more quickly.

And to offer easy and predictable access to highly sought-after GPU compute capacity, AWS launched Amazon EC2 Capacity Blocks for ML. This is the first consumption model from a major cloud provider that lets you reserve GPUs for future use (up to 500 deployed in EC2 UltraClusters) to run short duration ML workloads.

AWS is also simplifying training with Amazon SageMaker HyperPod, which automates more of the processes required for high-scale fault-tolerant distributed training (e.g., configuring distributed training libraries, scaling training workloads across thousands of accelerators, detecting and repairing faulty instances), speeding up training by as much as 40%. Customers like Perplexity AI elastically scale beyond hundreds of GPUs and minimize their downtime with SageMaker HyperPod.

Deep-learning inference is another example of how AWS is continuing its cloud infrastructure innovations, including the low-cost, high-performance Amazon EC2 Inf2 instances powered by AWS Inferentia2. These instances are designed to run high-performance deep-learning inference applications at scale globally. They are the most cost-effective and energy-efficient option on Amazon EC2 for deploying the latest innovations in generative AI.

Another example is with Amazon SageMaker, which helps you deploy multiple models to the same instance so you can share compute resources—reducing inference cost by 50%. SageMaker also actively monitors instances that are processing inference requests and intelligently routes requests based on which instances are available—achieving 20% lower inference latency (on average).

AWS invests heavily in the tools for generative AI workloads. For AWS ML silicon, AWS has focused on AWS Neuron, the software development kit (SDK) that helps customers get the maximum performance from Trainium and Inferentia. Neuron supports the most popular publicly available models, including Llama 2 from Meta, MPT from Databricks, Mistral from mistral.ai, and Stable Diffusion from Stability AI, as well as 93 of the top 100 models on the popular model repository Hugging Face. It plugs into ML frameworks like PyTorch and TensorFlow, and support for JAX is coming early this year. It’s designed to make it easy for AWS customers to switch from their existing model training and inference pipelines to Trainium and Inferentia with just a few lines of code.

Cloud storage on AWS enhancements for generative AI

Another way AWS is accelerating the training and inference pipelines is with improvements to storage performance—which is not only critical when thinking about the most common ML tasks (like loading training data into a large cluster of GPUs/accelerators), but also for checkpointing and serving inference requests. AWS announced several improvements to accelerate the speed of storage requests and reduce the idle time of your compute resources—which allows you to run generative AI workloads faster and more efficiently.

To gather more accurate predictions, generative AI workloads are using larger and larger datasets that require high-performant storage at scale to handle the sheer volume in of data.

With Amazon S3 Express One Zone a new storage class purpose-built to high-performance and low-latency object storage for an organizations most frequently accessed data, making it ideal for request-intensive operations like ML training and inference. Amazon S3 Express One Zone is the lowest-latency cloud object storage available, with data access speed up to 10x faster and request costs up to 50% lower than Amazon S3 Standard, from any AWS Availability Zone within an AWS Region.

AWS continues to optimize data access speeds for ML frameworks too. Recently, Amazon S3 Connector for PyTorch launched, which loads training data up to 40% faster than with the existing PyTorch connectors to Amazon S3. While most customers can meet their training and inference requirements using Mountpoint for Amazon S3 or Amazon S3 Connector for PyTorch, some are also building and managing their own custom data loaders. To deliver the fastest data transfer speeds between Amazon S3, and Amazon EC2 Trn1, P4d, and P5 instances, AWS recently announced the ability to automatically accelerate Amazon S3 data transfer in the AWS Command Line Interface (AWS CLI) and Python SDK. Now, training jobs download training data from Amazon S3 up to 3x faster and customers like Scenario are already seeing great results, with a 5x throughput improvement to model download times without writing a single line of code.

To meet the changing performance requirements that training generative AI workloads can require, Amazon FSx for Lustre announced throughput scaling on-demand. This is particularly useful for model training because it enables you to adjust the throughput tier of your file systems to meet these requirements with greater agility and lower cost.

EC2 networking enhancements for generative AI

Last year, AWS introduced EC2 UltraCluster 2.0, a flatter and wider network fabric that’s optimized specifically for the P5 instance and future ML accelerators. It allows us to reduce latency by 16% and supports up to 20,000 GPUs, with up to 10x the overall bandwidth. In a traditional cluster architecture, as clusters get physically bigger, latency will also generally increase. But, with UltraCluster 2.0, AWS is increasing the size while reducing latency, and that’s exciting.

AWS is also continuing to help you make your network more efficient. Take for example a recent launch with Amazon EC2 Instance Topology API. It gives you an inside look at the proximity between your instances, so you can place jobs strategically. Optimized job scheduling means faster processing for distributed workloads. Moving jobs that exchange data the most frequently to the same physical location in a cluster can eliminate multiple hops in the data path. As models push boundaries, this type of software innovation is key to getting the most out of your hardware.

In addition to Amazon Q (a generative AI powered assistant from AWS), AWS also launched Amazon Q networking troubleshooting (preview).

You can ask Amazon Q to assist you in troubleshooting network connectivity issues caused by network misconfiguration in your current AWS account. For this capability, Amazon Q works with Amazon VPC Reachability Analyzer to check your connections and inspect your network configuration to identify potential issues. With Amazon Q network troubleshooting, you can ask questions about your network in conversational English—for example, you can ask, “why can’t I SSH to my server,” or “why is my website not accessible”.

Conclusion

AWS is bringing customers even more choice for their infrastructure, including price-performant, sustainability focused, and ease-of-use options. Last year, AWS capabilities across this stack solidified our commitment to meeting the customer focus and goal of: Making generative AI accessible to customers of all sizes and technical abilities so they can get to reinventing and transforming what is possible.

Additional resources

For more information on AWS generative AI Infrastructure, go to the AWS Machine Learning Infrastructure page.
For more information on how AWS is building in the cloud with Generative AI across applications, tools, and infrastructure, go to the blog, “Welcome to a New Era of Building in the Cloud with Generative AI on AWS”.

Announcing IPv6 instance bundles and pricing update on Amazon Lightsail

2024-01-26 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/announcing-ipv6-instance-bundles-and-pricing-update-on-amazon-lightsail/

Amazon Lightsail is the easiest way to get started on AWS, allowing you to get your application running on your own virtual server in a matter of minutes. Lightsail bundles all the resources you need like memory, vCPU, solid-state drive (SSD), and data transfer allowance into a predictable monthly price, so budgeting is easy and straightforward.

IPv6 instance bundles

Announcing the availability of new IPv6 instance bundles on Lightsail. With the new bundles, you can now create and use Lightsail instances without a public IPv4 address. These bundles include an IPv6 address for use cases that do not require a public IPv4 address. Both Linux and Windows IPv6 bundles are available. See the full list of Amazon Lightsail instance blueprints compatible with IPv6 instances. If you have existing Lightsail instances with a public IPv4 address, you can migrate the instance to IPv6-only in a couple of steps: Create a snapshot of an existing instance, then create a new instance from the snapshot and select IPv6-only networking when choosing your instance plan.

To learn more about IPv6 bundles, read Lightsail documentation.

IPv4 instance bundles

Lightsail will continue to offer bundles that include one public IPv4 address and IPv6 address. Following AWS’s announcement on public IPv4 address charge, the prices of Lightsail bundles offered with a public IPv4 address will reflect the charge associated with the public IPv4 address.

Revised prices for bundles that include a public IPv4 address will be effective on all new and existing Lightsail bundles starting May 1, 2024.

The tables below outline all Lightsail instance bundles and pricing.

Linux-based bundles:

Windows-based bundles:

*Bundles in the Asia Pacific (Mumbai) and Asia Pacific (Sydney) AWS Regions include lower data transfer allowances than other regions.

To learn more about Lightsail’s bundled offerings and pricing, please see the Lightsail pricing page.

Amazon ECS supports a native integration with Amazon EBS volumes for data-intensive workloads

2024-01-12 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/amazon-ecs-supports-a-native-integration-with-amazon-ebs-volumes-for-data-intensive-workloads/

Today we are announcing that Amazon Elastic Container Service (Amazon ECS) supports an integration with Amazon Elastic Block Store (Amazon EBS), making it easier to run a wider range of data processing workloads. You can provision Amazon EBS storage for your ECS tasks running on AWS Fargate and Amazon Elastic Compute Cloud (Amazon EC2) without needing to manage storage or compute.

Many organizations choose to deploy their applications as containerized packages, and with the introduction of Amazon ECS integration with Amazon EBS, organizations can now run more types of workloads than before.

You can run data workloads requiring storage that supports high transaction volumes and throughput, such as extract, transform, and load (ETL) jobs for big data, which need to fetch existing data, perform processing, and store this processed data for downstream use. Because the storage lifecycle is fully managed by Amazon ECS, you don’t need to build any additional scaffolding to manage infrastructure updates, and as a result, your data processing workloads are now more resilient while simultaneously requiring less effort to manage.

Now you can choose from a variety of storage options for your containerized applications running on Amazon ECS:

Your Fargate tasks get 20 GiB of ephemeral storage by default. For applications that need additional storage space to download large container images or for scratch work, you can configure up to 200 GiB of ephemeral storage for your Fargate tasks.
For applications that span many tasks that need concurrent access to a shared dataset, you can configure Amazon ECS to mount the Amazon Elastic File System (Amazon EFS) file system to your ECS tasks running on both EC2 and Fargate. Common examples of such workloads include web applications such as content management systems, internal DevOps tools, and machine learning (ML) frameworks. Amazon EFS is designed to be available across a Region and can be simultaneously attached to many tasks.
For applications that need high-performance, low-cost storage that does not need to be shared across tasks, you can configure Amazon ECS to provision and attach Amazon EBS storage to your tasks running on both Amazon EC2 and Fargate. Amazon EBS is designed to provide block storage with low latency and high performance within an Availability Zone.

To learn more, see Using data volumes in Amazon ECS tasks and persistent storage best practices in the AWS documentation.

Getting started with EBS volume integration to your ECS tasks
You can configure the volume mount point for your container in the task definition and pass Amazon EBS storage requirements for your Amazon ECS task at runtime. For most use cases, you can get started by simply providing the size of the volume needed for the task. Optionally, you can configure all EBS volume attributes and the file system you want the volume formatted with.

1. Create a task definition
Go to the Amazon ECS console, navigate to Task definitions, and choose Create new task definition.

In the Storage section, choose Configure at deployment to set EBS volume as a new configuration type. You can provision and attach one volume per task for Linux file systems.

When you choose Configure at task definition creation, you can configure existing storage options such as bind mounts, Docker volumes, EFS volumes, Amazon FSx for Windows File Server volumes, or Fargate ephemeral storage.

Now you can select a container in the task definition, the source EBS volume, and provide a mount path where the volume will be mounted in the task.

You can also use $aws ecs register-task-definition --cli-input-json file://example.json command line to register a task definition to add an EBS volume. The following snippet is a sample, and task definitions are saved in JSON format.

{
    "family": "nginx"
    ...
    "containerDefinitions": [
        {
            ...
            "mountPoints": [
                "containerPath": "/foo",
                "sourceVoumne": "new-ebs-volume"
            ],
            "name": "nginx",
            "image": "nginx"
        }
    ],
    "volumes": [
       {
           "name": "/foo",
           "configuredAtRuntime": true
       }
    ]
}

2. Deploy and run your task with EBS volume
Now you can run a task by selecting your task in your ECS cluster. Go to your ECS cluster and choose Run new task. Note that you can select the compute options, the launch type, and your task definition.

Note: While this example goes through deploying a standalone task with an attached EBS volume, you can also configure a new or existing ECS service to use EBS volumes with the desired configuration.

You have a new Volume section where you can configure the additional storage. The volume name, type, and mount points are those that you defined in your task definition. Choose your EBS volume types, sizes (GiB), IOPs, and the desired throughput.

You cannot attach an existing EBS volume to an ECS task. But if you want to create a volume from an existing snapshot, you have the option to choose your snapshot ID. If you want to create a new volume, then you can leave this field empty. You can choose the file system type, either ext3 or ext4 file systems on Linux.

By default, when a task is terminated, Amazon ECS deletes the attached volume. If you need the data in the EBS volume to be retained after the task exits, check Delete on termination. Also, you need to create an AWS Identity and Access Management (IAM) role for volume management that contains the relevant permissions to allow Amazon ECS to make API calls on your behalf. For more information on this policy, see infrastructure role in the AWS documentation.

You can also configure encryption on your EBS volumes using either Amazon managed keys and customer managed keys. To learn more about the options, see our Amazon EBS encryption in the AWS documentation.

After configuring all task settings, choose Create to start your task.

3. Deploy and run your task with EBS volume
Once your task has started, you can see the volume information on the task definition details page. Choose a task and select the Volumes tab to find your created EBS volume details.

Your team can organize the development and operations of EBS volumes more efficiently. For example, application developers can configure the path where your application expects storage to be available in the task definition, and DevOps engineers can configure the actual EBS volume attributes at runtime when the application is deployed.

This allows DevOps engineers to deploy the same task definition to different environments with differing EBS volume configurations, for example, gp3 volumes in the development environments and io2 volumes in production.

Now available
Amazon ECS integration with Amazon EBS is available in nine AWS Regions: US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm). You only pay for what you use, including EBS volumes and snapshots. To learn more, see the Amazon EBS pricing page and Amazon EBS volumes in ECS in the AWS documentation.

Give it a try now and send feedback to our public roadmap, AWS re:Post for Amazon ECS, or through your usual AWS Support contacts.

— Channy

P.S. Special thanks to Maish Saidel-Keesing, a senior enterprise developer advocate at AWS for his contribution in writing this blog post.

Optimizing video encoding with FFmpeg using NVIDIA GPU-based Amazon EC2 instances

2024-01-04 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/optimizing-video-encoding-with-ffmpeg-using-nvidia-gpu-based-amazon-ec2-instances/

This post is written by Alejandro Gil, Solutions Architect and Joseba Echevarría, Solutions Architect.

Introduction

The purpose of this blog post is to compare video encoding performance between CPUs and Nvidia GPUs to determine the price/performance ratio in different scenarios while highlighting where it would be best to use a GPU.

Video encoding plays a critical role in modern media delivery, enabling efficient storage, delivery, and playback of high-quality video content across a wide range of devices and platforms.

Video encoding is frequently performed solely by the CPU because of its widespread availability and flexibility. Still, modern hardware includes specialized components designed specifically to obtain very high performance video encoding and decoding.

Nvidia GPUs, such as those found in the P and G Amazon EC2 instances, include this kind of built-in hardware in their NVENC (encoding) and NVDEC (decoding) accelerator engines, which can be used for real-time video encoding/decoding with minimal impact on the performance of the CPU or GPU.

Figure 1: NVIDIA NVDEC/NVENC architecture. Source https://developer.nvidia.com/video-codec-sdk

Scenario

Two main transcoding job types should be considered depending on the video delivery use case, 1) batch jobs for on demand video files and 2) streaming jobs for real-time, low latency use cases. In order to achieve optimal throughput and cost efficiency, it is a best practice to encode the videos in parallel using the same instance.

The utilized instance types in this benchmark can be found in figure 2 table (i.e g4dn and p3). For hardware comparison purposes, the p4d instance has been included in the table, showing the GPU specs and total number of NVDEC & NVENC cores in these EC2 instances. Based on the requirements, multiple GPU instances types are available in EC2.

Instance size	GPUs	GPU model	NVDEC generation	NVENC generation	NVDEC cores/GPU	NVENC cores/GPU
g4dn.xlarge	1	T4	4^th	7^th	2	1
p3.2xlarge	1	V100	3^rd	6^th	1	3
p4d.24xlarge	8	A100	4^th	N/A	5	0

Figure 2: GPU instances specifications

Benchmark

In order to determine which encoding strategy is the most convenient for each scenario, a benchmark will be conducted comparing CPU and GPU instances across different video settings. The results will be further presented using graphical representations of the performance indicators obtained.

The benchmark uses 3 input videos with different motion and detail levels (still, medium motion and high dynamic scene) in 4k resolution at 60 frames per second. The tests will show the average performance for encoding with FFmpeg 6.0 in batch (using Constant Rate Factor (CRF) mode) and streaming (using Constant Bit Rate (CBR)) with x264 and x265 codecs to five output resolutions (1080p, 720p, 480p, 360p and 160p).

The benchmark tests encoding the target videos into H.264 and H.265 using the x264 and x265 open-source libraries in FFmpeg 6.0 on the CPU and the NVENC accelerator when using the Nvidia GPU. The H.264 standard enjoys broad compatibility, with most consumer devices supporting accelerated decoding. The H.265 standard offers superior compression at a given level of quality than H.264 but hardware accelerated decoding is not as widely deployed. As a result, for most media delivery scenarios having more than one video format will be required in order to provide the best possible user experience.

Offline (batch) encoding

This test consists of a batch encoding with two different standard presets (ultrafast and medium for CPU-based encoding and p1 and medium presets for GPU-accelerated encoding) defined in the FFmpeg guide.

The following chart shows the relative cost of transcoding 1 million frames to the 5 different output resolutions in parallel for CPU-encoding EC2 instance (c6i.4xlarge) and two types of GPU-powered instances (g4dn.xlarge and p3.2xlarge). The results are normalized so that the cost of x264 ultrafast preset on c6i.4xlarge is equal to one.

Figure 3: Batch encoding performance for CPU and GPU instances.

The performance of batch encoding in the best GPU instance (g4dn.xlarge) shows around 73% better price/performance in x264 compared to the c6i.4xlarge and around 82% improvement in x265.

A relevent aspect to have in consideration is that the presets used are not exactly equivalent for each hardware because FFmpeg uses different operators depending on where the process runs (i.e CPU or GPU). As a consequence, the video outputs in each case have a noticeable difference between them. Generally, NVENC-based encoded videos (GPU) tend to have a higher quality in H.264, whereas CPU outputs present more encoding artifacts. The difference is more noticeable for lower quality cases (ultrafast/p1 presets or streaming use cases).

The following images compare the output quality for the medium motion video in the ultrafast/p1 and medium presets.

It is clearly seen in the following example, that the h264_nevenc (GPU) codec outperforms the libx264 codec (CPU) in terms of quality, showing less pixelation, especially in the ultrafast preset. For the medium preset, although the quality difference is less pronounced, the GPU output file is noticeably larger (refer to Figure 6 table).

Figure 4: Result comparison between GPU and CPU for h264, ultrafast

Figure 5: Result comparison between GPU and CPU for h264, medium

The output file sizes mainly depend on the preset, codec and input video. The different configurations can be found in the following table.

Figure 6: Sizes for output batch encoded videos. Streaming not represented because the size is the same (fixed bitrate)

Live stream encoding

For live streaming use cases, it is useful to measure how many streams a single instance can maintain transcoding to five output resolutions (1080p, 720p, 480p, 360p and 160p). The following results are the relative cost of each instance, which is the ratio of number of streams the instance was able to sustain divided by the cost per hour.

Figure 6: Streaming encoding performance for CPU and GPU instances.

The previous results show that a GPU-based instance family like g4dn is ideal for streaming use cases, where they can sustain up to 4 parallel encodings from 4K to 1080p, 720p, 480p, 360p & 160p simultaneously. Notice that the GPU-based p5 family performance is not compensating the cost increase.

On the other hand, the CPU-based instances can sustain 1 parallel stream (at most). If you want to sustain the same number of parallel streams in Intel-based instances, you’d have to opt for a much larger instance (c6i.12xlarge can almost sustain 3 simultaneous streams, but it struggles to keep up with the more dynamic scenes when encoding with x265) with a much higher cost ($2.1888 hourly for c6i.12xlarge vs $0.587 for g4dn.xlarge).

The price/performance difference is around 68% better in GPU for x264 and 79% for x265.

Conclusion

The results show that for the tested scenarios there can be a price-performance gain when transcoding with GPU compared to CPU. Also, GPU-encoded videos tend to have an equal or higher perceived quality level to CPU-encoded counterparts and there is no significant performance penalty for encoding to the more advanced H.265 format, which can make GPU-based encoding pipelines an attractive option.

Still, CPU-encoders do a particularly good job with containing output file sizes for most of the cases we tested, producing smaller output file sizes even when the perceived quality is simmilar. This is an important aspect to have into account since it can have a big impact in cost. Depending on the amount of media files distributed and consumed by final users, the data transfer and storage cost will noticeably increase if GPUs are used. With this in mind, it is important to weight the compute costs with the data transfer and storage costs for your use case when chosing to use CPU or GPU-based video encoding.

One additional point to be considered is pipeline flexibility. Whereas the GPU encoding pipeline is rigid, CPU-based pipelines can be modified to the customer’s needs, including additional FFmpeg filters to accommodate future needs as required.

The test did not include any specific quality measurements in the transcoded images, but it would be interesting to perform an analysis based on quantitative VMAF (or similar algorithm) metrics for the videos. We always recommend to make your own test to validate if the results obtained meet your requirements.

Benchmarking method

This blog post extends on the original work described in Optimized Video Encoding with FFmpeg on AWS Graviton Processors and the benchmarking process has been maintained in order to preserve consistency of the benchmark results. The original article analyzes in detail the price/performance advantages of AWS Graviton 3 compared to other processors.

Figure 7: Batch encoding workflow

Amazon Q brings generative AI-powered assistance to IT pros and developers (preview)

2023-11-28 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/amazon-q-brings-generative-ai-powered-assistance-to-it-pros-and-developers-preview/

Today, we are announcing the preview of Amazon Q, a new type of generative artificial intelligence (AI) powered assistant that is specifically for work and can be tailored to a customer’s business.

Amazon Q brings a set of capabilities to support developers and IT professionals. Now you can use Amazon Q to get started building applications on AWS, research best practices, resolve errors, and get assistance in coding new features for your applications. For example, Amazon Q Code Transformation can perform Java application upgrades now, from version 8 and 11 to version 17.

Amazon Q is available in multiple areas of AWS to provide quick access to answers and ideas wherever you work. Here’s a quick look at Amazon Q, including in integrated development environment (IDE):

Building applications together with Amazon Q
Application development is a journey. It involves a continuous cycle of researching, developing, deploying, optimizing, and maintaining. At each stage, there are many questions—from figuring out the right AWS services to use, to troubleshooting issues in the application code.

Trained on 17 years of AWS knowledge and best practices, Amazon Q is designed to help you at each stage of development with a new experience for building applications on AWS. With Amazon Q, you minimize the time and effort you need to gain the knowledge required to answer AWS questions, explore new AWS capabilities, learn unfamiliar technologies, and architect solutions that fuel innovation.

Let us show you some capabilities of Amazon Q.

1. Conversational Q&A capability
You can interact with the Amazon Q conversational Q&A capability to get started, learn new things, research best practices, and iterate on how to build applications on AWS without needing to shift focus away from the AWS console.

To start using this feature, you can select the Amazon Q icon on the right-hand side of the AWS Management Console.

For example, you can ask, “What are AWS serverless services to build serverless APIs?” Amazon Q provides concise explanations along with references you can use to follow up on your questions and validate the guidance. You can also use Amazon Q to follow up on and iterate your questions. Amazon Q will show more deep-dive answers for you with references.

There are times when we have questions for a use case with fairly specific requirements. With Amazon Q, you can elaborate on your use cases in more detail to provide context.

For example, you can ask Amazon Q, “I’m planning to create serverless APIs with 100k requests/day. Each request needs to lookup into the database. What are the best services for this workload?” Amazon Q responds with a list of AWS services you can use and tries to limit the answer results to those that are accurately referenceable and verified with best practices.

Here is some additional information that you might want to note:

Amazon Q conversational Q&A capability is currently in preview in all commercial AWS Regions.
This capability is integrated into the AWS Management Console, AWS Console Mobile Application, AWS Documentation, AWS websites, and Slack and Teams through AWS Chatbot, making it more convenient and easier to find what you need.

2. Optimize Amazon EC2 instance selection
Choosing the right Amazon Elastic Compute Cloud (Amazon EC2) instance type for your workload can be challenging with all the options available. Amazon Q aims to make this easier by providing personalized recommendations.

To use this feature, you can ask Amazon Q, “Which instance families should I use to deploy a Web App Server for hosting an application?” This feature is also available when you choose to launch an instance in the Amazon EC2 console. In Instance type, you can select Get advice on instance type selection. This will show a dialog to define your requirements.

Your requirements are automatically translated into a prompt on the Amazon Q chat panel. Amazon Q returns with a list of suggestions of EC2 instances that are suitable for your use cases. This capability helps you pick the right instance type and settings so your workloads will run smoothly and more cost-efficiently.

This capability to provide EC2 instance type recommendations based on your use case is available in preview in all commercial AWS Regions.

3. Troubleshoot and solve errors directly in the console
Amazon Q can also help you to solve errors for various AWS services directly in the console. With Amazon Q proposed solutions, you can avoid slow manual log checks or research.

Let’s say that you have an AWS Lambda function that tries to interact with an Amazon DynamoDB table. But, for an unknown reason (yet), it fails to run. Now, with Amazon Q, you can troubleshoot and resolve this issue faster by selecting Troubleshoot with Amazon Q.

Amazon Q provides concise analysis of the error which helps you to understand the root cause of the problem and the proposed resolution. With this information, you can follow the steps described by Amazon Q to fix the issue.

In just a few minutes, you will have the solution to solve your issues, saving significant time without disrupting your development workflow. The Amazon Q capability to help you troubleshoot errors in the console is available in preview in the US West (Oregon) for Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3), Amazon ECS, and AWS Lambda.

4. Network troubleshooting assistance
You can also ask Amazon Q to assist you in troubleshooting network connectivity issues caused by network misconfiguration in your current AWS account. For this capability, Amazon Q works with Amazon VPC Reachability Analyzer to check your connections and inspect your network configuration to identify potential issues.

This makes it easy to diagnose and resolve AWS networking problems, such as “Why can’t I SSH to my EC2 instance?” or “Why can’t I reach my web server from the Internet?” which you can ask Amazon Q.

Then, on the response text, you can select preview experience here, which will provide explanations to help you to troubleshoot network connectivity-related issues.

Here are a few things you need to know:

This capability is currently available in preview in the US East (N. Virginia).
To learn more about the feature and sample questions, see Getting started with Amazon Q network troubleshooting in the AWS Documentation.

5. Integration and conversational capabilities within your IDEs
As we mentioned, Amazon Q is also available in supported IDEs. This allows you to ask questions and get help within your IDE by chatting with Amazon Q or invoking actions by typing / in the chat box.

To get started, you need to install or update the latest AWS Toolkit and sign in to Amazon CodeWhisperer. Once you’re signed in to Amazon CodeWhisperer, it will automatically activate the Amazon Q conversational capability in the IDE. With Amazon Q enabled, you can now start chatting to get coding assistance.

You can ask Amazon Q to describe your source code file.

From here, you can improve your application, for example, by integrating it with Amazon DynamoDB. You can ask Amazon Q, “Generate code to save data into DynamoDB table called save_data() accepting data parameter and return boolean status if the operation successfully runs.”

Once you’ve reviewed the generated code, you can do a manual copy and paste into the editor. You can also select Insert at cursor to place the generated code into the source code directly.

This feature makes it really easy to help you focus on building applications because you don’t have to leave your IDE to get answers and context-specific coding guidance. You can try the preview of this feature in Visual Studio Code and JetBrains IDEs.

6. Feature development capability
Another exciting feature that Amazon Q provides is guiding you interactively from idea to building new features within your IDE and Amazon CodeCatalyst. You can go from a natural language prompt to application features in minutes, with interactive step-by-step instructions and best practices, right from your IDE. With a prompt, Amazon Q will attempt to understand your application structure and break down your prompt into logical, atomic implementation steps.

To use this capability, you can start by invoking an action command /dev in Amazon Q and describe the task you need Amazon Q to process.

Then, from here, you can review, collaborate and guide Amazon Q in the chat for specific areas that need to be implemented.

Additional capabilities to help you ship features faster with complete pull requests are available if you’re using Amazon CodeCatalyst. In Amazon CodeCatalyst, you can assign a new or an existing issue to Amazon Q, and it will process an end-to-end development workflow for you. Amazon Q will review the existing code, propose a solution approach, seek feedback from you on the approach, generate merge-ready code, and publish a pull request for review. All you need to do after is to review the proposed solutions from Amazon Q.

The following screenshots show a pull request created by Amazon Q in Amazon CodeCatalyst.

Here are a couple of things that you should know:

Amazon Q feature development capability is currently in preview in Visual Studio Code and Amazon CodeCatalyst
To use this capability in IDE, you need to have the Amazon CodeWhisperer Professional tier. Learn more on the Amazon CodeWhisperer pricing page.

7. Upgrade applications with Amazon Q Code Transformation
With Amazon Q, you can now upgrade an entire application within a few hours by starting a guided code transformation. This capability, called Amazon Q Code Transformation, simplifies maintaining, migrating, and upgrading your existing applications.

To start, navigate to the CodeWhisperer section and then select Transform. Amazon Q Code Transformation automatically analyzes your existing codebase, generates a transformation plan, and completes the key transformation tasks suggested by the plan.

Some additional information about this feature:

Amazon Q Code Transformation is available in preview today in the AWS Toolkit for IntelliJ IDEA and the AWS Toolkit for Visual Studio Code.
To use this capability, you need to have the Amazon CodeWhisperer Professional tier during the preview.
During preview, you can can upgrade Java 8 and 11 applications to version 17, a Java Long-Term Support (LTS) release.

Get started with Amazon Q today
With Amazon Q, you have an AI expert by your side to answer questions, write code faster, troubleshoot issues, optimize workloads, and even help you code new features. These capabilities simplify every phase of building applications on AWS.

Amazon Q lets you engage with AWS Support agents directly from the Q interface if additional assistance is required, eliminating any dead ends in the customer’s self-service experience. The integration with AWS Support is available in the console and will honor the entitlements of your AWS Support plan.

Learn more

— Donnie & Channy

AWS Lambda functions now scale 12 times faster when handling high-volume requests

2023-11-27 Marcia Villalba

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/aws-lambda-functions-now-scale-12-times-faster-when-handling-high-volume-requests/

Now AWS Lambda scales up to 12 times faster. Each synchronously invoked Lambda function now scales by 1,000 concurrent executions every 10 seconds until the aggregate concurrency across all functions reaches the account’s concurrency limit. In addition, each function within an account now scales independently from each other, no matter how the functions are invoked. These improvements come at no additional cost, and you don’t need to do any configuration in your existing functions.

Building scalable and high-performing applications can be challenging with traditional architectures, often requiring over-provisioning of compute resources or complex caching solutions for peak demands and unpredictable traffic. Many developers choose Lambda because it scales on-demand when applications face unpredictable traffic.

Before this update, Lambda functions could initially scale at the account level by 500–3,000 concurrent executions (depending on the Region) in the first minute, followed by 500 concurrent executions every minute until the account’s concurrency limit is reached. Because this scaling limit was shared between all the functions in the same account and Region, if one function experienced an influx of traffic, it could affect the throughput of other functions in the same account. This increased engineering efforts to monitor a few functions that could burst beyond the account limits, causing a noisy neighbor scenario and reducing the overall concurrency of other functions in the same account.

Now, with these scaling improvements, customers with highly variable traffic can reach concurrency targets faster than before. For instance, a news site publishing a breaking news story or an online store running a flash sale would experience a significant influx of visitors. Thanks to these improvements, they can now scale 12 times faster than before.

In addition, customers that use services such as Amazon Athena and Amazon Redshift with scalar Lambda-based UDFs to perform data enrichment or data transformations will see benefits from these improvements. These services rely on batching data and passing it in chunks to Lambda, simultaneously invoking multiple parallel functions. The enhanced concurrency scaling behavior ensures Lambda can rapidly scale and service level agreement (SLA) requirements are met.

How does this work in practice?
The following graph shows a function receiving requests and processing them every 10 seconds. The account concurrency limit is set to 7,000 concurrent requests and is shared between all the functions in the same account. Each function scaling-up rate is fixed to 1,000 concurrent executions every 10 seconds. This rate is independent from other functions in the same account, making it easier for you to predict how this function will scale and throttle the requests if needed.

09:00:00 – The function has been running for a while, and there are already 1,000 concurrent executions that are being processed.
09:00:10 – Ten seconds later, there is a new burst of 1,000 new requests. This function can process them with no problem because the function can scale up to 1,000 concurrent executions every 10 seconds.
09:00:20 – The same happens here: a thousand new requests.
09:00:30 – The function now receives 1,500 new requests. Because the maximum scale-up capacity for a function is 1,000 requests per 10 seconds, 500 of those requests will get throttled.
09:01:00 – At this time, the function is already processing 4,500 concurrent requests. But there is a burst of 3,000 new requests. Lambda processes 1,000 of the new requests and throttles 2,000 because the function can scale up to 1,000 requests every 10 seconds.
09:01:10 – After 10 seconds, there is another burst of 2,000 requests, and the function can now process 1,000 more requests. However, the remaining 1,000 requests get throttled because the function can scale to 1,000 requests every 10 seconds.
09:01:20 – Now the function is processing 6,500 concurrent requests, and there are 1,000 incoming requests. The first 500 of those requests get processed, but the other 500 get throttled because the function reached the account concurrency limit of 7,000 requests. It’s important to remember that you can raise the account concurrency limit by creating a support ticket in the AWS Management Console.

In the case of having more than one function in your account, the functions scale independently until the total account concurrency limit is reached. After that, all new invocations will be throttled.

Availability
These scaling improvements will be enabled by default for all functions. Starting on November 26 through mid-December, AWS is gradually rolling out these scaling improvements to all AWS Regions except China and GovCloud Regions.

If you want to learn more about Lambda’s new scaling behavior, read the Lambda scaling behavior documentation page.

– Marcia

Amazon EKS Pod Identity simplifies IAM permissions for applications on Amazon EKS clusters

2023-11-27 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/amazon-eks-pod-identity-simplifies-iam-permissions-for-applications-on-amazon-eks-clusters/

Starting today, you can use Amazon EKS Pod Identity to simplify your applications that access AWS services. This enhancement provides you with a seamless and easy to configure experience that lets you define required IAM permissions for your applications in Amazon Elastic Kubernetes Service (Amazon EKS) clusters so you can connect with AWS services outside the cluster.

Amazon EKS Pod Identity helps you solve growing challenges for managing permissions across many of your EKS clusters.

Simplifying experience with Amazon EKS Pod Identity
In 2019, we introduced IAM roles for service accounts (IRSA). IRSA lets you associate an IAM role with a Kubernetes service account. This helps you to implement the principle of least privilege by giving pods only the permissions they need. This approach prioritizes pods in IAM and helps developers configure applications with fine-grained permissions that enable the least privileged access to AWS services.

Now, with Amazon EKS Pod Identity, it’s even easier to configure and automate granting AWS permissions to Kubernetes identities. As the cluster administrator, you no longer need to switch between Amazon EKS and IAM services to authenticate your applications to all AWS resources.

The overall workflow to start using Amazon EKS Pod Identity can be summarized in a few simple steps:

Step 1: Create an IAM role with required permissions for your application and specify pods.eks.amazonaws.com as the service principal in its trust policy.
Step 2: Install Amazon EKS Pod Identity Agent add-on using the Amazon EKS console or AWS Command Line Interface (AWS CLI).
Step 3: Map the role to a service account directly in the Amazon EKS console, APIs, or AWS CLI.

Once it’s done, any new pods that use that service account will automatically be configured to receive IAM credentials.

Let’s get started
Let me show you how you can get started with EKS Pod Identity. For the demo in this post, I need to configure permission for a simple API running in my Amazon EKS cluster, which will return the list of files in my Amazon Simple Storage Service (Amazon S3) bucket.

First, I need to create an IAM role to provide the required permissions so my applications can run properly. In my case, I need to configure permissions to access my S3 bucket.

Next, on the same IAM role, I need to configure its trust policy and configure the principal to pods.eks.amazonaws.com. The following is the IAM template that I use:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "pods.eks.amazonaws.com"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ]
        }
    ]
}

At this stage, my IAM role is ready, and now we need to configure the Amazon EKS Pod Identity Agent in my cluster. For this article, I’m using my existing EKS cluster. If you want to learn how to do that, visit Getting started with Amazon EKS.

Moving on, I navigate to the Amazon EKS dashboard and then select my EKS cluster.

In my EKS cluster page, I need to select the Add-ons tab and then choose Get more add-ons.

Then, I need to add the Amazon EKS Pod Identity Agent add-on.

On the next page, I can add additional configuration if needed. In this case, I leave the default configuration and choose Next.

Then, I just need to review my add-on configuration and choose Create.

After a few minutes, the Amazon EKS Pod Identity Agent add-on is active for my cluster.

Once I have Amazon EKS Pod Identity in my cluster, I need to associate the IAM role to my Kubernetes pods.

I need to navigate to the Access tab in my EKS cluster. On the Pod Identity associations section, I select Create Pod Identity association to map my IAM role to Kubernetes pods.

Here, I use the IAM role that I created in the beginning. I also need to define my Kubernetes namespace and service account. If they don’t exist yet, I can type in the name of the namespace and service account. If they already exist, I can select them from the dropdown. Then, I choose Create.

Those are all the steps I need to do to configure IAM permissions for my applications running on Amazon EKS with EKS Pod Identity. Now, I can see my IAM role is listed in Pod Identity associations.

When I test my API running on Amazon EKS, it runs as expected and returns the list of files in my S3 bucket.

curl -X https://<API-URL> -H "Accept: application/json" 

{
   "files": [
         "test-file-1.md",
         "test-file-2.md"
    ]        
}

I found that Amazon EKS Pod Identity simplifies the experience of managing IAM roles for my applications running on Amazon EKS. I can easily reuse IAM roles across multiple EKS clusters without needing to update the role trust policy each time a new cluster is created.

New AWS APIs to configure EKS Pod Identity
You also have the flexibility to configure Amazon EKS Pod Identity for your cluster using AWS CLI. Amazon EKS Pod Identity provides a new set of APIs that you can use.

For example, I can use aws eks create-addon to install the Amazon EKS Pod Identity Agent add-on into my cluster. Here’s the AWS CLI command:

$ aws eks create-addon \
--cluster-name <CLUSTER_NAME> \
--addon-name eks-pod-identity-agent \
--addon-version v1.0.0-eksbuild.1

{
    "addon": {
    "addonName": "eks-pod-identity-agent",
    "clusterName": "<CLUSTER_NAME>",
    "status": "CREATING",
    "addonVersion": "v1.0.0-eksbuild.1",
    "health": {
        "issues": []
        },
    "addonArn": "<ARN>",
    "createdAt": 1697734297.597,
    "modifiedAt": 1697734297.612,
    "tags": {}
    }
}

Another example of what you can do with AWS APIs is to map the IAM role into your Kubernetes pods.

$ aws eks create-pod-identity-association \
  --cluster-name <CLUSTER_NAME> \
  --namespace <NAMESPACE> \
  --service-account <SERVICE_ACCOUNT_NAME> \
  --role-arn <IAM_ROLE_ARN>

Things to know

Availability – Amazon EKS Pod Identity is available in all AWS Regions supported by Amazon EKS, except the AWS GovCloud (US-East), AWS GovCloud (US-West), China (Beijing, operated by Sinnet), and China (Ningxia, operated by NWCD).

Pricing – Amazon EKS Pod Identity is available at no charge.

Supported Amazon EKS cluster – Amazon EKS Pod Identity supports Kubernetes running version 1.24 and above in Amazon EKS. You can see EKS Pod Identity cluster versions for more information.

Supported AWS SDK versions – You need to update your application to use the latest AWS SDK versions. Check out AWS developer tools to find out how to install and update your AWS SDK.

Get started today and visit EKS Pod Identities documentation page to learn more about how to simplify IAM management for your applications.

Happy building!
— Donnie

Amazon Managed Service for Prometheus collector provides agentless metric collection for Amazon EKS

2023-11-26 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/amazon-managed-service-for-prometheus-collector-provides-agentless-metric-collection-for-amazon-eks/

Today, I’m happy to announce a new capability, Amazon Managed Service for Prometheus collector, to automatically and agentlessly discover and collect Prometheus metrics from Amazon Elastic Kubernetes Service (Amazon EKS). Amazon Managed Service for Prometheus collector consists of a scraper that discovers and collects metrics from Amazon EKS applications and infrastructure without needing to run any collectors in-cluster.

This new capability provides fully managed Prometheus-compatible monitoring and alerting with Amazon Managed Service for Prometheus. One of the significant benefits is that the collector is fully managed, automatically right-sized, and scaled for your use case. This means you don’t have to run any compute for collectors to collect the available metrics. This helps you optimize metric collection costs to monitor your applications and infrastructure running on EKS.

With this launch, Amazon Managed Service for Prometheus now supports two major modes of Prometheus metrics collection: AWS managed collection, a fully managed and agentless collector, and customer managed collection.

Getting started with Amazon Managed Service for Prometheus Collector
Let’s take a look at how to use AWS managed collectors to ingest metrics using this new capability into a workspace in Amazon Managed Service for Prometheus. Then, we will evaluate the collected metrics in Amazon Managed Service for Grafana.

When you create a new EKS cluster using the Amazon EKS console, you now have the option to enable AWS managed collector by selecting Send Prometheus metrics to Amazon Managed Service for Prometheus. In the Destination section, you can also create a new workspace or select your existing Amazon Managed Service for Prometheus workspace. You can learn more about how to create a workspace by following the getting started guide.

Then, you have the flexibility to define your scraper configuration using the editor or upload your existing configuration. The scraper configuration controls how you would like the scraper to discover and collect metrics. To see possible values you can configure, please visit the Prometheus Configuration page.

Once you’ve finished the EKS cluster creation, you can go to the Observability tab on your cluster page to see the list of scrapers running in your EKS cluster.

The next step is to configure your EKS cluster to allow the scraper to access metrics. You can find the steps and information on Configuring your Amazon EKS cluster.

Once your EKS cluster is properly configured, the collector will automatically discover metrics from your EKS cluster and nodes. To visualize the metrics, you can use Amazon Managed Grafana integrated with your Prometheus workspace. Visit the Set up Amazon Managed Grafana for use with Amazon Managed Service for Prometheus page to learn more.

The following is a screenshot of metrics ingested by the collectors and visualized in an Amazon Managed Grafana workspace. From here, you can run a simple query to get the metrics that you need.

Using AWS CLI and APIs
Besides using the Amazon EKS console, you can also use the APIs or AWS Command Line Interface (AWS CLI) to add an AWS managed collector. This approach is useful if you want to add an AWS managed collector into an existing EKS cluster or make some modifications to the existing collector configuration.

To create a scraper, you can run the following command:

aws amp create-scraper \ 
       --source eksConfiguration="{clusterArn=<EKS-CLUSTER-ARN>,securityGroupIds=[<SG-SECURITY-GROUP-ID>],subnetIds=[<SUBNET-ID>]}" \ 
       --scrape-configuration configurationBlob=<BASE64-CONFIGURATION-BLOB> \ 
       --destination=ampConfiguration={workspaceArn="<WORKSPACE_ARN>"}

You can get most of the parameter values from the respective AWS console, such as your EKS cluster ARN and your Amazon Managed Service for Prometheus workspace ARN. Other than that, you also need to define the scraper configuration defined as configurationBlob.

Once you’ve defined the scraper configuration, you need to encode the configuration file into base64 encoding before passing the API call. The following is the command that I use in my Linux development machine to encode sample-configuration.yml into base64 and copy it onto the clipboard.

$ base64 sample-configuration.yml | pbcopy

Now Available
The Amazon Managed Service for Prometheus collector capability is now available to all AWS customers in all AWS Regions where Amazon Managed Service for Prometheus is supported.

Learn more:

Amazon Managed Service for Prometheus product page
AWS managed collectors documentation page

Happy building!
— Donnie

The attendee’s guide to the AWS re:Invent 2023 Compute track

2023-11-17 Chris Munns

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/the-attendees-guide-to-the-aws-reinvent-2023-compute-track/

This post by Art Baudo – Principal Product Marketing Manager – AWS EC2, and Pranaya Anshu – Product Marketing Manager – AWS EC2

We are just a few weeks away from AWS re:Invent 2023, AWS’s biggest cloud computing event of the year. This event will be a great opportunity for you to meet other cloud enthusiasts, find productive solutions that can transform your company, and learn new skills through 2000+ learning sessions.

Even if you are not able to join in person, you can catch-up with many of the sessions on-demand and even watch the keynote and innovation sessions live.

If you’re able to join us, just a reminder we offer several types of sessions which can help maximize your learning in a variety of AWS topics. Breakout sessions are lecture-style 60-minute informative sessions presented by AWS experts, customers, or partners. These sessions are recorded and uploaded a few days after to the AWS Events YouTube channel.

re:Invent attendees can also choose to attend chalk-talks, builder sessions, workshops, or code talk sessions. Each of these are live non-recorded interactive sessions.

Chalk-talk sessions: Attendees will interact with presenters, asking questions and using a whiteboard in session.
Builder Sessions: Attendees participate in a one-hour session and build something.
Workshops sessions: Attendees join a two-hour interactive session where they work in a small team to solve a real problem using AWS services.
Code talk sessions: Attendees participate in engaging code-focused sessions where an expert leads a live coding session.

To start planning your re:Invent week, check-out some of the Compute track sessions below. If you find a session you’re interested in, be sure to reserve your seat for it through the AWS attendee portal.

Explore the latest compute innovations

This year AWS compute services have launched numerous innovations: From the launch of over 100 new Amazon EC2 instances, to the general availability of Amazon EC2 Trn1n instances powered by AWS Trainium and Amazon EC2 Inf2 instances powered by AWS Inferentia2, to a new way to reserve GPU capacity with Amazon EC2 Capacity Blocks for ML. There’s a lot of exciting launches to take in.

Explore some of these latest and greatest innovations in the following sessions:

CMP102 | What’s new with Amazon EC2
Provides an overview on the latest Amazon EC2 innovations. Hear about recent Amazon EC2 launches, learn how about differences between Amazon EC2 instances families, and how you can use a mix of instances to deliver on your cost, performance, and sustainability goals.
CMP217 | Select and launch the right instance for your workload and budget
Learn how to select the right instance for your workload and budget. This session will focus on innovations including Amazon EC2 Flex instances and the new generation of Intel, AMD, and AWS Graviton instances.
CMP219-INT | Compute innovation for any application, anywhere
Provides you with an understanding of the breadth and depth of AWS compute offerings and innovation. Discover how you can run any application, including enterprise applications, HPC, generative artificial intelligence (AI), containers, databases, and games, on AWS.

Customer experiences and applications with machine learning

Machine learning (ML) has been evolving for decades and has an inflection point with generative AI applications capturing widespread attention and imagination. More customers, across a diverse set of industries, choose AWS compared to any other major cloud provider to build, train, and deploy their ML applications. Learn about the generative AI infrastructure at Amazon or get hands-on experience building ML applications through our ML focused sessions, such as the following:

CMP206 | Behind-the-scenes look at generative AI infrastructure at Amazon
Learn how to power performant generative AI applications while keeping costs under control. Get a behind-the-scenes look at how purpose-built infrastructure from AWS including AWS Trainium and AWS Inferentia2 are used by Amazon teams.
CMP329-R | PyTorch best practices for generative AI & LLM inference architectures
Understand best practices for using PyTorch to deploy Large Language Models (LLMs) on a cluster of Amazon EC2 Inf2 instances managed by Amazon Elastic Kubernetes Service (EKS).
CMP402 | Build a generative AI chatbot using your own data with Amazon Titan
Build and deploy a generative AI-powered chatbot on AWS using AWS Inferentia, AWS Trainium, Amazon SageMaker, Amazon Bedrock, Amazon S3, and Amazon OpenSearch Service.

Discover what powers AWS compute

AWS has invested years designing custom silicon optimized for the cloud to deliver the best price performance for a wide range of applications and workloads using AWS services. Learn more about the AWS Nitro System, processors at AWS, and ML chips.

CMP306 | Deep dive into the AWS Nitro System
Delve further into the design, architecture, and new innovations to the AWS Nitro platform in this session.
CMP309 | Compute innovations enabled by the AWS Nitro System
Deep dive in to the AWS Nitro System, the underlying platform for modern EC2 instances, the Nitro System, which has allowed AWS to innovate faster, further reduce your costs, and deliver increased security.
CMP313 | AWS Graviton: The best price performance for your AWS workloads
Learn more about AWS Graviton, hear customer stories about recent Graviton adoptions, how you can migrate your workloads to Graviton while saving cost, and gaining sustainability benefits.

Optimize your compute costs

At AWS, we focus on delivering the best possible cost structure for our customers. Frugality is one of our founding leadership principles. Cost effective design continues to shape everything we do, from how we develop products to how we run our operations. Come learn of new ways to optimize your compute costs through AWS services, tools, and optimization strategies in the following sessions:

CMP207 | Capacity, availability, cost efficiency: Pick three
Use AWS strategies such as using attribute-based instance type selection, prioritizing operational flexibility, and Amazon EC2 Flex instances to help you get compute resources, when you need them, while keeping cost under control.
CMP211 | Smart savings: Amazon EC2 cost-optimization strategies
Discover the basics of cost optimization from AWS Savings Plans and Amazon EC2 Auto Scaling to more advanced strategies like Amazon EC2 Spot Instances, AWS Graviton, automation, and more.
CMP406 | Reduce costs and improve sustainability with AWS Graviton
Learn how using AWS Graviton based instances can provide up to 40% better price performance over comparable current-generation instances for a wide variety of workloads.

Check out workload-specific sessions

Amazon EC2 offers the broadest and deepest compute platform to help you best match the needs of your workload. More SAP, high performance computing (HPC), ML, and Windows workloads run on AWS than any other cloud. Join sessions focused around your specific workload to learn about how you can leverage AWS solutions to accelerate your innovations.

CMP103 | Bring your small business to the cloud with Amazon Lightsail
Join to learn how you can quickly deploy and configure a WordPress blog, point-of-sale system, ecommerce site, or remote Windows desktop on AWS for a simple monthly price.
CMP203 | Scaling SAP HANA workloads on AWS
Discover how AWS SAP HANA customers benefit from the flexibility and reliability of the AWS Cloud. learn how you can migrate your mission-critical SAP HANA workloads.
CMP213 | Confidently run your production HPC workloads on AWS
Explore the HPC portfolio of services and products available in the AWS Cloud. Learn how customers can scale simulations and modeling and other memory- and data-intensive technical workloads.
CMP318 | Build a spatial data lake with Visual Asset Management System
Get a hands-on look at how to use the open source Visual Asset Management System (VAMS), built by AWS, to deploy a foundation for spatial data lakes, helping business units better collaborate and use 3D data while avoiding redundancies.
CMP333 | Accelerate Apple application development with Amazon EC2 Mac instances
Dive deep into best practices to automate the provisioning, configuration, and scaling of your Amazon EC2 Mac instances.

Hear from AWS customers

AWS serves millions of customers of all sizes across thousands of use cases, every industry, and around the world. Hear customers dive into how AWS compute solutions have helped them transform their businesses.

CMP212 | Sustainable compute: Reducing costs and carbon emissions with AWS
Learn how to build sustainable and cost-efficient cloud operations that align with your business goals and see how Adobe is leveraging AWS Graviton processors to do the same.
CMP214 | HPC on AWS for semiconductors and healthcare life sciences
See a demo of how organizations like Arm are using AWS for demanding simulation workloads.
CMP320 | Powering Peloton’s journey to personalization with AWS
Learn how Peloton’s solution architecture and technology stack uses Amazon EC2, Amazon S3, and Amazon EKS to build and serve individualized class recommendations in real time with low latency for a great user experience.

Ready to unlock new possibilities?

The AWS Compute team looks forward to seeing you in Las Vegas. Come meet us at the Compute Booth in the Expo. And if you’re looking for more session recommendations, check-out additional re:Invent attendee guides curated by experts.