Tag Archives: Amazon CloudWatch

Introducing Client-side Evaluation for Amazon CloudWatch Evidently

2023-02-28 Cole Thienes

Post Syndicated from Cole Thienes original https://aws.amazon.com/blogs/architecture/introducing-client-side-evaluation-for-amazon-cloudwatch-evidently/

Amazon CloudWatch Evidently enables developers to test new features on a small percentage of traffic and gauge the outcome before rolling it out to the rest of their users. Evidently feature flags are defined ahead of your release and, at runtime, your application code queries a remote service to determine whether to show the new feature to a given user. The remote call to fetch feature flags for a user is susceptible to network latency, adding several hundred milliseconds of delay in bad cases. Any additional latency added to fetching feature flags can directly impact the speed of a web page, where milliseconds matter. Our solution: client-side evaluation for Amazon CloudWatch Evidently. With client-side evaluation, developers can significantly decrease latency by fetching feature flags locally and avoiding network overhead altogether.

The term “client-side” does not refer to the browser in this case, but the operation taking place on your container application rather than through a remote API call. This removes the need for a network call by fetching feature flags for a user from the AWS AppConfig agent—a sidecar container running alongside your container application backend. The agent enables container runtimes to leverage AWS AppConfig, a service allowing customers to change the way an application behaves while running without deploying new code. In this post, we’ll walk through the solution architecture and how to instrument client-side evaluation in an Amazon Elastic Container Service (Amazon ECS) application.

Overview of solution

Figure 1 illustrates how client-side evaluation works in an application running on Amazon ECS. The webpage makes a call to the webpage backend to determine which website features to show an end-user. Let’s explore how this works.

Figure 1. High-level architecture of client-side evaluation for Amazon CloudWatch Evidently

Create an Evidently project, feature, and launch through the AWS Console Mobile Application, API, or AWS CloudFormation.
Create an Amazon ECS task for the backend application container and attach an AWS AppConfig agent container to the task. At runtime, the application container will invoke the EvaluateFeature API to fetch feature flags. Without client-side evaluation, this API call would perform a remote call to the Evidently cloud service.
With client-side evaluation, the API call is made from the application container to the AWS AppConfig agent container on localhost, short-circuiting the network overhead.
Evidently maintains a synchronized copy of the Evidently features in an AWS AppConfig configuration within your AWS account. When subsequent changes are made to the features, the configuration is updated (usually within a minute).
When the backend application initializes, the agent fetches the necessary configuration profile and caches it, polling periodically to refresh the cache. When the AWS AppConfig agent is invoked from the backend application, it evaluates the requested feature flag using the cached data.
After each successful EvaluateFeature call, a transaction record is generated, called an evaluation event. This useful bookkeeping mechanism helps developers view data to tell which of their users saw what feature and when. As the evaluation events are generated, they are placed in a buffer within the agent. Once the buffer reaches a certain size or age, events in the buffer are uploaded to Evidently via the PutProjectEvents API.
The evaluation events are then available for offline analysis in developer-configured storage, including CloudWatch Logs, Amazon Simple Storage Service (Amazon S3), and CloudWatch Metrics.

Walkthrough

Let’s take a practical example to demonstrate client-side evaluation. I have a simple webpage with a search bar on it. I’ve implemented a newer, fancier search bar, but I only want to show it to 10 percent of my visitors to make sure it doesn’t cause issues on my existing webpage before rolling it out to everyone, as in Figure 2.

Figure 2. A webpage experience where 10% of users see the new search bar

We could set up the necessary AWS resources by hand, but let’s use a pre-built AWS Cloud Development Kit (AWS CDK) example to save time. The sample code for this example is available on GitHub. Here are the high-level steps:

Provision the infrastructure. The infrastructure will consist of:
- An ECS service with a Virtual Private Cloud (Amazon VPC) to serve as our backend application and return the search bar variation
- An Evidently launch to split the traffic between the two search bars
- An AWS AppConfig environment, which the AWS AppConfig agent will fetch Evidently data from
Test our webpage. Once our code is deployed, we will visit our webpage to fetch feature flags using client-side evaluation.
Clean up by removing our infrastructure.

Prerequisites

Install Git
Install Node and npm
Create an AWS account.
Install the AWS CDK Toolkit. Ensure the AWS CDK Toolkit is updated to the latest version.
Install Docker. AWS CDK will not work without Docker installed.

Steps

Clone the repository

First, clone the official AWS CDK example repository:

git clone https://github.com/aws-samples/aws-cdk-examples

The repository has many examples for setting up AWS infrastructure in CDK. Let’s go to a client-side evaluation example.

Code explanation

Let’s take a look at the code example. When we visit our web page, a request will be routed to our application deployed on AWS Fargate, which allows us to run containers directly using ECS without having to manage Elastic Compute Cloud (Amazon EC2) instances. The application code is written in Node.js with Typescript and leverages the Express framework:

<p><code>// local-image/app.ts</code></p><p><code>import * as express from 'express';</code><br /><code>import {Evidently} from '@aws-sdk/client-evidently';</code></p><p><code>const app = express();</code></p><p><code>const evidently = new Evidently({</code><br /><code>    endpoint: 'http://localhost:2772',</code><br /><code>    disableHostPrefix: true</code><br /><code>});</code></p><p><code>app.get("/", async (_, res) =&gt; {</code><br /><code>    try {</code><br /><code>        console.time('latency')</code><br /><code>        const evaluation = await evidently.evaluateFeature({</code><br /><code>            project: 'WebPage',</code><br /><code>            feature: 'SearchBar',</code><br /><code>            entityId: 'WebPageVisitor43'</code><br /><code>        })</code><br /><code>        console.timeEnd('latency')</code><br /><code>        res.send(evaluation.variation)</code><br /><code>    } catch (err: any) {</code><br /><code>        console.timeEnd('latency')</code><br /><code>        res.send(err.toString())</code><br /><code>    }</code><br /><code>});</code></p>

The container application will invoke the EvaluateFeature API using the AWS SDK for Javascript and return the search bar variation, either the old or new search bar. Here we also log the latency of the operation. The EvaluateFeature request is forwarded to the endpoint we configure for the Evidently client: http://localhost:2772. This is the local address where the AWS AppConfig agent can be reached. To make this possible, we add the AWS AppConfig agent as a container to the Amazon ECS task definition:

// index.ts

service.taskDefinition.addContainer('AppConfigAgent', {
   image: ecs.ContainerImage.fromRegistry('public.ecr.aws/aws-appconfig/aws-appconfig-agent:2.x'),
   portMappings: [{
   containerPort: 2772
   }]
})

We also need to set up an AppConfig environment for the Evidently project. This tells Evidently where to create the configuration to keep a synchronized copy of the features in the project:

// index.ts

const application = new appconfig.CfnApplication(this,'AppConfigApplication', {
name: 'MyApplication'
});

const environment = new appconfig.CfnEnvironment(this, 'AppConfigEnvironment', {
applicationId: application.ref,
name: 'MyEnvironment'
});

const project = new evidently.CfnProject(this, 'EvidentlyProject', {
   name: 'WebPage',
   appConfigResource: {
   applicationId: application.ref,
   environmentId: environment.ref
   }
});

Finally, we set up an Evidently feature and launch that ensures only 10 percent of traffic receives the new search bar:

// index.ts

const launch = new evidently.CfnLaunch(this, 'EvidentlyLaunch', {
  project: project.name,
  name: 'MyLaunch',
  executionStatus: {
   status: 'START'
  },
  groups: [
   {
   feature: feature.name,
   variation: OLD_SEARCH_BAR,
   groupName: OLD_SEARCH_BAR
   },
   {
   feature: feature.name,
   variation: NEW_SEARCH_BAR,
   groupName: NEW_SEARCH_BAR
   }
  ],
  scheduledSplitsConfig: [{
   startTime: '2022-01-01T00:00:00Z',
   groupWeights: [
   {
   groupName: OLD_SEARCH_BAR,
   splitWeight: 90000
   },
   {
   groupName: NEW_SEARCH_BAR,
   splitWeight: 10000
   }
   ]
  }]
})

We start the launch immediately by setting executionStatus to START and startTime to a timestamp in the past. If you want to wait to show the new search bar, we can specify a future start time.

Install dependencies

Install the necessary Node modules:

npm install

Build and deploy

Build the AWS CDK template from the source code:

npm run build

Before deploying the app:

Ensure that you set up AWS credentials in your environment.
The AWS CDK Toolkit is bootstrapped in your AWS account.
Confirm the max number of VPCs has not been reached in your AWS account; the Amazon ECS service will deploy an Amazon VPC.

After that, you can deploy the AWS CDK template to your AWS account:

cdk deploy

Test the webpage

After the previous step, you should see the following output in your console:

Console output showing a successful CDK deployment

Figure 3. Console output showing a successful AWS CDK deployment

In your browser, visit the URL specified by the FargateServiceServiceURL output above and you will see oldSearchBar. We can visit our CloudWatch Logs from the Amazon ECS task to see our application logs. Go to the AWS console and visit the CloudWatch log groups page and visit the log group with the prefix EvidentlyClientSideEvaluationEcs. There, we can see that fetching feature flags took under two milliseconds, as in Figure 4.

Figure 4. CloudWatch Logs for the Amazon ECS task showing an EvaluateFeature latency of 1.238 milliseconds

Additionally, we can see how visitors have seen each version of the search bar. On the AWS console, visit the CloudWatch metrics page and see the Evidently metrics under All > Evidently > Feature, Project, Variation, as in Figure 5:

Figure 5. CloudWatch metrics showing the VariationCount (the number of times each feature flag variation was fetched)

We can increase or decrease the percentage of visitors seeing the new search bar at any time. On the AWS console, go to the CloudWatch Evidently page and go to Projects > WebPage > Launches > MyLaunch > Modify launch traffic and adjust the Traffic percentage, as in Figure 6.

Figure 6. Adjusting the traffic percentage of an Evidently launch

Cleaning up

To avoid incurring future charges, delete the resources. Let’s run:

cdk destroy

You can confirm the removal by going into CloudFormation and confirming the resources were deleted.

Conclusion

In this blog post, we learned how to set up a web page backend in Amazon ECS with client-side evaluation for Amazon CloudWatch Evidently. We easily deployed the example CloudFormation stack with AWS CDK Toolkit. Then we visited the example webpage and demonstrated the improved speed of fetching feature flags with client-side evaluation. If you’re interested in using client-side evaluation with AWS Lambda instead of Amazon ECS, check out this AWS CDK example.

Week in Review – February 13, 2023

2023-02-13 Sébastien Stormacq

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/week-in-review-february-13-2023/

AWS announced 32 capabilities since we published the last Week in Review blog post a week ago. I also read a couple of other news and blog posts.

Here is my summary.

The VPC section of the AWS Management Console now allows you to visualize your VPC resources, such as the relationships between a VPC and its subnets, routing tables, and gateways. This visualization was available at VPC creation time only, and now you can go back to it using the Resource Map tab in the console. You can read the details in Channy’s blog post.

CloudTrail Lake now gives you the ability to ingest activity events from non-AWS sources. This lets you immutably store and then process activity events without regard to their origin–AWS, on-premises servers, and so forth. All of this power is available to you with a single API call: PutAuditEvents. We launched AWS CloudTrail Lake about a year ago. It is a managed organization-scale data lake that aggregates, immutably stores, and allows querying of events recorded by CloudTrail. You can use it for auditing, security investigation, and troubleshooting. Again, my colleague Channy wrote a post with the details.

There are three new Amazon CloudWatch metrics for asynchronous AWS Lambda function invocations: AsyncEventsReceived, AsyncEventAge, and AsyncEventsDropped. These metrics provide visibility for asynchronous Lambda function invocations. They help you to identify the root cause of processing issues such as throttling, concurrency limit, function errors, processing latency because of retries, or missing events. You can learn more and have access to a sample application in this blog post.

Amazon Simple Notification Service (Amazon SNS) now supports AWS X-Ray to visualize, analyze, and debug applications. Developers can now trace messages going through Amazon SNS, making it easier to understand or debug microservices or serverless applications.

Amazon EC2 Mac instances now support replacing root volumes for quick instance restoration. Stopping and starting EC2 Mac instances trigger a scrubbing workflow that can take up to one hour to complete. Now you can swap the root volume of the instance with an EBS snapshot or an AMI. It helps to reset your instance to a previous known state in 10–15 minutes only. This significantly speeds up your CI and CD pipelines.

Amazon Polly launches two new Japanese NTTS voices. Neural Text To Speech (NTTS) produces the most natural and human-like text-to-speech voices possible. You can try these voices in the Polly section of the AWS Management Console. With this addition, according to my count, you can now choose among 52 NTTS voices in 28 languages or language variants (French from France or from Quebec, for example).

The AWS SDK for Java now includes the AWS CRT HTTP Client. The HTTP client is the center-piece powering our SDKs. Every single AWS API call triggers a network call to our API endpoints. It is therefore important to use a low-footprint and low-latency HTTP client library in our SDKs. AWS created a common HTTP client for all SDKs using the C programming language. We also offer 11 wrappers for 11 programming languages, from C++ to Swift. When you develop in Java, you now have the option to use this common HTTP client. It provides up to 76 percent cold start time reduction on AWS Lambda functions and up to 14 percent less memory usage compared to the Netty-based HTTP client provided by default. My colleague Zoe has more details in her blog post.

X in Y Jeff started this section a while ago to list the expansion of new services and capabilities to additional Regions. I noticed 10 Regional expansions this week:

Amazon Kendra now available in Asia Pacific (Tokyo) AWS Region
AWS DataSync is now available in 3 additional AWS Regions (Asia Pacific (Hyderabad) and Europe (Spain, Zurich))
AWS announces new AWS Direct Connect location in Kolkata, India
Amazon EC2 R6gd instances now available in AWS Europe (London) Region
Amazon GuardDuty now available in AWS Europe (Spain) Region
Amazon ElastiCache for Redis now supports auto scaling in six new regions (Asia Pacific (Hyderabad, Jakarta, Melbourne), Europe (Spain, Zurich), and Middle East (UAE))
Amazon EC2 X2idn instances now available in Europe (Zurich) Region
AWS Mainframe Modernization service is now available in 3 new Regions (US East (Ohio), US West (N. California), and Asia Pacific (Seoul))
Amazon EC2 M6i instances now available in AWS Asia Pacific (Jakarta)
AWS Console Mobile Application adds support for new AWS Regions (Asia Pacific (Hyderabad, Melbourne) and Europe (Spain, Zurich))

Other AWS News
This week, I also noticed these AWS news items:

My colleague Mai-Lan shared some impressive customer stories and metrics related to the use and scale of Amazon S3 Glacier. Check it out to learn how to put your cold data to work.

Space is the final (edge) frontier. I read this blog post published on avionweek.com. It explains how AWS helps to deploy AIML models on observation satellites to analyze image quality before sending them to earth, saving up to 40 percent satellite bandwidth. Interestingly, the main cause for unusable satellite images is…clouds.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

AWS re:Invent recaps in your area. During the re:Invent week, we had lots of new announcements, and in the next weeks, you can find in your area a recap of all these launches. All the events are posted on this site, so check it regularly to find an event nearby.

AWS re:Invent keynotes, leadership sessions, and breakout sessions are available on demand. I recommend that you check the playlists and find the talks about your favorite topics in one collection.

AWS Summits season will restart in Q2 2023. The dates and locations will be announced here. Paris and Sidney are kicking off the season on April 4th. You can register today to attend these in-person, free events (Paris, Sidney).

Stay Informed
That was my selection for this week! To better keep up with all of this news, do not forget to check out the following resources:

What’s New with AWS – All AWS announcements. You might want to add the RSS feed to your news reader.
The Official AWS Podcast – Listen each week for updates on the latest AWS news and deep dives into exciting use cases. There are also official AWS podcasts in your local languages. Check out the ones in French, German, Italian, and Spanish.
AWS News Blog – This blog.

— seb
This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Scaling an ASG using target tracking with a dynamic SQS target

2023-02-10 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/scaling-an-asg-using-target-tracking-with-a-dynamic-sqs-target/

This blog post is written by Wassim Benhallam, Sr Cloud Application Architect AWS WWCO ProServe, and Rajesh Kesaraju, Sr. Specialist Solution Architect, EC2 Flexible Compute.

Scaling an Amazon EC2 Auto Scaling group based on Amazon Simple Queue Service (Amazon SQS) is a commonly used design pattern in decoupled applications. For example, an EC2 Auto Scaling Group can be used as a worker tier to offload the processing of audio files, images, or other files sent to the queue from an upstream tier (e.g., web tier). For latency-sensitive applications, AWS guidance describes a common pattern that allows an Auto Scaling group to scale in response to the backlog of an Amazon SQS queue while accounting for the average message processing duration (MPD) and the application’s desired latency.

This post builds on that guidance to focus on latency-sensitive applications where the MPD varies over time. Specifically, we demonstrate how to dynamically update the target value of the Auto Scaling group’s target tracking policy based on observed changes in the MPD. We also cover the utilization of Amazon EC2 Spot instances, mixed instance policies, and attribute-based instance selection in the Auto Scaling Groups as well as best practice implementation to achieve greater cost savings.

The challenge

The key challenge that this post addresses is applications that fail to honor their acceptable/target latency in situations where the MPD varies over time. Latency refers here to the time required for any queue message to be consumed and fully processed.

Consider the example of a customer using a worker tier to process image files (e.g., resizing, rescaling, or transformation) uploaded by users within a target latency of 100 seconds. The worker tier consists of an Auto Scaling group configured with a target tracking policy. To achieve the target latency mentioned previously, the customer assumes that each image can be processed in one second, and configures the target value of the scaling policy so that the average image backlog per instance is maintained at approximately 100 images.

In the first week, the customer submits 1000 images to the Amazon SQS queue for processing, each of which takes one second of processing time. Therefore, the Auto Scaling group scales to 10 instances, each of which processes 100 images in 100s, thereby honoring the target latency of 100s.

In the second week, the customer submits 1000 slightly larger images for processing. Since an image’s processing duration scales with its size, each image takes two seconds to process. As in the first week, the Auto Scaling group scales to 10 instances, but this time each instance processes 100 images in 200s, which is twice as long as was needed in the first round. As a result, the application fails to process the latter images within its acceptable latency.

Therefore, the challenge is common to any latency-sensitive application where the MPD is subject to change. Applications where the processing duration scales with input data size are particularly vulnerable to this problem. This includes image processing, document processing, computational jobs, and others.

Solution overview

Before we dive into the solution, let’s briefly review the target tracking policy’s scaling metric and its corresponding target value. A target tracking scaling policy works by adjusting the capacity to keep a scaling metric at, or close to, the specified target value. When scaling in response to an Amazon SQS backlog, it’s good practice to use a scaling metric known as the Backlog Per Instance (BPI) and a target value based on the acceptable BPI. These are computed as follows:

Given the acceptable BPI equation, a longer MPD requires us to use a smaller target value if we are to process these messages in the same acceptable latency, and vice versa. Therefore, the solution we propose here works by monitoring the average MPD over time and dynamically adjusting the target value of the Auto Scaling group’s target tracking policy (acceptable BPI) based on the observed changes in the MPD. This allows the scaling policy to adapt to variations in the average MPD over time, and thus enables the application to honor its acceptable latency.

Solution architecture

To demonstrate how the approach above can be implemented in practice, we put together an example architecture highlighting the services involved (see the following figure). We also provide an automated deployment solution for this architecture using an AWS Serverless Application Model (AWS SAM) template and some Python code (repository link). The repository also includes a README file with detailed instructions that you can follow to deploy the solution. The AWS SAM template deploys several resources, including an autoscaling group, launch template, target tracking scaling policy, an Amazon SQS queue, and a few AWS Lambda functions that serve various functions described as follows.

The Amazon SQS queue is used to accumulate messages intended for processing, while the Auto Scaling group instances are responsible for polling the queue and processing any messages received. To do this, a launch template defines a bootstrap script that allows the group’s instances to download and execute a Python code when first launched. The Python code consumes messages from the Amazon SQS queue and simulates their processing by sleeping for the MPD duration specified in the message body. After processing each message, the instance publishes the MPD as an Amazon CloudWatch metric (see the following figure).

Figure 1: Architecture diagram showing the components deployed by the AWS SAM template.

To enable scaling, the Auto Scaling group is configured with a target tracking scaling policy that specifies BPI as the scaling metric and with an initial target value provided by the user.

The BPI CloudWatch metric is calculated and published by the “Metric-Publisher” Lambda function which is invoked every one minute using an Amazon EventBridge rate expression. To calculate BPI, the Lambda function simply takes the ratio of the number of messages visible in the Amazon SQS queue by the total number of in-service instances in the Auto Scaling group, as shown in equation (2) above.

On the other hand, the scaling policy’s target value is updated by the “Target-Setter” Lambda function, which is invoked every 30 minutes using another EventBridge rate expression. To calculate the new target value, the Lambda function simply takes the ratio of the user-defined acceptable latency value by the current average MPD queried from the corresponding CloudWatch metric, as shown in the previous equation (1).

Finally, to help you quickly test this solution, a Lambda function “Testing Lambda” is also provided and can be used to send messages to the Amazon SQS queue with a processing duration of your choice. This is specified within each message’s body. You can invoke this Lambda function with different MPDs (by modifying the corresponding environment variable) to verify how the Auto Scaling group scales in response. A CloudWatch dashboard is also deployed to enable you to track key scaling metrics through time. These include the number of messages visible in the queue, the number of in-service instances in the Auto Scaling group, the MPD, and BPI vs acceptable BPI.

Solution testing

To demonstrate the solution in action and its impact on application latency, we conducted two tests that you can reproduce by following the instructions described in the “Testing” section of the repository’s README file (repository link). In both tests, we assume a hypothetical application with a target latency of 300s. We also modified the invocation frequency of the “Target Setter” Lambda function to one minute to quickly assess the impact of target value changes. In both tests, we submit 50 messages to the Amazon SQS queue through the provided helper lambda. An MPD of 25s and 50s was used for the first and second test, respectively. The provided CloudWatch dashboard shows that the ASG scales to a total of four instances in the first test, and eight instances in the second (see the following figure). See README file for a detailed description of how various metrics evolve over time.

Comparison of Tests 1 and 2

Since Test 2 messages take twice as long to process, the Auto Scaling group launched twice as many instances to attempt to process all of the messages in the same amount of time as Test 1 (latency). The following figure shows that the total time to process all 50 messages in Test 1 was 9 mins vs 10 mins in Test 2. In contrast, if we were to use a static/fixed acceptable BPI of 12, then a total of four instances would have been operational in Test 2, thereby requiring double the time of Test 1 (~20 minutes) to process all of the messages. This demonstrates the value of using a dynamic scaling target when processing messages from Amazon SQS queues, especially in circumstances where the MPD is prone to vary with time.

Figure 2: CloudWatch dashboard showing Auto Scaling group scaling test results (Test 1 and 2).

Recommended best practices for Auto Scaling groups

This section highlights a few key best practices that we recommend adopting when deploying and working with Auto Scaling groups.

Reducing cost using EC2 Spot instances

Amazon SQS helps build loosely coupled application architectures, while providing reliable asynchronous communication between the various layers/components of an application. If a worker node fails to process a message within the Amazon SQS message visibility time-out, then the message is returned to the queue and another worker node can pick up and process that message. This makes Amazon SQS-backed applications fault-tolerant by design and thus a great fit for EC2 Spot instances. EC2 spot instances are spare compute capacity in the AWS cloud that is available to you at steep discounts as compared to On-Demand prices.

Maximizing capacity using attribute-based instance selection

With the recently released attribute-based instance selection feature, you can define infrastructure requirements based on application needs such as vCPU, RAM, and processor family (e.g., x86, ARM). This removes the need to define specific instances in your Auto Scaling group configuration, and it eliminates the burden of identifying the correct instance families and sizes. In addition, newly released instance types will be automatically considered if they fit your requirements. Attribute-based instance selection lets you tap into hundreds of different EC2 instance pools, which increases the chance of getting EC2 (Spot/On-demand) instances. When using attribute-based instance selection with the capacity optimized allocation strategy, Amazon EC2 allocates instances from deeper Spot capacity pools, thereby further reducing the chance of Spot interruption.

The following sample configuration creates an Auto Scaling group with attribute-based instance selection:

AutoScalingGroupName: 'my-asg' # [REQUIRED] 
MixedInstancesPolicy:
  LaunchTemplate:
    LaunchTemplateSpecification:
      LaunchTemplateId: 'lt-0537239d9aef10a77'
    Overrides:
    - InstanceRequirements:
        VCpuCount: # [REQUIRED] 
          Min: 2
          Max: 4
        MemoryMiB: # [REQUIRED] 
          Min: 2048
  InstancesDistribution:
    SpotAllocationStrategy: 'capacity-optimized'
MinSize: 0 # [REQUIRED] 
MaxSize: 100 # [REQUIRED] 
DesiredCapacity: 4
VPCZoneIdentifier: 'subnet-e76a128a,subnet-e66a128b,subnet-e16a128c'

Conclusion

As can be seen from the test results, this approach demonstrates how an Auto Scaling group can honor a user-provided acceptable latency constraint while accomodating variations in the MPD over time. This is possible because the average MPD is monitored and regularly updated as a CloudWatch metric. In turn, this is continously used to update the target value of the group’s target tracking policy. Moreover, we have covered additional Auto Scaling group best practices suitable for this use case, including the use of Spot instances to reduce costs and attribute-based instance selection to simplify the selection of relevant instance types.

For more information on scaling options for Auto Scaling groups, visit the Amazon EC2 Auto Scaling documentation page and the SQS-based scaling guide.

Adopt Recommendations and Monitor Predictive Scaling for Optimal Compute Capacity

2023-01-25 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/evaluating-predictive-scaling-for-amazon-ec2-capacity-optimization/

This post is written by Ankur Sethi, Sr. Product Manager, EC2, and Kinnar Sen, Sr. Specialist Solution Architect, AWS Compute.

Amazon EC2 Auto Scaling helps customers optimize their Amazon EC2 capacity by dynamically responding to varying demand. Based on customer feedback, we enhanced the scaling experience with the launch of predictive scaling policies. Predictive scaling proactively adds EC2 instances to your Auto Scaling group in anticipation of demand spikes. This results in better availability and performance for your applications that have predictable demand patterns and long initialization times. We recently launched a couple of features designed to help you assess the value of predictive scaling – prescriptive recommendations on whether to use predictive scaling based on its potential availability and cost impact, and integration with Amazon CloudWatch to continuously monitor the accuracy of predictions. In this post, we discuss the features in detail and the steps that you can easily adopt to enjoy the benefits of predictive scaling.

Recap: Predictive Scaling

EC2 Auto Scaling helps customers maintain application availability by managing the capacity and health of the underlying cluster. Prior to predictive scaling, EC2 Auto Scaling offered dynamic scaling policies such as target tracking and step scaling. These dynamic scaling policies are configured with an Amazon CloudWatch metric that represents an application’s load. EC2 Auto Scaling constantly monitors this metric and responds according to your policies, thereby triggering the launch or termination of instances. Although it’s extremely effective and widely used, this model is reactive in nature, and for larger spikes, may lead to unfulfilled capacity momentarily as the cluster is scaling out. Customers mitigate this by adopting aggressive scale out and conservative scale in to manage the additional buffer of instances. However, sometimes applications take a long time to initialize or have a recurring pattern with a sudden spike of high demand. These can have an impact on the initial response of the system when it is scaling out. Customers asked for a proactive scaling mechanism that can scale capacity ahead of predictable spikes, and so we delivered predictive scaling.

Predictive scaling was launched to make the scaling action proactive as it anticipates the changes required in the compute demand and scales accordingly. The scaling action is determined by ensemble machine learning (ML) built with data from your Auto Scaling group’s scaling patterns, as well as billions of data points from our observations. Predictive scaling should be used for applications where demand changes rapidly but with a recurring pattern, instances require a long time to initialize, or where you’re manually invoking scheduled scaling for routine demand patterns. Predictive scaling not only forecasts capacity requirements based on historical usage, but also learns continuously, thereby making forecasts more accurate with time. Furthermore, predictive scaling policy is designed to only scale out and not scale in your Auto Scaling groups, eliminating the risk of ending with lesser capacity because of inexact predictions. You must use dynamic scaling policy, scheduled scaling, or your own custom mechanism for scale-ins. In case of exceptional demand spikes, this addition of dynamic scaling policy can also improve your application performance by bridging the gap between demand and predicted capacity.

What’s new with predictive scaling

Predictive scaling policies can be configured in a non-mutative ‘Forecast Only’ mode to evaluate the accuracy of forecasts. When you’re ready to start scaling, you can switch to the ‘Forecast and Scale’ mode. Now we prescriptively recommend whether your policy should be switched to ‘Forecast and Scale’ mode if it can potentially lead to better availability and lower costs, saving you the time and effort of doing such an evaluation manually. You can test different configurations by creating multiple predictive scaling policies in ‘Forecast Only’ mode, and choose the one that performs best in terms of availability and cost improvements.

Monitoring and observability are key elements of AWS Well Architected Framework. Now we also offer CloudWatch metrics for your predictive scaling policies so that that you can programmatically monitor your predictive scaling policy for demand pattern changes or prolonged periods of inaccurate predictions. This will enable you to monitor the key performance metrics and make it easier to adopt AWS Well-Architected best practices.

In the following sections, we deep dive into the details of these two features.

Recommendations for predictive scaling

Once you set up an Auto Scaling group with predictive scaling policy in Forecast Only mode as explained in this introduction to predictive scaling blog post , you can review the results of the forecast visually and adjust any parameters to more accurately reflect the behavior that you desire. Evaluating simply on the basis of visualization may not be very intuitive if the scaling patterns are erratic. Moreover, if you keep higher minimum capacities, then the graph may show a flat line for the actual capacity as your Auto Scaling group capacity is an outcome of existing scaling policy configurations and the minimum capacity that you configured. This makes it difficult to contemplate whether the lower capacity predicted by predictive scaling wouldn’t leave your Auto Scaling group under-scaled.

This new feature provides a prescriptive guidance to switch on predictive scaling in Forecast and Scale mode based on the factors of availability and cost savings. To determine the availability and cost savings, we compare the predictions against the actual capacity and the optimal, required capacity. This required capacity is inferred based on whether your instances were running at a higher or lower value than the target value for scaling metric that you defined as part of the predictive scaling policy configuration. For example, if an Auto Scaling group is running 10 instances at 20% CPU Utilization while the target defined in predictive scaling policy is 40%, then the instances are running under-utilized by 50% and the required capacity is assumed to be 5 instances (half of your current capacity). For an Auto Scaling group, based on the time range in which you’re interested (two weeks as default), we aggregate the cost saving and availability impact of predictive scaling. The availability impact measures for the amount of time that the actual metric value was higher than the target value that you defined to be optimal for each policy. Similarly, cost savings measures the aggregated savings based on the capacity utilization of the underlying Auto Scaling group for each defined policy. The final cost and availability will lead us to a recommendation based on:

If availability increases (or remains same) and cost reduces (or remains same), then switch on Forecast and Scale
If availability reduces, then disable predictive scaling
If availability increase comes at an increased cost, then the customer should take the call based on their cost-availability tradeoff threshold

Figure 1: Predictive Scaling Recommendations on EC2 Auto Scaling console

The preceding figure shows how the console reflects the recommendation for a predictive scaling policy. You get information on whether the policy can lead to higher availability and lower cost, which leads to a recommendation to switch to Forecast and Scale. To achieve this cost saving, you might have to lower your minimum capacity and aim for higher utilization in dynamic scaling policies.

To get the most value from this feature, we recommend that you create multiple predictive scaling policies in Forecast Only mode with different configurations, choosing different metrics and/or different target values. Target value is an important lever that changes how aggressive the capacity forecasts must be. A lower target value increases your capacity forecast resulting in better availability for your application. However, this also means more dollars to be spent on the Amazon EC2 cost. Similarly, a higher target value can leave you under-scaled while reactive scaling bridges the gap in just a few minutes. Separate estimates of cost and availability impact are provided for each of the predictive scaling policies. We recommend using a policy if either availability or cost are improved and the other variable improves or stays the same. As long as there is a predictable pattern, Auto Scaling enhanced with predictive scaling maintains high availability for your applications.

Continuous Monitoring of predictive scaling

Once you’re using a predictive scaling policy in Forecast and Scale mode based on the recommendation, you must monitor the predictive scaling policy for demand pattern changes or inaccurate predictions. We introduced two new CloudWatch Metrics for predictive scaling called ‘PredictiveScalingLoadForecast’ and ‘PredictiveScalingCapacityForecast’. Using CloudWatch mertic math feature, you can create a customized metric that measures the accuracy of predictions. For example, to monitor whether your policy is over or under-forecasting, you can publish separate metrics to measure the respective errors. In the following graphic, we show how the metric math expressions can be used to create a Mean Absolute Error for over-forecasting on the load forecasts. Because predictive scaling can only increase capacity, it is useful to alert when the policy is excessively over-forecasting to prevent unnecessary cost.Figure 2: Graphing an accuracy metric using metric math on CloudWatch

In the previous graph, the total CPU Utilization of the Auto Scaling group is represented by m1 metric in orange color while the predicted load by the policy is represented by m2 metric in green color. We used the following expression to get the ratio of over-forecasting error with respect to the actual value.

IF((m2-m1)>0, (m2-m1),0))/m1

Next, we will setup an alarm to automatically send notifications using Amazon Simple Notification Service (Amazon SNS). You can create similar accuracy monitoring for capacity forecasts, but remember that once the policy is in Forecast and Scale mode, it already starts influencing the actual capacity. Hence, putting alarms on load forecast accuracy might be more intuitive as load is generally independent of the capacity of an Auto Scaling group.

Figure 3: Creating a CloudWatch Alarm on the accuracy metric

In the above screenshot, we have set an alarm that triggers when our custom accuracy metric goes above 0.02 (20%) for 10 out of last 12 data points which translates to 10 hours of the last 12 hours. We prefer to alarm on a greater number of data points so that we get notified only when predictive scaling is consistently giving inaccurate results.

Conclusion

With these new features, you can make a more informed decision about whether predictive scaling is right for you and which configuration makes the most sense. We recommend that you start off with Forecast Only mode and switch over to Forecast and Scale based on the recommendations. Once in Forecast and Scale mode, predictive scaling starts taking proactive scaling actions so that your instances are launched and ready to contribute to the workload in advance of the predicted demand. Then continuously monitor the forecast to maintain high availability and cost optimization of your applications. You can also use the new predictive scaling metrics and CloudWatch features, such as metric math, alarms, and notifications, to monitor and take actions when predictions are off by a set threshold for prolonged periods.

AWS Lambda: Resilience under-the-hood

2023-01-23 Marcia Villalba

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/compute/aws-lambda-resilience-under-the-hood/

This post is written by Adrian Hornsby (Principal System Dev Engineer) and Marcia Villalba (Principal Developer Advocate).

AWS Lambda comprises over 80 services working together to provide the serverless compute service that it offers to customers. Under the hood, many of these services are built on top of Amazon Elastic Compute Cloud (Amazon EC2) instances, provisioned within Availability Zones. However, AWS Lambda is a Regional service. This means that customers use Lambda services from the Region level and its services are designed to be resilient to impairments that the underlying Availability Zones might have.

This blog post discusses how a Regional service such as Lambda takes advantage of Availability Zones and static stability to achieve its high availability target, and shows how Lambda teams verify their service’s static stability using AWS Fault Injection Simulator (AWS FIS). It also provides a solution using AWS services and tools to achieve Lambda’s resiliency strategy, using FIS, Amazon CloudWatch, and Amazon Route 53 Application Recovery Controller (Route 53 ARC).

The role of Availability Zones

Availability Zones are physically isolated sections of an AWS Region, designed to operate but also fail independently. They are separated by a meaningful distance from each other, up to 100 kilometers (60 miles), to prevent correlated failures, but close enough to use synchronous replication with single-digit millisecond latency.

Customers and AWS services have been using Availability Zones for years to build highly available, fault tolerant, and scalable applications. In particular, AWS Regional services such as AWS Lambda, Amazon DynamoDB, Amazon Simple Queue Service (Amazon SQS), and Amazon Simple Storage Service (Amazon S3), have achieved their high availability promises by spreading multiple independent replicas of their services across multiple Availability Zones. It uses the principles of independence and redundancy of Availability Zones to maximize the overall availability of that service.

Each replica is called a zonal replica. The system is designed so that any of the replicas can fail at any time. When a replica fails, it can be temporarily removed from the system until everything works as expected again. When that happens, the load is shared between the remaining zonal replicas.

Designing for failures

One lesson we learned at AWS when building services is when there is an Availability Zone impairment, it is better not to rely on control plane operations to remediate the failure. A control plane operation can, for example, be provisioning more capacity in an Availability Zone that is not affected by the impairment.

This principle is called static stability, and it describes the capability for a system to keep its original steady-state (or behavior) even when subjected to disruptive events without having to make any changes. A statically stable service should have as few dependencies as possible for its recovery process.

For a Regional service like AWS Lambda, this means that the remaining capacity in the healthy Availability Zones can absorb the traffic from a potentially impaired Availability Zone without having to scale up. This implies over-provisioning resources in all Availability Zones. Having that extra capacity pre-provisioned helps Lambda achieve its static stability. It is a tradeoff between the cost of over-provisioning resources and service availability. Since AWS Lambda promises high availability to its customers, with a monthly uptime service commitment of 99.95%, that tradeoff falls towards service availability.

How to prepare for failures

Preparing for an Availability Zone impairment is difficult because the symptoms and size of the impact can vary widely. An Availability Zone may be partially accessible or totally unreachable, and everything in between. Causes for the impairment can range from fiber cuts, power issues, overheating, hardware malfunctions, networking problems, capacity issues, and other unexpected situations. While those happen, they happen rarely. The most common categories of failures are bad deployments and bad configurations.

While some of these failures can be difficult to infer or reproduce, common symptoms include disruption of connectivity, increased latency, increased traffic due to retry storms, increased CPU and memory usage, and slow I/O.

At AWS, we learned to expect the unexpected and plan for failure. This means injecting faults in the system to reproduce some of the common symptoms of Availability Zone impairments, then observe how the system responds, and implement improvements. In addition, injecting faults in the system helps uncover potential monitoring and alarming blind spots, and gives an opportunity for teams to practice and improve their response to events with a focus on reducing time to recovery.

How Lambda tests its response to an Availability Zone impairment

Lambda’s approach to being resilient to Availability Zone impairments is to rely on static stability and automated systems. Humans are slower than machines for detecting issues and mitigating them. Therefore, Lambda must ensure that its services can detect issues within a zonal replica and remediate automatically within minutes and with no operator intervention. This auto-remediation is done by shifting customer traffic away from the affected Availability Zone to healthy ones, and it is called Availability Zone evacuation.

To do this, Lambda built a tool that detects failures and performs the Availability Zone evacuation when needed. This tool does a statistical comparison of metrics between different Availability Zones and EC2 instances in order to identify unhealthy Availability Zones. If an Availability Zone is found to have issues, the tool starts the evacuation out of the unhealthy Availability Zone automatically. This automation cuts the time to the first action from 30 minutes to less than 3 minutes.

How AWS Lambda uses AWS FIS

To verify the automation continuously works as expected, Lambda performs a wide variety of tests, which includes Availability Zone failure testing in their pre-production environment. The main objective of these tests is to verify the services are statically stable in the presence of Availability Zone impairments, and to verify that the Availability Zone evacuation can be successfully initiated. The benefit of having an automated test is that teams can repeat it regularly and don’t need to have special skills. One click is all it takes to launch the test.

For these tests, Lambda uses AWS FIS to inject faults into their large fleet of EC2 instances. They use AWS FIS with support of the AWS System Manager (SSM) agent and resource filters to target their fleet of EC2 instances in a particular Availability Zone. This is a versatile approach that can inject resource faults, such as CPU and memory exhaustion, and networking faults, such as packet latency, loss, or drop.

Injecting packet loss or latency is very important, since these symptoms can have a serious impact on application and network performance. Indeed, latency and loss, even in small quantities, can create inefficiencies and prevent applications from running at their peak performance. For Lambda, being able to detect increased latency or loss before it affects customers is critical.

How to recover your applications rapidly from Availability Zones failures

You can build a similar solution to rapidly recover your applications from a zonal failure. The solution must have a mechanism to evacuate an impaired Availability Zone, a monitoring system that allows you to detect when a zonal replica is impaired, and a way to test the static stability of your system. AWS provides many tools and services that can help you build this solution to achieve Lambda’s resiliency strategy.

For performing Availability Zone evacuation, you can use the new zonal shift capability from Route 53 ARC, which at the time of writing is in preview. Zonal shift lets you evacuate an Availability Zone for applications that are uses Elastic Load Balancing. If you find out that a zonal replica is impaired or unhealthy, you can use zonal shift to evacuate the Availability Zone for a period of time, while the issue gets fixed.

For performing the zonal shift, you must detect when a zonal replica is unhealthy. Your application must provide a signal of its health per Availability Zone. There are two common ways to capture this signal. First, passively, you can check your metrics, like response times, HTTP status codes, and other metrics that can help track fatal errors in your applications. Or actively, using synthetic monitoring, which allows you to create synthetic requests against your production application to provide a more complete view of the customer experience.

Amazon CloudWatch Synthetics provides canaries, which are scripts that run on a schedule and perform synthetic requests in your application endpoints and APIs. Canaries perform the same actions as customers and continuously verify the customer experience. You can create a canary for each zonal replica of your application and monitor the results independently.

With this information, if the user experience diminishes in one of the replicas, you can start an Availability Zone evacuation using zonal shift and minimize the bad experience for the user while you find and fix the sources of the failure.

To ensure that you can successfully recover from a failure, you must test the solution in advance. Without testing, it is just an assumption. To prove or disprove your assumptions about your system’s capability to handle disruptive events such as issues within an Availability Zones, you can use FIS.

With FIS, you can inject faults simultaneously in multiple resources within the same failure domain, such as Availability Zones. FIS currently integrates with several AWS services including EC2, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), Amazon Relational Database Service (Amazon RDS), AWS Networking, and CloudWatch.

Typical use cases for testing a workload’s resilience to Availability Zones impairment are, for example, terminating all compute resources and databases within a particular Availability Zone, injecting latency or packet loss, increasing resource consumption (CPU, memory, and I/O) in compute resources in a particular Availability Zone, or impacting network communication within or between Availability Zones.

For more information and a step-by-step example of how to recover rapidly from application failures in a single Availability Zone and testing it with AWS FIS, read this blog post.

Conclusion

This article discusses static stability, a mechanism that is used by AWS services such as Lambda to build resilient Regional services. It also discusses how AWS takes advantage of the same services and infrastructure as customers. It shows how Lambda uses multiple Availability Zones and services like AWS FIS to build highly available services and improve its recovery time from unexpected failures to only a few minutes without human intervention. Finally, it shows a solution that you can implement for your applications to achieve Lambda’s resilience strategy.

To learn more about AWS FIS, there are many tutorials and a workshop you can check out.

For more serverless learning resources, visit Serverless Land.

An elastic deployment of Stable Diffusion with Discord on AWS

2022-12-23 Steven Warren

Post Syndicated from Steven Warren original https://aws.amazon.com/blogs/architecture/an-elastic-deployment-of-stable-diffusion-with-discord-on-aws/

Stable Diffusion is a state-of-the-art text-to-image model that generates images from text. Deploying text-to-image models such as Stable Diffusion can be difficult. Currently, Stable Diffusion requires specific computer hardware known as graphical processing units (GPUs). You can lower the bar to entry by offloading the text-to-image generation onto Amazon Web Services (AWS).

Discord is a popular voice, video, and text communication service. It provides a user interface that people can use to make text-to-image requests. When deployed, all members of a Discord server can create images by using Discord Slash Commands.

In this post, we discuss how to deploy a highly available solution on AWS. This solution will perform text-to-image generation with Stable Diffusion and use Discord as the user interface.

Solution architecture

Many of the services selected are serverless, which will offer many benefits. At the time of writing, Stable Diffusion requires a GPU for inference. Amazon Elastic Compute Cloud (Amazon EC2) was selected because it provides GPUs. The solution architecture is shown in the Figure 1.

Figure 1. Solution architecture diagram

Let us walk through the architecture of this solution.

Auto scaling with custom metrics

To properly scale the system, a custom Amazon CloudWatch metric is created. This custom CloudWatch metric calculates the number of Amazon Elastic Container Service (Amazon ECS) tasks required to adequately handle the amount of Amazon Simple Queue Service (Amazon SQS) messages. You should have a high-resolution CloudWatch metric to scale up quickly. For this use case, a high-resolution CloudWatch metric of every 10 seconds was implemented.

Next, let’s create the custom CW metric. Amazon EventBridge rules provide a serverless solution for starting actions on a schedule. Here we use an Amazon EventBridge rule, which initiates an AWS Step Function Express Workflow every minute. With the Express Workflow, we can create serverless workflows that take less than five minutes, which helps us avoid long running AWS Lambda functions. The Express Workflow runs a Lambda function every 10 seconds over a one-minute period, which generates the custom CloudWatch metric.

Two high-resolution CloudWatch alarms scale the system up and down, and are initiated by the custom CloudWatch metric. One CloudWatch alarm increases the ECS tasks and EC2 machines. The other alarm decreases the ECS tasks and EC2 machines.

Handling Discord requests

Someone on Discord sends a request. The Amazon API Gateway HTTP API receives the request and passes the information to an AWS Lambda function. The HTTP API provides a cost-effective option compared to REST APIs and provides tools for authentication and authorization. The HTTP API uses cross-origin resource sharing (CORS), which provides security because it only allows discord.com as an origin.

The AWS Lambda function provides a serverless solution for responding to the HTTP API requests. It transforms the HTTP API request and sends a message to the SQS First-In-First-Out (FIFO) queue. SQS seamelessly decouples the architecture between user requests and backend processing. A FIFO queue ensures that user requests are processed in the order they were requested. The AWS Lambda function sends a response back to the HTTP API within three seconds, which is a requirement of Discord Slash Commands.

When scaling up, an EC2 instance is registered with the ECS cluster. The EC2 instance type was selected because it provides GPU instances. ECS provides a repeatable solution to deploy the container across a variety of instance types. This solution currently only uses the g4dn.xlarge instance type. The ECS service will then place an ECS task onto the eligible EC2 instance. The ECS task will use the Amazon Elastic Container Registry (Amazon ECR) private registry to pull the image, perform text-to-image processing, and respond to the Discord request. The ECR private registry is a managed container registry that manages the image.

Once there is an ECS task running on an Amazon EC2 instance, the ECS task will consume messages from the queue using long polling. This reduces the amount of ReceiveMessage requests the ECS task needs to send. When the ECS task receives a message from the queue, it will then processes the request.

Estimated monthly cost

The example assumes 1,000 requests per month and each request takes 16 seconds to complete. Extra EC2 time was added for the time to begin processing messages (seven minutes) and auto scaling cooldown time (30 minutes). You can adjust the pricing calculations with the AWS Pricing Calculator to reflect your usage and estimated cost.

Prerequisites

This blog assumes familiarity with Terraform, Docker, Discord, Amazon EC2, Amazon Elastic Block Store (Amazon EBS), Amazon Virtual Private Cloud (Amazon VPC), AWS Identity and Access Management (IAM), Amazon API Gateway, AWS Lambda, Amazon SQS, Amazon Elastic Container Registry (Amazon ECR), Amazon ECS, Amazon EventBridge, AWS Step Functions, and Amazon CloudWatch.

For this walkthrough, you should have the following prerequisites:

Access to an AWS account, with permissions to create the resources described in the installation steps section
A virtual private cloud (VPC) with public subnets that is associated with an internet gateway in the region you are deploying into
We suggest using the default VPC. The subnets will need the tag of key: Tier and value: Public and be attached to the VPC. If you decided to create your own VPC with subnets, make sure that auto-assign IP settings is enabled.
An IAM user with the required permissions to deploy the infrastructure
A new Discord application that is registered to a Discord server you own with the scope applications.command. Use this tutorial if you need a starting point on creating a Discord application.
- Discord Bot token
- Discord Application ID
- Discord Public Key
A Hugging Face account
A computer with the following packages installed:
- Terraform >=1.3.6
- git
- AWS CLI
- docker

Walkthrough

Complete the following steps to deploy this solution on AWS.

Increase EC2 limits

This solution uses the g4dn.xlarge instance type, which might require you to request an EC2 limit increase. Check your current limit of Running On-Demand All G and VT instances. Make sure you have more than 4 vCPU; a single g4dn.xlarge requires 4 vCPU. We suggest requesting 8 vCPU so that you can access 2 g4dn.xlarge instances.

Deploy the infrastructure

Ensure you have at least 60 GB of storage available and you’re running on a 64-bit x86 architecture system.
Open a command line on the machine you will be deploying from.
Log in as your AWS user through the AWS CLI with the command aws configure. If you are using an EC2 instance, create and use an instance profile rather than using the AWS CLI.
The region you select will be the one you will deploy into.
Clone the Terraform repository:
git clone https://github.com/aws-samples/amazon-scalable-infra-discord-diffusion.git
Navigate into the Terraform repository:
cd amazon-scalable-infra-discord-diffusion
Customize the variables in terraform.tfvars to match your deployment.
Export the following secrets to the command line:
- export TF_VAR_discord_bot_secret='DISCORD_BOT_SECRET_HERE'
- export TF_VAR_huggingface_password='HUGGINGFACE_PASSWORD_HERE'
Initialize the repository:
terraform init
Apply the infrastructure (this takes about 2 minutes):
terraform apply
Save the outputs for future use.

Set up Discord

This setup adds the Discord interactions URL to your Discord application. After terraform apply comes back successfully, move onto these steps.

Open Discord Application Page -> General Information.
Copy and paste the value from discord_interactions_endpoint_url into the Interactions Endpoint URL, and then save the changes.

If successful, there should be a green box with All your edits have been carefully recorded.

Docker image and Amazon Elastic Container Registry

In this section, you will create a docker image with the Stable Diffusion model.

Exit the terraform repository:
cd ..
Clone the Docker build repository:
git clone https://github.com/aws-samples/amazon-scalable-discord-diffusion.git
Navigate to the Docker repository:
cd amazon-scalable-discord-diffusion
Build and push the docker image to ECR. This requires docker to be installed on the machine and actively running.
You can find the commands for your deployment from the Amazon ECR repository.

Figure 2. View push commands for Amazon ECR

This is a large image (10GB) and can take over 20 minutes to push depending on your machine’s internet connection.

Request an image with Discord Slash Commands

This section will describe how to request a text to image response with Discord.

Log in to Discord and navigate to the server with your Discord application deployed.
Navigate to a text channel.
Type the command /sparkle.
A box with COMMANDS MATCHING /sparkle will appear. Select the /sparkle command box.
Depending on how you customized your Discord Application, the avatar image shown in Figure 3 might be different from what you have.

Figure 3. Writing a Discord Slash Command
Type in a prompt such as a corgi, style of monet.
A response from YourBotName should appear with the response Submitted to Sparkle: YourPromptHere, as shown in Figure 4.

Figure 4. First response from AWS Lambda function

It will take 10 minutes for an EC2 instance to start with an ECS Task running on the instance. Once an ECS Task is running on the instance, inference times should reduce to under 30 seconds, depending on the request.
When an ECS Task is running your request, you will see a Processing your Sparkle message, as shown in Figure 5.

Figure 5. Amazon ECS task processing a request

The message is complete when it says Completed your Sparkle! as shown in Figure 6.

Figure 6. Amazon ECS task returning the final response

Cleaning up

To avoid incurring future charges, delete the resources created by the Terraform script.

Return to the directory where you deployed your terraform script.
To destroy the infrastructure in AWS, run the command terraform destroy.
When prompted to confirm that you want to destroy the infrastructure, type yes and press Enter.

Conclusion

In summary, we created a solution that allows members of a Discord server to create images from text with a Stable Diffusion model. With this implementation, the deployment can scale to many Discord Servers and handle over one hundred requests per second.

Create projects on AWS that lower the bar to entry for people wanting to try text to image models.

Enabling load-balancing of non-HTTP(s) traffic on AWS Wavelength

2022-12-22 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/enabling-load-balancing-of-non-https-traffic-on-aws-wavelength/

This blog post is written by Jack Chen, Telco Solutions Architect, and Robert Belson, Developer Advocate.

AWS Wavelength embeds AWS compute and storage services within 5G networks, providing mobile edge computing infrastructure for developing, deploying, and scaling ultra-low-latency applications. AWS recently introduced support for Application Load Balancer (ALB) in AWS Wavelength zones. Although ALB addresses Layer-7 load balancing use cases, some low latency applications that get deployed in AWS Wavelength Zones rely on UDP-based protocols, such as QUIC, WebRTC, and SRT, which can’t be load-balanced by Layer-7 Load Balancers. In this post, we’ll review popular load-balancing patterns on AWS Wavelength, including a proposed architecture demonstrating how DNS-based load balancing can address customer requirements for load-balancing non-HTTP(s) traffic across multiple Amazon Elastic Compute Cloud (Amazon EC2) instances. This solution also builds a foundation for automatic scale-up and scale-down capabilities for workloads running in an AWS Wavelength Zone.

Load balancing use cases in AWS Wavelength

In the AWS Regions, customers looking to deploy highly-available edge applications often consider Amazon Elastic Load Balancing (Amazon ELB) as an approach to automatically distribute incoming application traffic across multiple targets in one or more Availability Zones (AZs). However, at the time of this publication, AWS-managed Network Load Balancer (NLB) isn’t supported in AWS Wavelength Zones and ALB is being rolled out to all AWS Wavelength Zones globally. As a result, this post will seek to document general architectural guidance for load balancing solutions on AWS Wavelength.

As one of the most prominent AWS Wavelength use cases, highly-immersive video streaming over UDP using protocols such as WebRTC at scale often require a load balancing solution to accommodate surges in traffic, either due to live events or general customer access patterns. These use cases, relying on Layer-4 traffic, can’t be load-balanced from a Layer-7 ALB. Instead, Layer-4 load balancing is needed.

To date, two infrastructure deployments involving Layer-4 load balancers are most often seen:

Amazon EC2-based deployments: Often the environment of choice for earlier-stage enterprises and ISVs, a fleet of EC2 instances will leverage a load balancer for high-throughput use cases, such as video streaming, data analytics, or Industrial IoT (IIoT) applications
Amazon EKS deployments: Customers looking to optimize performance and cost efficiency of their infrastructure can leverage containerized deployments at the edge to manage their AWS Wavelength Zone applications. In turn, external load balancers could be configured to point to exposed services via NodePort objects. Furthermore, a more popular choice might be to leverage the AWS Load Balancer Controller to provision an ALB when you create a Kubernetes Ingress.

Regardless of deployment type, the following design constraints must be considered:

Target registration: For load balancing solutions not managed by AWS, seamless solutions to load balancer target registration must be managed by the customer. As one potential solution, visit a recent HAProxyConf presentation, Practical Advice for Load Balancing at the Network Edge.
Edge Discovery: Although DNS records can be populated into Amazon Route 53 for each carrier-facing endpoint, DNS won’t deterministically route mobile clients to the most optimal mobile endpoint. When available, edge discovery services are required to most effectively route mobile clients to the lowest latency endpoint.
Cross-zone load balancing: Given the hub-and-spoke design of AWS Wavelength, customer-managed load balancers should proxy traffic only to that AWS Wavelength Zone.

Solution overview – Amazon EC2

In this solution, we’ll present a solution for a highly-available load balancing solution in a single AWS Wavelength Zone for an Amazon EC2-based deployment. In a separate post, we’ll cover the needed configurations for the AWS Load Balancer Controller in AWS Wavelength for Amazon Elastic Kubernetes Service (Amazon EKS) clusters.

The proposed solution introduces DNS-based load balancing, a technique to abstract away the complexity of intelligent load-balancing software and allow your Domain Name System (DNS) resolvers to distribute traffic (equally, or in a weighted distribution) to your set of endpoints.

Our solution leverages the weighted routing policy in Route 53 to resolve inbound DNS queries to multiple EC2 instances running within an AWS Wavelength zone. As EC2 instances for a given workload get deployed in an AWS Wavelength zone, Carrier IP addresses can be assigned to the network interfaces at launch.

Through this solution, Carrier IP addresses attached to AWS Wavelength instances are automatically added as DNS records for the customer-provided public hosted zone.

To determine how Route 53 responds to queries, given an arbitrary number of records of a public hosted zone, Route53 offers numerous routing policies:

Simple routing policy – In the event that you must route traffic to a single resource in an AWS Wavelength Zone, simple routing can be used. A single record can contain multiple IP addresses, but Route 53 returns the values in a random order to the client.

Weighted routing policy – To route traffic more deterministically using a set of proportions that you specify, this policy can be selected. For example, if you would like Carrier IP A to receive 50% of the traffic and Carrier IP B to receive 50% of the traffic, we’ll create two individual A records (one for each Carrier IP) with a weight of 50 and 50, respectively. Learn more about Route 53 routing policies by visiting the Route 53 Developer Guide.

The proposed solution leverages weighted routing policy in Route 53 DNS to route traffic to multiple EC2 instances running within an AWS Wavelength zone.

Reference architecture

The following diagram illustrates the load-balancing component of the solution, where EC2 instances in an AWS Wavelength zone are assigned Carrier IP addresses. A weighted DNS record for a host (e.g., www.example.com) is updated with Carrier IP addresses.

When a device makes a DNS query, it will be returned to one of the Carrier IP addresses associated with the given domain name. With a large number of devices, we expect a fair distribution of load across all EC2 instances in the resource pool. Given the highly ephemeral mobile edge environments, it’s likely that Carrier IPs could frequently be allocated to accommodate a workload and released shortly thereafter. However, this unpredictable behavior could yield stale DNS records, resulting in a “blackhole” – routes to endpoints that no longer exist.

Time-To-Live (TTL) is a DNS attribute that specifies the amount of time, in seconds, that you want DNS recursive resolvers to cache information about this record.

In our example, we should set to 30 seconds to force DNS resolvers to retrieve the latest records from the authoritative nameservers and minimize stale DNS responses. However, a lower TTL has a direct impact on cost, as a result of increased number of calls from recursive resolvers to Route53 to constantly retrieve the latest records.

The core components of the solution are as follows:

EC2 Launch Template – specifies instance configuration information, such as the Amazon Machine Image (AMI), Amazon Elastic Block Storage (Amazon EBS)type and size, and Elastic Network Interface (ENI) attachment with a Carrier IP association, instance tags, among other attributes.
AWS Auto Scaling Group– a collection of EC2 instances logically grouped together for the purpose of automatic scaling.

Alongside the services above in the AWS Wavelength Zone, the following services are also leveraged in the AWS Region:

AWS Lambda – a serverless event-driven function that makes API calls to the Route 53 service to update DNS records.
Amazon EventBridge– a serverless event bus that reacts to EC2 instance lifecycle events and invokes the Lambda function to make DNS updates.
Route 53– cloud DNS service with a domain record pointing to AWS Wavelength-hosted resources.

In this post, we intentionally leave the specific load balancing software solution up to the customer. Customers can leverage various popular load balancers available on the AWS Marketplace, such as HAProxy and NGINX. To focus our solution on the auto-registration of DNS records to create functional load balancing, this solution is designed to support stateless workloads only. To support stateful workloads, sticky sessions – a process in which routes requests to the same target in a target group – must be configured by the underlying load balancer solution and are outside of the scope of what DNS can provide natively.

Automation overview

Using the aforementioned components, we can implement the following workflow automation:

Amazon CloudWatch alarm can trigger the Auto Scaling group Scale out or Scale in event by adding or removing EC2 instances. Eventbridge will detect the EC2 instance state change event and invoke the Lambda function. This function will update the DNS record in Route53 by either adding (scale out) or deleting (scale in) a weighted A record associated with the EC2 instance changing state.

Configuration of the automatic auto scaling policy is out of the scope of this post. There are many auto scaling triggers that you can consider using, based on predefined and custom metrics such as memory utilization. For the demo purposes, we will be leveraging manual auto scaling.

In addition to the core components that were already described, our solution also utilizes AWS Identity and Access Management (IAM) policies and CloudWatch. Both services are key components to building AWS Well-Architected solutions on AWS. We also use AWS Systems Manager Parameter Store to keep track of user input parameters. The deployment of the solution is automated via AWS CloudFormation templates. The Lambda function provided should be uploaded to an AWS Simple Storage Service (Amazon S3) bucket.

Amazon Virtual Private Cloud (Amazon VPC), subnets, Carrier Gateway, and Route Tables are foundational building blocks for AWS-based networking infrastructure. In our deployment, we are creating a new VPC, one subnet in an AWS Wavelength zone of your choice, a Carrier Gateway, and updating the route table for this subnet to point the default route to the Carrier Gateway.

Deployment prerequisites

The following are prerequisites to deploy the described solution in your account:

Access to an AWS Wavelength zone. If your account is not allow-listed to use AWS Wavelength zones, then opt-in to AWS Wavelength zones here.
Public DNS Hosted Zone hosted in Route 53. You must have access to a registered public domain to deploy this solution. The zone for this domain should be hosted in the same account where you plan to deploy AWS Wavelength workloads.
If you don’t have a public domain, then you can register a new one. Note that there will be a service charge for the domain registration.
Amazon S3 bucket. For the Lambda function that updates DNS records in Route 53, store the source code as a .zip file in an Amazon S3 bucket.
Amazon EC2 Key pair. You can use an existing Key pair for the deployment. If you don’t have a KeyPair in the region where you plan to deploy this solution, then create one by following these instructions.
4G or 5G-connected device. Although the infrastructure can be deployed independent of the underlying connected devices, testing the connectivity will require a mobile device on one of the Wavelength partner’s networks. View the complete list of Telecommunications providers and Wavelength Zone locations to learn more.

Conclusion

In this post, we demonstrated how to implement DNS-based load balancing for workloads running in an AWS Wavelength zone. We deployed the solution that used the EventBridge Rule and the Lambda function to update DNS records hosted by Route53. If you want to learn more about AWS Wavelength, subscribe to AWS Compute Blog channel here.

Genomics workflows, Part 3: automated workflow manager

2022-12-21 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/genomics-workflows-part-3-automated-workflow-manager/

Genomics workflows are high-performance computing workloads. Life-science research teams make use of various genomics workflows. With each invocation, they specify custom sets of data and processing steps, and translate them into commands. Furthermore, team members stay to monitor progress and troubleshoot errors, which can be cumbersome, non-differentiated, administrative work.

In Part 3 of this series, we describe the architecture of a workflow manager that simplifies the administration of bioinformatics data pipelines. The workflow manager dynamically generates the launch commands based on user input and keeps track of the workflow status. This workflow manager can be adapted to many scientific workloads—effectively becoming a bring-your-own-workflow-manager for each project.

Use case

In Part 1, we demonstrated how life-science research teams can use Amazon Web Services to remove the heavy lifting of conducting genomic studies, and our design pattern was built on AWS Step Functions with AWS Batch. We mentioned that we’ve worked with life-science research teams to put failed job logs onto Amazon DynamoDB. Some teams prefer to use command-line interface tools, such as the AWS Command Line Interface; other interfaces, such as PyBDA with Apache Spark, or CWL experimental grammar in combination with the Amazon Simple Storage Service (Amazon S3) API, are also used when access to the AWS Management Console is prohibited. In our use case, scientists used the console to easily update table items, plus initiate retry via DynamoDB streams.

In this blog post, we extend this idea to a new frontend layer in our design pattern. This layer automates command generation and monitors the invocations of a variety of workflows—becoming a workflow manager. Life-science research teams use multiple workflows for different datasets and use cases, each with different syntax and commands. The workflow manager we create removes the administrative burden of formulating workflow-specific commands and tracking their launches.

Solution overview

We allow scientists to upload their requested workflow configuration as objects in Amazon S3. We use S3 Event Notifications on PUT requests to invoke an AWS Lambda function. The function parses the uploaded S3 object and registers the new launch request as a DynamoDB item using the PutItem operation. Each item corresponds with a distinct launch request, stored as key-value pair. Item values store the:

S3 data path containing genomic datasets
Workflow endpoint
Preferred compute service (optional)

Another Lambda function monitors for change data captures in the DynamoDB Stream (Figure 1). With each PutItem operation, the Lambda function prepares a workflow invocation, which includes translating the user input into the syntax and launch commands of the respective workflow.

In the case of Snakemake (discussed in Part 2), the function creates a Snakefile that declares processing steps and commands. The function spins up an AWS Fargate task that builds the computational tasks, distributes them with AWS Batch, and monitors for completion. An AWS Step Functions state machine orchestrates job processing, for example, initiated by Tibanna.

Amazon CloudWatch provides a consolidated overview of performance metrics, like time elapsed, failed jobs, and error types. We store log data, including status updates and errors, in Amazon CloudWatch Logs. A third Lambda function parses those logs and updates the status of each workflow launch request in the corresponding DynamoDB item (Figure 1).

Figure 1. Workflow manager for genomics workflows

Implementation considerations

In this section, we describe some of our past implementation considerations.

Register new workflow requests

DynamoDB items are key-value pairs. We use launch IDs as key, and the value includes the workflow type, compute engine, S3 data path, the S3 object path to the user-defined configuration file and workflow status. Our Lambda function parses the configuration file and generates all commands plus ancillary artifacts, such as Snakefiles.

Launch workflows

Launch requests are picked by a Lambda function from the DynamoDB stream. The function has the following required parameters:

Launch ID: unique identifier of each workflow launch request
Configuration file: the Amazon S3 path to the configuration sheet with launch details (in s3://bucket/object format)
Compute service (optional): our workflow manager allows to select a particular service on which to run computational tasks, such as Amazon Elastic Compute Cloud (Amazon EC2) or AWS ParallelCluster with Slurm Workload Manager. The default is the pre-defined compute engine.

These points assume that the configuration sheet is already uploaded into an accessible location in an S3 bucket. This will issue a new Snakemake Fargate launch task. If either of the parameters is not provided or access fails, the workflow manager returns MissingRequiredParametersError.

Log workflow launches

Logs are written to CloudWatch Logs automatically. We write the location of the CloudWatch log group and log stream into the DynamoDB table. To send logs to Amazon CloudWatch, specify the awslogs driver in the Fargate task definition settings in your provisioning template.

Our Lambda function writes Fargate task launch logs from CloudWatch Logs to our DynamoDB table. For example, OutOfMemoryError can occur if the process utilizes more memory than the container is allocated.

AWS Batch job state logs are written to the following log group in CloudWatch Logs: /aws/batch/job. Our Lambda function writes status updates to the DynamoDB table. AWS Batch jobs may encounter errors, such as being stuck in RUNNABLE state.

Manage state transitions

We manage the status of each job in DynamoDB. Whenever a Fargate task changes state, it is picked up by a CloudWatch rule that references the Fargate compute cluster. This CloudWatch rule invokes a notifier Lambda function that updates the workflow status in DynamoDB.

Conclusion

In this blog post, we demonstrated how life-science research teams can simplify genomic analysis across an array of workflows. These workflows usually have their own command syntax and workflow management system, such as Snakemake. The presented workflow manager removes the administrative burden of preparing and formulating workflow launches, increasing reliability.

The pattern is broadly reusable with any scientific workflow and related high-performance computing systems. The workflow manager provides persistence to enable historical analysis and comparison, which enables us to automatically benchmark workflow launches for cost and performance.

Stay tuned for Part 4 of this series, in which we explore how to enable our workflows to process archival data stored in Amazon Simple Storage Service Glacier storage classes.

Related information

AWS Week in Review – December 19, 2022

2022-12-19 Sébastien Stormacq

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-week-in-review-december-19-2022/

We are half way between the re:Invent conference and the end-of-year holidays, and I did expect the cadence of releases and news to slow down a bit, but nothing is further away from reality. Our teams continue to listen to your feedback and release new capabilities and incremental improvements.

This week, many items caught my attention. Here is my summary.

The AWS Pricing Calculator for Amazon EC2 is getting a redesign to provide you with a simplified, consistent, and efficient calculator to estimate costs. It also added a way to bulk estimate costs for EC2 instances, EC2 Dedicated Hosts, and Amazon EBS services. Try it for yourself today.

Amazon CloudWatch Metrics Insights alarms now enables you to trigger alarms on entire fleets of dynamically changing resources (such as automatically scaling EC2 instances) with a single alarm using standard SQL queries. For example, you can now write a query like this to collect data about CPU utilization over your entire dynamic fleet of EC2 instances.

SELECT AVG(CPUUtilization) FROM SCHEMA("AWS/EC2", InstanceId)

AWS Amplify is a command line tool and a set of libraries to help you to build web and mobile applications connected to a cloud backend. We released Amplify Library for Android 2.0, with improvements and simplifications for user authentication. The team also released Amplify JavaScript library version 5, with improvements for React and React Native developers, such as a new notifications channel, also known as in-app messaging, that developers can use to display contextual messages to their users based on their behavior. The Amplify JavaScript library has also received improvements to reduce the overall bundle size and installation size.

Amazon Connect added granular access control based on resource tags for routing profiles, security profiles, users, and queues. It also adds bulk import for user hierarchy tags. This allows you to use attribute-based access control policies for Amazon Connect resources.

Amazon RDS Proxy now supports PostgreSQL major version 14. RDS Proxy is a fully managed, highly available database proxy for Amazon Relational Database Service (Amazon RDS) that makes applications more scalable, more resilient to database failures, and more secure. It is typically used by serverless applications that can have a large number of open connections to the database server and may open and close database connections at a high rate, exhausting database memory and compute resources.

AWS Gateway Load Balancer endpoints now support Ipv6 addresses. You can now send IPv6 traffic through Gateway Load Balancers and its endpoints to distribute traffic flows to dual stack appliance targets.

Amazon Location Service now provides Open Data Maps maps, in addition to ESRI and Here maps. I also noticed that Amazon is a core member of the new Overture Maps Foundation, officially hosted by the Linux Foundation. The mission of the Overture Maps Foundation is to power new map products through openly available datasets that can be used and reused across applications and businesses. The program is driven by Amazon Web Services (AWS), Facebook’s parent company Meta, Microsoft, and Dutch mapping company TomTom.

AWS Mainframe Modernization is a set of managed tools providing infrastructure and software for migrating, modernizing, and running mainframe applications. It is now available in three additional AWS Regions and supports AWS CloudFormation, AWS PrivateLink, AWS Key Management Service.

X in Y. Jeff started this section a while ago to list the expansion of new services and capabilities to additional Regions. I noticed 11 Regional expansions this week:

Other AWS News
This week, I also noticed these AWS news items:

Amazon SageMaker turned 5 years old . You can read the initial blog post we published at the time. To celebrate the event, the Amazon Science published this article where AWS’s Vice President Bratin Saha reflects on the past and future of AWS’s machine learning tools and AI services.

The security blog published a great post about the Cedar policy language. It explains how Amazon Verified Permissions provides a pre-built, flexible permissions system that you can use to build permissions based on both ABAC and RBAC in your applications. Cedar policy language is also at the heart of Amazon Verified Access I blogged about during re:Invent.

And just like every week, my most excellent colleague Ricardo published the open source newsletter.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

AWS re:Invent recaps in your area. During the re:Invent week, we had lots of new announcements, and in the next weeks, you can find in your area a recap of all these launches. All the events will be posted on this site, so check it regularly to find an event nearby.

AWS Summits season will restart in Q2 2023. The dates and locations will be announced here.

Stay Informed
That is my selection for this week! Heads up – the Week in Review will be taking a short break for the end of the year, but we’ll be back with regular updates starting on January 9, 2023. To better keep up with all of this news, do not forget to check out the following resources:

What’s New with AWS – All AWS announcements. Add the RSS feed to your news reader.
The Official AWS Podcast – Listen each week for updates on the latest AWS news and deep dives into exciting use cases. There are also official AWS podcasts in your local languages. Check the ones in French, German, Italian, and Spanish.
AWS News Blog – This blog.

— seb
This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Monitor AWS workloads without a single line of code with Logz.io and Kinesis Firehose

2022-12-19 Amos Etzion

Post Syndicated from Amos Etzion original https://aws.amazon.com/blogs/big-data/monitor-aws-workloads-without-a-single-line-of-code-with-logz-io-and-kinesis-firehose/

Observability data provides near real-time insights into the health and performance of AWS workloads, so that engineers can quickly address production issues and troubleshoot them before widespread customer impact.

As AWS workloads grow, observability data has been exploding, which requires flexible big data solutions to handle the throughput of large and unpredictable volumes of observability data.

Solution overview

One option is Amazon Kinesis Data Firehose, which is a popular service for streaming huge volumes of AWS data for storage and analytics. By pulling data from Amazon CloudWatch, Amazon Kinesis Data Firehose can deliver data to observability solutions.

Among these observability solutions is Logz.io, which can now ingest metric data from Amazon Kinesis Data Firehose and make it easier to get metrics from your AWS account to your Logz.io account for analysis, alerting, and correlation with logs and traces.

In a few clicks and a few configurations, we’ll see how you can start streaming your metric data (and soon, log data!) to Logz.io for storage and analysis.

Prerequisites

Logz.io account – Create a free trial here
Logz.io shipping token – Learn about metrics tokens here. You need to be a Logz.io administrator.
Access to Amazon CloudWatch and Amazon Kinesis Data Firehose with the appropriate permissions to manage HTTP endpoints.
Appropriate permissions to create an Amazon Simple Storage Service (Amazon S3) bucket

Sending Amazon CloudWatch metric data to Logz.io with an Amazon Kinesis Data Firehose

Amazon Kinesis Data Firehose is a service for ingesting, processing, and loading data from large, distributed sources such as logs or clickstreams into multiple consumers for storage and real-time analytics. Kinesis Data Firehose supports more than 50 sources and destinations as of today. This integration can be set up in minutes without a single line of code and enables near real-time analytics for observability data generated by AWS services by using Amazon CloudWatch, Amazon Kinesis Data Firehose, and Logz.io.

Once the integration is configured, Logz.io customers can open the Infrastructure Monitoring product to see their data coming in and populating their dashboards. To see some of the data analytics and correlation you get with Logz.io, check out this short demonstration.

Let’s begin a step-by-step tutorial for setting up the integration.

Start by going to Amazon Kinesis Data Firehose and creating a delivery stream with Data Firehose.

Kinesis Firehose Console

Next you select a source and destination. Select Direct Put as the source and Logz.io the destination.
Next, configure the destination settings. Give the HTTP endpoint a name, which should include logz.io.
Select from the dropdown the appropriate endpoint you would like to use.

If you’re sending data to a European region, then set it to Logz.io Metrics EU. Or you can use the us-east-1 destination by selecting Logz.io Metrics US.

Next, add your Logz.io Shipping Token. You can find this by going to Settings in Logz.io and selecting Manage Tokens, which requires Logz.io administrator to access. This ensures that your account is only ingesting data from the defined sources (e.g., this Amazon Kinesis Data Firehose delivery stream).

Kinesis Stream config

Keep Content encoding on Disabled and set your desired Retry Duration.

You can also configure Buffer hints to your preferences.

Next, determine your Backup settings in case something goes wrong. In most cases, it’s only necessary to back up the failed data. Simply choose an Amazon S3 bucket or create a new one to store data if it doesn’t make it to Logz.io. Then, select Create a delivery stream.

Now it’s time to connect Amazon CloudWatch to our Amazon Kinesis Data Firehose Delivery Stream.

Navigate to Amazon CloudWatch and select Streams in the Metrics menu. Select Create metrics stream.
Next, you can either select to send all your Amazon CloudWatch metrics to Logz.io, or only metrics from specified namespaces.

In this case, we chose Amazon Elastic Compute Cloud (Amazon EC2), Amazon Relational Database Service (Amazon RDS), AWS Lambda, and Elastic Load Balancing (ELB).

Under Configuration, choose the Select an existing Firehose owned by your account option and choose the Amazon Kinesis Data Firehose you just configured.

Metric Streams Config

If you’d like, you can choose additional statistics in the Add additional statistics box, which provides helpful metrics in terms of percentiles to monitor like latency metrics (i.e., which services have the highest average latency). This may increase your costs.

Lastly, give your metric stream a name and hit Create metric stream.

That’s it! Without writing a single line of code, we configured an integration with AWS and Logz.io that enables fast and easy infrastructure monitoring through Amazon CloudWatch data collection.

Your metrics will be stored in Logz.io for 18 months out of the box, without requiring any overhead management.

You can also begin to build dashboards and alerts to begin monitoring – like this Amazon EC2 monitoring dashboard below.

ec2 monitoring dashboard Logz.io

Conclusion

This post demonstrated how to configure an integration with AWS and Logz.io for efficient infrastructure monitoring through Amazon CloudWatch.

To learn more about building metrics dashboards in Logz.io, you can watch this video.

Currently, some users might find that they are sending more data than they really need, which can raise costs. In future versions of this integration, it will be easier to narrow down the metrics to reduce costs.

Want to try it yourself? Create a Logz.io account today, navigate to our infrastructure monitoring product, and start streaming metric data to Logz.io to start monitoring.

About the authors

Amos Etzion – Product Manager at Logz.io

Charlie Klein – Product Marketing Manager at Logz.io

Mark Kriaf – Partner Solutions Architect at AWS

Genomics workflows, Part 2: simplify Snakemake launches

2022-12-09 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/genomics-workflows-part-2-simplify-snakemake-launches/

Genomics workflows are high-performance computing workloads. In Part 1 of this series, we demonstrated how life-science research teams can focus on scientific discovery without the associated heavy lifting. We used regenie for large genome-wide association studies. Our design pattern built on AWS Step Functions with AWS Batch and Amazon FSx for Lustre.

In Part 2, we explore genomics workloads with built-in workflow logic. Historically, running bioinformatics data pipelines was a manual and error-prone task. Over the last years, multiple workflow management systems have emerged. An example of these is the Snakemake workflow management system with Tibanna orchestration. We discuss the solution design and how you can fully automate the launch with Amazon Web Services (AWS).

Use case

We focus on the use case of Snakemake, an open-source utility for whole genome sequence mapping in directed acyclic graph (DAG) format. Snakemake uses Snakefiles to declare workflow steps and commands. A Snakefile extends Python syntax to declare workflow steps such as mapping data sets to DAG structure and identifying variants. Consult the Snakemake tutorial for further information on workflow rules.

Snakefiles provide an exception from the general design pattern and an alternative to granular modeling workflow logic in Amazon States Language. In our real-life use case, we used Tibanna to orchestrate Snakemake. Tibanna is an open-source, AWS-native software that runs bioinformatics data pipelines. It supports Snakefile syntax, plus other workflow languages, including Common Workflow Language and Workflow Description Language (WDL).

We recommend using Amazon Genomics CLI, if Tibanna is not needed for your use case, and Amazon Omics, if your workflow definitions are compliant with the supported WDL and Nextflow specifications.

Solution overview

Snakemake is available as Docker image on GitHub. We push the image to Amazon Elastic Container Registry. Tibanna is also available as Docker image on GitHub—it comes with Snakemake. Consult the Tibanna installation guide for more information.

We store Snakefiles on Amazon Simple Storage Service (Amazon S3). We configure S3 Event Notifications on PUT request operations. The event notification triggers an AWS Lambda function. The Lambda function launches an AWS Fargate task, which overrides the task definition command with the appropriate Snakemake start command and arguments.

The launched AWS Fargate task pulls the Snakefiles at launch time for each job and prepares the Snakemake initiation commands. Once the Snakefiles are downloaded on the Fargate task, the Snakemake head initiation command is invoked to begin launching jobs using Tibanna. Tibanna invokes a Step Functions state machine which orchestrates the launch of Snakemake on Amazon Elastic Compute Cloud (Amazon EC2).

Amazon CloudWatch provides a consolidated overview of performance metrics, including elapsed time, failed jobs, and error types. You can keep logs of your failed jobs in CloudWatch Logs (Figure 1). You can set up filters to match specific error types, plus create subscriptions to deliver a real-time stream of your log events to Amazon Kinesis or Lambda for further retry.

Figure 1. Solution architecture for Snakemake with Tibanna on AWS

Implementation considerations

Here, we describe some of the implementation considerations.

Creating Snakefiles

The launching point for the initiation depends on a Snakefile. Each Snakefile may contain one or more samples to be launched. The sheet resides in an S3 bucket. This adds flexibility and the ability to purge any sensitive or restrictive information after the job has been processed.

Invoking Tibanna

In order to launch Snakemake DAGs using Tibanna, we will need to set up a new Tibanna Unicorn. A Tibanna Unicorn is an Step Functions state machine and a corresponding Lambda function for provisioning EC2 instances.

The state machine runs the following sequence:

Create EC2 instance
Check EC2 status
Exit

After the Tibanna Unicorn has been created, we can start a Snakemake DAG using the following sample commands inside of the Fargate task.

$ export TIBANNA_DEFAULT_STEP_FUNCTION_NAME=YOUR_UNICORN_PROJECT
$ snakemake --tibanna --tibanna-config spot_instance=true --default-remote-prefix=YOUR_S3_BUCKET/BUCKET_PREFIX --retries 3.

The Snakemake command is used with the --tibanna flag to send launch requests to the Step Functions state machine in order to provision EC2 instances and run DAG tasks.

We recommend deploying the solution with AWS Serverless Application Model or the AWS Cloud Development Kit, both of which launch AWS CloudFormation.

Logging and troubleshooting

With this solution, each launch will automatically capture and retain start logs in a centralized location in Amazon CloudWatch Logs for tracing and auditing.

If there are issues during the launch of the Tibanna Step Function state machine, such as Amazon EC2 capacity limits, logs will be available in the S3 bucket that was specified during the Tibanna Unicorn creation process. There will be a file available in the format of <EXECUTION_ID>.log inside of the S3 bucket. This information is easily accessible via the command line interface. Use the following command to display specific log results or error messages.

tibanna log -j <EXECUTION_ID> -T

Retries and EC2 Spot Instances

We advise to use Amazon EC2 Spot Instances, if possible, for additional cost savings. This option is available in the --tibanna-config arguments with the setting spot_instance=true.

This is optional, and you need to create retry logic in the event a Spot Instance gets reclaimed. You can include --retries=3 in your Tibanna launch command. This would ensure all rules are retried three times. You can also specify the number of retries for individual rules when defining the Snakemake DAG definition; for example:

rule a:
    output:
        "test.txt"
    retries: 3
    shell:
        "curl https://some.unreliable.server/test.txt > {output}"

If EC2 Spot Instance capacity is hit, you can automatically switch to using EC2 On-Demand Instances instead. Add the behavior_on_capacity_limit argument and set retry_without_spot=true.

Adding services

The presented solution can be adapted to use other compute services supported by Snakemake. These include Amazon Elastic Kubernetes Service and AWS ParallelCluster with Slurm Workload Manager plus Amazon FSx for Lustre volumes attached to the head node and cluster nodes.

To initiate jobs on ParallelCluster, install the AWS Systems Manager agent on the head node. This is the launching point into the cluster and used for submitting jobs to the initiation queue. Systems Manager is a secure way to remotely invoke commands on an EC2 instance without the need for SSH access. You can restrict access to your EC2 instance through IAM policies.

Conclusion

In this blog post, we demonstrated how life-science research teams can simplify the launch of Snakemake using AWS. We used Snakefiles and Tibanna to orchestrate workflow steps. Snakefiles provide an exception from the general design pattern and an alternative to Amazon States Language. File uploads to Amazon S3 served as our launching point for workflow initiations.

Stay tuned for Part 3 of this series, in which we create a job manager that administrates multiple workflows.

Related information

Email delta cost usage report in a multi-account organization using AWS Lambda

2022-12-05 Ashutosh Dubey

Post Syndicated from Ashutosh Dubey original https://aws.amazon.com/blogs/architecture/email-delta-cost-usage-report-in-a-multi-account-organization-using-aws-lambda/

Overview of solution

AWS Organizations gives customers the ability to consolidate their billing across accounts. This reduces billing complexity and centralizes cost reporting to a single account. These reports and cost information are available only to users with billing access to the primary AWS account.

In many cases, there are members of senior leadership or finance decision makers who don’t have access to AWS accounts, and therefore depend on individuals or additional custom processes to share billing information. This task becomes specifically complicated when there is a complex account organization structure in place.

In such cases, you can email cost reports periodically and automatically to these groups or individuals using AWS Lambda. In this blog post, you’ll learn how to send automated emails for AWS billing usage and consumption drifts from previous days.

Solution architecture

Figure 1. Account structure and architecture diagram

AWS provides the Cost Explorer API to enable you to programmatically query data for cost and usage of AWS services. This solution uses a Lambda function to query aggregated data from the API, format that data and send it to a defined list of recipients.

Amazon EventBridge (Amazon CloudWatch Events) is configured to cue the Lambda function at a specific time.
The function uses the AWS Cost Explorer API to fetch the cost details for each account.
The Lambda function calculates the change in cost over time and formats the information to be sent in an email.
The formatted information is passed to Amazon Simple Email Service (Amazon SES).
The report is emailed to the recipients configured in the environment variables of the function.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account with Cost Explorer enabled.
AWS user or role with permissions to deploy the AWS CloudFormation template.
Valid email IDs to receive email notifications.

Walkthrough

Download the AWS CloudFormation template from this link: AWS CloudFormation template
Once downloaded, open the template in your favorite text editor
Update account-specific variables in the template. You need to update the tuple, dictionary, display list, and display list monthly sections of the script for all the accounts which you want to appear in the daily report email. Refer to Figure 2 for an example of some dummy account IDs and email IDs.

A screenshot showing account IDs in AWS Lambda

Figure 2. Account IDs in AWS Lambda

Optionally, locate “def send_report_email” in the template. The subject variable controls the subject line of the email. This can be modified to something meaningful to the recipients.

After these changes are made according to your requirements, you can deploy the CloudFormation template:

Log in to the Cloud Formation console.
Choose Create Stack. From the dropdown, choose With new resources (standard).
On the next screen under Specify Template, choose Upload a template file.
Click Choose file. Choose the local template you modified earlier, then choose Next.
Fill out the parameter fields with valid email address. For SchduleExpression, use a valid Cron expression for when you would like the report sent. Choose Next.
Here is an example for a cron schedule: 18 11 * * ? *
(This example cron expression sets the schedule to send every day at 11:18 UTC time.)
This creates the Lambda function and needed AWS Identity and Access Management (AWS IAM) roles.

You will now need to make a few modifications to the created resources.

Log in to the IAM console.
Choose Roles.
Locate the role created by the CloudFormation template called “daily-services-usage-lambdarole”
Under the Permissions tab, choose Add Permissions. From the dropdown., choose Attach Policy.
In the search bar, search for “Billing”.
Select the check box next to the AWS Managed Billing Policy and then choose Attach Policy.
Log in to the AWS Lambda console.
Choose the DailyServicesUsage function.
Choose the Configuration tab.
In the options that appear, choose General Configuration.
Choose the Edit button.
Change the timeout option to 10 seconds, because the default of three seconds may not be enough time to run the function to retrieve the cost details from multiple accounts.
Choose Save.
Still under the General Configuration tab, choose the Permissions option and validate the execution role.
The edited IAM execution role should display the Resource details for which the access has been gained. Figure 3 shows that the allow actions to aws-portal for Billing, Usage, PaymentMethods, and ViewBilling are enabled. If the Resource summary does not show these permissions, the IAM role is likely not correct. Go back to the IAM console and confirm that you updated the correct role with billing access.

A screenshot of the AWS Lambda console showing Lambda role permissions

Figure 3. Lambda role permissions

Optionally, in the left navigation pane, choose Environment variables. Here you will see the email recipients you configured in the Cloud Formation template. If changes are needed to the list in the future, you can add or remove recipients by editing the environment variables. You can skip this step if you’re satisfied with the parameters you specified earlier.

Next, you will create a few Amazon SES identities for the email addresses that were provided as environment variables for the sender and recipients:

Log in to the SES console.
Under Configuration, choose Verified Identities.
Choose Create Identity.
Choose the identity type Email Address, fill out the Email address field with the sender email, and choose Create Identify.
Repeat this step for all receiver emails.

The email IDs included will receive an email for the confirmation. Once confirmed, the status shows as verified in the Verified Identities tab of the SES console. The verified email IDs will start receiving the email with the cost reports.

Amazon EventBridge (CloudWatch) event configuration

To configure events:

1. Go to the Amazon EventBridge console.
2. Choose Create rule.
3. Fill out the rule details with meaningful descriptions.
4. Under Rule Type, choose Schedule.
5. Schedule the cron pattern from when you would like the report to run.

Figure 4 shows that the highlighted rule is configured to run the Lambda function every 24 hours.

A screenshot of the Amazon EventBridge console showing an EventBridge rule

Figure 4. EventBridge rule

An example AWS Daily Cost Report email

From: [email protected] (the email ID mentioned as “sender”)
Sent: Tuesday, April 12, 2022 1:43 PM
To: [email protected] (the email ID mentioned as “receiver”)
Subject: AWS Daily Cost Report for Selected Accounts (the subject of email as set in the Lambda function)

Figure 5 shows the first part of the cost report. It provides the cost summary and delta of the cost variance percentage compare to the previous day. You can also see the trend based on the last seven days from the same table. This helps in understanding a pattern around cost and usage.

This summary is broken down per account, and then totaled, in order to help you understand the accounts contributing to the cost changes. The daily change percentages are also color coded to highlight significant variations.

Figure 5. AWS Daily Cost Report email body part 1

The second part of the report in the email provides the service-related cost breakup for each account configured in the Account dictionary section of the function. This is a further drilldown report; you will get these for all configured accounts.

Figure 6. AWS Daily Cost Report email body part 2

Cleanup

Delete the Amazon CloudFormation stack.
Delete the identities on Amazon SES.
Delete the Amazon EventBridge (CloudWatch) event rule.

Conclusion

The blog demonstrates how you can automatically and seamlessly share your AWS accounts’ billing and change information with your leadership and finance teams daily (or on any schedule you choose). While the solution was designed for accounts that are part of an organization in the service AWS organizations, it could also be deployed in a standalone account without making any changes. This allows information sharing without the need to provide account access to the recipients, and avoids any dependency on other manual processes. As a next step, you can also store these reports in Amazon Simple Storage Service (Amazon S3), generate a historical trend summary for consumption, and continue making informed decisions.

Additional reading

Amazon CloudWatch Insights for Amazon EKS on EC2 using AWS Distro for OpenTelemetry Helm charts

2022-12-02 Vimala Pydi

Post Syndicated from Vimala Pydi original https://aws.amazon.com/blogs/architecture/amazon-cloudwatch-insights-for-amazon-eks-on-ec2-using-aws-distro-for-opentelemetry-helm-charts/

This blog provides a simplified three-step solution to collect metrics and logs from an Amazon Elastic Kubernetes Service (Amazon EKS) cluster on Amazon Elastic Compute Cloud (Amazon EC2) using the AWS Distro for OpenTelemetry (ADOT) Helm charts repository and send them to Amazon CloudWatch Logs and Amazon CloudWatch Container Insights. The ADOT Helm charts repository contains Helm charts to provide easy mechanisms to set up the ADOT Collector and other collection agents like fluentbit to collect telemetry data such as metrics, logs and traces to send to AWS monitoring services.

Amazon EKS is a managed Kubernetes service that makes it easy for organizations to run Kubernetes on AWS Cloud and on premises. Organizations use Amazon EKS to automatically manage the availability and scalability of the Kubernetes control plane nodes responsible for scheduling containers, managing application availability, storing cluster data, and performing other key tasks. ADOT is a secure, production-ready, AWS-supported distribution of the OpenTelemetry project. Applications can set up ADOT Collector and other collector agents only once to send correlated metrics and traces to multiple AWS and Partner monitoring solutions. Fluent Bit is an open-source log processor and forwarder that you can use to collect data such as metrics and logs from different sources. Helm deploys packaged applications to Kubernetes and structures them into Helm charts.

Solution overview

A high-level architecture diagram depicted in Figure 1 shows a simple solution for collecting metrics and logs to send to Amazon CloudWatch Container Insights by installing an ADOT Helm chart on your existing or new Amazon EKS cluster.

Here are the steps to set up an ADOT and fluentbit collector:

Set up your environment and install the necessary tools to connect to an existing or newly created Amazon EKS cluster.
Configure the necessary roles for AWS Identity and Access Management (IAM) roles for service accounts and install Helm charts for ADOT, enabling fluentbit.
Monitor logs, metrics, and traces from Amazon CloudWatch Logs and Container Insights.

Figure 1. Architecture diagram for Helm chart installation of ADOT and fluentbit to an existing Amazon EKS cluster

Prerequisites

Existing AWS account with access to AWS Management Console
Intermediate-level knowledge and understanding of Amazon EKS
An existing or new Amazon EKS cluster

Install the tools

In this blog, AWS Cloud9 is used as an environment to connect to the Amazon EKS cluster and install Helm charts. If you choose to use AWS Cloud9, follow the step-by-step instructions provided in Creating an EC2 Environment. Refer to Getting started with Amazon EKS for additional instructions to install eksctl, create EKS clusters, and set up required IAM permissions for connecting to an EKS cluster.

Log in to your Amazon EKS cluster and inspect the cluster. Select an EKS cluster in AWS Management Console. On the Resources tab, check the DaemonSets, as in Figure 2a.

Figure 2a. EKS cluster DaemonSets
Open Amazon CloudWatch and inspect the Log groups and Amazon CloudWatch Container Insights. Note that the Log groups and Amazon CloudWatch Container Insights in Figure 2b do not show any EKS cluster-specific logs.

Figure 2b. Container Insights before ADOT and fluentbit collector installation

Install Helm and configure IAM roles

Run the following command to install Helm, verify the version, and configure Bash completion for the Helm command:

curl -ssl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash
helm version --short

helm completion bash >> ~/.bash_completion
. /etc/profile.d/bash_completion.sh
. ~/.bash_completion
source <(helm completion bash)

Set up IAM roles for service accounts.
Replace XXX in the following commands with your EKS Cluster name.

eksctl create iamserviceaccount \
--name fluent-bit \
--role-name EKS-ADOT-CWCI-Helm-Chart-Role-CW \
--namespace amazon-cloudwatch \
--cluster XXX \
--attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
--role-only \
--approve

eksctl create iamserviceaccount \
--name adot-collector-sa \
--role-name EKS-ADOT-CWCI-Helm-Chart-Role-METRICS \
--namespace amazon-metrics \
--cluster XXX \
--attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
--role-only \
--approve

Deploy the ADOT Helm chart.
Replace XXX in the following code with your EKS Cluster name.

CWCI_ADOT_HELM_ROLE_ARN_CW=$(aws iam get-role --role-name EKS-ADOT-CWCI-Helm-Chart-Role-CW | jq .Role.Arn -r)
CWCI_ADOT_HELM_ROLE_ARN_METRICS=$(aws iam get-role --role-name EKS-ADOT-CWCI-Helm-Chart-Role-METRICS | jq .Role.Arn -r)
helm repo add adot-helm-repo https://aws-observability.github.io/aws-otel-helm-charts
helm install adot-release adot-helm-repo/adot-exporter-for-eks-on-ec2  \
--set clusterName=XXX --set awsRegion=us-east-1 --set fluentbit.enabled=true \
--set adotCollector.daemonSet.service.metrics.receivers={awscontainerinsightreceiver} \
--set adotCollector.daemonSet.service.metrics.exporters={awsemf} \
--set adotCollector.daemonSet.cwexporters.logStreamName=EKSNode \

Run the following commands to validate the successful deployment.

Verify that two new namespaces have been created.
kubectl get ns
The result should be:

$ kubectl get ns
NAME                STATUS           AGE
amazon-cloudwatch   Active           2d20h
amazon-metrics      Active           2d20h

Verify that a fluentbit pod was enabled as part of the ADOT Helm Chart under the amazon-cloudwatch namespace.
kubectl get all -n amazon-cloudwatch
The result should be:

kubectl get all -n amazon-cloudwatch
NAME                   READY   STATUS    RESTARTS   AGE
pod/fluent-bit-9lrnt   1/1     Running   0          2d20h
pod/fluent-bit-h9lvt   1/1     Running   0          2d20h
pod/fluent-bit-nbqjm   1/1     Running   0          2d20h

NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE

Verify the adot-collector-pod under the amazon-metrics namespace.
kubectl get all -n amazon-metrics
The result should be:

$ kubectl get all -n amazon-metrics
NAME                                 READY   STATUS    RESTARTS   AGE
pod/adot-collector-daemonset-6qcsd   1/1     Running   0          2d20h
pod/adot-collector-daemonset-f92fr   1/1     Running   0          2d20h
pod/adot-collector-daemonset-gmhbx   1/1     Running   0          2d20h

NAME                                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/adot-collector-daemonset   3         3         3       3            3           <none>          2d20h

Validate the installation through the Amazon EKS cluster.
Go to the Amazon EKS cluster and select the Resources tab. Under Workloads, select DaemonSets, and find the fluent-bit and adot-collector-daemonsets as demonstrated in Figure 3.

Figure 3. DaemonSet under Amazon EKS cluster resources

Monitor logs, metrics, and traces

Monitor the CloudWatch Logs and CloudWatch Insights.

In the Logs section, choose Log groups to view Amazon EKS cluster log groups with a prefix of /aws/containerinsights, as in Figure 4a.

Figure 4a. EKS cluster log groups
In the Insights section, choose Container Insights to view all the resources within your Amazon EKS cluster, as in Figure 4b.

Figure 4b. EKS cluster’s Container Insights resources
On the Container Insights page, select Container map from the dropdown to check the container map for Amazon EKS clusters, as demonstrated in Figure 4c.

Figure 4c. EKS cluster’s Container Insights container map
On the Container Insights page, select Performance monitoring from the dropdown to view various performance metrics for Amazon EKS cluster, as demonstrated in Figure 4d.

Figure 4d. EKS cluster’s Container Insights performance monitoring

Cleanup

If you are no longer using the resources discussed in this blog, remove the excess AWS resources to avoid incurring charges. After you finish setting up ADOT and fluentbit collectors to send logs and metrics to Amazon CloudWatch Logs and Container Insights, clean up resources by uninstalling the ADOT Helm chart, deleting IAM Roles created for the services, deleting CloudWatch Logs, and deleting Container Insights.

Conclusion

In this blog we walked through a simple three-step solution to set up Amazon EKS cluster logs and Container Insights using Helm charts. The Helm chart installs ADOT and fluentbit as a DaemonSet in the existing EKS cluster to collect and port logs, metrics, and traces to Amazon CloudWatch Logs and Container Insights. The Amazon CloudWatch Container Insights provide insights into resources, monitor performance, and container map of all the resources within the Amazon EKS cluster.

Monitoring shared AWS Outposts rack capacity

2022-11-29 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/monitoring-shared-aws-outposts-rack-capacity/

This post is written by Adam Imeson, Sr. Hybrid Edge Specialist Solutions Architect.

AWS Outposts rack is a fully-managed service that offers the same AWS infrastructure, APIs, tools, and a subset of AWS services to any data center, colocation space, or on-premises facility for a consistent hybrid experience. Outposts rack is ideal for workloads that require low latency, access to on-premises systems, local data processing, data residency, and migration of applications with local system interdependencies.

An Outpost is a pool of AWS compute and storage capacity deployed at a customer site. In an Outposts rack deployment, an Outpost may comprise of one or more racks connected together at the site. It’s common for customers to order their Outpost in a dedicated account and then integrate with their multi-account organizational architecture by sharing the Outpost via AWS Resource Access Manager (AWS RAM). This post will explain how to set up cross-account Amazon CloudWatch metrics so that disparate stakeholders within your organization can effectively monitor your Outpost’s capacity to meet their specific needs.

Overview

The AWS account that you use to order an Outpost owns that Outpost. This includes all metrics and health events pertaining to that Outpost. Many customers must integrate Outposts into their multi-account environments, as discussed in the “Best practices: AWS Outposts in a multi-account AWS environment” posts (part 1 and part 2). This post will go into more detail on how to monitor Outposts in these environments.

The nuance here stems from the different ways to share access to AWS resources. AWS RAM allows infrastructure resources to be shared across multiple accounts. Then, the consumer accounts can launch resources on the infrastructure as though they owned it. AWS Identity and Access Management (IAM) allows customers to modify a given account’s permissions such that users in other accounts can make AWS API calls that affect the given account.

An Outpost provides infrastructure resources, so customers can share Outposts via AWS RAM. CloudWatch metrics about Outposts are data which customers retrieve using AWS API calls, so customers can share access to those metrics using IAM.

In a typical customer’s AWS Organization, there are two cases to consider. First, when the customer is sharing an Outpost to multiple development accounts, each account needs to view metrics relevant to the Outpost so that the development accounts can deploy and operate their applications.

Second, when the customer has several accounts that each own different Outposts, the customer’s centralized monitoring account needs to track metrics relevant to each of the Outposts.

This post will explain strategies for both cases.

Customers must monitor the health of the Outpost’s connection to its regional control plane (the Outpost’s service link), as an Outpost is an extension of an AWS Availability Zone (AZ) and is designed to be connected to an AZ at all times. The health of the Outpost’s service link is a crucial variable when application owners are diagnosing disruptions to their application, and also when infrastructure owners are diagnosing disruptions to a site. Customers can monitor their service link’s status with the ConnectedStatus metric.

Customers also must monitor their Outposts’ current capacity. Outposts necessarily have a limited capacity footprint when compared to an AWS Region. Application owners must make informed decisions about capacity as they scale their apps over time or respond to occasional hardware failures. Infrastructure owners also must maintain a holistic view of capacity across all of the Outposts for which they are responsible so that they can plan for capacity expansion over time. Customers can monitor their Outposts’ capacity using the various capacity metrics that Outposts provide.

For an overview of how to set up a capacity dashboard and capacity-based CloudWatch alarms within a single account, see “Monitoring AWS Outposts capacity.” This post will expand on the single-account strategy by introducing cross-account capabilities. See also “Cross-Account Cross-Region Dashboards with Amazon CloudWatch.” These two posts provide practical walkthroughs for setting up the metric flows explained below.

Setting up Outposts metric permissions for your organization

This post assumes that you have multiple Outposts in different accounts that are all part of the same Organization. You’re sharing these Outposts into accounts that development teams use to deploy and operate their applications. You also have a centralized monitoring account where your infrastructure team tracks various metrics across all accounts. Your Organization might look something like this:

The first Outpost is shared to Accounts A and B, and the second Outpost is only shared to Account B. This is just an example of how a customer might set up their environment so that Application A can deploy on Outpost 1, and Application B can deploy on both Outpost 1 and 2.

To enable centralized monitoring, each account shares CloudWatch metrics with the central monitoring account as described in “Cross-Account Cross-Region Dashboards with Amazon CloudWatch.”

Now there are application accounts which can launch on the desired Outposts, and all of the accounts are sharing metrics with the central monitoring account. The team responsible for procuring and managing the Outposts can now set up dashboards in the central monitoring account in accordance with “Monitoring AWS Outposts capacity” to get a holistic view of capacity. This is valuable for capacity planning as applications naturally grow over time.

However, this may not be sufficient for operations. Consider that each application team needs to understand how much capacity is available on the Outpost that they’re using. This is crucial for teams operating highly available applications to maintain awareness of whether they still have N+1 capacity available on the Outpost to use in the event of a hardware failure. This is also important for planning expansions to the application ahead of time, as application teams have the best understanding of the future needs of their applications. Finally, application teams can use the metrics to track the operational health of the Outpost, which is crucial for root-causing any application disruptions.

You can implement this by sharing CloudWatch metrics from the Outpost accounts to the application accounts which are consuming the Outposts’ capacity, as shown in the following diagram.

Walkthrough

Log in to your application account and navigate to the CloudWatch console. Open the Settings menu and choose Configure.

Scroll to the bottom. In the View cross-account cross-region section, choose Edit.

Choose your preferred account selection method from the three options and choose Save changes. I recommend the Custom account selector option, as it strikes a good balance between a simple setup and ease of use. If you choose this option, then input the Outpost owner account’s account ID and a human-readable name for the account. This name will appear in the drop-down when you’re using the CloudWatch console to view metrics from other accounts later.

Your application account is now prepared to view metrics from the Outpost owner account. Now log in to the account that owns the Outpost and navigate to the CloudWatch console. You still need to share the Outpost’s metrics to the application account. Open the Settings page again, and choose Configure in the Cross-account cross-region section as before. This time, choose Share data in the Share your CloudWatch data section:

Choose Add account and input the application account’s account ID. Then scroll to the bottom of the page and choose Launch CloudFormation template.

The AWS CloudFormation template will create the CloudWatch-CrossAccountSharingRole. This role gives CloudWatch read access to the AWS account that you specified, the application account. You can view and modify this role using the IAM console if you want to. For example, you might adjust the role to allow read access to an entire Organizational Unit (OU).

Now, log back in to the application account and navigate to the CloudWatch console. Choose All metrics in the left-side menu. In the Metrics section, select the Outpost owner account from the drop-down.

You can now view the metrics from the Outpost owner account and incorporate them into the dashboards in the application account. Now the application teams can track the Outposts’ ConnectedStatus metrics to be alerted on any disconnections from the region, and they can track the Outposts’ capacity metrics as well. It’s a best practice to alarm on Outpost capacity metrics once a consumption threshold defined by business needs has been breached.

Conclusion

Outposts rack allows customers to deploy AWS infrastructure into virtually any data center, colocation space, or on-premises facility. Outposts are tied to the AWS account that ordered them, and customers can share Outposts among AWS accounts within the same Organization. When multiple teams within a customer’s Organization are interacting with the same Outpost, that introduces additional monitoring surface area for capacity and service health. This post explains how customers can accommodate their teams’ different needs by sharing Outposts metrics around their Organization along with their Outposts. As best practices, customers should share their Outposts capacity and ConnectedStatus metrics to teams who are running applications on Outposts. Customers’ operations teams should also work with their stakeholders to define a maximum capacity utilization threshold for a given Outpost and alarm on that threshold.

Protect Sensitive Data with Amazon CloudWatch Logs

2022-11-28 Marcia Villalba

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/protect-sensitive-data-with-amazon-cloudwatch-logs/

Today we are announcing Amazon CloudWatch Logs data protection, a new set of capabilities for Amazon CloudWatch Logs that leverage pattern matching and machine learning (ML) to detect and protect sensitive log data in transit.

While developers try to prevent logging sensitive information such as Social Security numbers, credit card details, email addresses, and passwords, sometimes it gets logged. Until today, customers relied on manual investigation or third-party solutions to detect and mitigate sensitive information from being logged. If sensitive data is not redacted during ingestion, it will be visible in plain text in the logs and in any downstream system that consumed those logs.

Enforcing prevention across the organization is challenging, which is why quick detection and prevention of access to sensitive data in the logs is important from a security and compliance perspective. Starting today, you can enable Amazon CloudWatch Logs data protection to detect and mask sensitive log data as it is ingested into CloudWatch Logs or as it is in transit.

Customers from all industries that want to take advantage of native data protection capabilities can benefit from this feature. But in particular, it is useful for industries under strict regulations that need to make sure that no personal information gets exposed. Also, customers building payment or authentication services where personal and sensitive information may be captured can use this new feature to detect and mask sensitive information as it’s logged.

Getting Started
You can enable a data protection policy for new or existing log groups from the AWS Management Console, AWS Command Line Interface (CLI), or AWS CloudFormation. From the console, select any log group and create a data protection policy in the Data protection tab.

When you create the policy, you can specify the data you want to protect. Choose from over 100 managed data identifiers, which are a repository of common sensitive data patterns spanning financial, health, and personal information. This feature provides you with complete flexibility in choosing from a wide variety of data identifiers that are specific to your use cases or geographical region.

You can also enable audit reports and send them to another log group, an Amazon Simple Storage Service (Amazon S3) bucket, or Amazon Kinesis Firehose. These reports contain a detailed log of data protection findings.

If you want to monitor and get notified when sensitive data is detected, you can create an alarm around the metric LogEventsWithFindings. This metric shows how many findings there are in a particular log group. This allows you to quickly understand which application is logging sensitive data.

When sensitive information is logged, CloudWatch Logs data protection will automatically mask it per your configured policy. This is designed so that none of the downstream services that consume these logs can see the unmasked data. From the AWS Management Console, AWS CLI, or any third party, the sensitive information in the logs will appear masked.

Only users with elevated privileges in their IAM policy (add logs:Unmask action in the user policy) can view unmasked data in CloudWatch Logs Insights, logs stream search, or via FilterLogEvents and GetLogEvents APIs.

You can use the following query in CloudWatch Logs Insights to unmask data for a particular log group:

fields @timestamp, @message, unmask(@message)
| sort @timestamp desc
| limit 20

Available Now
Data protection is available in US East (Ohio), US East (N. Virginia), US West (N. California), US West (Oregon), Africa (Cape Town), Asia Pacific (Hong Kong), Asia Pacific (Jakarta), Asia Pacific (Mumbai), Asia Pacific (Osaka), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Milan), Europe (Paris), Europe (Stockholm), Middle East (Bahrain), and South America (São Paulo) AWS Regions.

Amazon CloudWatch Logs data protection pricing is based on the amount of data that is scanned for masking. You can check the CloudWatch Logs pricing page to learn more about the pricing of this feature in your Region.

Learn more about data protection on the CloudWatch Logs User Guide.

— Marcia

New – Amazon CloudWatch Cross-Account Observability

2022-11-28 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-amazon-cloudwatch-cross-account-observability/

Deploying applications using multiple AWS accounts is a good practice to establish security and billing boundaries between teams and reduce the impact of operational events. When you adopt a multi-account strategy, you have to analyze telemetry data that is scattered across several accounts. To give you the flexibility to monitor all the components of your applications from a centralized view, we are introducing today Amazon CloudWatch cross-account observability, a new capability to search, analyze, and correlate cross-account telemetry data stored in CloudWatch such as metrics, logs, and traces.

You can now set up a central monitoring AWS account and connect your other accounts as sources. Then, you can search, audit, and analyze logs across your applications to drill down into operational issues in a matter of seconds. You can discover and visualize metrics from many accounts in a single place and create alarms that evaluate metrics belonging to other accounts. You can start with an aggregated cross-account view of your application to visually identify the resources exhibiting errors and dive deep into correlated traces, metrics, and logs to find the root cause. This seamless cross-account data access and navigation helps reduce the time and effort required to troubleshoot issues.

Let’s see how this works in practice.

Configuring CloudWatch Cross-Account Observability
To enable cross-account observability, CloudWatch has introduced the concept of monitoring and source accounts:

A monitoring account is a central AWS account that can view and interact with observability data shared by other accounts.
A source account is an individual AWS account that shares observability data and resources with one or more monitoring accounts.

You can configure multiple monitoring accounts with the level of visibility you need. CloudWatch cross-account observability is also integrated with AWS Organizations. For example, I can have a monitoring account with wide access to all accounts in my organization for central security and operational teams and then configure other monitoring accounts with more restricted visibility across a business unit for individual service owners.

First, I configure the monitoring account. In the CloudWatch console, I choose Settings in the navigation pane. In the Monitoring account configuration section, I choose Configure.

Now I can choose which telemetry data can be shared with the monitoring account: Logs, Metrics, and Traces. I leave all three enabled.

To list the source accounts that will share data with this monitoring account, I can use account IDs, organization IDs, or organization paths. I can use an organization ID to include all the accounts in the organization or an organization path to include all the accounts in a department or business unit. In my case, I have only one source account to link, so I enter the account ID.

When using the CloudWatch console in the monitoring account to search and display telemetry data, I see the account ID that shared that data. Because account IDs are not easy to remember, I can display a more descriptive “account label.” When configuring the label via the console, I can choose between the account name or the email address used to identify the account. When using an email address, I can also choose whether to include the domain. For example, if all the emails used to identify my accounts are using the same domain, I can use as labels the email addresses without that domain.

There is a quick reminder that cross-account observability only works in the selected Region. If I have resources in multiple Regions, I can configure cross-account observability in each Region. To complete the configuration of the monitoring account, I choose Configure.

The monitoring account is now enabled, and I choose Resources to link accounts to determine how to link my source accounts.

To link source accounts in an AWS organization, I can download an AWS CloudFormation template to be deployed in a CloudFormation delegated administration account.

To link individual accounts, I can either download a CloudFormation template to be deployed in each account or copy a URL that helps me use the console to set up the accounts. I copy the URL and paste it into another browser where I am signed in as the source account. Then, I can configure which telemetry data to share (logs, metrics, or traces). The Amazon Resource Name (ARN) of the monitoring account configuration is pre-filled because I copy-pasted the URL in the previous step. If I don’t use the URL, I can copy the ARN from the monitoring account and paste it here. I confirm the label used to identify my source account and choose Link.

In the Confirm monitoring account permission dialog, I type Confirm to complete the configuration of the source account.

Using CloudWatch Cross-Account Observability
To see how things work with cross-account observability, I deploy a simple cross-account application using two AWS Lambda functions, one in the source account (multi-account-function-a) and one in the monitoring account (multi-account-function-b). When triggered, the function in the source account publishes an event to an Amazon EventBridge event bus in the monitoring account. There, an EventBridge rule triggers the execution of the function in the monitoring account. This is a simplified setup using only two accounts. You’d probably have your workloads running in multiple source accounts.

In the Lambda console, the two Lambda functions have Active tracing and Enhanced monitoring enabled. To collect telemetry data, I use the AWS Distro for OpenTelemetry (ADOT) Lambda layer. The Enhanced monitoring option turns on Amazon CloudWatch Lambda Insights to collect and aggregate Lambda function runtime performance metrics.

I prepare a test event in the Lambda console of the source account. Then, I choose Test and run the function a few times.

Now, I want to understand what the components of my application, running in different accounts, are doing. I start with logs and then move to metrics and traces.

In the CloudWatch console of the monitoring account, I choose Log groups in the Logs section of the navigation pane. There, I search for and find the log groups created by the two Lambda functions running in different AWS accounts. As expected, each log group shows the account ID and label originating the data. I select both log groups and choose View in Logs Insights.

I can now search and analyze logs from different AWS accounts using the CloudWatch Logs Insights query syntax. For example, I run a simple query to see the last twenty messages in the two log groups. I include the @log field to see the account ID that the log belongs to.

I can now also create Contributor Insights rules on cross-account log groups. This enables me, for example, to have a holistic view of what security events are happening across accounts or identify the most expensive Lambda requests in a serverless application running in multiple accounts.

Then, I choose All metrics in the Metrics section of the navigation pane. To see the Lambda function runtime performance metrics collected by CloudWatch Lambda Insights, I choose LambdaInsights and then function_name. There, I search for multi-account and memory to see the memory metrics. Again, I see the account IDs and labels that tell me that these metrics are coming from two different accounts. From here, I can just select the metrics I am interested in and create cross-account dashboards and alarms. With the metrics selected, I choose Add to dashboard in the Actions dropdown.

I create a new dashboard and choose the Stacked area widget type. Then, I choose Add to dashboard.

I do the same for the CPU and memory metrics (but using different widget types) to quickly create a cross-account dashboard where I can keep under control my multi-account setup. Well, there isn’t a lot of traffic yet but I am hopeful.

Finally, I choose Service map from the X-Ray traces section of the navigation pane to see the flow of my multi-account application. In the service map, the client triggers the Lambda function in the source account. Then, an event is sent to the other account to run the other Lambda function.

In the service map, I select the gear icon for the function running in the source account (multi-account-function-a) and then View traces to look at the individual traces. The traces contain data from multiple AWS accounts. I can search for traces coming from a specific account using a syntax such as:

service(id(account.id: "123412341234"))

The service map now stitches together telemetry from multiple accounts in a single place, delivering a consolidated view to monitor their cross-account applications. This helps me to pinpoint issues quickly and reduces resolution time.

Availability and Pricing
Amazon CloudWatch cross-account observability is available today in all commercial AWS Regions using the AWS Management Console, AWS Command Line Interface (CLI), and AWS SDKs. AWS CloudFormation support is coming in the next few days. Cross-account observability in CloudWatch comes with no extra cost for logs and metrics, and the first trace copy is free. See the Amazon CloudWatch pricing page for details.

Having a central point of view to monitor all the AWS accounts that you use gives you a better understanding of your overall activities and helps solve issues for applications that span multiple accounts.

Start using CloudWatch cross-account observability to monitor all your resources.

— Danilo

Amazon CloudWatch Internet Monitor Preview – End-to-End Visibility into Internet Performance for your Applications

2022-11-28 Sébastien Stormacq

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/cloudwatch-internet-monitor-end-to-end-visibility-into-internet-performance-for-your-applications/

How many times have you had monitoring dashboards show you a normal situation, and at the same time, you have received customer tickets reporting your app is “slow” or unavailable to them? How much time did it take to diagnose these customer reports?

You told us one of your challenges when monitoring internet-facing applications is to gather data outside of AWS to build a realistic picture of how your application behaves for your customers connected to multiple and geographically distant internet providers. Capturing and monitoring data about internet traffic before it reaches your infrastructure is either difficult or very expensive.

I am happy to announce the public preview of Amazon CloudWatch Internet Monitor, a new capability of CloudWatch that gives visibility into how an internet issue might impact the performance and availability of your applications. It allows you to reduce the time it takes to diagnose internet issues from days to minutes.

Internet Monitor uses the connectivity data that we capture from our global networking footprint to calculate a baseline of performance and availability for internet traffic. This is the same data that we use at AWS to monitor our own internet uptime and availability. With Internet Monitor, you can gain awareness of problems that arise on the internet experienced by your end users in different geographic locations and networks.

There is no need to instrument your application code. You can enable the service in the CloudWatch section of the AWS Management Console and start to use it immediately.

Let’s See It in Action
Getting started with Internet Monitor is easy. Let’s imagine I want to monitor the network paths between my customers and my AWS resources. I open the AWS Management Console and navigate to CloudWatch. I select Internet Monitor on the left-side navigation menu. Then, I select Create monitor.

On the Create monitor page, I enter a Monitor name, and I select Add resources to choose the resources to monitor. For this demo, I select the VPC and the CloudFront distribution hosting my customer-facing application.

I have the opportunity to review my choices. Then, I select Create monitor.

From that moment on, Internet Monitor starts to collect data based on my application’s resource logs behind the scene. There is no need for you to activate (or pay for) VPC Flow Logs, CloudFront logs, or other log types.

After a while, I receive customer complaints about our application being slow. I open Internet Monitor again, I select the monitor I created earlier (Monitor_example), and I immediately see that the application suffers from internet performance issues.

The Health scores graph shows you performance and availability information for your global traffic. AWS has substantial historical data about internet performance and availability for network traffic between geographic locations for different network providers and services. By applying statistical analysis to the data, we can detect when the performance and availability towards your application have dropped, compared to an estimated baseline that we’ve calculated. To make it easier to see those drops, we report that information to you in the form of an estimated performance score and an availability score.

I scroll a bit down the page. The Internet traffic overview map shows the overall event status across all monitored locations. I look at the details in the Health events table. It also highlights other events that are happening globally, sorted by total traffic impact. I notice that a performance issue in Las Vegas, Nevada, US, is affecting my application traffic the most.

Now that I have identified the issue, I am curious about the historical data. Has it happened before?

I select the Historical Explorer tab to understand trends and see earlier data related to this location and network provider. I can view aggregated metrics such as performance score, availability score, bytes transferred, and round-trip time at p50, p90, and p95 percentiles, for a customized timeframe, up to 18 months in the past.

I can see today’s incident is not the first one. This specific client location and network provider has had multiple issues in the past few months.

Now that I understand the context, I wonder what action I can take to mitigate the issue.

I switch to the Traffic insights tab. I see overall traffic data and top client locations that are being monitored based on total traffic (bytes). Apparently, Las Vegas, Nevada, US, is one of the top client locations.

I select the graph to see traffic details for Las Vegas, Nevada, US. In the Lowest Time To First Byte (TTFB) column, I see AWS service and AWS Region setup recommendations for all of the top client location and network combinations. The Predicted Time To First Byte in the table shows the potential impact if I make the suggested architectural change.

In this example, Internet Monitor suggests having CloudFront distribute the traffic currently distributed by EC2 and to allow for some additional traffic to be served by EC2 instances in us-east-1 in addition to us-east-2.

Available Today
Internet Monitor is available in public preview today in 20 AWS Regions:

In the Americas: US East (Ohio), US East (N. Virginia), US West (N. California), US West (Oregon), Canada (Central), South America (São Paulo).
In Asia and Pacific: Asia Pacific (Hong Kong), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo).
In Europe, Middle East, and Africa: Africa (Cape Town), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Milan), Europe (Paris), Europe (Stockholm), Middle East (Bahrain)

Note that AWS CloudFormation support is missing at the moment; it will be added soon.

There is no costs associated with the service during the preview period. Just keep in mind that Internet Monitor vends metrics and logs to CloudWatch; you will be charged for these additional CloudWatch logs and CloudWatch metrics.

Whether you work for a startup or a large enterprise, CloudWatch Internet Monitor helps you be proactive about your application performance and availability. Give it a try today!

— seb

AWS Week in Review – November 21, 2022

2022-11-21 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-week-in-review-november-21-2022/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

A new week starts, and the News Blog team is getting ready for AWS re:Invent! Many of us will be there next week and it would be great to meet in person. If you’re coming, do you know about PeerTalk? It’s an onsite networking program for re:Invent attendees available through the AWS Events mobile app (which you can get on Google Play or Apple App Store) to help facilitate connections among the re:Invent community.

If you’re not coming to re:Invent, no worries, you can get a free online pass to watch keynotes and leadership sessions.

Last Week’s Launches
It was a busy week for our service teams! Here are the launches that got my attention:

AWS Region in Spain – The AWS Region in Aragón, Spain, is now open. The official name is Europe (Spain), and the API name is eu-south-2.

Amazon Athena – You can now apply AWS Lake Formation fine-grained access control policies with all table and file format supported by Amazon Athena to centrally manage permissions and access data catalog resources in your Amazon Simple Storage Service (Amazon S3) data lake. With fine-grained access control, you can restrict access to data in query results using data filters to achieve column-level, row-level, and cell-level security.

Amazon EventBridge – With these additional filtering capabilities, you can now filter events by suffix, ignore case, and match if at least one condition is true. This makes it easier to write complex rules when building event-driven applications.

AWS Controllers for Kubernetes (ACK) – The ACK for Amazon Elastic Compute Cloud (Amazon EC2) is now generally available and lets you provision and manage EC2 networking resources, such as VPCs, security groups and internet gateways using the Kubernetes API. Also, the ACK for Amazon EMR on EKS is now generally available to allow you to declaratively define and manage EMR on EKS resources such as virtual clusters and job runs as Kubernetes custom resources. Learn more about ACK for Amazon EMR on EKS in this blog post.

Amazon HealthLake – New analytics capabilities make it easier to query, visualize, and build machine learning (ML) models. Now HealthLake transforms customer data into an analytics-ready format in near real-time so that you can query, and use the resulting data to build visualizations or ML models. Also new is Amazon HealthLake Imaging (preview), a new HIPAA-eligible capability that enables you to easily store, access, and analyze medical images at any scale. More on HealthLake Imaging can be found in this blog post.

Amazon RDS – You can now transfer files between Amazon Relational Database Service (RDS) for Oracle and an Amazon Elastic File System (Amazon EFS) file system. You can use this integration to stage files like Oracle Data Pump export files when you import them. You can also use EFS to share a file system between an application and one or more RDS Oracle DB instances to address specific application needs.

Amazon ECS and Amazon EKS – We added centralized logging support for Windows containers to help you easily process and forward container logs to various AWS and third-party destinations such as Amazon CloudWatch, S3, Amazon Kinesis Data Firehose, Datadog, and Splunk. See these blog posts for how to use this new capability with ECS and with EKS.

AWS SAM CLI – You can now use the Serverless Application Model CLI to locally test and debug an AWS Lambda function defined in a Terraform application. You can see a walkthrough in this blog post.

AWS Lambda – Now supports Node.js 18 as both a managed runtime and a container base image, which you can learn more about in this blog post. Also check out this interesting article on why and how you should use AWS SDK for JavaScript V3 with Node.js 18. And last but not least, there is new tooling support to build and deploy native AOT compiled .NET 7 applications to AWS Lambda. With this tooling, you can enable faster application starts and benefit from reduced costs through the faster initialization times and lower memory consumption of native AOT applications. Learn more in this blog post.

AWS Step Functions – Now supports cross-account access for more than 220 AWS services to process data, automate IT and business processes, and build applications across multiple accounts. Learn more in this blog post.

AWS Fargate – Adds the ability to monitor the utilization of the ephemeral storage attached to an Amazon ECS task. You can track the storage utilization with Amazon CloudWatch Container Insights and ECS Task Metadata endpoint.

AWS Proton – Now has a centralized dashboard for all resources deployed and managed by AWS Proton, which you can learn more about in this blog post. You can now also specify custom commands to provision infrastructure from templates. In this way, you can manage templates defined using the AWS Cloud Development Kit (AWS CDK) and other templating and provisioning tools. More on CDK support and AWS CodeBuild provisioning can be found in this blog post.

AWS IAM – You can now use more than one multi-factor authentication (MFA) device for root account users and IAM users in your AWS accounts. More information is available in this post.

Amazon ElastiCache – You can now use IAM authentication to access Redis clusters. With this new capability, IAM users and roles can be associated with ElastiCache for Redis users to manage their cluster access.

Amazon WorkSpaces – You can now use version 2.0 of the WorkSpaces Streaming Protocol (WSP) host agent that offers significant streaming quality and performance improvements, and you can learn more in this blog post. Also, with Amazon WorkSpaces Multi-Region Resilience, you can implement business continuity solutions that keep users online and productive with less than 30-minute recovery time objective (RTO) in another AWS Region during disruptive events. More on multi-region resilience is available in this post.

Amazon CloudWatch RUM – You can now send custom events (in addition to predefined events) for better troubleshooting and application specific monitoring. In this way, you can monitor specific functions of your application and troubleshoot end user impacting issues unique to the application components.

AWS AppSync – You can now define GraphQL API resolvers using JavaScript. You can also mix functions written in JavaScript and Velocity Template Language (VTL) inside a single pipeline resolver. To simplify local development of resolvers, AppSync released two new NPM libraries and a new API command. More info can be found in this blog post.

AWS SDK for SAP ABAP – This new SDK makes it easier for ABAP developers to modernize and transform SAP-based business processes and connect to AWS services natively using the SAP ABAP language. Learn more in this blog post.

AWS CloudFormation – CloudFormation can now send event notifications via Amazon EventBridge when you create, update, or delete a stack set.

AWS Console – With the new Applications widget on the Console home, you have one-click access to applications in AWS Systems Manager Application Manager and their resources, code, and related data. From Application Manager, you can view the resources that power your application and your costs using AWS Cost Explorer.

AWS Amplify – Expands Flutter support (developer preview) to Web and Desktop for the API, Analytics, and Storage use cases. You can now build cross-platform Flutter apps with Amplify that target iOS, Android, Web, and Desktop (macOS, Windows, Linux) using a single codebase. Learn more on Flutter Web and Desktop support for AWS Amplify in this post. Amplify Hosting now supports fully managed CI/CD deployments and hosting for server-side rendered (SSR) apps built using Next.js 12 and 13. Learn more in this blog post and see how to deploy a NextJS 13 app with the AWS CDK here.

Amazon SQS – With attribute-based access control (ABAC), you can define permissions based on tags attached to users and AWS resources. With this release, you can now use tags to configure access permissions and policies for SQS queues. More details can be found in this blog.

AWS Well-Architected Framework – The latest version of the Data Analytics Lens is now available. The Data Analytics Lens is a collection of design principles, best practices, and prescriptive guidance to help you running analytics on AWS.

AWS Organizations – You can now manage accounts, organizational units (OUs), and policies within your organization using CloudFormation templates.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
A few more stuff you might have missed:

Introducing our final AWS Heroes of the year – As the end of 2022 approaches, we are recognizing individuals whose enthusiasm for knowledge-sharing has a real impact with the AWS community. Please meet them here!

The Distributed Computing Manifesto – Werner Vogles, VP & CTO at Amazon.com, shared the Distributed Computing Manifesto, a canonical document from the early days of Amazon that transformed the way we built architectures and highlights the challenges faced at the end of the 20th century.

AWS re:Post – To make this community more accessible globally, we expanded the user experience to support five additional languages. You can now interact with AWS re:Post also using Traditional Chinese, Simplified Chinese, French, Japanese, and Korean.

For AWS open-source news and updates, here’s the latest newsletter curated by Ricardo to bring you the most recent updates on open-source projects, posts, events, and more.

Upcoming AWS Events
As usual, there are many opportunities to meet:

AWS re:Invent – Our yearly event is next week from November 28 to December 2. If you can’t be there in person, get your free online pass to watch live the keynotes and the leadership sessions.

AWS Community Days – AWS Community Day events are community-led conferences to share and learn together. Join us in Sri Lanka (on December 6-7), Dubai, UAE (December 10), Pune, India (December 10), and Ahmedabad, India (December 17).

That’s all from me for this week. Next week we’ll focus on re:Invent, and then we’ll take a short break. We’ll be back with the next Week in Review on December 12!

— Danilo

Introducing container, database, and queue utilization metrics for the Amazon MWAA environment

2022-11-18 David Boyne

Post Syndicated from David Boyne original https://aws.amazon.com/blogs/compute/introducing-container-database-and-queue-utilization-metrics-for-the-amazon-mwaa-environment/

This post is written by Uma Ramadoss (Senior Specialist Solutions Architect), and Jeetendra Vaidya (Senior Solutions Architect).

Today, AWS is announcing the availability of container, database, and queue utilization metrics for Amazon Managed Workflows for Apache Airflow (Amazon MWAA). This is a new collection of metrics published by Amazon MWAA in addition to existing Apache Airflow metrics in Amazon CloudWatch. With these new metrics, you can better understand the performance of your Amazon MWAA environment, troubleshoot issues related to capacity, delays, and get insights on right-sizing your Amazon MWAA environment.

Previously, customers were limited to Apache Airflow metrics such as DAG processing parse times, pool running slots, and scheduler heartbeat to measure the performance of the Amazon MWAA environment. While these metrics are often effective in diagnosing Airflow behavior, they lack the ability to provide complete visibility into the utilization of the various Apache Airflow components in the Amazon MWAA environment. This could limit the ability for some customers to monitor the performance and health of the environment effectively.

Overview

Amazon MWAA is a managed service for Apache Airflow. There are a variety of deployment techniques with Apache Airflow. The Amazon MWAA deployment architecture of Apache Airflow is carefully chosen to allow customers to run workflows in production at scale.

Amazon MWAA has distributed architecture with multiple schedulers, auto-scaled workers, and load balanced web server. They are deployed in their own Amazon Elastic Container Service (ECS) cluster using AWS Fargate compute engine. Amazon Simple Queue Service (SQS) queue is used to decouple Airflow workers and schedulers as part of Celery Executor architecture. Amazon Aurora PostgreSQL-Compatible Edition is used as the Apache Airflow metadata database. From today, you can get complete visibility into the scheduler, worker, web server, database, and queue metrics.

In this post, you can learn about the new metrics published for Amazon MWAA environment, build a sample application with a pre-built workflow, and explore the metrics using CloudWatch dashboard.

Container, database, and queue utilization metrics

In the CloudWatch console, in Metrics, select All metrics.
From the metrics console that appears on the right, expand AWS namespaces and select MWAA tile.

MWAA metrics

You can see a tile of dimensions, each corresponding to the container (cluster), database, and queue metrics.

MWAA metrics drilldown

Cluster metrics

The base MWAA environment comes up with three Amazon ECS clusters – scheduler, one worker (BaseWorker), and a web server. Workers can be configured with minimum and maximum numbers. When you configure more than one minimum worker, Amazon MWAA creates another ECS cluster (AdditionalWorker) to host the workers from 2 up to n where n is the max workers configured in your environment.

When you select Cluster from the console, you can see the list of metrics for all the clusters. To learn more about the metrics, visit the Amazon ECS product documentation.

MWAA metrics list

CPU usage is the most important factor for schedulers due to DAG file processing. When you have many DAGs, CPU usage can be higher. You can improve the performance by setting min_file_process_interval higher. Similarly, you can apply other techniques described in the Apache Airflow Scheduler page to fine tune the performance.

Higher CPU or memory utilization in the worker can be due to moving large files or doing computation on the worker itself. This can be resolved by offloading the compute to purpose-built services such as Amazon ECS, Amazon EMR, and AWS Glue.

Database metrics

Amazon Aurora DB clusters used by Amazon MWAA come up with a primary DB instance and a read replica to support the read operations. Amazon MWAA publishes database metrics for both READER and WRITER instances. When you select Database tile, you can view the list of metrics available for the database cluster.

Database metrics

Amazon MWAA uses connection pooling technique so the database connections from scheduler, workers, and web servers are taken from the connection pool. If you have many DAGs scheduled to start at the same time, it can overload the scheduler and increase the number of database connections at a high frequency. This can be minimized by staggering the DAG schedule.

SQS metrics

An SQS queue helps decouple scheduler and worker so they can independently scale. When workers read the messages, they are considered in-flight and not available for other workers. Messages become available for other workers to read if they are not deleted before the 12 hours visibility timeout. Amazon MWAA publishes in-flight message count (RunningTasks), messages available for reading count (QueuedTasks) and the approximate age of the earliest non-deleted message (ApproximateAgeOfOldestTask).

Database metrics

Getting started with container, database and queue utilization metrics for Amazon MWAA

The following sample project explores some key metrics using an Amazon CloudWatch dashboard to help you find the number of workers running in your environment at any given moment.

The sample project deploys the following resources:

Amazon Virtual Private Cloud (Amazon VPC).
Amazon MWAA environment of size small with 2 minimum workers and 10 maximum workers.
A sample DAG that fetches NOAA Global Historical Climatology Network Daily (GHCN-D) data, uses AWS Glue Crawler to create tables and AWS Glue Job to produce an output dataset in Apache Parquet format that contains the details of precipitation readings for the US between year 2010 and 2022.
Amazon MWAA execution role.
Two Amazon S3 buckets – one for Amazon MWAA DAGs, one for AWS Glue job scripts and weather data.
AWSGlueServiceRole to be used by AWS Glue Crawler and AWS Glue job.

Prerequisites

There are a few tools required to deploy the sample application. Ensure that you have each of the following in your working environment:

Setting up the Amazon MWAA environment and associated resources

From your local machine, clone the project from the GitHub repository.

git clone https://github.com/aws-samples/amazon-mwaa-examples

Navigate to mwaa_utilization_cw_metric directory.

cd usecases/mwaa_utilization_cw_metric

Run the makefile.

make deploy

Makefile runs the terraform template from the infra/terraform directory. While the template is being applied, you are prompted if you want to perform these actions.

MWAA utilization terminal

This provisions the resources and copies the necessary files and variables for the DAG to run. This process can take approximately 30 minutes to complete.

Generating metric data and exploring the metrics

Login into your AWS account through the AWS Management Console.
In the Amazon MWAA environment console, you can see your environment with the Airflow UI link in the right of the console.

MMQA environment console

Select the link Open Airflow UI. This loads the Apache Airflow UI.

Apache Airflow UI

From the Apache Airflow UI, enable the DAG using Pause/Unpause DAG toggle button and run the DAG using the Trigger DAG link.
You can see the Treeview of the DAG run with the tasks running.
Navigate to the Amazon CloudWatch dashboard in another browser tab. You can see a dashboard by the name, MWAA_Metric_Environment_env_health_metric_dashboard.
Access the dashboard to view different key metrics across cluster, database, and queue.

MWAA dashboard

After the DAG run is complete, you can look into the dashboard for worker count metrics. Worker count started with 2 and increased to 4.

When you trigger the DAG, the DAG runs 13 tasks in parallel to fetch weather data from 2010-2022. With two small size workers, the environment can run 10 parallel tasks. The rest of the tasks wait for either the running tasks to complete or automatic scaling to start. As the tasks take more than a few minutes to finish, MWAA automatic scaling adds additional workers to handle the workload. Worker count graph now plots higher with AdditionalWorker count increased to 3 from 1.

Cleanup

To delete the sample application infrastructure, use the following command from the usecases/mwaa_utilization_cw_metric directory.

make undeploy

Conclusion

This post introduces the new Amazon MWAA container, database, and queue utilization metrics. The example shows the key metrics and how you can use the metrics to solve a common question of finding the Amazon MWAA worker counts. These metrics are available to you from today for all versions supported by Amazon MWAA at no additional cost.

Start using this feature in your account to monitor the health and performance of your Amazon MWAA environment, troubleshoot issues related to capacity and delays, and to get insights into right-sizing the environment

Build your own CloudWatch dashboard using the metrics data JSON and Airflow metrics. To deploy more solutions in Amazon MWAA, explore the Amazon MWAA samples GitHub repo.

For more serverless learning resources, visit Serverless Land.

Apache, Apache Airflow, and Airflow are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

AWS Week In Review — September 26, 2022

2022-09-27 Sébastien Stormacq

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-week-in-review-september-26-2022/

It looks like my travel schedule is coupled with this Week In Review series of blog posts. This week, I am traveling to Fort-de-France in the French Caribbean islands to meet our customers and partners. I enjoy the travel time when I am offline. It gives me the opportunity to reflect on the past or plan for the future.

Last Week’s Launches
Here are some of the launches that caught my eye last week:

Amazon SageMaker Autopilot – has added a new Ensemble training mode powered by AutoGluon that is 8X faster than the current Hyper parameter Optimization Mode and supports a wide range of algorithms, including LightGBM, CatBoost, XGBoost, Random Forest, Extra Trees, linear models, and neural networks based on PyTorch and FastAI.

AWS Outposts and Amazon EKS – You can now deploy both the worker nodes and the Kubernetes control plane on an Outposts rack. This allows you to maximize your application availability in case of temporary network disconnection on premises. The Kubernetes control plane continues to manage the worker nodes, and no pod eviction happens when on-premises network connectivity is reestablished.

Amazon Corretto 19 – Corretto is a no-cost, multiplatform, production-ready distribution of OpenJDK. Corretto is distributed by Amazon under an open source license. This version supports the latest OpenJDK feature release and is available on Linux, Windows, and macOS. You can download Corretto 19 from our downloads page.

Amazon CloudWatch Evidently – Evidently is a fully-managed service that makes it easier to introduce experiments and launches in your application code. Evidently adds support for Client Side Evaluations (CSE) for AWS Lambda, powered by AWS AppConfig. Evidently CSE allows application developers to generate feature evaluations in single-digit milliseconds from within their own Lambda functions. Check the client-side evaluation documentation to learn more.

Amazon S3 on AWS Outposts – S3 on Outposts now supports object versioning. Versioning helps you to locally preserve, retrieve, and restore each version of every object stored in your buckets. Versioning objects makes it easier to recover from both unintended user actions and application failures.

Amazon Polly – Amazon Polly is a service that turns text into lifelike speech. This week, we announced the general availability of Hiujin, Amazon Polly’s first Cantonese-speaking neural text-to-speech (NTTS) voice. With this launch, the Amazon Polly portfolio now includes 96 voices across 34 languages and language variants.

X in Y – We launched existing AWS services in additional Regions:

Other AWS News
Introducing the Smart City Competency program – The AWS Smart City Competency provides best-in-class partner recommendations to our customers and the broader market. With the AWS Smart City Competency, you can quickly and confidently identify AWS Partners to help you address Smart City focused challenges.

An update to IAM role trust policy behavior – This is potentially a breaking change. AWS Identity and Access Management (IAM) is changing an aspect of how role trust policy evaluation behaves when a role assumes itself. Previously, roles implicitly trusted themselves. AWS is changing role assumption behavior to always require self-referential role trust policy grants. This change improves consistency and visibility with regard to role behavior and privileges. This blog post shares the details and explains how to evaluate if your roles are impacted by this change and what to modify. According to our data, only 0.0001 percent of roles are impacted. We notified by email the account owners.

Amazon Music Unifies Music Queuing – The Amazon Music team published a blog post to explain how they created a unified music queue across devices. They used AWS AppSync and AWS Amplify to build a robust solution that scales to millions of music lovers.

Upcoming AWS Events
Check your calendar and sign up for an AWS event in your Region and language:

AWS re:Invent – Learn the latest from AWS and get energized by the community present in Las Vegas, Nevada. Registrations are open for re:Invent 2022 which will be held from Monday, November 28 to Friday, December 2.

AWS Summits – Come together to connect, collaborate, and learn about AWS. Registration is open for the following in-person AWS Summits: Bogotá (October 4), and Singapore (October 6).

Natural Language Processing (NLP) Summit – The AWS NLP Summit 2022 will host over 25 sessions focusing on the latest trends, hottest research, and innovative applications leveraging NLP capabilities on AWS. It is happening at our UK headquarters in London, October 5–6, and you can register now.

AWS Innovate for every app – This regional online conference is designed to inspire and educate executives and IT professionals about AWS. It offers dozens of technical sessions in eight languages (English, Spanish, French, German, Italian, Japanese, Korean, and Indonesian). Register today: Americas, September 28; Europe, Middle-East, and Africa, October 6; Asia Pacific & Japan, October 20.

AWS Community Days – AWS Community Day events are community-led conferences to share and learn with one another. In September, the AWS community in the US will run events in Arlington, Virginia (September 30). In Europe, Community Day events will be held in October. Join us in Amersfoort, Netherlands (October 3), Warsaw, Poland (October 14), and Dresden, Germany (October 19).

AWS Tour du Cloud – The AWS Team in France has prepared a roadshow to meet customers and partners with a one-day free conference in seven cities across the country (Aix en Provence, Lille, Toulouse, Bordeaux, Strasbourg, Nantes, and Lyon), and in Fort-de-France, Martinique.

AWS Fest – This third-party event will feature AWS influencers, community heroes, industry leaders, and AWS customers, all sharing AWS optimization secrets (this week on Wednesday, September). You can register for AWS Fest here.

Stay Informed
That is my selection for this week! To better keep up with all of this news, please check out the following resources:

What’s New with AWS – All AWS announcements. Add the RSS feed to your news reader.
AWS on Air – Join us every Friday on Twitch, Twitter, and LinkedIn.
The Official AWS Podcast – Listen each week for updates on the latest AWS news and deep dives into exciting use cases. There are also official AWS podcasts in your local languages. Check the ones in Spanish, French, Italian, and German.
AWS News Blog – This blog.

— seb
This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!