Tag Archives: Amazon CloudWatch

Introducing the AWS Network Firewall CloudWatch Dashboard

2024-12-12 Ajinkya Patil

Post Syndicated from Ajinkya Patil original https://aws.amazon.com/blogs/security/introducing-the-aws-network-firewall-cloudwatch-dashboard/

Amazon CloudWatch dashboards are customizable pages in the CloudWatch console that you can use to monitor your resources in a single view. This post focuses on deploying a CloudWatch dashboard that you can use to create a customizable monitoring solution for your AWS Network Firewall firewall. It’s designed to provide deeper insights into your firewall’s performance and security events simplifying security monitoring.

Network Firewall is a managed service that you can use to deploy essential network protections to Amazon Virtual Private Clouds (Amazon VPCs). Network Firewall provides comprehensive logs and metrics through CloudWatch, and we’re expanding its capabilities with this CloudWatch dashboard. This enhancement makes it easier to visualize, analyze, and act on the wealth of data generated by your firewall.

This open source solution streamlines network security monitoring with a user-friendly AWS CloudFormation template that quickly deploys a dedicated monitoring dashboard. This solution incorporates a suite of CloudWatch features—basic monitoring metrics, vended logs, Logs Insights queries, Contributor Insights rules, and the dashboard itself—into a centralized view. Preconfigured widgets provide instant insights into critical areas such as top talkers, protocol distributions, and alert log trends, in addition to HTTP and TLS flow analysis. A consolidated view of key metrics and logs enables faster identification of potential security threats or performance issues. With all of this relevant network firewall data in one place, your team can respond more quickly to emerging security events.

In this blog post, we provide an overview of the dashboard and a step-by-step guide to deploy it in your environment.

Solution overview

The CloudWatch dashboard can be deployed in all AWS Regions where Network Firewall is available today, including the AWS GovCloud (US) Regions and China Regions. While the dashboard comes pre-configured, you can quickly adjust queries, time ranges, and refresh intervals to help meet your specific needs. By default, the dashboard queries firewall flow and alert log events over a 3-hour period, impacting the number of log events scanned. Logs Insights and Contributor Insights widgets showcase the top 10 data points by default, but you can enhance results by modifying queries or adjusting the Top Contributors value, though this might lead to increased costs. You can configure the auto-refresh interval of the widgets to get real-time visibility and optimize costs. See the Amazon CloudWatch Pricing guide for up-to-date free and paid tier pricing considerations.

The dashboard, shown in Figure 1, can be deployed using CloudFormation and includes data and analytics from the following sources:

Native CloudWatch metrics from the AWS/NetworkFirewall and AWS/PrivateLinkEndpoints namespaces
CloudWatch Logs Insights queries that analyze Network Firewall flow and alert logs
CloudWatch Contributor Insights rules that aggregate data from Network Firewall flow and alert logs.

Figure 1: CloudWatch dashboard

Walkthrough

In the dashboard, the Logs Insights and Contributor Insights widgets display the top 10 data points by default. You can edit the Insights queries or change the Top Contributors to a larger value to display more results, as shown in Figure 2.

Figure 2: Top Talkers dashboard showing a change to the Top Contributors value

You can also manually refresh the data within a single or multiple widgets, or you can configure the entire dashboard to automatically refresh at a configured time interval as shown in Figure 3. The dashboard won’t automatically refresh the widget data by default.

Figure 3: Configuring the dashboard to automatically refresh

Prerequisites

Deploying the Network Firewall CloudWatch Dashboard is straightforward. You will need the following:

A Network Firewall in your VPC.
Your Network Firewall must be configured to publish firewall flow and alert logs to two different CloudWatch log groups. For example, firewall flow logs are published to /my-firewall-flow-logs and alert logs are published to /my-firewall-alert-logs.

If you haven’t deployed Network Firewall in your VPC, you can use one of the available AWS Network Firewall Deployment Architecture templates to create a firewall. After creating a firewall, configure CloudWatch log groups for the firewall flow and alert logs and configure stateful logging as described previously. Fine-tune your firewall policy and rule configuration and make sure that you’re routing traffic symmetrically through the firewall. With the firewall now in the routed path and publishing metrics and log events, you can proceed with this Network Firewall CloudWatch dashboard template.

Deployment

The Network Firewall dashboard CloudFormation template creates a monitoring dashboard for a single Network Firewall firewall. Make sure that you launch this CloudFormation stack in the same AWS Region and account as the firewall, regardless of whether the firewall is set up centrally or in a distributed manner.

To deploy the dashboard:

Choose Launch Stack for the relevant AWS Region. Make sure that you’re signed in to the appropriate AWS account and Region.
- Region: China
- Region: Gov Cloud
- Region: All other regions supported by AWS Network Firewall
You will be redirected to the Create stack page in the AWS Management Console for CloudFormation. Make sure that you’re in the correct Region and using the correct template. Choose Next. The following are the Regions and their template names:
1. China Region: nfw-cloudwatch-dashboard-china.yaml
2. Gov Cloud Region: nfw-cloudwatch-dashboard-govcloud.yaml
3. All other Regions: nfw-cloudwatch-dashboard.yaml

Figure 4: Make sure that you’re using the correct template

When launching the stack, you will need to enter the following parameters:

Stack name: A descriptive name for this CloudFormation stack. For example, my-firewall-dashboard.
Firewall name: The firewall name as seen in the Amazon VPC console. In the Amazon VPC console, choose Network Firewall in the navigation pane, then choose Firewalls.
Firewall subnets: The firewall subnet IDs to which your firewall endpoints are attached. The firewall subnets can be found on the Firewall details tab of your firewall in the Amazon VPC
Flow log group name: The name of the CloudWatch log group where your firewall flow logs are stored.
Alert log group name: The name of the CloudWatch log group where your firewall alert logs are stored.
Contributor Insights rule state: Enable or disable the Contributor Insights rules (the template defaults to enabled). Disabling will stop the rules from scanning log data and displaying results in the Contributor Insights widgets. After the rules are created, you can change the state of one or more Contributor Insights rules from CloudWatch console by choosing Insights from the navigation pane, and then choosing Contributor Insights.

After the stack reaches CREATE_COMPLETE status, go to the Outputs tab and choose the FirewallDashboardURI link to open the new dashboard in the CloudWatch Dashboards console. It might take a few minutes for the Logs Insights and Contributor Insights widgets to start displaying data. For more details about each widget, see the README. If you don’t have log events matching the query parameters in the widgets, some widgets might not show data points.

Troubleshooting

If you encounter issues during or after deployment, review the following:

Firewall logging is enabled and configured to use CloudWatch instead of Amazon Simple Storage Service (Amazon S3) or Amazon Kinesis.
Both firewall flow and alert logging are enabled, not just one.
Log group names are entered correctly; incorrect names will cause widgets to point to invalid data.
Correct subnets are selected. Incorrect choices can impact the PrivateLink metrics widgets.
Firewall name is entered correctly. An incorrect name can disrupt metrics widgets, dashboard, and Contributor Insights widget names and break the firewall link.

Cleaning up

You can delete the Network Firewall CloudWatch dashboard and all of the associated resources with a few clicks. Deleting the dashboard will not impact the routing and network traffic inspection performed by the firewall.

Sign in to the CloudFormation console in the Region where you launched the stack and choose Stacks from the navigation pane.
Select the Stack name you chose when launching the stack. For example, my-firewall-dashboard.
Choose Delete.

Conclusion

We encourage you to see for yourself how this new dashboard can enhance your network security management. To get started with the AWS Network Firewall CloudWatch Dashboard, visit our GitHub repository for detailed instructions and the CloudFormation template. For a visual overview of the dashboard and its capabilities, check out our YouTube video.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Investigate and remediate operational issues with Amazon Q Developer (in preview)

2024-12-03 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/investigate-and-remediate-operational-issues-with-amazon-q-developer/

The growing complexity of modern software makes troubleshooting difficult, requiring deep knowledge and manual work across various systems. This results in slower problem-solving and less efficient operations. More and more customers need automated tools to handle routine tasks and simplify complex processes, so they can resolve issues faster and focus on delivering inovations for their customers.

Today, we’re announcing a new capability in Amazon Q Developer to investigate and remediation operational issues, which is now in preview. This generative AI-powered capability guides you through operational diagnostics and automates root cause analysis for problems in your workloads.

Here’s a quick look at how you can now use Amazon Q Developer for operational investigations.

AWS has more operational experience and scale than any other major cloud provider, delivering cloud services to customers around the world for over 17 years. AWS built this experience into Amazon Q Developer operational capabilities to create and present investigation hypotheses, and guide you through troubleshooting and remediation – capabilities that no other major cloud provider offers.

Get started with operational investigation using Amazon Q Developer
This new capability from Amazon Q Developer seamlessly integrates with Amazon CloudWatch and AWS Systems Manager, providing a unified experience while troubleshooting issues. To get started with this capability, you need to complete some prerequisites. You can learn more on the Get Started with Amazon Q Developer Operational Investigations page.

I’ve completed the setup and configured a CloudWatch alarm to monitor the metrics for my application. After receiving a notification email, I navigate to that alarm in Amazon CloudWatch. I observe that the metric has exceeded its threshold over several time periods.

With this finding, I select Investigate. Then, I have two options: Start new investigation or Add to existing investigation. Because I’m just getting started, I select Start a new investigation and provide some details and notes if necessary.

After I’ve created the investigation, I can view the details by choosing View Details on the banner.

The investigation page is divided into two main sections: the left-hand Feed panel, which contains all findings added during the investigation, and the right-hand Suggestions panel, which displays a list of finding suggestions from Amazon Q Developer to assist in the investigation.

Amazon Q Developer uses its knowledge of my AWS resources to automatically discover the relationships between them and create a topology map of the application. This makes it possible for Amazon Q Developer to follow the architecture and quickly find the component that caused an alarm, helping me get back into production faster than ever before.

As I investigate further, Amazon Q Developer proposes hypotheses based on a series of related metrics from various AWS services such as Amazon DynamoDB, AWS Lambda, Amazon Elastic Container Service (Amazon ECS) and others. I can choose Show reasoning to understand why.

One of the hypotheses suggests that the slowness is caused by throttling on a DynamoDB table, with read and write capacity units frequently exceeding the provisioned limits. I find this hypothesis makes sense, and I can Accept it, which will bring it into my Feed.

With all these findings, I can collect all the supporting data to troubleshoot this issue. In one of the hypotheses from Amazon Q Developer, I can also view suggested actions. I select View actions to understand my options for remediation.

In the Suggested actions menu, Amazon Q Developer proposes AWS Systems Manager Automation runbooks related to the hypothesis. Where applicable, it suggests automated runbooks from the AWS Systems Manager library, which includes over 400 AWS-authored and thousands of customer-authored runbooks to help remediate observed issues. Each runbook defines the actions that Systems Manager performs to help resolve the issue. Additionally, Amazon Q Developer provides relevant documentation links from AWS re:Post articles and AWS Documentation pages.

Here’s the list of suggested actions from Amazon Q Developer. I choose View runbook to understand more on how I can solve this issue by modifying DynamoDB provisioned capacity.

Here, I can read more information on this runbook. It will offer a description of the runbook, including execution history telling me if I ran this runbook successfully in this account in the past.

I can enter the required parameters as defined in the configuration. Under Execution preview segment, I can review a summary highlighting the impact on targeted resources. After confirming the details, I select Execute to implement the necessary changes for my workloads.

After running the runbook, I can see the results, which are then added to my feed.

Another feature I appreciate is the multiple ways to access this capability. For example, in my CloudWatch metrics for my AWS Lambda function, I can initiate an investigation and add findings directly. I can also select the Amazon Q Developer operational investigations icon to open the investigation panel.

This new capability from Amazon Q Developer feels like having an AWS expert available 24/7 to assist with operational troubleshooting. It lowers the barrier to operational experience and saves valuable time and effort.

Now in preview
The new capability of Amazon Q Developer to help you investigate and remediate operational issues is now in preview in the US East (N. Virginia) Region. Transform your operational investigation today and accelerate remediation with Amazon Q Developer. Visit Amazon CloudWatch documentation page to get started.

Happy troubleshooting!

— Donnie

Container Insights with enhanced observability now available in Amazon ECS

2024-12-02 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/container-insights-with-enhanced-observability-now-available-in-amazon-ecs/

Last year, we announced enhanced observability in Amazon CloudWatch Container Insights, a new capability to improve your observability for Amazon Elastic Kubernetes Service (Amazon EKS). This capability helps you detect and fix container issues faster by providing detailed performance metrics and logs.

Expanding this capability, today we’re launching enhanced observability for your container workloads running on Amazon Elastic Container Service (Amazon ECS). This new capability will help reduce your mean time to detect (MTTD) and mean time to repair (MTTR) for your overall applications, helping prevent issues that could negatively impact your user experience.

Here’s a quick look at Container Insights with enhanced observability for Amazon ECS.

Container Insights with enhanced observability addresses a critical gap in container monitoring. Previously, correlating metrics with logs and events was a time-consuming process, often requiring manual searches and expertise in application architecture. Now, with this capability, CloudWatch and Amazon ECS automatically collect granular performance metrics such as CPU utilization at both the task and container levels while providing visual drill downs enabling easy root-cause analysis.

This new capability enables the following use cases:

Quickly identify root causes by viewing granular resource usage patterns and correlating telemetry data.
Proactively manage your ECS resources using curated dashboards based on AWS best practices.
Track your recent deployments and root causes of your deployment failures with the matching infrastructure anomalies enabling faster issue detection and quicker rollbacks when necessary.
Effortlessly monitor resources across multiple accounts without manual setup. Built-in cross-account support reduces operational overhead with single pane of glass observability.
Integration with other CloudWatch services such as Application Signals and CloudWatch Logs provides a seamless experience to correlate infrastructure with the services running and identify the impacted services.

Using container insights with enhanced observability for Amazon ECS
There are two ways to enable Container Insights with enhanced observability:

Cluster-level onboarding – You can enable it for specific clusters individually.
Account-level onboarding – You can also enable it at the account level, which automatically enables observability for all new clusters created in your account. This approach saves time and effort by eliminating the need to manually enable it for each new cluster.

To enable this feature at the account level, I navigate to the Amazon ECS console and select Account settings. Under the CloudWatch Container Insights observability section, I can see it’s currently disabled. I choose Update.

On this page, I find a new option called Container Insights with enhanced observability. I select this option and then choose Save changes.

If I need to enable this capability at the cluster level, I can do so when creating a new cluster.

I can also enable this capability for my existing clusters. To do so, I select Update cluster, and then choose the option.

Once enabled, I can see task-level metrics by navigating to the Metrics tab in my cluster overview console. To access health and performance metrics across my clusters, I can select View Container Insights, which will redirect me to the Container Insights page.

To get a big picture of all my workloads across different clusters, I can navigate to Amazon CloudWatch and then to Container Insights.

This view addresses the challenge of effectively monitoring clusters, services, tasks, and containers by providing a honeycomb visualization that offers an intuitive, high-level summary of cluster health. The dashboard employs a dual-state monitoring approach:

Alarm state (red or green) – Reflects customer-defined thresholds and alerts, allowing teams to configure monitoring based on their specific requirements
Utilization state (dark blue or light blue) – Uses CloudWatch built-in best practices to monitor resource usage patterns across containers. The darker blue indicates clusters operating under higher utilization, enabling teams to proactively identify potential resource constraints before they impact performance

Let’s say there’s an issue in one of my clusters. I can hover over the cluster to display all the alarms created under that cluster at different layers, from the cluster layer down to the container layer.

I also have the option to view all clusters in a list format. The list format is essential for cross-account observability, displaying account IDs and labels for cluster ownership. This helps DevOps engineers quickly identify and collaborate with account owners to resolve potential application issues.

Now, I’d like to explore further. I select my cluster link, which redirects me to the Container Insights detailed dashboard view. Here, I can see a spike in memory utilization for this cluster.

I can dive deeper into container-level details, which help me quickly identify which services are causing this issue.

Another useful feature I found is the Filters option, which helps me conduct more thorough investigations across containers, services, or tasks in this cluster.

If I need to delve deeper into the application logs to understand the root cause of this issue, I can select the task, choose Actions, and choose which logs I would like to view.

On top of using AWS X-Ray traces, I can investigate another two types of logs here. First, I can use performance logs—structured logs containing metric data—to drill down and identify container-level root causes. Second, I examine collected application or container logs . These logs give me detailed insights into application behavior within the container, helping me trace the sequence of events that led to any issues.

In this case, I use application logs.

This streamlines my journey to troubleshoot my application. In this case, the issue is on the downstream calls to third-party applications, which return timeouts.

This enhanced capability also works with Amazon CloudWatch Application Signals to automatically instrument my application. I can monitor current application health and track long-term application performance against service-level objectives.

I select the Application Signals tab.

This integration with Amazon CloudWatch Application Signals provides me with end-to-end visibility, helping me correlate container performance with end-user experience.

When I select datapoints in the graphs, I can see associated traces, which show me all correlated services and their impact. I can also access relevant logs to understand root causes.

Additional things to know
Here are a couple of important points to note:

Availability – Container Insights with enhanced observability for ECS is now available in all AWS Regions including the China Regions.
Pricing – Container Insights with enhanced observability for ECS comes with a flat metric pricing, visit the Amazon CloudWatch Pricing page.

Get started today and experience improved observability for your container workloads. Learn more on the Amazon CloudWatch documentation page.

Happy monitoring,
— Donnie Prakoso

New Amazon CloudWatch Database Insights: Comprehensive database observability from fleets to instances

2024-12-01 Jeff Barr

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-amazon-cloudwatch-database-insights-comprehensive-database-observability-from-fleets-to-instances/

Observing your Amazon Aurora databases is now a whole lot easier. Instead of spending time setting up telemetry, building dashboards, and configuring alarms, you just open Amazon CloudWatch Database Insights and take a look. With no further setup, you can monitor the health of all of your Amazon Aurora MySQL and PostgreSQL instances in the selected Region:

Each of the sections contains a wealth of detail and I’ll get to that in a moment (this may be the ultimate “but wait, there’s more” post). From this view, I can open the filter control on the left and filter the set of instances in a couple of different ways. For example, I can filter for all of the instances running Amazon Aurora MySQL, and see that I have 66 such instances, with 3 raising alarms:

I can save the filter as a Fleet (note that Fleets are defined by specific properties and tags of the database instances and as such are inherently dynamic):

And then I can see the overall health of the fleet with a click. The entire page updates to reflect the fleet; I focus on the summary:

Behind the scenes, Database Insights looks for CloudWatch alarms that include a DBInstanceIdentifier dimension, and uses these alarms to establish a correlation between database instances and alarms. This, along with other built-in heuristics and correlation steps, allows Database Insights to deliver helpful, well-organized information that will help you to better understand the overall health of your fleet and to dive deep in order to find bottlenecks and other issues.

Clicking on an instance (represented by a hexagon) reveals details; I click on the instance name (demo-mysql-reader0) to learn more:

In the per-instance view I can also see a myriad of details:

Each of the tabs at the bottom provides additional insights into what’s happening inside the database instance. For example, selecting DB Load Analysis / Top SQL / SQL Metrics shows me which SQL statements are imposing the heaviest load, along with 29 additional metrics (not shown):

From past experience, I know that finding and understanding slow queries is a tedious yet important task. with Database Insights I can see patterns that are common to the slow queries, as well as the actual queries:

With help from AWS X-Ray, Application Signals, and the AWS Distro for OpenTelemetry SDK, I can see the services and operations that originate the queries to the database instance:

The red X indicates that this operation is failing the associated Service Level Objective (SLO), an application performance monitoring aspect of Application Signals. An SLO defines the reliability of a service against customer expectations, and can be set up by selecting the service and clicking Create SLO. There are a couple of steps and some very helpful options, but at the core a SLO is measured as a percentage of successful requests over an extended period of time:

If the database instance is configured to send logs to CloudWatch Logs, I can see and search the logs, filtered by the selected time period, and within a particular log group:

There’s still a lot more to explore at the fleet level. For example, I can see the ten calling services which drive the highest DB load (again, this is powered by AWS X-Ray, Application Signals, and the AWS Distro for OpenTelemetry SDK):

And I can see the top 10 instances with respect to any of eight different metrics:

I could go on all day, but I will leave the rest for you to explore. As I never tire of saying, this feature is available now and you can start using it today.

Things to Know
Here are a couple of things to know about Database Insights:

Supported Databases – You can use Database Insights with Amazon Aurora MySQL and Amazon Aurora PostgreSQL database instances.

Pricing – There is a per-hour, per-database instance charge based on the average number of vCPUs used (for provisioned instances) or Aurora Capacity Units (for Serverless v2 databases) monitored, with separate charges for ingestion and storage of database logs. See the CloudWatch Pricing page for more information.

Regions – This feature is available in all commercial AWS Regions.

— Jeff;

New Amazon CloudWatch and Amazon OpenSearch Service launch an integrated analytics experience

2024-12-01 Elizabeth Fuentes

Post Syndicated from Elizabeth Fuentes original https://aws.amazon.com/blogs/aws/new-amazon-cloudwatch-and-amazon-opensearch-service-launch-an-integrated-analytics-experience/

Today, Amazon Web Services (AWS) announces a new integrated analytics experience and zero-ETL integration between Amazon CloudWatch and Amazon OpenSearch Service. This integration simplifies log data analysis and visualization without data duplication, streamlining log management while reducing technical overhead and operational costs. CloudWatch Logs customers now have access to two additional query languages beyond CloudWatch Logs Insights QL, while OpenSearch customers can query CloudWatch logs in place without creating separate extract, transform, and load (ETL) pipelines.

Organizations often need different analytics capabilities for their log data. Some teams prefer CloudWatch Logs for its scalability and simplicity in centralizing logs from all their systems, applications, and AWS services. Others require OpenSearch Service for advanced analytics and visualizations. Previously, integration between these services required maintaining separate ingestion pipelines or creating ETL processes. This new integration helps customers get the best of both services by eliminating this complexity by bringing the power of OpenSearch analytics directly to CloudWatch Logs, without any data copy.

Amazon CloudWatch Logs now supports OpenSearch Piped Processing Language (PPL) and OpenSearch SQL directly within the CloudWatch Logs Insights console. You can use SQL to analyze data and correlate logs using JOIN. You can use SQL functions (such as JSON, mathematical, datetime, and string functions) for intuitive log analytics. You can also use the OpenSearch PPL to filter, aggregate, and analyze data. With a few clicks, you can access pre-built, out-of-the-box dashboards for vended logs, such as Amazon Virtual Private Cloud (VPC), AWS CloudTrail, and AWS WAF. These dashboards enable faster monitoring and troubleshooting through visualizations, such as analyzing flows over time, top talkers, megabytes, and packets transferred over time, without having to configure individual widgets or build specific queries. You can analyze VPC flows over time, identify top talkers, track network traffic metrics, monitor web request trends in AWS WAF, or analyze API activity patterns in AWS CloudTrail.

Additionally, OpenSearch Service users can now analyze CloudWatch logs using OpenSearch Discover and run SQL and PPL, similar to how they analyze data in Amazon Simple Storage (Amazon S3), and build indexes and create dashboards directly without any ETL operations or separate ingestion pipelines.

Let’s explore how this integration works
To demonstrate the new OpenSearch SQL and PPL query capabilities in CloudWatch, I start in the CloudWatch console. In the navigation pane, I choose Logs then Logs Insights. After selecting log groups for the query, I can now use OpenSearch PPL or OpenSearch SQL query languages directly within CloudWatch Logs Insights, with no additional setup or integration required. Using this new capability, I can write complex queries using familiar SQL syntax or OpenSearch PPL, making log analysis more intuitive and efficient. In the Query commands menu, you can find sample queries to help you get started.

This example demonstrates how to use SQL JOIN to combine data from two log groups: pet adoptions and pet availability. By filtering for specific customer IDs, you can analyze related log records and trace IDs for troubleshooting purposes.

One of the powerful features of this integration for CloudWatch Logs customers is the ability to create pre-built dashboards for Amazon VPC Flows, AWS CloudTrail and AWS WAF logs. Let’s explore this by creating a dashboard for AWS WAF logs. In the Analyze with OpenSearch tab, I choose Settings and follow the steps.

After a few minutes, my integration is ready and I go to Create an OpenSearch dashboard. In the options Select automatic dashboard type, I choose AWS WAF logs.

In the Dashboard data configuration tab, I can select Data synchronization frequency to occur every 15 minutes. I Select the log groups and View log samples of the selected log groups. I finish by choosing Create a dashboard.

After creating my dashboard, I can explore my logs. The AWS WAF logs dashboard provides comprehensive visibility into web application firewall metrics and events, with automatically configured visualizations that help you monitor and analyze security patterns.

Similarly, the CloudTrail dashboard offers deep insights into API activity across your AWS environment. It’s useful for monitoring API activity, auditing actions, and identifying potential security or compliance issues.

The VPC Flow Logs dashboard provides detailed visualization of key metrics from your logs for network traffic analysis. You can analyze network traffic, detect unusual patterns, and monitor resource usage. The dashboard currently supports only VPC v2 fields (default format). Custom formatted fields are not supported.

With zero-ETL to access CloudWatch data from OpenSearch Services, I also can build an OpenSearch dashboard from the OpenSearch Service console without having to build and maintain an ETL process. For this, I go to Central management, then I select the new Connected data sources menu, click choose Connect to create a new connected data source, and choose CloudWatch Logs.

In the next step, I name my data source and choose to Create a new role, which must have the necessary permissions to execute actions on OpenSearch Service. You can see them in the Sample custom policy.

In the Set up OpenSearch step, configure a OpenSearch data connection for CloudWatch Logs by selecting Create a new collection. As part of setting up the CloudWatch Logs source, a new OpenSearch Service serverless collection and OpenSearch UI application is created to store the indexed views and provide a user interface to analyze your CloudWatch Logs data. I create a new collection, name it, and conﬁgure the OpenSearch application and workspace within the application. After setting the Data retention days, I choose Next and ﬁnish with Review and connect.

When the integration with CloudWatch is ready, I can choose between Explore logs without indexing data which will take me to a querying interface in Discover or Explore vended logs by creating a dashboard for Amazon VPC Flows, CloudTrail and AWS WAF logs.

After I select Explore logs, OpenSearch UI takes me to Discover in the application workspace I created during the data source setup. In Discover, I select the data picker and choose View all available data to access my CloudWatch Logs data source and log groups.

After I select the log groups, I can analyze my CloudWatch logs using OpenSearch SQL and PPL directly in Discover, without having to switch between applications.

To create a dashboard, I return to the Connected data sources overview page on the console. From there, I select Create dashboard, which allows me to visually analyze my CloudWatch data without having to define queries or build visualizations, as I previously did in the CloudWatch console

After the dashboard is created, I navigate to OpenSearch resources where I can see the newly created indexes being populated with data in my Collection. After I have the data, I can go to the dashboard with the data from the CloudWatch logs that I selected in the configuration, and as more data comes in, it will be displayed in near real-time on the OpenSearch dashboard.

With this zero-ETL integration you can ingest data directly into OpenSearch, using its powerful query capabilities and visualization features while maintaining data consistency and reducing operational overhead.

Integration Highlights
For CloudWatch customers:

Query capabilities – Streamline log investigation by using OpenSearch SQL and PPL queries directly within the CloudWatch Logs Insights console.
Analytics features – With a few clicks, access pre-built, out-of-the-box dashboards for vended logs, such as VPC, AWS WAF, and CloudTrail logs. These dashboards enable faster monitoring and troubleshooting through visualizations for analyzing flows over time, top talkers, megabytes, and packets transferred over time, without having to configure individual widgets or build specific queries.
Getting started for CloudWatch users – Configure integration from CloudWatch Logs to OpenSearch Service. For more information refer to the Amazon CloudWatch Logs query capabilities and Amazon CloudWatch Logs vended dashboard documentation.

For OpenSearch Service customers:

Zero-ETL integration – Access and analyze CloudWatch data directly from OpenSearch Service without building or maintaining ETL processes. This integration eliminates separate ingestion pipelines while reducing storage costs and operational overhead through simplified data management and zero data duplication.
Getting started for OpenSearch users – Create a data connection selecting CloudWatch as a data source from OpenSearch Service. For more information, refer to the Amazon OpenSearch Service Developer Guide.

Regional availability and pricing
This integration is now available in AWS Regions where Amazon OpenSearch Service direct query is available. For pricing details and free trial information, you can visit the Amazon CloudWatch Pricing and Amazon OpenSearch Service Pricing pages.

PS: Writing a blog post at AWS is always a team effort, even when you see only one name under the post title. In this case, I want to thank Joshua Bright, Ashok Swaminathan, Abeetha Bala, Calvin Weng, and Ronil Prasad for their generous help with screenshots, technical guidance, and sharing their expertise in both services, which made this integration overview possible and comprehensive.

— Eli

Introducing Provisioned Mode for Kafka Event Source Mappings with AWS Lambda

2024-11-26 Chris McPeek

Post Syndicated from Chris McPeek original https://aws.amazon.com/blogs/compute/introducing-provisioned-mode-for-kafka-event-source-mappings-with-aws-lambda/

This post is written by Tarun Rai Madan, Principal Product Manager, Serverless Compute and Rajesh Kumar Pandey, Principal Software Engineer, Serverless Compute

AWS is announcing the general availability of Provisioned Mode for AWS Lambda Event Source Mappings (ESMs) that subscribe to Apache Kafka event sources including Amazon MSK and self-managed Kafka. Provisioned Mode allows you to optimize the throughput of your Kafka ESM by provisioning event polling resources that remain ready to handle sudden spikes in traffic. Controlling the throughput of your ESM helps you build highly responsive and scalable event-driven Kafka applications with stringent performance requirements.

Overview

When you build modern applications using Event-Driven Architectures (EDAs), your event producers publish events, which are then processed by event source connectors like an ESM, and routed to serverless compute consumers like Lambda functions. Apache Kafka is a popular open-source platform for building real-time streaming data applications using Lambda functions as consumers. AWS Lambda’s fully-managed MSK ESM or self-managed Kafka ESM reads events from Kafka as an event source, performs operations like filtering and batching, and invokes Lambda functions. Both ESMs offer built-in integrations with event sources, auto-scaling, and features like batching and filtering. When a Kafka ESM is created, Lambda ESM allocates one event poller to start polling for messages in a Kafka topic. The ESM then evaluates the message backlog – using the OffsetLag metric – for all partitions in the topic, and auto-scales event pollers to process messages efficiently.

Many real-time applications using Kafka are sensitive to sudden spikes in traffic, which could lead to noticeable delays in your end users’ experience. Previously, there were no controls to optimize the throughput for performance-sensitive workloads when using Kafka ESMs. This forced you to explore alternative solutions for workloads with strict performance requirements, which added architectural complexity. To harness the power of Lambda for such performance-sensitive applications, you need to be able to control your Kafka ESM’s throughput and ensure responsive auto-scaling behavior.

What’s new

Provisioned Mode for ESM is a feature that helps you control the throughput of your ESM, and achieve an enhanced performance profile for performance-sensitive applications, particularly ones that see sudden spikes in traffic. You can use Provisioned Mode for Kafka ESM with a range of Kafka or Kafka-compatible streaming data platform providers like Amazon MSK, Confluent, Redpanda, and self-managed Kafka. Key benefits include:

Controls to optimize throughput: You can now fine-tune the throughput of your ESM by configuring a minimum and maximum number of resources called “event pollers”. An event poller (or a “poller”) represents a compute resource that underpins an ESM in the Provisioned Mode, and allocates up to 5 MB/s throughput.
Responsive auto-scaling: With Provisioned Mode, your Kafka ESM detects the increase in OffsetLag metric for all partitions in your Kafka topic, and auto-scales event pollers in a responsive manner. During idle periods, your ESM automatically scales down to the minimum event pollers set by you.
Simplified networking experience and charges: Previously, you were required to configure AWS PrivateLink or NAT Gateway to enable Lambda to poll messages from Kafka clusters in your VPC and invoke Lambda functions. With Provisioned Mode, you are no longer required to configure PrivateLink or NAT Gateway. This approach reduces overhead and improves the developer experience, allowing you to focus on building applications rather than managing networking setup. Consequently, you are not charged for PrivateLink VPC endpoints when using Kafka as an event source with Lambda in the Provisioned Mode for ESM, which reduces your networking charges.

Activating Provisioned Mode for ESM

To activate Provisioned Mode for a new or existing Kafka ESM, you can configure the minimum event pollers, the maximum event pollers, or both for your ESM. The allowed values range from 1 to 200 for minimum event pollers, and from 1 to 2000 for maximum event pollers.

Note that you must configure at least one of minimum or maximum event pollers to activate Provisioned Mode. When you configure only the minimum number of event pollers (‘Min-only’), your ESM allocates this minimum quantity and can dynamically scale up to a maximum. This maximum is determined by the OffsetLag and is limited by either the number of partitions or the default maximum event pollers, whichever is lower. When you configure only the maximum number of event pollers (“Max-only”), your ESM starts with one minimum poller by default, and can scale up to the maximum event pollers or number of partitions, whichever is lower. When you configure both the minimum and maximum number of event pollers (“Min and Max”), your ESM can auto-scale between this range of minimum and maximum event pollers configured.

Activating using AWS CLI

You can activate Provisioned Mode for ESM during creation of a new ESM, or by updating an existing ESM. Specify the –provisioned-poller-config parameter.

aws lambda create-event-source-mapping \
    --region <region-name> \
    --function-name <function-name> \
    --event-source-arn <event-source-arn> \
    --provisioned-poller-config '{"MinimumPollers":<number>, "MaximumPollers":<number>}'

Activating using AWS Lambda Console

Select Configure provisioned mode to activate Provisioned Mode when creating a new ESM, or updating an existing one.

Figure 1: Activating Provisioned Mode for ESM in Console

Provisioned Mode for Kafka ESM in action

To see the performance profile with Provisioned Mode for Kafka ESM, deploy a Lambda function that subscribes to an Amazon MSK topic. Use the reference pattern on Serverless Land and see this blog post outlining steps to configure MSK ESM for a Lambda function. In this case, a producer writes 20 million messages, each with 1KB payload size to an MSK topic – distributed evenly across 100 partitions. Use a batch size of 100, with function duration at 100ms, and set the StartingPosition to TRIM_HORIZON to process from the beginning of the stream.

Note the baseline performance profile observed with the default On-Demand mode. Then analyze two configurations with the Provisioned Mode activated.

Scenario 1 uses different configurations for minimum event pollers
Scenario 2 uses the default minimum event pollers and lets Lambda manage the event pollers through autoscaling.

Baseline performance profile for Kafka ESM On-demand

With Provisioned Mode disabled, Lambda takes approximately 20 minutes to drain the backlog of 20 million messages. It takes 4 minutes to reach the maximum concurrent executions. Use this result as a baseline to compare against Provisioned Mode for ESM.

Figure 2: Baseline performance without Provisioned Mode for ESM

Scenario 1: Configuring minimum event pollers, and auto-scaling

To optimize the ESM throughput for this workload and reduce the time to drain the message backlog, configure the minimum event pollers. Select values of 10 and 100 for minimum event pollers, and observe the results.

Configuring 10 minimum event pollers

Lambda drains the backlog of 20 million messages in approximately 11 minutes with minimum pollers set to 10. This is 45% faster than the baseline without Provisioned Mode. It takes approximately 6 minutes to reach maximum concurrent executions.

Figure 3: Performance profile with minimum event pollers set to 10

Configuring 100 minimum event pollers

To further improve the processing performance, configure the minimum event pollers to 100. Lambda now takes 6 minutes to drain the backlog of 20 million messages, which is 70% faster than the baseline. It instantly reaches the maximum concurrent executions.

Figure 4: Performance profile with minimum event pollers set to 100

Scenario 2: Default minimum event pollers, and auto-scaling

In some cases, the workload may not be as performance-sensitive. With the same volume of 20M messages in your Kafka topic, activate Provisioned Mode for ESM. Start with the default minimum event pollers (set to 1) and let Lambda auto-scale the event pollers based on incoming traffic.

Lambda automatically scales up your event pollers to process the incoming messages, and scales them down as the backlog is cleared. With the default minimum and maximum event pollers, Lambda takes approximately 12 minutes to clear the backlog of 20 million messages, which is 40% faster than the baseline. Lambda takes 7 minutes to reach maximum concurrent executions.

Figure 5: Performance profile with minimum event pollers set to 1

The following table summarizes the performance improvement for the analyzed workload using Provisioned Mode for ESM.

ESM Mode	Time to drain message backlog	Percentage improvement
On-demand Mode	20 minutes	Baseline
Provisioned Mode: Scenario 1 (fine-tuned minimum event pollers)
Minimum event pollers = 10	11 minutes	45%
Minimum event pollers = 100	6 minutes	70%
Provisioned Mode: Scenario 2 (default minimum event pollers)
Minimum event pollers = 1	12 minutes	40%

Table: Performance profile for reference test case before and after activating Provisioned Mode for ESM

Observability and Pricing

You can observe the usage of event pollers by monitoring the ProvisionedPollers Amazon CloudWatch metric, which measures the number of event pollers that actively processed at least one event in the last 5-minute window.

Pricing is based on the provisioned minimum event pollers and the number of event pollers consumed during automatic scaling. Provisioned Mode introduces a billing unit called Event Poller Unit (EPU). Each EPU supports up to 20 MB/s of throughput for event polling. The number of event pollers allocated on an EPU depends on the throughput consumed by each event poller. You pay for the number of EPUs used and the duration they run for, measured in Event Poller Unit hours. For details, refer to AWS Lambda pricing.

Best practices and considerations

The optimal configuration of minimum and maximum event pollers for your Kafka Event Source Mapping (ESM) depends on your application’s performance requirements. Start with the default minimum event pollers to baseline the performance profile, and adjust event pollers based on observed message processing patterns and your application’s performance requirements. For workloads with spiky traffic and strict performance needs, increase the minimum event pollers to handle sudden surges. You can fine-tune the minimum required event pollers by evaluating your desired throughput, your observed throughput – which depends on factors like the ingested messages per second and average payload size, and using the throughput capacity of one event poller (up to 5 MB/s) as reference. Note that to maintain ordered processing within a partition, Lambda caps the maximum event pollers at the number of partitions in the topic.

Update your network settings to remove PrivateLink VPC endpoints and associated permissions for existing ESMs when you activate Provisioned Mode.

Conclusion

Provisioned Mode for Lambda ESM allows you to fine-tune the throughput for your Kafka ESMs by configuring a minimum and maximum number of event pollers. This provides a responsive auto-scaling behavior for Kafka applications that have stringent performance requirements and see unpredictable and spiky traffic. You can fine-tune your configured event pollers based on your requirements and monitor usage via CloudWatch metrics. Provisioned Mode also simplifies network configuration by removing the requirement to configure PrivateLink.

For more serverless learning resources, visit Serverless Land.

Introducing new Event Source Mapping (ESM) metrics for AWS Lambda

2024-11-22 Chris McPeek

Post Syndicated from Chris McPeek original https://aws.amazon.com/blogs/compute/introducing-new-event-source-mapping-esm-metrics-for-aws-lambda/

This post is written by Tarun Rai Madan, Principal Product Manager – Serverless, and Rajesh Kumar Pandey, Principal Software Engineer, Serverless

Today, AWS is announcing new opt-in Amazon CloudWatch metrics for AWS Lambda Event Source Mappings that subscribe to Amazon Simple Queue Service (Amazon SQS), Amazon Kinesis, and Amazon DynamoDB event sources. These metrics include PolledEventCount, InvokedEventCount, FilteredOutEventCount, FailedInvokeEventCount, DeletedEventCount, DroppedEventCount, and OnFailureDestinationDeliveredEventCount. The new metrics enable customers to monitor the processing state of events read by Event Source Mappings (ESMs), and helps them diagnose processing issues.

Previously, customers found it challenging to monitor the processing state of events read by an ESM. An ESM is a resource that polls events from an event source and invokes a Lambda function. With the new metrics for ESMs, you can count events by their processing state, which includes events that were polled, invoked, filtered out, deleted, dropped, failed, or sent to on-failure destination.

Overview

Customers building modern event-driven applications use services like SQS, Kinesis, and DynamoDB as fundamental building blocks for developing decoupled architectures, and use a Lambda function as a consumer to benefit from its simplicity, auto-scaling and cost effectiveness. To subscribe to an event source, customers configure a Lambda Event Source Mapping (ESM). An ESM is a fully-managed Lambda resource that runs an event poller which polls, processes (e.g., filters and batches), and delivers the events to a Lambda function. Due to the processing that happens on an ESM, for example, filtering, batching, and delivery to on-failure destinations, events can end up in varying terminal states. As a result, some polled events may not invoke a Lambda function. Previously, the count of polled, filtered, invoked, deleted or dropped events was not visible to customers. This made it challenging for customers to diagnose processing issues with their ESM, resulting from faulty permissions, misconfiguration, or function errors.

What’s new

With today’s announcement, customers can opt-in to CloudWatch metrics to monitor the processing state of events that are read by an ESM configured with SQS, Kinesis and DynamoDB as event sources.

PolledEventCount metric counts the number of events read by an ESM from the event source.

InvokedEventCount metric counts the number of events that invoked your Lambda function. For an event that experiences function errors, this metric may increase the count multiple times for the same polled event, due to retries.

FilteredOutEventCount metric counts the number of events filtered out by your ESM, based on the Filter Criteria defined by you.

FailedInvokeEventCount metric counts the number of events that attempted to invoke a Lambda function, but encountered partial or complete failure.

DeletedEventCount metric counts the events that have been deleted from the SQS queue by Lambda upon successful processing.

DroppedEventCount metric counts the number of events dropped due to event expiry or exhaustion of retry attempts, for Kinesis and DynamoDB ESMs configured with MaximumRecordAgeInSeconds or MaximumRetryAttempts.

OnFailureDestinationDeliveredEventCount metric counts the events sent to an on-failure destination upon reaching the MaximumRecordAgeInSeconds or MaximumRetryAttempts, for ESMs configured with DestinationConfig.

How to use the new ESM metrics

Once an ESM is created and reaches enabled state, it continuously polls the event source for new events. You can monitor the PolledEventCount metric to catch issues with polling due to misconfigured or deleted event source, misconfigured or deleted Lambda function execution role, incorrect permissions, or throttles from the event source. This metric typically increases when there is an increase in traffic in the event source. You can observe the InvokedEventCount metric to catch issues with the Lambda function, and whether the events are properly invoking your Lambda function. In case of Lambda function errors, InvokedEventCount could be more than PolledEventCount due to retries. This metric would also increase when there is an increase in events processed by an ESM. For ESMs that have filter criteria configured, you can monitor the FilteredOutEventCount to count events that were not sent to a Lambda function because they were filtered out per the defined filter criteria.

You can monitor the FailedInvokeEventCount metric to observe the number of events that failed processing when Lambda service tried to invoke your Lambda function. Invocations can fail due to network configuration issues, incorrect permissions, or a deleted Lambda function, version, or alias. If your event source mapping has partial batch responses enabled, this metric includes any event with a non-empty BatchItemFailures in the response. If all events in a batch are successfully processed by your Lambda function, Lambda service emits a 0 value for this metric. You can use the DeletedEventCount metric to ensure that processed events have been successfully deleted from your SQS queue after being processed by the Lambda function. You can use the DroppedEventCount metric to identify issues with message backlogs or misconfigured event expiry rules. You can use the OnFailureDestinationDeliveredEventCount metric to monitor issues such as failed events caused by Lambda function invocation errors.

The classification for available Lambda ESM metrics by event source is presented below:

CloudWatch metric	SQS	DynamoDB	Kinesis Data Stream
PolledEventCount	√	√	√
InvokedEventCount	√	√	√
FilteredOutEventCount	√	√	√
FailedInvokeEventCount	√	√	√
DeletedEventCount	√
DroppedEventCount		√	√
OnFailureDestinationDeliveredEventCount		√	√

Activating and testing the new ESM metrics

You can enable the new ESM metrics using AWS Lambda Console, AWS Command Line Interface (CLI), Lambda ESM API, AWS SDK, AWS CloudFormation, and AWS Serverless Application Model (SAM). The metrics will be published under the AWS/Lambda namespace and EventSourceMappingUUID dimension in the CloudWatch console. To learn more, see CloudWatch metrics for Lambda.

Using AWS CLI

To turn on the new metrics using AWS CLI, use the –metrics-config parameter.

aws lambda create-event-source-mapping \
    --region <region-name> \
    --function-name <function-name> \
    --event-source-arn <event-source-arn> \
    --metrics-config '{"Metrics": ["EventCount"]}'

Using AWS Lambda Console

To turn on the new metrics using AWS Lambda Console, click on “Enable metrics” while adding the trigger for your function.

Figure 1: Enabling ESM metrics in AWS Console

A typical scenario where the new ESM metrics can help with better observability is an ESM that uses event filtering. To test the ESM metrics, you can deploy a sample Lambda application with Kinesis as an event source using this serverless pattern, which uses event filtering with a certain criteria to control which events are sent to Lambda. Use this pattern for both the example scenarios; please follow the setup guidelines for this pattern and continue with testing for the scenarios. Running this sample project in your account may incur charges. See AWS Lambda pricing and Amazon Kinesis pricing.

Figure 2: Configuring Lambda function with Kinesis event source

Example scenario 1: ESM metrics with event filtering configured

The following diagram shows the results for the test scenario with Kinesis ESM, where the total polled events, filtered events, invoked events, and failed events are represented by PolledEventCount, FilteredOutEventCount, InvokedEventCount and FailedInvokeEventCount.

Figure 3: ESM metrics for scenario 1

Example Scenario 2: ESM metrics with event filtering and On-Failure Destination configured

Another common scenario is where you want to have visibility around the number of events delivered to Lambda function, events filtered, and additionally, the count of events routed to on-failure destination upon failure. To test this scenario, follow a setup similar to the one in scenario 1. Create or update the ESM with an on-failure destination, and set MaximumRetryCount to 1, as shown below.

aws lambda update-event-source-mapping \
    --uuid <event-source-mapping-uuid> \
    --maximum-retry-attempts 1 \
    --filter-criteria '{"Filters": [{"Pattern": "{\"data\": { \"tire_pressure\": [ { \"numeric\": [ \"<\", 32 ] } ] } }"}]}' \
    --destination-config '{"OnFailure": {"Destination": "<your_SQS_queue_ARN>"}}' \
    --function-name <lambda-function-name>

Publish a sample payload which matches the FilterCriteria defined above. Also generate sample data with different “tire_pressure” < 32 to match the event and invoke the Lambda function.

Sample Data:

{
    "time": "2021-11-09 13:32:04",
    "fleet_id": "fleet-452",
    "vehicle_id": "a42bb15c-43eb-11ec-81d3-0242ac130003",
    "lat": 47.616226213162406,
    "lon": -122.33989110734133,
    "speed": 43,
    "odometer": 43519,
    "tire_pressure": [41, 40, 31, 41],
    "weather_temp": 76,
    "weather_pressure": 1013,
    "weather_humidity": 66,
    "weather_wind_speed": 8,
    "weather_wind_dir": "ne"
}

Once you have published these records to the stream, you should be able to see the CloudWatch metrics under AWS/Lambda namespace with the EventSourceMappingUUID dimension, as shown below. Note that if an event experiences a function error, InvokedEventCount may increase multiple times for the same polled event due to automatic retries.

Figure 4: ESM metrics for scenario 2

Available Now

The new ESM metrics are generally available in all commercial regions that Lambda service is available in. Support is also available through AWS Lambda partners like Datadog, Elastic, and Lumigo. The Lambda service sends these new metrics to CloudWatch at no additional cost to you. However, charges apply for CloudWatch metrics at standard CloudWatch metrics pricing for these opt-in metrics, in addition to your AWS Lambda pricing and event source pricing.

Conclusion

With these new CloudWatch metrics, you can gain visibility into the processing state of your events that are polled by Lambda Event Source Mapping (ESM) for queue-based or stream-based applications. The blog explains the new metrics PolledEventCount, InvokedEventCount, FilteredOutEventCount, FailedInvokeEventCount, DeletedEventCount, DroppedEventCount, and OnFailureDestinationDeliveredEventCount, and how to use them to troubleshoot event processing issues for Lambda functions. These metrics help you track the invocation requests sent to Lambda via an ESM, monitor any delays or issues in processing, and take corrective actions if required. To learn more about these metrics, visit Lambda developer guide.

For more serverless learning resources, visit Serverless Land.

Track performance of serverless applications built using AWS Lambda with Application Signals

2024-11-21 Veliswa Boya

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/track-performance-of-serverless-applications-built-using-aws-lambda-with-application-signals/

In November 2023, we announced Amazon CloudWatch Application Signals, an AWS built-in application performance monitoring (APM) solution, to solve the complexity associated with monitoring performance of distributed systems for applications hosted on Amazon EKS, Amazon ECS, and Amazon EC2. Application Signals automatically correlates telemetry across metrics, traces, and logs, to speed up troubleshooting and reduce application disruption. By providing an integrated experience for analyzing performance in the context of your applications, Application Signals gives you improved productivity focusing on the applications that support your most critical business functions.

Today we’re announcing the availability of Application Signals for AWS Lambda to eliminate the complexities of manual setup and performance issues required to assess application health for Lambda functions. With CloudWatch Application Signals for Lambda, you can now collect application golden metrics (the incoming and outgoing volume of requests, latency, faults, and errors).

AWS Lambda abstracts away the complexity of the underlying infrastructure, enabling you to focus on building your application without having to monitor server health. This allows you to shift your focus toward monitoring the performance and health of your applications, which is necessary to operate your applications at peak performance and availability. This requires deep visibility into performance insights such as volume of transactions, latency spikes, availability drops, and errors for your critical business operations and application programming interfaces (APIs).

Previously, you had to spend signiﬁcant time correlating disjointed logs, metrics, and traces across multiple tools to establish the root cause of anomalies, increasing mean time to recovery (MTTR) and operational costs. Additionally, building your own APM solutions with custom code or manual instrumentation using open source (OSS) libraries was time-consuming, complex, operationally expensive, and often resulted in increased cold start times and deployment challenges when managing large ﬂeets of Lambda functions. Now, you can use Application Signals to seamlessly monitor and troubleshoot health and performance issues in serverless applications, without requiring any manual instrumentation or code changes from your application developers.

How it works
Using the pre-built, standardized dashboards of Application Signals, you can identify the root cause of performance anomalies in just a few clicks by drilling down into performance metrics for critical business operations and APIs. This helps you visualize application topology which shows interactions between the function and its dependencies. In addition, you can define Service Level Objectives (SLOs) on your applications to monitor specific operations that matter most to you. An example of an SLO could be to set a goal that a webpage should render within 2000 ms 99.9 percent of the time in a rolling 28-day interval.

Application Signals auto-instruments your Lambda function using enhanced AWS Distro for OpenTelemetry (ADOT) libraries. This delivers better performance such as lower cold start latency,
memory consumption, and function invocation duration, so you can quickly monitor your applications.

I have an existing Lambda function appsignals1 and I will configure Application Signals in the Lambda Console to collect various telemetry on this application.

In the Configuration tab of the function I select Monitoring and operations tools to enable both the Application signals and the Lambda service traces.

I have an application myAppSignalsApp that has this Lambda function attached as a resource. I’ve defined an SLO for my application to monitor specific operations that matter most to me. I’ve defined a goal that states that the application executes within 10 ms 99.9 percent of the time in a rolling 1-day interval.

It can take 5-10 minutes for Application Signals to discover the function after it’s been invoked. As a result you’ll need to refresh the Services page before you can see the service.

Now I’m in the Services page and I can see a list of all my Lambda functions that have been discovered by Application Signals. Any telemetry that is emitted will be displayed here.

I can then visualize the complete application topology from the Service Map and quickly spot anomalies across my service’s individual operations and dependencies, using the newly collected metrics of volume of requests, latency, faults, and errors. To troubleshoot, I can click into any point in time for any application metric graph to discover correlated traces and logs related to that metric, to quickly identify if issues impacting end users are isolated to an individual task or deployment.

Available now
Amazon CloudWatch Application Signals for Lambda is now generally available and you can start using it today in all AWS Regions where Lambda and Application Signals are available. Today, Application Signals is available for Lambda functions that use Python and Node.js managed runtimes. We’ll continue to add support for other Lambda runtimes in near future.

To learn more, visit the AWS Lambda developer guide and Application Signals developer guide. You can submit your questions to AWS re:Post for Amazon CloudWatch, or through your usual AWS Support contacts.

– Veliswa.

Efficiently monitor your On Demand Capacity Reservations (ODCR) by Grouping on CloudWatch Dimensions

2024-11-06 aostan

Post Syndicated from aostan original https://aws.amazon.com/blogs/compute/efficiently-monitor-your-on-demand-capacity-reservations-odcr-by-grouping-on-cloudwatch-dimensions/

This post is written by Ballu Singh, Principal Solutions Architect at AWS, Ankush Goyal, Enterprise Support Lead in AWS Enterprise Support, Hasan Tariq, Principal Solutions Architect with AWS and Ninad Joshi, Senior Solutions Architect at AWS.

The On-Demand Capacity Reservations (ODCR) allows you to reserve compute capacity for your Amazon Elastic Compute Cloud (Amazon EC2) instances in a specific Availability Zone (AZ) for any duration. It makes sure that you always have access to your Amazon EC2 capacity when you need it. This is ideal for users who need to make sure their instances are available during critical events, even when it is stopped and restarted. Users can create ODCR anytime, without the need for a one or three-year term commitment.

In the post Automate the Creation of On-Demand Capacity Reservations for running EC2 instances, we discussed a solution for automating ODCR operations for existing EC2 instances. The post included creating, modifying, and canceling Capacity Reservations. We also showed monitoring Capacity Reservation usage using the Amazon CloudWatch metric InstanceUtilization, which indicates the percentage of reserved capacity currently in use. This metric is essential for effectively monitoring and optimizing your ODCR consumption.

On August 1st, 2024, AWS introduced new CloudWatch dimensions for Amazon EC2 ODCR. Using these enhancements you can now group CloudWatch metrics for ODCR by dimensions, such as InstanceType, AvailabilityZone, InstanceMatchCriteria, InstancePlatform, Tenancy, CapacityReservationId, or across the Capacity Reservations within a selected AWS Region. With these new dimensions, you no longer need to create a new alarm each time a new Capacity Reservation ID (CRID) is added. Furthermore, there is no longer a need to poll ODCR metadata using the describe-capacity-reservations AWS CLI command or API, because this information is now readily available through CloudWatch metrics.

This post shows you how to create CloudWatch alarms for ODCR using these new dimensions. The setup methodology helps you get the information directly in the CloudWatch console instead of having to call the DescribeCapacityReservations API or invoking the describe-capacity-reservationsCLI command.

Summary

The Prerequisites section outlines the necessary prerequisites and assumptions that should be completed before implementing the technical steps described later in this post. This includes any accounts, services, permissions, or configurations that need to be set up in advance.
The Setup section describes the specific scenario and infrastructure environment assumed for demonstrating the CloudWatch dimensions and alarms discussed in this post.
In the Implementation details section, we do a deep dive into the technical implementation, such as code snippets and step-by-step configurations for creating CloudWatch alarms for metric InstanceUtilization grouped by six dimensions outlined earlier.
The Cleaning up section provides steps to prevent ongoing charges after experimenting with the infrastructure and alarms created here.
Finally, in the Conclusion section, we recap the key points explored around CloudWatch, dimensions, metrics, and alarms. This content can serve as a solid foundation for implementing more advanced monitoring, optimization, and architectural best practices going forward.

Prerequisites

This solution needs you to complete the following prerequisites:

Create Capacity Reservations in your account by following the ODCR Workshop self-paced lab. The solution needs scripts from this lab. If you are using any other Capacity Reservations for this lab, then you must use parameters according to your environment setup (for example AWS Region, AZ, platform, and instance match criteria)
All the code used in this post is publicly available in the accompanying GitHub repository. Refer to the json included in the GitHub repository for the AWS Identity and Access Management (IAM) role permissions for IAM users used in the solution.
Refer to the preceding GitHub repository for the code, and save the txt file in the same directory with other Python scripts. You may want to run the requirements.txt file if you don’t have appropriate dependencies to run the rest of the Python scripts. You can run this using the following command:

pip3 install -r requirements.txt

Setup

For this post, we have provided Python scripts to create CloudWatch alarms for Capacity Reservation usage metric for ODCR, for example InstanceUtilization. These alarms can be grouped using the new dimensions: InstanceType, AvailabilityZone, InstanceMatchCriteria, InstancePlatform, Tenancy, CapacityReservationId, or across the Capacity Reservations within a selected Region. We also created an Amazon Simple Notification Service (Amazon SNS) topic named ODCRAlarmTopic to notify you when there’s a breach with your CloudWatch alarm’s threshold.

To get started, download the scripts for creating a CloudWatch alarm using each aforementioned dimension from the GitHub repository in the Prerequisites section.

We envision a scenario where multiple Capacity Reservations exist across a Region in your AWS account. The goal is to identify any unused Capacity Reservations to optimize capacity usage and reduce unnecessary charges. Unused capacity can be identified by creating a CloudWatch alarm for the InstanceUtilization metric. The alarm can be grouped by one of six dimensions: AZ, Instance Match Criteria, Instance Type, InstancePlatform, Tenancy, or across the Capacity Reservations. You must set the alarm threshold that aligns to your usage optimization targets.

With Capacity Reservations, charges apply to any unused capacity. Users accept these charges because reservations provide capacity assurance. However, with improved reservation usage, users can make sure their reserved capacity is fully used. A triggered CloudWatch alarm signifies unused capacity. It can notify users to take near real-time action to optimize capacity and eliminate charges for unused reservations.

Implementation details

The following sections show the implementation details of alarms for each dimension.

For each alarm that we create in this section, you can set a threshold based on your usage optimization goals. For this post, we are setting the threshold to 75%. When these alarms are in place and the CloudWatch alarm breaches that threshold, the system enters an alarm state and sends an SNS notification to ODCRAlarmTopic. This process helps identify and address potential issues or opportunities for optimization related to the specific monitored dimension.

Creating CloudWatch alarm using AllCapacityReservations dimension

In this scenario, an organization is currently using the Capacity Reservations at 100% usage, but it needs to be notified when the total capacity usage drops to less than or equal to the threshold value. To do so, we use the InstanceUtilization metric for ODCR and group it with the AllCapacityReservations dimension. You can run the by_all_capacity_reservations.py script provided in the GitHub repository to create this CloudWatch alarm.

Prior to running the script, you must determine the following parameters:

Necessary input parameters

RegionName: The Region where the CloudWatch alarm should be created (for example us-east-1).
EmailAddress: The email address to receive notifications (for example [email protected]).

Optional input parameters

Dimension (default: AllCapacityReservations): The dimension for the CloudWatch alarm.
MetricName (default: InstanceUtilization): The metric name for the CloudWatch alarm.
ComparisonOperator (default: LessThanOrEqualToThreshold): The comparison operator for the CloudWatch alarm.
Threshold (default: 75.0): The threshold value for the CloudWatch alarm.
Protocol (default: email): The protocol for the SNS subscription.
TopicName (default: ODCRAlarmTopic): The name of the SNS topic.

After you’ve determined your input parameters, run the following Python script with your desired parameters to set up the alarm:

python3 by_all_capacity_reservations.py --RegionName <<Insert here the region where you have the Capacity Reservations>> --EmailAddress <<Insert here the email address subscribed to ODCRAlarmTopic>>

This creates a CloudWatch alarm that monitors InstanceUtilization for the Capacity Reservations in the Region you specified. You can confirm the alarm has been created by reviewing the by_all_capacity_reservations.log created in the same folder where you ran the script from. The following is an example log file content that confirms the creation of the alarm.

2024-10-17 19:32:34,922 __main__ INFO: Creating CloudWatch Alarm for the InstanceUtilization metric with AllCapacityReservations dimension in the us-east-1 region.

2024-10-17 19:32:35,514 __main__ INFO: The SNS topic 'ODCRAlarmTopic' already exists with ARN: arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:32:35,516 __main__ INFO: Please ensure you have subscribed to the SNS Topic arn:aws:sns:us-east-1: XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:32:35,986 __main__ INFO: Successfully created CloudWatch Alarm for the InstanceUtilization metric with AllCapacityReservations dimension in the us-east-1 region.

You can also validate in the CloudWatch console. Choose All alarms and search for ODCRAlarm-InstanceUtilization-AllCapacityReservations.

Figure 1: Example AllCapacityReservations Alarm Setup validation using CloudWatch console

Creating CloudWatch alarm using InstanceType dimension

After receiving an alert on total Capacity Reservations usage dropping below the threshold, you may want to view the usage drop by a specific instance type. To do so you can use the InstanceUtilization metric for ODCR and group it with the InstanceType dimension. You can use the by_instanceType.py script provided in the GitHub repository to create this CloudWatch alarm.

Prior to running the script, you must determine the following parameters:

Necessary input parameters

RegionName: The Region where the CloudWatch alarm should be created (for example us-east-1).
EmailAddress: The email address to receive notifications (for example [email protected]).
InstanceType: The instance type for the CloudWatch alarm (for example t2.micro).

Optional input parameters

Dimension (default: InstanceType): The dimension for the CloudWatch alarm.
MetricName (default: InstanceUtilization): The metric name for the CloudWatch alarm.
ComparisonOperator (default: LessThanOrEqualToThreshold): The comparison operator for the CloudWatch alarm.
Threshold (default: 75.0): The threshold value for the CloudWatch alarm.
Protocol (default: email): The protocol for the SNS subscription.
TopicName (default: ODCRAlarmTopic): The name of the SNS topic.

After you’ve determined your input parameters, run the following Python script with your desired parameters to set up the alarm:

python3 by_instanceType.py --RegionName <<Insert here the region where you have the Capacity Reservations>> --EmailAddress <<Insert here the email address subscribed to the ODCRAlarmTopic>> --InstanceType t2.micro

This should create the InstanceType dimension alarm. You can confirm this by reviewing the by_instanceType.log created in the same folder where you ran the script from. The following is an example log file content that confirms the creation of this alarm.

2024-10-17 19:46:07,288 __main__ INFO: Creating CloudWatch Alarm for the InstanceUtilization metric with InstanceType dimension in the us-east-1 region.

2024-10-17 19:46:07,804 __main__ INFO: The SNS topic 'ODCRAlarmTopic' already exists with ARN: arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:46:07,804 __main__ INFO: Please ensure you have subscribed to the SNS Topic arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:46:08,285 __main__ INFO: Successfully created CloudWatch Alarm for the InstanceUtilization metric with InstanceType dimension in the us-east-1 region.

You can also validate in the CloudWatch console. Choose All alarms and, for example, search for ODCRAlarm-InstanceUtilization-InstanceType-t2.micro.

Figure 2: Example InstanceType Alarm Setup validation using CloudWatch console

Creating CloudWatch alarm using AvailabilityZone dimension

After receiving an alert on total Capacity Reservations usage at the instance level dropping below the threshold, you may want to view the usage drop by a specific AZ. You can do so by using the InstanceUtilization metric for ODCR and group it with the AvailabilityZone dimension. You can use the by_availabilityZone.py script provided in the GitHub repository to create this CloudWatch alarm.

Prior to running the script, you must determine the following parameters:

Necessary input parameters

RegionName: The Region where the CloudWatch alarm should be created (for example us-east-1).
EmailAddress: The email address to receive notifications (for example [email protected]).
AvailabilityZone: The AZ for the CloudWatch alarm (for example us-east-1a).

Optional input parameters

Dimension (default: AvailabilityZone): The dimension for the CloudWatch alarm.
MetricName (default: InstanceUtilization): The metric name for the CloudWatch alarm.
ComparisonOperator (default: LessThanOrEqualToThreshold): The comparison operator for the CloudWatch alarm.
Threshold (default: 75.0): The threshold value for the CloudWatch alarm.
Protocol (default: email): The protocol for the SNS subscription.
TopicName (default: ODCRAlarmTopic): The name of the SNS topic.

After you’ve determined your input parameters, run the following Python script with your desired parameters to set up the alarm:

python3 by_availabilityZone.py --RegionName <<Insert here the region where you have the Capacity Reservations>> --EmailAddress <<Insert here the email address subscribed to the ODCRAlarmTopic>> --AvailabilityZone us-east-1b

This should create the AvailabilityZone alarm. You can confirm this by reviewing the by_availabilityZone.log created in the same folder where you ran the script from. The following is an example log file content that confirms the creation of this alarm.

2024-10-17 19:38:39,141 __main__ INFO: Creating CloudWatch Alarm for the InstanceUtilization with AvailabilityZone dimension in the us-east-1 region.

2024-10-17 19:38:39,667 __main__ INFO: The SNS topic 'ODCRAlarmTopic' already exists with ARN: arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:38:39,667 __main__ INFO: Please ensure you have subscribed to the SNS Topic arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:38:40,172 __main__ INFO: Successfully created CloudWatch Alarm for the InstanceUtilization metric with AvailabilityZone dimension in the us-east-1 region.

You can also validate in the CloudWatch console. Choose All alarms, and search for ODCRAlarm-InstanceUtilization-AvailabilityZone-us-east-1b.

Figure 3: Example AvailabilityZone Alarm Setup validation using CloudWatch console

Create CloudWatch alarm using the InstancePlatform dimension

Based on workload requirements, organizations use different platforms such as Windows and Linux/UNIX for EC2 instances. They may want to be notified when the usage drops below threshold for a particular platform. To achieve this, we can use the InstanceUtilization metric for ODCR and group it with the InstancePlatform dimension. You can use the by_platform.py script provided in the GitHub repository to create the CloudWatch alarm.

Prior to running the script, you must determine the following parameters:

Necessary input parameters

RegionName: The Region where the CloudWatch alarm should be created (for example us-east-1).
EmailAddress: The email address to receive notifications (for example [email protected]).
InstancePlatform: The InstancePlatform for the CloudWatch alarm. For exampleE: ‘Linux/UNIX’. Supported InstancePlatform are’Linux/UNIX’,’Red Hat Enterprise Linux’,’SUSE Linux’,’Windows’,’Windows with SQL Server’,’Windows with SQL Server Enterprise’,’Windows with SQL Server Standard’,’Windows with SQL Server Web’,’Linux with SQL Server Standard’,’Linux with SQL Server Web’,’Linux with SQL Server Enterprise’,’RHEL with SQL Server Standard’,’RHEL with SQL Server Enterprise’,’RHEL with SQL Server Web’,’RHEL with HA’,’RHEL with HA and SQL Server Standard’,’RHEL with HA and SQL Server Enterprise’,’Ubuntu Pro’

Optional input parameters

Dimension (default: InstancePlatform): The dimension for the CloudWatch alarm.
MetricName (default: InstanceUtilization): The metric name for the CloudWatch alarm.
ComparisonOperator (default: LessThanOrEqualToThreshold): The comparison operator for the CloudWatch alarm.
Threshold (default: 75.0): The threshold value for the CloudWatch alarm.
Protocol (default: email): The protocol for the SNS subscription.
TopicName (default: ODCRAlarmTopic): The name of the SNS topic.

After you’ve determined your input parameters, run the following Python script with your desired parameters to set up the alarm:

python3 by_platform.py --RegionName <<Insert here the region where you have the Capacity Reservations>> --EmailAddress <<Insert here the email address subscribed to the ODCRAlarmTopic>> --InstancePlatform Linux/Unix

This should create the InstancePlatform alarm. You can confirm this by reviewing the by_platform.log created in the same folder where you ran the script from. The following log entry shows a confirmation the creation of this alarm.

2024-10-17 19:52:03,839 __main__ INFO: Creating CloudWatch Alarm for the InstanceUtilization with Platform dimension in the us-east-1 region.

2024-10-17 19:52:04,345 __main__ INFO: The SNS topic 'ODCRAlarmTopic' already exists with ARN: arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:52:04,345 __main__ INFO: Please ensure you have subscribed to the SNS Topic arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:52:04,854 __main__ INFO: Successfully created CloudWatch Alarm for the InstanceUtilization with Platform dimension in the us-east-1 region.

You can also validate in the CloudWatch console. Choose All alarms, and search for ODCRAlarm-InstanceUtilization-Platform-Linux/Unix.

Figure 4: Example AvailabilityZone Alarm Setup validation using CloudWatch console

Creating CloudWatch alarm using the InstanceMatchCriteria dimension

Capacity Reservations are configured as either open or targeted. If the Capacity Reservation is open, then the new and existing instances that have matching attributes automatically run in the capacity of the Capacity Reservation. If the Capacity Reservation is targeted, then instances must specifically target it to run in the reserved capacity. Organizations using these configurations may want to be notified when instance usage drops in either open or targeted Capacity Reservations. To achieve this, we use the InstanceUtilization metric for ODCR and group it with the InstanceMatchCriteria dimension. You can use the by_instanceMatchCriteria.py script provided in the GitHub repository to create the CloudWatch alarm.

Prior to running the script, you must determine the following parameters:

Necessary input parameters

RegionName: The Region where the CloudWatch alarm should be created (for example us-east-1).
EmailAddress: The email address to receive notifications (for example [email protected]).
InstanceMatchCriteria: The tenancy for the CloudWatch alarm. Supported values are ‘open’ and ‘targeted’.

Optional input parameters

Dimension (default: InstanceMatchCriteria): The dimension for the CloudWatch alarm.
MetricName (default: InstanceUtilization): The metric name for the CloudWatch alarm.
ComparisonOperator (default: LessThanOrEqualToThreshold): The comparison operator for the CloudWatch alarm.
Threshold (default: 75.0): The threshold value for the CloudWatch alarm.
Protocol (default: email): The protocol for the SNS subscription.
TopicName (default: ODCRAlarmTopic): The name of the SNS topic.

After you’ve determined your input parameters, run the following Python script with your desired parameters to set up the alarm:

python3 by_instanceMatchCriteria.py --RegionName <<Insert here the region where you have the Capacity Reservations>> --EmailAddress <<Insert here the email address subscribed to the ODCRAlarmTopic>> --InstanceMatchCriteria open

This should create the InstanceMatchCriteria alarm. You can confirm this by reviewing the by_instanceMatchCriteria.log created in the same folder where you ran the script from. The following log entry confirms the creation of such alarm.

2024-10-17 19:43:25,463 __main__ INFO: Creating CloudWatch Alarm for the InstanceUtilization with InstanceMatchCriteria dimension in the us-east-1 region.

2024-10-17 19:43:25,996 __main__ INFO: The SNS topic 'ODCRAlarmTopic' already exists with ARN: arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:43:25,996 __main__ INFO: Please ensure you have subscribed to the SNS Topic arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:43:26,552 __main__ INFO: Successfully created CloudWatch Alarm for the InstanceUtilization metric with InstanceMatchCriteria dimension in the us-east-1 region.

You can also validate in the CloudWatch console. Choose All alarms, and search for ODCRAlarm-InstanceUtilization-InstanceMatchCriteria-open.

Figure 5: Example InstanceMatchCriteria Alarm Setup validation using CloudWatch console

Creating CloudWatch alarm using the Tenancy dimension

By default, EC2 instances run on shared tenancy hardware. However, if users want, they can also choose dedicated tenancy. Organizations using both types of tenancy for their workload may want to be notified when instance usage drops in either of these tenancies. To achieve this, we use the InstanceUtilization metric for ODCR and group it with the Tenancy dimension. You can run the by_tenancy.py script provided in the GitHub repository to create the CloudWatch alarm.

Prior to running the script, you must determine the following parameters:

Necessary input parameters

RegionName: The Region where the CloudWatch alarm should be created (for example us-east-1).
EmailAddress: The email address to receive notifications (for example [email protected]).
Tenancy: The tenancy for the CloudWatch alarm. Supported Tenancy are ‘default’ and ‘dedicated’.

Optional input parameters

Dimension (default: Tenancy): The dimension for the CloudWatch alarm.
MetricName (default: InstanceUtilization): The metric name for the CloudWatch alarm.
ComparisonOperator (default: LessThanOrEqualToThreshold): The comparison operator for the CloudWatch alarm.
Threshold (default: 75.0): The threshold value for the CloudWatch alarm.
Protocol (default: email): The protocol for the SNS subscription.
TopicName (default: ODCRAlarmTopic): The name of the SNS topic.

After you’ve determined your input parameters, run the following Python script with your desired parameters to set up the alarm:

This should create the Tenancy dimension alarm successfully. You can confirm this by reviewing the by_tenancy.log created in the same folder where you ran the script from. The following entry confirms the creation of the alarm.

2024-10-17 19:56:14,331 __main__ INFO: Creating CloudWatch Alarm for the InstanceUtilization with Tenancy dimension in the us-east-1 region.

2024-10-17 19:56:14,809 __main__ INFO: The SNS topic 'ODCRAlarmTopic' already exists with ARN: arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:56:14,810 __main__ INFO: Please ensure you have subscribed to the SNS Topic arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:56:15,287 __main__ INFO: Successfully created CloudWatch Alarm for the InstanceUtilization with Tenancy dimension in the us-east-1 region.

You can also validate in the CloudWatch console. Choose All alarms and search for ODCRAlarm-InstanceUtilization-Tenancy-default.

Figure 6: Example Tenancy Dimension Alarm Setup validation using CloudWatch console

Other options

As of this post’s publication, there is no native support to create a CloudWatch alarm on two dimensions. However, you can create a custom CloudWatch metric and create an alarm on that metric.

Cleaning up

If you used the ODCR workshop to create Capacity Reservations in your account, then follow the Clean-up step of the workshop to delete the Capacity Reservations and EC2 instances to stop incurring any charges. If you created any other EC2 instances or Capacity Reservations for this post, terminate those EC2 instances and cancel those Capacity Reservations.

To delete the alarms you created in this post, follow these steps given in the CloudWatch documentation.

Conclusion

In this post, we explored how to use the new Amazon CloudWatch dimensions for Amazon EC2 ODCR to efficiently monitor and maintain constant Capacity Reservations and achieve a higher level of usage, thereby saving costs associated with unused capacity. By automating the creation of CloudWatch alarms for Capacity Reservation usage metrics, specifically InstanceUtilization, you can gain more granular insights into your reserved capacity. This includes grouping metrics by Instance Type, Availability Zone, Platform, Instance Match Criteria, Tenancy, or across the Capacity Reservations in a Region.

We also used an Amazon SNS topic to receive near-real time alerts when thresholds are breached. These tools enable you to effectively monitor and optimize your ODCR usage, making sure that you maintain efficient and cost-effective capacity management during critical events.

For more details, refer to the updated Capacity Reservations documentation. If you have any questions or feedback, feel free to share them in the comments section or contact AWS Support.

Author Bios

	Ballu Singh Ballu Singh is a Principal Solutions Architect at AWS. He lives in the San Francisco Bay area and helps users architect and optimize applications on AWS. In his spare time, he enjoys reading and spending time with his family.
	Ankush Goyal Ankush is an Enterprise Support Lead in AWS Enterprise Support who helps Enterprise Support users streamline their cloud operations on AWS. He enjoys working with users to help them design, implement, and support cloud infrastructure. He is a results-driven IT professional with over 20 years of experience.
	Hasan Tariq Hasan Tariq is a Principal Solutions Architect with AWS. He helps Financial Services users accelerate their adoption of the AWS Cloud by providing architectural guidelines to design innovative and scalable solutions.
	Ninad Joshi Ninad Joshi is a Senior Solutions Architect at AWS, helping global AWS users design secure, scalable, and cost-effective solutions in cloud to solve their complex real-world business challenges. Ninad specializes in AI/ML and Generative AI. Prior to joining AWS, Ninad worked as a software developer for 13+ years. Outside of his professional endeavors, Ninad enjoys playing chess and exploring different gambits.

AWS Weekly Roundup: AWS Lambda, Amazon Bedrock, Amazon Redshift, Amazon CloudWatch, and more (Nov 4, 2024)

2024-11-04 Matheus Guimaraes

Post Syndicated from Matheus Guimaraes original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-aws-lambda-amazon-bedrock-amazon-redshift-amazon-cloudwatch-and-more-nov-4-2024/

The spooky season has come and gone now. While there aren’t any Halloween-themed releases, AWS has celebrated it in big style by having a plethora of exciting releases last week! I think it’s safe to say that we have truly entered the ‘pre’ re:Invent stage as more and more interesting things are being released every week on the countdown to AWS re:Invent 2024.

There is a lot to cover, so let me put my wizard hat on, open the big bag of treats, and dive into last week’s goodies!

Something for developers
There was no shortage of treats from AWS for developers this Halloween!

AWS enhances the Lambda application building experience with VS Code IDE and AWS Toolkit — AWS has enhanced AWS Lambda development with the AWS Toolkit for Visual Studio Code, providing a guided setup for coding, testing, and deploying Lambda applications directly within the IDE. It includes sample walkthroughs and one-click deployment, simplifying the development process. Now, building apps with Lambda is as intuitive as crafting a spell in a wizard’s workshop!

AWS Amplify integration with Amazon S3 for static website hosting — AWS Amplify Hosting now integrates with Amazon S3 for seamless static website hosting, with global CDN support via Amazon CloudFront. This simplifies set up, offering secure, high-performance delivery with custom domains and SSL certificates. Hosting your site is now easier than spotting a jack-o’-lantern on Halloween night!

AWS Lambda now supports AWS Fault Injection Service (FIS) actions — AWS Lambda now supports AWS Fault Injection Simulator (FIS) actions, enabling developers to test resilience by injecting controlled faults like latency and execution errors. This helps simulate real-world failures without code changes, improving monitoring and operational readiness. Great for testing that old candy dispenser!

AWS CodeBuild now supports retrying builds automatically — AWS CodeBuild now offers automatic build retries, allowing developers to set a retry limit for failed builds. This reduces manual intervention by automatically retrying builds up to the specified limit, tackling those pesky, intermittent failures like a ghostbuster clearing a haunted pipeline!

Amazon Virtual Private Cloud launches new security group sharing features — Amazon VPC now supports sharing security groups across multiple VPCs within the same account and with participant accounts in shared VPCs. This streamlines security management and ensures consistent traffic filtering across your organization. Now, keeping your network secure is as seamless as warding off digital goblins!

Amazon DataZone expands data access with tools like Tableau, Power BI, and more — Amazon DataZone now supports the Amazon Athena JDBC Driver, allowing seamless access to data lake assets from BI tools, like Tableau and Power BI. This lets analysts connect and analyze data with ease. Now, accessing data is as effortless as a witch flying on her broomstick!

Generative AI
Amazon Q and Amazon Bedrock continue to make generative AI seem like magic. Here are some releases from last week.

Amazon Q Developer inline chat — Amazon Q Developer has introduced inline chat support, allowing developers to engage directly within their code editor for actions like optimization, commenting, and test generation. Real-time inline diffs make it simple to review changes, available in Visual Studio Code and JetBrains IDEs. It’s practically code magic – no witch’s cauldron needed!

Meta’s Llama 3.1 8B and 70B models are now available for fine-tuning in Amazon Bedrock — Amazon Bedrock now supports fine-tuning for Meta’s Llama 3.1 8B and 70B models, allowing developers to customize these AI models with their own data. With a 128K context length, Llama 3.1 processes large text volumes efficiently, making it perfect for domain-specific applications. Now, your AI won’t be scared of handling monstrous amounts of data—even on a dark, stormy night!

Fine-tuning for Anthropic’s Claude 3 Haiku in Amazon Bedrock is now generally available — Fine-tuning for the Claude 3 Haiku model in Amazon Bedrock is now generally available, enabling customization with your data for better accuracy. Make your AI as unique as your Halloween costume!

Cost Planning, Saving, and Tracking
Here are some new releases that can help you stay on top of your budget and keep an eye on the amount of candy that you buy.

AWS now accepts partial card payments — AWS now supports partial payments with credit or debit cards, letting users split monthly bills across multiple cards. This flexibility makes managing your budget as smooth as a ghost gliding through a haunted house!

Amazon Bedrock now supports cost allocation tags on inference profiles — Amazon Bedrock now supports cost allocation tags for inference profiles, allowing customers to track and manage generative AI costs by department or application. This makes financial management a treat, not a trick!

AWS Deadline Cloud now adds budget-related events — AWS Deadline Cloud, a service used for rendering and managing visual effects and animation workloads, now sends budget-related events via Amazon EventBridge, enabling real-time spending updates and automated notifications. This helps keep project costs under control without any unexpected scares!

And the busiest team of the week award goes to…Amazon Redshift!
Clearly, the Amazon Redshift team loves Halloween and decided to celebrate in big style with many releases! Here are the highlights:

Amazon Redshift integration with Amazon Bedrock for generative AI — Amazon Redshift now integrates with Amazon Bedrock for generative AI tasks using SQL, adding AI capabilities like text generation directly in your data warehouse. Now, you can conjure up rich insights without the need for complicated spells!

Announcing general availability of auto-copy for Amazon Redshift — Auto-copy for continuous data ingestion from Amazon S3 into Amazon Redshift is now generally available. This streamlines workflows, making data integration as smooth as carving a soft pumpkin!

Amazon Redshift now supports incremental refresh on Materialized Views (MVs) for data lake tables — Amazon Redshift now supports incremental refresh for materialized views on data lake tables, updating only changed data to boost efficiency. This keeps your data fresh without any haunting overhead!

Announcing Amazon Redshift Serverless with AI-driven scaling and optimization — Amazon Redshift Serverless now offers AI-driven scaling, adjusting resources automatically based on workload. This ensures smooth performance without any chilling surprises!

CSV result format support for Amazon Redshift Data API — Amazon Redshift Data API now supports CSV output for SQL query results, enhancing data processing flexibility. This makes handling data as smooth as a ghost’s whisper!

Halloween week contest runner-up…Amazon CloudWatch!
The Amazon CloudWatch team has also been busy giving out candy this Halloween! Let’s check it out.

Amazon CloudWatch now monitors EBS volumes exceeding provisioned performance — Amazon CloudWatch now provides metrics to check if Amazon EBS volumes exceed their IOPS or throughput limits. This helps quickly spot and resolve performance issues before they turn into a haunting problem!

New Amazon CloudWatch metrics for monitoring I/O latency of Amazon EBS volumes — Amazon CloudWatch now offers metrics for average read and write I/O latency of Amazon EBS volumes, aiding in identifying performance issues. These insights are available per minute at no extra cost. This should help you prevent latency from sneaking up on you like a Halloween ghost!

Amazon ElastiCache for Valkey adds new CloudWatch metrics to monitor server-side response time — Amazon ElastiCache now includes metrics for read and write request latency, helping monitor server response times. This aids in quickly spotting and resolving performance issues before they become a frightful surprise!

Conclusion
And that’s a wrap on Halloween 2024. I don’t know about you, but this is my favorite time of the year, followed by News Year’s. Both carry an element of unpredictability that I very much enjoy. With Halloween, you can get excited about what costume you’re going to wear, whereas New Year’s is all about new possibilities and conquering new horizons.

Luckily, you don’t have to wait for the new year to unlock new frontiers with AWS as we bring excitement and innovation throughout the year. And what better way to see this in action than at AWS re:Invent 2024!

I wonder what kinds of spells and surprises we’ll be conjuring up come Halloween 2025. Until next time, keep your eyes on the horizon—and your broomsticks at the ready!

Introducing an enhanced local IDE experience for AWS Lambda developers

2024-10-31 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/introducing-an-enhanced-local-ide-experience-for-aws-lambda-developers/

AWS Lambda is introducing an enhanced local IDE experience to simplify Lambda-based application development. The new features help developers to author, build, debug, test, and deploy Lambda applications more efficiently in their local IDE when using Visual Studio Code (VS Code).

Overview

The IDE experience is part of the AWS Toolkit for Visual Studio Code. A new guided walkthrough helps developers set up their local environment and install required tools. The toolkit includes a set of sample applications which show you how to iterate on your code locally and in the cloud. You can configure and save build settings to speed up application builds. Generate a configuration file to set up the debugging environment for VS Code to attach and launch the step-through debugger. Iterate faster by choosing to sync local application changes quickly to the cloud or perform a full application deploy. Test functions locally and in the cloud and create and share test events to speed up local and cloud testing. There are quick action buttons for build, deploy to cloud, and local or remote invoke. The toolkit integrates with AWS Infrastructure Composer, providing a visual application building experience directly from the IDE.

Using the new features

Installing the extension

To use the updated IDE experience, ensure you have the AWS Toolkit minimum version 3.31.0 installed as a VS Code Extension.

The AWS Toolkit now includes an additional section called Application Builder within the AWS extension side-bar. This allows you to view template resources and create, build, debug, and test serverless applications.

Using Application Builder for existing applications

You can open an existing local application template using Open Folder.

Lambda’s enhanced in-console editing experience allows you to download existing function code and an AWS Serverless Application Model (AWS SAM) template. This allows you to start in the console and more easily move to using infrastructure as code, which is a serverless best practice.

Using the guided walkthrough

The guided walkthrough helps you install dependencies, select an application template, and explains how to use Application Builder to iterate locally and deploy to the cloud.

Choose Open Walkthrough which opens the walkthrough.
Complete installation takes you through a wizard to install required dependencies and select application templates.

The wizard provides download links to install the three dependencies:

If you have installed the dependencies, selecting the links recognizes the installations.

Select Choose your application template, which allows you to create example applications in VS Code.

The Iterate locally tile provides guidance on how to use Application Builder to build and invoke the function, and how to view the results.

Deploy to the cloud provides a link to Configure credentials and explains how to deploy your function to the cloud, remote invoke from your IDE, and view the results.

Creating an application from the samples

The following steps show how to create a function locally from an included template. You build the code artifact, locally test and debug, deploy, and remotely invoke and view results and logs, all without leaving your IDE.

Navigate back to Choose your application template.
New template with visual builder allows you to use Infrastructure Composer to create a new application using a visual canvas.
See more application examples provides additional sample applications across a number of managed runtimes.

There are also two specific example applications to explore Lambda functionality.

Rest API: Learn how to build a synchronous Lambda function behind an API.
File processing.: Learn how to build an asynchronous Amazon S3 file processing application.

Building a synchronous Rest API application

Select Rest API and chose Initialize your project.
Select a language runtime. Select Python for this example.
Open the file explorer, create a folder to download the example application and choose Create Project.

Application Builder downloads the application. This includes the function code hello_world\app.py, with dependencies in requirements.txt, an AWS SAM template, template.yaml file, and an example event trigger, event.json. A README.md file explains the application structure and provides build and deploy instructions.

The Application Builder section populates with the template resources.

The icons provide shortcuts to view, build, and deploy the application.

You can also use the Command Pallete to initiate the AWS SAM commands.

Selecting the Open Template File icon opens the AWS SAM template in Infrastructure Composer.

View the application resources and select Details to edit the template using the visual canvas.

Navigate to the function resource and select Open Function Handler to show the function code.

Building the application

The build step helps you build artifacts from the files in your application project directory.

Select the Build SAM template icon.
Specify build flags allows you to configure AWS SAM builds settings.

Select build settings particular to your configuration. Cached and parallel are useful to speed up future builds. Use container builds your function in a Lambda-like container. This allows you to build applications without having the language runtime and build tools installed locally.

Save parameters adds the default build options in samconfig.toml.

version = 0.1
[default.build.parameters]
template_file = "c:\\Code\\lambda-dx\\Rest-API\\template.yaml"
cached = true
parallel = true
use_container = true

AWS SAM builds the application. It downloads the build container image, installs the dependencies, and copies the function code.

Press any key to close the additional terminal.

Iterate locally: invoke and debug

You can locally invoke and debug your serverless application before uploading it to the cloud. This helps you to test the logic of your function faster. Step-through debugging allows you to identify and fix issues in your application one instruction at a time in your local environment.

Local invoke

In the Application Builder section, navigate to the function and select Local Invoke and Debug Configuration.

Initiating Local Invoke and Debug Configuration

This brings up another window which allows you to configure how to invoke the function locally and set up a debug configuration.

Viewing Local Invoke and Debug Configuration Options

You can create sample event payloads to test your function. Select an event provides a list of common trigger event payloads you can use and customize.

Selecting an example event template

This example application has an included sample event.

Select Local file and choose the events\event.json file.
Select the Invoke button.

This builds the application and locally invokes the function within a Lambda emulation environment, using the event input file.

View the function output within the IDE Output pane.

Viewing function output

Local debugging

You can also debug the function locally using VS Code’s built-in debugger.

Add a breakpoint to the function code.

Adding a breakpoint to the function code

Select the Invoke button again.

This locally invokes the function and attaches a debugger to the Lambda emulation environment.

The debugger stops at the breakpoint and you can view the function variables and call stack.

Viewing step through debugging

Use the VS Code debugger icons to step through the code.

Using VS Code debugger icons to step through the code.

In the Local Invoke and Debug Configuration panel. Chose Save Debug Config.
Choose Add Local Invoke and Debug Configuration.

Saving debug configuration

Enter a debug configuration name which creates a launch.json file and adds the debug configuration.

Naming debug configuration

You can create and save multiple debug configurations for different scenarios. See the AWS SAM documentation for more launch.json configuration options.

Once you save the debug configuration, you can use VS Code’s Run and Debug panel and select which debug configuration to run.

Using the Run and Debug panel

Deploying the application

Navigate to the Application Builder section and chose the Deploy SAM Application icon.

Selecting Deploy SAM Application icon

AWS SAM provides two deployment options:

Sync uses AWS SAM sync to perform an initial CloudFormation deploy and then allows for quick syncing of your application code, which allows for rapid prototyping. Use this for development environments only, as it doesn’t do a full CloudFormation deploy on code changes.
Deploy does a full CloudFormation deploy, which is preferred for non-quick development environments.

Viewing AWS SAM deployment options

Select Sync.
Select Specify required parameters and save as defaults.

Specifying SAM sync parameters

Select a Region to deploy the stack and enter a stack name. It is good practice to specify that this is a dev stack to avoid confusion when using the Deploy option.

Entering dev stack name

Select an existing S3 bucket to store the artifacts, or create a new one.

Selecting S3 bucket

Specify the Sync parameters. Ensure you select Watch as this automatically watches for code changes and quickly syncs code changes to the Lambda service

Setting sync parameters

AWS SAM sync does an initial CloudFormation deploy to build the resources and then waits for code changes.
Make a change to the handler file code and save the file,

Amending code

This performs a quick sync which reduces the time to test in the cloud.

Quickly syncing code

You can use the Deploy option to deploy a non-quick sync test version, amending the stack name to differentiate it from the dev stack.

Naming test version stack

Remote invoke

You can invoke the function in the cloud from your IDE. This allows you to test functionality without having to mock security, external services, or other environment variables.

Once the application is deployed, Application Builder detects changes to samconfig.toml and template.yaml, it updates the resources list with the cloud resources.

Viewing cloud resources

You can browse directly to the CloudFormation stack to view resources.

Browsing to CloudFormation stack

Selecting the function provides quick link functionality, which includes function details and a link directly to the Lambda console for the function.

Viewing function quick link options

Select Invoke in cloud.
Select the same local event file for the local invoke.

Selecting local file for remote invoke

Choose Remote Invoke.

The function invokes in the cloud using the local test event and displays the remote invoke results in the local IDE Output pane.

Viewing remote invoke results

Name and save the local event file as a remote event which becomes available in the Lambda console.

Saving remote test event

Viewing logs

You can fetch the Amazon CloudWatch Log streams generated by your Lambda function in the IDE.

Select the Search Logs icon.

Selecting Search Logs icon

You can optionally filter the results.

Optionally filtering log results

Conclusion

Lambda is introducing an enhanced local IDE experience to simplify the development of Lambda-based applications using the VS Code IDE and AWS Toolkit. This streamlines the code-test-deploy-debug cycle. A guided walkthrough helps set up your local development environment and provides sample applications to explore Lambda functionality. You can then build, debug, test, and deploy Lambda applications using icon shortcuts and the Command Pallette. This allows you to more easily iterate on your Lambda-based applications without switching between multiple interfaces.

For more serverless learning resources, visit Serverless Land.

Simplifying Lambda function development using CloudWatch Logs Live Tail and Metrics Insights

2024-10-17 Chris McPeek

Post Syndicated from Chris McPeek original https://aws.amazon.com/blogs/compute/simplifying-lambda-function-development-using-cloudwatch-logs-live-tail-and-metrics-insights/

This post is written by Shridhar Pandey, Senior Product Manager, AWS Lambda

Today, AWS is announcing two new features which make it easier for developers and operators to build and operate serverless applications using AWS Lambda. First, the Lambda console now natively supports Amazon CloudWatch Logs Live Tail which provides you real-time visibility into Lambda function logs, making it easier to develop and troubleshoot Lambda functions. Second, the Lambda console now offers Amazon CloudWatch Metrics Insights dashboard for key Lambda function metrics, enabling you to easily identify and troubleshoot the source of errors or performance issues.

This blog post dives into the new capabilities enabled by these launches, how these capabilities simplify the developer and operator experience for building serverless applications with Lambda, and how you can get started with them.

Native CloudWatch Live Tail logs in Lambda console

Customers building serverless applications using Lambda want visibility into the behavior of their Lambda functions in real time, such as when an error occurs or a code change causes unexpected behavior. For example, developers want to instantly see the result of their code or configuration changes on the behavior of the function, and operators want to quickly troubleshoot any critical issues which would prevent the function from operating smoothly.

To help you monitor and troubleshoot the behavior of your function, the Lambda service automatically captures and sends logs to CloudWatch Logs. However, previously, you had to wait for Lambda function logs to be ingested, processed, and stored by CloudWatch Logs before you could view them. Then, you had to navigate to the CloudWatch console to view, search, and query logs using tools like CloudWatch Logs Insights. This caused frequent context switching between the Lambda and CloudWatch consoles in order to access logs, which introduced friction in the process of rapidly developing and troubleshooting Lambda functions.

The Lambda console now natively supports CloudWatch Logs Live Tail, an interactive log streaming and analytics capability which enables developers and operators to view and analyze their Lambda function logs in real time. This capability provides a built-in, real-time view of function logs as they become available in the Lambda console, as seen in the following image. Developers can now easily start a Live Tail session with one click and view the latest log entries as their function is executing. So, they can now edit the function code, deploy changes, invoke the function, and view the result of their code change in real time, without navigating away from the Lambda console. This streamlines and accelerates the author-test-deploy cycle (also known as the “inner dev loop”) when building serverless applications using Lambda.

Figure 1 CloudWatch Logs Live Tail in Lambda console

The Live Tail experience in Lambda console also offers fine-grained log analysis capabilities to filter logs, making it easier for operators and DevOps teams to debug and troubleshoot issues and critical errors in their Lambda functions. For example, while investigating errors in the Lambda function, operators can apply filter patterns to only display log events containing keywords of interest e.g., ERROR, exception, etc. This helps narrow down the search to relevant log events and cut out the noise, reducing the mean time to recovery (MTTR) when troubleshooting Lambda function errors. Thus, Live Tail enables operators to proactively monitor the health of their serverless applications built using Lambda and react faster to resolve errors or unexpected behavior.

Live Tail in action

To use Live Tail capabilities on any CloudWatch log group, you must have logs:StartLiveTail, logs:StopLiveTail, and logs:DescribeLogGroups AWS Identity and Access Management (IAM) permissions for that CloudWatch log group. Alternatively, you can add CloudWatchLogsReadOnlyAccess managed IAM policy (which contains these IAM permissions) to your IAM role. See Overview of managing access permissions to your CloudWatch Logs resources to learn more.

To get started with Live Tail in the Lambda console:

Navigate to the Lambda console at https://console.aws.amazon.com/lambda/
In the Functions page, select the Lambda function for which you want to view Live Tail logs.
In the Code tab, select Run and Debug icon in the Activity Bar on the left-hand side of the code editor. This opens the Run and Debug view.
Select Open CloudWatch Live Tail. This opens the CloudWatch Logs Live Tail bottom drawer.Figure 2: Starting Live Tail from code editor in Lambda console
Select Start to start a Live Tail session and view your Lambda function logs stream in real time. Alternatively, visit Test tab and select CloudWatch Logs Live Tail to start a Live Tail session.

Figure 3: Active Live Tail session

The CloudWatch Logs Live Tail bottom drawer features a Filter panel on the left-hand side, which contains useful controls such as CloudWatch log group selection, option to filter logs, and the Start and Stop buttons. You can collapse this panel if you want to utilize the entire width of your screen to view logs (without horizontal scrolling) in the Live Tail session, as shown in the following image.

Figure 4: Active Live Tail session with collapsed Filter Panel

The Filter panel has the CloudWatch log group corresponding to your Lambda function selected by default, but you can select other log groups. You can select up to 5 log groups at a time. You can also stop the Live Tail session at any time by selecting Stop in the Filter panel.

To filter logs based on specific terms or keywords, apply patterns using the “Add filter pattern” option in the Filter panel. The filters field is case sensitive. You can specify keywords, phrases, numeric values, or regular expressions in the filter pattern. See Filter pattern syntax for metric filters, subscription filters, filter log events, and Live Tail to learn more about how to use filter patterns to display only log events of interest. For example, you can filter log events containing specific keywords or phrases (e.g., ERROR, FATAL, exception, etc.) as shown in the following image.

Figure 5: Active Live Tail session with multiple log groups selected and filter patterns applied

The Live Tail session automatically stops after 15 minutes (i.e., 900 seconds) of inactivity or when the Lambda console session times out. However, when you restart the Live Tail session, the previously applied filtering criteria will be retained. This means, you can pick up where you left off with just one click.

You get 1,800 minutes of Live Tail session usage per month for free with the AWS Free Tier, after which you pay $0.01 per minute of usage. See CloudWatch pricing page for Live Tail pricing details.

The Live Tail experience in Lambda console is available in all commercial AWS Regions where AWS Lambda and Amazon CloudWatch Logs are available. See Lambda documentation and this introductory video to learn more about native Live Tail support for Lambda.

CloudWatch Metrics Insights dashboard in Lambda console

In order to effectively operate distributed applications, easily identifying the source of errors or performance issues is critical. For example, when you notice a spike in critical metrics like errors or invocation duration in your Lambda dashboard, you want to quickly find out which Lambda functions are causing these spikes. Previously, you had to navigate to the CloudWatch console and query metrics or create custom dashboards.

Now, the Lambda console features a new built-in CloudWatch Metrics Insights dashboard which provides you instant visibility into critical insights for Lambda functions in your account, such as “most invoked Lambda functions”, “functions with highest number of errors”, and “functions taking the longest to run”. The dashboard leverages CloudWatch Metrics Insights capability to enable you to easily identify functions driving the highest usage, errors, and performance issues. Thus, the Metrics Insights dashboard surfaces key insights right where you need them, reducing friction due to context switching and making it easy for your operator team(s) to identify and fix errors and performance anomalies for your serverless applications built using Lambda.

Metrics Insights dashboard in action

You can easily get started with Metrics Insights dashboard without making any code changes or creating custom dashboards. Simply navigate to the Dashboard page in the Lambda console to start accessing the insights surfaced in the Metrics Insights dashboard. The following image shows the Metric Insights dashboard in the Lambda console.

Figure 6: Lambda Dashboard page with Metrics Insights dashboard

The dashboard shows the top 10 Lambda functions in your AWS account with highest number of invocations, errors, and longest invocation duration. In the example shown in the following image, the Lambda function named 1-LambdaConsoleStack-er4D14B2288-VulWZHExuSFp shows the highest error rate among all functions experiencing errors. This could be a signal to the operator team to prioritize identifying the root cause behind the high error rate for this function.

Figure 7: Metrics Insights dashboard showing top 10 functions with highest errors, invocations, and concurrent executions

The Metrics Insights dashboard displays data for the most recent 3 hours. You can view and query metrics in the CloudWatch console if you require metrics for longer than 3 hours.

Metrics Insights dashboard in Lambda console is now available in all commercial AWS Regions where AWS Lambda and Amazon CloudWatch are available, including the AWS GovCloud (US) Regions, at no additional cost.

Conclusion

This post introduces and illustrates two new Lambda features — native support for CloudWatch Logs Live Tail and Metrics Insights dashboard in the Lambda console. These features simplify the developer and operator experience for serverless applications built using Lambda. Live Tail enables you to view and analyze Lambda logs in real time, which simplifies and accelerates the author-test-deploy cycle and makes it easy to troubleshoot errors in Lambda functions. On the other hand, Metrics Insights dashboard shows key Lambda metrics like errors, invocations, and duration to reduce the mean time to recover (MTTR) from errors and performance issues for Lambda functions.

For more serverless learning resources, visit Serverless Land.

How CyberArk is streamlining serverless governance by codifying architectural blueprints

2024-10-11 Anton Aleksandrov

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/architecture/how-cyberark-is-streamlining-serverless-governance-by-codifying-architectural-blueprints/

This post was co-written with Ran Isenberg, Principal Software Architect at CyberArk and an AWS Serverless Hero.

Serverless architectures enable agility and simplified cloud resource management. Organizations embracing serverless architectures build robust, distributed cloud applications. As organizations grow and the number of development teams increases, maintaining architectural consistency, standardization, and governance across projects becomes crucial.

In this post, you will discover how CyberArk, a leading identity security company, efficiently implements serverless architecture governance, reduces duplicative efforts, and saves months of development time by codifying architectural blueprints. This approach helps to prevent redundant efforts and promotes uniform architectural standards, facilitating the seamless adoption of organizational best practices and governance across diverse teams.

Overview

The risk of duplicative efforts and architectural inconsistencies is particularly pronounced in large organizations, especially for requirements unrelated to specific business domains owned by individual teams. Diverse approaches to Infrastructure-as-Code, CI/CD, observability, and security can lead to inconsistent implementations across teams. Application developers should focus on delivering business value efficiently, rather than navigating the complexities of building and operating distributed architectures while adhering to organizational best practices. To achieve this, you need an approach that empowers developers and provides guardrails to ensure vetted architectural patterns are consistently applied. This solution should enable accelerated delivery without sacrificing agility and innovation.

Some organizations implement internal wiki consolidating architectural guidance. While well-intentioned, relying solely on documentation assumes development teams diligently follow the guidelines, which often requires manual validation and limits scalability. To overcome this limitation, organizations should adopt a scalable approach that codifies, automates, and promotes architectural best practices. This mechanism allows developers to focus on delivering business-domain value and drives standardized operational excellence, governance, and organizational policies adherence.

Introducing serverless blueprints

CyberArk engineering team had over 900 developers. It was looking for ways to ensure they build their serverless services based on vetted architectural and security best practices with fully automated governance controls enforcement. The solution came in the form of codified architecture blueprints and automated tooling.

Serverless architectures are composed using loosely coupled services, integrated based on the application requirements. Application developers use IaC tools such as AWS CDK and HashiCorp Terraform to define their serverless architectures and integration patterns. CyberArk has augmented the IaC with governance tools, such as cdk-nag, AWS Config, and AWS Control Tower. With these complementary tools in place, they’ve built serverless blueprints which include architectural definitions based on organizational best practices, as well as automatically applied governance controls

To illustrate this, consider a simple serverless architecture pattern. In this common pattern, an SQS queue serves as the event source for a Lambda function, which parses incoming messages and updates an Amazon S3 bucket.

Figure 1. A simple serverless architecture with SQS Queue, Lambda function, and S3 Bucket

While this pattern seems simple, turning it into an enterprise-ready service requires additional effort. You must consider aspects like resiliency, security, governance, observability, and coding best practices. Let’s examine several examples codified in architectural blueprints at CyberArk.

Error-handling best practices

Your services should be resilient. Retries can help to overcome occasional network hiccups, but you also need to handle scenarios when your function consistently fails to process particular messages (known as poison message) – for example, because of a code bug. This can lead to endless processing loops, data loss, and potential extra charges. To address this, a blueprint can implement a failure handling mechanism with a dead letter queue, alerting, and redrive. This pattern is straightforward to implement and adds extra resiliency to your architecture. It is also generic and does not contain any business domain code. This is a typical example of an architectural pattern that can be codified in a blueprint and reused across development teams.

Figure 2. The simple serverless architecture with added resiliency best practices

Security best practices

Another example is securing S3 buckets. Organizations must enforce S3 security best practices, such as enabling access logs, blocking public access, and enabling encryption at rest. Codifying these guardrails in architectural blueprints adds an extra layer that allows your developers to comply with organization standards without having to explicitly implement adherence to each best practice and policy on their own.

Figure 3. The simple serverless architecture with added security best practices

The following code snippet uses AWS CDK to create an S3 bucket with common best practices:

Enables bucket versioning on production environments only to save costs in non-production environments
Enforces data encryption with AWS-managed keys.
Blocks all public access
Enforces SSL to block all non-secure-transport access
Enables access logs

def _create_bucket(self, server_access_logs_bucket: s3.Bucket, is_production_env: bool) -> s3.Bucket:
    # Create an S3 bucket with AWS-managed keys encryption
    bucket = s3.Bucket(
        self,
        constants.BUCKET_NAME,
        versioned=True if is_production_env else False,
        encryption=s3.BucketEncryption.S3_MANAGED,
        block_public_access=s3.BlockPublicAccess.BLOCK_ALL,
        enforce_ssl=True,
        server_access_logs_bucket=server_access_logs_bucket, 
        # redacted
    )

Additional security best practices you can codify in your blueprints include the principle of least privilege access, VPC-attachment, and code signing for sensitive Lambda functions, and using KMS keys for encryption.

Lambda best practices

Your Lambda functions are another example of where blueprints can help. By providing a function blueprint implementing the baseline for capabilities like observability, idempotency, and batch processing out-of-the-box, you enable developers to focus on their business domain code.

Figure 4. Layered view of a Lambda function in CyberArk’s serverless architecture blueprint

CyberArk embeds Powertools for AWS Lambda, a toolkit that implements serverless best practices to increase developer velocity, into their blueprints. The following code snippets embed Powertools for enabling enhanced observability and implementing batch processing.

# CDK code
lambda_function = lambda.Function(
    environment={
        constants.POWERTOOLS_SERVICE_NAME: constants.SERVICE_NAME,
        constants.POWER_TOOLS_LOG_LEVEL: 'INFO',  
    },
    tracing=lambda.Tracing.ACTIVE,
    layers=["powertools-layer"],
    log_format=lambda.LogFormat.JSON.value,
    system_log_level=lambda.SystemLogLevel.INFO.value
    # redacted
)

# Function handler code
processor = BatchProcessor(event_type=EventType.SQS, model=OrderSqsRecord)

@logger.inject_lambda_context
@metrics.log_metrics
@tracer.capture_lambda_handler(capture_response=False)
def lambda_handler(event, context: LambdaContext):
    return process_partial_response(
        event=event,
        record_handler=record_handler,
        processor=processor,
        context=context,
)

Governance controls

Blueprints are not static; they evolve as you adopt new best practices and governance policies. Developers start with a vetted blueprint but can deviate as they evolve their serverless apps. To enable continuous adherence, it is important to use a combination of organizational governance tools, such as AWS Control Tower and Service Control Policies, and architecture blueprints that embed governance controls automatically enforced by CI/CD. This ensures that any architectural modification will be validated for adhering to organizational standards.

AWS defines proactive controls as mechanisms that prevent developers from deploying resources that violate governance policies. Detective controls are mechanisms that detect, log, and alert on resource or configuration changes that violate governance policies.

Figure 5. Applying governance controls at all stages of CI/CD

Depending on the IaC tool, you can leverage different types of governance tools for proactive control enforcement. The following screenshot shows a proactive control violation identified during CI/CD via the cdk-nag framework. You can see cdk-nag throwing an error for the stack deployment due to Lambda execution role being assigned wild-card permissions.

Figure 6. Exception thrown by cdk-nag for using wildcard permissions

See the practical guide for implementing serverless governance.

Sample code

Ran Isenberg has open-sourced a sample Lambda Handler Cookbook blueprint illustrating some of the patterns CyberArk has adopted.

Additional serverless architecture patterns you might consider implementing in your blueprints are server-side encryption for an Amazon SNS topic with an encrypted Amazon SQS queue subscribed, auto-adjusting provisioned concurrency for Lambda functions, secure Serverless Aurora Cluster with bastion host, and more.

See more patterns implemented at serverlessland.com and cdkpatterns.com

Conclusion

Translating architectural and security best practices into modular IaC definitions, such as CDK constructs or Terraform modules, is a scalable and reusable technique that allows CyberArk to reduce duplicative efforts and save months of development time. Using IaC tools like AWS CDK or Terraform, augmented with governance tools like cdk-nag or checkov, enabled CyberArk to share implementation best practices and encode governance policies into architectural blueprints. Development teams adopting these blueprints do not need to reinvent the wheel, each trying to solve the same problem on their own. Instead, they leverage the knowledge codified in the blueprint.

Monitoring best practices for event delivery with Amazon EventBridge

2024-09-30 Chris McPeek

Post Syndicated from Chris McPeek original https://aws.amazon.com/blogs/compute/monitoring-best-practices-for-event-delivery-with-amazon-eventbridge/

This post is written by Maximilian Schellhorn, Senior Solutions Architect and Michael Gasch, Senior Product Manager, EventBridge

Amazon EventBridge is a serverless event router that allows you to decouple your applications, using events to communicate important changes between event producers and consumers (targets). With EventBridge, producers publish events through an event bus, where you can configure rules to filter, transform, and route your events to a variety of targets such as AWS Lambda functions, Amazon Kinesis Data Streams, and public HTTPS endpoints (API destinations).

In event-driven architectures, the flow of sending and receiving events is asynchronous. There is no direct feedback to the producer when targets are invoked or if the invocation was successful. Therefore, to make sure business logic executes reliably in event-driven applications, it’s essential to get an understanding of your event delivery behavior with metrics, such as the number of delivery retries, failed delivery attempts, and the time it takes to deliver events. These metrics allow you to monitor the health of your event-driven architectures, and understand and mitigate event delivery issues caused by underperforming, undersized or unresponsive targets.

This post discusses how to monitor event delivery with EventBridge metrics to detect common event delivery issues and increase the reliability of your event-driven architectures on AWS.

Background

EventBridge is a multi-tenant system that handles more than 2.6 trillion events per month as of February 2024. EventBridge maintains fairness and availability under high load using mechanisms to detect and isolate noisy neighbors. As part of the AWS shared responsibility model, you are responsible to monitor and respond to target-related issues for reliable event delivery. For example, an underprovisioned Kinesis data stream or throttled API destination as a target will lead to delivery retries, delays, and failures.

Solution overview

EventBridge provides a variety of metrics to observe, troubleshoot, and optimize event delivery. For example, counter-based metrics such as InvocationAttempts, SuccessfulInvocationAttempts, RetryInvocationAttempts, and FailedInvocations allow you to observe throttling and calculate error rates. Latency-based metrics such as IngestionToInvocationSuccessLatency provide insights into event delivery and delays.

In the following sections, we demonstrate the behavior of these metrics through an example application and discuss best practices for reliable event delivery. The example is composed of three key components, as numbered in the following architecture:

An HTTP load generator to simulate different load patterns through Amazon API Gateway.
An EventBridge event bus and a rule with an API destination target, throttled at 50 requests per second to simulate an under-scaled resource.
A dead-letter queue (DLQ) that makes sure events are retained in case of invocations that fail permanently.

Example application architecture

The load generator creates varying load over multiple phases. To observe the number of incoming events, use the EventBridge metrics MatchedEvents or TriggeredRules on the rule name dimension, as illustrated in the following graph.

Number of incoming events visualized in CloudWatch Metrics

The following use cases focus on monitoring event delivery. Therefore, cases where event producers are not able to publish events due to permission errors or are experiencing throttling quotas on PutEvents are not covered.

Use case 1: Detecting event delivery issues due to target rate limiting

In this use case, event delivery will experience retries due to an under-scaled API destination target. The API processes all requests successfully. The load generator runs in three phases:

First, it warms up with a low number of requests and slowly increases the load while staying below the API destination rate limit of 50 requests per second
In the second phase, the load generator increases to 100 requests per second, exceeding the configured invocation rate on the API destination
Finally, the load generator slows to 50 requests per second, and eventually finishes

The following graph was created via CloudWatch Metrics and illustrates this scenario.

Load pattern of the example application

EventBridge supports new rule name dimensions for selected metrics, making it straightforward to observe invocations (event delivery) per rule. The following metrics are recommended:

InvocationAttempts – The overall number of times EventBridge attempts to invoke the target, including retries
SuccessfulInvocationAttempts – The number of invocation attempts that were successful
RetryInvocationAttempts – The number attempts that originated from retries

The following graph visualizes the metrics within the first phase of the example scenario. In this phase, the load stays below the configured rate limit of the target. When events are delivered successfully without throttling or errors, InvocationAttempts and SuccessfulInvocationAttempts are equivalent and RetryInvocationAttempts is 0 (the metric is only emitted if there are retries).

EventBridge Metrics during the first phase without throttling or errors

In the second phase (06:55), the load generator creates more events than the target can handle, exceeding the API destination invocation rate limit. This is reflected in the graph by InvocationAttempts and MatchedEvents increasing, while SuccessfulInvocationAttempts stays at the configured API destination rate limit. At the beginning of the phase, RetryInvocationAttempts is 0 because retries due to rate limiting from API destinations are not immediately executed, but delayed with exponential backoff. After the delay, RetryInvocationAttempts starts increasing (06:58), as shown in the following graph.

EventBridge Metrics during the ramp-up phase of the load generator

Because InvocationAttempts also includes retries, the overall number of InvocationAttempts is higher than the incoming MatchedEvents.

Lastly, during the cool down period, when the number of incoming events is decreasing significantly (7:03), more retry attempts succeed, and therefore InvocationAttempts and RetryAttempts reduce. Even though there are no more new incoming events (07:05), there are still events being retried that will eventually finish (07:14).

EventBridge Metrics during the cool-down phase of the load generator

Based on the observations during this scenario, we can calculate the overall custom metric SuccessfulInvocationRate. If you consider retries as a first sign of degraded system state, you can calculate this rate as SuccessfulInvocationAttempts/InvocationAttempts. For example, in Amazon CloudWatch, you can use metric math. Depending on your requirements, you can set up CloudWatch alarms to create notifications when a certain threshold is hit.

Custom SuccessfulInvocationRate metric generated with CloudWatch metric math

Although an occasional decrease of SuccessfulInvocationRate due to temporary traffic spikes or invocation errors can be considered normal, a constant mismatch is an indication of a misconfigured target and needs to be addressed as part of the shared responsibility model.

Use case 2: Detecting and handling event delivery failures

By default, EventBridge retries delivering an event for 24 hours and up to 185 times. After all retry attempts are exhausted, the event is dropped or sent to a DLQ. See Using dead-letter queues to process undelivered events in EventBridge for more information on how to configure a DLQ with EventBridge. These events can be visualized through the FailedInvocations or InvocationsSentToDlq metrics. Because FailedInvocations doesn’t consider retries that eventually succeed as failed invocations, this metric wasn’t visible in the previous example.

The following graph represents the same application and load pattern, but the EventBridge rule is configured with a maximum of three retries. During the first phase, there are no failed attempts because the load stays below the throttling limit.

EventBridge Metrics with FailedInvocations after maximum retries exceed

In the second phase, you can observe FailedInvocations starting after the initial retries (three) have been exceeded. Because the example application has a DLQ configured, InvocationsSentToDlq can provide the same insight, and can be used for alerting.

If you’re experiencing a large amount of FailedInvocations or InvocationsSentToDlq, it’s recommended to investigate if the target is properly scaled and able to receive the given traffic. For cases where retries are expected, the retry policy should be configured accordingly.

Use case 3: Detecting event delivery delays

The metrics outlined in the previous scenarios provided an overview of how to monitor your event delivery by the total number of retries or failures during a given time period. However, EventBridge also provides a metric that lets you observe the end-to-end latency (the time it takes from event ingestion to successful delivery to the target).

This can be achieved with the new IngestionToInvocationSuccessLatency metric. This metric surfaces effects from retries and delayed delivery, for example due to timeouts and slow responses from targets. In the following graph, you can observe 50th and 99th percentiles (p50 and p99) for IngestionToInvocationSuccessLatency on the right Y axis. During the second phase of the load generator, where invocations exceed the number of events the target can process, retries occur. Therefore, the overall time until events are delivered successfully to the target increases to almost 10 minutes (597,621ms, p99).

Combination of counter based metrics and latency based metrics

IngestionToInvocationSuccessLatency includes the time the target takes to successfully respond to event delivery. This allows you to monitor the end-to-end latency between EventBridge and your target, and detect performance variations and degradations of targets, even when there is no target throttling or errors. For example, the following graph displays constant successful invocations while the latency increases due to longer response times of the target over a 5-minute period (starting at 09:07).

Visualization of increased target latency without errors or retries

Conclusion

In this post, we explored best practices for observing event delivery with EventBridge. By using key metrics like SuccessfulInvocationAttempts, RetryInvocationAttempts, and FailedInvocations, you can gain visibility and identify issues early. With CloudWatch metric math, you can calculate a SuccessfulInvocationRate metric, allowing you to define thresholds and alerts on a single key metric.

Furthermore, the new IngestionToInvocationSuccessLatency metric provides insights into the end-to-end event delivery latency between EventBridge and your targets, enabling you to detect and respond to performance degradation. It’s recommended to combine these key metrics into a holistic overview, such as using CloudWatch dashboards. By setting up appropriate alarms and taking a proactive approach to observability, you can mitigate event delivery problems and build resilient, scalable, event-driven applications on AWS with EventBridge. Navigate to Monitoring Amazon EventBridge to get an overview of the available metrics and how to get started.

Try these metrics out with your own use case!

To find more serverless patterns, check out Serverless Land.

Amazon EMR Serverless observability, Part 1: Monitor Amazon EMR Serverless workers in near real time using Amazon CloudWatch

2024-09-27 Kashif Khan

Post Syndicated from Kashif Khan original https://aws.amazon.com/blogs/big-data/amazon-emr-serverless-observability-part-1-monitor-amazon-emr-serverless-workers-in-near-real-time-using-amazon-cloudwatch/

Amazon EMR Serverless allows you to run open source big data frameworks such as Apache Spark and Apache Hive without managing clusters and servers. With EMR Serverless, you can run analytics workloads at any scale with automatic scaling that resizes resources in seconds to meet changing data volumes and processing requirements.

We have launched job worker metrics in Amazon CloudWatch for EMR Serverless. This feature allows you to monitor vCPUs, memory, ephemeral storage, and disk I/O allocation and usage metrics at an aggregate worker level for your Spark and Hive jobs.

This post is part of a series about EMR Serverless observability. In this post, we discuss how to use these CloudWatch metrics to monitor EMR Serverless workers in near real time.

CloudWatch metrics for EMR Serverless

At the per-Spark job level, EMR Serverless emits the following new metrics to CloudWatch for both driver and executors. These metrics provide granular insights into job performance, bottlenecks, and resource utilization.

WorkerCpuAllocated	The total numbers of vCPU cores allocated for workers in a job run
WorkerCpuUsed	The total numbers of vCPU cores utilized by workers in a job run
WorkerMemoryAllocated	The total memory in GB allocated for workers in a job run
WorkerMemoryUsed	The total memory in GB utilized by workers in a job run
WorkerEphemeralStorageAllocated	The number of bytes of ephemeral storage allocated for workers in a job run
WorkerEphemeralStorageUsed	The number of bytes of ephemeral storage used by workers in a job run
WorkerStorageReadBytes	The number of bytes read from storage by workers in a job run
WorkerStorageWriteBytes	The number of bytes written to storage from workers in a job run

The following are the benefits of monitoring your EMR Serverless jobs with CloudWatch:

Optimize resource utilization – You can gain insights into resource utilization patterns and optimize your EMR Serverless configurations for better efficiency and cost savings. For example, underutilization of vCPUs or memory can reveal resource wastage, allowing you to optimize worker sizes to achieve potential cost savings.
Diagnose common errors – You can identify root causes and mitigation for common errors without log diving. For example, you can monitor the usage of ephemeral storage and mitigate disk bottlenecks by preemptively allocating more storage per worker.
Gain near real-time insights – CloudWatch offers near real-time monitoring capabilities, allowing you to track the performance of your EMR Serverless jobs as and when they are running, for quick detection of any anomalies or performance issues.
Configure alerts and notifications – CloudWatch enables you to set up alarms using Amazon Simple Notification Service (Amazon SNS) based on predefined thresholds, allowing you to receive notifications through email or text message when specific metrics reach critical levels.
Conduct historical analysis – CloudWatch stores historical data, allowing you to analyze trends over time, identify patterns, and make informed decisions for capacity planning and workload optimization.

Solution overview

To further enhance this observability experience, we have created a solution that gathers all these metrics on a single CloudWatch dashboard for an EMR Serverless application. You need to launch one AWS CloudFormation template per EMR Serverless application. You can monitor all the jobs submitted to a single EMR Serverless application using the same CloudWatch dashboard. To learn more about this dashboard and deploy this solution into your own account, refer to the EMR Serverless CloudWatch Dashboard GitHub repository.

In the following sections, we walk you through how you can use this dashboard to perform the following actions:

Optimize your resource utilization to save costs without impacting job performance
Diagnose failures due to common errors without the need for log diving and resolve those errors optimally

Prerequisites

To run the sample jobs provided in this post, you need to create an EMR Serverless application with default settings using the AWS Management Console or AWS Command Line Interface (AWS CLI), and then launch the CloudFormation template from the GitHub repo with the EMR Serverless application ID provided as the input to the template.

You need to submit all the jobs in this post to the same EMR Serverless application. If you want to monitor a different application, you can deploy this template for your own EMR Serverless application ID.

Optimize resource utilization

When running Spark jobs, you often start with the default configurations. It can be challenging to optimize your workload without any visibility into actual resource utilization. Some of the most common configurations that we’ve seen customers adjust are spark.driver.cores, spark.driver.memory, spark.executor.cores, and spark.executors.memory.

To illustrate how the newly added CloudWatch dashboard worker-level metrics can help you fine-tune your job configurations for better price-performance and enhanced resource utilization, let’s run the following Spark job, which uses the NOAA Integrated Surface Database (ISD) dataset to run some transformations and aggregations.

Use the following command to run this job on EMR Serverless. Provide your Amazon Simple Storage Service (Amazon S3) bucket and EMR Serverless application ID for which you launched the CloudFormation template. Make sure to use the same application ID to submit all the sample jobs in this post. Additionally, provide an AWS Identity and Access Management (IAM) runtime role.

aws emr-serverless start-job-run \
--name emrs-cw-dashboard-test-1 \
 --application-id <APPLICATION_ID> \
 --execution-role-arn <JOB_ROLE_ARN> \
 --job-driver '{
 "sparkSubmit": {
 "entryPoint": "s3://<BUCKETNAME>/scripts/windycity.py",
 "entryPointArguments": ["s3://noaa-global-hourly-pds/2024/", "s3://<BUCKET_NAME>/emrs-cw-dashboard-test-1/"]
 } }'

Now let’s check the executor vCPUs and memory from the CloudWatch dashboard.

This job was submitted with default EMR Serverless Spark configurations. From the Executor CPU Allocated metric in the preceding screenshot, the job was allocated 396 vCPUs in total (99 executors * 4 vCPUs per executor). However, the job only used a maximum of 110 vCPUs based on Executor CPU Used. This indicates oversubscription of vCPU resources. Similarly, the job was allocated 1,584 GB memory in total based on Executor Memory Allocated. However, from the Executor Memory Used metric, we see that the job only used 176 GB of memory during the job, indicating memory oversubscription.

Now let’s rerun this job with the following adjusted configurations.

	Original Job (Default Configuration)	Rerun Job (Adjusted Configuration)
spark.executor.memory	14 GB	3 GB
spark.executor.cores	4	2
spark.dynamicAllocation.maxExecutors	99	30
Total Resource Utilization	6.521 vCPU-hours 26.084 memoryGB-hours 32.606 storageGB-hours	1.739 vCPU-hours 3.688 memoryGB-hours 17.394 storageGB-hours
Billable Resource Utilization	7.046 vCPU-hours 28.182 memoryGB-hours 0 storageGB-hours	1.739 vCPU-hours 3.688 memoryGB-hours 0 storageGB-hours

We use the following code:

aws emr-serverless start-job-run \
--name emrs-cw-dashboard-test-2 \
 --application-id <APPLICATION_ID> \
 --execution-role-arn <JOB_ROLE_ARN> \
 --job-driver '{
 "sparkSubmit": {
 "entryPoint": "s3://<BUCKETNAME>/scripts/windycity.py",
 "entryPointArguments": ["s3://noaa-global-hourly-pds/2024/", "s3://<BUCKET_NAME>/emrs-cw-dashboard-test-2/"],
 "sparkSubmitParameters": "--conf spark.driver.cores=2 --conf spark.driver.memory=3g --conf spark.executor.memory=3g --conf spark.executor.cores=2 --conf spark.dynamicAllocation.maxExecutors=30"
 } }'

Let’s check the executor metrics from the CloudWatch dashboard again for this job run.

In the second job, we see lower allocation of both vCPUs (396 vs. 60) and memory (1,584 GB vs. 120 GB) as expected, resulting in better utilization of resources. The original job ran for 4 minutes, 41 seconds. The second job took 4 minutes, 54 seconds. This reconfiguration has resulted in 79% lower cost savings without affecting the job performance.

You can use these metrics to further optimize your job by increasing or decreasing the number of workers or the allocated resources.

Diagnose and resolve job failures

Using the CloudWatch dashboard, you can diagnose job failures due to issues related to CPU, memory, and storage such as out of memory or no space left on the device. This enables you to identify and resolve common errors quickly without having to check the logs or navigate through Spark History Server. Additionally, because you can check the resource utilization from the dashboard, you can fine-tune the configurations by increasing the required resources only as much as needed instead of oversubscribing to the resources, which further saves costs.

Driver errors

To illustrate this use case, let’s run the following Spark job, which creates a large Spark data frame with a few million rows. Typically, this operation is done by the Spark driver. While submitting the job, we also configure spark.rpc.message.maxSize, because it’s required for task serialization of data frames with a large number of columns.

aws emr-serverless start-job-run \
--name emrs-cw-dashboard-test-3 \
--application-id <APPLICATION_ID> \
--execution-role-arn <JOB_ROLE_ARN> \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://<BUCKETNAME>/scripts/create-large-disk.py"
"sparkSubmitParameters": "--conf spark.rpc.message.maxSize=2000"
} }'

After a few minutes, the job failed with the error message “Encountered errors when releasing containers,” as seen in the Job details section.

When encountering non-descriptive error messages, it becomes crucial to investigate further by examining the driver and executor logs to troubleshoot further. But before further log diving, let’s first check the CloudWatch dashboard, specifically the driver metrics, because releasing containers is generally performed by the driver.

We can see that the Driver CPU Used and Driver Storage Used are well within their respective allocated values. However, upon checking Driver Memory Allocated and Driver Memory Used, we can see that the driver was using all of the 16 GB memory allocated to it. By default, EMR Serverless drivers are assigned 16 GB memory.

Let’s rerun the job with more driver memory allocated. Let’s set driver memory to 27 GB as the starting point, because spark.driver.memory + spark.driver.memoryOverhead should be less than 30 GB for the default worker type. park.rpc.messsage.maxSize will be unchanged.

aws emr-serverless start-job-run \
—name emrs-cw-dashboard-test-4 \
—application-id <APPLICATION_ID> \
—execution-role-arn <JOB_ROLE_ARN> \
—job-driver '{
"sparkSubmit": {
"entryPoint": "s3://<BUCKETNAME>/scripts/create-large-disk.py"
"sparkSubmitParameters": "--conf spark.driver.memory=27G --conf spark.rpc.message.maxSize=2000"
} }'

The job succeeded this time around. Let’s check the CloudWatch dashboard to observe driver memory utilization.

As we can see, the allocated memory is now 30 GB, but the actual driver memory utilization didn’t exceed 21 GB during the job run. Therefore, we can further optimize costs here by reducing the value of spark.driver.memory. We reran the same job with spark.driver.memory set to 22 GB, and the job still succeeded with better driver memory utilization.

Executor errors

Using CloudWatch for observability is ideal for diagnosing driver-related issues because there is only one driver per job and driver resources used is the actual resource usage of the single driver. On the other hand, executor metrics are aggregated across all the workers. However, you can use this dashboard to provide only an adequate amount of resources to make your job succeed, thereby avoiding oversubscription of resources.

To illustrate, let’s run the following Spark job, which simulates uniform disk over-utilization across all workers by processing very large NOAA datasets from several years. This job also transiently caches a very large data frame on disk.

aws emr-serverless start-job-run \
--name emrs-cw-dashboard-test-5 \
--application-id <APPLICATION_ID> \
--execution-role-arn <JOB_ROLE_ARN> \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://<BUCKETNAME>/scripts/noaa-disk.py"
} }'

After a few minutes, we can see that the job failed with “No space left on device” error in the Job details section, which indicates that some of the workers have run out of disk space.

Checking the Running Executors metric from the dashboard, we can identify that there were 99 executor workers running. Each worker comes with 20 GB storage by default.

Because this is a Spark task failure, let’s check the Executor Storage Allocated and Executor Storage Used metrics from the dashboard (because the driver won’t run any tasks).

As we can see, the 99 executors have used up a total of 1,940 GB from the total allocated executor storage of 2,126 GB. This includes both the data shuffled by the executors and the storage used for caching the data frame. We don’t see the full 2,126 GB being utilized from this graph because there might be a few executors out of the 99 executors that weren’t holding much data when the job failed (before these executors could start processing tasks and store the data frame chunks).

Let’s rerun the same job but with increased executor disk size using the parameter spark.emr-serverless.executor.disk. Let’s try with 40 GB disk per executor as a starting point.

aws emr-serverless start-job-run \
--name emrs-cw-dashboard-test-6 \
--application-id <APPLICATION_ID> \
--execution-role-arn <JOB_ROLE_ARN> \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://<BUCKETNAME>/scripts/noaa-disk.py"
"sparkSubmitParameters": "--conf spark.emr-serverless.executor.disk=40G"
}
}'

This time, the job ran successfully. Let’s check the Executor Storage Allocated and Executor Storage Used metrics.

Executor Storage Allocated is now 4,251 GB because we’ve doubled the value of spark.emr-serverless.executor.disk. Although there is now twice as much aggregated executors’ storage, the job still used only a maximum of 1,940 GB out of 4,251 GB. This indicates that our executors were likely running out of disk space only by a few GBs. Therefore, we can try to set spark.emr-serverless.executor.disk to an even lower value like 25 GB or 30 GB instead of 40 GB to save storage costs as we did in the previous scenario. In addition, you can monitor Executor Storage Read Bytes and Executor Storage Write Bytes to see if your job is I/O intensive. In this case, you can use the Shuffle-optimized disks feature of EMR Serverless to further enhance your job’s I/O performance.

The dashboard is also useful to capture information about transient storage used while caching or persisting the data frames, including spill-to-disk scenarios. The Storage tab of Spark History Server records any caching activities, as seen in the following screenshot. However, this data will be lost from Spark History Server after the cache is evicted or when the job finishes. Therefore, Executor Storage Used can be used to do an analysis of a failed job run due to transient storage issues.

In this particular example, the data was evenly distributed among the executors. However, if you have a data skew (for, example only 1–2 executors out of 99 process the most amount of data, and as a result, your job runs out of disk space), the CloudWatch dashboard won’t accurately capture this scenario because the storage data is aggregated across all the executors for a job. For diagnosing issues at the individual executor level, we need to track per-executor-level metrics. We explore more advanced examples of how per-worker-level metrics can help you identify, mitigate, and resolve hard-to-find issues through EMR Serverless integration with Amazon Managed Service for Prometheus.

Conclusion

In this post, you learned how to effectively manage and optimize your EMR Serverless application using a single CloudWatch dashboard with enhanced EMR Serverless metrics. These metrics are available in all AWS Regions where EMR Serverless is available. For more details about this feature, refer to Job-level monitoring.

About the Authors

Kashif Khan is a Sr. Analytics Specialist Solutions Architect at AWS, specializing in big data services like Amazon EMR, AWS Lake Formation, AWS Glue, Amazon Athena, and Amazon DataZone. With over a decade of experience in the big data domain, he possesses extensive expertise in architecting scalable and robust solutions. His role involves providing architectural guidance and collaborating closely with customers to design tailored solutions using AWS analytics services to unlock the full potential of their data.

Veena Vasudevan is a Principal Partner Solutions Architect and Data & AI specialist at AWS. She helps customers and partners build highly optimized, scalable, and secure solutions; modernize their architectures; and migrate their big data, analytics, and AI/ML workloads to AWS.

Automate detection and response to website defacement with Amazon CloudWatch Synthetics

2024-09-20 Agus Komang

Post Syndicated from Agus Komang original https://aws.amazon.com/blogs/security/automate-detection-and-response-to-website-defacement-with-amazon-cloudwatch-synthetics/

Website defacement occurs when threat actors gain unauthorized access to a website, most commonly a public website, and replace content on the site with their own messages. In this blog post, we show you how to detect website defacement, and then automate both defacement verification and your defacement response by using Amazon CloudWatch Synthetics visual monitoring canaries. Canaries are configurable scripts that run on a schedule and compare screenshots taken during a canary run with screenshots taken during a baseline canary run. If the discrepancy between the two screenshots exceeds a threshold percentage, the canary fails. We will show you how to quickly deploy a maintenance page through AWS WAF after you verify the defacement.

Common causes of defacement include unauthorized access, SQL injection, cross-site scripting (XSS), or malware. You can use AWS services such as AWS WAF, Amazon Route 53, and Amazon GuardDuty to put additional mechanisms in place to help improve your security posture.

Solution overview

The architectural diagram in Figure 1 shows a typical web application where users access the application by using Amazon CloudFront protected by AWS WAF.

Figure 1: Defacement detection and response with CloudWatch Synthetics

As shown in the diagram, the solution consists of two parts: 1) visual monitoring for defacement detection, and 2) automation of the verification and defacement response.

Part 1: Visual monitoring for defacement detection

Defacement detection uses CloudWatch Synthetics visual monitoring canaries to perform visual monitoring. You can create canaries in CloudWatch Synthetics that periodically take a screenshot of the monitored URLs. Because the canaries only need network access to the monitored URLs, you can implement this solution without affecting the application or modifying its code. For more details on how to create CloudWatch Synthetics visual monitoring canaries, see Visual monitoring of applications with Amazon CloudWatch Synthetics.

You can use the CloudWatch Synthetics visual monitoring blueprint to compare screenshots taken during a canary run with screenshots taken during a baseline canary run. This solution is suitable for static a target=”_blank” hrefs where a discrepancy between the two screenshots that exceeds a threshold percentage could indicate a possible defacement attempt, causing the canary to trigger a failure event.

The threshold percentage is defined by the visual variance that occurs when the current screenshot differs from the baseline screenshot that was captured during the first run of the canary. To reduce false positives, you can adjust the threshold for detecting visual variance.

In the following script, we updated the visual variance to 5% in the visual monitoring blueprint:

# Setting Threshold to 5%
syntheticsConfiguration.withVisualVarianceThresholdPercentage(5);

Figure 2 shows the first baseline screenshot of a webpage with visual variance set to 5%.

Figure 2: Image taken during a baseline canary run

Figure 3 shows the visual variance of a defaced webpage. In this case, the visual variance was set to 5% in the script, and the visual variance detected was 30.92%.

Figure 3: Failed canary run due to differences from the baseline screenshot

Figure 4 shows a webpage with dynamic content that triggered a false positive because the visual monitoring canary was unable to differentiate between real dynamic content and variation from the baseline. In this case, the visual variance was set to 5% in the script, and the visual variance detected was 5.25%.

Figure 4: Dynamic content in Feedback form that triggered canary failure

You can select the dynamic content to exclude it from the visual comparison for subsequent canary runs. To exclude the dynamic content, edit the baseline screenshot in CloudWatch Synthetics. Using a simple click-drag, you can select the area to exclude from visual comparison for subsequent canary runs, as shown in Figure 5.

Figure 5: Exclusion of dynamic content

If your applications have additional areas with dynamic content, you can select more than one area to exclude from comparison.

Figure 6 shows a successful canary run after exclusion of the area that contains the dynamic content.

Figure 6: Canary succeeded after the exclusion of dynamic content

You can automate the defacement response by using Amazon EventBridge rules to trigger Amazon Simple Notification Service (Amazon SNS) when a canary run fails. By using the publish-subscribe pattern, you can customize and add on the response functions based on your organization’s needs.

The following shows the event pattern script in EventBridge. Make sure to update the canary name with the name of the CloudWatch Synthetics visual monitoring canary that you created earlier to serve as the event source.

 // Event patterns in EventBridge to get event source from canary

{
  "source": ["aws.synthetics"],
  "detail-type": ["Synthetics Canary TestRun Failure"],
  "detail": {
    "canary-name": ["<replace-with-canary-name>"]
  }
}

When the event pattern matches the rules that you configured in EventBridge, the Amazon SNS topic triggers the approval flow, as shown in Figure 7. This begins automation of the verification and defacement response, which we describe in the next section.

Figure 7: Amazon SNS topic triggered when the event pattern matches

Part 2: Automation of the verification and defacement response

Figure 8 outlines how to automate the verification and defacement response. When alerts are received upon detection of defacement, the notified team can choose to verify the defacement. This defacement monitor uses CloudWatch Synthetics while maintaining the flexibility to configure and verify threshold settings through manual verification. If you are confident in your thresholds, you can bypass the approval flow and directly block site traffic by using an AWS WAF rule during a defacement attempt.

Figure 8: Defacement detection and response with CloudWatch Synthetics

As shown in the diagram, this is what the traffic flow looks like during a defacement:

The canary from the CloudWatch Synthetics visual monitor identifies defacement through visual variance against the baseline screenshot taken during the first canary run and emits an event.
If the emitted event matches the rules configured in EventBridge, Amazon SNS is triggered. This triggers the subscribed AWS Lambda function that sends a Slack notification with the event details asking for approval.
The notified team receives a Slack message about the defacement and makes an approval decision.
If approval is granted, an AWS WAF rule is added to block traffic and a maintenance page is served to users.
The user that accessed the origin is shown a maintenance page served by AWS WAF.

Although this example shows the use of Slack as an approval mechanism, you can use the communication mechanism of your choice.

Conclusion

In this post, you learned how to use CloudWatch Synthetics to monitor for defacement and display a maintenance page through AWS WAF and CloudFront while you work on recovering the service. You also learned how to use manual approval to identify the optimal threshold and exclude the area that contains dynamic content to reduce false positives.

Although most web applications already use CloudFront and AWS WAF, you can integrate this solution to your existing environment without affecting the application or modifying its code. This solution helps detect potential defacement, providing you with an additional layer of protection for your environment.

We recommend that you explore the capabilities of CloudWatch Synthetics monitoring to detect and use the capabilities of the cloud through services such as EventBridge, Amazon SNS, and Lambda to enable automation. This can help you proactively protect your application against defacement attempts.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

AWS Lambda introduces recursive loop detection APIs

2024-08-20 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/aws-lambda-introduces-recursive-loop-detection-apis/

This post is written by James Ngai, Senior Product Manager, AWS Lambda, and Aneel Murari, Senior Specialist SA, Serverless.

Today, AWS Lambda is announcing new recursive loop detection APIs that allow you to set recursive loop detection configuration on individual Lambda functions. This allows you to turn off recursive loop detection on functions that intentionally use recursive patterns, avoiding disruption of these workloads. You can use these APIs to avoid disruption to any intentionally recursive workflows as Lambda expands support of recursive loop detection to other AWS services.

Overview

AWS Lambda functions are triggered in response to events generated by various AWS services. These Lambda functions may interact with other AWS services by invoking the corresponding service APIs. Typically, the service and resource that generates the triggering event is distinct from the service and resource that the Lambda function calls. However, due to coding errors or configuration issues, there may be situations where these two resources are the same, leading to an infinite or recursive loop. Such misconfigurations can result in runaway workloads, which can incur unplanned usage and charges to your AWS account. For example, a Lambda function processes messages from an Amazon Simple Notification Service (SNS) topic but then puts the resulting notification back to the same SNS topic. This causes an infinite loop.

Lambda provides a built-in preventative guardrail that detects and stops functions running in a recursive or infinite loop between Lambda, Amazon Simple Queue Service (SQS), and SNS. This feature, known as recursive loop detection, is enabled by default for all Lambda functions. This serves as a protective mechanism against unintended usage and unexpected billing from runaway workloads.

Lambda uses an AWS X-Ray trace header primitive called “lineage” to track the number of times a function has been invoked with an event. When your function code sends an event using a supported AWS SDK version, Lambda increments the counter in the lineage header. If your function is then invoked with the same triggering event more than 16 times, Lambda stops the next invocation for that event and emits an Amazon CloudWatch RecursiveInvocationsDropped metric. If the function is invoked synchronously, Lambda returns a RecursiveInvocationException to the caller. For asynchronous invocations, Lambda sends the event to a dead-letter queue or on-failure destination if one is configured.

You do not need to configure active X-Ray tracing for this feature to work. For more information on this feature and an example scenario, please refer to Detecting and stopping recursive loops in AWS Lambda functions.

Although AWS generally discourages this practice due to the possibility of runaway workloads, some customers intentionally employ recursive patterns in their workflows. Previously, customers that run workloads that intentionally use recursive patterns could only opt-out of recursive loop detection on a per-account basis by contacting AWS Support. With these new APIs, customers can selectively opt-out of recursive loop detection on individual functions while maintaining this preventative guardrail for the remaining functions in their account that do not use recursive code.

Today we are introducing two new API actions for recursive loop detection:

GetFunctionRecursiveConfig returns details about a function’s recursive loop detection configuration.
PutFunctionRecursiveConfig sets the recursive loop detection configuration for a function. By default, recursive loop detection is turned ON for all functions.

How to use the new recursive loop detection APIs

You can configure recursive loop detection for Lambda functions through the Lambda Console, the AWS CLI, or Infrastructure as Code tools like AWS CloudFormation, AWS Serverless Application Model (AWS SAM), or AWS Cloud Development Kit (CDK). This new configuration option is supported in AWS SAM CLI version 1.123.0 and CDK v2.153.0.

If you turn recursive loop detection off for a function, the metric for RecursiveInvocationsDropped is no longer emitted for that function.

Turning off recursive loop detection on your function means that Lambda no longer prevents recursive invocations caused by misconfiguration. This may lead to unexpected usage and billing to your AWS account. You should explore alternate ways of architecting your workload that do not use recursive patterns. AWS recommends you exercise caution when turning off this guardrail feature.

Setting recursive loop detection configuration on a function using the Lambda Console

You can get recursive loop detection configuration in the AWS Lambda console:

In the AWS Lambda Console, navigate to the Functions page. Select the function that uses intentionally recursive patterns.
Select Configuration. You can find recursive loop detection controls under the Concurrency and recursion detection section.

Recursive loop detection controls in the Lambda console

Recursive loop detection is turned on by default for all functions. You can change the recursive loop detection configuration of a function by choosing Edit.
To turn off recursive loop detection for a function, select Allow recursive loops and select Save.

Setting to allow recursive loops

Setting recursive loop detection configuration using the AWS CLI

You can get the current recursion loop detection configuration of a Lambda function by using the following CLI command:

aws lambda get-function-recursion-config \
--region $AWS_REGION \
--function-name $FUNCTION_NAME

You can update the recursion loop detection configuration for a Lambda function by using the following CLI command:

aws lambda put-function-recursion-config \
--region $AWS_REGION \
--function-name $FUNCTION_NAME \
--recursive-loop Allow|Terminate

Make sure to set appropriate values for AWS_REGION and FUNCTION_NAME in the previous commands. Setting the put-function-recursion-config parameter to Allow turns off the default behavior of detecting recursive loops. Set this value to Terminate to switch back to default behavior.

Setting recursive loop detection configuration using AWS CloudFormation

You can control the recursive loop detection configuration for a Lambda function by setting the RecursiveLoop resource property in CloudFormation. Setting the value of this property to Allow turns off the default behavior of automatically detecting recursive loops. Set this property to Terminate if you want to switch it back to the default behavior. The following CloudFormation snippet shows RecursiveLoop set to Allow.

LambdaFunction:
    Type: AWS::Lambda::Function                                                                                                                                                                                    
    Properties:                                                                                                                                                                                       
      Code:                                                                                                                                                                                          
        S3Bucket:S3_BUCKET                                                                                                                                                                            
        S3Key: S3_KEY      
      Handler: com.example.App::handleRequest                                                                                                                                                        
      MemorySize: 1024
      Role:                                                                                                                                                                                          
        Fn::GetAtt:                                                                                                                                                                                  
        - LambdaFunctionRole                                                                                                                                                                         
        - Arn                                                                                                                                                                               
      Runtime: java17
      RecursiveLoop : Allow                                                                                                                                                                                                                                                                           
      Timeout: 20                                                                                                                                                                        
      TracingConfig:                                                                                                                                                                               
        Mode: Active

Extending recursive loop detection to additional AWS services

Today, recursive loop detection detects and stops loops between Lambda, SQS, and SNS after approximately 16 invocations. Lambda plans to extend support for recursive loop detection to additional AWS services. Using the APIs, you can turn off recursive loop detection for specific functions that use recursive patterns so that they are not impacted when Lambda expands recursive loop detection to additional AWS services in the future.

One way you can identify functions that use recursive patterns is by using the CloudWatch metric RecursiveInvocationsDropped.

Set a CloudWatch alarm on all Lambda functions for the CloudWatch metric RecursiveInvocationsDropped. Configure the alarm to trigger when the metric is greater than a threshold of zero. Refer to CloudWatch documentation to set alarms. You can use the following CLI command to set this alarm:

aws cloudwatch put-metric-alarm --alarm-name lambda-recursive-alarm --metric-name RecursiveInvocationsDropped --namespace AWS/Lambda --statistic Sum --period 60 --threshold 0 --comparison-operator GreaterThanOrEqualToThreshold --evaluation-periods 1 --alarm-actions $arn-of-sns-notification-topic

When Lambda detects recursive invocations, it will emit the RecursiveInvocationsDropped metric, which will trigger the alarm. Note that Lambda will only detect and stop recursive invocations if all the services within the loop support recursive loop detection.
Navigate to the CloudWatch Console and determine which function has emitted the RecursiveInvocationsDropped metric. On the Browse tab, under Metrics, choose to view metrics By Function Name and search for RecursiveInvocationsDropped. This will list all functions that have emitted that metric.

RecursiveInvocationsDropped metric

Determine if recursion is the intended pattern for that function. If so, use the recursive loop detection API to turn off recursive loop detection for this function.

Conclusion

Lambda recursive loop detection automatically detects and stops recursive invocations between Lambda and supported services, preventing runaway workloads. In most cases, you should architect your workloads to avoid any recursive loops. In rare and special circumstances, you may want to turn off the default behavior on a case-by-case basis. The recursive loop detection APIs allow you to set recursive loop detection configuration on individual functions.

This feature is available in all AWS Regions where Lambda supports recursive loop detection.

To learn more about these APIs, refer to the AWS Lambda API Reference.

For more serverless learning resources, visit Serverless Land

Enabling high availability of Amazon EC2 instances on AWS Outposts servers: (Part 2)

2024-08-08 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/enabling-high-availability-of-amazon-ec2-instances-on-aws-outposts-servers-part-2/

This blog post was written by Brianna Rosentrater – Hybrid Edge Specialist SA and Jessica Win – Software Development Engineer

This post is Part 2 of the two-part series ‘Enabling high availability of Amazon EC2 instances on AWS Outposts servers’, providing you with code samples and considerations for implementing custom logic to automate Amazon Elastic Compute Cloud (Amazon EC2) relaunch on AWS Outposts servers. This post focuses on stateful applications where the Amazon EC2 instance store state needs to be maintained at relaunch.

Overview

AWS Outposts servers provide compute and networking services that are ideal for low-latency, local data processing needs for on-premises locations such as retail stores, branch offices, healthcare provider locations, or environments that are space-constrained. Outposts servers use EC2 instance store storage to provide non-durable block-level storage to the instances, and many applications use the instance store to save stateful information that must be retained in a Disaster Recovery (DR) type event. In this post, you will learn how to implement custom logic to provide High Availability (HA) for your applications running on an Outposts server using two or more servers for N+1 fault tolerance. The code provided is meant to help you get started with creating your own custom relaunch logic for workloads that require HA, and can be modified further for your unique workload needs.

Architecture

This solution is scoped to work for two Outposts servers set up as a resilient pair. For three or more servers running in the same data center, each server would need to be mapped to a secondary server for HA. One server can be the relaunch destination for multiple other servers, as long as Amazon EC2 capacity requirements are met. If both the source and destination Outposts servers are unavailable or experience a failure at the same time, then additional user action is required to resolve. In this case, a notification email is sent to the address specified in the notification email parameter that you supplied when executing the init.py script from Part 1 of this series. This lets you know that the attempted relaunch of your EC2 instances failed.

Figure 1: Amazon EC2 auto-relaunch custom logic on Outposts server architecture.

Refer to Part 1 of this series for a detailed breakdown of Steps 1-6 that discusses how the Amazon EC2 relaunch automation works, as shown in the preceding figure. For stateful applications, this logic has been extended to capture the EC2 instance store state. In order to save the state of the instance store, AWS Systems Manager automation is being used to create an Amazon Elastic Block Store (Amazon EBS)-backed Amazon Machine Image (AMI) in the Region of the EC2 instance running on the source Outposts server. Then, this AMI can be relaunched on another Outposts server in the event of a source server hardware or service link failure. The EBS volume associated with the AMI is automatically converted to the instance store root volume when relaunched on another Outposts server.

Prerequisites

The following prerequisites are required to complete the walkthrough:

EC2 instances being protected must be Linux instances.
EC2 instances being protected must have access to either an Internet Gateway (IGW) or NAT gateway in an AWS Region to download and install the required Linux packages and future updates.
Create an Amazon Simple Storage Service (Amazon S3) bucket in your Outposts servers’ parent region to store the Systems Manager automation document.
If you are setting this up from an Outpost consumer account, you need to configure Amazon CloudWatch cross-account observability between the consumer account and the Outpost owning account to view Outposts metrics in your consumer account.
Download the repository backup-outposts-servers-linux-instance.
- Note the prerequisites and limitations listed in the README file of the GitHub repository.

This post builds on the Amazon EC2 auto-relaunch logic implemented in Part 1 of this series. In Part 1, we covered the implementation for achieving HA for stateless applications. In Part 2, we extend the Part 1 implementation to achieve HA for stateful applications, which must retain EC2 instance store data when instances are relaunched.

Deploying Outposts Servers Linux Instance Backup Solution

For the purpose of this post, a virtual private cloud (VPC) named “Production-Application-A”, and subnets on each of the two Outposts servers being used for this post named “source-outpost-a” and “destination-outpost-b” have been created. The destination-outpost-b subnet is supplied in the launch template being used for this walkthrough. The Amazon EC2 auto-relaunch logic discussed in Part 1 of this series has already been implemented, and the focus here is on the next steps required to extend that auto-relaunch capability to stateful applications.

Following the installation instructions available in the GitHub repository README file, you first open an AWS CloudShell terminal from within the account that has access to your Outposts servers. Next, clone the GitHub repository and cd into the “backup-outposts-servers-linux-instance” directory:

From here you can build the Systems Manager Automation document with its attachments using the make documents command. Your output should look similar to the following after successful execution:

Finally, upload the Systems Manager Automation document you just created to the S3 bucket you created in your Outposts server’s parent region for this purpose. For the purpose of this post, an S3 bucket named “ssm-bucket07082024” was created. Following Step 4 in the GitHub installation instructions, the command looks like the following:

BUCKET_NAME="ssm-bucket07082024"
DOC_NAME="BackupOutpostsServerLinuxInstanceToEBS"
OUTPOST_REGION="us-east-1"
aws s3 cp Output/Attachments/attachment.zip s3://${BUCKET_NAME}
aws ssm create-document --content file://Output/BackupOutpostsServerLinuxInstanceToEBS.json --name ${DOC_NAME} --document-type "Automation" --document-format JSON --attachments Key=S3FileUrl,Values=s3://${BUCKET_NAME}/attachment.zip,Name=attachment.zip --region ${OUTPOST_REGION}

After you have successfully created the Systems Manager Automation document, the output of the command shows the content of your newly created file. After reviewing it, you can exit the terminal and confirm that a new file named “attachments.zip” is in the S3 bucket that you specified.

Now you’re ready to put this automation logic in place. Following the GitHub instructions for usage, navigate to Systems Manager in the account that has access to your Outposts servers, and execute the automation. The default document name is used for the purpose of this post “BackupOutpostsServerLinuxInstanceToEBS”, so that is the document selected. You may have other documents available to you for quick setup, and those can be disregarded for now.

Select the chosen document to execute this automation using the button in the top right-hand corner of the document details page.

After executing the automation, you are asked to configure the runbook for this automation. Leave the default Simple execution option selected:

For the Input parameters section, review the parameter definitions given in the GitHub repository README file. For the purpose of this post, the following is used:

Note that you may need to create a service role for Systems Manager to perform this automation on your behalf. For the purposes of this post, I have done so using the Required IAM Permissions to run this runbook section of the GitHub repository README file. The other settings can be left as default. Finish your set up by selecting Execute at the bottom of this page. It could take up to 30 minutes for all necessary steps to execute. Note that the automation document shows 32 steps, but the number of steps that are executed varies based on the type of Linux AMI that you started with. As long as your automation’s overall status shows as successful, you have completed implementation successfully. Here is a sample output:

You can find the AMI that was produced from this automation in your Amazon EC2 console under the Images section:

The final implementation step is creating a Systems Manager parameter for the AMI you just created. This prevents you from having to manually update the launch template for your application each time a new AMI is created and the AMI ID changes. Since this AMI is essentially a backup of your application and its current instance store state, you should expect the AMI ID to change with each new backup or new AMI that you create for your application, and determine the cadence for creating these AMIs that aligns to your application Recovery Point Objectives (RPO).

To create a Systems Manager parameter for your AMI, first navigate to your Systems Manager console. Under Application Management, select Parameter Store and Create parameter. You can select either the Standard or Advanced tier depending on your needs. The AMI ID I have is ami-038c878d31d9d0bfb and the following is an example of how the parameter details are filled in for this walkthrough:

Now you can modify your application’s launch template that you created in Part 1 of this series, and specify the Systems Manager parameter you just created. To do this, navigate to your Amazon EC2 console, and under Instances select the Launch Templates option. Create a new version of your launch template, select the Browse more AMIs option, and choose the arrow button to the right of the search bar. Select Specify custom value/Systems Manager parameter.

Now enter the name of your parameter in one of the listed formats, and select Save.

You should see your parameter listed in the launch template summary under Software Image (AMI):

Make sure that your launch template is set to the latest version. Your installation is now complete, and in the event of a source Outposts server failure, your application will be automatically relaunched on a new EC2 instance on your destination Outposts server. You will also receive a notification email sent to the address specified in the notification email parameter of the init.py script from Part 1 of this series. This means you can start triaging why your source Outposts server experienced a failure immediately without worrying about getting your application(s) back up and running. This helps make sure that your application(s) are highly available and reduces your Recovery Time Objective (RTO).

Cleaning up

The custom Amazon EC2 relaunch logic is implemented through AWS CloudFormation, so the only clean up required is to delete the CloudFormation stack from your AWS account. Doing so deletes the resources that were deployed through the CloudFormation stack. To remove the Systems Manager automation, un-enroll your EC2 instance from Host Management and delete the Amazon EBS-backed AMI in the Region.

Conclusion

The use of custom logic through AWS tools such as CloudFormation, CloudWatch, Systems Manager, and AWS Lambda enables you to architect for HA for stateful workloads on Outposts server. By implementing the custom logic we walked through in this post, you can automatically relaunch EC2 instances running on a source Outposts server to a secondary destination Outposts server while maintaining your application’s state data. This also reduces the downtime of your application(s) in the event of a hardware or service link failure. The code provided in this post can also be further expanded upon to meet the unique needs of your workload.

Note that while the use of Infrastructure-as-Code (IaC) can improve your application’s availability and be used to standardize deployments across multiple Outposts servers, it is crucial to do regular failure drills to test the custom logic in place to make sure that you understand your application’s expected behavior on relaunch in the event of a failure. To learn more about Outposts servers, please visit the Outposts servers user guide.

Enabling high availability of Amazon EC2 instances on AWS Outposts servers: (Part 1)

2024-08-08 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/enabling-high-availability-of-amazon-ec2-instances-on-aws-outposts-servers-part-1/

This blog post is written by Brianna Rosentrater – Hybrid Edge Specialist SA and Jessica Win – Software Development Engineer.

This post is part 1 of the two-part series ‘Enabling high availability of Amazon EC2 instances on AWS Outposts servers’, providing you with code samples and considerations for implementing custom logic to automate Amazon Elastic Compute Cloud (EC2) relaunch on Outposts servers. This post focuses on guidance for stateless applications, whereas part 2 focuses on stateful applications where the Amazon EC2 instance store state needs to be maintained at relaunch.

Outposts servers provide compute and networking services that are ideal for low-latency, local data processing needs for on-premises locations such as retail stores, branch offices, healthcare provider locations, or environments that are space-constrained. Outposts servers use EC2 instance store storage to provide non-durable block-level storage to the instances running stateless workloads, and while stateless workloads don’t require resilient storage, many application owners still have uptime requirements for these types of workloads. In this post, you will learn how to implement custom logic to provide high availability (HA) for your applications running on Outposts servers using two or more servers for N+1 fault tolerance. The code provided is meant to help you get started, and can be modified further for your unique workload needs.

Overview

In this post, we have provided an init.py script. This script takes your input parameters and creates a custom AWS CloudFormation template that is deployed in the specified account. Users can run “./init.py –-help” or “./init.py -h” to view parameter descriptions. The following input parameters are needed:

Parameter	Description
Launch template ID(s)	This is used to relaunch your EC2 instances on the destination Outposts server in the event of a source server hardware or service link failure. You can specify multiple Launch Template IDs for multiple applications.
Source Outpost ID	This is the Outpost ID of the server actively running your EC2 workload.
Template file	This is the base CloudFormation template. The `init.py` script customizes the `AutoRestartTemplate.yaml` template based on your inputs. Make sure to execute the `init.py` in the file directory that contains the `AutoRestartTemplate.yaml` file.
Stack name	This is the name you’d like to give your CloudFormation stack.
Region	This should be the same AWS Region to which your Outposts servers are anchored.
Notification email	This is the email Amazon Simple Notification Service (SNS) uses to alert you if Amazon CloudWatch detects that your source Outposts server has failed.
Launch template description	This is the description of the launch template(s) used to relaunch your EC2 instances on the destination Outposts server in the event of a source server failure.

After collecting the preceding parameters, the

script generates a CloudFormation template. You are asked to review the template and confirm that it meets your expectations. Once you select yes, the CloudFormation template is deployed in your account, and you can view the stack from your AWS Management Console. You also receive a confirmation email sent to the address specified in the notification email parameter, confirming your subscription to the SNS topic. This SNS topic was created by the CloudFormation stack to alert you if your source Outposts server experiences a hardware or service link failure.

The init.py script and AutoRestartTemplate.yaml CloudFormation template provided in this post is intended to be used to implement custom logic that relaunches EC2 instances running on the source Outposts server to a specified destination Outposts server for improved application availability. This logic works by essentially creating a mapping between the source and destination Outpost, and only works between two Outposts servers. This code can be further customized to meet your application requirements, and is meant to help you get started with implementing custom logic for your Outposts server environment. Now that we have covered the init.py parameters, the intended use case, scope, and limitations of the code provided, read on for more information on the architecture for this solution.

Architecture diagram

This solution is scoped to work for two Outposts servers set up as a resilient pair. For more than two servers running in the same data center, each server would need to be mapped to a secondary server for HA. One server can be the relaunch destination for multiple other servers, as long as Amazon EC2 capacity requirements are met. If both the source and destination Outposts servers are unavailable or experience a failure at the same time, then additional user action is required to resolve. In this case, a notification email is sent to the address specified in the notification email parameter letting you know that the attempted relaunch of your EC2 instances failed.

Figure 1: Amazon EC2 auto-relaunch custom logic on AWS Outposts server architecture.

Input environment parameters required for the CloudFormation template AutoRestartTemplate.yaml. After confirming that the customized template looks correct, agree to allow the init.py script to deploy the CloudFormation stack in your desired AWS account.
The CloudFormation stack is created and deployed in your AWS account with two or more Outposts servers. The CloudFormation stack creates the following resources:

- A CloudWatch alarm to monitor the source Outpost server ConnectedStatus metric;
- An SNS topic that alerts you if your source Outposts server ConnectedStatus shows as down;
- An AWS Lambda function that relaunches the source Outposts server EC2 instances on the destination Outposts server according to the launch template you provided.

A CloudWatch alarm monitors the ConnectedStatus metric of the source Outposts server to detect hardware or service link failure.
If the ConnectedStatus metric shows the source Outposts server service link as down, then a Lambda function coordinates relaunching the EC2 instances on the destination Outposts server according to the launch template that you provided.
In the event of a source Outposts server hardware or service link failure and Amazon EC2 relaunch, Amazon SNS sends a notification to the notification email provided in the init.py script as an environment parameter. You will be notified when the CloudWatch alarm is triggered, and when the automation finishes executing with an execution status included.
The EC2 instances described in your launch template are launched on the destination Outposts server automatically, with no manual action needed.

Now that we’ve covered the architecture and workflow for this solution, read on for step-by-step instructions on how to implement this code in your AWS account.

Prerequisites

The following prerequisites are required to complete the walkthrough:

Python is used to run the init.py script that dynamically creates a CloudFormation stack in the account specified as an input parameter.
Two Outposts servers that can be set up as an active/active or active/passive resilient pair depending on the size of the workload.
Create Launch Templates for the applications you want to protect—make sure that an instance type is selected that is available on your destination Outposts server.
Make sure that you have the credentials needed to programmatically deploy the CloudFormation stack in your AWS account.
If you are setting this up from an Outposts consumer account, you will need to configure CloudWatch cross-account observability between the consumer account and the Outposts owning account to view Outposts metrics.
Download the repository ec2-outposts-autorestart.

Deploying the AutoRestart CloudFormation stack

Make sure that you are in the directory that contains the init.py and AutoRestartTemplate.yaml files. Next, run the following command to execute the init.py file. Note that you may need to change the file permissions to do this. If so, then run “chmod a+x init.py” to give all users execute permissions for this file: ./init.py --launch-template-id <value> --source-outpost-id <value> --template-file AutoRestartTemplate.yaml --stack-name <value> --region <value> --notification-email <value>

After executing the preceding command, the init.py script asks you for a launch template description. Provide a brief description for the launch template that describes to which application it correlates. After that, the init.py script customizes the AutoRestartTemplate.yaml file using the parameter values you entered, and the content of the file is displayed in the terminal for you to verify before confirming everything looks correct.
After verifying the AutoRestartTemplate.yaml file looks correct, enter ‘y’ to confirm. Then, the script deploys a CloudFormation stack in your AWS account using the AutoRestartTemplate.yaml file as its template. It takes a few moments for the stack to deploy, after which it is visible in your AWS account under your CloudFormation console.
Verify the CloudFormation stack is visible in your AWS account.
You receive an email that looks like the preceding example asking you to confirm your subscription to the SNS topic that was created for your CloudWatch alarm. This alarm monitors your Outposts server ConnectedStatus metric. This is a crucial step, without confirming your SNS topic subscription for this alarm, you won’t be notified in the event that your source Outposts server experiences a hardware or service link failure and this relaunch logic is used. Once you have confirmed your email address, the implementation of this Amazon EC2 Auto-Relaunch logic is now complete, and in the event of a service link or source Outposts server failure, your EC2 instances now automatically relaunch on the destination Outposts server subnet you supplied as a parameter in your launch template. You also receive an email notifying you that your source Outpost went down and a relaunch event occurred.

A service link failure is simulated on the source-outpost-a server for the purpose of this post. Within a minute or so of the CloudWatch alarm being triggered, you receive an email alert from the SNS topic to which you subscribed earlier in the post. The email alert looks like the following image:

After receiving this alert, you can navigate to your EC2 Dashboard and view your running instances. There you should see a new instance being launched. It takes a minute or two to finish initializing before showing that both status checks passed:

Now that your EC2 instance(s) has been relaunched on your healthy destination Outposts server, you can start triaging why your source Outposts server experienced a failure without worrying about getting your application(s) back up and running.

Cleaning up

Because this custom logic is implemented through CloudFormation, the only clean up required is to delete the CloudFormation stack from your AWS account. Doing so deletes all resources that were deployed through the CloudFormation stack.

Conclusion

The use of custom logic through AWS tools such as CloudFormation, CloudWatch, and Lambda enables you to architect for HA for stateless workloads on an Outposts server. By implementing the custom logic we walked through in this post, you can automatically relaunch EC2 instances running on a source Outposts server to a secondary destination Outposts server, reducing the downtime of your applications in the event of a hardware or service link failure. The code provided in this post can also be further expanded upon to meet the unique needs of your workload.

Note that, while the use of Infrastructure-as-Code (IaC) can improve your application’s availability and be used to standardize deployments across multiple Outposts servers, it is crucial to do regular failure drills to test the custom logic in place. This helps make sure that you understand your application’s expected behavior on relaunch in the event of a hardware failure. Check out part 2 of this series to learn more about enabling HA on Outposts servers for stateful workloads.

Create a customizable cross-company log lake for compliance, Part I: Business Background

2024-08-01 Colin Carson

Post Syndicated from Colin Carson original https://aws.amazon.com/blogs/big-data/create-a-customizable-cross-company-log-lake-for-compliance-part-i-business-background/

As described in a previous post, AWS Session Manager, a capability of AWS Systems Manager, can be used to manage access to Amazon Elastic Compute Cloud (Amazon EC2) instances by administrators who need elevated permissions for setup, troubleshooting, or emergency changes. While working for a large global organization with thousands of accounts, we were asked to answer a specific business question: “What did employees with privileged access do in Session Manager?”

This question had an initial answer: use logging and auditing capabilities of Session Manager and integration with other AWS services, including recording connections (StartSession API calls) with AWS CloudTrail, and recording commands (keystrokes) by streaming session data to Amazon CloudWatch Logs.

This was helpful, but only the beginning. We had more requirements and questions:

After session activity is logged to CloudWatch Logs, then what?
How can we provide useful data structures that minimize work to read out, delivering faster performance, using more data, with more convenience?
How do we support a variety of usage patterns, such as ongoing system-to-system bulk transfer, or an ad-hoc query by a human for a single session?
How should we share and implement governance?
Thinking bigger, what about the same question for a different service or across more than one use case? How do we add what other API activity happened before or after a connection—in other words, context?

We needed more comprehensive functionality, more customization, and more control than a single service or feature could offer. Our journey began where previous customer stories about using Session Manager for privileged access (similar to our situation), least privilege, and guardrails ended. We had to create something new that combined existing approaches and ideas:

Low-level primitives such as Amazon Simple Storage Service (Amazon S3).
Latest features and approaches of AWS, such as vertical and horizontal scaling in AWS Glue.
Our experience working with legal, audit, and compliance in large enterprise environments.
Customer feedback.

In this post, we introduce Log Lake, a do-it-yourself data lake based on logs from CloudWatch and AWS CloudTrail. We share our story in three parts:

Part 1: Business background – We share why we created Log Lake and AWS alternatives that might be faster or easier for you.
Part 2: Build – We describe the architecture and how to set it up using AWS CloudFormation templates.
Part 3: Add – We show you how to add invocation logs, model input, and model output from Amazon Bedrock to Log Lake.

Do you really want to do it yourself?

Before you build your own log lake, consider the latest, highest-level options already available in AWS–they can save you a lot of work. Whenever possible, choose AWS services and approaches that abstract away undifferentiated heavy lifting to AWS so you can spend time on adding new business value instead of managing overhead. Know the use cases services were designed for, so you have a sense of what they already can do today and where they’re going tomorrow.

If that doesn’t work, and you don’t see an option that delivers the customer experience you want, then you can mix and match primitives in AWS for more flexibility and freedom, as we did for Log Lake.

Session Manager activity logging

As we mentioned in our introduction, you can save logging data to AmazonS3, add a table on top, and query that table using Amazon Athena—this is what we recommend you consider first because it’s straightforward.

This would result in files with the sessionid in the name. If you want, you can process these files into a calendarday, sessionid, sessiondata format using an S3 event notification that invokes a function (and make sure to save it to a different bucket, in a different table, to avoid causing recursive loops). The function could derive the calendarday and sessionid from the S3 key metadata, and sessiondata would be the entire file contents.

Alternatively, you can sign to one log group in CloudWatch logs, have an Amazon Data Firehose subscription filter move that to S3 (this file would have additional metadata in the JSON content and more customization potential from filters). This was used in our situation, but it wasn’t enough by itself.

AWS CloudTrail Lake

CloudTrail Lake is for running queries on events over years of history and with near real-time latency and offers a deeper and more customizable view of events than CloudTrail Event history. CloudTrail Lake enables you to federate an event data store, which lets you view the metadata in the AWS Glue catalog and run Athena queries. For needs involving one organization and ongoing ingesting from a trail (or point-in-time import from Amazon S3, or both), you can consider CloudTrail Lake.

We considered CloudTrail Lake, as either a managed lake option or source for CloudTrail only, but ended up creating our own AWS Glue job instead. This was because of a combination of reasons, including full control over schema and jobs, ability to ingest data from an S3 bucket of our choosing as an ongoing source, fine-grained filtering on account, AWS Region, and eventName (eventName filtering wasn’t supported for management events ), and cost.

The cost of CloudTrail lake based on uncompressed data ingested (data size can be 10 times larger than in Amazon S3) was a factor for our use case. In one test, we found CloudTrail Lake to be 38 times faster to process the same workload as Log Lake, but Log Lake was 10–100 times less costly depending on filters, timing, and account activity. Our test workload was 15.9 GB file size in S3, 199 million events, and 400 thousand files, spread across over 150 accounts and 3 Regions. Filters Log Lake applied were eventname='StartSession', 'AssumeRole', 'AssumeRoleWithSAML', and five arbitrary allow listed accounts. These tests might be different from your use case, so you should do your own testing, gather your own data, and decide for yourself.

Other services

The products mentioned previously are the most relevant to the outcomes we were trying to accomplish, but you should consider security, identity, and compliance products on AWS, too. These products and features can be used either as an alternative to Log Lake or to add functionality.

As an example, Amazon Bedrock can add functionality in three ways:

To skip the search and query Log Lake for you
To summarize across logs
As a source for logs (similar to Session Manager as a source for CloudWatch logs)

Querying means you can have an AI agent query your AWS Glue catalog (such as the Log Lake catalog) for data-based results. Summarizing means you can use generative artificial intelligence (AI) to summarize your text logs from a knowledge base as part of retrieval augmented generation (RAG), to ask questions like “How many log files are exactly the same? Who changed IAM roles last night?” Considerations and limitations apply.

Adding Amazon Bedrock as a source means using invocation logging to collect requests and responses.

Because we wanted to store very large amounts of data frugally (compressed and columnar format, not text) and produce non-generative (data-based) results that can be used for legal compliance and security, we didn’t use Amazon Bedrock in Log Lake—but we will revisit this topic in Part 3 when we detail how to use the approach we used for Session Manager for Amazon Bedrock.

Business background

When we began talking with our business partners, sponsors, and other stakeholders, important questions, problems, opportunities, and requirements emerged.

Why we needed to do this

Legal, security, identity, and compliance authorities of the large enterprise we were working for had created a customer-specific control. To comply with the control objective, use of elevated privileges required a manager to manually review all available data (including any session manager activity) to confirm or deny if use of elevated privileges was justified. This was a compliance use case that, when solved, could be applied to more use cases such as auditing and reporting.

Note on terms:

Here, the customer in customer-specific control means a control that is solely the responsibility of a customer, not AWS, as described in the AWS Shared Responsibility Model.
In this article, we define auditing broadly as testing information technology (IT) controls to mitigate risk, by anyone, at any cadence (ongoing as part of day-to-day operations, or one time only). We don’t refer to auditing that is financial, only conducted by an independent third-party, or only at certain times. We use self-review and auditing interchangeably.
We also define reporting broadly as presenting data for a specific purpose in a specific format to evaluate business performance and facilitate data-driven decisions—such as answering “how many employees had sessions last week?”

The use case

Our first and most important use case was a manager who needed to review activity, such as from an after-hours on-call page the previous night. If the manager needed to have additional discussions with their employee or needed additional time to consider activity, they had up to a week (7 calendar days) before they needed to confirm or deny elevated privileges were needed, based on their team’s procedures. A manager needed to review an entire set of events that all share the same session, regardless of known keywords or specific strings, as part of all available data in AWS. This was the workflow:

Employee uses homegrown application and standardized workflow to access Amazon EC2 with elevated privileges using Session Manager.
API activity in CloudTrail and continuous logging to CloudWatch logs.
The problem space – Data somehow gets procured, processed, and provided (this would become Log Lake later).
Another homegrown system (different from step 1) presents session activity to managers and applies access controls (a manager should only review activity for their own employees, and not be able to peruse data outside their team). This data might be only one StartSession API call and no session details, or might be thousands of lines from cat file
The manager reviews all available activity, makes an informed decision, and confirms or denies if use was justified.

This was an ongoing day-to-day operation, with a narrow scope. First, this meant only data available in AWS; if something couldn’t be captured by AWS, it was out of scope. If something was possible, it should be made available. Second, this meant only certain workflows; using Session Manager with elevated privileges for a specific, documented standard operating procedure.

Avoiding review

The simplest solution would be to block sessions on Amazon EC2 with elevated privileges, and fully automate build and deployment. This was possible for some but not all workloads, because some workloads required initial setup, troubleshooting, or emergency changes of Marketplace AMIs.

Is accurate logging and auditing possible?

We won’t extensively detail ways to bypass controls here, but there are important limitations and considerations we had to consider, and we recommend you do too.

First, logging isn’t available for sessionType Port, which includes SSH. This could be mitigated by ensuring employees can only use a custom application layer to start sessions without SSH. Blocking direct SSH access to EC2 instances using security group policies is another option.

Second, there are many ways to intentionally or accidentally hide or obfuscate activity in a session, making review of a specific command difficult or impossible. This was acceptable for our use case for multiple reasons:

A manager would always know if a session started and needed review from CloudTrail (our source signal). We joined to CloudWatch to meet our all available data requirement.
Continuous streaming to CloudWatch logs would log activity as it happened. Additionally, streaming to CloudWatch Logs supported interactive shell access, and our use case only used interactive shell access (sessionType Standard_Stream). Streaming isn’t supported for sessionType, InteractiveCommands, or NonInteractiveCommands.
The most important workflow to review involved an engineered application with one standard operating procedure (less variety than all the ways Session Manager could be used).
Most importantly, the manager was responsible for reviewing the reports and expected to apply their own judgement and interpret what happened. For example, a manager review could result in a follow up conversation with the employee that could improve business processes. A manager might ask their employee, “Can you help me understand why you ran this command? Do we need to update our runbook or automate something in deployment?”

To protect data against tampering, changes, or deletion, AWS provides tools and features such as AWS Identity and Access Management (IAM) policies and permissions and Amazon S3 Object Lock.

Security and compliance are a shared responsibility between AWS and the customer, and customers need to decide what AWS services and features to use for their use case. We recommend customers consider a comprehensive approach that considers overall system design and includes multiple layers of security controls (defense in depth). For more information, see the Security pillar of the AWS Well-Architected Framework.

Avoiding automation

Manual review can be a painful process, but we couldn’t automate review for two reasons: Legal requirements and to add friction to the feedback loop felt by a manager whenever an employee used elevated privileges, to discourage using elevated privileges.

Works with existing

We had to work with existing architecture, spanning thousands of accounts and multiple AWS Organizations. This meant sourcing data from buckets as an edge and point of ingress. Specifically, CloudTrail data was managed and consolidated outside of CloudTrail, across organizations and trails, into S3 buckets. CloudWatch data was also consolidated to S3 buckets, from Session Manager to CloudWatch Logs, with Amazon Data Firehose subscription filters on CloudWatch Logs pointing to S3. To avoid negative side effects on existing business processes, our business partners didn’t want to change settings in CloudTrail, CloudWatch, and Firehose. This meant Log Lake needed features and flexibility that enabled changes without impacting other workstreams using the same sources.

Event filtering is not a data lake

Before we were asked to help, there were attempts to do event filtering. One attempt tried to monitor session activity using Amazon EventBridge. This was limited to AWS API operations recorded by CloudTrail such as StartSession and didn’t include the information from inside the session, which was in CloudWatch Logs. Another attempt tried event filtering CloudWatch in the form of a subscription filter. Also, an attempt was made using EventBridge Event Bus with EventBridge rules, and storage in Amazon DynamoDB. These attempts didn’t deliver the expected results because of a combination of factors:

Size

Couldn’t accept large session log payloads because of the EventBridge PutEvents limit of 256 KB entry size. Saving large entries to Amazon S3 and using the object URL in the PutEvents entry would avoid this limitation in EventBridge, but wouldn’t pass the most important information the manager needed to review (the event’s sessionData element). This meant managing files and physical dependencies, and losing the metastore benefit of working with data as logical sets and objects.

Storage

Event filtering was a way to process data, not storage or a source of truth. We asked, how do we restore data lost in flight or destroyed after landing? If components are deleted or undergoing maintenance, can we still procure, process, and provide data—at all three layers independently? Without storage, no.

Data quality

No source of truth meant data quality checks weren’t possible. We couldn’t answer questions like: “Did the last job process more than 90 percent of events from CloudTrail in DynamoDB?” or“What percentage are we missing from source to target?”

Anti-patterns

DynamoDB as long-term storage wasn’t the most appropriate data store for large analytical workloads, low I/O, and highly complex many-to-many joins.

Reading out

Deliveries were fast, but work (and time and cost) was needed after delivery. In other words, queries had to do extra work to transform raw data into the needed format at time of read, which had a significant, cumulative effect on performance and cost. Imagine users running a select * from table without any filters on years of data and paying for storage and compute of those queries.

Cost of ownership

Filtering by event contents (sessionData from CloudWatch) required knowledge of session behavior, which was business logic. This meant changes to business logic required changes to event filtering. Imagine being asked to change CloudWatch filters or EventBridge rules based on a business process change, and trying to remember where to make the change, or troubleshoot why expected events weren’t being passed. This meant a higher cost of ownership and slower cycle times at best, and inability to meet SLA and scale at worst.

Accidental coupling

Creates accidental coupling between downstream consumers and low-level events. Consumers who directly integrate against events might get different schemas at different times for the same events, or events they don’t need. There’s no way to manage data at a higher level than event, at the level of sets (like all events for one sessionid), or at the object level (a table designed for dependencies). In other words, there was no metastore layer that separated the schema from the files, like in a data lake.

More sources (data to load in)

There were other, less important use cases that we wanted to expand to later: inventory management and security.

For inventory management, such as identifying EC2 instances running a Systems Manager agent that’s missing a patch, finding IAM users with inline policies, or finding Redshift clusters with nodes that aren’t RA3. This data would come from AWS Config unless it isn’t a supported resource type. We cut inventory management from scope because AWS Config data could be added to an AWS Glue catalog later, and queried from Athena using an approach like the one described in How to query your AWS resource configuration states using AWS Config and Amazon Athena.

For security, Splunk and OpenSearch were already in use for serviceability and operational analysis, sourcing files from Amazon S3. Log Lake is a complementary approach sourcing from the same data, which adds metadata and simplified data structures at the cost of latency. For more information about having different tools analyze the same data, see Solving big data problems on AWS.

More use cases (reasons to read out)

We knew from the first meeting that this was a bigger opportunity than just building a dataset for sessions from Systems Manager for manual manager review. Once we had procured logs from CloudTrail and CloudWatch, set up Glue jobs to process logs into convenient tables, and were able to join across these tables, we could change filters and configuration settings to answer questions about additional services and use cases, too. Similar to how we process data for Session Manager, we could expand the filters on Log Lake’s Glue jobs, and add data for Amazon Bedrock model invocation logging. For other use cases, we could use Log Lake as a source for automation (rules-based or ML), deep forensic investigations, or string-match searches (such as IP addresses or user names).

Additional technical considerations

*How did we define session? We would always know if a session started from StartSession event in CloudTrail API activity. Regarding when a session ended, we did not use TerminateSession because this was not always present and we considered this domain-specific logic. Log Lake enabled downstream customers to decide how to interpret the data. For example, our most important workflow had a Systems Manager timeout of 15 minutes, and our SLA was 90 minutes. This meant managers knew a session with a start time more than 2 hours prior to the current time was already ended.

*CloudWatch data required additional processing compared to CloudTrail, because CloudWatch logs from Firehose were saved in gzip format without gz suffix and had multiple JSON documents in the same line that needed to be processed to be on separate lines. Firehose can transform and convert records, such as invoking a Lambda function to transform, convert JSON to ORC, and decompress data, but our business partners didn’t want to change existing settings.

How to get the data (a deep dive)

To support the dataset needed for a manager to review, we needed to identify API-specific metadata (time, event source, and event name), and then join it to session data. CloudTrail was necessary because it was the most authoritative source for AWS API activity, specifically StartSession and AssumeRole and AssumeRoleWithSAML events, and contained context that didn’t exist in CloudWatch Logs (such as the error code AccessDenied) which could be useful for compliance and investigation. CloudWatch was necessary because it contained the keystrokes in a session, in the CloudWatch log’s sessionData element. We needed to obtain the AWS source of record from CloudTrail, but we recommend you check with your authorities to confirm you really need to join to CloudTrail. We mention this in case you hear this question “why not derive some sort of earliest eventTime from CloudWatch logs, and skip joining to CloudTrail entirely? That would cut size and complexity by half.”

To join CloudTrail (eventTime, eventname, errorCode, errorMessage, and so on) with CloudWatch (sessionData), we had to do the following:

Get the higher level API data from CloudTrail (time, event source, and event name), as the authoritative source for auditing Session Manager. To get this, we needed to look inside all CloudTrail logs and get only the rows with eventname=‘StartSession’ and eventsource=‘ssm.amazonaws.com’ (events from Systems Manager)—our business partners described this as looking for a needle in a haystack, because this could be only one session event across millions or billions of files. After we obtained this metadata, we needed to extract the sessionid to know what session to join it to, and we chose to extract sessionid from responseelements. Alternatively, we could use useridentity.sessioncontext.sourceidentity if a principal provided it while assuming a role (requires sts:SetSourceIdentity in the role trust policy).

Sample of a single record’s responseelements.sessionid value: "sessionid":"theuser-thefederation-0b7c1cc185ccf51a9"

The actual sessionid was the final element of the logstream: 0b7c1cc185ccf51a9.

Next we needed to get all logs for a single session from CloudWatch. Similarly to CloudTrail, we needed to look inside all CloudWatch logs landing in Amazon S3 from Firehose to identify only the needles that contained "logGroup":"/aws/ssm/sessionlogs". Then, we could get sessionid from logstream or sessionId, and get session activity from the message.sessionData.

Sample of a single record’s logStream element: "sessionId": "theuser-thefederation-0b7c1cc185ccf51a9"

Note: Looking inside the log isn’t always necessary. We did it because we had to work with existing logs Firehose put to Amazon S3, which didn’t have the logstream (and sessionid) in the file name. For example, a file from Firehose might have a name like

cloudwatch-logs-otherlogs-3-2024-03-03-22-22-55-55239a3d-622e-40c0-9615-ad4f5d4381fa

If we were able to use the ability of Session Manager to send to S3 directly, the file name in S3 is the loggroup (theuser-thefederation-0b7c1cc185ccf51a9.dms)and could be used to derive sessionid without looking inside the file.

Downstream of Log Lake, consumers could join on sessionid which was derived in the previous step.

What’s different about Log Lake

If you remember one thing about Log Lake, remember this: Log Lake is a data lake for compliance-related use cases, uses CloudTrail and CloudWatch as data sources, has separate tables for writing (original raw) and reading (read-optimized or readready), and gives you control over all components so you can customize it for yourself.

Here are some of the signature qualities of Log Lake:

Legal, identity, or compliance use cases

This includes deep dive forensic investigation, meaning use cases that are large volume, historical, and analytical. Because Log Lake uses Amazon S3, it can meet regulatory requirements that require write-once-read-many (WORM) storage.

AWS Well-Architected Framework

Log Lake applies real-world, time-tested design principles from the AWS Well-Architected Framework. This includes, but is not limited to:

Using the technology approach that best aligns with your goals (Performance Efficiency pillar).
Horizontal scaling, workload utilization, avoiding resource saturation, and recovering from failure (Reliability pillar). There were two scenarios we especially wanted to enable that didn’t exist when we began: regular run” jobs that ran at a predefined cadence with auto backfill ability (a failure of one or more jobs could be recovered by a subsequent job), and large backfills (hours or more). Another way we implemented resiliency was timeouts and retries in both file processing (AddAPart function) and metadata processing (AWS Glue jobs). If you want more resiliency you can add additional independent replicas of functionality, such as jobs, storage, or delivery doors, as part of a cell-based architecture.
Anticipating and learning from failure and implementing observability (Operational Excellence pillar). For jobs, this meant monitoring and alarming on workload utilization and timeouts. For storage this meant CloudWatch metrics for Amazon S3.

Operational Excellence also meant knowing service quotas, performing workload testing, and defining and documenting runbook processes. If we hadn’t tried to break something to see where the limit is, then we considered it untested and inappropriate for production use. To test, we would determine the highest single day volume we’d seen in the past year, and then run that same volume in an hour to see if (and how) it would break.

High-Performance, Portable Partition Adding (AddAPart)

Log Lake adds partitions to tables using Lambda functions with SQS, a pattern we call AddAPart. This uses Amazon Simple Query Service (SQS) to decouple triggers (files landing in Amazon S3) from actions (associating that file with metastore partition). Think of this as having four F’s:

Fair (first in, first out)
Flat (idempotent logic to deduplicate and avoid unnecessary API calls from different files triggering the same action)
File Filtering (on name).

This means no AWS Glue crawlers, no alter table or msck repair table to add partitions in Athena, and can be reused across sources and buckets. The management of partitions in Log Lake makes using partition-related features available in AWS Glue, including AWS Glue partition indexes and workload partitioning and bounded execution.

File name filtering uses the same central controls for lower cost of ownership, faster changes, troubleshooting from one location, and emergency levers—this means that if you want to avoid log recursion happening from a specific account, or want to exclude a Region because of regulatory compliance, you can do it in one place, managed by your change control process, before you pay for processing in downstream jobs.

If you want to tell a team, “onboard your data source to our log lake, here are the steps you can use to self-serve,” you can use AddAPart to do that. We describe this in Part 2.

Readready Tables

In Log Lake, data structures offer differentiated value to users, and original raw data isn’t directly exposed to downstream users by default. For each source, Log Lake has a corresponding read-optimized readready table.

Instead of this:

from_cloudtrail_raw

from_cloudwatch_raw

Log Lake exposes only these to users:

from_cloudtrail_readready

from_cloudwatch_readready

In Part 2, we describe these tables in detail. Here are our answers to frequently asked questions about readready tables:

Q: Doesn’t this have an up-front cost to process raw into readready? Why not pass the work (and cost) to downstream users?

A: Yes, and for us the cost of processing partitions of raw into readready happened once and was fixed, and was offset by the variable costs of querying, which was from many company-wide callers (systemic and human), with high frequency, and large volume.

Q: How much better are readready tables in terms of performance, cost, and convenience? How do you achieve these gains? How do you measure “convenience”?

A: In most tests, readready tables are 5–10 times faster to query and more than 2 times smaller in Amazon S3. Log Lake applies more than one technique: omitting columns, partition design, AWS Glue partition indexes, data types (readready tables don’t allow any nested complex data types within a column, such as struct<struct>), columnar storage (ORC), and compression (ZLIB). We measure convenience as the amount of operations required to join on a sessionid; using Log Lake’s readready tables this is 0 (zero).

Q: Do raw and readready use the same files or buckets?

A: No, files and buckets are not shared. This decouples writes from reads, improves both write and read performance, and adds resiliency.

This question is important when designing for large sizes and scaling, because a single job or downstream read alone can span millions of files in Amazon S3. S3 scaling doesn’t happen immediately, so queries against raw or original data involving many tiny JSON files can cause S3 503 errors when it exceeds 5,500 GET/HEAD per second. More than one bucket helps avoid resource saturation. There is another option that we didn’t have when we created Log Lake: S3 Express One Zone. For reliability, we still recommend not putting all your files in one bucket. Also, don’t forget to filter your data.

Customization and control

You can customize and control all components (columns or schema, data types, compression, job logic, job schedule, and so on) because Log Lake is built using AWS primitives—such as Amazon SQS and Amazon S3—for the most comprehensive combination of features with the most freedom to customize. If you want to change something, you can.

From mono to many

Rather than one large, monolithic lake that is tightly coupled to other systems, Log Lake is just one node in a larger network of distributed data products across different data domains—this concept is data mesh. Just like the AWS APIs it is built on, Log Lake abstracts away heavy lifting and enables users to move faster, more efficiently, and not wait for centralized teams to make changes. Log Lake does not try to cover all use cases—instead, Log Lake’s data can be accessed and consumed by domain-specific teams, empowering business experts to self-serve.

When you need more flexibility and freedom

As builders, sometimes you want to dissect a customer experience, find problems, and figure out ways to make it better. That means going a layer down to mix and match primitives together to get more comprehensive features and more customization, flexibility, and freedom.

We built Log Lake for our long-term needs, but it would have been easier in the short-term to save Session Manager logs to Amazon S3 and query them with Athena. If you have considered what already exists in AWS, and you’re sure you need more comprehensive abilities or customization, read on to Part 2: Build, which explains Log Lake’s architecture and how you can set it up.

If you have feedback and questions, let us know in the comments section.

References

About the authors

Colin Carson is a Data Engineer at AWS ProServe. He has designed and built data infrastructure for multiple teams at Amazon, including Internal Audit, Risk & Compliance, HR Hiring Science, and Security.

Sean O’Sullivan is a Cloud Infrastructure Architect at AWS ProServe. He has over 8 years industry experience working with customers to drive digital transformation projects, helping architect, automate, and engineer solutions in AWS.