Tag Archives: Amazon CloudWatch

Amazon S3 Storage Lens adds performance metrics, support for billions of prefixes, and export to S3 Tables

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/amazon-s3-storage-lens-adds-performance-metrics-support-for-billions-of-prefixes-and-export-to-s3-tables/

Today, we’re announcing three new capabilities for Amazon S3 Storage Lens that give you deeper insights into your storage performance and usage patterns. With the addition of performance metrics, support for analyzing billions of prefixes, and direct export to Amazon S3 Tables, you have the tools you need to optimize application performance, reduce costs, and make data-driven decisions about your Amazon S3 storage strategy.

New performance metric categories
S3 Storage Lens now includes eight new performance metric categories that help identify and resolve performance constraints across your organization. These are available at organization, account, bucket, and prefix levels. For example, the service helps you identify small objects in a bucket or prefix that can  slow down application performance. This can be mitigated by batching small objects or using the Amazon S3 Express One Zone storage class for higher performance small object workloads.

To access the new performance metrics, you need to enable performance metrics in the S3 Storage Lens advanced tier when creating a new Storage Lens dashboard or editing an existing configuration.

Metric category Details Use case Mitigation
Read request size Distribution of read request sizes (GET) by day Identify dataset with small read request patterns that slow down performance Small request: Batch small objects or use Amazon S3 Express One Zone for high-performance small object workloads
Write request size Distribution of write request sizes (PUT, POST, COPY, and UploadPart) by day Identify dataset with small write request patterns that slow down performance Large request: Parallelize requests, use MPU or use AWS CRT
Storage size Distribution of object sizes Identify dataset with small small objects that slow down performance Small object sizes: Consider bundling small objects
Concurrent PUT 503 errors Number of 503s due to concurrent PUT operation on same object Identify prefixes with concurrent PUT throttling that slow down performance For single writer, modify retry behavior or use Amazon S3 Express One Zone. For multiple writers, use consensus mechanism or use Amazon S3 Express One Zone
Cross-Region data transfer Bytes transferred and requests sent across Region, in Region Identify potential performance and cost degradation due to cross-Region data access Co-locate compute with data in the same AWS Region
Unique objects accessed Number or percentage of unique objects accessed per day Identify datasets where small subset of objects are being frequently accessed. These can be moved to higher performance storage tier for better performance Consider moving active data to Amazon S3 Express One Zone or other caching solutions
FirstByteLatency (existing Amazon CloudWatch metric) Daily average of first byte latency metric The daily average per-request time from the complete request being received to when the response starts to be returned
TotalRequestLatency (existing Amazon CloudWatch metric) Daily average of Total Request Latency The daily average elapsed per request time from the first byte received to the last byte sent

How it works
On the Amazon S3 console I choose Create Storage Lens dashboard to create a new dashboard. You can also edit an existing dashboard configuration. I then configure general settings such as providing a Dashboard name, Status, and the optional Tags. Then, I choose Next.


Next, I define the scope of the dashboard by selecting Include all Regions and Include all buckets and specifying the Regions and buckets to be included.


I opt in to the Advanced tier in the Storage Lens dashboard configuration, select Performance metrics, then choose Next.


Next, I select Prefix aggregation as an additional metrics aggregation, then leave the rest of the information as default before I choose Next.


I select the Default metrics report, then General purpose bucket as the bucket type, and then select the Amazon S3 bucket in my AWS account as the Destination bucket. I leave the rest of the information as default, then select Next.


I review all the information before I choose Submit to finalize the process.


After it’s enabled, I’ll receive daily performance metrics directly in the Storage Lens console dashboard. You can also choose to export report in CSV or Parquet format to any bucket in your account or publish to Amazon CloudWatch. The performance metrics are aggregated and published daily and will be available at multiple levels: organization, account, bucket, and prefix. In this dropdown menu, I choose the % concurrent PUT 503 error for the Metric, Last 30 days for the Date range, and 10 for the Top N buckets.


The Concurrent PUT 503 error count metric tracks the number of 503 errors generated by simultaneous PUT operations to the same object. Throttling errors can degrade application performance. For a single writer, modify retry behavior or use higher performance storage tier such as Amazon S3 Express One Zone to mitigate concurrent PUT 503 errors. For multiple writers scenario, use a consensus mechanism to avoid concurrent PUT 503 errors or use higher performance storage tier such as Amazon S3 Express One Zone.

Complete analytics for all prefixes in your S3 buckets
S3 Storage Lens now supports analytics for all prefixes in your S3 buckets through a new Expanded prefixes metrics report. This capability removes previous limitations that restricted analysis to prefixes meeting a 1% size threshold and a maximum depth of 10 levels. You can now track up to billions of prefixes per bucket for analysis at the most granular prefix level, regardless of size or depth.

The Expanded prefixes metrics report includes all existing S3 Storage Lens metric categories: storage usage, activity metrics (requests and bytes transferred), data protection metrics, and detailed status code metrics.

How to get started
I follow the same steps outlined in the How it works section to create or update the Storage Lens dashboard. In Step 4 on the console, where you select export options, you can select the new Expanded prefixes metrics report. Thereafter, I can export the expanded prefixes metrics report in CSV or Parquet format to any general purpose bucket in my account for efficient querying of my Storage Lens data.


Good to know
This enhancement addresses scenarios where organizations need granular visibility across their entire prefix structure. For example, you can identify prefixes with incomplete multipart uploads to reduce costs, track compliance across your entire prefix structure for encryption and replication requirements, and detect performance issues at the most granular level.

Export S3 Storage Lens metrics to S3 Tables
S3 Storage Lens metrics can now be automatically exported to S3 Tables, a fully managed feature on AWS with built-in Apache Iceberg support. This integration provides daily automatic delivery of metrics to AWS managed S3 Tables for immediate querying without requiring additional processing infrastructure.

How to get started
I start by following the process outlined in Step 5 on the console, where I choose the export destination. This time, I choose Expanded prefixes metrics report. In addition to General purpose bucket, I choose Table bucket.

The new Storage Lens metrics are exported to new tables in an AWS managed bucket aws-s3.


I select the expanded_prefixes_activity_metrics table to view API usage metrics for expanded prefix reports.


I can preview the table on the Amazon S3 console or use Amazon Athena to query the table.


Good to know
S3 Tables integration with S3 Storage Lens simplifies metric analysis using familiar SQL tools and AWS analytics services such as Amazon Athena, Amazon QuickSight, Amazon EMR, and Amazon Redshift, without requiring a data pipeline. The metrics are automatically organized for optimal querying, with custom retention and encryption options to suit your needs.

This integration enables cross-account and cross-Region analysis, custom dashboard creation, and data correlation with other AWS services. For example, you can combine Storage Lens metrics with S3 Metadata to analyze prefix-level activity patterns and identify objects in prefixes with cold data that are eligible for transition to lower-cost storage tiers.

For your agentic AI workflows, you can use natural language to query S3 Storage Lens metrics in S3 Tables with the S3 Tables MCP Server. Agents can ask questions such as ‘which buckets grew the most last month?’ or ‘show me storage costs by storage class’ and get instant insights from your observability data.

Now available
All three enhancements are available in all AWS Regions where S3 Storage Lens is currently offered (except the China Regions and AWS GovCloud (US)).

These features are included in the Amazon S3 Storage Lens Advanced tier at no additional charge beyond standard advanced tier pricing. For the S3 Tables export, you pay only for S3 Tables storage, maintenance, and queries. There is no additional charge for the export functionality itself.

To learn more about Amazon S3 Storage Lens performance metrics, support for billions of prefixes, and export to S3 Tables, refer to the Amazon S3 user guide. For pricing details, visit the Amazon S3 pricing page.

Veliswa Boya.

Amazon CloudWatch introduces unified data management and analytics for operations, security, and compliance

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/amazon-cloudwatch-introduces-unified-data-management-and-analytics-for-operations-security-and-compliance/

Today we’re expanding Amazon CloudWatch capabilities to unify and manage log data across operational, security, and compliance use cases with flexible and powerful analytics in one place and with reduced data duplication and costs.

This enhancement means that CloudWatch can automatically normalize and process data to offer consistency across sources with built-in support for Open Cybersecurity Schema Framework (OCSF) and Open Telemetry (OTel) formats, so you can focus on analytics and insights. CloudWatch also introduces Apache Iceberg compatible access to your data through Amazon Simple Storage Service (Amazon S3) Tables, so that you can run analytics, not only locally but also using Amazon Athena, Amazon SageMaker Unified Studio, or any other Iceberg-compatible tool.

You can also correlate your operational data in CloudWatch with other business data from your preferred tools to correlate with other data. This unified approach streamlines management and provides comprehensive correlation across security, operational, and business use cases.

Here are the detailed enhancements:

  • Streamline data ingestion and normalization – CloudWatch automatically collects AWS vended logs across accounts and AWS Regions, integrating with AWS Organizations from AWS services including AWS CloudTrail, Amazon Virtual Private Cloud (Amazon VPC) Flow Logs, AWS WAF access logs, Amazon Route 53 resolver logs, and pre-built connectors for third-party sources such as endpoint (CrowdStrike, SentinelOne), identity (Okta, Entra ID), cloud security (Wiz), network security (Zscaler, Palo Alto Networks), productivity and collaboration (Microsoft Office 365, Windows Event Logs, and GitHub), along with IT service manager with ServiceNow CMBD. To normalize and process your data as they are being ingested, CloudWatch offers managed OCSF conversion for various AWS and third-party data sources and other processors such ad Grok for custom parsing, field-level operations, and string manipulations.
  • Reduce costly log data management – CloudWatch consolidates log management into a single service with built-in governance capabilities without storing and maintaining multiple copies of the same data across different tools and data stores. The unified data store of CloudWatch eliminates the need for complex ETL pipelines and reduces your operational costs and management overhead needed to maintain multiple separate data stores and tools.
  • Discover business insights from log data – You can run queries in CloudWatch using natural language queries and popular query languages such as LogsQL, PPL, and SQL through a single interface, or query your data using your preferred analytics tools through Apache Iceberg-compatible tables. The new Facets interface gives you intuitive filtering by source, application, account, region, and log type, which you can use to run queries across log groups of multiple AWS accounts and Regions with intelligent parameter inference.

In the next sections we explore the new log management and analytics features of the CloudWatch Logs!

1. Data discovery and management by data sources and types

You can see a high-level overview of logs and all data sources with a new Logs Management View in the CloudWatch console. To get started, go to the CloudWatch console and choose Log Management under the Logs menu in the left navigation pane. In the Summary tab, you can observe your logs data sources and types, insights into how your log groups are doing across ingestion, and anomalies.

Choose the Data sources tab to find and manage your log data by data sources, types, and fields. CloudWatch ingests and automatically categorizes data sources by AWS services, third-party, or custom sources such as application logs.

Choose the Data source actions to integrate S3 Tables to make future logs for selected data sources. You have the flexibility to analyze the logs through Athena and Amazon Redshift and other query engines such as Spark using Iceberg compatible access patterns. With this integration, logs from CloudWatch are available in a read-only aws-cloudwatch S3 Tables bucket.

When you choose a specific data source such as CloudTrail data, you can view the details of the data source that includes information regarding data format, pipeline, facets/field indexes, S3 Tables association, and the number of logs with that data source. You can observe all log groups included in this data source and type and edit a source/type field index policy using the new schema support.

To learn more about how to manage your data sources and index policy, visit Data sources in the Amazon CloudWatch Logs User Guide.

2. Ingestion and transformation using CloudWatch pipelines

You can create pipelines to streamline collecting, transforming, and routing telemetry and security data while standardizing data formats to optimize observability and security data management. The new pipeline feature of CloudWatch connects data from a catalogue of data sources, so that you can add and configure pipeline processors from a library to parse, enrich, and standardize data.

In the Pipeline tab, choose Add pipeline. It shows you the pipeline configuration wizard. This wizard guides you through five steps where you can choose the data source and other source details such as log source types, configure destination, configure up to 19 processors to perform an action on your data (such as filtering, transforming, or enriching), and finally review and deploy the pipeline.

You also have the option to create pipelines through the new Ingestion experience in CloudWatch. To learn more about how to set up and manage the pipelines, visit Pipelines in the Amazon CloudWatch Logs User Guide.

3. Enhanced analytics and querying based on data sources

You can enhance analytics with support for Facets and querying based on data sources. Facets enable interactive exploration and drill-down into logs and their values are automatically extracted based on the selected time period.

Choose the Facets tab in the Log Insights under the Logs menu in the left navigation pane. You can view available facets and values that appear in the panel. Choose one or more facets and values to interactively explore your data. I choose Facets regarding a VPC Flow Logs group and action, query to list the five most frequent patterns in my VPC Flow Logs through the AI query generator, and get the result patterns.

You can save your query with the selected Facets and values that you have specified. When you next choose your saved query, the logs to be queried have the pre-specified facets and values. To learn more about Facet management, visit Facets in the CloudWatch Logs User Guide.

As I previously noted, you can integrate data sources into S3 Tables and query together. For example, using a Query Editor in Athena, you can query correlates network traffic with AWS API activity from a specific IP range (174.163.137.*) by joining VPC Flow Logs with CloudTrail logs based on matching source IP addresses.

This type of integrated search is particularly valuable for security monitoring, incident investigation, and suspicious behavior detection. You can view if an IP that’s making network connections is also performing sensitive AWS operations such as creating users, modifying security groups, or accessing data.

To learn more, visit S3 Tables integration with CloudWatch in the CloudWatch Logs User Guide.

Now available
New log management features of Amazon CloudWatch are available today in all AWS Regions except the AWS GovCloud (US) Regions and China Regions. For Regional availability and future roadmap, visit the AWS Capabilities by Region. There are no upfront commitments or minimum fees, and you pay for the usage of existing CloudWatch Logs for data ingestion, storage, and queries. To learn more, visit the CloudWatch pricing page.

Give it a try in the CloudWatch console. To learn more, visit the CloudWatch product page and send feedback to AWS re:Post for CloudWatch Logs or through your usual AWS Support contacts.

Channy

Monitor network performance and traffic across your EKS clusters with Container Network Observability

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/monitor-network-performance-and-traffic-across-your-eks-clusters-with-container-network-observability/

Organizations are increasingly expanding their Kubernetes footprint by deploying microservices to incrementally innovate and deliver business value faster. This growth places increased reliance on the network, giving platform teams exponentially complex challenges in monitoring network performance and traffic patterns in EKS. As a result, organizations struggle to maintain operational efficiency as their container environments scale, often delaying application delivery and increasing operational costs.

Today, I’m excited to announce Container Network Observability in Amazon Elastic Kubernetes Service (Amazon EKS), a comprehensive set of network observability features in Amazon EKS that you can use to better measure your network performance in your system and dynamically visualize the landscape and behavior of network traffic in EKS.

Here’s a quick look at Container Network Observability in Amazon EKS:

Container Network Observability in EKS addresses observability challenges by providing enhanced visibility of workload traffic. It offers performance insights into network flows within the cluster and those with cluster-external destinations. This makes your EKS cluster network environment more observable while providing built-in capabilities for more precise troubleshooting and investigative efforts.

Getting started with Container Network Observability in EKS

I can enable this new feature for a new or existing EKS cluster. For a new EKS cluster, during the Configure observability setup, I navigate to the Configure network observability section. Here, I select Edit container network observability. I can see there are three included features: Service map, Flow table, and Performance metric endpoint, which are enabled by Amazon CloudWatch Network Flow Monitor.

On the next page, I need to install the AWS Network Flow Monitor Agent.

After it’s enabled, I can navigate to my EKS cluster and select Monitor cluster.

This will bring me to my cluster observability dashboard. Then, I select the Network tab.


Comprehensive observability features
Container Network Observability in EKS provides several key features, including performance metrics, service map, and flow table with three views: AWS service view, cluster view, and external view.

With Performance metrics, you can now scrape network-related system metrics for pods and worker nodes directly from the Network Flow Monitor agent and send them to your preferred monitoring destination. Available metrics include ingress/egress flow counts, packet counts, bytes transferred, and various allowance exceeded counters for bandwidth, packets per second, and connection tracking limits. The following screenshot shows an example of how you can use Amazon Managed Grafana to visualize the performance metrics scraped using Prometheus.


With the Service map feature, you can dynamically visualize intercommunication between workloads in your cluster, making it straightforward to understand your application topology with a quick look. The service map helps you quickly identify performance issues by highlighting key metrics such as retransmissions, retransmission timeouts, and data transferred for network flows between communicating pods.

Let me show you how this works with a sample e-commerce application. The service map provides both high-level and detailed views of your microservices architecture. In this e-commerce example, we can see three core microservices working together: the GraphQL service acts as an API gateway, orchestrating requests between the frontend and backend services.

When a customer browses products or places an order, the GraphQL service coordinates communication with both the products service (for catalog data, pricing, and inventory) and the orders service (for order processing and management). This architecture allows each service to scale independently while maintaining clear separation of concerns.

For deeper troubleshooting, you can expand the view to see individual pod instances and their communication patterns. The detailed view reveals the complexity of microservices communication. Here, you can see multiple pod instances for each service and the network of connections between them.

This granular visibility is crucial for identifying issues like uneven load distribution, pod-to-pod communication bottlenecks, or when specific pod instances are experiencing higher latency. For example, if one GraphQL pod is making disproportionately more calls to a particular products pod, you can quickly spot this pattern and investigate potential causes.

Use the Flow table to monitor the top talkers across Kubernetes workloads in your cluster from three different perspectives, each providing unique insights into your network traffic patterns.

Flow table – Monitor the top talkers across Kubernetes workloads in your cluster from three different perspectives, each providing unique insights into your network traffic patterns:

  • AWS service view shows which workloads generate the most traffic to Amazon Web Services (AWS) services such as Amazon DynamoDB and Amazon Simple Storage Service (Amazon S3), so you can optimize data access patterns and identify potential cost optimization opportunities.
  • The Cluster view reveals the heaviest communicators within your cluster (east-west traffic), which means you can spot chatty microservices that might benefit from optimization or colocation strategies
  • External viewidentifies workloads with the highest traffic to destinations outside AWS (internet or on premises), which is useful for security monitoring and bandwidth management.

The flow table provides detailed metrics and filtering capabilities to analyze network traffic patterns. In this example, we can see the flow table displaying cluster view traffic between our e-commerce services. The table shows that the orders pod is communicating with multiple products pods, transferring amounts of data. This pattern suggests the orders service is making frequent product lookups during order processing.

The filtering capabilities are useful for troubleshooting, for example, to focus on traffic from a specific orders pod. This granular filtering helps you quickly isolate communication patterns when investigating performance issues. For instance, if customers are experiencing slow checkout times, you can filter to see if the orders service is making too many calls to the products service, or if there are network bottlenecks between specific pod instances.

Additional things to know
Here are key points to note about Container Network Observability in EKS:

  • Pricing – For network monitoring, you pay standard Amazon CloudWatch Network Flow Monitor pricing.
  • Availability – Container Network Observability in EKS is available in all commercial AWS regions where Amazon CloudWatch Network Flow Monitor is available.
  • Export metrics to your preferred monitoring solution – Metrics are available in OpenMetrics format, compatible with Prometheus and Grafana. For configuration details, refer to Network Flow Monitor documentation.

Get started with Container Network Observability in Amazon EKS today to improve network observability in your cluster.

Happy building!
Donnie

Monitor, analyze, and manage capacity usage from a single interface with Amazon EC2 Capacity Manager

Post Syndicated from Esra Kayabali original https://aws.amazon.com/blogs/aws/monitor-analyze-and-manage-capacity-usage-from-a-single-interface-with-amazon-ec2-capacity-manager/

Today, I’m happy to announce Amazon EC2 Capacity Manager, a centralized solution to monitor, analyze, and manage capacity usage across all accounts and AWS Regions from a single interface. This service aggregates capacity information with hourly refresh rates and provides prioritized optimization opportunities, streamlining capacity management workflows that previously required custom automation or manual data collection from multiple AWS services.

Organizations using Amazon Elastic Compute Cloud (Amazon EC2) at scale operate hundreds of instance types across multiple Availability Zones and accounts, using On-Demand Instances, Spot Instances, and Capacity Reservations. This complexity means customers currently access capacity data through various AWS services including the AWS Management Console, Cost and Usage Reports, Amazon CloudWatch, and EC2 describe APIs. This distributed approach can create operational overhead through manual data collection, context switching between tools, and the need for custom automation to aggregate information for capacity optimization analysis.

EC2 Capacity Manager helps you overcome these operational complexities by consolidating all capacity data into a unified dashboard. You can now view cross-account and cross-Region capacity metrics for On-Demand Instances, Spot Instances, and Capacity Reservations across all commercial AWS Regions from a single location, eliminating the need to build custom data collection tools or navigate between multiple AWS services.

This consolidated visibility can help you discover cost savings by highlighting underutilized Capacity Reservations, analyzing usage patterns across instance types, and providing insights into Spot Instance interruption patterns. By having access to comprehensive capacity data in one place, you can make more informed decisions about rightsizing your infrastructure and optimizing your EC2 spending.

Let me show you the capabilities of EC2 Capacity Manager in detail.

Getting started with EC2 Capacity Manager
On the AWS Management Console, I navigate to Amazon EC2 and select Capacity Manager from the navigation pane. I enable EC2 Capacity Manager through the service settings. The service aggregates historical data from the previous 14 days during initial setup.

The main Dashboard displays capacity utilization across all instance types through a comprehensive overview section that presents key metrics at a glance. The capacity overview cards for Reservations, Usage, and Spot show trend indicators and percentage changes to help you identify capacity patterns quickly. You can apply filtering through the date filter controls, which include date range selection, time zone configuration, and interval settings.

You can select different units to analyze data by vCPUs, instance counts, or estimated costs to understand resource consumption patterns. Estimated costs are based on published On-Demand rates and do not include Savings Plans or other discounts. This pricing reference helps you compare the relative impact of underutilized capacity across different instance types—for example, 100 vCPU hours of unused p5 reservations represents a larger cost impact than 100 vCPU hours of unused t3 reservations.

The dashboard includes detailed Usage metrics with both total usage visualization and usage over time charts. The total usage section shows the breakdown between reserved usage, unreserved usage, and Spot usage. The usage over time chart provides visualization that tracks capacity trends over time, helping you identify usage patterns and peak demand periods.

Under Reservation metrics, Reserved capacity trends visualizes used and unused reserved capacity across the selected period, showing the proportion of reserved vCPU hours that remain unutilized compared with those actively consumed, helping you track reservation efficiency patterns and identify periods of consistent low utilization. This visibility can help you reduce costs by identifying underutilized reservations and helping you to make informed decisions about capacity adjustments.

The Unused capacity section lists underutilized capacity reservations by instance type and Availability Zone combinations, displaying specific utilization percentages and instance types across different Availability Zones. This prioritized list helps you identify potential savings with direct visibility into unused capacity costs.

The Usage tab provides detailed historical trends and usage statistics across all AWS Regions for Spot Instances, On-Demand Instances, Capacity Reservations, Reserved Instances, and Savings Plans. Dedicated Hosts usage is not included. The Dimension filter helps you group by and filter capacity data by Account ID, Region, Instance Family, Availability Zone, and Instance Type, creating custom views that reveal usage patterns across your accounts and AWS Organizations. This helps you analyze specific configurations and compare performance across accounts or Regions.

The Aggregations section provides a comprehensive usage table across EC2 and Spot Instances. You can select different units to analyze data by vCPUs, instance counts, or estimated costs to understand resource consumption patterns. The table shows instance family breakdowns with total usage statistics, reserved usage hours, unreserved usage hours, and Spot usage data. Each row includes a View breakdown action for a detailed analysis.

The Capacity usage or estimated cost trends section visualizes usage trends, reserved usage, unreserved usage, and Spot usage. You can filter the displayed data and adjust the unit of measurement to view historical patterns. These filtering and analysis tools help you identify usage trends, compare costs across dimensions, and make informed decisions for capacity planning and optimization.

When you choose View breakdown from the Aggregations table, you access detailed Usage breakdown based on the dimension filters you selected. This breakdown view shows usage patterns for individual instance types within the selected family and Availability Zone combinations, helping you identify specific optimization opportunities.

The Reservations tab displays capacity reservation utilization with automated analysis capabilities that generate prioritized lists of optimization opportunities. Similar to the Usage tab, you can apply dimension filters by Account ID, Region, Instance Family, Availability Zone, and Instance Type along with additional options related to the reservation details. On each of the tabs you can drill down to see data for individual line items. For reservations specifically, you can view specific reservations and access detailed information about On-Demand Capacity Reservations (ODCRs), including utilization history, configuration parameters, and current status. When the ODCR exists in the same account as Capacity Manager, you can modify reservation parameters directly from this interface, eliminating the need to navigate to separate EC2 console sections for reservation management.

The Statistics section provides summary metrics, including total reservations count, overall utilization percentage, reserved capacity totals, used and unused capacity volumes, average scheduled reservations, and counts of accounts, instance families, and Regions with reservations.

This consolidated view helps you understand reservation distribution and utilization patterns across your infrastructure. For example, you might discover that your development accounts consistently show 30% reservation utilization while production accounts exceed 95%, indicating an opportunity to redistribute or modify reservations. Similarly, you could identify that specific instance families in certain Regions have sustained low utilization rates, suggesting candidates for reservation adjustments or workload optimization. These insights help you make data-driven decisions about reservation purchases, modifications, or cancellations to better align your reserved capacity with actual usage patterns.

The Spot tab focuses on Spot Instance usage and displays the amount of time your Spot instances run before being interrupted. This analysis of Spot Instance usage patterns helps you identify optimization opportunities for Spot Instance workloads. You can use Spot placement score recommendations to improve workload flexibility.

For organizations requiring data export capabilities, Capacity Manager includes data exports to Amazon Simple Storage Service (Amazon S3) buckets for capacity analysis. You can view and manage your data exports through the Data exports tab, which helps you create new exports, monitor delivery status, and configure export schedules to analyze capacity data outside the AWS Management Console.

Data exports extend your analytical capabilities by storing capacity data beyond the 90-day retention period available through the console and APIs. This extended retention enables long-term trend analysis and historical capacity planning. You can also integrate exported data with existing analytics workflows, business intelligence tools, or custom reporting systems to incorporate EC2 capacity metrics into broader infrastructure analysis and decision-making processes.

The Settings section provides configuration options for AWS Organizations integration, enabling centralized capacity management across multiple accounts. Organization administrators can enable enterprise-wide capacity visibility or delegate access to specific accounts while maintaining appropriate permissions and access controls.

Now available
EC2 Capacity Manager eliminates the operational overhead of collecting and analyzing capacity data from multiple sources. The service provides automated optimization opportunities, centralized multi-account visibility, and direct access to capacity management tools. You can reduce manual analysis time while improving capacity utilization and cost optimization across your EC2 infrastructure.

Amazon EC2 Capacity Manager is available at no additional cost. To begin using Amazon EC2 Capacity Manager, visit the Amazon EC2 console or access the service APIs. The service is available in all commercial AWS Regions.

To learn more, visit the EC2 Capacity Manager documentation.

— Esra

Integrate Amazon CloudWatch alarms with WhatsApp using AWS End User Messaging

Post Syndicated from Ruchikka Chaudhary original https://aws.amazon.com/blogs/messaging-and-targeting/integrate-amazon-cloudwatch-alarms-with-whatsapp-using-aws-end-user-messaging/

Timely and accessible alert notifications are crucial for maintaining operational excellence, but traditional alerting mechanisms often fall short. Email notifications can get buried in crowded inboxes, SMS messages might incur high costs for international teams, and pager systems lack the rich context modern teams need for rapid incident response. WhatsApp, with over 2 billion users worldwide, offers several advantages:

  • Universal accessibility – Available on virtually every smartphone
  • Rich media support – Sends formatted messages, images, and links
  • Global reach – No international SMS fees
  • High engagement – Messages are typically delivered within seconds
  • Familiar interface – Many teams already use WhatsApp for daily communication

This post walks through a solution to build a serverless alerting system that delivers Amazon CloudWatch alarms to WhatsApp using AWS End User Messaging.

Overview of solution

The solution uses AWS End User Messaging Social to create a seamless bridge between CloudWatch alarms and WhatsApp notifications. This serverless architecture provides real-time infrastructure alerts through WhatsApp messaging platform teams already know and use.

The architecture consists of four main AWS services working together. The data flow begins when CloudWatch alarms detect breaches of predefined thresholds and publish notifications to an Amazon Simple Notification Service (Amazon SNS) topic. The SNS topic triggers an AWS Lambda function that processes the alarm data, formats contextual WhatsApp messages, and uses AWS End User Messaging Social to deliver notifications to specified recipients.

The following diagram illustrates the solution architecture.

Prerequisites

Implementation requires an AWS account with appropriate permissions for AWS CloudFormation, Lambda, Amazon SNS, and CloudWatch. You must also have a WhatsApp Business Account integrated with AWS End User Messaging and the WhatsApp phone number ID from the AWS End User Messaging console. (For instructions to locate this information, see View a phone number’s ID in AWS End User Messaging Social).

For more information about how to set up WhatsApp using AWS End User Messaging Social, refer to Automate workflows with WhatsApp using AWS End User Messaging Social.

Before you deploy this solution, create an approved template in your Meta account named cw_alarm_notification. Alternatively, use your preferred template name and modify the Lambda function code accordingly (as shown in the following screenshot).

Lambda function code

The solution uses a Python-based Lambda function that processes CloudWatch alarm notifications and formats them for WhatsApp delivery. The function receives SNS events containing CloudWatch alarm data, extracts relevant information, formats contextual messages, and delivers notifications through AWS End User Messaging Social.

The following function code shows an example to parse the SNS message content and extract key alarm information, including alarm name, state value, and reason for state change:

def process_alarm_notification(record):
    # Parse SNS message
    sns_message = json.loads(record['Sns']['Message'])
    
    # Extract alarm details
    alarm_name = sns_message.get('AlarmName', 'Unknown Alarm')
    alarm_description = sns_message.get('AlarmDescription', '')
    new_state = sns_message.get('NewStateValue', 'UNKNOWN')
    old_state = sns_message.get('OldStateValue', 'UNKNOWN')
    reason = sns_message.get('NewStateReason', '')
    timestamp = sns_message.get('StateChangeTime', '')
    region = sns_message.get('Region', '')
    
    # Format WhatsApp message
    message = format_alarm_message(
        alarm_name, alarm_description, new_state, 
        old_state, reason, timestamp, region
    )
    
    # Send WhatsApp message
    send_whatsapp_message(message)
    

The following code shows an example to create a template message for WhatsApp (you can change the template name if required):

              # Build message
              template_name = 'cw_alarm_notification'
              template_message = {
                      "name": template_name,
                      "language": {
                          "code": "en"
                      },
                      "components": [
                          {
                              "type": "header",
                              "parameters": [{
                                  "type": "text",
                                  "parameter_name": "emoji",
                                  "text": state_emoji
                                  }
                              ]
                          },
                          {
                              "type": "body",
                              "parameters": [
                                  {
                                      "type": "text",
                                      "parameter_name": "alarm_name",
                                      "text": alarm_name
                                  },
                                  {
                                      "type": "text",
                                      "parameter_name": "status",
                                      "text": old_state + " → " + new_state
                                  },
                                  {
                                      "type": "text",
                                      "parameter_name": "region_account",
                                      "text": region + " - " + account_id
                                  },
                                  {
                                      "type": "text",
                                      "parameter_name": "time",
                                      "text": formatted_time
                                  },
                                  {
                                      "type": "text",
                                      "parameter_name": "description",
                                      "text": alarm_description + reason
                                  }
                              ]
                          }

                      ]
                  }
              
              return template_message

The send_whatsapp_message function uses AWS End User Messaging Social to deliver formatted messages through the socialmessaging client:

def send_whatsapp_message(message):
    """Send message via AWS End User Messaging"""
    client = boto3.client('socialmessaging')
    
    response = client.send_whats_app_message(
        originationPhoneNumberId=os.environ['WHATSAPP_PHONE_NUMBER_ID'],
        destinationPhoneNumber=os.environ['ALERT_RECIPIENT'],
        messageBody={'text': message}
    )

Deploy the solution

The solution uses AWS CloudFormation for infrastructure as code (IaC) deployment. The main template creates an SNS topic for alarm notifications, a Lambda function for message processing, and required AWS Identity and Access Management (IAM) roles with least-privilege permissions.

The CloudFormation template requires a recipient number with an active WhatsApp account to receive alarm notifications as messages. The template also requires the WhatsApp phone number ID retrieved from the AWS End User Messaging Social console, as noted in the prerequisites. The template must be deployed in the same AWS Region as AWS End User Messaging Social. See the following code:

aws cloudformation deploy \
  --template-file <cloudwatch-eum-whatsapp-alerts.yaml> \
  --stack-name cloudwatch-eum-whatsapp-alerts \
  --parameter-overrides \
    WhatsAppPhoneNumberId=<your-phone-number-id-from-eum> \
    AlertRecipient=<+1234567890> \
    Environment=dev \
  --capabilities CAPABILITY_IAM \
  --region <EUM-region>

The preceding template deploys an alarm called SampleHighCPUAlarm that triggers an SNS topic. The SNS topic triggers the WhatsApp alarm notifier Lambda function, which processes and sends the message using AWS End User Messaging Social.

Test the solution

You can test the solution using the sample alarm created by the CloudFormation stack. The following screenshot shows an example alarm configuration and its CloudWatch metrics, currently in the OK state.

You can use the following code to trigger the alarm by updating this metric:

 aws cloudwatch put-metric-data --namespace "Custom/EC2" --metric-data MetricName=CPUUtilization,Value=85,Unit=Percent

As the alarm switches from OK to ALARM state, a WhatsApp Message is delivered.

Conclusion

Integrating CloudWatch with WhatsApp notifications represents a significant step forward in modern infrastructure monitoring. By using AWS End User Messaging, teams can receive critical alerts through their preferred communication channel while maintaining the reliability and scalability of AWS services.

The solution’s modular architecture, simple yet effective security model, and cost-effective design make it suitable for organizations of various sizes.


About the author

AWS Weekly Roundup: Amazon Bedrock, AWS Outposts, Amazon ECS Managed Instances, AWS Builder ID, and more (October 6, 2025)

Post Syndicated from Prasad Rao original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-amazon-bedrock-aws-outposts-amazon-ecs-managed-instances-aws-builder-id-and-more-october-6-2025/

Last week, Anthropic’s Claude Sonnet 4.5—the world’s best coding model according to SWE-Bench – became available in Amazon Q command line interface (CLI) and Kiro. I’m excited about this for two reasons:

First, a few weeks ago I spent 4 intensive days with a global customer delivering an AI-assisted development workshop, where I experienced firsthand how Amazon Q CLI boosts developer productivity. During the workshop, the customer was able to add a new feature in their application within a day using Amazon Q CLI, which would have traditionally taken them at least a couple of weeks. With Anthropic’s Claude Sonnet 4.5 in Amazon Q CLI, I know developer productivity will be enhanced further.

Second, I’ve started preparing for my code talk at AWS re:Invent 2025, where my co-speaker and I will show live coding to modernize a legacy codebase using Kiro. I can’t wait to use Anthropic’s Claude Sonnet 4.5 in Kiro to create a live demo. If you want to see this demo and over a thousand other sessions on cloud and AI, join us at AWS re:Invent 2025 in Las Vegas from December 1–5.

Last week’s launches
Here are some launches that got my attention:

  • Availability of Claude Sonnet 4.5 in Amazon Bedrock – Anthropic’s most intelligent model, best for coding and complex agents, is now available in Amazon Bedrock. By using Claude Sonnet 4.5 in Amazon Bedrock, developers gain access to a fully managed service that not only provides a unified API for foundation models (FMs) but keeps their data under complete control with enterprise-grade tools for security, and optimization.
  • AWS Outposts supports third-party storage integration with Dell and HPE – AWS Outposts third-party storage integration now includes Dell PowerStore and HPE Alletra Storage MP B10000 systems, joining the list of existing integrations with NetApp on-premises enterprise storage arrays and Pure Storage FlashArray. This integration serves three key purposes. First, it helps you maintain your existing storage infrastructure while migrating VMware workloads to AWS. Second, it helps you meet strict data residency requirements by keeping your data on premises while using AWS services. Third, it means you can use AWS Outposts with third-party storage arrays through AWS tooling.
  • Amazon ECS Managed Instances now available – Amazon ECS Managed Instances for containerized applications is a new fully managed compute option for Amazon ECS designed to eliminate infrastructure management overhead while giving you access to the full capabilities of Amazon EC2. ECS Managed Instances helps you quickly launch and scale your workloads while enhancing performance and reducing your total cost of ownership.
  • Application map is now generally available for Amazon CloudWatch – Amazon CloudWatch now helps you monitor large-scale distributed applications by automatically discovering and organizing services into groups based on configurations and their relationships. With this new application performance monitoring (APM) capability, you can quickly visualize which applications and dependencies to focus on while troubleshooting your distributed applications.
  • Amazon Bedrock AgentCore Model Context Protocol (MCP) server now available – With built-in support for runtime, gateway integration, identity management, and agent memory, the AgentCore MCP server is purpose-built to speed up creation of components compatible with Bedrock AgentCore. You can use the AgentCore MCP server for rapid prototyping, production AI solutions, or to scale your agent infrastructure.

Additional Updates
Here are some additional news items and blog posts that I found interesting:

  • AWS Builder ID now supports Sign in with Google – You can now create an AWS Builder ID using sign in with Google. AWS Builder ID is a personal profile that provides access to AWS applications including Kiro, AWS Builder Center, AWS Training and Certification, AWS re:Post and AWS Startups.
  • AWS API MCP Server v1.0.0 release – AWS API MCP server acts as a bridge between AI assistants and AWS services enabling foundation models to interact with any AWS API through natural language by creating and executing syntactically correct CLI commands. The AWS API MCP Server is open-source and available now on AWS Labs GitHub repository.
  • AWS Knowledge MCP Server now generally available – The AWS Knowledge server gives AI agents and MCP clients access to authoritative knowledge, including documentation, blog posts, What’s New announcements, and Well-Architected best practices, in an LLM-compatible format. With this release, the server also includes knowledge about the regional availability of AWS APIs and CloudFormation resources.
  • AWS Transform now enables Terraform for VMware network automation – AWS Transform now offers Terraform as an additional option to generate network infrastructure code automatically from VMware environments. The service converts your source network definitions into reusable Terraform modules, complementing current AWS CloudFormation and AWS Cloud Development Kit (CDK) support.

Upcoming AWS events
Check your calendar and sign up for upcoming AWS events:

  • AWS AI Agent Global Hackathon – This is your chance to dive deep into our powerful generative AI stack and create something truly awesome. From September 8th to October 20th, you have the opportunity to create AI agents using AWS suite of AI services, competing for over $45,000 in prizes and exclusive go-to-market opportunities.
  • AWS Gen AI Lofts – You can learn AWS AI products and services with exclusive sessions, meet industry-leading experts, and have valuable networking opportunities with investors and peers. Register in your nearest city: Paris (October 7–21), London (Oct 13–21), and Tel Aviv (November 11–19).
  • AWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Munich (October 7), Budapest (October 16).

You can browse all upcoming AWS events and AWS startup events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

Prasad

Amazon OpenSearch Serverless monitoring: A CloudWatch setup guide

Post Syndicated from Urmila Iyer original https://aws.amazon.com/blogs/big-data/amazon-opensearch-serverless-monitoring-a-cloudwatch-setup-guide/

Amazon OpenSearch Serverless simplifies the deployment and management of OpenSearch workloads by automatically scaling based on your usage patterns. The service considers key metrics such as shard utilization, storage consumption, and CPU usage while maintaining millisecond-level response times, with the simplicity of a serverless environment.

While OpenSearch Serverless handles scaling automatically, implementing robust monitoring remains crucial for understanding usage patterns, optimizing costs, helping to ensure performance, and maintaining reliability. Proactive monitoring helps organizations detect critical issues with the applications or infrastructure in real time and identify root causes quickly.

This post is part of our Amazon OpenSearch service monitoring series, focusing on OpenSearch Serverless workloads and deployments. In this post, we explore commonly used Amazon CloudWatch metrics and alarms for OpenSearch Serverless, walking through the process of selecting relevant metrics, setting appropriate thresholds, and configuring alerts. This guide will provide you with a comprehensive monitoring strategy that complements the serverless nature of your OpenSearch deployment while maintaining full operational visibility.

Key benefits of CloudWatch monitoring for OpenSearch Serverless

Implementing CloudWatch monitoring for your OpenSearch Serverless collections offers several key advantages:

  • Near real-time performance monitoring – CloudWatch provides near real-time monitoring, enabling you to track your OpenSearch Serverless collections’ performance as they operate. This immediate visibility allows for swift detection of anomalies or performance issues, enabling prompt response to potential problems.
  • Efficient error diagnosis – You can quickly identify and address common errors without extensive log analysis. For instance, by monitoring ingestion request errors, you can preemptively mitigate bulk indexing request failures.
  • Proactive alerting system – Use the CloudWatch alarm functionality in conjunction with Amazon Simple Notification Service (SNS) to set up custom alerts. By defining specific thresholds for critical metrics, you can receive instant notifications through email or SMS when your OpenSearch Serverless collections approach or exceed these limits.
  • Comprehensive historical analysis – The data retention capabilities of CloudWatch allow for in-depth historical analysis. This helps you to identify long-term performance trends, recognize recurring patterns in resource utilization and optimize workload distribution based on historical insights.

Solution overview

Understanding which metrics to monitor in OpenSearch Serverless helps optimize your system’s performance and reliability. This guide explains the key metrics to monitor, their significance, how to determine appropriate thresholds, and the step-by-step process for setting up alarms. Understanding these fundamentals will help you establish effective monitoring for your OpenSearch Serverless collections and help maintain optimal performance and reliability.

Prerequisites

Before getting started, you must have the following prerequisites:

CloudWatch metrics and recommended alarms for OpenSearch Serverless

The following table summarizes key CloudWatch metrics for OpenSearch Serverless, including recommended alarm thresholds, metric descriptions, and applicable workload types.

Alarm Metric Level Metric Description Alarm Description Use case
IndexingOCU maximum is >= 10 for 5 minutes, three consecutive times Account Level

Serverless compute capacity is measured in OpenSearch Compute Units (OCUs). Each OCU is a combination of 6 GiB of memory and corresponding virtual CPU (vCPU), in addition to data transfer to Amazon Simple Storage Service (Amazon S3).

The IndexingOCU metric reports the number of OCUs used for data ingestion across all collections.

This alarm will alert you when Indexing OCUs scale upto / beyond 10 for more than 15 minutes. Monitor and Optimize Costs
SearchOCU maximum is >= 10 for 5 minutes, three consecutive times Account Level

Serverless compute capacity is measured in OCUs. Each OCU is a combination of 6 GiB of memory and corresponding virtual CPU (vCPU), in addition to data transfer to Amazon S3.

The SearchOCU metric reports the number of OCUs used to search collection data across all collections.

This alarm will alert you when Search OCUs scale upto / beyond 10 for more than 15 minutes. Monitor and Optimize Costs
IngestionRequestLatency maximum is >= 3 secs for 1 minutes, five consecutive times. Collection Level The IngestionRequestLatency metric reports the latency, in seconds, for bulk write operations to a collection. This alarm monitors the maximum latency of bulk write operations to a collection. It triggers when the maximum IngestionRequestLatency exceeds 3 seconds for five consecutive 1-minute intervals (for a total of 5 minutes). This indicates a sustained performance degradation in data ingestion operations, which could impact application performance and data availability. This metric might be crucial to monitor for log-based workloads, where indexing time is critical.
SearchRequestLatency maximum is >= 2 secs for 1 minutes, five consecutive times. Collection Level The SearchRequestLatency metric reports the latency, in seconds, that it takes to complete a search operation against a collection. This alarm monitors the maximum latency of search operations against a collection. It triggers when the maximum SearchRequestLatency exceeds 2 seconds for five consecutive 1-minute intervals (for a total of 5 minutes). Consistently high search latency indicates performance issues that could degrade user experience and application responsiveness. This metric might be crucial to monitor for vector and search-based workloads, where search time is critical.
IngestionRequestErrors sum is >= 100 errors for 1 minute, five consecutive times Collection Level The IngestionRequestErrors metric reports the total number of bulk indexing request errors to a collection. OpenSearch Serverless emits this metric when there are bulk indexing request failures, such as an authentication or availability issue. This alarm monitors the total count of failed bulk indexing operations to a collection. It triggers when the number of IngestionRequestErrors equals or exceeds 100 errors for five consecutive 1-minute intervals (for a total of 5 minutes). Persistent ingestion errors indicate systemic issues that could lead to data loss or inconsistency.
SearchRequestErrors sum is >= 50 errors for 1 minute, five consecutive times Collection Level The SearchRequestErrors metric reports the total number of query errors per minute for a collection. This alarm monitors the total count of failed search query operations in a collection. It triggers when the number of SearchRequestErrors equals or exceeds 50 errors for five consecutive 1-minute intervals (for a total of 5 minutes). Persistent search errors indicate potential issues that could impact application functionality and user experience.
ActiveCollection minimum is 0 for 1 minutes, three consecutive times. Collection Level This metric indicates whether a collection is active. A value of 1 means that the collection is in an ACTIVE state. This value is emitted upon successful creation of a collection and remains 1 until you delete the collection. The metric can’t have a value of 0. The alarm triggers when the metric is missing for three consecutive 1-minute intervals (for a total of 3 minutes). Because an active collection always emits a value of 1, missing data indicates the collection has been deleted or is experiencing serious issues.
Note: Make sure to setup the CloudWatch alarm so that it will treat missing data as breaching.
Monitor Availability of Collection

The specific threshold values mentioned are examples. However, you may need to adjust these thresholds based on the unique requirements and SLAs of your own applications and workloads running on OpenSearch Serverless.

To decide when to raise the global OCU limits, you should regularly review the IndexingOCU and SearchOCU metrics at the account level. If you notice the metrics consistently approaching the set threshold, it’s a good indication that you should consider increasing the overall account limits to accommodate your growing usage.

Additionally, monitor the collection-level metrics like IngestionRequestLatency and SearchRequestLatency. If you notice certain collections have consistently high latency, it might be a sign that the OCU allocation for those specific collections is insufficient. In such cases, you could consider increasing the OCU limits for those high-usage collections, rather than raising the global account limits.

By closely monitoring both the account-level and collection-level metrics, you can make informed decisions about when and how to adjust your OCU limits to maintain optimal performance and cost efficiency for your OpenSearch Serverless deployment.

Steps to create a CloudWatch alarm

CloudWatch Alarms can be created using any of the following methods:

Detailed steps and a / sample code snippet for each method are provided in the following sections.

Using the console

The AWS Management Console provides a user-friendly, visual interface for creating CloudWatch alarms. Follow these step-by-step instructions to set up your alarm through the console.

  1. Navigate to the CloudWatch console
  2. In the navigation pane, choose Alarms and then, All alarms.
  3. Choose Create alarm.

Create an alarm

  1. Choose Select Metric.
  2. Select the namespace AOSS 

Choose CloudWatch Namespace

  1. To setup alerting on IndexingOCU across all collections, navigate to ClientId and select the metric.
  2. Under Conditions:
    1. For Statistic: Select Maximum.
    2. For Period: Select 5 minutes.
    3. For Threshold type: Choose Static and Greater.

Specify metric and conditions

  1. Choose Next. Under Notification, select an SNS topic to notify when the alarm is in ALARM state, OK state, or INSUFFICIENT_DATA state.

Configure Actions

  1. When finished, choose Next. Enter a name and description for the alarm. The name must contain only UTF-8 characters, and can’t contain ASCII control characters. The description can include markdown formatting, which is displayed only in the alarm Details tab in the CloudWatch console. The markdown can be useful to add links to runbooks or other internal resources. Then choose Next.
  2. Under Preview and create, confirm that the information and conditions are what you want, then choose Create alarm.

For detailed documentation, refer to Create a CloudWatch alarm based on a static threshold.

Using the AWS CLI

For those who prefer command-line interfaces or need to automate alarm creation, the AWS CLI offers an efficient alternative. This section demonstrates how to create a CloudWatch alarm using a single CLI command.

To set up a CloudWatch alarm using the AWS CLI, you can use the put-metric-alarm command. The following example demonstrates how to create an alarm that sends an Amazon SNS email when the IndexingOCU exceeds 2 for 15 minutes at the account level. Replace [region] and [account-id] with your AWS Region and account ID.

aws cloudwatch put-metric-alarm \
--alarm-description '# IndexingOCU scaling out' \
--actions-enabled \
--alarm-actions 'arn:aws:sns:[region]:[account-id]:SecurityHubRecurringSummary' \
--metric-name 'IndexingOCU' \
--namespace 'AWS/AOSS' \
--statistic 'Maximum' \
--dimensions '[{"Name":"ClientId","Value":"[account-id]"}]' \
--period 300 \
--evaluation-periods 3 \
--datapoints-to-alarm 3 \
--threshold 2 \
--comparison-operator 'GreaterThanThreshold' \
--treat-missing-data 'ignore'

CloudFormation JSON

Infrastructure as Code (IaC) enables version-controlled, repeatable deployments. This JSON template shows how to define a CloudWatch alarm using AWS CloudFormation, suitable for those who prefer JSON syntax for their IaC implementations.

Replace [region] and [account-id] with your AWS Region and account ID.

{
    "Type": "AWS::CloudWatch::Alarm",
    "Properties": {
        "AlarmDescription": "# IndexingOCU scaling out",
        "ActionsEnabled": true,
        "OKActions": [],
        "AlarmActions": [
            "arn:aws:sns:[region]:[account-id]:SecurityHubRecurringSummary"
        ],
        "InsufficientDataActions": [],
        "MetricName": "IndexingOCU",
        "Namespace": "AWS/AOSS",
        "Statistic": "Maximum",
        "Dimensions": [
            {
                "Name": "ClientId",
                "Value": "[account-id]"
            }
        ],
        "Period": 300,
        "EvaluationPeriods": 3,
        "DatapointsToAlarm": 3,
        "Threshold": 2,
        "ComparisonOperator": "GreaterThanThreshold",
        "TreatMissingData": "ignore"
    }
}

CloudFormation YAML

For teams that prefer YAML’s more readable format, this section provides the equivalent CloudFormation template in YAML. The template creates the same CloudWatch alarm with identical configurations as the JSON version.

Replace [region] and [account-id] with your AWS Region and account ID.

Type: AWS::CloudWatch::Alarm
Properties:
    AlarmDescription: "# IndexingOCU scaling out"
    ActionsEnabled: true
    OKActions: []
    AlarmActions:
        - arn:aws:sns:[region]:[account-id]:SecurityHubRecurringSummary
    InsufficientDataActions: []
    MetricName: IndexingOCU
    Namespace: AWS/AOSS
    Statistic: Maximum
    Dimensions:
        - Name: ClientId
          Value: "[account-id]"
    Period: 300
    EvaluationPeriods: 3
    DatapointsToAlarm: 3
    Threshold: 2
    ComparisonOperator: GreaterThanThreshold
    TreatMissingData: ignore

CloudWatch dashboards

You can use Amazon CloudWatch dashboards to monitor multiple resources in a unified view. For example, the following dashboard provides a consolidated view of OpenSearch Serverless OCU usage, helping you track and manage costs.

View dashboards

Clean up

To avoid incurring unintended future charges, delete the following resources that were created as part of solution walk-through of this post:

  • CloudWatch alarms
  • CloudFormation stacks
  • SNS topics

Conclusion

Effective monitoring helps maintain optimal performance and reliability of your OpenSearch Serverless collections. By implementing the CloudWatch alarms and monitoring strategies outlined in this post, you can work towards proactively identifying and responding to performance issues before they impact your applications, optimize costs by tracking OCU usage patterns, support high availability objectives by monitoring collection health and error rates, and help maintain consistent performance through latency monitoring. Remember that the thresholds suggested in this guide serve as a starting point, you should adjust them based on your specific use cases, performance requirements, and budget constraints. Regular review and refinement of these alarms will help you maintain an efficient and cost-effective OpenSearch Serverless deployment.

Related links

Monitoring Amazon OpenSearch Serverless

Create a CloudWatch alarm based on a static threshold


About the authors

Urmila Iyer

Urmila Iyer

Urmila is a Technical Account Manager at AWS, where she partners with enterprise customers to understand their business objectives and architect solutions that drive meaningful outcomes. With 15 years of experience in IT, including 6 years at AWS, she specializes in data-driven solutions, bringing enthusiasm and expertise to data analytics projects using OpenSearch and real-time analytics platforms.

Parth Shah

Parth Shah

Parth is a Senior Solutions Architect at AWS passionate about solving complex data challenges for strategic customers. As a analytics enthusiast, he helps organizations make sense of their data through innovative cloud solutions, with deep expertise in OpenSearch implementations.
Outside of work, he enjoys spending time with family, exploring different cuisines and playing cricket.

Automate and orchestrate Amazon EMR jobs using AWS Step Functions and Amazon EventBridge

Post Syndicated from Senthil Kamala Rathinam original https://aws.amazon.com/blogs/big-data/automate-and-orchestrate-amazon-emr-jobs-using-aws-step-functions-and-amazon-eventbridge/

Many enterprises are adopting Apache Spark for scalable data processing tasks such as extract, transform, and load (ETL), batch analytics, and data enrichment. As data pipelines evolve, the need for flexible and cost-efficient execution environments that support automation, governance, and performance at scale also evolve in parallel. Amazon EMR provides a powerful environment to run Spark workloads, and depending on workload characteristics and compliance requirements, teams can choose between fully managed options like Amazon EMR Serverless or more customizable configurations using Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2).

In use cases where infrastructure control, data locality, or strict security postures are essential, such as in financial services, healthcare, or government, running transient EMR on EC2 clusters becomes a preferred choice. However, orchestrating the full lifecycle of these clusters, from provisioning to job submission and eventual teardown, can introduce operational overhead and risk if done manually.

To streamline this process, the AWS Cloud offers built-in orchestration capabilities using AWS Step Functions and Amazon EventBridge. Together, these services help you automate and schedule the entire EMR job lifecycle, reducing manual intervention while optimizing cost and compliance. Step Functions provides the workflow logic to manage cluster creation, Spark job execution, and cluster termination, and EventBridge schedules these workflows based on business or operational needs.

In this post, we discuss how to build a fully automated, scheduled Spark processing pipeline using Amazon EMR on EC2, orchestrated with Step Functions and triggered by EventBridge. We walk through how to deploy this solution using AWS CloudFormation, processes COVID-19 public dataset data in Amazon Simple Storage Service (Amazon S3), and store the aggregated results in Amazon S3. This architecture is ideal for periodic or scheduled batch processing scenarios where infrastructure control, auditability, and cost-efficiency are critical.

Solution overview

This solution uses the publicly available COVID-19 dataset to illustrate how to build a modular, scheduled architecture for scalable and cost-efficient batch processing for time-bound data workloads.The solution follows these steps:

  1. Raw COVID-19 data in CSV format is stored in an S3 input bucket.
  2. A scheduled rule in EventBridge triggers a Step Functions workflow.
  3. The Step Functions workflow provisions a transient Amazon EMR cluster using EC2 instances.
  4. A PySpark job is submitted to the cluster to calculate COVID-19 hospital utilization data to compute monthly state-level averages of inpatient and ICU bed utilization, and COVID-19 patient percentages.
  5. The processed results are written back to an S3 output bucket.
  6. After successful job completion, the EMR cluster is automatically deleted.
  7. Logs are persisted to Amazon S3 for observability and troubleshooting.

By automating this workflow, you alleviate the need to manually manage EMR clusters while gaining cost-efficiency by running compute only when needed. This architecture is ideal for periodic Spark jobs such as ETL pipelines, regulatory reporting, and batch analytics, especially when control, compliance, and customization are required.The following diagram illustrates the architecture for this use case.

The infrastructure is deployed using AWS CloudFormation to provide consistency and repeatability. AWS Identity and Access Management (IAM) roles grant least‑privilege access to Step Functions, Amazon EMR, EC2 instances, and S3 buckets, and optional AWS Key Management Service (AWS KMS) encryption can secure data at rest in Amazon S3 and Amazon CloudWatch Logs. By combining a scheduled trigger, stateful orchestration, and centralized logging, this solution delivers a fully automated, cost‑optimized, and secure way to run transient Spark workloads in production.

Prerequisites

Before you get started, make sure you have the following prerequisites:

Set up resources with AWS CloudFormation

To provision the required resources using a single CloudFormation template, complete the following steps:

  1. Sign in to the AWS Management Console as an admin user.
  2. Clone the sample repository to your local machine or AWS CloudShell and navigate into the project directory.
    git clone https://github.com/aws-samples/sample-emr-transient-cluster-step-functions-eventbridge.git
    cd sample-emr-transient-cluster-step-functions-eventbridge

  3. Set an environment variable for the AWS Region where you plan to deploy the resources. Replace the placeholder with your Region code, for example, us-east-1.
    export AWS_REGION=<YOUR AWS REGION>

  4. Deploy the stack using the following command. Update the stack name if needed. In this example, the stack is created with the name covid19-analysis.
    aws cloudformation deploy \
    --template-file emr_transient_cluster_step_functions_eventbridge.yaml \
    --stack-name covid19-analysis \
    --capabilities CAPABILITY_IAM \
    --region $AWS_REGION 

You can monitor the stack creation progress on the AWS CloudFormation console on the Events tab. The deployment typically completes in under 5 minutes.

After the stack is successfully created, go to the Outputs tab on the AWS CloudFormation console and note the following values for use in later steps:

  • InputBucketName
  • OutputBucketName
  • LogBucketName

Set up the COVID-19 dataset

With your infrastructure in place, complete the following steps to set up the input data:

  1. Download the COVID-19 data CSV file from HealthData.gov to your local machine.
  2. Rename the downloaded file to covid19-dataset.csv.
  3. Upload the renamed file to your S3 input bucket under the raw/ folder path.

Set up the PySpark Script

Complete the following steps to set up the PySpark script:

  1. Open AWS CloudShell from the console.
  2. Confirm that you are working inside the sample-emr-transient-cluster-step-functions-eventbridge directory before running the next command.
  3. Copy the PySpark script needed for this walkthrough into your input bucket:
    aws s3 cp covid19_processor.py s3://<InputBucketName>/scripts/

This script processes COVID-19 hospital utilization data stored as CSV files in your S3 input bucket. When running the job, provide the following command-line arguments:

  • --input – The S3 path to the input CSV files
  • --output – The S3 path to store the processed results

The script reads the raw dataset, standardizes various date formats, and filters out records with invalid or missing dates. It then extracts key utilization metrics such as inpatient bed usage, ICU bed usage, and the percentage of beds occupied by COVID-19 patients and calculates monthly averages grouped by state. The aggregated output is saved as timestamped CSV files in the specified S3 location.

This example demonstrates how you can use PySpark to efficiently clean, transform, and analyze large-scale healthcare data to gain actionable insights on hospital capacity trends during the pandemic.

Configure a schedule in EventBridge

The Step Functions state machine is by default scheduled to run on December 31, 2025, as a one-time execution. You can update the schedule for recurring or one-time execution as needed. Complete the following steps:

  1. On the EventBridge console, choose Schedules under Scheduler in the navigation pane.
  2. Select the schedule named <StackName>-covid19-analysis and choose Edit.
  3. Set your preferred schedule pattern.
    1. If you want to run the schedule one time, select One-time schedule for Occurrence and enter a date and time.
    2. If you want to run this on a recurring basis, select Recurring schedule. Specify the schedule type as either Cron-based schedule or Rate-based schedule as needed.
  4. Choose Next twice and choose Save schedule.

Start the workflow in Step Functions

Based on your EventBridge schedule, the Step Functions workflow will run automatically. For this walkthrough, complete the following steps to trigger it manually:

  1. On the Step Functions console, choose State machines in the navigation pane.
  2. Choose the state machine that begins with Covid19AnalysisStateMachine-*.
  3. Choose Start execution.
  4. In the Input section, provide the following JSON (provide the log bucket and output bucket names with the appropriate values captured earlier):
    {
      "LogUri": "s3://<LogBucketName>/logs/",
      "OutputS3Location": "s3://<OutputBucketName>/processed/"
    }

  5. Choose Start execution to initiate the workflow.

Monitor the EMR job and workflow execution

After you start the workflow, you can track both the Step Functions state transitions and the EMR job progress in real time on the console.

Monitor the Step Functions state machine

Complete the following steps to monitor the Step Functions state machine:

  1. On the Step Functions console, choose State machines in the navigation pane.
  2. Choose the state machine that begins with Covid19AnalysisStateMachine-*.
  3. Choose the running execution to view the visual workflow.

    Each state node will update as it progresses—green for success, red for failure.

  4. To explore a step, choose its node and inspect the input, output, and error details in the side pane.

The following screenshot shows an example of a successfully executed workflow.

Monitor the EMR cluster and EMR step

Complete the following steps to monitor the EMR cluster and EMR step status:

  1. While the cluster is active, open the Amazon EMR console and choose Clusters in the navigation pane.
  2. Locate the Covid19Cluster transient EMR cluster.
    Initially, it will be in Starting status.

    On the Steps tab, you can see your Spark submit step listed. As the job progresses, the step status changes from Pending to Running to finally Completed or Failed.

  3. Choose the Applications tab to view the application UIs, in which you can access the Spark History Server and YARN Timeline Server for monitoring and troubleshooting.

Monitor CloudWatch logs

To enable CloudWatch logging and enhanced monitoring for your EMR on EC2 cluster, refer to Amazon EMR on EC2 – Enhanced Monitoring with CloudWatch using custom metrics and logs. This guide explains how to install and configure the CloudWatch agent using a bootstrap action, so you can stream system-level metrics (such as CPU, memory, and disk usage) and application logs from EMR nodes directly to CloudWatch. With this setup, you can gain real-time visibility into cluster health and performance, simplify troubleshooting, and retain critical logs even after the cluster is terminated.

For this walkthrough, check the logs in the S3 log output location.

Confirm cluster deletion

When the Spark step is complete, Step Functions will automatically delete the Amazon EMR cluster. Refresh the Clusters page on the Amazon EMR console. You should see your cluster status change from Terminating to Terminated within a minute.

By following these steps, you gain full end-to-end visibility into your workflow from the moment the Step Functions state machine is triggered to the automatic shutdown of the EMR cluster. You can monitor execution progress, troubleshoot issues, confirm job success, and continuously optimize your transient Spark workloads.

Verify job output in Amazon S3

When the job is complete, complete the following steps to check the processed results in the S3 output bucket:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Open the output S3 bucket you noted earlier.
  3. Open the processed folder.
  4. Navigate into the timestamped subfolder to view the CSV output file.
  5. Download the CSV file to view the processed results, as shown in the following screenshot.

Monitoring and troubleshooting

To monitor the progress of your Spark job running on a transient EMR on EC2 cluster, use the Step Functions console. It provides real-time visibility into each state transition in your workflow, from cluster creation and job submission to cluster deletion. This makes it straightforward to track execution flow and identify where issues might occur.During job execution, you can use the Amazon EMR console to access cluster-level monitoring. This includes YARN application statuses, step-level logs, and overall cluster health. If CloudWatch logging is enabled in your job configuration, driver and executor logs stream in near real time, so you can quickly detect and diagnose errors, resource constraints, or data skew within your Spark application.

After the workflow is complete, regardless of whether it succeeds or fails, you can perform a detailed post-execution analysis by reviewing the logs stored in the S3 bucket specified in the LogUri parameter. This log directory includes standard output and error logs, along with Spark history files, offering insights into execution behavior and performance metrics.

For continued access to the Spark UI during job execution, you can use persistent application UIs on the EMR console. These links remain accessible even after the cluster is stopped, enabling deeper root-cause analysis and performance tuning for future runs.

This visibility into both workflow orchestration and job execution can help teams optimize their Spark workloads, reduce troubleshooting time, and build confidence in their EMR automation pipelines.

Clean up

To avoid incurring ongoing charges, clean up the resources provisioned during this walkthrough:

  1. Empty the S3 buckets:
    1. On the Amazon S3 console, choose Buckets in the navigation pane.
    2. Select the input, output, and log buckets used in this tutorial.
    3. Choose Empty to remove all objects before deleting the buckets (optional).
  2. Delete the CloudFormation stack:
    1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
    2. Select the stack you created for this solution and choose Delete.
    3. Confirm the deletion to remove associated resources.

Conclusion

In this post, we showed how to build a fully automated and cost-effective Spark processing pipeline using Step Functions, EventBridge, and Amazon EMR on EC2. The workflow provisions a transient EMR cluster, runs a Spark job to process data, and stops the cluster after the job completes. This approach helps reduce costs while giving you full control over the process. This solution is ideal for scheduled data processing tasks such as ETL jobs, log analytics, or batch reporting, especially when you need detailed control over infrastructure, security, and compliance settings.

To get started, deploy the solution in your environment using the CloudFormation stack provided and adjust it to fit your data processing needs. Check out the Step Functions Developer Guide and Amazon EMR Management Guide to explore further.

Share your feedback and ideas in the comments or connect with your AWS Solutions Architect to fine-tune this pattern for your use case.


About the authors

Senthil Kamala Rathinam

Senthil Kamala Rathinam

Senthil is a Solutions Architect at Amazon Web Services, specializing in Data and Analytics for banking customers across North America. With deep expertise in Data and Analytics, AI/ML, and Generative AI, he helps organizations unlock business value through data-driven transformation. Beyond work, Senthil enjoys spending time with his family and playing badminton.

Shashi Makkapati

Shashi Makkapati

Shashi is a Senior Solutions Architect serving banking customers across North America. He specializes in data analytics, AI/ML, and generative AI, focusing on innovative solutions that transform financial organizations. Shashi is passionate about leveraging technology to solve complex business challenges in the banking sector. Outside of work, he enjoys traveling and spending quality time with his family.

AWS Weekly Roundup: SQS fair queues, CloudWatch generative AI observability, and more (July 28, 2025)

Post Syndicated from Micah Walter original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-sqs-fair-queues-cloudwatch-generative-ai-observability-and-more-july-28-2025/

To be honest, I’m still recovering from the AWS Summit in New York, doing my best to level up on launches like Amazon Bedrock AgentCore (Preview) and Amazon Simple Storage Service (S3) Vectors. There’s a lot of new stuff to learn!

Meanwhile, it’s been an exciting week for AWS builders focused on reliability and observability. The standout announcement has to be Amazon SQS fair queues, which tackles one of the most persistent challenges in multi-tenant architectures: the “noisy neighbor” problem. If you’ve ever dealt with one tenant’s message processing overwhelming shared infrastructure and affecting other tenants, you’ll appreciate how this feature enables more balanced message distribution across your applications.

On the AI front, we’re also seeing AWS continue to enhance our observability capabilities with the preview launch of Amazon CloudWatch generative AI observability. This brings AI-powered insights directly into your monitoring workflows, helping you understand infrastructure and application performance patterns in new ways. And for those managing Amazon Connect environments, the addition of AWS CloudFormation for message template attachments makes it easier to programmatically deploy and manage email campaign assets across different environments.

Last week’s launches

  • Amazon SQS Fair Queues — AWS launched Amazon SQS fair queues to help mitigate the “noisy neighbor” problem in multi-tenant systems, enabling more balanced message processing and improved application resilience across shared infrastructure.
  • Amazon CloudWatch Generative AI Observability (Preview) — AWS launched a preview of Amazon CloudWatch generative AI observability, enabling users to gain AI-powered insights into their cloud infrastructure and application performance through advanced monitoring and analysis capabilities.
  • Amazon Connect CloudFormation Support for Message Template Attachments —AWS has expanded the capabilities of Amazon Connect by introducing support for AWS CloudFormation for Outbound Campaign message template attachments, enabling customers to programmatically manage and deploy email campaign attachments across different environments.
  • Amazon Connect Forecast Editing — Amazon Connect introduces a new forecast editing UI that allows contact center planners to quickly adjust forecasts by percentage or exact values across specific date ranges, queues, and channels for more responsive workforce planning.
  • Bloom Filters for Amazon ElastiCache — Amazon ElastiCache now supports Bloom filters in version 8.1 for Valkey, offering a space-efficient way to quickly check if an item is in a set with over 98% memory efficiency compared to traditional sets.
  • Amazon EC2 Skip OS Shutdown Option — AWS has introduced a new option for Amazon EC2 that allows customers to skip the graceful operating system shutdown when stopping or terminating instances, enabling faster application recovery and instance state transitions.
  • AWS HealthOmics Git Repository Integration — AWS HealthOmics now supports direct Git repository integration for workflow creation, allowing researchers to seamlessly pull workflow definitions from GitHub, GitLab, and Bitbucket repositories while enabling version control and reproducibility.
  • AWS Organizations Tag Policies Wildcard Support — AWS Organizations now supports a wildcard statement (ALL_SUPPORTED) in Tag Policies, allowing users to apply tagging rules to all supported resource types for a given AWS service in a single line, simplifying policy creation and reducing complexity.

Blogs of note

Beyond IAM Access Keys: Modern Authentication Approaches — AWS recommends moving beyond traditional IAM access keys to more secure authentication methods, reducing risks of credential exposure and unauthorized access by leveraging modern, more robust approaches to identity management.

Upcoming AWS events

AWS re:Invent 2025 (December 1-5, 2025, Las Vegas) — AWS’s flagship annual conference offering collaborative innovation through peer-to-peer learning, expert-led discussions, and invaluable networking opportunities.

AWS Summits — Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: Mexico City (August 6) and Jakarta (August 7).

AWS Community Days — Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Singapore (August 2), Australia (August 15), Adria (September 5), Baltic (September 10), and Aotearoa (September 18).

How Zapier runs isolated tasks on AWS Lambda and upgrades functions at scale

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/architecture/how-zapier-runs-isolated-tasks-on-aws-lambda-and-upgrades-functions-at-scale/

Zapier is a leading no-code automation provider whose customers use their solution to automate workflows and move data across over 8,000 applications such as Slack, Salesforce, Asana, and Dropbox. Zapier runs these automations through integrations called Zaps, which are implemented using a serverless architecture running on Amazon Web Services (AWS). Each Zap is powered by an AWS Lambda function.

In this post, you’ll learn how Zapier has built their serverless architecture focusing on three key aspects: using Lambda functions to build isolated Zaps, operating over a hundred thousand Lambda functions through Zapier’s control plane infrastructure, and enhancing security posture while reducing maintenance efforts by introducing automated function upgrades and cleanup workflows into their platform architecture.

Architecting a secure and isolated runtime environment

Zaps created by Zapier’s users implement tenant-specific business logic, hence they require cross-tenant compute isolation. Code implementing one Zap can’t share an execution environment with code implementing another Zap. Moreover, the same Zap type used by two different tenants can’t share execution environments as well.

To achieve the required level of isolation, Zapier’s engineering team adopted AWS Lambda, a serverless compute service that runs code in response to events and automatically manages cloud compute resources. Minimal operational overhead, built-in high availability, automated scaling, high level of isolation, and pay-per-use model made Lambda a great fit for this use case. Currently, Zapier’s architecture is running over a hundred thousand Lambda functions to support their customer’s integration workflows.

Because they’re powered by the open source Firecracker microVMs, each function is completely isolated from the others. Moreover, each execution environment belonging to the same function (sometimes referred to as function instances) is also isolated from other execution environments. The following architecture topology diagram uses red lines to represent isolation boundaries. Each execution environment of every function is isolated from its peers and is getting its own virtual resources such as disk, memory, and CPU. For more details, read Security in AWS Lambda.

Isolation boundary

Zapier’s control plane is architected using Amazon Elastic Kubernetes Service (Amazon EKS). A designated database is used to maintain the up-to-date function inventory. Whenever a user creates a new Zap, the control plane creates a corresponding Lambda function and stores a reference in the inventory database. When a Zap is triggered, the control plane retrieves information about a relevant Lambda function and invokes it to facilitate the integration workflow, as illustrated in the following diagram.

Control and Data planes

Understanding the runtime deprecation process

When building architectures using the traditional non-serverless compute, cloud engineers are the ones responsible for keeping operating systems and software on their compute instances up to date and applying security and maintenance patches. With serverless architectures and Lambda functions, security patches and minor runtime upgrades are handled by AWS automatically, which means customers can focus on delivering business value instead of the undifferentiated heavy lifting of infrastructure management.

When a major Lambda managed runtime version reaches end-of-life, AWS initiates a deprecation process through the AWS Health Dashboard and direct email communications to affected customers. Because deprecated runtimes eventually lose access to security updates and support, organizations must upgrade to supported runtime versions to avoid potential security risks. Read more about the shared responsibility model, runtime use after deprecation, and receiving runtime deprecation notifications.

As Zapier’s user base and architectural complexity – and consequently the number of Zaps – were growing, keeping all functions on the most up-to-date major runtime versions became a laborious task. Top contributing factors were:

  • High number of functions. At its peak, the Zapier platform was running Zaps using hundreds of thousands of unique Lambda functions. Approximately 35% of these functions were using a runtime that was scheduled for deprecation in the next 12 months.
  • Zapier architected their data plane environment to be ephemeral – the control plane creates and deletes Lambda functions on demand and manages their lifecycle dynamically. Identifying a specific owner for each affected function wasn’t always straightforward.
  • Security is paramount at Zapier and upgrading affected functions runtime prior to the deprecation date was an absolute must. At no point could Zapier functions use runtimes after their deprecation date. This was a task which required extra resources.
  • The upgrade process shouldn’t have had any impact on the end customer experience. At no point should customer experience be affected.

With a short runway, high-volume workload, and the strict requirements of not impacting customer experience, Zapier’s Platform Engineering team took on this challenge of maintaining high security posture in their platform architecture.

Applying the solution

The solution had three work streams:

  1. Reducing the risk by analyzing the architecture and identifying and cleaning up unused functions.
  2. Prioritizing upgrades by identifying the most critical and impactful functions.
  3. Empowering engineering teams with automated tools and knowledge to streamline the upgrade process in future.

Identify and clean up unused functions

The first step in streamlining the upgrade process was identifying and removing unused functions. This reduced the total number of functions in Zapier’s architecture that required upgrades, eliminating unnecessary work for the team.

Zapier started by augmenting the function inventory with runtime information using AWS Trusted Advisor and Amazon Cloud Intelligence Trusted Advisor dashboards, as illustrated in the following diagram.

Gathering data

This meant the team could build a detailed inventory of functions that were running on soon-to-be deprecated runtimes. Using Amazon CloudWatch, Zapier’s platform team started to monitor metrics such as number of invocations. They identified which functions were active, which functions weren’t used for an extended period, and which functions didn’t have an active owner and could be removed.

One of the primary mechanisms for ownership validation within the organization was using resource tags. Functions that were active, but didn’t have clear ownership, were flagged for additional review before removal. Functions that were confirmed as unused or didn’t have an active owner were marked for deletion. Removing such functions allowed Zapier to significantly simplify their architecture and reduce the number of functions that had to be upgraded.

Prioritizing upgrades

With a smaller volume of functions to upgrade, Zapier’s platform team prioritized function upgrades based on usage patterns, criticality, and potential customer impact. Three primary prioritization categories were:

  • Customer-facing functions – Any functions directly involved in executing user Zaps were marked as high priority. These had to be upgraded first to avoid service disruptions.
  • Backend infrastructure functions – Internal functions that supported system operations were evaluated based on their importance to platform stability.
  • High-volume functions – Functions with the highest execution frequency were prioritized because upgrading them would have the greatest impact on reducing operational risk.

Using these factors, Zapier’s platform team has created an upgrade roadmap, ensuring that critical assets were addressed first while minimizing potential disruptions.

Refer to Retrieve data about Lambda functions that use a deprecated runtime in the Lambda Developer Guide to learn how to identify most commonly and most frequently used Lambda functions in your serverless architecture.

Empowering engineering teams with automated tools and knowledge

To ensure a smooth and efficient upgrade process across their serverless architecture, Zapier’s team empowered engineering teams with clear guidelines and automated solutions. The platform incorporated two main approaches: Terraform-managed functions and a custom-built Lambda runtime canary tool. Implementing and adopting these tools and practices resulted in reducing the number of functions using soon-to-be deprecated runtimes by 95%.

For functions managed through infrastructure-as-code (IaC), Zapier’s team developed standardized Terraform modules that specified supported runtime versions. Development teams implemented these modules in their configurations:

resource "aws_lambda_function" "example" {
    runtime = "python3.13"  # Updated to supported runtime
}

After applying the new module version, teams validated changes by testing the new runtime in staging environments and monitoring Terraform plan outputs to ensure proper runtime version updates.

To efficiently manage most Lambda functions in their architecture, Zapier developed the Lambda runtime canary tool suite. Using this solution, they automated the runtime upgrade process for thousands of active Lambda functions with minimal manual intervention. The tool suite implements several key features:

  • Architected for gradual traffic shifting with the Lambda built-in routing mechanism through function version and aliasing. The tool can gradually shift traffic distribution from an old to a new function version. During this gradual traffic shift, the system monitors CloudWatch metrics for errors and automatically rolls back if error rates exceed acceptable thresholds.
  • Optimistic upgrade strategy implements direct upgrades for infrequently used functions using a flag value stored in a cache to detect potential issues during the first post-upgrade invocation. If this invocation fails, the control plane retries it using the previous function version. If the retried invocation succeeds, Zapier’s control plane initiates a rollback, assuming the error is most likely due to the runtime upgrade. After rollback, it will log the error and alert relevant stakeholders.
  • Integration with existing infrastructure uses an administrative interface and task queue for automated traffic shifting. A database ledger maintains tracking of function states and rollback information.
  • Operational controls provide manual rollback capabilities and implement centralized control switches for process management. After a function was upgraded to a new runtime and no rollback activity was detected within a set time period, an automated pruning task cleans up older versions.

Zapier’s Lambda canary tool, through its integration of gradual traffic shifting, real-time CloudWatch monitoring, and automated rollback mechanisms, established a sustainable framework for managing runtime upgrades across their serverless architecture. This approach not only automated the upgrade process and minimized operational risks but also created a scalable solution that provides continuous runtime upgrades, preventing the use of deprecated runtimes at any point. By allowing continuous function runtime updates with minimal disruption to end user experience, Zapier maintains security and stability while requiring minimal manual intervention. This framework efficiently manages their growing serverless infrastructure, providing both security and operational efficiency for future runtime updates.

Conclusion

In this post, you’ve learned how Zapier architected their software-as-a-service (SaaS) platform to provide secure, isolated execution environments using AWS Lambda and Amazon EKS, enabling their customers to create hundreds of thousands of Zaps. You’ve learned how Zapier’s team implemented the function runtime upgrade process at scale and reduced the number of functions running on soon-to-be deprecated runtimes by 95%. You’ve seen best practices that were established and techniques that helped Zapier to keep high security posture without impacting customer experience.

Use the following links to learn more about Lambda runtimes and upgrading your functions to the latest runtime versions:


About the authors

Streamline DevOps troubleshooting: Integrate CloudWatch investigations with Slack

Post Syndicated from Paige Broderick original https://aws.amazon.com/blogs/devops/streamline-devops-troubleshooting-integrate-cloudwatch-investigations-with-slack/

Infrastructure alerts pose a challenge for DevOps teams, particularly when they occur outside of regular business hours. The complexity isn’t merely in receiving notifications, it lies in rapidly assessing their severity and determining the root cause. This challenge is compounded when upstream service disruptions cascade into multiple downstream alerts, creating a confusion of notifications that mask the true source of the problem. DevOps teams find themselves working backwards through a complex web of interconnected services, unsure whether to start investigating at the application, network, or infrastructure level.

To reduce resolution time and alert root cause analysis, AWS introduced CloudWatch Investigations, a generative AI-powered capability within Amazon CloudWatch. Powered by Amazon Q Developer, a generative AI–powered assistant for software development, CloudWatch investigations analyzes multiple metrics, logs, and deployment events to provide suggestions for remediation and root-cause analyses, reducing alarm resolution time. A key advantage of this feature is the ability to integrate these findings directly into Microsoft Teams and Slack, making sure developers and stakeholders receive immediate alerts when issues arise. This centralized collaboration approach enables teams to work together efficiently, reducing duplicate efforts and facilitating consistent problem-solving across the organization.

In this blog post, we will walk through how to integrate CloudWatch Investigations with Slack channels and demonstrate how to interact with investigations in Slack.

Overview of the solution

CloudWatch Investigations can be started in multiple ways, like from existing Amazon CloudWatch log insights, metrics, or alarms. To demonstrate CloudWatch Investigations functionalities, we will use CloudWatch alarms in a sample web application available in the aws-samples GitHub repository. Steps on how to deploy this web app in your AWS environment, via a CloudFormation template, can be found here. You can learn more about the architecture of the resources deployed in the AWS One Observability workshop. If you choose to deploy the sample web application, you will be responsible for all service charges associated with the CloudFormation template deployment. Alternatively, you can use existing CloudWatch alarms in your environment. Examples of common Amazon CloudWatch alarms include: MemoryUtilization, CPUUtiliziation, 5xxErrors and 4xxError. A full list of available alarms can be found here.

For this blog, we will utilize a pre-configured alarm to monitor when one of the website services, backed by an Application Load Balancer, experiences abnormal response times. When the alarm triggers, CloudWatch Investigations automatically initiates an investigation, analyzing both the current alarm state and 90 days of CloudTrail event history to generate hypotheses and determine potential root causes. The investigation insights are published to a Slack channel via Amazon Q Developer in Chat Applications and Amazon Simple Notification Service (SNS).

Figure 1. Architecture diagram of the services involved in the investigation integration in Slack

Prerequisites

  1. Launch the Amazon CloudFormation template associated with the One Observability lab outlined in the AWS Samples GitHub.
  2. Set up a Standard Amazon SNS topic by following the instructions outlined here. To enable CloudWatch investigations to send notifications to Slack, you must add an access policy to the Amazon SNS topic, an example can be found here.
  3. When the topic configuration is complete, navigate to Amazon Q Developer in Chat Applications (formerly AWS Chatbot) to configure the integration between Amazon Q and Slack by following the instructions outlined here. To allow channel members to interact with the investigation in Slack, add the following permission templates to the Channel role settings: Notification Permissions, Amazon Q Permissions, and Amazon Q Operations assistant permissions. More details on these permissions can be located here.

Setting up CloudWatch Investigations

To get started, navigate to the Amazon CloudWatch console. Choose AI Operations and then Configuration.

Figure 2. Configure for this account button within the AWS Console

Before we can set up an investigation, we need to create an investigation group. This is an organizational structure to manage common properties of the investigation like retention requirements, encryption, access permissions and the SNS topic linked. Click Configure for this account and follow the prompts in the console to set up the investigation group. Detailed explanations for each prompt are located in the documentation here. For this demo, we left the default options for steps 1 and 2 of the prompts. In step 3, please select the existing SNS topic created in the prerequisites section.

Figure 3. Select SNS topic for Q Developer Operational Insights

For the investigation trigger, we will use an existing alarm created by the CloudFormation deployment mentioned at the beginning of this blog. The sample alarm is named:

ApplicationInsights/Services/AWS/ApplicationELB/TargetResponseTime/app/Servic-lista-... 

and it goes into ALARM state when one of the website services, backed by an Application Load Balancer, experiences abnormal response times.

To configure this alarm to automatically start an investigation when it goes into an ALARM state:

  1. In the CloudWatch console, choose Alarms, All alarms
  2. Search for the alarm name and click on it
  3. Choose Actions, Edit
  4. Choose Next once to skip the metrics and conditions section
  5. Choose Add investigation action and then select your investigation group as outlined in figure 4
  6. Choose Skip to Preview and create, then choose Update alarm

Figure 4. Configure alarm to automatically start investigations

Testing the solution

At this point, we are ready to test the solution. To simulate a website traffic overload and trigger the alarm, we are going to use Amazon ECS tasks deployed as part of the sample web application. Open up CloudShell and run the following command:

PETLISTADOPTIONS_CLUSTER=$(aws ecs list-clusters | jq '.clusterArns[]|select(contains("PetList"))' -r)

TRAFFICGENERATOR_SERVICE=$(aws ecs list-services --cluster $PETLISTADOPTIONS_CLUSTER | jq '.serviceArns[]|select(contains("trafficgenerator"))' -r)

aws ecs update-service --cluster $PETLISTADOPTIONS_CLUSTER --service $TRAFFICGENERATOR_SERVICE --desired-count 5

The command will launch 5 instances of the Amazon ECS traffic generator container task. Once the tasks are running (after about 5 minutes), the ALB will become overloaded with requests, forcing the alarm into ALARM state as shown below. You should also see a new investigation created.

Figure 5. CloudWatch Alarm in ALARM state

Interacting with the investigation via Slack

Once the alarm is triggered, an investigation is initiated. Since we associated the investigation with an Amazon SNS topic and subscribed our Slack client to it, we can see a message in our Slack channel from Amazon Q as seen in figure 6.

Figure 6. Slack notification for open investigation

Within Slack, channel members can accept useful hypotheses and discard unhelpful ones by clicking on the Accept or Discard button. They can also add text-based notes of observations or evidence to the investigation by clicking on the Add Note button. Amazon Q will respond to messages within the same thread as the original investigation message. Channel members will be able to track who has accepted or discarded messages, as well as notes made about the investigation. This emphasizes the power of Slack integration, as teams can collaborate on the investigation and track who is actively working on it. It is important to note that CloudWatch Investigations uses Generative AI and may provide suggestions different from those below based on your specific account environment.

Figure 7. Accept or discard investigation suggestions from Slack

When integrated with Slack, CloudWatch Investigations can provide suggestions and root-cause hypotheses. Channel members with appropriate permissions can access metrics, charts, and additional information related to the investigation by clicking the blue header at the top of the investigation message. This link will direct users to the CloudWatch Investigations feed in the AWS console as shown below in figure 8.

Figure 8. CloudWatch Investigations in CloudWatch console.

Integrating CloudWatch Investigations with Slack or Teams channels improves developers’ visibility of arising issues and provides targeted recommendations to reduce alarm resolution time. The Accept and Discard buttons make it straightforward to track who is actively working on an investigation, fostering a culture of collaboration. The best part? The integration is quick to set up, especially with existing alarms.

Clean Up

If you launched the CloudFormation template mentioned at the beginning of this blog, the services will continue to run unless you delete them. To make sure that you are not charged for use of the resources after the demo, please follow the below steps to delete the resources created as part of the steps performed on this blog.

  1. Remove the Amazon Q in Chat Applications Slack integration by clicking on Remove Workspace Integration and policy as explained here.
  2. Delete Amazon SNS topic and subscription as explained here.
  3. Remove the CloudWatch Investigations as explained here.
  4. Delete the images under the Amazon ECR repository named cdk-…-container-assets… as explained here.
  5. Open the CloudShell console or AWS CLI and execute the two commands below:
curl https://raw.githubusercontent.com/aws-samples/one-observability-demo/main/PetAdoptions/cdk/pet_stack/resources/destroy_stack.sh | bash

aws cloudformation delete-stack –stack-name CDKToolkit

After executing the above command, the resources of the demo should be destroyed. Look at the CloudFormation console in case of potential errors.

Conclusion

The new CloudWatch Investigations feature reduces alarm resolution time for development teams by providing actionable insights and recommendations. It is straightforward to connect investigations to a team’s primary form of communication, such as Teams or Slack, to improve notification awareness and interaction. To learn more about the capabilities of CloudWatch Investigations check out the feature announcement and documentation.

Happy investigating!

Implement monitoring for Amazon EKS with managed services

Post Syndicated from Aritra Nag original https://aws.amazon.com/blogs/architecture/implement-monitoring-for-amazon-eks-with-managed-services/

In this post, we show you how to implement comprehensive monitoring for Amazon Elastic Kubernetes Service (Amazon EKS) workloads using AWS managed services. Amazon EKS offers compelling solutions with EKS Auto Mode and AWS Fargate, each designed for different use cases. This solution demonstrates building an EKS platform that combines flexible compute options with enterprise-grade observability using AWS native services and OpenTelemetry.

Modern containerized environments require observability that goes beyond basic CPU and memory metrics. Our approach addresses three critical challenges: reducing compute management complexity, closing observability gaps, and enabling metrics-driven automatic scaling that responds to real application demand rather than infrastructure utilization alone.

Architecture components

Amazon Managed Service for Prometheus is a fully managed Prometheus-compatible service that alleviates the operational overhead of running Prometheus infrastructure while providing automatic scaling to handle billions of metrics, built-in high availability across multiple Availability Zones, 150 days of metrics retention by default, and native integration with Grafana and other visualization tools.

AWS Distro for OpenTelemetry (ADOT) is a secure, enterprise-grade distribution of OpenTelemetry that provides standardized metrics, traces, and logs collection, native AWS service integration, automatic instrumentation for popular frameworks, and efficient data processing and export.

Amazon CloudWatch is a centralized logging and monitoring service offering structured log aggregation and search, custom metrics and alarms, integration with AWS services, and long-term log retention and analysis.

Solution overview

This section outlines the comprehensive monitoring solution architecture and its key components. We explore how the different AWS services work together to provide complete observability for your Amazon EKS workloads.

Our solution addresses key challenges through a comprehensive observability pipeline using Amazon Managed Service for Prometheus, AWS X-Ray, and Amazon CloudWatch; real metrics-based automatic scaling using custom Prometheus metrics instead of basic resource utilization; and cost optimization through strategic virtual private cloud (VPC) endpoints and compute mode selection.

The architecture showcases a Kubernetes environment with two distinct compute modes, each optimized for different use cases. EKS Auto Mode represents AWS’s latest approach to managed Kubernetes compute. It eliminates the need for node management by removing the requirement to configure node groups or instance types. The platform automatically scales compute resources based on your actual workload demands, ensuring you pay only for the resources your applications consume. It comes with integrated services including automatic configuration of VPC CNI, EBS CSI driver, and load balancer integration, making it ideal for general workloads and cost-optimized deployments. The Amazon EKS Auto Mode architecture (shown in the following diagram) provides zero node management with automatic scaling based on workload demands. This mode includes integrated networking, storage, and load balancing capabilities, making it ideal for general workloads and cost-optimized deployments.

Amazon EKS Auto Mode Architecture

AWS Fargate takes a different approach by providing true serverless container execution. With Fargate, you don’t need to manage any Amazon EC2 instances, as each pod runs in its own isolated compute environment. This isolation extends to billing, where costs are tracked at the individual pod level, providing granular control over your expenses. Pods can scale independently without requiring capacity planning, making Fargate particularly well-suited for security-sensitive workloads and applications requiring strict resource isolation.The Amazon EKS Fargate architecture (shown in the following diagram) offers serverless container execution with strong isolation, where each pod runs in its own compute environment. This approach works best for security-sensitive workloads and applications requiring granular cost control.

Amazon EKS Fargate Architecture

The key architectural difference lies in networking and scaling behavior. Auto Mode uses shared node networking with cluster-wide scaling decisions, whereas Fargate provides isolated pod networking with individual pod scaling.

Comprehensive observability pipeline

The following diagram illustrates the workflow of the observability pipeline.

Open Telemetry Collector Agent

The observability architecture implements the three pillars of observability using AWS native services:

  • Metrics collection and storage:
    • Dual collection strategy combining direct Prometheus scraping and OpenTelemetry SDK
    • Local Prometheus server for Horizontal Pod Autoscaler (HPA) metrics and Prometheus Adapter integration
    • Amazon Managed Service for Prometheus for long-term storage and querying
    • Custom metrics exposed through Kubernetes custom metrics API
  • Distributed tracing:
    • OpenTelemetry SDK integration for automatic trace collection
    • AWS Distro for OpenTelemetry (ADOT) collector for data processing
    • AWS X-Ray for trace storage and service map visualization
    • End-to-end transaction monitoring across microservices
  • Centralized logging:
    • OpenTelemetry SDK for structured application logging
    • FluentBit for container log collection
    • CloudWatch Logs with proper retention policies
    • Log correlation with traces and metrics for comprehensive debugging

The below diagram demonstrates a modern cloud-native monitoring solution that collects and analyzes performance data from containerized applications, with data flowing from the Kubernetes workloads through the metrics pipeline to CloudWatch for centralized monitoring and observability.

Amazon EKS Fargate and Auto Mode Telemetry Collection

In the following sections, we walk you through deploying the complete observability stack. We start with the foundational AWS services, then configure the collection agents, and finally instrument your applications.

Prerequisites

Before implementing this solution, you must have the following:

Create the observability stack

The first step to implement the observability stack involves creating the core AWS services that will store and process your observability data using the AWS CDK:

from aws_cdk import (
    Stack,
    aws_logs as logs,
    aws_aps as aps,
    aws_iam as iam,
    RemovalPolicy,
    CfnOutput
)

class ObservabilityStack(Stack):
    def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

       # Create workspace for storing Prometheus metrics
        self.prometheus_workspace = aps.CfnWorkspace(
            self, "AmpWorkspace",
            alias="eks-observability-platform"
        )

        # Create CloudWatch Log Groups for storing Application Logs
        self.app_log_group = logs.LogGroup(
            self, "ApplicationLogGroup",
            log_group_name="/aws/eks/observability/applications",
            removal_policy=RemovalPolicy.DESTROY,
            retention=logs.RetentionDays.ONE_WEEK
        )
        
        # Create Otel Log Group for OpenTelemetry Logs
        self.otel_log_group = logs.LogGroup(
            self, "OtelLogGroup",
            log_group_name="/aws/eks/observability/otel",
            removal_policy=RemovalPolicy.DESTROY,
            retention=logs.RetentionDays.ONE_WEEK
        )

Deploy the infrastructure stack using the following commands:

pip install aws-cdk-lib constructs
cdk deploy ObservabilityStack

Deploy local Prometheus for HPA

This step configures Prometheus for service discovery and remote write to Amazon Managed Service for Prometheus. The local Prometheus instance enables the HPA to access custom metrics:

prometheus_config = {
    "apiVersion": "v1",
    "kind": "ConfigMap",
    "metadata": {
        "name": "prometheus-config",
        "namespace": "monitoring"
    },
    "data": {
        "prometheus.yml": f"""
global:
  scrape_interval: 15s
  evaluation_interval: 15s

remote_write:
  - url: https://aps-workspaces.{region}.amazonaws.com/workspaces/{workspace_id}/api/v1/remote_write
    queue_config:
      max_samples_per_send: 1000
      max_shards: 200
      capacity: 2500
    sigv4:
      region: {region}

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
"""
    }
}

Apply the configuration to your cluster:

kubectl apply -f prometheus-config.yaml

Configure the ADOT Collector

Deploy the ADOT Collector with proper AWS service integration. This collector processes telemetry data from your applications and exports it to AWS services:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
  namespace: opentelemetry
spec:
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
      prometheus:
        config:
          scrape_configs:
            - job_name: 'kubernetes-pods'
              kubernetes_sd_configs:
                - role: pod

    processors:
      batch:
        timeout: 1s
        send_batch_size: 1024
      resource:
        attributes:
          - key: ClusterName
            value: ${CLUSTER_NAME}
            action: upsert

    exporters:
      awsxray:
        region: ${AWS_REGION}
      awscloudwatchlogs:
        region: ${AWS_REGION}
        log_group_name: "/aws/eks/observability/otel"
      prometheusremotewrite:
        endpoint: https://aps-workspaces.${AWS_REGION}.amazonaws.com/workspaces/${PROMETHEUS_WORKSPACE_ID}/api/v1/remote_write
        auth:
          authenticator: sigv4auth

    extensions:
      sigv4auth:
        region: ${AWS_REGION}

    service:
      extensions: [sigv4auth]
      pipelines:
        traces:
          receivers: [otlp]
          processors: [resource, batch]
          exporters: [awsxray]
        metrics:
          receivers: [otlp, prometheus]
          processors: [resource, batch]
          exporters: [prometheusremotewrite]
        logs:
          receivers: [otlp]
          processors: [resource, batch]
          exporters: [awscloudwatchlogs]

Deploy the collector:

kubectl apply -f adot-collector.yaml

Instrument your applications

This section shows how to instrument your applications to emit telemetry data. We cover both Python and Java applications.

Instrument a Python Flask application

The following code demonstrates how to add OpenTelemetry instrumentation to a Python Flask application:

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader

# Configure OpenTelemetry
resource = Resource.create({
    "service.name": "flask-app",
    "service.version": "1.0.0",
    "deployment.environment": "production"
})

# Setup tracing
trace_provider = TracerProvider(resource=resource)
otlp_trace_exporter = OTLPSpanExporter(endpoint="http://otel-collector.opentelemetry:4317")
trace_provider.add_span_processor(BatchSpanProcessor(otlp_trace_exporter))
trace.set_tracer_provider(trace_provider)
tracer = trace.get_tracer(__name__)

# Setup metrics
metric_provider = MeterProvider(resource=resource)
otlp_metric_exporter = OTLPMetricExporter(endpoint="http://otel-collector.opentelemetry:4317")
metric_provider.add_metric_reader(PeriodicExportingMetricReader(otlp_metric_exporter))
metrics.set_meter_provider(metric_provider)
meter = metrics.get_meter(__name__)

# Create application metrics
request_counter = meter.create_counter(
    name="http_requests_total",
    description="Total HTTP requests",
    unit="1"
)

@app.route('/api/users')
def users():
    with tracer.start_as_current_span("get_users") as span:
        span.set_attribute("endpoint", "api_users")
        # Record metrics
        request_counter.add(1, {"endpoint": "api_users", "status": "success"})
        return jsonify({"users": ["user1", "user2", "user3"]})

Instrument a Java application

For Java applications using Spring Boot, add the following instrumentation:

@RestController
public class ApiController {
    private final Counter httpRequestsTotal;

    public ApiController(MeterRegistry meterRegistry) {
        this.httpRequestsTotal = Counter.builder("http_requests_total")
            .description("Total HTTP requests")
            .register(meterRegistry);
    }

    @GetMapping("/api/users")
    public Map<String, Object> getUsers() {
        httpRequestsTotal.increment();
        Map<String, Object> response = new HashMap<>();
        response.put("users", Arrays.asList("user1", "user2", "user3"));
        return response;
    }
}

Build and deploy your instrumented applications to the EKS cluster with the appropriate annotations for Prometheus scraping.

Configure the Prometheus Adapter for custom metrics

The Prometheus Adapter exposes custom metrics from Prometheus to the Kubernetes custom metrics API, enabling the HPA to use application-specific metrics:

prometheus_adapter_config = """
rules:
- seriesQuery: 'http_requests_total{app="flask-app"}'
  resources:
    overrides:
      kubernetes_namespace: {resource: "namespace"}
      kubernetes_pod_name: {resource: "pod"}
  name:
    as: "flask_app_requests_rate"
  metricsQuery: 'rate(http_requests_total{app="flask-app",<<.LabelMatchers>>}[1m]) * 60'
"""

# Deploy Prometheus Adapter
prometheus_adapter_deployment = {
    "apiVersion": "apps/v1",
    "kind": "Deployment",
    "metadata": {
        "name": "prometheus-adapter",
        "namespace": "monitoring"
    },
    "spec": {
        "replicas": 1,
        "selector": {"matchLabels": {"app": "prometheus-adapter"}},
        "template": {
            "metadata": {"labels": {"app": "prometheus-adapter"}},
            "spec": {
                "containers": [{
                    "name": "prometheus-adapter",
                    "image": "k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.12.0",
                    "args": [
                        "--prometheus-url=http://prometheus-service.monitoring:9090",
                        "--config=/etc/adapter/config.yaml"
                    ]
                }]
            }
        }
    }
}

Deploy the Prometheus Adapter:

kubectl apply -f prometheus-adapter.yaml

Configure HPAs with custom metrics

Create HPAs that use custom metrics instead of basic resource utilization:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: flask-app-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: flask-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: flask_app_requests_rate
      target:
        type: AverageValue
        averageValue: "10"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60

Apply the HPA configuration:

kubectl apply -f hpa-custom-metrics.yaml

Monitoring and visualization

After implementing this solution, you can create custom dashboards in Amazon Managed Grafana to monitor the following:

  • Application performance metrics
  • Request rates and latencies
  • Resource utilization
  • Error rates

For dashboard examples and templates, refer to the Amazon Managed Grafana documentation. The following screenshots are examples of some of the dashboards you can build:

  • OpenTelemetry Prometheus Dashboard – This dashboard displays Python application performance with request rate by endpoints, response time percentiles (P50, P95, P99), CPU utilization trends, memory usage patterns, and error rates segmented by HTTP status codes.

Python App OpenTelemetry Prometheus Dashboard

  • Go OpenTelemetry Application Dashboard – This dashboard focuses on Go-specific metrics including HTTP request rate, active concurrent users, goroutine counts, CPU usage, and memory allocation patterns with garbage collection insights.

Go App OpenTelemetry Prometheus Dashboard

  • Java OTEL Sample App Monitoring – This dashboard shows JVM-specific metrics like heap memory utilization, alongside application-level metrics such as requests per second, garbage collection insights, and thread pool utilization.

Java App OpenTelemetry Prometheus Dashboard

The dashboards enable real-time application performance monitoring, infrastructure resource utilization tracking, error rate monitoring and alerting, and automatic scaling visualization and trends.

Best practices and recommendations

Choose Amazon EKS Auto Mode for the following use cases and features:

  • You’re building general-purpose applications that benefit from cost optimization and operational simplicity
  • You’re managing mixed workload types and want to use integrated AWS service features
  • Teams want to avoid the complexity of node management
  • Cost-efficiency and ease of operations are priorities for production workloads

Choose Amazon EKS with Fargate in the following scenarios:

  • Security isolation is paramount for your applications
  • You’re running batch or event-driven workloads that require strong container isolation
  • Your organization requires granular cost attribution at the pod level
  • Compliance mandates dictate complete container isolation from the underlying infrastructure

For your observability strategy, consider the following monitoring approach:

  • Use business metrics for HPA scaling decisions
  • Implement proper metric labeling for filtering and aggregation
  • Monitor both application and infrastructure metrics
  • Set up alerting based on Service Level Indicator (SLI) and Service Level Objective (SLO) definitions

Additionally, implement the following tracing approach:

  • Instrument critical code paths with OpenTelemetry
  • Use consistent trace context propagation
  • Monitor service dependencies through AWS X-Ray service maps
  • Implement proper error handling and trace sampling

Benefits of the solution

Instead of relying on basic CPU and memory metrics, this solution configures the Prometheus Adapter to expose custom metrics to the Kubernetes HPA. The HPA configuration shown in this post enables more intelligent scaling decisions based on actual application load, resulting in better resource efficiency and improved application performance. This approach allows your applications to scale based on business-relevant metrics such as request rate, queue length, or custom application metrics rather than generic infrastructure utilization. This solution offers reduced management overhead through the following features:

  • Fully managed – Amazon Managed Service for Prometheus eliminates infrastructure management
  • Automatic scaling – Built-in high availability and scaling
  • Integrated security – Native IAM integration
  • Cost-effective – Pay only for metrics ingested and stored

You also benefit from enhanced observability:

  • Three pillars – Complete metrics, traces, and logs coverage
  • Real-time monitoring – Custom metrics for intelligent automatic scaling
  • Correlation – Trace IDs link logs, metrics, and traces
  • Business metrics – Scale based on application behavior, not just infrastructure

Troubleshooting

If the ADOT Collector isn’t receiving data, troubleshoot as follows:

  • Verify the collector service is running: $ kubectl get pods -n opentelemetry
  • Check application configuration for correct endpoint URLs
  • Verify IAM roles have proper permissions for AWS services

If the custom metrics aren’t available in the HPA, check the following:

  • Confirm the Prometheus Adapter is deployed and running
  • Verify metrics are being scraped by Prometheus: $ kubectl port-forward svc/prometheus 9090:9090
  • Check the Prometheus Adapter configuration for correct metric queries

Deployment cost considerations

In this section, we provide an estimate of the cost that will incur with the preceding solutions:

  • Amazon Managed Service for Prometheus – $0.90 per million samples ingested + $0.03 per GB-month storage
  • AWS X-Ray – $5.00 per million traces recorded
  • Amazon CloudWatch Logs – $0.50 per GB ingested + $0.03 per GB-month storage
  • Amazon EKS – $73/month control plane + compute costs (Auto Mode/Fargate variable)

For a medium-scale application (5 microservices, 2 million samples/hour, 100,000 traces/day, 10 GB logs/day), the costs are as follows:|

Service Cost
Amazon Managed Prometheus ~$80
AWS X-Ray ~$45
CloudWatch Logs ~$165
EKS Control Plane ~$73
Compute costs ~$200-400
Total ~$563-763/month

Costs are estimates based on US East (N. Virginia) pricing as of 2025 and might vary based on AWS Region, usage patterns, and AWS pricing changes. Consider the following cost optimization methods:

  • Sampling – Implement intelligent sampling for high-cardinality metrics
  • Retention – Set appropriate log retention (7–30 days for debug logs)
  • Monitoring – Use CloudWatch billing alarms to track spending
  • Regional – Deploy in single Region to minimize data transfer costs

Clean up

To avoid ongoing charges, delete the resources created in this walkthrough:

  1. Remove IAM roles and policies created for this solution through the IAM console or AWS CLI.
  2. Delete the AWS CDK stack:
cdk destroy ObservabilityStack

Conclusion

This solution demonstrates how organizations can achieve enterprise-grade Kubernetes deployments that balance flexibility, observability, and cost optimization. By combining Amazon EKS Auto Mode or Fargate with comprehensive AWS native observability services, teams can focus on application development while maintaining deep visibility into system performance. The real metrics-based automatic scaling approach represents a significant improvement over traditional resource-based scaling, enabling more intelligent infrastructure decisions that align with actual application behavior. Combined with the flexible compute options and modular architecture, this platform provides a robust foundation for modern containerized applications at scale. Key takeaways include:

  • Use AWS managed services – Reduce operational overhead with Amazon Managed Service for Prometheus and CloudWatch
  • Implement OpenTelemetry – Standardize observability across all applications
  • Custom metrics for HPA – Scale based on business metrics, not just CPU/memory
  • Structured logging – Enable better debugging and correlation
  • Security first – Implement proper IAM roles and network isolation

Organizations implementing this solution can expect reduced operational complexity, improved cost-efficiency, and enhanced visibility into their containerized applications, enabling faster development cycles and more reliable production deployments.


About the author

Monitor and debug event-driven applications with new Amazon EventBridge logging

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/monitor-and-debug-event-driven-applications-with-new-amazon-eventbridge-logging/

Starting today, you can use enhanced logging capability in Amazon EventBridge to monitor and debug your event-driven applications with comprehensive logs. These new enhancements help improve how you monitor and troubleshoot event flows.

Here’s how you can find this new capability on the Amazon EventBridge console:

The new observability capabilities address microservices and event-driven architecture monitoring challenges by providing comprehensive event lifecycle tracking. EventBridge now generates detailed log entries every time a matched event against rules is published, delivered to subscribers, or encounters failures and retries.

You gain visibility into the complete event journey with detailed information about successes, failures, and status codes that make identifying and diagnosing issues straightforward. What used to take hours of trial-and-error debugging now takes minutes with detailed event lifecycle tracking and built-in query tools.

Using Amazon EventBridge enhanced observability
Let me walk you through a demonstration that showcases the logging capability in Amazon EventBridge.

I can enable logging for an existing event bus or when creating a new custom event bus. First, I navigate to the EventBridge console and choose Event buses in the left navigation pane. In Custom event bus, I choose Create event bus.

I can see this new capability in the Logs section. I have three options to configure the Log destination: Amazon CloudWatch Logs, Amazon Data Firehose Stream, and Amazon Simple Storage Service (Amazon S3). If I want to stream my logs into a data lake, I can select Amazon Kinesis Data Firehose Stream. Logs are encrypted in transit with TLS and at rest if a customer-managed key (CMK) is provided for the event bus. CloudWatch Logs supports customer-managed keys, and Data Firehose offers server-side encryption for downstream destinations.

For this demo, I select CloudWatch logs and S3 logs.

I can also choose Log level, from Error, Info, or Trace. I choose Trace and select Include execution data because I need to review the payloads. You need to be mindful as logging payload data may contain sensitive information, and this setting applies to all log destinations you select. Then, I configure two destinations, one each for CloudWatch log group and S3 logs. Then I choose Create.

After logging is enabled, I can start publishing test events to observe the logging behavior.

For the first scenario, I’ve built an AWS Lambda function and configured this Lambda function as a target.

I navigate to my event bus to send a sample event by choosing Send events.

Here’s the payload that I use:

{
  "Source": "ecommerce.orders",
  "DetailType": "Order Placed",
  "Detail": {
    "orderId": "12345",
    "customerId": "cust-789",
    "amount": 99.99,
    "items": [
      {
        "productId": "prod-456",
        "quantity": 2,
        "price": 49.99
      }
    ]
  }
}

After I sent the sample event, I can see the logs are available in my S3 bucket.

I can also see the log entries appearing in the Amazon CloudWatch logs. The logs show the event lifecycle, from EVENT_RECEIPT to SUCCESS. Learn more about the complete event lifecycle on TBD:DOC_PAGE.

Now, let’s evaluate these logs. For brevity, I only include a few logs and have redacted them for readability. Here’s the log from when I triggered the event:

{
    "resource_arn": "arn:aws:events:us-east-1:123:event-bus/demo-logging",
    "message_timestamp_ms": 1751608776896,
    "event_bus_name": "demo-logging",
// REDACTED FOR BREVITY //
    "message_type": "EVENT_RECEIPT",
    "log_level": "TRACE",
    "details": {
        "caller_account_id": "123",
        "source_time_ms": 1751608775000,
        "source": "ecommerce.orders",
        "detail_type": "Order Placed",
        "resources": [],
        "event_detail": "REDACTED FOR BREVITY"
    }
}

Here’s the log when the event was successfully invoked:

{
    "resource_arn": "arn:aws:events:us-east-1:123:event-bus/demo-logging",
    "message_timestamp_ms": 1751608777091,
    "event_bus_name": "demo-logging",
// REDACTED FOR BREVITY //
    "message_type": "INVOCATION_SUCCESS",
    "log_level": "INFO",
    "details": {
// REDACTED FOR BREVITY //
        "total_attempts": 1,
        "final_invocation_status": "SUCCESS",
        "ingestion_to_start_latency_ms": 105,
        "ingestion_to_complete_latency_ms": 183,
        "ingestion_to_success_latency_ms": 183,
        "target_duration_ms": 53,
        "target_response_body": "<REDACTED FOR BREVITY>",
        "http_status_code": 202
    }
}

The additional log entries include rich metadata that makes troubleshooting straightforward. For example, on a successful event, I can see the latency timing from starting to completing the event, duration for the target to finish processing, and HTTP status code.

Debugging failures with complete event lifecycle tracking
The benefit of EventBridge logging becomes apparent when things go wrong. To test failure scenarios, I intentionally misconfigure a Lambda function’s permissions and change the rule to point to a different Lambda function without proper permissions.

The attempt failed with a permanent failure due to missing permissions. The log shows it’s a FIRST attempt that resulted in NO_PERMISSIONS status.

{
    "message_type": "INVOCATION_ATTEMPT_PERMANENT_FAILURE",
    "log_level": "ERROR",
    "details": {
        "rule_arn": "arn:aws:events:us-east-1:123:rule/demo-logging/demo-order-placed",
        "role_arn": "arn:aws:iam::123:role/service-role/Amazon_EventBridge_Invoke_Lambda_123",
        "target_arn": "arn:aws:lambda:us-east-1:123:function:demo-evb-fail",
        "attempt_type": "FIRST",
        "attempt_count": 1,
        "invocation_status": "NO_PERMISSIONS",
        "target_duration_ms": 25,
        "target_response_body": "{\"requestId\":\"a4bdfdc9-4806-4f3e-9961-31559cb2db62\",\"errorCode\":\"AccessDeniedException\",\"errorType\":\"Client\",\"errorMessage\":\"User: arn:aws:sts::123:assumed-role/Amazon_EventBridge_Invoke_Lambda_123/db4bff0a7e8539c4b12579ae111a3b0b is not authorized to perform: lambda:InvokeFunction on resource: arn:aws:lambda:us-east-1:123:function:demo-evb-fail because no identity-based policy allows the lambda:InvokeFunction action\",\"statusCode\":403}",
        "http_status_code": 403
    }
}

The final log entry summarizes the complete failure with timing metrics and the exact error message.

{
    "message_type": "INVOCATION_FAILURE",
    "log_level": "ERROR",
    "details": {
        "rule_arn": "arn:aws:events:us-east-1:123:rule/demo-logging/demo-order-placed",
        "role_arn": "arn:aws:iam::123:role/service-role/Amazon_EventBridge_Invoke_Lambda_123",
        "target_arn": "arn:aws:lambda:us-east-1:123:function:demo-evb-fail",
        "total_attempts": 1,
        "final_invocation_status": "NO_PERMISSIONS",
        "ingestion_to_start_latency_ms": 62,
        "ingestion_to_complete_latency_ms": 114,
        "target_duration_ms": 25,
        "http_status_code": 403
    },
    "error": {
        "http_status_code": 403,
        "error_message": "User: arn:aws:sts::123:assumed-role/Amazon_EventBridge_Invoke_Lambda_123/db4bff0a7e8539c4b12579ae111a3b0b is not authorized to perform: lambda:InvokeFunction on resource: arn:aws:lambda:us-east-1:123:function:demo-evb-fail because no identity-based policy allows the lambda:InvokeFunction action",
        "aws_service": "AWSLambda",
        "request_id": "a4bdfdc9-4806-4f3e-9961-31559cb2db62"
    }
}

The logs provide detailed performance metrics that help identify bottlenecks. The ingestion_to_start_latency_ms: 62 shows the time from event ingestion to starting invocation, while ingestion_to_complete_latency_ms: 114 represents the total time from ingestion to completion. Additionally, target_duration_ms: 25 indicates how long the target service took to respond, helping distinguish between EventBridge processing time and target service performance.

The error message clearly states what failed, lambda:InvokeFunction action, why it failed, (no identity-based policy allows the action), which role was involved (Amazon_EventBridge_Invoke_Lambda_1428392416), and which specific resource was affected, which was indicated by the Lambda function Amazon Resource Name (ARN).

Debugging API Destinations with EventBridge Logging
One particular use case that I think EventBridge logging capability will be helpful is to debug issues with API destinations. EventBridge API destinations are HTTPS endpoints that you can invoke as the target of an event bus rule or pipe. HTTPS endpoints help you to route events from your event bus to external systems, software-as-a-service (SaaS) applications, or third-party APIs using HTTPS calls. They use connections to handle authentication and credentials, making it easy to integrate your event-driven architecture with any HTTPS-based service. 

API destinations are commonly used to send events to external HTTPS endpoints and debugging failures from the external endpoint can be a challenge. These problems typically stem from changes to the endpoint authentication requirements or modified credentials.

To demonstrate this debugging capability, I intentionally configured an API destination with incorrect credentials in the connection resource.

When I send an event to this misconfigured endpoint, the enhanced logging shows the root cause of this failure.

{
    "resource_arn": "arn:aws:events:us-east-1:123:event-bus/demo-logging",
    "message_timestamp_ms": 1750344097251,
    "event_bus_name": "demo-logging",
    //REDACTED FOR BREVITY//,
    "message_type": "INVOCATION_FAILURE",
    "log_level": "ERROR",
    "details": {
        //REDACTED FOR BREVITY//,
        "total_attempts": 1,
        "final_invocation_status": "SDK_CLIENT_ERROR",
        "ingestion_to_start_latency_ms": 135,
        "ingestion_to_complete_latency_ms": 549,
        "target_duration_ms": 327,
        "target_response_body": "",
        "http_status_code": 400
    },
    "error": {
        "http_status_code": 400,
        "error_message": "Unable to invoke ApiDestination endpoint: The request failed because the credentials included for the connection are not authorized for the API destination."
    }
}

The log provides immediate clarity about the failure. The target_arn shows this involves an API destination, the final_invocation_status indicates SDK_CLIENT_ERROR, and the http_status_code of 400 , which points to a client-side issue. Most importantly, the error_message explicitly states that: Unable to invoke ApiDestination endpoint: The request failed because the credentials included for the connection are not authorized for the API destination.

This complete log sequence provides useful debugging insights because I can see exactly how the event moved through EventBridge — from event receipt, to ingestion, to rule matching, to invocation attempts. This level of detail eliminates guesswork and points directly to the root cause of the issue.

Additional things to know
Here are a couple of things to note:

  • Architecture support – Logging works with all EventBridge features including custom event buses, partner event sources, and API destinations for HTTPS endpoints.
  • Performance impact – Logging operates asynchronously with no measurable impact on event processing latency or throughput.
  • Pricing – You pay standard Amazon S3, Amazon CloudWatch Logs or Amazon Data Firehose pricing for log storage and delivery. EventBridge logging itself incurs no additional charges. For details, visit the Amazon EventBridge pricing page .
  • Availability – Amazon EventBridge logging capability is available in all AWS Regions where EventBridge is supported.
  • Documentation — For more details, refer to the Amazon EventBridge monitoring and debugging Documentation.

Get started with Amazon EventBridge logging capability by visiting the EventBridge console and enabling logging on your event buses.

Happy building!
— Donnie 

AWS Weekly Roundup: AWS Builder Center, Amazon Q, Oracle Database@AWS, and more (July 14, 2025)

Post Syndicated from Matheus Guimaraes original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-aws-builder-center-amazon-q-oracle-databaseaws-and-more-july-14-2025/

Summer is well and truly here in the UK! I’m a bit of a summer grinch though so, unlike most people, I’m not crazy about “the glorious sun” scorching me when I’m out and about. On the upside, this provides the perfect excuse to retreat to the comfort of a well-ventilated room where I can focus on coding and curating the latest AWS releases to bring you the highlights.

I also managed to escape the heat for most of yesterday while recording an episode for the AWS Developers Podcast where the wonderful Sebastien Stormaq and Tiffany Souterre interviewed me about games development. If you haven’t discovered it yet, I highly recommend you give it a go as the episodes are full of interesting lessons and insights from not just AWS, but customers and community members who share their stories and expertise in a relaxed conversation.

Alright, ready to discover some of the new things we released last week? Here are the highlights.

AWS Builder Center
There is a new home for AWS builders and community members! AWS Builder Center is a new place where cloud builders can connect, share knowledge, and access resources to enhance their AWS journey. The platform enables users to join community programs, discover trending topics, access AWS Skill Builder courses, participate in technical challenges, and more, using a single Builder ID sign-in.

One the features that I’m personally most excited about is the Wishlist. You can now create wishes and tell AWS directly about ways to improve our products and services or share original ideas that you think could help you and your teams. You can also browse and upvote existing wishes to support any suggestions that you think should be prioritized. The AWS teams will keep an eye on this and if a wish has enough traction it may just be considered!

Read the news blog post for a quick tour through some of the most exciting features or head over to AWS Builder Center and start exploring!

AI
The world of AI keeps moving fast and changing our world, by providing new and exciting ways to do things and become more productive. Here are two releases from last week that caught my attention.

  • Amazon Q chat in the AWS Management Console can now query AWS service data – Amazon Q Developer expands its capabilities by enabling natural language queries of data stored across AWS services like S3, DynamoDB, and CloudWatch, directly from the AWS Console, Slack, Microsoft Teams, and AWS Console Mobile Application. This enhancement streamlines cloud management and troubleshooting by allowing users to access and analyze service data through conversational interfaces, with access controls managed through IAM permissions.
  • Amazon CloudWatch and Application Signals MCP servers for AI-assisted troubleshooting – AWS has released two new Model Context Protocol (MCP) servers – CloudWatch MCP and Application Signals MCP – that enable AI agents to leverage observability data for automated troubleshooting through conversational interfaces. These open-source servers allow AI assistants to analyze metrics, alarms, logs, traces, and service health data across AWS environments, streamlining incident response and root cause analysis without requiring developers to manually navigate multiple AWS consoles.

Oracle Database@AWS
It seems like yesterday when Andy Jassy announced our partnership with Oracle to create Oracle Database@AWS, a jointly offered service that runs Oracle databases on Exadata infrastructure directly within AWS data centers, providing a unified AWS-Oracle experience. Fast forward to last week and Oracle Database@AWS has reached a significant milestone with its general availability release. It is now available in US East (N. Virginia) and US West (Oregon) regions, with plans to expand to 20 additional regions globally.

In addition, VPC Lattice has added support for Oracle Database@AWS enabling seamless connectivity between applications in VPCs and on-premises environments to Oracle database networks. The integration simplifies network management and provides secure access from Oracle Database@AWS to AWS services like Amazon S3 and Amazon Redshift, without requiring complex networking setup.

So if you’re looking to migrate your Oracle database workloads, now is a great time to explore Oracle Database@AWS as it offers a compelling path forward with minimal modifications required.

Additional highlights
Here are some other releases that I think many people will be happy about.

  • AWS Config now supports 12 new resource types – AWS Config has expanded its monitoring capabilities with support for 12 new resource types across services including BackupGateway, CloudFront, EntityResolution, Bedrock, and more. These additions are automatically tracked if you have enabled recording for all resource types, enhancing your ability to discover, assess, and audit AWS resources.
  • Amazon SageMaker Studio now supports remote connections from Visual Studio Code – Amazon SageMaker Studio now supports remote connections from Visual Studio Code, allowing developers to use their familiar VS Code setup while leveraging SageMaker’s scalable compute resources for AI development.
  • AWS Network Firewall: Native AWS Transit Gateway support in all regions – AWS Network Firewall now offers native integration with AWS Transit Gateway across all supported regions, enabling direct attachment and simplified traffic inspection between VPCs and on-premises networks. This integration eliminates the need for managing dedicated VPC subnets and route tables while providing multi-AZ redundancy for improved security and reliability.

Upcoming AWS Events
AWS Summit New York – this is definitely one to watch…literally! Registrations are closed due to capacity but you can tune in to watch live all the announcements and launches! No spoilers, but, trust me, there are a quite a few exciting things in store, so make sure to check it out.

AWS Gen AI LoftsAWS Gen AI Lofts are multi-day events offering hands-on workshops, expert guidance, and networking opportunities for developers and business leaders looking to explore or advance their generative AI journey. These events are hosted across multiple global locations including San Francisco, Berlin, Dubai, Dublin, Bengaluru, Manchester, Paris, and Tel Aviv, providing accessible opportunities to accelerate your generative AI adoption.

And that’s it for this week! Come back next Monday for more highlights and keep your AWS knowledge up to date as we cover the latest releases.

Matheus Guimaraes | @codingmatheus

AWS Weekly Roundup: Project Rainier, Amazon CloudWatch investigations, AWS MCP servers, and more (June 30, 2025)

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-project-rainier-amazon-cloudwatch-investigations-aws-mcp-servers-and-more-june-30-2025/

Every time I visit Seattle, the first thing that greets me at the airport is Mount Rainier. Did you know that the most innovative project at Amazon Web Services (AWS) is named after this mountain?

Project Rainier is a new project to create what is expected to be the world’s most powerful computer for training AI models across multiple data centers in the United Stages. Anthropic will develop the advanced versions of its Claude models with five times more computing power than its current largest training cluster.

The key technology powering Project Rainier is AWS custom-designed Trainium2 chips, which are specialized for the immense data processing required to train complex AI models. Thousands of these Trainium2 chips will be connected in a new type of Amazon EC2 UltraServer and EC2 UltraCluster architecture that allows ultra-fast communication and data sharing across the massive system.

Learn about the AWS vertical integration of Project Rainer, where it designs every component of the technology stack from chips to software, allows it to optimize the entire system for maximum efficiency and reliability.

Last week’s launches
Here are some launches that got my attention:

  • Amazon S3 access for Amazon FSx for OpenZFS – You can access and analyze your FSx for OpenZFS file data through Amazon S3 Access Points, enabling seamless integration with AWS AI/ML, and analytics services without moving your data out of the file system. You can treat your FSx for OpenZFS data as if it were stored in S3, making it accessible through the S3 API for various applications including Amazon Bedrock, Amazon SageMaker, AWS Glue, and other S3 based cloud-native applications.
  • Amazon S3 with sort and z-order compaction for Apache Iceberg tables – You can optimize query performance and reduce costs with new sort and z-order compaction. With S3 Tables, sort compaction automatically organizes data files based on defined column orders, while z-order compaction can be enabled through the maintenance API for efficient multicolumn queries.
  • Amazon CloudWatch investigations – You can accelerate your operational troubleshooting in AWS environments using the Amazon CloudWatch AI-powered investigation feature, which helps identify anomalies, surface related signals, and suggest remediation steps. This capability can be initiated through CloudWatch data widgets, multiple AWS consoles, CloudWatch alarm actions, or Amazon Q chat and enables team collaboration and integration with Slack and Microsoft Teams.
  • Amazon Bedrock Guardrails Standard tier – You can enhance your AI content safety measures using the new Standard tier. It offers improved content filtering and topic denial capabilities across up to 60 languages, better detection of variations including typos, and stronger protection against prompt attacks. This feature lets you configure safeguards to block harmful content, prevent model hallucinations, redact personally identifiable information (PII), and verify factual claims through automated reasoning checks.
  • Amazon Route 53 Resolver endpoints for private hosted zone – You can simplify DNS management across AWS and on-premises infrastructure using the new Route 53 DNS delegation feature for private hosted zone subdomains, which works with both inbound and outbound Resolver endpoints. You can delegate subdomain authority between your on-premises infrastructure and Route 53 Resolver cloud service using name server records, eliminating the need for complex conditional forwarding rules.
  • Amazon Q Developer CLI for Java transformation – You can automate and scale Java application upgrades using the new Amazon Q Developer Java transformation command line interface (CLI). This feature perform upgrades from Java versions 8, 11, 17, or 21 to versions 17 or 21 directly from the command line. This tool offers selective transformation options so you can choose specific steps from transformation plans and customize library upgrades.
  • New AWS IoT Device Management managed integrations – You can simplify Internet of Things (IoT) device management across multiple manufacturers and protocols using the new managed integrations feature, which provides a unified interface for controlling devices whether they connect directly, through hubs or third-party clouds. The feature includes pre-built cloud-to-cloud (C2C) connectors, device data model templates, and SDKs that support ZigBee, Z-Wave, and Wi-Fi protocols, while you can still create custom connectors and data models.

For a full list of AWS announcements, be sure to keep an eye on the What’s New with AWS? page.

Other AWS news
Various Model Context Protocol (MCP) servers for AWS services have been released. Here are some tutorials about MCP servers that you might find interesting:

Upcoming AWS events
Check your calendars and sign up for these upcoming AWS events:

  • AWS re:Invent – Register now to get a head start on choosing your best learning path, booking travel and accommodations, and bringing your team to learn, connect, and have fun. If you’re an early-career professional, you can apply to the All Builders Welcome Grant program, which is designed to remove financial barriers and create diverse pathways into cloud technology.
  • AWS NY Summits – You can gain insights from Swami’s keynote featuring the latest cutting-edge AWS technologies in compute, storage, and generative AI. My News Blog team is also preparing some exciting news for you. If you’re unable to attend in person, you can still participate by registering for the global live stream. Also, save the date for these upcoming Summits in July and August near your city.
  • AWS Builders Online Series – If you’re based in one of the Asia Pacific time zones, join and learn fundamental AWS concepts, architectural best practices, and hands-on demonstrations to help you build, migrate, and deploy your workloads on AWS.

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

Channy

AWS Weekly Roundup: Amazon Nova Premier, Amazon Q Developer, Amazon Q CLI, Amazon CloudFront, AWS Outposts, and more (May 5, 2025)

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-amazon-nova-premier-amazon-q-developer-amazon-q-cli-amazon-cloudfront-aws-outposts-and-more-may-5-2025/

Last week I went to Thailand to attend the AWS Summit Bangkok. It was an energizing and exciting event. We hosted the Developer Lounge, where developers can meet, discuss ideas, enjoy lightning talks, win SWAGs at AWS Builder ID Prize Wheel, take a challenge at Amazon Q Developer Coding Challenge, or learn Generative AI at Learn Amazon Bedrock booth.

Here’s a quick look:

Thank you to AWS Heroes, AWS Community Builders, AWS User Group leaders and developers for your collaboration.

Coming up next in ASEAN is AWS Summit Singapore—make sure you don’t miss it by registering now.

Last Week’s Launches
Here are some launches last week that caught my attention:

  • Amazon Nova Premier Now Generally Available — Amazon Nova Premier, our most capable model for complex tasks and teacher for model distillation, is now generally available in Amazon Bedrock. It excels at complex tasks requiring deep context understanding and multistep planning, while processing text, images, and videos with a 1M token context length. With Nova Premier and Amazon Bedrock Model Distillation, you can create highly capable, cost-effective, and low-latency versions of Nova Pro, Lite, and Micro, for your specific needs.

  • Amazon Q Developer elevates the IDE experience with new agentic coding experience — This new interactive, agentic coding experience for Visual Studio Code allows Q Developer to intelligently take actions on behalf of the developer. Amazon Q Developer introduces an interactive coding experience in Visual Studio Code, offering real-time collaboration for coding, documentation, and testing. It provides transparent reasoning, and supports automated or step-by-step changes in multiple languages.

  • New Foundation Models in Amazon Bedrock — Amazon Bedrock expands its model offerings with two significant additions:
    • Writer’s Palmyra X5 and X4 models feature extensive context windows (1M and 128K tokens respectively) and excel in complex reasoning for enterprise applications. They support multistep tool-calling and adaptive thinking with high reliability standards.
    • Meta’s Llama 4 Scout 17B and Maverick 17B models offer natively multimodal capabilities using mixture-of-experts architecture for enhanced reasoning and image understanding. They support multiple languages and extended context processing, with simplified integration through the Bedrock Converse API.
  • Second-Generation AWS Outposts Racks Released AWS announces the general availability of second-generation Outposts racks with significant enhancements including the latest x86 EC2 instances, simplified networking, and accelerated networking options. These improvements deliver doubled vCPU, memory, and network bandwidth, 40% better performance, and support for ultra-low latency workloads, making them ideal for demanding on-premises deployments.

  • Amazon CloudFront SaaS Manager Launches — Amazon CloudFront SaaS Manager helps SaaS providers and web hosting platforms efficiently manage content delivery across multiple customer domains. The service dramatically reduces operational complexity while providing high-performance content delivery and enterprise-grade security for every customer domain.

  • Amazon Aurora Now Supports PostgreSQL 17 — Amazon Aurora now supports PostgreSQL 17.4, offering community improvements and Aurora-specific enhancements like optimized memory management and faster failovers. The release includes new features for Babelfish, security fixes, and updated extensions, available in all AWS Regions.
  • CloudWatch Introduces Tiered Pricing for Lambda Logs — Amazon CloudWatch launches tiered pricing for AWS Lambda logs and new delivery destinations. Pricing in US East starts at $0.50/GB for CloudWatch and $0.25/GB for S3 and Firehose, both tiering down to $0.05/GB. This update enhances flexibility in log management across all supporting Regions.
  • RDS for MySQL Updates Minor VersionsAmazon RDS for MySQL now supports minor versions 8.0.42 and 8.4.5, delivering security fixes, bug fixes, and performance improvements. Users can upgrade automatically during maintenance windows or use Blue/Green deployments for safer updates.
  • Amazon Bedrock Model Distillation Generally AvailableAmazon Bedrock Model Distillation is now generally available, supporting new models like Amazon Nova and Claude 3.5. It enables smaller models to accurately predict function calling for Agents, delivering up to 500% faster responses and 75% lower costs with minimal accuracy loss for RAG use cases. The service includes automated workflows for data synthesis and student model training.
  • AI Search Flow Builder for Amazon OpenSearch Service Amazon OpenSearch Service now offers an AI search flow builder for OpenSearch 2.19+ domains. This low-code designer enables creation of sophisticated AI-enhanced search flows using AWS and third-party services, supporting use cases like RAG, query rewriting, and semantic encoding.

From Community.AWS
Here’s my personal favorites posts from community.aws:

Upcoming AWS events
Check your calendars and sign up for these upcoming AWS events:

  • AWS Summit — Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: Poland (6 May), Bengaluru (May 7 – 8), Hong Kong (May 8), Seoul (May 14-15), Singapore (May 29), and Sydney (June 4–5).
  • AWS re:Inforce – Mark your calendars for AWS re:Inforce (June 16–18) in Philadelphia, PA. AWS re:Inforce is a learning conference focused on AWS security solutions, cloud security, compliance, and identity. You can subscribe for event updates now!
  • AWS Partners Events – You’ll find a variety of AWS Partner events that will inspire and educate you, whether you are just getting started on your cloud journey or you are looking to solve new business challenges.
  • AWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Yerevan, Armenia (May 24), Zurich, Switzerland (May 25), and Bengaluru, India (May 25).

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!


How is the News Blog doing? Take this 1 minute survey!

(This survey is hosted by an external company. AWS handles your information as described in the AWS Privacy Notice. AWS will own the data gathered via this survey and will not share the information collected with survey respondents.)

AWS Weekly Review: Amazon S3 Express One Zone price cuts, Pixtral Large on Amazon Bedrock, Amazon Nova Sonic, and more (April 14, 2025)

Post Syndicated from Elizabeth Fuentes original https://aws.amazon.com/blogs/aws/aws-weekly-review-amazon-s3-express-one-zone-price-cuts-pixtral-large-on-amazon-bedrock-amazon-nova-sonic-and-more-april-14-2025/

The Amazon Web Services (AWS) Summit 2025 season launched this week, starting with the Paris Summit. These free events bring together the global cloud computing community for learning and collaboration. AWS Community Day Romania, held on April 11th, showcased how the local community creates opportunities for collective growth and inclusion.

Last week’s launches
Announcing up to 85% price reductions for Amazon S3 Express One Zone S3 Express One Zone, a high-performance storage class, now has reduced storage prices by 31 percent, PUT request prices by 55 percent, and GET request prices by 85 percent. In addition, S3 Express One Zone has reduced the per-GB charges for data uploads and retrievals by 60 percent. These charges now apply to all bytes transferred rather than just portions of requests greater than 512 KB.

Here is a price reduction table in the US East (N. Virginia) AWS Region:

Price Previous New Price reduction
Storage
(per GB-Month)
$0.16 $0.11 31%
Writes
(PUT requests)
$0.0025 per 1,000 requests up to 512 KB $0.00113 per 1,000 requests 55%
Reads
(GET requests)
$0.0002 per 1,000 requests up to 512 KB $0.00003 per 1,000 requests 85%
Data upload
(per GB)
$0.008 $0.0032 60%
Data retrievals
(per GB)
$0.0015 $0.0006 60%

AWS announces Pixtral Large 25.02 model in Amazon Bedrock serverless The Pixtral Large 25.02, developed by Mistral AI, combines advanced vision and language understanding, boasting a 128K context window and multilingual capabilities. This agent-centric design simplifies integration with existing systems. Prompt adherence improves reliability when working with Retrieval Augmented Generation (RAG) applications and large context scenarios.

Introducing Amazon Nova Sonic: Human-like voice conversations for generative AI applications Amazon Nova Sonic, the newest addition to the Amazon Nova family of foundation models (FMs) is available in Amazon Bedrock to create human-like voice conversations for applications. It unifies speech and text processing into one model, reducing complexity and enhancing natural interactions. Start today with the Amazon Nova model cookbook repository.

Amazon Bedrock Guardrails enhances generative AI application safety with new capabilitiesAmazon Bedrock Guardrails introduces new capabilities to enhance generative AI application safety, including multimodal toxicity detection, enhanced Personally Identifiable Information (PII) protection, AWS Identity and Access Management (AWS IAM) policy enforcement, selective guardrail application, and monitor mode for pre-deployment analysis.

AWS App Studio introduces a prebuilt solutions catalog and cross-instance Import and Export — This is a prebuilt solutions catalog with ready-to-use applications and patterns and cross-instance Import and Export functionality. These features help you streamline development applications, reducing setup time to under 15 minutes. Learn more about this in AWS App Studio introduces a prebuilt solutions catalog and cross-instance Import and Export blog.

Amazon Nova Reel 1.1: Featuring up to 2-minutes multi-shot videos Amazon Nova Reel 1.1 enhances video generation through Amazon Bedrock with support for 2-minute multi-shot videos. You can now create content using either single prompts for automatic generation or custom prompts for individual shots, offering flexible options for marketing and social media content creation.

AWS IAM Identity Center now offers improved error messages and AWS CloudTrail logging for provisioning issues AWS Identity and Access Management (IAM) Identity Center has enhanced its service with improved error messages and AWS CloudTrail logging capabilities. These updates help users better troubleshoot synchronization issues when managing workforce identities across AWS accounts and applications, while enabling automated monitoring and auditing of provisioning problems.

AWS WAF Console adds new top insights visualizations in additional regionsAWS WAF Console now offers enhanced traffic visualization features in AWS GovCloud (US) Regions. The all traffic dashboard includes new top insights based on Amazon CloudWatch logs, helping customers analyze traffic patterns, identify security threats, and optimize WAF configurations through detailed metrics.

AWS Step Functions expands data source and output options for Distributed MapAWS Step Functions enhances Distributed Map with expanded data source support, including JSONL and various delimited file formats from Amazon Simple Storage Service (Amazon S3). The update also adds new output transformation options, enabling more flexible parallel processing workflows and better integration with downstream systems.

Amazon CloudWatch now provides lock contention diagnostics for Aurora PostgreSQL Amazon CloudWatch Database Insights introduces lock contention diagnostics for Amazon Aurora PostgreSQL in Advanced mode. The feature visualizes blocking and waiting sessions, helping users identify root causes of lock contention issues, with 15-month historical data retention for comprehensive troubleshooting.

Get updated with all the announcements of AWS announcements on the What’s New with AWS? page.

Other AWS blog posts
Reduce ML training costs with Amazon SageMaker HyperPodAmazon SageMaker HyperPod addresses hardware failures in large-scale Machine Learning (ML) model training by automatically detecting and replacing faulty instances. The solution reduces downtime from 280 to 40 minutes per failure, potentially saving 32% of training time for large clusters. For a 10-million GPU-hour training job, this translates to $25.6M in cost savings.

Model customization, RAG, or both: A case study with Amazon Nova — A study comparing model customization with fine-tuning and Retrieval Augmented Generation (RAG) approaches with Amazon Nova models. Key findings show combining both methods yields best results: RAG works well for dynamic data and domain insights, while fine-tuning excels in specialized tasks and latency reduction.

Generate user-personalized communication with Amazon Personalize and Amazon BedrockAmazon Personalize and Amazon Bedrock work together to create personalized marketing emails. Learn how to create personalized user communications by combining Amazon Personalize for movie recommendations with Amazon Bedrock for generating tailored email content based on user preferences and demographics.

Implement human-in-the-loop confirmation with Amazon Bedrock Agents — When implementing human validation in Amazon Bedrock Agents, developers have two primary frameworks at their disposal: user confirmation and return of control (ROC). Using an HR application example, user confirmation allows simple yes/no validation before executing actions, while ROC enables users to modify parameters before execution.

Multi-LLM routing strategies for generative AI applications on AWS — Learn how to implement multi-Large Language Model (LLM) routing strategies for AWS generative AI applications using static routing, dynamic routing with Amazon Bedrock, or custom solutions for optimal model selection and cost efficiency.

Here are my personal favorites posts from community.aws:

Building a RAG System for Video Content Search and Analysis — In this blog, I’ll show you how to build a RAG system that makes video content searchable and analyzable. Unlocking video content has never been more crucial in today’s digital landscape. Whether you’re managing educational materials, corporate training, or entertainment content, the ability to search and analyze video content efficiently can transform how we interact with multimedia resources.

Build Serverless GenAI Apps Faster with Amazon Q Developer CLI AgentAmazon Q Developer CLI Agent enables rapid serverless GenAI app development. With one prompt, it generates infrastructure code, Lambda functions, and integrates with Claude 3 Haiku on Amazon Bedrock.

Speech-to-Speech AI: From Dr. Sbaitso to Amazon Nova Sonic — The evolution of speech-to-speech AI, from Dr. Sbaitso (1990s) to Amazon Nova Sonic. New AWS service enables real-time bidirectional conversations through Amazon Bedrock for more natural applications.

Setup Model Context Protocol (MCP) using Amazon Bedrock — A guide to setting up Model Context Protocol (MCP) desktop client with Amazon Bedrock models, enabling seamless integration between AI applications and external tools using Goose client.

Upcoming AWS events
Check your calendars and sign up for these upcoming AWS events:

AWS GenAI LoftsGenAI Lofts available around the world, offer collaborative spaces and immersive experiences for startups and developers. You can join in-person GenAI Loft San Francisco events such as GenAI in EdTech: A Hands-On Workshop (April 15), and Unstructured Data Meetup SF (April 16). Find your nearest event at GenAI Lofts.

AWS Summits — Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: Amsterdam (April 16), London (April 30), and Poland (May 5).

AWS re:Inforce — AWS re:Inforce (June 16–18) in Philadelphia, PA, is our annual learning event devoted to all things AWS cloud security. Registration is open. Be ready to join more than 5,000 security builders and leaders.

AWS Community Days — Join community-led conferences featuring technical discussions, workshops, and hands-on labs driven by expert AWS users and industry leaders from around the world. Upcoming AWS Community Days are scheduled for April 19 in Turkey, and on April 29 in Prague with Jeff Barr as Opening Keynote Speaker.

You can browse all upcoming in-person and virtual events.

Create your AWS Builder ID and reserve your alias. Builder ID is a universal login credential that gives you access—beyond the AWS Management Console—to AWS tools and resources, including over 600 free training courses, community features, and developer tools such as Amazon Q Developer.

That’s all for this week. Stay tuned for next week’s Weekly Roundup!

Eli

Thanks to Andra Somesan for the AWS Community Romania photo and Thembile Martis for the AWS Paris Summit photo.

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!


How is the News Blog doing? Take this 1 minute survey!

(This survey is hosted by an external company. AWS handles your information as described in the AWS Privacy Notice. AWS will own the data gathered via this survey and will not share the information collected with survey respondents.)

Serverless ICYMI 2025 Q1

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/serverless-icymi-2025-q1/

Welcome to the 28th edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. At the end of a quarter, we share the most recent product launches, feature enhancements, blog posts, videos, live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, check out what happened in Q4 2024 here.

Serverless calendar Q1 2025

Serverless calendar Q1 2025

AWS Step Functions

The AWS Step Functions team continues to improve developer experience. Workflow Studio is now available within Visual Studio Code (VS Code) through the AWS Toolkit extension.

AWS Step Functions in IDE

AWS Step Functions in IDE

You can now design, test, and deploy your Step Functions workflows without leaving your IDE. The extension provides a drag-and-drop interface with all the familiar Workflow Studio capabilities, making it even easier to build state machines locally.

To get started, install the AWS Toolkit for Visual Studio Code and visit the user guide on Workflow Studio integration.

Step Functions private integrations now allows you to integrate applications seamlessly across private networks, on-premises infrastructure, and cloud platforms. Learn more in a blog post and explanation video.

AWS Step Functions private integrations video

AWS Step Functions private integrations video

Step Functions now integrates with 36 more AWS services that support user messaging capabilities. You can orchestrate notifications through Amazon SNS, Amazon SQS, Amazon EventBridge, Amazon Pinpoint, and more, all using the optimized integrations you’re familiar with.

Step Functions has increased the default quota for state machines and activities from 10,000 to 100,000 per AWS account. This tenfold increase means you can create more workflows to automate your business processes without worrying about hitting quota limits.

Distributed Map is expanding capabilities by adding support for JSON Lines (JSONL) format. JSONL, a highly efficient text-based format, stores structured data as individual JSON objects separated by newlines, making it particularly suitable for processing large datasets.

AWS Step Functions Distributed Map

AWS Step Functions Distributed Map

Distributed Map can also process data from a broader range of delimited file formats stored in Amazon S3 and offers new output transformations for greater control over result formatting.

Developer Tools

Serverless Land patterns are now available directly within VS Code.

You no longer need to switch between your IDE and external resources when building serverless architectures. Browse, search, and implement pre-built serverless patterns directly in VS Code.

Example Serverless Pattern

Example Serverless Pattern

AWS Lambda

Learn how AWS Lambda handles billions of invocations.

AWS Lambda asynchronous invocations

AWS Lambda asynchronous invocations

This blog post provides recommendations and insights for implementing highly distributed applications based on the Lambda service team’s experience building its robust asynchronous event processing system. It dives into challenges you might face, solution techniques, and best practices for handling noisy neighbors.

A new video walks through using the enhanced local IDE experience for Lambda developers.

AWS Lambda new IDE experience

AWS Lambda new IDE experience

The VS Code extension for Lambda now supports live tailing of CloudWatch Logs directly in your IDE following on from previous support for Live Tail in the Lambda console. Watch logs in real-time as your functions execute, making debugging and troubleshooting more efficient than ever.

You can now enable Application Performance Monitoring (APM) for Java and .NET runtimes using Amazon CloudWatch Application Signals.

Amazon CloudWatch Application Signals for Java and .NET AWS Lambda runtimes

Amazon CloudWatch Application Signals for Java and .NET AWS Lambda runtimes

This provides deep visibility into your function’s performance, including method-level tracing, memory profiling, and automated anomaly detection.

Amazon Bedrock features

Multi-agent collaboration is now available in Bedrock as a preview, enabling you to create systems where multiple AI agents work together to solve complex problems. Agents can specialize in different domains, share context, and coordinate their actions to achieve goals that would be difficult for a single agent.

RAG evaluation is now generally available. This provides metrics to assess and improve your retrieval augmented generation pipelines. GraphRAG for Bedrock Knowledge Bases is now generally available, allowing you to enhance retrievals with graph-based context.

Amazon Bedrock Flows now supports multi-turn conversations, allowing you to build dynamic AI applications that maintain context across multiple user interactions. Bedrock data automation is now generally available, streamlining the process of preparing, ingesting, and maintaining data for your GenAI applications. Bedrock now offers LLM-as-a-judge capability for model evaluation, providing automated assessment of model outputs without requiring human reviewers. Compare different models or prompt strategies against your specific criteria at scale.

Bedrock’s capabilities are now integrated into the Amazon SageMaker Unified Studio, creating a seamless experience for machine learning practitioners who want to incorporate foundation models into their workflows. Access Bedrock models, fine-tuning, and evaluation directly from SageMaker.

Amazon Nova is a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry leading price-performance. Nova has expanded its tool use and converse API capabilities, making it easier for developers to build AI assistants that can use external tools to complete tasks.

Amazon Bedrock Guardrails image content filters are now generally available. Define and enforce boundaries for your AI applications with controls for both text and image content, ensuring outputs align with your organization’s policies.

Bedrock Knowledge Bases now supports using your existing OpenSearch clusters as the vector storage backend. This integration allows you to leverage your investments in OpenSearch while benefiting from the managed RAG capabilities of Bedrock.

New Amazon Bedrock models

  • Anthropic’s Claude 3.7 Sonnet hybrid reasoning allows you to toggle between standard and extended thinking modes. In standard mode, it functions as an upgraded version of Claude 3.5 Sonnet. While in extended thinking mode, it employs self-reflection to achieve improved results across a wide range of tasks.
  • DeepSeek R1, an advanced model specialized in research and scientific reasoning excels at complex problem-solving tasks and technical content generation.
  • Cohere Embed 3 models are now available in both multilingual and English-specific versions. These embedding models support text and images, providing more accurate representation for multimodal content and improving retrieval augmented generation (RAG) applications.
  • Ray2, Luma AI’s new visual AI model is capable of creating realistic visuals with fluid, natural movement. You can use it for image understanding, 3D scene reconstruction, and visual content generation, opening new possibilities for immersive and visual applications.
  • Bedrock now supports fine-tuning of Meta’s latest Llama 3.2 models. These upgraded models deliver improved performance across reasoning, coding, and multilingual tasks while being more efficient with computational resources.

Amazon Q Developer

Amazon Q Developer is now available as a CLI agent, bringing AI-assisted development to the command line. Get contextual recommendations, generate shell commands, and solve coding problems without leaving your terminal.

Amazon Q CLI

Amazon Q CLI

Amazon Q Developer transformation now supports upgrading Java applications using Maven to Java 21. It offers enhanced code suggestions, refactoring, and optimization recommendations for applications using the latest Java features, like virtual threads and pattern matching.

AWS AppSync

AWS AppSync Events now supports events publishing for WebSocket APIs, enabling real-time publish-subscribe functionality. This feature makes it easier to build applications requiring instant updates, like chat applications, collaborative tools, and real-time dashboards.

AWS AppSync Events

AWS AppSync Events

There are new AWS Cloud Development Kit (AWS CDK) L2 constructs for AppSync WebSocket APIs. These make it simpler to define and deploy real-time APIs using infrastructure as code. These high-level constructs handle the details of WebSocket connections, authorization, and messaging patterns.

Amazon SNS

Amazon SNS now supports high throughput mode for SNS FIFO topics, with default throughput matching SNS standard topics. When you enable high-throughput mode, SNS FIFO topics will maintain order within message group, while reducing the de-duplication scope to the message-group level.

Amazon EventBridge

Amazon EventBridge now supports direct delivery to targets across AWS accounts, simplifying multi-account architectures. This reduces latency and improves reliability when routing events between accounts in your organization.

Amazon EventBridge cross account

Amazon EventBridge cross account

The EventBridge console now features event source discovery, making it easier to find and visualize available event sources in your AWS environment. This tool helps you identify potential event producers and understand the event schemas they emit.

AWS Amplify

AWS Amplify now offers a TypeScript data client optimized for server-side Lambda functions, providing type-safe access to your data sources. This client reduces code complexity and improves reliability when working with databases and APIs in server environments.

Serverless compute blog posts

January

February

March

Serverless Office Hours weekly livestream

February

March

Still looking for more?

The Serverless landing page has more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials.

You can also follow the Developer Advocacy team members who work on Serverless to see the latest news, follow conversations, and interact with the team.

And finally, visit the Serverless Land  for all your serverless needs.

Effortlessly execute AWS CLI commands using natural language with Amazon Q Developer

Post Syndicated from Xipu Li original https://aws.amazon.com/blogs/devops/effortlessly-execute-aws-cli-commands-using-natural-language-with-amazon-q-developer/

The CLI struggle is real

Command-line tools are meant to simplify infrastructure and DevOps workflows, but the reality is often the opposite. Instead of speeding things up, the vast array of commands, flags, and syntax turns the CLI into a puzzle. Tools meant to enhance productivity have developers endlessly tab-switching between searches, forums, and docs just to find basic commands.

Much like learning a new language, mastering the CLI involves navigating intricate and unfamiliar syntax. In the real world, we have translators to help bridge communication gaps—and now, there’s Amazon Q Developer to do the same for your CLI.

In this post, we’ll guide you through setting up Amazon Q Developer on your command line and demonstrate how it simplifies complex tasks.

Getting started

Make sure you have AWS CLI installed and configured to have a smooth experience as you go through this post.

To use Amazon Q Developer in your CLI, start by installing Amazon Q. Currently, Amazon Q’s CLI capability is only available for macOS and Linux users, so make sure you are on the right platform.

Using Amazon Q Developer requires you to have an AWS Builder ID or be part of an organization with an AWS IAM Identity Center instance set up.

We will be interacting with Amazon S3 and Amazon CloudFront, check that your AWS account has the necessary permissions:

s3:CreateBucket 
s3:PutObject 
s3:PutBucketPolicy 
cloudfront:CreateDistribution 
cloudfront:CreateOriginAccessControl 
cloudfront:GetDistributionConfig 
cloudfront:UpdateDistribution

Once you have these prerequisites in place, you’re ready to see how Amazon Q Developer can simplify your CLI workflow. Let’s get started.

Transforming your CLI experience

Open a new terminal window, and type q translate, followed by a natural language prompt to execute a complex command. Amazon Q Developer will take your input and process it through a Large Language Model (LLM), generating the corresponding bash command.

Let’s start with something simple and fun: count down from 10 second by second in your terminal:

q translate “Count down from 10 second by second”

When you execute the command, you’ll see Amazon Q suggest an executable shell command with a few action items:

Screenshot of terminal executing the command

  • Execute command: executes the proposed command immediately
  • Edit command: allows you to modify the proposed command
  • Regenerate answer: proposes another command
  • Ask another question: allows you to enter a new prompt
  • Cancel: exits the session

For now, we can just execute the command and see it in action:

Building a website with AWS from the CLI

Now that we’ve covered the basics, let’s get our hands dirty with something practical. In this tutorial, we’ll walk through creating a static website hosted on AWS, entirely using the command line.

Step 1: Create an S3 Bucket to Host the Website

Amazon S3 is great for static website hosting. First, we need to create an S3 bucket to store our website files. We can simply instruct Amazon Q Developer to “create an S3 bucket called ‘ai-generated-bucket’ on us-west-2”:

Screenshot of terminal executing the command

Note: that if you already have a bucket with the same name, feel free to change it to something else.

We can see that Amazon Q Developer has correctly generated the command:

 aws s3 mb s3://ai-generated-bucket —region us-west-2

If you see something that’s not quite right, you can try to “Edit command” or “Regenerate answer”. For now, we can just execute that generated command and verify in the AWS S3 console that the S3 bucket has indeed been created.

Step 2: Create an HTML File with “Hello World”

Now, we need to upload a basic HTML file to the bucket. Instead of manually writing it, we can let Amazon Q Developer guide us. Prompt:

q translate "Create a simple index.html file with 'Hello World' message"

Screenshot of terminal executing the command

Amazon Q Developer is suggesting echo '<h1>Hello World</h1>' > index.html, which looks good. We can go ahead execute that command.

Step 3: Upload the HTML File to S3

Now, we’ll upload the index.html file to the S3 bucket:

q translate "Upload index.html to the 'ai-generated-bucket' S3 bucket"

Screenshot of terminal executing the command

After executing the command, we can check on the Amazon S3 console to see we have successfully uploaded the index.html file to our bucket.

Screenshot of Amazon S3 console

Step 4: Set up CloudFront distribution

By default, Amazon S3 blocks public access to your account and buckets. We recommend keeping Block Public Access enabled for production use cases and securely serve your site through Amazon CloudFront, a global Content Delivery Network (CDN) that securely delivers your content with low latency and high transfer speeds, making it ideal for hosting static websites.

Use Amazon Q Developer to create a CloudFront distribution for your S3 bucket:

q translate "Create a CloudFront distribution for my S3 bucket 'ai-generated-bucket'"

Screenshot of terminal executing the command
Hmm, that doesn’t look right. Amazon Q Developer is trying to supply —distribution-config with an unspecified configuration file. To fix this, you can either manually provide the CloudFront distribution configuration or we can push the limit of what Amazon Q Developer can do.

Let’s try to refine our prompt so that we don’t use the --distribution-config flag:

q translate "Create a CloudFront distribution for my S3 bucket 'ai-generated-bucket' without using the ‘—distribution-config' flag"

Screenshot of terminal executing the command

Much better! Looks like we need to replace the ‘X’s with an actual origin domain name. In our case, the S3 bucket origin uses the following format:

bucket-name.s3.amazonaws.com

Navigate to and toggle the Edit command option on your CLI, and replace the ‘X’s. In our case, the replacement value would be ai-generated-bucket.s3.amazonaws.com. Execute the command, and upon success, you should be able to see the CloudFront configuration printed on the console:

Screenshot of terminal printing out the CloudFront configuration

Copy the “Id” from the output, we will need it for the next step.

Step 5: Preparing for public access

We want to set index.html as our default root object for the CloudFront distribution we just created, this will ensure that when users access your website’s root URL, CloudFront automatically serves the index.html file without requiring the user to explicitly specify it in the URL.

q translate "Update my CloudFront distribution CLOUDFRONT_ID to have 'index.html' as default root object without using the --distribution-config flag"

Screenshot of terminal executing the command
Don’t forget to replace CLOUDFRONT_ID with the Id you retrieved from the previous step. Once you execute that command, you should see the CloudFront configuration printed on the console like the step above.

Now, to securely serve our HTML file to the public, we need to perform the following:

  1. set up an Origin Access Control (OAC)
  2. attach that OAC to our CloudFront distribution
  3. update our S3 bucket policy

Let’s have Amazon Q generate an OAC for us:

q translate "Create an OAC named 'ai-generated-oac' for my CloudFront distribution CLOUDFRONT_ID"

Screenshot of terminal executing the command
When we execute the command, we get an error saying:

An error occurred (MalformedXML) when calling the CreateOriginAccessControl operation: 1 validation error detected: Value 'origin-access-identity' at 'originAccessControlConfig.originAccessControlOriginType' failed to satisfy constraint: Member must satisfy enum value set: [s3, lambda, mediastore, mediapackagev2]

It might be difficult to debug this error message without consulting the internet. Luckily, Amazon Q Developer offers another powerful capability, q chat, which is like a chatbot in your terminal. Let’s try it out to troubleshoot:

q chat "@history An error occurred (MalformedXML) when calling the CreateOriginAccessControl operation: 1 validation error detected: Value 'origin-access-identity' at 'originAccessControlConfig.originAccessControlOriginType' failed to satisfy constraint: Member must satisfy enum value set: [s3, lambda, mediastore, mediapackagev2]. Help me resolve this error"

Note that I included @history in the prompt to pass our shell history to Amazon Q so it can respond based on the context provided.

Screenshot of terminal executing the command

We get a detailed AI generated response on our terminal with step by step instructions and citations! By using the @history tag, Amazon Q Developer knows to relate the error message to the correct shell command we executed (we can see that it knows we want to create an OAC with name ai-generated-oac without us explicitly telling it so).

Let’s try executing the revised bash command:

aws cloudfront create-origin-access-control \
    --origin-access-control-config \
    Name=ai-generated-oac,\
    OriginAccessControlOriginType=s3,\
    SigningBehavior=always,\
    SigningProtocol=sigv4
This time, it succeeded:
Screenshot of terminal printing the OAC response

Take a note of the Id field, we will need it later.

We can now ask q chat about how to attach the generated OAC to our CloudFront distribution:

q chat "@history Update my CloudFront distribution CLOUDFRONT_ID with the OAC we just generated, be specific"

Screenshot of terminal executing the command

Notice how @history even helps Amazon Q fill in CLOUDFRONT_ID for me!

We just need to follow the instruction and once we completed the last step, we should see an updated CloudFront distribution in our console output:

Screenshot of terminal printing the updated CloudFront distribution
It shows a “Status” of “InProgress”. To proceed to the next step, we will need to wait for CloudFront to finish deployment. Let’s ask Amazon Q Developer how to get the latest status:

q translate "Get the status of my CloudFront distribution with CLOUDFRONT_ID on us-west-2"

Screenshot of terminal executing the command

Replace $CLOUDFRONT_ID with your own and execute the command to get the latest status. It might take a few minutes, it’s time to grab a coffee or go for a short walk!

Once the status is Deployed, we can go ahead and update our S3 bucket policy to give public read access to our CloudFront distribution:

Screenshot of terminal executing the command

You can see we are using a very vague prompt but @history made sure that Amazon Q has enough context to generate helpful responses.

Simply follow the instruction and we’re ready to access the site!

Step 6: Access the site 🎉

Let’s get the public URL of our CloudFront distribution.

Screenshot of terminal executing the command

Once you get the URL, you can open it in a browser, and it should show “Hello World”!

Screenshot of web page displaying “Hello World”

Conclusion

In this post, we’ve successfully:

  1. Created an S3 bucket to host our website files
  2. Generated a simple HTML file with a “Hello World” message
  3. Uploaded the HTML file to our S3 bucket
  4. Created a CloudFront distribution to serve our website securely
  5. Set up an Origin Access Control (OAC) for our CloudFront distribution
  6. Updated our S3 bucket policy to allow access from CloudFront
  7. Accessed our website through the CloudFront URL

All of these steps were accomplished on our terminal using Amazon Q Developer on the CLI.

You’ve seen how effortlessly complex CLI commands can be translated into natural language, making AWS more accessible than ever. But this is just the start. As your needs grow, Amazon Q scales with you, delivering faster deployments, enhanced productivity, and seamless access to the full AWS ecosystem.

We encourage you to explore AWS services you haven’t yet tried, with Amazon Q Developer simplifying the process every step of the way. Let it streamline your workflow and unlock new opportunities to innovate and grow.