Tag Archives: Amazon CodeGuru

New- Amazon DevOps Guru Helps Identify Application Errors and Fixes

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/amazon-devops-guru-machine-learning-powered-service-identifies-application-errors-and-fixes/

Today, we are announcing Amazon DevOps Guru, a fully managed operations service that makes it easy for developers and operators to improve application availability by automatically detecting operational issues and recommending fixes. DevOps Guru applies machine learning informed by years of operational excellence from Amazon.com and Amazon Web Services (AWS) to automatically collect and analyze data such as application metrics, logs, and events to identify behavior that deviates from normal operational patterns.

Once a behavior is identified as an operational problem or risk, DevOps Guru alerts developers and operators to the details of the problem so they can quickly understand the scope of the problem and possible causes. DevOps Guru provides intelligent recommendations for fixing problems, saving you time resolving them. With DevOps Guru, there is no hardware or software to deploy, and you only pay for the data analyzed; there is no upfront cost or commitment.

Distributed/Complex Architecture and Operational Excellence
As applications become more distributed and complex, operators need more automated practices to maintain application availability and reduce the time and effort spent on detecting, debugging, and resolving operational issues. Application downtime, for example, as caused by misconfiguration, unbalanced container clusters, or resource depletion, can result in significant revenue loss to an enterprise.

In many cases, companies must invest developer time in deploying and managing multiple monitoring tools, such as metrics, logs, traces, and events, and storing them in various locations for analysis. Developers or operators also spend time developing and maintaining custom alarms to alert them to issues such as sudden spikes in load balancer errors or unusual drops in application request rates. When a problem occurs, operators receive multiple alerts related to the same issue and spend time combining alerts to prioritize those that need immediate attention.

How DevOps Guru Works
The DevOps Guru machine learning models leverages AWS expertise in running highly available applications for the world’s largest e-commerce business for the past 20 years. DevOps Guru automatically detects operational problems, details the possible causes, and recommends remediation actions. DevOps Guru provides customers with a single console experience to search and visualize operational data by integrating data across multiple sources supporting Amazon CloudWatch, AWS Config, AWS CloudTrail, AWS CloudFormation, and AWS X-Ray and reduces the need to use multiple tools.

Getting Started with DevOps Guru
Activating DevOps Guru is as easy as accessing the AWS Management Console and clicking Enable. When enabling DevOps Guru, you can select the IAM role. You’ll then choose the AWS resources to analyze, which may include all resources in your AWS account or just specified CloudFormation StackSets. Finally, you can set an Amazon SNS topic if you want to send notifications from DevOps Guru via SNS.

DevOps Guru starts to accumulate logs and analyze your environment; it can take up to several hours. Let’s assume we have a simple serverless architecture as shown in this illustration.

When the system has an error, the operator needs to investigate if the error came from Amazon API Gateway, AWS Lambda, or AWS DynamoDB. They must then determine the root cause and how to fix the issue. With DevOps Guru, the process is now easy and simple.

When a developer accesses the management console of DevOps Guru, they will see a list of insights which is a collection of anomalies that are created during the analysis of the AWS resources configured within your application. In this case, Amazon API Gateway, AWS Lambda, and Amazon DynamoDB. Each insight contains observations, recommendations, and contextual data you can use to better understand and resolve the operational problem.

The list below shows the insight name, the status (closed or ongoing), severity, and when the insight was created. Without checking any logs, you can immediately see that in the most recent issue (line1), a problem with a Lambda function within your stack was the cause of the issue, and it was related to duration. If the issue was still occurring, the status would be listed as Ongoing. Since this issue was temporary, the status is showing Closed.

Insights

Let’s look deeper at the most recent anomaly by clicking through the first insight link. There are two tabs: Aggregated metrics and Graphed anomalies.

Aggregated metrics display metrics that are related to the insight. Operators can see which AWS CloudFormation stack created the resource that emitted the metric, the name of the resource, and its type. The red lines on a timeline indicate spans of time when a metric emitted unusual values. In this case, the operator can see the specific time of day on Nov 24 when the anomaly occurred for each metric.

Graphed anomalies display detailed graphs for each of the insight’s anomalies. Operators can investigate and look at an anomaly at the resource level and per statistic. The graphs are grouped by metric name.

metrics

By reviewing aggregated and graphed anomalies, an operator can see when the issue occurred, whether it is still ongoing, as well as the resources impacted. It appears the increased Lambda duration had a corresponding impact on API Gateway causing timeouts and resulted in 5XX errors in API Gateway.

Dev Ops Guru also provides Relevant events which are related to activities that changed your application’s configuration as illustrated below.

Events

We can now see that a configuration change happened 2 hours before this issue occurred. If we click the point on the graph at 20:30 on 11/24, we can learn more and see the details of that change.

If you click through to the Ops event, the AWS CloudTrail logs would show that the configuration change was twofold: 1) a change in the concurrency provisioned capacity on a Lambda function and 2) the reduction in the integration timeout on an API integration latency.

recommendations to fix

The recommendations tell the operator to evaluate the provisioned concurrency for Lambda and how to troubleshoot errors in API Gateway. After further evaluation, the operator will discover this is exactly correct. The root cause is a mismatch between the Lambda provisioned concurrency setting and the API Gateway integration latency timeout. When the Lambda configuration was updated in the last deployment, it altered how this application responded to burst traffic, and it no longer fit within the API Gateway timeout window. This error is unlikely to have been found in unit testing and will occur repeatedly if the configurations are not updated.

DevOps Guru can send alerts of anomalies to operators via Amazon SNS, and it is integrated with AWS Systems Manager OpsCenter, enabling customers to receive insights directly within OpsCenter as quickly diagnose and remediate issues.

Available for Preview Today
Amazon DevOps Guru is available for preview in US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Tokyo). To learn more about DevOps Guru, please visit our web site and technical documentation, and get started today.

– Kame

 

 

Improving customer experience and reducing cost with CodeGuru Profiler

Post Syndicated from Rajesh original https://aws.amazon.com/blogs/devops/improving-customer-experience-and-reducing-cost-with-codeguru-profiler/

Amazon CodeGuru is a set of developer tools powered by machine learning that provides intelligent recommendations for improving code quality and identifying an application’s most expensive lines of code. Amazon CodeGuru Profiler allows you to profile your applications in a low impact, always on manner. It helps you improve your application’s performance, reduce cost and diagnose application issues through rich data visualization and proactive recommendations. CodeGuru Profiler has been a very successful and widely used service within Amazon, before it was offered as a public service. This post discusses a few ways in which internal Amazon teams have used and benefited from continuous profiling of their production applications. These uses cases can provide you with better insights on how to reap similar benefits for your applications using CodeGuru Profiler.

Inside Amazon, over 100,000 applications currently use CodeGuru Profiler across various environments globally. Over the last few years, CodeGuru Profiler has served as an indispensable tool for resolving issues in the following three categories:

  1. Performance bottlenecks, high latency and CPU utilization
  2. Cost and Infrastructure utilization
  3. Diagnosis of an application impacting event

API latency improvement for CodeGuru Profiler

What could be a better example than CodeGuru Profiler using itself to improve its own performance?
CodeGuru Profiler offers an API called BatchGetFrameMetricData, which allows you to fetch time series data for a set of frames or methods. We noticed that the 99th percentile latency (i.e. the slowest 1 percent of requests over a 5 minute period) metric for this API was approximately 5 seconds, higher than what we wanted for our customers.

Solution

CodeGuru Profiler is built on a micro service architecture, with the BatchGetFrameMetricData API implemented as set of AWS Lambda functions. It also leverages other AWS services such as Amazon DynamoDB to store data and Amazon CloudWatch to record performance metrics.

When investigating the latency issue, the team found that the 5-second latency spikes were happening during certain time intervals rather than continuously, which made it difficult to easily reproduce and determine the root cause of the issue in pre-production environment. The new Lambda profiling feature in CodeGuru came in handy, and so the team decided to enable profiling for all its Lambda functions. The low impact, continuous profiling capability of CodeGuru Profiler allowed the team to capture comprehensive profiles over a period of time, including when the latency spikes occurred, enabling the team to better understand the issue.
After capturing the profiles, the team went through the flame graphs of one of the Lambda functions (TimeSeriesMetricsGeneratorLambda) and learned that all of its CPU time was spent by the thread responsible to publish metrics to CloudWatch. The following screenshot shows a flame graph during one of these spikes.

TimeSeriesMetricsGeneratorLambda taking 100% CPU

As seen, there is a single call stack visible in the above flame graph, indicating all the CPU time was taken by the thread invoking above code. This helped the team immediately understand what was happening. Above code was related to the thread responsible for publishing the CloudWatch metrics. This thread was publishing these metrics in a synchronized block and as this thread took most of the CPU, it caused all other threads to wait and the latency to spike. To fix the issue, the team simply changed the TimeSeriesMetricsGeneratorLambda Lambda code, to publish CloudWatch metrics at the end of the function, which eliminated contention of this thread with all other threads.

Improvement

After the fix was deployed, the 5 second latency spikes were gone, as seen in the following graph.

Latency reduction for BatchGetFrameMetricData API

Cost, infrastructure and other improvements for CAGE

CAGE is an internal Amazon retail service that does royalty aggregation for digital products, such as Kindle eBooks, MP3 songs and albums and more. Like many other Amazon services, CAGE is also customer of CodeGuru Profiler.

CAGE was experiencing latency delays and growing infrastructure cost, and wanted to reduce them. Thanks to CodeGuru Profiler’s always-on profiling capabilities, rich visualization and recommendations, the team was able to successfully diagnose the issues, determine the root cause and fix them.

Solution

With the help of CodeGuru Profiler, the CAGE team identified several reasons for their degraded service performance and increased hardware utilization:

  • Excessive garbage collection activity – The team reviewed the service flame graphs (see the following screenshot) and identified that a lot of CPU time was spent getting garbage collection activities, 65.07% of the total service CPU.

Excessive garbage collection activities for CAGE

  • Metadata overhead – The team followed CodeGuru Profiler recommendation to identify that the service’s DynamoDB responses were consuming higher CPU, 2.86% of total CPU time. This was due to the response metadata caching in the AWS SDK v1.x HTTP client that was turned on by default. This was causing higher CPU overhead for high throughput applications such as CAGE. The following screenshot shows the relevant recommendation.

Response metadata recommendation for CAGE

  • Excessive logging – The team also identified excessive logging of its internal Amazon ION structures. The team initially added this logging for debugging purposes, but was unaware of its impact on the CPU cost, taking 2.28% of the overall service CPU. The following screenshot is part of the flame graph that helped identify the logging impact.

Excessive logging in CAGE service

The team used these flame graphs and CodeGuru Profiler provided recommendations to determine the root cause of the issues and systematically resolve them by doing the following:

  • Switching to a more efficient garbage collector
  • Removing excessive logging
  • Disabling metadata caching for Dynamo DB response

Improvements

After making these changes, the team was able to reduce their infrastructure cost by 25%, saving close to $2600 per month. Service latency also improved, with a reduction in service’s 99th percentile latency from approximately 2,500 milliseconds to 250 milliseconds in their North America (NA) region as shown below.

CAGE Latency Reduction

The team also realized a side benefit of having reduced log verbosity and saw a reduction in log size by 55%.

Event Analysis of increased checkout latency for Amazon.com

During one of the high traffic times, Amazon retail customers experienced higher than normal latency on their checkout page. The issue was due to one of the downstream service’s API experiencing high latency and CPU utilization. While the team quickly mitigated the issue by increasing the service’s servers, the always-on CodeGuru Profiler came to the rescue to help diagnose and fix the issue permanently.

Solution

The team analyzed the flame graphs from CodeGuru Profiler at the time of the event and noticed excessive CPU consumption (69.47%) when logging exceptions using Log4j2. See the following screenshot taken from an earlier version of CodeGuru Profiler user interface.

Excessive CPU consumption when logging exceptions using Log4j2

With CodeGuru Profiler flame graph and other metrics, the team quickly confirmed that the issue was due to excessive exception logging using Log4j2. This downstream service had recently upgraded to Log4j2 version 2.8, in which exception logging could be expensive, due to the way Log4j2 handles class-loading of certain stack frames. Log4j 2.x versions enabled class loading by default, which was disabled in 1.x versions, causing the increased latency and CPU utilization. The team was not able to detect this issue in pre-production environment, as the impact was observable only in high traffic situations.

Improvement

After they understood the issue, the team successfully rolled out the fix, removing the unnecessary exception trace logging to fix the issue. Such performance issues and many others are proactively offered as CodeGuru Profiler recommendations, to ensure you can proactively learn about such issues with your applications and quickly resolve them.

Conclusion

I hope this post provided a glimpse into various ways CodeGuru Profiler can benefit your business and applications. To get started using CodeGuru Profiler, see Setting up CodeGuru Profiler.
For more information about CodeGuru Profiler, see the following:

Investigating performance issues with Amazon CodeGuru Profiler

Optimizing application performance with Amazon CodeGuru Profiler

Find Your Application’s Most Expensive Lines of Code and Improve Code Quality with Amazon CodeGuru

 

Learn why AWS is the best cloud to run Microsoft Windows Server and SQL Server workloads

Post Syndicated from Fred Wurden original https://aws.amazon.com/blogs/compute/learn-why-aws-is-the-best-cloud-to-run-microsoft-windows-server-and-sql-server-workloads/

Fred Wurden, General Manager, AWS Enterprise Engineering (Windows, VMware, RedHat, SAP, Benchmarking)

For companies that rely on Windows Server but find it daunting to move those workloads to the cloud, there is no easier way to run Windows in the cloud than AWS. Customers as diverse as Expedia, Pearson, Seven West Media, and RepricerExpress have chosen AWS over other cloud providers to unlock the Microsoft products they currently rely on, including Windows Server and SQL Server. The reasons are several: by embracing AWS, they’ve achieved cost savings through forthright pricing options and expanded breadth and depth of capabilities. In this blog, we break down these advantages to understand why AWS is the simplest, most popular and secure cloud to run your business-critical Windows Server and SQL Server workloads.

AWS lowers costs and increases choice with flexible pricing options

Customers expect accurate and transparent pricing so you can make the best decisions for your business. When assessing which cloud to run your Windows workloads, customers look at the total cost of ownership (TCO) of workloads.

Not only does AWS provide cost-effective ways to run Windows and SQL Server workloads, we also regularly lower prices to make it even more affordable. Since launching in 2006, AWS has reduced prices 85 times. In fact, we recently dropped pricing by and average of 25% for Amazon RDS for SQL Server Enterprise Edition database instances in the Multi-AZ configuration, for both On-Demand Instance and Reserved Instance types on the latest generation hardware.

The AWS pricing approach makes it simple to understand your costs, even as we actively help you pay AWS less now and in the future. For example, AWS Trusted Advisor provides real-time guidance to provision your resources more efficiently. This means that you spend less money with us. We do this because we know that if we aren’t creating more and more value for you each year, you’ll go elsewhere.

In addition, we have several other industry-leading initiatives to help lower customer costs, including AWS Compute Optimizer, Amazon CodeGuru, and AWS Windows Optimization and Licensing Assessments (AWS OLA). AWS Compute Optimizer recommends optimal AWS Compute resources for your workloads by using machine learning (ML) to analyze historical utilization metrics. Customers who use Compute Optimizer can save up to 25% on applications running on Amazon Elastic Compute Cloud (Amazon EC2). Machine learning also plays a key role in Amazon CodeGuru, which provides intelligent recommendations for improving code quality and identifying an application’s most expensive lines of code. Finally, AWS OLA helps customers to optimize licensing and infrastructure provisioning based on actual resource consumption (ARC) to offer cost-effective Windows deployment options.

Cloud pricing shouldn’t be complicated

Other cloud providers bury key pricing information when making comparisons to other vendors, thereby incorrectly touting pricing advantages. Often those online “pricing calculators” that purport to clarify pricing neglect to include hidden fees, complicating costs through licensing rules (e.g., you can run this workload “for free” if you pay us elsewhere for “Software Assurance”). At AWS, we believe such pricing and licensing tricks are contrary to the fundamental promise of transparent pricing for cloud computing.

By contrast, AWS makes it straightforward for you to run Windows Server applications where you want. With our End-of-Support Migration Program (EMP) for Windows Server, you can easily move your legacy Windows Server applications—without needing any code changes. The EMP technology decouples the applications from the underlying OS. This enables AWS Partners or AWS Professional Services to migrate critical applications from legacy Windows Server 2003, 2008, and 2008 R2 to newer, supported versions of Windows Server on AWS. This allows you to avoid extra charges for extended support that other cloud providers charge.

Other cloud providers also may limit your ability to Bring-Your-Own-License (BYOL) for SQL Server to your preferred cloud provider. Meanwhile, AWS improves the BYOL experience using EC2 Dedicated Hosts and AWS License Manager. With EC2 Dedicated Hosts, you can save costs by moving existing Windows Server and SQL Server licenses do not have Software Assurance to AWS. AWS License Manager simplifies how you manage your software licenses from software vendors such as Microsoft, SAP, Oracle, and IBM across AWS and on-premises environments. We also work hard to help our customers spend less.

How AWS helps customers save money on Windows Server and SQL Server workloads

The first way AWS helps customers save money is by delivering the most reliable global cloud infrastructure for your Windows workloads. Any downtime costs customers in terms of lost revenue, diminished customer goodwill, and reduced employee productivity.

With respect to pricing, AWS offers multiple pricing options to help our customers save. First, we offer AWS Savings Plans that provide you with a flexible pricing model to save up to 72 percent on your AWS compute usage. You can sign up for Savings Plans for a 1- or 3-year term. Our Savings Plans help you easily manage your plans by taking advantage of recommendations, performance reporting and budget alerts in AWS Cost Explorer, which is a unique benefit only AWS provides. Not only that, but we also offer Amazon EC2 Spot Instances that help you save up to 90 percent on your compute costs vs. On-Demand Instance pricing.

Customers don’t need to walk this migration path alone. In fact, AWS customers often make the most efficient use of cloud resources by working with assessment partners like Cloudamize, CloudChomp, or Migration Evaluator (formerly TSO Logic), which is now part of AWS. By running detailed assessments of their environments with Migration Evaluator before migration, customers can achieve an average of 36 percent savings using AWS over three years. So how do you get from an on-premises Windows deployment to the cloud? AWS makes it simple.

AWS has support programs and tools to help you migrate to the cloud

Though AWS Migration Acceleration Program (MAP) for Windows is a great way to reduce the cost of migrating Windows Server and SQL Server workloads, MAP is more than a cost savings tool. As part of MAP, AWS offers a number of resources to support and sustain your migration efforts. This includes an experienced APN Partner ecosystem to execute migrations, our AWS Professional Services team to provide best practices and prescriptive advice, and a training program to help IT professionals understand and carry out migrations successfully. We help you figure out which workloads to move first, then leverage the combined experience of our Professional Services and partner teams to guide you through the process. For customers who want to save even more (up to 72% in some cases) we are the leaders in helping customers transform legacy systems to modernized managed services.

Again, we are always available to help guide you in your Windows journey to the cloud. We guide you through our technologies like AWS Launch Wizard, which provides a guided way of sizing, configuring, and deploying AWS resources for Microsoft applications like Microsoft SQL Server Always On, or through our comprehensive ecosystem of tens of thousands of partners and third-party solutions, including many with deep expertise with Windows technologies.

Why run Windows Server and SQL Server anywhere else but AWS?

Not only does AWS offer significantly more services than any other cloud, with over 48 services without comparable equivalents on other clouds, but AWS also provides better ways to use Microsoft products than any other cloud. This includes Active Directory as a managed service and FSx for Windows File Server, the only fully managed file storage service for Windows. If you’re interested in learning more about how AWS improves the Windows experience, please visit this article on our Modernizing with AWS blog.

Bring your Windows Server and SQL Server workloads to AWS for the most secure, reliable, and performant cloud, providing you with the depth and breadth of capabilities at the lowest cost. To learn more, visit Windows on AWS. Contact us today to learn more on how we can help you move your Windows to AWS or innovate on open source solutions.

About the Author
Fred Wurden is the GM of Enterprise Engineering (Windows, VMware, Red Hat, SAP, benchmarking) working to make AWS the most customer-centric cloud platform on Earth. Prior to AWS, Fred worked at Microsoft for 17 years and held positions, including: EU/DOJ engineering compliance for Windows and Azure, interoperability principles and partner engagements, and open source engineering. He lives with his wife and a few four-legged friends since his kids are all in college now.