Many customers use Amazon CloudWatch dashboards to monitor applications and often ask how they can integrate Amazon DevOps Guru Insights in order to have a unified dashboard for monitoring. This blog post showcases integrating DevOps Guru proactive and reactive insights to a CloudWatch dashboard by using Custom Widgets. It can help you to correlate trends over time and spot issues more efficiently by displaying related data from different sources side by side and to have a single pane of glass visualization in the CloudWatch dashboard.
Amazon DevOps Guru is a machine learning (ML) powered service that helps developers and operators automatically detect anomalies and improve application availability. DevOps Guru’s anomaly detectors can proactively detect anomalous behavior even before it occurs, helping you address issues before they happen; detailed insights provide recommendations to mitigate that behavior.
Amazon CloudWatch dashboard is a customizable home page in the CloudWatch console that monitors multiple resources in a single view. You can use CloudWatch dashboards to create customized views of the metrics and alarms for your AWS resources.
Solution overview
This post will help you to create a Custom Widget for Amazon CloudWatch dashboard that displays DevOps Guru Insights. A custom widget is part of your CloudWatch dashboard that calls an AWS Lambda function containing your custom code. The Lambda function accepts custom parameters, generates your dataset or visualization, and then returns HTML to the CloudWatch dashboard. The CloudWatch dashboard will display this HTML as a widget. In this post, we are providing sample code for the Lambda function that will call DevOps Guru APIs to retrieve the insights information and displays as a widget in the CloudWatch dashboard. The architecture diagram of the solution is below.
Figure 1: Reference architecture diagram
Prerequisites and Assumptions
An AWS account. To sign up:
Create an AWS account. For instructions, see Sign Up For AWS.
DevOps Guru should be enabled in the account. For enabling DevOps guru, see DevOps Guru Setup
Follow this Workshop to deploy a sample application in your AWS Account which can help generate some DevOps Guru insights.
Solution Deployment
We are providing two options to deploy the solution – using the AWS console and AWS CloudFormation. The first section has instructions to deploy using the AWS console followed by instructions for using CloudFormation. The key difference is that we will create one Widget while using the Console, but three Widgets are created when we use AWS CloudFormation.
Using the AWS Console:
We will first create a Lambda function that will retrieve the DevOps Guru insights. We will then modify the default IAM role associated with the Lambda function to add DevOps Guru permissions. Finally we will create a CloudWatch dashboard and add a custom widget to display the DevOps Guru insights.
Navigate to the Lambda Console after logging to your AWS Account and click on Create function.
Figure 2a: Create Lambda Function
Choose Author from Scratch and use the runtime Node.js 16.x. Leave the rest of the settings at default and create the function.
Figure 2b: Create Lambda Function
After a few seconds, the Lambda function will be created and you will see a code source box. Copy the code from the text box below and replace the code present in code source as shown in screen print below.
// SPDX-License-Identifier: MIT-0
// CloudWatch Custom Widget sample: displays count of Amazon DevOps Guru Insights
const aws = require('aws-sdk');
const DOCS = `## DevOps Guru Insights Count
Displays the total counts of Proactive and Reactive Insights in DevOps Guru.
`;
async function getProactiveInsightsCount(DevOpsGuru, StartTime, EndTime) {
let NextToken = null;
let proactivecount=0;
do {
const args = { StatusFilter: { Any : { StartTimeRange: { FromTime: StartTime, ToTime: EndTime }, Type: 'PROACTIVE' }}}
const result = await DevOpsGuru.listInsights(args).promise();
console.log(result)
NextToken = result.NextToken;
result.ProactiveInsights.forEach(res => {
console.log(result.ProactiveInsights[0].Status)
proactivecount++;
});
} while (NextToken);
return proactivecount;
}
async function getReactiveInsightsCount(DevOpsGuru, StartTime, EndTime) {
let NextToken = null;
let reactivecount=0;
do {
const args = { StatusFilter: { Any : { StartTimeRange: { FromTime: StartTime, ToTime: EndTime }, Type: 'REACTIVE' }}}
const result = await DevOpsGuru.listInsights(args).promise();
NextToken = result.NextToken;
result.ReactiveInsights.forEach(res => {
reactivecount++;
});
} while (NextToken);
return reactivecount;
}
function getHtmlOutput(proactivecount, reactivecount, region, event, context) {
return `DevOps Guru Proactive Insights<br><font size="+10" color="#FF9900">${proactivecount}</font>
<p>DevOps Guru Reactive Insights</p><font size="+10" color="#FF9900">${reactivecount}`;
}
exports.handler = async (event, context) => {
if (event.describe) {
return DOCS;
}
const widgetContext = event.widgetContext;
const timeRange = widgetContext.timeRange.zoom || widgetContext.timeRange;
const StartTime = new Date(timeRange.start);
const EndTime = new Date(timeRange.end);
const region = event.region || process.env.AWS_REGION;
const DevOpsGuru = new aws.DevOpsGuru({ region });
const proactivecount = await getProactiveInsightsCount(DevOpsGuru, StartTime, EndTime);
const reactivecount = await getReactiveInsightsCount(DevOpsGuru, StartTime, EndTime);
return getHtmlOutput(proactivecount, reactivecount, region, event, context);
};
Figure 3: Lambda Function Source Code
Click on Deploy to save the function code
Since we used the default settings while creating the function, a default Execution role is created and associated with the function. We will need to modify the IAM role to grant DevOps Guru permissions to retrieve Proactive and Reactive insights.
Click on the Configuration tab and select Permissions from the left side option list. You can see the IAM execution role associated with the function as shown in figure 4.
Figure 4: Lambda function execution role
Click on the IAM role name to open the role in the IAM console. Click on Add Permissions and select Attach policies.
Figure 5: IAM Role Update
Search for DevOps and select the AmazonDevOpsGuruReadOnlyAccess. Click on Add permissions to update the IAM role.
Figure 6: IAM Role Policy Update
Now that we have created the Lambda function for our custom widget and assigned appropriate permissions, we can navigate to CloudWatch to create a Dashboard.
Navigate to CloudWatch and click on dashboards from the left side list. You can choose to create a new dashboard or add the widget in an existing dashboard.
We will choose to create a new dashboard
Figure 7: Create New CloudWatch dashboard
Choose Custom Widget in the Add widget page
Figure 8: Add widget
Click Next in the custom widge page without choosing a sample
Figure 9: Custom Widget Selection
Choose the region where devops guru is enabled. Select the Lambda function that we created earlier. In the preview pane, click on preview to view DevOps Guru metrics. Once the preview is successful, create the Widget.
Figure 10: Create Custom Widget
Congratulations, you have now successfully created a CloudWatch dashboard with a custom widget to get insights from DevOps Guru. The sample code that we provided can be customized to suit your needs.
Using AWS CloudFormation
You may skip this step and move to future scope section if you have already created the resources using AWS Console.
In this step we will show you how to deploy the solution using AWS CloudFormation. AWS CloudFormation lets you model, provision, and manage AWS and third-party resources by treating infrastructure as code. Customers define an initial template and then revise it as their requirements change. For more information on CloudFormation stack creation refer to this blog post.
The following resources are created.
Three Lambda functions that will support CloudWatch Dashboard custom widgets
A CloudWatch dashboard with widgets to pull data from the Lambda Functions
To deploy the solution by using the CloudFormation template
You can use this downloadable template to set up the resources. To launch directly through the console, choose Launch Stack button, which creates the stack in the us-east-1 AWS Region.
Choose Next to go to the Specify stack details page.
(Optional) On the Configure Stack Options page, enter any tags, and then choose Next.
On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.
It takes approximately 2-3 minutes for the provisioning to complete. After the status is “Complete”, proceed to validate the resources as listed below.
Validate the resources
Now that the stack creation has completed successfully, you should validate the resources that were created.
On AWS Console, head to CloudWatch, under Dashboards – there will be a dashboard created with name <StackName-Region>.
On AWS Console, head to CloudWatch, under LogGroups there will be 3 new log-groups created with the name as:
lambdaProactiveLogGroup
lambdaReactiveLogGroup
lambdaSummaryLogGroup
On AWS Console, head to Lambda, there will be lambda function(s) under the name:
lambdaFunctionDGProactive
lambdaFunctionDGReactive
lambdaFunctionDGSummary
On AWS Console, head to IAM, under Roles there will be a new role created with name “lambdaIAMRole”
To View Results/Outcome
With the appropriate time-range setup on CloudWatch Dashboard, you will be able to navigate through the insights that have been generated from DevOps Guru on the CloudWatch Dashboard.
Figure 11: DevOpsGuru Insights in Cloudwatch Dashboard
Cleanup
For cost optimization, after you complete and test this solution, clean up the resources. You can delete them manually if you used the AWS Console or by deleting the AWS CloudFormation stack called devopsguru-cloudwatch-dashboard if you used AWS CloudFormation.
This blog post outlined how you can integrate DevOps Guru insights into a CloudWatch Dashboard. As a customer, you can start leveraging CloudWatch Custom Widgets to include DevOps Guru Insights in an existing Operational dashboard.
AWS Customers are now using Amazon DevOps Guru to monitor and improve application performance. You can start monitoring your applications by following the instructions in the product documentation. Head over to the Amazon DevOps Guru console to get started today.
To learn more about AIOps for Serverless using Amazon DevOps Guru check out this video.
Streaming alert evaluation scales much better than the traditional approach of polling time-series databases.
It allows us to overcome high dimensionality/cardinality limitations of the time-series database.
It opens doors to support more exciting use-cases.
Engineers want their alerting system to be realtime, reliable, and actionable. While actionability is subjective and may vary by use-case, reliability is non-negotiable. In other words, false positives are bad but false negatives are the absolute worst!
A few years ago, we were paged by our SRE team due to our Metrics Alerting System falling behind — critical application health alerts reached engineers 45 minutes late! As we investigated the alerting delay, we found that the number of configured alerts had recently increased dramatically, by 5 times! The alerting system queried Atlas, our time series database on a cron for each configured alert query, and was seeing an elevated throttle rate and excessive retries with backoffs. This, in turn, increased the time between two consecutive checks for an alert, causing a global slowdown for all alerts. On further investigation, we discovered that one user had programmatically created tens of thousands of new alerts. This user represented a platform team at Netflix, and their goal was to build alerting automation for their users.
While we were able to put out the immediate fire by disabling the newly created alerts, this incident raised some critical concerns around the scalability of our alerting system. We also heard from other platform teams at Netflix who wanted to build similar automation for their users who, given our state at the time, wouldn’t have been able to do so without impacting Mean Time To Detect (MTTD) for all others. Rather, we were looking at an order of magnitude increase in the number of alert queries just over the next 6 months!
Since querying Atlas was the bottleneck, our first instinct was to scale it up to meet the increased alert query demand; however, we soon realized that would increase Atlas cost prohibitively. Atlas is an in-memory time-series database that ingests multiple billions of time-series per day and retains the last two weeks of data. It is already one of the largest services at Netflix both in size and cost. While Atlas is architected around compute & storage separation, and we could theoretically just scale the query layer to meet the increased query demand, every query, regardless of its type, has a data component that needs to be pushed down to the storage layer. To serve the increasing number of push down queries, the in-memory storage layer would need to scale up as well, and it became clear that this would push the already expensive storage costs far higher. Moreover, common database optimizations like caching recently queried data don’t really work for alerting queries because, generally speaking, the last received datapoint is required for correctness. Take for example, this alert query that checks if errors as a % of total RPS exceeds a threshold of 50% for 4 out of the last 5 minutes:
Say if the datapoint received for the last time interval leads to a positive evaluation for this query, relying on stale/cached data would either increase MTTD or result in the perception of a false negative, at least until the missing data is fetched and evaluated. It became clear to us that we needed to solve the scalability problem with a fundamentally different approach. Hence, we started down the path of alert evaluation via real-time streaming metrics.
High Level Architecture
The idea, at a high level, was to avoid the need to query the Atlas database almost entirely and transition most alert queries to streaming evaluation.
Alert queries are submitted either via our Alerting UI or by API clients, which are then saved to a custom config database that supports streaming config updates (full snapshot + update notifications). The Alerting Service receives these config updates and hashes every new or updated alert query for evaluation to one of its nodes by leveraging Edda Slots. The node responsible for evaluating a query, starts by breaking it down into a set of “data expressions” and with them subscribes to an upstream “broker” service. Data expressions define what data needs to be sourced in order to evaluate a query. For the example query listed above, the data expressions are name,errors,:eq,:sum and name,rps,:eq,:sum. The broker service acts as a subscription manager that maps a data expression to a set of subscriptions. In addition, it also maintains a Query Index of all active data expressions which is consulted to discern if an incoming datapoint is of interest to an active subscriber. The internals here are outside the scope of this blog post.
Next, the Alerting service (via the atlas-eval library) maps the received data points for a data expression to the alert query that needs them. For alert queries that resolve to more than one data expression, we align the incoming data points for each one of those data expressions on the same time boundary before emitting the accumulated values to the final eval step. For the example above, the final eval step would be responsible for computing the ratio and maintaining the rolling-count, which is keeping track of the number of intervals in which the ratio crossed the threshold as shown below:
The atlas-eval library supports streaming evaluation for most if not all Query, Data, Math and Stateful operators supported by Atlas today. Certain operators such as offset, integral, des are not supported on the streaming path.
OK, Results?
First and foremost, we have successfully alleviated our initial scalability problem with the polling based architecture. Today, we run 20X the number of queries we used to run a few years ago, with ease and at a fraction of what it would have cost to scale up the Atlas storage layer to serve the same volume. Multiple platform teams at Netflix programmatically generate and maintain alerts on behalf of their users without having to worry about impacting other users of the system. We are able to maintain strong SLAs around Mean Time To Detect (MTTD) regardless of the number of alerts being evaluated by the system.
Additionally, streaming evaluation allowed us to relax restrictions around high cardinality that our users were previously running into — alert queries that were rejected by Atlas Backend before due to cardinality constraints are now getting checked correctly on the streaming path. In addition, we are able to use Atlas Streaming to monitor and alert on some very high cardinality use-cases, such as metrics derived from free-form log data.
Finally, we switched Telltale, our holistic application health monitoring system, from polling a metrics cache to using realtime Atlas Streaming. The fundamental idea behind Telltale is to detect anomalies on SLI metrics (for example, latency, error rates, etc). When such anomalies are detected, Telltale is able to compute correlations with similar metrics emitted from either upstream or downstream services. In addition, it also computes correlations between SLI metrics and custom metrics like the log derived metrics mentioned above. This has proven to be valuable towards reducing Mean Time to Recover (MTTR). For example, we are able to now correlate increased error rates with increased rate of specific exceptions occurring in logs and even point to an exemplar stacktrace, as shown below:
Our logs pipeline fingerprints every log message and attaches a (very high cardinality) fingerprint tag to a log events counter that is then emitted to Atlas Streaming. Telltale consumes this metric in a streaming fashion to identify fingerprints that correlate with anomalies seen in SLI metrics. Once an anomaly is found, we query the logs backend with the fingerprint hash to obtain the exemplar stacktrace. What’s more is we are now able to identify correlated anomalies (and exceptions) occurring in services that may be N hops away from the affected service. A system like Telltale becomes more effective as more services are onboarded (and for that matter the full service graph), because otherwise it becomes difficult to root cause the problem, especially in a microservices-based architecture. A few years ago, as noted in this blog, only about a hundred services were using Telltale; thanks to Atlas Streaming we have now managed to onboard thousands of other services at Netflix.
Finally, we realized that once you remove limits on the number of monitored queries, and start supporting much higher metric dimensionality/cardinality without impacting the cost/performance profile of the system, it opens doors to many exciting new possibilities. For example, to make alerts more actionable, we may now be able to compute correlations between SLI anomalies and custom metrics with high cardinality dimensions, for example an alert on elevated HTTP error rates may be able to point to impacted customer cohorts, by linking to precisely correlated exemplars. This would help developers with reproducibility.
Transitioning to the streaming path has been a long journey for us. One of the challenges was difficulty in debugging scenarios where the streaming path didn’t agree with what is returned by querying the Atlas database. This is especially true when either the data is not available in Atlas or the query is not supported because of (say) cardinality constraints. This is one of the reasons it has taken us years to get here. That said, early signs indicate that the streaming paradigm may help with tackling a cardinal problem in observability — effective correlation between the metrics & events verticals (logs, and potentially traces in the future), and we are excited to explore the opportunities that this presents for Observability in general.
As organizations operate day-to-day, having insights into their cloud infrastructure state can be crucial for the durability and availability of their systems. Industry research estimates[1] that downtime costs small businesses around $427 per minute of downtime, and medium to large businesses an average of $9,000 per minute of downtime. Amazon DevOps Guru customers want to monitor and generate alerts using a single dashboard. This allows them to reduce context switching between applications, providing them an opportunity to respond to operational issues faster.
DevOps Guru can integrate with Amazon Managed Grafana to create and display operational insights. Alerts can be created and communicated for any critical events captured by DevOps Guru and notifications can be sent to operation teams to respond to these events. The key telemetry data types of logs and metrics are parsed and filtered to provide the necessary insights into observability.
Furthermore, it provides plug-ins to popular open-source databases, third-party ISV monitoring tools, and other cloud services. With Amazon Managed Grafana, you can easily visualize information from multiple AWS services, AWS accounts, and Regions in a single Grafana dashboard.
In this post, we will walk you through integrating the insights generated from DevOps Guru with Amazon Managed Grafana.
Solution Overview:
This architecture diagram shows the flow of the logs and metrics that will be utilized by Amazon Managed Grafana, starting with DevOps Guru and then using Amazon EventBridge to save the insight event logs to Amazon CloudWatch Log Group DevOps Guru service metrics to be parsed by Amazon Managed Grafana and create new dashboards in Grafana from these logs and Metrics.
Now we will walk you through how to do this and set up notifications to your operations team.
Prerequisites:
The following prerequisites are required for this walkthrough:
Enabled DevOps Guru on your account with CloudFormation stack, or tagged resources monitored.
Using Amazon CloudWatch Metrics
DevOps Guru sends service metrics to CloudWatch Metrics. We will use these to track metrics for insights and metrics for your DevOps Guru usage; the DevOps Guru service reports the metrics to the AWS/DevOps-Guru namespace in CloudWatch by default.
First, we will provision an Amazon Managed Grafana workspace and then create a Dashboard in the workspace that uses Amazon CloudWatch as a data source.
Setting up Amazon CloudWatch Metrics
Create Grafana Workspace Navigate to Amazon Managed Grafana from AWS console, then click Create workspace
For other insights outside of the service metrics, such as a number of insights per specific service or the average for a region or for a specific AWS account, we will need to parse the event logs. These logs first need to be sent to Amazon CloudWatch Logs. We will go over the details on how to set this up and how we can parse these logs in Amazon Managed Grafana using CloudWatch Logs Query Syntax. In this post, we will show a couple of examples. For more details, please check out this User Guide documentation. This is not done by default and we will need to use Amazon EventBridge to pass these logs to CloudWatch.
DevOps Guru logs include other details that can be helpful when building Dashboards, such as region, Insight Severity (High, Medium, or Low), associated resources, and DevOps guru dashboard URL, among other things. For more information, please check out this User Guide documentation.
EventBridge offers a serverless event bus that helps you receive, filter, transform, route, and deliver events. It provides one to many messaging solutions to support decoupled architectures, and it is easy to integrate with AWS Services and 3rd-party tools. Using Amazon EventBridge with DevOps Guru provides a solution that is easy to extend to create a ticketing system through integrations with ServiceNow, Jira, and other tools. It also makes it easy to set up alert systems through integrations with PagerDuty, Slack, and more.
Setting up Amazon CloudWatch Logs
Let’s dive in to creating the EventBridge rule and enhance our Grafana dashboard:
a. First head to Amazon EventBridge in the AWS console.
b. Click Createrule.
Type in rule Name and Description. You can leave the Event bus to default and Rule type to Rule with an event pattern.
c. Select AWS events or EventBridge partner events.
For event Pattern change to Customer patterns (JSON editor) and use:
{"source": ["aws.devops-guru"]}
This filters for all events generated from DevOps Guru. You can use the same mechanism to filter out specific messages such as new insights, or insights closed to a different channel. For this demonstration, let’s consider extracting all events.
d. Next, for Target, select AWS service.
Then use CloudWatch log Group.
For the Log Group, give your group a name, such as “devops-guru”.
e. Click Createrule.
f. Navigate back to Amazon Managed Grafana. It’s time to add a couple more additional Panels to our dashboard. Click Add panel. Then Select Amazon CloudWatch, and change from metrics to CloudWatch Logs and select the Log Group we created previously.
g. For the query use the following to get the number of closed insights:
Next, depending on the visualizations, you can work with the Logs and metrics data types to parse and filter the data.
i. For our fourth panel, we will add DevOps Guru dashboard direct link to the AWS Console.
Repeat the same process as demonstrated previously one more time with this query:
fields detail.messageType, detail.insightSeverity, detail.insightUrlfilter
| filter detail.messageType="CLOSED_INSIGHT" or detail.messageType="NEW_INSIGHT"
Switch to table when prompted on the panel.
This will give us a direct link to the DevOps Guru dashboard and help us get to the insight details and Recommendations.
Save your dashboard.
You can extend observability by sending notifications through alerts on dashboards of panels providing metrics. The alerts will be triggered when a condition is met. The Alerts are communicated with Amazon SNS notification mechanism. This is our SNS notification channel setup.
A previously created notification is used next to communicate any alerts when the condition is met across the metrics being observed.
Cleanup
To avoid incurring future charges, delete the resources.
Navigate to EventBridge in AWS console and delete the rule created in step 4 (a-e) “devops-guru”.
Navigate to CloudWatch logs in AWS console and delete the log group created as results of step 4 (a-e) named “devops-guru”.
Amazon Managed Grafana: Navigate to Amazon Managed Grafana service and delete the Grafana services you created in step 1.
Conclusion
In this post, we have demonstrated how to successfully incorporate Amazon DevOps Guru insights into Amazon Managed Grafana and use Grafana as the observability tool. This will allow Operations team to successfully observe the state of their AWS resources and notify them through Alarms on any preset thresholds on DevOps Guru metrics and logs. You can expand on this to create other panels and dashboards specific to your needs. If you don’t have DevOps Guru, you can start monitoring your AWS applications with AWS DevOps Guru today using this link.
Effective security incident response depends on adequate logging, as described in the AWS Security Incident Response Guide. If you have the proper logs and the ability to query them, you can respond more rapidly and effectively to security events. If a security event occurs, you can use various log sources to validate what occurred and understand the scope. Then, you can use the results of your analysis to take remediation actions. To learn more about logging best practices, see Configure service and application logging and Analyze logs, findings, and metrics centrally.
In this blog post, we will show you how to achieve an effective strategy for logging for security incident response. We will share logging options across the typical cloud application stack, log analysis options, and sample queries. AWS offers managed services, such as Amazon GuardDuty for threat detection and Amazon Detective for incident analysis. If you want to collect additional logs or perform custom analysis, then you should consider the options described in this blog post.
Selection of logs
To select the appropriate logs for security incident response, you should start with the common cloud application stack, which consists of the components and layers of your application deployed on AWS. For each component, we will describe the logging sources that you have. For each log source, we will describe why you should log it for security incident response, how to enable the logs, and what your log storage options are.
To select the logs for security incident response, first consider the following questions:
What are your compliance and regulatory requirements for logging?
Note: Make sure that you comply with the log retention requirements of compliance standards relevant to your organization, as well as your organization’s incident response strategy.
What AWS services do you commonly use?
What AWS services have access to or contain sensitive data?
What threats are most relevant to you?
Note: Performing a threat model of your cloud architectures can help you answer this question. For more information, see How to approach threat modelling.
Considering these questions can help you develop requirements for logging that will guide your selection of the following log sources.
AWS account logs
An AWS account is the first, fundamental component of an application deployed on AWS. The account is a container for your AWS resources. You create and manage your AWS resources in this account, and the account provides administrative capabilities for access and billing.
AWS CloudTrail
Within an account, each action performed is an API call. From a console sign-in to the deployment of each resource in an AWS CloudFormation stack, events are generated to provide transparency on what has occurred in the account. With AWS CloudTrail, you can log, continuously monitor, and retain account activity related to actions across supported AWS services. CloudTrail provides the event history of your account activity, including actions taken through the AWS Management Console, AWS SDKs, command line tools, and other AWS services. CloudTrail logs API calls as three types of events:
Management events(also known as control plane operations) show management operations that are performed on resources in your account. This includes actions like creating an Amazon Simple Storage Service (Amazon S3) bucket and setting up logging.
Data events(also known as data plane operations) show the resource operations performed on or within resources in your account. These operations are often high-volume activities, such as Amazon S3 object-level API activity (for example, GetObject, DeleteObject, and PutObject API operations) and AWS Lambda function invocation activity.
Insights events capture unusual API call rate or error rate activity in your account. You must enable these events on a trail in order to capture them, and they are logged to a different folder prefix in the destination S3 bucket for your trail. Insights events provide you with information such as the type of event, the incident time period, the associated API, the error code, and statistics to help you understand and respond effectively to unusual activity.
For security investigations, CloudTrail provides context on the creation, modification, and deletion of AWS resources. Therefore, CloudTrail is one of your most important log sources for security incident response in an AWS environment. You have three primary ways to set up CloudTrail:
CloudTrail Event history — CloudTrail is enabled by default with 90-day retention of management events that you can retrieve through the CloudTrail Event history facility using the console, AWS Command line Interface (AWS CLI), or AWS SDK. You don’t need to take any action to get started using the Event history feature.
CloudTrail trail — For longer retention and visibility of data events, you need to create a CloudTrail trail and associate it with an S3 bucket and optionally with an Amazon CloudWatch log group. If you use AWS Organizations, you can create an organization trail that will log events for each account in the organization. By default, trails are multi-Region, so you don’t need to enable CloudTrail logs in each AWS Region.
AWS CloudTrail Lake — You can create a CloudTrail lake, which retains CloudTrail logs for up to seven years and provides a SQL-based querying facility. You don’t need to have a trail configured in your account to use CloudTrail Lake.
Amazon Security Lake — You can use Security Lake to ingest CloudTrail events, which include management and data events. You can further analyze these events with Amazon QuickSight or another other third-party security information and event management (SIEM) tool.
AWS Config
Creating and modifying resources is an integral part of your account use. Tracking resource configuration changes made by calling the AWS API helps you review changes throughout the resource lifecycle. AWS Config provides a detailed view of the configuration of AWS resources in your account, examines the resource configurations periodically, and tracks configuration changes that were not initiated by the API. This includes how the resources are related to one another and how they were configured in the past so that you can see how configurations and relationships change over time.
You should enable AWS Config in each Region where you have resources deployed, and you should configure an S3 bucket to receive configuration history and configuration snapshot files, which contain details on the resources that AWS Config records. You can then review configuration compliance and analyze activities performed before, during, and after an event using the configuration history in S3. You should centralize AWS Config resource tracking across multiple accounts in the same organization by setting up an aggregator. You can use AWS Control Tower to automate the setup.
During a security investigation, you might want to understand how a resource configuration has changed over time. For example, you might want to investigate the changes to an S3 bucket policy before and after a security event that involves an S3 bucket. AWS Config provides a configuration history for resources that can help you track activities performed during a security event.
Operating system and application logs
To record interactions with applications, you must capture operating system (OS) and application logs, especially custom logs generated by the application development framework. OS and local application logs are relevant for security events that involve an Amazon Elastic Compute Cloud (Amazon EC2) instance. These instances could be standalone, in an auto scaling group behind a load balancer, or compute workloads for Amazon Elastic Container Service (Amazon ECS) or an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. OS logs track privileged use, processes, login events, access to directory services, and file system activity on a server. To analyze a potential compromise to an EC2 instance, you will want to review the security event logs for Windows OS and the system logs for Linux-based OS.
With the unified CloudWatch agent, you can collect metrics and logs from EC2 instances and on-premises servers. The CloudWatch agent aggregates log data into CloudWatch logs, which can then be exported to Amazon S3 for long-term retention and analyzed with a SIEM tool of your choice or Amazon Athena, as shown in Figure 1.
Figure 1: Aggregate OS and application logs using CloudWatch Logs
Database logs
With SQL databases, you can log transactions to help track modifications to the databases, such as additions or deletions. After an engine or system failure, you will need transaction logs to restore a database to a consistent state. Transaction logs are designed to be secure, and they require additional processing to access valuable information. It’s important that you understand data interactions during a security investigation, especially if your databases hold personally identifiable information (PII), financial and payments information, or other information subject to regulatory controls.
The goal of logging network activity is to gain insight into the communications that traverse your network. You might need this data for a variety of reasons, such as network troubleshooting or for use in a forensic investigation of suspected malware activity within your network.
In the AWS Cloud, you can log network activity by creating a proxy that logs network traffic or by using Traffic Mirroring to send a copy of network traffic to a logging server. You can adopt cloud-native approaches to capture this type of data using Amazon Route 53 DNS query logs and Amazon VPC Flow Logs.
There are also a variety of third-party networking solutions available like Palo Alto Networks and Fortinet, so you can continue to use the network logging mechanisms that you might have used in an on-premises environment.
Route 53 DNS query logs
You can configure Amazon Route 53 to log Domain Name System (DNS) queries. These logs are categorized into two groups:
Public DNS query logging
Resolver query logging
Logging public DNS queries against domains that you have hosted in Route 53 provides query information, such as the domain or subdomain requested, date and time stamp of the request, DNS record type, Route 53 edge location that responded, and response code.
You can configure VPC Flow Logs for a VPC in your account to capture traffic that enters and moves around your VPC network, without the addition of instances or products. From these logs, you can review information, such as source and destination IP, ports, timestamps, protocol, account ID, and whether the traffic was accepted or rejected. For a complete list of the fields available for flow log records, see Available fields. You can create a flow log for a VPC, a subnet, or a network interface. If you create a flow log for a subnet or VPC, IP traffic going to and from each network interface in that subnet or VPC will be logged. For more details on VPC Flow Logs, see Logging IP traffic using VPC Flow Logs.
You can forward flow logs to Amazon CloudWatch Logs to create CloudWatch alarms based on metric filters. You can also forward flow logs to an S3 bucket for long-term retention and further analysis. Figure 2 demonstrates these configurations.
Figure 2: Sending VPC Flow logs to CloudWatch Logs and S3
Access logs
To identify access patterns for accessible endpoints, especially public endpoints, you should use access logs. Access logs capture detailed information about requests sent to your load balancer. Each log contains information such as the time the request was received, the client’s IP address, latencies, request paths, and server responses. With services built in layers behind a load balancer, unless you track the X-Forwarded-For request header, the requestor’s context is lost. Access logs help bridge that gap during investigations and analysis.
Amazon S3 server access logs
Access logs are critical to track object level access when using S3 buckets to store confidential or sensitive data. You can also turn on CloudTrail to capture S3 data events. You can store access logs in S3 buckets for long-term storage for compliance purposes and to run analyses during and after an event.
Load balancing logs
Elastic Load Balancing provides access logs that capture detailed information about requests sent to load balancers. Each log contains information such as the time the request was received, the client’s IP address, latencies, request paths, and server responses. You can use this log to analyze traffic patterns and to troubleshoot issues.
Access logs is an optional feature of Elastic Load Balancing that is turned off by default. To enable access logs for load balancers, see Access logs for your Application Load Balancer.
If you implement your own reverse proxy for load balancing needs, make sure that you capture the reverse proxy access logs. You can use the unified CloudWatch agent to forward the logs to CloudWatch. As with OS logs, you can export CloudWatch logs to an S3 bucket for long-term retention and analysis.
If you use an Amazon CloudFront distribution as the public endpoint for end users with load balancers as the custom origin, then load balancing access logs will represent the CloudFront distribution as the requestor, rather than the actual end user. If this information doesn’t add value to your incident handling process, then you can use CloudFront access logs as the log source that provides end user request details.
CloudFront access logs
You should enable standard logs, also known as access logs, when using CloudFront. Specify an S3 bucket where you want CloudFront to save the files.
CloudFront access logs are delivered on a best-effort basis. For information about requests made to a distribution in real time, use real-time logs that are delivered within seconds of receiving the requests. You should use real-time logs to monitor, analyze, and take action based on content delivery performance. For more details on the fields available from these logs, see the CloudFront standard log file format.
AWS WAF logs
When associated with a supported resource like a CloudFront distribution, Amazon API Gateway REST API, Application Load Balancer, AWS AppSync GraphQL API, Amazon Cognito user pool, or AWS App Runner, AWS WAF can help you monitor HTTP and HTTPS requests that are forwarded to the resource. You should configure web access control lists (ACLs) to gain fine-grained control over the requests, and enable logging for such ACLs to get detailed information about traffic that is analyzed by AWS WAF. Log information includes time of the request being received by AWS WAF from the AWS resource, details about the request, and the AWS WAF rules that the request matched. You can use this log information to monitor access patterns of public endpoints and configure rules to inspect requests in detail. For more information about AWS WAF logging, see Logging web ACL traffic.
Serverless logs
Serverless computing has become increasingly popular in the cloud-computing space. It provides on-demand compute power in a relatively short burst, meaning that cloud-based instances don’t need to be provisioned and kept around, idle, when there are no tasks to be completed. Although more and more compute tasks are being moved to serverless solutions, the need to log has not changed, but how the logs are generated has. In a serverless environment, security investigations not only benefit from logs that demonstrate the interactions and changes made by the code deployed, but that also document changes to the deployed code itself and access permissions of the Lambda execution role that is granting privileged access.
AWS Lambda
The logging of Lambda functions involves two components: how the function itself is operating, and what is happening inside the function (what your code is actually doing).
The logging of a Lambda function itself occurs through data events captured by CloudTrail. As noted earlier in this post, you must configure data events on a trail created in CloudTrail. During configuration, you will need to specify the function from which logs will be captured by your trail, and the destination S3 bucket where they will be stored. These logs contain details on the invocation of the function and help identify the IAM principals that called the Invoke API for Lambda.
AWS Lambda automatically monitors Lambda functions on your behalf and sends logs to CloudWatch. Your Lambda function comes with a CloudWatch Logs log group and a log stream for each instance of your function. The Lambda runtime environment sends details about each invocation to the log stream, and relays logs and other output from your function’s code. For more details on how to monitor Lambda functions, see Accessing Amazon CloudWatch logs for AWS Lambda.
Log analysis
For incident response, you need to be able to analyze and query your logs to validate what occurred and to understand the scope.
To begin, you can aggregate logs from various sources in S3 buckets for long-term storage, and you can integrate that data with query tools for further investigation. Logs can be exported and either parsed through directly, or ingested by another tool to help with the analysis. The following are some options that you can use to query these logs:
Amazon Athena — You can directly query CloudTrail events stored in S3 with Athena using SQL commands, specifying the LOCATION of the log files. You would generally use this approach if you have advanced queries to run, and you don’t have a SIEM. To set up Athena to query logs, you can use this open-source solution from AWS.
Amazon OpenSearch Service — OpenSearch is a distributed search and log analytics suite. Because it’s open source, it can ingest logs from more than just AWS log sources. To set this up, you can use this open-source SIEM solution from AWS.
CloudTrail Event History — Either from the console, or programmatically, you can query CloudTrail management events from the last 90-day period. This is ideal for when you have simple queries to make within the last 90 days, and you don’t need stored logs or more complex queries.
AWS CloudTrail Lake — Either from the console, or programmatically, you can query stored events in your configured CloudTrail Lake from the time of its configuration, up until the maximum storage duration of 2,557 days (7 years) from the time that you make your query. This approach allows for SQL-based queries, and it is ideal for when you need to make more complex queries against events, but don’t require the additional features of a SIEM solution.
Parse through raw JSON using CLI — This is achieved programmatically and parsed through terminal commands. It’s more a legacy method of parsing through logs. You might choose to use this approach for analysis if another service or solution isn’t feasible (for example, if you can’t use the service due to your corporate security policy).
Third-party SIEM — A third-party SIEM might be ideal if you already have a SIEM solution on AWS or elsewhere, and you don’t need a duplicated solution elsewhere. Typically, SIEM solutions will import logs from an S3 bucket and process and index events for analysis. To learn more about SIEM options, see the SIEM solutions in the AWS Marketplace, or the AWS Security Competency Partners for a partner local to you with threat detection and incident response (TDIR) expertise.
Sample queries
In this section, we provide samples of SQL queries. Both Athena and CloudTrail Lake accept SQL queries, but the following samples have been tested for use in Athena only. This is because some samples are for VPC Flow Logs, which you can’t query from CloudTrail Lake. To query CloudTrail logs in Athena, you must first create a table definition that points to the location of your logs stored in S3. You can do this from the CloudTrail Events console by using a hyperlinked suggestion, or from the Athena console directly. Alternatively, for Athena, you can use the AWS Security Analytics Bootstrap.
For each of these queries, you might need to modify some of the fields, such as the time frame that you are investigating, the IAM entity involved, and the account and Region in scope. For example, you might want to modify the time frame based on the current time and when you believe the security event began. This often involves expanding the time frame after running additional queries and learning more about the scope and timeline.
By using partitions for tables, you can restrict the amount of data scanned by each Athena query, helping to improve performance and reduce cost. For example, you can partition your CloudTrail Athena table manually or by using partition projection. You can include the partition column (for example, the timestamp) in your queries to limit the amount of data scanned.
Unauthorized attempts
When a security event occurs, you might want to review API calls that were attempted but failed due to the IAM principal not having access to perform the action on that resource. To discover this activity, run the following query (be sure to modify the time window first):
SELECT *
FROM cloudtrail
WHERE errorcode IN ('Client.UnauthorizedOperation','Client.InvalidPermission.NotFound','Client.OperationNotPermitted','AccessDenied')
AND useridentity.arn LIKE '%iam%'
AND eventtime >= '2023-01-01T00:00:00Z'
AND eventtime < '2023-03-01T00:00:00Z'
ORDER BY eventtime desc
This sample query can help you identify whether certain IAM principals have a significant amount of unauthorized API calls, which can indicate that an IAM principal is compromised.
Rejected TCP connections
During a security event, the unauthorized user that is interacting with the resources in your account is probably trying to establish persistence through the network layer. To get a list of rejected TCP connections and extract from it the day that these events occurred, run the following query:
SELECT day_of_week(date) AS
day,date,interface_id,srcaddr,action,protocol
FROM vpc_flow_logs
WHERE action = 'REJECT' AND protocol = 6
LIMIT 100;
Connections over older TLS versions
You might want to see how many calls to AWS APIs were made using older versions of the TLS protocol, as part of a forensic follow-up or a discovery job after a risk analysis. You can get this data by querying CloudTrail logs.
SELECT eventSource
COUNT(*) AS numOutdatedTlsCalls FROM cloudtrail WHERE tlsDetails.tlsVersion IN ('TLSv1', 'TLSv1.1') AND eventTime > '2023-01-01 00:00:00' GROUP BY eventSource ORDER BY numOutdatedTlsCalls DESC
Filter connections from an IP
With an IP address that you’d like to investigate, as a part of your forensic analysis, you might want to see the connections made to resources in a VPC. You can obtain this information by querying VPC Flow Logs. As with the server access logs, if you’re using Athena, you will first need to create a new table.
SELECT day_of_week(date) AS
day, date, srcaddr, dstaddr, action, protocol
FROM vpc_flow_logs
WHERE day >= '2023/01/01' AND day < '2023/03/01' AND srcaddr LIKE '172.50.%'
ORDER BY day DESC
LIMIT 100
Investigate user actions
If you have identified a user who has been compromised, or that you suspect has been compromised, you might want to know the API calls that they made over a specific time period. Understanding the activity of a user can help you understand the scope of impact during an incident, as well as the reach of user permissions when you design your access management strategy.
SELECT eventID, eventName, eventSource, eventTime, userIdentity.arn
AS user
FROM cloudtrail
WHERE userIdentity.arn = '%<username>%' AND eventTime > '2022-12-05 00:00:00' AND eventTime < '2022-12-08 00:00:00'
Conclusion
It is essential that you capture logs from various layers within your application architecture, so that you can effectively respond to a security event at various layers of the application stack. If a security event occurs, logs can help provide a clear picture of what happened and the scope of the affected resources. This post helps you build a logging strategy for security incident response by understanding what logs you want to analyze, where you want to store those logs, and how you will analyze them.
At Cloudflare, we reuse existing core systems to power multiple products and testing of these core systems is essential. In particular, we require being able to have a wide and thorough visibility of our live APIs’ behaviors. We want to be able to detect regressions, prevent incidents and maintain healthy APIs. That is why we built Scout.
Scout is an automated system periodically running Python tests verifying the end to end behavior of our APIs. Scout allows us to evaluate APIs in production-like environments and thus ensures we can green light a production deployment while also monitoring the behavior of APIs in production.
Why Scout?
Before Scout, we were using an automated test system leveraging the Robot Framework. This older system was limiting our testing capabilities. In fact, we could not easily match json responses against keys we were looking for. We would abandon covering different behaviors of our APIs as it was impossible to decide on which resources a given test suite would run. Two different test suites would create false negatives as they were running on the same account.
Regarding schema validation, only API responses were validated against a json schema and tests would not fail if the response did not match the schema. Moreover, It was impossible to validate API requests.
Test suites were run in a queue, making the delay to a new feature assessment dependent on the number of test suites to run. The queue would as well potentially make newer test suites run the following day. Hence we often ended up with a mismatch between tests and APIs versions. Test steps could not be run in parallel either.
We could not split test suites between different environments. If a new API feature was being developed it was impossible to write a test without first needing the actual feature to be released to production.
We built Scout to overcome all these difficulties. We wanted the developer experience to be easy and we wanted Scout to be fast and reliable while spotting any live API issue.
A Scout test example
Scout is built in Python and leverages the functionalities of Pytest. Before diving into the exact capabilities of Scout and its architecture, let’s have a quick look at how to use it!
Following is an example of a Scout test on the Rulesets API (the docs are available here):
from scout import requires, validate, Account, Zone
@validate(schema="rulesets", ignorePaths=["accounts/[^/]+/rules/lists"])
@requires(
account=Account(
entitlements={"rulesets.max_rules_per_ruleset": 2),
zone=Zone(plan="ENT",
entitlements={"rulesets.firewall_custom_phase_allowed": True},
account_entitlements={"rulesets.max_rules_per_ruleset": 2 }))
class TestZone:
def test_create_custom_ruleset(self, cfapi):
response = cfapi.zone.request(
"POST",
"rulesets",
payload=f"""{{
"name": "My zone ruleset",
"description": "My ruleset description",
"phase": "http_request_firewall_custom",
"kind": "zone",
"rules": [
{{
"description": "My rule",
"action": "block",
"expression": "http.host eq \"fake.net\""
}}
]
}}""")
response.expect_json_success(
200,
result=f"""{{
"name": "My zone ruleset",
"version": "1",
"source": "firewall_custom",
"phase": "http_request_firewall_custom",
"kind": "zone",
"rules": [
{{
"description": "My rule",
"action": "block",
"expression": "http.host eq \"fake.net\"",
"enabled": true,
...
}}
],
...
}}""")
A Scout test is a succession of roundtrips of requests and responses against a given API. We use the functionalities of Pytest fixtures and marks to be able to target specific resources while validating the request and responses. Pytest marks in Scout allow to provide an extra set of information to test suites. Pytest fixtures are contexts with information and methods which can be used across tests to enhance their capabilities. Hence the conjunction of marks with fixtures allow Scout to build the whole harness required to run a test suite against APIs.
Being able to exactly describe the resources against which a given test will run provides us confidence the live API behaves as expected under various conditions.
The cfapi fixture provides the capability to target different resources such as a Cloudflare account or a zone. In the test above, we use a Pytest mark @requires to describe the characteristics of the resources we want, e.g. we need here an account with a flag allowing us to have 2 rules for a ruleset. This will allow the test to only be run against accounts with such entitlements.
The @validate mark provides the capability to validate requests and responses to a given OpenAPI schema (here the rulesets OpenAPI schema). Any validation failure will be reported and flagged as a test failure.
Regarding the actual requests and responses, their payloads are described as f-strings, in particular the response f-string can be written as a “semi-json”:
Among many test assertions possible, Scout can assert the validity of a partial json response and it will log the information. We added the handling of ellipsis … as an indication for Scout not to care about any further fields at a given json nesting level. Hence, we are able to do partial matching on JSON API responses, thus focusing only on what matters the most in each test.
Once a test suite run is complete, the results are pushed by the service and stored using Cloudflare Workers KV. They are displayed via a Cloudflare Worker.
Scout is run in separate environments such as production-like and production environments. It is part of our deployment process to verify Scout is green in our production-like environment prior to deploying to production where Scout is also used for monitoring purposes.
How we built it
The core of Scout is written in Python and it is a combination of three components interacting together:
The Scout plugin: a Pytest plugin to write tests easily
The Scout service: a scheduler service to run the test suites periodically
The Scout Worker: a collector and presenter of test reports
The Scout plugin
This is the core component of the Scout system. It allows us to write self explanatory tests while ensuring a high level of compliance against OpenAPI schemas and verifying the APIs’ behaviors.
The Scout plugin architecture can be split into three components: setup, resource allocator, and runners. Setup is a conjunction of multiple sub components in charge of setting up the plugin.
The Registry contains all the information regarding a pool of accounts and zones we use for testing. As an example, entitlements are flags gating customers for using products features, the Registry provides the capability to describe entitlements per account and zone so that Scout can run a test against a specific setup.
As explained earlier, Scout can validate requests and responses against OpenAPI schemas. This is the responsibility of validators. A validator is built per OpenAPI schema and can be selected via the @validate mark we saw above.
As soon as a validator is selected, all the interaction of a given test with an API will be validated. If there is a validation failure, it will be marked as a test failure.
Last element of the setup, the config reader. It is the sub component in charge of providing all the URLs and authentication elements required for the Scout plugin to communicate with APIs.
Next in the chain, the resources allocator. This component is in charge of consuming the configuration and objects of the setup to build multiple runners. This is a factory which will make available the runners in the cfapi fixture.
When such a line of code is processed, it is the actual method request of the zone runner allocated for the test which is executed. Actually, the resources allocator is able to provide specialized runners (account, zone or default) which grant the possibility of targeting specific API endpoints for a given account or zone.
Runners are in charge of handling the execution of requests, managing the test expectations and using the validators for request/response schema validation.
Any failure on expectation or validation and any exceptions are recorded in the stash. The stash is shared across all runners. As such, when a test setup, run or cleanup is processed, the timeline of execution and potential retries are logged in the stash. The stash contents are later used for building the test suite reports.
Scout is able to run multiple test steps in parallel. Actually, each resource couple (Account Runner, Zone Runner) is associated with a Pytest-xdist worker which runs test steps independently. There can be as many workers as there are resource couples. An extra “default” runner is provided for reaching our different APIs and/or URLs with or without authentication.
Testing a test system was not the easiest part. We have been required to build a fake API and assert the Scout plugin would behave as it should in different situations. We reached and maintained a test coverage confidence which was considered good (close to 90%) for using the Scout plugin permanently.
The Scout service
The Scout service is meant to schedule test suites periodically. It is a configurable scheduler providing a reporting harness for the test suites as well as multiple metrics. It was a design decision to build a scheduler instead of using cron jobs.
We wanted to be aware of any scheduling issue as well as run issues. For this we used Prometheus metrics. The problem is that the Prometheus default configuration is to scrape metrics advertised by services. This scraping happens periodically and we were concerned about the eventuality of missing metrics if a cron job was to finish prior to the next Prometheus metrics scraping. As such we decided a small scheduler was better suited for overall observability of the test runs. Among the metrics the Scout service provides are network failures, general test failures, reporting failures, tests lagging and more.
The Scout service runs threads on configured periods. Each thread is a test suite run as a separate Pytest with Scout plugin process followed by a reporting execution consuming the results and publishing them to the relevant parties.
The reporting component provided to each thread publishes the report to Workers KV and notifies us on chat in case there is a failure. Reporting takes also care of publishing the information relevant for building API testing coverage. In fact it is mandatory for us to have coverage of all the API endpoints and their possible methods so that we can achieve a wide and thorough visibility of our live APIs.
As a fallback, if there are any thread failure, test failure or reporting failure we are alerted based on the Prometheus metrics being updated across the service execution. The logs of the Scout service as well as the logs of each Pytest-Scout plugin execution provide the last resort information if no metrics are available and reporting is failing.
The service can be deployed with a minimal YAML configuration and be set up for different environments. We can for example decide to run different test suites based on the environment, publish or not to Cloudflare Workers, set different periods and retry mechanisms and so on.
We keep the tests as part of our code base alongside the configuration of the Scout service, and that’s about it, the Scout service is a separate entity.
The Scout Worker
It is a Cloudflare worker in charge of fetching the most recent Worker KVs and displaying them in an eye pleasing manner. The Scout service publishes a test report as JSON, thus the Scout worker parses the report and displays its content based on the status of the test suite run.
For example, we present below an authentication failure during a test which resulted in such a display in the worker:
What does Scout let us do
Through leveraging the capabilities of Pytest and Cloudflare Workers, we have been able to build a configurable, robust and reliable system which allows us to easily write self explanatory tests for our APIs.
We can validate requests and responses against OpenAPI schemas and test behaviors over specific resources while getting alerted through multiple means if something goes wrong.
For specific use cases, we can write a test verifying the API behaves as it should, the configuration to be pushed at the edge is valid and a given zone will react as it should to security threats. Thus going beyond an end-to-end API test.
Scout quickly became our permanent live tester and monitor of APIs. We wrote tests for all endpoints to maintain a wide coverage of all our APIs. Scout has since been used for verifying an API version prior to its deployment to production. In fact, after a deployment in a production-like environment we can know in a couple of minutes if a new feature is good to go to production and assess if it is behaving correctly.
We hope you enjoyed this deep dive description into one of our systems!
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift enables you to use SQL for analyzing structured and semi-structured data with best price performance along with secure access to the data.
As more users start querying data in a data warehouse, access control is paramount to protect valuable organizational data. Database administrators want to continuously monitor and manage user privileges to maintain proper data access in the data warehouse. Amazon Redshift provides granular access control on the database, schema, table, column, row, and other database objects by granting privileges to roles, groups, and users from a SQL interface. To monitor privileges configured in Amazon Redshift, you can retrieve them by querying system tables.
Although Amazon Redshift provides a broad capability of managing access to database objects, we have heard from customers that they want to visualize and monitor privileges without using a SQL interface. In this post, we introduce predefined dashboards using Grafana which visualizes database privileges without writing SQL. This dashboard will help database administrators to reduce the time spent on database administration and increase the frequency of monitoring cycles.
This post focuses on database access, which relates to user access control against database objects. For more information, see Managing database security.
Amazon Redshift uses the GRANT command to define permissions in the database. For most database objects, GRANT takes three parameters:
Identity – The entity you grant access to. This could be a user, role, or group.
Object – The type of database object. This could be a database, schema, table or view, column, row, function, procedure, language, datashare, machine leaning (ML) model, and more.
Privilege – The type of operation. Examples include CREATE, SELECT, ALTER, DROP, DELETE, and INSERT. The level of privilege depends on the object.
Generally, database administrator monitors and reviews the identities, objects, and privileges periodically to ensure proper access is configured. They also need to investigate access configurations if database users face permission errors. These tasks require a SQL interface to query multiple system tables, which can be a repetitive and undifferentiated operation. Therefore, database administrators need a single pane of glass to quickly navigate through identities, objects, and privileges without writing SQL.
Solution overview
The following diagram illustrates the solution architecture and its key components:
Amazon Redshift contains database privilege information in system tables.
Grafana provides a predefined dashboard to visualize database privileges. The dashboard runs queries against the Amazon Redshift system table via the Amazon Redshift Data API.
Note that the dashboard focuses on visualization. SQL interface is required to configure privileges in Amazon Redshift. You can use query editor v2, a web-based SQL interface which enables users to run SQL commands from a browser.
Prerequisites
Before moving to the next section, you should have the following prerequisites:
While Amazon Managed Grafana controls the plugin version and updates periodically, local Grafana allows user to control the version. Therefore, local Grafana could be an option if you need earlier access for the latest features. Refer to plugin changelog for released features and versions.
Import the dashboards
After you have finished the prerequisites, you should have access to Grafana configured with Amazon Redshift as a data source. Next, import two dashboards for visualization.
In Grafana console, go to the created Redshift data source and click Dashboards
Import the Amazon Redshift Identities and Objects
Go to the data source again and import the Amazon Redshift Privileges
Each dashboard will appear once imported.
Amazon Redshift Identities and Objects dashboard
The Amazon Redshift Identities and Objects dashboard shows identites and database objects in Amazon Redshift, as shown in the following screenshot.
The Identities section shows the detail of each user, role, and group in the source database.
One of the key features in this dashboard is the Role assigned to Role, User section, which uses a node graph panel to visualize the hierarchical structure of roles and users from multiple system tables. This visualization can help administrators quickly examine which roles are inherited to users instead of querying multiple system tables. For more information about role-based access, refer to Role-based access control (RBAC).
Amazon Redshift Privileges dashboard
The Amazon Redshift Privileges dashboard shows privileges defined in Amazon Redshift.
In the Role and Group assigned to User section, open the Role assigned to User panel to list the roles for a specific user. In this panel, you can list and compare roles assigned to multiple users. Use the User drop-down at the top of the dashboard to select users.
The dashboard will refresh immediately and show filtered result for selected users. Following screenshot is the filtered result for user hr1, hr2 and it3.
The Object Privileges section shows the privileges granted for each database object and identity. Note that objects with no privileges granted are not listed here. To show the full list of database objects, use the Amazon Redshift Identities and Objects dashboard.
The Object Privileges (RLS) section contains visualizations for row-level security (RLS). The Policy attachments panel enables you to examine RLS configuration by visualizing relation between of tables, policies, roles and users.
Conclusion
In this post, we introduced a visualization for database privileges of Amazon Redshift using predefined Grafana dashboards. Database administrators can use these dashboards to quickly navigate through identities, objects, and privileges without writing SQL. You can also customize the dashboard to meet your business requirements. The JSON definition file of this dashboard is maintained as part of OSS in the Redshift data source for Grafana GitHub repository.
For more information about the topics described to in this post, refer to the following:
Yota Hamaoka is an Analytics Solution Architect at Amazon Web Services. He is focused on driving customers to accelerate their analytics journey with Amazon Redshift.
We use Prometheus as our core monitoring system. We’ve been heavy Prometheus users since 2017 when we migrated off our previous monitoring system which used a customized Nagios setup. Despite growing our infrastructure a lot, adding tons of new products and learning some hard lessons about operating Prometheus at scale, our original architecture of Prometheus (see Monitoring Cloudflare’s Planet-Scale Edge Network with Prometheus for an in depth walk through) remains virtually unchanged, proving that Prometheus is a solid foundation for building observability into your services.
One of the key responsibilities of Prometheus is to alert us when something goes wrong and in this blog post we’ll talk about how we make those alerts more reliable – and we’ll introduce an open source tool we’ve developed to help us with that, and share how you can use it too. If you’re not familiar with Prometheus you might want to start by watching this video to better understand the topic we’ll be covering here.
Prometheus works by collecting metrics from our services and storing those metrics inside its database, called TSDB. We can then query these metrics using Prometheus query language called PromQL using ad-hoc queries (for example to power Grafana dashboards) or via alerting or recording rules. A rule is basically a query that Prometheus will run for us in a loop, and when that query returns any results it will either be recorded as new metrics (with recording rules) or trigger alerts (with alerting rules).
Prometheus alerts
Since we’re talking about improving our alerting we’ll be focusing on alerting rules.
To create alerts we first need to have some metrics collected. For the purposes of this blog post let’s assume we’re working with http_requests_total metric, which is used on the examples page. Here are some examples of how our metrics will look:
Let’s say we want to alert if our HTTP server is returning errors to customers.
Since, all we need to do is check our metric that tracks how many responses with HTTP status code 500 there were, a simple alerting rule could like this:
This will alert us if we have any 500 errors served to our customers. Prometheus will run our query looking for a time series named http_requests_total that also has a status label with value “500”. Then it will filter all those matched time series and only return ones with value greater than zero.
If our alert rule returns any results a fire will be triggered, one for each returned result.
If our rule doesn’t return anything, meaning there are no matched time series, then alert will not trigger.
The whole flow from metric to alert is pretty simple here as we can see on the diagram below.
If we want to provide more information in the alert we can by setting additional labels and annotations, but alert and expr fields are all we need to get a working rule.
But the problem with the above rule is that our alert starts when we have our first error, and then it will never go away.
After all, our http_requests_total is a counter, so it gets incremented every time there’s a new request, which means that it will keep growing as we receive more requests. What this means for us is that our alert is really telling us “was there ever a 500 error?” and even if we fix the problem causing 500 errors we’ll keep getting this alert.
A better alert would be one that tells us if we’re serving errors right now.
For that we can use the rate() function to calculate the per second rate of errors.
The query above will calculate the rate of 500 errors in the last two minutes. If we start responding with errors to customers our alert will fire, but once errors stop so will this alert.
This is great because if the underlying issue is resolved the alert will resolve too.
We can improve our alert further by, for example, alerting on the percentage of errors, rather than absolute numbers, or even calculate error budget, but let’s stop here for now.
It’s all very simple, so what do we mean when we talk about improving the reliability of alerting? What could go wrong here?
Maybe a spot for a subheading here as you move on from the intro?
What could go wrong?
We can craft a valid YAML file with a rule definition that has a perfectly valid query that will simply not work how we expect it to work. Which, when it comes to alerting rules, might mean that the alert we rely upon to tell us when something is not working correctly will fail to alert us when it should. To better understand why that might happen let’s first explain how querying works in Prometheus.
Prometheus querying basics
There are two basic types of queries we can run against Prometheus. The first one is an instant query. It allows us to ask Prometheus for a point in time value of some time series. If we write our query as http_requests_total we’ll get all time series named http_requests_total along with the most recent value for each of them. We can further customize the query and filter results by adding label matchers, like http_requests_total{status=”500”}.
Let’s consider we have two instances of our server, green and red, each one is scraped (Prometheus collects metrics from it) every one minute (independently of each other).
This is what happens when we issue an instant query:
There’s obviously more to it as we can use functions and build complex queries that utilize multiple metrics in one expression. But for the purposes of this blog post we’ll stop here.
The important thing to know about instant queries is that they return the most recent value of a matched time series, and they will look back for up to five minutes (by default) into the past to find it. If the last value is older than five minutes then it’s considered stale and Prometheus won’t return it anymore.
The second type of query is a range query – it works similarly to instant queries, the difference is that instead of returning us the most recent value it gives us a list of values from the selected time range. That time range is always relative so instead of providing two timestamps we provide a range, like “20 minutes”. When we ask for a range query with a 20 minutes range it will return us all values collected for matching time series from 20 minutes ago until now.
An important distinction between those two types of queries is that range queries don’t have the same “look back for up to five minutes” behavior as instant queries. If Prometheus cannot find any values collected in the provided time range then it doesn’t return anything.
If we modify our example to request [3m] range query we should expect Prometheus to return three data points for each time series:
When queries don’t return anything
Knowing a bit more about how queries work in Prometheus we can go back to our alerting rules and spot a potential problem: queries that don’t return anything.
If our query doesn’t match any time series or if they’re considered stale then Prometheus will return an empty result. This might be because we’ve made a typo in the metric name or label filter, the metric we ask for is no longer being exported, or it was never there in the first place, or we’ve added some condition that wasn’t satisfied, like value of being non-zero in our http_requests_total{status=”500”} > 0 example.
Prometheus will not return any error in any of the scenarios above because none of them are really problems, it’s just how querying works. If you ask for something that doesn’t match your query then you get empty results. This means that there’s no distinction between “all systems are operational” and “you’ve made a typo in your query”. So if you’re not receiving any alerts from your service it’s either a sign that everything is working fine, or that you’ve made a typo, and you have no working monitoring at all, and it’s up to you to verify which one it is.
For example, we could be trying to query for http_requests_totals instead of http_requests_total (an extra “s” at the end) and although our query will look fine it won’t ever produce any alert.
Range queries can add another twist – they’re mostly used in Prometheus functions like rate(), which we used in our example. This function will only work correctly if it receives a range query expression that returns at least two data points for each time series, after all it’s impossible to calculate rate from a single number.
Since the number of data points depends on the time range we passed to the range query, which we then pass to our rate() function, if we provide a time range that only contains a single value then rate won’t be able to calculate anything and once again we’ll return empty results.
The number of values collected in a given time range depends on the interval at which Prometheus collects all metrics, so to use rate() correctly you need to know how your Prometheus server is configured. You can read more about this here and here if you want to better understand how rate() works in Prometheus.
For example if we collect our metrics every one minute then a range query http_requests_total[1m] will be able to find only one data point. Here’s a reminder of how this looks:
Since, as we mentioned before, we can only calculate rate() if we have at least two data points, calling rate(http_requests_total[1m]) will never return anything and so our alerts will never work.
There are more potential problems we can run into when writing Prometheus queries, for example any operations between two metrics will only work if both have the same set of labels, you can read about this here. But for now we’ll stop here, listing all the gotchas could take a while. The point to remember is simple: if your alerting query doesn’t return anything then it might be that everything is ok and there’s no need to alert, but it might also be that you’ve mistyped your metrics name, your label filter cannot match anything, your metric disappeared from Prometheus, you are using too small time range for your range queries etc.
Renaming metrics can be dangerous
We’ve been running Prometheus for a few years now and during that time we’ve grown our collection of alerting rules a lot. Plus we keep adding new products or modifying existing ones, which often includes adding and removing metrics, or modifying existing metrics, which may include renaming them or changing what labels are present on these metrics.
A lot of metrics come from metrics exporters maintained by the Prometheus community, like node_exporter, which we use to gather some operating system metrics from all of our servers. Those exporters also undergo changes which might mean that some metrics are deprecated and removed, or simply renamed.
A problem we’ve run into a few times is that sometimes our alerting rules wouldn’t be updated after such a change, for example when we upgraded node_exporter across our fleet. Or the addition of a new label on some metrics would suddenly cause Prometheus to no longer return anything for some of the alerting queries we have, making such an alerting rule no longer useful.
It’s worth noting that Prometheus does have a way of unit testing rules, but since it works on mocked data it’s mostly useful to validate the logic of a query. Unit testing won’t tell us if, for example, a metric we rely on suddenly disappeared from Prometheus.
Chaining rules
When writing alerting rules we try to limit alert fatigue by ensuring that, among many things, alerts are only generated when there’s an action needed, they clearly describe the problem that needs addressing, they have a link to a runbook and a dashboard, and finally that we aggregate them as much as possible. This means that a lot of the alerts we have won’t trigger for each individual instance of a service that’s affected, but rather once per data center or even globally.
For example, we might alert if the rate of HTTP errors in a datacenter is above 1% of all requests. To do that we first need to calculate the overall rate of errors across all instances of our server. For that we would use a recording rule:
First rule will tell Prometheus to calculate per second rate of all requests and sum it across all instances of our server. Second rule does the same but only sums time series with status labels equal to “500”. Both rules will produce new metrics named after the value of the record field.
Now we can modify our alert rule to use those new metrics we’re generating with our recording rules:
If we have a data center wide problem then we will raise just one alert, rather than one per instance of our server, which can be a great quality of life improvement for our on-call engineers.
But at the same time we’ve added two new rules that we need to maintain and ensure they produce results. To make things more complicated we could have recording rules producing metrics based on other recording rules, and then we have even more rules that we need to ensure are working correctly.
What if all those rules in our chain are maintained by different teams? What if the rule in the middle of the chain suddenly gets renamed because that’s needed by one of the teams? Problems like that can easily crop up now and then if your environment is sufficiently complex, and when they do, they’re not always obvious, after all the only sign that something stopped working is, well, silence – your alerts no longer trigger. If you’re lucky you’re plotting your metrics on a dashboard somewhere and hopefully someone will notice if they become empty, but it’s risky to rely on this.
We definitely felt that we needed something better than hope.
Introducing pint: a Prometheus rule linter
To avoid running into such problems in the future we’ve decided to write a tool that would help us do a better job of testing our alerting rules against live Prometheus servers, so we can spot missing metrics or typos easier. We also wanted to allow new engineers, who might not necessarily have all the in-depth knowledge of how Prometheus works, to be able to write rules with confidence without having to get feedback from more experienced team members.
Since we believe that such a tool will have value for the entire Prometheus community we’ve open-sourced it, and it’s available for anyone to use – say hello to pint!
You can run it against a file(s) with Prometheus rules
It can run as a part of your CI pipeline
Or you can deploy it as a side-car to all your Prometheus servers
It doesn’t require any configuration to run, but in most cases it will provide the most value if you create a configuration file for it and define some Prometheus servers it should use to validate all rules against. Running without any configured Prometheus servers will limit it to static analysis of all the rules, which can identify a range of problems, but won’t tell you if your rules are trying to query non-existent metrics.
First mode is where pint reads a file (or a directory containing multiple files), parses it, does all the basic syntax checks and then runs a series of checks for all Prometheus rules in those files.
Second mode is optimized for validating git based pull requests. Instead of testing all rules from all files pint will only test rules that were modified and report only problems affecting modified lines.
Third mode is where pint runs as a daemon and tests all rules on a regular basis. If it detects any problem it will expose those problems as metrics. You can then collect those metrics using Prometheus and alert on them as you would for any other problems. This way you can basically use Prometheus to monitor itself.
What kind of checks can it run for us and what kind of problems can it detect?
All the checks are documented here, along with some tips on how to deal with any detected problems. Let’s cover the most important ones briefly.
As mentioned above the main motivation was to catch rules that try to query metrics that are missing or when the query was simply mistyped. To do that pint will run each query from every alerting and recording rule to see if it returns any result, if it doesn’t then it will break down this query to identify all individual metrics and check for the existence of each of them. If any of them is missing or if the query tries to filter using labels that aren’t present on any time series for a given metric then it will report that back to us.
So if someone tries to add a new alerting rule with http_requests_totals typo in it, pint will detect that when running CI checks on the pull request and stop it from being merged. Which takes care of validating rules as they are being added to our configuration management system.
Another useful check will try to estimate the number of times a given alerting rule would trigger an alert. Which is useful when raising a pull request that’s adding new alerting rules – nobody wants to be flooded with alerts from a rule that’s too sensitive so having this information on a pull request allows us to spot rules that could lead to alert fatigue.
Similarly, another check will provide information on how many new time series a recording rule adds to Prometheus. In our setup a single unique time series uses, on average, 4KiB of memory. So if a recording rule generates 10 thousand new time series it will increase Prometheus server memory usage by 10000*4KiB=40MiB. 40 megabytes might not sound like but our peak time series usage in the last year was around 30 million time series in a single Prometheus server, so we pay attention to anything that’s might add a substantial amount of new time series, which pint helps us to notice before such rule gets added to Prometheus.
On top of all the Prometheus query checks, pint allows us also to ensure that all the alerting rules comply with some policies we’ve set for ourselves. For example, we require everyone to write a runbook for their alerts and link to it in the alerting rule using annotations.
We also require all alerts to have priority labels, so that high priority alerts are generating pages for responsible teams, while low priority ones are only routed to karma dashboard or create tickets using jiralert. It’s easy to forget about one of these required fields and that’s not something which can be enforced using unit testing, but pint allows us to do that with a few configuration lines.
With pint running on all stages of our Prometheus rule life cycle, from initial pull request to monitoring rules deployed in our many data centers, we can rely on our Prometheus alerting rules to always work and notify us of any incident, large or small.
Our rule now passes the most basic checks, so we know it’s valid. But to know if it works with a real Prometheus server we need to tell pint how to talk to Prometheus. For that we’ll need a config file that defines a Prometheus server we test our rule against, it should be the same server we’re planning to deploy our rule to. Here we’ll be using a test instance running on localhost. Let’s create a “pint.hcl” file and define our Prometheus server there:
prometheus "prom1" {
uri = "http://localhost:9090"
timeout = "1m"
}
Now we can re-run our check using this configuration file:
$ pint -c pint.hcl lint rules.yml
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=2
rules.yml:5: prometheus "prom1" at http://localhost:9090 didn't have any series for "http_requests_total" metric in the last 1w (promql/series)
expr: sum(rate(http_requests_total[2m])) without(method, status, instance)
rules.yml:8: prometheus "prom1" at http://localhost:9090 didn't have any series for "http_requests_total" metric in the last 1w (promql/series)
expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)
level=info msg="Problems found" Bug=2
level=fatal msg="Execution completed with error(s)" error="problems found"
Yikes! It’s a test Prometheus instance, and we forgot to collect any metrics from it.
Let’s fix that by starting our server locally on port 8080 and configuring Prometheus to collect metrics from it:
$ pint -c pint.hcl lint rules.yml
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=3
rules.yml:13: prometheus "prom1" at http://localhost:9090 didn't have any series for "job:http_requests_status500:rate2m" metric in the last 1w but found recording rule that generates it, skipping further checks (promql/series)
expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01
rules.yml:13: prometheus "prom1" at http://localhost:9090 didn't have any series for "job:http_requests_total:rate2m" metric in the last 1w but found recording rule that generates it, skipping further checks (promql/series)
expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01
level=info msg="Problems found" Information=2
It all works according to pint, and so we now can safely deploy our new rules file to Prometheus.
Notice that pint recognised that both metrics used in our alert come from recording rules, which aren’t yet added to Prometheus, so there’s no point querying Prometheus to verify if they exist there.
Now what happens if we deploy a new version of our server that renames the “status” label to something else, like “code”?
$ pint -c pint.hcl lint rules.yml
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=3
rules.yml:8: prometheus "prom1" at http://localhost:9090 has "http_requests_total" metric but there are no series with "status" label in the last 1w (promql/series)
expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)
rules.yml:13: prometheus "prom1" at http://localhost:9090 didn't have any series for "job:http_requests_status500:rate2m" metric in the last 1w but found recording rule that generates it, skipping further checks (promql/series)
expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01
level=info msg="Problems found" Bug=1 Information=1
level=fatal msg="Execution completed with error(s)" error="problems found"
Luckily pint will notice this and report it, so we can adopt our rule to match the new name.
But what if that happens after we deploy our rule? For that we can use the “pint watch” command that runs pint as a daemon periodically checking all rules.
Please note that validating all metrics used in a query will eventually produce some false positives. In our example metrics with status=”500” label might not be exported by our server until there’s at least one request ending in HTTP 500 error.
The promql/series check responsible for validating presence of all metrics has some documentation on how to deal with this problem. In most cases you’ll want to add a comment that instructs pint to ignore some missing metrics entirely or stop checking label values (only check if there’s “status” label present, without checking if there are time series with status=”500”).
Summary
Prometheus metrics don’t follow any strict schema, whatever services expose will be collected. At the same time a lot of problems with queries hide behind empty results, which makes noticing these problems non-trivial.
We use pint to find such problems and report them to engineers, so that our global network is always monitored correctly, and we have confidence that lack of alerts proves how reliable our infrastructure is.
Grafana is a rich interactive open-source tool by Grafana Labs for visualizing data across one or many data sources. It’s used in a variety of modern monitoring stacks, allowing you to have a common technical base and apply common monitoring practices across different systems. Amazon Managed Grafana is a fully managed, scalable, and secure Grafana-as-a-service solution developed by AWS in collaboration with Grafana Labs.
Amazon Redshift is the most widely used data warehouse in the cloud. You can view your Amazon Redshift cluster’s operational metrics on the Amazon Redshift console, use AWS CloudWatch, and query Amazon Redshift system tables directly from your cluster. The first two options provide a set of predefined general metrics and visualizations. The last one allows you to use the flexibility of SQL to get deep insights into the details of the workload. However, querying system tables requires knowledge of system table structures. To address that, we came up with a consolidated Amazon Redshift Grafana dashboard that visualizes a set of curated operational metrics and works on top of the Amazon Redshift Grafana data source. You can easily add it to an Amazon Managed Grafana workspace, as well as to any other Grafana deployments where the data source is installed.
This post guides you through a step-by-step process to create an Amazon Managed Grafana workspace and configure an Amazon Redshift cluster with a Grafana data source for it. Lastly, we show you how to set up the Amazon Redshift Grafana dashboard to visualize the cluster metrics.
Solution overview
The following diagram illustrates the solution architecture.
The solution includes the following components:
The Amazon Redshift cluster to get the metrics from.
Amazon Managed Grafana, with the Amazon Redshift data source plugin added to it. Amazon Managed Grafana communicates with the Amazon Redshift cluster via the Amazon Redshift Data Service API.
The Grafana web UI, with the Amazon Redshift dashboard using the Amazon Redshift cluster as the data source. The web UI communicates with Amazon Managed Grafana via an HTTP API.
We walk you through the following steps during the configuration process:
Configure an Amazon Redshift cluster.
Create a database user for Amazon Managed Grafana on the cluster.
Configure a user in AWS Single Sign-On (AWS SSO) for Amazon Managed Grafana UI access.
Configure an Amazon Managed Grafana workspace and sign in to Grafana.
Set up Amazon Redshift as the data source in Grafana.
Import the Amazon Redshift dashboard supplied with the data source.
Prerequisites
To follow along with this walkthrough, you should have the following prerequisites:
Familiarity with the basic concepts of the following services:
Amazon Redshift
Amazon Managed Grafana
AWS SSO
Configure an Amazon Redshift cluster
If you don’t have an Amazon Redshift cluster, create a sample cluster before proceeding with the following steps. For this post, we assume that the cluster identifier is called redshift-demo-cluster-1 and the admin user name is awsuser.
On the Amazon Redshift console, choose Clusters in the navigation pane.
Choose your cluster.
Choose the Properties tab.
To make the cluster discoverable by Amazon Managed Grafana, you must add a special tag to it.
Choose Add tags.
For Key, enter GrafanaDataSource.
For Value, enter true.
Choose Save changes.
Create a database user for Amazon Managed Grafana
Grafana will be directly querying the cluster, and it requires a database user to connect to the cluster. In this step, we create the user redshift_data_api_user and apply some security best practices.
On the cluster details page, choose Query data and Query in query editor v2.
Choose the redshift-demo-cluster-1 cluster we created previously.
For Database, enter the default dev.
Enter the user name and password that you used to create the cluster.
Choose Create connection.
In the query editor, enter the following statements and choose Run:
CREATE USER redshift_data_api_user PASSWORD '<password>' CREATEUSER;
ALTER USER redshift_data_api_user SET readonly TO TRUE;
ALTER USER redshift_data_api_user SET query_group TO 'superuser';
The first statement creates a user with superuser privileges necessary to access system tables and views (make sure to use a unique password). The second prohibits the user from making modifications. The last statement isolates the queries the user can run to the superuser queue, so they don’t interfere with the main workload.
In this example, we use service managed permissions in Amazon Managed Grafana and a workspace AWS Identity and Access Management (IAM) role as an authentication provider in the Amazon Redshift Grafana data source. We create the database user redshift_data_api_user using the AmazonGrafanaRedshiftAccess policy.
Configure a user in AWS SSO for Amazon Managed Grafana UI access
Two authentication methods are available for accessing Amazon Managed Grafana: AWS SSO and SAML. In this example, we use AWS SSO.
On the AWS SSO console, choose Users in the navigation pane.
Choose Add user.
In the Add user section, provide the required information.
In this post, we select Send an email to the user with password setup instructions. You need to be able to access the email address you enter because you use this email further in the process.
Choose Next to proceed to the next step.
Choose Add user.
An email is sent to the email address you specified.
Choose Accept invitation in the email.
You’re redirected to sign in as a new user and set a password for the user.
Enter a new password and choose Set new password to finish the user creation.
Configure an Amazon Managed Grafana workspace and sign in to Grafana
Now you’re ready to set up an Amazon Managed Grafana workspace.
On the Amazon Grafana console, choose Create workspace.
For Workspace name, enter a name, for example grafana-demo-workspace-1.
Choose Next.
For Authentication access, select AWS Single Sign-On.
For Permission type, select Service managed.
Chose Next to proceed.
For IAM permission access settings, select Current account.
For Data sources, select Amazon Redshift.
Choose Next to finish the workspace creation.
You’re redirected to the workspace page.
Next, we need to enable AWS SSO as an authentication method.
On the workspace page, choose Assign new user or group.
Select the previously created AWS SSO user under Users and Select users and groups tables.
You need to make the user an admin, because we set up the Amazon Redshift data source with it.
Select the user from the Users list and choose Make admin.
Go back to the workspace and choose the Grafana workspace URL link to open the Grafana UI.
Sign in with the user name and password you created in the AWS SSO configuration step.
Set up an Amazon Redshift data source in Grafana
To visualize the data in Grafana, we need to access the data first. To do so, we must create a data source pointing to the Amazon Redshift cluster.
On the navigation bar, choose the lower AWS icon (there are two) and then choose Redshift from the list.
For Regions, choose the Region of your cluster.
Select the cluster from the list and choose Add 1 data source.
On the Provisioned data sources page, choose Go to settings.
For Name, enter a name for your data source.
By default, Authentication Provider should be set as Workspace IAM Role, Default Region should be the Region of your cluster, and Cluster Identifier should be the name of the chosen cluster.
For Database, enter dev.
For Database User, enter redshift_data_api_user.
Choose Save & Test.
A success message should appear.
Import the Amazon Redshift dashboard supplied with the data source
As the last step, we import the default Amazon Redshift dashboard and make sure that it works.
In the data source we just created, choose Dashboards on the top navigation bar and choose Import to import the Amazon Redshift dashboard.
Under Dashboards on the navigation sidebar, choose Manage.
In the dashboards list, choose Amazon Redshift.
The dashboard appear, showing operational data from your cluster. When you add more clusters and create data sources for them in Grafana, you can choose them from the Data source list on the dashboard.
Clean up
To avoid incurring unnecessary charges, delete the Amazon Redshift cluster, AWS SSO user, and Amazon Managed Grafana workspace resources that you created as part of this solution.
Conclusion
In this post, we covered the process of setting up an Amazon Redshift dashboard working under Amazon Managed Grafana with AWS SSO authentication and querying from the Amazon Redshift cluster under the same AWS account. This is just one way to create the dashboard. You can modify the process to set it up with SAML as an authentication method, use custom IAM roles to manage permissions with more granularity, query Amazon Redshift clusters outside of the AWS account where the Grafana workspace is, use an access key and secret or AWS Secrets Manager based connection credentials in data sources, and more. You can also customize the dashboard by adding or altering visualizations using the feature-rich Grafana UI.
Because the Amazon Redshift data source plugin is an open-source project, you can install it in any Grafana deployment, whether it’s in the cloud, on premises, or even in a container running on your laptop. That allows you to seamlessly integrate Amazon Redshift monitoring into virtually all your existing Grafana-based monitoring stacks.
For more details about the systems and processes described in this post, refer to the following:
Sergey Konoplev is a Senior Database Engineer on the Amazon Redshift team. Sergey has been focusing on automation and improvement of database and data operations for more than a decade.
Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.
AWS Directory Service for Microsoft Active Directory (AWS Managed Microsoft AD), provides a fully managed service for Microsoft Active Directory (AD) in the AWS cloud. When you create your directory, AWS deploys two domain controllers in separate Availability Zones that are exclusively yours for high availability. For use cases requiring even higher resilience and performance, in a specific Region or during specific hours, AWS Managed Microsoft AD allows you to scale by deploying additional domain controllers to meet your needs. These domain controllers can help load-balance, increase overall performance, or simply provide additional nodes to protect against temporary availability issues. AWS Managed Microsoft AD allows you to define the correct number of domain controllers for your directory based on your individual use case.
This post will walk you through how to automate scaling in AWS Managed Microsoft AD using utilization metrics from your directory. You’ll do this using Amazon CloudWatch Alarms, SNS notifications, and a Lambda function to increase the number of domain controllers in your directory based on utilization peaks.
Simplified directory scaling
AWS Managed Microsoft AD has now simplified this directory scaling process by integrating with Amazon CloudWatch metrics. This new integration enables you to:
Analyze your directory to identify expected average and peak directory utilization
Scale your directory based on utilization data to adequately address the expected load
Automate the addition of domain controllers to handle unexpected load.
Integration is available for both domain controller utilization metrics such as CPU, Memory, Disk and Network, and for AD-specific metrics, such as LDAP searches, binds, DNS queries, and Directory reads/writes. Analyzing this data over time to identify expected average and peak utilization on your directory can help you deploy additional domain controllers in Regions that need them. Once you’ve established this utilization baseline, you can deploy additional domain controllers to service this load, and configure alarms for anything exceeding this baseline.
Solution overview
In this example, our AWS Managed Microsoft AD has the default two domain controllers; once your utilization threshold is reached, you’ll add one additional domain controller (domain controller 3 in the diagram) to cover this additional load.
Choose the Directory Service namespace, then choose AWS Managed Microsoft AD.
In the Directory ID column, select your directory and check search for this only.
From the Metric Category column, select Processor from Metric Category and check add to search. This view will show the processor utilization for your directory.
Figure 2. Processor utilization metrics
To see the average utilization across all domain controllers, choose Add Math, then All Functions, then AVG to create a metric math expression for average CPU utilization across all domain controllers.
Figure 3. Adding a math function to compute average
Next, choose the Graphed Metrics tab in the CloudWatch metrics console, select the newly created expression, then select the bell icon from the Actions column to create a CloudWatch alarm based on this metric.
Figure 4. Create a CloudWatch Alarm using Metric Math Expression
Configure the threshold alarm to trigger when CPU utilization exceeds 70%.
In the Metrics section, under Period, choose 1 Hour. In the Conditions section, under Threshold Type, choose Static. Under Define the alarm condition, choose Greater than threshold. Under Define the threshold value, enter 70. See Figure 5 for an image of how alarm parameters should look on your screen. Choose Next to Configure actions.
Figure 5. Configure the alarm parameters
On the Configure actions screen, configure the actions using the parameters listed below to send an email notification when the alarm state is triggered. See Figure 6 for an image of how email notifications are configured.
In the Notification section, set Alarm state trigger to In alarm. Set Select an SNS topic to Create topic. Fill in the name of the alarm in the Create a new topic field, and add the email where notifications should be sent to the Email endpoints that will receive notification field. An email address is required to create the SNS topic and you should use an email address that’s accessible by your operations team. This SNS topic will be used to trigger the Lambda automation described in a later section. Note: make a note of the SNS topic name you chose; you will use it later when creating the Lambda function in the To create an AWS Lambda function to automate scale out procedure below.
Figure 6. Create SNS topic and email notification
In the Alarm name field, provide a name for the alarm. You can optionally also add an Alarm description. Choose Next.
Review your configuration, and choose Create alarm to create the alarm.
Once you’ve completed these steps, you will now have an alarm implemented for when domain controller CPU utilization exceeds an average of 70% across both domain controllers. This will trigger an SNS topic when your directory is experiencing a heavy load, which will be used to start the Lambda automation and will send an informational email notification. In the next section, we’ll configure an AWS Lambda function to automate the addition of a domain controller based on this SNS topic.
The sample Lambda function shown below checks the number of domain controllers in this Region, and increases that by adding one additional domain controller. This procedure describes how to configure the IAM role required for this Lambda function, then how to deploy the Lambda function to execute when the alarm is triggered to automatically add a domain controller when your load exceeds your typical usage baseline.
Choose Next:Tags to add tags (optional) before choosing Next:Review.
On the Create Policy screen, provide a name in the Name field. You can optionally also add a description. Choose Create policy to complete creating the new policy.
Note: make a note of the policy name you chose; you will use it later when updating the execution role for the Lambda function.
Figure 7. Provide a name to create the IAM policy
In the AWS Console, navigate to Lambda and choose Create Function
On the Create Function screen, select Author from Scratch and provide a Name, then choose Create Function.
Figure 8. Create a Lambda function
Once created, on the Lambda function’s page, choose the Configuration tab, then choose Permissions from the sidebar and choose the execution role name linked under Role name. This will open the IAM console in another tab, preloaded to your Lambda execution role.
Figure 9: Select the Execution Role
On the execution role screen, choose Attach policies and select the IAM policy you’ve just created (e.g. DirectoryService-DCNumber Update). On the Attach Permissions screen, choose Attach policy to complete updating the execution role. Once completed this step, you may close this tab and return the previous browser tab.
Figure 10. Select and attach the IAM policy
On the Lambda function screen, choose the Configuration tab, then choose Triggers from the sidebar.
On the Add Trigger screen, choose the pulldown under Trigger configuration and select SNS. On the SNS topic box, select the SNS topic you created in Step 9 of the To create a CloudWatch Alarm with SNS topic notifications procedure above. Then choose Add to complete the trigger configuration.
On the Lambda function screen, choose the Configuration tab, then choose Environment variables from the sidebar.
On the Environment variables card, click Edit.
On the Edit environment variables screen, choose Add environment variables and use the Key “DIRECTORY_ID” and the Value will be the directory ID for you AWS Managed Microsoft AD.
Figure 11. The “Edit environment variables” screen
On the Lambda function screen, choose the Code tab to open the in-browser code editor experience inside the Code source card. Paste in the sample Lambda function code given below to complete the implementation.
Figure 12. Paste sample code to complete the Lambda function setup
Sample Lambda function code
The sample Lambda function given below automates adding another domain controller to your directory. When your CloudWatch alarm triggers, you will receive a notification email, and an additional domain controller will be deployed to provide the added capacity to support the increase in directory usage.
Note: The example code contains a variable for the maximum number of domain controllers (maxDcNum), to prevent you from over provisioning in the event of a missed configuration. This value is set to 3 for this blog post’s example and can be increased to suit your use case.
import json
import boto3
maxDcNum = 10
minDcNum = 2
region = "us-east-1"
dsId = "d-906752246f"
ds = boto3.client('ds', region_name=region)
def lambda_handler(event, context):
## get the current number of domain controllers
dcs = ds.describe_domain_controllers(DirectoryId = dsId)
DomainControllers = dcs["DomainControllers"]
DCcount = len(DomainControllers)
print(">>> Current number of DCs:" + str(DCcount))
#increase the number of DCs
if DCcount < maxDcNum:
NewDCnumber = DCcount + 1
response = ds.update_number_of_domain_controllers(DirectoryId = dsId, DesiredNumber = NewDCnumber);
return {
'statusCode': 200,
'body': json.dumps("New DC number will be " + str(NewDCnumber))
}
else:
return {
'statusCode': 200,
'body': json.dumps("Max number of DCs reached. The number of DCs is" + str(DCcount))
}
Note: When testing this Lambda function, remember that this will increase the number of domain controllers for your directory in that Region. If the additional domain controller is not needed, please reduce the count after the test to avoid costs for an additional domain controller. The same principles used in this article to automate the addition of domain controllers can be applied to automate the reduction of domain controllers and you should consider automating the reduction to optimize for resilience, performance and cost.
Conclusion
In this post, you’ve implemented alarms based on thresholds in Domain Controller utilization using AWS CloudWatch and automation to increase the number of domain controllers using AWS Lambda functions. This solution helps to cost-effectively improve resilience and performance of your directory, by scaling your directory based on historical load patterns.
As your monitoring infrastructures evolve, you might hit a point when there’s no avoiding using the Zabbix API. The Zabbix API can be used to automate a particular part of your day-to-day workflow, troubleshoot your monitoring or to simply analyze or get statistics about a specific set of entities.
In this blog post, we will take a look at some of the more advanced API methods and specific method parameters and learn how they can be used to improve your API workflows.
1. Count entities with CountOutput
Let’s start with gathering some statistics. Let’s say you have to count the number of some matching entities – here we can use the CountOutput parameter. For a more advanced use case – what if we have to count the number of events for some time period? Let’s combine countOutput with time_from and time_till (in unixtime) and get the number of events created for the month of November. Let’s get all of the events for the month of November that have the Disaster severity:
Now let’s copy and paste the result of the export and import the template into another environment. It’s extremely important to remember that for this method to work exactly as we intend to, we need to include the parameters that specify the behavior of particular entities contained in the configuration string, such as items/value maps/templates, etc. For example, if I exclude the templates parameter here, no templates will be imported.
3. Expand trigger functions and macros with expand parameters
Using trigger.get to obtain information about a particular set of triggers is a relatively common practice. One particular caveat that we have to consider is that by default macros in trigger name, expression or descriptions are not expanded. To expand the available macros we need to use the expand parameters:
4. Obtaining additional LLD information for a discovered item
If we wish to display additional LLD information for a discovered entity, in this case – an item, we can use the selectDiscoveryRule and selectItemDiscovery parameters.
While selectDiscoveryRule will provide the ID of the LLD rule that created the item, selectItemDiscovery can point us at the parent item prototype id from which the item was created, last discovery time, item prototype key, and more.
The example below will return the item details and will also provide the LLD rule and Item prototype IDs, the time when the lost item will be deleted and the last time the item was discovered:
5. Searching through the matched entities with search parameters
Zabbix API provides a couple of standard parameters for performing a search. With search parameter, we can search string or text fields and try to find objects based on a single or multiple entries. searchByAny parameter is capable of extending the search – if you set this as true, we will search by ANY of the criteria in the search array, instead of trying to find an entity that matches ALL of them (default behavior).
The following API call will find items that match agent and Zabbix keys on a particular template:
Feel free to take the above examples, change them around so they fit your use case and you should be able to quite easily implement them in your environment. There are many other use cases that we might potentially cover down the line – if you have a specific API use case that you wish for us to cover, feel free to leave a comment under this post and we just might cover it in one of the upcoming blog posts!
With Zabbix Summit Online 2021 just around the corner, it’s time to have a quick overview of the 6.0 LTS features that we can expect to see featured during the event. The Zabbix 6.0 LTS release aims to deliver some of the long-awaited enterprise-level features while also improving the general user experience, performance, scalability, and many other aspects of Zabbix.
Native Zabbix server cluster
Many of you will be extremely happy to hear that Zabbix 6.0 LTS release comes with out-of-the-box High availability for Zabbix Server. This means that HA will now be supported natively, without having to use external tools to create Zabbix Server clusters.
The native Zabbix Server cluster will have a speech dedicated to it during the Zabbix Summit Online 2021. You can expect to learn both the inner workings of the HA solution, the configuration and of course the main benefits of using the native HA solution. You can also take a look at the in-development version of the native Zabbix server cluster in the latest Zabbix 6.0 LTS alpha release.
Business service monitoring and root cause analysis
Service monitoring is also about to go through a significant redesign, focusing on delivering additional value by providing robust Business service monitoring (BSM)features. This is achieved by delivering significant additions to the existing service status calculation logic. With features such as service weights, service status analysis based on child problem severities, ability to calculate service status based on the number or percentage of children in a problem state, users will be able to implement BSM on a whole new level. BSM will also support root cause analysis – users will be informed about the root cause problem of the service status change.
All of this and more, together with examples and use cases will be covered during a separate speech dedicated to BSM. In addition, some of the BSM features are available in the latest Zabbix 6.0 LTS alpha release – with more to come as we continue working on the Zabbix 6.0 release.
Audit log redesign
The Audit log is another existing feature that has received a complete redesign. With the ability to log each and every change performed both by the Zabbix Server and Zabbix Frontend, the Audit log will become an invaluable source of audit information. Of course, the redesign also takes performance into consideration – the redesign was developed with the least possible performance impact in mind.
The audit log is constantly in development and the current Zabbix 6.0 LTS alpha release offers you an early look at the feature. We will also be covering the technical details of the new audit log implementation during the Summit and will explain how we are able to achieve minimal performance impact with major improvements to Zabbix audit logging.
Geographical maps
With Geographical maps, our users can finally display their entities on a geographical map based on the coordinates of the entity. Geographical maps can be used with multiple geographical map providers and display your hosts with their most severe problems. In addition, geographical maps will react dynamically to Zoom levels and support filtering.
The latest Zabbix 6.0 Alpha release includes the Geomap widget – feel free to deploy it in your QA environment, check out the different map providers, filter options and other great features that come with this widget.
Machine learning
When it comes to problem detection, Zabbix 6.0 LTS will deliver multiple trend new functions. A specific set of functions provides machine learning functionality for Anomaly detection and Baseline monitoring.
The topic will be covered in-depth during the Zabbix Summit Online 2021. We will look at the configuration of the new functions and also take a deeper dive at the logic and algorithms used under the hood.
During the Zabbix Summit Online 2021, we will also cover many other new features, such as:
New Dashboard widgets
New items for Zabbix Agent
New templates and integrations
Zabbix login password complexity settings
Performance improvements for Zabbix Server, Zabbix Proxy, and Zabbix Frontend
UI and UX improvements
Zabbix login password complexity requirements
New history and trend functions
And more!
Not only will you get the chance to have an early look at many new features not yet available in the latest alpha release, but also you will have a great chance to learn the inner workings of the new features, the upgrade and migration process to Zabbix 6.0 LTS and much more!
We are extremely excited to share all of the new features with our community, so don’t miss out – take a look at the full Zabbix Summit online 2021 agenda and register for the event by visiting our Zabbix Summit page, and we will see you at the Zabbix Summit Online 2021 on November 25!
Zabbix API enables you to collect any and all information from your Zabbix instance by using a multitude of API methods. You can even utilize Zabbix API calls in your HTTP items. For example, this can be used to monitor the number of particular sets of metrics and visualize their growth over time. With named Zabbix API tokens, such use cases are a lot more simple to implement.
Before Zabbix 5.4 we had to perform the user.login API call to obtain the authentication token. Once the user session was closed, we had to relog, obtain the new authentication token and use it in the subsequent API calls.
With the pre-defined named Zabbix API tokens, you don’t have to constantly check if the authentication token needs to be updated. Starting from Zabbix 5.4 you can simply create a new named Zabbix API token with an expiration date and use it in your API calls.
Creating a new named Zabbix API token
The Zabbix API token creation process is extremely simple. All you have to do is navigate to Administration – General – API tokens and create a new API token. The named API tokens are created for a particular user and can have an optional expiration date and time – otherwise, the tokens are defined without an expiry date.
You can create a named API token in the API tokens section, under Administration – General
Once the Token has been created, make sure to store it somewhere safe, since you won’t be able to recover it afterward. If the token is lost – you will have to recreate it.
Make sure to store the auth token!
Don’t forget, that when defining a role for the particular API user, we can restrict which API methods this user has access to.
Simplifying API tasks with the named API token
There are many different use cases where you could implement Zabbix API calls to collect some additional information. For this blog post, I will create an HTTP item that uses item.get API call to monitor the number of unsupported items.
To achieve that, I will create an HTTP item on a host (This can be the default Zabbix server host or a host dedicated to collecting metrics via Zabbix API calls) and provide the API call in the request body. Since the named API token now provides a static authentication token until it expires, I can simply use it in my API call without the need to constantly keep it updated.
An HTTP agent item that uses a Zabbix API call in its request body
HTTP item request body, which returns a count of unsupported items
I will also use regular expression preprocessing to obtain the numeric value from the API call result – otherwise, we won’t be able to graph our value or calculate trends for it.
Regular expression preprocessing step to obtain a numeric value from our Zabbix API call result
Utilizing Zabbix API scripts in Actions
In one of our previous blog posts, we covered resolving problems automatically with the event.acknowledge API method. The logic defined in the blog post was quite complex since we needed to keep an eye out for the authentication tokens and use a custom script to keep them up to date. With named Zabbix API tokens, this use case is a lot more simple.
All I have to do is create an Action operation script containing my API call and pass it to an action operation.
Action operation script that invokes Zabbix event.acknowledge API method
curl -sk -X POST -H "Content-Type: application/json" -d "
{
\"jsonrpc\": \"2.0\",
\"method\": \"event.acknowledge\",
\"params\": {
\"eventids\": \"{EVENT.ID}\",
\"action\": 1,
\"message\": \"Problem resolved.\"
},
\"auth\": \"<Place your authentication token here>",
\"id\": 2
}" <Place your Zabbix API endpoint URL here>
Problem remediation script example
Now my problems will get closed automatically after the time period which I have defined in my action.
Action operation which runs our event.acknowledge Zabbix API script
These are but a few examples that we can now achieve by using API tokens. A lot of information can be obtained and filtered in a unique way via Zabbix API, thus providing you with a granular analysis of your monitored environment. If you have recently upgraded to Zabbix 5.4 or plan to upgrade to Zabbix 6.0 LTS in the future, I would recommend implementing named Zabbix API tokens to simplify your day-to-day workflow and consider the possibilities that this new feature opens up for you.
If you have any questions or if you wish to share your particular use case for data collection or task automation with Zabbix API – feel free to share them in the comments section below!
There are many design choices to consider when we build our monitoring environment for high-frequency monitoring. How to minimize performance impact? What are the data retention policies with storage space in mind? What are the available out-of-the-box features to solve these potential problems?
In this blog post, we will discuss when you should use preprocessing and when it is better to use the “Do not keep history” option for your metrics, and what are the pros and cons for both of these approaches.
Throttling and other preprocessing steps
We’ve discussed throttling previously as the go-to approach for high-frequency monitoring. Indeed, with throttling, you can discard repeated values and do so with a heartbeat. This is extremely useful with metrics that come as discreet values – services states, network port statuses, and so on.
Example of throttling with and without heartbeat
In addition, since starting from Zabbix 4.2 all preprocessing is also performed by Zabbix proxies. This means we can discard the repeated values before they reach the Zabbix server. This can help us both with the performance (fewer metrics to insert in the Zabbix server DB) and reduce the DB size (Fewer metrics stored in the DB. This also helps with improving overall Zabbix performance)
There are a few caveats with this approach – since metrics get discarded before they reach the Zabbix server, the triggers will not react on these metrics (This is where having a heartbeat is useful) and, since trends are calculated by Zabbix server based on the received history data, there could be a lack of trend information for these metrics. Keep in mind that this applies not only to throttling preprocessing rules – any preprocessing can be done on the proxy and any preprocessing rules can be used to transform your data.
Understanding “Do not keep history” option
The behavior of “Do not keep history” which we can define when configuring an item is a bit different though. If we collect an item by a Proxy and configure the item with “Do not keep history”, the history won’t always get discarded! There are a couple of reasons for this.
First off, let’s not forget that some of our values can populate host inventory! If the particular item is configured to populate an inventory field – it will be forwarded to the Zabbix server, but it will not get stored in the history tables.
If the item does not populate an inventory field – the text data such as character, log and text will indeed get discarded before reaching the Zabbix server, but Numeric values – both float and integer, will get forwarded to the server. The reason for that is deriving trend information from the numeric values. Mind that the numeric data will still not get stored in the history tables, only trends will be available for these items.
Note: This behavior has been properly implemented starting from Zabbix 5.2. See ZBX-17548
Setting the “Do not keep history” option for an item
Using trend functions with high-frequency monitoring
With the specifics of “Do not keep history” in mind, we should now recall that starting from Zabbix 5.2 we have trend functions available at our disposal!
History functions such as trendavg, trendcount, trendmax, trendmin, trendsum allow us to perform different kinds of trend calculations – from counting the number of trend values to retrieving min/max/avg trend values for a time period.
This means, that if we require only the metric trend for specific time periods (hours, days, weeks, etc) we can use these trend functions together with “Do not keep history” option, thus discarding unnecessary data and improving our Zabbix server performance!
There are two approaches two using trend functions:
If you wish to collect and display the trend data, you need to create the item which will collect the metrics (say, a net.if.in Agent item for collecting incoming network traffic) and then create a separate calculated item that uses the trend function to calculate the avg/min/max value for the trend over a time period. The original item can then have “Do not keep history” option selected for it.
trendavg item for calculating hourly trends from the net.if.in[ifHCInOctets.5] item
If you wish to simply define triggers and react on long-term trends and are not required to collect the trend values, then we can skip the creation of the calculated item and simply use the trend function on the original item in the trigger.
This trigger fires if the hourly average trend value exceeds 100M. Note: In this case only the original item is required.
By combining these approaches in our environment – using preprocessing when we wish to discard or transform the data and also implementing opting out of storing the history data, whenever this is appropriate, we can minimize the performance impact on our Zabbix instance. Add a layer of distributed Zabbix proxies on top of this and you can truly achieve a large, scalable Zabbix infrastructure optimized for high-frequency ingestion and processing of your data.
There are many monitoring solutions and monitoring tools that you can use for different monitoring tasks. But in this post and video, we will focus only on Zabbix and the top five features making Zabbix the best choice to monitor your home office as well as enterprise instances or projects.
First, Zabbix is a free and open-source solution covered by General Public License (GPL) v2. This means that the Zabbix source code is readily available and can be redistributed or modified. With this in mind, you can always create your own version of Zabbix, if you’re willing to play around with the source code or have a great idea on how to improve the product.
Zabbix software properties
There are no paid versions of Zabbix, no paid functionality, and no hidden costs. You can monitor any number of devices and define your own data retention policies at no cost at all.
All of the latest features are absolutely free and available in the latest version of Zabbix. You can visit zabbix.com, click the Download button, choose the platform that is best for you and install Zabbix packages on it. Zabbix can be deployed on any kind of environment, be it a virtual machine, physical servers, cloud environments, or even a Docker container. After you have downloaded Zabbix, you are ready to go ahead with the latest Zabbix feature set.
Selecting platform to install Zabbix n Zabbix.com/download
Rich feature set
Zabbix is a fully enterprise-ready product with a wide set of features that you can use to achieve any of your monitoring tasks. As a tool, Zabbix is not focused on any single thing, offering users extreme flexibility. For instance, you can monitor Windows or Linux machines agentlessly or opt-in to install a Zabbix agent on them. On the other hand, to monitor network devices, SNMP monitoring might be the easiest approach. All it takes to start monitoring your end-points is creating an item, specifying the metric that you want to monitor together with the data collection interval, and you are good to go.
Configuring SNMP monitoring parameters
After we have collected the data, we can configure our problem thresholds (also known as triggers in Zabbix) by navigating to Configuration>Hosts>Triggers. Triggers are definitions of our problem thresholds, where you can define a problem threshold, when do you consider your metric to be within that threshold and how do you recover from the problem.
There is a wide array of so called trigger functions – these allow us to define thresholds in different ways. For example, we can analyze the last received value, averages, minimum and maximum values over some time, look for a specific string in a value and much, much more!
We also need to define a way of reacting to a problem – should we receive an e-mail if something goes wrong? Or maybe we want to try and remediate the issue automatically by executing a command or a script? This is where the so-called Actions come in. Actions are based on and/or conditions, that allow you to very granularly define how you’re going to react to a particular problem.
For example, you might define an action that states “If the Trigger name contains ‘SNMPSim’ send an email or a mobile text message to our Network administrator. If the problem still persists after 10 minutes, execute a locally stored script that should fix the problem.
Trigger actions
Once you have defined your items, triggers and actions, it’s time to present this information in a user-friendly fashion. For this, you can create a set of multi-page dashboards where you can see all of your collected data by utilizing different dashboard widgets together with a list of active problems,
display the collected metrics on graphs and provide an overview of your infrastructure state on network maps.
The dashboard is completely dynamic and interactive — you can zoom in on any point in time in your graphs, create interactive map hierarchies, navigate to different sections of Zabbix and much more.
Zabbix also supports SLA monitoring for your IT business services. You can define your IT service trees, link them to existing triggers and have access to different SLA related views.
Configuring and monitoring SLA
Inventory collection and storage is also natively supported by Zabbix. You can collect any inventory information from your devices – your device serial numbers, locations, software versions and much more. This can be done in many different ways – the inventory information can be captured from the collected metrics, populated manually or updated by using the Zabbix API. This information can be used to access different inventory views and group your devices based on the collected inventory data.
You own your data
There are many different ways to deploy Zabbix. You can navigate to zabbix.com/download page and select the installation method that fits your requirements. For Zabbix packages you have the option to choose the required Zabbix version, select from multiple operating systems – from the Red Hat Enterprise Linux to Raspberry Pi, as well as the specific OS version, the database backend, and the web server backend. After everything is selected, you will be presented with a comprehensive list of commands that you can use to get Zabbix up and running in minutes.
Zabbix Packages
If you are interested in cloud deployments then you can also run Zabbix in many different cloud environments, such as AWS, Microsoft Azure, Google Cloud, DigitalOcean, Linode, Red Hat OpenShift, Oracle Cloud or Yandex Cloud. All of these options offer full Zabbix functionality with native cloud images.
Zabbix Cloud images
Docker images are also available for different Zabbix components. You can run a single component in a container or deploy the whole Zabbix architecture in a containerized environment. The Docker hub page contains a comprehensive list of environment variables and examples of how to deploy container images with a single command.
The quickest way to deploy Zabbix, especially in a PoC environment, would be the Zabbix Appliance. Zabbix Appliance is a virtual machine image with all of the Zabbix components already pre-configured for you. Simply download the image for the hypervisor of your choice, deploy it and you are good to go.
The Zabbix source code is also available for download for different Zabbix versions. This approach is useful for more exotic environments, where installing via packages is not an option.
Zabbix Sources
Agents are available for download via packages, but if packages is not an option, you can always download the precompiled agents for many different operating systems, including Windows.
Zabbix agents
No matter which option you choose, Zabbix LLC never has any access to your configuration or history data. You are fully in control of your deployment and the data in it. This way you have the guarantee that your data belongs only to you.
Balance of flexibility and simplicity
A good monitoring solution should be simple and approachable even for users that are not experts in monitoring, Linux system, scripting and any other DevOps-related skills. However, simplicity usually comes with at a cost of functionality. If the tool focuses too much on simplicity, it would inevitably restrict the end-users in the set of functionality that they have available for them.
Zabbix provides a balance of simplicity and flexibility. While the amount of features in Zabbix may seem daunting at first, the flexibility it provides is the key benefit of Zabbix. With Zabbix, you can easily extend the out-of-the-box monitoring approaches with your custom monitoring methods. If you have in-house applications specific to your company, you can always extend the Zabbix monitoring functionality and create your own custom checks or use scripts or commands to collect the data. This way you can define custom methods not only for data collection, but also use scripting for robust automatic issue remediation.
This gives you a huge variety of features that you can utilize in Zabbix either by using out-of-the-box templates or defining your own custom checks. All of this can be done within a single central frontend.
Even if you’re monitoring 50 branches in different countries within one Zabbix installation, a Zabbix administrator will be able to maintain and change the configuration — add new items, triggers, etc. Zabbix is also a great fit for multi-tenant environments. The robust permission and role schema enable you to define multiple Zabbix administrators that can have granular access to monitoring entities within their organization.
Commercial services
Open-source solutions are great as you can download the product at any time, irrespective of your goals and use them for both small home lab environments and large enterprise infrastructures. Similar to many open-source solutions, Zabbix also has a large and passionate international community of users ready to help you out on the official forums, different social networks, Zabbix subreddit and other communication channels.
If this not sufficient and you’re still feeling overwhelmed by all of the available features and require additional help to deploy your environment with best practices in mind – this is where the Zabbix commercial services come into play.
Commercial services
Zabbix team offers multiple commercial services, starting with a multiple-tier technical support. With technical support services, Zabbix experts will have your back and help you with fixing any issues and answer all of your questions 24/7.
Zabbix team also offers consulting services, where you can address any topic that you wish to discuss to Zabbix experts — how to deploy Zabbix and start monitoring your infrastructure, whether Zabbix is able to cover all of your needs, receive help with tuning Zabbix configuration and much more.
Turnkey solutions allow you to engage Zabbix professionals and build everything from scratch with best practices and scalability in mind.
Zabbix team can lend you a hand with Template building services for your custom in-house application.
The Zabbix team will document all of the performed steps, so you can have a clear view of what has been done and what was the reason behind it. You can utilize this knowledge down the line to learn and be able to follow the best practice approaches on your own.
Upgrade procedures can be extremely stressful – you may be worried about minimizing downtime, following your organizational SLA’s or maybe you simply aren’t sure how to properly perform an upgrade. Once again, the Zabbix team can do this for you, document it and guide you through the process so you can learn from it and do it yourself in the later versions.
Need help with troubleshooting an ongoing issue? Then the remote troubleshooting services are for you. Zabbix team can help you get to the bottom of any issues you may have with your Zabbix architecture.
Zabbix is an extremely fast-growing enterprise-ready project with a vast set of functionality trusted by global brands, capable to support collection of hundreds of thousands of metrics with real-time 24/7 data analysis, powerful visualization options, robust permission schema, out of the box reports and the ability to tailor the tool to your specific needs.
If you have never tried Zabbix, this might be the perfect moment to visit zabbix.com, click Download, and download Zabbix on your local test environmentand try to monitor a couple of hosts to get acquainted with the product. I am sure that you’re going to be more than satisfied with the results!
We use Kubernetes to run many of the diverse services that help us control Cloudflare’s edge. We have five geographically diverse clusters, with hundreds of nodes in our largest cluster. These clusters are self-managed on bare-metal machines which gives us a good amount of power and flexibility in the software and integrations with Kubernetes. However, it also means we don’t have a cloud provider to rely on for virtualizing or managing the nodes. This distinction becomes even more prominent when considering all the different reasons that nodes degrade. With self-managed bare-metal machines, the list of reasons that cause a node to become unhealthy include:
Hardware failures
Kernel-level software failures
Kubernetes cluster-level software failures
Degraded network communication
Software updates are required
Resource exhaustion1
Unhappy Nodes
We have plenty of examples of failures in the aforementioned categories, but one example has been particularly tedious to deal with. It starts with the following log line from the kernel:
unregister_netdevice: waiting for lo to become free. Usage count = 1
The issue is further observed with the number of network interfaces on the node owned by the Container Network Interface (CNI) plugin getting out of proportion with the number of running pods:
$ ip link | grep cali | wc -l
1088
This is unexpected as it shouldn’t exceed the maximum number of pods allowed on a node (we use the default limit of 110). While this issue is interesting and perhaps worthy of a whole separate blog, the short of it is that the Linux network interfaces owned by the CNI are not getting cleaned up after a pod terminates.
Some history on this can be read in a Docker GitHub issue. We found this seems to plague nodes with a longer uptime, and after rebooting the node it would be fine for about a month. However, with a significant number of nodes, this was happening multiple times per day. Each instance would need rebooting, which means going through our worker reboot procedure which looked like this:
Cordon off the affected node to prevent new workloads from scheduling on it.
Collect any diagnostic information for later investigation.
Drain the node of current workloads.
Reboot and wait for the node to come back.
Verify the node is healthy.
Re-enable scheduling of new workloads to the node.
While solving the underlying issue would be ideal, we needed a mitigation to avoid toil in the meantime — an automated node remediation process.
Existing Detection and Remediation Solutions
While not complicated, the manual remediation process outlined previously became tedious and distracting, as we had to reboot nodes multiple times a day. Some manual intervention is unavoidable, but for cases matching the following, we wanted automation:
Generic worker nodes
Software issues confined to a given node
Already researched and diagnosed issues
Limiting automatic remediation to generic worker nodes is important as there are other node types in our clusters where more care is required. For example, for control-plane nodes the process has to be augmented to check etcd cluster health and ensure proper redundancy for components servicing the Kubernetes API. We are also going to limit the problem space to known software issues confined to a node where we expect automatic remediation to be the right answer (as in our ballooning network interface problem). With that in mind, we took a look at existing solutions that we could use.
Node Problem Detector
Node problem detector is a daemon that runs on each node that detects problems and reports them to the Kubernetes API. It has a pluggable problem daemon system such that one can add their own logic for detecting issues with a node. Node problems are distinguished between temporary and permanent problems, with the latter being persisted as status conditions on the Kubernetes node resources.2
Draino and Cluster-Autoscaler
Draino as its name implies, drains nodes but does so based on Kubernetes node conditions. It is meant to be used with cluster-autoscaler which then can add or remove nodes via the cluster plugins to scale node groups.
Kured
Kured is a daemon that looks at the presence of a file on the node to initiate a drain, reboot and uncordon of the given node. It uses a locking mechanism via the Kubernetes API to ensure only a single node is acted upon at a time.
Cluster-API
The Kubernetes cluster-lifecycle SIG has been working on the cluster-api project to enable declaratively defining clusters to simplify provisioning, upgrading, and operating multiple Kubernetes clusters. It has a concept of machine resources which back Kubernetes node resources and furthermore has a concept of machine health checks. Machine health checks use node conditions to determine unhealthy nodes and then the cluster-api provider is then delegated to replace that machine via create and delete operations.
Proof of Concept
Interestingly, with all the above except for Kured, there is a theme of pluggable components centered around Kubernetes node conditions. We wanted to see if we could build a proof of concept using the existing theme and solutions. For the existing solutions, draino with cluster-autoscaler didn’t make sense in a non-cloud environment like our bare-metal set up. The cluster-api health checks are interesting, however they require a more complete investment into the cluster-api project to really make sense. That left us with node-problem-detector and kured. Deploying node-problem-detector was simple, and we ended up testing a custom-plugin-monitor like the following:
With that in place, the actual remediation needed to happen. Kured seemed to do most everything we needed, except that it was looking at a file instead of Kubernetes node conditions. We hacked together a patch to change that and tested it successfully end to end — we had a working proof of concept!
Revisiting Problem Detection
While the above worked, we found that node-problem-detector was unwieldy because we were replicating our existing monitoring into shell scripts and node-problem-detector configuration. A 2017 blog post describes Cloudflare’s monitoring stack, although some things have changed since then. What hasn’t changed is our extensive usage of Prometheus and Alertmanager.
For the network interface issue and other issues we wanted to address, we already had the necessary exported metrics and alerting to go with them. Here is what our already existing alert looked like3:
Note that we use a “notify” label to drive the routing logic in Alertmanager. However, that got us asking, could we just route this to a Kubernetes node condition instead?
Introducing Sciuro
Sciuro is our open-source replacement of node-problem-detector that has one simple job: synchronize Kubernetes node conditions with currently firing alerts in Alertmanager. Node problems can be defined with a more holistic view and using already existing exporters such as node exporter, cadvisor or mtail. It also doesn’t run on affected nodes which allows us to rely on out-of-band remediation techniques. Here is a high level diagram of how Sciuro works:
Starting from the top, nodes are scraped by Prometheus, which collects those metrics and fires relevant alerts to Alertmanager. Sciuro polls Alertmanager for alerts with a matching receiver, matches them with a corresponding node resource in the Kubernetes API and updates that node’s conditions accordingly.
In more detail, we can start by defining an alert in Prometheus like the following:
Note the two differences from the previous alert. First, we use a new name with a more sensitive trigger. The idea is that we want automatic node remediation to try fixing the node first as soon as possible, but if the problem worsens or automatic remediation is failing, humans will still get notified to act. The second difference is that instead of notifying chat rooms, we route to a target called “node-condition-k8s”.
Sciuro then comes into play, polling the Altertmanager API for alerts matching the “node-condition-k8s” receiver. The following shows the equivalent using amtool:
$ amtool alert query -r node-condition-k8s
Alertname Starts At Summary
CalicoTooManyInterfacesEarly 2021-05-11 03:25:21 UTC Kubernetes node worker1a has too many Calico interfaces
Note the node and instance labels which Sciuro will use for matching with the corresponding Kubernetes node. Sciuro uses the excellent controller-runtime to keep track of and update node sources in the Kubernetes API. We can observe the updated node condition on the status field via kubectl:
$ kubectl get node worker1a -o json | jq '.status.conditions[] | select(.type | test("^AlertManager"))'
{
"lastHeartbeatTime": "2021-05-11T03:31:20Z",
"lastTransitionTime": "2021-05-11T03:26:53Z",
"message": "[P6] Kubernetes node worker1a has too many Calico interfaces",
"reason": "AlertIsFiring",
"status": "True",
"type": "AlertManager_CalicoTooManyInterfacesEarly"
}
One important note is Sciuro added the AlertManager_ prefix to the node condition type to prevent conflicts with other node condition types. For example, DiskPressure, a kubelet managed condition, could also be an alert name. Sciuro will also properly update heartbeat and transition times to reflect when it first saw the alert and its last update. With node conditions synchronized by Sciuro, remediation can take place via one of the existing tools. As mentioned previously we are using a modified version of Kured for now.
We’re happy to announce that we’ve open sourced Sciuro, and it can be found on GitHub where you can read the code, find the deployment instructions, or open a Pull Request for changes.
Managing Node Uptime
While we began using automatic node remediation for obvious problems, we’ve expanded its purpose to additionally keep node uptime low. Low node uptime is desirable to further reduce drift on nodes, keep the node initialization process well-oiled, and encourage the best deployment practices on the Kubernetes clusters. To expand on the last point, services that are deployed with best practices and in a high availability fashion should see negligible impact when a single node leaves the cluster. However, services that are not deployed with best practices will most likely have problems especially if they rely on singleton pods. By draining nodes more frequently, it introduces regular chaos that encourages best practices. To enable this with automatic node remediation the following alert was defined:
There is a bit of juggling with the kube_node_roles metric in the above to isolate the alert to generic worker nodes, but at a high level it looks at node_boot_time_seconds, a metric from prometheus node_exporter. Again the notify label is configured to send to node conditions which kicks off the automatic node remediation. One further detail is the priority here is set to “9” which is of lower precedence than our other alerts. Note that the message field of the node condition is prefixed with the alert priority in brackets. This allows the remediation process to take priority into account when choosing which node to remediate first, which is important because Kured uses a lock to act on a single node at a time.
Wrapping Up
In the past 30 days, we’ve used the above automatic node remediation process to action 571 nodes. That has saved our humans a considerable amount of time. We’ve also been able to reduce the time to repair for some issues as automatic remediation can act at all times of the day and with a faster response time.
As mentioned before, we’re open sourcing Sciuro and its code can be found on GitHub. We’re open to issues, suggestions, and pull requests. We do have some ideas for future improvements. For Sciuro, we may look to reduce latency which is mainly due to polling and potentially add a push model from Altermanager although this isn’t a need we’ve had yet. For the larger node remediation story, we hope to do an overhaul of the remediating component. As mentioned previously, we are currently using a fork of kured, but a future replacement component should include the following:
Use out-of-band management interfaces to be able to shut down and power on nodes without a functional operating system.
Move from decentralized architecture to a centralized one that can integrate more complicated logic. This might include being able to act on entire failure domains in parallel.
Handle specialized nodes such as masters or storage nodes.
Finally, we’re looking for more people passionate about Kubernetes to join our team. Come help us push Kubernetes to the next level to serve Cloudflare’s many needs!
1Exhaustion can be applied to hardware resources, kernel resources, or logical resources like the amount of logging being produced. 2Nearly all Kubernetes objects have spec and status fields. The status field is used to describe the current state of an object. For node resources, typically the kubelet manages a conditions field under the status field for reporting things like if the node is ready for servicing pods. 3The format of the following alert is documented on Prometheus Alerting Rules.
This post is written by Sebastian Lee, Solution Architect, Startup Singapore.
Amazon Lightsail is a great starting point for those looking to get started on AWS. Lightsail is ideal for startups, SMBs, and hobbyist developers because it simplifies the deployment of instances, databases, load-balancers, CDNs, and even containers. However, you cannot track metrics beyond CPU utilization, network utilization, and error messages. Many startups and small businesses need to review more metrics like memory usage and disk usage.
In this blog, I walk through the steps to configure a Lightsail instance to send memory usage to Amazon CloudWatch for monitoring, alarming and notifications.
Product and Solution Overview
Amazon CloudWatch is a monitoring and observability service built for DevOps engineers, developers, site-reliability engineers and IT managers. CloudWatch collects monitoring and operational data in the form of logs, metrics, and events. It provides a unified view of your AWS resources, applications and services that run on AWS and on-premise servers. You can configure your Lightsail resources to work with Amazon CloudWatch to receive more metrics.
The following sections include steps to install a Cloudwatch agent on your Amazon Lightsail instance and configure it to have the necessary permission to send memory usage metrics to Amazon Cloudwatch.
Prerequisites
Before you begin the walkthrough, you must have an instance running in your Lightsail account. You can follow the steps here if you need help creating an instance.
Walkthrough
1. Create IAM user
First, you must create an IAM user to provide permission to send data to CloudWatch.
Sign in to the AWS Management Console and open the IAM console.
In the navigation pane, choose Users, and then choose Add user.
Enter “lightsail-cloudwatch-agent” in the User name text box.
For Access type, select Programmatic access, and then choose Next: Permissions.
For Set permissions, choose Attach existing policies directly.
In the list of policies, select the check box next to CloudWatchAgentServerPolicy. You can use the search text box to find the policy.
Choose Next: Tags.
Optionally, you can add one or more tag-key value pairs to organize, track, or control access for this role, and then choose Next: Review.
Confirm that the correct policies are listed, and then choose Create user.
In the row for the new user, choose Show. Copy the access key and secret key to a file so that you can use them when installing the agent.
Important: You will not be able to copy the secret key after leaving this page. If you lose it, you will have to create a new one
Choose Close.
Now that you created an IAM user, you can SSH into your Lightsail instance.
2. SSH into Amazon Lightsail instance
You can connect to your instance using the browser-based SSH client available in the Lightsail console, or by using your own SSH client with the SSH key of your instance.
Complete the following steps to connect to your instance using the browser-based SSH client in the Lightsail console:
Click the terminal icon, next to the instance, as shown in the following screenshot.
3. Installing the CloudWatch agent
Now that you have SSH’d into your instance, you are ready to install the CloudWatch agent. The CloudWatch agent is available as a package on Amazon Linux 2 instances. For other operating systems, see Download and configure the CloudWatch agent using the command line.
Enter the following command to install the CloudWatch agent on a linux instance.
Follow the prompts to enter the access key ID and secret access key you copied earlier in this tutorial
AWS Access Key ID [None]: <Enter the access key from step 1>
AWS Secret Access Key [None]: <Enter the secret key from step 1>
Default region name [None]:
Default output format [None]:
5. Create CloudWatch configuration file to collect memory usage metrics
To tell CloudWatch agent to collect memory usage metrics, you will need to create a CloudWatch config file.
Enter the following command to create a config file for the CloudWatch agent.
> sudo vim /opt/aws/amazon-cloudwatch-agent/bin/config.json
Press “I” to enter insert mode in Vim, and paste the following text into the file.
Press “ESC”, and then type “:wq!” to save the file and exit Vim.
7. Start CloudWatch agent
Now the necessary configuration for CloudWatch agent is setup. Let’s start the agent.
Enter the following command to start the CloudWatch agent.
> sudo amazon-cloudwatch-agent-ctl -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -a fetch-config -s
****** processing cwagent-otel-collector ******
cwagent-otel-collector will not be started as it has not been configured yet.
****** processing amazon-cloudwatch-agent ******
…
Redirecting to /bin/systemctl restart amazon-cloudwatch-agent.service
Enter the following command to verify that the CloudWatch agent is running.
Under “Custom Namespaces”, You should see a link for “CWAgent”.
Choose CWAgent.
Choose ImageId, InstanceId, InstanceType.
Select checkbox to display metrics on graph.
In addition, you can create a CloudWatch alarm to monitor the memory usage metrics to automatically send you a notification when the metric reaches a threshold you specify. To create an alarm in CloudWatch, you can follow this guide.
Conclusion
In this blog, I covered how you can install the CloudWatch agent on your Amazon Lightsail instance to send memory metrics to Amazon CloudWatch. For more information on additional metrics and logs supported by CloudWatch Agent, see the CloudWatch User Guide
To get started with Amazon Lightsail, check out our getting started page for more tutorial and resources.
This post was written by Dean Suzuki, Senior Solutions Architect.
Customers who run Windows or Linux instances on AWS frequently ask, “How do I know if my disks are almost full?” or “How do I know if my application is using all the available memory and is paging to disk?” This blog helps answer these questions by walking you through how to set up monitoring to capture these internal performance metrics.
Solution overview
If you open the Amazon EC2 console, select a running Amazon EC2 instance, and select the Monitoring tab you can see Amazon CloudWatch metrics for that instance. Amazon CloudWatch is an AWS monitoring service. The Monitoring tab (shown in the following image) shows the metrics that can be measured external to the instance (for example, CPU utilization, network bytes in/out). However, to understand what percentage of the disk is being used or what percentage of the memory is being used, these metrics require an internal operating system view of the instance. AWS places an extra safeguard on gathering data inside a customer’s instance so this capability is not enabled by default.
To capture the server’s internal performance metrics, a CloudWatch agent must be installed on the instance. For Windows, the CloudWatch agent can capture any of the Windows performance monitor counters. For Linux, the CloudWatch agent can capture system-level metrics. For more details, please see Metrics Collected by the CloudWatch Agent. The agent can also capture logs from the server. The agent then sends this information to Amazon CloudWatch, where rules can be created to alert on certain conditions (for example, low free disk space) and automated responses can be set up (for example, perform backup to clear transaction logs). Also, dashboards can be created to view the health of your Windows servers.
There are four steps to implement internal monitoring:
Install the CloudWatch agent onto your servers. AWS provides a service called AWS Systems ManagerRun Command, which enables you to do this agent installation across all your servers.
Run the CloudWatch agent configuration wizard, which captures what you want to monitor. These items could be performance counters and logs on the server. This configuration is then stored in AWS System Manager Parameter Store
Configure CloudWatch agents to use agent configuration stored in Parameter Store using the Run Command.
Validate that the CloudWatch agents are sending their monitoring data to CloudWatch.
The following image shows the flow of these four steps.
In this blog, I walk through these steps so that you can follow along. Note that you are responsible for the cost of running the environment outlined in this blog. So, once you are finished with the steps in the blog, I recommend deleting the resources if you no longer need them. For the cost of running these servers, see Amazon EC2 On-Demand Pricing. For CloudWatch pricing, see Amazon CloudWatch pricing.
The first step is to deploy the Amazon CloudWatch agent. There are multiple ways to deploy the CloudWatch agent (see this documentation on Installing the CloudWatch Agent). In this blog, I walk through how to use the AWS Systems Manager Run Command to deploy the agent. AWS Systems Manager uses the Systems Manager agent, which is installed by default on each AWS instance. This AWS Systems Manager agent must be given the appropriate permissions to connect to AWS Systems Manager, and to write the configuration data to the AWS Systems Manager Parameter Store. These access rights are controlled through the use of IAM roles.
Create two IAM roles
IAM roles are identity objects that you attach IAM policies. IAM policies define what access is allowed to AWS services. You can have users, services, or applications assume the IAM roles and get the assigned rights defined in the permissions policies.
To use System Manager, you typically create two IAM roles. The first role has permissions to write the CloudWatch agent configuration information to System Manager Parameter Store. This role is called CloudWatchAgentAdminRole.
The second role only has permissions to read the CloudWatch agent configuration from the System Manager Parameter Store. This role is called CloudWatchAgentServerRole.
Once you create the roles, you attach them to your Amazon EC2 instances. By attaching the IAM roles to the EC2 instances, you provide the processes running on the EC2 instance the permissions defined in the IAM role. In this blog, you create two Amazon EC2 instances. Attach the CloudWatchAgentAdminRole to the first instance that is used to create the CloudWatch agent configuration. Attach CloudWatchAgentServerRole to the second instance and any other instances that you want to monitor. For details on how to attach or assign roles to EC2 instances, please see the documentation on How do I assign an existing IAM role to an EC2 instance?.
Install the CloudWatch agent
Now that you have setup the permissions, you can install the CloudWatch agent onto the servers that you want to monitor. For details on installing the CloudWatch agent using Systems Manager, please see the documentation on Download and Configure the CloudWatch Agent.
Create the CloudWatch agent configuration
Now that you installed the CloudWatch agent on your server, run the CloudAgent configuration wizard to create the agent configuration. For instructions on how to run the CloudWatch Agent configuration wizard, please see this documentation on Create the CloudWatch Agent Configuration File with the Wizard. To establish a command shell on the server, you can use AWS Systems Manager Session Manager to establish a session to the server and then run the CloudWatch agent configuration wizard. If you want to monitor both Linux and Windows servers, you must run the CloudWatch agent configuration on a Linux instance and on a Windows instance to create a configuration file per OS type. The configuration is unique to the OS type.
To run the Agent configuration wizard on Linux instances, run the following command:
To run the Agent configuration wizard on Windows instances, run the following commands:
cd "C:\Program Files\Amazon\AmazonCloudWatchAgent"
amazon-cloudwatch-agent-config-wizard.exe
Note for Linux instances: do not select to collect the collectd metrics in the agent configuration wizard unless you have collectd installed on your Linux servers. Otherwise, you may encounter an error.
Review the Agent configuration
The CloudWatch agent configuration generated from the wizard is stored in Systems Manager Parameter Store. You can review and modify this configuration if you need to capture extra metrics. To review the agent configuration, perform the following steps:
Click Parameter store on the left hand navigation.
You should see the parameter that was created by the CloudWatch agent configuration program. For Linux servers, the configuration is stored in: AmazonCloudWatch-linux and for Windows servers, the configuration is stored in: AmazonCloudWatch-windows.
Click on the parameter’s hyperlink (for example, AmazonCloudWatch-linux) to see all the configuration parameters that you specified in the configuration program.
In the following steps, I walk through an example of modifying the Windows configuration parameter (AmazonCloudWatch-windows) to add an additional metric (“Available Mbytes”) to monitor.
Click the AmazonCloudWatch-windows
In the parameter overview, scroll down to the “metrics” section and under “metrics_collected”, you can see the Windows performance monitor counters that will be gathered by the CloudWatch agent. If you want to add an additional perfmon counter, then you can edit and add the counter here.
Press Edit at the top right of the AmazonCloudWatch-windows Parameter Store page.
Scroll down in the Value section and look for “Memory.”
After the “% Committed Bytes In Use”, put a comma “,” and then press Enter to add a blank line. Then, put on that line “Available Mbytes” The following screenshot demonstrates what this configuration should look like.
Press Save Changes.
To modify the Linux configuration parameter (AmazonCloudWatch-linux), you perform similar steps except you click on the AmazonCloudWatch-linux parameter. Here is additional documentation on creating the CloudWatch agent configuration and modifying the configuration file.
Start the CloudWatch agent and use the configuration
In this step, start the CloudWatch agent and instruct it to use your agent configuration stored in System Manager Parameter Store.
Open another tab in your web browser and go to System Manager console.
Specify Run Command in the left hand navigation of the System Manager console.
Press Run Command
In the search bar,
Select Document name prefix
Select Equal
Specify AmazonCloudWatch (Note the field is case sensitive)
Press enter
Select AmazonCloudWatch-ManageAgent. This is the command that configures the CloudWatch agent.
In the command parameters section,
For Action, select Configure
For Mode, select ec2
For Optional Configuration Source, select ssm
For optional configuration location, specify the Parameter Store name. For Windows instances, you would specify AmazonCloudWatch-windows for Windows instances or AmazonCloudWatch-linux for Linux instances. Note the field is case sensitive. This tells the command to read the Parameter Store for the parameter specified here.
For optional restart, leave yes
For Targets, choose your target servers that you wish to monitor.
Scroll down and press Run. The Run Command may take a couple minutes to complete. Press the refresh button. The Run Command configures the CloudWatch agent by reading the Parameter Store for the configuration and configure the agent using those settings.
You should see a custom namespace for CWAgent. Click on the CWAgent Please note that this might take a couple minutes to appear. Refresh the page periodically until it appears.
Then click the ImageId, Instanceid hyperlinks to see the counters under that section.
Review the metrics captured by the CloudWatch agent. Notice the metrics that are only observable from inside the instance (for example, LogicalDisk % Free Space). These types of metrics would not be observable without installing the CloudWatch agent on the instance. From these metrics, you could create a CloudWatch Alarm to alert you if they go beyond a certain threshold. You can also add them to a CloudWatch Dashboard to review. To learn more about the metrics collected by the CloudWatch agent, see the documentation Metrics Collected by the CloudWatch Agent.
Conclusion
In this blog, you learned how to deploy and configure the CloudWatch agent to capture the metrics on either Linux or Windows instances. If you are done with this blog, we recommend deleting the System Manager Parameter Store entry, the CloudWatch data and then the EC2 instances to avoid further charges. If you would like a video tutorial of this process, please see this Monitoring Amazon EC2 Windows Instances using Unified CloudWatch Agent video.
Many Amazon Web Services (AWS) customers need enhanced insight into IP network flow. Traditionally, cost, the complexity of collection, and the time required for analysis has led to incomplete investigations of network flows. Having good telemetry is paramount, and VPC Flow Logs are a very important part of a robust centralized logging architecture. The information that VPC Flow Logs provide is frequently used by security analysts to determine the scope of security issues, to validate that network access rules are working as expected, and to help analysts investigate issues and diagnose network behaviors. Flow logs capture information about the IP traffic going to and from EC2 interfaces in a VPC. Each record describes aspects of the traffic flow, such as where it originated and where it was sent to, what network ports were used, and how many bytes were sent.
Amazon Detective now enables you to interactively examine the details of the virtual private cloud (VPC) network flows of your Amazon Elastic Compute Cloud (Amazon EC2) instances. Amazon Detective makes it easy to analyze, investigate, and quickly identify the root cause of potential security issues or suspicious activities. Detective automatically collects VPC flow logs from your monitored accounts, aggregates them by EC2 instance, and presents visual summaries and analytics about these network flows. Detective doesn’t require VPC Flow Logs to be configured and doesn’t impact existing flow log collection.
In this blog post, I describe how to use the new VPC flow feature in Detective to investigate an UnauthorizedAccess:EC2/TorClient finding from Amazon GuardDuty. Amazon GuardDuty is a threat detection service that continuously monitors for malicious activity and unauthorized behavior to protect your AWS accounts, workloads, and data stored in Amazon S3. GuardDuty documentation states that this alert can indicate unauthorized access to your AWS resources with the intent of hiding the unauthorized user’s true identity. I’ll demonstrate how to use Amazon Detective to investigate an instance that was flagged by Amazon GuardDuty to determine whether it is compromised or not.
Starting the investigation in GuardDuty
In my GuardDuty console, I’m going to select the UnauthorizedAccess:EC2/TorClient finding shown in Figure 1, choose the Actions menu, and select Investigate.
Figure 1: Investigating from the GuardDuty console
This opens a new browser tab and launches the Amazon Detective console, where I’m presented with the profile page for this finding, shown in Figure 2. You must have Detective enabled to pivot between a GuardDuty finding and Detective. Detective provides profile pages for supported GuardDuty findings and AWS resources (for example, IP address, EC2 instance, user, and role) that include information and data visualizations that summarize observed behaviors and give guidance for interpreting them. Profiles help analysts to determine whether the finding is of genuine concern or a false positive. For resources, profiles provide supporting details for an investigation into a finding or for a general hunt for suspicious activity.
Figure 2: Finding profile panel
In this case, the profile page for this GuardDuty UnauthorizedAccess:EC2/TorClient finding provides contextual and behavioral data about the EC2 instance on which GuardDuty has noted the issue. As I dive into this finding, I’m going to be asking questions that help assess whether the instance was in fact accessed unintentionally, such as, “What IP port or network service was in use at that time?,” “Were any large data transfers involved?,” “Was the traffic allowed by my security groups?” Profile pages in Detective organize content that helps security analysts investigate GuardDuty findings, examine unexpected network behavior, and identify other AWS resources that might be affected by a potential security issue.
I begin scrolling down the page and notice the Findings associated with EC2 instance i-9999999999999999 panel. Detective displays related findings to provide analysts with additional evidence and context about potentially related issues. The finding I’m investigating is listed there, as well as an Unusual Behaviors/VM/Behavior:EC2-NetworkPortUnusual finding. GuardDuty builds a baseline on your network traffic and will generate findings where there is traffic outside the calculated normal. While we might not investigate every instance of anomalous traffic, having these alerts correlated by Detective provides context for validating the issue. Keeping this in mind as I scroll down, at the bottom of this profile page, I find the Overall VPC flow volume panel. If you choose the Info link next to the panel title, you can see helpful tips that describe how to use the visualizations and provide ideas for questions to ask within your investigation. These info links are available throughout Detective. Check them out!
Investigating VPC flow in Detective
In this investigation, I’m very curious about the two large spikes in inbound traffic that I see in the Overall VPC flow volume panel, which seem to be visually associated with some unusual outbound traffic spikes. It’s most likely that these outbound spikes are related to the Unusual Behaviors/VM/Behavior:EC2-NetworkPortUnusual finding I mentioned earlier. To start the investigation, I choose the display details for scope time button, shown circled at the bottom of Figure 2. This expands the VPC Flow Details, shown in Figure 3.
Figure 3: Our first look at VPC Flow Details
We now can see that each entry displays the volume of inbound traffic, the volume of outbound traffic, and whether the access request was accepted or rejected. Detective provides annotations on the VPC flows to help guide your investigation. These From finding annotations make it clear which flows and resources were involved in the finding. In this case, we can easily see (in Figure 3) the three IP addresses at the top of the list that triggered this GuardDuty finding.
I’m first going to focus on the spikes in traffic that are above the baseline. When I click on one of the spikes in the graph, the time window for the VPC flow activity now matches the dates of these spikes I’m investigating.
If I choose the Inbound Traffic column header, shown in Figure 4, I can find the flows that contributed to the spike during this time window.
Figure 4: Inbound traffic spikes
Note that the two large inbound spikes aren’t associated with the IP address from the UnauthorizedAccess:EC2/TorClient finding, based on the Detective annotation From finding. Let’s check the outbound traffic. If I do a quick sort of the table based on the outbound traffic column, as shown in Figure 5, we can also see the outbound spikes, and it isn’t immediately evident whether the spikes are associated with this finding. I could continue to investigate the spikes (because they are a visual anomaly), or focus just on the VPC flow traffic that GuardDuty and Detective have labeled as associated with this TOR finding.
Figure 5: Outbound traffic spikes
Let’s focus on the outbound and inbound spikes and see if we can determine what’s happening. The inbound spikes are on port 443, typically an HTTPS port, or a secure web connection. The outbound spikes are on port 22 (ssh), but go to IP addresses that look to be internal based on their addresses of 172.16.x.x. The port 443 traffic might indicate a web server that’s open to the internet and receiving traffic. With further investigation, we can determine if this idea is valid, and continue hunting for potentially malicious traffic.
A good next step would be to investigate the two specific IP addresses to rule out their involvement in the finding. I can do this by right-clicking on either of the external IP addresses and opening a new tab, where I can focus on investigating these two specific IP addresses. I would take this line of investigation to possibly rule out the involvement of these IP addresses in this finding, determine if they regularly communicate with my resources, find out what instance(s) they’re related to, and see if there are other findings associated with these instances or IP addresses. This deeper investigation is outside the scope of this blog post, but it’s something you should be doing in your own environment.
IP addresses in AWS are ephemeral in nature. The unique identifier in VPC flow logs is the Instance ID. At the time of this investigation, 172.16.0.7 is assigned to the instance related to this finding, so let’s continue to take a look at the internal 172.16.0.7 IP address with 218 MB outbound traffic on port 22. I choose 172.16.0.7, and Detective opens up the profile page for this specific IP address, as shown in Figure 6. Here we see some interesting correlations: two other GuardDuty findings related to SSH brute-force attacks. These could be related to our outbound port 22 spikes, because they’re certainly in the window of time we’re investigating.
Figure 6: IP address profile panel
As part of a deeper investigation, you would investigate the SSH brute-force findings for 198.51.100.254 and 203.0.113.83 but for now I’m interested what this IP is involved in. Detective easily associates this 172.16.0.7 IP address with the instance that was assigned the IP during the scope time. I scroll down to the bottom of the profile page for 172.16.0.7 and investigate the i-9999999999999999 instance by choosing the instance name.
Filtering VPC flow activity
In Detective, as the investigator we are looking at an instance profile panel, similar to the one in Figure 2, and since we’re interested in VPC flow details, I’m going to scroll down and select display details for scope time.
To focus on specific activity, I can filter the activity details by the following values:
IP address
Local or remote port
Direction
Protocol
Whether the request was accepted or rejected
I’m going to filter these VPC flow details and just look at port 22 (sshd) inbound traffic. I select the Filter check box and select Local Port and 22, as shown in Figure 7. Detective fills in all the available ports for you, making it easy to complete this filter.
Figure 7: Port 22 traffic
The activity details show a few IP addresses related to port 22, and we’re still following the large outbound spikes of traffic. It’s outside the scope of this blog post, but now it would be time to start looking at your security groups and network access control lists (ACLs) and determine why port 22 is open to the internet and sending all this traffic.
Understanding traffic behavior
As an investigator, I now have a good picture of the traffic related to the initial finding, and by diving deeper we’re able to discover other interesting traffic during the same timeframe. While we may not always determine “who has done it,” the goal should be to improve our understanding of the behavior of our environment and gather important technical evidence. Detective helps you identify and investigate anomalies to give you insight into your environment. If we were to continue our investigation into the finding, here are some actions we can take within Detective.
Investigate VPC findings with Detective:
Perform ports and utilization analysis
Identify service and ephemeral ports
Determine whether traffic was accepted or rejected based on security groups and NACL configurations
Investigate possible reconnaissance traffic by exploring the significant amount of rejected traffic
Correlate EC2 instances to TCP/IP ports and IPs
Analyze traffic spikes and anomalies
Discover traffic patterns and make behavioral correlations
Explore EC2 instance behavior with Detective:
Directional Traffic Analysis
Investigate possible data exfiltration events by digging into large transfers
Enumerate distinct IP connections and sort and filter by protocol, amount of traffic, and traffic direction
Gather data related to a spike in port count from a single IP address (potential brute force) or multiple IP addresses (distributed denial of service (DDoS))
Additional forensics steps to consider
Snapshot EC2 Volumes
Memory dump of EC2 instance
Isolate EC2 instance
Review your authentication strategy and assess whether the chosen authentication method is sufficient to protect your asset
Summary
Without requiring you to set up infrastructure or spend time configuring log ingestion, Detective collects, organizes, and presents relevant data for your threat analysis and investigations. Security and operations teams will find this new capability helpful for simplifying EC2 traffic analysis, validating security group permissions, and diagnosing EC2 instance behavior. Detective does the heavy lifting of storing, and analyzing VPC flow data so you can focus on quickly answering your investigative questions. VPC network flow details are available now in all Detective supported Regions and are included as part of your service subscription.
Are you a visual learner? Check out Amazon Detective Overview and Demonstration. This video helps you learn how and when to use Amazon Detective to improve the security of your AWS resources.
If you have feedback about this post, submit comments in the Comments section below.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.