Tag Archives: monitoring

Free network flow monitoring for all enterprise customers

2024-03-07 Chris Draper

Post Syndicated from Chris Draper original https://blog.cloudflare.com/free-network-monitoring-for-enterprise

A key component of effective corporate network security is establishing end to end visibility across all traffic that flows through the network. Every network engineer needs a complete overview of their network traffic to confirm their security policies work, to identify new vulnerabilities, and to analyze any shifts in traffic behavior. Often, it’s difficult to build out effective network monitoring as teams struggle with problems like configuring and tuning data collection, managing storage costs, and analyzing traffic across multiple visibility tools.

Today, we’re excited to announce that a free version of Cloudflare’s network flow monitoring product, Magic Network Monitoring, is available to all Enterprise Customers. Every Enterprise Customer can configure Magic Network Monitoring and immediately improve their network visibility in as little as 30 minutes via our self-serve onboarding process.

Enterprise Customers can visit the Magic Network Monitoring product page, click “Talk to an expert”, and fill out the form. You’ll receive access within 24 hours of submitting the request. Over the next month, the free version of Magic Network Monitoring will be rolled out to all Enterprise Customers. The product will automatically be available by default without the need to submit a form.

How it works

Cloudflare customers can send their network flow data (either NetFlow or sFlow) from their routers to Cloudflare’s network edge.

Magic Network Monitoring will pick up this data, parse it, and instantly provide insights and analytics on your network traffic. These analytics include traffic volume overtime in bytes and packets, top protocols, sources, destinations, ports, and TCP flags.

Dogfooding Magic Network Monitoring during the remediation of the Thanksgiving 2023 security incident

Let’s review a recent example of how Magic Network Monitoring improved Cloudflare’s own network security and traffic visibility during the Thanksgiving 2023 security incident. Our security team needed a lightweight method to identify malicious packet characteristics in our core data center traffic. We monitored for any network traffic sourced from or destined to a list of ASNs associated with the bad actor. Our security team setup Magic Network Monitoring and established visibility into our first core data center within 24 hours of the project kick-off. Today, Cloudflare continues to use Magic Network Monitoring to monitor for traffic related to bad actors and to provide real time traffic analytics on more than 1 Tbps of core data center traffic.

*Magic Network Monitoring – Traffic Analytics*

Monitoring local network traffic from IoT devices

Magic Network Monitoring also improves visibility on any network traffic that doesn’t go through Cloudflare. Imagine that you’re a network engineer at ACME Corporation, and it’s your job to manage and troubleshoot IoT devices in a factory that are connected to the factory’s internal network. The traffic generated by these IoT devices doesn’t go through Cloudflare because it is destined to other devices and endpoints on the internal network. Nonetheless, you still need to establish network visibility into device traffic over time to monitor and troubleshoot the system.

To solve the problem, you configure a router or other network device to securely send encrypted traffic flow summaries to Cloudflare via an IPSec tunnel. Magic Network Monitoring parses the data, and instantly provides you with insights and analytics on your network traffic. Now, when an IoT device goes down, or a connection between IoT devices is unexpectedly blocked, you can analyze historical network traffic data in Magic Network Monitoring to speed up the troubleshooting process.

Monitoring cloud network traffic

As cloud networking becomes increasingly prevalent, it is essential for enterprises to invest in visibility across their cloud environments. Let’s say you’re responsible for monitoring and troubleshooting your corporation’s cloud network operations which are spread across multiple public cloud providers. You need to improve visibility into your cloud network traffic to analyze and troubleshoot any unexpected traffic patterns like configuration drift that leads to an exposed network port.

To improve traffic visibility across different cloud environments, you can export cloud traffic flow logs from any virtual device that supports NetFlow or sFlow to Cloudflare. In the future, we are building support for native cloud VPC flow logs in conjunction with Magic Cloud Networking. Cloudflare will parse this traffic flow data and provide alerts plus analytics across all your cloud environments in a single pane of glass on the Cloudflare dashboard.

Improve your security posture today in less than 30 minutes

If you’re an existing Enterprise customer, and you want to improve your corporate network security, you can get started right away. Visit the Magic Network Monitoring product page, click “Talk to an expert”, and fill out the form. You’ll receive access within 24 hours of submitting the request. You can begin the self-serve onboarding tutorial, and start monitoring your first batch of network traffic in less than 30 minutes.

Over the next month, the free version of Magic Network Monitoring will be rolled out to all Enterprise Customers. The product will be automatically available by default without the need to submit a form.

If you’re interested in becoming an Enterprise Customer, and have more questions about Magic Network Monitoring, you can talk with an expert. If you’re a free customer, and you’re interested in testing a limited beta of Magic Network Monitoring, you can fill out this form to request access.

Creating a User Activity Dashboard for Amazon CodeWhisperer

2024-03-04 David Ernst

Post Syndicated from David Ernst original https://aws.amazon.com/blogs/devops/creating-a-user-activity-dashboard-for-amazon-codewhisperer/

Maximizing the value from Enterprise Software tools requires an understanding of who and how users interact with those tools. As we have worked with builders rolling out Amazon CodeWhisperer to their enterprises, identifying usage patterns has been critical.

This blog post is a result of that work, builds on Introducing Amazon CodeWhisperer Dashboard blog and Amazon CloudWatch metrics and enables customers to build dashboards to support their rollouts. Note that these features are only available in CodeWhisperer Professional plan.

Organizations have leveraged the existing Amazon CodeWhisperer Dashboard to gain insights into developer usage. This blog explores how we can supplement the existing dashboard with detailed user analytics. Identifying leading contributors has accelerated tool usage and adoption within organizations. Acknowledging and incentivizing adopters can accelerate a broader adoption.

The architecture diagram outlines a streamlined process for tracking and analyzing Amazon CodeWhisperer usage events. It begins with logging these events in CodeWhisperer and AWS CloudTrail and then forwarding them to Amazon CloudWatch Logs. Configuring AWS CloudTrail involves using Amazon S3 for storage and AWS Key Management Service (KMS) for log encryption. An AWS Lambda function analyzes the logs, extracting information about user activity. This blog also introduces a AWS CloudFormation template that simplifies the setup process, including creating the CloudTrail with an S3 bucket KMS key and the Lambda function. The template also configures AWS IAM permissions, ensuring the Lambda function has access rights to interact with other AWS services.

Configuring CloudTrail for CodeWhisperer User Tracking

This section details the process for monitoring user interactions while using Amazon CodeWhisperer. The aim is to utilize AWS CloudTrail to record instances where users receive code suggestions from CodeWhisperer. This involves setting up a new CloudTrail trail tailored to log events related to these interactions. By accomplishing this, you lay a foundational framework for capturing detailed user activity data, which is crucial for the subsequent steps of analyzing and visualizing this data through a custom AWS Lambda function and an Amazon CloudWatch dashboard.

Setup CloudTrail for CodeWhisperer

1. Navigate to AWS CloudTrail Service.

2. Create Trail

3. Choose Trail Attributes

a. Click on Create Trail

b. Provide a Trail Name, for example, “cwspr-preprod-cloudtrail”

c. Choose Enable for all accounts in my organization

d. Choose Create a new Amazon S3 bucket to configure the Storage Location

e. For Trail log bucket and folder, note down the given unique trail bucket name in order to view the logs at a future point.

f. Check Enabled to encrypt log files with SSE-KMS encryption

j. Enter an AWS Key Management Service alias for log file SSE-KMS encryption, for example, “cwspr-preprod-cloudtrail”

h. Select Enabled for CloudWatch Logs

i. Select New

j. Copy the given CloudWatch Log group name, you will need this for the testing the Lambda function in a future step.

k. Provide a Role Name, for example, “CloudTrailRole-cwspr-preprod-cloudtrail”

l. Click Next.

This image depicts how to choose the trail attributes within CloudTrail for CodeWhisperer User Tracking.

4. Choose Log Events

a. Check “Management events“ and ”Data events“

b. Under Management events, keep the default options under API activity, Read and Write

c. Under Data event, choose CodeWhisperer for Data event type

d. Keep the default Log all events under Log selector template

e. Click Next

f. Review and click Create Trail

This image depicts how to choose the log events for CloudTrail for CodeWhisperer User Tracking.

Please Note: The logs will need to be included on the account which the management account or member accounts are enabled.

Gathering Application ARN for CodeWhisperer application

Step 1: Access AWS IAM Identity Center

1. Locate and click on the Services dropdown menu at the top of the console.

2. Search for and select IAM Identity Center (SSO) from the list of services.

Step 2: Find the Application ARN for CodeWhisperer application

1. In the IAM Identity Center dashboard, click on Application Assignments. -> Applications in the left-side navigation pane.

2. Locate the application with Service as CodeWhisperer and click on it

An image displays where you can find the Application in IAM Identity Center.

3. Copy the Application ARN and store it in a secure place. You will need this ID to configure your Lambda function’s JSON event.

An image shows where you will find the Application ARN after you click on you AWS managed application.

User Activity Analysis in CodeWhisperer with AWS Lambda

This section focuses on creating and testing our custom AWS Lambda function, which was explicitly designed to analyze user activity within an Amazon CodeWhisperer environment. This function is critical in extracting, processing, and organizing user activity data. It starts by retrieving detailed logs from CloudWatch containing CodeWhisperer user activity, then cross-references this data with the membership details obtained from the AWS Identity Center. This allows the function to categorize users into active and inactive groups based on their engagement within a specified time frame.

The Lambda function’s capability extends to fetching and structuring detailed user information, including names, display names, and email addresses. It then sorts and compiles these details into a comprehensive HTML output. This output highlights the CodeWhisperer usage in an organization.

Creating and Configuring Your AWS Lambda Function

1. Navigate to the Lambda service.

2. Click on Create function.

3. Choose Author from scratch.

4. Enter a Function name, for example, “AmazonCodeWhispererUserActivity”.

5. Choose Python 3.11 as the Runtime.

6. Click on ‘Create function’ to create your new Lambda function.

7. Access the Function: After creating your Lambda function, you will be directed to the function’s dashboard. If not, navigate to the Lambda service, find your function “AmazonCodeWhispererUserActivity”, and click on it.

8. Copy and paste your Python code into the inline code editor on the function’s dashboard. The lambda function code can be found here.

9. Click ‘Deploy’ to save and deploy your code to the Lambda function.

10. You have now successfully created and configured an AWS Lambda function with our Python code.

This image depicts how to configure your AWS Lambda function for tracking user activity in CodeWhisperer.

Updating the Execution Role for Your AWS Lambda Function

After you’ve created your Lambda function, you need to ensure it has the appropriate permissions to interact with other AWS services like CloudWatch Logs and AWS Identity Store. Here’s how you can update the IAM role permissions:

Locate the Execution Role:

1. Open Your Lambda Function’s Dashboard in the AWS Management Console.

2. Click on the ‘Configuration’ tab located near the top of the dashboard.

3. Set the Time Out setting to 15 minutes from the default 3 seconds

4. Select the ‘Permissions’ menu on the left side of the Configuration page.

5. Find the ‘Execution role’ section on the Permissions page.

6. Click on the Role Name to open the IAM (Identity and Access Management) role associated with your Lambda function.

7. In the IAM role dashboard, click on the Policy Name under the Permissions policies.

8. Edit the existing policy: Replace the policy with the following JSON.

9. Save the changes to the policy.

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Action":[
            "logs:CreateLogGroup",
            "logs:CreateLogStream",
            "logs:PutLogEvents",
            "logs:StartQuery",
            "logs:GetQueryResults",
            "sso:ListInstances",
            "sso:ListApplicationAssignments"
            "identitystore:DescribeUser",
            "identitystore:ListUsers",
            "identitystore:ListGroupMemberships"
         ],
         "Resource":"*",
         "Effect":"Allow"
      },
      {
         "Action":[
            "cloudtrail:DescribeTrails",
            "cloudtrail:GetTrailStatus"
         ],
         "Resource":"*",
         "Effect":"Allow"
      }
   ]
} Your AWS Lambda function now has the necessary permissions to execute and interact with CloudWatch Logs and AWS Identity Store.

Testing Lambda Function with custom input

1. On your Lambda function’s dashboard.

2. On the function’s dashboard, locate the Test button near the top right corner.

3. Click on Test. This opens a dialog for configuring a new test event.

4. In the dialog, you’ll see an option to create a new test event. If it’s your first test, you’ll be prompted automatically to create a new event.

5. For Event name, enter a descriptive name for your test, such as “TestEvent”.

6. In the event code area, replace the existing JSON with your specific input:

{
"log_group_name": "{Insert Log Group Name}",
"start_date": "{Insert Start Date}",
"end_date": "{Insert End Date}",
"codewhisperer_application_arn": "{Insert Codewhisperer Application ARN}", 
"identity_store_region": "{Insert Region}", 
"codewhisperer_region": "{Insert Region}"
}

7. This JSON structure includes:

a. log_group_name: The name of the log group in CloudWatch Logs.

b. start_date: The start date and time for the query, formatted as “YYYY-MM-DD HH:MM:SS”.

c. end_date: The end date and time for the query, formatted as “YYYY-MM-DD HH:MM:SS”.

e. codewhisperer_application_arn: The ARN of the Code Whisperer Application in the AWS Identity Store.

f. identity_store_region: The region of the AWS Identity Store.

f. codewhisperer_region: The region of where Amazon CodeWhisperer is configured.

8. Click on Save to store this test configuration.

This image depicts an example of creating a test event for the Lambda function with example JSON parameters entered.

9. With the test event selected, click on the Test button again to execute the function with this event.

10. The function will run, and you’ll see the execution result at the top of the page. This includes execution status, logs, and output.

11. Check the Execution result section to see if the function executed successfully.

This image depicts what a test case that successfully executed looks like.

Visualizing CodeWhisperer User Activity with Amazon CloudWatch Dashboard

This section focuses on effectively visualizing the data processed by our AWS Lambda function using a CloudWatch dashboard. This part of the guide provides a step-by-step approach to creating a “CodeWhispererUserActivity” dashboard within CloudWatch. It details how to add a custom widget to display the results from the Lambda Function. The process includes configuring the widget with the Lambda function’s ARN and the necessary JSON parameters.

1.Navigate to the Amazon CloudWatch service from within the AWS Management Console

2. Choose the ‘Dashboards’ option from the left-hand navigation panel.

3. Click on ‘Create dashboard’ and provide a name for your dashboard, for example: “CodeWhispererUserActivity”.

4. Click the ‘Create Dashboard’ button.

5. Select “Other Content Types” as your ‘Data sources types’ option before choosing “Custom Widget” for your ‘Widget Configuration’ and then click ‘Next’.

6. On the “Create a custom widget” page click the ‘Next’ button without making a selection from the dropdown.

7. On the ‘Create a custom widget’ page:

a. Enter your Lambda function’s ARN (Amazon Resource Name) or use the dropdown menu to find and select your “CodeWhispererUserActivity” function.

b. Add the JSON parameters that you provided in the test event, without including the start and end dates.

{
"log_group_name": "{Insert Log Group Name}",
“codewhisperer_application_arn”:”{Insert Codewhisperer Application ARN}”,
"identity_store_region": "{Insert identity Store Region}",
"codewhisperer_region": "{Insert Codewhisperer Region}"
}

This image depicts an example of creating a custom widget.

8. Click the ‘Add widget’ button. The dashboard will update to include your new widget and will run the Lambda function to retrieve initial data. You’ll need to click the “Execute them all” button in the upper banner to let CloudWatch run the initial Lambda retrieval.

This image depicts the execute them all button on the upper right of the screen.

9. Customize Your Dashboard: Arrange the dashboard by dragging and resizing widgets for optimal organization and visibility. Adjust the time range and refresh settings as needed to suit your monitoring requirements.

10. Save the Dashboard Configuration: After setting up and customizing your dashboard, click ‘Save dashboard’ to preserve your layout and settings.

This image depicts what the dashboard looks like. It showcases active users and inactive users, with first name, last name, display name, and email.

CloudFormation Deployment for the CodeWhisperer Dashboard

The blog post concludes with a detailed AWS CloudFormation template designed to automate the setup of the necessary infrastructure for the Amazon CodeWhisperer User Activity Dashboard. This template provisions AWS resources, streamlining the deployment process. It includes the configuration of AWS CloudTrail for tracking user interactions, setting up CloudWatch Logs for logging and monitoring, and creating an AWS Lambda function for analyzing user activity data. Additionally, the template defines the required IAM roles and permissions, ensuring the Lambda function has access to the needed AWS services and resources.

The blog post also provides a JSON configuration for the CloudWatch dashboard. This is because, at the time of writing, AWS CloudFormation does not natively support the creation and configuration of CloudWatch dashboards. Therefore, the JSON configuration is necessary to manually set up the dashboard in CloudWatch, allowing users to visualize the processed data from the Lambda function. The CloudFormation template can be found here.

Create a CloudWatch Dashboard and import the JSON below.

{
   "widgets":[
      {
         "height":30,
         "width":20,
         "y":0,
         "x":0,
         "type":"custom",
         "properties":{
            "endpoint":"{Insert ARN of Lambda Function}",
            "updateOn":{
               "refresh":true,
               "resize":true,
               "timeRange":true
            },
            "params":{
               "log_group_name":"{Insert Log Group Name}",
               "codewhisperer_application_arn":"{Insert Codewhisperer Application ARN}",
               "identity_store_region":"{Insert identity Store Region}",
               "codewhisperer_region":"{Insert Codewhisperer Region}"
            }
         }
      }
   ]
}

Conclusion

In this blog, we detail a comprehensive process for establishing a user activity dashboard for Amazon CodeWhisperer to deliver data to support an enterprise rollout. The journey begins with setting up AWS CloudTrail to log user interactions with CodeWhisperer. This foundational step ensures the capture of detailed activity events, which is vital for our subsequent analysis. We then construct a tailored AWS Lambda function to sift through CloudTrail logs. Then, create a dashboard in AWS CloudWatch. This dashboard serves as a central platform for displaying the user data from our Lambda function in an accessible, user-friendly format.

You can reference the existing CodeWhisperer dashboard for additional insights. The Amazon CodeWhisperer Dashboard offers a view summarizing data about how your developers use the service.

Overall, this dashboard empowers you to track, understand, and influence the adoption and effective use of Amazon CodeWhisperer in your organizations, optimizing the tool’s deployment and fostering a culture of informed data-driven usage.

About the authors:

How to develop an Amazon Security Lake POC

2024-02-28 Anna McAbee

Post Syndicated from Anna McAbee original https://aws.amazon.com/blogs/security/how-to-develop-an-amazon-security-lake-poc/

You can use Amazon Security Lake to simplify log data collection and retention for Amazon Web Services (AWS) and non-AWS data sources. To make sure that you get the most out of your implementation requires proper planning.

In this post, we will show you how to plan and implement a proof of concept (POC) for Security Lake to help you determine the functionality and value of Security Lake in your environment, so that your team can confidently design and implement in production. We will walk you through the following steps:

Understand the functionality and value of Security Lake
Determine success criteria for the POC
Define your Security Lake configuration
Prepare for deployment
Enable Security Lake
Validate deployment

Understand the functionality of Security Lake

Figure 1 summarizes the main features of Security Lake and the context of how to use it:

Figure 1: Overview of Security Lake functionality

As shown in the figure, Security Lake ingests and normalizes logs from data sources such as AWS services, AWS Partner sources, and custom sources. Security Lake also manages the lifecycle, orchestration, and subscribers. Subscribers can be AWS services, such as Amazon Athena, or AWS Partner subscribers.

There are four primary functions that Security Lake provides:

Centralize visibility to your data from AWS environments, SaaS providers, on-premises, and other cloud data sources — You can collect log sources from AWS services such as AWS CloudTrail management events, Amazon Simple Storage Service (Amazon S3) data events, AWS Lambda data events, Amazon Route 53 Resolver logs, VPC Flow Logs, and AWS Security Hub findings, in addition to log sources from on-premises, other cloud services, SaaS applications, and custom sources. Security Lake automatically aggregates the security data across AWS Regions and accounts.
Normalize your security data to an open standard — Security Lake normalizes log sources in a common schema, the Open Security Schema Framework (OCSF), and stores them in compressed parquet files.
Use your preferred analytics tools to analyze your security data — You can use AWS tools, such as Athena and Amazon OpenSearch Service, or you can utilize external security tools to analyze the data in Security Lake.
Optimize and manage your security data for more efficient storage and query — Security Lake manages the lifecycle of your data with customizable retention settings with automated storage tiering to help provide more cost-effective storage.

Determine success criteria

By establishing success criteria, you can assess whether Security Lake has helped address the challenges that you are facing. Some example success criteria include:

I need to centrally set up and store AWS logs across my organization in AWS Organizations for multiple log sources.
I need to more efficiently collect VPC Flow Logs in my organization and analyze them in my security information and event management (SIEM) solution.
I want to use OpenSearch Service to replace my on-premises SIEM.
I want to collect AWS log sources and custom sources for machine learning with Amazon Sagemaker.
I need to establish a dashboard in Amazon QuickSight to visualize my Security Hub findings and a custom log source data.

Review your success criteria to make sure that your goals are realistic given your timeframe and potential constraints that are specific to your organization. For example, do you have full control over the creation of AWS services that are deployed in an organization? Do you have resources that can dedicate time to implement and test? Is this time convenient for relevant stakeholders to evaluate the service?

The timeframe of your POC will depend on your answers to these questions.

Important: Security Lake has a 15-day free trial per account that you use from the time that you enable Security Lake. This is the best way to estimate the costs for each Region throughout the trial, which is an important consideration when you configure your POC.

Define your Security Lake configuration

After you establish your success criteria, you should define your desired Security Lake configuration. Some important decisions include the following:

Determine AWS log sources — Decide which AWS log sources to collect. For information about the available options, see Collecting data from AWS services.
Determine third-party log sources — Decide if you want to include non-AWS service logs as sources in your POC. For more information about your options, see Third-party integrations with Security Lake; the integrations listed as “Source” can send logs to Security Lake.

Note: You can add third-party integrations after the POC or in a second phase of the POC. Pre-planning will be required to make sure that you can get these set up during the 15-day free trial. Third-party integrations usually take more time to set up than AWS service logs.
Select a delegated administrator – Identify which account will serve as the delegated administrator. Make sure that you have the appropriate permissions from the organization admin account to identify and enable the account that will be your Security Lake delegated administrator. This account will be the location for the S3 buckets with your security data and where you centrally configure Security Lake. The AWS Security Reference Architecture (AWS SRA) recommends that you use the AWS logging account for this purpose. In addition, make sure to review Important considerations for delegated Security Lake administrators.
Select accounts in scope — Define which accounts to collect data from. To get the most realistic estimate of the cost of Security Lake, enable all accounts across your organization during the free trial.
Determine analytics tool — Determine if you want to use native AWS analytics tools, such as Athena and OpenSearch Service, or an existing SIEM, where the SIEM is a subscriber to Security Lake.
Define log retention and Regions — Define your log retention requirements and Regional restrictions or considerations.

Prepare for deployment

After you determine your success criteria and your Security Lake configuration, you should have an idea of your stakeholders, desired state, and timeframe. Now you need to prepare for deployment. In this step, you should complete as much as possible before you deploy Security Lake. The following are some steps to take:

Create a project plan and timeline so that everyone involved understands what success look like and what the scope and timeline is.
Define the relevant stakeholders and consumers of the Security Lake data. Some common stakeholders include security operations center (SOC) analysts, incident responders, security engineers, cloud engineers, finance, and others.
Define who is responsible, accountable, consulted, and informed during the deployment. Make sure that team members understand their roles.
Make sure that you have access in your management account to delegate and administrator. For further details, see IAM permissions required to designate the delegated administrator.
Consider other technical prerequisites that you need to accomplish. For example, if you need roles in addition to what Security Lake creates for custom extract, transform, and load (ETL) pipelines for custom sources, can you work with the team in charge of that process before the POC?

Enable Security Lake

The next step is to enable Security Lake in your environment and configure your sources and subscribers.

Deploy Security Lake across the Regions, accounts, and AWS log sources that you previously defined.
Configure custom sources that are in scope for your POC.
Configure analytics tools in scope for your POC.

Validate deployment

The final step is to confirm that you have configured Security Lake and additional components, validate that everything is working as intended, and evaluate the solution against your success criteria.

Validate log collection — Verify that you are collecting the log sources that you configured. To do this, check the S3 buckets in the delegated administrator account for the logs.
Validate analytics tool — Verify that you can analyze the log sources in your analytics tool of choice. If you don’t want to configure additional analytics tooling, you can use Athena, which is configured when you set up Security Lake. For sample Athena queries, see Amazon Security Lake Example Queries on GitHub and Security Lake queries in the documentation.
Obtain a cost estimate — In the Security Lake console, you can review a usage page to verify that the cost of Security Lake in your environment aligns with your expectations and budgets.
Assess success criteria — Determine if you achieved the success criteria that you defined at the beginning of the project.

Next steps

Next steps will largely depend on whether you decide to move forward with Security Lake.

Determine if you have the approval and budget to use Security Lake.
Expand to other data sources that can help you provide more security outcomes for your business.
Configure S3 lifecycle policies to efficiently store logs long term based on your requirements.
Let other teams know that they can subscribe to Security Lake to use the log data for their own purposes. For example, a development team that gets access to CloudTrail through Security Lake can analyze the logs to understand the permissions needed for an application.

Conclusion

In this blog post, we showed you how to plan and implement a Security Lake POC. You learned how to do so through phases, including defining success criteria, configuring Security Lake, and validating that Security Lake meets your business needs.

As a customer, this guide will help you run a successful proof of value (POV) with Security Lake. It guides you in assessing the value and factors to consider when deciding to implement the current features.

Further resources

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Keeping Remote Teams Connected: The Zabbix Advantage

2024-02-27 Michael Kammer

Post Syndicated from Michael Kammer original https://blog.zabbix.com/keeping-remote-teams-connected-the-zabbix-advantage/27551/

The popularity of remote teams may have exploded in popularity during the COVID-19 pandemic, but it’s not a phenomenon that’s likely to trend downward anytime soon. High-profile organizations like 3M, Dropbox, Shopify, and LinkedIn are continuing to enthusiastically embrace remote working, essentially making it the “default setting” for their employees.

The shift toward remote working is not without its challenges, however. Organizations of all sizes often have little time to set up the kind of networking infrastructure and efficient processes that make sure remote workers are just as connected and productive as their on-site counterparts. In this article, we’ll take a quick look at some of the most important network monitoring challenges that remote teams face and show how Zabbix can help you tackle them as efficiently as possible.

Table of Contents

Infrastructure and connectivity issues

A remote network is essentially a grouping of multiple smaller network setups, each with their own set of variables that can affect performance. The differences between network system and infrastructure quality at different remote destinations can often lead to low overall network performance, which in turn makes it a challenge to provide the kind of high-speed communication needed to run the remote automation tools and software applications used by remote employees and teams.

By providing straightforward and easy-to-understand visibility into a network’s connected devices and how data moves between them, Zabbix makes it easy to automatically compare data and identify any drop in network performance.

With Zabbix, you can easily keep an eye on network routers and switches, especially internet provider and uplink ports up/down. You can also monitor network latency, the error rate on ports, the packet loss to important devices, and network utilization on important ports with net.if.in/net.if.out. Here are some example triggers:

High Network Utilization: avg(/Router ABC/net.if.in[eth0],5m)>80MB
High Packet Loss: avg(/Router ABC/icmppingloss,5m)>5
High Latency: avg(/Router ABC/icmppingsec,5m)>0.1

What’s more, Zabbix allows you to create network maps with important network devices and real-time data, as well as dashboards with maps and single item/gauge widgets, all of which makes it far easier to achieve the uninterrupted connectivity that remote teams depend on.

Staying safe

Remote locations aren’t islands that can be completely isolated from external traffic. Staying vigilant and doing everything possible to eliminate data breaches is important, and taking advantage of strong encryption methods, network scanning tools, and firewalls to protect your systems is a good start. However, using a whole suite of tools to protect security can add more difficulty when it comes to integrating and monitoring them.

With Zabbix, you can count on enterprise-grade security, including encrypted communication between components, a flexible user permission schema that can be easily applied to a distributed environment, and custom user roles with a granular set of permissions for different types of users.

Zabbix also provides native support for HTTP, LDAP, and SAML authentication (which gives you an additional layer of security and improves your user experience while working with Zabbix), the ability to restrict access to sensitive information by limiting which metrics can be collected in your environment, and the ability to track changes in your environment by utilizing the Audit log. It’s all designed to make sure that there are no compromises on the security of your data when you decide to go remote.

Scalability

As a remote organization grows and its distributed systems expand, a good monitoring solution needs to be able to grow along with it in order to prevent gaps in coverage while maintaining performance and reliability. Zabbix gives you limitless scalability in the form of Zabbix proxies, which act as independent intermediaries that collect performance and availability data on behalf of a Zabbix server. You can roll out new proxies as fast as you need them, and because Zabbix is free and open source, you don’t have to worry about additional licensing costs.

Zabbix proxies allow you to see at a glance what resources are being used on your network at any given moment, which is especially handy if, like most remote teams, you have tens or even hundreds of servers and network appliances to monitor. You can also execute remote commands in remote locations – either on the proxies themselves or on the agents monitored by the proxy, and multiple frontends can be deployed for load balancing as well as for improved security and connectivity. Proxy docker containers and cloud options are available as well, enhancing flexibility and making Zabbix ideal for any organization that spans the globe (or aspires to).

Managing multiple solutions

The legacy software and systems you use were most likely designed to work in a traditional networking model. Remote working, as we’ve seen, presents a whole new range of challenges when it comes to compatibility and support.

We’ve created Zabbix to be as easy as possible to integrate with existing systems. You can easily monitor any operating system, cloud service, IP telephony service, docker container, or web server/database backend. We provide out-of-the-box monitoring for the world’s leading hardware and software vendors, and our extensively documented API makes it easy to create workflows and integrate with other systems. In addition, you can also integrate Zabbix with the most popular helpdesk, messaging, and ITSM systems, such as Slack, Jira, MS Teams, and many others.

Not only that, Zabbix is designed to serve as the ideal monitoring solution for multi-tenant environments. It serves as a single pane of glass for your entire infrastructure, and it’s easy to visualize everything that’s happening with your network with unique maps, dashboards, and templates.

Conclusion

The days of large teams all working together under the same roof are a thing of the past – the remote working trend will only accelerate as technology improves and employees get more accustomed to working with colleagues across multiple locations. That’s why it’s of paramount importance to make sure your monitoring solution has the built-in flexibility and scalability to grow with your team and your business.

If you want to see for yourself how Zabbix can help you effectively monitor a globally distributed network, contact us.

The post Keeping Remote Teams Connected: The Zabbix Advantage appeared first on Zabbix Blog.

HPC Monitoring: Transitioning from Nagios and Ganglia to Zabbix 6

2024-02-08 Mark Vilensky

Post Syndicated from Mark Vilensky original https://blog.zabbix.com/hpc-monitoring-transitioning-from-nagios-and-ganglia-to-zabbix-6/27313/

My name is Mark Vilensky, and I’m currently the Scientific Computing Manager at the Weizmann Institute of Science in Rehovot, Israel. I’ve been working in High-Performance Computing (HPC) for the past 15 years.

Our base is at the Chemistry Faculty at the Weizmann Institute, where our HPC activities follow a traditional path — extensive number crunching, classical calculations, and a repertoire that includes handling differential equations. Over the years, we’ve embraced a spectrum of technologies, even working with actual supercomputers like the SGI Altix.

Table of Contents

Our setup

As of now, our system boasts nearly 600 compute nodes, collectively wielding about 25,000 cores. The interconnect is Infiniband, and for management, provisioning, and monitoring, we rely on Ethernet. Our storage infrastructure is IBM GPFS on DDN hardware, and job submissions are facilitated through PBS Professional.

We use VMware for the system management. Surprisingly, the team managing this extensive system comprises only three individuals. The hardware landscape features HPE, Dell, and Lenovo servers.

The path to Zabbix

Recent challenges have surfaced in the monitoring domain, prompting considerations for an upgrade to Red Hat 8 or a comparable distribution. Our existing monitoring framework involved Nagios and Ganglia, but they had some severe limitations — Nagios’ lack of scalability and Ganglia’s Python 2 compatibility issues have become apparent.

Exploring alternatives led us to Zabbix, a platform not commonly encountered in supercomputing conferences but embraced by the community. Fortunately, we found a great YouTube channel by Dmitry Lambert that not only gives some recipes for doing things but also provides an overview required for planning, sizing, and avowing future troubles.

Our Zabbix setup resides in a modest VM, sporting 16 CPUs, 32 GB RAM, and three Ethernet interfaces, all operating within the Rocky 8.7 environment. The database relies on PostgreSQL 14 and Timescale DB2 version 2.8, with slight adjustments to the default configurations for history and trend settings.

Getting the job done

The stability of our Zabbix system has been noteworthy, showcasing its ability to automate tasks, particularly in scenarios where nodes are taken offline, prompting Zabbix to initiate maintenance cycles automatically. Beyond conventional monitoring, we’ve tapped into Zabbix’s capabilities for external scripts, querying the PBS server and GPFS server, and even managing specific hardware anomalies.

The Zabbix dashboard has emerged as a comprehensive tool, offering a differentiated approach through host groups. These groups categorize our hosts, differentiating between CPU compute nodes, GPU compute nodes, and infrastructure nodes, allowing tailored alerts based on node types.

Alerting and visualization

Our alerting strategy involves receiving email alerts only for significant disasters, a conscious effort to avoid alert fatigue. The presentation emphasizes the nuanced differences in monitoring compute nodes versus infrastructure nodes, focusing on availability and potential job performance issues for the former and services, memory, and memory leaks for the latter.

The power of visual representations is underscored, with the utilization of heat maps offering quick insights into the cluster’s performance.

Final thoughts

In conclusion, our journey with Zabbix has not only delivered stability and automation but has also provided invaluable insights for optimizing resource utilization. I’d like to express my special appreciation for Andrei Vasilev, a member of our team whose efforts have been instrumental in making the transition to Zabbix.

The post HPC Monitoring: Transitioning from Nagios and Ganglia to Zabbix 6 appeared first on Zabbix Blog.

A Look Back at Zabbix Summit 2023

2023-10-12 Michael Kammer

Post Syndicated from Michael Kammer original https://blog.zabbix.com/a-look-back-at-zabbix-summit-2023/26744/

Autumn in the Latvian capital of Riga is marked by a variety of traditions. The leaves fall, the rainy season arrives, the birds migrate, and IT professionals from around the world descend on the city for the annual Zabbix Summit.

On October 6 and 7, the Radisson Blu Hotel Latvija was packed with 450 delegates from 38 countries, all there for Zabbix Summit 2023, the 11th in-person version of Zabbix’s premier yearly event.

This year’s Summit was marked by presentations, partner activities, and moments of relaxation and celebration that will energize the Zabbix community and spark ideas that attendees will take home to every corner of the world.

If you couldn’t make it, here’s a little taste of how it felt to be there!

Zabbix Summit 2023 in numbers

The stage hosted 27 speakers from 17 different countries who gave 31 speeches, including both lectures and lightning talks. There were four workshops with deep dives into technical topics, conducted by the Zabbix technical team as well as our partners from Opensource ICT Solutions and IZI-IT. Summit attendees also enjoyed three parties designed to provide a relaxing experience and networking opportunities.

Zabbix Summit 2023 proudly featured 10 sponsors, all part of Zabbix’s official partner network. They included:

initMAX – Diamond Sponsor
IntelliTrend – Platinum Sponsor
IZI-IT – Platinum Sponsor
Quadrata – Platinum Sponsor
Allenta – Gold Sponsor
Metricio – Gold Sponsor
Opensource ICT Solutions – Gold Sponsor
Docomo Business – Gold Sponsor
SRA OSS – Silver Sponsor
Enthus – Lunch and coffee break sponsor

We’d also like to give a shout-out to our Zabbix Fans, who played a crucial role in supporting the Summit this year (as every year) with their attendance, merchandise purchases, and enthusiasm!

We’re grateful to everyone who played a role and helped us make Zabbix Summit 2023 happen!

Highlights from the main stage

This year we continued a Summit tradition and allowed our in-person audience as well those tuning in via livestream and YouTube to ask questions during live Q&A sessions – a feature that made the proceedings more interactive and helped everyone feel more involved. The speeches were all fascinating and well received, but a few in particular stood out:

What the future holds for Zabbix

Zabbix CEO and Founder Alexei Vladishev kicked off the presentations on Day 1 with a keynote speech about his current plans for Zabbix’s development, including a detailed look at enhancements requested by users.

Avoiding alert fatigue

Bringing a less technical and more conceptual approach to addressing day-to-day data monitoring issues, Rihards Olups, SaaS Architect at Nokia, discussed alert fatigue and how science explains it. During his presentation, Rihards showed how an excess of alerts can negatively affect selective attention and shared his thoughts about how professionals can intervene to prevent problems.

Making Zabbix’s latest offerings accessible to everyone

Day 2 began with Zabbix Director of Business Development Sergey Sorokin focusing on new plans and offerings, including a subscription system for technical support, consulting services, and monitoring tailored for managed service providers.

Monitoring everything (and we do mean everything!)

Janne Pikkarainen, Lead Site Reliability Engineer at Forcepoint, provided detailed and entertaining insights into how he connects Zabbix to smart accessories and uses it to monitor aspects of his home, including the location of personal items, noise levels, and even the frequency of his daughter’s naps and cries.

Implementing ideas and design in MSP environments

In tackling the topic of data collection and analysis for service providers, Brian van Baekel, Zabbix Trainer at Opensource ICT Solutions, presented details on the development of projects focused on monitoring service providers. He also highlighted best practices for data collection in Zabbix Server, data storage, and presenting on the Zabbix Frontend.

Monitoring the London transportation system

A use case presented by Nathan Liefting, Zabbix Consultant and Trainer at Opensource ICT Solutions, and Adan Mohamed, DevOps Manager at Boldyn Networks, showed how Zabbix monitors the availability of the London Underground subway system. Data is collected from 136 “tube” stations in a high-level architecture and used to assess the availability of Wi-Fi networks, emergency connections, and other services.

Bringing the Olympics and World Cup to life with Zabbix

Marianna Portela, a Tech Lead at Globo in Brazil, shared her insights into how Zabbix supports Globo’s digital transformation and helps her monitor live event infrastructure at massive events like the Olympics and World Cup.

Don’t forget the fun part!

Zabbix Summits are renowned for their friendly, informal atmosphere, which is probably most clearly on display at our famous Summit parties.

Zabbix Summit 2023’s Welcome party was held at the Stargorod Riga brewery in the heart of Riga’s old town. It featured arm wrestling, a selection of delicious foods and beverages, and plenty of opportunities for Summit participants to get to know each other.

The Main party saw live music, dancing, quizzes, and other fun events take place within the historic confines of the Latvian Railway History Museum. The atmosphere, food, drinks, and good company all combined to create an event that nobody who attended will soon forget!

Last but not least, the Closing party at the Burzma food hall was a true celebration of the diversity of the global Zabbix community, with food and music from every country with a Zabbix presence as well as plenty of opportunities for Summit attendees to swap stories and exchange contact details.

Open door, open minds

The traditional Zabbix open-door day was held on Thursday, October 5, and while past Summits have typically seen around 50 visitors, we were proud to welcome closer to 100 this time around. Attendees could have a coffee with their favorite Zabbix employees, play a friendly game of foosball or table tennis, and get a behind-the-scenes look at where the magic happens.

Testify!

One new feature that made a big splash at this year’s Summit was the testimonial booth, which allowed Summit attendees to share their thoughts and experiences about Zabbix with the rest of our community. Sharing a testimonial or leaving a review allowed attendees to collect a piece of exclusive Zabbix Summit 2023 merchandise, and we went through a lot of it – the booth provided us with 28 filmed and 17 written testimonials about Zabbix products and services, far more than we anticipated.

Where to find the presentations

If you couldn’t attend but want to stay informed about what was discussed at the event (or if you’d just like to revisit the stage presentations), both days of recordings are available on Zabbix’s YouTube channel at the following links:

Streaming – Zabbix Summit Day 1

Streaming – Zabbix Summit Day 2

The graphics and texts of the presentations are also available for reference and download on the official event website.

We hope that Zabbix Summit 2023 was a time of valuable learning, connections, and idea exchange for everyone who attended or followed along through social media. If you’ve enjoyed the photos, you can see several more on our Instagram.

If you had an amazing time at Zabbix Summit 2023 (and we certainly hope you did), registration for Zabbix Summit 2024 is already open and Early Bird tickets are available.

See you next year!

The post A Look Back at Zabbix Summit 2023 appeared first on Zabbix Blog.

The Zabbix Advantage for Business

2023-09-27 Michael Kammer

Post Syndicated from Michael Kammer original https://blog.zabbix.com/the-zabbix-advantage-for-business/26497/

CIOs and CITOs know all too well that a smoothly functioning network is the backbone of any business. Your network has to guarantee reliability, performance, and security. An unreliable network, by contrast, means damaged productivity, negative customer perceptions, and haphazard security. The solution is network monitoring, and in this post we’ll explore the reasons why Zabbix is the ideal monitoring solution for any business.

What is network monitoring?

Network monitoring is a critical IT process where all networking components (as well as key performance indicators like CPU utilization and network bandwidth) are constantly monitored to improve performance and eliminate bottlenecks. It provides real-time information that network administrators need to determine whether a network is running optimally.

Why Zabbix?

At Zabbix, we’re here to help you deliver for your customers, flawlessly and without interruptions. Our monitoring solution is 100% open source, available in over 20 languages, and able to collect an unlimited amount of data. Designed with enterprise requirements in mind, Zabbix provides a comprehensive, “single pane of glass” view of any size environment. Put simply, Zabbix allows you to monitor anything – from physical and virtual servers or containers to network infrastructure, applications, and cloud services.

What’s more, we offer a wide variety of additional professional services to go along with our solution, including:

Multiple technical support subscriptions that are tailored to the needs of your business
Certified training programs that are designed to help you master Zabbix under the guidance of top experts
A wide range of professional services, including template building, upgrades, consulting, and more

Keep reading to find out more about the difference Zabbix can make for your business.

The Zabbix advantage

IT teams are under enormous pressure to have their networks functioning perfectly 100% of the time, and with good reason. It’s simply not possible to run a business with a malfunctioning network. Here are 5 key reasons why you need to make network monitoring a top priority, and why Zabbix is the right answer for all of them.

Reliability

A network monitoring solution’s main reason for being is to show whether a device is working or not. Taking a proactive approach to maintaining a healthy network will keep tech support requests and downtime to an absolute minimum. Zabbix makes it easy to do so by automatically detecting problem states in your metric flow. Not only that, but our automated predictive functions can also help you react proactively. They do this by forecasting a value for early alerting and predicting the time left until you reach a problem threshold. Automation then allows you to remove additional inefficiencies.

Visibility

Having complete visibility of all your hardware and software assets allows you to easily monitor the health of your network. Zabbix lets businesses access metrics, issues, reports, and maps with a single click, allowing you to:

Analyze and correlate your metrics with easy-to-read graphs
Track your monitoring targets on an interactive geo-map
Display the statuses of your elements together with real-time data to get a detailed overview of your infrastructure on a Zabbix map
Generate scheduled PDF reports from any Zabbix dashboard
Extend the native Zabbix frontend functionality by developing your own frontend widgets and modules

Performance

By making it easy to monitor anything, Zabbix lets you know which parts of your network are being properly used, overused, or underused. This can help you uncover unnecessary costs that can be eliminated or identify a network component that needs upgrading.

Compliance

Today’s IT teams need to meet strict regulatory and protection standards in increasingly complex networks. Zabbix can spot changes in normal system behavior and unusual data flow. It can then either leverage multiple messaging channels to notify your team about anomalies or simply resolve any issues automatically.

Profitability

Zabbix has an extensive track record of making businesses more productive by saving network management time and lowering operating costs. Servers, for example, are machines that inevitably break down from time to time. Being able to quickly re-launch after a failure has occurred and minimizing the server downtime are vital. By making sure your team is aware of any and all current and impending issues, Zabbix can reduce downtime and increase the productivity and efficiency of your business.

Zabbix across industries

Whatever field you’re in, there’s no substitute for consistent, problem-free service when it comes to gaining the trust and loyalty of customers. Zabbix has an extensive track record of helping clients in multiple industries achieve their goals.

Zabbix for healthcare

A typical hospital relies on tens of thousands of connected devices. Manually checking each one for anomalies simply isn’t practical. Establishing a stable service level is a vital issue in most industries, but in healthcare it’s literally a matter of life and death. With Zabbix, hospital IT teams receive potentially life-saving alerts if anything is out of the ordinary.

What’s more, Zabbix can monitor progress toward expected outcomes, providing up-to-the-minute statistics on data errors or IT system failures. Issues, response times, and potential bottlenecks are displayed in easy-to-read graphs and charts. This allows hospital staff to follow up on the presence or absence of problems.

Zabbix for banking and finance

Financial institutions of all sizes rely on their networks to maintain connectivity and productivity. By processing millions of checks per minute and considering very complex dependencies between different elements of infrastructure, Zabbix allows banks to proactively detect and resolve network problems before they turn into major business disruptions.

Zabbix is also designed to seamlessly connect distributed architecture, including remote offices, branches, and even individual ATMs. Some of our financial industry clients previously used up to 20 different monitoring tools. Each alert sent hundreds of emails to different people, making it impossible to effectively monitor the environment. Naturally, they found Zabbix’s ability to monitor many thousands of devices and “single pane of glass” view to be a significant upgrade.

Zabbix for education

In an age of digital course materials and resources, schools and universities can’t operate without functioning IT infrastructures. Our clients in education typically have heterogeneous infrastructures with thousands of servers and clients. They also possess all kinds of connected devices, dozens of different operating systems, multiple locations, and hundreds of IT staff.

Zabbix has proven itself to be a simple, cost-effective method of monitoring geographically distributed campuses and educational sites. We’ve done this by:

Providing early notification of possible viruses, worms, Trojan horses, and other transmitters of system infection
Monitoring IT systems for intellectual property (IP) protection purposes
Saving human resources by reducing manual work

Zabbix for government

Network monitoring is critical for government agencies, as downtime can bring a halt to vital public services. Our public-sector clients range from city-wide public transportation companies all the way up to entire prefectures. They use Zabbix to monitor the availability of utilities, transport, lighting, and many other public services.

In the process, Zabbix increases the effectiveness of budget expenditures by providing precise and accountable data on how public resources are used. This makes it easier to justify further expenditures. In most business software, agents are required for each monitored host and costs increase in proportion to the number of monitored hosts. By contrast, Zabbix is open source and the software itself is free of charge, resulting in anticipated cost reductions of up to 25% in many cases.

Zabbix for retail

Retail environments increasingly depend on network-connected equipment, particularly when it comes to warehouse monitoring and tracking SKUs (stock keeping units). Zabbix delivers an all-in-one tool to monitor different applications, metrics, processes, and equipment while providing a complete picture about the availability and performance of all the components that make a retail business successful. This makes it possible for retailers to easily automate store openings and closings, monitor cash machines, and keep track of access system log entries.

Not only that, the quantity and quality of information that Zabbix collects makes it easy for retailers to conduct a more accurate analysis of what is happening (or what may happen) and take preventive measures. Our retail clients find that having this level of control over their resources and services increases the confidence of their teams as well as their customers.

Zabbix for telecom

Internet, telephony, and television verticals require availability and consistency. The key to success is providing your services 24/7/365.

Zabbix makes this possible by providing full visibility of all network and customer devices, allowing operators to know of any outage before customers do and take necessary actions. Some of our telecommunications clients are able to effortlessly monitor well over 100,000 devices with a single Zabbix server. This helps them improve the customer experience and driving growth in the process.

Zabbix for aerospace

In the aerospace industry, timely data delivery and issue notification are the keys to safe operations. Aircraft depend on complex electronic systems that can diagnose the slightest deviations and make malfunctions known. Unfortunately, this is often in the form of either an indicator light on an instrument panel or a log message that is accessible only with specialized software or tools.

With Zabbix, all data transfers from the aircraft’s diagnostic system to the responsible employees can happen automatically. Error prioritization and escalation to further levels can also happen automatically if any aircraft has an ongoing issue that remains active for multiple days.

Conclusion

At Zabbix, our goal is a world without interruptions, powered by a world-class universal monitoring solution that’s available and affordable to any business. Our open-source software allows you to monitor your entire IT stack, no matter what size your infrastructure is or where it’s hosted.

That’s why government institutions across the globe as well as some of the world’s largest companies trust us with their network monitoring needs.

Get in touch with us to learn more and get started on the path to maximum efficiency and uptime today!

The post The Zabbix Advantage for Business appeared first on Zabbix Blog.

Monitoring the London Underground with Nathan Liefting and Adan Mohamed

2023-09-26 Michael Kammer

Post Syndicated from Michael Kammer original https://blog.zabbix.com/monitoring-the-london-underground-with-nathan-liefting-and-adan-mohamed/26693/

With just a few days remaining until Zabbix Summit 2023, our series of speaker interviews draws to a close as we talk to Opensource ICT Solutions trainer and consultant Nathan Liefting about how he worked with Adan Mohamed of Boldyn Networks to monitor the London Underground with Zabbix.

Please tell us a bit about yourself and your work.

I’m a Zabbix trainer and consultant for Opensource ICT Solutions. You might also know me from the books Brian van Baekel and I wrote called “Zabbix IT Infrastructure Monitoring.”

How long have you been using Zabbix? What kind of daily Zabbix tasks are you involved in at your company?

My tasks are easy to explain – Zabbix, Zabbix, and some more Zabbix! Opensource ICT Solutions is one of the few companies that focus solely on Zabbix, so I get to work full time with the product, 40 hours a week. I build new environments, integrations, automations, and anything that you might need for your Zabbix environment.

Can you give us a sneak peek at what we can expect to hear during your Zabbix Summit speech?

Definitely! Adan from Boldyn Networks and I will be presenting you with a real use case for Zabbix monitoring. We’ll have a look at how Boldyn has brought broadband network connectivity to the London Underground tunnels and why it’s so important to monitor the equipment that makes that all possible. Of course, since this is THE Zabbix summit, we’ll also look at what the Zabbix setup looks like and share a pretty interesting use case for SNMP traps.

How and why did you come to the decision to use Zabbix as the monitoring solution for your use case?

Boldyn was looking for the best network monitoring solution for their project. Since we offer exactly that, we got to talking and we decided that our favorite open-source network monitoring tool was the way to go. Since then, we’ve been building amazing custom monitoring implementations together. The rest is history.

Can you mention some other noteworthy non-standard Zabbix monitoring use cases that you’ve worked on?

Definitely! Since I get to work on monitoring all day long, we’ve got a lot to choose from. I do most of my work in the Netherlands, the United Kingdom, and the United States, and all those markets are super exciting. We’re monitoring infrastructure that keeps planes flying safely and makes sure power grids are up and running, and now we’re also helping to keep people connected to the internet even when they go underground. If you ask me, it doesn’t get a lot more exciting than that – and that’s just the tip of the iceberg.

The post Monitoring the London Underground with Nathan Liefting and Adan Mohamed appeared first on Zabbix Blog.

Simplifying Digital Transformation with Marianna Portela

2023-09-13 Michael Kammer

Post Syndicated from Michael Kammer original https://blog.zabbix.com/simplifying-digital-transformation-with-marianna-portela/26609/

To help everyone in our community get up to speed with Zabbix Summit speakers and their topics, we’re continuing our series of interviews and sitting down for a chat with Marianna Portela of Brazilian mass media conglomerate Globo. Read on to get a preview of her Summit speech topic and see how she uses Zabbix to bring massive live events to millions of users around the globe.

Please tell us a bit about yourself and your work.

I’m a tech lead at Globo, the largest media group in Latin America. It includes over-the-air broadcasting, television and film production, a pay television subscription service, streaming media, publishing, and online services.

How long have you been using Zabbix? What kind of daily Zabbix tasks are you involved in at your company?

I have been working at Globo for 15 years. I’ve been involved in monitoring for 11 of those years, and I’ve been using Zabbix for 10. I help monitor the applications that generate data for live events, and I use Zabbix to generate metrics that support decision-making related to better content delivery quality.

Can you name a few of the specific challenges that Zabbix has helped you solve?

Zabbix allows us to empower our users and supports our entire digital transformation – including many things related to Globoplay streaming. It also helps us monitor live event infrastructure, like the Olympics and World Cup. Previously, when there were technical issues during live events, we would try to figure out what happened after the fact, but no longer – Zabbix gives us a proactive analysis of potential occurrences within live production.

Can you give us a sneak peek at what we can expect to hear during your Zabbix Summit speech?

I’m planning to talk about how we use Zabbix to help ensure the quality monitoring of live production, which is essentially the production and the part of Globo that deals with any type of live event and generates data for things like games, for example. I’ll introduce how we started with actual infrastructure monitoring and how this digital transformation at Globo began, specifically how we managed to enter new areas like content generation, especially live content. Then I’ll also discuss some specifics of how we monitor live event infrastructure.

The post Simplifying Digital Transformation with Marianna Portela appeared first on Zabbix Blog.

What is Server Monitoring? Everything You Need to Know

2023-09-12 Michael Kammer

Post Syndicated from Michael Kammer original https://blog.zabbix.com/what-is-server-monitoring-everything-you-need-to-know/26617/

Servers are the foundation of a company’s IT infrastructure, and the cost of server downtime can include anything from days without system access to the loss of important business data. This can lead to operational issues, service outages, and steep repair costs.

Viewed against this backdrop, server monitoring is an investment with massive benefits to any organization. The latest generation of server monitoring tools make it easier to assess server health and deal with any underlying issues as quickly and painlessly as possible.

What are servers, and how do they work?

Servers are computers (or applications) that run software services for other computers or devices on a network. The computer takes requests from the client computers or devices and performs tasks in response to the requests. These tasks can involve processing data, providing content, or performing calculations. Some servers are dedicated to hosting web services, which are software services offered on any computer connected to the internet.

What is server monitoring? Why does it matter?

Servers are some of the most important pieces of any company’s IT infrastructure. If a server is offline, running slowly, or experiencing outages, website performance will be affected and customers may decide to go elsewhere. If an internal file server is generating errors, important business data like accounting files or customer records could be compromised.

A server monitoring system is designed to watch your systems and provide a number of key metrics regarding their operation. In general, server monitoring software tests for accessibility (making sure that the server is alive and can be reached) and response time (guaranteeing that it is running fast enough to keep users happy). What’s more, it sends notifications about missing or corrupt files, security violations, and other issues.

Server monitoring is most often used for processing data in real time, but quality server monitoring is also predictive, letting users know when disks will reach capacity and whether memory or CPU utilization is about to be throttled. By evaluating historical data, it’s possible to find out if a server’s performance is degrading over time and even predict when a complete crash might occur.

How can server monitoring help businesses?

Here are a few of the most important business benefits of server monitoring:

Server monitoring tools give you a bird’s-eye view of your server’s health and performance

A quality server monitoring tool keeps IT administrators aware of metrics like CPU usage, RAM, disk space, and network bandwidth. This helps them to see when servers are slowing down or failing, allowing them to act before users are affected.

Server monitoring simplifies process automation

IT teams have long checklists when it comes to managing servers. They need to monitor hard disk space, keep an eye on infrastructure, schedule system backups, and update antivirus software. They also need to be able to foresee and solve critical events, while managing any disruptions.

A server monitoring tool helps IT professionals by automating all or many aspects of these jobs. It can show whether a backup was successful, if software is patched, and whether a server is in good condition. This allows IT teams to focus on tasks that benefit more from their involvement and expertise.

Server monitoring makes it easier to retain customers as well as employees

Acting quickly when servers develop issues (or even before) makes sure that employee workflows aren’t disrupted, allowing them to perform their duties, see results, and reach their goals. It also guarantees a positive customer experience by providing early notification of any issues.

Server monitoring keeps costs down

By automating processes and tasks (and freeing up time in the process) server monitoring systems make the most of resources and reduce costs. And by solving potential issues before they affect the organization, they help businesses avoid lost revenue from unfinished employee tasks, operational delays, and unfinished purchases.

What should you look for in a server monitoring solution?

Now that you’re sold on the benefits of server monitoring, you’ll want to choose the server monitoring solution that’s right for you. Here are a few capabilities to keep in mind:

Ease of use

Does the solution include an intuitive dashboard that makes it easy to monitor events and react to problems quickly? It should, and it should also allow you to make the most of the data it exports by providing graphs, reports, and integrations.

Customer support

Is it easy to contact support? How quickly do they respond? A quality server monitoring solution will provide a defined SLA and stick to it with no exceptions.

Breadth of coverage

A good solution will support all the server types (hardware, software, on-premises, cloud) that your enterprise uses. It should also be flexible enough to support any server types you may implement in the future.

Alert management

There are a few important questions to ask when it comes to alerts:

Does the solution include a dashboard or display that makes it easy to track events and react to problems quickly?
Is it easy to set up alerts via the configuration of thresholds that trigger them? How are alerts delivered?
Does the solution have a way to help you determine why a problem has occurred, instead of just telling you that something has gone wrong without context?

What are some best practices to keep in mind?

Here are a few best practices that will help you avoid the more common server monitoring pitfalls:

Proactively check for failures

Keep a sharp eye out for any issues that may affect your software or hardware. The tools included with a good monitoring solution can alert you to errors caused by a corrupted database (for example) and let you know if a security incident has left important services disabled.

Don’t forget your historical data

Server problems rarely occur in a vacuum, so look into the context of issues that emerge. You can do that by exploring metrics across a specific period, typically between 30 to 90 days. For example, you may find that CPU temperature has increased within the past week, which may suggest a problem with a server cooling system.

Operate your hardware in line with recommended tolerance levels

File servers are commonly pushed to the limit, rarely getting a break. That’s why it’s important to monitor metrics like CPU utilization, RAM utilization, storage capacity usage, and CPU temperature. Check these metrics regularly to identify issues before it’s too late.

Keep track of alerts

Always monitor your alerts in real time as they occur and explore reliable ways to manage and prioritize them. When escalating an incident, make sure it goes to the right individual as soon as possible.

Use server monitoring data to plan short-term cloud capacity

Server monitoring systems can help you plan the right computing power for specific moments. If services become slower or users experience other problems with performance, an IT manager can assess the situation through the server monitor. They’ll then be able to allocate extra resources to solve the problem.

Take advantage of capacity planning

Data center workloads have almost doubled in the past 5 years, and servers have had to keep up with this ongoing change. Analyzing long-term server utilization trends can prepare you for future server requirements.

Go beyond asset management

With server monitoring, you can discover which systems are approaching the end of their lives and whether any assets have disappeared from your network. You can also let your server monitoring tool handle the heavy lifting for you when it comes to tracking physical hardware.

The Zabbix Advantage

Zabbix is designed to make server monitoring easy. Our solution allows you to track any possible server performance metrics and incidents, including server performance, availability, and configuration changes.

Intuitive dashboards, network graphs, and topology maps allow you to visualize server performance and availability, and our flexible alerting allows for multiple delivery methods and customized message content.

Not only that, our out-of-the-box templates come with preconfigured items, triggers, graphs, applications, screens, low-level discovery rules, and web scenarios – all designed to have you up and running in just a few minutes.

And because Zabbix is open-source, it’s not just affordable, it’s free. Contact us to find out more and enjoy the peace of mind that comes from knowing that your servers are under control.

FAQ

Why do we need server monitoring?

Server monitoring allows IT professionals to:

Monitor the responsiveness of a server
Know a server’s capacity, user load, and speed
Proactively detect and prevent any issues that might affect the server

Why do companies choose to monitor their servers?

Companies monitor servers so that they can:

Proactively identify any performance issues before they impact users
Understand a server’s system resource usage
Analyze a server for its reliability, availability, performance, security, etc.

How is server monitoring done?

Server monitoring tools constantly collect system data across an entire IT infrastructure, giving administrators a clear view of when certain metrics are above or below thresholds. They also automatically notify relevant parties if a critical system error is detected, allowing them to act in a timely manner to resolve issues.

What should you monitor on a server?

Key areas to monitor on a server include:

A server’s physical status
Server performance, including CPU utilization, memory resources, and disk activity
Server uptime
Page file usage
Context switches
Time synchronization
Process activity
Server capacity, user load, and speed

If I want to monitor a server, how easy is it to set things up?

Setting up a server monitoring tool is easy, provided you’ve taken into account these 5 steps:

Assess and create a monitoring plan
Discover how data can be collected
Define any and all metrics
Set up alerts
Have an established workflow

The post What is Server Monitoring? Everything You Need to Know appeared first on Zabbix Blog.

What is Network Monitoring? Everything You Need to Know

2023-09-07 Michael Kammer

Post Syndicated from Michael Kammer original https://blog.zabbix.com/what-is-network-monitoring-everything-you-need-to-know/26539/

Your company’s network is the glue that bonds your enterprise together. The technology of networking is growing more stable and reliable all the time, but it doesn’t mean you can leave your network unattended – quality network monitoring is an absolute must-have.

What are network monitoring systems?

At its most basic, network monitoring is a critical IT process where all networking components (as well as key performance indicators like network hardware CPU utilization and network bandwidth) are continuously and proactively monitored to improve performance, eliminate bottlenecks, and prevent network congestion and downtime.

Put more simply, it’s the act of keeping an eye on all the connected elements that are relevant to your business. That means all your hardware and software resources, including routers, switches, firewalls, servers, PCs, printers, phones, and tablets.

A network monitoring system is a set of software tools that lets you program this action. It allows you to constantly monitor your network infrastructure by doing systematic tests to look for issues and notifying you if any are found. A good system makes monitoring your network easy by:

Allowing you to see all information in dashboards
Generating reports on demand
Sending alerts
Displaying the monitoring data you need in easy-to-read graphs

What are some key benefits of network monitoring?

A quality network monitoring solution allows you to:

Benchmark standard performance

Monitoring gives you the visibility to benchmark your network’s everyday performance. It also makes it easy to spot any fluctuations in performance, which in turn allows you to identify any unwanted changes.

Effectively allocate resources

IT teams need a clear understanding of the source of problems. They also need the ability to minimize tedious troubleshooting and put in place proactive measures to stay ahead of IT outages. To use a plumbing analogy, monitoring lets them fix cracks before a leak happens.

Identify security threats

Preventing security breaches is a major challenge for any organization. As attacks become increasingly more sophisticated and difficult to trace, detecting and mitigating any form of network threat before it escalates is critical. Network monitoring makes it easier to protect data and systems by providing early warning of any suspicious anomalies.

Manage a changing IT environment

New technologies like internet-enabled sensors, wireless devices, and cloud technologies make it harder for IT teams to track performance fluctuations or suspicious activity. A network monitoring solution can:

Give IT teams a comprehensive inventory of wired and wireless devices
Make it easy to analyze long-term trends
Help you get the most out of your available assets

Proactively detect and resolve issues before they affect users

Monitoring a network closely allows an organization to quickly resolve issues and prevent major disruptions. This means fewer interruptions to operations and better utilization of IT resources.

Deploy new technology and system upgrades successfully

Thanks to monitoring, IT teams can learn how equipment has performed over time and use trend analysis to see whether current technology can scale to meet business needs. This can:

Give a clear picture of whether a network is able to support the launch of a new technology
Mitigate any risks associated with a major change
Easily demonstrate ROI by providing comprehensive metrics

What are some different types of network monitoring?

Different types of monitoring exist depending on what exactly needs to be monitored. Some of the most common include fault monitoring, log monitoring, network performance monitoring, configuration monitoring, and availability monitoring.

Fault monitoring

As the name suggests, fault monitoring involves finding and reporting faults in a computer network. It is crucial for maintaining uninterrupted network uptime and is essential to keeping all programs and services running smoothly.

Log monitoring

Resources such as servers, applications, and websites continuously generate logs, which can:

Provide valuable insights into user activity
Help a business comply with regulations
Promptly resolve incidents
Boost network security

Network performance monitoring (NPM)

NPM tracks monitoring parameters like latency, network traffic, bandwidth usage, and throughput, with the goal of optimizing user experience. NPM tools provide valuable information that can be used to minimize downtime and troubleshoot network issues.

Configuration monitoring

Monitoring network configuration involves keeping track of the software and firmware in use on the network and making sure that any inconsistencies are identified and addressed. This prevents any gaps in visibility or security.

Network availability monitoring

Availability monitoring is the monitoring of all IT infrastructure to determine the uptime of devices. By consistently monitoring devices and servers, organizations can receive alerts when there is a network crash or when a device becomes unavailable. ICMP, SNMP, and Syslogs are the most commonly used availability monitoring techniques.

How does it work?

Network monitoring uses multiple techniques to test the availability and functionality of a network. Here are a few of the most common techniques used to collect data for monitoring software:

Ping

A ping is the simplest technique that monitoring software uses to test hosts within a network. The monitoring system sends out a signal and records:

Whether the signal was received,
How long it took the host to receive the signal
Whether any signal data was lost

That data is then used to determine:

Whether the host is active
How efficient the host is
The transmission time and packet loss experienced when communicating with the host
Any other vital information

Simple network management protocol (SNMP)

SNMP is the most widely used protocol for modern network management systems. It uses monitoring software to monitor individual devices in a network. In this system, each monitored device has SNMP agent monitoring software that sends information about the device’s performance to the monitoring solution, which collects this information in a database and then analyzes it for errors.

Syslog

Syslog is an automated messaging system that sends messages when an event affects a network device. Technicians can set up devices to send out messages when the device encounters an error, shuts down unexpectedly, encounters a configuration failure, and more. These messages often contain information that can be used for system management as well as security systems.

Scripts

Scripts are simple programs that collect basic information and instruct the network to perform an action within certain conditions. They can fill gaps in monitoring software functionality, performing scheduled tasks such as resetting and reconfiguring a public access computer every night.

Scripts can also be used to collect data, sending out an alert if results don’t fall within certain thresholds. Network managers will usually set these thresholds, programming the network software to send out an alert if data indicates issues, including:

Slow throughput
High error rates
Unavailable devices
Slower-than-usual response times

How can businesses benefit from network monitoring?

Here are 5 ways that quality network monitoring can benefit any business:

Increased reliability

The main function of any monitoring solution is to show whether a device is working or not. A proactive approach to maintaining a healthy network will keep tech support requests and downtime to an absolute minimum.

Improved visibility

Having complete visibility of all your hardware and software assets allows you to easily monitor the health of your network. Monitoring tracks the data moving along cables and through servers, switches, connections, and routers. In the event of a problem, your IT team can identify the root cause and fix the issue quickly.

Enhanced performance

Network monitoring software lets you know which parts of your network are being properly used, overused, or underused. You can also uncover unnecessary costs that can be eliminated or identify a network component that needs upgrading.

Stricter compliance

Today’s IT teams need to meet strict regulatory and protection standards in increasingly complex networks. The latest compliance guidelines recommend actively watching for changes in normal system behavior and unusual data flow. The data provided by monitoring tools makes it easy to assess your entire system and deliver a service that meets all required standards.

Greater profitability

Network monitoring makes businesses more productive by saving network management time and lowering operating costs. If your team is aware of current and impending issues, you can reduce downtime and increase productivity and efficiency.

The Zabbix advantage

At Zabbix, we’ve perfected an enterprise IT infrastructure monitoring software that can deploy anywhere and monitor any device, system, or app in any environment while providing comprehensive data protection, easy integration, and unlimited visualization options.

You can also count on complete transparency, a predictable release cycle, a vibrant and active user community, and an outstanding user experience.

Everything we do scales easily, so we’re able to grow right along with you. What’s more, we offer a comprehensive range of professional services, including implementation, integration, custom development, consulting services, technical support, and a full suite of training programs.

The best part? Because Zabbix is open-source, it’s not just affordable – it’s free. Get in touch with us to find out more and get started on the path to maximum network efficiency today.

FAQ

What is an example of basic network monitoring?

An example of basic network monitoring is a network engineer collecting real-time data from a data center and setting up alerts when a problem (such as a device failure, a temperature spike, a power outage, or a network capacity issue) appears.

What is network monitoring used for?

Network monitoring can:

• Determine whether a network is running optimally in real time
• Proactively identify deficiencies and optimize efficiency
• Catch and repair problems before they impact operations
• Reduce downtime and make sure employees have access to the resources they need
• Boost the availability of APIs and webpages
• Optimize network performance and availability

What is the most popular network monitoring program?

Some of the most popular network monitoring programs available on the market include:

• Zabbix
• SolarWinds Network Performance Monitor
• Auvik
• Datadog
• ManageEngine OpManager
• Site24x7
• Checkmk
• Progress WhatsUp Gold
• Microsoft Resource Monitor
• Wireshark
• Nagios
• Ntop
• Cacti
• FreeNATS
• Icinga

What are the key steps in network monitoring?

A network monitoring process includes all phases involved in executing efficient network monitoring. These phases include:

Locating all key network components
Actively monitoring the components
Creating alerts for component health and metrics
Making a plan for managing issues
Analyzing generated reports
Adjusting the process as necessary

The post What is Network Monitoring? Everything You Need to Know appeared first on Zabbix Blog.

Monitoring green power and distributed edge computing infrastructure with Hiroshi Abe

2023-09-05 Michael Kammer

Post Syndicated from Michael Kammer original https://blog.zabbix.com/monitoring-green-power-and-distributed-edge-computing-infrastructure-with-hiroshi-abe/26451/

With Zabbix Summit 2023 almost upon us, we’ve prepared a short and direct interview with Summit presenter Dr. Hiroshi Abe. Dr. Abe, a Research Engineer at the Toyota Motor Corporation, will share his thoughts about how Zabbix is the ideal solution when it comes to monitoring green power and distributed edge computing.

Please tell us a bit about yourself and your work.

I have been working for the Toyota Motor Corporation as a Research Engineer since 2019. My current research topics are related to large-scale monitoring systems that target connected car communications, edge computing, and green IT.

How long have you been using Zabbix? What Zabbix tasks are you involved in every day at your company?

Although I am technically retired, since 2015 I have been a member of the Monitoring team of the Network Operation Center, which is part of the ShowNet building team for the “Interop Tokyo” show event in Japan. I have been working with Kodai Terashima, CEO of Zabbix Japan, to build a monitoring system using Zabbix to monitor the event network required for ShowNet. In my office, we use Zabbix to monitor network and server equipment as well as our R&D environment.

Can you give us a sneak peek at what we can expect to hear during your Zabbix Summit speech?

You might expect to hear something deeply related to cars, and it’s true that much of the data created by cars can be processed using edge computing before being transported to the cloud. However, edge computing for the optimal use of green power will be the main topic of my talk. I’ll discuss a distributed monitoring system that uses Zabbix and Zabbix Proxy as a monitoring system for edge environments and green power in multiple data centers.

What made you go with Zabbix as a monitoring solution for green power and edge computing?

Zabbix Proxy is an easy-to-use distributed monitoring solution. A distributed monitoring system is a must for us because there will be multiple edge computing locations all over Japan.

Are you deploying Zabbix using containers to monitor your DCs?

We used RedHat’s OpenShift to implement the edge computing and data synchronization mechanism. We were able to easily deploy Zabbix in OpenShift as a container using Operator, and the monitoring environment using Zabbix containers is implemented in multiple DCs.

The post Monitoring green power and distributed edge computing infrastructure with Hiroshi Abe appeared first on Zabbix Blog.

Integrating DevOps Guru Insights with CloudWatch Dashboard

2023-05-04 Suresh Babu

Post Syndicated from Suresh Babu original https://aws.amazon.com/blogs/devops/integrating-devops-guru-insights-with-cloudwatch-dashboard/

Many customers use Amazon CloudWatch dashboards to monitor applications and often ask how they can integrate Amazon DevOps Guru Insights in order to have a unified dashboard for monitoring. This blog post showcases integrating DevOps Guru proactive and reactive insights to a CloudWatch dashboard by using Custom Widgets. It can help you to correlate trends over time and spot issues more efficiently by displaying related data from different sources side by side and to have a single pane of glass visualization in the CloudWatch dashboard.

Amazon DevOps Guru is a machine learning (ML) powered service that helps developers and operators automatically detect anomalies and improve application availability. DevOps Guru’s anomaly detectors can proactively detect anomalous behavior even before it occurs, helping you address issues before they happen; detailed insights provide recommendations to mitigate that behavior.

Amazon CloudWatch dashboard is a customizable home page in the CloudWatch console that monitors multiple resources in a single view. You can use CloudWatch dashboards to create customized views of the metrics and alarms for your AWS resources.

Solution overview

This post will help you to create a Custom Widget for Amazon CloudWatch dashboard that displays DevOps Guru Insights. A custom widget is part of your CloudWatch dashboard that calls an AWS Lambda function containing your custom code. The Lambda function accepts custom parameters, generates your dataset or visualization, and then returns HTML to the CloudWatch dashboard. The CloudWatch dashboard will display this HTML as a widget. In this post, we are providing sample code for the Lambda function that will call DevOps Guru APIs to retrieve the insights information and displays as a widget in the CloudWatch dashboard. The architecture diagram of the solution is below.

Solution Architecture

Figure 1: Reference architecture diagram

Prerequisites and Assumptions

An AWS account. To sign up:
- Create an AWS account. For instructions, see Sign Up For AWS.
DevOps Guru should be enabled in the account. For enabling DevOps guru, see DevOps Guru Setup
Follow this Workshop to deploy a sample application in your AWS Account which can help generate some DevOps Guru insights.

Solution Deployment

We are providing two options to deploy the solution – using the AWS console and AWS CloudFormation. The first section has instructions to deploy using the AWS console followed by instructions for using CloudFormation. The key difference is that we will create one Widget while using the Console, but three Widgets are created when we use AWS CloudFormation.

Using the AWS Console:

We will first create a Lambda function that will retrieve the DevOps Guru insights. We will then modify the default IAM role associated with the Lambda function to add DevOps Guru permissions. Finally we will create a CloudWatch dashboard and add a custom widget to display the DevOps Guru insights.

Navigate to the Lambda Console after logging to your AWS Account and click on Create function.

Figure 2a: Create Lambda Function
Choose Author from Scratch and use the runtime Node.js 16.x. Leave the rest of the settings at default and create the function.

Figure 2b: Create Lambda Function

After a few seconds, the Lambda function will be created and you will see a code source box. Copy the code from the text box below and replace the code present in code source as shown in screen print below.

// SPDX-License-Identifier: MIT-0
// CloudWatch Custom Widget sample: displays count of Amazon DevOps Guru Insights
const aws = require('aws-sdk');

const DOCS = `## DevOps Guru Insights Count
Displays the total counts of Proactive and Reactive Insights in DevOps Guru.
`;

async function getProactiveInsightsCount(DevOpsGuru, StartTime, EndTime) {
    let NextToken = null;
    let proactivecount=0;

    do {
        const args = { StatusFilter: { Any : { StartTimeRange: { FromTime: StartTime, ToTime: EndTime }, Type: 'PROACTIVE'  }}}
        const result = await DevOpsGuru.listInsights(args).promise();
        console.log(result)
        NextToken = result.NextToken;
        result.ProactiveInsights.forEach(res =&gt; {
        console.log(result.ProactiveInsights[0].Status)
        proactivecount++;
        });
        } while (NextToken);
    return proactivecount;
}

async function getReactiveInsightsCount(DevOpsGuru, StartTime, EndTime) {
    let NextToken = null;
    let reactivecount=0;

    do {
        const args = { StatusFilter: { Any : { StartTimeRange: { FromTime: StartTime, ToTime: EndTime }, Type: 'REACTIVE'  }}}
        const result = await DevOpsGuru.listInsights(args).promise();
        NextToken = result.NextToken;
        result.ReactiveInsights.forEach(res =&gt; {
        reactivecount++;
        });
        } while (NextToken);
    return reactivecount;
}

function getHtmlOutput(proactivecount, reactivecount, region, event, context) {

    return `DevOps Guru Proactive Insights&lt;br&gt;&lt;font size="+10" color="#FF9900"&gt;${proactivecount}&lt;/font&gt;
    &lt;p&gt;DevOps Guru Reactive Insights&lt;/p&gt;&lt;font size="+10" color="#FF9900"&gt;${reactivecount}`;
}

exports.handler = async (event, context) =&gt; {
    if (event.describe) {
        return DOCS;
    }
    const widgetContext = event.widgetContext;
    const timeRange = widgetContext.timeRange.zoom || widgetContext.timeRange;
    const StartTime = new Date(timeRange.start);
    const EndTime = new Date(timeRange.end);
    const region = event.region || process.env.AWS_REGION;
    const DevOpsGuru = new aws.DevOpsGuru({ region });

    const proactivecount = await getProactiveInsightsCount(DevOpsGuru, StartTime, EndTime);
    const reactivecount = await getReactiveInsightsCount(DevOpsGuru, StartTime, EndTime);

    return getHtmlOutput(proactivecount, reactivecount, region, event, context);
    
};

Figure 3: Lambda Function Source Code

Click on Deploy to save the function code
Since we used the default settings while creating the function, a default Execution role is created and associated with the function. We will need to modify the IAM role to grant DevOps Guru permissions to retrieve Proactive and Reactive insights.
Click on the Configuration tab and select Permissions from the left side option list. You can see the IAM execution role associated with the function as shown in figure 4.

Figure 4: Lambda function execution role
Click on the IAM role name to open the role in the IAM console. Click on Add Permissions and select Attach policies.

Figure 5: IAM Role Update
Search for DevOps and select the AmazonDevOpsGuruReadOnlyAccess. Click on Add permissions to update the IAM role.

Figure 6: IAM Role Policy Update
Now that we have created the Lambda function for our custom widget and assigned appropriate permissions, we can navigate to CloudWatch to create a Dashboard.
Navigate to CloudWatch and click on dashboards from the left side list. You can choose to create a new dashboard or add the widget in an existing dashboard.
We will choose to create a new dashboard

Figure 7: Create New CloudWatch dashboard
Choose Custom Widget in the Add widget page

Figure 8: Add widget
Click Next in the custom widge page without choosing a sample

Figure 9: Custom Widget Selection
Choose the region where devops guru is enabled. Select the Lambda function that we created earlier. In the preview pane, click on preview to view DevOps Guru metrics. Once the preview is successful, create the Widget.

Figure 10: Create Custom Widget
Congratulations, you have now successfully created a CloudWatch dashboard with a custom widget to get insights from DevOps Guru. The sample code that we provided can be customized to suit your needs.

Using AWS CloudFormation

You may skip this step and move to future scope section if you have already created the resources using AWS Console.

In this step we will show you how to deploy the solution using AWS CloudFormation. AWS CloudFormation lets you model, provision, and manage AWS and third-party resources by treating infrastructure as code. Customers define an initial template and then revise it as their requirements change. For more information on CloudFormation stack creation refer to this blog post.

The following resources are created.

Three Lambda functions that will support CloudWatch Dashboard custom widgets
An AWS Identity and Access Management (IAM) role to that allows the Lambda function to access DevOps Guru Insights and to publish logs to CloudWatch
Three Log Groups under CloudWatch
A CloudWatch dashboard with widgets to pull data from the Lambda Functions

To deploy the solution by using the CloudFormation template

You can use this downloadable template to set up the resources. To launch directly through the console, choose Launch Stack button, which creates the stack in the us-east-1 AWS Region.
Choose Next to go to the Specify stack details page.
(Optional) On the Configure Stack Options page, enter any tags, and then choose Next.
On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.

It takes approximately 2-3 minutes for the provisioning to complete. After the status is “Complete”, proceed to validate the resources as listed below.

Validate the resources

Now that the stack creation has completed successfully, you should validate the resources that were created.

On AWS Console, head to CloudWatch, under Dashboards – there will be a dashboard created with name <StackName-Region>.
On AWS Console, head to CloudWatch, under LogGroups there will be 3 new log-groups created with the name as:
- lambdaProactiveLogGroup
- lambdaReactiveLogGroup
- lambdaSummaryLogGroup
On AWS Console, head to Lambda, there will be lambda function(s) under the name:
- lambdaFunctionDGProactive
- lambdaFunctionDGReactive
- lambdaFunctionDGSummary
On AWS Console, head to IAM, under Roles there will be a new role created with name “lambdaIAMRole”

To View Results/Outcome

With the appropriate time-range setup on CloudWatch Dashboard, you will be able to navigate through the insights that have been generated from DevOps Guru on the CloudWatch Dashboard.

Figure 11: DevOpsGuru Insights in Cloudwatch Dashboard

Cleanup

For cost optimization, after you complete and test this solution, clean up the resources. You can delete them manually if you used the AWS Console or by deleting the AWS CloudFormation stack called devopsguru-cloudwatch-dashboard if you used AWS CloudFormation.

For more information on deleting the stacks, see Deleting a stack on the AWS CloudFormation console.

Conclusion

This blog post outlined how you can integrate DevOps Guru insights into a CloudWatch Dashboard. As a customer, you can start leveraging CloudWatch Custom Widgets to include DevOps Guru Insights in an existing Operational dashboard.

AWS Customers are now using Amazon DevOps Guru to monitor and improve application performance. You can start monitoring your applications by following the instructions in the product documentation. Head over to the Amazon DevOps Guru console to get started today.

To learn more about AIOps for Serverless using Amazon DevOps Guru check out this video.

Improved Alerting with Atlas Streaming Eval

2023-04-27 Netflix Technology Blog

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/improved-alerting-with-atlas-streaming-eval-e691c60dc61e

Ruchir Jha, Brian Harrington, Yingwu Zhao

TL;DR

Streaming alert evaluation scales much better than the traditional approach of polling time-series databases.
It allows us to overcome high dimensionality/cardinality limitations of the time-series database.
It opens doors to support more exciting use-cases.

Engineers want their alerting system to be realtime, reliable, and actionable. While actionability is subjective and may vary by use-case, reliability is non-negotiable. In other words, false positives are bad but false negatives are the absolute worst!

A few years ago, we were paged by our SRE team due to our Metrics Alerting System falling behind — critical application health alerts reached engineers 45 minutes late! As we investigated the alerting delay, we found that the number of configured alerts had recently increased dramatically, by 5 times! The alerting system queried Atlas, our time series database on a cron for each configured alert query, and was seeing an elevated throttle rate and excessive retries with backoffs. This, in turn, increased the time between two consecutive checks for an alert, causing a global slowdown for all alerts. On further investigation, we discovered that one user had programmatically created tens of thousands of new alerts. This user represented a platform team at Netflix, and their goal was to build alerting automation for their users.

While we were able to put out the immediate fire by disabling the newly created alerts, this incident raised some critical concerns around the scalability of our alerting system. We also heard from other platform teams at Netflix who wanted to build similar automation for their users who, given our state at the time, wouldn’t have been able to do so without impacting Mean Time To Detect (MTTD) for all others. Rather, we were looking at an order of magnitude increase in the number of alert queries just over the next 6 months!

Since querying Atlas was the bottleneck, our first instinct was to scale it up to meet the increased alert query demand; however, we soon realized that would increase Atlas cost prohibitively. Atlas is an in-memory time-series database that ingests multiple billions of time-series per day and retains the last two weeks of data. It is already one of the largest services at Netflix both in size and cost. While Atlas is architected around compute & storage separation, and we could theoretically just scale the query layer to meet the increased query demand, every query, regardless of its type, has a data component that needs to be pushed down to the storage layer. To serve the increasing number of push down queries, the in-memory storage layer would need to scale up as well, and it became clear that this would push the already expensive storage costs far higher. Moreover, common database optimizations like caching recently queried data don’t really work for alerting queries because, generally speaking, the last received datapoint is required for correctness. Take for example, this alert query that checks if errors as a % of total RPS exceeds a threshold of 50% for 4 out of the last 5 minutes:

name,errors,:eq,:sum,
name,rps,:eq,:sum,
:div,
100,:mul,
50,:gt,
5,:rolling-count,4,:gt,

Say if the datapoint received for the last time interval leads to a positive evaluation for this query, relying on stale/cached data would either increase MTTD or result in the perception of a false negative, at least until the missing data is fetched and evaluated. It became clear to us that we needed to solve the scalability problem with a fundamentally different approach. Hence, we started down the path of alert evaluation via real-time streaming metrics.

High Level Architecture

The idea, at a high level, was to avoid the need to query the Atlas database almost entirely and transition most alert queries to streaming evaluation.

Alert queries are submitted either via our Alerting UI or by API clients, which are then saved to a custom config database that supports streaming config updates (full snapshot + update notifications). The Alerting Service receives these config updates and hashes every new or updated alert query for evaluation to one of its nodes by leveraging Edda Slots. The node responsible for evaluating a query, starts by breaking it down into a set of “data expressions” and with them subscribes to an upstream “broker” service. Data expressions define what data needs to be sourced in order to evaluate a query. For the example query listed above, the data expressions are name,errors,:eq,:sum and name,rps,:eq,:sum. The broker service acts as a subscription manager that maps a data expression to a set of subscriptions. In addition, it also maintains a Query Index of all active data expressions which is consulted to discern if an incoming datapoint is of interest to an active subscriber. The internals here are outside the scope of this blog post.

Next, the Alerting service (via the atlas-eval library) maps the received data points for a data expression to the alert query that needs them. For alert queries that resolve to more than one data expression, we align the incoming data points for each one of those data expressions on the same time boundary before emitting the accumulated values to the final eval step. For the example above, the final eval step would be responsible for computing the ratio and maintaining the rolling-count, which is keeping track of the number of intervals in which the ratio crossed the threshold as shown below:

The atlas-eval library supports streaming evaluation for most if not all Query, Data, Math and Stateful operators supported by Atlas today. Certain operators such as offset, integral, des are not supported on the streaming path.

OK, Results?

First and foremost, we have successfully alleviated our initial scalability problem with the polling based architecture. Today, we run 20X the number of queries we used to run a few years ago, with ease and at a fraction of what it would have cost to scale up the Atlas storage layer to serve the same volume. Multiple platform teams at Netflix programmatically generate and maintain alerts on behalf of their users without having to worry about impacting other users of the system. We are able to maintain strong SLAs around Mean Time To Detect (MTTD) regardless of the number of alerts being evaluated by the system.

Additionally, streaming evaluation allowed us to relax restrictions around high cardinality that our users were previously running into — alert queries that were rejected by Atlas Backend before due to cardinality constraints are now getting checked correctly on the streaming path. In addition, we are able to use Atlas Streaming to monitor and alert on some very high cardinality use-cases, such as metrics derived from free-form log data.

Finally, we switched Telltale, our holistic application health monitoring system, from polling a metrics cache to using realtime Atlas Streaming. The fundamental idea behind Telltale is to detect anomalies on SLI metrics (for example, latency, error rates, etc). When such anomalies are detected, Telltale is able to compute correlations with similar metrics emitted from either upstream or downstream services. In addition, it also computes correlations between SLI metrics and custom metrics like the log derived metrics mentioned above. This has proven to be valuable towards reducing Mean Time to Recover (MTTR). For example, we are able to now correlate increased error rates with increased rate of specific exceptions occurring in logs and even point to an exemplar stacktrace, as shown below:

Our logs pipeline fingerprints every log message and attaches a (very high cardinality) fingerprint tag to a log events counter that is then emitted to Atlas Streaming. Telltale consumes this metric in a streaming fashion to identify fingerprints that correlate with anomalies seen in SLI metrics. Once an anomaly is found, we query the logs backend with the fingerprint hash to obtain the exemplar stacktrace. What’s more is we are now able to identify correlated anomalies (and exceptions) occurring in services that may be N hops away from the affected service. A system like Telltale becomes more effective as more services are onboarded (and for that matter the full service graph), because otherwise it becomes difficult to root cause the problem, especially in a microservices-based architecture. A few years ago, as noted in this blog, only about a hundred services were using Telltale; thanks to Atlas Streaming we have now managed to onboard thousands of other services at Netflix.

Finally, we realized that once you remove limits on the number of monitored queries, and start supporting much higher metric dimensionality/cardinality without impacting the cost/performance profile of the system, it opens doors to many exciting new possibilities. For example, to make alerts more actionable, we may now be able to compute correlations between SLI anomalies and custom metrics with high cardinality dimensions, for example an alert on elevated HTTP error rates may be able to point to impacted customer cohorts, by linking to precisely correlated exemplars. This would help developers with reproducibility.

Transitioning to the streaming path has been a long journey for us. One of the challenges was difficulty in debugging scenarios where the streaming path didn’t agree with what is returned by querying the Atlas database. This is especially true when either the data is not available in Atlas or the query is not supported because of (say) cardinality constraints. This is one of the reasons it has taken us years to get here. That said, early signs indicate that the streaming paradigm may help with tackling a cardinal problem in observability — effective correlation between the metrics & events verticals (logs, and potentially traces in the future), and we are excited to explore the opportunities that this presents for Observability in general.

Improved Alerting with Atlas Streaming Eval was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Monitoring Amazon DevOps Guru insights using Amazon Managed Grafana

2023-04-19 MJ Kubba

Post Syndicated from MJ Kubba original https://aws.amazon.com/blogs/devops/monitoring-amazon-devops-guru-insights-using-amazon-managed-grafana/

As organizations operate day-to-day, having insights into their cloud infrastructure state can be crucial for the durability and availability of their systems. Industry research estimates[1] that downtime costs small businesses around $427 per minute of downtime, and medium to large businesses an average of $9,000 per minute of downtime. Amazon DevOps Guru customers want to monitor and generate alerts using a single dashboard. This allows them to reduce context switching between applications, providing them an opportunity to respond to operational issues faster.

DevOps Guru can integrate with Amazon Managed Grafana to create and display operational insights. Alerts can be created and communicated for any critical events captured by DevOps Guru and notifications can be sent to operation teams to respond to these events. The key telemetry data types of logs and metrics are parsed and filtered to provide the necessary insights into observability.

Furthermore, it provides plug-ins to popular open-source databases, third-party ISV monitoring tools, and other cloud services. With Amazon Managed Grafana, you can easily visualize information from multiple AWS services, AWS accounts, and Regions in a single Grafana dashboard.

In this post, we will walk you through integrating the insights generated from DevOps Guru with Amazon Managed Grafana.

Solution Overview:

This architecture diagram shows the flow of the logs and metrics that will be utilized by Amazon Managed Grafana. Insights originate from DevOps Guru, each insight generating an event. These events are captured by Amazon EventBridge, and then saved as logs to Amazon CloudWatch Log Group DevOps Guru service metrics, and then parsed by Amazon Managed Grafana to create new dashboards.

This architecture diagram shows the flow of the logs and metrics that will be utilized by Amazon Managed Grafana, starting with DevOps Guru and then using Amazon EventBridge to save the insight event logs to Amazon CloudWatch Log Group DevOps Guru service metrics to be parsed by Amazon Managed Grafana and create new dashboards in Grafana from these logs and Metrics.

Now we will walk you through how to do this and set up notifications to your operations team.

Prerequisites:

The following prerequisites are required for this walkthrough:

An AWS Account
Enabled DevOps Guru on your account with CloudFormation stack, or tagged resources monitored.

Using Amazon CloudWatch Metrics

DevOps Guru sends service metrics to CloudWatch Metrics. We will use these to track metrics for insights and metrics for your DevOps Guru usage; the DevOps Guru service reports the metrics to the AWS/DevOps-Guru namespace in CloudWatch by default.

First, we will provision an Amazon Managed Grafana workspace and then create a Dashboard in the workspace that uses Amazon CloudWatch as a data source.

Setting up Amazon CloudWatch Metrics

Create Grafana Workspace
Navigate to Amazon Managed Grafana from AWS console, then click Create workspace

a. Select the Authentication mechanism

i. AWS IAM Identity Center (AWS SSO) or SAML v2 based Identity Providers

ii. Service Managed Permission or Customer Managed

iii. Choose Next

b. Under “Data sources and notification channels”, choose Amazon CloudWatch

c. Create the Service.

You can use this post for more information on how to create and configure the Grafana workspace with SAML based authentication.

Next, we will show you how to create a dashboard and parse the Logs and Metrics to display the DevOps Guru insights and recommendations.

2. Configure Amazon Managed Grafana

a. Add CloudWatch as a data source:
From the left bar navigation menu, hover over AWS and select Data sources.

b. From the Services dropdown select and configure CloudWatch.

3. Create a Dashboard

a. From the left navigation bar, click on add a new Panel.

b. You will see a demo panel.

c. In the demo panel – Click on Data source and select Amazon CloudWatch.

The Amazon Grafana Workspace dashboard with the Grafana data source dropdown menu open. The drop down has 'Amazon CloudWatch (region name)' highlighted, other options include 'Mixed, 'Dashboard', and 'Grafana'.

d. For this panel we will use CloudWatch metrics to display the number of insights.

e. From Namespace select the AWS/DevOps-Guru name space, Insights as Metric name and Average for Statistics.

In the Amazon Grafana Workspace dashboard the user has entered values in three fields. "Grafana Query with Namespace" has the chosen value: AWS/DevOps-Guru. "Metric name" has the chosen value: Insights. "Statistic" has the chosen value: Average.

click apply

Time series graph contains a single new data point, indicting a recent event.

f. This is our first panel. We can change the panel name from the right-side bar under Title. We will name this panel “Insights”

g. From the top right menu, click save dashboard and give your new dashboard a name

Using Amazon CloudWatch Logs via Amazon EventBridge

For other insights outside of the service metrics, such as a number of insights per specific service or the average for a region or for a specific AWS account, we will need to parse the event logs. These logs first need to be sent to Amazon CloudWatch Logs. We will go over the details on how to set this up and how we can parse these logs in Amazon Managed Grafana using CloudWatch Logs Query Syntax. In this post, we will show a couple of examples. For more details, please check out this User Guide documentation. This is not done by default and we will need to use Amazon EventBridge to pass these logs to CloudWatch.

DevOps Guru logs include other details that can be helpful when building Dashboards, such as region, Insight Severity (High, Medium, or Low), associated resources, and DevOps guru dashboard URL, among other things. For more information, please check out this User Guide documentation.

EventBridge offers a serverless event bus that helps you receive, filter, transform, route, and deliver events. It provides one to many messaging solutions to support decoupled architectures, and it is easy to integrate with AWS Services and 3rd-party tools. Using Amazon EventBridge with DevOps Guru provides a solution that is easy to extend to create a ticketing system through integrations with ServiceNow, Jira, and other tools. It also makes it easy to set up alert systems through integrations with PagerDuty, Slack, and more.

Setting up Amazon CloudWatch Logs

Let’s dive in to creating the EventBridge rule and enhance our Grafana dashboard:

a. First head to Amazon EventBridge in the AWS console.

b. Click Create rule.

Type in rule Name and Description. You can leave the Event bus to default and Rule type to Rule with an event pattern.

c. Select AWS events or EventBridge partner events.

For event Pattern change to Customer patterns (JSON editor) and use:

{"source": ["aws.devops-guru"]}

This filters for all events generated from DevOps Guru. You can use the same mechanism to filter out specific messages such as new insights, or insights closed to a different channel. For this demonstration, let’s consider extracting all events.

As the user configures their EventBridge Rule, for the Creation method they have chosen "Custom pattern (JSON editor) write an event pattern in JSON." For the Event pattern editor just below they have entered {"source":["aws.devops-guru"]}

d. Next, for Target, select AWS service.

Then use CloudWatch log Group.

For the Log Group, give your group a name, such as “devops-guru”.

In the prompt for the new Target's configurations, the user has chosen AWS service as the Target type. For the Select a target drop down, they chose CloudWatch log Group. For the log group, they selected the /aws/events radio option, and then filled in the following input text box with the kebab case group name devops-guru.

e. Click Create rule.

f. Navigate back to Amazon Managed Grafana.
It’s time to add a couple more additional Panels to our dashboard. Click Add panel.
Then Select Amazon CloudWatch, and change from metrics to CloudWatch Logs and select the Log Group we created previously.

In the Grafana Workspace, the user has "Data source" selected as Amazon CloudWatch us-east-1. Underneath that they have chosen to use the default region and CloudWatch Logs. Below that, for the Log Groups they have entered /aws/events/DevOpsGuru

g. For the query use the following to get the number of closed insights:

fields @detail.messageType
| filter detail.messageType="CLOSED_INSIGHT"
| count(detail.messageType)

You’ll see the new dashboard get updated with “Data is missing a time field”.

New panel suggestion with switch to table or open visualization suggestions

You can either open the suggestions and select a gauge that makes sense;

New Suggestions display a dial graph, a bar graph, and a count numerical tracker

Or choose from multiple visualization options.

Now we have 2 panels:

Two panels are shown, one is the new dial graph, and the other is the time series graph that was created earlier.

h. You can repeat the same process. To create 3^rd panel for the new insights using this query:

fields @detail.messageType 
| filter detail.messageType="NEW_INSIGHT" 
| count(detail.messageType)

Now we have 3 panels:

Grafana now shows three 3 panels. Two dial graphs, and the time series graph.

Next, depending on the visualizations, you can work with the Logs and metrics data types to parse and filter the data.

Setting up a 4th panel as table. Under the Query tab, in the query editor, the user has entered the text: fields detail.messageType, detail.insightSeverity, detail.insightUrlfilter | filter detail.messageType="CLOSED_INSIGHT" or detail.messageType="NEW_INSIGHT"

i. For our fourth panel, we will add DevOps Guru dashboard direct link to the AWS Console.

Repeat the same process as demonstrated previously one more time with this query:

fields detail.messageType, detail.insightSeverity, detail.insightUrlfilter 
| filter detail.messageType="CLOSED_INSIGHT" or detail.messageType="NEW_INSIGHT"

Switch to table when prompted on the panel.

Grafana now shows 4 panels. The new panel displays a data table that contains information about the most recent DevOps Guru insights. There are also the two dial graphs, and the time series graph from before.

This will give us a direct link to the DevOps Guru dashboard and help us get to the insight details and Recommendations.

Save your dashboard.

You can extend observability by sending notifications through alerts on dashboards of panels providing metrics. The alerts will be triggered when a condition is met. The Alerts are communicated with Amazon SNS notification mechanism. This is our SNS notification channel setup.

Screenshot: notification settings show Name: DevopsGuruAlertsFromGrafana and Type: SNS

A previously created notification is used next to communicate any alerts when the condition is met across the metrics being observed.

Screenshot: notification setting with condition when count of query is above 5, a notification is sent to DevopsGuruAlertsFromGrafana with message, "More than 5 insights in the past 1 hour"

Cleanup

To avoid incurring future charges, delete the resources.

Navigate to EventBridge in AWS console and delete the rule created in step 4 (a-e) “devops-guru”.
Navigate to CloudWatch logs in AWS console and delete the log group created as results of step 4 (a-e) named “devops-guru”.
Amazon Managed Grafana: Navigate to Amazon Managed Grafana service and delete the Grafana services you created in step 1.

Conclusion

In this post, we have demonstrated how to successfully incorporate Amazon DevOps Guru insights into Amazon Managed Grafana and use Grafana as the observability tool. This will allow Operations team to successfully observe the state of their AWS resources and notify them through Alarms on any preset thresholds on DevOps Guru metrics and logs. You can expand on this to create other panels and dashboards specific to your needs. If you don’t have DevOps Guru, you can start monitoring your AWS applications with AWS DevOps Guru today using this link.

[1] https://www.atlassian.com/incident-management/kpis/cost-of-downtime

About the authors:

Logging strategies for security incident response

2023-04-04 Anna McAbee

Post Syndicated from Anna McAbee original https://aws.amazon.com/blogs/security/logging-strategies-for-security-incident-response/

Effective security incident response depends on adequate logging, as described in the AWS Security Incident Response Guide. If you have the proper logs and the ability to query them, you can respond more rapidly and effectively to security events. If a security event occurs, you can use various log sources to validate what occurred and understand the scope. Then, you can use the results of your analysis to take remediation actions. To learn more about logging best practices, see Configure service and application logging and Analyze logs, findings, and metrics centrally.

In this blog post, we will show you how to achieve an effective strategy for logging for security incident response. We will share logging options across the typical cloud application stack, log analysis options, and sample queries. AWS offers managed services, such as Amazon GuardDuty for threat detection and Amazon Detective for incident analysis. If you want to collect additional logs or perform custom analysis, then you should consider the options described in this blog post.

Selection of logs

To select the appropriate logs for security incident response, you should start with the common cloud application stack, which consists of the components and layers of your application deployed on AWS. For each component, we will describe the logging sources that you have. For each log source, we will describe why you should log it for security incident response, how to enable the logs, and what your log storage options are.

To select the logs for security incident response, first consider the following questions:

What are your compliance and regulatory requirements for logging?

Note: Make sure that you comply with the log retention requirements of compliance standards relevant to your organization, as well as your organization’s incident response strategy.
What AWS services do you commonly use?
What AWS services have access to or contain sensitive data?
What threats are most relevant to you?

Note: Performing a threat model of your cloud architectures can help you answer this question. For more information, see How to approach threat modelling.

Considering these questions can help you develop requirements for logging that will guide your selection of the following log sources.

AWS account logs

An AWS account is the first, fundamental component of an application deployed on AWS. The account is a container for your AWS resources. You create and manage your AWS resources in this account, and the account provides administrative capabilities for access and billing.

AWS CloudTrail

Within an account, each action performed is an API call. From a console sign-in to the deployment of each resource in an AWS CloudFormation stack, events are generated to provide transparency on what has occurred in the account. With AWS CloudTrail, you can log, continuously monitor, and retain account activity related to actions across supported AWS services. CloudTrail provides the event history of your account activity, including actions taken through the AWS Management Console, AWS SDKs, command line tools, and other AWS services. CloudTrail logs API calls as three types of events:

Management events (also known as control plane operations) show management operations that are performed on resources in your account. This includes actions like creating an Amazon Simple Storage Service (Amazon S3) bucket and setting up logging.
Data events (also known as data plane operations) show the resource operations performed on or within resources in your account. These operations are often high-volume activities, such as Amazon S3 object-level API activity (for example, GetObject, DeleteObject, and PutObject API operations) and AWS Lambda function invocation activity.
Insights events capture unusual API call rate or error rate activity in your account. You must enable these events on a trail in order to capture them, and they are logged to a different folder prefix in the destination S3 bucket for your trail. Insights events provide you with information such as the type of event, the incident time period, the associated API, the error code, and statistics to help you understand and respond effectively to unusual activity.

For security investigations, CloudTrail provides context on the creation, modification, and deletion of AWS resources. Therefore, CloudTrail is one of your most important log sources for security incident response in an AWS environment. You have three primary ways to set up CloudTrail:

CloudTrail Event history — CloudTrail is enabled by default with 90-day retention of management events that you can retrieve through the CloudTrail Event history facility using the console, AWS Command line Interface (AWS CLI), or AWS SDK. You don’t need to take any action to get started using the Event history feature.
CloudTrail trail — For longer retention and visibility of data events, you need to create a CloudTrail trail and associate it with an S3 bucket and optionally with an Amazon CloudWatch log group. If you use AWS Organizations, you can create an organization trail that will log events for each account in the organization. By default, trails are multi-Region, so you don’t need to enable CloudTrail logs in each AWS Region.
AWS CloudTrail Lake — You can create a CloudTrail lake, which retains CloudTrail logs for up to seven years and provides a SQL-based querying facility. You don’t need to have a trail configured in your account to use CloudTrail Lake.
Amazon Security Lake — You can use Security Lake to ingest CloudTrail events, which include management and data events. You can further analyze these events with Amazon QuickSight or another other third-party security information and event management (SIEM) tool.

AWS Config

Creating and modifying resources is an integral part of your account use. Tracking resource configuration changes made by calling the AWS API helps you review changes throughout the resource lifecycle. AWS Config provides a detailed view of the configuration of AWS resources in your account, examines the resource configurations periodically, and tracks configuration changes that were not initiated by the API. This includes how the resources are related to one another and how they were configured in the past so that you can see how configurations and relationships change over time.

You should enable AWS Config in each Region where you have resources deployed, and you should configure an S3 bucket to receive configuration history and configuration snapshot files, which contain details on the resources that AWS Config records. You can then review configuration compliance and analyze activities performed before, during, and after an event using the configuration history in S3. You should centralize AWS Config resource tracking across multiple accounts in the same organization by setting up an aggregator. You can use AWS Control Tower to automate the setup.

During a security investigation, you might want to understand how a resource configuration has changed over time. For example, you might want to investigate the changes to an S3 bucket policy before and after a security event that involves an S3 bucket. AWS Config provides a configuration history for resources that can help you track activities performed during a security event.

Operating system and application logs

To record interactions with applications, you must capture operating system (OS) and application logs, especially custom logs generated by the application development framework. OS and local application logs are relevant for security events that involve an Amazon Elastic Compute Cloud (Amazon EC2) instance. These instances could be standalone, in an auto scaling group behind a load balancer, or compute workloads for Amazon Elastic Container Service (Amazon ECS) or an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. OS logs track privileged use, processes, login events, access to directory services, and file system activity on a server. To analyze a potential compromise to an EC2 instance, you will want to review the security event logs for Windows OS and the system logs for Linux-based OS.

With the unified CloudWatch agent, you can collect metrics and logs from EC2 instances and on-premises servers. The CloudWatch agent aggregates log data into CloudWatch logs, which can then be exported to Amazon S3 for long-term retention and analyzed with a SIEM tool of your choice or Amazon Athena, as shown in Figure 1.

Figure 1: Aggregate OS and application logs using CloudWatch Logs

Database logs

With SQL databases, you can log transactions to help track modifications to the databases, such as additions or deletions. After an engine or system failure, you will need transaction logs to restore a database to a consistent state. Transaction logs are designed to be secure, and they require additional processing to access valuable information. It’s important that you understand data interactions during a security investigation, especially if your databases hold personally identifiable information (PII), financial and payments information, or other information subject to regulatory controls.

When you use Amazon Relational Database Service (Amazon RDS), you can publish database logs to Amazon CloudWatch Logs. For NoSQL databases, tracking atomic interactions is useful. You can find logs for managed NoSQL databases like Amazon DynamoDB in CloudTrail. DynamoDB integrates with CloudTrail, providing a record of actions taken by a user, role, or service. These events are classified as data events in CloudTrail.

Network logs

The goal of logging network activity is to gain insight into the communications that traverse your network. You might need this data for a variety of reasons, such as network troubleshooting or for use in a forensic investigation of suspected malware activity within your network.

In the AWS Cloud, you can log network activity by creating a proxy that logs network traffic or by using Traffic Mirroring to send a copy of network traffic to a logging server. You can adopt cloud-native approaches to capture this type of data using Amazon Route 53 DNS query logs and Amazon VPC Flow Logs.

There are also a variety of third-party networking solutions available like Palo Alto Networks and Fortinet, so you can continue to use the network logging mechanisms that you might have used in an on-premises environment.

Route 53 DNS query logs

You can configure Amazon Route 53 to log Domain Name System (DNS) queries. These logs are categorized into two groups:

Public DNS query logging
Resolver query logging

Logging public DNS queries against domains that you have hosted in Route 53 provides query information, such as the domain or subdomain requested, date and time stamp of the request, DNS record type, Route 53 edge location that responded, and response code.

The Amazon Route 53 Resolver comes with Amazon Virtual Private Cloud (Amazon VPC) by default. Capturing Resolver query logs provides the same information as public queries, as well as additional information such as the Instance ID of the resource that the query originated from. You can also capture Resolver query logs against different types of queries.

VPC Flow Logs

You can configure VPC Flow Logs for a VPC in your account to capture traffic that enters and moves around your VPC network, without the addition of instances or products. From these logs, you can review information, such as source and destination IP, ports, timestamps, protocol, account ID, and whether the traffic was accepted or rejected. For a complete list of the fields available for flow log records, see Available fields. You can create a flow log for a VPC, a subnet, or a network interface. If you create a flow log for a subnet or VPC, IP traffic going to and from each network interface in that subnet or VPC will be logged. For more details on VPC Flow Logs, see Logging IP traffic using VPC Flow Logs.

You can forward flow logs to Amazon CloudWatch Logs to create CloudWatch alarms based on metric filters. You can also forward flow logs to an S3 bucket for long-term retention and further analysis. Figure 2 demonstrates these configurations.

Figure 2: Sending VPC Flow logs to CloudWatch Logs and S3

Access logs

To identify access patterns for accessible endpoints, especially public endpoints, you should use access logs. Access logs capture detailed information about requests sent to your load balancer. Each log contains information such as the time the request was received, the client’s IP address, latencies, request paths, and server responses. With services built in layers behind a load balancer, unless you track the X-Forwarded-For request header, the requestor’s context is lost. Access logs help bridge that gap during investigations and analysis.

Amazon S3 server access logs

Access logs are critical to track object level access when using S3 buckets to store confidential or sensitive data. You can also turn on CloudTrail to capture S3 data events. You can store access logs in S3 buckets for long-term storage for compliance purposes and to run analyses during and after an event.

Load balancing logs

Elastic Load Balancing provides access logs that capture detailed information about requests sent to load balancers. Each log contains information such as the time the request was received, the client’s IP address, latencies, request paths, and server responses. You can use this log to analyze traffic patterns and to troubleshoot issues.

Access logs is an optional feature of Elastic Load Balancing that is turned off by default. To enable access logs for load balancers, see Access logs for your Application Load Balancer.

If you implement your own reverse proxy for load balancing needs, make sure that you capture the reverse proxy access logs. You can use the unified CloudWatch agent to forward the logs to CloudWatch. As with OS logs, you can export CloudWatch logs to an S3 bucket for long-term retention and analysis.

If you use an Amazon CloudFront distribution as the public endpoint for end users with load balancers as the custom origin, then load balancing access logs will represent the CloudFront distribution as the requestor, rather than the actual end user. If this information doesn’t add value to your incident handling process, then you can use CloudFront access logs as the log source that provides end user request details.

CloudFront access logs

You should enable standard logs, also known as access logs, when using CloudFront. Specify an S3 bucket where you want CloudFront to save the files.

CloudFront access logs are delivered on a best-effort basis. For information about requests made to a distribution in real time, use real-time logs that are delivered within seconds of receiving the requests. You should use real-time logs to monitor, analyze, and take action based on content delivery performance. For more details on the fields available from these logs, see the CloudFront standard log file format.

AWS WAF logs

When associated with a supported resource like a CloudFront distribution, Amazon API Gateway REST API, Application Load Balancer, AWS AppSync GraphQL API, Amazon Cognito user pool, or AWS App Runner, AWS WAF can help you monitor HTTP and HTTPS requests that are forwarded to the resource. You should configure web access control lists (ACLs) to gain fine-grained control over the requests, and enable logging for such ACLs to get detailed information about traffic that is analyzed by AWS WAF. Log information includes time of the request being received by AWS WAF from the AWS resource, details about the request, and the AWS WAF rules that the request matched. You can use this log information to monitor access patterns of public endpoints and configure rules to inspect requests in detail. For more information about AWS WAF logging, see Logging web ACL traffic.

Serverless logs

Serverless computing has become increasingly popular in the cloud-computing space. It provides on-demand compute power in a relatively short burst, meaning that cloud-based instances don’t need to be provisioned and kept around, idle, when there are no tasks to be completed. Although more and more compute tasks are being moved to serverless solutions, the need to log has not changed, but how the logs are generated has. In a serverless environment, security investigations not only benefit from logs that demonstrate the interactions and changes made by the code deployed, but that also document changes to the deployed code itself and access permissions of the Lambda execution role that is granting privileged access.

AWS Lambda

The logging of Lambda functions involves two components: how the function itself is operating, and what is happening inside the function (what your code is actually doing).

The logging of a Lambda function itself occurs through data events captured by CloudTrail. As noted earlier in this post, you must configure data events on a trail created in CloudTrail. During configuration, you will need to specify the function from which logs will be captured by your trail, and the destination S3 bucket where they will be stored. These logs contain details on the invocation of the function and help identify the IAM principals that called the Invoke API for Lambda.

AWS Lambda automatically monitors Lambda functions on your behalf and sends logs to CloudWatch. Your Lambda function comes with a CloudWatch Logs log group and a log stream for each instance of your function. The Lambda runtime environment sends details about each invocation to the log stream, and relays logs and other output from your function’s code. For more details on how to monitor Lambda functions, see Accessing Amazon CloudWatch logs for AWS Lambda.

Log analysis

For incident response, you need to be able to analyze and query your logs to validate what occurred and to understand the scope.

To begin, you can aggregate logs from various sources in S3 buckets for long-term storage, and you can integrate that data with query tools for further investigation. Logs can be exported and either parsed through directly, or ingested by another tool to help with the analysis. The following are some options that you can use to query these logs:

Amazon Athena — You can directly query CloudTrail events stored in S3 with Athena using SQL commands, specifying the LOCATION of the log files. You would generally use this approach if you have advanced queries to run, and you don’t have a SIEM. To set up Athena to query logs, you can use this open-source solution from AWS.
Amazon OpenSearch Service — OpenSearch is a distributed search and log analytics suite. Because it’s open source, it can ingest logs from more than just AWS log sources. To set this up, you can use this open-source SIEM solution from AWS.
CloudTrail Event History — Either from the console, or programmatically, you can query CloudTrail management events from the last 90-day period. This is ideal for when you have simple queries to make within the last 90 days, and you don’t need stored logs or more complex queries.
AWS CloudTrail Lake — Either from the console, or programmatically, you can query stored events in your configured CloudTrail Lake from the time of its configuration, up until the maximum storage duration of 2,557 days (7 years) from the time that you make your query. This approach allows for SQL-based queries, and it is ideal for when you need to make more complex queries against events, but don’t require the additional features of a SIEM solution.
Parse through raw JSON using CLI — This is achieved programmatically and parsed through terminal commands. It’s more a legacy method of parsing through logs. You might choose to use this approach for analysis if another service or solution isn’t feasible (for example, if you can’t use the service due to your corporate security policy).
Third-party SIEM — A third-party SIEM might be ideal if you already have a SIEM solution on AWS or elsewhere, and you don’t need a duplicated solution elsewhere. Typically, SIEM solutions will import logs from an S3 bucket and process and index events for analysis. To learn more about SIEM options, see the SIEM solutions in the AWS Marketplace, or the AWS Security Competency Partners for a partner local to you with threat detection and incident response (TDIR) expertise.

Sample queries

In this section, we provide samples of SQL queries. Both Athena and CloudTrail Lake accept SQL queries, but the following samples have been tested for use in Athena only. This is because some samples are for VPC Flow Logs, which you can’t query from CloudTrail Lake. To query CloudTrail logs in Athena, you must first create a table definition that points to the location of your logs stored in S3. You can do this from the CloudTrail Events console by using a hyperlinked suggestion, or from the Athena console directly. Alternatively, for Athena, you can use the AWS Security Analytics Bootstrap.

For each of these queries, you might need to modify some of the fields, such as the time frame that you are investigating, the IAM entity involved, and the account and Region in scope. For example, you might want to modify the time frame based on the current time and when you believe the security event began. This often involves expanding the time frame after running additional queries and learning more about the scope and timeline.

By using partitions for tables, you can restrict the amount of data scanned by each Athena query, helping to improve performance and reduce cost. For example, you can partition your CloudTrail Athena table manually or by using partition projection. You can include the partition column (for example, the timestamp) in your queries to limit the amount of data scanned.

Unauthorized attempts

When a security event occurs, you might want to review API calls that were attempted but failed due to the IAM principal not having access to perform the action on that resource. To discover this activity, run the following query (be sure to modify the time window first):

SELECT *
FROM cloudtrail
WHERE errorcode IN ('Client.UnauthorizedOperation','Client.InvalidPermission.NotFound','Client.OperationNotPermitted','AccessDenied')
AND useridentity.arn LIKE '%iam%'
AND eventtime >= '2023-01-01T00:00:00Z'
AND eventtime < '2023-03-01T00:00:00Z'
ORDER BY eventtime desc

This sample query can help you identify whether certain IAM principals have a significant amount of unauthorized API calls, which can indicate that an IAM principal is compromised.

Rejected TCP connections

During a security event, the unauthorized user that is interacting with the resources in your account is probably trying to establish persistence through the network layer. To get a list of rejected TCP connections and extract from it the day that these events occurred, run the following query:

SELECT day_of_week(date) AS
day,date,interface_id,srcaddr,action,protocol
FROM vpc_flow_logs
WHERE action = 'REJECT' AND protocol = 6
LIMIT 100;

Connections over older TLS versions

You might want to see how many calls to AWS APIs were made using older versions of the TLS protocol, as part of a forensic follow-up or a discovery job after a risk analysis. You can get this data by querying CloudTrail logs.

SELECT eventSource
COUNT(*) AS numOutdatedTlsCalls FROM cloudtrail WHERE tlsDetails.tlsVersion IN ('TLSv1', 'TLSv1.1') AND eventTime > '2023-01-01 00:00:00' GROUP BY eventSource ORDER BY numOutdatedTlsCalls DESC

Filter connections from an IP

With an IP address that you’d like to investigate, as a part of your forensic analysis, you might want to see the connections made to resources in a VPC. You can obtain this information by querying VPC Flow Logs. As with the server access logs, if you’re using Athena, you will first need to create a new table.

SELECT day_of_week(date) AS 
day, date, srcaddr, dstaddr, action, protocol
FROM vpc_flow_logs
WHERE day >= '2023/01/01' AND day < '2023/03/01' AND srcaddr LIKE '172.50.%'
ORDER BY day DESC
LIMIT 100

Investigate user actions

If you have identified a user who has been compromised, or that you suspect has been compromised, you might want to know the API calls that they made over a specific time period. Understanding the activity of a user can help you understand the scope of impact during an incident, as well as the reach of user permissions when you design your access management strategy.

SELECT eventID, eventName, eventSource, eventTime, userIdentity.arn
AS user
FROM cloudtrail
WHERE userIdentity.arn = '%<username>%' AND eventTime > '2022-12-05 00:00:00' AND eventTime < '2022-12-08 00:00:00'

Conclusion

It is essential that you capture logs from various layers within your application architecture, so that you can effectively respond to a security event at various layers of the application stack. If a security event occurs, logs can help provide a clear picture of what happened and the scope of the affected resources. This post helps you build a logging strategy for security incident response by understanding what logs you want to analyze, where you want to store those logs, and how you will analyze them.

Further resources

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Keeping the Cloudflare API ‘all green’ using Python-based testing

2023-03-07 Elie Mitrani

Post Syndicated from Elie Mitrani original https://blog.cloudflare.com/keeping-cloudflare-api-all-green-using-python-based-testing/

Keeping the Cloudflare API 'all green' using Python-based testing

At Cloudflare, we reuse existing core systems to power multiple products and testing of these core systems is essential. In particular, we require being able to have a wide and thorough visibility of our live APIs’ behaviors. We want to be able to detect regressions, prevent incidents and maintain healthy APIs. That is why we built Scout.

Scout is an automated system periodically running Python tests verifying the end to end behavior of our APIs. Scout allows us to evaluate APIs in production-like environments and thus ensures we can green light a production deployment while also monitoring the behavior of APIs in production.

Why Scout?

Before Scout, we were using an automated test system leveraging the Robot Framework. This older system was limiting our testing capabilities. In fact, we could not easily match json responses against keys we were looking for. We would abandon covering different behaviors of our APIs as it was impossible to decide on which resources a given test suite would run. Two different test suites would create false negatives as they were running on the same account.

Regarding schema validation, only API responses were validated against a json schema and tests would not fail if the response did not match the schema. Moreover, It was impossible to validate API requests.

Test suites were run in a queue, making the delay to a new feature assessment dependent on the number of test suites to run. The queue would as well potentially make newer test suites run the following day. Hence we often ended up with a mismatch between tests and APIs versions. Test steps could not be run in parallel either.

We could not split test suites between different environments. If a new API feature was being developed it was impossible to write a test without first needing the actual feature to be released to production.

We built Scout to overcome all these difficulties. We wanted the developer experience to be easy and we wanted Scout to be fast and reliable while spotting any live API issue.

A Scout test example

Scout is built in Python and leverages the functionalities of Pytest. Before diving into the exact capabilities of Scout and its architecture, let’s have a quick look at how to use it!

Following is an example of a Scout test on the Rulesets API (the docs are available here):

from scout import requires, validate, Account, Zone

@validate(schema="rulesets", ignorePaths=["accounts/[^/]+/rules/lists"])
@requires(
    account=Account(
        entitlements={"rulesets.max_rules_per_ruleset": 2),
    zone=Zone(plan="ENT",
        entitlements={"rulesets.firewall_custom_phase_allowed": True},
        account_entitlements={"rulesets.max_rules_per_ruleset": 2 }))
class TestZone:
    def test_create_custom_ruleset(self, cfapi):
        response = cfapi.zone.request(
            "POST",
            "rulesets",
            payload=f"""{{
            "name": "My zone ruleset",
            "description": "My ruleset description",
            "phase": "http_request_firewall_custom",
            "kind": "zone",
            "rules": [
                {{
                    "description": "My rule",
                    "action": "block",
                    "expression": "http.host eq \"fake.net\""
                }}
            ]
        }}""")
        response.expect_json_success(
            200,
            result=f"""{{
            "name": "My zone ruleset",
            "version": "1",
            "source": "firewall_custom",
            "phase": "http_request_firewall_custom",
            "kind": "zone",
            "rules": [
                {{
                    "description": "My rule",
                    "action": "block",
                    "expression": "http.host eq \"fake.net\"",
                    "enabled": true,
                    ...
                }}
            ],
            ...
        }}""")

A Scout test is a succession of roundtrips of requests and responses against a given API. We use the functionalities of Pytest fixtures and marks to be able to target specific resources while validating the request and responses. Pytest marks in Scout allow to provide an extra set of information to test suites. Pytest fixtures are contexts with information and methods which can be used across tests to enhance their capabilities. Hence the conjunction of marks with fixtures allow Scout to build the whole harness required to run a test suite against APIs.

Being able to exactly describe the resources against which a given test will run provides us confidence the live API behaves as expected under various conditions.

The cfapi fixture provides the capability to target different resources such as a Cloudflare account or a zone. In the test above, we use a Pytest mark @requires to describe the characteristics of the resources we want, e.g. we need here an account with a flag allowing us to have 2 rules for a ruleset. This will allow the test to only be run against accounts with such entitlements.

The @validate mark provides the capability to validate requests and responses to a given OpenAPI schema (here the rulesets OpenAPI schema). Any validation failure will be reported and flagged as a test failure.

Regarding the actual requests and responses, their payloads are described as f-strings, in particular the response f-string can be written as a “semi-json”:

 response.expect_json_success(
            200,
            result=f"""{{
            "name": "My zone ruleset",
            "version": "1",
            "source": "firewall_custom",
            "phase": "phase_http_request_firewall_custom",
            "kind": "zone",
            "rules": [
                {{
                    "description": "My rule",
                    "action": "block",
                    "expression": "http.host eq \"fake.net\"",
                    "enabled": true,
                    ...
                }}
            ],
            ...
        }}""")

Among many test assertions possible, Scout can assert the validity of a partial json response and it will log the information. We added the handling of ellipsis … as an indication for Scout not to care about any further fields at a given json nesting level. Hence, we are able to do partial matching on JSON API responses, thus focusing only on what matters the most in each test.

Once a test suite run is complete, the results are pushed by the service and stored using Cloudflare Workers KV. They are displayed via a Cloudflare Worker.

Scout is run in separate environments such as production-like and production environments. It is part of our deployment process to verify Scout is green in our production-like environment prior to deploying to production where Scout is also used for monitoring purposes.

How we built it

The core of Scout is written in Python and it is a combination of three components interacting together:

The Scout plugin: a Pytest plugin to write tests easily
The Scout service: a scheduler service to run the test suites periodically
The Scout Worker: a collector and presenter of test reports

The Scout plugin

This is the core component of the Scout system. It allows us to write self explanatory tests while ensuring a high level of compliance against OpenAPI schemas and verifying the APIs’ behaviors.

The Scout plugin architecture can be split into three components: setup, resource allocator, and runners. Setup is a conjunction of multiple sub components in charge of setting up the plugin.

The Registry contains all the information regarding a pool of accounts and zones we use for testing. As an example, entitlements are flags gating customers for using products features, the Registry provides the capability to describe entitlements per account and zone so that Scout can run a test against a specific setup.

As explained earlier, Scout can validate requests and responses against OpenAPI schemas. This is the responsibility of validators. A validator is built per OpenAPI schema and can be selected via the @validate mark we saw above.

@validate(schema="rulesets", ignorePaths=["accounts/[^/]+/rules/lists"])

As soon as a validator is selected, all the interaction of a given test with an API will be validated. If there is a validation failure, it will be marked as a test failure.

Last element of the setup, the config reader. It is the sub component in charge of providing all the URLs and authentication elements required for the Scout plugin to communicate with APIs.

Next in the chain, the resources allocator. This component is in charge of consuming the configuration and objects of the setup to build multiple runners. This is a factory which will make available the runners in the cfapi fixture.

response = cfapi.zone.request(method, path, payload)

When such a line of code is processed, it is the actual method request of the zone runner allocated for the test which is executed. Actually, the resources allocator is able to provide specialized runners (account, zone or default) which grant the possibility of targeting specific API endpoints for a given account or zone.

Runners are in charge of handling the execution of requests, managing the test expectations and using the validators for request/response schema validation.

Any failure on expectation or validation and any exceptions are recorded in the stash. The stash is shared across all runners. As such, when a test setup, run or cleanup is processed, the timeline of execution and potential retries are logged in the stash. The stash contents are later used for building the test suite reports.

Scout is able to run multiple test steps in parallel. Actually, each resource couple (Account Runner, Zone Runner) is associated with a Pytest-xdist worker which runs test steps independently. There can be as many workers as there are resource couples. An extra “default” runner is provided for reaching our different APIs and/or URLs with or without authentication.

Testing a test system was not the easiest part. We have been required to build a fake API and assert the Scout plugin would behave as it should in different situations. We reached and maintained a test coverage confidence which was considered good (close to 90%) for using the Scout plugin permanently.

The Scout service

The Scout service is meant to schedule test suites periodically. It is a configurable scheduler providing a reporting harness for the test suites as well as multiple metrics. It was a design decision to build a scheduler instead of using cron jobs.

We wanted to be aware of any scheduling issue as well as run issues. For this we used Prometheus metrics. The problem is that the Prometheus default configuration is to scrape metrics advertised by services. This scraping happens periodically and we were concerned about the eventuality of missing metrics if a cron job was to finish prior to the next Prometheus metrics scraping. As such we decided a small scheduler was better suited for overall observability of the test runs. Among the metrics the Scout service provides are network failures, general test failures, reporting failures, tests lagging and more.

The Scout service runs threads on configured periods. Each thread is a test suite run as a separate Pytest with Scout plugin process followed by a reporting execution consuming the results and publishing them to the relevant parties.

The reporting component provided to each thread publishes the report to Workers KV and notifies us on chat in case there is a failure. Reporting takes also care of publishing the information relevant for building API testing coverage. In fact it is mandatory for us to have coverage of all the API endpoints and their possible methods so that we can achieve a wide and thorough visibility of our live APIs.

As a fallback, if there are any thread failure, test failure or reporting failure we are alerted based on the Prometheus metrics being updated across the service execution. The logs of the Scout service as well as the logs of each Pytest-Scout plugin execution provide the last resort information if no metrics are available and reporting is failing.

The service can be deployed with a minimal YAML configuration and be set up for different environments. We can for example decide to run different test suites based on the environment, publish or not to Cloudflare Workers, set different periods and retry mechanisms and so on.

We keep the tests as part of our code base alongside the configuration of the Scout service, and that’s about it, the Scout service is a separate entity.

The Scout Worker

It is a Cloudflare worker in charge of fetching the most recent Worker KVs and displaying them in an eye pleasing manner. The Scout service publishes a test report as JSON, thus the Scout worker parses the report and displays its content based on the status of the test suite run.

For example, we present below an authentication failure during a test which resulted in such a display in the worker:

What does Scout let us do

Through leveraging the capabilities of Pytest and Cloudflare Workers, we have been able to build a configurable, robust and reliable system which allows us to easily write self explanatory tests for our APIs.

We can validate requests and responses against OpenAPI schemas and test behaviors over specific resources while getting alerted through multiple means if something goes wrong.

For specific use cases, we can write a test verifying the API behaves as it should, the configuration to be pushed at the edge is valid and a given zone will react as it should to security threats. Thus going beyond an end-to-end API test.

Scout quickly became our permanent live tester and monitor of APIs. We wrote tests for all endpoints to maintain a wide coverage of all our APIs. Scout has since been used for verifying an API version prior to its deployment to production. In fact, after a deployment in a production-like environment we can know in a couple of minutes if a new feature is good to go to production and assess if it is behaving correctly.

We hope you enjoyed this deep dive description into one of our systems!

Visualize database privileges on Amazon Redshift using Grafana

2023-03-02 Yota Hamaoka

Post Syndicated from Yota Hamaoka original https://aws.amazon.com/blogs/big-data/visualize-database-privileges-on-amazon-redshift-using-grafana/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift enables you to use SQL for analyzing structured and semi-structured data with best price performance along with secure access to the data.

As more users start querying data in a data warehouse, access control is paramount to protect valuable organizational data. Database administrators want to continuously monitor and manage user privileges to maintain proper data access in the data warehouse. Amazon Redshift provides granular access control on the database, schema, table, column, row, and other database objects by granting privileges to roles, groups, and users from a SQL interface. To monitor privileges configured in Amazon Redshift, you can retrieve them by querying system tables.

Although Amazon Redshift provides a broad capability of managing access to database objects, we have heard from customers that they want to visualize and monitor privileges without using a SQL interface. In this post, we introduce predefined dashboards using Grafana which visualizes database privileges without writing SQL. This dashboard will help database administrators to reduce the time spent on database administration and increase the frequency of monitoring cycles.

Database security in Amazon Redshift

Security is the top priority at AWS. Amazon Redshift provides four levels of control:

Cluster management
Cluster connectivity
Database access
Temporary database credentials and single sign-on

This post focuses on database access, which relates to user access control against database objects. For more information, see Managing database security.

Amazon Redshift uses the GRANT command to define permissions in the database. For most database objects, GRANT takes three parameters:

Identity – The entity you grant access to. This could be a user, role, or group.
Object – The type of database object. This could be a database, schema, table or view, column, row, function, procedure, language, datashare, machine leaning (ML) model, and more.
Privilege – The type of operation. Examples include CREATE, SELECT, ALTER, DROP, DELETE, and INSERT. The level of privilege depends on the object.

To remove access, use the REVOKE command.

Additionally, Amazon Redshift offers granular access control with the Row-level security (RLS) feature. You can attach or detach RLS policies to identities with the ATTACH RLS POLICY and DETACH RLS POLICY commands, respectively. See RLS policy ownership and management for more details.

Generally, database administrator monitors and reviews the identities, objects, and privileges periodically to ensure proper access is configured. They also need to investigate access configurations if database users face permission errors. These tasks require a SQL interface to query multiple system tables, which can be a repetitive and undifferentiated operation. Therefore, database administrators need a single pane of glass to quickly navigate through identities, objects, and privileges without writing SQL.

Solution overview

The following diagram illustrates the solution architecture and its key components:

Amazon Redshift contains database privilege information in system tables.
Grafana provides a predefined dashboard to visualize database privileges. The dashboard runs queries against the Amazon Redshift system table via the Amazon Redshift Data API.

Note that the dashboard focuses on visualization. SQL interface is required to configure privileges in Amazon Redshift. You can use query editor v2, a web-based SQL interface which enables users to run SQL commands from a browser.

Prerequisites

Before moving to the next section, you should have the following prerequisites:

An AWS account
Create an Amazon Redshift cluster
Setup Amazon Managed Grafana or local Grafana
- To setup with Amazon Managed Grafana, go through the following sections in Query and visualize Amazon Redshift operational metrics using the Amazon Redshift plugin for Grafana:
  - Configure an Amazon Redshift cluster
  - Configure an Amazon Managed Grafana workspace and sign in to Grafana
  - Create a database user for Amazon Managed Grafana
  - Configure a user in AWS SSO for Amazon Managed Grafana UI access
  - Configure an Amazon Managed Grafana workspace and sign in to Grafana
  - Set up an Amazon Redshift data source in Grafana
- To setup with local Grafana, first install Grafana following the official setup documentation. Next, go through the following sections in Query and visualize Amazon Redshift operational metrics using the Amazon Redshift plugin for Grafana:
  - Create a database user for Amazon Managed Grafana
  - Set up an Amazon Redshift data source in Grafana

While Amazon Managed Grafana controls the plugin version and updates periodically, local Grafana allows user to control the version. Therefore, local Grafana could be an option if you need earlier access for the latest features. Refer to plugin changelog for released features and versions.

Import the dashboards

After you have finished the prerequisites, you should have access to Grafana configured with Amazon Redshift as a data source. Next, import two dashboards for visualization.

In Grafana console, go to the created Redshift data source and click Dashboards
Import the Amazon Redshift Identities and Objects
Go to the data source again and import the Amazon Redshift Privileges

Each dashboard will appear once imported.

Amazon Redshift Identities and Objects dashboard

The Amazon Redshift Identities and Objects dashboard shows identites and database objects in Amazon Redshift, as shown in the following screenshot.

The Identities section shows the detail of each user, role, and group in the source database.

One of the key features in this dashboard is the Role assigned to Role, User section, which uses a node graph panel to visualize the hierarchical structure of roles and users from multiple system tables. This visualization can help administrators quickly examine which roles are inherited to users instead of querying multiple system tables. For more information about role-based access, refer to Role-based access control (RBAC).

Amazon Redshift Privileges dashboard

The Amazon Redshift Privileges dashboard shows privileges defined in Amazon Redshift.

In the Role and Group assigned to User section, open the Role assigned to User panel to list the roles for a specific user. In this panel, you can list and compare roles assigned to multiple users. Use the User drop-down at the top of the dashboard to select users.

The dashboard will refresh immediately and show filtered result for selected users. Following screenshot is the filtered result for user hr1, hr2 and it3.

The Object Privileges section shows the privileges granted for each database object and identity. Note that objects with no privileges granted are not listed here. To show the full list of database objects, use the Amazon Redshift Identities and Objects dashboard.

The Object Privileges (RLS) section contains visualizations for row-level security (RLS). The Policy attachments panel enables you to examine RLS configuration by visualizing relation between of tables, policies, roles and users.

Conclusion

In this post, we introduced a visualization for database privileges of Amazon Redshift using predefined Grafana dashboards. Database administrators can use these dashboards to quickly navigate through identities, objects, and privileges without writing SQL. You can also customize the dashboard to meet your business requirements. The JSON definition file of this dashboard is maintained as part of OSS in the Redshift data source for Grafana GitHub repository.

For more information about the topics described to in this post, refer to the following:

About the author

Yota Hamaoka is an Analytics Solution Architect at Amazon Web Services. He is focused on driving customers to accelerate their analytics journey with Amazon Redshift.

Monitoring our monitoring: how we validate our Prometheus alert rules

2022-05-19 Lukasz Mierzwa

Post Syndicated from Lukasz Mierzwa original https://blog.cloudflare.com/monitoring-our-monitoring/

Background

Monitoring our monitoring: how we validate our Prometheus alert rules

We use Prometheus as our core monitoring system. We’ve been heavy Prometheus users since 2017 when we migrated off our previous monitoring system which used a customized Nagios setup. Despite growing our infrastructure a lot, adding tons of new products and learning some hard lessons about operating Prometheus at scale, our original architecture of Prometheus (see Monitoring Cloudflare’s Planet-Scale Edge Network with Prometheus for an in depth walk through) remains virtually unchanged, proving that Prometheus is a solid foundation for building observability into your services.

One of the key responsibilities of Prometheus is to alert us when something goes wrong and in this blog post we’ll talk about how we make those alerts more reliable – and we’ll introduce an open source tool we’ve developed to help us with that, and share how you can use it too. If you’re not familiar with Prometheus you might want to start by watching this video to better understand the topic we’ll be covering here.

Prometheus works by collecting metrics from our services and storing those metrics inside its database, called TSDB. We can then query these metrics using Prometheus query language called PromQL using ad-hoc queries (for example to power Grafana dashboards) or via alerting or recording rules. A rule is basically a query that Prometheus will run for us in a loop, and when that query returns any results it will either be recorded as new metrics (with recording rules) or trigger alerts (with alerting rules).

Prometheus alerts

Since we’re talking about improving our alerting we’ll be focusing on alerting rules.

To create alerts we first need to have some metrics collected. For the purposes of this blog post let’s assume we’re working with http_requests_total metric, which is used on the examples page. Here are some examples of how our metrics will look:

http_requests_total{job="myserver", handler="/", method=”get”, status=”200”}
http_requests_total{job="myserver", handler="/", method=”get”, status=”500”}
http_requests_total{job="myserver", handler="/posts", method=”get”, status=”200”}
http_requests_total{job="myserver", handler="/posts", method=”get”, status=”500”}
http_requests_total{job="myserver", handler="/posts/new", method=”post”, status=”201”}
http_requests_total{job="myserver", handler="/posts/new", method=”post”, status=”401”}

Let’s say we want to alert if our HTTP server is returning errors to customers.

Since, all we need to do is check our metric that tracks how many responses with HTTP status code 500 there were, a simple alerting rule could like this:

- alert: Serving HTTP 500 errors
  expr: http_requests_total{status=”500”} > 0

This will alert us if we have any 500 errors served to our customers. Prometheus will run our query looking for a time series named http_requests_total that also has a status label with value “500”. Then it will filter all those matched time series and only return ones with value greater than zero.

If our alert rule returns any results a fire will be triggered, one for each returned result.

If our rule doesn’t return anything, meaning there are no matched time series, then alert will not trigger.

The whole flow from metric to alert is pretty simple here as we can see on the diagram below.

If we want to provide more information in the alert we can by setting additional labels and annotations, but alert and expr fields are all we need to get a working rule.

But the problem with the above rule is that our alert starts when we have our first error, and then it will never go away.

After all, our http_requests_total is a counter, so it gets incremented every time there’s a new request, which means that it will keep growing as we receive more requests. What this means for us is that our alert is really telling us “was there ever a 500 error?” and even if we fix the problem causing 500 errors we’ll keep getting this alert.

A better alert would be one that tells us if we’re serving errors right now.

For that we can use the rate() function to calculate the per second rate of errors.

Our modified alert would be:

- alert: Serving HTTP 500 errors
  expr: rate(http_requests_total{status=”500”}[2m]) > 0

The query above will calculate the rate of 500 errors in the last two minutes. If we start responding with errors to customers our alert will fire, but once errors stop so will this alert.

This is great because if the underlying issue is resolved the alert will resolve too.

We can improve our alert further by, for example, alerting on the percentage of errors, rather than absolute numbers, or even calculate error budget, but let’s stop here for now.

It’s all very simple, so what do we mean when we talk about improving the reliability of alerting? What could go wrong here?

Maybe a spot for a subheading here as you move on from the intro?

What could go wrong?

We can craft a valid YAML file with a rule definition that has a perfectly valid query that will simply not work how we expect it to work. Which, when it comes to alerting rules, might mean that the alert we rely upon to tell us when something is not working correctly will fail to alert us when it should. To better understand why that might happen let’s first explain how querying works in Prometheus.

Prometheus querying basics

There are two basic types of queries we can run against Prometheus. The first one is an instant query. It allows us to ask Prometheus for a point in time value of some time series. If we write our query as http_requests_total we’ll get all time series named http_requests_total along with the most recent value for each of them. We can further customize the query and filter results by adding label matchers, like http_requests_total{status=”500”}.

Let’s consider we have two instances of our server, green and red, each one is scraped (Prometheus collects metrics from it) every one minute (independently of each other).

This is what happens when we issue an instant query:

There’s obviously more to it as we can use functions and build complex queries that utilize multiple metrics in one expression. But for the purposes of this blog post we’ll stop here.

The important thing to know about instant queries is that they return the most recent value of a matched time series, and they will look back for up to five minutes (by default) into the past to find it. If the last value is older than five minutes then it’s considered stale and Prometheus won’t return it anymore.

The second type of query is a range query – it works similarly to instant queries, the difference is that instead of returning us the most recent value it gives us a list of values from the selected time range. That time range is always relative so instead of providing two timestamps we provide a range, like “20 minutes”. When we ask for a range query with a 20 minutes range it will return us all values collected for matching time series from 20 minutes ago until now.

An important distinction between those two types of queries is that range queries don’t have the same “look back for up to five minutes” behavior as instant queries. If Prometheus cannot find any values collected in the provided time range then it doesn’t return anything.

If we modify our example to request [3m] range query we should expect Prometheus to return three data points for each time series:

When queries don’t return anything

Knowing a bit more about how queries work in Prometheus we can go back to our alerting rules and spot a potential problem: queries that don’t return anything.

If our query doesn’t match any time series or if they’re considered stale then Prometheus will return an empty result. This might be because we’ve made a typo in the metric name or label filter, the metric we ask for is no longer being exported, or it was never there in the first place, or we’ve added some condition that wasn’t satisfied, like value of being non-zero in our http_requests_total{status=”500”} > 0 example.

Prometheus will not return any error in any of the scenarios above because none of them are really problems, it’s just how querying works. If you ask for something that doesn’t match your query then you get empty results. This means that there’s no distinction between “all systems are operational” and “you’ve made a typo in your query”. So if you’re not receiving any alerts from your service it’s either a sign that everything is working fine, or that you’ve made a typo, and you have no working monitoring at all, and it’s up to you to verify which one it is.

For example, we could be trying to query for http_requests_totals instead of http_requests_total (an extra “s” at the end) and although our query will look fine it won’t ever produce any alert.

Range queries can add another twist – they’re mostly used in Prometheus functions like rate(), which we used in our example. This function will only work correctly if it receives a range query expression that returns at least two data points for each time series, after all it’s impossible to calculate rate from a single number.

Since the number of data points depends on the time range we passed to the range query, which we then pass to our rate() function, if we provide a time range that only contains a single value then rate won’t be able to calculate anything and once again we’ll return empty results.

The number of values collected in a given time range depends on the interval at which Prometheus collects all metrics, so to use rate() correctly you need to know how your Prometheus server is configured. You can read more about this here and here if you want to better understand how rate() works in Prometheus.

For example if we collect our metrics every one minute then a range query http_requests_total[1m] will be able to find only one data point. Here’s a reminder of how this looks:

Since, as we mentioned before, we can only calculate rate() if we have at least two data points, calling rate(http_requests_total[1m]) will never return anything and so our alerts will never work.

There are more potential problems we can run into when writing Prometheus queries, for example any operations between two metrics will only work if both have the same set of labels, you can read about this here. But for now we’ll stop here, listing all the gotchas could take a while. The point to remember is simple: if your alerting query doesn’t return anything then it might be that everything is ok and there’s no need to alert, but it might also be that you’ve mistyped your metrics name, your label filter cannot match anything, your metric disappeared from Prometheus, you are using too small time range for your range queries etc.

Renaming metrics can be dangerous

We’ve been running Prometheus for a few years now and during that time we’ve grown our collection of alerting rules a lot. Plus we keep adding new products or modifying existing ones, which often includes adding and removing metrics, or modifying existing metrics, which may include renaming them or changing what labels are present on these metrics.

A lot of metrics come from metrics exporters maintained by the Prometheus community, like node_exporter, which we use to gather some operating system metrics from all of our servers. Those exporters also undergo changes which might mean that some metrics are deprecated and removed, or simply renamed.

A problem we’ve run into a few times is that sometimes our alerting rules wouldn’t be updated after such a change, for example when we upgraded node_exporter across our fleet. Or the addition of a new label on some metrics would suddenly cause Prometheus to no longer return anything for some of the alerting queries we have, making such an alerting rule no longer useful.

It’s worth noting that Prometheus does have a way of unit testing rules, but since it works on mocked data it’s mostly useful to validate the logic of a query. Unit testing won’t tell us if, for example, a metric we rely on suddenly disappeared from Prometheus.

Chaining rules

When writing alerting rules we try to limit alert fatigue by ensuring that, among many things, alerts are only generated when there’s an action needed, they clearly describe the problem that needs addressing, they have a link to a runbook and a dashboard, and finally that we aggregate them as much as possible. This means that a lot of the alerts we have won’t trigger for each individual instance of a service that’s affected, but rather once per data center or even globally.

For example, we might alert if the rate of HTTP errors in a datacenter is above 1% of all requests. To do that we first need to calculate the overall rate of errors across all instances of our server. For that we would use a recording rule:

- record: job:http_requests_total:rate2m
  expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

- record: job:http_requests_status500:rate2m
  expr: sum(rate(http_requests_total{status=”500”}[2m])) without(method, status, instance)

First rule will tell Prometheus to calculate per second rate of all requests and sum it across all instances of our server. Second rule does the same but only sums time series with status labels equal to “500”. Both rules will produce new metrics named after the value of the record field.

Now we can modify our alert rule to use those new metrics we’re generating with our recording rules:

- alert: Serving HTTP 500 errors
  expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

If we have a data center wide problem then we will raise just one alert, rather than one per instance of our server, which can be a great quality of life improvement for our on-call engineers.

But at the same time we’ve added two new rules that we need to maintain and ensure they produce results. To make things more complicated we could have recording rules producing metrics based on other recording rules, and then we have even more rules that we need to ensure are working correctly.

What if all those rules in our chain are maintained by different teams? What if the rule in the middle of the chain suddenly gets renamed because that’s needed by one of the teams? Problems like that can easily crop up now and then if your environment is sufficiently complex, and when they do, they’re not always obvious, after all the only sign that something stopped working is, well, silence – your alerts no longer trigger. If you’re lucky you’re plotting your metrics on a dashboard somewhere and hopefully someone will notice if they become empty, but it’s risky to rely on this.

We definitely felt that we needed something better than hope.

Introducing pint: a Prometheus rule linter

To avoid running into such problems in the future we’ve decided to write a tool that would help us do a better job of testing our alerting rules against live Prometheus servers, so we can spot missing metrics or typos easier. We also wanted to allow new engineers, who might not necessarily have all the in-depth knowledge of how Prometheus works, to be able to write rules with confidence without having to get feedback from more experienced team members.

Since we believe that such a tool will have value for the entire Prometheus community we’ve open-sourced it, and it’s available for anyone to use – say hello to pint!

You can find sources on github, there’s also online documentation that should help you get started.

Pint works in 3 different ways:

You can run it against a file(s) with Prometheus rules
It can run as a part of your CI pipeline
Or you can deploy it as a side-car to all your Prometheus servers

It doesn’t require any configuration to run, but in most cases it will provide the most value if you create a configuration file for it and define some Prometheus servers it should use to validate all rules against. Running without any configured Prometheus servers will limit it to static analysis of all the rules, which can identify a range of problems, but won’t tell you if your rules are trying to query non-existent metrics.

First mode is where pint reads a file (or a directory containing multiple files), parses it, does all the basic syntax checks and then runs a series of checks for all Prometheus rules in those files.

Second mode is optimized for validating git based pull requests. Instead of testing all rules from all files pint will only test rules that were modified and report only problems affecting modified lines.

Third mode is where pint runs as a daemon and tests all rules on a regular basis. If it detects any problem it will expose those problems as metrics. You can then collect those metrics using Prometheus and alert on them as you would for any other problems. This way you can basically use Prometheus to monitor itself.

What kind of checks can it run for us and what kind of problems can it detect?

All the checks are documented here, along with some tips on how to deal with any detected problems. Let’s cover the most important ones briefly.

As mentioned above the main motivation was to catch rules that try to query metrics that are missing or when the query was simply mistyped. To do that pint will run each query from every alerting and recording rule to see if it returns any result, if it doesn’t then it will break down this query to identify all individual metrics and check for the existence of each of them. If any of them is missing or if the query tries to filter using labels that aren’t present on any time series for a given metric then it will report that back to us.

So if someone tries to add a new alerting rule with http_requests_totals typo in it, pint will detect that when running CI checks on the pull request and stop it from being merged. Which takes care of validating rules as they are being added to our configuration management system.

Another useful check will try to estimate the number of times a given alerting rule would trigger an alert. Which is useful when raising a pull request that’s adding new alerting rules – nobody wants to be flooded with alerts from a rule that’s too sensitive so having this information on a pull request allows us to spot rules that could lead to alert fatigue.

Similarly, another check will provide information on how many new time series a recording rule adds to Prometheus. In our setup a single unique time series uses, on average, 4KiB of memory. So if a recording rule generates 10 thousand new time series it will increase Prometheus server memory usage by 10000*4KiB=40MiB. 40 megabytes might not sound like but our peak time series usage in the last year was around 30 million time series in a single Prometheus server, so we pay attention to anything that’s might add a substantial amount of new time series, which pint helps us to notice before such rule gets added to Prometheus.

On top of all the Prometheus query checks, pint allows us also to ensure that all the alerting rules comply with some policies we’ve set for ourselves. For example, we require everyone to write a runbook for their alerts and link to it in the alerting rule using annotations.

We also require all alerts to have priority labels, so that high priority alerts are generating pages for responsible teams, while low priority ones are only routed to karma dashboard or create tickets using jiralert. It’s easy to forget about one of these required fields and that’s not something which can be enforced using unit testing, but pint allows us to do that with a few configuration lines.

With pint running on all stages of our Prometheus rule life cycle, from initial pull request to monitoring rules deployed in our many data centers, we can rely on our Prometheus alerting rules to always work and notify us of any incident, large or small.

GitHub: https://github.com/cloudflare/pint

Putting it all together

Let’s see how we can use pint to validate our rules as we work on them.

We can begin by creating a file called “rules.yml” and adding both recording rules there.

The goal is to write new rules that we want to add to Prometheus, but before we actually add those, we want pint to validate it all for us.

groups:
- name: Demo recording rules
  rules:
  - record: job:http_requests_total:rate2m
    expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

  - record: job:http_requests_status500:rate2m
    expr: sum(rate(http_requests_total{status="500"}[2m]) without(method, status, instance)

Next we’ll download the latest version of pint from GitHub and run check our rules.

$ pint lint rules.yml 
level=info msg="File parsed" path=rules.yml rules=2
rules.yml:8: syntax error: unclosed left parenthesis (promql/syntax)
    expr: sum(rate(http_requests_total{status="500"}[2m]) without(method, status, instance)

level=info msg="Problems found" Fatal=1
level=fatal msg="Execution completed with error(s)" error="problems found"

Whoops, we have “sum(rate(…)” and so we’re missing one of the closing brackets. Let’s fix that and try again.

groups:
- name: Demo recording rules
  rules:
  - record: job:http_requests_total:rate2m
    expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

  - record: job:http_requests_status500:rate2m
    expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)

$ pint lint rules.yml 
level=info msg="File parsed" path=rules.yml rules=2

Our rule now passes the most basic checks, so we know it’s valid. But to know if it works with a real Prometheus server we need to tell pint how to talk to Prometheus. For that we’ll need a config file that defines a Prometheus server we test our rule against, it should be the same server we’re planning to deploy our rule to. Here we’ll be using a test instance running on localhost. Let’s create a “pint.hcl” file and define our Prometheus server there:

prometheus "prom1" {
  uri     = "http://localhost:9090"
  timeout = "1m"
}

Now we can re-run our check using this configuration file:

$ pint -c pint.hcl lint rules.yml 
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=2
rules.yml:5: prometheus "prom1" at http://localhost:9090 didn't have any series for "http_requests_total" metric in the last 1w (promql/series)
    expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

rules.yml:8: prometheus "prom1" at http://localhost:9090 didn't have any series for "http_requests_total" metric in the last 1w (promql/series)
    expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)

level=info msg="Problems found" Bug=2
level=fatal msg="Execution completed with error(s)" error="problems found"

Yikes! It’s a test Prometheus instance, and we forgot to collect any metrics from it.

Let’s fix that by starting our server locally on port 8080 and configuring Prometheus to collect metrics from it:

scrape_configs:
  - job_name: webserver
    static_configs:
      - targets: ['localhost:8080’]

Let’ re-run our checks once more:

$ pint -c pint.hcl lint rules.yml 
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=2

This time everything works!

Now let’s add our alerting rule to our file, so it now looks like this:

groups:
- name: Demo recording rules
  rules:
  - record: job:http_requests_total:rate2m
    expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

  - record: job:http_requests_status500:rate2m
    expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)

- name: Demo alerting rules
  rules:
  - alert: Serving HTTP 500 errors
    expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

And let’s re-run pint once again:

$ pint -c pint.hcl lint rules.yml 
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=3
rules.yml:13: prometheus "prom1" at http://localhost:9090 didn't have any series for "job:http_requests_status500:rate2m" metric in the last 1w but found recording rule that generates it, skipping further checks (promql/series)
    expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

rules.yml:13: prometheus "prom1" at http://localhost:9090 didn't have any series for "job:http_requests_total:rate2m" metric in the last 1w but found recording rule that generates it, skipping further checks (promql/series)
    expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

level=info msg="Problems found" Information=2

It all works according to pint, and so we now can safely deploy our new rules file to Prometheus.

Notice that pint recognised that both metrics used in our alert come from recording rules, which aren’t yet added to Prometheus, so there’s no point querying Prometheus to verify if they exist there.

Now what happens if we deploy a new version of our server that renames the “status” label to something else, like “code”?

$ pint -c pint.hcl lint rules.yml 
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=3
rules.yml:8: prometheus "prom1" at http://localhost:9090 has "http_requests_total" metric but there are no series with "status" label in the last 1w (promql/series)
    expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)

rules.yml:13: prometheus "prom1" at http://localhost:9090 didn't have any series for "job:http_requests_status500:rate2m" metric in the last 1w but found recording rule that generates it, skipping further checks (promql/series)
    expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

level=info msg="Problems found" Bug=1 Information=1
level=fatal msg="Execution completed with error(s)" error="problems found"

Luckily pint will notice this and report it, so we can adopt our rule to match the new name.

But what if that happens after we deploy our rule? For that we can use the “pint watch” command that runs pint as a daemon periodically checking all rules.

Please note that validating all metrics used in a query will eventually produce some false positives. In our example metrics with status=”500” label might not be exported by our server until there’s at least one request ending in HTTP 500 error.

The promql/series check responsible for validating presence of all metrics has some documentation on how to deal with this problem. In most cases you’ll want to add a comment that instructs pint to ignore some missing metrics entirely or stop checking label values (only check if there’s “status” label present, without checking if there are time series with status=”500”).

Summary

Prometheus metrics don’t follow any strict schema, whatever services expose will be collected. At the same time a lot of problems with queries hide behind empty results, which makes noticing these problems non-trivial.

We use pint to find such problems and report them to engineers, so that our global network is always monitored correctly, and we have confidence that lack of alerts proves how reliable our infrastructure is.

Query and visualize Amazon Redshift operational metrics using the Amazon Redshift plugin for Grafana

2022-03-09 Sergey Konoplev

Post Syndicated from Sergey Konoplev original https://aws.amazon.com/blogs/big-data/query-and-visualize-amazon-redshift-operational-metrics-using-the-amazon-redshift-plugin-for-grafana/

Grafana is a rich interactive open-source tool by Grafana Labs for visualizing data across one or many data sources. It’s used in a variety of modern monitoring stacks, allowing you to have a common technical base and apply common monitoring practices across different systems. Amazon Managed Grafana is a fully managed, scalable, and secure Grafana-as-a-service solution developed by AWS in collaboration with Grafana Labs.

Amazon Redshift is the most widely used data warehouse in the cloud. You can view your Amazon Redshift cluster’s operational metrics on the Amazon Redshift console, use AWS CloudWatch, and query Amazon Redshift system tables directly from your cluster. The first two options provide a set of predefined general metrics and visualizations. The last one allows you to use the flexibility of SQL to get deep insights into the details of the workload. However, querying system tables requires knowledge of system table structures. To address that, we came up with a consolidated Amazon Redshift Grafana dashboard that visualizes a set of curated operational metrics and works on top of the Amazon Redshift Grafana data source. You can easily add it to an Amazon Managed Grafana workspace, as well as to any other Grafana deployments where the data source is installed.

This post guides you through a step-by-step process to create an Amazon Managed Grafana workspace and configure an Amazon Redshift cluster with a Grafana data source for it. Lastly, we show you how to set up the Amazon Redshift Grafana dashboard to visualize the cluster metrics.

Solution overview

The following diagram illustrates the solution architecture.

Architecture Diagram

The solution includes the following components:

The Amazon Redshift cluster to get the metrics from.
Amazon Managed Grafana, with the Amazon Redshift data source plugin added to it. Amazon Managed Grafana communicates with the Amazon Redshift cluster via the Amazon Redshift Data Service API.
The Grafana web UI, with the Amazon Redshift dashboard using the Amazon Redshift cluster as the data source. The web UI communicates with Amazon Managed Grafana via an HTTP API.

We walk you through the following steps during the configuration process:

Configure an Amazon Redshift cluster.
Create a database user for Amazon Managed Grafana on the cluster.
Configure a user in AWS Single Sign-On (AWS SSO) for Amazon Managed Grafana UI access.
Configure an Amazon Managed Grafana workspace and sign in to Grafana.
Set up Amazon Redshift as the data source in Grafana.
Import the Amazon Redshift dashboard supplied with the data source.

Prerequisites

To follow along with this walkthrough, you should have the following prerequisites:

An AWS account
Familiarity with the basic concepts of the following services:
- Amazon Redshift
- Amazon Managed Grafana
- AWS SSO

Configure an Amazon Redshift cluster

If you don’t have an Amazon Redshift cluster, create a sample cluster before proceeding with the following steps. For this post, we assume that the cluster identifier is called redshift-demo-cluster-1 and the admin user name is awsuser.

On the Amazon Redshift console, choose Clusters in the navigation pane.
Choose your cluster.
Choose the Properties tab.

Redshift Cluster Properties

To make the cluster discoverable by Amazon Managed Grafana, you must add a special tag to it.

Choose Add tags.
For Key, enter GrafanaDataSource.
For Value, enter true.
Choose Save changes.

Redshift Cluster Tags

Create a database user for Amazon Managed Grafana

Grafana will be directly querying the cluster, and it requires a database user to connect to the cluster. In this step, we create the user redshift_data_api_user and apply some security best practices.

On the cluster details page, choose Query data and Query in query editor v2.
Choose the redshift-demo-cluster-1 cluster we created previously.
For Database, enter the default dev.
Enter the user name and password that you used to create the cluster.
Choose Create connection.
In the query editor, enter the following statements and choose Run:

CREATE USER redshift_data_api_user PASSWORD '&lt;password&gt;' CREATEUSER;
ALTER USER redshift_data_api_user SET readonly TO TRUE;
ALTER USER redshift_data_api_user SET query_group TO 'superuser';

The first statement creates a user with superuser privileges necessary to access system tables and views (make sure to use a unique password). The second prohibits the user from making modifications. The last statement isolates the queries the user can run to the superuser queue, so they don’t interfere with the main workload.

In this example, we use service managed permissions in Amazon Managed Grafana and a workspace AWS Identity and Access Management (IAM) role as an authentication provider in the Amazon Redshift Grafana data source. We create the database user redshift_data_api_user using the AmazonGrafanaRedshiftAccess policy.

Configure a user in AWS SSO for Amazon Managed Grafana UI access

Two authentication methods are available for accessing Amazon Managed Grafana: AWS SSO and SAML. In this example, we use AWS SSO.

On the AWS SSO console, choose Users in the navigation pane.
Choose Add user.
In the Add user section, provide the required information.

SSO add user

In this post, we select Send an email to the user with password setup instructions. You need to be able to access the email address you enter because you use this email further in the process.

Choose Next to proceed to the next step.
Choose Add user.

An email is sent to the email address you specified.

Choose Accept invitation in the email.

You’re redirected to sign in as a new user and set a password for the user.

Enter a new password and choose Set new password to finish the user creation.

Configure an Amazon Managed Grafana workspace and sign in to Grafana

Now you’re ready to set up an Amazon Managed Grafana workspace.

On the Amazon Grafana console, choose Create workspace.
For Workspace name, enter a name, for example grafana-demo-workspace-1.
Choose Next.
For Authentication access, select AWS Single Sign-On.
For Permission type, select Service managed.
Chose Next to proceed.
For IAM permission access settings, select Current account.
For Data sources, select Amazon Redshift.
Choose Next to finish the workspace creation.

You’re redirected to the workspace page.

Next, we need to enable AWS SSO as an authentication method.

On the workspace page, choose Assign new user or group.
Select the previously created AWS SSO user under Users and Select users and groups tables.

You need to make the user an admin, because we set up the Amazon Redshift data source with it.

Select the user from the Users list and choose Make admin.
Go back to the workspace and choose the Grafana workspace URL link to open the Grafana UI.
Sign in with the user name and password you created in the AWS SSO configuration step.

Set up an Amazon Redshift data source in Grafana

To visualize the data in Grafana, we need to access the data first. To do so, we must create a data source pointing to the Amazon Redshift cluster.

On the navigation bar, choose the lower AWS icon (there are two) and then choose Redshift from the list.
For Regions, choose the Region of your cluster.
Select the cluster from the list and choose Add 1 data source.
On the Provisioned data sources page, choose Go to settings.
For Name, enter a name for your data source.
By default, Authentication Provider should be set as Workspace IAM Role, Default Region should be the Region of your cluster, and Cluster Identifier should be the name of the chosen cluster.
For Database, enter dev.
For Database User, enter redshift_data_api_user.
Choose Save & Test.

A success message should appear.

Data source working

Import the Amazon Redshift dashboard supplied with the data source

As the last step, we import the default Amazon Redshift dashboard and make sure that it works.

In the data source we just created, choose Dashboards on the top navigation bar and choose Import to import the Amazon Redshift dashboard.
Under Dashboards on the navigation sidebar, choose Manage.
In the dashboards list, choose Amazon Redshift.

The dashboard appear, showing operational data from your cluster. When you add more clusters and create data sources for them in Grafana, you can choose them from the Data source list on the dashboard.

Clean up

To avoid incurring unnecessary charges, delete the Amazon Redshift cluster, AWS SSO user, and Amazon Managed Grafana workspace resources that you created as part of this solution.

Conclusion

In this post, we covered the process of setting up an Amazon Redshift dashboard working under Amazon Managed Grafana with AWS SSO authentication and querying from the Amazon Redshift cluster under the same AWS account. This is just one way to create the dashboard. You can modify the process to set it up with SAML as an authentication method, use custom IAM roles to manage permissions with more granularity, query Amazon Redshift clusters outside of the AWS account where the Grafana workspace is, use an access key and secret or AWS Secrets Manager based connection credentials in data sources, and more. You can also customize the dashboard by adding or altering visualizations using the feature-rich Grafana UI.

Because the Amazon Redshift data source plugin is an open-source project, you can install it in any Grafana deployment, whether it’s in the cloud, on premises, or even in a container running on your laptop. That allows you to seamlessly integrate Amazon Redshift monitoring into virtually all your existing Grafana-based monitoring stacks.

For more details about the systems and processes described in this post, refer to the following:

Amazon Managed Grafana Documentation (in particular, its Amazon Redshift section)
Amazon Redshift Documentation
AWS Single Sign-On Documentation
Redshift data source for Grafana on the Grafana Labs website
Monitor all your Redshift clusters in Grafana with the new Amazon Redshift data source plugin
Redshift data source for Grafana on GitHub

About the Authors

Sergey Konoplev is a Senior Database Engineer on the Amazon Redshift team. Sergey has been focusing on automation and improvement of database and data operations for more than a decade.

Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.