Autumn in the Latvian capital of Riga is marked by a variety of traditions. The leaves fall, the rainy season arrives, the birds migrate, and IT professionals from around the world descend on the city for the annual Zabbix Summit.
On October 6 and 7, the Radisson Blu Hotel Latvija was packed with 450 delegates from 38 countries, all there for Zabbix Summit 2023, the 11th in-person version of Zabbix’s premier yearly event.
This year’s Summit was marked by presentations, partner activities, and moments of relaxation and celebration that will energize the Zabbix community and spark ideas that attendees will take home to every corner of the world.
If you couldn’t make it, here’s a little taste of how it felt to be there!
Zabbix Summit 2023 in numbers
The stage hosted 27 speakers from 17 different countries who gave 31 speeches, including both lectures and lightning talks. There were four workshops with deep dives into technical topics, conducted by the Zabbix technical team as well as our partners from Opensource ICT Solutions and IZI-IT. Summit attendees also enjoyed three parties designed to provide a relaxing experience and networking opportunities.
Zabbix Summit 2023 proudly featured 10 sponsors, all part of Zabbix’s official partner network. They included:
We’d also like to give a shout-out to our Zabbix Fans, who played a crucial role in supporting the Summit this year (as every year) with their attendance, merchandise purchases, and enthusiasm!
We’re grateful to everyone who played a role and helped us make Zabbix Summit 2023 happen!
Highlights from the main stage
This year we continued a Summit tradition and allowed our in-person audience as well those tuning in via livestream and YouTube to ask questions during live Q&A sessions – a feature that made the proceedings more interactive and helped everyone feel more involved. The speeches were all fascinating and well received, but a few in particular stood out:
What the future holds for Zabbix
Zabbix CEO and Founder Alexei Vladishev kicked off the presentations on Day 1 with a keynote speech about his current plans for Zabbix’s development, including a detailed look at enhancements requested by users.
Avoiding alert fatigue
Bringing a less technical and more conceptual approach to addressing day-to-day data monitoring issues, Rihards Olups, SaaS Architect at Nokia, discussed alert fatigue and how science explains it. During his presentation, Rihards showed how an excess of alerts can negatively affect selective attention and shared his thoughts about how professionals can intervene to prevent problems.
Making Zabbix’s latest offerings accessible to everyone
Day 2 began with Zabbix Director of Business Development Sergey Sorokin focusing on new plans and offerings, including a subscription system for technical support, consulting services, and monitoring tailored for managed service providers.
Monitoring everything (and we do mean everything!)
Janne Pikkarainen, Lead Site Reliability Engineer at Forcepoint, provided detailed and entertaining insights into how he connects Zabbix to smart accessories and uses it to monitor aspects of his home, including the location of personal items, noise levels, and even the frequency of his daughter’s naps and cries.
Implementing ideas and design in MSP environments
In tackling the topic of data collection and analysis for service providers, Brian van Baekel, Zabbix Trainer at Opensource ICT Solutions, presented details on the development of projects focused on monitoring service providers. He also highlighted best practices for data collection in Zabbix Server, data storage, and presenting on the Zabbix Frontend.
Monitoring the London transportation system
A use case presented by Nathan Liefting, Zabbix Consultant and Trainer at Opensource ICT Solutions, and Adan Mohamed, DevOps Manager at Boldyn Networks, showed how Zabbix monitors the availability of the London Underground subway system. Data is collected from 136 “tube” stations in a high-level architecture and used to assess the availability of Wi-Fi networks, emergency connections, and other services.
Bringing the Olympics and World Cup to life with Zabbix
Marianna Portela, a Tech Lead at Globo in Brazil, shared her insights into how Zabbix supports Globo’s digital transformation and helps her monitor live event infrastructure at massive events like the Olympics and World Cup.
Don’t forget the fun part!
Zabbix Summits are renowned for their friendly, informal atmosphere, which is probably most clearly on display at our famous Summit parties.
Zabbix Summit 2023’s Welcome party was held at the Stargorod Riga brewery in the heart of Riga’s old town. It featured arm wrestling, a selection of delicious foods and beverages, and plenty of opportunities for Summit participants to get to know each other.
The Main party saw live music, dancing, quizzes, and other fun events take place within the historic confines of the Latvian Railway History Museum. The atmosphere, food, drinks, and good company all combined to create an event that nobody who attended will soon forget!
Last but not least, the Closing party at the Burzma food hall was a true celebration of the diversity of the global Zabbix community, with food and music from every country with a Zabbix presence as well as plenty of opportunities for Summit attendees to swap stories and exchange contact details.
Open door, open minds
The traditional Zabbix open-door day was held on Thursday, October 5, and while past Summits have typically seen around 50 visitors, we were proud to welcome closer to 100 this time around. Attendees could have a coffee with their favorite Zabbix employees, play a friendly game of foosball or table tennis, and get a behind-the-scenes look at where the magic happens.
Testify!
One new feature that made a big splash at this year’s Summit was the testimonial booth, which allowed Summit attendees to share their thoughts and experiences about Zabbix with the rest of our community. Sharing a testimonial or leaving a review allowed attendees to collect a piece of exclusive Zabbix Summit 2023 merchandise, and we went through a lot of it – the booth provided us with 28 filmed and 17 written testimonials about Zabbix products and services, far more than we anticipated.
Where to find the presentations
If you couldn’t attend but want to stay informed about what was discussed at the event (or if you’d just like to revisit the stage presentations), both days of recordings are available on Zabbix’s YouTube channel at the following links:
The graphics and texts of the presentations are also available for reference and download on the official event website.
We hope that Zabbix Summit 2023 was a time of valuable learning, connections, and idea exchange for everyone who attended or followed along through social media. If you’ve enjoyed the photos, you can see several more on our Instagram.
If you had an amazing time at Zabbix Summit 2023 (and we certainly hope you did), registration for Zabbix Summit 2024 is already open and Early Bird tickets are available.
CIOs and CITOs know all too well that a smoothly functioning network is the backbone of any business. Your network has to guarantee reliability, performance, and security. An unreliable network, by contrast, means damaged productivity, negative customer perceptions, and haphazard security. The solution is network monitoring, and in this post we’ll explore the reasons why Zabbix is the ideal monitoring solution for any business.
What is network monitoring?
Network monitoring is a critical IT process where all networking components (as well as key performance indicators like CPU utilization and network bandwidth) are constantly monitored to improve performance and eliminate bottlenecks. It provides real-time information that network administrators need to determine whether a network is running optimally.
Why Zabbix?
At Zabbix, we’re here to help you deliver for your customers, flawlessly and without interruptions. Our monitoring solution is 100% open source, available in over 20 languages, and able to collect an unlimited amount of data. Designed with enterprise requirements in mind, Zabbix provides a comprehensive, “single pane of glass” view of any size environment. Put simply, Zabbix allows you to monitor anything – from physical and virtual servers or containers to network infrastructure, applications, and cloud services.
What’s more, we offer a wide variety of additional professional services to go along with our solution, including:
Multiple technical support subscriptions that are tailored to the needs of your business
Certified training programs that are designed to help you master Zabbix under the guidance of top experts
A wide range of professional services, including template building, upgrades, consulting, and more
Keep reading to find out more about the difference Zabbix can make for your business.
The Zabbix advantage
IT teams are under enormous pressure to have their networks functioning perfectly 100% of the time, and with good reason. It’s simply not possible to run a business with a malfunctioning network. Here are 5 key reasons why you need to make network monitoring a top priority, and why Zabbix is the right answer for all of them.
Reliability
A network monitoring solution’s main reason for being is to show whether a device is working or not. Taking a proactive approach to maintaining a healthy network will keep tech support requests and downtime to an absolute minimum. Zabbix makes it easy to do so by automatically detecting problem states in your metric flow. Not only that, but our automated predictive functions can also help you react proactively. They do this by forecasting a value for early alerting and predicting the time left until you reach a problem threshold. Automation then allows you to remove additional inefficiencies.
Visibility
Having complete visibility of all your hardware and software assets allows you to easily monitor the health of your network. Zabbix lets businesses access metrics, issues, reports, and maps with a single click, allowing you to:
Analyze and correlate your metrics with easy-to-read graphs
Track your monitoring targets on an interactive geo-map
Display the statuses of your elements together with real-time data to get a detailed overview of your infrastructure on a Zabbix map
Generate scheduled PDF reports from any Zabbix dashboard
Extend the native Zabbix frontend functionality by developing your own frontend widgets and modules
Performance
By making it easy to monitor anything, Zabbix lets you know which parts of your network are being properly used, overused, or underused. This can help you uncover unnecessary costs that can be eliminated or identify a network component that needs upgrading.
Compliance
Today’s IT teams need to meet strict regulatory and protection standards in increasingly complex networks. Zabbix can spot changes in normal system behavior and unusual data flow. It can then either leverage multiple messaging channels to notify your team about anomalies or simply resolve any issues automatically.
Profitability
Zabbix has an extensive track record of making businesses more productive by saving network management time and lowering operating costs. Servers, for example, are machines that inevitably break down from time to time. Being able to quickly re-launch after a failure has occurred and minimizing the server downtime are vital. By making sure your team is aware of any and all current and impending issues, Zabbix can reduce downtime and increase the productivity and efficiency of your business.
Zabbix across industries
Whatever field you’re in, there’s no substitute for consistent, problem-free service when it comes to gaining the trust and loyalty of customers. Zabbix has an extensive track record of helping clients in multiple industries achieve their goals.
Zabbix for healthcare
A typical hospital relies on tens of thousands of connected devices. Manually checking each one for anomalies simply isn’t practical. Establishing a stable service level is a vital issue in most industries, but in healthcare it’s literally a matter of life and death. With Zabbix, hospital IT teams receive potentially life-saving alerts if anything is out of the ordinary.
What’s more, Zabbix can monitor progress toward expected outcomes, providing up-to-the-minute statistics on data errors or IT system failures. Issues, response times, and potential bottlenecks are displayed in easy-to-read graphs and charts. This allows hospital staff to follow up on the presence or absence of problems.
Zabbix for banking and finance
Financial institutions of all sizes rely on their networks to maintain connectivity and productivity. By processing millions of checks per minute and considering very complex dependencies between different elements of infrastructure, Zabbix allows banks to proactively detect and resolve network problems before they turn into major business disruptions.
Zabbix is also designed to seamlessly connect distributed architecture, including remote offices, branches, and even individual ATMs. Some of our financial industry clients previously used up to 20 different monitoring tools. Each alert sent hundreds of emails to different people, making it impossible to effectively monitor the environment. Naturally, they found Zabbix’s ability to monitor many thousands of devices and “single pane of glass” view to be a significant upgrade.
Zabbix for education
In an age of digital course materials and resources, schools and universities can’t operate without functioning IT infrastructures. Our clients in education typically have heterogeneous infrastructures with thousands of servers and clients. They also possess all kinds of connected devices, dozens of different operating systems, multiple locations, and hundreds of IT staff.
Zabbix has proven itself to be a simple, cost-effective method of monitoring geographically distributed campuses and educational sites. We’ve done this by:
Providing early notification of possible viruses, worms, Trojan horses, and other transmitters of system infection
Monitoring IT systems for intellectual property (IP) protection purposes
Saving human resources by reducing manual work
Zabbix for government
Network monitoring is critical for government agencies, as downtime can bring a halt to vital public services. Our public-sector clients range from city-wide public transportation companies all the way up to entire prefectures. They use Zabbix to monitor the availability of utilities, transport, lighting, and many other public services.
In the process, Zabbix increases the effectiveness of budget expenditures by providing precise and accountable data on how public resources are used. This makes it easier to justify further expenditures. In most business software, agents are required for each monitored host and costs increase in proportion to the number of monitored hosts. By contrast, Zabbix is open source and the software itself is free of charge, resulting in anticipated cost reductions of up to 25% in many cases.
Zabbix for retail
Retail environments increasingly depend on network-connected equipment, particularly when it comes to warehouse monitoring and tracking SKUs (stock keeping units). Zabbix delivers an all-in-one tool to monitor different applications, metrics, processes, and equipment while providing a complete picture about the availability and performance of all the components that make a retail business successful. This makes it possible for retailers to easily automate store openings and closings, monitor cash machines, and keep track of access system log entries.
Not only that, the quantity and quality of information that Zabbix collects makes it easy for retailers to conduct a more accurate analysis of what is happening (or what may happen) and take preventive measures. Our retail clients find that having this level of control over their resources and services increases the confidence of their teams as well as their customers.
Zabbix for telecom
Internet, telephony, and television verticals require availability and consistency. The key to success is providing your services 24/7/365.
Zabbix makes this possible by providing full visibility of all network and customer devices, allowing operators to know of any outage before customers do and take necessary actions. Some of our telecommunications clients are able to effortlessly monitor well over 100,000 devices with a single Zabbix server. This helps them improve the customer experience and driving growth in the process.
Zabbix for aerospace
In the aerospace industry, timely data delivery and issue notification are the keys to safe operations. Aircraft depend on complex electronic systems that can diagnose the slightest deviations and make malfunctions known. Unfortunately, this is often in the form of either an indicator light on an instrument panel or a log message that is accessible only with specialized software or tools.
With Zabbix, all data transfers from the aircraft’s diagnostic system to the responsible employees can happen automatically. Error prioritization and escalation to further levels can also happen automatically if any aircraft has an ongoing issue that remains active for multiple days.
Conclusion
At Zabbix, our goal is a world without interruptions, powered by a world-class universal monitoring solution that’s available and affordable to any business. Our open-source software allows you to monitor your entire IT stack, no matter what size your infrastructure is or where it’s hosted.
That’s why government institutions across the globe as well as some of the world’s largest companies trust us with their network monitoring needs.
Get in touch with us to learn more and get started on the path to maximum efficiency and uptime today!
With just a few days remaining until Zabbix Summit 2023, our series of speaker interviews draws to a close as we talk to Opensource ICT Solutions trainer and consultant Nathan Liefting about how he worked with Adan Mohamed of Boldyn Networks to monitor the London Underground with Zabbix.
Please tell us a bit about yourself and your work.
I’m a Zabbix trainer and consultant for Opensource ICT Solutions. You might also know me from the books Brian van Baekel and I wrote called “Zabbix IT Infrastructure Monitoring.”
How long have you been using Zabbix? What kind of daily Zabbix tasks are you involved in at your company?
My tasks are easy to explain – Zabbix, Zabbix, and some more Zabbix! Opensource ICT Solutions is one of the few companies that focus solely on Zabbix, so I get to work full time with the product, 40 hours a week. I build new environments, integrations, automations, and anything that you might need for your Zabbix environment.
Can you give us a sneak peek at what we can expect to hear during your Zabbix Summit speech?
Definitely! Adan from Boldyn Networks and I will be presenting you with a real use case for Zabbix monitoring. We’ll have a look at how Boldyn has brought broadband network connectivity to the London Underground tunnels and why it’s so important to monitor the equipment that makes that all possible. Of course, since this is THE Zabbix summit, we’ll also look at what the Zabbix setup looks like and share a pretty interesting use case for SNMP traps.
How and why did you come to the decision to use Zabbix as the monitoring solution for your use case?
Boldyn was looking for the best network monitoring solution for their project. Since we offer exactly that, we got to talking and we decided that our favorite open-source network monitoring tool was the way to go. Since then, we’ve been building amazing custom monitoring implementations together. The rest is history.
Can you mention some other noteworthy non-standard Zabbix monitoring use cases that you’ve worked on?
Definitely! Since I get to work on monitoring all day long, we’ve got a lot to choose from. I do most of my work in the Netherlands, the United Kingdom, and the United States, and all those markets are super exciting. We’re monitoring infrastructure that keeps planes flying safely and makes sure power grids are up and running, and now we’re also helping to keep people connected to the internet even when they go underground. If you ask me, it doesn’t get a lot more exciting than that – and that’s just the tip of the iceberg.
To help everyone in our community get up to speed with Zabbix Summit speakers and their topics, we’re continuing our series of interviews and sitting down for a chat with Marianna Portela of Brazilian mass media conglomerate Globo. Read on to get a preview of her Summit speech topic and see how she uses Zabbix to bring massive live events to millions of users around the globe.
Please tell us a bit about yourself and your work.
I’m a tech lead at Globo, the largest media group in Latin America. It includes over-the-air broadcasting, television and film production, a pay television subscription service, streaming media, publishing, and online services.
How long have you been using Zabbix? What kind of daily Zabbix tasks are you involved in at your company?
I have been working at Globo for 15 years. I’ve been involved in monitoring for 11 of those years, and I’ve been using Zabbix for 10. I help monitor the applications that generate data for live events, and I use Zabbix to generate metrics that support decision-making related to better content delivery quality.
Can you name a few of the specific challenges that Zabbix has helped you solve?
Zabbix allows us to empower our users and supports our entire digital transformation – including many things related to Globoplay streaming. It also helps us monitor live event infrastructure, like the Olympics and World Cup. Previously, when there were technical issues during live events, we would try to figure out what happened after the fact, but no longer – Zabbix gives us a proactive analysis of potential occurrences within live production.
Can you give us a sneak peek at what we can expect to hear during your Zabbix Summit speech?
I’m planning to talk about how we use Zabbix to help ensure the quality monitoring of live production, which is essentially the production and the part of Globo that deals with any type of live event and generates data for things like games, for example. I’ll introduce how we started with actual infrastructure monitoring and how this digital transformation at Globo began, specifically how we managed to enter new areas like content generation, especially live content. Then I’ll also discuss some specifics of how we monitor live event infrastructure.
Servers are the foundation of a company’s IT infrastructure, and the cost of server downtime can include anything from days without system access to the loss of important business data. This can lead to operational issues, service outages, and steep repair costs.
Viewed against this backdrop, server monitoring is an investment with massive benefits to any organization. The latest generation of server monitoring tools make it easier to assess server health and deal with any underlying issues as quickly and painlessly as possible.
What are servers, and how do they work?
Servers are computers (or applications) that run software services for other computers or devices on a network. The computer takes requests from the client computers or devices and performs tasks in response to the requests. These tasks can involve processing data, providing content, or performing calculations. Some servers are dedicated to hosting web services, which are software services offered on any computer connected to the internet.
What is server monitoring? Why does it matter?
Servers are some of the most important pieces of any company’s IT infrastructure. If a server is offline, running slowly, or experiencing outages, website performance will be affected and customers may decide to go elsewhere. If an internal file server is generating errors, important business data like accounting files or customer records could be compromised.
A server monitoring system is designed to watch your systems and provide a number of key metrics regarding their operation. In general, server monitoring software tests for accessibility (making sure that the server is alive and can be reached) and response time (guaranteeing that it is running fast enough to keep users happy). What’s more, it sends notifications about missing or corrupt files, security violations, and other issues.
Server monitoring is most often used for processing data in real time, but quality server monitoring is also predictive, letting users know when disks will reach capacity and whether memory or CPU utilization is about to be throttled. By evaluating historical data, it’s possible to find out if a server’s performance is degrading over time and even predict when a complete crash might occur.
How can server monitoring help businesses?
Here are a few of the most important business benefits of server monitoring:
Server monitoring tools give you a bird’s-eye view of your server’s health and performance
A quality server monitoring tool keeps IT administrators aware of metrics like CPU usage, RAM, disk space, and network bandwidth. This helps them to see when servers are slowing down or failing, allowing them to act before users are affected.
Server monitoring simplifies process automation
IT teams have long checklists when it comes to managing servers. They need to monitor hard disk space, keep an eye on infrastructure, schedule system backups, and update antivirus software. They also need to be able to foresee and solve critical events, while managing any disruptions.
A server monitoring tool helps IT professionals by automating all or many aspects of these jobs. It can show whether a backup was successful, if software is patched, and whether a server is in good condition. This allows IT teams to focus on tasks that benefit more from their involvement and expertise.
Server monitoring makes it easier to retain customers as well as employees
Acting quickly when servers develop issues (or even before) makes sure that employee workflows aren’t disrupted, allowing them to perform their duties, see results, and reach their goals. It also guarantees a positive customer experience by providing early notification of any issues.
Server monitoring keeps costs down
By automating processes and tasks (and freeing up time in the process) server monitoring systems make the most of resources and reduce costs. And by solving potential issues before they affect the organization, they help businesses avoid lost revenue from unfinished employee tasks, operational delays, and unfinished purchases.
What should you look for in a server monitoring solution?
Now that you’re sold on the benefits of server monitoring, you’ll want to choose the server monitoring solution that’s right for you. Here are a few capabilities to keep in mind:
Ease of use
Does the solution include an intuitive dashboard that makes it easy to monitor events and react to problems quickly? It should, and it should also allow you to make the most of the data it exports by providing graphs, reports, and integrations.
Customer support
Is it easy to contact support? How quickly do they respond? A quality server monitoring solution will provide a defined SLA and stick to it with no exceptions.
Breadth of coverage
A good solution will support all the server types (hardware, software, on-premises, cloud) that your enterprise uses. It should also be flexible enough to support any server types you may implement in the future.
Alert management
There are a few important questions to ask when it comes to alerts:
Does the solution include a dashboard or display that makes it easy to track events and react to problems quickly?
Is it easy to set up alerts via the configuration of thresholds that trigger them? How are alerts delivered?
Does the solution have a way to help you determine why a problem has occurred, instead of just telling you that something has gone wrong without context?
What are some best practices to keep in mind?
Here are a few best practices that will help you avoid the more common server monitoring pitfalls:
Proactively check for failures
Keep a sharp eye out for any issues that may affect your software or hardware. The tools included with a good monitoring solution can alert you to errors caused by a corrupted database (for example) and let you know if a security incident has left important services disabled.
Don’t forget your historical data
Server problems rarely occur in a vacuum, so look into the context of issues that emerge. You can do that by exploring metrics across a specific period, typically between 30 to 90 days. For example, you may find that CPU temperature has increased within the past week, which may suggest a problem with a server cooling system.
Operate your hardware in line with recommended tolerance levels
File servers are commonly pushed to the limit, rarely getting a break. That’s why it’s important to monitor metrics like CPU utilization, RAM utilization, storage capacity usage, and CPU temperature. Check these metrics regularly to identify issues before it’s too late.
Keep track of alerts
Always monitor your alerts in real time as they occur and explore reliable ways to manage and prioritize them. When escalating an incident, make sure it goes to the right individual as soon as possible.
Use server monitoring data to plan short-term cloud capacity
Server monitoring systems can help you plan the right computing power for specific moments. If services become slower or users experience other problems with performance, an IT manager can assess the situation through the server monitor. They’ll then be able to allocate extra resources to solve the problem.
Take advantage of capacity planning
Data center workloads have almost doubled in the past 5 years, and servers have had to keep up with this ongoing change. Analyzing long-term server utilization trends can prepare you for future server requirements.
Go beyond asset management
With server monitoring, you can discover which systems are approaching the end of their lives and whether any assets have disappeared from your network. You can also let your server monitoring tool handle the heavy lifting for you when it comes to tracking physical hardware.
The Zabbix Advantage
Zabbix is designed to make server monitoring easy. Our solution allows you to track any possible server performance metrics and incidents, including server performance, availability, and configuration changes.
Intuitive dashboards, network graphs, and topology maps allow you to visualize server performance and availability, and our flexible alerting allows for multiple delivery methods and customized message content.
Not only that, our out-of-the-box templates come with preconfigured items, triggers, graphs, applications, screens, low-level discovery rules, and web scenarios – all designed to have you up and running in just a few minutes.
And because Zabbix is open-source, it’s not just affordable, it’s free. Contact us to find out more and enjoy the peace of mind that comes from knowing that your servers are under control.
FAQ
Why do we need server monitoring?
Server monitoring allows IT professionals to:
Monitor the responsiveness of a server
Know a server’s capacity, user load, and speed
Proactively detect and prevent any issues that might affect the server
Why do companies choose to monitor their servers?
Companies monitor servers so that they can:
Proactively identify any performance issues before they impact users
Understand a server’s system resource usage
Analyze a server for its reliability, availability, performance, security, etc.
How is server monitoring done?
Server monitoring tools constantly collect system data across an entire IT infrastructure, giving administrators a clear view of when certain metrics are above or below thresholds. They also automatically notify relevant parties if a critical system error is detected, allowing them to act in a timely manner to resolve issues.
What should you monitor on a server?
Key areas to monitor on a server include:
A server’s physical status
Server performance, including CPU utilization, memory resources, and disk activity
Server uptime
Page file usage
Context switches
Time synchronization
Process activity
Server capacity, user load, and speed
If I want to monitor a server, how easy is it to set things up?
Setting up a server monitoring tool is easy, provided you’ve taken into account these 5 steps:
Your company’s network is the glue that bonds your enterprise together. The technology of networking is growing more stable and reliable all the time, but it doesn’t mean you can leave your network unattended – quality network monitoring is an absolute must-have.
What are network monitoring systems?
At its most basic, network monitoring is a critical IT process where all networking components (as well as key performance indicators like network hardware CPU utilization and network bandwidth) are continuously and proactively monitored to improve performance, eliminate bottlenecks, and prevent network congestion and downtime.
Put more simply, it’s the act of keeping an eye on all the connected elements that are relevant to your business. That means all your hardware and software resources, including routers, switches, firewalls, servers, PCs, printers, phones, and tablets.
A network monitoring system is a set of software tools that lets you program this action. It allows you to constantly monitor your network infrastructure by doing systematic tests to look for issues and notifying you if any are found. A good system makes monitoring your network easy by:
Allowing you to see all information in dashboards
Generating reports on demand
Sending alerts
Displaying the monitoring data you need in easy-to-read graphs
What are some key benefits of network monitoring?
A quality network monitoring solution allows you to:
Benchmark standard performance
Monitoring gives you the visibility to benchmark your network’s everyday performance. It also makes it easy to spot any fluctuations in performance, which in turn allows you to identify any unwanted changes.
Effectively allocate resources
IT teams need a clear understanding of the source of problems. They also need the ability to minimize tedious troubleshooting and put in place proactive measures to stay ahead of IT outages. To use a plumbing analogy, monitoring lets them fix cracks before a leak happens.
Identify security threats
Preventing security breaches is a major challenge for any organization. As attacks become increasingly more sophisticated and difficult to trace, detecting and mitigating any form of network threat before it escalates is critical. Network monitoring makes it easier to protect data and systems by providing early warning of any suspicious anomalies.
Manage a changing IT environment
New technologies like internet-enabled sensors, wireless devices, and cloud technologies make it harder for IT teams to track performance fluctuations or suspicious activity. A network monitoring solution can:
Give IT teams a comprehensive inventory of wired and wireless devices
Make it easy to analyze long-term trends
Help you get the most out of your available assets
Proactively detect and resolve issues before they affect users
Monitoring a network closely allows an organization to quickly resolve issues and prevent major disruptions. This means fewer interruptions to operations and better utilization of IT resources.
Deploy new technology and system upgrades successfully
Thanks to monitoring, IT teams can learn how equipment has performed over time and use trend analysis to see whether current technology can scale to meet business needs. This can:
Give a clear picture of whether a network is able to support the launch of a new technology
Mitigate any risks associated with a major change
Easily demonstrate ROI by providing comprehensive metrics
What are some different types of network monitoring?
Different types of monitoring exist depending on what exactly needs to be monitored. Some of the most common include fault monitoring, log monitoring, network performance monitoring, configuration monitoring, and availability monitoring.
Fault monitoring
As the name suggests, fault monitoring involves finding and reporting faults in a computer network. It is crucial for maintaining uninterrupted network uptime and is essential to keeping all programs and services running smoothly.
Log monitoring
Resources such as servers, applications, and websites continuously generate logs, which can:
Provide valuable insights into user activity
Help a business comply with regulations
Promptly resolve incidents
Boost network security
Network performance monitoring (NPM)
NPM tracks monitoring parameters like latency, network traffic, bandwidth usage, and throughput, with the goal of optimizing user experience. NPM tools provide valuable information that can be used to minimize downtime and troubleshoot network issues.
Configuration monitoring
Monitoring network configuration involves keeping track of the software and firmware in use on the network and making sure that any inconsistencies are identified and addressed. This prevents any gaps in visibility or security.
Network availability monitoring
Availability monitoring is the monitoring of all IT infrastructure to determine the uptime of devices. By consistently monitoring devices and servers, organizations can receive alerts when there is a network crash or when a device becomes unavailable. ICMP, SNMP, and Syslogs are the most commonly used availability monitoring techniques.
How does it work?
Network monitoring uses multiple techniques to test the availability and functionality of a network. Here are a few of the most common techniques used to collect data for monitoring software:
Ping
A ping is the simplest technique that monitoring software uses to test hosts within a network. The monitoring system sends out a signal and records:
Whether the signal was received,
How long it took the host to receive the signal
Whether any signal data was lost
That data is then used to determine:
Whether the host is active
How efficient the host is
The transmission time and packet loss experienced when communicating with the host
Any other vital information
Simple network management protocol (SNMP)
SNMP is the most widely used protocol for modern network management systems. It uses monitoring software to monitor individual devices in a network. In this system, each monitored device has SNMP agent monitoring software that sends information about the device’s performance to the monitoring solution, which collects this information in a database and then analyzes it for errors.
Syslog
Syslog is an automated messaging system that sends messages when an event affects a network device. Technicians can set up devices to send out messages when the device encounters an error, shuts down unexpectedly, encounters a configuration failure, and more. These messages often contain information that can be used for system management as well as security systems.
Scripts
Scripts are simple programs that collect basic information and instruct the network to perform an action within certain conditions. They can fill gaps in monitoring software functionality, performing scheduled tasks such as resetting and reconfiguring a public access computer every night.
Scripts can also be used to collect data, sending out an alert if results don’t fall within certain thresholds. Network managers will usually set these thresholds, programming the network software to send out an alert if data indicates issues, including:
Slow throughput
High error rates
Unavailable devices
Slower-than-usual response times
How can businesses benefit from network monitoring?
Here are 5 ways that quality network monitoring can benefit any business:
Increased reliability
The main function of any monitoring solution is to show whether a device is working or not. A proactive approach to maintaining a healthy network will keep tech support requests and downtime to an absolute minimum.
Improved visibility
Having complete visibility of all your hardware and software assets allows you to easily monitor the health of your network. Monitoring tracks the data moving along cables and through servers, switches, connections, and routers. In the event of a problem, your IT team can identify the root cause and fix the issue quickly.
Enhanced performance
Network monitoring software lets you know which parts of your network are being properly used, overused, or underused. You can also uncover unnecessary costs that can be eliminated or identify a network component that needs upgrading.
Stricter compliance
Today’s IT teams need to meet strict regulatory and protection standards in increasingly complex networks. The latest compliance guidelines recommend actively watching for changes in normal system behavior and unusual data flow. The data provided by monitoring tools makes it easy to assess your entire system and deliver a service that meets all required standards.
Greater profitability
Network monitoring makes businesses more productive by saving network management time and lowering operating costs. If your team is aware of current and impending issues, you can reduce downtime and increase productivity and efficiency.
The Zabbix advantage
At Zabbix, we’ve perfected an enterprise IT infrastructure monitoring software that can deploy anywhere and monitor any device, system, or app in any environment while providing comprehensive data protection, easy integration, and unlimited visualization options.
You can also count on complete transparency, a predictable release cycle, a vibrant and active user community, and an outstanding user experience.
Everything we do scales easily, so we’re able to grow right along with you. What’s more, we offer a comprehensive range of professional services, including implementation, integration, custom development, consulting services, technical support, and a full suite of training programs.
The best part? Because Zabbix is open-source, it’s not just affordable – it’s free. Get in touch with us to find out more and get started on the path to maximum network efficiency today.
FAQ
What is an example of basic network monitoring?
An example of basic network monitoring is a network engineer collecting real-time data from a data center and setting up alerts when a problem (such as a device failure, a temperature spike, a power outage, or a network capacity issue) appears.
What is network monitoring used for?
Network monitoring can:
• Determine whether a network is running optimally in real time
• Proactively identify deficiencies and optimize efficiency
• Catch and repair problems before they impact operations
• Reduce downtime and make sure employees have access to the resources they need
• Boost the availability of APIs and webpages
• Optimize network performance and availability
What is the most popular network monitoring program?
Some of the most popular network monitoring programs available on the market include:
With Zabbix Summit 2023 almost upon us, we’ve prepared a short and direct interview with Summit presenter Dr. Hiroshi Abe. Dr. Abe, a Research Engineer at the Toyota Motor Corporation, will share his thoughts about how Zabbix is the ideal solution when it comes to monitoring green power and distributed edge computing.
Please tell us a bit about yourself and your work.
I have been working for the Toyota Motor Corporation as a Research Engineer since 2019. My current research topics are related to large-scale monitoring systems that target connected car communications, edge computing, and green IT.
How long have you been using Zabbix? What Zabbix tasks are you involved in every day at your company?
Although I am technically retired, since 2015 I have been a member of the Monitoring team of the Network Operation Center, which is part of the ShowNet building team for the “Interop Tokyo” show event in Japan. I have been working with Kodai Terashima, CEO of Zabbix Japan, to build a monitoring system using Zabbix to monitor the event network required for ShowNet. In my office, we use Zabbix to monitor network and server equipment as well as our R&D environment.
Can you give us a sneak peek at what we can expect to hear during your Zabbix Summit speech?
You might expect to hear something deeply related to cars, and it’s true that much of the data created by cars can be processed using edge computing before being transported to the cloud. However, edge computing for the optimal use of green power will be the main topic of my talk. I’ll discuss a distributed monitoring system that uses Zabbix and Zabbix Proxy as a monitoring system for edge environments and green power in multiple data centers.
What made you go with Zabbix as a monitoring solution for green power and edge computing?
Zabbix Proxy is an easy-to-use distributed monitoring solution. A distributed monitoring system is a must for us because there will be multiple edge computing locations all over Japan.
Are you deploying Zabbix using containers to monitor your DCs?
We used RedHat’s OpenShift to implement the edge computing and data synchronization mechanism. We were able to easily deploy Zabbix in OpenShift as a container using Operator, and the monitoring environment using Zabbix containers is implemented in multiple DCs.
Many customers use Amazon CloudWatch dashboards to monitor applications and often ask how they can integrate Amazon DevOps Guru Insights in order to have a unified dashboard for monitoring. This blog post showcases integrating DevOps Guru proactive and reactive insights to a CloudWatch dashboard by using Custom Widgets. It can help you to correlate trends over time and spot issues more efficiently by displaying related data from different sources side by side and to have a single pane of glass visualization in the CloudWatch dashboard.
Amazon DevOps Guru is a machine learning (ML) powered service that helps developers and operators automatically detect anomalies and improve application availability. DevOps Guru’s anomaly detectors can proactively detect anomalous behavior even before it occurs, helping you address issues before they happen; detailed insights provide recommendations to mitigate that behavior.
Amazon CloudWatch dashboard is a customizable home page in the CloudWatch console that monitors multiple resources in a single view. You can use CloudWatch dashboards to create customized views of the metrics and alarms for your AWS resources.
Solution overview
This post will help you to create a Custom Widget for Amazon CloudWatch dashboard that displays DevOps Guru Insights. A custom widget is part of your CloudWatch dashboard that calls an AWS Lambda function containing your custom code. The Lambda function accepts custom parameters, generates your dataset or visualization, and then returns HTML to the CloudWatch dashboard. The CloudWatch dashboard will display this HTML as a widget. In this post, we are providing sample code for the Lambda function that will call DevOps Guru APIs to retrieve the insights information and displays as a widget in the CloudWatch dashboard. The architecture diagram of the solution is below.
Figure 1: Reference architecture diagram
Prerequisites and Assumptions
An AWS account. To sign up:
Create an AWS account. For instructions, see Sign Up For AWS.
DevOps Guru should be enabled in the account. For enabling DevOps guru, see DevOps Guru Setup
Follow this Workshop to deploy a sample application in your AWS Account which can help generate some DevOps Guru insights.
Solution Deployment
We are providing two options to deploy the solution – using the AWS console and AWS CloudFormation. The first section has instructions to deploy using the AWS console followed by instructions for using CloudFormation. The key difference is that we will create one Widget while using the Console, but three Widgets are created when we use AWS CloudFormation.
Using the AWS Console:
We will first create a Lambda function that will retrieve the DevOps Guru insights. We will then modify the default IAM role associated with the Lambda function to add DevOps Guru permissions. Finally we will create a CloudWatch dashboard and add a custom widget to display the DevOps Guru insights.
Navigate to the Lambda Console after logging to your AWS Account and click on Create function.
Figure 2a: Create Lambda Function
Choose Author from Scratch and use the runtime Node.js 16.x. Leave the rest of the settings at default and create the function.
Figure 2b: Create Lambda Function
After a few seconds, the Lambda function will be created and you will see a code source box. Copy the code from the text box below and replace the code present in code source as shown in screen print below.
// SPDX-License-Identifier: MIT-0
// CloudWatch Custom Widget sample: displays count of Amazon DevOps Guru Insights
const aws = require('aws-sdk');
const DOCS = `## DevOps Guru Insights Count
Displays the total counts of Proactive and Reactive Insights in DevOps Guru.
`;
async function getProactiveInsightsCount(DevOpsGuru, StartTime, EndTime) {
let NextToken = null;
let proactivecount=0;
do {
const args = { StatusFilter: { Any : { StartTimeRange: { FromTime: StartTime, ToTime: EndTime }, Type: 'PROACTIVE' }}}
const result = await DevOpsGuru.listInsights(args).promise();
console.log(result)
NextToken = result.NextToken;
result.ProactiveInsights.forEach(res => {
console.log(result.ProactiveInsights[0].Status)
proactivecount++;
});
} while (NextToken);
return proactivecount;
}
async function getReactiveInsightsCount(DevOpsGuru, StartTime, EndTime) {
let NextToken = null;
let reactivecount=0;
do {
const args = { StatusFilter: { Any : { StartTimeRange: { FromTime: StartTime, ToTime: EndTime }, Type: 'REACTIVE' }}}
const result = await DevOpsGuru.listInsights(args).promise();
NextToken = result.NextToken;
result.ReactiveInsights.forEach(res => {
reactivecount++;
});
} while (NextToken);
return reactivecount;
}
function getHtmlOutput(proactivecount, reactivecount, region, event, context) {
return `DevOps Guru Proactive Insights<br><font size="+10" color="#FF9900">${proactivecount}</font>
<p>DevOps Guru Reactive Insights</p><font size="+10" color="#FF9900">${reactivecount}`;
}
exports.handler = async (event, context) => {
if (event.describe) {
return DOCS;
}
const widgetContext = event.widgetContext;
const timeRange = widgetContext.timeRange.zoom || widgetContext.timeRange;
const StartTime = new Date(timeRange.start);
const EndTime = new Date(timeRange.end);
const region = event.region || process.env.AWS_REGION;
const DevOpsGuru = new aws.DevOpsGuru({ region });
const proactivecount = await getProactiveInsightsCount(DevOpsGuru, StartTime, EndTime);
const reactivecount = await getReactiveInsightsCount(DevOpsGuru, StartTime, EndTime);
return getHtmlOutput(proactivecount, reactivecount, region, event, context);
};
Figure 3: Lambda Function Source Code
Click on Deploy to save the function code
Since we used the default settings while creating the function, a default Execution role is created and associated with the function. We will need to modify the IAM role to grant DevOps Guru permissions to retrieve Proactive and Reactive insights.
Click on the Configuration tab and select Permissions from the left side option list. You can see the IAM execution role associated with the function as shown in figure 4.
Figure 4: Lambda function execution role
Click on the IAM role name to open the role in the IAM console. Click on Add Permissions and select Attach policies.
Figure 5: IAM Role Update
Search for DevOps and select the AmazonDevOpsGuruReadOnlyAccess. Click on Add permissions to update the IAM role.
Figure 6: IAM Role Policy Update
Now that we have created the Lambda function for our custom widget and assigned appropriate permissions, we can navigate to CloudWatch to create a Dashboard.
Navigate to CloudWatch and click on dashboards from the left side list. You can choose to create a new dashboard or add the widget in an existing dashboard.
We will choose to create a new dashboard
Figure 7: Create New CloudWatch dashboard
Choose Custom Widget in the Add widget page
Figure 8: Add widget
Click Next in the custom widge page without choosing a sample
Figure 9: Custom Widget Selection
Choose the region where devops guru is enabled. Select the Lambda function that we created earlier. In the preview pane, click on preview to view DevOps Guru metrics. Once the preview is successful, create the Widget.
Figure 10: Create Custom Widget
Congratulations, you have now successfully created a CloudWatch dashboard with a custom widget to get insights from DevOps Guru. The sample code that we provided can be customized to suit your needs.
Using AWS CloudFormation
You may skip this step and move to future scope section if you have already created the resources using AWS Console.
In this step we will show you how to deploy the solution using AWS CloudFormation. AWS CloudFormation lets you model, provision, and manage AWS and third-party resources by treating infrastructure as code. Customers define an initial template and then revise it as their requirements change. For more information on CloudFormation stack creation refer to this blog post.
The following resources are created.
Three Lambda functions that will support CloudWatch Dashboard custom widgets
A CloudWatch dashboard with widgets to pull data from the Lambda Functions
To deploy the solution by using the CloudFormation template
You can use this downloadable template to set up the resources. To launch directly through the console, choose Launch Stack button, which creates the stack in the us-east-1 AWS Region.
Choose Next to go to the Specify stack details page.
(Optional) On the Configure Stack Options page, enter any tags, and then choose Next.
On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.
It takes approximately 2-3 minutes for the provisioning to complete. After the status is “Complete”, proceed to validate the resources as listed below.
Validate the resources
Now that the stack creation has completed successfully, you should validate the resources that were created.
On AWS Console, head to CloudWatch, under Dashboards – there will be a dashboard created with name <StackName-Region>.
On AWS Console, head to CloudWatch, under LogGroups there will be 3 new log-groups created with the name as:
lambdaProactiveLogGroup
lambdaReactiveLogGroup
lambdaSummaryLogGroup
On AWS Console, head to Lambda, there will be lambda function(s) under the name:
lambdaFunctionDGProactive
lambdaFunctionDGReactive
lambdaFunctionDGSummary
On AWS Console, head to IAM, under Roles there will be a new role created with name “lambdaIAMRole”
To View Results/Outcome
With the appropriate time-range setup on CloudWatch Dashboard, you will be able to navigate through the insights that have been generated from DevOps Guru on the CloudWatch Dashboard.
Figure 11: DevOpsGuru Insights in Cloudwatch Dashboard
Cleanup
For cost optimization, after you complete and test this solution, clean up the resources. You can delete them manually if you used the AWS Console or by deleting the AWS CloudFormation stack called devopsguru-cloudwatch-dashboard if you used AWS CloudFormation.
This blog post outlined how you can integrate DevOps Guru insights into a CloudWatch Dashboard. As a customer, you can start leveraging CloudWatch Custom Widgets to include DevOps Guru Insights in an existing Operational dashboard.
AWS Customers are now using Amazon DevOps Guru to monitor and improve application performance. You can start monitoring your applications by following the instructions in the product documentation. Head over to the Amazon DevOps Guru console to get started today.
To learn more about AIOps for Serverless using Amazon DevOps Guru check out this video.
Streaming alert evaluation scales much better than the traditional approach of polling time-series databases.
It allows us to overcome high dimensionality/cardinality limitations of the time-series database.
It opens doors to support more exciting use-cases.
Engineers want their alerting system to be realtime, reliable, and actionable. While actionability is subjective and may vary by use-case, reliability is non-negotiable. In other words, false positives are bad but false negatives are the absolute worst!
A few years ago, we were paged by our SRE team due to our Metrics Alerting System falling behind — critical application health alerts reached engineers 45 minutes late! As we investigated the alerting delay, we found that the number of configured alerts had recently increased dramatically, by 5 times! The alerting system queried Atlas, our time series database on a cron for each configured alert query, and was seeing an elevated throttle rate and excessive retries with backoffs. This, in turn, increased the time between two consecutive checks for an alert, causing a global slowdown for all alerts. On further investigation, we discovered that one user had programmatically created tens of thousands of new alerts. This user represented a platform team at Netflix, and their goal was to build alerting automation for their users.
While we were able to put out the immediate fire by disabling the newly created alerts, this incident raised some critical concerns around the scalability of our alerting system. We also heard from other platform teams at Netflix who wanted to build similar automation for their users who, given our state at the time, wouldn’t have been able to do so without impacting Mean Time To Detect (MTTD) for all others. Rather, we were looking at an order of magnitude increase in the number of alert queries just over the next 6 months!
Since querying Atlas was the bottleneck, our first instinct was to scale it up to meet the increased alert query demand; however, we soon realized that would increase Atlas cost prohibitively. Atlas is an in-memory time-series database that ingests multiple billions of time-series per day and retains the last two weeks of data. It is already one of the largest services at Netflix both in size and cost. While Atlas is architected around compute & storage separation, and we could theoretically just scale the query layer to meet the increased query demand, every query, regardless of its type, has a data component that needs to be pushed down to the storage layer. To serve the increasing number of push down queries, the in-memory storage layer would need to scale up as well, and it became clear that this would push the already expensive storage costs far higher. Moreover, common database optimizations like caching recently queried data don’t really work for alerting queries because, generally speaking, the last received datapoint is required for correctness. Take for example, this alert query that checks if errors as a % of total RPS exceeds a threshold of 50% for 4 out of the last 5 minutes:
Say if the datapoint received for the last time interval leads to a positive evaluation for this query, relying on stale/cached data would either increase MTTD or result in the perception of a false negative, at least until the missing data is fetched and evaluated. It became clear to us that we needed to solve the scalability problem with a fundamentally different approach. Hence, we started down the path of alert evaluation via real-time streaming metrics.
High Level Architecture
The idea, at a high level, was to avoid the need to query the Atlas database almost entirely and transition most alert queries to streaming evaluation.
Alert queries are submitted either via our Alerting UI or by API clients, which are then saved to a custom config database that supports streaming config updates (full snapshot + update notifications). The Alerting Service receives these config updates and hashes every new or updated alert query for evaluation to one of its nodes by leveraging Edda Slots. The node responsible for evaluating a query, starts by breaking it down into a set of “data expressions” and with them subscribes to an upstream “broker” service. Data expressions define what data needs to be sourced in order to evaluate a query. For the example query listed above, the data expressions are name,errors,:eq,:sum and name,rps,:eq,:sum. The broker service acts as a subscription manager that maps a data expression to a set of subscriptions. In addition, it also maintains a Query Index of all active data expressions which is consulted to discern if an incoming datapoint is of interest to an active subscriber. The internals here are outside the scope of this blog post.
Next, the Alerting service (via the atlas-eval library) maps the received data points for a data expression to the alert query that needs them. For alert queries that resolve to more than one data expression, we align the incoming data points for each one of those data expressions on the same time boundary before emitting the accumulated values to the final eval step. For the example above, the final eval step would be responsible for computing the ratio and maintaining the rolling-count, which is keeping track of the number of intervals in which the ratio crossed the threshold as shown below:
The atlas-eval library supports streaming evaluation for most if not all Query, Data, Math and Stateful operators supported by Atlas today. Certain operators such as offset, integral, des are not supported on the streaming path.
OK, Results?
First and foremost, we have successfully alleviated our initial scalability problem with the polling based architecture. Today, we run 20X the number of queries we used to run a few years ago, with ease and at a fraction of what it would have cost to scale up the Atlas storage layer to serve the same volume. Multiple platform teams at Netflix programmatically generate and maintain alerts on behalf of their users without having to worry about impacting other users of the system. We are able to maintain strong SLAs around Mean Time To Detect (MTTD) regardless of the number of alerts being evaluated by the system.
Additionally, streaming evaluation allowed us to relax restrictions around high cardinality that our users were previously running into — alert queries that were rejected by Atlas Backend before due to cardinality constraints are now getting checked correctly on the streaming path. In addition, we are able to use Atlas Streaming to monitor and alert on some very high cardinality use-cases, such as metrics derived from free-form log data.
Finally, we switched Telltale, our holistic application health monitoring system, from polling a metrics cache to using realtime Atlas Streaming. The fundamental idea behind Telltale is to detect anomalies on SLI metrics (for example, latency, error rates, etc). When such anomalies are detected, Telltale is able to compute correlations with similar metrics emitted from either upstream or downstream services. In addition, it also computes correlations between SLI metrics and custom metrics like the log derived metrics mentioned above. This has proven to be valuable towards reducing Mean Time to Recover (MTTR). For example, we are able to now correlate increased error rates with increased rate of specific exceptions occurring in logs and even point to an exemplar stacktrace, as shown below:
Our logs pipeline fingerprints every log message and attaches a (very high cardinality) fingerprint tag to a log events counter that is then emitted to Atlas Streaming. Telltale consumes this metric in a streaming fashion to identify fingerprints that correlate with anomalies seen in SLI metrics. Once an anomaly is found, we query the logs backend with the fingerprint hash to obtain the exemplar stacktrace. What’s more is we are now able to identify correlated anomalies (and exceptions) occurring in services that may be N hops away from the affected service. A system like Telltale becomes more effective as more services are onboarded (and for that matter the full service graph), because otherwise it becomes difficult to root cause the problem, especially in a microservices-based architecture. A few years ago, as noted in this blog, only about a hundred services were using Telltale; thanks to Atlas Streaming we have now managed to onboard thousands of other services at Netflix.
Finally, we realized that once you remove limits on the number of monitored queries, and start supporting much higher metric dimensionality/cardinality without impacting the cost/performance profile of the system, it opens doors to many exciting new possibilities. For example, to make alerts more actionable, we may now be able to compute correlations between SLI anomalies and custom metrics with high cardinality dimensions, for example an alert on elevated HTTP error rates may be able to point to impacted customer cohorts, by linking to precisely correlated exemplars. This would help developers with reproducibility.
Transitioning to the streaming path has been a long journey for us. One of the challenges was difficulty in debugging scenarios where the streaming path didn’t agree with what is returned by querying the Atlas database. This is especially true when either the data is not available in Atlas or the query is not supported because of (say) cardinality constraints. This is one of the reasons it has taken us years to get here. That said, early signs indicate that the streaming paradigm may help with tackling a cardinal problem in observability — effective correlation between the metrics & events verticals (logs, and potentially traces in the future), and we are excited to explore the opportunities that this presents for Observability in general.
As organizations operate day-to-day, having insights into their cloud infrastructure state can be crucial for the durability and availability of their systems. Industry research estimates[1] that downtime costs small businesses around $427 per minute of downtime, and medium to large businesses an average of $9,000 per minute of downtime. Amazon DevOps Guru customers want to monitor and generate alerts using a single dashboard. This allows them to reduce context switching between applications, providing them an opportunity to respond to operational issues faster.
DevOps Guru can integrate with Amazon Managed Grafana to create and display operational insights. Alerts can be created and communicated for any critical events captured by DevOps Guru and notifications can be sent to operation teams to respond to these events. The key telemetry data types of logs and metrics are parsed and filtered to provide the necessary insights into observability.
Furthermore, it provides plug-ins to popular open-source databases, third-party ISV monitoring tools, and other cloud services. With Amazon Managed Grafana, you can easily visualize information from multiple AWS services, AWS accounts, and Regions in a single Grafana dashboard.
In this post, we will walk you through integrating the insights generated from DevOps Guru with Amazon Managed Grafana.
Solution Overview:
This architecture diagram shows the flow of the logs and metrics that will be utilized by Amazon Managed Grafana, starting with DevOps Guru and then using Amazon EventBridge to save the insight event logs to Amazon CloudWatch Log Group DevOps Guru service metrics to be parsed by Amazon Managed Grafana and create new dashboards in Grafana from these logs and Metrics.
Now we will walk you through how to do this and set up notifications to your operations team.
Prerequisites:
The following prerequisites are required for this walkthrough:
Enabled DevOps Guru on your account with CloudFormation stack, or tagged resources monitored.
Using Amazon CloudWatch Metrics
DevOps Guru sends service metrics to CloudWatch Metrics. We will use these to track metrics for insights and metrics for your DevOps Guru usage; the DevOps Guru service reports the metrics to the AWS/DevOps-Guru namespace in CloudWatch by default.
First, we will provision an Amazon Managed Grafana workspace and then create a Dashboard in the workspace that uses Amazon CloudWatch as a data source.
Setting up Amazon CloudWatch Metrics
Create Grafana Workspace Navigate to Amazon Managed Grafana from AWS console, then click Create workspace
For other insights outside of the service metrics, such as a number of insights per specific service or the average for a region or for a specific AWS account, we will need to parse the event logs. These logs first need to be sent to Amazon CloudWatch Logs. We will go over the details on how to set this up and how we can parse these logs in Amazon Managed Grafana using CloudWatch Logs Query Syntax. In this post, we will show a couple of examples. For more details, please check out this User Guide documentation. This is not done by default and we will need to use Amazon EventBridge to pass these logs to CloudWatch.
DevOps Guru logs include other details that can be helpful when building Dashboards, such as region, Insight Severity (High, Medium, or Low), associated resources, and DevOps guru dashboard URL, among other things. For more information, please check out this User Guide documentation.
EventBridge offers a serverless event bus that helps you receive, filter, transform, route, and deliver events. It provides one to many messaging solutions to support decoupled architectures, and it is easy to integrate with AWS Services and 3rd-party tools. Using Amazon EventBridge with DevOps Guru provides a solution that is easy to extend to create a ticketing system through integrations with ServiceNow, Jira, and other tools. It also makes it easy to set up alert systems through integrations with PagerDuty, Slack, and more.
Setting up Amazon CloudWatch Logs
Let’s dive in to creating the EventBridge rule and enhance our Grafana dashboard:
a. First head to Amazon EventBridge in the AWS console.
b. Click Createrule.
Type in rule Name and Description. You can leave the Event bus to default and Rule type to Rule with an event pattern.
c. Select AWS events or EventBridge partner events.
For event Pattern change to Customer patterns (JSON editor) and use:
{"source": ["aws.devops-guru"]}
This filters for all events generated from DevOps Guru. You can use the same mechanism to filter out specific messages such as new insights, or insights closed to a different channel. For this demonstration, let’s consider extracting all events.
d. Next, for Target, select AWS service.
Then use CloudWatch log Group.
For the Log Group, give your group a name, such as “devops-guru”.
e. Click Createrule.
f. Navigate back to Amazon Managed Grafana. It’s time to add a couple more additional Panels to our dashboard. Click Add panel. Then Select Amazon CloudWatch, and change from metrics to CloudWatch Logs and select the Log Group we created previously.
g. For the query use the following to get the number of closed insights:
Next, depending on the visualizations, you can work with the Logs and metrics data types to parse and filter the data.
i. For our fourth panel, we will add DevOps Guru dashboard direct link to the AWS Console.
Repeat the same process as demonstrated previously one more time with this query:
fields detail.messageType, detail.insightSeverity, detail.insightUrlfilter
| filter detail.messageType="CLOSED_INSIGHT" or detail.messageType="NEW_INSIGHT"
Switch to table when prompted on the panel.
This will give us a direct link to the DevOps Guru dashboard and help us get to the insight details and Recommendations.
Save your dashboard.
You can extend observability by sending notifications through alerts on dashboards of panels providing metrics. The alerts will be triggered when a condition is met. The Alerts are communicated with Amazon SNS notification mechanism. This is our SNS notification channel setup.
A previously created notification is used next to communicate any alerts when the condition is met across the metrics being observed.
Cleanup
To avoid incurring future charges, delete the resources.
Navigate to EventBridge in AWS console and delete the rule created in step 4 (a-e) “devops-guru”.
Navigate to CloudWatch logs in AWS console and delete the log group created as results of step 4 (a-e) named “devops-guru”.
Amazon Managed Grafana: Navigate to Amazon Managed Grafana service and delete the Grafana services you created in step 1.
Conclusion
In this post, we have demonstrated how to successfully incorporate Amazon DevOps Guru insights into Amazon Managed Grafana and use Grafana as the observability tool. This will allow Operations team to successfully observe the state of their AWS resources and notify them through Alarms on any preset thresholds on DevOps Guru metrics and logs. You can expand on this to create other panels and dashboards specific to your needs. If you don’t have DevOps Guru, you can start monitoring your AWS applications with AWS DevOps Guru today using this link.
Effective security incident response depends on adequate logging, as described in the AWS Security Incident Response Guide. If you have the proper logs and the ability to query them, you can respond more rapidly and effectively to security events. If a security event occurs, you can use various log sources to validate what occurred and understand the scope. Then, you can use the results of your analysis to take remediation actions. To learn more about logging best practices, see Configure service and application logging and Analyze logs, findings, and metrics centrally.
In this blog post, we will show you how to achieve an effective strategy for logging for security incident response. We will share logging options across the typical cloud application stack, log analysis options, and sample queries. AWS offers managed services, such as Amazon GuardDuty for threat detection and Amazon Detective for incident analysis. If you want to collect additional logs or perform custom analysis, then you should consider the options described in this blog post.
Selection of logs
To select the appropriate logs for security incident response, you should start with the common cloud application stack, which consists of the components and layers of your application deployed on AWS. For each component, we will describe the logging sources that you have. For each log source, we will describe why you should log it for security incident response, how to enable the logs, and what your log storage options are.
To select the logs for security incident response, first consider the following questions:
What are your compliance and regulatory requirements for logging?
Note: Make sure that you comply with the log retention requirements of compliance standards relevant to your organization, as well as your organization’s incident response strategy.
What AWS services do you commonly use?
What AWS services have access to or contain sensitive data?
What threats are most relevant to you?
Note: Performing a threat model of your cloud architectures can help you answer this question. For more information, see How to approach threat modelling.
Considering these questions can help you develop requirements for logging that will guide your selection of the following log sources.
AWS account logs
An AWS account is the first, fundamental component of an application deployed on AWS. The account is a container for your AWS resources. You create and manage your AWS resources in this account, and the account provides administrative capabilities for access and billing.
AWS CloudTrail
Within an account, each action performed is an API call. From a console sign-in to the deployment of each resource in an AWS CloudFormation stack, events are generated to provide transparency on what has occurred in the account. With AWS CloudTrail, you can log, continuously monitor, and retain account activity related to actions across supported AWS services. CloudTrail provides the event history of your account activity, including actions taken through the AWS Management Console, AWS SDKs, command line tools, and other AWS services. CloudTrail logs API calls as three types of events:
Management events(also known as control plane operations) show management operations that are performed on resources in your account. This includes actions like creating an Amazon Simple Storage Service (Amazon S3) bucket and setting up logging.
Data events(also known as data plane operations) show the resource operations performed on or within resources in your account. These operations are often high-volume activities, such as Amazon S3 object-level API activity (for example, GetObject, DeleteObject, and PutObject API operations) and AWS Lambda function invocation activity.
Insights events capture unusual API call rate or error rate activity in your account. You must enable these events on a trail in order to capture them, and they are logged to a different folder prefix in the destination S3 bucket for your trail. Insights events provide you with information such as the type of event, the incident time period, the associated API, the error code, and statistics to help you understand and respond effectively to unusual activity.
For security investigations, CloudTrail provides context on the creation, modification, and deletion of AWS resources. Therefore, CloudTrail is one of your most important log sources for security incident response in an AWS environment. You have three primary ways to set up CloudTrail:
CloudTrail Event history — CloudTrail is enabled by default with 90-day retention of management events that you can retrieve through the CloudTrail Event history facility using the console, AWS Command line Interface (AWS CLI), or AWS SDK. You don’t need to take any action to get started using the Event history feature.
CloudTrail trail — For longer retention and visibility of data events, you need to create a CloudTrail trail and associate it with an S3 bucket and optionally with an Amazon CloudWatch log group. If you use AWS Organizations, you can create an organization trail that will log events for each account in the organization. By default, trails are multi-Region, so you don’t need to enable CloudTrail logs in each AWS Region.
AWS CloudTrail Lake — You can create a CloudTrail lake, which retains CloudTrail logs for up to seven years and provides a SQL-based querying facility. You don’t need to have a trail configured in your account to use CloudTrail Lake.
Amazon Security Lake — You can use Security Lake to ingest CloudTrail events, which include management and data events. You can further analyze these events with Amazon QuickSight or another other third-party security information and event management (SIEM) tool.
AWS Config
Creating and modifying resources is an integral part of your account use. Tracking resource configuration changes made by calling the AWS API helps you review changes throughout the resource lifecycle. AWS Config provides a detailed view of the configuration of AWS resources in your account, examines the resource configurations periodically, and tracks configuration changes that were not initiated by the API. This includes how the resources are related to one another and how they were configured in the past so that you can see how configurations and relationships change over time.
You should enable AWS Config in each Region where you have resources deployed, and you should configure an S3 bucket to receive configuration history and configuration snapshot files, which contain details on the resources that AWS Config records. You can then review configuration compliance and analyze activities performed before, during, and after an event using the configuration history in S3. You should centralize AWS Config resource tracking across multiple accounts in the same organization by setting up an aggregator. You can use AWS Control Tower to automate the setup.
During a security investigation, you might want to understand how a resource configuration has changed over time. For example, you might want to investigate the changes to an S3 bucket policy before and after a security event that involves an S3 bucket. AWS Config provides a configuration history for resources that can help you track activities performed during a security event.
Operating system and application logs
To record interactions with applications, you must capture operating system (OS) and application logs, especially custom logs generated by the application development framework. OS and local application logs are relevant for security events that involve an Amazon Elastic Compute Cloud (Amazon EC2) instance. These instances could be standalone, in an auto scaling group behind a load balancer, or compute workloads for Amazon Elastic Container Service (Amazon ECS) or an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. OS logs track privileged use, processes, login events, access to directory services, and file system activity on a server. To analyze a potential compromise to an EC2 instance, you will want to review the security event logs for Windows OS and the system logs for Linux-based OS.
With the unified CloudWatch agent, you can collect metrics and logs from EC2 instances and on-premises servers. The CloudWatch agent aggregates log data into CloudWatch logs, which can then be exported to Amazon S3 for long-term retention and analyzed with a SIEM tool of your choice or Amazon Athena, as shown in Figure 1.
Figure 1: Aggregate OS and application logs using CloudWatch Logs
Database logs
With SQL databases, you can log transactions to help track modifications to the databases, such as additions or deletions. After an engine or system failure, you will need transaction logs to restore a database to a consistent state. Transaction logs are designed to be secure, and they require additional processing to access valuable information. It’s important that you understand data interactions during a security investigation, especially if your databases hold personally identifiable information (PII), financial and payments information, or other information subject to regulatory controls.
The goal of logging network activity is to gain insight into the communications that traverse your network. You might need this data for a variety of reasons, such as network troubleshooting or for use in a forensic investigation of suspected malware activity within your network.
In the AWS Cloud, you can log network activity by creating a proxy that logs network traffic or by using Traffic Mirroring to send a copy of network traffic to a logging server. You can adopt cloud-native approaches to capture this type of data using Amazon Route 53 DNS query logs and Amazon VPC Flow Logs.
There are also a variety of third-party networking solutions available like Palo Alto Networks and Fortinet, so you can continue to use the network logging mechanisms that you might have used in an on-premises environment.
Route 53 DNS query logs
You can configure Amazon Route 53 to log Domain Name System (DNS) queries. These logs are categorized into two groups:
Public DNS query logging
Resolver query logging
Logging public DNS queries against domains that you have hosted in Route 53 provides query information, such as the domain or subdomain requested, date and time stamp of the request, DNS record type, Route 53 edge location that responded, and response code.
You can configure VPC Flow Logs for a VPC in your account to capture traffic that enters and moves around your VPC network, without the addition of instances or products. From these logs, you can review information, such as source and destination IP, ports, timestamps, protocol, account ID, and whether the traffic was accepted or rejected. For a complete list of the fields available for flow log records, see Available fields. You can create a flow log for a VPC, a subnet, or a network interface. If you create a flow log for a subnet or VPC, IP traffic going to and from each network interface in that subnet or VPC will be logged. For more details on VPC Flow Logs, see Logging IP traffic using VPC Flow Logs.
You can forward flow logs to Amazon CloudWatch Logs to create CloudWatch alarms based on metric filters. You can also forward flow logs to an S3 bucket for long-term retention and further analysis. Figure 2 demonstrates these configurations.
Figure 2: Sending VPC Flow logs to CloudWatch Logs and S3
Access logs
To identify access patterns for accessible endpoints, especially public endpoints, you should use access logs. Access logs capture detailed information about requests sent to your load balancer. Each log contains information such as the time the request was received, the client’s IP address, latencies, request paths, and server responses. With services built in layers behind a load balancer, unless you track the X-Forwarded-For request header, the requestor’s context is lost. Access logs help bridge that gap during investigations and analysis.
Amazon S3 server access logs
Access logs are critical to track object level access when using S3 buckets to store confidential or sensitive data. You can also turn on CloudTrail to capture S3 data events. You can store access logs in S3 buckets for long-term storage for compliance purposes and to run analyses during and after an event.
Load balancing logs
Elastic Load Balancing provides access logs that capture detailed information about requests sent to load balancers. Each log contains information such as the time the request was received, the client’s IP address, latencies, request paths, and server responses. You can use this log to analyze traffic patterns and to troubleshoot issues.
Access logs is an optional feature of Elastic Load Balancing that is turned off by default. To enable access logs for load balancers, see Access logs for your Application Load Balancer.
If you implement your own reverse proxy for load balancing needs, make sure that you capture the reverse proxy access logs. You can use the unified CloudWatch agent to forward the logs to CloudWatch. As with OS logs, you can export CloudWatch logs to an S3 bucket for long-term retention and analysis.
If you use an Amazon CloudFront distribution as the public endpoint for end users with load balancers as the custom origin, then load balancing access logs will represent the CloudFront distribution as the requestor, rather than the actual end user. If this information doesn’t add value to your incident handling process, then you can use CloudFront access logs as the log source that provides end user request details.
CloudFront access logs
You should enable standard logs, also known as access logs, when using CloudFront. Specify an S3 bucket where you want CloudFront to save the files.
CloudFront access logs are delivered on a best-effort basis. For information about requests made to a distribution in real time, use real-time logs that are delivered within seconds of receiving the requests. You should use real-time logs to monitor, analyze, and take action based on content delivery performance. For more details on the fields available from these logs, see the CloudFront standard log file format.
AWS WAF logs
When associated with a supported resource like a CloudFront distribution, Amazon API Gateway REST API, Application Load Balancer, AWS AppSync GraphQL API, Amazon Cognito user pool, or AWS App Runner, AWS WAF can help you monitor HTTP and HTTPS requests that are forwarded to the resource. You should configure web access control lists (ACLs) to gain fine-grained control over the requests, and enable logging for such ACLs to get detailed information about traffic that is analyzed by AWS WAF. Log information includes time of the request being received by AWS WAF from the AWS resource, details about the request, and the AWS WAF rules that the request matched. You can use this log information to monitor access patterns of public endpoints and configure rules to inspect requests in detail. For more information about AWS WAF logging, see Logging web ACL traffic.
Serverless logs
Serverless computing has become increasingly popular in the cloud-computing space. It provides on-demand compute power in a relatively short burst, meaning that cloud-based instances don’t need to be provisioned and kept around, idle, when there are no tasks to be completed. Although more and more compute tasks are being moved to serverless solutions, the need to log has not changed, but how the logs are generated has. In a serverless environment, security investigations not only benefit from logs that demonstrate the interactions and changes made by the code deployed, but that also document changes to the deployed code itself and access permissions of the Lambda execution role that is granting privileged access.
AWS Lambda
The logging of Lambda functions involves two components: how the function itself is operating, and what is happening inside the function (what your code is actually doing).
The logging of a Lambda function itself occurs through data events captured by CloudTrail. As noted earlier in this post, you must configure data events on a trail created in CloudTrail. During configuration, you will need to specify the function from which logs will be captured by your trail, and the destination S3 bucket where they will be stored. These logs contain details on the invocation of the function and help identify the IAM principals that called the Invoke API for Lambda.
AWS Lambda automatically monitors Lambda functions on your behalf and sends logs to CloudWatch. Your Lambda function comes with a CloudWatch Logs log group and a log stream for each instance of your function. The Lambda runtime environment sends details about each invocation to the log stream, and relays logs and other output from your function’s code. For more details on how to monitor Lambda functions, see Accessing Amazon CloudWatch logs for AWS Lambda.
Log analysis
For incident response, you need to be able to analyze and query your logs to validate what occurred and to understand the scope.
To begin, you can aggregate logs from various sources in S3 buckets for long-term storage, and you can integrate that data with query tools for further investigation. Logs can be exported and either parsed through directly, or ingested by another tool to help with the analysis. The following are some options that you can use to query these logs:
Amazon Athena — You can directly query CloudTrail events stored in S3 with Athena using SQL commands, specifying the LOCATION of the log files. You would generally use this approach if you have advanced queries to run, and you don’t have a SIEM. To set up Athena to query logs, you can use this open-source solution from AWS.
Amazon OpenSearch Service — OpenSearch is a distributed search and log analytics suite. Because it’s open source, it can ingest logs from more than just AWS log sources. To set this up, you can use this open-source SIEM solution from AWS.
CloudTrail Event History — Either from the console, or programmatically, you can query CloudTrail management events from the last 90-day period. This is ideal for when you have simple queries to make within the last 90 days, and you don’t need stored logs or more complex queries.
AWS CloudTrail Lake — Either from the console, or programmatically, you can query stored events in your configured CloudTrail Lake from the time of its configuration, up until the maximum storage duration of 2,557 days (7 years) from the time that you make your query. This approach allows for SQL-based queries, and it is ideal for when you need to make more complex queries against events, but don’t require the additional features of a SIEM solution.
Parse through raw JSON using CLI — This is achieved programmatically and parsed through terminal commands. It’s more a legacy method of parsing through logs. You might choose to use this approach for analysis if another service or solution isn’t feasible (for example, if you can’t use the service due to your corporate security policy).
Third-party SIEM — A third-party SIEM might be ideal if you already have a SIEM solution on AWS or elsewhere, and you don’t need a duplicated solution elsewhere. Typically, SIEM solutions will import logs from an S3 bucket and process and index events for analysis. To learn more about SIEM options, see the SIEM solutions in the AWS Marketplace, or the AWS Security Competency Partners for a partner local to you with threat detection and incident response (TDIR) expertise.
Sample queries
In this section, we provide samples of SQL queries. Both Athena and CloudTrail Lake accept SQL queries, but the following samples have been tested for use in Athena only. This is because some samples are for VPC Flow Logs, which you can’t query from CloudTrail Lake. To query CloudTrail logs in Athena, you must first create a table definition that points to the location of your logs stored in S3. You can do this from the CloudTrail Events console by using a hyperlinked suggestion, or from the Athena console directly. Alternatively, for Athena, you can use the AWS Security Analytics Bootstrap.
For each of these queries, you might need to modify some of the fields, such as the time frame that you are investigating, the IAM entity involved, and the account and Region in scope. For example, you might want to modify the time frame based on the current time and when you believe the security event began. This often involves expanding the time frame after running additional queries and learning more about the scope and timeline.
By using partitions for tables, you can restrict the amount of data scanned by each Athena query, helping to improve performance and reduce cost. For example, you can partition your CloudTrail Athena table manually or by using partition projection. You can include the partition column (for example, the timestamp) in your queries to limit the amount of data scanned.
Unauthorized attempts
When a security event occurs, you might want to review API calls that were attempted but failed due to the IAM principal not having access to perform the action on that resource. To discover this activity, run the following query (be sure to modify the time window first):
SELECT *
FROM cloudtrail
WHERE errorcode IN ('Client.UnauthorizedOperation','Client.InvalidPermission.NotFound','Client.OperationNotPermitted','AccessDenied')
AND useridentity.arn LIKE '%iam%'
AND eventtime >= '2023-01-01T00:00:00Z'
AND eventtime < '2023-03-01T00:00:00Z'
ORDER BY eventtime desc
This sample query can help you identify whether certain IAM principals have a significant amount of unauthorized API calls, which can indicate that an IAM principal is compromised.
Rejected TCP connections
During a security event, the unauthorized user that is interacting with the resources in your account is probably trying to establish persistence through the network layer. To get a list of rejected TCP connections and extract from it the day that these events occurred, run the following query:
SELECT day_of_week(date) AS
day,date,interface_id,srcaddr,action,protocol
FROM vpc_flow_logs
WHERE action = 'REJECT' AND protocol = 6
LIMIT 100;
Connections over older TLS versions
You might want to see how many calls to AWS APIs were made using older versions of the TLS protocol, as part of a forensic follow-up or a discovery job after a risk analysis. You can get this data by querying CloudTrail logs.
SELECT eventSource
COUNT(*) AS numOutdatedTlsCalls FROM cloudtrail WHERE tlsDetails.tlsVersion IN ('TLSv1', 'TLSv1.1') AND eventTime > '2023-01-01 00:00:00' GROUP BY eventSource ORDER BY numOutdatedTlsCalls DESC
Filter connections from an IP
With an IP address that you’d like to investigate, as a part of your forensic analysis, you might want to see the connections made to resources in a VPC. You can obtain this information by querying VPC Flow Logs. As with the server access logs, if you’re using Athena, you will first need to create a new table.
SELECT day_of_week(date) AS
day, date, srcaddr, dstaddr, action, protocol
FROM vpc_flow_logs
WHERE day >= '2023/01/01' AND day < '2023/03/01' AND srcaddr LIKE '172.50.%'
ORDER BY day DESC
LIMIT 100
Investigate user actions
If you have identified a user who has been compromised, or that you suspect has been compromised, you might want to know the API calls that they made over a specific time period. Understanding the activity of a user can help you understand the scope of impact during an incident, as well as the reach of user permissions when you design your access management strategy.
SELECT eventID, eventName, eventSource, eventTime, userIdentity.arn
AS user
FROM cloudtrail
WHERE userIdentity.arn = '%<username>%' AND eventTime > '2022-12-05 00:00:00' AND eventTime < '2022-12-08 00:00:00'
Conclusion
It is essential that you capture logs from various layers within your application architecture, so that you can effectively respond to a security event at various layers of the application stack. If a security event occurs, logs can help provide a clear picture of what happened and the scope of the affected resources. This post helps you build a logging strategy for security incident response by understanding what logs you want to analyze, where you want to store those logs, and how you will analyze them.
At Cloudflare, we reuse existing core systems to power multiple products and testing of these core systems is essential. In particular, we require being able to have a wide and thorough visibility of our live APIs’ behaviors. We want to be able to detect regressions, prevent incidents and maintain healthy APIs. That is why we built Scout.
Scout is an automated system periodically running Python tests verifying the end to end behavior of our APIs. Scout allows us to evaluate APIs in production-like environments and thus ensures we can green light a production deployment while also monitoring the behavior of APIs in production.
Why Scout?
Before Scout, we were using an automated test system leveraging the Robot Framework. This older system was limiting our testing capabilities. In fact, we could not easily match json responses against keys we were looking for. We would abandon covering different behaviors of our APIs as it was impossible to decide on which resources a given test suite would run. Two different test suites would create false negatives as they were running on the same account.
Regarding schema validation, only API responses were validated against a json schema and tests would not fail if the response did not match the schema. Moreover, It was impossible to validate API requests.
Test suites were run in a queue, making the delay to a new feature assessment dependent on the number of test suites to run. The queue would as well potentially make newer test suites run the following day. Hence we often ended up with a mismatch between tests and APIs versions. Test steps could not be run in parallel either.
We could not split test suites between different environments. If a new API feature was being developed it was impossible to write a test without first needing the actual feature to be released to production.
We built Scout to overcome all these difficulties. We wanted the developer experience to be easy and we wanted Scout to be fast and reliable while spotting any live API issue.
A Scout test example
Scout is built in Python and leverages the functionalities of Pytest. Before diving into the exact capabilities of Scout and its architecture, let’s have a quick look at how to use it!
Following is an example of a Scout test on the Rulesets API (the docs are available here):
from scout import requires, validate, Account, Zone
@validate(schema="rulesets", ignorePaths=["accounts/[^/]+/rules/lists"])
@requires(
account=Account(
entitlements={"rulesets.max_rules_per_ruleset": 2),
zone=Zone(plan="ENT",
entitlements={"rulesets.firewall_custom_phase_allowed": True},
account_entitlements={"rulesets.max_rules_per_ruleset": 2 }))
class TestZone:
def test_create_custom_ruleset(self, cfapi):
response = cfapi.zone.request(
"POST",
"rulesets",
payload=f"""{{
"name": "My zone ruleset",
"description": "My ruleset description",
"phase": "http_request_firewall_custom",
"kind": "zone",
"rules": [
{{
"description": "My rule",
"action": "block",
"expression": "http.host eq \"fake.net\""
}}
]
}}""")
response.expect_json_success(
200,
result=f"""{{
"name": "My zone ruleset",
"version": "1",
"source": "firewall_custom",
"phase": "http_request_firewall_custom",
"kind": "zone",
"rules": [
{{
"description": "My rule",
"action": "block",
"expression": "http.host eq \"fake.net\"",
"enabled": true,
...
}}
],
...
}}""")
A Scout test is a succession of roundtrips of requests and responses against a given API. We use the functionalities of Pytest fixtures and marks to be able to target specific resources while validating the request and responses. Pytest marks in Scout allow to provide an extra set of information to test suites. Pytest fixtures are contexts with information and methods which can be used across tests to enhance their capabilities. Hence the conjunction of marks with fixtures allow Scout to build the whole harness required to run a test suite against APIs.
Being able to exactly describe the resources against which a given test will run provides us confidence the live API behaves as expected under various conditions.
The cfapi fixture provides the capability to target different resources such as a Cloudflare account or a zone. In the test above, we use a Pytest mark @requires to describe the characteristics of the resources we want, e.g. we need here an account with a flag allowing us to have 2 rules for a ruleset. This will allow the test to only be run against accounts with such entitlements.
The @validate mark provides the capability to validate requests and responses to a given OpenAPI schema (here the rulesets OpenAPI schema). Any validation failure will be reported and flagged as a test failure.
Regarding the actual requests and responses, their payloads are described as f-strings, in particular the response f-string can be written as a “semi-json”:
Among many test assertions possible, Scout can assert the validity of a partial json response and it will log the information. We added the handling of ellipsis … as an indication for Scout not to care about any further fields at a given json nesting level. Hence, we are able to do partial matching on JSON API responses, thus focusing only on what matters the most in each test.
Once a test suite run is complete, the results are pushed by the service and stored using Cloudflare Workers KV. They are displayed via a Cloudflare Worker.
Scout is run in separate environments such as production-like and production environments. It is part of our deployment process to verify Scout is green in our production-like environment prior to deploying to production where Scout is also used for monitoring purposes.
How we built it
The core of Scout is written in Python and it is a combination of three components interacting together:
The Scout plugin: a Pytest plugin to write tests easily
The Scout service: a scheduler service to run the test suites periodically
The Scout Worker: a collector and presenter of test reports
The Scout plugin
This is the core component of the Scout system. It allows us to write self explanatory tests while ensuring a high level of compliance against OpenAPI schemas and verifying the APIs’ behaviors.
The Scout plugin architecture can be split into three components: setup, resource allocator, and runners. Setup is a conjunction of multiple sub components in charge of setting up the plugin.
The Registry contains all the information regarding a pool of accounts and zones we use for testing. As an example, entitlements are flags gating customers for using products features, the Registry provides the capability to describe entitlements per account and zone so that Scout can run a test against a specific setup.
As explained earlier, Scout can validate requests and responses against OpenAPI schemas. This is the responsibility of validators. A validator is built per OpenAPI schema and can be selected via the @validate mark we saw above.
As soon as a validator is selected, all the interaction of a given test with an API will be validated. If there is a validation failure, it will be marked as a test failure.
Last element of the setup, the config reader. It is the sub component in charge of providing all the URLs and authentication elements required for the Scout plugin to communicate with APIs.
Next in the chain, the resources allocator. This component is in charge of consuming the configuration and objects of the setup to build multiple runners. This is a factory which will make available the runners in the cfapi fixture.
When such a line of code is processed, it is the actual method request of the zone runner allocated for the test which is executed. Actually, the resources allocator is able to provide specialized runners (account, zone or default) which grant the possibility of targeting specific API endpoints for a given account or zone.
Runners are in charge of handling the execution of requests, managing the test expectations and using the validators for request/response schema validation.
Any failure on expectation or validation and any exceptions are recorded in the stash. The stash is shared across all runners. As such, when a test setup, run or cleanup is processed, the timeline of execution and potential retries are logged in the stash. The stash contents are later used for building the test suite reports.
Scout is able to run multiple test steps in parallel. Actually, each resource couple (Account Runner, Zone Runner) is associated with a Pytest-xdist worker which runs test steps independently. There can be as many workers as there are resource couples. An extra “default” runner is provided for reaching our different APIs and/or URLs with or without authentication.
Testing a test system was not the easiest part. We have been required to build a fake API and assert the Scout plugin would behave as it should in different situations. We reached and maintained a test coverage confidence which was considered good (close to 90%) for using the Scout plugin permanently.
The Scout service
The Scout service is meant to schedule test suites periodically. It is a configurable scheduler providing a reporting harness for the test suites as well as multiple metrics. It was a design decision to build a scheduler instead of using cron jobs.
We wanted to be aware of any scheduling issue as well as run issues. For this we used Prometheus metrics. The problem is that the Prometheus default configuration is to scrape metrics advertised by services. This scraping happens periodically and we were concerned about the eventuality of missing metrics if a cron job was to finish prior to the next Prometheus metrics scraping. As such we decided a small scheduler was better suited for overall observability of the test runs. Among the metrics the Scout service provides are network failures, general test failures, reporting failures, tests lagging and more.
The Scout service runs threads on configured periods. Each thread is a test suite run as a separate Pytest with Scout plugin process followed by a reporting execution consuming the results and publishing them to the relevant parties.
The reporting component provided to each thread publishes the report to Workers KV and notifies us on chat in case there is a failure. Reporting takes also care of publishing the information relevant for building API testing coverage. In fact it is mandatory for us to have coverage of all the API endpoints and their possible methods so that we can achieve a wide and thorough visibility of our live APIs.
As a fallback, if there are any thread failure, test failure or reporting failure we are alerted based on the Prometheus metrics being updated across the service execution. The logs of the Scout service as well as the logs of each Pytest-Scout plugin execution provide the last resort information if no metrics are available and reporting is failing.
The service can be deployed with a minimal YAML configuration and be set up for different environments. We can for example decide to run different test suites based on the environment, publish or not to Cloudflare Workers, set different periods and retry mechanisms and so on.
We keep the tests as part of our code base alongside the configuration of the Scout service, and that’s about it, the Scout service is a separate entity.
The Scout Worker
It is a Cloudflare worker in charge of fetching the most recent Worker KVs and displaying them in an eye pleasing manner. The Scout service publishes a test report as JSON, thus the Scout worker parses the report and displays its content based on the status of the test suite run.
For example, we present below an authentication failure during a test which resulted in such a display in the worker:
What does Scout let us do
Through leveraging the capabilities of Pytest and Cloudflare Workers, we have been able to build a configurable, robust and reliable system which allows us to easily write self explanatory tests for our APIs.
We can validate requests and responses against OpenAPI schemas and test behaviors over specific resources while getting alerted through multiple means if something goes wrong.
For specific use cases, we can write a test verifying the API behaves as it should, the configuration to be pushed at the edge is valid and a given zone will react as it should to security threats. Thus going beyond an end-to-end API test.
Scout quickly became our permanent live tester and monitor of APIs. We wrote tests for all endpoints to maintain a wide coverage of all our APIs. Scout has since been used for verifying an API version prior to its deployment to production. In fact, after a deployment in a production-like environment we can know in a couple of minutes if a new feature is good to go to production and assess if it is behaving correctly.
We hope you enjoyed this deep dive description into one of our systems!
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift enables you to use SQL for analyzing structured and semi-structured data with best price performance along with secure access to the data.
As more users start querying data in a data warehouse, access control is paramount to protect valuable organizational data. Database administrators want to continuously monitor and manage user privileges to maintain proper data access in the data warehouse. Amazon Redshift provides granular access control on the database, schema, table, column, row, and other database objects by granting privileges to roles, groups, and users from a SQL interface. To monitor privileges configured in Amazon Redshift, you can retrieve them by querying system tables.
Although Amazon Redshift provides a broad capability of managing access to database objects, we have heard from customers that they want to visualize and monitor privileges without using a SQL interface. In this post, we introduce predefined dashboards using Grafana which visualizes database privileges without writing SQL. This dashboard will help database administrators to reduce the time spent on database administration and increase the frequency of monitoring cycles.
This post focuses on database access, which relates to user access control against database objects. For more information, see Managing database security.
Amazon Redshift uses the GRANT command to define permissions in the database. For most database objects, GRANT takes three parameters:
Identity – The entity you grant access to. This could be a user, role, or group.
Object – The type of database object. This could be a database, schema, table or view, column, row, function, procedure, language, datashare, machine leaning (ML) model, and more.
Privilege – The type of operation. Examples include CREATE, SELECT, ALTER, DROP, DELETE, and INSERT. The level of privilege depends on the object.
Generally, database administrator monitors and reviews the identities, objects, and privileges periodically to ensure proper access is configured. They also need to investigate access configurations if database users face permission errors. These tasks require a SQL interface to query multiple system tables, which can be a repetitive and undifferentiated operation. Therefore, database administrators need a single pane of glass to quickly navigate through identities, objects, and privileges without writing SQL.
Solution overview
The following diagram illustrates the solution architecture and its key components:
Amazon Redshift contains database privilege information in system tables.
Grafana provides a predefined dashboard to visualize database privileges. The dashboard runs queries against the Amazon Redshift system table via the Amazon Redshift Data API.
Note that the dashboard focuses on visualization. SQL interface is required to configure privileges in Amazon Redshift. You can use query editor v2, a web-based SQL interface which enables users to run SQL commands from a browser.
Prerequisites
Before moving to the next section, you should have the following prerequisites:
While Amazon Managed Grafana controls the plugin version and updates periodically, local Grafana allows user to control the version. Therefore, local Grafana could be an option if you need earlier access for the latest features. Refer to plugin changelog for released features and versions.
Import the dashboards
After you have finished the prerequisites, you should have access to Grafana configured with Amazon Redshift as a data source. Next, import two dashboards for visualization.
In Grafana console, go to the created Redshift data source and click Dashboards
Import the Amazon Redshift Identities and Objects
Go to the data source again and import the Amazon Redshift Privileges
Each dashboard will appear once imported.
Amazon Redshift Identities and Objects dashboard
The Amazon Redshift Identities and Objects dashboard shows identites and database objects in Amazon Redshift, as shown in the following screenshot.
The Identities section shows the detail of each user, role, and group in the source database.
One of the key features in this dashboard is the Role assigned to Role, User section, which uses a node graph panel to visualize the hierarchical structure of roles and users from multiple system tables. This visualization can help administrators quickly examine which roles are inherited to users instead of querying multiple system tables. For more information about role-based access, refer to Role-based access control (RBAC).
Amazon Redshift Privileges dashboard
The Amazon Redshift Privileges dashboard shows privileges defined in Amazon Redshift.
In the Role and Group assigned to User section, open the Role assigned to User panel to list the roles for a specific user. In this panel, you can list and compare roles assigned to multiple users. Use the User drop-down at the top of the dashboard to select users.
The dashboard will refresh immediately and show filtered result for selected users. Following screenshot is the filtered result for user hr1, hr2 and it3.
The Object Privileges section shows the privileges granted for each database object and identity. Note that objects with no privileges granted are not listed here. To show the full list of database objects, use the Amazon Redshift Identities and Objects dashboard.
The Object Privileges (RLS) section contains visualizations for row-level security (RLS). The Policy attachments panel enables you to examine RLS configuration by visualizing relation between of tables, policies, roles and users.
Conclusion
In this post, we introduced a visualization for database privileges of Amazon Redshift using predefined Grafana dashboards. Database administrators can use these dashboards to quickly navigate through identities, objects, and privileges without writing SQL. You can also customize the dashboard to meet your business requirements. The JSON definition file of this dashboard is maintained as part of OSS in the Redshift data source for Grafana GitHub repository.
For more information about the topics described to in this post, refer to the following:
Yota Hamaoka is an Analytics Solution Architect at Amazon Web Services. He is focused on driving customers to accelerate their analytics journey with Amazon Redshift.
We use Prometheus as our core monitoring system. We’ve been heavy Prometheus users since 2017 when we migrated off our previous monitoring system which used a customized Nagios setup. Despite growing our infrastructure a lot, adding tons of new products and learning some hard lessons about operating Prometheus at scale, our original architecture of Prometheus (see Monitoring Cloudflare’s Planet-Scale Edge Network with Prometheus for an in depth walk through) remains virtually unchanged, proving that Prometheus is a solid foundation for building observability into your services.
One of the key responsibilities of Prometheus is to alert us when something goes wrong and in this blog post we’ll talk about how we make those alerts more reliable – and we’ll introduce an open source tool we’ve developed to help us with that, and share how you can use it too. If you’re not familiar with Prometheus you might want to start by watching this video to better understand the topic we’ll be covering here.
Prometheus works by collecting metrics from our services and storing those metrics inside its database, called TSDB. We can then query these metrics using Prometheus query language called PromQL using ad-hoc queries (for example to power Grafana dashboards) or via alerting or recording rules. A rule is basically a query that Prometheus will run for us in a loop, and when that query returns any results it will either be recorded as new metrics (with recording rules) or trigger alerts (with alerting rules).
Prometheus alerts
Since we’re talking about improving our alerting we’ll be focusing on alerting rules.
To create alerts we first need to have some metrics collected. For the purposes of this blog post let’s assume we’re working with http_requests_total metric, which is used on the examples page. Here are some examples of how our metrics will look:
Let’s say we want to alert if our HTTP server is returning errors to customers.
Since, all we need to do is check our metric that tracks how many responses with HTTP status code 500 there were, a simple alerting rule could like this:
This will alert us if we have any 500 errors served to our customers. Prometheus will run our query looking for a time series named http_requests_total that also has a status label with value “500”. Then it will filter all those matched time series and only return ones with value greater than zero.
If our alert rule returns any results a fire will be triggered, one for each returned result.
If our rule doesn’t return anything, meaning there are no matched time series, then alert will not trigger.
The whole flow from metric to alert is pretty simple here as we can see on the diagram below.
If we want to provide more information in the alert we can by setting additional labels and annotations, but alert and expr fields are all we need to get a working rule.
But the problem with the above rule is that our alert starts when we have our first error, and then it will never go away.
After all, our http_requests_total is a counter, so it gets incremented every time there’s a new request, which means that it will keep growing as we receive more requests. What this means for us is that our alert is really telling us “was there ever a 500 error?” and even if we fix the problem causing 500 errors we’ll keep getting this alert.
A better alert would be one that tells us if we’re serving errors right now.
For that we can use the rate() function to calculate the per second rate of errors.
The query above will calculate the rate of 500 errors in the last two minutes. If we start responding with errors to customers our alert will fire, but once errors stop so will this alert.
This is great because if the underlying issue is resolved the alert will resolve too.
We can improve our alert further by, for example, alerting on the percentage of errors, rather than absolute numbers, or even calculate error budget, but let’s stop here for now.
It’s all very simple, so what do we mean when we talk about improving the reliability of alerting? What could go wrong here?
Maybe a spot for a subheading here as you move on from the intro?
What could go wrong?
We can craft a valid YAML file with a rule definition that has a perfectly valid query that will simply not work how we expect it to work. Which, when it comes to alerting rules, might mean that the alert we rely upon to tell us when something is not working correctly will fail to alert us when it should. To better understand why that might happen let’s first explain how querying works in Prometheus.
Prometheus querying basics
There are two basic types of queries we can run against Prometheus. The first one is an instant query. It allows us to ask Prometheus for a point in time value of some time series. If we write our query as http_requests_total we’ll get all time series named http_requests_total along with the most recent value for each of them. We can further customize the query and filter results by adding label matchers, like http_requests_total{status=”500”}.
Let’s consider we have two instances of our server, green and red, each one is scraped (Prometheus collects metrics from it) every one minute (independently of each other).
This is what happens when we issue an instant query:
There’s obviously more to it as we can use functions and build complex queries that utilize multiple metrics in one expression. But for the purposes of this blog post we’ll stop here.
The important thing to know about instant queries is that they return the most recent value of a matched time series, and they will look back for up to five minutes (by default) into the past to find it. If the last value is older than five minutes then it’s considered stale and Prometheus won’t return it anymore.
The second type of query is a range query – it works similarly to instant queries, the difference is that instead of returning us the most recent value it gives us a list of values from the selected time range. That time range is always relative so instead of providing two timestamps we provide a range, like “20 minutes”. When we ask for a range query with a 20 minutes range it will return us all values collected for matching time series from 20 minutes ago until now.
An important distinction between those two types of queries is that range queries don’t have the same “look back for up to five minutes” behavior as instant queries. If Prometheus cannot find any values collected in the provided time range then it doesn’t return anything.
If we modify our example to request [3m] range query we should expect Prometheus to return three data points for each time series:
When queries don’t return anything
Knowing a bit more about how queries work in Prometheus we can go back to our alerting rules and spot a potential problem: queries that don’t return anything.
If our query doesn’t match any time series or if they’re considered stale then Prometheus will return an empty result. This might be because we’ve made a typo in the metric name or label filter, the metric we ask for is no longer being exported, or it was never there in the first place, or we’ve added some condition that wasn’t satisfied, like value of being non-zero in our http_requests_total{status=”500”} > 0 example.
Prometheus will not return any error in any of the scenarios above because none of them are really problems, it’s just how querying works. If you ask for something that doesn’t match your query then you get empty results. This means that there’s no distinction between “all systems are operational” and “you’ve made a typo in your query”. So if you’re not receiving any alerts from your service it’s either a sign that everything is working fine, or that you’ve made a typo, and you have no working monitoring at all, and it’s up to you to verify which one it is.
For example, we could be trying to query for http_requests_totals instead of http_requests_total (an extra “s” at the end) and although our query will look fine it won’t ever produce any alert.
Range queries can add another twist – they’re mostly used in Prometheus functions like rate(), which we used in our example. This function will only work correctly if it receives a range query expression that returns at least two data points for each time series, after all it’s impossible to calculate rate from a single number.
Since the number of data points depends on the time range we passed to the range query, which we then pass to our rate() function, if we provide a time range that only contains a single value then rate won’t be able to calculate anything and once again we’ll return empty results.
The number of values collected in a given time range depends on the interval at which Prometheus collects all metrics, so to use rate() correctly you need to know how your Prometheus server is configured. You can read more about this here and here if you want to better understand how rate() works in Prometheus.
For example if we collect our metrics every one minute then a range query http_requests_total[1m] will be able to find only one data point. Here’s a reminder of how this looks:
Since, as we mentioned before, we can only calculate rate() if we have at least two data points, calling rate(http_requests_total[1m]) will never return anything and so our alerts will never work.
There are more potential problems we can run into when writing Prometheus queries, for example any operations between two metrics will only work if both have the same set of labels, you can read about this here. But for now we’ll stop here, listing all the gotchas could take a while. The point to remember is simple: if your alerting query doesn’t return anything then it might be that everything is ok and there’s no need to alert, but it might also be that you’ve mistyped your metrics name, your label filter cannot match anything, your metric disappeared from Prometheus, you are using too small time range for your range queries etc.
Renaming metrics can be dangerous
We’ve been running Prometheus for a few years now and during that time we’ve grown our collection of alerting rules a lot. Plus we keep adding new products or modifying existing ones, which often includes adding and removing metrics, or modifying existing metrics, which may include renaming them or changing what labels are present on these metrics.
A lot of metrics come from metrics exporters maintained by the Prometheus community, like node_exporter, which we use to gather some operating system metrics from all of our servers. Those exporters also undergo changes which might mean that some metrics are deprecated and removed, or simply renamed.
A problem we’ve run into a few times is that sometimes our alerting rules wouldn’t be updated after such a change, for example when we upgraded node_exporter across our fleet. Or the addition of a new label on some metrics would suddenly cause Prometheus to no longer return anything for some of the alerting queries we have, making such an alerting rule no longer useful.
It’s worth noting that Prometheus does have a way of unit testing rules, but since it works on mocked data it’s mostly useful to validate the logic of a query. Unit testing won’t tell us if, for example, a metric we rely on suddenly disappeared from Prometheus.
Chaining rules
When writing alerting rules we try to limit alert fatigue by ensuring that, among many things, alerts are only generated when there’s an action needed, they clearly describe the problem that needs addressing, they have a link to a runbook and a dashboard, and finally that we aggregate them as much as possible. This means that a lot of the alerts we have won’t trigger for each individual instance of a service that’s affected, but rather once per data center or even globally.
For example, we might alert if the rate of HTTP errors in a datacenter is above 1% of all requests. To do that we first need to calculate the overall rate of errors across all instances of our server. For that we would use a recording rule:
First rule will tell Prometheus to calculate per second rate of all requests and sum it across all instances of our server. Second rule does the same but only sums time series with status labels equal to “500”. Both rules will produce new metrics named after the value of the record field.
Now we can modify our alert rule to use those new metrics we’re generating with our recording rules:
If we have a data center wide problem then we will raise just one alert, rather than one per instance of our server, which can be a great quality of life improvement for our on-call engineers.
But at the same time we’ve added two new rules that we need to maintain and ensure they produce results. To make things more complicated we could have recording rules producing metrics based on other recording rules, and then we have even more rules that we need to ensure are working correctly.
What if all those rules in our chain are maintained by different teams? What if the rule in the middle of the chain suddenly gets renamed because that’s needed by one of the teams? Problems like that can easily crop up now and then if your environment is sufficiently complex, and when they do, they’re not always obvious, after all the only sign that something stopped working is, well, silence – your alerts no longer trigger. If you’re lucky you’re plotting your metrics on a dashboard somewhere and hopefully someone will notice if they become empty, but it’s risky to rely on this.
We definitely felt that we needed something better than hope.
Introducing pint: a Prometheus rule linter
To avoid running into such problems in the future we’ve decided to write a tool that would help us do a better job of testing our alerting rules against live Prometheus servers, so we can spot missing metrics or typos easier. We also wanted to allow new engineers, who might not necessarily have all the in-depth knowledge of how Prometheus works, to be able to write rules with confidence without having to get feedback from more experienced team members.
Since we believe that such a tool will have value for the entire Prometheus community we’ve open-sourced it, and it’s available for anyone to use – say hello to pint!
You can run it against a file(s) with Prometheus rules
It can run as a part of your CI pipeline
Or you can deploy it as a side-car to all your Prometheus servers
It doesn’t require any configuration to run, but in most cases it will provide the most value if you create a configuration file for it and define some Prometheus servers it should use to validate all rules against. Running without any configured Prometheus servers will limit it to static analysis of all the rules, which can identify a range of problems, but won’t tell you if your rules are trying to query non-existent metrics.
First mode is where pint reads a file (or a directory containing multiple files), parses it, does all the basic syntax checks and then runs a series of checks for all Prometheus rules in those files.
Second mode is optimized for validating git based pull requests. Instead of testing all rules from all files pint will only test rules that were modified and report only problems affecting modified lines.
Third mode is where pint runs as a daemon and tests all rules on a regular basis. If it detects any problem it will expose those problems as metrics. You can then collect those metrics using Prometheus and alert on them as you would for any other problems. This way you can basically use Prometheus to monitor itself.
What kind of checks can it run for us and what kind of problems can it detect?
All the checks are documented here, along with some tips on how to deal with any detected problems. Let’s cover the most important ones briefly.
As mentioned above the main motivation was to catch rules that try to query metrics that are missing or when the query was simply mistyped. To do that pint will run each query from every alerting and recording rule to see if it returns any result, if it doesn’t then it will break down this query to identify all individual metrics and check for the existence of each of them. If any of them is missing or if the query tries to filter using labels that aren’t present on any time series for a given metric then it will report that back to us.
So if someone tries to add a new alerting rule with http_requests_totals typo in it, pint will detect that when running CI checks on the pull request and stop it from being merged. Which takes care of validating rules as they are being added to our configuration management system.
Another useful check will try to estimate the number of times a given alerting rule would trigger an alert. Which is useful when raising a pull request that’s adding new alerting rules – nobody wants to be flooded with alerts from a rule that’s too sensitive so having this information on a pull request allows us to spot rules that could lead to alert fatigue.
Similarly, another check will provide information on how many new time series a recording rule adds to Prometheus. In our setup a single unique time series uses, on average, 4KiB of memory. So if a recording rule generates 10 thousand new time series it will increase Prometheus server memory usage by 10000*4KiB=40MiB. 40 megabytes might not sound like but our peak time series usage in the last year was around 30 million time series in a single Prometheus server, so we pay attention to anything that’s might add a substantial amount of new time series, which pint helps us to notice before such rule gets added to Prometheus.
On top of all the Prometheus query checks, pint allows us also to ensure that all the alerting rules comply with some policies we’ve set for ourselves. For example, we require everyone to write a runbook for their alerts and link to it in the alerting rule using annotations.
We also require all alerts to have priority labels, so that high priority alerts are generating pages for responsible teams, while low priority ones are only routed to karma dashboard or create tickets using jiralert. It’s easy to forget about one of these required fields and that’s not something which can be enforced using unit testing, but pint allows us to do that with a few configuration lines.
With pint running on all stages of our Prometheus rule life cycle, from initial pull request to monitoring rules deployed in our many data centers, we can rely on our Prometheus alerting rules to always work and notify us of any incident, large or small.
Our rule now passes the most basic checks, so we know it’s valid. But to know if it works with a real Prometheus server we need to tell pint how to talk to Prometheus. For that we’ll need a config file that defines a Prometheus server we test our rule against, it should be the same server we’re planning to deploy our rule to. Here we’ll be using a test instance running on localhost. Let’s create a “pint.hcl” file and define our Prometheus server there:
prometheus "prom1" {
uri = "http://localhost:9090"
timeout = "1m"
}
Now we can re-run our check using this configuration file:
$ pint -c pint.hcl lint rules.yml
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=2
rules.yml:5: prometheus "prom1" at http://localhost:9090 didn't have any series for "http_requests_total" metric in the last 1w (promql/series)
expr: sum(rate(http_requests_total[2m])) without(method, status, instance)
rules.yml:8: prometheus "prom1" at http://localhost:9090 didn't have any series for "http_requests_total" metric in the last 1w (promql/series)
expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)
level=info msg="Problems found" Bug=2
level=fatal msg="Execution completed with error(s)" error="problems found"
Yikes! It’s a test Prometheus instance, and we forgot to collect any metrics from it.
Let’s fix that by starting our server locally on port 8080 and configuring Prometheus to collect metrics from it:
$ pint -c pint.hcl lint rules.yml
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=3
rules.yml:13: prometheus "prom1" at http://localhost:9090 didn't have any series for "job:http_requests_status500:rate2m" metric in the last 1w but found recording rule that generates it, skipping further checks (promql/series)
expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01
rules.yml:13: prometheus "prom1" at http://localhost:9090 didn't have any series for "job:http_requests_total:rate2m" metric in the last 1w but found recording rule that generates it, skipping further checks (promql/series)
expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01
level=info msg="Problems found" Information=2
It all works according to pint, and so we now can safely deploy our new rules file to Prometheus.
Notice that pint recognised that both metrics used in our alert come from recording rules, which aren’t yet added to Prometheus, so there’s no point querying Prometheus to verify if they exist there.
Now what happens if we deploy a new version of our server that renames the “status” label to something else, like “code”?
$ pint -c pint.hcl lint rules.yml
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=3
rules.yml:8: prometheus "prom1" at http://localhost:9090 has "http_requests_total" metric but there are no series with "status" label in the last 1w (promql/series)
expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)
rules.yml:13: prometheus "prom1" at http://localhost:9090 didn't have any series for "job:http_requests_status500:rate2m" metric in the last 1w but found recording rule that generates it, skipping further checks (promql/series)
expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01
level=info msg="Problems found" Bug=1 Information=1
level=fatal msg="Execution completed with error(s)" error="problems found"
Luckily pint will notice this and report it, so we can adopt our rule to match the new name.
But what if that happens after we deploy our rule? For that we can use the “pint watch” command that runs pint as a daemon periodically checking all rules.
Please note that validating all metrics used in a query will eventually produce some false positives. In our example metrics with status=”500” label might not be exported by our server until there’s at least one request ending in HTTP 500 error.
The promql/series check responsible for validating presence of all metrics has some documentation on how to deal with this problem. In most cases you’ll want to add a comment that instructs pint to ignore some missing metrics entirely or stop checking label values (only check if there’s “status” label present, without checking if there are time series with status=”500”).
Summary
Prometheus metrics don’t follow any strict schema, whatever services expose will be collected. At the same time a lot of problems with queries hide behind empty results, which makes noticing these problems non-trivial.
We use pint to find such problems and report them to engineers, so that our global network is always monitored correctly, and we have confidence that lack of alerts proves how reliable our infrastructure is.
Grafana is a rich interactive open-source tool by Grafana Labs for visualizing data across one or many data sources. It’s used in a variety of modern monitoring stacks, allowing you to have a common technical base and apply common monitoring practices across different systems. Amazon Managed Grafana is a fully managed, scalable, and secure Grafana-as-a-service solution developed by AWS in collaboration with Grafana Labs.
Amazon Redshift is the most widely used data warehouse in the cloud. You can view your Amazon Redshift cluster’s operational metrics on the Amazon Redshift console, use AWS CloudWatch, and query Amazon Redshift system tables directly from your cluster. The first two options provide a set of predefined general metrics and visualizations. The last one allows you to use the flexibility of SQL to get deep insights into the details of the workload. However, querying system tables requires knowledge of system table structures. To address that, we came up with a consolidated Amazon Redshift Grafana dashboard that visualizes a set of curated operational metrics and works on top of the Amazon Redshift Grafana data source. You can easily add it to an Amazon Managed Grafana workspace, as well as to any other Grafana deployments where the data source is installed.
This post guides you through a step-by-step process to create an Amazon Managed Grafana workspace and configure an Amazon Redshift cluster with a Grafana data source for it. Lastly, we show you how to set up the Amazon Redshift Grafana dashboard to visualize the cluster metrics.
Solution overview
The following diagram illustrates the solution architecture.
The solution includes the following components:
The Amazon Redshift cluster to get the metrics from.
Amazon Managed Grafana, with the Amazon Redshift data source plugin added to it. Amazon Managed Grafana communicates with the Amazon Redshift cluster via the Amazon Redshift Data Service API.
The Grafana web UI, with the Amazon Redshift dashboard using the Amazon Redshift cluster as the data source. The web UI communicates with Amazon Managed Grafana via an HTTP API.
We walk you through the following steps during the configuration process:
Configure an Amazon Redshift cluster.
Create a database user for Amazon Managed Grafana on the cluster.
Configure a user in AWS Single Sign-On (AWS SSO) for Amazon Managed Grafana UI access.
Configure an Amazon Managed Grafana workspace and sign in to Grafana.
Set up Amazon Redshift as the data source in Grafana.
Import the Amazon Redshift dashboard supplied with the data source.
Prerequisites
To follow along with this walkthrough, you should have the following prerequisites:
Familiarity with the basic concepts of the following services:
Amazon Redshift
Amazon Managed Grafana
AWS SSO
Configure an Amazon Redshift cluster
If you don’t have an Amazon Redshift cluster, create a sample cluster before proceeding with the following steps. For this post, we assume that the cluster identifier is called redshift-demo-cluster-1 and the admin user name is awsuser.
On the Amazon Redshift console, choose Clusters in the navigation pane.
Choose your cluster.
Choose the Properties tab.
To make the cluster discoverable by Amazon Managed Grafana, you must add a special tag to it.
Choose Add tags.
For Key, enter GrafanaDataSource.
For Value, enter true.
Choose Save changes.
Create a database user for Amazon Managed Grafana
Grafana will be directly querying the cluster, and it requires a database user to connect to the cluster. In this step, we create the user redshift_data_api_user and apply some security best practices.
On the cluster details page, choose Query data and Query in query editor v2.
Choose the redshift-demo-cluster-1 cluster we created previously.
For Database, enter the default dev.
Enter the user name and password that you used to create the cluster.
Choose Create connection.
In the query editor, enter the following statements and choose Run:
CREATE USER redshift_data_api_user PASSWORD '<password>' CREATEUSER;
ALTER USER redshift_data_api_user SET readonly TO TRUE;
ALTER USER redshift_data_api_user SET query_group TO 'superuser';
The first statement creates a user with superuser privileges necessary to access system tables and views (make sure to use a unique password). The second prohibits the user from making modifications. The last statement isolates the queries the user can run to the superuser queue, so they don’t interfere with the main workload.
In this example, we use service managed permissions in Amazon Managed Grafana and a workspace AWS Identity and Access Management (IAM) role as an authentication provider in the Amazon Redshift Grafana data source. We create the database user redshift_data_api_user using the AmazonGrafanaRedshiftAccess policy.
Configure a user in AWS SSO for Amazon Managed Grafana UI access
Two authentication methods are available for accessing Amazon Managed Grafana: AWS SSO and SAML. In this example, we use AWS SSO.
On the AWS SSO console, choose Users in the navigation pane.
Choose Add user.
In the Add user section, provide the required information.
In this post, we select Send an email to the user with password setup instructions. You need to be able to access the email address you enter because you use this email further in the process.
Choose Next to proceed to the next step.
Choose Add user.
An email is sent to the email address you specified.
Choose Accept invitation in the email.
You’re redirected to sign in as a new user and set a password for the user.
Enter a new password and choose Set new password to finish the user creation.
Configure an Amazon Managed Grafana workspace and sign in to Grafana
Now you’re ready to set up an Amazon Managed Grafana workspace.
On the Amazon Grafana console, choose Create workspace.
For Workspace name, enter a name, for example grafana-demo-workspace-1.
Choose Next.
For Authentication access, select AWS Single Sign-On.
For Permission type, select Service managed.
Chose Next to proceed.
For IAM permission access settings, select Current account.
For Data sources, select Amazon Redshift.
Choose Next to finish the workspace creation.
You’re redirected to the workspace page.
Next, we need to enable AWS SSO as an authentication method.
On the workspace page, choose Assign new user or group.
Select the previously created AWS SSO user under Users and Select users and groups tables.
You need to make the user an admin, because we set up the Amazon Redshift data source with it.
Select the user from the Users list and choose Make admin.
Go back to the workspace and choose the Grafana workspace URL link to open the Grafana UI.
Sign in with the user name and password you created in the AWS SSO configuration step.
Set up an Amazon Redshift data source in Grafana
To visualize the data in Grafana, we need to access the data first. To do so, we must create a data source pointing to the Amazon Redshift cluster.
On the navigation bar, choose the lower AWS icon (there are two) and then choose Redshift from the list.
For Regions, choose the Region of your cluster.
Select the cluster from the list and choose Add 1 data source.
On the Provisioned data sources page, choose Go to settings.
For Name, enter a name for your data source.
By default, Authentication Provider should be set as Workspace IAM Role, Default Region should be the Region of your cluster, and Cluster Identifier should be the name of the chosen cluster.
For Database, enter dev.
For Database User, enter redshift_data_api_user.
Choose Save & Test.
A success message should appear.
Import the Amazon Redshift dashboard supplied with the data source
As the last step, we import the default Amazon Redshift dashboard and make sure that it works.
In the data source we just created, choose Dashboards on the top navigation bar and choose Import to import the Amazon Redshift dashboard.
Under Dashboards on the navigation sidebar, choose Manage.
In the dashboards list, choose Amazon Redshift.
The dashboard appear, showing operational data from your cluster. When you add more clusters and create data sources for them in Grafana, you can choose them from the Data source list on the dashboard.
Clean up
To avoid incurring unnecessary charges, delete the Amazon Redshift cluster, AWS SSO user, and Amazon Managed Grafana workspace resources that you created as part of this solution.
Conclusion
In this post, we covered the process of setting up an Amazon Redshift dashboard working under Amazon Managed Grafana with AWS SSO authentication and querying from the Amazon Redshift cluster under the same AWS account. This is just one way to create the dashboard. You can modify the process to set it up with SAML as an authentication method, use custom IAM roles to manage permissions with more granularity, query Amazon Redshift clusters outside of the AWS account where the Grafana workspace is, use an access key and secret or AWS Secrets Manager based connection credentials in data sources, and more. You can also customize the dashboard by adding or altering visualizations using the feature-rich Grafana UI.
Because the Amazon Redshift data source plugin is an open-source project, you can install it in any Grafana deployment, whether it’s in the cloud, on premises, or even in a container running on your laptop. That allows you to seamlessly integrate Amazon Redshift monitoring into virtually all your existing Grafana-based monitoring stacks.
For more details about the systems and processes described in this post, refer to the following:
Sergey Konoplev is a Senior Database Engineer on the Amazon Redshift team. Sergey has been focusing on automation and improvement of database and data operations for more than a decade.
Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.
AWS Directory Service for Microsoft Active Directory (AWS Managed Microsoft AD), provides a fully managed service for Microsoft Active Directory (AD) in the AWS cloud. When you create your directory, AWS deploys two domain controllers in separate Availability Zones that are exclusively yours for high availability. For use cases requiring even higher resilience and performance, in a specific Region or during specific hours, AWS Managed Microsoft AD allows you to scale by deploying additional domain controllers to meet your needs. These domain controllers can help load-balance, increase overall performance, or simply provide additional nodes to protect against temporary availability issues. AWS Managed Microsoft AD allows you to define the correct number of domain controllers for your directory based on your individual use case.
This post will walk you through how to automate scaling in AWS Managed Microsoft AD using utilization metrics from your directory. You’ll do this using Amazon CloudWatch Alarms, SNS notifications, and a Lambda function to increase the number of domain controllers in your directory based on utilization peaks.
Simplified directory scaling
AWS Managed Microsoft AD has now simplified this directory scaling process by integrating with Amazon CloudWatch metrics. This new integration enables you to:
Analyze your directory to identify expected average and peak directory utilization
Scale your directory based on utilization data to adequately address the expected load
Automate the addition of domain controllers to handle unexpected load.
Integration is available for both domain controller utilization metrics such as CPU, Memory, Disk and Network, and for AD-specific metrics, such as LDAP searches, binds, DNS queries, and Directory reads/writes. Analyzing this data over time to identify expected average and peak utilization on your directory can help you deploy additional domain controllers in Regions that need them. Once you’ve established this utilization baseline, you can deploy additional domain controllers to service this load, and configure alarms for anything exceeding this baseline.
Solution overview
In this example, our AWS Managed Microsoft AD has the default two domain controllers; once your utilization threshold is reached, you’ll add one additional domain controller (domain controller 3 in the diagram) to cover this additional load.
Choose the Directory Service namespace, then choose AWS Managed Microsoft AD.
In the Directory ID column, select your directory and check search for this only.
From the Metric Category column, select Processor from Metric Category and check add to search. This view will show the processor utilization for your directory.
Figure 2. Processor utilization metrics
To see the average utilization across all domain controllers, choose Add Math, then All Functions, then AVG to create a metric math expression for average CPU utilization across all domain controllers.
Figure 3. Adding a math function to compute average
Next, choose the Graphed Metrics tab in the CloudWatch metrics console, select the newly created expression, then select the bell icon from the Actions column to create a CloudWatch alarm based on this metric.
Figure 4. Create a CloudWatch Alarm using Metric Math Expression
Configure the threshold alarm to trigger when CPU utilization exceeds 70%.
In the Metrics section, under Period, choose 1 Hour. In the Conditions section, under Threshold Type, choose Static. Under Define the alarm condition, choose Greater than threshold. Under Define the threshold value, enter 70. See Figure 5 for an image of how alarm parameters should look on your screen. Choose Next to Configure actions.
Figure 5. Configure the alarm parameters
On the Configure actions screen, configure the actions using the parameters listed below to send an email notification when the alarm state is triggered. See Figure 6 for an image of how email notifications are configured.
In the Notification section, set Alarm state trigger to In alarm. Set Select an SNS topic to Create topic. Fill in the name of the alarm in the Create a new topic field, and add the email where notifications should be sent to the Email endpoints that will receive notification field. An email address is required to create the SNS topic and you should use an email address that’s accessible by your operations team. This SNS topic will be used to trigger the Lambda automation described in a later section. Note: make a note of the SNS topic name you chose; you will use it later when creating the Lambda function in the To create an AWS Lambda function to automate scale out procedure below.
Figure 6. Create SNS topic and email notification
In the Alarm name field, provide a name for the alarm. You can optionally also add an Alarm description. Choose Next.
Review your configuration, and choose Create alarm to create the alarm.
Once you’ve completed these steps, you will now have an alarm implemented for when domain controller CPU utilization exceeds an average of 70% across both domain controllers. This will trigger an SNS topic when your directory is experiencing a heavy load, which will be used to start the Lambda automation and will send an informational email notification. In the next section, we’ll configure an AWS Lambda function to automate the addition of a domain controller based on this SNS topic.
The sample Lambda function shown below checks the number of domain controllers in this Region, and increases that by adding one additional domain controller. This procedure describes how to configure the IAM role required for this Lambda function, then how to deploy the Lambda function to execute when the alarm is triggered to automatically add a domain controller when your load exceeds your typical usage baseline.
Choose Next:Tags to add tags (optional) before choosing Next:Review.
On the Create Policy screen, provide a name in the Name field. You can optionally also add a description. Choose Create policy to complete creating the new policy.
Note: make a note of the policy name you chose; you will use it later when updating the execution role for the Lambda function.
Figure 7. Provide a name to create the IAM policy
In the AWS Console, navigate to Lambda and choose Create Function
On the Create Function screen, select Author from Scratch and provide a Name, then choose Create Function.
Figure 8. Create a Lambda function
Once created, on the Lambda function’s page, choose the Configuration tab, then choose Permissions from the sidebar and choose the execution role name linked under Role name. This will open the IAM console in another tab, preloaded to your Lambda execution role.
Figure 9: Select the Execution Role
On the execution role screen, choose Attach policies and select the IAM policy you’ve just created (e.g. DirectoryService-DCNumber Update). On the Attach Permissions screen, choose Attach policy to complete updating the execution role. Once completed this step, you may close this tab and return the previous browser tab.
Figure 10. Select and attach the IAM policy
On the Lambda function screen, choose the Configuration tab, then choose Triggers from the sidebar.
On the Add Trigger screen, choose the pulldown under Trigger configuration and select SNS. On the SNS topic box, select the SNS topic you created in Step 9 of the To create a CloudWatch Alarm with SNS topic notifications procedure above. Then choose Add to complete the trigger configuration.
On the Lambda function screen, choose the Configuration tab, then choose Environment variables from the sidebar.
On the Environment variables card, click Edit.
On the Edit environment variables screen, choose Add environment variables and use the Key “DIRECTORY_ID” and the Value will be the directory ID for you AWS Managed Microsoft AD.
Figure 11. The “Edit environment variables” screen
On the Lambda function screen, choose the Code tab to open the in-browser code editor experience inside the Code source card. Paste in the sample Lambda function code given below to complete the implementation.
Figure 12. Paste sample code to complete the Lambda function setup
Sample Lambda function code
The sample Lambda function given below automates adding another domain controller to your directory. When your CloudWatch alarm triggers, you will receive a notification email, and an additional domain controller will be deployed to provide the added capacity to support the increase in directory usage.
Note: The example code contains a variable for the maximum number of domain controllers (maxDcNum), to prevent you from over provisioning in the event of a missed configuration. This value is set to 3 for this blog post’s example and can be increased to suit your use case.
import json
import boto3
maxDcNum = 10
minDcNum = 2
region = "us-east-1"
dsId = "d-906752246f"
ds = boto3.client('ds', region_name=region)
def lambda_handler(event, context):
## get the current number of domain controllers
dcs = ds.describe_domain_controllers(DirectoryId = dsId)
DomainControllers = dcs["DomainControllers"]
DCcount = len(DomainControllers)
print(">>> Current number of DCs:" + str(DCcount))
#increase the number of DCs
if DCcount < maxDcNum:
NewDCnumber = DCcount + 1
response = ds.update_number_of_domain_controllers(DirectoryId = dsId, DesiredNumber = NewDCnumber);
return {
'statusCode': 200,
'body': json.dumps("New DC number will be " + str(NewDCnumber))
}
else:
return {
'statusCode': 200,
'body': json.dumps("Max number of DCs reached. The number of DCs is" + str(DCcount))
}
Note: When testing this Lambda function, remember that this will increase the number of domain controllers for your directory in that Region. If the additional domain controller is not needed, please reduce the count after the test to avoid costs for an additional domain controller. The same principles used in this article to automate the addition of domain controllers can be applied to automate the reduction of domain controllers and you should consider automating the reduction to optimize for resilience, performance and cost.
Conclusion
In this post, you’ve implemented alarms based on thresholds in Domain Controller utilization using AWS CloudWatch and automation to increase the number of domain controllers using AWS Lambda functions. This solution helps to cost-effectively improve resilience and performance of your directory, by scaling your directory based on historical load patterns.
As your monitoring infrastructures evolve, you might hit a point when there’s no avoiding using the Zabbix API. The Zabbix API can be used to automate a particular part of your day-to-day workflow, troubleshoot your monitoring or to simply analyze or get statistics about a specific set of entities.
In this blog post, we will take a look at some of the more advanced API methods and specific method parameters and learn how they can be used to improve your API workflows.
1. Count entities with CountOutput
Let’s start with gathering some statistics. Let’s say you have to count the number of some matching entities – here we can use the CountOutput parameter. For a more advanced use case – what if we have to count the number of events for some time period? Let’s combine countOutput with time_from and time_till (in unixtime) and get the number of events created for the month of November. Let’s get all of the events for the month of November that have the Disaster severity:
Now let’s copy and paste the result of the export and import the template into another environment. It’s extremely important to remember that for this method to work exactly as we intend to, we need to include the parameters that specify the behavior of particular entities contained in the configuration string, such as items/value maps/templates, etc. For example, if I exclude the templates parameter here, no templates will be imported.
3. Expand trigger functions and macros with expand parameters
Using trigger.get to obtain information about a particular set of triggers is a relatively common practice. One particular caveat that we have to consider is that by default macros in trigger name, expression or descriptions are not expanded. To expand the available macros we need to use the expand parameters:
4. Obtaining additional LLD information for a discovered item
If we wish to display additional LLD information for a discovered entity, in this case – an item, we can use the selectDiscoveryRule and selectItemDiscovery parameters.
While selectDiscoveryRule will provide the ID of the LLD rule that created the item, selectItemDiscovery can point us at the parent item prototype id from which the item was created, last discovery time, item prototype key, and more.
The example below will return the item details and will also provide the LLD rule and Item prototype IDs, the time when the lost item will be deleted and the last time the item was discovered:
5. Searching through the matched entities with search parameters
Zabbix API provides a couple of standard parameters for performing a search. With search parameter, we can search string or text fields and try to find objects based on a single or multiple entries. searchByAny parameter is capable of extending the search – if you set this as true, we will search by ANY of the criteria in the search array, instead of trying to find an entity that matches ALL of them (default behavior).
The following API call will find items that match agent and Zabbix keys on a particular template:
Feel free to take the above examples, change them around so they fit your use case and you should be able to quite easily implement them in your environment. There are many other use cases that we might potentially cover down the line – if you have a specific API use case that you wish for us to cover, feel free to leave a comment under this post and we just might cover it in one of the upcoming blog posts!
With Zabbix Summit Online 2021 just around the corner, it’s time to have a quick overview of the 6.0 LTS features that we can expect to see featured during the event. The Zabbix 6.0 LTS release aims to deliver some of the long-awaited enterprise-level features while also improving the general user experience, performance, scalability, and many other aspects of Zabbix.
Native Zabbix server cluster
Many of you will be extremely happy to hear that Zabbix 6.0 LTS release comes with out-of-the-box High availability for Zabbix Server. This means that HA will now be supported natively, without having to use external tools to create Zabbix Server clusters.
The native Zabbix Server cluster will have a speech dedicated to it during the Zabbix Summit Online 2021. You can expect to learn both the inner workings of the HA solution, the configuration and of course the main benefits of using the native HA solution. You can also take a look at the in-development version of the native Zabbix server cluster in the latest Zabbix 6.0 LTS alpha release.
Business service monitoring and root cause analysis
Service monitoring is also about to go through a significant redesign, focusing on delivering additional value by providing robust Business service monitoring (BSM)features. This is achieved by delivering significant additions to the existing service status calculation logic. With features such as service weights, service status analysis based on child problem severities, ability to calculate service status based on the number or percentage of children in a problem state, users will be able to implement BSM on a whole new level. BSM will also support root cause analysis – users will be informed about the root cause problem of the service status change.
All of this and more, together with examples and use cases will be covered during a separate speech dedicated to BSM. In addition, some of the BSM features are available in the latest Zabbix 6.0 LTS alpha release – with more to come as we continue working on the Zabbix 6.0 release.
Audit log redesign
The Audit log is another existing feature that has received a complete redesign. With the ability to log each and every change performed both by the Zabbix Server and Zabbix Frontend, the Audit log will become an invaluable source of audit information. Of course, the redesign also takes performance into consideration – the redesign was developed with the least possible performance impact in mind.
The audit log is constantly in development and the current Zabbix 6.0 LTS alpha release offers you an early look at the feature. We will also be covering the technical details of the new audit log implementation during the Summit and will explain how we are able to achieve minimal performance impact with major improvements to Zabbix audit logging.
Geographical maps
With Geographical maps, our users can finally display their entities on a geographical map based on the coordinates of the entity. Geographical maps can be used with multiple geographical map providers and display your hosts with their most severe problems. In addition, geographical maps will react dynamically to Zoom levels and support filtering.
The latest Zabbix 6.0 Alpha release includes the Geomap widget – feel free to deploy it in your QA environment, check out the different map providers, filter options and other great features that come with this widget.
Machine learning
When it comes to problem detection, Zabbix 6.0 LTS will deliver multiple trend new functions. A specific set of functions provides machine learning functionality for Anomaly detection and Baseline monitoring.
The topic will be covered in-depth during the Zabbix Summit Online 2021. We will look at the configuration of the new functions and also take a deeper dive at the logic and algorithms used under the hood.
During the Zabbix Summit Online 2021, we will also cover many other new features, such as:
New Dashboard widgets
New items for Zabbix Agent
New templates and integrations
Zabbix login password complexity settings
Performance improvements for Zabbix Server, Zabbix Proxy, and Zabbix Frontend
UI and UX improvements
Zabbix login password complexity requirements
New history and trend functions
And more!
Not only will you get the chance to have an early look at many new features not yet available in the latest alpha release, but also you will have a great chance to learn the inner workings of the new features, the upgrade and migration process to Zabbix 6.0 LTS and much more!
We are extremely excited to share all of the new features with our community, so don’t miss out – take a look at the full Zabbix Summit online 2021 agenda and register for the event by visiting our Zabbix Summit page, and we will see you at the Zabbix Summit Online 2021 on November 25!
Zabbix API enables you to collect any and all information from your Zabbix instance by using a multitude of API methods. You can even utilize Zabbix API calls in your HTTP items. For example, this can be used to monitor the number of particular sets of metrics and visualize their growth over time. With named Zabbix API tokens, such use cases are a lot more simple to implement.
Before Zabbix 5.4 we had to perform the user.login API call to obtain the authentication token. Once the user session was closed, we had to relog, obtain the new authentication token and use it in the subsequent API calls.
With the pre-defined named Zabbix API tokens, you don’t have to constantly check if the authentication token needs to be updated. Starting from Zabbix 5.4 you can simply create a new named Zabbix API token with an expiration date and use it in your API calls.
Creating a new named Zabbix API token
The Zabbix API token creation process is extremely simple. All you have to do is navigate to Administration – General – API tokens and create a new API token. The named API tokens are created for a particular user and can have an optional expiration date and time – otherwise, the tokens are defined without an expiry date.
You can create a named API token in the API tokens section, under Administration – General
Once the Token has been created, make sure to store it somewhere safe, since you won’t be able to recover it afterward. If the token is lost – you will have to recreate it.
Make sure to store the auth token!
Don’t forget, that when defining a role for the particular API user, we can restrict which API methods this user has access to.
Simplifying API tasks with the named API token
There are many different use cases where you could implement Zabbix API calls to collect some additional information. For this blog post, I will create an HTTP item that uses item.get API call to monitor the number of unsupported items.
To achieve that, I will create an HTTP item on a host (This can be the default Zabbix server host or a host dedicated to collecting metrics via Zabbix API calls) and provide the API call in the request body. Since the named API token now provides a static authentication token until it expires, I can simply use it in my API call without the need to constantly keep it updated.
An HTTP agent item that uses a Zabbix API call in its request body
HTTP item request body, which returns a count of unsupported items
I will also use regular expression preprocessing to obtain the numeric value from the API call result – otherwise, we won’t be able to graph our value or calculate trends for it.
Regular expression preprocessing step to obtain a numeric value from our Zabbix API call result
Utilizing Zabbix API scripts in Actions
In one of our previous blog posts, we covered resolving problems automatically with the event.acknowledge API method. The logic defined in the blog post was quite complex since we needed to keep an eye out for the authentication tokens and use a custom script to keep them up to date. With named Zabbix API tokens, this use case is a lot more simple.
All I have to do is create an Action operation script containing my API call and pass it to an action operation.
Action operation script that invokes Zabbix event.acknowledge API method
curl -sk -X POST -H "Content-Type: application/json" -d "
{
\"jsonrpc\": \"2.0\",
\"method\": \"event.acknowledge\",
\"params\": {
\"eventids\": \"{EVENT.ID}\",
\"action\": 1,
\"message\": \"Problem resolved.\"
},
\"auth\": \"<Place your authentication token here>",
\"id\": 2
}" <Place your Zabbix API endpoint URL here>
Problem remediation script example
Now my problems will get closed automatically after the time period which I have defined in my action.
Action operation which runs our event.acknowledge Zabbix API script
These are but a few examples that we can now achieve by using API tokens. A lot of information can be obtained and filtered in a unique way via Zabbix API, thus providing you with a granular analysis of your monitored environment. If you have recently upgraded to Zabbix 5.4 or plan to upgrade to Zabbix 6.0 LTS in the future, I would recommend implementing named Zabbix API tokens to simplify your day-to-day workflow and consider the possibilities that this new feature opens up for you.
If you have any questions or if you wish to share your particular use case for data collection or task automation with Zabbix API – feel free to share them in the comments section below!
There are many design choices to consider when we build our monitoring environment for high-frequency monitoring. How to minimize performance impact? What are the data retention policies with storage space in mind? What are the available out-of-the-box features to solve these potential problems?
In this blog post, we will discuss when you should use preprocessing and when it is better to use the “Do not keep history” option for your metrics, and what are the pros and cons for both of these approaches.
Throttling and other preprocessing steps
We’ve discussed throttling previously as the go-to approach for high-frequency monitoring. Indeed, with throttling, you can discard repeated values and do so with a heartbeat. This is extremely useful with metrics that come as discreet values – services states, network port statuses, and so on.
Example of throttling with and without heartbeat
In addition, since starting from Zabbix 4.2 all preprocessing is also performed by Zabbix proxies. This means we can discard the repeated values before they reach the Zabbix server. This can help us both with the performance (fewer metrics to insert in the Zabbix server DB) and reduce the DB size (Fewer metrics stored in the DB. This also helps with improving overall Zabbix performance)
There are a few caveats with this approach – since metrics get discarded before they reach the Zabbix server, the triggers will not react on these metrics (This is where having a heartbeat is useful) and, since trends are calculated by Zabbix server based on the received history data, there could be a lack of trend information for these metrics. Keep in mind that this applies not only to throttling preprocessing rules – any preprocessing can be done on the proxy and any preprocessing rules can be used to transform your data.
Understanding “Do not keep history” option
The behavior of “Do not keep history” which we can define when configuring an item is a bit different though. If we collect an item by a Proxy and configure the item with “Do not keep history”, the history won’t always get discarded! There are a couple of reasons for this.
First off, let’s not forget that some of our values can populate host inventory! If the particular item is configured to populate an inventory field – it will be forwarded to the Zabbix server, but it will not get stored in the history tables.
If the item does not populate an inventory field – the text data such as character, log and text will indeed get discarded before reaching the Zabbix server, but Numeric values – both float and integer, will get forwarded to the server. The reason for that is deriving trend information from the numeric values. Mind that the numeric data will still not get stored in the history tables, only trends will be available for these items.
Note: This behavior has been properly implemented starting from Zabbix 5.2. See ZBX-17548
Setting the “Do not keep history” option for an item
Using trend functions with high-frequency monitoring
With the specifics of “Do not keep history” in mind, we should now recall that starting from Zabbix 5.2 we have trend functions available at our disposal!
History functions such as trendavg, trendcount, trendmax, trendmin, trendsum allow us to perform different kinds of trend calculations – from counting the number of trend values to retrieving min/max/avg trend values for a time period.
This means, that if we require only the metric trend for specific time periods (hours, days, weeks, etc) we can use these trend functions together with “Do not keep history” option, thus discarding unnecessary data and improving our Zabbix server performance!
There are two approaches two using trend functions:
If you wish to collect and display the trend data, you need to create the item which will collect the metrics (say, a net.if.in Agent item for collecting incoming network traffic) and then create a separate calculated item that uses the trend function to calculate the avg/min/max value for the trend over a time period. The original item can then have “Do not keep history” option selected for it.
trendavg item for calculating hourly trends from the net.if.in[ifHCInOctets.5] item
If you wish to simply define triggers and react on long-term trends and are not required to collect the trend values, then we can skip the creation of the calculated item and simply use the trend function on the original item in the trigger.
This trigger fires if the hourly average trend value exceeds 100M. Note: In this case only the original item is required.
By combining these approaches in our environment – using preprocessing when we wish to discard or transform the data and also implementing opting out of storing the history data, whenever this is appropriate, we can minimize the performance impact on our Zabbix instance. Add a layer of distributed Zabbix proxies on top of this and you can truly achieve a large, scalable Zabbix infrastructure optimized for high-frequency ingestion and processing of your data.