Tag Archives: high availability

Making PaperCut NG Observable with Zabbix

2025-11-18 Patrik Uytterhoeven

Post Syndicated from Patrik Uytterhoeven original https://blog.zabbix.com/making-papercut-ng-observable-with-zabbix/31244/

In most organizations, printing is an essential but often invisible service. When it works, nobody notices. When it fails, productivity stalls. That’s why monitoring your print environment is just as important as monitoring servers, databases, or network devices.

At Opensource ICT Solutions, we specialize in turning complex systems into observable services. One recent example is our integration of PaperCut NG with Zabbix. This allows IT teams to track the health of their print infrastructure in real-time — everything from server resources to individual printers and devices.

Why monitoring PaperCut matters

PaperCut NG does much more than queue print jobs. It enforces quotas, integrates with authentication systems, and manages fleets of devices. If the database runs out of connections, the disk fills up, or the license expires, users feel the impact instantly.

By integrating PaperCut with Zabbix, we make these risks visible long before they become business problems. The result is:

Proactive detection of printer errors, low toner, or license issues.
Capacity planning through trend analysis of disk usage, memory, and DB connections.
Unified visibility — PaperCut health checks appear right alongside servers, networks, and applications in Zabbix dashboards.

How the integration works

The magic happens through the PaperCut System Health API and Zabbix’s flexible data collection methods.

HTTP agent items

Zabbix fetches raw JSON data directly from PaperCut using an HTTP agent item, such as:

This single call provides a full snapshot of server health.

Dependent items + JSONPATH

Instead of hammering the API with multiple requests, we extract the needed fields using dependent items with JSONPATH preprocessing.

For example:

This design means one request can populate dozens of metrics, keeping monitoring both efficient and lightweight.

Calculated items

Some values aren’t directly available from PaperCut. In those cases, we create calculated items inside Zabbix.

For example, the percentage of active DB connections is derived as:

This allows us to set intelligent triggers like “DB connections > 90%” without requiring PaperCut to calculate it for us.

Low-level discovery (LLD) for devices and printers

Perhaps the most powerful part of this integration is automatic discovery.

Printer LLD → Queries /api/health/printers and creates items and triggers per printer. If a printer goes into Paper Jam or No Toner, Zabbix knows immediately.
Device LLD → Queries /api/health/devices and builds items dynamically for each discovered device, tracking states like OK, WARNING, or ERROR.

This ensures that new printers and devices are monitored automatically — no manual configuration required!

Why this matters

Bringing all of this together, the integration turns PaperCut NG into a fully observable service inside Zabbix.

Efficiency → One API call, dozens of metrics.
Scalability → Automatic discovery of printers and devices.
Robustness → Alerts and dashboards for licenses, resources, and print queues.

For IT teams, this means fewer surprises, faster troubleshooting, and more confidence in a service that often goes unnoticed until it fails.

Our expertise

This PaperCut integration is just one example of how we at Opensource ICT Solutions help organizations unlock the full potential of Zabbix. We don’t just install monitoring – we design intelligent, scalable integrations that make hidden systems visible. Whether it’s print management, databases, custom applications, or network devices, we know how to extend Zabbix to fit your environment and give you the insights that matter most.

Feel free to download our template and documentation for free from our GitHub: https://github.com/OpensourceICTSolutions/ZabbixPapercutNG

Want to make your business-critical systems truly observable? Let’s talk about how we can tailor Zabbix to your needs: [email protected]

The post Making PaperCut NG Observable with Zabbix appeared first on Zabbix Blog.

Improving Customer Satisfaction and Experience with Zabbix

2025-11-07 Michael Kammer

Post Syndicated from Michael Kammer original https://blog.zabbix.com/improving-customer-satisfaction-and-experience-with-zabbix/31692/

No matter what business you’re in, there is one universal truth – your success or failure depends on customer satisfaction and trust. And when your IT systems fail, it’s your customers who pay the price. Being unable to place an order due to unexpected downtime (which can cost a large organization as much as $9,000 per minute) or having their credit card data compromised in a preventable security breach (which costs the average organization nearly $5 million) will force even your most loyal customers to go somewhere else.

Monitoring with Zabbix doesn’t just keep your infrastructure safe, it keeps your reputation safe and makes sure that your customers continue to be your customers. It does this by guaranteeing the performance, reliability, and security of your digital services – while also supporting better customer service and continuous improvement. Keep reading to see how it’s possible.

Table of Contents

Say goodbye to downtime

Your customers are looking to meet their needs quickly and effectively. Unexpected service disruptions cause them to feel neglected and force them to look elsewhere for solutions.
Monitoring your infrastructure with Zabbix can effectively eliminate downtime through proactive issue detection, which locates anomalies and performance issues like high CPU usage, packet loss, and latency in real time – before they have a chance to make life harder for customers.

If an issue does occur, Zabbix’s predictive alerting capabilities let your tech teams know about anything that could potentially impact an application or service, which lets them meet SLAs and provide a better, more reliable customer experience with fewer service disruptions, which in turns leads to higher levels of trust and satisfaction.

Outperform your competitors

No matter how good your products or services happen to be, you still need to provide smooth and fast online user experience if you want repeat use and positive reviews. Monitoring with Zabbix optimizes network traffic by helping you to identify bandwidth bottlenecks or misconfigured devices with a single glance at a dashboard, allowing better traffic management and a better online experience for customers.

It also improves response times, which allows you to be confident that your applications and services remain responsive. This is especially important for real-time services like video conferencing, e-commerce, or customer support.

Turn good customer service into outstanding customer service

What turns a casual, one-time user into a repeat customer? In most cases, it all comes down to making that user feel seen, informed, and supported. Zabbix helps you maintain consistent system performance, and nothing builds trust like stability.

With a bit of configuration and the help of IT service management tools like ServiceNow, Zabbix can provide clear, easy-to-access logs and metrics that help your customer service reps better understand your customers and the process of serving them, including:

• Customer satisfaction (CSAT)
• Preferred communication channel
• Average ticket count
• Average response time
• Average ticket resolution time
• Ticket resolution rate
• Ticket backlog
• Interactions per ticket

With this information, your team will be able to communicate proactively when issues happen, giving customers accurate information about the issue and the expected resolution time.

Keep your customers safe from cyber threats

The consequences of a data breach are deep and far-reaching, and they include financial losses, reputational damage, legal troubles, regulatory fines, and a loss of customer trust. Despite a greater emphasis on data security, hackers are constantly finding new ways to gain access to valuable corporate data and credentials by combining next-generation AI technologies with long-established tools.

Monitoring with Zabbix gives IT and security teams the visibility and early warning systems they need to spot and react to potential threats. Zabbix continuously monitors systems, networks, and applications for predefined thresholds and anomalies, identifying possible network intrusions or misconfigurations and notifying the relevant security stakeholders.

On top of that, Zabbix can monitor any existing security tools your team runs, tracking antivirus software, firewalls, IDS/IPS tools, and endpoint protection solutions to make sure they are functioning properly and running the latest versions. It can also integrate with SIEM systems (like Splunk, ELK, or Wazuh) as well as custom scripts in order to provide extended security analytics.

Meet (and exceed) your SLAs

Service Level Agreements (SLAs) are a framework for managing the expectations of both customers and businesses. They define agreed-on standards of service, but tracking them is more than just a way to measure compliance – it’s a tool that you can use to improve your overall service delivery and operations.

With Zabbix, you can monitor any quantifiable metric that’s relevant to your SLAs, such as system uptime/downtime, response time, the availability of web services, databases, or network devices, transaction success and failure rates, and much more. In addition, Zabbix can use real-time data and built-in SLA calculation to automatically calculate current SLA compliance and send an alert if an SLA is at risk of being breached, by using triggers based on thresholds.

If you’d rather track the metrics on your own, no problem – by using Zabbix dashboards, you can visualize SLA compliance in real-time, with the dashboards showing availability percentages, event timelines, and breach summaries, while giving you easy-to-understand views of service health. The result is better products and services that are aligned with customer expectations.

Build a continuous improvement culture

When it’s time to roll out a new feature or upgrade, you naturally want to have ALL the necessary data at your fingertips. Monitoring usage patterns and performance metrics with Zabbix not only gives you advanced visualizations (forecasting, capacity planning insights, etc.) but can also highlight cases where data analysis led to tangible improvements.

Want more input from customers and users? Zabbix can make sure that the improvements to your product are community-driven by giving you the data you need to run regular user surveys and forums to gather product feedback. It can even help you publish a public roadmap with transparent prioritization based on community input.

Conclusion

Customer satisfaction is about a lot more than just good service – it’s also about consistency, reliability, and transparency. Zabbix empowers businesses to deliver all three by providing a comprehensive, proactive, and scalable monitoring solution.

That’s why customers in verticals as diverse as aerospace and education turn to Zabbix to keep them informed about what’s working – and what isn’t. By integrating Zabbix into your IT operations, you’re not just improving system performance – you’re actively investing in customer satisfaction and loyalty.

Find out more about what Zabbix can do for you and your customers by taking a look at real-world case studies from companies like yours.

The post Improving Customer Satisfaction and Experience with Zabbix appeared first on Zabbix Blog.

Running Zabbix with MariaDB and Galera Active/Active Clustering

2025-09-30 Nathan Liefting

Post Syndicated from Nathan Liefting original https://blog.zabbix.com/running-zabbix-with-mariadb-and-galera-active-active-clustering/31104/

High availability on a platform like Zabbix is a hard requirement for many users. With native high availability on the Zabbix servers, proxies, and at the frontend through various solutions for web servers, all that’s left is at the database layer. Any downtime in your MariaDB database would disrupt your monitoring availability, at the least on the frontend side of things in case of proxy buffering. Let’s have a look at the easiest way to create a high availability (HA) architecture for Zabbix using MariaDB with built-in Galera clustering – by removing single points of failure from your database and finalizing the HA puzzle for Zabbix.

Architecture overview

Let’s start of with the MariaDB + Galera number one design requirement. For a proper quorum to be made, 3 nodes should be used in the cluster. With only two nodes in a Galera cluster, quorum rules become a bit of a headache, as Galera uses a majority vote (more than half the nodes) to decide if the cluster can still accept writes. In a two-node setup, all is good when the database is online. But when we lose one node, quorum is lost and that node needs to rejoin.

This makes a two-node setup fragile but not impossible, and it does work with Zabbix since we do only have one Zabbix server active at the time. In a split-brain scenario where both nodes either think they are the last to leave, you might have to decide which node you think has your up-to-date data. We will detail both scenario’s, but the principle remains the same. We will use MariaDB as our database and Galera will be used to create a primary/primary cluster. In such a cluster, all nodes in the cluster are writeable, which is great for the Zabbix native HA.

When we look in the Zabbix database, we can see that Zabbix keeps all of it’s Zabbix server HA information and states in the database.

This means that whatever one Zabbix server node writes into the database will also be replicated to all other nodes in the MariaDB Galera cluster.

The design

Knowing what we know now, we can create a very simple design for a solid Zabbix HA setup with Mariadb + Galera. When we have a single Zabbix frontend and we keep to the MariaDB + Galera requirement of having 3 database nodes, we get a fairly simple setup, as seen below.

In this setup, each Zabbix server connects to its own Database node and we don’t need added complexity by using load balancers. However, we do get an automatic failover from the Zabbix servers, as they know exactly which node is active through the database. However, in this situation we are still left with 3 frontends that do not have automatic failover, simply because we do not have database aware Apache or NGINX. This also works in a two database setup, with the side note that you might have quorum issues to manually resolve after an outage:

Adding onto this setup, we could install a VIP, load balancer, or something like HA proxy in front of the frontend to make a failover happen there as well. Keep in mind though, the failover needs to happen based on whether or not the webfrontend can reach a writeable database.

Optional Arbitrator

If you are set on running only 2 database nodes (your wallet is thankful), but still worried about quorums, we can bring in the ARBITRATOR.

If there are only 2 Database nodes in your Galera cluster, not to worry! It’s definitely possible even while maintaining a good quorum resolution in case of outages.

What about load balancing?

Lastly, it is also possible to add load balancing to the mix. Let’s say, for example, you cannot add a VIP to your environment but still need your WEB servers to failover. A load balancer can provide the solution here.

We still prefer to run the Zabbix servers with a direct database connection, but even there a load balancer could be added if you wish. However, please keep in mind that the more load balancers you add, the more complex troubleshooting might become. The whole idea about the setup without load balancers is to have a solid Zabbix setup that is easy to maintain, while providing high availability.

Conclusion

In the end, even with a minimal setup of 2 DB nodes, 2 Zabbix servers, and 2 WEB frontends, we can make a high availability setup. As we’ve shown with Galera, this setup becomes highly flexible, allowing us to run without automatic WEB failover all the way up to including complicated load balancers.

High availability doesn’t have to be overly complicated in a setup like this – it really is all about how far you want to push things. Besides that, in this setup everything is horizontally scalable on the database side. Do keep in mind, however, that Zabbix does still run in an Active/Passive setup.

I hope you enjoyed reading this blog post. If you have any questions or need help configuring anything in your Zabbix setup feel free to contact me and the team at Opensource ICT Solutions. We build a ton of cool stuff like this and more!

Nathan Liefting

https://oicts.com

A close up of a logo Description automatically generated

The post Running Zabbix with MariaDB and Galera Active/Active Clustering appeared first on Zabbix Blog.

Optimizing Financial Routines and Infrastructure with Banpará

2025-09-23 Michael Kammer

Post Syndicated from Michael Kammer original https://blog.zabbix.com/optimizing-financial-routines-and-infrastructure-with-banpara/30815/

Banco do Estado do Pará (Banpará) is the main public financial institution in the Brazilian state of Pará. It is a mixed-capital company, organized as a multiple bank with the mission of generating value for the state of Pará. It currently has approximately 198 physical customer service units and is present in all 144 municipalities in the state.

The challenge

Until 2016, Banpará used a monitoring environment installed on a single physical server. This environment was centralized, not very scalable, and vulnerable due to the lack of updates to recent versions of the software used. Centralization created a critical dependency – if there was a server failure, the entire monitoring system would be compromised.

There was no integration with the tool that orchestrates the company’s routine activities (which also generated an alert and a need for proper support of the bank’s infrastructure) and there was also the issue of including the routines of the internal demand generation tool in the monitoring panel, which was done manually.

With each new routine created, it was necessary to open calls with the technical teams for inclusion in the monitoring plan, which were then entered into a list of tasks. This process, in addition to being time-consuming, was subject to human error and delays, which compromised real-time visibility of critical operations.

The lack of proactive and integrated monitoring in Banpará’s structure resulted in operational gaps that created real risks to the continuous functioning of banking operations.

The solution

Given the challenges posed, the project developed with Zabbix had as its main objective to recreate the monitoring environment in a virtualized, scalable and resilient way, without dependence on a physical server. From rebuilding the infrastructure to integrating it with critical banking systems, the primary requirements included the following:

Integration with existing systems
Intelligent data processing and analysis
Reduction of manual processes and operational dependency
Development of customized solutions
Reorganization of the technological infrastructure

After implementing and structuring Zabbix at the bank (with the help of Master Support, an official Zabbix Certified Partner in Brazil), the structure became modular, scalable, and resilient, aligned with best practices, and able to expand monitoring without compromising system performance as the bank integrated new routines and services.

The results

The modernization of monitoring environment with Zabbix brought immediate benefits for Banpará’s IT monitoring scenario, especially with regard to operational efficiency, reliability and process automation:

More than 2,000 monitored devices
Around 100,000 metrics collected
More than 26,000 active alerts in Zabbix
Automated coverage of around 2,300 routines
An estimated gain of 2,300 operational hours

The adoption of Zabbix as a monitoring tool at Banpará was a practical response to the need to modernize the bank’s IT infrastructure. The project contributed to the elimination of manual processes, reduction of operational time, and increased visibility over critical routines. It also enabled the monitoring of a greater number of services, with greater agility in identifying failures and supporting decision-making.

In conclusion

With the current structure, Banpará now has a more integrated monitoring system, adjusted to operational demands and with the capacity to monitor the evolution of the bank’s activities in an organized and secure manner.

To learn more about what Zabbix can do for customers in banking and finance, visit our website.

The post Optimizing Financial Routines and Infrastructure with Banpará appeared first on Zabbix Blog.

Building HA Zabbix with PostgreSQL and Patroni

2025-09-16 Patrik Uytterhoeven

Post Syndicated from Patrik Uytterhoeven original https://blog.zabbix.com/building-ha-zabbix-with-postgresql-and-patroni/30960/

Running a monitoring platform like Zabbix in a production environment demands reliability and resilience. When your monitoring solution is down, you’re flying blind – and for many organizations, that simply isn’t acceptable. This post introduces a robust high-availability (HA) architecture for Zabbix, using PostgreSQL, Patroni, etcd, HAProxy, keepalived and PgBackRest. Built on RHEL 9 or derrivates, this solution combines modern open-source tools to provide automatic failover, load balancing, and seamless monitoring, all while maintaining consistency and performance.

Architecture overview

The HA design consists of multiple layers working in tandem to maintain continuity even during node or service failures:

Database Cluster Layer

2 or more nodes form the PostgreSQL cluster, managed by Patroni and coordinated using etcd. At any given time, one node is the primary (read/write), and the others are hot standbys ready to take over automatically.

Consensus layer

etcd runs on the same nodes and acts as the distributed configuration store and coordination layer for Patroni. It ensures a consistent cluster state and enables safe failover decisions.

Load balancing layer

Two HAProxy nodes provide a single point of entry for all clients (including Zabbix), routing requests to the current PostgreSQL primary. These nodes are monitored and coordinated via Keepalived to maintain a floating Virtual IP (VIP), ensuring seamless failover at the connection layer.

Backup layer

A separate backup server is responsible for running PgBackRest, which handles full and incremental backups, WAL archiving, and Point-In-Time Recovery (PITR). This server communicates securely with all database nodes over SSH.

Monitoring layer

Two Zabbix servers, running in active-passive mode, continuously monitor all layers of this stack including the HAProxy health, Patroni cluster role, and etcd status by accessing the PostgreSQL VIP for backend connectivity.

This multi-tiered setup ensures that no single failure be it a database, load balancer, or monitoring server brings down the monitoring platform.

Why HA matters for Zabbix

Zabbix depends heavily on its PostgreSQL database backend. Every metric, trigger, event, and alert is stored there. If PostgreSQL becomes unavailable, even briefly, data loss or monitoring blind spots can occur. That’s why introducing HA at the database layer is a crucial step when scaling Zabbix for enterprise environments.

While Zabbix itself supports HA at the application level, this architecture ensures that the database backend is also fully fault-tolerant, using modern consensus-based clustering with automatic failover.

Component overview

To achieve HA, we bring together several specialized components, each fulfilling a critical role in the system:

PostgreSQL

The relational database engine used by Zabbix. In this example setup, it runs on three nodes, forming a cluster managed by Patroni.

Patroni

Patroni is the orchestrator for the PostgreSQL cluster. It monitors node health, manages replication, promotes standbys when needed, and ensures only one writable leader exists at any time. Patroni leverages a distributed consensus store in this case, etcd but other DCS’s are possible to coordinate decisions across the cluster.

etcd

etcd is a lightweight and highly available key-value store used by Patroni to maintain the cluster’s state. It stores leader election data, health statuses, and locks. We deploy it as a three-node cluster, co-located with the PostgreSQL nodes for convenience, though this setup can be scaled independently if needed as etcd is very latency prone.

HAProxy

To simplify application connectivity, HAProxy acts as a load balancer in front of the database cluster. It monitors the role of each node using Patroni’s REST API and routes connections to the active primary server. If the leader fails, HAProxy automatically reroutes traffic to the new primary.

Keepalived

Keepalived provides a floating virtual IP address (VIP) across the HAProxy nodes. This VIP allows client systems, such as the Zabbix frontend, to connect to a single stable IP even if one HAProxy node fails.

PgBackRest

To protect the data itself, we use PgBackRest for full and incremental backups, as well as Point-In-Time Recovery (PITR). A dedicated backup server is included to pull and store archive logs and backups securely via SSH.

Zabbix server

Finally, we run two Zabbix servers in active-passive mode. Both are configured to connect to the PostgreSQL cluster through the VIP exposed by HAProxy. The Zabbix frontend is deployed on both nodes as well, ensuring continued accessibility through the load-balanced setup.

Topology at a glance

Here’s a simplified view of the architecture:

2 or more database nodes (PostgreSQL + Patroni + etcd)
Two HAProxy nodes, each configured with Keepalived to manage a floating virtual IP
One backup node for PgBackRest
Two Zabbix servers pointing to the PostgreSQL VIP

All systems are tied together with consistent hostname mappings, time synchronization (Chrony), and service monitoring.

Notes:

PgBackRest is directly connected to all three PostgreSQL nodes, allowing it to archive WAL segments and pull backups regardless of which node is primary.
This design enables full standby backups and supports Point-In-Time Recovery (PITR).
HAProxy ensures Zabbix always talks to the current primary node, while Patroni and etcd handle automatic failover and cluster state management.

Design rationale

This setup prioritizes resilience and self-healing. If any single component fails a database node, a load balancer, or even a monitoring server the system continues to function.

Using Patroni with etcd ensures that failovers are handled automatically, without human intervention. HAProxy ensures client traffic is always routed to the current primary, while Keepalived ensures that this routing layer itself is highly available.

We opted for PgBackRest over simple scripts or base backups because it provides not just efficient incremental backups, but also full WAL archiving and point-in-time recovery, which are invaluable for both disaster recovery and debugging.

Lastly, we chose to integrate Zabbix itself into this HA design, treating it not just as a application but as a fully resilient service able to monitor itself, so to speak.

Real-world considerations

Resource planning: While our nodes run comfortably, scaling this setup to heavy workloads requires careful tuning of memory, I/O, and PostgreSQL parameters.
etcd placement: Although we run etcd co-located with the database nodes in this example, separating etcd onto dedicated infrastructure is ideal for large-scale environments. This avoids resource contention and preserves quorum in extreme failure scenarios.
Monitoring the monitors: Zabbix itself must be monitored. In our setup, each component including etcd, Patroni, and PostgreSQL exposes health endpoints that can be used by Zabbix agents or scripts to generate alerts on replication lag, cluster health, and failover events.

Conclusion

This architecture provides a solid foundation for running Zabbix in a fault-tolerant, production-ready environment. It not only ensures high availability for the database layer but also offers flexibility, observability, and operational safety.

Whether you’re running internal infrastructure monitoring or offering Zabbix as a managed service, adopting this type of HA setup removes single points of failure and gives you peace of mind — all using open-source technologies that are battle-tested and widely supported.

If you need assistance with the migration or want to ensure best practices for scaling and optimizing Zabbix, don’t hesitate to reach out to OICTS. We are a Zabbix Premium Partner operating globally, with offices in the USA, UK, Netherlands, and Belgium, and we’re ready to help you every step of the way.

The post Building HA Zabbix with PostgreSQL and Patroni appeared first on Zabbix Blog.

Zabbix at the Zhongnan University of Economics and Law

2025-08-14 Michael Kammer

Post Syndicated from Michael Kammer original https://blog.zabbix.com/zabbix-at-the-zhongnan-university-of-economics-and-law/30949/

Zhongnan University of Economics and Law (ZUEL), located in Wuhan City, Hubei Province, China, is a key university with two campuses – Nanhu and Shouyi. The school boasts over 20,000 full-time undergraduate students, more than 8,800 graduate students, and over 2,500 faculty and staff members. ZUEL enjoys an outstanding reputation in the fields of law and economics, with four national key disciplines. Its law discipline, meanwhile, has been included in the list of national “Double First-Class” disciplines.

The challenge

As the information infrastructure at ZUEL continues to expand, the scale of the university’s IT infrastructure has rapidly grown to encompass power systems, dynamic environmental systems, servers, network devices, security appliances, storage systems, virtualization platforms, operating systems, databases, data lakes, and campus application systems.

At the same time, the daily academic and administrative activities of faculty and students increasingly demand higher levels of stability and reliability from information systems. To ensure the efficient operation of these systems, the Information Management department needed a monitoring and management system that could cover the entire university’s IT resources and address the growing complexities of operational maintenance.

The university found that traditional monitoring and management systems often fall short when faced with such large-scale and diverse monitoring demands, revealing problems like insufficient monitoring points, poor real-time capabilities, and limited scalability. To address these challenges, the university decided to adopt Zabbix 7.0 and develop a custom IP Radar platform to further meet its refined operational maintenance needs.

The solution

When combined with Zabbix 7.0, the IP Radar system can achieve comprehensive monitoring and management of the university’s entire IT infrastructure through the integrated application of multiple monitoring protocols and technologies. Specifically, the system collects data and performs monitoring with the help of the following core technologies:

Zabbix 7.0. As an enterprise-level open-source monitoring platform renowned for its robust data collection and analysis capabilities, Zabbix enhances the system’s high availability, supporting large-scale concurrent processing to make sure that the monitoring system remains stable and delivers uninterrupted service even under heavy loads.
Parallel monitoring with multiple protocols. The system collects data through a variety of protocols, including Agent, SNMP, IPMI, MODBUS, MQTT, and more, enabling the real-time monitoring of a wide variety of IT hardware.
High-availability design. To accommodate the monitoring demands of massive devices and thousands of users, the Zabbix 7.0 platform supports multi-node deployment and redundancy design, enabling load balancing and failover among proxy servers. Even in the event of a node failure, the system maintains uninterrupted monitoring services, and it’s also equipped with an automated fault alerting and repair mechanism.
The self-developed IP Radar platform. To meet a demanding set of operation and maintenance management needs, ZUEL has developed the IP Radar system based on the Zabbix 7.0 platform, further customizing its business monitoring capabilities. IP Radar not only conducts real-time monitoring of the IT infrastructure, but it also provides detailed performance analysis reports and trend predictions, while integrating behavior monitoring capabilities to enhance the school’s network security management.

The IP Radar platform itself contains a variety of unique and innovative features, including:

Comprehensive monitoring coverage. The IP Radar system monitors over a million items – everything from hardware devices to application systems, affecting everything from network performance to user experience. This extensive coverage gives the Information Management department to a comprehensive understanding of the operational status of the school’s IT resources while providing sufficient data support for troubleshooting and performance optimization.
Customized monitoring strategies. Compared to traditional monitoring systems, IP Radar offers highly customized monitoring strategies. ZUEL can tailor different business dashboards for networks, computing resources, user experience, data center environments, and more, based on its own needs and the permissions granted to operation and maintenance personnel. Depending on different monitoring thresholds and alerting strategies, the system can automatically generate alerts and notify relevant personnel through enterprise WeChat, SMS, and other channels.
Intelligent alerting and automated handling. The intelligent alerting system of the IP Radar platform leverages machine learning algorithms to analyze historical monitoring data, enabling it to predict potential fault risks and issue early warnings. At the same time, the system integrates automated operation and maintenance capabilities, which allow it to automatically execute predetermined repair operations when certain common faults occur, reducing the time and cost of manual intervention.
Network security monitoring. In terms of network security, the IP Radar system is capable of identifying abnormal traffic patterns and promptly detecting potential security threats through real-time analysis of the school’s entire network traffic. The system also supports the monitoring of online behavior to ensure that network access activities comply with the school’s security policies.

The results

After implementing the Zabbix-based system, ZUEL was able to measure a wide range of monitoring performance improvements, including:

Improved operational and maintenance efficiency. Through the IP Radar system, the school’s Information Management department has been able to monitor the operational status of over 28,000 hosts in real-time, significantly enhancing operational efficiency. The system’s automated fault handling capabilities reduce the complexity of manual operations, allowing operations and maintenance personnel to focus on addressing only the complex issues that the system is unable to resolve automatically. At the same time, the system’s intelligent alerting feature enables the early detection of potential problems, preventing sudden failures.
Enhancing system stability and reliability. The high availability design of Zabbix 7.0 ensures that the system remains stable even under heavy loads. Its redundant design and automatic failover mechanisms guarantee the reliability of the system, and the trend analysis functionality provided by IP Radar helps administrators to identify factors that may affect system stability in advance and making corresponding adjustments, enhancing the overall reliability of the IT system in the process.
Advancing detailed information management. The IP Radar platform lets schools manage multiple IT resources with greater precision. The system not only monitors the operational status of hardware devices, but it also analyzes the performance of business systems, helping administrators to optimize system configurations and enhancing user experiences. During project development, historical data from the monitoring platform serves as an essential basis for decision-making. In the acceptance phase, the monitoring platform provides evaluation reference data for operational efficiency and stability.

The IP Radar monitoring and management system developed by ZUEL and based on Zabbix 7.0 has become the largest, most widely used, and most effective (in terms of the volume of monitored data) in the Chinese education sector. The successful implementation of this system not only provides strong support for the school’s information management, but it also offers valuable references for information operation and maintenance at other universities.

In conclusion

Looking ahead, the IP Radar system is poised to expand its functionalities further by integrating more intelligent operation and maintenance management tools. Through the introduction of emerging technologies such as big data analysis and artificial intelligence, the system will achieve more breakthroughs in areas like automated operation and maintenance as well as intelligent fault prediction, providing even more comprehensive technical support for the university’s information management.

To learn more about what Zabbix can do for educational institutions, visit our website.

The post Zabbix at the Zhongnan University of Economics and Law appeared first on Zabbix Blog.

Running Zabbix with PostgreSQL and PG Auto Failover

2025-08-12 Patrik Uytterhoeven

Post Syndicated from Patrik Uytterhoeven original https://blog.zabbix.com/running-zabbix-with-postgresql-and-pg-auto-failover/31026/

Running a monitoring platform like Zabbix in a production environment requires bulletproof availability at the database layer. Any downtime in PostgreSQL, even for seconds, can disrupt monitoring visibility, triggering blind spots in alerts and data collection.

This post introduces a streamlined High-Availability (HA) architecture for Zabbix using PostgreSQL, pg_auto_failover, HAProxy, and PgBackRest. Built on RHEL 9 or derivatives, this architecture removes single points of failure and automates failover using minimal external dependencies, making it a strong candidate for modern observability backends.

Architecture overview

This HA design simplifies deployment by using a dedicated monitor node to orchestrate automatic failover between two PostgreSQL database nodes. With pg_auto_failover, we avoid the need for complex consensus layers like etcd or Consul while still achieving fast, reliable failover and recovery.

Database layer

Two PostgreSQL nodes are deployed in a primary/secondary configuration. These nodes are registered with a dedicated pg_auto_failover monitor, which continuously checks node health and replication status. In the event of a failure, the monitor promotes the secondary to primary with no manual intervention.

Each node is securely configured using scram-sha-256 authentication and self-signed / or owned SSL certificates to ensure encrypted communication within the cluster.

Monitor node (Arbiter)

The monitor node is a lightweight PostgreSQL instance that runs the pgautofailover extension. It holds state information about all participating nodes and acts as the arbiter during failover events. It requires only one node, reducing complexity compared to consensus-based DCS (Distributed Configuration Store) systems like etcd or ZooKeeper.

Load balancing layer

Two HAProxy nodes route all client (Zabbix) connections to the current PostgreSQL primary. A lightweight HTTP service on each DB node reports its current role (primary or not) and allows HAProxy to determine which node is writable. These proxies are kept highly available using Keepalived, which manages a shared Virtual IP (VIP) across both proxy servers.

This way, applications like Zabbix always connect to a stable endpoint, even during failover events.

Backup layer

Backups are handled using PgBackRest, deployed on a dedicated backup server. This server connects to both PostgreSQL nodes over SSH and performs the following:

Full and incremental backups
WAL archiving
Point-In-Time Recovery (PITR)

Passwordless SSH and proper pgbackrest.conf mappings are set up to support seamless interaction regardless of which node is currently primary.

Component overview

Component	Role
PostgreSQL	Relational backend storing all Zabbix metrics, alerts, events
pg_auto_failover	Ensures continuous availability by promoting replicas automatically
Monitor Node	Decides failover based on health checks and cluster state
HAProxy	Routes client traffic to the current primary
Keepalived	Provides VIP failover between HAProxy nodes
PgBackRest	Performs PITR-capable backups from any node
Zabbix Server	Connects to PostgreSQL via VIP to ensure continuity

Topology at a glance

Design

Unlike Patroni, which requires a distributed configuration store like etcd, pg_auto_failover uses a dedicated monitor node that simplifies orchestration. This setup reduces the operational burden while still delivering robust failover, automatic reconfiguration, and synchronization safeguards, including:

Synchronous_standby_names to enforce replication integrity
Service integration with systemd for reliable restarts
Failover detection with minimal latency

This design also ensures SSL-enabled encrypted communication, self-healing role changes, and full observability using Zabbix itself, which can be configured to monitor the PostgreSQL cluster through exposed health endpoints.

Real-world considerations

Upgrade Planning: The pg_auto_failover version in RPM repos may lag behind the latest upstream features like set_monitor_setting. Pin the package version if consistency is required.
Network Security: Only HAProxy nodes are allowed to query the internal role-check API on the DB nodes using custom firewall rules.
Cluster Hygiene: Always clean up config folders (~postgres/.config/pg_autoctl/…) if a node is misconfigured or needs to rejoin.
SELinux: Configure SELinux, use semanage and audit2allow to fix custom ports (e.g., 9877 for health checks).
Hybrid Logging: Setup PostgreSQL to log to both journald and traditional log files via stderr + logging_collector.

Conclusion

This architecture strikes a balance between simplicity and resilience. While Patroni is great for large-scale, multi-region setups requiring distributed consensus, pg_auto_failover offers a lighter-weight solution that covers most enterprise needs without complex dependencies.

By layering the following…

PostgreSQL 17
Pg_auto_failover with a single monitor
HAProxy + Keepalived for VIP failover
PgBackRest for backups

…you can then confidently run Zabbix in a highly available and secure fashion with minimal operational overhead.

If you’re considering implementing this setup or migrating from a single-node database backend, reach out to Opensource ICT Solutions, a Zabbix Premium Partner with global presence in the USA, the UK, the Netherlands, and Belgium. We can help you architect, deploy, and monitor Zabbix environments that scale with your needs.

The post Running Zabbix with PostgreSQL and PG Auto Failover appeared first on Zabbix Blog.

Keeping Latvia Connected with Zabbix and LMT

2025-07-22 Michael Kammer

Post Syndicated from Michael Kammer original https://blog.zabbix.com/keeping-latvia-connected-with-zabbix-and-lmt/30834/

LMT is a mobile GSM/UMTS/LTE operator in Latvia. Founded on January 2, 1992, it was the first mobile network operator in the country. In addition to providing mobile network and ISP services, LMT uses innovative technologies and solutions to develop and maintain a variety of IT solutions for public and private organizations. Currently, LMT is the largest telecommunications service provider in the country, with over 1,660 base stations and over 1.5 million users as of 2024.

The challenge

LMT utilizes a variety of monitoring solutions for a variety of purposes – from tools performing and monitoring ping responses to vendor-specific solutions and all-in-one tools such as Zabbix. LMT has 2 data centers, and since the vast majority of services delivered by LMT can be considered critical, most of the relevant infrastructure is duplicated across them.

Multiple Zabbix instances are used in the environment, including Zabbix 5.0 with MySQL database backend, Zabbix 7.0 with PostgreSQL, and TimescaleDB. Over 3,000 hosts with approximately 500,000 items are monitored by Zabbix.

The solution

Here is one example of how Zabbix is used to monitor switch cabinets in LMT data centers. Switch cabinets contain devices to measure the electric current, which support Modbus protocol and which can in turn be used to collect data.

Modbus monitoring was achieved by using Zabbix agent2 with the official Modbus plugin. This was combined with NetBox and GraphQL. NetBox was used as the source of truth, providing information about power feed and various electrical characteristics, such as voltage, amperage, utilization, phase, and more. The data was collected from NetBox via HTTP agent checks and GraphQL, and a JSON result was created by utilizing Zabbix preprocessing features.

The information collected from NetBox is combined with Modbus data collection utilizing Zabbix agent2. The data collected by Zabbix agent2 is preprocessed after the collection. The collected data is normalized and used by Zabbix low-level discovery features to automatically create Zabbix items and triggers for the available resources. Finally, the resulting data is visualized on Zabbix dashboards.

The results

Monitoring with Zabbix has made reacting to changes in the monitored power feed (detecting spikes, observing gradual power feed changes, etc.) a much simpler proposition for LMT, which in turn improves service for its millions of users.

In conclusion

Zabbix has proven itself to be an ideal solution for telecommunications clients, making it easier than ever to keep track of network health and performance, driving a more positive customer experience and greater revenue growth in the process.

To learn more about what Zabbix can do for customers in telecommunications, get in touch with us.

The post Keeping Latvia Connected with Zabbix and LMT appeared first on Zabbix Blog.

The ATS Group and a Regional Telecom Provider

2025-02-14 Michael Kammer

Post Syndicated from Michael Kammer original https://blog.zabbix.com/the-ats-group-and-a-regional-telecom-provider/29671/

Our Premium Partners at the ATS Group have a regional telecom provider on the West Coast of the United States as one of their key clients. The provider covers a massive geographical area on a limited budget and serves thousands of (primarily rural) customers.

The Challenge

After recent price hikes by the “big-box” monitoring solutions, the provider needed an alternative with a more stable pricing model. Simply put, their budget was shrinking, but their software monitoring costs were expanding.

The provider had a large stock of non-traditional IT equipment that all needed to be monitored effectively, and they also had only one month to get all monitored devices and endpoints over to a new solution.

On top of that, many of the provider’s legacy systems were directly related to regulatory compliance and therefore needed to be operational from day one.

The Solution

The provider set about migrating to a complete and robust Zabbix 7.0 solution that would eliminate any foreseeable issues – even the loss of an entire data center.

There were a few initial hiccups in the implementation when it came to getting PostgreSQL set up with database proxies, but the ATS Group team quickly arrived at an architecture that the provider was happy with. The clear and easy-to-follow Zabbix documentation was of particular help.

The Results

The new Zabbix solution, as implemented, was able to monitor a number of things that had previously been challenging, including:

• Doors. The provider badly needed a solution for monitoring doors, including entrance and exit doors as well as cabinet doors in data centers. They had long-term compliance issues with doors sticking open, employees forgetting to close doors, etc. Zabbix made it easy to develop custom SNMP traps that send alerts in case of open doors, solving the issue.

• Weather. The provider’s services are available over a large and varied geographical area that encompasses multiple states. The ability of Zabbix to predict weather changes across this area has been an important added bonus, with the provider now being able to get future weather alerts that can be used to compare against equipment tolerance levels. Personnel can then be sent to affected areas in anticipation of weather events, instead of being purely reactive.

• SLAs. The provider functions as an ISP that provides internet access to customers in rural areas, many of whom may not have other means of accessing the world around them. As such, they not only feel a strong sense of duty to provide consistent uptime, but they are bound by a strict set of service level agreements (SLAs). With Zabbix, it’s possible to provide SLAs for some of the remote edge equipment involved by building an integration with ServiceNow.

In conclusion

The telecom provider in question trusts Zabbix to guarantee rural broadband access for thousands of customers over an enormous geographic area. Zabbix not only gets the job done more effectively than other monitoring solutions, it does so at a fraction of the cost.

The post The ATS Group and a Regional Telecom Provider appeared first on Zabbix Blog.

How a Custom Zabbix Solution Maximized Efficiency for an MSP

2024-09-26 Kristy Slimmer

Post Syndicated from Kristy Slimmer original https://blog.zabbix.com/how-a-custom-zabbix-solution-maximized-efficiency-for-an-msp/28810/

Discover how our partners at ATS Group designed and implemented a custom Zabbix solution that allowed a large managed service provider (MSP) to monitor and manage a vast array of client devices across multiple data centers.

Table of Contents

The Challenge: Addressing Infrastructure Monitoring Complexities

When a federal government contractor specializing in IT managed services secured a contract to manage the infrastructure for a large federal agency, they faced a daunting challenge: how to effectively monitor and manage the vast array of devices under their purview using a single, comprehensive solution.

Real-time monitoring and immediate alerts for any issues were non-negotiable requirements. The sheer scale and complexity of the infrastructure demanded a robust monitoring system capable of providing insights across multiple data centers and diverse technologies.

The MSP, aware of the need for a trusted and experienced partner, turned to ATS Group to tackle the complexities of its observability and management challenge.

The Solution: Architecting a Custom Zabbix Solution

ATS Group, North America’s exclusive Zabbix Premium Partner, brings over two decades of experience in monitoring and optimizing enterprise IT environments. ATS Group architected and implemented a custom solution that leveraged Zabbix’s flexibility and scalability, demonstrating their deep knowledge of the technology and ability to handle complex challenges.

The ATS team deployed an on-premise Zabbix Server, accompanied by Zabbix Proxy Servers placed in each data center. This distributed architecture was a key factor in ensuring seamless monitoring across geographically dispersed environments while minimizing latency, a critical factor in managing such a vast infrastructure.

Custom Zabbix Solutions delivered by the ATS Group

From there, ATS implemented various Zabbix customizations that were integral to meeting the agency’s unique and diverse infrastructure needs, including developing templates and integrations.

Templates. ATS developed numerous templates covering a broad spectrum of technologies (including OpenShift, VMware, Dell, HP, Cisco UCS, Hitachi, NetApp, Pure, Brocade, Commvault, Linux, and Windows) to provide comprehensive monitoring capabilities tailored to the specifics of each component, ensuring a detailed view of the entire infrastructure stack.

Integrations. ATS built customized integrations for several third-party products. An integration with OpenShift allowed for alerts configured within OpenShift to be directly ingested and processed by Zabbix. The integration with VMware allowed Zabbix to detect when an administrator put a host in maintenance in VMware, automatically creating a maintenance period for that host within Zabbix to eliminate unwanted alerts while the host was being serviced.

Finally, integrations with ServiceNow and Operations Bridge Manager (OBM) enabled streamlined incident management workflows, ensuring that issues were promptly detected, triaged, and addressed with minimal manual intervention – and the proper stakeholders were notified within the customer and service provider organizations.

Trigger Actions. ATS implemented custom trigger actions to automate responses to predefined events. Whether restarting a service upon failure or executing remediation scripts, these trigger actions helped maintain system stability, minimize downtime, and reduce the workload (and callouts in the middle of the night!) for system administrators.

Dashboards. ATS designed custom dashboards to provide stakeholders with intuitive, real-time insights into the infrastructure’s health and performance. These dashboards served as a centralized hub, offering a comprehensive view of the entire environment with actionable insights to drive informed decision-making.

The Results

A custom Zabbix solution delivers visibility, streamlined monitoring, proactive management, and enhanced client satisfaction. The impact of the custom Zabbix solution was immediate and profound. By leveraging the power of Zabbix and the expert skill of the ATS team, the MSP gained unprecedented visibility and control over their client’s sprawling infrastructure. The benefits included:

Greater Operational Efficiency. With a unified view of the entire infrastructure and real-time alerts for any issues, our client experienced a significant improvement in operational efficiency. Proactive management and automated responses minimized downtime, allowing resources to be allocated more strategically.

Faster Incident Response. Issues were detected instantaneously, and relevant stakeholders were promptly alerted, enabling swift resolution and minimizing the impact on operations. This streamlined incident response mechanism reduced mean time to resolution (MTTR) and enhanced overall system reliability.

Increased Revenue. Delighted by the efficiency and effectiveness of our client’s management and monitoring capabilities, the end-user federal agency recognized the value of their partnership and expanded the scope of the contract.

This testament to our client’s success underscores the transformative impact of our solution, paving the way for further collaboration and growth opportunities. As a result, ATS Group and the managed services provider continue to expand their partnership and are solving complex infrastructure problems for numerous additional clients.

The post How a Custom Zabbix Solution Maximized Efficiency for an MSP appeared first on Zabbix Blog.

My Zabbix is down, now what? Restoring Zabbix functionality

2024-09-18 Aurea Araujo

Post Syndicated from Aurea Araujo original https://blog.zabbix.com/my-zabbix-is-down-now-what/28776/

We’ve all been in a situation in which Zabbix was somehow unavailable. It can happen for a variety of reasons, and our goal is always to help you get everything back up and running as quickly as possible. In this blog post, we’ll show you what to do in the event of a Zabbix failure, and we’ll also go into detail about how to work with the Zabbix technical support team to resolve more complex issues.

Step by step: Understanding why Zabbix is unavailable

When Zabbix becomes unavailable, it’s important to follow a few key steps to try to resolve the problem as quickly as possible.

Check the service status. First, verify if your Zabbix service is truly inactive. You can do this by accessing the machine where Zabbix is installed and checking the service status using a command like systemctl status zabbix-serveron Linux.
Analyze the Zabbix logs. Check the Zabbix logs for any error messages or clues about what may have caused the failure.
Restart the service. If the Zabbix service has stopped, try restarting it using the appropriate command for your operating system. For example, on Linux, you can use sudo systemctl restart zabbix-server.
Check the database connectivity. Zabbix uses a database to store data and Zabbix server configurations. Make sure that the database is accessible and functioning properly. You can test database connectivity using tools like ping or telnet.
Check your available disk space. Verify that there is available disk space on the machine where Zabbix is installed. A lack of disk space is a common cause of system failures.
Evaluate dependencies. Make sure all Zabbix dependencies are installed and working correctly. This includes libraries, services, and any other software required for Zabbix to function.

If the problem persists after carrying out these steps, it may be necessary to refer to the official Zabbix documentation, seek help from the official Zabbix forum, or contact the Zabbix technical support team, depending on the severity and urgency of the situation.

Making the most of a Zabbix technical support contract

If you or your company have a Zabbix technical support contract, access to our global team of technical experts is guaranteed. This is an ideal option for resolving more complex or urgent issues. Here are a few steps you can follow when contacting the Zabbix technical support team:

Gather all important information. Before contacting the Zabbix technical support team, gather all relevant information about the issue you’re facing. This can include error messages, logs, screenshots, and any steps you’ve already taken to try to resolve the issue.
Open a ticket with the Zabbix technical support team. Contact Zabbix technical support by opening a ticket on the Zabbix Support System. Provide all the information gathered in the previous step to help the technicians understand the problem and find a solution as quickly as possible.
Explain exactly how Zabbix crashed. When describing the problem, be as precise and detailed as possible. Include information such as the Zabbix version you are using, your operating system, your network configuration, and any other relevant details that might help our team diagnose the issue.
Be available to follow up on the ticket. Once you’ve opened a ticket, be available to provide additional information or clarify any questions the support technicians may have. This will help speed up the problem resolution process.
Follow the Zabbix technical support team’s recommendations. After receiving recommendations, follow them carefully and test to see whether they resolve the issue. If the problem persists or if new issues arise, inform the Zabbix technical support team immediately so they can continue assisting you.

A Zabbix technical support subscription gives you access to a team of Zabbix experts who can help you configure and troubleshoot your Zabbix environment. Check out the benefits of each type of subscription on the Zabbix website and make sure you have all the support you need to keep your monitoring fully operational.

The post My Zabbix is down, now what? Restoring Zabbix functionality appeared first on Zabbix Blog.

Making Patient Care Easier with Zabbix and Open-Future

2024-07-11 Brian van Baekel

Post Syndicated from Brian van Baekel original https://blog.zabbix.com/making-patient-care-easier-with-zabbix-and-open-future/28406/

The Antwerp University Hospital (UZA) is a university center known for top clinical and customer-friendly patient care, high-quality academic training, and groundbreaking scientific research with an important international dimension. The UZA has 593 hospital beds in 26 nursing units, as well as 41 highly specialized medical services where more than 800,000 patients are consulted every year and over 4,000 employees, including 642 doctors. Keep reading to see how Zabbix premium partner Open-Future rises to the challenge of monitoring this massive IT infrastructure.

The challenge

Due to the large amount of users connecting on a daily basis, the UZA’s Zabbix server was set up as a virtual machine with a front-end separate from the Zabbix server and database. Splitting the front-end from the Zabbix server allows them to use dedicated resources for the front-end and the Zabbix server.

Most of the monitoring is done by Zabbix agents on Linux and Windows. In order for the applications to see if everything is working as it should be, the Open-Future team leverages UserParameters and database monitoring with Zabbix Agent 2. For some more specific monitoring cases, we also make use of custom SQL scripts.

Because one server can have multiple teams responsible for just the application or the OS, getting the correct information to the right team proved to be a challenge. A simple solution was the creation of different trigger actions for every team that included only the triggers that were needed. Unfortunately this proved to be very difficult to manage over time and error-prone when changes were needed.

The solution

By making extensive use of tags in Zabbix, our team could add labels to the items and link them back to the correct user groups. This made it easier to send the right information to the correct teams and allowed them to both drastically reduce the number of actions that had to be created and simplify the actions that were created.

The results

Zabbix has proven itself as a powerful and versatile monitoring and management platform that allows our team to gain real-time insight into the performance of the UZA’s IT infrastructure and applications. Zabbix’s ability to collect and visualize various types of data (including network traffic, server load, application performance, and more) makes it easy to identify and resolve issues before they impact operations or patient care.

At present, Open-Future monitors about 1,400 hosts, a mix of Windows, Linux and BareMetal monitored by proxies. This allows us to monitor more then 10.000 metrics with more then 55,000 triggers to notify us in case of any potential issues. We make use of custom templates, plugins, and scripts to gather all needed information.

The impact of Zabbix on our operational efficiency cannot be overstated. Automated alerts and reporting functionality let us respond quickly to incidents and issues, which reduces downtime and maximizes the availability of critical systems. This has direct benefits for the UZA’s patients, as we can make sure that vital systems like electronic medical records are always available and that the quality of care is maintained at the highest level.

The post Making Patient Care Easier with Zabbix and Open-Future appeared first on Zabbix Blog.

Case Study: Monitoring with Zabbix and AI

2024-05-23 Aurea Araujo

Post Syndicated from Aurea Araujo original https://blog.zabbix.com/case-study-monitoring-with-zabbix-and-ai/28045/

Artificial intelligence (AI) and data monitoring are working together to digitally transform relationships, businesses, and people. In telecommunications, predictive analysis based on data collection plays a crucial role in development. Starting with version 6.0 of Zabbix, users have benefited from updates in predictive functions and machine learning, which make it possible for them to study the data monitored by Zabbix and integrate it with AI modules.

Danilo Barros, co-founder of Lunio (a Zabbix Certified Partner in Brazil), presented the results of using Zabbix combined with telecom data monitoring through AI and machine learning at Zabbix Conference Brazil in 2022. Keep reading to get the whole story!

Table of Contents

The scenario

With over 600 OLTs (Optical Line Terminals – the fiberoptic infrastructure used by internet providers) as well as 400,000 customers across more than 800 cities and 20 states in Brazil, Lunio’s client manages a staggering amount of data. This monitoring is essential for smooth operations and to guarantee that there are no negative impacts on users and no overload for customer service agents in the event of accidents.

A primary challenge for telecom clients is the overload of calls to customer service in the event of massive network incidents. With so many customers, every precaution must be taken to avoid clogging phone lines during outages or service failures.

“You can’t achieve customer satisfaction under such circumstances, and the Net Promoter Score (NPS) drops drastically.”

Danilo Barros, co-founder of Lunio

Mapping needs

Considering the client’s operational structure, a series of customer needs were identified, focusing on six main points:

1. Automation: With notifications via digital channels for each event
2. Speed: Aiming for improved customer service
3. Operational costs: Budget optimization
4. Root cause analysis: Quick identification of the cause of events
5. Predictability: The ability to analyze problems and identify trends
6. Reporting: Identifying incidents and following regulations from ANATEL (National Telecommunications Agency)

With these interests in mind, it was possible to reassess the use of tools previously employed by the telecom client, which at the time served unique functions in the process. Each tool had its usage and information verification time, which could impact hundreds of users in a massive-scale incident. The key challenges identified by the Lunio team included:

Integrations: Systems needed to be interconnected
Integrity: Constant data updates
Topology: With system mapping through specific programs
Business rules: Respecting the development of local processes
Performance: The monitoring and automation of 600,000 assets
High availability: Dozens of data centers catering to local demand

Once the needs and challenges were identified, it was time to promote change within the client. By integrating systems and using Zabbix to monitor over 600,000 items, understand incidents, and predict potential future errors, the technical teams at Lunio created LunioAI, a “super attendant” with analytical and predictive capabilities as well as the ability to continuously learn.

“This guy (LunioIA) learns from each event, understanding each topology that occurs in the client’s network.”

Danilo Barros, co-founder of Lunio

In the initial response tests, LunioAI was able to analyze and evaluate massive events in a minute and a half. Over time, this was reduced to 30 seconds, making the return to the technical team increasingly swift and positively impacting incident resolution.

The results

Throughout the development and improvement of LunioIA, the operations chain was involved in predictive analyses of potential events on the network, providing technical professionals with the information needed to perform preventive maintenance on monitored items.

LunioIA considers data from integrated systems, FTTH (fiber to the home) environments, data centers, and items, all as part of the Zabbix monitoring environment. It can then diagnose events, understand the severity of an event, and find resolution points – without the need for human resources in the process.

As a result, when physical attendants were contacted by customers experiencing difficulties with the service, instead of going through the entire process to understand what happened, the attendant could perform a search using the customer’s CPF (Individual Taxpayer Registry Identification) and then access a summary of the events, causes, and solutions identified by artificial intelligence combined with data monitoring through Zabbix.

In conclusion

This example happens to come from the telecommunications industry, but it’s not difficult to see how the ability of Zabbix to integrate the data monitored by Zabbix with AI modules can benefit companies in almost any industry.

You can find out more about what we can do across a variety of industries by visiting our website or requesting a demo.

The post Case Study: Monitoring with Zabbix and AI appeared first on Zabbix Blog.

Keeping Remote Teams Connected: The Zabbix Advantage

2024-02-27 Michael Kammer

Post Syndicated from Michael Kammer original https://blog.zabbix.com/keeping-remote-teams-connected-the-zabbix-advantage/27551/

The popularity of remote teams may have exploded in popularity during the COVID-19 pandemic, but it’s not a phenomenon that’s likely to trend downward anytime soon. High-profile organizations like 3M, Dropbox, Shopify, and LinkedIn are continuing to enthusiastically embrace remote working, essentially making it the “default setting” for their employees.

The shift toward remote working is not without its challenges, however. Organizations of all sizes often have little time to set up the kind of networking infrastructure and efficient processes that make sure remote workers are just as connected and productive as their on-site counterparts. In this article, we’ll take a quick look at some of the most important network monitoring challenges that remote teams face and show how Zabbix can help you tackle them as efficiently as possible.

Table of Contents

Infrastructure and connectivity issues

A remote network is essentially a grouping of multiple smaller network setups, each with their own set of variables that can affect performance. The differences between network system and infrastructure quality at different remote destinations can often lead to low overall network performance, which in turn makes it a challenge to provide the kind of high-speed communication needed to run the remote automation tools and software applications used by remote employees and teams.

By providing straightforward and easy-to-understand visibility into a network’s connected devices and how data moves between them, Zabbix makes it easy to automatically compare data and identify any drop in network performance.

With Zabbix, you can easily keep an eye on network routers and switches, especially internet provider and uplink ports up/down. You can also monitor network latency, the error rate on ports, the packet loss to important devices, and network utilization on important ports with net.if.in/net.if.out. Here are some example triggers:

High Network Utilization: avg(/Router ABC/net.if.in[eth0],5m)>80MB
High Packet Loss: avg(/Router ABC/icmppingloss,5m)>5
High Latency: avg(/Router ABC/icmppingsec,5m)>0.1

What’s more, Zabbix allows you to create network maps with important network devices and real-time data, as well as dashboards with maps and single item/gauge widgets, all of which makes it far easier to achieve the uninterrupted connectivity that remote teams depend on.

Staying safe

Remote locations aren’t islands that can be completely isolated from external traffic. Staying vigilant and doing everything possible to eliminate data breaches is important, and taking advantage of strong encryption methods, network scanning tools, and firewalls to protect your systems is a good start. However, using a whole suite of tools to protect security can add more difficulty when it comes to integrating and monitoring them.

With Zabbix, you can count on enterprise-grade security, including encrypted communication between components, a flexible user permission schema that can be easily applied to a distributed environment, and custom user roles with a granular set of permissions for different types of users.

Zabbix also provides native support for HTTP, LDAP, and SAML authentication (which gives you an additional layer of security and improves your user experience while working with Zabbix), the ability to restrict access to sensitive information by limiting which metrics can be collected in your environment, and the ability to track changes in your environment by utilizing the Audit log. It’s all designed to make sure that there are no compromises on the security of your data when you decide to go remote.

Scalability

As a remote organization grows and its distributed systems expand, a good monitoring solution needs to be able to grow along with it in order to prevent gaps in coverage while maintaining performance and reliability. Zabbix gives you limitless scalability in the form of Zabbix proxies, which act as independent intermediaries that collect performance and availability data on behalf of a Zabbix server. You can roll out new proxies as fast as you need them, and because Zabbix is free and open source, you don’t have to worry about additional licensing costs.

Zabbix proxies allow you to see at a glance what resources are being used on your network at any given moment, which is especially handy if, like most remote teams, you have tens or even hundreds of servers and network appliances to monitor. You can also execute remote commands in remote locations – either on the proxies themselves or on the agents monitored by the proxy, and multiple frontends can be deployed for load balancing as well as for improved security and connectivity. Proxy docker containers and cloud options are available as well, enhancing flexibility and making Zabbix ideal for any organization that spans the globe (or aspires to).

Managing multiple solutions

The legacy software and systems you use were most likely designed to work in a traditional networking model. Remote working, as we’ve seen, presents a whole new range of challenges when it comes to compatibility and support.

We’ve created Zabbix to be as easy as possible to integrate with existing systems. You can easily monitor any operating system, cloud service, IP telephony service, docker container, or web server/database backend. We provide out-of-the-box monitoring for the world’s leading hardware and software vendors, and our extensively documented API makes it easy to create workflows and integrate with other systems. In addition, you can also integrate Zabbix with the most popular helpdesk, messaging, and ITSM systems, such as Slack, Jira, MS Teams, and many others.

Not only that, Zabbix is designed to serve as the ideal monitoring solution for multi-tenant environments. It serves as a single pane of glass for your entire infrastructure, and it’s easy to visualize everything that’s happening with your network with unique maps, dashboards, and templates.

Conclusion

The days of large teams all working together under the same roof are a thing of the past – the remote working trend will only accelerate as technology improves and employees get more accustomed to working with colleagues across multiple locations. That’s why it’s of paramount importance to make sure your monitoring solution has the built-in flexibility and scalability to grow with your team and your business.

If you want to see for yourself how Zabbix can help you effectively monitor a globally distributed network, contact us.

The post Keeping Remote Teams Connected: The Zabbix Advantage appeared first on Zabbix Blog.

Technical Support: The Zabbix Advantage

2023-10-19 Michael Kammer

Post Syndicated from Michael Kammer original https://blog.zabbix.com/technical-support-the-zabbix-advantage/26709/

If you’ve ever been part of a technical support team (or dealt with technical support as a customer, for that matter) you’re aware that there are as many different types of technical support teams as there are types of businesses.

However, there are a few best practices that all technical support teams share, no matter what industry they’re in. Read on to learn a bit more about them and see how our technical support team at Zabbix embodies each one.

Table of Contents

Offer omnichannel technical support

Omnichannel support is the practice of providing support across every touchpoint that a customer uses to interact with your business. It’s not to be confused with multi-channel support, where teams work in silos and have little or no interaction.

Omnichannel support provides a unified experience across different channels, including email, phone, live chat, in-app chat, etc. Customers can start a conversation on any channel, at any time, and pick it up from where they left off on any other channel, any other time. Businesses can keep all customer data in the form of contacts inside a single platform, so that their support representatives can address issues with the proper context.

The goal of our technical support service at Zabbix has always been to provide responsive, dependable, quality support to resolve any issues regarding the installation, operation, and use of Zabbix.

Our specialists also leverage their skills, experience, and proximity to the design and development teams pass along the kind of helpful hints, tips, and tricks that help customers get the most out of their Zabbix installation.

The backbone of our support delivery is the Zabbix Support System. Available to every Zabbix customer, it guarantees swift and easy communication between customers and our technical specialists.

Email and remote sessions can be used to communicate with Zabbix support at any time. Customers with our Global or Enterprise support tiers can access support services by phone or take advantage of on-site visits anywhere in the world by lead technical engineers.

No matter the channel, there’s no guesswork involved – all information is automatically entered into the support system to keep track of issues and resolutions.

Set realistic SLAs and stick to them

Service level agreements (SLAs) are critical to technical support performance. They help set clear expectations for both the service provider and the customer by outlining what services will be provided, how they will be delivered, and the expected level of performance. This gives everyone involved a clear understanding of what to expect, prevents any misunderstandings, and makes sure that the customer’s needs are being met.

SLAs also provide a way to measure the performance of the service provider. By defining metrics and targets, both parties can track the provider’s performance and make sure that they’re meeting the agreed-upon standards. This can help identify areas for improvement and provide a way to hold the service provider accountable if they don’t meet their obligations.

Here at Zabbix, we don’t just meet our SLAs – we exceed them by getting to the root cause of customer issues and providing extensive documentation with the aim of making sure they don’t happen again.

Our support goes far beyond the support of Zabbix as software – we do our best to support the whole monitoring infrastructure, which in some cases can even mean troubleshooting issues that are only tangentially connected to Zabbix as a monitoring system.

That might involve architecture questions, best practices in gathering data from one or another data source, or helping a customer understand and optimize some third-party scripts. No matter what the case may be, we do our best to help.

Listen to what customers need and communicate effectively

As anyone who’s ever contacted technical support knows, the best support isn’t necessarily provided by someone with genius level knowledge who understands every function of a product in minute detail.

For efficient technical support, communication skills are just as vital as technical knowledge. Specialists need to listen first, ask questions to confirm that they understand the problem, and restate what they’ve heard to give the customer the opportunity to provide more information. Above all, they need to speak to the customer’s level of understanding, avoiding jargon and needless details.

Our technical support team sees itself as a bridge between the customer’s needs and the solutions we provide. A key principle of technical support at Zabbix is the notion that the information our customers are sharing with us is precious. The way we see it, our customers give us valuable insights into what’s working in our product and what isn’t.

Listening carefully to the different support queries that we get allows us to form a complete feedback loop between our users and our solutions. For example, if our support team notices issues with collecting specific types of data or monitoring particular endpoints, they can prevent further queries by including a link to our FAQ section while our developers work to fix the issue.

Help customers help themselves

In technical support as in life, self-help is often the best help. It may seem illogical, but the best technical support is usually when the customer is either not asking for help or is able to help themselves.

Allowing customers to perform self-service saves them the time needed to call in or submit an online ticket, and it also improves turnaround time and serves them in the channel they prefer.

Giving customers the tools to be self-sufficient has been a part of our technical support philosophy at Zabbix from day one, and we’re fortunate to have a dedicated and devoted user community to help us do it.

Our users regularly post troubleshooting articles on our blog, and all official product documentation (including an extensive FAQ section) is available on our website. What’s more, our employees make a habit of sharing their knowledge on community Telegram channels, on-site and virtual meetups, and free webinars.

The official Zabbix forum is also a great place to go for support. Customers can interact with each other and get their problems solved easily. No matter what the issue, there’s a good chance that somebody somewhere has experienced it as well and may have a clever workaround or trick to share.

Self-help has limits, however. Your data and infrastructure are the core of your business, and some things are simply best left to experts. That’s why it’s a good idea to add to your knowledge via our official training sessions and have your most urgent issues taken care of by the skilled professionals on our support team.

Embrace automation where it makes sense to do so

Not all technical support tasks can or should be automated. Most require complex problem-solving, creativity, and emotional intelligence. Others are repetitive, simple, or predictable. It’s important to identify the best use cases for automation based on what the customer expects, the nature and value of the task, and what resources are available.

Our philosophy at Zabbix has always been that there’s no substitute for the human element when it comes to technical support. Our customers trust us to handle complicated, urgent, and sensitive issues, and there’s no substitute for the hands-on assistance that our support team can provide. It’s why we take great pains to make sure that all our team members display soft skills like interpersonal communication, personality traits, and social awareness.

However, we also harness the power of automation to assist with lower-level and more menial tasks, such as ticket assignment and processing counts. Thanks to automation, our specialists don’t need to manually grab tickets from a pool. Instead, everything is automated based on the calendar and who is working on a particular shift.

The ultimate goal is always the same – choosing the best automation path for the particular task at hand so that the customer’s issue gets resolved as quickly as possible with a minimum of disruption.

Conclusion

At Zabbix, we see our role as solving problems, not questions. 95.7% of our resolved support tickets receive positive reviews, and it’s because of the hard work, dedication, knowledge, and soft skills of our support team, as well as their goal of providing sustainable growth, long-term success, and measurable outcomes.

Our support team is a truly global entity that can provide round-the-clock support, and it’s made up of highly skilled Zabbix professionals, experts, and trainers who can boast years or even decades of Zabbix experience.

We offer multiple support tiers at a variety of price points, so you can be sure that no matter what the support needs of your organization happen to be, we have a plan that will fit perfectly.
Contact us to learn more and find the support tier that’s right for you.

The post Technical Support: The Zabbix Advantage appeared first on Zabbix Blog.

What is Server Monitoring? Everything You Need to Know

2023-09-12 Michael Kammer

Post Syndicated from Michael Kammer original https://blog.zabbix.com/what-is-server-monitoring-everything-you-need-to-know/26617/

Servers are the foundation of a company’s IT infrastructure, and the cost of server downtime can include anything from days without system access to the loss of important business data. This can lead to operational issues, service outages, and steep repair costs.

Viewed against this backdrop, server monitoring is an investment with massive benefits to any organization. The latest generation of server monitoring tools make it easier to assess server health and deal with any underlying issues as quickly and painlessly as possible.

What are servers, and how do they work?

Servers are computers (or applications) that run software services for other computers or devices on a network. The computer takes requests from the client computers or devices and performs tasks in response to the requests. These tasks can involve processing data, providing content, or performing calculations. Some servers are dedicated to hosting web services, which are software services offered on any computer connected to the internet.

What is server monitoring? Why does it matter?

Servers are some of the most important pieces of any company’s IT infrastructure. If a server is offline, running slowly, or experiencing outages, website performance will be affected and customers may decide to go elsewhere. If an internal file server is generating errors, important business data like accounting files or customer records could be compromised.

A server monitoring system is designed to watch your systems and provide a number of key metrics regarding their operation. In general, server monitoring software tests for accessibility (making sure that the server is alive and can be reached) and response time (guaranteeing that it is running fast enough to keep users happy). What’s more, it sends notifications about missing or corrupt files, security violations, and other issues.

Server monitoring is most often used for processing data in real time, but quality server monitoring is also predictive, letting users know when disks will reach capacity and whether memory or CPU utilization is about to be throttled. By evaluating historical data, it’s possible to find out if a server’s performance is degrading over time and even predict when a complete crash might occur.

How can server monitoring help businesses?

Here are a few of the most important business benefits of server monitoring:

Server monitoring tools give you a bird’s-eye view of your server’s health and performance

A quality server monitoring tool keeps IT administrators aware of metrics like CPU usage, RAM, disk space, and network bandwidth. This helps them to see when servers are slowing down or failing, allowing them to act before users are affected.

Server monitoring simplifies process automation

IT teams have long checklists when it comes to managing servers. They need to monitor hard disk space, keep an eye on infrastructure, schedule system backups, and update antivirus software. They also need to be able to foresee and solve critical events, while managing any disruptions.

A server monitoring tool helps IT professionals by automating all or many aspects of these jobs. It can show whether a backup was successful, if software is patched, and whether a server is in good condition. This allows IT teams to focus on tasks that benefit more from their involvement and expertise.

Server monitoring makes it easier to retain customers as well as employees

Acting quickly when servers develop issues (or even before) makes sure that employee workflows aren’t disrupted, allowing them to perform their duties, see results, and reach their goals. It also guarantees a positive customer experience by providing early notification of any issues.

Server monitoring keeps costs down

By automating processes and tasks (and freeing up time in the process) server monitoring systems make the most of resources and reduce costs. And by solving potential issues before they affect the organization, they help businesses avoid lost revenue from unfinished employee tasks, operational delays, and unfinished purchases.

What should you look for in a server monitoring solution?

Now that you’re sold on the benefits of server monitoring, you’ll want to choose the server monitoring solution that’s right for you. Here are a few capabilities to keep in mind:

Ease of use

Does the solution include an intuitive dashboard that makes it easy to monitor events and react to problems quickly? It should, and it should also allow you to make the most of the data it exports by providing graphs, reports, and integrations.

Customer support

Is it easy to contact support? How quickly do they respond? A quality server monitoring solution will provide a defined SLA and stick to it with no exceptions.

Breadth of coverage

A good solution will support all the server types (hardware, software, on-premises, cloud) that your enterprise uses. It should also be flexible enough to support any server types you may implement in the future.

Alert management

There are a few important questions to ask when it comes to alerts:

Does the solution include a dashboard or display that makes it easy to track events and react to problems quickly?
Is it easy to set up alerts via the configuration of thresholds that trigger them? How are alerts delivered?
Does the solution have a way to help you determine why a problem has occurred, instead of just telling you that something has gone wrong without context?

What are some best practices to keep in mind?

Here are a few best practices that will help you avoid the more common server monitoring pitfalls:

Proactively check for failures

Keep a sharp eye out for any issues that may affect your software or hardware. The tools included with a good monitoring solution can alert you to errors caused by a corrupted database (for example) and let you know if a security incident has left important services disabled.

Don’t forget your historical data

Server problems rarely occur in a vacuum, so look into the context of issues that emerge. You can do that by exploring metrics across a specific period, typically between 30 to 90 days. For example, you may find that CPU temperature has increased within the past week, which may suggest a problem with a server cooling system.

Operate your hardware in line with recommended tolerance levels

File servers are commonly pushed to the limit, rarely getting a break. That’s why it’s important to monitor metrics like CPU utilization, RAM utilization, storage capacity usage, and CPU temperature. Check these metrics regularly to identify issues before it’s too late.

Keep track of alerts

Always monitor your alerts in real time as they occur and explore reliable ways to manage and prioritize them. When escalating an incident, make sure it goes to the right individual as soon as possible.

Use server monitoring data to plan short-term cloud capacity

Server monitoring systems can help you plan the right computing power for specific moments. If services become slower or users experience other problems with performance, an IT manager can assess the situation through the server monitor. They’ll then be able to allocate extra resources to solve the problem.

Take advantage of capacity planning

Data center workloads have almost doubled in the past 5 years, and servers have had to keep up with this ongoing change. Analyzing long-term server utilization trends can prepare you for future server requirements.

Go beyond asset management

With server monitoring, you can discover which systems are approaching the end of their lives and whether any assets have disappeared from your network. You can also let your server monitoring tool handle the heavy lifting for you when it comes to tracking physical hardware.

The Zabbix Advantage

Zabbix is designed to make server monitoring easy. Our solution allows you to track any possible server performance metrics and incidents, including server performance, availability, and configuration changes.

Intuitive dashboards, network graphs, and topology maps allow you to visualize server performance and availability, and our flexible alerting allows for multiple delivery methods and customized message content.

Not only that, our out-of-the-box templates come with preconfigured items, triggers, graphs, applications, screens, low-level discovery rules, and web scenarios – all designed to have you up and running in just a few minutes.

And because Zabbix is open-source, it’s not just affordable, it’s free. Contact us to find out more and enjoy the peace of mind that comes from knowing that your servers are under control.

FAQ

Why do we need server monitoring?

Server monitoring allows IT professionals to:

Monitor the responsiveness of a server
Know a server’s capacity, user load, and speed
Proactively detect and prevent any issues that might affect the server

Why do companies choose to monitor their servers?

Companies monitor servers so that they can:

Proactively identify any performance issues before they impact users
Understand a server’s system resource usage
Analyze a server for its reliability, availability, performance, security, etc.

How is server monitoring done?

Server monitoring tools constantly collect system data across an entire IT infrastructure, giving administrators a clear view of when certain metrics are above or below thresholds. They also automatically notify relevant parties if a critical system error is detected, allowing them to act in a timely manner to resolve issues.

What should you monitor on a server?

Key areas to monitor on a server include:

A server’s physical status
Server performance, including CPU utilization, memory resources, and disk activity
Server uptime
Page file usage
Context switches
Time synchronization
Process activity
Server capacity, user load, and speed

If I want to monitor a server, how easy is it to set things up?

Setting up a server monitoring tool is easy, provided you’ve taken into account these 5 steps:

Assess and create a monitoring plan
Discover how data can be collected
Define any and all metrics
Set up alerts
Have an established workflow

The post What is Server Monitoring? Everything You Need to Know appeared first on Zabbix Blog.

Curbing Connection Churn in Zuul

2023-08-16 Netflix Technology Blog

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/curbing-connection-churn-in-zuul-2feb273a3598

By Arthur Gonigberg, Argha C

Plaintext Past

When Zuul was designed and developed, there was an inherent assumption that connections were effectively free, given we weren’t using mutual TLS (mTLS). It’s built on top of Netty, using event loops for non-blocking execution of requests, one loop per core. To reduce contention among event loops, we created connection pools for each, keeping them completely independent. The result is that the entire request-response cycle happens on the same thread, significantly reducing context switching.

There is also a significant downside. It means that if each event loop has a connection pool that connects to every origin (our name for backend) server, there would be a multiplication of event loops by servers by Zuul instances. For example, a 16-core box connecting to an 800-server origin would have 12,800 connections. If the Zuul cluster has 100 instances, that’s 1,280,000 connections. That’s a significant amount and certainly more than is necessary relative to the traffic on most clusters.

As streaming has grown over the years, these numbers multiplied with bigger Zuul and origin clusters. More acutely, if a traffic spike occurs and Zuul instances scale up, it exponentially increases connections open to origins. Although this has been a known issue for a long time, it has never been a critical pain point until we moved large streaming applications to mTLS and our Envoy-based service mesh.

Fixing the Flows

The first step in improving connection overhead was implementing HTTP/2 (H2) multiplexing to the origins. Multiplexing allows the reuse of existing connections by creating multiple streams per connection, each able to send a request. Rather than requiring a connection for every request, we could reuse the same connection for many simultaneous requests. The more we reuse connections, the less overhead we have in establishing mTLS sessions with roundtrips, handshaking, and so on.

Although Zuul has had H2 proxying for some time, it never supported multiplexing. It effectively treated H2 connections as HTTP/1 (H1). For backward compatibility with existing H1 functionality, we modified the H2 connection bootstrap to create a stream and immediately release the connection back into the pool. Future requests will then be able to reuse the existing connection without creating a new one. Ideally, the connections to each origin server should converge towards 1 per event loop. It seems like a minor change, but it had to be seamlessly integrated into our existing metrics and connection bookkeeping.

The standard way to initiate H2 connections is, over TLS, via an upgrade with ALPN (Application-Layer Protocol Negotiation). ALPN allows us to gracefully downgrade back to H1 if the origin doesn’t support H2, so we can broadly enable it without impacting customers. Service mesh being available on many services made testing and rolling out this feature very easy because it enables ALPN by default. It meant that no work was required by service owners who were already on service mesh and mTLS.

Sadly, our plan hit a snag when we rolled out multiplexing. Although the feature was stable and functionally there was no impact, we didn’t get a reduction in overall connections. Because some origin clusters were so large, and we were connecting to them from all event loops, there wasn’t enough re-use of existing connections to trigger multiplexing. Even though we were now capable of multiplexing, we weren’t utilizing it.

Divide and Conquer

H2 multiplexing will improve connection spikes under load when there is a large demand for all the existing connections, but it didn’t help in steady-state. Partitioning the whole origin into subsets would allow us to reduce total connection counts while leveraging multiplexing to maintain existing throughput and headroom.

We had discussed subsetting many times over the years, but there was concern about disrupting load balancing with the algorithms available. An even distribution of traffic to origins is critical for accurate canary analysis and preventing hot-spotting of traffic on origin instances.

Subsetting was also top of mind after reading a recent ACM paper published by Google. It describes an improvement on their long-standing Deterministic Subsetting algorithm that they’ve used for many years. The Ringsteady algorithm (figure below) creates an evenly distributed ring of servers (yellow nodes) and then walks the ring to allocate them to each front-end task (blue nodes).

*The figure above is from Google’s* *ACM paper*

The algorithm relies on the idea of low-discrepancy numeric sequences to create a naturally balanced distribution ring that is more consistent than one built on a randomness-based consistent hash. The particular sequence used is a binary variant of the Van der Corput sequence. As long as the sequence of added servers is monotonically incrementing, for each additional server, the distribution will be evenly balanced between 0–1. Below is an example of what the binary Van der Corput sequence looks like.

Another big benefit of this distribution is that it provides a consistent expansion of the ring as servers are removed and added over time, evenly spreading new nodes among the subsets. This results in the stability of subsets and no cascading churn based on origin changes over time. Each node added or removed will only affect one subset, and new nodes will be added to a different subset every time.

Here’s a more concrete demonstration of the sequence above, in decimal form, with each number between 0–1 assigned to 4 subsets. In this example, each subset has 0.25 of that range depicted with its own color.

You can see that each new node added is balanced across subsets extremely well. If 50 nodes are added quickly, they will get distributed just as evenly. Similarly, if a large number of nodes are removed, it will affect all subsets equally.

The real killer feature, though, is that if a node is removed or added, it doesn’t require all the subsets to be shuffled and recomputed. Every single change will generally only create or remove one connection. This will hold for bigger changes, too, reducing almost all churn in the subsets.

Zuul’s Take

Our approach to implement this in Zuul was to integrate with Eureka service discovery changes and feed them into a distribution ring, based on the ideas discussed above. When new origins register in Zuul, we load their instances and create a new ring, and from then on, manage it with incremental deltas. We also take the additional step of shuffling the order of nodes before adding them to the ring. This helps prevent accidental hot spotting or overlap among Zuul instances.

The quirk in any load balancing algorithm from Google is that they do their load balancing centrally. Their centralized service creates subsets and load balances across their entire fleet, with a global view of the world. To use this algorithm, the key insight was to apply it to the event loops rather than the instances themselves. This allows us to continue having decentralized, client-side load balancing while also having the benefits of accurate subsetting. Although Zuul continues connecting to all origin servers, each event loop’s connection pool only gets a small subset of the whole. We end up with a singular, global view of the distribution that we can control on each instance — and a single sequence number that we can increment for each origin’s ring.

When a request comes in, Netty assigns it to an event loop, and it remains there for the duration of the request-response lifecycle. After running the inbound filters, we determine the destination and load the connection pool for this event loop. This will pull from a mapping of loop-to-subset, giving us the limited set of nodes we’re looking for. We then load balance using a modified choice-of-2, as discussed before. If this sounds familiar, it’s because there are no fundamental changes to how Zuul works. The only difference is that we provide a loop-bound subset of nodes to the load balancer as a starting point for its decision.

Another insight we had was that we needed to replicate the number of subsets among the event loops. This allows us to maintain low connection counts for large and small origins. At the same time, having a reasonable subset size ensures we can continue providing good balance and resiliency features for the origin. Most origins require this because they are not big enough to create enough instances in each subset.

However, we also don’t want to change this replication factor too often because it would cause a reshuffling of the entire ring and introduce a lot of churn. After a lot of iteration, we ended up implementing this by starting with an “ideal” subset size. We achieve this by computing the subset size that would achieve the ideal replication factor for a given cardinality of origin nodes. We can scale the replication factor across origins by growing our subsets until the desired subset size is achieved, especially as they scale up or down based on traffic patterns. Finally, we work backward to divide the ring into even slices based on the computed subset size.

Our ideal subset side is roughly 25–50 nodes, so an origin with 400 nodes will have 8 subsets of 50 nodes. On a 32-core instance, we’ll have a replication factor of 4. However, that also means that between 200 and 400 nodes, we’re not shuffling the subsets at all. An example of this subset recomputation is in the rollout graphs below.

An interesting challenge here was to satisfy the dual constraints of origin nodes with a range of cardinality, and the number of event loops that hold the subsets. Our goal is to scale the subsets as we run on instances with higher event loops, with a sub-linear increase in overall connections, and sufficient replication for availability guarantees. Scaling the replication factor elastically described above helped us achieve this successfully.

Subsetting Success

The results were outstanding. We saw improvements across all key metrics on Zuul, but most importantly, there was a significant reduction in total connection counts and churn.

Total Connections

This graph (as well as the ones below) shows a week’s worth of data, with the typical diurnal cycle of Netflix usage. Each of the 3 colors represents our deployment regions in AWS, and the blue vertical line shows when we turned on the feature.

Total connections at peak were significantly reduced in all 3 regions by a factor of 10x. This is a huge improvement, and it makes sense if you dig into how subsetting works. For example, a machine running 16 event loops could have 8 subsets — each subset is on 2 event loops. That means we’re dividing an origin by 8, hence an 8x improvement. As to why peak improvement goes up to 10x, it’s probably related to reduced churn (below).

Churn

This graph is a good proxy for churn. It shows how many TCP connections Zuul is opening per second. You can see the before and after very clearly. Looking at the peak-to-peak improvement, there is roughly an 8x improvement.

The decrease in churn is a testament to the stability of the subsets, even as origins scale up, down, and redeploy over time.

Looking specifically at connections created in the pool, the reduction is even more impressive:

The peak-to-peak reduction is massive and clearly shows how stable this distribution is. Although hard to see on the graph, the reduction went from thousands per second at peak down to about 60. There is effectively no churn of connections, even at peak traffic.

Load Balancing

The key constraint to subsetting is ensuring that the load balance on the backends is still consistent and evenly distributed. You’ll notice all the RPS on origin nodes grouped tightly, as expected. The thicker lines represent the subset size and the total origin size.

In the second graph, you’ll note that we recompute the subset size (blue line) because the origin (purple line) became large enough that we could get away with less replication in the subsets. In this case, we went from a subset size of 100 for 400 servers (a division of 4) to 50 (a division of 8).

System Metrics

Given the significant reduction in connections, we saw reduced CPU utilization (~4%), heap usage (~15%), and latency (~3%) on Zuul, as well.

Rolling it Out

As we rolled this feature out to our largest origins — streaming playback APIs — we saw the pattern above continue, but with scale, it became more impressive. On some Zuul shards, we saw a reduction of as much as 13 million connections at peak, with almost no churn.

Today the feature is rolled out widely. We’re serving the same amount of traffic but with tens of millions fewer connections. Despite the reduction of connections, there is no decrease in resiliency or load balancing. H2 multiplexing allows us to scale up requests separately from connections, and our subsetting algorithm ensures an even traffic balance.

Although challenging to get right, subsetting is a worthwhile investment.

Acknowledgments

We would also like to thank Peter Ward, Paul Wankadia, and Kavita Guliani at Google for developing this algorithm and publishing their work for the benefit of the industry.

Curbing Connection Churn in Zuul was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

What’s Up, Home? – No More Blackouts with Zabbix HA Cluster

2022-12-02 Janne Pikkarainen

Post Syndicated from Janne Pikkarainen original https://blog.zabbix.com/whats-up-home-no-more-blackouts-with-zabbix-ha-cluster/24738/

Can you have a Zabbix HA cluster at home? Of course, you can! By day, I am a monitoring tech lead in a global cyber security company. By night, I monitor my home with Zabbix & Grafana and do some weird experiments with them. Welcome to my blog about this project.

The winter has come, and due to world events, it might bring one to two hours of rolling blackouts here in Finland, too. As I have my home Zabbix running on my Raspberry Pi, without a UPS this would mean my Zabbix possibly could not monitor the actual duration of the outages, as my Zabbix server would be without power, too, right?

No. Thanks to the simplicity of setting up a HA cluster with Zabbix, I now have a two-node Zabbix server setup at home, with the standby node running on my laptop, which of course can run on battery for the duration of the blackout. So, while this post is kind of boring — I’m not introducing anything weird to monitor today — I hope the post encourages you to try out the high-availability features of Zabbix. It’s easy!

Set up the nodes

As written on Zabbix documentation, setting up HA on Zabbix means two additional lines added to your zabbix_server.conf file:

HANodeName for the descriptive, unique name of the node
NodeAddress, which should be the address Zabbix front-end will then use

That’s it! And, that is what I did. Then make sure your Zabbix servers point to the same database, and that all your Zabbix servers can connect to that database.

But does it work?

Of course, it does! Here’s the status as seen from Zabbix Reports → System Information:

And here’s the status as reported by sudo zabbix_server -R ha_status from the command line on my Raspberry Pi:

Out of curiosity, I tried out what happens if I try the same command on my laptop. This happens:

Still to do

As nowadays due to our baby my time is very limited, I do have one remaining task to make this perfect: to set up a database cluster. For now, MariaDB is running on my Raspberry Pi only, so I would need to spread it to run on my laptop, too. I will most likely do this with MariaDB Galera Cluster, but that will be another story.

Winter, you might take out my electricity, but you won’t take down my Zabbix.

I have been working at Forcepoint since 2014 and I won’t let my systems go down. — Janne Pikkarainen

This post was originally published on the author’s LinkedIn account.

Fast Way to Upgrade Your Zabbix Knowledge

2022-11-29 Nicole Makarova

Post Syndicated from Nicole Makarova original https://blog.zabbix.com/fast-way-to-upgrade-your-zabbix-knowledge/20267/

Since Zabbix 6.0 LTS has been released with a lot of new features and improvements, it might be tricky for one to figure out how to use these features on their own. Here, Zabbix comes to the rescue with Upgrade Training Courses to boost your knowledge in just one day.

If you previously completed the Zabbix 5.0 core training, the Upgrade Program will be an excellent way to learn about the recent improvements and add-ons of the new version without retaking the entire course. It is akin to a crash course that saves you both time and effort.

Zabbix Upgrade Training Program Overview

The Upgrade Training Program consists of two courses: Zabbix 6.0 Certified Specialist Upgrade Course and Zabbix 6.0 Certified Professional Upgrade Course. Let us tell you more about each upgraded training.

The Certified Specialist Upgrade Course covers the updated and new features of the basics, such as different data collection approaches, problem detection, data preprocessing, different visualization features, and more. Some of the long-awaited features you will get familiar with during the upgrade course are new Dashboard Widgets (e.g.: Item Value Widget), Top Hosts Widget, and the ability to display your monitored infrastructure on the Geomap Widget. Another thing that might serve your interest is the Service Monitoring section, which has been completely redesigned with a focus on flexible business service monitoring, alerting, and root cause analysis.

On the contrary, Certified Professional Upgrade Course focuses on advanced environments, where infrastructure scalability and redundancy are the common requirements. Hence, this course includes six major features, with two of them being High Availability and Advanced Problem Detection. The Zabbix server High Availability feature allows you to deploy multiple Zabbix servers that will remain in standby mode and will be failed over if the currently active server becomes unavailable. The Advanced Problem Detection section focuses on anomaly detection and baseline monitoring features, as Zabbix now supports history functions. This means that Zabbix can semi-automatically detect anomalous values and create alerts if such values are detected. The same approach can be used in baseline monitoring: Zabbix can now calculate baseline values for your metrics and react if your values are outside of this baseline.

As you see, such extensive training wraps up all the meaningful recent improvements of Zabbix 6.0 and delivers them to you in one day, not requiring you to spend a week on the course retake. And besides, it is also cheaper than the entire course.

This is the first time Zabbix is providing a quick and easy way to upgrade existing Zabbix 5.0 Specialist and Zabbix 5.0 Professional certifications to Zabbix 6.0. The course is designed for experienced Zabbix administrators, who are working with Zabbix 5.0 on daily basis. The one-day course featuring all important changes and updates in the most recent Zabbix LTS version is a very cost and time-efficient option.
– Kaspars Mednis, Chief Trainer at Zabbix

Applying for the Right Course

Now it is time to pick the right Upgrade Course to apply for if you are ready to evolve your Zabbix skills. Here are a few hints on how to do it.

If you have already completed our core training and received the Zabbix 5.0 Certified Specialist Certificate, you should apply for the Zabbix 6.0 Certified Specialist Upgrade Course. This one-day course includes 5 hours of training and a one-hour exam that will challenge you to check your knowledge of the whole Zabbix 6.0 LTS version and its new features.

Therefore, if you were certified as a 5.0 Certified Professional, go for the Zabbix 6.0 Certified Specialist + Professional Upgrade Course bundle. This one includes both: Specialist and Professional courses and lasts a little longer. After completing the Specialist course, Professionals will have their additional 1.5-hour training and a 30-minute exam to master their knowledge.

Useful Things to Know

We suggest revising your knowledge of the Zabbix usage, as the exam of the Upgrade Training Program includes questions about both: the entire Zabbix 6.0 LTS release, as well as new features and improvements. Feel free to use your Zabbix 5.0 materials from the previous core training you have completed or explore Zabbix Documentation in case the materials are unavailable to you for some reason.

Please, bear in mind that 6.0 Certified Professional Upgrade training is meant only for the 5.0 Certified Professionals who have previously acquired the 6.0 Certified Specialist level.

The Upgrade Training Program is available online all over the world in different languages and for various time zones. And what’s more, upon successful course completion, you will receive an official Zabbix training certificate stating you have upgraded to the Zabbix 6.0 Certified Specialist or Professional.

Ready for takeoff? Then check out the full schedule and cost of the program on our Upgrade Courses page and pick your training. For even more details, please contact our Sales Team.

Extra Links to Grab Your Attention

Discover more courses and make a solid investment into your Zabbix skills by applying to:
• Core Training Courses to become a professional or an expert
• Extra Courses to study in depth one specific monitoring topic
• Exams to prove your Zabbix knowledge

Top 10 reasons to migrate to Zabbix 6.0 LTS by Dmitry Krupornitsky / Zabbix Summit Online 2021

2021-12-29 Arturs Lontons

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/top-10-reasons-to-migrate-to-zabbix-6-0-lts-by-dmitry-krupornitsky-zabbix-summit-online-2021/18445/

Today we will take a look at the top 10 reasons to migrate to Zabbix 6.0 LTS. We will discuss features and changes included not only in Zabbix 6.0 LTS but also in intermediate major versions – Zabbix 5.2 and Zabbix 5.4.

The full recording of the speech is available on the official Zabbix Youtube channel.

High availability

With Zabbix 6.0 LTS, native support for Zabbix server high availability clusters is finally here. High availability setups can protect you from software and hardware failures and allow you to minimize downtime while performing maintenance tasks. Before Zabbix 6.0 LTS, users were required to use a dedicated piece of clustering software to enable high availability. Most users used a combination of Corosync + pacemaker software. This required additional knowledge related to these tools, to ensure a proper high availability cluster setup, configuration, maintenance, and other tasks related to managing your Zabbix high availability cluster. You could also use other 3rd party vendor solutions, but such solutions also require additional knowledge and in many cases incur additional licensing costs.

The native Zabbix server high availability cluster is an opt-in solution that provides high availability for the Zabbix server component. This solution consists of multiple Zabbix server instances – nodes, where each node is configured separately and uses the same database. Each node has two modes of operation – active or standby. Only a single node can be active at a time. The standby nodes do not perform any data collection, data processing, or any other Zabbix server activities. The standby nodes do not listen for connection on ports and have a minimal number of connections established to the Zabbix backend database. The high availability nodes are compatible with one another across different minor Zabbix server versions.

Learn how to deploy your own Zabbix server high availability cluster by following the steps provided in our Zabbix Summit blog post dedicated to this topic.

New Zabbix interface options

Zabbix 6.0 LTS provides multiple Zabbix interface improvements. One of the major changes that the users will notice when switching to Zabbix 6.0 LTS is the migration from screens to dashboards. The screens will be migrated to dashboards automatically during the upgrade. Dashboards consist of multiple highly customizable widgets, which can be placed on a dashboard with a click of a button. With Zabbix 6.0 LTS many new widgets will be available for different purposes – more flexible views of your metrics with the Single item value widget, a Geomap widget for a better overview of your infrastructure state, Top N/Bottom N views provide a whole new way to look at your metrics and more.

Now you will be able to save your favorite problem filters and access your filters in tabs for more simple filtering of the commonly accessed problem views.

Zabbix 6.0 LTS introduces timezone configuration on a per-user basis. Users can now have their preferred timezone configured via the user settings in the Zabbix frontend. The same is also true for language – this can also now be configured individually for each user.

The Zabbix frontend is now more customizable than ever. There are several ways in which you can customize your Zabbix frontend:

Replace the Zabbix logo with your company’s branding
Hide links to Zabbix support/integration pages
Set a custom help page link
Change the copyright notice in the footer of the frontend.

Implementing these changes requires customizing the underlying PHP code – we tried to make this as simple and accessible as possible, so you can quickly make the necessary changes yourself.

There are also many other Interface improvements, such as multi-page dashboards, third-level menus, graph improvements, and many others.

Improved security

Security is always something that we focus on when developing Zabbix. Zabbix 6.0 LTS brings many new security-related improvements and features:

User roles allow you to define roles with granular permissions related to the frontend access and the actions that each user role is permitted to perform
- Roles are still based on user types – Zabbix User, Admin, Super admin, and user type restrictions still apply, but can be further customized per each role
- User group to host group permissions (Read, Read/Write, Deny) still need to be used in combination with roles to ensure granular access to your data
- For example, now we can define users that have access to host configuration but restrict access to other configuration sections.

In Zabbix 6.0 LTS it is possible to define custom password complexity requirements for Zabbix frontend logins. We can define password length/complexity policies and prohibit the usage of easy to guess common passwords.

The Zabbix API has also seen some security improvements. Now it is possible to generate a persistent API token for a particular user, define an expiration date and use the token in your API calls, without the need to regularly re-issue a new API token.

Zabbix 5.2 release also added the ability to store sensitive information in an external vault. As of the release of Zabbix 6.0 LTS, only HashiCorp Vault is supported, but CyberArk Vault support is also coming in Zabbix 6.2 release.

A set of architectural and structural measures have been taken to completely restructure the Zabbix Audit log. The updated Audit log entry contains records of all configuration changes made by the Zabbix server and Zabbix frontend. The new Audit log also contains additional filtering options, such as filtering Audit log entries based on the operation during which the changes were performed. The new Audit log is not only more detailed but also reworked with minimum performance impact in mind.

Scalability improvements

Many scalability improvements have been introduced between the Zabbix 5.0 LTS release and Zabbix 6.0 LTS release. These improvements not only improve the performance of existing Zabbix instances but also lay the groundwork for the design of upcoming features in later releases.

Previously, trend-based trigger functions would always use database queries to obtain the required data. Starting from Zabbix 5.4, a new type of cache – Trend function cache, has been introduced. This cache stores the results of calculated trend functions. When processing the trend functions, the Zabbix server will check the Trend function cache for the cached results. In case of failure, the Zabbix server will read the data from the database and cache the results.

The scalability improvements allow for better parallel data processing on Zabbix servers with heavy loads. Zabbix Instances with tens of thousands or more new values per second will greatly benefit from the improved performance.

The introduction of the graceful startup of the Zabbix server can help you improve performance and prevent unwanted downtimes, especially with large distributed environments. Whenever a Zabbix server gets started up after downtime, the existing Zabbix proxies start sending the data backlog to the Zabbix server. it is extremely important to maintain the stability and performance of the Zabbix server during this time window. Graceful startup improves the Zabbix server data backlog handling logic during such situations.

To prevent unwanted delays and other issues when using zabbix_get and zabbix_sender command-line tools, it is now possible to define a custom Timeout parameter for these tools.

Advanced business service monitoring

The new Busines service monitoring features allow Zabbix users to not only define complex service trees but also receive alerts in situations where the status of a business service has been changed. This is valuable to every user that wishes to monitor their business services, no matter how simple or complex the service is.

Combined with a large number of new and improved service status calculation rules. By defining custom service weights and advanced service status propagation rules, the business services can be defined in an extremely flexible fashion. Services are also not linked to individual triggers anymore, instead, we use tag-based service mapping to map our services to problem events.

The service functionality has also received scalability improvements. Zabbix can support the monitoring of over 100 000 business services. The scalability improvements have been implemented from both the UI/UX and the performance perspectives.

The old all-or-nothing business service permission approach has been redesigned to a granular read/write permissions for individual business services. This is not only an improvement from the security perspective, but also adds the ability to define services in a multi-tenant fashion, where each tenant has access only to the services that they own.

With the redesign of the business services, we have added the support for root cause analysis, allowing users to see the underlying problem which caused a particular service to change its state.

You can read more about Business service monitoring in our Zabbix Summit blog post dedicated to this topic.

Tag and template improvements

Item applications have been replaced with tags. This design decision adds consistency to filtering, mapping, grouping, and other tag-related functions when it comes to different Zabbix entities. Tags can also be used to provide additional information related to your entities in a manner that is much more flexible than it was with applications.

Universal template IDs introduced for each of the template elements allow you to define much more robust template management workflows, especially when you combine this with a CI/CD template management approach. These IDs are unique and can be used to match a particular template entity, such as item, trigger, graph, and so on. By utilizing the Universal template IDs, Zabbix now understands which entity we are trying to update, which entity no longer exists, whether it is a new entity or we are adjusting an existing entity. The default template export format is now YAML, though JSON and XML formats are still supported. This was done to improve the template management usability since the YAML format is more user-friendly and easier to edit manually. All of the official Zabbix templates available on the Zabbix git page have already been converted to the YAML format.

The redesign of the templates has also allowed us to improve the visualization of the changes made when importing a template. Now users can see the list of changes in a diff-like display and understand the impact that the template import will have on the Zabbix entities.

Value maps have been moved to host and template levels. This is another design decision that we made to enable support for fully self-contained templates, that are easy to manage and deploy, and can be easily imported into different Zabbix environments. While global value maps might be easy to manage in small environments, this is not the case in larger environments, where different teams are working with a single or between multiple Zabbix instances. Therefore, the global value maps have been removed.

Reporting and visualization

With the addition of Scheduled reports functionality, any dashboard can now be converted into a scheduled report. While this feature was originally added in Zabbix 5.4, with the release of Zabbix 6.0 LTS and a set of new widgets, the reporting functionality has gained a lot of additional value that these widgets grant specifically from the reporting perspective. Users can create scheduled reports and receive them in their mailbox at a specific time either on a daily, weekly, monthly, or yearly basis. The time period for which the report will provide the information can also be selected.

The new Geographical map widget allows you to quickly deploy a geomap with an overview of the state of your infrastructure. The geomap widget supports filters, so we can display only a particular part of your infrastructure. Zabbix uses an open-source Javascript interactive maps library called Leaflet and supports multiple map providers such as OpenStreetMap, OpenTopoMap, USGS US Topo, and more. Users also have the ability to define and use a custom map tile provider. The map will display your infrastructure and also highlight any detected problems as well as display problem counters. This is a major step forward from the old approach, which required users to use the regular map functionality together with Zabbix API scripting, to provide information on a geographical map.

Advanced problem detection

Zabbix 5.4 release introduced a new unified syntax for defining trigger expressions, calculated, and aggregated items. There are multiple benefits that come with the new trigger syntax. First off – the syntax is now unified and can be used for defining triggers, calculated items, and providing values in maps or graph names. The syntax also has a more functional approach, instead of being object-oriented. This allows us to solve many complex use cases, for example dynamically calculate or aggregate a value from all hosts tagged with a specific tag or belonging to a specific host group. Aggregated item type has also been removed and users can now define aggregate checks under the calculated item type.

New monitoring functionality and integrations

As with every major release, Zabbix 6.0 LTS comes with a set of new items and improves the functionality of already existing items:

It is now possible to monitor SSL certificate validity and expiration data, such as the expiry date, issuer, version, subject, and more
New Zabbix Agent 2 metrics allow you to collect file owner information, file properties, extended interface info, extended TCP info, SHA2 hashes for files, and more
New templates for NGINX+, HPE/Dell servers, CISCO ASAv, Cloudflare

Finally – Zabbix 6.0 LTS

Many of our users and customers prefer sticking with the LTS releases instead of upgrading between each major version. As with every LTS release, there are major benefits to sticking with Zabbix 6.0 LTS:

LTS release receive thorough testing and full long term support
- 3 years of full support – general, critical, and security fixes/improvements
- 5 years of limited support – critical and security fixes

Questions

Q: Which of the current versions are still supported and for how long are they going to remain supported? What updates can we expect these versions to receive?

A: Currently we have three supported major versions available. Zabbix 5.4, which will not be supported after the release of Zabbix 6.0 LTS. We also still provide support for Zabbix 5.0 LTS and Zabbix 4.0 LTS. Zabbix 5.0 LTS will continue receiving full support until the middle of 2023 and limited support until the middle of 2025, while Zabbix 4.0 LTS will receive limited support until November 2023.

Q: Could you elaborate on how tags are more flexible than applications and are there any other benefits to using tags?

A: Zabbix already supports tags for most of the essential Zabbix objects, such as triggers, hosts, host prototypes, and templates. With the introduction of tags for items, tags can now be found everywhere. This way you can have tags that provide different additional information and assign values for your objects. Tags have several usages – for example, we can use them to mark events. If we have an item with a tag, this tag will mark any problem related to this item. Problem events will inherit tags from the whole tag chain – hosts, templates, triggers, items, and more. Further down the line, we can use our actions to react to specific tags. If you recall, Business services are also mapped to problems based on the tag mapping. Of course, tags can also be used for filtering and grouping different Zabbix objects.

Q: Is there a guideline to the migration process from an older version to Zabbix 6.0 LTS? Is there a change list that I can look at to see what other features have received an overhaul?

A: Regarding the upgrade itself – our documentation contains guidelines for both upgrading from packages and upgrading from sources. The documentation may also contain upgrade notes regarding any extra steps or precautions required when upgrading to a particular version. Regarding the feature changes – we recommend reading through the major version release notes. For example, if you’re upgrading from Zabbix 5.0 LTS to Zabbix 6.0 LTS, make sure to familiarize yourself not only with the Zabbix 6.0 LTS release notes, but also read through the Zabbix 5.2 and Zabbix 5.4 release notes, since changes introduced in these versions will also be a part of Zabbix 6.0 LTS.

The post Top 10 reasons to migrate to Zabbix 6.0 LTS by Dmitry Krupornitsky / Zabbix Summit Online 2021 appeared first on Zabbix Blog.