Scalability improvements

Post Syndicated from Sergey Simonenko original https://blog.zabbix.com/scalability-improvements/14832/

New improvements might be unnoticed by many Zabbix users since they come to scalability, rather than to new features or some aspects of the user interface experience. However, these improvements might be beneficial for those Zabbix users who run really large instances.

Contents

I. More efficient database use (1:15)

1. New worker processes (3:03)

2. In-memory trend cache (4:49)
3. More server resiliency (7:35)

II. Questions & Answers (10:54)

In case of large instances, the main performance bottleneck would be the database. Zabbix doesn’t establish ad-hoc connections and uses only persistent connections to the database. In Zabbix 5.4, the use of database connections has been further drastically optimized.

More efficient database use

  • In earlier versions, not only database syncers, but also pollers, and some other processes had a dedicated persistent connection to the database. These connections were necessary for calculated items and aggregate checks. These calculated items and aggregate checks are not real items, since they’re based on the queries to the database, particularly to history tables.

Connections were also required to update host availability status. Pollers (unreachable pollers, JMX pollers, as well as the IPMI manager) were updating it directly in the database.

  • In addition, in some cases, when proxies were used (that would be true for large instances) host availability was updated by the proxy poller, in case of a passive proxy, and trapper.

Why was it decided to avoid these connections in Zabbix 5.4?

  • First, they don’t really work smoothly with the default database configuration (PostgreSQL, Oracle). For instance, in PostgreSQL, max_connections is by default set to 100.
  • They can cause locking on the database side.
  • They also result in inefficient memory and CPU utilization.
  • Finally, in earlier versions, it was impossible to perfectly fine-tune the number of connections to the database.

New worker processes

In Zabbix 5.4, two new processes were introduced: history pollers and availability manager. If you have upgraded your Zabbix instance already when you log onto your server and run ps aux | grep zabbix_server, you will notice these new processes:

/usr/sbin/zabbix_server: history poller #1 [got 0 values in 0.000008 sec, idle 1 sec] 
/usr/sbin/zabbix_server: history poller #2 [got 2 values in 0.000186 sec, idle 1 sec] 
/usr/sbin/zabbix_server: history poller #3 [got 0 values in 0.000050 sec, idle 1 sec] 
/usr/sbin/zabbix_server: history poller #4 [got 0 values in 0.000010 sec, idle 1 sec] 
/usr/sbin/zabbix_server: history poller #5 [got 0 values in 0.000012 sec, idle 1 sec] 
/usr/sbin/zabbix_server: availability manager #1 [queued 0, processed 0 values, idle 5.016162 sec during 5.016415 sec]

History pollers

Since calculated items and aggregate checks represent a different types of items, now they have their own poller – history poller. History pollers are also used for several internal items (zabbix[*] item keys) as well.

New configuration parameters

History poller comes with a new configuration parameter. Here, it is important to keep in mind that more is not always better. So, the StartHistoryPollers value (how many history pollers are being pre-forked) should be increased only if history pollers are too busy according to internal self-monitoring, but should be kept as low as possible to avoid unnecessary connections to the database.

### Option: StartHistoryPollers
#     Number of pre-forked instances of history pollers.
#     Only required for calculated, aggregated and internal checks.
#     A database connection is required for each history poller instance.
#
# Mandatory: no
# Range: 0-1000
# Default:
# StartHistoryPollers=5

Availability manager

In earlier versions, pollers, unreachable pollers, JMX pollers, and the IPMI manager updated host availability directly in the database with a separate transaction for each host. Now, we have this separate availability manager, and all processes — pollers, trappers, etc. — communicate with the availability manager, and the statistics queue is flushed by the availability manager to the database every 5 seconds.

In-memory trend cache

Since Zabbix 5.2, new trigger functions like trendavg, trendmax, etc. were introduced, which operate with the trends data for long periods. Similarly to calculated items, these triggers used database queries to obtain the necessary data.

In Zabbix 5.4, finally, the trend cache has been implemented. It stores the results of calculated trends functions. If the value is not available in the cache yet, Zabbix will query the database and update the cache.

As with all newly introduced processes, this cache’s effectiveness can be monitored using internal check zabbix[tcache,cache,], which can be used to set the relevant TrendFunctionCacheSize parameter value.

### Option: TrendFunctionCacheSize
#           Size of trend function cache, in bytes.
#           Shared memory size for caching calculated trend function data.
#
# Mandatory: no
# Range: 128K-2G
# Default:
# TrendFunctionCacheSize=4M

To sum it up, with all these database-related optimizations:

  • Now it is possible to have as many database connections as you really need. So, if you, for instance, operate a very large instance and you need a hundred or more pollers, and at the same time, you don’t rely much on some calculated items or aggregate checks functionality, before Zabbix 5.4 you would end up with hundreds or more database connections that you didn’t need.

Moreover, with PostgreSQL with default configuration, if you increased the number of pollers, your database server could go down and bring down your Zabbix instance. For each PostgreSQL worker process, you would have had a limited work_mem as you had too many database connections. So, your overall database performance would have been sacrificed. That is not the case anymore.

  • In addition, if you are using trend functions with triggers using large periods of time, in the past you might have noticed, for instance, slow queries. Now, these changes will help you to drastically decrease the database load.

More server resiliency

  • Another important feature — graceful start. Active proxies can keep a backlog, which is useful if the communication between the server and the proxy breaks for any reason, for instance:

— server maintenance during upgrade to the next minor release;
— loss of Internet access at a remote site due to fiber cut, etc.

When communication restores, the proxies can easily overload the server after long downtimes, especially in large instances.

  • Since Zabbix 5.4, the server lets the proxies know if it’s busy, so the proxies throttle data sending.

Earlier, the data uploaded by the proxies was throttled when the history cache usage was 80% or greater. However, as the server was responsible for that task, all proxies were getting disabled in some situations. That meant the history data upload, as well as other tasks, such as processing of regular data and processing tasks, were getting suspended until the history cache utilization dropped lower than 80%.

This method was ineffective and unacceptable in large environments. Now, the proxies are responsible for checking whether the server can handle the data. When the history cache usage hits 80%, the following scenario is used:

  • the proxies send the data to the server and the data is accepted;
  • if the server thinks it’s busy it will respond with a special JSON tag upload set to ‘disable’;
  • the proxies will stop uploading history data, but will keep polling the server for tasks and uploading other data;
  • in a while, the proxies will upload data again;
  • if the server is not too busy, it will respond with the JSON tag upload set to ‘enable’.

Unlike the previous two scalability improvements which are based on serious architectural changes, this change was backported to earlier Zabbix versions — 5.0 and 5.2.

Questions & Answers

Question. Would you recommend using proxies even on the local site to allow for the server to be upgraded without losing data or for performance improvements?

Answer. Yes, in some cases there’re such setups. This idea mainly is to have a unified configuration, not only to improve performance. And in some cases, if you use a lot of proxies, you might want to monitor all the items only with the proxies. Such scenarios are used by many Zabbix customers.

Question. So, throttling can give you some noticeable performance benefits. Which version is required on the server and on the proxy for throttling?

Answer. All these changes have been backported to earlier versions, so you can use either Zabbix 5.4.0 released recently or the latest releases of Zabbix 5.0 or Zabbix 5.2.

Question. Is it possible to have two databases in a cluster and point the select queries to one database and, for instance, execution queries to another database? How would database clustering generally work? Is it of benefit to Zabbix? Can Zabbix utilize it?

Answer. In general, our HA setups use some basic features, which are built-in into database servers. They use replication. So, you have to use the servers that will provide some virtual IP for your cluster. That is completely transparent to Zabbix.

However, it is not recommended to split different queries on different nodes. They should still hit a single specific note. So, it is more of an HA approach rather than a horizontal scalability approach.

Question. Would you elaborate on what a large, or medium, or small instance means? What new values per second should we be looking at?

Answer. We can judge from large instances of our customers, and might not know about even larger instances managed by the customers themselves. Large instances can have, for instance, 100,000 NVPS and more. Sometimes, we upgrade really large instances with databases of dozens of terabytes. Some users like keeping really long records.

In my experience, large instances of 20,000 to 40,000 NPVS are quite common and they could benefit a lot from these changes.