Tag Archives: Technical

Zabbix 7.0 Proxy Load Balancing

Post Syndicated from Markku Leiniö original https://blog.zabbix.com/zabbix-7-0-proxy-load-balancing/28173/

One of the new features in Zabbix 7.0 LTS is proxy load balancing. As the documentation says:

Proxy load balancing allows monitoring hosts by a proxy group with automated distribution of hosts between proxies and high proxy availability.

If one proxy from the proxy group goes offline, its hosts will be immediately distributed among other proxies having the least assigned hosts in the group.

Proxy group is the new construct that enables Zabbix server to make dynamic decisions about the monitoring responsibilities within the group(s) of proxies. As you can see in the documentation, the proxy group has only a minimal set of configurable settings.

One important background information to understand is that Zabbix server always knows (within reasonable timeframe) which proxies in the proxy groups are online and which are not. That’s because all active proxies connect to the Zabbix server every 1 second by default (DataSenderFrequency setting in the proxy), and Zabbix server connects to the passive proxies also every 1 second by default (ProxyDataFrequency setting in the server), so if those connections are not happening anymore, then something is wrong with using the proxy.

Initially Zabbix server will balance the hosts between the proxies in the proxy group. It can also rebalance the hosts later if needed, the algorithm is described in the documentation. That’s something we don’t need to configure (that’s the “automated distribution of hosts” mentioned above). The idea is that, at any given time, any host configured to be monitored by the proxy group is monitored by one proxy only.

Now let’s see how the actual connections work with active and passive Zabbix agents. The active/passive modes of the proxies (with the Zabbix server connectivity) don’t matter in this context, but I’m using active proxies in my tests for simplicity.

Disclaimer: These are my own observations from my own Zabbix setup using 7.0.0, and they are not necessarily based on any official Zabbix documentation. I’m open for any comments or corrections in any case.

At the very end of this post I have included samples of captured agent traffic for each of the cases mentioned below.

Passive agents monitored by a proxy group

For passive agents the proxy load balancing really is this simple: Whenever a proxy goes down in a proxy group, all the hosts that were previously monitored by that proxy will then be monitored by the other available proxies in the same proxy group.

There is nothing new to configure in the passive agents, only the usual Server directive to allow specific proxies (IP addresses, DNS names, subnets) to communicate with the agent.

As a reminder, a passive agent means that it listens to incoming requests from Zabbix proxies (or the Zabbix server), and then collects and returns the requested data. All relevant firewalls also need to be configured to allow the connections from the Zabbix proxies to the agent TCP port 10050.

As yet another reminder, each agent (or monitored host) can have both passive and active items configured, which means that it will both listen to incoming Zabbix requests but also actively request any active tasks from Zabbix proxies or servers. But again, this is long-existing functionality, nothing new in Zabbix 7.0.

Active agents monitored by a proxy group

For active agents the proxy load balancing needs a bit new tweaking in the agent side.

By definition, an active agent is the party that initiates the connection to the Zabbix proxy (or server), to TCP port 10051 by default. The configuration happens with the ServerActive directive in the agent configuration. According to the official documentation, providing multiple comma-separated addresses in the ServerActive directive has been possible for ages, but it is for the purpose of providing data to multiple independent Zabbix installations at the same time. (Think about a Zabbix agent on a monitored host, being monitored by both a service provider and the inhouse IT department.)

Using semicolon-separated server addresses in ServerActive directive has been possible since Zabbix 6.0 when Zabbix servers are configured in high-availability cluster. That requires specific Zabbix server database implementation so that all the cluster nodes use the same database, and some other shared configurations.

Now in Zabbix 7.0 this same configuration style can be used for the agent to connect to all proxies in the proxy group, by entering all the proxy addresses in the ServerActive configuration, semicolon-separated. However, to be exact, this is not described in the ServerActive documentation as of this writing. Rather, it specifically says “More than one Zabbix proxy should not be specified from each Zabbix server/cluster.” But it works, let’s see how.

Using multiple semicolon-separated proxy addresses works because of the new redirection functionality in the proxy-agent communication: Whenever an active agent sends a message to a proxy, the proxy tells the agent to connect to another proxy, if the agent is currently assigned to some other proxy. The agent then ceases connecting to that previous proxy, and starts using the proxy address provided in the redirection instead. Thus the agent converges to using only that one designated proxy address.

In this simple example the Zabbix server determined that the agent should be monitored by Proxy 1, so when the agent initially contacted Proxy 1 (because its IP address is first in the ServerActive list), the proxy responded normally and agent was happy with that.

In case the Zabbix server had for any reason determined that the agent should be monitored by Proxy 2, then Proxy 1 would have responded with a redirection, and agent would have followed that. (There will be examples of redirections in the capture files below.)

To be clear, this agent redirection from the proxy group works only with Zabbix 7.0 agents as of this writing.

Note: In the initial version of this post I used comma-separated proxy addresses in ServerActive (instead of semicolon-separated), and that caused duplicate connections from the agent to the designated proxy (because the agent is not equipped to recognize that it connects to the same proxy twice), eventually causing data duplication in Zabbix database. Using comma-separated proxy addresses is thus not a working solution for proxy load balancing usage.

If the host-proxy assignments are changed by the Zabbix server for balancing the load between the proxies, the previously designated proxy will redirect the agent to the correct proxy address, and the situation is optimized again.

Side note: When configuring the proxies in Zabbix UI, there is a new Address for active agents field. That is the address value that is used by the proxies when responding with redirection messages to agents.

Proxy group failure scenarios with active agents

Proxy goes down

If the designated proxy of an active agent goes offline so that it doesn’t respond to the agent anymore, agent realizes the situation, discards the redirection information it had, and reverts to using the proxy addresses from ServerActive directive again.

Now, this is an interesting case because of some timing dependencies. In the proxy group configuration there is the Failover period configuration that controls the Zabbix server’s sensitivity to proxy availability in regards to agent rebalancing within the proxy group. Thus, if the agent reverts to using the other proxies faster than Zabbix server recognizes the situation and notifies the other proxies in the proxy group, the agent will get redirection responses from the other proxies, telling it to use the currently offline proxy. And the same happens again: agent fails to connect to the redirected proxy, and reverts to using the other locally configured proxies, and so on.

In my tests this looping was not very intense, only two rounds every second, so it was not very significant network-wise, and the situation will converge automatically when the Zabbix server has notified the proxies about the host rebalancing.

So this temporary looping is not a big deal. The takeaway is that the whole system converges automatically from a failed proxy.

After the failed proxy has recovered to online mode, the agents stay with their designated proxies in the proxy group.

As mentioned in the beginning, Zabbix server will automatically rebalance the hosts again after some time if needed.

Proxy is online but unreachable from the active agent

Another interesting case is one where the proxy itself is running and communicating with Zabbix server, thus being in online mode in the proxy group, but the active agent is not able to reach it, while still being able to connect to the other proxies in the group. This can happen due to various Internet-related routing issues for example, if the proxies are geographically distributed and far away from the agent.

Let’s start with the situation where the agent is currently monitored by Proxy 2 (as per the last picture above). When the failure starts and agent realizes that the connections to Proxy 2 are not succeeding anymore, the agent reverts to using the configured proxies in ServerActive, connecting to Proxy 1.

But, Proxy 1 knows (by the information given by Zabbix server) that Proxy 2 is still online and that the agent should be monitored by Proxy 2, so Proxy 1 responds to the agent with a redirection.

Obviously that won’t work for the agent as it doesn’t have connectivity to Proxy 2 anymore.

This is a non-recoverable situation (at least with the current Zabbix 7.0.0) while the reachability issue persists: The agent keeps on contacting Proxy 1, keeps receiving the redirection, and the same repeats over and over again.

Note that it does not matter if the agent is now locally reconfigured to only use Proxy 1 in this situation, because the load balancing of the hosts in the proxy group is not controlled by any of the agent-local configuration. The proxy group (led by Zabbix server) has the only authority to assign the hosts to the proxies.

One way to escape from this situation is to stop the unreachable Proxy 2. That way the Zabbix server will eventually notice that Proxy 2 is offline, and the hosts will be automatically rebalanced to other proxies in the group, thus removing the agent-side redirection to the unreachable proxy.

Keep this potential scenario in mind when planning proxy groups with proxy location diversity.

This is also something to think about if your Zabbix proxies have multiple network interfaces, where Zabbix server connectivity is using different interface from the agent connectivity. In that case the same problem can occur due to your own configurations.

Closing words

All in all, proxy load balancing looks very promising feature as it does not require any network-level tricks to achieve load balancing and high availability. In Zabbix 7.0 this is a new feature, so we can expect some further development for the details and behavior in the upcoming releases.


Appendix: Sample capture files

Ideally these capture files should be viewed with Wireshark version 4.3.0rc1 or newer because only the latest Wireshark builds include support for latest Zabbix protocol features. Wireshark 4.2.x should also show most of the Zabbix packet fields. Use display filter “zabbix” to see only the Zabbix protocol packets, but when examining cases more carefully you should also check the plain TCP packets (without any display filter) to get more understanding about the cases.

These samples are taken with Zabbix components version 7.0.0, using default timers in the Zabbix process configurations, and 20 seconds as the proxy group failover period.

Passive agent, with proxy failover

    • After frame #50 Proxy 1 was stopped and Proxy 2 eventually took over the monitoring

Active agent, with proxy failover

    • The agent initially communicates with Proxy 1
    • Proxy 1 was stopped before frame #425
    • Agent connected to Proxy 2, but Proxy 2 keeps sending redirects
    • Proxy 2 was assigned the agent before frame #1074, so it took over the monitoring and accepted the agent connections
    • Proxy 1 was later restarted (but agent didn’t try to connect to it yet)
    • The agent was manually restarted before frame #1498 and it connected to Proxy 1 again, was given a redirection to Proxy 2, and continued with Proxy 2 again

Active agent, with proxy unreachable

    • Started with Proxy 2 monitoring the agent normally
    • Network started dropping all packets from the agent to Proxy 2 before frame #179, agent started connecting to Proxy 1 almost immediately
    • From frame #181 on Proxy 1 responds with redirection to Proxy 2 (which is not working)
    • Proxy 2 was eventually stopped manually
    • Redirections continue until frame #781 when Proxy 1 is assigned the monitoring of the agent, and Proxy 1 starts accepting the agent requests

This post was originally published on the author’s blog.

The post Zabbix 7.0 Proxy Load Balancing appeared first on Zabbix Blog.

Make your interaction with Zabbix API faster: Async zabbix_utils.

Post Syndicated from Aleksandr Iantsen original https://blog.zabbix.com/make-your-interaction-with-zabbix-api-faster-async-zabbix_utils/27837/

In this article, we will explore the capabilities of the new asynchronous modules of the zabbix_utils library. Thanks to asynchronous execution, users can expect improved efficiency, reduced latency, and increased flexibility in interacting with Zabbix components, ultimately enabling them to create efficient and reliable monitoring solutions that meet their specific requirements.

There is a high demand for the Python library zabbix_utils. Since its release and up to the moment of writing this article, zabbix_utils has been downloaded from PyPI more than 15,000 times. Over the past week, the library has been downloaded more than 2,700 times. The first article about the zabbix_utils library has already gathered around 3,000 views. Among the array of tools available, the library has emerged as a popular choice, offering developers and administrators a comprehensive set of functions for interacting with Zabbix components such as Zabbix server, proxy, and agents.

Considering the demand from users, as well as the potential of asynchronous programming to optimize interaction with Zabbix, we are pleased to present a new version of the library with new asynchronous modules in addition to the existing synchronous ones. The new zabbix_utils modules are designed to provide a significant performance boost by taking advantage of the inherent benefits of asynchronous programming to speed up communication between Zabbix and your service or script.

You can read the introductory article about zabbix_utils for a more comprehensive understanding of working with the library.

Benefits and Usage Scenarios

From expedited data retrieval and real-time event monitoring to enhanced scalability, asynchronous programming empowers you to build highly efficient, flexible, and reliable monitoring solutions adapted to meet your specific needs and challenges.

The new version of zabbix_utils and its asynchronous components may be useful in the following scenarios:

  • Mass data gathering from multiple hosts: When it’s necessary to retrieve data from a large number of hosts simultaneously, asynchronous programming allows requests to be executed in parallel, significantly speeding up the data collection process;
  • Mass resource exporting: When templates, hosts or problems need to be exported in parallel. This parallel execution reduces the overall export time, especially when dealing with a large number of resources;
  • Sending alerts from or to your system: When certain actions need to be performed based on monitoring conditions, such as sending alerts or running scripts, asynchronous programming provides rapid condition processing and execution of corresponding actions;
  • Scaling the monitoring system: With an increase in the number of monitored resources or the volume of collected data, asynchronous programming provides better scalability and efficiency for the monitoring system.

Installation and Configuration

If you already use the zabbix_utils library, simply updating the library to the latest version and installing all necessary dependencies for asynchronous operation is sufficient. Otherwise, you can install the library with asynchronous support using the following methods:

  • By using pip:
~$ pip install zabbix_utils[async]

Using [async] allows you to install additional dependencies (extras) needed for the operation of asynchronous modules.

  • By cloning from GitHub:
~$ git clone https://github.com/zabbix/python-zabbix-utils
~$ cd python-zabbix-utils/
~$ pip install -r requirements.txt
~$ python setup.py install

The process of working with the asynchronous version of the zabbix_utils library is similar to the synchronous one, except for some syntactic differences of asynchronous code in Python.

Working with Zabbix API

To work with the Zabbix API in asynchronous mode, you need to import the AsyncZabbixAPI class from the zabbix_utils library:

from zabbix_utils import AsyncZabbixAPI

Similar to the synchronous ZabbixAPI, the new AsyncZabbixAPI can use the following environment variables: ZABBIX_URL, ZABBIX_TOKEN, ZABBIX_USER, ZABBIX_PASSWORD. However, when creating an instance of the AsyncZabbixAPI class you cannot specify a token or a username and password, unlike the synchronous version. They can only be passed when calling the login() method. The following usage scenarios are available here:

  • Use preset values of environment variables, i.e., not pass any parameters to AsyncZabbixAPI:
~$ export ZABBIX_URL="https://zabbix.example.local"
api = AsyncZabbixAPI()
  • Pass only the Zabbix API address as input, which can be specified as either the server IP/FQDN address or DNS name (in this case, the HTTP protocol will be used) or as an URL of Zabbix API:
api = AsyncZabbixAPI(url="127.0.0.1")

After declaring an instance of the AsyncZabbixAPI class, you need to call the login() method to authenticate with the Zabbix API. There are two ways to do this:

  • Using environment variable values:
~$ export ZABBIX_USER="Admin"
~$ export ZABBIX_PASSWORD="zabbix"

or

~$ export ZABBIX_TOKEN="xxxxxxxx"

and then:

await api.login()
  • Passing the authentication data when calling login():
await api.login(user="Admin", password="zabbix")

Like ZabbixAPI, the new AsyncZabbixAPI class supports version getting and comparison:

# ZabbixAPI version field
ver = api.version
print(type(ver).__name__, ver) # APIVersion 6.0.29

# Method to get ZabbixAPI version
ver = api.api_version()
print(type(ver).__name__, ver) # APIVersion 6.0.29

# Additional methods
print(ver.major)     # 6.0
print(ver.minor)     # 29
print(ver.is_lts())  # True

# Version comparison
print(ver < 6.4)        # True
print(ver != 6.0)       # False
print(ver != "6.0.24")  # True

After authentication, you can make any API requests described for all supported versions in the Zabbix documentation.

The format for calling API methods looks like this:

await api_instance.zabbix_object.method(parameters)

For example:

await api.host.get()

After completing all needed API requests, it is necessary to call logout() to close the API session if authentication was done using username and password, and also close the asynchronous sessions:

await api.logout()

More examples of usage can be found here.

Sending Values to Zabbix Server/Proxy

The asynchronous class AsyncSender has been added, which also helps to send values to the Zabbix server or proxy for items of the Zabbix Trapper data type.

AsyncSender can be imported as follows:

from zabbix_utils import AsyncSender

Values ​​can be sent in a group, for this it is necessary to import ItemValue:

import asyncio
from zabbix_utils import ItemValue, AsyncSender


items = [
    ItemValue('host1', 'item.key1', 10),
    ItemValue('host1', 'item.key2', 'Test value'),
    ItemValue('host2', 'item.key1', -1, 1702511920),
    ItemValue('host3', 'item.key1', '{"msg":"Test value"}'),
    ItemValue('host2', 'item.key1', 0, 1702511920, 100)
]

async def main():
    sender = AsyncSender('127.0.0.1', 10051)
    response = await sender.send(items)
    # processing the received response

asyncio.run(main())

As in the synchronous version, it is possible to specify the size of chunks when sending values in a group using the parameter chunk_size:

sender = AsyncSender('127.0.0.1', 10051, chunk_size=2)
response = await sender.send(items)

In the example, the chunk size is set to 2. So, 5 values passed in the code above will be sent in three requests of two, two, and one value, respectively.

Also it is possible to send a single value:

sender = AsyncSender(server='127.0.0.1', port=10051)
resp = await sender.send_value('example_host', 'example.key', 50, 1702511920))

If your server has multiple network interfaces, and values need to be sent from a specific one, the AsyncSender provides the option to specify a source_ip for sent values:

sender = AsyncSender(
    server='zabbix.example.local',
    port=10051,
    source_ip='10.10.7.1'
)
resp = await sender.send_value('example_host', 'example.key', 50, 1702511920)

AsyncSender also supports reading connection parameters from the Zabbix agent/agent2 configuration file. To do this, you need to set the use_config flag and specify the path to the configuration file if it differs from the default /etc/zabbix/zabbix_agentd.conf:

sender = AsyncSender(
    use_config=True,
    config_path='/etc/zabbix/zabbix_agent2.conf'
)

More usage examples can be found here.

Getting values from Zabbix Agent/Agent2 by item key.

In cases where you need the functionality of our standart zabbix_get utility but native to your Python project and working asynchronously, consider using the AsyncGetter class. A simple example of its usage looks like this:

import asyncio
from zabbix_utils import AsyncGetter

async def main():
    agent = AsyncGetter('10.8.54.32', 10050)
    resp = await agent.get('system.uname')
    print(resp.value) # Linux zabbix_server 5.15.0-3.60.5.1.el9uek.x86_64

asyncio.run(main())

Like AsyncSender, the AsyncGetter class supports specifying the source_ip address:

agent = AsyncGetter(
    host='zabbix.example.local',
    port=10050,
    source_ip='10.10.7.1'
)

More usage examples can be found here.

Conclusions

The new version of the zabbix_utils library provides users with the ability to implement efficient and scalable monitoring solutions, ensuring fast and reliable communication with the Zabbix components. Asynchronous way of interaction gives a lot of room for performance improvement and flexible task management when handling a large volume of requests to Zabbix components such as Zabbix API and others.

We have no doubt that the new version of zabbix_utils will become an indispensable tool for developers and administrators, helping them create more efficient, flexible, and reliable monitoring solutions that best meet their requirements and expectations.

The post Make your interaction with Zabbix API faster: Async zabbix_utils. appeared first on Zabbix Blog.

Extending Zabbix: the power of scripting

Post Syndicated from Giedrius Stasiulionis original https://blog.zabbix.com/extending-zabbix-the-power-of-scripting/27401/

Scripts can extend Zabbix in various different aspects. If you know your ways around a CLI, you will be able to extend your monitoring capabilities and streamline workflows related to most Zabbix components.

What I like about Zabbix is that it is very flexible and powerful tool right out of the box. It has many different ways to collect, evaluate and visualize data, all implemented natively and ready to use.

However, in more complex environments or custom use cases, you will inevitably face situations when something can’t be collected (or displayed) in a way that you want. Luckily enough, Zabbix is flexible even here! It provides you with ways to apply your knowledge and imagination so that even most custom monitoring scenarios would be covered. Even though Zabbix is an open-source tool, in this article I will talk about extending it without changing its code, but rather by applying something on top, with the help of scripting. I will guide you through some examples, which will hopefully pique your curiosity and maybe you will find them interesting enough to experiment and create something similar for yourself.

Although first idea which comes to ones mind when talking about scripts in Zabbix is most likely data collection, it is not the only place where scripts can help. So I will divide those examples / ideas into three sub categories:

  • Data collection
  • Zabbix internals
  • Visualization

Data collection

First things first. Data collection is a starting point for any kind of monitoring. There are multiple ways how to collect data in “custom” ways, but the easiest one is to use UserParameter capabilities. Basics of it are very nicely covered by official documentation or in other sources, e.g. in this video by Dmitry Lambert, so I will skip the “Hello World” part and provide some more advanced ideas which might be useful to consider. Also, the provided examples use common scripting themes/scenarios and you can find many similar solutions in the community, so maybe this will serve better as a reminder or a showcase for someone who has never created any custom items before.

Data collection: DB checks

There is a lot of good information on how to setup DB checks for Zabbix, so this is just a reminder, that one of the ways to do it is via custom scripts. I personally have done it for various different databases: MySQL, Oracle, PostgreSQL, OpenEdge Progress. Thing is ODBC is not always a great or permitted way to go, since some security restrictions might be in place and you can’t get direct access to DB from just anywhere you want. Or you want to transform your retrieved data in a ways that are complex and could hardly be covered by preprocessing. Then you have to rely on Zabbix agent running those queries either from localhost where DB resides or from some other place which is allowed to connect to your DB. Here is an example how you can do it for PostgreSQL

#!/bin/bash

my_dir="$(dirname ${0})"
conf_file="${my_dir}/sms_queue.conf"

[[ ! -f $conf_file ]] && echo -1 && exit 1

. ${conf_file}

export PGPASSWORD="${db_pass}"

query="SELECT COUNT(*) FROM sms WHERE sms.status IN ('retriable', 'validated');"

psql -h "${db_host}" -p "${db_port}" -U "${db_user}" -d "${db}" -c "${query}" -At 2>/dev/null

[[ $? -ne 0 ]] && echo -1 && exit 1

exit 0

Now what’s left is to feed the output of this script into Zabbix via UserParameter. Similar approach can be applied to Oracle (via sqlplus) or MySQL.

Data collection: log delay statistics

I once faced a situation when some graphs which are based on log data started having gaps. It meant something was wrong either with data collection (Zabbix agent) or with data not being there at the moment of collection (so nothing to collect). Quick check suggested it was the second one, but I needed to prove it somehow.

Since these log lines had timestamps of creation, it was a logical step to try to measure, how much do they differ from “current time” of reading. And this is how I came up with the following custom script to implement such idea.

First of all, we need to read the file, say once each minute. We are talking about log with several hundreds of thousands lines per minute, so this script should be made efficient. It should read the file in portions created between two script runs. I have explained such reading in details here so now we will not focus on it.

Next what this script does is it greps timestamps only from each line and counts immediately number of unique lines with the same timestamp (degree of seconds). That is where it becomes fast – it doesn’t need to analyze each and every line individually but it can analyze already grouped content!

Finally, delay is calculated based on the difference between “now” and collected timestamps, and those counters are exactly what is then passed to Zabbix.

#!/bin/bash

my_log="${1}"

my_project="${my_log##*\/}"
my_project="${my_project%%.log}"

me="$(basename ${0})"
my_dir="/tmp/log_delays/${my_project}"

[[ ! -d ${my_dir} ]] && mkdir -p ${my_dir}

# only one instance of this script at single point of time
# this makes sure you don't damage temp files

me_running="${my_dir}/${me}.running"

# allow only one process
# but make it more sophisticated:
# script is being run each minute
# if .running file is here for more than 10 minutes, something is wrong
# delete .running and try to run once again

[[ -f $me_running && $(($(date +%s)-$(stat -c %Y $me_running))) -lt 600 ]] && exit 1

touch $me_running

[[ "${my_log}" == "" || ! -f "${my_log}" ]] && exit 1

log_read="${my_dir}/${me}.read"

# get current file size in bytes

current_size=$(wc -c < "${my_log}")

# remember how many bytes you have now for next read
# when run for first time, you don't know the previous

[[ ! -f "${log_read}" ]] && echo "${current_size}" > "${log_read}"

bytes_read=$(cat "${log_read}")
echo "${current_size}" > "${log_read}"

# if rotated, let's read from the beginning

if [[ ${bytes_read} -gt ${current_size} ]]; then
  bytes_read=0
fi



# get the portion

now=$(date +%s)

delay_1_min=0
delay_5_min=0
delay_10_min=0
delay_30_min=0
delay_45_min=0
delay_60_min=0
delay_rest=0

while read line; do

  [[ ${line} == "" ]] && continue

  line=(${line})

  ts=$(date -d "${line[1]}+00:00" +%s)

  delay=$((now-ts))

  if [[ ${delay} -lt 60 ]]; then
    delay_1_min=$((${delay_1_min}+${line[0]}))
  elif [[ ${delay} -lt 300 ]]; then
    delay_5_min=$((${delay_5_min}+${line[0]}))
  elif [[ ${delay} -lt 600 ]]; then
    delay_10_min=$((${delay_10_min}+${line[0]}))
  elif [[ ${delay} -lt 1800 ]]; then
    delay_30_min=$((${delay_30_min}+${line[0]}))
  elif [[ ${delay} -lt 2700 ]]; then
    delay_45_min=$((${delay_45_min}+${line[0]}))
  elif [[ ${delay} -lt 3600 ]]; then
    delay_60_min=$((${delay_60_min}+${line[0]}))
  else
    delay_rest=$((${delay_rest}+${line[0]}))
  fi

done <<< "$(tail -c +$((bytes_read+1)) "${my_log}" | head -c $((current_size-bytes_read)) | grep -Po "(?<=timestamp\":\")(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})(?=\.)" | sort | uniq -c | sort -k1nr)"

echo "delay_1_min=${delay_1_min}
delay_5_min=${delay_5_min}
delay_10_min=${delay_10_min}
delay_30_min=${delay_30_min}
delay_45_min=${delay_45_min}
delay_60_min=${delay_60_min}
delay_rest=${delay_rest}"



rm -f "${me_running}"

exit 0

Now on Zabbix side, there is an item running this script and 7 dependent items, representing the degree of delay. Since there are many logs for which this data is collected, it is all put into LLD based on contents of specific directory:

vfs.dir.get[/var/log/logs,".*log$",,,,,1000]

This LLD then provides two macros:

And item prototypes will look like:

Those dependent items have one simple preprocessing step which takes needed number out of the script output:

So the final result is the nice graph in dashboard, showing exactly when and what degree delays do appear:

So as you see, it is relatively easy to collect just about any data you wish, once you know how. As you can see from these examples, it might be something more complex but it can also be just a simple one-liner – in any case it should be obvious that possibilities are endless when talking about scripts in data collection. If something is executable from the CLI and has a valuable output, go ahead and collect it!

Zabbix internals

Another area where scripts can be really useful is adjusting how Zabbix behaves or controlling this behavior automatically. And in this case, we will employ Zabbix API, since it’s designed exactly for such or similar purposes.

Zabbix internals: automatically disabling problematic item

In our environment, we have many logs to be analyzed. And some of them sometimes go crazy – something that we intend to catch starts appearing there too often and requires attention – typically we would have to adjust the regexp, temporarily suppress some patterns and inform responsible teams about too extensive logging. If you don’t (or can’t) pay attention quick, it might kill Zabbix – history write cache starts filling up. So what we do is automatically detect such an item with most values received during some most recent short period of time and automatically disable it.

First of all there are two items – the one measuring history write cache and the other one extracting top item in the given table

[root@linux ~]# zabbix_agentd -t zabbix.db.max[history_log,30] 2>/dev/null
zabbix.db.max[history_log,30] [t|463 1997050]
[root@linux ~]#

First number here is values gathered during provided period, second one is item id. The script behind this item looks like this

[root@linux ~]# grep zabbix.db.max /etc/zabbix/zabbix_agentd.d/userparameter_mysql.conf
UserParameter=zabbix.db.max[*],HOME=/etc/zabbix mysql -BN -e "USE zabbix; SELECT count(*), itemid FROM $1 WHERE clock >= unix_timestamp(NOW() - INTERVAL $2 MINUTE) GROUP BY itemid ORDER BY count(*) DESC LIMIT 1;"
[root@linux ~]#

And now relying on the history write cache item values showing us drop, we construct a trigger:

And as a last step, such trigger invokes action, which is running the script that disables the item with given ID with the help of Zabbix API, method “item.update”

Now we are able to avoid unexpected behavior of our data sources affecting Zabbix performance, all done automatically – thanks to the scripts!

Zabbix internals: add host to group via frontend scripts

Zabbix maintenance mode is a great feature allowing us to reduce noise or avoid some false positive alerts once specific host is known to have issues. At some point we found it would be convenient to be able to add (or remove) specific host into (from) maintenance directly from “Problems” window. And that is possible and achieved via a frontend script, again with the help of Zabbix API, this time methods “host.get”, “hostgroup.get”, “hostgroup.massadd” and “hostgroup.massremove”

Data visualization

Zabbix has many different widgets that are able to cover various different ways of displaying your collected data. But in some cases, you might find yourself missing some small type of “something” which would allow your dashboards to shine even more – at least I constantly face it. Starting From version 6.4 Zabbix allows you to create your own widgets but it might be not such a straightforward procedure if you have little or no programming experience. However, you can employ two already existing widgets in order to customize your dashboard look in pretty easy way.

Data visualization: URL widget

First one example is done using the URL widget. You might feed just about any content there, so if you have any web development skills, you can easily create something which would look like custom widget. Here is an example. I need a clock but not the one already provided by Zabbix as a separate clock widget – I want to have a digital clock and I also want this clock to have a section, which would display the employee on duty now and in an upcoming shift. So with a little bit of HTML, CSS and JavaScript / AJAX, I have this

With styles properly chosen, such content can be smoothly integrated into dashboards, along with other widgets.

Data visualization: plain text widget with HTML formatting

Another useful widget which is often overlooked is the “Plain text” widget – in combination with the following parameters:

It becomes a very powerful tool to display nicely formatted data snapshots. Simple yet very good example here would be to display some content, which requires human readable structure – a table.

So again, integration with other dashboard widgets is so smooth – with just some custom HTML / CSS around your data you wrap it into something that looks like brand new “table” widget. Isn’t it awesome? And you are of course not limited to tables… Just use your imagination!

Conclusion

Although I personally prefer bash as the first option to solve things, there is no big difference regarding which scripting or programming languages to choose when extending Zabbix in these ways. Just try anything you feel most comfortable with.

I hope that examples shown here inspired you in some ways. Happy scripting!

The post Extending Zabbix: the power of scripting appeared first on Zabbix Blog.

Introducing zabbix_utils – the official Python library for Zabbix API

Post Syndicated from Aleksandr Iantsen original https://blog.zabbix.com/python-zabbix-utils/27056/

Zabbix is a flexible and universal monitoring solution that integrates with a wide variety of different systems right out of the box. Despite actively expanding the list of natively supported systems for integration (via templates or webhook integrations), there may still be a need to integrate with custom systems and services that are not yet supported. In such cases, a library taking care of implementing interaction protocols with the Zabbix API, Zabbix server/proxy, or Agent/Agent2 becomes extremely useful. Given that Python is widely adopted among DevOps and SRE engineers as well as server administrators, we decided to release a library for this programming language first.

We are pleased to introduce zabbix_utils – a Python library for seamless interaction with Zabbix API, Zabbix server/proxy, and Zabbix Agent/Agent2. Of course, there are popular community solutions for working with these Zabbix components in Python. Keeping this fact in mind, we have tried to consolidate popular issues and cases along with our experience to develop as convenient a tool as possible. Furthermore, we made sure that transitioning to the tool is as straightforward and clear as possible. Thanks to official support, you can be confident that the current version of the library is compatible with the latest Zabbix release.

In this article, we will introduce you to the main capabilities of the library and provide examples of how to use it with Zabbix components.

Usage Scenarios

The zabbix_utils library can be used in the following scenarios, but is not limited to them:

  • Zabbix automation
  • Integration with third-party systems
  • Custom monitoring solutions
  • Data export (hosts, templates, problems, etc.)
  • Integration into your Python application for Zabbix monitoring support
  • Anything else that comes to mind

You can use zabbix_utils for automating Zabbix tasks, such as scripting the automatic monitoring setup of your IT infrastructure objects. This can involve using ZabbixAPI for the direct management of Zabbix objects, Sender for sending values to hosts, and Getter for gathering data from Agents. We will discuss Sender and Getter in more detail later in this article.

For example, let’s imagine you have an infrastructure consisting of different branches. Each server or workstation is deployed from an image with an automatically configured Zabbix Agent and each branch is monitored by a Zabbix proxy since it has an isolated network. Your custom service or script can fetch a list of this equipment from your CMDB system, along with any additional information. It can then use this data to create hosts in Zabbix and link the necessary templates using ZabbixAPI based on the received information. If the information from CMDB is insufficient, you can request data directly from the configured Zabbix Agent using Getter and then use this information for further configuration and decision-making during setup. Another part of your script can access AD to get a list of branch users to update the list of users in Zabbix through the API and assign them the appropriate permissions and roles based on information from AD or CMDB (e.g., editing rights for server owners).

Another use case of the library may be when you regularly export templates from Zabbix for subsequent import into a version control system. You can also establish a mechanism for loading changes and rolling back to previous versions of templates. Here a variety of other use cases can also be implemented – it’s all up to your requirements and the creative usage of the library.

Of course, if you are a developer and there is a requirement to implement Zabbix monitoring support for your custom system or tool, you can implement sending data describing any events generated by your custom system/tool to Zabbix using Sender.

Installation and Configuration

To begin with, you need to install the zabbix_utils library. You can do this in two main ways:

  • By using pip:
~$ pip install zabbix_utils
  • By cloning from GitHub:
~$ git clone https://github.com/zabbix/python-zabbix-utils
~$ cd python-zabbix-utils/
~$ python setup.py install

No additional configuration is required. But you can specify values for the following environment variables: ZABBIX_URL, ZABBIX_TOKEN, ZABBIX_USER, ZABBIX_PASSWORD if you need. These use cases are described in more detail below.

Working with Zabbix API

To work with Zabbix API, it is necessary to import the ZabbixAPI class from the zabbix_utils library:

from zabbix_utils import ZabbixAPI

If you are using one of the existing popular community libraries, in most cases, it will be sufficient to simply replace the ZabbixAPI import statement with an import from our library.

At that point you need to create an instance of the ZabbixAPI class. T4here are several usage scenarios:

  • Use preset values of environment variables, i.e., not pass any parameters to ZabbixAPI:
~$ export ZABBIX_URL="https://zabbix.example.local"
~$ export ZABBIX_USER="Admin"
~$ export ZABBIX_PASSWORD="zabbix"
from zabbix_utils import ZabbixAPI


api = ZabbixAPI()
  • Pass only the Zabbix API address as input, which can be specified as either the server IP/FQDN address or DNS name (in this case, the HTTP protocol will be used) or as an URL, and the authentication data should still be specified as values for environment variables:
~$ export ZABBIX_USER="Admin"
~$ export ZABBIX_PASSWORD="zabbix"
from zabbix_utils import ZabbixAPI

api = ZabbixAPI(url="127.0.0.1")
  • Pass only the Zabbix API address to ZabbixAPI, as in the example above, and pass the authentication data later using the login() method:
from zabbix_utils import ZabbixAPI

api = ZabbixAPI(url="127.0.0.1")
api.login(user="Admin", password="zabbix")
  • Pass all parameters at once when creating an instance of ZabbixAPI; in this case, there is no need to subsequently call login():
from zabbix_utils import ZabbixAPI

api = ZabbixAPI(
    url="127.0.0.1",
    user="Admin",
    password="zabbix"
)

The ZabbixAPI class supports working with various Zabbix versions, automatically checking the API version during initialization. You can also work with the Zabbix API version as an object as follows:

from zabbix_utils import ZabbixAPI

api = ZabbixAPI()

# ZabbixAPI version field
ver = api.version
print(type(ver).__name__, ver) # APIVersion 6.0.24

# Method to get ZabbixAPI version
ver = api.api_version()
print(type(ver).__name__, ver) # APIVersion 6.0.24

# Additional methods
print(ver.major)    # 6.0
print(ver.minor)    # 24
print(ver.is_lts()) # True

As a result, you will get an APIVersion object that has major and minor fields returning the respective minor and major parts of the current version, as well as the is_lts() method, returning true if the current version is LTS (Long Term Support), and false otherwise. The APIVersion object can also be compared to a version represented as a string or a float number:

# Version comparison
print(ver < 6.4)      # True
print(ver != 6.0)     # False
print(ver != "6.0.5") # True

If the account and password (or starting from Zabbix 5.4 – token instead of login/password) are not set as environment variable values or during the initialization of ZabbixAPI, then it is necessary to call the login() method for authentication:

from zabbix_utils import ZabbixAPI

api = ZabbixAPI(url="127.0.0.1")
api.login(token="xxxxxxxx")

After authentication, you can make any API requests described for all supported versions in the Zabbix documentation.

The format for calling API methods looks like this:

api_instance.zabbix_object.method(parameters)

For example:

api.host.get()

After completing all the necessary API requests, it’s necessary to execute logout() if authentication was done using login and password:

api.logout()

More examples of usage can be found here.

Sending Values to Zabbix Server/Proxy

There is often a need to send values to Zabbix Trapper. For this purpose, the zabbix_sender utility is provided. However, if your service or script sending this data is written in Python, calling an external utility may not be very convenient. Therefore, we have developed the Sender, which will help you send values to Zabbix server or proxy one by one or in groups. To work with Sender, you need to import it as follows:

from zabbix_utils import Sender

After that, you can send a single value:

from zabbix_utils import Sender

sender = Sender(server='127.0.0.1', port=10051)
resp = sender.send_value('example_host', 'example.key', 50, 1702511920)

Alternatively, you can put them into a group for simultaneous sending, for which you need to additionally import ItemValue:

from zabbix_utils import ItemValue, Sender


items = [
    ItemValue('host1', 'item.key1', 10),
    ItemValue('host1', 'item.key2', 'Test value'),
    ItemValue('host2', 'item.key1', -1, 1702511920),
    ItemValue('host3', 'item.key1', '{"msg":"Test value"}'),
    ItemValue('host2', 'item.key1', 0, 1702511920, 100)
]

sender = Sender('127.0.0.1', 10051)
response = sender.send(items)

For cases when there is a necessity to send more values than Zabbix Trapper can accept at one time, there is an option for fragmented sending, i.e. sequential sending in separate fragments (chunks). By default, the chunk size is set to 250 values. In other words, when sending values in bulk, the 400 values passed to the send() method for sending will be sent in two stages. 250 values will be sent first, and the remaining 150 values will be sent after receiving a response. The chunk size can be changed, to do this, you simply need to specify your value for the chunk_size parameter when initializing Sender:

from zabbix_utils import ItemValue, Sender


items = [
    ItemValue('host1', 'item.key1', 10),
    ItemValue('host1', 'item.key2', 'Test value'),
    ItemValue('host2', 'item.key1', -1, 1702511920),
    ItemValue('host3', 'item.key1', '{"msg":"Test value"}'),
    ItemValue('host2', 'item.key1', 0, 1702511920, 100)
]

sender = Sender('127.0.0.1', 10051, chunk_size=2)
response = sender.send(items)

In the example above, the chunk size is set to 2. So, 5 values passed will be sent in three requests of two, two, and one value, respectively.

If your server has multiple network interfaces, and values need to be sent from a specific one, the Sender provides the option to specify a source_ip for the sent values:

from zabbix_utils import Sender

sender = Sender(
    server='zabbix.example.local',
    port=10051,
    source_ip='10.10.7.1'
)
resp = sender.send_value('example_host', 'example.key', 50, 1702511920)

It also supports reading connection parameters from the Zabbix Agent/Agent2 configuration file. To do this, set the use_config flag, after which it is not necessary to pass connection parameters when creating an instance of Sender:

from zabbix_utils import Sender

sender = Sender(
    use_config=True,
    config_path='/etc/zabbix/zabbix_agent2.conf'
)
response = sender.send_value('example_host', 'example.key', 50, 1702511920)

Since the Zabbix Agent/Agent2 configuration file can specify one or even several Zabbix clusters consisting of multiple Zabbix server instances, Sender will send data to the first available server of each cluster specified in the ServerActive parameter in the configuration file. In case the ServerActive parameter is not specified in the Zabbix Agent/Agent2 configuration file, the server address from the Server parameter with the standard Zabbix Trapper port – 10051 will be taken.

By default, Sender returns the aggregated result of sending across all clusters. But it is possible to get more detailed information about the results of sending for each chunk and each cluster:

print(response)
# {"processed": 2, "failed": 0, "total": 2, "time": "0.000108", "chunk": 2}

if response.failed == 0:
    print(f"Value sent successfully in {response.time}")
else:
    print(response.details)
    # {
    #     127.0.0.1:10051: [
    #         {
    #             "processed": 1,
    #             "failed": 0,
    #             "total": 1,
    #             "time": "0.000051",
    #             "chunk": 1
    #         }
    #     ],
    #     zabbix.example.local:10051: [
    #         {
    #             "processed": 1,
    #             "failed": 0,
    #             "total": 1,
    #             "time": "0.000057",
    #             "chunk": 1
    #         }
    #     ]
    # }
    for node, chunks in response.details.items():
        for resp in chunks:
            print(f"processed {resp.processed} of {resp.total} at {node.address}:{node.port}")
            # processed 1 of 1 at 127.0.0.1:10051
            # processed 1 of 1 at zabbix.example.local:10051

More usage examples can be found here.

Getting values from Zabbix Agent/Agent2 by item key.

Sometimes it can also be useful to directly retrieve values from the Zabbix Agent. To assist with this task, zabbix_utils provides the Getter. It performs the same function as the zabbix_get utility, allowing you to work natively within Python code. Getter is straightforward to use; just import it, create an instance by passing the Zabbix Agent’s address and port, and then call the get() method, providing the data item key for the value you want to retrieve:

from zabbix_utils import Getter

agent = Getter('10.8.54.32', 10050)
resp = agent.get('system.uname')

In cases where your server has multiple network interfaces, and requests need to be sent from a specific one, you can specify the source_ip for the Agent connection:

from zabbix_utils import Getter

agent = Getter(
    host='zabbix.example.local',
    port=10050,
    source_ip='10.10.7.1'
)
resp = agent.get('system.uname')

The response from the Zabbix Agent will be processed by the library and returned as an object of the AgentResponse class:

print(resp)
# {
#     "error": null,
#     "raw": "Linux zabbix_server 5.15.0-3.60.5.1.el9uek.x86_64",
#     "value": "Linux zabbix_server 5.15.0-3.60.5.1.el9uek.x86_64"
# }

print(resp.error)
# None

print(resp.value)
# Linux zabbix_server 5.15.0-3.60.5.1.el9uek.x86_64

More usage examples can be found here.

Conclusions

The zabbix_utils library for Python allows you to take full advantage of monitoring using Zabbix, without limiting yourself to the integrations available out of the box. It can be valuable for both DevOps and SRE engineers, as well as Python developers looking to implement monitoring support for their system using Zabbix.

In the next article, we will thoroughly explore integration with an external service using this library to demonstrate the capabilities of zabbix_utils more comprehensively.

Questions

Q: Which Agent versions are supported for Getter?

A: Supported versions of Zabbix Agents are the same as Zabbix API versions, as specified in the readme file. Our goal is to create a library with full support for all Zabbix components of the same version.

Q: Does Getter support Agent encryption?

A: Encryption support is not yet built into Sender and Getter, but you can create your wrapper using third-party libraries for both.

from zabbix_utils import Sender

def psk_wrapper(sock, tls):
    # ...
    # Implementation of TLS PSK wrapper for the socket
    # ...

sender = Sender(
    server='zabbix.example.local',
    port=10051,
    socket_wrapper=psk_wrapper
)

More examples can be found here.

Q: Is it possible to set a timeout value for Getter?

A: The response timeout value can be set for the Getter, as well as for ZabbixAPI and Sender. In all cases, the timeout is set for waiting for any responses to requests.

# Example of setting a timeout for Sender
sender = Sender(server='127.0.0.1', port=10051, timeout=30)

# Example of setting a timeout for Getter
agent = Getter(host='127.0.0.1', port=10050, timeout=30)

Q: Is parallel (asynchronous) mode supported?

A: Currently, the library does not include asynchronous classes and methods, but we plan to develop asynchronous versions of ZabbixAPI and Sender.

Q: Is it possible to specify multiple servers when sending through Sender without specifying a configuration file (for working with an HA cluster)?

A: Yes, it’s possible by the following way:

from zabbix_utils import Sender


zabbix_clusters = [
    [
        'zabbix.cluster1.node1',
        'zabbix.cluster1.node2:10051'
    ],
    [
        'zabbix.cluster2.node1:10051',
        'zabbix.cluster2.node2:20051',
        'zabbix.cluster2.node3'
    ]
]

sender = Sender(clusters=zabbix_clusters)
response = sender.send_value('example_host', 'example.key', 10, 1702511922)

print(response)
# {"processed": 2, "failed": 0, "total": 2, "time": "0.000103", "chunk": 2}

print(response.details)
# {
#     "zabbix.cluster1.node1:10051": [
#         {
#             "processed": 1,
#             "failed": 0,
#             "total": 1,
#             "time": "0.000050",
#             "chunk": 1
#         }
#     ],
#     "zabbix.cluster2.node2:20051": [
#         {
#             "processed": 1,
#             "failed": 0,
#             "total": 1,
#             "time": "0.000053",
#             "chunk": 1
#         }
#     ]
# }

The post Introducing zabbix_utils – the official Python library for Zabbix API appeared first on Zabbix Blog.

Improving SNMP monitoring performance with bulk SNMP data collection

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/improving-snmp-monitoring-performance-with-bulk-snmp-data-collection/27231/

Zabbix 6.4 introduced major improvements to SNMP monitoring, especially when it comes to collecting large numbers of metrics from a single device. This is done by utilizing master-dependent item logic and combining it with low-level discovery and newly introduced preprocessing rules. This blog post will cover the drawbacks of the legacy SNMP monitoring approach, the benefits of the new approach, and the steps required to deploy bulk SNMP metric collection.

The legacy SNMP monitoring approach – potential pitfalls

Let’s take a look at the SNMP monitoring logic that all of us are used to. For our example here, we will look at network interface discovery on a network switch.

To start off, we create a low-level discovery rule. In the discovery rule, we specify which low-level discovery macros are collected from which OIDs. This way, we create multiple low-level discovery macro and OID pairs. Zabbix then goes through the list of indexes at the end of the specified OIDs and matches the collected values to low-level discovery macros. Zabbix also collects the list of discovered indexes for the specified OIDs and automatically matches them with the {#SNMPINDEX} low-level discovery macros.

An example of regular SNMP discovery key:

discovery[{#IFOPERSTATUS},1.3.6.1.2.1.2.2.1.8,{#IFADMINSTATUS},1.3.6.1.2.1.2.2.1.7,{#IFALIAS},1.3.6.1.2.1.31.1.1.1.18,{#IFNAME},1.3.6.1.2.1.31.1.1.1.1,{#IFDESCR},1.3.6.1.2.1.2.2.1.2,{#IFTYPE},1.3.6.1.2.1.2.2.1.3]
An example of regular SNMP low-level discovery rule

The collected low-level discovery data will look something like this:

[
{
"{#SNMPINDEX}":"3",
"{#IFOPERSTATUS}":"2",
"{#IFADMINSTATUS}":"1",
"{#IFALIAS}":"",
"{#IFNAME}":"3",
"{#IFDESCR}":"3",
"{#IFTYPE}":"6"
},
{
"{#SNMPINDEX}":"4",
"{#IFOPERSTATUS}":"2",
"{#IFADMINSTATUS}":"1",
"{#IFALIAS}":"",
"{#IFNAME}":"4",
"{#IFDESCR}":"4",
"{#IFTYPE}":"6"
},
{
"{#SNMPINDEX}":"5",
"{#IFOPERSTATUS}":"2",
"{#IFADMINSTATUS}":"1",
"{#IFALIAS}":"",
"{#IFNAME}":"5",
"{#IFDESCR}":"5",
"{#IFTYPE}":"6"
},
{
"{#SNMPINDEX}":"6",
"{#IFOPERSTATUS}":"2",
"{#IFADMINSTATUS}":"1",
"{#IFALIAS}":"",
"{#IFNAME}":"6",
"{#IFDESCR}":"6",
"{#IFTYPE}":"6"
},
{
"{#SNMPINDEX}":"7",
"{#IFOPERSTATUS}":"2",
"{#IFADMINSTATUS}":"1",
"{#IFALIAS}":"",
"{#IFNAME}":"7",
"{#IFDESCR}":"7",
"{#IFTYPE}":"6"
}
]

Once the low-level discovery rule is created, we move on to creating item prototypes.

Items created based on this item prototype will collect metrics from the OIDs specified in the SNMP OID field and will create an item per index ( {#SNMPINDEX} macro) collected by the low-level discovery rule. Note that the item type is SNMP agent – each discovered and created item will be a regular SNMP item, polling the device and collecting metrics based on the item OID.

Now, imagine we have hundreds of interfaces and we’re polling a variety of metrics at a rapid interval for each interface. If our device has older or slower hardware, this can cause an issue where the device simply cannot process that many requests. To resolve this, a better way to collect SNMP metrics is required.

Bulk data collection with master – dependent items

Before we move on to the improved SNMP metric collection approach, we need to first take a look at how master-dependent item bulk metric collection and low-level discovery logic are implemented in Zabbix.

  • First, we create a master item, which collects both the metrics and low-level discovery information in a single go.
  • Next, we create a low-level discovery rule of type dependent item and point at the master item created in the previous step. At this point, we need to either ensure that the data collected by the master item is formatted in JSON or convert the data to JSON by using preprocessing.
  • Once we have ensured that our data is JSON-formatted, we can use the LLD macros tab to populate our low-level discovery macro values via JSONPath. Note: Here the SNMP low-level discovery with bulk metric collection uses a DIFFERENT APPROACH, designed specifically for SNMP checks.
  • Finally, we create item prototypes of type dependent item and once again point them at the master item created in the first step (Remember – our master item contains not only low-level discovery information, but also all of the required metrics). Here we use JSONPath preprocessing together with low-level discovery macros to specify which values should be collected. Remember that low-level discovery macros will be resolved to their values for each of the items created from the item prototype.

Improving SNMP monitoring performance with bulk metric collection

The SNMP bulk metric collection and discovery logic is very similar to what is discussed in the previous section, but it is more tailored to SNMP nuances.

Here, to avoid excessive polling, a new walk[] item has been introduced. The item utilizes GetBulk requests with SNMPv2 and v3 interfaces and GetNext for SNMPv1 interfaces to collect SNMP data. GetBulk requests perform much better by design. A GetBulk request retrieves values of all instances at the end of the OID tree in a single go, instead of issuing individual Get requests per each instance.

To utilize this in Zabbix, first we have to create a walk[] master item, specifying the list of OIDs from which to collect values. The retrieved values will be used in both low-level discovery (e.g.: interface names) and items created from low-level discovery item prototypes (e.g.: incoming and outgoing traffic).

Two new preprocessing steps have been introduced to facilitate SNMP bulk data collection:

  • SNMP walk to JSON is used to specify the OIDs from which the low-level discovery macros will be populated with their values
  • SNMP walk value is used in the item prototypes to specify the OID from which the item value will be collected

The workflow for SNMP bulk data collection can be described in the following steps:

  • Create a master walk[] item containing the required OIDs
  • Create a low-level discovery rule of type dependent item which depends on the walk[] master item
  • Define low-level discovery macros by using the SNMP walk to JSON preprocessing step
  • Create item prototypes of type dependent item which depend on the walk[] master item, and use the SNMP walk value preprocessing step to specify which OID should be used for value collection

Monitoring interface traffic with bulk SNMP data collection

Let’s take a look at a simple example which you can use as a starting point for implementing bulk SNMP metric collection for your devices. In the following example we will create a master walk[] item, a dependent low-level discovery rule to discover network interfaces, and dependent item prototypes for incoming and outgoing traffic.

Creating the master item

We will start by creating the walk[] SNMP agent master item. The name and the key of the item can be specified arbitrarily. What’s important here is the OID field, where we will specify the list of comma separated OIDs from which their instance values will be collected.

walk[1.3.6.1.2.1.31.1.1.1.6,1.3.6.1.2.1.31.1.1.1.10,1.3.6.1.2.1.31.1.1.1.1,1.3.6.1.2.1.2.2.1.2,1.3.6.1.2.1.2.2.1.3]

The walk[] item will collect values from the following OIDs:

  • 1.3.6.1.2.1.31.1.1.1.6 – Incoming traffic
  • 1.3.6.1.2.1.31.1.1.1.10 – Outgoing traffic
  • 1.3.6.1.2.1.31.1.1.1.1 – Interface names
  • 1.3.6.1.2.1.2.2.1.2 – Interface descriptions
  • 1.3.6.1.2.1.2.2.1.3 – Interface types

SNMP bulk metric collection master walk[] item

Here we can see the resulting values collected by this item:

Note: For readability, the output has been truncated and some of the interfaces have been left out.

.1.3.6.1.2.1.2.2.1.2.102 = STRING: DEFAULT_VLAN
.1.3.6.1.2.1.2.2.1.2.104 = STRING: VLAN3
.1.3.6.1.2.1.2.2.1.2.105 = STRING: VLAN4
.1.3.6.1.2.1.2.2.1.2.106 = STRING: VLAN5
.1.3.6.1.2.1.2.2.1.2.4324 = STRING: Switch loopback interface
.1.3.6.1.2.1.2.2.1.3.102 = INTEGER: 53
.1.3.6.1.2.1.2.2.1.3.104 = INTEGER: 53
.1.3.6.1.2.1.2.2.1.3.105 = INTEGER: 53
.1.3.6.1.2.1.2.2.1.3.106 = INTEGER: 53
.1.3.6.1.2.1.2.2.1.3.4324 = INTEGER: 24
.1.3.6.1.2.1.31.1.1.1.1.102 = STRING: DEFAULT_VLAN
.1.3.6.1.2.1.31.1.1.1.1.104 = STRING: VLAN3
.1.3.6.1.2.1.31.1.1.1.1.105 = STRING: VLAN4
.1.3.6.1.2.1.31.1.1.1.1.106 = STRING: VLAN5
.1.3.6.1.2.1.31.1.1.1.1.4324 = STRING: lo0
.1.3.6.1.2.1.31.1.1.1.10.102 = Counter64: 0
.1.3.6.1.2.1.31.1.1.1.10.104 = Counter64: 0
.1.3.6.1.2.1.31.1.1.1.10.105 = Counter64: 0
.1.3.6.1.2.1.31.1.1.1.10.106 = Counter64: 0
.1.3.6.1.2.1.31.1.1.1.10.4324 = Counter64: 12073
.1.3.6.1.2.1.31.1.1.1.6.102 = Counter64: 0
.1.3.6.1.2.1.31.1.1.1.6.104 = Counter64: 0
.1.3.6.1.2.1.31.1.1.1.6.105 = Counter64: 0
.1.3.6.1.2.1.31.1.1.1.6.106 = Counter64: 0
.1.3.6.1.2.1.31.1.1.1.6.4324 = Counter64: 12457

By looking at these values we can confirm that the item collects values required for both the low-level discovery rule (interface name, type, and description) and the items created from item prototypes (incoming/outgoing traffic).

Creating the low-level discovery rule

As our next step, we will create a dependent low-level discovery rule which will discover interfaces based on the data from the master walk[] item.

Interface discovery dependent low-level discovery rule

The most important part of configuring the low-level discovery rule lies in defining the SNMP walk to JSON preprocessing step. Here we can assign low-level discovery macros to OIDs. For our example, we will assign the {#IFNAME} macro to the OID containig the values of interface names:

Field name: {#IFNAME}
OID prefix: 1.3.6.1.2.1.31.1.1.1.1
Dependent low-level discovery rule preprocessing steps

The name and the key of the dependent item can be specified arbitrarily.

Creating item prototypes

Finally, let’s create two dependent item prototypes to collect traffic data from our master item.

Here we will provide an arbitrary name and key containing low-level discovery macros. On items created from the item prototypes, the macros will resolve as our OID values, thus giving each item a unique name and key.

Note: The {#SNMPINDEX} macro is automatically collected by the low-level discovery rule and contains the indexes from the OIDs specified in the SNMP walk to JSON preprocessing step.

The final step in creating the item prototype is using the SNMP walk value preprocessing step to define which value will be collected by the item. We will also append the {#SNMPINDEX} macro at the end of the OID. This way, each item created from the prototype will collect data from a unique OID corresponding to the correct object instance.

Incoming traffic item prototype

Incoming traffic item prototype preprocessing step:

SNMP walk value: 1.3.6.1.2.1.31.1.1.1.6.{#SNMPINDEX}
Incoming traffic item preprocessing steps

 

Outgoing traffic item prototype

Outgoing traffic item prototype preprocessing step:

SNMP walk value: 1.3.6.1.2.1.31.1.1.1.10.{#SNMPINDEX}
Outgoing traffic item preprocessing steps

Note: Since the collected traffic values are counter values (always increasing), the Change per second preprocessing step is required to collect the traffic per second values.

Note: Since the values are collected in bytes, we will use the Custom multiplier preprocessing step to convert bytes to bits.

Final notes

And we’re done! Now all we have to do is wait until the master item update interval kicks in and we should see our items getting discovered by the low-level discovery rule.

Items created from the item prototypes

After we have confirmed that our interfaces are getting discovered and the items are collecting metrics from the master item, we should also implement the Discard unchanged with heartbeat preprocessing step on our low-level discovery rule. This way, the low-level discovery rule will not try and discover new entities in situations where we’re getting the same set of interfaces over and over again from our master item. This in turn improves the overall performance of internal low-level discovery processes.

Discard unchanged with heartbeat preprocessing on the low-level discovery rule

Note that we discovered other interface parameters than just the interface name – interface description and type are also collected in the master item. To use this data, we would have to add additional fields in the low-level discovery rule SNMP walk to JSON preprocessing step and assign low-level discovery macros to the corresponding OIDs containing this information. Once that is done, we can use the new macros in the item prototype to provide additional information in item name or key, or filter the discovered interfaces based on this information (e.g.: only discover interfaces of a particular type).

If you have any questions, comments, or suggestions regarding a topic you wish to see covered next in our blog, don’t hesitate to leave a comment below!

The post Improving SNMP monitoring performance with bulk SNMP data collection appeared first on Zabbix Blog.

Decrypting Zabbix TLS with Wireshark

Post Syndicated from Markku Leiniö original https://blog.zabbix.com/decrypting-zabbix-tls-with-wireshark/26832/

One of the built-in security features in Zabbix is TLS (Transport Layer Security) support for external connections. This means that when your distributed Zabbix proxies or Zabbix agents connect to the Zabbix server (or vice versa), TLS can be used to encrypt all the connections. When the connections are encrypted, third parties cannot read the Zabbix components’ communication, even though they would be able to catch the network traffic in some way.

In specific cases you may still want to inspect the encrypted traffic, for example to troubleshoot some problems with Zabbix agents or proxies. I already wrote a post about troubleshooting Zabbix agent with Wireshark, but the TLS encryption prevents anyone seeing the actual contents of the packets.

Since the traffic is encrypted in the Zabbix components (the server, agents and proxies), there still is a way for you, the Zabbix administrator, to intervene with the encryption so that you can get hold of the unencrypted traffic as well. In this post I will explain the process.

First, let’s demonstrate the TLS encryption between a Zabbix agent and a Zabbix server. I have configured the agent (Zabbix Agent 2 actually) with these lines:

Hostname=Zabbix70-agent
ServerActive=zabbixtest.lein.io
TLSConnect=psk
TLSPSKIdentity=agent-ident
TLSPSKFile=/etc/zabbix/psk

In this example I’m using TLS with pre-shared key (PSK), and the key itself is saved in /etc/zabbix/psk. My favorite way of generating a PSK is using OpenSSL:

markku@agent:~$ openssl rand -hex 32
afa34bf1104a1457e11e7d3a9b1ff7f5fb4f494c92ca1a8a9c5e1437f8897416
markku@agent:~$

The same key must also be configured on the Zabbix server frontend, see the Zabbix documentation for the PSK configuration details:

After the configurations I captured the Zabbix traffic for some time on the Zabbix server (using sudo tcpdump port 10051 -v -w zabbix70-tls-agent.pcap), stopped the capture, copied the capture file to my workstation, and opened it with Wireshark.

The capture file can be downloaded here:

Note: I recommend using Wireshark version 4.1.0 or later when analyzing captures containing Zabbix traffic because the built-in Zabbix protocol support was only added to Wireshark in version 4.1.0.

The packet list looks like this:

As we can see in the Protocol column, there are no Zabbix packets recognized in this capture, there are only TCP and TLS packets (the TCP-marked packets being the “empty” packets for negotiating the actual connectivity).

A side detail: Even though the traffic is encrypted, you can still see the configured Zabbix TLS PSK identity (“agent-ident” in my configuration above) in plain text inside the TLS Client Hello packets, if you ever need to check that in the traffic.

Now that we confirmed that TLS encryption is used and we cannot see the Zabbix traffic contents in the capture, let’s prepare the Zabbix server for the TLS decryption.

As I hinted in the beginning, since we have the TLS connection endpoints under our management, we can do tricks on the hosts to get the encryption keys. TLS negotiates the encryption keys dynamically for each connection, but there is a way to save the keys to a file so that we can later decrypt the captured traffic. (Note: I’m not a protocol-level TLS expert, so please forgive me any possible technical inaccuracies in the detailed explanations. I’ll just call “TLS keys” whatever is needed to get the encryption/decryption done.)

Peter Wu (who, in contrast to me, is a protocol-level TLS expert, and also one of the Wireshark core developers) has kindly published code for a helper library that makes it possible for us to save the TLS session keys on the TLS endpoint. In this demo I will save the keys on the Zabbix server, but the same could be done on the agents/proxies instead if needed.

First I’ll see the TLS library that my Debian-based Zabbix server is using:

markku@zabbixserver:~$ ldd /usr/sbin/zabbix_server | grep ssl
        libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x00007f62ee47a000)
markku@zabbixserver:~$ dpkg -l libssl* | grep ^ii
ii  libssl3:amd64  3.0.9-1   amd64  Secure Sockets Layer toolkit - shared libraries
markku@zabbixserver:~$

To get and compile the helper library I’ll need to install some utilities:

markku@zabbixserver:~$ sudo apt install git gcc make libssl-dev
...
markku@zabbixserver:~$ dpkg -l libssl* | grep ^ii
ii  libssl-dev:amd64 3.0.9-1 amd64 Secure Sockets Layer toolkit - development files
ii  libssl3:amd64    3.0.9-1 amd64 Secure Sockets Layer toolkit - shared libraries
markku@zabbixserver:~$

I’ll the clone the Peter’s wireshark-notes repo to the server:

markku@zabbixserver:~$ git clone --depth=1 https://git.lekensteyn.nl/peter/wireshark-notes
Cloning into 'wireshark-notes'...
...
markku@zabbixserver:~$ cd wireshark-notes/src
markku@zabbixserver:~/wireshark-notes/src$ ls -l
total 28
-rw-r--r-- 1 markku markku   534 Oct  7 15:39 Makefile
-rw-r--r-- 1 markku markku 11392 Oct  7 15:39 sslkeylog.c
-rw-r--r-- 1 markku markku  7278 Oct  7 15:39 sslkeylog.py
-rwxr-xr-x 1 markku markku  2325 Oct  7 15:39 sslkeylog.sh
markku@zabbixserver:~/wireshark-notes/src$

Now I can compile the library and make it available on the server:

markku@zabbixserver:~/wireshark-notes/src$ make
cc   sslkeylog.c -shared -o libsslkeylog.so -fPIC -ldl
markku@zabbixserver:~/wireshark-notes/src$ sudo install libsslkeylog.so /usr/local/lib
markku@zabbixserver:~/wireshark-notes/src$ ls -l /usr/local/lib/libsslkeylog.so
-rwxr-xr-x 1 root root 17336 Oct  7 15:40 /usr/local/lib/libsslkeylog.so
markku@zabbixserver:~/wireshark-notes/src$ cd
markku@zabbixserver:~$

To use the helper library, a couple of environment variables need to be set. For Zabbix server the easy way is to edit the systemd configuration for zabbix-server service:

markku@zabbixserver:~$ sudo systemctl edit zabbix-server

In the editor that opens I’ll add these in the configuration:

[Service]
Environment=LD_PRELOAD=/usr/local/lib/libsslkeylog.so
Environment=SSLKEYLOGFILE=/tmp/tls.keys

The variables are kind of self-explanatory: Whenever Zabbix server service is started, the libsslkeylog.so library is loaded first, and the SSLKEYLOGFILE variable sets the location of the file where the keys will be saved.

Now the word of warning: The libsslkeylog.so library, when loaded by a process that uses TLS communication, will save the encryption/decryption keys of all the TLS sessions of the process to the configured file. This means that whoever gets that file and the saved TLS communication will be able to see the decrypted contents of the packets, defeating the whole idea of the TLS encryption. You really don’t want to do this TLS key saving for any longer periods of time. Be sure to remove the configurations (and restart the service) after you have inspected whatever you were inspecting in your system. Or, don’t do any of this at all.

After saving the configuration the Zabbix server needs to be restarted:

markku@zabbixserver:~$ sudo systemctl restart zabbix-server
markku@zabbixserver:~$

The TLS keys have now started being saved in the configured file:

markku@zabbixserver:~$ ls -l /tmp/tls.keys
-rw-rw-r-- 1 zabbix zabbix 10157 Oct  7 15:45 tls.keys
markku@zabbixserver:~$

At this point the Zabbix agent is still communicating actively with the Zabbix server, so I’ll take a new capture with tcpdump (sudo tcpdump port 10051 -v -w zabbix70-tls-agent-2.pcap).

After a short while I’ll stop the capture, and copy the capture file and the TLS key file on my workstation.

Now it’s a good time to disable the TLS key saving as well (besides containing sensitive data, the key file will also grow with each new TLS session so it can quickly get very large), so I’ll edit the Zabbix service configuration, remove the configured lines and restart the service:

markku@zabbixserver:~$ sudo systemctl edit zabbix-server
markku@zabbixserver:~$ sudo systemctl restart zabbix-server
markku@zabbixserver:~$

When opening the new capture file in Wireshark there is no immediate change in the packet list: the TLS packets are still shown encrypted. Wireshark needs to be specifically configured to read the TLS keys from the separate file.

In Wireshark, I’ll go to Edit – Preferences – Protocols – TLS:

There is the “(Pre)-Master-Secret log filename” field, I’ll use Browse button to select the copied tls.keys file, and save the configuration with OK.

At this point Wireshark reloads the capture file and the Zabbix agent TLS sessions will be decrypted:

Using the “zabbix” display filter will show just the Zabbix protocol packets:

When selecting a Zabbix protocol packet and looking at the packet details, in the lower right pane there are now three tabs: Frame (the encrypted TLS data), Decrypted TLS, and Uncompressed data.

This is because in this example the Zabbix agent 2 also compresses the traffic, and the compressed traffic is then encrypted when sending out to the network. Wireshark can interpret all this because of its built-in knowledge about TLS encryption and the Zabbix protocol structure, as well as the user-supplied TLS decryption keys.

We are now able to analyze the Zabbix agent communication with Wireshark even though the traffic was TLS-encrypted when we captured it.

One more trick about the TLS keys in Wireshark: It is also possible to save the keys inside the capture file when analyzing the traffic, instead of having the keys in a separate file (tls.keys in this example). I’ll go in Edit menu and select Inject TLS Secrets, and then save the capture file in pcapng format. Now the previously loaded keys are embedded in the capture file, and I can clear the “(Pre)-Master-Secret log filename” field in the TLS settings (as the filename setting is not useful in any later Wireshark analysis). The same can also be done in the command line by using editcap --inject-secrets (editcap is part of Wireshark install, see the manual page of editcap for more details).

Here is the second capture file of this demo, with the embedded TLS keys:

Finally some closing comments:

  • As demonstrated, when you have administrator/root-level access to the TLS session endpoint (Zabbix server in this example), there can be a possibility to save and decrypt the TLS sessions using external tooling. After all, TLS encryption is based on the negotiation between the TLS-connected endpoints, so if you are the TLS connection endpoint, you have ways to access the plaintext data. If you don’t have sufficient access to the TLS session endpoint, there is no way you can get the decryption keys mid-path.
  • Act responsibly when saving the TLS session keys for any traffic, on Zabbix server or otherwise. The encryption is there for a purpose, and saving the TLS keys always carries the risk that someone else gets access to data they wouldn’t have access to otherwise.
  • Do not save the TLS session keys with the capture file, unless you are dealing with a test/demo environment, like I had here.
  • When troubleshooting Zabbix connections, TLS decryption with Wireshark is not the only way. You should also consider if just increasing the logging level in the Zabbix components brings you enough information to solve your case, or maybe in some specific case you can just disable TLS encryption for an agent for a moment to not have to deal with the decryption at all. But again, usually the encryption is there for a purpose, so you need to evaluate your own situation.

The post Decrypting Zabbix TLS with Wireshark appeared first on Zabbix Blog.

Zabbix in: exploratory data analysis rehearsal – Part 3

Post Syndicated from Paulo R. Deolindo Jr. original https://blog.zabbix.com/zabbix-in-exploratory-data-analysis-rehearsal-part-2-2/26266/

Abstract

This will be the last blog post of the “Zabbix in… exploratory data analysis rehearsal” series. To continue our initial proposal, we’ll close the third and fourth data distribution moments concept. This time, we’ll talk about Skewness and Kurtosis.

Remember the first and second articles of the series to be aware of what we are discussing here.

The four moments for a data distribution

While the first moment helps us with the location estimate for some data distribution, the second moment works with its variance. The third moment, called asymmetry, allows us to understand the value trends and the degree of the asymmetry. The fourth moment is called kurtosis and is about the probability of the peak’s existence (outliers).

These four moments are not the final study about the data distribution. There is so much to learn and apply to data science when considering statistical concepts, but for now we must finish the initial proposal and bring forward some insights for decision makers.

Let’s get started!

Asymmetry

Based on our web application scenario, we can see a certain asymmetry in response time in most cases. This is normal and expected – so far, no problems. But it is also true that some symmetry is also possible in certain cases. Again, no problems here.

So, where is the problem? When does it happen?

Sometimes, the web application response time can be too different from the previous one, and we have no control over it. In these cases, the outliers must be found, and the correct interpretation must be applied. At that point, we must consider anomalies in the environment. Sometimes, the outliers are just a deviation. In all cases, we must pay attention and monitor the metrics that can make the difference.

Speaking of asymmetry – why is this topic so special? One of the possible answers is that we need to understand the degree of the asymmetry – whether it is high or moderated and whether the values were in most cases smaller than or bigger than the mean or median. In other words, what does the asymmetry say about the web application performance?

Let’s check some implementations.

The key skewness

From version 6.0, Zabbix introduced the item key skewness.

For example, it can be used like this:

skewness(/host/key,1h) # the skewness for the last hour until now

Now, let’s see how this formula could be applied to our scenario:

skewness(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)

Using skewness and time shift “1h:now/h”, we are looking for some web application response time asymmetry at the previous hour.

The asymmetry can be negative (left skew), zero, positive (right skew) or undefined.

Definition: a left-skewed distribution is longer on the left side of its peak than on its right. In other words, a left-skewed distribution has a long tail on its left side.

Considering a left skew, it is possible to state that at the previous hour, the web application had more higher values than smaller values. This means that our web application does not perform as well as it should.

Look at the graph above. You can see some bars on the left side of the mean and other bars on the right side of the mean, with the same size as a mirror. We can consider this a normal distribution for web application response time, but it does not mean that the response times were good or bad – it only means that they had some balance and suggests more investigation.

Definition: a right-skewed distribution is longer on the right side of its peak than on its left. In other words, a right-skewed distribution has a long tail on its right side.

Considering a right skew, it is possible to state that at the previous hour, the web application had more smaller values than higher values. This means our web application performs as expected.

Value Map for Skewness

You must create in your template the following value map:

If you wish, the value map can also be as below:

“is greater than or equals”                                 0.1              à Mais tempos bons, se comparados à média

“equals”                                                               0               à Tempos de resposta simétricos ou bem distribuídos em bons e ruins

“is less than or equals”                                       0                à Mais tempos ruins, se comparados à média

Pearson Skewness Coefficient

The Pearson’s Coefficient is a very interesting indicator. Considering some skewness for a data distribution, it tells us if the asymmetry is strong or only moderate.

We can create a calculated item for the Pearson’s Coefficient:

(3*(avg-median))/stddevpop

In Zabbix, we need:

  • One item for the response time average considering the previous hour:
    • Key: resp.time.previous.hour
      • Formula: trendavg(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)
    • One item for calculating the median (for percentile, see the previous blog post):
      • Key: response.time.previous.hour
        • Formula: (last(//p51.previous.hour)+last(//p50.previous.hour))/2
      • One item for the standard deviation calculation:
        • Key:
          • Formula: response.time.previous.hour stddevpop(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)

Finally, a calculated item for a Pearson’s Coefficient:

  • Key: coefficient.requests.previous.hour
    • Formula:
      ((last(//trendavg.requests.per.minute)-
      last(//median.access.previous.hour))*3)
      /
      last(//stddevpop.requests.previous.hour)

To finish the exercise, create a value map:

Curtose

Kurtosis is the fourth data distribution and can indicate if the values are prone to peaks.

In Zabbix, you can implement or calculate kurtosis with other calculated items:

kurtosis(/host/key,1h)

Um valor negativo de curtose nos diz que a distribuição não está propensa ou produziu poucos outliers. Já um valor positivo de curtose nos diz que a distribuição está propensa ou produziu muitos outliers. Tudo gira em torno da média da distribuição de dados. Já um valor neutro, ou zero, nos diz que a distribuição é considerada simétrica.

Para nosso cenário web, temos:

kurtosis(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)

Explanatory Dashboard

Considering the image above, the data distribution for the web application response time at the previous hour has a left skew. This suggests that, when considering the response time mean at the previous hour, there were more higher values than smallest values. That’s sad – our web application performed poorly. Why?

However, because the skewness was moderated (considering the Pearson’s Coefficient), it means that the response time values were not so different from the others at the same hour.

As for kurtosis, we can say that the data distribution is prone to peaks because we have positive kurtosis. In basic statistics, the values are near the mean and there is a high probability of producing outliers.

This is a point to pay attention to if you are looking at a critical service.

Looking at other metrics

To check and validate our interpretation, let’s visualize the collected values on a simple Zabbix graph, using a graph widget.

Please consider using the following “time period” configuration:

The graph will show only the data collected at the previous hour – in this case, from 11:00 to 11:59.

The graph in Zabbix allows us to visualize the 50th percentile as well. We do not display the mean on the graph because it wouldn’t be well represented visually as it was collected only once, thus lacking an interesting visual trend line like the 50th percentile. However, notice that the mean and the 50th percentile values are very close, which will give us an idea of the data distribution around this measure.

Partial Conclusion

Skewness and Kurtosis, respectively, are the third and fourth moments of a data distribution. They help us understand the environment’s behavior and allow us to gain insight into a lot of things (in this case, we simply applied these concepts to IT infrastructure monitoring and focused on our web application to analyze its performance).

In most cases, the asymmetry will exist – it’s normal and expected. However, knowing some properties of the skewness can help us understand the response times, indicating good or poor performance. The skewness coefficient allows us to know if the asymmetry was strong or just moderated. Meanwhile, kurtosis helps us to understand if the data distribution produced some peaks considering an observation period or whether it is prone to produce peaks. We can then create some triggers for that and avoid some undesirable behaviors in the future, based on our data distribution observation. This is applied data science at its best.

Conclusion

Data science can be easily applied to (and bring up insights about) Zabbix and its Aggregate functions. It’s true that there are some special functions such as skewness, kurtosis, stddevpop, stddevsamp, mad, and so on, but there are old functions to help us too, such as percentile, forecast, timeleft, etc. All these functions must be used in calculated item formulas.

One of the interesting advantages of using Zabbix in data science and performing an exploratory data analysis is the fact that Zabbix can monitor everything. This means that the database already exists with the relevant date to analyze, in real time.

In the blog posts in this series, we took as a basis some data referring to the previous hour, the previous day, and so on, but we did nothing regarding the current hour. If we applied the concepts studied in real time data, we would have other “live” results, including support for decision-making. This is because we will not only study historical data, but instead will have the opportunity to change the course of events.

Zabbix is improving dashboards significantly. From Zabbix 6.4, we have many new out-of-the-box widgets and the possibility to create our own. However, there is a concern – Zabbix administrators sometimes show unnecessary data in dashboards, which can cloud the decision-making process. Zabbix administrators, in general, might want to learn storytelling techniques to rectify this situation. Maybe in a perfect world!

I hope you have enjoyed this blog post series.

Keep studying!

 

The post Zabbix in: exploratory data analysis rehearsal – Part 3 appeared first on Zabbix Blog.

Zabbix in: exploratory data analysis rehearsal – Part 2

Post Syndicated from Paulo R. Deolindo Jr. original https://blog.zabbix.com/zabbix-in-exploratory-data-analysis-rehearsal-part-2/26151/

Abstract

In the previous blog post, we just explored some of the basic statistic concepts to estimate KPIs for a web application response time: in that case, average, median and percentile. Additionally, we improved the nginx out-of-the-box template and showed some results in simple dashboards. Now we must continue our work but this time, analyzing some variances of the collected metrics, considering a certain period.

Please, read the previous blog post to understand the context in a better way. I wish you a good reading.

A little about basic statistic

In basic statistic, a data distribution has at least four moments:

  • Location estimate
  • Variance
  • Skewness
  • Kurtosis

In the previous blog post, we introduced the 1st moment, knowing some estimates of our data distribution. It means that we have analyzed some values of our web application response time. It reveals that the response time can have minimum and maximum values, average, a value that can represent a central value of the distribution and so on. Some metrics, such as average, can be influenced by outliers, but other metrics do not, such as 50th percentile or median. To conclude, now we know something about the variance of those values, but it isn’t enough. Let’s check the 2nd moment of the data distribution: Variance.

Variance

So, we have some notion about the variance of the web application response time, meaning that it can have some asymmetry (in most cases) and we also know that some KPIs must be considered but, which of them?

In exploratory data analysis, we can discover some key metrics but, in most cases, we won’t use all of them, so we have to know each one’s relevance to choose properly which metric can represent the reality of our scenario.

Yes! There are some cases when some metrics must be added to other metrics so that they make sense otherwise, we can discard them: we must create and understand the context for all those metrics.

Let’s check some concepts of the variance:

  • Variance
  • Standard deviation
  • Median absolute deviation (MAD)
  • Amplitude
  • IQR – Interquartile range

Amplitude

This concept is simple and its formula too: it is the difference between the maximum and the minimum value in a data distribution. In this case, we are talking about data distribution at the previous hour (,1h:now/h). We are interested in knowing the range of variation in response times in that period.

Let’s create a Calculated item to Amplitude metric in “Nginx by HTTP modified” template.

  • trendmax(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)
  • trendmin(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)

In other terms, it could be:

  • max(/host/key)-min(/host/key)

However, we are analysing a data distribution based on the previous hour, so…

  • trendmax(/host/key,1h:now/h)-trendmin(/host/key,1h:now/h)

Modifying our dashboard, we’ll see something like this:

 

This result interpretation could be: between the worst and the best response times, the variance is too small. It means, during that hour the response times had no significant differences.

However, amplitude itself is not enough to get some web application diagnosis at that moment. It’s necessary to combine this result with other results and we’ll see how to do it.

To complement, we can create some triggers based on it:

  • Fire if the response time amplitude was bigger than 5 seconds at the previous hour. It means that the web application did not perform as expected considering the web application requests.
    • Expression = last(/Nginx by HTTP modified/amplitude.previous.hour)>5
    • Level = Information
  • Fire if the response time amplitude reaches 5 seconds at least 3 consecutive times. It means at the last 3 hours, there was too much variance among the web application response times and it is not the expected.
    • Expression = max(/Nginx by HTTP modified/amplitude.previous.hour,#3)>5
    • Level = Warning

Remember, we are evaluating the previous hour and it makes no sense to generate this metric every single minute. Let’s create a Custom interval period for it.

 

By doing it, we are avoiding flapping on triggers environment.

IQR – Interquartile range

Consider these values below:

3, 5, 2, 1, 3, 3, 2, 6, 7, 8, 6, 7, 6

Open the shell environment. Create the file “vaules.txt” and insert each one, one per line. Now, read the file:

# cat values.txt

3
5
2
1
3
3
2
6
7
8
6
7
6

Now, send the value to Zabbix using Zabbix sender:

# for x in `cat values.txt`; do zabbix_sender -z 127.0.0.1 -s “Web server A” -k input.values -o $x; done

Look at the historical data using Zabbix frontend.

 

Now, let’s create some Calculated items to 75th  percentile and 25th  percentile.

  • Key: iqr.test.75
    Formula: percentile(//input.values,#13,75)
    Type: Numeric (float)
  • Key: iqr.test.25
    Formula: percentile(//input.values,#13,25)
    Type: Numeric (float)

If we apply it on a Linux terminal the command “sort values.txt”, we’ll get the same values ordered by size. Let’s check:

# sort values.txt

 

We’ll use the same concept here.

From the left to the right, go to the 25th percentile. You will get the number 3.

Do it again, but this time go to the 50th percentile. You will get the number 5.

And again, go to the 75th percentile. You will get the number 6.

The IQR is the difference between the 75th percentile (Q3) and the 25th percentile (Q1). So, we are excluding the outliers (the smallest values on the left and the biggest values on the right).

To calculate the IQR, you can create the following Calculate item:

  • key: iqr.test
    Formula: last(//iqr.test.75)-last(//iqr.test.25)

Now, we’ll apply this concept in Web Application Response Time.

The Calculated Item for the 75th percentile:

key: percentile.75.response.time.previous.hour
Formula: percentile(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h,75)

The Calculated item for the 25th percentile:

key: percentile.25.response.time.previous.hour
Formula: percentile(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h,25)

The Calculate item for the IQR:

key: iqr.response.time.previous.hour
Formula: last(//percentile.75.response.time.previous.hour)-last(//percentile.25.response.time.previous.hour)

Keep the monitoring schedule at the 1st minute for each hour, to avoid repetition (it’s very important) and adjust the dashboard.

Considering the worst web response time and the best web response time at the previous hour, the AMPLITUDE returns a big value in comparison to the IQR and it happens because the outliers were discarded in IQR calculation. So, just as the mean is a location estimate that is influenced by outliers and the median is not, so are the RANGE and IQR. The IQR is a robust indicator and allows us to know the difference between the web response time variance in a central position.

P.S.: we are considering only the previous hour, however, you can apply the IQR concept to an entire period, such as the previous day, or the previous week, the previous month and so on, using the correct time shift notation. You can use it to compare the web application response time variance between the periods you wish to observe and then, get some insights about the web application behavior at different times and situations.

Variance

The variance is a way to calculate the dispersion of data, considering its average. In Zabbix, calculating the variance is simple, since there is a specific formula for that, through a Calculated item.

The formula is the following:

  • Key: varpop.response.time.previous.hour
    Formula: varpop(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)

In this case, the formula returns the dispersion of data, however, there is a characteristic: at some point, the data is squared and then, the data scale changes.

Let’s check the steps for calculating the variance of the data:

1st) Calculate the mean;
2nd) Subtract the mean of each value;
3rd) Square each subtraction result;
4th) Perform the sum of all squares;
5th) Divide the result of the sum by the total observations.

At the 3rd step, we have the scale change. This new data can be used to other calculations in the future.

Standard Deviation

The root square of the variance.

Calculating the root square of the variance, the data can come back to its original scale!

There are at least two ways to do it:

  1. Using the root square key and formula in Zabbix:
    1. Key: varpop.previous.hour
    2. Formula: sqrt(last(//varpop.response.time.previous.hour))
  2. Using the standard deviation key and formula in Zabbix:
    1. Key:previous.hour
    2. Formula: stddevpop(//host,key,1h:now/h) # an example for the previous hour

A simple way to understand the standard deviation concept is: a way of knowing how “far” values are from the average. So, applying the specific formula, we’ll get that indicator.

Look at this:

The image above is a common image that can be found on the Internet, and it can help us understand some results. The standard deviation value must be near “zero”, otherwise, we’ll have serious deviations.

Let’s check the following Calculated item:

  • stddevpop(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)

We are calculating the standard deviation based on the collected values at the previous hour. Let’s check the Test item on the frontend:

The test returned some value less than 1, at about 0.000446. If this value is less than 1, we don’t have a complete deviation and it means that the collected values at the previous hour are near the average.

For a web application response time, it can represent a good behavior, with no significant variances and as expected, off course, other indicators must be checked to a complete and reliable diagnosis.

Important notes about standard deviation:

  • Sensitive to outliers
  • Can be calculated based on the population of a data distribution or based on its sample.
    • Using this formula: stddevsamp. In this case, it can return a different value from the the previous one.

Median Absolute Deviation (MAD)

While the Standard Deviation is a simple way to understand if the data of a distribution are far from its mean, MAD help us understand if these values are far from its median. So, MAD can be considered a robust estimate, because is not sensitive to outliers.

Warning: If you need to identify outliers or consider them in your analysis, the MAD function is not recommended, because it ignores them.

Let’s check our dashboard and compare different deviation calculations for the same data distribution:

Note that the last one is based on MAD function, and it is less than the other items, just because it is not considering outliers.

In this particular case, the web application is stable, and its response times are near to the mean or median (considering the MAD algorithm).

Exploratory Dashboard

Partial conclusion

In this post, we have just introduced the data distribution moments, presenting the variability or variance concept and then, we learned some techniques to achieve some KPIs or indicators.

What do we know? The response time for a web application can be different from the previous one and so on, so the knowledge of the variance can help us understand about the application behaviors using some extraordinary data. Then, we can decide if an application had a good or poor performance, for example.

Of course, it was a didactic example for some data distribution and the location estimate and variance concepts can be applied to other data exploratory analysis considering a long period, such as days, weeks, months, years, and so on. In those case, is very important consider using trends instead of history data.

Our goal is bringing to the light some extraordinary data and insights, instead of common data, allow us knowing better our application.

In the next posts, we’ll talk about Skewness and Kurtosis, the 3rd and 4th moments for a data distribution, respectively.

The post Zabbix in: exploratory data analysis rehearsal – Part 2 appeared first on Zabbix Blog.

Zabbix in: exploratory data analysis rehearsal – Part 1

Post Syndicated from Paulo R. Deolindo Jr. original https://blog.zabbix.com/zabbix-in-exploratory-data-analysis-rehearsal-part-1/25802/

Abstract

Imagine your happiness when you start a new enterprise device and application monitoring project using Zabbix[i]. Indeed, doing this is so easy that the first results bring up a lot of satisfaction very quickly. For example, when you apply a specific template [ii]in a specific host and the data comes (like magic) and you can create some dashboards with these data and visualize then.

If you haven’t done this yet, you must try it as soon as possible. You can create a web server host using both Apache or Nginx web services, applying the appropriate template and getting metrics by HTTP checks: “Apache by HTTP” template or “Nginx by HTTP” template. You will see interesting metrics being collected and you will be able to create and view some graphs or dashboards. But the work is not finished yet, because using Zabbix, you can do much more!

In this article, I’ll talk about how we can think of new metrics, new use cases, how to support our business and help the company with important results and insights using exploratory data analysis introducing and implementing some data science concepts using only Zabbix.

What is our goal?

Testing and learning some new Zabbix functions introduced from 6.0 version, compare some results and discuss insights.

Contextualizing

Let’s keep the focus on the web server metrics. However, all the results of this study can be used later in different scenarios.

The web server runs nginx version 1.18.0 and we are using “Nginx by HTTP” template to collect the following metrics:

  • HTTP agent master item: get_stub_status
  • Dependent items[i]:

Nginx: Connections accepted per second

Nginx: Connections active

Nginx: Connections dropped per second

Nginx: Connections handled per second

Nginx: Connections reading

Nginx: Connections waiting

Nginx: Connections writing

Nginx: Requests per second

Nginx: Requests total

Nginx: Version

  • Simple check items:

Nginx: Service response time

Nginx: Service status

 

That are the possibilities at the moment and below we have a simple dashboard created to view the initial results:

All widgets are reflecting metrics collected by using out-of-the-box “Nginx by HTTP” template.

Despite being Zabbix specialist and having some knowledge about our monitored application, there are some questions we need to ask ourselves. These questions do not need to be exhaustive, but they are relevant for our exercise. So, let’s jump to the next topic.

Generating new metrics! Bringing up some thoughts!

Let’s think about the collected metrics in the beginning of this monitoring project:

  1. Why the does number of requests only increase?
  2. When did we have more or fewer connections, considering for example, the last hour?
  3. What’s the percentage change comparing the current hour with the previous one?
  4. Which value can be representing the best or the worst response time performance?
  5. Considering some collected values, can we predict an application downtime?
  6. Can we detect anomalies in the application based on the amount of collected values and application behavior?
  7. How to stablish a baseline? Is it possible?

These are some questions we need to answer using this article and the next ones to come.

Generating new metrics

1st step: Let’s create a new template. From “nginx by HTTP”, clone it and change its name to “Nginx by HTTP modified”;

2nd step: Modify the “Nginx: Requests total” item, adding a new pre-processing step: “Simple change”. It will look like the image below:

It’s a Dependent item from the Master item “Nginx: Get stub status page” and the last one is based on HTTP agent to retrieve the main metric. So, if the number of the total connections always increase, the current value will be decreased from the last collected value. A simple mathematical operation: subtraction. And then, from this moment on we’ll have the number of new connections per minute.

The formula for the “Simple change” pre-processing step can be represented using the following images:

I also suggest you change the name of the item to: “Nginx: Requests at last minute”.

I can add some Tags[i] too. These tags can be used in the future to filter the views and so on.

Same metrics variations

With the modified nginx template we can retrieve how many new connections our web application receives per minute and then, we can create new metrics from the previous one. Using Zabbix timeshift[i] [ii]function, we can create metrics such as the number of connections:

  • At the last hour
  • Today so far and Yesterday
  • This week and at the previous week
  • This month and the previous month
  • This year and at the previous year
  • Etc

This exercise can be very interesting. Let’s create some Calculated items with the following formulas:

sum(//nginx.requests.total,1h:now/h)                                                    # Somatório de novas conexões na hora anterior

sum(//nginx.requests.total,1h:now/h+1h)                                              # Somatório de novas conexões da hora atual

In Zabbix official documentation we have lots of examples to creating Calculated items using “time shift” parameter. Please, see this link.

Improving our dashboard

Using the new metrics, we can improve our dashboard and enhance the data visualization. Briefly:

The same framework could be used to show the daily, weekly, monthly and yearly data, depending on your business rule, of course. Please, be patient because some items will have some delay in collecting operation (monthly, yearly, etc).

Basic statistics metrics using Zabbix

As we know, it is perfectly possible to generate some statistics values with Zabbix by using Calculated items. However, there are questions that can guide us to other thoughts and some answers will come in format of metrics, again.

  1. Today, which response time was the best?
  2. And if we think about the worst response time?
  3. And about the average?

We can start with these basic statistics and then, growing up latter.

All data in dashboard above were retrieved using simple Zabbix functions.

 

The best response time today so far.

min(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1d:now/d+1d)

The worst response time today so far.

max(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1d:now/d+1d)

The average of the response time today so far.

avg(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1d:now/d+1d)

 

It’s ok. Nothing is new, so far. But let’s check some thoughts.

Why we are Looking for the best, the worst and the average using min, max e avg functions, instead of  trendmin, trendmax e trendavg functions? The Trend-based functions retrieve data from trends tables, while History-based functions calculates in “real time”. If you wish to use History-based functions to calculate something in a short period, ok. But if you wish to use it to calculate some values considering a long period such as month or year… hum! It can be complicated, and it can take a lot of resources of your infrastructure power.

We need to remember an important thing: to use Trend-based functions, we must consider only data retrieved until the last full hour, because we have to consider the trend-cache sync process.

Look at the dashboard below, this time, using Trend-based Functions for the statistics.

Look at the current results. Basically, they are the same. There aren’t so many differences and, I guess, using it’s an intelligent way to retrieve the desired values.

Insights

If a response time is too short such as 0.06766 (the best of the day) and another value is too big and is representing the worst response time, such as 3.1017, can you imagine which and how many values exist between then?

How to calculate de average? You know: the sum of all collected values within a period, divided by the number of values.

So far, so good. The avg or trendavg functions can retrieve this average based on the desired period. However, if you look at the graph above, you will see some “peaks” in certain periods. These “peaks” are called “outliers”. The outliers are influencers of the average.

The outliers are important but because it exists, the average sometimes may not represent the reality. Think about this: a response time of the web application having stayed between 0.0600 and 0.0777 at the previous hour. During a specific minute within the same monitored period, for some reason, the response time was 3.0123. In this case, the average will increase. But, what if we discard the outlier? Obviously, the average will be as expected. In this case, the outlier was a deviation, “an error in the matrix”. So, we need to be able to calculate de average or other estimative location for our values, without the outlier.

And we cannot forget: if we are looking for anomalies based on the web application response time, we need to consider the outliers. If not, I guess, outliers can be removed on the calculation for now.

Ok! Outliers can influence the common average. So, how can we calculate something without the outliers?

Introduction to Median

About data timeline, we can affirm that the database is respecting the collected timestamp. Look at the collected data below:

2023-04-06 16:24:50 1680809090 0.06977
2023-04-06 16:23:50 1680809030 0.06981
2023-04-06 16:22:50 1680808970 0.07046
2023-04-06 16:21:50 1680808910 0.0694
2023-04-06 16:20:50 1680808850 0.06837
2023-04-06 16:19:50 1680808790 0.06941
2023-04-06 16:18:53 1680808733 3.1101
2023-04-06 16:17:51 1680808671 0.06942
2023-04-06 16:16:50 1680808610 0.07015
2023-04-06 16:15:50 1680808550 0.06971
2023-04-06 16:14:50 1680808490 0.07029

For the average, the timestamp or the collected order will not be important. However, if we ignore its timestamp and order the values from smallest to biggest, we’ll get something like this:

0.06837 0.0694 0.06941 0.06942 0.06971 0.06977 0.06981 0.07015 0.07029 0.07046 3.1101

Table 1.0 – 11 collected values, ordered by from smallest to biggest

In this case, the values are ordered by from the smallest one to the biggest one, ignoring their timestamp.

Look at the outlier at the end. It’s not important for us right now.

The timeline has an odd number of values and the value in green, is the central value. The Median. And what if it was an even number of values? How could we calculate the median? There is a formula for it.

0.0694 0.06941 0.06942 0.06971 0.06977 0.06981 0.07015 0.07029 0.07046 3.1101

Table 2.0 – 10 collected values, ordered by from smallest to biggest

Now, we have two groups of values. There is not a central position.

This time, we can use the median formula (in general): Calculate de average for the last value for the “Group A” and the first value for the “Group B”. Look at the timeline below and the values in green and orange colors.

Percentile in Zabbix

Despite considering the concept of median, we can also use the percentile calculation.

In most of cases, the median has a synonymous: “50th percentile”.

I’m proposing you an exercise:

 

1. You must create a Zabbix trapper item and send to it the following values using zabbix-sender:

0.06837, 0.0694, 0.06941, 0.06942, 0.06971, 0.06977, 0.06981, 0.07015, 0.07029, 0.07046, 3.1101

# for x in `cat numbers.txt`; do zabbix_sender -z 159.223.145.187 -s “Web server A” -k percentile.test -o “$x”; done

At the end, we’ll have 11 values in Zabbix database, and we’ll calculate the 50th percentile

 

2. You must create a Zabbix Calculated item with the following formula:

percentile(//percentile.test,#11,50)

In this case, we can read it: consider the last 11 values and return the value in the 50th position in the array. And you can check in advance the result using “Test” button in Zabbix.

Now, we’ll work with an even number of values, excluding the value 0.06837. Our values for the next test will be:

0.0694, 0.06941, 0.06942, 0.06971, 0.06977, 0.06981, 0.07015, 0.07029, 0.07046, 3.1101

Please, before sending the values with zabbix-sender again, clear the history and trends for this Calculated item and then, adjust the formula:

percentile(//percentile.test,#10,50)

Checking the result, something curious happened: the 50th percentile was the same value.

There is a simple explanation for this.

Considering the last 10 values, in green we have the “Group A” and in orange, we have the “Group B”. The value retrieved using 50th percentile formula occupies the same position in both first and second tests.

We can test it again but this time, let’s change the formula to 51st percentile. The next value will be the first value for the second group.

percentile(//percentile.test,#10,51)

The result was changed. Now, we have something different to work and then, in the next steps, we’ll retrieve the median.

So, the percentile can be considered the central value for an odd number of values, but when we have an even number of values, the result cannot be the expected.

Average ou Percentile?

Two different calculations. Two different results. Neither the first is wrong nor the second. Both values can be considered correct, but we need some context for this affirmation.

The average is considering the outliers. The last one, percentile, is not.

Let’s update our dashboard.

We don’t need to prove anything to anyone about the values, but we need to show the values and their context.

Median

It’s simple: If the median is the central value, we can retrieve the average for the 50th percentile and 51st percentile, in this case. Remember, our new connections are being collected every minute, so at the end of each hour, we’ll have an even number of values.

Fantastic. We can calculate de median in a simple way:

(last(//percentile.50.input.values)+last(//percentile.51.input.values))/2

 

This is the median formula in this case using Zabbix. Let’s check the results in our dashboard.

Partial conclusion

In this article, we have just explored some Zabbix functions to calculate basic statistics and bring up some insights about a symbolic web application and its response time.

There is no absolute truth about those metrics but each one of them needs a context.

In Exploratory Data Analysis, asking questions can guide us to interesting answers, but remember that we need to know where we are going or what we wish.

With Zabbix, you and me can perform a Data Scientist function, knowing Zabbix too and knowing it very well.

You don’t need to use python or R for all tasks in Data Science. We’ll talk about it in next articles for this series.

Keep in mind: Zabbix is your friend. Let’s learn Zabbix and get wonderful results!

_____________

[1] Infográfico Zabbix (unirede.net)

[1] [1] https://www.unirede.net/zabbix-templates-onde-conseguir/

[1] https://www.unirede.net/monitoramento-de-certificados-digitais-de-websites-com-zabbix-agent2/

[1] Tagging: Monitorando todos os serviços! – YouTube

[1] [1] Timeshift – YouTube

Integrate Zabbix with your data pipelines by configuring real-time metric and event streaming

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/integrate-zabbix-with-your-data-pipelines-by-configuring-real-time-metric-and-event-streaming/25728/

Modern IT infrastructures tend to utilize multiple data sources to evaluate and react to the current state of the infrastructure. A set of internal solutions and dedicated software tools are used to correlate the collected information and react in a proper way to changes in the environment – be it a gradual increase or decrease in resource usage, unexpected load spikes or run-of-the-mill outages.

With the release of Zabbix 6.4, metrics collected by Zabbix and events generated based on trigger expressions can be integrated into such a data pipeline by using the new real-time metric and event streaming feature.

Zabbix real-time metric and event streaming

Before Zabbix 6.4 Zabbix supported real-time export of history, trends and events to files. The export was performed by exporting the data in newline-delimited JSON format. Additional scripting was mandatory if a user wanted to integrate this data within their data pipeline since Zabbix only exported the data to files without any further transformations or streaming.

On the other hand – real-time metric streaming can be a lot more flexible when it comes to integrating Zabbix with data pipelines, filtering the required data and securing the connection to the third party endpoint.

In addition to specifiying the streaming endpoint URL, Zabbix users can choose between streaming Item values and Events. On top of that, Tag filtering can be used to further narrow down the data that will be streamed to the endpoint.

Connectors

Real-time item value and event streaming can be configured by accessing the Administration – General – Connectors section. Here Zabbix administrators will have to create connectors and specify what kind of data the connector will be streaming. A single connector can only stream either Item values or events – if you wish to stream both, you will have to define at least two connectors, one for values and the other for events.

Configuring a new connector

Marking the Advanced configuration checkbox enables Zabbix administrators to further configure each individual connector – from specifying the number of concurrent sessions, limiting the number of attempts and specifying the connection Timeout to configuring HTTP proxies and enabling SSL connections.

An HTTP authentication method can also be specified for each connector. It is possible to use one of the following methods:

None – no authentication used;

Basic – basic authentication is used;

NTLM – NTLM (Windows NT LAN Manager) authentication is used;

Kerberos – Kerberos authentication is used;

Digest – Digest authentication is used;

Bearer – Bearer authentication is used.

In addition, the number of pre-forked connector worker instances must be specified in the Zabbix server configuration file in the StartConnectors parameter.

Setting StartConnectors parameter in Zabbix server configuration file

Protocol

Under the hood, the data is sent over HTTP using newline-delimited JSON export protocol.

The following example shows a trigger event for Zabbix agent being unreachable:

{"clock":1519304285,"ns":123456789,"value":1,"name":"Either Zabbix agent is unreachable on Host B or pollers are too busy on Zabbix Server","severity":3,"eventid":42, "hosts":[{"host":"Host B", "name":"Host B visible"},{"host":"Zabbix Server","name":"Zabbix Server visible"}],"groups":["Group X","Group Y","Group Z","Zabbix servers"],"tags":[{"tag":"availability","value":""},{"tag":"data center","value":"Riga"}]}

Do not get confused by the value field, the value if which will always be 1 for problem events, while for items, it will contain the item value:

{"host":{"host":"Host B","name":"Host B visible"},"groups":["Group X","Group Y","Group Z"],"item_tags":[{"tag":"foo","value":"test"}],"itemid":4,"name":"CPU Load","clock":1519304285,"ns":123456789,"value":0.1,"type":0}

Use cases

Finally, what particular use cases can we use the real-time metric streaming feature for? As I mentioned in the introduction, Zabbix item values and events could serve as an additional source of near real-time information about the current system behavior.

For example, we could stream item values and events to message brokers such as Kafka, RabbitMQ or Amazon Kinesis. Combine this with additional automation solutions and your services could dynamically scale (think K8s/Docker containers) depending on the current (or expected) load. Zabbix Kubernetes and Docker container monitoring templates very much complement such an approach.

The streamed data could also be used to gain new insights about the system behavior by streaming it to an AI engine or data lakes/data warehouses for long-term storage and analysis.

Kubernetes monitoring with Zabbix – Part 3: Extracting Prometheus metrics with Zabbix preprocessing

Post Syndicated from Michaela DeForest original https://blog.zabbix.com/kubernetes-monitoring-with-zabbix-part-3-extracting-prometheus-metrics-with-zabbix-preprocessing/25639/

In the previous Kubernetes monitoring blog post, we explored the functionality provided by the Kubernetes integration in Zabbix and discussed use cases for monitoring and alerting to events in a cluster, such as changes in replicas or CPU pressure.

In the final part of this series on monitoring Kubernetes with Zabbix, we will show how the Kubernetes integration uses Prometheus to parse data from kube-state-metrics and how users can leverage this functionality to monitor the many cloud-native applications that expose Prometheus metrics by default.

Want to see Kubernetes monitoring in action? Watch Part 3 of our Kubernetes monitoring video guide.

Prometheus Data Model

Prometheus is an open-source toolkit for monitoring and alerting created by SoundCloud. Prometheus was the second hosted project to join the Cloud-native Computing Foundation in 2016, after Kubernetes. As such, users of Kubernetes have adopted Prometheus extensively.

Lines in the model begin with or without a pound sign. Lines beginning with a pound sign specify metadata that includes help text and type information. Additional lines follow where the first key is the metric name with optional labels specified, followed by the value, and optionally concluding with a timestamp. If a timestamp is absent, the assumption is that the timestamp is equal to the time of collection.

http_requests_total{job=”nginx”,instance=”10.0.0.1:443”} 15 1677507349983

Using Prometheus with Kubernetes Monitoring

Let’s start with an example from the kube-state-metrics endpoint, installed in the first part of this series. Below is the output for the /metrics endpoint used by the Kubernetes integration, showing the metric kube_job_created. Each metric has help text followed by a line starting with that metric name, labels describing each job, and creation time as the sample value.

# HELP kube_job_created Unix creation timestamp
# TYPE kube_job_created gauge
kube_job_created{namespace="jdoe",job_name="supportreport-supportreport-27956880"} 1.6774128e+09
kube_job_created{namespace="default",job_name="core-backup-data-default-0-27957840"} 1.6774704e+09
kube_job_created{namespace="default",job_name="core-backup-data-default-1-27956280"} 1.6773768e+09
kube_job_created{namespace="jdoe",job_name="activetrials-activetrials-27958380"} 1.6775028e+09
kube_job_created{namespace="default",job_name="core-cache-tags-27900015"} 1.6740009e+09
kube_job_created{namespace="default",job_name="core-cleanup-pipes-27954860"} 1.6772916e+09
kube_job_created{namespace="jdoe",job_name="salesreport-salesreport-27954060"} 1.6772436e+09
kube_job_created{namespace="default",job_name="core-correlation-cron-1671562914"} 1.671562914e+09
kube_job_created{namespace="jtroy",job_name="jtroy-clickhouse-default-0-maintenance-27613440"} 1.6568064e+09
kube_job_created{namespace="default",job_name="core-backup-data-default-0-27956880"} 1.6774128e+09
kube_job_created{namespace="default",job_name="core-cleanup-sessions-27896445"} 1.6737867e+09
kube_job_created{namespace="default",job_name="report-image-findings-report-27937095"} 1.6762257e+09
kube_job_created{namespace="jdoe",job_name="salesreport-salesreport-27933900"} 1.676034e+09
kube_job_created{namespace="default",job_name="core-cache-tags-27899775"} 1.6739865e+09
kube_job_created{namespace="ssmith",job_name="test-auto-merger"} 1.653574763e+09
kube_job_created{namespace="default",job_name="report-image-findings-report-1650569984"} 1.650569984e+09
kube_job_created{namespace="ssmith",job_name="auto-merger-and-mailer-auto-merger-and-mailer-27952200"} 1.677132e+09
kube_job_created{namespace="default",job_name="core-create-pipes-pxc-user"} 1.673279381e+09
kube_job_created{namespace="jdoe",job_name="activetrials-activetrials-1640610000"} 1.640610005e+09
kube_job_created{namespace="jdoe",job_name="salesreport-salesreport-27943980"} 1.6766388e+09
kube_job_created{namespace="default",job_name="core-cache-accounting-map-27958085"} 1.6774851e+09

Zabbix collects data from this endpoint in the “Get state metrics.” The item uses a script item type to get data from the /metrics endpoint. Dependent items that use a Prometheus pattern as a preprocessing step to obtain data relevant to the dependent item are created.

Prometheus and Out-Of-The-Box Templates

Zabbix also offers many templates for applications that expose Prometheus metrics, including etcd. Etcd is a distributed key-value store that uses a simple HTTP interface. Many cloud applications use etcd, including Kubernetes. Following is a description of how to set up an etcd “host” using the built-in etcd template.

A new host is created called “Etcd Application” with an agent interface specified that provides the location of the application API. The interface port does not matter because a macro sets the port. The “Etcd by HTTP” template is attached to the host.

The “Get node metrics” item is the master item that collects Prometheus metrics. Testing this item shows that it returns Prometheus formatted metrics. The master item creates many dependent items that parse the Prometheus metrics. In the dependent item, “Maximum open file descriptors,” the maximum number of open file descriptors is obtained by adding the “Prometheus pattern” preprocessing step. This metric is available with the metric name process_max_fds.

Custom Prometheus Templates

 

While it is convenient when Zabbix has a template for the application you want to monitor, creating a new template for an application that exposes a /metrics endpoint but does not have an associated template is easy.

One such application is Argo CD. Argo CD is a GitOps continuous delivery tool for Kubernetes. An “application” represents each deployment in Kubernetes. Argo CD uses Git to keep applications in sync.

Argo CD exposes a Prometheus metrics endpoint that we can be used to monitor the application. The Argo CD documentation site includes information about available metrics.

In Argo CD, the metrics service is available at the argocd-metrics service. Following is a demonstration of creating an Argo CD template that collects Prometheus metrics. Install Argo CD in a cluster with a Zabbix proxy installed before starting. To do this, follow the Argo CD “Getting Started” guide.

Create a new template called, “Argo CD by HTTP” in the “Templates/Applications” group. Add three macros to the template. Set {$ARGO.METRICS.SERVICE.PORT} to the default of 8082. Set {$ARGO.METRICS.API.PATH} to “/metrics.” Set the last macro, {$ARGO.METRICS.SCHEME} to the default of “http.”

Open the template and click “Items -> Create item.” Name this item “Get Application Metrics” and give it the “HTTP agent” type. Set the key to argocd.get_metrics with a “Text” information type. Set the URL to {$ARGO.METRICS.SCHEME}://{HOST.CONN}:{$ARGO.METRICS.SERVICE.PORT}/metrics. Set the History storage period to “Do not keep history.”

Create a new host to represent Argo. Go to “Hosts -> Create host”. Name the host “Argo CD Application” and assign the newly created template. Define an interface and set the DNS name to the name of the metrics service, including the namespace, if the Argo CD deployment is not in the same namespace as the Zabbix proxy deployment. Connect to DNS and leave the port as the default because the template does not use this value. Like in the etcd template, a macro sets the port. Set the proxy to the proxy located in the cluster. In most cases, the macros do not need to be updated.

Click “Test -> Get value and test” to test the item. Prometheus metrics are returned, including a metric called argocd_app_info. This metric collects the status of the applications in Argo. We can collect all deployed applications with a discovery rule.

Navigate to the Argo CD template and click “Discovery rules -> Create discovery rule.” Call the rule “Discover Applications.” The type should be “Dependent item” because it depends on the metrics collection item. Set the master item to the “Get Application Metrics” item. The key will be argocd.applications.discovery. Go to the preprocessing tab and add a new step called, “Prometheus to JSON.” The preprocessing step will convert the application data to JSON, which will look like the one below.

[{"name":"argocd_app_info","value":"1","line_raw":"argocd_app_info{dest_namespace=\"monitoring\",dest_server=\"https://kubernetes.default.svc\",health_status=\"Healthy\",name=\"guestbook\",namespace=\"argocd\",operation=\"\",project=\"default\",repo=\"https://github.com/argoproj/argocd-example-apps\",sync_status=\"Synced\"} 1","labels":{"dest_namespace":"monitoring","dest_server":"https://kubernetes.default.svc","health_status":"Healthy","name":"guestbook","namespace":"argocd","operation":"","project":"default","repo":"https://github.com/argoproj/argocd-example-apps","sync_status":"Synced"},"type":"gauge","help":"Information about application."}]

Set the parameters to “argocd_app_info” to gather all metrics with that name. Under “LLD Macros”, set three macros. {#NAME} is set to the .labels.name key, {#NAMESPACE} is set to the .labels.dest_namespace key, and {#SERVER} is set to .labels.dest_server.

Let us create some item prototypes. Click “Create item prototype” and name it “{#NAME}: Health Status.” Set it as a dependent item with a key of argocd.applications[{#NAME}].health. The type of information will be “Character.” Set the master item to “Get Application Metrics.”

In preprocessing, add a Prometheus pattern step with parameters argocd_app_info{name=”{#NAME}”}. Use “label” and set the label to health_status. Add a second step to “Discard unchanged with heartbeat” with the heartbeat set to 2h.

Clone the prototype to create another item called “{#NAME}: Sync status.” Change the key to argocd.applications.sync[{#NAME}]. Under “Preprocessing” change the label to sync_status.

Now, when viewing “Latest Data” the sync and health status are available for each discovered application.

Conclusion

We have shown how Zabbix templates, such as the Kubernetes template, and the etcd template utilize Prometheus patterns to extract metric data. We have also created templates for new applications that expose Prometheus data. Because of the adoption of Prometheus in Kubernetes and cloud-native applications, Zabbix benefits by parsing this data so that Zabbix can monitor Kubernetes and cloud-native applications.

I hope you enjoyed this series on monitoring Kubernetes and cloud-native applications with Zabbix. Good luck on your monitoring journey as you learn to monitor with Zabbix in a containerized world.

About the Author

Michaela DeForest is a Platform Engineer for The ATS Group. She is a Zabbix Certified Specialist on Zabbix 6.0 with additional areas of expertise, including Terraform, Amazon Web Services (AWS), Ansible, and Kubernetes, to name a few. As ATS’s resident authority in DevOps, Michaela is critical in delivering cutting-edge solutions that help businesses improve efficiency, reduce errors, and achieve a faster ROI.

About ATS Group:

The ATS Group provides a fully inclusive set of technology services and tools designed to innovate and transform IT. Their systems integration, business resiliency, cloud enablement, infrastructure intelligence, and managed services help businesses of all sizes “get IT done.” With over 20 years in business, ATS has become the trusted advisor to nearly 500 customers across multiple industries. They have built their reputation around honesty, integrity, and technical expertise unrivaled by the competition.

Just-in-Time user provisioning explained

Post Syndicated from Evgeny Yurchenko original https://blog.zabbix.com/just-in-time-user-provisioning-explained/25515/

Zabbix 6.4 finally brings a very much waited feature called “Just-In-Time user provisioning”. Zabbix “What’s new in 6.4” LDAP/SAML user provisioning paragraph is very brief and can not (not that I am saying it should) deliver any excitement about this new really game changing feature. This blog post was born to address two points:

  • explain in more details why it is “game changing” feature
  • configuration of this feature is very flexible and as it often happens flexibility brings complexity and sometimes confusion about how to actually not only get it working but also to get the most of this feature

NOTE: I am talking about LDAP in this blog post but SAML works exactly the same way so you can easily apply this article to SAML JIT user provisioning configuration.

Old times (before 6.4)

Let’s do a quick reminder how it worked before Zabbix 6.4:Obvious problem here is that a User must be pre-created in Zabbix to be able to log in using LDAP. The database user records do not have any fields noticing that the user will be authenticated via LDAP, it’s just users’ passwords stored in the database are ignored, instead, Zabbix goes to an LDAP server to verify whether:

  • a user with a given username exists
  • user provided the correct password

no other attributes configured for the user on the LDAP server side are taken into account.

So when Zabbix is used by many users and groups, user management becomes not a very trivial task as new people join different teams (or leave).

Zabbix 6.4 with JIT user provisioning enabled

Now let’s take a look at what is happening in Zabbix 6.4 (very simplified picture). The picture depicts what happens when memberOf method is selected for Group Configuration (more on that later):Now when Zabbix gets a username and password from the Login form it goes to the LDAP server and gets all the information available for this user including his/her LDAP groups membership and e-mail address. Obviously, it gets all that only if the correct (from LDAP server perspective) username and password were provided. Then Zabbix goes through pre-configured mapping that defines users from which LDAP group goes to which Zabbix user group. If at least one match is found then a user is created in the Zabbix database belonging to a Zabbix user group and having a Zabbix user role according to configured “match”. So far sounds pretty simple, right? Now let’s go into detail about how all this should be configured.

LDAP server data

To experiment with the feature I built a Docker container which is a fully functional LDAP server with some pre-configured data, you can easily spin it up using this image. Start the container this way:

docker run -p 3389:389 -p 6636:636 --name openldap-server --detach bgmot42/openldap-server:0.1.1

To visually see LDAP server data (and add your own configuration like users and groups) you can start this standard container

docker run -p 8081:80 -p 4443:443 --name phpldapadmin --hostname phpldapadmin --link openldap-server:ldap-host --env PHPLDAPADMIN_LDAP_HOSTS=ldap-host --detach osixia/phpldapadmin:0.9.0

Now you can access this LDAP server via https://<ip_address>:4443 (or any other port you configure to access this Docker container), click Login, enter “cn=admin,dc=example,dc=org” in Login DN field and “password” in Password field, click Authenticate. You should see the following structure of the LDAP server (picture shows ‘zabbix-admins’ group configuration):All users in this container for convenience are configured with “password” word as their passwords.

General LDAP authentication configuration in Zabbix

No surprises here, you need to enable LDAP authentication, just a couple of additions here:

  • You must provide Deprovisioned users group. This group must be literally “disabled” otherwise you won’t be able to select it here. This is the Zabbix user group where all “de-provisioned” users will be put into so effectively will get disabled from accessing Zabbix.
  • Enable JIT provisioing check-box which obviously needs to be checked for this feature to work.

And again already familiar interface to configure a LDAP server and search parameters, however, this picture depicts how we actually fill in these parameters according to data in our LDAP server:

“Special” Distinguished Name (DN) cn=ldap_search,dc=example,dc=org is used for searching, i.e. Zabbix uses this DN to connect to LDAP server and of course when you connect to LDAP server you need to be authenticated – this is why you need to provide Bind password. This DN should have access to a sub-tree in LDAP data hierarchy where all your users are configured. In our case all the users configured “under” ou-Users,dc=example,dc=org, this DN is called base DN and used by Zabbix as so to say “starting point” to start searching.
Note: technically it is possible to bind to LDAP server anonymously, without providing a password but this is a huge breach in security as the whole users sub-tree becomes available for anonymous (unauthenticated) search, i.e. effectively exposed to any LDAP client that can connect to LDAP server over TCP. The LDAP server we deployed previously in Docker container does not provide this functionality.

Group configuration method “memberOf”

All users in our LDAP server have memberOf attribute which defines what LDAP groups every user belongs to, e.g. if you perform a LDAP query for user1 user you’ll get that its memberOf attribute has this value:
memberOf: cn=zabbix-admins,ou=Group,dc=example,dc=org
Note, that your real LDAP server can have totally different LDAP attribute that provides users’ group membership, and of course, you can easily configure what attribute to use when searching for user’s LDAP groups by putting it into User group membership attribute field:

In the picture above we are telling Zabbix to use memberOf attribute to extract DN defining user’s group membership (in this case it is cn=zabbix-admins,out=Group,dc=example,dc=org) and take only cn attribute from that DN (in this case it is zabbix-admins) to use in searching for a match in User group mapping rules. Then we define as many mapping rules as we want. In the picture above we have two rules:

  • All users belonging to zabbix-users LDAP group will be created in Zabbix as members of Zabbix users group with User role
  • All users belonging to zabbix-admins LDAP group will be created in Zabbix as members of Zabbix administrators group with Super admin role

Group configuration method “groupOfNames”

There is another method of finding users’ group membership called “groupOfNames” it is not as efficient as “memberOf” method but can provide much more flexibility if needed. Here Zabbix is not querying LDAP server for a user instead it is searching for LDAP groups based on a given criterion (filter). It’s easier to explain with pictures depicting an example:

Firstly we define LDAP “sub-tree” where Zabbix will be searching for LDAP groups – note ou=Group,dc=example,dc=org in Group base DN field. Then in the field Group name attribute field we what attribute to use when we search in mapping rules (in this case we take cn, i.e. only zabbix-admins from full DN cn=zabbix-admins,ou=Group,dc=example,dc=org). Each LDAP group in our LDAP server has member attribute that has all users that belong to this LDAP group (look at the right picture) so we put member in Group member attribute field. Each user’s DN will help us construct Group filter field. Now pay attention: Reference attribute field defines what LDAP user’s attribute Zabbix will use in the Group filter, i.e. %{ref} will be replaced with the value of this attribute (here we are talking about the user’s attributes – we already authenticated this user, i.e. got all its attributes from LDAP server). To sum up what I’ve said above Zabbix

  1. Authenticate the user with entered Username and Password against LDAP server getting all user’s LDAP attributes
  2. Uses Reference attribute and Group filter fields to construct a filter (when user1 logs in the filter will be (member=uid=user1,ou=Users,dc=example,dc=org)
  3. Performs LDAP query to get all LDAP groups with member attribute (configured in Group member attribute field) containing constructed in step 2 filter
  4. Goes through all LDAP groups received in step 3 and picks cn attribute (configured in Group name attribute field) and finds a match in User group mapping rules

Looks a bit complicated but all you really need to know is the structure of your LDAP data.

Demo time

Finally let’s see what happens when user1 belonging to zabbix-admins LDAP group and user3 belonging to zabbix-users LDAP group log in:

That’s it. Happy JIT user provisioning!

Kubernetes monitoring with Zabbix – Part 2: Understanding the discovered resources

Post Syndicated from Michaela DeForest original https://blog.zabbix.com/kubernetes-monitoring-with-zabbix-part-2-understanding-the-discovered-resources/25476/

In the previous blog post, we installed the Zabbix Agent Helm Chart and set up official Kubernetes templates to monitor a cluster in Zabbix. In this edition, part 2, we will explore the functionality provided by the Kubernetes integration in Zabbix and discuss use cases for monitoring and alerting on events in a cluster. (This post assumes that the Kubernetes integration has been set up in at least one cluster using the helm chart and provided templates.)

Want to see Kubernetes monitoring in action? Watch Part 2 of our Kubernetes monitoring video guide.

Node and Component Discovery

Following integration setup, the templates will discover control plane components, each node, and the kubelet associated with it using the Kubernetes API via a “Script” item type.

Note:

In the last blog post, I showed a managed EKS cluster. Control plane components cannot be discovered in an EKS cluster because AWS does not make them directly available through the API. For the sake of demonstrating the full capabilities of the integration, this post will use screenshots depicting a cluster that was created using the kubeadm utility.

In the latest version of Zabbix (6.2 at the time of writing), control plane components are discovered via node labels added only for clusters created with kubeadm. Depending on your setup, you may be able to add the same node labels to your own control plane nodes or modify the template to use your specific labels.

This example cluster has 4 worker nodes and 1 master node. The control plane runs entirely on the master node.

Zabbix’s “Low-Level Discovery” is the backbone of the Kubernetes integration. Zabbix discovers each node and creates two hosts to represent them in the cluster. The first host attaches the “Linux by Zabbix Agent” template to it, and the second attaches a custom Kubelet template called “Kubernetes Kubelet by HTTP. Zabbix also creates items for most standard objects like pods, deployments, replicasets, job, cronjob, etc.

Node and Kubernetes Performance Metrics

In this example, there are four discovered worker nodes with the “Linux by Zabbix Agent” template attached to them. The template will provide metrics about the machines running in the cluster.

Each worker host’s “System performance” dashboard shows system load, CPU usage, and memory usage metrics.

Zabbix will also collect Kubernetes-specific metrics related to the nodes. “Latest Data” for the Kubernetes Nodes host shows metrics such as the Allocatable CPU available to pods and the node’s memory capacity.

Alerts are generated for events such as the allocation of too much CPU. This could indicate that capacity should be increased, assuming that the memory and CPU limits set on the pod label are accurate.

The Kubernetes integration also monitors object states. As a best practice, any tool used to monitor Kubernetes should be monitoring and alerting critical status changes within the cluster. The image above shows the triggers related to the health of a pod. There are also triggers when certain conditions are detected by the nodes, like memory or CPU pressure.

Zabbix discovers objects like pods, deployments, and Replicasets, and triggers on object states.  For example, pods that are not up or deployments that do not have the correct number of replicas up.

In this example, a cluster is running a Kubernetes dashboard deployment with 3 replicas. By running the following command, we can see that all 3 replicas are up. Under “Latest Data,” Zabbix shows those 3 replicas available out of the 3 desired.

kubectl get deployment kubernetes-dashboard



To mimic a pod crashing, the pod is edited to use an invalid image tag.

kubectl edit pod <pod name>

The image tag is changed to  “invalid.tag, “ which is unavailable for the image. This causes the pod to fail because it can no longer pull the image. Output now shows that one pod is no longer ready.

Looking at the data in Zabbix, the number of available replicas is only 3, while the number of unavailable replicas is now 1.

On the problems page, there are two new problems. Both alerted that there is a mismatch between the number of replicas for the dashboard and the number of desired replicas.

Changing the tag back to a valid one should cause those problems to be resolved.

The Kubernetes templates offer many metrics and triggers, including most provided by Prometheus and Alert Manager. With some Zabbix experience and the ability to navigate kube-state-metrics and Kubernetes APIs, creating new items is possible.

What’s Next?

Above is an example of the output from the kube-state-metrics API. Unlike most APIs that return data in JSON format, the kube-state-metrics API uses the Prometheus data model to supply metrics.

As you get comfortable with Kubernetes monitoring in Zabbix, you may want to parse your own metrics from kube-state-metrics and create new items.

In the next video, we will learn how to monitor applications with Prometheus in Zabbix.

About the Author

Michaela DeForest is a Platform Engineer for The ATS Group.  She is a Zabbix Certified Specialist on Zabbix 6.0 with additional areas of expertise, including Terraform, Amazon Web Services (AWS), Ansible, and Kubernetes, to name a few.  As ATS’s resident authority in DevOps, Michaela is critical in delivering cutting-edge solutions that help businesses improve efficiency, reduce errors, and achieve a faster ROI.

About ATS Group: The ATS Group provides a fully inclusive set of technology services and tools designed to innovate and transform IT.  Their systems integration, business resiliency, cloud enablement, infrastructure intelligence, and managed services help businesses of all sizes “get IT done.” With over 20 years in business, ATS has become the trusted advisor to nearly 500 customers across multiple industries.  They have built their reputation around honesty, integrity, and technical expertise unrivaled by the competition.

How to write a webhook for Zabbix

Post Syndicated from Andrey Biba original https://blog.zabbix.com/how-to-write-a-webhook-for-zabbix/25298/

As you know, a picture is worth a thousand words. Therefore, I would like to share the process of creating a webhook from scratch. In this article, we will walk through the creation process step by step – starting with studying the target service with which Zabbix will integrate and finishing with tests for sending events from Zabbix. Although it may seem complicated, writing your own integrations is not so difficult.

Preparation

First, we need to decide what we want to see as a result of the webhook. In most cases, the services to which we will send events are divided into 2 types:

  • Messengers to which you can send messages. For example, Telegram, Slack, Discord, etc.
  • Service Desks where you can open, close, and update tickets. For example, Jira, Redmine, ServiceNow, etc.

In both cases, the principle of creating a webhook will not differ – the difference is only in the complexity of one type from the other.

In this article, I will describe the process of creating a webhook for messengers – and specifically for Line messenger.

After we have decided on the type, we need to find out whether this service supports the possibility of API requests and, if it does, what is required for this. Usually, all the services you want to integrate Zabbix with have somewhat detailed documentation about the API methods they support. By the way, Zabbix also has its own API, which is documented in detail.

After we are done studying the Line documentation, we find out that messages are sent using the POST method to the https://api.line.me/v2/bot/message/push endpoint, using the Line bot token in the request header for authorization and passing a specially formatted JSON in the request body with the content of the message. Confused? No problem. Let’s take a closer look.

HTTP requests

The operation of the API is based on HTTP requests, which are executed with parameters provided by the developers of this API.

Several types of HTTP requests are used more often than others:

  • GET – is perhaps the most common one that all of us encounter on a daily basis. This request only involves getting data. For example, the browser used a GET request from the web server to fetch the article you are currently reading.
  • POST – is a request that sends data to a resource. This is exactly the case when we want to pass something to the service using API requests.
  • PUT – is much less common than the previous 2, but no less important. This query replaces the values in a resource.

These are not all HTTP request methods, but these three will suffice for a general introduction.

We are done with methods. Let’s move on to the endpoint.

An endpoint is a permanent address of a resource via which we transfer, receive, or change data. In this case, https://api.line.me/v2/bot/message/push is the endpoint that accepts POST requests to send messages.

So, the method and the endpoint are defined. What’s next?

Generally, any HTTP request consists of:

  1. URL
  2. Method
  3. Headers
  4. Body
HTTP request structure

We have already dealt with the first two, but the headers and the request body remain.

Headers usually contain service information that allows you to process a request correctly. For example, the Content-Type: application/json header implies that our request body should be interpreted as a json object. Also, quite often, authorization information is passed in the headers. As in the case of Line, the Authorization: Bearer {channel access token} header contains the authorization token of the bot on behalf of which messages will be sent.

The request body usually contains the information we want to pass on to the service. In our case, this will be the subject and body of the event in Zabbix.

Checking the service API

The documentation is good, but it is necessary to check that everything we read works exactly how it is documented. It is not uncommon that a service can be developed faster than the documentation can keep up with it. So field testing never hurts. Excluding unexpected behavior will significantly reduce the time spent searching for problems.

I recommend using Postman to work with API requests – a handy tool that saves time. But for this article, we will use cURL due to its prevalence and ease of use.

I will not describe the process of creating the Line Bot API token because this is not directly related to the article. However, for those interested in this process, I will leave a link here.

As we have already found out, the request type will be POST, the access point URL is https://api.line.me/v2/bot/message/push, and additional headers must be passed: Content-Type: application/json which specifies the type of data to be sent (in our case it is JSON) and Authorization: Bearer {token value}. And the messages themselves are in JSON format. For example, I used 2 messages – “Hello, world1” and “Hello, world2”. As a result, I got the following query:

After executing the request, we got the expected result of 2 messages that came to the messenger, which were in the request body.

Excellent! So half of the work has already been done: there is a ready-made request that works in manual mode and successfully sends messages to Line. The only thing left is to put the necessary information in the right places and automate the process using JS and Zabbix.

Integration with Zabbix

After successfully completing the tests, go to Zabbix, create a new notification method in the Administration section, select the webhook type, and name it Line.

For webhook integrations with external services, Zabbix uses the built-in JavaScript engine on Duktape. Parameters are passed to the script, which is used to build the logic of the webhook. As a result of the script, tags can be returned that will be assigned to the event. This is usually necessary in case of integration with service desks in order to be able to update the status of tickets.

Let’s take a closer look at the webhook setup interface.

The Media type section contains the general settings for the new media type:

  • Name – Name of the media type.
  • Type – The type of media type. There are 4 types: email, SMS, webhook, and script.
  • Parameters – This is a list of variables passed to the code. All necessary data can be passed through parameters: event id, event type, trigger severity, event source, etc. You can specify macros and text values in parameters. The parameters are passed as a JSON string, accessible through the built-in variable value.
  • Script – JS script describing the logic of the webhook.
  • Timeout – The time after which the script will be terminated.
  • Process tags   – If this option is enabled, the webhook will support generating tags for events sent using this hook.
  • Include event menu entry – This option makes the Menu Entry Name and Menu Entry URL fields available for use.
  • Menu entry name – The text displayed in the event dropdown menu for the Menu entry URL submitted using this hook.
  • Menu entry URL – A link to an external resource in the event menu.
  • Description – A text field that contains a description of the notification method.
  • Enabled – an Option that allows enabling or disabling the media type.

The Message templates section contains templates that are used by webhook to send alerts. Each template contains:

  • Message type – The event type to which the message will apply. For example, Problem – when the trigger fires and Problem recovery – when the problem is resolved.
  • Subject  – The headline of the message.
  • Message – A message template that contains useful information about the event. For example, event time, date, event name, host name, etc.

The Options section contains additional options:

  • Concurrent sessions – The number of concurrent sessions to send an alert.
  • Attempts – The number of retries in case of send failure.
  • Attempt interval  – The frequency of attempts to send an alert.

When writing your own webhook, you can take an existing one as a basis – Zabbix has more than thirty ready-made webhook solutions of varying complexity. All basic functions are usually repeated from hook to hook with little or no change at all, as are the parameters passed to them.

Let’s set the following parameters:

It is convenient to set parameter values with macros. A macro is a variable in Zabbix that contains a specific value. Macros allow you to optimize and automate your work. They can be used in various places, such as triggers, filters, alerts, and so on.

A little more about each macro separately in order to understand why each of them is needed:

  • {ALERT.SUBJECT} – The subject of the event message. This value is taken from the Subject field of the corresponding Message template type.
  • {ALERT.MESSAGE} – The event message body. This value is taken from the Message field of the corresponding Message template type.
  • {EVENT.ID} – The event id in Zabbix. Could be used for generating a link to an event
  • {EVENT.NSEVERITY} – The numerical definition of the event’s severity from 0-5. We will use this to change the message in case of different severity.
  • {EVENT.SOURCE} – The event source. Needed to handle events correctly. In most cases, we are interested in triggers; this corresponds to source value 0.
  • {EVENT.UPDATE.STATUS} – Returns 1 if it is an update event. For example, in case of acknowledge operations or a change in severity.
  • {EVENT.VALUE} – The event state. 0 for recovery and 1 for the problem.
  • {ALERT.SENDTO} – The field from the media type assigned to the user. It returns the ID of the user or group in the Line, where it will be necessary to send a message
  • {TRIGGER.DESCRIPTION} – A macro that will be expanded if the event source is a trigger. Returns the description of the trigger
  • {TRIGGER.ID} – The trigger ID. Required to generate a link to an event in Zabbix

Webhooks can use other macros if needed. A list of all macros can be viewed on the documentation page. Be careful – not all macros can be used in webhooks.

Writing the script

Before writing the script, let’s define the main points that the webhook will need to be able to perform:

  • the script should describe the logic for sending messages
  • handle possible errors
  • logging for debugging

I will not describe the entire code in order not to repeat the same type of blocks and concentrate only on important aspects.

To send messages, let’s write a function that will accept messages and params variables. We got the following function:

function sendMessage(messages, params) {
    // Declaring variables
    var response,
        request = new HttpRequest();

    // Adding the required headers to the request
    request.addHeader('Content-Type: application/json');
    request.addHeader('Authorization: Bearer ' + params.bot_token);

    // Forming the request that will send the message
    response = request.post('https://api.line.me/v2/bot/message/push', JSON.stringify({
        "to": params.send_to,
        "messages": messages
    }));

    // If the response is different from 200 (OK), return an error with the content of the response
    if (request.getStatus() !== 200) {
        throw "API request failed: " + response;
    }
}

Of course, this is not a reference function, and depending on the requirements for the request may differ. There may be other required headers and a different request body. In some cases, it may be necessary to add an additional step to obtain authorization data through another API request.

In this case, the request to send a message returns an empty {} object, so it makes no sense to return it from the function. But for example, when sending a message to Telegram, an object with data about this message is returned. If you pass this data to tags, you can write logic that will change the already sent message – for example, in case of closing or updating the problem.

Now let’s describe a function that will accept webhook parameters and validate their values. In the example, we will not describe all the conditions because they are of the same type:

function validateParams(params) {
    // Checking that the bot_token parameter is a string and not empty
    if (typeof params.bot_token !== 'string' || params.bot_token.trim() === '') {
        throw 'Field "bot_token" cannot be empty';
    }

    // Checking that the event_source parameter is only a number from 0-3
    if ([0, 1, 2, 3].indexOf(parseInt(params.event_source)) === -1) {
        throw 'Incorrect "event_source" parameter given: "' + params.event_source + '".nMust be 0-3.';
    }

    // If an event of type "Discovery" or "Autoregistration" set event_value 1, 
    // which means "Problem", and we will process these events same as problems
    if (params.event_source === '1' || params.event_source === '2') {
        params.event_value = '1';
    }

    ...

    // Checking that trigger_id is a number and not equal to zero
    if (isNaN(params.trigger_id) && params.event_source === '0') {
        throw 'field "trigger_id" is not a number';
    }
}

As you can see from the code, in most cases these are simple checks that allow you to avoid errors associated with the input data. Validation is necessary because there is no guarantee that the expected value will be in the parameter.

The main block of code is placed inside the try…catch block in order to correctly handle errors:

try {
    // Declaring the params variable and writing the webhook parameters to it
    var params = JSON.parse(value);

    // Calling the validation function and passing parameters to it for verification
    validateParams(params);

    // If the event is a trigger and it is in the problem status, compose the message body
    if (params.event_source === '0' && params.event_value === '1') {
        var line_message = [
            {
                "type": "text",
                "text": params.alert_subject + 'nn' +
                    params.alert_message + 'n' + params.trigger_description
            }
        ];
    }

    ...

    // Sending a composed message
    sendMessage(line_message, params);

    // Returning OK so that the webhook understands that the script has completed with OK status
    return 'OK';
}
catch (err) {
    // Adding a log function so in case of problems you can see the error in the Zabbix server console
    Zabbix.log(4, '[ Line Webhook ] Line notification failed : ' + err);

    // In case of an error, return it from the webhook
    throw 'Line notification failed : ' + err;
}

Here we assign parameter values to the params variable, then validate them using the validateParams() function, describe the main conditions for generating a message, and send this message to the messenger. At the same time, the try…catch block allows you to catch all errors, log them to Zabbix and return them in a readable form to the user in the web interface.

For writing webhooks in Zabbix, there is a guideline dedicated to this topic. Please read this information because it will help you write better code and avoid common mistakes.

Testing

After we’ve finished with the webhook script, it’s time to test how our code works. To do this, Zabbix provides a function to send test messages. Go to the AdministrationMedia types, find Line, and click on the Test button opposite it. In the window that appears, fill in all the fields with the necessary data and press the Test button. Check the messenger and see that the message came with the data we specified in the test.

Ready-made Line integration can be found in the Zabbix git repository and in all recent Zabbix instance builds.

Troubleshooting

Of course, everything in the article looks like I did it on the first attempt and did not encounter a single error or problem. Naturally, this is not the case in practice. Work with each new product includes Research & Development. How can you catch errors and, most importantly, understand the problem?

Well, as I wrote earlier – read the documentation and test all requests before writing code. At this stage, it is easiest to catch all the problems. The response to the HTTP request will explicitly describe the error. For example, if you make a mistake in the request body and send an object with incorrect values, the service will return the body with an error description and the response status 400 (Bad request).

There are several options for debugging in case of errors that may occur when writing a webhook script:

  • Focus on the errors displayed when the notification method is executed. For example, if you mistyped or set the wrong name of the function and variable.
  • Include logging in the code for displaying service information. For example, while you are in the script development stage, the result of the function can be logged using the Zabbix.log() function. Zabbix supports 6 debug levels (0-5), which can be set in this function. Usually, webhooks use level 4, which contains information for debugging.
  • Use the zabbix_js utility. You can transfer a file with a script and parameters to it. You can read more about it here.

Conclusion

I hope this article has helped you better understand how webhooks work in Zabbix and highlighted the basic steps for creating, diagnosing, and preparing to write your integration. The Zabbix community is constantly adding custom templates and media types. I expect that after reading this article, more people will be interested in creating their own webhooks and sharing them with the community. We appreciate any contribution to the development and expansion of the base of integration solutions.

Questions

Q: I don’t know JS, but I know other languages. Is native support of other languages planned in Zabbix, such as Python?

A: For now, there are no such plans.

Q: Are there any restrictions with writing a JS script for a webhook?

A: Yes, there are. The built-in Duktape engine is used to execute the code, and it does not have all the functionality that is available in the latest JS releases. Therefore, I recommend that you read the documentation of this engine and the built-in objects to learn more about the available methods.

Deep dive into Zabbix Frontend Modules

Post Syndicated from Evgeny Yurchenko original https://blog.zabbix.com/deep-dive-into-zabbix-frontend-modules/24863/

Zabbix gives every community member the ability to extend their frontend functionality by writing their own frontend modules. In this video, we will go through the steps required to write a Zabbix frontend module and look at multiple code examples that will explain the steps behind successfully implementing a custom frontend module. The article is based on a Zabbix Summit 2022 speech by Evgeny Yurchenko.

Why?

Zabbix 5.0 introduced a pretty cool feature called “Frontend Modules” (or WEB modules) that lets anybody extend Zabbix WebUI (add new menu items, modify current menu element behavior or even delete some menu items completely). We see a constant growth of the number of Modules created, but there is not too much written on how to efficiently write your own Modules. This article tries to give you as much detail as possible on how the Modules subsystem is implemented in Zabbix which obviously should help you understand how Modules function thus easing the process of writing your own Modules.

Disclaimer

  • Information in this presentation is a result of Zabbix source code analysis and in no way a replacement but rather an addition to the official documentation:
 https://www.zabbix.com/documentation/current/en/manual/modules
  • Modules can be harmful as they work in Zabbix WebUI process space and have the same level of access to the Zabbix database as Zabbix web UI itself. People will have to trust the modules you are developing so be careful.
  • Errors in your module code may crash the frontend. Currently, there is no version compatibility check during module installation; keep that in mind and implement strict versioning of these modules, clearly stating what version of the module works with what version of Zabbix.
  • All the references to code in this article assume Zabbix 6.0.

MVC framework

Zabbix WebUI is built based on so-called “Model-Controller-View” (MVC) framework. The concept is quite old and you can find a lot of information about MVC on the Internet. The following picture depicts every Zabbix WebUI application execution flow (every user’s click on a menu item, a button like “Apply” or “Filter” etc.)HTTP request from the user’s browser is accepted by the Controller component and analyzed. Based on the action parameter received Controller makes a decision on how to serve this request. Controller (optionally) “talks” to Model component to get needed data (usually) from Zabbix Database, then massages data to ultimately create a set of data to be returned to end user. Then (again if needed) Controller “talks” to the View component asking to prepare the data for the user’s consumption (in essence make it look nice and clean in Browser thus View component, in most cases generates HTML/CSS/Javascript). And once the controller has everything to send back to the user it returns HTTP response to the end user’s browser. I prefer to think that Zabbix Frontend Modules cover Controller and View components functionality as it’s a good idea to consume data available in Zabbix by re-using what’s already implemented in Zabbix WebUI code (huge amount of classes with their methods delivering any data you want) though strictly speaking, you can implement your own data feed in your Module.

Fronted module example

In this blog post, I’ll be using https://github.com/BGmot/zabbix-module-hosts-tree as an example which brings a brand new main menu element Monitoring -> Hosts tree showing the end-user a hierarchy of host groups instead of a flat list of hosts that comes with default Zabbix installation in Monitoring -> Hosts.

Refer to Zabbix official documentation on how to install modules and I won’t waste any time here on what is perfectly well documented. After this module is installed we’ll have the following files:

/usr/share/zabbix/modules/zabbix-module-hosts-tree# tree
.
|-- Module.php
|-- actions
| |-- CControllerBGHost.php
| |-- CControllerBGHostView.php
| `-- CControllerBGHostViewRefresh.php
|-- manifest.json
|-- partials
| |-- js
| | `-- monitoring.host.view.refresh.js.php
| `-- module.monitoring.host.view.html.php
`-- views
|-- js
| `-- monitoring.host.view.js.php
|-- module.monitoring.bghost.view.php
`-- module.monitoring.bghost.view.refresh.php

WebUI application

As mentioned earlier Zabbix WebUI is just a PHP application, yes, sophisticated, very complex but still it is a PHP application which means after every user click (and upon some timeouts), it initializes, executes and terminates producing some output (in most cases) which is passed to the user’s browser.

Top-level of this application is an object of class APP which inherits ZBase adding literally nothing, declared in file ./include/classes/core/APP.php:

class APP extends ZBase {
}

Application starts with ZBase::run() method, file ./include/config.inc.php:

APP::getInstance()->run(APP::EXEC_MODE_DEFAULT);

Method run() does many things, but in the light of this blog post it is important to mention these two:

  • router initialization
  • modules initialization

Router initialization

The so-called “Router” is a crucial part of the Controller component of MVC that drives a decision on how to handle a user’s request. The Router works based on an associative array $routes which is defined in ./include/classes/mvc/CRouter.php. Here is a code snippet illustrating how this array is organized:

private $routes = [
// action                   controller                        layout             view
‘action.operation.get'  => [‘CControllerActionOperationGet',  ‘layout.json',     null],
‘audit.settings.update' => [‘CControllerAuditSettingsUpdate',  null,             null],
‘dashboard.view'        => [‘CControllerDashboardView', ‘layout.htmlpage',   'monitoring.dashboard.view'],

As you can see in this array for every action three elements are defined:

  • controller (a class that will be used to prepare data)

  • layout (in which form to present data generated by the controller)

  • view (how to present/show the data to end-user/requestor)

We will talk about each of these components later. For now – just remember that the controller here is a class name while layout and view is a name of a .php file that is included (i.e. executed) at a certain point of webUI application execution. So Router Initialization is basically the class CRouter instantiation making $routes array available to other classes via CRouter‘s methods.

Modules initialization

This is the moment the web UI application goes through all enabled Modules in the system calling their init() method. Here in ZBase::run() methods init() are called and all the actions from enabled modules are added to the Router, file ./include/classes/core/ZBase.php:

$this->initModuleManager();
$router = $this->component_registry->get('router');
$router->addActions($this->module_manager->getActions());
$router->setAction($action_name);

Your module must be a child of Core\CModule class defined in ./include/classes/core/CModule.php. You can redefine init() method to fit your needs. Since init() method of all enabled Modules is called, this makes it a perfect place to have new menu items added to Zabbix WebUI main menu here.

We talked about Router in the previous clause and your module’s manifest.json file defines what needs to be added to this Router for your Module to function properly. If you defined an action in your Module that already exists in out-of-the-box Zabbix then the Module’s action overwrites the “default” entry in $router array thus, you can re-define the behavior of any menu item.

Overall what happens during the module initialization is easier to describe by following this picture:

So we added a new menu item “Hosts tree” under “Hosts” in the main menu. When a user clicks on this menu item a request will be generated with the parameter “action” set to “bghost.view” and now the Router “knows” that to serve this action it needs to use CControllerBGHostView class as a Controller and module.monitoring.bghost.view.php file (with default layout.html.php layout) to generate HTML code.

Processing request

After initialization is done the ZBase::processRequest() method is called, file ./include/classes/core/ZBase.php, passing initialized Router (let’s assume user selected Monitoring -> Hosts tree menu item implemented in my module):

$this->processRequest($router);
...
private function processRequest(CRouter $router): void {
  $action_name = $router->getAction();        // returns “bghost.view”
  $action_class = $router->getController();   // “Modules\BGmotHosts\Actions\CControllerBGHostView”
  ...
  $action = new $action_class();              // Controller defined for this action is instantiated
  ...
  register_shutdown_function(function() use ($action) {
    $this->module_manager->publishEvent($action, 'onTerminate');
  });

As we can see it gets the action name and action class name from the Router and then instantiates in variable $action a Controller class that will handle the requested action.

Then it registers all enabled modules onTerminate() methods – it means that before PHP execution exits all these functions will be executed regardless of what menu item a user selected. One important caveat: if a user selected an action covered by your module then your module’s onTerminate() method will be executed the last (later than all other modules’ onTerminate() methods) right when WebUI PHP application is about to finish its execution (i.e. exit). onTerminate() method of base class Core/CModule is empty and you can redefine it in your module, e.g., here is how I am adding my JavaScript to every page that generates HTML, ./modules/zabbix-module-menu/Module.php:

public function onTerminate(CAction $action): void {
  $action_page = $action->getAction();
  $router = clone APP::Component()->get('router');
  $layout = $router->getLayout();
  if ($action_page) {
    if ($action_page != 'jsrpc.php' &&
        $layout != 'layout.widget' &&
        $layout != 'layout.json') {
      echo '<script type="text/javascript">';
      echo file_get_contents(__DIR__.'/js/bg_menu.js');
      echo '</script>';
    }
  }
}

This approach works well only with layout.htmlpage layout as it just adds whatever you output here right before closing </body> tag in the final HTML page returned to browser.

Keep in mind we are still in ZBase::processRequest() method and now it executes all enabled modules’ onBeforeAction() methods:

$this->module_manager->publishEvent($action, 'onBeforeAction');

Again if a user selected an action covered by your module then your module’s onBeforeAction() method will be executed last. You can define onBeforeAction() method in your module this way:

public function onBeforeAction(CAction $action): void {
...
}

Processing request – Controller

Now ZBase::processRequest() passes execution to our Controller (remember $action was instantiated with our Controller class):

 $action->run();

All controllers have CController as a parent class. Method run() is defined in ./include/classes/mvc/CController.php and you cannot re-define it in your controller; it does some standards checks, validates input and if everything is ok passes control directly to your controller’s doAction() method and returns the result:

final public function run(): ?CControllerResponse {
  if ($this->checkInput()) {
    $this->doAction();
  return $this->getResponse();

Use your Controller’s checkInput() method to check whether all the parameters passed in the HTTP request are valid. You can implement it like this (see ./modules/zabbix-module-hosts-tree/actions/CControllerBGHostView.php):

 class CControllerBGHostView extends CControllerBGHost { 
        protected function checkInput(): bool {
                $fields = [             
                       'name' =>       'string',
                       'groupids' =>   'array_id',
                       'status' =>     'in -1,’.HOST_STATUS_MONITORED.','.HOST_STATUS_NOT_MONITORED,
                ...
                $ret = $this->validateInput($fields);
                return $ret;

All possible validation rules are defined in ./include/classes/validators/CNewValidator.php Just take a peek and select proper validation:

 # grep "case ‘" ./include/classes/validators/CNewValidator.php 
                                case ‘not_empty':
                                case 'json':
                                case ‘in':
                                ...
                                case 'array_id':
                                ...

If checkInput() method returns true (input is valid) then doAction() of your Controller is called. You can go as fancy as you want preparing data: executing internal Zabbix functions (The Model component of MVC), performing selects/updates in the database directly from your code, talk to other APIs, etc. At the end you need to prepare one massive associative array (usually it is named $data) and return it as shown here (./modules/zabbix-module-hosts-tree/actions/CControllerBGHostView.php):

 protected function doAction(): void {
     ...
     $data = [ ... ];
     $response = new CControllerResponseData($data);
     $response->setTitle(_('Hosts'));
     $this->setResponse($response);
 }

Whatever you place into the associative array $data will be available later in your View code.

Processing request – View

The final step in ZBase::processRequest() is calling one more method that in fact handles the layout and view you defined in the Router for this action:

$this->processResponseFinal($router, $action);

The name of your view file is what you put into the view field for your action in manifest.json + .php extension and must be in the views/ folder.

First, it fills in the layout data with defaults and if a view is defined for given action then it constructs an instance of CView class (defined in ./include/classes/mvc/CView.php) passing data you prepared in Controller to CView constructor:

private function processResponseFinal(CRouter $router, CAction $action): void {
  ...
  if ($router->getView() !== null && $response->isViewEnabled()) {
    $view = new CView($router->getView(), $response->getData());

CView constructor just tries to find a file with the name of your View + .php extension, e.g. module.monitoring.bghost.view.php and, if found, initializes two member variables $name and $data.

public function __construct($name, array $data = []) {

Everything you implement in your View .php file will feed the $layout_data variable.

$layout_data = array_replace($layout_data_defaults, [
  'main_block' => $view->getOutput(),
  'javascript' => [ 'files' => $view->getJsFiles() ],
  'stylesheet' => [ 'files' => $view->getCssFiles() ],
  'web_layout_mode' => $view->getLayoutMode()
]);

CView::getOutput() simply performs PHP include of your view .php file, so whatever you “print” in your view will be assigned to $layout_data[‘main_block’], file ./include/classes/mvc/CView.php:

public function getOutput() {
  $file_path = $this->directory.’/'.$this->name.'.php';
  ob_start();
  if ((include $file_path) === false) {
    ...
  return ob_get_clean();

If you have a lot of time and want to produce a real “piece of art” web page then you can print pure HTML/CSS in your view .php file, but it will most probably not look like the “Zabbix native style”. For example:

Fortunately, there is an easy solution to make your web pages look elegant and totally “Zabbix’ish”: it is very easy to use Zabbix classes, e.g., to add a table to your page use CTableInfo class, then prepare an array with CColHeader elements and add them with CTableInfo::setHeader(), then construct rows with CRow class and add them via CTableInfo::addRow(), etc.

Look at the Zabbix source code for an example of how to use them.
 See the list of all out-of-the-box classes here:

# ls -1 ./include/classes/html/
CActionButtonList.php
CBarGauge.php
CButton.php
...

Since we passed all the data generated by your Controller to your View object, it is very easy to use the data in your View code, e.g., in Controller you do the following:

 # ./modules/zabbix-module-hosts-tree/actions/CControllerBGHostView.php
 protected function doAction(): void {
     $data = [ 'hosts_count' => API::Host()->get(['countOutput' => true]) ];
     $response = new CControllerResponseData($data);
     $this->setResponse($response);
 }

And in View you use this data:

If you want to add some CSS files to the page, do it in your View code this way:

 # ./views/module.monitoring.bghost.view.php
 $this->addCssFile('modules/zabbix-module-hosts-tree/views/css/mycool.css');

Your CSS file should not contain <style> tags, just pure CSS:

 # ./views/css/mycool.css
 .list-table thead th {color: #ff0000;}

Interestingly enough you cannot add JavaScript files from your module folder using a similar addJsFile() method. You can use this method only to use the JS files that come with Zabbix and are located in the root /js folder, e.g.:

 # ./views/module.monitoring.bghost.view.php
 $this->addJsFile('multiselect.js');

To include your JavaScript code use this function. Note this .php file that must have PHP code that by “printing” generates JavaScript code, not .js (containing pure JavaScript code).

 # ./views/module.monitoring.bghost.view.php
 $this->includeJsFile('monitoring.host.view.js.php', $data);

– Again: this .php file must “print” JavaScript code and will be searched in the ./js
 subfolder of the view file this function is invoked from.
– A copy of $data variable will be available for use within the file!

So all the HTML code you generated went into $layout_data[‘main_block’], all JavaScript files that need to be included went into $layout_data[‘javascript’] and all CSS files that need to be included went into $layout_data[‘stylesheet’] (see $layout_data variable initialization above in this article).

The last thing ZBase::processResponseFinal() does is instantiate the new CView class with a layout .php file (you define the layout for every action in Router, remember?) and “printing” everything according to the selected layout, file ./include/classes/core/ZBase.php:

echo (new CView($router->getLayout(), $layout_data))->getOutput();

That is it! At this point, everything you “printed” (don’t forget the onTerminate() function) is returned as an HTTP response to the user’s browser.

Wrapping up this article I must tell you about one more thing – module configuration. In your modules’ manifest.json file you can have a “config” section –  below you can see how it is reflected in the Zabbix database:

I would not mention this, but there is one interesting caveat: the “config” section of manifest.json file is copied into the database only when the module is discovered for the first time during modules directory scanning. Changing the manifest.json file later has no effect; module’s config is always taken from the database so if you want to change something you either need to first wipe the module or make changes directly in the database.
Access this config data in your code with $this->getConfig();

Good luck developing Web Modules!
Your BGmot.

Real Life Business Service Monitoring

Post Syndicated from Alexander Petrov-Gavrilov original https://blog.zabbix.com/real-life-business-service-monitoring/24915/

Learn more about Zabbix business service monitoring features and check out our real-life use cases. The article is based on a Zabbix Summit 2022 speech by Aleksandrs Petrovs-Gavrilovs.

Business service monitoring with Zabbix

Hello everyone, my name is Alex and today I am going to write about Advanced business service and SLA monitoring and the related use cases.

Some of you may already be familiar with business services and the core idea behind them. In the vast majority of organizations, we have services that we provide to our customers or/and for internal use. The availability of those services is usually based either on hardware, software or people’s presence and availability. 

But no matter how well our monitoring is configured, there are times when we can miss how each specific device affects our business and that is where business service monitoring can help us.

With the help of business service monitoring it is possible to see what exactly is going on with your business depending on the state of every single part of your infrastructure. This allows us, the admins and service owners, to understand what it really means when a piece of hardware breaks or a device becomes unreachable. With business service monitoring, we see what exactly impacts our business and how severe the situation is, including calculating SLA (Service Level Agreement) and evaluating it against the defined SLO (Service Level Objective).

Business service hierarchy example

So let’s check out some examples of what business real-life business services may look like.

An average service tree example

In this example, we have a service tree that is based on support services. It has phones and phones are plugged into PBX, while PBX is plugged into the switch. And this is just one example, in reality, we could have a more complex infrastructure consisting of containers, CRM services and so on. And we of course monitor all of them, but what if we are interested in the business perspective as well?

To see the business perspective we need to go to the new service section in the main menu, where we can create and view the service tree itself. In addition, in the same section, we can configure the actions, which enable us to react in cases when something happens with one of the services.
We can also specify the SLO we strive to achieve and see SLA reports on the current situation.

Basic service overview

The service view also lets us see if we have problems affecting our services and track their root cause.

Service with active problems

Defining which service is affected by what problem is done by utilizing problem tags, which essentially link them together. Services can also have their own tags, which we use to group services and understand how one service relates to another. We can also use service tags to build an SLA report or execute actions in case a service is affected by a problem. Permissions too are based on service tags, allowing to creation of different views for different users.

But those are just the basics – what’s more interesting are the actual use cases. Let’s take a look at how Zabbix users actually use business service monitoring to their advantage based on real business examples. 

Business service tree for a financial institution

Real business service use cases can be helpful examples that can help you design your own Zabbix business service trees. Maybe you already have a similar business of your own and you need that starting point for everything to “click” – that starting point can be a real-life example.

Example service tree of a bank

The first example will seem a bit convoluted while actually being very straightforward. Here we can see an actual financial institution business service tree. You can see they have quite a lot of different interconnected services. First look at the service tree raw schema may be a bit confusing, but in Zabbix it’s pretty straightforward.

The internal service is connected to emails and emails are related to customer services at the same since we do need to communicate with the customers, not only internally! In addition, we also have to define services representing the underlying systems and applications which support our email services. That is easy to do with Zabbix services.

Easy to read e-mail service state

Imagine now, if you don’t have the services functionality at all, how fast can you check the status of the email service when all you have is only a list of problems for multiple devices? How can you check the service statistics for an entire year? That was the question that the service owners and administrators had in this use case and they solved it by defining Zabbix business service trees.

The “root” service

We start by defining the root business service – Financial institution. It is linked to 15 main services. The 15 services are grouped into internal or external ones. The lower-level services also contain the sub-services that the main services are based on. I.e., we have an Accounting service based on specific VM availability, where the accounting software resides on.

Detailed service tree

The services are divided into specific categories so the service owners can read the situation a lot easier without spending a lot of time figuring out which problem causes which situation. With a single click, the service owners can immediately see which components or child services each service is based on and the actual service SLA. This also gives the benefit of displaying the root cause problem and being able to quickly identify which child services are causing issues with a particular business service.

Parent-Child service relationship

Don’t forget, that the business service trees can be multi-level –  child services can have their own child services and services can also be interconnected with each other. For example – in the Parent-Child service relationship screenshot, we can see that we have an Accounting service. Accounting uses Microsoft services. Microsoft services are also used internally. So what happens when Microsoft services stop working? We will know that accounting will be affected, the internal services will be affected and we will see the exact chain of events – what and how exactly went wrong in the organization and which components need fixing.

Service state configuration

Services can have a varying impact on your business. Some services are more critical than others. Additional rules enable Zabbix to take the potential service impact into account. The first two additional rules analyze the percentage of affected child services and set the severity of the service problem accordingly.
But if the two most critical services are affected, that will immediately become a disaster. For example, online banking – you can imagine that any bank now has an online banking service and if it goes down – all the customers will be affected; it could even hit the news, not only monitoring. So of course they want to immediately know about that kind of a disaster, and with Zabbix services – they will. By defining additional rules and service weights, you can react to problems preemptively and fix the issues before they cause downtime for your end users.

SLA reporting

In Zabbix, we can choose for what periods SLA should be calculated – daily, weekly, monthly, yearly, or a mixed selection of those. Based on our selection, we can see real-time reports for services and as an example, by the end of the year or a day, understand what needs the most attention and review the service performance. Or to put in a closer-to-reality example – find out by accounting reports if the licenses were renewed in time so that the software which is used by accounting is always available.  We can also build a dashboard that will contain the reports, showing what is the current summary for the service so they can plan, buy new software, buy a new license and get new hardware and always be ahead again of whatever might happen.

Service state dashboard

Service permissions in user roles can be used to create different service views. This can be used to hide sensitive service information or simply display the services at the required level of detail. For example, a more detailed view can be provided for internal support users since they will need as much information as possible to fix any service-related issues. Separate views can be provided for Accounting and Management teams, showing only the relevant data to ensure a quick and reliable decision-making process. 

What if we want to make things even more simple for our Accounting and Management teams? We can use actions and scheduled report functionality to deliver the required information to the user’s mailbox without having them periodically log into Zabbix.

Service permissions

Business service tree for an MSP

Another example is an MSP (managed service provider) service tree. This use case is encountered pretty frequently and the tree is always easy to read even in the raw schema view as this:

Manager Service Provider service tree

We use a hosting company for our example. The company provides a particular set of services for its users. There are also some internal services that can also be used by the customers – for example, Zabbix itself.

Zabbix can be a great tool in MSP scenarios since it’s straightforward to provide customers with access to Zabbix and build a dashboard view with the latest statistics related to a particular user.
In this example, we can see the main service which is hosting, divided across customers, where each customer is a branch of that tree, using the hosting services the company provides. We also see that monitoring is a service itself because in this case customers also have the advantage of using Zabbix to get detailed information about the services they use and their current state. Seeing the current level of SLA for the servers they use and does it match the expectations.

Customer overview

The MSP, of course, retains the full view of the customers and all customers are equally important and deserve a proper quality of service so of course each customer will have an equal weight assigned to them. As soon as any customer has a problem, the related service will be marked with a high-level severity on the service tree. This way, the MSP will immediately see which customer is affected, making it possible to assist them as quickly as possible.

If you have a bigger environment – maybe you have hundreds of customers, you may opt out of defining service weights in your configuration since the number of services changes very rapidly. How can we react to global issues then?
We can use percentage rules instead of reacting to just the static weight number. This way, we can check is the problem related to a single customer or is it something global and everyone is now affected.

Root cause view in the services will allow you to start fixing everything immediately. Meanwhile, each customer can be informed individually using the service actions and conditions. This should be easy to do if we have properly named or tagged the services.

Customer service configuration

Don’t forget to define the permissions so that any customer, as Mooyani here, can have access to their Services immediately after login, ensuring that information not only remains private but also relevant for the current user.

Customer view

All information for Customers can be placed on their personal dashboards where they can see all the details whenever they need to. Monitoring the traffic going through their VMs, resource usage, application statuses and any other monitored entities. Don’t forget that service SLA reports can also be placed on Zabbix dashboards. This way your customers can see that the MSP meets the terms defined in the agreement and everything is performing as expected. 

To summarize – monitoring your infrastructure is great from any perspective, including business monitoring. it’s always a good idea to provide this view as an MSP to your customers, so they can see we meet the standards we define for ourselves and course promise for our users.

Generating Zabbix Health CSV reports with custom frontend module

Post Syndicated from Aigars Kadiķis original https://blog.zabbix.com/generating-zabbix-health-csv-reports-with-custom-frontend-module/24369/

Have you ever wanted a button to export specific data to CSV directly from Zabbix frontend? My friend Gregory and I have solved this problem.

In this blog post, I will show you how with the help of a custom Zabbix frontend module, you can add an “Export to CSV” button and use it to export custom information. This can be installed as a frontend module. It’s applicable for versions Zabbix 5.0 LTS and Zabbix 6.0 LTS. As an additional benefit, we can use this module to polish our Zabbix instance, find misconfiguration, summarize existing configuration and locate potential bottlenecks.

Characteristics of the module

  • Define and store custom SQL queries in the user profile. It uses the “profiles” database table, specifically the columns “idx” and “value_str” to store the settings/data. We are not changing the database structure but using what’s available already.
  • Export an unlimited amount of rows via the CSV button.
  • Compatible with MySQL and PostgreSQL and general SQL syntax. Use the same language as you use directly in the SQL client.
  • SQL highlighter. It’s based on a well-known CodeMirror product. The Zabbix module is still completely standalone.
  • Protection to disallow running UPDATE, DELETE, INSERT, CREATE, ALTER, DROP commands.
  • Hyperlink support in the SQL output. Useful to generate a link used to navigate closer to the problem. Only hyperlinks which are related to zabbix sections (items.php, triggers.php) are supported.

What is it NOT?

  • The module does not provide an environment for “Zabbix User” or “Zabbix Admin” types of users to perform any reporting. Only users of “Zabbix Super Admin” type can use this module.

Install

Below you can find a sequence of commands used to download and install the “Export to CSV” module

Navigate to default directory of frontend modules:
cd /usr/share/zabbix/modules
Pick and create a name of directory:
mkdir -p zabbix-module-sqlexplorer
Navigate inside:
cd zabbix-module-sqlexplorer
Download version 1.4 of module:
wget https://github.com/gr8b/zabbix-module-sqlexplorer/releases/download/v1.6/v1.6.zip
Extract the archive:
unzip v1.6.zip
Remove the original source file, it is unnecessary:
rm v1.6.zip

Enable

Open “Administration” => “General” => “Modules” => click on “Scan directory“. After that, click “Enable“:

A new menu is available:

Now we can run an SQL query, see the result on the screen, and export the data to CSV:

Let’s talk about 5 beneficial use cases that this module can enable. Here Zabbix technical support engineers are sharing a few frequently used SQL commands.

1) Ensure the basic data collection works

This section would be the basic minimum to maintain. In a perfect world, every host object (a device, a server) must be online 24/7. For each category (ZBX, SNMP, IPMI, JMX) we need to use a dedicated query. Let’s work with the 2 most popular categories – ZBX and SNMP.

Unreachable Zabbix agent (ZBX) hosts

Zabbix 5.0:

SELECT proxy.host AS proxy, hosts.host, hosts.error AS hostError, CONCAT('hosts.php?form=update&hostid=',hosts.hostid) AS goTo FROM hosts LEFT JOIN hosts proxy ON (hosts.proxy_hostid=proxy.hostid) WHERE hosts.status=0 AND LENGTH(hosts.error) > 0;

Zabbix 6.0:

SELECT proxy.host AS proxy, hosts.host, interface.error FROM hosts LEFT JOIN hosts proxy ON (hosts.proxy_hostid=proxy.hostid) JOIN interface ON (interface.hostid=hosts.hostid) WHERE LENGTH(interface.error) > 0 AND interface.type=1;

In the output, we receive a proxy object, host title, an error message, a clickable link to navigate immediately to the object.
Per every row in output, either we need to: 1) Fix the issue. Most likely it is a firewall/DNS/credential/timeout or network quality issue; 2) Delete host object; or 3) Disable host object.

Unreachable network (SNMP) devices

On Zabbix 5.0 use:

SELECT proxy.host AS proxy, hosts.host, hosts.snmp_error AS hostError, CONCAT('hosts.php?form=update&hostid=',hosts.hostid) AS goTo FROM hosts LEFT JOIN hosts proxy ON (hosts.proxy_hostid=proxy.hostid) WHERE hosts.status=0 AND LENGTH(hosts.snmp_error) > 0;

On Zabbix 6.0 use:

SELECT proxy.host AS proxy, hosts.host, interface.error, CONCAT('zabbix.php?action=host.edit&hostid=',hosts.hostid) AS goTo FROM hosts LEFT JOIN hosts proxy ON (hosts.proxy_hostid=proxy.hostid) JOIN interface ON (interface.hostid=hosts.hostid) WHERE LENGTH(interface.error) > 0 AND interface.type=2;

We receive the proxy title, host object, reason the host is not working and a clickable link to the object.

2) Improve speed of frontend

Ensure we have the best response time for all sections in GUI. If tables contain “unnecessary” data, the user experience will suffer for it. Nowadays, no one wants to spend longer than a few seconds waiting for data to be displayed on the screen.

Amount of user sessions

Zabbix 5.0:

SELECT COUNT(*) AS count, users.alias FROM sessions JOIN users ON (users.userid=sessions.userid) GROUP BY 2 ORDER BY 1 DESC;

Zabbix 6.0:

SELECT COUNT(*) AS count, users.username FROM sessions JOIN users ON (users.userid=sessions.userid) GROUP BY 2 ORDER BY 1 DESC;

The total number of sessions should not exceed 1000; it’s hard to imagine why it should be over 100. It is OK to delete all data in the “sessions” table and optimize/vacuum the table. This will improve the overall Zabbix GUI responsiveness and performance.

Don’t keep too many open problems onboard

Every monitoring tool is about identifying problems. If you keep too many open problems, then frontend will be slow. The following query will print trigger problems (the ones we receive in email) and so as internal problems, which reflect the health of the monitoring tool.

SELECT COUNT(*) AS count, CASE WHEN source=0 THEN 'surface' WHEN source>0 THEN 'internal' END AS level, CASE WHEN source=0 AND object=0 THEN 'trigger in a problem state' WHEN source=3 AND object=0 THEN 'cannot evaluate trigger expression' WHEN source=3 AND object=4 THEN 'data collection not working' WHEN source=3 AND object=5 THEN 'low level discovery not perfect' END AS problemCategory FROM problem GROUP BY 2,3 ORDER BY 2 DESC;

To decrease the number of “internal” problems, have a study on the point nr. 5 in this blog post.

3) Identify exceptions

Monitoring administrators can change item update frequency at a host level, can install a different trigger threshold at the host level, or install a different threshold inside a nested template tree. This section will highlight all overrides.

Item update interval differs between host/template levels

This query will print items and LLD rules with different update frequencies on the host level while comparing them with the template level. Most of the time,  having different update interval at the host level is done by accident.

SELECT h2.host AS Source, i2.name AS itemName, i2.key_ AS itemKey, i2.delay AS OriginalUpdateFrequency,h1.host AS exceptionInstalledOn, i1.delay AS FrequencyChild, CASE WHEN i1.flags=1 THEN 'LLD rule' WHEN i1.flags IN (0,4) THEN 'data collection' END AS itemCategory , CASE WHEN i1.flags=1 THEN CONCAT('host_discovery.php?form=update&context=host&itemid=',i1.itemid) WHEN i1.flags IN (0,4) THEN CONCAT('items.php?form=update&context=host&hostid=',h1.hostid,'&itemid=',i1.itemid) END AS goTo FROM items i1 JOIN items i2 ON (i2.itemid=i1.templateid) JOIN hosts h1 ON (h1.hostid=i1.hostid) JOIN hosts h2 ON (h2.hostid=i2.hostid) WHERE i1.delay <> i2.delay;

In the output, the “Source” column is a host object (a device) or a template object. The “Source” column is heavily related to “exceptionInstalledOn” column. “Source” VS “exceptionInstalledOn” practically tells there is a relation between a host and a template or between a template and another template.

FrequencyChild” column is the most important field, which describes an exception installed which differs from the original object.

The output allows navigating directly to the item where the update frequency stands out.

Connection characteristics and custom trigger thresholds

The query will show every installed override between host and template object or between template and parent template object. If template nesting is used at multiple levels, it will highlight if an overriding value if that is installed somewhere in the middle.

SELECT hm1.macro AS Macro, child.host AS owner, hm2.value AS defaultValue, parent.host AS OverrideInstalled, hm1.value AS OverrideValue FROM hosts parent, hosts child, hosts_templates rel, hostmacro hm1, hostmacro hm2 WHERE parent.hostid=rel.hostid AND child.hostid=rel.templateid AND hm1.hostid = parent.hostid AND hm2.hostid = child.hostid AND hm1.macro = hm2.macro AND parent.flags=0 AND child.flags=0 AND hm1.value <> hm2.value;

Abandoned items

A very lonely item that does not belong to any template.

SELECT hosts.host, items.key_,CONCAT('items.php?form=update&context=host&itemid=',items.itemid) AS goTo FROM items JOIN hosts ON (hosts.hostid=items.hostid) WHERE hosts.status=0 AND hosts.flags=0 AND items.status=0 AND items.templateid IS NULL AND items.flags=0;

To keep up centralized management, it would be better to move and maintain the definition of the item to the template level.

4) Reporting

We can summarize an item and host configurations, enabled and disabled data collector items, and linked templates.

What data collection method is used?

To summarize all data collection techniques and see the membership by Zabbix proxy use the following command:

SELECT proxy.host AS proxy, CASE items.type WHEN 0 THEN 'Zabbix agent' WHEN 1 THEN 'SNMPv1 agent' WHEN 2 THEN 'Zabbix trapper' WHEN 3 THEN 'Simple check' WHEN 4 THEN 'SNMPv2 agent' WHEN 5 THEN 'Zabbix internal' WHEN 6 THEN 'SNMPv3 agent' WHEN 7 THEN 'Zabbix agent (active) check' WHEN 8 THEN 'Aggregate' WHEN 9 THEN 'HTTP test (web monitoring scenario step)' WHEN 10 THEN 'External check' WHEN 11 THEN 'Database monitor' WHEN 12 THEN 'IPMI agent' WHEN 13 THEN 'SSH agent' WHEN 14 THEN 'TELNET agent' WHEN 15 THEN 'Calculated' WHEN 16 THEN 'JMX agent' WHEN 17 THEN 'SNMP trap' WHEN 18 THEN 'Dependent item' WHEN 19 THEN 'HTTP agent' WHEN 20 THEN 'SNMP agent' WHEN 21 THEN 'Script item' END AS type,COUNT(*) FROM items JOIN hosts ON (hosts.hostid=items.hostid) LEFT JOIN hosts proxy ON (hosts.proxy_hostid=proxy.hostid) WHERE hosts.status=0 AND items.status=0 GROUP BY proxy.host, items.type ORDER BY 1,2,3 DESC;

Devices and linked templates

If one server runs as a database server, a web server, and an application server, there must be multiple templates linked. The following query can help to detect the linked templates.

MySQL:

SELECT proxy.host AS proxy, hosts.host, GROUP_CONCAT(template.host SEPARATOR ', ') AS templates FROM hosts JOIN hosts_templates ON (hosts_templates.hostid=hosts.hostid) LEFT JOIN hosts proxy ON (hosts.proxy_hostid=proxy.hostid) LEFT JOIN hosts template ON (hosts_templates.templateid=template.hostid) WHERE hosts.status IN (0,1) AND hosts.flags=0 GROUP BY 1,2 ORDER BY 1,3,2;

PostgreSQL:

SELECT proxy.host AS proxy, hosts.host, ARRAY_TO_STRING(ARRAY_AGG(template.host),', ') AS templates FROM hosts JOIN hosts_templates ON (hosts_templates.hostid=hosts.hostid) LEFT JOIN hosts proxy ON (hosts.proxy_hostid=proxy.hostid) LEFT JOIN hosts template ON (hosts_templates.templateid=template.hostid) WHERE hosts.status IN (0,1) AND hosts.flags=0 GROUP BY 1,2 ORDER BY 1,3,2;
Devices and all inventory fields
SELECT h.host,i.type,i.type_full,i.name,i.alias,i.os,i.os_full,i.os_short,i.serialno_a,i.serialno_b,i.tag,i.asset_tag,i.macaddress_a,i.macaddress_b,i.hardware,i.hardware_full,i.software,i.software_full,i.software_app_a,i.software_app_b,i.software_app_c,i.software_app_d,i.software_app_e,i.contact,i.location,i.location_lat,i.location_lon,i.notes,i.chassis,i.model,i.hw_arch,i.vendor,i.contract_number,i.installer_name,i.deployment_status,i.url_a,i.url_b,i.url_c,i.host_networks,i.host_netmask,i.host_router,i.oob_ip,i.oob_netmask,i.oob_router,i.date_hw_purchase,i.date_hw_install,i.date_hw_expiry,i.date_hw_decomm,i.site_address_a,i.site_address_b,i.site_address_c,i.site_city,i.site_state,i.site_country,i.site_zip,i.site_rack,i.site_notes,i.poc_1_name,i.poc_1_email,i.poc_1_phone_a,i.poc_1_phone_b,i.poc_1_cell,i.poc_1_screen,i.poc_1_notes,i.poc_2_name,i.poc_2_email,i.poc_2_phone_a,i.poc_2_phone_b,i.poc_2_cell,i.poc_2_screen,i.poc_2_notes FROM host_inventory i, hosts h WHERE i.hostid=h.hostid AND h.flags=0;

By default, it prints all columns. You can cut away the unnecessary ones. You can use the extracted data for analytics in other software, e.g., MS Excel.

External scripts in use

When migrating to the next release of Zabbix, it’s better to be aware of all external scripts and ensure the new server will support that. An external script is a custom solution for how you collect data. For example, a Python 2 language is not available on a newer operating system out of box.

SELECT items.key_,COUNT(*) FROM items JOIN hosts ON (hosts.hostid=items.hostid) WHERE hosts.status=0 AND items.status=0 AND items.type=10 GROUP BY 1 ORDER BY 2;

 

5) Polish your Zabbix instance

This section will require more work. Being a non-perfectionist is an advantage.

To work with internal events (data collection not working, trigger not working), we have to have a least one internal action enabled. It can be an action with a condition that will never be true:

Be careful, if you have huge infrastructure and currently everything is disabled, then don’t enable! Or enable only for 4h to generate some statistics and then turn it off. If you keep internal events ON and a lot of things are not working, it will create frontend very slow. And it will get even slower every day.

Data collection

Monitoring is based on data collection. If new data does not come in, we cannot detect if the service is up or down. The following query will list all data collector items which cannot receive the data.

SELECT hosts.name, items.key_ AS keyName, problem.name AS error, CONCAT('items.php?form=update&itemid=',objectid) AS goTo FROM problem JOIN items ON (items.itemid=problem.objectid) JOIN hosts ON (hosts.hostid=items.hostid) WHERE problem.source>0 AND problem.object=4;

The output provides a clickable link to navigate to the item and investigate. Common issues can be a timeout, wrong credentials, and permissions.

Trigger evaluation

When data is collected and if an item has a trigger linked, it’s possible that there is a problem in evaluating trigger logic. The result of the following query will show you why it’s impossible to detect the problem.

SELECT DISTINCT CONCAT('triggers.php?form=update&triggerid=',problem.objectid) AS goTo, hosts.name AS hostName, triggers.description AS triggerTitle, problem.name AS error FROM problem JOIN triggers ON (triggers.triggerid=problem.objectid) JOIN functions ON (functions.triggerid=triggers.triggerid) JOIN items ON (items.itemid=functions.itemid) JOIN hosts ON (hosts.hostid=items.hostid) WHERE problem.source > 0 AND problem.object=0;

Low-level discovery

The purpose of low-level discovery is to find all elements which exist in a particular system. For example, find all services on a windows system. If discovery is not working, then additional elements will not get covered, which means no data collection, no trigger, and no notification. The following query will show all erroneous LLD rules and the reason there is a problem:

SELECT hosts.name AS hostName, items.key_ AS itemKey, problem.name AS LLDerror, CONCAT('host_discovery.php?form=update&itemid=',problem.objectid) AS goTo FROM problem JOIN items ON (items.itemid=problem.objectid) JOIN hosts ON (hosts.hostid=items.hostid) WHERE problem.source > 0 AND problem.object=5;

That is it for today.

In the comments section below, let us know what kind of report your company desires to see.

Appendix

Know issues of the module:
  • When linking tables in SQL output, if there are multiple columns with the same name, only the first column will be printed. As a workaround, always use “AS ColumnName” directive to specify unique column names in the output.

Deploying containers to AWS using ECS and CodePipeline 

Post Syndicated from Jessica Veljanovska original https://www.anchor.com.au/blog/2022/10/deploying-containers-to-aws/

Deploying containers to AWS using ECS and CodePipeline 

In this post, it will look at deploying containers to AWS using ECS and CodePipeline. Note that this is only an overview of the process and is not a complete tutorial on how to set this up. 

Containers 101 

However, before we look at how we can deploy Containers onto AWS, it is first worth looking at why we want to use containers, so what are containers?

Containers provide environments for applications to run in isolation. Unlike virtualisation, containers do not require a full guest operating system for each container instance, instead, they share the kernel and other lower-level components of the host operating system as provided by the underlying containerization platform (the most popular of which is Docker, which we will be using in examples from here onwards). This makes containers more efficient at using the underlying hardware resources to run applications. Since Containers are isolated, they can include all their dependencies separate from other running Containers. Suppose that you have two applications that each require specific, conflicting, versions of Python to run, they will not run correctly if this requirement is not met. With containers, you can run both applications with their own environments and dependencies on the same Host Operating system without conflict. 

Containers are also portable, as a developer you can create, run, and test an image on your local workstation and then deploy the same image into your production environment. Of course, deploying into production is never that simple, but that example highlights the flexibility that containers can afford to developers in creating their applications. 

This is achieved by using images, which are read-only templates that provide the containerization platform the instructions required to run the image as a Container. Each image is built from a DockerFile that provides the specification on how to build the image and can also include other images. This can be as simple or as complicated as it needs to be to successfully run the application.

FROM ubuntu:22.04 

COPY . /app 

RUN make /app 

CMD python /app/app.py

However, it is important to know that each image is made up of different layers, which are created based on each line of instruction in the DockerFile that is used to build the image. Each layer is cached by Docker which provides performance benefits to well optimised DockerFiles and the resulting images they create. When the image is run by Docker it creates a Container that flips the layers from the image and adds a runtime Read-Write layer on top, which can be used for logging and any other activity that the application needs to perform and cannot be read from the image. You can use the same image to run as many Containers (running instances of the image) as you desire. Finally, when a Container is removed, the image is retained and only the Read-Write layer is lost. 


Elastic Container Service 

Elastic Container Service or ECS is AWS’ native Container management system. As an orchestration system it makes it easy to deploy and manage containerized applications at scale with built in scheduling that can allow you to spin up/down Containers at specific times or even configure auto-scaling. ECS has three primary modes of operation, known as launch types. Elastic Container Service or ECS is AWS’ native Container management system. As an orchestration system it makes it easy to deploy and manage containerized applications at scale with built in scheduling that can allow you to spin up/down Containers at specific times or even configure auto-scaling. ECS has three primary modes of operation, known as launch types. The first launch type is the AWS EC2 launch type, where you run the ECS agent on EC2 instances that you maintain and manage. The available resource capacity is dependent on the number and type of EC2 instances that you are using. As you are already billed for the AWS resources consumed (EC2, EBS, etc.), there are no additional charges for using ECS with this launch type. If you are using AWS Outposts, you can use also utilise that capacity instead of Cloud-based EC2 instances. 

The second launch type is the AWS Fargate launch type. This is similar to the EC2 launch type, however, the major difference is that AWS are managing the underlying infrastructure. Because of this there are no inherent capacity constraints, and the billing models is based on the amount of vCPU and memory your Containers consume. 

The last launch type is the External launch type, which is known as ECS Anywhere. This allows you to use the features of ECS with your own on-premises hardware/infrastructure. This is billed at an hourly rate per managed instance. 

ECS operates using a hierarchy that connects together the different aspects of the service and allows flexibility in exactly how the application is run. This hierarchy consists of ECS Clusters, ECS Services, ECS Tasks, and finally the running Containers. We will briefly look at each of these services. ECS Clusters 

ECS Clusters are a logical grouping of ECS Services and their associated ECS Task. A Cluster is where scheduled tasks can be defined, either using fixed intervals or cron expression. If you are using the AWS EC2 Launch Type, it is also where you can configure Auto-Scaling for the EC2 instances in the form of Capacity Providers. 

ECS Services 

ECS Services are responsible for running ECS Tasks. They will ensure that the desired number of Tasks are running and will start new Tasks if an existing Task stops or fails for any reason in order to maintain the desired number of running Tasks. Auto-Scaling can be configured here to automatically update the desired number of Tasks based on CPU and Memory. You can also associate the Service with an Elastic Load Balancer Target Group and if you do this you can also use the number of requests as a metric for Auto-Scaling. 

ECS Tasks 

ECS Tasks are running instances of an ECS Task Definition. The Task Definition describes one or more Containers that make up the Task. This is where the main configuration for the Containers is defined, including the Image/s that are used, the Ports exposed, Environment variables, and if any storage volumes (for example EFS) are required. The Task as a whole can also have Resource sizing configured, which is a hard requirement for Fargate due to its billing model. Task definitions are versioned, so each time you make a change it will create a new revision, allowing you to easily roll back to an older configuration if required.


Elastic Container Registry 

Elastic Container Registry or ECR, is not formally part of ECS but instead supports it. ECR can be used to publicly or privately store Container images and like all AWS services has granular permissions provided using IAM. The main features of ECR, besides its ability to integrate with ECS, is the built-in vulnerability scanning for images, and the lack of throttling limitations that public container registries might impose. It is not a free service though, so you will need to pay for any usage that falls outside of the included Free-Tier inclusions.

ECR is not strictly required for using ECS, you can continue to use whatever image registry you want. However, if you are using a CodePipeline to deploy your Containers we recommend using ECR purely to avoid throttling issues preventing the CodePipeline from successfully running. 


CodePipeline 

In software development, a Pipeline is a collection of multiple tools and services working together in order to achieve a common goal, most commonly building and deploying application code. AWS CodePipeline is AWS’ native form of a Pipeline that supports both AWS services as well as external services in order to help automate deployments/releases.

CodePipeline is one part of the complete set of developer tools that AWS provides, which commonly have names starting with “Code”. This is important as by itself CodePipeline can only orchestrate other tooling. For a Pipeline that will deploy a Container to ECS, we will need AWS CodeCommit, AWS CodeBuild, and AWS CodeDeploy. In addition to running the components that are configured, CodePipeline can also provide and retrieve artifacts stored in S3 for each step. For example, it will store the application code from CodeCommit into S3 and provide this to CodeBuild, Code Build will then take this and will create its own artifact files that are provided to CodeDeploy. 

AWS CodeCommit 

AWS CodeCommit is a fully managed source control service that hosts private Git repositories. While this is not strictly required for the CodePipeline, we recommend using it to ensure that AWS has a cached copy of the code at all times. External git repositories can use actions or their own pipelines to push code to CodeCommit when it has been committed. CodeCommit is used in the Source stage of the CodePipeline to both provide the code that is used and to trigger the Pipeline to run when a commit is made. Alternatively, you can use CodeStar Connections to directly use GitHub or BitBucket as the Source stage instead.

AWS CodeBuild 

AWS CodeBuild is a fully managed integration service that can be used to run commands as defined in the BuildSpec that it draws its configuration from, either in CodeBuild itself, or from the Source repository. This flexibility allows it to compile source code, run tests, make API calls, or in our case build Docker Images using the DockerFile that is part of a code repository. CodeBuild is used in the Build stage of the CodePipeline to build the Docker Image, push it to ECR, and update any Environment Variables used later in the deployment.

The following is an example of what a typical BuildSpec might look like for our purposes. 

version: 0.2
  
  
  phases:
  
    pre_build:
  
      commands:
  
        - echo Logging in to Amazon ECR...
  
        - aws --version
  
        - aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $ACCOUNTID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com
  
    build:
  
      commands:
  
        - echo Build started on `date`
  
        - echo Building the Docker image...
  
        - docker build -t "$REPOSITORY_URI:$IMAGE_TAG-$CODEBUILD_START_TIME" .
  
        - docker tag "$REPOSITORY_URI:$IMAGE_TAG-$CODEBUILD_START_TIME" "$REPOSITORY_URI:$IMAGE_TAG-$CODEBUILD_START_TIME"
  
    post_build:
  
      commands:
  
        - echo Build completed on `date`
  
        - echo Pushing the Docker images...
  
        - docker push "$REPOSITORY_URI:$IMAGE_TAG-$CODEBUILD_START_TIME"
  
        - echo Writing image definitions file...
  
        - printf '{"ImageURI":"%s"}'" $REPOSITORY_URI:$IMAGE_TAG-$CODEBUILD_START_TIME"  > imageDetail.json
  
        - echo $REPOSITORY_URI:$IMAGE_TAG-$CODEBUILD_START_TIME
  
        - sed -i 's|<APP_NAME>|'$IMAGE_REPO_NAME'|g' appspec.yaml taskdef.json
  
        - sed -i 's|<SERVICE_PORT>|'$SERVICE_PORT'|g' appspec.yaml taskdef.json
  
        - sed -i 's|<AWS_ACCOUNT_ID>|'$AWS_ACCOUNT_ID'|g' taskdef.json
  
        - sed -i 's|<MEMORY_RESV>|'$MEMORY_RESV'|g' taskdef.json
  
        - sed -i 's|<IMAGE_NAME>|'$REPOSITORY_URI'\:'$IMAGE_TAG-$CODEBUILD_START_TIME'|g' taskdef.json
  
        - sed -i 's|<IAM_EXEC_ROLE>|'$IAM_EXEC_ROLE'|g' taskdef.json
  
        - sed -i 's|<REGION>|'$REGION'|g' taskdef.json
  
  artifacts:
  
    files:
  
      - appspec.yaml
  
      - taskdef.json
  
      - imageDetail.json

Failures in this step are most likely caused by an incorrect, non-functioning DockerFile or hitting throttling from external Docker repositories. 

AWS CodeDeploy 

AWS CodeDeploy is a fully managed deployment service that integrates with several AWS services including ECS. It can also be used to deploy software to on-premises servers. The configuration of the deployment is defined using an appspec file. CodeDeploy offers several different deployment types and configurations. We tend to use the `Blue/Green` deployment type and the `CodeDeployDefault.ECSAllAtOnce` deployment configuration by default. 

The Blue/Green deployment model allows for the deployment to rollback to the previously deployed Tasks if the deployment is not successful. This makes use of Load Balancer Target Groups to determine if the created Tasks in the ECS Service are healthy. If the health checks fail, then the deployment will fail and trigger a rollback.  

In our ECS CodePipeline, CodeDeploy will run based on the following example appspec.yaml file. Note that the place holder variables in <> are updated with real configuration by the CodeBuild stage.

version: 0.0 

Resources: 

  - TargetService: 

      Type: AWS::ECS::Service 

      Properties: 

        TaskDefinition: <TASK_DEFINITION> 

        LoadBalancerInfo: 

          ContainerName: "<APP_NAME>" 

          ContainerPort: <SERVICE_PORT> 

You may have noticed a lack of configuration regarding the actual Container we will be running up to this point. As we noted earlier, ECS Tasks can use a taskdef file to configure the Task and Containers, which is exactly what we are doing here. One of the files the CodeBuild configuration above expects to be present is taskdef.json. Below is an example taskdef.json file, ss with the appspec.yaml file there are placeholder variables as indicated by <>.


{ 

  "executionRoleArn": "<IAM_EXEC_ROLE>", 

  "containerDefinitions": [ 

    { 

      "name": "<APP_NAME>", 

      "image": "<IMAGE_NAME>", 

      "essential": true, 

      "memoryReservation": <MEMORY_RESV>, 

      "portMappings": [ 

        { 

          "hostPort": 0, 

          "protocol": "tcp", 

          "containerPort": <SERVICE_PORT> 

        } 

      ], 

      "environment": [ 

        { 

          "name": "PORT", 

          "value": "<SERVICE_PORT>" 

        }, 

        { 

          "name": "APP_NAME", 

          "value": "<APP_NAME>" 

        }, 

        { 

          "name": "IMAGE_BUILD_DATE", 

          "value": "<IMAGE_BUILD_DATE>" 

        }, 

      ], 

      "mountPoints": [] 

    } 

  ], 

  "volumes": [], 

  "requiresCompatibilities": [ 

    "EC2" 

  ], 

  "networkMode": "bridge", 

  "family": "<APP_NAME>" 

Failures in this stage of the CodePipeline generally mean that there is something wrong with the Container that is preventing it from starting successfully and passing its health check. Causes of this are varied and rely heavily on having good logging in place. Failures can range from the DockerFile being configured to execute a command that does not exist, to the application itself erroring out when starting for whatever reason might be logged, such as pulling incomplete data from RDS or failing to connect to an external service. 

Summary 

In summary, we have seen what Containers are, why they are useful, and what options AWS provides with its Elastic Container Service (ECS) for running them. Additionally, we looked at what parts of AWS CodePipeline we would need to use in order to deploy our Containers to ECS using CodePipeline.

For further information on the services used, I would highly recommend looking at the documentation that AWS provides. 

For a more hands-on guided walkthrough on setting up ECS and CodePipeline, AWS provide the following resources, there is also plenty of third party material you can find online. 

The post Deploying containers to AWS using ECS and CodePipeline  appeared first on Anchor | Cloud Engineering Services.

Docker Container Monitoring With Zabbix

Post Syndicated from Dmitry Lambert original https://blog.zabbix.com/docker-container-monitoring-with-zabbix/20175/

In this blog post, I will cover Docker container monitoring with Zabbix. We will use the official Docker by Zabbix agent 2 template to make things as simple as possible. The template download link and configuration steps can be found on the Zabbix Integrations page. If you require a visual guide, I invite you to check out my video covering this topic.

Importing the official Docker template

Importing the Docker by Zabbix agent 2 template

Since we will be using the official Docker by Zabbix agent 2 template, first, we need to make sure that the template is actually available in our Zabbix instance. The template is available for Zabbix versions 5.0, 5.4, and 6.0. If you cannot find this template under Configuration – Templates, chances are that you haven’t imported it into your environment after upgrading Zabbix to one of the aforementioned versions. Remember that Zabbix does not modify or import any templates during the upgrade process, so we will have to import the template manually. If that is so, simply download the template from the official Zabbix git page (or use the link in the introduction) and import it into your Zabbix instance by using the Import button in the Configuration – Templates section.

Installing and configuring Zabbix agent 2

Before we get started with configuring our host, we first have to install Zabbix agent 2 and configure it according to the template guidelines. Follow the steps in the download section of the Zabbix website and install the zabbix-agent2 package. Feel free to use any other agent deployment methods if you want to (like compiling the agent from the source files)

Installing Zabbix agent2 from packages takes just a few simple steps:

Install the Zabbix repository package:

rpm -Uvh https://repo.zabbix.com/zabbix/6.0/rhel/8/x86_64/zabbix-release-6.0-1.el8.noarch.rpm

Install the Zabbix agent 2 package:

dnf install zabbix-agent2

Configure the Server parameter by populating it with your Zabbix server/proxy address

vi /etc/zabbix/zabbix_agent2.conf
### Option: Server
# List of comma delimited IP addresses, optionally in CIDR notation, or DNS names of Zabbix servers and Zabbix proxies.
# Incoming connections will be accepted only from the hosts listed here.
# If IPv6 support is enabled then '127.0.0.1', '::127.0.0.1', '::ffff:127.0.0.1' are treated equally
# and '::/0' will allow any IPv4 or IPv6 address.
# '0.0.0.0/0' can be used to allow any IPv4 address.
# Example: Server=127.0.0.1,192.168.1.0/24,::1,2001:db8::/32,zabbix.example.com
#
# Mandatory: yes, if StartAgents is not explicitly set to 0
# Default:
# Server=

Server=192.168.50.49

Plugin specific Zabbix agent 2 configuration

Zabbix agent 2 provides plugin-specific configuration parameters. Mostly these are optional parameters related to a specific plugin. You can find the full list of plugin-specific configuration parameters in the Zabbix documentation. In the newer versions of Zabbix agent 2, the plugin-specific parameters are defined in separate plugin configuration files, located in /etc/zabbix/zabbix_agent2.d/plugins.d/, while in older versions, they are defined directly in the zabbix_agent2.conf file.

For the Zabbix agent 2 Docker plugin, we have to provide the Docker daemon unix-socket location. This can be done by specifying the following plugin parameter:

### Option: Plugins.Docker.Endpoint
# Docker API endpoint.
#
# Mandatory: no
# Default: unix:///var/run/docker.sock
# Plugins.Docker.Endpoint=unix:///var/run/docker.sock

The default socket location will be correct for your Docker environment – in that case, you can leave the configuration file as-is.

Once we have made the necessary changes in the Zabbix agent 2 configuration files, start and enable the agent:

systemctl enable zabbix-agent2 --now

Check if the Zabbix agent2 is running:

tail -f /var/log/zabbix/zabbix_agent2.log

Before we move on to Zabbix frontend, I would like to point your attention to the Docker socket file permission – the zabbix user needs to have access to the Docker socket file. The zabbix user should be added to the docker group to resolve the following error messages.

[Docker] cannot fetch data: Get http://1.28/info: dial unix /var/run/docker.sock: connect: permission denied
ZBX_NOTSUPPORTED: Cannot fetch data.

You can add the zabbix user to the Docker group by executing the following command:

usermod -aG docker zabbix

Configuring the docker host

Configuring the host representing our Docker environment

After importing the template, we have to create a host which will represent our Docker instance. Give the host a name and assign it to a Host group – I will assign it to the Linux servers host group. Assign the Docker by Zabbix agent 2 template to the host. Since the template uses Zabbix agent 2 to collect the metrics, we also have to add an agent interface on this host. The address of the interface should point to the machine running your Docker containers. Finish up the host configuration by clicking the Add button.

Docker by Zabbix agent 2 template

Regular docker template items

The template contains a set of regular items for the general Docker instance metrics, such as the number of available images, Docker architecture information, the total number of containers, and more.

Docker tempalte Low-level discovery rules

On top of that, the template also gathers container and image-specific information by using low-level discovery rules.

Once Zabbix discovers your containers and images, these low-level discovery rules will then be used to create items, triggers, and graphs from prototypes for each of your containers and images. This way, we can monitor container or image-specific metrics, such as container memory, network information, container status, and more.

Docker templates Low-level discovery item prototypes

Verifying the host and template configuration

To verify that the agent and the host are configured correctly, we can use Zabbix get command-line tool and try to poll our agent. If you haven’t installed Zabbix get, do so on your Zabbix server or Zabbix proxy host:

dnf install zabbix-get

Now we can use zabbix-get to verify that our agent can obtain the Docker-related metrics. Execute the following command:

zabbix_get -s docker-host -k docker.info

Use the -s parameter to specify your agent host’s host name or IP address. The -k parameter specifies the item key for which we wish to obtain the metrics by polling the agent with Zabbix get.

zabbix_get -s 192.168.50.141 -k docker.info

{"Id":"SJYT:SATE:7XZE:7GEC:XFUD:KZO5:NYFI:L7M5:4RGO:P2KX:QJFD:TAVY","Containers":2,"ContainersRunning":2,"ContainersPaused":0,"ContainersStopped":0,"Images":2,"Driver":"overlay2","MemoryLimit":true,"SwapLimit":true,"KernelMemory":true,"KernelMemoryTCP":true,"CpuCfsPeriod":true,"CpuCfsQuota":true,"CPUShares":true,"CPUSet":true,"PidsLimit":true,"IPv4Forwarding":true,"BridgeNfIptables":true,"BridgeNfIP6tables":true,"Debug":false,"NFd":39,"OomKillDisable":true,"NGoroutines":43,"LoggingDriver":"json-file","CgroupDriver":"cgroupfs","NEventsListener":0,"KernelVersion":"5.4.17-2136.300.7.el8uek.x86_64","OperatingSystem":"Oracle Linux Server 8.5","OSVersion":"8.5","OSType":"linux","Architecture":"x86_64","IndexServerAddress":"https://index.docker.io/v1/","NCPU":1,"MemTotal":1776848896,"DockerRootDir":"/var/lib/docker","Name":"localhost.localdomain","ExperimentalBuild":false,"ServerVersion":"20.10.14","ClusterStore":"","ClusterAdvertise":"","DefaultRuntime":"runc","LiveRestoreEnabled":false,"InitBinary":"docker-init","SecurityOptions":["name=seccomp,profile=default"],"Warnings":null}

In addition, we can also use the low-level discovery key – docker.containers.discovery[false] to check the result of the low-level discovery.

zabbix_get -s 192.168.50.141 -k docker.containers.discovery[false]

[{"{#ID}":"a1ad32f5ee680937806bba62a1aa37909a8a6663d8d3268db01edb1ac66a49e2","{#NAME}":"/apache-server"},{"{#ID}":"120d59f3c8b416aaeeba50378dee7ae1eb89cb7ffc6cc75afdfedb9bc8cae12e","{#NAME}":"/mysql-server"}]

We can see that Zabbix will discover and start monitoring two containers – apache-server and mysql-server. Any agent low-level discovery rule or item can be checked with Zabbix get.

Docker template in action

Discovered items on our Docker host

Now that we have configured our agent and host, applied the Docker template, and verified that everything is working, we should be able to see the discovered entities in the frontend.

Collected Docker container metrics

In addition, our metrics should have also started coming in. We can check the Latest data section and verify that they are indeed getting collected.

Macros inherited from the Docker template

Lastly, we have a few additional options for further modifying the template and the results of our low-level discovery. If you open the Macros section of your host and select Inherited and host macros, you will notice that there are 4 macros inherited from the Docker template. These macros are responsible for filtering in/out the discovered containers and images. Feel free to modify these values if you wish to filter in/out the discovery of these entities as per your requirements.

Notice that the container discovery item also has one parameter, which is defined as false on the template:

  • docker.containers.discovery[false] – Discover only running containers
  • docker.containers.discovery[true] – Discover all containers, no matter their state.

And that’s it! We successfully imported the template, installed and configured Zabbix agent 2, created a host, and applied the Docker template. Finally – our Zabbix instance is now monitoring our Docker environment! If you have any other questions or comments, feel free to leave a response in the comments section of this post.

 

The post Docker Container Monitoring With Zabbix appeared first on Zabbix Blog.

Webhooks in Zabbix

Post Syndicated from Andrey Biba original https://blog.zabbix.com/webhooks-in-zabbix/19935/

Zabbix is not only a flexible and versatile monitoring system but also a convenient tool for generating alerts and integrating with existing service desks. Among the various integration methods, webhooks have become the most popular. In this blog post, we will take a look at what are webhooks, how they can be used to integrate Zabbix with an external solution, and also take a look at some use case examples for webhook integrations.

What is a webhook?

Generally speaking, a webhook is a method of augmenting or altering the behavior of a web page or web application with custom callbacks. But to put it simply, a webhook is an automatic reaction to an event. If an event occurs (for example, a problem appears), then the webhook makes a call (via HTTP / HTTPS) to a third-party service to notify it about the event. Many existing solutions provide an API that allows you to interact with them via webhooks.

The webhook in Zabbix is implemented using JavaScript, so writing code does not require knowledge of a specific Zabbix syntax, and due to the prevalence of the JavaScript language, you can find many examples, tips, and guides on the Internet.

How does a webhook work?

Essentially, a webhook is code that makes a sequence of calls to achieve some result. In the case of Zabbix, a JavaScript code is executed that accesses the service API and transfers, updates, and retrieves data from there. For example, we need to open a ticket at the service desk and leave a comment on the ticket, which will contain information about the problem. For this we need:

  • Log in to the service and get a token
  • Make a request with the token to create a ticket
  • Create a comment on the newly created ticket using a token

In different services, the details may differ, but the general idea will be preserved from service to service.

How to use it?

Our integration team constantly communicates with the community and monitors the most popular services to develop official out-of-the-box integrations for them. At the moment, Zabbix provides a vast selection of out-of-the-box webhooks for the most popular services, and we review new ones and improve current ones every day.

In most cases, setting up a ready-made webhook comes down to 3-4 steps, which are described in the README file in the Zabbix repository. Usually, it is necessary to generate an API key in the service, set it in Zabbix, set the URL to the service endpoint URL, and specify a couple of parameters required for the webhook to work.

In addition to ready-made solutions, there is a Github community repository where custom templates and webhooks are laid out! If you are the author of a webhook or a template, please share it with the community by submitting it to this repository!

Example – Telegram webhook

The theory is good, but we are all interested in how it is implemented in practice. Let’s look at a Telegram webhook as an example. Now this messenger is very popular and it will be relevant to use it as an example.

First of all, let’s go to the Zabbix repository or navigate to the Zabbix website integrations section to read the setup instructions. In the repository, all templates and notification methods are located in the /templates folder, and for each of them, there is a README file with a detailed description.

From the Telegram side, we need to create a bot and get its token following the instructions and set it in the Token parameter.

After that, we create a user, set up a Telegram media type for this user, and in the “Send to” field we write the id of the user or group chat.

Voila! Your webhook is set up and ready to send notifications or event information!

As you may have noticed, the setup did not take much time and did not require deep knowledge. Naturally, for finer tuning, it is possible to edit the content of messages, the type of problems, intervals, and other parameters. But even without additional changes, notifications are already ready to go.

Is it difficult to write a webhook on your own?

Of course, creating a webhook requires certain skills.

First of all, knowledge of JavaScript is required. The language itself is not difficult and can be mastered relatively quickly. The Zabbix documentation site has a guideline for writing webhooks with recommendations and best practices.

Secondly, understanding how Zabbix works. This does not require an in-depth understanding of Zabbix and the ability to follow basic instructions will be enough. You can read more about setting up notification methods in the official documentation. It is important to properly configure the webhook itself, grant rights to users, and set up a notification action for the necessary triggers.

And thirdly, study the documentation of the service for which the webhook will be written. Although all APIs work on the same principle, they can differ greatly from each other in methods and request structure. It is also necessary to understand the service itself to understand how it works. It is difficult to write an integration if it is not clear how Zabbix should properly interact with the service being integrated.

Summarizing

Webhook is a modern and flexible way of integration that allows Zabbix to be a universal solution. Since the realities of our world imply a large number of different systems, and as a result – many people working together – webhooks are becoming an indispensable tool in notification automation. A properly written and configured webhook is an effective solution for flexible notifications.

In the next article, together we will learn the basic methods and requests that are needed to send alerts, receive updates and assign tags. For this purpose, we will completely inspect some webhook in close detail.

Questions

Q: We have a ready-made notification system built on scripts. Does it make sense to rewrite it to a webhook?

A: Certainly. Firstly, the webhook is executed natively in Zabbix, which will be much more productive than in an external script. Secondly, the webhook is much more flexible, more functional, and much easier to make changes to.

 

Q: We have a service for which we would like to write an integration, but we do not have qualified specialists who could do it. Is it possible to request such integration from Zabbix?

A: Yes, if you are a Zabbix partner, you can leave a request to create such integration.

 

The post Webhooks in Zabbix appeared first on Zabbix Blog.