Tag Archives: Technical

Decrypting Zabbix TLS with Wireshark

Post Syndicated from Markku Leiniö original https://blog.zabbix.com/decrypting-zabbix-tls-with-wireshark/26832/

One of the built-in security features in Zabbix is TLS (Transport Layer Security) support for external connections. This means that when your distributed Zabbix proxies or Zabbix agents connect to the Zabbix server (or vice versa), TLS can be used to encrypt all the connections. When the connections are encrypted, third parties cannot read the Zabbix components’ communication, even though they would be able to catch the network traffic in some way.

In specific cases you may still want to inspect the encrypted traffic, for example to troubleshoot some problems with Zabbix agents or proxies. I already wrote a post about troubleshooting Zabbix agent with Wireshark, but the TLS encryption prevents anyone seeing the actual contents of the packets.

Since the traffic is encrypted in the Zabbix components (the server, agents and proxies), there still is a way for you, the Zabbix administrator, to intervene with the encryption so that you can get hold of the unencrypted traffic as well. In this post I will explain the process.

First, let’s demonstrate the TLS encryption between a Zabbix agent and a Zabbix server. I have configured the agent (Zabbix Agent 2 actually) with these lines:

Hostname=Zabbix70-agent
ServerActive=zabbixtest.lein.io
TLSConnect=psk
TLSPSKIdentity=agent-ident
TLSPSKFile=/etc/zabbix/psk

In this example I’m using TLS with pre-shared key (PSK), and the key itself is saved in /etc/zabbix/psk. My favorite way of generating a PSK is using OpenSSL:

markku@agent:~$ openssl rand -hex 32
afa34bf1104a1457e11e7d3a9b1ff7f5fb4f494c92ca1a8a9c5e1437f8897416
markku@agent:~$

The same key must also be configured on the Zabbix server frontend, see the Zabbix documentation for the PSK configuration details:

After the configurations I captured the Zabbix traffic for some time on the Zabbix server (using sudo tcpdump port 10051 -v -w zabbix70-tls-agent.pcap), stopped the capture, copied the capture file to my workstation, and opened it with Wireshark.

The capture file can be downloaded here:

Note: I recommend using Wireshark version 4.1.0 or later when analyzing captures containing Zabbix traffic because the built-in Zabbix protocol support was only added to Wireshark in version 4.1.0.

The packet list looks like this:

As we can see in the Protocol column, there are no Zabbix packets recognized in this capture, there are only TCP and TLS packets (the TCP-marked packets being the “empty” packets for negotiating the actual connectivity).

A side detail: Even though the traffic is encrypted, you can still see the configured Zabbix TLS PSK identity (“agent-ident” in my configuration above) in plain text inside the TLS Client Hello packets, if you ever need to check that in the traffic.

Now that we confirmed that TLS encryption is used and we cannot see the Zabbix traffic contents in the capture, let’s prepare the Zabbix server for the TLS decryption.

As I hinted in the beginning, since we have the TLS connection endpoints under our management, we can do tricks on the hosts to get the encryption keys. TLS negotiates the encryption keys dynamically for each connection, but there is a way to save the keys to a file so that we can later decrypt the captured traffic. (Note: I’m not a protocol-level TLS expert, so please forgive me any possible technical inaccuracies in the detailed explanations. I’ll just call “TLS keys” whatever is needed to get the encryption/decryption done.)

Peter Wu (who, in contrast to me, is a protocol-level TLS expert, and also one of the Wireshark core developers) has kindly published code for a helper library that makes it possible for us to save the TLS session keys on the TLS endpoint. In this demo I will save the keys on the Zabbix server, but the same could be done on the agents/proxies instead if needed.

First I’ll see the TLS library that my Debian-based Zabbix server is using:

markku@zabbixserver:~$ ldd /usr/sbin/zabbix_server | grep ssl
        libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x00007f62ee47a000)
markku@zabbixserver:~$ dpkg -l libssl* | grep ^ii
ii  libssl3:amd64  3.0.9-1   amd64  Secure Sockets Layer toolkit - shared libraries
markku@zabbixserver:~$

To get and compile the helper library I’ll need to install some utilities:

markku@zabbixserver:~$ sudo apt install git gcc make libssl-dev
...
markku@zabbixserver:~$ dpkg -l libssl* | grep ^ii
ii  libssl-dev:amd64 3.0.9-1 amd64 Secure Sockets Layer toolkit - development files
ii  libssl3:amd64    3.0.9-1 amd64 Secure Sockets Layer toolkit - shared libraries
markku@zabbixserver:~$

I’ll the clone the Peter’s wireshark-notes repo to the server:

markku@zabbixserver:~$ git clone --depth=1 https://git.lekensteyn.nl/peter/wireshark-notes
Cloning into 'wireshark-notes'...
...
markku@zabbixserver:~$ cd wireshark-notes/src
markku@zabbixserver:~/wireshark-notes/src$ ls -l
total 28
-rw-r--r-- 1 markku markku   534 Oct  7 15:39 Makefile
-rw-r--r-- 1 markku markku 11392 Oct  7 15:39 sslkeylog.c
-rw-r--r-- 1 markku markku  7278 Oct  7 15:39 sslkeylog.py
-rwxr-xr-x 1 markku markku  2325 Oct  7 15:39 sslkeylog.sh
markku@zabbixserver:~/wireshark-notes/src$

Now I can compile the library and make it available on the server:

markku@zabbixserver:~/wireshark-notes/src$ make
cc   sslkeylog.c -shared -o libsslkeylog.so -fPIC -ldl
markku@zabbixserver:~/wireshark-notes/src$ sudo install libsslkeylog.so /usr/local/lib
markku@zabbixserver:~/wireshark-notes/src$ ls -l /usr/local/lib/libsslkeylog.so
-rwxr-xr-x 1 root root 17336 Oct  7 15:40 /usr/local/lib/libsslkeylog.so
markku@zabbixserver:~/wireshark-notes/src$ cd
markku@zabbixserver:~$

To use the helper library, a couple of environment variables need to be set. For Zabbix server the easy way is to edit the systemd configuration for zabbix-server service:

markku@zabbixserver:~$ sudo systemctl edit zabbix-server

In the editor that opens I’ll add these in the configuration:

[Service]
Environment=LD_PRELOAD=/usr/local/lib/libsslkeylog.so
Environment=SSLKEYLOGFILE=/tmp/tls.keys

The variables are kind of self-explanatory: Whenever Zabbix server service is started, the libsslkeylog.so library is loaded first, and the SSLKEYLOGFILE variable sets the location of the file where the keys will be saved.

Now the word of warning: The libsslkeylog.so library, when loaded by a process that uses TLS communication, will save the encryption/decryption keys of all the TLS sessions of the process to the configured file. This means that whoever gets that file and the saved TLS communication will be able to see the decrypted contents of the packets, defeating the whole idea of the TLS encryption. You really don’t want to do this TLS key saving for any longer periods of time. Be sure to remove the configurations (and restart the service) after you have inspected whatever you were inspecting in your system. Or, don’t do any of this at all.

After saving the configuration the Zabbix server needs to be restarted:

markku@zabbixserver:~$ sudo systemctl restart zabbix-server
markku@zabbixserver:~$

The TLS keys have now started being saved in the configured file:

markku@zabbixserver:~$ ls -l /tmp/tls.keys
-rw-rw-r-- 1 zabbix zabbix 10157 Oct  7 15:45 tls.keys
markku@zabbixserver:~$

At this point the Zabbix agent is still communicating actively with the Zabbix server, so I’ll take a new capture with tcpdump (sudo tcpdump port 10051 -v -w zabbix70-tls-agent-2.pcap).

After a short while I’ll stop the capture, and copy the capture file and the TLS key file on my workstation.

Now it’s a good time to disable the TLS key saving as well (besides containing sensitive data, the key file will also grow with each new TLS session so it can quickly get very large), so I’ll edit the Zabbix service configuration, remove the configured lines and restart the service:

markku@zabbixserver:~$ sudo systemctl edit zabbix-server
markku@zabbixserver:~$ sudo systemctl restart zabbix-server
markku@zabbixserver:~$

When opening the new capture file in Wireshark there is no immediate change in the packet list: the TLS packets are still shown encrypted. Wireshark needs to be specifically configured to read the TLS keys from the separate file.

In Wireshark, I’ll go to Edit – Preferences – Protocols – TLS:

There is the “(Pre)-Master-Secret log filename” field, I’ll use Browse button to select the copied tls.keys file, and save the configuration with OK.

At this point Wireshark reloads the capture file and the Zabbix agent TLS sessions will be decrypted:

Using the “zabbix” display filter will show just the Zabbix protocol packets:

When selecting a Zabbix protocol packet and looking at the packet details, in the lower right pane there are now three tabs: Frame (the encrypted TLS data), Decrypted TLS, and Uncompressed data.

This is because in this example the Zabbix agent 2 also compresses the traffic, and the compressed traffic is then encrypted when sending out to the network. Wireshark can interpret all this because of its built-in knowledge about TLS encryption and the Zabbix protocol structure, as well as the user-supplied TLS decryption keys.

We are now able to analyze the Zabbix agent communication with Wireshark even though the traffic was TLS-encrypted when we captured it.

One more trick about the TLS keys in Wireshark: It is also possible to save the keys inside the capture file when analyzing the traffic, instead of having the keys in a separate file (tls.keys in this example). I’ll go in Edit menu and select Inject TLS Secrets, and then save the capture file in pcapng format. Now the previously loaded keys are embedded in the capture file, and I can clear the “(Pre)-Master-Secret log filename” field in the TLS settings (as the filename setting is not useful in any later Wireshark analysis). The same can also be done in the command line by using editcap --inject-secrets (editcap is part of Wireshark install, see the manual page of editcap for more details).

Here is the second capture file of this demo, with the embedded TLS keys:

Finally some closing comments:

  • As demonstrated, when you have administrator/root-level access to the TLS session endpoint (Zabbix server in this example), there can be a possibility to save and decrypt the TLS sessions using external tooling. After all, TLS encryption is based on the negotiation between the TLS-connected endpoints, so if you are the TLS connection endpoint, you have ways to access the plaintext data. If you don’t have sufficient access to the TLS session endpoint, there is no way you can get the decryption keys mid-path.
  • Act responsibly when saving the TLS session keys for any traffic, on Zabbix server or otherwise. The encryption is there for a purpose, and saving the TLS keys always carries the risk that someone else gets access to data they wouldn’t have access to otherwise.
  • Do not save the TLS session keys with the capture file, unless you are dealing with a test/demo environment, like I had here.
  • When troubleshooting Zabbix connections, TLS decryption with Wireshark is not the only way. You should also consider if just increasing the logging level in the Zabbix components brings you enough information to solve your case, or maybe in some specific case you can just disable TLS encryption for an agent for a moment to not have to deal with the decryption at all. But again, usually the encryption is there for a purpose, so you need to evaluate your own situation.

The post Decrypting Zabbix TLS with Wireshark appeared first on Zabbix Blog.

Zabbix in: exploratory data analysis rehearsal – Part 3

Post Syndicated from Paulo R. Deolindo Jr. original https://blog.zabbix.com/zabbix-in-exploratory-data-analysis-rehearsal-part-2-2/26266/

Abstract

This will be the last blog post of the “Zabbix in… exploratory data analysis rehearsal” series. To continue our initial proposal, we’ll close the third and fourth data distribution moments concept. This time, we’ll talk about Skewness and Kurtosis.

Remember the first and second articles of the series to be aware of what we are discussing here.

The four moments for a data distribution

While the first moment helps us with the location estimate for some data distribution, the second moment works with its variance. The third moment, called asymmetry, allows us to understand the value trends and the degree of the asymmetry. The fourth moment is called kurtosis and is about the probability of the peak’s existence (outliers).

These four moments are not the final study about the data distribution. There is so much to learn and apply to data science when considering statistical concepts, but for now we must finish the initial proposal and bring forward some insights for decision makers.

Let’s get started!

Asymmetry

Based on our web application scenario, we can see a certain asymmetry in response time in most cases. This is normal and expected – so far, no problems. But it is also true that some symmetry is also possible in certain cases. Again, no problems here.

So, where is the problem? When does it happen?

Sometimes, the web application response time can be too different from the previous one, and we have no control over it. In these cases, the outliers must be found, and the correct interpretation must be applied. At that point, we must consider anomalies in the environment. Sometimes, the outliers are just a deviation. In all cases, we must pay attention and monitor the metrics that can make the difference.

Speaking of asymmetry – why is this topic so special? One of the possible answers is that we need to understand the degree of the asymmetry – whether it is high or moderated and whether the values were in most cases smaller than or bigger than the mean or median. In other words, what does the asymmetry say about the web application performance?

Let’s check some implementations.

The key skewness

From version 6.0, Zabbix introduced the item key skewness.

For example, it can be used like this:

skewness(/host/key,1h) # the skewness for the last hour until now

Now, let’s see how this formula could be applied to our scenario:

skewness(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)

Using skewness and time shift “1h:now/h”, we are looking for some web application response time asymmetry at the previous hour.

The asymmetry can be negative (left skew), zero, positive (right skew) or undefined.

Definition: a left-skewed distribution is longer on the left side of its peak than on its right. In other words, a left-skewed distribution has a long tail on its left side.

Considering a left skew, it is possible to state that at the previous hour, the web application had more higher values than smaller values. This means that our web application does not perform as well as it should.

Look at the graph above. You can see some bars on the left side of the mean and other bars on the right side of the mean, with the same size as a mirror. We can consider this a normal distribution for web application response time, but it does not mean that the response times were good or bad – it only means that they had some balance and suggests more investigation.

Definition: a right-skewed distribution is longer on the right side of its peak than on its left. In other words, a right-skewed distribution has a long tail on its right side.

Considering a right skew, it is possible to state that at the previous hour, the web application had more smaller values than higher values. This means our web application performs as expected.

Value Map for Skewness

You must create in your template the following value map:

If you wish, the value map can also be as below:

“is greater than or equals”                                 0.1              à Mais tempos bons, se comparados à média

“equals”                                                               0               à Tempos de resposta simétricos ou bem distribuídos em bons e ruins

“is less than or equals”                                       0                à Mais tempos ruins, se comparados à média

Pearson Skewness Coefficient

The Pearson’s Coefficient is a very interesting indicator. Considering some skewness for a data distribution, it tells us if the asymmetry is strong or only moderate.

We can create a calculated item for the Pearson’s Coefficient:

(3*(avg-median))/stddevpop

In Zabbix, we need:

  • One item for the response time average considering the previous hour:
    • Key: resp.time.previous.hour
      • Formula: trendavg(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)
    • One item for calculating the median (for percentile, see the previous blog post):
      • Key: response.time.previous.hour
        • Formula: (last(//p51.previous.hour)+last(//p50.previous.hour))/2
      • One item for the standard deviation calculation:
        • Key:
          • Formula: response.time.previous.hour stddevpop(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)

Finally, a calculated item for a Pearson’s Coefficient:

  • Key: coefficient.requests.previous.hour
    • Formula:
      ((last(//trendavg.requests.per.minute)-
      last(//median.access.previous.hour))*3)
      /
      last(//stddevpop.requests.previous.hour)

To finish the exercise, create a value map:

Curtose

Kurtosis is the fourth data distribution and can indicate if the values are prone to peaks.

In Zabbix, you can implement or calculate kurtosis with other calculated items:

kurtosis(/host/key,1h)

Um valor negativo de curtose nos diz que a distribuição não está propensa ou produziu poucos outliers. Já um valor positivo de curtose nos diz que a distribuição está propensa ou produziu muitos outliers. Tudo gira em torno da média da distribuição de dados. Já um valor neutro, ou zero, nos diz que a distribuição é considerada simétrica.

Para nosso cenário web, temos:

kurtosis(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)

Explanatory Dashboard

Considering the image above, the data distribution for the web application response time at the previous hour has a left skew. This suggests that, when considering the response time mean at the previous hour, there were more higher values than smallest values. That’s sad – our web application performed poorly. Why?

However, because the skewness was moderated (considering the Pearson’s Coefficient), it means that the response time values were not so different from the others at the same hour.

As for kurtosis, we can say that the data distribution is prone to peaks because we have positive kurtosis. In basic statistics, the values are near the mean and there is a high probability of producing outliers.

This is a point to pay attention to if you are looking at a critical service.

Looking at other metrics

To check and validate our interpretation, let’s visualize the collected values on a simple Zabbix graph, using a graph widget.

Please consider using the following “time period” configuration:

The graph will show only the data collected at the previous hour – in this case, from 11:00 to 11:59.

The graph in Zabbix allows us to visualize the 50th percentile as well. We do not display the mean on the graph because it wouldn’t be well represented visually as it was collected only once, thus lacking an interesting visual trend line like the 50th percentile. However, notice that the mean and the 50th percentile values are very close, which will give us an idea of the data distribution around this measure.

Partial Conclusion

Skewness and Kurtosis, respectively, are the third and fourth moments of a data distribution. They help us understand the environment’s behavior and allow us to gain insight into a lot of things (in this case, we simply applied these concepts to IT infrastructure monitoring and focused on our web application to analyze its performance).

In most cases, the asymmetry will exist – it’s normal and expected. However, knowing some properties of the skewness can help us understand the response times, indicating good or poor performance. The skewness coefficient allows us to know if the asymmetry was strong or just moderated. Meanwhile, kurtosis helps us to understand if the data distribution produced some peaks considering an observation period or whether it is prone to produce peaks. We can then create some triggers for that and avoid some undesirable behaviors in the future, based on our data distribution observation. This is applied data science at its best.

Conclusion

Data science can be easily applied to (and bring up insights about) Zabbix and its Aggregate functions. It’s true that there are some special functions such as skewness, kurtosis, stddevpop, stddevsamp, mad, and so on, but there are old functions to help us too, such as percentile, forecast, timeleft, etc. All these functions must be used in calculated item formulas.

One of the interesting advantages of using Zabbix in data science and performing an exploratory data analysis is the fact that Zabbix can monitor everything. This means that the database already exists with the relevant date to analyze, in real time.

In the blog posts in this series, we took as a basis some data referring to the previous hour, the previous day, and so on, but we did nothing regarding the current hour. If we applied the concepts studied in real time data, we would have other “live” results, including support for decision-making. This is because we will not only study historical data, but instead will have the opportunity to change the course of events.

Zabbix is improving dashboards significantly. From Zabbix 6.4, we have many new out-of-the-box widgets and the possibility to create our own. However, there is a concern – Zabbix administrators sometimes show unnecessary data in dashboards, which can cloud the decision-making process. Zabbix administrators, in general, might want to learn storytelling techniques to rectify this situation. Maybe in a perfect world!

I hope you have enjoyed this blog post series.

Keep studying!

 

The post Zabbix in: exploratory data analysis rehearsal – Part 3 appeared first on Zabbix Blog.

Zabbix in: exploratory data analysis rehearsal – Part 2

Post Syndicated from Paulo R. Deolindo Jr. original https://blog.zabbix.com/zabbix-in-exploratory-data-analysis-rehearsal-part-2/26151/

Abstract

In the previous blog post, we just explored some of the basic statistic concepts to estimate KPIs for a web application response time: in that case, average, median and percentile. Additionally, we improved the nginx out-of-the-box template and showed some results in simple dashboards. Now we must continue our work but this time, analyzing some variances of the collected metrics, considering a certain period.

Please, read the previous blog post to understand the context in a better way. I wish you a good reading.

A little about basic statistic

In basic statistic, a data distribution has at least four moments:

  • Location estimate
  • Variance
  • Skewness
  • Kurtosis

In the previous blog post, we introduced the 1st moment, knowing some estimates of our data distribution. It means that we have analyzed some values of our web application response time. It reveals that the response time can have minimum and maximum values, average, a value that can represent a central value of the distribution and so on. Some metrics, such as average, can be influenced by outliers, but other metrics do not, such as 50th percentile or median. To conclude, now we know something about the variance of those values, but it isn’t enough. Let’s check the 2nd moment of the data distribution: Variance.

Variance

So, we have some notion about the variance of the web application response time, meaning that it can have some asymmetry (in most cases) and we also know that some KPIs must be considered but, which of them?

In exploratory data analysis, we can discover some key metrics but, in most cases, we won’t use all of them, so we have to know each one’s relevance to choose properly which metric can represent the reality of our scenario.

Yes! There are some cases when some metrics must be added to other metrics so that they make sense otherwise, we can discard them: we must create and understand the context for all those metrics.

Let’s check some concepts of the variance:

  • Variance
  • Standard deviation
  • Median absolute deviation (MAD)
  • Amplitude
  • IQR – Interquartile range

Amplitude

This concept is simple and its formula too: it is the difference between the maximum and the minimum value in a data distribution. In this case, we are talking about data distribution at the previous hour (,1h:now/h). We are interested in knowing the range of variation in response times in that period.

Let’s create a Calculated item to Amplitude metric in “Nginx by HTTP modified” template.

  • trendmax(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)
  • trendmin(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)

In other terms, it could be:

  • max(/host/key)-min(/host/key)

However, we are analysing a data distribution based on the previous hour, so…

  • trendmax(/host/key,1h:now/h)-trendmin(/host/key,1h:now/h)

Modifying our dashboard, we’ll see something like this:

 

This result interpretation could be: between the worst and the best response times, the variance is too small. It means, during that hour the response times had no significant differences.

However, amplitude itself is not enough to get some web application diagnosis at that moment. It’s necessary to combine this result with other results and we’ll see how to do it.

To complement, we can create some triggers based on it:

  • Fire if the response time amplitude was bigger than 5 seconds at the previous hour. It means that the web application did not perform as expected considering the web application requests.
    • Expression = last(/Nginx by HTTP modified/amplitude.previous.hour)>5
    • Level = Information
  • Fire if the response time amplitude reaches 5 seconds at least 3 consecutive times. It means at the last 3 hours, there was too much variance among the web application response times and it is not the expected.
    • Expression = max(/Nginx by HTTP modified/amplitude.previous.hour,#3)>5
    • Level = Warning

Remember, we are evaluating the previous hour and it makes no sense to generate this metric every single minute. Let’s create a Custom interval period for it.

 

By doing it, we are avoiding flapping on triggers environment.

IQR – Interquartile range

Consider these values below:

3, 5, 2, 1, 3, 3, 2, 6, 7, 8, 6, 7, 6

Open the shell environment. Create the file “vaules.txt” and insert each one, one per line. Now, read the file:

# cat values.txt

3
5
2
1
3
3
2
6
7
8
6
7
6

Now, send the value to Zabbix using Zabbix sender:

# for x in `cat values.txt`; do zabbix_sender -z 127.0.0.1 -s “Web server A” -k input.values -o $x; done

Look at the historical data using Zabbix frontend.

 

Now, let’s create some Calculated items to 75th  percentile and 25th  percentile.

  • Key: iqr.test.75
    Formula: percentile(//input.values,#13,75)
    Type: Numeric (float)
  • Key: iqr.test.25
    Formula: percentile(//input.values,#13,25)
    Type: Numeric (float)

If we apply it on a Linux terminal the command “sort values.txt”, we’ll get the same values ordered by size. Let’s check:

# sort values.txt

 

We’ll use the same concept here.

From the left to the right, go to the 25th percentile. You will get the number 3.

Do it again, but this time go to the 50th percentile. You will get the number 5.

And again, go to the 75th percentile. You will get the number 6.

The IQR is the difference between the 75th percentile (Q3) and the 25th percentile (Q1). So, we are excluding the outliers (the smallest values on the left and the biggest values on the right).

To calculate the IQR, you can create the following Calculate item:

  • key: iqr.test
    Formula: last(//iqr.test.75)-last(//iqr.test.25)

Now, we’ll apply this concept in Web Application Response Time.

The Calculated Item for the 75th percentile:

key: percentile.75.response.time.previous.hour
Formula: percentile(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h,75)

The Calculated item for the 25th percentile:

key: percentile.25.response.time.previous.hour
Formula: percentile(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h,25)

The Calculate item for the IQR:

key: iqr.response.time.previous.hour
Formula: last(//percentile.75.response.time.previous.hour)-last(//percentile.25.response.time.previous.hour)

Keep the monitoring schedule at the 1st minute for each hour, to avoid repetition (it’s very important) and adjust the dashboard.

Considering the worst web response time and the best web response time at the previous hour, the AMPLITUDE returns a big value in comparison to the IQR and it happens because the outliers were discarded in IQR calculation. So, just as the mean is a location estimate that is influenced by outliers and the median is not, so are the RANGE and IQR. The IQR is a robust indicator and allows us to know the difference between the web response time variance in a central position.

P.S.: we are considering only the previous hour, however, you can apply the IQR concept to an entire period, such as the previous day, or the previous week, the previous month and so on, using the correct time shift notation. You can use it to compare the web application response time variance between the periods you wish to observe and then, get some insights about the web application behavior at different times and situations.

Variance

The variance is a way to calculate the dispersion of data, considering its average. In Zabbix, calculating the variance is simple, since there is a specific formula for that, through a Calculated item.

The formula is the following:

  • Key: varpop.response.time.previous.hour
    Formula: varpop(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)

In this case, the formula returns the dispersion of data, however, there is a characteristic: at some point, the data is squared and then, the data scale changes.

Let’s check the steps for calculating the variance of the data:

1st) Calculate the mean;
2nd) Subtract the mean of each value;
3rd) Square each subtraction result;
4th) Perform the sum of all squares;
5th) Divide the result of the sum by the total observations.

At the 3rd step, we have the scale change. This new data can be used to other calculations in the future.

Standard Deviation

The root square of the variance.

Calculating the root square of the variance, the data can come back to its original scale!

There are at least two ways to do it:

  1. Using the root square key and formula in Zabbix:
    1. Key: varpop.previous.hour
    2. Formula: sqrt(last(//varpop.response.time.previous.hour))
  2. Using the standard deviation key and formula in Zabbix:
    1. Key:previous.hour
    2. Formula: stddevpop(//host,key,1h:now/h) # an example for the previous hour

A simple way to understand the standard deviation concept is: a way of knowing how “far” values are from the average. So, applying the specific formula, we’ll get that indicator.

Look at this:

The image above is a common image that can be found on the Internet, and it can help us understand some results. The standard deviation value must be near “zero”, otherwise, we’ll have serious deviations.

Let’s check the following Calculated item:

  • stddevpop(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)

We are calculating the standard deviation based on the collected values at the previous hour. Let’s check the Test item on the frontend:

The test returned some value less than 1, at about 0.000446. If this value is less than 1, we don’t have a complete deviation and it means that the collected values at the previous hour are near the average.

For a web application response time, it can represent a good behavior, with no significant variances and as expected, off course, other indicators must be checked to a complete and reliable diagnosis.

Important notes about standard deviation:

  • Sensitive to outliers
  • Can be calculated based on the population of a data distribution or based on its sample.
    • Using this formula: stddevsamp. In this case, it can return a different value from the the previous one.

Median Absolute Deviation (MAD)

While the Standard Deviation is a simple way to understand if the data of a distribution are far from its mean, MAD help us understand if these values are far from its median. So, MAD can be considered a robust estimate, because is not sensitive to outliers.

Warning: If you need to identify outliers or consider them in your analysis, the MAD function is not recommended, because it ignores them.

Let’s check our dashboard and compare different deviation calculations for the same data distribution:

Note that the last one is based on MAD function, and it is less than the other items, just because it is not considering outliers.

In this particular case, the web application is stable, and its response times are near to the mean or median (considering the MAD algorithm).

Exploratory Dashboard

Partial conclusion

In this post, we have just introduced the data distribution moments, presenting the variability or variance concept and then, we learned some techniques to achieve some KPIs or indicators.

What do we know? The response time for a web application can be different from the previous one and so on, so the knowledge of the variance can help us understand about the application behaviors using some extraordinary data. Then, we can decide if an application had a good or poor performance, for example.

Of course, it was a didactic example for some data distribution and the location estimate and variance concepts can be applied to other data exploratory analysis considering a long period, such as days, weeks, months, years, and so on. In those case, is very important consider using trends instead of history data.

Our goal is bringing to the light some extraordinary data and insights, instead of common data, allow us knowing better our application.

In the next posts, we’ll talk about Skewness and Kurtosis, the 3rd and 4th moments for a data distribution, respectively.

The post Zabbix in: exploratory data analysis rehearsal – Part 2 appeared first on Zabbix Blog.

Zabbix in: exploratory data analysis rehearsal – Part 1

Post Syndicated from Paulo R. Deolindo Jr. original https://blog.zabbix.com/zabbix-in-exploratory-data-analysis-rehearsal-part-1/25802/

Abstract

Imagine your happiness when you start a new enterprise device and application monitoring project using Zabbix[i]. Indeed, doing this is so easy that the first results bring up a lot of satisfaction very quickly. For example, when you apply a specific template [ii]in a specific host and the data comes (like magic) and you can create some dashboards with these data and visualize then.

If you haven’t done this yet, you must try it as soon as possible. You can create a web server host using both Apache or Nginx web services, applying the appropriate template and getting metrics by HTTP checks: “Apache by HTTP” template or “Nginx by HTTP” template. You will see interesting metrics being collected and you will be able to create and view some graphs or dashboards. But the work is not finished yet, because using Zabbix, you can do much more!

In this article, I’ll talk about how we can think of new metrics, new use cases, how to support our business and help the company with important results and insights using exploratory data analysis introducing and implementing some data science concepts using only Zabbix.

What is our goal?

Testing and learning some new Zabbix functions introduced from 6.0 version, compare some results and discuss insights.

Contextualizing

Let’s keep the focus on the web server metrics. However, all the results of this study can be used later in different scenarios.

The web server runs nginx version 1.18.0 and we are using “Nginx by HTTP” template to collect the following metrics:

  • HTTP agent master item: get_stub_status
  • Dependent items[i]:

Nginx: Connections accepted per second

Nginx: Connections active

Nginx: Connections dropped per second

Nginx: Connections handled per second

Nginx: Connections reading

Nginx: Connections waiting

Nginx: Connections writing

Nginx: Requests per second

Nginx: Requests total

Nginx: Version

  • Simple check items:

Nginx: Service response time

Nginx: Service status

 

That are the possibilities at the moment and below we have a simple dashboard created to view the initial results:

All widgets are reflecting metrics collected by using out-of-the-box “Nginx by HTTP” template.

Despite being Zabbix specialist and having some knowledge about our monitored application, there are some questions we need to ask ourselves. These questions do not need to be exhaustive, but they are relevant for our exercise. So, let’s jump to the next topic.

Generating new metrics! Bringing up some thoughts!

Let’s think about the collected metrics in the beginning of this monitoring project:

  1. Why the does number of requests only increase?
  2. When did we have more or fewer connections, considering for example, the last hour?
  3. What’s the percentage change comparing the current hour with the previous one?
  4. Which value can be representing the best or the worst response time performance?
  5. Considering some collected values, can we predict an application downtime?
  6. Can we detect anomalies in the application based on the amount of collected values and application behavior?
  7. How to stablish a baseline? Is it possible?

These are some questions we need to answer using this article and the next ones to come.

Generating new metrics

1st step: Let’s create a new template. From “nginx by HTTP”, clone it and change its name to “Nginx by HTTP modified”;

2nd step: Modify the “Nginx: Requests total” item, adding a new pre-processing step: “Simple change”. It will look like the image below:

It’s a Dependent item from the Master item “Nginx: Get stub status page” and the last one is based on HTTP agent to retrieve the main metric. So, if the number of the total connections always increase, the current value will be decreased from the last collected value. A simple mathematical operation: subtraction. And then, from this moment on we’ll have the number of new connections per minute.

The formula for the “Simple change” pre-processing step can be represented using the following images:

I also suggest you change the name of the item to: “Nginx: Requests at last minute”.

I can add some Tags[i] too. These tags can be used in the future to filter the views and so on.

Same metrics variations

With the modified nginx template we can retrieve how many new connections our web application receives per minute and then, we can create new metrics from the previous one. Using Zabbix timeshift[i] [ii]function, we can create metrics such as the number of connections:

  • At the last hour
  • Today so far and Yesterday
  • This week and at the previous week
  • This month and the previous month
  • This year and at the previous year
  • Etc

This exercise can be very interesting. Let’s create some Calculated items with the following formulas:

sum(//nginx.requests.total,1h:now/h)                                                    # Somatório de novas conexões na hora anterior

sum(//nginx.requests.total,1h:now/h+1h)                                              # Somatório de novas conexões da hora atual

In Zabbix official documentation we have lots of examples to creating Calculated items using “time shift” parameter. Please, see this link.

Improving our dashboard

Using the new metrics, we can improve our dashboard and enhance the data visualization. Briefly:

The same framework could be used to show the daily, weekly, monthly and yearly data, depending on your business rule, of course. Please, be patient because some items will have some delay in collecting operation (monthly, yearly, etc).

Basic statistics metrics using Zabbix

As we know, it is perfectly possible to generate some statistics values with Zabbix by using Calculated items. However, there are questions that can guide us to other thoughts and some answers will come in format of metrics, again.

  1. Today, which response time was the best?
  2. And if we think about the worst response time?
  3. And about the average?

We can start with these basic statistics and then, growing up latter.

All data in dashboard above were retrieved using simple Zabbix functions.

 

The best response time today so far.

min(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1d:now/d+1d)

The worst response time today so far.

max(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1d:now/d+1d)

The average of the response time today so far.

avg(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1d:now/d+1d)

 

It’s ok. Nothing is new, so far. But let’s check some thoughts.

Why we are Looking for the best, the worst and the average using min, max e avg functions, instead of  trendmin, trendmax e trendavg functions? The Trend-based functions retrieve data from trends tables, while History-based functions calculates in “real time”. If you wish to use History-based functions to calculate something in a short period, ok. But if you wish to use it to calculate some values considering a long period such as month or year… hum! It can be complicated, and it can take a lot of resources of your infrastructure power.

We need to remember an important thing: to use Trend-based functions, we must consider only data retrieved until the last full hour, because we have to consider the trend-cache sync process.

Look at the dashboard below, this time, using Trend-based Functions for the statistics.

Look at the current results. Basically, they are the same. There aren’t so many differences and, I guess, using it’s an intelligent way to retrieve the desired values.

Insights

If a response time is too short such as 0.06766 (the best of the day) and another value is too big and is representing the worst response time, such as 3.1017, can you imagine which and how many values exist between then?

How to calculate de average? You know: the sum of all collected values within a period, divided by the number of values.

So far, so good. The avg or trendavg functions can retrieve this average based on the desired period. However, if you look at the graph above, you will see some “peaks” in certain periods. These “peaks” are called “outliers”. The outliers are influencers of the average.

The outliers are important but because it exists, the average sometimes may not represent the reality. Think about this: a response time of the web application having stayed between 0.0600 and 0.0777 at the previous hour. During a specific minute within the same monitored period, for some reason, the response time was 3.0123. In this case, the average will increase. But, what if we discard the outlier? Obviously, the average will be as expected. In this case, the outlier was a deviation, “an error in the matrix”. So, we need to be able to calculate de average or other estimative location for our values, without the outlier.

And we cannot forget: if we are looking for anomalies based on the web application response time, we need to consider the outliers. If not, I guess, outliers can be removed on the calculation for now.

Ok! Outliers can influence the common average. So, how can we calculate something without the outliers?

Introduction to Median

About data timeline, we can affirm that the database is respecting the collected timestamp. Look at the collected data below:

2023-04-06 16:24:50 1680809090 0.06977
2023-04-06 16:23:50 1680809030 0.06981
2023-04-06 16:22:50 1680808970 0.07046
2023-04-06 16:21:50 1680808910 0.0694
2023-04-06 16:20:50 1680808850 0.06837
2023-04-06 16:19:50 1680808790 0.06941
2023-04-06 16:18:53 1680808733 3.1101
2023-04-06 16:17:51 1680808671 0.06942
2023-04-06 16:16:50 1680808610 0.07015
2023-04-06 16:15:50 1680808550 0.06971
2023-04-06 16:14:50 1680808490 0.07029

For the average, the timestamp or the collected order will not be important. However, if we ignore its timestamp and order the values from smallest to biggest, we’ll get something like this:

0.06837 0.0694 0.06941 0.06942 0.06971 0.06977 0.06981 0.07015 0.07029 0.07046 3.1101

Table 1.0 – 11 collected values, ordered by from smallest to biggest

In this case, the values are ordered by from the smallest one to the biggest one, ignoring their timestamp.

Look at the outlier at the end. It’s not important for us right now.

The timeline has an odd number of values and the value in green, is the central value. The Median. And what if it was an even number of values? How could we calculate the median? There is a formula for it.

0.0694 0.06941 0.06942 0.06971 0.06977 0.06981 0.07015 0.07029 0.07046 3.1101

Table 2.0 – 10 collected values, ordered by from smallest to biggest

Now, we have two groups of values. There is not a central position.

This time, we can use the median formula (in general): Calculate de average for the last value for the “Group A” and the first value for the “Group B”. Look at the timeline below and the values in green and orange colors.

Percentile in Zabbix

Despite considering the concept of median, we can also use the percentile calculation.

In most of cases, the median has a synonymous: “50th percentile”.

I’m proposing you an exercise:

 

1. You must create a Zabbix trapper item and send to it the following values using zabbix-sender:

0.06837, 0.0694, 0.06941, 0.06942, 0.06971, 0.06977, 0.06981, 0.07015, 0.07029, 0.07046, 3.1101

# for x in `cat numbers.txt`; do zabbix_sender -z 159.223.145.187 -s “Web server A” -k percentile.test -o “$x”; done

At the end, we’ll have 11 values in Zabbix database, and we’ll calculate the 50th percentile

 

2. You must create a Zabbix Calculated item with the following formula:

percentile(//percentile.test,#11,50)

In this case, we can read it: consider the last 11 values and return the value in the 50th position in the array. And you can check in advance the result using “Test” button in Zabbix.

Now, we’ll work with an even number of values, excluding the value 0.06837. Our values for the next test will be:

0.0694, 0.06941, 0.06942, 0.06971, 0.06977, 0.06981, 0.07015, 0.07029, 0.07046, 3.1101

Please, before sending the values with zabbix-sender again, clear the history and trends for this Calculated item and then, adjust the formula:

percentile(//percentile.test,#10,50)

Checking the result, something curious happened: the 50th percentile was the same value.

There is a simple explanation for this.

Considering the last 10 values, in green we have the “Group A” and in orange, we have the “Group B”. The value retrieved using 50th percentile formula occupies the same position in both first and second tests.

We can test it again but this time, let’s change the formula to 51st percentile. The next value will be the first value for the second group.

percentile(//percentile.test,#10,51)

The result was changed. Now, we have something different to work and then, in the next steps, we’ll retrieve the median.

So, the percentile can be considered the central value for an odd number of values, but when we have an even number of values, the result cannot be the expected.

Average ou Percentile?

Two different calculations. Two different results. Neither the first is wrong nor the second. Both values can be considered correct, but we need some context for this affirmation.

The average is considering the outliers. The last one, percentile, is not.

Let’s update our dashboard.

We don’t need to prove anything to anyone about the values, but we need to show the values and their context.

Median

It’s simple: If the median is the central value, we can retrieve the average for the 50th percentile and 51st percentile, in this case. Remember, our new connections are being collected every minute, so at the end of each hour, we’ll have an even number of values.

Fantastic. We can calculate de median in a simple way:

(last(//percentile.50.input.values)+last(//percentile.51.input.values))/2

 

This is the median formula in this case using Zabbix. Let’s check the results in our dashboard.

Partial conclusion

In this article, we have just explored some Zabbix functions to calculate basic statistics and bring up some insights about a symbolic web application and its response time.

There is no absolute truth about those metrics but each one of them needs a context.

In Exploratory Data Analysis, asking questions can guide us to interesting answers, but remember that we need to know where we are going or what we wish.

With Zabbix, you and me can perform a Data Scientist function, knowing Zabbix too and knowing it very well.

You don’t need to use python or R for all tasks in Data Science. We’ll talk about it in next articles for this series.

Keep in mind: Zabbix is your friend. Let’s learn Zabbix and get wonderful results!

_____________

[1] Infográfico Zabbix (unirede.net)

[1] [1] https://www.unirede.net/zabbix-templates-onde-conseguir/

[1] https://www.unirede.net/monitoramento-de-certificados-digitais-de-websites-com-zabbix-agent2/

[1] Tagging: Monitorando todos os serviços! – YouTube

[1] [1] Timeshift – YouTube

Integrate Zabbix with your data pipelines by configuring real-time metric and event streaming

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/integrate-zabbix-with-your-data-pipelines-by-configuring-real-time-metric-and-event-streaming/25728/

Modern IT infrastructures tend to utilize multiple data sources to evaluate and react to the current state of the infrastructure. A set of internal solutions and dedicated software tools are used to correlate the collected information and react in a proper way to changes in the environment – be it a gradual increase or decrease in resource usage, unexpected load spikes or run-of-the-mill outages.

With the release of Zabbix 6.4, metrics collected by Zabbix and events generated based on trigger expressions can be integrated into such a data pipeline by using the new real-time metric and event streaming feature.

Zabbix real-time metric and event streaming

Before Zabbix 6.4 Zabbix supported real-time export of history, trends and events to files. The export was performed by exporting the data in newline-delimited JSON format. Additional scripting was mandatory if a user wanted to integrate this data within their data pipeline since Zabbix only exported the data to files without any further transformations or streaming.

On the other hand – real-time metric streaming can be a lot more flexible when it comes to integrating Zabbix with data pipelines, filtering the required data and securing the connection to the third party endpoint.

In addition to specifiying the streaming endpoint URL, Zabbix users can choose between streaming Item values and Events. On top of that, Tag filtering can be used to further narrow down the data that will be streamed to the endpoint.

Connectors

Real-time item value and event streaming can be configured by accessing the Administration – General – Connectors section. Here Zabbix administrators will have to create connectors and specify what kind of data the connector will be streaming. A single connector can only stream either Item values or events – if you wish to stream both, you will have to define at least two connectors, one for values and the other for events.

Configuring a new connector

Marking the Advanced configuration checkbox enables Zabbix administrators to further configure each individual connector – from specifying the number of concurrent sessions, limiting the number of attempts and specifying the connection Timeout to configuring HTTP proxies and enabling SSL connections.

An HTTP authentication method can also be specified for each connector. It is possible to use one of the following methods:

None – no authentication used;

Basic – basic authentication is used;

NTLM – NTLM (Windows NT LAN Manager) authentication is used;

Kerberos – Kerberos authentication is used;

Digest – Digest authentication is used;

Bearer – Bearer authentication is used.

In addition, the number of pre-forked connector worker instances must be specified in the Zabbix server configuration file in the StartConnectors parameter.

Setting StartConnectors parameter in Zabbix server configuration file

Protocol

Under the hood, the data is sent over HTTP using newline-delimited JSON export protocol.

The following example shows a trigger event for Zabbix agent being unreachable:

{"clock":1519304285,"ns":123456789,"value":1,"name":"Either Zabbix agent is unreachable on Host B or pollers are too busy on Zabbix Server","severity":3,"eventid":42, "hosts":[{"host":"Host B", "name":"Host B visible"},{"host":"Zabbix Server","name":"Zabbix Server visible"}],"groups":["Group X","Group Y","Group Z","Zabbix servers"],"tags":[{"tag":"availability","value":""},{"tag":"data center","value":"Riga"}]}

Do not get confused by the value field, the value if which will always be 1 for problem events, while for items, it will contain the item value:

{"host":{"host":"Host B","name":"Host B visible"},"groups":["Group X","Group Y","Group Z"],"item_tags":[{"tag":"foo","value":"test"}],"itemid":4,"name":"CPU Load","clock":1519304285,"ns":123456789,"value":0.1,"type":0}

Use cases

Finally, what particular use cases can we use the real-time metric streaming feature for? As I mentioned in the introduction, Zabbix item values and events could serve as an additional source of near real-time information about the current system behavior.

For example, we could stream item values and events to message brokers such as Kafka, RabbitMQ or Amazon Kinesis. Combine this with additional automation solutions and your services could dynamically scale (think K8s/Docker containers) depending on the current (or expected) load. Zabbix Kubernetes and Docker container monitoring templates very much complement such an approach.

The streamed data could also be used to gain new insights about the system behavior by streaming it to an AI engine or data lakes/data warehouses for long-term storage and analysis.

Kubernetes monitoring with Zabbix – Part 3: Extracting Prometheus metrics with Zabbix preprocessing

Post Syndicated from Michaela DeForest original https://blog.zabbix.com/kubernetes-monitoring-with-zabbix-part-3-extracting-prometheus-metrics-with-zabbix-preprocessing/25639/

In the previous Kubernetes monitoring blog post, we explored the functionality provided by the Kubernetes integration in Zabbix and discussed use cases for monitoring and alerting to events in a cluster, such as changes in replicas or CPU pressure.

In the final part of this series on monitoring Kubernetes with Zabbix, we will show how the Kubernetes integration uses Prometheus to parse data from kube-state-metrics and how users can leverage this functionality to monitor the many cloud-native applications that expose Prometheus metrics by default.

Want to see Kubernetes monitoring in action? Watch Part 3 of our Kubernetes monitoring video guide.

Prometheus Data Model

Prometheus is an open-source toolkit for monitoring and alerting created by SoundCloud. Prometheus was the second hosted project to join the Cloud-native Computing Foundation in 2016, after Kubernetes. As such, users of Kubernetes have adopted Prometheus extensively.

Lines in the model begin with or without a pound sign. Lines beginning with a pound sign specify metadata that includes help text and type information. Additional lines follow where the first key is the metric name with optional labels specified, followed by the value, and optionally concluding with a timestamp. If a timestamp is absent, the assumption is that the timestamp is equal to the time of collection.

http_requests_total{job=”nginx”,instance=”10.0.0.1:443”} 15 1677507349983

Using Prometheus with Kubernetes Monitoring

Let’s start with an example from the kube-state-metrics endpoint, installed in the first part of this series. Below is the output for the /metrics endpoint used by the Kubernetes integration, showing the metric kube_job_created. Each metric has help text followed by a line starting with that metric name, labels describing each job, and creation time as the sample value.

# HELP kube_job_created Unix creation timestamp
# TYPE kube_job_created gauge
kube_job_created{namespace="jdoe",job_name="supportreport-supportreport-27956880"} 1.6774128e+09
kube_job_created{namespace="default",job_name="core-backup-data-default-0-27957840"} 1.6774704e+09
kube_job_created{namespace="default",job_name="core-backup-data-default-1-27956280"} 1.6773768e+09
kube_job_created{namespace="jdoe",job_name="activetrials-activetrials-27958380"} 1.6775028e+09
kube_job_created{namespace="default",job_name="core-cache-tags-27900015"} 1.6740009e+09
kube_job_created{namespace="default",job_name="core-cleanup-pipes-27954860"} 1.6772916e+09
kube_job_created{namespace="jdoe",job_name="salesreport-salesreport-27954060"} 1.6772436e+09
kube_job_created{namespace="default",job_name="core-correlation-cron-1671562914"} 1.671562914e+09
kube_job_created{namespace="jtroy",job_name="jtroy-clickhouse-default-0-maintenance-27613440"} 1.6568064e+09
kube_job_created{namespace="default",job_name="core-backup-data-default-0-27956880"} 1.6774128e+09
kube_job_created{namespace="default",job_name="core-cleanup-sessions-27896445"} 1.6737867e+09
kube_job_created{namespace="default",job_name="report-image-findings-report-27937095"} 1.6762257e+09
kube_job_created{namespace="jdoe",job_name="salesreport-salesreport-27933900"} 1.676034e+09
kube_job_created{namespace="default",job_name="core-cache-tags-27899775"} 1.6739865e+09
kube_job_created{namespace="ssmith",job_name="test-auto-merger"} 1.653574763e+09
kube_job_created{namespace="default",job_name="report-image-findings-report-1650569984"} 1.650569984e+09
kube_job_created{namespace="ssmith",job_name="auto-merger-and-mailer-auto-merger-and-mailer-27952200"} 1.677132e+09
kube_job_created{namespace="default",job_name="core-create-pipes-pxc-user"} 1.673279381e+09
kube_job_created{namespace="jdoe",job_name="activetrials-activetrials-1640610000"} 1.640610005e+09
kube_job_created{namespace="jdoe",job_name="salesreport-salesreport-27943980"} 1.6766388e+09
kube_job_created{namespace="default",job_name="core-cache-accounting-map-27958085"} 1.6774851e+09

Zabbix collects data from this endpoint in the “Get state metrics.” The item uses a script item type to get data from the /metrics endpoint. Dependent items that use a Prometheus pattern as a preprocessing step to obtain data relevant to the dependent item are created.

Prometheus and Out-Of-The-Box Templates

Zabbix also offers many templates for applications that expose Prometheus metrics, including etcd. Etcd is a distributed key-value store that uses a simple HTTP interface. Many cloud applications use etcd, including Kubernetes. Following is a description of how to set up an etcd “host” using the built-in etcd template.

A new host is created called “Etcd Application” with an agent interface specified that provides the location of the application API. The interface port does not matter because a macro sets the port. The “Etcd by HTTP” template is attached to the host.

The “Get node metrics” item is the master item that collects Prometheus metrics. Testing this item shows that it returns Prometheus formatted metrics. The master item creates many dependent items that parse the Prometheus metrics. In the dependent item, “Maximum open file descriptors,” the maximum number of open file descriptors is obtained by adding the “Prometheus pattern” preprocessing step. This metric is available with the metric name process_max_fds.

Custom Prometheus Templates

 

While it is convenient when Zabbix has a template for the application you want to monitor, creating a new template for an application that exposes a /metrics endpoint but does not have an associated template is easy.

One such application is Argo CD. Argo CD is a GitOps continuous delivery tool for Kubernetes. An “application” represents each deployment in Kubernetes. Argo CD uses Git to keep applications in sync.

Argo CD exposes a Prometheus metrics endpoint that we can be used to monitor the application. The Argo CD documentation site includes information about available metrics.

In Argo CD, the metrics service is available at the argocd-metrics service. Following is a demonstration of creating an Argo CD template that collects Prometheus metrics. Install Argo CD in a cluster with a Zabbix proxy installed before starting. To do this, follow the Argo CD “Getting Started” guide.

Create a new template called, “Argo CD by HTTP” in the “Templates/Applications” group. Add three macros to the template. Set {$ARGO.METRICS.SERVICE.PORT} to the default of 8082. Set {$ARGO.METRICS.API.PATH} to “/metrics.” Set the last macro, {$ARGO.METRICS.SCHEME} to the default of “http.”

Open the template and click “Items -> Create item.” Name this item “Get Application Metrics” and give it the “HTTP agent” type. Set the key to argocd.get_metrics with a “Text” information type. Set the URL to {$ARGO.METRICS.SCHEME}://{HOST.CONN}:{$ARGO.METRICS.SERVICE.PORT}/metrics. Set the History storage period to “Do not keep history.”

Create a new host to represent Argo. Go to “Hosts -> Create host”. Name the host “Argo CD Application” and assign the newly created template. Define an interface and set the DNS name to the name of the metrics service, including the namespace, if the Argo CD deployment is not in the same namespace as the Zabbix proxy deployment. Connect to DNS and leave the port as the default because the template does not use this value. Like in the etcd template, a macro sets the port. Set the proxy to the proxy located in the cluster. In most cases, the macros do not need to be updated.

Click “Test -> Get value and test” to test the item. Prometheus metrics are returned, including a metric called argocd_app_info. This metric collects the status of the applications in Argo. We can collect all deployed applications with a discovery rule.

Navigate to the Argo CD template and click “Discovery rules -> Create discovery rule.” Call the rule “Discover Applications.” The type should be “Dependent item” because it depends on the metrics collection item. Set the master item to the “Get Application Metrics” item. The key will be argocd.applications.discovery. Go to the preprocessing tab and add a new step called, “Prometheus to JSON.” The preprocessing step will convert the application data to JSON, which will look like the one below.

[{"name":"argocd_app_info","value":"1","line_raw":"argocd_app_info{dest_namespace=\"monitoring\",dest_server=\"https://kubernetes.default.svc\",health_status=\"Healthy\",name=\"guestbook\",namespace=\"argocd\",operation=\"\",project=\"default\",repo=\"https://github.com/argoproj/argocd-example-apps\",sync_status=\"Synced\"} 1","labels":{"dest_namespace":"monitoring","dest_server":"https://kubernetes.default.svc","health_status":"Healthy","name":"guestbook","namespace":"argocd","operation":"","project":"default","repo":"https://github.com/argoproj/argocd-example-apps","sync_status":"Synced"},"type":"gauge","help":"Information about application."}]

Set the parameters to “argocd_app_info” to gather all metrics with that name. Under “LLD Macros”, set three macros. {#NAME} is set to the .labels.name key, {#NAMESPACE} is set to the .labels.dest_namespace key, and {#SERVER} is set to .labels.dest_server.

Let us create some item prototypes. Click “Create item prototype” and name it “{#NAME}: Health Status.” Set it as a dependent item with a key of argocd.applications[{#NAME}].health. The type of information will be “Character.” Set the master item to “Get Application Metrics.”

In preprocessing, add a Prometheus pattern step with parameters argocd_app_info{name=”{#NAME}”}. Use “label” and set the label to health_status. Add a second step to “Discard unchanged with heartbeat” with the heartbeat set to 2h.

Clone the prototype to create another item called “{#NAME}: Sync status.” Change the key to argocd.applications.sync[{#NAME}]. Under “Preprocessing” change the label to sync_status.

Now, when viewing “Latest Data” the sync and health status are available for each discovered application.

Conclusion

We have shown how Zabbix templates, such as the Kubernetes template, and the etcd template utilize Prometheus patterns to extract metric data. We have also created templates for new applications that expose Prometheus data. Because of the adoption of Prometheus in Kubernetes and cloud-native applications, Zabbix benefits by parsing this data so that Zabbix can monitor Kubernetes and cloud-native applications.

I hope you enjoyed this series on monitoring Kubernetes and cloud-native applications with Zabbix. Good luck on your monitoring journey as you learn to monitor with Zabbix in a containerized world.

About the Author

Michaela DeForest is a Platform Engineer for The ATS Group. She is a Zabbix Certified Specialist on Zabbix 6.0 with additional areas of expertise, including Terraform, Amazon Web Services (AWS), Ansible, and Kubernetes, to name a few. As ATS’s resident authority in DevOps, Michaela is critical in delivering cutting-edge solutions that help businesses improve efficiency, reduce errors, and achieve a faster ROI.

About ATS Group:

The ATS Group provides a fully inclusive set of technology services and tools designed to innovate and transform IT. Their systems integration, business resiliency, cloud enablement, infrastructure intelligence, and managed services help businesses of all sizes “get IT done.” With over 20 years in business, ATS has become the trusted advisor to nearly 500 customers across multiple industries. They have built their reputation around honesty, integrity, and technical expertise unrivaled by the competition.

Just-in-Time user provisioning explained

Post Syndicated from Evgeny Yurchenko original https://blog.zabbix.com/just-in-time-user-provisioning-explained/25515/

Zabbix 6.4 finally brings a very much waited feature called “Just-In-Time user provisioning”. Zabbix “What’s new in 6.4” LDAP/SAML user provisioning paragraph is very brief and can not (not that I am saying it should) deliver any excitement about this new really game changing feature. This blog post was born to address two points:

  • explain in more details why it is “game changing” feature
  • configuration of this feature is very flexible and as it often happens flexibility brings complexity and sometimes confusion about how to actually not only get it working but also to get the most of this feature

NOTE: I am talking about LDAP in this blog post but SAML works exactly the same way so you can easily apply this article to SAML JIT user provisioning configuration.

Old times (before 6.4)

Let’s do a quick reminder how it worked before Zabbix 6.4:Obvious problem here is that a User must be pre-created in Zabbix to be able to log in using LDAP. The database user records do not have any fields noticing that the user will be authenticated via LDAP, it’s just users’ passwords stored in the database are ignored, instead, Zabbix goes to an LDAP server to verify whether:

  • a user with a given username exists
  • user provided the correct password

no other attributes configured for the user on the LDAP server side are taken into account.

So when Zabbix is used by many users and groups, user management becomes not a very trivial task as new people join different teams (or leave).

Zabbix 6.4 with JIT user provisioning enabled

Now let’s take a look at what is happening in Zabbix 6.4 (very simplified picture). The picture depicts what happens when memberOf method is selected for Group Configuration (more on that later):Now when Zabbix gets a username and password from the Login form it goes to the LDAP server and gets all the information available for this user including his/her LDAP groups membership and e-mail address. Obviously, it gets all that only if the correct (from LDAP server perspective) username and password were provided. Then Zabbix goes through pre-configured mapping that defines users from which LDAP group goes to which Zabbix user group. If at least one match is found then a user is created in the Zabbix database belonging to a Zabbix user group and having a Zabbix user role according to configured “match”. So far sounds pretty simple, right? Now let’s go into detail about how all this should be configured.

LDAP server data

To experiment with the feature I built a Docker container which is a fully functional LDAP server with some pre-configured data, you can easily spin it up using this image. Start the container this way:

docker run -p 3389:389 -p 6636:636 --name openldap-server --detach bgmot42/openldap-server:0.1.1

To visually see LDAP server data (and add your own configuration like users and groups) you can start this standard container

docker run -p 8081:80 -p 4443:443 --name phpldapadmin --hostname phpldapadmin --link openldap-server:ldap-host --env PHPLDAPADMIN_LDAP_HOSTS=ldap-host --detach osixia/phpldapadmin:0.9.0

Now you can access this LDAP server via https://<ip_address>:4443 (or any other port you configure to access this Docker container), click Login, enter “cn=admin,dc=example,dc=org” in Login DN field and “password” in Password field, click Authenticate. You should see the following structure of the LDAP server (picture shows ‘zabbix-admins’ group configuration):All users in this container for convenience are configured with “password” word as their passwords.

General LDAP authentication configuration in Zabbix

No surprises here, you need to enable LDAP authentication, just a couple of additions here:

  • You must provide Deprovisioned users group. This group must be literally “disabled” otherwise you won’t be able to select it here. This is the Zabbix user group where all “de-provisioned” users will be put into so effectively will get disabled from accessing Zabbix.
  • Enable JIT provisioing check-box which obviously needs to be checked for this feature to work.

And again already familiar interface to configure a LDAP server and search parameters, however, this picture depicts how we actually fill in these parameters according to data in our LDAP server:

“Special” Distinguished Name (DN) cn=ldap_search,dc=example,dc=org is used for searching, i.e. Zabbix uses this DN to connect to LDAP server and of course when you connect to LDAP server you need to be authenticated – this is why you need to provide Bind password. This DN should have access to a sub-tree in LDAP data hierarchy where all your users are configured. In our case all the users configured “under” ou-Users,dc=example,dc=org, this DN is called base DN and used by Zabbix as so to say “starting point” to start searching.
Note: technically it is possible to bind to LDAP server anonymously, without providing a password but this is a huge breach in security as the whole users sub-tree becomes available for anonymous (unauthenticated) search, i.e. effectively exposed to any LDAP client that can connect to LDAP server over TCP. The LDAP server we deployed previously in Docker container does not provide this functionality.

Group configuration method “memberOf”

All users in our LDAP server have memberOf attribute which defines what LDAP groups every user belongs to, e.g. if you perform a LDAP query for user1 user you’ll get that its memberOf attribute has this value:
memberOf: cn=zabbix-admins,ou=Group,dc=example,dc=org
Note, that your real LDAP server can have totally different LDAP attribute that provides users’ group membership, and of course, you can easily configure what attribute to use when searching for user’s LDAP groups by putting it into User group membership attribute field:

In the picture above we are telling Zabbix to use memberOf attribute to extract DN defining user’s group membership (in this case it is cn=zabbix-admins,out=Group,dc=example,dc=org) and take only cn attribute from that DN (in this case it is zabbix-admins) to use in searching for a match in User group mapping rules. Then we define as many mapping rules as we want. In the picture above we have two rules:

  • All users belonging to zabbix-users LDAP group will be created in Zabbix as members of Zabbix users group with User role
  • All users belonging to zabbix-admins LDAP group will be created in Zabbix as members of Zabbix administrators group with Super admin role

Group configuration method “groupOfNames”

There is another method of finding users’ group membership called “groupOfNames” it is not as efficient as “memberOf” method but can provide much more flexibility if needed. Here Zabbix is not querying LDAP server for a user instead it is searching for LDAP groups based on a given criterion (filter). It’s easier to explain with pictures depicting an example:

Firstly we define LDAP “sub-tree” where Zabbix will be searching for LDAP groups – note ou=Group,dc=example,dc=org in Group base DN field. Then in the field Group name attribute field we what attribute to use when we search in mapping rules (in this case we take cn, i.e. only zabbix-admins from full DN cn=zabbix-admins,ou=Group,dc=example,dc=org). Each LDAP group in our LDAP server has member attribute that has all users that belong to this LDAP group (look at the right picture) so we put member in Group member attribute field. Each user’s DN will help us construct Group filter field. Now pay attention: Reference attribute field defines what LDAP user’s attribute Zabbix will use in the Group filter, i.e. %{ref} will be replaced with the value of this attribute (here we are talking about the user’s attributes – we already authenticated this user, i.e. got all its attributes from LDAP server). To sum up what I’ve said above Zabbix

  1. Authenticate the user with entered Username and Password against LDAP server getting all user’s LDAP attributes
  2. Uses Reference attribute and Group filter fields to construct a filter (when user1 logs in the filter will be (member=uid=user1,ou=Users,dc=example,dc=org)
  3. Performs LDAP query to get all LDAP groups with member attribute (configured in Group member attribute field) containing constructed in step 2 filter
  4. Goes through all LDAP groups received in step 3 and picks cn attribute (configured in Group name attribute field) and finds a match in User group mapping rules

Looks a bit complicated but all you really need to know is the structure of your LDAP data.

Demo time

Finally let’s see what happens when user1 belonging to zabbix-admins LDAP group and user3 belonging to zabbix-users LDAP group log in:

That’s it. Happy JIT user provisioning!

Kubernetes monitoring with Zabbix – Part 2: Understanding the discovered resources

Post Syndicated from Michaela DeForest original https://blog.zabbix.com/kubernetes-monitoring-with-zabbix-part-2-understanding-the-discovered-resources/25476/

In the previous blog post, we installed the Zabbix Agent Helm Chart and set up official Kubernetes templates to monitor a cluster in Zabbix. In this edition, part 2, we will explore the functionality provided by the Kubernetes integration in Zabbix and discuss use cases for monitoring and alerting on events in a cluster. (This post assumes that the Kubernetes integration has been set up in at least one cluster using the helm chart and provided templates.)

Want to see Kubernetes monitoring in action? Watch Part 2 of our Kubernetes monitoring video guide.

Node and Component Discovery

Following integration setup, the templates will discover control plane components, each node, and the kubelet associated with it using the Kubernetes API via a “Script” item type.

Note:

In the last blog post, I showed a managed EKS cluster. Control plane components cannot be discovered in an EKS cluster because AWS does not make them directly available through the API. For the sake of demonstrating the full capabilities of the integration, this post will use screenshots depicting a cluster that was created using the kubeadm utility.

In the latest version of Zabbix (6.2 at the time of writing), control plane components are discovered via node labels added only for clusters created with kubeadm. Depending on your setup, you may be able to add the same node labels to your own control plane nodes or modify the template to use your specific labels.

This example cluster has 4 worker nodes and 1 master node. The control plane runs entirely on the master node.

Zabbix’s “Low-Level Discovery” is the backbone of the Kubernetes integration. Zabbix discovers each node and creates two hosts to represent them in the cluster. The first host attaches the “Linux by Zabbix Agent” template to it, and the second attaches a custom Kubelet template called “Kubernetes Kubelet by HTTP. Zabbix also creates items for most standard objects like pods, deployments, replicasets, job, cronjob, etc.

Node and Kubernetes Performance Metrics

In this example, there are four discovered worker nodes with the “Linux by Zabbix Agent” template attached to them. The template will provide metrics about the machines running in the cluster.

Each worker host’s “System performance” dashboard shows system load, CPU usage, and memory usage metrics.

Zabbix will also collect Kubernetes-specific metrics related to the nodes. “Latest Data” for the Kubernetes Nodes host shows metrics such as the Allocatable CPU available to pods and the node’s memory capacity.

Alerts are generated for events such as the allocation of too much CPU. This could indicate that capacity should be increased, assuming that the memory and CPU limits set on the pod label are accurate.

The Kubernetes integration also monitors object states. As a best practice, any tool used to monitor Kubernetes should be monitoring and alerting critical status changes within the cluster. The image above shows the triggers related to the health of a pod. There are also triggers when certain conditions are detected by the nodes, like memory or CPU pressure.

Zabbix discovers objects like pods, deployments, and Replicasets, and triggers on object states.  For example, pods that are not up or deployments that do not have the correct number of replicas up.

In this example, a cluster is running a Kubernetes dashboard deployment with 3 replicas. By running the following command, we can see that all 3 replicas are up. Under “Latest Data,” Zabbix shows those 3 replicas available out of the 3 desired.

kubectl get deployment kubernetes-dashboard



To mimic a pod crashing, the pod is edited to use an invalid image tag.

kubectl edit pod <pod name>

The image tag is changed to  “invalid.tag, “ which is unavailable for the image. This causes the pod to fail because it can no longer pull the image. Output now shows that one pod is no longer ready.

Looking at the data in Zabbix, the number of available replicas is only 3, while the number of unavailable replicas is now 1.

On the problems page, there are two new problems. Both alerted that there is a mismatch between the number of replicas for the dashboard and the number of desired replicas.

Changing the tag back to a valid one should cause those problems to be resolved.

The Kubernetes templates offer many metrics and triggers, including most provided by Prometheus and Alert Manager. With some Zabbix experience and the ability to navigate kube-state-metrics and Kubernetes APIs, creating new items is possible.

What’s Next?

Above is an example of the output from the kube-state-metrics API. Unlike most APIs that return data in JSON format, the kube-state-metrics API uses the Prometheus data model to supply metrics.

As you get comfortable with Kubernetes monitoring in Zabbix, you may want to parse your own metrics from kube-state-metrics and create new items.

In the next video, we will learn how to monitor applications with Prometheus in Zabbix.

About the Author

Michaela DeForest is a Platform Engineer for The ATS Group.  She is a Zabbix Certified Specialist on Zabbix 6.0 with additional areas of expertise, including Terraform, Amazon Web Services (AWS), Ansible, and Kubernetes, to name a few.  As ATS’s resident authority in DevOps, Michaela is critical in delivering cutting-edge solutions that help businesses improve efficiency, reduce errors, and achieve a faster ROI.

About ATS Group: The ATS Group provides a fully inclusive set of technology services and tools designed to innovate and transform IT.  Their systems integration, business resiliency, cloud enablement, infrastructure intelligence, and managed services help businesses of all sizes “get IT done.” With over 20 years in business, ATS has become the trusted advisor to nearly 500 customers across multiple industries.  They have built their reputation around honesty, integrity, and technical expertise unrivaled by the competition.

How to write a webhook for Zabbix

Post Syndicated from Andrey Biba original https://blog.zabbix.com/how-to-write-a-webhook-for-zabbix/25298/

As you know, a picture is worth a thousand words. Therefore, I would like to share the process of creating a webhook from scratch. In this article, we will walk through the creation process step by step – starting with studying the target service with which Zabbix will integrate and finishing with tests for sending events from Zabbix. Although it may seem complicated, writing your own integrations is not so difficult.

Preparation

First, we need to decide what we want to see as a result of the webhook. In most cases, the services to which we will send events are divided into 2 types:

  • Messengers to which you can send messages. For example, Telegram, Slack, Discord, etc.
  • Service Desks where you can open, close, and update tickets. For example, Jira, Redmine, ServiceNow, etc.

In both cases, the principle of creating a webhook will not differ – the difference is only in the complexity of one type from the other.

In this article, I will describe the process of creating a webhook for messengers – and specifically for Line messenger.

After we have decided on the type, we need to find out whether this service supports the possibility of API requests and, if it does, what is required for this. Usually, all the services you want to integrate Zabbix with have somewhat detailed documentation about the API methods they support. By the way, Zabbix also has its own API, which is documented in detail.

After we are done studying the Line documentation, we find out that messages are sent using the POST method to the https://api.line.me/v2/bot/message/push endpoint, using the Line bot token in the request header for authorization and passing a specially formatted JSON in the request body with the content of the message. Confused? No problem. Let’s take a closer look.

HTTP requests

The operation of the API is based on HTTP requests, which are executed with parameters provided by the developers of this API.

Several types of HTTP requests are used more often than others:

  • GET – is perhaps the most common one that all of us encounter on a daily basis. This request only involves getting data. For example, the browser used a GET request from the web server to fetch the article you are currently reading.
  • POST – is a request that sends data to a resource. This is exactly the case when we want to pass something to the service using API requests.
  • PUT – is much less common than the previous 2, but no less important. This query replaces the values in a resource.

These are not all HTTP request methods, but these three will suffice for a general introduction.

We are done with methods. Let’s move on to the endpoint.

An endpoint is a permanent address of a resource via which we transfer, receive, or change data. In this case, https://api.line.me/v2/bot/message/push is the endpoint that accepts POST requests to send messages.

So, the method and the endpoint are defined. What’s next?

Generally, any HTTP request consists of:

  1. URL
  2. Method
  3. Headers
  4. Body
HTTP request structure

We have already dealt with the first two, but the headers and the request body remain.

Headers usually contain service information that allows you to process a request correctly. For example, the Content-Type: application/json header implies that our request body should be interpreted as a json object. Also, quite often, authorization information is passed in the headers. As in the case of Line, the Authorization: Bearer {channel access token} header contains the authorization token of the bot on behalf of which messages will be sent.

The request body usually contains the information we want to pass on to the service. In our case, this will be the subject and body of the event in Zabbix.

Checking the service API

The documentation is good, but it is necessary to check that everything we read works exactly how it is documented. It is not uncommon that a service can be developed faster than the documentation can keep up with it. So field testing never hurts. Excluding unexpected behavior will significantly reduce the time spent searching for problems.

I recommend using Postman to work with API requests – a handy tool that saves time. But for this article, we will use cURL due to its prevalence and ease of use.

I will not describe the process of creating the Line Bot API token because this is not directly related to the article. However, for those interested in this process, I will leave a link here.

As we have already found out, the request type will be POST, the access point URL is https://api.line.me/v2/bot/message/push, and additional headers must be passed: Content-Type: application/json which specifies the type of data to be sent (in our case it is JSON) and Authorization: Bearer {token value}. And the messages themselves are in JSON format. For example, I used 2 messages – “Hello, world1” and “Hello, world2”. As a result, I got the following query:

After executing the request, we got the expected result of 2 messages that came to the messenger, which were in the request body.

Excellent! So half of the work has already been done: there is a ready-made request that works in manual mode and successfully sends messages to Line. The only thing left is to put the necessary information in the right places and automate the process using JS and Zabbix.

Integration with Zabbix

After successfully completing the tests, go to Zabbix, create a new notification method in the Administration section, select the webhook type, and name it Line.

For webhook integrations with external services, Zabbix uses the built-in JavaScript engine on Duktape. Parameters are passed to the script, which is used to build the logic of the webhook. As a result of the script, tags can be returned that will be assigned to the event. This is usually necessary in case of integration with service desks in order to be able to update the status of tickets.

Let’s take a closer look at the webhook setup interface.

The Media type section contains the general settings for the new media type:

  • Name – Name of the media type.
  • Type – The type of media type. There are 4 types: email, SMS, webhook, and script.
  • Parameters – This is a list of variables passed to the code. All necessary data can be passed through parameters: event id, event type, trigger severity, event source, etc. You can specify macros and text values in parameters. The parameters are passed as a JSON string, accessible through the built-in variable value.
  • Script – JS script describing the logic of the webhook.
  • Timeout – The time after which the script will be terminated.
  • Process tags   – If this option is enabled, the webhook will support generating tags for events sent using this hook.
  • Include event menu entry – This option makes the Menu Entry Name and Menu Entry URL fields available for use.
  • Menu entry name – The text displayed in the event dropdown menu for the Menu entry URL submitted using this hook.
  • Menu entry URL – A link to an external resource in the event menu.
  • Description – A text field that contains a description of the notification method.
  • Enabled – an Option that allows enabling or disabling the media type.

The Message templates section contains templates that are used by webhook to send alerts. Each template contains:

  • Message type – The event type to which the message will apply. For example, Problem – when the trigger fires and Problem recovery – when the problem is resolved.
  • Subject  – The headline of the message.
  • Message – A message template that contains useful information about the event. For example, event time, date, event name, host name, etc.

The Options section contains additional options:

  • Concurrent sessions – The number of concurrent sessions to send an alert.
  • Attempts – The number of retries in case of send failure.
  • Attempt interval  – The frequency of attempts to send an alert.

When writing your own webhook, you can take an existing one as a basis – Zabbix has more than thirty ready-made webhook solutions of varying complexity. All basic functions are usually repeated from hook to hook with little or no change at all, as are the parameters passed to them.

Let’s set the following parameters:

It is convenient to set parameter values with macros. A macro is a variable in Zabbix that contains a specific value. Macros allow you to optimize and automate your work. They can be used in various places, such as triggers, filters, alerts, and so on.

A little more about each macro separately in order to understand why each of them is needed:

  • {ALERT.SUBJECT} – The subject of the event message. This value is taken from the Subject field of the corresponding Message template type.
  • {ALERT.MESSAGE} – The event message body. This value is taken from the Message field of the corresponding Message template type.
  • {EVENT.ID} – The event id in Zabbix. Could be used for generating a link to an event
  • {EVENT.NSEVERITY} – The numerical definition of the event’s severity from 0-5. We will use this to change the message in case of different severity.
  • {EVENT.SOURCE} – The event source. Needed to handle events correctly. In most cases, we are interested in triggers; this corresponds to source value 0.
  • {EVENT.UPDATE.STATUS} – Returns 1 if it is an update event. For example, in case of acknowledge operations or a change in severity.
  • {EVENT.VALUE} – The event state. 0 for recovery and 1 for the problem.
  • {ALERT.SENDTO} – The field from the media type assigned to the user. It returns the ID of the user or group in the Line, where it will be necessary to send a message
  • {TRIGGER.DESCRIPTION} – A macro that will be expanded if the event source is a trigger. Returns the description of the trigger
  • {TRIGGER.ID} – The trigger ID. Required to generate a link to an event in Zabbix

Webhooks can use other macros if needed. A list of all macros can be viewed on the documentation page. Be careful – not all macros can be used in webhooks.

Writing the script

Before writing the script, let’s define the main points that the webhook will need to be able to perform:

  • the script should describe the logic for sending messages
  • handle possible errors
  • logging for debugging

I will not describe the entire code in order not to repeat the same type of blocks and concentrate only on important aspects.

To send messages, let’s write a function that will accept messages and params variables. We got the following function:

function sendMessage(messages, params) {
    // Declaring variables
    var response,
        request = new HttpRequest();

    // Adding the required headers to the request
    request.addHeader('Content-Type: application/json');
    request.addHeader('Authorization: Bearer ' + params.bot_token);

    // Forming the request that will send the message
    response = request.post('https://api.line.me/v2/bot/message/push', JSON.stringify({
        "to": params.send_to,
        "messages": messages
    }));

    // If the response is different from 200 (OK), return an error with the content of the response
    if (request.getStatus() !== 200) {
        throw "API request failed: " + response;
    }
}

Of course, this is not a reference function, and depending on the requirements for the request may differ. There may be other required headers and a different request body. In some cases, it may be necessary to add an additional step to obtain authorization data through another API request.

In this case, the request to send a message returns an empty {} object, so it makes no sense to return it from the function. But for example, when sending a message to Telegram, an object with data about this message is returned. If you pass this data to tags, you can write logic that will change the already sent message – for example, in case of closing or updating the problem.

Now let’s describe a function that will accept webhook parameters and validate their values. In the example, we will not describe all the conditions because they are of the same type:

function validateParams(params) {
    // Checking that the bot_token parameter is a string and not empty
    if (typeof params.bot_token !== 'string' || params.bot_token.trim() === '') {
        throw 'Field "bot_token" cannot be empty';
    }

    // Checking that the event_source parameter is only a number from 0-3
    if ([0, 1, 2, 3].indexOf(parseInt(params.event_source)) === -1) {
        throw 'Incorrect "event_source" parameter given: "' + params.event_source + '".nMust be 0-3.';
    }

    // If an event of type "Discovery" or "Autoregistration" set event_value 1, 
    // which means "Problem", and we will process these events same as problems
    if (params.event_source === '1' || params.event_source === '2') {
        params.event_value = '1';
    }

    ...

    // Checking that trigger_id is a number and not equal to zero
    if (isNaN(params.trigger_id) && params.event_source === '0') {
        throw 'field "trigger_id" is not a number';
    }
}

As you can see from the code, in most cases these are simple checks that allow you to avoid errors associated with the input data. Validation is necessary because there is no guarantee that the expected value will be in the parameter.

The main block of code is placed inside the try…catch block in order to correctly handle errors:

try {
    // Declaring the params variable and writing the webhook parameters to it
    var params = JSON.parse(value);

    // Calling the validation function and passing parameters to it for verification
    validateParams(params);

    // If the event is a trigger and it is in the problem status, compose the message body
    if (params.event_source === '0' && params.event_value === '1') {
        var line_message = [
            {
                "type": "text",
                "text": params.alert_subject + 'nn' +
                    params.alert_message + 'n' + params.trigger_description
            }
        ];
    }

    ...

    // Sending a composed message
    sendMessage(line_message, params);

    // Returning OK so that the webhook understands that the script has completed with OK status
    return 'OK';
}
catch (err) {
    // Adding a log function so in case of problems you can see the error in the Zabbix server console
    Zabbix.log(4, '[ Line Webhook ] Line notification failed : ' + err);

    // In case of an error, return it from the webhook
    throw 'Line notification failed : ' + err;
}

Here we assign parameter values to the params variable, then validate them using the validateParams() function, describe the main conditions for generating a message, and send this message to the messenger. At the same time, the try…catch block allows you to catch all errors, log them to Zabbix and return them in a readable form to the user in the web interface.

For writing webhooks in Zabbix, there is a guideline dedicated to this topic. Please read this information because it will help you write better code and avoid common mistakes.

Testing

After we’ve finished with the webhook script, it’s time to test how our code works. To do this, Zabbix provides a function to send test messages. Go to the AdministrationMedia types, find Line, and click on the Test button opposite it. In the window that appears, fill in all the fields with the necessary data and press the Test button. Check the messenger and see that the message came with the data we specified in the test.

Ready-made Line integration can be found in the Zabbix git repository and in all recent Zabbix instance builds.

Troubleshooting

Of course, everything in the article looks like I did it on the first attempt and did not encounter a single error or problem. Naturally, this is not the case in practice. Work with each new product includes Research & Development. How can you catch errors and, most importantly, understand the problem?

Well, as I wrote earlier – read the documentation and test all requests before writing code. At this stage, it is easiest to catch all the problems. The response to the HTTP request will explicitly describe the error. For example, if you make a mistake in the request body and send an object with incorrect values, the service will return the body with an error description and the response status 400 (Bad request).

There are several options for debugging in case of errors that may occur when writing a webhook script:

  • Focus on the errors displayed when the notification method is executed. For example, if you mistyped or set the wrong name of the function and variable.
  • Include logging in the code for displaying service information. For example, while you are in the script development stage, the result of the function can be logged using the Zabbix.log() function. Zabbix supports 6 debug levels (0-5), which can be set in this function. Usually, webhooks use level 4, which contains information for debugging.
  • Use the zabbix_js utility. You can transfer a file with a script and parameters to it. You can read more about it here.

Conclusion

I hope this article has helped you better understand how webhooks work in Zabbix and highlighted the basic steps for creating, diagnosing, and preparing to write your integration. The Zabbix community is constantly adding custom templates and media types. I expect that after reading this article, more people will be interested in creating their own webhooks and sharing them with the community. We appreciate any contribution to the development and expansion of the base of integration solutions.

Questions

Q: I don’t know JS, but I know other languages. Is native support of other languages planned in Zabbix, such as Python?

A: For now, there are no such plans.

Q: Are there any restrictions with writing a JS script for a webhook?

A: Yes, there are. The built-in Duktape engine is used to execute the code, and it does not have all the functionality that is available in the latest JS releases. Therefore, I recommend that you read the documentation of this engine and the built-in objects to learn more about the available methods.

Deep dive into Zabbix Frontend Modules

Post Syndicated from Evgeny Yurchenko original https://blog.zabbix.com/deep-dive-into-zabbix-frontend-modules/24863/

Zabbix gives every community member the ability to extend their frontend functionality by writing their own frontend modules. In this video, we will go through the steps required to write a Zabbix frontend module and look at multiple code examples that will explain the steps behind successfully implementing a custom frontend module. The article is based on a Zabbix Summit 2022 speech by Evgeny Yurchenko.

Why?

Zabbix 5.0 introduced a pretty cool feature called “Frontend Modules” (or WEB modules) that lets anybody extend Zabbix WebUI (add new menu items, modify current menu element behavior or even delete some menu items completely). We see a constant growth of the number of Modules created, but there is not too much written on how to efficiently write your own Modules. This article tries to give you as much detail as possible on how the Modules subsystem is implemented in Zabbix which obviously should help you understand how Modules function thus easing the process of writing your own Modules.

Disclaimer

  • Information in this presentation is a result of Zabbix source code analysis and in no way a replacement but rather an addition to the official documentation:
 https://www.zabbix.com/documentation/current/en/manual/modules
  • Modules can be harmful as they work in Zabbix WebUI process space and have the same level of access to the Zabbix database as Zabbix web UI itself. People will have to trust the modules you are developing so be careful.
  • Errors in your module code may crash the frontend. Currently, there is no version compatibility check during module installation; keep that in mind and implement strict versioning of these modules, clearly stating what version of the module works with what version of Zabbix.
  • All the references to code in this article assume Zabbix 6.0.

MVC framework

Zabbix WebUI is built based on so-called “Model-Controller-View” (MVC) framework. The concept is quite old and you can find a lot of information about MVC on the Internet. The following picture depicts every Zabbix WebUI application execution flow (every user’s click on a menu item, a button like “Apply” or “Filter” etc.)HTTP request from the user’s browser is accepted by the Controller component and analyzed. Based on the action parameter received Controller makes a decision on how to serve this request. Controller (optionally) “talks” to Model component to get needed data (usually) from Zabbix Database, then massages data to ultimately create a set of data to be returned to end user. Then (again if needed) Controller “talks” to the View component asking to prepare the data for the user’s consumption (in essence make it look nice and clean in Browser thus View component, in most cases generates HTML/CSS/Javascript). And once the controller has everything to send back to the user it returns HTTP response to the end user’s browser. I prefer to think that Zabbix Frontend Modules cover Controller and View components functionality as it’s a good idea to consume data available in Zabbix by re-using what’s already implemented in Zabbix WebUI code (huge amount of classes with their methods delivering any data you want) though strictly speaking, you can implement your own data feed in your Module.

Fronted module example

In this blog post, I’ll be using https://github.com/BGmot/zabbix-module-hosts-tree as an example which brings a brand new main menu element Monitoring -> Hosts tree showing the end-user a hierarchy of host groups instead of a flat list of hosts that comes with default Zabbix installation in Monitoring -> Hosts.

Refer to Zabbix official documentation on how to install modules and I won’t waste any time here on what is perfectly well documented. After this module is installed we’ll have the following files:

/usr/share/zabbix/modules/zabbix-module-hosts-tree# tree
.
|-- Module.php
|-- actions
| |-- CControllerBGHost.php
| |-- CControllerBGHostView.php
| `-- CControllerBGHostViewRefresh.php
|-- manifest.json
|-- partials
| |-- js
| | `-- monitoring.host.view.refresh.js.php
| `-- module.monitoring.host.view.html.php
`-- views
|-- js
| `-- monitoring.host.view.js.php
|-- module.monitoring.bghost.view.php
`-- module.monitoring.bghost.view.refresh.php

WebUI application

As mentioned earlier Zabbix WebUI is just a PHP application, yes, sophisticated, very complex but still it is a PHP application which means after every user click (and upon some timeouts), it initializes, executes and terminates producing some output (in most cases) which is passed to the user’s browser.

Top-level of this application is an object of class APP which inherits ZBase adding literally nothing, declared in file ./include/classes/core/APP.php:

class APP extends ZBase {
}

Application starts with ZBase::run() method, file ./include/config.inc.php:

APP::getInstance()->run(APP::EXEC_MODE_DEFAULT);

Method run() does many things, but in the light of this blog post it is important to mention these two:

  • router initialization
  • modules initialization

Router initialization

The so-called “Router” is a crucial part of the Controller component of MVC that drives a decision on how to handle a user’s request. The Router works based on an associative array $routes which is defined in ./include/classes/mvc/CRouter.php. Here is a code snippet illustrating how this array is organized:

private $routes = [
// action                   controller                        layout             view
‘action.operation.get'  => [‘CControllerActionOperationGet',  ‘layout.json',     null],
‘audit.settings.update' => [‘CControllerAuditSettingsUpdate',  null,             null],
‘dashboard.view'        => [‘CControllerDashboardView', ‘layout.htmlpage',   'monitoring.dashboard.view'],

As you can see in this array for every action three elements are defined:

  • controller (a class that will be used to prepare data)

  • layout (in which form to present data generated by the controller)

  • view (how to present/show the data to end-user/requestor)

We will talk about each of these components later. For now – just remember that the controller here is a class name while layout and view is a name of a .php file that is included (i.e. executed) at a certain point of webUI application execution. So Router Initialization is basically the class CRouter instantiation making $routes array available to other classes via CRouter‘s methods.

Modules initialization

This is the moment the web UI application goes through all enabled Modules in the system calling their init() method. Here in ZBase::run() methods init() are called and all the actions from enabled modules are added to the Router, file ./include/classes/core/ZBase.php:

$this->initModuleManager();
$router = $this->component_registry->get('router');
$router->addActions($this->module_manager->getActions());
$router->setAction($action_name);

Your module must be a child of Core\CModule class defined in ./include/classes/core/CModule.php. You can redefine init() method to fit your needs. Since init() method of all enabled Modules is called, this makes it a perfect place to have new menu items added to Zabbix WebUI main menu here.

We talked about Router in the previous clause and your module’s manifest.json file defines what needs to be added to this Router for your Module to function properly. If you defined an action in your Module that already exists in out-of-the-box Zabbix then the Module’s action overwrites the “default” entry in $router array thus, you can re-define the behavior of any menu item.

Overall what happens during the module initialization is easier to describe by following this picture:

So we added a new menu item “Hosts tree” under “Hosts” in the main menu. When a user clicks on this menu item a request will be generated with the parameter “action” set to “bghost.view” and now the Router “knows” that to serve this action it needs to use CControllerBGHostView class as a Controller and module.monitoring.bghost.view.php file (with default layout.html.php layout) to generate HTML code.

Processing request

After initialization is done the ZBase::processRequest() method is called, file ./include/classes/core/ZBase.php, passing initialized Router (let’s assume user selected Monitoring -> Hosts tree menu item implemented in my module):

$this->processRequest($router);
...
private function processRequest(CRouter $router): void {
  $action_name = $router->getAction();        // returns “bghost.view”
  $action_class = $router->getController();   // “Modules\BGmotHosts\Actions\CControllerBGHostView”
  ...
  $action = new $action_class();              // Controller defined for this action is instantiated
  ...
  register_shutdown_function(function() use ($action) {
    $this->module_manager->publishEvent($action, 'onTerminate');
  });

As we can see it gets the action name and action class name from the Router and then instantiates in variable $action a Controller class that will handle the requested action.

Then it registers all enabled modules onTerminate() methods – it means that before PHP execution exits all these functions will be executed regardless of what menu item a user selected. One important caveat: if a user selected an action covered by your module then your module’s onTerminate() method will be executed the last (later than all other modules’ onTerminate() methods) right when WebUI PHP application is about to finish its execution (i.e. exit). onTerminate() method of base class Core/CModule is empty and you can redefine it in your module, e.g., here is how I am adding my JavaScript to every page that generates HTML, ./modules/zabbix-module-menu/Module.php:

public function onTerminate(CAction $action): void {
  $action_page = $action->getAction();
  $router = clone APP::Component()->get('router');
  $layout = $router->getLayout();
  if ($action_page) {
    if ($action_page != 'jsrpc.php' &&
        $layout != 'layout.widget' &&
        $layout != 'layout.json') {
      echo '<script type="text/javascript">';
      echo file_get_contents(__DIR__.'/js/bg_menu.js');
      echo '</script>';
    }
  }
}

This approach works well only with layout.htmlpage layout as it just adds whatever you output here right before closing </body> tag in the final HTML page returned to browser.

Keep in mind we are still in ZBase::processRequest() method and now it executes all enabled modules’ onBeforeAction() methods:

$this->module_manager->publishEvent($action, 'onBeforeAction');

Again if a user selected an action covered by your module then your module’s onBeforeAction() method will be executed last. You can define onBeforeAction() method in your module this way:

public function onBeforeAction(CAction $action): void {
...
}

Processing request – Controller

Now ZBase::processRequest() passes execution to our Controller (remember $action was instantiated with our Controller class):

 $action->run();

All controllers have CController as a parent class. Method run() is defined in ./include/classes/mvc/CController.php and you cannot re-define it in your controller; it does some standards checks, validates input and if everything is ok passes control directly to your controller’s doAction() method and returns the result:

final public function run(): ?CControllerResponse {
  if ($this->checkInput()) {
    $this->doAction();
  return $this->getResponse();

Use your Controller’s checkInput() method to check whether all the parameters passed in the HTTP request are valid. You can implement it like this (see ./modules/zabbix-module-hosts-tree/actions/CControllerBGHostView.php):

 class CControllerBGHostView extends CControllerBGHost { 
        protected function checkInput(): bool {
                $fields = [             
                       'name' =>       'string',
                       'groupids' =>   'array_id',
                       'status' =>     'in -1,’.HOST_STATUS_MONITORED.','.HOST_STATUS_NOT_MONITORED,
                ...
                $ret = $this->validateInput($fields);
                return $ret;

All possible validation rules are defined in ./include/classes/validators/CNewValidator.php Just take a peek and select proper validation:

 # grep "case ‘" ./include/classes/validators/CNewValidator.php 
                                case ‘not_empty':
                                case 'json':
                                case ‘in':
                                ...
                                case 'array_id':
                                ...

If checkInput() method returns true (input is valid) then doAction() of your Controller is called. You can go as fancy as you want preparing data: executing internal Zabbix functions (The Model component of MVC), performing selects/updates in the database directly from your code, talk to other APIs, etc. At the end you need to prepare one massive associative array (usually it is named $data) and return it as shown here (./modules/zabbix-module-hosts-tree/actions/CControllerBGHostView.php):

 protected function doAction(): void {
     ...
     $data = [ ... ];
     $response = new CControllerResponseData($data);
     $response->setTitle(_('Hosts'));
     $this->setResponse($response);
 }

Whatever you place into the associative array $data will be available later in your View code.

Processing request – View

The final step in ZBase::processRequest() is calling one more method that in fact handles the layout and view you defined in the Router for this action:

$this->processResponseFinal($router, $action);

The name of your view file is what you put into the view field for your action in manifest.json + .php extension and must be in the views/ folder.

First, it fills in the layout data with defaults and if a view is defined for given action then it constructs an instance of CView class (defined in ./include/classes/mvc/CView.php) passing data you prepared in Controller to CView constructor:

private function processResponseFinal(CRouter $router, CAction $action): void {
  ...
  if ($router->getView() !== null && $response->isViewEnabled()) {
    $view = new CView($router->getView(), $response->getData());

CView constructor just tries to find a file with the name of your View + .php extension, e.g. module.monitoring.bghost.view.php and, if found, initializes two member variables $name and $data.

public function __construct($name, array $data = []) {

Everything you implement in your View .php file will feed the $layout_data variable.

$layout_data = array_replace($layout_data_defaults, [
  'main_block' => $view->getOutput(),
  'javascript' => [ 'files' => $view->getJsFiles() ],
  'stylesheet' => [ 'files' => $view->getCssFiles() ],
  'web_layout_mode' => $view->getLayoutMode()
]);

CView::getOutput() simply performs PHP include of your view .php file, so whatever you “print” in your view will be assigned to $layout_data[‘main_block’], file ./include/classes/mvc/CView.php:

public function getOutput() {
  $file_path = $this->directory.’/'.$this->name.'.php';
  ob_start();
  if ((include $file_path) === false) {
    ...
  return ob_get_clean();

If you have a lot of time and want to produce a real “piece of art” web page then you can print pure HTML/CSS in your view .php file, but it will most probably not look like the “Zabbix native style”. For example:

Fortunately, there is an easy solution to make your web pages look elegant and totally “Zabbix’ish”: it is very easy to use Zabbix classes, e.g., to add a table to your page use CTableInfo class, then prepare an array with CColHeader elements and add them with CTableInfo::setHeader(), then construct rows with CRow class and add them via CTableInfo::addRow(), etc.

Look at the Zabbix source code for an example of how to use them.
 See the list of all out-of-the-box classes here:

# ls -1 ./include/classes/html/
CActionButtonList.php
CBarGauge.php
CButton.php
...

Since we passed all the data generated by your Controller to your View object, it is very easy to use the data in your View code, e.g., in Controller you do the following:

 # ./modules/zabbix-module-hosts-tree/actions/CControllerBGHostView.php
 protected function doAction(): void {
     $data = [ 'hosts_count' => API::Host()->get(['countOutput' => true]) ];
     $response = new CControllerResponseData($data);
     $this->setResponse($response);
 }

And in View you use this data:

If you want to add some CSS files to the page, do it in your View code this way:

 # ./views/module.monitoring.bghost.view.php
 $this->addCssFile('modules/zabbix-module-hosts-tree/views/css/mycool.css');

Your CSS file should not contain <style> tags, just pure CSS:

 # ./views/css/mycool.css
 .list-table thead th {color: #ff0000;}

Interestingly enough you cannot add JavaScript files from your module folder using a similar addJsFile() method. You can use this method only to use the JS files that come with Zabbix and are located in the root /js folder, e.g.:

 # ./views/module.monitoring.bghost.view.php
 $this->addJsFile('multiselect.js');

To include your JavaScript code use this function. Note this .php file that must have PHP code that by “printing” generates JavaScript code, not .js (containing pure JavaScript code).

 # ./views/module.monitoring.bghost.view.php
 $this->includeJsFile('monitoring.host.view.js.php', $data);

– Again: this .php file must “print” JavaScript code and will be searched in the ./js
 subfolder of the view file this function is invoked from.
– A copy of $data variable will be available for use within the file!

So all the HTML code you generated went into $layout_data[‘main_block’], all JavaScript files that need to be included went into $layout_data[‘javascript’] and all CSS files that need to be included went into $layout_data[‘stylesheet’] (see $layout_data variable initialization above in this article).

The last thing ZBase::processResponseFinal() does is instantiate the new CView class with a layout .php file (you define the layout for every action in Router, remember?) and “printing” everything according to the selected layout, file ./include/classes/core/ZBase.php:

echo (new CView($router->getLayout(), $layout_data))->getOutput();

That is it! At this point, everything you “printed” (don’t forget the onTerminate() function) is returned as an HTTP response to the user’s browser.

Wrapping up this article I must tell you about one more thing – module configuration. In your modules’ manifest.json file you can have a “config” section –  below you can see how it is reflected in the Zabbix database:

I would not mention this, but there is one interesting caveat: the “config” section of manifest.json file is copied into the database only when the module is discovered for the first time during modules directory scanning. Changing the manifest.json file later has no effect; module’s config is always taken from the database so if you want to change something you either need to first wipe the module or make changes directly in the database.
Access this config data in your code with $this->getConfig();

Good luck developing Web Modules!
Your BGmot.

Real Life Business Service Monitoring

Post Syndicated from Alexander Petrov-Gavrilov original https://blog.zabbix.com/real-life-business-service-monitoring/24915/

Learn more about Zabbix business service monitoring features and check out our real-life use cases. The article is based on a Zabbix Summit 2022 speech by Aleksandrs Petrovs-Gavrilovs.

Business service monitoring with Zabbix

Hello everyone, my name is Alex and today I am going to write about Advanced business service and SLA monitoring and the related use cases.

Some of you may already be familiar with business services and the core idea behind them. In the vast majority of organizations, we have services that we provide to our customers or/and for internal use. The availability of those services is usually based either on hardware, software or people’s presence and availability. 

But no matter how well our monitoring is configured, there are times when we can miss how each specific device affects our business and that is where business service monitoring can help us.

With the help of business service monitoring it is possible to see what exactly is going on with your business depending on the state of every single part of your infrastructure. This allows us, the admins and service owners, to understand what it really means when a piece of hardware breaks or a device becomes unreachable. With business service monitoring, we see what exactly impacts our business and how severe the situation is, including calculating SLA (Service Level Agreement) and evaluating it against the defined SLO (Service Level Objective).

Business service hierarchy example

So let’s check out some examples of what business real-life business services may look like.

An average service tree example

In this example, we have a service tree that is based on support services. It has phones and phones are plugged into PBX, while PBX is plugged into the switch. And this is just one example, in reality, we could have a more complex infrastructure consisting of containers, CRM services and so on. And we of course monitor all of them, but what if we are interested in the business perspective as well?

To see the business perspective we need to go to the new service section in the main menu, where we can create and view the service tree itself. In addition, in the same section, we can configure the actions, which enable us to react in cases when something happens with one of the services.
We can also specify the SLO we strive to achieve and see SLA reports on the current situation.

Basic service overview

The service view also lets us see if we have problems affecting our services and track their root cause.

Service with active problems

Defining which service is affected by what problem is done by utilizing problem tags, which essentially link them together. Services can also have their own tags, which we use to group services and understand how one service relates to another. We can also use service tags to build an SLA report or execute actions in case a service is affected by a problem. Permissions too are based on service tags, allowing to creation of different views for different users.

But those are just the basics – what’s more interesting are the actual use cases. Let’s take a look at how Zabbix users actually use business service monitoring to their advantage based on real business examples. 

Business service tree for a financial institution

Real business service use cases can be helpful examples that can help you design your own Zabbix business service trees. Maybe you already have a similar business of your own and you need that starting point for everything to “click” – that starting point can be a real-life example.

Example service tree of a bank

The first example will seem a bit convoluted while actually being very straightforward. Here we can see an actual financial institution business service tree. You can see they have quite a lot of different interconnected services. First look at the service tree raw schema may be a bit confusing, but in Zabbix it’s pretty straightforward.

The internal service is connected to emails and emails are related to customer services at the same since we do need to communicate with the customers, not only internally! In addition, we also have to define services representing the underlying systems and applications which support our email services. That is easy to do with Zabbix services.

Easy to read e-mail service state

Imagine now, if you don’t have the services functionality at all, how fast can you check the status of the email service when all you have is only a list of problems for multiple devices? How can you check the service statistics for an entire year? That was the question that the service owners and administrators had in this use case and they solved it by defining Zabbix business service trees.

The “root” service

We start by defining the root business service – Financial institution. It is linked to 15 main services. The 15 services are grouped into internal or external ones. The lower-level services also contain the sub-services that the main services are based on. I.e., we have an Accounting service based on specific VM availability, where the accounting software resides on.

Detailed service tree

The services are divided into specific categories so the service owners can read the situation a lot easier without spending a lot of time figuring out which problem causes which situation. With a single click, the service owners can immediately see which components or child services each service is based on and the actual service SLA. This also gives the benefit of displaying the root cause problem and being able to quickly identify which child services are causing issues with a particular business service.

Parent-Child service relationship

Don’t forget, that the business service trees can be multi-level –  child services can have their own child services and services can also be interconnected with each other. For example – in the Parent-Child service relationship screenshot, we can see that we have an Accounting service. Accounting uses Microsoft services. Microsoft services are also used internally. So what happens when Microsoft services stop working? We will know that accounting will be affected, the internal services will be affected and we will see the exact chain of events – what and how exactly went wrong in the organization and which components need fixing.

Service state configuration

Services can have a varying impact on your business. Some services are more critical than others. Additional rules enable Zabbix to take the potential service impact into account. The first two additional rules analyze the percentage of affected child services and set the severity of the service problem accordingly.
But if the two most critical services are affected, that will immediately become a disaster. For example, online banking – you can imagine that any bank now has an online banking service and if it goes down – all the customers will be affected; it could even hit the news, not only monitoring. So of course they want to immediately know about that kind of a disaster, and with Zabbix services – they will. By defining additional rules and service weights, you can react to problems preemptively and fix the issues before they cause downtime for your end users.

SLA reporting

In Zabbix, we can choose for what periods SLA should be calculated – daily, weekly, monthly, yearly, or a mixed selection of those. Based on our selection, we can see real-time reports for services and as an example, by the end of the year or a day, understand what needs the most attention and review the service performance. Or to put in a closer-to-reality example – find out by accounting reports if the licenses were renewed in time so that the software which is used by accounting is always available.  We can also build a dashboard that will contain the reports, showing what is the current summary for the service so they can plan, buy new software, buy a new license and get new hardware and always be ahead again of whatever might happen.

Service state dashboard

Service permissions in user roles can be used to create different service views. This can be used to hide sensitive service information or simply display the services at the required level of detail. For example, a more detailed view can be provided for internal support users since they will need as much information as possible to fix any service-related issues. Separate views can be provided for Accounting and Management teams, showing only the relevant data to ensure a quick and reliable decision-making process. 

What if we want to make things even more simple for our Accounting and Management teams? We can use actions and scheduled report functionality to deliver the required information to the user’s mailbox without having them periodically log into Zabbix.

Service permissions

Business service tree for an MSP

Another example is an MSP (managed service provider) service tree. This use case is encountered pretty frequently and the tree is always easy to read even in the raw schema view as this:

Manager Service Provider service tree

We use a hosting company for our example. The company provides a particular set of services for its users. There are also some internal services that can also be used by the customers – for example, Zabbix itself.

Zabbix can be a great tool in MSP scenarios since it’s straightforward to provide customers with access to Zabbix and build a dashboard view with the latest statistics related to a particular user.
In this example, we can see the main service which is hosting, divided across customers, where each customer is a branch of that tree, using the hosting services the company provides. We also see that monitoring is a service itself because in this case customers also have the advantage of using Zabbix to get detailed information about the services they use and their current state. Seeing the current level of SLA for the servers they use and does it match the expectations.

Customer overview

The MSP, of course, retains the full view of the customers and all customers are equally important and deserve a proper quality of service so of course each customer will have an equal weight assigned to them. As soon as any customer has a problem, the related service will be marked with a high-level severity on the service tree. This way, the MSP will immediately see which customer is affected, making it possible to assist them as quickly as possible.

If you have a bigger environment – maybe you have hundreds of customers, you may opt out of defining service weights in your configuration since the number of services changes very rapidly. How can we react to global issues then?
We can use percentage rules instead of reacting to just the static weight number. This way, we can check is the problem related to a single customer or is it something global and everyone is now affected.

Root cause view in the services will allow you to start fixing everything immediately. Meanwhile, each customer can be informed individually using the service actions and conditions. This should be easy to do if we have properly named or tagged the services.

Customer service configuration

Don’t forget to define the permissions so that any customer, as Mooyani here, can have access to their Services immediately after login, ensuring that information not only remains private but also relevant for the current user.

Customer view

All information for Customers can be placed on their personal dashboards where they can see all the details whenever they need to. Monitoring the traffic going through their VMs, resource usage, application statuses and any other monitored entities. Don’t forget that service SLA reports can also be placed on Zabbix dashboards. This way your customers can see that the MSP meets the terms defined in the agreement and everything is performing as expected. 

To summarize – monitoring your infrastructure is great from any perspective, including business monitoring. it’s always a good idea to provide this view as an MSP to your customers, so they can see we meet the standards we define for ourselves and course promise for our users.

Generating Zabbix Health CSV reports with custom frontend module

Post Syndicated from Aigars Kadiķis original https://blog.zabbix.com/generating-zabbix-health-csv-reports-with-custom-frontend-module/24369/

Have you ever wanted a button to export specific data to CSV directly from Zabbix frontend? My friend Gregory and I have solved this problem.

In this blog post, I will show you how with the help of a custom Zabbix frontend module, you can add an “Export to CSV” button and use it to export custom information. This can be installed as a frontend module. It’s applicable for versions Zabbix 5.0 LTS and Zabbix 6.0 LTS. As an additional benefit, we can use this module to polish our Zabbix instance, find misconfiguration, summarize existing configuration and locate potential bottlenecks.

Characteristics of the module

  • Define and store custom SQL queries in the user profile. It uses the “profiles” database table, specifically the columns “idx” and “value_str” to store the settings/data. We are not changing the database structure but using what’s available already.
  • Export an unlimited amount of rows via the CSV button.
  • Compatible with MySQL and PostgreSQL and general SQL syntax. Use the same language as you use directly in the SQL client.
  • SQL highlighter. It’s based on a well-known CodeMirror product. The Zabbix module is still completely standalone.
  • Protection to disallow running UPDATE, DELETE, INSERT, CREATE, ALTER, DROP commands.
  • Hyperlink support in the SQL output. Useful to generate a link used to navigate closer to the problem. Only hyperlinks which are related to zabbix sections (items.php, triggers.php) are supported.

What is it NOT?

  • The module does not provide an environment for “Zabbix User” or “Zabbix Admin” types of users to perform any reporting. Only users of “Zabbix Super Admin” type can use this module.

Install

Below you can find a sequence of commands used to download and install the “Export to CSV” module

Navigate to default directory of frontend modules:
cd /usr/share/zabbix/modules
Pick and create a name of directory:
mkdir -p zabbix-module-sqlexplorer
Navigate inside:
cd zabbix-module-sqlexplorer
Download version 1.4 of module:
wget https://github.com/gr8b/zabbix-module-sqlexplorer/releases/download/v1.6/v1.6.zip
Extract the archive:
unzip v1.6.zip
Remove the original source file, it is unnecessary:
rm v1.6.zip

Enable

Open “Administration” => “General” => “Modules” => click on “Scan directory“. After that, click “Enable“:

A new menu is available:

Now we can run an SQL query, see the result on the screen, and export the data to CSV:

Let’s talk about 5 beneficial use cases that this module can enable. Here Zabbix technical support engineers are sharing a few frequently used SQL commands.

1) Ensure the basic data collection works

This section would be the basic minimum to maintain. In a perfect world, every host object (a device, a server) must be online 24/7. For each category (ZBX, SNMP, IPMI, JMX) we need to use a dedicated query. Let’s work with the 2 most popular categories – ZBX and SNMP.

Unreachable Zabbix agent (ZBX) hosts

Zabbix 5.0:

SELECT proxy.host AS proxy, hosts.host, hosts.error AS hostError, CONCAT('hosts.php?form=update&hostid=',hosts.hostid) AS goTo FROM hosts LEFT JOIN hosts proxy ON (hosts.proxy_hostid=proxy.hostid) WHERE hosts.status=0 AND LENGTH(hosts.error) > 0;

Zabbix 6.0:

SELECT proxy.host AS proxy, hosts.host, interface.error FROM hosts LEFT JOIN hosts proxy ON (hosts.proxy_hostid=proxy.hostid) JOIN interface ON (interface.hostid=hosts.hostid) WHERE LENGTH(interface.error) > 0 AND interface.type=1;

In the output, we receive a proxy object, host title, an error message, a clickable link to navigate immediately to the object.
Per every row in output, either we need to: 1) Fix the issue. Most likely it is a firewall/DNS/credential/timeout or network quality issue; 2) Delete host object; or 3) Disable host object.

Unreachable network (SNMP) devices

On Zabbix 5.0 use:

SELECT proxy.host AS proxy, hosts.host, hosts.snmp_error AS hostError, CONCAT('hosts.php?form=update&hostid=',hosts.hostid) AS goTo FROM hosts LEFT JOIN hosts proxy ON (hosts.proxy_hostid=proxy.hostid) WHERE hosts.status=0 AND LENGTH(hosts.snmp_error) > 0;

On Zabbix 6.0 use:

SELECT proxy.host AS proxy, hosts.host, interface.error, CONCAT('zabbix.php?action=host.edit&hostid=',hosts.hostid) AS goTo FROM hosts LEFT JOIN hosts proxy ON (hosts.proxy_hostid=proxy.hostid) JOIN interface ON (interface.hostid=hosts.hostid) WHERE LENGTH(interface.error) > 0 AND interface.type=2;

We receive the proxy title, host object, reason the host is not working and a clickable link to the object.

2) Improve speed of frontend

Ensure we have the best response time for all sections in GUI. If tables contain “unnecessary” data, the user experience will suffer for it. Nowadays, no one wants to spend longer than a few seconds waiting for data to be displayed on the screen.

Amount of user sessions

Zabbix 5.0:

SELECT COUNT(*) AS count, users.alias FROM sessions JOIN users ON (users.userid=sessions.userid) GROUP BY 2 ORDER BY 1 DESC;

Zabbix 6.0:

SELECT COUNT(*) AS count, users.username FROM sessions JOIN users ON (users.userid=sessions.userid) GROUP BY 2 ORDER BY 1 DESC;

The total number of sessions should not exceed 1000; it’s hard to imagine why it should be over 100. It is OK to delete all data in the “sessions” table and optimize/vacuum the table. This will improve the overall Zabbix GUI responsiveness and performance.

Don’t keep too many open problems onboard

Every monitoring tool is about identifying problems. If you keep too many open problems, then frontend will be slow. The following query will print trigger problems (the ones we receive in email) and so as internal problems, which reflect the health of the monitoring tool.

SELECT COUNT(*) AS count, CASE WHEN source=0 THEN 'surface' WHEN source>0 THEN 'internal' END AS level, CASE WHEN source=0 AND object=0 THEN 'trigger in a problem state' WHEN source=3 AND object=0 THEN 'cannot evaluate trigger expression' WHEN source=3 AND object=4 THEN 'data collection not working' WHEN source=3 AND object=5 THEN 'low level discovery not perfect' END AS problemCategory FROM problem GROUP BY 2,3 ORDER BY 2 DESC;

To decrease the number of “internal” problems, have a study on the point nr. 5 in this blog post.

3) Identify exceptions

Monitoring administrators can change item update frequency at a host level, can install a different trigger threshold at the host level, or install a different threshold inside a nested template tree. This section will highlight all overrides.

Item update interval differs between host/template levels

This query will print items and LLD rules with different update frequencies on the host level while comparing them with the template level. Most of the time,  having different update interval at the host level is done by accident.

SELECT h2.host AS Source, i2.name AS itemName, i2.key_ AS itemKey, i2.delay AS OriginalUpdateFrequency,h1.host AS exceptionInstalledOn, i1.delay AS FrequencyChild, CASE WHEN i1.flags=1 THEN 'LLD rule' WHEN i1.flags IN (0,4) THEN 'data collection' END AS itemCategory , CASE WHEN i1.flags=1 THEN CONCAT('host_discovery.php?form=update&context=host&itemid=',i1.itemid) WHEN i1.flags IN (0,4) THEN CONCAT('items.php?form=update&context=host&hostid=',h1.hostid,'&itemid=',i1.itemid) END AS goTo FROM items i1 JOIN items i2 ON (i2.itemid=i1.templateid) JOIN hosts h1 ON (h1.hostid=i1.hostid) JOIN hosts h2 ON (h2.hostid=i2.hostid) WHERE i1.delay <> i2.delay;

In the output, the “Source” column is a host object (a device) or a template object. The “Source” column is heavily related to “exceptionInstalledOn” column. “Source” VS “exceptionInstalledOn” practically tells there is a relation between a host and a template or between a template and another template.

FrequencyChild” column is the most important field, which describes an exception installed which differs from the original object.

The output allows navigating directly to the item where the update frequency stands out.

Connection characteristics and custom trigger thresholds

The query will show every installed override between host and template object or between template and parent template object. If template nesting is used at multiple levels, it will highlight if an overriding value if that is installed somewhere in the middle.

SELECT hm1.macro AS Macro, child.host AS owner, hm2.value AS defaultValue, parent.host AS OverrideInstalled, hm1.value AS OverrideValue FROM hosts parent, hosts child, hosts_templates rel, hostmacro hm1, hostmacro hm2 WHERE parent.hostid=rel.hostid AND child.hostid=rel.templateid AND hm1.hostid = parent.hostid AND hm2.hostid = child.hostid AND hm1.macro = hm2.macro AND parent.flags=0 AND child.flags=0 AND hm1.value <> hm2.value;

Abandoned items

A very lonely item that does not belong to any template.

SELECT hosts.host, items.key_,CONCAT('items.php?form=update&context=host&itemid=',items.itemid) AS goTo FROM items JOIN hosts ON (hosts.hostid=items.hostid) WHERE hosts.status=0 AND hosts.flags=0 AND items.status=0 AND items.templateid IS NULL AND items.flags=0;

To keep up centralized management, it would be better to move and maintain the definition of the item to the template level.

4) Reporting

We can summarize an item and host configurations, enabled and disabled data collector items, and linked templates.

What data collection method is used?

To summarize all data collection techniques and see the membership by Zabbix proxy use the following command:

SELECT proxy.host AS proxy, CASE items.type WHEN 0 THEN 'Zabbix agent' WHEN 1 THEN 'SNMPv1 agent' WHEN 2 THEN 'Zabbix trapper' WHEN 3 THEN 'Simple check' WHEN 4 THEN 'SNMPv2 agent' WHEN 5 THEN 'Zabbix internal' WHEN 6 THEN 'SNMPv3 agent' WHEN 7 THEN 'Zabbix agent (active) check' WHEN 8 THEN 'Aggregate' WHEN 9 THEN 'HTTP test (web monitoring scenario step)' WHEN 10 THEN 'External check' WHEN 11 THEN 'Database monitor' WHEN 12 THEN 'IPMI agent' WHEN 13 THEN 'SSH agent' WHEN 14 THEN 'TELNET agent' WHEN 15 THEN 'Calculated' WHEN 16 THEN 'JMX agent' WHEN 17 THEN 'SNMP trap' WHEN 18 THEN 'Dependent item' WHEN 19 THEN 'HTTP agent' WHEN 20 THEN 'SNMP agent' WHEN 21 THEN 'Script item' END AS type,COUNT(*) FROM items JOIN hosts ON (hosts.hostid=items.hostid) LEFT JOIN hosts proxy ON (hosts.proxy_hostid=proxy.hostid) WHERE hosts.status=0 AND items.status=0 GROUP BY proxy.host, items.type ORDER BY 1,2,3 DESC;

Devices and linked templates

If one server runs as a database server, a web server, and an application server, there must be multiple templates linked. The following query can help to detect the linked templates.

MySQL:

SELECT proxy.host AS proxy, hosts.host, GROUP_CONCAT(template.host SEPARATOR ', ') AS templates FROM hosts JOIN hosts_templates ON (hosts_templates.hostid=hosts.hostid) LEFT JOIN hosts proxy ON (hosts.proxy_hostid=proxy.hostid) LEFT JOIN hosts template ON (hosts_templates.templateid=template.hostid) WHERE hosts.status IN (0,1) AND hosts.flags=0 GROUP BY 1,2 ORDER BY 1,3,2;

PostgreSQL:

SELECT proxy.host AS proxy, hosts.host, ARRAY_TO_STRING(ARRAY_AGG(template.host),', ') AS templates FROM hosts JOIN hosts_templates ON (hosts_templates.hostid=hosts.hostid) LEFT JOIN hosts proxy ON (hosts.proxy_hostid=proxy.hostid) LEFT JOIN hosts template ON (hosts_templates.templateid=template.hostid) WHERE hosts.status IN (0,1) AND hosts.flags=0 GROUP BY 1,2 ORDER BY 1,3,2;
Devices and all inventory fields
SELECT h.host,i.type,i.type_full,i.name,i.alias,i.os,i.os_full,i.os_short,i.serialno_a,i.serialno_b,i.tag,i.asset_tag,i.macaddress_a,i.macaddress_b,i.hardware,i.hardware_full,i.software,i.software_full,i.software_app_a,i.software_app_b,i.software_app_c,i.software_app_d,i.software_app_e,i.contact,i.location,i.location_lat,i.location_lon,i.notes,i.chassis,i.model,i.hw_arch,i.vendor,i.contract_number,i.installer_name,i.deployment_status,i.url_a,i.url_b,i.url_c,i.host_networks,i.host_netmask,i.host_router,i.oob_ip,i.oob_netmask,i.oob_router,i.date_hw_purchase,i.date_hw_install,i.date_hw_expiry,i.date_hw_decomm,i.site_address_a,i.site_address_b,i.site_address_c,i.site_city,i.site_state,i.site_country,i.site_zip,i.site_rack,i.site_notes,i.poc_1_name,i.poc_1_email,i.poc_1_phone_a,i.poc_1_phone_b,i.poc_1_cell,i.poc_1_screen,i.poc_1_notes,i.poc_2_name,i.poc_2_email,i.poc_2_phone_a,i.poc_2_phone_b,i.poc_2_cell,i.poc_2_screen,i.poc_2_notes FROM host_inventory i, hosts h WHERE i.hostid=h.hostid AND h.flags=0;

By default, it prints all columns. You can cut away the unnecessary ones. You can use the extracted data for analytics in other software, e.g., MS Excel.

External scripts in use

When migrating to the next release of Zabbix, it’s better to be aware of all external scripts and ensure the new server will support that. An external script is a custom solution for how you collect data. For example, a Python 2 language is not available on a newer operating system out of box.

SELECT items.key_,COUNT(*) FROM items JOIN hosts ON (hosts.hostid=items.hostid) WHERE hosts.status=0 AND items.status=0 AND items.type=10 GROUP BY 1 ORDER BY 2;

 

5) Polish your Zabbix instance

This section will require more work. Being a non-perfectionist is an advantage.

To work with internal events (data collection not working, trigger not working), we have to have a least one internal action enabled. It can be an action with a condition that will never be true:

Be careful, if you have huge infrastructure and currently everything is disabled, then don’t enable! Or enable only for 4h to generate some statistics and then turn it off. If you keep internal events ON and a lot of things are not working, it will create frontend very slow. And it will get even slower every day.

Data collection

Monitoring is based on data collection. If new data does not come in, we cannot detect if the service is up or down. The following query will list all data collector items which cannot receive the data.

SELECT hosts.name, items.key_ AS keyName, problem.name AS error, CONCAT('items.php?form=update&itemid=',objectid) AS goTo FROM problem JOIN items ON (items.itemid=problem.objectid) JOIN hosts ON (hosts.hostid=items.hostid) WHERE problem.source>0 AND problem.object=4;

The output provides a clickable link to navigate to the item and investigate. Common issues can be a timeout, wrong credentials, and permissions.

Trigger evaluation

When data is collected and if an item has a trigger linked, it’s possible that there is a problem in evaluating trigger logic. The result of the following query will show you why it’s impossible to detect the problem.

SELECT DISTINCT CONCAT('triggers.php?form=update&triggerid=',problem.objectid) AS goTo, hosts.name AS hostName, triggers.description AS triggerTitle, problem.name AS error FROM problem JOIN triggers ON (triggers.triggerid=problem.objectid) JOIN functions ON (functions.triggerid=triggers.triggerid) JOIN items ON (items.itemid=functions.itemid) JOIN hosts ON (hosts.hostid=items.hostid) WHERE problem.source > 0 AND problem.object=0;

Low-level discovery

The purpose of low-level discovery is to find all elements which exist in a particular system. For example, find all services on a windows system. If discovery is not working, then additional elements will not get covered, which means no data collection, no trigger, and no notification. The following query will show all erroneous LLD rules and the reason there is a problem:

SELECT hosts.name AS hostName, items.key_ AS itemKey, problem.name AS LLDerror, CONCAT('host_discovery.php?form=update&itemid=',problem.objectid) AS goTo FROM problem JOIN items ON (items.itemid=problem.objectid) JOIN hosts ON (hosts.hostid=items.hostid) WHERE problem.source > 0 AND problem.object=5;

That is it for today.

In the comments section below, let us know what kind of report your company desires to see.

Appendix

Know issues of the module:
  • When linking tables in SQL output, if there are multiple columns with the same name, only the first column will be printed. As a workaround, always use “AS ColumnName” directive to specify unique column names in the output.

Deploying containers to AWS using ECS and CodePipeline 

Post Syndicated from Jessica Veljanovska original https://www.anchor.com.au/blog/2022/10/deploying-containers-to-aws/

Deploying containers to AWS using ECS and CodePipeline 

In this post, it will look at deploying containers to AWS using ECS and CodePipeline. Note that this is only an overview of the process and is not a complete tutorial on how to set this up. 

Containers 101 

However, before we look at how we can deploy Containers onto AWS, it is first worth looking at why we want to use containers, so what are containers?

Containers provide environments for applications to run in isolation. Unlike virtualisation, containers do not require a full guest operating system for each container instance, instead, they share the kernel and other lower-level components of the host operating system as provided by the underlying containerization platform (the most popular of which is Docker, which we will be using in examples from here onwards). This makes containers more efficient at using the underlying hardware resources to run applications. Since Containers are isolated, they can include all their dependencies separate from other running Containers. Suppose that you have two applications that each require specific, conflicting, versions of Python to run, they will not run correctly if this requirement is not met. With containers, you can run both applications with their own environments and dependencies on the same Host Operating system without conflict. 

Containers are also portable, as a developer you can create, run, and test an image on your local workstation and then deploy the same image into your production environment. Of course, deploying into production is never that simple, but that example highlights the flexibility that containers can afford to developers in creating their applications. 

This is achieved by using images, which are read-only templates that provide the containerization platform the instructions required to run the image as a Container. Each image is built from a DockerFile that provides the specification on how to build the image and can also include other images. This can be as simple or as complicated as it needs to be to successfully run the application.

FROM ubuntu:22.04 

COPY . /app 

RUN make /app 

CMD python /app/app.py

However, it is important to know that each image is made up of different layers, which are created based on each line of instruction in the DockerFile that is used to build the image. Each layer is cached by Docker which provides performance benefits to well optimised DockerFiles and the resulting images they create. When the image is run by Docker it creates a Container that flips the layers from the image and adds a runtime Read-Write layer on top, which can be used for logging and any other activity that the application needs to perform and cannot be read from the image. You can use the same image to run as many Containers (running instances of the image) as you desire. Finally, when a Container is removed, the image is retained and only the Read-Write layer is lost. 


Elastic Container Service 

Elastic Container Service or ECS is AWS’ native Container management system. As an orchestration system it makes it easy to deploy and manage containerized applications at scale with built in scheduling that can allow you to spin up/down Containers at specific times or even configure auto-scaling. ECS has three primary modes of operation, known as launch types. Elastic Container Service or ECS is AWS’ native Container management system. As an orchestration system it makes it easy to deploy and manage containerized applications at scale with built in scheduling that can allow you to spin up/down Containers at specific times or even configure auto-scaling. ECS has three primary modes of operation, known as launch types. The first launch type is the AWS EC2 launch type, where you run the ECS agent on EC2 instances that you maintain and manage. The available resource capacity is dependent on the number and type of EC2 instances that you are using. As you are already billed for the AWS resources consumed (EC2, EBS, etc.), there are no additional charges for using ECS with this launch type. If you are using AWS Outposts, you can use also utilise that capacity instead of Cloud-based EC2 instances. 

The second launch type is the AWS Fargate launch type. This is similar to the EC2 launch type, however, the major difference is that AWS are managing the underlying infrastructure. Because of this there are no inherent capacity constraints, and the billing models is based on the amount of vCPU and memory your Containers consume. 

The last launch type is the External launch type, which is known as ECS Anywhere. This allows you to use the features of ECS with your own on-premises hardware/infrastructure. This is billed at an hourly rate per managed instance. 

ECS operates using a hierarchy that connects together the different aspects of the service and allows flexibility in exactly how the application is run. This hierarchy consists of ECS Clusters, ECS Services, ECS Tasks, and finally the running Containers. We will briefly look at each of these services. ECS Clusters 

ECS Clusters are a logical grouping of ECS Services and their associated ECS Task. A Cluster is where scheduled tasks can be defined, either using fixed intervals or cron expression. If you are using the AWS EC2 Launch Type, it is also where you can configure Auto-Scaling for the EC2 instances in the form of Capacity Providers. 

ECS Services 

ECS Services are responsible for running ECS Tasks. They will ensure that the desired number of Tasks are running and will start new Tasks if an existing Task stops or fails for any reason in order to maintain the desired number of running Tasks. Auto-Scaling can be configured here to automatically update the desired number of Tasks based on CPU and Memory. You can also associate the Service with an Elastic Load Balancer Target Group and if you do this you can also use the number of requests as a metric for Auto-Scaling. 

ECS Tasks 

ECS Tasks are running instances of an ECS Task Definition. The Task Definition describes one or more Containers that make up the Task. This is where the main configuration for the Containers is defined, including the Image/s that are used, the Ports exposed, Environment variables, and if any storage volumes (for example EFS) are required. The Task as a whole can also have Resource sizing configured, which is a hard requirement for Fargate due to its billing model. Task definitions are versioned, so each time you make a change it will create a new revision, allowing you to easily roll back to an older configuration if required.


Elastic Container Registry 

Elastic Container Registry or ECR, is not formally part of ECS but instead supports it. ECR can be used to publicly or privately store Container images and like all AWS services has granular permissions provided using IAM. The main features of ECR, besides its ability to integrate with ECS, is the built-in vulnerability scanning for images, and the lack of throttling limitations that public container registries might impose. It is not a free service though, so you will need to pay for any usage that falls outside of the included Free-Tier inclusions.

ECR is not strictly required for using ECS, you can continue to use whatever image registry you want. However, if you are using a CodePipeline to deploy your Containers we recommend using ECR purely to avoid throttling issues preventing the CodePipeline from successfully running. 


CodePipeline 

In software development, a Pipeline is a collection of multiple tools and services working together in order to achieve a common goal, most commonly building and deploying application code. AWS CodePipeline is AWS’ native form of a Pipeline that supports both AWS services as well as external services in order to help automate deployments/releases.

CodePipeline is one part of the complete set of developer tools that AWS provides, which commonly have names starting with “Code”. This is important as by itself CodePipeline can only orchestrate other tooling. For a Pipeline that will deploy a Container to ECS, we will need AWS CodeCommit, AWS CodeBuild, and AWS CodeDeploy. In addition to running the components that are configured, CodePipeline can also provide and retrieve artifacts stored in S3 for each step. For example, it will store the application code from CodeCommit into S3 and provide this to CodeBuild, Code Build will then take this and will create its own artifact files that are provided to CodeDeploy. 

AWS CodeCommit 

AWS CodeCommit is a fully managed source control service that hosts private Git repositories. While this is not strictly required for the CodePipeline, we recommend using it to ensure that AWS has a cached copy of the code at all times. External git repositories can use actions or their own pipelines to push code to CodeCommit when it has been committed. CodeCommit is used in the Source stage of the CodePipeline to both provide the code that is used and to trigger the Pipeline to run when a commit is made. Alternatively, you can use CodeStar Connections to directly use GitHub or BitBucket as the Source stage instead.

AWS CodeBuild 

AWS CodeBuild is a fully managed integration service that can be used to run commands as defined in the BuildSpec that it draws its configuration from, either in CodeBuild itself, or from the Source repository. This flexibility allows it to compile source code, run tests, make API calls, or in our case build Docker Images using the DockerFile that is part of a code repository. CodeBuild is used in the Build stage of the CodePipeline to build the Docker Image, push it to ECR, and update any Environment Variables used later in the deployment.

The following is an example of what a typical BuildSpec might look like for our purposes. 

version: 0.2
  
  
  phases:
  
    pre_build:
  
      commands:
  
        - echo Logging in to Amazon ECR...
  
        - aws --version
  
        - aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $ACCOUNTID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com
  
    build:
  
      commands:
  
        - echo Build started on `date`
  
        - echo Building the Docker image...
  
        - docker build -t "$REPOSITORY_URI:$IMAGE_TAG-$CODEBUILD_START_TIME" .
  
        - docker tag "$REPOSITORY_URI:$IMAGE_TAG-$CODEBUILD_START_TIME" "$REPOSITORY_URI:$IMAGE_TAG-$CODEBUILD_START_TIME"
  
    post_build:
  
      commands:
  
        - echo Build completed on `date`
  
        - echo Pushing the Docker images...
  
        - docker push "$REPOSITORY_URI:$IMAGE_TAG-$CODEBUILD_START_TIME"
  
        - echo Writing image definitions file...
  
        - printf '{"ImageURI":"%s"}'" $REPOSITORY_URI:$IMAGE_TAG-$CODEBUILD_START_TIME"  > imageDetail.json
  
        - echo $REPOSITORY_URI:$IMAGE_TAG-$CODEBUILD_START_TIME
  
        - sed -i 's|<APP_NAME>|'$IMAGE_REPO_NAME'|g' appspec.yaml taskdef.json
  
        - sed -i 's|<SERVICE_PORT>|'$SERVICE_PORT'|g' appspec.yaml taskdef.json
  
        - sed -i 's|<AWS_ACCOUNT_ID>|'$AWS_ACCOUNT_ID'|g' taskdef.json
  
        - sed -i 's|<MEMORY_RESV>|'$MEMORY_RESV'|g' taskdef.json
  
        - sed -i 's|<IMAGE_NAME>|'$REPOSITORY_URI'\:'$IMAGE_TAG-$CODEBUILD_START_TIME'|g' taskdef.json
  
        - sed -i 's|<IAM_EXEC_ROLE>|'$IAM_EXEC_ROLE'|g' taskdef.json
  
        - sed -i 's|<REGION>|'$REGION'|g' taskdef.json
  
  artifacts:
  
    files:
  
      - appspec.yaml
  
      - taskdef.json
  
      - imageDetail.json

Failures in this step are most likely caused by an incorrect, non-functioning DockerFile or hitting throttling from external Docker repositories. 

AWS CodeDeploy 

AWS CodeDeploy is a fully managed deployment service that integrates with several AWS services including ECS. It can also be used to deploy software to on-premises servers. The configuration of the deployment is defined using an appspec file. CodeDeploy offers several different deployment types and configurations. We tend to use the `Blue/Green` deployment type and the `CodeDeployDefault.ECSAllAtOnce` deployment configuration by default. 

The Blue/Green deployment model allows for the deployment to rollback to the previously deployed Tasks if the deployment is not successful. This makes use of Load Balancer Target Groups to determine if the created Tasks in the ECS Service are healthy. If the health checks fail, then the deployment will fail and trigger a rollback.  

In our ECS CodePipeline, CodeDeploy will run based on the following example appspec.yaml file. Note that the place holder variables in <> are updated with real configuration by the CodeBuild stage.

version: 0.0 

Resources: 

  - TargetService: 

      Type: AWS::ECS::Service 

      Properties: 

        TaskDefinition: <TASK_DEFINITION> 

        LoadBalancerInfo: 

          ContainerName: "<APP_NAME>" 

          ContainerPort: <SERVICE_PORT> 

You may have noticed a lack of configuration regarding the actual Container we will be running up to this point. As we noted earlier, ECS Tasks can use a taskdef file to configure the Task and Containers, which is exactly what we are doing here. One of the files the CodeBuild configuration above expects to be present is taskdef.json. Below is an example taskdef.json file, ss with the appspec.yaml file there are placeholder variables as indicated by <>.


{ 

  "executionRoleArn": "<IAM_EXEC_ROLE>", 

  "containerDefinitions": [ 

    { 

      "name": "<APP_NAME>", 

      "image": "<IMAGE_NAME>", 

      "essential": true, 

      "memoryReservation": <MEMORY_RESV>, 

      "portMappings": [ 

        { 

          "hostPort": 0, 

          "protocol": "tcp", 

          "containerPort": <SERVICE_PORT> 

        } 

      ], 

      "environment": [ 

        { 

          "name": "PORT", 

          "value": "<SERVICE_PORT>" 

        }, 

        { 

          "name": "APP_NAME", 

          "value": "<APP_NAME>" 

        }, 

        { 

          "name": "IMAGE_BUILD_DATE", 

          "value": "<IMAGE_BUILD_DATE>" 

        }, 

      ], 

      "mountPoints": [] 

    } 

  ], 

  "volumes": [], 

  "requiresCompatibilities": [ 

    "EC2" 

  ], 

  "networkMode": "bridge", 

  "family": "<APP_NAME>" 

Failures in this stage of the CodePipeline generally mean that there is something wrong with the Container that is preventing it from starting successfully and passing its health check. Causes of this are varied and rely heavily on having good logging in place. Failures can range from the DockerFile being configured to execute a command that does not exist, to the application itself erroring out when starting for whatever reason might be logged, such as pulling incomplete data from RDS or failing to connect to an external service. 

Summary 

In summary, we have seen what Containers are, why they are useful, and what options AWS provides with its Elastic Container Service (ECS) for running them. Additionally, we looked at what parts of AWS CodePipeline we would need to use in order to deploy our Containers to ECS using CodePipeline.

For further information on the services used, I would highly recommend looking at the documentation that AWS provides. 

For a more hands-on guided walkthrough on setting up ECS and CodePipeline, AWS provide the following resources, there is also plenty of third party material you can find online. 

The post Deploying containers to AWS using ECS and CodePipeline  appeared first on Anchor | Cloud Engineering Services.

Docker Container Monitoring With Zabbix

Post Syndicated from Dmitry Lambert original https://blog.zabbix.com/docker-container-monitoring-with-zabbix/20175/

In this blog post, I will cover Docker container monitoring with Zabbix. We will use the official Docker by Zabbix agent 2 template to make things as simple as possible. The template download link and configuration steps can be found on the Zabbix Integrations page. If you require a visual guide, I invite you to check out my video covering this topic.

Importing the official Docker template

Importing the Docker by Zabbix agent 2 template

Since we will be using the official Docker by Zabbix agent 2 template, first, we need to make sure that the template is actually available in our Zabbix instance. The template is available for Zabbix versions 5.0, 5.4, and 6.0. If you cannot find this template under Configuration – Templates, chances are that you haven’t imported it into your environment after upgrading Zabbix to one of the aforementioned versions. Remember that Zabbix does not modify or import any templates during the upgrade process, so we will have to import the template manually. If that is so, simply download the template from the official Zabbix git page (or use the link in the introduction) and import it into your Zabbix instance by using the Import button in the Configuration – Templates section.

Installing and configuring Zabbix agent 2

Before we get started with configuring our host, we first have to install Zabbix agent 2 and configure it according to the template guidelines. Follow the steps in the download section of the Zabbix website and install the zabbix-agent2 package. Feel free to use any other agent deployment methods if you want to (like compiling the agent from the source files)

Installing Zabbix agent2 from packages takes just a few simple steps:

Install the Zabbix repository package:

rpm -Uvh https://repo.zabbix.com/zabbix/6.0/rhel/8/x86_64/zabbix-release-6.0-1.el8.noarch.rpm

Install the Zabbix agent 2 package:

dnf install zabbix-agent2

Configure the Server parameter by populating it with your Zabbix server/proxy address

vi /etc/zabbix/zabbix_agent2.conf
### Option: Server
# List of comma delimited IP addresses, optionally in CIDR notation, or DNS names of Zabbix servers and Zabbix proxies.
# Incoming connections will be accepted only from the hosts listed here.
# If IPv6 support is enabled then '127.0.0.1', '::127.0.0.1', '::ffff:127.0.0.1' are treated equally
# and '::/0' will allow any IPv4 or IPv6 address.
# '0.0.0.0/0' can be used to allow any IPv4 address.
# Example: Server=127.0.0.1,192.168.1.0/24,::1,2001:db8::/32,zabbix.example.com
#
# Mandatory: yes, if StartAgents is not explicitly set to 0
# Default:
# Server=

Server=192.168.50.49

Plugin specific Zabbix agent 2 configuration

Zabbix agent 2 provides plugin-specific configuration parameters. Mostly these are optional parameters related to a specific plugin. You can find the full list of plugin-specific configuration parameters in the Zabbix documentation. In the newer versions of Zabbix agent 2, the plugin-specific parameters are defined in separate plugin configuration files, located in /etc/zabbix/zabbix_agent2.d/plugins.d/, while in older versions, they are defined directly in the zabbix_agent2.conf file.

For the Zabbix agent 2 Docker plugin, we have to provide the Docker daemon unix-socket location. This can be done by specifying the following plugin parameter:

### Option: Plugins.Docker.Endpoint
# Docker API endpoint.
#
# Mandatory: no
# Default: unix:///var/run/docker.sock
# Plugins.Docker.Endpoint=unix:///var/run/docker.sock

The default socket location will be correct for your Docker environment – in that case, you can leave the configuration file as-is.

Once we have made the necessary changes in the Zabbix agent 2 configuration files, start and enable the agent:

systemctl enable zabbix-agent2 --now

Check if the Zabbix agent2 is running:

tail -f /var/log/zabbix/zabbix_agent2.log

Before we move on to Zabbix frontend, I would like to point your attention to the Docker socket file permission – the zabbix user needs to have access to the Docker socket file. The zabbix user should be added to the docker group to resolve the following error messages.

[Docker] cannot fetch data: Get http://1.28/info: dial unix /var/run/docker.sock: connect: permission denied
ZBX_NOTSUPPORTED: Cannot fetch data.

You can add the zabbix user to the Docker group by executing the following command:

usermod -aG docker zabbix

Configuring the docker host

Configuring the host representing our Docker environment

After importing the template, we have to create a host which will represent our Docker instance. Give the host a name and assign it to a Host group – I will assign it to the Linux servers host group. Assign the Docker by Zabbix agent 2 template to the host. Since the template uses Zabbix agent 2 to collect the metrics, we also have to add an agent interface on this host. The address of the interface should point to the machine running your Docker containers. Finish up the host configuration by clicking the Add button.

Docker by Zabbix agent 2 template

Regular docker template items

The template contains a set of regular items for the general Docker instance metrics, such as the number of available images, Docker architecture information, the total number of containers, and more.

Docker tempalte Low-level discovery rules

On top of that, the template also gathers container and image-specific information by using low-level discovery rules.

Once Zabbix discovers your containers and images, these low-level discovery rules will then be used to create items, triggers, and graphs from prototypes for each of your containers and images. This way, we can monitor container or image-specific metrics, such as container memory, network information, container status, and more.

Docker templates Low-level discovery item prototypes

Verifying the host and template configuration

To verify that the agent and the host are configured correctly, we can use Zabbix get command-line tool and try to poll our agent. If you haven’t installed Zabbix get, do so on your Zabbix server or Zabbix proxy host:

dnf install zabbix-get

Now we can use zabbix-get to verify that our agent can obtain the Docker-related metrics. Execute the following command:

zabbix_get -s docker-host -k docker.info

Use the -s parameter to specify your agent host’s host name or IP address. The -k parameter specifies the item key for which we wish to obtain the metrics by polling the agent with Zabbix get.

zabbix_get -s 192.168.50.141 -k docker.info

{"Id":"SJYT:SATE:7XZE:7GEC:XFUD:KZO5:NYFI:L7M5:4RGO:P2KX:QJFD:TAVY","Containers":2,"ContainersRunning":2,"ContainersPaused":0,"ContainersStopped":0,"Images":2,"Driver":"overlay2","MemoryLimit":true,"SwapLimit":true,"KernelMemory":true,"KernelMemoryTCP":true,"CpuCfsPeriod":true,"CpuCfsQuota":true,"CPUShares":true,"CPUSet":true,"PidsLimit":true,"IPv4Forwarding":true,"BridgeNfIptables":true,"BridgeNfIP6tables":true,"Debug":false,"NFd":39,"OomKillDisable":true,"NGoroutines":43,"LoggingDriver":"json-file","CgroupDriver":"cgroupfs","NEventsListener":0,"KernelVersion":"5.4.17-2136.300.7.el8uek.x86_64","OperatingSystem":"Oracle Linux Server 8.5","OSVersion":"8.5","OSType":"linux","Architecture":"x86_64","IndexServerAddress":"https://index.docker.io/v1/","NCPU":1,"MemTotal":1776848896,"DockerRootDir":"/var/lib/docker","Name":"localhost.localdomain","ExperimentalBuild":false,"ServerVersion":"20.10.14","ClusterStore":"","ClusterAdvertise":"","DefaultRuntime":"runc","LiveRestoreEnabled":false,"InitBinary":"docker-init","SecurityOptions":["name=seccomp,profile=default"],"Warnings":null}

In addition, we can also use the low-level discovery key – docker.containers.discovery[false] to check the result of the low-level discovery.

zabbix_get -s 192.168.50.141 -k docker.containers.discovery[false]

[{"{#ID}":"a1ad32f5ee680937806bba62a1aa37909a8a6663d8d3268db01edb1ac66a49e2","{#NAME}":"/apache-server"},{"{#ID}":"120d59f3c8b416aaeeba50378dee7ae1eb89cb7ffc6cc75afdfedb9bc8cae12e","{#NAME}":"/mysql-server"}]

We can see that Zabbix will discover and start monitoring two containers – apache-server and mysql-server. Any agent low-level discovery rule or item can be checked with Zabbix get.

Docker template in action

Discovered items on our Docker host

Now that we have configured our agent and host, applied the Docker template, and verified that everything is working, we should be able to see the discovered entities in the frontend.

Collected Docker container metrics

In addition, our metrics should have also started coming in. We can check the Latest data section and verify that they are indeed getting collected.

Macros inherited from the Docker template

Lastly, we have a few additional options for further modifying the template and the results of our low-level discovery. If you open the Macros section of your host and select Inherited and host macros, you will notice that there are 4 macros inherited from the Docker template. These macros are responsible for filtering in/out the discovered containers and images. Feel free to modify these values if you wish to filter in/out the discovery of these entities as per your requirements.

Notice that the container discovery item also has one parameter, which is defined as false on the template:

  • docker.containers.discovery[false] – Discover only running containers
  • docker.containers.discovery[true] – Discover all containers, no matter their state.

And that’s it! We successfully imported the template, installed and configured Zabbix agent 2, created a host, and applied the Docker template. Finally – our Zabbix instance is now monitoring our Docker environment! If you have any other questions or comments, feel free to leave a response in the comments section of this post.

 

The post Docker Container Monitoring With Zabbix appeared first on Zabbix Blog.

Webhooks in Zabbix

Post Syndicated from Andrey Biba original https://blog.zabbix.com/webhooks-in-zabbix/19935/

Zabbix is not only a flexible and versatile monitoring system but also a convenient tool for generating alerts and integrating with existing service desks. Among the various integration methods, webhooks have become the most popular. In this blog post, we will take a look at what are webhooks, how they can be used to integrate Zabbix with an external solution, and also take a look at some use case examples for webhook integrations.

What is a webhook?

Generally speaking, a webhook is a method of augmenting or altering the behavior of a web page or web application with custom callbacks. But to put it simply, a webhook is an automatic reaction to an event. If an event occurs (for example, a problem appears), then the webhook makes a call (via HTTP / HTTPS) to a third-party service to notify it about the event. Many existing solutions provide an API that allows you to interact with them via webhooks.

The webhook in Zabbix is implemented using JavaScript, so writing code does not require knowledge of a specific Zabbix syntax, and due to the prevalence of the JavaScript language, you can find many examples, tips, and guides on the Internet.

How does a webhook work?

Essentially, a webhook is code that makes a sequence of calls to achieve some result. In the case of Zabbix, a JavaScript code is executed that accesses the service API and transfers, updates, and retrieves data from there. For example, we need to open a ticket at the service desk and leave a comment on the ticket, which will contain information about the problem. For this we need:

  • Log in to the service and get a token
  • Make a request with the token to create a ticket
  • Create a comment on the newly created ticket using a token

In different services, the details may differ, but the general idea will be preserved from service to service.

How to use it?

Our integration team constantly communicates with the community and monitors the most popular services to develop official out-of-the-box integrations for them. At the moment, Zabbix provides a vast selection of out-of-the-box webhooks for the most popular services, and we review new ones and improve current ones every day.

In most cases, setting up a ready-made webhook comes down to 3-4 steps, which are described in the README file in the Zabbix repository. Usually, it is necessary to generate an API key in the service, set it in Zabbix, set the URL to the service endpoint URL, and specify a couple of parameters required for the webhook to work.

In addition to ready-made solutions, there is a Github community repository where custom templates and webhooks are laid out! If you are the author of a webhook or a template, please share it with the community by submitting it to this repository!

Example – Telegram webhook

The theory is good, but we are all interested in how it is implemented in practice. Let’s look at a Telegram webhook as an example. Now this messenger is very popular and it will be relevant to use it as an example.

First of all, let’s go to the Zabbix repository or navigate to the Zabbix website integrations section to read the setup instructions. In the repository, all templates and notification methods are located in the /templates folder, and for each of them, there is a README file with a detailed description.

From the Telegram side, we need to create a bot and get its token following the instructions and set it in the Token parameter.

After that, we create a user, set up a Telegram media type for this user, and in the “Send to” field we write the id of the user or group chat.

Voila! Your webhook is set up and ready to send notifications or event information!

As you may have noticed, the setup did not take much time and did not require deep knowledge. Naturally, for finer tuning, it is possible to edit the content of messages, the type of problems, intervals, and other parameters. But even without additional changes, notifications are already ready to go.

Is it difficult to write a webhook on your own?

Of course, creating a webhook requires certain skills.

First of all, knowledge of JavaScript is required. The language itself is not difficult and can be mastered relatively quickly. The Zabbix documentation site has a guideline for writing webhooks with recommendations and best practices.

Secondly, understanding how Zabbix works. This does not require an in-depth understanding of Zabbix and the ability to follow basic instructions will be enough. You can read more about setting up notification methods in the official documentation. It is important to properly configure the webhook itself, grant rights to users, and set up a notification action for the necessary triggers.

And thirdly, study the documentation of the service for which the webhook will be written. Although all APIs work on the same principle, they can differ greatly from each other in methods and request structure. It is also necessary to understand the service itself to understand how it works. It is difficult to write an integration if it is not clear how Zabbix should properly interact with the service being integrated.

Summarizing

Webhook is a modern and flexible way of integration that allows Zabbix to be a universal solution. Since the realities of our world imply a large number of different systems, and as a result – many people working together – webhooks are becoming an indispensable tool in notification automation. A properly written and configured webhook is an effective solution for flexible notifications.

In the next article, together we will learn the basic methods and requests that are needed to send alerts, receive updates and assign tags. For this purpose, we will completely inspect some webhook in close detail.

Questions

Q: We have a ready-made notification system built on scripts. Does it make sense to rewrite it to a webhook?

A: Certainly. Firstly, the webhook is executed natively in Zabbix, which will be much more productive than in an external script. Secondly, the webhook is much more flexible, more functional, and much easier to make changes to.

 

Q: We have a service for which we would like to write an integration, but we do not have qualified specialists who could do it. Is it possible to request such integration from Zabbix?

A: Yes, if you are a Zabbix partner, you can leave a request to create such integration.

 

The post Webhooks in Zabbix appeared first on Zabbix Blog.

Zabbix NOT AFFECTED by the Log4j exploit

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/zabbix-not-affected-by-the-log4j-exploit/17873/

A newly revealed vulnerability impacting Apache Log4j 2 versions 2.0 to 2.14.1 was disclosed on GitHub on 9 December 2021 and registered as CVE-2021-44228 with the highest severity rating. Log4j is an open-source, Java-based logging utility widely used by enterprise applications and cloud services. By utilizing this vulnerability, a remote attacker could take control of the affected system.

Zabbix is aware of this vulnerability, has completed verification, and can conclude that the only product where we use Java is Zabbix Java Gateway, which does not utilize the log4j library, thereby is not impacted by this vulnerability.

For customers, who use the log4j library with other Java applications, here are some proactive measures, which they can take to reduce the risk posed by CVE-2021-44228:

  • Upgrade to Apache log4j-2.1.50.rc2, as all prior 2.x versions are vulnerable.
  • For Log4j version 2.10.0 or later, block JNDI from making requests to untrusted servers by setting the configuration value log4j2.formatMsgNoLookups to “TRUE” to prevent LDAP and other queries.
  • Default both com.sun.jndi.rmi.object.trustURLCodebase and com.sun.jndi.cosnaming.object.trustURLCodebase to “FALSE” to prevent Remote Code Execution attacks in Java 8u121.

Advanced Zabbix API – 5 API use cases to improve your API workfows

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/advanced-zabbix-api-5-api-use-cases-to-improve-your-api-workfows/16801/

As your monitoring infrastructures evolve, you might hit a point when there’s no avoiding using the Zabbix API. The Zabbix API can be used to automate a particular part of your day-to-day workflow, troubleshoot your monitoring or to simply analyze or get statistics about a specific set of entities.

In this blog post, we will take a look at some of the more advanced API methods and specific method parameters and learn how they can be used to improve your API workflows.

1. Count entities with CountOutput

Let’s start with gathering some statistics. Let’s say you have to count the number of some matching entities – here we can use the CountOutput parameter. For a more advanced use case – what if we have to count the number of events for some time period? Let’s combine countOutput with time_from and time_till (in unixtime) and get the number of events created for the month of November. Let’s get all of the events for the month of November that have the Disaster severity:

{
"jsonrpc": "2.0",
"method": "event.get",
"params": {
"output": "extend",
"time_from": "1635717600",
"time_till": "1638223200",
"severities": "5",
"countOutput": "true"
},
"auth": "xxxxxx",
"id": 1
}

2. Use API to perform Configuration export/import

Next, let’s take a look at how we can use the configuration.export method to export one of our templates in yaml:

{
"jsonrpc": "2.0",
"method": "configuration.export",
"params": {
"options": {
"templates": [
"10001"
]
},
"format": "yaml"
},
"auth": "xxxxxx",
"id": 1
}

Now let’s copy and paste the result of the export and import the template into another environment. It’s extremely important to remember that for this method to work exactly as we intend to, we need to include the parameters that specify the behavior of particular entities contained in the configuration string, such as items/value maps/templates, etc. For example, if I exclude the templates parameter here, no templates will be imported.

{
"jsonrpc": "2.0",
"method": "configuration.import",
"params": {
"format": "yaml",
"rules": {
"valueMaps": {
"createMissing": true,
"updateExisting": true
},
"items": {
"createMissing": true,
"updateExisting": true,
"deleteMissing": true
},
"templates": {
"createMissing": true,
"updateExisting": true
},

"templateLinkage": {
"createMissing": true
}
},
"source": "zabbix_export:\n version: '5.4'\n date: '2021-11-13T09:31:29Z'\n groups:\n -\n uuid: 846977d1dfed4968bc5f8bdb363285bc\n name: 'Templates/Operating systems'\n templates:\n -\n uuid: e2307c94f1744af7a8f1f458a67af424\n template: 'Linux by Zabbix agent active'\n name: 'Linux by Zabbix agent active'\n 
...
},
"auth": "xxxxxx",
"id": 1
}

3. Expand trigger functions and macros with expand parameters

Using trigger.get to obtain information about a particular set of triggers is a relatively common practice. One particular caveat that we have to consider is that by default macros in trigger name, expression or descriptions are not expanded. To expand the available macros we need to use the expand parameters:

{
"jsonrpc": "2.0",
"method": "trigger.get",
"params": {
"triggerids": "18135",
"output": "extend",
"expandExpression":"1",
"selectFunctions": "extend"
},
"auth": "xxxxxx",
"id": 1
}

4. Obtaining additional LLD information for a discovered item

If we wish to display additional LLD information for a discovered entity, in this case – an item, we can use the selectDiscoveryRule and selectItemDiscovery parameters.
While selectDiscoveryRule will provide the ID of the LLD rule that created the item, selectItemDiscovery can point us at the parent item prototype id from which the item was created, last discovery time, item prototype key, and more.

The example below will return the item details and will also provide the LLD rule and Item prototype IDs, the time when the lost item will be deleted and the last time the item was discovered:

{
"jsonrpc": "2.0",
"method": "item.get",
"params": {
"itemids":"36717",
"selectDiscoveryRule":"1",
"selectItemDiscovery":["lastcheck","ts_delete","parent_itemid"]
}, "auth":"xxxxxx",
"id": 1
}

5. Searching through the matched entities with search parameters

Zabbix API provides a couple of standard parameters for performing a search. With search parameter, we can search string or text fields and try to find objects based on a single or multiple entries. searchByAny parameter is capable of extending the search – if you set this as true, we will search by ANY of the criteria in the search array, instead of trying to find an entity that matches ALL of them (default behavior).

The following API call will find items that match agent and Zabbix keys on a particular template:

{
"jsonrpc": "2.0",
"method": "item.get",
"params": {
"output": "extend",
"templateids": "10001",
"search": {
"key_": ["agent.","zabbix"]
},
"searchByAny":"true",
"sortfield": "name"
},
"auth": "xxxxxx",
"id": 1
}

Feel free to take the above examples, change them around so they fit your use case and you should be able to quite easily implement them in your environment. There are many other use cases that we might potentially cover down the line – if you have a specific API use case that you wish for us to cover, feel free to leave a comment under this post and we just might cover it in one of the upcoming blog posts!

Simplifying Zabbix API workflows with named Zabbix API tokens

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/simplifying-zabbix-api-workflows-with-named-zabbix-api-tokens/16653/

Zabbix API enables you to collect any and all information from your Zabbix instance by using a multitude of API methods. You can even utilize Zabbix API calls in your HTTP items. For example, this can be used to monitor the number of particular sets of metrics and visualize their growth over time. With named Zabbix API tokens, such use cases are a lot more simple to implement.

Before Zabbix 5.4 we had to perform the user.login API call to obtain the authentication token. Once the user session was closed, we had to relog, obtain the new authentication token and use it in the subsequent API calls.

With the pre-defined named Zabbix API tokens, you don’t have to constantly check if the authentication token needs to be updated. Starting from Zabbix 5.4 you can simply create a new named Zabbix API token with an expiration date and use it in your API calls.

Creating a new named Zabbix API token

The Zabbix API token creation process is extremely simple. All you have to do is navigate to Administration – General – API tokens and create a new API token. The named API tokens are created for a particular user and can have an optional expiration date and time – otherwise, the tokens are defined without an expiry date.

You can create a named API token in the API tokens section, under Administration – General

Once the Token has been created, make sure to store it somewhere safe, since you won’t be able to recover it afterward. If the token is lost – you will have to recreate it.

Make sure to store the auth token!

Don’t forget, that when defining a role for the particular API user, we can restrict which API methods this user has access to.

Simplifying API tasks with the named API token

There are many different use cases where you could implement Zabbix API calls to collect some additional information. For this blog post, I will create an HTTP item that uses item.get API call to monitor the number of unsupported items.

To achieve that, I will create an HTTP item on a host (This can be the default Zabbix server host or a host dedicated to collecting metrics via Zabbix API calls) and provide the API call in the request body. Since the named API token now provides a static authentication token until it expires, I can simply use it in my API call without the need to constantly keep it updated.

An HTTP agent item that uses a Zabbix API call in its request body

{
    "jsonrpc": "2.0",
    "method": "item.get",
    "params": {
			"countOutput":"1",
			 "filter": {
 "state": "1"
 }
    },
    "id": 2,
    "auth": "b72be8cf163438aacc5afa40a112155e307c3548ae63bd97b87ff4e98b1f7657"
}

HTTP item request body, which returns a count of unsupported items

I will also use regular expression preprocessing to obtain the numeric value from the API call result – otherwise, we won’t be able to graph our value or calculate trends for it.

Regular expression preprocessing step to obtain a numeric value from our Zabbix API call result

Utilizing Zabbix API scripts in Actions

In one of our previous blog posts, we covered resolving problems automatically with the event.acknowledge API method. The logic defined in the blog post was quite complex since we needed to keep an eye out for the authentication tokens and use a custom script to keep them up to date. With named Zabbix API tokens, this use case is a lot more simple.

All I have to do is create an Action operation script containing my API call and pass it to an action operation.

Action operation script that invokes Zabbix event.acknowledge API method

curl -sk -X POST -H "Content-Type: application/json" -d "
{
\"jsonrpc\": \"2.0\",
\"method\": \"event.acknowledge\",
\"params\": {
\"eventids\": \"{EVENT.ID}\",
\"action\": 1,
\"message\": \"Problem resolved.\"
},
\"auth\": \"<Place your authentication token here>",
\"id\": 2
}" <Place your Zabbix API endpoint URL here>

Problem remediation script example

Now my problems will get closed automatically after the time period which I have defined in my action.

Action operation which runs our event.acknowledge Zabbix API script

These are but a few examples that we can now achieve by using API tokens. A lot of information can be obtained and filtered in a unique way via Zabbix API, thus providing you with a granular analysis of your monitored environment. If you have recently upgraded to Zabbix 5.4 or plan to upgrade to Zabbix 6.0 LTS in the future, I would recommend implementing named Zabbix API tokens to simplify your day-to-day workflow and consider the possibilities that this new feature opens up for you.

If you have any questions or if you wish to share your particular use case for data collection or task automation with Zabbix API – feel free to share them in the comments section below!

Combining preprocessing with storing only trend data for high-frequency monitoring

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/combining-preprocessing-with-storing-only-trend-data-for-high-frequency-monitoring/16568/

There are many design choices to consider when we build our monitoring environment for high-frequency monitoring. How to minimize performance impact? What are the data retention policies with storage space in mind? What are the available out-of-the-box features to solve these potential problems?
In this blog post, we will discuss when you should use preprocessing and when it is better to use the “Do not keep history” option for your metrics, and what are the pros and cons for both of these approaches.

Throttling and other preprocessing steps

We’ve discussed throttling previously as the go-to approach for high-frequency monitoring. Indeed, with throttling, you can discard repeated values and do so with a heartbeat. This is extremely useful with metrics that come as discreet values – services states, network port statuses, and so on.
Example of throttling with and without heartbeat
In addition, since starting from Zabbix 4.2 all preprocessing is also performed by Zabbix proxies. This means we can discard the repeated values before they reach the Zabbix server. This can help us both with the performance (fewer metrics to insert in the Zabbix server DB) and reduce the DB size (Fewer metrics stored in the DB. This also helps with improving overall Zabbix performance)
There are a few caveats with this approach – since metrics get discarded before they reach the Zabbix server, the triggers will not react on these metrics (This is where having a heartbeat is useful) and, since trends are calculated by Zabbix server based on the received history data, there could be a lack of trend information for these metrics. Keep in mind that this applies not only to throttling preprocessing rules – any preprocessing can be done on the proxy and any preprocessing rules can be used to transform your data.

Understanding “Do not keep history” option

The behavior of “Do not keep history” which we can define when configuring an item is a bit different though. If we collect an item by a Proxy and configure the item with “Do not keep history”, the history won’t always get discarded! There are a couple of reasons for this.
  • First off, let’s not forget that some of our values can populate host inventory! If the particular item is configured to populate an inventory field – it will be forwarded to the Zabbix server, but it will not get stored in the history tables.
  • If the item does not populate an inventory field – the text data such as character, log and text will indeed get discarded before reaching the Zabbix server, but Numeric values – both float and integer, will get forwarded to the server. The reason for that is deriving trend information from the numeric values. Mind that the numeric data will still not get stored in the history tables, only trends will be available for these items.

Note: This behavior has been properly implemented starting from Zabbix 5.2. See ZBX-17548

Setting the “Do not keep history” option for an item

Using trend functions with high-frequency monitoring

With the specifics of “Do not keep history” in mind, we should now recall that starting from Zabbix 5.2 we have trend functions available at our disposal!
History functions such as trendavg, trendcount, trendmax, trendmin, trendsum allow us to perform different kinds of trend calculations – from counting the number of trend values to retrieving min/max/avg trend values for a time period.
This means, that if we require only the metric trend for specific time periods (hours, days, weeks, etc) we can use these trend functions together with “Do not keep history” option, thus discarding unnecessary data and improving our Zabbix server performance!
There are two approaches two using trend functions:
  • If you wish to collect and display the trend data, you need to create the item which will collect the metrics (say, a net.if.in Agent item for collecting incoming network traffic) and then create a separate calculated item that uses the trend function to calculate the avg/min/max value for the trend over a time period. The original item can then have “Do not keep history” option selected for it.

trendavg item for calculating hourly trends from the net.if.in[ifHCInOctets.5] item

 

  • If you wish to simply define triggers and react on long-term trends and are not required to collect the trend values, then we can skip the creation of the calculated item and simply use the trend function on the original item in the trigger.

This trigger fires if the hourly average trend value exceeds 100M.
Note: In this case only the original item is required.

By combining these approaches in our environment – using preprocessing when we wish to discard or transform the data and also implementing opting out of storing the history data, whenever this is appropriate, we can minimize the performance impact on our Zabbix instance. Add a layer of distributed Zabbix proxies on top of this and you can truly achieve a large, scalable Zabbix infrastructure optimized for high-frequency ingestion and processing of your data.

Keeping your Zabbix templates up to date

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/keeping-your-zabbix-templates-up-to-date/16412/

Have you recently updated your Zabbix environment but are still wondering – why haven’t the templates been updated? Where can I obtain the latest official Zabbix templates, and how should I update them? In this blog post, we will discuss why it is vital to keep your templates up to date and how we the template update process looks like.

Updating your templates

“Will updating Zabbix also update my templates?” is a question that I receive quite often. The answer to that question is – no changes are made to your templates whenever you update your Zabbix instance – be it a minor or a major update. The reasoning behind that is quite simple – we always recommend that you tune the out-of-the-box templates as per your particular requirements. That may consist of changing update intervals, disabling items/triggers, or even changing the existing trigger expressions or adding whole new entities to the template.

This is where the current behavior with template updates starts to make more sense. If Zabbix were to automatically update your templates, there could be a chance of overwriting your custom changes and could potentially disrupt the monitoring of your environment. That is something that we definitely wish to avoid.

The question still stands – Then how am I supposed to update my templates?

The answer – you can find the latest official Zabbix templates on our official git page – https://git.zabbix.com/

First, navigate to the Zabbix repository and open the Templates folder. Then, select the release branch that matches your Zabbix instance version. Here you can find all of our official templates and also the official media types under the media folder. All you have to do now is open the template up and download the raw template file.

Zabbix 5.4 release git templates/db folder

Once that is done, we can import the template into our Zabbix environment

Template template_db_oracle_agent2 import

Don’t forget to back up your existing templates, especially if you have made some custom changes to them! Ideally – add a prefix to their names, so the new and old templates can live side by side, and you can then manually copy over the changes from the latest official template to your custom template.

The benefits of keeping templates up to date

But what is the point of updating your templates – what do you get out of it? Well, that varies on the specific fixes or improvements we make to the particular template over time. Sometimes the updated template will provide improved trigger expressions or preprocessing logic. Other times the updated template will provide extra value to your monitoring with completely new items and triggers. In the case of Webhook media types – the updates usually contain fixes or improvements for some particular use cases, for example – fixing a compatibility issue for a specific OS.

You can always track these changes either in the release notes of a particular Zabbix version or by looking up a specific bug or a feature request in our bug tracker – https://support.zabbix.com

Some of the template changes in Zabbix 5.4 major update

Zabbix self-monitoring templates

Another key aspect of why it’s important to keep your templates up to date is so you can implement the changes made to the Zabbix self-monitoring templates. For example, if we compare Zabbix 5.0 to Zabbix 5.4, there are multiple new Zabbix processes and caches added to Zabbix 5.4, such as report writer/manager process, availability manager process, trend function caches, and other new components.

Zabbix server health template version 5.0 and 5.4 difference in the number of entities

So, if you update from Zabbix 5.0 to 5.4 (or Zabbix 6.0 if you’re sticking with LTS versions), you WILL NOT be monitoring these processes and caches if you don’t update your Zabbix server and Zabbix proxy templates to the current Zabbix versions. This means that you will be completely unaware of any potential performance issues related to these processes or caches.

Tracking template changes

With Zabbix 5.4 and later, you will notice some great improvements to the template import process. If you’re wondering what has changed when comparing an older template version with a newer one, you will now be able to see the changes during the import process. The added and removed elements will be highlighted in red or green accordingly.

Preview of the changes made during the template import process

How often should you update the templates? Ideally, you would follow the Zabbix update release notes and take note of any changes made to the templates that are of use in your environment. At the very least – definitely check for changes in the self-monitoring templates when moving to a newer major version of Zabbix. Otherwise, you risk losing track of potential issues in your Zabbix environment.

Now that you know the answer to the question “How can I update my Zabbix templates?” try and think back to when you last updated your Zabbix instance to a new version – did you also check the official templates for updates? If not, then don’t hesitate and visit https://git.zabbix.com/ to find the latest templates for your Zabbix version. Chances are that you will be pleasantly surprised with a set of new and updated templates for your monitoring endpoints and new webhook media types to help you integrate Zabbix with your existing systems.