Tag Archives: Technical

Advanced Zabbix API – 5 API use cases to improve your API workfows

2021-11-18 Arturs Lontons

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/advanced-zabbix-api-5-api-use-cases-to-improve-your-api-workfows/16801/

As your monitoring infrastructures evolve, you might hit a point when there’s no avoiding using the Zabbix API. The Zabbix API can be used to automate a particular part of your day-to-day workflow, troubleshoot your monitoring or to simply analyze or get statistics about a specific set of entities.

In this blog post, we will take a look at some of the more advanced API methods and specific method parameters and learn how they can be used to improve your API workflows.

1. Count entities with CountOutput

Let’s start with gathering some statistics. Let’s say you have to count the number of some matching entities – here we can use the CountOutput parameter. For a more advanced use case – what if we have to count the number of events for some time period? Let’s combine countOutput with time_from and time_till (in unixtime) and get the number of events created for the month of November. Let’s get all of the events for the month of November that have the Disaster severity:

{
"jsonrpc": "2.0",
"method": "event.get",
"params": {
"output": "extend",
"time_from": "1635717600",
"time_till": "1638223200",
"severities": "5",
"countOutput": "true"
},
"auth": "xxxxxx",
"id": 1
}

2. Use API to perform Configuration export/import

Next, let’s take a look at how we can use the configuration.export method to export one of our templates in yaml:

{
"jsonrpc": "2.0",
"method": "configuration.export",
"params": {
"options": {
"templates": [
"10001"
]
},
"format": "yaml"
},
"auth": "xxxxxx",
"id": 1
}

Now let’s copy and paste the result of the export and import the template into another environment. It’s extremely important to remember that for this method to work exactly as we intend to, we need to include the parameters that specify the behavior of particular entities contained in the configuration string, such as items/value maps/templates, etc. For example, if I exclude the templates parameter here, no templates will be imported.

{
"jsonrpc": "2.0",
"method": "configuration.import",
"params": {
"format": "yaml",
"rules": {
"valueMaps": {
"createMissing": true,
"updateExisting": true
},
"items": {
"createMissing": true,
"updateExisting": true,
"deleteMissing": true
},
"templates": {
"createMissing": true,
"updateExisting": true
},

"templateLinkage": {
"createMissing": true
}
},
"source": "zabbix_export:\n version: '5.4'\n date: '2021-11-13T09:31:29Z'\n groups:\n -\n uuid: 846977d1dfed4968bc5f8bdb363285bc\n name: 'Templates/Operating systems'\n templates:\n -\n uuid: e2307c94f1744af7a8f1f458a67af424\n template: 'Linux by Zabbix agent active'\n name: 'Linux by Zabbix agent active'\n 
...
},
"auth": "xxxxxx",
"id": 1
}

3. Expand trigger functions and macros with expand parameters

Using trigger.get to obtain information about a particular set of triggers is a relatively common practice. One particular caveat that we have to consider is that by default macros in trigger name, expression or descriptions are not expanded. To expand the available macros we need to use the expand parameters:

{
"jsonrpc": "2.0",
"method": "trigger.get",
"params": {
"triggerids": "18135",
"output": "extend",
"expandExpression":"1",
"selectFunctions": "extend"
},
"auth": "xxxxxx",
"id": 1
}

4. Obtaining additional LLD information for a discovered item

If we wish to display additional LLD information for a discovered entity, in this case – an item, we can use the selectDiscoveryRule and selectItemDiscovery parameters.
While selectDiscoveryRule will provide the ID of the LLD rule that created the item, selectItemDiscovery can point us at the parent item prototype id from which the item was created, last discovery time, item prototype key, and more.

The example below will return the item details and will also provide the LLD rule and Item prototype IDs, the time when the lost item will be deleted and the last time the item was discovered:

{
"jsonrpc": "2.0",
"method": "item.get",
"params": {
"itemids":"36717",
"selectDiscoveryRule":"1",
"selectItemDiscovery":["lastcheck","ts_delete","parent_itemid"]
}, "auth":"xxxxxx",
"id": 1
}

5. Searching through the matched entities with search parameters

Zabbix API provides a couple of standard parameters for performing a search. With search parameter, we can search string or text fields and try to find objects based on a single or multiple entries. searchByAny parameter is capable of extending the search – if you set this as true, we will search by ANY of the criteria in the search array, instead of trying to find an entity that matches ALL of them (default behavior).

The following API call will find items that match agent and Zabbix keys on a particular template:

{
"jsonrpc": "2.0",
"method": "item.get",
"params": {
"output": "extend",
"templateids": "10001",
"search": {
"key_": ["agent.","zabbix"]
},
"searchByAny":"true",
"sortfield": "name"
},
"auth": "xxxxxx",
"id": 1
}

Feel free to take the above examples, change them around so they fit your use case and you should be able to quite easily implement them in your environment. There are many other use cases that we might potentially cover down the line – if you have a specific API use case that you wish for us to cover, feel free to leave a comment under this post and we just might cover it in one of the upcoming blog posts!

Simplifying Zabbix API workflows with named Zabbix API tokens

2021-11-11 Arturs Lontons

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/simplifying-zabbix-api-workflows-with-named-zabbix-api-tokens/16653/

Zabbix API enables you to collect any and all information from your Zabbix instance by using a multitude of API methods. You can even utilize Zabbix API calls in your HTTP items. For example, this can be used to monitor the number of particular sets of metrics and visualize their growth over time. With named Zabbix API tokens, such use cases are a lot more simple to implement.

Before Zabbix 5.4 we had to perform the user.login API call to obtain the authentication token. Once the user session was closed, we had to relog, obtain the new authentication token and use it in the subsequent API calls.

With the pre-defined named Zabbix API tokens, you don’t have to constantly check if the authentication token needs to be updated. Starting from Zabbix 5.4 you can simply create a new named Zabbix API token with an expiration date and use it in your API calls.

Creating a new named Zabbix API token

The Zabbix API token creation process is extremely simple. All you have to do is navigate to Administration – General – API tokens and create a new API token. The named API tokens are created for a particular user and can have an optional expiration date and time – otherwise, the tokens are defined without an expiry date.

You can create a named API token in the API tokens section, under Administration – General

Once the Token has been created, make sure to store it somewhere safe, since you won’t be able to recover it afterward. If the token is lost – you will have to recreate it.

Make sure to store the auth token!

Don’t forget, that when defining a role for the particular API user, we can restrict which API methods this user has access to.

Simplifying API tasks with the named API token

There are many different use cases where you could implement Zabbix API calls to collect some additional information. For this blog post, I will create an HTTP item that uses item.get API call to monitor the number of unsupported items.

To achieve that, I will create an HTTP item on a host (This can be the default Zabbix server host or a host dedicated to collecting metrics via Zabbix API calls) and provide the API call in the request body. Since the named API token now provides a static authentication token until it expires, I can simply use it in my API call without the need to constantly keep it updated.

An HTTP agent item that uses a Zabbix API call in its request body

{
    "jsonrpc": "2.0",
    "method": "item.get",
    "params": {
			"countOutput":"1",
			 "filter": {
 "state": "1"
 }
    },
    "id": 2,
    "auth": "b72be8cf163438aacc5afa40a112155e307c3548ae63bd97b87ff4e98b1f7657"
}

HTTP item request body, which returns a count of unsupported items

I will also use regular expression preprocessing to obtain the numeric value from the API call result – otherwise, we won’t be able to graph our value or calculate trends for it.

Regular expression preprocessing step to obtain a numeric value from our Zabbix API call result

Utilizing Zabbix API scripts in Actions

In one of our previous blog posts, we covered resolving problems automatically with the event.acknowledge API method. The logic defined in the blog post was quite complex since we needed to keep an eye out for the authentication tokens and use a custom script to keep them up to date. With named Zabbix API tokens, this use case is a lot more simple.

All I have to do is create an Action operation script containing my API call and pass it to an action operation.

Action operation script that invokes Zabbix event.acknowledge API method

curl -sk -X POST -H "Content-Type: application/json" -d "
{
\"jsonrpc\": \"2.0\",
\"method\": \"event.acknowledge\",
\"params\": {
\"eventids\": \"{EVENT.ID}\",
\"action\": 1,
\"message\": \"Problem resolved.\"
},
\"auth\": \"<Place your authentication token here>",
\"id\": 2
}" <Place your Zabbix API endpoint URL here>

Problem remediation script example

Now my problems will get closed automatically after the time period which I have defined in my action.

Action operation which runs our event.acknowledge Zabbix API script

These are but a few examples that we can now achieve by using API tokens. A lot of information can be obtained and filtered in a unique way via Zabbix API, thus providing you with a granular analysis of your monitored environment. If you have recently upgraded to Zabbix 5.4 or plan to upgrade to Zabbix 6.0 LTS in the future, I would recommend implementing named Zabbix API tokens to simplify your day-to-day workflow and consider the possibilities that this new feature opens up for you.

If you have any questions or if you wish to share your particular use case for data collection or task automation with Zabbix API – feel free to share them in the comments section below!

Combining preprocessing with storing only trend data for high-frequency monitoring

2021-11-04 Arturs Lontons

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/combining-preprocessing-with-storing-only-trend-data-for-high-frequency-monitoring/16568/

There are many design choices to consider when we build our monitoring environment for high-frequency monitoring. How to minimize performance impact? What are the data retention policies with storage space in mind? What are the available out-of-the-box features to solve these potential problems?

In this blog post, we will discuss when you should use preprocessing and when it is better to use the “Do not keep history” option for your metrics, and what are the pros and cons for both of these approaches.

Throttling and other preprocessing steps

We’ve discussed throttling previously as the go-to approach for high-frequency monitoring. Indeed, with throttling, you can discard repeated values and do so with a heartbeat. This is extremely useful with metrics that come as discreet values – services states, network port statuses, and so on.

Example of throttling with and without heartbeat

In addition, since starting from Zabbix 4.2 all preprocessing is also performed by Zabbix proxies. This means we can discard the repeated values before they reach the Zabbix server. This can help us both with the performance (fewer metrics to insert in the Zabbix server DB) and reduce the DB size (Fewer metrics stored in the DB. This also helps with improving overall Zabbix performance)

There are a few caveats with this approach – since metrics get discarded before they reach the Zabbix server, the triggers will not react on these metrics (This is where having a heartbeat is useful) and, since trends are calculated by Zabbix server based on the received history data, there could be a lack of trend information for these metrics. Keep in mind that this applies not only to throttling preprocessing rules – any preprocessing can be done on the proxy and any preprocessing rules can be used to transform your data.

Understanding “Do not keep history” option

The behavior of “Do not keep history” which we can define when configuring an item is a bit different though. If we collect an item by a Proxy and configure the item with “Do not keep history”, the history won’t always get discarded! There are a couple of reasons for this.

First off, let’s not forget that some of our values can populate host inventory! If the particular item is configured to populate an inventory field – it will be forwarded to the Zabbix server, but it will not get stored in the history tables.
If the item does not populate an inventory field – the text data such as character, log and text will indeed get discarded before reaching the Zabbix server, but Numeric values – both float and integer, will get forwarded to the server. The reason for that is deriving trend information from the numeric values. Mind that the numeric data will still not get stored in the history tables, only trends will be available for these items.

Note: This behavior has been properly implemented starting from Zabbix 5.2. See ZBX-17548

Setting the “Do not keep history” option for an item

Using trend functions with high-frequency monitoring

With the specifics of “Do not keep history” in mind, we should now recall that starting from Zabbix 5.2 we have trend functions available at our disposal!

History functions such as trendavg, trendcount, trendmax, trendmin, trendsum allow us to perform different kinds of trend calculations – from counting the number of trend values to retrieving min/max/avg trend values for a time period.

This means, that if we require only the metric trend for specific time periods (hours, days, weeks, etc) we can use these trend functions together with “Do not keep history” option, thus discarding unnecessary data and improving our Zabbix server performance!

There are two approaches two using trend functions:

If you wish to collect and display the trend data, you need to create the item which will collect the metrics (say, a net.if.in Agent item for collecting incoming network traffic) and then create a separate calculated item that uses the trend function to calculate the avg/min/max value for the trend over a time period. The original item can then have “Do not keep history” option selected for it.

trendavg item for calculating hourly trends from the net.if.in[ifHCInOctets.5] item

If you wish to simply define triggers and react on long-term trends and are not required to collect the trend values, then we can skip the creation of the calculated item and simply use the trend function on the original item in the trigger.

This trigger fires if the hourly average trend value exceeds 100M.
Note: In this case only the original item is required.

By combining these approaches in our environment – using preprocessing when we wish to discard or transform the data and also implementing opting out of storing the history data, whenever this is appropriate, we can minimize the performance impact on our Zabbix instance. Add a layer of distributed Zabbix proxies on top of this and you can truly achieve a large, scalable Zabbix infrastructure optimized for high-frequency ingestion and processing of your data.

Keeping your Zabbix templates up to date

2021-10-28 Arturs Lontons

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/keeping-your-zabbix-templates-up-to-date/16412/

Have you recently updated your Zabbix environment but are still wondering – why haven’t the templates been updated? Where can I obtain the latest official Zabbix templates, and how should I update them? In this blog post, we will discuss why it is vital to keep your templates up to date and how we the template update process looks like.

Updating your templates

“Will updating Zabbix also update my templates?” is a question that I receive quite often. The answer to that question is – no changes are made to your templates whenever you update your Zabbix instance – be it a minor or a major update. The reasoning behind that is quite simple – we always recommend that you tune the out-of-the-box templates as per your particular requirements. That may consist of changing update intervals, disabling items/triggers, or even changing the existing trigger expressions or adding whole new entities to the template.

This is where the current behavior with template updates starts to make more sense. If Zabbix were to automatically update your templates, there could be a chance of overwriting your custom changes and could potentially disrupt the monitoring of your environment. That is something that we definitely wish to avoid.

The question still stands – Then how am I supposed to update my templates?

The answer – you can find the latest official Zabbix templates on our official git page – https://git.zabbix.com/

First, navigate to the Zabbix repository and open the Templates folder. Then, select the release branch that matches your Zabbix instance version. Here you can find all of our official templates and also the official media types under the media folder. All you have to do now is open the template up and download the raw template file.

Zabbix 5.4 release git templates/db folder

Once that is done, we can import the template into our Zabbix environment

Template template_db_oracle_agent2 import

Don’t forget to back up your existing templates, especially if you have made some custom changes to them! Ideally – add a prefix to their names, so the new and old templates can live side by side, and you can then manually copy over the changes from the latest official template to your custom template.

The benefits of keeping templates up to date

But what is the point of updating your templates – what do you get out of it? Well, that varies on the specific fixes or improvements we make to the particular template over time. Sometimes the updated template will provide improved trigger expressions or preprocessing logic. Other times the updated template will provide extra value to your monitoring with completely new items and triggers. In the case of Webhook media types – the updates usually contain fixes or improvements for some particular use cases, for example – fixing a compatibility issue for a specific OS.

You can always track these changes either in the release notes of a particular Zabbix version or by looking up a specific bug or a feature request in our bug tracker – https://support.zabbix.com

Some of the template changes in Zabbix 5.4 major update

Zabbix self-monitoring templates

Another key aspect of why it’s important to keep your templates up to date is so you can implement the changes made to the Zabbix self-monitoring templates. For example, if we compare Zabbix 5.0 to Zabbix 5.4, there are multiple new Zabbix processes and caches added to Zabbix 5.4, such as report writer/manager process, availability manager process, trend function caches, and other new components.

Zabbix server health template version 5.0 and 5.4 difference in the number of entities

So, if you update from Zabbix 5.0 to 5.4 (or Zabbix 6.0 if you’re sticking with LTS versions), you WILL NOT be monitoring these processes and caches if you don’t update your Zabbix server and Zabbix proxy templates to the current Zabbix versions. This means that you will be completely unaware of any potential performance issues related to these processes or caches.

Tracking template changes

With Zabbix 5.4 and later, you will notice some great improvements to the template import process. If you’re wondering what has changed when comparing an older template version with a newer one, you will now be able to see the changes during the import process. The added and removed elements will be highlighted in red or green accordingly.

Preview of the changes made during the template import process

How often should you update the templates? Ideally, you would follow the Zabbix update release notes and take note of any changes made to the templates that are of use in your environment. At the very least – definitely check for changes in the self-monitoring templates when moving to a newer major version of Zabbix. Otherwise, you risk losing track of potential issues in your Zabbix environment.

Now that you know the answer to the question “How can I update my Zabbix templates?” try and think back to when you last updated your Zabbix instance to a new version – did you also check the official templates for updates? If not, then don’t hesitate and visit https://git.zabbix.com/ to find the latest templates for your Zabbix version. Chances are that you will be pleasantly surprised with a set of new and updated templates for your monitoring endpoints and new webhook media types to help you integrate Zabbix with your existing systems.

Monitoring MongoDB nodes and clusters with Zabbix

2021-10-05 Dmitry Lambert

Post Syndicated from Dmitry Lambert original https://blog.zabbix.com/monitoring-mongodb-nodes-and-clusters-with-zabbix/16031/

Zabbix Agent 2 enables our users to monitor a whole set of new systems with minimal configuration required on the monitored systems. Forget about writing custom monitoring scripts, deploying additional packages, or configuring ODBC. A great use-case for Zabbix Agent 2 is monitoring one of the most popular NoSQL DB backends – MongoDB. Below, you can read a detailed description and step-by-step guide through the use case or refer to the video available here.

Zabbix MongoDB template

For this example, we will be using Zabbix version 5.4, but MongoDB monitoring by Zabbix Agent 2 is supported starting from version 5.0. If you have a fresh deployment of Zabbix version 5.0 or newer, you will be able to find the MongoDB template in your ‘Configuration‘ – ‘Templates‘ section.

MongoDB Node and Cluster templates

On the other hand, if you have an instance that you deployed before the release of Zabbix 5.0 and then upgraded to Zabbix 5.0 or newer, you will have to import the template manually from our git page. Let’s remember that Zabbix DOES NOT apply new templates or modify existing templates during an upgrade. Therefore, newly released templates have to be IMPORTED MANUALLY!

We can see that we have two MongoDB templates – ‘MongoDB cluster by Zabbix Agent 2’ and ‘MongoDB node by Zabbix agent 2’. Depending on your MongoDB setup – individual nodes or a cluster, apply the corresponding template. Note that the MongoDB cluster template can automatically create hosts for your config servers and shards and apply the MongoDB node template on these hosts.

Host prototypes for config servers and shards

Deploying Zabbix Agent 2 on your Host

Since the data collection is done by Zabbix Agent 2, first, let’s deploy Zabbix Agent 2 on our MongoDB node or cluster host. Let’s start with adding the Zabbix 5.4 repository and install the Zabbix Agent 2 via a package.

Add the Zabbix 5.4 repository:

rpm -Uvh https://repo.zabbix.com/zabbix/5.4/rhel/8/x86_64/zabbix-release-5.4-1.el8.noarch.rpm

Install Zabbix Agent 2:

yum install zabbix-agent2

What if you already have the regular Zabbix Agent running on this machine? In this case, we have two options for how we can proceed. We can simply remove the regular Zabbix Agent and Deploy Zabbix Agent 2. In this case, make sure you make a backup of the Zabbix Agent configuration file and migrate all of the changes to the Zabbix Agent 2 configuration file.

The second option is running both of the Zabbix Agents in parallel. In this case, we need to make sure that both agents – Zabbix Agent and Zabbix Agent 2 are listening on their own specific ports because, by default, both agents are listening for connections on port 10050. This can be configured in the Zabbix Agent configuration file by changing the ‘ListenPort’ parameter.

Don’t forget to specify the ‘Server‘ parameter in the Zabbix Agent 2 configuration file. This parameter should contain your Zabbix Server address or DNS name. By defining it here, you will allow Zabbix Agent 2 to accept the metric poll requests from Zabbix Server.

After you have made the configuration changes in the Zabbix Agent 2 configuration file, don’t forget to restart Zabbix Agent 2 to apply the changes:

systemctl restart zabbix-agent2

Creating a MongoDB user for monitoring

Once the agent has been deployed and configured, you need to ensure that you have a MongoDB database user that we can use for monitoring purposes. Below you can find a brief example of how you can create a MongoDB user:

Access the MongoDB shell:

mongosh

Switch to the MongoDB admin database:

use admin

Create a user with ‘userAdminAnyDatabase‘ permissions:

db.createUser(
... {
..... user: "zabbix_mon",
..... pwd: "zabbix_mon",
..... roles: [ { role: "userAdminAnyDatabase", db: "admin" } ]
..... }
... )

The username for the newly created user is ‘zabbix_mon’. The password is also ‘zabbix_mon‘ – feel free to change these as per your security policy.

Creating and configuring a MongoDB host

Next, you need to open your Zabbix frontend and create a new host representing your MongoDB node. You can see that in our example, we called our node ‘MongoDB’ and assigned it to a ‘MongoDB Servers’ host group. You can use more detailed naming in a production environment and use your own host group assignment logic. But remember – a host needs to belong to AT LEAST a single host group!

Since the metrics are collected by Zabbix Agent 2, you must also create an Agent interface on the host. Zabbix Server will connect to this interface and request the metrics from the Zabbix Agent 2. Define the IP address or DNS name of your MongoDB host, where you previously deployed Zabbix Agent 2. Mind the port – by default, we have port 10050 defined over here, but if you have modified the ‘ListenPort’ parameter in the Zabbix Agent 2 config and changed the value from the default one (10050) to something else, you also need to use the same port number here.

MongoDB host configuration example

Next, navigate to the ‘Templates’ tab and assign the corresponding template – either ‘MongoDB node by Zabbix agent 2’ or ‘MongoDB cluster by Zabbix Agent 2’. In our example, we will assign the MongoDB node template.

Before adding the host, you also need to provide authentication and connection parameters by editing the corresponding User Macros. These User Macros are used by the items that specify which metrics should we be collecting. Essentially, we are forwarding the connectivity and authentication information to Zabbix Agent 2, telling it to use these values when collecting the metrics from our MongoDB instance.

To do this, navigate to the ‘Macros’ tab in the host configuration screen. Then, select ‘Inherited and host macros’ to display macros inherited from the MongoDB template.

We can see a bunch of macros here – some of them are related to trigger thresholds and discovery filters, but what we’re interested in right now are the following macros:

{$MONGODB.PASSWORD} – MongoDB username. For our example, we will set this to zabbix_mon
{$MONGODB.USER} – MongoDB password. For our example, we will set this to zabbix_mon
{$MONGODB.CONNSTRING} – MongoDB connection string. Specify the MongoDB address and port here to which the Zabbix Agent 2 should connect and perform the metric collection

Now we are ready to add the host. Once the host has been added, we might have to wait for a minute or so before Zabbix begins to monitor the host. This is because Zabbix Server doesn’t instantly pick up the configuration changes. By default, Zabbix Server updates the Configuration Cache once a minute.

Fine-tuning MongoDB monitoring

At this point, we should see a green ZBX Icon next to our MongoDB host.

Data collection on the MongoDB host has started – note the green ‘ZBX’ icon.

This means that the Zabbix Server has successfully connected to our Zabbix Agent 2, and the metric collection has begun. You can now navigate to the ‘Monitoring’ – ‘Latest data’ section, filter the view by your MongoDB host, and you should see all of the collected metrics here.

MongoDB metrics in ‘Monitoring’ – ‘Latest data’

The final task is to tune the MongoDB monitoring on your hosts, collecting only the required metrics. Navigate to ‘Configuration’ – ‘Hosts’, find your MongoDB hosts, and go through the different entity types on the host – items, triggers, discovery rules. See an item that you don’t wish to collect metrics for? Feel free to disable it. Open up the discovery rules – change the update intervals on them or disable the unnecessary ones.

Note: Be careful not to disable master items. Many of the items and discovery rules here are of type ‘Dependent item’ which means, that they require a so-called ‘Master item’. Feel free to read more about dependent items here.

Remember the ‘Macros’ section in the host configuration? Let’s return to it. here we can see some macros which are used in our trigger thresholds, like:

{$MONGODB.REPL.LAG.MAX.WARN} – Maximum replication lag in seconds
{$MONGODB.CURSOR.OPEN.MAX.WARN} – Maximum number of open cursors

Feel free to change these as per your problem threshold requirements.

One last thing here – we can filter which elements get discovered by our discovery rules. This is once again defined by user macros like:

{$MONGODB.LLD.FILTER.DB.MATCHES} – Databases that should be discovered (By default, the value here is ‘.*’, which will match everything)
{$MONGODB.LLD.FILTER.DB.NOT_MATCHES} – Databases that should be excluded from the discovery

And that’s it! After some additional tuning has been applied, we are good to go – our MongoDB entities are being discovered, metrics are getting collected, and problem thresholds have been defined. And all of it has been done with the native Zabbix Agent 2 functionality and an out-of-the-box MongoDB template!

Zabbix frontend as a control panel for your devices

2021-09-30 Aigars Kadiķis

Post Syndicated from Aigars Kadiķis original https://blog.zabbix.com/zabbix-frontend-as-a-control-panel-for-your-devices/15545/

The ability to define and execute scripts on different Zabbix components in different scenarios can be extremely powerful. There are many different use cases where we can execute these scripts – to remediate an issue, forward our alerts to an external system, and much more. In this post, we will cover one of the lesser-known use cases – creating a control panel of sorts in which we can execute different scripts directly from our frontend.

Configuration cache

Let’s use two very popular Zabbix runtime commands for our use case – ‘zabbix_server -R config_cache_reload’ and ‘zabbix_proxy -R config_cache_reload’. These commands can be used to force the Zabbix server and Zabbix proxy components to load the configuration changes on demand.

First, let’s discuss how these commands work:

It all starts with the configuration cache frequency, which is configured for the central Zabbix server. Have a look at the output:

grep CacheUpdateFrequency= /etc/zabbix/zabbix_server.conf

And on the Zabbix proxy side, there is a similar setting. Let’s take a look:

grep ConfigFrequency= /etc/zabbix/zabbix_proxy.conf

With a stock installation we have ‘CacheUpdateFrequency=60‘ for ‘zabbix-server‘ and we have ‘ConfigFrequency=3600‘ for ‘zabbix-proxy‘. This parameter represents how fast the Zabbix component will pick up the configuration changes that we have made in the GUI.

Apart from the frequency, we have also another variable which is: how long it actually takes to run one configuration sync cycle. To find the precise time value, we can use this command:

ps auxww | egrep -o "[s]ynced.*sec"

The output will produce a line like:

synced configuration in 14.295782 sec, idle 60 sec

This means that it takes approximately 14 seconds to load the configuration cache from the database. Then there is a break for the next 60 seconds. After that, the process repeats.

When the monitoring infrastructure gets big, we might need to start using larger values for ‘CacheUpdateFrequency‘ and ‘ConfigFrequency‘. By reducing the configuration reload frequency, we can offload our database. The best possible configuration performance-wise is to install ‘CacheUpdateFrequency=3600‘ in ‘zabbix_server.conf‘ and use ‘ConfigFrequency=3600‘ (it’s the default value) in ‘zabbix_proxy.conf‘.

Some repercussions arise with such a configuration. When we use values that are this large, there will be a delay of one hour until newly created entities are monitored or changes are applied to the existing entities.

Setting up the scripts

I would like to introduce a way we can force the configuration to be reloaded via GUI.
Some prerequisites must be configured:

1) Make sure the ‘Zabbix server‘ host belongs to the “Zabbix servers” host group.

2) On the server where service ‘zabbix-server‘ runs, install a new sudoers rule:

cd /etc/sudoers.d
echo 'zabbix ALL=(ALL) NOPASSWD: /usr/sbin/zabbix_server -R config_cache_reload' | sudo tee zabbix_server_config_cache_reload
chmod 0440 zabbix_server_config_cache_reload

The sudoers file is required because out of the box the service ‘zabbix-server‘ runs with user ‘zabbix‘ which does not have access to interact with the local system.

3) We will also create Zabbix hosts representing our Zabbix proxies. These hosts must belong to the ‘Zabbix proxies’ host group.

Notice that in the screenshot the host ‘127.0.0.1′ is using ‘Monitored by proxy‘. This is extremely important since we do not care about the agent interface in the use case with proxies – the interface can contain an arbitrary address/DNS name. What we care about is the ‘Monitored by proxy’ field. Our command will be executed on the proxy that we select here.

4) On the server where service ‘zabbix-proxy‘ runs, install a new sudoers rule:

cd /etc/sudoers.d
echo 'zabbix ALL=(ALL) NOPASSWD: /usr/sbin/zabbix_proxy -R config_cache_reload' | sudo tee zabbix_proxy_config_cache_reload
chmod 0440 zabbix_proxy_config_cache_reload

5) Make the following changes in the ‘/etc/zabbix/zabbix_proxy.conf‘ proxy configuration file: ‘EnableRemoteCommands=1‘. Restart the ‘zabbix-proxy’ service afterwards.

6) Open ‘Administration’ => ‘Scripts’ and define the following commands:
For the ‘Zabbix servers’ host group:

sudo /usr/sbin/zabbix_server -R config_cache_reload

Since this is a custom command that we will execute, the type of the script will be ‘Script’. The first script will be executed on the Zabbix server – we are forcing the central Zabbix server to reload its configuration cache. In this example, all users with at least ‘Read’ access to the Zabbix server host will be able to execute the script. You can limit this as per your internal Zabbix policies.

Below you can see how it should look:

For the ‘Zabbix proxies’ host group:

sudo /usr/sbin/zabbix_proxy -R config_cache_reload

The only thing that we change for the proxy script is the ‘Command’ and ‘Execute on’ parameters, since now the command will be executed on the Zabbix proxy which is monitoring the target host:

Frontend as a control panel

I prefer to add an additional host group “Control panel” which contains the central Zabbix server and all Zabbix proxies.

Now when we need to reload our configuration cache, we can open ‘Monitoring’ => ‘Hosts‘ and filter out host group ‘Control panel’. Then click on the proxy host in question and select ‘config cache reload proxy’:

It takes 5 seconds to complete and then we will see the result of script execution. In this case – ‘command sent successfully’:

By the way, we can bookmark this page too

With this approach, you can create ‘Control panel’ host groups and scripts for different types of tasks that you can execute directly from the Zabbix frontend! This allows us to use our Zabbix frontend not just for configuration and data overview, but also as a control panel of sorts for our hosts.
If you have any questions, comments, or wish to share your use cases for using scripts in the frontend – leave us a comment! Your use case could be the one to inspire many other Zabbix community members to give it a try.

Agentless Oracle database monitoring with ODBC

2021-09-21 Aigars Kadiķis

Post Syndicated from Aigars Kadiķis original https://blog.zabbix.com/agentless-oracle-database-monitoring-with-odbc/15589/

Did you know that Zabbix has an out-of-the-box template for collecting Oracle database metrics? With this template, we can collect data like database, tablespace, ASM, and many other metrics agentlessly, by using ODBC. This blog post will guide you on how to set up ODBC monitoring for Oracle 11.2, 12.1, 18.5, or 19.2 database servers. This post can serve as the perfect set of guidelines for deploying Oracle database monitoring in your environment.

Download Instant client and SQLPlus

The provided commands apply for the following operating systems: CentOS 8, Oracle Linux 8, or Rocky Linux.

First we have to download the following packages:
oracle-instantclient19.12-basic-19.12.0.0.0-1.x86_64.rpm
oracle-instantclient19.12-sqlplus-19.12.0.0.0-1.x86_64.rpm
oracle-instantclient19.12-odbc-19.12.0.0.0-1.x86_64.rpm

Here we are downloading

Oracle instant client – required, to establish connectivity to an Oracle database
SQLPlus – A tool that we can use to test the connectivity to an Oracle database
Oracle ODBC package – contains the required ODBC drivers and configuration scripts to enable ODBC connectivity to an Oracle database

Upload the packages to the Zabbix server (or proxy, if you wish to monitor your Oracle DB on a proxy) and place it in:

/tmp/oracle-instantclient19.12-basic-19.12.0.0.0-1.x86_64.rpm
/tmp/oracle-instantclient19.12-sqlplus-19.12.0.0.0-1.x86_64.rpm
/tmp/oracle-instantclient19.12-odbc-19.12.0.0.0-1.x86_64.rpm

Solve OS dependencies

Install ‘libaio’ and ‘libnsl’ library:

dnf -y install libaio-devel libnsl

Otherwise, we will receive errors:

# rpm -ivh /tmp/oracle-instantclient19.12-basic-19.12.0.0.0-1.x86_64.rpm
error: Failed dependencies:
        libaio is needed by oracle-instantclient19.12-basic-19.12.0.0.0-1.x86_64
        libnsl.so.1()(64bit) is needed by oracle-instantclient19.12-basic-19.12.0.0.0-1.x86_64
# rpm -ivh /tmp/oracle-instantclient19.12-basic-19.12.0.0.0-1.x86_64.rpm
error: Failed dependencies:
        libnsl.so.1()(64bit) is needed by oracle-instantclient19.12-basic-19.12.0.0.0-1.x86_64

Check if Oracle components have been previously deployed on the system. The commands below should provide an empty output:

rpm -qa | grep oracle
ldconfig -p | grep oracle

Install Oracle Instant Client

rpm -ivh /tmp/oracle-instantclient19.12-basic-19.12.0.0.0-1.x86_64.rpm

Make sure that the package ‘oracle-instantclient19.12-basic-19.12.0.0.0-1.x86_64’ is installed:

rpm -qa | grep oracle

LD config

The official Oracle template page at git.zabbix.com talks about the method to configure Oracle ENV Usage for the service. For this version 19.12 of instant client, it is NOT REQUIRED to create a ‘/etc/sysconfig/zabbix-server’ file with content:

export ORACLE_HOME=/usr/lib/oracle/19.12/client64
export PATH=$PATH:$ORACLE_HOME/bin
export LD_LIBRARY_PATH=$ORACLE_HOME/lib:/usr/lib64:/usr/lib:$ORACLE_HOME/bin
export TNS_ADMIN=$ORACLE_HOME/network/admin

While we did install the rpm package, the Oracle 19.12 client package did auto-configure LD path at the global level – it means every user on the system can use the Oracle instant client. We can see the LD path have been configured under:

cat /etc/ld.so.conf.d/oracle-instantclient.conf

This will print:

/usr/lib/oracle/19.12/client64/lib

To ensure that the required Oracle libraries are recognized by the OS, we can run:

ldconfig -p | grep oracle

It should print:

liboramysql19.so (libc6,x86-64) => /usr/lib/oracle/19.12/client64/lib/liboramysql19.so
libocijdbc19.so (libc6,x86-64) => /usr/lib/oracle/19.12/client64/lib/libocijdbc19.so
libociei.so (libc6,x86-64) => /usr/lib/oracle/19.12/client64/lib/libociei.so
libocci.so.19.1 (libc6,x86-64) => /usr/lib/oracle/19.12/client64/lib/libocci.so.19.1
libnnz19.so (libc6,x86-64) => /usr/lib/oracle/19.12/client64/lib/libnnz19.so
libmql1.so (libc6,x86-64) => /usr/lib/oracle/19.12/client64/lib/libmql1.so
libipc1.so (libc6,x86-64) => /usr/lib/oracle/19.12/client64/lib/libipc1.so
libclntshcore.so.19.1 (libc6,x86-64) => /usr/lib/oracle/19.12/client64/lib/libclntshcore.so.19.1
libclntshcore.so (libc6,x86-64) => /usr/lib/oracle/19.12/client64/lib/libclntshcore.so
libclntsh.so.19.1 (libc6,x86-64) => /usr/lib/oracle/19.12/client64/lib/libclntsh.so.19.1
libclntsh.so (libc6,x86-64) => /usr/lib/oracle/19.12/client64/lib/libclntsh.so

Note: If for some reason the ldconfig command shows links to other dynamic libraries – that’s when we might have to create a separate ENV file for Zabbix server/Proxy, which would link the Zabbix application to the correct dynamic libraries, as per the example at the start of this section.

Check if the Oracle service port is reachable

To save us some headache down the line, let’s first check the network connectivity to our Oracle database host. Let’s check if we can reach the default Oracle port at the network level. In this example, we will try to connect to the default Oracle database port, 1521. Depending on which port your Oracle database is listening for connections, adjust accordingly,. Make sure the output says ‘Connected to 10.1.10.15:1521’:

nc -zv 10.1.10.15 1521

Test connection with SQLPlus

We can simulate the connection to the Oracle database before moving on with the ODBC configuration. Make sure that the Oracle username and password used in the command are correct. For this task, we will first need to install the SQLPlus package.:

rpm -ivh /tmp/oracle-instantclient19.12-sqlplus-19.12.0.0.0-1.x86_64.rpm

To simulate the connection, we can use a one-liner command. In the example command I’m using the username ‘system’ together with the password ‘oracle’ to reach out to the Oracle database server ‘10.1.10.15’ via port ‘1521’ and connect to the service name ‘xe’:

sqlplus64 'system/oracle@(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.1.10.15)(PORT=1521)))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=xe)))'

In the output we can see: we are using the 19.12 client to connect to 11.2 server:

SQL*Plus: Release 19.0.0.0.0 - Production on Mon Sep 6 13:47:36 2021
Version 19.12.0.0.0
Copyright (c) 1982, 2021, Oracle.  All rights reserved.
Connected to:
Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production

Note: This gives us an extra hint regarding the Oracle instant client – newer versions of the client are backwards compatible with the older versions of the Oracle database server. Though this doesn’t apply to every version of Oracle client/server, please check the Oracle instant client documentation first.

ODBC connector

When it comes to configuring ODBC, let’s first install the ODBC driver manager

dnf -y install unixODBC

Now we can see that we have two new files – ‘/etc/odbc.ini’ (possibly empty) and ‘/etc/odbcinst.ini’.

The file ‘/etc/odbcinst.ini’ describes driver relation. Currently, when we ‘grep’ the keyword ‘oracle’ there is no oracle relation installed, the output is empty when we run:

grep -i oracle /etc/odbcinst.ini

Our next step is to Install Oracle ODBC driver package:

rpm -ivh /tmp/oracle-instantclient19.12-odbc-19.12.0.0.0-1.x86_64.rpm

The ‘oracle-instantclient*-odbc’ package contains a script that will update the ‘/etc/odbcinst.ini’ configuration automatically:

cd /usr/lib/oracle/19.12/client64/bin
./odbc_update_ini.sh / /usr/lib/oracle/19.12/client64/lib

It will print:

 *** ODBCINI environment variable not set,defaulting it to HOME directory!

Now when we print the file on the screen:

cat /etc/odbcinst.ini

We will see that there is the Oracle 19 ODBC driver section added at the end of the file::

[Oracle 19 ODBC driver]
Description     = Oracle ODBC driver for Oracle 19
Driver          = /usr/lib/oracle/19.12/client64/lib/libsqora.so.19.1
Setup           =
FileUsage       =
CPTimeout       =
CPReuse         =

It’s important to check if there are no errors produced in the output when executing the ‘ldd’ command. This ensures that the dependencies are satisfied and accessible and there are no conflicts with the library versioning:

ldd /usr/lib/oracle/19.12/client64/lib/libsqora.so.19.1

It will print something similar like:

linux-vdso.so.1 (0x00007fff121b5000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007fb18601c000)
libm.so.6 => /lib64/libm.so.6 (0x00007fb185c9a000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fb185a7a000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x00007fb185861000)
librt.so.1 => /lib64/librt.so.1 (0x00007fb185659000)
libaio.so.1 => /lib64/libaio.so.1 (0x00007fb185456000)
libresolv.so.2 => /lib64/libresolv.so.2 (0x00007fb18523f000)
libclntsh.so.19.1 => /usr/lib/oracle/19.12/client64/lib/libclntsh.so.19.1 (0x00007fb1810e6000)
libclntshcore.so.19.1 => /usr/lib/oracle/19.12/client64/lib/libclntshcore.so.19.1 (0x00007fb180b42000)
libodbcinst.so.2 => /lib64/libodbcinst.so.2 (0x00007fb18092c000)
libc.so.6 => /lib64/libc.so.6 (0x00007fb180567000)
/lib64/ld-linux-x86-64.so.2 (0x00007fb1864da000)
libnnz19.so => /usr/lib/oracle/19.12/client64/lib/libnnz19.so (0x00007fb17fdba000)
libltdl.so.7 => /lib64/libltdl.so.7 (0x00007fb17fbb0000)

When we executed the ‘odbc_update_ini.sh’ script, a new DSN (data source name) file was made in ‘/root/.odbc.ini’. This is a sample configuration ODBC configuration file which describes what settings this version of ODBC driver supports.

Let’s move this configuration file from the user directories to a location accessible system-wide:

cat /root/.odbc.ini | sudo tee -a /etc/odbc.ini

And remove the file from the user directory completely:

rm /root/.odbc.ini

This way, every user in the system will use only this one ODBC configuration file.

We can now alter the existing configuration – /etc/odbc.ini. I’m highlighting things that have been changed from the defaults:

[Oracle11g]
AggregateSQLType = FLOAT
Application Attributes = T
Attributes = W
BatchAutocommitMode = IfAllSuccessful
BindAsFLOAT = F
CacheBufferSize = 20
CloseCursor = F
DisableDPM = F
DisableMTS = T
DisableRULEHint = T
Driver = Oracle 19 ODBC driver
DSN = Oracle11g
EXECSchemaOpt =
EXECSyntax = T
Failover = T
FailoverDelay = 10
FailoverRetryCount = 10
FetchBufferSize = 64000
ForceWCHAR = F
LobPrefetchSize = 8192
Lobs = T
Longs = T
MaxLargeData = 0
MaxTokenSize = 8192
MetadataIdDefault = F
QueryTimeout = T
ResultSets = T
ServerName = //10.1.10.15:1521/xe
SQLGetData extensions = F
SQLTranslateErrors = F
StatementCache = F
Translation DLL =
Translation Option = 0
UseOCIDescribeAny = F
UserID = system
Password = oracle

DNS – Data source name. Should match the section name in brackets, e.g.:[Oracle11g]
ServerName – Oracle server address
UserID – Oracle user name
Password – Oracle user password

To test the connection from the command line, let’s use the isql command-line tool which should simulate the ODBC connection akin to what the Zabbix is doing when gathering metrics:

isql -v Oracle11g

The isql command in this example picks up the ODBC settings (Username, Password, Server address) from the odbc.ini file. All we have to do is reference the particular DSN – Oracle11g

On the other hand, if we do not prefer to keep the password on the filesystem (/etc/odbc.ini), we can erase the lines ‘UserID’ and ‘Password’. Then we can test the ODBC connection with:

isql -v Oracle11g 'system' 'oracle'

In case of a successful connection it should print:

+---------------------------------------+
| Connected!                            |
|                                       |
| sql-statement                         |
| help [tablename]                      |
| quit                                  |
|                                       |
+---------------------------------------+
SQL>

And that’s it for the ODBC configuration! Now we should be able to apply the Oracle by ODBC template in Zabbix

Don’t forget that we also need to provide the necessary Oracle credentials to start collecting Oracle database metrics:

The lessons learned in this blog post can be easily applied to ODBC monitoring and troubleshooting in general, not just Oracle. If you’re having any issues or wish to share your experience with ODBC or Oracle database monitoring – feel free to leave us a comment!

Maintaining Zabbix API token via JavaScript

2021-09-14 Aigars Kadiķis

Post Syndicated from Aigars Kadiķis original https://blog.zabbix.com/maintaining-zabbix-api-token-via-javascript/15561/

In this blog post, we will talk about maintaining and storing the Zabbix API session key in an automated fashion. The blog post builds upon the Close problem automatically via Zabbix API subject and can be used as extra configuration for this particular use-case. The blog post also shares a great example of synthetic monitoring by way of JavaScript preprocessing – how to emulate a scenario in an automated fashion and get alerted in case of any problems.

Prerequisites

First, let us create the Zabbix API user and user macros where we will store our username, password, Zabbix URL and the API session key.

1) Open “Administration” => “Users”. Create a new user ‘api’ with password ‘zabbix’. At the permissions tab set User Type “Zabbix Super Admin”.

2) Go to “Administration” => “General” => “Macros”. Configure base characteristics:

      {$Z_API_PHP} = http://demo.zabbix.demo/api_jsonrpc.php
     {$Z_API_USER} = api
 {$Z_API_PASSWORD} = zabbix
{$Z_API_SESSIONID} =

It’s OK to leave {$Z_API_SESSIONID} empty for now.

3) Let’s check if the Zabbix backend server can reach the Zabbix frontend server. Make sure that you are logged into the Zabbix backend server by looking up the zabbix_server process:

ps auxww | grep "[z]abbix_server.*conf"

Ensure that we can reach the Zabbix frontend by curling the Zabbix frontend server from the Zabbix backend server:

curl -s "http://demo.zabbix.demo/index.php" | grep Zabbix

4) Download template “Check and repair Zabbix API token” and import it in your instance.

5) Create a new host with an Agent interface and link the template. The IP address of the host does not matter. The template will use an agentless check to do the monitoring, it will use an “HTTP agent” item.

How it works

Our goal for today is to figure out a way to keep the Zabbix API authentication token up to date in a user macro. This way we can reuse the macro repeatedly for items, action operations and scripts that require for us to use the Zabbix API. We need to ensure that even if the token changes, the macro gets automatically updated with the new token value! Let’s try and understand each step of the underlying workflow required for us to achieve this goal.

The first component of our workflow is the “Validate session key raw” item. This is an HTTP agent item that performs a POST request with an arbitrary method – proxy.get in this case, but we could have used ANY other method. We simply want to check if an arbitrary Zabbix API call can be executed with the current {$Z_API_SESSIONID} macro value.

The second part of the workflow is the “Repair session key” dependent item. This item utilizes the JavaScript preprocessing step with custom JavaScript code to check the values obtained by the previous item and generate a new authentication token if that is necessary.

The third item – “Status”, is another dependent item that uses regular expression preprocessing steps to check for different error messages or status codes in the value of the “Validate session key raw” item. Most of the triggers defined in this template will react to the values obtained by this item.

Below you can see the full underlying workflow:

Code-wise, the magic is implemented with the following JavaScript code snippet:

if (value.match(/Session terminated/)) {

var req = new CurlHttpRequest();

// Zabbix API
var json_rpc='{$Z_API_PHP}'

// lib curl header
req.AddHeader('Content-Type: application/json');

// First request to get authentication token
var token =  JSON.parse(req.Post(json_rpc,
'{"jsonrpc":"2.0","method":"user.login","params":{"user":"{$Z_API_USER}","password":"{$Z_API_PASSWORD}"},"id":1,"auth":null}'
));

// If authentication was unsuccessful
if ( token.error )
{
// Login name or password is incorrect
return 32500;
}

else {
// Update the macro

// Get the global macro ID
// We cannot plot here a very native Zabbix macro because it will be automatically expanded
// Must use a workaround to distinguish a dollar sign from the actual macro name and merge with '+'
var id = JSON.parse(req.Post(json_rpc,
'{"jsonrpc":"2.0","method":"usermacro.get","params":{"output":["globalmacroid"],"globalmacro":true,"filter":{"macro":"{$'+'Z_API_SESSIONID'+'}"}},"auth":"'+token.result+'","id":1}'
)).result[0].globalmacroid;

// This line contains a keyword '+value+' which will grab exactly the previous value outside this JavaScript snippet
var overwrite = JSON.parse(req.Post(json_rpc,
'{"jsonrpc":"2.0","method":"usermacro.updateglobal","params":{"globalmacroid":"'+id+'","value":"'+token.result+'"},"auth":"'+token.result+'","id":1}'
));

// Return the id (an integer) of the macro, which was updated
return overwrite.result.globalmacroids[0];
}

} else {
return 0;
}

Throughout the JavaScript code, we are extracting a value from one step and using that as an input for the next step.

After the data has been collected, the values are analyzed by 4 triggers. 3 out of these 4 triggers prints a misconfiguration problem that requires a human investigation. We also have a “repairing Zabbix API session key in background” title, which is the main trigger that indicates a token has been expired and repair automatically.

And that’s it – now any integration that requires a Zabbix API authentication token can receive the current token value by referencing the user macro that we created in the article. We’ve ensured that the token is always going to stay up to date and in case of any issues, we will receive an alert from Zabbix!

Setting up manual ticket creation using Zabbix frontend scripts

2021-09-09 Nathan Liefting

Post Syndicated from Nathan Liefting original https://blog.zabbix.com/setting-up-manual-ticket-creation-using-zabbix-frontend-scripts/15550/

In this blog post, you will learn how to set up manual ticket creation through the use of Zabbix frontend scripts. We will use Jira Service Desk as an example, but this guide should work for any type of service desk or help desk system, as we can apply this technique for other systems in a similar fashion.

Introduction

Zabbix already has the ability to automatically create tickets through the use of Media Types and Actions. This way we can filter out certain issues based on host groups, severity, tags and a lot more. But what if your action misses a problem because of your filters? Or what if you simply want to limit the number of tickets that are created by actions? That’s when we can use Zabbix Frontend Scripts to manually create a ticket from the Zabbix Frontend, while automatically filling in the ticket info with a click of a button.

Let’s check take a look at how we can achieve this.

How to

Setting up Jira

As mentioned, in this example we’ll use Jira. Why? Simply because personally I am a big fan of the Atlassian product suite and it’s what I had available on hand. Feel free to apply this technique to any service desk system out there though, as we are definitely not limited to Jira Service Desk.

To make the frontend script work we are going to need a working integration (Media Type in Zabbix terminology). You can check out https://zabbix.com/integrations to see if the service desk system of your choosing is already available out of the box. If not, you can always use a community solution or build your own integration. Jira Service Desk is already available though, meaning we can use the settings pre-defined in Zabbix. For Jira Service Desk we’ll need:

jira_url – The actual URL of your Jira instance. For example: https://company.atlassian.net/
jira_user– Jira user login, in our instance this is my email.
jira_project_key– Numeric key of the Jira project. We can find this in the URL once we go to our Jira Service Desk project settings. For example: pid=10054
jira_issue_type– Number of the issue type to be used when creating new issues from Zabbix. Check out the Project Settings -> Request types and use the number in the URL for the request type you’d like to use.
jira_password– Password or API token. We can create this at: https://id.atlassian.com/manage/api-tokens

Make sure to save the API token somewhere, as you can only copy it over once.

Setting up the frontend script

Now once you have set up everything on the Jira (or which ever other service desk you use) side, we can continue with the next step which is setting up the Frontend script. If we look at the particular Zabbix Media Type, we can see something interesting.

The Jira ServiceDesk default Media Type is of course already set up, but that’s not the most interesting part – although super useful nevertheless. Looking at the Media Type we have our Parameters and the Javascript script. If we take a look at the Frontend Script configuration and compare it with our Media Type, that’s when it gets interesting:

With release of Zabbix 5.4 it is now possible possible for us to execute webhooks directly with a Frontend script, as we can see we have the same Parameter and Script options available in this section. Meaning that most of what we have to do is navigate to Administration -> Scripts and copy over the Media Type Parameters and script. So, let’s do that and fill in the parameters with default values:

Do not forget to do the same thing for the script part:

Once you’ve set this up, we aren’t completely done yet. We have copied the default parameter values as defined by Zabbix in the default Jira ServiceDesk Media Type. Now that we’ve copied those from the Media Type to the Frontend script, we still need to edit them to reflect our Jira ServiceDesk parameters. We need to edit the following fields:

alert_message: We can add our own alert message here, which will be filling out the body of our Jira Service Desk ticket. Use the Zabbix built-in macro’s wisely here. Something like:

There is a problem on {HOST.HOST} - {EVENT.NAME} with severity: {EVENT.SEVERITY} since {EVENT.DATE}

alert_subject: This will be the subject of our Jira ticket, again heavily reliant on Built-in Macro’s. It can be something like:

Severity:{EVENT.SEVERITY} Problem on:{HOST.HOST} - {EVENT.NAME}

event_source: We set this to 0, meaning it is a problem. Normally this is dynamic, but we can make it static for the script.

Then, as discussed in the earlier part of this post we need to edit the following parameters:

jira_password

jira_request_type_id

jira_servicedesk_id

jira_url

jira_user

Once you have set up all of those parameters, it should now look like this:

The result

Now, if all of the information from the previous steps is filled in, we can test the integration. Navigate to Monitoring -> Problems and click on any of the problem names:

Which will open a dropdown menu. Go to ServiceDesk and click the Create Jira ticket button.

This will then kick of our new Zabbix frontend script. The webhook script will be using the Jira API to create a ticket and voila:

A ticket is created, right from the frontend. It’s just that easy.

Conclusion

Using Media Types is great and I would definitely recommend using them. If you have set up the correct filtering using host groups, severities, tags and perhaps more than that, you can already keep the ticket count to a minimum. But there’s always a downside to using a lot of filters, you might miss something. That’s where our Frontend Script implementation kicks in. Maybe your filter missed a ticket, maybe you don’t want automated tickets at all. We can use the script to manually create tickets, quickly from the frontend and solve that issue.

You’ve now got a powerful way to do so, fully compatible with Zabbix most of the times out of the box! As Zabbix adds more and more of this functionality it becomes easier and easier to do so, if you just know where to look.

I hope you enjoyed reading this blog post and if you have any questions or need help configuring anything on your Zabbix setup feel free to contact me and the team at Opensource ICT Solutions. We build a ton of cool integrations like this and much more!

Nathan Liefting

https://oicts.com

A close up of a logo Description automatically generated

“ICMP Rings” for better Dependencies

2021-08-12 Olger Diekstra

Post Syndicated from Olger Diekstra original https://blog.zabbix.com/icmp-rings-for-better-dependencies/15157/

When you have devices spread across different locations and monitor these with a single Zabbix instance, you’ll encounter a challenge managing the various latencies to each location, especially when these locations span the world. Ping times can vary wildly from 10ms to 500ms and more depending on the internet connections.

Flexible latency threshold with User macros

Setting the max latency for all devices at 500ms isn’t really a good option, and overriding the {$ICMP_RESPONSE_TIME_WARN} for each individual device doesn’t scale well.

There is a better way though.

First move the {$ICMP_RESPONSE_TIME_WARN} macro from the “Template Module ICMP Ping” into a global macro (with the default value of 0.15, which is 150ms) and remove the macro from the template.

You can find global variables in the “Administration->General” menu. Click on the “GUI” dropdown and select “Macros”.

Then create templates for each location and set a custom {$ICMP_RESPONSE_TIME_WARN} macro (overriding the global macro) based on what was best for that particular location.
Lastly, add all devices for a location to the template for that location.

Now when a new device is added to a location, all that is needed is to add it to the correct template. An added benefit is that when the latency changes, changing the macro in the affected template changes the response time for every device that relies on that template.

Defining dependencies

Because networks often work as a tree structure, using dependencies can help suppress alerts for devices downstream. However, Zabbix’s dependency structure isn’t very intelligent. If an upstream device is checked just before an issue occurs, then a downstream device might hit
the 3 failed checks before the upstream device does, and alert. This can lead to many alerts even though Zabbix might suppress most of these alerts on the dashboard once it detects the upstream device is offline too.

Every network is different and this approach may need some tweaking for your network, but in our case, each remote location has a firewall that Zabbix pings, a switch that connects to that firewall, hosts that connect to the switch, and in some cases VM’s that reside on a host.
Each firewall has an external and internal interface that we monitor separately (if the internal interface is down, but the external interface up, that means something happened to the connected switch), and we monitor each.

So, we created a concept of “ICMP rings”.

The four rings

Zabbix pings the external interface of a remote firewall every 60s (default ICMP update interval). We’ll call this “Ring 0”. The internal Firewall interface is dependent on the external interface, This internal interface sits in “Ring 1”. Usually, the next device is a switch, which is dependent on the internal firewall, This switch lives in “Ring 2”. Hosts that we want to monitor on the remote network are dependent on the switch, and therefore live in “Ring 3”. We also have remote VM’s which exist on a remote host, and these live in “Ring 4”.
In each ring we increase the update interval time.

This means the further downstream a device is, the longer it takes to detect an outage.

But the upshot is that we won’t get alert storms any more.

First move the {$ICMP_UPDATE_INTERVAL} macro from the “Template Module ICMP Ping” to a global macro (with a value of 1m) and created templates called:

ICMP_Ring1_FW
ICMP_Ring2_SW
ICMP_Ring3_Hosts
ICMP_Ring4_VMs

Each Ring template has a macro {$ICMP_UPDATE_INTERVAL} set (overriding the global macro) with the below values for each ring.

ring 0 = 60s
ring 1 = 80s
ring 2 = 107s
ring 3 = 143s
ring 4 = 191s

These are not random. In order to always have an upstream device alert before a downstream device you need to take the update interval time (seconds) of the upstream ring, multiply it by its intervals+1, and then divide it by three (and round it off).
The resulting update interval will ensure an upstream device always alerts before a downstream device. The downside of this approach is that the further downstream a device is, the longer it takes to detect a problem.
(This could be mitigated by using a Zabbix proxy, but we’re working with a single instance).

How did I get to this formula? That’s pretty simple. Consider a firewall and switch, where the switch is dependent on the firewall. Now lets assume seconds after the firewall is checked (and found responsive), the link goes down. Seconds later the switch is checked, and found unresponsive. Strike one for the switch. Now the firewall is checked again, and is now unresponsive too. Strike one for the firewall. Next the switch is checked again, and its strike two. By the time the firewall hits strike 3, the switch is already well and truly alerting.

To allow the firewall to be checked 3 times we need to check the switch 4 times. We could simply increase the interval for every subsequent ring, but that’s not as efficient (we’d be adding 60s every time we increase the interval).

Rather, we can work with the seconds. 4 x 60 seconds equals 240 seconds. Divide that by 3 intervals (for the next ring) and that results in 80 seconds. If we check the switch every 80ms, that totals to 240s. We could add 1 or 2 seconds margin, to ensure the firewall has some extra time, but every second we add, we also need to add to subsequent downstream rings. Which means the time we need to detect issues becomes longer and longer.

A tip on managing dependencies.

In Zabbix version 5.0 The url https://zabbix/triggers.php isn’t available via the menu, but it allows you to filter triggers, for instance all ICMP unavailable triggers, and see what each is dependent on. We use host groups to set locations for devices, and combining a location host group with an ICMP unavailable trigger allows to quickly review whether dependencies are (correctly) set, and if so to what. In later versions this link will not work, since it now requires a context in the URL.

Deploying and configuring Zabbix 5.4 in a multi-tenant environment

2021-08-10 Arturs Lontons

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/deploying-and-configuring-zabbix-5-4-in-a-multi-tenant-environment/15109/

In this post and the video, we will discuss deploying and configuring Zabbix 5.4 in a multi-tenant environment and how Zabbix is finally ready for real multi-tenant use cases thanks to multiple features.

I. Monitoring requirements of multi-tenant environments (0:30)
II. Supported monitoring approaches (2:32)

- - Agent (2:32)
  - Latest improvements (4:53)

III. Zabbix and multi-tenant environment (5:56)

- - Zabbix proxy (6:25)
  - Data preprocessing — throttling (8:45)
  - Permissions (10:08)
  - High availability (13:27)
  - Database replication (17:04)
  - Database performance tuning (18:10)
  - History table partitioning (19:23)

IV. To-do list (21:36)
V. Questions & Answers (23:32)

Monitoring requirements of multi-tenant environments

Before talking Zabbix, let us first analyze the core requirements behind multi-tenant environments. Such environments can be quite complex with a particular set of prerequisites that we have to be sure we can satisfy before continuing further.

The core idea behind multi-tenancy is support for multiple customers. Therefore we need to support granular role/permission schema. The ability to define different roles for different customers and limit what they can access is key to success for such deployments.
Multiple customers means a lot of data. No matter if we’re talking about a single Zabbix instance or scaling by deploying multiple Zabbix instances (say, for different regions) we need to have the ability to process large amounts of data.
On top of that, we must be able to scale upwards, ideally – both horizontally and vertically. More customers, different requirements, varying amounts of data to process – all of this needs to be accounted for in advance.
Redundancy is another key factor for us. As service providers, we absolutely cannot afford any downtime or data loss. While this may be acceptable in our own home labs or classrooms, this is not the case here. Unscheduled downtime could potentially result in a loss of a customer.

Supported monitoring approaches

Now that we have covered the architectural requirements, let’s focus on data collection. No matter the monitoring solution, the easiest approach in most cases would simply be to tell our customer to deploy an agent and be done with it. Unfortunately, this often doesn’t sit will with the end user and their security team. Let us also not forget that it is simply not possible to deploy an agent on some devices or environments – what then? Having a vast selection of monitoring methods is key for successful deployment of a multi-tenant monitoring service.

Let us take a look at this in the context of Zabbix.

Agent

With Zabbix agents we can obtain data in two ways – in passive (polling) and active modes (trapping). This is extremely useful while working with multiple customers, since each of them will have different internal network security policies. I have personally seen cases where only one of these approaches is supported, while the other is restricted by the security policies.
Agents also support deployment on different platforms (Windows, Unix, etc.), as well as execution of external third-party scripts either by way of User parameters or system.run item key.
Active agents are also capable of reading log files and event logs on Windows environments. This can be extremely useful, since many applications, even in-house ones, can provide a lot of monitoring data by logging it.

Agent-supported deployment on various platforms

Since we need to stay flexible, there are many other monitoring approaches supported by Zabbix that we can utilize:

SNMP, HTTP, IPMI and SSH agentless monitoring.
Simple checks (ICMP pings, port statuses).
Database and Java application monitoring.
External scripts (executed by the Zabbix server, Zabbix proxy or Zabbix agent).
Aggregations and calculations of existing data.
VMware monitoring and integration.
Web monitoring by creating web scenarios.
Synthetic monitoring for simulating real life user transactions.

Latest improvements

Why are we putting emphasis on multi tenancy just now? The reason is a couple of great features added in the last few releases. These features can finally allow us to utilize Zabbix in a truly multitenant environment:

Added in Zabbix 5.2:

Ability to create customizable user roles based on user types;
Secrets can now be stored in an, highly secure external vault;
Improvements in configuring frontend were also added. For example, each user can now select their time zone for frontend data display. This will be relevant for users in different geographical locations.

Added in Zabbix 5.4:

Users now have the ability to send scheduled reports. This is extremely useful for customers who may wish to receive scheduled reports about their environments. Now, instead of utilizing third-party scripts to export data and generate reports, you can use the native Zabbix functionality.
Major performance improvements have also been added, especially for really large instances with tens of thousands of new incoming values per second.

Zabbix and multi-tenant environment

How do we use Zabbix in a multi-tenant environment? Essentially, we provide Zabbix as a service. We use the Zabbix monitoring tool to monitor our clients (ABC and BCD in the image). We monitor their network traffic, their operating system statistics, application statistics, log files, etc. For each tenant, these monitoring requirements are going to be different.

Multi-tenant environment

Zabbix proxy

Multi-tenancy would not be possible without Zabbix proxies. With Zabbix proxies we can deploy them in customer offices, data centers, organization branches and collect data locally. Since proxies also perform preprocessing, we can even utilize them to transform and normalize metrics or even discard some of the collected metrics before forwarding them to the central Zabbix backend server.

Proxies are capable of performing preprocessing ever since Zabbix version 5.0. This allows us to normalize and transform data, for example – change our textual data to numeric data, use throttling and other pre-processing approaches. Even custom JavaScript is supported nowadays to format or normalize the data before we send it to our central Zabbix backend server. So, instead of the server being responsible for all of the preprocessing and having quite a large preprocessing overhead, now the proxy can do it and then forward the data to the server.
In addition, on the proxy, the data gets compressed before forwarding to the server thus saving some network traffic overhead.
The proxy still continues collecting data and storing it in its own database even in case of a network outage on the customer’s site.
Once we collect the data by the proxy, it gets sent to the server via a single connection, which is a lot more feasible from the network security perspective. In this case we need to create only a single firewall rule as opposed to a wide array of rules if we were to monitor the customer’s site directly from the central Zabbix backend server.
We can execute remote scripts on the proxy.
We can also deploy multiple proxies to improve scalability. If a single proxy cannot handle the amount of data that we are gathering or preprocessing, we can always deploy an extra proxy. They are easy to deploy, and can even use out-of-the-box SQLite databases.

Passive and active proxies

With proxies can also select the direction of the connection. We can deploy passive proxies, which get polled by the Zabbix backend server. In that case, the Zabbix server pulls the data from the proxy. In this scenario the Zabbix server is the one responsible for establishing the connection to the proxy. This adds a minor performance overhead to the Zabbix backend server. On the other hand, we can also deploy active proxies, where we remove that overhead from the server and proxy sends the data autonomously to the server.

At the end of the day, similar to how it goes with agent requirements, the proxy mode will depend on the security policies of the customer. Don’t forget that we aren’t restricted to a single type of proxy – we can have both of these proxy types running at the same time.

Selecting the connection direction

Data preprocessing — throttling

Preprocessing can help us not only normalize our data, but we can also utilize it to save up on storage and performance overhead, which is vital in large environments.

When monitoring a service or an application state, we are going to be obtaining discrete values such as 1, 2, or 3, or any number. These numbers have a tendency to repeat – if our server stays up, we are going to continue receiving a number which represents “Up”. By using the preprocessing method called throttling, we can decrease the amount of these numbers stored by discarding repeating values. Only status changes are stored, therefore we can potentially save some database space and remove unneeded data processing overhead.

Discarding unchanged values

At this point in time, this feature sees more and more usage in many Zabbix environments, though it was severely underutilized initially when Zabbix 5.0 came out. So, if you aren’t using throttling yet and you’re running on 5.0 or newer, I definitely suggest trying to implement it to some extent. It is available in Preprocessing section of the item configuration.

Permissions

Robust permission design is essential to a multi-tenant environment. Even though permission logic has seen an addition of roles, the user group to host group relations haven’t been abandoned and still play vital role in overall permission schema.

With roles we still have to utilize the three user types – Users, Admin, and Super admins.

User role overview

Here you can see the user role and the UI elements the user has access to together with API restrictions and the actions the user can perform.

Roles grant the ability to configure access to specific UI elements, actions and restrict API calls in a granular fashion. So, when you’re configuring a role, you will see a screen similar to the one below:

Configuring user roles

User roles

Here you can select User type. The user type restrictions still apply. Users can get access only to Monitoring and Inventory, Admins can get access to except the Administration section, and Super admins can get access to every section, including Administration.

With roles we can further restrict these user types. You can have Super admins with some limitations, so that they could only do specific actions and access specific UI sections.

This option has two core benefits. The first one is security as we can limit what our customers can do and what they can access. The other benefit is in the UX, as we can simplify the UI for our users, especially people not experienced with Zabbix. We can restrict the visibility of the sections that the end users don’t have access to, so they will not be concerned with navigating through multiple sections and subsections that they are not familiar with.

User groups

We still have user groups and user groups to host groups relations, which we have to take into consideration. Access to hosts is defined on User groups. So, we have to define our user groups and assign Full/Read only/Deny permissions on particular host groups. This is how we limit what specific customers can access.

User groups

In addition, we can have host groups defined in a hierarchical manner. For instance, if you have two customers each of then having a “Network Devices” subgroup, we can select to include the root group and all of its subgroups when assigning user group to host group permissions. This is a really elegant and quick way to give a User or an Admin on the customer’s side access to all of their hosts or limit a specific organizational unit to only access what they need, e.g.: only permit access to network devices for network administrators.

Using group hierarchy

High availability

The next important decision is the HA implementation. Going without some sort of HA solution is simply too risky and therefore is not an option with such environments.

HA can be used to minimize downtime and add redundancy.
Zabbix supporst Linux HA tools – PCS, Corosync, Pacemaker, which are used to enable HA. You are also welcome to try and use other third-party tools for HA.
Out-of-the-box HA is planned for Zabbix 6.0.

HA setup

To achieve a quorum in our HA environment, we will require an odd number of nodes. For Zabbix backend HA it is very much recommended to have at least three nodes. Does that mean that you have to deploy 3 Zabbix servers? Not really – our third node is going to be a really small arbitration node, which is simply going to be checking connections to the two other Zabbix nodes and giving a vote to achieve quorum in case of issues with one of the nodes.

In the end we will have three nodes: Zabbix server A, Zabbix server B, and the Arbiter node

An odd number of nodes is recommended to achieve quorum.
Only Active/Passive cluster architecture is supported.
We cannot have two Zabbix nodes active running at the same time and talking to the same database. It is important to use some ‘shoot the other node in the head’ mechanism — STONITH to avoid such split-brain scenarios.

Failure to abide by these requirements can result in issues with database consistency, issues with underlying queries and cleaning up or inserting data. This can cause unexpected Zabbix backend server crashes down the line.

In addition, it is very common to have a requirement for proxies to be deployed with HA. Before implementing HA for proxies, we need to decide if we really do need it. HA adds a significant configuration management overhead. We can have hundreds if not thousands of proxies, and managing HA for each of those can add a significant overhead. Of course, the more comfortable you feel with the HA tools, the easier the deployment and the management of the environment.

Another approach for Zabbix proxy HA can also be implemented by using Zabbix API scripts. We can essentially have two proxies running without the need to have the HA suite. In this case, if proxy A is down, we can use Zabbix API to move a host from proxy A to proxy B.

Using Zabbix API script to change the proxy

Here, host.massupdate is used to change the proxy on the hosts. Combine this with a robust scripting logic and you end up with a very viable approach to move your hosts between proxies in failover scenarios.

Database replication

We have covered the HA for Zabbix server backend and let’s remember that with frontend servers, we can simply bring up additional frontends, for instance, by utilizing Docker containers. But what about the DB redundancy?

Database replication can be used as a form of redundancy for the Zabbix DB. No matter the DB backend – Postgres, MySQL, Oracle, we can deploy multiple DB nodes and utilize the native DB replication or use third-party tools for replication, for instance, Galera Cluster.
I personally prefer using native replication tools as it is a bit more simple and you don’t have to concern yourself with another configuration and management layer that could potentially fail and be a bother to troubleshoot. But this will depend on your requirements, design and skillset.

Let’s look at an example with MySQL replication. You can set it up in many different ways as multiple replication approaches are supported: master/slave, master/master, or even have multiple masters replicating to one another. It is completely up to you how to implement replication, especially if you are already experienced with such deployments.

Which approach is best? At the end of the day it will all depend on your company policies, database backend and a compromise between simplicty and extra redundancy. I definitely suggest delving deeper and studying use cases and articles for the DB backend of your choice, before you decide to go with any particular approach.

Database replication

Database performance tuning

Database tuning is vital for the long term stability of your Zabbix instance. The database defaults might be sufficient for your home office, but for large multi-tenant environments with tens of thousands new values per second they will not suffice. The database defaults depend on the database backend and the database version used, but ideally, these should be tuned and tested, preferably during the design stage, before you have deployed your Zabbix instance in production.

After installing the database backend we need to take a look at the hardware resources available. Ideally, you have already estimated the hardware resources required for your instance and ensured that DB hosts have sufficient memory, CPU resources and storage has been selected according to the I/O requirements. Now, you can move on to tuning your database backend.

As an example with Postgres I used PGTune — an online database tuning tool. This is a simple estimate that should still provide you with a somewhat adequate configuration. Though ideally, you should have a DBA on board that is aware of what kind of data loads you will be dealing with to help you with an optimal database configuration.

Database performance tuning

History table partitioning

In such large environments, you will most likely see that housekeeper cannot keep up with the amount of data stored, unable to clean it up in a timely fashion with the housekeeper processes utilization reaching reaching 100 percent for 20-30 minutes at a time. This will have a negative effect on the overall database performance for the duration of housekeeping.

At this point, it is recommended to implement partitioning for history/trend tables. We can use Postgres with TimescaleDB plugin for this. Partitioning is supported out of the box, and you can configure it in Administration > General > Housekeeping.

For MySQL and Oracle backends we would have to rely on custom partitioning scripts or procedures. In addition, community-provided partitioning scripts are publicly available.

As always – don’t forget to test 3rd party scripts in a test environment before deploying it in production!

Community partitioning solution for MySQL

You can always create your own partitioning script, but you should be aware of what you’re doing and how things should be partitioned. We should always be partitioning only history and trend tables.

History table partitioning with TimescaleDB

TimescaleDB plugin for PostgreSQL DB backends supports out-of-the-box partitioning. You don’t have to rely on community scripts.
On TimescaleDB, we need to specify the chunk_time_interval parameter, which will define the partition chunk size.
In addition, we can also add compression of history/trends, which helps to reduce the history table size by 60-80 percent. Again, in such scenarios, your database is going to be huge — terabytes in size with hundreds of customers, each having thousands of metrics per second. So, compression is a really valuable asset.
The only thing we have to take into account is that compressed data is read-only and cannot be changed post-compression. So, no more changes or inserts are possible for the compressed chunks.

History and trends compression

To-do list

Deploy the latest available Zabbix version. Ideally stick with an LTS version.
Deploy proxy servers, define and configure HA/Replication on Zabbix proxies, as well as on Zabbix servers and databases.
Implement partitioning to improve database performance.
Implement throttling to reduce the volume of the incoming data.
Tune your database! Either use online guides or consult with your DBA.

With our to-do list completed, we can have our Zabbix environment with deployed with redundancy in mind, providing monitoring as a service for hundreds of customers, multiple proxies running for each of the customers, HA in place, and Zabbix performing up to our expectations.

Questions & Answers

Question. Will Zabbix have its native HA solution? Will it be the whole package or does it involve installing individual components and maintaining them?

Answer. It’s planned on the roadmap to have a native HA solution in Zabbix 6.0. You should be able to get your hands on it when the 6.0 beta version gets released. Hopefully, you’ll be able to get your hands on it, test it out yourselves and give us feedback. From the looks of it it should be very much plug-and-play and will remove a lot of management overhead when comparing it with current HA implementation. Right now this is being developed only for the Zabbix backend server. As for the frontend – nothing is stopping you from having multiple frontends pointed at the same DB/Zabbix backend server.

Question. Can I run Zabbix on a single server and sell monitoring service to several customers with fully isolated environments, not just GUI, but also items, triggers, etc.

Answer. Yes, you can. You can have a single Zabbix instance and multiple customers being monitored by this instance. The only extra step that might be required is deploying proxies on the customer’s side. By using permission restrictions, proxy servers, roles, etc., we can then monitor multiple customers from a single Zabbix instance.

Question. When we change proxy, the agent configuration has to be updated. What about HA configuration on the proxy?

Answer. That really depends on the approach. If the agent is getting pointed at the virtual IP address and HA is managed by PCS, Corosync, or Pacemaker, then it should be fine as is and the VIP should just be on the currently active host. So, you’ll be essentially rerouted. With the HA by way of API approach, you can simply allow your agent to accept connections from both proxies. With ServerActive we can also specify multiple endpoints, so agents can actually be prepared for such an environment.

Question. How to merge two different instances into a single monitoring instance?

Answer. This is a complex task. First off, both instances need to have the same major Zabbix backend version. You might simply migrate the history from one instance to another, but then you will have some problems with underlying element IDs. So, in one instance you have your own set of items, triggers, users, etc. with your own set of IDs. These will most likely conflict with the set of IDs on the other instance.

You can do partial migration or use the export function to export your templates, hosts, value maps, network maps. I would try to export as much as I can as migration on SQL level will be a real pain. It is possible if you’re stubborn enough, but it can end up being a really complex task that can take days if not weeks to fully implement and test.

Question. Do subgroups relate to templates as well?

Answer. Subgroups relate to templates in a way where we can also define permissions to reading and modifying templates. For templates, you can also create per-customer templates and assign them to host groups. Users that have access to these host groups can then read or modify the templates.

Scheduled report generation in Zabbix 5.4

2021-07-13 Arturs Lontons

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/scheduled-report-generation-in-zabbix-5-4/14776/

The release of version 5.4 grants Zabbix users the ability to receive scheduled PDF reports in their mailbox, which is a very sought-after feature. This post and the video will cover all-new report-related configuration parameters and walk you through setting up scheduled report generation.

I. Reporting in Zabbix 5.4 (0:45)
II. Scheduled reports (2:26)

- - - Creating a report (2:58)
    - Receiving a report (3:36)
    - Permissions (4:29)
    - Recipients of scheduled reports (6:04)
    - Report prerequisites (7:04)
    - Configuring reports — Web service (8:08)
    - Configuring reports — Server (8:52)
    - Configuring reports — Frontend (9:37)
    - Reports — testing (9:59)
    - Common issues (10:41)

III. Questions & Answers (13:28)

Reporting in Zabbix 5.4

Zabbix 5.4 is our first big step in bringing out-of-the-box reporting for our end users. With this feature, we now have a foundation to build upon in the future and make reporting more robust and versatile over time. Since reports are 100% based on dashboard widgets, it’s only a matter of time until more report-focused widgets get released, thus enabling not only better dashboards, but also improving the reporting functionality.

We have implemented a new web service component responsible for generating reports — of course, you can install this server in a quick and easy fashion by using the provided packages.
Reporting works out of the box without the need to deploy or develop any custom scripts.
The initial configuration is easy to understand and implement.
Reporting will use the existing Email media types to send out these reports.
The reports do respect your permissions, as well as roles introduced in Zabbix 5.2.
You will be able to test the report before implementing it as per our schedules just by clicking the Test button.

Scheduled reports

We have added a new Scheduled report section, where the list of reports is available, displaying the report Name, their Owner, Repeats (daily, weekly, etc.), the Period for which the report is generated, and the Last sent date.

Scheduled reports

NOTE. When you configure new reports, and they have not been sent out yet, the Last sent date will be set to ‘Never.’

Creating a report

When you create a report, you will also have to fill in a couple of fields:

Owner,
Name of the report,
Dashboard, the report will be based on,
Period — if you send the report for the Previous day, Previous week, Previous month, of Previous year,
Cycle — how often you send the report Daily/Weekly/Monthly/ Yearly,
Start time (Zabbix server time is used here),
Start date and end date.

Creating reports

Receiving a report

When you receive a PDF report to your mailbox, you can also use the {TIME} macro to display server time both in the subject and the body of the message.

Receiving a report

In the PDF report, you can display any information from the included dashboard – Graphs, Problems, Latest data, and much more. Thanks to all of the available widgets, we will be able to customize our reports in a very granular fashion.

Receiving a report example

The report does respect user permissions. So, in the example above, the report shows only the data to which the user (either the recipient or the report creator) has access.

Permissions

After upgrading to Zabbix 5.4, you will see two new options in the User roles section:

Scheduled reports UI element. Under the UI elements, you can grant or deny access to the Scheduled reports section. This is accessible only to Super admin and Admin user Types.

Permissions

If the Scheduled reports UI element is unchecked for the role, the user won’t be able to access the Scheduled report section and will see an error message. The same behavior is true if you use a URL to access the Scheduled reports.

Access to scheduled reports denied message for the users of a user role

You can also manage scheduled report permissions in the Access to actions section by checking the Manage scheduled reports box. This action permission grants or denies the ability to create or edit scheduled reports and is also accessible to Admin and Super admin user types.

Manage scheduled reports

If this check box is unchecked, the users won’t be able to create new or edit existing reports, though they will be able to access the UI section and see the list of reports and how they are configured.

Access to Manage scheduled reports restricted

Recipients of scheduled reports

When you are defining a new report, you can select the recipient. Report subscription can contain a user or a user group.

When selecting a user, you can specify to include or exclude the user from the subscription.
User group to host group permissions still apply.
You can specify which user is going to be generating the report – recipient or the creator of the report.:

Report recipients

For example, if we need to send some extra information to our NOC team that might not be directly available to them, you can select Current user, and the report will be generated with the permissions of the report creator. Since it is the admin that is creating the report, you can add some extra information that wouldn’t be visible to your NOC team or other regular users. They still won’t be able to access it in Zabbix, but they’ll receive it in their mailbox if you configure the report for them.

Report prerequisites

Diving a bit deeper into the technical side of things, we need to set up two additional packages to enable the reports:

zabbix-web-service — the additional reporting service by default listening to port 10053. The service needs to be reachable from the Zabbix server and can be deployed on the same machine as our frontend or our server. We also have the option to deploy it on a completely separate machine. The zabbix-web-service package should be available if you have added the Zabbix repository.

#yum install zabbix-web-service

Google Chrome is required. However, on some distributions, Chromium is reported to also work, though this is not 100% tested. Note that Google Chrome packages are not included in Zabbix. The Google Chrome packages can simply be downloaded from the Google Chrome website and then installed on the zabbix-web-service host.

#wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm

#yum install google-chrome-stable_current_x86_64.rpm

Configuring reports — Web service

We have a whole new configuration file for the web service. Web service supports many different configuration options:

Logging — similar to that for server and for proxy. You can set up debug levels, select the log types, rotations, and so on.

### Option: LogType –system (syslog), file, console (standard output)
### Option: LogFile–Log file location
### Option: LogFileSize -Size in MB before rotation
### Option: DebugLevel –0 -5

List of allowed server addresses that can access this web service.

### Option: AllowedIP List of comma delimited IP addresses, optionally in CIDR notation, or DNS names of Zabbix servers

Timeout settings

### Option: Timeout -Spend no more than Timeout seconds on processing (Default –3)

Listen port

### Option: ListenPort -Service will listen on this port for connections from the server
(Default -ListenPort=10053)

Encryption settings by using certificates. This way the communication with the web service can be secured.

### Option: TLSAccept –unencrypted or cert
### Option: TLSCAFile–pathname of a file containing top level CA(s) certificates
### Option: TLSCertFile–pathname of a file containing the service certificate
### Option: TLSKeyFile–pathname of a file containing the service private key

Configuring reports — Server

In addition, the server settings now contain report-related parameters:

The number of report writer instances.

### Option: StartReportWriters -Number of pre-forked report writer instances.
(Default –0)

NOTE. You need to have at least one StartReportWriter

NOTE. The number of the necessary report writers will depend on the number of reports and how often you generate them.

Zabbix Web Service URL (to be passed on to the server)

### Option: WebServiceURL -URL to Zabbix web service, used to perform web-related tasks. (No default value)
#Example: http://192.168.1.156:10053/report

You need to make sure that we can communicate with the Zabbix Web Service URL and permit the incoming traffic through this port to the web service.

Configuring reports — Frontend

As the last step, you need to enable communication between the frontend and the web service.

In Administration > General > Other, we have a new configuration parameter where you need to specify your frontend URL that will be reachable by the web service.

Frontend URL

Once this is done, we can create a report.

Reports — testing

After you have created the report, you can test it. You can click the Test button and send out your test report to see if it works. The users to which we’re sending the report need to have an Email media assigned to them in the User settings.

NOTE. Currently, {TIME} macros are resolved only with the scheduled generation and are not available in test reports, though this might change in the future.

Testing reports

Common issues

Some parameters can certainly be misconfigured, so let’s look at the most common issues:

Make sure that you have a properly configured Email media assigned to the user that should be receiving the report. Otherwise, they will fail to receive it.

— Make sure that the Email media type settings are properly configured.
— Once you define the media type, if you’re creating it from scratch, make sure that you test the media type and generate a test report.

Media configuration failed

NOTE. Sending out the report failed in this example siince no media is configured for the report recipients.

Make sure that the correct Web service address is configured on the Zabbix server in the WebServiceURL parameter.

— Confirm that the Zabbix server can connect to the Zabbix web service and that so that we can connect to the specified port/IP address.
— Check your firewall settings if the web service is running on a dedicated machine.
— Make sure that third-party security software, such as SELinux or firewalls don’t block the communication.

Wrong WebServiceURL parameter

Otherwise, you will receive an error message on the Frontend. The error messages should be sufficient enough to point you in the right direction.

Make sure that the Web service URL is configured without any typos. Otherwise, you will reach the web service, but the report page will output an error — ‘404 page not found’.

WebServiceURL=http://192.168.1.156:10053/reportwrong

Typos in configuration error message

NOTE. If you see this error message, check for typos in the Zabbix server configuration file for WebServiceURL.

Don’t forget to assign the Frontend URL in Administration > General > Other.

— If a URL is misconfigured, you might start receiving empty reports.
— If the URL syntax is wrong, you will receive an error message about the malformed URL.

Malformed URL error message

Frontend URL configuration parameter

Google Chrome is not pre-packaged with Zabbix,

— You need to have Google Chrome package installed separately. You can download Google Chrome from the official Google Chrome website, for instance, by using wget.

— Make sure that Google Chrome is available via $PATH environmental variable. If you don’t have it configured, you will receive the error message, so you will need to modify the path variable and make sure the executable is available there.

$PATH environmental variable error

Questions & Answers

Question. What are the possibilities to customize the page size like A4, A3?

Answer. It will be based on how you customize your Dashboard. Currently, you cannot customize the page and select portrait or landscape, for instance.

Scalability improvements

2021-07-06 Sergey Simonenko

Post Syndicated from Sergey Simonenko original https://blog.zabbix.com/scalability-improvements/14832/

New improvements might be unnoticed by many Zabbix users since they come to scalability, rather than to new features or some aspects of the user interface experience. However, these improvements might be beneficial for those Zabbix users who run really large instances.

I. More efficient database use (1:15)

1. New worker processes (3:03)

- - - History pollers (3:32)
    - Availability manager (4:14)

2. In-memory trend cache (4:49)
3. More server resiliency (7:35)

II. Questions & Answers (10:54)

In case of large instances, the main performance bottleneck would be the database. Zabbix doesn’t establish ad-hoc connections and uses only persistent connections to the database. In Zabbix 5.4, the use of database connections has been further drastically optimized.

More efficient database use

In earlier versions, not only database syncers, but also pollers, and some other processes had a dedicated persistent connection to the database. These connections were necessary for calculated items and aggregate checks. These calculated items and aggregate checks are not real items, since they’re based on the queries to the database, particularly to history tables.

Connections were also required to update host availability status. Pollers (unreachable pollers, JMX pollers, as well as the IPMI manager) were updating it directly in the database.

In addition, in some cases, when proxies were used (that would be true for large instances) host availability was updated by the proxy poller, in case of a passive proxy, and trapper.

Why was it decided to avoid these connections in Zabbix 5.4?

First, they don’t really work smoothly with the default database configuration (PostgreSQL, Oracle). For instance, in PostgreSQL, max_connections is by default set to 100.
They can cause locking on the database side.
They also result in inefficient memory and CPU utilization.
Finally, in earlier versions, it was impossible to perfectly fine-tune the number of connections to the database.

New worker processes

In Zabbix 5.4, two new processes were introduced: history pollers and availability manager. If you have upgraded your Zabbix instance already when you log onto your server and run ps aux | grep zabbix_server, you will notice these new processes:

/usr/sbin/zabbix_server: history poller #1 [got 0 values in 0.000008 sec, idle 1 sec] 
/usr/sbin/zabbix_server: history poller #2 [got 2 values in 0.000186 sec, idle 1 sec] 
/usr/sbin/zabbix_server: history poller #3 [got 0 values in 0.000050 sec, idle 1 sec] 
/usr/sbin/zabbix_server: history poller #4 [got 0 values in 0.000010 sec, idle 1 sec] 
/usr/sbin/zabbix_server: history poller #5 [got 0 values in 0.000012 sec, idle 1 sec] 
/usr/sbin/zabbix_server: availability manager #1 [queued 0, processed 0 values, idle 5.016162 sec during 5.016415 sec]

History pollers

Since calculated items and aggregate checks represent a different types of items, now they have their own poller – history poller. History pollers are also used for several internal items (zabbix[*] item keys) as well.

New configuration parameters

History poller comes with a new configuration parameter. Here, it is important to keep in mind that more is not always better. So, the StartHistoryPollers value (how many history pollers are being pre-forked) should be increased only if history pollers are too busy according to internal self-monitoring, but should be kept as low as possible to avoid unnecessary connections to the database.

### Option: StartHistoryPollers
#     Number of pre-forked instances of history pollers.
#     Only required for calculated, aggregated and internal checks.
#     A database connection is required for each history poller instance.
#
# Mandatory: no
# Range: 0-1000
# Default:
# StartHistoryPollers=5

Availability manager

In earlier versions, pollers, unreachable pollers, JMX pollers, and the IPMI manager updated host availability directly in the database with a separate transaction for each host. Now, we have this separate availability manager, and all processes — pollers, trappers, etc. — communicate with the availability manager, and the statistics queue is flushed by the availability manager to the database every 5 seconds.

In-memory trend cache

Since Zabbix 5.2, new trigger functions like trendavg, trendmax, etc. were introduced, which operate with the trends data for long periods. Similarly to calculated items, these triggers used database queries to obtain the necessary data.

In Zabbix 5.4, finally, the trend cache has been implemented. It stores the results of calculated trends functions. If the value is not available in the cache yet, Zabbix will query the database and update the cache.

As with all newly introduced processes, this cache’s effectiveness can be monitored using internal check zabbix[tcache,cache,], which can be used to set the relevant TrendFunctionCacheSize parameter value.

### Option: TrendFunctionCacheSize
#           Size of trend function cache, in bytes.
#           Shared memory size for caching calculated trend function data.
#
# Mandatory: no
# Range: 128K-2G
# Default:
# TrendFunctionCacheSize=4M

To sum it up, with all these database-related optimizations:

Now it is possible to have as many database connections as you really need. So, if you, for instance, operate a very large instance and you need a hundred or more pollers, and at the same time, you don’t rely much on some calculated items or aggregate checks functionality, before Zabbix 5.4 you would end up with hundreds or more database connections that you didn’t need.

Moreover, with PostgreSQL with default configuration, if you increased the number of pollers, your database server could go down and bring down your Zabbix instance. For each PostgreSQL worker process, you would have had a limited work_mem as you had too many database connections. So, your overall database performance would have been sacrificed. That is not the case anymore.

In addition, if you are using trend functions with triggers using large periods of time, in the past you might have noticed, for instance, slow queries. Now, these changes will help you to drastically decrease the database load.

More server resiliency

Another important feature — graceful start. Active proxies can keep a backlog, which is useful if the communication between the server and the proxy breaks for any reason, for instance:

— server maintenance during upgrade to the next minor release;
— loss of Internet access at a remote site due to fiber cut, etc.

When communication restores, the proxies can easily overload the server after long downtimes, especially in large instances.

Since Zabbix 5.4, the server lets the proxies know if it’s busy, so the proxies throttle data sending.

Earlier, the data uploaded by the proxies was throttled when the history cache usage was 80% or greater. However, as the server was responsible for that task, all proxies were getting disabled in some situations. That meant the history data upload, as well as other tasks, such as processing of regular data and processing tasks, were getting suspended until the history cache utilization dropped lower than 80%.

This method was ineffective and unacceptable in large environments. Now, the proxies are responsible for checking whether the server can handle the data. When the history cache usage hits 80%, the following scenario is used:

the proxies send the data to the server and the data is accepted;
if the server thinks it’s busy it will respond with a special JSON tag upload set to ‘disable’;
the proxies will stop uploading history data, but will keep polling the server for tasks and uploading other data;
in a while, the proxies will upload data again;
if the server is not too busy, it will respond with the JSON tag upload set to ‘enable’.

Unlike the previous two scalability improvements which are based on serious architectural changes, this change was backported to earlier Zabbix versions — 5.0 and 5.2.

Questions & Answers

Question. Would you recommend using proxies even on the local site to allow for the server to be upgraded without losing data or for performance improvements?

Answer. Yes, in some cases there’re such setups. This idea mainly is to have a unified configuration, not only to improve performance. And in some cases, if you use a lot of proxies, you might want to monitor all the items only with the proxies. Such scenarios are used by many Zabbix customers.

Question. So, throttling can give you some noticeable performance benefits. Which version is required on the server and on the proxy for throttling?

Answer. All these changes have been backported to earlier versions, so you can use either Zabbix 5.4.0 released recently or the latest releases of Zabbix 5.0 or Zabbix 5.2.

Question. Is it possible to have two databases in a cluster and point the select queries to one database and, for instance, execution queries to another database? How would database clustering generally work? Is it of benefit to Zabbix? Can Zabbix utilize it?

Answer. In general, our HA setups use some basic features, which are built-in into database servers. They use replication. So, you have to use the servers that will provide some virtual IP for your cluster. That is completely transparent to Zabbix.

However, it is not recommended to split different queries on different nodes. They should still hit a single specific note. So, it is more of an HA approach rather than a horizontal scalability approach.

Question. Would you elaborate on what a large, or medium, or small instance means? What new values per second should we be looking at?

Answer. We can judge from large instances of our customers, and might not know about even larger instances managed by the customers themselves. Large instances can have, for instance, 100,000 NVPS and more. Sometimes, we upgrade really large instances with databases of dozens of terabytes. Some users like keeping really long records.

In my experience, large instances of 20,000 to 40,000 NPVS are quite common and they could benefit a lot from these changes.

Auto-healing Kafka connector tasks with Zabbix

2021-06-22 Ronald Schouw

Post Syndicated from Ronald Schouw original https://blog.zabbix.com/auto-healing-kafka-connector-tasks-with-zabbix/14269/

In this post, we will talk about the low-level discovery of Kafka connectors and tasks. When a Kafka task fails, a trigger is fired, which starts a remote command to restart the failed Kafka task. Of course, with the necessary logging around it.

You can find the template and scripts on the Zabbix share. But first, let’s talk a little bit about Kafka producers and consumers. Let’s say you have got a couple of connectors set up, pulling data from Postgres with Debezium and streaming it into Elasticsearch. The Postgres source is a bit flaky and goes offline periodically. If you view the status of the Postgres source, the producer, you noticed the task is failed. Kafka does not restart the failed task out of the box. We don’t wait for the customer to complain, but we let Zabbix actively monitor the tasks. A failed connector task is easy to restart using the Rest API. But manually restarting and watching a task is annoying. We used to do that at our business. Now Zabbix comes into play and restarts the failed Kafka task automatically. And we do sleep well.

About Kafka

Apache Kafka is a community distributed event streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Since being created and open-sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue to a full-fledged event streaming platform.

First, let’s do a curl and check the failed connector task.

curl -s "http://localhost:8083/connectors"| 
jq '."connector_sink-test"| .status.tasks'
[{
"id": 0,
"state": "RUNNING",
"worker_id": "connect1.test.com:8083"
},
{
"id": 1,
"state": "FAILED",
"worker_id": "connect2.test.com:8083"
}]

So this is where the fun starts – we have a connector task with id “1” which has failed. At the end of the blog, Zabbix restarts the connector, but first, let’s look at an example. This curl post should restart the connector task: connect2.test.com id:1

curl -X POST http://localhost:8083/connectors/connect2.test.com/tasks/1/restart

Low-level discovery

The zabbix_kafka_connector template does work out of the box. To implemented the use cases provided in this blog you will need the scripts bundled together with the template. Kafka connectors can have multiple tasks. First, we determine the connectors and later the state of the connectors and tasks. Let’s run the following script – api_connectors.sh. I suggest you execute the script via a cronjob every 5 minutes, depending on your priority to run the curl jobs.

api_connectors.sh

curl http://localhost:8083/connectors?expand=status | jq > check_connectors
curl http://localhost:8083/connectors | jq .[] > get_connectors

It creates two files, check_connectors, and get_connectors. Needless to say, we use curl with authentication in the production environment.

The next shell script get_connector_data.sh uses check_connectors and get_connectors files as input. It defines the connector {#CONNECTOR} and the connector tasks {#CONNECTOR_ID} with the corresponding ID used by low-level discovery. Down the line it might be more efficient to rewrite it as a python script. Json query is our useful friend here. The script is used by a user parameter later on.

get_connector_data.sh

#!/bin/sh
CONNECTOR=$(cat get_connectors)
CONNECTOR_IDS=$(cat get_connectors | tr -d ")
FIRST="1"
#create zabbix lld discovery connectors
echo "{"
echo " "data":["
for i in $CONNECTOR
do
if [ "$FIRST" -eq 0 ]
then
printf ",n"
fi
FIRST="0"
printf " {"{#CONNECTOR}": $i}"
done
#create zabbix lld discovery task connectors
for i in $CONNECTOR_IDS
do
IDS=$(cat check_connectors | jq --arg i ${i} -r '."'${i}'"| .status.tasks[].id')
for z in $IDS
do
if [ "$FIRST" -eq 0 ]
then
printf ",n"
fi
FIRST="0"
printf " {"{#CONNECTOR_ID}": "${i}-${z}"}"
done
done
#
printf "n ] n}"

Part of the script output will look like this, depending, of course, how many connectors there are and tasks in your Kafka environment.

{
"data":[
{"{#CONNECTOR}": "source_invoices-prod"},
{"{#CONNECTOR}": "employee_sink-prod"},
{"{#CONNECTOR_ID}": "ource_invoices-prod-0"},
{"{#CONNECTOR_ID}": "source_invoices-prod-1"},
{"{#CONNECTOR_ID}": "employee_sink-prod-0"},
{"{#CONNECTOR_ID}": "employee_sink-prod-1"},
{"{#CONNECTOR_ID}": "employee_sink-prod-2"},
{"{#CONNECTOR_ID}": "employee_sink-prod-3"}
]
}

Template.

We will define a template with the LLD rule in it and later attach the template to a host. Create a template Configuration > Templates > Create template. Give it a name according to your choice: Template_kafka_connector or some other name, depending on your template naming policies.

Discovery rule

Next, we create a discovery rule. Keep lost resources period is an arbitrary value here – once again, depending on your policies regarding LLD entities.
In this case, we will discard the lost resource immediately – Keep Lost resources (0). This can be a bit more database friendly, in case when Kafka creates hundreds of connectors. The update interval is the same as the cronjob interval.

Configuration > Templates > your created template > discovery > create discovery rule

The key is used by the User Parameter further in the blog

Item prototype.

We will create two item prototypes, one for the connector and one for the task of the connector with the corresponding ID of the task. The ID is important because we want to restart the correct task later.

Name: State of {#CONNECTOR} connector
Key: state[{#CONNECTOR}]

Configuration > Templates > your created template > item prototypes > create item prototype

Trigger prototypes

Four trigger prototypes have been created. They are sets of two. The sets have different severities. The highest severity only fires after six hours and is intended for the operation center. Most times, Zabbix will restart the failed task within 5 or 10 minutes. It is then not necessary to burden the operation center with this. I will explain the most important trigger. This trigger will soon be used in an action to start the remote command. The URL macro {TRIGGER.URL} is used, which determines the ID of the task that should be restarted. There are probably other solutions, but this one works well and is stable.

Configuration > Templates > your created template > item prototypes > create trigger prototype

The other trigger examples are provided below.

Name: Kafka Connector task {#CONNECTOR_ID} on {HOST.NAME} is not RUNNING
Expression: {C_Template kafka Connector:task[{#CONNECTOR_ID}].str(RUNNING,6h)}=0 and {C_Template kafka Connector:task[{#CONNECTOR_ID}].str(FAILED)}=1
Severity Warning
Name: Kafka Connector {#CONNECTOR} on {HOST.NAME} is FAILED
Expression: {C_Template kafka Connector:state[{#CONNECTOR}].str(FAILED)}=1
Severity: Not classified
Name: Kafka Connector {#CONNECTOR} on {HOST.NAME} is not RUNNING
Expression: {C_Template Kafka Connector:state[{#CONNECTOR}].str(RUNNING,6h)}=0 and {C_Template Kafka Connector:state[{#CONNECTOR}].str(FAILED)}=1
Severity: warning

Userparameter

Three User Parameters are required—one for the low-level discovery and two for the items.

UserParameter=connector.discovery,sh /etc/zabbix/get_connector_data.sh
UserParameter=state[*],/etc/zabbix/check_connector.sh $1
UserParameter=task[*],/etc/zabbix/check_task_connector.sh $1

check_connector.sh script gets the state of the connector.

#!/bin/sh
CONNECTOR="$1"
cat /etc/zabbix/check_connectors | jq --arg CONNECTOR "${CONNECTOR} " -r '."'${CONNECTOR}'" | .status.connector.state'

check_task_connector.sh Does a check on the connector task. A disadvantage of this construction is that the connector can have a maximum of 10 tasks. At ID -10 or higher, the check fails. But that’s unusual in Kafka to deploy a connector with so many tasks.

#!/bin/sh
value=$1
CONNECTOR=$(echo ${value::-2})
IDS=$(echo ${value:(-1)})
cat /etc/zabbix/check_connectors | jq --arg CONNECTOR "${CONNECTOR}" --arg IDS '${IDS}' -r '."'${CONNECTOR}'" | .status.tasks[]| select(.id=='$IDS').state'

Zabbix-agent

When all scripts are in the right place, we make a small adjustment to the Zabbix agent config. The LogRemoteCommands option is not necessary, but it is useful for debugging. Restart the Zabbix agent afterward. Add the Kafka template to a host, and we can proceed.

EnableRemoteCommands=1
LogRemoteCommands=1

Action auto-healing

Let’s define some actions that can heal our connector tasks by automatically restarting a Kafka task with an action. Create a new action – you can choose any conditions that can be applied to your trigger.

Configuration > actions > event source – triggers > create action.

Create an operation. This can be a bit tricky. In my case, I restart the tasks every five minutes for the first half-hour. If unsuccessful, the Kafka admins will receive an email. After that, the tasks are restarted every hour for three days. In practice, this has never happened, but such a situation can occur over the weekend, for example. After three days, the operation stops and sends a final email. Usually, the task starts the first time – if not, then the second attempt is sufficient in 99% of the cases.

Restart script.

You will probably have to adapt the script to your own environment. We have built-in some extra logging. This is certainly useful during the initial setup.

#!/bin/sh
LOG=/var/log/zabbix/restarted-connector.log
value=$(echo $1 | awk -F "/" '{print $(NF)}')
echo $value
CONNECTOR=$(echo ${value::-2})
IDS=$(echo ${value:(-1)})
curl -v -X POST http://localhost:8083/connectors/"{$CONNECTOR}"/tasks/"{$IDS}"/restart 2>&1 | tee -a $LOG
echo "Connector $CONNECTOR ID $IDS has been restarted at $(date)" >> $LOG

The {TRIGGER.URL} macro is used here, not intended to be used this way out of the box by Zabbix, but it gets the job done for this use case. The awk ensures that the http: // is fetched.

If you have any other suggestions on how to improve the scripts or the templates – you are very much welcome to leave a comment with your idea!

Credits.

I am inspired by Robin Moffatt at Confluent and not in the last place my colleague Werner Dijkerman at fullstaq

Correlation between devices across client site

2021-06-09 Aigars Kadiķis

Post Syndicated from Aigars Kadiķis original https://blog.zabbix.com/correlation-between-devices-across-client-site/14657/

In this blog post, we will talk about aggregating different kinds of devices that are disconnected from the general network. Finding out how many devices per kind are “down” right now. This can be useful in the Internet Service Provider type of situation.

A property

It all starts with a property.

A property can be a building, a block. Most likely it has a firewall and a core switch at the top of everything.

A building can have floors. Floors can own a switch. An edge switch.

Each floor can have rooms or departments. It may be enough to put there a router to feed all devices around.

Vision

When something goes off, we want to see “what is the damage?” If a major component goes down, that should be a priority to concentrate on.

In general, we target much more descriptive message like:

=> 2 edge switches down

=> A core switch is down

=> 15 routers down

=> Firewall is down

For the best experience, we target to have only one message.

Prerequisites

To inform us how many devices are down, we need to make sure:

1) Each client host must belong to a host group. The name of host group describes the location of the property, for example “Riga/Block7”:

2) Each host object owns a macro {$PROPERTY_HOST_GROUP}. This can be delivered through the template. The macro value must be the same as the name of the host group: “Riga/Block7”

3) There is one virtual host in the client pool. This host will do the aggregations, determine what kind of devices are down and how many of them.

4) At least one passive check must work for devices. SNMP polling must be in place.

How it works?

Monitoring software is executing passive checks:

As a result, it will generate red/green icons:

A “Zabbix internal” type can read the status of the icon:

zabbix[host,agent,"available"]
zabbix[host,snmp,"available"]

if the icon is red, a number 0 will be reported

if the icon is green, a number 2 will be reported

2 items in the template

There are 2 items in the template per category. At first, the “availability” item fetch the status of the icon and then a dependable item transforms this information into another number:

// Router:
if (value == 0) {return 1} else {return 0}

// Switch:
if (value == 0) {return 100} else {return 0}

// Core switch:
if (value == 0) {return 100000} else {return 0}

// Firewall:
if (value == 0) {return 1000000} else {return 0}

Aggregation

Each device type will generate numbers like 1 or 100 or 100000 or 1000000.

We can have 2 options:

1) Link a trigger directly on the calculated number. If the integer number is bigger, it’s more critical to the client.

2) We can also operate with dependable items. Here is one method to transform calculated item back into dependent items:

// Routers down:
if (value == 0) {return 1} else {return 0}

// Switches down:
if (value > 99) {return value.replace(/..$/,"") % 1000} else {return 0}
// % 1000 is because the client can have 999 switches

// Core switches down:
if (value > 99999) {return value.replace(/.....$/,"") % 10} else {return 0}
// % 10 is because the client can have 9 core switches

// Firewalls down:
if (value > 999999) {return value.replace(/......$/,"") % 10} else {return 0}
// % 10 is because the client can have 9 firewalls

Detect flapping

It would be quite useful to detect when some devices are changing up/down too frequently. No elegant solution here, be we can have a workaround. We can clone all 4 items:

and, per each item, add a second preprocessing step “Discard unchanged”:

We will result up in +4 items:

One last step, to create additional +4 items to count how many metrics the “changes” item did receive. Here is a sample of one:

Macro value of {$FLAP} can be ‘1d’.

Known issues with the solution to detect flapping

If service ‘zabbix-server’ will receive a restart, It will generate “+1 flap” per each device type.

If a device in one category change state to “up” and another device in the same category in the same minute changes state to “down”. This will not be detected

How far we can go?

How many classifications we can use? The calculated item is limited to a 64-bit integer which is ‘18446744073709551615’, there are 20 digits in this number. Because it starts with a ‘1’ it means that safely we can use only 19 digits.

Proof of concept template

available

There are 6 templates included in one XML file:

To use this solution:

1) Import XML file.

2) Clone “Property” template.

3) Open a cloned “Property” template and install the correct value of macro {$PROPERTY_HOST_GROUP}. Value must be the same a the host group where all client devices are in.

4) In the same host group where all client devices are in, create a dummy host, apply the “aggregate status” template and assign “Property” template to this host.

5) Assign “binary” templates (the ones which contain a ‘1’ in a name) to devices (switches, core switches, firewalls) the client owns.

Alright, that is it for today. Bye!

Anchor | Cloud Engineering Services 2022-05-27 07:02:18

2021-05-27 Gerald Bachlmayr

Post Syndicated from Gerald Bachlmayr original https://www.anchor.com.au/blog/2022/05/death-by-nodevops/

The CEO of ‘Waterfall & Silo’ walks into the meeting room and asks his three internal advisors: How are we progressing with our enterprise transformation towards DevOps, business agility and simplification?

The well-prepared advisors, who had read at least a book and a half about organisational transformation and also watched a considerable number of Youtube videos, confidently reply: We are nearly there. We only need to get one more team on board. We have the first CI/CD pipelines established, and the containers are already up and running.

Unfortunately the advisors overlooked some details.

Two weeks later, the CEO asks the same question, and this time the response is: We only need to get two more teams on board, agree on some common tooling, the delivery methodology and relaunch our community of practice.

A month later, an executive decision is made to go back to the previous processes, tooling and perceived ‘customer focus’.

Two years later, the business closes its doors whilst other competitors achieve record revenues.

What has gone wrong, and why does this happen so often?

To answer this question, let’s have a look…

Why do you need to transform your business?

Without transforming your business, you will run the risk of falling behind because you are potentially:

Dealing with the drag of outdated processes and ways of working. Therefore your organisation cannot react swiftly to new business opportunities and changing market trends.
Wasting a lot of time and money on Undifferentiated heavy lifting (UHL). These are tasks that don’t differentiate your business from others but can be easily done better, faster and cheaper by someone else, for example, providing cloud infrastructure. Every minute you spend on UHL distracts you from focusing on your customer.
Not focusing enough on what your customers need. If you don’t have sufficient data insights or experiment with new customer features, you will probably mainly focus on your competition. That makes you a follower. Customer-focused organisations will figure out earlier what works for them and what doesn’t. They will take the lead.

How do you get started?

The biggest enablers for your transformation are the people in your business. If they work together in a collaborative way, they can leverage synergies and coach each other. This will ultimately motivate them. Delivering customer value is like in a team sport: not the team with the best player wins, but the team with the best strategy and overall team performance.

How do we get there?

Establishing top-performing DevOps teams

Moving towards cross-functional DevOps teams, also called squads, helps to reduce manual hand-offs and waiting times in your delivery. It is also a very scalable model that is used by many modern organisations that have a good customer experience at their forefront. This applies to a variety of industries, from financial services to retail and professional services. Squad members have different skills and work together towards a shared outcome. A top-performing squad that understands the business goals will not only figure out how to deliver effectively but also how to simplify the solution and reduce Undifferentiated Heavy Lifting. A mature DevOps team will always try out new ways to solve problems. The experimental aspect is crucial for continuous improvement, and it keeps the team excited. Regular feedback in the form of metrics and retrospectives will make it easier for the team to know that they are on the right track.

Understand your customer needs and value chain

There are different methodologies to identify customer needs. Amazon has the “working backwards from the customer” methodology to come up with new ideas, and Google has the “design sprint” methodology. Identifying your actual opportunities and understanding the landscape you are operating in are big challenges. It is easy to get lost in detail and head in the wrong direction. Getting the strategy right is only one aspect of the bigger picture. You also need to get the execution right, experiment with new approaches and establish strong feedback loops between execution and strategy.

This brings us to the next point that describes how we link those two aspects.

A bidirectional governance approach

DevOps teams operate autonomously and figure out how to best work together within their scope. They do not necessarily know what capabilities are required across the business. Hence you will need a governing working group that has complete visibility of this. That way, you can leverage synergies organisation-wide and not just within a squad. It is important that this working group gets feedback from the individual squads who are closer to specific business domains. One size does not fit all, and for some edge cases, you might need different technologies or delivery approaches. A bidirectional feedback loop will make sure you can improve customer focus and execution across the business.

Key takeaways

Establishing a mature DevOps model is a journey, and it may take some time. Each organisation and industry deals with different challenges, and therefore the journey does not always look the same. It is important to continuously tweak the approach and measure progress to make sure the customer focus can improve.

But if you don’t start the DevOps journey, you could turn into another ‘Waterfall & Silo’.

The post appeared first on Anchor | Cloud Engineering Services.

Zabbix proxy performance tuning and troubleshooting

2021-05-04 Arturs Lontons

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/zabbix-proxy-performance-tuning-and-troubleshooting/14013/

Most Zabbix users use proxies, and those running medium to large instances might have encountered some performance issues. From this post and the video, you will learn more about the most common troubleshooting steps to resolve any proxy issues and to detect them as sometimes you might be unaware of an ongoing issue, as well as basic performance tuning to prevent such issues in the future.

I. Zabbix proxy (1:36)
II. Proxy performance issues (5:35)
III. Selecting and tuning the DB backend (13:27)
VI. General performance tuning (16:59)
V. Proxy network connectivity troubleshooting (20:43)

Zabbix proxy

Zabbix proxy can be deployed and most of the time is used to monitor distributed IT infrastructures, for instance, on a remote location to prevent data loss in case of network outages as the proxy collects the data locally and it is then pushed/pulled to/from Zabbix server.

Zabbix proxy supports active and passive modes, so we can push the data to the Zabbix server or have the Zabbix server pull the data from the proxy. Even if we don’t have any remote locations and have a single data center, it is still a good practice to delegate most of your data collection to a proxy running next to your server, especially in medium-sized and large instances. This allows for offloading our data collection and preprocessing performance overhead from the server to the proxy.

Active vs. passive

Whether an active or a passive mode is better for your company at the end of the day will depend on your security policies. We can use passive mode with the server pulling the data from the proxy or active mode with the proxy establishing the connection to the Zabbix server and pushing the data.

Active mode is the default configuration parameter as it is a bit more simple to configure — almost all of the configuration can be done only on the proxy side. Then, we need to add the proxy on the frontend and we’re good to go.

### Option: ProxyMode
#   Proxy operating mode.
#   0 - proxy in the active mode
#   1 - proxy in the passive mode
#
# Mandatory: no
# Default:
# ProxyMode=0

In the case of a passive proxy, we have to make some changes in the Zabbix server configuration file, which would involve a restart of the Zabbix server and, as a consequence, downtime.

Finally, it is all going to boil down to our networking team and the network and security policies, for instance, allowing for passive or active mode only. If both modes are supported, then the active mode is a bit more elegant.

Proxy versions

Another common question is about the proxy version to install and the database backend to use.

The main point here is that the major proxy version should match the major version of the Zabbix server, while minor versions can differ. For instance, Proxy 5.0.4 can be used with Server 5.0.3 and Web 5.0.9 (in this example, the first and the second number should match). Otherwise, the proxy won’t be able to send the data to the server and you will see some error messages in your log files about version mismatch and data formatting not fitting your server requirements.
Proxies support: SQLite / MySQL/ PostgreSQL/ Oracle backends. To install the proxy, we need to select the proper package for either SQLite3, MySQL, PostgreSQL, or just compile proxy with Oracle database backend support.

— SQLite proxy package:

# yum install zabbix-proxy-sqlite3

— MySQL proxy package:

# yum install zabbix-proxy-mysql

— PostgreSQL proxy package:

# yum install zabbix-proxy-pgsql

For instance, if we do # yum install zabbix-proxy-sqlite3 or copy and paste the instructions from the Zabbix website for SQLite, we will later wonder why it is not working for MySQL as there are some unique dependencies for each of these packages.

NOTE. Don’t forget to select the proper package in relation to the proxy DB backend

Proxy performance issues

After we have installed everything and covered the basics of what needs to be done and how to set things up, we can start tuning or proxy and try to detect any potential performance issues.

Detecting proxy performance issues

How can we find out what the root cause of performance issues is or if we are having them at all?

First, we need to make sure that we are actually monitoring our proxy. So, we need to:

Create a host in Zabbix,
Assign this host to be monitored by the proxy. If the host is monitored by the server, it will report the wrong metrics — the Zabbix server metrics, not the Zabbix proxy metrics.

So, we need to create a host and configure it to be monitored by the proxy itself. Then we can use an out-of-the-box proxy monitoring template — Template Apps Zabbix Proxy.

Template App Zabbix Proxy

NOTE. Template Apps Zabbix Proxy gets updated on the git.zabbix page, when we add new components to Zabbix, new internal processes, new gathering processes, and so on, to support these new components.

If you are running an older version of Zabbix, for example, all the way back from version 2.0, make sure that you download the newer template from our git page not to stay in the blind about the newer internal component performance.

Once we have applied the template, we will see performance graphs with information about gathering processes, internal processes, cache usage, and proxy performance, and both the queue and the new values received per second. So, we can actually react to predefined triggers provided by the template, if there is an issue.

Performance graphs

Then, we need to have a look at the administration queue. A large or growing proxy-specific queue can be a sign of performance issues or a misconfiguration. We might have failed to allow our agents to communicate with our proxies or we might have some network issue on our proxy preventing us from collecting data from the proxy.

An issue on the proxy

In this case:

Check the proxy status, graphs, and log files. In the example above, the proxy has been down for over a year, so it should be decommissioned and removed from the Zabbix environment.
Check the agent logs for issues related to connecting to the proxy. For instance, the proxy might be trying to pull the data but have no rights to do so due to no permissions in the agent configuration file.

Lack of server resources

In some cases, we might simply try to monitor way too much on a really small server, for instance, an older version of a Raspberry Pi device. So, we should use tools such as sar or top to identify resource bottlenecks on the proxy server coming, for example, from the storage performance.

sar -wdp 3 5 > disk.perf.txt

sar is a part of a sysstat package, and this command can provide us with information about our storage performance, serialization, wait times, queues, input/output operations per second, and so on. sar can tell us when something might be overloaded especially if we have longer wait times.

NOTE. Don’t get confused by high %util, which is relevant on hard drives, but on an SSD or a RAID setup the utilization is normally very high. While hard drives can handle only one operation at a time, SSD disks or RAID setups support parallel operations. This can cause SSD or RAID util% to skyrocket, which might not necessarily be a sign of an issue.

Proxy queue

Another useful, though a bit hackish, indicator of the proxy performance is the proxy queue on the proxy database — the count of the metrics pending but not yet sent to the server.

We can observe this in real-time by queueing the proxy DB.
A constantly growing number means that we cannot catch up with our backlog — the network is down or there are some performance issues on the server or the proxy, so more data is getting backlogged than sent.
The list of unsent metrics is stored in proxy_history table.
The last sent metric is marked in the IDs table.

select count(*) from proxy_history where id>(select nextid from ids where
table_name="proxy_history");

This value will keep growing if the proxy is unable to send the data at all or due to performance issues. If the network is down, this is to be expected between the proxy and the server. However, if everything is working but the count still keeps growing, we need to investigate for any spamming items, thousands of log lines coming per second, or other performance issues with our storage and/or our database. There might be performance problems on the server due to the server being unable to ingest all of this data in time after a restart, a long downtime, etc. Such a problem should get resolved over time on its own. Otherwise, if there are no significant factors regarding the performance or any recent changes, we need to investigate deeper.

If this value is steadily decreasing, the proxy is actually catching up with the backlog and the incoming data, and is sending data to the server faster than it is collecting new metrics. So, this backlog will get resolved over time.

Configuration frequency

Don’t forget about the configuration frequency. Any configuration changes will be applied on the proxy after ConfigFrequency interval. By default, these changes get applied once an hour, so ConfigFrequency is 3600.

### Option: ConfigFrequency
#   How often proxy retrieves configuration data from Zabbix Server in seconds.
#   For a proxy in the passive mode this parameter will be ignored.
#
# Mandatory: no
# Range: 1-3600*24*7
# Default:
# ConfigFrequency=3600

On active proxies, we can force configuration cache reload by executing config_cache_reload for Zabbix proxy.

#zabbix_proxy -R config_cache_reload
#zabbix_proxy [1972]: command sent successfully

This is another good reason to use active proxies to pick up all of the configuration changes from the server. However, on passive proxies, the only thing we can do is a proxy restart to force a reload of the configuration changes, which is not a good idea. Otherwise, we have to wait for an hour or some other configuration interval until the changes are picked up by the proxy.

Selecting and tuning the DB backend

The next important step is a selection of the database.

SQLite

A common question, which has no clear answer is when to use SQLite and when should we switch to a more robust DB backend.

SQLite is perfect for small instances as it supports embedded hardware. So, if I were to run a proxy on Raspberry or an older desktop machine, I might use SQLite. Even embedded hardware aside, on smaller instances with fewer than 1,000 new values per second, SQLite backend should feel quite comfortable, though a lot will depend on the underlying hardware.

### Option: ConfigFrequency
#   How often proxy retrieves configuration data from Zabbix Server in seconds.
#   For a proxy in the passive mode this parameter will be ignored.
#
# Mandatory: no
# Range: 1-3600*24*7
# Default:
# ConfigFrequency=3600

So, in most cases, when proxies collect less than 1,000 NVPS per second, SQLite proxy DB backends are sufficient. With SQLite, you don’t need to additionally configure the database.

#zabbix_proxy -R config_cache_reload

#zabbix_proxy [1972]: command sent successfully

With SQLite, there’s no need to have additional database configuration, preparation, or tuning. In the proxy configuration file, we just point at the location of the SQLite file.
A single file is created at the proxy startup, which can be deleted if data cleanup is necessary.

### Option: DBName
#   Database name.
#   For SQLite3 path to database file must be provided. DBUser and DBPassword are ignored.
#   Warning: do not attempt to use the same database Zabbix server is using.
#
# Mandatory: yes
# Default:
# DBName=
DBName=/tmp/zabbix_proxy

All in all the SQLite backend comparatively easy to manage However, it comes with a set of negatives. If we need something more robust that we can tune and tweak, then SQLite won’t do. Essentially, if we reach over 1,000 new values per second, I would consider deploying something more robust — MySQL, PostgreSQL, or Oracle.

Other proxy DB backends

Any of the supported DB backends can be used for a proxy. In addition, the Zabbix server and Zabbix proxy can use different DB backends. The DB configuration parameters are very similar in Zabbix server and Zabbix proxy configuration files, so users should feel right at home with configuring the proxy DB backend.
DB and DB user should be created beforehand with the proper collation and permissions.

shell> mysql -uroot -p<password>
mysql> create database zabbix_proxy character set utf8 collate utf8_bin; 
mysql> create user 'zabbix'@'localhost' identified by '<password>'; 
mysql> grant all privileges on zabbix_proxy.* to 'zabbix'@'localhost'; 
mysql> quit;

DB schema import is also a prerequisite. The command for proxy schema import is very similar to the server import.

zcat /usr/share/doc/zabbix-proxy-mysql*/schema.sql.gz | mysql -uzabbix -p zabbix_proxy

DB Tuning

Make sure to use the DB backend you are most familiar with.
The same tuning rules apply to the Zabbix proxy DB as to the Zabbix server DB.
Default configuration parameters of the backend will depend on the version of the backend used. For instance, different MySQL versions will have different default parameters, so we need to have a look at MySQL documentation, the default parameters, and the way to tune them.
For PostgreSQL, it is possible to use the online tuner — PGTune. Though it is not an ideal instrument, it is a good starting point not to leave the proxy hanging without any tuning as we might encounter issues sooner rather than later. With tuning, the database will be more robust and will last longer before we will have to add any resources and rescale the database config.

PGTune

General performance tuning

Proxy configuration tuning

Database aside, how we can tune the proxy itself?

Proxy configuration is similar to the configuration of the Zabbix server: we still need to take into account and tune our gathering processes, internal processes, such as preprocessors, and our cache sizes. So, we need to have a look at our gathering graphs, internal process graphs, and our cache graphs to see how busy the processes and how full the graphs are and adjust accordingly. This is a lot easier to do on the proxy than on the server since proxy restart is usually quicker and a lot less critical, and less impactful than the server downtime.

In addition, these will differ on each of the proxy servers depending on the proxy size and types of items. For instance, if on proxy A we are capturing SNMP traps, we need to enable the SNMP trapper process and configure our trap handler — Perl, snmptrapd, etc. If we are doing a lot of ICMP pings for another proxy, we’ll require many ICMP pingers. A really large proxy will need to have its History Syncers increased. So, each proxy will be different, and there is no one-fit-all configuration.

Since most of the time proxies handle fewer values since they are distributed and scaled out, we will have a lesser number of History Syncers on proxies in comparison with Zabbix server. In the vast majority of cases, the default number of History Syncers is more than sufficient. Though sometimes we might need to change the count of History Syncers on the proxy.

### Option: StartDBSyncers
#             Number of pre-forked instances of DB Syncers.
#
# Mandatory: no
# Range: 1-100
# Default:
# StartDBSyncers=4

There are always exceptions to the rule. For instance, we might want to have a single large-scale and robust proxy collecting the data from some very critical or very large location with many data points – such an infrastructure layout will still be supported.

If DB syncers do underperform on a seemingly small instance, chances are it is due to lack of hardware resources or, for SQLite, DB backend limitations

We need to monitor the resource usage via sar, or top, or any other tool to make sure that hardware resources aren’t overloaded.

Proxy data buffers

We also have the option to store the data on our proxies if the server is offline or store them even if the Zabbix server is reachable and the data has been sent to the server. we may want to keep our data in the proxy database and utilize it by other third-party tools or integrations.

On our proxies, we have a local buffer and an offline buffer, which determine for how long we can store the data. The size of Local and Offline buffers will affect the size and the performance of your database. The larger the time window for which we store the data, the larger the database is. So, the fewer resources we utilize, the better the performance is, the easier it is to scale up, etc.

Local buffer

### Option: ProxyLocalBuffer
#   Proxy will keep data locally for N hours, even if the data have already been synced with the server.
#
# Mandatory: no
# Range: 0-720
# Default:
# ProxyLocalBuffer=0

Offline buffer

### Option: ProxyOfflineBuffer
#   Proxy will keep data for N hours in case if no connectivity with Zabbix Server.
#   Older data will be lost.
#
# Mandatory: no
# Range: 1-720
# Default:
# ProxyOfflineBuffer=1

Proxy network connectivity troubleshooting

Detecting network issues

Sometimes we have network issues between proxies and the server: either the server cannot talk to proxies or proxies cannot talk to the server.

A good first step would be to test telnet connectivity to/from a proxy.

time telnet 192.168.1.101 10051

Another great method is to time your pings to see how long pinging takes or how long it takes to establish a telnet connection. This could point you towards network latency issues: slow networks, network outages, and so on.
Log file can help you figure out proxy connectivity issues.

125209:20210214:073505.803 cannot send proxy data to server at "192.168.1.101": ZBX_TCP_WRITE() timed out

Load balancers, Traffic inspectors, and other IDS/Firewall tools can hinder proxy traffic. Sometimes it can take hours troubleshooting an issue to find out that it boils down to a load balancer, a traffic inspector, or IDS/firewall tool.

Troubleshooting network issues

A great way to troubleshoot this would be to deploy a test proxy with a different firewall/load balancing configuration. From time to time, network connectivity drops seemingly for no reason. We can bring up another proxy with no load balancers or no traffic inspectors, and ideally, in the same network as the problematic proxies. We need to find out if the new proxy is experiencing the same problems, or if the issue is resolved after we remove the load balancers, IDS/firewall tools. If the problem gets resolved, then this might be a case of misconfigured firewall/IDS.
Another great approach of detecting networking issues due to transport problems, for instance, IDS/Firewalls cutting up our packets, is to perform a tcpdump on proxy and server to correlate network traffic with error messages in the log.

— tcpdump on the proxy:

tcpdump -ni any host -w /tmp/proxytoserver

— tcpdump on the server:

tcpdump -ni any host -w /tmp/servertoproxy

— Correlating retransmissions with errors in logs could signify a network issue.

Many retransmissions may be a sign of network issues. If there are a few of them, if we open Wireshark to find just a couple of retransmits, it might not be the root cause. However, if the majority of our packet capture result is read with duplicate packets, retransmits, acknowledges without data packets being received, etc., that can be a sign of an ongoing network issue.

Ideally, we could take a look at this packet capture and correlate it with our proxy log file to figure out if these error messages in our proxy logfile or server logfile (depending on the type of communication — active or passive) correlate with packet capture issues. If so, we can be quite sure that the networking issue is at fault and then we need to figure out what is causing it — IDS, load balancers, a shoddy network, or anything else.

MySQL performance tuning 101 for Zabbix

2021-04-15 Vittorio Cioe

Post Syndicated from Vittorio Cioe original https://blog.zabbix.com/mysql-performance-tuning-101-for-zabbix/13899/

In this post and the video, you will learn about a proper approach to getting the most out of Zabbix and optimizing the underlying MySQL Database configuration to improve performance while working with a database-intensive application such as Zabbix.

I. Zabbix and MySQL (1:12)
II. Optimizing MySQL for Zabbix (2:09)

III. Conclusion (15:43)

Zabbix and MySQL

Zabbix and MySQL love each other. Half of the Zabbix installations are running on MySQL. However, Zabbix is quite a write-intensive application, so we need to optimize the database configuration and usage to work smoothly with Zabbix that reads the database and writes to the database a lot.

Optimizing MySQL for Zabbix

Balancing the load on several disks

So, how can we optimize MySQL configuration to work with Zabbix? First of all, it is very important to balance the load on several hard drives by using:

- datadir to specify the default location, that is to dedicate the hard drives to the data directory;
- datadir innodb_data_file_path to define size, and attributes of InnoDB system tablespace data files;
- innodb_undo_directory to specify the path to the InnoDB undo tablespaces;
- innodb_log_group_home_dir to specify the path to the InnoDB redo log files;
- log-bin to enable binary logging and set path/file name prefix (dual functionality); and
- tmpdir (Random, SSD, tmpfs).

The key here is to split the load as much as possible across different hard drives in order to avoid different operations fighting for resources.

Viewing your MySQL configuration

Now, we can jump straight to MySQL configuration. It is important to start from your current configuration and check who and when has changed this configuration.

SELECT t1.*, VARIABLE_VALUE FROM performance_schema.variables_info t1 JOIN
performance_schema.global_variables t2 ON t2.VARIABLE_NAME=t1.VARIABLE_NAME WHERE
t1.VARIABLE_SOURCE not like "COMPILED"

This query can help you to understand who has changed the configuration. However, when the configuration is changing is also important to keep track of these changes.

Viewing MySQL configuration

MySQL key variables to optimize in your configuration

InnoDB buffer pool

The king of all of the variables to be optimized is InnoDB buffer pool, which is the main parameter determining the memory for storing the DB pages — MySQL buffer pool — an area in main memory MySQL where InnoDB caches table and index data as it is accessed.

InnoDB default value is to log, for production 50-75% of available memory on the dedicated database server.
Since MySQL 5.7, innodb_buffer_pool_size can be changed dynamically.

Judging from experience, 50 percent of available memory will be enough for the majority of databases with a lot of connections or activities, as many other indicators are used, which occupy memory. So, 50 percent is a good though conservative parameter.

To check InnoDB Buffer Pool usage (in %) and if you need to allocate more memory for the InnoDB Buffer Pool, you can use the query, which allows you to see the current usage as a percentage (though there are many queries to monitor the InnoDB Buffer Pool).

SELECT CONCAT(FORMAT(DataPages*100.0/TotalPages,2),
' %') BufferPoolDataPercentage
FROM (SELECT variable_value DataPages FROM information_schema.global_status
WHERE variable_name = 'Innodb_buffer_pool_pages_data') A,
(SELECT variable_value TotalPages FROM information_schema.global_status
WHERE variable_name = 'Innodb_buffer_pool_pages_total') B;

Binary logs

Binary logs contain events that describe changes, provide data changes sent to replicas, and are used for data recovery operations.

If you work with replication, you might know that binary logs require special attention apart from having them on a separate disk. You should size the binary logs properly, set the proper expiration time (1 month by default), and the maximum size, for instance, of 1 GB so that you will be able to write 1 GB of data per day.

We can have about 30 log files in the binary logs. However, you should check the activities of your system to consider increasing this number, as well as the expiration of the binary logs, if you need to keep more data for operations, such as finding time recovery, for instance.

How to control binary logs:

- log_bin, max_binlog_size, binlog_expire_logs_seconds, etc.
- PURGE BINARY LOGS TO|BEFORE to delete all the binary log files listed in the log index file prior to the specified log file name or date.
- In addition, consider using GTID for replication to keep track of transactions.

InnoDB redo logs

This is yet another beast, which we want to keep control of — the redo and undo logs, which get written prior to flushing the data to the disk.

- innodb_log_file_size

– The size of redo logs will impact the writing speed over the time to recover.
– The default value is too low, so consider using at least 512 MB for production.
– Total redo log capacity is determined by innodb_log_files_in_group (default value 2). For write-intensive systems, consider increasing innodb_log_files_in_group and keeping them on in a separate disk.

NOTE. Here, the related parameters are innodb_log_file_size and innodb_log_files_in_group.

Trading performance over consistency (ACID)

Associated with the redo and undo log discussion is the trading performance over consistency discussion about when InnoDB should flush/sync committed truncations.

innodb_flush_log_at_trx_commit defines how ofter InnoDB flushes the logs to the disk. This variable can have different values:

- 0 — transactions are written to redo logs once per second;
- 1 — (default value) fully ACID-compliant with redo logs written and flushed to disk at transaction commit;
- 2 — transactions are written to redo logs at commit, and redo logs are flushed once per second.

If the system is write-intensive, you might consider setting this value to 2 to keep redo logs at every commit with the data written to disk once per second. This is a very good compromise between data integrity and performance successfully used in a number of write-intensive setups. This is a relief for the disk subsystem allowing you to gain that extra performance.

NOTE. I recommend using default (1) settings unless you are bulk-loading data, set session variable to 2 during load, experiencing an unforeseen peak in workload (hitting your disk system) and need to survive until you can
solve the problem, or you use the latest MySQL 8.0. You can also disable redo-logging completely.

table_open_cache and max_connections

Opening the cache discussions, we will start from the max_connections parameter, which sets the maximum number of connections that we want to accept on the MySQL server, and the table_open_cache parameter, which sets the value of the cache of open tables we want to keep. Both parameters affect the maximum number of files the server keeps open:

- table_open_cache value — 2,000 (default), which means that by default you can keep 2,000 tables open per connection.
- max_connections value — 151 (default).

If you increase both values too much, you may easily run out of memory. So, the total number of open tables in MySQL is:

N of opened tables = N of connections x N (max number of tables per join)

NOTE. This number is related to the joins operated by your database per connection.

So, having an insight into what Zabbix does and which queries it executes can help you fine-tune this parameter. In addition, you can go by the rule of thumb checking if the table_open_cache sheets are full. To do that, you can check the global status like ‘opened_tables‘ to understand what is going on.

In addition, if you are going to increase the table up and cache on the maximum number of connections, you can check open_files_limit in MySQL and ulimit — the maximum number of open files in the operating system, as new connections are kept as open files in Linux. So, this is a parameter to fine-tune as well.

Open buffers per client connection

There are other buffers that depend on the number of connections (max_connections), such as:

- read_buffer_size,
- read_rnd_buffer_size,
- join_buffer_size,
- sort_buffer_size,
- binlog_cache_size (if binary logging is enabled),
- net_buffer_length.

Depending on how often you get connections to the Zabbix database, you might want to increase these parameters. It is recommended to monitor your database to see how these buffers are being filled up.

You also need to reserve some extra memory for these buffers if you have many connections. That is why it is recommended to reserve 50 percent of available memory for InnoDB buffer pool, so that you can use these spare 25 percent for extra buffers.

However, there might be another solution.

Enabling Automatic Configuration for a Dedicated MySQL Server

In MySQL 8.0, innodb_dedicated_server automatically configures the following variables:

- innodb_buffer_pool_size,
- innodb_log_file_size,
- innodb_log_files_in_group, and
- innodb_flush_method.

I would enable this variable as it configures the innodb_flush_ method which has a dependency with the file system.

NOTE. Enabling innodb_dedicated_server is not recommended if the MySQL instance shares system resources with other applications, as this variable enabled implicitly means that we are running only MySQL on the machine.

Conclusion

Now, you are ready to fine-tune your configuration step by step, starting from innodb_buffer_pool, max_connections, and table_open_cache, and see if your performance improves. Eventually, you can do further analysis and go further to really fine-tune your system up to your needs.

In general, 3-5 core parameters would be enough for operating with Zabbix in the vast majority of cases. If you tune those parameters keeping in mind dealing with a write-intensive application, you can achieve good results, especially if you separate the resources at a hardware level or at a VM level.

Performance tuning dos and don’ts

For a high-level performance tuning 101, think carefully and consider the whole stack together with the application.
In addition, think methodically:

1. define what you are trying to solve, starting from the core of variables, which you want to fine-tune;
2. argue why the proposed change will work;
3. create an action plan; and
4. verify the change worked.

To make things work:

— don’t micromanage;
— do not optimize too much;
— do not optimize everything; and, most importantly,
— do not take best practices as gospel truth, but try to adjust any practices to your particular environment.

Low-Level Discovery with Dependent items

2021-04-01 Brian van Baekel

Post Syndicated from Brian van Baekel original https://blog.zabbix.com/low-level-discovery-with-dependent-items/13634/

The low-level discovery was introduced in Zabbix 2.0 and still belongs to one of the all-time favorites. Before LLD was available, adding items was all manual work. For example adding new disks, new interfaces, network ports on switches and everything else was all manual labor. And then LLD came around and suddenly we were able to ‘discover’ entities, and based on those discovered entities we can add new items, triggers, and such automatically.

Low-Level Discovery setup
Dependent items
Combing Low-Level Discovery and Dependent items
Conclusion

For a video guide, check out the Zabbix YouTube here: Zabbix: Low Level Discovery with Dependent items – YouTube

Low-Level Discovery setup

Let’s go over the idea of Low-Level Discovery first.

For the sake of clarity, we will stick with the default Zabbix agent item. Of course, as we will discover it’s only the format that matters for Zabbix to consider a response as LLD information. Let’s use built-in agent key: vfs.fs.discovery. Once we force the Zabbix agent to execute this item, it will reply with something like this:

[{"{#FSNAME}":"/sys","{#FSTYPE}":"sysfs"},{"{#FSNAME}":"/proc","{#FSTYPE}":"proc"},{"{#FSNAME}":"/dev","{#FSTYPE}":"devtmpfs"},{"{#FSNAME}":"/sys/kernel/security","{#FSTYPE}":"securityfs"},{"{#FSNAME}":"/dev/shm","{#FSTYPE}":"tmpfs"},{"{#FSNAME}":"/dev/pts","{#FSTYPE}":"devpts"},{"{#FSNAME}":"/run","{#FSTYPE}":"tmpfs"},{"{#FSNAME}":"/sys/fs/cgroup","{#FSTYPE}":"tmpfs"},{"{#FSNAME}":"/sys/fs/cgroup/systemd","{#FSTYPE}":"cgroup"},{"{#FSNAME}":"/sys/fs/pstore","{#FSTYPE}":"pstore"},{"{#FSNAME}":"/sys/firmware/efi/efivars","{#FSTYPE}":"efivarfs"},{"{#FSNAME}":"/sys/fs/bpf","{#FSTYPE}":"bpf"},{"{#FSNAME}":"/sys/fs/cgroup/net_cls,net_prio","{#FSTYPE}":"cgroup"},{"{#FSNAME}":"/sys/fs/cgroup/devices","{#FSTYPE}":"cgroup"},{"{#FSNAME}":"/sys/fs/cgroup/hugetlb","{#FSTYPE}":"cgroup"},{"{#FSNAME}":"/sys/fs/cgroup/memory","{#FSTYPE}":"cgroup"},{"{#FSNAME}":"/sys/fs/cgroup/rdma","{#FSTYPE}":"cgroup"},{"{#FSNAME}":"/sys/fs/cgroup/freezer","{#FSTYPE}":"cgroup"},{"{#FSNAME}":"/sys/fs/cgroup/cpu,cpuacct","{#FSTYPE}":"cgroup"},{"{#FSNAME}":"/sys/fs/cgroup/cpuset","{#FSTYPE}":"cgroup"},{"{#FSNAME}":"/sys/fs/cgroup/perf_event","{#FSTYPE}":"cgroup"},{"{#FSNAME}":"/sys/fs/cgroup/blkio","{#FSTYPE}":"cgroup"},{"{#FSNAME}":"/sys/fs/cgroup/pids","{#FSTYPE}":"cgroup"},{"{#FSNAME}":"/sys/kernel/tracing","{#FSTYPE}":"tracefs"},{"{#FSNAME}":"/sys/kernel/config","{#FSTYPE}":"configfs"},{"{#FSNAME}":"/","{#FSTYPE}":"xfs"},{"{#FSNAME}":"/sys/fs/selinux","{#FSTYPE}":"selinuxfs"},{"{#FSNAME}":"/proc/sys/fs/binfmt_misc","{#FSTYPE}":"autofs"},{"{#FSNAME}":"/dev/hugepages","{#FSTYPE}":"hugetlbfs"},{"{#FSNAME}":"/dev/mqueue","{#FSTYPE}":"mqueue"},{"{#FSNAME}":"/sys/kernel/debug","{#FSTYPE}":"debugfs"},{"{#FSNAME}":"/sys/fs/fuse/connections","{#FSTYPE}":"fusectl"},{"{#FSNAME}":"/boot","{#FSTYPE}":"ext4"},{"{#FSNAME}":"/boot/efi","{#FSTYPE}":"vfat"},{"{#FSNAME}":"/home","{#FSTYPE}":"xfs"},{"{#FSNAME}":"/run/user/0","{#FSTYPE}":"tmpfs"}]

When we put this in a more readable format (truncated) it will look like this:

[
{
"{#FSNAME}":"/sys",
"{#FSTYPE}":"sysfs"
},
{
"{#FSNAME}":"/proc",
"{#FSTYPE}":"proc"
},
{
"{#FSNAME}":"/dev",
"{#FSTYPE}":"devtmpfs"
},
{
"{#FSNAME}":"/sys/kernel/config",
"{#FSTYPE}":"configfs"
},
{
"{#FSNAME}":"/",
"{#FSTYPE}":"xfs"
},
{
"{#FSNAME}":"/boot",
"{#FSTYPE}":"ext4"
},
{
"{#FSNAME}":"/home",
"{#FSTYPE}":"xfs"
}
]

In this format it suddenly becomes clear, we have the {#FSNAME} macro, with the name of a filesystem, combined with the type, captured in {#FSTYPE}.

Perfect! We feed this information into Zabbix, and LLD magic will happen.
Based on the Item prototypes, new items per {#FSNAME} will be added, and monitoring will start on those items.

Looking at the Item prototypes, they look a lot like normal items:

So, we have one item prototype that is responsible for providing the LLD information, and then the created ‘normal’ items to query the filesystem statistics. As you can imagine, with just 5 filesystems and 1 metric per filesystem, queried once per minute, no problem. But what if we have 50 filesystems, 7 metrics per filesystem and they get queried every 10 seconds… That’s a lot of queries against the host! Not only does that add load to the Zabbix server, but obviously also to the monitored host. It works, but is it ideal? It certainly isn’t!

So we’ve basically just setup this:

Dependent items

But then Zabbix introduced dependent items. Let’s take a quick look at dependent items and what they are

We have one master item that gathers all information (in bulk) and propagates that information to all the dependent items. On those dependent items we just do the cherry picking and filtering of the relevant metrics. Let’s put this to work and see how that goes.

So we create an item, with, in this case, the http agent type, which will collect the following information regarding the server status in a single request:

ServerVersion: Apache/2.4.37 (centos)
ServerMPM: event
Server Built: Nov  4 2020 03:20:37
CurrentTime: Monday, 08-Mar-2021 14:35:20 CET
RestartTime: Monday, 08-Mar-2021 11:04:09 CET
ParentServerConfigGeneration: 1
ParentServerMPMGeneration: 0
ServerUptimeSeconds: 12671
ServerUptime: 3 hours 31 minutes 11 seconds
Load1: 0.01
Load5: 0.03
Load15: 0.00
Total Accesses: 1182
Total kBytes: 10829
Total Duration: 95552
CPUUser: 5.01
CPUSystem: 7.34
CPUChildrenUser: 0
CPUChildrenSystem: 0
CPULoad: .0974667
Uptime: 12671
ReqPerSec: .0932839
BytesPerSec: 875.14
BytesPerReq: 9381.47
DurationPerReq: 80.8393
BusyWorkers: 1
IdleWorkers: 99
Processes: 4
Stopping: 0
BusyWorkers: 1
IdleWorkers: 99
ConnsTotal: 4
ConnsAsyncWriting: 0
ConnsAsyncKeepAlive: 0
ConnsAsyncClosing: 0
Scoreboard: _________________________________________________________________________________________W__________............................................................................................................................................................................................................................................................................................................

Now, we create some dependent items, that depend on that first item (which we will call the Master item). Every time the Master item receives information, the complete reply will be pushed to the dependent items, without any altering of that data. So the master and dependent items are identical when no preprocessing is applied. That’s why on the dependent items we apply preprocessing to filter relevant information, for example, the BusyWorkers:

Perfect. So querying a host once, getting all the metrics in bulk, and then parsing it in Zabbix using preprocessing. Say bye to excessive load on the monitored host… (and due to preprocessing processes within Zabbix, no problem on the Zabbix server side).

Combining Low-Level Discovery and Dependent items

Ok, and what if we combine these to concepts? LLD with Dependent items? Wouldn’t that be the ultimate goal? Automatically creating new items without putting extra load to the monitored host? Let’s get this going!

To stick with the first example of LLD, we will discover filesystems, but now without the vfs.fs.discovery key, but the newly introduced vfs.fs.get key. Once we force the agent to execute this key, we will see this reply:

[{"fsname":"/dev","fstype":"devtmpfs","bytes":{"total":1940963328,"free":1940963328,"used":0,"pfree":100.000000,"pused":0.000000},"inodes":{"total":473868,"free":473487,"used":381,"pfree":99.919598,"pused":0.080402}},{"fsname":"/dev/shm","fstype":"tmpfs","bytes":{"total":1958469632,"free":1958469632,"used":0,"pfree":100.000000,"pused":0.000000},"inodes":{"total":478142,"free":478141,"used":1,"pfree":99.999791,"pused":0.000209}},{"fsname":"/run","fstype":"tmpfs","bytes":{"total":1958469632,"free":1892040704,"used":66428928,"pfree":96.608121,"pused":3.391879},"inodes":{"total":478142,"free":477519,"used":623,"pfree":99.869704,"pused":0.130296}},{"fsname":"/sys/fs/cgroup","fstype":"tmpfs","bytes":{"total":1958469632,"free":1958469632,"used":0,"pfree":100.000000,"pused":0.000000},"inodes":{"total":478142,"free":478125,"used":17,"pfree":99.996445,"pused":0.003555}},{"fsname":"/","fstype":"xfs","bytes":{"total":95516360704,"free":55329644544,"used":40186716160,"pfree":57.926877,"pused":42.073123},"inodes":{"total":46661632,"free":46535047,"used":126585,"pfree":99.728717,"pused":0.271283}},{"fsname":"/boot","fstype":"ext4","bytes":{"total":1023303680,"free":705544192,"used":247296000,"pfree":74.046435,"pused":25.953565},"inodes":{"total":65536,"free":65497,"used":39,"pfree":99.940491,"pused":0.059509}},{"fsname":"/home","fstype":"xfs","bytes":{"total":5358223360,"free":5286903808,"used":71319552,"pfree":98.668970,"pused":1.331030},"inodes":{"total":2621440,"free":2621428,"used":12,"pfree":99.999542,"pused":0.000458}},{"fsname":"/run/user/0","fstype":"tmpfs","bytes":{"total":391692288,"free":391692288,"used":0,"pfree":100.000000,"pused":0.000000},"inodes":{"total":478142,"free":478137,"used":5,"pfree":99.998954,"pused":0.001046}}]

And if we format it to be more readable, it will look like this (truncated):

[
  {
    "fsname":"/",
    "fstype":"xfs",
    "bytes":{
      "total":95516360704,
      "free":55329644544,
      "used":40186716160,
      "pfree":57.926877,
      "pused":42.073123
    },
    "inodes":{
      "total":46661632,
      "free":46535047,
      "used":126585,
      "pfree":99.728717,
      "pused":0.271283
    }
  },
  {
    "fsname":"/home",
    "fstype":"xfs",
    "bytes":{
      "total":5358223360,
      "free":5286903808,
      "used":71319552,
      "pfree":98.668970,
      "pused":1.331030
    },
    "inodes":{
      "total":2621440,
      "free":2621428,
      "used":12,
      "pfree":99.999542,
      "pused":0.000458
    }
  }
]

Per filesystem, we get the original information FSNAME and FSTYPE, but also the statistics of these filesystems… bulk metrics! So, we create a normal item (Which will serve as the master item) getting out all those metrics in a single query:

Once we’ve got this data in Zabbix, we feed it into the LLD rule, giving this LLD rule the dependent LLD type:

Of course there are no ready to use LLD macros in this data, but since it is in JSON format, it shouldn’t be too hard to create the LLD macros with the ‘LLD macros’ option in the frontend and the relevant JSONPath expression:

Note: Technically we do not need to create the {#FSTYPE} macro to get this working!

Once this is done, we should be ready to create the item prototypes for this LLD rule. The data is there, macros are available, nothing is going to stop us now!

Let’s move on to item prototypes. But of course, we do not want to poll that remote host again per discovered filesystem. That means we will make this item prototype of the dependent item type as well, pointing it back to the master item we’ve created.

For the first item prototype, we want to obtain the total size per filesystem:

But, as I mentioned earlier: a dependent item without any preprocessing is identical to the master item and of course that would be wrong in this case. We just want to see the total bytes per filesystem and not all the collected statistics. In the configuration above we already know what to get out, so the Type of information and Units are filled already. What is not visible on that screenshot is the preprocessing rule that we need. Here the ‘JSONPath’ preprocessing step comes in handy since we receive JSON data. We would like to get out this part for our item (truncated):

[
  {
    "fsname":"/",
    "fstype":"xfs",
    "bytes":{
      "total":95516360704,
      "free":55329644544,
       "used":40186716160,
      "pfree":57.926877,
      "pused":42.073123

So, if we try to get this information out using JSONPath, it should look like: $.bytes.total.first() but this will match on any filesystem, so we need to configure it a bit more specific like: $[?(@.fsname==’/’)].bytes.total.first()

As you can see, the JSONPath is a bit more complex here. We are forcing it to match on @.fsname==’/’ and from that entity, get out the bytes.total. Now, to make it even more complex we shouldn’t configure the filesystem hardcoded in the JSONPath since we’re working with Item prototypes. It should be the LLD Macro {#FSNAME} instead!

Now we save this item prototype, grab a cup of coffee (or just force a config_cache_reload on the server) and just wait for the magic to happen.

We’ve now built this setup:

So the master item will get values (i.e. obtain bulk data every minute) and push it into the LLD rule. From there, as per item prototypes, items will be created and those are populated from the master item as well, filtering out only the relevant metrics using Preprocessing.

So far, so good, but we have one small problem to solve: We want to get metrics every minute or so, but since all those metrics will get pushed into the LLD rule, we might be adding unnecessary extra load due to the high frequency. Luckily, solving that problem is no too hard. Navigate to the discovery rule, go to the ‘Preprocessing tab’ and select ‘Discard unchanged with heartbeat’ parameter: 1h or even larger interval!

This is insane! With just one poll/query to a host, we will utilize the power of LLD and dependent items, getting all metrics without adding minimal extra load on that host.

Conclusion

That’s it. If you’ve setup everything correctly, you should now get out quite a few filesystem metrics without adding any extra performance overhead on the host by performing unnecessary data requests.

Of course, if you need help optimizing your Zabbix environment, support contracts, consultancy, or training, we from Opensource ICT Solutions are always available to assist you in every possible way, worldwide, 24×7.

Thanks for reading this blog post, see you in the next one.

Finalizing the installation of Zabbix Agent with Ansible

2021-03-04 Werner Dijkerman

Post Syndicated from Werner Dijkerman original https://blog.zabbix.com/finalizing-the-installation-of-zabbix-agent-with-ansible/13321/

In the previous blog posts, we created a Zabbix Server with a new user, a media type, and an action. In the 2nd blog post, we continued with creating and configuring a Zabbix Proxy. In the last part of this series of blog posts, we will install the Zabbix Agent on all of the 3 nodes we have running.

This blog post is the 3rd part of a 3 part series of blog posts where Werner Dijkerman gives us an example of how to set up your Zabbix infrastructure by using Ansible.
You can find part 1 of the blog post by clicking here.

To summarize, so far we have a Zabbix Server and a Zabbix Proxy. The Zabbix Server has a MySQL instance running on a separate node, the MySQL instance for the Zabbix Proxy runs on the same node. But we are missing one component right now, and that is something we will install with the help of this blog post. We will install the Zabbix Agent on the 3 nodes.

A git repository containing the code used in these blog posts is available on https://github.com/dj-wasabi/blog-installing-zabbix-with-ansible. Before we run Ansible, we need to make sure we have opened a shell on the “bastion” node. We can do that with the following command:

$ vagrant ssh bastion

Once we have opened the shell, go to the “/ansible” directory where we have all of our Ansible files present.

$ cd /ansible

In the previous blog post, we executed the “zabbix-proxy.yml” playbook. Now we are going to use the “zabbix-agent.yml” playbook. The playbook will install the Zabbix Agent on all nodes (“node-1”, “node-2” and “node-3”). Next up, on both the “node-1” and “node-3”, we will add a user parameters file specifically for MySQL. With this user parameters file, we are able to monitor the MySQL instances.

$ ansible-playbook -i hosts zabbix-agent.yml

This playbook will run for a few minutes installing the Zabbix Agent on the nodes. It will install the zabbix-agent package and add the configuration file, but it will also make a connection to the Zabbix Server API. We will automatically create a host with the correct IP information and the correct templates! When the Ansible playbook has finished running, the hosts can immediately be found in the Frontend. And better yet, it is automatically correctly configured, so the hosts will be monitored immediately!

We have several configurations spread over multiple files to make this work. We first start with the “all” file.

The file “/ansible/group_vars/all” contains the properties that will apply to all hosts. Here we have the majority of essential properties configured that are overriding the default properties of the Ansible Roles. Each role has some default configuration, which will work out of the box. But in our case, we need to override these, and we will discuss some of these properties next.

zabbix_url

This is the URL on which the Zabbix Frontend is available and thus also the API. This property is for example used when we create the hosts via the API as part of the Proxy and Agent installation.

zabbix_proxy

The Zabbix Agents will be monitored by the Zabbix Proxy unless the Agent runs on the Zabbix Server or the host running the database for the Zabbix Server. Like with the previous blog post, we will also use some Ansible notation to get the IP address of the host running the Zabbix Proxy to configure the Zabbix Agent.

zabbix_proxy: node-3
zabbix_agent_server: "{{ hostvars[zabbix_proxy]['ansible_host'] }}"
zabbix_agent_serveractive: "{{ hostvars[zabbix_proxy]['ansible_host'] }}"

With the above configuration, we configure both the Server and ServerActive in the Zabbix Agent’s configuration file to use the IP address of the host running the Zabbix Proxy. If you look at the files “/ansible/group_vars/zabbix_database” and “/ansible/group_vars/zabbix_server/generic” you would see that these contain the following:

zabbix_agent_server: "{{ hostvars['node-1']['ansible_host'] }}"
zabbix_agent_serveractive: "{{ hostvars['node-1']['ansible_host'] }}"

The Zabbix Agent on the Zabbix Server and on its database is using the IP address of the Zabbix Server to be used as the value for both the “Server” and “ActiveServer” configuration settings for the Zabbix Agent.

zabbix_api_user & zabbix_api_pass

These are the default in the roles, but I have added them here so it is clear that they exist. When you change the Admin user password, don’t forget to change them here as well.

zabbix_api_create_hosts & zabbix_api_create_hostgroups

Because we automatically want to create the Zabbix Frontend hosts via the API, we need to set both these properties to true. Firstly, we create the host groups that can be found with the property named “zabbix_host_groups”. After that, as part of the Zabbix Agent installation, the hosts will be created via the API because of the property zabbix_api_create_hosts.

Now we need to know what kind of information we want these hosts created with. Let’s go through some of them.

zabbix_agent_interfaces

This property contains a list of all interfaces that are used to monitoring the host. This is relatively simple in our case, as the hosts only have 1 interface available. You can find some more information about what to use when you have other interfaces like IPMI or SNMP: https://github.com/ansible-collections/community.zabbix/blob/main/docs/ZABBIX_AGENT_ROLE.md#other-interfaces We use the interface with the value from property “ansible_host” for port 10050.

zabbix_host_groups

This property was also discussed before – we automatically assign our new host to these host groups. Again, we have a fundamental setup, and thus it is an effortless property.

zabbix_link_templates

We provide a list of all Zabbix Templates we will want to assign to the hosts with this property. This property seems a bit complicated, but no worries – let’s dive in!

zabbix_link_templates:
  - "{{ zabbix_link_templates_append if zabbix_link_templates_append is defined else [] }}"
  - "{{ zabbix_link_templates_default }}"

With the first line, we add the property’s value “zabbix_link_templates_append”, but we only do that if that property exists. If Ansible can not find that property, then we basically add an empty list. So where can we find this property? We can check the files in the other directories in the group_vars directory. If we check, for example “/ansible/group_vars/database/generic”, we will find the property:

zabbix_link_templates_append:
  - 'MySQL by Zabbix agent'

So on all nodes that are part of the database group, we add the value to the property “zabbix_link_templates”. All of the database servers will get this template attached to the host. If we would check the file “/ansible/group_vars/zabbix_server/generic”, then we will find the following:

zabbix_link_templates_append:
  - 'Zabbix Server'

As you probably understand now, when we create the Zabbix Server host, we will add the “Zabbix Server” template to the host, because this file is only used for the hosts that are part of the zabbix_server group.

With this setup, we can configure specific templates for the specific groups, but there is also at least 1 template that we always want to add. We don’t want to add the template to each file as that is a lot of configuration, so we use a new property for this named “zabbix_link_templates_default”. In our case, we only have Linux hosts, so we always want to add the templates:

zabbix_link_templates_default:
  - "Linux by Zabbix agent active"

On the Zabbix Server, we both assign the “Zabbix Server” template and the template “Linux by Zabbix agent active” to the host.

But what if we have Macros?

zabbix_macros

As part of some extra tasks in this playbook execution, we also need to provide a macro for some hosts. This macro is needed to make the Zabbix Template we assign to the hosts work. For the hosts running a MySQL database, we need to add a macro, which can be found with the property zabbix_macros_append in the file “/ansible/group_vars/database/generic”.

zabbix_macros_append:
  - macro_key: "MYSQL.HOST"
    macro_value: "{{ ansible_host }}"

We will create 1 macro with the key name “MYSQL.HOST” and assign a value that will be equal to the contents of the property ansible_host (For the “node-2” host, the host running the database for the Zabbix Server), which is “10.10.1.12”.

User parameters

The “problem” with assigning the MySQL template is that it also requires some UserParameter entries set. The Zabbix Agent role can deploy files containing UserParameters to the given hosts. In “/ansible/group_vars/database/generic” we can find the following properties:

zabbix_agent_userparameters_templates_src: "{{ inventory_dir }}/files/zabbix/mysql"
zabbix_agent_userparameters:
  - name: template_db_mysql.conf

The first property “zabbix_agent_userparameters_templates_src” will let Ansible know where to find the files. The “{{ inventory_dir }}” will be translated to “/ansible” and here you will find a directory named “files” (and you will find the group_vars directory as well) and further drilling down the directories, you will find the file “template_db_mysql.conf”.

With the second property “zabbix_agent_userparameters” we let Ansible know which file we want to deploy to the host. In this case, the only file found in the directory named “template_db_mysql.conf”.

When the Zabbix agent role is fully executed, we have everything set to monitor all the hosts automatically. Open the dashboard, and you will see something like the following:

It provides an overview, and on the right side, you will notice we have a total of 3 nodes of which 3 are available. Maybe you will see a “Problem” like in the screenshot above, but it will go away.

If we go to “Configuration” and “Hosts,” we will see that we have the 3 nodes, and they have the status “Enabled” and the “ZBX” icon is green, so we have a proper connection.

We should verify that we have some data, so go to “Monitoring” and click on “Latest data.” We select in the Host form field the “Zabbix database,” and we select “MySQL” as Application and click on “Apply.” If everything is right, it should provide us with some information and values, just like the following screenshot. If not, please wait a few minutes and try again.

Summary

This is the end of a 3 part blog post in creating a fully working Zabbix environment with a Zabbix Server, Proxy, and Agent. With these 3 blogposts you were able to see how you can install and configure a complete Zabbix environment with Ansible. Keep in mind that the code shown was for demo purposes and it is not something you can immediately use for the Production environment. We also used some of the available functionality of the Ansible collection for Zabbix, there are many more possibilities like creating a maintenance period or a discovery rule. Not everything is possible, if you do miss a task or functionality of a role that Ansible should do or configure, please create an issue on Github so we can make it happen.

Don’t forget to execute the following command:

$ vagrant destroy -f

With this, we clean up our environment and delete our 4 nodes, thus finishing with the task at hand!

1. Count entities with CountOutput

2. Use API to perform Configuration export/import

3. Expand trigger functions and macros with expand parameters

4. Obtaining additional LLD information for a discovered item

5. Searching through the matched entities with search parameters

Creating a new named Zabbix API token

Simplifying API tasks with the named API token

Utilizing Zabbix API scripts in Actions

Throttling and other preprocessing steps

Understanding “Do not keep history” option

Using trend functions with high-frequency monitoring

Updating your templates

The benefits of keeping templates up to date

Zabbix self-monitoring templates

Tracking template changes

Zabbix MongoDB template

Deploying Zabbix Agent 2 on your Host

Creating a MongoDB user for monitoring

Creating and configuring a MongoDB host

Fine-tuning MongoDB monitoring

Configuration cache

Setting up the scripts

Frontend as a control panel

Download Instant client and SQLPlus

Solve OS dependencies

Install Oracle Instant Client

LD config

Check if the Oracle service port is reachable

Test connection with SQLPlus

ODBC connector

Prerequisites

How it works

Introduction

How to

Setting up Jira

Setting up the frontend script

The result

Conclusion

Flexible latency threshold with User macros

Defining dependencies

The four rings

Contents

Monitoring requirements of multi-tenant environments

Supported monitoring approaches

Agent

Latest improvements

Zabbix and multi-tenant environment

Zabbix proxy

Passive and active proxies

Data preprocessing — throttling

Permissions

User roles

User groups

High availability

Database replication

Database performance tuning

History table partitioning

History table partitioning with TimescaleDB

To-do list

Questions & Answers

Contents

Reporting in Zabbix 5.4

Scheduled reports

Creating a report

Receiving a report

Permissions

Recipients of scheduled reports

Report prerequisites

Configuring reports — Web service

Configuring reports — Server

Configuring reports — Frontend

Reports — testing

Common issues

Questions & Answers

Contents

More efficient database use

New worker processes

History pollers

New configuration parameters

Availability manager