Tag Archives: monitoring

Saving Time with a Custom Zabbix Agent Installer

Post Syndicated from Rizqi Firmansyah original https://blog.zabbix.com/saving-time-with-a-custom-zabbix-agent-installer/31843/

When managing large-scale infrastructure, the process of installing monitoring agents is often repetitive and time-consuming. Administrators must log into each server, manually run installation commands, and configure the agent to connect to the Zabbix server. To address this issue, the Zabbix Agent Deployer custom module was created. This module enables the direct installation of Zabbix agents on multiple hosts from the Zabbix Web interface.

The features of the Zabbix Agent Deployer module include:

  • Bulk host list input using a CSV file.
  • The ability to automatically add hosts to Zabbix and remotely install the Zabbix Agent on the
    associated hosts.
  • The ability to display installation log results directly within the module.

With this approach, administrators can add new hosts to the monitoring system faster and more efficiently.

Key use cases for the Zabbix Agent installer

The Zabbix Agent Deployer module enables several practical scenarios, including:

1. Faster provisioning for new servers – When adding a large number of servers, agents can be installed simultaneously without requiring a login to each machine.

2. Standardized installation – All agents are installed in the same way using a centralized script, reducing the risk of misconfiguration.

3. Easier additional provisioning – Provisioning new servers is easier for users because they don’t need to configure them directly on the server.

Getting started with the Zabbix Agent Deployer module

Solution overview architecture

To use this module, the main steps are:

1. Upload the custom module to the Zabbix frontend in the /usr/share/zabbix/modules/ directory.

2. Enable the module from the Administration → General → Modules page, and click the Scan Directory button. Locate the Zabbix agent deployer module and click Enabled.

3. Once activated, the Zabbix agent deployer module can be accessed in the Data Collection menu. Here’s a screenshot of the Zabbix agent deployer module.

4. Prepare a CSV file like the format below, or download a sample CSV from the module page.

With this CSV file, we will add two hosts to Zabbix to be monitored and automatically install the Zabbix agent on them.

5. Upload the CSV file to the Zabbix agent deployer module page and click Apply.

6. The Zabbix agent deployer module will handle the process of adding hosts to Zabbix and installing the Zabbix agent. The status can be seen as follows:

From the image above, server1 and server2 were successfully added to Zabbix, and the Zabbix agent installation was successful!

7. Check out the Zabbix hosts list page. Hosts will appear according to the uploaded CSV file.

Conclusion

The implementation of this custom Zabbix Agent installer extends Zabbix’s capabilities beyond its built-in functionality. The Zabbix Agent Deployer module enables a more efficient bulk host addition process, as all steps from adding hosts to Zabbix to installing the Zabbix agent can be integrated through a single page.

If you’re interested in implementing this, please contact us. Bangunindo is a premium Zabbix partner in Indonesia. We’re ready to help you design, implement, and optimize your Zabbix solution to suit your needs.

The post Saving Time with a Custom Zabbix Agent Installer appeared first on Zabbix Blog.

Aruba Central API Monitoring with Zabbix

Post Syndicated from Tibor Volanszki original https://blog.zabbix.com/aruba-central-api-monitoring-with-zabbix/31370/

Aruba Central is a SaaS solution that allows you to manage your Enterprise Aruba network environment. Due to the increasing number of cloud migrations, we can expect that more and more Aruba customers will move their on-premise environment to it, which will also mean a change in their monitoring environment. In this article, I will show you how to switch to API- based monitoring using Aruba Central and Zabbix. All custom resources mentioned can be found in my repository.

Aruba Central’s API

Oauth 2.0 is used, so you can forget the simple token management. At the end it is great, but for monitoring purposes it is overkill. There is pretty good documentation (referred to later) regarding how you can generate your access token, but after two hours it expires so you need to continually refresh it. To do this, you must use a refresh token, which can help you to get a new access token AND a new refresh token.

Within two hours, use the latest refresh token to repeat this action again. At this point you can imagine that this is not something you can implement easily by using the Zabbix GUI only. Well, maybe with some javascript magic, but otherwise there is no native support for this logic at this point of time. So how can we do this? In short:

  1. Generate your client credentials
  2. Generate your first token
  3. Schedule the token refresh for every two hours
  4. Update your host macro via Zabbix API
  5. Use the token in Zabbix HTTP agent checks
  6. Monitor your environment based on JSONPath pre-processing

Initial steps within Aruba Central

To manage your API access, you need to launch your “HPE Aruba Networking Central” application, so do NOT look into your workspace modules – the “Personal API clients” menu is NOT what we are looking for. Turn off the “New Central” view – at this point the early access version is not so useful (hopefully it will change soon).

The first time you get there, you will not see any items, but under the “My Apps & Tokens” tab you can click the “Add Apps & Tokens” button and generate it. Technically, this is already enough to start to monitoring your network infrastructure, but within two hours it would stop. So the relevant data for us are the “Client ID” and “Client Secret.” Feel free to revoke the recently created token at the bottom area as we do not need it.

Record your credentials

For this article, I am using a simple file to store all the credentials, which will be sourced into a bash script. Please keep in mind that storing your sensitive credentials in a single file is a BAD practice! Your SECO/CISO would probably have a few words with you about it, so please consider a better approach. A more secure way would be to use some Key Vault solution (like Azure, AWS, Google, or Hashicorp). Anyway, let’s continue with this unsecure example:

#!/bin/bash

### ZABBIX VARS ###

# URL of your zabbix instance (assuming you do not use the "/zabbix" ending, if yes, then add it to the end)
zabbix_url="https://your.zabbix.instance.net"
# Your Zabbix API token. If you do not know how to get it, check the documentation.
zabbix_api_token="1234_your_zabbix_api_key_5678"
# Create a host with a macro, remain at the "Macros" tab, turn on debug mode, look for "[hostmacroid] =>"
zabbix_macro_id="12345"

### ARUBA VARS ###
# To find yours, go here and check "Table: Domain URLs for API Gateway Access"
base_url="YOUR_ARUBA_CENTRAL_BASE_URL"
# Click on your profile in the Central app and you will find it there: 32 char long hexa string
client_id="YOUR_CLIENT_ID"
# provided in the previous step
client_secret="YOUR_CLIENT_ID"
# provided in the previous step
customer_id="YOUR_CUSTOMER_ID"
# your login credential
account_username="YOUR_CENTRAL_LOGIN_USERNAME"
# your login credential
account_password="YOUR_CENTRAL_LOGIN_PASSWORD"
# to be populated later
csrftoken=""
session=""
auth_code=""

Get or refresh your token and update the Zabbix host macro

The next steps are based on the official Aruba documentation, which you can find here. Please remember that there are many ways to achieve our target – this is just one example and probably not the most optimal one. Feel free to change / improve it with your code in your preferred scripting language.

The below script assumes that the file containing the credentials (previous step) is named as “variables” and located in the folder named “central.

Filename: aruba_central_token_new.sh

Purpose: To be used for first time token generation. Later, you only have to refresh your token with the script after this one.

Remarks: Aruba is limiting this API query set, so you can run it only ONCE every 30 minutes! If you made a typo somewhere, wait 30 minutes before your next attempt or tweak the result files.

#!/bin/bash

basedir=central
source $basedir/variables

curl -s --noproxy '*' -v --cookie-jar $basedir/cookie --location --request POST "$base_url/oauth2/authorize/central/api/login?client_id=$client_id" \
--header "Content-Type: application/json" \
--data-raw "{
    \"username\": \"$account_username\",
    \"password\": \"$account_password\"
}" > $basedir/result1.raw 2>&1

grep 'Added cookie' $basedir/result1.raw > $basedir/result1.filtered

csrftoken=$(grep csrftoken $basedir/result1.filtered | awk -F '"' '{print $2}')
session=$(grep session $basedir/result1.filtered | awk -F '"' '{print $2}')

curl -s --noproxy '*' --request POST "$base_url/oauth2/authorize/central/api?client_id=$client_id&response_type=code&scope=all" \
--header "Content-Type: application/json" \
--header "Cookie: session=$session" \
--header "X-CSRF-Token: $csrftoken" \
--data-raw "{
\"customer_id\": \"$customer_id\"
}" > $basedir/result2.raw

auth_code=$(cat $basedir/result2.raw | jq -r .auth_code)

curl -s --noproxy '*' --request POST "$base_url/oauth2/token" \
--header "Content-Type: application/json" \
--data "{
    \"client_id\": \"${client_id}\",
    \"client_secret\": \"${client_secret}\",
    \"grant_type\": \"authorization_code\",
    \"code\": \"${auth_code}\"         
}" > $basedir/result3.raw

refresh_token=$(cat $basedir/result3.raw | jq -r .refresh_token)
access_token=$(cat $basedir/result3.raw | jq -r .access_token)

if [ "$refresh_token" == "null" ]; then
    echo "something went wrong... exiting now"
    exit 1
fi

echo $access_token > $basedir/token_access.latest
echo $refresh_token > $basedir/token_refresh.latest

echo "access_token: $access_token"
echo "refresh_token: $refresh_token"

curl -s --request POST \
--url "$zabbix_url/api_jsonrpc.php" \
--header "Authorization: Bearer $zabbix_api_token" \
--header "Content-Type: application/json-rpc" \
--data "{\"jsonrpc\": \"2.0\",\"method\": \"usermacro.update\",\"params\": {\"hostmacroid\": \"${zabbix_macro_id}\",\"value\": \"${access_token_new}\"},\"id\": 1}"

rm -f $basedir/cookie

Filename: aruba_central_token_refresh.sh

Purpose: To refresh your existing token. It is expecting an existing refresh token in the “token_refresh.latest” file, so better to run the previous script one time before this.

Remarks: You can run this script as many times you want, but it will result in new tokens only once per every two hours (when the current one expires). Therefore, refreshing too frequently is pointless.

#!/bin/bash

basedir=central
source $basedir/variables

refresh_token_current=$(cat $basedir/token_refresh.latest | tr -d '\n')
refresh_token_new=""

curl -s --noproxy '*' --request POST "$base_url/oauth2/token?client_id=$client_id&client_secret=$client_secret&grant_type=refresh_token&refresh_token=$refresh_token_current" > $basedir/result4.raw

refresh_token_new=$(cat $basedir/result4.raw | jq -r .refresh_token)
access_token_new=$(cat $basedir/result4.raw | jq -r .access_token)
expires_in=$(cat $basedir/result4.raw | jq -r .expires_in)

if [ "$refresh_token_new" == "null" ]; then
    echo "something went wrong... exiting now"
    exit 1
fi

echo $access_token_new > $basedir/token_access.latest
echo $refresh_token_new > $basedir/token_refresh.latest

echo "access_token: $access_token_new"
echo "refresh_token: $refresh_token_new"
echo "expires_in: $expires_in"

curl -s --request POST \
--url "$zabbix_url/api_jsonrpc.php" \
--header "Authorization: Bearer $zabbix_api_token" \
--header "Content-Type: application/json-rpc" \
--data "{\"jsonrpc\": \"2.0\",\"method\": \"usermacro.update\",\"params\": {\"hostmacroid\": \"${zabbix_macro_id}\",\"value\": \"${access_token_new}\"},\"id\": 1}"

In my case, both the scripts and variables files are in the same “central” folder, which is in a git repository. Each time I call one of the scripts, it will record the new tokens in files, which are committed and pushed to the repo. In my own implementation, this is how I call the refresh script and sync the result with my repo:

git checkout master

basedir=central
source $basedir/variables
bash $basedir/aruba_central_token_refresh.sh

git add .
git commit -m "save the new tokens"
git push origin master

Schedule your token management

You must run your refresh script at least once per every two hours. To make this happen you have many options, including:

  • cron (old-school, outdated way)
  • systemctl timer (a better way, but only if it is monitored)
  • Jenkins / Github Actions/etc.
  • Zabbix itself, by calling your bash script

In my case, Jenkins does the scheduling and execution and the job is monitored via Zabbix.

Monitor your network infrastructure

When everything is in place, then the monitoring part is pretty simple. The usual JSONPath based logic can be used. API call documentation can be found here. The template contains only the wireless components, since I do not have my switches in Central. Implementing the switching part should not be difficult – just have a look at the “Switch” section, then clone and adjust one of your “get” items.

Screenshots

Latest data – tag based filtering:

Latest data – Site health

Latest data – Gateway info

Latest data – AP info

Triggers:

Some triggers are intentionally disabled, because they are a bit redundant. However, I wanted to cover all options. Sometimes less alerting is better if you have a ticketing system integration, otherwise your monitoring system will turn into a ticket factory.

Known issues and limitations

Since we are not querying the devices directly, some delay can be expected. Based on my recent testing, the delay compared to real time is between 3-10 minutes. In my test I disconnected my test environment and then started to do manual updates frequently. Some items got the real state earlier, some only later.

If your refresh script will malfunction for whatever reason (normally it should not), then you may have to run the other script once to generate a new token, or you can go to the GUI and check the last refresh token, with which you can override the content of the “token_refresh.latest” file.

Aruba is limiting the number of API queries to 5,000 per day. This could seem annoying, but it is way more than what you need (you should expect less than 1,000 in normal conditions, depending on your update frequency).

Zabbix API will not authorize your call unless you insert a line into your apache vhost configuration. This is a more generic Zabbix API issue that is not related to Aruba Central.

SetEnvIf Authorization "(.*)" HTTP_AUTHORIZATION=$1

If Aruba Central has a maintenance activity, then the token refreshing way could break. Running the token request script once should address the issue.

Summary

Aruba Central’s API is pretty decent, but if you start from zero it could take a while to get to the end of it. With this guide, my intention was to speed you up, but please do not consider my scripts and the shown example as the only or best possible way – I’m just hoping it can give you a good base for your own solution. Have fun!

The post Aruba Central API Monitoring with Zabbix appeared first on Zabbix Blog.

Optimize latency-sensitive workloads with Amazon EC2 detailed NVMe statistics

Post Syndicated from Sanjeev Malladi original https://aws.amazon.com/blogs/compute/optimize-latency-sensitive-workloads-with-amazon-ec2-detailed-nvme-statistics/

Amazon Elastic Cloud Compute (Amazon EC2) instances with locally attached NVMe storage can provide the performance needed for workloads demanding ultra-low latency and high I/O throughput. High-performance workloads, from high-frequency trading applications and in-memory databases to real-time analytics engines and AI/ML inference, need comprehensive performance tracking. Operating system tools like iostat and sar provide valuable system-level insights, and Amazon CloudWatch offers important disk IOPs and throughput measurements, but high-performance workloads can benefit from even more detailed visibility into instance store performance.

For latency-sensitive applications where every millisecond counts, enhanced performance monitoring tools provide deep visibility into storage systems, so your teams can track and analyze behavior at a 1 second granularity. This detailed insight can help your organization detect bottlenecks quickly, fine-tune application performance, and deliver reliable service.

In this post, we discuss how you can use Amazon EC2 detailed performance statistics for instance store NVMe volumes, a set of new metrics that provide per-second granularity, to provide real-time visibility into your locally attached storage performance. These statistics are similar to the Amazon EBS detailed performance statistics, providing a consistent monitoring experience across both storage types. You can access these statistics directly from your NVMe devices attached to the Amazon EC2 instance using nvme-cli or using CloudWatch agent to monitor I/O performance at the storage level. We also provide examples of how to use these statistics to identify performance bottlenecks.

Feature overview

Amazon EC2 Nitro-based instances with locally attached NVMe instance storage now offer 11 comprehensive metrics at per-second granularity. These metrics, similar to EBS volume metrics, include queue length measurements, IOPS, throughput data, and IO latency histograms for the locally attached NVMe instance storage. Additionally, they also include IO size-specific latency histograms to provide even more detailed insights into performance patterns of the local NVMe instance storage. These metrics are collected and presented separately for each individual NVMe volume available on an instance.

The statistics are presented in three main formats:

    1. Cumulative counters that track IO operations, throughput, and read/write times
    2. Real-time queue length, displaying the current value at the time of your query
    3. Latency histograms visualizing the distribution of IO operations across different latency ranges by displaying both cumulative view and IO size-specific distributions

Prerequisites

To access detailed performance statistics for local instance storage, complete the following steps:

    1. Launch a new Amazon EC2 Nitro instance or use an existing one, then connect to it using SSH or your preferred connection method.
    2. Identify the NVMe device associated with the local storage to query for the performance statistics. For example, you can run the nvme-cli command in the CLI to output all NVMe devices on the instance.
      $ sudo nvme list.

      The following is an example output of the list command that lists the NVMe devices on the instance and their volume Serial Numbers (SN; masked in the below output for privacy). In this demonstration, consider that the local storage used by your application is /dev/nvme1n1.

      Terminal output showing five NVMe devices: one EBS volume and four EC2 instance storage volumes with 3.75TB capacity each

    3. If you are using Amazon Linux 2023 version 2023.8.20250915 (or later) or Amazon Linux 2 2.0.20251014.0 (or later) you can proceed to Step 4 because nvme-cli will use the latest version. If you are using an earlier Amazon Linux version, update the nvme-cli using the following command, where 2023.8.20250915 can be replaced with the latest Amazon Linux 2023 version:
      $ sudo dnf upgrade --releasever=2023.8.20250915
    4. Run the nvme-cli, with the correct permissions, and pass the device as a parameter. You can use --help to get details on the command usage:
      $ sudo nvme amzn stats --help

      Example output:
      Command help output for 'nvme amzn stats' showing usage syntax and format options
      If you prefer output in a JSON format, you can provide the -o json parameter to the command.

      $ sudo nvme amzn stats /dev/nvme1n1 -o json

      The following output (without the -o json parameter) shows cumulative read/write operations, read/write bytes, total processing time (read and write in microseconds), and duration (in microseconds) when application attempted to exceed the instance’s IOPS/throughput limits.
      Storage performance metrics showing read operations count, total bytes, and timing statistics for an EC2 NVMe volume
      It also displays read/write I/O latency histograms, with each row representing completed I/O operations within a specific bin of time (in microseconds).
      Read latency distribution histogram showing operation counts across different microsecond ranges, with peak activity in 2048-4096 rangeWrite latency distribution histogram showing zero operations across all time ranges, indicating no write activity
      If you want to view the latency histograms across 5 different IO bands: (0, 512 Byte], (512B, 4KiB], (4KiB, 8KiB], (8KiB 32KiB], (32 KiB, MAX], you can provide --details or -d parameter to the command:

      $ sudo nvme amzn stats -d /dev/nvme1n

      The following image is an excerpt of the above command’s output, showing the additional latency histograms (read and write) of the 5 different IO bands.
      Dual read/write I/O latency histogram analyzing small block operations from 0-512 bytes with peak at 4096-8192 rangePerformance analysis histogram showing I/O patterns for 512-4K blocks with significant activity in 512-1024 rangeDual histogram showing I/O latency patterns for 4K-8K block operations with concentrated activity at 4096-8192Performance analysis histogram displaying I/O patterns for 8K-32K blocks with peak activity in 4096-8192 rangeComprehensive I/O latency histogram analyzing largest block sizes from 32K to maximum with concentrated activity in 4096-8192

You can run the stats command at a per second granularity. You can also write scripts to pull the stats at a desired interval (every second or any other duration) with each subsequent output reflecting the updated cumulative totals for the metrics. Calculating the difference in the statistics across the last two outputs allows you to derive insight into the instance storage profile during the interval. Below is a sample script you can use to pull the stats at a default interval of 1 second or at your desired interval.

#!/bin/bash 
# interval of 1 second 
INTERVAL=${1:-1} 
while true; do 
	echo "=== $(date) ===" 
	sudo nvme amzn stats /dev/nvme1 || break 
	echo 
	sleep $INTERVAL 
done

You can save this script, make it executable and run it at either the default 1-second interval or provide a custom interval when executing the script. For example, if you saved the script as nvme_stats.sh, you could use the following commands to make it executable and run to get the output at the default 1-second interval (assuming you are in the same directory as that of nvme_stats.sh).

chmod +x nvme_stats.sh
./nvme_stats.sh

If, for instance, you want to get the output at every 5 seconds, you can use the command below (after making the script executable)

./nvme_stats.sh 5

You can also integrate with CloudWatch using CloudWatch agent to collect and publish these statistics for historical tracking, trend visualization through dashboards, and performance-based alerts to correlate with application metrics and automated notifications for performance issues.

Deriving insights from the Amazon EC2 instance store NVMe detailed performance statistics

Similar to EBS detailed performance statistics, you can use Amazon EC2 instance store NVMe statistics to troubleshoot various workload performance issues. As mentioned in the preceding section, you can also use the detailed statistics to view I/O latency histograms to observe the spread of I/O latency within the period. You can use the read/write operations and time spent statistics to calculate the average latency. The detailed statistics show the average latency at per-second granularity.

The next two example scenarios demonstrate key performance analysis using the statistics. In Scenario 1, we will use the EC2 Instance Local Storage Performance Exceeded (us) metric to check if I/O demands exceed instance storage capabilities, helping with instance right-sizing for sufficient I/O application performance. In Scenario 2, we will use IO-size specific histograms (using --details) to diagnose how large block writes affect subsequent read performance – an issue typically hidden by traditional monitoring tools’ aggregated metrics across all IO sizes.

Scenario 1: Identifying when applications exceed instance storage performance limits

Understanding whether your application’s I/O demands exceed your instance store volumes’ capabilities is important for performance troubleshooting. When applications generate I/O workloads that consistently attempt to exceed the IOPS and throughput limits of specific Amazon EC2 instance types, you’ll experience increased latency and degraded performance. The EC2 Instance Local Storage Performance Exceeded (us) metric helps identify these scenarios by showing the duration (in microseconds) when workloads exceeded supported instance performance. A non-zero value or increasing count between snapshots indicates your current instance size or type may not provide sufficient I/O performance for your application.

The following section shows how to identify if an application is sending more IOPS than the instance’s local storage can support.

The example scenario: An application on an i3en.xlarge instance shows elevated write latency of >1ms. You want to determine if the application’s workload is exceeding the instance’s NVMe volume supported performance.

    1. Select the Instance Storage NVMe device you want to analyze – Identify the instance you want to analyze for the application experiencing elevated latency.
    2. Identify the NVMe device – Use the following nvme-cli command, and identify the NVMe device associated with that instance storage.
      $ sudo nvme list

      Example scenario: We used the list and identified /dev/nvme1n1 as the NVMe device associated with the i3en.xlarge instance that is running the application which is currently seeing elevated write latency >1ms (while read latency is <50us as per normal conditions), so now we want to. analyze it.

    3. Collect statistics for the device at a single point in time or at desired intervals – Collect the detailed performance statistics using the nvme-cli command or use the sample script provided in previous section to capture statistics at the desired intervals, if needed.
      $ sudo nvme amzn stats /dev/nvme1n1

      Example scenario: We choose to collect the statistics only once after noticing elevated write latency of the application.

    4. Analyze the statistics to check if the application demands more than the supported performance of the instance storage – Confirm existence of overall I/O latency degradation by comparing two sets of read/write I/O latency histograms taken some time apart.Example scenario: The following output shows Read IO histogram of the NVMe local instance storage taken 40 seconds apart with no read IO latency issues (as normal read latency for this workload is < 50 us).

      Metric captured at time T:
      AWS EC2 storage performance histogram showing read latency distribution, peak at 16-32 microsecond bucket
      Metric captured at time T+40s:
      AWS EC2 storage performance data showing increased read latency concentration in 16-32 microsecond bucket
      The following output shows Write IO histogram taken 40 seconds apart. We can discern that many write IOs fall into the 1ms – 2ms latency range, which is not expected for this application.
      Metric captured at time T:
      AWS EC2 storage write performance data showing majority of operations between 1-2ms latency
      Metric captured at time T+40s:
      AWS EC2 storage performance metrics showing increased write operations clustered in 1-2ms latency range

    5. Analyze the EC2 Instance Local Storage Performance Exceeded (us) metric which shows total time (in microseconds) IOPS requests exceed volume limits. Ideally, the incremental count of this metric between two snapshot times should be minimal, as any value above 0 indicates that the workload demanded more IOPS than the volume could deliver.Example scenario: Comparing metrics 40 seconds apart shows that for more than 34 seconds, the application’s IOPS demands surpassed the IOPS supported by the local instance storage. This explains elevated write latency: excess IOPS above what the underlying storage can physically handle queue the operations, increasing wait times. This indicates that the i3en.xlarge instance chosen to run this application cannot meet the application’s performance requirements, suggesting either upgrading to a larger instance size or re-evaluating the instance type itself.
      Metric captured at time T:
      EC2 Instance Local Storage Performance exceeded output of nvme-cli for the described scenario at time T
      Metric captured at time T+40s:
      EC2 Instance Local Storage Performance exceeded output of nvme-cli for the described scenario at time T+40 with increased count of metric

It’s important to have the right instance size to avoid performance bottlenecks to your application. Refer to the Amazon EC2 instance documentation for more information on the different instances and their storage size.

Scenario 2: Identifying the block size causing elevated latency in your applications

Many storage performance issues arise from complex interactions between read and write operations with different I/O sizes, which traditional system-level monitoring tools like iostat or sar cannot effectively diagnose due to their aggregated metrics across all I/O sizes. EC2 instance store NVMe detailed performance statistics solves this by providing I/O-size specific latency histograms through the --details option in NVMe CLI. These histograms show latency data for different I/O size ranges: (0, 512 Byte], (512B, 4KiB], (4KiB, 8KiB], (8KiB, 32KiB], (32KiB, MAX], for a more precise correlation between application workload patterns and I/O size-specific latency metrics for targeted optimizations.

In this example scenario, your application performs small reads (typically <=4KiB, like metadata read) followed by large writes (>=32KiB) and shows unexpectedly high read latency. This common issue occurs when large writes impact subsequent read operations’ performance, creating a cascading effect on overall I/O performance.

    1. Gather read and write IO latency by size ranges – Use the NVMe CLI with the --details option to gather read and write IO latency by size ranges:
      $ sudo nvme amzn stats /dev/nvme1n1 --details

    2. Confirm existence of overall IO latency degradation – In the example scenario, examining overall IO latency, both read (left) and write (right) operations are showing higher than expected latency.
      NVMe storage read latency histogram highlighting concentrated IO operations in 4K-16K microsecond rangeNVMe storage write latency histogram highlighting concentrated IO operations in 8-32K microsecond range
    3. Examine the output for patterns across different IO size bands – Analyzing latency by operation sizes shows small read operations (512 bytes to 4K), typically fast, are experiencing unexpected latency spikes while large writes (32K+) show significant delays. Small reads should theoretically maintain good performance regardless of other I/O activities.
      NVMe storage read/write latency histogram highlighting concentrated IO operations in 8-16K microsecond range for IO band of 512 - 4KNVMe storage read/write latency histogram highlighting concentrated IO operations in 8-16K microsecond range in IO band 32K and above
      The observed pattern indicates that the backed-up large write operations create system-wide congestion, affecting all I/O operations of types and sizes. Despite the storage system’s capability to handle small reads efficiently, the queued large writes slow down both read and write operations at the application level.

Based on this analysis, we can implement several targeted optimizations to the application, like using smaller block sizes for write operations when possible, or batching smaller writes instead of performing large single writes.

Clean up

If you created an Amazon EC2 instance with NVMe volume for this exercise, then terminate and delete the appropriate instance to avoid future costs.

Conclusion

Amazon EC2 detailed performance statistics for instance store NVMe volumes provide real-time, sub-minute storage performance monitoring, similar to the detailed performance statistics available for Amazon EBS volumes. This offers consistent monitoring experience across both storage types, with additional IO-size based latency histograms for instance storage for better optimization of I/O patterns, and more effective troubleshooting.

To learn more about Amazon EC2 instance store NVMe volumes, optimization techniques for latency-sensitive workloads or other Amazon EC2 related topics, visit the Amazon EC2 documentation page or explore our other AWS Storage Blog posts on performance optimization.

We’d love to hear how you’re using these statistics to enhance your workloads, or if you have any questions, in the comments section below.

Improving Customer Satisfaction and Experience with Zabbix

Post Syndicated from Michael Kammer original https://blog.zabbix.com/improving-customer-satisfaction-and-experience-with-zabbix/31692/

No matter what business you’re in, there is one universal truth – your success or failure depends on customer satisfaction and trust. And when your IT systems fail, it’s your customers who pay the price. Being unable to place an order due to unexpected downtime (which can cost a large organization as much as $9,000 per minute) or having their credit card data compromised in a preventable security breach (which costs the average organization nearly $5 million) will force even your most loyal customers to go somewhere else.

Monitoring with Zabbix doesn’t just keep your infrastructure safe, it keeps your reputation safe and makes sure that your customers continue to be your customers. It does this by guaranteeing the performance, reliability, and security of your digital services – while also supporting better customer service and continuous improvement. Keep reading to see how it’s possible.

Say goodbye to downtime

Your customers are looking to meet their needs quickly and effectively. Unexpected service disruptions cause them to feel neglected and force them to look elsewhere for solutions.
Monitoring your infrastructure with Zabbix can effectively eliminate downtime through proactive issue detection, which locates anomalies and performance issues like high CPU usage, packet loss, and latency in real time – before they have a chance to make life harder for customers.

If an issue does occur, Zabbix’s predictive alerting capabilities let your tech teams know about anything that could potentially impact an application or service, which lets them meet SLAs and provide a better, more reliable customer experience with fewer service disruptions, which in turns leads to higher levels of trust and satisfaction.

Outperform your competitors

No matter how good your products or services happen to be, you still need to provide smooth and fast online user experience if you want repeat use and positive reviews. Monitoring with Zabbix optimizes network traffic by helping you to identify bandwidth bottlenecks or misconfigured devices with a single glance at a dashboard, allowing better traffic management and a better online experience for customers.

It also improves response times, which allows you to be confident that your applications and services remain responsive. This is especially important for real-time services like video conferencing, e-commerce, or customer support.

Turn good customer service into outstanding customer service

What turns a casual, one-time user into a repeat customer? In most cases, it all comes down to making that user feel seen, informed, and supported. Zabbix helps you maintain consistent system performance, and nothing builds trust like stability.

With a bit of configuration and the help of IT service management tools like ServiceNow, Zabbix can provide clear, easy-to-access logs and metrics that help your customer service reps better understand your customers and the process of serving them, including:

• Customer satisfaction (CSAT)
• Preferred communication channel
• Average ticket count
• Average response time
• Average ticket resolution time
• Ticket resolution rate
• Ticket backlog
• Interactions per ticket

With this information, your team will be able to communicate proactively when issues happen, giving customers accurate information about the issue and the expected resolution time.

Keep your customers safe from cyber threats

The consequences of a data breach are deep and far-reaching, and they include financial losses, reputational damage, legal troubles, regulatory fines, and a loss of customer trust. Despite a greater emphasis on data security, hackers are constantly finding new ways to gain access to valuable corporate data and credentials by combining next-generation AI technologies with long-established tools.

Monitoring with Zabbix gives IT and security teams the visibility and early warning systems they need to spot and react to potential threats. Zabbix continuously monitors systems, networks, and applications for predefined thresholds and anomalies, identifying possible network intrusions or misconfigurations and notifying the relevant security stakeholders.

On top of that, Zabbix can monitor any existing security tools your team runs, tracking antivirus software, firewalls, IDS/IPS tools, and endpoint protection solutions to make sure they are functioning properly and running the latest versions. It can also integrate with SIEM systems (like Splunk, ELK, or Wazuh) as well as custom scripts in order to provide extended security analytics.

Meet (and exceed) your SLAs

Service Level Agreements (SLAs) are a framework for managing the expectations of both customers and businesses. They define agreed-on standards of service, but tracking them is more than just a way to measure compliance – it’s a tool that you can use to improve your overall service delivery and operations.

With Zabbix, you can monitor any quantifiable metric that’s relevant to your SLAs, such as system uptime/downtime, response time, the availability of web services, databases, or network devices, transaction success and failure rates, and much more. In addition, Zabbix can use real-time data and built-in SLA calculation to automatically calculate current SLA compliance and send an alert if an SLA is at risk of being breached, by using triggers based on thresholds.

If you’d rather track the metrics on your own, no problem – by using Zabbix dashboards, you can visualize SLA compliance in real-time, with the dashboards showing availability percentages, event timelines, and breach summaries, while giving you easy-to-understand views of service health. The result is better products and services that are aligned with customer expectations.

Build a continuous improvement culture

When it’s time to roll out a new feature or upgrade, you naturally want to have ALL the necessary data at your fingertips. Monitoring usage patterns and performance metrics with Zabbix not only gives you advanced visualizations (forecasting, capacity planning insights, etc.) but can also highlight cases where data analysis led to tangible improvements.

Want more input from customers and users? Zabbix can make sure that the improvements to your product are community-driven by giving you the data you need to run regular user surveys and forums to gather product feedback. It can even help you publish a public roadmap with transparent prioritization based on community input.

Conclusion

Customer satisfaction is about a lot more than just good service – it’s also about consistency, reliability, and transparency. Zabbix empowers businesses to deliver all three by providing a comprehensive, proactive, and scalable monitoring solution.

That’s why customers in verticals as diverse as aerospace and education turn to Zabbix to keep them informed about what’s working – and what isn’t. By integrating Zabbix into your IT operations, you’re not just improving system performance – you’re actively investing in customer satisfaction and loyalty.

Find out more about what Zabbix can do for you and your customers by taking a look at real-world case studies from companies like yours.

The post Improving Customer Satisfaction and Experience with Zabbix appeared first on Zabbix Blog.

Creating a Community-Driven Zabbix Book

Post Syndicated from Zane Lasmane original https://blog.zabbix.com/creating-a-community-driven-zabbix-book/31688/

At the recent Zabbix Summit community meeting, participants gathered to discuss an exciting initiative – the creation of the first-ever community-driven Zabbix book. While several books about Zabbix have been published in the past (often written by individual authors over a decade ago), this project marks a new milestone. For the first time, Zabbix community members from around the world are coming together to co-author a book, share their expertise, and tell the Zabbix story from many perspectives.

What is the Zabbix Book?

The project, hosted at thezabbixbook.com, is an open, collaborative effort led by Nathan Liefting and Patrik Uytterhoeven from Opensource ICT Solutions B.V. The goal is to create a community-built guide to Zabbix, written by users, for users. As Zabbix trainers, Patrik and Nathan have both been long-time (don’t want to say old) contributors to the Zabbix community, authoring multiple books and blog posts.

The Zabbix Book will cover topics ranging from cloud templates and infrastructure monitoring to host triggers, Zabbix internals, SNMP, low-level discovery, multi-factor authentication, and much more. Each contributor can choose a specific chapter or topic that matches their expertise, making it a truly collective and flexible effort.

The content is managed on GitHub, written in Markdown, and follows open contribution principles. The aim is to complete the main foundation of the book alongside the release of Zabbix 8.0 LTS (expected in 2026, Q1/Q2), with an update to include new 8.0 features approximately a month later.

Why write a Zabbix Book when documentation exists?

While the official Zabbix documentation remains the primary source for technical accuracy, the Zabbix Book serves as an alternative and more narrative approach to learning, created by everyday Zabbix users. It’s designed to introduce new users to Zabbix through practical examples, real-world use cases, and community wisdom – making it easier for newcomers to connect the dots.

How the community works together

During the Summit breakout session, the group discussed:

• The current project status and foundational setup
• How contributions are managed — commits, rules, and legal aspects
• Missing topics and a call for more writers, editors, and translators
• Ideas for practical information and real-world examples (like JMX, SNMP, etc.)
• Donations and funding goals, including ideas for supporting open-source projects, good causes, or new Zabbix community features

The project embraces an open, democratic spirit – anyone can contribute, vote, or help improve the book’s structure, content, and readability. The Zabbix Book is created by the Monitoring Penmasters Foundation, which was created in order to make it a real community project – all the intellectual rights belong to the foundation itself, and when revenue is created there will be a  vote on where to donate the money.

Currently, the Monitoring Penmasters foundation consists of Patrik, Nathan, and Zabbix CEO and Founder Alexei Vladishev, who is involved in the book’s review and has agreed to contribute to some parts of the book while allocating design resources from Zabbix itself.

The project has also gotten a big assist from Brian van Baekel of Opensource ICT Solutions, a dedicated community member and certified Zabbix trainer who has given his fair share of presentations and written extensively about Zabbix and its capabilities.

Get involved

If you’d like to contribute, share your expertise, or simply follow the book’s progress, visit thezabbixbook.com to explore the current chapters and learn how to join the project. The project’s digital chapters are available to everyone, and while the writing and printing are still in progress, we hope to see finalized online and printed versions in spring 2026.

It’s also worth remembering that even though the book is free to download and use, the creators do have costs and financial contributions are welcome – you can chip in here.

Together, we’re not just writing a book — we’re writing a piece of Zabbix community history!

The post Creating a Community-Driven Zabbix Book appeared first on Zabbix Blog.

Monitoring a Starlink Dish with Zabbix

Post Syndicated from Alexander Petrov-Gavrilov original https://blog.zabbix.com/monitoring-a-starlink-dish-with-zabbix/31543/

Did you realize that you can monitor a Starlink dish using just Zabbix? The idea (or rather the need) to use Starlink came to me almost as soon as I moved to a fairly rural area. Local internet providers have not yet “provided” fiberoptic or stable mobile connectivity to places like this, and while searching for a solution I accidentally discovered that Starlink was already providing service to some local companies. As I later found out, they also offered service in my area for residential customers.

To make a long story short, since internet access is crucial in the IT field, I decided to acquire and then monitor my very own Starlink dish. At first, this proved challenging because regular user data access is quite limited. However, thanks to Zabbix browser monitoring, I managed to solve it fairly easily. In this post I will share my solution with you, including the template.

Monitoring configuration

First, you need to make sure you have Zabbix installed (either a Zabbix proxy or server) on the same network that the Starlink dish and router are on. The next step is to configure Zabbix for browser monitoring.

WebDriver installation
# podman run --name webdriver -d \
-p 4444:4444 \ 
-p 7900:7900 \
--shm-size="2g" \
--restart=always -d docker.io/selenium/standalone-chrome:latest

Port 4444 will be the port on which the WebDriver will be listening, and port 7900 will be used by NoVNC, which allows us to observe browser behavior in case a browser with a GUI is used.

Zabbix server/proxy configuration

After WebDriver is installed, we need to set up the communication between Zabbix and the driver. This can be done by editing the Zabbix server/proxy configuration file and updating the following parameters:

### Option: WebDriverURL 
# WebDriver interface HTTP[S] URL. For example http://localhost:4444 used with 
# Selenium WebDriver standalone server. 
# 
# WebDriverURL= 
WebDriverURL=http://localhost:4444 
### Option: StartBrowserPollers 
# Number of pre-forked instances of browser item pollers. 
# 
# Range: 0-1000 
# StartBrowserPollers=1 
StartBrowserPollers=5

With the configuration parameters in place, restart the Zabbix server/proxy to apply the changes:

systemctl restart zabbix-server
Creating a host

First, we need to navigate to the “Data collection” > “Hosts” section and create a host that represents our Starlink dish. The host in my example will look like this:

Starlink dish host
Starlink dish host

The host also has a user macro:

{$LINK} with value: http://webapp.starlink.com to point to the correct Starlink dish web app:

Link macro
Link macro
Creating a browser item

We will now configure our browser item to collect and monitor the list of metrics exposed in the Starlink browser app:

Starlink browser item
Starlink browser item

We are using the bare minimum here, so make sure the update intervals are as frequent as you need. However, I would not recommend updating it more frequently than every 5 minutes. It’s also not a good idea to store the history, since it is already stored trough dependent items.

The most important part of the item is the script itself:

var browser, result;
var opts = Browser.chromeOptions();

opts.capabilities.alwaysMatch['goog:chromeOptions'].args = [];
browser = new Browser(opts);
browser.setScreenSize(Number(1980), Number(1020));

try {
    var params = JSON.parse(value);
    browser.navigate(params.url);

 // Wait for the dish to report status
    Zabbix.sleep(2000);

    // Find the JSON text element(s)
    var jsonElements = browser.findElements("xpath", "//div[@id='root']/div[@class='App']/div[@class='Main']/div[2]/div[@class='Section'][2]/pre[@class='Json-Format']/div[@class='Json-Text']");
    var extractedData = [];

    for (var i = 0; i < jsonElements.length; i++) {
        var text = jsonElements[i].getText();

        // Try parsing JSON
        try {
            extractedData.push(JSON.parse(text));
        } catch (e) {
            // If not valid JSON, include raw text instead
            extractedData.push({ raw: text, error: "Invalid JSON format" });
        }
    }

    // Collect result 
    result = browser.getResult();

    // Replace with parsed JSON data
    result.extractedJsonData = extractedData.length === 1 ? extractedData[0] : extractedData;

}
catch (err) {
    if (!(err instanceof BrowserError)) {
        browser.setError(err.message);
    }
    result = browser.getResult();
}
finally {
    // Return a clean JSON object
    return JSON.stringify(result.extractedJsonData);
}

So what does this script do? It opens the Starlink web app, waits for the Starlink dish to output all the status data, and, after a bit of parsing, returns the data highlighted in the screenshot:

Starlink dish diagnostic data
Starlink dish diagnostic data

Now we can click on the three dots on the left of our newly created item in the items page and proceed to create dependent items for each value we are interested in!

Creating dependent items

Now we just click here:

As an example, to create an item that monitors the hardware version we can create an item like this:

Hardware version dependent item
Hardware version dependent item

With JSONPath preprocessing:

Hardware version item preprocessing
Hardware version item preprocessing

In the end we get the data in Zabbix:

Starlink dish hardware version
Starlink dish hardware version

All other items (except alerts) will follow the same logic – just update the item name, key, and JSONPath in preprocessing to extract the required values.

Creating dependent LLD item prototypes

To automate the alerts items creation, we can create a dependent discovery rule. In the “Discovery” section, create a new discovery rule:

Starlink dish alerts discovery
Starlink dish alerts discovery

With preprocessing using Java Script:

var data = JSON.parse(value);
var alerts = data.alerts;
var lld = [];

for (var key in alerts) {
    if (alerts.hasOwnProperty(key)) {
        lld.push({
            "{#ALERT}": key
        });
    }
}

return JSON.stringify({ data: lld });

This will provide us with following JSON data:

{
  "data": [
    {
      "{#ALERT}": "dishIsHeating"
    },
    {
      "{#ALERT}": "dishThermalThrottle"
    },
    {
      "{#ALERT}": "dishThermalShutdown"
    },
    {
      "{#ALERT}": "powerSupplyThermalThrottle"
    },
    {
      "{#ALERT}": "motorsStuck"
    },
    {
      "{#ALERT}": "mastNotNearVertical"
    },
    {
      "{#ALERT}": "slowEthernetSpeeds"
    },
    {
      "{#ALERT}": "softwareInstallPending"
    },
    {
      "{#ALERT}": "movingTooFastForPolicy"
    },
    {
      "{#ALERT}": "obstructed"
    }
  ]
}

All that’s left ‘to do is to create a dependent item prototype:

Starlink dish alert prototype
Starlink dish alert prototype

With preprocessing, of course:

JSONPath will transform to extract each specific alert and “Boolean to Decimal” will save us some space in the database by tranforming true/false booleans to digits.

Result

In the end, we can monitor all the data:

Starlink dish latest data
Starlink dish latest data

Even more data can be collected using exporters – if you are willing to do a bit of extra configuration, of course! Let me know if you are interested, and I will show you a completely different approach with a template.

Before I forget, the template used in this tutorial can be found  here.

The post Monitoring a Starlink Dish with Zabbix appeared first on Zabbix Blog.

Community, Coffee, and Code: A Zabbix Summit 2025 Recap

Post Syndicated from Michael Kammer original https://blog.zabbix.com/community-coffee-and-code-a-zabbix-summit-2025-recap/31577/

Zabbix Summit 2025 is officially in the history books, so now is the perfect time for a casual, behind‑the‑scenes run‑through of what went down. If you were there, this should ring a few bells (or spark some “oh hey, I forgot about that” moments). If you couldn’t make it, consider this your own personal highlight reel!

Featuring approximately 550 attendees from 42 countries, the Summit took place from October 8-10 at the Radisson Blu Hotel Latvija in the heart of downtown Riga. The 13th in-person version of our premier yearly event was in many ways our biggest and boldest yet, and it included keynote sessions, two parallel tracks (including a developer track), workshops, hands-on sessions, training and certification exams, and a variety of evening social and networking events.

Open source, open house

On October 8, we welcomed nearly 100 guests to our brand-new headquarters for Zabbix Summit 2025’s Open House Day. The new facility gave us plenty of space to host everyone, and visitors got to explore our new HQ, take part in a fun quiz with Zabbix facts, and catch up with longtime colleagues while meeting new ones from the community and the Zabbix team.

Day 1: Looking ahead 

The Summit officially kicked off with Zabbix Founder and CEO Alexei Vladishev’s keynote address, entitled “Zabbix 8.0: A New Chapter in Monitoring.” The address laid out in detail what’s around the corner for Zabbix, including:

  • Zabbix Academy – a new learning hub with self-paced, expert-built courses to boost Zabbix skills anytime and from anywhere.
  • Zabbix France – Zabbix is acquiring IZI-IT and opening a new office in France to provide localized support and closer collaboration with French clients and partners.
  • Zabbix Cloud – a host of new features, including automatic upgrades and backups, plus predictable pricing and simplified user management.
  • Zabbix 8.0 LTS (coming in 2026) – a major leap forward with APM and OpenTelemetry for end-to-end visibility, Complex Event Processing (CEP) and AI-based correlation, plus new UI & visualizations for a smoother experience.
  • Zabbix Mobile App – coming with 8.0 LTS for iOS & Android, the app will offer instant push notifications, issue management, collaboration, seamless connection with Zabbix Cloud, and multi-server views in your pocket.
  • Zabbix Marketplace (2026) – A new global space to connect Zabbix users with vendor and partner solutions, Zabbix Marketplace will extend the power of Zabbix beyond our core product.

Next up was initMAX Founder and CEO Tomáš Heřmánek, who showed how to turn physical sensor data from analog inputs into Zabbix metrics with budget hardware and integrations, complete with templates and triggers.

Another crowd-pleasing session reached the audience thanks to Richard Germanus of CANCOM, who shared the story of how CANCOM consolidated six monitoring systems into one, managing approximately 30,000 hosts, deploying 162 Zabbix proxies, standardizing templates, integrating Power BI for dashboards, automating with APIs, and offering monitoring-as-a-service.

Shortly thereafter, a lightning talk by SEB Bank’s Giedrius Stasiulionis explored “Monitoring Sounds with Zabbix” – in other words, converting audio and sound waves into meaningful metrics, a fresh and inventive notion.

The day’s other lightning talk, “Monitor Your Nearby Areas and Events with Zabbix” by longtime Summit fixture and Zabbix superfan Janne Pikkarainen, showed how anyone can use Zabbix to centralize event data like train timetables, traffic patterns, or cinema showtimes.

Developer track: Something for everyone

Meanwhile, the Summit Developer track was full of special sessions for builders and extension authors, such as “Extend Zabbix Agent 2 with Your Plugin”, which saw Senior Golang Developer Eriks Sneiders show an appreciative audience how Zabbix agent 2’s plugin architecture works, how to use existing plugins, and how to build brand-new custom ones.

Other topics in the Developer track included template design, advanced scripting, API tips, and internal tooling, giving Zabbix techies some food for thought and hopefully sparking a batch of fresh ideas!

Day 2: Showing the big picture

After a long first day and night, Zabbix Summit 2025’s special guest Dylan Beattie made some noise and woke everyone up with a talk entitled “Open Source, Open Mind: The Cost of Free Software.”

Dylan took the Summit audience on a journey through the history and philosophy of free and open source software, touching on questions about licensing issues, looking at the motivations of developers, discussing edge cases and challenges, and asking whether truly sustainable open-source ecosystems can exist.

Later, Inqbeo Founder Christian Anton shared a system in which a central Zabbix instance serves multiple tenants, with the architecture leveraging Kafka to stream metric data partitioned per tenant, storing results in S3 (in Prometheus format), and visualizing via Grafana. This enables isolation and the creation of custom dashboards.

Other main-stage sessions tackled topics like scaling Zabbix, managing large datasets, tag and template strategies, and AI/automation in monitoring.

Connecting people with the Community track

Zabbix Summit 2025 also introduced a Community track, a dedicated space at Zabbix where users, enthusiasts, and contributors could share ideas and shape the future of Zabbix. Instead of deeply technical or development-level presentations, this track focused on community-driven topics like integrations, templates, connectors, media types, and open resources.

A key highlight was the “Zabbix Book Breakout Room”, led by Alexei Vladishev himself along with longtime community members Patrik Uytterhoeven, Brian van Baekel, and Nathan Liefting. Zabbix users were able to brainstorm ideas for new chapters, missing topics, translations, and community contributions to the online Zabbix Book.

Turning ideas into action

Day 2 was also full of hands-on workshops, including a fascinating one from the team at initMAX that was based on their day 1 presentation. Participants got kits with an ESP32 board, a camera, a 3D-printed counter mount, and a few other odds and ends. They were then guided step-by-step as they integrated the device into Zabbix, built monitoring scenarios, and used AI models to interpret camera images.

Meanwhile, the Summit also hosted training and certification exams before, during, and after the main event. Attendees could take courses like Automation & Integration with API, Database Monitoring, SNMP Monitoring, and level-up exams (Specialist and Professional) at discounted rates.

A different kind of networking

One of the things that makes the Zabbix Summit experience so special is the depth of the networking experience – there’s no awkward small talk or simple business card exchanges here, but rather a series of real connections made, deals closed, and new partnerships cemented.

Accordingly, a lot of the magic at Zabbix Summit 2025 happened after hours, with everyone gathering at Riga’s famed Monkey Club for the Summit Welcome Event on October 8 to enjoy a lively atmosphere, a wide selection of cocktails, and plenty of opportunities to connect with fellow monitoring and observability enthusiasts.

October 9’s Main Event took place in the Tallinn Quarter Angārs, which blended concert hall energy with an open-plan street food kitchen and bar that gave everyone plenty of room to mingle.

A special treat was provided in the form of an original Zabbix-related song by Zabbix PHP Developer and part-time rock star Vladimirs Maksimovs, which got the entire crowd on its feet and set the tone for an unforgettable evening.

In what has become a bit of a tradition within a tradition, the Summit officially wrapped up on October 10 at Riga’s Burzma Food Hall, with its relaxed atmosphere, multiple cuisines, and communal tables. It’s proven to be the perfect place for reflecting on Summit highlights, swapping contact info, or plotting collaborations.

Thank you to our sponsors!

We want to extend our heartfelt thanks to all the sponsors of Zabbix Summit 2025, whose commitment not only helped us bring everyone together under one roof but also contributed to the growth of both Zabbix and the entire global monitoring ecosystem. We value your partnership and look forward to working with you for many years to come!

Thanks again to our sponsors and everyone else who helped make Zabbix Summit 2025 possible!

In case you couldn’t make it…

If you didn’t manage to make the trip, you can still enjoy the Summit atmosphere in the privacy of your own home! Recordings of both days are available on Zabbix’s YouTube channel:

Zabbix Summit 2025 Day 1 

Zabbix Summit 2025 Day 2 

The slides and texts of the presentations are also available here.

And that’s a wrap on Zabbix Summit 2025! From mind-blowing tech talks to caffeinated hallway chats and everything in between, this year’s Summit experience delivered. Whether you came for the deep dives or just the cool merch (no shame in that), we hope you went away inspired, connected, and maybe just a little more obsessed with monitoring and observability than before. See you in 2026!

The post Community, Coffee, and Code: A Zabbix Summit 2025 Recap appeared first on Zabbix Blog.

Building HA Zabbix with PostgreSQL and Patroni

Post Syndicated from Patrik Uytterhoeven original https://blog.zabbix.com/building-ha-zabbix-with-postgresql-and-patroni/30960/

Running a monitoring platform like Zabbix in a production environment demands reliability and resilience. When your monitoring solution is down, you’re flying blind – and for many organizations, that simply isn’t acceptable. This post introduces a robust high-availability (HA) architecture for Zabbix, using PostgreSQL,  Patroni, etcd, HAProxy, keepalived and PgBackRest. Built on RHEL 9 or derrivates, this solution combines modern open-source tools to provide automatic failover, load balancing, and seamless monitoring, all while maintaining consistency and performance.

Architecture overview

The HA design consists of multiple layers working in tandem to maintain continuity even during node or service failures:

Database Cluster Layer

2 or more nodes form the PostgreSQL cluster, managed by Patroni and coordinated using etcd. At any given time, one node is the primary (read/write), and the others are hot standbys ready to take over automatically.

Consensus layer

etcd runs on the same nodes and acts as the distributed configuration store and coordination layer for Patroni. It ensures a consistent cluster state and enables safe failover decisions.

Load balancing layer  

Two HAProxy nodes provide a single point of entry for all clients (including Zabbix), routing requests to the current PostgreSQL primary. These nodes are monitored and coordinated via Keepalived to maintain a floating Virtual IP (VIP), ensuring seamless failover at the connection layer.

Backup layer

A separate backup server is responsible for running PgBackRest, which handles full and incremental backups, WAL archiving, and Point-In-Time Recovery (PITR). This server communicates securely with all database nodes over SSH.

Monitoring layer

Two Zabbix servers, running in active-passive mode, continuously monitor all layers of this stack including the HAProxy health, Patroni cluster role, and etcd status by accessing the PostgreSQL VIP for backend connectivity.

This multi-tiered setup ensures that no single failure be it a database, load balancer, or monitoring server brings down the monitoring platform.

Why HA matters for Zabbix

Zabbix depends heavily on its PostgreSQL database backend. Every metric, trigger, event, and alert is stored there. If PostgreSQL becomes unavailable, even briefly, data loss or monitoring blind spots can occur. That’s why introducing HA at the database layer is a crucial step when scaling Zabbix for enterprise environments.

While Zabbix itself supports HA at the application level, this architecture ensures that the database backend is also fully fault-tolerant, using modern consensus-based clustering with automatic failover.

Component overview

To achieve HA, we bring together several specialized components, each fulfilling a critical role in the system:

PostgreSQL

The relational database engine used by Zabbix. In this example setup, it runs on three nodes, forming a cluster managed by Patroni.

Patroni

Patroni is the orchestrator for the PostgreSQL cluster. It monitors node health, manages replication, promotes standbys when needed, and ensures only one writable leader exists at any time. Patroni leverages a distributed consensus store in this case, etcd but other DCS’s are possible to coordinate decisions across the cluster.

etcd

etcd is a lightweight and highly available key-value store used by Patroni to maintain the cluster’s state. It stores leader election data, health statuses, and locks. We deploy it as a three-node cluster, co-located with the PostgreSQL nodes for convenience, though this setup can be scaled independently if needed as etcd is very latency prone.

HAProxy

To simplify application connectivity, HAProxy acts as a load balancer in front of the database cluster. It monitors the role of each node using Patroni’s REST API and routes connections to the active primary server. If the leader fails, HAProxy automatically reroutes traffic to the new primary.

Keepalived

Keepalived provides a floating virtual IP address (VIP) across the HAProxy nodes. This VIP allows client systems, such as the Zabbix frontend, to connect to a single stable IP even if one HAProxy node fails.

PgBackRest

To protect the data itself, we use PgBackRest for full and incremental backups, as well as Point-In-Time Recovery (PITR). A dedicated backup server is included to pull and store archive logs and backups securely via SSH.

Zabbix server

Finally, we run two Zabbix servers in active-passive mode. Both are configured to connect to the PostgreSQL cluster through the VIP exposed by HAProxy. The Zabbix frontend is deployed on both nodes as well, ensuring continued accessibility through the load-balanced setup.

Topology at a glance

Here’s a simplified view of the architecture:

  • 2 or more database nodes (PostgreSQL + Patroni + etcd)
  • Two HAProxy nodes, each configured with Keepalived to manage a floating virtual IP
  • One backup node for PgBackRest
  • Two Zabbix servers pointing to the PostgreSQL VIP

All systems are tied together with consistent hostname mappings, time synchronization (Chrony), and service monitoring.

Notes:

  • PgBackRest is directly connected to all three PostgreSQL nodes, allowing it to archive WAL segments and pull backups regardless of which node is primary.
  • This design enables full standby backups and supports Point-In-Time Recovery (PITR).
  • HAProxy ensures Zabbix always talks to the current primary node, while Patroni and etcd handle automatic failover and cluster state management.

Design rationale

This setup prioritizes resilience and self-healing. If any single component fails a database node, a load balancer, or even a monitoring server the system continues to function.

Using Patroni with etcd ensures that failovers are handled automatically, without human intervention. HAProxy ensures client traffic is always routed to the current primary, while Keepalived ensures that this routing layer itself is highly available.

We opted for PgBackRest over simple scripts or base backups because it provides not just efficient incremental backups, but also full WAL archiving and point-in-time recovery, which are invaluable for both disaster recovery and debugging.

Lastly, we chose to integrate Zabbix itself into this HA design, treating it not just as a application but as a fully resilient service able to monitor itself, so to speak.

Real-world considerations
  • Resource planning: While our nodes run comfortably, scaling this setup to heavy workloads requires careful tuning of memory, I/O, and PostgreSQL parameters.
  • etcd placement: Although we run etcd co-located with the database nodes in this example, separating etcd onto dedicated infrastructure is ideal for large-scale environments. This avoids resource contention and preserves quorum in extreme failure scenarios.
  • Monitoring the monitors: Zabbix itself must be monitored. In our setup, each component including etcd, Patroni, and PostgreSQL exposes health endpoints that can be used by Zabbix agents or scripts to generate alerts on replication lag, cluster health, and failover events.

Conclusion

This architecture provides a solid foundation for running Zabbix in a fault-tolerant, production-ready environment. It not only ensures high availability for the database layer but also offers flexibility, observability, and operational safety.

Whether you’re running internal infrastructure monitoring or offering Zabbix as a managed service, adopting this type of HA setup removes single points of failure and gives you peace of mind — all using open-source technologies that are battle-tested and widely supported.

If you need assistance with the migration or want to ensure best practices for scaling and optimizing Zabbix, don’t hesitate to reach out to OICTS. We are a Zabbix Premium Partner operating globally, with offices in the USAUKNetherlands, and Belgium, and we’re ready to help you every step of the way.

 

The post Building HA Zabbix with PostgreSQL and Patroni appeared first on Zabbix Blog.

Exploring the Human Side of Software with Dylan Beatty

Post Syndicated from Michael Kammer original https://blog.zabbix.com/exploring-the-human-side-of-software-with-dylan-beatty/31320/

There are plenty of good reasons to attend Zabbix Summit 2025, but one of the most important is the fact that this year’s Summit will feature Dylan Beattie as a special guest speaker. A Software Development Consultant and Founder of Ursatile, Dylan is an international keynote speaker, and a long-time contributor to the open-source community. He’s also a Microsoft MVP and has created Rockstar, an esoteric programming language that started as an inside joke and ended up being featured in Classic Rock magazine.

At the Summit, Dylan will give a talk titled “Open Source, Open Minds. The Cost of Free Software.” We asked him about his beginnings in the tech industry, what drove the creation of Rockstar, and why communication is the key to successful software development.

Can you tell us a bit about your journey into software development? How did you get started, and was there any particular moment when you realized that you were on the right path?

Like a lot of folks in tech, I got started on the 8-bit home computers of the 1980s – mine was an Amstrad 6128, which came with a couple of fairly mediocre games, but it also had a BASIC and a LOGO interpreter, and I pretty quickly found out that writing little programs and trying to create my own games was way more fun than playing the games which were included with it. I graduated from that to a 286 PC with MS—DOS 5, Windows 3.1 – but I really wasn’t thinking about it as a career.

The turning point was when I was sixteen years old, and I was supposed to be going to university to study mathematics. Dad brought home a new 486 PC a couple of weeks before my final exams, I spent my study leave messing around on the computer instead of studying, and when I didn’t get the grades I needed for my university course I figured maybe that was a sign I should be studying computer science instead. I went to Southampton and got a bachelor’s degree in computer science, learned C, C++, Lisp, SQL, and HTML. I graduated right as the dot-com bubble was bursting but still managed to get a job building data-driven web applications, and I’ve never really looked back.

You talk a lot about the human side of software. Why do you think communication is such a critical skill in development?

One of the perennial challenges facing the craft of programming is that it can be a profoundly solitary activity. One person working on their own can create an app or a game, put it online, and share their creation with literally millions of people – no meetings, no emails, just one person cranking out code. But then you try to translate those coding skills into domains like banking, healthcare, aviation, domains where software quality can have a real, material effect on people’s lives, and you realize that the code is actually the easy part.

The ability to talk to people, figure out what they need, help them understand your own ideas; to create consensus and avoid misunderstanding? It’s way more important than being able to crank out code. The most expensive problems I’ve had to deal with in my career haven’t been bugs in the code, they’ve been misunderstandings about what the team is doing and why it matters.

How did you end up creating a programming language (Rockstar) that can do double-duty as rock lyrics?

Good question! So, there’s always been this trope of the “rockstar programmer” – these mythical, high-powered, hyper-productive developers who can crank out millions of lines of fast, flawless code – and about a decade ago there was a massive spike in recruiters putting out adverts for “rockstar programmers.” When somebody suggested on Twitter that somebody should create a programming language called Rockstar to really confuse recruiters, that gave me an idea.

Initially it was just a piece of comedy writing – a parody of a programming language specification. I wanted to see if it was possible to extract enough clichés from rock music to create a formal grammar for a Turing-complete programming language that read exactly like song lyrics. It turns out that the answer is yes! I published the parody spec on GitHub, it got shared on Reddit and Hacker News, and the whole thing snowballed from there. Eventually I had no choice but to actually build a Rockstar interpreter, which turned out to be way more difficult than I thought, but also a lot of fun. The latest version is online here  – it’s built in C#, compiles to native binaries for WIndows, Linux, and macOS, plus there’s a web assembly version on the website so curious folks can run Rockstar right in their browser without having to download anything!

Before taking on a speaking slot at this year’s Summit, how familiar were you with Zabbix? What has your experience of using it been like?

I’ve got to be honest – I’m not sure I’d ever heard of Zabbix before I was invited to speak at Zabbix Summit 2025, but that’s not unusual. I get invited to a lot of events that are focused around a particular technology or platform, and it’s a constant reminder of just how vast our industry is that somebody will organize a conference around a product I’ve never even heard of and attract literally hundreds of smart, curious people who want to share their own experiences and learn from each other. One thing about Zabbix which was particularly interesting to me when I started researching it was the licensing model. I think it’s a relatively unusual example of a commercially sustainable product or software that’s published under the Affero GPL license, so I’m really looking forward to chatting with other attendees about that and how that’s influenced their decision to use it.

You’re famous for your detailed and theatrical presentations – what makes a technical talk memorable to you?

A great talk is one that really connects with an audience, and the best way I’ve found to do that is to look for the little things that we all do every day that we’ve all learned just accept at face value, even when we have no idea why they work that way. Why is a capital “A” ASCII code 65 but a lowercase “a” is code 97? Why is validating email addresses difficult? Why is vertically aligning something in CSS such a big deal? There’s a good chance that a lot of folks in the audience have asked themselves that same question at some point, so the curiosity is already there. Tapping into that curiosity gets their attention, and then you can tell them the good stuff: the history, the stories, the personalities, the decisions.

There’s a lot of stuff in tech which feels kinda stupid, but none of it was designed to be stupid (well, except Rockstar!) Once you understand the context and the history, everything makes a lot more sense – and then at some point, maybe months later, you’ll hit a weird text encoding bug, or a problem with a system that won’t accept certain kinds of email addresses, and you’ll remember the talk. I get email from folks sometimes talking about how something from one of my presentations has helped them fix a weird bug years after they saw the presentation. That’s a great feeling.

Can you drop any hints about your presentation at this year’s Summit? What should audience members expect?

Sure! We’re going to talk about MIT, laser printers, software, Commander Keen, Doom, Quake, Netscape, the origins of the term “open source”, Linksys routers, WordPress, how the xz-utils backdoor nearly ended up compromising about half the computers on the internet – and a really cute story about a squirrel. It’s going to be awesome. I can’t wait!

 

The post Exploring the Human Side of Software with Dylan Beatty appeared first on Zabbix Blog.

Migrating from PRTG to Zabbix: A High-Level Guide

Post Syndicated from Patrik Uytterhoeven original https://blog.zabbix.com/migrating-from-prtg-to-zabbix-a-high-level-guide/30845/

For companies looking to migrate from PRTG Network Monitor to Zabbix, one of the most critical aspects is making sure a smooth migration of monitored devices and configurations. While there is no official tool to directly migrate between the two platforms, creating a bridge using custom export/import scripts allows for an effective and large migation. This blog post outlines a practical approach to achieving that migration based on the export/import methodology we at Opensource ICT Solutions previously implemented for one of our clients.

Why migrate?

While PRTG offers an intuitive interface and is popular for its ease of use, Zabbix provides:

  • Greater flexibility and scalability
  • Full open-source licensing
  • More powerful automation and templating
  • A robust API for integrations
  • Lower costs, especially since Paessler was sold to an investor

These features make Zabbix an attractive choice for teams looking to scale or standardize on open-source infrastructure.

Migration overview

The migration involves two key steps:

  1. Exporting PRTG device information
  2. Importing data into Zabbix

Because the two systems are conceptually and structurally different, we focused our scripts on migrating what is most transferable: device names, IP addresses, and interface types. SNMP versions or PRTG-specific sensor details were excluded or simplified where not applicable to Zabbix. PRTG, for example, will only export probes that have an OID that was not built-in in PRTG but added later, making our export incomplete. This does not mean we did a partial migration, it just means we have not included it in the automated approach.

Step 1: Exporting from PRTG

We developed a Python-based script that interacts with the PRTG API to extract monitored device data and export it to a CSV file. The script filters out irrelevant objects and organizes the output for easy Zabbix processing.

This creates a clean CSV, like this:

Device Name, IP Address, Interface Type
zabbix-server,10.0.0.10,agent
ServerA,192.168.0.2,SNMP
ServerA,192.168.0.2,agent
core-switch,192.168.0.1,SNMP

This file serves as a clean, structured inventory of monitored devices.

Note: SNMP version fields were excluded in the final export, as Zabbix does not currently display or rely on an SNMP version in the same way PRTG does.

Step 2: Importing into Zabbix

Using Zabbix’s API, we created an import script that reads the CSV and:

  • Creates host entries
  • Assigns them to the appropriate host group
  • Adds relevant interfaces (e.g., Agent,ILO,SNMP or a combination of …)

Each host is configured based on its detected interface type in PRTG.

On the Zabbix side, we used the Zabbix API to automate the creation of hosts, interfaces, and template assignment. The import script reads the CSV line-by-line and takes action based on the interface type.

Considerations and “gotchas”

  • Templates: We didn’t add templates, as there is no 1:1 solution – PRTG has a different concept and adding a standard template would be possible but probably not the best solution.
  • Host Groups: For ease of use and the limited time we had, we added all hosts in a temporary host group made for the migration. Although we do have scripts that take it out from PRTG and create it in Zabbix, in this particular migration it was not needed.
  • Permissions: The API token used in the import script must have sufficient privileges to create hosts.

What is NOT migrated

Because of fundamental differences between the platforms, the following are not directly migrated:

  • Historical data or sensor readings: Mainly because the customer had no hard requirement for it.
  • Custom PRTG notifications or dependencies: It was easier to manually re-create them.
  • Maps or dashboards: The Zabbix approach is so different that it was easier to recreate it manually (and improve).
  • Sensors: Zabbix is working with a different concept.

Post-migration tips

  • Validation: After the import, verify that each host is reachable and monitored correctly in Zabbix.
  • Discovery: Consider using Zabbix’s LLD (Low-Level Discovery) to dynamically find interfaces, disks, or other entities.
  • Housekeeping: Disable PRTG monitoring only after confirming Zabbix is fully operational.

Conclusion

Migrating from PRTG to Zabbix is not a one click operation, but with some scripting, planning, and experience from a partner like us, it can be done efficiently and with minimal disruption. The custom export/import scripts act as a reliable bridge between the two systems, allowing for a clean transfer of your monitoring inventory. From there, Zabbix’s automation and scalability features can help take your monitoring to the next level.

If you need assistance with the migration or want to ensure best practices for scaling and optimizing Zabbix, don’t hesitate to reach out to OICTS. We are a Zabbix Premium Partner operating globally, with offices in the USA, UK, Netherlands, and Belgium ready to help you every step of the way.

The post Migrating from PRTG to Zabbix: A High-Level Guide appeared first on Zabbix Blog.

Troubleshooting network connectivity and performance with Cloudflare AI

Post Syndicated from Chris Draper original https://blog.cloudflare.com/AI-troubleshoot-warp-and-network-connectivity-issues/

Monitoring a corporate network and troubleshooting any performance issues across that network is a hard problem, and it has become increasingly complex over time. Imagine that you’re maintaining a corporate network, and you get the dreaded IT ticket. An executive is having a performance issue with an application, and they want you to look into it. The ticket doesn’t have a lot of details. It simply says: “Our internal documentation is taking forever to load. PLS FIX NOW”.

In the early days of IT, a corporate network was built on-premises. It provided network connectivity between employees that worked in person and a variety of corporate applications that were hosted locally.

The shift to cloud environments, the rise of SaaS applications, and a “work from anywhere” model has made IT environments significantly more complex in the past few years. Today, it’s hard to know if a performance issue is the result of:

  • An employee’s device

  • Their home or corporate wifi

  • The corporate network

  • A cloud network hosting a SaaS app

  • An intermediary ISP

A performance ticket submitted by an employee might even be a combination of multiple performance issues all wrapped together into one nasty problem.

Cloudflare built Cloudflare One, our Secure Access Service Edge (SASE) platform, to protect enterprise applications, users, devices, and networks. In particular, this platform relies on two capabilities to simplify troubleshooting performance issues:

  • Cloudflare’s Zero Trust client, also known as WARP, forwards and encrypts traffic from devices to Cloudflare edge.

  • Digital Experience Monitoring (DEX) works alongside WARP to monitor device, network, and application performance.

We’re excited to announce two new AI-powered tools that will make it easier to troubleshoot WARP client connectivity and performance issues.  We’re releasing a new WARP diagnostic analyzer in the Zero Trust dashboard and a MCP (Model Context Protocol) server for DEX. Today, every Cloudflare One customer has free access to both of these new features by default.

WARP diagnostic analyzer

The WARP client provides diagnostic logs that can be used to troubleshoot connectivity issues on a device. For desktop clients, the most common issues can be investigated with the information captured in logs called WARP diagnostic. Each WARP diagnostic log contains an extensive amount of information spanning days of captured events occurring on the client. It takes expertise to manually go through all of this information and understand the full picture of what is occurring on a client that is having issues. In the past, we’ve advised customers having issues to send their WARP diagnostic log straight to us so that our trained support experts can do a root cause analysis for them. While this is effective, we want to give our customers the tools to take control of deciphering common troubleshooting issues for even quicker resolution. 

Enter the WARP diagnostic analyzer, a new AI available for free in the Cloudflare One dashboard as of today! This AI demystifies information in the WARP diagnostic log so you can better understand events impacting the performance of your clients and network connectivity. Now, when you run a remote capture for WARP diagnostics in the Cloudflare One dashboard, you can generate an AI analysis of the WARP diagnostic file. Simply go to your organization’s Zero Trust dashboard and select DEX > Remote Captures from the side navigation bar. After you successfully run diagnostics and produce a WARP diagnostic file, you can open the status details and select View WARP Diag to generate your AI analysis.


In the WARP Diag analysis, you will find a Cloudy summary of events that we recommend a deeper dive into.


Below this summary is an events section, where the analyzer highlights occurrences of events commonly occurring when there are client and connectivity issues. 


Expanding on any of the events detected will reveal a detailed page explaining the event, recommended resources to help troubleshoot, and a list of time stamped recent occurrences of the event on the device.


To further help with trouble shooting we’ve added a Device and WARP details section at the bottom of this page with a quick view of the device specifications and WARP configurations such as Operating system, WARP version, and the device profile ID.


Finally, we’ve made it easy to take all the information created in your AI summary with you by navigating to the JSON file tab and copying the contents. Your WARP Diag file is also available to download from this screen for any further analysis.


MCP server for DEX

Alongside the new WARP Diagnostic Analyzer, we’re excited to announce that all Cloudflare One customers have access to a MCP (Model Context Protocol) server for our Digital Experience Monitoring (DEX) product. Let’s dive into how this will save our customers time and money.

Cloudflare One customers use Digital Experience Monitoring (DEX) to monitor devices across their employee network and troubleshoot any connectivity or performance issues. Like many products at Cloudflare, every data point generated by DEX is available to customers via Cloudflare’s API or log ingestion. DEX API and log data is valuable because it enables Enterprises to create custom analytics for their devices’ connectivity and performance in a SIEM (Security Information and Event Management). Building new data pipelines and various dashboards can take a lot of time and can be expensive. Some of Cloudflare One customers aren’t able to dedicate the engineering time to build custom analytics: whether it’s due to budget, resource constraints, time constraints, or other factors.

Model Context Protocol (MCP) is an AI standardization that allows AI to connect to other applications. In this instance, using MCP together with DEX gives users a custom analytics experience — without all the work of integrating log data into a SIEM. Network security practitioners can type out a question (like “Show me the performance data for [email protected]’s device in a few graphs”), and an MCP server will answer that question with a customized graph on Alice’s device data from the DEX API.

Troubleshooting DNS query performance with the DEX MCP server

Any IT admin, security engineer, or network engineer can ask the MCP server a question like: “I received an IT ticket from [email protected] who is an enterprise employee. He is complaining about the performance of his device. Can you investigate this for me?”.

After you submit this question, the DEX MCP server will start by reviewing the device’s overall health.

Overall Device Health:
✅ Connection Status: Consistently connected for the past 24 hours with 100% availability
✅ WARP Mode: Running in optimal "warp+doh" (WARP + DNS over HTTPS) mode
✅ Client Version: Up-to-date with version 2025.1.7.2

So far, the device’s health looks great. Next, the DEX MCP server will begin analyzing the results of the performance tests that are configured on Bob’s device.

📊 Performance Metrics Analysis

HTTP Response Times:
Google.com test: Average 247ms (Range: 81-649ms)
  • Shows significant improvement throughout the day: Started at ~317ms, dropped to ~211ms by evening
Wiki.internal.acme.com: Average 1,374ms (Range: 877-2,431ms)
  • This is notably high and could be the source of performance complaints
  • DNS resolution particularly slow: Average 616ms with spikes up to 1,600ms

Now we’re making progress in our troubleshooting effort. The DEX MCP server identified that the HTTP GET test for wiki.internal.acme.com has a high average resource fetch time when compared to other websites (like google.com). In particular, high resource fetch time is the result of slow DNS resolution. Now, the DEX MCP server will summarize its findings.

🚨 Performance Issues Identified
Extremely Slow DNS Resolution:
  • Average DNS response time of 616ms is significantly high
  • Normal DNS should be <50ms for good performance
High Resource Fetch Times on wiki.internal.acme.com:
   • 1.3+ second response times indicate potential application-level issues
   • Could be related to DNS performance or server-side delays
HTTP Status Code Issues:
        • wiki.internal.acme.com is showing consistent HTTP 400 errors
        • This suggests potential configuration or routing issues

Now, we’ve successfully identified that Bob’s slow wiki performance is the result of a high average DNS response time. Slow DNS resolution increases the average resource fetch time, and also generates HTTP 400 errors for Enterprise employees that are attempting to access wiki.internal.acme.com. Notably, it’s likely that if the wiki is seeing slow DNS response times, other internal applications are experiencing performance issues as well.

We can ask the DEX MCP server to suggest potential solutions that would fix this wider DNS performance issue.

🔧 DNS Optimization Strategies

Check WARP DNS configuration and settings
  • Since the device is using WARP+DoH mode, you can optimize the DNS settings
Implement or improve local DNS caching
  • Implement or improve local DNS caching to reduce repeated lookups
Network stack tuning
  • Optimize TCP / UDP settings for DNS

Try out the DEX MCP server today

Fast and easy option for testing an MCP server

Any Cloudflare One customer with a Free, PayGo, or ENT plan can start using the DEX MCP server in less than one minute. The fastest and easiest way to try out the DEX MCP server is to visit playground.ai.cloudflare.com. There are five steps to get started:

  1. Copy the URL for the DEX MCP server: https://dex.mcp.cloudflare.com/sse

  2. Open playground.ai.cloudflare.com in a browser

  3. Find the section in the left side bar titled MCP Servers

  4. Paste the URL for the DEX MCP server into the URL input box and click Connect

  5. Authenticate your Cloudflare account, and then start asking questions to the DEX MCP server

It’s worth noting that end users will need to ask specific and explicit questions to the DEX MCP server to get a response. For example, you may need to say, “Set my production account as the active  account”, and then give the separate command, “Fetch the DEX test results for the user [email protected] over the past 24 hours”.

Better experience for MCP servers that requires additional steps

Customers will get a more flexible prompt experience by configuring the DEX MCP server with their preferred AI assistant (Claude, Gemini, ChatGPT, etc.) that has MCP server support. MCP server support may require a subscription for some AI assistants. You can read the Digital Experience Monitoring – MCP server documentation for step by step instructions on how to get set up with each of the major AI assistants that are available today.

As an example, you can configure the DEX MCP server in Claude by downloading the Claude Desktop client, then selecting Claude Code > Developer > Edit Config. You will be prompted to open “claude_desktop_config.json” in a code editor of your choice. Simply add the following JSON configuration, and you’re ready to use Claude to call the DEX MCP server.

{
  "globalShortcut": "",
  "mcpServers": {
    "cloudflare-dex-analysis": {
      "command": "npx",
      "args": [
        "mcp-remote",
        "https://dex.mcp.cloudflare.com/sse"
      ]
    }
  }
}

Get started with Cloudflare One today

Are you ready to secure your Internet traffic, employee devices, and private resources without compromising speed? You can get started with our new Cloudflare One AI powered tools today.

The WARP diagnostic analyzer and the DEX MCP server are generally available to all customers. Head to the Zero Trust dashboard to run a WARP diagnostic and learn more about your client’s connectivity with the WARP diagnostic analyzer. You can test out the new DEX MCP server (https://dex.mcp.cloudflare.com/sse) in less than one minute at playground.ai.cloudflare.com, and you can also configure an AI assistant like Claude to use the new DEX MCP server.

If you don’t have a Cloudflare account, and you want to try these new features, you can create a free account for up to 50 users. If you’re an Enterprise customer, and you’d like a demo of these new Cloudflare One AI features, you can reach out to your account team to set up a demo anytime. 

You can stay up to date on latest feature releases across the Cloudflare One platform by following the Cloudflare One changelogs and joining the conversation in the Cloudflare community hub or on our Discord Server.


Zabbix at the Zhongnan University of Economics and Law

Post Syndicated from Michael Kammer original https://blog.zabbix.com/zabbix-at-the-zhongnan-university-of-economics-and-law/30949/

Zhongnan University of Economics and Law (ZUEL), located in Wuhan City, Hubei Province, China, is a key university with two campuses – Nanhu and Shouyi. The school boasts over 20,000 full-time undergraduate students, more than 8,800 graduate students, and over 2,500 faculty and staff members. ZUEL enjoys an outstanding reputation in the fields of law and economics, with four national key disciplines. Its law discipline, meanwhile, has been included in the list of national “Double First-Class” disciplines.

The challenge

As the information infrastructure at ZUEL continues to expand, the scale of the university’s IT infrastructure has rapidly grown to encompass power systems, dynamic environmental systems, servers, network devices, security appliances, storage systems, virtualization platforms, operating systems, databases, data lakes, and campus application systems.

At the same time, the daily academic and administrative activities of faculty and students increasingly demand higher levels of stability and reliability from information systems. To ensure the efficient operation of these systems, the Information Management department needed a monitoring and management system that could cover the entire university’s IT resources and address the growing complexities of operational maintenance.

The university found that traditional monitoring and management systems often fall short when faced with such large-scale and diverse monitoring demands, revealing problems like insufficient monitoring points, poor real-time capabilities, and limited scalability. To address these challenges, the university decided to adopt Zabbix 7.0 and develop a custom IP Radar platform to further meet its refined operational maintenance needs.

The solution

When combined with Zabbix 7.0, the IP Radar system can achieve comprehensive monitoring and management of the university’s entire IT infrastructure through the integrated application of multiple monitoring protocols and technologies. Specifically, the system collects data and performs monitoring with the help of the following core technologies:

  • Zabbix 7.0. As an enterprise-level open-source monitoring platform renowned for its robust data collection and analysis capabilities, Zabbix enhances the system’s high availability, supporting large-scale concurrent processing to make sure that the monitoring system remains stable and delivers uninterrupted service even under heavy loads.
  • Parallel monitoring with multiple protocols. The system collects data through a variety of protocols, including Agent, SNMP, IPMI, MODBUS, MQTT, and more, enabling the real-time monitoring of a wide variety of IT hardware.
  • High-availability design. To accommodate the monitoring demands of massive devices and thousands of users, the Zabbix 7.0 platform supports multi-node deployment and redundancy design, enabling load balancing and failover among proxy servers. Even in the event of a node failure, the system maintains uninterrupted monitoring services, and it’s also equipped with an automated fault alerting and repair mechanism.
  • The self-developed IP Radar platform. To meet a demanding set of operation and maintenance management needs, ZUEL has developed the IP Radar system based on the Zabbix 7.0 platform, further customizing its business monitoring capabilities. IP Radar not only conducts real-time monitoring of the IT infrastructure, but it also provides detailed performance analysis reports and trend predictions, while integrating behavior monitoring capabilities to enhance the school’s network security management.

The IP Radar platform itself contains a variety of unique and innovative features, including:

  • Comprehensive monitoring coverage. The IP Radar system monitors over a million items – everything from hardware devices to application systems, affecting everything from network performance to user experience. This extensive coverage gives the Information Management department to a comprehensive understanding of the operational status of the school’s IT resources while providing sufficient data support for troubleshooting and performance optimization.
  • Customized monitoring strategies. Compared to traditional monitoring systems, IP Radar offers highly customized monitoring strategies. ZUEL can tailor different business dashboards for networks, computing resources, user experience, data center environments, and more, based on its own needs and the permissions granted to operation and maintenance personnel. Depending on different monitoring thresholds and alerting strategies, the system can automatically generate alerts and notify relevant personnel through enterprise WeChat, SMS, and other channels.
  • Intelligent alerting and automated handling. The intelligent alerting system of the IP Radar platform leverages machine learning algorithms to analyze historical monitoring data, enabling it to predict potential fault risks and issue early warnings. At the same time, the system integrates automated operation and maintenance capabilities, which allow it to automatically execute predetermined repair operations when certain common faults occur, reducing the time and cost of manual intervention.
  • Network security monitoring. In terms of network security, the IP Radar system is capable of identifying abnormal traffic patterns and promptly detecting potential security threats through real-time analysis of the school’s entire network traffic. The system also supports the monitoring of online behavior to ensure that network access activities comply with the school’s security policies.

The results

After implementing the Zabbix-based system, ZUEL was able to measure a wide range of monitoring performance improvements, including:

  • Improved operational and maintenance efficiency. Through the IP Radar system, the school’s Information Management department has been able to monitor the operational status of over 28,000 hosts in real-time, significantly enhancing operational efficiency. The system’s automated fault handling capabilities reduce the complexity of manual operations, allowing operations and maintenance personnel to focus on addressing only the complex issues that the system is unable to resolve automatically. At the same time, the system’s intelligent alerting feature enables the early detection of potential problems, preventing sudden failures.
  • Enhancing system stability and reliability. The high availability design of Zabbix 7.0 ensures that the system remains stable even under heavy loads. Its redundant design and automatic failover mechanisms guarantee the reliability of the system, and the trend analysis functionality provided by IP Radar helps administrators to identify factors that may affect system stability in advance and making corresponding adjustments, enhancing the overall reliability of the IT system in the process.
  • Advancing detailed information management. The IP Radar platform lets schools manage multiple IT resources with greater precision. The system not only monitors the operational status of hardware devices, but it also analyzes the performance of business systems, helping administrators to optimize system configurations and enhancing user experiences. During project development, historical data from the monitoring platform serves as an essential basis for decision-making. In the acceptance phase, the monitoring platform provides evaluation reference data for operational efficiency and stability.

The IP Radar monitoring and management system developed by ZUEL and based on Zabbix 7.0 has become the largest, most widely used, and most effective (in terms of the volume of monitored data) in the Chinese education sector. The successful implementation of this system not only provides strong support for the school’s information management, but it also offers valuable references for information operation and maintenance at other universities.

In conclusion

Looking ahead, the IP Radar system is poised to expand its functionalities further by integrating more intelligent operation and maintenance management tools. Through the introduction of emerging technologies such as big data analysis and artificial intelligence, the system will achieve more breakthroughs in areas like automated operation and maintenance as well as intelligent fault prediction, providing even more comprehensive technical support for the university’s information management.

To learn more about what Zabbix can do for educational institutions, visit our website.

 

The post Zabbix at the Zhongnan University of Economics and Law appeared first on Zabbix Blog.

Understanding HTTP Template Authorization in AWS

Post Syndicated from evgenii.gordymov original https://blog.zabbix.com/understanding-http-template-authorization-in-aws/30856/

Authorization in Amazon Web Services (AWS) determines what actions a user, service, or system can perform on resources. It answers the question: “Does this identity have permission to do this action on that resource?”

In AWS, authorization is primarily handled through:

  • IAM (Identity and Access Management) policies
  • Resource-based policies (like S3 bucket policies)
  • Session-based permissions (like STS AssumeRole)

What authorization types are available in Zabbix AWS templates?

  • Access key authorization
  • Role-based authorization
  • Assume role authorization

Let’s look briefly at each of them.

Before using the template, you need to create an IAM policy that grants the necessary permissions for the AWS services the template will interact with.

This policy defines what actions are allowedon which resources, and optionally, under which conditions. Once created, the policy should be attached to the IAM role or user that will run the template.

IAM policy for Zabbix

Add the following required permissions to your Zabbix IAM policy in order to collect metrics. The policy can change when new metrics and services are added in Zabbix templates.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "cloudwatch:DescribeAlarms",
                "cloudwatch:GetMetricData",
                "ec2:DescribeInstances",
                "ec2:DescribeVolumes",
                "ec2:DescribeRegions",
                "rds:DescribeEvents",
                "rds:DescribeDBInstances",
                "ecs:DescribeClusters",
                "ecs:ListServices",
                "ecs:ListTasks",
                "ecs:ListClusters",
                "s3:ListAllMyBuckets",
                "s3:GetBucketLocation",
                "s3:GetMetricsConfiguration",
                "elasticloadbalancing:DescribeLoadBalancers",
                "elasticloadbalancing:DescribeTargetGroups",
                "ec2:DescribeSecurityGroups",
                "lambda:ListFunctions"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}

To create and attach the policy:

  • Go to IAM → Policies → Create policy
  • Choose JSON and paste your policy
  • Review and create the policy

Access key authorization

1. Attach the required policy to the IAM user

  • Go to IAM → Users → Select a user → Permissions tab
  • Click Attach policies
  • Select the policy you created before (IAM Policy for Zabbix)
  • Click Attach policy

2. Get your access key and secret access key

In the AWS console:

  • Go to IAM → Users → Select a user → Security credentials tab

  • Click Create access key

  • Copy:

    • Access key ID
    • Secret access key

⚠ Never expose your keys publicly!

3. Configure AWS CLI

Open your terminal and run:

configure aws cli

aws configure --profile zabbix_user

You’ll be prompted to enter:

AWS Access Key ID [None]: AKIAXXXXXXXXXXXEXAMPLE
AWS Secret Access Key [None]: asdkjhUSADWDskhjdasd/EXAMPLEKEY
Default region name [None]: eu-central-1
Default output format [None]: json

4. Test it

List all S3 buckets:

aws s3 ls --profile zabbix_user

Get EC2 tags:

Use region where you create instance

aws ec2 describe-instances --region eu-central-1 --query 'Reservations[*].Instances[*].Tags' --profile zabbix_user

If you get this error…

An error occurred (AccessDenied) when calling the DescribeInstances operation: User: arn:aws:iam::123456789010:user/zabbix_user is not authorized to perform: ec2:DescribeInstances on resource: arn:aws:ec2:eu-central-1:123456789010:instance/*

…you need to check the following permission to the role you are using (IAM Policy for Zabbix).

5. Set the following macros in Zabbix:

  • {$AWS.AUTH_TYPE} – set to access_key
  • {$AWS.ACCESS.KEY.ID} – set to your access key ID
  • {$AWS.SECRET.ACCESS.KEY} – set to your secret access key

Security tips

  • Never hardcode access keys in scripts or code.
  • Store them in ~/.aws/credentials, which is protected by file system permissions.
  • Apply least privilege with IAM policies.

Role-based authorization

1. Add the appropriate permission to the role you are using:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::{Account}:role/{RoleNameWithPath}"
        },
        {
            "Effect": "Allow",
            "Action": [
                "theSameAsIAMPolicyForZabbix",
            ],
            "Resource": "*"
        }
    ]
}

2. Add a principal to the trust relationships of the role you are using:

  • Go to IAM → Roles → Select a role → Trust relationships tab
  • Click Edit trust relationship
  • Add a principal to the trust relationships of the role you are using:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "ec2.amazonaws.com"
                ]
            },
            "Action": [
                "sts:AssumeRole"
            ]
        }
    ]
}

⚠ Using role-based authorization is only possible when you use a Zabbix server or proxy inside AWS.

3. Attach the role to the instance

  • Go to EC2 → Instances → Select an instance → Actions → Security → Modify IAM role
  • Select the role you created before which has the policy attached (IAM Policy for Zabbix)
  • Click Apply

4. Test it

Connect to ES2 ssh terminal in instance and run:

  • Go to EC2 → Instances → Select an instance → Connect → SSH client

Example:

ssh -i "zabbix_user.pem" [email protected]

Get caller identity:

aws sts get-caller-identity

Get token for metadata service:

export TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

Get IAM role from metadata service:

curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/iam/security-credentials

Get IAM role credentials from metadata service using role name from instance metadata service (see Get IAM role from metadata service):

curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/iam/security-credentials/<<--role_name-->>

6. Set the following macros in Zabbix:

  • {$AWS.AUTH_TYPE} – set to role_base
  • {$AWS.ASSUME.ROLE.ARN} – set to your role ARN

Assume role authorization

This method has two options:

  • Using access key authorization for getting creds for assume role
  • Using role-based authorization for getting creds for assume role

Lets look first at using access key authorization for getting creds for assume role.

Using access key authorization for getting creds for assume role

1. Create access key for user (see Access Key Authorization)

2. Add the appropriate permission in role you are using:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Resource": "arn:aws:iam::{Account}:user/{UserName}"
        },
        {
            "Effect": "Allow",
            "Action": [
                "theSameAsIAMPolicyForZabbix",
            ],
            "Resource": "*"
        }
    ]
}

3. Add principal to the trust relationships of the role you are using:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::{Account}:user/{UserName}"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}


4. Test It

Get assume role credentials using access key authorization

aws sts assume-role --role-arn arn:aws:iam::123456789010:role/Zabbix_Role --role-session-name test-session --profile zabbix_user

An example of response:

{
    "Credentials": {
        "AccessKeyId": "ASDFGHJKLEXAMPLE",
        "SecretAccessKey": "QowihdwoieuoinflksnliooEXAMPLE",
        "Expiration": "2029-09-09T22:22:22+00:00"
    },
    "AssumedRoleUser": {
        "AssumedRoleId": "ASDFGHJKLEXAMPLE:test-session",
        "Arn": "arn:aws:sts::123456789010:assumed-role/Zabbix_Role/test-session"
    }
}

5. Set the following macros in Zabbix:

  • {$AWS.AUTH_TYPE} – set to assume_role
  • {$AWS.ACCESS.KEY.ID} – set to your access key ID
  • {$AWS.SECRET.ACCESS.KEY} – set to your secret access key
  • {$AWS.ASSUME.ROLE.ARN} – set to your role ARN
  • {$AWS.ASSUME.ROLE.AUTH.METADATA} – set to false

Getting credentials for assume role using cross-account role (best practice)

1. Create role (see Role-Based Authorization)

2. Add the appropriate permission to the role you are using:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Resource": "arn:aws:iam::{Account}:role/{RoleNameWithPath}"
        },
        {
            "Effect": "Allow",
            "Action": [
                "theSameAsIAMPolicyForZabbix",
            ],
            "Resource": "*"
        }
    ]
}

3. Add the principal to the trust relationships of the role you are using:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::{Account}:role/{RoleNameWithPath}"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

⚠ Using assume role with role-based authorization is only possible when you use a Zabbix server or proxy inside AWS.

4. Test it

Connect to ES2 ssh terminal in the instance and run:

  • Go to EC2 → Instances → Select an instance → Connect → SSH client

Get assume role credentials using role name from instance metadata service:

aws sts assume-role --role-arn arn:aws:iam::123456789010:role/NewRole --role-session-name test-session

An example of response:

{
    "Credentials": {
        "AccessKeyId": "ACCESS_KEY_ID",
        "SecretAccessKey": "SECRET_ACCESS_KEY",
        "SessionToken": "SESSION_TOKEN",
        "Expiration": "EXPIRATION_DATE"
    },
    "AssumedRoleUser": {
        "AssumedRoleId": "ASSUMED_ROLE_ID",
        "Arn": "arn:aws:sts::ACCOUNT_ID:assumed-role/ROLE_NAME/SESSION_NAME"
    }
}

5. Set the following macros in Zabbix:

  • {$AWS.AUTH_TYPE} – set to assume_role
  • {$AWS.ASSUME.ROLE.ARN} – set to your role ARN
  • {$AWS.ASSUME.ROLE.AUTH.METADATA} – set to true

Well done! You have successfully configured AWS authorization in Zabbix AWS templates.

Now you can use the template to collect metrics from AWS.

The post Understanding HTTP Template Authorization in AWS appeared first on Zabbix Blog.

Transforming IT Infrastructure Visibility at Doğan Trend Automotive

Post Syndicated from Michael Kammer original https://blog.zabbix.com/transforming-it-infrastructure-visibility-at-dogan-trend-automotive/30715/

Established in 2020 to consolidate Doğan Group’s automotive and mobility companies and brands under a single entity, Doğan Trend Automotive is a prominent player in their industry. Representing a diverse portfolio ranging from automobiles, motorcycles, and marine engines to electric commercial vehicles, Doğan Trend also delivers innovative solutions to customers through its e-commerce platforms, such as suvmarket.com and vespastoreturkey.com.

The challenge

Doğan Trend’s IT ecosystem spans data centers, remote locations, and multiple units, necessitating seamless operations as well as an efficient monitoring and alert system. The existing infrastructure posed challenges in monitoring, making it difficult to detect potential issues in a timely manner, thus increasing operational risks.

The solution

To address Doğan Trend’s needs, our associates at ASNSKY implemented a Zabbix-based monitoring system. Key highlights of the project included:

  • Centralized dashboards: Custom dashboards were designed for data centers and remote locations, enabling the unified monitoring of IT locations and components from a single interface.
  • A dynamic alert system: Alerts prioritized based on predefined conditions allowed for the swift and effective resolution of critical issues.
  • Seamless operations: Early detection of potential issues prevented operational disruptions and ensured continuity.

Throughout the integration process, ASNSKY’s team collaborated closely with Doğan Trend’s IT department, addressing the specific requirements of different units and providing training for effective system use.

The results

Implementing the new monitoring system rapidly delivered the following results:

  • Enhanced visibility: Real-time monitoring of all IT locations and components made potential issues easy to spot.
  • Proactive issue management: Early detection of critical issues reduced operational downtime.
  • Increased efficiency: The centralized monitoring system drastically improved the responsiveness and effectiveness of Doğan Trend’s IT team.

“This project with the ASNSKY team made our IT infrastructure more transparent and manageable. With Zabbix’s flexible and effective monitoring capabilities, we gained active control over our critical operations. We thank the ASNSKY team for this successful collaboration.” – Burak Altunalan, IT System Management Specialist at Doğan Trend Automotive

In conclusion

Doğan Trend plans to further take advantage of Zabbix’s flexibility to strengthen their resilience and operational efficiency. Their association with ASNSKY marks a significant step toward achieving these objectives.

To learn more about what Zabbix can do for retail customers in every sector, get in touch with us. 

About ASNSKY

ASNSKY enhances its customers’ competitiveness by integrating the power of enterprise-grade open-source-solutions in security and infrastructure with a professional service approach and high quality standards.

Backed by deep industry expertise and a team of seasoned professionals, ASNSKY stands as a trusted partner in your digital transformation journey.

The post Transforming IT Infrastructure Visibility at Doğan Trend Automotive appeared first on Zabbix Blog.

How to Install Zabbix on Windows with a Linux Subsystem

Post Syndicated from Alexander Petrov-Gavrilov original https://blog.zabbix.com/how-to-install-zabbix-on-windows-with-a-linux-subsystem/30311/

It’s a very well known fact that Zabbix can only be installed on Linux. But what if you are in a Windows environment and getting a Linux machine is not so simple or even possible? This can obstruct the implementation of Zabbix, or at least significantly delay it. Not only that, building a POC outside of the future environment makes data procurement a lot more complicated. Is there a way to work around this and get Zabbix as close to Windows as we possibly can?

WSL

WSL/WSL 2 is a fast and easy solution for installing and using Zabbix in a smaller Windows-dominant environment, be that a POC or a small company office. WSL 2 runs a real Linux kernel in a lightweight VM while being optimized for Windows. This means a faster start, lower resource consumption, and the ability to share files with Windows directly, meaning you can use Windows File explorer to find and manage the VM files.

WSL 2 also allows you to use Linux CLI while working with Windows (i.e. running vim from a Windows terminal and editing Windows files directly). At this point, you may be asking yourself, “Why not Hyper-V and VirtualBox?” Those are definitely options too, but they are quite heavy on system resources. In addition, boot times are a bit longer and sharing files between a host and a guest OS is clunkier.

Maybe Docker Desktop then? It’s an absolutely valid option, but that would require a bit of Docker knowledge and you would still be using WSL, technically speaking. So, with that said, WSL is definitely the fastest and most reliable way to sprung a Zabbix instance in a Windows-focused environment.

We will use WSL 2, but as a note WSL 1 is also available. Here are the differences:

  • WSL 2 is usually the better performer overall, especially for dev environments. It also has better Linux compatibility.
  • WSL 1 Linux files aren’t isolated, which can make them more accessible. In WSL 2, Linux runs in a virtual disk (ext4), so Linux and Windows files are more separate. Integration is still pretty good, however.
  • WSL 2 has better Linux compatibility – systemd, iptables, etc.
  • WSL 1 shared the same IP as Windows, WSL 2 is a VM – some networking required.
  • With WSL 1 you can see Linux running processes in Task Manager. WSL 2 will have processes isolated.

Installing Zabbix using WSL

Install WSL

Open PowerShell as an Administrator and run:

PS C:\Windows\system32> wsl --install

If you’ve already have WSL 1 installed, update it:

PS C:\Windows\system32> wsl --update

You can also set WSL 2 as default:

PS C:\Windows\system32> wsl --set-default-version 2
WSL installation
WSL installation

 

Install/Get preferred Linux Distribution using either Microsoft Store (i.e. Ubuntu, Debian, Oracle Linux) or just download directly. I will be using Oracle Linux 9.4.

Microsoft store WSL images
Microsoft store WSL images

 

You can also download the RootFS tarball from the preferred distribution portal, but then the process will be a bit different. Create a folder using PowerShell:

PS C:\Windows\system32> mkdir C:\WSL\OracleLinux9

Copy the .tar.xz file to this folder, then run:

PS C:\Windows\system32> wsl --import OracleLinux9 C:\WSL\OracleLinux9 .\oraclelinux9-rootfs.tar.xz --version 2

After the image is installed or imported, start Oracle Linux using PowerShell:

PS C:\Windows\system32> oraclelinux94

When installation is finished, there is a prompt to create a default UNIX user account and password for the said user, as the username does not need to match your Windows username. I’ll set it to “zabbix” of course, but you can set it to any other.

PS C:\Windows\system32> Enter new UNIX username: 
PS C:\Windows\system32> zabbix
PS C:\Windows\system32> New password: <your-password>
PS C:\Windows\system32> passwd: all authentication tokens updated successfully.
PS C:\Windows\system32> Installation successful!

Now OracleLinux is ready for use!

Prepare the system

You will be immediately logged in to the new environment. If logged out, to log in again just execute in PowerShel:

PS C:\Windows\system32> oraclelinux94

Being logged in, first double check that your selected OS is indeed installed by executing in the PowerShell, which will now serve as your VM CLI access point:

[zabbix@PC-NAME ~]$ cat /etc/os-release
NAME="Oracle Linux Server"
VERSION="9.4"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="9.4"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Oracle Linux Server 9.4"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:9:4:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL=https://github.com/oracle/oracle-linux

Confirmation received, make sure all OS updates are installed:

[zabbix@PC-NAME ~]$ sudo dnf update -y

When the update process is finished, you will need to decide whether you would like to use systemd or not (this may increase booting time). I will enable systemd. To do this, edit the wsl.conf on the Linux subsystem:

vi /etc/wsl.conf

Add to the newly created file:

[boot]
systemd=true

Reboot the images (this command will reboot all of them):

PS C:\Windows\system32> wsl.exe --shutdown

Start back your Linux distribution:

PS C:\Windows\system32> oraclelinux94

Install Zabbix database

We will need to prepare the database engine. Again, any preferred database engine can be used, in this case I install and configure MariaDB:

[zabbix@PC-NAME ~]$ sudo dnf install -y mariadb-server mariadb
[zabbix@PC-NAME ~]$ sudo systemctl enable --now mariadb

Confirm MariaDB is running:

[zabbix@PC-NAME ~]$ Systemctl status mariadb

mariadb.service - MariaDB 10.5 database server
     Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; preset: disabled)
     Active: active (running) since Tue 2025-04-29 12:39:54 EEST; 3min 55s ago
       Docs: man:mariadbd(8)
             https://mariadb.com/kb/en/library/systemd/
   Main PID: 235 (mariadbd)
     Status: "Taking your SQL requests now..."
      Tasks: 9 (limit: 26213)
     Memory: 109.6M
     CGroup: /system.slice/mariadb.service
             └─235 /usr/libexec/mariadbd --basedir=/usr

After confirmation, secure it a bit by creating a root password and selecting the options in bold:

[zabbix@PC-NAME ~]$ sudo mysql_secure_installation 

Enter current password for root (enter for none):
OK, successfully used password, moving on...

Setting the root password or using the unix_socket ensures that nobody
can log into the MariaDB root user without the proper authorisation.

You already have your root account protected, so you can safely answer 'n'.

Switch to unix_socket authentication [Y/n] n
 ... skipping.

You already have your root account protected, so you can safely answer 'n'.

Change the root password? [Y/n] Y
New password:
Re-enter new password:
Password updated successfully!
Reloading privilege tables..
 ... Success!

Remove anonymous users? [Y/n] Y
 ... Success!

Disallow root login remotely? [Y/n] Y
 ... Success!

Remove test database and access to it? [Y/n] Y


Reload privilege tables now? [Y/n] Y

 ... Success!

Cleaning up...

All done!  If you've completed all of the above steps, your MariaDB
installation should now be secure.

Thanks for using MariaDB!

Now to create the Zabbix database. Log in to MariaDB:

[zabbix@PC-NAME ~]$ sudo mysql -u root -p 
[zabbix@PC-NAME ~]$ Enter password: <enter your password, won’t be visible>

Follow the steps from the Zabbix installation page:

MariaDB [(none)]> create database zabbix character set utf8mb4 collate utf8mb4_bin;
MariaDB [(none)]> create user zabbix@localhost identified by '<custom-password>';
MariaDB [(none)]> grant all privileges on zabbix.* to zabbix@localhost;
MariaDB [(none)]> set global log_bin_trust_function_creators = 1;

MariaDB [(none)]> quit;

Installing Zabbix

Install the Zabbix repository:

[zabbix@PC-NAME ~]$ sudo dnf install https://repo.zabbix.com/zabbix/7.0/centos/9/x86_64/zabbix-release-latest-7.0.el9.noarch.rpm
[zabbix@PC-NAME ~]$ dnf clean all

Proceed to install the Zabbix server, frontend, and agent:

[zabbix@PC-NAME ~]$ sudo dnf -y install zabbix-server-mysql zabbix-web-mysql zabbix-apache-conf zabbix-sql-scripts zabbix-selinux-policy zabbix-agent
...
[zabbix@PC-NAME ~]$zabbix-agent-7.0.12-release1.el9.x86_64 zabbix-apache-conf-7.0.12-release1.el9.noarch zabbix-selinux-policy-7.0.12-release1.el9.x86_64  zabbix-server-mysql-7.0.12-release1.el9.x86_64 zabbix-sql-scripts-7.0.12-release1.el9.noarch
 zabbix-web-7.0.12-release1.el9.noarch zabbix-web-deps-7.0.12-release1.el9.noarch zabbix-web-mysql-7.0.12-release1.el9.noarch

Complete!

Now import the initial database schema:

[zabbix@PC-NAME ~]$ zcat /usr/share/zabbix-sql-scripts/mysql/server.sql.gz | mysql -u zabbix -p zabbix
Enter password: <enter your DB user password and wait until you will see the next line appear>
[root@ZBX-5CD3221K14 zabbix]#

Disable the log_bin_trust_function_creators option after import has finished:

# mysql -uroot -p
password
MariaDB [(none)]>  set global log_bin_trust_function_creators = 0;
MariaDB [(none)]>  quit;

Add your Zabbix user database password to the Zabbix server configuration file:

[zabbix@PC-NAME ~]$ vi /etc/zabbix/zabbix_server.conf
### Option: DBPassword
#       Database password.
#       Comment this line if no password is used.
#
# Mandatory: no
# Default:
DBPassword=<your-DB-user-password>

Start the Zabbix server and frontend and add them to autorun:

[zabbix@PC-NAME ~]$  systemctl restart zabbix-server zabbix-agent httpd php-fpm
[zabbix@PC-NAME ~]$  systemctl enable zabbix-server zabbix-agent httpd php-fpm
Created symlink /etc/systemd/system/multi-user.target.wants/zabbix-server.service → /usr/lib/systemd/system/zabbix-server.service.
Created symlink /etc/systemd/system/multi-user.target.wants/zabbix-agent.service → /usr/lib/systemd/system/zabbix-agent.service.
Created symlink /etc/systemd/system/multi-user.target.wants/httpd.service → /usr/lib/systemd/system/httpd.service.
Created symlink /etc/systemd/system/multi-user.target.wants/php-fpm.service → /usr/lib/systemd/system/php-fpm.service.

Installation of the backend is now finished, but we still need the frontend.

Exposing and installing the Zabbix frontend for WSL

Since WSL2 does not expose services to localhost by default, you need to determine the WSL IP:

[zabbix@PC-NAME ~]$ ip addr show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:15:5d:47:32:c6 brd ff:ff:ff:ff:ff:ff
    inet 172.29.128.155/20 brd 172.29.143.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::215:5dff:fe47:32c6/64 scope link
       valid_lft forever preferred_lft forever

Look for an IP like 172.x.x.x, then using your browser go to:

http://<WSL_IP>/zabbix

In this example, that would be: 

http://172.29.128.155/zabbix

You can also port forward WSL to localhost with netsh in PowerShell:

PS C:\Windows\system32> netsh interface portproxy add v4tov4 listenport=8080 listenaddress=127.0.0.1 connectport=80 connectaddress=<WSL_IP>

Then you will be able to access Zabbix from http://localhost:8080/zabbix. Now, just finish the standard frontend setup and Zabbix is ready to use!

WSL advantages

Some extra advantages you get with this approach include clearer resource usage visibility:

WSL Task manager

 

Direct access to the Linux subsystem files through File explorer with your favorite Windows tools:

Linux subsystem file explorer
Linux subsystem file explorer

 

As you can see, docker is here as well. System and configuration files are also visible and editable:

File explorer Zabbix config files
File explorer Zabbix config files

 

Now you can proceed with building your Zabbix or Zabbix POC, (almost) without needing to leave your regular Windows environment!

The post How to Install Zabbix on Windows with a Linux Subsystem appeared first on Zabbix Blog.

Zabbix and a Federal Government Agency

Post Syndicated from Michael Kammer original https://blog.zabbix.com/zabbix-and-a-federal-government-agency/30708/

Our Premium Partners at the ATS Group work with a large federal government agency in the United States. They primarily provide storage and compute-as-a-service for the agency, which relies on them to stay up and running at all times.

The challenge

The agency’s primary goal was to simplify their capacity and performance monitoring without extra costs. They had very strict regulatory and SLO oversight requirements that had to be met, especially when it came to capacity and performance.

There was no commercially available software that could accomplish everything they needed directly out of the box, but they still required a solution that was powerful and flexible enough to monitor almost anything.

The solution

Because the agency has several different data centers of different sizes, they use a distributed proxy set up, intense SLA reporting, a ServiceNow integration, a variety of internal integrations, and a monitoring solution provided by Zabbix that includes a predictive alerting setup.

The agency has plenty of software in the mix, but it primarily relies on storage, VMWare, and Kubernetes. They also have multiple satellite offices and data centers, so that in the event of a data center failure, another can come online with minimal downtime in between.

On top of that, they have over 30 metrics and more than a trillion data points across 10 major technologies that they need to measure, primarily from a regulatory perspective. Thousands of granular metrics needed to have solutions and reporting designed for them in Zabbix, including (for example) CPU cores and frequency, processor-to-core usage metrics, and virtualization ratios from hosts to virtual machines.

Their Kubernetes-based Openshift environment also needs to be monitored to exact specifications. Deployment took place via Helm Chart, with Zabbix components being installed as Kubernetes resources, node-level resources, and applications being monitored, while data was aggregated and sent to the Zabbix server.

Metrics are collected via the Kubernetes API and kube-state metrics, and the solution uses Prometheus-exported metrics or direct HTTP endpoint calls. When it comes to configuration, proxies and hosts are created in Zabbix to represent Kubernetes nodes and clusters, while templates and macros are configured to point to the Kubernetes API and kube-state-metrics endpoints.

The results

Thanks to Zabbix, the federal government agency in question has a solution that provides centralized monitoring of Kubernetes alongside other IT resources, supports application-specific metrics without requiring Prometheus endpoints, and offers plenty of flexibility to customize and scale.

In addition, Zabbix’s predictive alerting capabilities identify abnormalities in operational data and predictively alert the agency about anything that could potentially impact an application or service, which lets them meet SLAs, optimize user experience, and increase productivity.

In conclusion

Zabbix’s flexibility and ease of customization make it ideal for customers who need a single source of truth that can be relied on in even the most stringent regulatory environments.

To learn more about what Zabbix can do for customers in the public sector, visit us here.

The post Zabbix and a Federal Government Agency appeared first on Zabbix Blog.

Podman Container Monitoring with Prometheus Exporter, part 2

Post Syndicated from Janis Eidaks original https://blog.zabbix.com/podman-container-monitoring-with-prometheus-exporter-part-2/30538/

In the first part of this post, we explored how to get data with HTTP agent from the Prometheus Podman exporter and use the same item data for the Podman pods Discovery rule as well as item and trigger prototypes. In part 2 of the same series, we’ll learn how to discover and monitor Podman containers.

Creating a template discovery rule

I will create another discovery rule for container discovery. This discovery rule is also based on the same item [Podman info] in the template – Podman containers by HTTP and Prometheus (you can check part one of this series to find out how to configure it). The parameters of the discovery rule are shown below. This discovery rule will allow us to discover the pod name and ID.

Template: Podman containers by HTTP and Prometheus

▲ Discovery rule
  ▪ Name:                   Container discovery
  ▪ Type                    Dependent item
  ▪ Key:                    training.containers.discovery
  ▪ Master item             Podman containers by HTTP and Prometheus: Podman info
  ▪ Delete lost resources  After 10d
  ▪ Disable lost resources Immediately
♯ Preprocessing
  ▪ Prometheus to JSON     podman_container_info
♦ LLD Macros
  ▪ {#CONTAINER.ID}        $.labels.id
  ▪ {#CONTAINER.NAME}      $.labels.name
Fig 1. Discovery rule: Container discovery
Fig 2. Discovery rule: Container discovery preprocessing tab
Fig 3. Discovery rule: Container discovery LLD macros tab

Next, different dependent item prototypes are created in this container discovery rule. As the Prometheus Podman exporter provides a lot of different metrics about the containers, I will create multiple such items: state, health, creation date, input/output network traffic information, and so on. So, check out what metrics can be acquired and use what is relevant for you.

You can also add a description of each item prototype. I am interested only in metrics with the discovered container ID macros, and I am not interested in what values are for the other fields, such as pod_id, pod_name, so I use ~”.*”, which matches any value. I will show the item prototype configuration screenshots of one of the item prototypes.

These item prototypes are similar to each other, with some minor differences, such as Prometheus patterns, or in some cases, with a different master item (item prototype as master item).

Fig 4. Discovery rule preprocessing step: Prometheus to JSON with pattern podman_container_info
Fig 5. Discovery rule LLD macros: assigning relevant JSONPATH to LLD macros

Creating a template discovery rule: Item prototypes

After the containers have been discovered, we have to create item prototypes. These prototypes will also be dependent item prototypes and will use the same item as the discovery rule: Podman info. Prometheus Podman exported returns a lot more metrics for the containers than it did for the pods.

You can get container metrics such as container health, state, creation date, disk read/write, memory usage, network usage, and more. In this blog post, I have added most of them, so check what metrics are relevant to your monitoring needs and start monitoring.

Fig 6. Low-level discovery rule and item prototypes based on the same item.

The screenshots of the item prototype is shown below.

Fig 7. Container state item prototype tab
Fig 8. Container state item prototype tag tab
Fig 9. Container state item prototype preprocessing tab

Remember, you can also test these item prototypes in the preprocessing step – just copy the Prometheus exporter data and set the relevant macro to value you want to check.

The configuration parameters of the item prototypes are shown below. There are a lot of metrics you can monitor, but remember to monitor what is relevant and necessary for you.

Template: Podman containers by HTTP and Prometheus; Discovery rule: Container discovery

● Item prototype #1
  ▪ Name: 		Container health: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.health[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (float)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 
♦ Tags (name:value) 	
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:health		
♯ Preprocessing
  ▪ Prometheus pattern 	podman_container_health{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #2
  ▪ Name: 		Container state: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.state[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (float)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		
♦ Tags (name:value)  			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:state		
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_state{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #3
  ▪ Name: 		Created at: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.created[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		unixtime
♦ Tags (name:value) 		
  ▪ Container:{#CONTAINER.NAME}	
  ▪Metric:created		
♯ Preprocessing
  ▪ Prometheus pattern 	podman_container_created_seconds{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #4
  ▪ Name: 		Disk read per second: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.disk.read[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		B
♦ Tags (name:value) 	
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:disk_read		
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_block_output_total{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value
  ▪ Change per second

● Item prototype #5
  ▪ Name: 		Disk write per second: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.disk.write[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		B
♦ Tags (name:value) 	
  ▪ Container:{#CONTAINER.NAME}	 
  ▪ Metric:disk_write		
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_block_input_total{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value
  ▪ Change per second

● Item prototype #6
  ▪ Name: 		Exit code: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.exit_code[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (float)
    ▪ Master item	Podman containers by HTTP and Prometheus: Podman info
▪ Units: 			
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:exit_code	
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_exit_code{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #7
  ▪ Name: 		Image tags: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.image.tags[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Character
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 			
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:tag
♯ Preprocessing
▪ Prometheus pattern podman_container_info{id="{#CONTAINER.ID}",image=~".*",name=~".*",pod_id=~".*",pod_name=~".*",ports=~".*"} label image
  ▪ Regular expression	\.*(\/.\w.*)	\1

● Item prototype #8
  ▪ Name: 		Memory usage: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.mem[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		B
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:mem		
♯ Preprocessing
  ▪ Prometheus pattern podman_container_mem_usage_bytes{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #9
  ▪ Name: 		Network input dropped: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.net.in.drop[{#CONTAINER.NAME}]
  ▪ Type of inf: Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		packets
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:net_in_drop		
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_net_input_dropped_total{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #10
  ▪ Name: 		Network input errors: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.net.in.errors[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:net_in_err		
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_net_input_errors_total{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #11
  ▪ Name: 		Network input total: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.net.in.total[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		B
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:net_in_tot
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_net_input_total{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #12
  ▪ Name: 		Network input per second: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.net.in.change[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (float)
  ▪ Master item		prototype - Network input total: [{#CONTAINER.NAME}] 
  ▪ Units: 		Bps
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:net_in_change
♯ Preprocessing
  ▪ Change per second

● Item prototype #13
  ▪ Name: 		Network output dropped: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.net.out.drop[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:net_out_drop	
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_net_output_dropped_total{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #14
  ▪ Name: 		Network output errors: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.net.out.errors[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:net_out_err	
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_net_output_errors_total{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #15
  ▪ Name: 		Network output total: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.net.out.total[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		B
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:net_out_tot	
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_net_output_total{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #16
  ▪ Name: 		Network output per second: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.net.out.change[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (float)
  ▪ Master item		prototype - Network output total: [{#CONTAINER.NAME}]
  ▪ Units: 		Bps
♦ Tags 			 
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:net_out_change
♯ Preprocessing
  ▪ Name			Change per second

● Item prototype #17
  ▪ Name: 		Rootfs size: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.rootfs.size[{#CONTAINER.NAME}]
  ▪ Type of inf: Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		B
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:rootfs
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_rootfs_size_bytes{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #18
  ▪ Name: 		Total system CPU time: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.cpu.time
  ▪ Type of inf: 	Numeric (float)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		s
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:sys_time
♯ Preprocessing
  ▪ Prometheus pattern: podman_container_cpu_system_seconds_total{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

Creating a template discovery rule: Trigger prototype

I have created a user macro {$CONTAINER.RUNNING.STATE} on the template with a value of 2, which corresponds to the containers running state. After that, create a trigger prototype to check if the container is in different state other than running.

Template: Podman containers by HTTP and Prometheus; Discovery rule: Container discovery

◘ Trigger prototypes
  ▪ Name:               Container [{#CONTAINER.NAME}] state has changed from running
  ▪ Severity:           Warning
  ▪ Expression:         last(/Podman containers by HTTP and Prometheus/container.state[{#CONTAINER.NAME}])<>{$CONTAINER.RUNNING.STATE}
  ▪ PROBLEM event generation mode: Single
  ▪ OK event closes: All problems

So, once all of this is done, and some container status changes from running and or pod status also changes from running, you will get a problem event.

Fig 10. Generated problem events when the podman pod and container change states.

Technically, I could also create a trigger for container health; however, as all of the received container values for me are -1 (meaning unknown) it makes little sense to make a trigger that will fire right away. You can also add additional item/trigger prototypes in the template. If everything is set up as expected, you should see something like the screenshot below after the LLD rule execution.

Fig 11. Example of the mysql-server container and zabbix pod item values.

Summary

Now, you can monitor both Podman pods and containers using both blog posts of this series. We used the same template item for both the container LLD and item prototypes from the first part of this post.

The post Podman Container Monitoring with Prometheus Exporter, part 2 appeared first on Zabbix Blog.

Podman Container Monitoring with Prometheus Exporter, part 1

Post Syndicated from Janis Eidaks original https://blog.zabbix.com/podman-container-monitoring-with-prometheus-exporter-part-1/30513/

In part one of this blog post, I will show you how to monitor Podman pods using HTTP agent item to retrieve data from the Prometheus Podman exporter. Let’s get started!

Installing and checking Prometheus Podman exporter

First, you will need to install and enable the Prometheus Podman exporter (my OS is CentOS Stream release 9). Then, check that the service is active and running.

# dnf install -y prometheus-podman-exporter

# systemctl enable prometheus-podman-exporter –now

# systemctl status prometheus-podman-exporter

You can check that you are getting the data from the exporter with either the curl command from the machine/VM where the Prometheus podman exporter is installed and started:

# curl http://localhost:9882/metrics
Fig 1. Output of Prometheus podman exporter in CLI

Or through the browser (replace abc with the machine’s IP/DNS ): abc:9882/metrics.

Fig 2. Output of Prometheus Podman exporter in browser

A line starting with # is a comment and contains an explanation regarding the metric; in this case, podman_container_block_input_total will return data in bytes.  In Figure 2, after the comments, you can see several podman_container_block_input_total metrics, one for each container, with different container IDs, pod IDs, and pod names listed in each metric. The metric’s value is displayed on the right side after curly brackets.

Creating a template and template items

Next, I will create a template Podman containers by HTTP and Prometheus where I will put all of the entities (everything will be created on the template). In the template, I will create an item Podman info, which will gather all of the necessary data at defined intervals. This approach will be convenient from a data collection standpoint as the same item data will be used for LLD and item prototypes. During testing, you can set “History” to store data for some time, and when everything is working as expected, then set “History” not to keep any data. This item will be used for the Low-Level Discovery rule and the item prototype.

The item Podman info parameters are as follows:

Template: Podman containers by HTTP and Prometheus

○ Item
  ▪ Name:         Podman info
  ▪ Type          HTTP agent
  ▪ Key:          podman.info
  ▪ Type of inf   text
  ▪ URL           http://{HOST.CONN}:9882/metrics
  ▪ Request type  GET
  ▪ Update int.   5m
  ▪ Req status c. 200
  ▪ History       Do not store
◊ Tags
  ▪ Podman:raw

 

Fig 3. Template item for data gathering

At this moment, this item will contain just raw data, without any preprocessing steps applied. The IP address will be taken from any host interface added to the host. You will get an error message if the host has no interface.

Fig 4. Error on the host with the linked template without any interface

If you do not want to add an interface to the host, you can define a user macro on the template level and use that user macro in the items URL. After adding the template to the host, just modify the user macro value on the host to correct IP/DNS name.

Fig 5. User macro on template
Fig 6. Template item for data gathering with user macro instead of built in macro from host interface

I can also create an item to determine the number of containers created. I can count specific Prometheus pattern occurrences in the master item to determine this. For this, I will use the podman_container_state parameter. Likewise, I could use different parameters, such as podman_container_info, and count the occurrences of such a pattern. The parameters of the item container count:

Template: Podman containers by HTTP and Prometheus

○ Item
  ▪ Name:         Container count
  ▪ Type          Dependent item
  ▪ Key:          container.count
  ▪ Type of inf   Numeric (unsigned)
  ▪ Master item   Podman containers by HTTP and Prometheus: Podman info
◊ Tags
  ▪ Containers:total
♯ Preprocessing
  ▪ Prometheus pattern     podman_container_state     count
Fig 7. Template item preprocessing step for counting the total number of containers

Creating a Discovery rule in template

Next, the LLD rule will be created to discover Podman pods. It will be a dependent LLD rule based on a Podman info item with a preprocessing step to convert the Prometheus pattern data to JSON format. The caveat is that the LLD discovery will be executed as frequently as the data is received for the item. If there are a lot of hosts with such a template, there will be a lot of LLD processes executed, which can put a strain on your Zabbix instance.

To rectify this issue, I will add a preprocessing step: discard unchanged with heartbeat (as there are no dynamic parameters in the extracted pattern, otherwise we would need to filter out dynamically changing information). For LLD discovery, the recommended interval is around 1h. Additionally, LLD macros will be created from selected JSNOPath variables. The parameters of the LLD rule are shown below.

Template: Podman containers by HTTP and Prometheus

▲ Discovery rule
  ▪ Name:                   POD discovery
  ▪ Type                    Dependent item
  ▪ Key:                    training.pod.discovery
  ▪ Master item             Podman containers by HTTP and Prometheus: Podman info
  ▪ Delete lost resources  After 10d
  ▪ Disable lost resources Immediately
♯ Preprocessing
  ▪ Prometheus to JSON     podman_pod_info
  ▪ Discard unchanged with heartbeat 1h
♦ LLD Macros
  ▪ {#POD.ID}              $.labels.id
  ▪ {#POD.NAME}            $.labels.name
Fig 8. Discovery rule: Pod discovery
Fig 9. Discovery rule: Pod discovery preprocessing tab
Fig 10. Discovery rule: Pod discovery LLD macros tab

The block diagram below will show how the data is transformed. First, a preprocessing step is applied to the data to convert the Prometheus pattern to JSON format, as all data for LLD must be supplied in JSON format.

In the example below, the matching queried pattern is returned in JSON format after this preprocessing step.

Fig 11. Discovery rule preprocessing step: Prometheus to JSON with pattern podman_pod_info

After the preprocessing step, we can assign specific JSONPATH values to LLD macros.

Fig 12. Discovery rule LLD macros: assigning relevant JSONPATH to LLD macros

Creating a template Discovery rule: item prototypes

Now that we have discovered the macros we are interested in, the discovered macros can be used for further prototype (ITEM/HOST/TRIGGER) creation. In this example, I am using the same master item for LLD discovery and the dependent item prototypes, because it is convenient for me, and all the information is available in one item. But usually, there are scenarios where you have to use one item’s data for discovery and the data of another item for populating the prototype values.

In this case, I am interested in the pod ID, when the pod was created, the number of containers in the pod, and the state of the pod. Therefore, I will create the item prototypes and use the LLD macro in the name, key, and preprocessing step. Zabbix will cycle through the discovered LLD macro values and create the items based on the prototype by replacing the LLD macro with discovered values. Although you can set matching item prototype names (which will be confusing), you still have to use the LLD macro in the item key so that different item keys are generated – otherwise, you will get an error regarding duplicate keys. The item prototype parameters are given below.

Fig 13. Low-level discovery rule and item prototypes based on the same item.
Template: Podman containers by HTTP and Prometheus; Discovery rule: POD discovery

○ Item prototype #1
  ▪ Name:         POD ID: [{#POD.NAME}]
  ▪ Type          Dependent item
  ▪ Key:          pod.id[{#POD.NAME}]
  ▪ Type of inf   Character
  ▪ Master item   Podman containers by HTTP and Prometheus: Podman info
♦ Tags
  ▪ Metric:ID
  ▪ Pod:{#CONTAINER.NAME}
♯ Preprocessing
  ▪ Prometheus pattern     podman_pod_containers{id="{#POD.ID}"}          label    id

○ Item prototype #2
  ▪ Name:         POD state: {#POD.NAME}
  ▪ Type          Dependent item
  ▪ Key:          pod.state[{#POD.NAME}]
  ▪ Type of inf   Numeric (float)
  ▪ Master item   Podman containers by HTTP and Prometheus: Podman info
  ▪ Value mapping POD state
♦ Tags
  ▪ Metric:state
  ▪ Pod:{#CONTAINER.NAME}
♯ Preprocessing
  ▪ Prometheus pattern     podman_pod_state{id="{#POD.ID}"}      value

○ Item prototype #3
  ▪ Name:         POD created at: [{#POD.NAME}]
  ▪ Type          Dependent item
  ▪ Key:          pod.created[{#POD.NAME}]
  ▪ Type of inf   Numeric (unsigned)
  ▪ Units         unixtime
  ▪ Master item   Podman containers by HTTP and Prometheus: Podman info
♦ Tags
  ▪ Metric:created
  ▪ Pod:{#CONTAINER.NAME}
♯ Preprocessing
  ▪ Prometheus pattern     podman_pod_created_seconds{id="{#POD.ID}"}     value

○ Item prototype #4
  ▪ Name:         POD container count: [{#POD.NAME}]
  ▪ Type          Dependent item
  ▪ Key:          pod.count[{#POD.ID}]
  ▪ Type of inf   Numeric (unsigned)
  ▪ Master item   Podman containers by HTTP and Prometheus: Podman info
♦ Tags
  ▪ Metric:count
  ▪ Pod:{#CONTAINER.NAME}

On the template, I have also created a value map for deciphering the numerical pod state codes to text strings for better clarity.

Fig 14. Value mapping for the POD state item

Here are some screenshots of the POD state item prototype, shown below.

Fig 15. POD state item prototype: item prototype tab
Fig 16. POD state item prototype: tag tab
Fig 17. POD state item prototype: preprocessing tab

Creating a template Discovery rule: trigger prototype

We can also create a trigger prototype to generate an alert if there is something wrong with the pod. I have created a user macro {$POD.RUNNING.STATE} on the template with a value of 4, which corresponds to the running state.

Template: Podman containers by HTTP and Prometheus; Discovery rule: POD discovery

◘ Trigger prototypes:
  ▪ Name:               POD [{#POD.NAME}] state has changed from running
  ▪ Severity:           Warning
  ▪ Expression: last(/Podman containers by HTTP and Prometheus/pod.state[{#POD.NAME}])<>{$POD.RUNNING.STATE}
  ▪ PROBLEM event generation mode: Single
  ▪ OK event closes: All problems
Fig 18. Trigger prototype based on POD state item value

Once you link the template to the host and execute the LLD rule, you should start seeing the Podman pods ( if you have them), similar to the screenshot below.

Fig 19. Latest data for the host with the linked template

Summary

This blog post shows how to get data with HTTP agent from Prometheus Podman exporter and use the same item data for the Discovery rule as well as item and trigger prototypes. Check out part 2 of this series to find out how to discover and monitor Podman containers.

The post Podman Container Monitoring with Prometheus Exporter, part 1 appeared first on Zabbix Blog.

Next-Level Alert Analysis with DeepSeek and Zabbix

Post Syndicated from Zhe Cheng original https://blog.zabbix.com/next-level-alert-analysis-with-deepseek-and-zabbix/30424/

As IT infrastructures grow increasingly complex, efficiently analyzing monitoring data and accelerating incident response have become critical challenges for operations teams. This post explores a few innovative applications of DeepSeek when integrated with Zabbix.

Requirements:

– Zabbix server 7.0 or higher
– DeepSeek API (Alternatively, other AI APIs can be used if needed)

1. Scenario One: One-Click Intelligent Alert Analysis

By integrating DeepSeek Analytics into the Zabbix frontend, users can conduct intelligent alert analysis with just one click. This integration facilitates the swift generation of comprehensive fault analyses and solution suggestions, markedly decreasing the MTTR (Mean Time to Resolution). Consequently, it streamlines the troubleshooting process, alleviates the workload on IT personnel, ensures system stability, and conserves both time and resources.

1.1  On the Zabbix home page, navigate to “Alerts” > “Scripts”, and click on the “Create script” button.

1.2  Configuration script:

  • Name: Can be customized
  • Scope: Select “Manual event action”
  • Menu path: Customize menu paths for quick access
  • Type: Select “Script”
  • Execute on: Select “Zabbix proxy or server”

1.3 Enter the following command in the command bar:

/etc/zabbix/scripts/send_alert_to_ai.sh "{TRIGGER.NAME}" "{TRIGGER.SUBJECT}"  "{HOST.NAME}" "{HOST.IP}" "{EVENT.TIME}" "{TRIGGER.SEVERITY}"

1.4  Create an API call script on zabbix-server.

1.4.1 Modify the Zabbix Server Configuration File and Enable Global Scripts:

Open the Zabbix server configuration file for editing:

vi /etc/zabbix/zabbix_server.conf

Set the EnableGlobalScripts option to 1:

EnableGlobalScripts=1

Save the changes and exit the editor. Then, restart the Zabbix server service to apply the changes:

systemctl restart zabbix-server

1.4.2 Create an API Call Script.

Create a directory for custom scripts if it does not already exist:

mkdir -p /etc/zabbix/scripts && cd /etc/zabbix/scripts

Note: If the frontend prompts that the script file cannot be found, try moving the script to the directory used by the Nginx agent. Create a new script file named send_alert_to_ai.sh:

vi send_alert_to_ai.sh

Add the following content to the script, replacing DeepSeek KEY with your actual API key. Make sure you adjust the API call method if using a different AI service:

#!/bin/bash

# DeepSeek API configuration
API_URL="https://api.deepseek.com/chat/completions"
API_KEY="xxxxxxxxxxxxxxxxxxxx"

# Obtain the parameters to be passed as alarm information
TRIGGER_NAME="$1"
ALERT_SUBJECT="$2"
HOSTNAME="$3"
HOST_IP="$4"
EVENT_TIME="$5"
TRIGGER_SEVERITY="$6"

# Build a more concise JSON format for alarm information
alert_info=$(cat <<EOF
{
"model": "deepseek-chat",
"messages": [
{"role": "system", "content": "You are an assistant focused on responding quickly to system alarms。"},
{"role": "user", "content": "The following alarm information is received:\n\n: $TRIGGER_NAME\n: $ALERT_SUBJECT\n: $HOSTNAME\n: $HOST_IP\n: $EVENT_TIME\n: $TRIGGER_SEVERITY\n\nPlease tell me the cause of the alarm and the handling measures in a short and professional language with a word limit of 300 words。"}
],
"stream": false
}
EOF
)

# Send the POST request and capture the response and HTTP status code
response=$(curl -s -w "\n%{http_code}" -X POST "$API_URL" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d "$alert_info")

# Separate HTTP status codes from response bodies
http_code=$(echo "$response" | tail -n1)
response_body=$(echo "$response" | sed '$d')

# Parse and extract the content field
if [ "$http_code" -eq 200 ]; then
# Parse JSON using the jq tool
if ! command -v jq &> /dev/null; then
echo "jq could not be found, please install it first."
exit 1
fi

# Extract the content field and format the output
content=$(echo "$response_body" | jq -r '.choices[0].message.content')
echo -e "Analysis result:\n$content"
else
echo "failure: HTTP status code $http_code, respond: $response_body"
fi

Make the script executable:

chmod +x send_alert_to_ai.sh

Note: The script provided invokes the official DeepSeek API. Replace DeepSeek KEY with your actual API_KEY. If you are using another AI service, please confirm the appropriate API invocation method.

Important Notes:

Note: The script relies on jq to process and parse JSON data for tasks such as filtering, mapping, aggregating, and formatting. If jq is not installed on your system, follow these instructions to install it.

For Debian/Ubuntu Systems:

apt-get update
apt-get install jq

For CentOS/RHEL Systems:

yum install epel-release
yum install jq

1.5 Actual Effect Display:

1.6  Optional Optimization Items.

1.6.1 Adjust Output Box Size for Better Browsing.

After executing the script, you may find that the output box is too small and inconvenient to browse. To optimize this, you can modify the front-end CSS file as follows.

Back up the existing CSS File:

cd /usr/share/nginx/html/assets/styles/
cp blue-theme.css blue-theme.css.bak

Edit the CSS File:

vi /usr/share/nginx/html/assets/styles/blue-theme.css

Add Custom Styles at the End of the File.

Add the following CSS rules to adjust the size and behavior of the output box:

#execution-output {
height: 500px; /* Adjust to your desired height */
width: 540px; /* Optional: Adjust the width as required */
overflow-y: auto; /* Displays scrollbar when content exceeds the set height */
}

Save and exit the editor. At this point, clear the browser cache and reload the page to see the changes take effect.

1.6.2 How to Optimize Slow Output Response after Executing the One-Click Analysis Script.

During actual testing, it was estimated that returning a 300-word result takes approximately 20 to 30 seconds. While you can improve the response speed by adjusting the preset prompt words in the script, this approach may reduce the richness of the analysis content. Therefore, it is recommended to balance speed and content depth by adjusting the number of replies in the script’s prompt words according to your actual needs.

Actual effect display:

2. Scenario Two: Zabbix Documentation Knowledge Base Assistant

In today’s fast-paced IT environment, managing and retrieving information efficiently is crucial. To address this need, we’ve developed the Zabbix KB Assistant, an intelligent knowledge base solution built on MaxKB—an open-source Q&A system leveraging large language models.

This assistant streamlines access to Zabbix’s extensive documentation, making it easier than ever for users to find the information they require.
MaxKB stands out for its seamless integration capabilities, allowing for quick uploads of documents and automatic crawling of online content.

Its flexibility means it can be effortlessly embedded into third-party systems, including our very own Zabbix platform. The project is available at the GitHub repository.

The development process of Zabbix KB Assistant involved configuring MaxKB to recognize and parse the official Zabbix documentation. By utilizing this URL, we ensured that the latest updates and comprehensive guides are always accessible within our assistant. After setting up the core model configurations, we created a dedicated knowledge base tailored to Zabbix’s rich content.

With the knowledge base in place, we proceeded to integrate Zabbix KB Assistant into the Zabbix frontend. This step was essential for providing instant access to users navigating the Zabbix interface. By embedding a floating window mode, users can interact with the assistant without leaving their current page—a feature that significantly enhances user experience.

Actual effect display:

3. Scenario Three: DingTalk Alert Enhancement

By integrating DeepSeek’s deep analysis capabilities, DingTalk can automatically analyze alarm information upon receiving alerts. This integration provides precise fault diagnosis and solutions, aiding IT operations and maintenance personnel in quickly identifying and resolving issues. Consequently, this improves the efficiency of system maintenance and reduces downtime.

3.1 Create a Bot and Configure Security Settings.

First, create a new bot within the DingTalk group and ensure that the keyword “Alarm” is properly configured in the security settings. Next, retrieve the webhook URL for this bot and keep it safe for later use.

3.2 Install Python3 and Necessary Libraries.

Ensure that Python3 along with the required libraries are installed on your system. Depending on your operating system, follow these instructions.

For Ubuntu/Debian systems:

sudo apt update
sudo apt install python3 python3-pip
pip3 install requests

For CentOS/RHEL systems:

sudo dnf install python3
pip3 install requests

3.3 Below is an example script (deepseekdingding.py) located at /usr/lib/zabbix/alertscripts/.

Replace the placeholder webhook URL and DeepSeek API key in the script with your actual values:

#!/usr/bin/env python3
#coding:utf-8
import requests
import sys
import json

class DingTalkBot(object):
    # Send an alarm
    def send_news_message(self, webhook_url, subject, content, ai_response):
        url = webhook_url
        data = {
            "msgtype": "markdown",
            "markdown": {
                "title": subject,
                "text": f"{subject}\n{content}\n\n【DeepSeek analysis】:\n\n{ai_response}"  
            }
        }
        headers = {'Content-Type': 'application/json'}
        response = requests.post(url, headers=headers, data=json.dumps(data))
        return response

if __name__ == '__main__':
    WEBHOOK_URL = 'https://oapi.dingtalk.com/robot/send?access_token=224c1ff0c6df60a809b3c5b69b8448486b780d292e9d395ac8fbf84980214e30'  # Webhook
    API_URL = 'https://api.deepseek.com/chat/completions'
    API_KEY = "xxxxxxxxxxxxxxxxxxxx"  # DeepSeek API

    if len(sys.argv) < 3:
        print("Error: Not enough arguments provided.")
        sys.exit(1)

    subject = str(sys.argv[1])  
    content = str(sys.argv[2])  

    print(f"Received subject: {subject}")
    print(f"Received content: {content}")

    try:
        headers = {
            'Authorization': f'Bearer {API_KEY}',
            'Content-Type': 'application/json',
        }
        payload = {
            "model": "deepseek-chat",  # DeepSeek
            "messages": [
                {"role": "user", "content": f"If you are a professional IT operation and maintenance expert, please tell me the cause of these alarms and handling suggestions in a concise and professional language with a word limit of 100 words{content}"}
            ]
        }
        ai_response = requests.post(API_URL, headers=headers, json=payload)
        ai_response.raise_for_status()  
        ai_response_content = ai_response.json().get('choices', [{}])[0].get('message', {}).get('content', '')
    except Exception as e:
        ai_response_content = "\nThe interface call timed out or an error occurred. Please check the configuration and try again"

    bot = DingTalkBot()
    response = bot.send_news_message(WEBHOOK_URL, subject, content, ai_response_content)

    if response.status_code == 200:
        print("successfully")
    else:
        print(f"failed: {response.text}")

3.5 On the Zabbix home page, go to Alerts – Media types – Create Media type and then enter the following information:

  • Name: aiAlarm-Dingtalk
  • Type: script
  • Script name: deepseekdingding.py
  • Script parameter: {ALERT.MESSAGE} {ALERT.SUBJECT}

3.6 Create an alarm action.

Go to Alarm – Action – Trigger actions – Create action and set the name to Alarm -deepseek. Select this parameter as required:

Edit the action options as follows:

Send to media type aiAlarm-Dingtalk
Topic fault alarm: {EVENT.NAME}
message
【Zabbix Alarm Notification 】

Alarm group: {TRIGGER.HOSTGROUP.NAME}

Alarm host: {HOSTNAME1}

Alarm time: {EVENT.DATE} {EVENT.TIME}

Alert level: {TRIGGER.SEVERITY}

Problem information: {TRIGGER.NAME}

Confirm the update.

3.7 Configure notification rights for users.

The following item is added to the “User-User-Alarm” media dialog box. Once added, click Update.

Actual effect display:

4. Scenario Four: One-Click System Service Deep Analysis

Our solution integrates DeepSeek analysis to offer a one-click intelligent inspection tool that automates the collection of service configurations, logs, and status from within your system. This information is then sent via API to DeepSeek for comprehensive analysis.

Our approach begins by extracting relevant configuration data, recent log entries, and current service statuses. These pieces of information are combined with predefined prompts and submitted to DeepSeek through its API. For instance, a prompt might look like this:

“Here are the current logs for XXX service:\n\n${recent_logs}\n\nService. Status is as follows:\n${service_status}\n. Please analyze the following four aspects based on this information and provide a concise report within 500 words: service status analysis, configuration review, historical issue examination, and troubleshooting recommendations.”

DeepSeek processes this input to perform a detailed breakdown across these four areas, delivering structured feedback and actionable insights.
This integration offers deep system analysis and precise optimization suggestions, enabling swift responses to system changes or anomalies. It aids administrators in promptly identifying and addressing issues.

In addition, it’s easily integrated into existing monitoring systems, allowing adjustments to the depth and scope of analysis as needed. The solution boasts high scalability and flexibility, catering to evolving business requirements.

Actual effect display :

The post Next-Level Alert Analysis with DeepSeek and Zabbix appeared first on Zabbix Blog.

Build a Culture of Monitoring and Get Buy-In with Zabbix

Post Syndicated from Michael Kammer original https://blog.zabbix.com/build-a-culture-of-monitoring-and-get-buy-in-with-zabbix/30085/

In today’s fast-paced, interconnected IT world, simply waiting for something to fail before fixing it isn’t good enough. A proactive approach to monitoring, which aims to identify and address potential issues before they escalate into major disruptions, is a necessity rather than a luxury.

Here at Zabbix, we’ve got plenty of reason to believe that we offer the most flexible monitoring solution available on the market today. However, choosing the best monitoring tool for your organization’s needs is only half the battle – you also need to get buy-in from team members who may not understand the need for monitoring, may be fearful of and resistant to change, and may not be familiar with the technologies behind monitoring.

In this post, we’ll take a look at a few strategies you can use to help win over lukewarm or hesitant colleagues and build a culture of monitoring. We’ll also explore how choosing Zabbix for your monitoring needs can make each strategy a bit easier to implement.

Strategy 1: Explain the “why”

One of the first questions that you can anticipate during any change initiative is simply, “what for?” The ethos of “don’t fix what isn’t broken” runs strong in the tech community, and unless you go above and beyond to explain why monitoring matters, your team will remain skeptical.

Zabbix can help you make your case by providing you with the evidence you need to bolster your case. We’ve got plenty of testimonials available from tech communities worldwide (including PeerSpot, Gartner, and Capterra), and no matter what field you’re in or how big your team is, we’ve most likely got a case study or two available that shows how monitoring with Zabbix was a game changer for a company like yours.

All of this should help you explain the rationale for the change in an open and transparent way. When it comes to monitoring, sharing details on costs, expected benefits, and what will happen if no change is made will build understanding around why monitoring is necessary and why monitoring with Zabbix is the right answer for your team’s needs.

Strategy 2: Show your team what’s in it for them

One of the most effective ways to get employee buy-in for monitoring is by highlighting the benefits it will bring to individual employees. Show how monitoring can simplify their tasks, improve efficiency, and enhance their work experience, and give them concrete examples of how the technology can make their jobs easier or help them to deliver better results.

We recently had a large managed services provider (MSP) use our monitoring solution as a true “force multiplier”, allowing them to monitor their systems, automate tasks based on real-time events, and provide immediate responses to issues without manual intervention. Thanks to Zabbix, their engineers report higher job satisfaction thanks to no longer having to be “on call” at all hours to solve simple issues, while management has seen productivity skyrocket thanks to their team’s newfound ability to find potential issues before they become real problems.

Strategy 3: Turn important stakeholders into monitoring champions

Determine who monitoring will impact and who needs to be kept informed. This might be team leaders, IT staff, end users, and/or an executive sponsor. Getting input from these groups early on will help you anticipate needs and concerns, and you’ll also want to identify influential employees who are enthusiastic about monitoring and get them to help you promote it.

A great way to help them do so is by encouraging them to attend one (or more) Zabbix events – we’ve got free meetings, online meetups, regional conferences, or even our yearly Summit in Latvia. No matter where you happen to be located, there’s a pretty good chance that we’ll soon be bringing your key people a chance to network with like-minded professionals from multiple industries, expand their knowledge, get answers to their questions, and explore how Zabbix can work for them.

Strategy 4: Provide adequate training

Equipping employees with the skills and knowledge they need to get the most out of a monitoring system means gaining a solid understanding of their current capabilities and then finding out which gaps you most urgently need to fill. Chances are, you’ll need to provide guidance, documentation, hands-on demonstrations, and access to experts – and this is another area where Zabbix has you covered.

Zabbix Certified trainings are designed to help your people learn Zabbix inside and out, giving them the practical knowledge they’ll need to increase their productivity and performance. When you explore our training options, you’ll find a wide variety of courses, everything from one-day sessions that cover the basics to week-long sessions that guarantee users the ability to tackle any Zabbix challenge on their own.

In addition, we’ve got plenty of other free resources available to teams and individuals looking to upskill, including our famously active forum, blog, webinars, and newsletter.

Conclusion

Building a culture of monitoring requires commitment from every level of an organization. By choosing Zabbix as the guide to your monitoring journey and following the strategies outlined in this article, you and your team can successfully implement and maintain a robust monitoring strategy that will help you achieve your organization’s IT goals.

To learn more about what Zabbix can do for you, visit our website.

The post Build a Culture of Monitoring and Get Buy-In with Zabbix appeared first on Zabbix Blog.