Tag Archives: How-to

Saving Time with a Custom Zabbix Agent Installer

Post Syndicated from Rizqi Firmansyah original https://blog.zabbix.com/saving-time-with-a-custom-zabbix-agent-installer/31843/

When managing large-scale infrastructure, the process of installing monitoring agents is often repetitive and time-consuming. Administrators must log into each server, manually run installation commands, and configure the agent to connect to the Zabbix server. To address this issue, the Zabbix Agent Deployer custom module was created. This module enables the direct installation of Zabbix agents on multiple hosts from the Zabbix Web interface.

The features of the Zabbix Agent Deployer module include:

  • Bulk host list input using a CSV file.
  • The ability to automatically add hosts to Zabbix and remotely install the Zabbix Agent on the
    associated hosts.
  • The ability to display installation log results directly within the module.

With this approach, administrators can add new hosts to the monitoring system faster and more efficiently.

Key use cases for the Zabbix Agent installer

The Zabbix Agent Deployer module enables several practical scenarios, including:

1. Faster provisioning for new servers – When adding a large number of servers, agents can be installed simultaneously without requiring a login to each machine.

2. Standardized installation – All agents are installed in the same way using a centralized script, reducing the risk of misconfiguration.

3. Easier additional provisioning – Provisioning new servers is easier for users because they don’t need to configure them directly on the server.

Getting started with the Zabbix Agent Deployer module

Solution overview architecture

To use this module, the main steps are:

1. Upload the custom module to the Zabbix frontend in the /usr/share/zabbix/modules/ directory.

2. Enable the module from the Administration → General → Modules page, and click the Scan Directory button. Locate the Zabbix agent deployer module and click Enabled.

3. Once activated, the Zabbix agent deployer module can be accessed in the Data Collection menu. Here’s a screenshot of the Zabbix agent deployer module.

4. Prepare a CSV file like the format below, or download a sample CSV from the module page.

With this CSV file, we will add two hosts to Zabbix to be monitored and automatically install the Zabbix agent on them.

5. Upload the CSV file to the Zabbix agent deployer module page and click Apply.

6. The Zabbix agent deployer module will handle the process of adding hosts to Zabbix and installing the Zabbix agent. The status can be seen as follows:

From the image above, server1 and server2 were successfully added to Zabbix, and the Zabbix agent installation was successful!

7. Check out the Zabbix hosts list page. Hosts will appear according to the uploaded CSV file.

Conclusion

The implementation of this custom Zabbix Agent installer extends Zabbix’s capabilities beyond its built-in functionality. The Zabbix Agent Deployer module enables a more efficient bulk host addition process, as all steps from adding hosts to Zabbix to installing the Zabbix agent can be integrated through a single page.

If you’re interested in implementing this, please contact us. Bangunindo is a premium Zabbix partner in Indonesia. We’re ready to help you design, implement, and optimize your Zabbix solution to suit your needs.

The post Saving Time with a Custom Zabbix Agent Installer appeared first on Zabbix Blog.

Aruba Central API Monitoring with Zabbix

Post Syndicated from Tibor Volanszki original https://blog.zabbix.com/aruba-central-api-monitoring-with-zabbix/31370/

Aruba Central is a SaaS solution that allows you to manage your Enterprise Aruba network environment. Due to the increasing number of cloud migrations, we can expect that more and more Aruba customers will move their on-premise environment to it, which will also mean a change in their monitoring environment. In this article, I will show you how to switch to API- based monitoring using Aruba Central and Zabbix. All custom resources mentioned can be found in my repository.

Aruba Central’s API

Oauth 2.0 is used, so you can forget the simple token management. At the end it is great, but for monitoring purposes it is overkill. There is pretty good documentation (referred to later) regarding how you can generate your access token, but after two hours it expires so you need to continually refresh it. To do this, you must use a refresh token, which can help you to get a new access token AND a new refresh token.

Within two hours, use the latest refresh token to repeat this action again. At this point you can imagine that this is not something you can implement easily by using the Zabbix GUI only. Well, maybe with some javascript magic, but otherwise there is no native support for this logic at this point of time. So how can we do this? In short:

  1. Generate your client credentials
  2. Generate your first token
  3. Schedule the token refresh for every two hours
  4. Update your host macro via Zabbix API
  5. Use the token in Zabbix HTTP agent checks
  6. Monitor your environment based on JSONPath pre-processing

Initial steps within Aruba Central

To manage your API access, you need to launch your “HPE Aruba Networking Central” application, so do NOT look into your workspace modules – the “Personal API clients” menu is NOT what we are looking for. Turn off the “New Central” view – at this point the early access version is not so useful (hopefully it will change soon).

The first time you get there, you will not see any items, but under the “My Apps & Tokens” tab you can click the “Add Apps & Tokens” button and generate it. Technically, this is already enough to start to monitoring your network infrastructure, but within two hours it would stop. So the relevant data for us are the “Client ID” and “Client Secret.” Feel free to revoke the recently created token at the bottom area as we do not need it.

Record your credentials

For this article, I am using a simple file to store all the credentials, which will be sourced into a bash script. Please keep in mind that storing your sensitive credentials in a single file is a BAD practice! Your SECO/CISO would probably have a few words with you about it, so please consider a better approach. A more secure way would be to use some Key Vault solution (like Azure, AWS, Google, or Hashicorp). Anyway, let’s continue with this unsecure example:

#!/bin/bash

### ZABBIX VARS ###

# URL of your zabbix instance (assuming you do not use the "/zabbix" ending, if yes, then add it to the end)
zabbix_url="https://your.zabbix.instance.net"
# Your Zabbix API token. If you do not know how to get it, check the documentation.
zabbix_api_token="1234_your_zabbix_api_key_5678"
# Create a host with a macro, remain at the "Macros" tab, turn on debug mode, look for "[hostmacroid] =>"
zabbix_macro_id="12345"

### ARUBA VARS ###
# To find yours, go here and check "Table: Domain URLs for API Gateway Access"
base_url="YOUR_ARUBA_CENTRAL_BASE_URL"
# Click on your profile in the Central app and you will find it there: 32 char long hexa string
client_id="YOUR_CLIENT_ID"
# provided in the previous step
client_secret="YOUR_CLIENT_ID"
# provided in the previous step
customer_id="YOUR_CUSTOMER_ID"
# your login credential
account_username="YOUR_CENTRAL_LOGIN_USERNAME"
# your login credential
account_password="YOUR_CENTRAL_LOGIN_PASSWORD"
# to be populated later
csrftoken=""
session=""
auth_code=""

Get or refresh your token and update the Zabbix host macro

The next steps are based on the official Aruba documentation, which you can find here. Please remember that there are many ways to achieve our target – this is just one example and probably not the most optimal one. Feel free to change / improve it with your code in your preferred scripting language.

The below script assumes that the file containing the credentials (previous step) is named as “variables” and located in the folder named “central.

Filename: aruba_central_token_new.sh

Purpose: To be used for first time token generation. Later, you only have to refresh your token with the script after this one.

Remarks: Aruba is limiting this API query set, so you can run it only ONCE every 30 minutes! If you made a typo somewhere, wait 30 minutes before your next attempt or tweak the result files.

#!/bin/bash

basedir=central
source $basedir/variables

curl -s --noproxy '*' -v --cookie-jar $basedir/cookie --location --request POST "$base_url/oauth2/authorize/central/api/login?client_id=$client_id" \
--header "Content-Type: application/json" \
--data-raw "{
    \"username\": \"$account_username\",
    \"password\": \"$account_password\"
}" > $basedir/result1.raw 2>&1

grep 'Added cookie' $basedir/result1.raw > $basedir/result1.filtered

csrftoken=$(grep csrftoken $basedir/result1.filtered | awk -F '"' '{print $2}')
session=$(grep session $basedir/result1.filtered | awk -F '"' '{print $2}')

curl -s --noproxy '*' --request POST "$base_url/oauth2/authorize/central/api?client_id=$client_id&response_type=code&scope=all" \
--header "Content-Type: application/json" \
--header "Cookie: session=$session" \
--header "X-CSRF-Token: $csrftoken" \
--data-raw "{
\"customer_id\": \"$customer_id\"
}" > $basedir/result2.raw

auth_code=$(cat $basedir/result2.raw | jq -r .auth_code)

curl -s --noproxy '*' --request POST "$base_url/oauth2/token" \
--header "Content-Type: application/json" \
--data "{
    \"client_id\": \"${client_id}\",
    \"client_secret\": \"${client_secret}\",
    \"grant_type\": \"authorization_code\",
    \"code\": \"${auth_code}\"         
}" > $basedir/result3.raw

refresh_token=$(cat $basedir/result3.raw | jq -r .refresh_token)
access_token=$(cat $basedir/result3.raw | jq -r .access_token)

if [ "$refresh_token" == "null" ]; then
    echo "something went wrong... exiting now"
    exit 1
fi

echo $access_token > $basedir/token_access.latest
echo $refresh_token > $basedir/token_refresh.latest

echo "access_token: $access_token"
echo "refresh_token: $refresh_token"

curl -s --request POST \
--url "$zabbix_url/api_jsonrpc.php" \
--header "Authorization: Bearer $zabbix_api_token" \
--header "Content-Type: application/json-rpc" \
--data "{\"jsonrpc\": \"2.0\",\"method\": \"usermacro.update\",\"params\": {\"hostmacroid\": \"${zabbix_macro_id}\",\"value\": \"${access_token_new}\"},\"id\": 1}"

rm -f $basedir/cookie

Filename: aruba_central_token_refresh.sh

Purpose: To refresh your existing token. It is expecting an existing refresh token in the “token_refresh.latest” file, so better to run the previous script one time before this.

Remarks: You can run this script as many times you want, but it will result in new tokens only once per every two hours (when the current one expires). Therefore, refreshing too frequently is pointless.

#!/bin/bash

basedir=central
source $basedir/variables

refresh_token_current=$(cat $basedir/token_refresh.latest | tr -d '\n')
refresh_token_new=""

curl -s --noproxy '*' --request POST "$base_url/oauth2/token?client_id=$client_id&client_secret=$client_secret&grant_type=refresh_token&refresh_token=$refresh_token_current" > $basedir/result4.raw

refresh_token_new=$(cat $basedir/result4.raw | jq -r .refresh_token)
access_token_new=$(cat $basedir/result4.raw | jq -r .access_token)
expires_in=$(cat $basedir/result4.raw | jq -r .expires_in)

if [ "$refresh_token_new" == "null" ]; then
    echo "something went wrong... exiting now"
    exit 1
fi

echo $access_token_new > $basedir/token_access.latest
echo $refresh_token_new > $basedir/token_refresh.latest

echo "access_token: $access_token_new"
echo "refresh_token: $refresh_token_new"
echo "expires_in: $expires_in"

curl -s --request POST \
--url "$zabbix_url/api_jsonrpc.php" \
--header "Authorization: Bearer $zabbix_api_token" \
--header "Content-Type: application/json-rpc" \
--data "{\"jsonrpc\": \"2.0\",\"method\": \"usermacro.update\",\"params\": {\"hostmacroid\": \"${zabbix_macro_id}\",\"value\": \"${access_token_new}\"},\"id\": 1}"

In my case, both the scripts and variables files are in the same “central” folder, which is in a git repository. Each time I call one of the scripts, it will record the new tokens in files, which are committed and pushed to the repo. In my own implementation, this is how I call the refresh script and sync the result with my repo:

git checkout master

basedir=central
source $basedir/variables
bash $basedir/aruba_central_token_refresh.sh

git add .
git commit -m "save the new tokens"
git push origin master

Schedule your token management

You must run your refresh script at least once per every two hours. To make this happen you have many options, including:

  • cron (old-school, outdated way)
  • systemctl timer (a better way, but only if it is monitored)
  • Jenkins / Github Actions/etc.
  • Zabbix itself, by calling your bash script

In my case, Jenkins does the scheduling and execution and the job is monitored via Zabbix.

Monitor your network infrastructure

When everything is in place, then the monitoring part is pretty simple. The usual JSONPath based logic can be used. API call documentation can be found here. The template contains only the wireless components, since I do not have my switches in Central. Implementing the switching part should not be difficult – just have a look at the “Switch” section, then clone and adjust one of your “get” items.

Screenshots

Latest data – tag based filtering:

Latest data – Site health

Latest data – Gateway info

Latest data – AP info

Triggers:

Some triggers are intentionally disabled, because they are a bit redundant. However, I wanted to cover all options. Sometimes less alerting is better if you have a ticketing system integration, otherwise your monitoring system will turn into a ticket factory.

Known issues and limitations

Since we are not querying the devices directly, some delay can be expected. Based on my recent testing, the delay compared to real time is between 3-10 minutes. In my test I disconnected my test environment and then started to do manual updates frequently. Some items got the real state earlier, some only later.

If your refresh script will malfunction for whatever reason (normally it should not), then you may have to run the other script once to generate a new token, or you can go to the GUI and check the last refresh token, with which you can override the content of the “token_refresh.latest” file.

Aruba is limiting the number of API queries to 5,000 per day. This could seem annoying, but it is way more than what you need (you should expect less than 1,000 in normal conditions, depending on your update frequency).

Zabbix API will not authorize your call unless you insert a line into your apache vhost configuration. This is a more generic Zabbix API issue that is not related to Aruba Central.

SetEnvIf Authorization "(.*)" HTTP_AUTHORIZATION=$1

If Aruba Central has a maintenance activity, then the token refreshing way could break. Running the token request script once should address the issue.

Summary

Aruba Central’s API is pretty decent, but if you start from zero it could take a while to get to the end of it. With this guide, my intention was to speed you up, but please do not consider my scripts and the shown example as the only or best possible way – I’m just hoping it can give you a good base for your own solution. Have fun!

The post Aruba Central API Monitoring with Zabbix appeared first on Zabbix Blog.

Improving throughput of serverless streaming workloads for Kafka

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/improving-throughput-of-serverless-streaming-workloads-for-kafka/

Event-driven applications often need to process data in real-time. When you use AWS Lambda to process records from Apache Kafka topics, you frequently encounter two typical requirements: you need to process very high volumes of records in close to real-time, and you want your consumers to have the ability to scale rapidly to handle traffic spikes. Achieving both necessitates understanding how Lambda consumes Kafka streams, where the potential bottlenecks are, and how to optimize configurations for high throughput and best performance.

In this post, we discuss how to optimize Kafka processing with Lambda for both high throughput and predictable scaling. We explore the Lambda’s Kafka Event Source Mappings (ESMs) scaling, optimization techniques available during record consumption, how to use ESM Provisioned Mode for bursty workloads, and which observability metrics you need to use for performance optimization.

Overview

To start processing records from a Kafka topic with a Lambda function, whether using Amazon Managed Streaming for Apache Kafka (Amazon MSK) or a self-managed Kafka cluster, you create an ESM: a lightweight serverless resource that consumes records from Kafka topics and invokes your function.

The scaling behavior of Kafka ESMs is based on the offset lag. This is a metric indicating the number of records in the topic that have not yet been consumed by the Lambda function. This metric typically grows when producers publish new records faster than consumers process them. As the lag grows, the Lambda service gradually adds more Kafka consumers (also known as pollers) to your ESM. To preserve ordering guarantees, the maximum number of pollers is capped by the number of partitions in the topic. Lambda also scales pollers down automatically when lag decreases.

Each ESM follows a consistent polling workflow: poll -> filter -> batch -> invoke, as shown in the following diagram. Every stage has configurable options that directly affect performance, latency, and cost.


Figure 1. ESM processing workflow.

Polling: Increasing predictability with Provisioned Mode

By default, Kafka ESM uses the on-demand polling mode. In this mode, ESM starts with one poller, automatically adds more pollers when the offset lag grows, and scales the number of pollers down as lag decreases. On-demand mode does not need upfront scaling configuration and is the lowest-cost option for steady workloads. For many applications, this behavior is sufficient: scaling up can take several minutes, but the throughput eventually catches up, and you only pay for the resources you use, such as number of invocations.

However, if your workloads are bursty and latency-sensitive, then on-demand scaling may not be fast enough and can result in a rapidly growing lag. This can be addressed by switching to Provisioned Mode, which gives you more fine-grained control to configure a minimum and maximum number of always-on pollers for your Kafka ESM. These pollers remain connected even when traffic is low, so consumption begins immediately when a spike occurs, and scaling within the configured range is faster and more predictable.

The following diagram shows the performance improvements of using the ESM in Provisioned Mode for bursty workloads. You can see that in on-demand mode it took ESM over 15 minutes to eventually catch up to the new traffic volume, while in Provisioned Mode the ESM handled the traffic increase instantly.


Figure 2. Comparing Kafka ESM on-demand and Provisioned Mode.

Best practices for using Provisioned Mode:

  • Start small: Provisioned Mode is a paid capability. AWS recommends that for smaller topics (less than 10 partitions) you start with a single provisioned poller to evaluate throughput and observe workload behavior. For larger topics, you can start with a higher number of provisioned pollers to accommodate the baseline consumption. You can adjust this configuration at any time as you learn traffic patterns and refine your performance targets.
  • Estimate throughput: A single provisioned poller can process up to 5 MB/s of Kafka data. Monitor your average record size and per-record processing time to establish a baseline for minimum and maximum pollers, then validate with real workload metrics.
  • Set a low floor and flexible ceiling: Choose a minimum number of pollers that makes sure that latency targets are met when a traffic burst occurs, then allow the ESM to scale toward a higher maximum as needed.

See Low latency processing for Kafka event sources for more information.

To summarize:

  • Use Provisioned Mode for bursty traffic, strict SLOs, or when backlogs pose downstream risk.
  • Use on-demand polling mode for steady traffic, flexible latency requirements, or when minimizing cost is the primary objective.

Filtering: Drop irrelevant records early

By default, all records from Kafka are delivered to your Lambda function. This approach is direct and flexible. Your handler code decides which records to process and which to ignore. This default behavior is highly efficient for workloads where nearly all records are valuable.

When you find yourself discarding a large portion of records in your handler code, you can use native ESM filtering capabilities to drop irrelevant records before they reach your function. You can filter early to reduce cost, free up concurrency, increase throughput, and make sure that your Lambda function spends cycles on valuable work only.

The following diagram shows the application of an ESM filter to only process telemetry that meets a specified condition.


Figure 3. ESM filtering configuration.

Batching: Processing more records per invocation

You can batch multiple Kafka records together to process more data per invocation and increase the efficiency of your Lambda functions. Larger batches help you achieve higher throughput and reduce costs by making better use of each invocation run. To get the best results, you should balance batch size and latency targets and adjust the configuration based on your workload’s specific traffic patterns and SLOs.

Lambda gives you two primary controls for configuring ESM batching behavior:

  • Batch window: This is how long the ESM waits to accumulate records before invoking your function. A shorter window produces smaller batches and more frequent invocations. A longer window (up to 5 minutes) produces larger batches and less frequent invocations.
  • Batch size: This is the maximum number of records that the ESM can accumulate before invoking your function, up to 10,000.

There’s no single setting that universally works for all workloads. Your optimal configuration depends on workload characteristics such as latency tolerance and record size. AWS recommends starting with the default values and then gradually adjusting the configuration based on your requirements. For example, you can increase the batch size while monitoring function duration, error rates, and end-to-end latency.

The following diagram shows how to configure batch window and size using Terraform:


Figure 4. ESM batch window and batch size configuration with Terraform.

The ESM invokes your function when one of the following three conditions is met:

  1. The batch window elapses.
  2. The accumulated batch reaches the configured maximum batch size.
  3. The accumulated payload approaches the 6 MB maximum invocation payload limit of Lambda.

When using higher batch window values during traffic spikes, you typically see more records-per-batch and longer function invocation durations. This is normal: larger batches can take longer to process. Always interpret the Duration metric in the context of the batch size being processed.

Invoke: Process each batch faster and more efficiently

You control how quickly each batch completes through two main factors: the efficiency of your function code and the compute resources you allocate to your functions. You can improve both to process more records per second, reduce the necessary concurrency, and lower cost.

Optimize your code: Review your function handler code to identify where you can reduce work per record. For example, eliminate redundant serialization, initialize dependencies once during function startup, and consider parallel processing within the handler (where applicable). For performance-critical workloads, you can also choose languages that compile to binary, such as Go or Rust, which typically deliver high performance with lower resource usage.

Tune compute resources: Increasing the memory function allocation proportionally increases vCPU. Use the Lambda PowerTuning tool to find the memory configuration that best balances performance and cost for your workload.

Correlate metrics: As you optimize, monitor Duration and Concurrency. You should see the concurrency drop as duration improves. That correlation confirms that your changes are improving the system throughput and efficiency.

When you combine handler optimizations with early filtering and efficient batching, even small improvements can make your pipeline noticeably faster to operate under load.

Observability drives good decisions

You can’t optimize what you can’t see. To tune your data processing pipeline, use a combination of OffsetLag, function invocation metrics, and Kafka broker metrics to understand your data processing performance. OffsetLag tells you whether your function is keeping up with incoming records, as shown in the following figure. Function metrics such as Duration, Concurrency, Errors, and Throttles show how efficiently your code is processing record batches. If you use Provisioned Mode, then you can use the Provisioned Pollers metric to track the poller capacity.


Figure 5. Kafka consumption observability with Amazon CloudWatch.

Always interpret function duration in the context of batch size. During traffic spikes, you can typically observe both duration and actual batch size increase, which is expected amortization, not a regression. For alerting, monitor lag growth, unexpected drops in invocation rate, and error spikes. With these signals in place, you can detect issues early and tune your configuration with confidence.

A sample step-by-step optimization loop

  1. Establish a clean baseline: Make your handler idempotent and batch-aware, start with a short batch window and moderate batch size. Monitor your ESM and confirm offset lag stays near zero at steady state.
  2. Filter early: Move static checks (record type, version, other custom properties) into ESM filtering and verify invoked counts drop relative to polled counts, proving the filter saves cost and concurrency.
  3. Increase batch size gradually while monitoring the duration, error rates, and latency metrics. Extend the batch window slightly if spikes cause too many invocations.
  4. Speed up the handler: Increase memory for more CPU, reduce per-record I/O, remove redundant serialization, and parallelize safely inside the batch while tracking duration and concurrency metrics together.
  5. Prove spike readiness: Replay realistic surges, monitor offset lag and drain time, and enable Provisioned Mode with a small minimum if recovery takes too long, adjusting with MB/s-per-poller estimates.
  6. Implement alerting: Watch for sustained lag growth, unexpected gaps between polled and invoked, and error spikes tied to partitions or large batches. Always read metrics in context with batch size.
  7. Re-evaluate periodically: Re-measure system throughput, confirm filter effectiveness, and retune batch and memory settings regularly as workloads evolve.

Conclusion

Optimizing Kafka streams processing with AWS Lambda necessitates understanding how ESMs work and tuning consumption components: polling, filtering, batching, and invoking. Filtering redundant records early removes unnecessary work, batching helps you process more records per invocation, and handler optimizations make sure that you make the most of the compute that you allocate. Together, these adjustments let you scale efficiently and keep offset lag under control.

When your workload is bursty, use Provisioned Mode to absorb spikes without long recovery times. With the right alerts on lag, errors, and unexpected polled versus invoked behavior, you can spot problems early and adjust before they impact users. Following this optimization guide gives you a practical way to measure, tune, and revisit your setup as traffic patterns change.

To learn more about optimizing Kafka consumption, see the AWS re:Invent 2024 session about Improving throughput and monitoring of serverless streaming workloads.

To learn more about building Serverless architectures see Serverless Land.

Safely Handle Configuration Drift with CloudFormation Drift-Aware Change Sets

Post Syndicated from JJ Lei original https://aws.amazon.com/blogs/devops/safely-handle-configuration-drift-with-cloudformation-drift-aware-change-sets/

Introduction

Is configuration drift preventing you from accessing the speed, safety, and governance benefits of AWS CloudFormation for infrastructure management? Configuration drift occurs when cloud resources are modified outside of CloudFormation, leading to a mismatch in the actual state and template definition of resources. Drift tends to accumulate from infrastructure changes that engineers make via the AWS Management Console to resolve production incidents or troubleshoot malfunctioning applications. Drift can cause unexpected changes during subsequent IaC deployments or leave resources in a non-compliant state. Unresolved drift can lead to cost increases when resources are over-provisioned outside of template definitions, or compliance violations that may result in audit penalties. Additionally, drift makes it hard to reproduce applications for testing or disaster recovery.

CloudFormation now offers drift-aware change sets that allow you to safely handle configuration drift and keep your infrastructure in sync with your templates. In this post, we will explore the process of leveraging drift-aware change sets to resolve common scenarios in which drift impacts the availability or security of your application.

Solution Overview

Drift-aware change sets are a type of CloudFormation change sets that can bring drifted resources in line with template definitions and preview the required changes to actual infrastructure states before deployment. Drift-aware change sets surface a three-way comparison of your new template, actual resource states, and previous template before deployment, allowing you to prevent unexpected overwrites of drift. Additionally, drift-aware change sets offer you a systematic mechanism to restore drifted resources to approved template definitions, strengthening the reproducibility and compliance posture of applications. You can create drift-aware change sets either from the CloudFormation Management Console or from the AWS CLI or SDK by passing the --deployment-mode REVERT_DRIFT parameter to the CreateChangeSet API.

Prerequisites

AWS CLI latest version with CloudFormation permissions configured.

AWS Identity and Access Management (IAM) permissions required: Permissions to create and manage CloudFormation stacks, AWS Lambda functions, Security Groups, Amazon Simple Storage Service (Amazon S3) buckets, and IAM roles. PowerUserAccess or Administrator access recommended for testing.

• Test environment (non-production AWS account recommended)

• Basic CloudFormation knowledge (stacks, templates, change sets)

Important Note: These sample templates are provided for educational purposes only and should not be used in production environments without proper security review and testing. You are responsible for testing, securing, and optimizing these templates based on your specific quality control practices and standards. Deploying these templates may incur AWS charges for creating or using AWS resources. Work with your security and legal teams to meet your organizational security, regulatory, and compliance requirements before any production deployment.

Scenario 1: Prevent Dangerous Overwrites

This scenario demonstrates how drift-aware change sets prevent dangerous overwrites when Lambda function memory is increased outside of CloudFormation during an outage, and a subsequent template update could accidentally reduce memory, causing performance issues.

Story: Your team deploys a Lambda function with 128 MB memory via CloudFormation. During a production outage, an engineer increases the memory to 512 MB through the Lambda Console to resolve performance issues. Later, another developer updates the template to 256 MB for a code change, unaware of the console modification. Without drift-aware change sets, CloudFormation would unexpectedly reduce memory from 512 MB to 256 MB—potentially causing the outage to recur.

User journey: Create stack with 128MB => Increase memory to 512MB via console during outage => Create drift-aware change set with 256MB template => Review three-way comparison showing dangerous memory reduction => Cancel change set to prevent outage => Update template to match production state (512MB) => Create and execute drift-aware change set with updated template (512MB) to resolve drift

Scenario Flow

1. Create Stack

Deploy CloudFormation stack with Lambda function (128 MB memory).

Figure 1

CloudFormation stack “lambda-memory-drift-test” successfully deployed with CREATE_COMPLETE status

2. Emergency Memory Increase (Console)

Manually increase Lambda memory to 512 MB through AWS Console (simulating emergency performance fix during outage).

Figure 2

Initial Lambda function showing 128 MB memory as configured in template

Figure 3

Lambda memory increased to 512 MB through console during outage, creating drift from template

3. Create Drift-Aware Change Set

Create change set with 256 MB template using drift-aware mode to reveal the dangerous memory reduction.

Figure 4

CloudFormation console showing the new “Drift aware change set” option selected. This compares the new template with the live state of your stack and shows changes to drifted resources before deployment, unlike standard change sets that only compare templates.

aws cloudformation create-change-set \
--stack-name lambda-memory-drift-test \
--change-set-name detect-memory-overwrite \
--template-body file://lambda-memory-drift-scenario-256mb.yaml \
--deployment-mode REVERT_DRIFT \
--capabilities CAPABILITY_IAM \
--region us-east-1

4. Review Change Set – The Critical Three-Way Comparison

Examine the drift-aware change set to see the dangerous memory reduction that would occur.

Figure 5

Critical insight revealed: The change set shows Live resource state (512 MB) vs Proposed resource state (256 MB), revealing a dangerous memory reduction that would impact performance.

Figure 6: view drift

Drift analysis: Clicking “View drift” reveals the complete picture – Previous template (128 MB) vs Live resource state (512 MB). This shows the live state has 4x more memory than the original template, indicating emergency changes were made during the outage that must be preserved.

Key Insight: The drift-aware change set reveals that:

  • Previous template: 128 MB (original deployment)
  • Live resource state: 512 MB (emergency change during outage)
  • Proposed template: 256 MB (new deployment)

This would cause a dangerous reduction from 512 MB to 256 MB, potentially recreating the original performance issue. Without drift-aware change sets, this critical information would be hidden.

5. Recreate Drift-aware Change Set with Updated Template (512MB) to Resolve Drift

Update the template to match the live production state (512 MB) and create a new drift-aware change set to safely resolve the drift.

Figure 7

Resolution confirmed: The drift-aware change set shows both Live resource state and Proposed resource state at 512 MB, with change set action ” Sync with live”. This verifies that the updated template now matches production, preventing the dangerous memory reduction and safely resolving the drift without impacting performance.

CloudFormation Templates

Initial Template (128 MB):

Resources:
  DriftTestFunction:
    Type: AWS::Lambda::Function
    Properties:
      Runtime: python3.9
      Handler: index.lambda_handler
      MemorySize: 128
      ReservedConcurrentExecutions: 5
      Role: !GetAtt LambdaExecutionRole.Arn
      Code:
        ZipFile: |
          def lambda_handler(event, context):
              return {'statusCode': 200, 'body': 'Hello!'}
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

Updated Template (256 MB – lambda-memory-drift-scenario-256mb.yaml):

Resources:
  DriftTestFunction:
    Type: AWS::Lambda::Function
    Properties:
      Runtime: python3.9
      Handler: index.lambda_handler
      MemorySize: 256
      ReservedConcurrentExecutions: 5
      Role: !GetAtt LambdaExecutionRole.Arn
      Code:
        ZipFile: |
          def lambda_handler(event, context):
              return {'statusCode': 200, 'body': 'Hello!'}
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

CLI Commands

  1. Create stack:
aws cloudformation create-stack --stack-name lambda-memory-drift-test --template-body file://lambda-memory-drift-scenario.yaml --capabilities CAPABILITY_IAM --region us-east-1
  1. Get function name:
aws cloudformation describe-stack-resources --stack-name lambda-memory-drift-test --logical-resource-id DriftTestFunction --query 'StackResources[0].PhysicalResourceId' --output text --region us-east-1
  1. Create drift-aware change set:
aws cloudformation create-change-set --stack-name lambda-memory-drift-test --change-set-name detect-memory-overwrite --template-body file://lambda-memory-drift-scenario-256mb.yaml --deployment-mode REVERT_DRIFT --capabilities CAPABILITY_IAM --region us-east-1
  1. Describe change set:
aws cloudformation describe-change-set --change-set-name detect-memory-overwrite --stack-name lambda-memory-drift-test --region us-east-1

Scenario 2: Remediate Unauthorized Changes

This scenario demonstrates how drift-aware change sets systematically remediate unauthorized changes when a developer adds temporary debugging rules to a security group but forgets to remove them, creating a compliance violation.

Story: Your team deploys a security group with only HTTP access via CloudFormation for compliance. During debugging, a developer adds SSH access (port 22) through the AWS Console for their IP address to troubleshoot an application issue. They forget to remove this rule after debugging. Later, security compliance requires reverting to the original template state. A standard change set shows no changes since the template is unchanged, but a drift-aware change set can detect and systematically remove the unauthorized SSH rule.

User journey: Create stack with HTTP-only access => Add SSH rule via console for debugging => Forget to remove SSH rule => Create drift-aware change set with REVERT_DRIFT mode => Review change set showing SSH rule removal => Execute change set to restore compliance

Scenario Flow

1. Create Stack

Deploy CloudFormation stack with security group allowing only HTTP traffic.

Figure 8

CloudFormation stack “sg-revert-drift-test” successfully deployed with DriftTestSecurityGroup resource

2. Make Unauthorized Changes (Console)

Manually add SSH ingress rule through AWS Console (simulating developer debugging access that wasn’t removed).

Figure 9: http only

Initial security group showing only HTTP (port 80) access as configured in template – compliant state

Figure 10: ssh-added

Security group now shows 2 permission entries: SSH (port 22) for specific IP and HTTP (port 80) for all traffic. The SSH rule creates drift and a compliance violation that needs systematic removal.

3. Create Drift-Aware Change Set

Create change set using REVERT_DRIFT mode to systematically remove the unauthorized SSH rule.

Figure 11

Creating drift-aware change set for security group compliance restoration. Note the “Drift aware change set” option is selected to compare with live state and detect unauthorized changes.

aws cloudformation create-change-set \
--stack-name sg-revert-drift-test \
--change-set-name revert-ssh-drift \
--use-previous-template \
--deployment-mode REVERT_DRIFT \
--region us-east-1

4. Review Change Set – Systematic Compliance Restoration

Examine the drift-aware change set to see systematic removal of unauthorized SSH rule.

Figure 12

Compliance violation detected: The drift -aware change set shows that the SSH rule in the live resource state (rule 232 for IP 15.248.7.53/32 on port 22) is not present in the proposed resource state derived from the template. This unauthorized SSH rule violates security policy and will be systematically removed

Key Insight: The drift-aware change set enables systematic compliance restoration by:

  • Previous template: Only HTTP (port 80) access – compliant state
  • Live resource state: HTTP + SSH (port 22) for 15.248.7.53/32 – compliance violation
  • Action: Remove unauthorized SSH rule to restore compliance

This provides a systematic, auditable way to remove unauthorized changes rather than manual cleanup.

Figure 13

Stack events showing successful execution of the drift-aware change set – SSH rule removed

CloudFormation Templates

security-group-drift-scenario.yaml:

Resources:
  DriftTestSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: "Security group for drift testing"
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0
          Description: "Allow HTTP traffic for demo purposes"
      SecurityGroupEgress:
        - IpProtocol: -1
          CidrIp: 0.0.0.0/0
          Description: "Allow all outbound traffic"

CLI Commands

  1. Create stack:
aws cloudformation create-stack --stack-name sg-revert-drift-test --template-body file://security-group-drift-scenario.yaml --region us-east-1
  1. Get security group ID:
aws ec2 describe-security-groups --filters "Name=tag:aws:cloudformation:stack-name,Values=sg-revert-drift-test" --query 'SecurityGroups[0].GroupId' --output text --region us-east-1
  1. Create drift-aware change set:
aws cloudformation create-change-set --stack-name sg-revert-drift-test --change-set-name revert-ssh-drift --template-body file://security-group-drift-scenario.yaml --deployment-mode REVERT_DRIFT --region us-east-1
  1. Describe change set:
aws cloudformation describe-change-set --change-set-name revert-ssh-drift --stack-name sg-revert-drift-test --region us-east-1

Scenario 3: Recreate Deleted Resources

This scenario demonstrates drift detection when a dependent resource (logs bucket) is accidentally deleted outside of CloudFormation during troubleshooting. The main application bucket depends on this logs bucket for access logging. You need to recreate the deleted resource while maintaining the existing infrastructure dependencies.

Story: Your team deploys a main S3 bucket with a dependent logs bucket for access logging via CloudFormation. During troubleshooting, an operator accidentally deletes the logs bucket through the AWS Console. The main bucket still exists but its logging configuration now references a non-existent bucket. You need to recreate the deleted logs bucket while maintaining the dependency relationship.

User journey: Create stack with main and logs buckets => Accidentally delete logs bucket => Create drift-aware change set with REVERT_DRIFT mode => Review change set showing LogBucket will be recreated => Execute change set to restore deleted resource

Scenario Flow

1. Create Stack

Deploy CloudFormation stack with main S3 bucket and dependent logs bucket.

Figure 14

CloudFormation stack “s3-deletion-drift-test” successfully deployed with both LogBucket and MainBucket resources in CREATE_COMPLETE status

2. Accidental Deletion (Console)

Manually delete the logs bucket through AWS Console (simulating accidental deletion during troubleshooting).

Figure 15

LogBucket accidentally deleted outside of CloudFormation during troubleshooting, creating drift – the MainBucket still exists but its logging configuration now references a non-existent bucket

3. Create Drift-Aware Change Set

Create change set using REVERT_DRIFT mode to recreate the deleted LogBucket.

Figure 16

Creating drift-aware change set with “Drift aware change set” option selected to detect and recreate the deleted resource by comparing template with live state

aws cloudformation create-change-set \
--stack-name s3-deletion-drift-test \
--change-set-name recreate-deleted-bucket \
--use-previous-template \
--deployment-mode REVERT_DRIFT \
--region us-east-1

4. Review Change Set – Resource Recreation

Examine change set to see LogBucket recreation while preserving MainBucket dependencies.

Figure 17

Change set preview showing LogBucket will be recreated to restore the deleted resource and MainBucket updated to maintain infrastructure dependencies

Key Insight: The drift-aware change set detects that:

  • Template expectation: Both LogBucket and MainBucket should exist
  • Live resource state: Only MainBucket exists, LogBucket is missing
  • Action: Recreate LogBucket with original configuration to restore logging functionality

This enables systematic recovery of accidentally deleted resources while maintaining infrastructure dependencies.

CloudFormation Templates

s3-drift-scenario.yaml:

Resources:
  LogBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      VersioningConfiguration:
        Status: Enabled
  
  MainBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      VersioningConfiguration:
        Status: Enabled
      LoggingConfiguration:
        DestinationBucketName: !Ref LogBucket

CLI Commands

  1. Create stack:
aws cloudformation create-stack --stack-name s3-deletion-drift-test --template-body file://s3-drift-scenario.yaml --region us-east-1
  1. Get LogBucket name:
aws cloudformation describe-stack-resources --stack-name s3-deletion-drift-test --logical-resource-id LogBucket --query 'StackResources[0].PhysicalResourceId' --output text --region us-east-1
  1. Create drift-aware change set:
aws cloudformation create-change-set --stack-name s3-deletion-drift-test --change-set-name recreate-deleted-bucket --template-body file://s3-drift-scenario.yaml --deployment-mode REVERT_DRIFT --region us-east-1
  1. Describe change set:
aws cloudformation describe-change-set --change-set-name recreate-deleted-bucket --stack-name s3-deletion-drift-test --region us-east-1

Best Practices

When working with drift-aware change sets, consider these best practices:

Always review three-way comparisons before executing change sets to understand the full impact

Use REVERT_DRIFT deployment mode when you want to bring resources back to template compliance

Document emergency changes made outside of CloudFormation to inform future template updates

Implement change management processes to minimize unauthorized drift

Regular drift detection helps identify configuration changes before they become problematic

Test drift-aware change sets in non-production environments first

Cleanup

Important: Execute these cleanup commands promptly after completing the scenarios to avoid incurring unnecessary AWS charges. Resources such as Lambda functions, S3 buckets (even if empty), and security groups may incur costs if left running. Ensure all stacks are successfully deleted by verifying the DELETE_COMPLETE status.

Commands to delete all test resources:

# Scenario 1: Lambda Memory Drift
aws cloudformation delete-stack --stack-name lambda-memory-drift-test --region us-east-1

# Scenario 2: Security Group Drift
aws cloudformation delete-stack --stack-name sg-revert-drift-test --region us-east-1

# Scenario 3: S3 Bucket Deletion Drift
aws cloudformation delete-stack --stack-name s3-deletion-drift-test --region us-east-1

# Verify all stacks are deleted
aws cloudformation list-stacks --stack-status-filter DELETE_COMPLETE --region us-east-1

Note: CloudFormation will automatically clean up all resources created by the stacks, including Lambda functions, security groups, and S3 buckets.

Conclusion

Drift-aware change sets enable you to mitigate the operational and security risks of configuration drift, allowing you to confidently automate and govern your infrastructure updates with CloudFormation. Through the scenarios described in this post, you have seen how you can leverage drift-aware change sets to prevent outages in production environments, maintain the integrity of your test environments, and manage the compliance posture of all environments. Remember to thoroughly review the infrastructure changes previewed by drift-aware change sets before executing deployments.

Available Now

Drift-aware change sets are available in AWS Regions where CloudFormation is available. Please refer to the AWS Region table to learn more.

Making PaperCut NG Observable with Zabbix

Post Syndicated from Patrik Uytterhoeven original https://blog.zabbix.com/making-papercut-ng-observable-with-zabbix/31244/

In most organizations, printing is an essential but often invisible service. When it works, nobody notices. When it fails, productivity stalls. That’s why monitoring your print environment is just as important as monitoring servers, databases, or network devices.

At Opensource ICT Solutions, we specialize in turning complex systems into observable services. One recent example is our integration of PaperCut NG with Zabbix. This allows IT teams to track the health of their print infrastructure in real-time — everything from server resources to individual printers and devices.

Why monitoring PaperCut matters

PaperCut NG does much more than queue print jobs. It enforces quotas, integrates with authentication systems, and manages fleets of devices. If the database runs out of connections, the disk fills up, or the license expires, users feel the impact instantly.

By integrating PaperCut with Zabbix, we make these risks visible long before they become business problems. The result is:

  • Proactive detection of printer errors, low toner, or license issues.
  • Capacity planning through trend analysis of disk usage, memory, and DB connections.
  • Unified visibility — PaperCut health checks appear right alongside servers, networks, and applications in Zabbix dashboards.

How the integration works

The magic happens through the PaperCut System Health API and Zabbix’s flexible data collection methods.

HTTP agent items

Zabbix fetches raw JSON data directly from PaperCut using an HTTP agent item, such as:

This single call provides a full snapshot of server health.

Dependent items + JSONPATH

Instead of hammering the API with multiple requests, we extract the needed fields using dependent items with JSONPATH preprocessing.

For example:

This design means one request can populate dozens of metrics, keeping monitoring both efficient and lightweight.

Calculated items

Some values aren’t directly available from PaperCut. In those cases, we create calculated items inside Zabbix.

For example, the percentage of active DB connections is derived as:

This allows us to set intelligent triggers like “DB connections > 90%” without requiring PaperCut to calculate it for us.

Low-level discovery (LLD) for devices and printers

Perhaps the most powerful part of this integration is automatic discovery.

  • Printer LLD → Queries /api/health/printers and creates items and triggers per printer. If a printer goes into Paper Jam or No Toner, Zabbix knows immediately.
  • Device LLD → Queries /api/health/devices and builds items dynamically for each discovered device, tracking states like OK, WARNING, or ERROR.

This ensures that new printers and devices are monitored automatically — no manual configuration required!

Why this matters

Bringing all of this together, the integration turns PaperCut NG into a fully observable service inside Zabbix.

  • Efficiency → One API call, dozens of metrics.
  • Scalability → Automatic discovery of printers and devices.
  • Robustness → Alerts and dashboards for licenses, resources, and print queues.

For IT teams, this means fewer surprises, faster troubleshooting, and more confidence in a service that often goes unnoticed until it fails.

Our expertise

This PaperCut integration is just one example of how we at Opensource ICT Solutions help organizations unlock the full potential of Zabbix. We don’t just install monitoring – we design intelligent, scalable integrations that make hidden systems visible. Whether it’s print management, databases, custom applications, or network devices, we know how to extend Zabbix to fit your environment and give you the insights that matter most.

Feel free to download our template and documentation for free from our GitHub: https://github.com/OpensourceICTSolutions/ZabbixPapercutNG

Want to make your business-critical systems truly observable? Let’s talk about how we can tailor Zabbix to your needs: [email protected]

 

The post Making PaperCut NG Observable with Zabbix appeared first on Zabbix Blog.

Monitoring Website Changes with Zabbix Browser Item

Post Syndicated from Adi Rusmanto original https://blog.zabbix.com/monitoring-website-changes-with-zabbix-browser-item/31684/

In today’s digital era, information is an asset and most of it is obtained from websites. The ability to automatically monitor website content changes has become a crucial competitive advantage, as even small changes on a website can affect business strategies, security postures, and data-driven decision-making. Accordingly, Zabbix 7.0 saw the introduction of a new feature called Browser Item, which allowed users to perform advanced website monitoring using a browser.

The Browser Item feature includes the ability to:

● Capture screenshots of the current website state
● Measure website performance and availability metrics
● Extract and analyze data from web pages
● Generate automatic alerts based on detected changes or errors

This means Zabbix is no longer limited to traditional IT infrastructure monitoring. It can now also serve as a tool for monitoring strategic external information.

Key use cases for website change monitoring with Zabbix

The Zabbix Browser Item opens up many valuable use cases for organizations that want to proactively track website changes. Below are some key examples:

Monitoring release notes

Tracking vendor release notes is essential for IT teams. With Zabbix, we can automatically detect new releases, extract relevant information, and notify the appropriate team members so they can respond faster.

Tracking security advisories

Security advisories are critical for maintaining a strong security posture. By monitoring websites that publish vulnerability information using Zabbix, security teams can be promptly alerted about new threats and take timely actions to reduce risks.

Monitoring competitor websites

In a competitive market, staying informed about competitor activities is vital. Zabbix allows users to monitor competitor websites for pricing updates, new product offerings, marketing campaigns, or news announcements, while providing valuable business intelligence to support strategic decisions.

Monitoring tender announcements

Zabbix can also monitor websites for new tender announcements from government portals or business partners, ensuring our organization stays aware of the latest business opportunities.

Ensuring internal website integrity

Beyond external sites, we can also use the Browser Item to ensure the integrity and availability of our own websites. It helps detect unexpected content changes, broken links, or performance degradation that may affect the user experience or signal potential issues. Proactive monitoring helps maintain a high-quality user experience and protect our brand reputation.

Getting started with website change monitoring in Zabbix

Solution overview architecture

This diagram shows how Zabbix uses a WebDriver to capture and analyze website content.
The collected data is stored in Zabbix for visualization and alerts when changes are
detected.

Step-by-step configuration

In this example, we’ll monitor changes on the Nginx Security Advisories webpage.

Step 1: Prepare the Web Driver

Zabbix requires a Web Driver to perform browser-based monitoring. One commonly used option is Selenium, which can be deployed using the following Docker image:

https://hub.docker.com/r/selenium/standalone-chrome

Step 2: Configure WebDriverURL on Zabbix server or proxy

Update the WebDriverURL parameter in your Zabbix Server or Zabbix Proxy configuration to point to the Selenium service you deployed.

Step 3: Create a Browser Item in Zabbix

1. Create a host if it doesn’t already exist.

2. Add a new item with the following settings:

  • Type: Browser
  • Type of information: Text

The key part is the script section. Below is the example script.

The script uses two methods:

  • browser.navigate method defines the URL to be monitored
  • browser.findElements method specifies the page section where changes should be detected

Note: The StartBrowserPollers parameter must be enabled on the Zabbix server or proxy configuration for browser items to work. It is enabled by default with the value StartBrowserPollers=1.

Step 4: Create dependent items

The Browser Item produces a JSON result containing website data. This item serves as the master item for dependent items such as:

  • Extracting the latest security advisories
  • Capturing a website screenshot

Step 5: Create a trigger for change alerts

Create a trigger that compares the current and previous values of the “latest security advisories” item. If any change is detected, Zabbix will automatically send an alert notifying your team of the update.

Step 6: Display data on the dashboard

To visualize the monitored data, we can use the Item History widget on a Zabbix dashboard to show both the latest security advisories and the corresponding screenshot, for example.

Conclusion

The Browser Item feature in Zabbix 7.0 elevates website monitoring beyond simple availability checks. It enables comprehensive monitoring of website changes, unlocking a variety of use cases such as tracking release notes, security advisories, competitor activity, and more.

If you’re interested in implementing this capability, feel free to contact us. Bangunindo is a Zabbix Premium Partner in Indonesia, ready to help you design, implement, and optimize your Zabbix monitoring solution to fit your specific needs.

The post Monitoring Website Changes with Zabbix Browser Item appeared first on Zabbix Blog.

Monitoring MDM Certificates with Lab9 Pro and Zabbix

Post Syndicated from Michael Kammer original https://blog.zabbix.com/monitoring-mdm-certificates-with-lab9-pro-and-zabbix/31621/

Lab9 Pro is the B2B division of Lab9, Belgium’s leading Apple Premium Partner. With over 30 years of experience, Lab9 Pro specializes in integrating and supporting Apple systems within businesses, educational institutions, and public organizations. Beyond Apple expertise, Lab9 Pro also designs, implements, and maintains complete IT infrastructures, including networks, servers, storage, and security solutions.

The challenge

It’s impossible to manage devices at organizations without the use of a good MDM (Mobile Device Management) system such as Jamf. As the leading provider of Apple device management solutions, Jamf empowers organizations to deploy, manage, and secure Apple devices at scale.

Even in smaller organizations Jamf is the right solution, as small and medium-sized enterprises (SMEs) often lack the resources to manage their MDM systems. Offering an MSP model solves a lot of problems for these customers.

For Apple device management, the typical customer has a few certificates issued by Apple, which require approval of the user agreement by the Apple business or school manager. Without getting too technical about Apple Device management, depending on the customer the certificates need to be renewed on different dates. If the user agreement is not approved, automated device enrollment will stop working.

Lab9 Pro found themselves needing to check all certificates and user agreements for MSP customers manually, which involved an unacceptably high error rate that often caused discontinuity of the MDM system.

The solution

Lab9 Pro were already using Zabbix to monitor customer environments and their own infrastructure, including storage, firewalls, switches, and more. Because Zabbix offers a wide variety of options that make it possible to monitor almost anything, it was only logical to explore whether Zabbix could also be used to monitor the MDM certificates.

The research phase

Step one was to check the availability of certificate information. Unfortunately, Apple Business Manager’s API did not help much, as it does not provide certificate details. Instead, the team at Lab9 Pro investigated the Jamf API.

Although it doesn’t directly return certificate information either,  they found something even more useful – Jamf’s API provides customer instance notifications. These include alerts when certificates (VPP, PUSH, DEP, etc.) are about to expire (typically 10 days in advance) as well as when the Device Enrollment Program (user agreement) is not approved.

Zabbix implementation

Since Lab9 Pro manages multiple MSP tenants, they created a dedicated Zabbix template. This template includes both pre-filled and empty macros:

Pre-filled macros:

• {$JAMF.AUTH.INTERVAL}: Interval for retrieving the bearer token
• {$JAMF.NOTIF.INTERVAL}: Interval for retrieving Jamf notifications
• {$JAMF.PATH.AUTH}: API path for retrieving the bearer token
• {$JAMF.PATH.NOTIFICATIONS}: API path for retrieving Jamf notifications

Empty macros:

• {$JAMF.URL}: Jamf URL
• {$JAMF.API.USER}: Jamf user account for authentication
• {$JAMF.API.PASSWORD}: Jamf password (stored as a secret value)

The team configured an item to perform an API call to retrieve the bearer token. A preprocessing rule in JavaScript stores this token in a variable. Discovery rules proved very useful for executing API calls to retrieve Jamf notifications using the bearer token. This was achieved by configuring preprocessing steps and Low-Level Discovery (LLD) macros to pass the Jamf URL and bearer token. Trigger prototypes for each certificate were also added within the same discovery rule.

The results

Whenever a certificate is nearing expiration, a problem is automatically displayed on Lab9 Pro’s Zabbix dashboard, which is visible on TV screens placed throughout their office in order to make sure the entire team is aware of upcoming certificate renewals.

Since Lab9 Pro began monitoring MDM certificates through the Jamf API, they have experienced zero expired certificates, which in turn has allowed them to avoid situations where devices become unmanaged and require a full setup again.

Zabbix makes it possible for Lab9 Pro to keep their clients’ MDM systems operational, while allowing them to either proactively inform them when certificates need to be renewed or handle the renewal process on their behalf.

The post Monitoring MDM Certificates with Lab9 Pro and Zabbix appeared first on Zabbix Blog.

Monitoring a Starlink Dish with Zabbix

Post Syndicated from Alexander Petrov-Gavrilov original https://blog.zabbix.com/monitoring-a-starlink-dish-with-zabbix/31543/

Did you realize that you can monitor a Starlink dish using just Zabbix? The idea (or rather the need) to use Starlink came to me almost as soon as I moved to a fairly rural area. Local internet providers have not yet “provided” fiberoptic or stable mobile connectivity to places like this, and while searching for a solution I accidentally discovered that Starlink was already providing service to some local companies. As I later found out, they also offered service in my area for residential customers.

To make a long story short, since internet access is crucial in the IT field, I decided to acquire and then monitor my very own Starlink dish. At first, this proved challenging because regular user data access is quite limited. However, thanks to Zabbix browser monitoring, I managed to solve it fairly easily. In this post I will share my solution with you, including the template.

Monitoring configuration

First, you need to make sure you have Zabbix installed (either a Zabbix proxy or server) on the same network that the Starlink dish and router are on. The next step is to configure Zabbix for browser monitoring.

WebDriver installation
# podman run --name webdriver -d \
-p 4444:4444 \ 
-p 7900:7900 \
--shm-size="2g" \
--restart=always -d docker.io/selenium/standalone-chrome:latest

Port 4444 will be the port on which the WebDriver will be listening, and port 7900 will be used by NoVNC, which allows us to observe browser behavior in case a browser with a GUI is used.

Zabbix server/proxy configuration

After WebDriver is installed, we need to set up the communication between Zabbix and the driver. This can be done by editing the Zabbix server/proxy configuration file and updating the following parameters:

### Option: WebDriverURL 
# WebDriver interface HTTP[S] URL. For example http://localhost:4444 used with 
# Selenium WebDriver standalone server. 
# 
# WebDriverURL= 
WebDriverURL=http://localhost:4444 
### Option: StartBrowserPollers 
# Number of pre-forked instances of browser item pollers. 
# 
# Range: 0-1000 
# StartBrowserPollers=1 
StartBrowserPollers=5

With the configuration parameters in place, restart the Zabbix server/proxy to apply the changes:

systemctl restart zabbix-server
Creating a host

First, we need to navigate to the “Data collection” > “Hosts” section and create a host that represents our Starlink dish. The host in my example will look like this:

Starlink dish host
Starlink dish host

The host also has a user macro:

{$LINK} with value: http://webapp.starlink.com to point to the correct Starlink dish web app:

Link macro
Link macro
Creating a browser item

We will now configure our browser item to collect and monitor the list of metrics exposed in the Starlink browser app:

Starlink browser item
Starlink browser item

We are using the bare minimum here, so make sure the update intervals are as frequent as you need. However, I would not recommend updating it more frequently than every 5 minutes. It’s also not a good idea to store the history, since it is already stored trough dependent items.

The most important part of the item is the script itself:

var browser, result;
var opts = Browser.chromeOptions();

opts.capabilities.alwaysMatch['goog:chromeOptions'].args = [];
browser = new Browser(opts);
browser.setScreenSize(Number(1980), Number(1020));

try {
    var params = JSON.parse(value);
    browser.navigate(params.url);

 // Wait for the dish to report status
    Zabbix.sleep(2000);

    // Find the JSON text element(s)
    var jsonElements = browser.findElements("xpath", "//div[@id='root']/div[@class='App']/div[@class='Main']/div[2]/div[@class='Section'][2]/pre[@class='Json-Format']/div[@class='Json-Text']");
    var extractedData = [];

    for (var i = 0; i < jsonElements.length; i++) {
        var text = jsonElements[i].getText();

        // Try parsing JSON
        try {
            extractedData.push(JSON.parse(text));
        } catch (e) {
            // If not valid JSON, include raw text instead
            extractedData.push({ raw: text, error: "Invalid JSON format" });
        }
    }

    // Collect result 
    result = browser.getResult();

    // Replace with parsed JSON data
    result.extractedJsonData = extractedData.length === 1 ? extractedData[0] : extractedData;

}
catch (err) {
    if (!(err instanceof BrowserError)) {
        browser.setError(err.message);
    }
    result = browser.getResult();
}
finally {
    // Return a clean JSON object
    return JSON.stringify(result.extractedJsonData);
}

So what does this script do? It opens the Starlink web app, waits for the Starlink dish to output all the status data, and, after a bit of parsing, returns the data highlighted in the screenshot:

Starlink dish diagnostic data
Starlink dish diagnostic data

Now we can click on the three dots on the left of our newly created item in the items page and proceed to create dependent items for each value we are interested in!

Creating dependent items

Now we just click here:

As an example, to create an item that monitors the hardware version we can create an item like this:

Hardware version dependent item
Hardware version dependent item

With JSONPath preprocessing:

Hardware version item preprocessing
Hardware version item preprocessing

In the end we get the data in Zabbix:

Starlink dish hardware version
Starlink dish hardware version

All other items (except alerts) will follow the same logic – just update the item name, key, and JSONPath in preprocessing to extract the required values.

Creating dependent LLD item prototypes

To automate the alerts items creation, we can create a dependent discovery rule. In the “Discovery” section, create a new discovery rule:

Starlink dish alerts discovery
Starlink dish alerts discovery

With preprocessing using Java Script:

var data = JSON.parse(value);
var alerts = data.alerts;
var lld = [];

for (var key in alerts) {
    if (alerts.hasOwnProperty(key)) {
        lld.push({
            "{#ALERT}": key
        });
    }
}

return JSON.stringify({ data: lld });

This will provide us with following JSON data:

{
  "data": [
    {
      "{#ALERT}": "dishIsHeating"
    },
    {
      "{#ALERT}": "dishThermalThrottle"
    },
    {
      "{#ALERT}": "dishThermalShutdown"
    },
    {
      "{#ALERT}": "powerSupplyThermalThrottle"
    },
    {
      "{#ALERT}": "motorsStuck"
    },
    {
      "{#ALERT}": "mastNotNearVertical"
    },
    {
      "{#ALERT}": "slowEthernetSpeeds"
    },
    {
      "{#ALERT}": "softwareInstallPending"
    },
    {
      "{#ALERT}": "movingTooFastForPolicy"
    },
    {
      "{#ALERT}": "obstructed"
    }
  ]
}

All that’s left ‘to do is to create a dependent item prototype:

Starlink dish alert prototype
Starlink dish alert prototype

With preprocessing, of course:

JSONPath will transform to extract each specific alert and “Boolean to Decimal” will save us some space in the database by tranforming true/false booleans to digits.

Result

In the end, we can monitor all the data:

Starlink dish latest data
Starlink dish latest data

Even more data can be collected using exporters – if you are willing to do a bit of extra configuration, of course! Let me know if you are interested, and I will show you a completely different approach with a template.

Before I forget, the template used in this tutorial can be found  here.

The post Monitoring a Starlink Dish with Zabbix appeared first on Zabbix Blog.

Running Zabbix with MariaDB and Galera Active/Active Clustering

Post Syndicated from Nathan Liefting original https://blog.zabbix.com/running-zabbix-with-mariadb-and-galera-active-active-clustering/31104/

High availability on a platform like Zabbix is a hard requirement for many users. With native high availability on the Zabbix servers, proxies, and at the frontend through various solutions for web servers, all that’s left is at the database layer. Any downtime in your MariaDB database would disrupt your monitoring availability, at the least on the frontend side of things in case of proxy buffering. Let’s have a look at the easiest way to create a high availability (HA) architecture for Zabbix using MariaDB with built-in Galera clustering – by removing single points of failure from your database and finalizing the HA puzzle for Zabbix.

Architecture overview

Let’s start of with the MariaDB + Galera number one design requirement. For a proper quorum to be made, 3 nodes should be used in the cluster. With only two nodes in a Galera cluster, quorum rules become a bit of a headache, as Galera uses a majority vote (more than half the nodes) to decide if the cluster can still accept writes. In a two-node setup, all is good when the database is online. But when we lose one node, quorum is lost and that node needs to rejoin.

This makes a two-node setup fragile but not impossible, and it does work with Zabbix since we do only have one Zabbix server active at the time. In a split-brain scenario where both nodes either think they are the last to leave, you might have to decide which node you think has your up-to-date data. We will detail both scenario’s, but the principle remains the same. We will use MariaDB as our database and Galera will be used to create a primary/primary cluster. In such a cluster, all nodes in the cluster are writeable, which is great for the Zabbix native HA.

When we look in the Zabbix database, we can see that Zabbix keeps all of it’s Zabbix server HA information and states in the database.

This means that whatever one Zabbix server node writes into the database will also be replicated to all other nodes in the MariaDB Galera cluster.

The design

Knowing what we know now, we can create a very simple design for a solid Zabbix HA setup with Mariadb + Galera. When we have a single Zabbix frontend and we keep to the MariaDB + Galera requirement of having 3 database nodes, we get a fairly simple setup, as seen below.

In this setup, each Zabbix server connects to its own Database node and we don’t need added complexity by using load balancers. However,  we do get an automatic failover from the Zabbix servers, as they know exactly which node is active through the database. However, in this situation we are still left with 3 frontends that do not have automatic failover, simply because we do not have database aware Apache or NGINX. This also works in a two database setup, with the side note that you might have quorum issues to manually resolve after an outage:

Adding onto this setup, we could install a VIP, load balancer, or something like HA proxy in front of the frontend to make a failover happen there as well. Keep in mind though, the failover needs to happen based on whether or not the webfrontend can reach a writeable database.

Optional Arbitrator

If you are set on running only 2 database nodes (your wallet is thankful), but still worried about quorums, we can bring in the ARBITRATOR.

If there are only 2 Database nodes in your Galera cluster, not to worry! It’s definitely possible even while maintaining a good quorum resolution in case of outages.

All we have to do is add a third machine (VM) running the Galera arbitrator software. Preferably this machine would be in a third location, so it can act independently. But you can also add it into your main site if required.

What about load balancing?

Lastly, it is also possible to add load balancing to the mix. Let’s say, for example, you cannot add a VIP to your environment but still need your WEB servers to failover. A load balancer can provide the solution here.

We still prefer to run the Zabbix servers with a direct database connection, but even there a load balancer could be added if you wish. However, please keep in mind that the more load balancers you add, the more complex troubleshooting might become. The whole idea about the setup without load balancers is to have a solid Zabbix setup that is easy to maintain, while providing high availability.

Conclusion

In the end, even with a minimal setup of 2 DB nodes, 2 Zabbix servers, and 2 WEB frontends, we can make a high availability setup. As we’ve shown with Galera, this setup becomes highly flexible, allowing us to run without automatic WEB failover all the way up to including complicated load balancers.

High availability doesn’t have to be overly complicated in a setup like this – it really is all about how far you want to push things. Besides that, in this setup everything is horizontally scalable on the database side. Do keep in mind, however, that Zabbix does still run in an Active/Passive setup.

I hope you enjoyed reading this blog post. If you have any questions or need help configuring anything in your Zabbix setup feel free to contact me and the team at Opensource ICT Solutions. We build a ton of cool stuff like this and more!

Nathan Liefting

https://oicts.com

A close up of a logo Description automatically generated

The post Running Zabbix with MariaDB and Galera Active/Active Clustering appeared first on Zabbix Blog.

Building HA Zabbix with PostgreSQL and Patroni

Post Syndicated from Patrik Uytterhoeven original https://blog.zabbix.com/building-ha-zabbix-with-postgresql-and-patroni/30960/

Running a monitoring platform like Zabbix in a production environment demands reliability and resilience. When your monitoring solution is down, you’re flying blind – and for many organizations, that simply isn’t acceptable. This post introduces a robust high-availability (HA) architecture for Zabbix, using PostgreSQL,  Patroni, etcd, HAProxy, keepalived and PgBackRest. Built on RHEL 9 or derrivates, this solution combines modern open-source tools to provide automatic failover, load balancing, and seamless monitoring, all while maintaining consistency and performance.

Architecture overview

The HA design consists of multiple layers working in tandem to maintain continuity even during node or service failures:

Database Cluster Layer

2 or more nodes form the PostgreSQL cluster, managed by Patroni and coordinated using etcd. At any given time, one node is the primary (read/write), and the others are hot standbys ready to take over automatically.

Consensus layer

etcd runs on the same nodes and acts as the distributed configuration store and coordination layer for Patroni. It ensures a consistent cluster state and enables safe failover decisions.

Load balancing layer  

Two HAProxy nodes provide a single point of entry for all clients (including Zabbix), routing requests to the current PostgreSQL primary. These nodes are monitored and coordinated via Keepalived to maintain a floating Virtual IP (VIP), ensuring seamless failover at the connection layer.

Backup layer

A separate backup server is responsible for running PgBackRest, which handles full and incremental backups, WAL archiving, and Point-In-Time Recovery (PITR). This server communicates securely with all database nodes over SSH.

Monitoring layer

Two Zabbix servers, running in active-passive mode, continuously monitor all layers of this stack including the HAProxy health, Patroni cluster role, and etcd status by accessing the PostgreSQL VIP for backend connectivity.

This multi-tiered setup ensures that no single failure be it a database, load balancer, or monitoring server brings down the monitoring platform.

Why HA matters for Zabbix

Zabbix depends heavily on its PostgreSQL database backend. Every metric, trigger, event, and alert is stored there. If PostgreSQL becomes unavailable, even briefly, data loss or monitoring blind spots can occur. That’s why introducing HA at the database layer is a crucial step when scaling Zabbix for enterprise environments.

While Zabbix itself supports HA at the application level, this architecture ensures that the database backend is also fully fault-tolerant, using modern consensus-based clustering with automatic failover.

Component overview

To achieve HA, we bring together several specialized components, each fulfilling a critical role in the system:

PostgreSQL

The relational database engine used by Zabbix. In this example setup, it runs on three nodes, forming a cluster managed by Patroni.

Patroni

Patroni is the orchestrator for the PostgreSQL cluster. It monitors node health, manages replication, promotes standbys when needed, and ensures only one writable leader exists at any time. Patroni leverages a distributed consensus store in this case, etcd but other DCS’s are possible to coordinate decisions across the cluster.

etcd

etcd is a lightweight and highly available key-value store used by Patroni to maintain the cluster’s state. It stores leader election data, health statuses, and locks. We deploy it as a three-node cluster, co-located with the PostgreSQL nodes for convenience, though this setup can be scaled independently if needed as etcd is very latency prone.

HAProxy

To simplify application connectivity, HAProxy acts as a load balancer in front of the database cluster. It monitors the role of each node using Patroni’s REST API and routes connections to the active primary server. If the leader fails, HAProxy automatically reroutes traffic to the new primary.

Keepalived

Keepalived provides a floating virtual IP address (VIP) across the HAProxy nodes. This VIP allows client systems, such as the Zabbix frontend, to connect to a single stable IP even if one HAProxy node fails.

PgBackRest

To protect the data itself, we use PgBackRest for full and incremental backups, as well as Point-In-Time Recovery (PITR). A dedicated backup server is included to pull and store archive logs and backups securely via SSH.

Zabbix server

Finally, we run two Zabbix servers in active-passive mode. Both are configured to connect to the PostgreSQL cluster through the VIP exposed by HAProxy. The Zabbix frontend is deployed on both nodes as well, ensuring continued accessibility through the load-balanced setup.

Topology at a glance

Here’s a simplified view of the architecture:

  • 2 or more database nodes (PostgreSQL + Patroni + etcd)
  • Two HAProxy nodes, each configured with Keepalived to manage a floating virtual IP
  • One backup node for PgBackRest
  • Two Zabbix servers pointing to the PostgreSQL VIP

All systems are tied together with consistent hostname mappings, time synchronization (Chrony), and service monitoring.

Notes:

  • PgBackRest is directly connected to all three PostgreSQL nodes, allowing it to archive WAL segments and pull backups regardless of which node is primary.
  • This design enables full standby backups and supports Point-In-Time Recovery (PITR).
  • HAProxy ensures Zabbix always talks to the current primary node, while Patroni and etcd handle automatic failover and cluster state management.

Design rationale

This setup prioritizes resilience and self-healing. If any single component fails a database node, a load balancer, or even a monitoring server the system continues to function.

Using Patroni with etcd ensures that failovers are handled automatically, without human intervention. HAProxy ensures client traffic is always routed to the current primary, while Keepalived ensures that this routing layer itself is highly available.

We opted for PgBackRest over simple scripts or base backups because it provides not just efficient incremental backups, but also full WAL archiving and point-in-time recovery, which are invaluable for both disaster recovery and debugging.

Lastly, we chose to integrate Zabbix itself into this HA design, treating it not just as a application but as a fully resilient service able to monitor itself, so to speak.

Real-world considerations
  • Resource planning: While our nodes run comfortably, scaling this setup to heavy workloads requires careful tuning of memory, I/O, and PostgreSQL parameters.
  • etcd placement: Although we run etcd co-located with the database nodes in this example, separating etcd onto dedicated infrastructure is ideal for large-scale environments. This avoids resource contention and preserves quorum in extreme failure scenarios.
  • Monitoring the monitors: Zabbix itself must be monitored. In our setup, each component including etcd, Patroni, and PostgreSQL exposes health endpoints that can be used by Zabbix agents or scripts to generate alerts on replication lag, cluster health, and failover events.

Conclusion

This architecture provides a solid foundation for running Zabbix in a fault-tolerant, production-ready environment. It not only ensures high availability for the database layer but also offers flexibility, observability, and operational safety.

Whether you’re running internal infrastructure monitoring or offering Zabbix as a managed service, adopting this type of HA setup removes single points of failure and gives you peace of mind — all using open-source technologies that are battle-tested and widely supported.

If you need assistance with the migration or want to ensure best practices for scaling and optimizing Zabbix, don’t hesitate to reach out to OICTS. We are a Zabbix Premium Partner operating globally, with offices in the USAUKNetherlands, and Belgium, and we’re ready to help you every step of the way.

 

The post Building HA Zabbix with PostgreSQL and Patroni appeared first on Zabbix Blog.

Revolutionizing Zabbix Maintenance with Artificial Intelligence

Post Syndicated from Grover Taipe original https://blog.zabbix.com/revolutionizing-zabbix-maintenance-with-artificial-intelligence/31284/

Can you imagine being able to schedule maintenance in Zabbix by simply telling a program: “I need to put the web server in maintenance tomorrow from 8 to 10 with ticket 100-178306”? That’s exactly what the Artificial Intelligence (AI) Scheduler Zabbix project I’ve developed does!

What problem does it solve?

Anyone who has worked with Zabbix knows that scheduling maintenance can sometimes be tedious, especially when you need to:

  • Configure complex routine maintenance
  • Handle Zabbix API bitmasks for specific days of the week or month
  • Search for specific hosts or groups
  • Document associated tickets

This project eliminates that friction by allowing the use of natural language to create both one-time and routine maintenance.

The magic behind the code

Conversational artificial intelligence

The system integrates both OpenAI GPT-4 and Google Gemini to interpret natural language requests. The AI doesn’t just understand what you want to do, but automatically:

  • Detects servers, groups, and dates
  • Identifies ticket numbers (XXX-XXXXXX format)
  • Automatically calculates complex Zabbix bitmasks
  • Generates contextual responses with examples
Fig. 1. Adding the AI Scheduler widget to your Zabbix dashboard

Advanced routine maintenance

What really stands out is its ability to handle complex patterns. Here are some practical examples that work:

  • “Daily backup for srv-backup from 2 to 4 AM with ticket 200-8341 until February 2027”
  • “Thursday and Friday maintenance from 5 to 7 AM until January 2027”
  • “Cleanup on the first Sunday of each month with ticket 100-178306 until December 2026”
Fig. 2. AI-generated maintenance summary with all calculated parameters

Elegant architecture

The project uses a three-layer architecture:

  • Frontend: Custom widget for Zabbix
  • Backend: Flask API with AI integration
  • Zabbix: Native API to create maintenance
Fig. 3. Maintenance successfully created and visible in Zabbix interface

Super-simple installation

One of the best features is how easy it is to get it running:

cp .env.example .env

You only need to configure your Zabbix URL and AI API key:

 docker compose up -d --build

And that’s it! You have an AI assistant working.

Multi-instance support

For organizations with multiple Zabbix servers, the project includes configuration for up to 5 simultaneous instances, each with its own configuration.

What impresses me most

Intelligent date detection

The system understands natural expressions like:

  • “Tomorrow from 8 to 10” → Next date with specific schedule
  • “Sunday from 2 to 4 AM” → Next Sunday at those hours
  • “24/08/25 10:00am” → Automatically converts the format

Automatic Bitmask management

Zabbix API bitmasks can be notoriously complicated. This system calculates them automatically:

  • Thursday and Friday = 8 + 16 = 24
  • Sundays only = 64
  • First week of the month with specific configuration
Fig. 4. Complex weekly maintenance scheduling with automatic bitmask calculation

Why is it important?

This project represents a natural evolution in systems administration. Instead of memorizing complex syntax or navigating multiple menus, you simply describe what you need in natural language. It’s especially valuable for:

  • Operations teams handling multiple maintenance tasks
  • Companies that need to document associated tickets
  • Organizations with complex maintenance patterns

The future is here

Projects like this demonstrate how artificial intelligence can make complex technical tools more accessible without sacrificing functionality. It’s not just automation – it’s intelligence applied to real infrastructure problems. If you work with Zabbix and are tired of manually configuring maintenance, this project is definitely worth checking out. It’s open source, well documented, and solves a real problem that many of us face every day. You can find the complete project on GitHub.

The post Revolutionizing Zabbix Maintenance with Artificial Intelligence appeared first on Zabbix Blog.

Migrating from PRTG to Zabbix: A High-Level Guide

Post Syndicated from Patrik Uytterhoeven original https://blog.zabbix.com/migrating-from-prtg-to-zabbix-a-high-level-guide/30845/

For companies looking to migrate from PRTG Network Monitor to Zabbix, one of the most critical aspects is making sure a smooth migration of monitored devices and configurations. While there is no official tool to directly migrate between the two platforms, creating a bridge using custom export/import scripts allows for an effective and large migation. This blog post outlines a practical approach to achieving that migration based on the export/import methodology we at Opensource ICT Solutions previously implemented for one of our clients.

Why migrate?

While PRTG offers an intuitive interface and is popular for its ease of use, Zabbix provides:

  • Greater flexibility and scalability
  • Full open-source licensing
  • More powerful automation and templating
  • A robust API for integrations
  • Lower costs, especially since Paessler was sold to an investor

These features make Zabbix an attractive choice for teams looking to scale or standardize on open-source infrastructure.

Migration overview

The migration involves two key steps:

  1. Exporting PRTG device information
  2. Importing data into Zabbix

Because the two systems are conceptually and structurally different, we focused our scripts on migrating what is most transferable: device names, IP addresses, and interface types. SNMP versions or PRTG-specific sensor details were excluded or simplified where not applicable to Zabbix. PRTG, for example, will only export probes that have an OID that was not built-in in PRTG but added later, making our export incomplete. This does not mean we did a partial migration, it just means we have not included it in the automated approach.

Step 1: Exporting from PRTG

We developed a Python-based script that interacts with the PRTG API to extract monitored device data and export it to a CSV file. The script filters out irrelevant objects and organizes the output for easy Zabbix processing.

This creates a clean CSV, like this:

Device Name, IP Address, Interface Type
zabbix-server,10.0.0.10,agent
ServerA,192.168.0.2,SNMP
ServerA,192.168.0.2,agent
core-switch,192.168.0.1,SNMP

This file serves as a clean, structured inventory of monitored devices.

Note: SNMP version fields were excluded in the final export, as Zabbix does not currently display or rely on an SNMP version in the same way PRTG does.

Step 2: Importing into Zabbix

Using Zabbix’s API, we created an import script that reads the CSV and:

  • Creates host entries
  • Assigns them to the appropriate host group
  • Adds relevant interfaces (e.g., Agent,ILO,SNMP or a combination of …)

Each host is configured based on its detected interface type in PRTG.

On the Zabbix side, we used the Zabbix API to automate the creation of hosts, interfaces, and template assignment. The import script reads the CSV line-by-line and takes action based on the interface type.

Considerations and “gotchas”

  • Templates: We didn’t add templates, as there is no 1:1 solution – PRTG has a different concept and adding a standard template would be possible but probably not the best solution.
  • Host Groups: For ease of use and the limited time we had, we added all hosts in a temporary host group made for the migration. Although we do have scripts that take it out from PRTG and create it in Zabbix, in this particular migration it was not needed.
  • Permissions: The API token used in the import script must have sufficient privileges to create hosts.

What is NOT migrated

Because of fundamental differences between the platforms, the following are not directly migrated:

  • Historical data or sensor readings: Mainly because the customer had no hard requirement for it.
  • Custom PRTG notifications or dependencies: It was easier to manually re-create them.
  • Maps or dashboards: The Zabbix approach is so different that it was easier to recreate it manually (and improve).
  • Sensors: Zabbix is working with a different concept.

Post-migration tips

  • Validation: After the import, verify that each host is reachable and monitored correctly in Zabbix.
  • Discovery: Consider using Zabbix’s LLD (Low-Level Discovery) to dynamically find interfaces, disks, or other entities.
  • Housekeeping: Disable PRTG monitoring only after confirming Zabbix is fully operational.

Conclusion

Migrating from PRTG to Zabbix is not a one click operation, but with some scripting, planning, and experience from a partner like us, it can be done efficiently and with minimal disruption. The custom export/import scripts act as a reliable bridge between the two systems, allowing for a clean transfer of your monitoring inventory. From there, Zabbix’s automation and scalability features can help take your monitoring to the next level.

If you need assistance with the migration or want to ensure best practices for scaling and optimizing Zabbix, don’t hesitate to reach out to OICTS. We are a Zabbix Premium Partner operating globally, with offices in the USA, UK, Netherlands, and Belgium ready to help you every step of the way.

The post Migrating from PRTG to Zabbix: A High-Level Guide appeared first on Zabbix Blog.

Running Zabbix with PostgreSQL and PG Auto Failover

Post Syndicated from Patrik Uytterhoeven original https://blog.zabbix.com/running-zabbix-with-postgresql-and-pg-auto-failover/31026/

Running a monitoring platform like Zabbix in a production environment requires bulletproof availability at the database layer. Any downtime in PostgreSQL, even for seconds, can disrupt monitoring visibility, triggering blind spots in alerts and data collection.

This post introduces a streamlined High-Availability (HA) architecture for Zabbix using PostgreSQL, pg_auto_failover, HAProxy, and PgBackRest. Built on RHEL 9 or derivatives, this architecture removes single points of failure and automates failover using minimal external dependencies, making it a strong candidate for modern observability backends.

Architecture overview

This HA design simplifies deployment by using a dedicated monitor node to orchestrate automatic failover between two PostgreSQL database nodes. With pg_auto_failover, we avoid the need for complex consensus layers like etcd or Consul while still achieving fast, reliable failover and recovery.

Database layer

Two PostgreSQL nodes are deployed in a primary/secondary configuration. These nodes are registered with a dedicated pg_auto_failover monitor, which continuously checks node health and replication status. In the event of a failure, the monitor promotes the secondary to primary with no manual intervention.

Each node is securely configured using scram-sha-256 authentication and self-signed / or owned SSL certificates to ensure encrypted communication within the cluster.

Monitor node (Arbiter)

The monitor node is a lightweight PostgreSQL instance that runs the pgautofailover extension. It holds state information about all participating nodes and acts as the arbiter during failover events. It requires only one node, reducing complexity compared to consensus-based DCS (Distributed Configuration Store) systems like etcd or ZooKeeper.

Load balancing layer

Two HAProxy nodes route all client (Zabbix) connections to the current PostgreSQL primary. A lightweight HTTP service on each DB node reports its current role (primary or not) and allows HAProxy to determine which node is writable. These proxies are kept highly available using Keepalived, which manages a shared Virtual IP (VIP) across both proxy servers.

This way, applications like Zabbix always connect to a stable endpoint, even during failover events.

Backup layer

Backups are handled using PgBackRest, deployed on a dedicated backup server. This server connects to both PostgreSQL nodes over SSH and performs the following:

  • Full and incremental backups
  • WAL archiving
  • Point-In-Time Recovery (PITR)

Passwordless SSH and proper pgbackrest.conf mappings are set up to support seamless interaction regardless of which node is currently primary.

Component overview

Component Role
PostgreSQL Relational backend storing all Zabbix metrics, alerts, events
pg_auto_failover Ensures continuous availability by promoting replicas automatically
Monitor Node Decides failover based on health checks and cluster state
HAProxy Routes client traffic to the current primary
Keepalived Provides VIP failover between HAProxy nodes
PgBackRest Performs PITR-capable backups from any node
Zabbix Server Connects to PostgreSQL via VIP to ensure continuity

 

Topology at a glance

Design

Unlike Patroni, which requires a distributed configuration store like etcd, pg_auto_failover uses a dedicated monitor node that simplifies orchestration. This setup reduces the operational burden while still delivering robust failover, automatic reconfiguration, and synchronization safeguards, including:

  • Synchronous_standby_names to enforce replication integrity
  • Service integration with systemd for reliable restarts
  • Failover detection with minimal latency

This design also ensures SSL-enabled encrypted communication, self-healing role changes, and full observability using Zabbix itself, which can be configured to monitor the PostgreSQL cluster through exposed health endpoints.

Real-world considerations

  • Upgrade Planning: The pg_auto_failover version in RPM repos may lag behind the latest upstream features like set_monitor_setting. Pin the package version if consistency is required.
  • Network Security: Only HAProxy nodes are allowed to query the internal role-check API on the DB nodes using custom firewall rules.
  • Cluster Hygiene: Always clean up config folders (~postgres/.config/pg_autoctl/…) if a node is misconfigured or needs to rejoin.
  • SELinux: Configure SELinux, use semanage and audit2allow to fix custom ports (e.g., 9877 for health checks).
  • Hybrid Logging: Setup PostgreSQL to log to both journald and traditional log files via stderr + logging_collector.

Conclusion

This architecture strikes a balance between simplicity and resilience. While Patroni is great for large-scale, multi-region setups requiring distributed consensus, pg_auto_failover offers a lighter-weight solution that covers most enterprise needs without complex dependencies.

By layering the following…

  • PostgreSQL 17
  • Pg_auto_failover with a single monitor
  • HAProxy + Keepalived for VIP failover
  • PgBackRest for backups

…you can then confidently run Zabbix in a highly available and secure fashion with minimal operational overhead.

If you’re considering implementing this setup or migrating from a single-node database backend, reach out to Opensource ICT Solutions, a Zabbix Premium Partner with global presence in the USA, the UK, the Netherlands, and Belgium. We can help you architect, deploy, and monitor Zabbix environments that scale with your needs.

The post Running Zabbix with PostgreSQL and PG Auto Failover appeared first on Zabbix Blog.

When Generative AI Meets Zabbix

Post Syndicated from Cesar Caceres original https://blog.zabbix.com/when-generative-ai-meets-zabbix/30908/

Zabbix has been the backbone of my infrastructure for over ten years, a journey I’ve been on from version 3.2 to 7.4. It’s a robust and reliable tool. However, in the age of intelligent assistants, I posed a question to myself: Why can’t I interact with my monitoring system as naturally as I talk with Maria, my generative AI assistant?

What is MCP?

MCP (Model Context Protocol) is a universal protocol that helps generative AI systems interact with global data securely, reliably, and at scale.
Imagine this: It’s 3 AM, and you receive a critical alert on your phone. Instead of opening multiple dashboards and manually correlating data, you simply type: “What’s happening with the production server?”

You get a response like this:

“The web-prod-01 server is experiencing high memory usage (94%). This started 15 minutes ago, coinciding with a traffic spike. I recommend checking the database connection pool and considering a restart of the Apache service. Would you like me to show you the related logs?”

This is no longer science fiction!

Design principle

The main objective is to enhance Zabbix without altering its core. The solution is based on an architecture that adheres to the following principles:

  • Zabbix intact: The original installation remains unchanged.
  • API-first: All communication is done through Zabbix’s robust JSON-RPC API.
  • Intelligent bridge: An intermediary service is created to translate between human language and Zabbix metrics.
  • Scalability: The design is prepared to grow alongside the infrastructure.

Proposed architecture:

  • Zabbix server: Debian 12, Zabbix 7.4.0, PostgreSQL 15.13
  • AI server (MCP): Rocky Linux 9, Gemini AI, Express.js, Winston (Logging), Gemini CLI, Redis, Nginx, PM2

Webhooks

We process Zabbix alerts through a webhook that sends the data to our generative AI service.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import json
import requests
import sys
from datetime import datetime

def send_to_mcp(args):
""" Sends alerts to MCP server"""
# SETTINGS - EDIT ACCORDING TO YOUR ENVIRONMENT
mcp_endpoint = "http://TU_IP_MCP_SERVER:3001/alerts" # Change to the MCP server IP
mcp_token = "TU_MCP_AUTH_TOKEN" # Exchange for your MCP authentication token
zabbix_server_ip = "TU_IP_ZABBIX_SERVER" # Change to the Zabbix server IP

headers = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {mcp_token}'
}

# Extracting arguments from the Zabbix webhook
eventid = args[0] if len(args) > 0 else "unknown"
severity = args[1] if len(args) > 1 else "0"
message = args[2] if len(args) > 2 else "No message"
host = args[3] if len(args) > 3 else "unknown"
value = args[4] if len(args) > 4 else ""

payload = {
"timestamp": datetime.now().isoformat(),
"source": "zabbix",
"eventid": eventid,
"severity": severity,
"message": message,
"host": host,
"value": value,
"zabbix_server": zabbix_server_ip
}

try:
print(f"Sending alert to MCP: {mcp_endpoint}")
print(f"Payload: {json.dumps(payload, indent=2)}")

response = requests.post(mcp_endpoint, json=payload, headers=headers, timeout=15)
response.raise_for_status()

print(f"Alert sent successfully: HTTP {response.status_code}")
print(f"Response: {response.text}")
return True

except requests.exceptions.Timeout:
print("? Error: Timeout connecting to MCP server", file=sys.stderr)
return False
except requests.exceptions.ConnectionError:
print("? Error: Cannot connect to MCP server", file=sys.stderr)
return False
except requests.exceptions.HTTPError as e:
print(f"? HTTP Error: {e}", file=sys.stderr)
print(f"Response: {response.text}", file=sys.stderr)
return False
except Exception as e:
print(f"? Unexpected error: {e}", file=sys.stderr)
return False

if __name__ == "__main__":
# The arguments come from Zabbix
args = sys.argv[1:]
print(f"Script called with args: {args}")

success = send_to_mcp(args)
sys.exit(0 if success else 1)

Use cases

Scenario 1: Monday morning

Cesar: Good morning, how is the infrastructure looking?

Gemini CLI: Good morning! Everything is calm. 14 active hosts, 0 critical problems. There was a scheduled reboot on ‘proxmox-desarrollo’ over the weekend which completed successfully. The average CPU usage is at 23%. Do you need details on any specific system?

Scenario 2: Proactive planning

Cesar: What worrying trends are you seeing this week?

Gemini CLI: I’ve noticed three patterns that require attention:

1. storage-01: Growth of 2GB/day in /var/log.

    • Projection: 90% capacity in 12 days.
    • Recommendation: Configure a more aggressive log rotate.

2. mikrotik: Gradually increasing temperature (+3°C over the last 7 days).

    • Recommendation: Check ventilation.

3. base-datos-01: Slow queries increasing by 15%.

    • Recommendation: Review indexes and optimize frequent queries.

MCP implementation

Server (Rocky Linux 9)
bash
#create the project
mkdir /opt/mcp-zabbix
cd /opt/mcp-zabbix

#Install dependencies
sudo dnf install -y nodejs npm redis nginx
sudo npm install -g pm2

#Set up the project
npm init -y
npm install express axios @google/generative-ai winston helmet cors dotenv

Configuration (.env)

bash
#Environment variables
ZABBIX_URL=http://tu-zabbix-server/zabbix/api_jsonrpc.php
ZABBIX_API_TOKEN=tu_token_de_zabbix_aqui
GEMINI_API_KEY=tu_api_key_de_gemini
MCP_AUTH_TOKEN=genera_un_token_seguro
PORT=3001

Webhook in Zabbix

1. Alerts → Media Types → Create
2. Script name: mcp_webhook.py
3. Parameters: {EVENT.ID} {EVENT.NSEVERITY} {ALERT.MESSAGE} {HOST.NAME} {ITEM.VALUE}

Test it

#Start the MCP server
pm2 start ecosystem.config.js

#Test curl 
curl -H "Authorization: Bearer TU_TOKEN" \
-H "Content-Type: application/json" \
-d '{"prompt":"¿How many host fo I have?"}' \
http://localhost:3001/ask-zabbix

The future

Dashboard conversations

Cesar: Show me a dashboard of the critical servers.

Gemini CLI: Creating custom dashboard with:

  • CPU/memory of your 3 production servers
  • Network latency of web services
  • Database disk space
  • Nightly backup status

Generated dashboard: http://zabbix.local/dashboard/generated-123

Errors to avoid

  • Don’t ignore security: Tokens, firewall, rate limiting from day 1
  • Don’t forget documentation: Code explains itself, workflows don’t

Resources to get started

  • Complete installation: Scripts for Rocky Linux and Debian
  • Zabbix configuration: Media types and actions
  • API reference: Endpoints and examples

Use cases

Basic monitoring: Hosts, items, triggers

  • Intelligent alerts: Automatic analysis
  • Ad-hoc queries: Quick investigation
  • Automated reports: Periodic summaries

Future integrations

The goal is to develop an application that allows natural interaction with an AI assistant called “Maria.” The idea is that based on what’s happening, Maria suggests actions and executes them proactively.

To achieve this, the assistant will integrate with Gemini’s command-line interface (CLI) and establish an additional secure communication channel. The recommended architecture will consist of several servers capable of understanding each other, including a Zabbix Server, the MCP (Model Context Protocol), and the personal assistant.You can follow the development of the base integration in this repository.

Conclusion

Zabbix will continue to be the reliable engine we all know. The difference is that it now becomes more intuitive and conversational. The goal is not to replace human experience, but to empower it. AI will allow us to create solutions that were previously unthinkable.

To fully leverage this potential, it is essential that we, as experts, continue to train and deepen our knowledge of the tool. This way, we will not only depend on what the AI suggests, but we will be able to validate and authorize its actions with our own judgment.

The post When Generative AI Meets Zabbix appeared first on Zabbix Blog.

Migrating Nagios to Zabbix: Lessons Learned

Post Syndicated from Nathan Liefting original https://blog.zabbix.com/migrating-nagios-to-zabbix-lessons-learned/30917/

Recently, a new customer of ours at Opensource ICT Solutions asked whether we could migrate their Nagios instance to Zabbix. Because Nagios and Zabbix are very different in their storage methods, we told them that we would have to investigate and see if we could come up with a viable solution. It wasn’t long until we found a way to do it and started building some script to get it done.

The customer’s wishes

  • No loss of any Nagios configuration data
  • Historic performance data migrated to Zabbix
  • Existing problems migrated from Nagios
  • Nagios XI to be disabled entirely, as the license is expiring

The customer was clear in their wishes – we needed to turn off Nagios, but without losing historic data. As such, they wanted all their old data visible in Zabbix instead of having Nagios running somewhere as a backup. This meant that a script had to be built to get that Nagios data out and into Zabbix.

The configuration data

The good part here is that it starts simple. When we dive into the Nagios configuration data, we clearly see that Nagios has hosts just like Zabbix. They just have a slightly different build than our usual Zabbix hosts. For example, we can see three different names for a host in Nagios:

  • Host Name
  • Alias = Host name
  • Display Name = Visible name

That immediately gives us a good way to hook up Nagios names to Zabbix host and visible names.

When we then take a look at the checks and how they are executed in Nagios, we also see similarities with Zabbix. In the end, both of them are monitoring solutions, of course. However, Nagios works more in a command execution kind of way, which is good for our migration. We can take this command and find an equivalent item in Zabbix. For example the check_icmp command can easily be translated into a simple check in Zabbix icmpping, icmppingloss, and icmppingsec.

For the check_tcp command we can do a similar translation. Making sure we use the simple check net.tcp.service whenever this command is executed on a Nagios host.

Because of the big differences between Nagios and Zabbix, this does mean we need to make some manual translations between the Nagios commands and Zabbix items. Depending on your Nagios instance, this could be a big task. Luckily for us, this was a smaller instance with only ICMP and TCP port checks.

The history (i.e. performance) data

Now that we know how to start creating our hosts and items, we need to understand how Nagios is storing its data. Zabbix has a big centralized MariaDB or PostgreSQL database, which makes it easy to parse through and work with our data. Unfortunately, Nagios instances use a different technique. Nagios stores data in .rrd (Round Robin Database) files and with it a .xml file to interpret the RRD file. The RRD files are not centralized like a Zabbix database, but they are more manageable in terms of storage size. We can see an RRD file per type of check in Nagios, which means we will have to grab the data from that file while understanding what it is going to belong to in Zabbix.

To see the data in the RRD file, we can use a special command line tool.

rrdtool /usr/local/nagios/share/perfdata/BeNeLux-Host-Name/Availability.rrd LAST --start -30d --end now | grep -v "nan"

Now we can clearly see that this specific RRD file above contains 8 columns, 7 with a performance value. The first column contains the timestamp in Unixtime, which is great because it will be perfect for storing in the Zabbix database. The other 7 columns in this file are different though, because we do not know what the value in the column belongs to. This is where the .xml file comes into play. The XML file belongs with the RRD file and contains details on what is included in the RRD file.

In this XML file we will find all of the required host information, which is great for creating the host in Zabbix. It also contains the check information, so we can also use this file to create the items in Zabbix. The biggest thing we will have to keep in mind is to make sure that the XML and RRD file match up in terms of number of RRD entries and columns. Column 1 in the RRD file will match with the first entry in our XML.

Let’s create a script

With the host, item and history data identified, we can start to create a script. In our case we decided to create a Python import tool. As Zabbix comes with some limitations in terms of which hostnames we can use (which are different from the limitations in Nagios), we need to sanitize our hostnames slightly.

Then all we need to do is parse through all the XML files and create new hosts in Zabbix through the Zabbix API.

It will be a very similar process for our items, as we parse through our XML file and create all of the required items in Zabbix through the API.

We can even create the triggers straight from the XML file by parsing through the different severities already set up in Nagios.

Once everything is created in Zabbix, the Python script can now start using RRDTool to parse through the RRD file, making sure to keep the XML file structure in mind when parsing through the columns.

This script can now create the hosts, the items, the triggers, and then import all of the data. We can see the hosts being created and data being imported.

The beauty of importing history data into Zabbix while the triggers are already created is then also seen below.

All of the triggers will trigger and be resolved based on the data imported, meaning that we can create problems with historic data. This means that not only do we have our historic data, but also all the problems with the correct duration as they are now discovered from the actual imported data.

To make this possible we can use the Zabbix sender tool. It has an option to include the timestamp upon every historic value imported.

Our Python script grabs the values from the RRD file and then converts them into a new _HOST_.sender file. This file will be sent to the Zabbix server using the Zabbix sender tool.

Looking at the file, we can see it contains only the name of the host, the unixtime stamp, and the actual value to send.

All we need to do is make our script send this file to the correct item in Zabbix.

Manual template and item creation

The last step will be our cleanup. We decided that we would start dirty with a one-on-one data import from Nagios. This means hostnames, item names, and trigger names are imported straight from Nagios. No templates will be created in Zabbix by the tool either, skipping the Zabbix best practice to use templates for all hosts.

We did this to make the initial import easier and not go overboard with scripting. It’s easier to have a messy Zabbix to clean up than to script everything perfectly in Python. Time is valuable.

What we did afterwards is create all the templates manually to take over the items as is from the hosts. For example, we can translate the ICMP ping and TCP stuff easily into a template.

After doing so, we do end up with some bad looking templates, but we can now start cleaning up.

We can also start creating normal trigger names and clean up…

…while changing our dynamic port names for something more expected as well.

And that’s it!

The post Migrating Nagios to Zabbix: Lessons Learned appeared first on Zabbix Blog.

Understanding HTTP Template Authorization in AWS

Post Syndicated from evgenii.gordymov original https://blog.zabbix.com/understanding-http-template-authorization-in-aws/30856/

Authorization in Amazon Web Services (AWS) determines what actions a user, service, or system can perform on resources. It answers the question: “Does this identity have permission to do this action on that resource?”

In AWS, authorization is primarily handled through:

  • IAM (Identity and Access Management) policies
  • Resource-based policies (like S3 bucket policies)
  • Session-based permissions (like STS AssumeRole)

What authorization types are available in Zabbix AWS templates?

  • Access key authorization
  • Role-based authorization
  • Assume role authorization

Let’s look briefly at each of them.

Before using the template, you need to create an IAM policy that grants the necessary permissions for the AWS services the template will interact with.

This policy defines what actions are allowedon which resources, and optionally, under which conditions. Once created, the policy should be attached to the IAM role or user that will run the template.

IAM policy for Zabbix

Add the following required permissions to your Zabbix IAM policy in order to collect metrics. The policy can change when new metrics and services are added in Zabbix templates.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "cloudwatch:DescribeAlarms",
                "cloudwatch:GetMetricData",
                "ec2:DescribeInstances",
                "ec2:DescribeVolumes",
                "ec2:DescribeRegions",
                "rds:DescribeEvents",
                "rds:DescribeDBInstances",
                "ecs:DescribeClusters",
                "ecs:ListServices",
                "ecs:ListTasks",
                "ecs:ListClusters",
                "s3:ListAllMyBuckets",
                "s3:GetBucketLocation",
                "s3:GetMetricsConfiguration",
                "elasticloadbalancing:DescribeLoadBalancers",
                "elasticloadbalancing:DescribeTargetGroups",
                "ec2:DescribeSecurityGroups",
                "lambda:ListFunctions"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}

To create and attach the policy:

  • Go to IAM → Policies → Create policy
  • Choose JSON and paste your policy
  • Review and create the policy

Access key authorization

1. Attach the required policy to the IAM user

  • Go to IAM → Users → Select a user → Permissions tab
  • Click Attach policies
  • Select the policy you created before (IAM Policy for Zabbix)
  • Click Attach policy

2. Get your access key and secret access key

In the AWS console:

  • Go to IAM → Users → Select a user → Security credentials tab

  • Click Create access key

  • Copy:

    • Access key ID
    • Secret access key

⚠ Never expose your keys publicly!

3. Configure AWS CLI

Open your terminal and run:

configure aws cli

aws configure --profile zabbix_user

You’ll be prompted to enter:

AWS Access Key ID [None]: AKIAXXXXXXXXXXXEXAMPLE
AWS Secret Access Key [None]: asdkjhUSADWDskhjdasd/EXAMPLEKEY
Default region name [None]: eu-central-1
Default output format [None]: json

4. Test it

List all S3 buckets:

aws s3 ls --profile zabbix_user

Get EC2 tags:

Use region where you create instance

aws ec2 describe-instances --region eu-central-1 --query 'Reservations[*].Instances[*].Tags' --profile zabbix_user

If you get this error…

An error occurred (AccessDenied) when calling the DescribeInstances operation: User: arn:aws:iam::123456789010:user/zabbix_user is not authorized to perform: ec2:DescribeInstances on resource: arn:aws:ec2:eu-central-1:123456789010:instance/*

…you need to check the following permission to the role you are using (IAM Policy for Zabbix).

5. Set the following macros in Zabbix:

  • {$AWS.AUTH_TYPE} – set to access_key
  • {$AWS.ACCESS.KEY.ID} – set to your access key ID
  • {$AWS.SECRET.ACCESS.KEY} – set to your secret access key

Security tips

  • Never hardcode access keys in scripts or code.
  • Store them in ~/.aws/credentials, which is protected by file system permissions.
  • Apply least privilege with IAM policies.

Role-based authorization

1. Add the appropriate permission to the role you are using:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::{Account}:role/{RoleNameWithPath}"
        },
        {
            "Effect": "Allow",
            "Action": [
                "theSameAsIAMPolicyForZabbix",
            ],
            "Resource": "*"
        }
    ]
}

2. Add a principal to the trust relationships of the role you are using:

  • Go to IAM → Roles → Select a role → Trust relationships tab
  • Click Edit trust relationship
  • Add a principal to the trust relationships of the role you are using:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "ec2.amazonaws.com"
                ]
            },
            "Action": [
                "sts:AssumeRole"
            ]
        }
    ]
}

⚠ Using role-based authorization is only possible when you use a Zabbix server or proxy inside AWS.

3. Attach the role to the instance

  • Go to EC2 → Instances → Select an instance → Actions → Security → Modify IAM role
  • Select the role you created before which has the policy attached (IAM Policy for Zabbix)
  • Click Apply

4. Test it

Connect to ES2 ssh terminal in instance and run:

  • Go to EC2 → Instances → Select an instance → Connect → SSH client

Example:

ssh -i "zabbix_user.pem" [email protected]

Get caller identity:

aws sts get-caller-identity

Get token for metadata service:

export TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

Get IAM role from metadata service:

curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/iam/security-credentials

Get IAM role credentials from metadata service using role name from instance metadata service (see Get IAM role from metadata service):

curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/iam/security-credentials/<<--role_name-->>

6. Set the following macros in Zabbix:

  • {$AWS.AUTH_TYPE} – set to role_base
  • {$AWS.ASSUME.ROLE.ARN} – set to your role ARN

Assume role authorization

This method has two options:

  • Using access key authorization for getting creds for assume role
  • Using role-based authorization for getting creds for assume role

Lets look first at using access key authorization for getting creds for assume role.

Using access key authorization for getting creds for assume role

1. Create access key for user (see Access Key Authorization)

2. Add the appropriate permission in role you are using:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Resource": "arn:aws:iam::{Account}:user/{UserName}"
        },
        {
            "Effect": "Allow",
            "Action": [
                "theSameAsIAMPolicyForZabbix",
            ],
            "Resource": "*"
        }
    ]
}

3. Add principal to the trust relationships of the role you are using:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::{Account}:user/{UserName}"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}


4. Test It

Get assume role credentials using access key authorization

aws sts assume-role --role-arn arn:aws:iam::123456789010:role/Zabbix_Role --role-session-name test-session --profile zabbix_user

An example of response:

{
    "Credentials": {
        "AccessKeyId": "ASDFGHJKLEXAMPLE",
        "SecretAccessKey": "QowihdwoieuoinflksnliooEXAMPLE",
        "Expiration": "2029-09-09T22:22:22+00:00"
    },
    "AssumedRoleUser": {
        "AssumedRoleId": "ASDFGHJKLEXAMPLE:test-session",
        "Arn": "arn:aws:sts::123456789010:assumed-role/Zabbix_Role/test-session"
    }
}

5. Set the following macros in Zabbix:

  • {$AWS.AUTH_TYPE} – set to assume_role
  • {$AWS.ACCESS.KEY.ID} – set to your access key ID
  • {$AWS.SECRET.ACCESS.KEY} – set to your secret access key
  • {$AWS.ASSUME.ROLE.ARN} – set to your role ARN
  • {$AWS.ASSUME.ROLE.AUTH.METADATA} – set to false

Getting credentials for assume role using cross-account role (best practice)

1. Create role (see Role-Based Authorization)

2. Add the appropriate permission to the role you are using:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Resource": "arn:aws:iam::{Account}:role/{RoleNameWithPath}"
        },
        {
            "Effect": "Allow",
            "Action": [
                "theSameAsIAMPolicyForZabbix",
            ],
            "Resource": "*"
        }
    ]
}

3. Add the principal to the trust relationships of the role you are using:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::{Account}:role/{RoleNameWithPath}"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

⚠ Using assume role with role-based authorization is only possible when you use a Zabbix server or proxy inside AWS.

4. Test it

Connect to ES2 ssh terminal in the instance and run:

  • Go to EC2 → Instances → Select an instance → Connect → SSH client

Get assume role credentials using role name from instance metadata service:

aws sts assume-role --role-arn arn:aws:iam::123456789010:role/NewRole --role-session-name test-session

An example of response:

{
    "Credentials": {
        "AccessKeyId": "ACCESS_KEY_ID",
        "SecretAccessKey": "SECRET_ACCESS_KEY",
        "SessionToken": "SESSION_TOKEN",
        "Expiration": "EXPIRATION_DATE"
    },
    "AssumedRoleUser": {
        "AssumedRoleId": "ASSUMED_ROLE_ID",
        "Arn": "arn:aws:sts::ACCOUNT_ID:assumed-role/ROLE_NAME/SESSION_NAME"
    }
}

5. Set the following macros in Zabbix:

  • {$AWS.AUTH_TYPE} – set to assume_role
  • {$AWS.ASSUME.ROLE.ARN} – set to your role ARN
  • {$AWS.ASSUME.ROLE.AUTH.METADATA} – set to true

Well done! You have successfully configured AWS authorization in Zabbix AWS templates.

Now you can use the template to collect metrics from AWS.

The post Understanding HTTP Template Authorization in AWS appeared first on Zabbix Blog.

How to Install Zabbix on Windows with a Linux Subsystem

Post Syndicated from Alexander Petrov-Gavrilov original https://blog.zabbix.com/how-to-install-zabbix-on-windows-with-a-linux-subsystem/30311/

It’s a very well known fact that Zabbix can only be installed on Linux. But what if you are in a Windows environment and getting a Linux machine is not so simple or even possible? This can obstruct the implementation of Zabbix, or at least significantly delay it. Not only that, building a POC outside of the future environment makes data procurement a lot more complicated. Is there a way to work around this and get Zabbix as close to Windows as we possibly can?

WSL

WSL/WSL 2 is a fast and easy solution for installing and using Zabbix in a smaller Windows-dominant environment, be that a POC or a small company office. WSL 2 runs a real Linux kernel in a lightweight VM while being optimized for Windows. This means a faster start, lower resource consumption, and the ability to share files with Windows directly, meaning you can use Windows File explorer to find and manage the VM files.

WSL 2 also allows you to use Linux CLI while working with Windows (i.e. running vim from a Windows terminal and editing Windows files directly). At this point, you may be asking yourself, “Why not Hyper-V and VirtualBox?” Those are definitely options too, but they are quite heavy on system resources. In addition, boot times are a bit longer and sharing files between a host and a guest OS is clunkier.

Maybe Docker Desktop then? It’s an absolutely valid option, but that would require a bit of Docker knowledge and you would still be using WSL, technically speaking. So, with that said, WSL is definitely the fastest and most reliable way to sprung a Zabbix instance in a Windows-focused environment.

We will use WSL 2, but as a note WSL 1 is also available. Here are the differences:

  • WSL 2 is usually the better performer overall, especially for dev environments. It also has better Linux compatibility.
  • WSL 1 Linux files aren’t isolated, which can make them more accessible. In WSL 2, Linux runs in a virtual disk (ext4), so Linux and Windows files are more separate. Integration is still pretty good, however.
  • WSL 2 has better Linux compatibility – systemd, iptables, etc.
  • WSL 1 shared the same IP as Windows, WSL 2 is a VM – some networking required.
  • With WSL 1 you can see Linux running processes in Task Manager. WSL 2 will have processes isolated.

Installing Zabbix using WSL

Install WSL

Open PowerShell as an Administrator and run:

PS C:\Windows\system32> wsl --install

If you’ve already have WSL 1 installed, update it:

PS C:\Windows\system32> wsl --update

You can also set WSL 2 as default:

PS C:\Windows\system32> wsl --set-default-version 2
WSL installation
WSL installation

 

Install/Get preferred Linux Distribution using either Microsoft Store (i.e. Ubuntu, Debian, Oracle Linux) or just download directly. I will be using Oracle Linux 9.4.

Microsoft store WSL images
Microsoft store WSL images

 

You can also download the RootFS tarball from the preferred distribution portal, but then the process will be a bit different. Create a folder using PowerShell:

PS C:\Windows\system32> mkdir C:\WSL\OracleLinux9

Copy the .tar.xz file to this folder, then run:

PS C:\Windows\system32> wsl --import OracleLinux9 C:\WSL\OracleLinux9 .\oraclelinux9-rootfs.tar.xz --version 2

After the image is installed or imported, start Oracle Linux using PowerShell:

PS C:\Windows\system32> oraclelinux94

When installation is finished, there is a prompt to create a default UNIX user account and password for the said user, as the username does not need to match your Windows username. I’ll set it to “zabbix” of course, but you can set it to any other.

PS C:\Windows\system32> Enter new UNIX username: 
PS C:\Windows\system32> zabbix
PS C:\Windows\system32> New password: <your-password>
PS C:\Windows\system32> passwd: all authentication tokens updated successfully.
PS C:\Windows\system32> Installation successful!

Now OracleLinux is ready for use!

Prepare the system

You will be immediately logged in to the new environment. If logged out, to log in again just execute in PowerShel:

PS C:\Windows\system32> oraclelinux94

Being logged in, first double check that your selected OS is indeed installed by executing in the PowerShell, which will now serve as your VM CLI access point:

[zabbix@PC-NAME ~]$ cat /etc/os-release
NAME="Oracle Linux Server"
VERSION="9.4"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="9.4"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Oracle Linux Server 9.4"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:9:4:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL=https://github.com/oracle/oracle-linux

Confirmation received, make sure all OS updates are installed:

[zabbix@PC-NAME ~]$ sudo dnf update -y

When the update process is finished, you will need to decide whether you would like to use systemd or not (this may increase booting time). I will enable systemd. To do this, edit the wsl.conf on the Linux subsystem:

vi /etc/wsl.conf

Add to the newly created file:

[boot]
systemd=true

Reboot the images (this command will reboot all of them):

PS C:\Windows\system32> wsl.exe --shutdown

Start back your Linux distribution:

PS C:\Windows\system32> oraclelinux94

Install Zabbix database

We will need to prepare the database engine. Again, any preferred database engine can be used, in this case I install and configure MariaDB:

[zabbix@PC-NAME ~]$ sudo dnf install -y mariadb-server mariadb
[zabbix@PC-NAME ~]$ sudo systemctl enable --now mariadb

Confirm MariaDB is running:

[zabbix@PC-NAME ~]$ Systemctl status mariadb

mariadb.service - MariaDB 10.5 database server
     Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; preset: disabled)
     Active: active (running) since Tue 2025-04-29 12:39:54 EEST; 3min 55s ago
       Docs: man:mariadbd(8)
             https://mariadb.com/kb/en/library/systemd/
   Main PID: 235 (mariadbd)
     Status: "Taking your SQL requests now..."
      Tasks: 9 (limit: 26213)
     Memory: 109.6M
     CGroup: /system.slice/mariadb.service
             └─235 /usr/libexec/mariadbd --basedir=/usr

After confirmation, secure it a bit by creating a root password and selecting the options in bold:

[zabbix@PC-NAME ~]$ sudo mysql_secure_installation 

Enter current password for root (enter for none):
OK, successfully used password, moving on...

Setting the root password or using the unix_socket ensures that nobody
can log into the MariaDB root user without the proper authorisation.

You already have your root account protected, so you can safely answer 'n'.

Switch to unix_socket authentication [Y/n] n
 ... skipping.

You already have your root account protected, so you can safely answer 'n'.

Change the root password? [Y/n] Y
New password:
Re-enter new password:
Password updated successfully!
Reloading privilege tables..
 ... Success!

Remove anonymous users? [Y/n] Y
 ... Success!

Disallow root login remotely? [Y/n] Y
 ... Success!

Remove test database and access to it? [Y/n] Y


Reload privilege tables now? [Y/n] Y

 ... Success!

Cleaning up...

All done!  If you've completed all of the above steps, your MariaDB
installation should now be secure.

Thanks for using MariaDB!

Now to create the Zabbix database. Log in to MariaDB:

[zabbix@PC-NAME ~]$ sudo mysql -u root -p 
[zabbix@PC-NAME ~]$ Enter password: <enter your password, won’t be visible>

Follow the steps from the Zabbix installation page:

MariaDB [(none)]> create database zabbix character set utf8mb4 collate utf8mb4_bin;
MariaDB [(none)]> create user zabbix@localhost identified by '<custom-password>';
MariaDB [(none)]> grant all privileges on zabbix.* to zabbix@localhost;
MariaDB [(none)]> set global log_bin_trust_function_creators = 1;

MariaDB [(none)]> quit;

Installing Zabbix

Install the Zabbix repository:

[zabbix@PC-NAME ~]$ sudo dnf install https://repo.zabbix.com/zabbix/7.0/centos/9/x86_64/zabbix-release-latest-7.0.el9.noarch.rpm
[zabbix@PC-NAME ~]$ dnf clean all

Proceed to install the Zabbix server, frontend, and agent:

[zabbix@PC-NAME ~]$ sudo dnf -y install zabbix-server-mysql zabbix-web-mysql zabbix-apache-conf zabbix-sql-scripts zabbix-selinux-policy zabbix-agent
...
[zabbix@PC-NAME ~]$zabbix-agent-7.0.12-release1.el9.x86_64 zabbix-apache-conf-7.0.12-release1.el9.noarch zabbix-selinux-policy-7.0.12-release1.el9.x86_64  zabbix-server-mysql-7.0.12-release1.el9.x86_64 zabbix-sql-scripts-7.0.12-release1.el9.noarch
 zabbix-web-7.0.12-release1.el9.noarch zabbix-web-deps-7.0.12-release1.el9.noarch zabbix-web-mysql-7.0.12-release1.el9.noarch

Complete!

Now import the initial database schema:

[zabbix@PC-NAME ~]$ zcat /usr/share/zabbix-sql-scripts/mysql/server.sql.gz | mysql -u zabbix -p zabbix
Enter password: <enter your DB user password and wait until you will see the next line appear>
[root@ZBX-5CD3221K14 zabbix]#

Disable the log_bin_trust_function_creators option after import has finished:

# mysql -uroot -p
password
MariaDB [(none)]>  set global log_bin_trust_function_creators = 0;
MariaDB [(none)]>  quit;

Add your Zabbix user database password to the Zabbix server configuration file:

[zabbix@PC-NAME ~]$ vi /etc/zabbix/zabbix_server.conf
### Option: DBPassword
#       Database password.
#       Comment this line if no password is used.
#
# Mandatory: no
# Default:
DBPassword=<your-DB-user-password>

Start the Zabbix server and frontend and add them to autorun:

[zabbix@PC-NAME ~]$  systemctl restart zabbix-server zabbix-agent httpd php-fpm
[zabbix@PC-NAME ~]$  systemctl enable zabbix-server zabbix-agent httpd php-fpm
Created symlink /etc/systemd/system/multi-user.target.wants/zabbix-server.service → /usr/lib/systemd/system/zabbix-server.service.
Created symlink /etc/systemd/system/multi-user.target.wants/zabbix-agent.service → /usr/lib/systemd/system/zabbix-agent.service.
Created symlink /etc/systemd/system/multi-user.target.wants/httpd.service → /usr/lib/systemd/system/httpd.service.
Created symlink /etc/systemd/system/multi-user.target.wants/php-fpm.service → /usr/lib/systemd/system/php-fpm.service.

Installation of the backend is now finished, but we still need the frontend.

Exposing and installing the Zabbix frontend for WSL

Since WSL2 does not expose services to localhost by default, you need to determine the WSL IP:

[zabbix@PC-NAME ~]$ ip addr show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:15:5d:47:32:c6 brd ff:ff:ff:ff:ff:ff
    inet 172.29.128.155/20 brd 172.29.143.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::215:5dff:fe47:32c6/64 scope link
       valid_lft forever preferred_lft forever

Look for an IP like 172.x.x.x, then using your browser go to:

http://<WSL_IP>/zabbix

In this example, that would be: 

http://172.29.128.155/zabbix

You can also port forward WSL to localhost with netsh in PowerShell:

PS C:\Windows\system32> netsh interface portproxy add v4tov4 listenport=8080 listenaddress=127.0.0.1 connectport=80 connectaddress=<WSL_IP>

Then you will be able to access Zabbix from http://localhost:8080/zabbix. Now, just finish the standard frontend setup and Zabbix is ready to use!

WSL advantages

Some extra advantages you get with this approach include clearer resource usage visibility:

WSL Task manager

 

Direct access to the Linux subsystem files through File explorer with your favorite Windows tools:

Linux subsystem file explorer
Linux subsystem file explorer

 

As you can see, docker is here as well. System and configuration files are also visible and editable:

File explorer Zabbix config files
File explorer Zabbix config files

 

Now you can proceed with building your Zabbix or Zabbix POC, (almost) without needing to leave your regular Windows environment!

The post How to Install Zabbix on Windows with a Linux Subsystem appeared first on Zabbix Blog.

Streamline Operational Troubleshooting with Amazon Q Developer CLI

Post Syndicated from Kirankumar Chandrashekar original https://aws.amazon.com/blogs/devops/streamline-operational-troubleshooting-with-amazon-q-developer-cli/

Amazon Q Developer is the most capable generative AI–powered assistant for software development, helping developers perform complex workflows. Amazon Q Developer command-line interface (CLI) combines conversational AI with direct access to AWS services, helping you understand, build, and operate applications more effectively. The Amazon Q Developer CLI executes commands, analyzes outputs, and provides contextual recommendations based on best practices for troubleshooting tools and platforms available on your local machine.

In today’s cloud-native environments, troubleshooting production issues often involves juggling multiple terminal windows, parsing through extensive log files, and navigating numerous AWS console pages. This constant context-switching delays problem resolution and adds cognitive burden to teams managing cloud infrastructure.

In this blog post, you will explore how Amazon Q Developer CLI transforms the troubleshooting experience by streamlining challenging scenarios through conversational interactions.

The Traditional Troubleshooting Experience

When issues arise, engineers typically spend hours manually examining infrastructure configurations, reviewing logs across services, and analyzing error patterns. The process requires switching between multiple interfaces, correlating information from various sources, and deep AWS knowledge. This complex workflow often extends problem resolution from hours into days and increase the burden on the infrastructure teams.

Solution: Amazon Q Developer CLI

Amazon Q Developer CLI streamlines the entire troubleshooting process, from initial investigation to problem resolution, making complex AWS troubleshooting accessible and efficient through simple conversations.

How Amazon Q Developer CLI works:

  • Natural Language Interface: Execute AWS CLI commands and interact with AWS services using conversational prompts
  • Automated Discovery: Map out infrastructure and analyze configurations
  • Intelligent Log Analysis: Parse, correlate, and analyze logs across services
  • Root Cause Identification: Pinpoint issues through AI-powered reasoning
  • Guided Remediation: Implement fixes with minimal human intervention
  • Validation: Test solutions and explain complex issues simply

One of the built-in tools within the Amazon Q Developer CLI, use_aws, enables natural language interaction with AWS services, as shown in Figure 1. This tool leverages the AWS CLI permissions configured on your local machine, allowing secure and authorized access to your AWS resources.

A command line interface showing a list of tools and their permissions. The display is titled "/tools" and shows several built-in tools including execute_bash, fs_read, fs_write, report_issue, and use_aws. Each tool has an associated permission level indicated by asterisks. The use_aws tool is highlighted with "trust read-only commands" permission. At the bottom, there's a note stating "Trusted tools will run without confirmation" and a tip to "Use /tools help to edit permissions".

Figure 1: Tools selection in Amazon Q Developer CLI

Real-World Troubleshooting Scenario

Demonstration Environment Setup

This demonstration was performed with the following environment configuration:

The environment includes a local development machine with necessary tools, appropriate AWS account permissions, and terminal access. By starting Amazon Q Developer CLI in the project directory, it has immediate access to relevant code and configuration files.

Scenario: Troubleshooting NGINX 5XX Errors

The scenario demonstrates troubleshooting a multi-tier application architecture as shown in figure 2 deployed on Amazon ECS Fargate with:

  • Application Load Balancer (ALB) distributing traffic across availability zones
  • NGINX reverse proxy service handling incoming requests
  • Node.js backend service processing business logic
  • Service discovery enabling internal communication
  • CloudWatch Logs providing centralized logging

An AWS cloud architecture diagram showing the flow of traffic from an Internet user through multiple components. The diagram includes: At the top: An Internet user connecting to an Internet Gateway Within a VPC (Virtual Private Cloud): Two public subnets containing a NAT Gateway and Application Load Balancer Two private subnets within an ECS Cluster containing: An NGINX service (Fargate) A Backend service (Fargate) A 10-second timeout between them A Cloud Map Service Discovery component at the bottom CloudWatch Logs integration on the right side The diagram includes a note about gateway timeouts: "504 Gateway Timeout - Backend takes 15s to respond, NGINX timeout is 10s" All components are connected with arrows showing the flow of traffic and data through the system. The infrastructure follows AWS best practices with public and private subnet separation for security.

Figure 2: AWS Architecture diagram for the app used in this blog post

Traditional Troubleshooting Steps

For the architecture in figure 2, when 502 Gateway Timeout errors occur, traditional troubleshooting requires:

  1. Checking ALB target group health
  2. Examining ECS service status across multiple consoles
  3. Analyzing CloudWatch logs from different log groups
  4. Correlating error patterns between services
  5. Reviewing infrastructure code for configuration issues
  6. Implementing and deploying fixes

Amazon Q Developer CLI Approach

Instead, let’s see how Amazon Q Developer CLI handles this systematically, step by step:

Step1: Initial Problem Report

Amazon Q Developer CLI is provided with the initial prompt as a problem statement within the application project directory as shown in the following screenshot in figure 3. Amazon Q Developer responds back and says it is going investigate the 502 Gateway Timeout errors in the NGINX application.

Prompt:

Our production NGINX application is experiencing 502 Gateway Timeout errors. 
I have checked out the application and infrastructure code locally and the AWS CLI 
profile 'demo-profile' is configured with access to the AWS account where the 
infrastructure and application is deployed to. Can you help investigate and diagnose the issue?

A Visual Studio Code window showing a debugging session for an NGINX application. The interface has three main sections: a file explorer on the left showing project files including 'app.ts' and 'nginx-config-task.json', a terminal tab in the center displaying an "Amazon Q" ASCII art logo, and a conversation where a user is reporting 502 Gateway Timeout errors. The terminal shows AWS CLI command execution using a tool called "use_aws" with parameters including the service name "ecs" and region "us-west-2". The interface has red annotations highlighting key areas like "project files", "User provided initial prompt", and "Q CLI executing AWS CLI calls.

Figure 3: Amazon Q Developer CLI with initial prompt and problem statement

Step2: Systematic Infrastructure Discovery

Amazon Q Developer CLI start to systematically discovering the infrastructure as shown in the following screenshot in figure 4. If you see the initial prompt did not include that the app is hosted on ECS, but Amazon Q Developer CLI understood the context and executes the AWS CLI calls to describe the Cluster and the services within it. It made sure that the ECS tasks are running for both the services within the Cluster. It is a key discovery that both services show healthy status (1/1 desired count), indicating the issue isn’t service availability.

A terminal window showing three sequential AWS CLI commands being executed through a "use_aws" tool: First command: "list-clusters" operation for ECS service in us-west-2 region using demo-profile, completing in 1.244 seconds Second command: "list-services" operation targeting the NginxSimulationCluster, completing in 0.877 seconds with confirmation of finding both nginx-service and backend-service Third command: "describe-services" operation examining both services in detail, completing in 0.968 seconds with confirmation that both services are running as expected (1/1 desired count) Each command includes execution details, parameters, and completion status, with the system preparing to check CloudWatch logs next.

Figure 4: AWS Infrastructure discovery by Amazon Q Developer CLI

Step 3: Intelligent Log Analysis

Amazon Q Developer CLI retrieves and analyzes recent CloudWatch logs from the NGINX container, immediately identifying the critical error pattern as shown in the following screenshot in figure 5, where Amazon Q Developer responds: “Perfect! I found the issue. The NGINX logs show clear 504 gateway timeout with upstream timeout messages.”

A terminal window showing two AWS CloudWatch Logs commands being executed: First command: "describe-log-streams" operation for the "/ecs/nginx-service" log group, limiting to 5 most recent entries, ordered by LastEventTime in descending order Second command: "get-log-events" operation retrieving 50 log entries from a specific NGINX container log stream The output reveals a critical error message highlighted at the bottom showing an upstream timeout (error 110) occurring while reading response headers. The error details include client IP 10.0.0.247, upstream server at http://10.0.3.18:3000/, and host 52.35.62.210.

Figure 5: CloudWatch Log analysis by Amazon Q Developer CLI

Step 4: Amazon Q Developer CLI Analysis and Root Cause Identification

Amazon Q Developer examines backend service logs and discovers a mismatch between the backend service response time and NGINX timeout settings, as seen in the following screenshot in figure 6.

A terminal window showing AWS CloudWatch Logs commands and their output. The first command describes log streams for a backend container, and the second retrieves log events. The output reveals a debugging analysis showing that while health checks work fine, regular requests are being delayed by about 15 seconds, causing NGINX timeout issues. The log group is "/ecs/backend-service" in the us-west-2 region using a demo-profile.

Figure 6: Root cause identification by Amazon Q Developer CLI

Step 5: Amazon Q Developer CLI Root Cause Analysis

Amazon Q Developer CLI examines the ECS task definitions to identify the exact configuration mismatch, as shown in the following screenshot in figure 7. Amazon Q Developer finds that:

  • Backend service is configured with response_delay=15000 (15 secs)
  • NGINX proxy is configured with proxy_read_timeout 10s

This mismatch causes 504 gateway timeout errors when the backend response exceeds NGINX’s timeout threshold.

A terminal window showing two AWS CLI commands to describe ECS task definitions in the us-west-2 region. Below the commands is a highlighted "Root Cause Analysis" section that explains a timeout mismatch: the backend service is configured with a 15-second response delay while NGINX has a 10-second proxy timeout, resulting in 502 Gateway Timeout errors. Both commands use a demo-profile and are labeled as checking timeout and response delay configurations.

Figure 7: Root cause analysis and issue detection by Amazon Q Developer CLI

Step 6: Automated Code Fix

Here’s where Amazon Q Developer CLI truly excels—it doesn’t just diagnose; it implements the fix. Since Amazon Q Developer CLI is started within the project where the CDK code for ECS task definition is defined, it identified the code configuration and also modified it, as shown in the following screenshot in figure 8.

A terminal window showing file operations using fs_read and fs_write tools. The code changes show an NGINX configuration update in ecs-nginx-cdk.ts, where the proxy_read_timeout is being modified from '10s' to '20s'. The file also shows additional timeout configurations being added, including proxy_connect_timeout and proxy_send_timeout. The update is confirmed with a user prompt and completed in 0.2 seconds.

Figure 8: CDK code fix by Amazon Q Developer CLI

Step 7: Deployment

Amazon Q Developer CLI builds and deploys the fix by executing cdk synth and cdk deploy using the ‘demo-profile‘ AWS CLI profile that was initially provided in the prompt, as shown in the following screenshot in figure 9.

A terminal window showing two execute_bash commands running in sequence. The first command builds a CDK project using 'npm run build' in the nginx-app directory, completing in 4.102s. The second command deploys the updated CDK stack using 'cdk deploy' with the demo-profile, showing deployment progress including some warnings about minHealthyPercent configurations and CloudFormation stack updates in us-west-2 region.

Figure 9: CDK code build and deployment by Amazon Q Developer CLI

Step 8: Validation

Amazon Q Developer CLI validates the solution by sending a curl request to the ALB endpoint after the successful deployment, as shown in the following screenshot in figure 10.

A terminal window showing the execution of a curl command to test an NGINX application on AWS. The command targets an Elastic Load Balancer in the us-west-2 region. The response shows a successful HTTP 200 OK status after 14 seconds, with a JSON response containing the message "Hello from backend". The test completes in 15.100 seconds, indicating the fix for previous 502 errors was successful.

Figure 10: Fix validation by Amazon Q Developer CLI

In addition to that, Amazon Q Developer also sends a request to the health check endpoint and validates everything is working after the fix was deployed, as shown in the following screenshot in figure 11.

A terminal screenshot showing the results of a health check on an Nginx server using curl. The command executed shows a successful response with "healthy" status, completing in 0.65 seconds. The output displays various metrics including download speed (386 B/s), 100% completion rate, and timing statistics for real, user, and system processes.

Figure 11: Health endpoint validation by Amazon Q Developer CLI

What Amazon Q Developer CLI Accomplished

Using just conversational commands, Amazon Q Developer CLI performed a complete troubleshooting cycle:

  • Infrastructure Discovery: Automatically mapped ECS clusters, services, and dependencies
  • Log Correlation: Analyzed thousands of log entries across multiple services
  • Root Cause Analysis: Identified exact configuration mismatch between NGINX’s timeout (10s) and the backend’s response delay (15s)
  • Code-Level Diagnosis: Located problematic timeout setting in CDK infrastructure code
  • Automated Implementation: Modified infrastructure code to increase the NGINX timeout
  • End-to-End Deployment: Built, deployed, and validated the complete solution
  • Comprehensive Testing: Verified both fix effectiveness and overall system health

Amazon Q Developer CLI handles troubleshooting tasks through a single, conversational interface, eliminating the need for multiple tools or AWS CLI commands.

Conclusion

Amazon Q Developer CLI represents a significant evolution in how we troubleshoot cloud infrastructure issues. By combining natural language understanding with powerful command execution capabilities, it transforms complex troubleshooting workflows into efficient, action-oriented dialogues. Whether you’re dealing with NGINX 5XX errors or similar issues across other AWS services, Amazon Q Developer CLI can help you diagnose issues, implement fixes, and validate solutions—all through a conversational interface that feels natural and intuitive.

Give Amazon Q Developer CLI a try the next time you encounter a troubleshooting challenge, and experience the difference it can make in your operational workflow.

To learn more about Amazon Q Developer’s features and pricing details, visit the Amazon Q Developer product page.

About the Author

kirankumar.jpeg

Kirankumar Chandrashekar is a Generative AI Specialist Solutions Architect at AWS, focusing on Amazon Q Developer. Bringing deep expertise in AWS cloud services, DevOps, modernization, and infrastructure as code, he helps customers accelerate their development cycles and elevate developer productivity through innovative AI-powered solutions. By leveraging Amazon Q Developer, he enables teams to build applications faster, automate routine tasks, and streamline development workflows. Kirankumar is dedicated to enhancing developer efficiency while solving complex customer challenges, and enjoys music, cooking, and traveling.

Accelerate development with secure access to Amazon Q Developer using PingIdentity

Post Syndicated from Sid Vantair original https://aws.amazon.com/blogs/devops/accelerate-development-with-secure-access-to-amazon-q-developer-using-pingidentity/

 Overview

Customers adopting Amazon Q Developer, a generative AI-powered coding companion, often need authentication through existing identity providers like PingIdentity. By leveraging AWS IAM Identity Center, organizations can enable their developers to access Amazon Q Developer with their existing PingIdentity credentials, streamlining authentication and removing the need for separate login procedures. Amazon Q Developer can chat about code, provide inline code completions, and generate new code. It also scans your code for security vulnerabilities and makes code improvements, including language updates, debugging, and optimizations. Amazon Q Developer comes in two tiers. The Free Tier is available at no cost for individual use. The Pro Tier is a paid version offering enterprise access controls, an analytics dashboard, customization, and higher usage limits. Organizations that enable the Pro tier of Amazon Q Developer for their developers typically authenticate with AWS IAM Identity Center. This approach is popular due to its ability to federate with external identity providers. In this blog, we will show you how to set up PingIdentity as an external IdP for IAM Identity Center and allow developers to access Amazon Q Developer using their existing PingIdentity login credentials.

How it works

AWS authentication flow diagram: Developers interact with Amazon Q Developer and AWS IAM Identity Center, integrating with Ping Identity for SAML-based access.

Figure 1 – Solution Overview

The authentication workflow is as follows:

  1. The developer initiates an access request to Amazon Q Developer.
  2. IAM Identity center checks authentication status.
  3. If not authenticated, redirects to PingIdentity login.
  4. Developer provides PingIdentity Credentials.
  5. PingIdentity validates credentials and sends SAML response.
  6. IAM Identity Center verifies the SAML response.
  7. Upon successful verification, grants Amazon Q Developer access.
  8. Developer begins using Amazon Q Developer.

Prerequisites

  • AWS account
  • PingIdentity environment with users and groups already setup for Amazon Q Developer access
  • IAM identity center
  • Pro Tier subscription of Amazon Q Developer

Walkthrough

In this section, we demonstrate how to create a SAML-based connection between PingIdentity and IAM Identity Center, enabling you to access Amazon Q Developer seamlessly using your PingIdentity credentials.

Note: You will need to switch between PingIdentity portal and IAM Identity Center in your browser. We recommend opening a new browser tab for each console.

Step 1: Enable AWS Single Sign-On in PingIdentity

This step involves enabling AWS Single Sign-On application within PingIdentity.

    1. In the PingIdentity console, Navigate to the Applications Tab > Application Catalog
    2. Browse catalog for AWS Single Sign-On and select + to start the Quick Setup.
Screenshot of the PingIdentity Application Catalog interface. The search term "aws" is entered in the search bar, displaying three results: Amazon Web Services – AWS, AWS Gov-Cloud, and AWS Single Sign-On. The "AWS Single Sign-On" option is outlined with a red box and includes a plus button to add the application

Figure 2 – PingIdentity Application Catalog

Alt Text: Screenshot of the PingIdentity Application Catalog interface. The search term “aws” is entered in the search bar, displaying three results: Amazon Web Services – AWS, AWS Gov-Cloud, and AWS Single Sign-On. The “AWS Single Sign-On” option is outlined with a red box and includes a plus button to add the application

    1. Provide Name, SSO Region and SSO Tenant ID and choose Next
      • Name – Input an appropriate name for the connection
      • SSO Region – Input the appropriate region
      • Tenant ID – Identity Store ID
        You can run the following CLI command to retrieve the value. It’s a 10-digit alphanumeric prefixed by “d-“.
aws sso-admin list-instances –query ‘Instances[0].IdentityStoreId’Output: “d-XXXXXXXXXX”
    1. Navigate to PingOne Mappings and select Email Address from the drop down.
Screenshot of the AWS Single Sign-On configuration in PingIdentity. The screen shows Step 2 of the setup process where the SAML attribute SAML_SUBJECT is mapped to the PingOne attribute "Email Address". A red box highlights the mapping section under "PingOne Mappings".

Figure 3 – AWS Single Sign-On attribute mapping

Alt Text: Screenshot of the AWS Single Sign-On configuration in PingIdentity. The screen shows Step 2 of the setup process where the SAML attribute SAML_SUBJECT is mapped to the PingOne attribute “Email Address”. A red box highlights the mapping section under “PingOne Mappings”.

    1. Search and select the group that you have created earlier for enabling access to Amazon Q Developer and select + to add the group.
    2. Choose Save
Screenshot of Step 3 in the AWS Single Sign-On setup process in PingIdentity. The screen shows the group selection interface where the "Amazon Q" group is listed. A plus icon is shown next to the group to add it, and a blue "Save" button is highlighted in the bottom-right corner to confirm the configuration.

Figure 4 – Select PingIdentity directory Groups for Amazon Q Developer access

Alt Text: Screenshot of Step 3 in the AWS Single Sign-On setup process in PingIdentity. The screen shows the group selection interface where the “Amazon Q” group is listed. A plus icon is shown next to the group to add it, and a blue “Save” button is highlighted in the bottom-right corner to confirm the configuration.

Step 2: Connecting PingIdentity with IAM identity Center

This step involves configuring PingIdentity with the AWS IAM Identity Center sign-on details to complete the authentication setup.

  1. In the PingIdentity console, Navigate to the Applications Tab > Applications and select the application you created earlier in Step 1
  2. Select Enable Advanced Configuration and choose Enable.
Screenshot of the PingIdentity Applications dashboard showing the AWS Single Sign-On application selected. The overview panel displays key configuration sections including protocol (SAML), mapped attributes, selected policies, and access group (Amazon Q). The option "Enable Advanced Configuration" is highlighted near the bottom of the panel.

Figure 5 – Enable Advanced configuration for AWS single Sign-On application

Alt Text: Screenshot of the PingIdentity Applications dashboard showing the AWS Single Sign-On application selected. The overview panel displays key configuration sections including protocol (SAML), mapped attributes, selected policies, and access group (Amazon Q). The option “Enable Advanced Configuration” is highlighted near the bottom of the panel.

  1. Scroll down and select Download Metadata. This will save the Metadata file to your local computer, which you will use later during the configuration process.
  2. In another browser tab login to your AWS IAM Identity Center console and Select Choose your identity source.
  3. Under Identity source, select Change identity source from the Actions drop-down menu.
Screenshot of the IAM Identity Center settings page, focused on the "Identity source" tab. The page displays details such as identity source, authentication method, AWS access portal URL, issuer URL, and identity store ID. A dropdown menu labeled "Actions" is expanded in the top-right corner, showing options to "Customize AWS access portal URL" and "Change identity source," highlighted with a red box.

Figure 6 – Change identity source in IAM Identity Center Console

Alt Text: Screenshot of the IAM Identity Center settings page, focused on the “Identity source” tab. The page displays details such as identity source, authentication method, AWS access portal URL, issuer URL, and identity store ID. A dropdown menu labeled “Actions” is expanded in the top-right corner, showing options to “Customize AWS access portal URL” and “Change identity source,” highlighted with a red box.

  1. On the next page, select External identity provider and choose Next.
  2. Under Service provider metadata copy the IAM Identity Center Assertion Consumer Service (ACS) URL.

    Screenshot of the "Configure external identity provider" step in the AWS IAM Identity Center setup process. The screen displays service provider metadata including the AWS access portal sign-in URL, IAM Identity Center Assertion Consumer Service (ACS) URL (highlighted with a red box), and IAM Identity Center issuer URL. A button labeled "Download metadata file" is shown in the upper right.

    Figure 7 – Copy IAM Identity Center ACS URL

Alt Text: Screenshot of the “Configure external identity provider” step in the AWS IAM Identity Center setup process. The screen displays service provider metadata including the AWS access portal sign-in URL, IAM Identity Center Assertion Consumer Service (ACS) URL (highlighted with a red box), and IAM Identity Center issuer URL. A button labeled “Download metadata file” is shown in the upper right.

  1. Now go back to the PingIdentity browser tab and Navigate to the Configuration tab and select pencil icon to edit the details.
  2. Paste the ACS URL you copied from the IAM identity center console and choose Save.
Screenshots showing the configuration and editing of SAML settings for AWS Single Sign-On in PingIdentity. The first image displays the static configuration view, listing the ACS URL, signing key ("PingOne SSO Certificate for Administrators environment"), signing method ("Response"), and signing algorithm. The second image shows the editable configuration screen with the ACS URL input field highlighted in red, alongside dropdowns for selecting the signing key, options for signing method (Assertion, Response, or both), and the RSA_SHA256 signing algorithm. These screens guide users through setting up secure SAML integration with AWS SSO.

Figure 8 – Configuring AWS Single Sign-On SAML Settings in PingIdentity console

Alt Text: Two screenshots showing the configuration and editing of SAML settings for AWS Single Sign-On in PingIdentity. The first image displays the static configuration view, listing the ACS URL, signing key (“PingOne SSO Certificate for Administrators environment”), signing method (“Response”), and signing algorithm. The second image shows the editable configuration screen with the ACS URL input field highlighted in red, alongside dropdowns for selecting the signing key, options for signing method (Assertion, Response, or both), and the RSA_SHA256 signing algorithm. These screens guide users through setting up secure SAML integration with AWS SSO.

Step 3: Configure PingIdentity as external IdP in IAM identity Center

This step involves setting up PingIdentity as an external IdP in IAM Identity Center to enable federated access.

  1. Navigate back to the previous browser tab where you had IAM Identity Center console open.
  2. Upload the downloaded PingIdentity IdP SAML metadata file from step 3 of previous section and select Next.
Screenshot of the AWS Identity Center configuration screen where the user uploads the IdP SAML metadata XML file. The metadata file is shown as successfully selected. Below are empty fields for optional manual entry of IdP sign-in URL, IdP issuer URL, and IdP certificate. The "Next" button is highlighted in orange at the bottom right, indicating the next step in the setup process.

Figure 9 – AWS IAM Identity Center metadata

Alt Text: Screenshot of the AWS Identity Center configuration screen where the user uploads the IdP SAML metadata XML file. The metadata file is shown as successfully selected. Below are empty fields for optional manual entry of IdP sign-in URL, IdP issuer URL, and IdP certificate. The “Next” button is highlighted in orange at the bottom right, indicating the next step in the setup process.

  1. Review the list of changes. Once you are ready to proceed, type ACCEPT, then select Change identity source.

Step 4: Enable provisioning and identity-aware sessions in IAM identity Center

This step involves configuring user provisioning and enabling identity-aware sessions in AWS IAM Identity Center to support dynamic access control.

  1. In IAM Identity Center Console, Choose Settings in the left navigation pane.
  2. On the Settings page, locate and enable automatic provisioning. This immediately enabled automatic provisioning in IAM Identity Center and displays the necessary SCIM endpoint and access token information.
  3. In the Inbound automatic provisioning dialog box, copy each of the values for the following options. You will need to paste these later when you configure provisioning in PingIdentity.
    • SCIM endpoint
    • Access token
  4. Choose Close.
  5. Next enable identity-aware sessions and automatic provisioning.
Two options are displayed for further configuration: "Enable identity-aware sessions" and "Automatic provisioning." Both options have an "Enable" button on the right-hand side, highlighted in red.

Figure 10 – IAM Identity Center Settings for identity aware sessions and automatic provisioning

Alt Text: Two options are displayed for further configuration: “Enable identity-aware sessions” and “Automatic provisioning.” Both options have an “Enable” button on the right-hand side, highlighted in red.

Step 5: Configure connections provisioning in PingIdentity

This step involves setting up connection provisioning in PingIdentity to enable automatic user and group management.

  1. In the PingIdentity console, Navigate to the Integrations > Provisioning.
  2. Select plus icon > New Connection
  3. Under connection type Select Identity Store.
PingIdentity Provisioning configuration screen. The left sidebar highlights the "Provisioning" tab. The main panel shows the "Create a New Connection" dialog with two connection type options: "Identity Store" and "Gateway." The "Identity Store" option is selected using the "Select" button on the right. A plus (+) icon at the top indicates the option to add a new provisioning connection.

Figure 11 – PingIdentity connection provisioning

Alt Text: PingIdentity Provisioning configuration screen. The left sidebar highlights the “Provisioning” tab. The main panel shows the “Create a New Connection” dialog with two connection type options: “Identity Store” and “Gateway.” The “Identity Store” option is selected using the “Select” button on the right. A plus (+) icon at the top indicates the option to add a new provisioning connection.

  1. Select SCIM outbound from the list of options and select Next.
  2. Provide a name for the connection and select Next.
  3. Paste the SCIM endpoint URL into the SCIM BASE URL field.
  4. Navigate to Authentication Method and select OAuth 2 Bearer Token.
  5. Paste the Access token into the Oauth Access Token field.
  6. Select Test Connection to validate the connectivity and select Next.
PingIdentity interface showing the "Configure Authentication" step in the "Create a New Connection" wizard. Key fields include the SCIM Base URL, SCIM Version (2.0), Authentication Method (OAuth 2 Bearer Token), OAuth Access Token (obscured), and resource paths for Users and Groups. The "Test Connection" and "Next" buttons are visible at the bottom.

Figure 12 – Configure authentication details

Alt Text: PingIdentity interface showing the “Configure Authentication” step in the “Create a New Connection” wizard. Key fields include the SCIM Base URL, SCIM Version (2.0), Authentication Method (OAuth 2 Bearer Token), OAuth Access Token (obscured), and resource paths for Users and Groups. The “Test Connection” and “Next” buttons are visible at the bottom.

  1. Navigate to User Filter Expression and change to userName Eq “%s”.
  2. Choose Save. By default, the connection is created in a Disabled state.
Final step in the PingIdentity "Create a New Connection" wizard showing the "Configure Preferences" screen. The highlighted fields include "User Filter Expression" with the value userName Eq "%s", "User Identifier" set to userName, and group membership handling options ("Merge" and "Overwrite" with "Overwrite" selected). A "Save" button is highlighted at the bottom right.

Figure 13 – Edit UserFilter Expressions for the connection

Alt Text: Final step in the PingIdentity “Create a New Connection” wizard showing the “Configure Preferences” screen. The highlighted fields include “User Filter Expression” with the value userName Eq “%s”, “User Identifier” set to userName, and group membership handling options (“Merge” and “Overwrite” with “Overwrite” selected). A “Save” button is highlighted at the bottom right.

  1. Select the connection you created and select the toggle switch to enable the connection.
PingIdentity configuration screen showing the IAM Identity Store integration. The page displays the identity store name, and tabs for "Overview" and "Configuration." A toggle switch in the top-right corner is highlighted, indicating the integration is currently enabled.

Figure 14 – Enable the connection

Alt Text: PingIdentity configuration screen showing the IAM Identity Store integration. The page displays the identity store name, and tabs for “Overview” and “Configuration.” A toggle switch in the top-right corner is highlighted, indicating the integration is currently enabled.

Step 6: Configure rules provisioning in PingIdentity

This step involves setting up provisioning rules in PingIdentity to define how users and groups are synchronized.

  1. In the PingIdentity console, Navigate to the Integrations > Provisioning.
  2. Select plus icon > New Rule
  3. Provide a Name and Description for the rule.
  4. Choose Create.
  5. Select plus icon to select the Connection you created in the previous step.
  6. Choose Save.
Alt Text: Screenshots showing the final steps in connecting the IAM Identity Center to the IAM identity store using PingIdentity. The first image shows the IAM Identity Store connection listed under "Available Connections" with a plus (+) icon to initiate the link. The second image shows the selected connection from the PingOne Directory (P1) as the source and IAM identity store (SCIM) as the target, with the option to "Save" the configuration.

Figure 15 – Add the IAM identity center connection to the rule

Alt Text: Screenshots showing the final steps in connecting the IAM Identity Center to the IAM identity store using PingIdentity. The first image shows the IAM Identity Store connection listed under “Available Connections” with a plus (+) icon to initiate the link. The second image shows the selected connection from the PingOne Directory (P1) as the source and IAM identity store (SCIM) as the target, with the option to “Save” the configuration.

  1. If you want to sync users from your PingIdentity directory, create a user filter. To do so, navigate to User Filter and select pencil icon to edit the settings.
  2. Choose the appropriate filter from the drop down based on your use case and select Save. I have chosen Group Name which has been designated for Amazon Q Developer access.
Screenshot of the "Edit User Filter" interface in IAM Identity Center. The user filter is configured to provision users who belong to a group with names that contain "Amazon Q Developer." The condition logic is set to match if "Any" of the conditions are true.

Figure 16 – PingIdentity user filter

Alt Text: Screenshot of the “Edit User Filter” interface in IAM Identity Center. The user filter is configured to provision users who belong to a group with names that contain “Amazon Q Developer.” The condition logic is set to match if “Any” of the conditions are true.

  1. If you want to sync a group from your PingIdentity directory, create group provisioning. To do so, navigate to Group Provisioning and select pencil icon to edit the settings.
  2. Select the appropriate group which has been designated for Amazon Q Developer access and choose Save.
Screenshot of the "Edit Group Provisioning" screen in IAM Identity Center. The group "Amazon Q Developer" is selected for outbound provisioning. A "Save" button is highlighted in the bottom-left corner.

Figure 17 – PingIdentity Group Provisioning

Alt Text: Screenshot of the “Edit Group Provisioning” screen in IAM Identity Center. The group “Amazon Q Developer” is selected for outbound provisioning. A “Save” button is highlighted in the bottom-left corner.

  1. Navigate to Attribute Mapping and select the pencil icon to edit the settings.
  2. Delete the PingOne Directory attribute Primary Phone.
  3. Add a new attribute and select Username as PingOne Directory and displayName as IAM identity Store.
  4. Choose Save.
Two screenshots showing the editing of attribute mappings in IAM Identity Center. The first image displays default mappings such as 'Email Address' to 'workEmail' and 'Username' to 'userName', with an option to delete or update each field. The second image shows the addition of a new attribute mapping from 'Username' to 'displayName', along with highlighted 'Add' and 'Save' buttons.

Figure 18 – PingIdentity attribute mapping

Alt Text: Two screenshots showing the editing of attribute mappings in IAM Identity Center. The first image displays default mappings such as ‘Email Address’ to ‘workEmail’ and ‘Username’ to ‘userName’, with an option to delete or update each field. The second image shows the addition of a new attribute mapping from ‘Username’ to ‘displayName’, along with highlighted ‘Add’ and ‘Save’ buttons.

  1. Select the rule you created and select the toggle switch to enable the rule.
  2. This automatically provisions the users/groups from PingIdentity to IAM identity Center using SCIM.
IAM Identity Center sync summary showing successful user and group provisioning. The first image highlights two users impacted and successfully synced. The second image highlights one group impacted and successfully synced. Sync status is marked 'ACTIVE' in both views, confirming successful integration between PingOne and AWS IAM Identity Center.

Figure 19 – PingIdentity Users and Groups Sync status using SCIM

Alt Text: IAM Identity Center sync summary showing successful user and group provisioning. The first image highlights two users impacted and successfully synced. The second image highlights one group impacted and successfully synced. Sync status is marked ‘ACTIVE’ in both views, confirming successful integration between PingOne and AWS IAM Identity Center.

Step 7: Provide access to Amazon Q Developer

This step involves locating and subscribing the groups that need permission to use Amazon Q Developer.

  1. In the Amazon Q Developer console, under Subscriptions add the IAM identity center groups which require access to Amazon Q Developer.
  2. Select Subscribe and search for the group name.
  3. Select Assign.
Screenshot of the Amazon Q Developer Subscriptions page in the AWS Management Console. The "Groups" tab is selected, displaying “Amazon Q Developer,” with a subscription status of “Subscribed.” The “Amazon Q Developer” group is highlighted with a red box.

Figure 20 – Amazon Q Developer subscriptions page

Alt Text: Screenshot of the Amazon Q Developer Subscriptions page in the AWS Management Console. The “Groups” tab is selected, displaying “Amazon Q Developer,” with a subscription status of “Subscribed.” The “Amazon Q Developer” group is highlighted with a red box.

Setup Amazon Q Developer with IAM Identity Center

This section guides you through installing the Amazon Q Developer extension and setting up authentication with IAM Identity Center.

  1. To set up Amazon Q Developer extension in your integrated development environment (IDE), complete the steps in AWS documentation.
  2. Once extension is installed Choose Amazon Q icon in your IDE.
  3. Choose a sign-in option.
  4. Select Use with Pro license and choose
  5. Continue.
  6. Provide the Start URL. You can retrieve this AWS access portal URL from the IAM Identity Center Console.
Screenshot of the IAM Identity Center settings page in the AWS Console, displaying the identity source configuration. It shows that the identity source is set to "External identity provider" with SAML 2.0 authentication and SCIM provisioning. The highlighted section includes the AWS access portal URL and the Identity Store ID. The "Settings" tab is selected in the left navigation pane.

Figure 21 – IAM identity center access portal URL

Alt Text: Screenshot of the IAM Identity Center settings page in the AWS Console, displaying the identity source configuration. It shows that the identity source is set to “External identity provider” with SAML 2.0 authentication and SCIM provisioning. The highlighted section includes the AWS access portal URL and the Identity Store ID. The “Settings” tab is selected in the left navigation pane.

  1. Provide the region that hosts the identity directory and choose Continue
  2. Select Open on the resulting pop up which redirects to your browser.
  3. The browser redirects you to the Pingone URL where you enter your PingIdentity credentials and select Sign On.
  4. Upon successful authentication, select Allow access on the resulting pop up to login successfully.
A screen recording of Visual Studio Code where the user selects the Amazon Q icon from the sidebar. The screen transitions to a login prompt indicating that the user must authenticate using their PingIdentity credentials via IAM Identity Center before accessing Amazon Q Developer features. The message highlights that authentication is required to continue.

Figure 22 – Setup Visual Studio Code Amazon Q Developer extension

Alt Text: A screen recording of Visual Studio Code where the user selects the Amazon Q icon from the sidebar. The screen transitions to a login prompt indicating that the user must authenticate using their PingIdentity credentials via IAM Identity Center before accessing Amazon Q Developer features. The message highlights that authentication is required to continue.

Test Configuration

Upon successfully completing the previous step, you can now leverage the code suggestions by Amazon Q Developer.

A screen recording of Visual Studio Code where Amazon Q Developer generates a sample code inline.

Figure 23 – Amazon Q Developer example

Alt Text: A screen recording of Visual Studio Code where Amazon Q Developer generates a sample code inline.

Clean Up

To avoid ongoing charges after testing this solution, follow these steps to remove all provisioned resources:1. Remove PingIdentity Application Configuration

  • In the PingIdentity console, navigate to Applications.
  • Locate and delete the AWS Single Sign-On application that was configured for IAM Identity Center integration.

2. Reset IAM Identity Center Configuration

  • In the AWS IAM Identity Center console:
    • Navigate to Settings > Identity source.
    • Change the identity source back to the default IAM Identity Center directory if no longer using PingIdentity.
    • Remove any external metadata and configuration uploaded during the setup.

3. Revoke Subscriptions and Access

  • In the Amazon Q Developer console:
    • Go to Subscriptions and remove assigned groups such as Amazon Q Developer or code whisperer trial.
    • This will deactivate access and prevent any future charges tied to those subscriptions.

4. Remove Amazon Q Developer Extension

  • If desired, uninstall the Amazon Q Developer extension from Visual Studio Code to fully revert the development environment.

Conclusion

In this post, we demonstrated how to use existing PingIdentity credentials to access Amazon Q Developer through integration with IAM Identity Center. We provided a step-by-step guide for configuring PingIdentity as an external identity provider (IdP) with IAM Identity Center. Lastly, we demonstrated how to connect Amazon Q Developer extension within your IDE to AWS using your PingIdentity credentials, allowing seamless access to Amazon Q Developer.If you have any comments or questions, share them in the comments section.

To learn more about AWS Services

Amazon Q Developer

IAM Identity Center

AWS Toolkit for Visual Studio Code


About the author

Sid Vantair is a Solutions Architect with AWS covering Strategic accounts. He thrives on resolving complex technical issues to overcome customer hurdles. Outside of work, he cherishes spending time with his family and fostering inquisitiveness in his children.

Podman Container Monitoring with Prometheus Exporter, part 2

Post Syndicated from Janis Eidaks original https://blog.zabbix.com/podman-container-monitoring-with-prometheus-exporter-part-2/30538/

In the first part of this post, we explored how to get data with HTTP agent from the Prometheus Podman exporter and use the same item data for the Podman pods Discovery rule as well as item and trigger prototypes. In part 2 of the same series, we’ll learn how to discover and monitor Podman containers.

Creating a template discovery rule

I will create another discovery rule for container discovery. This discovery rule is also based on the same item [Podman info] in the template – Podman containers by HTTP and Prometheus (you can check part one of this series to find out how to configure it). The parameters of the discovery rule are shown below. This discovery rule will allow us to discover the pod name and ID.

Template: Podman containers by HTTP and Prometheus

▲ Discovery rule
  ▪ Name:                   Container discovery
  ▪ Type                    Dependent item
  ▪ Key:                    training.containers.discovery
  ▪ Master item             Podman containers by HTTP and Prometheus: Podman info
  ▪ Delete lost resources  After 10d
  ▪ Disable lost resources Immediately
♯ Preprocessing
  ▪ Prometheus to JSON     podman_container_info
♦ LLD Macros
  ▪ {#CONTAINER.ID}        $.labels.id
  ▪ {#CONTAINER.NAME}      $.labels.name
Fig 1. Discovery rule: Container discovery
Fig 2. Discovery rule: Container discovery preprocessing tab
Fig 3. Discovery rule: Container discovery LLD macros tab

Next, different dependent item prototypes are created in this container discovery rule. As the Prometheus Podman exporter provides a lot of different metrics about the containers, I will create multiple such items: state, health, creation date, input/output network traffic information, and so on. So, check out what metrics can be acquired and use what is relevant for you.

You can also add a description of each item prototype. I am interested only in metrics with the discovered container ID macros, and I am not interested in what values are for the other fields, such as pod_id, pod_name, so I use ~”.*”, which matches any value. I will show the item prototype configuration screenshots of one of the item prototypes.

These item prototypes are similar to each other, with some minor differences, such as Prometheus patterns, or in some cases, with a different master item (item prototype as master item).

Fig 4. Discovery rule preprocessing step: Prometheus to JSON with pattern podman_container_info
Fig 5. Discovery rule LLD macros: assigning relevant JSONPATH to LLD macros

Creating a template discovery rule: Item prototypes

After the containers have been discovered, we have to create item prototypes. These prototypes will also be dependent item prototypes and will use the same item as the discovery rule: Podman info. Prometheus Podman exported returns a lot more metrics for the containers than it did for the pods.

You can get container metrics such as container health, state, creation date, disk read/write, memory usage, network usage, and more. In this blog post, I have added most of them, so check what metrics are relevant to your monitoring needs and start monitoring.

Fig 6. Low-level discovery rule and item prototypes based on the same item.

The screenshots of the item prototype is shown below.

Fig 7. Container state item prototype tab
Fig 8. Container state item prototype tag tab
Fig 9. Container state item prototype preprocessing tab

Remember, you can also test these item prototypes in the preprocessing step – just copy the Prometheus exporter data and set the relevant macro to value you want to check.

The configuration parameters of the item prototypes are shown below. There are a lot of metrics you can monitor, but remember to monitor what is relevant and necessary for you.

Template: Podman containers by HTTP and Prometheus; Discovery rule: Container discovery

● Item prototype #1
  ▪ Name: 		Container health: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.health[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (float)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 
♦ Tags (name:value) 	
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:health		
♯ Preprocessing
  ▪ Prometheus pattern 	podman_container_health{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #2
  ▪ Name: 		Container state: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.state[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (float)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		
♦ Tags (name:value)  			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:state		
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_state{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #3
  ▪ Name: 		Created at: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.created[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		unixtime
♦ Tags (name:value) 		
  ▪ Container:{#CONTAINER.NAME}	
  ▪Metric:created		
♯ Preprocessing
  ▪ Prometheus pattern 	podman_container_created_seconds{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #4
  ▪ Name: 		Disk read per second: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.disk.read[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		B
♦ Tags (name:value) 	
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:disk_read		
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_block_output_total{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value
  ▪ Change per second

● Item prototype #5
  ▪ Name: 		Disk write per second: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.disk.write[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		B
♦ Tags (name:value) 	
  ▪ Container:{#CONTAINER.NAME}	 
  ▪ Metric:disk_write		
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_block_input_total{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value
  ▪ Change per second

● Item prototype #6
  ▪ Name: 		Exit code: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.exit_code[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (float)
    ▪ Master item	Podman containers by HTTP and Prometheus: Podman info
▪ Units: 			
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:exit_code	
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_exit_code{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #7
  ▪ Name: 		Image tags: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.image.tags[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Character
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 			
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:tag
♯ Preprocessing
▪ Prometheus pattern podman_container_info{id="{#CONTAINER.ID}",image=~".*",name=~".*",pod_id=~".*",pod_name=~".*",ports=~".*"} label image
  ▪ Regular expression	\.*(\/.\w.*)	\1

● Item prototype #8
  ▪ Name: 		Memory usage: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.mem[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		B
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:mem		
♯ Preprocessing
  ▪ Prometheus pattern podman_container_mem_usage_bytes{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #9
  ▪ Name: 		Network input dropped: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.net.in.drop[{#CONTAINER.NAME}]
  ▪ Type of inf: Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		packets
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:net_in_drop		
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_net_input_dropped_total{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #10
  ▪ Name: 		Network input errors: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.net.in.errors[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:net_in_err		
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_net_input_errors_total{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #11
  ▪ Name: 		Network input total: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.net.in.total[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		B
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:net_in_tot
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_net_input_total{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #12
  ▪ Name: 		Network input per second: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.net.in.change[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (float)
  ▪ Master item		prototype - Network input total: [{#CONTAINER.NAME}] 
  ▪ Units: 		Bps
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:net_in_change
♯ Preprocessing
  ▪ Change per second

● Item prototype #13
  ▪ Name: 		Network output dropped: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.net.out.drop[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:net_out_drop	
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_net_output_dropped_total{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #14
  ▪ Name: 		Network output errors: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.net.out.errors[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:net_out_err	
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_net_output_errors_total{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #15
  ▪ Name: 		Network output total: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.net.out.total[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		B
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:net_out_tot	
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_net_output_total{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #16
  ▪ Name: 		Network output per second: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.net.out.change[{#CONTAINER.NAME}]
  ▪ Type of inf: 	Numeric (float)
  ▪ Master item		prototype - Network output total: [{#CONTAINER.NAME}]
  ▪ Units: 		Bps
♦ Tags 			 
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:net_out_change
♯ Preprocessing
  ▪ Name			Change per second

● Item prototype #17
  ▪ Name: 		Rootfs size: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.rootfs.size[{#CONTAINER.NAME}]
  ▪ Type of inf: Numeric (unsigned)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		B
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:rootfs
♯ Preprocessing
  ▪ Prometheus pattern	podman_container_rootfs_size_bytes{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

● Item prototype #18
  ▪ Name: 		Total system CPU time: [{#CONTAINER.NAME}]
  ▪ Type 		Dependent item
  ▪ Key: 		container.cpu.time
  ▪ Type of inf: 	Numeric (float)
  ▪ Master item		Podman containers by HTTP and Prometheus: Podman info
  ▪ Units: 		s
♦ Tags 			
  ▪ Container:{#CONTAINER.NAME}	
  ▪ Metric:sys_time
♯ Preprocessing
  ▪ Prometheus pattern: podman_container_cpu_system_seconds_total{id="{#CONTAINER.ID}",pod_id=~".*",pod_name=~".*"} value

Creating a template discovery rule: Trigger prototype

I have created a user macro {$CONTAINER.RUNNING.STATE} on the template with a value of 2, which corresponds to the containers running state. After that, create a trigger prototype to check if the container is in different state other than running.

Template: Podman containers by HTTP and Prometheus; Discovery rule: Container discovery

◘ Trigger prototypes
  ▪ Name:               Container [{#CONTAINER.NAME}] state has changed from running
  ▪ Severity:           Warning
  ▪ Expression:         last(/Podman containers by HTTP and Prometheus/container.state[{#CONTAINER.NAME}])<>{$CONTAINER.RUNNING.STATE}
  ▪ PROBLEM event generation mode: Single
  ▪ OK event closes: All problems

So, once all of this is done, and some container status changes from running and or pod status also changes from running, you will get a problem event.

Fig 10. Generated problem events when the podman pod and container change states.

Technically, I could also create a trigger for container health; however, as all of the received container values for me are -1 (meaning unknown) it makes little sense to make a trigger that will fire right away. You can also add additional item/trigger prototypes in the template. If everything is set up as expected, you should see something like the screenshot below after the LLD rule execution.

Fig 11. Example of the mysql-server container and zabbix pod item values.

Summary

Now, you can monitor both Podman pods and containers using both blog posts of this series. We used the same template item for both the container LLD and item prototypes from the first part of this post.

The post Podman Container Monitoring with Prometheus Exporter, part 2 appeared first on Zabbix Blog.