Tag Archives: Case Study

Medidata’s journey to a modern lakehouse architecture on AWS

Post Syndicated from Mike Araujo original https://aws.amazon.com/blogs/big-data/medidatas-journey-to-a-modern-lakehouse-architecture-on-aws/

This post was co-authored by Mike Araujo Principal Engineer at Medidata Solutions.

The life sciences industry is transitioning from fragmented, standalone tools towards integrated, platform-based solutions. Medidata, a Dassault Systèmes company, is building a next-generation data platform that addresses the complex challenges of modern clinical research. In this post, we show you how Medidata created a unified, scalable, real-time data platform that serves thousands of clinical trials worldwide with AWS services, Apache Iceberg, and a modern lakehouse architecture.

Challenges with legacy architecture

As the Medidata clinical data repository expanded, the team recognized the shortcomings of the legacy data solution to provide quality data products to their customers across their growing portfolio of data offerings. Several data tenants began to erode. The following diagram shows Medidata’s legacy extract, transform, and load (ETL) architecture.

Built upon a series of scheduled batch jobs, the legacy system proved ill-equipped to provide a unified view of the data across the entire ecosystem. Batch jobs ran at different intervals, often requiring a sufficient degree of scheduling buffer to make sure upstream jobs completed within the expected window. As the data volume expanded, the jobs and their schedules continued to inflate, introducing a latency window between ingestion and processing for dependent consumers. Different consumers operating from various underlying data services further magnified the problem as pipelines had to be continuously built across a variety of data delivery stacks.

The expanding portfolio of pipelines began to overwhelm existing maintenance operations. With more operations, the opportunity for failure expanded and recovery efforts further complicated. Existing observability systems were inundated with operational data, and identifying the root cause of data quality issues became a multi-day endeavor. Increases in the data volume required scaling considerations across the entire data estate.

Additionally, the proliferation of data pipelines and copies of the data in different technologies and storage systems necessitated expanding access controls with enhanced security features to make sure only the correct users had access to the subset of data to which they were permitted. Making sure access control changes were correctly propagated across all systems added a further layer of complexity to consumers and producers.

Solution overview

With the advent of Clinical Data Studio (Medidata’s unified data management and analytics solution for clinical trials) and Data Connect (Medidata’s data solution for acquiring, transforming, and exchanging electronic health record (EHR) data across healthcare organizations), Medidata introduced a new world of data discovery, analysis, and integration to the life sciences industry powered by open source technologies and hosted on AWS. The following diagram illustrates the solution architecture.

Fragmented batch ETL jobs were replaced by real-time Apache Flink streaming pipelines, an open source, distributed engine for stateful processing, and powered by Amazon Elastic Kubernetes Service (Amazon EKS), a fully managed Kubernetes service. The Flink jobs write to Apache Kafka running in Amazon Managed Apache Kafka (Amazon MSK), a streaming data service that manages Kafka infrastructure and operations, before landing in Iceberg tables backed by the AWS Glue Data Catalog, a centralized metadata repository for data assets. From this collection of Iceberg tables, a central, single source of data is now accessible from a variety of consumers without additional downstream processing, alleviating the need for custom pipelines to satisfy the requirements of downstream consumers. Through these fundamental architectural changes, the team at Medidata solved the issues presented by the legacy solution.

Data availability and consistency

With the introduction of the Flink jobs and Iceberg tables, the team was able to deliver a consistent view of their data across the Medidata data experience. Pipeline latency was reduced from days to minutes, helping Medidata customers realize a 99% performance gain from the data ingestion to the data analytics layers. Due to Iceberg’s interoperability, Medidata users saw the same view of the data regardless of where they viewed that data, minimizing the need for consumer-driven custom pipelines because Iceberg could plug into existing consumers.

Maintenance and durability

Iceberg’s interoperability provided a single copy of the data to satisfy their use cases, so the Medidata team could focus its observation and maintenance efforts on a five-times smaller subset of operations than previously required. Observability was enhanced by tapping into the various metadata components and metrics exposed by Iceberg and the Data Catalog. Quality management transformed from cross-system traces and queries to a single analysis of unified pipelines, with an added benefit of point in time data queries thanks to the Iceberg snapshot feature. Data volume increases are handled with out-of-box scaling supported by the entire infrastructure stack and AWS Glue Iceberg optimization features that include compaction, snapshot retention, and orphan file deletion, which provide a set-and-forget experience for solving a number of common Iceberg frustrations, such as the small file problem, orphan file retention, and query performance.

Security

With Iceberg at the center of its solution architecture, the Medidata team no longer had to spend the time building custom access control layers with enhanced security features at each data integration point. Iceberg on AWS centralizes the authorization layer using familiar systems such as AWS Identity and Access Management (IAM), providing a single and durable control for data access. The data also stays entirely within the Medidata virtual private cloud (VPC), further reducing the opportunity for unintended disclosures.

Conclusion

In this post, we demonstrated how legacy universe of consumer-driven custom ETL pipelines can be replaced with a scalable, high-performant streaming lakehouses. By putting Iceberg on AWS at the center of data operations, you can have a single source of data for your consumers.

To learn more about Iceberg on AWS, refer to Optimizing Iceberg tables and Using Apache Iceberg on AWS.


About the authors

Mike Araujo

Mike is a Principal Engineer at Medidata Solutions, working on building a next generation data and AI platform for clinical data and trials. By using the power of open source technologies such as Apache Kafka, Apache Flink, and Apache Iceberg, Mike and his team have enabled the delivery of billions of clinical events and data transformations in near real time to downstream consumers, applications, and AI agents. His core skills focus on architecting and building big data and ETL solutions at scale as well as their integration in agentic workflows.

Sandeep Adwankar

Sandeep is a Senior Product Manager at AWS, who has driven feature launches across Amazon SageMaker, AWS Glue, and AWS Lake Formation. He has led initiatives in Amazon S3 Tables analytics, Iceberg compaction strategies, and AWS Glue Iceberg optimizations. His recent work focuses on generative AI and autonomous systems, including the AWS Glue Data Catalog model context protocol and Amazon Bedrock structured knowledge bases. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that accelerate their business outcomes.

Ian Beatty

Ian is a Technical Account Manager at AWS, where he specializes in supporting independent software vendor (ISV) customers in the healthcare and life sciences (HCLS) and financial services industry (FSI) sectors. Based in the Rochester, NY area, Ian helps ISV customers navigate their cloud journey by maintaining resilient and optimized workloads on AWS. With over a decade of experience building on AWS since 2014, he brings deep technical expertise from his previous roles as an AWS Architect and DevSecOps team lead for SaaS ISVs before joining AWS more than 3 years ago.

Ashley Chen

Ashley is a Solutions Architect at AWS based in Washington D.C. She supports independent software vendor (ISV) customers in the healthcare and life sciences industries, focusing on customer enablement, generative AI applications, and container workloads.

Monitoring MDM Certificates with Lab9 Pro and Zabbix

Post Syndicated from Michael Kammer original https://blog.zabbix.com/monitoring-mdm-certificates-with-lab9-pro-and-zabbix/31621/

Lab9 Pro is the B2B division of Lab9, Belgium’s leading Apple Premium Partner. With over 30 years of experience, Lab9 Pro specializes in integrating and supporting Apple systems within businesses, educational institutions, and public organizations. Beyond Apple expertise, Lab9 Pro also designs, implements, and maintains complete IT infrastructures, including networks, servers, storage, and security solutions.

The challenge

It’s impossible to manage devices at organizations without the use of a good MDM (Mobile Device Management) system such as Jamf. As the leading provider of Apple device management solutions, Jamf empowers organizations to deploy, manage, and secure Apple devices at scale.

Even in smaller organizations Jamf is the right solution, as small and medium-sized enterprises (SMEs) often lack the resources to manage their MDM systems. Offering an MSP model solves a lot of problems for these customers.

For Apple device management, the typical customer has a few certificates issued by Apple, which require approval of the user agreement by the Apple business or school manager. Without getting too technical about Apple Device management, depending on the customer the certificates need to be renewed on different dates. If the user agreement is not approved, automated device enrollment will stop working.

Lab9 Pro found themselves needing to check all certificates and user agreements for MSP customers manually, which involved an unacceptably high error rate that often caused discontinuity of the MDM system.

The solution

Lab9 Pro were already using Zabbix to monitor customer environments and their own infrastructure, including storage, firewalls, switches, and more. Because Zabbix offers a wide variety of options that make it possible to monitor almost anything, it was only logical to explore whether Zabbix could also be used to monitor the MDM certificates.

The research phase

Step one was to check the availability of certificate information. Unfortunately, Apple Business Manager’s API did not help much, as it does not provide certificate details. Instead, the team at Lab9 Pro investigated the Jamf API.

Although it doesn’t directly return certificate information either,  they found something even more useful – Jamf’s API provides customer instance notifications. These include alerts when certificates (VPP, PUSH, DEP, etc.) are about to expire (typically 10 days in advance) as well as when the Device Enrollment Program (user agreement) is not approved.

Zabbix implementation

Since Lab9 Pro manages multiple MSP tenants, they created a dedicated Zabbix template. This template includes both pre-filled and empty macros:

Pre-filled macros:

• {$JAMF.AUTH.INTERVAL}: Interval for retrieving the bearer token
• {$JAMF.NOTIF.INTERVAL}: Interval for retrieving Jamf notifications
• {$JAMF.PATH.AUTH}: API path for retrieving the bearer token
• {$JAMF.PATH.NOTIFICATIONS}: API path for retrieving Jamf notifications

Empty macros:

• {$JAMF.URL}: Jamf URL
• {$JAMF.API.USER}: Jamf user account for authentication
• {$JAMF.API.PASSWORD}: Jamf password (stored as a secret value)

The team configured an item to perform an API call to retrieve the bearer token. A preprocessing rule in JavaScript stores this token in a variable. Discovery rules proved very useful for executing API calls to retrieve Jamf notifications using the bearer token. This was achieved by configuring preprocessing steps and Low-Level Discovery (LLD) macros to pass the Jamf URL and bearer token. Trigger prototypes for each certificate were also added within the same discovery rule.

The results

Whenever a certificate is nearing expiration, a problem is automatically displayed on Lab9 Pro’s Zabbix dashboard, which is visible on TV screens placed throughout their office in order to make sure the entire team is aware of upcoming certificate renewals.

Since Lab9 Pro began monitoring MDM certificates through the Jamf API, they have experienced zero expired certificates, which in turn has allowed them to avoid situations where devices become unmanaged and require a full setup again.

Zabbix makes it possible for Lab9 Pro to keep their clients’ MDM systems operational, while allowing them to either proactively inform them when certificates need to be renewed or handle the renewal process on their behalf.

The post Monitoring MDM Certificates with Lab9 Pro and Zabbix appeared first on Zabbix Blog.

Multi-Cloud Code Deployments using Amazon Q Developer with Echo3D

Post Syndicated from Kevon Mayers original https://aws.amazon.com/blogs/devops/multi-cloud-code-deployments-using-amazon-q-developer-with-echo3d/

Banner showing echo3D logo and Amazon Q Developer logo

Image showing 87& speed up in development tasks completion, 41% of code written by Amazon Q Developer, and 60% development productivity increasedn

Overview

Founded in 2018, echo3D built a revolutionary 3D digital asset management (DAM) platform to address the surging demand for immersive content across industries. The company’s platform enables enterprises to seamlessly store, secure, optimize, and share 3D content, serving over 200,000 professionals across energy, healthcare, gaming, retail, and beyond.

echo3D’s platform has become the go-to solution for managing complex 3D assets at scale, supporting major enterprises across multiple sectors. With their technology operating within clients’ own AWS accounts, echo3D delivers critical infrastructure that powers real-time 3D content management for organizations worldwide.

As customer demand grew, echo3D faced increasing pressure to maintain rapid innovation while ensuring stable multi-cloud deployments. With a streamlined development team managing expanding cross-platform requirements, the company needed an efficient solution to accelerate their build and debug processes. This led them to explore Amazon Q Developer as a way to enhance their development capabilities and meet growing market demands.

Opportunity | Building for a Multi-Cloud Reality through Amazon Q Developer

echo3D specializes in 3D digital asset management, with a critical focus on multi-cloud deployments to serve their diverse enterprise client base. The company’s commitment to cross-platform functionality isn’t optional—it’s fundamental to their business model, with many clients specifically requiring AWS compatibility.

The company’s existing cloud infrastructure needed to support seamless migrations while maintaining robust performance across different environments. “For many of our clients, AWS is the ultimate destination,” explains Ben Pedazur, CTO at echo3D. “Amazon Q Developer has proven to be an indispensable guide for these migrations, both for our infrastructure and for the solutions we build for customers.”

After evaluating various solutions, echo3D identified Amazon Q Developer as their key tool for standardizing cross-platform development. “We needed a solution that could generate consistent code across different cloud environments while resolving platform-specific challenges,” notes Pedazur. This capability became particularly crucial during a recent customer migration project, which served as a perfect test case for Amazon Q Developer’s capabilities.

Solution | Streamlining the Journey to AWS with Amazon Q Developer

To streamline their cloud migration process, echo3D implemented Amazon Q Developer across their entire development workflow. The team utilized Amazon Q Developer to handle a critical migration from Azure Cosmos DB to Amazon DynamoDB, leveraging the AI assistant to generate comprehensive migration blueprints that included code modifications, configuration changes, and testing strategies.

Developers used detailed prompts to generate migration plans and receive context-aware guidance throughout the process. Amazon Q Developer provided not just code snippets, but complete architectural solutions that considered both the source and target platforms. During implementation, the team integrated Amazon Q Developer directly into their workflow, receiving real-time suggestions for code optimization and platform-specific adjustments.

The impact of Amazon Q Developer was immediate and measurable, with 41% of the new codebase being generated or auto-completed by the tool. “Amazon Q Developer has transformed our migration efficiency,” says Pedazur. “Our development time for cloud migrations has decreased by 87%, while significantly improving code quality.”

Amazon Q Developer assists throughout the entire development lifecycle, generating test cases, deployment scripts, and documentation. This comprehensive support has led to remarkable improvements: platform-specific bugs decreased by 75%, deployment success rates reached 99.8% across multiple clouds, and code review cycles shortened by 60%.

Beyond code generation, echo3D uses Amazon Q Developer to enhance team collaboration and knowledge sharing. The tool has cut onboarding time for new engineers in half, reducing it from four weeks to two weeks. Support tickets related to deployment errors have dropped by 68%, indicating improved code stability and reliability.

The new multi-cloud infrastructure, built with AWS services including DynamoDB, enables echo3D to scale efficiently while maintaining high performance across different cloud environments. The combination of Amazon Q Developer and AWS services has empowered echo3D to accelerate their development cycle while ensuring consistent quality across platforms.

“Amazon Q Developer isn’t just about coding faster—it’s about building better,” explains Pedazur. “We’ve seen improvements across every metric, from development speed to code quality, allowing our team to focus on innovation rather than troubleshooting.”

Outcome | Reimagining Development Through AI-Powered Workflows

With Amazon Q Developer, echo3D plans to further leverage Amazon Q Developer across their product lifecycle, from rapid prototyping to ongoing code maintenance and enhancement.

“Amazon Q Developer has revolutionized our approach to multi-cloud development,” says Pedazur. “It’s not just about automating tasks; it’s about reimagining our entire workflow. We’re now able to prototype, test, and deploy across cloud platforms with unprecedented speed and accuracy.”

Authors

Headshot of Lilly McDermott, Account Manager, AWS

Lilly McDermott

Lilly McDermott is an AWS account manager specializing in supporting gaming companies and game tech. As a trusted advisor, she guides customers through their cloud journey, helping them implement scalable solutions that drive innovation and growth in their games and services. Lilly is dedicated to guiding her customers in transforming their creative ideas into executable plans, empowering them to thrive in the competitive gaming market.

Headshot of Kevon Mayers, Infrastructure as Code Focus Area Lead and Games Solutions Architect, AWS

Kevon Mayers

Kevon Mayers is a Games Solutions Architect at AWS and is the Infrastructure as Code (IaC) Focus Area Lead for the NextGen Developer Experience Technical Field Community at AWS. Kevon is a Core Contributor for Terraform and has led multiple Terraform initiatives within AWS. Prior to joining AWS, he was working as a DevOps engineer and developer, and before that was working with the GRAMMYs/The Recording Academy as a studio manager, music producer, and audio engineer. He also owns a professional production company, MM Productions.

Headshot of Ben Pedazur, echo3D CTO

Ben Pedazur

Ben Pedazur (CTO at echo3D) holds a MSc in Electrical Engineering from Tel Aviv University specializing in computer vision and network communication, a BSc in Electrical Engineering from Afeka Academic College of Engineering specializing in image processing, is a former engineering manager at Cisco Systems, founder of an AR+Drones startup, and algorithm engineer at AdiMap. Ben is skilled in agile leadership, engineering management, and product research & development.

Headshot of Alon Grinshpoon, echo3D CEO

Alon Grinshpoon

Alon Grinshpoon (CEO at echo3D) holds MS in Computer Science from Columbia University specializing in 3D/AR/VR and human-computer interaction (HCI), BS in Computer Science and Electrical Engineering specializing in cloud technology, former NVIDIA engineer, published 3D UI researcher, a frequent speaker at CES, SXSW, Augmented World Expo (AWE), NYVR, Slush, and more. Alon has published papers in top engineering journals such as SIGGRAPH 2018 Emerging Technologies and IEEE Conference on Virtual Reality and 3D User Interfaces (VR) on AR system design and 3D interaction techniques in AR.

Optimizing Financial Routines and Infrastructure with Banpará

Post Syndicated from Michael Kammer original https://blog.zabbix.com/optimizing-financial-routines-and-infrastructure-with-banpara/30815/

Banco do Estado do Pará (Banpará) is the main public financial institution in the Brazilian state of Pará. It is a mixed-capital company, organized as a multiple bank with the mission of generating value for the state of Pará. It currently has approximately 198 physical customer service units and is present in all 144 municipalities in the state.

The challenge

Until 2016, Banpará used a monitoring environment installed on a single physical server. This environment was centralized, not very scalable, and vulnerable due to the lack of updates to recent versions of the software used. Centralization created a critical dependency – if there was a server failure, the entire monitoring system would be compromised.

There was no integration with the tool that orchestrates the company’s routine activities (which also generated an alert and a need for proper support of the bank’s infrastructure) and there was also the issue of including the routines of the internal demand generation tool in the monitoring panel, which was done manually.

With each new routine created, it was necessary to open calls with the technical teams for inclusion in the monitoring plan, which were then entered into a list of tasks. This process, in addition to being time-consuming, was subject to human error and delays, which compromised real-time visibility of critical operations.

The lack of proactive and integrated monitoring in Banpará’s structure resulted in operational gaps that created real risks to the continuous functioning of banking operations.

The solution

Given the challenges posed, the project developed with Zabbix had as its main objective to recreate the monitoring environment in a virtualized, scalable and resilient way, without dependence on a physical server. From rebuilding the infrastructure to integrating it with critical banking systems, the primary requirements included the following:

  • Integration with existing systems
  • Intelligent data processing and analysis
  • Reduction of manual processes and operational dependency
  • Development of customized solutions
  • Reorganization of the technological infrastructure

After implementing and structuring Zabbix at the bank (with the help of Master Support, an official Zabbix Certified Partner in Brazil), the structure became modular, scalable, and resilient, aligned with best practices, and able to expand monitoring without compromising system performance as the bank integrated new routines and services.

The results

The modernization of monitoring environment with Zabbix brought immediate benefits for Banpará’s IT monitoring scenario, especially with regard to operational efficiency, reliability and process automation:

  • More than 2,000 monitored devices
  • Around 100,000 metrics collected
  • More than 26,000 active alerts in Zabbix
  • Automated coverage of around 2,300 routines
  • An estimated gain of 2,300 operational hours

The adoption of Zabbix as a monitoring tool at Banpará was a practical response to the need to modernize the bank’s IT infrastructure. The project contributed to the elimination of manual processes, reduction of operational time, and increased visibility over critical routines. It also enabled the monitoring of a greater number of services, with greater agility in identifying failures and supporting decision-making.

In conclusion

With the current structure, Banpará now has a more integrated monitoring system, adjusted to operational demands and with the capacity to monitor the evolution of the bank’s activities in an organized and secure manner.

To learn more about what Zabbix can do for customers in banking and finance, visit our website.

The post Optimizing Financial Routines and Infrastructure with Banpará appeared first on Zabbix Blog.

Proxy Group Load Balancing with SNMP Traps

Post Syndicated from Nathan Liefting original https://blog.zabbix.com/proxy-group-load-balancing-with-snmp-traps/31042/

The new Zabbix proxy groups provide us with a method to provide both redundancy and load balancing in our Zabbix proxy setups. However, one major limitation arises when we want to use SNMP traps with these new proxy groups – it isn’t natively supported at the moment. One of our customers asked me to find a solution to that problem, so here’s how I went about it.

Getting to grips with the problem

As mentioned, many of us are now facing a problem. Either we use proxy groups and we don’t use SNMP traps, or we use proxy groups and move SNMP traps to a single proxy. Unfortunately, this is unacceptable for many environments where SNMP traps might be an essential part of monitoring. The problem, however, stems from how snmptrapd works in combination with Zabbix reading the trapper file. Improvements have already been made to provide for more room when creating our own solutions like this.

Other Zabbix users have also been proposing solutions and I’m sure Zabbix is looking into improvements. Here’s an example case to vote on.

However, that doesn’t solve many of our issues now. The problem starts when we are sending SNMP traps to a single proxy (Proxy 1 for example) and a Zabbix host (let’s say Zabbix host 2) is assigned to another proxy in the proxy group (Proxy 2 for example). In this situation, the trap is coming in on an incorrect monitoring proxy and Zabbix won’t be able to read the trap. It will simply not add it to the Zabbix database and ignore it.

The solution here is simple – we can configure our monitoring target like a switch or a router to send the SNMP trap to multiple sources. However, this will cause our trap to be sent over the network multiple times, increasing the load on our network. This is acceptable for smaller setups, but we were dealing with a setup that is sending hundreds of traps every second.

Finding a solution

With the problem laid out for us, we came up with a simple duplication setup that included these requirements:

  1. Simple and easy to maintain/troubleshoot
  2. Traps could only be sent over the network once
  3. Works fast between failovers
  4. Works with both redundancy and load balancing
  5. Minimal extra packages
  6. No easily corruptible shared file systems

What we came up with in the end is visible in the image below:

 

It’s a simple setup that requires us to install 2 extra packages and a container.

First, we added a VIP to our proxy setup using keepalived, to provide our monitoring targets with a single SNMP trap destination. The VIP will be available on one proxy at the time, regardless of whether there are 2, 10 or more proxies in the proxy group. Our switches, routers, or any other SNMP trap host can now be configured to send traps to this VIP.

Second, we needed a way to duplicate our traps. Since only one proxy is going to be receiving traps, the other proxies still need to be able to receive the traps. Without the duplication and the VIP being present on Proxy 1, Zabbix host 2 still would not receive its trap. We installed Docker and created a tiny, lightweight container on our hosts to duplicate the SNMP trap from one proxy to all other proxies in the group. Admittedly this does slightly go against requirement number 2, as we are now sending the trap over the network between proxies. This is, however, all within our own more localized infrastructure instead of over a longer network.

That’s it! Whenever Proxy 1 receives a trap, it will now duplicate it to Proxy 2. The proxy with the host being monitoring will parse the trap correctly to Zabbix and the other proxies will ignore the trap. Even if the proxy restarts, fails over, or suddenly goes down, it will not read the trap twice.

The only thing to keep in mind is that it can take some time for keepalived to fail over the VIP. With SNMP traps being UDP-based, this means that any traps sent to the VIP while snmptrapd is down won’t be parsed. However, it’s definitely better to lose some in case of failover, than to lose all upon outage!

The post Proxy Group Load Balancing with SNMP Traps appeared first on Zabbix Blog.

Zabbix at the Zhongnan University of Economics and Law

Post Syndicated from Michael Kammer original https://blog.zabbix.com/zabbix-at-the-zhongnan-university-of-economics-and-law/30949/

Zhongnan University of Economics and Law (ZUEL), located in Wuhan City, Hubei Province, China, is a key university with two campuses – Nanhu and Shouyi. The school boasts over 20,000 full-time undergraduate students, more than 8,800 graduate students, and over 2,500 faculty and staff members. ZUEL enjoys an outstanding reputation in the fields of law and economics, with four national key disciplines. Its law discipline, meanwhile, has been included in the list of national “Double First-Class” disciplines.

The challenge

As the information infrastructure at ZUEL continues to expand, the scale of the university’s IT infrastructure has rapidly grown to encompass power systems, dynamic environmental systems, servers, network devices, security appliances, storage systems, virtualization platforms, operating systems, databases, data lakes, and campus application systems.

At the same time, the daily academic and administrative activities of faculty and students increasingly demand higher levels of stability and reliability from information systems. To ensure the efficient operation of these systems, the Information Management department needed a monitoring and management system that could cover the entire university’s IT resources and address the growing complexities of operational maintenance.

The university found that traditional monitoring and management systems often fall short when faced with such large-scale and diverse monitoring demands, revealing problems like insufficient monitoring points, poor real-time capabilities, and limited scalability. To address these challenges, the university decided to adopt Zabbix 7.0 and develop a custom IP Radar platform to further meet its refined operational maintenance needs.

The solution

When combined with Zabbix 7.0, the IP Radar system can achieve comprehensive monitoring and management of the university’s entire IT infrastructure through the integrated application of multiple monitoring protocols and technologies. Specifically, the system collects data and performs monitoring with the help of the following core technologies:

  • Zabbix 7.0. As an enterprise-level open-source monitoring platform renowned for its robust data collection and analysis capabilities, Zabbix enhances the system’s high availability, supporting large-scale concurrent processing to make sure that the monitoring system remains stable and delivers uninterrupted service even under heavy loads.
  • Parallel monitoring with multiple protocols. The system collects data through a variety of protocols, including Agent, SNMP, IPMI, MODBUS, MQTT, and more, enabling the real-time monitoring of a wide variety of IT hardware.
  • High-availability design. To accommodate the monitoring demands of massive devices and thousands of users, the Zabbix 7.0 platform supports multi-node deployment and redundancy design, enabling load balancing and failover among proxy servers. Even in the event of a node failure, the system maintains uninterrupted monitoring services, and it’s also equipped with an automated fault alerting and repair mechanism.
  • The self-developed IP Radar platform. To meet a demanding set of operation and maintenance management needs, ZUEL has developed the IP Radar system based on the Zabbix 7.0 platform, further customizing its business monitoring capabilities. IP Radar not only conducts real-time monitoring of the IT infrastructure, but it also provides detailed performance analysis reports and trend predictions, while integrating behavior monitoring capabilities to enhance the school’s network security management.

The IP Radar platform itself contains a variety of unique and innovative features, including:

  • Comprehensive monitoring coverage. The IP Radar system monitors over a million items – everything from hardware devices to application systems, affecting everything from network performance to user experience. This extensive coverage gives the Information Management department to a comprehensive understanding of the operational status of the school’s IT resources while providing sufficient data support for troubleshooting and performance optimization.
  • Customized monitoring strategies. Compared to traditional monitoring systems, IP Radar offers highly customized monitoring strategies. ZUEL can tailor different business dashboards for networks, computing resources, user experience, data center environments, and more, based on its own needs and the permissions granted to operation and maintenance personnel. Depending on different monitoring thresholds and alerting strategies, the system can automatically generate alerts and notify relevant personnel through enterprise WeChat, SMS, and other channels.
  • Intelligent alerting and automated handling. The intelligent alerting system of the IP Radar platform leverages machine learning algorithms to analyze historical monitoring data, enabling it to predict potential fault risks and issue early warnings. At the same time, the system integrates automated operation and maintenance capabilities, which allow it to automatically execute predetermined repair operations when certain common faults occur, reducing the time and cost of manual intervention.
  • Network security monitoring. In terms of network security, the IP Radar system is capable of identifying abnormal traffic patterns and promptly detecting potential security threats through real-time analysis of the school’s entire network traffic. The system also supports the monitoring of online behavior to ensure that network access activities comply with the school’s security policies.

The results

After implementing the Zabbix-based system, ZUEL was able to measure a wide range of monitoring performance improvements, including:

  • Improved operational and maintenance efficiency. Through the IP Radar system, the school’s Information Management department has been able to monitor the operational status of over 28,000 hosts in real-time, significantly enhancing operational efficiency. The system’s automated fault handling capabilities reduce the complexity of manual operations, allowing operations and maintenance personnel to focus on addressing only the complex issues that the system is unable to resolve automatically. At the same time, the system’s intelligent alerting feature enables the early detection of potential problems, preventing sudden failures.
  • Enhancing system stability and reliability. The high availability design of Zabbix 7.0 ensures that the system remains stable even under heavy loads. Its redundant design and automatic failover mechanisms guarantee the reliability of the system, and the trend analysis functionality provided by IP Radar helps administrators to identify factors that may affect system stability in advance and making corresponding adjustments, enhancing the overall reliability of the IT system in the process.
  • Advancing detailed information management. The IP Radar platform lets schools manage multiple IT resources with greater precision. The system not only monitors the operational status of hardware devices, but it also analyzes the performance of business systems, helping administrators to optimize system configurations and enhancing user experiences. During project development, historical data from the monitoring platform serves as an essential basis for decision-making. In the acceptance phase, the monitoring platform provides evaluation reference data for operational efficiency and stability.

The IP Radar monitoring and management system developed by ZUEL and based on Zabbix 7.0 has become the largest, most widely used, and most effective (in terms of the volume of monitored data) in the Chinese education sector. The successful implementation of this system not only provides strong support for the school’s information management, but it also offers valuable references for information operation and maintenance at other universities.

In conclusion

Looking ahead, the IP Radar system is poised to expand its functionalities further by integrating more intelligent operation and maintenance management tools. Through the introduction of emerging technologies such as big data analysis and artificial intelligence, the system will achieve more breakthroughs in areas like automated operation and maintenance as well as intelligent fault prediction, providing even more comprehensive technical support for the university’s information management.

To learn more about what Zabbix can do for educational institutions, visit our website.

 

The post Zabbix at the Zhongnan University of Economics and Law appeared first on Zabbix Blog.

Reducing Alert Fatigue with Zabbix and China Pacific Insurance

Post Syndicated from Michael Kammer original https://blog.zabbix.com/reducing-alert-fatigue-with-zabbix-and-china-pacific-insurance/30913/

Headquartered in Shanghai, the China Pacific Insurance (Group) Co., Ltd. (CPI) is a Chinese insurance company that was established on the basis of the former China Pacific Insurance Corporation. CPI Group is the second largest property insurance company and the third largest life insurance company in Mainland China. It provides integrated insurance services (including life insurance, property insurance, and reinsurance) through its subsidiaries.

The challenge:

The overall data center operation structure of the company works along financial industry lines, with a two-site, three-center operation model. The total scale of China Pacific Insurance’s on-premises hosts is over 6,000, and the three centers add up to nearly 40,000 host devices in the production environment.

It’s an enormous amount of information to monitor, so any monitoring solution needs to significantly reduce the difficulty of overall alert analysis. The alert information provided by the mixture of cloud product components that CPI were using caused a serious case of alert fatigue for their operations and maintenance personnel, with some alerts taking as long as 4 months to process.

The solution:

The bank’s cloud platforms all had their own monitoring and alerting functions, but the configuration of value threshold and notification policies was not flexible enough. Zabbix’s ability to uniformly collect data while configuring triggers and alerting proved to be a game-changer.

In addition, when compared to cloud vendors whose solutions require adjustments to thresholds in each product component, Zabbix proved to be a much simpler and more cost-effective way to notify operators of only the most essential alerts.

The results:

In practice, CPI found that the Zabbix multi-index combined alert function eliminates 30% of invalid alerts. Thanks to this success, CPI now plans to transfer the Zabbix Data Transmission Service to their digital twin data center, so that the inspection of physical facilities and the impact analysis of the application system can be quickly displayed to their operations and maintenance personnel.

Conclusion

The team at China Pacific Insurance successfully built an intelligent operation and maintenance system covering a number of key modules such as automated operation and maintenance, intelligent monitoring, logging platforms, and container platforms – all with Zabbix at the core. They are currently exploring the cutting-edge integration of monitoring systems with LLMs, further advancing intelligent monitoring and observability solutions in the process.

To learn more about what Zabbix can do for customers in banking and finance, visit our website.

The post Reducing Alert Fatigue with Zabbix and China Pacific Insurance appeared first on Zabbix Blog.

Keeping Latvia Connected with Zabbix and LMT

Post Syndicated from Michael Kammer original https://blog.zabbix.com/keeping-latvia-connected-with-zabbix-and-lmt/30834/

LMT is a mobile GSM/UMTS/LTE operator in Latvia. Founded on January 2, 1992, it was the first mobile network operator in the country. In addition to providing mobile network and ISP services, LMT uses innovative technologies and solutions to develop and maintain a variety of IT solutions for public and private organizations. Currently, LMT is the largest telecommunications service provider in the country, with over 1,660 base stations and over 1.5 million users as of 2024.

The challenge

LMT utilizes a variety of monitoring solutions for a variety of purposes – from tools performing and monitoring ping responses to vendor-specific solutions and all-in-one tools such as Zabbix. LMT has 2 data centers, and since the vast majority of services delivered by LMT can be considered critical, most of the relevant infrastructure is duplicated across them.

Multiple Zabbix instances are used in the environment, including Zabbix 5.0 with MySQL database backend, Zabbix 7.0 with PostgreSQL, and TimescaleDB. Over 3,000 hosts with approximately 500,000 items are monitored by Zabbix.

The solution

Here is one example of how Zabbix is used to monitor switch cabinets in LMT data centers. Switch cabinets contain devices to measure the electric current, which support Modbus protocol and which can in turn be used to collect data.

Modbus monitoring was achieved by using Zabbix agent2 with the official Modbus plugin. This was combined with NetBox and GraphQL. NetBox was used as the source of truth, providing information about power feed and various electrical characteristics, such as voltage, amperage, utilization, phase, and more. The data was collected from NetBox via HTTP agent checks and GraphQL, and a JSON result was created by utilizing Zabbix preprocessing features.

The information collected from NetBox is combined with Modbus data collection utilizing Zabbix agent2. The data collected by Zabbix agent2 is preprocessed after the collection. The collected data is normalized and used by Zabbix low-level discovery features to automatically create Zabbix items and triggers for the available resources. Finally, the resulting data is visualized on Zabbix dashboards.

The results

Monitoring with Zabbix has made reacting to changes in the monitored power feed (detecting spikes, observing gradual power feed changes, etc.) a much simpler proposition for LMT, which in turn improves service for its millions of users.

In conclusion

Zabbix has proven itself to be an ideal solution for telecommunications clients, making it easier than ever to keep track of network health and performance, driving a more positive customer experience and greater revenue growth in the process.

To learn more about what Zabbix can do for customers in telecommunications, get in touch with us.

The post Keeping Latvia Connected with Zabbix and LMT appeared first on Zabbix Blog.

Transforming IT Infrastructure Visibility at Doğan Trend Automotive

Post Syndicated from Michael Kammer original https://blog.zabbix.com/transforming-it-infrastructure-visibility-at-dogan-trend-automotive/30715/

Established in 2020 to consolidate Doğan Group’s automotive and mobility companies and brands under a single entity, Doğan Trend Automotive is a prominent player in their industry. Representing a diverse portfolio ranging from automobiles, motorcycles, and marine engines to electric commercial vehicles, Doğan Trend also delivers innovative solutions to customers through its e-commerce platforms, such as suvmarket.com and vespastoreturkey.com.

The challenge

Doğan Trend’s IT ecosystem spans data centers, remote locations, and multiple units, necessitating seamless operations as well as an efficient monitoring and alert system. The existing infrastructure posed challenges in monitoring, making it difficult to detect potential issues in a timely manner, thus increasing operational risks.

The solution

To address Doğan Trend’s needs, our associates at ASNSKY implemented a Zabbix-based monitoring system. Key highlights of the project included:

  • Centralized dashboards: Custom dashboards were designed for data centers and remote locations, enabling the unified monitoring of IT locations and components from a single interface.
  • A dynamic alert system: Alerts prioritized based on predefined conditions allowed for the swift and effective resolution of critical issues.
  • Seamless operations: Early detection of potential issues prevented operational disruptions and ensured continuity.

Throughout the integration process, ASNSKY’s team collaborated closely with Doğan Trend’s IT department, addressing the specific requirements of different units and providing training for effective system use.

The results

Implementing the new monitoring system rapidly delivered the following results:

  • Enhanced visibility: Real-time monitoring of all IT locations and components made potential issues easy to spot.
  • Proactive issue management: Early detection of critical issues reduced operational downtime.
  • Increased efficiency: The centralized monitoring system drastically improved the responsiveness and effectiveness of Doğan Trend’s IT team.

“This project with the ASNSKY team made our IT infrastructure more transparent and manageable. With Zabbix’s flexible and effective monitoring capabilities, we gained active control over our critical operations. We thank the ASNSKY team for this successful collaboration.” – Burak Altunalan, IT System Management Specialist at Doğan Trend Automotive

In conclusion

Doğan Trend plans to further take advantage of Zabbix’s flexibility to strengthen their resilience and operational efficiency. Their association with ASNSKY marks a significant step toward achieving these objectives.

To learn more about what Zabbix can do for retail customers in every sector, get in touch with us. 

About ASNSKY

ASNSKY enhances its customers’ competitiveness by integrating the power of enterprise-grade open-source-solutions in security and infrastructure with a professional service approach and high quality standards.

Backed by deep industry expertise and a team of seasoned professionals, ASNSKY stands as a trusted partner in your digital transformation journey.

The post Transforming IT Infrastructure Visibility at Doğan Trend Automotive appeared first on Zabbix Blog.

Zabbix and a Federal Government Agency

Post Syndicated from Michael Kammer original https://blog.zabbix.com/zabbix-and-a-federal-government-agency/30708/

Our Premium Partners at the ATS Group work with a large federal government agency in the United States. They primarily provide storage and compute-as-a-service for the agency, which relies on them to stay up and running at all times.

The challenge

The agency’s primary goal was to simplify their capacity and performance monitoring without extra costs. They had very strict regulatory and SLO oversight requirements that had to be met, especially when it came to capacity and performance.

There was no commercially available software that could accomplish everything they needed directly out of the box, but they still required a solution that was powerful and flexible enough to monitor almost anything.

The solution

Because the agency has several different data centers of different sizes, they use a distributed proxy set up, intense SLA reporting, a ServiceNow integration, a variety of internal integrations, and a monitoring solution provided by Zabbix that includes a predictive alerting setup.

The agency has plenty of software in the mix, but it primarily relies on storage, VMWare, and Kubernetes. They also have multiple satellite offices and data centers, so that in the event of a data center failure, another can come online with minimal downtime in between.

On top of that, they have over 30 metrics and more than a trillion data points across 10 major technologies that they need to measure, primarily from a regulatory perspective. Thousands of granular metrics needed to have solutions and reporting designed for them in Zabbix, including (for example) CPU cores and frequency, processor-to-core usage metrics, and virtualization ratios from hosts to virtual machines.

Their Kubernetes-based Openshift environment also needs to be monitored to exact specifications. Deployment took place via Helm Chart, with Zabbix components being installed as Kubernetes resources, node-level resources, and applications being monitored, while data was aggregated and sent to the Zabbix server.

Metrics are collected via the Kubernetes API and kube-state metrics, and the solution uses Prometheus-exported metrics or direct HTTP endpoint calls. When it comes to configuration, proxies and hosts are created in Zabbix to represent Kubernetes nodes and clusters, while templates and macros are configured to point to the Kubernetes API and kube-state-metrics endpoints.

The results

Thanks to Zabbix, the federal government agency in question has a solution that provides centralized monitoring of Kubernetes alongside other IT resources, supports application-specific metrics without requiring Prometheus endpoints, and offers plenty of flexibility to customize and scale.

In addition, Zabbix’s predictive alerting capabilities identify abnormalities in operational data and predictively alert the agency about anything that could potentially impact an application or service, which lets them meet SLAs, optimize user experience, and increase productivity.

In conclusion

Zabbix’s flexibility and ease of customization make it ideal for customers who need a single source of truth that can be relied on in even the most stringent regulatory environments.

To learn more about what Zabbix can do for customers in the public sector, visit us here.

The post Zabbix and a Federal Government Agency appeared first on Zabbix Blog.

Zabbix at the Netherlands Ministry of Infrastructure and Water Management

Post Syndicated from Michael Kammer original https://blog.zabbix.com/zabbix-at-the-netherlands-ministry-of-infrastructure-and-water-management/30681/

The Ministry of Infrastructure and Water Management is the Dutch ministry responsible for transport, aviation, housing policy, public works, spatial planning, land management, and water resource management. Created in 2010 following the merger of the Ministry of Transport and Water Management and the Ministry of Housing, Spatial Planning, and Environment, the ministry works to create an efficient network of roads, railways, waterways, and airways, effective water management to protect against flooding, and improved air and water quality.

The challenge:

The ministry needed a monitoring solution that could handle not only infrastructure monitoring, but also IoT devices responsible for monitoring water levels, water quality, temperature, and other data. The infrastructure components that needed to be monitored included Red Hat Satellite and Capsule servers, Red Hat Virtual Data Centers, Red Hat Identity management, Ansible automation platforms, and a wide range of custom IoT devices.

The solution:

The Red Hat Satellite and Capsule monitoring consists of one satellite, 6 server, and 15 satellite capsules for different environments, with approximately 2000 Linux machines connected to the satellite capsules. The machines retrieve their packages from the capsules and the capsules act as proxies that fetch data from the satellite servers. The capsules also manage the content packages and subscriptions for the machines.

For Red Hat satellite and Capsule monitoring, Zabbix performs capsule discovery via Low Level Discovery, which uses Http requests, which in turn collect data via the REST API. Each capsule’s content sync status is monitored and if the content sync fails, new packages are not installed. Connectivity between capsules and the satellite is also monitored by performing port checks, because capsules need to be able to connect to the satellite in order for the content to be synced.

Zabbix also discovers and monitors satellite repositories, checking both when the last sync was performed and the current sync status. Software subscriptions are also discovered and monitored and alerts are sent, with the severity of the alerts raised at the point when a subscription has only 30 days remaining.

Red Hat Virtual Data Center licences and identity management also benefit from the added flexibility that Zabbix brings to the table. Virtual DC licences must be present on ESX hosts, so situations where an ESX host with an active license has no VMS on it (or has VMS migrated to it) must be avoided, because that would mean that a license is being essentially wasted. Whenever a Zabbix trigger detects a problem, Ansible automatically attaches or detaches a licence to or from the ESX host, depending on the type of problem detected.

When it comes to Red Hat identity management, Zabbix discovers and monitors processes on the identity management platform (including identity management service status) thanks to the ability to extend Zabbix agent with user parameters.

Meanwhile, Ansible Automation Platform monitoring consists of monitoring for controllers. The Ansible Automation Platform API is used to discover the controllers, and each controller is checked to see if any jobs are running, their last seen time, their capacity, and their status. Sometimes controllers are disabled for maintenance and then re-enabled, so alerts are sent out for controllers that have been disabled for a longer time.

Ansible Automation Platform monitoring also includes monitoring decommission machines, which are assigned to a group instead of being immediately deleted. Zabbix monitors the grace period for the decommission machines and alerts users if the grace period is over, generating a warning if an Ansible host is disabled for seven days and then escalating it if the machine has been disabled for more than 14 days.

Zabbix also discovers and monitors configuration management jobs, and if a job fails it will attempt to restart it. If the issue is still not resolved, it gets escalated to the appropriate individual. These Ansible checks are primarily done via Http agents, from Zabbix servers or proxies.

Finally, in addition to infrastructure monitoring, Zabbix also monitors the health of IoT devices responsible for water levels, water quality, temperature, and other data. These devices are running Raspberry Pi modules and Zabbix Agent 2 is used to monitor the device status. Zabbix Agent 2 with a local agent database is used in cases where the agent is unable to send the metrics on these devices. Should a network outage happen, Zabbix stores the backlog data in the local agent database.

The results:

Trusting their monitoring to Zabbix has greatly improved processes at the ministry, saving time and money by making it easy to notice and fix issues before affected departments themselves were aware of them. In addition, having the latest historical data at their fingertips has been invaluable to the ministry’s technical teams during troubleshooting or when dealing with performance issues, saving everyone involved a great deal of time.

In conclusion

Zabbix’s flexible nature and its ability to integrate with popular platforms as well as custom devices made it the perfect “one-stop shop” for the ministry’s needs, consolidating all of their monitoring in a single pane of glass and giving them complete visibility into every layer of their infrastructure – while also integrating smoothly with their existing systems.

To learn more about what Zabbix can do for customers in the public sector, contact us.

The post Zabbix at the Netherlands Ministry of Infrastructure and Water Management appeared first on Zabbix Blog.

Melting the ice — How Natural Intelligence simplified a data lake migration to Apache Iceberg

Post Syndicated from Yonatan Dolan original https://aws.amazon.com/blogs/big-data/melting-the-ice-how-natural-intelligence-simplified-a-data-lake-migration-to-apache-iceberg/

This post is co-written with Haya Axelrod Stern, Zion Rubin and Michal Urbanowicz from Natural Intelligence.

Many organizations turn to data lakes for the flexibility and scale needed to manage large volumes of structured and unstructured data. However, migrating an existing data lake to a new table format such as Apache Iceberg can bring significant technical and organizational challenges

Natural Intelligence (NI) is a world leader in multi-category marketplaces. NI’s leading brands, Top10.com and BestMoney.com, help millions of people worldwide to make informed decisions every day. Recently, NI embarked on a journey to transition their legacy data lake from Apache Hive to Apache Iceberg.

In this blog post, NI shares their journey, the innovative solutions developed, and the key takeaways that can guide other organizations considering a similar path.

This article details NI’s practical approach to this complex migration, focusing less on Apache Iceberg’s technical specifications, but rather on the real-world challenges and solutions encountered during the transition to Apache Iceberg, a challenge that many organizations are grappling with.

Why Apache Iceberg?

The architecture at NI followed the commonly used medallion architecture, comprised of a bronze-silver-gold layered framework, shown in the figure that follows:

  • Bronze layer: Unprocessed data from various sources, stored in its raw format in Amazon Simple Storage Service (Amazon S3), ingested through Apache Kafka brokers.
  • Silver layer: Contains cleaned and enriched data, processed using Apache Flink.
  • Gold layer: Holds analytics-ready datasets designed for business intelligence (BI) and reporting, produced using Apache Spark pipelines, and consumed by services such as Snowflake, Amazon Athena, Tableau, and Apache Druid. The data is stored in Apache Parquet format with AWS Glue Catalog providing metadata management.

BDB4681-Arch1

While this architecture supported NI analytical needs, it lacked the flexibility required for a truly open and adaptable data platform. The gold layer was coupled only with query engines that supported Hive and AWS Glue Data Catalog. It was possible to use Amazon Athena however Snowflake required maintaining another catalog in order to query those external tables. This issue made it difficult to evaluate or adopt alternative tools and engines without costly data duplication, query rewrite data catalog synchronization. As business scaled, NI needed a data platform that could seamlessly support multiple query engines simultaneously with a single data catalog and avoiding any vendor lock-in.

The power of Apache Iceberg

Apache Iceberg emerged as the perfect solution—a flexible, open table format that aligns with NI’s approach of Data Lake First. Iceberg offers several critical advantages such as ACID transactions, schema evolution, time travel, performance improvements and more. But the key strategic benefits lay in the ability to support multiple query engines simultaneously. It also has the following advantages:

  • Decoupling of storage and compute: The open table format enables you to separate the storage layer from the query engine, allowing an easy swap and support for multiple engines concurrently without data duplication.
  • Vendor independence: As an open table format, Apache Iceberg prevents vendor lock-in, giving you the flexibility to adapt to changing analytics needs.
  • Vendor adoption: Apache Iceberg is widely supported by major platforms and tools, providing seamless integration and long-term ecosystem compatibility.

By transitioning to Iceberg, NI was able to embrace a truly open data platform, providing long-term flexibility, scalability, and interoperability while maintaining a unified source of truth for all analytics and reporting needs.

Challenges faced

Migrating a live production data lake to Iceberg was challenging because of operational complexities and legacy constraints. The data service at NI runs hundreds of Spark and machine learning pipelines, manages thousands of tables, and supports over 400 dashboards—all operating 24/7. Any migration would need to be done without production interruptions; and coordinating such a migration while operations continue seamlessly was daunting.

NI needed to accommodate diverse users with varying requirements and timelines from data engineers to data analysts all the way to data scientists and BI teams.

Adding to the challenge were legacy constraints. Some of the existing tools didn’t fully support Iceberg, so there was a need to maintain Hive-backed tables for compatibility. As NI realized that not all consumers could adopt Iceberg immediately. A plan was required to allow for incremental transitions without downtime or disruption to ongoing operations.

Key pillars for migration

To help ensure a smooth and successful transition, six critical pillars were defined:

  • Support ongoing operations: Maintain uninterrupted compatibility with existing systems and workflows during the migration process.
  • User transparency: Minimize disruption for users by preserving existing table names and access patterns.
  • Gradual consumer migration: Allow consumers to adopt Iceberg at their own pace, avoiding a forced, simultaneous switchover.
  • ETL flexibility: Migrate ETL pipelines to Iceberg without imposing constraints on development or deployment.
  • Cost effectiveness: Minimize storage and compute duplication and overhead during the migration period.
  • Minimize maintenance: Reduce the operational burden of managing dual table formats (Hive and Iceberg) during the transition.

Evaluating traditional migration approaches

Apache Iceberg supports two main approaches for migration: In-place and rewrite-based migration.

In-place migration

How it works: Converts an existing dataset into an Iceberg table without duplicating data by creating Iceberg metadata on top of the existing files while preserving their layout and format.

Advantages:

  • Cost-effective in terms of storage (no data duplication)
  • Simplified implementation
  • Maintains existing table names and locations
  • No data movement and minimal compute requirements, translating into lower cost

Disadvantages:

  • Downtime required: All write operations must be paused during conversion, which was unacceptable in NI cases because data and analytics are considered mission critical and run 24/7
  • No gradual adoption: All consumers must switch to Iceberg simultaneously, increasing the risk of disruption
  • Limited validation: No opportunity to validate data before cutover; rollback requires restoring from backups
  • Technical constraints: Schema evolution during migration can be challenging; data type incompatibilities can halt the entire process

Rewrite-based migration

How it works: Rewrite-based migration in Apache Iceberg involves creating a new Iceberg table by rewriting and reorganizing existing dataset files into Iceberg’s optimized format and structure for improved performance and data management.

Advantages:

  • Zero downtime during migration
  • Supports gradual consumer migration
  • Enables thorough validation
  • Simple rollback mechanism

Disadvantages:

  • Resource overhead: Double storage and compute costs during migration
  • Maintenance complexity: Managing two parallel data pipelines increases operational burden
  • Consistency challenges: Maintaining perfect consistency between the two systems is challenging
  • Performance impact: Increased latency because of dual writes; potential pipeline slowdowns

Why neither option alone was good enough

NI decided that neither option could meet all critical requirements:

  • In-place migration fell short because of unacceptable downtime and lack of support for gradual migration.
  • Rewrite-based migration fell short because of prohibitive cost overhead and complex operational management.

This analysis led NI to develop a hybrid approach that combines the advantages of both methods while mitigating and minimizing limitations.

The hybrid solution

The hybrid migration strategy was designed around five foundational elements, using AWS analytical services for orchestration, processing, and state management.

  1. Hive-to-Iceberg CDC: Automatically synchronize Hive tables with Iceberg using a custom change data capture (CDC) process to support existing consumers. Unlike traditional CDC focusing on row-level changes, the process was done at the partition-level to preserve Hive’s behavior of updating tables by overwriting partitions. This helps ensure that data consistency is maintained between Hive and Iceberg without logic changes at the migration phase, making sure that the same data exists on both tables.
  2. Continuous schema synchronization: Schema evolution during the migration introduced maintenance challenges. Automated schema sync processes compared Hive and Iceberg schemas, reconciling differences while maintaining type compatibility.
  3. Iceberg-to-Hive reverse CDC: To enable the data team to transition extract, transform, and load (ETL) jobs to write directly to Iceberg while maintaining compatibility with existing Hive-based processes not yet migrated, a reverse CDC from Iceberg to Hive was implemented. This allowed ETLs to write to Iceberg while maintaining Hive tables for downstream processes that had not yet migrated and still relied on them during the migration period.
  4. Alias management in Snowflake: Snowflake aliases made sure that Iceberg tables retained their original names, making the transition transparent to users. This approach minimized reconfiguration efforts across dependent teams and workflows.
  5. Table replacement: Swap production tables while retaining original names, completing the migration.

Technical deep dive

The migration to from Hive to Iceberg was constructed of several steps:

1. Hive-to-Iceberg CDC pipeline

Objective: Keep Hive and Iceberg tables synchronized without duplicating effort.

The preceding figure demonstrates how every partition written to the Hive table is automatically and transparently copied to the Iceberg table using a CDC process. This process makes sure that both tables are synchronized, enabling a seamless and incremental migration without disrupting downstream systems. NI chose partition-level synchronization because the legacy Hive ETL jobs already wrote updates by overwriting entire partitions and updating the partition location. Adopting that same approach in the CDC pipeline helped ensure that it remained consistent with how data was originally managed, making the migration smoother and avoiding the need to rework row-level logic.

Implementation:

  • To keep Hive and Iceberg tables synchronized without duplicating effort, a streamlined pipeline was implemented. Whenever partitions in Hive tables are updated, the AWS Glue Catalog emits events such as UpdatePartition. Amazon EventBridge captured these events, filtered them for the relevant databases and tables according to the event bridge rule, and triggered an AWS Lambda This function parsed the event metadata and sent the partition updates to an Apache Kafka topic.
  • A Spark job running on Amazon EMR consumed the messages from Kafka, which contained the updated partition details from the Data Catalog events. Using that event metadata, the Spark job queried the relevant Hive table, and wrote it to Iceberg table in Amazon S3 using the Spark Iceberg overwritePartitions API, as shown in the following example:
{
   "id":"10397e54-c049-fc7b-76c8-59e148c7cbfc",
   "detail-type":"Glue Data Catalog Table State Change",
   "source":"aws.glue",
   "time":"2024-10-27T17:16:21Z",
   "region":"us-east-1",
   "detail":{
      "databaseName":"dlk_visitor_funnel_dwh_production",
      "changedPartitions":[
         "2024-10-27"
      ],
      "typeOfChange":"UpdatePartition",
      "tableName":"fact_events"
   }
}
  • By targeting only modified partitions, the pipeline (shown in the following figure) significantly reduced the need for costly full-table rewrites. Iceberg’s robust metadata layers, including snapshots and manifest files, were seamlessly updated to capture these changes, providing efficient and accurate synchronization between Hive and Iceberg tables.

2. Iceberg-to-Hive reverse CDC pipeline

Objective: Support Hive consumers while allowing ETL pipelines to transition to Iceberg.

BDB4681-arch4

The preceding figure shows the reverse process, where every partition written to the Iceberg table is automatically and transparently copied to the Hive table using a CDC mechanism. This process helps ensure synchronization between the two systems, enabling seamless data updates for legacy systems that still rely on Hive while transitioning to Iceberg.

Implementation:

Synchronizing data from Iceberg tables back to Hive tables presented a different challenge. Unlike Hive tables, Data Catalog doesn’t track partition updates for Iceberg tables because partitions in Iceberg are managed internally and not within the catalog. This meant NI couldn’t rely on Glue Catalog events to detect partition changes.

To address this, NI implemented a solution similar to the previous flow but adapted to Iceberg’s architecture. Apache Spark was used to query Iceberg’s metadata tables—specifically the snapshots and entries tables—to identify the partitions modified since the last synchronization. The query used was:

SELECT e.data_file.partition, MAX(s.committed_at) AS last_modified_time 
FROM $target_table.snapshots JOIN $target_table.entries e ON s.snapshot_id = e.snapshot_id 
WHERE s.committed_at > '$last_sync_time' 
GROUP BY e.data_file.partition;

This query returned only the partitions that had been updated since the last synchronization, enabling it to focus exclusively on the changed data. Using this information, similar to the earlier process, a Spark job retrieved the updated partitions from Iceberg and wrote them back to the corresponding Hive table, providing seamless synchronization between both tables.

3. Continuous schema synchronization

Objective: Automate schema updates to maintain consistency across Hive and Iceberg.

BDB4681-arch5

The preceding figure shows how the automatic schema sync process helps ensure consistency between Hive and Iceberg tables schemas by automatically synchronizing schema changes. In this example adding the Channel column, minimizing manual work and double maintenance during the extended migration period.

 Implementation:

To handle schema changes between Hive and Iceberg, a process was implemented to detect and reconcile differences automatically. When a schema change happens in a Hive table, Data Catalog emits an UpdateTable event. This event triggers a Lambda function (routed through EventBridge), which retrieves the updated schema from Data Catalog for the Hive table and compares it to the Iceberg schema. It’s important to call out that in NI’s setup, schema changes originate from Hive because the Iceberg table is hidden behind aliases across the system. Because Iceberg is primarily used for Snowflake, a one-way sync from Hive to Iceberg is sufficient. As a result, there is no mechanism to detect or handle schema changes made directly in Iceberg, because they aren’t needed in the current workflow.

During the schema reconciliation (shown in the following figure), data types are normalized to help ensure compatibility—for example, converting Hive’s VARCHAR to Iceberg’s STRING. Any new fields or type changes are validated and applied to the Iceberg schema using a Spark job running on Amazon EMR. Amazon DynamoDB stores schema synchronization checkpoints which allow tracking changes over time and maintain consistency between the Hive and Iceberg schemas.

BDB4681-arch6

By automating this schema synchronization, maintenance overhead was significantly reduced and freed developers from manually keeping schemas in sync, making the long migration period significantly more manageable.

The preceding figure depicts an automated workflow to maintain schema consistency between Hive and Iceberg tables. AWS Glue captures table state change events from Hive, which trigger an EventBridge event. The event invokes a Lambda function that fetches metadata from DynamoDB and compares schemas fetched from AWS Glue for both Hive and Iceberg tables. If a mismatch is detected, the schema in Iceberg is updated to help ensure alignment, minimizing manual intervention and supporting smooth operation during the migration.

4. Alias management in Snowflake

Objective: Enable Snowflake consumers to adopt Iceberg without changing query references.

The preceding figure shows how Snowflake aliases enable seamless migration by mapping queries like SELECT platform, COUNT(clickouts) FROM funnel.clickouts to Iceberg tables in the Glue Catalog. Even with suffixes added during the Iceberg migration, existing queries and workflows remain unchanged, minimizing disruption for BI tools and analysts.

Implementation:

To help ensure a seamless experience for BI tools and analysts during the migration, Snowflake aliases were used to map external tables to the Iceberg metadata stored in Data Catalog. By assigning aliases that matched the original Hive table names, existing queries and reports were preserved without interruption. For example, an external table was created in Snowflake and aliased it to the original table name, as shown in the following query:

CREATE OR REPLACE ICEBERG TABLE dlk_visitor_funnel_dwh_production.aggregated_cost 
EXTERNAL_VOLUME = 's3_dlk_visitor_funnel_dwh_production_iceberg_migration' 
CATALOG = 'glue_dlk_visitor_funnel_dwh_production_iceberg_migration' 
CATALOG_TABLE_NAME = 'aggregated_cost'; 
ALTER ICEBERG TABLE dlk_visitor_funnel_dwh_production.aggregated_cost REFRESH;

When migration was completed, a simple change back to the alias was done to point to the new location or schema, making the transition seamless and minimizing any disruption to user workflows.

5. Table replacement

Objective: When all ETLs and related data workflows were successfully transitioned to use Apache Iceberg’s capabilities, and everything was functioning correctly with the synchronization flow, it was time to move on to the final phase of the migration. The primary objective was to maintain the original table names, avoiding the use of any prefixes like those employed in the earlier, intermediate migration steps. This helped ensure that the configuration remained tidy and free from unnecessary naming complications.

The preceding figure shows the table replacement to complete the migration, where Hive on Amazon EMR was used to register Parquet files as Iceberg tables while preserving original table names and avoiding data duplication, helping to ensure a seamless and tidy migration.

Implementation:

One of the challenges was that renaming tables isn’t possible within AWS Glue, which prevents the use of a straightforward renaming approach for the existing synchronization flow tables. In addition, AWS Glue doesn’t support the Migrate procedure, which creates Iceberg metadata on top of the existing data file while preserving the original table name. The strategy to overcome this limitation was to use a Hive metastore on an Amazon EMR cluster. By using Hive on Amazon EMR, NI was able to create the final tables with their original names because it operates in a separate metastore environment, giving the flexibility to define any required schema and table names without interference.

The add_files procedure was used to methodically register all the existing Parquet files, thus constructing all necessary metadata within Hive. This was a crucial step, because it helped ensure that all data files were appropriately cataloged and linked within the metastore.

The preceding figure shows the transition of a production table to Iceberg by using the add_files procedure to register existing Parquet files and create Iceberg metadata. This helped ensure a smooth migration while preserving the original data and avoiding duplication.

This setup allowed the use of existing Parquet files without duplicating data, thus saving resources. Although the sync flow used separate buckets for the final architecture, NI chose to maintain the original buckets and cleaned the intermediate files. This resulted in a different folder structure on Amazon S3. The historical data had subfolders for each partition under the root table directory, while the new Iceberg data organizes subfolders within a data folder. This difference was acceptable to avoid data duplication and preserve the original Amazon S3 buckets.

Technical recap

The AWS Glue Data Catalog served as the primary source of truth for schema and table updates, with Amazon EventBridge capturing Data Catalog events to trigger synchronization workflows. AWS Lambda parsed event metadata and managed schema synchronization, while Apache Kafka buffered events for real-time processing. Apache Spark on Amazon EMR handled data transformations and incremental updates, and Amazon DynamoDB maintained state, including synchronization checkpoints and table mappings. Finally, Snowflake seamlessly consumed Iceberg tables via aliases without disrupting existing workflows.

Migration outcome

The migration was completed with zero downtime; continuous operations were maintained throughout the migration, supporting hundreds of pipelines and dashboards without interruption. The migration was done with a cost optimized mindset with incremental updates and partition-level synchronization that minimized the usage of compute and storage resources. Lastly, NI Established a modern, vendor-neutral platform that enables scaling their evolving analytics and machine learning needs. It enables seamless integration with multiple compute and query engines, supporting flexibility and further innovation.

Conclusion

Natural intelligence migration to Apache Iceberg was a pivotal step in modernizing the company’s data infrastructure. By adopting a hybrid strategy and using the power of event-driven architectures, NI helped ensure a seamless transition that balanced innovation with operational stability. The journey underscored the importance of careful planning, understanding the data ecosystem, and focusing on an organization-first approach.

Above all, business was kept in focus and continuity prioritized the user experience. By doing so, NI unlocked the flexibility and scalability of their data lake while minimizing disruption, allowing teams to use cutting-edge analytics capabilities, positioning the company at the forefront of modern data management and readiness for the future.

If you’re considering an Apache Iceberg migration or facing similar data infrastructure challenges, we encourage you to explore the possibilities. Embrace open formats, use automation, and design with your organization’s unique needs in mind. The journey might be complex, but the rewards in scalability, flexibility, and innovation are well worth the effort. You can use the AWS prescriptive guide to help learn more about how to best use Apache Iceberg for your organization


About the Authors

Yonatan DolanYonatan Dolan is a Principal Analytics Specialist at Amazon Web Services. Yonatan is an Apache Iceberg evangelist.

Haya Stern is a Senior Director of Data at Natural Intelligence. She leads the development of NI’s large-scale data platform, with a focus on enabling analytics, streamlining data workflows, and improving dev efficiency. In the past year, she led the successful migration from the previous data architecture to a modern lake house based on Apache Iceberg and Snowflake.

Zion Rubin is a Data Architect at Natural Intelligence with ten years of experience architecting large‑scale big‑data platforms, now focused on developing intelligent agent systems that turn complex data into real‑time business insight.

Michał Urbanowicz is a Cloud Data Engineer at Natural Intelligence with expertise in migrating data warehouses and implementing robust retention, cleanup, and monitoring processes to ensure scalability and reliability. He also develops automations that streamline and support campaign management operations in cloud-based environments.

The ATS Group and a Large MSP

Post Syndicated from Michael Kammer original https://blog.zabbix.com/the-ats-group-and-a-large-msp/29857/

One of the most critical clients of our Premium Partners at the ATS Group is a large MSP that acts as a service and administration platform for their own clients, providing them with hardware, software, engineers, support staff, metrics, and reporting.

The challenge

The MSP needed a stable, high-performance platform monitoring solution that would cover all the services they provided. They didn’t have the capabilities or budget to run multiple monitoring solutions – a single, flexible solution that could track every service was paramount, as was the ability to react to anomalies before they became serious problems.

After an initial trial with a different monitoring solution that was notable for poor service, a lack of integrations, no community, and almost no documentation, they took a closer look at Zabbix, thanks in large part to our focus on preventative action and automation.

The solution

Because of their focus on performance-based monitoring, the client went with a “hot-cold” architecture and an integration with Ansible EDA, which stands for Event-Driven Ansible. It turned out to be a true “force multiplier”, as using Zabbix, Ansible, and EDA together allowed the MSP to monitor their systems, automate tasks based on real-time events, and provide immediate responses to issues without manual intervention.

The integration was designed to sort issues by whether or not they were able to be automated. If an issue arose that required human intervention, alerts could be sent to ServiceNow via multiple channels. If human intervention was unnecessary, the issue was rerouted to Event-Driven Ansible, which runs automation on all monitored hosts.

For example, with the joint Zabbix/Ansible solution, a slash admin backstage management system filling up at 2AM because of an overflowing log file for some script is no longer an urgent issue. If there are multiple gigabytes of room in the volume group, Zabbix can tell Ansible it’s a problem. Ansible can then increase the file system by 25% and send a message letting the engineers know in the morning that they took action on their behalf.

The results

With essentially no software costs and an automation integration that can find issues and fix them independently, the MSP was able to rapidly achieve a much higher service-to-spend ratio than they’d ever imagined possible.

There has been a noted increase in employee satisfaction as well – thanks to automation, engineers no longer have to be “on call” at all hours to solve simple issues, while C-level executives have seen productivity skyrocket thanks to the joint solution’s ability to find potential issues before they become real problems.

In conclusion

At Zabbix, we work hard to stay on the forefront of automation. That means constantly improving our own product while also staying on top of new technologies like Event-Driven Ansible in order to better integrate with them. To learn more about what Zabbix can do for MSPs, visit us here.

The post The ATS Group and a Large MSP appeared first on Zabbix Blog.

The ATS Group and a Regional Telecom Provider

Post Syndicated from Michael Kammer original https://blog.zabbix.com/the-ats-group-and-a-regional-telecom-provider/29671/

Our Premium Partners at the ATS Group have a regional telecom provider on the West Coast of the United States as one of their key clients. The provider covers a massive geographical area on a limited budget and serves thousands of (primarily rural) customers.

The Challenge

After recent price hikes by the “big-box” monitoring solutions, the provider needed an alternative with a more stable pricing model. Simply put, their budget was shrinking, but their software monitoring costs were expanding.

The provider had a large stock of non-traditional IT equipment that all needed to be monitored effectively, and they also had only one month to get all monitored devices and endpoints over to a new solution.

On top of that, many of the provider’s legacy systems were directly related to regulatory compliance and therefore needed to be operational from day one.

The Solution

The provider set about migrating to a complete and robust Zabbix 7.0 solution that would eliminate any foreseeable issues – even the loss of an entire data center.

There were a few initial hiccups in the implementation when it came to getting PostgreSQL set up with database proxies, but the ATS Group team quickly arrived at an architecture that the provider was happy with. The clear and easy-to-follow Zabbix documentation was of particular help.

The Results

The new Zabbix solution, as implemented, was able to monitor a number of things that had previously been challenging, including:

• Doors. The provider badly needed a solution for monitoring doors, including entrance and exit doors as well as cabinet doors in data centers. They had long-term compliance issues with doors sticking open, employees forgetting to close doors, etc. Zabbix made it easy to develop custom SNMP traps that send alerts in case of open doors, solving the issue.

• Weather. The provider’s services are available over a large and varied geographical area that encompasses multiple states. The ability of Zabbix to predict weather changes across this area has been an important added bonus, with the provider now being able to get future weather alerts that can be used to compare against equipment tolerance levels. Personnel can then be sent to affected areas in anticipation of weather events, instead of being purely reactive.

• SLAs. The provider functions as an ISP that provides internet access to customers in rural areas, many of whom may not have other means of accessing the world around them. As such, they not only feel a strong sense of duty to provide consistent uptime, but they are bound by a strict set of service level agreements (SLAs). With Zabbix, it’s possible to provide SLAs for some of the remote edge equipment involved by building an integration with ServiceNow.

In conclusion

The telecom provider in question trusts Zabbix to guarantee rural broadband access for thousands of customers over an enormous geographic area. Zabbix not only gets the job done more effectively than other monitoring solutions, it does so at a fraction of the cost.

The post The ATS Group and a Regional Telecom Provider appeared first on Zabbix Blog.

Migration to Zabbix 7.0

Post Syndicated from Rogerio Batista original https://blog.zabbix.com/migration-to-zabbix-7-0/29594/

Based in northern Brazil, TO HOST Data Centers provides regional cloud services with a focus on cloud computing, colocation, and infrastructure management. With 35 suppliers and partners and over 5,000 monitored assets, their mission is to provide innovative IT infrastructure products and services with a high level of proficiency, in order to meet the high standards required by their clients and partners. To do this, they need to monitor internal applications, data center assets, devices, and customer environments, ensuring high availability and optimal performance.

The challenge:

TO HOST’s monitoring environment included a standalone server (Zabbix, FrontEnd, Database) with the following:

  • Hosts: ~600
  • Itens/Metrics: ~90.000
  • Average period for history table: 45~60 days
  • Average period for trends table: 365 days
  • Average period for events table: 365 days
  • 3 Internal Proxies
  • 8 Client Proxies
  • ~30 External Active Agents

TO HOST needed a clean installation of Zabbix Server and Zabbix Proxy version 7.0.x on separate virtual machines with an updated operating system (Oracle 9), plus a migration of the current monitoring environment database to the new version, while preserving history and data integrity.

Their production servers were outdated, featuring a CentOS 7 version that was originally installed with Zabbix version 5.2.x and updated to version 6.0.x in 2022. The migration needed to retain historical data and ensure compatibility with Zabbix 7.0.x, while keeping service interruptions to a minimum.

A number of risks were anticipated and planned for – during the data migration process, it was understood that there may be failures in migrating the database due to version incompatibility and that there was a distinct possibility of collection failures that would require corrections after migration, if any data sources were not properly mapped.

All graphs needed to be reviewed and optimized to take advantage of the new widget models and improvements in Zabbix 7.0. Due to the changes in data sources (and because of the migration to a new operating system and a new version of the Zabbix Server) there was potential version incompatibility.

Directories containing custom scripts and images were mapped and files were copied in order to ensure integrity, and the TO HOST team was prepared for possible service interruptions during the upgrade process, standing ready to notify users about the planned maintenance and creating procedures to minimize the impact.

The solution:

Step one was to make sure that the change to Zabbix 7.0 was appropriately planned. A change schedule was created, and all relevant stakeholders were notified of the operation. A virtualized environment was then set up on Oracle 9, in order to guarantee a clean installation.

Once that was done, Zabbix 7.0 was installed, keeping in mind that the imported database could not exist on the new server. Next up was a full backup and the cloning of the database for integrity validation pre-migration. At this point, the To Host team stopped the data collection service, started the backup, and started restore.

From that point, it became a simple matter of carrying out automated database versioning and data source mapping corrections. The data mapping during the Zabbix 7.0 migration involved updating the database structure to meet the new version’s requirements, such as changes to MySQL instances, fields, and storage formats.

Data mapping in the Zabbix migration process involved the following:

  • Database Version: During migration, the database structure changed to align with the requirements of Zabbix 7.0. This included different versioning of MySQL instances, as well as modifications to fields, tables, and storage formats within the database.
  • Import and Update Process: The legacy database (version 6) was exported and then imported into the new Zabbix 7.0 installation. During the process, Zabbix ran automatic update scripts to convert the old database into the new format.
  • Data Sources: Each item monitored in Zabbix was associated with a unique key (item key) that defined how data was collected and processed. No changes were identified in this process.
  • Tools and Validations: Mapping validation was performed during the import/restore process, where error logs indicated inconsistencies. During testing, inconsistencies were found in the validation, requiring a command to update the keys replicated on the migration.

Data collection services were then restarted, and all stakeholders were notified of the completion of the change.

The results:

Zabbix 7.0’s new dashboards and improved visual configuration have increased the satisfaction of internal customers, while having a tangible impact on operational efficiency and customer satisfaction.

The implementation and management of Zabbix 7.0 has enhanced the continuous visibility and integrity of TO HOST’s IT systems, enabling real-time monitoring and alerting, facilitating proactive issue resolution, and guaranteeing optimal infrastructure performance.

Many users have noted that the asynchronous polling method used in Zabbix 7.0 significantly reduces the time taken for metric collection. This allows for faster incident detection and resolution in TO HOST’s critical environment, while the addition of multi-factor authentication and improved access controls has helped to enhance security in monitoring environments and keep cyber threats at bay.

TO HOST’s future plans include exploring advanced Zabbix 7.0 features and continuous performance monitoring. A roadmap is already in place to leverage the additional automation and security enhancements that Zabbix 7.0 can provide.

The post Migration to Zabbix 7.0 appeared first on Zabbix Blog.

Solving Log Monitoring Challenges at SEB Bank

Post Syndicated from Giedrius Stasiulionis original https://blog.zabbix.com/solving-log-monitoring-challenges-at-seb-bank/29153/

SEB Bank is a major financial services group based in Stockholm, Sweden. It serves northern Europe, particularly the Nordic and Baltic regions. Known for its digital innovation and commitment to sustainability, SEB offers banking, investment, and financial advisory services to individuals, businesses, and institutions, focusing on long-term relationships and financial stability. This case study, which shows how Zabbix helped SEB solve its log monitoring challenges, discusses aspects specific to SEB’s operations in the Baltics, where distinct systems and structures are in place but are aligned with the group’s overall approach.

The challenge

Between 2016 and 2020, SEB launched a unified IT platform for all three Baltic countries. They encountered a wide variety of challenges, including a distinct need to unify the monitoring area. Different countries had different tools and different attitudes regarding the way monitoring should operate. After numerous discussions and weighing the pros and cons of different monitoring tools, SEB concluded that the most effective way to achieve unification would be to (re)implement everything necessary with Zabbix.

It turned out that a great deal of valuable data for monitoring resides in logs. The logs varied in update frequency and structure, as did the requirements for data extraction. Some monitoring items were simple regex patterns to count matching entities or catch errors, while others had more complex logic, such as joining multiple lines for evaluation or dynamically detecting specific patterns to observe.

At the start of SEB’s journey with Zabbix, they were using version 3.0, which came with some now long-forgotten limitations:

  • No log.count[*] item yet
  • No PCRE regular expressions – only ERE was available
  • Very limited dashboard and visualization capabilities

The solution

To address all the log-related challenges, SEB chose to leverage Zabbix’s “UserParameter” capabilities. This feature is invaluable for extending Zabbix functionality.

log.discovery

This custom approach relies on the ability to effectively convert regex capturing groups into LLD (Low-Level Discovery) objects. When new elements that need monitoring appear in the logs, corresponding monitoring objects can be automatically created in Zabbix. This process was covered in more detail at Zabbix Summit 2023.

For instance, an effective set of metrics is extracted from logs to monitor the SEB mobile app. Request processing durations are logged alongside other parameters, enabling efficient grouping, such as by endpoint name and HTTP status code. This approach accommodates a wide range of potential combinations for “endpoint + HTTP status code”:

[root@linux ~]# ./log_discovery.sh "${my_log}" 1000000 COMPONENT "response\":.\"status\":(\d{3}).*uriPattern\":\"([^ ]+)\",.timing" | jq '.' | grep -c COMPONENT_1
205
[root@linux ~]#

LLD is able to gather them all:

For each discovered couple, monitoring of request processing durations is added, both for individual durations and 1 minute averages:

Certain significant combinations are enhanced with triggers, efficiently managed using the “Override” section in the LLD configuration to ensure they are created only for specific cases. So with this approach, some unexpected slowness can be nicely caught:

log.reader

For complex data collection scenarios, there was a need to implement a solution that allows data to be extracted from logs with minimal limitations. The approach was to create a log reading mechanism that could support any required data extraction logic on top of it. This was covered in more detail at Zabbix Summit 2024.

Zabbix agent 2

In addition to the mentioned custom log processing techniques, SEB had a good reason to use “Zabbix agent 2”. Both log[*] and log.count[*] are of the “Active” item type. These items are not processed in parallel by the Zabbix agent. In places with a large number of log-based items, “Zabbix Agent 2” was used, because it supports the concurrent processing of active checks.

The results

The ability to use LLD on logs was a game-changer and a lifesaver for SEB. Imagine hundreds of different items discovered from a single rule, along with the requirement to monitor any new entity matching a specific pattern as soon as it appears. Without LLD, meeting such a requirement would have been simply impossible. This approach covers many different areas, including mission-critical metrics such as counts of various requests and processing durations.

The ability to slice logs themselves and create any needed logic on top makes almost any custom log monitoring requirement possible. It gives the ability to analyze data in ways that wouldn’t be possible otherwise (e.g. average duration monitoring for large set of data).

In conclusion

SEB Bank in the Baltics relies heavily on data collection from logs. Zabbix is flexible enough to meet most of their needs when it comes to log monitoring, and – most importantly – it allows for custom implementations where required. This flexibility is highly appreciated, as it removes many barriers when monitoring the various components of SEB’s IT ecosystem and business functions.

The post Solving Log Monitoring Challenges at SEB Bank appeared first on Zabbix Blog.

Accenture Expedites Infrastructure Deployment with Amazon Q Developer

Post Syndicated from Vikas Purohit original https://aws.amazon.com/blogs/devops/accenture-expedites-infrastructure-deployment-with-amazon-q-developer/

By Priya Mallya, Managing Director – Accenture, Sandeep Singh Bhatia, Sr Manager – Accenture

     Vikas Purohit – Sr. Solutions Architect – AWS

Being able to internally setup and manage flexible, efficient infrastructure can be painful. Manually authoring your Infrastructure as Code (IaC) templates is error prone and time consuming. However, adoption of generative AI coding tools is changing the way infrastructure engineers can carry out their day-to-day activities. Accenture utilizing Amazon Q Developer to create IaC templates for one of their US based customer became a game changer.

We will discuss how Accenture used Amazon Q Developer to boost the productivity of their infrastructure team. They were responsible for deploying an Amazon Web Services (AWS) Control Tower based landing zone, central networking and security, and centralized service deployment using AWS Service Catalog for a large US-based financial client.

The Challenge

Accenture was working with a US-based financial client that was not using AWS and approached Accenture to support a green field deployment. They wanted help with:

  • AWS best practices for a multi-account strategy
  • Centralized logging
  • Networking and security
  • Setting up a infrastructure catalog managed by a central team
  • Distributing the catalog of newly managed services to their lines of business (LOBs)

Team involved decided on Hashicorp’s Terraform as the IaC language for this project.

The customer’s critical business needs drove a short time frame for the project. The customer wanted to build their infrastructure right from the outset, but typically manually creating Terraform scripts can be time intensive. Implementing infrastructure as code (IaC) is considered a best practice, as manual “click-ops” are error prone.

Solution Overview

In order to achieve adherence to the best deployment practices using IaC, as well as meeting the customer’s delivery timelines, Accenture decided to explore Amazon Q Developer for reducing the time to write the Terraform IaC code files.

Amazon Q Developer helps developers and IT professionals (IT pros) with all of their tasks across the software development lifecycle—from coding, testing, and upgrading, to troubleshooting, performing security scanning and fixes, optimizing AWS resources, and creating data engineering pipelines. It can help Terraform practitioners to focus on creating an end-to-end workflow. Amazon Q Developer features an open-source reference tracker and built-in security scans that are available while writing IaC code using Terraform. In order to generate high-quality code suggestions, HashiCorp and the Amazon Q Developer team worked together to ensure the generated code recommendations met the requirements of the Terraform practitioners.

Accenture team created a PoC to evaluate the accuracy of the recommendations generated by Amazon Q Developer. Upon successful completion of the POCs that delivered good quality code in quick timeframe, and seeking approvals from the customer, the Accenture team started writing Terraform IaC artifacts using Amazon Q Developer.

Amazon Q Developer was used to generate Terraform IaC artifacts for more than 50 AWS services as part of the project including AWS Control Tower, central networking using AWS Transit Gateway, AWS Network Firewall and Amazon Route 53 hosted zones. AWS Service Catalog products were created to manage central cataloging and deployment of products and services approved by the customer’s IT team.

Benefits observed by Accenture:

  • Using Amazon Q Developer resulted in accelerating the time to write the Terraform code by 30%.
  • Generated Terraform code had an accuracy of close to 99%, avoiding frequent context switching to reference the HashiCorp site to get the correct resource definition.
  • Using Amazon Q Developer, the Accenture team had a conversational interface to have queries on AWS services quickly answered further speeding up the development process.
  • Accenture also used Amazon Q Developer to identify and fix potential errors and edge cases.

Best Practices followed during project:

  • Temporarily writing variables in the local file helped Amazon Q Developer access all necessary information in the local file
  • AWS recommends a ‘human in the loop approach’, where a team member checks the code after it’s been generated. This code review process is a best practice regardless of which person or system created the code. This way, there were able to quickly catch and fix issues in the few instances they arose.

Conclusion

Amazon Q Developer helps developers and IT professionals (IT pros) with all of their tasks—from coding, testing, and upgrading, to troubleshooting, performing security scanning and fixes, optimizing AWS resources, and creating data engineering pipelines. We highlighted how Accenture used Amazon Q Developer for generating coding recommendations for Terraform, HashiCorp’s IaC language, to increase productivity and reduce the time for writing complex Terraform codes.

You can start using Amazon Q Developer in your IDE today to automatically build entire application features, find and fix security vulnerabilities and more. Visit Amazon Q Developer to get started.

Check out more AWS Partners or contact an AWS Representative to know how we can help accelerate your business.

Further Reading

About Accenture

Accenture is an AWS Premier Tier Services Partner and MSP that provides end-to-end solutions to migrate to and manage operations on AWS. By working with the Accenture AWS Business Group (AABG), a strategic collaboration by Accenture and AWS, organizations can accelerate the pace of innovation to deliver disruptive products and services.

 

How a Custom Zabbix Solution Maximized Efficiency for an MSP

Post Syndicated from Kristy Slimmer original https://blog.zabbix.com/how-a-custom-zabbix-solution-maximized-efficiency-for-an-msp/28810/

Discover how our partners at ATS Group designed and implemented a custom Zabbix solution that allowed a large managed service provider (MSP) to monitor and manage a vast array of client devices across multiple data centers.

The Challenge: Addressing Infrastructure Monitoring Complexities

When a federal government contractor specializing in IT managed services secured a contract to manage the infrastructure for a large federal agency, they faced a daunting challenge: how to effectively monitor and manage the vast array of devices under their purview using a single, comprehensive solution.

Real-time monitoring and immediate alerts for any issues were non-negotiable requirements. The sheer scale and complexity of the infrastructure demanded a robust monitoring system capable of providing insights across multiple data centers and diverse technologies.

The MSP, aware of the need for a trusted and experienced partner, turned to ATS Group to tackle the complexities of its observability and management challenge.

The Solution: Architecting a Custom Zabbix Solution

ATS Group, North America’s exclusive Zabbix Premium Partner, brings over two decades of experience in monitoring and optimizing enterprise IT environments. ATS Group architected and implemented a custom solution that leveraged Zabbix’s flexibility and scalability, demonstrating their deep knowledge of the technology and ability to handle complex challenges.

The ATS team deployed an on-premise Zabbix Server, accompanied by Zabbix Proxy Servers placed in each data center. This distributed architecture was a key factor in ensuring seamless monitoring across geographically dispersed environments while minimizing latency, a critical factor in managing such a vast infrastructure.

Custom Zabbix Solutions delivered by the ATS Group

From there, ATS implemented various Zabbix customizations that were integral to meeting the agency’s unique and diverse infrastructure needs, including developing templates and integrations.

Templates. ATS developed numerous templates covering a broad spectrum of technologies (including OpenShift, VMware, Dell, HP, Cisco UCS, Hitachi, NetApp, Pure, Brocade, Commvault, Linux, and Windows) to provide comprehensive monitoring capabilities tailored to the specifics of each component, ensuring a detailed view of the entire infrastructure stack.

Integrations. ATS built customized integrations for several third-party products. An integration with OpenShift allowed for alerts configured within OpenShift to be directly ingested and processed by Zabbix. The integration with VMware allowed Zabbix to detect when an administrator put a host in maintenance in VMware, automatically creating a maintenance period for that host within Zabbix to eliminate unwanted alerts while the host was being serviced.

Finally, integrations with ServiceNow and Operations Bridge Manager (OBM) enabled streamlined incident management workflows, ensuring that issues were promptly detected, triaged, and addressed with minimal manual intervention – and the proper stakeholders were notified within the customer and service provider organizations.

Trigger Actions. ATS implemented custom trigger actions to automate responses to predefined events. Whether restarting a service upon failure or executing remediation scripts, these trigger actions helped maintain system stability, minimize downtime, and reduce the workload (and callouts in the middle of the night!) for system administrators.

Dashboards. ATS designed custom dashboards to provide stakeholders with intuitive, real-time insights into the infrastructure’s health and performance. These dashboards served as a centralized hub, offering a comprehensive view of the entire environment with actionable insights to drive informed decision-making.

The Results

A custom Zabbix solution delivers visibility, streamlined monitoring, proactive management, and enhanced client satisfaction. The impact of the custom Zabbix solution was immediate and profound. By leveraging the power of Zabbix and the expert skill of the ATS team, the MSP gained unprecedented visibility and control over their client’s sprawling infrastructure. The benefits included:

Greater Operational Efficiency. With a unified view of the entire infrastructure and real-time alerts for any issues, our client experienced a significant improvement in operational efficiency. Proactive management and automated responses minimized downtime, allowing resources to be allocated more strategically.

Faster Incident Response. Issues were detected instantaneously, and relevant stakeholders were promptly alerted, enabling swift resolution and minimizing the impact on operations. This streamlined incident response mechanism reduced mean time to resolution (MTTR) and enhanced overall system reliability.

Increased Revenue. Delighted by the efficiency and effectiveness of our client’s management and monitoring capabilities, the end-user federal agency recognized the value of their partnership and expanded the scope of the contract.

This testament to our client’s success underscores the transformative impact of our solution, paving the way for further collaboration and growth opportunities. As a result, ATS Group and the managed services provider continue to expand their partnership and are solving complex infrastructure problems for numerous additional clients.

The post How a Custom Zabbix Solution Maximized Efficiency for an MSP appeared first on Zabbix Blog.

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

Post Syndicated from Louis Hourcade original https://aws.amazon.com/blogs/big-data/how-amazon-gtts-runs-large-scale-etl-jobs-on-aws-using-amazon-mwaa/

The Amazon Global Transportation Technology Services (GTTS) team owns a set of products called INSITE (Insights Into Transportation Everywhere). These products are user-facing applications that solve specific business problems across different transportation domains: network topology management, capacity management, and network monitoring. As of this writing, GTTS serves around 10,000 customers globally on a monthly basis, managing the outbound transportation network.

INSITE applications are in general data intensive. They ingest and transform large volumes of data in different formats and processing patterns (such as batch and near real time) from various sources internal and external to Amazon. Datasets are often shared between applications both within domains and across domains, and are consumed in complex data pipelines that run under tight SLAs. To enable and meet these requirements, GTTS built its own data platform.

A critical component of the data platform is the data pipeline orchestrator. GTTS built its own orchestrator named Langley in 2018, and used it to schedule and monitor extract, transform, and load (ETL) jobs on a variety of compute platforms, such as Amazon EMR, Amazon Redshift, Amazon Relational Database Service (Amazon RDS).

As the Langley user base grew, GTTS engineers faced a couple of challenges on key dimensions, such as maintainability, scalability, multi-tenancy, observability, and interoperability.

Amazon GTTS partnered with AWS Professional Services to modernize their orchestration platform, relying as much as possible on managed services with auto scaling capabilities. After analyzing candidate solutions, the team decided to build a target solution relying on Amazon Managed Workflows for Apache Airflow (Amazon MWAA). This post elaborates on the drivers of the migration and its achieved benefits.

Legacy platform

Amazon GTTS works with diverse and distributed data stores, storing petabytes of data. Data engineers need a tool to define ETL jobs which run on various compute environments, as illustrated in the following diagram.

Amazon GTTS orchestration platfrom - high-level diagram

GTTS built Langley as their custom orchestrator in 2018, and have been operating it ever since. At a high level, the core of Langley’s architecture is based on a set of Amazon Simple Queue Service (Amazon SQS) queues and AWS Lambda functions, and a dedicated RDS database to store ETL job data and metadata. It also uses AWS Data Pipeline to run SQL-based workloads, Amazon Simple Storage Service (Amazon S3) to store configuration files, and Amazon CloudWatch for alarming on failures. Every day, Langley handles the lifecycle of more than 17,000 ETL jobs in Europe and 5,000 ETL jobs in North America.

The following diagram illustrates the Langley architecture.

Langley architecture diagram

Business challenges

Langley started as a simple solution to a team-internal problem, but its growth over the years surfaced key issues:

  • The maintenance of this custom solution requires considerable time from engineers, which increased over the years with the release of new features, increasing the overall complexity.
  • The Langley user base grew continuously and eventually became a key orchestration platform for multiple teams and products across Amazon. However, it wasn’t created with multi-tenancy in mind and therefore it didn’t provide the robustness and the appropriate level of isolation to guard each tenant from impacting others on the shared platform.
  • In 2023, AWS announced the upcoming deprecation of Data Pipeline, one of the core services used by Langley.

GTTS partnered with AWS to design and implement a solution to overcome those challenges. AWS used the following evaluation matrix to build a durable solution:

Maintainability The level of effort required to maintain the orchestrating system in a functional state, encompassing updates, patches, bug fixes, and routine checks for optimal performance.
Costs The overall expenditure associated with the orchestrator, including infrastructure costs, licensing fees, personnel expenses, and other relevant costs. This criterion particularly assesses the system’s ability to effectively control and reduce costs.
Scheduling The capabilities related to running and scheduling jobs, including the ability to resume an ETL job from a failed step.
User experience The overall satisfaction and usability of a system from the end-users’ perspective, considering factors such as responsiveness, accessibility, interoperability, and ease of use.
Security Mechanisms in place to safeguard data and applications from unauthorized access at all times.
Monitoring and alerting The continuous observation and analysis of system components and performance metrics to detect and address issues, optimize resource usage, and provide overall health and reliability.
Scalability The orchestrator’s capacity to efficiently adapt its resources to handle increased workload or demand, providing sustained performance.

Among the explored solutions, Amazon MWAA was finally determined as the best overall performer across this matrix.

The next section is a dive deep into the rationales that led GTTS and AWS Professional Services to choose Amazon MWAA as the best performer.

Benefits of migrating to Amazon MWAA

Amazon GTTS and AWS Professional Services worked together to release a Minimum Viable Product (MVP) of the solution described earlier, which showcases the benefits on the agreed decision criteria.

Maintainability

With their legacy system, Amazon GTTS had to manage the orchestrator database, web servers, activity queue, dispatch functions, and worker nodes.

Amazon MWAA eliminates the need for underlying infrastructure management. It takes care of provisioning and maintenance of the Apache Airflow web server, scheduler, worker nodes, and relational database, allowing GTTS teams to focus on building their ETL jobs.

Amazon MWAA offers one-click updates of the infrastructure for minor versions, like moving from Airflow version x.4.z to x.5.z. During the upgrade process, Amazon MWAA captures a snapshot of your environment metadata; upgrades the workers, schedulers, and web server to the new Airflow version; and finally restores the metadata database using the snapshot, backing it with an automated rollback mechanism.

Costs

Amazon MWAA contributes to a more cost-effective solution by automatically scaling workers depending on the workload. This dynamic scaling in and out avoids over-provisioning and allows the organization to pay for the compute they actually use, without the risk of downtime during activity spikes. Because this is an AWS-managed solution, it also reduced GTTS’s Total Cost of Ownership (TCO) by freeing up time from engineers that were managing the legacy system.

Scheduling

Amazon MWAA supports all the trigger mechanisms that the Amazon orchestrator needed:

  • Manual trigger – The users can simply invoke a Direct Acyclic Graph (DAG) using the Airflow API or even more simply via the User Interface (UI).
  • Scheduler – A scheduler can be defined as code, together with the DAG definition, to make sure it will run at specific rates (from hourly to yearly) or on specific cron schedules.
  • Event-driven trigger – Airflow provides native operators that enable invoking a downstream DAG from another DAG or from a dataset update (push approach). It also includes sensors that listen for the completion of a task external to the DAG (pull approach).
  • Partial runs on DAG failures – Another key feature for GTTS was the possibility the recover from partial DAG failures without having to rerun the whole DAG. Airflow provides task-level controls that makes this operation straightforward to implement.

User experience

In this section, we discuss three aspects of the user experience: the web UI, the interoperability, and the programming interface.

Web UI

Amazon MWAA comes with a managed web server that hosts the Airflow UI. As a result, and without any maintenance needed, you can use it to quickly run DAGs, check run history, visualize dependencies between DAGs, troubleshoot with a direct access to task logs, manage variables and database connections, and define granular permissions. The following screenshot shows an example of the UI.

Amazon MWAA User Interface - console screenshot

Interoperability

One of the most important features evaluated was the ability for the new orchestrator to effortlessly integrate with GTTS multiple data storage services, compute components, and monitoring services.

Amazon MWAA comes with a wide variety of providers preinstalled, such as apache-airflow-providers-amazon, apache-airflow-providers-postgres, and apache-airflow-providers-common-sql. This allowed GTTS to connect with those services using multiple connection methodologies, including AWS IAM Identity Center or AWS Secrets Manager password-based authentications, without having to write a single custom Airflow operator.

Amazon MWAA also makes it straightforward to upgrade providers version and install new ones. By providing a requirements.txt file, GTTS was able to change the major version of apache-airflow-providers-amazon and install the apache-airflow-providers-mysql provider.

Programming interface

Airflow is an orchestrator with a low barrier to entry, especially for those familiar with the Python programming language. Its workflow management is defined in Python scripts, with a well-documented set of native operators and external providers, making it straightforward for Python developers to get started with Airflow and create complex data pipelines.

The following are two key Airflow features:

  • TaskFlow API – The TaskFlow API removes a lot of the boilerplate code required by traditional operators by using Python decorators while simplifying the DAG editing process DAG with cleaner and more concise DAG files.
  • Dynamic DAG generation – The dynamic DAG generation capability allowed us to generate DAGs from the original legacy orchestrator’s configuration files. This enabled the platform team to build a centralized framework consumed by multiple teams to keep the code DRY (Don’t Repeat Yourself), providing a seamless migration journey from the legacy orchestrator.

The following screenshot shows an example of these features.

Airflow dynamic DAG definition - code sample

Security

The new Amazon MWAA-based architecture improves GTTS’s posture by introducing granular access control. Amazon MWAA integrates with AWS services such as AWS Key Management Service (AWS KMS), Secrets Manager, and IAM Identity Center to keep data safely encrypted at all times, both at rest and in transit using TLS-based communications. Airflow also includes a role-based access control (RBAC) model to determine what users can do on the platform and enforce the principle of least privilege. Amazon MWAA also natively integrates with AWS CloudTrail for auditing purposes.

The Airflow RBAC model enables administrators to define roles with specific privileges to access Airflow system settings and DAGs themselves. This granular access control reduces the risk of data breaches and malicious activities by limiting access to critical DAGs and sensitive Airflow environment variables. Airflow includes five default roles with different sets of permissions (as shown in the following screenshot), but it is possible to create new roles depending on your security requirements.

Airflow roles - console screenshot

GTTS used the Airflow RBAC model to restrict permissions of certain teams and consumers of the application. They also used priority weights and Airflow pools to prioritize tasks and control run concurrency. However, if you want to run a multi-tenant orchestration platform, it’s recommended to use a separate environment for each team. You can assume that everything accessible by the Amazon MWAA role is also accessible to users who can write DAGs to the environment.

To ease authentication in Amazon MWAA, GTTS federated their identity provider (IdP) through Amazon Cognito and SAML. With this integration, users log in to the Amazon MWAA UI using the same identity as in other internal systems, which removes the need for new credentials. The user’s group membership is retrieved from the IdP through Amazon Cognito, and a Lambda function redirects the user to Amazon MWAA with the appropriate Airflow role. This process is illustrated in the following architecture, and is abstracted from the user and attached to a public Application Load Balancer that redirects at the end of the process to an Amazon MWAA private cluster, making the authentication workflow seamless and secure. Refer to Accessing a private Amazon MWAA environment using federated identities to implement it using your own IdP.

Amazon MWAA federation - architecture diagram

Monitoring and alerting

Amazon MWAA integrates with CloudWatch, which manages all infrastructure logs for you. When creating an Amazon MWAA environment, you can configure what level of logs should be saved. GTTS enabled CloudWatch logging for all of the five types of components: Airflow task logs, Airflow web server logs, Airflow scheduler logs, Airflow worker logs, and Airflow DAG processing logs.

Amazon MWAA logging configuration - console screenshot

These logs are all accessible in CloudWatch for continuous monitoring, but Amazon MWAA users can also access task logs directly from the Airflow UI by looking at the DAG run history. The following screenshot shows an example of task-level logs in Airflow 2.5.1.

Amazon MWAA task-level logs - console screenshot

You can also build CloudWatch monitoring dashboards to keep an eye on the state of your environment and alert administrators when required. Amazon MWAA natively provides Airflow environment metrics and Amazon MWAA infrastructure-related metrics.

Scalability

Each Amazon MWAA environment includes the schedulers, web server, and worker nodes. Scheduler nodes are responsible for the overall orchestration and parsing of DAG files. These tasks happen in worker nodes that Amazon MWAA auto scales up and down according to system load. When creating a new Amazon MWAA environment, you need to specify the type of worker nodes, the minimum and maximum number of worker nodes, and the scheduler count, as shown in the following screenshot.

Amazon MWAA environment classes - console screenshot

There are notably two ways GTTS controlled how Amazon MWAA scales to handle the load:

  • Minimum and maximum worker count – Amazon MWAA automatically adds or deletes workers within the boundaries you set, depending on the number of tasks that are waiting to be processed. As indicated in the AWS documentation, it is possible to request a quota increase to run up to 50 workers in a single environment.
  • Size of the node – Larger worker nodes can run more concurrent tasks. For example, mw1.small instances run 5 concurrent tasks by default, whereas mw1.large instances run 20 concurrent tasks by default. The following figure shows the specification for each instance type.

Amazon MWAA environment sizes - console screenshot

With Amazon MWAA, GTTS can therefore run up to 4,000 concurrent tasks in a single Amazon MWAA environment (50 worker nodes x 80 tasks per node with mw1.2xlarge). This remains an order of magnitude for the load that can fit into the workers vCPUs and RAM, but it is possible to edit the default configuration to add even more tasks per worker. For more information regarding Amazon MWAA automatic scaling, see Configuring Amazon MWAA automatic scaling.

The Amazon MWAA based orchestration platform

After selecting Amazon MWAA as the core service for their orchestrating system, Amazon GTTS and AWS worked together to develop an end-to-end data platform with automation capabilities, access management, monitoring, and integration with downstream systems. The following diagram illustrates the solution architecture.

MWAA-based platform - architecture diagram

The following are notable components of the architecture:

  1. DAG update – GTTS Developers manage the creation, update, and deletion of Amazon MWAA DAGs through a dedicated code repository. When a developer edits DAG definitions and commits changes to the code repository, a CI/CD pipeline automatically packages the DAG definition and stores it in Amazon S3, which automatically updates DAGs in Amazon MWAA.
  2. Infrastructure as code – The entire stack is defined as IaC with the AWS CDK, which eases the process of updating components, and makes it repeatable if GTTS wants to extend the solution and redeploy the stack in multiple AWS Regions.
  3. Authentication, authorizations, and Permissions – Permissions are centrally managed with AWS Identity and Access Management (IAM) together with Airflow roles. GTTS integrated their identity provider with Amazon Cognito and Amazon MWAA, so Amazon employees can connect to the Amazon MWAA UI with the same authentication tool they are used to, and see only the DAGs they are allowed to access.
  4. UI and DAG runs – Amazon MWAA includes an AWS-managed web server that exposes the Airflow UI. Amazon employees can connect to this UI to list DAGs, run DAGs, and track their status. In addition, GTTS used the native Amazon MWAA scheduler to automatically invoke DAGs at a specific time.
  5. Airflow workers – The users can use Airflow native providers to run custom Shell or Python code directly on the workers nodes. For compute-intensive jobs, the Amazon MWAA worker can delegate the compute to a more suitable AWS service, such as Apache Spark running on Amazon EMR on Amazon EKS, which will provide compute resources only for the duration of the job, helping in optimizing costs.
  6. Data stores and external computes services – Amazon MWAA comes also with the AWS provider preloaded, allowing a seamless connectivity with more than 23 AWS compute and data services. GTTS can extend the connectivity to other AWS or external services by using Boto3 with the PythonOperator or creating dedicated custom operators.
  7. Logging and alerting – Amazon MWAA is seamlessly integrated with CloudWatch and CloudTrail to publish DAG logs, audit logs, and metrics. This enables GTTS to track completion, troubleshoot, and create an automated alerting and notifications system so DAGs owners can take remediation actions as fast as possible.

Conclusion

Amazon GTTS partnered with AWS Professional Services to overcome the challenges faced by their legacy custom orchestrator against various dimensions such as maintainability, cost efficiency, security, scalability, and observability.

The new Amazon MWAA-based architecture offers significant improvements in the context of the AWS Well-Architected Framework compared to their former system. In terms of operational excellence, the new orchestration platform is built with evolutivity in mind and enables the GTTS team to use the most adapted ETL service to run their jobs. Regarding performance efficiency, GTTS observed up to 70% improvement in end-to-end runtime on their jobs running in Amazon MWAA. In terms of security, the new solution implements best practices such as the deployment in private subnets, authentication of users through Amazon internal federation systems, and data encryption at rest and in transit. Reliability is achieved with Multi-AZ failover and built-in auto scaling to meet the workload demand at all times. Finally, cost is reduced because Amazon MWAA is an AWS-managed service, which decreases the human effort from GTTS to maintain the orchestration platform.

Amazon GTTS is now bringing the MVP into production, where it is planned to handle petabytes of data and host more than 2,000 jobs migrated from the legacy system. Additionally, the migration to Amazon MWAA has empowered GTTS to enhance its operational scalability, paving the way for the integration of new jobs and further expansion with greater efficiency and confidence.

To learn more, refer to the following resources:


About the Authors

Béntor Bautista is a Senior Data Engineer at Amazon GTTS
Louis Hourcade is a Solutions Architect at AWS
Raphael Ducay is a Senior DataOps Architect at AWS
Konstantin Zarudaev is a DevOps Consultant at AWS
Dorra Elboukari is a DevOps Architect at AWS
Marcin Zapal is an Engagement Manager at AWS
Grigorios Pikoulas is a Strategic Program Lead at AWS
Antonio Cennamo is a Senior Customer Practice Manager at AWS

Elevating Code Quality: Real-Time Insights with Zabbix Integration and SonarQube

Post Syndicated from Benyamine Elmahir original https://blog.zabbix.com/elevating-code-quality-real-time-insights-with-zabbix-and-sonarqube/28452/

The objective of this project was to establish a robust and integrated environment for the continuous monitoring of code quality and performance metrics. To achieve this, SonarQube, an open-source platform for the continuous inspection of code quality, was installed on AlmaLinux. Following its setup, SonarQube was seamlessly integrated with Zabbix, an enterprise-class open-source distributed monitoring solution, to enable the dynamic monitoring of various projects. This integration aimed to provide our team at Zen Networks with real-time visibility into key metrics such as bugs, vulnerabilities, and code smells for ongoing projects.

Installing SonarQube on AlmaLinux

1. Pre-installation Requirements:
  • We conducted a detailed review to ensure that the server met the minimum hardware requirements for running SonarQube effectively.
  • Necessary dependencies, including Java Development Kit (JDK) and a supported database system, were installed and configured.
2. SonarQube Installation Steps:
  • The SonarQube server was downloaded from the official website.
  • Following best practices, a dedicated SonarQube user account was created for running the service.
  • The SonarQube service was configured to start on boot, ensuring high availability.
3. Configuration:
  • The sonar.properties file was meticulously edited to connect SonarQube to the chosen database, optimizing for performance and security.
  • Network settings were adjusted to allow SonarQube to run on the desired port (9000) and be accessible from the developer’s workstations.
  • Additional plugins were installed to extend the functionality of SonarQube and to support the languages used in our projects.

Project Setup in SonarQube

Upon successful installation and configuration of SonarQube on the AlmaLinux server, the next phase involved setting up projects for code analysis. Five test projects were created to demonstrate the capabilities of SonarQube and serve as a baseline for quality assessment.

Creation of Test Projects:
  • We created a series of five distinct projects, namely app-java, backup-code, erp-app, test-app, and web-app, each configured within SonarQube.
  • The projects were configured to assess various aspects of code quality, including reliability, security, and maintainability.
  • We enabled the automated scanning of code to identify bugs, vulnerabilities, and code smells within each project.
Analysis and Metrics:
  • Each project underwent a thorough analysis, with results indicating varying levels of bugs and vulnerabilities alongside code smells.
  • Metrics such as coverage and duplication were configured to be monitored, though the initial test runs reflected 0.0% coverage, indicating a scope for further CI/CD integration.
  • The test-app project notably showed a substantial number of bugs and a significant code smell count, highlighting areas for immediate improvement.
Quality Gate Status:
  • All projects were set against predefined quality gates to ensure they met the organization’s standards for code quality.
  • Despite some projects having bugs and code smells, all projects passed the quality gates, suggesting that non-critical issues were identified, which would be addressed in an ongoing manner.

Integration with Zabbix

The integration of SonarQube with Zabbix was aimed at leveraging Zabbix’s robust monitoring capabilities to keep a close eye on the projects’ health status in terms of code quality.

Zabbix Template Creation:

Our team built a Zabbix template dedicated to interfacing with the SonarQube API and designed to auto-discover SonarQube projects and their key metrics. For integrating Zabbix with the SonarQube API and enabling the auto-discovery of projects and key metrics, the following API calls and configurations were used:

Authentication:
    • Example API call to authenticate:
    • curl -u token: “http://sonarqube_ip/api/authentication/validate”
Project Discovery:
    • Example API call to list projects:
    • curl -u token: “http://sonarqube_ip/api/projects/search”
Metrics Retrieval:
    • Example API call to get project metrics:
    • curl -u token: “http://sonarqube_ip/api/measures/component?component=project_key&metricKeys=bugs,vulnerabilities,code_smells”
Zabbix Template Configuration:
    • A customized Zabbix template was created to interface with the SonarQube API. The template includes discovery rules, item prototypes, and preprocessing steps to extract relevant metrics.
    • Example of a discovery rule and item prototype in the Zabbix template:
<discovery_rule>
<name>sonarqube_project_discovery</name>
<type>HTTP_AGENT</type>
<key>sonarqube.project.discovery</key>
<delay>1h</delay>
<lifetime>3d</lifetime>
<item_prototypes>
<item_prototype>
<name>{#PROJECTNAME}: Metrics</name>
<type>HTTP_AGENT</type>
<key>sonarqube.project.metrics['{#PROJECTNAME}']</key>
<delay>5m</delay>
<url>{$PROTO}://{HOST.IP}:{$PORT}/api/measures/component?
component={#PROJECTNAME}&metricKeys=bugs,vulnerabilities,
code_smells,ncloc,complexity,violations</url>
<headers>
<header>
<name>Authorization</name>
<value>Basic YOUR_BASE64_ENCODED_TOKEN</value>
</header>
</headers>
</item_prototype>
</item_prototypes>
</discovery_rule>

In addition, our team set up items within Zabbix to track the number of bugs, vulnerabilities, and code smells, as presented in the SonarQube dashboard. We also configured triggers within Zabbix to alert the team when certain thresholds were reached, facilitating prompt action to maintain code quality.

Automation and Dynamic Monitoring:

We enabled the dynamic discovery of projects in SonarQube, allowing for new projects to be automatically detected and monitored without manual intervention. To enable the dynamic discovery of projects in SonarQube and ensure they are automatically detected and monitored by Zabbix, we implemented the following configurations:

  • SonarQube Configuration:
    • Webhooks: Configured SonarQube webhooks to notify Zabbix whenever a new project is created or updated.
    • Project Tags: Used consistent tagging for SonarQube projects to facilitate easy identification in Zabbix.
  • Zabbix Configuration:
    • Discovery Rules: Created discovery rules in Zabbix that periodically query the SonarQube API to check for new projects.
    • Low-Level Discovery (LLD): Implemented LLD in Zabbix to automate the creation of items, triggers, and graphs for each new SonarQube project.
    • We also established a data flow between SonarQube and Zabbix, ensuring that updates in the code quality metrics were reflected in real time on the Zabbix dashboard.
Validation and Testing:
      • We conducted a series of tests to ensure that the integration was functioning correctly.
      • Our team verified that metrics reported in SonarQube matched those displayed in Zabbix, confirming the accuracy and reliability of the monitoring setup.

With the projects and metrics being actively monitored, the focus shifted to presenting the data effectively. A custom dashboard was created in Zabbix to aggregate and display the information gleaned from SonarQube.

Design and Layout:

We created a user-friendly dashboard to provide a quick overview of the status of all projects.

  • The dashboard was organized to show metrics such as the number of bugs, vulnerabilities, code smells, and the Quality Gate status of each project at a glance.
  • Particular attention was paid to visual hierarchy and layout, ensuring that the most critical metrics were immediately visible.

Custom Widgets and Visualizations:

Widgets were customized for each key metric to enhance readability and instant understanding of the project statuses.
Visual indicators, such as color-coded status icons and progress bars, were incorporated to give a clear visual cue about the health of each project.

Real-time Data Representation:

W configured the dashboard to refresh at regular intervals, providing real-time updates to the development team.
Ensured that the most current data was always available, enabling a proactive approach to quality assurance and code health.

Results and Benefits

The integration of SonarQube with Zabbix and the creation of a dedicated dashboard yielded significant benefits for development workflow and project management.

Improved Code Quality Monitoring:
  • The real-time monitoring of code quality metrics allowed for quicker identification and resolution of issues.
  • Developers received immediate feedback on the quality of their code, fostering a culture of quality-first in the development process.
Enhanced Visibility:
  • The Zabbix dashboard provided a centralized view of the health status of all projects, enhancing visibility for both developers and management.
  • Critical issues could be identified at a glance, allowing for prioritization and resource allocation to address the most pressing problems.
Streamlined Workflow:
  • Automated project discovery and monitoring reduced manual overhead, allowing developers to focus on coding rather than reporting.
  • Alerts and notifications from Zabbix ensured that no critical issues went unnoticed.
Decision-making Support:
  • The collected data and trends visible on the dashboard supported informed decision-making regarding code quality improvements and technical debt management.
  • The ability to track historical data enabled the team to measure the impact of implemented changes over time.
Proactive Issue Management:
  • The early detection of bugs and vulnerabilities allowed the team to address issues before they escalated, reducing potential risks to project timelines and quality.
  • The Quality Gate statuses helped maintain a consistent standard of code quality across all projects.

Special thanks to the team at Zen Networks (Oumaima Naami, Karim Chadil, and Fayçal Noushi) for their work on this project.

 

The post Elevating Code Quality: Real-Time Insights with Zabbix Integration and SonarQube appeared first on Zabbix Blog.