Tag Archives: Performance

A Decade of Defense: Celebrating Grab’s 10th Year Bug Bounty Program

Post Syndicated from Grab Tech original https://engineering.grab.com/a-decade-of-defense

Introduction

Ten years ago, we launched our bug bounty program in partnership with HackerOne. Beyond a security initiative, it represented an open invitation to collaborative development.
As pioneers in Southeast Asia, we began the program with 23 initial researchers, and it has since evolved into a global community of security researchers.

The strategic structure and scope of our Bug Bounty Program, combined with our continuous innovation and experimentation, have successfully captured the attention of the global security research community. Over the past decade, we have partnered with more than 850 active security researchers from HackerOne’s community of over 2 million cybersecurity professionals worldwide. These dedicated researchers work alongside us across borders and time zones, forming a collaborative defense network that helps protect over 187 million users throughout Southeast Asia. Their ongoing participation demonstrates both the maturity of our program and the trust we’ve built within the security research community.

This milestone reflects the strength of shared purpose and our sustained partnership with the HackerOne platform. It demonstrates the value of human connection and the collective understanding that security is stronger through collaboration. Here’s to a decade of partnership and to many more years of building a safer future, one collaboration at a time!

Figure 1. Ten years of achievements with our HackerOne partnership.

Evolution and growth: Adapting to a dynamic threat landscape

Over the past ten years, our program has consistently adapted to the dynamic threat landscape and integrated invaluable feedback from our research community. We have grown from a private initiative to a program that consistently ranks among the top 20 worldwide and among the top 3 in Asia on HackerOne. Key milestones from our journey include:

  • Expanding our horizons: Our scope significantly broadened in 2023-2024, continuously adding new assets and prominently including financial services in Indonesia and AI systems. This expansion provides researchers with more avenues to contribute to Grab’s security.
  • Focused mobile security: We introduced a dedicated bounty table for mobile-specific issues, recognizing the unique challenges of mobile security.
  • Incentivizing excellence: We regularly experiment with campaigns of various types and targets, diversifying our reward methods to include both financial rewards and recognition.
  • Evolving vulnerability focus: We’ve observed a significant shift in the types of vulnerabilities reported over the decade, moving from foundational issues in early years to more sophisticated and emerging categories recently.
Figure 2. The journey of our bug bounty program.

The global stage: Connecting with the best

Our program’s success is deeply rooted in its vibrant global community, which we actively foster through continuous engagement. Our strategy extends beyond the platform to major live hacking events, including the ThreatCon Live Hacking Event 2023 in Nepal and DEFCON 32’s Live Recon Village 2024 in Las Vegas. These initiatives have been instrumental in connecting us with a diverse pool of new talent and strengthening relationships with researchers across different continents. By meeting hackers where they are, we’ve not only brought new expertise into our ecosystem but also demonstrated our commitment to being an accessible and collaborative partner on a global scale.

The high participation and quality submissions from these events demonstrate the effectiveness of this approach. They’ve expanded our global security testing coverage and strengthened our standing within the worldwide cybersecurity community. Through ongoing interactions and submitted reports, we continue to see that security is a collaborative effort with no borders.

Exclusive anniversary celebrations: Global club campaigns

To commemorate our 10th anniversary, we launched three exclusive, invite-only campaigns with HackerOne’s regional clubs in Germany, Morocco, and India. These campaigns served as cultural exchanges, bringing fresh perspectives from outside our core Southeast Asian consumer markets. By engaging with these clubs, we expanded our researcher community and connected with security experts who understand different threat landscapes and methodologies, bringing outside perspectives to our systems.

In August, we also ran a broader anniversary campaign that drew significant participation from the researcher community, resulting in 461 submissions. xchopath was awarded the Best Hacker Bonus for their contributions during this campaign.

These campaigns expanded our global security testing coverage and strengthened relationships with international researcher communities. Beyond vulnerability reports, they functioned as knowledge-sharing initiatives. We connected directly with researchers to learn from their experience and feedback, creating a continuous loop of improvement. This international collaboration also informed our global expansion security strategy by providing insights into how different regions approach digital payments and authentication.

The anniversary campaigns allowed us to validate our security frameworks against diverse regulatory environments and advanced testing methodologies from established security markets, reinforcing our commitment to maintaining robust security standards.

Voices from our community

Behind every vulnerability report is a researcher who chose to help make Grab safer. Their perspectives reveal the human side of our security evolution. These individuals are not just cybersecurity experts; they are partners in our mission to protect millions of users and ensure a safe digital environment. Here are a few testimonies from participants in our past campaigns:

  • “The triage was very fast despite the time difference, which I really appreciated. The triaging experience was better than other programs. The huge scope and business portal with different user roles made it especially interesting to explore.” – ArtSec [H1 Germany club campaign participant]

  • “I liked that different countries have different features—this gives me more attack surface to explore. Response time was great, triage was very fast, and I appreciated Grab’s effort in providing fast responses. The scope was huge with a lot of wildcards for reconnaissance.” – Sicksec [H1 Morocco club campaign participant]

  • “More than 20 bugs were reported, and was particularly happy that bounties were being paid upon triage. The Germany team spent a lot of time on the educational part, especially for newcomers. Communication overall was very good, and the immediate response even outside working hours was really cool. SSO and authentication is my expertise and I liked that aspect of exploring the platform.” – Lauritz [H1 Germany club campaign participant]

The road ahead: Our commitment to a secure future

With a strong community of security researchers across countries and a decade of collaboration, we’ve built meaningful partnerships. Every vulnerability report represents trust, and every discovery reflects dedication to our shared mission. The program demonstrates our choice to build together rather than work in isolation, to protect rather than exploit, and to collaborate rather than compete.

While we celebrate our external community, the success of our program relies equally on our dedicated internal teams. Our cybersecurity teams form the operational foundation of this initiative. Their consistent responsiveness and researcher-focused approach have enabled vulnerability reporting to evolve into a genuine partnership, maintaining researcher trust and keeping Grab secure.

The next ten years will bring challenges we can’t yet imagine, from emerging threats in artificial intelligence to novel cryptographic approaches in a quantum-powered world. We will face them together as a community that spans cultures, time zones, and expertise.

Together, we’ll continue securing Southeast Asia’s digital future, one partnership, one discovery, one shared achievement at a time.

Join us

Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility, and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people every day to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line – we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Real-time data quality monitoring: Kafka stream contracts with syntactic and semantic test

Post Syndicated from Grab Tech original https://engineering.grab.com/real-time-data-quality-monitoring

Introduction

In today’s data-driven landscape, monitoring data quality has become a critical need for ensuring reliable and efficient data usage across domains. High-quality data is the backbone of AI innovation, driving efficiency and unlocking new opportunities. As decentralized data ownership grows, the ability to effectively monitor data quality is essential for maintaining reliability in data systems.

Kafka streams, as a vital component of real-time data processing, play a significant role in this ecosystem. However, unreliable data within Kafka streams can lead to errors and inefficiencies for downstream users, and monitoring the quality of data within these streams has always been a challenge. This blog introduces a solution that empowers stream users to define a data contract, specifying the rules that Kafka stream data must adhere to. By leveraging this user-defined data contract, the solution performs automated real-time data quality checks, identifies problematic data as it occurs, and promptly notifies stream owners. This ensures timely action, enabling effective monitoring and management of Kafka stream data quality while supporting the broader goals of data mesh and AI-driven innovation.

Problem statement

In the past, monitoring Kafka stream data processing lacked an effective solution for data quality validation. This limitation made it challenging to identify bad data, notify users in a timely manner, and prevent the cascading impact on downstream users from further escalating.

Challenges in syntactic and semantic issue identification:

  • Syntactic issues: Refers to schema mismatches between producers and consumers, which can lead to deserialization errors. While schema backward compatibility can be validated upon schema evolution, there are scenarios where the actual data in the Kafka topic does not align with the defined schema. For example, this can occur when a rogue Kafka producer is not using the expected schema for a given Kafka topic. Identifying the specific fields causing these syntactic issues is a typical challenge.
  • Semantic issues: Refers to inconsistencies or misalignments between producers and consumers about the expected pattern or significance of each field. Unlike Kafka stream schemas, which act as a data structure contract between producers and consumers, there is no existing framework for stakeholders to define and enforce field-level semantic rules, for example, the expected length or pattern of an identifier.

Timeliness challenge in data quality monitoring: There is no real-time mechanism to automatically validate data against predefined rules, timely identify quality issues, and promptly alert stream stakeholders. Without real-time stream validation, data quality issues can sometimes persist for periods of time, impacting various online and offline downstream systems before being discovered.

Observability challenge for troubleshooting bad data: Even when problematic data is identified, stream users face difficulties in pinpointing the exact “poison data” and understanding which fields are incompatible with the schema or violate semantic rules. This lack of visibility complicates Root Cause Analysis and resolution efforts.

Solution

Our Coban platform offers a standardized data quality test and observability solution at the platform level, consisting of the following components:

  • Data Contract Definition: Enables Kafka stream stakeholders to define contracts that include schema agreements, semantic rules that Kafka topic data must comply with, and Kafka stream ownership details for alerting and notifications.
  • Automated Test Execution: Provides a long running Test Runner to automatically execute real-time tests based on the defined contract.
  • Real-time Data Quality Issue Identification: Detects data issues at both syntactic and semantic levels in real-time.
  • Alerts and Result Observability: Alerts users, simplifying observation of data quality issues via the platform.

Architecture details

The solution includes three components: Data Contract Definition, Test Execution & Data Quality Issue Identification, and Result Observability as shown in the architecture diagram in figure 1. All mentions of “Flow” from here onwards refer to the corresponding processes illustrated in figure 1.

Figure 1. Real-time Kafka Stream Data Quality Monitoring Architecture diagram.

Data Contract Definition

The Coban Platform streamlines the process of defining Kafka stream data contracts, serving as a formal agreement among Kafka stream stakeholders. This includes the following components:

  • Kafka Stream Schema: Represents the schema used by the Kafka topic under test and helps the Test Runner to validate schema compatibility across data streams (Flow 1.1).
  • Kafka Stream Configuration: Encompasses essential configurations such as the endpoint and topic name, which the platform automatically populates (Flow 1.2).
  • Observability Metadata: Provides contact information for notifying Kafka stream stakeholders about data quality issues and includes alert configurations for monitoring (Flow 1.3).
  • Kafka Stream Semantic Test Rules: Empowers users to define intuitive semantic test rules at the field level. These rules include checks for string patterns, number ranges, constant values, etc. (Flow 1.5).
  • LLM-Based Semantic Test Rules Recommendation: Defining dozens if not hundreds of field-specific test rules can overwhelm users. To simplify this process, the Coban Platform uses LLM-based recommendations to predict semantic test rules using provided Kafka stream schemas and anonymized sample data (Flow 1.4). This feature helps users set up semantic rules efficiently, as demonstrated in the sample UI in figure 2.
Figure 2. Sample UI showcasing LLM-based Kafka stream schema field-level semantic test rules. Note that the data shown is entirely fictional.

Data Contract Transformation

Once defined, the Coban Platform’s transformation engine converts the data contract into configurations that the Test Runner can interpret (Flow 2.1). This transformation process includes:

  • Kafka Stream Schema: Translates the schema defined in the data contract into a schema reference that the Test Runner can parse.
  • Kafka Stream Configuration: Sets up the Kafka stream as a source for the Test Runner.
  • Observability metadata: Sets contact information as configurations of the Test Runner.
  • Kafka Stream Semantic Test Rules: Transforms human-readable semantic test rules into an inverse SQL query to capture the data that violates the defined rules.
Figure 3. Illustration of semantic test rules being converted from human-readable formats into inverse SQL queries.

Test Execution & Data Quality Issue Identification

Once the Test Configuration Transformation Engine generates the Test Runner configuration (Flow 2.1), the platform automatically deploys the Test Runner.

Test Runner

The Test Runner utilises FlinkSQL as the compute engine to execute the tests. FlinkSQL was selected for its flexibility in defining test rules as straightforward SQL statements, enabling our platform to efficiently convert data contracts into enforceable rules.

Test Execution Workflow And Problematic Data Identification

FlinkSQL consumes data from the Kafka topic under test (Flow 2.2) using its own consumer group, ensuring it doesn’t impact other consumers. It runs the inverse SQL query (Flow 2.3) to identify any data that violates the semantic rules or that is syntactically incorrect in the first place. Test Runner captures such data, packages it into a data quality issue event enriched with a test summary, the total count of bad records, and sample bad data, and publishes it to a dedicated Kafka topic (Flow 3.2). Additionally, the platform sinks all such data quality events to an AWS S3 bucket (Flow 3.1) to enable deeper observability and analysis.

Result Observability

Grab’s in-house data quality observability platform, Genchi, consumes problematic data captured by the Test Runner (Flow 3.3).

Alerting

Genchi sends Slack notifications (Flow 3.5) to stream owners specified in the data contract observability metadata. These notifications include detailed information about stream issues, such as links to sample data in Coban UI, observed windows, counts of bad records, and other relevant details.

Figure 4. Sample Slack notifications

Observability

Users can access the Coban UI (Flow 3.4), displaying Kafka stream test rules and sample bad records, highlighting fields and values that violate rules.

Figure 5. In this Sample Test Result, the highlighted fields indicate violations of the semantic test rules.

Impact

Since its deployment earlier this year, the solution has enabled Kafka stream users to define contracts with syntactic and semantic rules, automate test execution, and alert users when problematic data is detected, prompting timely action. It has been actively monitoring data quality across 100+ critical Kafka topics. The solution offers the capability to immediately identify and halt the propagation of invalid data across multiple streams.

Conclusion

We implemented and rolled out a solution to assist Grab engineers in effectively monitoring data quality in their Kafka streams. This solution empowers them to establish syntactic and semantic tests for their data. Our platform’s automatic testing feature enables real-time tracking of data quality, with instant alerts for any discrepancies. Additionally, we provide detailed visibility into test results, facilitating the easy identification of specific data fields that violate the rules. This accelerates the process of diagnosing and resolving issues, allowing users to swiftly address production data challenges.

What’s next

While our current solution emphasizes monitoring the quality of Kafka streaming data, further exploration will focus on tracing producers to pinpoint the origin of problematic data, as well as enabling more advanced semantic tests such as cross-field validations. Additionally, we aim to expand monitoring capabilities to cover broader aspects like data completeness and freshness, and integrate with Gable AI to detect Data Transfer Object (DTO) changes and semantic regressions in Go producers upon committing code to the Git repository. These enhancements will pave the way for a more robust, multidimensional data quality testing solution across a wider range.

References

Driving Data Quality with Data Contracts: A Comprehensive Guide to Building Reliable, Trusted, and Effective Data Platforms by Andrew Jones

Join us

Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line – we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

SpellVault’s evolution: Beyond LLM apps, towards the agentic future

Post Syndicated from Grab Tech original https://engineering.grab.com/spellvault-evolution-beyond-llm

Introduction

At Grab, innovation isn’t just about building new features; it’s about evolving our platforms to meet the changing needs of our users and the broader technological landscape. SpellVault, our internal AI platform, exemplifies this philosophy. When SpellVault was first launched, our vision was straightforward: empower everyone at Grab to effortlessly build and manage AI-powered apps without the need for coding. Built on the principles of Retrieval-Augmented Generation (RAG) and enhanced by plugin support, SpellVault rapidly evolved into a powerful productivity engine for the organization, enabling the creation of thousands of apps that drive automation, foster experimentation, and support production use cases.

As the AI landscape has evolved, SpellVault has grown alongside it. Initially launched as a straightforward no-code app builder for Large Language Models (LLMs), it has now evolved into a cutting-edge platform that embraces the agentic future—a future where AI goes beyond generating responses to reasoning, acting, and dynamically adapting through the use of tools and contextual understanding.

This article outlines SpellVault’s journey towards an agentic future and how we empower users to build AI Agents that are smarter, more adaptable, and ready for the future.

A no-code platform for building LLM apps

SpellVault was founded with a clear mission: to democratize access to AI for everyone at Grab, regardless of their technical expertise. Initially launched as a no-code LLM app builder, the platform was built on a foundation of RAG pipelines and basic plugin support.

Early on, we recognized that the true potential of AI apps extends beyond the capabilities of language models alone. Their real value lies in the ability to seamlessly interact with external systems and diverse data sources. This insight drove our commitment to minimizing barriers and ensuring users could access data from various sources with ease. From the very beginning, we centered our efforts on three key focus areas:

Comprehensive RAG solution with useful integrations

From the start, the SpellVault team prioritized enabling users to enhance their LLM apps with data through RAG. Rather than solely relying on the LLM’s internal information, we wanted the apps to ground their responses in up-to-date, contextually relevant, and factual information. SpellVault has built-in integrations with knowledge sources such as Wikis, Google Docs, as well as plain text and PDF uploads. These capabilities empower users to build assistants that reference relevant knowledge and provide more accurate, verifiable answers.

Plugins to fetch information on demand

To move beyond static knowledge retrieval, we needed a way for apps to act dynamically. This was made possible through SpellVault plugins—modular components that allow apps to interact with internal systems (e.g. service dashboards, incident trackers) and external APIs (e.g. search engines, weather data). Rather than being confined to their initial prompt and data, these plugins can fetch fresh information at runtime. From the available plugin types, users can create their own instances of plugins with custom settings, enabling highly specialized functionality tailored to their specific workflows. For instance, with SpellVault’s HTTP plugin, users can define custom endpoints and credentials, enabling their AI apps to make tailored HTTP calls during runtime. These custom plugins have become the backbone of many of our most impactful apps, empowering teams to seamlessly integrate SpellVault with their existing systems and processes.

Figure 1. SpellVault’s early architecture.

Making SpellVault accessible via common interfaces: Web, Slack, API

One of our primary goals was to make AI seamlessly accessible and useful within the tools users already use—whether it’s a browser or Slack. With SpellVault, users can make their AI apps in minutes and start using them via browser or Slack messaging immediately and intuitively, without requiring any additional setup. We also exposed APIs that enabled other internal services to integrate with SpellVault apps for a variety of use cases. This multi-channel approach ensured that SpellVault wasn’t just a standalone sandbox but a platform woven into existing tools and processes.

Users quickly adopted the platform, creating thousands of apps for internal productivity gains, automation, and even production use cases. The platform’s success validated our hypothesis that there was significant demand for democratized AI tools within the organization.

Figure 2. SpellVault’s web interface for LLM App configuration and chat.

Evolution over time

The AI landscape over the past few years has been defined by relentless change. New frameworks, execution paradigms, and standards have emerged in quick succession, each promising to make AI systems more powerful, more reliable, or more extensible. At Grab, we recognized that for SpellVault to stay relevant, it could not remain static. It needed to evolve in tandem with the ever-changing ecosystem, continuously incorporating valuable advancements while ensuring a seamless experience for our users.

This philosophy of continuous adaptation has guided SpellVault’s journey. From its early days as a simple RAG-powered app builder with a few plugins, the platform grew to support an extensive number of plugin types, richer execution models, and eventually a unified approach to tools. Each step was a response both to the needs of our users and to the shifting definition of what “building with AI” meant in practice. Rather than opting for a complete overhaul, SpellVault has embraced incremental advancements, ensuring that users can seamlessly benefit from new capabilities without disruption.

This approach to evolution has naturally positioned SpellVault to transition from a platform for LLM apps to one designed for AI agents. The following section delves into this transition in greater detail.

Expanding capabilities

Over time, we introduced numerous new capabilities to SpellVault, driven both by user feedback and our commitment to innovation and staying ahead of industry trends. For instance, we extended support for different plugin types, enabling integrations with tools like Slack and Kibana, and continuously added more integrations to enhance the platform’s versatility. We implemented auto-updates for users’ Knowledge Vaults, ensuring their data remained current. With more users building with the platform, ensuring the trustworthiness of responses generated by SpellVault apps became increasingly important. We included citation capability to mitigate some of that concern. Recognizing the need for more precise answers to mathematical problems, we developed a feature that enabled LLMs to solve such problems using Python runtime. Additionally, many users requested an automated way to trigger their LLM apps, which led to the creation of a Task Scheduler feature that allows LLMs to schedule actions based on natural language user input.

A significant milestone in SpellVault’s evolution was the introduction of “Workflow,” a drag-and-drop interface within the platform that empowered users to design deterministic workflows. These workflows enabled users to seamlessly combine various components from the SpellVault ecosystem—such as LLM calls, Python code execution, and Knowledge Vault lookups—in a predefined and structured manner. This enabled advanced use cases for many users.

Figure 3. Evolving tools landscape of SpellVault with increasing integrations.

Shifting the execution model

As SpellVault evolved, a fundamental shift took place in the way its apps were executed internally. We transitioned from our legacy executor system, which facilitated one-off information retrieval from the Knowledge Vault or user plugins, to a more advanced graph based executor. This empowered SpellVault’s app execution with nodes, edges, and states that supported branching, looping, and modularity. This laid the groundwork for more sophisticated agent behaviors, moving beyond the linear input-output paradigm.

This transformed all existing SpellVault apps into ‘Reasoning and Acting’ agents, better known as ReAct agents – a “one size fits many” solution that significantly enhanced the capabilities of these apps. By enabling them to leverage the Knowledge Vault and plugins in a more agentic and dynamic manner, the ReAct agent framework allowed apps to perform more complex tasks while seamlessly preserving their existing functionality, ensuring no disruption to their behavior.

In addition, the internal decoupling of the executor and prompt engineering components enabled us to design multiple execution pathways with ease. This allowed us to provide generic Deep Research capability to any SpellVault app via a simple UI checkbox, as well as sophisticated internal workflows that cater to high-ROI complex use cases like on-call alert analysis. The Deep Research capability came with SpellVault’s ability to search across internal information repositories (e.g., Slack messages, Wiki, Jira) within Grab, as well as searching online for relevant information.

Figure 4. SpellVault’s evolved architecture with more dynamic context gathering and advanced interaction modes.

Towards an agentic framework

Over time, several capabilities were added to SpellVault, including features like Python code execution and internal repository search. Initially, these functionalities were integrated directly into the core PromptBuilder class. For users, these features were primarily accessible through simple checkboxes in the user interface. As SpellVault gradually transitioned towards giving more agency to user-crafted apps, we recognized that these capabilities should instead be positioned as “Tools” for LLMs to use with greater autonomy, similar to how ReAct agent–backed apps have been using SpellVault’s user plugins. We also understood that this shift could bring a clearer mental model for users where they were no longer simply toggling features but creating AI agents with access to a defined set of tools. The agents could then decide when and how to use those tools intelligently to accomplish tasks, making the overall experience more natural and intuitive.

This recognition led to the consolidation of these scattered capabilities into a unified framework called “Native Tools.” These Native Tools, along with SpellVault’s existing user plugins—rebranded as “Community Built Tools”—formed a comprehensive collection of tools that LLMs could dynamically invoke at runtime. Despite being grouped under the same umbrella, a key distinction was maintained: Native Tools required no user-specific configuration (e.g., performing internet searches), whereas Community Built Tools were custom, user-configured entities (e.g., invoking specific HTTP endpoints) created from available plugin types, often requiring credentials or other personalized settings.

This consolidation of capabilities under a unified Tools abstraction and enabling SpellVault apps to invoke them with greater autonomy marked a pivotal milestone in the platform’s evolution. It meaningfully shifted SpellVault toward making agentic behavior more natural, discoverable, and extensible for every app.

Figure 5. SpellVault’s Unified Tools housing both Native Tools and Community Built Tools.

SpellVault as an MCP service

As we streamlined SpellVault’s internal capabilities into a unified tools framework, we also turned our focus outward to align with industry standards. The growing adoption of the Model Context Protocol (MCP) presented an opportunity for agents and clients to seamlessly interact without requiring custom integrations. To remain at the forefront of innovation, we adapted SpellVault to function as an MCP service, enabling it to actively participate in this evolving ecosystem. This extension brought two key advancements:

  • SpellVault apps as MCP tools: Each app created in SpellVault can now be exposed through the MCP protocol. This allows other agents or MCP-compatible clients, such as IDEs or external orchestration frameworks, to treat a SpellVault app as a callable tool. Instead of living only inside our web user interface or Slack interface, these apps become accessible building blocks that other systems can invoke dynamically.

  • RAG as an MCP tool: We extended the same idea to our Knowledge Vaults. Through MCP, external clients can search, retrieve, and even add information to Vaults. This effectively turns SpellVault’s RAG pipeline into an MCP-native service, making contextual grounding available to agents beyond SpellVault itself.

While building the SpellVault MCP Server, we also created TinyMCP – a lightweight open-source Python library that adds MCP capabilities to an existing FastAPI app as just another router, instead of mounting a separate app.

By exposing both apps and RAG through MCP, we shifted SpellVault from being a self-contained platform to becoming an interoperable service provider in the agentic ecosystem. Users still benefit from the no-code simplicity inside SpellVault. However, the output of their work, apps, and knowledge, are now usable by other agents and tools outside of it.

Conclusion

SpellVault’s evolution shows how a platform can adapt with the AI landscape while staying true to its original mission of making powerful technology accessible to everyone. What began as a no-code builder for LLM apps has steadily expanded into an agentic platform – one where apps can act with more intelligence, agency, and context and interact with the systems around them.

This progress wasn’t the result of a single breakthrough, but of steady, incremental improvements that introduced new capabilities while preserving ease of use. By layering in these advancements thoughtfully but boldly, SpellVault has managed to support more sophisticated agentic behaviors without compromising its original goal of democratizing AI at Grab.

Join us

Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line – we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Grab’s Mac Cloud Exit supercharges macOS CI/CD

Post Syndicated from Grab Tech original https://engineering.grab.com/mac-cloud-exit

Introduction

In our mission to optimize continuous integration and delivery (CI/CD), we have taken a bold step by relocating our infrastructure from a cloud vendor in the US to a colocation cluster within Southeast Asia, closer to our Git server infrastructure. This change has dramatically improved the performance of our macOS builds, primarily by reducing the network traffic delays associated with distant data centers. By bringing our infrastructure closer to home, we have not only accelerated CI/CD job completion times but also massively slashed operational costs.

Join us as we delve into the Mac Cloud Exit journey and the significant improvements it has brought to our workflows.

Our macOS CI/CD infrastructure has evolved from 1 Physical Mac Pro running in our office to a cluster of 250 Mac minis fully occupied during peak hours of the day. There were multiple stages in the journey to transition to the current state. The following diagram shows the focus area for this blog post.

Figure 1: Infrastructure transition path

Before and after: Visualizing the evolution

We began our journey with a much simpler setup.

Figure 2: Photo of the setup when we started

Today, that infrastructure has scaled significantly to meet the growing demands of Grab

Figure 3: Mac mini cluster today

Economy at scale: The rent vs. own equation

At the beginning, it was a no-brainer to rent when our demand for macOS hardware increased from 1 MacPro to 20 times that size. However, when that grew to over 200 machines, the total cost became significant, prompting us to consider:

  1. What is the desired reliability for this cluster?
  2. What would be the total cost of ownership for us to build this cluster ourselves compared to cloud-based options?
  3. What kind of operational leverage would it bring us by controlling end-to-end stack by ourselves?

What is Grab’s scale

At Grab, our iOS build needs have scaled quite significantly, so we went from running some builds on a single Mac Pro to running them on an army of 250+ Mac minis. And so did the cost.

Active jobs trend

The total number of jobs trend is one of the data points to understand the demand situation. The following chart is a snapshot from our demand curve in 2022. Peak demand often started to exceed the available supply, creating queues for the jobs.

We estimated we would need 200+ machines to comfortably supply for the peak demand and projected a demand for 400+ machines in 2025.

Figure 4: Active macOS CI/CD jobs

What is our workload

We have several iOS apps that share a common macOS compute cluster for their CI/CD workloads.
This includes, but is not limited to:

The workload primarily involves:

  • Building apps
  • Execution of tests

The Evaluation: Cloud vs colocation vs on-prem

We did a comprehensive comparison and total cost of ownership (TCO) estimation to compare many different options, including cloud vendors and colocation in different places.

Cost of macOS compute

The expense of macOS compute is notably higher, particularly in continuous integration (CI) setups, posing challenges for optimal configuration. Several factors contribute to these increased costs:

  • Apple’s restrictive EULA mandates a minimum lease period of 24 hours for macOS instances, which alters the utilization equation.
  • Economies of scale are not favorable for available macOS hardware configurations compared to alternatives. Optimized server hardware designed for racking offers various configurations that reduce operational costs, unlike macOS options such as Mac Mini and Mac Pro.

For instance, although not a direct comparison, the pricing for GitHub Actions build minutes shows macOS is ten times more costly than Linux. This reflects the pricing GitHub can offer after implementing racking optimizations.

Initially, we conducted rough estimations to assess the total cost of ownership differences between cloud, colocation, and on-premises setups. Even with conservative estimates for manpower and engineering costs, colocation or on-premises setups proved more cost-effective at our scale. This cost disparity became even more pronounced when focusing on cloud vendors providing macOS compute physically located in Southeast Asia.

We opted to conduct an in-depth evaluation of the following options:

  • Establishing a macOS cluster at our headquarters in Singapore, which was swiftly dismissed due to scalability and cost concerns making it an unsuitable long-term solution.
  • Colocating in a Southeast Asian country where we have operational presence.

Choice of location

As a Southeast Asian company, we maintain offices in each country where we operate, some of which boast advanced data center infrastructures. We focused our location choices on Singapore and Malaysia, assessing them based on several criteria, including:

  • The maturity of existing data center infrastructure.
  • The proximity of the data centers to our offices, ensuring staff availability for infrastructure setup.
  • The cost and reliability of power.
  • The proximity to our Git servers and the expense of establishing direct network connections.

Eventually we concluded to go ahead with a decision to colocate in a data center in Malaysia which is one of the emerging data center powerhouses in the region with relatively low energy cost compared to Singapore.

Choice of Mac hardware

Our choice of hardware model for our build and test workload was guided by a cost-benefit analysis. We decided to use bare-metal setups without virtualization, simplifying migration processes, which may be revisited in the future. We ensured we neither over-specified nor under-specified the bare-metal hardware. We had a clear understanding of the resource consumption of our most demanding workload on a few reference models, as illustrated in the following graphs.

Figure 5: User and System CPU usage during build operation of our largest iOS mobile codebase
Figure 6: Memory Usage during build operation of our largest iOS mobile codebase

Virtualization vs bare-metal

Virtualization offers significant advantages in managing and provisioning clusters, including the flexibility to create ephemeral builds. However, our experience with macOS virtualization has been mixed. While off-the-shelf virtualization solutions provide maintenance benefits, they often come at the cost of performance or stability.

Key points:

  • Improved Utilization: Virtualization can improve resource utilization by consolidating multiple workloads on fewer physical servers, thereby reducing the overall cost.
  • Performance Penalty: However, the performance penalty associated with virtualization can sometimes negate these cost benefits. This is particularly true for macOS virtualization, where we have observed trade-offs in performance or stability.
  • Evolution of Virtualization: The virtualization space has been evolving and making good progress. We may re-evaluate these solutions in the future as they continue to mature and potentially address current performance and stability issues.

Our conclusion was to stick to bare-metal for the time-being as the benefits didn’t justify the downside and cost.

Execution

Progressive Migration

Any disruption to the macOS CI/CD cluster would be hugely disruptive to the company given our scale highlighted above. So, we enabled new cluster partially for part of the workload for a reasonably long period of time and monitored and compared:

  • Job failure rate
  • Jobs performance
  • Reliability

Once we were confident, we made the full switch and terminated vendor contracts at due.

Figure 7: Total active jobs trend

Result

The migration yielded better results overall than our initial conservative estimates.

  • Cost savings: Estimated over 2.4 million USD over three years
  • Performance improvement: Between 20-40% depending on the use case
  • Stability: No compromise

A strategic investment in our mission to drive Southeast Asia forward by onshoring critical Mac infrastructure into the region.

Cost

We anticipate a three-year replacement cycle for our hardware. While some equipment may be utilized beyond this period, it provides a reasonable lifespan for cost estimation purposes.

The lifecycle of networking equipment involves both physical reliability, following the bathtub curve, and technological obsolescence, often necessitating replacement every 3 to 5 years. Mac minis could become outdated after approximately three years, making the opportunity cost of extended use potentially higher than the net replacement cost after benefits.

Importantly, the experience gained during this cycle could significantly reduce the engineering costs associated with future replacements.

Overall, we project total cost of ownership savings of approximately 2.4 million USD over a three-year period compared to our last cloud-based setup rented from a vendor.

Performance

We measured the performance gains in two of ou largest iOS apps at Grab:

Overall gains

The following table summarizes the total time measured before and after the migration for total CI pipeline time and building the app codebase. Measurements are presented in 3 percentiles (p50, p75, p95)

App/Metric   Time (Minutes)    
    p50 p75 p95
CI Pipeline Time Trend for Grab: Taxi Ride, Food Delivery Before 43 54 67
  After 33 42 49
  Gain 23.26% 22.22% 26.87%
App build time Trend for Grab: Taxi Ride, Food Delivery Before 10.7 13.2 17.6
  After 6.45 9 10.8
  Gain 39.72% 31.82% 38.64%
Pipeline time trend for Grab Driver: App for Partners Before 47 50 52
  After 26 31 32
  Gain 44.68% 38.00% 38.46%
App build time trend for Grab Driver: App for Partners Before 10 13 14
  After 6 8 8.5
  Gain 40.00% 38.46% 39.29%

The following trend illustrations show how the performance of various tasks has improved while we progressively migrated to the new colocation setup.

Figure 8: 14 day aggregate percentiles of p50, p75 and p95 for total CI pipeline times for the Taxi Ride, Food Delivery codebase
Figure 9: Pipeline time pulse for the Taxi Ride, Food Delivery codebase
Figure 10: 14 day aggregate percentiles of p50, p75 and p95 for total CI pipeline times for the App for Partners codebase

Stability

We measured overall job failure rates between both clusters for extended periods as a guardrail metric and ensured the stability of the new cluster before shutting down the old one.

Colocation setup and rack configuration

The following table provides an overview of the layout of our new Mac mini cluster.

Component Description Redundancy
Rack We have got four 42RU (600x1200x42RU) racks housing 200+ Mac minis, plus some spare racks to house upcoming scheduled capacity upgrades. Racks have shared resources which have their own redundancy. Generally rack separation does provide some level of redundancy for total compute.
Power 2 power sources power the cluster. Each rack is powered by these 2 power sources. It is 1U, 2-post rack mount. Losing 1 power source will reduce 50% of capacity.
Mac Mini We rack 2 Mac minis in a row on a mounting tray, typically racking 70 minis in one rack in total. Except for the first rack which requires extra rack units (RUs) for core switches and firewalls.  
KVM KVM switches with adaptor for keyboard and mouse emulation when required. N/A
Networking Setup Networking consists of Core Switches, Access Switches, Firewalls, Internet and Direct Connect Links. Mostly active/active redundancy.

Provisioning and configuration

Zero-touch provisioning

Zero-touch provisioning is a streamlined method for setting up and configuring devices with minimal manual intervention. This section outlines the process and benefits of zero-touch provisioning using Jamf for Mac minis.

We have a setup that enables these machines to start accepting jobs once they are racked up and connected (Power and network cables). Here is how it works:

MDM configuration and Automated Device Enrollment (ADE)

ADE, previously known as Device Enrollment Program (DEP), is an Apple service that facilitates automatic enrollment. When a new Mac Mini is acquired and registered in the organization’s ADE account, it is primed for automatic enrollment. Administrators create a PreStage enrollment configuration within Jamf Pro, encompassing account settings (e.g., creating a local admin account, hiding it in Users & Groups, skipping account creation for the user), configuration profiles (defining device settings, security policies, and restrictions), and enrollment packages (including necessary software and scripts).

Device setup: Activation and redirection

Upon powering on and connecting to the internet, the Mac Mini communicates with Apple’s activation servers. The activation servers identify the device as part of the organization’s ADE and redirect it to the Jamf MDM server, ensuring automatic enrollment without user input.

Enrollment and configuration

The Mac Mini enrolls into the Jamf MDM system automatically. Jamf applies predefined configuration profiles to set up the device’s settings, installs required applications based on configured policies, and enforces security policies such as encryption and authentication settings to ensure compliance.

Key benefits of zero-touch provisioning

  • Efficiency: Devices are ready to use right out of the box, reducing the time and effort required by IT staff.
  • Consistency: Ensures that all devices are configured uniformly according to organizational policies.
  • Security: Enforces security policies from the moment the device is first powered on, reducing vulnerabilities.
  • Scalability: Easily manage and configure a large number of devices without manual intervention.

Learnings and insights

Supply chain is as fast as the last essential component you need

The efficiency of a supply chain hinges on the delivery of its final essential component. Despite being a fundamental principle, it’s worth reiterating. Our timely launch was facilitated by a buffer period for unexpected delays. Interestingly, one of the last critical items to arrive was the rack mounting trays. The brief delay underscored the importance of prioritizing and planning for on-time delivery of every essential component, irrespective of its manufacturing simplicity.

Consistently address the question: How will this scale?

From the outset, our goal was to develop a scalable infrastructure. As the cluster expands, tasks such as preparing Mac minis for job acceptance require increasing manual input, which ultimately impacts costs. Hence, zero-touch provisioning becomes essential, as scalability is not merely a desirable feature but a necessity.

Plan and opt in for a power cost structure best suite for your need

Power cost structures

In a colocation setup power costs can be billed in several ways, each with pros and cons:

  • Flat Rate Per Circuit: A fixed monthly fee, predictable but limits flexibility (e.g., can’t exceed 80% without extra circuits).
  • Allocated kW: Commit to a fixed power amount (e.g., 100 kW), potentially cheaper but with penalties for overages.
  • Metered Usage: Pay for actual consumption (kWh), good for variable loads but may still charge for space.
  • All-In Space & Power: Single rate covering both, easy to compare but less flexible for upgrades.

We ultimately opted for an allocated kW commitment, a phased approach based on conservative equipment power ratings and historical usage. We structured this into phases of commitment increases for future capacity growth.

Conclusion

The Mac Cloud Exit wasn’t just a technical migration; it was a strategic move that fundamentally enhanced our engineering efficiency. By onshoring our infrastructure into Southeast Asia, we have achieved $2.4 million USD in projected savings and supercharged our CI pipeline, delivering performance gains of 20-40%. This project proves that taking ownership of our core infrastructure can be a major competitive advantage, allowing us to deliver faster and more reliably for our users across the region.

Join us

Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line – we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

How We Built a Custom Vision LLM to Improve Document Processing at Grab

Post Syndicated from Grab Tech original https://engineering.grab.com/custom-vision-llm-at-grab

Introduction

In the world of digital services, accurate extraction of information from user-submitted documents such as identification (ID) cards, driver’s licenses, and registration certificates is a critical first step for processes like electronic know-your-customer (eKYC). This task is especially challenging in Southeast Asia (SEA) due to the diversity of languages and document formats.

We began this journey to address the limitations of traditional Optical Character Recognition (OCR) systems, which struggled with the variety of document templates it had to process. While powerful proprietary Large Language Models (LLMs) were an option, they often fell short in understanding SEA languages, produced errors, hallucinations, and had high latency. On the other hand, open-sourced Vision LLMs were more efficient but not accurate enough for production.

This prompted us to fine-tune and ultimately develop a lightweight, specialized Vision LLM from the ground up. This blog is our account of the entire process.

Figure 1: Simplified overview of how Vision LLM works.

Background

What is a Vision LLM?

You’ve likely heard of LLMs that process text. You give the LLM a text prompt, and it responds with a text output. A Vision LLM takes this a step further by allowing the model to understand images. The basic architecture involves three key components:

  • Image encoder: This component ‘looks’ at an image and converts it into a numerical (vectorized) format.
  • Vision-language projector: It acts as a translator, converting the image’s numerical format into a representation that the language model can understand.
  • Language model: The familiar text-based model that processes the combined image and text input to generate a final text output.
Figure 2: Vision LLM basic architecture.

Choosing our base Vision LLM model

We evaluated a range of LLMs capable of performing OCR and Key Information Extraction (KIE). Our exploration of open-source options—including Qwen2VL, miniCPM, Llama3.2 Vision, Pixtral 12B, GOT-OCR2.0, and NVLM 1.0—led us to select Qwen2-VL 2B as our base multimodal LLM. This decision was driven by several critical factors:

  • Efficient size: It is small enough for full fine-tuning on GPUs with limited VRAM resources.
  • SEA language support: Its tokenizer is efficient for languages like Thai and Vietnamese, indicating decent native vocabulary coverage.
  • Dynamic resolution: Unlike models that require fixed-size image inputs, Qwen2-VL can process images in their native resolution. This is crucial for OCR tasks as it prevents the distortion of text characters that can happen when images are resized or cropped.

We benchmarked Qwen2VL and miniCPM on Grab’s dataset. Our initial findings showed low accuracy, mainly due to the limited coverage of SEA languages. This motivated us to fine-tune the model to improve OCR and KIE accuracy. Training the LLM can be a very data-intensive and GPU resource-intensive process. Due to this, we had to address these two concerns before progressing further:

  • Data: How do we use open source and internal data effectively to train the model?
  • Model: How do we customize the model to reduce latency but keep high accuracy?

Training dataset generation

Synthetic OCR dataset

We extracted the SEA languages text content from a large online text corpus—Common Crawl (internet dataset). Then, we used an in-house synthetic data pipeline to generate text images by rendering SEA text contents in various fonts, backgrounds and augmentations.

The dataset contains text in Bahasa Indonesia, Thai, Vietnamese, and English. Each image has a paragraph of random sentences extracted from the dataset as shown in Figure 3.

Figure 3: Two synthetic sample images in Thai language used for model training.

Documint: AI-powered, auto-labelling framework

Our experiments showed that applying document detection and orientation correction significantly improves OCR and information extraction. Now that we have an OCR dataset, we needed to generate a pre-processing dataset to further improve model training.

Documint is an internal platform developed by our team that creates an auto‑labelling and pre‑processing framework for document understanding. It prepares high‑quality, labelled datasets. Documint utilizes various submodules to effectively execute the full OCR and KIE task. We then used a pipeline with the large amount of Grab collected cards and documents to extract training labels. The data was further refined by a human reviewer to achieve high label accuracy.

Documint has four main modules:

  • Detection module: Detect the region from the full picture.
  • Orientation module: Gives correction angle (e.g. if document is upside down, 180 degrees).
  • OCR module: Returns text values in unstructured format.
  • KIE module: Returns JSON values from unstructured text.
Figure 4: Pipeline overview of Documint.

Experimentation

Phase 1: The LoRA experiment

Our first attempt in fine-tuning a Vision LLM involved fine-tuning an open-source model Qwen2VL, using a technique called Low-Rank Adaptation (LoRA). LoRA is efficient because it allows lightweight updates to the model’s parameters, minimizing the need for extensive computational resources.

We trained the model on our curated document data, which included various document templates in multiple languages. The performance was promising for documents with Latin scripts. Our experiment of LoRA fine-tuned Qwen2VL-2B achieved high field-level of accuracy for Indonesian documents.

However, the fine-tuned model still struggled with:

  • Documents containing non-Latin scripts like Thai and Vietnamese.
  • Unstructured layouts with small, dense text.

Phase 2: The power of full fine-tuning

Our experiments revealed a key limitation. While open-source Vision LLMs often have extensive multi-lingual corpus coverage for the LLM decoder’s pre-training, they lack visual text in SEA languages during vision encoder and joint training. This insight drove our decision to pursue full parameter fine-tuning for optimal results.

Drawing from the Large Language and Vision Assistant (LLAVA) methodology, we implemented a two-stage training approach illustrated in Figure 5.

Figure 5: From left to right—two-stage training process.

Stage 1 – Continual pre-training: We first trained the vision components of the model using synthetic OCR datasets that we created for Bahasa Indonesia, Thai, Vietnamese, and English. This helps the model to learn the unique visual patterns of SEA scripts.

Stage 2 – Full-parameter fine-tuning: We then fine-tuned the entire model—vision encoder, projector, and language model—using our task-specific document data.

Results:

Table 1: OCR Field level accuracy between the baseline and Qwen2-VL 2B model. (pp: percentage points).

The fully fine-tuned Qwen2-VL 2B model delivered significant improvement, especially on documents that the LoRA model struggled with.

  • Thai document accuracy increased +70pp from baseline.
  • Vietnamese document accuracy rose +40pp from baseline.

Phase 3: Building a lightweight 1B model from scratch

While the Qwen2VL-2B model was a success, the full fine-tuning pushed the limits of GPUs. To optimize resources used and to create a model perfectly tailored to our needs, we decided to build a lightweight Vision LLM (~1B parameters) from scratch.

Our strategy was to combine the best parts of all models:

  • We took the powerful vision encoder from the larger Qwen2-VL 2B model.
  • We paired it with the compact and efficient language decoder from the Qwen2.5 0.5B model.
  • We connected them with an adjusted projector layer to ensure they could work together seamlessly.

This created a custom ~1B parameter Vision LLM optimized for training and deployment.

Four stages in training our custom model

We trained our new model using a comprehensive four-stage process as shown in Figure 6.

Figure 6: From left to right— four stages of model training.

Stage 1 – Projector alignment: The first step was to train the new projector layer to ensure the vision encoder and language decoder could communicate effectively.

Stage 2 – Vision tower enhancement: We then trained the vision encoder on a vast and diverse set of public multimodal datasets, covering tasks like visual Q&A, general OCR, and image captioning to improve its foundational visual understanding.

Stage 3 – Language-specific visual training: We trained the model on two types of synthetic OCR data. Without this stage, performance on non-Latin documents dropped by as much as 10%.

Stage 4 – Task-centric fine-tuning: Lastly, we performed full-parameter fine-tuning on our custom 1B model using our curated document dataset.

The final results are as follow:

Accuracy:

  • It achieved performance comparable to the larger 2B model, staying within a 3pp accuracy gap across most document types. The model also maintained strong generalization when trained on quality-augmented datasets.

Latency:

  • The latency of our model far outperforms the 2B model, as well as traditional OCR models, as well as external APIs like chatGPT or Gemini. One of the biggest weaknesses we identified with external APIs was the P99 latency, which can easily be 3 to 4x the P50 latency, which would not be acceptable for Grab’s large scale rollouts.
Table 2: Performance comparison between Qwen2-VL 2B and 1B sized Vision LLM.

Key takeaways

Our work demonstrates that strategic training with high-quality data enables smaller, specialized models to achieve remarkable efficiency and effectiveness. Here are the critical insights from our extensive experiments:

  • Full fine-tuning is superior: For specialized, non-Latin script domains, full-parameter fine-tuning dramatically outperforms LoRA.
  • Lightweight models are effective: A smaller model (~1B) built from scratch and trained comprehensively can achieve near state-of-the-art results, validating the custom architecture.
  • Base model matters: Starting with a base model that has native support for your target languages is crucial for success.
  • Data is king: Meticulous dataset preprocessing and augmentation plays a critical role in achieving consistent and accurate results.
  • Native resolution is a game changer: A model that can handle dynamic image resolutions preserves text integrity, dramatically improves OCR capabilities.

Our journey demonstrates that specialized Vision LLMs can effectively replace traditional OCR pipelines with a single, unified, highly accurate model—opening new possibilities for document processing at scale.

Table 3: Comparison of model types .

What’s next?

As we continue to enhance our Vision LLM capabilities, exciting developments are underway:

  • Smarter, more adaptable models: We’re developing Chain of Thought-based OCR and KIE models to strengthen generalisation capabilities and tackle even more diverse document scenarios.

  • Expanding across Southeast Asia: We’re extending support to all Grab markets, bringing our advanced document processing to Myanmar, Cambodia, and beyond.

References

Join us

Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line – we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Machine-learning predictive autoscaling for Flink

Post Syndicated from Grab Tech original https://engineering.grab.com/ml-predictive-autoscaling-for-flink

Introduction

As Grab transitions to derive more valuable insights from our wealth of operational data, we are witnessing a steep increase in stream-processing applications. Over the past year, the number of Flink applications grew 2.5 times, driven by interest in real-time stream processing and the improved accessibility of developing such applications with Flink SQL. At this scale, it has become crucial for the internal Flink platform team to provide a cost-effective and self-service offering that supports users of diverse backgrounds.

Flink at Grab is deployed in application mode, each pipeline has its own isolated resources for JobManager and TaskManager. Flink pipeline creators control both application logic and deployment configuration that affect throughput and performance, including OSS configurations:

  • Number of TaskManagers and task slots per TaskManager
  • CPU cores per TaskManager
  • Memory per TaskManager

As pipeline creation has become more accessible, users of different backgrounds (analyst, data scientist, engineers, etc.) often struggle to choose a set of configurations that work for their applications. Many go through a long process of trial and error and still end up over-provisioning their applications, leading to huge resource waste. Moreover, pipeline behavior changes over time due to changes in application logic or data pattern, invalidating previous efforts in tuning and causing users to repeat the exercise.

In this article, we focus on addressing the challenge of efficient CPU provisioning for TaskManagers, as CPU constraints are a common bottleneck in our clusters. Our solution specifically targets Flink applications sourcing data from our message bus system (eg. Kafka, Change Data Capture Streams, DynamoDB Streams) , which represents the majority of our use cases. These workloads offer significant opportunities for cost savings due to their clear seasonal patterns, making them an ideal starting point for optimising autoscaling strategies.

Limits of reactive autoscaling

Our initial reactive setup

Our first automated solution relied on Flink’s Adaptive Scheduler in Reactive Mode. In this mode, each Flink application is deployed as its own individual Flink cluster running a dedicated job. The cluster greedily uses all available TaskManagers and scales its job parallelism accordingly. Running on Kubernetes, the cluster relies on Horizon Pod Autoscaler (HPA) to scale the number of TaskManager pods based on metrics such as CPU usage or custom metrics such as the pipeline’s consumer latency. While this solution was helpful initially, we quickly observed multiple issues with it.
It is important to note that while the below issues can be solved by fine-tuning, it is a tedious trial and error effort that only works for specific applications, requiring users to repeat the process for every pipeline they own.

Restart spike: root cause of many issues

When autoscaling a Flink pipeline, the job restarts from the last checkpoint. This triggers an immediate spike in load, as the pipeline must reprocess records from the period between the last checkpoint and job restart, along with any new records that were backlogged at the source during the downtime. As a result, CPU usage and P99 consumer latency typically spikes after scaling events, for example, at 00:05 and 00:55, as shown in Figure 1. These spikes occur even though there is no change in source topic throughput. In this case, CPU usage surges from 0.5 cores to near provision limit of 2.5 cores, while consumer latency temporarily spiked from sub-second levels to as high as three minutes.

Figure 1: CPU usage and consumer latency spike after a pipeline restart.

Reactive spiral and fluctuation

Typically, HPA scales on metrics such as CPU usage, consumer latency, or backpressure crossing a defined threshold. The challenge arises if these thresholds are misconfigured. The HPA’s reactive nature, when combined with restart spikes, can become detrimental to your Flink application. It piles additional load onto a system that’s already degrading, further amplifying the problem.

Figure 2: A reactive scaling incident that demonstrates scaling fluctuations and restarts.

Figure 2 provides us a case study of reactive spiral and fluctuation, assuming we are having a pipeline that consumes a Kafka topic of 300 partitions:

  • 07:00: As the source topic throughput increases, the P99 consumer latency rises due to insufficient processing power.
  • 07:15: Reactive scaling is triggered, resulting in a scale out event. This is reflected in the increased TaskManager and task slot count. The pipeline continues to operate, as there is no increase in restart count.
  • 07:30: As the P99 consumer latency remains high, reactive scaling continues to scale out incrementally. The records in rate by task rises rapidly as the pipeline reprocesses data from the checkpoint. During this period, the pipeline repeatedly restarts CPU usage drops significantly, and P99 consumer latency spikes to nearly one hour. This marks the onset of a spiral failure.
  • 08:00: Reactive scaling reaches its upper limit of 300 slots, corresponding to the number of partitions in the source topic. This halts the spiral effect as it cannot scale out any further. Without disruption from autoscaling restart, the pipeline begins to process the backlog since the last successful checkpoint, as observed by the significant increase in records in rate by task. As the pipeline catches up, it eventually stabilizes, and the P99 consumer latency returns to normal levels.
  • 08:30 – 10:15: The P99 consumer latency returns to normal levels, below the threshold. Reactive scaling triggers scale-in events despite the source topic throughput continuing to trend upward. During these scale-in events, P99 latency fluctuates, occasionally spiking up to 15 minutes. However, these fluctuations are not severe enough to prevent the repeated scale in process.
  • 10:15: The P99 consumer latency rises again, triggering a scale-out event back to the upper limit of 300 slots.
  • 11:15-11:45: Despite the source topic throughput maintaining an upward trend, the pipeline undergoes multiple scale-in events in quick succession, encounters latency issues due to reprocessing data from checkpoints, and scales out again shortly after. This is an example of fluctuation after scaling in, resulting in 6 restarts within a 30 minutes window.

Limited parallelism constraints

Even with HPA, we frequently encounter a bottleneck when trying to scale our applications’ throughput. This is primarily because some of our connectors, most notably the Kafka connector, don’t inherently support dynamic parallelism changes.
Kafka topics, by design, have a fixed number of partitions. This directly limits the number of parallel consumers we can run. Consequently, once we reach this maximum parallelism for our consumers, we often have to scale up resources, for example, increase memory/CPU per instance instead of scaling out (adding more instances).

Predictive Resource Advisor

Assumptions and hypothesis

To tackle the issue of reactive spirals and fluctuations, the new solution should have the following characteristics:

  • Vertical scaling: To tackle the issue of limited parallelism with our dependencies, we should be looking at vertical instead of horizontal scaling.
  • Predictive: Adjust CPU to scale up or down before demand spikes or dips occur, ensuring the system is prepared for changes in workload. This prevents artificial workload increases caused by processing backlogs on top of actual workload increase, further straining the system.
  • Deterministic: The CPU configuration must be precisely calculated based on the workload demand, ensuring predictable and consistent resource allocation. For a given workload, the calculated CPU value should remain the same every time, eliminating variability and uncertainty in scaling decisions.
  • Accurate: Determine the optimal CPU configuration required to handle workload demand in a single, precise calculation, avoiding the inefficiencies of multi-step, trial-and-error tuning.

Key observations

Our solution is conceptualized based on key observations of our Flink applications:

  1. The CPU usage of Flink applications is primarily driven by the input load.
  2. The input load of our Flink applications can be accurately forecasted using time-series forecasting techniques.
  3. Time-based autoscaling that relies solely on historical CPU usage is not robust enough to adapt to evolving workloads. This approach also carries the risk of a negative self-amplifying feedback loop: each autoscaling restart causes a CPU usage spike (as illustrated in Figure 1), which, if anomalies are not properly handled, inflates subsequent CPU calculations.

Model formulation

We then formulate the relationship between CPU usage and input load using a regression model to provide a mathematical framework for predicting CPU requirements based on workload patterns, expressed as:

Ct = f(xt)

In this equation:

  • Ct represents the CPU required at a specific point in time.
  • xt represents the input workload at the corresponding point in time.
  • f() represents the regression function that maps the input load to the required CPU capacity.

Input load, represented by Kafka source topic throughput in our case, is chosen as the independent variable xt because it reflects true business demand and is entirely independent of Flink consumers. This metric is influenced solely by the business logic of upstream producers and remains unaffected by any changes or behaviors in the Flink consumer pipeline.

Proposed solution

Our predictive autoscaler operates through four key stages as shown in Figure 3.

Figure 3: The predictive autoscaling system operates through four key stages.

Stage 1: Workload forecast model

The workload forecast model is a time-series forecasting model trained on actual workload data, specifically source topic throughput from our Kafka cluster (1). This approach is particularly effective as our workload exhibits seasonal patterns. While historical data could be directly used as input for CPU prediction, time-series forecasting offers a more robust solution by enabling the model to account for organic traffic growth over time. Through periodic retraining, the model adapts to evolving workload trends, ensuring more accurate and reliable predictions for resource provisioning.

Stage 2: Resource prediction model

This follows the regression-based model Ct = f(xt) defined earlier. We use the same source topic throughput from our Kafka cluster (2a) as input feature xt, and the Flink application’s Kubernetes CPU usage metric (2b) as output label Ct for model training. To ensure clean and representative data for model training, we collect CPU usage metrics under conditions that simulate infinite resource availability. We include data exclusively from periods of continuous and stable operation, as determined by latency, uptime, and restart metrics (2b), eliminating biases caused by hardware limitations or disruptions.

Stage 3: Workload forecasting

To prepare for autoscaling, we forecast the workload for the future t-hour window (3) using our trained time-series forecast model.

Stage 4: Predict CPU usage

The forecasted workload (3) is fed into the resource prediction model to estimate the CPU usage required to handle that workload. The predicted value is then refined using custom safety feature adjustments to account for variability and ensure stability. This adjusted prediction is passed to the custom autoscaler controller, which evaluates the current CPU configuration of the TaskManager deployment. If the adjusted predicted value differs from the existing CPU configuration, the controller initiates vertical scaling to update the TaskManager deployment accordingly.

Proof of concept and results

Experiment setup

To validate our hypothesis, we present a deep dive into one of our experiments. This pipeline features complex business logic, aggregates from multiple Kafka sources, with a checkpoint interval of one minute and a maximum consumer latency of five minutes.

We set up an experimental pipeline with configurations identical to the production pipeline (the control). Both applications sourced data from the same Kafka topics but sank data to alternative topics to maintain isolation. The Predictive Resource Advisor was enabled on the experimental pipeline, while the control pipeline operated with fixed CPU provisioning.

Results

Figure 4 demonstrates a strong correlation between CPU usage (yellow, green) and the total Kafka topics throughput. The variable CPU provisioning (blue) for the experimental pipeline is calculated by our autoscaler models, which were trained exclusively on data collected from the experiment pipeline. The CPU usage trend of the experimental pipeline closely mirrors that of the control pipeline and remains aligned with the Kafka throughput trend. However, the experimental pipeline’s CPU provisioning is dynamically adjusted to more closely match its actual CPU usage, whereas the control pipeline maintains a static CPU allocation (purple). This illustrates the model’s effectiveness in dynamically adjusting CPU allocation to meet variable workload demands.

Figure 4: CPU usage closely correlates with source throughput for both the experimental and control pipelines.

Without autoscaler enabled, the control pipeline experienced no disruptions and maintained latency (blue) consistently below one second, which is not visible in Figure 5. On the other hand, the experiment pipeline latency (red) experienced a highest recorded peak latency of just over four minutes during a single disruption window. Other latency spikes observed were comparable to or lower than the three minutes peak latency previously identified as part of the restart spike issue analysis. The varied durations and amplitudes of these spikes showed some correlation with the heavy Kafka topic throughput during those periods. Importantly, there were only nine autoscaling events throughout the day, resulting in nine restarts for the experiment pipeline.

Figure 5: Autoscaling impacts service-level agreement requirements through latency spikes during scaling events.

Outcome

The Predictive Resource Advisor solution has been successfully deployed across more than 50% of applicable production applications, specifically those consuming from Kafka topics and exhibiting seasonal workload patterns with some tolerance for disruptions. This implementation has delivered significant results across three key areas, stability, efficiency, and user experience.

Stability

With autoscaling becoming more predictable and controllable, our Flink applications experience fewer disruptions caused by autoscaling fluctuations. The machine learning and predictive capabilities of the solution also ensure that applications remain operational during periods of increased workload by automatically learning and adapting to organic growth trends and workload surges.

Efficiency

Applications powered by the Predictive Resource Advisor demonstrated significant improvements in CPU provisioning, aligning CPU configuration more closely with actual requirements, particularly during low traffic periods. As a result of this optimization, on average, these applications made approximately >35% savings in cloud infrastructure cost.

User experience

The solution has simplified the deployment process for users, allowing them to simply deploy Flink applications with default configurations. The Predictive Resource Advisor automatically collects data, trains autoscaling models, and applies configuration changes, thus eliminating the need for manual fine-tuning. This significantly enhances the user experience by streamlining pipeline maintenance and enabling self-service capabilities, such as effortless onboarding. It empowers users to explore and derive value from real-time features with minimal effort.

What’s next?

Our journey doesn’t stop here. We’re continuously working to enhance our predictive autoscaler, with the following key areas of focus:

  • Tackling memory configuration (Predictive Resource Advisor’s next frontier)
    Memory is critical yet often misconfigured that can lead to unrecoverable failures for example, OOMKilled. Our next major goal for the Predictive Resource Advisor is to take on memory tuning, completely removing the burden of complex memory configuration from our users and further empowering them.
  • Enhancing model accuracy
    To further improve the robustness of our predictions, we are actively exploring advanced techniques in input feature engineering and anomaly detection, especially for workloads exhibiting frequent bursting patterns. By refining these aspects, we aim to extend the applicability of our solution to a broader range of Flink applications, including those connected to diverse sources such as change data capture systems or batch-like, spiky workloads, such as the Flink applications powering our real-time data lake.
  • Streamlining model training
    We’re developing a more efficient model training workflow. A particularly exciting avenue we’re investigating is the use of pretrained time-series forecasting models based on large language model architectures.

References

Join us

Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line – we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

A deep dive into BPF LPM trie performance and optimization

Post Syndicated from Matt Fleming original https://blog.cloudflare.com/a-deep-dive-into-bpf-lpm-trie-performance-and-optimization/

It started with a mysterious soft lockup message in production. A single, cryptic line that led us down a rabbit hole into the performance of one of the most fundamental data structures we use: the BPF LPM trie.

BPF trie maps (BPF_MAP_TYPE_LPM_TRIE) are heavily used for things like IP and IP+Port matching when routing network packets, ensuring your request passes through the right services before returning a result. The performance of this data structure is critical for serving our customers, but the speed of the current implementation leaves a lot to be desired. We’ve run into several bottlenecks when storing millions of entries in BPF LPM trie maps, such as entry lookup times taking hundreds of milliseconds to complete and freeing maps locking up a CPU for over 10 seconds. For instance, BPF maps are used when evaluating Cloudflare’s Magic Firewall rules and these bottlenecks have even led to traffic packet loss for some customers.

This post gives a refresher of how tries and prefix matching work, benchmark results, and a list of the shortcomings of the current BPF LPM trie implementation.

A brief recap of tries

If it’s been a while since you last looked at the trie data structure (or if you’ve never seen it before), a trie is a tree data structure (similar to a binary tree) that allows you to store and search for data for a given key and where each node stores some number of key bits.

Searches are performed by traversing a path, which essentially reconstructs the key from the traversal path, meaning nodes do not need to store their full key. This differs from a traditional binary search tree (BST) where the primary invariant is that the left child node has a key that is less than the current node and the right child has a key that is greater. BSTs require that each node store the full key so that a comparison can be made at each search step.

Here’s an example that shows how a BST might store values for the keys:

  • ABC

  • ABCD

  • ABCDEFGH

  • DEF


In comparison, a trie for storing the same set of keys might look like this.


This way of splitting out bits is really memory-efficient when you have redundancy in your data, e.g. prefixes are common in your keys, because that shared data only requires a single set of nodes. It’s for this reason that tries are often used to efficiently store strings, e.g. dictionaries of words – storing the strings “ABC” and “ABCD” doesn’t require 3 bytes + 4 bytes (assuming ASCII), it only requires 3 bytes + 1 byte because “ABC” is shared by both (the exact number of bits required in the trie is implementation dependent).

Tries also allow more efficient searching. For instance, if you wanted to know whether the key “CAR” existed in the BST you are required to go to the right child of the root (the node with key “DEF”) and check its left child because this is where it would live if it existed. A trie is more efficient because it searches in prefix order. In this particular example, a trie knows at the root whether that key is in the trie or not.

This design makes tries perfectly suited for performing longest prefix matches and for working with IP routing using CIDR. CIDR was introduced to make more efficient use of the IP address space (no longer requiring that classes fall into 4 buckets of 8 bits) but comes with added complexity because now the network portion of an IP address can fall anywhere. Handling the CIDR scheme in IP routing tables requires matching on the longest (most specific) prefix in the table rather than performing a search for an exact match.

If searching a trie does a single-bit comparison at each node, that’s a binary trie. If searching compares more bits we call that a multibit trie. You can store anything you like in a trie, including IP and subnet addresses – it’s all just ones and zeroes.

Nodes in multibit tries use more memory than in binary tries, but since computers operate on multibit words anyhow, it’s more efficient from a microarchitecture perspective to use multibit tries because you can traverse through the bits faster, reducing the number of comparisons you need to make to search for your data. It’s a classic space vs time tradeoff.

There are other optimisations we can use with tries. The distribution of data that you store in a trie might not be uniform and there could be sparsely populated areas. For example, if you store the strings “A” and “BCDEFGHI” in a multibit trie, how many nodes do you expect to use? If you’re using ASCII, you could construct the binary trie with a root node and branch left for “A” or right for “B”. With 8-bit nodes, you’d need another 7 nodes to store “C”, “D”, “E”, “F”, “G”, “H”, “I”.


Since there are no other strings in the trie, that’s pretty suboptimal. Once you hit the first level after matching on “B” you know there’s only one string in the trie with that prefix, and you can avoid creating all the other nodes by using path compression. Path compression replaces nodes “C”, “D”, “E” etc. with a single one such as “I”.


If you traverse the tree and hit “I”, you still need to compare the search key with the bits you skipped (“CDEFGH”) to make sure your search key matches the string. Exactly how and where you store the skipped bits is implementation dependent – BPF LPM tries simply store the entire key in the leaf node. As your data becomes denser, path compression is less effective.

What if your data distribution is dense and, say, all the first 3 levels in a trie are fully populated? In that case you can use level compression and replace all the nodes in those levels with a single node that has 2**3 children. This is how Level-Compressed Tries work which are used for IP route lookup in the Linux kernel (see net/ipv4/fib_trie.c).

There are other optimisations too, but this brief detour is sufficient for this post because the BPF LPM trie implementation in the kernel doesn’t fully use the three we just discussed.

How fast are BPF LPM trie maps?

Here are some numbers from running BPF selftests benchmark on AMD EPYC 9684X 96-Core machines. Here the trie has 10K entries, a 32-bit prefix length, and an entry for every key in the range [0, 10K).

Operation

Throughput

Stddev

Latency

lookup

7.423M ops/s

0.023M ops/s

134.710 ns/op

update

2.643M ops/s

0.015M ops/s

378.310 ns/op

delete

0.712M ops/s

0.008M ops/s

1405.152 ns/op

free

0.573K ops/s

0.574K ops/s

1.743 ms/op

The time to free a BPF LPM trie with 10K entries is noticeably large. We recently ran into an issue where this took so long that it caused soft lockup messages to spew in production.

This benchmark gives some idea of worst case behaviour. Since the keys are so densely populated, path compression is completely ineffective. In the next section, we explore the lookup operation to understand the bottlenecks involved.

Why are BPF LPM tries slow?

The LPM trie implementation in kernel/bpf/lpm_trie.c has a couple of the optimisations we discussed in the introduction. It is capable of multibit comparisons at leaf nodes, but since there are only two child pointers in each internal node, if your tree is densely populated with a lot of data that only differs by one bit, these multibit comparisons degrade into single bit comparisons.

Here’s an example. Suppose you store the numbers 0, 1, and 3 in a BPF LPM trie. You might hope that since these values fit in a single 32 or 64-bit machine word, you could use a single comparison to decide which next node to visit in the trie. But that’s only possible if your trie implementation has 3 child pointers in the current node (which, to be fair, most trie implementations do). In other words, you want to make a 3-way branching decision but since BPF LPM tries only have two children, you’re limited to a 2-way branch.

A diagram for this 2-child trie is given below.


The leaf nodes are shown in green with the key, as a binary string, in the center. Even though a single 8-bit comparison is more than capable of figuring out which node has that key, the BPF LPM trie implementation resorts to inserting intermediate nodes (blue) to inject 2-way branching decisions into your path traversal because its parent (the orange root node in this case) only has 2 children. Once you reach a leaf node, BPF LPM tries can perform a multibit comparison to check the key. If a node supported pointers to more children, the above trie could instead look like this, allowing a 3-way branch and reducing the lookup time.


This 2-child design impacts the height of the trie. In the worst case, a completely full trie essentially becomes a binary search tree with height log2(nr_entries) and the height of the trie impacts how many comparisons are required to search for a key.

The above trie also shows how BPF LPM tries implement a form of path compression – you only need to insert an intermediate node where you have two nodes whose keys differ by a single bit. If instead of 3, you insert a key of 15 (0b1111), this won’t change the layout of the trie; you still only need a single node at the right child of the root.


And finally, BPF LPM tries do not implement level compression. Again, this stems from the fact that nodes in the trie can only have 2 children. IP route tables tend to have many prefixes in common and you typically see densely packed tries at the upper levels which makes level compression very effective for tries containing IP routes.

Here’s a graph showing how the lookup throughput for LPM tries (measured in million ops/sec) degrades as the number of entries increases, from 1 entry up to 100K entries.


Once you reach 1 million entries, throughput is around 1.5 million ops/sec, and continues to fall as the number of entries increases.


Why is this? Initially, this is because of the L1 dcache miss rate. All of those nodes that need to be traversed in the trie are potential cache miss opportunities.


As you can see from the graph, L1 dcache miss rate remains relatively steady and yet the throughput continues to decline. At around 80K entries, dTLB miss rate becomes the bottleneck.


Because BPF LPM tries to dynamically allocate individual nodes from a freelist of kernel memory, these nodes can live at arbitrary addresses. Which means traversing a path through a trie almost certainly will incur cache misses and potentially dTLB misses. This gets worse as the number of entries, and height of the trie, increases.


Where do we go from here?

By understanding the current limitations of the BPF LPM trie, we can now work towards building a more performant and efficient solution for the future of the Internet.

We’ve already contributed these benchmarks to the upstream Linux kernel — but that’s only the start. We have plans to improve the performance of BPM LPM tries, particularly the lookup function which is heavily used for our workloads. This post covered a number of optimisations that are already used by the net/ipv4/fib_trie.c code, so a natural first step is to refactor that code so that a common Level Compressed trie implementation can be used. Expect future blog posts to explore this work in depth.

If you’re interested in looking at more performance numbers, Jesper Brouer has recorded some here: https://github.com/xdp-project/xdp-project/blob/main/areas/bench/bench02_lpm-trie-lookup.org.

If the Linux kernel, performance, or optimising data structures excites you, our engineering teams are hiring.

Modernising Grab’s model serving platform with NVIDIA Triton Inference Server

Post Syndicated from Grab Tech original https://engineering.grab.com/modernising-grab-model-serving-platform

Introduction

Catwalk is Grab’s machine learning (ML) model serving platform, designed to enable data scientists and engineers in deploying production-ready inference APIs. Currently, Catwalk powers hundreds of ML models and online deployments. To accommodate this growth, the platform has adapted to the rapidly evolving machine learning technology landscape. This involved progressively integrating support for multiple frameworks such as ONNX, PyTorch, TensorFlow, and vLLM. While this approach initially worked for a limited number of frameworks, it soon became unsustainable as maintaining various inference engines, ensuring backward compatibility, and managing deprecated legacy components (such as the ONNX server) introduced significant technical debt. Over time, this resulted in degraded platform performance: with increased latency, reduced throughput, and escalating costs. These issues began to impact users, as larger models could no longer be served efficiently or cost-effectively by legacy components. Recognising the need for change, the team revisited the platform’s design to address these challenges.

Evaluation and implementation

After evaluating other industry-leading model serving platforms and studying best practices, we decided to conduct an in-depth analysis of NVIDIA Triton. Triton offers significant advantages as an inference engine, including:

  • Multi-framework support: Compatibility with major ML frameworks, including ONNX, PyTorch, and TensorFlow, ensuring versatility and broad applicability.

  • Unified inference interface: Provides a single, consistent API for various ML frameworks, simplifying user interaction and reducing overhead when switching between models or frameworks.

  • Hardware optimisation: Optimised for NVIDIA GPUs, Triton delivers strong performance on CPU-only environments and specialised instances like AWS Inferentia.

  • Up-to-date support: Continuously updated by upstream to support the latest optimisation and features from upstream ML frameworks, ensuring access to cutting-edge capabilities.

  • Advanced inference features: Includes capabilities like dynamic batching and model ensembling (model pipelining), which enhances throughput and efficiency for complex ML workflows.

Our extensive benchmarking demonstrated that NVIDIA Triton delivers substantial enhancements in both performance and service stability compared to our existing solutions.

We are now working towards consolidating the various inference engines we manage into a unified, all-in-one Triton engine, beginning with ONNX adoption as the first phase of implementation.

In this blog, we aim to share our journey of adopting Triton. From initial benchmarking results on one of Grab’s core models facing performance challenges, to the development of the “Triton manager”, a component designed to integrate Triton into our platform seamlessly and with minimal user disruption. Ultimately, more than 50% of online deployments were successfully migrated to Triton, with some of our critical systems achieving a 50% improvement in tail latency.

Exploratory benchmark results

We conducted rigorous testing of Triton against our existing ONNX server under varying levels of request traffic.

Table 1: Benchmark results of Triton against Catwalk ONNX server.

During testing with a transformer-based model, Triton demonstrated the ability to handle at least 5 times the traffic while maintaining excellent latency. Additionally, its performance was further enhanced with features like batching enabled, and there is potential for even greater optimisation by converting the model to TensorRT, leveraging GPU support.

Through profiling, we learned that a handful of ONNX Runtime knobs have an outsized impact on throughput. One low-effort, high-return tweak is to set the intra-op thread count to match the number of physical CPU cores. In most cases, this single change yields a healthy performance lift, sparing us from time-consuming, model-by-model micro-optimisation.

Adopting Triton at scale

While the benchmark results clearly demonstrate Triton’s advantages, the primary challenge was ensuring a seamless migration, ideally with minimal user reactions. Given the high frequency of migrations within our company, even exceptional performance improvements are often insufficient to fully motivate internal users to adopt new systems. From our point of view, a successful migration required:

  • Maintaining API compatibility with existing systems.
  • Ensuring zero-downtime.
  • Preserving all existing functionality while adding new capabilities.
  • Minimising disruption to downstream services and users.

To streamline the migration process, we opted to manage it centrally within our platform, rather than relying on individual users to address the details themselves.

We landed on the idea of offering Triton to our users as a drop-in replacement for the old server, with the help of a new component, “Triton manager”. The Triton manager is a critical component that glues Triton to the Catwalk ecosystem. It consists of two major components: Triton server manager and Triton proxy.

Triton server manager is designed as the entry point of our Catwalk Triton. It downloads the model from remote storage, runs verification on the model files, prepares per-model configurations based on users’ customisation, and lastly it launches the Triton server. It also periodically checks the server’s health and provides observability overlooking the server’s status.

Triton proxy provides backward compatibility to the existing clients. It hosts endpoints that translate requests from the older API and forward them to the Triton server. The proxy layer plays a crucial role in facilitating a seamless transition from our legacy servers, eliminating the need for user code changes. The conversion logic is designed to prioritise performance, ensuring minimal overhead. Extensive benchmarks were conducted during development to validate and optimise its efficiency.

Figure 1: High-level architecture for Triton Inference Server (TIS) deployment at Catwalk.

Finally, a special mode in the Triton server manager is implemented to allow the Triton Inference Server (TIS) to be backward compatible with the command line interface of the existing ONNX runtime server used in Catwalk.

We plan to enhance the Triton Manager to ensure backward compatibility with other ML frameworks, as part of our efforts to onboard additional frameworks seamlessly.

Rollout result

Within just 10 days of Triton’s availability, we successfully rolled it out to over 50% of our online model deployments. Thanks to rigorous testing for backward compatibility, the rollout was seamless, with most users unaware of the transition while benefiting from the improved performance.

Triton’s impacts on critical models

Figure 2: Latency before and after rollout in ms. Blue line: XGBoost-based model. Orange line: transformer-based model. Solid line: average. Dashed line: p99

We’ve observed significant performance improvements in our business-critical models that have high demands for stability. Latency improvements were consistently observed in all models, especially in the models that suffered from highly volatile request traffic. For some larger transformer models, the p90 latency decreased dramatically from 120ms to 20ms, and the average latency remained steady at 4ms. Smaller XGBoost models maintained their average latency at 2ms across regions.

Figure 3: Number of pods, before (blue line) and after (purple line) rollout in another model.

Triton has delivered significant cost savings for certain models, with some achieving over 90% reductions due to its advanced optimisations. These improvements have come alongside enhanced performance and reliability.

It is worth noting that Triton was initially rolled out with limited capabilities to prioritise backward compatibility and ensure a seamless migration. However, we’ve noticed that higher tail latency still remains an issue when facing request spikes for larger models in production. To address this, we are working on enabling batching through Triton to minimise tail latency during traffic surges. This effort will involve close collaboration with model owners to optimise the capacity of each Triton instance further.

Early cost impact of the migration

To gauge the financial upside of migrating to Triton, we took a snapshot of 11 production ML services that had already completed the migration. For every ML service, we compared its infrastructure spend over the 14 days before the cut-over with the 14 days after.

Despite the staggered migration dates, the trend was uniform: average spend fell by ~ 20% across this small cohort within 14 days. As more models and applications migrate, we expect the absolute dollar savings to scale proportionally.

Takeaways

Initial results are aligned with our benchmarks for the Triton migration. With improved performance and cost reduction, we expect model owners to either upgrade their model sizes or allow for higher Queries Per Second (QPS). While making further progress with the overall Triton migration, the model serving platform team will continue to monitor cost differences and provide consultation to model owners who seek further optimisation for their deployments.

Another key takeaway is the painless migration of Triton for our internal users. Rather than asking internal users to make necessary code changes, our team dedicated significant time to providing Triton as a drop-in inference engine to minimise any inconvenience of migration.

Big appreciation to Shengwei Pang from the Geo team, Khai Hung Do, Nhat Minh Nguyen, and Siddharth Pandey from the Catwalk team, along with Richard Ryu from the PM team and Padarn George Wilson for the sponsorship.

Join us

Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line – we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

15 years of helping build a better Internet: a look back at Birthday Week 2025

Post Syndicated from Nikita Cano original https://blog.cloudflare.com/birthday-week-2025-wrap-up/

Cloudflare launched fifteen years ago with a mission to help build a better Internet. Over that time the Internet has changed and so has what it needs from teams like ours.  In this year’s Founder’s Letter, Matthew and Michelle discussed the role we have played in the evolution of the Internet, from helping encryption grow from 10% to 95% of Internet traffic to more recent challenges like how people consume content. 

We spend Birthday Week every year releasing the products and capabilities we believe the Internet needs at this moment and around the corner. Previous Birthday Weeks saw the launch of IPv6 gateway in 2011,  Universal SSL in 2014, Cloudflare Workers and unmetered DDoS protection in 2017, Cloudflare Radar in 2020, R2 Object Storage with zero egress fees in 2021,  post-quantum upgrades for Cloudflare Tunnel in 2022, Workers AI and Encrypted Client Hello in 2023. And those are just a sample of the launches.

This year’s themes focused on helping prepare the Internet for a new model of monetization that encourages great content to be published, fostering more opportunities to build community both inside and outside of Cloudflare, and evergreen missions like making more features available to everyone and constantly improving the speed and security of what we offer.

We shipped a lot of new things this year. In case you missed the dozens of blog posts, here is a breakdown of everything we announced during Birthday Week 2025. 

Monday, September 22

What

In a sentence …

Help build the future: announcing Cloudflare’s goal to hire 1,111 interns in 2026

To invest in the next generation of builders, we announced our most ambitious intern program yet with a goal to hire 1,111 interns in 2026.

Supporting the future of the open web: Cloudflare is sponsoring Ladybird and Omarchy

To support a diverse and open Internet, we are now sponsoring Ladybird (an independent browser) and Omarchy (an open-source Linux distribution and developer environment).

Come build with us: Cloudflare’s new hubs for startups

We are opening our office doors in four major cities (San Francisco, Austin, London, and Lisbon) as free hubs for startups to collaborate and connect with the builder community.

Free access to Cloudflare developer services for non-profit and civil society organizations

We extended our Cloudflare for Startups program to non-profits and public-interest organizations, offering free credits for our developer tools.

Introducing free access to Cloudflare developer features for students

We are removing cost as a barrier for the next generation by giving students with .edu emails 12 months of free access to our paid developer platform features.

Cap’n Web: a new RPC system for browsers and web servers

We open-sourced Cap’n Web, a new JavaScript-native RPC protocol that simplifies powerful, schema-free communication for web applications.

A lookback at Workers Launchpad and a warm welcome to Cohort #6

We announced Cohort #6 of the Workers Launchpad, our accelerator program for startups building on Cloudflare.

Tuesday, September 23

What

In a sentence …

Building unique, per-customer defenses against advanced bot threats in the AI era

New anomaly detection system that uses machine learning trained on each zone to build defenses against AI-driven bot attacks. 

Why Cloudflare, Netlify, and Webflow are collaborating to support Open Source tools

To support the open web, we joined forces with Webflow to sponsor Astro, and with Netlify to sponsor TanStack.

Launching the x402 Foundation with Coinbase, and support for x402 transactions

We are partnering with Coinbase to create the x402 Foundation, encouraging the adoption of the x402 protocol to allow clients and services to exchange value on the web using a common language

Helping protect journalists and local news from AI crawlers with Project Galileo

We are extending our free Bot Management and AI Crawl Control services to journalists and news organizations through Project Galileo.

Cloudflare Confidence Scorecards – making AI safer for the Internet

Automated evaluation of AI and SaaS tools, helping organizations to embrace AI without compromising security.

Wednesday, September 24

What

In a sentence …

Automatically Secure: how we upgraded 6,000,000 domains by default

Our Automatic SSL/TLS system has upgraded over 6 million domains to more secure encryption modes by default and will soon automatically enable post-quantum connections.

Giving users choice with Cloudflare’s new Content Signals Policy

The Content Signals Policy is a new standard for robots.txt that lets creators express clear preferences for how AI can use their content.

To build a better Internet in the age of AI, we need responsible AI bot principles

A proposed set of responsible AI bot principles to start a conversation around transparency and respect for content creators’ preferences.

Securing data in SaaS to SaaS applications

New security tools to give companies visibility and control over data flowing between SaaS applications.

Securing today for the quantum future: WARP client now supports post-quantum cryptography (PQC)

Cloudflare’s WARP client now supports post-quantum cryptography, providing quantum-resistant encryption for traffic. 

A simpler path to a safer Internet: an update to our CSAM scanning tool

We made our CSAM Scanning Tool easier to adopt by removing the need to create and provide unique credentials, helping more site owners protect their platforms.

Thursday, September 25

What

In a sentence …

Every Cloudflare feature, available to everyone

We are making every Cloudflare feature, starting with Single Sign On (SSO), available for anyone to purchase on any plan. 

Cloudflare’s developer platform keeps getting better, faster, and more powerful

Updates across Workers and beyond for a more powerful developer platform – such as support for larger and more concurrent Container images, support for external models from OpenAI and Anthropic in AI Search (previously AutoRAG), and more. 

Partnering to make full-stack fast: deploy PlanetScale databases directly from Workers

You can now connect Cloudflare Workers to PlanetScale databases directly, with connections automatically optimized by Hyperdrive.

Announcing the Cloudflare Data Platform

A complete solution for ingesting, storing, and querying analytical data tables using open standards like Apache Iceberg. 

R2 SQL: a deep dive into our new distributed query engine

A technical deep dive on R2 SQL, a serverless query engine for petabyte-scale datasets in R2.

Safe in the sandbox: security hardening for Cloudflare Workers

A deep-dive into how we’ve hardened the Workers runtime with new defense-in-depth security measures, including V8 sandboxes and hardware-assisted memory protection keys.

Choice: the path to AI sovereignty

To champion AI sovereignty, we’ve added locally-developed open-source models from India, Japan, and Southeast Asia to our Workers AI platform.

Announcing Cloudflare Email Service’s private beta

We announced the Cloudflare Email Service private beta, allowing developers to reliably send and receive transactional emails directly from Cloudflare Workers.

A year of improving Node.js compatibility in Cloudflare Workers

There are hundreds of new Node.js APIs now available that make it easier to run existing Node.js code on our platform. 

Friday, September 26

What

In a sentence …

Cloudflare just got faster and more secure, powered by Rust

We have re-engineered our core proxy with a new modular, Rust-based architecture, cutting median response time by 10ms for millions. 

Introducing Observatory and Smart Shield

New monitoring tools in the Cloudflare dashboard that provide actionable recommendations and one-click fixes for performance issues.

Monitoring AS-SETs and why they matter

Cloudflare Radar now includes Internet Routing Registry (IRR) data, allowing network operators to monitor AS-SETs to help prevent route leaks.

An AI Index for all our customers

We announced the private beta of AI Index, a new service that creates an AI-optimized search index for your domain that you control and can monetize.

Introducing new regional Internet traffic and Certificate Transparency insights on Cloudflare Radar

Sub-national traffic insights and Certificate Transparency dashboards for TLS monitoring.

Eliminating Cold Starts 2: shard and conquer

We have reduced Workers cold starts by 10x by implementing a new “worker sharding” system that routes requests to already-loaded Workers.

Network performance update: Birthday Week 2025

The TCP Connection Time (Trimean) graph shows that we are the fastest TCP connection time in 40% of measured ISPs – and the fastest across the top networks.

How Cloudflare uses performance data to make the world’s fastest global network even faster

We are using our network’s vast performance data to tune congestion control algorithms, improving speeds by an average of 10% for QUIC traffic.

Come build with us!

Helping build a better Internet has always been about more than just technology. Like the announcements about interns or working together in our offices, the community of people behind helping build a better Internet matters to its future. This week, we rolled out our most ambitious set of initiatives ever to support the builders, founders, and students who are creating the future.

For founders and startups, we are thrilled to welcome Cohort #6 to the Workers Launchpad, our accelerator program that gives early-stage companies the resources they need to scale. But we’re not stopping there. We’re opening our doors, literally, by launching new physical hubs for startups in our San Francisco, Austin, London, and Lisbon offices. These spaces will provide access to mentorship, resources, and a community of fellow builders.

We’re also investing in the next generation of talent. We announced free access to the Cloudflare developer platform for all students, giving them the tools to learn and experiment without limits. To provide a path from the classroom to the industry, we also announced our goal to hire 1,111 interns in 2026 — our biggest commitment yet to fostering future tech leaders.

And because a better Internet is for everyone, we’re extending our support to non-profits and public-interest organizations, offering them free access to our production-grade developer tools, so they can focus on their missions.

Whether you’re a founder with a big idea, a student just getting started, or a team working for a cause you believe in, we want to help you succeed.

Until next year

Thank you to our customers, our community, and the millions of developers who trust us to help them build, secure, and accelerate the Internet. Your curiosity and feedback drive our innovation.

It’s been an incredible 15 years. And as always, we’re just getting started!

Introducing Observatory and Smart Shield — see how the world sees your website, and make it faster in one click

Post Syndicated from Tim Kadlec original https://blog.cloudflare.com/introducing-observatory-and-smart-shield/

Modern users expect instant, reliable web experiences. When your application is slow, they don’t just complain — they leave. Even delays as small as 100 ms have been shown to have a measurable impact on revenue, conversions, bounce rate, engagement and more

If you’re responsible for delivering on these expectations to the users of your product, you know there are many monitoring tools that show you how visitors experience your website, and can let you know when things are slow or causing issues. This is essential, but we believe understanding the condition is only half the story. The real value comes from integrating monitoring and remedies in the same view, giving customers the ability to quickly identify and resolve issues.

That’s why today, we’re excited to launch the new and improved Observatory, now in open beta. This monitoring and observability tool goes beyond charts and graphs, by also telling you exactly how to improve your application’s performance and resilience, and immediately showing you the impact of those changes. And we’re releasing it to all subscription tiers (including Free!), available today.


But wait, there’s more! To make your users’ experience in Cloudflare even faster, we’re launching Smart Shield, available today for all subscription tiers. Using Observatory, you can pinpoint performance bottlenecks, and for many of the most common issues, you can now apply the fix in just a few clicks with our Smart Shield product. Double the fun!

Our unique perspective: leveraging data from 20% of the web

Every day, Cloudflare handles traffic for over 20% of the web, giving us a unique vantage point into what makes websites faster and more resilient. We built Observatory to take advantage of this position, uniting data that is normally scattered across different tools — including real-user data, synthetic testing, error rates, and backend telemetry — into a single platform. This gives you a complete, cohesive picture of your application’s health end-to-end, in one spot, and enables you to easily identify and resolve performance issues.

For this launch, we’re bringing together:

  • Real-user data: See how your application performs for real people, in the real world.

  • Back-end telemetry: Break down the lifecycle of a request to pinpoint areas for improvement.

  • Error rates: Understand the stability of your application at both the edge and origin.

  • Cache hit ratios: Ensure you’re maximizing the performance of your configuration.

  • Synthetic testing: Proactively test and monitor key endpoints with powerful, accurate simulations.

Let’s take a quick look at each data set to see how we use them in Observatory.

Real-user data

There are two primary forms of data collection: real-user data and synthetic data. Real-user data are performance metrics collected from real traffic, from real visitors, to your application. It’s how users are actually seeing your application perform in the real world. It’s unpredictable, and covers every scenario.

Synthetic data is data collected using some sort of simulated test (loading a site in a headless browser, making network requests from a testing system to an endpoint, etc.). Tests are run under a predefined set of characteristics — location, network speed, etc. — to provide a consistent baseline.

Both forms of data have their uses, and companies with a strongly established culture of operational excellence tend to use both.

The first data you’ll see when you visit Observatory is real-user data collected with Real User Monitoring (RUM), with a particular focus on the Core Web Vital metrics.


This is very intentional.

Real-user data should be the source of truth when it comes to measuring performance and resiliency of your application. Even the best of synthetic data sources are always going to be an approximation. They cannot cover every possible scenario, and because they are being run from a lab environment, they will not always reveal issues that may be more sporadic and unpredictable.

They’re also the best representation of what your users are experiencing when they access your site and, at the end of the day, that’s why we focus on improving performance, resiliency,  and security for our users.

We believe so strongly in the importance of every company having access to accurate, detailed RUM data that we are providing it for free, to all accounts. In fact, we’re about to make our privacy-first analytics — which doesn’t track individual users for analytics — available by default for all free zones (excluding data from EU or UK visitors), no setup necessary. We believe the right thing is arming everyone with detailed, actionable, real-user data, and we want to make it easy.

Backend telemetry

Front-end performance metrics are our best proxy for understanding the actual user experience of an application and as a result, they work great as key performance indicators (KPI’s).

But they’re not enough. Every primary metric should have some level of supporting diagnostic metrics that help us understand why our user metrics are performing like they are — so that we can quickly identify issues, bottlenecks, and areas of improvement.


While the industry has largely, and rightfully, moved on from Time to First Byte (TTFB) as a primary metric of focus, it still has value as a diagnostic metric. In fact, we analyzed our RUM data and found a very strong connection between Time to First Byte and Largest Contentful Paint.

Google’s recommended thresholds for Time to First Byte are:

  • Good: <= 800ms

  • Needs Improvement: > 800ms and <= 1800ms

  • Poor: > 1800ms

Similarly, their official thresholds for Largest Contentful Paint are:

  • Good: <= 2500ms

  • Needs Improvement > 2500ms and <= 4000ms

  • Poor: > 4000ms

Looking across over 9 billion events, we found that when compared to the average site, sites with a “poor” (>1800ms) TTFB are:

  • 70.1 percentage points less likely to have a “good” LCP

  • 21.9 percentage points more likely to have a “needs improvement” LCP

  • 48.2 percentage points more likely to have a “poor” LCP

TTFB is an ill-defined blackbox, so we’re making a point to break that down into its various subparts so you can quickly pinpoint if the issue is with the connection establishment, the server response time, the network itself, and more. We’ll be working to break this down even further in the coming months as we expose the complete lifecycle of a request so you’re able to pinpoint exactly where the bottlenecks lie.

Errors & cache ratios

Degradation in stability and performance are frequently directly connected to configuration changes or an increase in errors. Clear visibility into these characteristics can often cut right to the heart of the issue at hand, as well as point to opportunities for improvement of the overall efficiency and effectiveness of your application.


Observatory prominently surfaces cache hit ratio and error rates for both the edge and origin. This compliments the backend telemetry nicely, and helps to further breakdown the backend metrics you are seeing to help pinpoint areas of improvement.

Take cache hit ratio for example. Intuitively, we know that when content is served from cache on an edge server, it should be faster than when the request has to go all the way back to the origin server. Based on our data, again, that’s exactly what we see.

If we consider our Time To First Byte thresholds again (good is <= 800ms; needs improvement is > 800ms and less than 1800ms; poor is anything over 1800ms), when looking across 9 billion data points as collected by our RUM solution, we see that a whopping 91.7% of all pages served from Cloudflare’s cache have a “good” TTFB compared to 79.7% when the request has to be served from the origin server.

In other words, optimizing origin performance (more on that in a bit) and moving more content to the edge are sure-fire ways to give you a much stronger performance baseline.

Accurate and detailed synthetic testing

While real-user data is our source of truth, synthetic testing and monitoring is important as well. Because tests are run in a more controlled environment (test from this location, at this time, with this criteria, etc.), the resulting data is a lot less noisy and variable. In addition, because there is not a user involved and we don’t have to worry about any observer effect, synthetic tests are able to grab a lot more information about the request and page lifecycle.

As a result, synthetic data tends to work very well for arming engineers with debugging information, as well as providing a cleaner set of data for comparing and contrasting results across different platforms, releases, and other situations.

Observatory provides two different types of synthetic tests.

The first synthetic test is a browser test. A browser test will load the requested page in a headless browser, run Google’s Lighthouse on it to report on key performance metrics, and provide some light suggestions for improvement. 


The second type of synthetic test Observatory provides is a network test. This is a brand new test type in Cloudflare, and is focused on giving you a better breakdown of the network and back-end performance of an endpoint.

Each network test will hit the provided endpoint for the test and record the wait time, server response time, connect time, SSL negotiation time, and total load time for the endpoint response. Because these tests are much more targeted, a single test in itself is not as valuable and can be prone to variation. That variation isn’t necessarily a bad thing—in fact, variability in these results can actually give you a better understanding of the breadth of results when real users hit that same endpoint.

For that reason, network tests trigger a series of individual runs against the provided endpoint spread out over a short period of time. The data for each response is recorded, and then presented as a histogram on the test results page, letting you see not just a single datapoint, but the long and short-tail of each metric. This gives you a much more accurate representation of reality than what a single test run can provide.


You are also able to compare network tests in Observatory, by selecting two network tests that have been completed. Again, all the data points for each test will be provided in a histogram, where you can easily compare the results of the two.


We are working on improving both synthetic test types in Q4 2025, focusing on making them more powerful and diagnostic.

As we mentioned before, even at its best, synthetic data is an approximation of what is actually happening. Accuracy is critical. Inaccurate data can distract teams with variability and faulty measurements.

It’s important that these tools are as accurate and true to the real world as possible. It’s also important to us that we give back to the community, both because it’s the right thing to do, and because we believe the best way to have the highest level of confidence in the measurement tools and frameworks we’re using is the rigor and scrutiny that open-source provides.

For those reasons, we’ll be working on open-sourcing many of the testing agents we’re using to power Observatory. We’ll share more on that soon, as well as more details about how we’ve built each different testing tool, and why.

Doing something about it: Smart Suggestions

People don’t measure for the sake of having data and pretty charts. They measure because they want to be able to stay on top of the health of their application and find ways to improve it. Data is easy. Understanding what to do about the data you’re presented is both the hardest, and most important, part.

Monitoring without action is useless.

We’re building Observatory to have a relentless focus on actionability. Before any new metric is presented, we take some time to explore why that metric matters, when it’s something worth addressing, and what actions you should take if those metrics need improvement.

All of that leads us to our new Smart Suggestions. Wherever possible, we want to pair each metric with a set of opinionated, data-driven suggestions for how to make things better. We want to avoid vague hand-wavy advice and instead be prescriptive and specific and precise.

For example, let’s look at one particular recommendation we provide around improving Largest Contentful Paint.

Largest Contentful Paint is a core web vital metric that measures when the largest piece of content is displayed on the screen. That piece of content could be an image, video or text.

Much like TTFB, Largest Contentful Paint is a bit of a black box by itself. While it tells us how long it takes for that content to get on screen, there are a large number of potential bottlenecks that could be causing the delay. Perhaps the server response time was very slow. Or maybe there was something blocking the content from being displayed on the page. If the object was an image or video, perhaps the filesize was large and the resulting download was slow. LCP by itself doesn’t give us that level of granularity, so it’s hard to give more than hand wavy guidance on how to address it.

Thankfully, just like we can break TTFB into subparts, we can break LCP into its subparts as well. Specifically we can look at:

  • Time to First Byte: how quickly the server responds to the request for HTML

  • Resource Load Delay: How long it takes after TTFB for the browser to discover the LCP resource

  • Resource Load Duration: How long it takes for the browser to download the LCP resource

  • Render Delay: How long it takes the browser to render the content, after it has the resource in hand.

Breaking it down into these subparts, we can be much more diagnostic about what to do.


In the example above, our recommendation engine analyzes the site’s real-user data and notices that Resource Load Delay accounts for over 10% of total LCP time. As a result, there’s a high likelihood that the resource triggering LCP is large and could potentially be compressed to reduce file size. So we make a recommendation to enable compression using Polish.

We’re very excited about the impact these suggestions will have on helping everyone quickly zero in on meaningful solutions for improving performance and resiliency, without having to wade through mountains of data to get there. As we analyze data, we’ll find more and more patterns of problems and the solutions they can map to. Expanding on our Smart Suggestions will be a constant and ongoing focus as we move forward, and we are working on adding much more content about those patterns and what we find in Q4.

Fixing the biggest pain point: Smart Shield

Observatory gives you unprecedented insight into your application’s health, but insights are only half the battle. The next challenge is acting on them, which brings us to another layer of complexity: protecting your origin. For many of our customers, proper management of origin routes and connections is one of the largest drivers of aggregate overall performance. As we mentioned before, we see a clear negative impact on user-facing performance metrics when we have to go back to the origin, and we want to make it as easy as possible for our customers to improve those experiences. Achieving this requires protecting against unnecessary load while ensuring only trusted traffic reaches your servers.

Today’s customers have powerful tools to protect their origins, but achieving basic use cases remains frustratingly complex:

  • Making applications faster

  • Reducing origin load

  • Understanding origin health issues

  • Restricting IP address access to origin servers

These fundamental needs currently require navigating multiple APIs and dashboard settings. You shouldn’t need to become an expert in each feature — we should analyze your traffic patterns and provide clear, actionable solutions.

Smart Shield: the future of origin shielding

Smart Shield transforms origin protection from a complex, multi-tool challenge into a streamlined, intelligent solution that works on your behalf. Our unified API and UI combines all origin protection essentials — dynamic traffic acceleration, intelligent caching, health monitoring, and dedicated egress IPs — into one place that enables single-click configuration.

But we didn’t stop at simplification. Smart Shield integrates with Observatory to provide both the “what” — identifying performance bottlenecks and health issues — and the “how” — delivering capabilities that increase performance, availability, and security.

This creates a continuous feedback loop: Observatory identifies problems, Smart Shield provides solutions, and real-time analytics verify the impact.



But what does this mean for you? 

  • Reduce total cost of ownership (TCO)

  • Reduce the time-to-value (TTV) for performance, availability, and security issues pertaining to customer origins

  • Enable new features without guesswork and validate effectiveness in the data

Your time stays focused on building incredible user experiences, not becoming a configuration expert. We are excited to give you back time for your customers and your engineers, while paving the way for how you make sure your origin infrastructure is easily optimized to delight your customers. 

Protecting and accelerating origins with smart Connection Reuse

Keeping your origins fast and stable is a big part of what we do at Cloudflare. When you experience a traffic surge, the last thing you want is for a flood of TLS handshakes to knock your origin down, or for those new connections to stall your requests, leaving your users to wait for slow pages to load.

This is why we’ve made significant changes to how Cloudflare’s network talks to your origins to dramatically improve the performance of our origin connections. 

When Cloudflare makes a request to your origins, we make them from a subset of the available machines in every Cloudflare data center so that we can improve your connection reuse. Until now, this pool would be sized the same by default for every application within a data center, and changes to the sizing of the pool for a particular customer would need to be made manually. This often led to suboptimal connection reuse for our customers, as we might be making requests from way more machines than were actually needed, resulting in fewer warm connection pools than we otherwise could have had. This also caused issues at our data centers from time to time, as larger applications might have more traffic than the default pool size was capable of serving, resulting in production incidents where engineers are paged and had to manually increase the fanout factor for specific customers.

Now, these pool sizes are determined automatically and dynamically. By tracking domain-level traffic volume within a datacenter, we can automatically scale up and scale down the number of machines that serve traffic destined for customer origin servers for any particular customer, improving both the performance of customer websites and the reliability of our network. A massive, high-volume website with a considerable amount of API traffic will no longer be processed by the same number of machines as a smaller and more typical website. Our systems can respond to changes in customer traffic patterns within seconds, allowing us to quickly ramp up and respond to surges in origin traffic.

Thanks to these improvements, Cloudflare now uses over 30% fewer connections across the board to talk to origins. To put this into a more understandable perspective, this translates to saving approximately 402 years of handshake time every day across our global traffic, or 12,060 years of handshake time saved per month! This means just by proxying your traffic through Cloudflare, you’ll see a 30% on average reduction in the amount of connections to your origin, keeping it more available while serving the same traffic volume and in turn lowering your egress fees. But, in many cases, the results observed can be far greater than 30%. For example, in one data center which is particularly heavy in API traffic, we saw a reduction in origin connections of ~60%! 

Many don’t realize that making more connections to an origin requires more compute and time for systems to create TCP and SSL handshakes. This takes time away from serving content requested by your end-users and can act as a hidden tax on your performance and overall to your application. We are proud to reduce the Internet’s hidden tax by finding intelligent, innovative ways to reduce the amount of connections needed while supporting the same traffic volume.

Watch out for more updates to Smart Shield at the start of 2026 — we’re working on adding self-serve support for dedicated CDN egress IP addresses, along with significant performance, reliability, and resilience improvements!

Charting the course: next steps for Observatory & Smart Shield

We’re really excited to share these two products with everyone today. Smart Shield and Observatory combine to provide a powerful one-two punch of insight and easy remediation.

As we navigate the beta launch of Observatory, we know this is just the start.

Our vision for Observatory is to be the single source of truth for your application’s health. We know that making the right decisions requires robust, accurate data, and we want to arm our customers with the most comprehensive picture available.

In the coming months, we plan to continue driving forward with our goal of providing comprehensive data, backed by a clear path to action.

  • Deeper, more diagnostic data. We’ll continue to break down data silos, bringing in more metrics to make sure you have a truly comprehensive view of your application’s health. We’ll be focused on going deeper and being more diagnostic, breaking down every aspect of both the request and page lifecycle to give you more granular data.

  • More paths to solutions. People don’t measure for the sake of looking at data, they measure to solve problems. We’re going to continue to expand our suggestions, arming you with more precise, data-driven solutions to a wider range of issues, letting you fix problems with a single click through Smart Shield and bringing a tighter feedback loop to validate the impact of your configuration updates.

  • Benchmarking against other products. Some of our customers split traffic between different CDNs due to regulatory or compliance requirements. Naturally, this brings up a whole series of questions about comparing the performance of the split traffic. In Observatory, you can compare these today, but we have a lot of things planned to make this even easier.

Try out Observatory and Smart Shield yourself today. And if you have ideas or suggestions for making Observatory and Smart Shield better, we’re all ears and would love to talk!

Network performance update: Birthday Week 2025

Post Syndicated from Lai Yi Ohlsen original https://blog.cloudflare.com/network-performance-update-birthday-week-2025/

We are committed to being the fastest network in the world because improvements in our performance translate to improvements for the own end users of your application. We are excited to share that Cloudflare continues to be the fastest network for the most peered networks in the world.

We relentlessly measure our own performance and our performance against peers. We publish those results routinely, starting with our first update in June 2021 and most recently with our last post in September 2024.

Today’s update breaks down where we have improved since our update last year and what our priorities are going into the next year. While we are excited to be the fastest in the greatest number of last-mile ISPs, we are never done improving and have more work to do.

How do we measure this metric, and what are the results?

We measure network performance by attempting to capture what the experience is like for Internet users across the globe. To do that we need to simulate what their connection is like from their last-mile ISP to our networks.

We start by taking the 1,000 largest networks in the world based on estimated population. We use that to give ourselves a representation of real users in nearly every geography.

We then measure performance itself with TCP connection time. TCP connection time is the time it takes for an end user to connect to the website or endpoint they are trying to reach. We chose this metric because we believe this most closely approximates what users perceive to be Internet speed, as opposed to other metrics which are either too scientific (ignoring real world challenges like congestion or distance) or too broad.

We take the trimean measurement of TCP connection times to calculate our metric. The trimean is a weighted average of three statistical values: the first quartile, the median, and the third quartile. This approach allows us to reduce some of the noise and outliers and get a comprehensive picture of quality.

For this year’s update, we examined the trimean of TCP connection times measured from August 6 to September 4, Cloudflare is the #1 provider in 40% of the top 1000 networks. In our September 2024 update, we shared that we were the #1 provider in 44% of the top 1000 networks.


The TCP Connection Time (Trimean) graph shows that we are the fastest TCP connection time in 383 networks, but that would make us the fastest in 38% of the top 1,000. We exclude networks that aren’t last-mile ISPs, such as transit networks, since they don’t reflect the end user experience, which brings the number of measured networks to 964 and makes Cloudflare the fastest in 40% of measured ISPs and the fastest across the top networks.

How do we capture this data? 

A Cloudflare-branded error page does more than just display an error; it kicks off a real-world speed test. Behind the scenes, on a selection of our error pages, we use Real User Measurements (RUM), which involves a browser retrieving a small file from multiple networks, including Cloudflare, Amazon CloudFront, Google, Fastly and Akamai.

Running these tests lets us gather performance data directly from the user’s perspective, providing a genuine comparison of different network speeds. We do this to understand where our network is fastest and, more importantly, where we can make further improvements. For a deeper dive into the technical details, the Speed Week blog post covers the full methodology.

By using RUM data, we track key metrics like TCP Connection Time, Time to First Byte (TTFB), and Time to Last Byte (TTLB). These are widely recognized, industry-standard metrics that allow us to objectively measure how quickly and efficiently a website loads for actual users. By monitoring these benchmarks, we can objectively compare our performance against other networks.

We specifically chose the top 1000 networks by estimated population from APNIC, excluding those that aren’t last-mile ISPs. Consistency is key: by analyzing the same group of networks in every cycle, we ensure our measurements and reporting remain reliable and directly comparable over time.

How do the results compare across countries?

The map below shows the fastest providers per country and Cloudflare is fastest in dozens of countries. 


The color coding is generated by grouping all the measurements we generate by which country the measurement originates from. Then we look at the trimean measurements for each provider to identify who is the fastest… Akamai was measured as well, but providers are only represented in the map if they ranked first in a country which Akamai does not anywhere in the world.

These slim margins mean that the fastest provider in a country is often determined by latency differences so small that the fastest provider is often only faster by less than 5%. As an example, let’s look at India, a country where we are currently the second-fastest provider.

India (IN)

Rank

Entity 

Connect Time (Trimean)

#1 Diff

#1

CloudFront

107 ms

#2

Cloudflare

113 ms

+4.81% (+5.16 ms)

#3

Google

117 ms

+8.74% (+9.39 ms)

#4 

Fastly

133 ms

+24% (+26 ms)

#5

Akamai

144 ms

+34% (+37 ms)

In India, Cloudflare is 5ms behind Cloudfront, the #1 provider (To put milliseconds into perspective, the average human eye blink lasts between 100ms and 400ms). The competition for the number one spot in many countries is fierce and often shifts day by day. For example, in Mexico on Tuesday, August 5th, Cloudflare was the second-fastest provider by 0.73 ms but then on Tuesday, August 12th, Cloudflare was the fastest provider by 3.72 ms. 

Mexico (MX)

Date

Rank

Entity 

Connect Time (Trimean)

#1 Diff

August 5, 2025

#1

CloudFront

116 ms

#2

Cloudflare

116 ms

+0.63% (+0.73 ms)

August 12, 2025

#1

Cloudflare

106 ms

#2

CloudFront

109 ms

+3.52% (+3.72 ms)

Because ranking reorderings are common, we also review country and network level rankings to evaluate and benchmark our performance. 

Focusing on where we are not the fastest yet

As mentioned above, in September 2024, Cloudflare was fastest in 44% of measured ISPs. These values can shift as providers constantly make improvements to their networks. One way we focus in on how we are prioritizing improving is to not just observe where we are not the fastest but to measure how far we are from the leader.

In these locations we tend to pace extremely close to the fastest provider, giving us an opportunity to capture the spot as we relentlessly improve. In networks where Cloudflare is 2nd, over 50% of those networks have a less than 5% difference (10ms or less) with the top provider.

Country

ASN

#1

Cloudflare Rank

#1 Diff (ms)

#1 Diff (%)

US

AS36352

Google

2

25 ms

32%

US

AS46475

Google

2

35 ms

29%

US

AS29802

Google

2

8.03 ms

21%

US

AS20473

Google

2

15 ms

13%

US

AS7018

CloudFront

2

23 ms

13%

US

AS4181

CloudFront

2

8.19 ms

11%

US

AS62240

Google

2

18 ms

9.77%

US

AS22773

CloudFront

2

12 ms

9.48%

US

AS6167

CloudFront

2

13 ms

7.55%

US

AS11427

Google

2

9.33 ms

5.27%

US

AS6614

CloudFront

2

6.68 ms

4.12%

US

AS4922

Google

2

3.38 ms

3.86%

US

AS11492

Fastly

2

3.73 ms

3.33%

US

AS11351

Google

2

5.14 ms

3.04%

US

AS396356

Google

2

4.12 ms

2.23%

US

AS212238

Google

2

3.42 ms

1.35%

US

AS20055

Fastly

2

1.22 ms

1.33%

US

AS40021

CloudFront

2

2.06 ms

0.91%

US

AS12271

Fastly

2

1.26 ms

0.89%

US

AS141039

CloudFront

2

1.26 ms

0.88%

In networks where Cloudflare is 3rd, 50% of those networks are less than a 10% difference with the top provider (10ms or less). Margins are small and suggest that in instances where Cloudflare isn’t number one across networks, we’re extremely close to our competitors and the top networks change day over day. 

Country

ASN

#1

Cloudflare Rank

#1 Diff (ms)

#1 Diff (%)

US

AS6461

Google

3

33 ms

39%

US

AS81

Fastly

3

43 ms

35%

US

AS14615

Google

3

24 ms

24%

US

AS13977

CloudFront

3

21 ms

19%

US

AS33363

Google

3

29 ms

18%

US

AS63949

Google

3

9.56 ms

14%

US

AS14593

Fastly

3

17 ms

13%

US

AS23089

CloudFront

3

7.4 ms

11%

US

AS16509

Fastly

3

10 ms

9.48%

US

AS209

CloudFront

3

9.69 ms

6.87%

US

AS27364

CloudFront

3

8.76 ms

6.61%

US

AS11404

CloudFront

3

6.11 ms

6.16%

US

AS46690

CloudFront

3

5.91 ms

5.43%

US

AS136787

CloudFront

3

8.23 ms

5.18%

US

AS6079

Fastly

3

5.45 ms

4.49%

US

AS5650

Google

3

3.91 ms

3.35%

Countries with an abundance of networks, like the United States, have a lot of noise we need to calibrate against. For example, the graph below represents the performance of all providers for a major ISP like AS701 (Verizon Business).

AS701 (Verizon Business) Connect Time (P95) between 2025-08-09 and 2025-09-09


In this chart, the “P95” value, or 95th percentile, refers to one point of a percentile distribution. The P95 shows the value below which 95% of the data points fall and is specifically good at helping identify the slowest or worst-case user experiences, such as those on poor networks or older devices. Additionally, we review the other numbers lower on the percentile chain in the table below, which tell us how performance varies across the full range of data. When we do so, the picture becomes more nuanced.

AS701 (Verizon Business) Provider Rankings for Connect Time at P95, P75 and P50

Rank

Entity 

Connect Time (P95)

Connect Time (P75)

Connect Time (P50)

#1

Fastly

128 ms

66 ms

48 ms

#2

Google

134 ms

72 ms

54 ms

#3

CloudFront

139 ms

67 ms

47 ms

#4 

Cloudflare

141 ms

68 ms

49 ms

#5

Akamai

160 ms

84 ms

61 ms

At the 95th percentile for AS701, Cloudflare ranks 4th but at the 75th and 50th, Cloudflare is only 2 milliseconds slower than the fastest provider. In other words, when reviewing more than one point along the distribution at the network level, Cloudflare is keeping up with the top providers for the less extreme samples. To capture these details, it’s important to look at the range of outcomes, not just one percentile.

To better reflect the full spectrum of user experiences, we started using the trimean in July 2025 to rank providers. This metric combines values from across the distribution of data – specifically the 75th, 50th and 25th percentiles – which gives a more balanced representation of overall performance, rather than only focusing on the extremes. Summarizing user experience with a single number is always challenging, but the trimean helps us compare providers in a way that better reflects how users actually experience the Internet.

Cloudflare is the fastest provider in 40% of networks in the majority of real-world conditions, not just in worst-case scenarios. Still, the 95th percentile remains key to understanding how performance holds up in challenging conditions and where other providers might fall behind in performance. When we review the 95th percentile across the same date range for all the networks, not just AS701, Cloudflare is fastest across roughly the same amount of networks but by 103 more networks than the next fastest provider. Being faster in such a wide margin of networks tells us that Cloudflare is particularly strong in the challenging, long-tail cases that other providers struggle with.


Our performance data shows that even when we are not the top-ranked provider, we remain exceptionally competitive, often trailing the leader by a mere handful of percentage points. Our strength at the 95th percentile also highlights our superior performance in the most challenging scenarios. Cloudflare’s ability to outperform other providers, in the worst-case, is a testament to the resilience and efficiency of our network.

Moving forward, we’ll continue to share multiple metrics and continue to make improvements to our network —and we’ll use this data to do it! Let’s talk about how. 

How does Cloudflare use this data to improve?

Cloudflare applies this data to identify regions and networks that need prioritization. If we are consistently slower than other providers in a network, we want to know why, so we can fix it.

For example, the graph below shows the 95th percentile of Connect Time for AS8966. Prior to June 13, 2025, our performance was suffering, and we were the slowest provider for the network. By referencing our own measurement data, we prioritized partner data centers in the region and almost immediately performance improved for users connecting through AS8966.

Cloudflare’s partner data centers consist of collaborations with local service providers who host Cloudflare’s equipment within their own facilities. This allows us to expand our network to new locations and get closer to users more quickly. In the case of AS8966, adding a new partner data center took us from being ranked last to ranked first and improved latency by roughly 150ms in one day. By using a data-driven approach, we made our network faster and most importantly, improved the end user experience.

TCP Connect Time (P95) for AS8966


What’s next?

We are always working to build a faster network and will continue sharing our process as we go. Our approach is straightforward: identify performance bottlenecks, implement fixes, and report the results. We believe in being transparent about our methods and are committed to a continuous cycle of improvement to achieve the best possible performance. Follow our blog for the latest performance updates as we continue to optimize our network and share our progress.

The RUM Diaries: enabling Web Analytics by default

Post Syndicated from Alex Krivit original https://blog.cloudflare.com/the-rum-diaries-enabling-web-analytics-by-default/

Measuring and improving performance on the Internet can be a daunting task because it spans multiple layers: from the user’s device and browser, to DNS lookups and the network routes, to edge configurations and origin server location. Each layer introduces its own variability such as last-mile bandwidth constraints, third-party scripts, or limited CPU resources, that are often invisible unless you have robust observability tooling in place. Even if you gather data from most of these Internet hops, performance engineers still need to correlate different metrics like front-end events, network processing times, and server-side logs in order to pinpoint where and why elusive “latency” occurs to understand how to fix it.

We want to solve this problem by providing a powerful, in-depth monitoring solution that helps you debug and optimize applications, so you can understand and trace performance issues across the Internet, end to end.

That’s why we’re excited to announce the start of a major upgrade to Cloudflare’s performance analytics suite: Web Analytics as part of our real user monitoring (RUM) tools will soon be combined with network-level insights to help you pinpoint performance issues anywhere on a packet’s journey — from a visitor’s browser, through Cloudflare’s network, to your origin.

Some popular web performance monitoring tools have also sacrificed user privacy in order to achieve depth of visibility. We’re also going to remove that tradeoff. By correlating client-side metrics (like Core Web Vitals) with detailed network and origin data, developers can see where slowdowns occur — and why —  all while preserving end user privacy (by dropping client-specific information and aggregating data by visits as explained in greater detail below).

Over the next several months we’ll share:

  • How Web Analytics work

  • Real-world debugging examples from across the Internet

  • Tips to get the most value from Cloudflare’s analytics tools

The journey starts on October 15, 2025, when Cloudflare will enable Web Analytics for all free domains by default — helping you see how your site actually performs for visitors around the world in real time, without ever collecting any personal data (not applicable to traffic originating from the EU or UK, see below). By the middle of 2026, we’ll deliver something nobody has ever had before: a comprehensive, privacy-first platform for performance monitoring and debugging. Unlike many other tools, this platform won’t just show you where latency lives, it will help you fix it, all in one place. From untangling the trickiest bottlenecks, to getting a crystal-clear view of global performance, this new tool will change how you see your web application and experiment with new performance features. And we’re not building it behind closed doors, we want to bring you along as we launch it in public. Follow along in this series, The RUM Diaries, as we share the journey.

Why this matters

Performance monitoring is only as good as the detail you can see — and the trust your users have that while you’re watching traffic performance, you aren’t watching them. As we explain below, by combining real user metrics with deep, in-network instrumentation, we’ll give developers the visibility to debug any layer of the stack while maintaining Cloudflare’s zero-compromise stance on privacy.

What problem are we solving? 

Many performance monitoring solutions provide only a narrow slice of the performance layer cake, focusing on either the client or the origin while lumping everything in between under a vague “processing time” due to lack of visibility. But as web applications get more complex and user expectations continue to rise, traditional analytics alone don’t cut it. Knowing what happened is just the tip of the iceberg; modern teams need to understand why a bottleneck occurred and how network conditions, code changes, or even a single external script can degrade load times. Moreover, often the tools available can only observe performance rather than helping to optimize it, which leaves teams unable to understand what to try to move the needle on latency.

We want to pull back the curtain so you can understand performance implications of the services you use on our platform and how you can make sure you’re getting the best performance possible. 

Consider Shannon in Detroit, Michigan. She operates an e-commerce site selling hard-to-find watches to horology enthusiasts around the globe. Shannon knows that her customers are impatient (she pictures them frequently checking their wrists). If her site loads slowly, she loses sales, her SEO drops, and her customers go to a different store where they have a better online shopping experience. 

As a result, Shannon continually monitors her site performance, but she frequently runs into problems trying to understand how her site is experienced by customers in different parts of the world. After updating her site, she frequently spot checks its performance using her browser on her office wifi in Detroit, but she continually hears complaints about slow load from her customers in Germany. So Shannon shops around for a solution that monitors performance around the globe. 

This off-the-shelf performance monitoring solution offers her the ability to run similar tests from virtual machines situated around the world across various desktops, mobile devices, and even ISPs, close to her customers. Shannon receives data from these tests, ranging from how fast these synthetic clients’ DNS resolved, how quickly they connected to a particular server, and even when a response was on its way back to a client. Thankfully for Shannon, the off-the-shelf performance monitoring solution identified “server processing time” as the latency culprit in Germany. However, she can’t help but wonder, is it my server that is slow or the transit connection of my users in Germany? Can I make my site faster by adding another server in Germany, or updating my CDN configuration? It’s a three option head-scratcher: is it a networking problem, a server problem, or something else?

Cloudflare can help Shannon (and others!) because we sit in a unique place to provide richer performance analytics. As a reverse proxy positioned between the client and the origin, we are often the first web server a user connects to when requesting content. In addition to moving what’s important closer to your customers, our product suite can generate responses at our edge (e.g. Workers), steer traffic through our dedicated backbone (e.g. cloudflared and more), and route around Internet traffic jams (e.g. Argo). By tailoring a solution that brings together: 

  • client performance data, 

  • real-time network metrics,

  • customer configuration settings, and

  • origin performance measurements

we can provide more insightful information about what’s happening in the vague “processing time.” This will allow developers like Shannon to understand what they should tweak to make their site more performant, build her business and her customers happier. 

What is Web Analytics? 

Turning back to what’s happening on October 15, 2025: We’re enabling Web Analytics so teams can track down performance bottlenecks. Web Analytics works by adding a lightweight JavaScript snippet to your website, which helps monitor performance metrics from visitors to your site. In the Web Analytics dashboard you can see aggregate performance data related to: how a browser has painted the page (via LCP, INP, and CLS), general load time metrics associated with server processing, as well as aggregate counts of visitors.

If you’ve ever popped open DevTools in your browser and stared at the waterfall chart of a slow-loading page, you’ve had a taste of what Web Analytics is doing, except instead of measuring your load times from your laptop, it’s measuring it directly from the browsers of real visitors.

Here’s the high-level architecture:

A lightweight beacon in the browser
Every page that you track with Cloudflare’s Web Analytics includes a tiny JavaScript snippet, optimized to load asynchronously so it won’t block rendering.

  • This snippet hooks into modern browser APIs like the Performance API, Resource Timing, etc

  • This is how Cloudflare collects Core Web Vital metrics like Largest Contentful Paint and Interaction to Next Paint, plus data about resource load times, TLS handshake duration from the perspective of the client.

Aggregation at the edge
When the browser sends performance data, it goes to the nearest Cloudflare data center. Instead of pushing raw events straight to a database, we pre-process at the edge. This reduces storage needs, minimizes latency, and removes personal information like IP addresses. After this pre-processing, it is sent to a core datacenter to be processed and queried by users.


Web Analytics sits under the Analytics & Logs section of the dashboard (at both the account and domain level of the dashboard). Starting on October 15, 2025, free domains will begin to see Web Analytics enabled by default and will be able to view the performance of their visitors in their dashboard. Pro, Biz and ENT accounts can enable Web Analytics by selecting the hostname of the website to add the snippet to and selecting Automatic Setup. Alternatively, you can manually paste the JavaScript beacon before the closing </body> tag on any HTML page you’d like to track from your origin. Just select “manage site” from the Web Analytics tab in the dashboard. 


Once enabled, the JS snippet works with visitors’ browsers to measure how the user experienced page load times and reports on critical client-side metrics. Below these metrics are resource attribution tables that help users understand which assets are taking the most time per metrics to load so that users can better optimize their site performance. 


What does privacy-first mean?

From the beginning, our Web Analytics tools have centered on providing insights without compromising privacy. Being privacy-first means we don’t track individual users for analytics. We don’t use any client-side state (like cookies or localStorage) for analytics purposes, and we don’t track users over time by IP address, User Agent, or any other fingerprinting technique.

Moreover, when enabling Web Analytics, you can choose to drop requests from European and UK visitors if you so desire (listed here specifically), meaning we will not collect any RUM metrics from traffic that passes through our European and UK data centers. The version of Web Analytics that will be enabled by default excludes data from EU visitors (this can be changed in the dashboard if you want). 

The concept of a visit is key to our privacy approach. Rather than count unique IP addresses (requiring storing state about each visitor), we simply count page views that originate from a distinct referral or navigation event, avoiding the need to store information that might be considered personal data. We believe this same concept that we’ve used for years in providing our privacy-first Web Analytics can be logically extended to network and origin metrics. This will allow customers to gain the insights they need to debug and solve performance issues while ensuring they are not collecting unneeded data on visitors.


Opting-out

We built our Web Analytics service to give you the insights you need to run your website, all while maintaining a privacy-first approach. However, if you do want to opt-out, here are the steps to do so.

Via Dashboard

If you have a free domain and do not want Web Analytics automatically enabled for your zone you should do the following before October 15, 2025: 

  1. Navigate to the zone in the Cloudflare dashboard

  2. In the list on the left of the screen, navigate to Web Analytics


  3. On the next page, select either `Enable Globally` or `Exclude EU` to activate the feature


  4. Once Web Analytics has been activated, navigate to `Manage RUM Settings` in the Web Analytics dashboard


  5. Then, on the next page, select `Disable` to disable Web Analytics for the zone


  6. OR, to remove Web Analytics from the zone entirely, delete the configs by clicking Advanced Options and then Delete


    Once you have disabled the product once, we will not re-enable it again. You can choose to enable it whenever you want, however.

Via API

  1. Create a Web Analytics configuration with the following API call:

    curl https://api.cloudflare.com/client/v4/accounts/$ACCOUNT_ID/rum/site_info \
        -H 'Content-Type: application/json' \
        -H "X-Auth-Email: $CLOUDFLARE_EMAIL" \
        -H "X-Auth-Key: $CLOUDFLARE_API_KEY" \
        -d '{
              "auto_install": false,
              "host": "example.com",
              "zone_tag": "023e105f4ecef8ad9ca31a8372d0c353"
            }'
    

    Note: This will not cause your zone to collect RUM data because auto_install is set to `false`

  2. Collect the site_tag and zone_tag fields from the response to this call

    1. site_tag in this response will correspond to $SITE_ID in the following calls

  3. EITHER Disable the Web Analytics configuration with the following API call:

    curl https://api.cloudflare.com/client/v4/accounts/$ACCOUNT_ID/rum/site_info/$SITE_ID \
        -X PUT \
        -H 'Content-Type: application/json' \
        -H "X-Auth-Email: $CLOUDFLARE_EMAIL" \
        -H "X-Auth-Key: $CLOUDFLARE_API_KEY" \
        -d '{
              "auto_install": true,
              "enabled": false,
              "host": "example.com",
              "zone_tag": "023e105f4ecef8ad9ca31a8372d0c353"
            }'
    
    

  4. OR Delete the Web Analytics configuration with the following API call:

    curl https://api.cloudflare.com/client/v4/accounts/$ACCOUNT_ID/rum/site_info/$SITE_ID \
        -X DELETE \
        -H "X-Auth-Email: $CLOUDFLARE_EMAIL" \
        -H "X-Auth-Key: $CLOUDFLARE_API_KEY"

Where We’re Going Next

Today, Web Analytics gives you visibility into how people experience your site in the browser. Next, we’re expanding that lens to show what’s happening across the entire request path, from the click in a user’s browser, through Cloudflare’s global network, to your origin servers, and back.

Here’s what’s coming:

  1. Correlating Across Layers
    We’ll match RUM data from the client with network timing, Cloudflare edge processing, and origin response latency, allowing you to pinpoint whether a spike in TTFB comes from a slow script, a cache miss, or an origin bottleneck.

  2. Proactive Alerting
    Configurable alerts will tell you when performance regresses in specific geographies, when a data center underperforms, or when origin latency spikes.

  3. Actionable Insights
    We’ll go beyond “processing time” as a single number, breaking it into the real-world steps that make up the journey: proxy routing, security checks, cache lookups, origin fetches, and more.

  4. Unified View
    All of this will live in one place (your Cloudflare dashboard) alongside your analytics, logs, firewall events, and configuration settings, so you can see cause and effect in one workflow.

Conclusion

Stay tuned as we work alongside you, in public, to build the most comprehensive, privacy-focused performance analytics platform. Together, we will illuminate every corner of the request journey so you can optimize, innovate, and deliver the best experiences to your users, every time.

The next chapters of this journey will unlock proactive alerts, cross-layer correlation, and actionable insights you can’t get anywhere else. Follow along as the RUM Diaries are just getting started.

Taming the monorepo beast: Our journey to a leaner, faster GitLab repo

Post Syndicated from Grab Tech original https://engineering.grab.com/taming-monorepo-beast

At Grab, our engineering teams rely on a massive Go monorepo that serves as the backbone for a large portion of our backend services. This repository has been our development foundation for over a decade, but age brought complexity, and size brought sluggishness. What was once a source of unified code became a bottleneck that was slowing down our developers and straining our infrastructure.

A primer on GitLab, Gitaly, and replication

To understand our core problem, it’s helpful to know how GitLab handles repositories at scale. GitLab uses Gitaly, its Git RPC service, to manage all Git operations. In a high-availability setup like ours, we use a Gitaly Cluster with multiple nodes.

Here’s how it works:

  • Write operations: A primary Gitaly node handles all write operations.
  • Replication: Data is replicated to secondary nodes.
  • Read operations: Secondary nodes handle read operations, such as clones and fetches, effectively distributing the load across the cluster.
  • Failover: If the primary node fails, a secondary node can take over.
    For the system to function effectively, replication must be nearly instantaneous. When secondary nodes experience significant delays syncing with the primary—a condition called replication lag—GitLab stops routing read requests to the secondary nodes to ensure data consistency. This forces all traffic back to the primary node, eliminating the benefits of our distributed setup. Figure 1 illustrates the replication architecture of Gitaly nodes.
Figure 1: The replication architecture of Gitaly nodes in a high-availability setup.

The scale of our problem

Our Go monorepo started as a simple repository 11 years ago but ballooned as Grab grew. A Git analysis using the git-sizer utility in early 2025 revealed the shocking scale:

  • 12.7 million commits accumulated over a decade.
  • 22.1 million Git trees consuming 73GB of metadata.
  • 5.16 million blob objects totaling 176GB.
  • 12 million references, mostly leftovers from automated processes.
  • 429,000 commits deep on some branches.
  • 444,000 files in the latest checkout.

This massive size wasn’t just a number—it was crippling our daily operations.

Infrastructure problems

Figure 2: Replication delays of up to four minutes during peak working hours.

In high-availability setups, replication is critical for distributing workloads and ensuring system reliability. However, when replication delays occur, they can severely impact infrastructure performance and create bottlenecks. Figure 2 illustrates replication delays of up to four minutes which caused both secondary nodes, Gitaly S1 (orange) and Gitaly S2 (blue), to lag behind the primary node, Gitaly P (green). As a result, all requests were routed exclusively to the primary node, creating significant performance challenges.

The key issues here are:

  • Single point of failure: Only one of our three Gitaly nodes could handle the load, creating a bottleneck.
  • Throttled throughput: The system limits the read capacity to just one-third of the cluster’s potential.

Developer experience issues

The growing size of the monorepo directly impacted developer workflows:

  • Slow clones: 8+ minutes even on fast networks.
  • Painful Git operations: Every commit, diff, and blame had to process millions of objects.
  • CI pipeline overhead: Repository cloning added up 5-8 minutes to every CI job.
  • Frustrated developers: “Why is this repo so slow?” became a common question.

Operational challenges

The repository’s scale introduced significant operational hurdles:

  • Storage issues: 250GB of Git data made backups and maintenance cumbersome.
  • GitLab UI timeouts: The web interface struggled to handle millions of commits and refs, frequently timing out.
  • Limited CI scalability: Adding more CI runners overloaded the single working node.

All these factors were dragging down developer productivity. It was clear that continuing to let the monorepo grow unchecked wasn’t sustainable. We needed to make the repository leaner and faster, without losing the important history that teams relied on.

Our solution journey

Proof of concept: Validating the theory

Before making any changes, we needed to answer a critical question: “Would trimming repository history solve our replication issues?” Without proof, committing to such a major change felt risky. So we set out to test the idea.

The test setup:

We designed a simple experiment. In our staging environment, we created two repositories:

  • Full history repository: This repository mirrored the original repository with full history.
  • Shallow history repository: This repository contained only a single commit history.

Both repositories contained the same number of files and directories. We then simulated production-like load on both of the repositories.

The results:

  • Full history repository: 160-240 seconds replication delay.
  • Shallow history repository: 1-2.5 seconds replication delay.

This was nearly a 100x improvement in replication performance.

This proof of concept gave us confidence that history trimming was the right approach and provided baseline performance expectations.

Content preservation strategies: What to keep

Initial strategy: Time-based approach (1-2 years)

Initially, we wanted to keep commits from the last 1-2 years and archive everything else, as this seemed like a reasonable balance between recent history and size reduction. However, when we developed our custom migration script, we discovered it could only process 100 commits per hour, approximately 2,400 commits per day. With millions of commits in the original repository, even keeping 1-2 years of history would take months.

  • We can only process ~100 commits per hour in batches of 20 to avoid memory limits on GitLab runners.
  • Each batch takes 2 minutes to process, but requires 10 minutes of cleanup (git gc, git reflog expire) to prevent local disk and memory exhaustion.
  • This means each batch takes 12 minutes, allowing only 5 batches per hour (60 ÷ 12 = 5), totaling to 100 commits per hour (5 × 20 = 100).
  • Larger batches increased cleanup time and skipping cleanup caused jobs to crash after 200-300 commits.

The bottleneck wasn’t just the number of commits, it was the 10-minute cleanup process.

Additional constraints discovered:

As we dug deeper, we discovered more obstacles.

  • Critical dependencies extended beyond two years. Some Go module tags from six years ago were still actively used.
  • A pure time-based cut would break existing build pipelines.
  • Development teams needed some recent history for troubleshooting and daily operations.

Revised strategy: Tag-based + recent history

Given the processing speed constraint of 100 commits per hour, we needed to drastically reduce the number of commits while preserving essential functionality. After careful evaluation, we settled on a tag-based approach combined with recent history.

What we decided to keep:

  • Critical tags: All commits reachable by 2,000+ identified tags, ensuring semantic importance for releases and dependencies.
  • Recent history: Complete commit history for the last month only addressing stakeholder needs within processing constraints.
  • Simplified merge commits: Converted complex merge commits into single commits to further reduce processing time.

Why this approach worked:

  • Time-feasible: Reduced processing time from months to weeks.
  • Functionally complete: Preserved all tagged releases and recent development context.
  • Stakeholder satisfaction: Met development teams’ need for recent history.
  • Massive size reduction: Achieved 99.9% fewer commits while keeping what matters.

The trade-off:

We sacrificed deep historical browsing of 1 to 2 years for practical migration feasibility, while ensuring no critical functionality was lost.

Technical implementation methods: How to execute

Method 1: git filter-repo (Failed)

The approach: Use Git’s filter-repo tool with git replace --graft to remove commits older than a specified criteria.

Why it failed:

  • Complex history: Our repository’s highly non-linear history, with multiple branches and merges, made this approach impractical.
  • Workflow complexity: The process required numerous git replace --graft commands to account for various branches and dependencies, significantly complicating the workflow.
  • Risk of inconsistencies: The complexity introduced a high risk of errors and inconsistencies, making this method unsuitable.

Method 2: git rebase –onto (Failed)

The approach: Use git rebase --onto to preserve selected commits while pruning unwanted history.

Why it failed:

  • Scale issues: The repository size overwhelmed the rebase process.
  • Conflict resolution: High number of unexpected conflicts that couldn’t be resolved automatically.
  • Technical limitations: Batch processing couldn’t solve the performance issues; Git’s internal mechanisms struggled with the scale.

Method 3: Patch-based implementation (Failed)

The approach: Create and apply patches for each commit individually to preserve repository history.

Why it failed:

  • Merge commit complexity: Couldn’t maintain correct parent-child relationships for merge commits.
  • History integrity: Resulted in linear sequence instead of preserving original merge structure.
  • Missing commits: Important merge commits were lost or incorrectly applied.

Method 4: Custom migration script (Success!)

The breakthrough: A sophisticated custom script that could handle our specific requirements and processing constraints. Unlike traditional Git history rewriting tools, our script implements a two-phase chronological processing approach that efficiently handles large-scale repositories.

Phase 1: Bulk migration

In this phase, the script focuses on reconstructing history based on critical tags.

  1. Fetch tags chronologically: Retrieve all tags in the order they were created.
  2. Pre-fetch Large File Storage (LFS) objects: Collect LFS objects for tag-related commits before processing.
  3. Batch processing: Process tags in batches of 20 to optimize memory and network usage. For each tag:
    • Check for associated LFS objects.
    • Perform selective LFS fetch if required.
    • Create a new commit using the original tree hash and metadata.
    • Embed the original commit hash in the commit message for traceability.
    • Gracefully handle LFS checkout failures.

Then, push the processed batch of 20 commits to the destination repository, with LFS tolerance.

  1. Cleanup and continue: Perform cleanup operations after each batch and proceed to the next.

Phase 2: Delta migration

This phase integrates recent commits after the cutoff date.

  1. Fetch recent commits: Retrieve all commits created after the cutoff date in chronological order.
  2. Batch processing: Process commits in batches of 20 for efficiency. For each commit:
    • Check for associated LFS objects.
    • Perform selective LFS fetch if required.
    • Recreate the commit with its original metadata.
    • Embed the original commit hash for resumption tracking in case of interruptions.
    • Gracefully handle LFS checkout failures.

Then, push the processed batch of commits to the destination repository, with LFS tolerance.

  1. Tag mapping: Map tags to their corresponding new commit hashes.
  2. Push tags: Push related tags pointing to the correct new commits.
  3. Final validation: Validate all LFS objects to ensure completeness.

LFS handling

The script incorporates robust mechanisms to handle Git LFS efficiently.

  • Configure LFS for incomplete pushes.
  • Skip LFS download errors when possible.
  • Retry checkout with LFS smudge skip.
  • Perform selective LFS object fetching.
  • Gracefully degrade processing for missing LFS objects.

Key features:

  • Sequential processing of tags and commits in chronological order.
  • Resumable operations that could restart from the last processed item if interrupted.
  • Batch processing to manage memory and network resources efficiently.
  • Robust error handling for network issues and Git complications.
  • Maintains repository integrity while simplifying complex merge structures.
  • Optimized for our specific preservation strategy (tags + recent history).

Implementation: Executing the migration

With our strategy defined (tags + last month), we executed the migration using our custom script. This process involved careful planning, smart processing techniques, and overcoming technical challenges.

Smart processing approach

Our custom script employed several key strategies to ensure efficient and reliable migration:

  • Sequential tag processing: Replay tags chronologically to maintain logical history.
  • Resumable operations: The migration could restart from the last processed item if interrupted.
  • Batch processing: Handle items in manageable groups to prevent resource exhaustion.
  • Progress tracking: Monitor processing rate and estimated completion time.

Technical challenges solved

The migration addressed several critical technical hurdles.

  • Large file support: Handled Git LFS objects with incomplete push allowances.
  • Error handling: Robust retry logic for network issues and Git errors.
  • Merge commit simplification: Converted complex merge structures to linear commits.

Two-phase migration strategy

The migration was executed in two carefully planned phases.

  • Phase 1 – Bulk migration: Migrated 95% of tags while keeping the old repo live.
  • Phase 2 – Delta migration: Performed final synchronization during a maintenance window to migrate recent changes.

Results and impact

Infrastructure transformation

Replication delay, or the time required to sync across all Gitaly nodes, improved by 99.4% following the pruning process. As illustrated in Figures 3 and 4, the new pruned monorepo achieves replication in under ~1.5 seconds on average, compared to ~240 seconds for the old repository. This transformation eliminated the previous single-node bottleneck, enabling read requests to be distributed evenly across all three storage nodes, significantly enhancing system reliability and performance.

Figure 3: In the new pruned monorepo, replication delay ranges from 200 – 2,000 ms.
Figure 4: In the old monorepo, replication delay ranged from 16,000 – 28,000 ms.

The migration significantly improved load distribution across Gitaly nodes. As shown in Figure 5, the new monorepo leverages all three Gitaly nodes to serve requests, effectively tripling read capacity. Additionally, the migration eliminated the single point of failure that existed in the old monorepo, ensuring greater reliability and scalability.

Figure 5: In the new monorepo, requests are evenly distributed across all three servers, demonstrating improved performance and replication across nodes.
Figure 6: In the old monorepo, requests were served only by a single server during working hours, creating a single point of failure.

Performance improvements

The migration resulted in significant improvements across multiple areas.

  • Clone time: Reduced from 7.9 minutes to 5.1 minutes, achieving a 36% improvement, making repository cloning faster and more efficient.
  • Commit count: Achieved a 99.9% reduction, trimming the repository from 13 million commits to just 15.8 thousand commits, drastically simplifying its structure.
  • References: Reduced by 99.9%, going from 12 million to 9.8 thousand refs, streamlining repository metadata.
  • Storage: Reduced by 59%, shrinking storage requirements from 214GB to 87GB, optimizing resource usage.

Developer experience

The migration also transformed the developer experience.

  • Faster Git operations: Commits, diffs, and history commands are noticeably snappier.
  • Responsive GitLab UI: Web interface no longer times out.
  • Scalable CI: The system can now safely run 3x more concurrent jobs.

The following table summarizes the key repository metrics, comparing the state of the repository before and after the migration:

Metric Old Monorepo New Monorepo Reduction
Commits ~13,000,000 ~15,800 −99.9% (histories squashed)
Git trees ~23,600,000 ~2,080,000 −91% (pruned)
Git references ~12,200,000 9,860 −99.9% (cleaned)
Blob storage 214 GiB 86.8 GiB −59% (smaller packs)
Files in checkout ~444,000 ~444,000 ~0% (no change)
Latest code size ~9.9 GiB ~8.4 GiB ~−15% (slightly leaner)

Key challenges and lessons learned

Such a large-scale migration wasn’t without its hiccups and lessons. Here are some challenges we faced and what we learned:

Git LFS woes

Initially, GitLab rejected some commits due to missing LFS objects, even old commits that we weren’t keeping. This happened because GitLab’s push hook expected the content of LFS pointers, even if the files weren’t required. To fix this, we had to allow incomplete pushes and skip LFS download errors. We also wrote logic to selectively fetch LFS objects for commits we were keeping. This ensured that any binary assets needed by tagged commits were present in the new repo. The takeaway is that LFS adds complexity to history rewrites – plan for it by adjusting Git LFS settings (e.g., lfs.allowincompletepush) and verifying important large files are carried over.

Pipeline token scoping

Right after the cutover, some CI pipelines failed to access resources. We discovered a GitLab CI/CD pipeline token issue – our new repo’s ID wasn’t in the allowed list for certain secure token scopes. We quickly updated the settings to include the new project, resolving the authorization error. If your CI jobs interact with other projects or use project-scoped tokens, remember to update those references when you migrate repositories.

Commit hash references broke

One of our internal tools was using commit SHA-1 hashes to track deployed versions. Since rewriting history means changing all commit hashes, the tool couldn’t find the expected commits. The solution was to map old hashes to new ones for the tagged releases, or better, to modify the tool to use tag names instead of raw hashes going forward. We learned to communicate early with teams that have any dependency on Git commit IDs or history assumptions. In our case, providing a mapping of old tag→new tag (which were mostly 1-to-1 except for the commit SHA) helped them adjust. In hindsight, using stable identifiers like semantic version tags, is much more robust than relying on commit hashes, which are ephemeral in a rewritten history.

Developer concerns: “Where’s my history?”

A few engineers were concerned when they noticed that the git log in the new repo only showed two years of history. From their perspective, useful historical context seemed gone. We addressed this by pointing them to the archived full-history repo. In fact, we kept the old repository read-only in our GitLab, so anyone can still search the old history if needed (just not in the main repo). Additionally, we received suggestions on making the archive easily accessible or even automate a way to query old commits on demand. From this we learned, if you prune history, ensure there’s a plan to access legacy information for those rare times it’s needed – whether that’s an archive repo, a Git bundle, or a read-only mirror.

Office network bottleneck

Interestingly, after the migration, a few developers in certain offices didn’t feel a huge speed improvement in clones. It turned out their corporate network/VPN was the limiting factor – cloning 8 GiB vs 10 GiB over a slow link is not a night and day difference. This highlighted that we should continue to work with the IT team on improving network performance. The repo is faster, but the environment matters too. We’re using this as an opportunity to improve our office VPN throughput so that the 36% clone improvement is realized by everyone, not just CI machines.

Automation and hardcoded IDs

We had a lot of automation around the monorepo (scripts, webhooks, integrations). Most of these referenced the project by name, which remained the same, so they were fine. However, a few used the project’s numeric ID in the GitLab API, which changed when we created a new repo. Those broke. We had to scan and update some configs to use the new project ID. Our learning here is to audit all external references such as CI configs, deploy scripts, and monitor jobs when migrating repositories. Ideally, use identifiable names instead of IDs, or ensure you’re prepared to update them during the cutover.

Adjusting to new boundaries

Some teams had to adjust their workflows after the prune. For instance, one team was in the habit of digging into 3 to 5 year old commit logs to debug issues. Post-migration, git log doesn’t go back that far in the main repo; they have to consult the archive for that. It’s a cultural shift to not have all history at your fingertips. We held a short information session to explain how to access the archived repo and emphasized the benefits (faster operations) that come with the lean history. After a while, teams embraced the new normal, appreciating the speed and rarely needing the older commits anyway.

In the end, we had zero data loss – all actual code and tags were preserved – and only some minor inconveniences that were resolved within a day or two. The challenges reinforced the importance of thorough testing (our staging dry-runs caught many issues) and cross-team communication when making such a change.

Impact and next steps

This migration transformed our development infrastructure from a bottleneck into a performance enabler. We eliminated the single point of failure, restored confidence in our Git operations, and created a foundation that can support our growing engineering team.

As the next step, we plan to generalize our pruning script to apply the same optimization techniques to other repositories, ensuring consistency and scalability across our infrastructure. Additionally, we will implement continuous performance monitoring to track repository health and proactively address any emerging issues. To prevent future repository bloat, we aim to establish clear best practices and guidelines, empowering teams to maintain efficiency while supporting the growth of our engineering operations.

Conclusion

What started as a performance crisis became one of our most successful infrastructure projects. By focusing on the right problems—infrastructure reliability and performance rather than just size—we achieved dramatic improvements that benefit every developer daily.

The key takeaway is that sometimes the biggest technical challenges require custom solutions, careful planning, and willingness to iterate until you find what works. Our 99% improvement in replication performance is just the beginning of what’s possible when you tackle infrastructure problems systematically.

This migration was completed by Grab Tech Infra DevTools team, involving months of analysis, custom tooling development, and careful production migration of critical infrastructure serving thousands of developers across multiple time zones.

Join us

Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line – we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Troubleshooting network connectivity and performance with Cloudflare AI

Post Syndicated from Chris Draper original https://blog.cloudflare.com/AI-troubleshoot-warp-and-network-connectivity-issues/

Monitoring a corporate network and troubleshooting any performance issues across that network is a hard problem, and it has become increasingly complex over time. Imagine that you’re maintaining a corporate network, and you get the dreaded IT ticket. An executive is having a performance issue with an application, and they want you to look into it. The ticket doesn’t have a lot of details. It simply says: “Our internal documentation is taking forever to load. PLS FIX NOW”.

In the early days of IT, a corporate network was built on-premises. It provided network connectivity between employees that worked in person and a variety of corporate applications that were hosted locally.

The shift to cloud environments, the rise of SaaS applications, and a “work from anywhere” model has made IT environments significantly more complex in the past few years. Today, it’s hard to know if a performance issue is the result of:

  • An employee’s device

  • Their home or corporate wifi

  • The corporate network

  • A cloud network hosting a SaaS app

  • An intermediary ISP

A performance ticket submitted by an employee might even be a combination of multiple performance issues all wrapped together into one nasty problem.

Cloudflare built Cloudflare One, our Secure Access Service Edge (SASE) platform, to protect enterprise applications, users, devices, and networks. In particular, this platform relies on two capabilities to simplify troubleshooting performance issues:

  • Cloudflare’s Zero Trust client, also known as WARP, forwards and encrypts traffic from devices to Cloudflare edge.

  • Digital Experience Monitoring (DEX) works alongside WARP to monitor device, network, and application performance.

We’re excited to announce two new AI-powered tools that will make it easier to troubleshoot WARP client connectivity and performance issues.  We’re releasing a new WARP diagnostic analyzer in the Zero Trust dashboard and a MCP (Model Context Protocol) server for DEX. Today, every Cloudflare One customer has free access to both of these new features by default.

WARP diagnostic analyzer

The WARP client provides diagnostic logs that can be used to troubleshoot connectivity issues on a device. For desktop clients, the most common issues can be investigated with the information captured in logs called WARP diagnostic. Each WARP diagnostic log contains an extensive amount of information spanning days of captured events occurring on the client. It takes expertise to manually go through all of this information and understand the full picture of what is occurring on a client that is having issues. In the past, we’ve advised customers having issues to send their WARP diagnostic log straight to us so that our trained support experts can do a root cause analysis for them. While this is effective, we want to give our customers the tools to take control of deciphering common troubleshooting issues for even quicker resolution. 

Enter the WARP diagnostic analyzer, a new AI available for free in the Cloudflare One dashboard as of today! This AI demystifies information in the WARP diagnostic log so you can better understand events impacting the performance of your clients and network connectivity. Now, when you run a remote capture for WARP diagnostics in the Cloudflare One dashboard, you can generate an AI analysis of the WARP diagnostic file. Simply go to your organization’s Zero Trust dashboard and select DEX > Remote Captures from the side navigation bar. After you successfully run diagnostics and produce a WARP diagnostic file, you can open the status details and select View WARP Diag to generate your AI analysis.


In the WARP Diag analysis, you will find a Cloudy summary of events that we recommend a deeper dive into.


Below this summary is an events section, where the analyzer highlights occurrences of events commonly occurring when there are client and connectivity issues. 


Expanding on any of the events detected will reveal a detailed page explaining the event, recommended resources to help troubleshoot, and a list of time stamped recent occurrences of the event on the device.


To further help with trouble shooting we’ve added a Device and WARP details section at the bottom of this page with a quick view of the device specifications and WARP configurations such as Operating system, WARP version, and the device profile ID.


Finally, we’ve made it easy to take all the information created in your AI summary with you by navigating to the JSON file tab and copying the contents. Your WARP Diag file is also available to download from this screen for any further analysis.


MCP server for DEX

Alongside the new WARP Diagnostic Analyzer, we’re excited to announce that all Cloudflare One customers have access to a MCP (Model Context Protocol) server for our Digital Experience Monitoring (DEX) product. Let’s dive into how this will save our customers time and money.

Cloudflare One customers use Digital Experience Monitoring (DEX) to monitor devices across their employee network and troubleshoot any connectivity or performance issues. Like many products at Cloudflare, every data point generated by DEX is available to customers via Cloudflare’s API or log ingestion. DEX API and log data is valuable because it enables Enterprises to create custom analytics for their devices’ connectivity and performance in a SIEM (Security Information and Event Management). Building new data pipelines and various dashboards can take a lot of time and can be expensive. Some of Cloudflare One customers aren’t able to dedicate the engineering time to build custom analytics: whether it’s due to budget, resource constraints, time constraints, or other factors.

Model Context Protocol (MCP) is an AI standardization that allows AI to connect to other applications. In this instance, using MCP together with DEX gives users a custom analytics experience — without all the work of integrating log data into a SIEM. Network security practitioners can type out a question (like “Show me the performance data for [email protected]’s device in a few graphs”), and an MCP server will answer that question with a customized graph on Alice’s device data from the DEX API.

Troubleshooting DNS query performance with the DEX MCP server

Any IT admin, security engineer, or network engineer can ask the MCP server a question like: “I received an IT ticket from [email protected] who is an enterprise employee. He is complaining about the performance of his device. Can you investigate this for me?”.

After you submit this question, the DEX MCP server will start by reviewing the device’s overall health.

Overall Device Health:
✅ Connection Status: Consistently connected for the past 24 hours with 100% availability
✅ WARP Mode: Running in optimal "warp+doh" (WARP + DNS over HTTPS) mode
✅ Client Version: Up-to-date with version 2025.1.7.2

So far, the device’s health looks great. Next, the DEX MCP server will begin analyzing the results of the performance tests that are configured on Bob’s device.

📊 Performance Metrics Analysis

HTTP Response Times:
Google.com test: Average 247ms (Range: 81-649ms)
  • Shows significant improvement throughout the day: Started at ~317ms, dropped to ~211ms by evening
Wiki.internal.acme.com: Average 1,374ms (Range: 877-2,431ms)
  • This is notably high and could be the source of performance complaints
  • DNS resolution particularly slow: Average 616ms with spikes up to 1,600ms

Now we’re making progress in our troubleshooting effort. The DEX MCP server identified that the HTTP GET test for wiki.internal.acme.com has a high average resource fetch time when compared to other websites (like google.com). In particular, high resource fetch time is the result of slow DNS resolution. Now, the DEX MCP server will summarize its findings.

🚨 Performance Issues Identified
Extremely Slow DNS Resolution:
  • Average DNS response time of 616ms is significantly high
  • Normal DNS should be <50ms for good performance
High Resource Fetch Times on wiki.internal.acme.com:
   • 1.3+ second response times indicate potential application-level issues
   • Could be related to DNS performance or server-side delays
HTTP Status Code Issues:
        • wiki.internal.acme.com is showing consistent HTTP 400 errors
        • This suggests potential configuration or routing issues

Now, we’ve successfully identified that Bob’s slow wiki performance is the result of a high average DNS response time. Slow DNS resolution increases the average resource fetch time, and also generates HTTP 400 errors for Enterprise employees that are attempting to access wiki.internal.acme.com. Notably, it’s likely that if the wiki is seeing slow DNS response times, other internal applications are experiencing performance issues as well.

We can ask the DEX MCP server to suggest potential solutions that would fix this wider DNS performance issue.

🔧 DNS Optimization Strategies

Check WARP DNS configuration and settings
  • Since the device is using WARP+DoH mode, you can optimize the DNS settings
Implement or improve local DNS caching
  • Implement or improve local DNS caching to reduce repeated lookups
Network stack tuning
  • Optimize TCP / UDP settings for DNS

Try out the DEX MCP server today

Fast and easy option for testing an MCP server

Any Cloudflare One customer with a Free, PayGo, or ENT plan can start using the DEX MCP server in less than one minute. The fastest and easiest way to try out the DEX MCP server is to visit playground.ai.cloudflare.com. There are five steps to get started:

  1. Copy the URL for the DEX MCP server: https://dex.mcp.cloudflare.com/sse

  2. Open playground.ai.cloudflare.com in a browser

  3. Find the section in the left side bar titled MCP Servers

  4. Paste the URL for the DEX MCP server into the URL input box and click Connect

  5. Authenticate your Cloudflare account, and then start asking questions to the DEX MCP server

It’s worth noting that end users will need to ask specific and explicit questions to the DEX MCP server to get a response. For example, you may need to say, “Set my production account as the active  account”, and then give the separate command, “Fetch the DEX test results for the user [email protected] over the past 24 hours”.

Better experience for MCP servers that requires additional steps

Customers will get a more flexible prompt experience by configuring the DEX MCP server with their preferred AI assistant (Claude, Gemini, ChatGPT, etc.) that has MCP server support. MCP server support may require a subscription for some AI assistants. You can read the Digital Experience Monitoring – MCP server documentation for step by step instructions on how to get set up with each of the major AI assistants that are available today.

As an example, you can configure the DEX MCP server in Claude by downloading the Claude Desktop client, then selecting Claude Code > Developer > Edit Config. You will be prompted to open “claude_desktop_config.json” in a code editor of your choice. Simply add the following JSON configuration, and you’re ready to use Claude to call the DEX MCP server.

{
  "globalShortcut": "",
  "mcpServers": {
    "cloudflare-dex-analysis": {
      "command": "npx",
      "args": [
        "mcp-remote",
        "https://dex.mcp.cloudflare.com/sse"
      ]
    }
  }
}

Get started with Cloudflare One today

Are you ready to secure your Internet traffic, employee devices, and private resources without compromising speed? You can get started with our new Cloudflare One AI powered tools today.

The WARP diagnostic analyzer and the DEX MCP server are generally available to all customers. Head to the Zero Trust dashboard to run a WARP diagnostic and learn more about your client’s connectivity with the WARP diagnostic analyzer. You can test out the new DEX MCP server (https://dex.mcp.cloudflare.com/sse) in less than one minute at playground.ai.cloudflare.com, and you can also configure an AI assistant like Claude to use the new DEX MCP server.

If you don’t have a Cloudflare account, and you want to try these new features, you can create a free account for up to 50 users. If you’re an Enterprise customer, and you’d like a demo of these new Cloudflare One AI features, you can reach out to your account team to set up a demo anytime. 

You can stay up to date on latest feature releases across the Cloudflare One platform by following the Cloudflare One changelogs and joining the conversation in the Cloudflare community hub or on our Discord Server.


Reducing double spend latency from 40 ms to < 1 ms on privacy proxy

Post Syndicated from Ben Yang original https://blog.cloudflare.com/reducing-double-spend-latency-from-40-ms-to-less-than-1-ms-on-privacy-proxy/

One of Cloudflare’s big focus areas is making the Internet faster for end users. Part of the way we do that is by looking at the “big rocks” or bottlenecks that might be slowing things down — particularly processes on the critical path. When we recently turned our attention to our privacy proxy product, we found a big opportunity for improvement.

What is our privacy proxy product? These proxies let users browse the web without exposing their personal information to the websites they’re visiting. Cloudflare runs infrastructure for privacy proxies like Apple’s Private Relay and Microsoft’s Edge Secure Network.

Like any secure infrastructure, we make sure that users authenticate to these privacy proxies before we open up a connection to the website they’re visiting. In order to do this in a privacy-preserving way (so that Cloudflare collects the least possible information about end-users) we use an open Internet standard – Privacy Pass – to issue tokens that authenticate to our proxy service.

Every time a user visits a website via our Privacy Proxy, we check the validity of the Privacy Pass token which is included in the Proxy-Authorization header in their request. Before we cryptographically validate a user’s token, we check if this token has already been spent. If the token is unspent, we let the user request through. Otherwise, it’s a “double-spend”. From an access control perspective, double-spends are indicative of a problem. From a privacy perspective, double-spends can reduce the anonymity set and privacy characteristics. From a performance perspective, our privacy proxies see millions of requests per second – and any time spent authenticating delays people from accessing sites – so the check needs to be fast. Let’s see how we reduced the latency of these double-spend checks from ~40 ms to <1 ms.

How did we discover the issue?

We use a tracing platform, Jaeger. It lets us see which paths our code took and how long functions took to run. When we looked into these traces, we saw latencies of ~ 40 ms. It was a good lead, but it alone was not enough to conclude it was an issue. The reason was we only sample a small percentage of our traces, so what we saw was not the whole picture. We needed to look at more data. We could’ve increased how many traces we sampled, but traces are large and heavy for our systems to process. Metrics are a lighter weight solution. We added metrics to get data on all double-spend checks.


The lines in this graph are median latencies we saw for the slowest privacy proxies around the world. The metrics data gave us confidence that it was a problem affecting a large portion of requests… assuming that ~ 45 ms was longer than expected. But, was it expected? What numbers did we expect?

The expected latency

To understand what times are reasonable to expect, let’s go into detail on what makes up a “double-spend check”. When we do a double-spend check, we ask a backing data store if a Privacy Pass token exists. The data store we use is memcached. We have many memcached instances running on servers around the world, so which server do we ask? For this, we use mcrouter. Instead of figuring out which memcached server to ask, we give our request to mcrouter, and it will handle choosing a good memcached server to use. We looked at the median time it took for mcrouter to process our request. This graph shows the average latencies per server over time. There are spikes, but most of the time the latency is < 1 ms. 


By this point, we were confident that double-spend check latencies were longer than expected everywhere, and we started looking for the root cause.

How did we investigate the issue?

We took inspiration from the scientific method. We analyzed our code, created theories for why sections of code caused latency, and used data to reject those theories. For any remaining theories, we implemented fixes and tested if they worked.

Let’s look at the code. At a high level, the double-spend checking logic is:

  1. Get a connection, which can be broken down into:

    1. Send a memcached version command. This serves as a health check for whether the connection is still good to send data on.

    2. If the connection is still good, acquire it. Otherwise, establish a new connection.

  2. Send a memcached get command on the connection.

Let’s go through the theories we had for each step listed above.

Theory 1: health check takes long

We measured the health check primarily as a sanity check. The version command is simple and fast to process, so it should not take long. And we remained sane. The median latency was < 1 ms.


Theory 2: waiting to get a connection

To understand why we may need to wait to get a connection, let’s go into more detail on how we get a connection. In our code, we use a connection pool. The pool is a set of ready-to-go connections to mcrouter. The benefit of having a pool is that we do not have to pay the overhead of establishing a connection every time we want to make a request. Pools have a size limit, though. Our limit was 20 per server, and this is where a potential problem lies. Imagine we have a server that processes 5,000 requests every second, and requests stay for 45 ms. We can use something called Little’s Law to estimate the average number of requests in our system: 5000 x 0.045 = 225. Due to our pool size limits, we can only have 20 connections at a time, so we can only process 20 requests at any point in time. That means 205 requests are just waiting! When we do a double-spend check, maybe we’re waiting ~ 40 ms to get a connection?

We looked at the metrics of many different servers. No matter what the requests per second was, the latency was consistently ~ 40 ms, disproving the theory. For example, this graph shows data from a server that saw a maximum of 20 requests per second. It shows a histogram over time, and the large majority of requests fall in the 40 – 50 ms bucket.


Theory 3: delays in Nagle’s algorithm and delayed acks

We decided to chat with Gemini, giving it the observations we had so far. It suggested many things, but the most interesting was to check if TCP_NODELAY was set. If we had set this option in our code, it would’ve disabled something called Nagle’s algorithm. Nagle’s algorithm itself was not a problem, but when enabled alongside another feature, delayed ACKs, latencies could creep in. To explain why, let’s go through an analogy.

Suppose we run a group chat app. Normally, people type a full thought and send it in one message. But, we have a friend who sends one word at a time: “Hi”. Send. “how”. Send. “are”. Send. “you”. Send. That’s a lot of notifications. Nagle’s algorithm aims to prevent this. Nagle says that if the friend wants to send one short message, that’s fine, but it only lets them do it once per turn. When they try to send more single words right after, Nagle will save the words in a draft message. Once the draft message hits a certain length, Nagle sends. But what if the draft message never hits that length? To manage this, delayed ACKs initiates a 40 ms timer whenever the friend sends a message. If the app gets no further input before the timer ends, the message is sent to the group.

I took a closer look at the code, both Cloudflare authored code and code from dependencies we rely on. We depended on the memcache-async crate for implementing the code that lets us send memcache commands. Here is the code for sending a memcached version command:

self.io.write_all(b"version\r\n").await?;
self.io.flush().await?;

Nothing out of the ordinary. Then, we looked inside the get function.

let writer = self.io.get_mut();
writer.write_all(b"get ").await?;
writer.write_all(key.as_ref()).await?;
writer.write_all(b"\r\n").await?;
writer.flush().await?;

In our code, we set io as a TcpStream, meaning that each write_all call resulted in sending a message. With Nagle’s algorithm enabled, the data flow looked like this:


Oof. We tried to send all three small messages, but after we sent the “get “, the kernel put the token and \r\n in a buffer and started waiting. When mcrouter got the “get “, it could not do anything because it did not have the full command. So, it waited 40 ms. Then, it sent an ACK in response. We got the ACK, and sent the rest of the command in the buffer. mcrouter got the rest of the command, processed it, and returned a response telling us if the token exists. What would the data flow look like with Nagle’s algorithm disabled?


We would send all three small messages. mcrouter would have the full command, and return a response immediately. No waiting, whatsoever.

Why 40 ms?

Our Linux servers have minimum bounds for the delay. Here is a snippet of Linux source code that defines those bounds.

#if HZ >= 100
#define TCP_DELACK_MIN	((unsigned)(HZ/25))	/* minimal time to delay before sending an ACK */
#define TCP_ATO_MIN	((unsigned)(HZ/25))
#else
#define TCP_DELACK_MIN	4U
#define TCP_ATO_MIN	4U
#endif

The comment tells us that TCP_DELACK_MIN is the minimum time delayed ACKs will wait before sending an ACK. We spent some time digging through Cloudflare’s custom kernel settings and found this:

CONFIG_HZ=1000

CONFIG_HZ eventually propagates to HZ and results in a 40 ms delay. That’s where the number comes from!

The fix

We were sending three separate messages for a single command when we only needed to send one. We captured what a get command looked like in Wireshark to verify we were sending three separate messages. (We captured this locally on MacOS. Interestingly, we got an ACK for every message.)


The fix was to use BufWriter<TcpStream> so that write_all would buffer the small messages in a user-space memory buffer, and flush would send the entire memcached command in one message. The Wireshark capture looked much cleaner.


Conclusion

After deploying the fix to production, we saw the median double-spend check latency drop to expected values everywhere.


Our investigation followed a systematic, data-driven approach. We began by using observability tools to confirm the problem’s scale. From there, we formed testable hypotheses and used data to systematically disprove them. This process ultimately led us to a subtle interaction between Nagle’s algorithm and delayed ACKs, caused by how we made use of a third-party dependency.

Ultimately, our mission is to help build a better Internet. Every millisecond saved contributes to a faster and more seamless, private browsing experience for end users. We’re excited to have this rolled out and excited to continue to chase further performance improvements!

Building Jetflow: a framework for flexible, performant data pipelines at Cloudflare

Post Syndicated from Harry Hough original https://blog.cloudflare.com/building-jetflow-a-framework-for-flexible-performant-data-pipelines-at-cloudflare/

The Cloudflare Business Intelligence team manages a petabyte-scale data lake and ingests thousands of tables every day from many different sources. These include internal databases such as Postgres and ClickHouse, as well as external SaaS applications such as Salesforce. These tasks are often complex and tables may have hundreds of millions or billions of rows of new data each day. They are also business-critical for product decisions, growth plannings, and internal monitoring. In total, about 141 billion rows are ingested every day.

As Cloudflare has grown, the data has become ever larger and more complex. Our existing Extract Load Transform (ELT) solution could no longer meet our technical and business requirements. After evaluating other common ELT solutions, we concluded that their performance generally did not surpass our current system, either.

It became clear that we needed to build our own framework to cope with our unique requirements — and so Jetflow was born. 

What we achieved

Over 100x efficiency improvement in GB-s:

  • Our longest running job with 19 billion rows was taking 48 hours using 300 GB of memory, and now completes in 5.5 hours using 4 GB of memory

  • We estimate that ingestion of 50 TB from Postgres via Jetflow could cost under $100 based on rates published by commercial cloud providers

>10x performance improvement:

  • Our largest dataset was ingesting 60-80,000 rows per second, this is now 2-5 million rows per second per database connection.

  • In addition, these numbers scale well with multiple database connections for some databases.

Extensibility: 

  • The modular design makes it easy to extend and test. Today Jetflow works with ClickHouse, Postgres, Kafka, many different SaaS APIs, Google BigQuery and many others. It has continued to work well and remain flexible with the addition of new use cases.

How did we do this?

Requirements

The first step to designing our new framework had to be a clear understanding of the problems we were aiming to solve, with clear requirements to stop us creating new ones.

Performant & efficient

We needed to be able to move more data in less time as some ingestion jobs were taking ~24 hours, and our data will only grow. The data should be ingested in a streaming fashion and use less memory and compute resources than our existing solution.

Backwards compatible 

Given the daily ingestion of thousands of tables, the chosen solution needed to allow for the migration of individual tables as needed. Due to our usage of Spark downstream and Spark’s limitations in merging desperate Parquet schemas, the chosen solution had to offer the flexibility to generate the precise schemas needed for each case to match legacy.

We also required seamless integration with our custom metadata system, used for dependency checks and job status information.

Ease of use

We want a configuration file that can be version-controlled, without introducing bottlenecks on repositories with many concurrent changes.

To increase accessibility for different roles within the team, another requirement was no-code (or configuration as code) in the vast majority of cases. Users should not have to worry about availability or translation of data types between source and target systems, or writing new code for each new ingestion. The configuration needed should also be minimal — for example, data schema should be inferred from the source system and not need to be supplied by the user.

Customizable

Striking a balance with the no-code requirement above, although we want a low bar of entry we also want to have the option to tune and override options if desired, with a flexible and optional configuration layer. For example, writing Parquet files is often more expensive than reading from the database, so we want to be able to allocate more resources and concurrency as needed. 

Additionally, we wanted to allow for control over where the work is executed, with the ability to spin up concurrent workers in different threads, different containers, or on different machines. The execution of workers and communication of data was abstracted away with an interface, and different implementations can be written and injected, controlled via the job configuration. 

Testable

We wanted a solution capable of running locally in a containerized environment, which would allow us to write tests for every stage of the pipeline. With “black box” solutions, testing often means validating the output after making a change, which is a slow feedback loop, risks not testing all edge cases as there isn’t good visibility of all code paths internally, and makes debugging issues painful.

Designing a flexible framework 

To build a truly flexible framework, we broke the pipeline down into distinct stages, and then create a config layer to define the composition of the pipeline from these stages, and any configuration overrides. Every pipeline configuration that makes sense logically should execute correctly, and users should not be able to create pipeline configs that do not work. 

Pipeline configuration

This led us to a design where we created stages which were classified according to the meaningfully different categories of:

  • Consumers

  • Transformers

  • Loaders


The pipeline was constructed via a YAML file that required a consumer, zero or more transformers, and at least one loader. Consumers create a data stream (via reading from the source system), Transformers (e.g. data transformations, validations) take a data stream input and output a data stream conforming to the same API so that they can be chained, and Loaders have the same data streaming interface, but are the stages with persistent effects — i.e. stages where data is saved to an external system. 

This modular design means that each stage is independently testable, with shared behaviour (such as error handling and concurrency) inherited from shared base stages, significantly decreasing development time for new use cases and increasing confidence in code correctness.

Data divisions

Next, we designed a breakdown for the data that would allow the pipeline to be idempotent both on whole pipeline re-run and also on internal retry of any data partition due to transient error. We decided on a design that let us parallelize processing, while maintaining meaningful data divisions that allowed the pipeline to perform cleanups of data where required for a retry.

  • RunInstance: the least granular division, corresponding to a business unit for a single run of the pipeline (e.g. one month/day/hour of data). 

  • Partition: a division of the RunInstance that allows each row to be allocated to a partition in a way that is deterministic and self-evident from the row data without external state, and is therefore idempotent on retry. (e.g. an accountId range, a 10-minute interval)

  • Batch: a division of the partition data that is non-deterministic and used only to break the data down into smaller chunks for streaming/parallel processing for faster processing with fewer resources. (e.g. 10k rows, 50 MB)

The options that the user configures in the consumer stage YAML both construct the query that is used to retrieve the data from the source system, and also encode the semantic meaning of this data division in a system agnostic way, so that later stages understand what this data represents — e.g. this partition contains the data for all accounts IDs 0-500. This means that we can do targeted data cleanup and avoid, for example, duplicate data entries if a single data partition is retried due to error.


Framework implementation

Standard internal state for stage compatibility 

Our most common use case is something like read from a database, convert to Parquet format, and then save to object storage, with each of these steps being a separate stage. As more use cases were onboarded to Jetflow, we had to make sure that if someone wrote a new stage it would be compatible with the other stages. We don’t want to create a situation where new code needs to be written for every output format and target system, or you end up with a custom pipeline for every different use case.

The way we have solved this problem is by having our stage extractor class only allow output data in a single format. This means as long as any downstream stages support this format as in the input and output format they would be compatible with the rest of the pipeline. This seems obvious in retrospect, but internally was a painful learning experience, as we originally created a custom type system and struggled with stage interoperability. 

For this internal format, we chose to use Arrow, an in-memory columnar data format. The key benefits of this format for us are:

  • Arrow ecosystem: Many data projects now support Arrow as an output format. This means when we write extractor stages for new data sources, it is often trivial to produce Arrow output.

  • No serialisation overhead: This makes it easy to move Arrow data between machines and even programming languages with minimum overhead. Jetflow was designed from the start to have the flexibility to be able to run in a wide range of systems via a job controller interface, so this efficiency in data transmission means there’s minimal compromise on performance when creating distributed implementations.

  • Reserve memory in large fixed-size batches to avoid memory allocations: As Go is a garbage collected (GC) language and GC cycle times are affected mostly by the number of objects rather than the sizes of those objects, fewer heap objects reduces CPU time spent garbage collecting significantly, even if the total size is the same. As the number of objects to scan, and possibly collect, during a GC cycle increases with the number of allocations, if we have 8192 rows with 10 columns each, Arrow would only require us to do 10 allocations versus the 8192 allocations of most drivers that allocate on a row by row basis, meaning fewer objects and lower GC cycle times with Arrow.

Converting rows to columns

Another important performance optimization was reducing the number of conversion steps that happen when reading and processing data. Most data ingestion frameworks internally represent data as rows. In our case, we are mostly writing data in Parquet format, which is column based. When reading data from column-based sources (e.g. ClickHouse, where most drivers receive RowBinary format), converting into row-based memory representations for the specific language implementation is inefficient. This is then converted again from rows to columns to write Parquet files. These conversions result in a significant performance impact.

Jetflow instead reads data from column-based sources in columnar formats (e.g. for ClickHouse-native Block format) and then copies this data into Arrow column format. Parquet files are then written directly from Arrow columns. The simplification of this process improves performance.


Writing each pipelines stage

Case study: ClickHouse

When testing an initial version of Jetflow, we discovered that due to the architecture of ClickHouse, using additional connections would not be of any benefit, since ClickHouse was reading faster than we were receiving data. It should then be possible, with a more optimized database driver, to take better advantage of that single connection to read a much larger number of rows per second, without needing additional connections.

Initially, a custom database driver was written for ClickHouse, but we ended up switching to the excellent ch-go low level library, which directly reads Blocks from ClickHouse in a columnar format. This had a dramatic effect on performance in comparison to the standard Go driver. Combined with the framework optimisations above, we now ingest millions of rows per second with a single ClickHouse connection.

A valuable lesson learned is that as with any software, tradeoffs are often made for the sake of convenience or a common use case that may not match your own. Most database drivers tend not to be optimized for reading large batches of rows, and have high per-row overhead.

Case study: Postgres

For Postgres, we use the excellent jackc/pgx driver, but instead of using the database/sql Scan interface, we directly receive the raw bytes for each row and use the jackc/pgx internal scan functions for each Postgres OID (Object Identifier) type.

The database/sql Scan interface in Go uses reflection to understand the type passed to the function and then also uses reflection to set each field with the column value received from Postgres. In typical scenarios, this is fast enough and easy to use, but falls short for our use cases in terms of performance. The jackc/pgx driver reuses the row bytes produced each time the next Postgres row is requested, resulting in zero allocations per row. This allows us to write high-performance, low-allocation code within Jetflow. With this design, we are able to achieve nearly 600,000 rows per second per Postgres connection for most tables, with very low memory usage.

Conclusion

As of early July 2025, the team ingests 77 billion records per day via Jetflow. The remaining jobs are in the process of being migrated to Jetflow, which will bring the total daily ingestion to 141 billion records. The framework has allowed us to ingest tables in cases that would not otherwise have been possible, and provided significant cost savings due to ingestions running for less time and with fewer resources. 

In the future, we plan to open source the project, and if you are interested in joining our team to help develop tools like this, then open roles can be found at https://www.cloudflare.com/careers/jobs/.

Optimizing Incident Management with Zabbix and PagerDuty

Post Syndicated from Zabbix LatAm original https://blog.zabbix.com/optimizing-incident-management-with-zabbix-and-pagerduty/30114/

When monitoring environments, we sometimes need to rely on third-party tools to better manage functionality and optimize responses to alerts. Let’s explore how to integrate Zabbix with PagerDuty, a real-time incident management solution designed to improve the reliability of digital services, including best practices and configuration details.

What is PagerDuty?

PagerDuty is a real-time incident management platform designed to help IT teams react quickly to critical events. The tool helps organizations automate and manage incident response through a system of alerts, escalation, and coordination between teams. When a problem is detected in the system, PagerDuty notifies the responsible individuals and ensures that corrective action is taken quickly. This reduces downtime and improves operational efficiency. Integration with monitoring tools such as Zabbix makes it easy to identify issues before they impact users.

Some of PagerDuty’s key features include:

• Integration with monitoring tools (such as Zabbix)
• Notifications in multiple channels (email, SMS, calls)
• Automatic escalation of incidents to ensure agile responses
• Event analysis to improve the detection of recurring problems

How to integrate PagerDuty with Zabbix

In PagerDuty, go to “Services” and click on “Service Directory.” Create a new service.

Give it a proper name and description.

Accept the escalation terms and click “Next.”

On the next screen, select “Intelligent” and the “Auto-pause incident notifications” option, then click “Next.”

The next step is to add the Zabbix Webhook service, which will allow integration with Zabbix, and then click “Next.”

In Services > Service Directory, select the name of the service. In the “Integrations” tab, copy the integration token that is generated.

It is important to note that the PagerDuty webhook only shows the option of Zabbix versions 5.0 to 5.2, but it works correctly in later versions such as Zabbix 7.2, which was tested without any issues.

On Zabbix Server, go to Alerts > Media types > PagerDuty. Enter the integration token, the Zabbix URL, and select “Update.”

Send a test message to confirm that the integration is working correctly.

In the PagerDuty application, verify that the test alert was received.

To send notifications, you need to grant permissions to a user in Zabbix. Go to Users > Create User. In the “Media” tab, select PagerDuty as the notification method. Set the severity of the alerts you want to receive.

Subsequently, set up a Trigger Action in Alerts > Actions > Trigger Actions to define what types of alerts will be received (either by item or trigger) according to the needs of your team.

Best practices for integrating Zabbix and PagerDuty

Customize notifications: Set rules to send only truly critical alerts, avoiding unnecessary notifications.
Optimize escalations: Set up escalation rules so that alerts reach the right people at the right time.
Monitor key metrics: Measure incident response times and adjust workflows as needed.
Automate incident responses: Use PagerDuty’s capabilities to perform automated tasks in response to specific events.
Notify about service failures: Use PagerDuty to start running recovery scripts, send notifications to the responsible teams, or even escalate the problem to a higher level if there is no solution in a stipulated length of time.

Conclusion

Zabbix’s integration with PagerDuty allows you to monitor the status of critical services in real time, even outside of working hours. This facilitates rapid incident response and improves your IT team’s ability to react.

This combination not only optimizes incident management but also helps minimize downtime, improve operational efficiency, and ensure the reliability of monitored systems.

With proper configuration and best practices, integrating Zabbix with PagerDuty can become essential for the proactive management of your technological infrastructure.

 

 

 

 

 

The post Optimizing Incident Management with Zabbix and PagerDuty appeared first on Zabbix Blog.

Network performance update: Developer Week 2025

Post Syndicated from Emily Music original https://blog.cloudflare.com/network-performance-update-developer-week-2025/

As the Internet has become enmeshed in our everyday lives, so has our need for speed. No one wants to wait when adding shoes to our shopping carts, or accessing corporate assets from across the globe. And as the Internet supports more and more of our critical infrastructure, speed becomes more than just a measure of how quickly we can place a takeout order. It becomes the connective tissue between the systems that keep us safe, healthy, and organized. Governments, financial institutions, healthcare ecosystems, transit — they increasingly rely on the Internet. This is why at Cloudflare, building the fastest network is our north star. 

We’re happy to announce that we are the fastest network in 48% of the top 1000 networks by 95th percentile TCP connection time between November 2024, and March 2025, up from 44% in September 2024.

In this post, we’re going to share with you how our network performance has changed since our last post in September 2024, and talk about what makes us faster than other networks.  But first, let’s talk a little bit about how we get this data.

How does Cloudflare get this data?

It’s happened to all of us — you casually click on a site, and suddenly you’ve reached a Cloudflare-branded error page. While you are shaking your fist at the sky, something interesting is happening on the back end. Cloudflare is using Real User Monitoring (RUM) to collect the data used to compare our performance against other networks. The monitoring we do is slightly different than the RUM Cloudflare offers to customers. When the error page loads, a 100 KB file is fetched and loaded. This file is hosted on networks like Cloudflare, Akamai, Amazon CloudFront, Fastly, and Google Cloud CDN. Your browser processes the performance data, and sends it to Cloudflare, where we use it to get a clear view of how these different networks stack up in terms of speed. 

We’ve been collecting and refining this data since June 2021.  You can read more about how we collect that data here, and we regularly track our performance during Innovation Weeks to hold ourselves accountable to you that we are always in pursuit of being the fastest network in the world.

How are we doing?

In order to evaluate Cloudflare’s speed relative to others, we measure performance across the top 1000 “eyeball” networks using the list provided by the Asia Pacific Network Information Centre (APNIC). So-called “eyeball” networks are those with a large concentration of subscribers/end users.  This information is important, because it gives us signals for where we can expand our presence or peering, or optimize our traffic engineering. When benchmarking, we assess the 95th percentile TCP connection time. This is the time it takes a user to establish a TCP connection to the server they are trying to reach. This metric helps us illustrate how Cloudflare’s network makes your traffic faster by serving your customers as locally as possible. 

When we look at Cloudflare’s performance across the top 1000 networks, we can see that we’re fastest in 487, or over 48%, of these networks, between November 2024 and March 2025:


In September 2024, we ranked #1 in 44% of these networks:


So why did we jump?  To get a better understanding of why, let’s take a look at the countries where we improved, which will give us a better sense of where to dive in.  This is what our network map looked like in September 2024 (grey countries mean we do not have enough data or users to derive insights):


(September 2024)

Today, using those same 95th percentile TCP connect times, we rank #1 in 48% of networks and the network map looks like this:


(March 2025)

We made most of our gains in Africa, where countries that previously didn’t have enough samples saw an increase in samples, and Cloudflare pulled ahead. This could mean that there was either an increase in Cloudflare users, or an increase in error pages shown. These countries got faster almost exclusively due to the presence of our Edge Partner deployments, which are Cloudflare locations embedded in last mile networks.  In next-generation markets like many African countries, these locations are crucial towards being faster as connectivity to end users tends to fall back to places like South Africa or London if in-country peering does not exist.

But let’s take a look at a couple of other places and see why we got faster.

In Canada, we were not the fastest in September 2024, but we are the fastest today. Today, we are the fastest in 40% of networks, which is the most out of all of our competitors:


But when you look at the overall country numbers, we see that the race for the fastest network is quite close:

Canada 95th Percentile TCP Connect Time by Provider

Rank

Entity

Connect Time (P95)

#1 Diff

1

Cloudflare

179 ms

2

Fastly

180 ms

+0.48% (+0.87 ms)

3

Google

180 ms

+0.74% (+1.32 ms)

4

CloudFront

182 ms

+1.74% (+3.11 ms)

5

Akamai

215 ms 

+20% (+36 ms)

The difference between Cloudflare and the third-fastest network is a little over a millisecond!  As we’ve pointed out previously, such fluctuations are quite common, especially at higher percentiles.  But there is still a significant difference between us and the slowest network; we’re around 20% faster.

However, looking at a place like Japan where were not the fastest in September 2024 but are now the fastest, there is a significant difference between Cloudflare and the number two network:

Japan 95th Percentile TCP Connect Time by Provider

Rank

Entity

Connect Time (P95)

#1 Diff

1

Cloudflare

116 ms

2

Fastly

122 ms

+5.23% (+6.08 ms)

3

Google

124 ms

+6.21% (+7.22 ms)

4

CloudFront

127 ms

+8.91% (+10 ms)

5

Akamai

153 ms 

+32% (+37 ms)

Why is this? We are in more locations in Japan than our competitors and added more Edge Partner deployments in these locations, bringing us even closer to end-users. Edge Partner deployments are collaborations with ISPs, where we take space in their data centers, and peer with them directly. 

Why?

Why do we track our network performance like this? The answer is simple: to improve user experience. This data allows us to track a key performance metric for Cloudflare and the other networks. When we see that we’re lagging in a region, it serves as a signal to dig deeper into our network. 

This data is a gold mine for the teams tasked with improving Cloudflare’s network. When there are countries where Cloudflare is behind, it gives us signals for where we should expand or investigate. If we’re slow, we may need to invest in additional peering. If a region we have invested in heavily is slower, we may need to investigate our hardware.  The example from Japan shows exactly how this can benefit: we took a location where we were previously on par with our competitors, added peering in new locations, and we pulled ahead. 

On top of this map, we have autonomous system (ASN) level granularity on how we are performing on each one of the top 1000 eyeball networks, and we continuously optimize our traffic flow with each of them.  This allows us to track individual networks that may lag and improve the customer experience in those networks through turning up peering, or even adding new deployments in those regions. 

What’s next?

We’re sharing our updates on our journey to become #1 everywhere so that you can see what goes into running the fastest network in the world. From here, our plan is the same as always: identify where we’re slower, fix it, and then tell you how we’ve gotten faster.

“You get Instant Purge, and you get Instant Purge!” — all purge methods now available to all customers

Post Syndicated from Alex Krivit original https://blog.cloudflare.com/instant-purge-for-all/

There’s a tradition at Cloudflare of launching real products on April 1, instead of the usual joke product announcements circulating online today. In previous years, we’ve introduced impactful products like 1.1.1.1 and 1.1.1.1 for Families. Today, we’re excited to continue this tradition by making every purge method available to all customers, regardless of plan type.

During Birthday Week 2024, we announced our intention to bring the full suite of purge methods — including purge by URL, purge by hostname, purge by tag, purge by prefix, and purge everything — to all Cloudflare plans. Historically, methods other than “purge by URL” and “purge everything” were exclusive to Enterprise customers. However, we’ve been openly rebuilding our purge pipeline over the past few years (hopefully you’ve read some of our blog series), and we’re thrilled to share the results more broadly. We’ve spent recent months ensuring the new Instant Purge pipeline performs consistently under 150 ms, even during increased load scenarios, making it ready for every customer.  

But that’s not all — we’re also significantly raising the default purge rate limits for Enterprise customers, allowing even greater purge throughput thanks to the efficiency of our newly developed Instant Purge system.

Building a better purge: a two-year journey

Stepping back, today’s announcement represents roughly two years of focused engineering. Near the end of 2022, our team went heads down rebuilding Cloudflare’s purge pipeline with a clear yet challenging goal: dramatically increase our throughput while maintaining near-instant invalidation across our global network.

Cloudflare operates data centers in over 335 cities worldwide. Popular cached assets can reside across all of our data centers, meaning each purge request must quickly propagate to every location caching that content. Upon receiving a purge command, each data center must efficiently locate and invalidate cached content, preventing stale responses from being served. The amount of content that must be invalidated can vary drastically, from a single file, to all cached assets associated with a particular hostname. After the content has been purged, any subsequent requests will trigger retrieval of a fresh copy from the origin server, which will be stored in Cloudflare’s cache during the response. 

Ensuring consistent, rapid propagation of purge requests across a vast network introduces substantial technical challenges, especially when accounting for occasional data center outages, maintenance, or network interruptions. Maintaining consistency under these conditions requires robust distributed systems engineering.

How did we scale purge?

We’ve previously discussed how our new Instant Purge system was architected to achieve sub-150 ms purge times. It’s worth noting that the performance improvements were only part of what our new architecture achieved, as it also helped us solve significant scaling challenges around storage and throughput that allowed us to bring Instant Purge to all users. 

Initially, our purge system scaled well, but with rapid customer growth, the storage consumption from millions of daily purge keys that needed to be stored reduced available caching space. Early attempts to manage this storage and throughput demand involved queues and batching for smoothing traffic spikes, but this introduced latency and underscored the tight coupling between increased usage and rising storage costs.

We needed to revisit our thinking on how to better store purge keys and when to remove purged content so we could reclaim space. Historically, when a customer would purge by tag, prefix or hostname, Cloudflare would mark the content as expired and allow it to be evicted later. This is known as lazy-purge because nothing is actively removed from disk. Lazy-purge is fast, but not necessarily efficient, because it consumes storage for expired but not-yet-evicted content. After examining global or data center-level indexing for purge keys, we decided that wasn’t viable due to increases in system complexity and the latency those indices could bring due to our network size. So instead, we opted for per-machine indexing, integrating indices directly alongside our cache proxies. This minimized network complexity, simplified reliability, and provided predictable scaling.

After careful analysis and benchmarking, we selected RocksDB, an embedded key-value store that we could optimize for our needs, which formed the basis of CacheDB, our Rust-based service running alongside each cache proxy. CacheDB manages indexing and immediate purge execution (active purge), significantly reducing storage needs and freeing space for caching.


Local queues within CacheDB buffer purge operations to ensure consistent throughput without latency spikes, while the cache proxies consult CacheDB to guarantee rapid, active purges. Our updated distribution pipeline broadcasts purges directly to CacheDB instances across machines, dramatically improving throughput and purge speed.

Using CacheDB, we’ve reduced storage requirements 10x by eliminating lazy purge storage accumulation, instantly freeing valuable disk space. The freed storage enhances cache retention, boosting cache HIT ratios and minimizing origin egress. These savings in storage and increased throughput allowed us to scale to the point where we can offer Instant Purge to more customers.

For more information on how we designed the new Instant Purge system, please see the previous installment of our Purge series blog posts. 

Striking the right balance: what to purge and when

Moving on to practical considerations of using these new purge methods, it’s important to use the right method for what you want to invalidate. Purging too aggressively can overwhelm origin servers with unnecessary requests, driving up egress costs and potentially causing downtime. Conversely, insufficient purging leaves visitors with outdated content. Balancing precision and speed is vital.

Cloudflare supports multiple targeted purge methods to help customers achieve this balance.

Starting today, all of these methods are available to every Cloudflare customer.    

How to purge 

Users can select their purge method directly in the Cloudflare dashboard, located under the Cache tab in the configurations section, or via the Cloudflare API. Each purge request should clearly specify the targeted URLs, hostnames, prefixes, or cache tags relevant to the selected purge type (known as purge keys). For instance, a prefix purge request might specify a directory such as example.com/foo/bar. To maximize efficiency and throughput, batching multiple purge keys in a single request is recommended over sending individual purge requests each with a single key.

How much can you purge?

The new rate limits for Cloudflare’s purge by tag, prefix, hostname, and purge everything are different for each plan type. We use a token bucket rate limit system, so each account has a token bucket with a maximum size based on plan type. When we receive a purge request we first add tokens to the account’s bucket based on the time passed since the account’s last purge request divided by the refill rate for its plan type (which can be a fraction of a token). Then we check if there’s at least one whole token in the bucket, and if so we remove it and process the purge request. If not, the purge request will be rate limited. An easy way to think about this rate limit is that the refill rate represents the consistent rate of requests a user can send in a given period while the bucket size represents the maximum burst of requests available.

For example, a free user starts with a bucket size of 25 requests and a refill rate of 5 requests per minute (one request per 12 seconds). If the user were to send 26 requests all at once, the first 25 would be processed, but the last request would be rate limited. They would need to wait 12 seconds and retry their last request for it to succeed. 

The current limits are applied per account

Plan

Bucket size

Request refill rate

Max keys per request

Total keys

Free

25 requests

5 per minute

100

500 per minute

Pro

25 requests

5 per second

100

500 per second

Biz

50 requests

10 per second

100

1,000 per second

Enterprise

500 requests

50 per second

100

5,000 per second

More detailed documentation on all purge rate limits can be found in our documentation.

What’s next?

We’ve spent a lot of time optimizing our purge platform. But we’re not done yet. Looking forward, we will continue to enhance the performance of Cloudflare’s single-file purge. The current P50 performance is around 250 ms, and we suspect that we can optimize it further to bring it under 200 ms. We will also build out our ability to allow for greater purge throughput for all of our systems, and will continue to find ways to implement filtering techniques to ensure we can continue to scale effectively and allow customers to purge whatever and whenever they choose. 

We invite you to try out our new purge system today and deliver an instant, seamless experience to your visitors.

Dynamically optimize, clip, and resize video from any origin with Media Transformations

Post Syndicated from Taylor Smith original https://blog.cloudflare.com/media-transformations-for-video-open-beta/

Today, we are thrilled to announce Media Transformations, a new service that brings the magic of Image Transformations to short-form video files wherever they are stored.

Since 2018, Cloudflare Stream has offered a managed video pipeline that empowers customers to serve rich video experiences at global scale easily, in multiple formats and quality levels. Sometimes, the greatest friction to getting started isn’t even about video, but rather the thought of migrating all those files. Customers want a simpler solution that retains their current storage strategy to deliver small, optimized MP4 files. Now you can do that with Media Transformations.

Short videos, big volume

For customers with a huge volume of short video, such as generative AI output, e-commerce product videos, social media clips, or short marketing content, uploading those assets to Stream is not always practical. Furthermore, Stream’s key features like adaptive bitrate encoding and HLS packaging offer diminishing returns on short content or small files.

Instead, content like this should be fetched from our customers’ existing storage like R2 or S3 directly, optimized by Cloudflare quickly, and delivered efficiently as small MP4 files. Cloudflare Images customers reading this will note that this sounds just like their existing Image Transformation workflows. Starting today, the same workflow can be applied to your short-form videos.

What’s in a video?

The distinction between video and images online can sometimes be blurry — consider an animated GIF: is that an image or a video? (They’re usually smaller as MP4s anyway!) As a practical example, consider a selection of product images for a new jacket on an e-commerce site. You want a consumer to know how it looks, but also how it flows. So perhaps the first “image” in that carousel is actually a video of a model simply putting the jacket on. Media Transformations empowers customers to optimize the product video and images with similar tools and identical infrastructure.

How to get started

Any website that is already enabled for Image Transformations is now enabled for Media Transformations. To enable a new zone, navigate to “Transformations” under Stream (or Images), locate your zone in the list, and click Enable. Enabling and disabling a zone for transformations affects both Images and Media transformations.


After enabling Media Transformations on a website, it is simple to construct a URL that transforms a video. The pattern is similar to Image Transformations, but uses the media endpoint instead of the image endpoint:

https://example.com/cdn-cgi/media/<OPTIONS>/<SOURCE-VIDEO>

The <OPTIONS> portion of the URL is a comma-separated list of flags written as key=value. A few noteworthy flags:

  • mode can be video (the default) to output a video, frame to pull a still image of a single frame, or even spritesheet to generate an image with multiple frames, which is useful for seek previews or storyboarding.

  • time specifies the exact start time from the input video to extract a frame or start making a clip

  • duration specifies the length of an output video to make a clip shorter than the original

  • fit, together with height and width allow resizing and cropping the output video or frame.

  • Setting audio to false removes the sound in the output video.

The <SOURCE-VIDEO> is a full URL to a source file or a root-relative path if the origin is on the same zone as the transformation request.

A full list of supported options, examples, and troubleshooting information is available in DevDocs.

A few examples

I used my phone to take this video of the randomness mobile in Cloudflare’s Austin Office and put it in an R2 bucket. Of course, it is possible to embed the original video file from R2 directly:

That video file is almost 30 MB. Let’s optimize it together — a more efficient choice would be to resize the video to the width of this blog post template. Let’s apply a width adjustment in the options portion of the URL:

https://example.com/cdn-cgi/media/width=760/https://pub-d9fcbc1abcd244c1821f38b99017347f.r2.dev/aus-mobile.mp4

That will deliver the same video, resized and optimized:

Not only is this video the right size for its container, now it’s less than 4 MB. That’s a big bandwidth savings for visitors.

As I recorded the video, the lobby was pretty quiet, but there was someone talking in the distance. If we wanted to use this video as a background, we should remove the audio, shorten it, and perhaps crop it vertically. All of these options can be combined, comma-separated, in the options portion of the URL:

https://example.com/cdn-cgi/media/mode=video,duration=10s,width=480,height=720,fit=cover,audio=false/https://pub-d9fcbc1abcd244c1821f38b99017347f.r2.dev/aus-mobile.mp4

The result:

If this were a product video, we might want a small thumbnail to add to the carousel of images so shoppers can click to zoom in and see it move. Use the “frame” mode and a “time” to generate a static image from a single point in the video. The same size and fit options apply:

https://example.com/cdn-cgi/media/mode=frame,time=3s,width=120,height=120,fit=cover/https://pub-d9fcbc1abcd244c1821f38b99017347f.r2.dev/aus-mobile.mp4

Which generates this optimized image:

Try it out yourself using our video or one of your own: 

Input Limits

We are eager to start supporting real customer content, and we will right-size our input limitations with our early adopters. To start:

  • Video files must be smaller than 40 megabytes.

  • Files must be MP4s and should be h.264 encoded.

  • Videos and images generated with Media Transformations will be cached. However, in our initial beta, the original content will not be cached which means regenerating a variant will result in a request to the origin.

How it works

Unlike Stream, Media Transformations receives requests on a customer’s own website. Internally, however, these requests are passed to the same On-the-Fly Encoder (“OTFE”) platform that Stream Live uses. To achieve this, the Stream team built modules that run on our servers to act as entry points for these requests.

These entry points perform some initial validation on the URL formatting and flags before building a request to Stream’s own Delivery Worker, which in turn calls OTFE’s set of transformation handlers. The original asset is fetched from the customer’s origin, validated for size and type, and passed to the same OTFE methods responsible for manipulating and optimizing video or still frame thumbnails for videos uploaded to Stream. These tools do a final inspection of the media type and encoding for compatibility, then generate the requested variant. If any errors were raised along the way, an HTTP error response will be generated using similar error codes to Image Transformations. When successful, the result is cached for future use and delivered to the requestor as a single file. Even for new or uncached requests, all of this operates much faster than the video’s play time.


What it costs

Media Transformations will be free for all customers while in beta. We expect the beta period to extend into Q3 2025, and after that, Media Transformations will use the same subscriptions and billing mechanics as Image Transformations — including a free allocation for all websites/zones. Generating a still frame (single image) from a video counts as 1 transformation. Generating an optimized video is billed as 1 transformation per second of the output video. Each unique transformation is only billed once per month. All Media and Image Transformations cost $0.50 per 1,000 monthly unique transformation operations, with a free monthly allocation of 5,000.

Using this post as an example, recall the two transformed videos and one transformed image above — the big original doesn’t count because it wasn’t transformed. The first video (showing blog post width) was 15 seconds of output. The second video (silent vertical clip) was 10 seconds of output. The preview square is a still frame. These three operations would count as 26 transformations — and they would only bill once per month, regardless of how many visitors this page receives.

Looking ahead

Our short-term focus will be on right-sizing input limits based on real customer usage as well as adding a caching layer for origin fetches to reduce any egress fees our customers may be facing from other storage providers. Looking further, we intend to streamline Images and Media Transformations to further simplify the developer experience, unify the features, and streamline enablement: Cloudflare’s Media Transformations will optimize your images and video, quickly and easily, wherever you need them.

Try it for yourself today using our sample asset above, or get started by enabling Transformations on a zone in your account and uploading a short file to R2, both of which offer a free tier to get you going.