All posts by Chris Draper

Troubleshooting network connectivity and performance with Cloudflare AI

2025-08-29 Chris Draper

Post Syndicated from Chris Draper original https://blog.cloudflare.com/AI-troubleshoot-warp-and-network-connectivity-issues/

Monitoring a corporate network and troubleshooting any performance issues across that network is a hard problem, and it has become increasingly complex over time. Imagine that you’re maintaining a corporate network, and you get the dreaded IT ticket. An executive is having a performance issue with an application, and they want you to look into it. The ticket doesn’t have a lot of details. It simply says: “Our internal documentation is taking forever to load. PLS FIX NOW”.

In the early days of IT, a corporate network was built on-premises. It provided network connectivity between employees that worked in person and a variety of corporate applications that were hosted locally.

The shift to cloud environments, the rise of SaaS applications, and a “work from anywhere” model has made IT environments significantly more complex in the past few years. Today, it’s hard to know if a performance issue is the result of:

An employee’s device
Their home or corporate wifi
The corporate network
A cloud network hosting a SaaS app
An intermediary ISP

A performance ticket submitted by an employee might even be a combination of multiple performance issues all wrapped together into one nasty problem.

Cloudflare built Cloudflare One, our Secure Access Service Edge (SASE) platform, to protect enterprise applications, users, devices, and networks. In particular, this platform relies on two capabilities to simplify troubleshooting performance issues:

Cloudflare’s Zero Trust client, also known as WARP, forwards and encrypts traffic from devices to Cloudflare edge.
Digital Experience Monitoring (DEX) works alongside WARP to monitor device, network, and application performance.

We’re excited to announce two new AI-powered tools that will make it easier to troubleshoot WARP client connectivity and performance issues. We’re releasing a new WARP diagnostic analyzer in the Zero Trust dashboard and a MCP (Model Context Protocol) server for DEX. Today, every Cloudflare One customer has free access to both of these new features by default.

WARP diagnostic analyzer

The WARP client provides diagnostic logs that can be used to troubleshoot connectivity issues on a device. For desktop clients, the most common issues can be investigated with the information captured in logs called WARP diagnostic. Each WARP diagnostic log contains an extensive amount of information spanning days of captured events occurring on the client. It takes expertise to manually go through all of this information and understand the full picture of what is occurring on a client that is having issues. In the past, we’ve advised customers having issues to send their WARP diagnostic log straight to us so that our trained support experts can do a root cause analysis for them. While this is effective, we want to give our customers the tools to take control of deciphering common troubleshooting issues for even quicker resolution.

Enter the WARP diagnostic analyzer, a new AI available for free in the Cloudflare One dashboard as of today! This AI demystifies information in the WARP diagnostic log so you can better understand events impacting the performance of your clients and network connectivity. Now, when you run a remote capture for WARP diagnostics in the Cloudflare One dashboard, you can generate an AI analysis of the WARP diagnostic file. Simply go to your organization’s Zero Trust dashboard and select DEX > Remote Captures from the side navigation bar. After you successfully run diagnostics and produce a WARP diagnostic file, you can open the status details and select View WARP Diag to generate your AI analysis.

In the WARP Diag analysis, you will find a Cloudy summary of events that we recommend a deeper dive into.

Below this summary is an events section, where the analyzer highlights occurrences of events commonly occurring when there are client and connectivity issues.

Expanding on any of the events detected will reveal a detailed page explaining the event, recommended resources to help troubleshoot, and a list of time stamped recent occurrences of the event on the device.

To further help with trouble shooting we’ve added a Device and WARP details section at the bottom of this page with a quick view of the device specifications and WARP configurations such as Operating system, WARP version, and the device profile ID.

Finally, we’ve made it easy to take all the information created in your AI summary with you by navigating to the JSON file tab and copying the contents. Your WARP Diag file is also available to download from this screen for any further analysis.

MCP server for DEX

Alongside the new WARP Diagnostic Analyzer, we’re excited to announce that all Cloudflare One customers have access to a MCP (Model Context Protocol) server for our Digital Experience Monitoring (DEX) product. Let’s dive into how this will save our customers time and money.

Cloudflare One customers use Digital Experience Monitoring (DEX) to monitor devices across their employee network and troubleshoot any connectivity or performance issues. Like many products at Cloudflare, every data point generated by DEX is available to customers via Cloudflare’s API or log ingestion. DEX API and log data is valuable because it enables Enterprises to create custom analytics for their devices’ connectivity and performance in a SIEM (Security Information and Event Management). Building new data pipelines and various dashboards can take a lot of time and can be expensive. Some of Cloudflare One customers aren’t able to dedicate the engineering time to build custom analytics: whether it’s due to budget, resource constraints, time constraints, or other factors.

Model Context Protocol (MCP) is an AI standardization that allows AI to connect to other applications. In this instance, using MCP together with DEX gives users a custom analytics experience — without all the work of integrating log data into a SIEM. Network security practitioners can type out a question (like “Show me the performance data for [email protected]’s device in a few graphs”), and an MCP server will answer that question with a customized graph on Alice’s device data from the DEX API.

Troubleshooting DNS query performance with the DEX MCP server

Any IT admin, security engineer, or network engineer can ask the MCP server a question like: “I received an IT ticket from [email protected] who is an enterprise employee. He is complaining about the performance of his device. Can you investigate this for me?”.

After you submit this question, the DEX MCP server will start by reviewing the device’s overall health.

Overall Device Health:
✅ Connection Status: Consistently connected for the past 24 hours with 100% availability
✅ WARP Mode: Running in optimal "warp+doh" (WARP + DNS over HTTPS) mode
✅ Client Version: Up-to-date with version 2025.1.7.2

So far, the device’s health looks great. Next, the DEX MCP server will begin analyzing the results of the performance tests that are configured on Bob’s device.

📊 Performance Metrics Analysis

HTTP Response Times:
Google.com test: Average 247ms (Range: 81-649ms)
  • Shows significant improvement throughout the day: Started at ~317ms, dropped to ~211ms by evening
Wiki.internal.acme.com: Average 1,374ms (Range: 877-2,431ms)
  • This is notably high and could be the source of performance complaints
  • DNS resolution particularly slow: Average 616ms with spikes up to 1,600ms

Now we’re making progress in our troubleshooting effort. The DEX MCP server identified that the HTTP GET test for wiki.internal.acme.com has a high average resource fetch time when compared to other websites (like google.com). In particular, high resource fetch time is the result of slow DNS resolution. Now, the DEX MCP server will summarize its findings.

🚨 Performance Issues Identified
Extremely Slow DNS Resolution:
  • Average DNS response time of 616ms is significantly high
  • Normal DNS should be <50ms for good performance
High Resource Fetch Times on wiki.internal.acme.com:
   • 1.3+ second response times indicate potential application-level issues
   • Could be related to DNS performance or server-side delays
HTTP Status Code Issues:
        • wiki.internal.acme.com is showing consistent HTTP 400 errors
        • This suggests potential configuration or routing issues

Now, we’ve successfully identified that Bob’s slow wiki performance is the result of a high average DNS response time. Slow DNS resolution increases the average resource fetch time, and also generates HTTP 400 errors for Enterprise employees that are attempting to access wiki.internal.acme.com. Notably, it’s likely that if the wiki is seeing slow DNS response times, other internal applications are experiencing performance issues as well.

We can ask the DEX MCP server to suggest potential solutions that would fix this wider DNS performance issue.

🔧 DNS Optimization Strategies

Check WARP DNS configuration and settings
  • Since the device is using WARP+DoH mode, you can optimize the DNS settings
Implement or improve local DNS caching
  • Implement or improve local DNS caching to reduce repeated lookups
Network stack tuning
  • Optimize TCP / UDP settings for DNS

Try out the DEX MCP server today

Fast and easy option for testing an MCP server

Any Cloudflare One customer with a Free, PayGo, or ENT plan can start using the DEX MCP server in less than one minute. The fastest and easiest way to try out the DEX MCP server is to visit playground.ai.cloudflare.com. There are five steps to get started:

Copy the URL for the DEX MCP server: https://dex.mcp.cloudflare.com/sse
Open playground.ai.cloudflare.com in a browser
Find the section in the left side bar titled MCP Servers
Paste the URL for the DEX MCP server into the URL input box and click Connect
Authenticate your Cloudflare account, and then start asking questions to the DEX MCP server

It’s worth noting that end users will need to ask specific and explicit questions to the DEX MCP server to get a response. For example, you may need to say, “Set my production account as the active account”, and then give the separate command, “Fetch the DEX test results for the user [email protected] over the past 24 hours”.

Better experience for MCP servers that requires additional steps

Customers will get a more flexible prompt experience by configuring the DEX MCP server with their preferred AI assistant (Claude, Gemini, ChatGPT, etc.) that has MCP server support. MCP server support may require a subscription for some AI assistants. You can read the Digital Experience Monitoring – MCP server documentation for step by step instructions on how to get set up with each of the major AI assistants that are available today.

As an example, you can configure the DEX MCP server in Claude by downloading the Claude Desktop client, then selecting Claude Code > Developer > Edit Config. You will be prompted to open “claude_desktop_config.json” in a code editor of your choice. Simply add the following JSON configuration, and you’re ready to use Claude to call the DEX MCP server.

{
  "globalShortcut": "",
  "mcpServers": {
    "cloudflare-dex-analysis": {
      "command": "npx",
      "args": [
        "mcp-remote",
        "https://dex.mcp.cloudflare.com/sse"
      ]
    }
  }
}

Get started with Cloudflare One today

Are you ready to secure your Internet traffic, employee devices, and private resources without compromising speed? You can get started with our new Cloudflare One AI powered tools today.

The WARP diagnostic analyzer and the DEX MCP server are generally available to all customers. Head to the Zero Trust dashboard to run a WARP diagnostic and learn more about your client’s connectivity with the WARP diagnostic analyzer. You can test out the new DEX MCP server (https://dex.mcp.cloudflare.com/sse) in less than one minute at playground.ai.cloudflare.com, and you can also configure an AI assistant like Claude to use the new DEX MCP server.

If you don’t have a Cloudflare account, and you want to try these new features, you can create a free account for up to 50 users. If you’re an Enterprise customer, and you’d like a demo of these new Cloudflare One AI features, you can reach out to your account team to set up a demo anytime.

You can stay up to date on latest feature releases across the Cloudflare One platform by following the Cloudflare One changelogs and joining the conversation in the Cloudflare community hub or on our Discord Server.

Free network flow monitoring for all enterprise customers

2024-03-07 Chris Draper

Post Syndicated from Chris Draper original https://blog.cloudflare.com/free-network-monitoring-for-enterprise

A key component of effective corporate network security is establishing end to end visibility across all traffic that flows through the network. Every network engineer needs a complete overview of their network traffic to confirm their security policies work, to identify new vulnerabilities, and to analyze any shifts in traffic behavior. Often, it’s difficult to build out effective network monitoring as teams struggle with problems like configuring and tuning data collection, managing storage costs, and analyzing traffic across multiple visibility tools.

Today, we’re excited to announce that a free version of Cloudflare’s network flow monitoring product, Magic Network Monitoring, is available to all Enterprise Customers. Every Enterprise Customer can configure Magic Network Monitoring and immediately improve their network visibility in as little as 30 minutes via our self-serve onboarding process.

Enterprise Customers can visit the Magic Network Monitoring product page, click “Talk to an expert”, and fill out the form. You’ll receive access within 24 hours of submitting the request. Over the next month, the free version of Magic Network Monitoring will be rolled out to all Enterprise Customers. The product will automatically be available by default without the need to submit a form.

How it works

Cloudflare customers can send their network flow data (either NetFlow or sFlow) from their routers to Cloudflare’s network edge.

Magic Network Monitoring will pick up this data, parse it, and instantly provide insights and analytics on your network traffic. These analytics include traffic volume overtime in bytes and packets, top protocols, sources, destinations, ports, and TCP flags.

Dogfooding Magic Network Monitoring during the remediation of the Thanksgiving 2023 security incident

Let’s review a recent example of how Magic Network Monitoring improved Cloudflare’s own network security and traffic visibility during the Thanksgiving 2023 security incident. Our security team needed a lightweight method to identify malicious packet characteristics in our core data center traffic. We monitored for any network traffic sourced from or destined to a list of ASNs associated with the bad actor. Our security team setup Magic Network Monitoring and established visibility into our first core data center within 24 hours of the project kick-off. Today, Cloudflare continues to use Magic Network Monitoring to monitor for traffic related to bad actors and to provide real time traffic analytics on more than 1 Tbps of core data center traffic.

*Magic Network Monitoring – Traffic Analytics*

Monitoring local network traffic from IoT devices

Magic Network Monitoring also improves visibility on any network traffic that doesn’t go through Cloudflare. Imagine that you’re a network engineer at ACME Corporation, and it’s your job to manage and troubleshoot IoT devices in a factory that are connected to the factory’s internal network. The traffic generated by these IoT devices doesn’t go through Cloudflare because it is destined to other devices and endpoints on the internal network. Nonetheless, you still need to establish network visibility into device traffic over time to monitor and troubleshoot the system.

To solve the problem, you configure a router or other network device to securely send encrypted traffic flow summaries to Cloudflare via an IPSec tunnel. Magic Network Monitoring parses the data, and instantly provides you with insights and analytics on your network traffic. Now, when an IoT device goes down, or a connection between IoT devices is unexpectedly blocked, you can analyze historical network traffic data in Magic Network Monitoring to speed up the troubleshooting process.

Monitoring cloud network traffic

As cloud networking becomes increasingly prevalent, it is essential for enterprises to invest in visibility across their cloud environments. Let’s say you’re responsible for monitoring and troubleshooting your corporation’s cloud network operations which are spread across multiple public cloud providers. You need to improve visibility into your cloud network traffic to analyze and troubleshoot any unexpected traffic patterns like configuration drift that leads to an exposed network port.

To improve traffic visibility across different cloud environments, you can export cloud traffic flow logs from any virtual device that supports NetFlow or sFlow to Cloudflare. In the future, we are building support for native cloud VPC flow logs in conjunction with Magic Cloud Networking. Cloudflare will parse this traffic flow data and provide alerts plus analytics across all your cloud environments in a single pane of glass on the Cloudflare dashboard.

Improve your security posture today in less than 30 minutes

If you’re an existing Enterprise customer, and you want to improve your corporate network security, you can get started right away. Visit the Magic Network Monitoring product page, click “Talk to an expert”, and fill out the form. You’ll receive access within 24 hours of submitting the request. You can begin the self-serve onboarding tutorial, and start monitoring your first batch of network traffic in less than 30 minutes.

Over the next month, the free version of Magic Network Monitoring will be rolled out to all Enterprise Customers. The product will be automatically available by default without the need to submit a form.

If you’re interested in becoming an Enterprise Customer, and have more questions about Magic Network Monitoring, you can talk with an expert. If you’re a free customer, and you’re interested in testing a limited beta of Magic Network Monitoring, you can fill out this form to request access.

Network flow monitoring is GA, providing end-to-end traffic visibility

2023-10-18 Chris Draper

Post Syndicated from Chris Draper original http://blog.cloudflare.com/network-flow-monitoring-generally-available/

Network flow monitoring is GA, providing end-to-end traffic visibility

Network engineers often find they need better visibility into their network’s traffic and operations while analyzing DDoS attacks or troubleshooting other traffic anomalies. These engineers typically have some high level metrics about their network traffic, but they struggle to collect essential information on the specific traffic flows that would clarify the issue. To solve this problem, Cloudflare has been piloting a cloud network flow monitoring product called Magic Network Monitoring that gives customers end-to-end visibility into all traffic across their network.

Today, Cloudflare is excited to announce that Magic Network Monitoring (previously called Flow Based Monitoring) is now generally available to all enterprise customers. Over the last year, the Cloudflare engineering team has significantly improved Magic Network Monitoring; we’re excited to offer a network services product that will help our customers identify threats faster, reduce vulnerabilities, and make their network more secure.

Magic Network Monitoring is automatically enabled for all Magic Transit and Magic WAN enterprise customers. The product is located at the account level of the Cloudflare dashboard and can be opened by navigating to “Analytics & Logs > Magic Monitoring”. The onboarding process for Magic Network Monitoring is self-serve, and all enterprise customers with access can begin configuring the product today.

Any enterprise customers without Magic Transit or Magic WAN that are interested in testing Magic Network Monitoring can receive access to the free version (with some limitations on traffic volume) by submitting a request to their Cloudflare account team or filling out this form to talk with an expert.

What is Magic Network Monitoring?

Magic Network Monitoring is a cloud network flow monitor. Network traffic flow refers to any stream of packets between one source and one destination with the same Internet protocol and set of ports. Customers can send network flow reports from their routers (or any other network flow generator) to a publicly available endpoint on Cloudflare’s anycast network, even if the traffic didn’t originally pass through Cloudflare’s network. Cloudflare analyzes the network flow data, then provides customers visibility into key network traffic metrics via an analytics dashboard. These metrics include: traffic volume (in bits or packets) over time, source IPs, destination IPs, ports, traffic protocols, and router IPs. Customers can also configure alerts to identify DDoS attacks and any other abnormal traffic volume activities.

Send flow data from your network to Cloudflare for analysis

Enterprise DDoS attack type detection

Magic Transit On Demand (MTOD) customers will experience significant traffic visibility benefits when using Magic Network Monitoring. Magic Transit is a network security solution that offers DDoS protection and traffic acceleration from every Cloudflare data center for on-premise, cloud-hosted, and hybrid networks. Magic Transit On Demand customers can activate Magic Transit for protection when a DDoS attack is detected.

In general, we noticed that some MTOD customers lacked the network visibility tools to quickly identify DDoS attacks and take the appropriate mitigation action. Now, MTOD customers can use Magic Network Monitoring to analyze their network data and receive an alert if a DDoS attack is detected.

Cloudflare detects a DDoS attack from the customer’s network flow data

Once a DDoS attack is detected, Magic Network Monitoring customers can choose to either manually or automatically enable Magic Transit to mitigate any DDoS attacks.

Activate Magic Transit for DDoS protection

Enterprise network monitoring

Cloudflare’s Magic WAN and Cloudflare One customers can also benefit from using Magic Network Monitoring. Today, these customers have excellent visibility into the traffic they send through Cloudflare’s network, but sometimes they may lack visibility into traffic that isn’t sent through Cloudflare. This can include traffic that remains on a local network, or network traffic sent in between cloud environments. Magic WAN and Cloudflare One customers can add Magic Network Monitoring into their suite of product solutions to establish end-to-end network visibility across all traffic on their network.

A deep dive into network flow and network traffic sampling

Magic Network Monitoring gives customers better visibility into their network traffic by ingesting and analyzing network flow data.

The process starts when a router (or other network flow generation device) collects statistical samples of inbound and / or outbound packet data. These samples are collected by examining 1 in every X packets, where X is the sampling rate configured on the router. Typical sampling rates range from 1 in every 1,000 to 1 in every 4,000 packets. The ideal sampling rate depends on the traffic volume, traffic diversity, and the compute / memory power of your router’s hardware. You can read more about the recommended network flow sampling rate in Cloudflare’s MNM Developer Docs.

The sampled data is packaged into one of two industry standard formats for network flow data: NetFlow or sFlow. In NetFlow, the sampled packet data is grouped by different packet characteristics such as source / destination IP, port, and protocol. Each group of sampled packet data also includes a traffic volume estimate. In sFlow, the entire packet header is selected as the representative sample, and there isn’t any data summarization. As a result, sFlow is a richer data format and includes more details about network traffic than NetFlow data. Once either the NetFlow or sFlow data samples are collected, they’re sent to Magic Network Monitoring for analysis and alerting.

Why simple random sampling didn’t work for Magic Network Monitoring

Magic Network Monitoring has come a long way from its early access release one year ago. In particular, the Cloudflare engineering team invested significant time in improving the accuracy of the traffic volume estimations in MNM. In the early access version of Magic Network Monitoring, customers were unexpectedly reporting that their network traffic volume estimates were too high and didn’t match the expected value.

Magic Network Monitoring performs its own sampling of the NetFlow or sFlow data it receives, so it can effectively scale and manage the data ingested across Cloudflare’s global network. Increasing the accuracy of the traffic volume estimations was more difficult than expected, as the NetFlow or sFlow data parsed by MNM is already built on sampled packet data. This introduces multiple distinct layers of data sampling in the product’s analytics.

The first version of Magic Network Monitoring used random sampling where a random subset of network flow data with the same timestamp was selected to represent the traffic volume at that point in time. A characteristic of network flow data is that some samples are more significant than others and represent a greater volume of network traffic. In order to account for this significance, we can associate a weight with each sample based on the traffic volume it represents. Network flow data weights are always positive numbers, and they follow a long tail distribution. These data characteristics caused MNM’s random sampling to incorrectly estimate the traffic volume of a customer’s network. Customers would see false spikes in their traffic volume analytics when an outlying data sample from the long tail was randomly selected to be the representative of all traffic at that point in time.

Increasing accuracy with VarOpt reservoir sampling

To solve this problem, the Cloudflare engineering team implemented an alternative reservoir sampling technique called VarOpt. VarOpt is designed to collect samples from a stream of data when the length of the data stream is unknown (a perfect application for analyzing incoming network flow data). In the MNM implementation of VarOpt, we start with an empty reservoir of a fixed size that is filled with samples of network flow data. When the reservoir is full, and there is still new incoming network flow data, an old sample is randomly discarded from the reservoir and replaced with a new one.

After a certain number of samples have been observed, we calculate the traffic volume across all weighted samples in the reservoir, and that is the estimated traffic volume of a customer’s network flow at that point in time. Finally, the reservoir is emptied, and the VarOpt loop is restarted by filling the reservoir with the next set of the latest network flow samples.

The new VarOpt sampling method significantly increased the accuracy of the traffic volume estimations in Magic Network Monitoring, and solved our customer’s problems. These sampling improvements paved the way for general availability, and we’re excited to make accurate network flow analytics available to everyone.

Developer Docs and Discord Community

There are detailed Developer Docs for Magic Network Monitoring that explain the product’s features and outlines a step-by-step configuration guide for new customers. As you’re working through the Magic Network Monitoring documentation, please feel free to provide feedback by clicking the “Give Feedback” button in the top right corner of the Developer Docs.

We’ve also created a channel in Cloudflare’s Discord community built around debugging configuration problems, testing new features, and providing product feedback. You can follow this link to join the Cloudflare Discord server.

Free version

A free version of Magic Network Monitoring is available to all Enterprise customers on request to their Cloudflare account team. The free version is designed to enable Enterprise customers to quickly test and evaluate Magic Network Monitoring before purchasing Magic Transit, Magic WAN, or Cloudflare One. Enterprise customers can fully configure Magic Network Monitoring themselves by following the step-by-step onboarding guide in the product’s documentation. The free version has some limitations on the quantity of traffic that can be processed which are further outlined in the product’s documentation.

The free version of Magic Network Monitoring is also available to all Free, Pro, and Business plan Cloudflare customers via a closed beta. Anyone can request access to the free version by reading the free version documentation and filling out this form. Priority access is granted to anyone that joins Cloudflare’s Discord server and sends a message in the Magic Network Monitoring Discord channel.

Next steps that you can take today

Magic Network Monitoring is generally available, and all Magic Transit and Magic WAN customers have been automatically granted access to the product today. You can navigate to the product by going to the account level of the Cloudflare dashboard, then selecting “Analytics & Logs > Magic Monitoring”.

If you’re an enterprise customer without Magic Transit or Magic WAN, and you want to use Magic Network Monitoring to improve your traffic visibility, you can talk with an MNM expert today.

If you’re interested in using Magic Transit and Magic Network Monitoring for DDoS protection, you can request a demo of Magic Transit. If you want to use Magic WAN and Magic Network Monitoring together to establish end-to-end network traffic visibility, you can talk with a Magic WAN expert.

How Orpheus automatically routes around bad Internet weather

2023-06-19 Chris Draper

Post Syndicated from Chris Draper original http://blog.cloudflare.com/orpheus-saves-internet-requests-while-maintaining-speed/

How Orpheus automatically routes around bad Internet weather

Cloudflare’s mission is to help build a better Internet for everyone, and Orpheus plays an important role in realizing this mission. Orpheus identifies Internet connectivity outages beyond Cloudflare’s network in real time then leverages the scale and speed of Cloudflare’s network to find alternative paths around those outages. This ensures that everyone can reach a Cloudflare customer’s origin server no matter what is happening on the Internet. The end result is powerful: Cloudflare protects customers from Internet incidents outside our network while maintaining the average latency and speed of our customer’s traffic.

A little less than two years ago, Cloudflare made Orpheus automatically available to all customers for free. Since then, Orpheus has saved 132 billion Internet requests from failing by intelligently routing them around connectivity outages, prevented 50+ Internet incidents from impacting our customers, and made our customer’s origins more reachable to everyone on the Internet. Let’s dive into how Orpheus accomplished these feats over the last year.

Increasing origin reachability

One service that Cloudflare offers is a reverse proxy that receives Internet requests from end users then applies any number of services like DDoS protection, caching, load balancing, and / or encryption. If the response to an end user’s request isn’t cached, Cloudflare routes the request to our customer’s origin servers. To be successful, end users need to be able to connect to Cloudflare, and Cloudflare needs to connect to our customer’s origin servers. With end users and customer origins around the world, and ~20% of websites using our network, this task is a tall order!

Orpheus provides origin reachability benefits to everyone using Cloudflare by identifying invalid paths on the Internet in real time, then routing traffic via alternative paths that are working as expected. This ensures Cloudflare can reach an origin no matter what problems are happening on the Internet on any given day.

Reducing 522 errors

At some point while browsing the Internet, you may have run into this 522 error.

This error indicates that you, the end user, was unable to access content on a Cloudflare customer’s origin server because Cloudflare couldn’t connect to the origin. Sometimes, this error occurs because the origin is offline for everyone, and ultimately the origin owner needs to fix the problem. Other times, this error can occur even when the origin server is up and able to receive traffic. In this case, some people can reach content on the origin server, but other people using a different Internet routing path cannot because of connectivity issues across the Internet.

Some days, a specific network may have excellent connectivity, while other days that network may be congested or have paths that are impassable altogether. The Internet is a massive and unpredictable network of networks, and the “weather” of the Internet changes every day.

When you see this error, Cloudflare attempted to connect to an origin on behalf of the end user, but did not receive a response back from the origin. Either the connection request never reached the origin, or the origin’s reply was dropped on the way back to Cloudflare. In the case of 522 errors, Cloudflare and the origin server could both be working as expected, but packets are dropped on the network path between them.

These 522 errors can cause a lot of frustration, and Orpheus was built to reduce them. The goal of Orpheus is to ensure that if at least one Cloudflare data center can connect to an origin, then anyone using Cloudflare’s network can also reach that origin, even if there are Internet connectivity problems happening outside of Cloudflare’s network.

Improving origin reachability for an example customer using Cloudflare

Let’s look at a concrete example of how Orpheus makes the Internet better for everyone by saving an origin request that would have otherwise failed. Imagine that you’re running an e-commerce website that sells dog toys online, and your store is hosted by an origin server in Chicago.

Imagine there are two different customers visiting your website at the same time: the first customer lives in Seattle, and the second customer lives in Tampa. The customer in Seattle reaches your origin just fine, but the customer in Tampa tries to connect to your origin and experiences a problem. It turns out that a construction crew accidentally damaged an Internet fiber line in Tampa, and Tampa is having connectivity issues with Chicago. As a result, any customer in Tampa receives a 522 error when they try to buy your dog toys online.

This is where Orpheus comes in to save the day. Orpheus detects that users in Tampa are receiving 522 errors when connecting to Chicago. Its database shows there is another route from Tampa through Boston and then to Chicago that is valid. As a result, Orpheus saves the end user’s request by rerouting it through Boston and taking an alternative path. Now, everyone in Tampa can still buy dog toys from your website hosted in Chicago, even though a fiber line was damaged unexpectedly.

How does Orpheus save requests that would otherwise fail via only BGP?

BGP (Border Gateway Protocol) is like the postal service of the Internet. It’s the protocol that makes the Internet work by enabling data routing. When someone requests data over the Internet, BGP is responsible for looking at all the available paths a request could take, then selecting a route.

BGP is designed to route around network failures by finding alternative paths to the destination IP address after the preferred path goes down. Sometimes, BGP does not route around a network failure at all. In this case, Cloudflare still receives BGP advertisements that an origin network is reachable via a particular autonomous system (AS), when actually packets sent through that AS will be dropped. In contrast, Orpheus will test alternate paths via synthetic probes and with real time traffic to ensure it is always using valid routes. Even when working as designed, BGP takes time to converge after a network disruption; Orpheus can react faster, find alternative paths to the origin that route around temporary or persistent errors, and ultimately save more Internet requests.

Additionally, BGP routes can be vulnerable to hijacking. If a BGP route is hijacked, Orpheus can prevent Internet requests from being dropped by invalid BGP routes by frequently testing all routes and examining the results to ensure they’re working as expected. In any of these cases, Orpheus routes around these BGP issues by taking advantage of the scale of Cloudflare’s global network which directly connects to 11,000 networks, features data centers across 275 cities, and has 172 Tbps of network capacity.

Let’s give an example of how Orpheus can save requests that would otherwise fail if only using BGP. Imagine an end user in Mumbai sends a request to a Cloudflare customer with an origin server in New York. For any request that misses Cloudflare’s cache, Cloudflare forwards the request from Mumbai to the website’s origin server in New York. Now imagine something happens, and the origin is no longer reachable from India: maybe a fiber optic cable was cut in Egypt, a different network advertised a BGP route it shouldn’t have, or an intermediary AS between Cloudflare and the origin was misconfigured that day.

In any of these scenarios, Orpheus can leverage the scale of Cloudflare’s global network to reach the origin in New York via an alternate path. While the direct path from Mumbai to New York may be unreachable, an alternate path from Mumbai, through London, then to New York may be available. This alternate path is valid because it uses different physical Internet connections that are unaffected by the issues with directly connecting from Mumbai to New York. In this case, Orpheus selects the alternate route through London and saves a request that would otherwise fail via the direct connection.

How Orpheus was built by reusing components of Argo Smart Routing

Back in 2017, Cloudflare released Argo Smart Routing which decreases latency by an average of 30%, improves security, and increases reliability. To help Cloudflare achieve its goal of helping build a better Internet for everyone, we decided to take the features that offered “increased reliability” in Argo Smart Routing and make them available to every Cloudflare user for free with Orpheus.

Argo Smart Routing’s architecture has two primary components: the data plane and the control plane. The control plane is responsible for computing the fastest routes between two locations and identifying potential failover paths in case the fastest route is down. The data plane is responsible for sending requests via the routes defined by the control plane, or detecting in real-time when a route is down and sending a request via a failover path as needed.

Orpheus was born with a simple technical idea: Cloudflare could deploy an alternate version of Argo’s control plane where the routing table only includes failover paths. Today, this alternate control plane makes up the core of Orpheus. If a request that travels through Cloudflare’s network is unable to connect to the origin via a preferred path, then Orpheus’s data plane selects a failover path from the routing table in its control plane. Orpheus prioritizes using failover paths that are more reliable to increase the likelihood a request uses the failover route and is successful.

Orpeus also takes advantage of a complex Internet monitoring system that we had already built for Argo Smart Routing. This system is constantly testing the health of many internet routing paths between different Cloudflare data centers and a customer’s origin by periodically opening then closing a TCP connection. This is called a synthetic probe, and the results are used for Argo Smart Routing, Orpheus, and even in other Cloudflare products. Cloudflare directly connects to 11,000 networks, and typically there are many different Internet routing paths that reach the same origin. Argo and Orpheus maintain a database of the results of all TCP connections that opened successfully or failed with their corresponding routing paths.

Scaling the Orpheus data plane to save requests for everyone

Cloudflare proxies millions of requests to customers' origins every second, and we had to make some improvements to Orpheus before it was ready to save users’ requests at scale. In particular, Cloudflare designed Orpheus to only process and reroute requests that would otherwise fail. In order to identify these requests, we added an error cache to Cloudflare’s layer 7 HTTP stack.

When you send an Internet request (TCP SYN) through Cloudflare to our customer’s origin, and Cloudflare doesn’t receive a response (TCP SYN/ACK), the end user receives a 522 error (learn more about TCP flags). Orpheus creates an entry in the error cache for each unique combination of a 522 error, origin address, and a specific route to that origin. The next time a request is sent to the same origin address via the same route, Orpheus will check the error cache for relevant entries. If there is a hit in the error cache, then Orpheus’s data plane will select an alternative route to prevent subsequent requests from failing.

To keep entries in the error cache updated, Orpheus will use live traffic to retry routes that previously failed to check their status. Routes in the error cache are periodically retried with a bounded exponential backoff. Unavailable routes are tested every 5th, 25th, 125th, 625th, and 3,125th request (the maximum bound). If the test request that’s sent down the original path fails, Orpheus saves the test request, sends it via the established alternate path, and updates the backoff counter. If a test request is successful, then the failed route is removed from the error cache, and normal routing operations are restored. Additionally, the error cache has an expiry period of 10 minutes. This prevents the cache from storing entries on failed routes that rarely receive additional requests.

The error cache has notable a trade-off; one direct-to-origin request must fail before Orpheus engages and saves subsequent requests. Clearly this isn’t ideal, and the Argo / Orpheus engineering team is hard at work improving Orpheus so it can prevent any request from failing.

Making Orpheus faster and more responsive

Orpheus does a great job of identifying congested or unreachable paths on the Internet, and re-routing requests that would have otherwise failed. However, there is always room for improvement, and Cloudflare has been hard at work to make Orpheus even better.

Since its release, Orpheus was built to select failover paths with the highest predicted reliability when it saves a request to an origin. This was an excellent first step, but sometimes a request that was re-routed by Orpheus would take an inefficient path that had better origin reachability but also increased latency. With recent improvements, the Orpheus routing algorithm balances both latency and origin reachability when selecting a new route for a request. If an end user makes a request to an origin, and that request is re-routed by Orpheus, it’s nearly as fast as any other request on Cloudflare’s network.

In addition to decreasing the latency of Orpheus requests, we’re working to make Orpheus more responsive to connectivity changes across the Internet. Today, Orpheus leverages synthetic probes to test whether Internet routes are reachable or unreachable. In the near future, Orpheus will also leverage real-time traffic data to more quickly identify Internet routes that are unreachable and reachable. This will enable Orpheus to re-route traffic around connectivity problems on the Internet within minutes rather than hours.

Expanding Orpheus to save WebSockets requests

Previously, Orpheus focused on saving HTTP and TCP Internet requests. Cloudflare has seen amazing benefits to origin reliability and Internet stability for these types of requests, and we’ve been hard at work to expand Orpheus to also save WebSocket requests from failing.

WebSockets is a common Internet protocol that prioritizes sending real time data between a client and server by maintaining an open connection between that client and server. Imagine that you (the client) have sent a request to see a website’s home page (which is generated by the server). When using HTTP, the connection between the client and server is established by the client, and the connection is closed once the request is completed. That means that if you send three requests to a website, three different connections are opened and closed for each request.

In contrast, when using the WebSockets protocol, one connection is established between the client and server. All requests moving in between the client and server are sent through this connection until the connection is terminated. In this case, you could send 10 requests to a website, and all of those requests would travel over the same connection. Due to these differences in protocol, Cloudflare had to adjust to Orpheus to make it capable of also saving WebSockets requests. Now all Cloudflare customers that use WebSockets in their Internet applications can expect the same level of stability and resiliency across their HTTP, TCP, and WebSockets traffic.

P.S. If you’re interested in working on Orpheus, drop us a line!

Orpheus and Argo Smart Routing

Orpheus runs on the same technology that powers Cloudflare’s Argo Smart Routing product. While Orpheus is designed to maximize origin reachability, Argo Smart Routing leverages network latency data to accelerate traffic on Cloudflare’s network and find the fastest route between an end user and a customer’s origin. On average, customers using Argo Smart Routing see that their web assets perform 30% faster. Together, Orpheus and Argo Smart Routing work to improve the end user experience for websites and contribute to Cloudflare’s goal of helping build a better Internet.

If you’re a Cloudflare customer, you are automatically using Orpheus behind the scenes and improving your website’s availability. If you want to make the web faster for your users, you can log in to the Cloudflare dashboard and add Argo Smart Routing to your contract or plan today.

Monitor your own network with free network flow analytics from Cloudflare

2022-09-28 Chris Draper

Post Syndicated from Chris Draper original https://blog.cloudflare.com/free-magic-network-monitoring/

Monitor your own network with free network flow analytics from Cloudflare

As a network engineer or manager, answering questions about the traffic flowing across your infrastructure is a key part of your job. Cloudflare built Magic Network Monitoring (previously called Flow Based Monitoring) to give you better visibility into your network and to answer questions like, “What is my network’s peak traffic volume? What are the sources of that traffic? When does my network see that traffic?” Today, Cloudflare is excited to announce early access to a free version of Magic Network Monitoring that will be available to everyone. You can request early access by filling out this form.

Magic Network Monitoring now features a powerful analytics dashboard, self-serve configuration, and a step-by-step onboarding wizard. You’ll have access to a tool that helps you visualize your traffic and filter by packet characteristics including protocols, source IPs, destination IPs, ports, TCP flags, and router IP. Magic Network Monitoring also includes network traffic volume alerts for specific IP addresses or IP prefixes on your network.

Making Network Monitoring easy

Magic Networking Monitoring allows customers to collect network analytics without installing a physical device like a network TAP (Test Access Point) or setting up overly complex remote monitoring systems. Our product works with any hardware that exports network flow data, and customers can quickly configure any router to send flow data to Cloudflare’s network. From there, our network flow analyzer will aggregate your traffic data and display it in Magic Network Monitoring analytics.

Analytics dashboard

In Magic Network Monitoring analytics, customers can take a deep dive into their network traffic data. You can filter traffic data by protocol, source IP, destination IP, TCP flags, and router IP. Customers can combine these filters together to answer questions like, “How much ICMP data was requested from my speed test server over the past 24 hours?” Visibility into traffic analytics is a key part of understanding your network’s operations and proactively improving your security. Let’s walk through some cases where Magic Network Monitoring analytics can answer your network visibility and security questions.

Create network volume alert thresholds per IP address or IP prefix

Magic Network Monitoring is incredibly flexible, and it can be customized to meet the needs of any network hobbyist or business. You can monitor your traffic volume trends over time via the analytics dashboard and build an understanding of your network’s traffic profile. After gathering historical network data, you can set custom volumetric threshold alerts for one IP prefix or a group of IP prefixes. As your network traffic changes over time, or their network expands, they can easily update their Magic Network Monitoring configuration to receive data from new routers or destinations within their network.

Monitoring a speed test server in a home lab

Let’s run through an example where you’re running a network home lab. You decide to use Magic Network Monitoring to track the volume of requests a speed test server you’re hosting receives and check for potential bad actors. Your goal is to identify when your speed test server experiences peak traffic, and the volume of that traffic. You set up Magic Network Monitoring and create a rule that analyzes all traffic destined for your speed test server’s IP address. After collecting data for seven days, the analytics dashboard shows that peak traffic occurs on weekdays in the morning, and that during this time, your traffic volume ranges from 450 – 550 Mbps.

As you’re checking over the analytics data, you also notice strange traffic spikes of 300 – 350 Mbps in the middle of the night that occur at the same time. As you investigate further, the analytics dashboard shows the source of this traffic spike is from the same IP prefix. You research some source IPs, and find they’re associated with malicious activity. As a result, you update your firewall to block traffic from this problematic source.

Identifying a network layer DDoS attack

Magic Network Monitoring can also be leveraged to identify a variety of L3, L4, and L7 DDoS attacks. Let’s run through an example of how ACME Corp, a small business using Magic Network Monitoring, can identify a Ping (ICMP) Flood attack on their network. Ping Flood attacks aim to overwhelm the targeted network’s ability to respond to a high number of requests or overload the network connection with bogus traffic.

At the start of a Ping Flood attack, your server’s traffic volume will begin to ramp up. Magic Network Monitoring will analyze traffic across your network, and send an email, webhook, or PagerDuty alert once an unusual volume of traffic is identified. Your network and security team can respond to the volumetric alert by checking the data in Magic Network Monitoring analytics and identifying the attack type. In this case, they’ll notice the following traffic characteristics:

Network traffic volume above your historical traffic averages
An unusually large amount of ICMP traffic
ICMP traffic coming from a specific set of source IPs

Now, your network security team has confirmed the traffic is malicious by identifying the attack type, and can begin taking steps to mitigate the attack.

Magic Network Monitoring and Magic Transit

If your business is impacted by DDoS attacks, Magic Network Monitoring will identify attacks, and Magic Transit can be used to mitigate those DDoS attacks. Magic Transit protects customers’ entire network from DDoS attacks by placing our network in front of theirs. You can use Magic Transit Always On to reduce latency and mitigate attacks all the time, or Magic Transit On Demand to protect your network during active attacks. With Magic Transit, you get DDoS protection, traffic acceleration, and other network functions delivered as a service from every Cloudflare data center. Magic Transit works by allowing Cloudflare to advertise customers’ IP prefixes to the Internet with BGP to route the customer’s traffic through our network for DDoS protection. If you’re interested in protecting your network with Magic Transit, you can visit the Magic Transit product page and request a demo today.

The free version of Magic Network Monitoring (MNM) will be released in the next few weeks. You can request early access by filling out this form.

This is just the beginning for Magic Network Monitoring. In the future, you can look forward to features like advanced DDoS attack identification, network incident history and trends, and volumetric alert threshold recommendations.

How Cloudflare One solves your observability problems

2022-06-21 Chris Draper

Post Syndicated from Chris Draper original https://blog.cloudflare.com/cloudflare-one-observability/

How Cloudflare One solves your observability problems

Today, we’re excited to announce Cloudflare One Observability. Cloudflare One Observability will help customers work across Cloudflare One applications to troubleshoot network connectivity, security policies, and performance issues to ensure a consistent experience for employees everywhere. Cloudflare One, our comprehensive SASE platform, already includes visibility for individual products; Cloudflare One Observability is the next step in bringing data together across the Cloudflare One platform.

Network taps and legacy enterprise networks

Traditional enterprise networks operated like a castle protected by a moat. Employees working from a physical office location authenticated themselves at the beginning of their session, they were protected by an extensive office firewall, and the majority of the applications they accessed were on-premise.

Many enterprise networks had a strictly defined number of “entrances” for employees at office locations. Network taps (devices used to measure and report events on a local network) monitored each entrance point, and these devices gave network administrators and engineers complete visibility into their operations.

Learn more about the old castle-and-moat network security model.

Incomplete observability in today’s enterprise network

Today’s enterprise networks have expanded beyond the traditional on-premise model and have become extremely fragmented. Now, employees can work from anywhere. People access enterprise networks from across the Internet, and the applications they use every day are a mix of on-premise and SaaS cloud instances.

SaaS applications are hosted outside the enterprise network, leaving your security teams with limited observability into how users access those applications and move data through them. Without observability on the applications your employees are using, you can’t control how sensitive data is stored, shared, or exposed to third parties.

Now that enterprise networks have become more fragmented, it is increasingly difficult to understand how the various fragments are operating. To even gain limited observability, you have to implement a disorganized combination of network taps, flow data, synthetic probes, and dashboards that fail to share data across one another.

Total observability across an enterprise & cloud network built on Cloudflare One

Cloudflare One Observability is built to solve today’s issue of network fragmentation in a zero trust world. Instead of having data spread across multiple network tools, Cloudflare One Observability will combine data from different Cloudflare One functions into a single experience. Customers will be able to go to one place to troubleshoot any issues they’re experiencing with their enterprise applications and networks.

In today’s world of fragmented enterprise networks, there are some questions that can be difficult to answer. Let’s break down a couple of customer examples and walk through how Cloudflare One Observability will simplify the troubleshooting process.

Troubleshooting bandwidth issues across branch locations

A customer may want to know, “What applications are using up the majority of my bandwidth across multiple office locations?” In a typical enterprise network, a network engineer would need to install a network tap or collect flow data at each office location, aggregate the information across separate networks, then build a custom tool to visualize the bandwidth data.

Instead, for Cloudflare One customers, Cloudflare will automatically do all the upfront data collection and aggregation. Customers will be able to skip straight to troubleshooting and solving their bandwidth problem by using Cloudflare One Observability to visualize bandwidth usage across office locations.

Identifying network vulnerabilities

Another challenging question that customers face is, “What attack trends are popular, and is my network vulnerable?” Assessing a network’s vulnerability is time-consuming as administrators dive into separate applications for VPNs, firewalls, user policies, and endpoints to understand their network’s security posture.

Cloudflare One is built from the ground up to simplify this problem. Observability is straightforward when your network on-ramps, firewalls, user policies, and endpoint protection are all managed within the same platform. Customers will be able to go to the Cloudflare One Observability experience to see security patches that are automatically applied by Cloudflare so that customers don’t have to worry. Cloudflare One lets you know whether you’ve been targeted by an attack and gives you confidence that you’re protected.

Troubleshooting slow network performance

Many people have experienced logging into a slow enterprise network. The general problem of “my network is slow when I access an on-premise or SaaS application” can be tough to solve. If employees are working remotely, a network engineer would need to dig through different applications to troubleshoot latency and jitter between VPNs, firewalls, user policies, and endpoint connections.

Cloudflare One Observability simplifies this time-consuming troubleshooting process. When your on-ramps, firewalls, user policies, and endpoint monitoring are all configured on the same platform, you only need to go to one place to troubleshoot these network functions. Cloudflare One’s architecture is built on the concept of single pass inspection. When a request lands on a Cloudflare server, that request passes through instances of Cloudflare One services all on that same single server. This makes it easy to visualize end-to-end network request handling, so customers can seamlessly analyze traffic and identify a network bottleneck or misconfiguration.

Observability powered by Cloudflare’s network

Cloudflare One Observability is built on Cloudflare’s best-in-class network. We have data centers in 270+ cities and over 100 countries. Since every Cloudflare One product runs on every server, we can provide an unparalleled fast and consistent experience to customers everywhere. Cloudflare built its network and security applications from the ground up on the same infrastructure. Unlike our competitors that have strung together a zero trust platform by building siloed applications or through acquisitions, Cloudflare One applications are seamlessly integrated and designed from day one to share data between one another.

As our applications are all built on the same infrastructure, so are our data pipelines and logging services. When you use Cloudflare One, you get the full benefits of our advanced data tools, like Instant Logs for delivering live network data as it arrives and ABR for analyzing network data at scale.

Delivering the Zero Trust observability customers need today

Since 2009, Cloudflare has built one of the fastest, most reliable, and most secure networks in the world. We’ve built Cloudflare One and Cloudflare One Observability on top of this network, and we’re extending its power to meet the challenges of any company. The move to Zero Trust is a paradigm shift, and we believe the security benefits of this new paradigm will make it inevitable for every company. We’re proud of how we have helped and continue to help existing and new customers reinvent their corporate networks.

Construction of Cloudflare One Observability is still in progress. If you’re excited about this new product, you can sign up for our wait list now!