Network-layer DDoS attacks are on the rise, prompting security teams to rethink their L3 DDoS mitigation strategies to prevent business impact. Magic Transit protects customers’ entire networks from DDoS attacks by placing our network in front of theirs, either always on or on demand. Today, we’re announcing new functionality to improve the experience for on-demand Magic Transit customers: flow-based monitoring. Flow-based monitoring allows us to detect threats and notify customers when they’re under attack so they can activate Magic Transit for protection.
Magic Transit is Cloudflare’s solution to secure and accelerate your network at the IP layer. With Magic Transit, you get DDoS protection, traffic acceleration, and other network functions delivered as a service from every Cloudflare data center. With Cloudflare’s global network (59 Tbps capacity across 200+ cities) and <3sec time to mitigate at the edge, you’re covered from even the largest and most sophisticated attacks without compromising performance. Learn more about Magic Transit here.
Using Magic Transit on demand
With Magic Transit, Cloudflare advertises customers’ IP prefixes to the Internet with BGP in order to attract traffic to our network for DDoS protection. Customers can choose to use Magic Transit always on or on demand. With always on, we advertise their IPs and mitigate attacks all the time; for on demand, customers activate advertisement only when their networks are under active attack. But there’s a problem with on demand: if your traffic isn’t routed through Cloudflare’s network, by the time you notice you’re being targeted by an attack and activate Magic Transit to mitigate it, the attack may have already caused impact to your business.
On demand with flow-based monitoring
Flow-based monitoring solves the problem with on-demand by enabling Cloudflare to detect and notify you about attacks based on traffic flows from your data centers. You can configure your routers to continuously send NetFlow or sFlow (coming soon) to Cloudflare. We’ll ingest your flow data and analyze it for volumetric DDoS attacks.
When an attack is detected, we’ll notify you automatically (by email, webhook, and/or PagerDuty) with information about the attack.
You can choose whether you’d like to activate IP advertisement with Magic Transit manually – we support activation via the Cloudflare dashboard or API – or automatically, to minimize the time to mitigation. Once Magic Transit is activated and your traffic is flowing through Cloudflare, you’ll receive only the clean traffic back to your network over your GRE tunnels.
Using flow-based monitoring with Magic Transit on demand will provide your team peace of mind. Rather than acting in response to an attack after it impacts your business, you can complete a simple one-time setup and rest assured that Cloudflare will notify you (and/or start protecting your network automatically) when you’re under attack. And once Magic Transit is activated, Cloudflare’s global network and industry-leading DDoS mitigation has you covered: your users can continue business as usual with no impact to performance.
Example flow-based monitoring workflow: faster time to mitigate for Acme Corp
Let’s walk through an example customer deployment and workflow with Magic Transit on demand and flow-based monitoring. Acme Corp’s network was hit by a large ransom DDoS attack recently, which caused downtime for both external-facing and internal applications. To make sure they’re not impacted again, the Acme network team chose to set up on-demand Magic Transit. They authorize Cloudflare to advertise their IP space to the Internet in case of an attack, and set up Anycast GRE tunnels to receive clean traffic from Cloudflare back to their network. Finally, they configure their routers at each data center to send NetFlow data to a Cloudflare Anycast IP.
Cloudflare receives Acme’s NetFlow data at a location close to the data center sending it (thanks, Anycast!) and analyzes it for DDoS attacks. When traffic exceeds attack thresholds, Cloudflare triggers an automatic PagerDuty incident for Acme’s NOC team and starts advertising Acme’s IP prefixes to the Internet with BGP. Acme’s traffic, including the attack, starts flowing through Cloudflare within minutes, and the attack is blocked at the edge. Clean traffic is routed back to Acme through their GRE tunnels, causing no disruption to end users – they’ll never even know Acme was attacked. When the attack has subsided, Acme’s team can withdraw their prefixes from Cloudflare with one click, returning their traffic to its normal path.
To learn more about Magic Transit and flow-based monitoring, contact us today.
In late 2020, a major Fortune Global 500 company was targeted by a Ransom DDoS (RDDoS) attack by a group claiming to be the Lazarus Group. Cloudflare quickly onboarded them to the Magic Transit service and protected them against the lingering threat. This extortion attempt was part of wider ransom campaigns that have been unfolding throughout the year, targeting thousands of organizations around the world. Extortionists are threatening organizations with crippling DDoS attacks if they do not pay a ransom.
Throughout 2020, Cloudflare onboarded and protected many organizations with Magic Transit, Cloudflare’s DDoS protection service for critical network infrastructure, the WAF service for HTTP applications, and the Spectrum service for TCP/UDP based applications — ensuring their business’s availability and continuity.
Unwinding the attack timeline
I spoke with Daniel (a pseudonym) and his team, who work at the Incident Response and Forensics team at the aforementioned company. I wanted to learn about their experience, and share it with our readers so they could learn how to better prepare for such an event. The company has requested to stay anonymous and so some details have been omitted to ensure that. In this blog post, I will refer to them as X.
Initially, the attacker sent ransom emails to a handful of X’s publicly listed email aliases such as [email protected], [email protected], and [email protected] We’ve heard from other customers that in some cases, non-technical employees received the email and ignored it as being spam which delayed the incident response team’s time to react by hours. However, luckily for X, a network engineer that was on the email list of the [email protected] alias saw it and immediately forwarded it to Daniel’s incident response team.
In the ransom email, the attackers demanded 20 bitcoin and gave them a week to pay up, or else a second larger attack would strike, and the ransom would increase to 30 bitcoin. Daniel says that they had a contingency plan ready for this situation and that they did not intend to pay. Paying the ransom funds illegitimate activities, motivates the attackers, and does not guarantee that they won’t attack anyway.
…Please perform a google search of “Lazarus Group” to have a look at some of our previous work. Also, perform a search for “NZX” or “New Zealand Stock Exchange” in the news. You don’t want to be like them, do you?…
The current fee is 20 Bitcoin (BTC). It’s a small price to pay for what will happen if your whole network goes down. Is it worth it? You decide!…
If you decide not to pay, we will start the attack on the indicated date and uphold it until you do. We will completely destroy your reputation and make sure your services will remain offline until you pay…
–An excerpt of the ransom note
The contingency plan
Upon receiving the email from the network engineer, Daniel called him and they started combing through the network data — they noticed a significant increase in traffic towards one of their global data centers. This attacker was not playing around, firing gigabits per second towards a single server. The attack, despite just being a proof of intention, saturated the Internet uplink to that specific data center, causing a denial of service event and generating a series of failure events.
This first “teaser” attack came on a work day, towards the end of business hours as people were already wrapping up their day. At the time, X was not protected by Cloudflare and relied on an on-demand DDoS protection service. Daniel activated the contingency plan which relied on the on-demand scrubbing center service.
Daniel contacted their DDoS protection service. It took them over 30 minutes to activate the service and redirect X’s traffic to the scrubbing center. Activating the on-demand service caused networking failures and resulted in multiple incidents for X on various services — even ones that were not under attack. Daniel says hindsight is 2020 and he realized that an always-on service would have been much more effective than on-demand, reactionary control that takes time to implement, after the impact is felt. The networking failures amplified the one-hour attack resulting in incidents lasting much longer than expected.
Onboarding to Cloudflare
Following the initial attack, Daniel’s team reached out to Cloudflare and wanted to onboard to our automated always-on DDoS protection service, Magic Transit. The goal was to onboard to it before the second attack would strike. Cloudflare explained the pre-onboarding steps, provided details on the process, and helped onboard X’s network in a process Daniel defined as “quite painless and very professional. The speed and responsiveness were impressive. One of the key differentiation is the attack and traffic analytics that we see that our incumbent provider couldn’t provide us. We’re seeing attacks we never knew about being mitigated automatically.”
The attackers promised a second, huge attack which never happened. Perhaps it was just an empty threat, or it could be that the attackers detected that X is protected by Cloudflare which deterred them and they, therefore, decided to move on to their next target?
Recommendations for organizations
I asked Daniel if he has any recommendations for businesses so they can learn from his experience and be better prepared, should they be targeted by ransom attacks:
1. Utilize an automated always-on DDoS protection service
Do not rely on reactive on-demand SOC-based DDoS Protection services that require humans to analyze attack traffic. It just takes too long. Don’t be tempted to use an on-demand service: “you get all of the pain and none of the benefits”. Instead, onboard to a cloud service that has sufficient network capacity and automated DDoS mitigation systems.
2. Work with your vendor to build and understand your threat model
Work together with your DDoS protection vendor to tailor mitigation strategies to your workload. Every network is different, and each poses unique challenges when integrating with DDoS mitigation systems.
3. Create a contingency plan and educate your employees
Be prepared. Have plans ready and train your teams on them. Educate all of your employees, even the non-techies, on what to do if they receive a ransom email. They should report it immediately to your Security Incident Response team.
Cloudflare customers need not worry as they are protected. Enterprise customers can reach out to their account team if they are being extorted in order to review and optimize their security posture if needed. Customers on all other plans can reach out to our support teams and learn more about how to optimize your Cloudflare security configuration.
On the week of Black Friday, Cloudflare automatically detected and mitigated a unique ACK DDoS attack, which we’ve codenamed “Beat”, that targeted a Magic Transit customer. Usually, when attacks make headlines, it’s because of their size. However, in this case, it’s not the size that is unique but the method that appears to have been borrowed from the world of acoustics.
Acoustic inspired attack
As can be seen in the graph below, the attack’s packet rate follows a wave-shaped pattern for over 8 hours. It seems as though the attacker was inspired by an acoustics concept called beat. In acoustics, a beat is a term that is used to describe an interference of two different wave frequencies. It is the superposition of the two waves. When the two waves are nearly 180 degrees out of phase, they create the beating phenomenon. When the two waves merge they amplify the sound and when they are out of sync they cancel one another, creating the beating effect.
Acedemo.org has a nice tool where you can create your own beat wave. As you can see in the screenshot below, the two waves in blue and red are out of phase and the purple wave is their superposition, the beat wave.
Reverse engineering the attack
It looks like the attacker launched a flood of packets where the rate of the packets is determined by the equation of the beat wave: y‘beat=y1+y2. The two equations y1 and y2 represent the two waves.
Each equation is expressed as
where fi is the frequency of each wave and t is time.
Therefore, the packet rate of the attack is determined by manipulation of the equation
to achieve a packet rate that ranges from ~18M to ~42M pps.
To get to the scale of this attack we will need to multiply y‘beat by a certain variable a and also add a constant c, giving us ybeat=ay‘beat+c. Now, it’s been a while since I played around with equations, so I’m only going to try and get an approximation of the equation.
By observing the attack graph, we can guesstimate that
by playing around with desmos’s cool graph visualizer tool, if we set f1=0.0000345 and f2=0.00003455 we can generate a graph that resembles the attack graph. Plotting in those variables, we get:
Now this formula assumes just one node firing the packets. However, this specific attack was globally distributed, and if we assume that each node, or bot in this botnet, was firing an equal amount of packets at an equal rate, then we can divide the equation by the size of the botnet; the number of bots b. Then the final equation is something in the form of:
In the screenshot below, g = f 1. You can view this graph here.
Beating the drum
The attacker may have utilized this method in order to try and overcome our DDoS protection systems (perhaps thinking that the rhythmic rise and fall of the attack would fool our systems). However, flowtrackd, our unidirectional TCP state tracking machine, detected it as being a flood of ACK packets that do not belong to any existing TCP connection. Therefore, flowtrackd automatically dropped the attack packets at Cloudflare’s edge.
The attacker was beating the drum for over 19 hours with an amplitude of ~7 Mpps, a wavelength of ~4 hours, and peaking at ~42 Mpps. During the two days in which the attack took place, Cloudflare systems automatically detected and mitigated over 700 DDoS attacks that targeted this customer. The attack traffic accumulated at almost 500 Terabytes out of a total of 3.6 Petabytes of attack traffic that targeted this single customer in November alone. During those two days, the attackers utilized mainly ACK floods, UDP floods, SYN floods, Christmas floods (where all of the TCP flags are ‘lit’), ICMP floods, and RST floods.
The challenge of TCP based attacks
TCP is a stateful protocol, which means that in some cases, you’d need to keep track of a TCP connection’s state in order to know if a packet is legitimate or part of an attack, i.e. out of state. We were able to provide protection against out-of-state TCP packet attacks for our “classic” WAF/CDN service and Spectrum service because in both cases Cloudflare serves as a reverse-proxy seeing both ingress and egress traffic.
However, when we launched Magic Transit, which relies on an asymmetric routing topology with a direct server return (DSR), we couldn’t utilize our existing TCP connection tracking systems.
And so, being a software-defined company, we’re able to write code and spin up software when and where needed — as opposed to vendors that utilize dedicated DDoS protection hardware appliances. And that is what we did. We built flowtrackd, which runs autonomously on each server at our network’s edge. flowtrackd is able to classify the state of TCP flows by analyzing only the ingress traffic, and then drops, challenges, or rate-limits attack packets that do not correspond to an existing flow.
flowtrackd works together with our two additional DDoS protection systems, dosd and Gatebot, to assure our customers are protected against DDoS attacks, regardless of their size or sophistication — in this case, serving as a noise-canceling system to the Beat attack; reducing the headaches for our customers.
Read more about how our DDoS protection systems work here.
Today we’re excited to announce Magic Firewall™, a network-level firewall delivered through Cloudflare to secure your enterprise. Magic Firewall covers your remote users, branch offices, data centers and cloud infrastructure. Best of all, it’s deeply integrated with Cloudflare One™, giving you a one-stop overview of everything that’s happening on your network.
Cloudflare Magic Transit™ secures IP subnets with the same DDoS protection technology that we built to keep our own global network secure. That helps ensure your network is safe from attack and available and it replaces physical appliances that have limits with Cloudflare’s network.
That still leaves some hardware onsite, though, for a different function: firewalls. Networks don’t just need protection from DDoS attacks; administrators need a way to set policies for all traffic entering and leaving the network. With Magic Firewall, we want to help your team deprecate those network firewall appliances and move that burden to the Cloudflare global network.
Firewall boxes are miserable to manage
Network firewalls have always been clunky. Not only are they expensive, they are bound by their own hardware constraints. If you need more CPU or memory, you have to buy more boxes. If you lack capacity, the entire network suffers, directly impacting employees that are trying to do their work. To compensate, network operators and security teams are forced to buy more capacity than we need, resulting in having to pay more than necessary.
We’ve heard this problem from our Magic Transit customers who are constantly running into capacity challenges:
“We’re constantly running out of memory and running into connection limits on our firewalls. It’s a huge problem.”
Network operators find themselves piecing together solutions from different vendors, mixing and matching features, and worrying about keeping policies in sync across the network. The result is more headache and added cost.
The solution isn’t more hardware
Some organizations then turn to even more vendors and purchase additional hardware to manage the patchwork firewall hardware they have deployed. Teams then have to balance refresh cycles, updates, and end of life management across even more platforms. These are band-aid solutions that do not solve the fundamental problem: how do we create a single view of the entire network that gives insights into what is happening (good and bad) and apply policy instantaneously, globally?
Introducing Magic Firewall
Instead of more band-aids, we’re excited to launch Magic Firewall as a single, comprehensive, solution to network filtering. Unlike legacy appliances, Magic Firewall runs in the Cloudflare network. That network scales up or down with a customer’s needs at any given time.
Running in our network delivers an added benefit. Many customers backhaul network traffic to single chokepoints in order to perform firewalling operations, adding latency. Cloudflare operates data centers in 200 cities around the world and each of those points of presence is capable of delivering the same solution. Regional offices and data centers can instead rely on a Cloudflare Magic Firewall engine running within 100 milliseconds of their operation.
Integrated with Cloudflare One
Cloudflare One consists of products that allow you to apply a single filtering engine with consistent security controls to your entire network, not just part of it. The same types of controls that your organization wants to apply to traffic leaving your networks should be applied to traffic leaving your devices.
Magic Firewall will integrate with what you’re already using in Cloudflare. For example, traffic leaving endpoints outside of the network can reach Cloudflare using the Cloudflare WARP client where Gateway will apply the same rules your team configures for network level filtering. Branch offices and data centers can connect through Magic Transit with the same set of rules. This gives you a one-stop overview of your entire network instead of having to hunt down information across multiple devices and vendors.
How does it work?
So what is Magic Firewall? Magic Firewall is a way to replace your antiquated on-premises network firewall with an as-a-service solution, pushing your perimeter out to the edge. We already allow you to apply firewall rules at our edge with Magic Transit, but the process to add or change rules has previously involved working with your account team or Cloudflare support. Our first version, generally available in the next few months, will allow all our Magic Transit customers to apply static OSI Layer 3 & 4 mitigations completely self-service, at Cloudflare scale.
Cloudflare applies firewall policies at every data center
Meaning you have firewalls applying policies across the globe
Our first version of Magic Firewall will focus on static mitigations, allowing you to set a standard set of rules that apply to your entire network, whether devices or applications are sitting in the cloud, an employee’s device or a branch office. You’ll be able to express rules allowing or blocking based on:
Source or destination IP and port
Bit field match
Rules can be crafted in Wireshark syntax, a domain specific language common in the networking world and the same syntax we use across our other products. With this syntax, you can easily craft extremely powerful rules to precisely allow or deny any traffic in or out of your network. If you suspect there’s a bad actor inside or outside of your perimeter, simply log on to the dashboard and block that traffic. Rules are pushed out globally in seconds, shutting down threats at the edge.
Configuring firewalls should be easy and powerful. With Magic Firewall, rules can be configured using an easy UI that allows for complex logic. Or, just type the filter rule manually using Wireshark filter syntax and configure that way. Don’t want to mess with a UI? Rules can be added just as easily through the API.
Looking at packets is not enough… Even with firewall rules, teams still need visibility into what’s actually happening on their network: what’s happening inside of these datastreams? Is this legitimate traffic or do we have malicious actors either inside or outside of our network doing nefarious things? Deploying Cloudflare to sit between any two actors that interact with any of your assets (be they employee devices or services exposed to the Internet) allows us to enforce any policy, anywhere, either on where the traffic is coming from or what’s inside the traffic. Applying policies based on traffic type is just around the corner and we’re excited to announce that we’re planning to add additional capabilities to automatically detect intrusion events based on what’s happening inside datastreams in the near future.
We’re excited about this new journey. With Cloudflare One, we’re reinventing what the network looks like for corporations. We integrate access management, security features and performance across the board: for your network’s visitors but also for anyone inside it. All of this built on top of a network that was #BuiltForThis.
We’ll be opening up Magic Firewall in a limited beta, starting with existing Magic Transit customers. If you’re interested, please let us know.
Running a secure enterprise network is really difficult. Employees spread all over the world work from home. Applications are run from data centers, hosted in public cloud, and delivered as services. Persistent and motivated attackers exploit any vulnerability.
Enterprises used to build networks that resembled a castle-and-moat. The walls and moat kept attackers out and data in. Team members entered over a drawbridge and tended to stay inside the walls. Trust folks on the inside of the castle to do the right thing, and deploy whatever you need in the relative tranquility of your secure network perimeter.
The Internet, SaaS, and “the cloud” threw a wrench in that plan. Today, more of the workloads in a modern enterprise run outside the castle than inside. So why are enterprises still spending money building more complicated and more ineffective moats?
Today, we’re excited to share Cloudflare One™, our vision to tackle the intractable job of corporate security and networking.
Cloudflare One combines networking products that enable employees to do their best work, no matter where they are, with consistent security controls deployed globally.
Starting today, you can begin replacing traffic backhauls to security appliances with Cloudflare WARP and Gateway to filter outbound Internet traffic. For your office networks, we plan to bring next-generation firewall capabilities to Magic Transit with Magic Firewall to let you get rid of your top-of-shelf firewall appliances.
With multiple on-ramps to the Internet through Cloudflare, and the elimination of backhauled traffic, we plan to make it simple and cost-effective to manage that routing compared to MPLS and SD-WAN models. Cloudflare Magic WAN will provide a control plane for how your traffic routes through our network.
You can use Cloudflare One today to replace the other function of your VPN: putting users on a private network for access control. Cloudflare Access delivers Zero Trust controls that can replace private network security models. Later this week, we’ll announce how you can extend Access to any application – including SaaS applications. We’ll also preview our browser isolation technology to keep the endpoints that connect to those applications safe from malware.
Finally, the products in Cloudflare One focus on giving your team the logs and tools to both understand and then remediate issues. As part of our Gateway filtering launch this week we’re including logs that provide visibility into the traffic leaving your organization. We’ll be sharing how those logs get smarter later this week with a new Intrusion Detection System that detects and stops intrusion attempts.
Many of those components are available today, some new features are arriving this week, and other pieces will be launching soon. All together, we’re excited to share this vision and for the future of the corporate network.
Problems in enterprise networking and security
The demands placed on a corporate network have changed dramatically. IT has gone from a back-office function to mission critical. In parallel with networks becoming more integral, users spread out from offices to work from home. Applications left the datacenter and are now being run out of multiple clouds or are being delivered by vendors directly over the Internet.
Direct network paths became hairpin turns
Employees sitting inside of an office could connect over a private network to applications running in a datacenter nearby. When team members left the office, they could use a VPN to sneak back onto the network from outside the walls. Branch offices hopped on that same network over expensive MPLS links.
When applications left the data center and users left their offices, organizations responded by trying to force that scattered world into the same castle-and-moat model. Companies purchased more VPN licenses and replaced MPLS links with difficult SD-WAN deployments. Networks became more complex in an attempt to mimic an older model of networking when in reality the Internet had become the new corporate network.
Attackers looking to compromise corporate networks have a multitude of tools at their disposal, and may execute surgical malware strikes, throw a volumetric kitchen sink at your network, or any number of things in between. Traditionally, defense against each class of attack was provided by a separate, specialized piece of hardware running in a datacenter.
Security controls used to be relatively easy when every user and every application sat in the same place. When employees left offices and workloads left data centers, the same security controls struggled to follow. Companies deployed a patchwork of point solutions, attempting to rebuild their topside firewall appliances across hybrid and dynamic environments.
High-visibility required high-effort
The move to a patchwork model sacrificed more than just defense-in-depth — companies lost visibility into what was happening in their networks and applications. We hear from customers that this capture and standardization of logs has become one of their biggest hurdles. They purchased expensive data ingestion, analysis, storage, and analytics tools.
Enterprises now rely on multiple point solutions that one of the biggest hurdles is the capture and standardization of logs. Increasing regulatory and compliance pressures place more emphasis on data retention and analysis. Splintered security solutions become a data management nightmare.
Fixing issues relied on best guesses
Without visibility into this new networking model, security teams had to guess at what could go wrong. Organizations who wanted to adopt an “assume breach” model struggled to determine what kind of breach could even occur, so they threw every possible solution at the problem.
We talk to enterprises who purchase new scanning and filtering services, delivered in virtual appliances, for problems they are unsure they have. These teams attempt to remediate every possible event manually, because they lack visibility, rather than targeting specific events and adapting the security model.
How does Cloudflare One fit?
Over the last several years, we’ve been assembling the components of Cloudflare One. We launched individual products to target some of these problems one-at-a-time. We’re excited to share our vision for how they all fit together in Cloudflare One.
Flexible data planes
Cloudflare launched as a reverse proxy. Customers put their Internet-facing properties on our network and their audience connected to those specific destinations through our network. Cloudflare One represents years of launches that allow our network to process any type of traffic flowing in either the “reverse” or “forward” direction.
In 2019, we launchedCloudflare WARP — a mobile application that kept Internet-bound traffic private with an encrypted connection to our network while also making it faster and more reliable. We’re now packaging that same technology into an enterprise version launching this week to connect roaming employees to Cloudflare Gateway.
Your data centers and offices should have the same advantage. We launchedMagic Transit last year to secure your networks from IP-layer attacks. Our initial focus with Magic Transit has been delivering best-in-class DDoS mitigation to on-prem networks. DDoS attacks are a persistent thorn in network operators’ sides, and Magic Transit effectively diffuses their sting without forcing performance compromises. That rock-solid DDoS mitigation is the perfect platform on which to build higher level security functions that apply to the same traffic already flowing across our network.
Earlier this year, we expanded that model when we launchedCloudflare Network Interconnect (CNI) to allow our customers to interconnect branch offices and data centers directly with Cloudflare. As part of Cloudflare One, we’ll apply outbound filtering to that same connection.
Cloudflare One should not just help your team move to the Internet as a corporate network, it should be faster than the Internet. Our network is carrier-agnostic, exceptionally well-connected and peered, and delivers the same set of services globally. In each of these on-ramps, we’re adding smarter routing based on our Argo Smart Routing technology, which has been shown to reduce latency by 30% or more in the real-world. Security + Performance, because they’re better together.
A single, unified control plane
When users connect to the Internet from branch offices and devices, they skip the firewall appliances that used to live in headquarters altogether. To keep pace, enterprises need a way to secure traffic that no longer lives entirely within their own network. Cloudflare One applies standard security controls to all traffic – regardless of how that connection starts or where in the network stack it lives.
Cloudflare Access starts by introducing identity into Cloudflare’s network. Teams apply filters based on identity and context to both inbound and outbound connections. Every login, request, and response proxies through Cloudflare’s network regardless of the location of the server or user. The scale of our network and its distribution can filter and log enterprise traffic without compromising performance.
Cloudflare Gateway keeps connections to the rest of the Internet safe. Gateway inspects traffic leaving devices and networks for threats and data loss events that hide inside of connections at the application layer. Launching soon, Gateway will bring that same level of control lower in the stack to the transport layer.
You should have the same level of control over how your networks send traffic. We’re excited to announce Magic Firewall, a next-generation firewall for all traffic leaving your offices and data centers. With Gateway and Magic Firewall, you can build a rule once and run it everywhere, or tailor rules to specific use cases in a single control plane.
We know some attacks can’t be filtered because they launch before filters can be built to stop them. Cloudflare Browser, our isolated browser technology gives your team a bulletproof pane of glass from threats that can evade known filters. Later this week, we’ll invite customers to sign up to join the beta to browse the Internet on Cloudflare’s edge without the risk of code leaping out of the browser to infect an endpoint.
Finally, the PKI infrastructure that secures your network should be modern and simpler to manage. We heard from customers who described certificate management as one of the core problems of moving to a better model of security. Cloudflare works with, not against, modern encryption standards like TLS 1.3. Cloudflare made it easy to add encryption to your sites on the Internet with one click. We’re going to bring that ease-of-management to the network functions you run on Cloudflare One.
One place to get your logs, one location for all of your security analysis
Cloudflare’s network serves 18 million HTTP requests per second on average. We’ve built logging pipelines that make it possible for some of the largest Internet properties in the world to capture and analyze their logs at scale. Cloudflare One builds on that same capability.
Cloudflare Access and Gateway capture every request, inbound or outbound, without any server-side code changes or advanced client-side configuration. Your team can export those logs to the SIEM provider of your choice with our Cloudflare Logpush service – the same pipeline that exports HTTP request events at scale for public sites. Magic Transit expands that logging capability to entire networks and offices to ensure you never lose visibility from any location.
We’re going beyond just logging events. Available today for your websites, Cloudflare Web Analytics converts logs into insights. We plan to keep expanding that visibility into how your network operates, as well. Just as Cloudflare has replaced the “band-aid boxes” that performed disparate network functions and unified them into a cohesive, adaptable edge, we intend to do the same for the fragmented, hard to use, and expensive security analytics ecosystem. More to come on this soon.Smarter, faster remediation
Data and analytics should surface events that a team can remediate. Log systems that lead to one-click fixes can be powerful tools, but we want to make that remediation automatic.
Launching into a closed preview later this week, Cloudflare Intrusion Detection System (IDS) will proactively scan your network for anomalous events and recommend actions or, better yet, take actions for you to remediate problems. We plan to bring that same proactive scanning and remediation approach to Cloudflare Access and Cloudflare Gateway.
Run your network on our globally scaled network
Over 25 million Internet properties rely on Cloudflare’s network to reach their audiences. More than 10% of all websites connect through our reverse proxy, including 16% of the Fortune 1000. Cloudflare accelerates traffic for huge chunks of the Internet by delivering services from datacenters around the world.
We deliver Cloudflare One from those same data centers. And critically, every datacenter we operate delivers the same set of services, whether that is Cloudflare Access, WARP, Magic Transit, or our WAF. As an example, when your employees connect through Cloudflare WARP to one of our data centers, there is a real chance they never have to leave our network or that data center to reach the site or data they need. As a result, their entire Internet experience becomes extraordinarily fast, no matter where they are in the world.
We expect that performance bonus to become even more meaningful as browsing moves to Cloudflare’s edge with Cloudflare Browser. The isolated browsers running in Cloudflare’s data centers can request content that sits just centimeters away. Even further, as more web properties rely on Cloudflare Workers to power their applications, entire workflows can stay inside of a data center within 100 ms of your employees.
While many of these features are available today, we’re going to be launching several new features over the next several days as part of Cloudflare’s Zero Trust week. Stay tuned for announcements each day this week that add new pieces to the Cloudflare One featureset.
On July 3, Cloudflare’s global DDoS protection system, Gatebot, automatically detected and mitigated a UDP-based DDoS attack that peaked at 654 Gbps. The attack was part of a ten-day multi-vector DDoS campaign targeting a Magic Transit customer and was mitigated without any human intervention. The DDoS campaign is believed to have been generated by Moobot, a Mirai-based botnet. No downtime, service degradation, or false positives were reported by the customer.
Over those ten days, our systems automatically detected and mitigated over 5,000 DDoS attacks against this one customer, mainly UDP floods, SYN floods, ACK floods, and GRE floods. The largest DDoS attack was a UDP flood and lasted a mere 2 minutes. This attack targeted only one IP address but hit multiple ports. The attack originated from 18,705 unique IP addresses, each believed to be a Moobot-infected IoT device.
The attack was observed in Cloudflare’s data centers in 100 countries around the world. Approximately 89% of the attack traffic originated from just 10 countries with the US leading at 41%, followed by South Korea and Japan in second place (12% each), and India in third (10%). What this likely means is that the malware has infected at least 18,705 devices in 100 countries around the world.
Moobot – Self Propagating Malware
‘Moobot’ sounds like a cute name, but there’s nothing cute about it. According to Netlab 360, Moobot is the codename of a self-propagating Mirai-based malware first discovered in 2019. It infects IoT (Internet of Things) devices using remotely exploitable vulnerabilities or weak default passwords. IoT is a term used to describe smart devices such as security hubs and cameras, smart TVs, smart speakers, smart lights, sensors, and even refrigerators that are connected to the Internet.
Once a device is infected by Moobot, control of the device is transferred to the operator of the command and control (C2) server, who can issue commands remotely such as attacking a target and locating additional vulnerable IoT devices to infect (self-propagation).
Self-propagation – The self-propagation module is in charge of the botnet’s growth. After an IoT device is infected, it randomly scans the Internet for open telnet ports and reports back to the C2 server. Once the C2 server gains knowledge of open telnet ports around the world, it tries to leverage known vulnerabilities or brute force its way into the IoT devices with common or default credentials.
Synchronized attacks – The C2 server orchestrates a coordinated flood of packets or HTTP requests with the goal of creating a denial of service event for the target’s website or service.
The botnet operator may use multiple C2 servers in various locations around the world in order to reduce the risk of exposure. Infected devices may be assigned to different C2 servers varying by region and module; one server for self-propagation and another for launching attacks. Thus if a C2 server is compromised and taken down by law enforcement authorities, only parts of the botnet are deactivated.
Why this attack was not successful
This is the second large scale attack in the past few months that we observed on Cloudflare’s network. The previous one peaked at 754M packets per second and attempted to take down our routers with a high packet rate. Despite the high packet rate, the 754Mpps attack peaked at a mere 253 Gbps.
As opposed to the high packet rate attack, this attack was a high bit rate attack, peaking at 654 Gbps. Due to the high bit rates of this attack, it seems as though the attacker tried (and failed) to cause a denial of service event by saturating our Internet link capacity. So let’s explore why this attack was not successful.
Avoiding link saturation & keeping appliances running
Cloudflare’s global network capacity is over 42 Tbps and growing. Our network spans more than 200 cities in over 100 countries, including 17 cities in mainland China. It interconnects with over 8,800 networks globally, including major ISPs, cloud services, and enterprises. This level of interconnectivity along with the use of Anycast ensures that our network can easily absorb even the largest attacks.
After traffic arrives at an edge data center, it is then load-balanced efficiently using our own Layer 4 load-balancer that we built, Unimog, which uses our appliances’ health and other metrics to load-balance traffic intelligently within a data center to avoid overwhelming any single server.
Besides the use of Anycast for inter-data center load balancing and Unimog for intra-data center load balancing, we also utilize various forms of traffic engineering in order to deal with sudden changes in traffic loads across our network. We utilize both automatic and manual traffic engineering methods that can be employed by our 24/7/365 Site Reliability Engineering (SRE) team.
These combined factors significantly reduce the likelihood of a denial of service event due to link saturation or appliances being overwhelmed — and as seen in this attack, no link saturation occurred.
Detecting & Mitigating DDoS attacks
Once traffic arrives at our edge, it encounters our three software-defined DDoS protection systems:
Gatebot – Cloudflare’s centralized DDoS protection systems for detecting and mitigating globally distributed volumetric DDoS attacks. Gatebot runs in our network’s core data center. It receives samples from every one of our edge data centers, analyzes them, and automatically sends mitigation instructions when attacks are detected. Gatebot is also synchronized to each of our customers’ web servers to identify its health and triggers mitigation accordingly.
dosd (denial of service daemon) – Cloudflare’s decentralized DDoS protection systems. dosd runs autonomously in each server in every Cloudflare data center around the world, analyzing traffic and applying local mitigation rules when needed. Besides being able to detect and mitigate attacks at super-fast speeds, dosd significantly improves our network resilience by delegating the detection and mitigation capabilities to the edge.
flowtrackd (flow tracking daemon) – Cloudflare’s TCP state tracking machine for detecting and mitigating the most randomized and sophisticated TCP-based DDoS attacks in unidirectional routing topologies (such as the case for Magic Transit). flowtrackd is able to identify the state of a TCP connection and then drops, challenges, or rate-limits packets that don’t belong to a legitimate connection.
The three DDoS protection systems collect traffic samples in order to detect DDoS attacks. The types of traffic data that they sample include:
Packet fields such as the source IP, source port, destination IP, destination port, protocol, TCP flags, sequence number, options, and packet rate.
HTTP request metadata such as HTTP headers, user agent, query-string, path, host, HTTP method, HTTP version, TLS cipher version, and request rate.
HTTP response metrics such as error codes returned by customers’ origin servers and their rates.
Our systems then crunch these sample data points together to form a real-time view of our network’s security posture and our customer’s origin server health. They look for attack patterns and traffic anomalies. When found, a mitigation rule with a dynamically crafted attack signature is generated in real-time. Rules are propagated to the most optimal place for cost-effective mitigation. For example, an L7 HTTP flood might be dropped at L4 to reduce the CPU consumption.
Rules that are generated by dosd and flowtrackd are propagated within a single data center for rapid mitigation. Gatebot’s rules are propagated to all of the edge data centers which then take priority over dosd’s rules for an even and optimal mitigation. Even if the attack is detected in a subset of edge data centers, Gatebot propagates the mitigation instructions to all of Cloudflare’s edge data centers — effectively sharing the threat intelligence across our network as a form of proactive protection.
In the case of this attack, in each edge data center, dosd generated rules to mitigate the attack promptly. Then as Gatebot received and analyzed samples from the edge, it determined that this was a globally distributed attack. Gatebot propagated unified mitigation instructions to the edge, which prepared each and every one of our 200+ data centers to tackle the attack as the attack traffic may shift to a different data center due to Anycast or traffic engineering.
No inflated bills
DDoS attacks obviously pose the risk of an outage and service disruption. But there is another risk to consider — the cost of mitigation. During these ten days, more than 65 Terabytes of attack traffic were generated by the botnet. However, as part of Cloudflare’s unmetered DDoS protection guarantee, Cloudflare mitigated and absorbed the attack traffic without billing the customer. The customer doesn’t need to submit a retroactive credit request. Attack traffic is automatically excluded from our billing system. We eliminated the financial risk.
Today we’re excited to announce Cloudflare’s Network Interconnection Partner Program, in support of our new CNI product. As ever more enterprises turn to Cloudflare to secure and accelerate their branch and core networks, the ability to connect privately and securely becomes increasingly important. Today’s announcement significantly increases the interconnection options for our customers, allowing them to connect with us in the location of their choice using the method or vendors they prefer.
In addition to our physical locations, our customers can now interconnect with us at any of 23 metro areas across five continents using software-defined layer 2 networking technology. Following the recent release of CNI (which includes PNI support for Magic Transit), customers can now order layer 3 DDoS protection in any of the markets below, without requiring physical cross connects, providing private and secure links, with simpler setup.
We’re very excited to announce that five of the world’s premier interconnect platforms are available at launch.Console Connect by PCCW Global in 14 locations, Megaport in 14 locations, PacketFabric in 15 locations, Equinix ECX Fabric™ in 8 locations and Zayo Tranzact in 3 locations, spanning North America, Europe, Asia, Oceania and Africa.
What is an Interconnection Platform?
Like much of the networking world, there are many terms in the interconnection space for the same thing: Cloud Exchange, Virtual Cross Connect Platform and Interconnection Platform are all synonyms. They are platforms that allow two networks to interconnect privately at layer 2, without requiring additional physical cabling. Instead the customer can order a port and a virtual connection on a dashboard, and the interconnection ‘fabric’ will establish the connection. Since many large customers are already connected to these fabrics for their connections to traditional Cloud providers, it is a very convenient method to establish private connectivity with Cloudflare.
Why interconnect virtually?
Cloudflare has an extensive peering infrastructure and already has private links to thousands of other networks. Virtual private interconnection is particularly attractive to customers with strict security postures and demanding performance requirements, but without the added burden of ordering and managing additional physical cross connects and expanding their physical infrastructure.
Key Benefits of Interconnection Platforms
Secure Similar to physical PNI,traffic does not pass across the Internet. Rather, it flows from the customer router, to the Interconnection Platform’s network and ultimately to Cloudflare. So while there is still some element of shared infrastructure, it’s not over the public Internet.
Efficient Modern PNIs are typically a minimum of 1Gbps, but if you have the security motivation without the sustained 1Gbps data transfer rates, then you will have idle capacity. Virtual connections provide for “sub-rate” speeds, which means less than 1Gbps, such as 100Mbps, meaning you only pay for what you use. Most providers also allow some level of “burstiness”, which is to say you can exceed that 100Mbps limit for short periods.
Performance By avoiding the public Internet, virtual links avoid Internet congestion.
Price The major cloud providers typically have different pricing for egressing data to the Internet compared to an Interconnect Platform. By connecting to your cloud via an Interconnect Partner, you can benefit from those reduced egress fees between your cloud and the Interconnection Platform. This builds on our Bandwidth Alliance to give customers more options to continue to drive down their network costs.
Less Overhead By virtualizing, you reduce physical cable management to just one connection into the Interconnection Platform. From there, everything defined and managed in software. For example, ordering a 100Mbps link to Cloudflare can be a few clicks in a Dashboard, as would be a 100Mbps link into Salesforce.
Data Center Independence Is your infrastructure in the same metro, but in a different facility to Cloudflare? An Interconnection Platform can bring us together without the need for additional physical links.
In any of the 23 metro areas where we are currently connected to an Interconnection Platform (see below)
If you’d like to connect virtually in a location not yet listed below, simply get in touch via our interconnection page and we’ll work out the best way to connect.
The metro areas below have currently active connections. New providers and locations can be turned up on request.
Our customers have been asking for direct on-ramps to our global network for a long time and we’re excited to deliver that today with both physical and virtual connectivity of the world’s leading interconnection Platforms.
Already a Cloudflare customer and connected with one of our Interconnection partners? Then contact your account team today to get connected and benefit from improved reliability, security and privacy of Cloudflare Network Interconnect via our interconnection partners.
Are you an Interconnection Platform with customers demanding direct connectivity to Cloudflare? Head to our partner program page and click “Become a partner”. We’ll continue to add platforms and partners according to customer demand.
"Equinix and Cloudflare share the vision of software-defined, virtualized and API-driven network connections. The availability of Cloudflare on the Equinix Cloud Exchange Fabric demonstrates that shared vision and we’re excited to offer it to our joint customers today."
– Joseph Harding, Equinix, Vice President, Global Product & Platform MarketingSoftware Developer
"Cloudflare and Megaport are driven to offer greater flexibility to our customers. In addition to accessing Cloudflare’s platform on Megaport’s global internet exchange service, customers can now provision on-demand, secure connections through our Software Defined Network directly to Cloudflare Network Interconnect on-ramps globally. With over 700 enabled data centres in 23 countries, Megaport extends the reach of CNI onramps to the locations where enterprises house their critical IT infrastructure. Because Cloudflare is interconnected with our SDN, customers can point, click, and connect in real time. We’re delighted to grow our partnership with Cloudflare and bring CNI to our services ecosystem — allowing customers to build multi-service, securely-connected IT architectures in a matter of minutes."
– Matt Simpson, Megaport, VP of Cloud Services
“The ability to self-provision direct connections to Cloudflare’s network from Console Connect is a powerful tool for enterprises as they come to terms with new demands on their networks. We are really excited to bring together Cloudflare’s industry-leading solutions with PCCW Global’s high-performance network on the Console Connect platform, which will deliver much higher levels of network security and performance to businesses worldwide.”
– Michael Glynn, PCCW Global, VP of Digital Automated Innovation
"Our customers can now connect to Cloudflare via a private, secure, and dedicated connection via the PacketFabric Marketplace. PacketFabric is proud to be the launch partner for Cloudflare’s Interconnection program. Our large U.S. footprint provides the reach and density that Cloudflare customers need."
– Dave Ward, PacketFabric CEO
Today we’re thrilled to announce general availability of Bring Your Own IP (BYOIP) across our Layer 7 products as well as Spectrum and Magic Transit services. When BYOIP is configured, the Cloudflare edge will announce a customer’s own IP prefixes and the prefixes can be used with our Layer 7 services, Spectrum, or Magic Transit. If you’re not familiar with the term, an IP prefix is a range of IP addresses. Routers create a table of reachable prefixes, known as a routing table, to ensure that packets are delivered correctly across the Internet.
As part of this announcement, we are listing BYOIP on the relevant product pages, developer documentation, and UI support for controlling your prefixes. Previous support was API only.
Customers choose BYOIP with Cloudflare for a number of reasons. It may be the case that your IP prefix is already allow-listed in many important places, and updating firewall rules to also allow Cloudflare address space may represent a large administrative hurdle. Additionally, you may have hundreds of thousands, or even millions, of end users pointed directly to your IPs via DNS, and it would be hugely time consuming to get them all to update their records to point to Cloudflare IPs.
Over the last several quarters we have been building tooling and processes to support customers bringing their own IPs at scale. At the time of writing this post we’ve successfully onboarded hundreds of customer IP prefixes. Of these, 84% have been for Magic Transit deployments, 14% for Layer 7 deployments, and 2% for Spectrum deployments.
When you BYOIP with Cloudflare, this means we announce your IP space in over 200 cities around the world and tie your IP prefix to the service (or services!) of your choosing. Your IP space will be protected and accelerated as if they were Cloudflare’s own IPs. We can support regional deployments for BYOIP prefixes as well if you have technical and/or legal requirements limiting where your prefixes can be announced, such as data sovereignty.
You can turn on advertisement of your IPs from the Cloudflare edge with a click of a button and be live across the world in a matter of minutes.
All BYOIP customers receive network analytics on their prefixes. Additionally all IPs in BYOIP prefixes can be considered static IPs. There are also benefits specific to the service you use with your IP prefix on Cloudflare.
Layer 7 + BYOIP:
Cloudflare has a robust Layer 7 product portfolio, including products like Bot Management, Rate Limiting, Web Application Firewall, and Content Delivery, to name just a few. You can choose to BYOIP with our Layer 7 products and receive all of their benefits on your IP addresses.
For Layer 7 services, we can support a variety of IP to domain mapping requests including sharing IPs between domains or putting domains on dedicated IPs, which can help meet requirements for things such as non-SNI support.
If you are also an SSL for SaaS customer, using BYOIP, you have increased flexibility to change IP address responses for custom_hostnames in the event an IP is unserviceable for some reason.
Spectrum + BYOIP:
Spectrum is Cloudflare’s solution to protect and accelerate applications that run any UDP or TCP protocol. The Spectrum API supports BYOIP today. Spectrum customers who use BYOIP can specify, through Spectrum’s API, which IPs they would like associated with a Spectrum application.
Magic Transit + BYOIP:
Magic Transit is a Layer 3 security service which processes all your network traffic by announcing your IP addresses and attracting that traffic to the Cloudflare edge for processing. Magic Transit supports sophisticated packet filtering and firewall configurations. BYOIP is a requirement for using the Magic Transit service. As Magic Transit is an IP level service, Cloudflare must be able to announce your IPs in order to provide this service
Bringing Your IPs to Cloudflare: What is Required?
Before Cloudflare can announce your prefix we require some documentation to get started. The first is something called a ‘Letter of Authorization’ (LOA), which details information about your prefix and how you want Cloudflare to announce it. We then share this document with our Tier 1 transit providers in advance of provisioning your prefix. This step is done to ensure that Tier 1s are aware we have authorization to announce your prefixes.
Secondly, we require that your Internet Routing Registry (IRR) records are up to date and reflect the data in the LOA. This typically means ensuring the entry in your regional registry is updated (i.e. ARIN, RIPE, APNIC).
Once the administrivia is out of the way, work with your account team to learn when your prefixes will be ready to announce.
We also encourage customers to use RPKI and can support this for customer prefixes. We have blogged and built extensive tooling to make adoption of this protocol easier. If you’re interested in BYOIP with RPKI support just let your account team know!
Each customer prefix can be announced via the ‘dynamic advertisement’ toggle in either the UI or API, which will cause the Cloudflare edge to either announce or withdraw a prefix on your behalf. This can be done as soon as your account team lets you know your prefixes are ready to go.
Once the IPs are ready to be announced, you may want to set up ‘delegations’ for your prefixes. Delegations manage how the prefix can be used across multiple Cloudflare accounts and have slightly different implications depending on which service your prefix is bound to. A prefix is owned by a single account, but a delegation can extend some of the prefix functionality to other accounts. This is also captured on our developer docs. Today, delegations can affect Layer 7 and Spectrum BYOIP prefixes.
Layer 7: If you use BYOIP + Layer 7 and also use the SSL for SaaS service, a delegation to another account will allow that account to also use that prefix to validate custom hostnames in addition to the original account which owns the prefix. This means that multiple accounts can use the same IP prefix to serve up custom hostname traffic. Additionally, all of your IPs can serve traffic for custom hostnames, which means you can easily change IP addresses for these hostnames if an IP is blocked for any reason.
Spectrum: If you used BYOIP + Spectrum, via the Spectrum API, you can specify which IP in your prefix you want to create a Spectrum app with. If you create a delegation for prefix to another account, that second account will also be able to specify an IP from that prefix to create an app.
If you are interested in learning more about BYOIP across either Magic Transit, CDN, or Spectrum, please reach out to your account team if you’re an existing customer or contact [email protected] if you’re a new prospect.
Magic Transit is Cloudflare’s L3 DDoS Scrubbing service for protecting network infrastructure. As part of our ongoing investment in Magic Transit and our DDoS protection capabilities, we’re excited to talk about a new piece of software helping to protect Magic Transit customers: flowtrackd. flowrackd is a software-defined DDoS protection system that significantly improves our ability to automatically detect and mitigate even the most complex TCP-based DDoS attacks. If you are a Magic Transit customer, this feature will be enabled by default at no additional cost on July 29, 2020.
TCP-Based DDoS Attacks
In the first quarter of 2020, one out of every two L3/4 DDoS attacks Cloudflare mitigated was an ACK Flood, and over 66% of all L3/4 attacks were TCP based. Most types of DDoS attacks can be mitigated by finding unique characteristics that are present in all attack packets and using that to distinguish ‘good’ packets from the ‘bad’ ones. This is called “stateless” mitigation, because any packet that has these unique characteristics can simply be dropped without remembering any information (or “state”) about the other packets that came before it. However, when attack packets have no unique characteristics, then “stateful” mitigation is required, because whether a certain packet is good or bad depends on the other packets that have come before it.
The most sophisticated types of TCP flood require stateful mitigation, where every TCP connection must be tracked in order to know whether any particular TCP packet is part of an active connection. That kind of mitigation is called “flow tracking”, and it is typically implemented in Linux by the iptables conntrack module. However, DDoS protection with conntrack is not as simple as flipping the iptable switch, especially at the scale and complexity that Cloudflare operates in. If you’re interested to learn more, in this blog we talk about the technical challenges of implementing iptables conntrack.
Complex TCP DDoS attacks pose a threat as they can be harder to detect and mitigate. They therefore have the potential to cause service degradation, outages and increased false positives with inaccurate mitigation rules. So how does Cloudflare block patternless DDoS attacks without affecting legitimate traffic?
Bidirectional TCP Flow Tracking
Using Cloudflare’s traditional products, HTTP applications can be protected by the WAF service, and TCP/UDP applications can be protected by Spectrum. These services are “reverse proxies“, meaning that traffic passes through Cloudflare in both directions. In this bidirectional topology, we see the entire TCP flow (i.e., segments sent by both the client and the server) and can therefore track the state of the underlying TCP connection. This way, we know if a TCP packet belongs to an existing flow or if it is an “out of state” TCP packet. Out of state TCP packets look just like regular TCP packets, but they don’t belong to any real connection between a client and a server. These packets are most likely part of an attack and are therefore dropped.
While not trivial, tracking TCP flows can be done when we serve as a proxy between the client and server, allowing us to absorb and mitigate out of state TCP floods. However it becomes much more challenging when we only see half of the connection: the ingress flow. This visibility into ingress but not egress flows is the default deployment method for Cloudflare’s Magic Transit service, so we had our work cut out for us in identifying out of state packets.
The Challenge With Unidirectional TCP Flows
With Magic Transit, Cloudflare receives inbound internet traffic on behalf of the customer, scrubs DDoS attacks, and routes the clean traffic to the customer’s data center over a tunnel. The data center then responds directly to the eyeball client using a technique known as Direct Server Return (DSR).
Using DSR, when a TCP handshake is initiated by an eyeball client, it sends a SYN packet that gets routed via Cloudflare to the origin data center. The origin then responds with a SYN-ACK directly to the client, bypassing Cloudflare. Finally, the client responds with an ACK that once again routes to the origin via Cloudflare and the connection is then considered established.
In a unidirectional flow we don’t see the SYN+ACK sent from the origin to the eyeball client, and therefore can’t utilize our existing flow tracking capabilities to identify out of state packets.
Unidirectional TCP Flow Tracking
To overcome the challenges of unidirectional flows, we recently completed the development and rollout of a new system, codenamed flowtrackd (“flow tracking daemon”). flowtrackd is a state machine that hooks into the network interface. It tracks unidirectional TCP flows using only the ingress traffic that routes through Cloudflare to determine the state of the TCP connection. flowtrackd is then able to determine if a packet is part of a new connection, an open one, a connection that is closing, one that is closed, or if it’s an out of state packet. Once a high volume of out-of-state packets is detected, flowtrackd will either challenge (force RST) or drop the packets.
The state machine that determines the state of the flows was developed in-house and complements Gatebot and dosd. Together Gatebot, dosd, and flowtrackd provide a comprehensive multi layer DDoS protection.
Releasing flowtrackd to the Wild
And it works! Less than a day after releasing flowtrackd to an early access customer, flowtrackd automatically detected and mitigated an ACK flood that peaked at 6 million packets per second. No downtime, service disruption, or false positives were reported.
Cloudflare’s DDoS Protection – Delivered From Every Data Center
As opposed to legacy scrubbing center providers with limited network infrastructures, Cloudflare provides DDoS Protection from every one of our data centers in over 200 locations around the world. We write our own software-defined DDoS protection systems. Notice I say systems, because as opposed to vendors that use a dedicated third party appliance, we’re able to write and spin up whatever software we need, deploy it in the optimal location in our tech stack and are therefore not dependent on other vendors or be limited to the capabilities of one appliance.
flowtrackd joins the Cloudflare DDoS protection family which includes our veteran Gatebot and the younger and energetic dosd. flowtrackd will be available from every one of our data centers, with a total mitigation capacity of over 37 Tbps, protecting our Magic Transit customers against the most complex TCP DDoS attacks.
New to Magic Transit? Replace your legacy provider with Magic Transit and pay nothing until your current contract expires. Offer expires September 1, 2020. Click here for details.
On Cloudflare’s 8th birthday in 2017, we announced free unmetered DDoS Protection as part of all of our plans, regardless if you’re an independent blogger using WordPress on Cloudflare’s Free plan or part of a large enterprise operating global network infrastructures. Our DDoS protection covers attack vectors on Layers 3-7; whether highly distributed and volumetric (rate-intensive) or small and sneaky. We protect over 26 million Internet properties, and at this scale, identifying small and sneaky DDoS attacks can be challenging, especially at L7. In this post, we discuss this challenge along with trends that we’ve seen, interesting DDoS attacks, and how we’ve responded to them so that you don’t have to worry.
Let’s Talk Trends
When analyzing attacks on the Cloudflare network, we’ve seen a steady decline in the proportion of L3/L4 DDoS attacks that exceed a rate of 30 Gbps in recent months. From September 2019 to March 2020, attacks peaking over 30 Gbps decreased by 82%, and in March 2020, more than 95% of all network-layer DDoS attacks peaked below 30 Gbps. Over the same time period, the average size of a DDoS attack has also steadily decreased by 53%, to just 11.88 Gbps. Yet, very large attacks have not disappeared: we’re still seeing attacks with intensive rates peaking at 330 Gbps on average and up to 400 millions packets per second. Some of our customers are being targeted with as many as 890 DDoS attacks in a single day and 1,750 DDoS attacks in a month.
As the average rate of these L3/L4 attacks has decreased, they have become more localized and less geographically distributed. Increasingly, we’re seeing attacks hit just one or two of our data centers, which means that these hyper-localized attacks were launched in the catchment of the data center, otherwise our Anycast network would have spread the attack surface across our global fleet of data centers. Counterintuitively, these hyper-localized floods can be more difficult to detect on a global scale as the attack samples get diluted when aggregated from all of our data centers in the core. Therefore we’ve had to change our tactics and systems to roll with the change in attacker behavior.
Keeping things interesting in the penthouse floor of the OSI Model, over the same time period we’ve also observed some of the most rate-intensive and highly distributed L7 HTTP DDoS attacks we’ve ever seen. These attacks have pushed our engineering teams to invent even more efficient and intelligent ways to defend our network and our customers at scale. Let’s take a look at some of these trends and attacks.
Before we released dosd late last year, the primary automated system responsible for protecting Cloudflare and our customers against distributed rate-intensive attacks was Gatebot. Gatebot works by ingesting samples of flow data from routers and samples of HTTP requests from servers. It then analyzes these samples for anomalies, and when attacks are detected, pushes mitigation instructions automatically to the edge.
Gatebot requires a lot of computational power to analyze these samples, and correlate them across all the data centers, so it runs centrally in our “core” data centers, rather than at the edge. It does a terrific job at mitigating large attacks, and on average stops over 4,000 L3/L4 DDoS attacks every month.
Edge Analyzed, Edge Enforced Mitigations
The persistent increase we’ve observed in smaller, more localized attacks was one of the main factors that drove us to develop a new, complementary system to Gatebot. We call this new system our denial of service daemon, or “dosd”, and this past month alone it mitigated 281,746 L3/4 DDoS attacks. This figure is roughly 6 times greater than what Gatebot dropped over the same period, thanks to dosd’s ability to detect smaller network attacks that would previously have flown under the radar (or taken longer to mitigate).
To complement the computationally heavy, centralized deployments of Gatebot, dosd was architected as a decentralized system that runs on every single server in every one of our data centers. Each instance detects and mitigates attacks independent of the other instances, or any sort of centralized data center whatsoever. As a result, the system is much faster than Gatebot, and can detect and mitigate attacks within 0-3 seconds (and less than 10 seconds on average). The speed of dosd enables it to generate real-time rules to quickly protect our customers at the data center. Then Gatebot, which samples traffic globally, can determine a mitigation that applies to all data centers if needed. In such a case, Gatebot will push rules to the data centers which will take priority over dosd’s rules.
dosd is also a leaner piece of software, consumes less memory and CPU, and significantly improves the resiliency of our network by removing the need to communicate with our core data centers to mitigate attacks. dosd detects and mitigates attacks using a similar logic to Gatebot’s methods, but in the scope of a single server, across a subset of servers in the same data center, or even across the entire data center.
Changing L7 Trends
Example #1 – Highly Distributed DDoS Attack Targeting A Customer Website
In July 2019, Cloudflare mitigated an HTTP DDoS attack that peaked at 1.4M requests per second. While this isn’t the most rate-intensive attack that we’ve seen, what is interesting is that the attack originated from almost 1.1M unique IP addresses. These were actual clients with the ability to complete a TCP and HTTPS handshake, they were not spoofed IP addresses. As it turns out, responding (rather than dropping at the network level) to over a million clients at a max rate of 1.4M requests per second can be quite costly.
Example #2 – Rate-Intensive DDoS Attack Targeting A Customer Website
The second attack took place in September 2019. We mitigated an HTTP DDoS attack that peaked and persisted just below 5M requests per second for a little over an hour. What’s interesting is the sustained capability of the attacker to reach those rates from only 371K unique IPs (also not spoofed).
These attacks highlighted to us what needed to be optimized and consequently drove us to improve our L7 mitigations even more so, and significantly reduced the cost of mitigating an attack.
Using IP Jails to Reduce the Cost of Mitigation
With the goal of reducing the computational cost to Cloudflare of mitigating rate-intensive attacks, we recently rolled out a new Gatebot capability called IP Jails. IP Jails excelsat efficiently mitigating extremely rate-intensive and distributed HTTP DDoS attacks. It is triggered when an attack exceeds a certain request rate and then pushes the mitigation from the application layer (L7 in the OSI model) to the transport layer (L4). Therefore instead of responding with an error or challenge page from the proxy, we simply drop the connection for that IP. Mitigating at L4 is more computationally efficient, it reduces our CPU and memory consumption in addition to saving bandwidth. It allows us to keep mitigating the largest of attacks without sacrificing performance.
IP Jails in action
In the first graph below, you can see an HTTP flood peaking just below 8M rps before the IPs are ‘jailed’ for misbehaving. In the second graph, you can see that same attack being dropped as packets at L4.
The flood requests generated over 130 Gbps in responses. IP Jails slashed it by a factor of 10.
Similarly, you can see a spike in the attack mitigation CPU usage which then drops back to normal after IP Jails kicks in.
Using Origin Errors to Catch Low-Rate Attacks
We see one or two of these rate-intensive attacks every month. But the vast majority of attacks we observe are mostly of a lower request rate, trying to sneak under the radar. To tackle these low-rate attacks better, last month we completed the rollout of a new capability that synchronizes Gatebot’s detection sensitivity with our customers’ origin server health. Gatebot uses the origin’s error response codes as an additional adaptive feedback signal.
However, when we take a step back and think about what a DDoS attack is actually, we usually think of a malicious actor that targets traffic at a specific website or IP address with the intent to degrade performance or cause an outage. However, malicious attackers are not the only threats to your applications availability.
As the migration of functionality to the edge increases, the cloud becomes smarter and more powerful, which often allows administrators to scale down their origin servers and infrastructure leaving the origin server weaker and under-configured. Evidently, there are many cases where an origin was taken down by small floods of traffic that were neither malicious nor generated with bad intentions. These floods may be generated by an overly excited good bot or even faulty client applications calling home too frequently. Fixing a home-sick client application or strengthening a server can be lengthy and costly processes during which the origin remains susceptible. Consequently, if a website is taken offline, no matter the reason, the end-users still experience it as if it were an attack.
Therefore this new capability not only protects our customers against DDoS attacks, but also protects the origin against all kinds of unwanted floods. It is designed to protect every one of our customers; big or small. It’s available on all of our plans including the Free plan.
When an origin responds to Cloudflare with an increasing rate of errors from the 500 range (Internal Server Error), Gatebot initiates automatically and analyzes traffic to reduce or eliminate the impact on the origin even faster than before. The current error rate is also compared to the average error rate to minimize false positives. Once an attack is detected, dynamically generated, ephemeral mitigation rules are propagated to Cloudflare’s edge data centers to mitigate the flood. Mitigation rules may use a block action (403), rate-limit (429), or even a challenge based on the fingerprint logic and confidence.
In March 2020, we mitigated 812 HTTP DDoS attacks on average every day, and approximately 20,000 HTTP DDoS attacks in total.
Don’t Take Our Word For It, See For Yourself
Whether it’s Gatebot or dosd that mitigated L3/4 DDoS attacks, you can see both types of attack events for yourself in our new Network Analytics dashboard.
Today this dashboard provides Magic Transit & BYOIP customers real-time visibility into L3/4 traffic and DDoS attacks, and in the future we plan to expand access to customers of our other products.
Whether you’re part of a large global enterprise, or use Cloudflare for your personal site on the Free plan, we want to make sure that you’re protected and also have the visibility that you need.
DDoS Protection is included as part of every Cloudflare service; from Magic Transit at L3, through Spectrum at L4, to the WAF/CDN service at L7. Our mission is to help build a better Internet – and this means a safer, faster, and more reliable Internet. For everyone.
If you’re a Cloudflare customer of any plan (Free, Pro, Business or Enterprise), these new protections are now enabled by default at no additional charge.
Back in March 2019, we released Firewall Analytics which provides insights into HTTP security events across all of Cloudflare’s protection suite; Firewall rule matches, HTTP DDoS Attacks, Site Security Level which harnesses Cloudflare’s threat intelligence, and more. It helps customers tailor their security configurations more effectively. The initial release was for Enterprise customers, however we believe that everyone should have access to powerful tools, not just large enterprises, and so in December 2019 we extended those same enterprise-level analytics to our Business and Pro customers.
Since then, we’ve built on top of our analytics platform; improved the usability, added more functionality and extended it to additional Cloudflare services in the form of Account Analytics, DNS Analytics, Load Balancing Analytics, Monitoring Analytics and more.
Our entire analytics platform harnesses the powerful GraphQL framework which is also available to customers that want to build, export and share their own custom reports and dashboards.
Extending Visibility From L7 To L3
Until recently, all of our dashboards were mostly HTTP-oriented and provided visibility into HTTP attributes such as the user agent, hosts, cached resources, etc. This is valuable to customers that use Cloudflare to protect and accelerate HTTP applications, mobile apps, or similar. We’re able to provide them visibility into the application layer (Layer 7 in the OSI model) because we proxy their traffic at L7.
However with Magic Transit, we don’t proxy traffic at L7 but rather route it at L3 (network layer). Using BGP Anycast, customer traffic is routed to the closest point of presence of Cloudflare’s network edge where it is filtered by customer-defined network firewall rules and automatic DDoS mitigation systems. Clean traffic is then routed via dynamic GRE Anycast tunnels to the customer’s data-centers. Routing at L3 means that we have limited visibility into the higher layers. So in order to provide Magic Transit customers visibility into traffic and attacks, we needed to extend our analytics platform to the packet-layer.
On January 16, 2020, we released the Network Analytics dashboard for Magic Transit customers and Bring Your Own IP (BYOIP) customers. This packet and bit oriented dashboard provides near real-time visibility into network- and transport-layer traffic patterns and DDoS attacks that are blocked at the Cloudflare edge in over 200 cities around the world.
Analytics For A Year
The way we’ve architected the analytics data-stores enables us to provide our customers one year’s worth of insights. Traffic is sampled at the edge data-centers. From those samples we structure IP flow logs which are similar to SFlow and bucket one minute’s worth of traffic, grouped by destination IP, port and protocol. IP flows includes multiple packet attributes such as TCP flags, source IPs and ports, Cloudflare data-center, etc. The source IP is considered PII data and is therefore only stored for 30 days, after which the source IPs are discarded and logs are rolled up into one hour groups, and then one day groups. The one hour roll-ups are stored for 6 months and the one day roll-ups for 1 year.
Similarly, attack logs are also stored efficiently. Attacks are stored as summaries with start/end timestamps, min/max/average/total bits and packets per second, attack type, action taken and more. A DDoS attack could easily consist of billions of packets which could impact performance due to the number of read/write calls to the data-store. By storing attacks as summary logs, we’re able to overcome these challenges and therefore provide attack logs for up to 1 year back.
Network Analytics via GraphQL API
We built this dashboard on the same analytics platform, meaning that our packet-level analytics are also available by GraphQL. As an example, below is an attack report query that would show the top attacker IPs, the data-center cities and countries where the attack was observed, the IP version distribution, the ASNs that were used by the attackers and the ports. The query is done at the account level, meaning it would provide a report for all of your IP ranges. In order to narrow the report down to a specific destination IP or port range, you can simply add additional filters. The same filters also exist in the UI.
After running the query using Altair GraphQL Client, the response is returned in a JSON format:
What Do Customers Want?
As part of our product definition and design research stages, we interviewed internal customer-facing teams including Customer Support, Solution Engineering and more. I consider these stakeholders as super-user-aggregators because they’re customer-facing teams and are constantly engaging and helping our users. After the internal research phase, we expanded externally to customers and prospects; particularly network and security engineers and leaders. We wanted to know how they expect the dashboard to fit in their work-flow, what are their use cases and how we can tailor the dashboard to their needs. Long story short, we identified two main use cases: Incident Response and Reporting. Let’s go into each of these use cases in more detail.
I started off by asking them a simple question – “what do you do when you’re paged?” We wanted to better understand their incident response process; specifically, how they’d expect to use this dashboard when responding to an incident and what are the metrics that matter to them to help them make quick calculated decisions.
You’ve Just Been Paged
Let’s say that you’re a security operations engineer. It’s Black Friday. You’re on call. You’ve just been paged. Traffic levels to one of your data-centers has exceeded a safe threshold. Boom. What do you do? Responding quickly and resolving the issue as soon as possible is key.
If your workflows are similar to our customers’, then your objective is to resolve the page as soon as possible. However, before you can resolve it, you need to determine if there is any action that you need to take. For instance, is this a legitimate rise in traffic from excited Black Friday shoppers, perhaps a new game release or maybe an attack that hasn’t been mitigated? Do you need to shift traffic to another data-center or are levels still stable? Our customers tell us that these are the metrics that matter the most:
Top Destination IP and port – helps understand what services are being impacted
Top source IPs, port, ASN, data-center and data-center Country – helps identify the source of the traffic
Real-time packet and bit rates – helps understand the traffic levels
Protocol distribution – helps understand what type of traffic is abnormal
TCP flag distribution – an abnormal distribution could indicate an attack
Attack Log – shows what types of traffic is being dropped/rate-limited
As network and transport layer attacks can be highly distributed and the packet attributes can be easily spoofed, it’s usually not practical to block IPs. Instead, the dashboard enables you to quickly identify patterns such as an increased frequency of a specific TCP flag or increased traffic from a specific country. Identifying these patterns brings you one step closer to resolving the issue. After you’ve identified the patterns, packet-level filtering can be applied to drop or rate-limit the malicious traffic. If the attack was automatically mitigated by Cloudflare’s systems, you’ll be able to see it immediately along with the attack attributes in the activity log. By filtering by the Attack ID, the entire dashboard becomes your attack report.
During our interviews with security and network engineers, we also asked them what metrics and insights they need when creating reports for their managers, C-levels, colleagues and providing evidence to law-enforcement agencies. After all, processing data and creating reports can consume over a third (36%) of a security team’s time (~3 hours a day) and is also one of the most frequent DDoS asks by our customers.
On top of all of these customizable insights, we wanted to also provide a one-line summary that would reflect your recent activity. The one-liner is dynamic and changes based on your activity. It tells you whether you’re currently under attack, and how many attacks were blocked. If your CISO is asking for a security update, you can simply copy-paste it and convey the efficiency of the service:
Our customers say that they want to reflect the value of Cloudflare to their managers and peers:
How much potential downtime and bandwidth did Cloudflare spare me?
What are my top attacked IPs and ports?
Where are the attacks coming from? What types and what are the trends?
The Secret To Creating Good Reports
What does everyone love? Cool Maps! The key to a good report is adding a map; showing where the attack came from. But given that packet attributes can be easily spoofed, including the source IP, it won’t do us any good to plot a map based on the locations of the source IPs. It would result in a spoofed source country and is therefore useless data. Instead, we decided to show the geographic distribution of packets and bits based on the Cloudflare data-center in which they were ingested. As opposed to legacy scrubbing center solutions with limited network infrastructures, Cloudflare has data-centers in more than 200 cities around the world. This enables us to provide precise geographic distribution with high confidence, making your reports accurate.
Tailored For You
One of the main challenges both our customers and we struggle with is how to process and generate actionable insights from all of the data points. This is especially critical when responding to an incident. Under this assumption, we built this dashboard with the purpose of speeding up your reporting and investigation processes. By tailoring it to your needs, we hope to make you more efficient and make the most out of Cloudflare’s services. Got any feedback or questions? Post them below in the comments section.
If you’re an existing Magic Transit or BYOIP customer, then the dashboard is already available to you. Not a customer yet? Click here to learn more.
It was a scorching Monday on July 22 as temperatures soared above 37°C (99°F) in Austin, TX, the live music capital of the world. Only hours earlier, the last crowds dispersed from the historic East 6th Street entertainment district. A few blocks away, Cloudflarians were starting to make their way to the office. Little did those early arrivers know that they would soon be unknowingly participating in a Cloudflare time honored tradition of dogfooding new services before releasing them to the wild.
6th East Street, Austin Texas
Dogfooding is when an organization uses its own products. In this case, we dogfed our newest cloud service, Magic Transit, which both protects and accelerates our customers’ entire network infrastructure—not just their web properties or TCP/UDP applications. With Magic Transit, Cloudflare announces your IP prefixes via BGP, attracts (routes) your traffic to our global network edge, blocks bad packets, and delivers good packets to your data centers via Anycast GRE.
We decided to use Austin’s network because we wanted to test the new service on a live network with real traffic from real people and apps. With the target identified, we began onboarding the Austin office in an always-on routing topology.
In an always-on routing mode, Cloudflare data centers constantly advertise Austin’s prefix, resulting in faster, almost immediate mitigation. As opposed to traditional on-demand scrubbing center solutions with limited networks, Cloudflare operates within 100 milliseconds of 99% of the Internet-connected population in the developed world. For our customers, this means that always-on DDoS mitigation doesn’t sacrifice performance due to suboptimal routing. On the contrary, Magic Transit can actually improve your performance due to our network’s reach.
Cloudflare’s Global Network
Now that we’ve completed onboarding Austin to Magic Transit, all we needed was a motivated attacker to launch a DDoS attack. Luckily, we found more than a few willing volunteers on our Site Reliability Engineering (SRE) team to execute the attack. While the teams were still assembling in multiple locations around the world, our SRE volunteer started firing packets at our target from an undisclosed location.
Without Magic Transit, the Austin office would’ve been hit directly with the packet flood. Two things could have happened in this case (not mutually exclusive):
Austin’s on-premise equipment (routers, firewalls, servers, etc.) would have been overwhelmed and failed
Austin’s service providers would have dropped packets that exceeded its bandwidth allowance
Both cases would result in a very bad day for everyone.
Cloudflare DDoS Mitigation
Instead, when our SRE attacker launched the flood the packets were automatically routed via BGP to Cloudflare’s network. The packets reached the closest data center via Anycast and encountered multiple defenses in the form of XDP, eBPF and iptables. Those defenses are populated with pre-configured static firewall rules as well as dynamic rules generated by our DDoS mitigation systems.
Static rules can vary from straightforward IP blocking and rate-limiting to more sophisticated expressions that match against specific packet attributes. Dynamic rules, on the other hand, are generated automatically in real-time. To play fair with our attacker, we didn’t pre-configure any special rules against the attack. We wanted to give our attacker a fair opportunity to take Austin down. Although due to our multi-layered protection approach, the odds were never actually in their favor.
Generating Dynamic Rules
As part of our multi-layered protection approach, Dynamic Rules are generated on-the-fly by analyzing the packets that route through our network. While the packets are being routed, flow data is asynchronously sampled, collected, and analyzed by two main detection systems. The first is called Gatebot and runs across the entire Cloudflare network; the second is our newly deployed DoSD (denial of service daemon) which operates locally within each data center. DoSD is an exciting improvement that we’ve just recently rolled out and we look forward to writing more about its technical details here soon. DoSD samples at a much faster rate (1/100 packets) versus Gatebot which samples at a lower rate (~1/8000 packets), allowing it to detect even more attacks and block them faster.
The asynchronous attack detection lifecycle is represented as the dotted lines in the diagram below. Attacks are detected out of path to assure that we don’t add any latency, and mitigation rules are pushed in line and removed as needed.
Multiple packet attributes and correlations are taken into consideration during analysis and detection. Gatebot and DoSD search for both new network anomalies and already known attacks. Once an attack is detected, rules are automatically generated, propagated, and applied in the optimal location within 10 seconds or less. Just to give you an idea of the scale, we’re talking about hundreds of thousands of dynamic rules that are applied and removed every second across the entire Cloudflare network.
One of the beauties of Gatebot and DoSD is that they don’t require a traffic learning period. Once a customer is onboarded, they’re protected immediately. They don’t need to sample traffic for weeks before kicking in. While we can always apply specific firewall rules if requested by the customer, no manual configuration is required by the customer or our teams. It just works.
What this mitigation process looks like in practice
Let’s look at what happened in Austin when one of our SREs tried to DDoS Austin and failed. During one of the first attempts, before DoSD had rolled out globally, a degradation in audio and video quality was noticed for Austin employees on video calls for a few seconds before Gatebot kicked in. However, as soon as Gatebot kicked in, the quality was immediately restored. If we hadn’t had Magic Transit in-line, the degradation of service would’ve worsened until the point of full denial of service. Austin would have been offline and our Austin colleagues wouldn’t have had a very productive day.
On a subsequent attack attempt which took place after DoSD was deployed, our SRE launched a SYN flood on Austin. The attack targeted multiple IP addresses in Austin’s prefix and peaked just above 250,000 packets per second. DoSD detected the attack and blocked it in approximately 3 seconds. DoSD’s quick response resulted in no degradation of service for the Austin team.
What We Learned
Dogfooding Magic Transit served as a valuable experiment for us with lots of lessons learned both from the engineering and procedural aspects. From the engineering aspect, we fine-tuned our mitigations and optimized routings. From the procedural aspects, we drilled members of multiple teams including the Security Operations Center and Solution Engineering teams to help refine our run-books. By doing so, we reduced the onboarding duration to hours instead of days in order to assure a quick and smooth onboarding experience for our customers.
Want To Learn More?
Request a demo and learn how you can protect and accelerate your network with Cloudflare.
Today we’re excited to announce Cloudflare Magic Transit. Magic Transit provides secure, performant, and reliable IP connectivity to the Internet. Out-of-the-box, Magic Transit deployed in front of your on-premise network protects it from DDoS attack and enables provisioning of a full suite of virtual network functions, including advanced packet filtering, load balancing, and traffic management tools.
Magic Transit is built on the standards and networking primitives you are familiar with, but delivered from Cloudflare’s global edge network as a service. Traffic is ingested by the Cloudflare Network with anycast and BGP, announcing your company’s IP address space and extending your network presence globally. Today, our anycast edge network spans 193 cities in more than 90 countries around the world.
Once packets hit our network, traffic is inspected for attacks, filtered, steered, accelerated, and sent onward to the origin. Magic Transit will connect back to your origin infrastructure over Generic Routing Encapsulation (GRE) tunnels, private network interconnects (PNI), or other forms of peering.
Enterprises are often forced to pick between performance and security when deploying IP network services. Magic Transit is designed from the ground up to minimize these trade-offs: performance and security are better together. Magic Transit deploys IP security services across our entire global network. This means no more diverting traffic to small numbers of distant “scrubbing centers” or relying on on-premise hardware to mitigate attacks on your infrastructure.
We’ve been laying the groundwork for Magic Transit for as long as Cloudflare has been in existence, since 2010. Scaling and securing the IP network Cloudflare is built on has required tooling that would have been impossible or exorbitantly expensive to buy. So we built the tools ourselves! We grew up in the age of software-defined networking and network function virtualization, and the principles behind these modern concepts run through everything we do.
When we talk to our customers managing on-premise networks, we consistently hear a few things: building and managing their networks is expensive and painful, and those on-premise networks aren’t going away anytime soon.
Traditionally, CIOs trying to connect their IP networks to the Internet do this in two steps:
Source connectivity to the Internet from transit providers (ISPs).
Purchase, operate, and maintain network function specific hardware appliances. Think hardware load balancers, firewalls, DDoS mitigation equipment, WAN optimization, and more.
Each of these boxes costs time and money to maintain, not to mention the skilled, expensive people required to properly run them. Each additional link in the chain makes a network harder to manage.
This all sounded familiar to us. We had an aha! moment: we had the same issues managing our datacenter networks that power all of our products, and we had spent significant time and effort building solutions to those problems. Now, nine years later, we had a robust set of tools we could turn into products for our own customers.
Magic Transit aims to bring the traditional datacenter hardware model into the cloud, packaging transit with all the network “hardware” you might need to keep your network fast, reliable, and secure. Once deployed, Magic Transit allows seamless provisioning of virtualized network functions, including routing, DDoS mitigation, firewalling, load balancing, and traffic acceleration services.
Magic Transit is your network’s on-ramp to the Internet
Magic Transit delivers its connectivity, security, and performance benefits by serving as the “front door” to your IP network. This means it accepts IP packets destined for your network, processes them, and then outputs them to your origin infrastructure.
Connecting to the Internet via Cloudflare offers numerous benefits. Starting with the most basic, Cloudflare is one of the most extensively connected networks on the Internet. We work with carriers, Internet exchanges, and peering partners around the world to ensure that a bit placed on our network will reach its destination quickly and reliably, no matter the destination.
An example deployment: Acme Corp
Let’s walk through how a customer might deploy Magic Transit. Customer Acme Corp. owns the IP prefix 203.0.113.0/24, which they use to address a rack of hardware they run in their own physical datacenter. Acme currently announces routes to the Internet from their customer-premise equipment (CPE, aka a router at the perimeter of their datacenter), telling the world 203.0.113.0/24 is reachable from their autonomous system number, AS64512. Acme has DDoS mitigation and firewall hardware appliances on-premise.
Acme wants to connect to the Cloudflare Network to improve the security and performance of their own network. Specifically, they’ve been the target of distributed denial of service attacks, and want to sleep soundly at night without relying on on-premise hardware. This is where Cloudflare comes in.
Deploying Magic Transit in front of their network is simple:
Cloudflare uses Border Gateway Protocol (BGP) to announce Acme’s 203.0.113.0/24 prefix from Cloudflare’s edge, with Acme’s permission.
Cloudflare begins ingesting packets destined for the Acme IP prefix.
Magic Transit applies DDoS mitigation and firewall rules to the network traffic. After it is ingested by the Cloudflare network, traffic that would benefit from HTTPS caching and WAF inspection can be “upgraded” to our Layer 7 HTTPS pipeline without incurring additional network hops.
Acme would like Cloudflare to use Generic Routing Encapsulation (GRE) to tunnel traffic back from the Cloudflare Network back to Acme’s datacenter. GRE tunnels are initiated from anycast endpoints back to Acme’s premise. Through the magic of anycast, the tunnels are constantly and simultaneously connected to hundreds of network locations, ensuring the tunnels are highly available and resilient to network failures that would bring down traditionally formed GRE tunnels.
Cloudflare egresses packets bound for Acme over these GRE tunnels.
Let’s dive deeper on how the DDoS mitigation included in Magic Transit works.
Magic Transit protects networks from DDoS attack
Customers deploying Cloudflare Magic Transit instantly get access to the same IP-layer DDoS protection system that has protected the Cloudflare Network for the past 9 years. This is the same mitigation system that stopped a 942Gbps attack dead in its tracks, in seconds. This is the same mitigation system that knew how to stop memcached amplification attacks days before a 1.3Tbps attack took down Github, which did not have Cloudflare watching its back. This is the same mitigation we trust every day to protect Cloudflare, and now it protects your network.
Cloudflare has historically protected Layer 7 HTTP and HTTPS applications from attacks at all layers of the OSI Layer model. The DDoS protection our customers have come to know and love relies on a blend of techniques, but can be broken into a few complementary defenses:
Anycast and a network presence in 193 cities around the world allows our network to get close to users and attackers, allowing us to soak up traffic close to the source without introducing significant latency.
30+Tbps of network capacity allows us to soak up a lot of traffic close to the source. Cloudflare’s network has more capacity to stop DDoS attacks than that of Akamai Prolexic, Imperva, Neustar, and Radware — combined.
Our HTTPS reverse proxy absorbs L3 (IP layer) and L4 (TCP layer) attacks by terminating connections and re-establishing them to the origin. This stops most spurious packet transmissions from ever getting close to a customer origin server.
Layer 7 mitigations and rate limiting stop floods at the HTTPS application layer.
Looking at the above description carefully, you might notice something: our reverse proxy servers protect our customers by terminating connections, but our network and servers still get slammed by the L3 and 4 attacks we stop on behalf of our customers. How do we protect our own infrastructure from these attacks?
Gatebot is a suite of software running on every one of our servers inside each of our datacenters in the 193 cities we operate, constantly analyzing and blocking attack traffic. Part of Gatebot’s beauty is its simple architecture; it sits silently, in wait, sampling packets as they pass from the network card into the kernel and onward into userspace. Gatebot does not have a learning or warm-up period. As soon as it detects an attack, it instructs the kernel of the machine it is running on to drop the packet, log its decision, and move on.
Historically, if you wanted to protect your network from a DDoS attack, you might have purchased a specialized piece of hardware to sit at the perimeter of your network. This hardware box (let’s call it “The DDoS Protection Box”) would have been fantastically expensive, pretty to look at (as pretty as a 2U hardware box could get), and required a ton of recurring effort and money to stay on its feet, keep its licence up to date, and keep its attack detection system accurate and trained.
For one thing, it would have to be carefully monitored to make sure it was stopping attacks but not stopping legitimate traffic. For another, if an attacker managed to generate enough traffic to saturate your datacenter’s transit links to the Internet, you were out of luck; no box sitting inside your datacenter can protect you from an attack generating enough traffic to congest the links running from the outside world to the datacenter itself.
Early on, Cloudflare considered buying The DDoS Protection Box(es) to protect our various network locations, but ruled them out quickly. Buying hardware would have incurred substantial cost and complexity. In addition, buying, racking, and managing specialized pieces of hardware makes a network hard to scale. There had to be a better way. We set out to solve this problem ourselves, starting from first principles and modern technology.
To make our modern approach to DDoS mitigation work, we had to invent a suite of tools and techniques to allow us to do ultra-high performance networking on a generic x86 server running Linux.
At the core of our network data plane is the eXpress Data Path (XDP) and the extended Berkeley Packet Filter (eBPF), a set of APIs that allow us to build ultra-high performance networking applications in the Linux kernel. My colleagues have written extensively about how we use XDP and eBPF to stop DDoS attacks:
At the end of the day, we ended up with a DDoS mitigation system that:
Is delivered by our entire network, spread across 193 cities around the world. To put this another way, our network doesn’t have the concept of “scrubbing centers” — every single one of our network locations is always mitigating attacks, all the time. This means faster attack mitigation and minimal latency impact for your users.
Has exceptionally fast times to mitigate, with most attacks mitigated in 10s or less.
Was built in-house, giving us deep visibility into its behavior and the ability to rapidly develop new mitigations as we see new attack types.
Is deployed as a service, and is horizontally scalable. Adding x86 hardware running our DDoS mitigation software stack to a datacenter (or adding another network location) instantly brings more DDoS mitigation capacity online.
Gatebot is designed to protect Cloudflare infrastructure from attack. And today, as part of Magic Transit, customers operating their own IP networks and infrastructure can rely on Gatebot to protect their own network.
Magic Transit puts your network hardware in the cloud
We’ve covered how Cloudflare Magic Transit connects your network to the Internet, and how it protects you from DDoS attack. If you were running your network the old-fashioned way, this is where you’d stop to buy firewall hardware, and maybe another box to do load balancing.
With Magic Transit, you don’t need those boxes. We have a long track record of delivering common network functions (firewalls, load balancers, etc.) as services. Up until this point, customers deploying our services have relied on DNS to bring traffic to our edge, after which our Layer 3 (IP), Layer 4 (TCP & UDP), and Layer 7 (HTTP, HTTPS, and DNS) stacks take over and deliver performance and security to our customers.
Magic Transit is designed to handle your entire network, but does not enforce a one-size-fits-all approach to what services get applied to which portion of your traffic. To revisit Acme, our example customer from above, they have brought 203.0.113.0/24 to the Cloudflare Network. This represents 256 IPv4 addresses, some of which (eg 203.0.113.8/30) might front load balancers and HTTP servers, others mail servers, and others still custom UDP-based applications.
Each of these sub-ranges may have different security and traffic management requirements. Magic Transit allows you to configure specific IP addresses with their own suite of services, or apply the same configuration to large portions (or all) of your block.
Taking the above example, Acme may wish that the 203.0.113.8/30 block containing HTTP services fronted by a traditional hardware load balancer instead deploy the Cloudflare Load Balancer, and also wants HTTP traffic analyzed with Cloudflare’s WAF and content cached by our CDN. With Magic Transit, deploying these network functions is straight-forward — a few clicks in our dashboard or API calls will have your traffic handled at a higher layer of network abstraction, with all the attendant goodies applying application level load balancing, firewall, and caching logic bring.
This is just one example of a deployment customers might pursue. We’ve worked with several who just want pure IP passthrough, with DDoS mitigation applied to specific IP addresses. Want that? We got you!
Magic Transit runs on the entire Cloudflare Global Network. Or, no more scrubs!
When you connect your network to Cloudflare Magic Transit, you get access to the entire Cloudflare network. This means all of our network locations become your network locations. Our network capacity becomes your network capacity, at your disposal to power your experiences, deliver your content, and mitigate attacks on your infrastructure.
How expansive is the Cloudflare Network? We’re in 193 cities worldwide, with more than 30Tbps of network capacity spread across them. Cloudflare operates within 100 milliseconds of 98% of the Internet-connected population in the developed world, and 93% of the Internet-connected population globally (for context, the blink of an eye is 300-400 milliseconds).
Just as we built our own products in house, we also built our network in house. Every product runs in every datacenter, meaning our entire network delivers all of our services. This might not have been the case if we had assembled our product portfolio piecemeal through acquisition, or not had completeness of vision when we set out to build our current suite of services.
The end result for customers of Magic Transit: a network presence around the globe as soon you come on board. Full access to a diverse set of services worldwide. All delivered with latency and performance in mind.
We’ll be sharing a lot more technical detail on how we deliver Magic Transit in the coming weeks and months.
Magic Transit lowers total cost of ownership
Traditional network services don’t come cheap; they require high capital outlays up front, investment in staff to operate, and ongoing maintenance contracts to stay functional. Just as our product aims to be disruptive technically, we want to disrupt traditional network cost-structures as well.
Magic Transit is delivered and billed as a service. You pay for what you use, and can add services at any time. Your team will thank you for its ease of management; your management will thank you for its ease of accounting. That sounds pretty good to us!
Magic Transit is available today
We’ve worked hard over the past nine years to get our network, management tools, and network functions as a service into the state they’re in today. We’re excited to get the tools we use every day in customers’ hands.
So that brings us to naming. When we showed this to customers the most common word they used was ‘whoa.’ When we pressed what they meant by that they almost all said: ‘It’s so much better than any solution we’ve seen before. It’s, like, magic!’ So it seems only natural, if a bit cheesy, that we call this product what it is: Magic Transit.
Today we announced Cloudflare Magic Transit, which makes Cloudflare’s network available to any IP traffic on the Internet. Up until now, Cloudflare has primarily operated proxy services: our servers terminate HTTP, TCP, and UDP sessions with Internet users and pass that data through new sessions they create with origin servers. With Magic Transit, we are now also operating at the IP layer: in addition to terminating sessions, our servers are applying a suite of network functions (DoS mitigation, firewalling, routing, and so on) on a packet-by-packet basis.
Over the past nine years, we’ve built a robust, scalable global network that currently spans 193 cities in over 90 countries and is ever growing. All Cloudflare customers benefit from this scale thanks to two important techniques. The first is anycast networking. Cloudflare was an early adopter of anycast, using this routing technique to distribute Internet traffic across our data centers. It means that any data center can handle any customer’s traffic, and we can spin up new data centers without needing to acquire and provision new IP addresses. The second technique is homogeneous server architecture. Every server in each of our edge data centers is capable of running every task. We build our servers on commodity hardware, making it easy to quickly increase our processing capacity by adding new servers to existing data centers. Having no specialty hardware to depend on has also led us to develop an expertise in pushing the limits of what’s possible in networking using modern Linux kernel techniques.
Magic Transit is built on the same network using the same techniques, meaning our customers can now run their network functions at Cloudflare scale. Our fast, secure, reliable global edge becomes our customers’ edge. To explore how this works, let’s follow the journey of a packet from a user on the Internet to a Magic Transit customer’s network.
Putting our DoS mitigation to work… for you!
In the announcement blog post we describe an example deployment for Acme Corp. Let’s continue with this example here. When Acme brings their IP prefix 203.0.113.0/24 to Cloudflare, we start announcing that prefix to our transit providers, peers, and to Internet exchanges in each of our data centers around the globe. Additionally, Acme stops announcing the prefix to their own ISPs. This means that any IP packet on the Internet with a destination address within Acme’s prefix is delivered to a nearby Cloudflare data center, not to Acme’s router.
Let’s say I want to access Acme’s FTP server on 203.0.113.100 from my computer in Cloudflare’s office in Champaign, IL. My computer generates a TCP SYN packet with destination address 203.0.113.100 and sends it out to the Internet. Thanks to anycast, that packet ends up at Cloudflare’s data center in Chicago, which is the closest data center (in terms of Internet routing distance) to Champaign. The packet arrives on the data center’s router, which uses ECMP (Equal Cost Multi-Path) routing to select which server should handle the packet and dispatches the packet to the selected server.
Once at the server, the packet flows through our XDP- and iptables-based DoS detection and mitigation functions. If this TCP SYN packet were determined to be part of an attack, it would be dropped and that would be the end of it. Fortunately for me, the packet is permitted to pass.
So far, this looks exactly like any other traffic on Cloudflare’s network. Because of our expertise in running a global anycast network we’re able to attract Magic Transit customer traffic to every data center and apply the same DoS mitigation solution that has been protecting Cloudflare for years. Our DoS solution has handled some of the largest attacks ever recorded, including a 942Gbps SYN flood in 2018. Below is a screenshot of a recent SYN flood of 300M packets per second. Our architecture lets us scale to stop the largest attacks.
Network namespaces for isolation and control
The above looked identical to how all other Cloudflare traffic is processed, but this is where the similarities end. For our other services, the TCP SYN packet would now be dispatched to a local proxy process (e.g. our nginx-based HTTP/S stack). For Magic Transit, we instead want to dynamically provision and apply customer-defined network functions like firewalls and routing. We needed a way to quickly spin up and configure these network functions while also providing inter-network isolation. For that, we turned to network namespaces.
Namespaces are a collection of Linux kernel features for creating lightweight virtual instances of system resources that can be shared among a group of processes. Namespaces are a fundamental building block for containerization in Linux. Notably, Docker is built on Linux namespaces. A network namespace is an isolated instance of the Linux network stack, including its own network interfaces (with their own eBPF hooks), routing tables, netfilter configuration, and so on. Network namespaces give us a low-cost mechanism to rapidly apply customer-defined network configurations in isolation, all with built-in Linux kernel features so there’s no performance hit from userspace packet forwarding or proxying.
When a new customer starts using Magic Transit, we create a brand new network namespace for that customer on every server across our edge network (did I mention that every server can run every task?). We built a daemon that runs on our servers and is responsible for managing these network namespaces and their configurations. This daemon is constantly reading configuration updates from Quicksilver, our globally distributed key-value store, and applying customer-defined configurations for firewalls, routing, etc, inside the customer’s namespace. For example, if Acme wants to provision a firewall rule to allow FTP traffic (TCP ports 20 and 21) to 203.0.113.100, that configuration is propagated globally through Quicksilver and the Magic Transit daemon applies the firewall rule by adding an nftables rule to the Acme customer namespace:
Getting the customer’s traffic to their network namespace requires a little routing configuration in the default network namespace. When a network namespace is created, a pair of virtual ethernet (veth) interfaces is also created: one in the default namespace and one in the newly created namespace. This interface pair creates a “virtual wire” for delivering network traffic into and out of the new network namespace. In the default network namespace, we maintain a routing table that forwards Magic Transit customer IP prefixes to the veths corresponding to those customers’ namespaces. We use iptables to mark the packets that are destined for Magic Transit customer prefixes, and we have a routing rule that specifies that these specially marked packets should use the Magic Transit routing table.
(Why go to the trouble of marking packets in iptables and maintaining a separate routing table? Isolation. By keeping Magic Transit routing configurations separate we reduce the risk of accidentally modifying the default routing table in a way that affects how non-Magic Transit traffic flows through our edge.)
Network namespaces provide a lightweight environment where a Magic Transit customer can run and manage network functions in isolation, letting us put full control in the customer’s hands.
GRE + anycast = magic
After passing through the edge network functions, the TCP SYN packet is finally ready to be delivered back to the customer’s network infrastructure. Because Acme Corp. does not have a network footprint in a colocation facility with Cloudflare, we need to deliver their network traffic over the public Internet.
This poses a problem. The destination address of the TCP SYN packet is 203.0.113.100, but the only network announcing the IP prefix 203.0.113.0/24 on the Internet is Cloudflare. This means that we can’t simply forward this packet out to the Internet—it will boomerang right back to us! In order to deliver this packet to Acme we need to use a technique called tunneling.
Tunneling is a method of carrying traffic from one network over another network. In our case, it involves encapsulating Acme’s IP packets inside of IP packets that can be delivered to Acme’s router over the Internet. There are a number of common tunneling protocols, but Generic Routing Encapsulation (GRE) is often used for its simplicity and widespread vendor support.
GRE tunnel endpoints are configured both on Cloudflare’s servers (inside of Acme’s network namespace) and on Acme’s router. Cloudflare servers then encapsulate IP packets destined for 203.0.113.0/24 inside of IP packets destined for a publicly-routable IP address for Acme’s router, which decapsulates the packets and emits them into Acme’s internal network.
Now, I’ve omitted an important detail in the diagram above: the IP address of Cloudflare’s side of the GRE tunnel. Configuring a GRE tunnel requires specifying an IP address for each side, and the outer IP header for packets sent over the tunnel must use these specific addresses. But Cloudflare has thousands of servers, each of which may need to deliver packets to the customer through a tunnel. So how many Cloudflare IP addresses (and GRE tunnels) does the customer need to talk to? The answer: just one, thanks to the magic of anycast.
Cloudflare uses anycast IP addresses for our GRE tunnel endpoints, meaning that any server in any data center is capable of encapsulating and decapsulating packets for the same GRE tunnel. How is this possible? Isn’t a tunnel a point-to-point link? The GRE protocol itself is stateless—each packet is processed independently and without requiring any negotiation or coordination between tunnel endpoints. While the tunnel is technically bound to an IP address it need not be bound to a specific device. Any device that can strip off the outer headers and then route the inner packet can handle any GRE packet sent over the tunnel. Actually, in the context of anycast the term “tunnel” is misleading since it implies a link between two fixed points. With Cloudflare’s Anycast GRE, a single “tunnel” gives you a conduit to every server in every data center on Cloudflare’s global edge.
One very powerful consequence of Anycast GRE is that it eliminates single points of failure. Traditionally, GRE-over-Internet can be problematic because an Internet outage between the two GRE endpoints fully breaks the “tunnel”. This means reliable data delivery requires going through the headache of setting up and maintaining redundant GRE tunnels terminating at different physical sites and rerouting traffic when one of the tunnels breaks. But because Cloudflare is encapsulating and delivering customer traffic from every server in every data center, there is no single “tunnel” to break. This means Magic Transit customers can enjoy the redundancy and reliability of terminating tunnels at multiple physical sites while only setting up and maintaining a single GRE endpoint, making their jobs simpler.
Our scale is now your scale
Magic Transit is a powerful new way to deploy network functions at scale. We’re not just giving you a virtual instance, we’re giving you a global virtual edge. Magic Transit takes the hardware appliances you would typically rack in your on-prem network and distributes them across every server in every data center in Cloudflare’s network. This gives you access to our global anycast network, our fleet of servers capable of running your tasks, and our engineering expertise building fast, reliable, secure networks. Our scale is now your scale.
The collective thoughts of the interwebz
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.