Tag Archives: Traffic

The crawl-to-click gap: Cloudflare data on AI bots, training, and referrals

2025-08-29 João Tomé

Post Syndicated from João Tomé original https://blog.cloudflare.com/crawlers-click-ai-bots-training/

In 2025, Generative AI is reshaping how people and companies use the Internet. Search engines once drove traffic to content creators through links. Now, AI training crawlers — the engines behind commonly-used LLMs — are consuming vast amounts of web data, while sending far fewer users back. We covered this shift, along with related trends and Cloudflare features (like pay per crawl) in early July. Studies from Pew Research Center (1, 2) and Authoritas already point to AI overviews — Google’s new AI-generated summaries shown at the top of search results — contributing to sharp declines in news website traffic. For a news site, this means lots of bot hits, but far fewer real readers clicking through — which in turn means fewer people clicking on ads or chances to convert to subscriptions.

Cloudflare’s data shows the same pattern. Crawling by search engines and AI services surged in the first half of 2025 — up 24% year-over-year in June — before slowing to just 4% year-over-year growth in July. How is the space evolving? Which crawling purposes are most common, and how is that changing? Spoiler: training-related crawling is leading the way. In this post, we track AI and search bot crawl activity, what purposes dominate, and which platforms contribute the least referral traffic back to creators.

Key takeaways

Training crawling grows: Training now drives nearly 80% of AI bot activity, up from 72% a year ago.
Publisher referrals drop: Google referrals to news sites fell, with March 2025 down ~9% compared to January.
AI & search crawling increase: Crawling rose 32% year-over-year in April 2025, before slowing to 4% year-over-year growth in July.
AI-only crawler shifts: OpenAI’s GPTBot more than doubled in share of AI crawling traffic (4.7% to 11.7%), Anthropic’s ClaudeBot rose (6% to ~10%), while ByteDance’s Bytespider fell from 14.1% to 2.4%.
Crawl-to-refer imbalance (how many pages a bot crawls per page that a user clicks back to): Anthropic increased referrals but still leads with 38,000 crawls per visitor in July (down from 286,000:1 in January). Perplexity decreased referrals in 2025 — with more crawling but fewer referrals at 194 crawls per visitor in July.

Several of the trends in this blog use Cloudflare Radar’s new AI Insights features, explained in more detail in the post: “A deeper look at AI crawlers: breaking down traffic by purpose and industry.”

Google referrals fall as AI Overviews expand

Referral traffic from search is already shifting, as we noted above and as studies have shown. In our dataset of news-related customers (spanning the Americas, Europe, and Asia), Google’s referrals have been clearly declining since February 2025. This drop is unusual, since overall Internet traffic (and referrals as well) historically has only dipped during July and August — the summer months when the Northern Hemisphere is largely on break from school or work. The sharpest and least seasonal decline came in March. Despite being a 31-day month, March had almost the same referral volume as the shorter, 28-day February.

Looking at longer comparisons: March 2025 referral traffic from Google was 9% lower than January, the same drop seen in June. April was worse, down 15% compared with January.

This drop seems to coincide with some of Google’s changes. AI Overviews launched in the U.S. in May 2024, but in March 2025, Google upgraded AI Overviews with Gemini 2.0, introduced AI Mode in Labs, and expanded Overviews to more European countries. By May 2025, AI Mode rolled out broadly in the U.S. with Gemini 2.5, adding conversational search, Deep Search, and personalized recommendations.

The search-to-news site pipeline seems to be weakening, replaced in part by AI-driven results.

Looking at a daily perspective, we can also spot a clear U.S.-election-related peak in referrals from Google to the cohort of known news sites on November 5–6, 2024.

AI and search crawling: spring surge (+24%), summer slowdown

In June, we talked about search and AI crawler growth, and our picture of the trend is now more complete with more data. To focus only on AI and search crawlers, and to remove the bias of customer growth, we analyzed a fixed set of customers from specific weeks, a method we’ve also used in the Cloudflare Radar Year in Review.

What the data shows: crawling spiked twice: first in November 2024, then again between March and April 2025. April 2025 alone was up 32% compared with May 2024, the first full month where we have comparable data. After that surge, growth stabilized. In June 2025, crawling traffic was still 24% higher year-over-year, but by July the increase was down to just 4%. That shift highlights how quickly crawler activity can accelerate and then cool down.

As the chart below shows, crawling traffic rose sharply in March and April. It remained high but slightly lower in May, before starting to drop in June. The seasonal dip is similar to what we see in overall Internet traffic during the Northern Hemisphere’s summer months (August and September are often the quietest), though in the case of crawlers, this is likely due to reduced overall web activity rather than bots themselves taking a “break.” Historically, activity tends to rise again in November — as it did in 2024 for AI and search bot traffic — when people spend more time online for shopping and seasonal habits (a pattern we’ve seen in past years).

Googlebot is still the anchor, accounting for 39% of all AI and search crawler traffic, but the fastest growth now comes from AI-specific crawlers, though bots related to Amazon and ByteDance (Bytespider) have lost significant ground. GPTBot’s share grew from 4.7% in July 2024 to 11.7% in July 2025. ClaudeBot also increased, from 6% to nearly 10%, while Meta’s crawler jumped from 0.9% to 7.5%. By contrast, Amazonbot dropped from 10.2% to 5.9%, and ByteDance’s Bytespider dropped from 14.1% to just 2.4%.

The table below shows how market shares have shifted between July 2024 and July 2025:

	Bot name	% share July 2024	% share July 2025	Δ percentage-point change
1	Googlebot	37.5	39	1.5
2	GPTBot	4.7	11.7	7
3	ClaudeBot	6	9.9	3.9
4	Bingbot	8.7	9.3	0.6
5	Meta-ExternalAgent	0.9	7.5	6.5
6	Amazonbot	10.2	5.9	-4.3
7	Googlebot-Image	4.1	3.3	-0.8
8	Yandex	5	2.9	-2.1
9	GoogleOther	4.6	2.7	-1.8
10	Bytespider	14.1	2.4	-11.6
11	Applebot	1.8	1.5	-0.3
12	ChatGPT-User	0.1	0.9	0.9
13	OAI-SearchBot	0	0.9	0.9
14	Baiduspider	0.5	0.5	0
15	Googlebot-Mobile	0.2	0.4	0.2

AI-only crawlers: OpenAI rises, ByteDance falls

Looking only at AI bot traffic (as tracked on our Radar AI page), the trend is clear. Since January 2025, GPTBot has steadily increased its crawling volume, driven mainly by training-related activity. ClaudeBot crawling accelerated in June, while Amazonbot and Bytespider activity slowed.

The chart below shows how GPTBot surged over the past 12 months, overtaking Amazonbot and Bytespider, which both fell sharply:

A comparison between July 2024 and July 2025 makes the shift even more obvious. GPTBot gained 16 percentage points, Meta’s crawler rose by more than 15, and ClaudeBot grew by 8. On the shrinking side, Amazonbot dropped 12 percentage points and Bytespider dropped over 31 percentage points.

	AI-only bots	July 2024 %	July 2025 %	Δ percentage-point change
1	GPTBot	11.9	28.1	16.1
2	ClaudeBot	15	23.3	8.3
3	Meta-ExternalAgent	2.4	17.7	15.3
4	Amazonbot	26.4	14.1	-12.3
5	Bytespider	37.3	5.8	-31.5
6	Applebot	4.9	3.7	-1.2
7	ChatGPT-User	0.2	2.4	2.2
8	OAI-SearchBot	0	2.2	2.2
9	TikTokSpider	0	0.7	0.7
10	imgproxy	0	0.7	0.7
11	PerplexityBot	0	0.4	0.4
12	Google-CloudVertexBot	0	0.3	0.3
13	AI2Bot	0	0.2	0.2
14	Timpibot	0.6	0.1	-0.5
15	CCBot	0.1	0.1	0

We covered the functionality of these bots in our June blog post.

Crawling by purpose: training dominates

Training is the clear leader. (We classify purpose based on operator disclosures and industry sources, a method we explained in this AI Week blog.) Over the past 12 months, 80% of AI crawling was for training, compared with 18% for search and just 2% for user actions. In the last six months, the share for training rose further to 82%, while search dropped to 15% and user actions increased slightly to 3%.

The chart below shows how training-related crawling steadily grew over the past year, far outpacing other purposes:

The year-over-year comparison reinforces this trend. In July 2024, training accounted for 72% of AI crawling. By July 2025, it had risen to 79%. Over the same period, search fell from 26% to 17%, while user actions grew modestly from 2% to 3.2%.

Crawl-to-refer ratios shifts: tens of thousands of bot crawls per human click

The crawl-to-refer ratio measures how many pages a platform crawls compared with how often it drives users to a website. In practice, a high ratio means heavy crawling but little referral traffic. For example, for every visitor Anthropic refers back to a website, its crawlers have already visited tens of thousands of pages.

Why does this metric matter? It highlights the imbalance between how much content AI systems consume and how little traffic they return. For publishers, it can feel like giving away the raw material for free. With that in mind, here’s how different platforms compare from January to July 2025.

Anthropic remains the most crawl-heavy platform. Even after an 87% decline this year, it still crawled 38,000 pages for every referred page visit in July 2025 — the highest imbalance among major AI players. Referrals may be improving, though, after Anthropic added web search to Claude in March 2025 (initially for U.S. paid users) and expanded it globally by May to all users, including the free tier. The feature introduced direct citations with clickable URLs, creating new referral pathways.

The full dataset is below, showing January–July 2025 ratios by platform ordered by the highest ratio average:
(Note: a rising ratio means more bot crawling per human click sent back, while a falling ratio means less bot crawling per human click sent back)

Crawl-to-refer ratio (from Cloudflare Radar’s data)

Service	Jan	Feb	Mar	Apr	May	Jun	Jul	Average	% Change Jan-Jul
Anthropic	286,930.1	271,748.2	121,612.7	130,330.2	114,313	71,282.8	38,065.7	147,754.7	-86.7%
OpenAI	1,217.4	1,774.5	2,217	1200	995.6	1,655.9	1,091.4	1,437.8	-10.4%
Perplexity	54.6	55.3	201.3	300.9	199.1	200.6	194.8	172.4	256.7%
Microsoft	38.5	44.2	42.3	43.3	45.1	42	40.7	42.3	5.7%
Yandex	15.5	13.1	13.1	15.7	14.7	15.9	21.4	15.6	38.3%
Google	3.8	6.3	14.6	22.5	16.7	13.1	5.4	11.8	43%
ByteDance	18	16.4	3.5	2.3	1.6	1.6	0.9	6.3	-95%
Baidu	0.6	0.7	0.8	1.5	1.2	1	0.9	1	44.5%
DuckDuckGo	0.1	0.2	0.2	0.2	0.3	0.3	0.3	0.2	116.3%

Looking at the changes from January to July 2025:

Anthropic recorded the steepest decrease in bot to human traffic, down 86.7%. From 286,930 bots per human in January, to 38,065 bots per human in July, the change shows a dramatic increase in referrals. Despite the change, it remains by far the most crawl-heavy platform, with tens of thousands of pages still crawled for every referral.
Perplexity moved in the opposite direction, with bot crawling increasing +256.7% relative to human visitors; climbing from 54 bots per human in January to 195 bots per human in July. While the ratio is still far below Anthropic, the increase shows it is crawling more heavily, relative to the traffic it refers, than it did earlier.
OpenAI ratio dropped slightly, from 1,217 bots per human in January to 1,091 in July (-10%). The shift is smaller than Anthropic’s but suggests OpenAI is sending a bit more referral traffic relative to its crawling.
Microsoft stayed steady, with its ratio moving only slightly, from 38.5 bots per human in January to 40.7 in July (+6%). This consistency suggests stable behavior from Bing-linked services.
Yandex increased from 15.5 bots per human in January to 21.4 in July (+38%). The overall ratio is far smaller than Anthropic’s or Perplexity’s, but it shows Yandex is crawling more heavily relative to the traffic it sends back.

Alongside measuring crawling volumes and referral traffic (now also visible on the AI Insights page of Cloudflare Radar), it’s worth looking at whether AI operators follow good practices when deploying their bots. Cloudflare data shows that most leading AI crawlers are on our verified bots list, meaning their IP addresses match published ranges and they respect robots.txt. But adoption of newer standards like WebBotAuth — which uses cryptographic signatures in HTTP messages to confirm a request comes from a specific bot, and is especially relevant today — is still missing.

Google, Meta, and OpenAI run distinct bots for different purposes, while Anthropic lags in verification. That makes it easier for bad actors to spoof its crawler and ignore robots.txt, since without verification, it’s hard to distinguish real from fake traffic — leaving its compliance effectively unclear. (A longer list of AI bots is available here).

Conclusion and what’s next

If training-related crawling continues to dominate while referrals stay flat, creators face a paradox: feeding AI systems without gaining traffic in return. Many want their content to appear in chatbot answers, but without monetization or cooperation, the incentive to produce quality work declines.

The Web now stands at a fork in the road. Either a new balance emerges — one where the new AI era helps sustain publishers and creators — or AI turns the open web into a one-way training set, extracting value with little flowing back.

You can learn more about some of these data trends on Cloudflare Radar’s updated AI Insights page.

How the power outage of April 28, 2025, in Portugal and Spain impacted Internet traffic and connectivity

2025-04-29 David Belson

Post Syndicated from David Belson original https://blog.cloudflare.com/how-power-outage-in-portugal-spain-impacted-internet/

A massive power outage struck significant portions of Portugal and Spain at 10:34 UTC on April 28, grinding transportation to a halt, shutting retail businesses, and otherwise disrupting everyday activities and services. Parts of France were also reportedly impacted by the power outage. Portugal’s electrical grid operator blamed the outage on a “fault in the Spanish electricity grid”, and later stated that “due to extreme temperature variations in the interior of Spain, there were anomalous oscillations in the very high voltage lines (400 kilovolts), a phenomenon known as ‘induced atmospheric vibration’” and that “These oscillations caused synchronisation failures between the electrical systems, leading to successive disturbances across the interconnected European network.”

The breadth of Cloudflare’s network and our customer base provides us with a unique perspective on Internet resilience, enabling us to observe the Internet impact of this power outage at both a local and national level, as well as at a network level, across traffic, network quality, and routing metrics.

Impacts in Portugal

Country level

In Portugal, Internet traffic dropped as the power grid failed, with traffic immediately dropping by half as compared to the previous week, falling to approximately 90% below the previous week within the next five hours.

Request traffic from users in Portugal to Cloudflare’s 1.1.1.1 DNS resolver also fell when the power went out, initially dropping by 40% as compared to the previous week, and falling further over the next several hours.

Network level

At a network level, the loss of Internet traffic from local providers including NOS, Vodafone, MEO, and NOWO was swift and significant. The Cloudflare Radar graphs below show that traffic from those networks effectively evaporated over the hours after the power outage began. The autonomous systems (ASNs) shown below for these providers may carry a mix of fixed and mobile broadband traffic. However, MEO breaks out at least some of their mobile traffic onto a separate ASN, and the graph below for MEO-MOVEL (AS42863) shows that request traffic from that network more than doubled after the power went out, as subscribers turned to their mobile devices for information about what was happening. However, despite the initial spike, this mobile traffic also fell over the next several hours, dropping to approximately half of the volume seen the prior week.

Regional level

In addition to looking at traffic at a national and network level, we can also look at traffic at a regional level. As noted above, the power outage did not impact every region of the country. The traffic graphs below show the changes in Internet traffic from the parts of Portugal where an impact was observed.

In Lisbon and Porto, a sharp, but limited drop in traffic was observed as the power outage began, with traffic recovering slightly almost as quickly. However, traffic gradually declined in the subsequent hours, in contrast to the other regions reviewed below.

The most significant immediate traffic drops were observed in Aveiro, Beja, Bragança, Castelo Branco, Évora, Faro, Guarda, Portalegre, Santarém, Viana do Castelo, Vila Real, and Viseu. In these areas, traffic fell and then quickly stabilized at very low volumes. In Braga and Setúbal, traffic declined more gradually after the initial drop.

Network quality

The power outage also impacted the quality of connectivity at a national level in Portugal. Prior to the loss of power, median download speeds across the country were around 40 Mbps, but within several hours after the state of the outage, fell as low as 15 Mbps. As expected, latency at a country level saw an opposite impact. Prior to the loss of power, median latency was around 20 ms. However, it gradually grew to as much as 50 ms. The lower download speeds and higher latency are likely due to the congestion of the network links that remained available.

Routing

Network infrastructure in Portugal was also impacted by the power outage, with the impact seen as a drop in announced IP address space. (This means that portions of Portuguese providers’ networks are no longer visible to the rest of the Internet.) The number of announced IPv4 /24s (blocks of 256 IPv4 addresses) dropped by ~300 (around 1.2%), and the number of announced IPv6 /48s (blocks of over 1.2 octillion IPv6 addresses) dropped from 17,928,551 to 16,355,607 (around 9%). Address space began to drop further after 16:00 UTC, possibly as a result of backup power being exhausted and associated network infrastructure falling offline.

Impacts in Spain

Country level

In Spain, Internet traffic dropped as the power grid failed, with traffic immediately dropping by around 60% as compared to the previous week, falling to approximately 80% below the previous week within the next five hours.

Request traffic from users in Spain to Cloudflare’s 1.1.1.1 DNS resolver also fell when the power went out, initially dropping by 54% as compared to the previous week, but quickly stabilizing.

Network level

At a network level, traffic volumes from the top five ASNs in Spain fell rapidly once power was lost, with most declining gradually over the next several hours. In contrast, traffic from Digi Spain Telecom (AS57269) fell quickly, but then stabilized at the lower level. In comparison to the previous week, traffic from these providers fell between 75% and 93% in the hours after the power outage began.

Regional level

In most of the impacted regions in Spain, traffic dropped off quickly and stabilized, or continued to fall further. However, some recovery in traffic is also evident, and can be seen in Navarre, La Rioja, Cantabria, and Basque Country. This traffic recovery is likely associated with an initial restoration of power in those regions, as an update from Red Eléctrica (operator of Spain’s national electricity grid) noted that “Electricity is now available in parts of Catalonia, Aragon, the Basque Country, Galicia, Asturias, Navarre, Castile and León, Extremadura, Andalusia, and La Rioja.”

Network quality

The power outage also impacted the quality of connectivity at a national level in Spain. Prior to the loss of power, median download speeds across the country were around 35 Mbps, but within several hours after the state of the outage, fell as low as 19 Mbps. Interestingly, the median bandwidth didn’t see the clean gradual decline as it did in Portugal, instead falling and recovering twice before gradually declining.

As expected, latency at a country level saw a significant increase. Prior to the loss of power, median latency was around 22 ms, but grew to as much as 40 ms. As in Portugal, the lower download speeds and higher latency are likely due to the congestion of the network links that remained available.

Routing

Similar to Portugal, network infrastructure in Spain was also impacted by the power outage, with the impact seen as a drop in announced IP address space. By 14:30 UTC, the number of announced IPv4 /24 address blocks had fallen by around 2.4%, and continued to drop further over the following hours. The number of announced IPv6 /48 address blocks fell by over 8% during that same time span, and also continued to drop in the following hours.

Impacts in other European countries

Parts of Andorra and France were also reportedly impacted by the power outage, with additional outages reported as far away as Belgium. At a national level, no traffic disruptions were evident in any of the countries.

Analysis of traffic at a regional level in France shows a slight decline concurrent with the power outage in several regions, but the drops were nominal in comparison to Spain and Portugal, and traffic volumes recovered to expected levels within 90 minutes. No impact was evident at a regional level in Andorra.

It appears that Morocco may have been impacted in some fashion by the power outage, or at least Orange Maroc was. In a post on X, the provider stated (translated) “Internet traffic has been disrupted following a massive power outage in Spain and Portugal, which is affecting international connections.” Cloudflare Radar shows that traffic from the network fell sharply around 12:00 UTC, 90 minutes after the power outage began, with a full outage beginning around 15:00 UTC.

Conclusion

Power restoration in Spain had already started as this post was being written, and full recovery will likely take hours to days. As power is restored, Internet traffic and other metrics will recover as well. The current state of Internet connectivity in Spain and Portugal can be tracked on Cloudflare Radar.

The Cloudflare Radar team is constantly monitoring for Internet disruptions, sharing our observations on the Cloudflare Radar Outage Center, via social media, and in posts on blog.cloudflare.com. Follow us on social media at @CloudflareRadar (X), noc.social/@cloudflareradar (Mastodon), and radar.cloudflare.com (Bluesky), or contact us via email.

Some TXT about, and A PTR to, new DNS insights on Cloudflare Radar

2025-02-27 David Belson

Post Syndicated from David Belson original https://blog.cloudflare.com/new-dns-section-on-cloudflare-radar/

No joke – Cloudflare’s 1.1.1.1 resolver was launched on April Fool’s Day in 2018. Over the last seven years, this highly performant and privacy–conscious service has grown to handle an average of 1.9 Trillion queries per day from approximately 250 locations (countries/regions) around the world. Aggregated analysis of this traffic provides us with unique insight into Internet activity that goes beyond simple Web traffic trends, and we currently use analysis of 1.1.1.1 data to power Radar’s Domains page, as well as the Radar Domain Rankings.

In December 2022, Cloudflare joined the AS112 Project, which helps the Internet deal with misdirected DNS queries. In March 2023, we launched an AS112 statistics page on Radar, providing insight into traffic trends and query types for this misdirected traffic. Extending the basic analysis presented on that page, and building on the analysis of resolver data used for the Domains page, today we are excited to launch a dedicated DNS page on Cloudflare Radar to provide increased visibility into aggregate traffic and usage trends seen across 1.1.1.1 resolver traffic. In addition to looking at global, location, and autonomous system (ASN) traffic trends, we are also providing perspectives on protocol usage, query and response characteristics, and DNSSEC usage.

The traffic analyzed for this new page may come from users that have manually configured their devices or local routers to use 1.1.1.1 as a resolver, ISPs that set 1.1.1.1 as the default resolver for their subscribers, ISPs that use 1.1.1.1 as a resolver upstream from their own, or users that have installed Cloudflare’s 1.1.1.1/WARP app on their device. The traffic analysis is based on anonymised DNS query logs, in accordance with Cloudflare’s Privacy Policy, as well as our 1.1.1.1 Public DNS Resolver privacy commitments.

Below, we walk through the sections of Radar’s new DNS page, reviewing the included graphs and the importance of the metrics they present. The data and trends shown within these graphs will vary based on the location or network that the aggregated queries originate from, as well as on the selected time frame.

Traffic trends

As with many Radar metrics, the DNS page leads with traffic trends, showing normalized query volume at a worldwide level (default), or from the selected location or autonomous system (ASN). Similar to other Radar traffic-based graphs, the time period shown can be adjusted using the date picker, and for the default selections (last 24 hours, last 7 days, etc.), a comparison with traffic seen over the previous period is also plotted.

For location-level views (such as Latvia, in the example below), a table showing the top five ASNs by query volume is displayed alongside the graph. Showing the network’s share of queries from the selected location, the table provides insights into the providers whose users are generating the most traffic to 1.1.1.1.

When a country/region is selected, in addition to showing an aggregate traffic graph for that location, we also show query volumes for the country code top level domain (ccTLD) associated with that country. The graph includes a line showing worldwide query volume for that ccTLD, as well as a line showing the query volume based on queries from the associated location. Anguilla’s ccTLD is .ai, and is a popular choice among the growing universe of AI-focused companies. While most locations see a gap between the worldwide and “local” query volume for their ccTLD, Anguilla’s is rather significant — as the graph below illustrates, this size of the gap is driven by both the popularity of the ccTLD and Anguilla’s comparatively small user base. (Traffic for .ai domains from Anguilla is shown by the dark blue line at the bottom of the graph.) Similarly, sizable gaps are seen with other “popular” ccTLDs as well, such as .io (British Indian Ocean Territory), .fm (Federated States of Micronesia), and .co (Colombia). A higher “local” ccTLD query volume in other locations results in smaller gaps when compared to the worldwide query volume.

Depending on the strength of the signal (that is, the volume of traffic) from a given location or ASN, this data can also be used to corroborate reported Internet outages or shutdowns, or reported blocking of 1.1.1.1. For example, the graph below illustrates the result of Venezuelan provider CANTV reportedly blocking access to 1.1.1.1 for its subscribers. A comparable drop is visible for Supercable, another Venezuelan provider that also reportedly blocked access to Cloudflare’s resolver around the same time.

Individual domain pages (like the one for cloudflare.com, for example) have long had a choropleth map and accompanying table showing the popularity of the domain by location, based on the share of DNS queries for that domain from each location. A similar view is included at the bottom of the worldwide overview page, based on the share of total global queries to 1.1.1.1 from each location.

Query and response characteristics

While traffic trends are always interesting and important to track, analysis of the characteristics of queries to 1.1.1.1 and the associated responses can provide insights into the adoption of underlying transport protocols, record type popularity, cacheability, and security.

Published in November 1987, RFC 1035 notes that “The Internet supports name server access using TCP [RFC-793] on server port 53 (decimal) as well as datagram access using UDP [RFC-768] on UDP port 53 (decimal).” Over the subsequent three-plus decades, UDP has been the primary transport protocol for DNS queries, falling back to TCP for a limited number of use cases, such as when the response is too big to fit in a single UDP packet. However, as privacy has become a significantly greater concern, encrypted queries have been made possible through the specification of DNS over TLS (DoT) in 2016 and DNS over HTTPS (DoH) in 2018. Cloudflare’s 1.1.1.1 resolver has supported both of these privacy-preserving protocols since launch. The DNS transport protocol graph shows the distribution of queries to 1.1.1.1 over these four protocols. (Setting up 1.1.1.1 on your device or router uses DNS over UDP by default, although recent versions of Android support DoT and DoH. The 1.1.1.1 app uses DNS over HTTPS by default, and users can also configure their browsers to use DNS over HTTPS.)

Note that Cloudflare’s resolver also services queries over DoH and Oblivious DoH (ODoH) for Mozilla and other large platforms, but this traffic is not currently included in our analysis. As such, DoH adoption is under-represented in this graph.

Aggregated worldwide between February 19 – February 26, distribution of transport protocols was 86.6% for UDP, 9.6% for DoT, 2.0% for TCP, and 1.7% for DoH. However, in some locations, these ratios may shift if users are more privacy conscious. For example, the graph below shows the distribution for Egypt over the same time period. In that country, the UDP and TCP shares are significantly lower than the global level, while the DoT and DoH shares are significantly higher, suggesting that users there may be more concerned about the privacy of their DNS queries than the global average, or that there is a larger concentration of 1.1.1.1 users on Android devices who have set up 1.1.1.1 using DoT manually. (The 2024 Cloudflare Radar Year in Review found that Android had an 85% mobile device traffic share in Egypt, so mobile device usage in the country leans very heavily toward Android.)

RFC 1035 also defined a number of standard and Internet specific resource record types that return the associated information about the submitted query name. The most common record types are A and AAAA, which return the hostname’s IPv4 and IPv6 addresses respectively (assuming they exist). The DNS query type graph below shows that globally, these two record types comprise on the order of 80% of the queries received by 1.1.1.1. Among the others shown in the graph, HTTPS records can be used to signal HTTP/3 and HTTP/2 support, PTR records are used in reverse DNS records to look up a domain name based on a given IP address, and NS records indicate authoritative nameservers for a domain.

A response code is sent with each response from 1.1.1.1 to the client. Six possible values were originally defined in RFC 1035, with the list further extended in RFC 2136 and RFC 2671. NOERROR, as the name suggests, means that no error condition was encountered with the query. Others, such as NXDOMAIN, SERVFAIL, REFUSED, and NOTIMP define specific error conditions encountered when trying to resolve the requested query name. The response codes may be generated by 1.1.1.1 itself (like REFUSED) or may come from an upstream authoritative nameserver (like NXDOMAIN).

The DNS response code graph shown below highlights that the vast majority of queries seen globally do not encounter an error during the resolution process (NOERROR), and that when errors are encountered, most are NXDOMAIN (no such record). It is worth noting that NOERROR also includes empty responses, which occur when there are no records for the query name and query type, but there are records for the query name and some other query type.

With DNS being a first-step dependency for many other protocols, the amount of queries of particular types can be used to indirectly measure the adoption of those protocols. But to effectively measure adoption, we should also consider the fraction of those queries that are met with useful responses, which are represented with the DNS record adoption graphs.

The example below shows that queries for A records are met with a useful response nearly 88% of the time. As IPv4 is an established protocol, the remaining 12% are likely to be queries for valid hostnames that have no A records (e.g. email domains that only have MX records). But the same graph also shows that there’s still a significant adoption gap where IPv6 is concerned.

When Cloudflare’s DNS resolver gets a response back from an upstream authoritative nameserver, it caches it for a specified amount of time — more on that below. By caching these responses, it can more efficiently serve subsequent queries for the same name. The DNS cache hit ratio graph provides insight into how frequently responses are served from cache. At a global level, as seen below, over 80% of queries have a response that is already cached. These ratios will vary by location or ASN, as the query patterns differ across geographies and networks.

As noted in the preceding paragraph, when an authoritative nameserver sends a response back to 1.1.1.1, each record inside it includes information about how long it should be cached/considered valid for. This piece of information is known as the Time-To-Live (TTL) and, as a response may contain multiple records, the smallest of these TTLs (the “minimum” TTL) defines how long 1.1.1.1 can cache the entire response for. The TTLs on each response served from 1.1.1.1’s cache decrease towards zero as time passes, at which point 1.1.1.1 needs to go back to the authoritative nameserver. Hostnames with relatively low TTL values suggest that the records may be somewhat dynamic, possibly due to traffic management of the associated resources; longer TTL values suggest that the associated resources are more stable and expected to change infrequently.

The DNS minimum TTL graphs show the aggregate distribution of TTL values for five popular DNS record types, broken out across seven buckets ranging from under one minute to over one week. During the third week of February, for example, A and AAAA responses had a concentration of low TTLs, with over 80% below five minutes. In contrast, NS and MX responses were more concentrated across 15 minutes to one hour and one hour to one day. Because MX and NS records change infrequently, they are generally configured with higher TTLs. This allows them to be cached for longer periods in order to achieve faster DNS resolution.

DNS security

DNS Security Extensions (DNSSEC) add an extra layer of authentication to DNS establishing the integrity and authenticity of a DNS response. This ensures subsequent HTTPS requests are not routed to a spoofed domain. When sending a query to 1.1.1.1, a DNS client can indicate that it is DNSSEC-aware by setting a specific flag (the “DO” bit) in the query, which lets our resolver know that it is OK to return DNSSEC data in the response. The DNSSEC client awareness graph breaks down the share of queries that 1.1.1.1 sees from clients that understand DNSSEC and can require validation of responses vs. those that don’t. (Note that by default, 1.1.1.1 tries to protect clients by always validating DNSSEC responses from authoritative nameservers and not forwarding invalid responses to clients, unless the client has explicitly told it not to by setting the “CD” (checking-disabled) bit in the query.)

Unfortunately, as the graph below shows, nearly 90% of the queries seen by Cloudflare’s resolver are made by clients that are not DNSSEC-aware. This broad lack of client awareness may be due to several factors. On the client side, DNSSEC is not enabled by default for most users, and enabling DNSSEC requires extra work, even for technically savvy and security conscious users. On the authoritative side, for domain owners, supporting DNSSEC requires extra operational maintenance and knowledge, and a mistake can cost your domain to disappear from the Internet, causing significant (including financial) issues.

The companion End-to-end security graph represents the fraction of DNS interactions that were protected from tampering, when considering the client’s DNSSEC capabilities and use of encryption (use of DoT or DoH). This shows an even greater imbalance at a global level, and highlights the importance of further adoption of encryption and DNSSEC.

For DNSSEC validation to occur, the query name being requested must be part of a DNSSEC-enabled domain, and the DNSSEC validation status graph represents the share of queries where that was the case under the Secure and Invalid labels. Queries for domains without DNSSEC are labeled as Insecure, and queries where DNSSEC validation was not applicable (such as various kinds of errors) fall under the Other label. Although nearly 93% of generic Top Level Domains (TLDs) and 65% of country code Top Level Domains (ccTLDs) are signed with DNSSEC (as of February 2025), the adoption rate across individual (child) domains lags significantly, as the graph below shows that over 80% of queries were labeled as Insecure.

Conclusion

DNS is a fundamental, foundational part of the Internet. While most Internet users don’t think of DNS beyond its role in translating easy-to-remember hostnames to IP addresses, there’s a lot going on to make even that happen, from privacy to performance to security. The new DNS page on Cloudflare Radar endeavors to provide visibility into what’s going on behind the scenes, at a global, national, and network level.

While the graphs shown above are taken from the DNS page, all the underlying data is available via the API and can be interactively explored in more detail across locations, networks, and time periods using Radar’s Data Explorer and AI Assistant. And as always, Radar and Data Assistant charts and graphs are downloadable for sharing, and embeddable for use in your own blog posts, websites, or dashboards.

If you share our DNS graphs on social media, be sure to tag us: @CloudflareRadar and @1111Resolver (X), noc.social/@cloudflareradar (Mastodon), and radar.cloudflare.com (Bluesky). If you have questions or comments, you can reach out to us on social media, or contact us via email.

No hallucinations here: track the latest AI trends with expanded insights on Cloudflare Radar

2025-02-04 David Belson

Post Syndicated from David Belson original https://blog.cloudflare.com/expanded-ai-insights-on-cloudflare-radar/

During 2024’s Birthday Week, we launched an AI bot & crawler traffic graph on Cloudflare Radar that provides visibility into which bots and crawlers are the most aggressive and have the highest volume of requests, which crawl on a regular basis, and more. Today, we are launching a new dedicated “AI Insights” page on Cloudflare Radar that incorporates this graph and builds on it with additional metrics that you can use to understand AI-related trends from multiple perspectives. In addition to the traffic trends, the new section includes a view into the relative popularity of publicly available Generative AI services based on 1.1.1.1 DNS resolver traffic, the usage of robots.txt directives to restrict AI bot access to content, and open source model usage as seen by Cloudflare Workers AI.

Below, we’ll review each section of the new AI Insights page in more detail.

AI bots and crawlers traffic trends

Tracking traffic trends for AI bots can help us better understand their activity over time. Initially launched in September 2024 on Radar’s Traffic page, the AI bot & crawler traffic graph has moved to the AI Insights page and provides visibility into traffic trends gathered globally over the selected time period for the top five most active AI bots & crawlers. The associated list of user agents tracked here is based on the ai.robots.txt list, and will be updated with new entries as they are identified. The time series and summary data for this graph is available from the Radar API, and traffic trends for the full set of AI bots & crawlers we see traffic from can be viewed in the Data Explorer.

Popularity of Generative AI services

Over the last several years, the Cloudflare Radar Year in Review has analyzed request traffic data from our 1.1.1.1 DNS resolver to present rankings of the most popular Internet services, both generally and across several categories. In both 2023 and 2024, this section included rankings for publicly-available Generative AI services, with ChatGPT topping the list both years. While an accompanying blog post provides a more detailed look at how the rankings shifted over the course of the year, it too is looking through the rearview mirror. That is, it doesn’t provide visibility into the changes as they are occurring. The new Generative AI services popularity graph shows the relative rankings of these services and platforms based on DNS request traffic for domains associated with these services aggregated at a daily level. The underlying time series data is available through the Radar API, using the serviceCategory=Generative%20AI parameter.

The graph below shows that as of the end of January 2025, the top five services were fairly stable over the preceding four weeks, but there was regular movement among those ranked #6-10. We expect that the rankings will continue to change over time. DeepSeek, a Generative AI service that took the industry by storm at the end of January, can be seen making its initial appearance at #9 on January 26, rising rapidly to #3 on January 29, just three days later.

Analysis of robots.txt files

Content providers can attempt to control access to their full site, or specific portions of it, through the use of Allow or Disallow directives in a robots.txt file. However, successful access control is dependent on the bots respecting the listed directives. Cloudflare’s AI Audit gives you visibility and control into how AI bots are interacting with your website, and now Cloudflare Radar gives you insights into how other sites are handling them.

On a weekly basis, we analyze Radar’s top 10,000 domains to determine which associated sites publish robots.txt files, as well as aggregating the AI-specific directives within those files. In our new AI user agents found in robots.txt graph, seen below, we are now providing insights into actions that these top sites are taking with respect to AI bots. These actions are specified by directives that allow or disallow access by a given user agent (bot identifier) for either all content on the site (Fully Allowed/Disallowed) or certain sections (Partially Allowed/Disallowed).

In addition, we have also organized these domains by category (for example, Ecommerce or News & Media), highlighting the specific bots that the sites within those categories have listed in their directives. For example, the News & Media domain category graph shown below illustrates that these types of sites almost universally fully disallow access to their sites by AI user agents.

Changing the directive to “Allow” shows a much smaller set of user agents, with a drastically smaller set of sites explicitly allowing full or partial access. (Note that if a user agent is not listed in a robots.txt file, and a wildcard “*” user agent is not specified, then access is fully allowed by default.)

In addition to appearing on the AI Insights page, the underlying data is available for further exploration and analysis through the Radar API and the Data Explorer.

Popularity of models and tasks on Workers AI

The AI model landscape is rapidly evolving, with providers regularly releasing more powerful models, capable of tasks like text and image generation, speech recognition, and image classification. Cloudflare works closely with AI model providers to ensure that Workers AI supports these models as soon as possible following their release. On the new AI Insights page, Radar now provides visibility into the popularity of publicly available supported models (Workers AI model popularity) as well as the types of tasks (Workers AI task popularity) that these models perform, based on customer account share. Extended insights, including share trends and summary shares for the full list of models and tasks, as well as the ability to compare model and task shares across time periods, are available through the Data Explorer. The underlying model popularity and task popularity data is also available through API endpoints.

Conclusion

The AI space is extremely dynamic, with new platforms, services, and models regularly appearing. In some cases, these new entrants even have the power to upset the market as they see rapid growth in interest and usage. And over two years since ChatGPT was announced, there continues to be tension between content providers and AI platforms about scraping content for model training. The new “AI Insights” page on Cloudflare Radar provides timely trends and information about this dynamic space, enabling industry observers and participants to better understand how it is changing and evolving over time.

If you share AI Insights graphs on social media, be sure to tag us: @CloudflareRadar (X), noc.social/@cloudflareradar (Mastodon), and radar.cloudflare.com (Bluesky). You can also reach out on social media, or contact us via email, with suggestions for AI metrics that we can explore adding to the page in the future.

Resilient Internet connectivity in Europe mitigates impact from multiple cable cuts

2024-11-20 David Belson

Post Syndicated from David Belson original https://blog.cloudflare.com/resilient-internet-connectivity-baltic-cable-cuts

When cable cuts occur, whether submarine or terrestrial, they often result in observable disruptions to Internet connectivity, knocking a network, city, or country offline. This is especially true when there is insufficient resilience or alternative paths — that is, when a cable is effectively a single point of failure. Associated observations of traffic loss resulting from these disruptions are frequently covered by Cloudflare Radar in social media and blog posts. However, two recent cable cuts that occurred in the Baltic Sea resulted in little-to-no observable impact to the affected countries, as we discuss below, in large part because of the significant redundancy and resilience of Internet infrastructure in Europe.

BCS East-West Interlink

Traffic volume indicators

On Sunday, November 17 2024, the BCS East-West Interlink submarine cable connecting Sventoji, Lithuania and Katthammarsvik, Sweden was reportedly damaged around 10:00 local (Lithuania) time (08:00 UTC). A Data Center Dynamics article about the cable cut quotes the CTO of Telia Lietuva, the telecommunications provider that operates the cable, and notes “The Lithuanian cable carried about a third of the nation’s Internet capacity, but capacity was carried via other routes.”

As the Cloudflare Radar graphs below show, there was no apparent impact to traffic volumes in either country at the time that the cables were damaged. The NetFlows graphs represent the number of bytes that Cloudflare sends to users and clients in response to their requests.

Internet quality

Internet quality metrics for both countries show changes in measured bandwidth and latency throughout the day on Sunday, but with no sudden anomalous shifts visible around the time of the cable cut. (The loss of connectivity associated with a cable cut potentially manifests itself as an increase in latency and concurrent decrease in bandwidth due to loss of capacity.) The latency graph for Sweden does show an increase in latency, but it began before the cable cut occurred, is similar to a pattern visible several hours earlier, and is matched by an increase in measured bandwidth, so it is unlikely to be related to the cable cut event.

Visibility in BGP events, announced IP address space unaffected

BGP announcements are a way for network providers to communicate routing information to other networks, and announcement activity observed on Telia Lietuva’s autonomous systems around the time of the cable cut may be related to the re-routing referenced in the article. No change in announced IP address space was visible for any of these autonomous systems, suggesting no loss of connectivity as the capacity was re-routed.

Telegeography’s submarinecablemap.com illustrates, at least in part, the resilience in connectivity enjoyed by these two countries. In addition to the damaged cable, it shows that Lithuania is connected to neighboring Latvia as well as to the Swedish mainland. Over 20 submarine cables land in Sweden, connecting it to multiple countries across Europe. In addition to the submarine resilience, network providers in both countries can take advantage of terrestrial fiber connections to neighboring countries, such as those illustrated in a European network map from Arelion (formerly Telia), which is only one of the large European backbone providers.

C-Lion1

Traffic volume indicators

Less than a day later, the C-Lion1 submarine cable, which connects Helsinki, Finland and Rostock Germany was reportedly damaged during the early morning hours of Monday, November 18. Cinia, the telecommunications company that owns the cable, said that the cable stopped working at about 02:00 UTC.

In this situation as well, as the Cloudflare Radar graphs below show, there was no apparent impact to traffic volumes in either country at the time that the cables were damaged. The Finland graphs, week-on-week, show fewer bytes transferred and fewer HTTP requests, but that difference is present before the cable cut at 02:00 UTC. However, the trend of the current line does not change after the cable cut, so the two events would appear unrelated.

Internet quality

By looking at volume-related metrics, alone, Internet connectivity would appear to be unaffected by the cable cut.

If, however, we change perspective and look at Internet quality, a brief yet interesting change is visible for Finland around the reported time of the cable damage, though it isn’t clear whether it is related in any way. Just after midnight, median measured bandwidth, previously consistent around 50 Mbps begins to grow, peaking just over 200 Mbps around 03:00 UTC. Around that same time, measured median latency also begins to drop, falling from around 30 ms to a low of 13 ms, also around 03:00 UTC. Median bandwidth returned to normal levels around 06:00 UTC, while latency took about two hours longer to return to normal levels. These observed improvements in bandwidth and latency could have been due to traffic being re-routed to along paths with better connectivity to measurement endpoints, but because the shifts began before the cable damage occurred, and recovered shortly thereafter, that is unlikely to be the root cause.

In Germany, a brief minor increase in median bandwidth peaked around 02:45 UTC, while no notable changes were observed in latency.

BGP business as usual

From a routing perspective, there was no notable BGP announcement activity observed for top autonomous systems in either Finland or Germany around 02:00 on November 18, and total announced IP address space aggregated at a country level also demonstrated no change.

Telegeography’s submarinecablemap.com shows that both Finland and Germany also have significant redundancy and resilience from a submarine cable perspective, with over 10 cables landing in Finland, and nearly 10 landing in Germany, including Atlantic Crossing-1 (AC-1), which connects to the United States over two distinct paths. Terrestrial fiber maps from Arelion and eunetworks (as just two examples) show multiple redundant fiber routes within both countries, as well as cross-border routes to other neighboring countries, enabling more resilient Internet connectivity.

Conclusion

As we have discussed in multiple prior blog posts (Jersey, 2016; AAE-1/SMW5, 2022; WACS/MainOne/SAT3/ACE, 2024; EASSy/Seacom, 2024), cable cuts often cause significant disruptions to Internet connectivity, in many cases because they represent a concentrated point of vulnerability, whether for an individual network provider, city/state, or country. These disruptions are often quite lengthy as well, due to the time needed to marshal repair resources, identify the location of the damage, etc. Although it is not always feasible due to financial or geographic constraints, building redundant and resilient network architecture, at multiple levels, is a best practice. This includes the sending traffic over multiple physical cables (both submarine and terrestrial), connecting to multiple peer and upstream network providers, and even avoiding single points of failure in core Internet resources like DNS servers.

The Cloudflare Radar team continually monitors the status of Internet connectivity in countries/regions around the world, and we share our observations on the Cloudflare Radar Outage Center, via social media, and in posts on blog.cloudflare.com. Follow us on social media at @CloudflareRadar (X), https://noc.social/@cloudflareradar (Mastodon), and radar.cloudflare.com (Bluesky), or contact us via email.

Waiting Room adds multi-host and path coverage, unlocking broader protection and multilingual setups

2023-10-04 Arielle Olache

Post Syndicated from Arielle Olache original http://blog.cloudflare.com/multihost-waiting-room/

Waiting Room adds multi-host and path coverage, unlocking broader protection and multilingual setups

Cloudflare Waiting Room protects sites from overwhelming traffic surges by placing excess visitors in a fully customizable virtual waiting room, admitting them dynamically as spots become available. Instead of throwing error pages or delivering poorly-performing site pages, Waiting Room empowers customers to take control of their end-user experience during unmanageable traffic surges.

A key decision customers make when setting up a waiting room is what pages it will protect. Before now, customers could select one hostname and path combination to determine what pages would be covered by a waiting room. Today, we are thrilled to announce that Waiting Room now supports coverage of multiple hostname and path combinations with a single waiting room, giving customers more flexibility and offering broader site coverage without interruptions to end-user flows. This new capability is available to all Enterprise customers with an Advanced Purchase of Waiting Room.

Waiting Room site placement

As part of the simple, no-coding-necessary process for deploying a waiting room, customers specify a hostname and path combination to indicate which pages are covered by a particular waiting room. When a site visitor makes a preliminary request to that hostname and path or any of its subpaths, they will be issued a waiting room cookie and either be admitted to the site or placed in a waiting room if the site is at capacity.

Last year, we added Waiting Room Bypass Rules, giving customers many options for creating exceptions to this hostname and path coverage. This unlocked capabilities such as User Agent Bypassing, geo-targeting, URL exclusions, and administrative IP bypassing. It also allowed for some added flexibility regarding which pages a waiting room would apply to on a customer’s site by adding the ability to exclude URLs, paths, and query strings. While this update allowed for greater specificity regarding which traffic should be gated by Waiting Room, it didn’t offer the broader coverage that many customers still needed to protect greater portions of their sites with a single waiting room.

Why customers needed broader Waiting Room coverage

Let’s review some simple yet impactful examples of why this broader coverage option was important for our customers. Imagine you have an online store, example.com, and you want to ensure that a single waiting room covers the entire customer journey — from the homepage, to product browsing, to checkout. Many sites would delineate these steps in the flow using paths: example.com/, example.com/shop/product1, example.com/checkout. Because Waiting Room assumes an implied wildcard at the end of a waiting room’s configured path, this use case was already satisfied for these sites. Thus, placing a waiting room at example.com/ would cover all the URLs associated with every step of this customer journey. In this setup, once a site visitor has passed the waiting room, they would not be re-queued at any step in their user flow because they are still using the same waiting room’s cookie, which indicates to Waiting Room that they are the same user between URLs.

However, many sites delineate different steps of this type of shopping flow using subdomains instead of or as well as paths. For example, many sites will have their checkout page on a different subdomain, such as checkout.example.com. Before now, if a customer had this site structure and wanted to protect their entire site with a single waiting room, they would have needed to deploy a waiting room at example.com/ and another at checkout.example.com/. This option was not ideal for many customers because a site visitor could be queued at two different parts of the same customer journey. This is because the waiting room at checkout.example.com/ would count the same visitor as a different user than the waiting room covering example.com/.

That said, there are cases where it is wise to separate waiting rooms on a single site. For example, a ticketing website could place a waiting room at its apex domain (example.com) and distinct waiting rooms with pre-queues on pages for specific events (example.com/popular_artist_tour). The waiting room set at example.com/ ensures that the main point of entry to the site does not get overwhelmed and crash when ticket sales for any one event open. The waiting room placed on the specific event page ensures that traffic for a single event can start queuing ahead of the event without impacting traffic going to other parts of the site.

Ultimately, whether a customer wants one or multiple waiting rooms to protect their site, we wanted to give our customers the flexibility to deploy Waiting Room however was best for their use cases and site structure. We’re thrilled to announce that Waiting Room now supports multi-hostname and path coverage with a single waiting room!

Getting started with multi-host and path coverage

Customers can now configure a waiting room on multiple hostname and path combinations — or routes — belonging to the same zone. To do so, navigate to Traffic > Waiting Room and select Create. The name of your domain will already be populated. To add additional routes to your waiting room configuration, select Add Hostname and Path. You can then enter another hostname and path to be covered by the same waiting room. Remember, there is an implied wildcard after each path. Therefore, creating a waiting room for each URL you want your waiting room to cover is unnecessary. Only create additional routes for URLs that wouldn’t be covered by the other hostname and path combinations you have already entered.

When deploying a waiting room that covers multiple hostname and path combinations, you must create a unique cookie name for this waiting room (more on that later!). Then, proceed with the same workflow as usual to deploy your waiting room.

Deploying a multi-language waiting room

A frequent customer request was the ability to cover a multi-language site with a single waiting room — offering different text per language while counting all site traffic toward the same waiting room limits. There are various ways a site can be structured to delineate between different language options, but the two most common are either by subdomain or path. For a site that uses path delineation, this could look like example.com/en and example.com/es for English and Spanish, respectively. For a site using subdomain delineation, this could look like en.example.com/ and es.example.com/. Before multihost Waiting Room coverage, the subdomain variation would not have been able to be covered by a single waiting room.

Waiting Room’s existing configuration options already supported the path variation. However, this was only true if the customer wanted to gate the entire site, which they could do by placing the waiting room at example.com/. Many e-commerce customers asked for the ability to gate different high-demand product pages selling the same product but with different language options. For instance, consider example.com/en/product_123 and example.com/es/product_123, where the customer wants the same waiting room and traffic limits to cover both URLs. Before now, this would not have been possible without some complex bypass rule logic.

Now, customers can deploy a waiting room that accommodates either the subdomain or path approach to structuring a multi-language site. The only remaining step is setting up your waiting room to serve different languages when a user is queued in a waiting room. This can be achieved by constructing a template that reads the URL to determine the locale and define the appropriate translations for each of the locales within the template.

Here is an example of a template that determines the locale from the URL path, and renders the translated text:

<!DOCTYPE html>
<html>
  <head>
    <title>Waiting Room powered by Cloudflare</title>
  </head>
  <body>
    <section>
      <h1 id="inline-msg">
        You are now in line.
      </h1>
      <h1 id="patience-msg">
        Thank you for your patience.
      </h1>
    </section>
    <h2 id="waitTime"></h2>

    <script>
      var locale = location.pathname.split("/")[1] || "en";
      var translations = {
        "en": {
          "waittime_60_less": "Your estimated wait time is {{waitTime}} minute.",
          "waittime_60_greater": "Your estimated wait time is {{waitTimeHours}} hours and {{waitTimeHourMinutes}} minutes.",
          "inline-msg": "You are now in line.",
          "patience-msg": "Thank you for your patience.",
        },
        "es": {
          "waittime_60_less": "El tiempo de espera estimado es {{waitTime}} minuto.",
          "waittime_60_greater": "El tiempo de espera estimado es {{waitTimeHours}} de horas y {{waitTimeHourMinutes}} minutos.",
          "inline-msg": "Ahora se encuentra en la fila de espera previa.",
          "patience-msg": "Gracias por su paciencia.",
        }
      };

      {{#waitTimeKnown}}
      var minutes = parseInt( {{waitTime}} , 10);
      var time = document.getElementById('waitTime');

      if ( minutes < 61) {
        time.innerText = translations[locale]["waittime_60_less"]
      } else {
        time.innerText = translations[locale]["waittime_60_greater"]
      }
      {{/waitTimeKnown}}

      // translate template text for each of the elements with “id” suffixed with “msg”
      for (const msg of ["inline-msg", "patience-msg"]) {
        const el = document.getElementById(msg);
        if (el == null) continue;
        el.innerText = translations[locale][msg];
      }
    </script>
  </body>
</html>

Here’s how the template works: given a site distinguishes different locales with various paths such as /en/product_123 or /es/product_123 in the <script /> body after the rendering the page, the locale is extracted from the location.pathname via locale = location.pathname.split(“/”)[1]. If there isn’t a locale specified within the translations object we default it to “en”. For example, if a user visits example.com/product_123, then by default render the English text template.

Similarly, in order to support a multi-language waiting room for sites that delineate site structure via subdomain, it would only require you to update how you extract the locale from the URL. Instead of looking at the pathname we look at the hostname property of the window.location object like locale = location.hostname.split(“.”)[0], given, we have site structure as following: en.example.com, es.example.com.

The above code is a pared down example of starter templates we have added to our developer documentation, which we have included to make it easier for you to get started with a multi-language waiting room. You can download these templates and edit them to have the look and feel aligned with your site and with the language options your site offers.

These templates can further be expanded to include other languages by adding translations to the `translations` object for each of the locales. Thus, a single template is able to serve multiple languages depending on whatever the locale the site is being served as either via subdomain or path.

How we handle cookies with a multihostname and path waiting room

The waiting room assigns a __cfwaitingroom cookie to each user to manage the state of the user that determines the position of the user in line among other properties needed to make the queueing decisions for the user. Traditionally, for a single hostname and path configuration it was trivial to just set the cookie as __cfwaitingroom=[cookie-value]; Domain=example.com; Path=/es/product_123. This ensured that when the page refreshed it would send us the same Waiting Room cookie for us to examine and update. However, this became non-trivial when we wanted to share the same cookie across different subdomain or path combinations, for example, on example.com/en/product_123.

Approaches to setting multiple cookies

There were two approaches that we brainstormed and evaluated to allow the sharing of the cookie for the same waiting room.

The first approach we considered was to issue multiple Set-Cookie headers in the HTTP response. For example, in the multi-language example above, we could issue the following in the response header:

Set-Cookie: __cfwaitingroom=[cookie-value]…Domain=example.com; Path=/en/product_123 …
Set-Cookie: __cfwaitingroom=[cookie-value]...Domain=example.com; Path=/es/product_123 …

This approach would allow us to handle the multiple hostnames and paths for the same waiting room. However, it does not present itself as a very scalable solution. Per RFC6265, there isn’t a strict limit defined in the specification, browsers have their own implementation-specific limits. For example, Chrome allows the response header to grow up to 256K bytes before throwing the transaction with ERR_RESPONSE_HEADERS_TOO_BIG. Additionally, in this case, the response header would grow proportionally to the number of unique hostname and path combinations, and we would be redundantly repeating the same cookie value N (where N is the number of additional routes) number of times. While this approach would have allowed us to effectively handle a bounded list of hostname and path combinations that need to be set, it was not ideal for cases where we can expect more than a handful routes for a particular site.

The second approach that we considered and chose to move forward with was to set the cookie on the apex domain (or the most common subdomain). In other words, we would figure out the most common subdomain from the list of routes and use that as the domain. Similarly, for the paths, this would entail determining the least common prefix from the list of paths and use that as the value to be set on the path attribute. For example, consider a waiting room with the following two routes configured, a.example.com/shoes/product_123 and b.example.com/shoes/product_456.

To determine the domain that would be set for the cookie, we can see that example.com is the most common subdomain among the list of domains. Applying the most common subdomain algorithm, we would get example.com as the domain. Applying the least common prefix algorithm on the set of paths, /shoes/product_123 and /shoes/product_456, we would get /shoes as the path. Thus, the set-cookie header would be the following:

Set-Cookie: … __cfwaitingroom=[cookie-value]; Domain=example.com; Path=/shoes …

Now, when a visitor accesses any of the pages covered by this waiting room, we can guarantee Waiting Room receives the right cookie and there will only be Set-Cookie included in the response header.

However, we are still not there yet. Although we are able to identify the common subdomain (or apex domain) and common path prefix, there would still be a problem if routes from one waiting room would overlap with another waiting room. For example, say we configure two waiting rooms for the same site where we are selling our special shoes:

Waiting Room A
a.example.com/shoes/product_123
b.example.com/shoes/product_456

Waiting Room B
c.example.com/shoes/product_789
d.example.com/shoes/product_012

If we apply the same algorithm as described above, we would get the same domain and path properties set for the two cookies. For Waiting Room A, the domain would be example.com and the path would be /shoes. Similarly, for Waiting Room B, the most common subdomain would also be example.com and least common path prefix would be /shoes. This would effectively prevent us from distinguishing the two cookies and route the visitor to the right waiting room. In order to solve the issue of overlapping cookie names, we introduced the requirement of setting a custom suffix that is unique to the customer’s zone. This is why it is required to provide a custom cookie suffix when configuring a waiting room that covers multiple hostnames or paths. Therefore, if Waiting Room A was assigned cookie suffix “a” and Waiting Room B was assigned cookie suffix “b”, the two Set-Cookie definitions would look like the following:

Waiting Room A

Set-Cookie: __cfwaitingroom_a=[cookie-value]; Domain=example.com; Path=/shoes

Waiting Room B

Set-Cookie: __cfwaitingroom_b=[cookie-value]; Domain=example.com; Path=/shoes

When a visitor makes a request to any pages where the two different waiting rooms are configured, Waiting Room can distinguish and select which cookie corresponds to the given request and use this to determine which waiting room’s settings apply to that user depending on where they are on the site.

With the addition of multihost and multipath Waiting Room coverage, we’re thrilled to offer more flexible options for incorporating Waiting Room into your site, giving your visitors the best experience possible while protecting your site during times of high traffic. Make sure your site is always protected from traffic surges. Try out Waiting Room today or reach out to us to learn more about the Advanced Waiting Room!

How Cloudflare’s systems dynamically route traffic across the globe

2023-09-25 David Tuber

Post Syndicated from David Tuber original http://blog.cloudflare.com/meet-traffic-manager/

How Cloudflare’s systems dynamically route traffic across the globe

Picture this: you’re at an airport, and you’re going through an airport security checkpoint. There are a bunch of agents who are scanning your boarding pass and your passport and sending you through to your gate. All of a sudden, some of the agents go on break. Maybe there’s a leak in the ceiling above the checkpoint. Or perhaps a bunch of flights are leaving at 6pm, and a number of passengers turn up at once. Either way, this imbalance between localized supply and demand can cause huge lines and unhappy travelers — who just want to get through the line to get on their flight. How do airports handle this?

Some airports may not do anything and just let you suffer in a longer line. Some airports may offer fast-lanes through the checkpoints for a fee. But most airports will tell you to go to another security checkpoint a little farther away to ensure that you can get through to your gate as fast as possible. They may even have signs up telling you how long each line is, so you can make an easier decision when trying to get through.

At Cloudflare, we have the same problem. We are located in 300 cities around the world that are built to receive end-user traffic for all of our product suites. And in an ideal world, we always have enough computers and bandwidth to handle everyone at their closest possible location. But the world is not always ideal; sometimes we take a data center offline for maintenance, or a connection to a data center goes down, or some equipment fails, and so on. When that happens, we may not have enough attendants to serve every person going through security in every location. It’s not because we haven’t built enough kiosks, but something has happened in our data center that prevents us from serving everyone.

So, we built Traffic Manager: a tool that balances supply and demand across our entire global network. This blog is about Traffic Manager: how it came to be, how we built it, and what it does now.

The world before Traffic Manager

The job now done by Traffic Manager used to be a manual process carried out by network engineers: our network would operate as normal until something happened that caused user traffic to be impacted at a particular data center.

When such events happened, user requests would start to fail with 499 or 500 errors because there weren’t enough machines to handle the request load of our users. This would trigger a page to our network engineers, who would then remove some Anycast routes for that data center. The end result: by no longer advertising those prefixes in the impacted data center, user traffic would divert to a different data center. This is how Anycast fundamentally works: user traffic is drawn to the closest data center advertising the prefix the user is trying to connect to, as determined by Border Gateway Protocol. For a primer on what Anycast is, check out this reference article.

Depending on how bad the problem was, engineers would remove some or even all the routes in a data center. When the data center was again able to absorb all the traffic, the engineers would put the routes back and the traffic would return naturally to the data center.

As you might guess, this was a challenging task for our network engineers to do every single time any piece of hardware on our network had an issue. It didn’t scale.

Never send a human to do a machine’s job

But doing it manually wasn’t just a burden on our Network Operations team. It also resulted in a sub-par experience for our customers; our engineers would need to take time to diagnose and re-route traffic. To solve both these problems, we wanted to build a service that would immediately and automatically detect if users were unable to reach a Cloudflare data center, and withdraw routes from the data center until users were no longer seeing issues. Once the service received notifications that the impacted data center could absorb the traffic, it could put the routes back and reconnect that data center. This service is called Traffic Manager, because its job (as you might guess) is to manage traffic coming into the Cloudflare network.

Accounting for second order consequences

When a network engineer removes a route from a router, they can make the best guess at where the user requests will move to, and try to ensure that the failover data center has enough resources to handle the requests — if it doesn’t, they can adjust the routes there accordingly prior to removing the route in the initial data center. To be able to automate this process, we needed to move from a world of intuition to a world of data — accurately predicting where traffic would go when a route was removed, and feeding this information to Traffic Manager, so it could ensure it doesn’t make the situation worse.

Meet Traffic Predictor

Although we can adjust which data centers advertise a route, we are unable to influence what proportion of traffic each data center receives. Each time we add a new data center, or a new peering session, the distribution of traffic changes, and as we are in over 300 cities and 12,500 peering sessions, it has become quite difficult for a human to keep track of, or predict the way traffic will move around our network. Traffic manager needed a buddy: Traffic Predictor.

In order to do its job, Traffic Predictor carries out an ongoing series of real world tests to see where traffic actually moves. Traffic Predictor relies on a testing system that simulates removing a data center from service and measuring where traffic would go if that data center wasn’t serving traffic. To help understand how this system works, let’s simulate the removal of a subset of a data center in Christchurch, New Zealand:

First, Traffic Predictor gets a list of all the IP addresses that normally connect to Christchurch. Traffic Predictor will send a ping request to hundreds of thousands of IPs that have recently made a request there.
Traffic Predictor records if the IP responds, and whether the response returns to Christchurch using a special Anycast IP range specifically configured for Traffic Predictor.
Once Traffic Predictor has a list of IPs that respond to Christchurch, it removes that route containing that special range from Christchurch, waits a few minutes for the Internet routing table to be updated, and runs the test again.
Instead of being routed to Christchurch, the responses are instead going to data centers around Christchurch. Traffic Predictor then uses the knowledge of responses for each data center, and records the results as the failover for Christchurch.

This allows us to simulate Christchurch going offline without actually taking Christchurch offline!

But Traffic Predictor doesn’t just do this for any one data center. To add additional layers of resiliency, Traffic Predictor even calculates a second layer of indirection: for each data center failure scenario, Traffic Predictor also calculates failure scenarios and creates policies for when surrounding data centers fail.

Using our example from before, when Traffic Predictor tests Christchurch, it will run a series of tests that remove several surrounding data centers from service including Christchurch to calculate different failure scenarios. This ensures that even if something catastrophic happens which impacts multiple data centers in a region, we still have the ability to serve user traffic. If you think this data model is complicated, you’re right: it takes several days to calculate all of these failure paths and policies.

Here’s what those failure paths and failover scenarios look like for all of our data centers around the world when they’re visualized:

This can be a bit complicated for humans to parse, so let’s dig into that above scenario for Christchurch, New Zealand to make this a bit more clear. When we take a look at failover paths specifically for Christchurch, we see they look like this:

In this scenario we predict that 99.8% of Christchurch’s traffic would shift to Auckland, which is able to absorb all Christchurch traffic in the event of a catastrophic outage.

Traffic Predictor allows us to not only see where traffic will move to if something should happen, but it allows us to preconfigure Traffic Manager policies to move requests out of failover data centers to prevent a thundering herd scenario: where sudden influx of requests can cause failures in a second data center if the first one has issues. With Traffic Predictor, Traffic Manager doesn’t just move traffic out of one data center when that one fails, but it also proactively moves traffic out of other data centers to ensure a seamless continuation of service.

From a sledgehammer to a scalpel

With Traffic Predictor, Traffic Manager can dynamically advertise and withdraw prefixes while ensuring that every datacenter can handle all the traffic. But withdrawing prefixes as a means of traffic management can be a bit heavy-handed at times. One of the reasons for this is that the only way we had to add or remove traffic to a data center was through advertising routes from our Internet-facing routers. Each one of our routes has thousands of IP addresses, so removing only one still represents a large portion of traffic.

Specifically, Internet applications will advertise prefixes to the Internet from a /24 subnet at an absolute minimum, but many will advertise prefixes larger than that. This is generally done to prevent things like route leaks or route hijacks: many providers will actually filter out routes that are more specific than a /24 (for more information on that, check out this blog here). If we assume that Cloudflare maps protected properties to IP addresses at a 1:1 ratio, then each /24 subnet would be able to service 256 customers, which is the number of IP addresses in a /24 subnet. If every IP address sent one request per second, we’d have to move 4 /24 subnets out of a data center if we needed to move 1,000 requests per second (RPS).

But in reality, Cloudflare maps a single IP address to hundreds of thousands of protected properties. So for Cloudflare, a /24 might take 3,000 requests per second, but if we needed to move 1,000 RPS out, we would have no choice but to move a single /24 out. And that’s just assuming we advertise at a /24 level. If we used /20s to advertise, the amount we can withdraw gets less granular: at a 1:1 website to IP address mapping, that’s 4,096 requests per second for each prefix, and even more if the website to IP address mapping is many to one.

While withdrawing prefix advertisements improved the customer experience for those users who would have seen a 499 or 500 error — there may have been a significant portion of users who wouldn’t have been impacted by an issue who still were moved away from the data center they should have gone to, probably slowing them down, even if only a little bit. This concept of moving more traffic out than is necessary is called “stranding capacity”: the data center is theoretically able to service more users in a region but cannot because of how Traffic Manager was built.

We wanted to improve Traffic Manager so that it only moved the absolute minimum of users out of a data center that was seeing a problem and not strand any more capacity. To do so, we needed to shift percentages of prefixes, so we could be extra fine-grained and only move the things that absolutely need to be moved. To solve this, we built an extension of our Layer 4 load balancer Unimog, which we call Plurimog.

A quick refresher on Unimog and layer 4 load balancing: every single one of our machines contains a service that determines whether that machine can take a user request. If the machine can take a user request then it sends the request to our HTTP stack which processes the request before returning it to the user. If the machine can’t take the request, the machine sends the request to another machine in the data center that can. The machines can do this because they are constantly talking to each other to understand whether they can serve requests for users.

Plurimog does the same thing, but instead of talking between machines, Plurimog talks in between data centers and points of presence. If a request goes into Philadelphia and Philadelphia is unable to take the request, Plurimog will forward to another data center that can take the request, like Ashburn, where the request is decrypted and processed. Because Plurimog operates at layer 4, it can send individual TCP or UDP requests to other places which allows it to be very fine-grained: it can send percentages of traffic to other data centers very easily, meaning that we only need to send away enough traffic to ensure that everyone can be served as fast as possible. Check out how that works in our Frankfurt data center, as we are able to shift progressively more and more traffic away to handle issues in our data centers. This chart shows the number of actions taken on free traffic that cause it to be sent out of Frankfurt over time:

But even within a data center, we can route traffic around to prevent traffic from leaving the datacenter at all. Our large data centers, called Multi-Colo Points of Presence (MCPs) contain logical sections of compute within a data center that are distinct from one another. These MCP data centers are enabled with another version of Unimog called Duomog, which allows for traffic to be shifted between logical sections of compute automatically. This makes MCP data centers fault-tolerant without sacrificing performance for our customers, and allows Traffic Manager to work within a data center as well as between data centers.

When evaluating portions of requests to move, Traffic Manager does the following:

Traffic Manager identifies the proportion of requests that need to be removed from a data center or subsection of a data center so that all requests can be served.
Traffic Manager then calculates the aggregated space metrics for each target to see how many requests each failover data center can take.
Traffic Manager then identifies how much traffic in each plan we need to move, and moves either a proportion of the plan, or all of the plan through Plurimog/Duomog, until we've moved enough traffic. We move Free customers first, and if there are no more Free customers in a data center, we'll move Pro, and then Business customers if needed.

For example, let’s look at Ashburn, Virginia: one of our MCPs. Ashburn has nine different subsections of capacity that can each take traffic. On 8/28, one of those subsections, IAD02, had an issue that reduced the amount of traffic it could handle.

During this time period, Duomog sent more traffic from IAD02 to other subsections within Ashburn, ensuring that Ashburn was always online, and that performance was not impacted during this issue. Then, once IAD02 was able to take traffic again, Duomog shifted traffic back automatically. You can see these actions visualized in the time series graph below, which tracks the percentage of traffic moved over time between subsections of capacity within IAD02 (shown in green):

How does Traffic Manager know how much to move?

Although we used requests per second in the example above, using requests per second as a metric isn’t accurate enough when actually moving traffic. The reason for this is that different customers have different resource costs to our service; a website served mainly from cache with the WAF deactivated is much cheaper CPU wise than a site with all WAF rules enabled and caching disabled. So we record the time that each request takes in the CPU. We can then aggregate the CPU time across each plan to find the CPU time usage per plan. We record the CPU time in ms, and take a per second value, resulting in a unit of milliseconds per second.

CPU time is an important metric because of the impact it can have on latency and customer performance. As an example, consider the time it takes for an eyeball request to make it entirely through the Cloudflare front line servers: we call this the cfcheck latency. If this number goes too high, then our customers will start to notice, and they will have a bad experience. When cfcheck latency gets high, it’s usually because CPU utilization is high. The graph below shows 95th percentile cfcheck latency plotted against CPU utilization across all the machines in the same data center, and you can see the strong correlation:

So having Traffic Manager look at CPU time in a data center is a very good way to ensure that we’re giving customers the best experience and not causing problems.

After getting the CPU time per plan, we need to figure out how much of that CPU time to move to other data centers. To do this, we aggregate the CPU utilization across all servers to give a single CPU utilization across the data center. If a proportion of servers in the data center fail, due to network device failure, power failure, etc., then the requests that were hitting those servers are automatically routed elsewhere within the data center by Duomog. As the number of servers decrease, the overall CPU utilization of the data center increases. Traffic Manager has three thresholds for each data center; the maximum threshold, the target threshold, and the acceptable threshold:

Maximum: the CPU level at which performance starts to degrade, where Traffic Manager will take action
Target: the level to which Traffic Manager will try to reduce the CPU utilization to restore optimal service to users
Acceptable: the level below which a data center can receive requests forwarded from another data center, or revert active moves

When a data center goes above the maximum threshold, Traffic Manager takes the ratio of total CPU time across all plans to current CPU utilization, then applies that to the target CPU utilization to find the target CPU time. Doing it this way means we can compare a data center with 100 servers to a data center with 10 servers, without having to worry about the number of servers in each data center. This assumes that load increases linearly, which is close enough to true for the assumption to be valid for our purposes.

Target ratio equals current ratio:

Therefore:

Subtracting the target CPU time from the current CPU time gives us the CPU time to move:

For example, if the current CPU utilization was at 90% across the data center, the target was 85%, and the CPU time across all plans was 18,000, we would have:

This would mean Traffic Manager would need to move 1,000 CPU time:

Now we know the total CPU time needed to move, we can go through the plans, until the required time to move has been met.

What is the maximum threshold?

A frequent problem that we faced was determining at which point Traffic Manager should start taking action in a data center – what metric should it watch, and what is an acceptable level?

As said before, different services have different requirements in terms of CPU utilization, and there are many cases of data centers that have very different utilization patterns.

To solve this problem, we turned to machine learning. We created a service that will automatically adjust the maximum thresholds for each data center according to customer-facing indicators. For our main service-level indicator (SLI), we decided to use the cfcheck latency metric we described earlier.

But we also need to define a service-level objective (SLO) in order for our machine learning application to be able to adjust the threshold. We set the SLO for 20ms. Comparing our SLO to our SLI, our 95th percentile cfcheck latency should never go above 20ms and if it does, we need to do something. The below graph shows 95th percentile cfcheck latency over time, and customers start to get unhappy when cfcheck latency goes into the red zone:

If customers have a bad experience when CPU gets too high, then the goal of Traffic Manager’s maximum thresholds are to ensure that customer performance isn’t impacted and to start redirecting traffic away before performance starts to degrade. At a scheduled interval the Traffic Manager service will fetch a number of metrics for each data center and apply a series of machine learning algorithms. After cleaning the data for outliers we apply a simple quadratic curve fit, and we are currently testing a linear regression algorithm.

After fitting the models we can use them to predict the CPU usage when the SLI is equal to our SLO, and then use it as our maximum threshold. If we plot the cpu values against the SLI we can see clearly why these methods work so well for our data centers, as you can see for Barcelona in the graphs below, which are plotted against curve fit and linear regression fit respectively.

In these charts the vertical line is the SLO, and the intersection of this line with the fitted model represents the value that will be used as the maximum threshold. This model has proved to be very accurate, and we are able to significantly reduce the SLO breaches. Let’s take a look at when we started deploying this service in Lisbon:

Before the change, cfcheck latency was constantly spiking, but Traffic Manager wasn’t taking actions because the maximum threshold was static. But after July 29, we see that cfcheck latency has never hit the SLO because we are constantly measuring to make sure that customers are never impacted by CPU increases.

Where to send the traffic?

So now that we have a maximum threshold, we need to find the third CPU utilization threshold which isn’t used when calculating how much traffic to move – the acceptable threshold. When a data center is below this threshold, it has unused capacity which, as long as it isn’t forwarding traffic itself, is made available for other data centers to use when required. To work out how much each data center is able to receive, we use the same methodology as above, substituting target for acceptable:

Therefore:

Subtracting the current CPU time from the acceptable CPU time gives us the amount of CPU time that a data center could accept:

To find where to send traffic, Traffic Manager will find the available CPU time in all data centers, then it will order them by latency from the data center needing to move traffic. It moves through each of the data centers, using all available capacity based on the maximum thresholds before moving onto the next. When finding which plans to move, we move from the lowest priority plan to highest, but when finding where to send them, we move in the opposite direction.

To make this clearer let's use an example:

We need to move 1,000 CPU time from data center A, and we have the following usage per plan: Free: 500ms/s, Pro: 400ms/s, Business: 200ms/s, Enterprise: 1000ms/s.

We would move 100% of Free (500ms/s), 100% of Pro (400ms/s), 50% of Business (100ms/s), and 0% of Enterprise.

Nearby data centers have the following available CPU time: B: 300ms/s, C: 300ms/s, D: 1,000ms/s.

With latencies: A-B: 100ms, A-C: 110ms, A-D: 120ms.

Starting with the lowest latency and highest priority plan that requires action, we would be able to move all the Business CPU time to data center B and half of Pro. Next we would move onto data center C, and be able to move the rest of Pro, and 20% of Free. The rest of Free could then be forwarded to data center D. Resulting in the following action: Business: 50% → B, Pro: 50% → B, 50% → C, Free: 20% → C, 80% → D.

Reverting actions

In the same way that Traffic Manager is constantly looking to keep data centers from going above the threshold, it is also looking to bring any forwarded traffic back into a data center that is actively forwarding traffic.

Above we saw how Traffic Manager works out how much traffic a data center is able to receive from another data center — it calls this the available CPU time. When there is an active move we use this available CPU time to bring back traffic to the data center — we always prioritize reverting an active move over accepting traffic from another data center.

When you put this all together, you get a system that is constantly measuring system and customer health metrics for every data center and spreading traffic around to make sure that each request can be served given the current state of our network. When we put all of these moves between data centers on a map, it looks something like this, a map of all Traffic Manager moves for a period of one hour. This map doesn’t show our full data center deployment, but it does show the data centers that are sending or receiving moved traffic during this period:

Data centers in red or yellow are under load and shifting traffic to other data centers until they become green, which means that all metrics are showing as healthy. The size of the circles represent how many requests are shifted from or to those data centers. Where the traffic is going is denoted by where the lines are moving. This is difficult to see at a world scale, so let’s zoom into the United States to see this in action for the same time period:

Here you can see Toronto, Detroit, New York, and Kansas City are unable to serve some requests due to hardware issues, so they will send those requests to Dallas, Chicago, and Ashburn until equilibrium is restored for users and data centers. Once data centers like Detroit are able to service all the requests they are receiving without needing to send traffic away, Detroit will gradually stop forwarding requests to Chicago until any issues in the data center are completely resolved, at which point it will no longer be forwarding anything. Throughout all of this, end users are online and are not impacted by any physical issues that may be happening in Detroit or any of the other locations sending traffic.

Happy network, happy products

Because Traffic Manager is plugged into the user experience, it is a fundamental component of the Cloudflare network: it keeps our products online and ensures that they’re as fast and reliable as they can be. It’s our real time load balancer, helping to keep our products fast by only shifting necessary traffic away from data centers that are having issues. Because less traffic gets moved, our products and services stay fast.

But Traffic Manager can also help keep our products online and reliable because they allow our products to predict where reliability issues may occur and preemptively move the products elsewhere. For example, Browser Isolation directly works with Traffic Manager to help ensure the uptime of the product. When you connect to a Cloudflare data center to create a hosted browser instance, Browser Isolation first asks Traffic Manager if the data center has enough capacity to run the instance locally, and if so, the instance is created right then and there. If there isn’t sufficient capacity available, Traffic Manager tells Browser Isolation which the closest data center with sufficient available capacity is, thereby helping Browser Isolation to provide the best possible experience for the user.

Happy network, happy users

At Cloudflare, we operate this huge network to service all of our different products and customer scenarios. We’ve built this network for resiliency: in addition to our MCP locations designed to reduce impact from a single failure, we are constantly shifting traffic around on our network in response to internal and external issues.

But that is our problem — not yours.

Similarly, when human beings had to fix those issues, it was customers and end users who would be impacted. To ensure that you’re always online, we’ve built a smart system that detects our hardware failures and preemptively balances traffic across our network to ensure it’s online and as fast as possible. This system works faster than any person — not only allowing our network engineers to sleep at night — but also providing a better, faster experience for all of our customers.

And finally: if these kinds of engineering challenges sound exciting to you, then please consider checking out the Traffic Engineering team's open position on Cloudflare’s Careers page!

Elevate load balancing with Private IPs and Cloudflare Tunnels: a secure path to efficient traffic distribution

2023-09-08 Brian Batraski

Post Syndicated from Brian Batraski original http://blog.cloudflare.com/elevate-load-balancing-with-private-ips-and-cloudflare-tunnels-a-secure-path-to-efficient-traffic-distribution/

Elevate load balancing with Private IPs and Cloudflare Tunnels: a secure path to efficient traffic distribution

In the dynamic world of modern applications, efficient load balancing plays a pivotal role in delivering exceptional user experiences. Customers commonly leverage load balancing, so they can efficiently use their existing infrastructure resources in the best way possible. Though, load balancing is not a ‘one-size-fits-all, out of the box’ solution for everyone. As you go deeper into the details of your traffic shaping requirements and as your architecture becomes more complex, different flavors of load balancing are usually required to achieve these varying goals, such as steering between datacenters for public traffic, creating high availability for critical internal services with private IPs, applying steering between servers in a single datacenter, and more. We are extremely excited to announce a new addition to our Load Balancing solution, Local Traffic Management (LTM) with deep integrations with Zero Trust!

A common problem businesses run into is that almost no providers can satisfy all these requirements, resulting in a growing list of vendors to manage disparate data sources to get a clear view of your traffic pipeline, and investment into incredibly expensive hardware that is complicated to set up and maintain. Not having a single source of truth to dwindle down ‘time to resolution’ and a single partner to work with in times when things are not operating within the ideal path can be the difference between a proactive, healthy growing business versus one that is reactive and constantly having to put out fires. The latter can result in extreme slowdown to developing amazing features/services, reduction in revenue, tarnishing of brand trust, decreases in adoption – the list goes on!

For eight years, we have provided top-tier global traffic load balancing (GTM) capabilities to thousands of customers across the globe. But why should the steering intelligence, failover, and reliability we guarantee stop at the front door of the selected datacenter and only operate with public traffic? We came to the conclusion that we should go even further. Today is the start of a long series of new features that allow traffic steering, failover, session persistence, SSL/TLS offloading and much more to take place between servers after datacenter selection has occurred! Instead of relying only on the relative weight to determine which server traffic should be sent to, you can now bring the same intelligent steering policies, such as least outstanding requests steering or hash steering, to any of your many data centers. This also means you have a single partner for all of your load balancing initiatives and a single pane of glass to inform business decisions! Cloudflare is thrilled to introduce the powerful combination of private IP support for Load Balancing with Cloudflare Tunnels and Local Traffic Management, offering customers a solution that blends unparalleled efficiency, security, flexibility, and privacy.

What is a load balancer?

Load balancing — functionality that’s been around for the last 30 years to help businesses leverage their existing infrastructure resources. Load balancing works by proactively steering traffic away from unhealthy origin servers and — for more advanced solutions — intelligently distributing traffic load based on different steering algorithms. This process ensures that errors aren’t served to end users and empowers businesses to tightly couple overall business objectives to their traffic behavior. Cloudflare Load Balancing has made it simpler and easier to securely and reliably manage your traffic across multiple data centers around the world. With Cloudflare Load Balancing, your traffic will be directed reliably regardless of the scale of traffic or where it originates with customizable steering, affinity and failover. This clearly has an advantage over a physical load balancer since it can be configured easily and traffic doesn’t have to reach one of your data centers to be routed to another location, introducing single points of failure and significant latency. When compared with other global traffic management load balancers, Cloudflare’s Load Balancing offering is easier to set up, simpler to understand, and is fully integrated with the Cloudflare platform as one single product for all load balancing needs.

What are Cloudflare Tunnels?

In 2018, Cloudflare introduced Cloudflare Tunnels, a private, secure connection between your data center and Cloudflare. Traditionally, from the moment an Internet property is deployed, developers spend an exhaustive amount of time and energy locking it down through access control lists, rotating IP addresses, or more complex solutions like GRE tunnels. We built Tunnel to help alleviate that burden. With Tunnels, users can create a private link from their origin server directly to Cloudflare without exposing your services directly to the public internet or allowing incoming connections in your data center’s firewall. Instead, this private connection is established by running a lightweight daemon, cloudflared, in your data center, which creates a secure, outbound-only connection. This means that only traffic that you’ve configured to pass through Cloudflare can reach your private origin.

Unleashing the potential of Cloudflare Load Balancing with Cloudflare Tunnels

Combining Cloudflare Tunnels with Cloudflare Load Balancing allows you to remove your physical load balancers from your data center and have your Cloudflare load balancer reach out to your servers directly via their private IP addresses with health checks, steering, and all other Load Balancing features currently available. Instead of configuring your on-premise load balancer to expose each service and then updating your Cloudflare load balancer, you can configure it all in one place. This means that from the end-user to the server handling the request, all your configuration can be done in a single place – the Cloudflare dashboard. On top of this, you can say goodbye to the multi hundred thousand dollar price tag to hardware appliances, the incredible management overhead and investing in a solution that has a time limit for its delivered value.

Load Balancing serves as the backbone for online services, ensuring seamless traffic distribution across servers or data centers. Traditional load balancing techniques often require exposing services on a data center’s public IP addresses, forcing organizations to create complex configurations vulnerable to security risks and potential data exposure. By harnessing the power of private IP support for Load Balancing in conjunction with Cloudflare Tunnels, Cloudflare is revolutionizing the way businesses protect and optimize their applications. With clear steps to install the cloudflared agent to connect your private network to Cloudflare’s network via Cloudflare Tunnels, directly and securely routing traffic into your data centers becomes easier than ever before!

Publicly exposing services in private data centers is complicated

Load balancing within a private data center can be expensive and difficult to manage. The idea of keeping security first while ensuring ease of use and flexibility for your internal workforce is a tricky balance to strike. It’s not only the ‘how’ of securely exposing internal services, but how to best balance traffic between servers at a single location within your private network!

In a private data center, even a very simple website can be fairly complex in terms of networking and configuration. Let’s walk through a simple example of a customer device connecting to a website. A customer device performs a DNS lookup for the business’s website and receives an IP address corresponding to a customer data center. The customer then makes an HTTPS request to that IP address, passing the original hostname via Server Name Indication (SNI). That load balancer forwards that request to the corresponding origin server and returns the response to the customer device.

This example doesn’t have any advanced functionality and the stack is already difficult to configure:

Expose the service or server on a private IP.
Configure your data center’s networking to expose the LB on a public IP or IP range.
Configure your load balancer to forward requests for that hostname and/or public IP to your server’s private IP.
Configure a DNS record for your domain to point to your load balancer’s public IP.

In large enterprises, each of these configuration changes likely requires approval from several stakeholders and modified through different repositories, websites and/or private web interfaces. Load balancer and networking configurations are often maintained as complex configuration files for Terraform, Chef, Puppet, Ansible or a similar infrastructure-as-code service. These configuration files can be syntax checked or tested but are rarely tested thoroughly prior to deployment. Each deployment environment is often unique enough that thorough testing is often not feasible given the time and hardware requirements needed to do so. This means that changes to these files can negatively affect other services within the data center. In addition, opening up an ingress to your data center widens the attack surface for varying security risks such as DDoS attacks or catastrophic data breaches. To make things worse, each vendor has a different interface or API for configuring their devices or services. For example, some registrars only have XML APIs while others have JSON REST APIs. Each device configuration may have different Terraform providers or Ansible playbooks. This results in complex configurations accumulating over time that are difficult to consolidate or standardize, inevitably resulting in technical debt.

Now let’s add additional origins. For each additional origin for our service, we’ll have to go set up and expose that origin and configure the physical load balancer to use our new origin. Now let’s add another data center. Now we need another solution to distribute across our data centers. This results in a separate global traffic management system and local traffic management system. These solutions have in the past come from different vendors and will have to be configured in different ways even though they should serve the same purpose: load balancing. This makes managing your web traffic unnecessarily difficult. Why should you have to configure your origins in two different load balancers? Why can’t you manage all the traffic for all the origins for a service in the same place?

Simpler and better: Load Balancing with Tunnels

With Cloudflare Load Balancing and Cloudflare Tunnel, you can manage all your public and private origins in one place: the Cloudflare dashboard. Cloudflare load balancers can be easily configured using the Cloudflare dashboard or the Cloudflare API. There’s no need to SSH or open a remote desktop to modify load balancer configurations for your public or private servers. All configurations can be done through the dashboard UI or Cloudflare API, with full parity between the two.

With Cloudflare Tunnel set up and running in your data center, everything is ready to connect your origin server to Cloudflare network and load balancers. You do not need to configure any ingress to your data center since Cloudflare Tunnel operates only over outbound connections and can securely reach out to privately addressed services inside your data center. To expose your service to Cloudflare, you just set up your private IP range to be routed over that tunnel. Then, you can create a Cloudflare load balancer and input the corresponding private IP address and virtual network ID into your origin pool. After that, Cloudflare manages the DNS and load balancing across your private servers. Now your origin is receiving traffic exclusively via Cloudflare Tunnel and your physical load balancer is no longer needed!

This groundbreaking integration enables organizations to deploy load balancers while keeping their applications securely shielded from the public Internet. The customer’s traffic passes through Cloudflare’s data centers, allowing customers to continue to take full advantage of Cloudflare’s security and performance services. Also, by leveraging Cloudflare Tunnels, traffic between Cloudflare and customer origins remains isolated within trusted networks, bolstering privacy, security, and peace of mind.

The advantages of Private IP support with Cloudflare Tunnels

Combining Global and Local Traffic Management: All the features and ease of use that were part of Cloudflare Load Balancing for Global Traffic Management are also available with Local Traffic Management. You can configure your public and private origins in one dashboard as opposed to several services and vendors. Now, all your private origins can benefit from the features that Cloudflare Load Balancing is known for: instant failover, customizable steering between data centers, ease of use, custom rules and configuration updates in a matter of seconds. They will also benefit from our newer features including least connection steering, least outstanding request steering, and session affinity by header. This is just a small subset of the expansive feature set for Load Balancing. See our dev docs for more features and details on the offering.

Enhanced Security: By combining private IP support with Cloudflare Tunnels, organizations can fortify their security posture and protect sensitive data. With private IP addresses and encrypted connections via Cloudflare Tunnel, the risk of unauthorized access and potential attacks is significantly reduced – traffic remains within trusted networks. You can also configure Cloudflare Access to add single sign-on support for your application and restrict your application to a subset of authorized users. In addition, you still benefit from Firewall rules, Rate Limiting rules, Bot Management, DDoS protection and all the other Cloudflare products available today allowing comprehensive security configurations.

Uncompromising Privacy: As data privacy continues to take center stage, businesses must ensure the confidentiality of user information. Cloudflare's private IP support with Cloudflare Tunnels enables organizations to segregate applications and keep sensitive data within their private network boundaries. Custom rules also allow you to direct traffic for specific devices to specific data centers. For example, you can use custom rules to direct traffic from Eastern and Western Europe to your European data centers, so you can easily keep those users’ data within Europe. This minimizes the exposure of data to external entities, preserving user privacy and complying with strict privacy regulations across different geographies.

Flexibility & Reliability: Scale and adaptability are some of the major foundations of a well-operating business. Implementing solutions that fit your business’ needs today is not enough. Customers must find solutions that meet their needs for the next three or more years. The blend of Load Balancing with Cloudflare Tunnels within our Zero Trust solution lends to the very definition of flexibility and reliability! Changes to load balancer configurations propagate around the world in a matter of seconds, making load balancers an effective way to respond to incidents. Also, instant failover, health monitoring, and steering policies all help to maintain high availability for your applications, so you can deliver the reliability that your users expect. This is all in addition to best in class Zero Trust capabilities that are deeply integrated such as, but not limited to Secure Web Gateway (SWG), remote browser isolation, network logs. Data loss prevention.

Streamlined Infrastructure: Organizations can consolidate their network architecture and establish secure connections across distributed environments. This unification reduces complexity, lowers operational overhead, and facilitates efficient resource allocation. Whether you need to apply a global traffic manager to intelligently direct traffic between datacenters within your private network, or steer between specific servers after datacenter selection has taken place, there is now a clear, single lens to manage your global and local traffic, regardless of whether the source or destination of the traffic is public or private. Complexity can be a large hurdle in achieving and maintaining fast, agile business units. Consolidating into a single provider, like Cloudflare, that provides security, reliability, and observability will not only save significant cost but allows your teams to move faster and focus on growing their business, enhancing critical services, and developing incredible features, rather than taping together infrastructure that may not work in a few years. Leave the heavy lifting to us, and let us empower you and your team to focus on creating amazing experiences for your employees and end-users.

The lack of agility, flexibility, and lean operations of hardware appliances for Local Traffic Management does not justify the hundreds of thousands of dollars spent on them, along with the huge overhead of managing CPU, memory, power, cooling, etc. Instead, we want to help businesses move this logic to the cloud by abstracting away the needless overhead and bringing more focus back to teams to do what they do best, building amazing experiences, and allowing Cloudflare to do what we do best, protecting, accelerating, and building heightened reliability. Stay tuned for more updates on Cloudflare's Local Traffic Manager and how it can reduce architecture complexity while bringing more insight, security, and control to your teams. In the meantime, check out our new whitepaper!

Looking to the future

Cloudflare's impactful solution, private IP support for Load Balancing with Cloudflare Tunnels as part of the Zero Trust solution, reaffirms our commitment to providing cutting-edge tools that prioritize security, privacy, and performance. By leveraging private IP addresses and secure tunnels, Cloudflare empowers businesses to fortify their network infrastructure while ensuring compliance with regulatory requirements. With enhanced security, uncompromising privacy, and streamlined infrastructure, load balancing becomes a powerful driver of efficient and secure public or private services.

As a business grows and its systems scale up, they'll need the features that Cloudflare Load Balancing is known for: health monitoring, steering, and failover. As availability requirements increase due to growing demands and standards from end-users, customers can add health checks, enabling automatic failover to healthy servers when an unhealthy server begins to fail. When the business begins to receive more traffic from around the world, they can create new pools for different regions and use dynamic steering to reduce latency between the user and the server. For intensive or long-running requests, such as complex datastore queries, customers can benefit from leveraging least outstanding requests steering to reduce the number of concurrent requests per server. Before, this could all be done with publicly addressable IPs, but it is now available for pools with public IPs, private servers, or combinations of the two. Private IP Load Balancing along with Local Traffic Management is live and ready to use today! Check out our dev docs for instructions on how to get started.

Stay tuned for our next addition to add new Load Balancing onramp support for Spectrum and WARP with Cloudflare Tunnels with private IPs for your Layer 4 traffic, allowing us to support TCP and UDP applications in your private data centers!

Cloudflare’s view of the Virgin Media outage in the UK

2023-04-04 David Belson

Post Syndicated from David Belson original https://blog.cloudflare.com/virgin-media-outage-april-4-2023/

Cloudflare’s view of the Virgin Media outage in the UK

Just after midnight (UTC) on April 4, subscribers to UK ISP Virgin Media (AS5089) began experiencing an Internet outage, with subscriber complaints multiplying rapidly on platforms including Twitter and Reddit.

Cloudflare Radar data shows Virgin Media traffic dropping to near-zero around 00:30 UTC, as seen in the figure below. Connectivity showed some signs of recovery around 02:30 UTC, but fell again an hour later. Further nominal recovery was seen around 04:45 UTC, before again experiencing another complete outage between around 05:45-06:45 UTC, after which traffic began to recover, reaching expected levels around 07:30 UTC.

After the initial set of early-morning disruptions, Virgin Media experienced another round of issues in the afternoon. Cloudflare observed instability in traffic from Virgin Media’s network (called an autonomous system in Internet jargon) AS5089 starting around 15:00 UTC, with a significant drop just before 16:00 UTC. However in this case, it did not appear to be a complete outage, with traffic recovering approximately a half hour later.

Cloudflare’s view of the Virgin Media outage in the UK

Virgin Media’s Twitter account acknowledged the early morning disruption several hours after it began, posting responses stating “We’re aware of an issue that is affecting broadband services for Virgin Media customers as well as our contact centres. Our teams are currently working to identify and fix the problem as quickly as possible and we apologise to those customers affected.” Further responses after service restoration noted “We’ve restored broadband services for customers but are closely monitoring the situation as our engineers continue to investigate. We apologise for any inconvenience caused.”

However, the second disruption was acknowledged on Virgin Media’s Twitter account much more rapidly, with a post at 16:25 UTC stating “Unfortunately we have seen a repeat of an earlier issue which is causing intermittent broadband connectivity problems for some Virgin Media customers. We apologise again to those impacted, our teams are continuing to work flat out to find the root cause of the problem and fix it.”

At the time of the outages, www.virginmedia.com, which includes the provider’s status page, was unavailable. As seen in the figure below, a DNS lookup for the hostname resulted in a SERVFAIL error, indicating that the lookup failed to return a response. This is because the authoritative nameservers for virginmedia.com are listed as ns{1-4}.virginmedia.net, and these nameservers are all hosted within Virgin Media’s network (AS5089) and thus are not accessible during the outage.

Although Virgin Media has not publicly released a root cause for the series of disruptions that its network has experienced, looking at BGP activity can be instructive.

BGP is a mechanism to exchange routing information between networks on the Internet. The big routers that make the Internet work have huge, constantly updated lists of the possible routes that can be used to deliver each network packet to its final destination. Without BGP, the Internet routers wouldn’t know what to do, and the Internet wouldn’t exist.

The Internet is literally a network of networks, or for math fans, a graph, with each individual network a node in it, and the edges representing the interconnections. All of this is bound together by BGP, which allows one network (Virgin Media, for instance) to advertise its presence to other networks that form the Internet. When Virgin Media is not advertising its presence, other networks can’t find its network and it becomes effectively unavailable.

BGP announcements inform a router of changes made to the routing of a prefix (a group of IP addresses) or entirely withdraws the prefix, removing it from the routing table. The figure below shows aggregate BGP announcement activity from AS5089 with spikes that align with the decreases and increases seen in the traffic graph above, suggesting that the underlying cause may in fact be BGP-related, or related to problems with core network infrastructure.

We can drill down further to break out the observed activity between BGP announcements (dark blue) and withdrawals (light blue) seen in the figure below, with key activity coincident with the loss and return of traffic. An initial set of withdrawals are seen just after midnight, effectively removing Virgin Media from the Internet resulting in the initial outage.

A set of announcements occurred just before 03:00 UTC, aligning with the nominal increase in traffic noted above, but those were followed quickly by another set of withdrawals. A similar announcement/withdrawal exchange was observed at 05:00 and 05:30 UTC respectively, before a final set of announcements restored connectivity at 07:00 UTC.

Things remained relatively stable through the morning into the afternoon, before another set of withdrawals presaged the afternoon’s connectivity problems, with a spike of withdrawals at 15:00 UTC, followed by additional withdrawal/announcement exchanges over the next several hours.

Conclusion

Track ongoing traffic trends for Virgin Media on Cloudflare Radar, and follow us on Twitter and Mastodon for regular updates.

Who won Super Bowl LV? A look at Internet traffic during the game

2021-02-06 Marc Lamik

Post Syndicated from Marc Lamik original https://blog.cloudflare.com/who-won-super-bowl-lv/

Who won Super Bowl LV? A look at Internet traffic during the game

The obvious answer is the Tampa Bay Buccaneers but the less obvious answer comes from asking “which Super Bowl advertiser got the biggest Internet bump?”. This blog aims to answer that question.

Before, during, and after the game a crack team of three people who work on Cloudflare Radar looked at real time statistics for traffic to advertisers’ websites, social media in the US, US food delivery services, and websites covering (American) football. Luckily, one of us (Kari) is (a) American and (b) a fan of football. Unluckily, one of us (Kari) is a fan of the Kansas City Chiefs.

Cloudflare Radar uses a variety of sources to provide aggregate information about Internet traffic and attack trends. In this blog post we use DNS name resolution data to estimate traffic to websites. We can’t see who visited the websites mentioned below, or what anyone did on the websites, but DNS can give us an estimate of the interest generated by the commercials. This analysis only looked at the top-level names in each domain (so example.com and www.example.com and not any other subdomains).

The Big Picture

To get the ball rolling here’s a look at traffic to NFL team websites and sports websites. Traffic builds to a peak as the game begins just after 1830 local time. As the game progresses traffic to those websites drops off hitting a mid-game low at about 2015 before jumping up for 30 minutes during the halftime show.

The big peak at around 2145 appears to be at the same time as a streaker ran onto the field. A lesser peak comes soon after 2200 when the Buccaneers sealed their victory.

As well as reading about the game, fans were also using social media to get news and add their own commentary. Here’s a look at US social media use during the game.

Social media use dips a little just as the game is about to begin and then ramps up until the Buccaneers’ victory is final. And then, as people go to sleep, social media use falls away.

Taking a look at food delivery services it looks like folks orders ramped up about 90 minutes before the game started and people’s hunger was sated before the game ended.

The Internet Impact of Commercials

One question is “Does a Super Bowl commercial drive traffic to the company’s web site while the game is on?” Answer: yes. Here’s one that went vroom peaking at over 40x baseline:

Vehicles and related services played a big part. GM tried to take on Norway with a commercial about their electric vehicle batteries.

And the electric vehicle theme continued with Cadillac:

And Jeep’s commercial caused a similar spike:

Food and Drink

Anheuser-Busch had a “corporate” commercial this time (as opposed to a commercial for their individual brands). You can clearly see from this chart when that aired.

And Bud Light got a boost that seemed to keep people thinking about beer through the game.

Some brands see a spike when the commercial airs and then traffic reverts fairly quickly towards the baseline. Another brand that lingered on post-commercial is Mountain Dew. Prior to the commercial, traffic to the Mountain Dew website was steady, then came the airing of the commercial with a greater than 25x peak followed by interest throughout the game and into the night.

Pepsi sponsored the halftime show and that’s clearly visible in the charts as they get a boost through The Weeknd’s performance.

Moving on from drinks and reaching for the snacks we can see that Doritos were pretty popular all evening long with visible commercial induced spikes.

Brands that might not be so well known get a large traffic boost from their Super Bowl commercials. Here’s the impact on oat milk company Oatly:

And ever popular chain Jimmy John’s saw a large traffic jump when their commercial aired.

Keeping Clean

Cleaning products (household and personal) were also the order of the day. First up, we have Dr. Squatch advertising their soap products for men.

Microban24 makes hand sanitizer and got a jump from their commercial:

But perhaps the biggest surprise in this category is Tide, which not only jumped up but stayed up throughout the game. Maybe it was the sight of all the sportswear on the field that was going to need cleaning:

Highlights

The folks at WeatherTech showed their commercial more than once and hit over 20x baseline.

Rocket Mortgage got a lot of people thinking about mortgages well into the night:

Financial services firm Klarna got a big jump as the game was wrapping up.

Throughout the game Paramount+ was touted more than once. Another streaming service? Definitely looked interesting to many.

Skechers’ got people thinking about what they put on their feet throughout the first half of the game:

And lastly, for this look at just some of the Super Bowl LV commercials, Fiverr got the message out about about freelancing:

Which brings us to the all important question…

So, who won Super Bowl LV?

Taking “highest commercial induced peak” as the measure then it’s Dexcom. Dexcom makes wearable continuous glucose monitors for people with diabetes. They got a 100x boost.

A close runner up is Inspiration4, a privately funded trip into space where one seat is up for grabs via a lottery. Inspiration4 went from very little traffic to over 70x and continuous interest throughout.

Of course, this doesn’t tell the whole story. Inspiration4 had very little traffic prior to the game so the magnitude of the peak isn’t that surprising.

After a tough 2020 perhaps it’s not a surprise that a healthcare product and an inspirational project should “win” Super Bowl LV.

Of course, for brands that already get a lot of Internet traffic that spikes aren’t so high yet represent a great deal of traffic because the baseline is so much higher. Here’s online marketplace Mercari getting a 2x jump.

And businesses like Disney or Amazon have so much traffic that commercials might drive a small increase in overall traffic but it tends to get lost.

One More Thing: Tom Brady

One person who didn’t advertise during the game but nevertheless got a bump in traffic to their website was Tom Brady. Brady’s fitness and nutrition brand TB12 saw traffic grow as the game ended and continued interest into the night.

Want more?

Visit Cloudflare Radar for up to date Internet traffic and attack trends.

Uganda’s January 13, 2021 Internet Shut Down

2021-01-15 Celso Martinho

Post Syndicated from Celso Martinho original https://blog.cloudflare.com/uganda-january-13-2021-internet-shut-down/

Uganda's January 13, 2021 Internet Shut Down

Two days ago, through its communications regulator, Uganda’s government ordered the “Suspension Of The Operation Of Internet Gateways” the day before the country’s general election. This action was confirmed by several users and journalists who got access to the letter sent to Internet providers. In other words, the government effectively cut off Internet access from the population to the rest of the world.

Ahead of tomorrow’s election the Internet has been shutdown in Uganda (confirmed by a few friends in Kampala).
Letter from communications commission below: pic.twitter.com/tRpTIXTPcW

— Samira Sawlani (@samirasawlani) January 13, 2021

On Cloudflare Radar, we want to help anyone understand what happens on the Internet. We are continually monitoring our network and exposing insights, threats, and trends based on the aggregated data that we see.

Uganda’s unusual traffic patterns quickly popped up in our charts. Our 7-day change in Internet Traffic chart in Uganda shows a clear drop to near zero starting around 1900 local time, when the providers received the letter.

Uganda's January 13, 2021 Internet Shut Down

This is also obvious in the Application-level Attacks chart.

The traffic drop was also confirmed by the Uganda Internet eXchange point, a place where many providers exchange their data traffic, on their public statistics page.

We keep an eye on traffic levels and BGP routing to our edge network, and are able to see which networks carry traffic to and from Uganda and their relative traffic levels. The cutoff is clear in those statistics also. Each colored line is a different network inside Uganda (such as ISPs, mobile providers, etc.)

We will continue to keep an eye on traffic levels from Uganda and update the blog when we see significant changes. At the time of writing, Internet access appears to be still cut off.

Cloudflare Radar’s 2020 Year In Review

2021-01-12 John Graham-Cumming

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/cloudflare-radar-2020-year-in-review/

Cloudflare Radar's 2020 Year In Review

Throughout 2020, we tracked changing Internet trends as the SARS-Cov-2 pandemic forced us all to change the way we were living, working, exercising and learning. In early April, we created a dedicated website https://builtforthis.net/ that showed some of the ways in which Internet use had changed, suddenly, because of the crisis.

On that website, we showed how traffic patterns had changed; for example, where people accessed the Internet from, how usage had jumped up dramatically, and how Internet attacks continued unabated and ultimately increased.

Today we are launching a dedicated Year In Review page with interactive maps and charts you can use to explore what changed on the Internet in 2020. Year In Review is part of Cloudflare Radar. We launched Radar in September 2020 to give anyone access to Internet use and abuse trends that Cloudflare normally had reserved only for employees.

Where people accessed the Internet

To get a sense for the Year In Review, let’s zoom in on London (you can do the same with any city from a long list of locations that we’ve analyzed). Here’s a map showing the change in Internet use comparing April (post-lockdown) and February (pre-lockdown). This map compares working hours Internet use on a weekday between those two months.

As you can clearly see, with offices closed in central London (and elsewhere), Internet use dropped (the blue colour) while usage increased in largely residential areas. Looking out to the west of London, a blue area near Windsor shows how Internet usage dropped at London’s Heathrow airport and surrounding areas.

A similar story plays out slightly later in the San Francisco Bay Area.

But that trend reverses in July, with an increase in Internet use in many places that saw a rapid decrease in April.

When you select a city from the map, a second chart shows the overall trend in Internet use for the country in which that city is located. For example, here’s the chart for the United States. The Y-axis shows the percentage change in Internet traffic compared to the start of the year.

Internet use really took off in March (when the lockdowns began) and rapidly increased to 40% higher than the start of the year. And usage has pretty much stayed there for all of 2020: that’s the new normal.

Here’s what happened in France (when selecting Paris) on the map view.

Internet use was flat until the lockdowns began. At that point, it took off and grew close to 40% over the beginning of the year. But there’s a visible slow down during the summer months, with Internet use up “only” 20% over the start of the year. Usage picked up again at “la rentrée” in September, with a new normal of about 30% growth in 2020.

What people did on the Internet

Returning to London, we can zoom into what people did on the Internet as the lockdowns began. The UK government announced a lockdown on March 23. On that day, the mixture of Internet use looked like this:

A few days later, the E-commerce category had jumped from 12.9% to 15.1% as people shopped online for groceries, clothing, webcams, school supplies, and more. Travel dropped from 1.5% of traffic to 1.1% (a decline of 30%).

And then by early mid-April E-commerce had increased to 16.2% of traffic with Travel remaining low.

But not all the trends are pandemic-related. One question is: to what extent is Black Friday (November 27, 2020) an event outside the US? We can answer that by moving the London slider to late November and look at the change in E-commerce. Watch carefully as E-commerce traffic grows towards Black Friday and actually peaks at 21.8% of traffic on Saturday, November 28.

As Christmas approached, E-commerce dropped off, but another category became very important: Entertainment. Notice how it peaked on Christmas Eve, as Britons, no doubt, turned to entertainment online during a locked-down Christmas.

And Hacking 2020

Of course, a pandemic didn’t mean that hacking activity decreased. Throughout 2020 and across the world, hackers continued to run their tools to attack websites, overwhelm APIs, and try to exfiltrate data.

Explore More

To explore data for 2020, you can check out Cloudflare Radar’s Year In Review page. To go deep into any specific country with up-to-date data about current trends, start at Cloudflare Radar’s homepage.

Internet traffic disruption caused by the Christmas Day bombing in Nashville

2021-01-06 John Graham-Cumming

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/internet-traffic-disruption-caused-by-the-christmas-day-bombing-in-nashville/

Internet traffic disruption caused by the Christmas Day bombing in Nashville

On Christmas Day 2020, an apparent suicide bomb exploded in Nashville, TN. The explosion happened outside an AT&T network building on Second Avenue in Nashville at 1230 UTC. Damage to the AT&T building and its power supply and generators quickly caused an outage for telephone and Internet service for local people. These outages continued for two days.

Looking at traffic flow data for AT&T in the Nashville area to Cloudflare we can see that services continued operating (on battery power according to reports) for over five hours after the explosion, but at 1748 UTC we saw a dramatic drop in traffic. 1748 UTC is close to noon in Nashville when reports indicate that people lost phone and Internet service.

Internet traffic disruption caused by the Christmas Day bombing in Nashville

We saw traffic from Nashville via AT&T start to recover over a 45 minute period on December 27 at 1822 UTC making the total outage 2 days and 34 minutes.

Traffic flows continue to be normal and no further disruption has been seen.

Key takeaways

Google referrals fall as AI Overviews expand

AI and search crawling: spring surge (+24%), summer slowdown

AI-only crawlers: OpenAI rises, ByteDance falls

Crawling by purpose: training dominates

Crawl-to-refer ratios shifts: tens of thousands of bot crawls per human click

Conclusion and what’s next

Impacts in Portugal

Country level

Network level

Regional level

Network quality

Routing

Impacts in Spain

Country level

Network level

Regional level

Network quality

Routing

Impacts in other European countries

Conclusion

Traffic trends

Query and response characteristics

DNS security

Conclusion

AI bots and crawlers traffic trends

Popularity of Generative AI services

Analysis of robots.txt files

Popularity of models and tasks on Workers AI

Conclusion

BCS East-West Interlink

Traffic volume indicators

Internet quality

Visibility in BGP events, announced IP address space unaffected

C-Lion1

Traffic volume indicators

Internet quality

BGP business as usual

Conclusion

Waiting Room site placement

Why customers needed broader Waiting Room coverage

Getting started with multi-host and path coverage

Deploying a multi-language waiting room

How we handle cookies with a multihostname and path waiting room

Approaches to setting multiple cookies

The world before Traffic Manager

Never send a human to do a machine’s job

Accounting for second order consequences

Meet Traffic Predictor

From a sledgehammer to a scalpel

How does Traffic Manager know how much to move?

What is the maximum threshold?

Where to send the traffic?

Reverting actions

Happy network, happy products

Happy network, happy users

What is a load balancer?

What are Cloudflare Tunnels?

Unleashing the potential of Cloudflare Load Balancing with Cloudflare Tunnels

Publicly exposing services in private data centers is complicated

Simpler and better: Load Balancing with Tunnels

The advantages of Private IP support with Cloudflare Tunnels

Looking to the future

Conclusion

The Big Picture

The Internet Impact of Commercials

Food and Drink

Keeping Clean

Highlights

So, who won Super Bowl LV?

One More Thing: Tom Brady

Want more?

Where people accessed the Internet

What people did on the Internet

And Hacking 2020

Explore More

The collective thoughts of the interwebz