domains | Noise

Post Syndicated from Celso Martinho original https://blog.cloudflare.com/radar-domain-rankings/

Goodbye, Alexa. Hello, Cloudflare Radar Domain Rankings

The Internet is a living organism. Technology changes, shifts in human behavior, social events, intentional disruptions, and other occurrences change the Internet in unpredictable ways, even to the trained eye.

Cloudflare Radar has long been the place to visit for accessing data and getting unique insights into how people and organizations are using the Internet across the globe, as well as those unpredictable changes to the Internet.

One of the most popular features on Radar has always been the “Most Popular Domains,” with both global and country-level perspectives. Domain usage signals provide a proxy for user behavior over time and are a good representation of what people are doing on the Internet.

Today, we’re going one step further and launching a new dataset called Radar Domain Rankings (Beta). Domain Rankings is based on aggregated 1.1.1.1 resolver data that is anonymized in accordance with our privacy commitments. The dataset aims to identify the top most popular domains based on how people use the Internet globally, without tracking individuals’ Internet use.

There are a few reasons why we’re doing this now. One is obviously to improve our Radar features with better data and incorporate new learnings. But also, ranking lists are used all over the Internet in all sorts of systems. One of the most used and trusted sources of domain rankings was Alexa, but that service was recently deprecated. We believe we are in a good position to provide a strong alternative.

Let’s see how we built it.

Differences in domain names

Before we dig into the data science behind Domain Rankings, it’s important to understand what a domain and DNS are. Internet domain names are human-readable dot-separated letters, digits and hyphens that correspond to a network resource, like a server or a website. However, your computer and applications don’t know what to do with a domain name; they need IP addresses to send and receive information over the network. DNS is the system that converts, or resolves, a domain name into an IP address. Think of it as an Internet phonebook for domain names.

Note: This is a simplification. A new standard called Internationalized Domain Names, or IDN, allows using Unicode strings in domain names.

Each dot defines a new hierarchy level, reading right to left. Domains can have multiple levels of depth. The highest level corresponds to country code top-level domains (ccTLDs) like .uk, .fr or .pt, or generic top-level domains (gTLDs) like .com, .org, or .net. These are normally assigned to and managed by either country-level entities or administrative organizations operating a registry.

Then there are the second-level domains like cloudflare.com or google.com. These are normally purchased and registered by individuals or organizations, which are then free to create and manage as many hostnames and hierarchy levels as they want.

Unfortunately, however, there are exceptions. For instance, many countries use second-level domain registration. One such example is the United Kingdom, where commercial domains can only be registered under the .co.uk hierarchy. That’s why Google in the UK isn’t google.uk, but rather google.co.uk.

But that’s not all. Some countries use 3rd level domain registrations. One example is Japan, which offers Regional Domain registration under cities like *.aisai.aichi.jp.

Projects like the Public Suffix List are a good starting point for understanding the variations involved, and how they affect validations and assumptions in other systems, such as cookies in web browsers.

Domain Rankings takes some of this nuance into account to inform the definition of our current ruleset:

We boil everything down to second-level domains, such as cloudflare.com or google.com.
However, if the second level is .edu, .com, .org, .gov, .net, .gov, .net, .co or .mil, then we use third-level domains.
We don’t distinguish between what we think is a website or an infrastructure system. A domain represents an Internet-available resource.
We will also semi-automate, curate and maintain a list of domains that map to popular platforms and services in the future. Example: fb.audio, fb.com, fb.watch, all map to a “facebook” platform.

Defining popularity

Definitions are important. We established what we consider a domain, but what does domain popularity mean exactly? Our research showed that the volume of traffic generated to a given domain doesn’t really work as a proxy for what we perceive as popular. Instead, Domain Rankings looks at the size of the population of users that look up a domain per unit of time. The more people who are interested in a domain, the more popular it is.

Sounds pretty straightforward, right? Well, it’s not. Our databases don’t have cookies, IPs, or other tracking artifacts, and we strip information that leads to identifying an individual from all of our data, by design.

The good news, however, is that we do a very good job at identifying automated traffic (for instance, you can read about Bot Management and how we use Machine Learning to detect bots in HTTP traffic in our blog) and we found we could develop a reasonable proxy for the unique users metric without sacrificing privacy (using other data points that we store for a limited period of time, like the ASN and high-level geolocation information of the request or the Cloudflare data center that served it).

Domain Rankings’ popularity metric is best described as the estimated relative size of the user population that accesses a domain over some period of time.

Our approach

We announced 1.1.1.1, our privacy-first consumer DNS resolver in 2018, and over the years it’s grown to become one of the top DNS services in the world. 1.1.1.1 is also part of a Research Agreement with APNIC in which we collaborate with them doing public research and DNS data insights.

The data we collect from it honors our privacy commitments, and is aggregated and stripped of any information that could lead to identifying or tracking users. We conducted a privacy examination by a Big Four accounting firm to determine whether the 1.1.1.1 resolver was effectively configured to meet our privacy commitments. You can read more about it in this blog, and the full report is publicly available on our compliance page.

Even without this personally identifying information, the resulting collection is vast and representative of Internet activity.

The 1.1.1.1 service is used in many ways. Regular (human) Internet users use it as their DNS resolver, either because they explicitly configured it in their devices, or their ISP did, or because they use WARP, or their browser uses 1.1.1.1 under the hood. However, servers and cloud infrastructure, IoT devices, home routers, and bots also use 1.1.1.1 extensively, which creates a lot of challenges for us when trying to identify human traffic.

We’ve been using DNS data to calculate the top and trending domains found on both the global and country pages on Cloudflare Radar. It’s been quite a learning experience trying to improve these lists. We have implemented aggregations, counts, filters, handling exceptions, and tried reducing noise, and yet they’re far from perfect. We felt that there had to be a better way.

We’ve spent the last six months building a variety of machine learning models to help us predict the rank of a domain.

Building the model was no easy feat. We experimented with multiple regression types first, to know exactly what the model was doing, and then more complex algorithms to get better performance. We played with different datasets, changed the population groups, variables (features), and combinations of variables, and used synthetic data.

After evaluation, one of our first conclusions was that building a model that could produce good results for the highest ranked domains and the long tail would be difficult.

The paper “A Long Way to the Top: Significance, Structure, and Stability of Internet Top Lists” describes this problem well. “The ranking of domains in the long tail should be based on significantly smaller and hence less reliable numbers.” Talking to our Research Team who submitted the collaboration paper “Toppling Top Lists: Evaluating the Accuracy of Popular Website Lists” to IMC 2022, got us to the same conclusion: the most popular domains (like google.com and facebook.com) have feature values disproportionally higher than the lower-ranked domains.

Therefore, we selected the two models that performed best. One model was trained on the population with the highest feature values, uses more features, and is used to generate the ordered top 100 domain list. A second model was trained on a more general group of domains, uses fewer features, and is used to get the top one million most popular domains, which we then divide into ranking buckets.

These buckets are ranked, but each bucket’s contents are intentionally unordered. For example, the second bucket of 10,000 most popular domains includes the set of domains that rank from 10,001 to 20,000, but give no further indication of the individual ranking of domains in that bucket. Given the size of some of these buckets and the window of time we use to populate them, they will inherently be exposed to more instability, too. We feel this is a good compromise between the described natural uncertainties of our long tail model and providing a reasonable idea of how close to the top a domain is.

Results

It’s important to mention there is no global view that can establish the perfect rank, and there’s no easy mechanism to confirm if a ranking is, ultimately, good. Data-driven results are always subject to some bias and skewing, related to the context of the organizations and systems that collect them. Sometimes all that can be done is to be transparent about potential sources of bias. The geographical distribution of customers and users, product characteristics, platform features, and behavioral diversity play an essential role in the final result. We are presenting the Cloudflare view, what we see.

Having said this, Cloudflare sits in a privileged position and handles a significant amount of Internet traffic. We have plenty of signals we can extract from our aggregated data, and believe that makes it possible to generate high quality domain rankings.

Domain Rankings are available today. You can head up to the Domains page and check it out:

Ordered list of the top 100 most popular domains globally and per country, based on our first model. Last 24 hours, updated daily.
Unordered global most popular domains datasets divided into buckets of the following sizes: 200, 500, 1,000, 2,000, 5,000, 10,000, 20,000, 50,000, 100,000, 200,000, 500,000, 1,000,000. Last 7 days, updated weekly.

Next steps

We will keep improving Domain Rankings and monitoring the results. Anyone can access them on Cloudflare Radar, read the results, and download the CSV files.

Feel free to explore our Domain Rankings and share feedback with us.

Noise

Tag Archives: domains