Tag Archives: Bot Management

Holistic web protection: industry recognition for a prolific 2020

Post Syndicated from Patrick R. Donahue original https://blog.cloudflare.com/cloudflare-named-the-innovation-leader-in-holistic-web-protection/

Holistic web protection: industry recognition for a prolific 2020

I love building products that solve real problems for our customers. These days I don’t get to do so as much directly with our Engineering teams. Instead, about half my time is spent with customers listening to and learning from their security challenges, while the other half of my time is spent with other Cloudflare Product Managers (PMs) helping them solve these customer challenges as simply and elegantly as possible. While I miss the deeply technical engineering discussions, I am proud to have the opportunity to look back every year on all that we’ve shipped across our application security teams.

Taking the time to reflect on what we’ve delivered also helps to reinforce my belief in the Cloudflare approach to shipping product: release early, stay close to customers for feedback, and iterate quickly to deliver incremental value. To borrow a term from the investment world, this approach brings the benefits of compounded returns to our customers: we put new products that solve real-world problems into their hands as quickly as possible, and then reinvest the proceeds of our shared learnings immediately back into the product.

It is these sustained investments that allow us to release a flurry of small improvements over the course of a year, and be recognized by leading industry analyst firms for the capabilities we’ve accumulated and distributed to our customers. Today we’re excited to announce that Frost & Sullivan has named Cloudflare the Innovation Leader in their Frost Radar™: Global Holistic Web Protection Market Report. Frost & Sullivan’s view that this market “will gradually absorb the markets formed around legacy and point solutions” is consistent with our view of the world, and we’re leading the way in “the consolidation of standalone WAF, DDoS mitigation, and Bot Risk Management solutions” they believe is “poised to happen before 2025”.

Holistic web protection: industry recognition for a prolific 2020
Image © 2020 Frost & Sullivan from Frost Radar™: Global Holistic Web Protection Market Report

We are honored to receive this recognition, based on the analysis of 10 providers’ competitive strengths and opportunities as assessed by Frost & Sullivan. The rest of this post explains some of the capabilities that we shipped in 2020 across our Web Application Firewall (WAF), Bot Management, and Distributed Denial-of-Service product lines—the scope of Frost & Sullivan’s report. Get a copy of the Frost & Sullivan Frost Radar report to see why Cloudflare was named the Innovation Leader here.

2020 Web Security Themes and Roundup

Before jumping into specific product and feature launches, I want to briefly explain how we think about building and delivering our web security capabilities. The most important “product” by far that’s been built at Cloudflare over the past 10 years is the massive global network that moves bits securely around the world, as close to the speed of light as possible. Building our features atop this network allows us to reject the legacy tradeoff of performance or security. And equipping customers with the ability to program and extend the network with Cloudflare Workers and Firewall Rules allows us to focus on quickly delivering useful security primitives such as functions, operators, and ML-trained data—then later packaging them up in streamlined user interfaces.

We talk internally about building up the “toolbox” of security controls so customers can express their desired security posture, and that’s how we think about many of the releases over the past year that are discussed below. We begin by providing the saw, hammer, and nails, and let expert builders construct whatever defenses they see fit. By watching how these tools are put to use and observing the results of billions of attempts to evade the erected defenses, we learn how to improve and package them together as a whole for those less inclined to build from components. Most recently we did this with API Shield, providing a guided template to create “positive security” models within Firewall Rules using existing primitives plus new data structures for strong authentication such as Cloudflare-managed client SSL/TLS certificates. Each new tool added to the toolbox increases the value of the existing tools. Each new web request—good or bad—improves the models that our threat intelligence and Bot Management capabilities depend upon.

Web application firewall (WAF) usability at scale

Holistic web protection: industry recognition for a prolific 2020

Last year we spoke with many customers about our plan to decouple configuration from the zone/domain model and allow rules to be set for arbitrary paths and groups of services across an account. In 4Q2020 we put this granular control in the hands of a few developers and some of our most sophisticated enterprise customers, and we’re currently collecting and incorporating feedback before defaulting the capabilities on for new customers.

Rules are great, especially with increased flexibility, but without data structures and request enrichment at the edge (such as the Bot Management techniques described below) they cannot act on anything beyond static properties of the request. In 3Q2020 we released our IP Lists capabilities and customers have been steadily uploading their home-grown and third-party subscription lists. These lists can be referenced anywhere in a customer’s account as named variables and then combined with all other attributes of the request, even Bot Management scores, e.g., http.request.uri.path contains “/login” and (not ip.src in $pingdom_probes and cf.bot_management.score < 30) is a Firewall Rule filter that blocks all bots except Pingdom from accessing the login endpoint.

Requests that are blocked or challenged need to find their way as quickly as possible to our customers’ SOCs for triage, investigation and, occasionally, incident response, so we upgraded our edge-logging framework in 2Q2020 to push real time security-specific logs directly to customer SIEMs. And in 4Q2020, we released the ability to encrypt sensitive payloads within these logs using customer-provided encryption keys and novel encryption algorithms termed “Hybrid Public Key Encryption” (HPKE), and a data localization suite to provide control over where our customers’ data is stored and protected.

Built predominantly in 4Q2020 and currently being tested in the Firewall Rules engine is a brand new implementation of our Rate Limiting engine. By moving this matching and enforcement logic from a standalone tool to a component within a performant, memory-safe, expressive engine built in Rust, we have increased the utility of existing functions. Additional examples of improving this library of capabilities include the work completed in 1Q2020 to add HMAC functions and regex-based HTTP header and body inspection to the engine.

Bots and machine learning (ML)

Holistic web protection: industry recognition for a prolific 2020

In addition to making edge data sets accessible for request evaluation, we continued to invest heavily within our Bot Management team to provide actionable data so that our customers could decide what (if any) automated traffic they wanted to allow to interact with their applications. Our highest priority for Bot research and development has always been efficacy, and last year was no different. A significant portion of our engineering effort was dedicated to our detection engines — both updating and iterating on existing systems or creating entirely new detection engines from scratch.

In 1Q2020 we completed a total rewrite of our Machine Learning engine, and are continually focused on improving the efficacy of our ML engines. To do this, we draw on one of our major competitive advantages: the massive amount of data flowing through Cloudflare’s network. The early 2020 upgrade to our ML model nearly doubled the number of features we use to evaluate and score requests. And to help customers better understand why requests are flagged as bots, we have recently complemented the bot likelihood score in our logs with attribution to the specific engine that generated the score.

Also in 1Q2020, we upgraded our behavioral analysis engine to incorporate more features and increase overall accuracy. This engine conducts histogram-based outlier scoring and is now fully deployed to nearly all Bot Management zones.

In 2Q2020, we developed a lightweight JavaScript element that further advanced our browser fingerprinting capabilities and aids in detection. Specifically, we now silently challenge browsers and detect if a browser is misrepresenting its User Agent. This technique will be incorporated into our ML models and combined with our heuristics engine for more accurate browser fingerprinting. This feature is entirely optional and can be enabled or disabled by customers through our UI and API. Customers with extremely performance sensitive zones or traffic types that are unsuitable for JavaScript (such as API or some mobile app traffic) can still be accurately scored by our Bot Management engine.

In addition to detection, we also spent (and will continue to spend) engineering effort on mitigation. Our entire JavaScript and CAPTCHA challenge platform was rewritten in the last year and deployed to our customer zones in a staged fashion in the second half of 2020. Our new platform is faster and more robust at detecting automated systems attempting to solve the challenges. More importantly, this platform allows us to further invest in new challenge types and modes as we enter 2021.

The biggest and most well received feature released in 2020 was our dedicated Bot Management analytics, released in 3Q2020. We now present informative graphs that double as diagnostic tools. Customers have found that analytics are far more than interesting charts and statistics: in the case of Bot Management, analytics are essential to spotting and subsequently eliminating false positives.

Last but definitely not least, we announced the deprecation of the __cfduid cookie in 4Q2020 which was used primarily to detect bots but caused confusion for some customers including questions about whether they needed to display a cookie banner because of what we do.

To get a sense of the Bot Attack trends we saw in the first half of 2020, take a read through this blog post. And if you’re curious about how our ML models and heuristic engines work to keep your properties safe, this deep dive by Alex Bocharov, Machine Learning Tech Lead on the Bots team, is an excellent guide.

API and IoT security and protection

Holistic web protection: industry recognition for a prolific 2020

At the beginning of 4Q2020, we released a product called API Shield that was purpose built to secure, protect, and accelerate API traffic — and will eventually provide much of the common functionality expected in traditional API Gateways. The UI for API Shield was built on top of Firewall Rules for maximum flexibility, and will serve as the jump-off point for configuring additional API security features we have planned this year.

As part of API Shield, every customer now gets a fully managed, domain-scoped private CA generated for each of their zones, and we plan to continue working closely with the SSL/TLS team to expand CA management options based on feedback. Since the release, we’ve seen great adoption from in particular IoT companies focused on locking down their APIs using short-lived client certificates distributed out to devices. Customers can also now upload OpenAPI schemas to be matched against incoming requests from these devices, with bad requests being dropped at the edge rather than passed on to origin infrastructure.

Another capability we released in 4Q2020 was support for gRPC-based API traffic. Since that release, customers have expressed significant interest in using Cloudflare as a secure API gateway between easy-to-use customer-facing JSON endpoints and internal-facing gRPC or GraphQL endpoints. Like most customer challenges at Cloudflare, early adopters are looking to solve these use cases initially with Cloudflare Workers, but we’re keeping an eye on whether there are aspects for which we’ll want to provide first-class feature support.

Distributed Denial-of-Service (DDoS) protections for web applications and APIs

Holistic web protection: industry recognition for a prolific 2020

The application-layer security of a web application or API is of minimal importance if the service itself is not available due to a persistent DDoS attack at L3-L7. While mitigating such attacks has long been one of Cloudflare’s strengths, attack methodologies evolve and we continued to invest heavily in 2020 to drop attacks more quickly, more efficiently, and more precisely; as a result, automatic mitigation techniques are applied immediately and most malicious traffic is blocked in less than 3 seconds.

Early in 2020 we responded to a persistent increase in smaller, more localized attacks by fine-tuning a system that can autonomously detect attacks on any server in any datacenter. In the month prior to us first posting about this tool, it mitigated almost 300,000 network-layer attacks, roughly 55 times greater than the tool we previously relied upon. This new tool, dubbed “dosd”, leverages Linux’s eXpress Data Path (XDP) and allows our system to quickly — and automatically — deploy rules eBPF rules that run on each packet received. We further enhanced our edge mitigation capabilities in 3Q2020 by developing and releasing a protection layer that can operate even in environments where we only see one side of the TCP flow. These network layer protections help protect our customers who leverage both Magic Transit to protect their IP ranges and our WAF to protect their applications and APIs.

To document and provide visibility into these attacks, we released a GraphQL-backed interface in 1Q2020 called Network Analytics. Network Analytics extends the visibility of attacks against our customers’ services from L7 to L3, and includes detailed attack logs containing data such as top source and destination IPs and ports, ASNs, data centers, countries, bit rates, protocol and TCP flag distributions. A litany of improvements made to this graphical rendering engine over the course of 2020 have benefitted all analytics tools using the same front-end. In 4Q2020, Network Analytics was extended to provide traffic and attack insights into Cloudflare Spectrum-protected applications, which are terminated at L4 (TCP/UDP).

Towards the end of 4Q2020, we released real-time DDoS attack alerting capable of sending emails or pages via PagerDuty to alert security teams of ongoing attacks and mitigations. This capability was released just in time to assist with the onslaught of ransomware attacks that Cloudflare helped detect and defend against. For additional context on unique attacks we fought off in 2020, consider reading about an acoustics inspired attack, a 754 million packet-per-second, or a roundup of attacks from 1Q2020, 2Q2020, or 3Q2020.

Wrapping up and looking towards 2021

2020 was a tough year around the world. Throughout what has also been, and continues to be, a period of heightened cyberattacks and breaches, we feel proud that our teams were able to release a steady flow of new and improved capabilities across several critical security product areas reviewed by Frost & Sullivan. These releases culminated in far greater protections for customers at the end of the year than the beginning, and a recognition for our sustained efforts.

We are pleased to have been named the Innovation Leader in their Frost Radar™: Global Holistic Web Protection Market Report, which “addresses organizations’ demand for consolidated, single pane of glass solutions, which not only reduce the security gaps of legacy products but also provide simplified management capabilities”.

As we look towards 2021 we plan to continue releasing early and often, listening to feedback from our customers, and delivering incremental value along the way. If you have ideas on what additional capabilities you’d like to use to protect your applications and networks, we’d love to hear them below in the comments.

Introducing Bot Analytics

Post Syndicated from Ben Solomon original https://blog.cloudflare.com/introducing-bot-analytics/

Introducing Bot Analytics

Introducing Bot Analytics

Bots — both good and bad — are everywhere on the Internet. Roughly 40% of Internet traffic is automated. Fortunately, Cloudflare offers a tool that can detect and block unwanted bots: we call it Bot Management. This is the most recent platform in our long history of detecting bots for our customers. In fact, Cloudflare has always offered some form of bot detection. Over the past two years, our team has focused on building advanced detection engines, innovating as bots become more sophisticated, and creating new features.

Today, we are releasing Bot Analytics to help you visualize your automated traffic.

Background

It’s worth including some background for those who are new to bots.

Many websites expect human behavior. When I shop online, I behave as anyone else would: I might search for a few items, read reviews when I find something interesting, and eventually complete an order. This is expected. It is a standard use of the Internet.

Introducing Bot Analytics

Unfortunately, without protection these sites can be ripe for exploitation. Those shoes I was looking at? They are limited edition sneakers that resell for five times the price. Sneaker hoarders clamor at the chance to buy a pair (or fifty). Or perhaps I just added a book to my cart: there are probably hundreds of online retailers that sell the same book, each one eager to offer the best price. These retailers desperately want to know what their competitors’ prices are.

You can see where this is going. While most humans make good use of the Internet, some use automated tools to perform abuse at scale. For example, attackers will deplete sneaker inventories by using automated bots to check out quickly. By the time humans click “add to cart,” bots have already paid for shipping. Humans hardly stand a chance. Similarly, online retailers keep track of their competitors with “price scraping” bots that collect pricing information. So when one retailer lowers a book price to $10, another retailer’s bot will respond by pricing at $9.99. This is how we end up with weird prices like $12.32 for toilet paper. Worst of all, malicious bots are incentivized to hide their identities. They’re hidden among us.

Introducing Bot Analytics

Not all bots are bad. Cloudflare maintains a list of verified good bots that we keep separated from the rest. Verified bots are usually transparent about who they are: DuckDuckGo, for example, publicly lists the IP addresses it uses for its search engine. This is a well-intentioned service that happens to be automated, so we verified it. We also verify bots for error monitoring and other tools.

Enter: Bot Analytics

Introducing Bot Analytics

As discussed earlier, we built a Bot Management platform that intelligently detects bots on the Internet, allowing our customers to block bad ones and allow good ones. If you’re curious about how our solution works, read here.

Beginning today, we are going to show you the bots that reach your website. You can see these bots with a new tool called Bot Analytics. It’s fast, accurate, and loaded with information. You can query data up to one month in the past with no noticeable lag. To accomplish this, we exposed the data with GraphQL and paired it with adaptive bitrate (ABR) technology to dynamically load content. If you already have Bot Management added to your Cloudflare account, Bot Analytics is included in your service. Open up your dashboard and let’s take a tour…

The Tour

First: where to go? Bot Analytics lives under the Firewall tab of the dashboard. Once you’re in the Firewall, go to “Overview” and click the second thumbnail on the left. Remember, Bot Management must be added to your account for full access to analytics.

Introducing Bot Analytics

It’s worth noting that Enterprise sites without Bot Management can see a snapshot of their bot traffic. This data is updated in real time and should help you determine if you have a bot problem. Generally speaking, if you have a double-digit percentage of automated traffic, you might be spending more on origin costs than you have to. More importantly, you might be losing revenue or sensitive information to inventory hoarding and credential stuffing.

“Requests by bot score” is the first section on the page. Here, we show traffic over time, but we split it vertically by the traffic type. Green segments represent verified bots, while shades of purple and blue show varying degrees of bot/human likelihood.

Introducing Bot Analytics

“Bot score distribution” is next. This shows similar data, but we display it horizontally without the notion of time. Use the slider below to filter on subsets of traffic and watch the rest of the page adapt.

Introducing Bot Analytics

We recommend that you use the slider to find your ideal bot threshold. In other words: what is the cutoff for suspicious traffic on your site? We generally consider traffic below 30 to be automated, but customers might choose to challenge traffic below 40 or block traffic below 10 (you can even do both!). You should set a threshold that is ambitious but not too aggressive. If your traffic looks like the example below, consider setting a threshold at a “drop off” point like 3 or 14. Why? Notice that the request density is very high near scores 1-2 and 12-13. Many of these requests will have similar characteristics, meaning that the scores immediately above them (3 and 14) offer some differentiating quality. These are the most promising places to segment your bot rules. Notably, not every graph is this pronounced.

Introducing Bot Analytics

“Bot score source” sits lower on the page. Here, you can examine the detection engines that are responsible for scoring your traffic. If you can’t remember the purpose of each engine, simply hover over the tooltip to view a brief description. Customers may wonder why some requests are flagged as “not computed.” This commonly occurs when Cloudflare has issued an error page on your behalf. Perhaps a visitor’s request was met with a gateway timeout (error 504), in which case Cloudflare responded with a branded error page. The error page would not have warranted a challenge or a block, so we did not spend time calculating a bot score. We published another blog post that provides an overview of the most common sources, including machine learning and heuristics.

Introducing Bot Analytics

“Top requests by source” is the final section of Bot Analytics. Although it’s not quite as colorful as the sections above, this section grounds Bot Analytics in highly specific data. You can filter or exclude request attributes, including IP addresses, user agents, and ASNs. In the next section, we’ll use this to spot a bot attack.

Let’s Spot A Bot Attack!

First, I’m going to use the “bot score source” tool to select the most obvious bot requests — those detected by our heuristics engine. This provides us with the following information, some of which has been redacted for privacy reasons:

Introducing Bot Analytics

I already suspect a correlation between a few of these attributes. First, the IP addresses all have very similar request counts. No human would access a site 22,000 times, and the uniformity across IPs 2-5 suggests foul play. Not surprisingly, the same pattern occurs for user agents on the right. User agents tell us about the browser and device associated with a particular request. When Bot Analytics shows this much uniformity and presents clear anomalies in country and ASN, I get suspicious (and you should too). I’m now going to filter on these anomalies to see if my instinct is right:

Introducing Bot Analytics

The trends hold true — to be sure, I briefly expanded the table and found nine separate IP addresses exhibiting the same behavior. This is likely an aggressive content scraper. Notably, it is not marked as a verified bot, so Bot Management issued the lowest possible score and flagged it as “automated.” At the top of Bot Analytics, I will narrow down the traffic and keep the time period at 24 hours:

Introducing Bot Analytics

The most severe attacks come and go. This traffic is clearly sustained, and my best guess is that someone is frequently scraping the homepage for content. This isn’t the most malicious of attacks, but content is still being taken. If I wanted to, I could set a firewall rule to target this bot score or any of the filters I used.

Try It Out

As a reminder, all Enterprise customers will be able to see a snapshot of their bot traffic. Even if you don’t have Bot Management for your site, visit the Firewall for some high-level insights that are updated in real time.

Introducing Bot Analytics

And for those of you with Bot Management — check out Bot Analytics! It’s live now, and we hope you’ll have fun using it. Keep your eyes open for new analytics features in the coming months.

Bot Attack trends for Jan-Jul 2020

Post Syndicated from Ricardo Pacheco original https://blog.cloudflare.com/bot-attack-trends-for-jan-jul-2020/

Bot Attack trends for Jan-Jul 2020

Bot Attack trends for Jan-Jul 2020

Now that we’re a long way through 2020, let’s take a look at automated traffic, which makes up almost 40% of total Internet traffic.

This blog post is a high-level overview of bot traffic on Cloudflare’s network. Cloudflare offers a comprehensive Bot Management tool for Enterprise customers, along with an effective free tool called Bot Fight Mode. Because of the tremendous amount of traffic that flows through our network each day, Cloudflare is in a unique position to analyze global bot trends.

In this post, we will cover the basics of bot traffic and distinguish between automated requests and other human requests (What Is A Bot?). Then, we’ll move on to a global overview of bot traffic around the world (A RoboBird’s Eye View, A Bot Day and Bots All Over The World), and dive into North American traffic (A Look into North American Traffic).  Lastly, we’ll finish with an overview of how the coronavirus pandemic affected global traffic, and we’ll take a deeper look at European traffic (Bots During COVID-19 In Europe).

On average, Cloudflare processes 18 million HTTP requests every second. This is a great opportunity to understand how bots shape the Internet, how much infrastructure is dedicated to these automated requests, and why our customers need a great bot management solution.

What Is A Bot?

Bot Attack trends for Jan-Jul 2020

Cloudflare groups traffic into four bot-related categories:

1. Verified
2. Definitely automated
3. Likely automated
4. Likely human

Our goal is to stop malicious and unwanted bots from harming our customers, while giving customers the opportunity to control how other automated traffic is managed.

We label each request that comes into Cloudflare with a “bot score” 1 through 99, where a lower score means that a request probably came from a bot. A higher score means that a request probably came from a human. This score is available in our Firewall, logs, and Workers, giving customers the flexibility to act on any score.

Cloudflare also maintains a challenge platform that customers can choose to deploy on suspected bots. You’ll recognize these as CAPTCHA challenges or JavaScript challenges. In fact, having the score available in Firewall Rules means that customers can take any action they choose. This platform can be used for mitigation, ensuring that unwanted traffic is stopped in its tracks.

To learn more about how Bot Management interacts with our firewall, check out our support page.

We track successes and failures during these challenges, which ultimately allows us to improve our detection systems. Assuming that our challenges are solvable by humans, effective detections should have low solve rates, given that they are usually presented to bots.

Bot Attack trends for Jan-Jul 2020

Verified bots are registered in an internal verified bot directory. These good bots power search engines and monitoring tools. Good bots enable our customers’ web pages to be found by search engines, for example.

For known non-verified bots (such as a scraper using a simple curl library), we keep a similar directory that is managed by our heuristics engine. If not otherwise verified, we consider requests caught by this engine to be definitely automated.

Our machine learning engine provides another way to identify potential bots. This engine identifies requests with a high probability of automation and marks them as likely automated. This detection mechanism benefits from models built on data from our global network.

If a request is not marked as automated, we mark it as likely human and pass along the bot score from our machine learning system.

We also have a behavioral analysis engine and a JavaScript detections engine. You can learn more about these systems by checking out Alex Bocharov’s previous post on Cloudflare Bot Management.

The two bot definitions for automated traffic are somewhat complementary. Requests caught by heuristic detections will not count towards machine learning detections. Requests that are reliably caught by our machine learning detections won’t need to be registered in our known heuristics bot directory. Because of this, we combine these two together when we discuss “automated traffic” in general.

A RoboBird’s Eye View

Data from this piece comes from information about Cloudflare’s customers, analyzed between January 15, 2020 and July 31, 2020.

First, let’s get a basic understanding of the traffic on our network.

Bot Attack trends for Jan-Jul 2020
Figure 1.1 Traffic type on Cloudflare’s network.

Figure 1.1 has a global breakdown regarding classification; 60.6% of traffic is likely human, 19.3% is likely automated, 18.1% is definitely automated and only 2.1% is from verified bots. In total, 39.5% of requests we score come from some kind of bot.

A Bot Day

Regular traffic fluctuates throughout the day. Do bots follow suit? Let’s check. Figure 2.1 represents traffic deviation from the average hourly traffic. An increase of 10% would mean that the hour is 10% busier than the average hour (measuring requests per hour). We include the total overall traffic in this chart to serve as a comparison to other types of traffic.

Bot Attack trends for Jan-Jul 2020
Figure 2.1 Hourly traffic as a deviation from the average hour.
Bot Attack trends for Jan-Jul 2020
Figure 2.2 Bot classification over an average day. 

We can clearly see a difference between human traffic and bot traffic. Human traffic varies heavily, but predictably, throughout the day. We can see a 15% decrease in human traffic early in the day, between midnight and 05:00 UTC, corresponding to the end of business hours in the Americas, and up to a 25% increase during business hours, 14:00 to 17:00 UTC, where traffic is highest. Conversely, bot traffic is more consistent. Slow hours still see a smaller drop than overall traffic, and busy hours are less busy. The difference between good and bad bots is also apparent: good bots are even more consistent, with small fluctuations in hourly traffic.

But why would this happen? A large portion of bots, good and bad, perform the same task across the Internet. Bad bots may be scraping websites or looking to infect unprotected machines, and they will do this with little intervention from human operators. Good bots could be doing some of these operations, but less frequently and in a more targeted fashion. A good bot scraping a website may be doing so to add it to a search engine, while a bad bot will do the same thing at a much higher rate, for other reasons.

A lot of bots follow business hours. For example, sneaker bots—focused on nabbing exclusive items from sneaker stores—will naturally be active when new products launch.

This difference in volume does not mean that our classifications are affected: our scores remain consistent throughout the day, as Figure 2.1 shows.

Bot Attack trends for Jan-Jul 2020
Figure 2.3 Daily traffic as a deviation from the average day. Grouped by day of week.
Bot Attack trends for Jan-Jul 2020
Figure 2.4 Bot classification over an average week.

We can also see that good bots don’t take weekends off. Weekdays and weekends have fairly marked differences for most traffic, but good bots keep a consistent schedule. Whereas a typical weekday is slightly above average, we can see a drop of about 4% in overall traffic. This does not fully apply to verified bots, which only see a small 1% drop in traffic.

Bots All Over The World

Now that we’ve taken a look at global traffic, let’s dig a little deeper.

Different regions have distinct traffic landscapes regarding automated traffic.

Bot Attack trends for Jan-Jul 2020
Figure 3.1 Traffic type by region.

Figure 3.1 breaks down traffic by region, letting us peek into where each type of traffic comes from. North America stands out as a major automated traffic source; over 50% of definitely automated traffic comes from there, and they also contribute almost 80% of all verified bot traffic. Europe makes up the second largest chunk of traffic, followed by Asia.

Bot Attack trends for Jan-Jul 2020
Figure 3.2 Traffic classification within each region.

Looking at regional breakdown of traffic in Figure 3.2, we can see just how much North American traffic is automated, well above the global average.

A Look into North American Traffic

As the largest source of automated traffic, North America deserves a closer look.

First, we’ll start with a breakdown of each country.

Bot Attack trends for Jan-Jul 2020
Figure 3.3 Percentage of traffic within North America.

Most of our requests in North America come from just three countries—the United States, Canada and Mexico. These account for 98% of all requests from North America, 97% of all requests from likely human sources and 100% of requests from verified bots. The United States alone accounts for 88% of total requests, 82% of requests from likely human sources, 96% of requests from definitely automated sources, 88% of requests from likely automated traffic sources and  98% of requests from verified bot.

However, this alone does not mean that the United States has an unusual amount of activity. These countries have a combined population of roughly 497 million people. The United States accounts for 66.5% of that, Mexico 25.9% and Canada 7.6%. With this context, we can see that the United States is overrepresented in terms of raw requests, but underrepresented in terms of how much of that traffic is likely to be human. Conversely, Canadian traffic is more likely to be human.

Let’s take another look at each country.

Bot Attack trends for Jan-Jul 2020
Figure 3.4 Percentage of traffic within each country.

Over half of the traffic from the United States is automated in some way, which is a clear departure from trends in Mexico and Canada.

American Bots

So far, we’ve seen how much the United States contributes to automated traffic. If we want to go deeper, a good place to start is by understanding how these bots get online. We can do this by examining the networks from which the traffic originates. Networks are identified by Autonomous System Numbers, or ASNs. These form the backbone of the Internet infrastructure.

Think of these as Internet Service Providers, but facing inward towards the network instead of outward towards end consumers. ISPs like Comcast and Verizon are examples of residential ASNs, where we expect mostly human traffic. Cloud providers such as Google and Amazon are also ASNs, but targeted towards cloud services. We expect most of these requests to be automated in some way.

Looking at traffic on the ASN level is important because we can identify cloud-based traffic, or traffic using residential proxies, among others.

Let’s take a look at which ASNs are associated with visitors in the United States. We’ll restrict ourselves to “eyeball” traffic, which is the term we use for requests coming from site visitors.

Bot Attack trends for Jan-Jul 2020
Figure 4.1 Top ASN in the United States.

From figure 4.1 we can clearly see the impact that cloud services have on traffic; 11.5% of all eyeball traffic comes from Amazon and Google.

Bot Attack trends for Jan-Jul 2020
Figure 4.2 Top ASN in the United States for verified bot traffic.

Verified bots operate in a different landscape, coming from cloud providers such as Amazon, Google, Microsoft, Advanced Hosting and Wowrack.

Bot Attack trends for Jan-Jul 2020
Figure 4.3 Top ASN in the United States for likely and definitely automated traffic.

Automated traffic has a variety of ASNs. Cloud providers such as Amazon, Google and Microsoft make up the 30% of automated traffic. Comcast also makes up a significant portion of traffic at 4.8%, indicating that some bots come from residential services.

Bots During COVID-19 In Europe

Lockdowns and limits on public events came as a consequence of the ongoing coronavirus pandemic. Many people have been working from home, and even those who do not have this option are using the Internet in new ways. Overall, this has meant that Cloudflare’s network has grown tremendously.

But how does this impact bot traffic? First let’s get an idea of how it impacted traffic in general. Countries were impacted by the virus at different times, so we expect to see differences, right?

Bot Attack trends for Jan-Jul 2020
Figure 5.1 Total traffic across all regions.

Figure 5.1 has just the traffic increase. Globally, we are seeing an average increase of 10%, while North America saw an increase of over 40% compared to the beginning of the year. Some regions did not change much, such as Africa and Asia, while others, such as Europe saw an increased period, but has since normalized to previous levels.

Let’s look at a few countries, so we can understand what this looks like.

Bot Attack trends for Jan-Jul 2020
Figure 5.2 Daily traffic evolution for Italy, the United Kingdom and Portugal, overlaid with Europe.

Figure 5.2 shows daily traffic relative to January 15, when data collection started. For comparison, we have overall European traffic, and three selected countries: Italy, the United Kingdom and Portugal. Italy was picked because it was one of the first countries in Europe to face the worst of the coronavirus and enact lockdown measures. The United Kingdom took another strategy, with an initial focus on herd immunity, and enacted measures later than the others. Portugal is somewhere in between, locking down later than Italy, in slightly different circumstances.

At the beginning of the year, traffic kept stable and fluctuations kept in line with the European average. As lockdown measures began, traffic increased. Italy was first out of these countries, rising a few weeks before the others, and keeping well above average. Eventually, all countries saw a growth in traffic, followed by a stabilization. Italy seems to have adjusted to a normal, with its growth in line with the European average. Portugal has also stabilized, but with busier weekdays. Conversely, the United Kingdom showed no signs of stopping, exceeding a growth of 40% compared to the beginning of the year.

Bot Attack trends for Jan-Jul 2020
Figure 5.3 Daily definitely automated traffic evolution for Italy, the United Kingdom and Portugal, overlaid with Europe.

Definitely automated traffic did not have that much of a pronounced variation. Italian traffic kept steady throughout, and Portugal had a rather large increase. The biggest one, however, was the United Kingdom, which tripled its initial count.

Bot Attack trends for Jan-Jul 2020
Figure 5.4 Verified bot traffic evolution for Italy, the United Kingdom and Portugal, overlaid with Europe. 

Verified bot traffic is steady, except in Italy, with a massive increase between March and May. What could be the cause of this? Are these a few zones, getting a massive number of requests?

Bot Attack trends for Jan-Jul 2020
Figure 5.5 Verified bot traffic in Italy for the top 10 000 zones, relative to January 15th 2020.

Well, no. If we only examine the top 10,000 zones (by total verified bot requests), we can still see a massive increase in traffic for other zones. So, what’s happening?

Let’s look at user agents. We can separate the top 10 user agents during the bump, and see how they evolve over time.

Bot Attack trends for Jan-Jul 2020
Figure 5.6 Verified bot traffic in Italy for the top 10 user agents, relative to January 15th 2020.

We can see that these 10 user agents are responsible for the majority of verified traffic coming from Italy.

Bot Attack trends for Jan-Jul 2020
Figure 5.7 Verified bot traffic in Italy for the top user agent, relative to January 15 2020.

In fact, most of this increase is from a single user agent. This instance of Google image proxy anonymizes image requests from Gmail, which explains its popularity.

Where does this increase come from? Did this bot suddenly appear and disappear?

Not quite. One thing to keep in mind when dealing with bots is that they cross borders easily. As a proxy service, this bot is making calls on behalf of the end user – people opening emails. These requests will originate from a data center, which can be anywhere in the world. To see this in action, let’s take a look at traffic for this bot in a few select countries.

Bot Attack trends for Jan-Jul 2020
Figure 5.8. Countries of origin for GoogleImageProxy.

We can see that the global average barely budges. It appears that Google may be moving image proxy traffic between data centers and during the period we observed above that traffic was coming from Italy.

Summary

With Cloudflare’s global reach, we’re in a position to understand how bots behave.

The first half of 2020 saw a massive increase in web traffic of around 35% since the beginning of the year, driven by the ongoing coronavirus pandemic, and some bots have taken advantage of it.

We explained how bot management works for our customers, and how we distinguish between likely automated and human traffic.

We showed an overview of how much of our global traffic is automated, and how bots change their behavior throughout the day and the week. Notably, 39.4% of all traffic Cloudflare processes comes from a suspected automated source.

A regional overview of automated traffic lets us know which regions were the source of traffic from likely automated agents. North America, Europe and Asia were the primary sources of traffic, and also of automated traffic in particular.

We then focused on North America, where the majority of automated traffic originates. The United States alone accounted for the majority of requests, over half of which come from automated sources.

To explore this further, we briefly dived into ASN traffic in the United States, so we could see where these requests were coming from. ASNs like Comcast and AT&T were the top ASNs for overall traffic, but unsurprisingly, data centers like Google and Amazon AWS were the main drivers of automated traffic.

Finally, we examined how the coronavirus has impacted traffic in Europe, with a deeper dive on Italian traffic. This led to some interesting insights on verified bot traffic, which saw a massive increase in Italy for a few months.

This post is a small peek into bot management at Cloudflare. In the future, we hope to expand this series of blog posts on bot management, exposing even more insights about bots on the Internet.

How a Customer’s Trust in Cloudflare Led to a Big Win against Bots

Post Syndicated from Kate Fleming original https://blog.cloudflare.com/how-a-customers-trust-in-cloudflare-led-to-a-big-win-against-bots/

How a Customer's Trust in Cloudflare Led to a Big Win against Bots

Key Points

  • Anyone with public-facing web properties is likely to have bot traffic on their website.
  • One type of bot that commonly targets eCommerce and online portals is a ‘scraper bot’.
  • Some scraper bots are good (such as those used by search engines to assess your website’s content to inform search results, or price comparison sites to help inform consumer decisions), however many are malicious, and will work to scrape not only images but also pricing data from your site for use by a competitor.
  • Many Bot Management providers will need to divert your traffic to a dedicated data centre to analyse your traffic and ‘scrub’ it clean from malicious bot traffic before sending it on to your site. While effective this will almost certainly add latency to the traffic’s ‘journey’ resulting in degraded user experience. Look for a technology partner with an expansive network who can scan your traffic in real time as it passes through any data centre on their network.
  • Good and bad scraper bots behave in largely the same way, making it difficult for bot protection systems to differentiate between the two. A common challenge with Bot Management solutions is that they can return a high number of false positives (legitimate bot or customer traffic blocked as though it were malicious). This can result in legitimate customers being challenged to various and repeated authentication challenges, or in extreme cases, blocked altogether. Look for a technology partner who can consistently return low rates of false positives on your traffic.

The Success Story

How a Customer's Trust in Cloudflare Led to a Big Win against Bots

I often joke that the key to understanding what that role of Customer Success is all about is to say: repeat the sentence again slowly… it’s there, in the name. Customer Success is, at its core, about making customers successful. And this is what makes our day, and makes us happy. Allow me to share a short story on how Cloudflare made a successful customer even more so with our Bot Management solution.

Once upon a time, an online property portal, let’s call them Property Portal, came to Cloudflare for our DDoS and WAF solution. We worked well together. The customer liked our ease of use and we delivered on our promise to provide performance and security to them. As Property Portal’s brand and digital footprint grew, so did the instances of malicious bot traffic, in particular ‘scraper bots’.

They say that imitation is the sincerest form of flattery. But when that imitation turns into someone else profiting off your IP, the shine starts to wear off, and that ‘imitation’ starts to become something more akin to outright theft. This is a challenge common to market leaders in the eCommerce and online portal space, where market leading organizations who pride themselves on presenting a solid portfolio of quality product offerings often find that competing sites seem to be not only replicating their content but matching or undercutting their prices in near real time.

Cloudflare wasn’t working with Property Portal when they first started facing these challenges, and as such, Property Portal engaged another party – at that time, one of the market leaders in the space – to provide a Bot Management solution for them.

At first pass, this seemed to solve the problem, however it wasn’t long before additional challenges became apparent:

  • performance was impacted slightly as this solution required traffic to be re-routed to a scrubbing centre to be ‘scrubbed’ of requests from bad bots before coming to their site;
  • a small percentage of malicious bots were still getting through and scraping valuable content of their website, and most damagingly;
  • Property Portal discovered that they were seeing a significant number of ‘false positives’, resulting in rising frustration for legitimate visitors (buyers, sellers and renters) repeatedly being asked to complete challenges in order to validate that they were human and not bots as they tried to navigate through the site.

During this time, Cloudflare released and matured our own Bot Management offering. Being aware that Property Portal weren’t seeing success from their existing solution, the account team began discussing the value of consolidating their bot solution with their DDoS and WAF offering from Cloudflare. We were given very clear success criteria which in technical terms, translated to the following:

  1. Don’t mess anything up, deprecate our user experience, or make us change our domains if we switch to you;
  2. Stop more of the bad traffic.;
  3. And most importantly, let more of the good users in, and stop challenging them as much as our current provider does (reduce the number of false positives).

We passed with flying colours. In addition, Property Portal was happy that they were able to consolidate additional services under one vendor.

For our side, Cloudflare now has the privilege of knowing that we are helping to improve the experience for many of Property Portal’s end customers, while at the same time working to protect their IP and hard work by keeping the ‘imitators’ and their scraper bots at a safe distance.

Happy Customer = Happy Customer Success Managers & Account Teams. Day made.

Would you like to know more?

Does any of the above feel familiar to you? Do your competitors have the uncanny ability to present near identical inventory, images or pricing to yours just as soon as you publish changes to your site? Or, are you keen to learn more in order to stay ahead of the bot armies?

Well, you’ve come to the right place, friend:

  • If you’re the self-serve type then take a look at our learning centre here and here, or our product pages. (And yes, we have a (good) chat bot there waiting to help you).
  • If ‘tuning in and geeking out’ is your preferred method of learning, then tune into the next episode of ‘Customers + Success’ on Cloudflare TV where I’ll be interviewing some of the people involved in this case, and hearing more about the challenges that this customer faced first-hand. The segment will air at 4PM PST October 21st / 7AM SGT October 22nd / 10AM AEST October 22nd, and will be appearing on the CFTV schedule in the next week.
  • Alternatively, if you consume your knowledge in old-fashioned human style feel free to contact us here. Someone from our team will be in touch to get you the answers you are looking for.

Cloudflare Bot Management: machine learning and more

Post Syndicated from Alex Bocharov original https://blog.cloudflare.com/cloudflare-bot-management-machine-learning-and-more/

Cloudflare Bot Management: machine learning and more

Introduction

Cloudflare Bot Management: machine learning and more

Building Cloudflare Bot Management platform is an exhilarating experience. It blends Distributed Systems, Web Development, Machine Learning, Security and Research (and every discipline in between) while fighting ever-adaptive and motivated adversaries at the same time.

This is the ongoing story of Bot Management at Cloudflare and also an introduction to a series of blog posts about the detection mechanisms powering it. I’ll start with several definitions from the Bot Management world, then introduce the product and technical requirements, leading to an overview of the platform we’ve built. Finally, I’ll share details about the detection mechanisms powering our platform.

Let’s start with Bot Management’s nomenclature.

Some Definitions

Bot – an autonomous program on a network that can interact with computer systems or users, imitating or replacing a human user’s behavior, performing repetitive tasks much faster than human users could.

Good bots – bots which are useful to businesses they interact with, e.g. search engine bots like Googlebot, Bingbot or bots that operate on social media platforms like Facebook Bot.

Bad bots – bots which are designed to perform malicious actions, ultimately hurting businesses, e.g. credential stuffing bots, third-party scraping bots, spam bots and sneakerbots.

Cloudflare Bot Management: machine learning and more

Bot Management – blocking undesired or malicious Internet bot traffic while still allowing useful bots to access web properties by detecting bot activity, discerning between desirable and undesirable bot behavior, and identifying the sources of the undesirable activity.

WAF – a security system that monitors and controls network traffic based on a set of security rules.

Gathering requirements

Cloudflare has been stopping malicious bots from accessing websites or misusing APIs from the very beginning, at the same time helping the climate by offsetting the carbon costs from the bots. Over time it became clear that we needed a dedicated platform which would unite different bot fighting techniques and streamline the customer experience. In designing this new platform, we tried to fulfill the following key requirements.

  • Complete, not complex – customers can turn on/off Bot Management with a single click of a button, to protect their websites, mobile applications, or APIs.
  • Trustworthy – customers want to know whether they can trust the website visitor is who they say they are and provide a certainty indicator for that trust level.
  • Flexible – customers should be able to define what subset of the traffic Bot Management mitigations should be applied to, e.g. only login URLs, pricing pages or sitewide.
  • Accurate – Bot Management detections should have a very small error, e.g. none or very few human visitors ever should be mistakenly identified as bots.
  • Recoverable – in case a wrong prediction was made, human visitors still should be able to access websites as well as good bots being let through.

Moreover, the goal for new Bot Management product was to make it work well on the following use cases:

Cloudflare Bot Management: machine learning and more

Technical requirements

Additionally to the product requirements above, we engineers had a list of must-haves for the new Bot Management platform. The most critical were:

  • Scalability – the platform should be able to calculate a score on every request, even at over 10 million requests per second.
  • Low latency – detections must be performed extremely quickly, not slowing down request processing by more than 100 microseconds, and not requiring additional hardware.
  • Configurability – it should be possible to configure what detections are applied on what traffic, including on per domain/data center/server level.
  • Modifiability – the platform should be easily extensible with more detection mechanisms, different mitigation actions, richer analytics and logs.
  • Security – no sensitive information from one customer should be used to build models that protect another customer.
  • Explainability & debuggability – we should be able to explain and tune predictions in an intuitive way.

Equipped with these requirements, back in 2018, our small team of engineers got to work to design and build the next generation of Cloudflare Bot Management.

Meet the Score

“Simplicity is the ultimate sophistication.”
– Leonardo Da Vinci

Cloudflare operates on a vast scale. At the time of this writing, this means covering 26M+ Internet properties, processing on average 11M requests per second (with peaks over 14M), and examining more than 250 request attributes from different protocol levels. The key question is how to harness the power of such “gargantuan” data to protect all of our customers from modern day cyberthreats in a simple, reliable and explainable way?

Bot management is hard. Some bots are much harder to detect and require looking at multiple dimensions of request attributes over a long time, and sometimes a single request attribute could give them away. More signals may help, but are they generalizable?

When we classify traffic, should customers decide what to do with it or are there decisions we can make on behalf of the customer? What concept could possibly address all these uncertainty problems and also help us to deliver on the requirements from above?

As you might’ve guessed from the section title, we came up with the concept of Trusted Score or simply The Scoreone thing to rule them all – indicating the likelihood between 0 and 100 whether a request originated from a human (high score) vs. an automated program (low score).

Cloudflare Bot Management: machine learning and more
“One Ring to rule them all” by idreamlikecrazy, used under CC BY / Desaturated from original

Okay, let’s imagine that we are able to assign such a score on every incoming HTTP/HTTPS request, what are we or the customer supposed to do with it? Maybe it’s enough to provide such a score in the logs. Customers could then analyze them on their end, find the most frequent IPs with the lowest scores, and then use the Cloudflare Firewall to block those IPs. Although useful, such a process would be manual, prone to error and most importantly cannot be done in real time to protect the customer’s Internet property.

Fortunately, around the same time we started worked on this system , our colleagues from the Firewall team had just announced Firewall Rules. This new capability provided customers the ability to control requests in a flexible and intuitive way, inspired by the widely known Wireshark®  language. Firewall rules supported a variety of request fields, and we thought – why not have the score be one of these fields? Customers could then write granular rules to block very specific attack types. That’s how the cf.bot_management.score field was born.

Having a score in the heart of Cloudflare Bot Management addressed multiple product and technical requirements with one strike – it’s simple, flexible, configurable, and it provides customers with telemetry about bots on a per request basis. Customers can adjust the score threshold in firewall rules, depending on their sensitivity to false positives/negatives. Additionally, this intuitive score allows us to extend our detection capabilities under the hood without customers needing to adjust any configuration.

So how can we produce this score and how hard is it? Let’s explore it in the following section.

Architecture overview

What is powering the Bot Management score? The short answer is a set of microservices. Building this platform we tried to re-use as many pipelines, databases and components as we could, however many services had to be built from scratch. Let’s have a look at overall architecture (this overly simplified version contains Bot Management related services):

Cloudflare Bot Management: machine learning and more

Core Bot Management services

In a nutshell our systems process data received from the edge data centers, produce and store data required for bot detection mechanisms using the following technologies:

  • Databases & data storesKafka, ClickHouse, Postgres, Redis, Ceph.
  • Programming languages – Go, Rust, Python, Java, Bash.
  • Configuration & schema management – Salt, Quicksilver, Cap’n Proto.
  • Containerization – Docker, Kubernetes, Helm, Mesos/Marathon.

Each of these services is built with resilience, performance, observability and security in mind.

Edge Bot Management module

All bot detection mechanisms are applied on every request in real-time during the request processing stage in the Bot Management module running on every machine at Cloudflare’s edge locations. When a request comes in we extract and transform the required request attributes and feed them to our detection mechanisms. The Bot Management module produces the following output:

Firewall fieldsBot Management fields
cf.bot_management.score – an integer indicating the likelihood between 0 and 100 whether a request originated from an automated program (low score) to a human (high score).
cf.bot_management.verified_bot – a boolean indicating whether such request comes from a Cloudflare whitelisted bot.
cf.bot_management.static_resource – a boolean indicating whether request matches file extensions for many types of static resources.

Cookies – most notably it produces cf_bm, which helps manage incoming traffic that matches criteria associated with bots.

JS challenges – for some of our detections and customers we inject into invisible JavaScript challenges, providing us with more signals for bot detection.

Detection logs – we log through our data pipelines to ClickHouse details about each applied detection, used features and flags, some of which are used for analytics and customer logs, while others are used to debug and improve our models.

Once the Bot Management module has produced the required fields, the Firewall takes over the actual bot mitigation.

Firewall integration

The Cloudflare Firewall’s intuitive dashboard enables users to build powerful rules through easy clicks and also provides Terraform integration. Every request to the firewall is inspected against the rule engine. Suspicious requests can be blocked, challenged or logged as per the needs of the user while legitimate requests are routed to the destination, based on the score produced by the Bot Management module and the configured threshold.

Cloudflare Bot Management: machine learning and more

Firewall rules provide the following bot mitigation actions:

  • Log – records matching requests in the Cloudflare Logs provided to customers.
  • Bypass – allows customers to dynamically disable Cloudflare security features for a request.
  • Allow – matching requests are exempt from challenge and block actions triggered by other Firewall Rules content.
  • Challenge (Captcha) – useful for ensuring that the visitor accessing the site is human, and not automated.
  • JS Challenge – useful for ensuring that bots and spam cannot access the requested resource; browsers, however, are free to satisfy the challenge automatically.
  • Block – matching requests are denied access to the site.

Our Firewall Analytics tool, powered by ClickHouse and GraphQL API, enables customers to quickly identify and investigate security threats using an intuitive interface. In addition to analytics, we provide detailed logs on all bots-related activity using either the Logpull API and/or LogPush, which provides the easy way to get your logs to your cloud storage.

Cloudflare Workers integration

In case a customer wants more flexibility on what to do with the requests based on the score, e.g. they might want to inject new, or change existing, HTML page content, or serve incorrect data to the bots, or stall certain requests, Cloudflare Workers provide an option to do that. For example, using this small code-snippet, we can pass the score back to the origin server for more advanced real-time analysis or mitigation:

addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})
 
async function handleRequest(request) {
  request = new Request(request);
 
  request.headers.set("Cf-Bot-Score", request.cf.bot_management.score)
 
  return fetch(request);
}

Now let’s have a look into how a single score is produced using multiple detection mechanisms.

Detection mechanisms

Cloudflare Bot Management: machine learning and more

The Cloudflare Bot Management platform currently uses five complementary detection mechanisms, producing their own scores, which we combine to form the single score going to the Firewall. Most of the detection mechanisms are applied on every request, while some are enabled on a per customer basis to better fit their needs.

Cloudflare Bot Management: machine learning and more
Cloudflare Bot Management: machine learning and more

Having a score on every request for every customer has the following benefits:

  • Ease of onboarding – even before we enable Bot Management in active mode, we’re able to tell how well it’s going to work for the specific customer, including providing historical trends about bot activity.
  • Feedback loop – availability of the score on every request along with all features has tremendous value for continuous improvement of our detection mechanisms.
  • Ensures scaling – if we can compute for score every request and customer, it means that every Internet property behind Cloudflare is a potential Bot Management customer.
  • Global bot insights – Cloudflare is sitting in front of more than 26M+ Internet properties, which allows us to understand and react to the tectonic shifts happening in security and threat intelligence over time.

Overall globally, more than third of the Internet traffic visible to Cloudflare is coming from bad bots, while Bot Management customers have the ratio of bad bots even higher at ~43%!

Cloudflare Bot Management: machine learning and more
Cloudflare Bot Management: machine learning and more

Let’s dive into specific detection mechanisms in chronological order of their integration with Cloudflare Bot Management.

Machine learning

The majority of decisions about the score are made using our machine learning models. These were also the first detection mechanisms to produce a score and to on-board customers back in 2018. The successful application of machine learning requires data high in Quantity, Diversity, and Quality, and thanks to both free and paid customers, Cloudflare has all three, enabling continuous learning and improvement of our models for all of our customers.

At the core of the machine learning detection mechanism is CatBoost  – a high-performance open source library for gradient boosting on decision trees. The choice of CatBoost was driven by the library’s outstanding capabilities:

  • Categorical features support – allowing us to train on even very high cardinality features.
  • Superior accuracy – allowing us to reduce overfitting by using a novel gradient-boosting scheme.
  • Inference speed – in our case it takes less than 50 microseconds to apply any of our models, making sure request processing stays extremely fast.
  • C and Rust API – most of our business logic on the edge is written using Lua, more specifically LuaJIT, so having a compatible FFI interface to be able to apply models is fantastic.

There are multiple CatBoost models run on Cloudflare’s Edge in the shadow mode on every request on every machine. One of the models is run in active mode, which influences the final score going to Firewall. All ML detection results and features are logged and recorded in ClickHouse for further analysis, model improvement, analytics and customer facing logs. We feed both categorical and numerical features into our models, extracted from request attributes and inter-request features built using those attributes, calculated and delivered by the Gagarin inter-requests features platform.

We’re able to deploy new ML models in a matter of seconds using an extremely reliable and performant Quicksilver configuration database. The same mechanism can be used to configure which version of an ML model should be run in active mode for a specific customer.

A deep dive into our machine learning detection mechanism deserves a blog post of its own and it will cover how do we train and validate our models on trillions of requests using GPUs, how model feature delivery and extraction works, and how we explain and debug model predictions both internally and externally.

Heuristics engine

Not all problems in the world are the best solved with machine learning. We can tweak the ML models in various ways, but in certain cases they will likely underperform basic heuristics. Often the problems machine learning is trying to solve are not entirely new. When building the Bot Management solution it became apparent that sometimes a single attribute of the request could give a bot away. This means that we can create a bunch of simple rules capturing bots in a straightforward way, while also ensuring lowest false positives.

The heuristics engine was the second detection mechanism integrated into the Cloudflare Bot Management platform in 2019 and it’s also applied on every request. We have multiple heuristic types and hundreds of specific rules based on certain attributes of the request, some of which are very hard to spoof. When any of the requests matches any of the heuristics – we assign the lowest possible score of 1.

The engine has the following properties:

  • Speed – if ML model inference takes less than 50 microseconds per model, hundreds of heuristics can be applied just under 20 microseconds!
  • Deployability – the heuristics engine allows us to add new heuristic in a matter of seconds using Quicksilver, and it will be applied on every request.
  • Vast coverage – using a set of simple heuristics allows us to classify ~15% of global traffic and ~30% of Bot Management customers’ traffic as bots. Not too bad for a few if conditions, right?
  • Lowest false positives – because we’re very sure and conservative on the heuristics we add, this detection mechanism has the lowest FP rate among all detection mechanisms.
  • Labels for ML – because of the high certainty we use requests classified with heuristics to train our ML models, which then can generalize behavior learnt from from heuristics and improve detections accuracy.

So heuristics gave us a lift when tweaked with machine learning and they contained a lot of the intuition about the bots, which helped to advance the Cloudflare Bot Management platform and allowed us to onboard more customers.

Behavioral analysis

Machine learning and heuristics detections provide tremendous value, but both of them require human input on the labels, or basically a teacher to distinguish between right and wrong. While our supervised ML models can generalize well enough even on novel threats similar to what we taught them on, we decided to go further. What if there was an approach which doesn’t require a teacher, but rather can learn to distinguish bad behavior from the normal behavior?

Enter the behavioral analysis detection mechanism, initially developed in 2018 and integrated with the Bot Management platform in 2019. This is an unsupervised machine learning approach, which has the following properties:

  • Fitting specific customer needs – it’s automatically enabled for all Bot Management customers, calculating and analyzing normal visitor behavior over an extended period of time.
  • Detects bots never seen before – as it doesn’t use known bot labels, it can detect bots and anomalies from the normal behavior on specific customer’s website.
  • Harder to evade – anomalous behavior is often a direct result of the bot’s specific goal.

Please stay tuned for a more detailed blog about behavioral analysis models and the platform powering this incredible detection mechanism, protecting many of our customers from unseen attacks.

Verified bots

So far we’ve discussed how to detect bad bots and humans. What about good bots, some of which are extremely useful for the customer website? Is there a need for a dedicated detection mechanism or is there something we could use from previously described detection mechanisms? While the majority of good bot requests (e.g. Googlebot, Bingbot, LinkedInbot) already have low score produced by other detection mechanisms, we also need a way to avoid accidental blocks of useful bots. That’s how the Firewall field cf.bot_management.verified_bot came into existence in 2019, allowing customers to decide for themselves whether they want to let all of the good bots through or restrict access to certain parts of the website.

The actual platform calculating Verified Bot flag deserves a detailed blog on its own, but in the nutshell it has the following properties:

  • Validator based approach – we support multiple validation mechanisms, each of them allowing us to reliably confirm good bot identity by clustering a set of IPs.
  • Reverse DNS validator – performs a reverse DNS check to determine whether or not a bots IP address matches its alleged hostname.
  • ASN Block validator – similar to rDNS check, but performed on ASN block.
  • Downloader validator – collects good bot IPs from either text files or HTML pages hosted on bot owner sites.
  • Machine learning validator – uses an unsupervised learning algorithm, clustering good bot IPs which are not possible to validate through other means.
  • Bots Directory – a database with UI that stores and manages bots that pass through the Cloudflare network.
Cloudflare Bot Management: machine learning and more
Bots directory UI sample‌‌

Using multiple validation methods listed above, the Verified Bots detection mechanism identifies hundreds of unique good bot identities, belonging to different companies and categories.

JS fingerprinting

When it comes to Bot Management detection quality it’s all about the signal quality and quantity. All previously described detections use request attributes sent over the network and analyzed on the server side using different techniques. Are there more signals available, which can be extracted from the client to improve our detections?

As a matter of fact there are plenty, as every browser has unique implementation quirks. Every web browser graphics output such as canvas depends on multiple layers such as hardware (GPU) and software (drivers, operating system rendering). This highly unique output allows precise differentiation between different browser/device types. Moreover, this is achievable without sacrificing website visitor privacy as it’s not a supercookie, and it cannot be used to track and identify individual users, but only to confirm that request’s user agent matches other telemetry gathered through browser canvas API.

This detection mechanism is implemented as a challenge-response system with challenge injected into the webpage on Cloudflare’s edge. The challenge is then rendered in the background using provided graphic instructions and the result sent back to Cloudflare for validation and further action such as  producing the score. There is a lot going on behind the scenes to make sure we get reliable results without sacrificing users’ privacy while being tamper resistant to replay attacks. The system is currently in private beta and being evaluated for its effectiveness and we already see very promising results. Stay tuned for this new detection mechanism becoming widely available and the blog on how we’ve built it.

This concludes an overview of the five detection mechanisms we’ve built so far. It’s time to sum it all up!

Summary

Cloudflare has the unique ability to collect data from trillions of requests flowing through its network every week. With this data, Cloudflare is able to identify likely bot activity with Machine Learning, Heuristics, Behavioral Analysis, and other detection mechanisms. Cloudflare Bot Management integrates seamlessly with other Cloudflare products, such as WAF  and Workers.

Cloudflare Bot Management: machine learning and more

All this could not be possible without hard work across multiple teams! First of all thanks to everybody on the Bots Team for their tremendous efforts to make this platform come to life. Other Cloudflare teams, most notably: Firewall, Data, Solutions Engineering, Performance, SRE, helped us a lot to design, build and support this incredible platform.

Cloudflare Bot Management: machine learning and more
Bots team during Austin team summit 2019 hunting bots with axes 🙂

Lastly, there are more blogs from the Bots series coming soon, diving into internals of our detection mechanisms, so stay tuned for more exciting stories about Cloudflare Bot Management!