Tag Archives: bots

Introducing Ephemeral IDs: a new tool for fraud detection

2024-09-23 Oliver Payne

Post Syndicated from Oliver Payne original https://blog.cloudflare.com/turnstile-ephemeral-ids-for-fraud-detection

In the early days of the Internet, a single IP address was a reliable indicator of a single user. However, today’s Internet is more complex. Shared IP addresses are now common, with users connecting via mobile IP address pools, VPNs, or behind CGNAT (Carrier Grade Network Address Translation). This makes relying on IP addresses alone a weak method to combat modern threats like automated attacks and fraudulent activity. Additionally, many Internet users have no option but to use an IP address which they don’t have sole control over, and as such, should not be penalized for that.

At Cloudflare, we are solving this complexity with Turnstile, our CAPTCHA alternative. And now, we’re taking the next step in advancing security with Ephemeral IDs, a new feature that generates a unique short-lived ID, without relying on any network-level information.

When a website visitor interacts with Turnstile, we now calculate an Ephemeral ID that can link behavior to a specific client instead of an IP address. This means that even when attackers rotate through large pools of IP addresses, we can still identify and block malicious actions. For example, in attacks like credential stuffing or account signups, where fraudsters attempt to disguise themselves using different IP addresses, Ephemeral IDs allow us to detect abuse patterns more accurately beyond just determining whether the visitor is a human or a bot. Multiple fraudulent actions from the same client are grouped together, improving our detection rate while reducing false positives.

How Ephemeral IDs work

Turnstile detects bots by analyzing browser attributes and signals. Using these aggregated client-side signals, we generate a short-lived Ephemeral ID without setting any cookies or using similar client-side storage. These IDs are intentionally not 100% unique and have a brief lifespan, making them highly effective in identifying patterns of fraud and abuse, without compromising user privacy.

When the same visitor interacts with Turnstile widgets from different Cloudflare customers, they receive different Ephemeral IDs for each one. Additionally, because these IDs change frequently, they cannot be used to track a single visitor over multiple days.

_{Blue: A single IP address | Green: A single Ephemeral ID}_{The bigger the node, the more frequently seen that ID or IP address was in our dataset.}

The graphic above illustrates the complex reality of the modern Internet, where the relationship between clients and IP addresses is far from a simple one-to-one mapping. While some straightforward mappings still exist, they are no longer the norm.

During a period where a site or service is under attack, we observe a “nest” of highly correlated Ephemeral IDs. In the example below, the correlation is based on both Ephemeral ID and IP address.

_{Nest in the center of the diagram visualizes thousands of IP addresses (blue) which are correlated by the commonly identified Ephemeral IDs (green). The bigger the node, the more frequently seen that ID or IP address was in our dataset.}

This is real-world data showing fraudulent activity on one of Cloudflare’s public-facing forms. Even with access to a broad range of IP addresses, attackers struggle to completely disguise their requests because Ephemeral IDs are generated based on patterns beyond IP addresses. This means that even if they rotate addresses, the underlying client characteristics are still detected, making it harder for them to evade our security measures. This makes it easier for us to group these requests and apply appropriate business logic, whether that means discarding the requests, requiring further validation, enforcing multi-factor authentication (MFA), or other actions.

This new client identification technology seamlessly integrates into the broader advancements we’ve made to Turnstile over the past year. Whether you’re protecting login forms, signup pages, or high value transactions, you’ll immediately benefit from this extra layer of abuse detection without needing to change a single line of code. We’ll take care of all the heavy lifting and analysis behind the scenes, and our system will continue to improve its accuracy and effectiveness over time.

What does this mean for you? Starting today, Turnstile will go beyond just identifying bots. All websites protected by Turnstile will automatically benefit from the integration of Ephemeral IDs into our detection logic. This means we can more effectively identify and penalize offending clients without impacting other users on the same network, or IP address, improving security and user experience for everyone.

Ephemeral IDs in action

Everyone benefits from the addition of Ephemeral IDs to the Challenge Platform, but for those who want to use it beyond that, the Ephemeral ID is available through the Turnstile siteverify response. A practical use case for Ephemeral IDs is preventing fraudulent account signups. Imagine a bad actor, a real person using a real device, creating hundreds of fake accounts while rotating IP addresses to avoid detection. By ingesting Ephemeral IDs and logging them alongside your account creation logs, you can set up alerts based on account creation thresholds in real-time or retroactively investigate suspicious activity. Even though Ephemeral IDs are short-lived and may have changed by the time an investigation begins, they still provide valuable insights through aggregate analysis, and provide an extra dimension to identify fraud and abuse.

For our Turnstile Enterprise and Bot Management Enterprise customers, you now have the option to access Ephemeral IDs directly through the Turnstile siteverify response. Get in touch with your Account Executive to enable it on your account.

Below is an example of siteverify response for those who have enabled Ephemeral IDs.

curl 'https://challenges.cloudflare.com/turnstile/v0/siteverify' --data 'secret=verysecret&response=<RESPONSE>'

{
    "success": true,
    "error-codes": [],
    "challenge_ts": "2024-09-10T17:29:00.463Z",
    "hostname": "example.com",
    "metadata": {
        "ephemeral_id": "x:9f78e0ed210960d7693b167e"
    }
}

What’s next for Turnstile?

We launched Turnstile with a bold mission: to redefine CAPTCHAs with a frictionless, privacy-first solution that eliminates the annoyance of picking puzzles, selecting stoplights, and clicking crosswalks to prove our humanity. It’s incredible to think that Turnstile has been generally available for a whole year now! During this time, it has blocked over one trillion bots, and is actively protecting more than 350,000 domains worldwide.

As we celebrate Turnstile’s second birthday, we’re proud of the progress we’ve made and thrilled to introduce our latest innovations. While Ephemeral IDs represent the newest evolution of Turnstile, they’re part of our ongoing commitment to continuous improvement. Over the past year, we’ve also introduced a Cloudflare Pages Plugin and partnered with Google Firebase, ensuring that developers have easy access to Turnstile.

Earlier this year, we also launched Pre-Clearance for Turnstile, integrating it with Cloudflare WAF’s Challenge action, making it easier for customers to use Cloudflare’s Application Security products together. If you want to learn more about how to use Turnstile with Cloudflare’s Bot Management and WAF in more detail, check it out here!

We’re incredibly excited about what’s ahead. The introduction of Ephemeral IDs is just one of many innovations on the horizon. We’re committed to making the Internet a safer, more private place for everyone, eliminating the need for frustrating CAPTCHA puzzles while keeping security our top priority. And with our free tier remaining open and unlimited for all, there’s no barrier to getting started with Turnstile today.

Join us in revolutionizing online security – get started with Turnstile now or dive straight into our how-to guides. Let’s help make the Internet a better place, together!

Declaring your AIndependence: block AI bots, scrapers and crawlers with a single click

2024-07-03 Alex Bocharov

Post Syndicated from Alex Bocharov original https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click

To help preserve a safe Internet for content creators, we’ve just launched a brand new “easy button” to block all AI bots. It’s available for all customers, including those on our free tier.

The popularity of generative AI has made the demand for content used to train models or run inference on skyrocket, and, although some AI companies clearly identify their web scraping bots, not all AI companies are being transparent. Google reportedly paid $60 million a year to license Reddit’s user generated content, Scarlett Johansson alleged OpenAI used her voice for their new personal assistant without her consent, and most recently, Perplexity has been accused of impersonating legitimate visitors in order to scrape content from websites. The value of original content in bulk has never been higher.
Last year, Cloudflare announced the ability for customers to easily block AI bots that behave well. These bots follow robots.txt, and don’t use unlicensed content to train their models or run inference for RAG applications using website data. Even though these AI bots follow the rules, Cloudflare customers overwhelmingly opt to block them.

We hear clearly that customers don’t want AI bots visiting their websites, and especially those that do so dishonestly. To help, we’ve added a brand new one-click to block all AI bots. It’s available for all customers, including those on the free tier. To enable it, simply navigate to the Security > Bots section of the Cloudflare dashboard, and click the toggle labeled AI Scrapers and Crawlers.

This feature will automatically be updated over time as we see new fingerprints of offending bots we identify as widely scraping the web for model training. To ensure we have a comprehensive understanding of all AI crawler activity, we surveyed traffic across our network.

AI bot activity today

The graph below illustrates the most popular AI bots seen on Cloudflare’s network in terms of their request volume. We looked at common AI crawler user agents and aggregated the number of requests on our platform from these AI user agents over the last year:

When looking at the number of requests made to Cloudflare sites, we see that Bytespider, Amazonbot, ClaudeBot, and GPTBot are the top four AI crawlers. Operated by ByteDance, the Chinese company that owns TikTok, Bytespider is reportedly used to gather training data for its large language models (LLMs), including those that support its ChatGPT rival, Doubao. Amazonbot and ClaudeBot follow Bytespider in request volume. Amazonbot, reportedly used to index content for Alexa’s question-answering, sent the second-most number of requests and ClaudeBot, used to train the Claude chat bot, has recently increased in request volume.

Among the top AI bots that we see, Bytespider not only leads in terms of number of requests but also in both the extent of its Internet property crawling and the frequency with which it is blocked. Following closely is GPTBot, which ranks second in both crawling and being blocked. GPTBot, managed by OpenAI, collects training data for its LLMs, which underpin AI-driven products such as ChatGPT. In the table below, “Share of websites accessed” refers to the proportion of websites protected by Cloudflare that were accessed by the named AI bot.

AI Bot	Share of Websites Accessed
Bytespider	40.40%
GPTBot	35.46%
ClaudeBot	11.17%
ImagesiftBot	8.75%
CCBot	2.14%
ChatGPT-User	1.84%
omgili	0.10%
Diffbot	0.08%
Claude-Web	0.04%
PerplexityBot	0.01%

While our analysis identified the most popular crawlers in terms of request volume and number of Internet properties accessed, many customers are likely not aware of the more popular AI crawlers actively crawling their sites. Our Radar team performed an analysis of the top robots.txt entries across the top 10,000 Internet domains to identify the most commonly actioned AI bots, then looked at how frequently we saw these bots on sites protected by Cloudflare.

In the graph below, which looks at disallowed crawlers for these sites, we see that customers most often reference GPTBot, CCBot, and Google in robots.txt, but do not specifically disallow popular AI crawlers like Bytespider and ClaudeBot.

With the Internet now flooded with these AI bots, we were curious to see how website operators have already responded. In June, AI bots accessed around 39% of the top one million Internet properties using Cloudflare, but only 2.98% of these properties took measures to block or challenge those requests. Moreover, the higher-ranked (more popular) an Internet property is, the more likely it is to be targeted by AI bots, and correspondingly, the more likely it is to block such requests.

Top N Internet properties by number of visitors seen by Cloudflare	% accessed by AI bots	% blocking AI bots
10	80.0%	40.0%
100	63.0%	16.0%
1,000	53.2%	8.8%
10,000	47.99%	8.92%
100,000	44.53%	6.36%
1,000,000	38.73%	2.98%

We see website operators completely block access to these AI crawlers using robots.txt. However, these blocks are reliant on the bot operator respecting robots.txt and adhering to RFC9309 (ensuring variations on user against all match the product token) to honestly identify who they are when they visit an Internet property, but user agents are trivial for bot operators to change.

How we find AI bots pretending to be real web browsers

Sadly, we’ve observed bot operators attempt to appear as though they are a real browser by using a spoofed user agent. We’ve monitored this activity over time, and we’re proud to say that our global machine learning model has always recognized this activity as a bot, even when operators lie about their user agent.

Take one example of a specific bot that others observed to be hiding their activity. We ran an analysis to see how our machine learning models scored traffic from this bot. In the diagram below, you can see that all bot scores are firmly below 30, indicating that our scoring thinks this activity is likely to be coming from a bot.

The diagram reflects scoring of the requests using our newest model, where “hotter” colors indicate more requests falling in that band, and “cooler” colors meaning fewer requests did. We can see the vast majority of requests fell into the bottom two bands, showing that Cloudflare’s model gave the offending bot a score of 9 or less. The user agent changes have no effect on the score, because this is the very first thing we expect bot operators to do.

Any customer with an existing WAF rule set to challenge visitors with a bot score below 30 (our recommendation) automatically blocked all of this AI bot traffic with no new action on their part. The same will be true for future AI bots that use similar techniques to hide their activity.

We leverage Cloudflare global signals to calculate our Bot Score, which for AI bots like the one above, reflects that we correctly identify and score them as a “likely bot.”

When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint. For every fingerprint we see, we use Cloudflare’s network, which sees over 57 million requests per second on average, to understand how much we should trust this fingerprint. To power our models, we compute global aggregates across many signals. Based on these signals, our models were able to appropriately flag traffic from evasive AI bots, like the example mentioned above, as bots.

The upshot of this globally aggregated data is that we can immediately detect new scraping tools and their behavior without needing to manually fingerprint the bot, ensuring that customers stay protected from the newest waves of bot activity.

If you have a tip on an AI bot that’s not behaving, we’d love to investigate. There are two options you can use to report misbehaving AI crawlers:

1. Enterprise Bot Management customers can submit a False Negative Feedback Loop report via Bot Analytics by simply selecting the segment of traffic where they noticed misbehavior:

2. We’ve also set up a reporting tool where any Cloudflare customer can submit reports of an AI bot scraping your website without permission.

We fear that some AI companies intent on circumventing rules to access content will persistently adapt to evade bot detection. We will continue to keep watch and add more bot blocks to our AI Scrapers and Crawlers rule and evolve our machine learning models to help keep the Internet a place where content creators can thrive and keep full control over which models their content is used to train or run inference on.

Using machine learning to detect bot attacks that leverage residential proxies

2024-06-24 Bob AminAzad

Post Syndicated from Bob AminAzad original https://blog.cloudflare.com/residential-proxy-bot-detection-using-machine-learning

Bots using residential proxies are a major source of frustration for security engineers trying to fight online abuse. These engineers often see a similar pattern of abuse when well-funded, modern botnets target their applications. Advanced bots bypass country blocks, ASN blocks, and rate-limiting. Every time, the bot operator moves to a new IP address space until they blend in perfectly with the “good” traffic, mimicking real users’ behavior and request patterns. Our new Bot Management machine learning model (v8) identifies residential proxy abuse without resorting to IP blocking, which can cause false positives for legitimate users.

Background

One of the main sources of Cloudflare’s bot score is our bot detection machine learning model which analyzes, on average, over 46 million HTTP requests per second in real time. Since our first Bot Management ML model was released in 2019, we have continuously evolved and improved the model. Nowadays, our models leverage features based on request fingerprints, behavioral signals, and global statistics and trends that we see across our network.

Each iteration of the model focuses on certain areas of improvement. This process starts with a rigorous R&D phase to identify the emerging patterns of bot attacks by reviewing feedback from our customers and reports of missed attacks. In v8, we mainly focused on two areas of abuse. First, we analyzed the campaigns that leverage residential IP proxies, which are proxies on residential networks commonly used to launch widely distributed attacks against high profile targets. In addition to that, we improved model accuracy for detecting attacks that originate from cloud providers.

Residential IP proxies

Proxies allow attackers to hide their identity and distribute their attack. Moreover, IP address rotation allows attackers to directly bypass traditional defenses such as IP reputation and IP rate limiting. Knowing this, defenders use a plethora of signals to identify malicious use of proxies. In its simplest forms, IP reputation signals (e.g., data center IP addresses, known open proxies, etc.) can lead to the detection of such distributed attacks.

However, in the past few years, bot operators have started favoring proxies operating in residential network IP address space. By using residential IP proxies, attackers can masquerade as legitimate users by sending their traffic through residential networks. Nowadays, residential IP proxies are offered by companies that facilitate access to large pools of IP addresses for attackers. Residential proxy providers claim to offer 30-100 million IPs belonging to residential and mobile networks across the world. Most commonly, these IPs are sourced by partnering with free VPN providers, as well as including the proxy SDKs into popular browser extensions and mobile applications. This allows residential proxy providers to gain a foothold on victims’ devices and abuse their residential network connections.

Figure 1: Architecture of a residential proxy network

Figure 1 depicts the architecture of a residential proxy. By subscribing to these services, attackers gain access to an authenticated proxy gateway address commonly using the HTTPS/SOCKS5 proxy protocol. Some residential proxy providers allow their users to select the country or region for the proxy exit nodes. Alternatively, users can choose to keep the same IP address throughout their session or rotate to a new one for each outgoing request. Residential proxy providers then identify active exit nodes on their network (on devices that they control within residential networks across the world) and route the proxied traffic through them.

The large pool of IP addresses and the diversity of networks poses a challenge to traditional bot defense mechanisms that rely on IP reputation and rate limiting. Moreover, the diversity of IPs enables the attackers to rotate through them indefinitely. This shrinks the window of opportunity for bot detection systems to effectively detect and stop the attacks. Effective defense against residential proxy attacks should be able to detect this type of bot traffic either based on single request features to stop the attack immediately, or identify unique fingerprints from the browsing agent to track and mitigate the bot traffic regardless of the IP source. Overly broad blocking actions, such as IP block-listing, by definition, would result in blocking legitimate traffic from residential networks where at least one device is acting as a residential proxy node.

ML model training

At its heart, our model is built using a chain of modules that work together. Initially, we fetch and prepare training and validation datasets from our Clickhouse data storage. We use datasets with high confidence labels as part of our training. For model validation, we use datasets consisting of missed attacks reported by our customers, known sources of bot traffic (e.g., verified bots), and high confidence detections from other bot management modules (e.g., heuristics engine). We orchestrate these steps using Apache Airflow, which enables us to customize each stage of the ML model training and define the interdependencies of our training, validation, and reporting modules in the form of directed acyclic graphs (DAGs).

The first step of training a new model is fetching labeled training data from our data store. Under the hood, our dataset definitions are SQL queries that will materialize by fetching data from our Clickhouse cluster where we store feature values and calculate aggregates from the traffic on our network. Figure 2 depicts these steps as train and validation dataset fetch operations. Introducing new datasets can be as straightforward as writing the SQL queries to filter the desired subset of requests.

Figure 2: Airflow DAG for model training and validation

After fetching the datasets, we train our Catboost model and tune its hyper parameters. During evaluation, we compare the performance of the newly trained model against the current default version running for our customers. To capture the intricate patterns in subsets of our data, we split certain validation datasets into smaller slivers called specializations. For instance, we use the detections made by our heuristics engine and managed rulesets as ground truth for bot traffic. To ensure that larger sources of traffic (large ASNs, different HTTP versions, etc.) do not mask our visibility into patterns for the rest of the traffic, we define specializations for these sources of traffic. As a result, improvements in accuracy of the new model can be evaluated for common patterns (e.g., HTTP/1.1 and HTTP/2) as well as less common ones. Our model training DAG will provide a breakdown report for the accuracy, score distribution, feature importance, and SHAP explainers for each validation dataset and its specializations.

Once we are happy with the validation results and model accuracy, we evaluate our model against a checklist of steps to ensure the correctness and validity of our model. We start by ensuring that our results and observations are reproducible over multiple non-overlapping training and validation time ranges. Moreover, we check for the following factors:

Check for the distribution of feature values to identify irregularities such as missing or skewed values.
Check for overlaps between training and validation datasets and feature values.
Verify the diversity of training data and the balance between labels and datasets.
Evaluate performance changes in the accuracy of the model on validation datasets based on their order of importance.
Check for model overfitting by evaluating the feature importance and SHAP explainers.

After the model passes the readiness checks, we deploy it in shadow mode. We can observe the behavior of the model on live traffic in log-only mode (i.e., without affecting the bot score). After gaining confidence in the model’s performance on live traffic, we start onboarding beta customers, and gradually switch the model to active mode all while closely monitoring the real-world performance of our new model.

ML features for bot detection

Each of our models uses a set of features to make inferences about the incoming requests. We compute our features based on single request properties (single request features) and patterns from multiple requests (i.e., inter-request features). We can categorize these features into the following groups:

Global features: inter-request features that are computed based on global aggregates for different types of fingerprints and traffic sources (e.g., for an ASN) seen across our global network. Given the relatively lower cardinality of these features, we can scalably calculate global aggregates for each of them.
High cardinality features: inter-request features focused on fine-grained aggregate data from local traffic patterns and behaviors (e.g., for an individual IP address)
Single request features: features derived from each individual request (e.g., user agent).

Our Bot Management system (named BLISS) is responsible for fetching and computing these feature values and making them available on our servers for inference by active versions of our ML models.

Detecting residential proxies using network and behavioral signals

Attacks originating from residential IP addresses are commonly characterized by a spike in the overall traffic towards sensitive endpoints on the target websites from a large number of residential ASNs. Our approach for detecting residential IP proxies is twofold. First, we start by comparing direct vs proxied requests and looking for network level discrepancies. Revisiting Figure 1, we notice that a request routed through residential proxies (red dotted line) has to traverse through multiple hops before reaching the target, which affects the network latency of the request.

Based on this observation alone, we are able to characterize residential proxy traffic with a high true positive rate (i.e., all residential proxy requests have high network latency). While we were able to replicate this in our lab environment, we quickly realized that at the scale of the Internet, we run into numerous exceptions with false positive detections (i.e., non-residential proxy traffic with high latency). For instance, countries and regions that predominantly use satellite Internet would exhibit a high network latency for the majority of their requests due to the use of performance enhancing proxies.

Realizing that relying solely on network characteristics of connections to detect residential proxies is inadequate given the diversity of the connections on the Internet, we switched our focus to the behavior of residential IPs. To that end, we observe that the IP addresses from residential proxies express a distinct behavior during periods of peak activity. While this observation singles out highly active IPs over their peak activity time, given the pool size of residential IPs, it is not uncommon to only observe a small number of requests from the majority of residential proxy IPs.

These periods of inactivity can be attributed to the temporary nature of residential proxy exit nodes. For instance, when the client software (i.e., browser or mobile application) that runs the exit nodes of these proxies is closed, the node leaves the residential proxy network. One way to filter out periods of inactivity is to increase the monitoring time and punish each IP address that exhibits residential proxy behavior for a period of time. This block-listing approach, however, has certain limitations. Most importantly, by relying only on IP-based behavioral signals, we would block traffic from legitimate users that may unknowingly run mobile applications or browser extensions that turn their devices into proxies. This is further detrimental for mobile networks where many users share their IPs behind CGNATs. Figure 3 demonstrates this by comparing the share of direct vs proxied requests that we received from active residential proxy IPs over a 24-hour period. Overall, we see that 4 out of 5 requests from these networks belong to direct and benign connections from residential devices.

Figure 3: Percentage of direct vs proxied requests from residential proxy IPs.

Using this insight, we combined behavioral and latency-based features along with new datasets to train a new machine learning model that detects residential proxy traffic on a per-request basis. This scheme allows us to block residential proxy traffic while allowing benign residential users to visit Cloudflare-protected websites from the same residential network.

Detection results and case studies

We started testing v8 in shadow mode in March 2024. Every hour, v8 is classifying more than 17 million unique IPs that participate in residential proxy attacks. Figure 4 shows the geographic distribution of IPs with residential proxy activity belonging to more than 45 thousand ASNs in 237 countries/regions. Among the most commonly requested endpoints from residential proxies, we observe patterns of account takeover attempts, such as requests to /login, /auth/login, and /api/login.

Figure 4: Countries and regions with residential network activity. Size of markers are proportionate to the number of IPs with residential proxy activity.

Furthermore, we see significant improvements when evaluating our new machine learning model on previously missed attacks reported by our customers. In one case, v8 was able to correctly classify 95% of requests from distributed residential proxy attacks targeting the voucher redemption endpoint of the customer’s website. In another case, our new model successfully detected a previously missed content scraping attack evident by increased detection during traffic spikes depicted in Figure 5. We are continuing to monitor the behavior of residential proxy attacks in the wild and work with our customers to ensure that we can provide robust detection against these distributed attacks.

Figure 5: Spikes in bot requests from residential proxies detected by ML v8

Improving detection for bots from cloud providers

In addition to residential IP proxies, bot operators commonly use cloud providers to host and run bot scripts that attack our customers. To combat these attacks, we improved our ground truth labels for cloud provider attacks in our latest ML training datasets. Early results show that v8 detects 20% more bots from cloud providers, with up to 70% more bots detected on zones that are marked as under attack. We further plan to expand the list of cloud providers that v8 detects as part of our ongoing updates.

Check out ML v8

For existing Bot Management customers we recommend toggling “Auto-update machine learning model” to instantly gain the benefits of ML v8 and its residential proxy detection, and to stay up to date with our future ML model updates. If you’re not a Cloudflare Bot Management customer, contact our sales team to try out Bot Management.

New Consent and Bot Management features for Cloudflare Zaraz

2024-05-15 Yo'av Moshe

Post Syndicated from Yo'av Moshe original https://blog.cloudflare.com/new-consent-and-bot-management-features-for-cloudflare-zaraz

Managing consent online can be challenging. After you’ve figured out the necessary regulations, you usually need to configure some Consent Management Platform (CMP) to load all third-party tools and scripts on your website in a way that respects these demands. Cloudflare Zaraz manages the loading of all of these third-party tools, so it was only natural that in April 2023 we announced the Cloudflare Zaraz CMP: the simplest way to manage consent in a way that seamlessly integrates with your third-party tools manager.

As more and more third-party tool vendors are required to handle consent properly, our CMP has evolved to integrate with these new technologies and standardization efforts. Today, we’re happy to announce that the Cloudflare Zaraz CMP is now compatible with the Interactive Advertising Bureau Transparency and Consent Framework (IAB TCF) requirements, and fully supports Google’s Consent Mode v2 signals. Separately, we’ve taken efforts to improve the way Cloudflare Zaraz handles traffic coming from online bots.

IAB TCF Compatibility

Earlier this year, Google announced that websites that would like to use AdSense and other advertising solutions in the European Economic Area (EEA), the UK, and Switzerland, will be required to use a CMP that is approved by IAB Europe, an association for digital marketing and advertising. Their Transparency and Consent Framework sets guidelines for how CMPs should operate. Since March 2024, the Cloudflare Zaraz CMP is compliant with these guidelines, and Zaraz users in Europe can use Google’s advertising products without any restrictions.

Since the IAB TCF requirements can make the consent modal a little complex for users, we have made this compliance mode an opt-in feature. See the official documentation for information on how to enable it.

Google Consent Mode v2

Another new requirement from Google was the need to send “Consent Signals”. These signals are part of what is also known as “Consent Mode”, and later, Consent Mode v2. Together with each event sent to Google Analytics and Google Ads, they tell the Google servers about the consent status of the current visitor – did they agree to have their data used for personalized advertising? Did they accept cookies? These and other questions are answered by Consent Mode v2, telling the Google servers how to treat the data it receives.

Consent Mode v2 usually requires setting two values for each consent category – a default value and an updated one. The default value represents the consent status (granted or denied) a certain category (e.g. using cookies) has before the user has submitted their personal preferences. Usually, and especially within the EU, the default value would be `denied`. Once the user submits their preferences, Consent Mode v2 sends an additional “updated” value that represents the choice the user made.

Implementing Consent Mode v2 is quick and easy with Cloudflare Zaraz, although the specific implementation depends on your CMP. Examples, including integration with the Cloudflare Zaraz CMP, are available in our official documentation.

We believe that better standardization around online consent benefits everyone, and we are glad to be working on tools that respect users’ privacy and improve online user experience.

Bot Management

We also recently integrated better Bot Management support within Cloudflare Zaraz. You often want crawlers to be able to access your website, but you don’t want them to trigger your analytics and conversion pixels. Using the Bot Management feature in the Cloudflare Zaraz Settings page allows you to fine tune which requests will make it to Cloudflare Zaraz and which ones will be skipped. Since Zaraz pricing is based on the total number of Zaraz Events, this can also be useful if you want more control over your Cloudflare Zaraz costs, ensuring you will not be paying for events triggered by bots. Like all other Cloudflare Zaraz features, these new features are also available to users on all plans, including the free plan. For us, it is part of making sure that everyone can benefit from a faster, safer, and more private way to manage third parties online. If you haven’t started using Cloudflare Zaraz already, now is a great time. Go to the Cloudflare dashboard and set it up in just a few clicks.

Building secure websites: a guide to Cloudflare Pages and Turnstile Plugin

2024-03-07 Sally Lee

Post Syndicated from Sally Lee original https://blog.cloudflare.com/guide-to-cloudflare-pages-and-turnstile-plugin

Balancing developer velocity and security against bots is a constant challenge. Deploying your changes as quickly and easily as possible is essential to stay ahead of your (or your customers’) needs and wants. Ensuring your website is safe from malicious bots — without degrading user experience with alien hieroglyphics to decipher just to prove that you are a human — is no small feat. With Pages and Turnstile, we’ll walk you through just how easy it is to have the best of both worlds!

Cloudflare Pages offer a seamless platform for deploying and scaling your websites with ease. You can get started right away with configuring your websites with a quick integration using your git provider, and get set up with unlimited requests, bandwidth, collaborators, and projects.

Cloudflare Turnstile is Cloudflare’s CAPTCHA alternative solution where your users don’t ever have to solve another puzzle to get to your website, no more stop lights and fire hydrants. You can protect your site without having to put your users through an annoying user experience. If you are already using another CAPTCHA service, we have made it easy for you to migrate over to Turnstile with minimal effort needed. Check out the Turnstile documentation to get started.

Alright, what are we building?

In this tutorial, we’ll walk you through integrating Cloudflare Pages with Turnstile to secure your website against bots. You’ll learn how to deploy Pages, embed the Turnstile widget, validate the token on the server side, and monitor Turnstile analytics. Let’s build upon this tutorial from Cloudflare’s developer docs, which outlines how to create an HTML form with Pages and Functions. We’ll also show you how to secure it by integrating with Turnstile, complete with client-side rendering and server-side validation, using the Turnstile Pages Plugin!

Step 1: Deploy your Pages

On the Cloudflare Dashboard, select your account and go to Workers & Pages to create a new Pages application with your git provider. Choose the repository where you cloned the tutorial project or any other repository that you want to use for this walkthrough.

The Build settings for this project is simple:

Framework preset: None
Build command: npm install @cloudflare/pages-plugin-turnstile
Build output directory: public

Once you select “Save and Deploy”, all the magic happens under the hood and voilà! The form is already deployed.

Step 2: Embed Turnstile widget

Now, let’s navigate to Turnstile and add the newly created Pages site.

Here are the widget configuration options:

Domain: All you need to do is add the domain for the Pages application. In this example, it’s “pages-turnstile-demo.pages.dev”. For each deployment, Pages generates a deployment specific preview subdomain. Turnstile covers all subdomains automatically, so your Turnstile widget will work as expected even in your previews. This is covered more extensively in our Turnstile domain management documentation.
Widget Mode: There are three types of widget modes you can choose from.
Managed: This is the recommended option where Cloudflare will decide when further validation through the checkbox interaction is required to confirm whether the user is a human or a bot. This is the mode we will be using in this tutorial.

Non-interactive: This mode does not require the user to interact and check the box of the widget. It is a non-intrusive mode where the widget is still visible to users but requires no added step in the user experience.

Invisible: Invisible mode is where the widget is not visible at all to users and runs in the background of your website.
Pre-Clearance setting: With a clearance cookie issued by the Turnstile widget, you can configure your website to verify every single request or once within a session. To learn more about implementing pre-clearance, check out this blog post.

Once you create your widget, you will be given a sitekey and a secret key. The sitekey is public and used to invoke the Turnstile widget on your site. The secret key should be stored safely for security purposes.

Let’s embed the widget above the Submit button. Your index.html should look like this:

<!doctype html>
<html lang="en">
	<head>
		<meta charset="utf8">
		<title>Cloudflare Pages | Form Demo</title>
		<meta name="theme-color" content="#d86300">
		<meta name="mobile-web-app-capable" content="yes">
		<meta name="apple-mobile-web-app-capable" content="yes">
		<meta name="viewport" content="width=device-width,initial-scale=1">
		<link rel="icon" type="image/png" href="https://www.cloudflare.com/favicon-128.png">
		<link rel="stylesheet" href="/index.css">
		<script src="https://challenges.cloudflare.com/turnstile/v0/api.js?onload=_turnstileCb" defer></script>
	</head>
	<body>

		<main>
			<h1>Demo: Form Submission</h1>

			<blockquote>
				<p>This is a demonstration of Cloudflare Pages with Turnstile.</p>
				<p>Pages deployed a <code>/public</code> directory, containing a HTML document (this webpage) and a <code>/functions</code> directory, which contains the Cloudflare Workers code for the API endpoint this <code>&lt;form&gt;</code> references.</p>
				<p><b>NOTE:</b> On form submission, the API endpoint responds with a JSON representation of the data. There is no JavaScript running in this example.</p>
			</blockquote>

			<form method="POST" action="/api/submit">
				<div class="input">
					<label for="name">Full Name</label>
					<input id="name" name="name" type="text" />
				</div>

				<div class="input">
					<label for="email">Email Address</label>
					<input id="email" name="email" type="email" />
				</div>

				<div class="input">
					<label for="referers">How did you hear about us?</label>
					<select id="referers" name="referers">
						<option hidden disabled selected value></option>
						<option value="Facebook">Facebook</option>
						<option value="Twitter">Twitter</option>
						<option value="Google">Google</option>
						<option value="Bing">Bing</option>
						<option value="Friends">Friends</option>
					</select>
				</div>

				<div class="checklist">
					<label>What are your favorite movies?</label>
					<ul>
						<li>
							<input id="m1" type="checkbox" name="movies" value="Space Jam" />
							<label for="m1">Space Jam</label>
						</li>
						<li>
							<input id="m2" type="checkbox" name="movies" value="Little Rascals" />
							<label for="m2">Little Rascals</label>
						</li>
						<li>
							<input id="m3" type="checkbox" name="movies" value="Frozen" />
							<label for="m3">Frozen</label>
						</li>
						<li>
							<input id="m4" type="checkbox" name="movies" value="Home Alone" />
							<label for="m4">Home Alone</label>
						</li>
					</ul>
				</div>
				<div id="turnstile-widget" style="padding-top: 20px;"></div>
				<button type="submit">Submit</button>
			</form>
		</main>
	<script>
	// This function is called when the Turnstile script is loaded and ready to be used.
	// The function name matches the "onload=..." parameter.
	function _turnstileCb() {
	    console.debug('_turnstileCb called');

	    turnstile.render('#turnstile-widget', {
	      sitekey: '0xAAAAAAAAAXAAAAAAAAAAAA',
	      theme: 'light',
	    });
	}
	</script>
	</body>
</html>

You can embed the Turnstile widget implicitly or explicitly. In this tutorial, we will explicitly embed the widget by injecting the JavaScript tag and related code, then specifying the placement of the widget.

<script src="https://challenges.cloudflare.com/turnstile/v0/api.js?onload=_turnstileCb" defer></script>

<script>
	function _turnstileCb() {
	    console.debug('_turnstileCb called');

	    turnstile.render('#turnstile-widget', {
	      sitekey: '0xAAAAAAAAAXAAAAAAAAAAAA',
	      theme: 'light',
	    });
	}
</script>

Make sure that the div id you assign is the same as the id you specify in turnstile.render call. In this case, let’s use “turnstile-widget”. Once that’s done, you should see the widget show up on your site!

<div id="turnstile-widget" style="padding-top: 20px;"></div>

Step 3: Validate the token

Now that the Turnstile widget is rendered on the front end, let’s validate it on the server side and check out the Turnstile outcome. We need to make a call to the /siteverify API with the token in the submit function under ./functions/api/submit.js.

First, grab the token issued from Turnstile under cf-turnstile-response. Then, call the /siteverify API to ensure that the token is valid. In this tutorial, we’ll attach the Turnstile outcome to the response to verify everything is working well. You can decide on the expected behavior and where to direct the user based on the /siteverify response.

/**
 * POST /api/submit
 */

import turnstilePlugin from "@cloudflare/pages-plugin-turnstile";

// This is a demo secret key. In prod, we recommend you store
// your secret key(s) safely. 
const SECRET_KEY = '0x4AAAAAAASh4E5cwHGsTTePnwcPbnFru6Y';

export const onRequestPost = [
    turnstilePlugin({
    	secret: SECRET_KEY,
    }),
    (async (context) => {
    	// Request has been validated as coming from a human
    	const formData = await context.request.formData()

    	var tmp, outcome = {};
	for (let [key, value] of formData) {
		tmp = outcome[key];
		if (tmp === undefined) {
			outcome[key] = value;
		} else {
			outcome[key] = [].concat(tmp, value);
		}
	}

	// Attach Turnstile outcome to the response
	outcome["turnstile_outcome"] = context.data.turnstile;

	let pretty = JSON.stringify(outcome, null, 2);

      	return new Response(pretty, {
      		headers: {
      			'Content-Type': 'application/json;charset=utf-8'
      		}
      	});
    })
];

Since Turnstile accurately decided that the visitor was not a bot, the response for “success” is “true” and “interactive” is “false”. The “interactive” being “false” means that the checkbox was automatically checked by Cloudflare as the visitor was determined to be human. The user was seamlessly allowed access to the website without having to perform any additional actions. If the visitor looks suspicious, Turnstile will become interactive, requiring the visitor to actually click the checkbox to verify that they are not a bot. We used the managed mode in this tutorial but depending on your application logic, you can choose the widget mode that works best for you.

{
  "name": "Sally Lee",
  "email": "[email protected]",
  "referers": "Facebook",
  "movies": "Space Jam",
  "cf-turnstile-response": "0._OHpi7JVN7Xz4abJHo9xnK9JNlxKljOp51vKTjoOi6NR4ru_4MLWgmxt1rf75VxRO4_aesvBvYj8bgGxPyEttR1K2qbUdOiONJUd5HzgYEaD_x8fPYVU6uZPUCdWpM4FTFcxPAnqhTGBVdYshMEycXCVBqqLVdwSvY7Me-VJoge7QOStLOtGgQ9FaY4NVQK782mpPfgVujriDAEl4s5HSuVXmoladQlhQEK21KkWtA1B6603wQjlLkog9WqQc0_3QMiBZzZVnFsvh_NLDtOXykOFK2cba1mLLcADIZyhAho0mtmVD6YJFPd-q9iQFRCMmT2Sz00IToXz8cXBGYluKtxjJrq7uXsRrI5pUUThKgGKoHCGTd_ufuLDjDCUE367h5DhJkeMD9UsvQgr1MhH3TPUKP9coLVQxFY89X9t8RAhnzCLNeCRvj2g-GNVs4-MUYPomd9NOcEmSpklYwCgLQ.jyBeKkV_MS2YkK0ZRjUkMg.6845886eb30b58f15de056eeca6afab8110e3123aeb1c0d1abef21c4dd4a54a1",
  "turnstile_outcome": {
    "success": true,
    "error-codes": [],
    "challenge_ts": "2024-02-28T22:52:30.009Z",
    "hostname": "pages-turnstile-demo.pages.dev",
    "action": "",
    "cdata": "",
    "metadata": {
      "interactive": false
    }
  }
}

Wrapping up

Now that we’ve set up Turnstile, we can head to Turnstile analytics in the Cloudflare Dashboard to monitor the solve rate and widget traffic. Visitor Solve Rate indicates the percentage of visitors who successfully completed the Turnstile widget. A sudden drop in the Visitor Solve Rate could indicate an increase in bot traffic, as bots may fail to complete the challenge presented by the widget. API Solve Rate measures the percentage of visitors who successfully validated their token against the /siteverify API. Similar to the Visitor Solve Rate, a significant drop in the API Solve Rate may indicate an increase in bot activity, as bots may fail to validate their tokens. Widget Traffic provides insights into the nature of the traffic hitting your website. A high number of challenges requiring interaction may suggest that bots are attempting to access your site, while a high number of unsolved challenges could indicate that the Turnstile widget is effectively blocking suspicious traffic.

And that’s it! We’ve walked you through how to easily secure your Pages with Turnstile. Pages and Turnstile are currently available for free for every Cloudflare user to get started right away. If you are looking for a seamless and speedy developer experience to get a secure website up and running, protected by Turnstile, head over to the Cloudflare Dashboard today!

Integrating Turnstile with the Cloudflare WAF to challenge fetch requests

2023-12-18 Adam Martinetti http://blog.cloudflare.com/author/adam-martinetti/

Post Syndicated from Adam Martinetti http://blog.cloudflare.com/author/adam-martinetti/ original https://blog.cloudflare.com/integrating-turnstile-with-the-cloudflare-waf-to-challenge-fetch-requests

Two months ago, we made Cloudflare Turnstile generally available — giving website owners everywhere an easy way to fend off bots, without ever issuing a CAPTCHA. Turnstile allows any website owner to embed a frustration-free Cloudflare challenge on their website with a simple code snippet, making it easy to help ensure that only human traffic makes it through. In addition to protecting a website’s frontend, Turnstile also empowers web administrators to harden browser-initiated (AJAX) API calls running under the hood. These APIs are commonly used by dynamic single-page web apps, like those created with React, Angular, Vue.js.

Today, we’re excited to announce that we have integrated Turnstile with the Cloudflare Web Application Firewall (WAF). This means that web admins can add the Turnstile code snippet to their websites, and then configure the Cloudflare WAF to manage these requests. This is completely customizable using WAF Rules; for instance, you can allow a user authenticated by Turnstile to interact with all of an application’s API endpoints without facing any further challenges, or you can configure certain sensitive endpoints, like Login, to always issue a challenge.

Challenging fetch requests in the Cloudflare WAF

Millions of websites protected by Cloudflare’s WAF leverage our JS Challenge, Managed Challenge, and Interactive Challenge to stop bots while letting humans through. For each of these challenges, Cloudflare intercepts the matching request and responds with an HTML page rendered by the browser, where the user completes a basic task to demonstrate that they’re human. When a user successfully completes a challenge, they receive a cf_clearance cookie, which tells Cloudflare that a user has successfully passed a challenge, the type of challenge, and when it was completed. A clearance cookie can’t be shared between users, and is only valid for the time set by the Cloudflare customer in their Security Settings dashboard.

This process works well, except when a browser receives a challenge on a fetch request and the browser has not previously passed a challenge. On a fetch request, or an XML HTTP Request (XHR), the browser expects to get back simple text (in JSON or XML formats) and cannot render the HTML necessary to run a challenge.

As an example, let’s imagine a pizzeria owner who built an online ordering form in React with a payment page that submits data to an API endpoint that processes payments. When a user views the web form to add their credit card details they can pass a Managed Challenge, but when the user submits their credit card details by making a fetch request, the browser won’t execute the code necessary for a challenge to run. The pizzeria owner’s only option for handling suspicious (but potentially legitimate) requests is to block them, which runs the risk of false positives that could cause the restaurant to lose a sale.

This is where Turnstile can help. Turnstile allows anyone on the Internet to embed a Cloudflare challenge anywhere on their website. Before today, the output of Turnstile was only a one-time use token. To enable customers to issue challenges for these fetch requests, Turnstile can now issue a clearance cookie for the domain that it’s embedded on. Customers can issue their challenge within the HTML page before a fetch request, pre-clearing the visitor to interact with the Payment API.

Turnstile Pre-Clearance mode

Returning to our pizzeria example, the three big advantages of using Pre-Clearance to integrate Turnstile with the Cloudflare WAF are:

Improved user experience: Turnstile’s embedded challenge can run in the background while the visitor is entering their payment details.
Blocking more requests at the edge: Because Turnstile now issues a clearance cookie for the domain that it’s embedded on, our pizzeria owner can use a Custom Rule to issue a Managed Challenge for every request to the payment API. This ensures that automated attacks attempting to target the payment API directly are stopped by Cloudflare before they can reach the API.
(Optional) Securing the action and the user: No backend code changes are necessary to get the benefit of Pre-Clearance. However, further Turnstile integration will increase security for the integrated API. The pizzeria owner can adjust their payment form to validate the received Turnstile token, ensuring that every payment attempt is individually validated by Turnstile to protect their payment endpoint from session hijacking.

A Turnstile widget with Pre-Clearance enabled will still issue turnstile tokens, which gives customers the flexibility to decide if an endpoint is critical enough to require a security check on every request to it, or just once a session. Clearance cookies issued by a Turnstile widget are automatically applied to the Cloudflare zone the Turnstile widget is embedded on, with no configuration necessary. The clearance time the token is valid for is still controlled by the zone specific “Challenge Passage” time.

Implementing Turnstile with Pre-Clearance

Let’s make this concrete by walking through a basic implementation. Before we start, we’ve set up a simple demo application where we emulate a frontend talking to a backend on a /your-api endpoint.

To this end, we have the following code:

<!DOCTYPE html>
<html lang="en">
<head>
   <title>Turnstile Pre-Clearance Demo </title>
</head>
<body>
  <main class="pre-clearance-demo">
    <h2>Pre-clearance Demo</h2>
    <button id="fetchBtn">Fetch Data</button>
    <div id="response"></div>
</main>


<script>
  const button = document.getElementById('fetchBtn');
  const responseDiv = document.getElementById('response');
  button.addEventListener('click', async () => {
  try {
    let result = await fetch('/your-api');
    if (result.ok) {
      let data = await result.json();
      responseDiv.textContent = JSON.stringify(data);
    } else {
      responseDiv.textContent = 'Error fetching data';
    }
  } catch (error) {
    responseDiv.textContent = 'Network error';
  }
});
</script>

We’ve created a button. Upon clicking, Cloudflare makes a fetch() request to the /your-api endpoint, showing the result in the response container.

Now let’s consider that we have a Cloudflare WAF rule set up that protects the /your-api endpoint with a Managed Challenge.

Due to this rule, the app that we just wrote is going to fail for the reason described earlier (the browser is expecting a JSON response, but instead receives the challenge page as HTML).

If we inspect the Network Tab, we can see that the request to /your-api has been given a 403 response.

Upon inspection, the Cf-Mitigated header shows that the response was challenged by Cloudflare’s firewall, as the visitor has not solved a challenge before.

To address this problem in our app, we set up a Turnstile Widget in Pre-Clearance mode for the Turnstile sitekey that we want to use.

In our application, we override the fetch() function to invoke Turnstile once a Cf-Mitigated response has been received.

<script>
turnstileLoad = function () {
  // Save a reference to the original fetch function
  const originalFetch = window.fetch;

  // A simple modal to contain Cloudflare Turnstile
  const overlay = document.createElement('div');
  overlay.style.position = 'fixed';
  overlay.style.top = '0';
  overlay.style.left = '0';
  overlay.style.right = '0';
  overlay.style.bottom = '0';
  overlay.style.backgroundColor = 'rgba(0, 0, 0, 0.7)';
  overlay.style.border = '1px solid grey';
  overlay.style.zIndex = '10000';
  overlay.style.display = 'none';
  overlay.innerHTML =       '<p style="color: white; text-align: center; margin-top: 50vh;">One more step before you proceed...</p><div style=”display: flex; flex-wrap: nowrap; align-items: center; justify-content: center;” id="turnstile_widget"></div>';
  document.body.appendChild(overlay);

  // Override the native fetch function
  window.fetch = async function (...args) {
      let response = await originalFetch(...args);

      //If the original request was challenged...
      if (response.headers.has('cf-mitigated') && response.headers.get('cf-mitigated') === 'challenge') {
          //The request has been challenged...
          overlay.style.display = 'block';

          await new Promise((resolve, reject) => {
              turnstile.render('#turnstile_widget', {
                  'sitekey': ‘YOUR_TURNSTILE_SITEKEY',
                  'error-callback': function (e) {
                      overlay.style.display = 'none';
                      reject(e);
                  },
                  'callback': function (token, preClearanceObtained) {
                      if (preClearanceObtained) {
                          //The visitor successfully solved the challenge on the page. 
                          overlay.style.display = 'none';
                          resolve();
                      } else {
                          reject(e);
                      }
                  },
              });
          });

          // Replay the original fetch request, this time it will have the cf_clearance Cookie
          response = await originalFetch(...args);
      }
      return response;
  };
};
</script>
<script src="https://challenges.cloudflare.com/turnstile/v0/api.js?onload=turnstileLoad" async defer></script>

There is a lot going on in the snippet above: First, we create a hidden overlay element and override the browser’s fetch() function. The fetch() function is changed to introspect the Cf-Mitigated header for ‘challenge’. If a challenge is issued, the initial result will be unsuccessful; instead, a Turnstile overlay (with Pre-Clearance enabled) will appear in our web application. Once the Turnstile challenge has been completed we will retry the previous request after Turnstile has obtained the cf_clearance cookie to get through the Cloudflare WAF.

Upon solving the Turnstile widget, the overlay disappears, and the requested API result is shown successfully:

Pre-Clearance is available to all Cloudflare customers

Every Cloudflare user with a free plan or above can use Turnstile in managed mode free for an unlimited number of requests. If you’re a Cloudflare user looking to improve your security and user experience for your critical API endpoints, head over to our dashboard and create a Turnstile widget with Pre-Clearance today.

Cloudflare is free of CAPTCHAs; Turnstile is free for everyone

2023-09-29 Benedikt Wolters

Post Syndicated from Benedikt Wolters original http://blog.cloudflare.com/turnstile-ga/

Cloudflare is free of CAPTCHAs; Turnstile is free for everyone

For years, we’ve written that CAPTCHAs drive us crazy. Humans give up on CAPTCHA puzzles approximately 15% of the time and, maddeningly, CAPTCHAs are significantly easier for bots to solve than they are for humans. We’ve spent the past three and a half years working to build a better experience for humans that’s just as effective at stopping bots. As of this month, we’ve finished replacing every CAPTCHA issued by Cloudflare with Turnstile, our new CAPTCHA replacement (pictured below). Cloudflare will never issue another visual puzzle to anyone, for any reason.

Now that we’ve eliminated CAPTCHAs at Cloudflare, we want to make it easy for anyone to do the same, even if they don’t use other Cloudflare services. We’ve decoupled Turnstile from our platform so that any website operator on any platform can use it just by adding a few lines of code. We’re thrilled to announce that Turnstile is now generally available, and Turnstile’s ‘Managed’ mode is now completely free to everyone for unlimited use.

Easy on humans, hard on bots, private for everyone

There’s a lot that goes into Turnstile’s simple checkbox to ensure that it’s easy for everyone, preserves user privacy, and does its job stopping bots. Part of making challenges better for everyone means that everyone gets the same great experience, no matter what browser you’re using. Because we do not employ a visual puzzle, users with low vision or blindness get the same easy to use challenge flow as everyone else. It was particularly important for us to avoid falling back to audio CAPTCHAs to offer an experience accessible to everyone. Audio CAPTCHAs are often much worse than even visual CAPTCHAs for humans to solve, with only 31.2% of audio challenges resulting in a three-person agreement on what the correct solution actually is. The prevalence of free speech-to-text services has made it easy for bots to solve audio CAPTCHAs as well, with a recent study showing bots can accurately solve audio CAPTCHAs in over 85% of attempts.

We also created Turnstile to be privacy focused. Turnstile meets ePrivacy Directive, GDPR and CCPA compliance requirements, as well as the strict requirements of our own privacy commitments. In addition, Cloudflare's FedRAMP Moderate authorized package, "Cloudflare for Government" now includes Turnstile. We don’t rely on tracking user data, like what other websites someone has visited, to determine if a user is a human or robot. Our business is protecting websites, not selling ads, so operators can deploy Turnstile knowing that their users’ data is safe.

With all of our emphasis on how easy it is to pass a Turnstile challenge, you would be right to ask how it can stop a bot. If a bot can find all images with crosswalks in grainy photos faster than we can, surely it can check a box as well. Bots definitely can check a box, and they can even mimic the erratic path of human mouse movement while doing so. For Turnstile, the actual act of checking a box isn’t important, it’s the background data we’re analyzing while the box is checked that matters. We find and stop bots by running a series of in-browser tests, checking browser characteristics, native browser APIs, and asking the browser to pass lightweight tests (ex: proof-of-work tests, proof-of-space tests) to prove that it’s an actual browser. The current deployment of Turnstile checks billions of visitors every day, and we are able to identify browser abnormalities that bots exhibit while attempting to pass those tests.

For over one year, we used our Managed Challenge to rotate between CAPTCHAs and our own Turnstile challenge to compare our effectiveness. We found that even without asking users for any interactivity at all, Turnstile was just as effective as a CAPTCHA. Once we were sure that the results were effective at coping with the response from bot makers, we replaced the CAPTCHA challenge with our own checkbox solution. We present this extra test when we see potentially suspicious signals, and it helps us provide an even greater layer of security.

Turnstile is great for fighting fraud

Like all sites that offer services for free, Cloudflare sees our fair share of automated account signups, which can include “new account fraud,” where bad actors automate the creation of many different accounts to abuse our platform. To help combat this abuse, we’ve rolled out Turnstile’s invisible mode to protect our own signup page. This month, we’ve blocked over 1 million automated signup attempts using Turnstile, without a reported false positive or any change in our self-service billings that rely on this signup flow.

Lessons from the Turnstile beta

Over the past twelve months, we’ve been grateful to see how many people are eager to try, then rely on, and integrate Turnstile into their web applications. It’s been rewarding to see the developer community embrace Turnstile as well. We list some of the community created Turnstile integrations here, including integrations with WordPress, Angular, Vue, and a Cloudflare recommended React library. We’ve listened to customer feedback, and added support for 17 new languages, new callbacks, and new error codes.

76,000+ users have signed up, but our biggest single test by far was the Eurovision final vote. Turnstile runs on challenge pages on over 25 million Cloudflare websites. Usually, that makes Cloudflare the far and away biggest Turnstile consumer, until the final Eurovision vote. During that one hour, challenge traffic from the Eurovision voting site outpaced the use of challenge pages on those 25 million sites combined! Turnstile handled the enormous spike in traffic without a hitch.

While a lot went well during the Turnstile beta, we also encountered some opportunities for us to learn. We were initially resistant to disclosing why a Turnstile challenge failed. After all, if bad actors know what we’re looking for, it becomes easier for bots to fool our challenges until we introduce new detections. However, during the Turnstile beta, we saw a few scenarios where legitimate users could not pass a challenge. These scenarios made it clear to us that we need to be transparent about why a challenge failed to help aid any individual who might modify their browser in a way that causes them to get caught by Turnstile. We now publish detailed client-side error codes to surface the reason why a challenge has failed. Two scenarios came up on several occasions that we didn’t expect:

First, we saw that desktop computers at least 10 years old frequently had expired motherboard batteries, and computers with bad motherboard batteries very often keep inaccurate time. This is because without the motherboard battery, a desktop computer’s clock will stop operating when the computer is off. Turnstile checks your computer’s system time to detect when a website operator has accidentally configured a challenge page to be cached, as caching a challenge page will cause it to become impassable. Unfortunately, this same check was unintentionally catching humans who just needed to update the time. When we see this issue, we now surface a clear error message to the end user to update their system time. We’d prefer to never have to surface an error in the first place, so we’re working to develop new ways to check for cached content that won’t impact real people.

Second, we find that a few privacy-focused users often ask their browsers to go beyond standard practices to preserve their anonymity. This includes changing their user-agent (something bots will do to evade detection as well), and preventing third-party scripts from executing entirely. Issues caused by this behavior can now be displayed clearly in a Turnstile widget, so those users can immediately understand the issue and make a conscientious choice about whether they want to allow their browser to pass a challenge.

Although we have some of the most sensitive, thoroughly built monitoring systems at Cloudflare, we did not catch either of these issues on our own. We needed to talk to users affected by the issue to help us understand what the problem was. Going forward, we want to make sure we always have that direct line of communication open. We’re rolling out a new feedback form in the Turnstile widget, to ensure any future corner cases are addressed quickly and with urgency.

Turnstile: GA and Free for Everyone

Announcing Turnstile’s General Availability means that Turnstile is now completely production ready, available for free for unlimited use via our visible widget in Managed mode. Turnstile Enterprise includes SaaS platform support and a visible mode without the Cloudflare logo. Self-serve customers can expect a pay-as-you-go option for advanced features to be available in early 2024. Users can continue to access Turnstile’s advanced features below our 1 million siteverify request limit, as has been the case during the beta. If you’ve been waiting to try Turnstile, head over to our signup page and create an account!

Application Security Report: Q2 2023

2023-08-21 Michael Tremante

Post Syndicated from Michael Tremante original http://blog.cloudflare.com/application-security-report-q2-2023/

Application Security Report: Q2 2023

Cloudflare has a unique vantage point on the Internet. From this position, we are able to see, explore, and identify trends that would otherwise go unnoticed. In this report we are doing just that and sharing our insights into Internet-wide application security trends.

This report is the third edition of our Application Security Report. The first one was published in March 2022, with the second published earlier this year in March, and this is the first to be published on a quarterly basis.

Since the last report, our network is bigger and faster: we are now processing an average of 46 million HTTP requests/second and 63 million at peak. We consistently handle approximately 25 million DNS queries per second. That's around 2.1 trillion DNS queries per day, and 65 trillion queries a month. This is the sum of authoritative and resolver requests served by our infrastructure. Summing up both HTTP and DNS requests, we get to see a lot of malicious traffic. Focusing on HTTP requests only, in Q2 2023 Cloudflare blocked an average of 112 billion cyber threats each day, and this is the data that powers this report.

But as usual, before we dive in, we need to define our terms.

Definitions

Throughout this report, we will refer to the following terms:

Mitigated traffic: any eyeball HTTP* request that had a “terminating” action applied to it by the Cloudflare platform. These include the following actions: BLOCK, CHALLENGE, JS_CHALLENGE and MANAGED_CHALLENGE. This does not include requests that had the following actions applied: LOG, SKIP, ALLOW. In contrast to last year, we now exclude requests that had CONNECTION_CLOSE and FORCE_CONNECTION_CLOSE actions applied by our DDoS mitigation system, as these technically only slow down connection initiation. They also accounted for a relatively small percentage of requests. Additionally, we improved our calculation regarding the CHALLENGE type actions to ensure that only unsolved challenges are counted as mitigated. A detailed description of actions can be found in our developer documentation.
Bot traffic/automated traffic: any HTTP* request identified by Cloudflare’s Bot Management system as being generated by a bot. This includes requests with a bot score between 1 and 29 inclusive. This has not changed from last year’s report.
API traffic: any HTTP* request with a response content type of XML or JSON. Where the response content type is not available, such as for mitigated requests, the equivalent Accept content type (specified by the user agent) is used instead. In this latter case, API traffic won’t be fully accounted for, but it still provides a good representation for the purposes of gaining insights.

Unless otherwise stated, the time frame evaluated in this post is the 3 month period from April 2023 through June 2023 inclusive.

Finally, please note that the data is calculated based only on traffic observed across the Cloudflare network and does not necessarily represent overall HTTP traffic patterns across the Internet.

* When referring to HTTP traffic we mean both HTTP and HTTPS.

Global traffic insights

Mitigated daily traffic stable at 6%, spikes reach 8%

Although daily mitigated HTTP requests decreased by 2 percentage points to 6% on average from 2021 to 2022, days with larger than usual malicious activity can be clearly seen across the network. One clear example is shown in the graph below: towards the end of May 2023, a spike reaching nearly 8% can be seen. This is attributable to large DDoS events and other activity that does not follow standard daily or weekly cycles and is a constant reminder that large malicious events can still have a visible impact at a global level, even at Cloudflare scale.

75% of mitigated HTTP requests were outright BLOCKed. This is a 6 percentage point decrease compared to the previous report. The majority of other requests are mitigated with the various CHALLENGE type actions, with managed challenges leading with ~20% of this subset.

Shields up: customer configured rules now biggest contributor to mitigated traffic

In our previous report, our automated DDoS mitigation system accounted for, on average, more than 50% of mitigated traffic. Over the past two quarters, due to both increased WAF adoption, but most likely organizations better configuring and locking down their applications from unwanted traffic, we’ve seen a new trend emerge, with WAF mitigated traffic surpassing DDoS mitigation. Most of the increase has been driven by WAF Custom Rule BLOCKs rather than our WAF Managed Rules, indicating that these mitigations are generated by customer configured rules for business logic or related purposes. This can be clearly seen in the chart below.

Note that our WAF Managed Rules mitigations (yellow line) are negligible compared to overall WAF mitigated traffic also indicating that customers are adopting positive security models by allowing known good traffic as opposed to blocking only known bad traffic. Having said that, WAF Managed Rules mitigations reached as much as 1.5 billion/day during the quarter.

Our DDoS mitigation is, of course, volumetric and the amount of traffic matching our DDoS layer 7 rules should not be underestimated, especially given that we are observing a number of novel attacks and botnets being spun up across the web. You can read a deep dive on DDoS attack trends in our Q2 DDoS threat report.

Aggregating the source of mitigated traffic, the WAF now accounts for approximately 57% of all mitigations. Tabular format below with other sources for reference.

Source	Percentage %
WAF	57%
DDoS Mitigation	34%
IP Reputation	6%
Access Rules	2%
Other	1%

Application owners are increasingly relying on geo location blocks

Given the increase in mitigated traffic from customer defined WAF rules, we thought it would be interesting to dive one level deeper and better understand what customers are blocking and how they are doing it. We can do this by reviewing rule field usage across our WAF Custom Rules to identify common themes. Of course, the data needs to be interpreted correctly, as not all customers have access to all fields as that varies by contract and plan level, but we can still make some inferences based on field “categories”. By reviewing all ~7M WAF Custom Rules deployed across the network and focusing on main groupings only, we get the following field usage distribution:

Field	Used in percentage % of rules
Geolocation fields	40%
HTTP URI	31%
IP address	21%
Other HTTP fields (excluding URI)	34%
Bot Management fields	11%
IP reputation score	4%

Notably, 40% of all deployed WAF Custom Rules use geolocation-related fields to make decisions on how to treat traffic. This is a common technique used to implement business logic or to exclude geographies from which no traffic is expected and helps reduce attack surface areas. While these are coarse controls which are unlikely to stop a sophisticated attacker, they are still efficient at reducing the attack surface.

Another notable observation is the usage of Bot Management related fields in 11% of WAF Custom Rules. This number has been steadily increasing over time as more customers adopt machine learning-based classification strategies to protect their applications.

Old CVEs are still exploited en masse

Contributing ~32% of WAF Managed Rules mitigated traffic overall, HTTP Anomaly is still the most common attack category blocked by the WAF Managed Rules. SQLi moved up to second position, surpassing Directory Traversal with 12.7% and 9.9% respectively.

If we look at the start of April 2023, we notice the DoS category far exceeding the HTTP Anomaly category. Rules in the DoS category are WAF layer 7 HTTP signatures that are sufficiently specific to match (and block) single requests without looking at cross request behavior and that can be attributed to either specific botnets or payloads that cause denial of service (DoS). Normally, as is the case here, these requests are not part of “distributed” attacks, hence the lack of the first “D” for “distributed” in the category name.

Tabular format for reference (top 10 categories):

Source	Percentage %
HTTP Anomaly	32%
SQLi	13%
Directory Traversal	10%
File Inclusion	9%
DoS	9%
XSS	9%
Software Specific	7%
Broken Authentication	6%
Common Injection	3%
CVE	1%

Zooming in, and filtering on the DoS category only, we find that most of the mitigated traffic is attributable to one rule: 100031 / ce02fd… (old WAF and new WAF rule ID respectively). This rule, with a description of “Microsoft IIS – DoS, Anomaly:Header:Range – CVE:CVE-2015-1635” pertains to a CVE dating back to 2015 that affected a number of Microsoft Windows components resulting in remote code execution*. This is a good reminder that old CVEs, even those dating back more than 8 years, are still actively exploited to compromise machines that may be unpatched and still running vulnerable software.

* Due to rule categorisation, some CVE specific rules are still assigned to a broader category such as DoS in this example. Rules are assigned to a CVE category only when the attack payload does not clearly overlap with another more generic category.

Another interesting observation is the increase in Broken Authentication rule matches starting in June. This increase is also attributable to a single rule deployed across all our customers, including our FREE users: “WordPress – Broken Access Control, File Inclusion”. This rule is blocking attempts to access wp-config.php – the WordPress default configuration file which is normally found in the web server document root directory, but of course should never be accessed directly via HTTP.

On a similar note, CISA/CSA recently published a report highlighting the 2022 Top Routinely Exploited Vulnerabilities. We took this opportunity to explore how each CVE mentioned in CISA’s report was reflected in Cloudflare’s own data. The CISA/CSA discuss 12 vulnerabilities that malicious cyber actors routinely exploited in 2022. However, based on our analysis, two CVEs mentioned in the CISA report are responsible for the vast majority of attack traffic we have seen in the wild: Log4J and Atlassian Confluence Code Injection. Our data clearly suggests a major difference in exploit volume between the top two and the rest of the list. The following chart compares the attack volume (in logarithmic scale) of the top 6 vulnerabilities of the CISA list according to our logs.

Bot traffic insights

Cloudflare’s Bot Management continues to see significant investment as the addition of JavaScript Verified URLs for greater protection against browser-based bots, Detection IDs are now available in Custom Rules for additional configurability, and an improved UI for easier onboarding. For self-serve customers, we’ve added the ability to “Skip” Super Bot Fight Mode rules and support for WordPress Loopback requests, to better integrate with our customers’ applications and give them the protection they need.

Our confidence in the Bot Management classification output remains very high. If we plot the bot scores across the analyzed time frame, we find a very clear distribution, with most requests either being classified as definitely bot (score below 30) or definitely human (score greater than 80), with most requests actually scoring less than 2 or greater than 95. This equates, over the same time period, to 33% of traffic being classified as automated (generated by a bot). Over longer time periods we do see the overall bot traffic percentage stable at 29%, and this reflects the data shown on Cloudflare Radar.

On average, more than 10% of non-verified bot traffic is mitigated

Compared to the last report, non-verified bot HTTP traffic mitigation is currently on a downward trend (down 6 percentage points). However, the Bot Management field usage within WAF Custom Rules is non negligible, standing at 11%. This means that there are more than 700k WAF Custom Rules deployed on Cloudflare that are relying on bot signals to perform some action. The most common field used is cf.client.bot, an alias to cf.bot_management.verified_bot which is powered by our list of verified bots and allows customers to make a distinction between “good” bots and potentially “malicious” non-verified ones.

Enterprise customers have access to the more powerful cf.bot_management.score which provides direct access to the score computed on each request, the same score used to generate the bot score distribution graph in the prior section.

The above data is also validated by looking at what Cloudflare service is mitigating unverified bot traffic. Although our DDoS mitigation system is automatically blocking HTTP traffic across all customers, this only accounts for 13% of non-verified bot mitigations. On the other hand, WAF, and mostly customer defined rules, account for 77% of such mitigations, much higher than mitigations across all traffic (57%) discussed at the start of the report. Note that Bot Management is specifically called out but refers to our “default” one-click rules, which are counted separately from the bot fields used in WAF Custom Rules.

Tabular format for reference:

Source	Percentage %
WAF	77%
DDoS Mitigation	13%
IP reputation	5%
Access Rules	3%
Other	1%

API traffic insights

The growth of overall API traffic observed by Cloudflare is not slowing down. Compared to last quarter, we are now seeing 58% of total dynamic traffic be classified as API related. This is a 3 percentage point increase as compared to Q1.

Our investment in API Gateway is also following a similar growth trend. Over the last quarter we have released several new API security features.

First, we’ve made API Discovery easier to use with a new inbox view. API Discovery inventories your APIs to prevent shadow IT and zombie APIs, and now customers can easily filter to show only new endpoints found by API Discovery. Saving endpoints from API Discovery places them into our Endpoint Management system.

Next, we’ve added a brand new API security feature offered only at Cloudflare: the ability to control API access by client behavior. We call it Sequence Mitigation. Customers can now create positive or negative security models based on the order of API paths accessed by clients. You can now ensure that your application’s users are the only ones accessing your API instead of brute-force attempts that ignore normal application functionality. For example, in a banking application you can now enforce that access to the funds transfer endpoint can only be accessed after a user has also accessed the account balance check endpoint.

We’re excited to continue releasing API security and API management features for the remainder of 2023 and beyond.

65% of global API traffic is generated by browsers

The percentage of API traffic generated by browsers has remained very stable over the past quarter. With this statistic, we are referring to HTTP requests that are not serving HTML based content that will be directly rendered by the browser without some preprocessing, such as those more commonly known as AJAX calls which would normally serve JSON based responses.

HTTP Anomalies are the most common attack vector on API endpoints

Just like last quarter, HTTP Anomalies remain the most common mitigated attack vector on API traffic. SQLi injection attacks, however, are non negligible, contributing approximately 11% towards the total mitigated traffic, closely followed by XSS attacks, at around 9%.

Tabular format for reference (top 5):

Source	Percentage %
HTTP Anomaly	64%
SQLi	11%
XSS	9%
Software Specific	5%
Command Injection	4%

Looking forward

As we move our application security report to a quarterly cadence, we plan to deepen some of the insights and to provide additional data from some of our newer products such as Page Shield, allowing us to look beyond HTTP traffic, and explore the state of third party dependencies online.

Stay tuned and keep an eye on Cloudflare Radar for more frequent application security reports and insights.

How Cloudflare runs machine learning inference in microseconds

2023-06-19 Austin Hartzheim

Post Syndicated from Austin Hartzheim original http://blog.cloudflare.com/how-cloudflare-runs-ml-inference-in-microseconds/

How Cloudflare runs machine learning inference in microseconds

Cloudflare executes an array of security checks on servers spread across our global network. These checks are designed to block attacks and prevent malicious or unwanted traffic from reaching our customers’ servers. But every check carries a cost – some amount of computation, and therefore some amount of time must be spent evaluating every request we process. As we deploy new protections, the amount of time spent executing security checks increases.

Latency is a key metric on which CDNs are evaluated. Just as we optimize network latency by provisioning servers in close proximity to end users, we also optimize processing latency – which is the time spent processing a request before serving a response from cache or passing the request forward to the customers’ servers. Due to the scale of our network and the diversity of use-cases we serve, our edge software is subject to demanding specifications, both in terms of throughput and latency.

Cloudflare's bot management module is one suite of security checks which executes during the hot path of request processing. This module calculates a variety of bot signals and integrates directly with our front line servers, allowing us to customize behavior based on those signals. This module evaluates every request for heuristics and behaviors indicative of bot traffic, and scores every request with several machine learning models.

To reduce processing latency, we've undertaken a project to rewrite our bot management technology, porting it from Lua to Rust, and applying a number of performance optimizations. This post focuses on optimizations applied to the machine-learning detections within the bot management module, which account for approximately 15% of the latency added by bot detection. By switching away from a garbage collected language, removing memory allocations, and optimizing our parsers, we reduce the P50 latency of the bot management module by 79μs – a 20% reduction.

Engineering for zero allocations

Writing software without memory allocation poses several challenges. Indeed, high-level programming languages often trade memory management for productivity, abstracting away the details of memory management. But, in those details, are a number of algorithms to find contiguous regions of free memory, handle fragmentation, and call into the kernel to request new memory pages. Garbage collected languages incur additional costs throughout program execution to track when memory can be freed, plus pauses in program execution while the garbage collector executes. But, when performance is a critical requirement, languages should be evaluated for their ability to meet performance constraints.

Stack allocation

One of the simplest ways to reduce memory allocations is to work with fixed-size buffers. Fixed-sized buffers can be placed on the stack, which eliminates the need to invoke heap allocation logic; the compiler simply reserves space in the current stack frame to hold local variables. Alternatively, the buffers can be heap-allocated outside the hot path (for example, at application startup), incurring a one-time cost.

Arrays can be stack allocated:

let mut buf = [0u8; BUFFER_SIZE];

Vectors can be created outside the hot path:

let mut buf = Vec::with_capacity(BUFFER_SIZE);

To demonstrate the performance difference, let's compare two implementations of a case-insensitive string equality check. The first will allocate a buffer for each invocation to store the lower-case version of the string. The second will use a buffer that has already been allocated for this purpose.

Allocate a new buffer for each iteration:

fn case_insensitive_equality_buffer_with_capacity(s: &str, pat: &str) -> bool {
	let mut buf = String::with_capacity(s.len());
	buf.extend(s.chars().map(|c| c.to_ascii_lowercase()));
	buf == pat
}

Re-use a buffer for each iteration, avoiding allocations:

fn case_insensitive_equality_buffer_with_capacity(s: &str, pat: &str, buf: &mut String) -> bool {
	buf.clear();
	buf.extend(s.chars().map(|c| c.to_ascii_lowercase()));
	buf == pat
}

Benchmarking the two code snippets, the first executes in ~40ns per iteration, the second in ~25ns. Changing only the memory allocation behavior, the second implementation is ~38% faster.

Choice of algorithms

Another strategy to reduce the number of memory allocations is to choose algorithms that operate on the data in-place and store any necessary state on the stack.

Returning to our string comparison function from above, let's rewrite it operate completely on the stack, and without copying data into a separate buffer:

fn case_insensitive_equality_buffer_iter(s: &str, pat: &str) -> bool {
	s.chars().map(|c| c.to_ascii_lowercase()).eq(pat.chars())
}

In addition to being the shortest, this function is also the fastest. This function benchmarks at ~13ns/iter, which is just slightly slower than the 11ns used to execute eq_ignore_ascii_case from the standard library. And the standard library implementation similarly avoids buffer allocation through use of iterators.

Testing allocations

Automated testing of memory allocation on the critical path prevents accidental use of functions or libraries which allocate memory. dhat is a crate in the Rust ecosystem that supports such testing. By setting a new global allocator, dhat is able to count the number of allocations, as well as the number of bytes allocated on a given code path.

/// Assert that the hot path logic performs zero allocations.
#[test]
fn zero_allocations() {
	let _profiler = dhat::Profiler::builder().testing().build();

	// Execute hot-path logic here.

	// Assert that no allocations occurred.
	dhat::assert_eq!(stats.total_blocks, 0);
	dhat::assert_eq!(stats.total_bytes, 0);
}

It is important to note, dhat does have the limitation that it only detects allocations in Rust code. External libraries can still allocate memory without using the Rust allocator. FFI calls, such as those made to C, are one such place where memory allocations may slip past dhat's measurements.

Zero allocation decision trees

CatBoost is an open-source machine learning library used within the bot management module. The core logic of CatBoost is implemented in C++, and the library exposes bindings to a number of other languages – such as C and Rust. The Lua-based implementation of the bot management module relied on FFI calls to the C API to execute our models.

By removing memory allocations and implementing buffer re-use, we optimize the execution duration of the sample model included in the CatBoost repository by 10%. Our production models see gains up to 15%.

Optimize for single-document evaluation

By optimizing CatBoost to evaluate a single set of features at a time, we reduce memory allocations and reduce latency. The CatBoost API has several functions which are optimized for evaluating multiple documents at a time, but this API does not benefit our application where requests are evaluated in the order they are received, and delaying processing to coalesce batches is undesirable. To support evaluation of a variable number of documents, the CatBoost implementation allocates vectors and iterates over the input documents, writing them into the vectors.

TVector<TConstArrayRef<float>> floatFeaturesVec(docCount);
TVector<TConstArrayRef<int>> catFeaturesVec(docCount);
for (size_t i = 0; i < docCount; ++i) {
    if (floatFeaturesSize > 0) {
        floatFeaturesVec[i] = TConstArrayRef<float>(floatFeatures[i], floatFeaturesSize);
    }
    if (catFeaturesSize > 0) {
        catFeaturesVec[i] = TConstArrayRef<int>(catFeatures[i], catFeaturesSize);
    }
}
FULL_MODEL_PTR(modelHandle)->Calc(floatFeaturesVec, catFeaturesVec, TArrayRef<double>(result, resultSize));

To evaluate a single document, however, CatBoost only needs access to a reference to contiguous memory holding feature values. The above code can be replaced with the following:

TConstArrayRef<float> floatFeaturesArray = TConstArrayRef<float>(floatFeatures, floatFeaturesSize);
TConstArrayRef<int> catFeaturesArray = TConstArrayRef<int>(catFeatures, catFeaturesSize);
FULL_MODEL_PTR(modelHandle)->Calc(floatFeaturesArray, catFeaturesArray, TArrayRef<double>(result, resultSize));

Similar to the C++ implementation, the CatBoost Rust bindings also allocate vectors to support multi-document evaluation. For example, the bindings iterate over a vector of vectors, mapping it to a newly allocated vector of pointers:

let mut float_features_ptr = float_features
   .iter()
   .map(|x| x.as_ptr())
   .collect::<Vec<_>>();

But in the single-document case, we don't need the outer vector at all. We can simply pass the inner pointer value directly:

let float_features_ptr = float_features.as_ptr();

Reusing buffers

The previous API in the Rust bindings accepted owned Vecs as input. By taking ownership of a heap-allocated data structure, the function also takes responsibility for freeing the memory at the conclusion of its execution. This is undesirable as it forecloses the possibility of buffer reuse. Additionally, categorical features are passed as owned Strings, which prevents us from passing references to bytes in the original request. Instead, we must allocate a temporary String on the heap and copy bytes into it.

pub fn calc_model_prediction(
	&self,
	float_features: Vec<Vec<f32>>,
	cat_features: Vec<Vec<String>>,
) -> CatBoostResult<Vec<f64>> { ... }

Let's rewrite this function to take &[f32] and &[&str]:

pub fn calc_model_prediction_single(
	&self,
	float_features: &[f32],
	cat_features: &[&str],
) -> CatBoostResult<f64> { ... }

But, we also execute several models per request, and those models may use the same categorical features. Instead of calculating the hash for each separate model we execute, let's compute the hashes first and then pass them to each model that requires them:

pub fn calc_model_prediction_single_with_hashed_cat_features(
	&self,
	float_features: &[f32],
	hashed_cat_features: &[i32],
) -> CatBoostResult<f64> { ... }

Summary

By optimizing away unnecessary memory allocations in the bot management module, we reduced P50 latency from 388us to 309us (20%), and reduced P99 latency from 940us to 813us (14%). And, in many cases, the optimized code is shorter and easier to read than the unoptimized implementation.

These optimizations were targeted at model execution in the bot management module. To learn more about how we are porting bot management from Lua to Rust, check out this blog post.

Super Bot Fight Mode is now configurable!

2023-03-16 Adam Martinetti

Post Syndicated from Adam Martinetti original https://blog.cloudflare.com/configurable-super-bot-fight-mode/

Super Bot Fight Mode is now configurable!

Millions of customers around the world use Cloudflare to keep their applications safe by blocking bot traffic to their website. We block an average of 336 million requests per day for self-service customers using a service called Super Bot Fight Mode. It is a crucial part of how customers keep their websites online.

While most customers use Cloudflare’s Verified Bot directory to securely allow good, automated traffic, some customers also like to write their own localized integration scripts to crawl and update their website, or perform other necessary maintenance functions. Because these bots are only used on a single website, they don’t fit our verified bot criteria the way a Google or Bing crawler does. This makes Super Bot Fight Mode difficult to manage for these types of customers.

Super Bot Fight Mode: now configurable!

Previously, Super Bot Fight Mode ran as an independent service on our global network and other Cloudflare security services were unable to affect its configuration. To solve this, we’ve rewritten Super Bot Fight Mode behind the scenes. It’s now a new managed ruleset in the new WAF, just like the OWASP Core Ruleset or the Cloudflare Managed Ruleset. This doesn’t change the interface, but brings Super Bot Fight Mode closer to where customers are managing their other security exceptions.

As we speak, the WAF team is carefully migrating all self-serve customers from our old Firewall Rules system to a new system. This new system, called Custom Rules, simplifies the exception process in the rules you write with no other changes or loss of functionality. In the old system we had two separate actions, “allow” and “bypass”. In the new Custom Rules, there’s only one action called “skip”. Rules that “skip” traffic can skip the rest of your custom rules (just like an “allow” rule would) and other Cloudflare services. As Cloudflare customers are given the “Skip” action, you will be able to see the option available to “skip” Super Bot Fight Mode. Here’s an example:

While we spoke to customers about their use cases for skipping Super Bot Fight Mode, one use-case kept popping up that didn’t quite fit the rest: WordPress Loopback requests. As many people know, as part of WordPress’ self-diagnostic capabilities, a WordPress site will make automated requests back to itself over the Internet to confirm its reachability and functionality. These loopback diagnostics can come from dozens of different community developed plugins, each implementing loopback requests slightly differently. To help accommodate an ever-growing diversity in diagnostic tools used in WordPress, we have added a simple configuration option to securely allow these loop-back requests.

In the future, we will be integrating this feature with the Cloudflare WordPress plugin to make it even easier to use WordPress with Cloudflare.

What’s next?

Self-serve customers with Custom Rules can create “Skip” rules to create exceptions for Super Bot Fight Mode today. We are currently rolling out Custom Rules to all of our customers. If you do not see this option available now, you should expect to see it in the next several weeks. If the lack of flexibility has prevented you from using Super Bot Fight Mode in the past, please log into the Cloudflare dashboard and try it with these new skip rules!

While we’ve added flexibility to customers’ Super Bot Fight Mode deployments, we know that Free plan customers want the same level of customization that self-serve customers do. Now that our migration of Super Bot Fight Mode to the new WAF is complete, we plan to do the same for the original Bot Fight Mode to allow more free customers than ever before to join us in the fight against bots.

How Cloudflare and IBM partner to help build a better Internet

2023-03-16 David McClure

Post Syndicated from David McClure original https://blog.cloudflare.com/ibm-keyless-bots/

How Cloudflare and IBM partner to help build a better Internet

In this blog post, we wanted to highlight some ways that Cloudflare and IBM Cloud work together to help drive product innovation and deliver services that address the needs of our mutual customers. On our blog, we often discuss exciting new product developments and how we are solving real-world problems in our effort to make the internet better and many of our customers and partners play an important role.

IBM Cloud and Cloudflare have been working together since 2018 to integrate Cloudflare application security and performance products natively into IBM Cloud. IBM Cloud Internet Services (CIS) has customers across a wide range of industry verticals and geographic regions but they also have several specialist groups building unique service offerings.

The IBM Cloud team specializes in serving clients in highly regulated industries, aiming to ensure their resiliency, performance, security and compliance needs are met. One group that we’ve been working with recently is IBM Cloud for Financial Services. This group extends the capabilities of IBM Cloud to help serve the complex security and compliance needs of banks, financial institutions and fintech companies.

Bot Management

As malicious bot attacks get more sophisticated and manual mitigations become more onerous, a dynamic and adaptive solution is required for enterprises running Internet facing workloads. With Cloudflare Bot Management on IBM Cloud Internet Services, we aim to help IBM clients protect their Internet properties from targeted application abuse such as account takeover attacks, inventory hoarding, carding abuse and more. Bot Management will be available in the second quarter of 2023.

Threat actors specifically target financial services entities with Account Takeover Attacks, and this is where Cloudflare can help. As much as 71% of login requests we see come from bots (Source: Cloudflare Data) Cloudflare’s Bot Management is powered by a global machine learning model that analyses an average of 45 million HTTP requests a second to track botnets across our network. Cloudflare’s Bot Management solution has the potential to benefit all IBM CIS customers.

Supporting banks, financial institutions, and fintechs

IBM Cloud has been a leader when it comes to providing solutions for the financial services industry and has developed several key management solutions that are designed so clients only need to store their private keys in custom built devices.

The IBM CIS team wants to incorporate the right mix of security and performance, which necessitates the use of cloud-based DDoS, WAF, and Bot Management. Specifically, they wanted to incorporate the powerful security tools that were offered through IBM’s Enterprise-level Cloud Internet Services offerings. When using a cloud solution, it is necessary to proxy traffic which can create a potential challenge when it comes to managing private keys. While Cloudflare adopts strict controls to protect these keys, organizations in highly regulated industries may have security policies and compliance requirements that prevent them from sharing these private keys.

Enter Cloudflare’s Keyless SSL solution.

Cloudflare built Keyless SSL to allow customers to have total control over exactly where private keys are stored. With Keyless SSL and IBM’s key storage solutions, we aim to help enterprises benefit from the robust application protections available through Cloudflare’s WAF, including Cloudflare Bot Management, while still retaining control of their private keys.

“We aim to ensure our clients meet their resiliency, performance, security and compliance needs. The introduction of Keyless SSL and Bot Management security capabilities can further our collaborative accomplishments with Cloudflare and help enterprises, including those in regulated industries, to leverage cloud-native security and adaptive threat mitigation tools.”
— Zane Adam, Vice President, IBM Cloud.

“Through our collaboration with IBM Cloud Internet Services, we get to draw on the knowledge and experience of IBM teams, such as the IBM Cloud for Financial Services team, and combine it with our incredible ability to innovate, resulting in exciting new product and service offerings.”
— David McClure, Global Alliance Manager, Strategic Partnerships

If you want to learn more about how IBM leverages Cloudflare to protect their customers, visit: https://www.ibm.com/cloud/cloudflare

IBM experts are here to help you if you have any additional questions.

Announcing Cloudflare Fraud Detection

2023-03-15 Adam Martinetti

Post Syndicated from Adam Martinetti original https://blog.cloudflare.com/cloudflare-fraud-detection/

Announcing Cloudflare Fraud Detection

The world changed when the COVID-19 pandemic began. Everything moved online to a much greater degree: school, work, and, surprisingly, fraud. Although some degree of online fraud has existed for decades, the Federal Trade Commission reported consumers lost almost $8.8 billion in fraud in 2022 (an over 400% increase since 2019) and the continuation of a disturbing trend. People continue to spend more time alone than ever before, and that time alone makes them not just more targeted, but also more vulnerable to fraud. Companies are falling victim to these trends just as much as individuals: according to PWC’s Global Economic Crime and Fraud Survey, more than half of companies with at least $10 billion in revenue experienced some sort of digital fraud.

This is a familiar story in the world of bot attacks. Cloudflare Bot Management helps customers identify the automated tools behind online fraud, but it’s important to note that not all fraud is committed by bots. If the target is valuable enough, bad actors will contract out the exploitation of online applications to real people. Security teams need to look at more than just bots to better secure online applications and tackle modern, online fraud.

Today, we’re excited to announce Cloudflare Fraud Detection. Fraud Detection will give you precise, easy to use tools that can be deployed in seconds to any website on the Cloudflare network to help detect and categorize fraud. For every type of fraud we detect on your website, you will be able to choose the behavior that makes the most sense to you. While some customers will want to block fraudulent traffic at our edge, other customers may want to pass this information in headers to build integrations with their own app, or use our Cloudflare Workers platform to direct high risk users to load an alternate online experience with fewer capabilities.

The online fraud experience today

When we talk to organizations impacted by sophisticated, online fraud, the first thing we hear from frustrated security teams is that they know what they could do to stop fraud in a vacuum: they’ve proposed requiring email verification on signup, enforcing two-factor authentication for all logins, or blocking online purchases from anonymizing VPNs or countries they repeatedly see a disproportionately high number of charge-backs from. While all of these measures would undoubtedly reduce fraud, they would also make the user experience worse. The fear for every company is that a bad UX will mean slower adoption and less revenue, and that’s too steep a price to pay for most run-of-the-mill online fraud.

For those who’ve chosen to preserve that frictionless user experience and bear the cost of fraud, we see two big impacts: higher infrastructure costs and less efficient employees. Bad actors that abuse account creation endpoints or service availability endpoints often do so with floods of highly distributed HTTP requests, quickly moving through residential proxies to pass under IP based rate limiting rules. Without a way to identify fraudulent traffic with certainty, companies are forced to scale up their infrastructure to be able to serve new peaks in request traffic, even when they know the majority of this traffic is illegitimate. Engineering and Trust and Safety Teams suddenly have a whole new set of responsibilities: regularly banning IP addresses that will probably never be used again, routinely purging fraudulent data from over capacity databases, and even sometimes becoming de-facto fraud investigators. As a result, the organization incurs greater costs without any greater value to their customers.

Reduce modern fraud without hurting UX

Organizations have told us loud and clear that an effective fraud management solution needs to reliably stop bad actors before they can create fraudulent accounts, use stolen credit cards, or steal customer data all the while ensuring a frictionless user experience for real users. We are building novel and highly accurate detections, solving for the four common fraud types we hear the most demand for from businesses around the world:

Fake Account Creation: Bad actors signing up for many different accounts to gain access to promotional rewards, or more resources than a single user should have access to.
Account Takeover: Gaining unauthorized access to legitimate accounts, by means such as using stolen username and password combinations from other websites, guessing weak passwords, or abusing account recovery mechanisms.
Card Testing and Fraudulent Transactions: Testing the validity of stolen credit card details or using those same details to purchase goods or services.
Expediting: Obtaining limited availability goods or services by circumventing the normal user flow to complete orders more quickly than should be possible.

In order to trust your fraud management solution, organizations have to understand the decisions or predictions behind the detection of fraud. This is referred to as explainability. For example, it’s not enough to know a signup attempt was flagged as fraud. You need to know, for example, if a signup is fraudulent, exactly what field supplied by the user led us to think this was an issue, why it was an issue, and if it was part of a larger pattern. We will pass along this level of detail when we detect fraud so you can ensure we are only keeping the bad actors out.

Every business that deals with modern, online fraud has a different idea of what risks are acceptable, and a different preference for dealing with fraud once it’s been identified. To give customers maximum flexibility, we’re building Cloudflare’s fraud detection signals to be used individually, or combined with other Cloudflare security products in whichever way best fits each customer’s risk profile and use case, all while using the familiar Cloudflare Firewall Rules interface. Templated rules and suggestions will be available to provide guidance and help customers become familiar with the new features, but each customer will have the option of fully customizing how they want to protect each internet application. Customers can either block, rate-limit, or challenge requests at the edge, or send those signals upstream in request headers, to trigger custom in-application behavior.

Cloudflare provides application performance and security services to millions of sites, and we see 45 million HTTP requests per second on average. The massive diversity and volume of this traffic puts us in a unique position to analyze and defeat online fraud. Cloudflare Bot Management is already built to run our Machine Learning model that detects automated traffic on every request we see. To better tackle more challenging use cases like online fraud, we made our lightning fast Machine Learning even more performant. The typical Machine Learning model now executes in under 0.2 milliseconds, giving us the architecture we need to run multiple specific Machine Learning models in parallel without slowing down content delivery.

Stopping fake account creation and adding to Cloudflare’s defense in depth

The first problem our customers asked us to tackle is detecting fake account creation. Cloudflare is perfectly positioned to solve this because we see more account creation pages than anyone else. Using sampled fake account attack data from our customers, we started looking at signup submission data, and how threat intelligence curated by our Cloudforce One team might be helpful. We found that the data used in our Cloudflare One products was already able to identify 72% of fake accounts based on the signup details supplied by the bad actor, such as the email address or the domain they’re using in the attack. We are continuing to add more sources of threat intelligence data specific to fake accounts to get this number close to 100%. On top of these threat intelligence based rules, we are also training new machine learning models on this data as well, that will spot trends like popular fraud domains based on intelligence from the millions of domains we see across the Cloudflare network.

Making fraud inefficient by expediting detection

The second problem customers asked us to prioritize is expediting. As a reminder, expediting means visiting a succession of web pages faster than would be possible for a normal user, and sometimes skipping ahead in the order of web pages in order to efficiently exploit a resource.

For instance, let’s say that you have an Account Recovery page that is being spammed by a sophisticated group of bad actors, looking for vulnerable users they can steal reset tokens for. In this case, the fraudsters have access to a large number of valid email addresses and they’re testing which of these addresses may be used at your website. To prevent your account recovery process from being abused, we need to ensure that no single person can move through the account recovery process faster, or in a different order than a real person would.

In order to complete a valid password reset action on your site, you may know that a user should have made:

A GET request to render your login page
A POST request to the login page (at least one second after receiving the login page HTML)
A GET request to render the Account Recovery page (at least one second after receiving the POST response)
A POST request to the password reset page (at least one second after receiving the Account Recovery page HTML)
Taken a total time of less than 5 seconds to complete the process

To solve this, we will rely on encrypted data stored by the user in a token to help us determine if the user has visited all the necessary pages needed in a reasonable amount of time to be performing sensitive actions on your site. If your account recovery process is being abused, the encrypted token we supply acts as a VIP pass, allowing only authorized users to successfully complete the password recovery process. Without a pass indicating the user has gone through the normal recovery flow in the correct order and time, they are denied entry to complete a password recovery. By forcing the bad actor to behave the same as a legitimate user, we make their task of checking which of their compromised email addresses might be registered at your site an impossibly slow process, forcing them to move on to other targets.

These are just two of the first techniques we use to identify and block fraud. We are also building Account Takeover and Carding Abuse detections that we will be talking about in the future on this blog. As online fraud continues to evolve, we will continue to build new and unique detections, leveraging Cloudflare’s unique position to help keep the internet safe.

Cloudflare’s mission is to help build a better Internet, and that includes dealing with the evolution of modern online fraud. If you’re spending hours cleaning up after fraud, or are tired of paying to serve web traffic to bad actors, you can join in the Cloudflare Fraud Detection Early Access in the second half of 2023 by submitting your contact information here. Early Access customers can opt in to providing training data sets right away, making our models more effective for their use cases. You’ll also get test access to our newest models, and future fraud protection features as soon as they roll out.

More bots, more trees

2022-12-14 Adam Martinetti

Post Syndicated from Adam Martinetti original https://blog.cloudflare.com/more-bots-more-trees/

More bots, more trees

Once a year, we pull data from our Bot Fight Mode to determine the number of trees we can donate to our partners at One Tree Planted. It’s part of the commitment we made in 2019 to deter malicious bots online by redirecting them to a challenge page that requires them to perform computationally intensive, but meaningless tasks. While we use these tasks to drive up the bill for bot operators, we account for the carbon cost by planting trees.

This year when we pulled the numbers, we saw something exciting. While the number of bot detections has gone significantly up, the time bots spend in the Bot Fight Mode challenge page has gone way down. We’ve observed that bot operators are giving up quickly, and moving on to other, unprotected targets. Bot Fight Mode is getting smarter at detecting bots and more efficient at deterring bot operators, and that’s a win for Cloudflare and the environment.

What’s changed?

We’ve seen two changes this year in the Bot Fight Mode results. First, the time attackers spend in Bot Fight Mode challenges has reduced by 166%. Many bot operators are disconnecting almost immediately now from Cloudflare challenge pages. We expect this is because they’ve noticed the sharp cost increase associated with our CPU intensive challenge and given up. Even though we’re seeing individual bot operators give up quickly, Bot Fight Mode is busier than ever. We’re issuing six times more CPU intensive challenges per day compared to last year, thanks to a new detection system written using Cloudflare’s ruleset engine, detailed below.

How did we do this?

When Bot Fight Mode launched, we highlighted one of our core detection systems:

“Handwritten rules for simple bots that, however simple, get used day in, day out.”

Some of them are still very simple. We introduce new simple rules regularly when we detect new software libraries as they start to source a significant amount of traffic. However, we started to reach the limitations of this system. We knew there were sophisticated bots out there that we could identify easily, but they shared enough overlapping traits with good browser traffic that we couldn’t safely deploy new rules to block them safely without potentially impacting our customers’ good traffic as well.

To solve this problem, we built a new rules system written on the same highly performant Ruleset Engine that powers the new WAF, Transform Rules, and Cache Rules, rather than the old Gagarin heuristics engine that was fast but inflexible. This new framework gives us the flexibility we need to write highly complex rules to catch more elusive bots without the risk of interfering with legitimate traffic. The data gathered by these new detections are then labeled and used to train our Machine Learning engine, ensuring we will continue to catch these bots as their operators attempt to adapt.

What’s next?

We’ve heard from Bot Fight Mode customers that they need more flexibility. Website operators now expect a significant percentage of their legitimate traffic to come from automated sources, like service to service APIs. These customers are waiting to enable Bot Fight Mode until they can tell us what parts of their website it can run on safely. In 2023, we will give everyone the ability to write their own flexible Bot Fight Mode rules, so that every Cloudflare customer can join the fight against bots!

Update: Mangroves, Climate Change & economic development

We’re also pleased to report the second tree planting project from our 2021 bot activity is now complete! Earlier this year, Cloudflare contributed 25,000 trees to a restoration project at Victoria Park in Nova Scotia.

For our second project, we donated 10,000 trees to a much larger restoration project on the eastern shoreline of Kumirmari island in the Sundarbans of West Bengal, India. In total, the project included more than 415,000 trees along 7.74 hectares of land in areas that have been degraded or deforested. The types of trees planted included Bain, Avicennia officianalis, Kalo Bain, and eight others.

The Sundarbans are located on the delta of the Ganges, Brahmaptura, and Meghna rivers on the Bay of Bengal, and are home to one of the world’s largest mangrove forests. The forest is not only a UNESCO World Heritage site, but also home to 260 bird species as well as a number of threatened species like the Bengal tiger, the estuarine crocodile, and Indian python. According to One Tree Planted, the Sundarbans are currently under threat from rising sea levels, increasing salinity in the water and soil, cyclonic storms, and flooding.

The Intergovernmental Panel on Climate Change (IPCC) has found that mangroves are critical to mitigating greenhouse gas (GHG) emissions and protecting coastal communities from extreme weather events caused by climate change. The Sundarbans mangrove forest is one of the world’s largest carbon sinks (an area that absorbs more carbon than it emits). One study suggested that coastal mangrove forests sequester carbon at a rate of two to four times that of a mature tropical or subtropical forest region.

One of the most exciting parts of this project was its focus on hiring and empowering local women. According to One Tree Planted, 75 percent of those involved in the project were women, including 85 women employed to monitor and manage the planting site over a five-month period. Participants also received training in the seed collection process with the goal of helping local residents lead mangrove planting from start to finish in the future.

More bots stopped, more trees planted!

Thanks to every Cloudflare customer who’s enabled Bot Fight Mode so far. You’ve helped make the Internet a better place by stopping malicious bots, and you’ve helped make the planet a better place by reforesting the Earth on bot operators’ dime. The more domains that use Bot Fight Mode, the more trees we can plant, so sign up for Cloudflare and activate Bot Fight Mode today!

Announcing Turnstile, a user-friendly, privacy-preserving alternative to CAPTCHA

2022-09-28 Reid Tatoris

Post Syndicated from Reid Tatoris original https://blog.cloudflare.com/turnstile-private-captcha-alternative/

Announcing Turnstile, a user-friendly, privacy-preserving alternative to CAPTCHA

Today, we’re announcing the open beta of Turnstile, an invisible alternative to CAPTCHA. Anyone, anywhere on the Internet, who wants to replace CAPTCHA on their site will be able to call a simple API, without having to be a Cloudflare customer or sending traffic through the Cloudflare global network. Sign up here for free.

There is no point in rehashing the fact that CAPTCHA provides a terrible user experience. It’s been discussed in detail before on this blog, and countless times elsewhere. The creator of the CAPTCHA has even publicly lamented that he “unwittingly created a system that was frittering away, in ten-second increments, millions of hours of a most precious resource: human brain cycles.” We hate it, you hate it, everyone hates it. Today we’re giving everyone a better option.

Turnstile is our smart CAPTCHA alternative. It automatically chooses from a rotating suite of non-intrusive browser challenges based on telemetry and client behavior exhibited during a session. We talked in an earlier post about how we’ve used our Managed Challenge system to reduce our use of CAPTCHA by 91%. Now anyone can take advantage of this same technology to stop using CAPTCHA on their own site.

UX isn’t the only big problem with CAPTCHA — so is privacy

While having to solve a CAPTCHA is a frustrating user experience, there is also a potential hidden tradeoff a website must make when using CAPTCHA. If you are a small site using CAPTCHA today, you essentially have one option: an 800 pound gorilla with 98% of the CAPTCHA market share. This tool is free to use, but in fact it has a privacy cost: you have to give your data to an ad sales company.

According to security researchers, one of the signals that Google uses to decide if you are malicious is whether you have a Google cookie in your browser, and if you have this cookie, Google will give you a higher score. Google says they don’t use this information for ad targeting, but at the end of the day, Google is an ad sales company. Meanwhile, at Cloudflare, we make money when customers choose us to protect their websites and make their services run better. It’s a simple, direct relationship that perfectly aligns our incentives.

Less data collection, more privacy, same security

In June, we announced an effort with Apple to use Private Access Tokens. Visitors using operating systems that support these tokens, including the upcoming versions of macOS or iOS, can now prove they’re human without completing a CAPTCHA or giving up personal data.

By collaborating with third parties like device manufacturers, who already have the data that would help us validate a device, we are able to abstract portions of the validation process, and confirm data without actually collecting, touching, or storing that data ourselves. Rather than interrogating a device directly, we ask the device vendor to do it for us.

Private Access Tokens are built directly into Turnstile. While Turnstile has to look at some session data (like headers, user agent, and browser characteristics) to validate users without challenging them, Private Access Tokens allow us to minimize data collection by asking Apple to validate the device for us. In addition, Turnstile never looks for cookies (like a login cookie), or uses cookies to collect or store information of any kind. Cloudflare has a long track record of investing in user privacy, which we will continue with Turnstile.

We are opening our CAPTCHA replacement to everyone

To improve the Internet for everyone, we decided to open up the technology that powers our Managed Challenge to everyone in beta as a standalone product called Turnstile.

Rather than try to unilaterally deprecate and replace CAPTCHA with a single alternative, we built a platform to test many alternatives and rotate new challenges in and out as they become more or less effective. With Turnstile, we adapt the actual challenge outcome to the individual visitor/browser. First we run a series of small non-interactive JavaScript challenges gathering more signals about the visitor/browser environment. Those challenges include proof-of-work, proof-of-space, probing for web APIs, and various other challenges for detecting browser-quirks and human behavior. As a result, we can fine-tune the difficulty of the challenge to the specific request.

Turnstile also includes machine learning models that detect common features of end visitors who were able to pass a challenge before. The computational hardness of those initial challenges may vary by visitor, but is targeted to run fast.

Swap out your existing CAPTCHA in a few minutes

You can take advantage of Turnstile and stop bothering your visitors with a CAPTCHA even without being on the Cloudflare network. While we make it as easy as possible to use our network, we don’t want this to be a barrier to improving privacy and user experience.

To switch from a CAPTCHA service, all you need to do is:

Create a Cloudflare account, navigate to the `Turnstile` tab on the navigation bar, and get a sitekey and secret key.
Copy our JavaScript from the dashboard and paste over your old CAPTCHA JavaScript.
Update the server-side integration by replacing the old siteverify URL with ours.

There is more detail on the process below, including options you can configure, but that’s really it. We’re excited about the simplicity of making a change.

Deployment options and analytics

To use Turnstile, first create an account and get your site and secret keys.

Then, copy and paste our HTML snippet:

<script src="https://challenges.cloudflare.com/turnstile/v0/api.js" async defer></script>

Once the script is embedded, you can use implicit rendering. Here, the HTML is scanned for elements that have a cf-turnstile class:

<form action="/login" method="POST">
  <div class="cf-turnstile" data-sitekey="yourSiteKey"></div>
  <input type="submit">
</form>

Once a challenge has been solved, a token is injected in your form, with the name cf-turnstile-response. This token can be used with our siteverify endpoint to validate a challenge response. A token can only be validated once, and a token cannot be redeemed twice. The validation can be done on the server side or even in the cloud, for example using a simple Workers fetch (see a demo here):

async function handleRequest() {
    // ... Receive token
    let formData = new FormData();
    formData.append('secret', turnstileISecretKey);
    formData.append('response', receivedToken);
 
    await fetch('https://challenges.cloudflare.com/turnstile/v0/siteverify',
        {
            body: formData,
            method: 'POST'
        });
    // ...
}

For more complex use cases, the challenge can be invoked explicitly via JavaScript:

<script>
    window.turnstileCallbackFunction = function () {
        const turnstileOptions = {
            sitekey: 'yourSitekey',
            callback: function(token) {
                console.log(`Challenge Success: ${token}`);
            }
        };
        turnstile.render('#container', turnstileOptions);
    };
</script>
<div id="container"></div>

You can also create what we call ‘Actions’. Custom labels that allow you to distinguish between different pages where you’re using Turnstile, like a login, checkout, or account creation page.

Once you’ve deployed Turnstile, you can go back to the dashboard and see analytics on where you have widgets deployed, how users are solving them, and view any defined actions.

Why are we giving this away for free?

While this is sometimes hard for people outside to believe, helping build a better Internet truly is our mission. This isn’t the first time we’ve built free tools that we think will make the Internet better, and it won’t be the last. It’s really important to us.

So whether or not you’re a Cloudflare customer today, if you’re using a CAPTCHA, try Turnstile for free, instead. You’ll make your users happier, and minimize the data you send to third parties.

Visit this page to sign up for the best invisible, privacy-first, CAPTCHA replacement and to retrieve your Turnstile beta sitekey.

Crawler Hints supports Microsoft’s IndexNow in helping users find new content

2022-08-12 Alex Krivit

Post Syndicated from Alex Krivit original https://blog.cloudflare.com/crawler-hints-supports-microsofts-indexnow-in-helping-users-find-new-content/

Crawler Hints supports Microsoft’s IndexNow in helping users find new content

The web is constantly changing. Whether it’s news or updates to your social feed, it’s a constant flow of information. As a user, that’s great. But have you ever stopped to think how search engines deal with all the change?

It turns out, they “index” the web on a regular basis — sending bots out, to constantly crawl webpages, looking for changes. Today, bot traffic accounts for about 30% of total traffic on the Internet, and given how foundational search is to using the Internet, it should come as no surprise that search engine bots make up a large proportion of that what might come as a surprise is how inefficient the model is, though: we estimate that over 50% of crawler traffic is wasted effort.

This has a huge impact. There’s all the additional capacity that owners of websites need to bake into their site to absorb the bots crawling all over it. There’s the transmission of the data. There’s the CPU cost of running the bots. And when you’re running at the scale of the Internet, all of this has a pretty big environmental footprint.

Part of the problem, though, is nobody had really stopped to ask: maybe there’s a better way?

Right now, the model for indexing websites is the same as it has been since the 1990s: a “pull” model, where the search engine sends a crawler out to a website after a predetermined amount of time. During Impact Week last year, we asked: what about flipping the model on its head? What about moving to a push model, where a website could simply ping a search engine to let it know an update had been made?

There are a heap of advantages to such a model. The website wins: it’s not dealing with unnecessary crawls. It also makes sure that as soon as there’s an update to its content, it’s reflected in the search engine — it doesn’t need to wait for the next crawl. The website owner wins because they don’t need to manage distinct search engine crawl submissions. The search engine wins, too: it saves money on crawl costs, and it can make sure it gets the latest content.

Of course, this needs work to be done on both sides of the equation. The websites need a mechanism to alert the search engines; and the search engines need a mechanism to receive the alert, so they know when to do the crawl.

Crawler Hints — Cloudflare’s Solution for Websites

Solving this problem is why we launched Crawler Hints. Cloudflare sits in a unique position on the Internet — we’re serving on average 36 million HTTP requests per second. That represents a lot of websites. It also means we’re uniquely positioned to help solve this problem: to help give crawlers hints about when they should recrawl if new content has been added or if content on a site has recently changed.

With Crawler Hints, we send signals to web indexers based on cache data and origin status codes to help them understand when content has likely changed or been added to a site. The aim is to increase the number of relevant crawls as well as drastically reduce the number of crawls that don’t find fresh content, saving bandwidth and compute for both indexers and sites alike, and improving the experience of using the search engines.

But, of course, that’s just half the equation.

IndexNow Protocol — the Search Engine Moves from Pull to Push

Websites alerting the search engine about changes is useless if the search engines aren’t listening — and they simply continue to crawl the way they always have. Of course, search engines are incredibly complicated, and changing the way they operate is no easy task.

The IndexNow Protocol is a standard developed by Microsoft, Seznam.cz and Yandex, and it represents a major shift in the way search engines operate. Using IndexNow, search engines have a mechanism by which they can receive signals from Crawler Hints. Once they have that signal, they can shift their crawlers from a pull model to a push model.

In a recent update, Microsoft has announced that millions of websites are now using IndexNow to signal to search engine crawlers when their content needs to be crawled and IndexNow was used to index/crawl about 7% of all new URLs clicked when someone is selecting from web search results.

On the Cloudflare side, since the release of Crawler Hints in October 2021, Crawler Hints has processed about six-hundred-billion signals to IndexNow.

That’s a lot of saved crawls.

How to enable Crawler Hints

By enabling Crawler Hints on your website, with the simple click of a button, Cloudflare will take care of signaling to these search engines when your content has changed via the IndexNow API. You don’t need to do anything else!

Crawler Hints is free to use and available to all Cloudflare customers. If you’d like to see how Crawler Hints can benefit how your website is indexed by the world’s biggest search engines, please feel free to opt-into the service by:

Sign in to your Cloudflare Account.
In the dashboard, navigate to the Cache tab.
Click on the Configuration section.
Locate the Crawler Hints and enable.

Upon enabling Crawler Hints, Cloudflare will share when content on your site has changed and needs to be re-crawled with search engines using the IndexNow protocol (this blog can help if you’re interested in finding out more about how the mechanism works).

What’s Next?

Going forward, because the benefits are so substantial for site owners, search operators, and the environment, we plan to start defaulting Crawler Hints on for all our customers. We’re also hopeful that Google, the world’s largest search engine and most wasteful user of Internet resources, will adopt IndexNow or a similar standard and lower the burden of search crawling on the planet.

When we think of helping to build a better Internet, this is exactly what comes to mind: creating and supporting standards that make it operate better, greener, faster. We’re really excited about the work to date, and will continue to work to improve the signaling to ensure the most valuable information is being sent to the search engines in a timely manner. This includes incorporating additional signals such as etags, last-modified headers, and content hash differences. Adding these signals will help further inform crawlers when they should reindex sites, and how often they need to return to a particular site to check if it’s been changed. This is only the beginning. We will continue testing more signals and working with industry partners so that we can help crawlers run efficiently with these hints.

And finally: if you’re on Cloudflare, and you’d like to be part of this revolution in how search engines operate on the web (it’s free!), simply follow the instructions in the section above.

35,000 new trees in Nova Scotia

2022-07-13 Patrick Day

Post Syndicated from Patrick Day original https://blog.cloudflare.com/35-000-new-trees-in-nova-scotia/

35,000 new trees in Nova Scotia

Cloudflare is proud to announce the first 35,000 trees from our commitment to help clean up bad bots (and the climate) have been planted.

Working with our partners at One Tree Planted (OTP), Cloudflare was able to support the restoration of 20 hectares of land at Victoria Park in Nova Scotia, Canada. The 130-year-old natural woodland park is located in the heart of Truro, NS, and includes over 3,000 acres of hiking and biking trails through natural gorges, rivers, and waterfalls, as well as an old-growth eastern hemlock forest.

The planting projects added red spruce, black spruce, eastern white pine, eastern larch, northern red oak, sugar maple, yellow birch, and jack pine to two areas of the park. The first area was a section of the park that recently lost a number of old conifers due to insect attacks. The second was an area previously used as a municipal dump, which has since been covered by a clay cap and topsoil.

Our tree commitment began far from the Canadian woodlands. In 2019, we launched an ambitious tool called Bot Fight Mode, which for the first time fought back against bots, targeting scrapers and other automated actors.

Our idea was simple: preoccupy bad bots with nonsense tasks, so they cannot attack real sites. Even better, make these tasks computationally expensive to engage with. This approach is effective, but it forces bad actors to consume more energy and likely emit more greenhouse gasses (GHG). So in addition to launching Bot Fight Mode, we also committed to supporting tree planting projects to account for any potential environmental impact.

What is Bot Fight Mode?

As soon as Bot Fight Mode is enabled, it immediately starts challenging bots that visit your site. It is available to all Cloudflare customers for free, regardless of plan.

When Bot Fight Mode identifies a bot, it issues a computationally expensive challenge to exhaust it (also called “tarpitting”). Our aim is to disincentivize attackers, so they have to find a new hobby altogether. When we tarpit a bot, we require a significant amount of compute time that will stall its progress and result in a hefty server bill. Sorry not sorry.

We do this because bots are leeches. They draw resources, slow down sites, and abuse online platforms. They also hack into accounts and steal personal data. Of course, we allowlist a small number of bots that are well-behaved, like Slack and Google. And Bot Fight Mode only acts on traffic from cloud and hosting providers (because that is where bots usually originate from).

Over 550,000 sites use Bot Fight Mode today! We believe this makes it the most widely deployed bot management solution in the world (though this is impossible to validate). Free customers can enable the tool from the dashboard and paid customers can use a special version, known as Super Bot Fight Mode.

How many trees? Let’s do the math 🚀

Now, the hard part: how can we translate bot challenges into a specific number of trees that should be planted? Fortunately, we can use a series of unit conversions, similar to those we use to calculate Cloudflare’s total GHG emissions.

We started with the following assumptions.

Table 1.

Measure	Quantity	Scaled	Source
Energy used by a standard server	1,760.3 kWh / year	To hours (0.2 kWh / hour)	Go Climate
Emissions factor	0.33852 kgCO2e / kWh	To grams (338.52 gCO2e / kWh)	Go Climate
CO2 absorbed by a mature tree	48 lbsCO2e / year	To kilograms (21 kgCO2e / year)	One Tree Planted

Next, we selected a high-traffic day to model the rate and duration of bot challenges on our network. On May 23, 2021, Bot Fight Mode issued 2,878,622 challenges, which lasted an average of 50 seconds each. In total, bots spent 39,981 hours engaging with our network defenses, or more than four years of challenges in a single day!

We then converted that time value into kilowatt-hours (kWh) of energy based on the rate of power consumed by our generic server listed in Table 1 above.

39,981 (hours) x .2 (kWh/hour) = 7,996 (kWh)

Once we knew the total amount of energy consumed by bad bot servers, we used an emissions factor (the amount of greenhouse gasses emitted per unit of energy consumed) to determine total emissions.

7,996 (kwh) x 338.52 (gCO2e/kwh) = 2,706,805 (gCO2e)

If you have made it this far, clearly you like to geek out like we do, so for the sake of completeness, the unit commonly used in emissions calculations is carbon dioxide equivalent (CO2e), which is a composite unit for all six GHGs listed in the Kyoto Protocol weighted by Global Warming Potential.

The last conversion we needed was from emissions to trees. Our partners at OTP found that a mature tree absorbs roughly 21 kgCO2e per year. Based on our total emissions that translates to roughly 47,000 trees per server, or 840 trees per CPU core. However, in our original post, we also noted that given the time it takes for a newly planted tree to reach maturity, we would multiply our donation by a factor of 25.

In the end, over the first two years of the program, we calculated that we would need approximately 42,000 trees to account for all the individual CPU cores engaged in Bot Fight Mode. For good measure, we rounded up to an even 50,000.

We are proud that most of these trees are already in the ground, and we look forward to providing an update when the final 15,000 are planted.

A piece of the puzzle

“Planting trees will benefit species diversity of the existing forest, animal habitat, greening of reclamation areas as well as community recreation areas, and visual benefits along popular hiking/biking trail networks.”
– Stephanie Clement, One Tree Planted, Project Manager North America

Reforestation is an important part of protecting healthy ecosystems and promoting biodiversity. Trees and forests are also a fundamental part of helping to slow the growth of global GHG emissions.

However, we recognize there is no single solution to the climate crisis. As part of our mission to help build a better, more sustainable Internet, Cloudflare is investing in renewable energy, tools that help our customers understand and mitigate their own carbon footprints on our network, and projects that will help offset or remove historical emissions associated with powering our network by 2025.

Want to be part of our bots & trees effort? Enable Bot Fight Mode today! It’s available on our free plan and takes only a few seconds. By the time we made our first donation to OTP in 2021, Bot Fight Mode had already spent more than 3,000 years distracting bots.

Help us defeat bad bots and improve our planet today!

—-
For more information on Victoria Park, please visit https://www.victoriaparktruro.ca
For more information on One Tree Planted, please visit https://onetreeplanted.org
For more information on sustainability at Cloudflare, please visit www.cloudflare.com/impact

Internet Explorer, we hardly knew ye

2022-06-29 David Belson

Post Syndicated from David Belson original https://blog.cloudflare.com/internet-explorer-retired/

Internet Explorer, we hardly knew ye

On May 19, 2021, a Microsoft blog post announced that “The future of Internet Explorer on Windows 10 is in Microsoft Edge” and that “the Internet Explorer 11 desktop application will be retired and go out of support on June 15, 2022, for certain versions of Windows 10.” According to an associated FAQ page, those “certain versions” include Windows 10 client SKUs and Windows 10 IoT. According to data from Statcounter, Windows 10 currently accounts for over 70% of desktop Windows market share on a global basis, so this “retirement” impacts a significant number of Windows systems around the world.

As the retirement date for Internet Explorer 11 has recently passed, we wanted to explore several related usage trends:

Is there a visible indication that use is declining in preparation for its retirement?
Where is Internet Explorer 11 still in the heaviest use?
How does the use of Internet Explorer 11 compare to previous versions?
How much Internet Explorer traffic is “likely human” vs. “likely automated”?
How do Internet Explorer usage patterns compare with those of Microsoft Edge, its replacement?

The long goodbye

Publicly released in January 2020, and automatically rolled out to Windows users starting in June 2020, Chromium-based Microsoft Edge has become the default browser for the Windows platform, intended to replace Internet Explorer. Given the two-year runway, and Microsoft’s May 2021 announcement, we would expect to see Internet Explorer traffic decline over time as users shift to Edge.

Looking at global request traffic to Cloudflare from Internet Explorer versions between January 1 and June 20, 2022, we see in the graph below that peak request volume for Internet Explorer 11 has declined by approximately one-third over that period. The clear weekly usage pattern suggests higher usage in the workplace than at home, and the nominal decline in traffic year-to-date suggests that businesses are not rushing to replace Internet Explorer with Microsoft Edge. However, we expect traffic from Internet Explorer 11 to drop more aggressively as Microsoft rolls out a two-phase plan to redirect users to Microsoft Edge, and then ultimately disable Internet Explorer. Having said that, we do not expect Internet Explorer 11 traffic to ever fully disappear for several reasons, including Microsoft Edge’s “IE Mode” representing itself as Internet Explorer 11, the ongoing usage of Internet Explorer 11 on Windows 8.1 and Windows 7 (which were out of scope for the retirement announcement), and automated (bot) traffic masquerading as Internet Explorer 11.

It is also apparent in the graph above that traffic from earlier versions of Internet Explorer has never fully disappeared. (In fact, we still see several million requests each day from clients purporting to be Internet Explorer 2, which was released in November 1995 — over a quarter-century ago.) After version 11, Internet Explorer 7, first released in October 2006 and last updated in May 2009, generates the next largest volume of requests. Traffic trends for this version have remained relatively consistent. Internet Explorer 9 was the next largest traffic generator through late May, when Internet Explorer 6 seemed to stage a comeback. (Internet Explorer 7 saw a slight bump in traffic at that time as well.)

Where is Internet Explorer 11 used?

Perhaps unsurprisingly, the United States has accounted for the largest volume of Internet Explorer 11 requests year-to-date. Similar to the global observation above, daily peak request traffic has declined by approximately one-third. With request volume approximately one-fourth that seen in the United States, Japan ostensibly has the next largest Internet Explorer 11 user base. (And published reports note that Internet Explorer’s retirement is likely to cause Japan headaches ‘for months’” because local businesses and government agencies didn’t take action in the months ahead of the event.)

However, unusual shifts in Brazil’s request volume, seen in the graph above, are particularly surprising. For several weeks in January, Internet Explorer 11 traffic from the country appears to quadruple, with the same behavior seen from early May through mid-June, as well as a significant spike in March. Classifying the request traffic by bot score, as shown in the graph below, makes it clear that the observed increases are the result of automated (bot) traffic presenting itself as coming from Internet Explorer 11.

Further, analyzing this traffic to see what percentage of requests were mitigated by Cloudflare’s Web Application Firewall, we find that the times when the mitigation percentage increased, as shown in the graph below, align very closely with the periods where we observed the higher levels of automated (bot) traffic. This suggests that the spikes in Internet Explorer 11 traffic coming from Brazil that were seen over the last six months were from a botnet presenting itself as that version of the browser.

Bot or not

Building on the Brazil analysis, breaking out the traffic for each version by associated bot score can help us better understand the residual traffic from long-deprecated versions of Internet Explorer shown above. For requests with a bot score that characterizes the traffic as “likely human”, the graph below shows clear weekly traffic patterns for versions 11 and 7, suggesting that the traffic is primarily driven by systems primarily in use on weekdays, likely by business users. For Internet Explorer 7, that traffic pattern becomes more evident starting in mid-February, after a significant decline in associated request volume.

Interestingly, that decline in “likely human” Internet Explorer 7 request volume aligns with an increase in “likely automated” (bot) request volume for that version, visible in the graph below. Given that the “likely human” traffic didn’t appear to migrate to another version of Internet Explorer, the shift may be related to improvements to the machine learning model that powers bot detection that were rolled out in the January/February time frame. It is also interesting to note that “likely automated” request volume for both Internet Explorer 11 and 7 has been extremely similar since mid-March. It is not immediately clear why this is the case.

We can also use this data to understand what percentage of the traffic from a given version of Internet Explorer is likely to be automated (coming from bots). The graph below highlights the ratios for Internet Explorer 11 and 7. For version 11, we can see that the percentage has grown from around 60% at the start of 2022 to around 80% in June. For version 7, it starts the year in the 40% range, and more than doubles to over 80% in February and remains consistent at that level.

However, when we look at firewall mitigated traffic percentages, we don’t see the same clear alignment of trends as was visible for Brazil, as discussed above. In addition, only a fraction of the “likely automated” traffic was mitigated, suggesting that the automated traffic is split between being generated by bots and other non-malicious tools, such as performance testing.

Internet Explorer versions 6 & 9 were also discussed above, with respect to driving the largest volume of requests. However, when we examine the “likely automated” request ratios for these two browsers, we find that most of their traffic appears to be bot-driven. Internet Explorer 6 started 2022 at around 80%, growing to 95% in June. In contrast, Internet Explorer 9 starts the year around 90%, drops to 60% at the end of January, and then gradually increases back to the 75-80% range.

As Internet Explorer 6’s “likely automated” traffic has increased, the fraction of it that was mitigated has increased as well. The small bumps visible in the graph above align with the larger spikes in the graph below, potentially due to brief bursts of bot activity. In contrast, mitigated Internet Explorer 9 traffic has remained relatively consistent, even as its automated request percentage dropped and then gradually increased.

For the oldest, long-deprecated versions of Internet Explorer, automated traffic frequently comprises more than 80% of request volume, reaching 100% on multiple days year-to-date. Mitigated traffic generally amounted to under 30% of request volume, although Internet Explorer 2 frequently increased to the 50% range, spiking as high as 90%.

Edging into the future

As Microsoft stated, “the future of Internet Explorer on Windows 10 is in Microsoft Edge.” Given that, we wanted to understand the usage patterns of Microsoft Edge. Similar to the analysis above, we looked at request volumes for the last ten versions of the browser year-to-date. The graph below clearly illustrates strong enterprise usage of edge, with weekday peaks, and lower traffic on the weekends. In addition, the four-week major release cycle cadence is clearly evident, with a long tail of usage extending across eight weeks due to enterprise customers who need an extended timeline to manage updates.

Having said that, in analyzing the split by bot score for these Edge versions, we note that only around 80% of requests are classified as “likely human” for about eight weeks after a given version is released, after which it gradually tapers to around 60%. The balance is classified as “likely automated”, suggesting that those who develop bots and other automated processes recognize the value in presenting their user agents as the latest version of Microsoft’s web browser. For Edge, there does not appear to be any meaningful correlation between firewall mitigated traffic percentages and “likely automated” traffic percentages or the traffic cycles visible in the graph above.

Conclusion

Analyzing traffic trends from deprecated versions of Internet Explorer brought to mind the “I’m not dead yet” scene from Monty Python and the Holy Grail with these older versions of the browser claiming to still be alive, at least from a traffic perspective. However, categorizing this traffic to better understand the associated bot/human split showed that the majority of Internet Explorer traffic seen by Cloudflare, including for Internet Explorer 11, is apparently not coming from actual browser clients installed on user systems, but rather from bots and other automated processes. For the automated traffic, analysis of firewall mitigation activity shows that the percentage likely coming from malicious bots varies by version.

As Microsoft executes its planned two-phase approach for actively moving users off of Internet Explorer, it will be interesting to see how both request volumes and bot/human splits for the browser change over time – check back later this year for an updated analysis.

The end of the road for Cloudflare CAPTCHAs

2022-04-01 Reid Tatoris

Post Syndicated from Reid Tatoris original https://blog.cloudflare.com/end-cloudflare-captcha/

The end of the road for Cloudflare CAPTCHAs

There is no point in rehashing the fact that CAPTCHA provides a terrible user experience. It’s been discussed in detail before on this blog, and countless times elsewhere. One of the creators of the CAPTCHA has publicly lamented that he “unwittingly created a system that was frittering away, in ten-second increments, millions of hours of a most precious resource: human brain cycles.” We don’t like them, and you don’t like them.

So we decided we’re going to stop using CAPTCHAs. Using an iterative platform approach, we have already reduced the number of CAPTCHAs we choose to serve by 91% over the past year.

Before we talk about how we did it, and how you can help, let’s first start with a simple question.

Why in the world is CAPTCHA still used anyway?

If everyone agrees CAPTCHA is so bad, if there have been calls to get rid of it for 15 years, if the creator regrets creating it, why is it still widely used?

The frustrating truth is that CAPTCHA remains an effective tool for differentiating real human users from bots despite the existence of CAPTCHA-solving services. Of course, this comes with a huge trade off in terms of usability, but generally the alternatives to CAPTCHA are blocking or allowing traffic, which will inherently increase either false positives or false negatives. With a choice between increased errors and a poor user experience (CAPTCHA), many sites choose CAPTCHA.

CAPTCHAs are also a safe choice because so many other sites use them. They delegate abuse response to a third party, and remove the risk from the website with a simple integration. Using the most common solution will rarely get you into trouble. Plug, play, forget.

Lastly, CAPTCHA is useful because it has a long history of a known and stable baseline. We’ve tracked a metric called CAPTCHA (or Challenge) Solve Rate for many years. CAPTCHA solve rate is the number of CAPTCHAs solved, divided by the number of page loads. For our purposes both failing or not attempting to solve the CAPTCHA count as a failure, since in either case a user cannot access the content they want to. We find this metric to typically be stable for any particular website. That is, if the solve rate is 1%, it tends to remain at 1% over time. We also find that any change in solve rate – up or down – is a strong indicator of an attack in progress. Customers can monitor the solve rate and create alerts to notify them when it changes, then investigate what might be happening.

Many alternatives to CAPTCHA have been tried, including our own Cryptographic Attestation. However, to date, none have seen the amount of widespread adoption of CAPTCHAs. We believe attempting to replace CAPTCHA with a single alternative is the main reason why. When you replace CAPTCHA, you lose the stable history of the solve rate, and making decisions becomes more difficult. If you switch from deciphering text to picking images, you will get vastly different results. How do you know if those results are good or bad? So, we took a different approach.

Many solutions, not one

Rather than try to unilaterally deprecate and replace CAPTCHA with a single alternative, we built a platform to test many alternatives and see which had the best potential to replace CAPTCHA. We call this Cloudflare Managed Challenge.

Managed Challenge is a smarter solution than CAPTCHA. It defers the decision about whether to serve a visual puzzle to a later point in the flow after more information is available from the browser. Previously, a Cloudflare customer could only choose between either a CAPTCHA or JavaScript Challenge as the action of a security or firewall rule. Now, the Managed Challenge option will decide to show a visual puzzle or other means of proving humanness to visitors based on the client behavior exhibited during a challenge and based on the telemetry we receive from the visitor. A customer simply tells us, “I want you (Cloudflare) to take appropriate actions to challenge this type of traffic as you see necessary.“

With Managed Challenge, we adapt the actual challenge outcome to the individual visitor/browser. As a result, we can fine-tune the difficulty of the challenge itself and avoid showing visual puzzles to more than 90% of human requests, while at the same time presenting harder challenges to visitors that exhibit non-human behaviors.

When a visitor encounters a Managed Challenge, we first run a series of small non-interactive JavaScript challenges gathering more signals about the visitor/browser environment. This means we deploy in-browser detections and challenges at the time the request is made. Challenges are selected based on what characteristics the visitor emits and based on the initial information we have about the visitor. Those challenges include, but are not limited to, proof-of-work, proof-of-space, probing for web APIs, and various challenges for detecting browser-quirks and human behavior.

They also include machine learning models that detect common features of end visitors who were able to pass a CAPTCHA before. The computational hardness of those initial challenges may vary by visitor, but is targeted to run fast. Managed Challenge is also integrated into the Cloudflare Bot Management and Super Bot Fight Mode systems by consuming signals and data from the bot detections.

After our non-interactive challenges have been run, we evaluate the gathered signals. If by the combination of those signals we are confident that the visitor is likely human, no further action is taken, and the visitor is redirected to the destined page without any interaction required. However, in some cases, if the signal is weak, we present a visual puzzle to the visitor to prove their humanness. In the context of Managed Challenge, we’re also experimenting with other privacy-preserving means of attesting humanness, to continue reducing the portion of time that Managed Challenge uses a visual puzzle step.

We started testing Managed Challenge last year, and initially, we chose from a rotating subset of challenges, one of them being CAPTCHA. At the start, CAPTCHA was still used in the vast majority of cases. We compared the solve rate for the new challenge in question, with the existing, stable solve rate for CAPTCHA. We thus used CAPTCHA solve rate as a goal to work towards as we improved our CAPTCHA alternatives, getting better and better over time. The challenge platform allows our engineers to easily create, deploy, and test new types of challenges without impacting customers. When a challenge turns out to not be useful, we simply deprecate it. When it proves to be useful, we increase how often it is used. In order to preserve ground-truth, we also randomly choose a small subset of visitors to always solve a visual puzzle to validate our signals.

Managed Challenge performs better than CAPTCHA

The Challenge Platform now has the same stable solve rate as previously used CAPTCHAs.

Using an iterative platform approach, we have reduced the number of CAPTCHAs we serve by 91%. This is only the start. By the end of the year, we will reduce our use of CAPTCHA as a challenge to less than 1%. By skipping the visual puzzle step for almost all visitors, we are able to reduce the visitor time spent in a challenge from an average of 32 seconds to an average of just one second to run our non-interactive challenges. We also see churn improvements: our telemetry indicates that visitors with human properties are 31% less likely to abandon a Managed Challenge than on the traditional CAPTCHA action.

Today, the Managed Challenge platform rotates between many challenges. A Managed Challenge instance consists of many sub-challenges: some of them are established and effective, whereas others are new challenges we are experimenting with. All of them are much, much faster and easier for humans to complete than CAPTCHA, and almost always require no interaction from the visitor.

Managed Challenge replaces CAPTCHA for Cloudflare

We have now deployed Managed Challenge across the entire Cloudflare network. Any time we show a CAPTCHA to a visitor, it’s via the Managed Challenge platform, and only as a benchmark to confirm our other challenges are performing as well.

All Cloudflare customers can now choose Managed Challenge as a response option to any Firewall rule instead of CAPTCHA. We’ve also updated our dashboard to encourage all Cloudflare customers to make this choice.

You’ll notice that we changed the name of the CAPTCHA option to ‘Legacy CAPTCHA’. This more accurately describes what CAPTCHA is: an outdated tool that we don’t think people should use. As a result, the usage of CAPTCHA across the Cloudflare network has dropped significantly, and usage of managed challenge has increased dramatically.

As noted above, today CAPTCHA represents 9% of Managed Challenge solves (light blue), but that number will decrease to less than 1% by the end of the year. You’ll also see the gray bar above, which shows when our customers have chosen to show a CAPTCHA as a response to a Firewall rule triggering. We want that number to go to zero, but the good news is that 63% of customers now choose Managed Challenge rather than CAPTCHA when they create a Firewall rule with a challenge response action.

We expect this number to increase further over time.

If you’re using the Cloudflare WAF, log into the Dashboard today and look at all of your Firewall rules. If any of your rules are using “Legacy CAPTCHA” as a response, please change it now! Select the “Managed Challenge” response option instead. You’ll give your users a better experience, while maintaining the same level of protection you have today. If you’re not currently a Cloudflare customer, stay tuned for ways you can reduce your own use of CAPTCHA.

Cloudflare Observability

2022-03-18 Tanushree Sharma

Post Syndicated from Tanushree Sharma original https://blog.cloudflare.com/vision-for-observability/

Cloudflare Observability

Whether you’re a software engineer deploying a new feature, network engineer updating routes, or a security engineer configuring a new firewall rule: You need visibility to know if your system is behaving as intended — and if it’s not, to know how to fix it.

Cloudflare is committed to helping our customers get visibility into the services they have protected behind Cloudflare. Being a single pane of glass for all network activity has always been one of Cloudflare’s goals. Today, we’re outlining the future vision for Cloudflare observability.

What is observability?

Observability means gaining visibility into the internal state of a system. It’s used to give users the tools to figure out what’s happening, where it’s happening, and why.

At Cloudflare, we believe that observability has three core components: monitoring, analytics, and forensics. Monitoring measures the health of a system – it tells you when something is going wrong. Analytics give you the tools to visualize data to identify patterns and insights. Forensics helps you answer very specific questions about an event.

Observability becomes particularly important in the context of security to validate that any mitigating actions performed by our security products, such as Firewall or Bot Management, are not false positives. Was that request correctly classified as malicious? And if it wasn’t, which detection system classified it as such?

Cloudflare, additionally, has products to improve performance of applications and corporate networks and allow developers to write lightning fast code that runs on our global network. We want to be able to provide our customers with insights into every request, packet, and fetch that goes through Cloudflare’s network.

Monitoring and Notifying

Analytics are fantastic for summarizing data, but how do you know when to look at them? No one wants to sit on the dashboard clicking refresh over and over again just in case something looks off. That’s where notifications come in.

When we talk about something “looking off” on an analytics page, what we really mean is that there’s a significant change in your traffic or network which is reflected by spikes or drops in our analytics. Availability and performance directly affect end users, and our goal is to monitor and notify our customers as soon as we see things going wrong.

Today, we have many different types of notifications from Origin Error Rates, Security Events, and Advanced Security Events to Usage Based Billing and Health Checks. We’re continuously adding more notification types to have them correspond with our awesome analytics. As our analytics get more customizable, our notifications will as well.

There’s tons of different algorithms that can be used to detect spikes, including using burn rates and z-scores. We’re continuing to iterate on the algorithms that we use for detections to offer more variations, make them smarter, and make sure that our notifications are both accurate and not too noisy.

Analytics

So, you’ve received an alert from Cloudflare. What comes next?

Analytics can be used to get a birds eye view of traffic or focus on specific types of events by adding filters and time ranges. After you receive an alert, we want to show you exactly what’s been triggered through graphs, high level metrics, and top Ns on the Cloudflare dashboard.

Whether you’re a developer, security analyst, or network engineer, the Cloudflare dashboard should be the spot for you to see everything you need. We want to make the dashboard more customizable to serve the diverse use cases of our customers. Analyze data by specifying a timeframe and filter through dropdowns on the dashboard, or build your own metrics and graphs that work alongside the raw logs to give you a clear picture of what’s happening.

Focusing on security, we believe analytics are the best tool to build confidence before deploying security policies. Moving forward, we plan to layer all of our security related detection signals on top of HTTP analytics so you can use the dashboard to answer questions such as: if I were to block all requests that the WAF identifies as an XSS attack, what would I block?

Customers using our enterprise Bot Management may already be familiar with this experience, and as we improve it and build upon it further, all of our other security products will follow.

Analytics are a powerful tool to see high level patterns and identify anomalies that indicate that something unusual is happening. We’re working on new dashboards, customizations, and features that widen the use cases for our customers. Stay tuned!

Logs

Logs are used when you want to examine specific details about an event. They consist of a timestamp and fields that describe the event and are used to get visibility on a granular level when you need a play-by-play.

In each of our datasets, an event measures something different. For example, in HTTP request logs, an event is when an end user requests content from or sends content to a server. For Firewall logs, an event occurs when the Firewall takes an action on an HTTP request. There can be multiple Firewall events for each HTTP request.

Today, our customers access logs using Logpull, Logpush, or Instant Logs. Logpull and Logpush are great for customers that want to send their logs to third parties (like our Analytics Partners) to store, analyze, and correlate with other data sources. With Instant Logs, our customers can monitor and troubleshoot their traffic in real-time straight from the dashboard or CLI. We’re planning on building out more capabilities to dig into logs on Cloudflare. We’re hard at work on building log storage on R2 – but what’s next?

We’ve heard from customers that the activity log on the Firewall analytics dashboard is incredibly useful. We want to continue to bring the power of logs to the dashboard by adding the same functionality across our products. For customers that will store their logs on Cloudflare R2, this means that we can minimize the use of sampled data.

If you’re looking for something very specific, querying logs is also important, which is where forensics comes in. The goal is to let you investigate from high level analytics all the way down to individual logs lines that make them up. Given a unique identifier, such as the ray ID, you should be able to look up a single request, and then correlate it with all other related activity. Find out the client IP of that ray ID and from there, use cases are plentiful: what other requests from this IP are malicious? What paths did the client follow?

Tracing

Logs are really useful, but they don’t capture the context around a request. Traces show the end-to-end life cycle of a request from when a user requests a resource to each of the systems that are involved in its delivery. They’re another way of applying forensics to help you find something very specific.

These are used to differentiate each part of the application to identify where errors or bottlenecks are occurring. Let’s say that you have a Worker that performs a fetch event to your origin and a third party API. Analytics can show you average execution times and error rates for your Worker, but it doesn’t give you visibility into each of these operations.

Using wrangler dev and console.log statements are really helpful ways to test and debug your code. They bring some of the visibility that’s needed, but it can be tedious to instrument your code like this.

As a developer, you should have the tools to understand what’s going on in your applications so you can deliver the best experience to your end users. We can help you answer questions like: Where is my Worker execution failing? Which operation is causing a spike in latency in my application?

Putting it all together

Notifications, analytics, logs, and tracing each have their distinct use cases, but together, these are powerful tools to provide analysts and developers visibility. Looking forward, we’re excited to bring more and more of these capabilities on the Cloudflare dashboard.

We would love to hear from you as we build these features out. If you’re interested in sharing use cases and helping shape our roadmap, contact your account team!

Announcing Friendly Bots

2022-03-16 Ben Solomon

Post Syndicated from Ben Solomon original https://blog.cloudflare.com/friendly-bots/

Announcing Friendly Bots

When someone mentions bots on the Internet, what’s your first reaction?

It’s probably negative. Most of us conjure up memories of CAPTCHAs, stolen passwords, or some other pain caused by bad bots.

But the truth is, there are plenty of well-behaved bots on the Internet. These include Google’s search crawler and Stripe’s payment bot. At Cloudflare, we manually “verify” good bots, so they don’t get blocked. Our customers can choose to allowlist any bot that is verified. Unfortunately, new bots are popping up faster than we can verify them. So today we’re announcing a solution: Friendly Bots.

Let’s begin with some background.

How does a bot get verified?

We often find good bots via our public form. Anyone can submit a bot, but we prefer that bot operators complete the form to provide us with the information we need. We ask for some standard bits of information: your bot’s name, its public documentation, and its user agent (or regex). Then, we ask for information that will help us validate your bot. There are four common methods:

IP list
Send us a list of IP addresses used by your bot. This doesn’t have to be a static list — you can give us a dynamic page that changes — just provide us with the URL, and we’ll fetch updates every day. These IPs must be publicly documented and exclusive to your bot. If you provide a shared IP address (like one used by a proxy service), our systems will detect risk and refuse to cooperate. We want to avoid accidentally allowing other traffic.

rDNS
This one is fun. You’ve heard of DNS: the phone book of the Internet, which helps map domain names to IP addresses. rDNS works in the reverse, allowing us to take an IP address and deduce the domain name associated with it.

In other words: give us a hostname suffix, and in many cases we’ll be able to validate your bot’s identity!

User agent + ASN validation
In some cases, we can verify bots that consistently come from the same network (known as an “ASN”) with the same user agent. Note that we can’t always do this — traffic becomes easier to spoof — but we’re often confident enough to use this as a validation method.

Machine learning
This is the most flashy method. Cloudflare sees 32+ million requests every second, and we’ve been able to feed those requests into a model that can accurately profile good bots. If the previous validation methods don’t work for you, there’s a good chance we can use ML to spot your bot. But we need enough traffic (thousands of requests) to detect a usable pattern.

We usually approve Verified Bot requests within a few weeks, after taking some time to quality test and ensure everything is safe. But as mentioned before, we often have to reserve this process for trusted partners and larger bots, even though plenty of our users still need their bots allowlisted.

What if my bot isn’t a huge global service?

We keep our ears open (and our eyes on Twitter), so we know that folks want their own “personal” version of Verified Bots.

For example: let’s say you built your own monitoring service that crawls a few of your personal websites. It doesn’t make sense for us to verify this bot, because it doesn’t meet any of our criteria:

Serve the broader Internet.
Objectively demonstrate good behavior.
Comply with Internet standards like robots.txt.

It’s your bot (and to you, it might be good!), but our other users might feel differently. Imagine if someone else’s bot could waltz into your infrastructure at any time!

Here’s another case. Perhaps Cloudflare has labeled a particular proxy as automated, possibly because a mix of humans and bots use the proxy to access the Internet. You may want to allow this traffic on your site without affecting other Cloudflare customers.

Lastly, if you work at a startup, your company may run automated services that haven’t reached the scale we require. But you still need a way to allowlist these services.

Announcing Friendly Bots

The bots described above, especially common services, are not bad. They deserve to sit in a state between bad and verified. They’re friendly.

And we’ve come up with a really cool way to help you manage them.

Our new feature, Friendly Bots, allows you to instantly auto-validate any traffic with the help of IP lists, rDNS, and more.

Here’s how it works: in the Cloudflare dashboard, tell us about your bot. You can point us toward a public IP list, give us a hostname suffix, or even select other methods like machine learning. Cloudflare’s anycast network allows us to run all of these mechanisms at each one of our data centers. This means you’ll have performant, secure, and scalable bot verification.

Build a collection of Friendly Bots and share them between your sites, creating custom policies that allow, rate limit, or log this type of traffic. You may just want to keep tabs on a particular bot; that’s fine. The response options are flexible and directly integrate with our Workers platform.

In the past, we’ve struggled to verify bots that did not crawl the web at a large scale. Why? Our system relies on a cache of verified traffic, ensuring that certain IPs or other data have widely shown good behavior on the Internet. This means that bots were sometimes difficult to verify if they did not make thousands of requests to Cloudflare. With Friendly Bots, we’ve eliminated that requirement, introducing a new, dynamic cache that optimizes for fun-sized projects.

The downstream benefits

Friendly Bots will streamline your dashboard experience. But there are a few hidden, downstream benefits we want to highlight:

Easier verification
Admittedly, it’s challenging to keep up with all the good bots on the Internet. In order to verify a bot, we’ve relied on manual submissions that may come weeks, or even months after a good bot is created. Friendly Bots will change all of that. If we notice many of our customers allowlisting a particular bot — say, a certain IP address or hostname suffix, our systems will automatically queue that bot for verification. We can intelligently use your Friendly Bots to help the rest of Cloudflare’s customers.

Instant feedback
In the past, users have been confused by the verification process. Do I need to provide documentation for my IPs? What about my user agent: can it change over time? If any piece of the validation data was broken, it could take us weeks to identify and fix.

That’s no longer the case. With Friendly Bots, we perform validation almost instantly. So if something isn’t right — perhaps your rDNS validation uses the wrong hostname — you’ll know immediately because the bot won’t be allowlisted. No more waiting to hear from our support team.

Better sourcing
Previously, we required bot operators (e.g., Google) to submit verification data themselves. If there was a bot you wanted to verify, but did not own, you were out of luck.

Friendly Bots eliminates this dependency on bot operators. Anyone who can find identifying information can register a bot on their site.

No arbitration
If a scraper shows up to your site, is that a good thing? To some, yes, because it’s exposure. To others, no, because that scraper may take data. This is a question we’ve carefully considered with every Verified Bots submission to date.

Now: it’s your choice to make. Friendly Bots puts the control in your hands, allowing you to categorize bots at a domain level. We’ll continue to verify bots at a global level (when behavior is objectively good).

Cloudflare Radar

Here’s a fun bonus: in addition to today’s Friendly Bots announcement, we’re also making some changes to Cloudflare Radar.

Beginning immediately, you can see a list of many Verified Bots in Radar. This is exciting; we’ve never published a detailed list like this before.

All data is updated in real time. As we verify new bots, they will appear here in the Radar module.

We’re also beginning to add specific Verified Bots to our Logs product. You’ll see them as Bot Tags, so a request might include the string “pinterest” if it came from Pinterest’s bot.

What’s next?

Our team is excited to launch Friendly Bots soon. We anticipate the impact will radiate throughout Bot Management, reducing false positives, improving crawl-ability, and generally stabilizing sites.

If you have Bot Management and want to give this new feature a try, please tell your account team (and we’ll be sure to include you in the early access period). You can also continue to tell us about bots that should be verified.

How Ephemeral IDs work

Ephemeral IDs in action

What’s next for Turnstile?

AI bot activity today

How we find AI bots pretending to be real web browsers

Background

Residential IP proxies

ML model training

ML features for bot detection

Detecting residential proxies using network and behavioral signals

Detection results and case studies

Improving detection for bots from cloud providers

Check out ML v8

IAB TCF Compatibility

Google Consent Mode v2

Bot Management

Alright, what are we building?

Step 1: Deploy your Pages

Step 2: Embed Turnstile widget

Step 3: Validate the token

Wrapping up

Challenging fetch requests in the Cloudflare WAF

Turnstile Pre-Clearance mode

Implementing Turnstile with Pre-Clearance

Pre-Clearance is available to all Cloudflare customers

Easy on humans, hard on bots, private for everyone

Turnstile is great for fighting fraud

Lessons from the Turnstile beta

Turnstile: GA and Free for Everyone

Definitions

Global traffic insights

Mitigated daily traffic stable at 6%, spikes reach 8%

Shields up: customer configured rules now biggest contributor to mitigated traffic

Application owners are increasingly relying on geo location blocks

Old CVEs are still exploited en masse

Bot traffic insights

On average, more than 10% of non-verified bot traffic is mitigated

API traffic insights

58% of dynamic (non cacheable) traffic is API related

65% of global API traffic is generated by browsers

HTTP Anomalies are the most common attack vector on API endpoints

Looking forward

Engineering for zero allocations

Stack allocation

Choice of algorithms

Testing allocations

Zero allocation decision trees

Optimize for single-document evaluation

Reusing buffers

Summary

Super Bot Fight Mode: now configurable!

What’s next?

Bot Management

Supporting banks, financial institutions, and fintechs

The online fraud experience today

Reduce modern fraud without hurting UX

Stopping fake account creation and adding to Cloudflare’s defense in depth

Making fraud inefficient by expediting detection

Where do I sign up?

What’s changed?

How did we do this?

What’s next?

Update: Mangroves, Climate Change & economic development

More bots stopped, more trees planted!

UX isn’t the only big problem with CAPTCHA — so is privacy

Less data collection, more privacy, same security

We are opening our CAPTCHA replacement to everyone

Swap out your existing CAPTCHA in a few minutes

Deployment options and analytics

Why are we giving this away for free?

Crawler Hints — Cloudflare’s Solution for Websites

IndexNow Protocol — the Search Engine Moves from Pull to Push

How to enable Crawler Hints

What’s Next?

What is Bot Fight Mode?

How many trees? Let’s do the math 🚀

A piece of the puzzle

The long goodbye

Where is Internet Explorer 11 used?

Bot or not