Tag Archives: Bot Management

Monitoring machine learning models for bot detection

Post Syndicated from Daniel Means http://blog.cloudflare.com/author/daniel-means/ original https://blog.cloudflare.com/monitoring-machine-learning-models-for-bot-detection


Cloudflare’s Bot Management is used by organizations around the world to proactively detect and mitigate automated bot traffic. To do this, Cloudflare leverages machine learning models that help predict whether a particular HTTP request is coming from a bot or not, and further distinguishes between benign and malicious bots. Cloudflare serves over 55 million HTTP requests per second — so our machine learning models need to run at Cloudflare scale.

We are constantly making improvements to the models that power Bot Management to ensure they are incorporating the latest threat intelligence. This process of iteration is an important part of ensuring our customers stay a step ahead of malicious actors, and it requires a rigorous process for experimentation, deployment, and ongoing observation.

We recently shared an introduction to Cloudflare’s approach to MLOps, which provides a holistic overview of model training and deployment processes at Cloudflare. In this post, we will dig deeper into monitoring, and how we continuously evaluate the models that power Bot Management.

Why monitoring matters

Before bot detection models are released, we undergo an extensive model testing/validation process to ensure our detections perform as expected. Model performance is validated across a wide number of web traffic segments, by browser, HTTP protocol, and other dimensions to get a fine-grained view into how we expect the model to perform once deployed. If everything checks out, the model is gradually released into production, and we get a level up in our bot detections.

After models are deployed to production, it can be challenging to get visibility into performance on a granular level. Sure, we can look at outcomes-based metrics — like bot score distributions, or challenge solve rates. These are informative, but with any change in bot scoring or challenge solve rates, we’re still left asking, “Which segments of web traffic are most impacted by this change? Was that expected?”.

To train a model for the Internet is to train a model against a moving target. Anyone can train a model on static data and achieve great results — so long as the input does not change. Building a model that generalizes into the future, with new threats, browsers, and bots is a more difficult task. Machine learning monitoring is an important part of the story because it provides confidence that our models continue to generalize, using a rigorous and repeatable process.

In the days before machine learning monitoring, the team would analyze web traffic patterns and model scoring results to track the proportion of web requests scored as bot or human. This high-level metric is helpful for evaluating performance of the model in the aggregate, but didn’t provide granular detail into how the model was behaving with particular types of traffic. For a deeper analysis, we’d be left with the additional work of investigating performance on individual traffic segments like traffic from Chrome browser or clients using iOS.

With machine learning monitoring, we get insights into how the model behaves not just at a high level, but in a much more granular way — without having to do a lot of manual investigation. The monitoring closes the feedback loop by answering the critical question: “How are our bot detection models performing in production?” Monitoring gives us the same level of confidence derived from pre-deployment model validation/testing, except applied to all models in production.

The use cases for which monitoring has proven invaluable include:

  • Investigating bot score anomalies: If a customer reports machine learning scoring false positives/negatives, and we suspect broader issues across a subset of detections, monitoring can help zero-in on the answer. Engineers can find insights from our global monitoring dashboard, or focus on performance for a specific dataset.
  • Monitoring any model predictions or request field: The monitoring service is flexible and can add an observability layer over any request artifact stored in our web requests databases. If model predictions or outcomes of interest are stored with our request logs, then they can be monitored. We can work across engineering teams to enable monitoring for any outcome.
  • Deploying new models: We gradually deploy new model versions, eventually ramping up to running across Cloudflare’s global web traffic. Along the way, we have a series of checks before a new model can be deployed to the next release step. Monitoring allows us to compare the latest model with the previous version against granular traffic segments at each deployment stage — giving us confidence when proceeding forward with the rollout.

How does machine learning monitoring work?

The process begins with a ground-truth dataset — a set of traffic data known to be either human or bot-generated, labeled accordingly and accurately. If our model identifies a particular request as bot traffic, when our ground-truth label indicates it originated from a human, then we know the model has miscategorized the request, and vice versa. This kind of labeled data, where we flag traffic as being from a bot or a human, is what our model is trained on to learn to make detections in the first place.

Datasets gathered at training time allow us to evaluate the performance of a trained model for that snapshot in time. Since we want to continuously evaluate model performance in production, we need to likewise get real-time labeled data to compare against our bot score. We can generate a labeled dataset for this purpose when we’re certain that web requests come from a certain actor. For example, our heuristics engine is one source of high-confidence labeled data. Other sources of reliable, labeled data include customer feedback and attack pattern research.

We can directly compare our model’s bot scores on web requests against recently-labeled datasets to judge model performance. To ensure that we are making an apples-to-apples comparison as we evaluate the model’s score over time, consistency is paramount: the data itself will be different, but we want the methodology, conditions, and filters to remain the same between sampling windows. We have automated this process, allowing us to generate labeled datasets in real-time that give us an up-to-the-minute view of model performance.

Getting granular performance metrics

Let’s say we detect a sudden drop in accuracy on a given dataset labeled as bot traffic, meaning our detection is incorrectly scoring bots as human traffic. We would be keen to determine the exact subset of traffic responsible for the scoring miss. Is it coming from the latest Chrome browser or maybe a certain ASN?

To answer this, performance monitoring uses specializations, which are filters applied against our dataset that focus on a dimension of interest (e.g. browser type, ASN). With specializations on datasets, we get both an expectation on how traffic should have been scored, and insight into the exact dimension causing the miss.

Integrating monitoring into our bots machine learning platform

The monitoring system runs on a unified platform called Endeavor, which we built to handle all aspects of bots-related machine learning, including model training and validation, model interpretability, and delivering the most up-to-date information to our servers running bot detection. We can break down monitoring into a few tasks: rendering monitoring queries to fetch datasets, computing performance metrics, and storing metrics. Endeavour uses Airflow, a workflow execution engine, making it a great place to run our monitoring tasks on top of a kubernetes cluster and GPUs, with access to Postgres and ClickHouse databases.

Rendering monitoring queries

A monitoring query is simply a SQL query to our ClickHouse web request database asking “How does machine learning scoring look right now?”. The query gets more precise when we add in dataset and specialization conditions so that we can ask a more refined question “For this set of known (non-)automated traffic, how does machine learning scoring look along these dimensions of interest?”.

In our system, datasets for training and validation are determined using SQL queries, which are tailored to capture segments of request traffic, such as traffic flagged as bots by our heuristics engine. For model monitoring, we adapt these queries to measure performance metrics like accuracy and continuously update the time range to measure the latest model performance. For each dataset used in training and validation, we can generate a monitoring query that produces real-time insight into model performance.

Computing performance metrics

With a rendered monitoring query ready, we can go ahead and fetch bot score distributions from our web request database. The MetricsComputer takes in the bot score distributions as input and produces relevant performance metrics, like accuracy, over a configurable time interval.

We can evaluate model performance along any metric of interest. The MetricInterface is a Python interface that acts as a blueprint for performance metrics. Any newly added metric would only need to implement the interface’s compute_metric method, which defines how the MetricsComputer should perform the calculation.

Storing metrics

After each monitoring run, we store performance metrics by dataset, model version, and specialization value in the ml_performance ClickHouse table. Precomputing metrics enables long data retention periods, so we can review model performance by model versions or dimensions of interest over time. Importantly, newly added performance metrics can be backfilled as needed since the ml_performance table also stores the score distributions used to compute each metric.

Running tasks on GPUs

Metrics computation is load balanced across endeavour-worker instances running across GPUs. From a system perspective, the airflow-scheduler adds a monitoring task to a Redis Queue and Airflow Celery workers running on each GPU will pull tasks off the queue for processing. We benefit from having a production service constantly powered by GPUs, as opposed to only running ad hoc model training workloads. As a result, the monitoring service acts as a health-check that ensures various Endeavour components are functioning properly on GPUs. This helps ensure the GPUs are always updated and ready to run model training/validation tasks.

Machine learning monitoring in action

To better illustrate how Cloudflare uses machine learning monitoring, let’s explore some recent examples.

Improving accuracy of machine learning bot detection

When the monitoring system was first deployed, we quickly found an anomaly: our model wasn’t performing well on web traffic using HTTP/3. At the time, HTTP/3 usage was hardly seen across the web, and the primary model in production wasn’t trained on HTTP/3 traffic, leading to inaccurate bot scores. Fortunately, another bot detection layer, our heuristics engine, was still accurately finding bots using HTTP/3 — so our customers were still covered.

Still, this finding pointed to a key area of improvement for the next model iteration. And we did improve: the next model iteration was consistently able to distinguish between bot and human initiated HTTP/3 web requests with over 3.5x higher accuracy compared to the prior model version. As we enable more datasets and specializations, we can uncover specific browsers, OSs and other dimensions where performance can be improved during model training.

Early detection, quick intervention

Deploying machine learning at a global scale, running in data centers spread over 100 countries around the world, is challenging. Things don’t always go to plan.

A couple of years ago, we deployed an update to our machine learning powered bot detections, and it led to an increase in false positive bot detections — we were incorrectly flagging some legitimate traffic as bot traffic. Our monitoring system quickly showed a drop in performance on residential ASNs where we expect mostly non-automated traffic.

In the graph above, deployments are shown to three colo “tiers”, 1-3. Since software deployments start on tier 3 colocation centers and gradually move up to tier 1, the impact followed the same pattern.

At the same time, a software release was being deployed to our global network, but we didn’t know if it was the cause of the performance drop. We do staged deployments, updating the software in one batch of datacenters at a time before reaching global traffic. Our monitoring dashboards showed a drop in performance that followed this exact deployment pattern, and the release was starting to reach our biggest datacenters.

Monitoring dashboards clearly showed the pattern followed a software update. We reverted the change before the update made it to most of our datacenters and restored normal machine learning bot detection performance. Monitoring allows us to catch performance anomalies, dig into the root cause, and take action — fast.

Model deployment monitoring for all

We’ve seen a lot of value in being able to monitor and control our models and deployments, and realized that other people must be running into the same challenges as well. Over the next few months, we’ll be building out more advanced features for AI Gateway – our proxy that helps people observe and control their AI applications and models better. With AI Gateway, we can do all the same deployments, monitoring, and optimization strategies we have been doing for our Bot detection models in one unified control plane. We’re excited to use these new features internally, but even more excited to release these features to the public, so that anyone is able to deploy, test, monitor and improve the way they use AI or machine learning models.

Next up

Today, machine learning monitoring helps us investigate performance issues and monitor performance as we roll out new models — and we’re just getting started!

This year, we’re accelerating our machine learning model iterations for bot detection to deliver improved detections faster than ever. Monitoring will be key for enabling fast and safe deployments. We’re excited to add alerting based on model performance – so that we’re automatically notified should model performance ever drift outside our expected bounds.

Alongside our Workers AI launch, we recently deployed GPUs in 100+ cities, leveling up our compute resources at a global scale. This new infrastructure will unlock our model iteration process, allowing us to explore new, cutting-edge models with even more powerful bot detection capabilities. Running models on our GPUs will bring inference closer to users for better model performance and latency, and we’re excited to leverage our new GPU compute with our bot detection models as well.

Every request, every microsecond: scalable machine learning at Cloudflare

Post Syndicated from Alex Bocharov original http://blog.cloudflare.com/scalable-machine-learning-at-cloudflare/

Every request, every microsecond: scalable machine learning at Cloudflare

Every request, every microsecond: scalable machine learning at Cloudflare

In this post, we will take you through the advancements we've made in our machine learning capabilities. We'll describe the technical strategies that have enabled us to expand the number of machine learning features and models, all while substantially reducing the processing time for each HTTP request on our network. Let's begin.

Background

For a comprehensive understanding of our evolved approach, it's important to grasp the context within which our machine learning detections operate. Cloudflare, on average, serves over 46 million HTTP requests per second, surging to more than 63 million requests per second during peak times.

Machine learning detection plays a crucial role in ensuring the security and integrity of this vast network. In fact, it classifies the largest volume of requests among all our detection mechanisms, providing the final Bot Score decision for over 72% of all HTTP requests. Going beyond, we run several machine learning models in shadow mode for every HTTP request.

At the heart of our machine learning infrastructure lies our reliable ally, CatBoost. It enables ultra low-latency model inference and ensures high-quality predictions to detect novel threats such as stopping bots targeting our customers' mobile apps. However, it's worth noting that machine learning model inference is just one component of the overall latency equation. Other critical components include machine learning feature extraction and preparation. In our quest for optimal performance, we've continuously optimized each aspect contributing to the overall latency of our system.

Initially, our machine learning models relied on single-request features, such as presence or value of certain headers. However, given the ease of spoofing these attributes, we evolved our approach. We turned to inter-request features that leverage aggregated information across multiple dimensions of a request in a sliding time window. For example, we now consider factors like the number of unique user agents associated with certain request attributes.

The extraction and preparation of inter-request features were handled by Gagarin, a Go-based feature serving platform we developed. As a request arrived at Cloudflare, we extracted dimension keys from the request attributes. We then looked up the corresponding machine learning features in the multi-layered cache. If the desired machine learning features were not found in the cache, a memcached "get" request was made to Gagarin to fetch those. Then machine learning features were plugged into CatBoost models to produce detections, which were then surfaced to the customers via Firewall and Workers fields and internally through our logging pipeline to ClickHouse. This allowed our data scientists to run further experiments, producing more features and models.

Every request, every microsecond: scalable machine learning at Cloudflare
Previous system design for serving machine learning features over Unix socket using Gagarin.

Initially, Gagarin exhibited decent latency, with a median latency around 200 microseconds to serve all machine learning features for given keys. However, as our system evolved and we introduced more features and dimension keys, coupled with increased traffic, the cache hit ratio began to wane. The median latency had increased to 500 microseconds and during peak times, the latency worsened significantly, with the p99 latency soaring to roughly 10 milliseconds. Gagarin underwent extensive low-level tuning, optimization, profiling, and benchmarking. Despite these efforts, we encountered the limits of inter-process communication (IPC) using Unix Domain Socket (UDS), among other challenges, explored below.

Problem definition

In summary, the previous solution had its drawbacks, including:

  • High tail latency: during the peak time, a portion of requests experienced increased  latency caused by CPU contention on the Unix socket and Lua garbage collector.
  • Suboptimal resource utilization: CPU and RAM utilization was not optimized to the full potential, leaving less resources for other services running on the server.
  • Machine learning features availability: decreased due to memcached timeouts, which resulted in a higher likelihood of false positives or false negatives for a subset of the requests.
  • Scalability constraints: as we added more machine learning features, we approached the scalability limit of our infrastructure.

Equipped with a comprehensive understanding of the challenges and armed with quantifiable metrics, we ventured into the next phase: seeking a more efficient way to fetch and serve machine learning features.

Exploring solutions

In our quest for more efficient methods of fetching and serving machine learning features, we evaluated several alternatives. The key approaches included:

Further optimizing Gagarin: as we pushed our Go-based memcached server to its limits, we encountered a lower bound on latency reductions. This arose from IPC over UDS synchronization overhead and multiple data copies, the serialization/deserialization overheads, as well as the inherent latency of garbage collector and the performance of hashmap lookups in Go.

Considering Quicksilver: we contemplated using Quicksilver, but the volume and update frequency of machine learning features posed capacity concerns and potential negative impacts on other use cases. Moreover, it uses a Unix socket with the memcached protocol, reproducing the same limitations previously encountered.

Increasing multi-layered cache size: we investigated expanding cache size to accommodate tens of millions of dimension keys. However, the associated memory consumption, due to duplication of these keys and their machine learning features across worker threads, rendered this approach untenable.

Sharding the Unix socket: we considered sharding the Unix socket to alleviate contention and improve performance. Despite showing potential, this approach only partially solved the problem and introduced more system complexity.

Switching to RPC: we explored the option of using RPC for communication between our front line server and Gagarin. However, since RPC still requires some form of communication bus (such as TCP, UDP, or UDS), it would not significantly change the performance compared to the memcached protocol over UDS, which was already simple and minimalistic.

After considering these approaches, we shifted our focus towards investigating alternative Inter-Process Communication (IPC) mechanisms.

IPC mechanisms

Adopting a first principles design approach, we questioned: "What is the most efficient low-level method for data transfer between two processes provided by the operating system?" Our goal was to find a solution that would enable the direct serving of machine learning features from memory for corresponding HTTP requests. By eliminating the need to traverse the Unix socket, we aimed to reduce CPU contention, improve latency, and minimize data copying.

To identify the most efficient IPC mechanism, we evaluated various options available within the Linux ecosystem. We used ipc-bench, an open-source benchmarking tool specifically designed for this purpose, to measure the latencies of different IPC methods in our test environment. The measurements were based on sending one million 1,024-byte messages forth and back (i.e., ping pong) between two processes.

IPC method Avg duration, μs Avg throughput, msg/s
eventfd (bi-directional) 9.456 105,533
TCP sockets 8.74 114,143
Unix domain sockets 5.609 177,573
FIFOs (named pipes) 5.432 183,388
Pipe 4.733 210,369
Message Queue 4.396 226,421
Unix Signals 2.45 404,844
Shared Memory 0.598 1,616,014
Memory-Mapped Files 0.503 1,908,613

Based on our evaluation, we found that Unix sockets, while taking care of synchronization, were not the fastest IPC method available. The two fastest IPC mechanisms were shared memory and memory-mapped files. Both approaches offered similar performance, with the former using a specific tmpfs volume in /dev/shm and dedicated system calls, while the latter could be stored in any volume, including tmpfs or HDD/SDD.

Missing ingredients

In light of these findings, we decided to employ memory-mapped files as the IPC mechanism for serving machine learning features. This choice promised reduced latency, decreased CPU contention, and minimal data copying. However, it did not inherently offer data synchronization capabilities like Unix sockets. Unlike Unix sockets, memory-mapped files are simply files in a Linux volume that can be mapped into memory of the process. This sparked several critical questions:

  1. How could we efficiently fetch an array of hundreds of float features for given dimension keys when dealing with a file?
  2. How could we ensure safe, concurrent and frequent updates for tens of millions of keys?
  3. How could we avert the CPU contention previously encountered with Unix sockets?
  4. How could we effectively support the addition of more dimensions and features in the future?

To address these challenges we needed to further evolve this new approach by adding a few key ingredients to the recipe.

Augmenting the Idea

To realize our vision of memory-mapped files as a method for serving machine learning features, we needed to employ several key strategies, touching upon aspects like data synchronization, data structure, and deserialization.

Wait-free synchronization

When dealing with concurrent data, ensuring safe, concurrent, and frequent updates is paramount. Traditional locks are often not the most efficient solution, especially when dealing with high concurrency environments. Here's a rundown on three different synchronization techniques:

With-lock synchronization: a common approach using mechanisms like mutexes or spinlocks. It ensures only one thread can access the resource at a given time, but can suffer from contention, blocking, and priority inversion, just as evident with Unix sockets.

Lock-free synchronization: this non-blocking approach employs atomic operations to ensure at least one thread always progresses. It eliminates traditional locks but requires careful handling of edge cases and race conditions.

Wait-free synchronization: a more advanced technique that guarantees every thread makes progress and completes its operation without being blocked by other threads. It provides stronger progress guarantees compared to lock-free synchronization, ensuring that each thread completes its operation within a finite number of steps.

Disjoint Access Parallelism Starvation Freedom Finite Execution Time
With lock Every request, every microsecond: scalable machine learning at Cloudflare Every request, every microsecond: scalable machine learning at Cloudflare Every request, every microsecond: scalable machine learning at Cloudflare
Lock-free Every request, every microsecond: scalable machine learning at Cloudflare Every request, every microsecond: scalable machine learning at Cloudflare Every request, every microsecond: scalable machine learning at Cloudflare
Wait-free Every request, every microsecond: scalable machine learning at Cloudflare Every request, every microsecond: scalable machine learning at Cloudflare Every request, every microsecond: scalable machine learning at Cloudflare

Our wait-free data access pattern draws inspiration from Linux kernel's Read-Copy-Update (RCU) pattern and the Left-Right concurrency control technique. In our solution, we maintain two copies of the data in separate memory-mapped files. Write access to this data is managed by a single writer, with multiple readers able to access the data concurrently.

We store the synchronization state, which coordinates access to these data copies, in a third memory-mapped file, referred to as "state". This file contains an atomic 64-bit integer, which represents an InstanceVersion and a pair of additional atomic 32-bit variables, tracking the number of active readers for each data copy. The InstanceVersion consists of the currently active data file index (1 bit), the data size (39 bits, accommodating data sizes up to 549 GB), and a data checksum (24 bits).

Zero-copy deserialization

To efficiently store and fetch machine learning features, we needed to address the challenge of deserialization latency. Here, zero-copy deserialization provides an answer. This technique reduces the time and memory required to access and use data by directly referencing bytes in the serialized form.

We turned to rkyv, a zero-copy deserialization framework in Rust, to help us with this task. rkyv implements total zero-copy deserialization, meaning no data is copied during deserialization and no work is done to deserialize data. It achieves this by structuring its encoded representation to match the in-memory representation of the source type.

One of the key features of rkyv that our solution relies on is its ability to access HashMap data structures in a zero-copy fashion. This is a unique capability among Rust serialization libraries and one of the main reasons we chose rkyv for our implementation. It also has a vibrant Discord community, eager to offer best-practice advice and accommodate feature requests.

Every request, every microsecond: scalable machine learning at Cloudflare
Feature comparison: rkyv vs FlatBuffers and Cap'n Proto

Enter mmap-sync crate

Leveraging the benefits of memory-mapped files, wait-free synchronization and zero-copy deserialization, we've crafted a unique and powerful tool for managing high-performance, concurrent data access between processes. We've packaged these concepts into a Rust crate named mmap-sync, which we're thrilled to open-source for the wider community.

At the core of the mmap-sync package is a structure named Synchronizer. It offers an avenue to read and write any data expressible as a Rust struct. Users simply have to implement or derive a specific Rust trait surrounding struct definition – a task requiring just a single line of code. The Synchronizer presents an elegantly simple interface, equipped with "write" and "read" methods.

impl Synchronizer {
    /// Write a given `entity` into the next available memory mapped file.
    pub fn write<T>(&mut self, entity: &T, grace_duration: Duration) -> Result<(usize, bool), SynchronizerError> {
        …
    }

    /// Reads and returns `entity` struct from mapped memory wrapped in `ReadResult`
    pub fn read<T>(&mut self) -> Result<ReadResult<T>, SynchronizerError> {
        …
    }
}

/// FeaturesMetadata stores features along with their metadata
#[derive(Archive, Deserialize, Serialize, Debug, PartialEq)]
#[archive_attr(derive(CheckBytes))]
pub struct FeaturesMetadata {
    /// Features version
    pub version: u32,
    /// Features creation Unix timestamp
    pub created_at: u32,
    /// Features represented by vector of hash maps
    pub features: Vec<HashMap<u64, Vec<f32>>>,
}

A read operation through the Synchronizer performs zero-copy deserialization and returns a "guarded" Result encapsulating a reference to the Rust struct using RAII design pattern. This operation also increments the atomic counter of active readers using the struct. Once the Result is out of scope, the Synchronizer decrements the number of readers.

The synchronization mechanism used in mmap-sync is not only "lock-free" but also "wait-free". This ensures an upper bound on the number of steps an operation will take before it completes, thus providing a performance guarantee.

The data is stored in shared mapped memory, which allows the Synchronizer to “write” to it and “read” from it concurrently. This design makes mmap-sync a highly efficient and flexible tool for managing shared, concurrent data access.

Now, with an understanding of the underlying mechanics of mmap-sync, let's explore how this package plays a key role in the broader context of our Bot Management platform, particularly within the newly developed components: the bliss service and library.

System design overhaul

Transitioning from a Lua-based module that made memcached requests over Unix socket to Gagarin in Go to fetch machine learning features, our new design represents a significant evolution. This change pivots around the introduction of mmap-sync, our newly developed Rust package, laying the groundwork for a substantial performance upgrade. This development led to a comprehensive system redesign and introduced two new components that form the backbone of our Bots Liquidation Intelligent Security System – or BLISS, in short: the bliss service and the bliss library.

Every request, every microsecond: scalable machine learning at Cloudflare

Bliss service

The bliss service operates as a Rust-based, multi-threaded sidecar daemon. It has been designed for optimal batch processing of vast data quantities and extensive I/O operations. Among its key functions, it fetches, parses, and stores machine learning features and dimensions for effortless data access and manipulation. This has been made possible through the incorporation of the Tokio event-driven platform, which allows for efficient, non-blocking I/O operations.

Bliss library

Operating as a single-threaded dynamic library, the bliss library seamlessly integrates into each worker thread using the Foreign Function Interface (FFI) via a Lua module. Optimized for minimal resource usage and ultra-low latency, this lightweight library performs tasks without the need for heavy I/O operations. It efficiently serves machine learning features and generates corresponding detections.

In addition to leveraging the mmap-sync package for efficient machine learning feature access, our new design includes several other performance enhancements:

  • Allocations-free operation: bliss library re-uses pre-allocated data structures and performs no heap allocations, only low-cost stack allocations. To enforce our zero-allocation policy, we run integration tests using the dhat heap profiler.
  • SIMD optimizations: wherever possible, the bliss library employs vectorized CPU instructions. For instance, AVX2 and SSE4 instruction sets are used to expedite hex-decoding of certain request attributes, enhancing speed by tenfold.
  • Compiler tuning: We compile both the bliss service and library with the following flags for superior performance:

[profile.release]
codegen-units = 1
debug = true
lto = "fat"
opt-level = 3

  • Benchmarking & profiling: We use Criterion for benchmarking every major feature or component within bliss. Moreover, we are also able to use the Go pprof profiler on Criterion benchmarks to view flame graphs and more:

cargo bench -p integration -- --verbose --profile-time 100

go tool pprof -http=: ./target/criterion/process_benchmark/process/profile/profile.pb

This comprehensive overhaul of our system has not only streamlined our operations but also has been instrumental in enhancing the overall performance of our Bot Management platform. Stay tuned to witness the remarkable changes brought about by this new architecture in the next section.

Rollout results

Our system redesign has brought some truly "blissful" dividends. Above all, our commitment to a seamless user experience and the trust of our customers have guided our innovations. We ensured that the transition to the new design was seamless, maintaining full backward compatibility, with no customer-reported false positives or negatives encountered. This is a testament to the robustness of the new system.

As the old adage goes, the proof of the pudding is in the eating. This couldn't be truer when examining the dramatic latency improvements achieved by the redesign. Our overall processing latency for HTTP requests at Cloudflare improved by an average of 12.5% compared to the previous system.

This improvement is even more significant in the Bot Management module, where latency improved by an average of 55.93%.

Every request, every microsecond: scalable machine learning at Cloudflare
Bot Management module latency, in microseconds.

More specifically, our machine learning features fetch latency has improved by several orders of magnitude:

Latency metric Before (μs) After (μs) Change
p50 532 9 -98.30% or x59
p99 9510 18 -99.81% or x528
p999 16000 29 -99.82% or x551

To truly grasp this impact, consider this: with Cloudflare’s average rate of 46 million requests per second, a saving of 523 microseconds per request equates to saving over 24,000 days or 65 years of processing time every single day!

In addition to latency improvements, we also reaped other benefits from the rollout:

  • Enhanced feature availability: thanks to eliminating Unix socket timeouts, machine learning feature availability is now a robust 100%, resulting in fewer false positives and negatives in detections.
  • Improved resource utilization: our system overhaul liberated resources equivalent to thousands of CPU cores and hundreds of gigabytes of RAM – a substantial enhancement of our server fleet's efficiency.
  • Code cleanup: another positive spin-off has been in our Lua and Go code. Thousands of lines of less performant and less memory-safe code have been weeded out, reducing technical debt.
  • Upscaled machine learning capabilities: last but certainly not least, we've significantly expanded our machine learning features, dimensions, and models. This upgrade empowers our machine learning inference to handle hundreds of machine learning features and dozens of dimensions and models.

Conclusion

In the wake of our redesign, we've constructed a powerful and efficient system that truly embodies the essence of 'bliss'. Harnessing the advantages of memory-mapped files, wait-free synchronization, allocation-free operations, and zero-copy deserialization, we've established a robust infrastructure that maintains peak performance while achieving remarkable reductions in latency. As we navigate towards the future, we're committed to leveraging this platform to further improve our Security machine learning products and cultivate innovative features. Additionally, we're excited to share parts of this technology through an open-sourced Rust package mmap-sync.

As we leap into the future, we are building upon our platform's impressive capabilities, exploring new avenues to amplify the power of machine learning. We are deploying a new machine learning model built on BLISS with select customers. If you are a Bot Management subscriber and want to test the new model, please reach out to your account team.

Separately, we are on the lookout for more Cloudflare customers who want to run their own machine learning models at the edge today. If you’re a developer considering making the switch to Workers for your application, sign up for our Constellation AI closed beta. If you’re a Bot Management customer and looking to run an already trained, lightweight model at the edge, we would love to hear from you. Let's embark on this path to bliss together.

How Cloudflare runs machine learning inference in microseconds

Post Syndicated from Austin Hartzheim original http://blog.cloudflare.com/how-cloudflare-runs-ml-inference-in-microseconds/

How Cloudflare runs machine learning inference in microseconds

How Cloudflare runs machine learning inference in microseconds

Cloudflare executes an array of security checks on servers spread across our global network. These checks are designed to block attacks and prevent malicious or unwanted traffic from reaching our customers’ servers. But every check carries a cost – some amount of computation, and therefore some amount of time must be spent evaluating every request we process. As we deploy new protections, the amount of time spent executing security checks increases.

Latency is a key metric on which CDNs are evaluated. Just as we optimize network latency by provisioning servers in close proximity to end users, we also optimize processing latency – which is the time spent processing a request before serving a response from cache or passing the request forward to the customers’ servers. Due to the scale of our network and the diversity of use-cases we serve, our edge software is subject to demanding specifications, both in terms of throughput and latency.

Cloudflare's bot management module is one suite of security checks which executes during the hot path of request processing. This module calculates a variety of bot signals and integrates directly with our front line servers, allowing us to customize behavior based on those signals. This module evaluates every request for heuristics and behaviors indicative of bot traffic, and scores every request with several machine learning models.

To reduce processing latency, we've undertaken a project to rewrite our bot management technology, porting it from Lua to Rust, and applying a number of performance optimizations. This post focuses on optimizations applied to the machine-learning detections within the bot management module, which account for approximately 15% of the latency added by bot detection. By switching away from a garbage collected language, removing memory allocations, and optimizing our parsers, we reduce the P50 latency of the bot management module by 79μs – a 20% reduction.

Engineering for zero allocations

Writing software without memory allocation poses several challenges. Indeed, high-level programming languages often trade memory management for productivity, abstracting away the details of memory management. But, in those details, are a number of algorithms to find contiguous regions of free memory, handle fragmentation, and call into the kernel to request new memory pages. Garbage collected languages incur additional costs throughout program execution to track when memory can be freed, plus pauses in program execution while the garbage collector executes. But, when performance is a critical requirement, languages should be evaluated for their ability to meet performance constraints.

Stack allocation

One of the simplest ways to reduce memory allocations is to work with fixed-size buffers. Fixed-sized buffers can be placed on the stack, which eliminates the need to invoke heap allocation logic; the compiler simply reserves space in the current stack frame to hold local variables. Alternatively, the buffers can be heap-allocated outside the hot path (for example, at application startup), incurring a one-time cost.

Arrays can be stack allocated:

let mut buf = [0u8; BUFFER_SIZE];

Vectors can be created outside the hot path:

let mut buf = Vec::with_capacity(BUFFER_SIZE);

To demonstrate the performance difference, let's compare two implementations of a case-insensitive string equality check. The first will allocate a buffer for each invocation to store the lower-case version of the string. The second will use a buffer that has already been allocated for this purpose.

Allocate a new buffer for each iteration:

fn case_insensitive_equality_buffer_with_capacity(s: &str, pat: &str) -> bool {
	let mut buf = String::with_capacity(s.len());
	buf.extend(s.chars().map(|c| c.to_ascii_lowercase()));
	buf == pat
}

Re-use a buffer for each iteration, avoiding allocations:

fn case_insensitive_equality_buffer_with_capacity(s: &str, pat: &str, buf: &mut String) -> bool {
	buf.clear();
	buf.extend(s.chars().map(|c| c.to_ascii_lowercase()));
	buf == pat
}

Benchmarking the two code snippets, the first executes in ~40ns per iteration, the second in ~25ns. Changing only the memory allocation behavior, the second implementation is ~38% faster.

Choice of algorithms

Another strategy to reduce the number of memory allocations is to choose algorithms that operate on the data in-place and store any necessary state on the stack.

Returning to our string comparison function from above, let's rewrite it operate completely on the stack, and without copying data into a separate buffer:

fn case_insensitive_equality_buffer_iter(s: &str, pat: &str) -> bool {
	s.chars().map(|c| c.to_ascii_lowercase()).eq(pat.chars())
}

In addition to being the shortest, this function is also the fastest. This function benchmarks at ~13ns/iter, which is just slightly slower than the 11ns used to execute eq_ignore_ascii_case from the standard library. And the standard library implementation similarly avoids buffer allocation through use of iterators.

Testing allocations

Automated testing of memory allocation on the critical path prevents accidental use of functions or libraries which allocate memory. dhat is a crate in the Rust ecosystem that supports such testing. By setting a new global allocator, dhat is able to count the number of allocations, as well as the number of bytes allocated on a given code path.

/// Assert that the hot path logic performs zero allocations.
#[test]
fn zero_allocations() {
	let _profiler = dhat::Profiler::builder().testing().build();

	// Execute hot-path logic here.

	// Assert that no allocations occurred.
	dhat::assert_eq!(stats.total_blocks, 0);
	dhat::assert_eq!(stats.total_bytes, 0);
}

It is important to note, dhat does have the limitation that it only detects allocations in Rust code. External libraries can still allocate memory without using the Rust allocator. FFI calls, such as those made to C, are one such place where memory allocations may slip past dhat's measurements.

Zero allocation decision trees

CatBoost is an open-source machine learning library used within the bot management module. The core logic of CatBoost is implemented in C++, and the library exposes bindings to a number of other languages – such as C and Rust. The Lua-based implementation of the bot management module relied on FFI calls to the C API to execute our models.

By removing memory allocations and implementing buffer re-use, we optimize the execution duration of the sample model included in the CatBoost repository by 10%. Our production models see gains up to 15%.

Optimize for single-document evaluation

By optimizing CatBoost to evaluate a single set of features at a time, we reduce memory allocations and reduce latency. The CatBoost API has several functions which are optimized for evaluating multiple documents at a time, but this API does not benefit our application where requests are evaluated in the order they are received, and delaying processing to coalesce batches is undesirable. To support evaluation of a variable number of documents, the CatBoost implementation allocates vectors and iterates over the input documents, writing them into the vectors.

TVector<TConstArrayRef<float>> floatFeaturesVec(docCount);
TVector<TConstArrayRef<int>> catFeaturesVec(docCount);
for (size_t i = 0; i < docCount; ++i) {
    if (floatFeaturesSize > 0) {
        floatFeaturesVec[i] = TConstArrayRef<float>(floatFeatures[i], floatFeaturesSize);
    }
    if (catFeaturesSize > 0) {
        catFeaturesVec[i] = TConstArrayRef<int>(catFeatures[i], catFeaturesSize);
    }
}
FULL_MODEL_PTR(modelHandle)->Calc(floatFeaturesVec, catFeaturesVec, TArrayRef<double>(result, resultSize));

To evaluate a single document, however, CatBoost only needs access to a reference to contiguous memory holding feature values. The above code can be replaced with the following:

TConstArrayRef<float> floatFeaturesArray = TConstArrayRef<float>(floatFeatures, floatFeaturesSize);
TConstArrayRef<int> catFeaturesArray = TConstArrayRef<int>(catFeatures, catFeaturesSize);
FULL_MODEL_PTR(modelHandle)->Calc(floatFeaturesArray, catFeaturesArray, TArrayRef<double>(result, resultSize));

Similar to the C++ implementation, the CatBoost Rust bindings also allocate vectors to support multi-document evaluation. For example, the bindings iterate over a vector of vectors, mapping it to a newly allocated vector of pointers:

let mut float_features_ptr = float_features
   .iter()
   .map(|x| x.as_ptr())
   .collect::<Vec<_>>();

But in the single-document case, we don't need the outer vector at all. We can simply pass the inner pointer value directly:

let float_features_ptr = float_features.as_ptr();

Reusing buffers

The previous API in the Rust bindings accepted owned Vecs as input. By taking ownership of a heap-allocated data structure, the function also takes responsibility for freeing the memory at the conclusion of its execution. This is undesirable as it forecloses the possibility of buffer reuse. Additionally, categorical features are passed as owned Strings, which prevents us from passing references to bytes in the original request. Instead, we must allocate a temporary String on the heap and copy bytes into it.

pub fn calc_model_prediction(
	&self,
	float_features: Vec<Vec<f32>>,
	cat_features: Vec<Vec<String>>,
) -> CatBoostResult<Vec<f64>> { ... }

Let's rewrite this function to take &[f32] and &[&str]:

pub fn calc_model_prediction_single(
	&self,
	float_features: &[f32],
	cat_features: &[&str],
) -> CatBoostResult<f64> { ... }

But, we also execute several models per request, and those models may use the same categorical features. Instead of calculating the hash for each separate model we execute, let's compute the hashes first and then pass them to each model that requires them:

pub fn calc_model_prediction_single_with_hashed_cat_features(
	&self,
	float_features: &[f32],
	hashed_cat_features: &[i32],
) -> CatBoostResult<f64> { ... }

Summary

By optimizing away unnecessary memory allocations in the bot management module, we reduced P50 latency from 388us to 309us (20%), and reduced P99 latency from 940us to 813us (14%). And, in many cases, the optimized code is shorter and easier to read than the unoptimized implementation.

These optimizations were targeted at model execution in the bot management module. To learn more about how we are porting bot management from Lua to Rust, check out this blog post.

How Cloudflare and IBM partner to help build a better Internet

Post Syndicated from David McClure original https://blog.cloudflare.com/ibm-keyless-bots/

How Cloudflare and IBM partner to help build a better Internet

How Cloudflare and IBM partner to help build a better Internet

In this blog post, we wanted to highlight some ways that Cloudflare and IBM Cloud work together to help drive product innovation and deliver services that address the needs of our mutual customers. On our blog, we often discuss exciting new product developments and how we are solving real-world problems in our effort to make the internet better and many of our customers and partners play an important role.

IBM Cloud and Cloudflare have been working together since 2018 to integrate Cloudflare application security and performance products natively into IBM Cloud. IBM Cloud Internet Services (CIS) has customers across a wide range of industry verticals and geographic regions but they also have several specialist groups building unique service offerings.

The IBM Cloud team specializes in serving clients in highly regulated industries, aiming to ensure their resiliency, performance, security and compliance needs are met. One group that we’ve been working with recently is IBM Cloud for Financial Services. This group extends the capabilities of IBM Cloud to help serve the complex security and compliance needs of banks, financial institutions and fintech companies.

Bot Management

As malicious bot attacks get more sophisticated and manual mitigations become more onerous, a dynamic and adaptive solution is required for enterprises running Internet facing workloads. With Cloudflare Bot Management on IBM Cloud Internet Services, we aim to help IBM clients protect their Internet properties from targeted application abuse such as account takeover attacks, inventory hoarding, carding abuse and more. Bot Management will be available in the second quarter of 2023.

Threat actors specifically target financial services entities with Account Takeover Attacks, and this is where Cloudflare can help. As much as 71% of login requests we see come from bots (Source: Cloudflare Data) Cloudflare’s Bot Management is powered by a global machine learning model that analyses an average of 45 million HTTP requests a second to track botnets across our network. Cloudflare’s Bot Management solution has the potential to benefit all IBM CIS customers.

Supporting banks, financial institutions, and fintechs

IBM Cloud has been a leader when it comes to providing solutions for the financial services industry and has developed several key management solutions that are designed so clients only need to store their private keys in custom built devices.

The IBM CIS team wants to incorporate the right mix of security and performance, which necessitates the use of cloud-based DDoS, WAF, and Bot Management. Specifically, they wanted to incorporate the powerful security tools that were offered through IBM’s Enterprise-level Cloud Internet Services offerings. When using a cloud solution, it is necessary to proxy traffic which can create a potential challenge when it comes to managing private keys. While Cloudflare adopts strict controls to protect these keys, organizations in highly regulated industries may have security policies and compliance requirements that prevent them from sharing these private keys.

Enter Cloudflare’s Keyless SSL solution.

Cloudflare built Keyless SSL to allow customers to have total control over exactly where private keys are stored. With Keyless SSL and IBM’s key storage solutions, we aim to help enterprises benefit from the robust application protections available through Cloudflare’s WAF, including Cloudflare Bot Management, while still retaining control of their private keys.

“We aim to ensure our clients meet their resiliency, performance, security and compliance needs. The introduction of Keyless SSL and Bot Management security capabilities can further our collaborative accomplishments with Cloudflare and help enterprises, including those in regulated industries, to leverage cloud-native security and adaptive threat mitigation tools.”
Zane Adam, Vice President, IBM Cloud.

“Through our collaboration with IBM Cloud Internet Services, we get to draw on the knowledge and experience of IBM teams, such as the IBM Cloud for Financial Services team, and combine it with our incredible ability to innovate, resulting in exciting new product and service offerings.”
David McClure, Global Alliance Manager, Strategic Partnerships

If you want to learn more about how IBM leverages Cloudflare to protect their customers, visit: https://www.ibm.com/cloud/cloudflare

IBM experts are here to help you if you have any additional questions.

Announcing Cloudflare Fraud Detection

Post Syndicated from Adam Martinetti original https://blog.cloudflare.com/cloudflare-fraud-detection/

Announcing Cloudflare Fraud Detection

Announcing Cloudflare Fraud Detection

The world changed when the COVID-19 pandemic began. Everything moved online to a much greater degree: school, work, and, surprisingly, fraud. Although some degree of online fraud has existed for decades, the Federal Trade Commission reported consumers lost almost $8.8 billion in fraud in 2022 (an over 400% increase since 2019) and the continuation of a disturbing trend. People continue to spend more time alone than ever before, and that time alone makes them not just more targeted, but also more vulnerable to fraud. Companies are falling victim to these trends just as much as individuals: according to PWC’s Global Economic Crime and Fraud Survey, more than half of companies with at least $10 billion in revenue experienced some sort of digital fraud.

This is a familiar story in the world of bot attacks. Cloudflare Bot Management helps customers identify the automated tools behind online fraud, but it’s important to note that not all fraud is committed by bots. If the target is valuable enough, bad actors will contract out the exploitation of online applications to real people. Security teams need to look at more than just bots to better secure online applications and tackle modern, online fraud.

Today, we’re excited to announce Cloudflare Fraud Detection. Fraud Detection will give you precise, easy to use tools that can be deployed in seconds to any website on the Cloudflare network to help detect and categorize fraud. For every type of fraud we detect on your website, you will be able to choose the behavior that makes the most sense to you. While some customers will want to block fraudulent traffic at our edge, other customers may want to pass this information in headers to build integrations with their own app, or use our Cloudflare Workers platform to direct high risk users to load an alternate online experience with fewer capabilities.

The online fraud experience today

When we talk to organizations impacted by sophisticated, online fraud, the first thing we hear from frustrated security teams is that they know what they could do to stop fraud in a vacuum: they’ve proposed requiring email verification on signup, enforcing two-factor authentication for all logins, or blocking online purchases from anonymizing VPNs or countries they repeatedly see a disproportionately high number of charge-backs from. While all of these measures would undoubtedly reduce fraud, they would also make the user experience worse. The fear for every company is that a bad UX will mean slower adoption and less revenue, and that’s too steep a price to pay for most run-of-the-mill online fraud.

For those who’ve chosen to preserve that frictionless user experience and bear the cost of fraud, we see two big impacts: higher infrastructure costs and less efficient employees. Bad actors that abuse account creation endpoints or service availability endpoints often do so with floods of highly distributed HTTP requests, quickly moving through residential proxies to pass under IP based rate limiting rules. Without a way to identify fraudulent traffic with certainty, companies are forced to scale up their infrastructure to be able to serve new peaks in request traffic, even when they know the majority of this traffic is illegitimate. Engineering and Trust and Safety Teams suddenly have a whole new set of responsibilities: regularly banning IP addresses that will probably never be used again, routinely purging fraudulent data from over capacity databases, and even sometimes becoming de-facto fraud investigators. As a result, the organization incurs greater costs without any greater value to their customers.

Reduce modern fraud without hurting UX

Organizations have told us loud and clear that an effective fraud management solution needs to reliably stop bad actors before they can create fraudulent accounts, use stolen credit cards, or steal customer data all the while ensuring a frictionless user experience for real users. We are building novel and highly accurate detections, solving for the four common fraud types we hear the most demand for from businesses around the world:

  • Fake Account Creation: Bad actors signing up for many different accounts to gain access to promotional rewards, or more resources than a single user should have access to.
  • Account Takeover: Gaining unauthorized access to legitimate accounts, by means such as using stolen username and password combinations from other websites, guessing weak passwords, or abusing account recovery mechanisms.
  • Card Testing and Fraudulent Transactions: Testing the validity of stolen credit card details or using those same details to purchase goods or services.
  • Expediting: Obtaining limited availability goods or services by circumventing the normal user flow to complete orders more quickly than should be possible.

In order to trust your fraud management solution, organizations have to understand the decisions or predictions behind the detection of fraud. This is referred to as explainability. For example, it’s not enough to know a signup attempt was flagged as fraud. You need to know, for example, if a signup is fraudulent, exactly what field supplied by the user led us to think this was an issue, why it was an issue, and if it was part of a larger pattern. We will pass along this level of detail when we detect fraud so you can ensure we are only keeping the bad actors out.

Every business that deals with modern, online fraud has a different idea of what risks are acceptable, and a different preference for dealing with fraud once it’s been identified. To give customers maximum flexibility, we’re building Cloudflare’s fraud detection signals to be used individually, or combined with other Cloudflare security products in whichever way best fits each customer’s risk profile and use case, all while using the familiar Cloudflare Firewall Rules interface. Templated rules and suggestions will be available to provide guidance and help customers become familiar with the new features, but each customer will have the option of fully customizing how they want to protect each internet application. Customers can either block, rate-limit, or challenge requests at the edge, or send those signals upstream in request headers, to trigger custom in-application behavior.

Cloudflare provides application performance and security services to millions of sites, and we see 45 million HTTP requests per second on average. The massive diversity and volume of this traffic puts us in a unique position to analyze and defeat online fraud. Cloudflare Bot Management is already built to run our Machine Learning model that detects automated traffic on every request we see. To better tackle more challenging use cases like online fraud, we made our lightning fast Machine Learning even more performant. The typical Machine Learning model now executes in under 0.2 milliseconds, giving us the architecture we need to run multiple specific Machine Learning models in parallel without slowing down content delivery.

Stopping fake account creation and adding to Cloudflare’s defense in depth

Announcing Cloudflare Fraud Detection

The first problem our customers asked us to tackle is detecting fake account creation. Cloudflare is perfectly positioned to solve this because we see more account creation pages than anyone else. Using sampled fake account attack data from our customers, we started looking at signup submission data, and how threat intelligence curated by our Cloudforce One team might be helpful. We found that the data used in our Cloudflare One products was already able to identify 72% of fake accounts based on the signup details supplied by the bad actor, such as the email address or the domain they’re using in the attack. We are continuing to add more sources of threat intelligence data specific to fake accounts to get this number close to 100%. On top of these threat intelligence based rules, we are also training new machine learning models on this data as well, that will spot trends like popular fraud domains based on intelligence from the millions of domains we see across the Cloudflare network.

Making fraud inefficient by expediting detection

The second problem customers asked us to prioritize is expediting. As a reminder, expediting means visiting a succession of web pages faster than would be possible for a normal user, and sometimes skipping ahead in the order of web pages in order to efficiently exploit a resource.

For instance, let’s say that you have an Account Recovery page that is being spammed by a sophisticated group of bad actors, looking for vulnerable users they can steal reset tokens for. In this case, the fraudsters have access to a large number of valid email addresses and they’re testing which of these addresses may be used at your website. To prevent your account recovery process from being abused, we need to ensure that no single person can move through the account recovery process faster, or in a different order than a real person would.

In order to complete a valid password reset action on your site, you may know that a user should have made:

  • A GET request to render your login page
  • A POST request to the login page (at least one second after receiving the login page HTML)
  • A GET request to render the Account Recovery page (at least one second after receiving the POST response)
  • A POST request to the password reset page (at least one second after receiving the Account Recovery page HTML)
  • Taken a total time of less than 5 seconds to complete the process

To solve this, we will rely on encrypted data stored by the user in a token to help us determine if the user has visited all the necessary pages needed in a reasonable amount of time to be performing sensitive actions on your site. If your account recovery process is being abused, the encrypted token we supply acts as a VIP pass, allowing only authorized users to successfully complete the password recovery process. Without a pass indicating the user has gone through the normal recovery flow in the correct order and time, they are denied entry to complete a password recovery. By forcing the bad actor to behave the same as a legitimate user, we make their task of checking which of their compromised email addresses might be registered at your site an impossibly slow process, forcing them to move on to other targets.

Announcing Cloudflare Fraud Detection

These are just two of the first techniques we use to identify and block fraud. We are also building Account Takeover and Carding Abuse detections that we will be talking about in the future on this blog. As online fraud continues to evolve, we will continue to build new and unique detections, leveraging Cloudflare’s unique position to help keep the internet safe.

Where do I sign up?

Cloudflare’s mission is to help build a better Internet, and that includes dealing with the evolution of modern online fraud. If you’re spending hours cleaning up after fraud, or are tired of paying to serve web traffic to bad actors, you can join in the Cloudflare Fraud Detection Early Access in the second half of 2023 by submitting your contact information here. Early Access customers can opt in to providing training data sets right away, making our models more effective for their use cases. You’ll also get test access to our newest models, and future fraud protection features as soon as they roll out.

Announcing Cloudflare Fraud Detection

Detect and block advanced bot traffic

Post Syndicated from Etienne Munnich original https://aws.amazon.com/blogs/security/detect-and-block-advanced-bot-traffic/

Automated scripts, known as bots, can generate significant volumes of traffic to your mobile applications, websites, and APIs. Targeted bots take this a step further by targeting website content, such as product availability or pricing.

Traffic from targeted bots can result in a poor user experience by competing against legitimate user traffic for website access to high-demand inventory, increasing business risk through chargebacks from fraudulent transactions, and increasing infrastructure costs.

In 2021, AWS released AWS WAF Bot Control for Common Bots to help you detect and control common bots. In October 2022, AWS released a new feature—AWS Bot Control for Targeted Bots—that can help you detect and protect against bots that use advanced techniques to actively avoid detection.

In this post, I provide an overview of Bot Control for Targeted Bots and show you how to enable Bot Control to detect and block both common and targeted bots.

Overview of Bot Control for Targeted Bots

Bot Control for Targeted Bots provides sophisticated bot detection and mitigation by creating an intelligent baseline of traffic patterns. Bot Control for Targeted Bots uses browser fingerprinting techniques and client-side JavaScript interrogation methods to help protect your application from advanced bots that mimic human traffic patterns and actively try to evade detection.

Bot Control detects anomalies in usage patterns and provides new flexible mitigation options to isolate bad bots. These options include dynamic rate-limiting, challenge actions, and the ability to block based on labels and confidence scores.

With Bot Control for Targeted Bots, you can use bot protection rules to allow verified common bot traffic and, at the same time, to challenge unwanted advanced bot traffic. You can achieve both tasks from the same configuration page without making application or architectural changes. You can also configure fine-grained rule sets. For example, you can configure blocking actions for high-risk bots while allowing for exceptions for known IP ranges.

This release also introduces token domains, which is the ability to use the same AWS WAF web ACL across multiple domain names and Amazon CloudFront distributions to simplify client-side configuration. For example, you can use token domains to accept tokens that are generated by www.example.com for api.example.com and vice versa. In addition, you can now specify a resource path directly in the managed rule configuration, enabling you to only require a token for API calls, but not for cached, content-like images.

Bot Control for Targeted Bots sends metrics to Amazon CloudWatch to identify application access trends. The metrics include the percentage of human traffic compared to bot traffic and the count of requests for sensitive web pages such as login and checkout pages. Each rule in Bot Control produces a unique label so that you can review CloudWatch metrics and filter logs to understand traffic patterns. By using these mechanisms, you can identify, isolate, and remediate operational issues.

Walkthrough

In this walkthrough, I will show you how to set up Bot Control for Targeted Bots to help protect a CloudFront distribution.

You will set up an AWS WAF web ACL with an AWS Managed Rule for Bot Control for Targeted Bots. The rule detects bots and then decides the appropriate action:

  • Dynamically rate limit verified bots – Based on traffic history, Bot Control creates an intelligent baseline and then applies rate limits to abnormally high volumes.
  • Enable the challenge action – You have a new option, called challenge, along with the already supported options of count, allow, block, and CAPTCHA. The challenge option initiates a process of challenge interstitial, which means that Bot Control provides a challenge to the browser and creates a domain token when the challenge is resolved.

Set up Bot Control for Targeted Bots

In this section, I will show you how to set up Bot Control for Targeted Bots by creating a new web ACL or editing an existing one.

To set up Bot Control for Targeted Bots

  1. Open the AWS WAF console, and then do one of the following:
    • To create a new web ACL, choose Create a new web ACL.
    • To edit an existing web ACL, choose the name of the ACL.
  2. On the Rules tab, for the Add rules drop-down, select Add managed rule groups.
  3. Add a Bot Control rule set to the web ACL. Choose Edit to edit the rule.
  4. For Bot Control inspection level, select the inspection level for Bot Control. For this walkthrough, we chose Targeted to enable Bot Control for Targeted Bots.
    Figure 1: Bot Control – Select inspection level

    Figure 1: Bot Control – Select inspection level

  5. Review and select the actions to be taken on each category of bots detected, and then choose Save rule. In our example, we set allow, challenge, and count rules for the categories, as shown in Figure 2.
    Figure 2: Bot Control – Select actions for each category

    Figure 2: Bot Control – Select actions for each category

    You can select different actions for each category based on your application security needs:

    • Allow: Allows the request to be sent to a protected resource.
    • Block: Blocks the request, returning an HTTP 403 (Forbidden) response.
    • Count: Allows the request to be sent to the protected resource while counting detections. The count shows you bot activity that is occurring without blocking or challenging. When you turn on rules for the first time, this information can help you see what the detections are, before you change the actions.
    • CAPTCHA and Challenge: use CAPTCHA puzzles and silent challenges with tokens to track successful client responses.
  6. In this example you will configure a scope-down statement to apply Bot Control for a given URI path only.

    On the same page in the step above, you can add a scope-down statement to ensure you use and incur Targeted Bots charges for the requests where you need protections. There are more examples of how to use scope-down statements in our documentation.

    Select “Enable scope-down statement” and configure the rule to inspect the URI path as per figure 3.

    Figure 3: Bot Control – Add the scope-down statement

    Figure 3: Bot Control – Add the scope-down statement

  7. To add domain names to be protected, scroll to the bottom of the web ACL and choose Edit. In the Token domain listoptional section, enter the domain name or names to which the token verification applies. Tokens that are generated are valid for these domains.

Create the SDK link for the AWS WAF integration

In this section, I’ll show you how to find the AWS WAF SDK and add it to your application pages.

The token SDK manages the token authorization and includes the tokens in the requests that you send to your protected resources. By adding the SDK link to application pages, you can help ensure that the remote procedure calls by your client contain a valid token.

To add the SDK to your application pages

  1. In the AWS WAF console, in the left navigation pane, choose Application integration SDKs.
  2. Under JavaScript SDK, copy the provided code snippet. This code snippet allows for creation of the cryptographic token in the background when the application loads for the first time, providing a better customer experience.
  3. Add the code snippet to your pages. For example, paste the provided script code within the <head> section of the HTML.

When this integration is in place on your application’s pages, you can add AWS WAF rules in your web ACL to block requests that don’t contain a valid token. Replace the <Web ACL integration URL> with the provided integration URL from the AWS WAF console or copy the script tag from the console:

<script type="text/javascript" src="<Web ACL integration URL>/challenge.js” defer></script>

Figure 4 shows the SDK link for application pages.

Figure 4: Bot Control – Add SDK link to application pages

Figure 4: Bot Control – Add SDK link to application pages

Review metrics

Now that you’ve set up the web ACL and application, you can use the bot visualization dashboard to review bot traffic patterns. Bot rules emit metrics corresponding to their labels, helping you identify which rule within the AWS Managed Rule for Bot Control for Targeted Bots initiated an action. You can also use these labels and rule actions to filter AWS WAF logs so that you can further examine a request.

To view AWS WAF metrics for the distribution

  1. In the AWS WAF console, in the left navigation pane, select Web ACLs.
  2. Select the web ACL that Bot Control is enabled on and then choose the Bot Control tab to view the metrics.
Figure 5: Bot Control – Review web ACL metrics

Figure 5: Bot Control – Review web ACL metrics

Best practices

In this section, I describe best practices for your Bot Control setup.

Set priority ordering of AWS WAF rules to help lower costs

You can set the priority of rule groups in a web ACL such that the order of the rule matches requests more efficiently. AWS WAF will take the action associated to the first rule it matches. If the incoming traffic matches the more wider criteria (such as IPset rules at priority 1), the associated action is taken. That request is never analyzed by the Bot Control rule and hence do not incur the bot control request analysis fees. For example, the following list shows rules ranked in order from highest priority (1) to lowest priority (5):

  1. Use allow and deny lists – provide IP addresses to allow or deny
  2. AWS Managed Rule groups for IP reputation – block bots and other threats
  3. General rate limit – help prevent HTTP flood across the protected resource
  4. AWS WAF Bot Control rule group – scoped-down to exclude static content such as images
  5. Rate limit for login pages – scoped-down for specific URLs and HTTP POST methods

Figure 6 shows the prioritized rules in AWS WAF.

Figure 6: AWS WAF – Web ACL rule order

Figure 6: AWS WAF – Web ACL rule order

Use scope-down statements

You can use scope-down statements to limit the requests evaluated for a rule group. For example, a scope-down statement that excludes checking requests for static assets, such as images for a given URI and HTTP method (GET), can help reduce Bot Control costs.

Block requests without tokens

If a request has a token absent or is rejected, you can block that request. For example, you might want to block requests on login or payment processing pages. To block requests with a missing or rejected token, add a rule to run after the Bot Control rule to block requests matching the labels rejected and absent:

  • awswaf:managed:token:rejected – The request token is present but is either corrupt or has an expired challenge timestamp.
  • awswaf:managed:token:absent – The request doesn’t have a token.

Use SDK integration

After you add the token domains and the provided script to your application pages, you can add a rule to block requests that don’t have a token. Use of the SDK helps AWS WAF verify the client application with silent challenges and provide AWS token acquisition and management. The SDK provides the full functionality of both AWS WAF Bot Control and AWS WAF Fraud Control, reducing the need for multiple SDKs if either or both rule groups are used in the web ACL.

Create CloudWatch alarms

You can add CloudWatch alarms to help you assess whether there is activity outside of the norm for your application. For example, you can monitor for a high number of token-absent metrics for a given time period.

Configure a billing alarm

To help you track costs, you can configure a billing alarm that sends an alert when you have exceeded the threshold for your expected costs.

Pricing and availability

Bot Control for Targeted Bots is available today in AWS Regions where AWS WAF is available, excluding AWS GovCloud (US) and China Regions. For information on pricing, see AWS WAF Pricing.

Conclusion

In this post, you learned how to use Bot Control for Targeted Bots to add visibility into bot activity on your website or applications. With Bot Control for common and targeted bots, you can detect, challenge, and block unwanted bot activity. Because Bot Control is customizable, you can tailor how you address legitimate bots while protecting against bots that use advanced techniques to actively avoid detection. For more information and to get started today, see AWS WAF Bot Control.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Etienne Munnich

Etienne Munnich

Etienne works as an Edge Specialist Solutions Architect, assisting start ups and companies of all sizes across Asia Pacific to improve their web applications performance and security. Based in Sydney, Australia, he has fulfilled many roles in tech, which range from systems integrator to cloud engineer, project manager and now a solutions architect. Follow him on Twitter at @etiennemunnich

Cloudflare named a Leader in WAF by Forrester

Post Syndicated from Michael Tremante original https://blog.cloudflare.com/cloudflare-named-leader-waf-forrester-2022/

Cloudflare named a Leader in WAF by Forrester

Cloudflare named a Leader in WAF by Forrester

Forester has recognised Cloudflare as a Leader in The Forrester Wave™: Web Application Firewalls, Q3 2022 report. The report evaluated 12 Web Application Firewall (WAF) providers on 24 criteria across current offering, strategy and market presence.

You can register for a complimentary copy of the report here. The report helps security and risk professionals select the correct offering for their needs.

We believe this achievement, along with recent WAF developments, reinforces our commitment and continued investment in the Cloudflare Web Application Firewall (WAF), one of our core product offerings.

The WAF, along with our DDoS Mitigation and CDN services, has in fact been an offering since Cloudflare’s founding, and we could not think of a better time to receive this recognition: Birthday Week.

We’d also like to take this opportunity to thank Forrester.

Leading WAF in strategy

Cloudflare received the highest score of all assessed vendors in the strategy category. We also received the highest possible scores in 10 criteria, including:

  • Innovation
  • Management UI
  • Rule creation and modification
  • Log4Shell response
  • Incident investigation
  • Security operations feedback loops

According to Forrester, “Cloudflare Web Application Firewall shines in configuration and rule creation”, “Cloudflare stands out for its active online user community and its associated response time metrics”, and “Cloudflare is a top choice for those prioritizing usability and looking for a unified application security platform.”

Protecting web applications

The core value of any WAF is to keep web applications safe from external attacks by stopping any compromise attempt. Compromises can in fact lead to complete application take over and data exfiltration resulting in financial and reputational damage to the targeted organization.

The Log4Shell criterion in the Forrester Wave report is an excellent example of a real world use case to demonstrate this value.

Log4Shell was a high severity vulnerability discovered in December 2021 that affected the popular Apache Log4J software commonly used by applications to implement logging functionality. The vulnerability, when exploited, allows an attacker to perform remote code execution and consequently take over the target application.

Due to the popularity of this software component, many organizations worldwide were potentially at risk after the immediate public announcement of the vulnerability on December 9, 2021.

We believe that we scored the highest possible score in the Log4Shell criterion due to our fast response to the announcement, by ensuring that all customers using the Cloudflare WAF were protected against the exploit in less than 17 hours globally.

We did this by deploying new managed rules (virtual patching) that were made available to all customers. The rules were deployed with a block action ensuring exploit attempts never reached customer applications.

Additionally, our continuous public updates on the subject, including regarding internal processes, helped create clarity and understanding around the severity of the issue and remediation steps.

In the following weeks from the initial announcement, we updated WAF rules several times following discovery of multiple variations of the attack payloads.

The Cloudflare WAF ultimately “bought” valuable time for our customers to patch their back end systems before attackers may have been able to find and attempt compromise of vulnerable applications.

You can read about our response and our actions following the Log4Shell announcement in great detail on our blog.

Use the Cloudflare WAF today

Cloudflare WAF keeps organizations safer while they focus on improving their applications and APIs. We integrate leading application security capabilities into a single console to protect applications with our WAF while also securing APIs, stopping DDoS attacks, blocking unwanted bots, and monitoring for 3rd party JavaScript attacks.

To start using our Cloudflare WAF today, sign up for an account.

Cloudflare named a Leader by Gartner

Post Syndicated from Michael Tremante original https://blog.cloudflare.com/cloudflare-waap-named-leader-gartner-magic-quadrant-2022/

Cloudflare named a Leader by Gartner

Cloudflare named a Leader by Gartner

Gartner has recognised Cloudflare as a Leader in the 2022 “Gartner® Magic Quadrant™ for Web Application and API Protection (WAAP)” report that evaluated 11 vendors for their ‘ability to execute’ and ‘completeness of vision’.

You can register for a complimentary copy of the report here.

We believe this achievement highlights our continued commitment and investment in this space as we aim to provide better and more effective security solutions to our users and customers.

Keeping up with application security

With over 36 million HTTP requests per second being processed by the Cloudflare global network we get unprecedented visibility into network patterns and attack vectors. This scale allows us to effectively differentiate clean traffic from malicious, resulting in about 1 in every 10 HTTP requests proxied by Cloudflare being mitigated at the edge by our WAAP portfolio.

Visibility is not enough, and as new use cases and patterns emerge, we invest in research and new product development. For example, API traffic is increasing (55%+ of total traffic) and we don’t expect this trend to slow down. To help customers with these new workloads, our API Gateway builds upon our WAF to provide better visibility and mitigations for well-structured API traffic for which we’ve observed different attack profiles compared to standard web based applications.

We believe our continued investment in application security has helped us gain our position in this space, and we’d like to thank Gartner for the recognition.

Cloudflare WAAP

At Cloudflare, we have built several features that fall under the Web Application and API Protection (WAAP) umbrella.

DDoS protection & mitigation

Our network, which spans more than 275 cities in over 100 countries is the backbone of our platform, and is a core component that allows us to mitigate DDoS attacks of any size.

To help with this, our network is intentionally anycasted and advertises the same IP addresses from all locations, allowing us to “split” incoming traffic into manageable chunks that each location can handle with ease, and this is especially important when mitigating large volumetric Distributed Denial of Service (DDoS) attacks.

The system is designed to require little to no configuration while also being “always-on” ensuring attacks are mitigated instantly. Add to that some very smart software such as our new location aware mitigation, and DDoS attacks become a solved problem.

For customers with very specific traffic patterns, full configurability of our DDoS Managed Rules is just a click away.

Web Application Firewall

Our WAF is a core component of our application security and ensures hackers and vulnerability scanners have a hard time trying to find potential vulnerabilities in web applications.

This is very important when zero-day vulnerabilities become publicly available as we’ve seen bad actors attempt to leverage new vectors within hours of them becoming public. Log4J, and even more recently the Confluence CVE, are just two examples where we observed this behavior. That’s why our WAF is also backed by a team of security experts who constantly monitor and develop/improve signatures to ensure we “buy” precious time for our customers to harden and patch their backend systems when necessary. Additionally, and complementary to signatures, our WAF machine learning system classifies each request providing a much wider view in traffic patterns.

Our WAF comes packed with many advanced features such as leaked credential checks, advanced analytics and alerting and payload logging.

Bot Management

It is no secret that a large portion of web traffic is automated, and while not all automation is bad, some is unnecessary and may also be malicious.

Our Bot Management product works in parallel to our WAF and scores every request with the likelihood of it being generated by a bot, allowing you to easily filter unwanted traffic by deploying a WAF Custom Rule, all this backed by powerful analytics. We make this easy by also maintaining a list of verified bots that can be used to further improve a security policy.

In the event you want to block automated traffic, Cloudflare’s managed challenge ensures that only bots receive a hard time without impacting the experience of real users.

API Gateway

API traffic, by definition, is very well-structured relative to standard web pages consumed by browsers. At the same time, APIs tend to be closer abstractions to back end databases and services, resulting in increased attention from malicious actors and often go unnoticed even to internal security teams (shadow APIs).

API Gateway, that can be layered on top of our WAF, helps you both discover API endpoints served by your infrastructure, as well detect potential anomalies in traffic flows that may indicate compromise, both from a volumetric and sequential perspective.

The nature of APIs also allows API Gateway to much more easily provide a positive security model contrary to our WAF: only allow known good traffic and block everything else. Customers can leverage schema protection and mutual TLS authentication (mTLS) to achieve this with ease.

Page Shield

Attacks that leverage the browser environment directly can go unnoticed for some time, as they don’t necessarily require the back end application to be compromised. For example, if any third party JavaScript library used by a web application is performing malicious behavior, application administrators and users may be none the wiser while credit card details are being leaked to a third party endpoint controlled by an attacker. This is a common vector for Magecart, one of many client side security attacks.

Page Shield is solving client side security by providing active monitoring of third party libraries and alerting application owners whenever a third party asset shows malicious activity. It leverages both public standards such as content security policies (CSP) along with custom classifiers to ensure coverage.

Page Shield, just like our other WAAP products, is fully integrated on the Cloudflare platform and requires one single click to turn on.

Security Center

Cloudflare’s new Security Center is the home of the WAAP portfolio. A single place for security professionals to get a broad view across both network and infrastructure assets protected by Cloudflare.

Moving forward we plan for the Security Center to be the starting point for forensics and analysis, allowing you to also leverage Cloudflare threat intelligence when investigating incidents.

The Cloudflare advantage

Our WAAP portfolio is delivered from a single horizontal platform, allowing you to leverage all security features without additional deployments. Additionally, scaling, maintenance and updates are fully managed by Cloudflare allowing you to focus on delivering business value on your application.

This applies even beyond WAAP, as, although we started building products and services for web applications, our position in the network allows us to protect anything connected to the Internet, including teams, offices and internal facing applications. All from the same single platform. Our Zero Trust portfolio is now an integral part of our business and WAAP customers can start leveraging our secure access service edge (SASE) with just a few clicks.

If you are looking to consolidate your security posture, both from a management and budget perspective, application services teams can use the same platform that internal IT services teams use, to protect staff and internal networks.

Continuous innovation

We did not build our WAAP portfolio overnight, and over just the past year we’ve released more than five major WAAP portfolio security product releases. To showcase our speed of innovation, here is a selection of our top picks:

  • API Shield Schema Protection: traditional signature based WAF approaches (negative security model) don’t always work well with well-structured data such as API traffic. Given the fast growth in API traffic across the network we built a new incremental product that allows you to enforce API schemas directly at the edge using a positive security model: only let well-formed data through to your origin web servers;
  • API Abuse Detection: complementary to API Schema Protection, API Abuse Detection warns you whenever anomalies are detected on your API endpoints. These can be triggered by unusual traffic flows or patterns that don’t follow normal traffic activity;
  • Our new Web Application Firewall: built on top of our new Edge Rules Engine, the core Web Application Firewall received a complete overhaul, all the way from engine internals to the UI. Better performance both in terms of latency and efficacy at blocking malicious payloads, along with brand-new capabilities including but not limited to Exposed Credential Checks, account wide configurations and payload logging;
  • DDoS customizable Managed Rules: to provide additional configuration flexibility, we started exposing some of our internal DDoS mitigation managed rules for custom configurations to further reduce false positives and allow customers to increase thresholds / detections as required;
  • Security Center: Cloudflare view on infrastructure and network assets, along with alerts and notifications for miss configurations and potential security issues;
  • Page Shield: based on growing customer demand and the rise of attack vectors focusing on the end user browser environment, Page Shield helps you detect whenever malicious JavaScript may have made its way into your application’s code;
  • API Gateway: full API management, including routing directly from the Cloudflare edge, with API Security baked in, including encryption and mutual TLS authentication (mTLS);
  • Machine Learning WAF: complementary to our WAF Managed Rulesets, our new ML WAF engine, scores every single request from 1 (clean) to 99 (malicious) giving you additional visibility in both valid and non-valid malicious payloads increasing our ability to detect targeted attacks and scans towards your application;

Looking forward

Our roadmap is packed with both new application security features and improvements to existing systems. As we learn more about the Internet we find ourselves better equipped to keep your applications safe. Stay tuned for more.

Gartner, “Magic Quadrant for Web Application and API Protection”, Analyst(s): Jeremy D’Hoinne, Rajpreet Kaur, John Watts, Adam Hils, August 30, 2022.

Gartner and Magic Quadrant are registered trademarks of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation.

Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

35,000 new trees in Nova Scotia

Post Syndicated from Patrick Day original https://blog.cloudflare.com/35-000-new-trees-in-nova-scotia/

35,000 new trees in Nova Scotia

Cloudflare is proud to announce the first 35,000 trees from our commitment to help clean up bad bots (and the climate) have been planted.

35,000 new trees in Nova Scotia

Working with our partners at One Tree Planted (OTP), Cloudflare was able to support the restoration of 20 hectares of land at Victoria Park in Nova Scotia, Canada. The 130-year-old natural woodland park is located in the heart of Truro, NS, and includes over 3,000 acres of hiking and biking trails through natural gorges, rivers, and waterfalls, as well as an old-growth eastern hemlock forest.

The planting projects added red spruce, black spruce, eastern white pine, eastern larch, northern red oak, sugar maple, yellow birch, and jack pine to two areas of the park. The first area was a section of the park that recently lost a number of old conifers due to insect attacks. The second was an area previously used as a municipal dump, which has since been covered by a clay cap and topsoil.

35,000 new trees in Nova Scotia

Our tree commitment began far from the Canadian woodlands. In 2019, we launched an ambitious tool called Bot Fight Mode, which for the first time fought back against bots, targeting scrapers and other automated actors.

Our idea was simple: preoccupy bad bots with nonsense tasks, so they cannot attack real sites. Even better, make these tasks computationally expensive to engage with. This approach is effective, but it forces bad actors to consume more energy and likely emit more greenhouse gasses (GHG). So in addition to launching Bot Fight Mode, we also committed to supporting tree planting projects to account for any potential environmental impact.

What is Bot Fight Mode?

As soon as Bot Fight Mode is enabled, it immediately starts challenging bots that visit your site. It is available to all Cloudflare customers for free, regardless of plan.

35,000 new trees in Nova Scotia

When Bot Fight Mode identifies a bot, it issues a computationally expensive challenge to exhaust it (also called “tarpitting”). Our aim is to disincentivize attackers, so they have to find a new hobby altogether. When we tarpit a bot, we require a significant amount of compute time that will stall its progress and result in a hefty server bill. Sorry not sorry.

We do this because bots are leeches. They draw resources, slow down sites, and abuse online platforms. They also hack into accounts and steal personal data. Of course, we allowlist a small number of bots that are well-behaved, like Slack and Google. And Bot Fight Mode only acts on traffic from cloud and hosting providers (because that is where bots usually originate from).

Over 550,000 sites use Bot Fight Mode today! We believe this makes it the most widely deployed bot management solution in the world (though this is impossible to validate). Free customers can enable the tool from the dashboard and paid customers can use a special version, known as Super Bot Fight Mode.

How many trees? Let’s do the math 🚀

Now, the hard part: how can we translate bot challenges into a specific number of trees that should be planted? Fortunately, we can use a series of unit conversions, similar to those we use to calculate Cloudflare’s total GHG emissions.

We started with the following assumptions.

Table 1.

Measure Quantity Scaled Source
Energy used by a standard server 1,760.3 kWh / year To hours (0.2 kWh / hour) Go Climate
Emissions factor 0.33852 kgCO2e / kWh To grams (338.52 gCO2e / kWh) Go Climate
CO2 absorbed by a mature tree 48 lbsCO2e / year To kilograms (21 kgCO2e / year) One Tree Planted

Next, we selected a high-traffic day to model the rate and duration of bot challenges on our network. On May 23, 2021, Bot Fight Mode issued 2,878,622 challenges, which lasted an average of 50 seconds each. In total, bots spent 39,981 hours engaging with our network defenses, or more than four years of challenges in a single day!

We then converted that time value into kilowatt-hours (kWh) of energy based on the rate of power consumed by our generic server listed in Table 1 above.

39,981 (hours) x .2 (kWh/hour) = 7,996 (kWh)

Once we knew the total amount of energy consumed by bad bot servers, we used an emissions factor (the amount of greenhouse gasses emitted per unit of energy consumed) to determine total emissions.

7,996 (kwh) x 338.52 (gCO2e/kwh) = 2,706,805 (gCO2e)

If you have made it this far, clearly you like to geek out like we do, so for the sake of completeness, the unit commonly used in emissions calculations is carbon dioxide equivalent (CO2e), which is a composite unit for all six GHGs listed in the Kyoto Protocol weighted by Global Warming Potential.

The last conversion we needed was from emissions to trees. Our partners at OTP found that a mature tree absorbs roughly 21 kgCO2e per year. Based on our total emissions that translates to roughly 47,000 trees per server, or 840 trees per CPU core. However, in our original post, we also noted that given the time it takes for a newly planted tree to reach maturity, we would multiply our donation by a factor of 25.

In the end, over the first two years of the program, we calculated that we would need approximately 42,000 trees to account for all the individual CPU cores engaged in Bot Fight Mode. For good measure, we rounded up to an even 50,000.

We are proud that most of these trees are already in the ground, and we look forward to providing an update when the final 15,000 are planted.

A piece of the puzzle

“Planting trees will benefit species diversity of the existing forest, animal habitat, greening of reclamation areas as well as community recreation areas, and visual benefits along popular hiking/biking trail networks.”  
Stephanie Clement, One Tree Planted, Project Manager North America

Reforestation is an important part of protecting healthy ecosystems and promoting biodiversity. Trees and forests are also a fundamental part of helping to slow the growth of global GHG emissions.

However, we recognize there is no single solution to the climate crisis. As part of our mission to help build a better, more sustainable Internet, Cloudflare is investing in renewable energy, tools that help our customers understand and mitigate their own carbon footprints on our network, and projects that will help offset or remove historical emissions associated with powering our network by 2025.

Want to be part of our bots & trees effort? Enable Bot Fight Mode today! It’s available on our free plan and takes only a few seconds. By the time we made our first donation to OTP in 2021, Bot Fight Mode had already spent more than 3,000 years distracting bots.

Help us defeat bad bots and improve our planet today!

35,000 new trees in Nova Scotia

—-
For more information on Victoria Park, please visit https://www.victoriaparktruro.ca
For more information on One Tree Planted, please visit https://onetreeplanted.org
For more information on sustainability at Cloudflare, please visit www.cloudflare.com/impact

Envoy Media: using Cloudflare’s Bot Management & ML

Post Syndicated from Ryan Marlow (Guest Blogger) original https://blog.cloudflare.com/envoy-media-machine-learning-bot-management/

Envoy Media: using Cloudflare's Bot Management & ML

This is a guest post by Ryan Marlow, CTO, and Michael Taggart, Co-founder of Envoy Media Group.

Envoy Media: using Cloudflare's Bot Management & ML

My name is Ryan Marlow, and I’m the CTO of Envoy Media Group. I’m excited to share a story with you about Envoy, Cloudflare, and how we use Bot Management to monitor automated traffic.

Background

Envoy Media Group is a digital marketing and lead generation company. The aim of our work is simple: we use marketing to connect customers with financial services. For people who are experiencing a particular financial challenge, Envoy provides informative videos, money management tools, and other resources. Along the way, we bring customers through an online experience, so we can better understand their needs and educate them on their options. With that information, we check our database of highly vetted partners to see which programs may be useful, and then match them up with the best company to serve them.

As you can imagine, it’s important for us to responsibly match engaged customers to the right financial services. Envoy develops its own brands that guide customers throughout the process. We spend our own advertising dollars, work purely on a performance basis, and choose partners we know will do right by customers. Envoy serves as a trusted guide to help customers get their financial lives back on track.

A bit of technical detail

We often say that Envoy offers a “sophisticated online experience.” This is not your average lead generation engine. We’ve built our own multichannel marketing platform called Revstr, which handles content management, marketing automation, and business intelligence. Envoy has an in-house technology team that develops and maintains Revstr’s multi-million line PHP application, cloud computing services, and infrastructure. With Revstr’s systems, we are able to A/B test any combination of designs with any set of business rules. As a result, Envoy shows the right experience to the right customer every time, and even adapts to the responses of each individual.

Revstr tracks each aspect of the customer’s progress through our pages and forms. It also integrates with advertising platforms, client companies’ CRM systems, and third-party marketing tools. Revstr creates a 360° view of our performance and the customer’s experience. All this information goes into our proprietary data warehouse and is served to the business team. This warehouse also provides data — already cleaned, normalized, and labeled — to our machine learning pipeline. Where needed, we can perform quick and easy testing, training, and deployment of ML models. Both our business team and our marketing automation rely heavily on guidance from these reports and models. And that’s where Cloudflare comes into the picture…

Why we care about automated traffic

One of our key challenges is evaluating the quality of our traffic. Bad actors are always advancing and proliferating. Importantly: any fake traffic to our sites has a direct impact on our business, including wasted resources in advertising, UX optimization, and customer service.

In our world of digital marketing, especially in the industries we compete in, each click is costly and must be treated like a precious commodity. The saying goes that “he or she who is able to spend the most for a click wins.” We spend our own money on those clicks and are only paid when we deliver real leads who consistently convert to enrollments for our clients.

Any money spent on illegitimate clicks will hurt our bottom line — and bots tend to be the lead culprits. Bots hurt our standing in auctions and reduce our ability to buy. The media buyers on our business team are always watching statistics on the cost, quantity, engagement, and conversion rates of our ad traffic. They look for anomalies that might represent fraudulent clicks to ensure we trim out any wasted spend.

Cloudflare offers Bot Management, which spots the exact traffic we need to look out for.

How we use Cloudflare’s Bot Score to filter out bad traffic

We solved our problem by using Cloudflare. For each request that reaches Envoy, Cloudflare calculates a “bot score” that ranges from 1 (automated) to 99 (human). It’s generated using a number of sophisticated methods, including machine learning. Because Cloudflare sees traffic from millions of requests every second, they have an enormous training set to work with. The bot scores are incredibly accurate. By leveraging the bot score to evaluate the legitimacy of an ad click and switching experiences accordingly, we can make the most of every click we pay for.

Because of the high cost of a click, we cannot afford to completely block that click even if Cloudflare indicates it might be a bot. Instead, we ingest the bot score as a custom header and use it as an input to our rules engine. For example, we can put much longer, more qualifying forms in front of traffic that looks suspicious, and render more streamlined forms to higher scoring visitors. In an extreme case, we can even require suspect visitors to contact us by phone instead of completing an online form. This allows us to convert leads which may have a more dubious origin but still prove to be legitimate, while maintaining a pleasant experience for the best leads.

We also pull the bot score into our data warehouse and provide it to our marketing team. Over the long term, if they see that any ad campaign or traffic source has a low average bot score, we can reduce or eliminate spend on that traffic source, seek refunds from providers, and refocus our efforts on more profitable segments.

Using the Bot Score to predict conversion rate

Envoy also leverages the bot score by integrating it into our ML models. For most lead generation companies, it would be sufficient to track lead volume and profit margins. For Envoy, it’s part of our DNA to go beyond making a sale and really assess the lifetime value of those leads for our clients. So we turn to machine learning. We use ML to predict the lifetime value of new visitors based on known data about past leads, and then pass that signal on to our advertising vendors. By skewing the conversion value higher for leads with a better predicted lifetime value, we can influence those pay-per click (PPC) platforms’ own smart bidding algorithms to search for the best qualified leads.

One of the models we use in this process does a prediction of backend conversion rate — how likely a given lead is to become an enrollment for the client company. When we added the bot score and behavioral metrics to this model, we saw a significant increase in its accuracy. When we can better predict conversion rate, we get better leads. This accuracy boost is a force multiplier for our whole platform; it makes an impact not only in media management but also in form design, lead delivery integrations, and email automation.

Why is Bot Score so valuable for Envoy Media?

At Envoy, we take pride in being analytical and data driven. Here are some of the insights we found by combining Cloudflare’s Bot Score with our own internal data:

1. When we added bot score along with behavioral metrics to our conversion rate prediction ML model, its precision increased by 15%. Getting even a 1% improvement in such a carefully tuned model is difficult; a 15% improvement is a huge win.

2. Bot score is included in 76 different reports used by our media buying and UX optimization teams, and in 9 different ML models. It is now a standard component of all new UX reports.

3. Because bot score is so accurate and because bot score is now broadly available within our organization, it is driving organizational performance in ways that we didn’t expect. For example, here is a testimonial from our UX Optimization Team:

I use bot score in PPC search reports. Before I had access to the bot score our PPC reports were muddied with automated traffic – one day conversion rates are 11%, the next day they are at 5%. That is no way to run a business! I spent a lot of time investigating to understand and justify these differences – and many times there just wasn’t a satisfactory answer, and we had to throw the analysis out. Today I have access to bot score data, and it prevents data dilution and gives me a much higher degree of confidence in my analysis.

Thank You and More to Come!

Thanks to the Cloudflare team for giving us the opportunity to share our story. We’re constantly innovating and hope that we can share more of our developments with you in the future.

The Grinch Bot is Stealing Christmas!

Post Syndicated from Ben Solomon original https://blog.cloudflare.com/grinch-bot/

The Grinch Bot is Stealing Christmas!

The Grinch Bot is Stealing Christmas!

This week, a group of US lawmakers introduced the Stopping Grinch Bots Act — new legislation that could stop holiday hoarders on the Internet. This inspired us to put a spin on a Dr. Seuss classic:

Each person on the Internet liked Christmas a lot
But the Grinch Bot, built by the scalper did not!
The Grinch Bot hated Christmas! The whole Christmas season!
Now, please don’t ask why. No one quite knows the reason.

The Grinch Bot is Stealing Christmas!

Cloudflare stops billions of bad bots every day. As you might have guessed, we see all types of attacks, but none is more painful than a Grinch Bot attack. Join us as we take a closer look at this notorious holiday villain…

25 days seconds of Christmas

What is the Grinch Bot? Technically speaking, it’s just a program running on a computer, making automated requests that reach different websites. We’ve come to refer to these requests as “bots” on the Internet. Bots move quickly, leveraging the efficiency of computers to carry out tasks at scale. The Grinch Bot is a very special type that satisfies two conditions:

  1. It only pursues online inventory, attempting to purchase items before humans can complete their orders.
  2. It only operates during the holiday season.

Now, attackers use bots to perform these tasks all year long. But in these winter months, we like to use the term “Grinch Bot” as seasonal terminology.

The Grinch Bot strikes first around Black Friday. It knows that the best discounts come around Thanksgiving, and it loves to get a good deal. Exclusive items are always the first to go, so attackers use the Grinch Bot to cut every (virtual) line and checkpoint. Cloudflare detected nearly 1.5 trillion bot requests on Black Friday. That’s about half of all our traffic; but more on this in a bit.

The Grinch Bot is Stealing Christmas!

The Grinch Bot strikes again on Cyber Monday. As shoppers find gifts for their loved ones, bots are ten steps ahead — selecting “add to cart” automatically. Many bots have payment details ready (perhaps even stolen from your account!).

The Grinch Bot will buy 500 pairs of Lululemon joggers before you even get one. And it’ll do so in seconds.

Nearly 44% of traffic comes from bad bots

The Grinch Bot has friends working throughout the year, putting pressure on security teams and moving undetected. 43.8% of Internet traffic comes from these bots. When the holidays arrive, the Grinch Bot can ask its friends how to attack the largest sites. They have already been testing tactics for months.

The Grinch Bot is Stealing Christmas!

In response, many sites block individual IP addresses, groups of devices, or even entire countries. Other sites use Rate Limiting to reduce traffic volume. At Cloudflare, we’ve advocated not only for Rate Limiting, but also for a more sophisticated approach known as Bot Management, which dynamically identifies threats as they appear. Here’s a look at bot traffic before the holidays (1H 2021):

The Grinch Bot is Stealing Christmas!

When we looked at bot traffic on Black Friday, we found that it had surged to nearly 50%. Cloudflare Radar showed data close to 55% (if you want to include the good bots as well). Businesses tell us this is the most vulnerable time of the year for their sites.

Over 300 billion bots…

Bots are highly effective at scale. While humans can purchase one or two items within a few minutes, bots can purchase far more inventory with little effort.

During the year, Cloudflare observed over 300 billion bots try to “add to cart.” How did we find this? We ran our bot detection engines on every endpoint that contains the word “cart.” Keep in mind, most bots are stopped before they can even view item details. There are trillions of inventory hoarding bots that were caught earlier in their efforts by our Bot Management and security solutions.

Even worse, some bots want to steal your holiday funds. They skip the ecommerce sites and head right for your bank, where they test stolen credentials and try to break into your account. 71% of login traffic comes from bots:

The Grinch Bot is Stealing Christmas!

Bots operate at such an immense scale that they occasionally succeed. When this happens, they can break into accounts, retrieve your credit card information, and begin a holiday shopping spree.

Deck the halls with JS Challenges

We hate CAPTCHAs almost as much as we hate the Grinch Bot, so we built JS challenges as a lightweight, non-interactive alternative:

The Grinch Bot is Stealing Christmas!

Not surprisingly, we issue more JS Challenges when more bots reach our network. These challenges are traditionally a middle ground between taking no action and completely blocking requests. They offer a chance for suspicious looking requests to prove their legitimacy. Cloudflare issued over 35 billion JS Challenges over the shopping weekend.

Even more impressive, however, is the number of threats blocked around this time. On Black Friday, Cloudflare blocked over 150 billion threats:

The Grinch Bot is Stealing Christmas!

While we expected the Grinch Bot to make its move on Friday, we did not expect it to recede as it did on Cyber Monday. Bot traffic decreased as the shopping weekend continued. We like to think the Grinch Bot spent its time furiously trying to avoid blocks and JS Challenges, but eventually gave up.

Saving the Internet (and Christmas)

While large retailers can afford to purchase bot solutions, not every site is so fortunate. We decided to fix that.

Cloudflare’s Bot Fight Mode is a completely free tool that stops bots. You can activate it with one click, drawing on our advanced detection engines to protect your site. It’s easy:

The Grinch Bot is Stealing Christmas!

And Bot Fight Mode doesn’t just stop bots — it makes them pay. We unleash a tarpit challenge that preoccupies each bot with nonsense puzzles, ultimately handing bot operators a special gift: a massive server bill. We even plant trees to offset the carbon emissions of these expensive challenges. In fact, with so many bots stopped in the snow, there’s really just one thing left to say…

Every person on the Internet, the tall and the small, ⁣
Called out with joy that their shopping didn’t stall!
He hadn’t stopped Christmas from coming! It came!
Somehow or other, it came just the same!
And the Grinch Bot, with his grinch feet ice-cold in the snow, ⁣
Stood puzzling and puzzling. “How could it be so?”

Holistic web protection: industry recognition for a prolific 2020

Post Syndicated from Patrick R. Donahue original https://blog.cloudflare.com/cloudflare-named-the-innovation-leader-in-holistic-web-protection/

Holistic web protection: industry recognition for a prolific 2020

I love building products that solve real problems for our customers. These days I don’t get to do so as much directly with our Engineering teams. Instead, about half my time is spent with customers listening to and learning from their security challenges, while the other half of my time is spent with other Cloudflare Product Managers (PMs) helping them solve these customer challenges as simply and elegantly as possible. While I miss the deeply technical engineering discussions, I am proud to have the opportunity to look back every year on all that we’ve shipped across our application security teams.

Taking the time to reflect on what we’ve delivered also helps to reinforce my belief in the Cloudflare approach to shipping product: release early, stay close to customers for feedback, and iterate quickly to deliver incremental value. To borrow a term from the investment world, this approach brings the benefits of compounded returns to our customers: we put new products that solve real-world problems into their hands as quickly as possible, and then reinvest the proceeds of our shared learnings immediately back into the product.

It is these sustained investments that allow us to release a flurry of small improvements over the course of a year, and be recognized by leading industry analyst firms for the capabilities we’ve accumulated and distributed to our customers. Today we’re excited to announce that Frost & Sullivan has named Cloudflare the Innovation Leader in their Frost Radar™: Global Holistic Web Protection Market Report. Frost & Sullivan’s view that this market “will gradually absorb the markets formed around legacy and point solutions” is consistent with our view of the world, and we’re leading the way in “the consolidation of standalone WAF, DDoS mitigation, and Bot Risk Management solutions” they believe is “poised to happen before 2025”.

Holistic web protection: industry recognition for a prolific 2020
Image © 2020 Frost & Sullivan from Frost Radar™: Global Holistic Web Protection Market Report

We are honored to receive this recognition, based on the analysis of 10 providers’ competitive strengths and opportunities as assessed by Frost & Sullivan. The rest of this post explains some of the capabilities that we shipped in 2020 across our Web Application Firewall (WAF), Bot Management, and Distributed Denial-of-Service product lines—the scope of Frost & Sullivan’s report. Get a copy of the Frost & Sullivan Frost Radar report to see why Cloudflare was named the Innovation Leader here.

2020 Web Security Themes and Roundup

Before jumping into specific product and feature launches, I want to briefly explain how we think about building and delivering our web security capabilities. The most important “product” by far that’s been built at Cloudflare over the past 10 years is the massive global network that moves bits securely around the world, as close to the speed of light as possible. Building our features atop this network allows us to reject the legacy tradeoff of performance or security. And equipping customers with the ability to program and extend the network with Cloudflare Workers and Firewall Rules allows us to focus on quickly delivering useful security primitives such as functions, operators, and ML-trained data—then later packaging them up in streamlined user interfaces.

We talk internally about building up the “toolbox” of security controls so customers can express their desired security posture, and that’s how we think about many of the releases over the past year that are discussed below. We begin by providing the saw, hammer, and nails, and let expert builders construct whatever defenses they see fit. By watching how these tools are put to use and observing the results of billions of attempts to evade the erected defenses, we learn how to improve and package them together as a whole for those less inclined to build from components. Most recently we did this with API Shield, providing a guided template to create “positive security” models within Firewall Rules using existing primitives plus new data structures for strong authentication such as Cloudflare-managed client SSL/TLS certificates. Each new tool added to the toolbox increases the value of the existing tools. Each new web request—good or bad—improves the models that our threat intelligence and Bot Management capabilities depend upon.

Web application firewall (WAF) usability at scale

Holistic web protection: industry recognition for a prolific 2020

Last year we spoke with many customers about our plan to decouple configuration from the zone/domain model and allow rules to be set for arbitrary paths and groups of services across an account. In 4Q2020 we put this granular control in the hands of a few developers and some of our most sophisticated enterprise customers, and we’re currently collecting and incorporating feedback before defaulting the capabilities on for new customers.

Rules are great, especially with increased flexibility, but without data structures and request enrichment at the edge (such as the Bot Management techniques described below) they cannot act on anything beyond static properties of the request. In 3Q2020 we released our IP Lists capabilities and customers have been steadily uploading their home-grown and third-party subscription lists. These lists can be referenced anywhere in a customer’s account as named variables and then combined with all other attributes of the request, even Bot Management scores, e.g., http.request.uri.path contains “/login” and (not ip.src in $pingdom_probes and cf.bot_management.score < 30) is a Firewall Rule filter that blocks all bots except Pingdom from accessing the login endpoint.

Requests that are blocked or challenged need to find their way as quickly as possible to our customers’ SOCs for triage, investigation and, occasionally, incident response, so we upgraded our edge-logging framework in 2Q2020 to push real time security-specific logs directly to customer SIEMs. And in 4Q2020, we released the ability to encrypt sensitive payloads within these logs using customer-provided encryption keys and novel encryption algorithms termed “Hybrid Public Key Encryption” (HPKE), and a data localization suite to provide control over where our customers’ data is stored and protected.

Built predominantly in 4Q2020 and currently being tested in the Firewall Rules engine is a brand new implementation of our Rate Limiting engine. By moving this matching and enforcement logic from a standalone tool to a component within a performant, memory-safe, expressive engine built in Rust, we have increased the utility of existing functions. Additional examples of improving this library of capabilities include the work completed in 1Q2020 to add HMAC functions and regex-based HTTP header and body inspection to the engine.

Bots and machine learning (ML)

Holistic web protection: industry recognition for a prolific 2020

In addition to making edge data sets accessible for request evaluation, we continued to invest heavily within our Bot Management team to provide actionable data so that our customers could decide what (if any) automated traffic they wanted to allow to interact with their applications. Our highest priority for Bot research and development has always been efficacy, and last year was no different. A significant portion of our engineering effort was dedicated to our detection engines — both updating and iterating on existing systems or creating entirely new detection engines from scratch.

In 1Q2020 we completed a total rewrite of our Machine Learning engine, and are continually focused on improving the efficacy of our ML engines. To do this, we draw on one of our major competitive advantages: the massive amount of data flowing through Cloudflare’s network. The early 2020 upgrade to our ML model nearly doubled the number of features we use to evaluate and score requests. And to help customers better understand why requests are flagged as bots, we have recently complemented the bot likelihood score in our logs with attribution to the specific engine that generated the score.

Also in 1Q2020, we upgraded our behavioral analysis engine to incorporate more features and increase overall accuracy. This engine conducts histogram-based outlier scoring and is now fully deployed to nearly all Bot Management zones.

In 2Q2020, we developed a lightweight JavaScript element that further advanced our browser fingerprinting capabilities and aids in detection. Specifically, we now silently challenge browsers and detect if a browser is misrepresenting its User Agent. This technique will be incorporated into our ML models and combined with our heuristics engine for more accurate browser fingerprinting. This feature is entirely optional and can be enabled or disabled by customers through our UI and API. Customers with extremely performance sensitive zones or traffic types that are unsuitable for JavaScript (such as API or some mobile app traffic) can still be accurately scored by our Bot Management engine.

In addition to detection, we also spent (and will continue to spend) engineering effort on mitigation. Our entire JavaScript and CAPTCHA challenge platform was rewritten in the last year and deployed to our customer zones in a staged fashion in the second half of 2020. Our new platform is faster and more robust at detecting automated systems attempting to solve the challenges. More importantly, this platform allows us to further invest in new challenge types and modes as we enter 2021.

The biggest and most well received feature released in 2020 was our dedicated Bot Management analytics, released in 3Q2020. We now present informative graphs that double as diagnostic tools. Customers have found that analytics are far more than interesting charts and statistics: in the case of Bot Management, analytics are essential to spotting and subsequently eliminating false positives.

Last but definitely not least, we announced the deprecation of the __cfduid cookie in 4Q2020 which was used primarily to detect bots but caused confusion for some customers including questions about whether they needed to display a cookie banner because of what we do.

To get a sense of the Bot Attack trends we saw in the first half of 2020, take a read through this blog post. And if you’re curious about how our ML models and heuristic engines work to keep your properties safe, this deep dive by Alex Bocharov, Machine Learning Tech Lead on the Bots team, is an excellent guide.

API and IoT security and protection

Holistic web protection: industry recognition for a prolific 2020

At the beginning of 4Q2020, we released a product called API Shield that was purpose built to secure, protect, and accelerate API traffic — and will eventually provide much of the common functionality expected in traditional API Gateways. The UI for API Shield was built on top of Firewall Rules for maximum flexibility, and will serve as the jump-off point for configuring additional API security features we have planned this year.

As part of API Shield, every customer now gets a fully managed, domain-scoped private CA generated for each of their zones, and we plan to continue working closely with the SSL/TLS team to expand CA management options based on feedback. Since the release, we’ve seen great adoption from in particular IoT companies focused on locking down their APIs using short-lived client certificates distributed out to devices. Customers can also now upload OpenAPI schemas to be matched against incoming requests from these devices, with bad requests being dropped at the edge rather than passed on to origin infrastructure.

Another capability we released in 4Q2020 was support for gRPC-based API traffic. Since that release, customers have expressed significant interest in using Cloudflare as a secure API gateway between easy-to-use customer-facing JSON endpoints and internal-facing gRPC or GraphQL endpoints. Like most customer challenges at Cloudflare, early adopters are looking to solve these use cases initially with Cloudflare Workers, but we’re keeping an eye on whether there are aspects for which we’ll want to provide first-class feature support.

Distributed Denial-of-Service (DDoS) protections for web applications and APIs

Holistic web protection: industry recognition for a prolific 2020

The application-layer security of a web application or API is of minimal importance if the service itself is not available due to a persistent DDoS attack at L3-L7. While mitigating such attacks has long been one of Cloudflare’s strengths, attack methodologies evolve and we continued to invest heavily in 2020 to drop attacks more quickly, more efficiently, and more precisely; as a result, automatic mitigation techniques are applied immediately and most malicious traffic is blocked in less than 3 seconds.

Early in 2020 we responded to a persistent increase in smaller, more localized attacks by fine-tuning a system that can autonomously detect attacks on any server in any datacenter. In the month prior to us first posting about this tool, it mitigated almost 300,000 network-layer attacks, roughly 55 times greater than the tool we previously relied upon. This new tool, dubbed “dosd”, leverages Linux’s eXpress Data Path (XDP) and allows our system to quickly — and automatically — deploy rules eBPF rules that run on each packet received. We further enhanced our edge mitigation capabilities in 3Q2020 by developing and releasing a protection layer that can operate even in environments where we only see one side of the TCP flow. These network layer protections help protect our customers who leverage both Magic Transit to protect their IP ranges and our WAF to protect their applications and APIs.

To document and provide visibility into these attacks, we released a GraphQL-backed interface in 1Q2020 called Network Analytics. Network Analytics extends the visibility of attacks against our customers’ services from L7 to L3, and includes detailed attack logs containing data such as top source and destination IPs and ports, ASNs, data centers, countries, bit rates, protocol and TCP flag distributions. A litany of improvements made to this graphical rendering engine over the course of 2020 have benefitted all analytics tools using the same front-end. In 4Q2020, Network Analytics was extended to provide traffic and attack insights into Cloudflare Spectrum-protected applications, which are terminated at L4 (TCP/UDP).

Towards the end of 4Q2020, we released real-time DDoS attack alerting capable of sending emails or pages via PagerDuty to alert security teams of ongoing attacks and mitigations. This capability was released just in time to assist with the onslaught of ransomware attacks that Cloudflare helped detect and defend against. For additional context on unique attacks we fought off in 2020, consider reading about an acoustics inspired attack, a 754 million packet-per-second, or a roundup of attacks from 1Q2020, 2Q2020, or 3Q2020.

Wrapping up and looking towards 2021

2020 was a tough year around the world. Throughout what has also been, and continues to be, a period of heightened cyberattacks and breaches, we feel proud that our teams were able to release a steady flow of new and improved capabilities across several critical security product areas reviewed by Frost & Sullivan. These releases culminated in far greater protections for customers at the end of the year than the beginning, and a recognition for our sustained efforts.

We are pleased to have been named the Innovation Leader in their Frost Radar™: Global Holistic Web Protection Market Report, which “addresses organizations’ demand for consolidated, single pane of glass solutions, which not only reduce the security gaps of legacy products but also provide simplified management capabilities”.

As we look towards 2021 we plan to continue releasing early and often, listening to feedback from our customers, and delivering incremental value along the way. If you have ideas on what additional capabilities you’d like to use to protect your applications and networks, we’d love to hear them below in the comments.

Introducing Bot Analytics

Post Syndicated from Ben Solomon original https://blog.cloudflare.com/introducing-bot-analytics/

Introducing Bot Analytics

Introducing Bot Analytics

Bots — both good and bad — are everywhere on the Internet. Roughly 40% of Internet traffic is automated. Fortunately, Cloudflare offers a tool that can detect and block unwanted bots: we call it Bot Management. This is the most recent platform in our long history of detecting bots for our customers. In fact, Cloudflare has always offered some form of bot detection. Over the past two years, our team has focused on building advanced detection engines, innovating as bots become more sophisticated, and creating new features.

Today, we are releasing Bot Analytics to help you visualize your automated traffic.

Background

It’s worth including some background for those who are new to bots.

Many websites expect human behavior. When I shop online, I behave as anyone else would: I might search for a few items, read reviews when I find something interesting, and eventually complete an order. This is expected. It is a standard use of the Internet.

Introducing Bot Analytics

Unfortunately, without protection these sites can be ripe for exploitation. Those shoes I was looking at? They are limited edition sneakers that resell for five times the price. Sneaker hoarders clamor at the chance to buy a pair (or fifty). Or perhaps I just added a book to my cart: there are probably hundreds of online retailers that sell the same book, each one eager to offer the best price. These retailers desperately want to know what their competitors’ prices are.

You can see where this is going. While most humans make good use of the Internet, some use automated tools to perform abuse at scale. For example, attackers will deplete sneaker inventories by using automated bots to check out quickly. By the time humans click “add to cart,” bots have already paid for shipping. Humans hardly stand a chance. Similarly, online retailers keep track of their competitors with “price scraping” bots that collect pricing information. So when one retailer lowers a book price to $10, another retailer’s bot will respond by pricing at $9.99. This is how we end up with weird prices like $12.32 for toilet paper. Worst of all, malicious bots are incentivized to hide their identities. They’re hidden among us.

Introducing Bot Analytics

Not all bots are bad. Cloudflare maintains a list of verified good bots that we keep separated from the rest. Verified bots are usually transparent about who they are: DuckDuckGo, for example, publicly lists the IP addresses it uses for its search engine. This is a well-intentioned service that happens to be automated, so we verified it. We also verify bots for error monitoring and other tools.

Enter: Bot Analytics

Introducing Bot Analytics

As discussed earlier, we built a Bot Management platform that intelligently detects bots on the Internet, allowing our customers to block bad ones and allow good ones. If you’re curious about how our solution works, read here.

Beginning today, we are going to show you the bots that reach your website. You can see these bots with a new tool called Bot Analytics. It’s fast, accurate, and loaded with information. You can query data up to one month in the past with no noticeable lag. To accomplish this, we exposed the data with GraphQL and paired it with adaptive bitrate (ABR) technology to dynamically load content. If you already have Bot Management added to your Cloudflare account, Bot Analytics is included in your service. Open up your dashboard and let’s take a tour…

The Tour

First: where to go? Bot Analytics lives under the Firewall tab of the dashboard. Once you’re in the Firewall, go to “Overview” and click the second thumbnail on the left. Remember, Bot Management must be added to your account for full access to analytics.

Introducing Bot Analytics

It’s worth noting that Enterprise sites without Bot Management can see a snapshot of their bot traffic. This data is updated in real time and should help you determine if you have a bot problem. Generally speaking, if you have a double-digit percentage of automated traffic, you might be spending more on origin costs than you have to. More importantly, you might be losing revenue or sensitive information to inventory hoarding and credential stuffing.

“Requests by bot score” is the first section on the page. Here, we show traffic over time, but we split it vertically by the traffic type. Green segments represent verified bots, while shades of purple and blue show varying degrees of bot/human likelihood.

Introducing Bot Analytics

“Bot score distribution” is next. This shows similar data, but we display it horizontally without the notion of time. Use the slider below to filter on subsets of traffic and watch the rest of the page adapt.

Introducing Bot Analytics

We recommend that you use the slider to find your ideal bot threshold. In other words: what is the cutoff for suspicious traffic on your site? We generally consider traffic below 30 to be automated, but customers might choose to challenge traffic below 40 or block traffic below 10 (you can even do both!). You should set a threshold that is ambitious but not too aggressive. If your traffic looks like the example below, consider setting a threshold at a “drop off” point like 3 or 14. Why? Notice that the request density is very high near scores 1-2 and 12-13. Many of these requests will have similar characteristics, meaning that the scores immediately above them (3 and 14) offer some differentiating quality. These are the most promising places to segment your bot rules. Notably, not every graph is this pronounced.

Introducing Bot Analytics

“Bot score source” sits lower on the page. Here, you can examine the detection engines that are responsible for scoring your traffic. If you can’t remember the purpose of each engine, simply hover over the tooltip to view a brief description. Customers may wonder why some requests are flagged as “not computed.” This commonly occurs when Cloudflare has issued an error page on your behalf. Perhaps a visitor’s request was met with a gateway timeout (error 504), in which case Cloudflare responded with a branded error page. The error page would not have warranted a challenge or a block, so we did not spend time calculating a bot score. We published another blog post that provides an overview of the most common sources, including machine learning and heuristics.

Introducing Bot Analytics

“Top requests by source” is the final section of Bot Analytics. Although it’s not quite as colorful as the sections above, this section grounds Bot Analytics in highly specific data. You can filter or exclude request attributes, including IP addresses, user agents, and ASNs. In the next section, we’ll use this to spot a bot attack.

Let’s Spot A Bot Attack!

First, I’m going to use the “bot score source” tool to select the most obvious bot requests — those detected by our heuristics engine. This provides us with the following information, some of which has been redacted for privacy reasons:

Introducing Bot Analytics

I already suspect a correlation between a few of these attributes. First, the IP addresses all have very similar request counts. No human would access a site 22,000 times, and the uniformity across IPs 2-5 suggests foul play. Not surprisingly, the same pattern occurs for user agents on the right. User agents tell us about the browser and device associated with a particular request. When Bot Analytics shows this much uniformity and presents clear anomalies in country and ASN, I get suspicious (and you should too). I’m now going to filter on these anomalies to see if my instinct is right:

Introducing Bot Analytics

The trends hold true — to be sure, I briefly expanded the table and found nine separate IP addresses exhibiting the same behavior. This is likely an aggressive content scraper. Notably, it is not marked as a verified bot, so Bot Management issued the lowest possible score and flagged it as “automated.” At the top of Bot Analytics, I will narrow down the traffic and keep the time period at 24 hours:

Introducing Bot Analytics

The most severe attacks come and go. This traffic is clearly sustained, and my best guess is that someone is frequently scraping the homepage for content. This isn’t the most malicious of attacks, but content is still being taken. If I wanted to, I could set a firewall rule to target this bot score or any of the filters I used.

Try It Out

As a reminder, all Enterprise customers will be able to see a snapshot of their bot traffic. Even if you don’t have Bot Management for your site, visit the Firewall for some high-level insights that are updated in real time.

Introducing Bot Analytics

And for those of you with Bot Management — check out Bot Analytics! It’s live now, and we hope you’ll have fun using it. Keep your eyes open for new analytics features in the coming months.

Bot Attack trends for Jan-Jul 2020

Post Syndicated from Ricardo Pacheco original https://blog.cloudflare.com/bot-attack-trends-for-jan-jul-2020/

Bot Attack trends for Jan-Jul 2020

Bot Attack trends for Jan-Jul 2020

Now that we’re a long way through 2020, let’s take a look at automated traffic, which makes up almost 40% of total Internet traffic.

This blog post is a high-level overview of bot traffic on Cloudflare’s network. Cloudflare offers a comprehensive Bot Management tool for Enterprise customers, along with an effective free tool called Bot Fight Mode. Because of the tremendous amount of traffic that flows through our network each day, Cloudflare is in a unique position to analyze global bot trends.

In this post, we will cover the basics of bot traffic and distinguish between automated requests and other human requests (What Is A Bot?). Then, we’ll move on to a global overview of bot traffic around the world (A RoboBird’s Eye View, A Bot Day and Bots All Over The World), and dive into North American traffic (A Look into North American Traffic).  Lastly, we’ll finish with an overview of how the coronavirus pandemic affected global traffic, and we’ll take a deeper look at European traffic (Bots During COVID-19 In Europe).

On average, Cloudflare processes 18 million HTTP requests every second. This is a great opportunity to understand how bots shape the Internet, how much infrastructure is dedicated to these automated requests, and why our customers need a great bot management solution.

What Is A Bot?

Bot Attack trends for Jan-Jul 2020

Cloudflare groups traffic into four bot-related categories:

1. Verified
2. Definitely automated
3. Likely automated
4. Likely human

Our goal is to stop malicious and unwanted bots from harming our customers, while giving customers the opportunity to control how other automated traffic is managed.

We label each request that comes into Cloudflare with a “bot score” 1 through 99, where a lower score means that a request probably came from a bot. A higher score means that a request probably came from a human. This score is available in our Firewall, logs, and Workers, giving customers the flexibility to act on any score.

Cloudflare also maintains a challenge platform that customers can choose to deploy on suspected bots. You’ll recognize these as CAPTCHA challenges or JavaScript challenges. In fact, having the score available in Firewall Rules means that customers can take any action they choose. This platform can be used for mitigation, ensuring that unwanted traffic is stopped in its tracks.

To learn more about how Bot Management interacts with our firewall, check out our support page.

We track successes and failures during these challenges, which ultimately allows us to improve our detection systems. Assuming that our challenges are solvable by humans, effective detections should have low solve rates, given that they are usually presented to bots.

Bot Attack trends for Jan-Jul 2020

Verified bots are registered in an internal verified bot directory. These good bots power search engines and monitoring tools. Good bots enable our customers’ web pages to be found by search engines, for example.

For known non-verified bots (such as a scraper using a simple curl library), we keep a similar directory that is managed by our heuristics engine. If not otherwise verified, we consider requests caught by this engine to be definitely automated.

Our machine learning engine provides another way to identify potential bots. This engine identifies requests with a high probability of automation and marks them as likely automated. This detection mechanism benefits from models built on data from our global network.

If a request is not marked as automated, we mark it as likely human and pass along the bot score from our machine learning system.

We also have a behavioral analysis engine and a JavaScript detections engine. You can learn more about these systems by checking out Alex Bocharov’s previous post on Cloudflare Bot Management.

The two bot definitions for automated traffic are somewhat complementary. Requests caught by heuristic detections will not count towards machine learning detections. Requests that are reliably caught by our machine learning detections won’t need to be registered in our known heuristics bot directory. Because of this, we combine these two together when we discuss “automated traffic” in general.

A RoboBird’s Eye View

Data from this piece comes from information about Cloudflare’s customers, analyzed between January 15, 2020 and July 31, 2020.

First, let’s get a basic understanding of the traffic on our network.

Bot Attack trends for Jan-Jul 2020
Figure 1.1 Traffic type on Cloudflare’s network.

Figure 1.1 has a global breakdown regarding classification; 60.6% of traffic is likely human, 19.3% is likely automated, 18.1% is definitely automated and only 2.1% is from verified bots. In total, 39.5% of requests we score come from some kind of bot.

A Bot Day

Regular traffic fluctuates throughout the day. Do bots follow suit? Let’s check. Figure 2.1 represents traffic deviation from the average hourly traffic. An increase of 10% would mean that the hour is 10% busier than the average hour (measuring requests per hour). We include the total overall traffic in this chart to serve as a comparison to other types of traffic.

Bot Attack trends for Jan-Jul 2020
Figure 2.1 Hourly traffic as a deviation from the average hour.
Bot Attack trends for Jan-Jul 2020
Figure 2.2 Bot classification over an average day. 

We can clearly see a difference between human traffic and bot traffic. Human traffic varies heavily, but predictably, throughout the day. We can see a 15% decrease in human traffic early in the day, between midnight and 05:00 UTC, corresponding to the end of business hours in the Americas, and up to a 25% increase during business hours, 14:00 to 17:00 UTC, where traffic is highest. Conversely, bot traffic is more consistent. Slow hours still see a smaller drop than overall traffic, and busy hours are less busy. The difference between good and bad bots is also apparent: good bots are even more consistent, with small fluctuations in hourly traffic.

But why would this happen? A large portion of bots, good and bad, perform the same task across the Internet. Bad bots may be scraping websites or looking to infect unprotected machines, and they will do this with little intervention from human operators. Good bots could be doing some of these operations, but less frequently and in a more targeted fashion. A good bot scraping a website may be doing so to add it to a search engine, while a bad bot will do the same thing at a much higher rate, for other reasons.

A lot of bots follow business hours. For example, sneaker bots—focused on nabbing exclusive items from sneaker stores—will naturally be active when new products launch.

This difference in volume does not mean that our classifications are affected: our scores remain consistent throughout the day, as Figure 2.1 shows.

Bot Attack trends for Jan-Jul 2020
Figure 2.3 Daily traffic as a deviation from the average day. Grouped by day of week.
Bot Attack trends for Jan-Jul 2020
Figure 2.4 Bot classification over an average week.

We can also see that good bots don’t take weekends off. Weekdays and weekends have fairly marked differences for most traffic, but good bots keep a consistent schedule. Whereas a typical weekday is slightly above average, we can see a drop of about 4% in overall traffic. This does not fully apply to verified bots, which only see a small 1% drop in traffic.

Bots All Over The World

Now that we’ve taken a look at global traffic, let’s dig a little deeper.

Different regions have distinct traffic landscapes regarding automated traffic.

Bot Attack trends for Jan-Jul 2020
Figure 3.1 Traffic type by region.

Figure 3.1 breaks down traffic by region, letting us peek into where each type of traffic comes from. North America stands out as a major automated traffic source; over 50% of definitely automated traffic comes from there, and they also contribute almost 80% of all verified bot traffic. Europe makes up the second largest chunk of traffic, followed by Asia.

Bot Attack trends for Jan-Jul 2020
Figure 3.2 Traffic classification within each region.

Looking at regional breakdown of traffic in Figure 3.2, we can see just how much North American traffic is automated, well above the global average.

A Look into North American Traffic

As the largest source of automated traffic, North America deserves a closer look.

First, we’ll start with a breakdown of each country.

Bot Attack trends for Jan-Jul 2020
Figure 3.3 Percentage of traffic within North America.

Most of our requests in North America come from just three countries—the United States, Canada and Mexico. These account for 98% of all requests from North America, 97% of all requests from likely human sources and 100% of requests from verified bots. The United States alone accounts for 88% of total requests, 82% of requests from likely human sources, 96% of requests from definitely automated sources, 88% of requests from likely automated traffic sources and  98% of requests from verified bot.

However, this alone does not mean that the United States has an unusual amount of activity. These countries have a combined population of roughly 497 million people. The United States accounts for 66.5% of that, Mexico 25.9% and Canada 7.6%. With this context, we can see that the United States is overrepresented in terms of raw requests, but underrepresented in terms of how much of that traffic is likely to be human. Conversely, Canadian traffic is more likely to be human.

Let’s take another look at each country.

Bot Attack trends for Jan-Jul 2020
Figure 3.4 Percentage of traffic within each country.

Over half of the traffic from the United States is automated in some way, which is a clear departure from trends in Mexico and Canada.

American Bots

So far, we’ve seen how much the United States contributes to automated traffic. If we want to go deeper, a good place to start is by understanding how these bots get online. We can do this by examining the networks from which the traffic originates. Networks are identified by Autonomous System Numbers, or ASNs. These form the backbone of the Internet infrastructure.

Think of these as Internet Service Providers, but facing inward towards the network instead of outward towards end consumers. ISPs like Comcast and Verizon are examples of residential ASNs, where we expect mostly human traffic. Cloud providers such as Google and Amazon are also ASNs, but targeted towards cloud services. We expect most of these requests to be automated in some way.

Looking at traffic on the ASN level is important because we can identify cloud-based traffic, or traffic using residential proxies, among others.

Let’s take a look at which ASNs are associated with visitors in the United States. We’ll restrict ourselves to “eyeball” traffic, which is the term we use for requests coming from site visitors.

Bot Attack trends for Jan-Jul 2020
Figure 4.1 Top ASN in the United States.

From figure 4.1 we can clearly see the impact that cloud services have on traffic; 11.5% of all eyeball traffic comes from Amazon and Google.

Bot Attack trends for Jan-Jul 2020
Figure 4.2 Top ASN in the United States for verified bot traffic.

Verified bots operate in a different landscape, coming from cloud providers such as Amazon, Google, Microsoft, Advanced Hosting and Wowrack.

Bot Attack trends for Jan-Jul 2020
Figure 4.3 Top ASN in the United States for likely and definitely automated traffic.

Automated traffic has a variety of ASNs. Cloud providers such as Amazon, Google and Microsoft make up the 30% of automated traffic. Comcast also makes up a significant portion of traffic at 4.8%, indicating that some bots come from residential services.

Bots During COVID-19 In Europe

Lockdowns and limits on public events came as a consequence of the ongoing coronavirus pandemic. Many people have been working from home, and even those who do not have this option are using the Internet in new ways. Overall, this has meant that Cloudflare’s network has grown tremendously.

But how does this impact bot traffic? First let’s get an idea of how it impacted traffic in general. Countries were impacted by the virus at different times, so we expect to see differences, right?

Bot Attack trends for Jan-Jul 2020
Figure 5.1 Total traffic across all regions.

Figure 5.1 has just the traffic increase. Globally, we are seeing an average increase of 10%, while North America saw an increase of over 40% compared to the beginning of the year. Some regions did not change much, such as Africa and Asia, while others, such as Europe saw an increased period, but has since normalized to previous levels.

Let’s look at a few countries, so we can understand what this looks like.

Bot Attack trends for Jan-Jul 2020
Figure 5.2 Daily traffic evolution for Italy, the United Kingdom and Portugal, overlaid with Europe.

Figure 5.2 shows daily traffic relative to January 15, when data collection started. For comparison, we have overall European traffic, and three selected countries: Italy, the United Kingdom and Portugal. Italy was picked because it was one of the first countries in Europe to face the worst of the coronavirus and enact lockdown measures. The United Kingdom took another strategy, with an initial focus on herd immunity, and enacted measures later than the others. Portugal is somewhere in between, locking down later than Italy, in slightly different circumstances.

At the beginning of the year, traffic kept stable and fluctuations kept in line with the European average. As lockdown measures began, traffic increased. Italy was first out of these countries, rising a few weeks before the others, and keeping well above average. Eventually, all countries saw a growth in traffic, followed by a stabilization. Italy seems to have adjusted to a normal, with its growth in line with the European average. Portugal has also stabilized, but with busier weekdays. Conversely, the United Kingdom showed no signs of stopping, exceeding a growth of 40% compared to the beginning of the year.

Bot Attack trends for Jan-Jul 2020
Figure 5.3 Daily definitely automated traffic evolution for Italy, the United Kingdom and Portugal, overlaid with Europe.

Definitely automated traffic did not have that much of a pronounced variation. Italian traffic kept steady throughout, and Portugal had a rather large increase. The biggest one, however, was the United Kingdom, which tripled its initial count.

Bot Attack trends for Jan-Jul 2020
Figure 5.4 Verified bot traffic evolution for Italy, the United Kingdom and Portugal, overlaid with Europe. 

Verified bot traffic is steady, except in Italy, with a massive increase between March and May. What could be the cause of this? Are these a few zones, getting a massive number of requests?

Bot Attack trends for Jan-Jul 2020
Figure 5.5 Verified bot traffic in Italy for the top 10 000 zones, relative to January 15th 2020.

Well, no. If we only examine the top 10,000 zones (by total verified bot requests), we can still see a massive increase in traffic for other zones. So, what’s happening?

Let’s look at user agents. We can separate the top 10 user agents during the bump, and see how they evolve over time.

Bot Attack trends for Jan-Jul 2020
Figure 5.6 Verified bot traffic in Italy for the top 10 user agents, relative to January 15th 2020.

We can see that these 10 user agents are responsible for the majority of verified traffic coming from Italy.

Bot Attack trends for Jan-Jul 2020
Figure 5.7 Verified bot traffic in Italy for the top user agent, relative to January 15 2020.

In fact, most of this increase is from a single user agent. This instance of Google image proxy anonymizes image requests from Gmail, which explains its popularity.

Where does this increase come from? Did this bot suddenly appear and disappear?

Not quite. One thing to keep in mind when dealing with bots is that they cross borders easily. As a proxy service, this bot is making calls on behalf of the end user – people opening emails. These requests will originate from a data center, which can be anywhere in the world. To see this in action, let’s take a look at traffic for this bot in a few select countries.

Bot Attack trends for Jan-Jul 2020
Figure 5.8. Countries of origin for GoogleImageProxy.

We can see that the global average barely budges. It appears that Google may be moving image proxy traffic between data centers and during the period we observed above that traffic was coming from Italy.

Summary

With Cloudflare’s global reach, we’re in a position to understand how bots behave.

The first half of 2020 saw a massive increase in web traffic of around 35% since the beginning of the year, driven by the ongoing coronavirus pandemic, and some bots have taken advantage of it.

We explained how bot management works for our customers, and how we distinguish between likely automated and human traffic.

We showed an overview of how much of our global traffic is automated, and how bots change their behavior throughout the day and the week. Notably, 39.4% of all traffic Cloudflare processes comes from a suspected automated source.

A regional overview of automated traffic lets us know which regions were the source of traffic from likely automated agents. North America, Europe and Asia were the primary sources of traffic, and also of automated traffic in particular.

We then focused on North America, where the majority of automated traffic originates. The United States alone accounted for the majority of requests, over half of which come from automated sources.

To explore this further, we briefly dived into ASN traffic in the United States, so we could see where these requests were coming from. ASNs like Comcast and AT&T were the top ASNs for overall traffic, but unsurprisingly, data centers like Google and Amazon AWS were the main drivers of automated traffic.

Finally, we examined how the coronavirus has impacted traffic in Europe, with a deeper dive on Italian traffic. This led to some interesting insights on verified bot traffic, which saw a massive increase in Italy for a few months.

This post is a small peek into bot management at Cloudflare. In the future, we hope to expand this series of blog posts on bot management, exposing even more insights about bots on the Internet.

How a Customer’s Trust in Cloudflare Led to a Big Win against Bots

Post Syndicated from Kate Fleming original https://blog.cloudflare.com/how-a-customers-trust-in-cloudflare-led-to-a-big-win-against-bots/

How a Customer's Trust in Cloudflare Led to a Big Win against Bots

Key Points

  • Anyone with public-facing web properties is likely to have bot traffic on their website.
  • One type of bot that commonly targets eCommerce and online portals is a ‘scraper bot’.
  • Some scraper bots are good (such as those used by search engines to assess your website’s content to inform search results, or price comparison sites to help inform consumer decisions), however many are malicious, and will work to scrape not only images but also pricing data from your site for use by a competitor.
  • Many Bot Management providers will need to divert your traffic to a dedicated data centre to analyse your traffic and ‘scrub’ it clean from malicious bot traffic before sending it on to your site. While effective this will almost certainly add latency to the traffic’s ‘journey’ resulting in degraded user experience. Look for a technology partner with an expansive network who can scan your traffic in real time as it passes through any data centre on their network.
  • Good and bad scraper bots behave in largely the same way, making it difficult for bot protection systems to differentiate between the two. A common challenge with Bot Management solutions is that they can return a high number of false positives (legitimate bot or customer traffic blocked as though it were malicious). This can result in legitimate customers being challenged to various and repeated authentication challenges, or in extreme cases, blocked altogether. Look for a technology partner who can consistently return low rates of false positives on your traffic.

The Success Story

How a Customer's Trust in Cloudflare Led to a Big Win against Bots

I often joke that the key to understanding what that role of Customer Success is all about is to say: repeat the sentence again slowly… it’s there, in the name. Customer Success is, at its core, about making customers successful. And this is what makes our day, and makes us happy. Allow me to share a short story on how Cloudflare made a successful customer even more so with our Bot Management solution.

Once upon a time, an online property portal, let’s call them Property Portal, came to Cloudflare for our DDoS and WAF solution. We worked well together. The customer liked our ease of use and we delivered on our promise to provide performance and security to them. As Property Portal’s brand and digital footprint grew, so did the instances of malicious bot traffic, in particular ‘scraper bots’.

They say that imitation is the sincerest form of flattery. But when that imitation turns into someone else profiting off your IP, the shine starts to wear off, and that ‘imitation’ starts to become something more akin to outright theft. This is a challenge common to market leaders in the eCommerce and online portal space, where market leading organizations who pride themselves on presenting a solid portfolio of quality product offerings often find that competing sites seem to be not only replicating their content but matching or undercutting their prices in near real time.

Cloudflare wasn’t working with Property Portal when they first started facing these challenges, and as such, Property Portal engaged another party – at that time, one of the market leaders in the space – to provide a Bot Management solution for them.

At first pass, this seemed to solve the problem, however it wasn’t long before additional challenges became apparent:

  • performance was impacted slightly as this solution required traffic to be re-routed to a scrubbing centre to be ‘scrubbed’ of requests from bad bots before coming to their site;
  • a small percentage of malicious bots were still getting through and scraping valuable content of their website, and most damagingly;
  • Property Portal discovered that they were seeing a significant number of ‘false positives’, resulting in rising frustration for legitimate visitors (buyers, sellers and renters) repeatedly being asked to complete challenges in order to validate that they were human and not bots as they tried to navigate through the site.

During this time, Cloudflare released and matured our own Bot Management offering. Being aware that Property Portal weren’t seeing success from their existing solution, the account team began discussing the value of consolidating their bot solution with their DDoS and WAF offering from Cloudflare. We were given very clear success criteria which in technical terms, translated to the following:

  1. Don’t mess anything up, deprecate our user experience, or make us change our domains if we switch to you;
  2. Stop more of the bad traffic.;
  3. And most importantly, let more of the good users in, and stop challenging them as much as our current provider does (reduce the number of false positives).

We passed with flying colours. In addition, Property Portal was happy that they were able to consolidate additional services under one vendor.

For our side, Cloudflare now has the privilege of knowing that we are helping to improve the experience for many of Property Portal’s end customers, while at the same time working to protect their IP and hard work by keeping the ‘imitators’ and their scraper bots at a safe distance.

Happy Customer = Happy Customer Success Managers & Account Teams. Day made.

Would you like to know more?

Does any of the above feel familiar to you? Do your competitors have the uncanny ability to present near identical inventory, images or pricing to yours just as soon as you publish changes to your site? Or, are you keen to learn more in order to stay ahead of the bot armies?

Well, you’ve come to the right place, friend:

  • If you’re the self-serve type then take a look at our learning centre here and here, or our product pages. (And yes, we have a (good) chat bot there waiting to help you).
  • If ‘tuning in and geeking out’ is your preferred method of learning, then tune into the next episode of ‘Customers + Success’ on Cloudflare TV where I’ll be interviewing some of the people involved in this case, and hearing more about the challenges that this customer faced first-hand. The segment will air at 4PM PST October 21st / 7AM SGT October 22nd / 10AM AEST October 22nd, and will be appearing on the CFTV schedule in the next week.
  • Alternatively, if you consume your knowledge in old-fashioned human style feel free to contact us here. Someone from our team will be in touch to get you the answers you are looking for.