SIMD | Noise

Post Syndicated from Zak Cutner original https://blog.cloudflare.com/building-even-faster-interpreters-in-rust/

Building even faster interpreters in Rust

At Cloudflare, we’re constantly working on improving the performance of our edge — and that was exactly what my internship this summer entailed. I’m excited to share some improvements we’ve made to our popular Firewall Rules product over the past few months.

Firewall Rules lets customers filter the traffic hitting their site. It’s built using our engine, Wirefilter, which takes powerful boolean expressions written by customers and matches incoming requests against them. Customers can then choose how to respond to traffic which matches these rules. We will discuss some in-depth optimizations we have recently made to Wirefilter, so you may wish to get familiar with how it works if you haven’t already.

Minimizing CPU usage

As a new member of the Firewall team, I quickly learned that performance is important — even in our security products. We look for opportunities to make our customers’ Internet properties faster where it’s safe to do so, maximizing both security and performance.

Our engine is already heavily used, powering all of Firewall Rules. But we have bigger plans. More and more products like our Web Application Firewall (WAF) will be running behind our Wirefilter-based engine, and it will become responsible for eating up a sizable chunk of our total CPU usage before long.

How to measure performance?

Measuring performance is a notoriously tricky task, and as you can probably imagine trying to do this in a highly distributed environment (aka Cloudflare’s edge) does not help. We’ve been surprised in the past by optimizations that look good on paper, but, when tested out in production, just don’t seem to do much.

Our solution? Performance measurement as a service — an isolated and reproducible benchmark for our Firewall engine and a framework for engineers to easily request runs and view results. It’s worth noting that we took a lot of inspiration from the fantastic Rust Compiler benchmarks to build this.

What to measure?

Our next challenge was to find some meaningful performance metrics. Some experimentation quickly uncovered that time was far too volatile a measure for meaningful comparisons, so we turned to hardware counters [2]. It’s not hard to find tools to measure these (perf and VTune are two such examples), although they (mostly) don’t allow control over which parts of the program are recorded. In our case, we wished to individually record measurements for different stages of filter processing — parsing, compilation, analysis, and execution.

Once again we took inspiration from the Rust compiler, and its self-profiling options, using the perf_event_open API to record counters from inside our binary. We then output something like the following, which our framework can easily ingest and store for later visualization.

Whilst we mainly focussed on metrics relating to CPU usage, we also use a combination of getrusage and clear_refs to find the maximum resident set size (RSS). This is useful to understand the memory impact of particular algorithms in addition to CPU.

But the challenge was not over. Cloudflare’s standard CI agents use virtualization and sandboxing for security and convenience, but this makes accessing hardware counters virtually impossible. Running our benchmarks on a dedicated machine gave us access to these counters, and ensured more reproducible results.

Speeding up the speed test

Our benchmarks were designed from the outset to take an important place in our development process. For instance, we now perform a full benchmark run before releasing each new version to detect performance regressions.

But with our benchmarks in place, it quickly became clear that we had a problem. Our benchmarks simply weren’t fast enough — at least if we wanted to complete them in less than a few hours! The problem was we have a very large number of filters. Since our engine would never usually execute requests against this many filters at once it was proving incredibly costly. We came up with a few tricks to cut this down…

Deduplication. It turns out that only around a third of filters are structurally unique (something that is easy to check as Wirefilter can helpfully serialize to JSON). We managed to cut down a great deal of time by ignoring duplicate filters in our benchmarks.
Sampling. Still, we had too many filters and random sampling presented an easy solution. A more subtle challenge was to make sure that the random sample was always the same to maintain reproducibility.
Partitioning. We worried that deduplication and sampling would cause us to miss important cases that are useful to optimize. By first partitioning filters by Wirefilter language feature, we can ensure we’re getting a good range of filters. It also helpfully gives us more detail about where specifically the impact of a performance change is.

Most of these are trade-offs, but very necessary ones which allow us to run continual benchmarks without development speed grinding to a halt. At the time of writing, we’ve managed to get a benchmark run down to around 20 minutes using these ideas.

Optimizing our engine

With a benchmarking framework in place, we were ready to begin testing optimizations. But how do you optimize an interpreter like Wirefilter? Just-in-time (JIT) compilation, selective inlining and replication were some ideas floating around in the word of interpreters that seemed attractive. After all, we previously wrote about the cost of dynamic dispatch in Wirefilter. All of these techniques aim to reduce that effect.

However, running some real filters through a profiler tells a different story. Most execution time, around 65%, is spent not resolving dynamic dispatch calls but instead performing operations like comparison and searches. Filters currently in production tend to be pretty light on functions, but throw in a few more of these and even less time would be spent on dynamic dispatch. We suspect that even a fair chunk of the remaining 35% is actually spent reading the memory of request fields.

Function	CPU time
`matches` operator	0.6%
`in` operator	1.1%
`eq` operator	11.8%
`contains` operator	51.5%
Everything else	35.0%

Breakdown of CPU time while executing a typical production filter.

An adventure in substring searching

By now, you shouldn’t be surprised that the contains operator was one of the first in line for optimization. If you’ve ever written a Firewall Rule, you’re probably already familiar with what it does — it checks whether a substring is present in the field you are matching against. For example, the following expression would match when the host is “example.com” or “www.example.net”, but not when it is “cloudflare.com”. In string searching algorithms, this is commonly referred to as finding a ‘needle’ (“example”) within a ‘haystack’ (“example.com”).

http.host contains “example”

How does this work under the hood? Ordinarily, we may have used Rust’s `String::contains` function but Wirefilter also allows raw byte expressions that don’t necessarily conform to UTF-8.

http.host contains 65:78:61:6d:70:6c:65

We therefore used the memmem crate which performs a two-way substring search algorithm on raw bytes.

Sounds good, right? It was, and it was working pretty well, although we’d noticed that rewriting `contains` filters using regular expressions could bizarrely often make them faster.

http.host matches “example”

Regular expressions are great, but since they’re far more powerful than the `contains` operator, they shouldn’t be faster than a specialized algorithm in simple cases like this one.

Something was definitely up. It turns out that Rust’s regex library comes equipped with a whole host of specialized matchers for what it deems to be simple expressions like this. The obvious question was whether we could therefore simply use the regex library. Interestingly, you may not have realized that the popular ripgrep tool does just that when searching for fixed-string patterns.

However, our use case is a little different. Since we’re building an interpreter (and we’re using dynamic dispatch in any case), we would prefer to dispatch to a specialized case for `contains` expressions, rather than matching on some enum deep within the regex crate when the filter is executed. What’s more, there are some pretty cool things being done to perform substring searching that leverages SIMD instruction sets. So we wired up our engine to some previous work by Wojciech Muła and the results were fantastic.

Benchmark	Improvement
Expressions using `contains` operator	72.3%
‘Simple’ expressions	0.0%
All expressions	31.6%

Improvements in instruction count using Wojciech Muła’s sse4-strstr library over the memmem crate with Wirefilter.

I encourage you to read more on “Algorithm 1”, which we used, but it works something like this (I’ve changed the order a little to help make it clearer). It’s worth reading up on SIMD instructions if you’re unfamiliar with them — they’re the essence behind what makes this algorithm fast.

We fill one SIMD register with the first byte of the needle being searched for, simply repeated over and over.
We load as much of our haystack as we can into another SIMD register and perform a bitwise equality operation with our previous register.
Now, any position in the resultant register that is 0 cannot be the start of the match since it doesn’t start with the same byte of the needle.
We now repeat this process with the last byte of the needle, offsetting the haystack, to rule out any positions that don’t end with the same byte as the needle.
Bitwise ANDing these two results together, we (hopefully) have now drastically reduced our potential matches.
Each of the remaining potential matches can be checked manually using a memcmp operation. If we find a match, then we’re done.
If not, we continue with the next part of our haystack and repeat until we’ve checked the entire thing.

When it goes wrong

You may be wondering what happens if our haystack doesn’t fit neatly into registers. In the original algorithm, nothing. It simply continues reading into the oblivion after the end of the haystack until the last register is full, and uses a bitmask to ignore potential false-positives from this additional region of memory.

As we mentioned, security is our priority when it comes to optimizations, so we could never deploy something with this kind of behaviour. We ended up porting Muła’s library to Rust (we’ve also open-sourced the crate!) and performed an overlapping registers modification found in ARM’s blog.

It’s best illustrated by example — notice the difference between how we would fill registers on an imaginary SIMD instruction-set with 4-byte registers.

Before modification

After modification

In our case, repeating some bytes within two different registers will never change the final outcome, so this modification is allowed as-is. However, in reality, we found it was better to use a bitmask to exclude repeated parts of the final register and minimize the number of memcmp calls.

What if the haystack is too small to even fill a single register? In this case, we can’t use our overlapping trick since there’s nothing to overlap with. Our solution is straightforward: while we were primarily targeting AVX2, which can store 32-bytes in a lane, we can easily move down to another instruction set with smaller registers that the haystack can fit into. In reality, we don’t currently go any smaller than SSE2. Beyond this, we instead use an implementation of the Rabin-Karp searching algorithm which appears to perform well.

Instruction set	Register size
AVX2	32 bytes
SSE2	16 bytes
SWAR (u64)	8 bytes
SWAR (u32)	4 bytes
…	…

Register sizes in different SIMD instruction sets [3]. We did not consider AVX512 since support for this is not widespread enough.

Is it always fast?

Choosing the first and last bytes of the needle to rule out potential matches is a great idea. It means that when it does come to performing a memcmp, we can ignore these, as we know they already match. Unfortunately, as Muła points out, this also makes the algorithm susceptible to a worst-case attack in some instances.

Let’s give an expression that a customer might write to illustrate this.

http.request.uri.path contains “/wp-admin/”

If we try to search for this within a very long sequence of ‘/’s, we will find a potential match in every position and make lots of calls to memcmp — essentially performing a slow bruteforce substring search.

Clearly we need to choose different bytes from the needle. But which ones should we choose? For each choice, an adversary can always find a slightly different, but equally troublesome, worst case. We instead use randomness to throw off our would-be adversary, picking the first byte of the needle as before, but then choosing another random byte to use.

Our new version is unsurprisingly slower than Muła’s, yet it still exhibits a great improvement over both the memmem and regex crates. Performance, but without sacrificing safety.

Benchmark	Improvement
	sse4-strstr (original)	sliceslice (our version)
Expressions using `contains` operator	72.3%	49.1%
‘Simple’ expressions	0.0%	0.1%
All expressions	31.6%	24.0%

Improvements in instruction count of using sse4-strstr and sliceslice over the memmem crate with Wirefilter.

What’s next?

This is only a small taste of the performance work we’ve been doing, and we have much more yet to come. Nevertheless, none of this would have been possible without the support of my manager Richard and my mentor Elie, who contributed a lot of these ideas. I’ve learned so much over the past few months, but most of all that Cloudflare is an amazing place to be an intern!

[1] Since our benchmarks are not run within a production environment, results in this post do not represent traffic on our edge.

[2] We found instruction counts to be a particularly stable measure, and CPU cycles a particularly unstable one.

[3] Note that SWAR is technically not an instruction set, but instead uses regular registers like vector registers.

Noise

Tag Archives: SIMD