All posts by Marek Majkowski

Computing Euclidean distance on 144 dimensions

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/computing-euclidean-distance-on-144-dimensions/

Computing Euclidean distance on 144 dimensions

Computing Euclidean distance on 144 dimensions

Late last year I read a blog post about our CSAM image scanning tool. I remember thinking: this is so cool! Image processing is always hard, and deploying a real image identification system at Cloudflare is no small achievement!

Some time later, I was chatting with Kornel: “We have all the pieces in the image processing pipeline, but we are struggling with the performance of one component.” Scaling to Cloudflare needs ain’t easy!

The problem was in the speed of the matching algorithm itself. Let me elaborate. As John explained in his blog post, the image matching algorithm creates a fuzzy hash from a processed image. The hash is exactly 144 bytes long. For example, it might look like this:

00e308346a494a188e1043333147267a 653a16b94c33417c12b433095c318012
5612442030d14a4ce82c623f4e224733 1dd84436734e4a5d6e25332e507a8218
6e3b89174e30372d

The hash is designed to be used in a fuzzy matching algorithm that can find “nearby”, related images. The specific algorithm is well defined, but making it fast is left to the programmer — and at Cloudflare we need the matching to be done super fast. We want to match thousands of hashes per second, of images passing through our network, against a database of millions of known images. To make this work, we need to seriously optimize the matching algorithm.

Naive quadratic algorithm

The first algorithm that comes to mind has O(K*N) complexity: for each query, go through every hash in the database. In naive implementation, this creates a lot of work. But how much work exactly?

First, we need to explain how fuzzy matching works.

Given a query hash, the fuzzy match is the “closest” hash in a database. This requires us to define a distance. We treat each hash as a vector containing 144 numbers, identifying a point in a 144-dimensional space. Given two such points, we can calculate the distance using the standard Euclidean formula.

For our particular problem, though, we are interested in the “closest” match in a database only if the distance is lower than some predefined threshold. Otherwise, when the distance is large,  we can assume the images aren’t similar. This is the expected result — most of our queries will not have a related image in the database.

The Euclidean distance equation used by the algorithm is standard:

Computing Euclidean distance on 144 dimensions

To calculate the distance between two 144-byte hashes, we take each byte, calculate the delta, square it, sum it to an accumulator, do a square root, and ta-dah! We have the distance!

Here’s how to count the squared distance in C:

Computing Euclidean distance on 144 dimensions

This function returns the squared distance. We avoid computing the actual distance to save us from running the square root function – it’s slow. Inside the code, for performance and simplicity, we’ll mostly operate on the squared value. We don’t need the actual distance value, we just need to find the vector with the smallest one. In our case it doesn’t matter if we’ll compare distances or squared distances!

As you can see, fuzzy matching is basically a standard problem of finding the closest point in a multi-dimensional space. Surely this has been solved in the past — but let’s not jump ahead.

While this code might be simple, we expect it to be rather slow. Finding the smallest hash distance in a database of, say, 1M entries, would require going over all records, and would need at least:

  1. 144 * 1M subtractions
  2. 144 * 1M multiplications
  3. 144 * 1M additions

And more. This alone adds up to 432 million operations! How does it look in practice? To illustrate this blog post we prepared a full test suite. The large database of known hashes can be well emulated by random data. The query hashes can’t be random and must be slightly more sophisticated, otherwise the exercise wouldn’t be that interesting. We generated the test smartly by byte-swaps of the actual data from the database — this allows us to precisely control the distance between test hashes and database hashes. Take a look at the scripts for details. Here’s our first run of the first, naive, algorithm:

$ make naive
< test-vector.txt ./mmdist-naive > test-vector.tmp
Total: 85261.833ms, 1536 items, avg 55.509ms per query, 18.015 qps

We matched 1,536 test hashes against a database of 1 million random vectors in 85 seconds. It took 55ms of CPU time on average to find the closest neighbour. This is rather slow for our needs.

SIMD for help

An obvious improvement is to use more complex SIMD instructions. SIMD is a way to instruct the CPU to process multiple data points using one instruction. This is a perfect strategy when dealing with vector problems — as is the case for our task.

We settled on using AVX2, with 256 bit vectors. We did this for a simple reason — newer AVX versions are not supported by our AMD CPUs. Additionally, in the past, we were not thrilled by the AVX-512 frequency scaling.

Using AVX2 is easier said than done. There is no single instruction to count Euclidean distance between two uint8 vectors! The fastest way of counting the full distance of two 144-byte vectors with AVX2 we could find is authored by Vlad:

Computing Euclidean distance on 144 dimensions

It’s actually simpler than it looks: load 16 bytes, convert vector from uint8 to int16, subtract the vector, store intermediate sums as int32, repeat. At the end, we need to do complex 4 instructions to extract the partial sums into the final sum. This AVX2 code improves the performance around 3x:

$ make naive-avx2 
Total: 25911.126ms, 1536 items, avg 16.869ms per query, 59.280 qps

We measured 17ms per item, which is still below our expectations. Unfortunately, we can’t push it much further without major changes. The problem is that this code is limited by memory bandwidth. The measurements come from my Intel i7-5557U CPU, which has the max theoretical memory bandwidth of just 25GB/s. The database of 1 million entries takes 137MiB, so it takes at least 5ms to feed the database to my CPU. With this naive algorithm we won’t be able to go below that.

Vantage Point Tree algorithm

Since the naive brute force approach failed, we tried using more sophisticated algorithms. My colleague Kornel Lesiński implemented a super cool Vantage Point algorithm. After a few ups and downs, optimizations and rewrites, we gave up. Our problem turned out to be unusually hard for this kind of algorithm.

We observed “the curse of dimensionality”. Space partitioning algorithms don’t work well in problems with large dimensionality — and in our case, we have an enormous number of 144 dimensions. K-D trees are doomed. Locality-sensitive hashing is also doomed. It’s a bizarre situation in which the space is unimaginably vast, but everything is close together. The volume of the space is a 347-digit-long number, but the maximum distance between points is just 3060 – sqrt(255*255*144).

Space partitioning algorithms are fast, because they gradually narrow the search space as they get closer to finding the closest point. But in our case, the common query is never close to any point in the set, so the search space can’t be narrowed to a meaningful degree.

A VP-tree was a promising candidate, because it operates only on distances, subdividing space into near and far partitions, like a binary tree. When it has a close match, it can be very fast, and doesn’t need to visit more than O(log(N)) nodes. For non-matches, its speed drops dramatically. The algorithm ends up visiting nearly half of the nodes in the tree. Everything is close together in 144 dimensions! Even though the algorithm avoided visiting more than half of the nodes in the tree, the cost of visiting remaining nodes was higher, so the search ended up being slower overall.

Smarter brute force?

This experience got us thinking. Since space partitioning algorithms can’t narrow down the search, and still need to go over a very large number of items, maybe we should focus on going over all the hashes, extremely quickly. We must be smarter about memory bandwidth though — it was the limiting factor in the naive brute force approach before.

Perhaps we don’t need to fetch all the data from memory.

Short distance

The breakthrough came from the realization that we don’t need to count the full distance between hashes. Instead, we can compute only a subset of dimensions, say 32 out of the total of 144. If this distance is already large, then there is no need to compute the full one! Computing more points is not going to reduce the Euclidean distance.

The proposed algorithm works as follows:

1. Take the query hash and extract a 32-byte short hash from it

2. Go over all the 1 million 32-byte short hashes from the database. They must be densely packed in the memory to allow the CPU to perform good prefetching and avoid reading data we won’t need.

3. If the distance of the 32-byte short hash is greater or equal a best score so far, move on

4. Otherwise, investigate the hash thoroughly and compute the full distance.

Even though this algorithm needs to do less arithmetic and memory work, it’s not faster than the previous naive one. See make short-avx2. The problem is: we still need to compute a full distance for hashes that are promising, and there are quite a lot of them. Computing the full distance for promising hashes adds enough work, both in ALU and memory latency, to offset the gains of this algorithm.

There is one detail of our particular application of the image matching problem that will help us a lot moving forward. As we described earlier, the problem is less about finding the closest neighbour and more about proving that the neighbour with a reasonable distance doesn’t exist. Remember — in practice, we don’t expect to find many matches! We expect almost every image we feed into the algorithm to be unrelated to image hashes stored in the database.

It’s sufficient for our algorithm to prove that no neighbour exists within a predefined distance threshold. Let’s assume we are not interested in hashes more distant than, say, 220, which squared is 48,400. This makes our short-distance algorithm variation work much better:

$ make short-avx2-threshold
Total: 4994.435ms, 1536 items, avg 3.252ms per query, 307.542 qps

Origin distance variation

Computing Euclidean distance on 144 dimensions

At some point, John noted that the threshold allows additional optimization. We can order the hashes by their distance from some origin point. Given a query hash which has origin distance of A, we can inspect only hashes which are distant between |A-threshold| and |A+threshold| from the origin. This is pretty much how each level of Vantage Point Tree works, just simplified. This optimization — ordering items in the database by their distance from origin point — is relatively simple and can help save us a bit of work.

While great on paper, this method doesn’t introduce much gain in practice, as the vectors are not grouped in clusters — they are pretty much random! For the threshold values we are interested in, the origin distance algorithm variation gives us ~20% speed boost, which is okay but not breathtaking. This change might bring more benefits if we ever decide to reduce the threshold value, so it might be worth doing for production implementation. However, it doesn’t work well with query batching.

Transposing data for better AVX

But we’re not done with AVX optimizations! The usual problem with AVX is that the instructions don’t normally fit a specific problem. Some serious mind twisting is required to adapt the right instruction to the problem, or to reverse the problem so that a specific instruction can be used. AVX2 doesn’t have useful “horizontal” uint16 subtract, multiply and add operations. For example, _mm_hadd_epi16 exists, but it’s slow and cumbersome.

Instead, we can twist the problem to make use of fast available uint16 operands. For example we can use:

  1. _mm256_sub_epi16
  2. _mm256_mullo_epi16
  3. and _mm256_add_epu16.

The add would overflow in our case, but fortunately there is add-saturate _mm256_adds_epu16.

The saturated add is great and saves us conversion to uint32. It just adds a small limitation: the threshold passed to the program (i.e., the max squared distance) must fit into uint16. However, this is fine for us.

To effectively use these instructions we need to transpose the data in the database. Instead of storing hashes in rows, we can store them in columns:

Computing Euclidean distance on 144 dimensions

So instead of:

  1. [a1, a2, a3],
  2. [b1, b2, b3],
  3. [c1, c2, c3],

We can lay it out in memory transposed:

  1. [a1, b1, c1],
  2. [a2, b2, c2],
  3. [a3, b3, c3],

Now we can load 16 first bytes of hashes using one memory operation. In the next step, we can subtract the first byte of the querying hash using a single instruction, and so on. The algorithm stays exactly the same as defined above; we just make the data easier to load and easier to process for AVX.

The hot loop code even looks relatively pretty:

Computing Euclidean distance on 144 dimensions

With the well-tuned batch size and short distance size parameters we can see the performance of this algorithm:

$ make short-inv-avx2
Total: 1118.669ms, 1536 items, avg 0.728ms per query, 1373.062 qps

Whoa! This is pretty awesome. We started from 55ms per query, and we finished with just 0.73ms. There are further micro-optimizations possible, like memory prefetching or using huge pages to reduce page faults, but they have diminishing returns at this point.

Computing Euclidean distance on 144 dimensions
Roofline model from Denis Bakhvalov’s book‌‌

If you are interested in architectural tuning such as this, take a look at the new performance book by Denis Bakhvalov. It discusses roofline model analysis, which is pretty much what we did here.

Do take a look at our code and tell us if we missed some optimization!

Summary

What an optimization journey! We jumped between memory and ALU bottlenecked code. We discussed more sophisticated algorithms, but in the end, a brute force algorithm — although tuned — gave us the best results.

To get even better numbers, I experimented with Nvidia GPU using CUDA. The CUDA intrinsics like vabsdiff4 and dp4a fit the problem perfectly. The V100 gave us some amazing numbers, but I wasn’t fully satisfied with it. Considering how many AMD Ryzen cores with AVX2 we can get for the cost of a single server-grade GPU, we leaned towards general purpose computing for this particular problem.

This is a great example of the type of complexities we deal with every day. Making even the best technologies work “at Cloudflare scale” requires thinking outside the box. Sometimes we rewrite the solution dozens of times before we find the optimal one. And sometimes we settle on a brute-force algorithm, just very very optimized.

The computation of hashes and image matching are challenging problems that require running very CPU intensive operations.. The CPU we have available on the edge is scarce and workloads like this are incredibly expensive. Even with the optimization work talked about in this blog post, running the CSAM scanner at scale is a challenge and has required a huge engineering effort. And we’re not done! We need to solve more hard problems before we’re satisfied. If you want to help, consider applying!

Why is there a “V” in SIGSEGV Segmentation Fault?

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/why-is-there-a-v-in-sigsegv-segmentation-fault/

Why is there a

Why is there a

Another long night. I was working on my perfect, bug-free program in C, when the predictable thing happened:

$ clang skynet.c -o skynet
$ ./skynet.out 
Segmentation fault (core dumped)

Oh, well… Maybe I’ll be more lucky taking over the world another night. But then it struck me. My program received a SIGSEGV signal and crashed with "Segmentation Fault" message. Where does the "V" come from?

Did I read it wrong? Was there a "Segmentation Vault?"? Or did Linux authors make a mistake? Shouldn’t the signal be named SIGSEGF?

I asked my colleagues and David Wragg quickly told me that the signal name stands for "Segmentation Violation". I guess that makes sense. Long long time ago, computers used to have memory segmentation. Each memory segment had defined length – called Segment Limit. Accessing data over this limit caused a processor fault. This error code got re-used by newer systems that used paging. I think the Intel manuals call this error "Invalid Page Fault". When it’s triggered it gets reported to the userspace as a SIGSEGV signal. End of story.

Or is it?

Martin Levy pointed me to an ancient Version 6th UNIX documentation on "signal". This is from around 1978:

Why is there a

Look carefully. There is no SIGSEGV signal! Signal number 11 is called SIGSEG!

It seems that userspace parts of the UNIX tree (i.e. /usr/include/signal.h) switched to SIGSEGV fairly early on. But the kernel internals continued to use the name SIGSEG for much longer.

Looking deeper David found that PDP11 trap vector used wording "segmentation violation". This shows up in Research V4 Edition in the UNIX history repo, but it doesn’t mean it was introduced in V4 – it’s just because V4 is the first version with code still available.

This trap was converted into SIGSEG signal in trap.c file.

The file /usr/include/signal.h appears in the tree for Research V7, with the name SIGSEGV. But the kernel still called it SIGSEG at the time

It seems the kernel side was renamed to SIGSEGV in BSD-4.

Here you go. Originally the signal was called SIGSEG. It was subsequently renamed SIGSEGV in the userspace and a bit later – around 1980 – to SIGSEGV on the kernel side. Apparently there are still no Segmentation Vaults found on UNIX systems.

As for my original crash, I fixed it – of course – by catching the signal and jumping over the offending instruction. On Linux it is totally possible to catch and handle SIGSEGV. With that fix, my code will never again crash. For sure.

#define _GNU_SOURCE
#include <signal.h>
#include <stdio.h>
#include <ucontext.h>

static void sighandler(int signo, siginfo_t *si, void* v_context)
{
    ucontext_t *context = v_context;
    context->uc_mcontext.gregs[REG_RIP] += 10;
}

int *totally_null_pointer = NULL;

int main() {
    struct sigaction psa;
    psa.sa_sigaction = sighandler;
    sigaction(SIGSEGV, &psa, NULL);

    printf("Before NULL pointer dereference\n");
    *totally_null_pointer = 1;
    __asm__ __volatile__("nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;");
    printf("After NULL pointer. Still here!\n");

    return 0;
}

Conntrack tales – one thousand and one flows

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/conntrack-tales-one-thousand-and-one-flows/

Conntrack tales - one thousand and one flows

At Cloudflare we develop new products at a great pace. Their needs often challenge the architectural assumptions we made in the past. For example, years ago we decided to avoid using Linux’s "conntrack" – stateful firewall facility. This brought great benefits – it simplified our iptables firewall setup, sped up the system a bit and made the inbound packet path easier to understand.

But eventually our needs changed. One of our new products had a reasonable need for it. But we weren’t confident – can we just enable conntrack and move on? How does it actually work? I volunteered to help the team understand the dark corners of the "conntrack" subsystem.

What is conntrack?

"Conntrack" is a part of Linux network stack, specifically part of the firewall subsystem. To put that into perspective: early firewalls were entirely stateless. They could express only basic logic, like: allow SYN packets to port 80 and 443, and block everything else.

The stateless design gave some basic network security, but was quickly deemed insufficient. You see, there are certain things that can’t be expressed in a stateless way. The canonical example is assessment of ACK packets – it’s impossible to say if an ACK packet is legitimate or part of a port scanning attempt, without tracking the connection state.

To fill such gaps all the operating systems implemented connection tracking inside their firewalls. This tracking is usually implemented as a big table, with at least 6 columns: protocol (usually TCP or UDP), source IP, source port, destination IP, destination port and connection state. On Linux this subsystem is called "conntrack" and is often enabled by default. Here’s how the table looks on my laptop inspected with "conntrack -L" command:

Conntrack tales - one thousand and one flows

The obvious question is how large this state tracking table can be. This setting is under "/proc/sys/net/nf_conntrack_max":

$ cat /proc/sys/net/nf_conntrack_max
262144

This is a global setting, but the limit is per per-container. On my system each container, or "network namespace", can have up to 256K conntrack entries.

What exactly happens when the number of concurrent connections exceeds the conntrack limit?

Testing conntrack is hard

In past testing conntrack was hard – it required complex hardware or vm setup. Fortunately, these days we can use modern "user namespace" facilities which do permission magic, allowing an unprivileged user to feel like root. Using the tool "unshare" it’s possible to create an isolated environment where we can precisely control the packets going through and experiment with iptables and conntrack without threatening the health of our host system. With appropriate parameters it’s possible to create and manage a networking namespace, including access to namespaced iptables and conntrack, from an unprivileged user.

This script is the heart of our test:

# Enable tun interface
ip tuntap add name tun0 mode tun
ip link set tun0 up
ip addr add 192.0.2.1 peer 192.0.2.2 dev tun0
ip route add 0.0.0.0/0 via 192.0.2.2 dev tun0

# Refer to conntrack at least once to ensure it's enabled
iptables -t raw -A PREROUTING -j CT
# Create a counter in mangle table
iptables -t mangle -A PREROUTING
# Make sure reverse traffic doesn't affect conntrack state
iptables -t raw -A OUTPUT -p tcp --sport 80 -j DROP

tcpdump -ni any -B 16384 -ttt &
...
./venv/bin/python3 send_syn.py

conntrack -L
# Show iptables counters
iptables -nvx -t raw -L PREROUTING
iptables -nvx -t mangle -L PREROUTING

This bash script is shortened for readability. See the full version here. The accompanying "send_syn.py" is just sending 10 SYN packets over "tun0" interface. Here is the source but allow me to paste it here – showing off "scapy" is always fun:

tun = TunTapInterface("tun0", mode_tun=True)
tun.open()

for i in range(10000,10000+10):
    ip=IP(src="198.18.0.2", dst="192.0.2.1")
    tcp=TCP(sport=i, dport=80, flags="S")
    send(ip/tcp, verbose=False, inter=0.01, socket=tun)

The bash script above contains a couple of gems. Let’s walk through them.

First, please note that we can’t just inject packets into the loopback interface using SOCK_RAW sockets. The Linux networking stack is a complex beast. The semantics of sending packets over a SOCK_RAW are different then delivering a packet over a real interface. We’ll discuss this later, but for now, to avoid triggering unexpected behaviour, we will deliver packets over a tun/tap device which better emulates a real interface.

Then we need to make sure the conntrack is active in the network namespace we wish to use for testing. Traditionally, just loading the kernel module would have done that, but in the brave new world of containers and network namespaces, a method had to be found to allow conntrack to be active in some and inactive in other containers. Hence this is tied to usage – rules referencing conntrack must exist in the namespace’s iptables for conntrack to be active inside the container.

As a side note, containers triggering host to load kernel modules is an interesting subject.

After the "-t raw -A PREROUTING" rule, which we added "-t mangle -A PREROUTING" rule, but notice – it doesn’t have any action! This syntax is allowed by iptables and it is pretty useful to get iptables to report rule counters. We’ll need these counters soon. A careful reader might suggest looking at "policy" counters in iptables to achieve our goal. Sadly, "policy" counters (increased for each packet entering a chain), work only if there is at least one rule inside it.

The rest of the steps are self-explanatory. We set up "tcpdump" in the background, send 10 SYN packets to 127.0.0.1:80 using the "scapy" Python library. Ten we print the conntrack table and iptables counters.

Let’s run this script in action. Remember to run it under networking namespace as fake root with "unshare -Ur -n":

Conntrack tales - one thousand and one flows

This is all nice. First we see a "tcpdump" listing showing 10 SYN packets. Then we see the conntrack table state, showing 10 created flows. Finally, we see iptables counters in two rules we created, each showing 10 packets processed.

Can conntrack table fill up?

Given that the conntrack table is size constrained, what exactly happens when it fills up? Let’s check it out. First, we need to drop the conntrack size. As mentioned it’s controlled by a global toggle – it’s necessary to tune it on the host side. Let’s reduce the table size to 7 entries, and repeat our test:

Conntrack tales - one thousand and one flows

This is getting interesting. We still see the 10 inbound SYN packets. We still see that the "-t raw PREROUTING" table received 10 packets, but this is where similarities end. The "-t mangle PREROUTING" table saw only 7 packets. Where did the three missing SYN packets go?

It turns out they went where all the dead packets go. They were hard dropped. Conntrack on overfill does exactly that. It even complains in the "dmesg":

Conntrack tales - one thousand and one flows

This is confirmed by our iptables counters. Let’s review the famous iptables diagram:

Conntrack tales - one thousand and one flows
image by Jan Engelhardt CC BY-SA 3.0

As we can see, the "-t raw PREROUTING" happens before conntrack, while "-t mangle PREROUTING" is just after it. This is why we see 10 and 7 packets reported by our iptables counters.

Let me emphasize the gravity of our discovery. We showed three completely valid SYN packets being implicitly dropped by "conntrack". There is no explicit "-j DROP" iptables rule. There is no configuration to be toggled. Just the fact of using "conntrack" means that, when it’s full, packets creating new flows will be dropped. No questions asked.

This is the dark side of using conntrack. If you use it, you absolutely must make sure it doesn’t get filled.

We could end our investigation here, but there are a couple of interesting caveats.

Strict vs loose

Conntrack supports a "strict" and "loose" mode, as configured by "nf_conntrack_tcp_loose" toggle.

$ cat /proc/sys/net/netfilter/nf_conntrack_tcp_loose
1

By default, it’s set to "loose" which means that stray ACK packets for unseen TCP flows will create new flow entries in the table. We can generalize: "conntrack" will implicitly drop all the packets that create new flow, whether that’s SYN or just stray ACK.

What happens when we clear the "nf_conntrack_tcp_loose=0" setting? This is a subject for another blog post, but suffice to say – mess. First, this setting is not settable in the network namespace scope – although it should be. To test it you need to be in the root network namespace. Then, due to twisted logic the ACK will be dropped on a full conntrack table, even though in this case it doesn’t create a flow. If the table is not full, the ACK packet will pass through it, having "-ctstate INVALID" from "mangle" table forward.

When a conntrack entry is not created?

There are important situations when conntrack entry is not created. For example, we could replace these line in our script:

# Make sure reverse traffic doesn't affect conntrack state
iptables -t raw -A OUTPUT -p tcp --sport 80 -j DROP

With those:

# Make sure inbound SYN packets don't go to networking stack
iptables -A INPUT -j DROP

Naively we could think dropping SYN packets past the conntrack layer would not interfere with the created flows. This is not correct. In spite of these SYN packets having been seen by conntrack, no flow state is created for them. Packets hitting "-j DROP" will not create new conntrack flows. Pretty magical, isn’t it?

Full Conntrack causes with EPERM

Recently we hit a case when a "sendto()" syscall on UDP socket from one of our applications was erroring with EPERM. This is pretty weird, and not documented in the man page. My colleague had no doubts:

Conntrack tales - one thousand and one flows

I’ll save you the gruesome details, but indeed, the full conntrack table will do that to your new UDP flows – you will get EPERM. Beware. Funnily enough, it’s possible to get EPERM if an outbound packet is dropped on OUTPUT firewall in other ways. For example:

marek:~$ sudo iptables -I OUTPUT -p udp --dport 53 --dst 192.0.2.8 -j DROP
marek:~$ strace -e trace=write nc -vu 192.0.2.8 53
write(3, "X", 1)                        = -1 EPERM (Operation not permitted)
+++ exited with 1 +++

If you ever receive EPERM from "sendto()", you might want to treat it as a transient error, if you suspect a filled conntrack problem, or permanent error if you blame iptables configuration.

This is also why we can’t send our SYN packets directly using SOCK_RAW sockets in our test. Let’s see what happens on conntrack overfill with standard "hping3" tool:

$ hping3 -S -i u10000 -c 10 --spoof 192.18.0.2 192.0.2.1 -p 80 -I lo
HPING 192.0.2.1 (lo 192.0.2.1): S set, 40 headers + 0 data bytes
[send_ip] sendto: Operation not permitted

"send()" even on a SOCK_RAW socket fails with EPERM when conntrack table is full.

Full conntrack can happen on a SYN flood

There is one more caveat. During a SYN flood, the conntrack entries will totally be created for the spoofed flows. Take a look at second test case we prepared, this time correctly listening on port 80, and sending SYN+ACK:

Conntrack tales - one thousand and one flows

We can see 7 SYN+ACK’s flying out of the port 80 listening socket. The final three SYN’s go nowhere as they are dropped by conntrack.

This has important implications. If you use conntrack on publicly accessible ports, during SYN flood mitigation technologies like SYN Cookies won’t help. You are still at risk of running out of conntrack space and therefore affecting legitimate connections.

For this reason, as a rule of thumb consider avoiding conntrack on inbound connections (-j NOTRACK). Alternatively having some reasonable rate limits on iptables layer, doing "-j DROP". This will work well and won’t create new flows, as we discussed above. The best method though, would be to trigger SYN Cookies from a layer before conntrack, like XDP. But this is a subject for another time.

Summary

Over the years Linux conntrack has gone through many changes and has improved a lot. While performance used to be a major concern, these days it’s considered to be very fast. Dark corners remain. Correctly applying conntrack is tricky.

In this blog post we showed how it’s possible to test parts of conntrack with "unshare" and a series of scripts. We showed the behaviour when the conntrack table gets filled – packets might implicitly be dropped. Finally, we mentioned the curious case of SYN floods where incorrectly applied conntrack may cause harm.

Stay tuned for more horror stories as we dig deeper and deeper into the Linux networking stack guts.

When Bloom filters don’t bloom

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/when-bloom-filters-dont-bloom/

When Bloom filters don't bloom

When Bloom filters don't bloom

I’ve known about Bloom filters (named after Burton Bloom) since university, but I haven’t had an opportunity to use them in anger. Last month this changed – I became fascinated with the promise of this data structure, but I quickly realized it had some drawbacks. This blog post is the tale of my brief love affair with Bloom filters.

While doing research about IP spoofing, I needed to examine whether the source IP addresses extracted from packets reaching our servers were legitimate, depending on the geographical location of our data centers. For example, source IPs belonging to a legitimate Italian ISP should not arrive in a Brazilian datacenter. This problem might sound simple, but in the ever-evolving landscape of the internet this is far from easy. Suffice it to say I ended up with many large text files with data like this:

When Bloom filters don't bloom

This reads as: the IP 192.0.2.1 was recorded reaching Cloudflare data center number 107 with a legitimate request. This data came from many sources, including our active and passive probes, logs of certain domains we own (like cloudflare.com), public sources (like BGP table), etc. The same line would usually be repeated across multiple files.

I ended up with a gigantic collection of data of this kind. At some point I counted 1 billion lines across all the harvested sources. I usually write bash scripts to pre-process the inputs, but at this scale this approach wasn’t working. For example, removing duplicates from this tiny file of a meager 600MiB and 40M lines, took… about an eternity:

When Bloom filters don't bloom

Enough to say that deduplicating lines using the usual bash commands like ‘sort’ in various configurations (see ‘–parallel’, ‘–buffer-size’ and ‘–unique’) was not optimal for such a large data set.

Bloom filters to the rescue

When Bloom filters don't bloom

Image by David Eppstein Public Domain

Then I had a brainwave – it’s not necessary to sort the lines! I just need to remove duplicated lines – using some kind of "set" data structure should be much faster. Furthermore, I roughly know the cardinality of the input file (number of unique lines), and I can live with some data points being lost – using a probabilistic data structure is fine!

Bloom-filters are a perfect fit!

While you should go and read Wikipedia on Bloom Filters, here is how I look at this data structure.

How would you implement a "set"? Given a perfect hash function, and infinite memory, we could just create an infinite bit array and set a bit number ‘hash(item)’ for each item we encounter. This would give us a perfect "set" data structure. Right? Trivial. Sadly, hash functions have collisions and infinite memory doesn’t exist, so we have to compromise in our reality. But we can calculate and manage the probability of collisions. For example, imagine we have a good hash function, and 128GiB of memory. We can calculate the probability of the second item added to the bit array colliding would be 1 in 1099511627776. The probability of collision when adding more items worsens as we fill up the bit array.

Furthermore, we could use more than one hash function, and end up with a denser bit array. This is exactly what Bloom filters optimize for. A Bloom filter is a bunch of math on top of the four variables:

  • ‘n’ – The number of input elements (cardinality)
  • ‘m’ – Memory used by the bit-array
  • ‘k’ – Number of hash functions counted for each input
  • ‘p’ – Probability of a false positive match

Given the ‘n’ input cardinality and the ‘p’ desired probability of false positive, the Bloom filter math returns the ‘m’ memory required and ‘k’ number of hash functions needed.

Check out this excellent visualization by Thomas Hurst showing how parameters influence each other:

mmuniq-bloom

Guided by this intuition, I set out on a journey to add a new tool to my toolbox – ‘mmuniq-bloom’, a probabilistic tool that, given input on STDIN, returns only unique lines on STDOUT, hopefully much faster than ‘sort’ + ‘uniq’ combo!

Here it is:

For simplicity and speed I designed ‘mmuniq-bloom’ with a couple of assumptions. First, unless otherwise instructed, it uses 8 hash functions k=8. This seems to be a close to optimal number for the data sizes I’m working with, and the hash function can quickly output 8 decent hashes. Then we align ‘m’, number of bits in the bit array, to be a power of two. This is to avoid the pricey % modulo operation, which compiles down to slow assembly ‘div’. With power-of-two sizes we can just do bitwise AND. (For a fun read, see how compilers can optimize some divisions by using multiplication by a magic constant.)

We can now run it against the same data file we used before:

When Bloom filters don't bloom

Oh, this is so much better! 12 seconds is much more manageable than 2 minutes before. But hold on… The program is using an optimized data structure, relatively limited memory footprint, optimized line-parsing and good output buffering… 12 seconds is still eternity compared to ‘wc -l’ tool:

When Bloom filters don't bloom

What is going on? I understand that counting lines by ‘wc’ is easier than figuring out unique lines, but is it really worth the 26x difference? Where does all the CPU in ‘mmuniq-bloom’ go?

It must be my hash function. ‘wc’ doesn’t need to spend all this CPU performing all this strange math for each of the 40M lines on input. I’m using a pretty non-trivial ‘siphash24’ hash function, so it surely burns the CPU, right? Let’s check by running the code computing hash function but not doing any Bloom filter operations:

When Bloom filters don't bloom

This is strange. Counting the hash function indeed costs about 2s, but the program took 12s in the previous run. The Bloom filter alone takes 10 seconds? How is that possible? It’s such a simple data structure…

A secret weapon – a profiler

It was time to use a proper tool for the task – let’s fire up a profiler and see where the CPU goes. First, let’s fire an ‘strace’ to confirm we are not running any unexpected syscalls:

When Bloom filters don't bloom

Everything looks good. The 10 calls to ‘mmap’ each taking 4ms (3971 us) is intriguing, but it’s fine. We pre-populate memory up front with ‘MAP_POPULATE’ to save on page faults later.

What is the next step? Of course Linux’s ‘perf’!

When Bloom filters don't bloom

Then we can see the results:

When Bloom filters don't bloom

Right, so we indeed burn 87.2% of cycles in our hot code. Let’s see where exactly. Doing ‘perf annotate process_line –source’ quickly shows something I didn’t expect.

When Bloom filters don't bloom

You can see 26.90% of CPU burned in the ‘mov’, but that’s not all of it! The compiler correctly inlined the function, and unrolled the loop 8-fold. Summed up that ‘mov’ or ‘uint64_t v = *p’ line adds up to a great majority of cycles!

When Bloom filters don't bloom

Clearly ‘perf’ must be mistaken, how can such a simple line cost so much? We can repeat the benchmark with any other profiler and it will show us the same problem. For example, I like using ‘google-perftools’ with kcachegrind since they emit eye-candy charts:

When Bloom filters don't bloom

The rendered result looks like this:

When Bloom filters don't bloom

Allow me to summarise what we found so far.

The generic ‘wc’ tool takes 0.45s CPU time to process 600MiB file. Our optimized ‘mmuniq-bloom’ tool takes 12 seconds. CPU is burned on one ‘mov’ instruction, dereferencing memory….

When Bloom filters don't bloom

Image by Jose Nicdao CC BY/2.0

Oh! I how could I have forgotten. Random memory access is slow! It’s very, very, very slow!

According to the rule of thumb "latency numbers every programmer should know about", one RAM fetch is about 100ns. Let’s do the math: 40 million lines, 8 hashes counted for each line. Since our Bloom filter is 128MiB, on our older hardware it doesn’t fit into L3 cache! The hashes are uniformly distributed across the large memory range – each hash generates a memory miss. Adding it together that’s…

When Bloom filters don't bloom

That suggests 32 seconds burned just on memory fetches. The real program is faster, taking only 12s. This is because, although the Bloom filter data does not completely fit into L3 cache, it still gets some benefit from caching. It’s easy to see with ‘perf stat -d’:

When Bloom filters don't bloom

Right, so we should have had at least 320M LLC-load-misses, but we had only 280M. This still doesn’t explain why the program was running only 12 seconds. But it doesn’t really matter. What matters is that the number of cache misses is a real problem and we can only fix it by reducing the number of memory accesses. Let’s try tuning Bloom filter to use only one hash function:

When Bloom filters don't bloom

Ouch! That really hurt! The Bloom filter required 64 GiB of memory to get our desired false positive probability ratio of 1-error-per-10k-lines. This is terrible!

Also, it doesn’t seem like we improved much. It took the OS 22 seconds to prepare memory for us, but we still burned 11 seconds in userspace. I guess this time any benefits from hitting memory less often were offset by lower cache-hit probability due to drastically increased memory size. In previous runs we required only 128MiB for the Bloom filter!

Dumping Bloom filters altogether

This is getting ridiculous. To get the same false positive guarantees we either must use many hashes in Bloom filter (like 8) and therefore many memory operations, or we can have 1 hash function, but enormous memory requirements.

We aren’t really constrained by available memory, instead we want to optimize for reduced memory accesses. All we need is a data structure that requires at most 1 memory miss per item, and use less than 64 Gigs of RAM…

While we could think of more sophisticated data structures like Cuckoo filter, maybe we can be simpler. How about a good old simple hash table with linear probing?

When Bloom filters don't bloom
Image by Vadims Podāns

Welcome mmuniq-hash

Here you can find a tweaked version of mmuniq-bloom, but using hash table:

Instead of storing bits as for the Bloom-filter, we are now storing 64-bit hashes from the ‘siphash24’ function. This gives us much stronger probability guarantees, with probability of false positives much better than one error in 10k lines.

Let’s do the math. Adding a new item to a hash table containing, say 40M, entries has ’40M/2^64′ chances of hitting a hash collision. This is about one in 461 billion – a reasonably low probability. But we are not adding one item to a pre-filled set! Instead we are adding 40M lines to the initially empty set. As per birthday paradox this has much higher chances of hitting a collision at some point. A decent approximation is ‘~n^2/2m’, which in our case is ‘~(40M^2)/(2*(2^64))’. This is a chance of one in 23000. In other words, assuming we are using good hash function, every one in 23 thousand random sets of 40M items, will have a hash collision. This practical chance of hitting a collision is non-negligible, but it’s still better than a Bloom filter and totally acceptable for my use case.

The hash table code runs faster, has better memory access patterns and better false positive probability than the Bloom filter approach.

When Bloom filters don't bloom

Don’t be scared about the "hash conflicts" line, it just indicates how full the hash table was. We are using linear probing, so when a bucket is already used, we just pick up the next empty bucket. In our case we had to skip over 0.7 buckets on average to find an empty slot in the table. This is fine and, since we iterate over the buckets in linear order, we can expect the memory to be nicely prefetched.

From the previous exercise we know our hash function takes about 2 seconds of this. Therefore, it’s fair to say 40M memory hits take around 4 seconds.

Lessons learned

Modern CPUs are really good at sequential memory access when it’s possible to predict memory fetch patterns (see Cache prefetching). Random memory access on the other hand is very costly.

Advanced data structures are very interesting, but beware. Modern computers require cache-optimized algorithms. When working with large datasets, not fitting L3, prefer optimizing for reduced number loads, over optimizing the amount of memory used.

I guess it’s fair to say that Bloom filters are great, as long as they fit into the L3 cache. The moment this assumption is broken, they are terrible. This is not news, Bloom filters optimize for memory usage, not for memory access. For example, see the Cuckoo Filters paper.

Another thing is the ever-lasting discussion about hash functions. Frankly – in most cases it doesn’t matter. The cost of counting even complex hash functions like ‘siphash24’ is small compared to the cost of random memory access. In our case simplifying the hash function will bring only small benefits. The CPU time is simply spent somewhere else – waiting for memory!

One colleague often says: "You can assume modern CPUs are infinitely fast. They run at infinite speed until they hit the memory wall".

Finally, don’t follow my mistakes – everyone should start profiling with ‘perf stat -d’ and look at the "Instructions per cycle" (IPC) counter. If it’s below 1, it generally means the program is stuck on waiting for memory. Values above 2 would be great, it would mean the workload is mostly CPU-bound. Sadly, I’m yet to see high values in the workloads I’m dealing with…

Improved mmuniq

With the help of my colleagues I’ve prepared a further improved version of the ‘mmuniq’ hash table based tool. See the code:

It is able to dynamically resize the hash table, to support inputs of unknown cardinality. Then, by using batching, it can effectively use the ‘prefetch’ CPU hint, speeding up the program by 35-40%. Beware, sprinkling the code with ‘prefetch’ rarely works. Instead, I specifically changed the flow of algorithms to take advantage of this instruction. With all the improvements I got the run time down to 2.1 seconds:

When Bloom filters don't bloom

The end

Writing this basic tool which tries to be faster than ‘sort | uniq’ combo revealed some hidden gems of modern computing. With a bit of work we were able to speed it up from more than two minutes to 2 seconds. During this journey we learned about random memory access latency, and the power of cache friendly data structures. Fancy data structures are exciting, but in practice reducing random memory loads often brings better results.

When TCP sockets refuse to die

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/

When TCP sockets refuse to die

While working on our Spectrum server, we noticed something weird: the TCP sockets which we thought should have been closed were lingering around. We realized we don’t really understand when TCP sockets are supposed to time out!

When TCP sockets refuse to die

Image by Sergiodc2 CC BY SA 3.0

In our code, we wanted to make sure we don’t hold connections to dead hosts. In our early code we naively thought enabling TCP keepalives would be enough… but it isn’t. It turns out a fairly modern TCP_USER_TIMEOUT socket option is equally as important. Furthermore it interacts with TCP keepalives in subtle ways. Many people are confused by this.

In this blog post, we’ll try to show how these options work. We’ll show how a TCP socket can timeout during various stages of its lifetime, and how TCP keepalives and user timeout influence that. To better illustrate the internals of TCP connections, we’ll mix the outputs of the tcpdump and the ss -o commands. This nicely shows the transmitted packets and the changing parameters of the TCP connections.

SYN-SENT

Let’s start from the simplest case – what happens when one attempts to establish a connection to a server which discards inbound SYN packets?

The scripts used here are available on our Github.

$ sudo ./test-syn-sent.py
# all packets dropped
00:00.000 IP host.2 > host.1: Flags [S] # initial SYN

State    Recv-Q Send-Q Local:Port Peer:Port
SYN-SENT 0      1      host:2     host:1    timer:(on,940ms,0)

00:01.028 IP host.2 > host.1: Flags [S] # first retry
00:03.044 IP host.2 > host.1: Flags [S] # second retry
00:07.236 IP host.2 > host.1: Flags [S] # third retry
00:15.427 IP host.2 > host.1: Flags [S] # fourth retry
00:31.560 IP host.2 > host.1: Flags [S] # fifth retry
01:04.324 IP host.2 > host.1: Flags [S] # sixth retry
02:10.000 connect ETIMEDOUT

Ok, this was easy. After the connect() syscall, the operating system sends a SYN packet. Since it didn’t get any response the OS will by default retry sending it 6 times. This can be tweaked by the sysctl:

$ sysctl net.ipv4.tcp_syn_retries
net.ipv4.tcp_syn_retries = 6

It’s possible to overwrite this setting per-socket with the TCP_SYNCNT setsockopt:

setsockopt(sd, IPPROTO_TCP, TCP_SYNCNT, 6);

The retries are staggered at 1s, 3s, 7s, 15s, 31s, 63s marks (the inter-retry time starts at 2s and then doubles each time). By default the whole process takes 130 seconds, until the kernel gives up with the ETIMEDOUT errno. At this moment in the lifetime of a connection, SO_KEEPALIVE settings are ignored, but TCP_USER_TIMEOUT is not. For example, setting it to 5000ms, will cause the following interaction:

$ sudo ./test-syn-sent.py 5000
# all packets dropped
00:00.000 IP host.2 > host.1: Flags [S] # initial SYN

State    Recv-Q Send-Q Local:Port Peer:Port
SYN-SENT 0      1      host:2     host:1    timer:(on,996ms,0)

00:01.016 IP host.2 > host.1: Flags [S] # first retry
00:03.032 IP host.2 > host.1: Flags [S] # second retry
00:05.016 IP host.2 > host.1: Flags [S] # what is this?
00:05.024 IP host.2 > host.1: Flags [S] # what is this?
00:05.036 IP host.2 > host.1: Flags [S] # what is this?
00:05.044 IP host.2 > host.1: Flags [S] # what is this?
00:05.050 connect ETIMEDOUT

Even though we set user-timeout to 5s, we still saw the six SYN retries on the wire. This behaviour is probably a bug (as tested on 5.2 kernel): we would expect only two retries to be sent – at 1s and 3s marks and the socket to expire at 5s mark. Instead we saw this, but also we saw further 4 retransmitted SYN packets aligned to 5s mark – which makes no sense. Anyhow, we learned a thing – the TCP_USER_TIMEOUT does affect the behaviour of connect().

SYN-RECV

SYN-RECV sockets are usually hidden from the application. They live as mini-sockets on the SYN queue. We wrote about the SYN and Accept queues in the past. Sometimes, when SYN cookies are enabled, the sockets may skip the SYN-RECV state altogether.

In SYN-RECV state, the socket will retry sending SYN+ACK 5 times as controlled by:

$ sysctl net.ipv4.tcp_synack_retries
net.ipv4.tcp_synack_retries = 5

Here is how it looks on the wire:

$ sudo ./test-syn-recv.py
00:00.000 IP host.2 > host.1: Flags [S]
# all subsequent packets dropped
00:00.000 IP host.1 > host.2: Flags [S.] # initial SYN+ACK

State    Recv-Q Send-Q Local:Port Peer:Port
SYN-RECV 0      0      host:1     host:2    timer:(on,996ms,0)

00:01.033 IP host.1 > host.2: Flags [S.] # first retry
00:03.045 IP host.1 > host.2: Flags [S.] # second retry
00:07.301 IP host.1 > host.2: Flags [S.] # third retry
00:15.493 IP host.1 > host.2: Flags [S.] # fourth retry
00:31.621 IP host.1 > host.2: Flags [S.] # fifth retry
01:04:610 SYN-RECV disappears

With default settings, the SYN+ACK is re-transmitted at 1s, 3s, 7s, 15s, 31s marks, and the SYN-RECV socket disappears at the 64s mark.

Neither SO_KEEPALIVE nor TCP_USER_TIMEOUT affect the lifetime of SYN-RECV sockets.

Final handshake ACK

After receiving the second packet in the TCP handshake – the SYN+ACK – the client socket moves to an ESTABLISHED state. The server socket remains in SYN-RECV until it receives the final ACK packet.

Losing this ACK doesn’t change anything – the server socket will just take a bit longer to move from SYN-RECV to ESTAB. Here is how it looks:

00:00.000 IP host.2 > host.1: Flags [S]
00:00.000 IP host.1 > host.2: Flags [S.]
00:00.000 IP host.2 > host.1: Flags [.] # initial ACK, dropped

State    Recv-Q Send-Q Local:Port  Peer:Port
SYN-RECV 0      0      host:1      host:2 timer:(on,1sec,0)
ESTAB    0      0      host:2      host:1

00:01.014 IP host.1 > host.2: Flags [S.]
00:01.014 IP host.2 > host.1: Flags [.]  # retried ACK, dropped

State    Recv-Q Send-Q Local:Port Peer:Port
SYN-RECV 0      0      host:1     host:2    timer:(on,1.012ms,1)
ESTAB    0      0      host:2     host:1

As you can see SYN-RECV, has the "on" timer, the same as in example before. We might argue this final ACK doesn’t really carry much weight. This thinking lead to the development of TCP_DEFER_ACCEPT feature – it basically causes the third ACK to be silently dropped. With this flag set the socket remains in SYN-RECV state until it receives the first packet with actual data:

$ sudo ./test-syn-ack.py
00:00.000 IP host.2 > host.1: Flags [S]
00:00.000 IP host.1 > host.2: Flags [S.]
00:00.000 IP host.2 > host.1: Flags [.] # delivered, but the socket stays as SYN-RECV

State    Recv-Q Send-Q Local:Port Peer:Port
SYN-RECV 0      0      host:1     host:2    timer:(on,7.192ms,0)
ESTAB    0      0      host:2     host:1

00:08.020 IP host.2 > host.1: Flags [P.], length 11  # payload moves the socket to ESTAB

State Recv-Q Send-Q Local:Port Peer:Port
ESTAB 11     0      host:1     host:2
ESTAB 0      0      host:2     host:1

The server socket remained in the SYN-RECV state even after receiving the final TCP-handshake ACK. It has a funny "on" timer, with the counter stuck at 0 retries. It is converted to ESTAB – and moved from the SYN to the accept queue – after the client sends a data packet or after the TCP_DEFER_ACCEPT timer expires. Basically, with DEFER ACCEPT the SYN-RECV mini-socket discards the data-less inbound ACK.

Idle ESTAB is forever

Let’s move on and discuss a fully-established socket connected to an unhealthy (dead) peer. After completion of the handshake, the sockets on both sides move to the ESTABLISHED state, like:

State Recv-Q Send-Q Local:Port Peer:Port
ESTAB 0      0      host:2     host:1
ESTAB 0      0      host:1     host:2

These sockets have no running timer by default – they will remain in that state forever, even if the communication is broken. The TCP stack will notice problems only when one side attempts to send something. This raises a question – what to do if you don’t plan on sending any data over a connection? How do you make sure an idle connection is healthy, without sending any data over it?

This is where TCP keepalives come in. Let’s see it in action – in this example we used the following toggles:

  • SO_KEEPALIVE = 1 – Let’s enable keepalives.
  • TCP_KEEPIDLE = 5 – Send first keepalive probe after 5 seconds of idleness.
  • TCP_KEEPINTVL = 3 – Send subsequent keepalive probes after 3 seconds.
  • TCP_KEEPCNT = 3 – Time out after three failed probes.
$ sudo ./test-idle.py
00:00.000 IP host.2 > host.1: Flags [S]
00:00.000 IP host.1 > host.2: Flags [S.]
00:00.000 IP host.2 > host.1: Flags [.]

State Recv-Q Send-Q Local:Port Peer:Port
ESTAB 0      0      host:1     host:2
ESTAB 0      0      host:2     host:1  timer:(keepalive,2.992ms,0)

# all subsequent packets dropped
00:05.083 IP host.2 > host.1: Flags [.], ack 1 # first keepalive probe
00:08.155 IP host.2 > host.1: Flags [.], ack 1 # second keepalive probe
00:11.231 IP host.2 > host.1: Flags [.], ack 1 # third keepalive probe
00:14.299 IP host.2 > host.1: Flags [R.], seq 1, ack 1

Indeed! We can clearly see the first probe sent at the 5s mark, two remaining probes 3s apart – exactly as we specified. After a total of three sent probes, and a further three seconds of delay, the connection dies with ETIMEDOUT, and final the RST is transmitted.

For keepalives to work, the send buffer must be empty. You can notice the keepalive timer active in the "timer:(keepalive)" line.

Keepalives with TCP_USER_TIMEOUT are confusing

We mentioned the TCP_USER_TIMEOUT option before. It sets the maximum amount of time that transmitted data may remain unacknowledged before the kernel forcefully closes the connection. On its own, it doesn’t do much in the case of idle connections. The sockets will remain ESTABLISHED even if the connectivity is dropped. However, this socket option does change the semantics of TCP keepalives. The tcp(7) manpage is somewhat confusing:

Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option, TCP_USER_TIMEOUT will override keepalive to determine when to close a connection due to keepalive failure.

The original commit message has slightly more detail:

To understand the semantics, we need to look at the kernel code in linux/net/ipv4/tcp_timer.c:693:

if ((icsk->icsk_user_timeout != 0 &&
    elapsed >= msecs_to_jiffies(icsk->icsk_user_timeout) &&
    icsk->icsk_probes_out > 0) ||

For the user timeout to have any effect, the icsk_probes_out must not be zero. The check for user timeout is done only after the first probe went out. Let’s check it out. Our connection settings:

  • TCP_USER_TIMEOUT = 5*1000 – 5 seconds
  • SO_KEEPALIVE = 1 – enable keepalives
  • TCP_KEEPIDLE = 1 – send first probe quickly – 1 second idle
  • TCP_KEEPINTVL = 11 – subsequent probes every 11 seconds
  • TCP_KEEPCNT = 3 – send three probes before timing out
00:00.000 IP host.2 > host.1: Flags [S]
00:00.000 IP host.1 > host.2: Flags [S.]
00:00.000 IP host.2 > host.1: Flags [.]

# all subsequent packets dropped
00:01.001 IP host.2 > host.1: Flags [.], ack 1 # first probe
00:12.233 IP host.2 > host.1: Flags [R.] # timer for second probe fired, socket aborted due to TCP_USER_TIMEOUT

So what happened? The connection sent the first keepalive probe at the 1s mark. Seeing no response the TCP stack then woke up 11 seconds later to send a second probe. This time though, it executed the USER_TIMEOUT code path, which decided to terminate the connection immediately.

What if we bump TCP_USER_TIMEOUT to larger values, say between the second and third probe? Then, the connection will be closed on the third probe timer. With TCP_USER_TIMEOUT set to 12.5s:

00:01.022 IP host.2 > host.1: Flags [.] # first probe
00:12.094 IP host.2 > host.1: Flags [.] # second probe
00:23.102 IP host.2 > host.1: Flags [R.] # timer for third probe fired, socket aborted due to TCP_USER_TIMEOUT

We’ve shown how TCP_USER_TIMEOUT interacts with keepalives for small and medium values. The last case is when TCP_USER_TIMEOUT is extraordinarily large. Say we set it to 30s:

00:01.027 IP host.2 > host.1: Flags [.], ack 1 # first probe
00:12.195 IP host.2 > host.1: Flags [.], ack 1 # second probe
00:23.207 IP host.2 > host.1: Flags [.], ack 1 # third probe
00:34.211 IP host.2 > host.1: Flags [.], ack 1 # fourth probe! But TCP_KEEPCNT was only 3!
00:45.219 IP host.2 > host.1: Flags [.], ack 1 # fifth probe!
00:56.227 IP host.2 > host.1: Flags [.], ack 1 # sixth probe!
01:07.235 IP host.2 > host.1: Flags [R.], seq 1 # TCP_USER_TIMEOUT aborts conn on 7th probe timer

We saw six keepalive probes on the wire! With TCP_USER_TIMEOUT set, the TCP_KEEPCNT is totally ignored. If you want TCP_KEEPCNT to make sense, the only sensible USER_TIMEOUT value is slightly smaller than:

TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT

Busy ESTAB socket is not forever

Thus far we have discussed the case where the connection is idle. Different rules apply when the connection has unacknowledged data in a send buffer.

Let’s prepare another experiment – after the three-way handshake, let’s set up a firewall to drop all packets. Then, let’s do a send on one end to have some dropped packets in-flight. An experiment shows the sending socket dies after ~16 minutes:

00:00.000 IP host.2 > host.1: Flags [S]
00:00.000 IP host.1 > host.2: Flags [S.]
00:00.000 IP host.2 > host.1: Flags [.]

# All subsequent packets dropped
00:00.206 IP host.2 > host.1: Flags [P.], length 11 # first data packet
00:00.412 IP host.2 > host.1: Flags [P.], length 11 # early retransmit, doesn't count
00:00.620 IP host.2 > host.1: Flags [P.], length 11 # 1nd retry
00:01.048 IP host.2 > host.1: Flags [P.], length 11 # 2rd retry
00:01.880 IP host.2 > host.1: Flags [P.], length 11 # 3th retry

State Recv-Q Send-Q Local:Port Peer:Port
ESTAB 0      0      host:1     host:2
ESTAB 0      11     host:2     host:1    timer:(on,1.304ms,3)

00:03.543 IP host.2 > host.1: Flags [P.], length 11 # 4th
00:07.000 IP host.2 > host.1: Flags [P.], length 11 # 5th
00:13.656 IP host.2 > host.1: Flags [P.], length 11 # 6th
00:26.968 IP host.2 > host.1: Flags [P.], length 11 # 7th
00:54.616 IP host.2 > host.1: Flags [P.], length 11 # 8th
01:47.868 IP host.2 > host.1: Flags [P.], length 11 # 9th
03:34.360 IP host.2 > host.1: Flags [P.], length 11 # 10th
05:35.192 IP host.2 > host.1: Flags [P.], length 11 # 11th
07:36.024 IP host.2 > host.1: Flags [P.], length 11 # 12th
09:36.855 IP host.2 > host.1: Flags [P.], length 11 # 13th
11:37.692 IP host.2 > host.1: Flags [P.], length 11 # 14th
13:38.524 IP host.2 > host.1: Flags [P.], length 11 # 15th
15:39.500 connection ETIMEDOUT

The data packet is retransmitted 15 times, as controlled by:

$ sysctl net.ipv4.tcp_retries2
net.ipv4.tcp_retries2 = 15

From the ip-sysctl.txt documentation:


The default value of 15 yields a hypothetical timeout of 924.6 seconds and is a lower bound for the effective timeout. TCP will effectively time out at the first RTO which exceeds the hypothetical timeout.

The connection indeed died at ~940 seconds. Notice the socket has the "on" timer running. It doesn’t matter at all if we set SO_KEEPALIVE – when the "on" timer is running, keepalives are not engaged.

TCP_USER_TIMEOUT keeps on working though. The connection will be aborted exactly after user-timeout specified time since the last received packet. With the user timeout set the tcp_retries2 value is ignored.

Zero window ESTAB is… forever?

There is one final case worth mentioning. If the sender has plenty of data, and the receiver is slow, then TCP flow control kicks in. At some point the receiver will ask the sender to stop transmitting new data. This is a slightly different condition than the one described above.

In this case, with flow control engaged, there is no in-flight or unacknowledged data. Instead the receiver throttles the sender with a "zero window" notification. Then the sender periodically checks if the condition is still valid with "window probes". In this experiment we reduced the receive buffer size for simplicity. Here’s how it looks on the wire:

00:00.000 IP host.2 > host.1: Flags [S]
00:00.000 IP host.1 > host.2: Flags [S.], win 1152
00:00.000 IP host.2 > host.1: Flags [.]

00:00.202 IP host.2 > host.1: Flags [.], length 576 # first data packet
00:00.202 IP host.1 > host.2: Flags [.], ack 577, win 576
00:00.202 IP host.2 > host.1: Flags [P.], length 576 # second data packet
00:00.244 IP host.1 > host.2: Flags [.], ack 1153, win 0 # throttle it! zero-window

00:00.456 IP host.2 > host.1: Flags [.], ack 1 # zero-window probe
00:00.456 IP host.1 > host.2: Flags [.], ack 1153, win 0 # nope, still zero-window

State Recv-Q Send-Q Local:Port Peer:Port
ESTAB 1152   0      host:1     host:2
ESTAB 0      129920 host:2     host:1  timer:(persist,048ms,0)

The packet capture shows a couple of things. First, we can see two packets with data, each 576 bytes long. They both were immediately acknowledged. The second ACK had "win 0" notification: the sender was told to stop sending data.

But the sender is eager to send more! The last two packets show a first "window probe": the sender will periodically send payload-less "ack" packets to check if the window size had changed. As long as the receiver keeps on answering, the sender will keep on sending such probes forever.

The socket information shows three important things:

  • The read buffer of the reader is filled – thus the "zero window" throttling is expected.
  • The write buffer of the sender is filled – we have more data to send.
  • The sender has a "persist" timer running, counting the time until the next "window probe".

In this blog post we are interested in timeouts – what will happen if the window probes are lost? Will the sender notice?

By default the window probe is retried 15 times – adhering to the usual tcp_retries2 setting.

The tcp timer is in persist state, so the TCP keepalives will not be running. The SO_KEEPALIVE settings don’t make any difference when window probing is engaged.

As expected, the TCP_USER_TIMEOUT toggle keeps on working. A slight difference is that similarly to user-timeout on keepalives, it’s engaged only when the retransmission timer fires. During such an event, if more than user-timeout seconds since the last good packet passed, the connection will be aborted.

Note about using application timeouts

In the past we have shared an interesting war story:

Our HTTP server gave up on the connection after an application-managed timeout fired. This was a bug – a slow connection might have correctly slowly drained the send buffer, but the application server didn’t notice that.

We abruptly dropped slow downloads, even though this wasn’t our intention. We just wanted to make sure the client connection was still healthy. It would be better to use TCP_USER_TIMEOUT than rely on application-managed timeouts.

But this is not sufficient. We also wanted to guard against a situation where a client stream is valid, but is stuck and doesn’t drain the connection. The only way to achieve this is to periodically check the amount of unsent data in the send buffer, and see if it shrinks at a desired pace.

For typical applications sending data to the Internet, I would recommend:

  1. Enable TCP keepalives. This is needed to keep some data flowing in the idle-connection case.

  2. Set TCP_USER_TIMEOUT to TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT.

  3. Be careful when using application-managed timeouts. To detect TCP failures use TCP keepalives and user-timeout. If you want to spare resources and make sure sockets don’t stay alive for too long, consider periodically checking if the socket is draining at the desired pace. You can use ioctl(TIOCOUTQ) for that, but it counts both data buffered (notsent) on the socket and in-flight (unacknowledged) bytes. A better way is to use TCP_INFO tcpi_notsent_bytes parameter, which reports only the former counter.

An example of checking the draining pace:

while True:
    notsent1 = get_tcp_info(c).tcpi_notsent_bytes
    notsent1_ts = time.time()
    ...
    poll.poll(POLL_PERIOD)
    ...
    notsent2 = get_tcp_info(c).tcpi_notsent_bytes
    notsent2_ts = time.time()
    pace_in_bytes_per_second = (notsent1 - notsent2) / (notsent2_ts - notsent1_ts)
    if pace_in_bytes_per_second > 12000:
        # pace is above effective rate of 96Kbps, ok!
    else:
        # socket is too slow...

There are ways to further improve this logic. We could use TCP_NOTSENT_LOWAT, although it’s generally only useful for situations where the send buffer is relatively empty. Then we could use the SO_TIMESTAMPING interface for notifications about when data gets delivered. Finally, if we are done sending the data to the socket, it’s possible to just call close() and defer handling of the socket to the operating system. Such a socket will be stuck in FIN-WAIT-1 or LAST-ACK state until it correctly drains.

Summary

In this post we discussed five cases where the TCP connection may notice the other party going away:

  • SYN-SENT: The duration of this state can be controlled by TCP_SYNCNT or tcp_syn_retries.
  • SYN-RECV: It’s usually hidden from application. It is tuned by tcp_synack_retries.
  • Idling ESTABLISHED connection, will never notice any issues. A solution is to use TCP keepalives.
  • Busy ESTABLISHED connection, adheres to tcp_retries2 setting, and ignores TCP keepalives.
  • Zero-window ESTABLISHED connection, adheres to tcp_retries2 setting, and ignores TCP keepalives.

Especially the last two ESTABLISHED cases can be customized with TCP_USER_TIMEOUT, but this setting also affects other situations. Generally speaking, it can be thought of as a hint to the kernel to abort the connection after so-many seconds since the last good packet. This is a dangerous setting though, and if used in conjunction with TCP keepalives should be set to a value slightly lower than TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT. Otherwise it will affect, and potentially cancel out, the TCP_KEEPCNT value.

In this post we presented scripts showing the effects of timeout-related socket options under various network conditions. Interleaving the tcpdump packet capture with the output of ss -o is a great way of understanding the networking stack. We were able to create reproducible test cases showing the "on", "keepalive" and "persist" timers in action. This is a very useful framework for further experimentation.

Finally, it’s surprisingly hard to tune a TCP connection to be confident that the remote host is actually up. During our debugging we found that looking at the send buffer size and currently active TCP timer can be very helpful in understanding whether the socket is actually healthy. The bug in our Spectrum application turned out to be a wrong TCP_USER_TIMEOUT setting – without it sockets with large send buffers were lingering around for way longer than we intended.

The scripts used in this article can be found on our Github.

Figuring this out has been a collaboration across three Cloudflare offices. Thanks to Hiren Panchasara from San Jose, Warren Nelson from Austin and Jakub Sitnicki from Warsaw. Fancy joining the team? Apply here!

A gentle introduction to Linux Kernel fuzzing

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/a-gentle-introduction-to-linux-kernel-fuzzing/

A gentle introduction to Linux Kernel fuzzing

For some time I’ve wanted to play with coverage-guided fuzzing. Fuzzing is a powerful testing technique where an automated program feeds semi-random inputs to a tested program. The intention is to find such inputs that trigger bugs. Fuzzing is especially useful in finding memory corruption bugs in C or C++ programs.

A gentle introduction to Linux Kernel fuzzing

Image by Patrick Shannon CC BY 2.0

Normally it’s recommended to pick a well known, but little explored, library that is heavy on parsing. Historically things like libjpeg, libpng and libyaml were perfect targets. Nowadays it’s harder to find a good target – everything seems to have been fuzzed to death already. That’s a good thing! I guess the software is getting better! Instead of choosing a userspace target I decided to have a go at the Linux Kernel netlink machinery.

Netlink is an internal Linux facility used by tools like "ss", "ip", "netstat". It’s used for low level networking tasks – configuring network interfaces, IP addresses, routing tables and such. It’s a good target: it’s an obscure part of kernel, and it’s relatively easy to automatically craft valid messages. Most importantly, we can learn a lot about Linux internals in the process. Bugs in netlink aren’t going to have security impact though – netlink sockets usually require privileged access anyway.

In this post we’ll run AFL fuzzer, driving our netlink shim program against a custom Linux kernel. All of this running inside KVM virtualization.

This blog post is a tutorial. With the easy to follow instructions, you should be able to quickly replicate the results. All you need is a machine running Linux and 20 minutes.

Prior work

The technique we are going to use is formally called "coverage-guided fuzzing". There’s a lot of prior literature:

Many people have fuzzed the Linux Kernel in the past. Most importantly:

  • syzkaller (aka syzbot) by Dmitry Vyukov, is a very powerful CI-style continuously running kernel fuzzer, which found hundreds of issues already. It’s an awesome machine – it will even report the bugs automatically!
  • Trinity fuzzer

We’ll use the AFL, everyone’s favorite fuzzer. AFL was written by Michał Zalewski. It’s well known for its ease of use, speed and very good mutation logic. It’s a perfect choice for people starting their journey into fuzzing!

If you want to read more about AFL, the documentation is in couple of files:

Coverage-guided fuzzing

Coverage-guided fuzzing works on the principle of a feedback loop:

  • the fuzzer picks the most promising test case
  • the fuzzer mutates the test into a large number of new test cases
  • the target code runs the mutated test cases, and reports back code coverage
  • the fuzzer computes a score from the reported coverage, and uses it to prioritize the interesting mutated tests and remove the redundant ones

For example, let’s say the input test is "hello". Fuzzer may mutate it to a number of tests, for example: "hEllo" (bit flip), "hXello" (byte insertion), "hllo" (byte deletion). If any of these tests will yield an interesting code coverage, then it will be prioritized and used as a base for a next generation of tests.

Specifics on how mutations are done, and how to efficiently compare code coverage reports of thousands of program runs is the fuzzer secret sauce. Read on the AFL’s technical whitepaper for nitty gritty details.

The code coverage reported back from the binary is very important. It allows fuzzer to order the test cases, and identify the most promising ones. Without the code coverage the fuzzer is blind.

Normally, when using AFL, we are required to instrument the target code so that coverage is reported in an AFL-compatible way. But we want to fuzz the kernel! We can’t just recompile it with "afl-gcc"! Instead we’ll use a trick. We’ll prepare a binary that will trick AFL into thinking it was compiled with its tooling. This binary will report back the code coverage extracted from kernel.

Kernel code coverage

The kernel has at least two built-in coverage mechanisms – GCOV and KCOV:

KCOV was designed with fuzzing in mind, so we’ll use this.

Using KCOV is pretty easy. We must compile the Linux kernel with the right setting. First, enable the KCOV kernel config option:

cd linux
./scripts/config \
    -e KCOV \
    -d KCOV_INSTRUMENT_ALL

KCOV is capable of recording code coverage from the whole kernel. It can be set with KCOV_INSTRUMENT_ALL option. This has disadvantages though – it would slow down the parts of the kernel we don’t want to profile, and would introduce noise in our measurements (reduce "stability"). For starters, let’s disable KCOV_INSTRUMENT_ALL and enable KCOV selectively on the code we actually want to profile. Today, we focus on netlink machinery, so let’s enable KCOV on whole "net" directory tree:

find net -name Makefile | xargs -L1 -I {} bash -c 'echo "KCOV_INSTRUMENT := y" >> {}'

In a perfect world we would enable KCOV only for a couple of files we really are interested in. But netlink handling is peppered all over the network stack code, and we don’t have time for fine tuning it today.

With KCOV in place, it’s worth to add "kernel hacking" toggles that will increase the likelihood of reporting memory corruption bugs. See the README for the list of Syzkaller suggested options – most importantly KASAN.

With that set we can compile our KCOV and KASAN enabled kernel. Oh, one more thing. We are going to run the kernel in virtualized kvm. We’re going to use "virtme", so we need a couple of toggles:

./scripts/config \
    -e VIRTIO -e VIRTIO_PCI -e NET_9P -e NET_9P_VIRTIO -e 9P_FS \
    -e VIRTIO_NET -e VIRTIO_CONSOLE  -e DEVTMPFS ...

(see the README for full list)

How to use KCOV

KCOV is super easy to use. First, note the code coverage is recorded in a per-process data structure. This means you have to enable and disable KCOV within a userspace process, and it’s impossible to record coverage for non-task things, like interrupt handling. This is totally fine for our needs.

KCOV reports data in a ring buffer. Setting it up is pretty simple, see our code. Then you can enable and disable it with a trivial ioctl:

ioctl(kcov_fd, KCOV_ENABLE, KCOV_TRACE_PC);
/* profiled code */
ioctl(kcov_fd, KCOV_DISABLE, 0);

After such a sequence the ring buffer will contain a list of %rip values of all taken basic blocks for KCOV-enabled kernel code. To read the buffer just run:

n = __atomic_load_n(&kcov_ring[0], __ATOMIC_RELAXED);
for (i = 0; i < n; i++) {
    printf("0x%lx\n", kcov_ring[i + 1]);
}

With tools like addr2line it’s possible to resolve the %rip to the specific line of code. We won’t need it though – the raw %rip values are sufficient for us.

Feeding KCOV into AFL

The next step in our journey is to learn how to trick AFL. Remember, AFL needs a specially-crafted executable, but we want to feed in the kernel code coverage. First we need to understand how AFL works.

AFL sets up an array of 64K 8-bit numbers. This memory region is called "shared_mem" or "trace_bits" and is shared with the traced program. Every byte in the array can be thought of as a hit counter for a particular (branch_src, branch_dst) pair in the instrumented code.

It’s important to notice that AFL prefers random branch labels, rather than reusing the %rip value to identify the basic blocks. This is to increase entropy – we want our hit counters in the array to be uniformly distributed. The algorithm AFL uses is:

cur_location = <COMPILE_TIME_RANDOM>;
shared_mem[cur_location ^ prev_location]++; 
prev_location = cur_location >> 1;

In our case with KCOV we don’t have compile-time-random values for each branch. Instead we’ll use a hash function to generate a uniform 16 bit number from %rip recorded by KCOV. This is how to feed a KCOV report into the AFL "shared_mem" array:

n = __atomic_load_n(&kcov_ring[0], __ATOMIC_RELAXED);
uint16_t prev_location = 0;
for (i = 0; i < n; i++) {
        uint16_t cur_location = hash_function(kcov_ring[i + 1]);
        shared_mem[cur_location ^ prev_location]++;
        prev_location = cur_location >> 1;
}

Reading test data from AFL

Finally, we need to actually write the test code hammering the kernel netlink interface! First we need to read input data from AFL. By default AFL sends a test case to stdin:

/* read AFL test data */
char buf[512*1024];
int buf_len = read(0, buf, sizeof(buf));

Then we need to send this buffer into a netlink socket. But we know nothing about how netlink works! Okay, let’s use the first 5 bytes of input as the netlink protocol and group id fields. This will allow the AFL to figure out and guess the correct values of these fields. The code testing netlink (simplified):

netlink_fd = socket(AF_NETLINK, SOCK_RAW | SOCK_NONBLOCK, buf[0]);

struct sockaddr_nl sa = {
        .nl_family = AF_NETLINK,
        .nl_groups = (buf[1] <<24) | (buf[2]<<16) | (buf[3]<<8) | buf[4],
};

bind(netlink_fd, (struct sockaddr *) &sa, sizeof(sa));

struct iovec iov = { &buf[5], buf_len - 5 };
struct sockaddr_nl sax = {
      .nl_family = AF_NETLINK,
};

struct msghdr msg = { &sax, sizeof(sax), &iov, 1, NULL, 0, 0 };
r = sendmsg(netlink_fd, &msg, 0);
if (r != -1) {
      /* sendmsg succeeded! great I guess... */
}

That’s basically it! For speed, we will wrap this in a short loop that mimics the AFL "fork server" logic. I’ll skip the explanation here, see our code for details. The resulting code of our AFL-to-KCOV shim looks like:

forksrv_welcome()
while True:
    forksrv_cycle()
    test_data = afl_read_input()
    kcov_enable()
    /* netlink magic */
    kcov_disable()
    /* fill in shared_map with tuples recorded by kcov */
    if new_crash_in_dmesg:
         forksrv_status(1)
    else:
         forksrv_status(0)

See full source code.

How to run the custom kernel

We’re missing one important piece – how to actually run the custom kernel we’ve built. There are three options:

"native": You can totally boot the built kernel on your server and fuzz it natively. This is the fastest technique, but pretty problematic. If the fuzzing succeeds in finding a bug you will crash the machine, potentially losing the test data. Cutting the branches we sit on should be avoided.

"uml": We could configure the kernel to run as User Mode Linux. Running a UML kernel requires no privileges. The kernel just runs a user space process. UML is pretty cool, but sadly, i doesn’t support KASAN, therefore the chances of finding a memory corruption bug are reduced. Finally, UML is a pretty magic special environment – bugs found in UML may not be relevant on real environments. Interestingly, UML is used by Android network_tests framework.

"kvm": we can use kvm to run our custom kernel in a virtualized environment. This is what we’ll do.

One of the simplest ways to run a custom kernel in a KVM environment is to use "virtme" scripts. With them we can avoid having to create a dedicated disk image or partition, and just share the host file system. This is how we can run our code:

virtme-run \
    --kimg bzImage \
    --rw --pwd --memory 512M \
    --script-sh "<what to run inside kvm>" 

But hold on. We forgot about preparing input corpus data for our fuzzer!

Building the input corpus

Every fuzzer takes a carefully crafted test cases as input, to bootstrap the first mutations. The test cases should be short, and cover as large part of code as possible. Sadly – I know nothing about netlink. How about we don’t prepare the input corpus…

Instead we can ask AFL to "figure out" what inputs make sense. This is what Michał did back in 2014 with JPEGs and it worked for him. With this in mind, here is our input corpus:

mkdir inp
echo "hello world" > inp/01.txt

Instructions, how to compile and run the whole thing are in README.md on our github. It boils down to:

virtme-run \
    --kimg bzImage \
    --rw --pwd --memory 512M \
    --script-sh "./afl-fuzz -i inp -o out -- fuzznetlink" 

With this running you will see the familiar AFL status screen:

A gentle introduction to Linux Kernel fuzzing

Further notes

That’s it. Now you have a custom hardened kernel, running a basic coverage-guided fuzzer. All inside KVM.

Was it worth the effort? Even with this basic fuzzer, and no input corpus, after a day or two the fuzzer found an interesting code path: NEIGH: BUG, double timer add, state is 8. With a more specialized fuzzer, some work on improving the "stability" metric and a decent input corpus, we could expect even better results.

If you want to learn more about what netlink sockets actually do, see a blog post by my colleague Jakub Sitnicki Multipath Routing in Linux – part 1. Then there is a good chapter about it in Linux Kernel Networking book by Rami Rosen.

In this blog post we haven’t mentioned:

  • details of AFL shared_memory setup
  • implementation of AFL persistent mode
  • how to create a network namespace to isolate the effects of weird netlink commands, and improve the "stability" AFL score
  • technique on how to read dmesg (/dev/kmsg) to find kernel crashes
  • idea to run AFL outside of KVM, for speed and stability – currently the tests aren’t stable after a crash is found

But we achieved our goal – we set up a basic, yet still useful fuzzer against a kernel. Most importantly: the same machinery can be reused to fuzz other parts of Linux subsystems – from file systems to bpf verifier.

I also learned a hard lesson: tuning fuzzers is a full time job. Proper fuzzing is definitely not as simple as starting it up and idly waiting for crashes. There is always something to improve, tune, and re-implement. A quote at the beginning of the mentioned presentation by Mateusz Jurczyk resonated with me:

"Fuzzing is easy to learn but hard to master."

Happy bug hunting!

Cloudflare architecture and how BPF eats the world

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/cloudflare-architecture-and-how-bpf-eats-the-world/

Cloudflare architecture and how BPF eats the world

Recently at Netdev 0x13, the Conference on Linux Networking in Prague, I gave a short talk titled “Linux at Cloudflare”. The talk ended up being mostly about BPF. It seems, no matter the question – BPF is the answer.

Here is a transcript of a slightly adjusted version of that talk.


Cloudflare architecture and how BPF eats the world

At Cloudflare we run Linux on our servers. We operate two categories of data centers: large “Core” data centers, processing logs, analyzing attacks, computing analytics, and the “Edge” server fleet, delivering customer content from 180 locations across the world.

In this talk, we will focus on the “Edge” servers. It’s here where we use the newest Linux features, optimize for performance and care deeply about DoS resilience.


Cloudflare architecture and how BPF eats the world

Our edge service is special due to our network configuration – we are extensively using anycast routing. Anycast means that the same set of IP addresses are announced by all our data centers.

This design has great advantages. First, it guarantees the optimal speed for end users. No matter where you are located, you will always reach the closest data center. Then, anycast helps us to spread out DoS traffic. During attacks each of the locations receives a small fraction of the total traffic, making it easier to ingest and filter out unwanted traffic.


Cloudflare architecture and how BPF eats the world

Anycast allows us to keep the networking setup uniform across all edge data centers. We applied the same design inside our data centers – our software stack is uniform across the edge servers. All software pieces are running on all the servers.

In principle, every machine can handle every task – and we run many diverse and demanding tasks. We have a full HTTP stack, the magical Cloudflare Workers, two sets of DNS servers – authoritative and resolver, and many other publicly facing applications like Spectrum and Warp.

Even though every server has all the software running, requests typically cross many machines on their journey through the stack. For example, an HTTP request might be handled by a different machine during each of the 5 stages of the processing.


Cloudflare architecture and how BPF eats the world

Let me walk you through the early stages of inbound packet processing:

(1) First, the packets hit our router. The router does ECMP, and forwards packets onto our Linux servers. We use ECMP to spread each target IP across many, at least 16, machines. This is used as a rudimentary load balancing technique.

(2) On the servers we ingest packets with XDP eBPF. In XDP we perform two stages. First, we run volumetric DoS mitigations, dropping packets belonging to very large layer 3 attacks.

(3) Then, still in XDP, we perform layer 4 load balancing. All the non-attack packets are redirected across the machines. This is used to work around the ECMP problems, gives us fine-granularity load balancing and allows us to gracefully take servers out of service.

(4) Following the redirection the packets reach a designated machine. At this point they are ingested by the normal Linux networking stack, go through the usual iptables firewall, and are dispatched to an appropriate network socket.

(5) Finally packets are received by an application. For example HTTP connections are handled by a “protocol” server, responsible for performing TLS encryption and processing HTTP, HTTP/2 and QUIC protocols.

It’s in these early phases of request processing where we use the coolest new Linux features. We can group useful modern functionalities into three categories:

  • DoS handling
  • Load balancing
  • Socket dispatch


Cloudflare architecture and how BPF eats the world

Let’s discuss DoS handling in more detail. As mentioned earlier, the first step after ECMP routing is Linux’s XDP stack where, among other things, we run DoS mitigations.

Historically our mitigations for volumetric attacks were expressed in classic BPF and iptables-style grammar. Recently we adapted them to execute in the XDP eBPF context, which turned out to be surprisingly hard. Read on about our adventures:

During this project we encountered a number of eBPF/XDP limitations. One of them was the lack of concurrency primitives. It was very hard to implement things like race-free token buckets. Later we found that Facebook engineer Julia Kartseva had the same issues. In February this problem has been addressed with the introduction of bpf_spin_lock helper.


Cloudflare architecture and how BPF eats the world

While our modern volumetric DoS defenses are done in XDP layer, we still rely on iptables for application layer 7 mitigations. Here, a higher level firewall’s features are useful: connlimit, hashlimits and ipsets. We also use the xt_bpf iptables module to run cBPF in iptables to match on packet payloads. We talked about this in the past:


Cloudflare architecture and how BPF eats the world

After XDP and iptables, we have one final kernel side DoS defense layer.

Consider a situation when our UDP mitigations fail. In such case we might be left with a flood of packets hitting our application UDP socket. This might overflow the socket causing packet loss. This is problematic – both good and bad packets will be dropped indiscriminately. For applications like DNS it’s catastrophic. In the past to reduce the harm, we ran one UDP socket per IP address. An unmitigated flood was bad, but at least it didn’t affect the traffic to other server IP addresses.

Nowadays that architecture is no longer suitable. We are running more than 30,000 DNS IP’s and running that number of UDP sockets is not optimal. Our modern solution is to run a single UDP socket with a complex eBPF socket filter on it – using the SO_ATTACH_BPF socket option. We talked about running eBPF on network sockets in past blog posts:

The mentioned eBPF rate limits the packets. It keeps the state – packet counts – in an eBPF map. We can be sure that a single flooded IP won’t affect other traffic. This works well, though during work on this project we found a rather worrying bug in the eBPF verifier:

I guess running eBPF on a UDP socket is not a common thing to do.


Cloudflare architecture and how BPF eats the world

Apart from the DoS, in XDP we also run a layer 4 load balancer layer. This is a new project, and we haven’t talked much about it yet. Without getting into many details: in certain situations we need to perform a socket lookup from XDP.

The problem is relatively simple – our code needs to look up the “socket” kernel structure for a 5-tuple extracted from a packet. This is generally easy – there is a bpf_sk_lookup helper available for this. Unsurprisingly, there were some complications. One problem was the inability to verify if a received ACK packet was a valid part of a three-way handshake when SYN-cookies are enabled. My colleague Lorenz Bauer is working on adding support for this corner case.


Cloudflare architecture and how BPF eats the world

After DoS and the load balancing layers, the packets are passed onto the usual Linux TCP / UDP stack. Here we do a socket dispatch – for example packets going to port 53 are passed onto a socket belonging to our DNS server.

We do our best to use vanilla Linux features, but things get complex when you use thousands of IP addresses on the servers.

Convincing Linux to route packets correctly is relatively easy with the “AnyIP” trick. Ensuring packets are dispatched to the right application is another matter. Unfortunately, standard Linux socket dispatch logic is not flexible enough for our needs. For popular ports like TCP/80 we want to share the port between multiple applications, each handling it on a different IP range. Linux doesn’t support this out of the box. You can call bind() either on a specific IP address or all IP’s (with 0.0.0.0).


Cloudflare architecture and how BPF eats the world

In order to fix this, we developed a custom kernel patch which adds a SO_BINDTOPREFIX socket option. As the name suggests – it allows us to call bind() on a selected IP prefix. This solves the problem of multiple applications sharing popular ports like 53 or 80.

Then we run into another problem. For our Spectrum product we need to listen on all 65535 ports. Running so many listen sockets is not a good idea (see our old war story blog), so we had to find another way. After some experiments we learned to utilize an obscure iptables module – TPROXY – for this purpose. Read about it here:

This setup is working, but we don’t like the extra firewall rules. We are working on solving this problem correctly – actually extending the socket dispatch logic. You guessed it – we want to extend socket dispatch logic by utilizing eBPF. Expect some patches from us.


Cloudflare architecture and how BPF eats the world

Then there is a way to use eBPF to improve applications. Recently we got excited about doing TCP splicing with SOCKMAP:

This technique has a great potential for improving tail latency across many pieces of our software stack. The current SOCKMAP implementation is not quite ready for prime time yet, but the potential is vast.

Similarly, the new TCP-BPF aka BPF_SOCK_OPS hooks provide a great way of inspecting performance parameters of TCP flows. This functionality is super useful for our performance team.


Cloudflare architecture and how BPF eats the world

Some Linux features didn’t age well and we need to work around them. For example, we are hitting limitations of networking metrics. Don’t get me wrong – the networking metrics are awesome, but sadly they are not granular enough. Things like TcpExtListenDrops and TcpExtListenOverflows are reported as global counters, while we need to know it on a per-application basis.

Our solution is to use eBPF probes to extract the numbers directly from the kernel. My colleague Ivan Babrou wrote a Prometheus metrics exporter called “ebpf_exporter” to facilitate this. Read on:

With “ebpf_exporter” we can generate all manner of detailed metrics. It is very powerful and saved us on many occasions.


Cloudflare architecture and how BPF eats the world

In this talk we discussed 6 layers of BPFs running on our edge servers:

  • Volumetric DoS mitigations are running on XDP eBPF
  • Iptables xt_bpf cBPF for application-layer attacks
  • SO_ATTACH_BPF for rate limits on UDP sockets
  • Load balancer, running on XDP
  • eBPFs running application helpers like SOCKMAP for TCP socket splicing, and TCP-BPF for TCP measurements
  • “ebpf_exporter” for granular metrics

And we’re just getting started! Soon we will be doing more with eBPF based socket dispatch, eBPF running on Linux TC (Traffic Control) layer and more integration with cgroup eBPF hooks. Then, our SRE team is maintaining ever-growing list of BCC scripts useful for debugging.

It feels like Linux stopped developing new API’s and all the new features are implemented as eBPF hooks and helpers. This is fine and it has strong advantages. It’s easier and safer to upgrade eBPF program than having to recompile a kernel module. Some things like TCP-BPF, exposing high-volume performance tracing data, would probably be impossible without eBPF.

Some say “software is eating the world”, I would say that: “BPF is eating the software”.