All posts by Marek Majkowski

QUIC restarts, slow problems: udpgrm to the rescue

2025-05-07 Marek Majkowski

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/quic-restarts-slow-problems-udpgrm-to-the-rescue/

At Cloudflare, we do everything we can to avoid interruption to our services. We frequently deploy new versions of the code that delivers the services, so we need to be able to restart the server processes to upgrade them without missing a beat. In particular, performing graceful restarts (also known as “zero downtime”) for UDP servers has proven to be surprisingly difficult.

We’ve previously written about graceful restarts in the context of TCP, which is much easier to handle. We didn’t have a strong reason to deal with UDP until recently — when protocols like HTTP3/QUIC became critical. This blog post introduces udpgrm, a lightweight daemon that helps us to upgrade UDP servers without dropping a single packet.

Here’s the udpgrm GitHub repo.

Historical context

In the early days of the Internet, UDP was used for stateless request/response communication with protocols like DNS or NTP. Restarts of a server process are not a problem in that context, because it does not have to retain state across multiple requests. However, modern protocols like QUIC, WireGuard, and SIP, as well as online games, use stateful flows. So what happens to the state associated with a flow when a server process is restarted? Typically, old connections are just dropped during a server restart. Migrating the flow state from the old instance to the new instance is possible, but it is complicated and notoriously hard to get right.

The same problem occurs for TCP connections, but there a common approach is to keep the old instance of the server process running alongside the new instance for a while, routing new connections to the new instance while letting existing ones drain on the old. Once all connections finish or a timeout is reached, the old instance can be safely shut down. The same approach works for UDP, but it requires more involvement from the server process than for TCP.

In the past, we described the established-over-unconnected method. It offers one way to implement flow handoff, but it comes with significant drawbacks: it’s prone to race conditions in protocols with multi-packet handshakes, and it suffers from a scalability issue. Specifically, the kernel hash table used for dispatching packets is keyed only by the local IP:port tuple, which can lead to bucket overfill when dealing with many inbound UDP sockets.

Now we have found a better method, leveraging Linux’s SO_REUSEPORT API. By placing both old and new sockets into the same REUSEPORT group and using an eBPF program for flow tracking, we can route packets to the correct instance and preserve flow stickiness. This is how udpgrm works.

REUSEPORT group

Before diving deeper, let’s quickly review the basics. Linux provides the SO_REUSEPORT socket option, typically set after socket() but before bind(). Please note that this has a separate purpose from the better known SO_REUSEADDR socket option.

SO_REUSEPORT allows multiple sockets to bind to the same IP:port tuple. This feature is primarily used for load balancing, letting servers spread traffic efficiently across multiple CPU cores. You can think of it as a way for an IP:port to be associated with multiple packet queues. In the kernel, sockets sharing an IP:port this way are organized into a reuseport group — a term we’ll refer to frequently throughout this post.

┌───────────────────────────────────────────┐
│ reuseport group 192.0.2.0:443             │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ socket #1 │ │ socket #2 │ │ socket #3 │ │
│ └───────────┘ └───────────┘ └───────────┘ │
└───────────────────────────────────────────┘

Linux supports several methods for distributing inbound packets across a reuseport group. By default, the kernel uses a hash of the packet’s 4-tuple to select a target socket. Another method is SO_INCOMING_CPU, which, when enabled, tries to steer packets to sockets running on the same CPU that received the packet. This approach works but has limited flexibility.

To provide more control, Linux introduced the SO_ATTACH_REUSEPORT_CBPF option, allowing server processes to attach a classic BPF (cBPF) program to make socket selection decisions. This was later extended with SO_ATTACH_REUSEPORT_EBPF, enabling the use of modern eBPF programs. With eBPF, developers can implement arbitrary custom logic. A boilerplate program would look like this:

SEC("sk_reuseport")
int udpgrm_reuseport_prog(struct sk_reuseport_md *md)
{
    uint64_t socket_identifier = xxxx;
    bpf_sk_select_reuseport(md, &sockhash, &socket_identifier, 0);
    return SK_PASS;
}

To select a specific socket, the eBPF program calls bpf_sk_select_reuseport, using a reference to a map with sockets (SOCKHASH, SOCKMAP, or the older, mostly obsolete SOCKARRAY), along with a key or index. For example, a declaration of a SOCKHASH might look like this:

struct {
	__uint(type, BPF_MAP_TYPE_SOCKHASH);
	__uint(max_entries, MAX_SOCKETS);
	__uint(key_size, sizeof(uint64_t));
	__uint(value_size, sizeof(uint64_t));
} sockhash SEC(".maps");

This SOCKHASH is a hash map that holds references to sockets, even though the value size looks like a scalar 8-byte value. In our case it’s indexed by an uint64_t key. This is pretty neat, as it allows for a simple number-to-socket mapping!

However, there’s a catch: the SOCKHASH must be populated and maintained from user space (or a separate control plane), outside the eBPF program itself. Keeping this socket map accurate and in sync with the server process state is surprisingly difficult to get right — especially under dynamic conditions like restarts, crashes, or scaling events. The point of udpgrm is to take care of this stuff, so that server processes don’t have to.

Socket generation and working generation

Let’s look at how graceful restarts for UDP flows are achieved in udpgrm. To reason about this setup, we’ll need a bit of terminology: A socket generation is a set of sockets within a reuseport group that belong to the same logical application instance:

┌───────────────────────────────────────────────────┐
│ reuseport group 192.0.2.0:443                     │
│  ┌─────────────────────────────────────────────┐  │
│  │ socket generation 0                         │  │
│  │  ┌───────────┐ ┌───────────┐ ┌───────────┐  │  │
│  │  │ socket #1 │ │ socket #2 │ │ socket #3 │  │  │
│  │  └───────────┘ └───────────┘ └───────────┘  │  │
│  └─────────────────────────────────────────────┘  │
│  ┌─────────────────────────────────────────────┐  │
│  │ socket generation 1                         │  │
│  │  ┌───────────┐ ┌───────────┐ ┌───────────┐  │  │
│  │  │ socket #4 │ │ socket #5 │ │ socket #6 │  │  │
│  │  └───────────┘ └───────────┘ └───────────┘  │  │
│  └─────────────────────────────────────────────┘  │
└───────────────────────────────────────────────────┘

When a server process needs to be restarted, the new version creates a new socket generation for its sockets. The old version keeps running alongside the new one, using sockets from the previous socket generation.

Reuseport eBPF routing boils down to two problems:

For new flows, we should choose a socket from the socket generation that belongs to the active server instance.
For already established flows, we should choose the appropriate socket — possibly from an older socket generation — to keep the flows sticky. The flows will eventually drain away, allowing the old server instance to shut down.

Easy, right?

Of course not! The devil is in the details. Let’s take it one step at a time.

Routing new flows is relatively easy. udpgrm simply maintains a reference to the socket generation that should handle new connections. We call this reference the working generation. Whenever a new flow arrives, the eBPF program consults the working generation pointer and selects a socket from that generation.

┌──────────────────────────────────────────────┐
│ reuseport group 192.0.2.0:443                │
│   ...                                        │
│   Working generation ────┐                   │
│                          V                   │
│           ┌───────────────────────────────┐  │
│           │ socket generation 1           │  │
│           │  ┌───────────┐ ┌──────────┐   │  │
│           │  │ socket #4 │ │ ...      │   │  │
│           │  └───────────┘ └──────────┘   │  │
│           └───────────────────────────────┘  │
│   ...                                        │
└──────────────────────────────────────────────┘

For this to work, we first need to be able to differentiate packets belonging to new connections from packets belonging to old connections. This is very tricky and highly dependent on the specific UDP protocol. For example, QUIC has an initial packet concept, similar to a TCP SYN, but other protocols might not.

There needs to be some flexibility in this and udpgrm makes this configurable. Each reuseport group sets a specific flow dissector.

Flow dissector has two tasks:

It distinguishes new packets from packets belonging to old, already established flows.
For recognized flows, it tells udpgrm which specific socket the flow belongs to.

These concepts are closely related and depend on the specific server. Different UDP protocols define flows differently. For example, a naive UDP server might use a typical 5-tuple to define flows, while QUIC uses a “connection ID” field in the QUIC packet header to survive NAT rebinding.

udpgrm supports three flow dissectors out of the box and is highly configurable to support any UDP protocol. More on this later.

Welcome udpgrm!

Now that we covered the theory, we’re ready for the business: please welcome udpgrm — UDP Graceful Restart Marshal! udpgrm is a stateful daemon that handles all the complexities of the graceful restart process for UDP. It installs the appropriate eBPF REUSEPORT program, maintains flow state, communicates with the server process during restarts, and reports useful metrics for easier debugging.

We can describe udpgrm from two perspectives: for administrators and for programmers.

udpgrm daemon for the system administrator

udpgrm is a stateful daemon, to run it:

$ sudo udpgrm --daemon
[ ] Loading BPF code
[ ] Pinning bpf programs to /sys/fs/bpf/udpgrm
[*] Tailing message ring buffer  map_id 936146

This sets up the basic functionality, prints rudimentary logs, and should be deployed as a dedicated systemd service — loaded after networking. However, this is not enough to fully use udpgrm. udpgrm needs to hook into getsockopt, setsockopt, bind, and sendmsg syscalls, which are scoped to a cgroup. To install the udpgrm hooks, you can install it like this:

$ sudo udpgrm --install=/sys/fs/cgroup/system.slice

But a more common pattern is to install it within the current cgroup:

$ sudo udpgrm --install --self

Better yet, use it as part of the systemd “service” config:

[Service]
...
ExecStartPre=/usr/local/bin/udpgrm --install --self

Once udpgrm is running, the administrator can use the CLI to list reuseport groups, sockets, and metrics, like this:

$ sudo udpgrm list
[ ] Retrievieng BPF progs from /sys/fs/bpf/udpgrm
192.0.2.0:4433
	netns 0x1  dissector bespoke  digest 0xdead
	socket generations:
		gen  3  0x17a0da  <=  app 0  gen 3
	metrics:
		rx_processed_total 13777528077
...

Now, with both the udpgrm daemon running, and cgroup hooks set up, we can focus on the server part.

udpgrm for the programmer

We expect the server to create the appropriate UDP sockets by itself. We depend on SO_REUSEPORT, so that each server instance can have a dedicated socket or a set of sockets:

sd = socket.socket(AF_INET, SOCK_DGRAM, 0)
sd.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
sd.bind(("192.0.2.1", 5201))

With a socket descriptor handy, we can pursue the udpgrm magic dance. The server communicates with the udpgrm daemon using setsockopt calls. Behind the scenes, udpgrm provides eBPF setsockopt and getsockopt hooks and hijacks specific calls. It’s not easy to set up on the kernel side, but when it works, it’s truly awesome. A typical socket setup looks like this:

try:
    work_gen = sd.getsockopt(IPPROTO_UDP, UDP_GRM_WORKING_GEN)
except OSError:
    raise OSError('Is udpgrm daemon loaded? Try "udpgrm --self --install"')
    
sd.setsockopt(IPPROTO_UDP, UDP_GRM_SOCKET_GEN, work_gen + 1)
for i in range(10):
    v = sd.getsockopt(IPPROTO_UDP, UDP_GRM_SOCKET_GEN, 8);
    sk_gen, sk_idx = struct.unpack('II', v)
    if sk_idx != 0xffffffff:
        break
    time.sleep(0.01 * (2 ** i))
else:
    raise OSError("Communicating with udpgrm daemon failed.")

sd.setsockopt(IPPROTO_UDP, UDP_GRM_WORKING_GEN, work_gen + 1)

You can see three blocks here:

First, we retrieve the working generation number and, by doing so, check for udpgrm presence. Typically, udpgrm absence is fine for non-production workloads.
Then we register the socket to an arbitrary socket generation. We choose work_gen + 1 as the value and verify that the registration went through correctly.
Finally, we bump the working generation pointer.

That’s it! Hopefully, the API presented here is clear and reasonable. Under the hood, the udpgrm daemon installs the REUSEPORT eBPF program, sets up internal data structures, collects metrics, and manages the sockets in a SOCKHASH.

Advanced socket creation with udpgrm_activate.py

In practice, we often need sockets bound to low ports like :443, which requires elevated privileges like CAP_NET_BIND_SERVICE. It’s usually better to configure listening sockets outside the server itself. A typical pattern is to pass the listening sockets using socket activation.

Sadly, systemd cannot create a new set of UDP SO_REUSEPORT sockets for each server instance. To overcome this limitation, udpgrm provides a script called udpgrm_activate.py, which can be used like this:

[Service]
Type=notify                 # Enable access to fd store
NotifyAccess=all            # Allow access to fd store from ExecStartPre
FileDescriptorStoreMax=128  # Limit of stored sockets must be set

ExecStartPre=/usr/local/bin/udpgrm_activate.py test-port 0.0.0.0:5201

Here, udpgrm_activate.py binds to 0.0.0.0:5201 and stores the created socket in the systemd FD store under the name test-port. The server echoserver.py will inherit this socket and receive the appropriate FD_LISTEN environment variables, following the typical systemd socket activation pattern.

Systemd service lifetime

Systemd typically can’t handle more than one server instance running at the same time. It prefers to kill the old instance quickly. It supports the “at most one” server instance model, not the “at least one” model that we want. To work around this, udpgrm provides a decoy script that will exit when systemd asks it to, while the actual old instance of the server can stay active in the background.

[Service]
...
ExecStart=/usr/local/bin/mmdecoy examples/echoserver.py

Restart=always             # if pid dies, restart it.
KillMode=process           # Kill only decoy, keep children after stop.
KillSignal=SIGTERM         # Make signals explicit

At this point, we showed a full template for a udpgrm enabled server that contains all three elements: udpgrm --install --self for cgroup hooks, udpgrm_activate.py for socket creation, and mmdecoy for fooling systemd service lifetime checks.

[Service]
Type=notify                 # Enable access to fd store
NotifyAccess=all            # Allow access to fd store from ExecStartPre
FileDescriptorStoreMax=128  # Limit of stored sockets must be set

ExecStartPre=/usr/local/bin/udpgrm --install --self
ExecStartPre=/usr/local/bin/udpgrm_activate.py --no-register test-port 0.0.0.0:5201
ExecStart=/usr/local/bin/mmdecoy PWD/examples/echoserver.py

Restart=always             # if pid dies, restart it.
KillMode=process           # Kill only decoy, keep children after stop. 
KillSignal=SIGTERM         # Make signals explicit

Dissector modes

We’ve discussed the udpgrm daemon, the udpgrm setsockopt API, and systemd integration, but we haven’t yet covered the details of routing logic for old flows. To handle arbitrary protocols, udpgrm supports three dissector modes out of the box:

DISSECTOR_FLOW: udpgrm maintains a flow table indexed by a flow hash computed from a typical 4-tuple. It stores a target socket identifier for each flow. The flow table size is fixed, so there is a limit to the number of concurrent flows supported by this mode. To mark a flow as “assured,” udpgrm hooks into the sendmsg syscall and saves the flow in the table only when a message is sent.

DISSECTOR_CBPF: A cookie-based model where the target socket identifier — called a udpgrm cookie — is encoded in each incoming UDP packet. For example, in QUIC, this identifier can be stored as part of the connection ID. The dissection logic is expressed as cBPF code. This model does not require a flow table in udpgrm but is harder to integrate because it needs protocol and server support.

DISSECTOR_NOOP: A no-op mode with no state tracking at all. It is useful for traditional UDP services like DNS, where we want to avoid losing even a single packet during an upgrade.

Finally, udpgrm provides a template for a more advanced dissector called DISSECTOR_BESPOKE. Currently, it includes a QUIC dissector that can decode the QUIC TLS SNI and direct specific TLS hostnames to specific socket generations.

For more details, please consult the udpgrm README. In short: the FLOW dissector is the simplest one, useful for old protocols. CBPF dissector is good for experimentation when the protocol allows storing a custom connection id (cookie) — we used it to develop our own QUIC Connection ID schema (also named DCID) — but it’s slow, because it interprets cBPF inside eBPF (yes really!). NOOP is useful, but only for very specific niche servers. The real magic is in the BESPOKE type, where users can create arbitrary, fast, and powerful dissector logic.

Summary

The adoption of QUIC and other UDP-based protocols means that gracefully restarting UDP servers is becoming an increasingly important problem. To our knowledge, a reusable, configurable and easy to use solution didn’t exist yet. The udpgrm project brings together several novel ideas: a clean API using setsockopt(), careful socket-stealing logic hidden under the hood, powerful and expressive configurable dissectors, and well-thought-out integration with systemd.

While udpgrm is intended to be easy to use, it hides a lot of complexity and solves a genuinely hard problem. The core issue is that the Linux Sockets API has not kept up with the modern needs of UDP.

Ideally, most of this should really be a feature of systemd. That includes supporting the “at least one” server instance mode, UDP SO_REUSEPORT socket creation, installing a REUSEPORT_EBPF program, and managing the “working generation” pointer. We hope that udpgrm helps create the space and vocabulary for these long-term improvements.

Multi-Path TCP: revolutionizing connectivity, one path at a time

2025-01-03 Marek Majkowski

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/multi-path-tcp-revolutionizing-connectivity-one-path-at-a-time/

The Internet is designed to provide multiple paths between two endpoints. Attempts to exploit multi-path opportunities are almost as old as the Internet, culminating in RFCs documenting some of the challenges. Still, today, virtually all end-to-end communication uses only one available path at a time. Why? It turns out that in multi-path setups, even the smallest differences between paths can harm the connection quality due to packet reordering and other issues. As a result, Internet devices usually use a single path and let the routers handle the path selection.

There is another way. Enter Multi-Path TCP (MPTCP), which exploits the presence of multiple interfaces on a device, such as a mobile phone that has both Wi-Fi and cellular antennas, to achieve multi-path connectivity.

MPTCP has had a long history — see the Wikipedia article and the spec (RFC 8684) for details. It’s a major extension to the TCP protocol, and historically most of the TCP changes failed to gain traction. However, MPTCP is supposed to be mostly an operating system feature, making it easy to enable. Applications should only need minor code changes to support it.

There is a caveat, however: MPTCP is still fairly immature, and while it can use multiple paths, giving it superpowers over regular TCP, it’s not always strictly better than it. Whether MPTCP should be used over TCP is really a case-by-case basis.

In this blog post we show how to set up MPTCP to find out.

Subflows

Internally, MPTCP extends TCP by introducing “subflows”. When everything is working, a single TCP connection can be backed by multiple MPTCP subflows, each using different paths. This is a big deal – a single TCP byte stream is now no longer identified by a single 5-tuple. On Linux you can see the subflows with ss -M, like:

marek$ ss -tMn dport = :443 | cat
tcp   ESTAB 0  	0 192.168.2.143%enx2800af081bee:57756 104.28.152.1:443
tcp   ESTAB 0  	0       192.168.1.149%wlp0s20f3:44719 104.28.152.1:443
mptcp ESTAB 0  	0                 192.168.2.143:57756 104.28.152.1:443

Here you can see a single MPTCP connection, composed of two underlying TCP flows.

MPTCP aspirations

Being able to separate the lifetime of a connection from the lifetime of a flow allows MPTCP to address two problems present in classical TCP: aggregation and mobility.

Aggregation: MPTCP can aggregate the bandwidth of many network interfaces. For example, in a data center scenario, it’s common to use interface bonding. A single flow can make use of just one physical interface. MPTCP, by being able to launch many subflows, can expose greater overall bandwidth. I’m personally not convinced if this is a real problem. As we’ll learn below, modern Linux has a BLESS-like MPTCP scheduler and macOS stack has the “aggregation” mode, so aggregation should work, but I’m not sure how practical it is. However, there are certainly projects that are trying to do link aggregation using MPTCP.
Mobility: On a customer device, a TCP stream is typically broken if the underlying network interface goes away. This is not an uncommon occurrence — consider a smartphone dropping from Wi-Fi to cellular. MPTCP can fix this — it can create and destroy many subflows over the lifetime of a single connection and survive multiple network changes.

Improving reliability for mobile clients is a big deal. While some software can use QUIC, which also works on Multipath Extensions, a large number of classical services still use TCP. A great example is SSH: it would be very nice if you could walk around with a laptop and keep an SSH session open and switch Wi-Fi networks seamlessly, without breaking the connection.

MPTCP work was initially driven by UCLouvain in Belgium. The first serious adoption was on the iPhone. Apparently, users have a tendency to use Siri while they are walking out of their home. It’s very common to lose Wi-Fi connectivity while they are doing this. (source)

Implementations

Currently, there are only two major MPTCP implementations — Linux kernel support from v5.6, but realistically you need at least kernel v6.1 (MPTCP is not supported on Android yet) and iOS from version 7 / Mac OS X from 10.10.

Typically, Linux is used on the server side, and iOS/macOS as the client. It’s possible to get Linux to work as a client-side, but it’s not straightforward, as we’ll learn soon. Beware — there is plenty of outdated Linux MPTCP documentation. The code has had a bumpy history and at least two different APIs were proposed. See the Linux kernel source for the mainline API and the mptcp.dev website.

Linux as a server

Conceptually, the MPTCP design is pretty sensible. After the initial TCP handshake, each peer may announce additional addresses (and ports) on which it can be reached. There are two ways of doing this. First, in the handshake TCP packet each peer specifies the “Do not attempt to establish new subflows to this address and port” bit, also known as bit [C], in the MPTCP TCP extensions header.

^{Wireshark dissecting MPTCP flags from a SYN packet.}^{Tcpdump does not report}^{this flag yet.}

With this bit cleared, the other peer is free to assume the two-tuple is fine to be reconnected to. Typically, the server allows the client to reuse the server IP/port address. Usually, the client is not listening and disallows the server to connect back to it. There are caveats though. For example, in the context of Cloudflare, where our servers are using Anycast addressing, reconnecting to the server IP/port won’t work. Going twice to the IP/port pair is unlikely to reach the same server. For us it makes sense to set this flag, disallowing clients from reconnecting to our server addresses. This can be done on Linux with:

# Linux server sysctl - useful for ECMP or Anycast servers
$ sysctl -w net.mptcp.allow_join_initial_addr_port=0

There is also a second way to advertise a listening IP/port. During the lifetime of a connection, a peer can send an ADD-ADDR MPTCP signal which advertises a listening IP/port. This can be managed on Linux by ip mptcp endpoint ... signal, like:

# Linux server - extra listening address
$ ip mptcp endpoint add 192.51.100.1 dev eth0 port 4321 signal

With such a config, a Linux peer (typically server) will report the additional IP/port with ADD-ADDR MPTCP signal in an ACK packet, like this:

host > host: Flags [.], ack 1, win 8, options [mptcp 30 add-addr v1 id 1 192.51.100.1:4321 hmac 0x...,nop,nop], length 0

It’s important to realize that either peer can send ADD-ADDR messages. Unusual as it might sound, it’s totally fine for the client to advertise extra listening addresses. The most common scenario though, consists of either nobody, or just a server, sending ADD-ADDR.

Technically, to launch an MPTCP socket on Linux, you just need to replace IPPROTO_TCP with IPPROTO_MPTCP in the application code:

IPPROTO_MPTCP = 262
sd = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP)

In practice, though, this introduces some changes to the sockets API. Currently not all setsockopt’s work yet — like TCP_USER_TIMEOUT. Additionally, at this stage, MPTCP is incompatible with kTLS.

Path manager / scheduler

Once the peers have exchanged the address information, MPTCP is ready to kick in and perform the magic. There are two independent pieces of logic that MPTCP handles. First, given the address information, MPTCP must figure out if it should establish additional subflows. The component that decides on this is called “Path Manager”. Then, another component called “scheduler” is responsible for choosing a specific subflow to transmit the data over.

Both peers have a path manager, but typically only the client uses it. A path manager has a hard task to launch enough subflows to get the benefits, but not too many subflows which could waste resources. This is where the MPTCP stacks get complicated.

Linux as client

On Linux, path manager is an operating system feature, not an application feature. The in-kernel path manager requires some configuration — it must know which IP addresses and interfaces are okay to start new subflows. This is configured with ip mptcp endpoint ... subflow, like:

$ ip mptcp endpoint add dev wlp1s0 192.0.2.3 subflow  # Linux client

This informs the path manager that we (typically a client) own a 192.0.2.3 IP address on interface wlp1s0, and that it’s fine to use it as source of a new subflow. There are two additional flags that can be passed here: “backup” and “fullmesh”. Maintaining these ip mptcp endpoints on a client is annoying. They need to be added and removed every time networks change. Fortunately, NetworkManager from 1.40 supports managing these by default. If you want to customize the “backup” or “fullmesh” flags, you can do this here (see the documentation):

ubuntu$ cat /etc/NetworkManager/conf.d/95-mptcp.conf
# set "subflow" on all managed "ip mptcp endpoints". 0x22 is the default.
[connection]
connection.mptcp-flags=0x22

Path manager also takes a “limit” setting, to set a cap of additional subflows per MPTCP connection, and limit the received ADD-ADDR messages, like:

$ ip mptcp limits set subflow 4 add_addr_accepted 2  # Linux client

I experimented with the “mobility” use case on my Ubuntu 22 Linux laptop. I repeatedly enabled and disabled Wi-Fi and Ethernet. On new kernels (v6.12), it works, and I was able to hold a reliable MPTCP connection over many interface changes. I was less lucky with the Ubuntu v6.8 kernel. Unfortunately, the default path manager on Linux client only works when the flag “Do not attempt to establish new subflows to this address and port” is cleared on the server. Server-announced ADD-ADDR don’t result in new subflows created, unless ip mptcp endpoint has a fullmesh flag.

It feels like the underlying MPTCP transport code works, but the path manager requires a bit more intelligence. With a new kernel, it’s possible to get the “interactive” case working out of the box, but not for the ADD-ADDR case.

Custom path manager

Linux allows for two implementations of a path manager component. It can either use built-in kernel implementation (default), or userspace netlink daemon.

$ sysctl -w net.mptcp.pm_type=1 # use userspace path manager

However, from what I found there is no serious implementation of configurable userspace path manager. The existing implementations don’t do much, and the API seems immature yet.

Scheduler and BPF extensions

Thus far we’ve covered Path Manager, but what about the scheduler that chooses which link to actually use? It seems that on Linux there is only one built-in “default” scheduler, and it can do basic failover on packet loss. The developers want to write MPTCP schedulers in BPF, and this work is in-progress.

macOS

As opposed to Linux, macOS and iOS expose a raw MPTCP API. On those operating systems, path manager is not handled by the kernel, but instead can be an application responsibility. The exposed low-level API is based on connectx(). For example, here’s an example of obscure code that establishes one connection with two subflows:

int sock = socket(AF_MULTIPATH, SOCK_STREAM, 0);
connectx(sock, ..., &cid1);
connectx(sock, ..., &cid2);

This powerful API is hard to use though, as it would require every application to listen for network changes. Fortunately, macOS and iOS also expose higher-level APIs. One example is nw_connection in C, which uses nw_parameters_set_multipath_service.

Another, more common example is using Network.framework, and would look like this:

let parameters = NWParameters.tcp
parameters.multipathServiceType = .interactive
let connection = NWConnection(host: host, port: port, using: parameters)

The API supports three MPTCP service type modes:

Handover Mode: Tries to minimize cellular. Uses only Wi-Fi. Uses cellular only when Wi-Fi Assist is enabled and makes such a decision.
Interactive Mode: Used for Siri. Reduces latency. Only for low-bandwidth flows.
Aggregation Mode: Enables resource pooling but it’s only available for developer accounts and not deployable.

The MPTCP API is nicely integrated with the iPhone “Wi-Fi Assist” feature. While the official documentation is lacking, it’s possible to find sources explaining how it actually works. I was able to successfully test both the cleared “Do not attempt to establish new subflows” bit and ADD-ADDR scenarios. Hurray!

IPv6 caveat

Sadly, MPTCP IPv6 has a caveat. Since IPv6 addresses are long, and MPTCP uses the space-constrained TCP Extensions field, there is not enough room for ADD-ADDR messages if TCP timestamps are enabled. If you want to use MPTCP and IPv6, it’s something to consider.

Summary

I find MPTCP very exciting, being one of a few deployable serious TCP extensions. However, current implementations are limited. My experimentation showed that the only practical scenario where currently MPTCP might be useful is:

Linux as a server
macOS/iOS as a client
“interactive” use case

With a bit of effort, Linux can be made to work as a client.

Don’t get me wrong, Linux developers did tremendous work to get where we are, but, in my opinion for any serious out-of-the-box use case, we’re not there yet. I’m optimistic that Linux can develop a good MPTCP client story relatively soon, and the possibility of implementing the Path manager and Scheduler in BPF is really enticing.

Time will tell if MPTCP succeeds — it’s been 15 years in the making. In the meantime, Multi-Path QUIC is under active development, but it’s even further from being usable at this stage.

We’re not quite sure if it makes sense for Cloudflare to support MPTCP. Reach out if you have a use case in mind!

Shoutout to Matthieu Baerts for tremendous help with this blog post.

The forecast is clear: clouds on e-paper, powered by the cloud

2024-12-31 Marek Majkowski

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/the-forecast-is-clear-clouds-on-e-paper-powered-by-the-cloud/

I’ve noticed that many shops are increasingly using e-paper displays. They’re impressive: high contrast, no backlight, and no visible cables. Unlike most electronics, these displays are seamlessly integrated and feel very natural. This got me wondering: is it possible to use such a display for a pet project? I want to experiment with this technology myself.

⁽^source⁾

My main goal in this project is to understand the hardware and its capabilities. Here, I’ll be using an e-paper display to show the current weather, but at its core, I’m simply feeding data from a website to the display. While it sounds straightforward, it actually requires three layers of software to pull off. Still, it’s a fun challenge and a great opportunity to work with both embedded hardware and Cloudflare Workers.

Sourcing the hardware

For this project, I’m using components from Waveshare. They offer a variety of e-paper displays, ranging from credit card-sized to A4-sized models. I chose the 7.5-inch, two-color “e-Paper (G)” display. For the controller, I’m using a Waveshare ESP32-based universal board. With just these two components — a display and a controller — I was ready to get started.

When the components arrived, I carefully connected the display’s ribbon cable to the ESP32 board. Even though this step isn’t documented anywhere, it was simple and almost impossible to get wrong. Best of all, no soldering was needed!

That’s pretty much it for the hardware setup! I’m keeping the device powered with a 5V supply through a micro-USB connection.

One layer of hardware

⁽^source⁾

This was my first time working with the ESP32 CPU family, and I’m really impressed. It’s a system-on-chip controller with built-in Bluetooth and Wi-Fi. It’s relatively fast, very power-efficient, and quite popular in DSP (digital signal processing) applications. For example, your audio device might be powered by a CPU like this. Interestingly, the newer models have switched to the RISC-V instruction set.

For our purposes, we’ll only scratch the surface of what the ESP32 is capable of. The chip is straightforward to work with, thanks to the familiar Arduino environment. A great starting point is the demo provided by Waveshare. It sets up a web page where you can easily upload a custom image to the display.

To run the demo you need to:

Install the Arduino IDE.
Fix permissions of the /dev/ACM0 device.
Install “Additional Boards Manager URL” as per the instructions, and install the “esp32 by expressif” bundle.
Open the “Loader_esp32wf” example downloaded from waveshare.
Change the Wi-Fi name, password and IP address in the Arduino IDE srvr.h tab.

Once everything is set up, you should be able to connect to the ESP32’s IP address and use the simple web interface to upload an image to the display.

With a simple click of the “Upload Image” button, the magic happens: the e-paper display comes to life, showcasing the uploaded image.

With the demo up and running, we can move on to the next step: figuring out how to render a web page on the e-paper display.

Three layers of software

The ESP32 comes with some limitations. It has 520 KiB of RAM, 4 MiB of flash, and a 240 MHz clock speed. While this is fine for tasks like connecting to Wi-Fi or fetching a simple URL, it’s not powerful enough for more demanding tasks, such as parsing JSON or rendering an entire web page.

There are basic Arduino libraries for handling bitmaps, which can draw rectangles and render simple fonts, but manually managing layout doesn’t sound appealing to me. A better approach is to play to the ESP32’s strengths — fetching and displaying bitmaps — and delegate the more complex task of HTML rendering to a more powerful server.

Let’s break the problem into three layers:

ESP32 (Display Layer): The ESP32 will periodically, say every minute, fetch a pre-rendered bitmap from the server and display it on the e-paper screen. This keeps the ESP32’s tasks lightweight and manageable.
Server A (Rendering Layer): This server will fetch the desired website, render it, and rasterize it into a bitmap format. Its job is to prepare a bitmap that the ESP32 can handle without additional processing.
Server B (Content Layer): This server hosts the actual website with the HTML and CSS content. In this case, it will provide the local weather data in a styled format, ready to be fetched and rendered by Server A.

ESP32 (Display Layer)

The ESP32 provides some great higher-level libraries to simplify development. For this project, we’ll need three key components:

Wi-Fi Arduino Library: To connect the ESP32 to a Wi-Fi network.
HTTP Arduino Library: To handle HTTP requests and fetch the rendered bitmap from the server.
EPD (e-Paper Display) Driver: To control the e-paper display and render the fetched bitmap.

These libraries make it much easier to implement the required functionality without dealing with low-level details.

Here’s my ESP32 Arduino project code. It’s actually pretty straightforward:

First, it connects to Wi-Fi
Then, it fetches a rendered bitmap from an HTTP endpoint
Then it pushes it to the e-paper display if needed
Waits a minute
And repeats the whole process forever

E-paper displays typically start to degrade after about one million refresh cycles. To preserve the display’s lifespan, I’m being extra careful to avoid unnecessary refreshes.

Server A (Rendering Layer)

Now for the exciting part! We need an online service that can fetch a website, render it, rasterize it to fit our small monochromatic display, and return it as a display-sized binary blob. Initially, I considered using headless Chrome paired with an ImageMagick script, but then I discovered Cloudflare’s Browser Rendering API, which fits our needs perfectly.

This API can be used quite trivially and nicely fits our needs. Here’s the typescript worker code, and there are two particularly interesting parts: handling a remote browser and dithering.

Remote Browser API

First, see how easy it is to render a website as a PNG using Browser Rendering:

if (!browser) {
browser = await puppeteer.launch(env.MYBROWSER, { keep_alive: 600000 });
launched = true;
}
sessionId = browser.sessionId();

const page = await browser.newPage();
await page.setViewport({
width: 480,
height: 800,
deviceScaleFactor: 1,
})

await page.goto(url);
img = (await page.screenshot()) as Buffer;

I’m genuinely surprised at how practical and effective this approach is. While the remote browser startup isn’t exactly fast — it can take a few seconds to generate the screenshot — it’s not an issue for my use case. The delay is perfectly acceptable, especially considering how much work is offloaded to the cloud.

Dithering

To prepare the bitmap for the ESP32, we need to decode the PNG, reduce the color palette to monochromatic, and apply dithering. Here’s the dithering code:

function ditherTwoBits(px: Buffer,
                       width: number,
                       height: number
                      ): Buffer {
    px = new Float32Array(px);

    for (let y = 0; y < height; y++) {
        for (let x = 0; x < width; x++) {
            const old_pixel = px[y * width + x];
            const new_pixel = old_pixel > 128 ? 0xff : 0x00;

            const quant_error = (old_pixel - new_pixel) / 16.0;
            px[(y + 0) * width + (x + 0)] = new_pixel;
            px[(y + 0) * width + (x + 1)] += quant_error * 7.;
            px[(y + 1) * width + (x - 1)] += quant_error * 3.;
            px[(y + 1) * width + (x + 0)] += quant_error * 5.;
            px[(y + 1) * width + (x + 1)] += quant_error * 1.;
        }
    }

    return Buffer.from(Uint8ClampedArray.from(px));
}

This was my first time experimenting with dithering, and it’s been a lot of fun! I was surprised by how straightforward the process is and that it’s fully deterministic. Now that I understand the details of the algorithm, I can’t help but notice its subtle side effects everywhere — in printed materials, on screens, and even in design choices around me. It’s fascinating how something so simple has such a broad impact!

To deploy this code as a Cloudflare Worker, you only need to install the required dependencies, configure the wrangler.toml file, and publish the code. Here’s a step-by-step guide:

sudo apt install npm
cd worker-render-raster
npm install wrangler
npm install @cloudflare/puppeteer --save-dev
npm install fast-png --save-dev
npx wrangler kv:namespace create KV
npx wrangler kv:namespace create KV --preview

With this out of the way, you can run the code:

2025-01-e-paper/worker-render-raster$ npx wrangler dev --remote

 ⛅️ wrangler 3.99.0
-------------------

Your worker has access to the following bindings:
- KV Namespaces:
  - KV: XXX
- Browser:
  - Name: BROWSER
[wrangler:inf] Ready on http://localhost:46131
⎔ Starting remote preview...
Total Upload: 755.39 KiB / gzip: 149.05 KiB
╭─────────────────────────────────────────────────────────────────────────────────────────────────╮
│  [b] open a browser, [d] open devtools, [l] turn on local mode, [c] clear console, [x] to exit  │
╰─────────────────────────────────────────────────────────────────────────────────────────────────╯

With everything set up, you can now open a browser and see a rendered and rasterized version of a website, processed through your Cloudflare Worker! For example, here’s how the 1.1.1.1 page looks in a 800×480 monochromatic resolution, complete with dithering:

This demonstrates how effectively the Worker can handle rendering, rasterizing, and adapting web content for an e-paper display. It’s quite satisfying to see the pipeline in action.

Server B (Content Layer)

To create the weather panel, I designed a simple HTML and CSS page and published it as another Cloudflare Worker. This time, I used Python in Cloudflare Workers because it felt more straightforward, especially since the site needs to query an external weather API. The simplicity of the code was surprising and made the process smooth.

async def on_fetch(request, env):
    cached = await env.KV.get("weather")
    if cached:
        cached = json.loads(cached)
    else:
        u = "https://api.open-meteo.com/..."
        a = await fetch(u)
        result = await a.text()
        cached = json.loads(result)
        await env.KV.put("weather", json.dumps(cached))
    return Response.new(render(...), headers=[('content-type', 'text/html')])

Here’s how it appears in a normal browser compared to the rendered and rasterized version by our worker:

Summary

Finally, the display deserves a proper frame. Here’s the finished version:

I started this project wanting to experiment with an e-paper display hardware, but I ended up spending most of my time writing software—and it turned out to be surprisingly enjoyable across all layers:

ESP32: The CPU is fantastic. Programming it is straightforward, thanks to powerful built-in libraries that simplify development.
Cloudflare Worker Browser Rendering: This is an underrated but incredibly powerful technology. It made implementing features like the Floyd–Steinberg dithering algorithm surprisingly easy.
Cloudflare Worker Python: Although still in beta, it worked flawlessly for my needs and was a great fit for handling API requests and serving dynamic content.

It’s remarkable how much you can achieve with relatively inexpensive hardware and free Cloudflare services.

Virtual networking 101: Bridging the gap to understanding TAP

2023-10-06 Marek Majkowski

Post Syndicated from Marek Majkowski original http://blog.cloudflare.com/virtual-networking-101-understanding-tap/

Virtual networking 101: Bridging the gap to understanding TAP

It's a never-ending effort to improve the performance of our infrastructure. As part of that quest, we wanted to squeeze as much network oomph as possible from our virtual machines. Internally for some projects we use Firecracker, which is a KVM-based virtual machine manager (VMM) that runs light-weight “Micro-VM”s. Each Firecracker instance uses a tap device to communicate with a host system. Not knowing much about tap, I had to up my game, however, it wasn't easy — the documentation is messy and spread across the Internet.

Here are the notes that I wish someone had passed me when I started out on this journey!

A tap device is a virtual network interface that looks like an ethernet network card. Instead of having real wires plugged into it, it exposes a nice handy file descriptor to an application willing to send/receive packets. Historically tap devices were mostly used to implement VPN clients. The machine would route traffic towards a tap interface, and a VPN client application would pick them up and process accordingly. For example this is what our Cloudflare WARP Linux client does. Here's how it looks on my laptop:

$ ip link list
...
18: CloudflareWARP: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1280 qdisc mq state UNKNOWN mode DEFAULT group default qlen 500
	link/none

$ ip tuntap list
CloudflareWARP: tun multi_queue

More recently tap devices started to be used by virtual machines to enable networking. The VMM (like Qemu, Firecracker, or gVisor) would open the application side of a tap and pass all the packets to the guest VM. The tap network interface would be left for the host kernel to deal with. Typically, a host would behave like a router and firewall, forward or NAT all the packets. This design is somewhat surprising – it's almost reversing the original use case for tap. In the VPN days tap was a traffic destination. With a VM behind, tap looks like a traffic source.

A Linux tap device is a mean creature. It looks trivial — a virtual network interface, with a file descriptor behind it. However, it's surprisingly hard to get it to perform well. The Linux networking stack is optimized for packets handled by a physical network card, not a userspace application. However, over the years the Linux tap interface grew in features and nowadays, it's possible to get good performance out of it. Later I'll explain how to use the Linux tap API in a modern way.

To tun or to tap?

The interface is called "the universal tun/tap" in the kernel. The "tun" variant, accessible via the IFF_TUN flag, looks like a point-to-point link. There are no L2 Ethernet headers. Since most modern networks are Ethernet, this is a bit less intuitive to set up for a novice user. Most importantly, projects like Firecracker and gVisor do expect L2 headers.

"Tap", with the IFF_TAP flag, is the one which has Ethernet headers, and has been getting all the attention lately. If you are like me and always forget which one is which, you can use this AI-generated rhyme (check out WorkersAI/LLama) to help to remember:

Tap is like a switch,
Ethernet headers it'll hitch.
Tun is like a tunnel,
VPN connections it'll funnel.
Ethernet headers it won't hold,
Tap uses, tun does not, we're told.

Listing devices

Tun/tap devices are natively supported by iproute2 tooling. Typically, one creates a device with ip tuntap add and lists it with ip tuntap list:

$ sudo ip tuntap add mode tap user marek group marek name tap0
$ ip tuntap list
tap0: tap persist user 1000 group 1000

Alternatively, it's possible to look for the /sys/devices/virtual/net/<ifr_name>/tun_flags files.

Tap device setup

To open or create a new device, you first need to open /dev/net/tun which is called a "clone device":

    /* First, whatever you do, the device /dev/net/tun must be
     * opened read/write. That device is also called the clone
     * device, because it's used as a starting point for the
     * creation of any tun/tap virtual interface. */
    char *clone_dev_name = "/dev/net/tun";
    int tap_fd = open(clone_dev_name, O_RDWR | O_CLOEXEC);
    if (tap_fd < 0) {
   	 error(-1, errno, "open(%s)", clone_dev_name);
    }

With the clone device file descriptor we can now instantiate a specific tap device by name:

    struct ifreq ifr = {};
    strncpy(ifr.ifr_name, tap_name, IFNAMSIZ);
    ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
    int r = ioctl(tap_fd, TUNSETIFF, &ifr);
    if (r != 0) {
   	 error(-1, errno, "ioctl(TUNSETIFF)");
    }

If ifr_name is empty or with a name that doesn't exist, a new tap device is created. Otherwise, an existing device is opened. When opening existing devices, flags like IFF_MULTI_QUEUE must match with the way the device was created, or EINVAL is returned. It's a good idea to try to reopen the device with flipped multi queue setting on EINVAL error.

The ifr_flags can have the following bits set:

IFF_TAP / IFF_TUN	Already discussed.
IFF_NO_CARRIER	Holding an open tap device file descriptor sets the Ethernet interface CARRIER flag up. In some cases it might be desired to delay that until a TUNSETCARRIER call.
IFF_NO_PI	Historically each packet on tap had a "struct tun_pi" 4 byte prefix. There are now better alternatives and this option disables this prefix.
IFF_TUN_EXCL	Ensures a new device is created. Returns EBUSY if the device exists
IFF_VNET_HDR	Prepend "struct virtio_net_hdr" before the RX and TX packets, should be followed by setsockopt(TUNSETVNETHDRSZ).
IFF_MULTI_QUEUE	Use multi queue tap, see below.
IFF_NAPI / IFF_NAPI_FRAGS	See below.

You almost always want IFF_TAP, IFF_NO_PI, IFF_VNET_HDR flags and perhaps sometimes IFF_MULTI_QUEUE.

The curious IFF_NAPI

Judging by the original patchset introducing IFF_NAPI and IFF_NAPI_FRAGS, these flags were introduced to increase code coverage of syzkaller. However, later work indicates there were performance benefits when doing XDP on tap. IFF_NAPI enables a dedicated NAPI instance for packets written from an application into a tap. Besides allowing XDP, it also allows packets to be batched and GRO-ed. Otherwise, a backlog NAPI is used.

A note on buffer sizes

Internally, a tap device is just a pair of packet queues. It's exposed as a network interface towards the host, and a file descriptor, a character device, towards the application. The queue in the direction of application (tap TX queue) is of size txqueuelen packets, controlled by an interface parameter:

$ ip link set dev tap0 txqueuelen 1000
$ ip -s link show dev tap0
26: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ... qlen 1000
	RX:  bytes packets errors dropped  missed   mcast      	 
         	0   	0  	0   	0   	0   	0
	TX:  bytes packets errors dropped carrier collsns      	 
       	266   	3  	0  	66   	0   	0

In "ip link" statistics the column "TX dropped" indicates the tap application was too slow and the queue space exhausted.

In the other direction – interface RX queue – from application towards the host, the queue size limit is measured in bytes and controlled by the TUNSETSNDBUF ioctl. The qemu comment discusses this setting, however it's not easy to cause this queue to overflow. See this discussion for details.

vnethdr size

After the device is opened, a typical scenario is to set up VNET_HDR size and offloads. Typically the VNETHDRSZ should be set to 12:

    len = 12;
    r = ioctl(tap_fd, TUNSETVNETHDRSZ, &(int){len});
    if (r != 0) {
   	 error(-1, errno, "ioctl(TUNSETVNETHDRSZ)");
    }

Sensible values are {10, 12, 20}, which are derived from virtio spec. 12 bytes makes room for the following header (little endian):

struct virtio_net_hdr_v1 {
#define VIRTIO_NET_HDR_F_NEEDS_CSUM  1    /* Use csum_start, csum_offset */
#define VIRTIO_NET_HDR_F_DATA_VALID  2    /* Csum is valid */
    u8 flags;
#define VIRTIO_NET_HDR_GSO_NONE      0    /* Not a GSO frame */
#define VIRTIO_NET_HDR_GSO_TCPV4     1    /* GSO frame, IPv4 TCP (TSO) */
#define VIRTIO_NET_HDR_GSO_UDP       3    /* GSO frame, IPv4 UDP (UFO) */
#define VIRTIO_NET_HDR_GSO_TCPV6     4    /* GSO frame, IPv6 TCP */
#define VIRTIO_NET_HDR_GSO_UDP_L4    5    /* GSO frame, IPv4& IPv6 UDP (USO) */
#define VIRTIO_NET_HDR_GSO_ECN       0x80 /* TCP has ECN set */
    u8 gso_type;
    u16 hdr_len;     /* Ethernet + IP + tcp/udp hdrs */
    u16 gso_size;    /* Bytes to append to hdr_len per frame */
    u16 csum_start;
    u16 csum_offset;
    u16 num_buffers;
};

offloads

To enable offloads use the ioctl:

    unsigned off_flags = TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6;
    int r = ioctl(tap_fd, TUNSETOFFLOAD, off_flags);
    if (r != 0) {
   	 error(-1, errno, "ioctl(TUNSETOFFLOAD)");
    }

Here are the allowed bit values. They confirm that the userspace application can receive:

TUN_F_CSUM	L4 packet checksum offload
TUN_F_TSO4	TCP Segmentation Offload – TSO for IPv4 packets
TUN_F_TSO6	TSO for IPv6 packets
TUN_F_TSO_ECN	TSO with ECN bits
TUN_F_UFO	UDP Fragmentation offload – UFO packets. Deprecated
TUN_F_USO4	UDP Segmentation offload – USO for IPv4 packets
TUN_F_USO6	USO for IPv6 packets

Generally, offloads are extra packet features the tap application can deal with. Details of the offloads used by the sender are set on each packet in the vnethdr prefix.

Checksum offload TUN_F_CSUM

Let's start with the checksumming offload. The TUN_F_CSUM offload saves the kernel some work by pushing the checksum processing down the path. Applications which set that flag are indicating they can handle checksum validation. For example with this offload, for UDP IPv4 packet will have:

vnethdr flags will have VIRTIO_NET_HDR_F_NEEDS_CSUM set
hdr_len would be 42 (14+20+8)
csum_start 34 (14+20)
and csum_offset 6 (UDP header checksum is 6 bytes into L4)

This is illustrated above.

Supporting checksum offload is needed for further offloads.

TUN_F_CSUM is a must

Consider this code:

s = socket(AF_INET, SOCK_DGRAM)
s.setsockopt(SOL_UDP, UDP_SEGMENT, 1400)
s.sendto(b"x", ("10.0.0.2", 5201))     # Would you expect EIO ?

This simple code produces a packet. When directed at a tap device, this code will surprisingly yield an EIO "Input/output error". This weird behavior happens if the tap is opened without TUN_F_CSUM and the application is sending GSO / UDP_SEGMENT frames. Tough luck. It might be considered a kernel bug, and we're thinking about fixing that. However, in the meantime everyone using tap should just set the TUN_F_CSUM bit.

Segmentation offloads

We wrote about UDP_SEGMENT in the past. In short: on Linux an application can handle many packets with a single send/recv, as long as they have identical length.

Tap devices support offloading which exposes that very functionality. With TUN_F_TSO4 and TUN_F_TSO6 flags the tap application signals it can deal with long packet trains. Note, that with these features the application must be ready to receive much larger buffers – up to 65507 bytes for IPv4 and 65527 for IPv6.

TSO4/TSO6 flags are enabling long packet trains for TCP and have been supported for a long time. More recently TUN_F_USO4 and TUN_F_USO6 bits were introduced for UDP. When any of these offloads are used, the gso_type contains the relevant offload type and gso_size holds a segment size within the GRO packet train.

TUN_F_UFO is a UDP fragmentation offload which is deprecated.

By setting TUNSETOFFLOAD, the application is telling the kernel which offloads it's able to handle on the read() side of a tap device. If the ioctl(TUNSETOFFLOAD) succeeds, the application can assume the kernel supports the same offloads for packets in the other direction.

Bug in rx-udp-gro-forwarding – TUN_F_USO4

When working with tap and offloads it's useful to inspect ethtool:

$ ethtool -k tap0 | egrep -v fixed
tx-checksumming: on
    tx-checksum-ip-generic: on
scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: on
tcp-segmentation-offload: on
    tx-tcp-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
tx-udp-segmentation: on
rx-gro-list: off
rx-udp-gro-forwarding: off

With ethtool we can see the enabled offloads and disable them as needed.

While toying with UDP Segmentation Offload (USO) I've noticed that when packet trains from tap are forwarded to a real network interface, sometimes they seem badly packetized. See the netdev discussion, and the proposed fix. In any case – beware of this bug, and maybe consider doing "ethtool -K tap0 rx-udp-gro-forwarding off".

Miscellaneous setsockopts

TUNGETFEATURES	Return vector of IFF_* constants that the kernel supports. Typically used to detect the host support of: IFF_VNET_HDR, IFF_NAPI and IFF_MULTI_QUEUE.
TUNSETIFF	Takes "struct ifreq", sets up a tap device, fills in the name if empty.
TUNGETIFF	Returns a "struct ifreq" containing the device's current name and flags.
TUNSETPERSIST	Sets TUN_PERSIST flag, if you want the device to remain in the system after the tap_fd is closed.
TUNSETOWNER, TUNSETGROUP	Set uid and gid that can own the device.
TUNSETLINK	Set the Ethernet link type for the device. The device must be down. See ARPHRD_* constants. For tap it defaults to ARPHRD_ETHER.
TUNSETOFFLOAD	As documented above.
TUNGETSNDBUF, TUNSETSNDBUF	Get/set send buffer. The default is INT_MAX.
TUNGETVNETHDRSZ, TUNSETVNETHDRSZ	Already discussed.
TUNSETIFINDEX	Set interface index (ifindex), useful in checkpoint-restore.
TUNSETCARRIER	Set the carrier state of an interface, as discussed earlier, useful with IFF_NO_CARRIER.
TUNGETDEVNETNS	Return an fd of a net namespace that the interface belongs to.

TUNSETTXFILTER

Takes "struct tun_filter" which limits the dst mac addresses that can be delivered to the application.

TUNATTACHFILTER, TUNDETACHFILTER, TUNGETFILTER

Attach/detach/get classic BPF filter for packets going to application. Takes "struct sock_fprog".

TUNSETFILTEREBPF

Set an eBPF filter on a tap device. This is independent of the classic BPF above.

TUNSETQUEUE	Used to set IFF_DETACH_QUEUE and IFF_ATTACH_QUEUE for multiqueue.
TUNSETSTEERINGEBPF	Set an eBPF program for selecting a specific tap queue, in the direction towards the application. This is useful if you want to ensure some traffic is sticky to a specific application thread. The eBPF program takes "struct __sk_buff" and returns an int. The result queue number is computed from the return value u16 modulo number of queues is the selection.

Single queue speed

Tap devices are quite weird — they aren't network sockets, nor true files. Their semantics are closest to pipes, and unfortunately the API reflects that. To receive or send a packet from a tap device, the application must do a read() or write() syscall, one packet at a time.

One might think that some sort of syscall batching would help. Sockets have sendmmsg()/recvmmsg(), but that doesn't work on tap file descriptors. The typical alternatives enabling batching are: an old io_submit AIO interface, or modern io_uring. Io_uring added tap support quite recently. However, it turns out syscall batching doesn't really offer that much of an improvement. Maybe in the range of 10%.

The Linux kernel is just not capable of forwarding millions of packets per second for a single flow or on a single CPU. The best possible solution is to scale vertically for elephant flows with TSO/USO (packet trains) offloads, and scale horizontally for multiple concurrent flows with multi queue.

In this chart you can see how dramatic the performance gain of offloads is. Without them, a sample "echo" tap application can process between 320 and 500 thousand packets per second on a single core. MTU being 1500. When the offloads are enabled it jumps to 2.7Mpps, while keeping the number of received "packet trains" to just 56 thousand per second. Of course not every traffic pattern can fully utilize GRO/GSO. However, to get decent performance from tap, and from Linux in general, offloads are absolutely critical.

Multi queue considerations

Multi queue is useful when the tap application is handling multiple concurrent flows and needs to utilize more than one CPU.

To get a file descriptor of a tap queue, just add the IFF_MULTI_QUEUE flag when opening the tap. It's possible to detach/reattach a queue with TUNSETQUEUE and IFF_DETACH_QUEUE/IFF_ATTACH_QUEUE, but I'm unsure when this is useful.

When a multi queue tap is created, it spreads the load across multiple tap queues, each one having a unique file descriptor. Beware of the algorithm selecting the queue though: it might bite you back.

By default, Linux tap driver records a symmetric flow hash of any handled flow in a flow table. It saves on which queue the traffic from the application was transmitted. Then, on the receiving side it follows that selection and sends subsequent packets to that specific queue. For example, if your userspace application is sending some TCP flow over queue #2, then the packets going into the application which are a part of that flow will go to queue #2. This is generally a sensible design as long as the sender is always selecting one specific queue. If the sender changes the TX queue, new packets will immediately shift and packets within one flow might be seen as reordered. Additionally, this queue selection design does not take into account CPU locality and might have minor negative effects on performance for very high throughput applications.

It's possible to override the flow hash based queue selection by using tc multiq qdisc and skbedit queue_mapping filter:

tc qdisc add dev tap0 root handle 1: multiq
tc filter add dev tap0 parent 1: protocol ip prio 1 u32 \
        match ip dst 192.168.0.3 \
        action skbedit queue_mapping 0

tc is fragile and thus it's not a solution I would recommend. A better way is to customize the queue selection algorithm with a TUNSETSTEERINGEBPF eBPF program. In that case, the flow tracking code is not employed anymore. By smartly using such a steering eBPF program, it's possible to keep the flow processing local to one CPU — useful for best performance.

Summary

Now you know everything I wish I had known when I was setting out on this journey!

To get the best performance, I recommend:

enable vnethdr
enable offloads (TSO and USO)
consider spreading the load across multiple queues and CPUs with multi queue
consider syscall batching for additional gain of maybe 10%, perhaps try io_uring
consider customizing the steering algorithm

References:

The day my ping took countermeasures

2023-07-11 Marek Majkowski

Post Syndicated from Marek Majkowski original http://blog.cloudflare.com/the-day-my-ping-took-countermeasures/

The day my ping took countermeasures

Once my holidays had passed, I found myself reluctantly reemerging into the world of the living. I powered on a corporate laptop, scared to check on my email inbox. However, before turning on the browser, obviously, I had to run a ping. Debugging the network is a mandatory first step after a boot, right? As expected, the network was perfectly healthy but what caught me off guard was this message:

I was not expecting ping to take countermeasures that early on in a day. Gosh, I wasn't expecting any countermeasures that Monday!

Once I got over the initial confusion, I took a deep breath and collected my thoughts. You don't have to be Sherlock Holmes to figure out what has happened. I'm really fast – I started ping before the system NTP daemon synchronized the time. In my case, the computer clock was rolled backward, confusing ping.

While this doesn't happen too often, a computer clock can be freely adjusted either forward or backward. However, it's pretty rare for a regular network utility, like ping, to try to manage a situation like this. It's even less common to call it "taking countermeasures". I would totally expect ping to just print a nonsensical time value and move on without hesitation.

Ping developers clearly put some thought into that. I wondered how far they went. Did they handle clock changes in both directions? Are the bad measurements excluded from the final statistics? How do they test the software?

I can't just walk past ping "taking countermeasures" on me. Now I have to understand what ping did and why.

Understanding ping

An investigation like this starts with a quick glance at the source code:

 *			P I N G . C
 *
 * Using the InterNet Control Message Protocol (ICMP) "ECHO" facility,
 * measure round-trip-delays and packet loss across network paths.
 *
 * Author -
 *	Mike Muuss
 *	U. S. Army Ballistic Research Laboratory
 *	December, 1983

Ping goes back a long way. It was originally written by Mike Muuss while at the U. S. Army Ballistic Research Laboratory, in 1983, before I was born. The code we're looking for is under iputils/ping/ping_common.c gather_statistics() function:

The code is straightforward: the message in question is printed when the measured RTT is negative. In this case ping resets the latency measurement to zero. Here you are: "taking countermeasures" is nothing more than just marking an erroneous measurement as if it was 0ms.

But what precisely does ping measure? Is it the wall clock? The man page comes to the rescue. Ping has two modes.

The "old", -U mode, in which it uses the wall clock. This mode is less accurate (has more jitter). It calls gettimeofday before sending and after receiving the packet.

The "new", default, mode in which it uses "network time". It calls gettimeofday before sending, and gets the receive timestamp from a more accurate SO_TIMESTAMP CMSG. More on this later.

Tracing gettimeofday is hard

Let's start with a good old strace:

$ strace -e trace=gettimeofday,time,clock_gettime -f ping -n -c1 1.1 >/dev/null
... nil ...

It doesn't show any calls to gettimeofday. What is going on?

On modern Linux some syscalls are not true syscalls. Instead of jumping to the kernel space, which is slow, they remain in userspace and go to a special code page provided by the host kernel. This code page is called vdso. It's visible as a .so library to the program:

$ ldd `which ping` | grep vds
    linux-vdso.so.1 (0x00007ffff47f9000)

Calls to the vdso region are not syscalls, they remain in userspace and are super fast, but classic strace can't see them. For debugging it would be nice to turn off vdso and fall back to classic slow syscalls. It's easier said than done.

There is no way to prevent loading of the vdso. However there are two ways to convince a loaded program not to use it.

The first technique is about fooling glibc into thinking the vdso is not loaded. This case must be handled for compatibility with ancient Linux. When bootstrapping in a freshly run process, glibc inspects the Auxiliary Vector provided by ELF loader. One of the parameters has the location of the vdso pointer, the man page gives this example:

void *vdso = (uintptr_t) getauxval(AT_SYSINFO_EHDR);

A technique proposed on Stack Overflow works like that: let's hook on a program before execve() exits and overwrite the Auxiliary Vector AT_SYSINFO_EHDR parameter. Here's the novdso.c code. However, the linked code doesn't quite work for me (one too many kill(SIGSTOP)), and has one bigger, fundamental flaw. To hook on execve() it uses ptrace() therefore doesn't work under our strace!

$ strace -f ./novdso ping 1.1 -c1 -n
...
[pid 69316] ptrace(PTRACE_TRACEME)  	= -1 EPERM (Operation not permitted)

While this technique of rewriting AT_SYSINFO_EHDR is pretty cool, it won't work for us. (I wonder if there is another way of doing that, but without ptrace. Maybe with some BPF? But that is another story.)

A second technique is to use LD_PRELOAD and preload a trivial library overloading the functions in question, and forcing them to go to slow real syscalls. This works fine:

$ cat vdso_override.c
#include <sys/syscall.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>

int gettimeofday(struct timeval *restrict tv, void *restrict tz) {
	return syscall(__NR_gettimeofday, (long)tv, (long)tz, 0, 0, 0, 0);
}

time_t time(time_t *tloc) {
	return syscall(__NR_time, (long)tloc, 0, 0, 0, 0, 0);
}

int clock_gettime(clockid_t clockid, struct timespec *tp) {
    return syscall(__NR_clock_gettime, (long)clockid, (long)tp, 0, 0, 0, 0);
}

To load it:

$ gcc -Wall -Wextra -fpic -shared -o vdso_override.so vdso_override.c

$ LD_PRELOAD=./vdso_override.so \
       strace -e trace=gettimeofday,clock_gettime,time \
       date

clock_gettime(CLOCK_REALTIME, {tv_sec=1688656245 ...}) = 0
Thu Jul  6 05:10:45 PM CEST 2023
+++ exited with 0 +++

Hurray! We can see the clock_gettime call in strace output. Surely we'll also see gettimeofday from our ping, right?

Not so fast, it still doesn't quite work:

$ LD_PRELOAD=./vdso_override.so \
     strace -c -e trace=gettimeofday,time,clock_gettime -f \
     ping -n -c1 1.1 >/dev/null
... nil ...

To suid or not to suid

I forgot that ping might need special permissions to read and write raw packets. Historically it had a suid bit set, which granted the program elevated user identity. However LD_PRELOAD doesn't work with suid. When a program is being loaded a dynamic linker checks if it has suid bit, and if so, it ignores LD_PRELOAD and LD_LIBRARY_PATH settings.

However, does ping need suid? Nowadays it's totally possible to send and receive ICMP Echo messages without any extra privileges, like this:

from socket import *
import struct

sd = socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP)
sd.connect(('1.1', 0))

sd.send(struct.pack("!BBHHH10s", 8, 0, 0, 0, 1234, b'payload'))
data = sd.recv(1024)
print('type=%d code=%d csum=0x%x id=%d seq=%d payload=%s' % struct.unpack_from("!BBHHH10s", data))

Now you know how to write "ping" in eight lines of Python. This Linux API is known as ping socket. It generally works on modern Linux, however it requires a correct sysctl, which is typically enabled:

$ sysctl net.ipv4.ping_group_range
net.ipv4.ping_group_range = 0    2147483647

The ping socket is not as mature as UDP or TCP sockets. The "ICMP ID" field is used to dispatch an ICMP Echo Response to an appropriate socket, but when using bind() this property is settable by the user without any checks. A malicious user can deliberately cause an "ICMP ID" conflict.

But we're not here to discuss Linux networking API's. We're here to discuss the ping utility and indeed, it's using the ping sockets:

$ strace -e trace=socket -f ping 1.1 -nUc 1
socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP) = 3
socket(AF_INET6, SOCK_DGRAM, IPPROTO_ICMPV6) = 4

Ping sockets are rootless, and ping, at least on my laptop, is not a suid program:

$ ls -lah `which ping`
-rwxr-xr-x 1 root root 75K Feb  5  2022 /usr/bin/ping

So why doesn't the LD_PRELOAD? It turns out ping binary holds a CAP_NET_RAW capability. Similarly to suid, this is preventing the library preloading machinery from working:

$ getcap `which ping`
/usr/bin/ping cap_net_raw=ep

I think this capability is enabled only to handle the case of a misconfigured net.ipv4.ping_group_range sysctl. For me ping works perfectly fine without this capability.

Rootless is perfectly fine

Let's remove the CAP_NET_RAW and try out LD_PRELOAD hack again:

$ cp `which ping` .

$ LD_PRELOAD=./vdso_override.so strace -f ./ping -n -c1 1.1
...
setsockopt(3, SOL_SOCKET, SO_TIMESTAMP_OLD, [1], 4) = 0
gettimeofday({tv_sec= ... ) = 0
sendto(3, ...)
setitimer(ITIMER_REAL, {it_value={tv_sec=10}}, NULL) = 0
recvmsg(3, { ... cmsg_level=SOL_SOCKET, 
                 cmsg_type=SO_TIMESTAMP_OLD, 
                 cmsg_data={tv_sec=...}}, )

We finally made it! Without -U, in the "network timestamp" mode, ping:

Sets SO_TIMESTAMP flag on a socket.
Calls gettimeofday before sending the packet.
When fetching a packet, gets the timestamp from the CMSG.

Fault injection – fooling ping

With strace up and running we can finally do something interesting. You see, strace has a little known fault injection feature, named "tampering" in the manual:

With a couple of command line parameters we can overwrite the result of the gettimeofday call. I want to set it forward to confuse ping into thinking the SO_TIMESTAMP time is in the past:

LD_PRELOAD=./vdso_override.so \
    strace -o /dev/null -e trace=gettimeofday \
            -e inject=gettimeofday:poke_exit=@arg1=ff:when=1 -f \
    ./ping -c 1 -n 1.1.1.1

PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
./ping: Warning: time of day goes back (-59995290us), taking countermeasures
./ping: Warning: time of day goes back (-59995104us), taking countermeasures
64 bytes from 1.1.1.1: icmp_seq=1 ttl=60 time=0.000 ms

--- 1.1.1.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.000/0.000/0.000/0.000 ms

It worked! We can now generate the "taking countermeasures" message reliably!

While we can cheat on the gettimeofday result, with strace it's impossible to overwrite the CMSG timestamp. Perhaps it might be possible to adjust the CMSG timestamp with Linux time namespaces, but I don't think it'll work. As far as I understand, time namespaces are not taken into account by the network stack. A program using SO_TIMESTAMP is deemed to compare it against the system clock, which might be rolled backwards.

Fool me once, fool me twice

At this point we could conclude our investigation. We're now able to reliably trigger the "taking countermeasures" message using strace fault injection.

There is one more thing though. When sending ICMP Echo Request messages, does ping remember the send timestamp in some kind of hash table? That might be wasteful considering a long-running ping sending thousands of packets.

Ping is smart, and instead puts the timestamp in the ICMP Echo Request packet payload!

Here's how the full algorithm works:

Ping sets the SO_TIMESTAMP_OLD socket option to receive timestamps.
It looks at the wall clock with gettimeofday.
It puts the current timestamp in the first bytes of the ICMP payload.
After receiving the ICMP Echo Reply packet, it inspects the two timestamps: the send timestamp from the payload and the receive timestamp from CMSG.
It calculates the RTT delta.

This is pretty neat! With this algorithm, ping doesn't need to remember much, and can have an unlimited number of packets in flight! (For completeness, ping maintains a small fixed-size bitmap to account for the DUP! packets).

What if we set a packet length to be less than 16 bytes? Let's see:

$ ping 1.1 -c2 -s0
PING 1.1 (1.0.0.1) 0(28) bytes of data.
8 bytes from 1.0.0.1: icmp_seq=1 ttl=60
8 bytes from 1.0.0.1: icmp_seq=2 ttl=60
--- 1.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms

In such a case ping just skips the RTT from the output. Smart!

Right… this opens two completely new subjects. While ping was written back when everyone was friendly, today’s Internet can have rogue actors. What if we spoofed responses to confuse ping. Can we: cut the payload to prevent ping from producing RTT, and spoof the timestamp and fool the RTT measurements?

Both things work! The truncated case will look like this to the sender:

$ ping 139.162.188.91
PING 139.162.188.91 (139.162.188.91) 56(84) bytes of data.
8 bytes from 139.162.188.91: icmp_seq=1 ttl=53 (truncated)

The second case, of an overwritten timestamp, is even cooler. We can move timestamp forwards causing ping to show our favorite "taking countermeasures" message:

$ ping 139.162.188.91  -c 2 -n
PING 139.162.188.91 (139.162.188.91) 56(84) bytes of data.
./ping: Warning: time of day goes back (-1677721599919015us), taking countermeasures
./ping: Warning: time of day goes back (-1677721599918907us), taking countermeasures
64 bytes from 139.162.188.91: icmp_seq=1 ttl=53 time=0.000 ms
./ping: Warning: time of day goes back (-1677721599905149us), taking countermeasures
64 bytes from 139.162.188.91: icmp_seq=2 ttl=53 time=0.000 ms

--- 139.162.188.91 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.000/0.000/0.000/0.000 ms

Alternatively we can move the time in the packet backwards causing ping to show nonsensical RTT values:

$ ./ping 139.162.188.91  -c 2 -n
PING 139.162.188.91 (139.162.188.91) 56(84) bytes of data.
64 bytes from 139.162.188.91: icmp_seq=1 ttl=53 time=1677721600430 ms
64 bytes from 139.162.188.91: icmp_seq=2 ttl=53 time=1677721600084 ms

--- 139.162.188.91 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 1677721600084.349/1677721600257.351/1677721600430.354/-9223372036854775.-808 ms

We proved that "countermeasures" work only when time moves in one direction. In another direction ping is just fooled.

Here's a rough scapy snippet that generates an ICMP Echo Response fooling ping:

# iptables -I INPUT -i eth0 -p icmp --icmp-type=8 -j DROP
import scapy.all as scapy
import struct

def custom_action(echo_req):
    try:
    	payload = bytes(echo_req[scapy.ICMP].payload)
    	if len(payload) >= 8:
        	ts, tu = struct.unpack_from("<II", payload)
        	payload = struct.pack("<II", (ts-0x64000000)&0xffffffff, tu) \
                     + payload[8:]

    	echo_reply = scapy.IP(
        	dst=echo_req[scapy.IP].src,
        	src=echo_req[scapy.IP].dst,
    	) / scapy.ICMP(type=0, code=0,
                 	id=echo_req[scapy.ICMP].id,
                 	seq=echo_req.payload.seq,
   	  	) / payload
    	scapy.send(echo_reply,iface=iface)
    except Exception as e:
        pass

scapy.sniff(filter="icmp and icmp[0] = 8", iface=iface, prn=custom_action)

Leap second

In practice, how often does time change on a computer? The NTP daemon adjusts the clock all the time to account for any drift. However, these are very small changes. Apart from initial clock synchronization after boot or sleep wakeup, big clock shifts shouldn't really happen.

There are exceptions as usual. Systems that operate in virtual environments or have unreliable Internet connections often experience their clocks getting out of sync.

One notable case that affects all computers is a coordinated clock adjustment called a leap second. It causes the clock to move backwards, which is particularly troublesome. An issue with handling leap second caused our engineers a headache in late 2016.

Leap seconds often cause issues, so the current consensus is to deprecate them by 2035. However, according to Wikipedia the solution seem to be to just kick the can down the road:

A suggested possible future measure would be to let the discrepancy increase to a full minute, which would take 50 to 100 years, and then have the last minute of the day taking two minutes in a "kind of smear" with no discontinuity.

In any case, there hasn't been a leap second since 2016, there might be some in the future, but there likely won't be any after 2035. Many environments already use a leap second smear to avoid the problem of clock jumping back.

In most cases, it might be completely fine to ignore the clock changes. When possible, to count time durations use CLOCK_MONOTONIC, which is bulletproof.

We haven't mentioned daylight savings clock adjustments here because, from a computer perspective they are not real clock changes! Most often programmers deal with the operating system clock, which is typically set to the UTC timezone. DST timezone is taken into account only when pretty printing the date on screen. The underlying software operates on integer values. Let's consider an example of two timestamps, which in my Warsaw timezone, appear as two different DST timezones. While it may like the clock rolled back, this is just a user interface illusion. The integer timestamps are sequential:

$ date --date=@$[1698541199+0]
Sun Oct 29 02:59:59 AM CEST 2023

$ date --date=@$[1698541199+1]
Sun Oct 29 02:00:00 AM CET 2023

Lessons

Arguably, the clock jumping backwards is a rare occurrence. It's very hard to test for such cases, and I was surprised to find that ping made such an attempt. To avoid the problem, to measure the latency ping might use CLOCK_MONOTONIC, its developers already use this time source in another place.

Unfortunately this won't quite work here. Ping needs to compare send timestamp to receive timestamp from SO_TIMESTAMP CMSG, which uses the non-monotonic system clock. Linux API's are sometimes limited, and dealing with time is hard. For time being, clock adjustments will continue to confuse ping.

In any case, now we know what to do when ping is "taking countermeasures"! Pull down your periscope and check the NTP daemon status!

Cloudflare servers don’t own IPs anymore – so how do they connect to the Internet?

2022-11-25 Marek Majkowski

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/cloudflare-servers-dont-own-ips-anymore/

Cloudflare servers don't own IPs anymore – so how do they connect to the Internet?

A lot of Cloudflare’s technology is well documented. For example, how we handle traffic between the eyeballs (clients) and our servers has been discussed many times on this blog: “A brief primer on anycast (2011)”, “Load Balancing without Load Balancers (2013)“, “Path MTU discovery in practice (2015)“, “Cloudflare’s edge load balancer (2020)“, “How we fixed the BSD socket API (2022)“.

However, we have rarely talked about the second part of our networking setup — how our servers fetch the content from the Internet. In this blog we’re going to cover this gap. We’ll discuss how we manage Cloudflare IP addresses used to retrieve the data from the Internet, how our egress network design has evolved and how we optimized it for best use of available IP space.

Brace yourself. We have a lot to cover.

Terminology first!

Each Cloudflare server deals with many kinds of networking traffic, but two rough categories stand out:

Internet sourced traffic – Inbound connections initiated by eyeball to our servers. In the context of this blog post we’ll call these “ingress connections”.
Cloudflare sourced traffic – Outgoing connections initiated by our servers to other hosts on the Internet. For brevity, we’ll call these “egress connections”.

The egress part, while rarely discussed on this blog, is critical for our operation. Our servers must initiate outgoing connections to get their jobs done! Like:

In our CDN product, before the content is cached, it’s fetched from the origin servers. See “Pingora, the proxy that connects Cloudflare to the Internet (2022)“, Argo and Tiered Cache.
For the Spectrum product, each ingress TCP connection results in one egress connection.
Workers often run multiple subrequests to construct an HTTP response. Some of them might be querying servers to the Internet.
We also operate client-facing forward proxy products – like WARP and Teams. These proxies deal with eyeball connections destined to the Internet. Our servers need to establish connections to the Internet on behalf of our users.

And so on.

Anycast on ingress, unicast on egress

Our ingress network architecture is very different from the egress one. On ingress, the connections sourced from the Internet are handled exclusively by our anycast IP ranges. Anycast is a technology where each of our data centers “announces” and can handle the same IP ranges. With many destinations possible, how does the Internet know where to route the packets? Well, the eyeball packets are routed towards the closest data center based on Internet BGP metrics, often it’s also geographically the closest one. Usually, the BGP routes don’t change much, and each eyeball IP can be expected to be routed to a single data center.

However, while anycast works well in the ingress direction, it can’t operate on egress. Establishing an outgoing connection from an anycast IP won’t work. Consider the response packet. It’s likely to be routed back to a wrong place – a data center geographically closest to the sender, not necessarily the source data center!

For this reason, until recently, we established outgoing connections in a straightforward and conventional way: each server was given its own unicast IP address. “Unicast IP” means there is only one server using that address in the world. Return packets will work just fine and get back exactly to the right server identified by the unicast IP.

Segmenting traffic based on egress IP

Originally connections sourced by Cloudflare were mostly HTTP fetches going to origin servers on the Internet. As our product line grew, so did the variety of traffic. The most notable example is our WARP app. For WARP, our servers operate a forward proxy, and handle the traffic sourced by end-user devices. It’s done without the same degree of intermediation as in our CDN product. This creates a problem. Third party servers on the Internet — like the origin servers — must be able to distinguish between connections coming from Cloudflare services and our WARP users. Such traffic segmentation is traditionally done by using different IP ranges for different traffic types (although recently we introduced more robust techniques like Authenticated Origin Pulls).

To work around the trusted vs untrusted traffic pool differentiation problem, we added an untrusted WARP IP address to each of our servers:

Country tagged egress IP addresses

It quickly became apparent that trusted vs untrusted weren’t the only tags needed. For WARP service we also need country tags. For example, United Kingdom based WARP users expect the bbc.com website to just work. However, the BBC restricts many of its services to people just in the UK.

It does this by geofencing — using a database mapping public IP addresses to countries, and allowing only the UK ones. Geofencing is widespread on today’s Internet. To avoid geofencing issues, we need to choose specific egress addresses tagged with an appropriate country, depending on WARP user location. Like many other parties on the Internet, we tag our egress IP space with country codes and publish it as a geofeed (like this one). Notice, the published geofeed is just data. The fact that an IP is tagged as say UK does not mean it is served from the UK, it just means the operator wants it to be geolocated to the UK. Like many things on the Internet, it is based on trust.

Notice, at this point we have three independent geographical tags:

the country tag of the WARP user – the eyeball connecting IP
the location of the data center the eyeball connected to
the country tag of the egressing IP

For best service, we want to choose the egressing IP so that its country tag matches the country from the eyeball IP. But egressing from a specific country tagged IP is challenging: our data centers serve users from all over the world, potentially from many countries! Remember: due to anycast we don’t directly control the ingress routing. Internet geography doesn’t always match physical geography. For example our London data center receives traffic not only from users in the United Kingdom, but also from Ireland, and Saudi Arabia. As a result, our servers in London need many WARP egress addresses associated with many countries:

Can you see where this is going? The problem space just explodes! Instead of having one or two egress IP addresses for each server, now we require dozens, and IPv4 addresses aren’t cheap. With this design, we need many addresses per server, and we operate thousands of servers. This architecture becomes very expensive.

Is anycast a problem?

Let me recap: with anycast ingress we don’t control which data center the user is routed to. Therefore, each of our data centers must be able to egress from an address with any conceivable tag. Inside the data center we also don’t control which server the connection is routed to. There are potentially many tags, many data centers, and many servers inside a data center.

Maybe the problem is the ingress architecture? Perhaps it’s better to use a traditional networking design where a specific eyeball is routed with DNS to a specific data center, or even a server?

That’s one way of thinking, but we decided against it. We like our anycast on ingress. It brings us many advantages:

Performance: with anycast, by definition, the eyeball is routed to the closest (by BGP metrics) data center. This is usually the fastest data center for a given user.
Automatic failover: if one of our data centers becomes unavailable, the traffic will be instantly, automatically re-routed to the next best place.
DDoS resilience: during a denial of service attack or a traffic spike, the load is automatically balanced across many data centers, significantly reducing the impact.
Uniform software: The functionality of every data center and of every server inside a data center is identical. We use the same software stack on all the servers around the world. Each machine can perform any action, for any product. This enables easy debugging and good scalability.

For these reasons we’d like to keep the anycast on ingress. We decided to solve the issue of egress address cardinality in some other way.

Solving a million dollar problem

Out of the thousands of servers we operate, every single one should be able to use an egress IP with any of the possible tags. It’s easiest to explain our solution by first showing two extreme designs.

Each server owns all the needed IPs: each server has all the specialized egress IPs with the needed tags.

One server owns the needed IP: a specialized egress IP with a specific tag lives in one place, other servers forward traffic to it.

Both options have pros and cons:

Specialized IP on every server	Specialized IP on one server
Super expensive $$$, every server needs many IP addresses.	Cheap $, only one specialized IP needed for a tag.
Egress always local – fast	Egress almost always forwarded – slow
Excellent reliability – every server is independent	Poor reliability – introduced chokepoints

There’s a third way

We’ve been thinking hard about this problem. Frankly, the first extreme option of having every needed IP available locally on every Cloudflare server is not totally unworkable. This is, roughly, what we were able to pull off for IPv6. With IPv6, access to the large needed IP space is not a problem.

However, in IPv4 neither option is acceptable. The first offers fast and reliable egress, but requires great cost — the IPv4 addresses needed are expensive. The second option uses the smallest possible IP space, so it’s cheap, but compromises on performance and reliability.

The solution we devised is a compromise between the extremes. The rough idea is to change the assignment unit. Instead of assigning one /32 IPv4 address for each server, we devised a method of assigning a /32 IP per data center, and then sharing it among physical servers.

Specialized IP on every server	Specialized IP per data center	Specialized IP on one server
Super expensive $$$	Reasonably priced $$	Cheap $
Egress always local – fast	Egress always local – fast	Egress almost always forwarded – slow
Excellent reliability – every server is independent	Excellent reliability – every server is independent	Poor reliability – many choke points

The idea of sharing an IP among servers is not new. Traditionally this can be achieved by Source-NAT on a router. Sadly, the sheer number of egress IP’s we need and the size of our operation, prevents us from relying on stateful firewall / NAT at the router level. We also dislike shared state, so we’re not fans of distributed NAT installations.

What we chose instead, is splitting an egress IP across servers by a port range. For a given egress IP, each server owns a small portion of available source ports – a port slice.

When return packets arrive from the Internet, we have to route them back to the correct machine. For this task we’ve customized “Unimog” – our L4 XDP-based load balancer – (“Unimog, Cloudflare’s load balancer (2020)“) and it’s working flawlessly.

With a port slice of say 2,048 ports, we can share one IP among 31 servers. However, there is always a possibility of running out of ports. To address this, we’ve worked hard to be able to reuse the egress ports efficiently. See the “How to stop running out of ports (2022)“, “How to share IPv4 addresses (2022)” and our Cloudflare.TV segment.

This is pretty much it. Each server is aware which IP addresses and port slices it owns. For inbound routing Unimog inspects the ports and dispatches the packets to appropriate machines.

This is not the end of the story though, we haven’t discussed how we can route a single /32 address into a datacenter. Traditionally, in the public Internet, it’s only possible to route subnets with granularity of /24 or 256 IP addresses. In our case this would lead to great waste of IP space.

To solve this problem and improve the utilization of our IP space, we deployed our egress ranges as… anycast! With that in place, we customized Unimog and taught it to forward the packets over our backbone network to the right data center. Unimog maintains a database like this:

198.51.100.1 - forward to LHR
198.51.100.2 - forward to CDG
198.51.100.3 - forward to MAN
...

With this design, it doesn’t matter to which data center return packets are delivered. Unimog can always fix it and forward the data to the right place. Basically, while at the BGP layer we are using anycast, due to our design, semantically an IP identifies a datacenter and an IP and port range identify a specific machine. It behaves almost like a unicast.

We call this technology stack “soft-unicast” and it feels magical. It’s like we did unicast in software over anycast in the BGP layer.

Soft-unicast is indistinguishable from magic

With this setup we can achieve significant benefits:

We are able to share a /32 egress IP amongst many servers.
We can spread a single subnet across many data centers, and change it easily on the fly. This allows us to fully use our egress IPv4 ranges.
We can group similar IP addresses together. For example, all the IP addresses tagged with the “UK” tag might form a single continuous range. This reduces the size of the published geofeed.
It’s easy for us to onboard new egress IP ranges, like customer IP’s. This is useful for some of our products, like Cloudflare Zero Trust.

All this is done at sensible cost, at no loss to performance and reliability:

Typically, the user is able to egress directly from the closest datacenter, providing the best possible performance.
Depending on the actual needs we can allocate or release the IP addresses. This gives us flexibility with the IP cost management, we don’t need to overspend upfront.
Since we operate multiple egress IP addresses in different locations, the reliability is not compromised.

The true location of our IP addresses is: “the cloud”

While soft-unicast allows us to gain great efficiency, we’ve hit some issues. Sometimes we get a question “Where does this IP physically exist?”. But it doesn’t have an answer! Our egress IPs don’t exist physically anywhere. From a BGP standpoint our egress ranges are anycast, so they live everywhere. Logically each address is used in one data center at a time, but we can move it around on demand.

Content Delivery Networks misdirect users

As another example of problems, here’s one issue we’ve hit with third party CDNs. As we mentioned before, there are three country tags in our pipeline:

The country tag of the IP eyeball is connecting from.
The location of our data center.
The country tag of the IP addresses we chose for the egress connections.

The fact that our egress address is tagged as “UK” doesn’t always mean it actually is being used in the UK. We’ve had cases when a UK-tagged WARP user, due to the maintenance of our LHR data center, was routed to Paris. A popular CDN performed a reverse-lookup of our egress IP, found it tagged as “UK”, and directed the user to a London CDN server. This is generally OK… but we actually egressed from Paris at the time. This user ended up routing packets from their home in the UK, to Paris, and back to the UK. This is bad for performance.

We address this issue by performing DNS requests in the egressing data center. For DNS we use IP addresses tagged with the location of the data center, not the intended geolocation for the user. This generally fixes the problem, but sadly, there are still some exceptions.

The future is here

Our 2021 experiments with Addressing Agility proved we have plenty of opportunity to innovate with the addressing of the ingress. Soft-unicast shows us we can achieve great flexibility and density on the egress side.

With each new product, the number of tags we need on the egress grows – from traffic trustworthiness, product category to geolocation. As the pool of usable IPv4 addresses shrinks, we can be sure there will be more innovation in the space. Soft-unicast is our solution, but for sure it’s not our last development.

For now though, it seems like we’re moving away from traditional unicast. Our egress IP’s really don’t exist in a fixed place anymore, and some of our servers don’t even own a true unicast IP nowadays.

When the window is not fully open, your TCP stack is doing more than you think

2022-07-26 Marek Majkowski

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/

When the window is not fully open, your TCP stack is doing more than you think

Over the years I’ve been lurking around the Linux kernel and have investigated the TCP code many times. But when recently we were working on Optimizing TCP for high WAN throughput while preserving low latency, I realized I have gaps in my knowledge about how Linux manages TCP receive buffers and windows. As I dug deeper I found the subject complex and certainly non-obvious.

In this blog post I’ll share my journey deep into the Linux networking stack, trying to understand the memory and window management of the receiving side of a TCP connection. Specifically, looking for answers to seemingly trivial questions:

How much data can be stored in the TCP receive buffer? (it’s not what you think)
How fast can it be filled? (it’s not what you think either!)

Our exploration focuses on the receiving side of the TCP connection. We’ll try to understand how to tune it for the best speed, without wasting precious memory.

A case of a rapid upload

To best illustrate the receive side buffer management we need pretty charts! But to grasp all the numbers, we need a bit of theory.

We’ll draw charts from a receive side of a TCP flow, running a pretty straightforward scenario:

The client opens a TCP connection.
The client does send(), and pushes as much data as possible.
The server doesn’t recv() any data. We expect all the data to stay and wait in the receive queue.
We fix the SO_RCVBUF for better illustration.

Simplified pseudocode might look like (full code if you dare):

sd = socket.socket(AF_INET, SOCK_STREAM, 0)
sd.bind(('127.0.0.3', 1234))
sd.listen(32)

cd = socket.socket(AF_INET, SOCK_STREAM, 0)
cd.setsockopt(SOL_SOCKET, SO_RCVBUF, 32*1024)
cd.connect(('127.0.0.3', 1234))

ssd, _ = sd.accept()

while true:
    cd.send(b'a'*128*1024)

We’re interested in basic questions:

How much data can fit in the server’s receive buffer? It turns out it’s not exactly the same as the default read buffer size on Linux; we’ll get there.
Assuming infinite bandwidth, what is the minimal time – measured in RTT – for the client to fill the receive buffer?

A bit of theory

Let’s start by establishing some common nomenclature. I’ll follow the wording used by the ss Linux tool from the iproute2 package.

First, there is the buffer budget limit. ss manpage calls it skmem_rb, in the kernel it’s named sk_rcvbuf. This value is most often controlled by the Linux autotune mechanism using the net.ipv4.tcp_rmem setting:

$ sysctl net.ipv4.tcp_rmem
net.ipv4.tcp_rmem = 4096 131072 6291456

Alternatively it can be manually set with setsockopt(SO_RCVBUF) on a socket. Note that the kernel doubles the value given to this setsockopt. For example SO_RCVBUF=16384 will result in skmem_rb=32768. The max value allowed to this setsockopt is limited to meager 208KiB by default:

$ sysctl net.core.rmem_max net.core.wmem_max
net.core.rmem_max = 212992
net.core.wmem_max = 212992

The aforementioned blog post discusses why manual buffer size management is problematic – relying on autotuning is generally preferable.

Here’s a diagram showing how skmem_rb budget is being divided:

When the window is not fully open, your TCP stack is doing more than you think

In any given moment, we can think of the budget as being divided into four parts:

Recv-q: part of the buffer budget occupied by actual application bytes awaiting read().
Another part of is consumed by metadata handling – the cost of struct sk_buff and such.
Those two parts together are reported by ss as skmem_r – kernel name is sk_rmem_alloc.
What remains is “free”, that is: it’s not actively used yet.
However, a portion of this “free” region is an advertised window – it may become occupied with application data soon.
The remainder will be used for future metadata handling, or might be divided into the advertised window further in the future.

The upper limit for the window is configured by tcp_adv_win_scale setting. By default, the window is set to at most 50% of the “free” space. The value can be clamped further by the TCP_WINDOW_CLAMP option or an internal rcv_ssthresh variable.

How much data can a server receive?

Our first question was “How much data can a server receive?”. A naive reader might think it’s simple: if the server has a receive buffer set to say 64KiB, then the client will surely be able to deliver 64KiB of data!

But this is totally not how it works. To illustrate this, allow me to temporarily set sysctl tcp_adv_win_scale=0. This is not a default and, as we’ll learn, it’s the wrong thing to do. With this setting the server will indeed set 100% of the receive buffer as an advertised window.

Here’s our setup:

The client tries to send as fast as possible.
Since we are interested in the receiving side, we can cheat a bit and speed up the sender arbitrarily. The client has transmission congestion control disabled: we set initcwnd=10000 as the route option.
The server has a fixed skmem_rb set at 64KiB.
The server has tcp_adv_win_scale=0.

There are so many things here! Let’s try to digest it. First, the X axis is an ingress packet number (we saw about 65). The Y axis shows the buffer sizes as seen on the receive path for every packet.

First, the purple line is a buffer size limit in bytes – skmem_rb. In our experiment we called setsockopt(SO_RCVBUF)=32K and skmem_rb is double that value. Notice, by calling SO_RCVBUF we disabled the Linux autotune mechanism.
Green recv-q line is how many application bytes are available in the receive socket. This grows linearly with each received packet.
Then there is the blue skmem_r, the used data + metadata cost in the receive socket. It grows just like recv-q but a bit faster, since it accounts for the cost of the metadata kernel needs to deal with.
The orange rcv_win is an advertised window. We start with 64KiB (100% of skmem_rb) and go down as the data arrives.
Finally, the dotted line shows rcv_ssthresh, which is not important yet, we’ll get there.

Running over the budget is bad

It’s super important to notice that we finished with skmem_r higher than skmem_rb! This is rather unexpected, and undesired. The whole point of the skmem_rb memory budget is, well, not to exceed it. Here’s how ss shows it:

$ ss -m
Netid  State  Recv-Q  Send-Q  Local Address:Port  Peer Address:Port   
tcp    ESTAB  62464   0       127.0.0.3:1234      127.0.0.2:1235
     skmem:(r73984,rb65536,...)

As you can see, skmem_rb is 65536 and skmem_r is 73984, which is 8448 bytes over! When this happens we have an even bigger issue on our hands. At around the 62nd packet we have an advertised window of 3072 bytes, but while packets are being sent, the receiver is unable to process them! This is easily verifiable by inspecting an nstat TcpExtTCPRcvQDrop counter:

$ nstat -az TcpExtTCPRcvQDrop
TcpExtTCPRcvQDrop    13    0.0

In our run 13 packets were dropped. This variable counts a number of packets dropped due to either system-wide or per-socket memory pressure – we know we hit the latter. In our case, soon after the socket memory limit was crossed, new packets were prevented from being enqueued to the socket. This happened even though the TCP advertised window was still open.

This results in an interesting situation. The receiver’s window is open which might indicate it has resources to handle the data. But that’s not always the case, like in our example when it runs out of the memory budget.

The sender will think it hit a network congestion packet loss and will run the usual retry mechanisms including exponential backoff. This behavior can be looked at as desired or undesired, depending on how you look at it. On one hand no data will be lost, the sender can eventually deliver all the bytes reliably. On the other hand the exponential backoff logic might stall the sender for a long time, causing a noticeable delay.

The root of the problem is straightforward – Linux kernel skmem_rb sets a memory budget for both the data and metadata which reside on the socket. In a pessimistic case each packet might incur a cost of a struct sk_buff + struct skb_shared_info, which on my system is 576 bytes, above the actual payload size, plus memory waste due to network card buffer alignment:

We now understand that Linux can’t just advertise 100% of the memory budget as an advertised window. Some budget must be reserved for metadata and such. The upper limit of window size is expressed as a fraction of the “free” socket budget. It is controlled by tcp_adv_win_scale, with the following values:

By default, Linux sets the advertised window at most at 50% of the remaining buffer space.

Even with 50% of space “reserved” for metadata, the kernel is very smart and tries hard to reduce the metadata memory footprint. It has two mechanisms for this:

TCP Coalesce – on the happy path, Linux is able to throw away struct sk_buff. It can do so, by just linking the data to the previously enqueued packet. You can think about it as if it was extending the last packet on the socket.
TCP Collapse – when the memory budget is hit, Linux runs “collapse” code. Collapse rewrites and defragments the receive buffer from many small skb’s into a few very long segments – therefore reducing the metadata cost.

Here’s an extension to our previous chart showing these mechanisms in action:

TCP Coalesce is a very effective measure and works behind the scenes at all times. In the bottom chart, the packets where the coalesce was engaged are shown with a pink line. You can see – the skmem_r bumps (blue line) are clearly correlated with a lack of coalesce (pink line)! The nstat TcpExtTCPRcvCoalesce counter might be helpful in debugging coalesce issues.

The TCP Collapse is a bigger gun. Mike wrote about it extensively, and I wrote a blog post years ago, when the latency of TCP collapse hit us hard. In the chart above, the collapse is shown as a red circle. We clearly see it being engaged after the socket memory budget is reached – from packet number 63. The nstat TcpExtTCPRcvCollapsed counter is relevant here. This value growing is a bad sign and might indicate bad latency spikes – especially when dealing with larger buffers. Normally collapse is supposed to be run very sporadically. A prominent kernel developer describes this pessimistic situation:

This also means tcp advertises a too optimistic window for a given allocated rcvspace: When receiving frames, sk_rmem_alloc can hit sk_rcvbuf limit and we call tcp_collapse() too often, especially when application is slow to drain its receive queue […] This is a major latency source.

If the memory budget remains exhausted after the collapse, Linux will drop ingress packets. In our chart it’s marked as a red “X”. The nstat TcpExtTCPRcvQDrop counter shows the count of dropped packets.

rcv_ssthresh predicts the metadata cost

Perhaps counter-intuitively, the memory cost of a packet can be much larger than the amount of actual application data contained in it. It depends on number of things:

Network card: some network cards always allocate a full page (4096, or even 16KiB) per packet, no matter how small or large the payload.
Payload size: shorter packets, will have worse metadata to content ratio since struct skb will be comparably larger.
Whether XDP is being used.
L2 header size: things like ethernet, vlan tags, and tunneling can add up.
Cache line size: many kernel structs are cache line aligned. On systems with larger cache lines, they will use more memory (see P4 or S390X architectures).

The first two factors are the most important. Here’s a run when the sender was specially configured to make the metadata cost bad and the coalesce ineffective (the details of the setup are messy):

You can see the kernel hitting TCP collapse multiple times, which is totally undesired. Each time a collapse kernel is likely to rewrite the full receive buffer. This whole kernel machinery, from reserving some space for metadata with tcp_adv_win_scale, via using coalesce to reduce the memory cost of each packet, up to the rcv_ssthresh limit, exists to avoid this very case of hitting collapse too often.

The kernel machinery most often works fine, and TCP collapse is rare in practice. However, we noticed that’s not the case for certain types of traffic. One example is websocket traffic with loads of tiny packets and a slow reader. One kernel comment talks about such a case:

* The scheme does not work when sender sends good segments opening
* window and then starts to feed us spaghetti. But it should work
* in common situations. Otherwise, we have to rely on queue collapsing.

Notice that the rcv_ssthresh line dropped down on the TCP collapse. This variable is an internal limit to the advertised window. By dropping it the kernel effectively says: hold on, I mispredicted the packet cost, next time I’m given an opportunity I’m going to open a smaller window. Kernel will advertise a smaller window and be more careful – all of this dance is done to avoid the collapse.

Normal run – continuously updated window

Finally, here’s a chart from a normal run of a connection. Here, we use the default tcp_adv_win_wcale=1 (50%):

Early in the connection you can see rcv_win being continuously updated with each received packet. This makes sense: while the rcv_ssthresh and tcp_adv_win_scale restrict the advertised window to never exceed 32KiB, the window is sliding nicely as long as there is enough space. At packet 18 the receiver stops updating the window and waits a bit. At packet 32 the receiver decides there still is some space and updates the window again, and so on. At the end of the flow the socket has 56KiB of data. This 56KiB of data was received over a sliding window reaching at most 32KiB .

The saw blade pattern of rcv_win is enabled by delayed ACK (aka QUICKACK). You can see the “acked” bytes in red dashed line. Since the ACK’s might be delayed, the receiver waits a bit before updating the window. If you want a smooth line, you can use quickack 1 per-route parameter, but this is not recommended since it will result in many small ACK packets flying over the wire.

In normal connection we expect the majority of packets to be coalesced and the collapse/drop code paths never to be hit.

Large receive windows – rcv_ssthresh

For large bandwidth transfers over big latency links – big BDP case – it’s beneficial to have a very wide advertised window. However, Linux takes a while to fully open large receive windows:

In this run, the skmem_rb is set to 2MiB. As opposed to previous runs, the buffer budget is large and the receive window doesn’t start with 50% of the skmem_rb! Instead it starts from 64KiB and grows linearly. It takes a while for Linux to ramp up the receive window to full size – ~800KiB in this case. The window is clamped by rcv_ssthresh. This variable starts at 64KiB and then grows at a rate of two full-MSS packets per each packet which has a “good” ratio of total size (truesize) to payload size.

Eric Dumazet writes about this behavior:

Stack is conservative about RWIN increase, it wants to receive packets to have an idea of the skb->len/skb->truesize ratio to convert a memory budget to RWIN.
Some drivers have to allocate 16K buffers (or even 32K buffers) just to hold one segment (of less than 1500 bytes of payload), while others are able to pack memory more efficiently.

This behavior of slow window opening is fixed, and not configurable in vanilla kernel. We prepared a kernel patch that allows to start up with higher rcv_ssthresh based on per-route option initrwnd:

$ ip route change local 127.0.0.0/8 dev lo initrwnd 1000

With the patch and the route change deployed, this is how the buffers look:

The advertised window is limited to 64KiB during the TCP handshake, but with our kernel patch enabled it’s quickly bumped up to 1MiB in the first ACK packet afterwards. In both runs it took ~1800 packets to fill the receive buffer, however it took different time. In the first run the sender could push only 64KiB onto the wire in the second RTT. In the second run it could immediately push full 1MiB of data.

This trick of aggressive window opening is not really necessary for most users. It’s only helpful when:

You have high-bandwidth TCP transfers over big-latency links.
The metadata + buffer alignment cost of your NIC is sensible and predictable.
Immediately after the flow starts your application is ready to send a lot of data.
The sender has configured large initcwnd.
You care about shaving off every possible RTT.

On our systems we do have such flows, but arguably it might not be a common scenario. In the real world most of your TCP connections go to the nearest CDN point of presence, which is very close.

Getting it all together

In this blog post, we discussed a seemingly simple case of a TCP sender filling up the receive socket. We tried to address two questions: with our isolated setup, how much data can be sent, and how quickly?

With the default settings of net.ipv4.tcp_rmem, Linux initially sets a memory budget of 128KiB for the receive data and metadata. On my system, given full-sized packets, it’s able to eventually accept around 113KiB of application data.

Then, we showed that the receive window is not fully opened immediately. Linux keeps the receive window small, as it tries to predict the metadata cost and avoid overshooting the memory budget, therefore hitting TCP collapse. By default, with the net.ipv4.tcp_adv_win_scale=1, the upper limit for the advertised window is 50% of “free” memory. rcv_ssthresh starts up with 64KiB and grows linearly up to that limit.

On my system it took five window updates – six RTTs in total – to fill the 128KiB receive buffer. In the first batch the sender sent ~64KiB of data (remember we hacked the initcwnd limit), and then the sender topped it up with smaller and smaller batches until the receive window fully closed.

I hope this blog post is helpful and explains well the relationship between the buffer size and advertised window on Linux. Also, it describes the often misunderstood rcv_ssthresh which limits the advertised window in order to manage the memory budget and predict the unpredictable cost of metadata.

In case you wonder, similar mechanisms are in play in QUIC. The QUIC/H3 libraries though are still pretty young and don’t have so many complex and mysterious toggles…. yet.

As always, the code and instructions on how to reproduce the charts are available at our GitHub.

How to stop running out of ephemeral ports and start to love long-lived connections

2022-02-02 Marek Majkowski

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/how-to-stop-running-out-of-ephemeral-ports-and-start-to-love-long-lived-connections/

How to stop running out of ephemeral ports and start to love long-lived connections

Often programmers have assumptions that turn out, to their surprise, to be invalid. From my experience this happens a lot. Every API, technology or system can be abused beyond its limits and break in a miserable way.

It’s particularly interesting when basic things used everywhere fail. Recently we’ve reached such a breaking point in a ubiquitous part of Linux networking: establishing a network connection using the connect() system call.

Since we are not doing anything special, just establishing TCP and UDP connections, how could anything go wrong? Here’s one example: we noticed alerts from a misbehaving server, logged in to check it out and saw:

marek@:~# ssh 127.0.0.1
ssh: connect to host 127.0.0.1 port 22: Cannot assign requested address

You can imagine the face of my colleague who saw that. SSH to localhost refuses to work, while she was already using SSH to connect to that server! On another occasion:

marek@:~# dig cloudflare.com @1.1.1.1
dig: isc_socket_bind: address in use

This time a basic DNS query failed with a weird networking error. Failing DNS is a bad sign!

In both cases the problem was Linux running out of ephemeral ports. When this happens it’s unable to establish any outgoing connections. This is a pretty serious failure. It’s usually transient and if you don’t know what to look for it might be hard to debug.

The root cause lies deeper though. We can often ignore limits on the number of outgoing connections. But we encountered cases where we hit limits on the number of concurrent outgoing connections during normal operation.

In this blog post I’ll explain why we had these issues, how we worked around them, and present an userspace code implementing an improved variant of connect() syscall.

Outgoing connections on Linux part 1 – TCP

Let’s start with a bit of historical background.

Long-lived connections

Back in 2014 Cloudflare announced support for WebSockets. We wrote two articles about it:

If you skim these blogs, you’ll notice we were totally fine with the WebSocket protocol, framing and operation. What worried us was our capacity to handle large numbers of concurrent outgoing connections towards the origin servers. Since WebSockets are long-lived, allowing them through our servers might greatly increase the concurrent connection count. And this did turn out to be a problem. It was possible to hit a ceiling for a total number of outgoing connections imposed by the Linux networking stack.

In a pessimistic case, each Linux connection consumes a local port (ephemeral port), and therefore the total connection count is limited by the size of the ephemeral port range.

Basics – how port allocation works

When establishing an outbound connection a typical user needs the destination address and port. For example, DNS might resolve cloudflare.com to the ‘104.1.1.229’ IPv4 address. A simple Python program can establish a connection to it with the following code:

cd = socket.socket(AF_INET, SOCK_STREAM)
cd.connect(('104.1.1.229', 80))

The operating system’s job is to figure out how to reach that destination, selecting an appropriate source address and source port to form the full 4-tuple for the connection:

How to stop running out of ephemeral ports and start to love long-lived connections

The operating system chooses the source IP based on the routing configuration. On Linux we can see which source IP will be chosen with ip route get:

$ ip route get 104.1.1.229
104.1.1.229 via 192.168.1.1 dev eth0 src 192.168.1.8 uid 1000
	cache

The src parameter in the result shows the discovered source IP address that should be used when going towards that specific target.

The source port, on the other hand, is chosen from the local port range configured for outgoing connections, also known as the ephemeral port range. On Linux this is controlled by the following sysctls:

$ sysctl net.ipv4.ip_local_port_range net.ipv4.ip_local_reserved_ports
net.ipv4.ip_local_port_range = 32768    60999
net.ipv4.ip_local_reserved_ports =

The ip_local_port_range sets the low and high (inclusive) port range to be used for outgoing connections. The ip_local_reserved_ports is used to skip specific ports if the operator needs to reserve them for services.

Vanilla TCP is a happy case

The default ephemeral port range contains more than 28,000 ports (60999+1-32768=28232). Does that mean we can have at most 28,000 outgoing connections? That’s the core question of this blog post!

In TCP the connection is identified by a full 4-tuple, for example:

full 4-tuple	192.168.1.8	32768	104.1.1.229	80

In principle, it is possible to reuse the source IP and port, and share them against another destination. For example, there could be two simultaneous outgoing connections with these 4-tuples:

full 4-tuple #A	192.168.1.8	32768	104.1.1.229	80
full 4-tuple #B	192.168.1.8	32768	151.101.1.57	80

This “source two-tuple” sharing can happen in practice when establishing connections using the vanilla TCP code:

sd = socket.socket(SOCK_STREAM)
sd.connect( (remote_ip, remote_port) )

But slightly different code can prevent this sharing, as we’ll discuss.

In the rest of this blog post, we’ll summarise the behaviour of code fragments that make outgoing connections showing:

The technique’s description
The typical `errno` value in the case of port exhaustion
And whether the kernel is able to reuse the {source IP, source port}-tuple against another destination

The last column is the most important since it shows if there is a low limit of total concurrent connections. As we’re going to see later, the limit is present more often than we’d expect.

technique description	errno on port exhaustion	possible src 2-tuple reuse
connect(dst_IP, dst_port)	EADDRNOTAVAIL	yes (good!)

In the case of generic TCP, things work as intended. Towards a single destination it’s possible to have as many connections as an ephemeral range allows. When the range is exhausted (against a single destination), we’ll see EADDRNOTAVAIL error. The system also is able to correctly reuse local two-tuple {source IP, source port} for ESTABLISHED sockets against other destinations. This is expected and desired.

Manually selecting source IP address

Let’s go back to the Cloudflare server setup. Cloudflare operates many services, to name just two: CDN (caching HTTP reverse proxy) and WARP.

For Cloudflare, it’s important that we don’t mix traffic types among our outgoing IPs. Origin servers on the Internet might want to differentiate traffic based on our product. The simplest example is CDN: it’s appropriate for an origin server to firewall off non-CDN inbound connections. Allowing Cloudflare cache pulls is totally fine, but allowing WARP connections which contain untrusted user traffic might lead to problems.

To achieve such outgoing IP separation, each of our applications must be explicit about which source IPs to use. They can’t leave it up to the operating system; the automatically-chosen source could be wrong. While it’s technically possible to configure routing policy rules in Linux to express such requirements, we decided not to do that and keep Linux routing configuration as simple as possible.

Instead, before calling connect(), our applications select the source IP with the bind() syscall. A trick we call “bind-before-connect”:

sd = socket.socket(SOCK_STREAM)
sd.bind( (src_IP, 0) )
sd.connect( (dst_IP, dst_port) )

technique description	errno on port exhaustion	possible src 2-tuple reuse
bind(src_IP, 0) connect(dst_IP, dst_port)	EADDRINUSE	no (bad!)

This code looks rather innocent, but it hides a considerable drawback. When calling bind(), the kernel attempts to find an unused local two-tuple. Due to BSD API shortcomings, the operating system can’t know what we plan to do with the socket. It’s totally possible we want to listen() on it, in which case sharing the source IP/port with a connected socket will be a disaster! That’s why the source two-tuple selected when calling bind() must be unique.

Due to this API limitation, in this technique the source two-tuple can’t be reused. Each connection effectively “locks” a source port, so the number of connections is constrained by the size of the ephemeral port range. Notice: one source port is used up for each connection, no matter how many destinations we have. This is bad, and is exactly the problem we were dealing with back in 2014 in the WebSockets articles mentioned above.

Fortunately, it’s fixable.

IP_BIND_ADDRESS_NO_PORT

Back in 2014 we fixed the problem by setting the SO_REUSEADDR socket option and manually retrying bind()+ connect() a couple of times on error. This worked ok, but later in 2015 Linux introduced a proper fix: the IP_BIND_ADDRESS_NO_PORT socket option. This option tells the kernel to delay reserving the source port:

sd = socket.socket(SOCK_STREAM)
sd.setsockopt(IPPROTO_IP, IP_BIND_ADDRESS_NO_PORT, 1)
sd.bind( (src_IP, 0) )
sd.connect( (dst_IP, dst_port) )

technique description	errno on port exhaustion	possible src 2-tuple reuse
IP_BIND_ADDRESS_NO_PORT bind(src_IP, 0) connect(dst_IP, dst_port)	EADDRNOTAVAIL	yes (good!)

This gets us back to the desired behavior. On modern Linux, when doing bind-before-connect for TCP, you should set IP_BIND_ADDRESS_NO_PORT.

Explicitly selecting a source port

Sometimes an application needs to select a specific source port. For example: the operator wants to control full 4-tuple in order to debug ECMP routing issues.

Recently a colleague wanted to run a cURL command for debugging, and he needed the source port to be fixed. cURL provides the --local-port option to do this¹ :

$ curl --local-port 9999 -4svo /dev/null https://cloudflare.com/cdn-cgi/trace
*   Trying 104.1.1.229:443...

In other situations source port numbers should be controlled, as they can be used as an input to a routing mechanism.

But setting the source port manually is not easy. We’re back to square one in our hackery since IP_BIND_ADDRESS_NO_PORT is not an appropriate tool when calling bind() with a specific source port value. To get the scheme working again and be able to share source 2-tuple, we need to turn to SO_REUSEADDR:

sd = socket.socket(SOCK_STREAM)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sd.bind( (src_IP, src_port) )
sd.connect( (dst_IP, dst_port) )

Our summary table:

technique description	errno on port exhaustion	possible src 2-tuple reuse
SO_REUSEADDR bind(src_IP, src_port) connect(dst_IP, dst_port)	EADDRNOTAVAIL	yes (good!)

Here, the user takes responsibility for handling conflicts, when an ESTABLISHED socket sharing the 4-tuple already exists. In such a case connect will fail with EADDRNOTAVAIL and the application should retry with another acceptable source port number.

Userspace connectx implementation

With these tricks, we can implement a common function and call it connectx. It will do what bind()+connect() should, but won’t have the unfortunate ephemeral port range limitation. In other words, created sockets are able to share local two-tuples as long as they are going to distinct destinations:

def connectx((source_IP, source_port), (destination_IP, destination_port)):

We have three use cases this API should support:

user specified	technique
{_, _, dst_IP, dst_port}	vanilla connect()
{src_IP, _, dst_IP, dst_port}	IP_BIND_ADDRESS_NO_PORT
{src_IP, src_port, dst_IP, dst_port}	SO_REUSEADDR

The name we chose isn’t an accident. MacOS (specifically the underlying Darwin OS) has exactly that function implemented as a connectx() system call (implementation):

It’s more powerful than our connectx code, since it supports TCP Fast Open.

Should we, Linux users, be envious? For TCP, it’s possible to get the right kernel behaviour with the appropriate setsockopt/bind/connect dance, so a kernel syscall is not quite needed.

But for UDP things turn out to be much more complicated and a dedicated syscall might be a good idea.

Outgoing connections on Linux – part 2 – UDP

In the previous section we listed three use cases for outgoing connections that should be supported by the operating system:

Vanilla egress: operating system chooses the outgoing IP and port
Source IP selection: user selects outgoing IP but the OS chooses port
Full 4-tuple: user selects full 4-tuple for the connection

We demonstrated how to implement all three cases on Linux for TCP, without hitting connection count limits due to source port exhaustion.

It’s time to extend our implementation to UDP. This is going to be harder.

For UDP, Linux maintains one hash table that is keyed on local IP and port, which can hold duplicate entries. Multiple UDP connected sockets can not only share a 2-tuple but also a 4-tuple! It’s totally possible to have two distinct, connected sockets having exactly the same 4-tuple. This feature was created for multicast sockets. The implementation was then carried over to unicast connections, but it is confusing. With conflicting sockets on unicast addresses, only one of them will receive any traffic. A newer connected socket will “overshadow” the older one. It’s surprisingly hard to detect such a situation. To get UDP connectx() right, we will need to work around this “overshadowing” problem.

Vanilla UDP is limited

It might come as a surprise to many, but by default, the total count for outbound UDP connections is limited by the ephemeral port range size. Usually, with Linux you can’t have more than ~28,000 connected UDP sockets, even if they point to multiple destinations.

Ok, let’s start with the simplest and most common way of establishing outgoing UDP connections:

sd = socket.socket(SOCK_DGRAM)
sd.connect( (dst_IP, dst_port) )

technique description	errno on port exhaustion	possible src 2-tuple reuse	risk of overshadowing
connect(dst_IP, dst_port)	EAGAIN	no (bad!)	no

The simplest case is not a happy one. The total number of concurrent outgoing UDP connections on Linux is limited by the ephemeral port range size. On our multi-tenant servers, with potentially long-lived gaming and H3/QUIC flows containing WebSockets, this is too limiting.

On TCP we were able to slap on a setsockopt and move on. No such easy workaround is available for UDP.

For UDP, without REUSEADDR, Linux avoids sharing local 2-tuples among UDP sockets. During connect() it tries to find a 2-tuple that is not used yet. As a side note: there is no fundamental reason that it looks for a unique 2-tuple as opposed to a unique 4-tuple during ‘connect()’. This suboptimal behavior might be fixable.

SO_REUSEADDR is hard

To allow local two-tuple reuse we need the SO_REUSEADDR socket option. Sadly, this would also allow established sockets to share a 4-tuple, with the newer socket overshadowing the older one.

sd = socket.socket(SOCK_DGRAM)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sd.connect( (dst_IP, dst_port) )

technique description	errno on port exhaustion	possible src 2-tuple reuse	risk of overshadowing
SO_REUSEADDR connect(dst_IP, dst_port)	EAGAIN	yes	yes (bad!)

In other words, we can’t just set SO_REUSEADDR and move on, since we might hit a local 2-tuple that is already used in a connection against the same destination. We might already have an identical 4-tuple connected socket underneath. Most importantly, during such a conflict we won’t be notified by any error. This is unacceptably bad.

Detecting socket conflicts with eBPF

We thought a good solution might be to write an eBPF program to detect such conflicts. The idea was to put a code on the connect() syscall. Linux cgroups allow the BPF_CGROUP_INET4_CONNECT hook. The eBPF is called every time a process under a given cgroup runs the connect() syscall. This is pretty cool, and we thought it would allow us to verify if there is a 4-tuple conflict before moving the socket from UNCONNECTED to CONNECTED states.

Here is how to load and attach our eBPF

bpftool prog load ebpf.o /sys/fs/bpf/prog_connect4  type cgroup/connect4
bpftool cgroup attach /sys/fs/cgroup/unified/user.slice connect4 pinned /sys/fs/bpf/prog_connect4

With such a code, we’ll greatly reduce the probability of overshadowing:

technique description	errno on port exhaustion	possible src 2-tuple reuse	risk of overshadowing
INET4_CONNECT hook SO_REUSEADDR connect(dst_IP, dst_port)	manual port discovery, EPERM on conflict	yes	yes, but small

However, this solution is limited. First, it doesn’t work for sockets with an automatically assigned source IP or source port, it only works when a user manually creates a 4-tuple connection from userspace. Then there is a second issue: a typical race condition. We don’t grab any lock, so it’s technically possible a conflicting socket will be created on another CPU in the time between our eBPF conflict check and the finish of the real connect() syscall machinery. In short, this lockless eBPF approach is better than nothing, but fundamentally racy.

Socket traversal – SOCK_DIAG ss way

There is another way to verify if a conflicting socket already exists: we can check for connected sockets in userspace. It’s possible to do it without any privileges quite effectively with the SOCK_DIAG_BY_FAMILY feature of netlink interface. This is the same technique the ss tool uses to print out sockets available on the system.

The netlink code is not even all that complicated. Take a look at the code. Inside the kernel, it goes quickly into a fast __udp_lookup() routine. This is great – we can avoid iterating over all sockets on the system.

With that function handy, we can draft our UDP code:

sd = socket.socket(SOCK_DGRAM)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
cookie = sd.getsockopt(socket.SOL_SOCKET, SO_COOKIE, 8)
sd.bind( src_addr )
c, _ = _netlink_udp_lookup(family, src_addr, dst_addr)
if c != cookie:
    raise OSError(...)
sd.connect( dst_addr )

This code has the same race condition issue as the connect inet eBPF hook before. But it’s a good starting point. We need some locking to avoid the race condition. Perhaps it’s possible to do it in the userspace.

SO_REUSEADDR as a lock

Here comes a breakthrough: we can use SO_REUSEADDR as a locking mechanism. Consider this:

sd = socket.socket(SOCK_DGRAM)
cookie = sd.getsockopt(socket.SOL_SOCKET, SO_COOKIE, 8)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sd.bind( src_addr )
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 0)
c, _ = _netlink_udp_lookup(family, src_addr, dst_addr)
if c != cookie:
    raise OSError()
sd.connect( dst_addr )
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

The idea here is:

We need REUSEADDR around bind, otherwise it wouldn’t be possible to reuse a local port. It’s technically possible to clear REUSEADDR after bind. Doing this technically makes the kernel socket state inconsistent, but it doesn’t hurt anything in practice.
By clearing REUSEADDR, we’re locking new sockets from using that source port. At this stage we can check if we have ownership of the 4-tuple we want. Even if multiple sockets enter this critical section, only one, the newest, can win this verification. This is a cooperative algorithm, so we assume all tenants try to behave.
At this point, if the verification succeeds, we can perform connect() and have a guarantee that the 4-tuple won’t be reused by another socket at any point in the process.

This is rather convoluted and hacky, but it satisfies our requirements:

technique description	errno on port exhaustion	possible src 2-tuple reuse	risk of overshadowing
REUSEADDR as a lock	EAGAIN	yes	no

Sadly, this schema only works when we know the full 4-tuple, so we can’t rely on kernel automatic source IP or port assignments.

Faking source IP and port discovery

In the case when the user calls ‘connect’ and specifies only target 2-tuple – destination IP and port, the kernel needs to fill in the missing bits – the source IP and source port. Unfortunately the described algorithm expects the full 4-tuple to be known in advance.

One solution is to implement source IP and port discovery in userspace. This turns out to be not that hard. For example, here’s a snippet of our code:

def _get_udp_port(family, src_addr, dst_addr):
    if ephemeral_lo == None:
        _read_ephemeral()
    lo, hi = ephemeral_lo, ephemeral_hi
    start = random.randint(lo, hi)
    ...

Putting it all together

Combining the manual source IP, port discovery and the REUSEADDR locking dance, we get a decent userspace implementation of connectx() for UDP.

We have covered all three use cases this API should support:

user specified	comments
{_, _, dst_IP, dst_port}	manual source IP and source port discovery
{src_IP, _, dst_IP, dst_port}	manual source port discovery
{src_IP, src_port, dst_IP, dst_port}	just our “REUSEADDR as lock” technique

Take a look at the full code.

Summary

This post described a problem we hit in production: running out of ephemeral ports. This was partially caused by our servers running numerous concurrent connections, but also because we used the Linux sockets API in a way that prevented source port reuse. It meant that we were limited to ~28,000 concurrent connections per protocol, which is not enough for us.

We explained how to allow source port reuse and prevent having this ephemeral-port-range limit imposed. We showed an userspace connectx() function, which is a better way of creating outgoing TCP and UDP connections on Linux.

Our UDP code is more complex, based on little known low-level features, assumes cooperation between tenants and undocumented behaviour of the Linux operating system. Using REUSEADDR as a locking mechanism is rather unheard of.

The connectx() functionality is valuable, and should be added to Linux one way or another. It’s not trivial to get all its use cases right. Hopefully, this blog post shows how to achieve this in the best way given the operating system API constraints.

___

¹ On a side note, on the second cURL run it fails due to TIME-WAIT sockets: “bind failed with errno 98: Address already in use”.

One option is to wait for the TIME_WAIT socket to die, or work around this with the time-wait sockets kill script. Killing time-wait sockets is generally a bad idea, violating protocol, unneeded and sometimes doesn’t work. But hey, in some extreme cases it’s good to know what’s possible. Just saying.

Everything you ever wanted to know about UDP sockets but were afraid to ask, part 1

2021-11-25 Marek Majkowski

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/everything-you-ever-wanted-to-know-about-udp-sockets-but-were-afraid-to-ask-part-1/

Everything you ever wanted to know about UDP sockets but were afraid to ask, part 1

Historically Cloudflare’s core competency was operating an HTTP reverse proxy. We’ve spent significant effort optimizing traditional HTTP/1.1 and HTTP/2 servers running on top of TCP. Recently though, we started operating big scale stateful UDP services.

Stateful UDP gains popularity for a number of reasons:

— QUIC is a new transport protocol based on UDP, it powers HTTP/3. We see the adoption accelerating.

— We operate WARP — our Wireguard protocol based tunneling service — which uses UDP under the hood.

— We have a lot of generic UDP traffic going through our Spectrum service.

Although UDP is simple in principle, there is a lot of domain knowledge needed to run things at scale. In this blog post we’ll cover the basics: all you need to know about UDP servers to get started.

Connected vs unconnected

How do you “accept” connections on a UDP server? If you are using unconnected sockets, you generally don’t.

But let’s start with the basics. UDP sockets can be “connected” (or “established”) or “unconnected”. Connected sockets have a full 4-tuple associated {source ip, source port, destination ip, destination port}, unconnected sockets have 2-tuple {bind ip, bind port}.

Traditionally the connected sockets were mostly used for outgoing flows, while unconnected for inbound “server” side connections.

UDP client

As we’ll learn today, these can be mixed. It is possible to use connected sockets for ingress handling, and unconnected for egress. To illustrate the latter, consider these two snippets. They do the same thing — send a packet to the DNS resolver. First snippet is using a connected socket:

Second, using unconnected one:

Which one is better? In the second case, when receiving, the programmer should verify the source IP of the packet. Otherwise, the program can get confused by some random inbound internet junk — like port scanning. It is tempting to reuse the socket descriptor and query another DNS server afterwards, but this would be a bad idea, particularly when dealing with DNS. For security, DNS assumes the client source port is unpredictable and short-lived.

Generally speaking for outbound traffic it’s preferable to use connected UDP sockets.

Connected sockets can save route lookup on each packet by employing a clever optimization — Linux can save a route lookup result on a connection struct. Depending on the specifics of the setup this might save some CPU cycles.

For completeness, it is possible to roll a new source port and reuse a socket descriptor with an obscure trick called “dissolving of the socket association”. It can be done with connect(AF_UNSPEC), but this is rather advanced Linux magic.

UDP server

Traditionally on the server side UDP requires unconnected sockets. Using them requires a bit of finesse. To illustrate this, let’s write an UDP echo server. In practice, you probably shouldn’t write such a server, due to a risk of becoming a DoS reflection vector. Among other protections, like rate limiting, UDP services should always respond with a strictly smaller amount of data than was sent in the initial packet. But let’s not digress, the naive UDP echo server might look like:

This code begs questions:

— Received packets can be longer than 2048 bytes. This can happen over loop back, when using jumbo frames or with help of IP fragmentation.

— It’s totally possible for the received packet to have an empty payload.

— What about inbound ICMP errors?

These problems are specific to UDP, they don’t happen in the TCP world. TCP can transparently deal with MTU / fragmentation and ICMP errors. Depending on the specific protocol, a UDP service might need to be more complex and pay extra care to such corner cases.

Sourcing packets from a wildcard socket

There is a bigger problem with this code. It only works correctly when binding to a specific IP address, like ::1 or 127.0.0.1. It won’t always work when we bind to a wildcard. The issue lies in the sendto() line — we didn’t explicitly set the outbound IP address! Linux doesn’t know where we’d like to source the packet from, and it will choose a default egress IP address. It might not be the IP the client communicated to. For example, let’s say we added ::2 address to loop back interface and sent a packet to it, with src IP set to a valid ::1:

marek@mrprec:~$ sudo tcpdump -ni lo port 1234 -t
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
IP6 ::1.41879 > ::2.1234: UDP, length 2
IP6 ::1.1234 > ::1.41879: UDP, length 2

Here we can see the packet correctly flying from ::1 to ::2, to our server. But then when the server responds, it sources the response from ::1 IP which in this case is wrong.

On the server side, when binding to a wildcard:

— we might receive packets destined to a number of IP addresses

— we must be very careful when responding and use appropriate source IP address

BSD Sockets API doesn’t make it easy to understand where the received packet was destined to. On Linux and BSD it is possible to request useful CMSG metadata with IP_RECVPKTINO and IPV6_RECVPKTINFO.

An improved server loop might look like:

The recvmsg and sendmsg syscalls, as opposed to recvfrom / sendto allow the programmer to request and set extra CMSG metadata, which is very handy when dealing with UDP.

The IPV6_PKTINFO CMSG contains this data structure:

We can find here the IP address and interface number of the packet target. Notice, there’s no place for a port number.

Graceful server restart

Many traditional UDP protocols, like DNS, are request-response based. Since there is no state associated with a higher level “connection”, the server can restart, to upgrade or change configuration, without any problems. Ideally, sockets should be managed with the usual systemd socket activation to avoid the short time window where the socket is down.

Modern protocols are often connection-based. For such servers, on restart, it’s beneficial to keep the old connections directed to the old server process, while the new server instance is available for handling the new connections. The old connections will eventually die off, and the old server process will be able to terminate. This is a common and easy practice in the TCP world where each connection has its own file descriptor. The old server process stops accept()-ing new connections and just waits for the old connections to gradually go away. NGINX has a good documentation on the subject.

Sadly, in UDP you can’t accept() new connections. Doing graceful server restarts for UDP is surprisingly hard.

Established-over-unconnected technique

For some services we are using a technique which we call “established-over-unconnected”. This comes from a realization that on Linux it’s possible to create a connected socket *over* an unconnected one. Consider this code:

Does this look hacky? Well, it should. What we do here is:

— We start a UDP unconnected socket.

— We wait for a client to come in.

— As soon as we receive the first packet from the client, we immediately create a new fully connected socket, *over* the unconnected socket! It shares the same local port and local IP.

This is how it might look in ss:

marek@mrprec:~$ ss -panu sport = :1234 or dport = :1234 | cat
State     Recv-Q    Send-Q       Local Address:Port        Peer Address:Port    Process                                                                         
ESTAB     0         0                    [::1]:1234               [::1]:44592    python3
UNCONN    0         0                        *:1234                   *:*        python3
ESTAB     0         0                    [::1]:44592              [::1]:1234     nc

Here you can see the two sockets managed in our python test server. Notice the established socket is sharing the unconnected socket port.

This trick is basically reproducing the ‘accept()` behaviour in UDP, where each ingress connection gets its own dedicated socket descriptor.

While this trick is nice, it’s not without drawbacks — it’s racy in two places. First, it’s possible that the client will send more than one packet to the unconnected socket before the connected socket is created. The application code should work around it — if a packet received from the server socket belongs to an already existing connected flow, it shall be handed over to the right place. Then, during the creation of the connected socket, in the short window after bind() before connect() we might receive unexpected packets belonging to the unconnected socket! We don’t want these packets here. It is necessary to filter the source IP/port when receiving early packets on the connected socket.

Is this approach worth the extra complexity? It depends on the use case. For a relatively small number of long-lived flows, it might be ok. For a high number of short-lived flows (especially DNS or NTP) it’s an overkill.

Keeping old flows stable during service restarts is particularly hard in UDP. The established-over-unconnected technique is just one of the simpler ways of handling it. We’ll leave another technique, based on SO_REUSEPORT ebpf, for a future blog post.

Summary

In this blog post we started by highlighting connected and unconnected UDP sockets. Then we discussed why binding UDP servers to a wildcard is hard, and how IP_PKTINFO CMSG can help to solve it. We discussed the UDP graceful restart problem, and hinted on an established-over-unconnected technique.

Socket type	Created with	Appropriate syscalls
established	connect()	recv()/send()
established	bind() + connect()	recvfrom()/send(), watch out for the race after bind(), verify source of the packet
unconnected	bind(specific IP)	recvfrom()/sendto()
unconnected	bind(wildcard)	recvmsg()/sendmsg() with IP_PKTINFO CMSG

Stay tuned, in future blog posts we might go even deeper into the curious world of production UDP servers.

Computing Euclidean distance on 144 dimensions

2020-12-18 Marek Majkowski

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/computing-euclidean-distance-on-144-dimensions/

Computing Euclidean distance on 144 dimensions

Late last year I read a blog post about our CSAM image scanning tool. I remember thinking: this is so cool! Image processing is always hard, and deploying a real image identification system at Cloudflare is no small achievement!

Some time later, I was chatting with Kornel: “We have all the pieces in the image processing pipeline, but we are struggling with the performance of one component.” Scaling to Cloudflare needs ain’t easy!

The problem was in the speed of the matching algorithm itself. Let me elaborate. As John explained in his blog post, the image matching algorithm creates a fuzzy hash from a processed image. The hash is exactly 144 bytes long. For example, it might look like this:

00e308346a494a188e1043333147267a 653a16b94c33417c12b433095c318012
5612442030d14a4ce82c623f4e224733 1dd84436734e4a5d6e25332e507a8218
6e3b89174e30372d

The hash is designed to be used in a fuzzy matching algorithm that can find “nearby”, related images. The specific algorithm is well defined, but making it fast is left to the programmer — and at Cloudflare we need the matching to be done super fast. We want to match thousands of hashes per second, of images passing through our network, against a database of millions of known images. To make this work, we need to seriously optimize the matching algorithm.

Naive quadratic algorithm

The first algorithm that comes to mind has O(K*N) complexity: for each query, go through every hash in the database. In naive implementation, this creates a lot of work. But how much work exactly?

First, we need to explain how fuzzy matching works.

Given a query hash, the fuzzy match is the “closest” hash in a database. This requires us to define a distance. We treat each hash as a vector containing 144 numbers, identifying a point in a 144-dimensional space. Given two such points, we can calculate the distance using the standard Euclidean formula.

For our particular problem, though, we are interested in the “closest” match in a database only if the distance is lower than some predefined threshold. Otherwise, when the distance is large, we can assume the images aren’t similar. This is the expected result — most of our queries will not have a related image in the database.

The Euclidean distance equation used by the algorithm is standard:

To calculate the distance between two 144-byte hashes, we take each byte, calculate the delta, square it, sum it to an accumulator, do a square root, and ta-dah! We have the distance!

Here’s how to count the squared distance in C:

This function returns the squared distance. We avoid computing the actual distance to save us from running the square root function – it’s slow. Inside the code, for performance and simplicity, we’ll mostly operate on the squared value. We don’t need the actual distance value, we just need to find the vector with the smallest one. In our case it doesn’t matter if we’ll compare distances or squared distances!

As you can see, fuzzy matching is basically a standard problem of finding the closest point in a multi-dimensional space. Surely this has been solved in the past — but let’s not jump ahead.

While this code might be simple, we expect it to be rather slow. Finding the smallest hash distance in a database of, say, 1M entries, would require going over all records, and would need at least:

144 * 1M subtractions
144 * 1M multiplications
144 * 1M additions

And more. This alone adds up to 432 million operations! How does it look in practice? To illustrate this blog post we prepared a full test suite. The large database of known hashes can be well emulated by random data. The query hashes can’t be random and must be slightly more sophisticated, otherwise the exercise wouldn’t be that interesting. We generated the test smartly by byte-swaps of the actual data from the database — this allows us to precisely control the distance between test hashes and database hashes. Take a look at the scripts for details. Here’s our first run of the first, naive, algorithm:

$ make naive
< test-vector.txt ./mmdist-naive > test-vector.tmp
Total: 85261.833ms, 1536 items, avg 55.509ms per query, 18.015 qps

We matched 1,536 test hashes against a database of 1 million random vectors in 85 seconds. It took 55ms of CPU time on average to find the closest neighbour. This is rather slow for our needs.

SIMD for help

An obvious improvement is to use more complex SIMD instructions. SIMD is a way to instruct the CPU to process multiple data points using one instruction. This is a perfect strategy when dealing with vector problems — as is the case for our task.

We settled on using AVX2, with 256 bit vectors. We did this for a simple reason — newer AVX versions are not supported by our AMD CPUs. Additionally, in the past, we were not thrilled by the AVX-512 frequency scaling.

Using AVX2 is easier said than done. There is no single instruction to count Euclidean distance between two uint8 vectors! The fastest way of counting the full distance of two 144-byte vectors with AVX2 we could find is authored by Vlad:

It’s actually simpler than it looks: load 16 bytes, convert vector from uint8 to int16, subtract the vector, store intermediate sums as int32, repeat. At the end, we need to do complex 4 instructions to extract the partial sums into the final sum. This AVX2 code improves the performance around 3x:

$ make naive-avx2 
Total: 25911.126ms, 1536 items, avg 16.869ms per query, 59.280 qps

We measured 17ms per item, which is still below our expectations. Unfortunately, we can’t push it much further without major changes. The problem is that this code is limited by memory bandwidth. The measurements come from my Intel i7-5557U CPU, which has the max theoretical memory bandwidth of just 25GB/s. The database of 1 million entries takes 137MiB, so it takes at least 5ms to feed the database to my CPU. With this naive algorithm we won’t be able to go below that.

Vantage Point Tree algorithm

Since the naive brute force approach failed, we tried using more sophisticated algorithms. My colleague Kornel Lesiński implemented a super cool Vantage Point algorithm. After a few ups and downs, optimizations and rewrites, we gave up. Our problem turned out to be unusually hard for this kind of algorithm.

We observed “the curse of dimensionality”. Space partitioning algorithms don’t work well in problems with large dimensionality — and in our case, we have an enormous number of 144 dimensions. K-D trees are doomed. Locality-sensitive hashing is also doomed. It’s a bizarre situation in which the space is unimaginably vast, but everything is close together. The volume of the space is a 347-digit-long number, but the maximum distance between points is just 3060 – sqrt(255*255*144).

Space partitioning algorithms are fast, because they gradually narrow the search space as they get closer to finding the closest point. But in our case, the common query is never close to any point in the set, so the search space can’t be narrowed to a meaningful degree.

A VP-tree was a promising candidate, because it operates only on distances, subdividing space into near and far partitions, like a binary tree. When it has a close match, it can be very fast, and doesn’t need to visit more than O(log(N)) nodes. For non-matches, its speed drops dramatically. The algorithm ends up visiting nearly half of the nodes in the tree. Everything is close together in 144 dimensions! Even though the algorithm avoided visiting more than half of the nodes in the tree, the cost of visiting remaining nodes was higher, so the search ended up being slower overall.

Smarter brute force?

This experience got us thinking. Since space partitioning algorithms can’t narrow down the search, and still need to go over a very large number of items, maybe we should focus on going over all the hashes, extremely quickly. We must be smarter about memory bandwidth though — it was the limiting factor in the naive brute force approach before.

Perhaps we don’t need to fetch all the data from memory.

Short distance

The breakthrough came from the realization that we don’t need to count the full distance between hashes. Instead, we can compute only a subset of dimensions, say 32 out of the total of 144. If this distance is already large, then there is no need to compute the full one! Computing more points is not going to reduce the Euclidean distance.

The proposed algorithm works as follows:

1. Take the query hash and extract a 32-byte short hash from it

2. Go over all the 1 million 32-byte short hashes from the database. They must be densely packed in the memory to allow the CPU to perform good prefetching and avoid reading data we won’t need.

3. If the distance of the 32-byte short hash is greater or equal a best score so far, move on

4. Otherwise, investigate the hash thoroughly and compute the full distance.

Even though this algorithm needs to do less arithmetic and memory work, it’s not faster than the previous naive one. See make short-avx2. The problem is: we still need to compute a full distance for hashes that are promising, and there are quite a lot of them. Computing the full distance for promising hashes adds enough work, both in ALU and memory latency, to offset the gains of this algorithm.

There is one detail of our particular application of the image matching problem that will help us a lot moving forward. As we described earlier, the problem is less about finding the closest neighbour and more about proving that the neighbour with a reasonable distance doesn’t exist. Remember — in practice, we don’t expect to find many matches! We expect almost every image we feed into the algorithm to be unrelated to image hashes stored in the database.

It’s sufficient for our algorithm to prove that no neighbour exists within a predefined distance threshold. Let’s assume we are not interested in hashes more distant than, say, 220, which squared is 48,400. This makes our short-distance algorithm variation work much better:

$ make short-avx2-threshold
Total: 4994.435ms, 1536 items, avg 3.252ms per query, 307.542 qps

Origin distance variation

At some point, John noted that the threshold allows additional optimization. We can order the hashes by their distance from some origin point. Given a query hash which has origin distance of A, we can inspect only hashes which are distant between |A-threshold| and |A+threshold| from the origin. This is pretty much how each level of Vantage Point Tree works, just simplified. This optimization — ordering items in the database by their distance from origin point — is relatively simple and can help save us a bit of work.

While great on paper, this method doesn’t introduce much gain in practice, as the vectors are not grouped in clusters — they are pretty much random! For the threshold values we are interested in, the origin distance algorithm variation gives us ~20% speed boost, which is okay but not breathtaking. This change might bring more benefits if we ever decide to reduce the threshold value, so it might be worth doing for production implementation. However, it doesn’t work well with query batching.

Transposing data for better AVX

But we’re not done with AVX optimizations! The usual problem with AVX is that the instructions don’t normally fit a specific problem. Some serious mind twisting is required to adapt the right instruction to the problem, or to reverse the problem so that a specific instruction can be used. AVX2 doesn’t have useful “horizontal” uint16 subtract, multiply and add operations. For example, _mm_hadd_epi16 exists, but it’s slow and cumbersome.

Instead, we can twist the problem to make use of fast available uint16 operands. For example we can use:

_mm256_sub_epi16
_mm256_mullo_epi16
and _mm256_add_epu16.

The add would overflow in our case, but fortunately there is add-saturate _mm256_adds_epu16.

The saturated add is great and saves us conversion to uint32. It just adds a small limitation: the threshold passed to the program (i.e., the max squared distance) must fit into uint16. However, this is fine for us.

To effectively use these instructions we need to transpose the data in the database. Instead of storing hashes in rows, we can store them in columns:

So instead of:

[a1, a2, a3],
[b1, b2, b3],
[c1, c2, c3],

…

We can lay it out in memory transposed:

[a1, b1, c1],
[a2, b2, c2],
[a3, b3, c3],

…

Now we can load 16 first bytes of hashes using one memory operation. In the next step, we can subtract the first byte of the querying hash using a single instruction, and so on. The algorithm stays exactly the same as defined above; we just make the data easier to load and easier to process for AVX.

The hot loop code even looks relatively pretty:

With the well-tuned batch size and short distance size parameters we can see the performance of this algorithm:

$ make short-inv-avx2
Total: 1118.669ms, 1536 items, avg 0.728ms per query, 1373.062 qps

Whoa! This is pretty awesome. We started from 55ms per query, and we finished with just 0.73ms. There are further micro-optimizations possible, like memory prefetching or using huge pages to reduce page faults, but they have diminishing returns at this point.

If you are interested in architectural tuning such as this, take a look at the new performance book by Denis Bakhvalov. It discusses roofline model analysis, which is pretty much what we did here.

Do take a look at our code and tell us if we missed some optimization!

Summary

What an optimization journey! We jumped between memory and ALU bottlenecked code. We discussed more sophisticated algorithms, but in the end, a brute force algorithm — although tuned — gave us the best results.

To get even better numbers, I experimented with Nvidia GPU using CUDA. The CUDA intrinsics like vabsdiff4 and dp4a fit the problem perfectly. The V100 gave us some amazing numbers, but I wasn’t fully satisfied with it. Considering how many AMD Ryzen cores with AVX2 we can get for the cost of a single server-grade GPU, we leaned towards general purpose computing for this particular problem.

This is a great example of the type of complexities we deal with every day. Making even the best technologies work “at Cloudflare scale” requires thinking outside the box. Sometimes we rewrite the solution dozens of times before we find the optimal one. And sometimes we settle on a brute-force algorithm, just very very optimized.

The computation of hashes and image matching are challenging problems that require running very CPU intensive operations.. The CPU we have available on the edge is scarce and workloads like this are incredibly expensive. Even with the optimization work talked about in this blog post, running the CSAM scanner at scale is a challenge and has required a huge engineering effort. And we’re not done! We need to solve more hard problems before we’re satisfied. If you want to help, consider applying!

Historical context

REUSEPORT group

Socket generation and working generation

Welcome udpgrm!

udpgrm daemon for the system administrator

udpgrm for the programmer

Advanced socket creation with udpgrm_activate.py

Systemd service lifetime

Dissector modes

Summary

Subflows

MPTCP aspirations

Implementations

Linux as a server

Path manager / scheduler

Linux as client

Custom path manager

Scheduler and BPF extensions

macOS

IPv6 caveat

Summary

Sourcing the hardware

One layer of hardware

Three layers of software

ESP32 (Display Layer)

Server A (Rendering Layer)

Remote Browser API

Dithering

Server B (Content Layer)

Summary

To tun or to tap?

Listing devices

Tap device setup

The curious IFF_NAPI

A note on buffer sizes

vnethdr size

offloads

Checksum offload TUN_F_CSUM

TUN_F_CSUM is a must

Segmentation offloads

Bug in rx-udp-gro-forwarding – TUN_F_USO4

Miscellaneous setsockopts

Setup related socket options

Filtering related socket options

Multi queue related socket options

Single queue speed

Multi queue considerations

Summary

Understanding ping

Tracing gettimeofday is hard

To suid or not to suid

Rootless is perfectly fine

Fault injection – fooling ping

Fool me once, fool me twice

Leap second

Lessons

Terminology first!

Anycast on ingress, unicast on egress

Segmenting traffic based on egress IP

Country tagged egress IP addresses

Is anycast a problem?

Solving a million dollar problem

There’s a third way

Sharing an IP inside data center

Sharing a subnet between data centers

Soft-unicast is indistinguishable from magic

The true location of our IP addresses is: “the cloud”

Content Delivery Networks misdirect users

The future is here

A case of a rapid upload

A bit of theory

How much data can a server receive?

Running over the budget is bad

rcv_ssthresh predicts the metadata cost

Normal run – continuously updated window

Large receive windows – rcv_ssthresh

Getting it all together

Outgoing connections on Linux part 1 – TCP

Long-lived connections

Basics – how port allocation works