Tag Archives: TCP

A Socket API that works across JavaScript runtimes — announcing a WinterCG spec and Node.js implementation of connect()

Post Syndicated from Dominik Picheta original http://blog.cloudflare.com/socket-api-works-javascript-runtimes-wintercg-polyfill-connect/

A Socket API that works across JavaScript runtimes — announcing a WinterCG spec and Node.js implementation of connect()

A Socket API that works across JavaScript runtimes — announcing a WinterCG spec and Node.js implementation of connect()

Earlier this year, we announced a new API for creating outbound TCP socketsconnect(). From day one, we’ve been working with the Web-interoperable Runtimes Community Group (WinterCG) community to chart a course toward making this API a standard, available across all runtimes and platforms — including Node.js.

Today, we’re sharing that we’ve reached a new milestone in the path to making this API available across runtimes — engineers from Cloudflare and Vercel have published a draft specification of the connect() sockets API for review by the community, along with a Node.js compatible implementation of the connect() API that developers can start using today.

This implementation helps both application developers and maintainers of libraries and frameworks:

  1. Maintainers of existing libraries that use the node:net and node:tls APIs can use it to more easily add support for runtimes where node:net and node:tls are not available.
  2. JavaScript frameworks can use it to make connect() available in local development, making it easier for application developers to target runtimes that provide connect().

Why create a new standard? Why connect()?

As we described when we first announced connect(), to-date there has not been a standard API across JavaScript runtimes for creating and working with TCP or UDP sockets. This makes it harder for maintainers of open-source libraries to ensure compatibility across runtimes, and ultimately creates friction for application developers who have to navigate which libraries work on which platforms.

While Node.js provides the node:net and node:tls APIs, these APIs were designed over 10 years ago in the very early days of the Node.js project and remain callback-based. As a result, they can be hard to work with, and expose configuration in ways that don’t fit serverless platforms or web browsers.

The connect() API fills this gap by incorporating the best parts of existing socket APIs and prior proposed standards, based on feedback from the JavaScript community — including contributors to Node.js. Libraries like pg (node-postgres on Github) are already using the connect() API.

The connect() specification

At time of writing, the draft specification of the Sockets API defines the following API:

dictionary SocketAddress {
  DOMString hostname;
  unsigned short port;
};

typedef (DOMString or SocketAddress) AnySocketAddress;

enum SecureTransportKind { "off", "on", "starttls" };

[Exposed=*]
dictionary SocketOptions {
  SecureTransportKind secureTransport = "off";
  boolean allowHalfOpen = false;
};

[Exposed=*]
interface Connect {
  Socket connect(AnySocketAddress address, optional SocketOptions opts);
};

interface Socket {
  readonly attribute ReadableStream readable;
  readonly attribute WritableStream writable;

  readonly attribute Promise<undefined> closed;
  Promise<undefined> close();

  Socket startTls();
};

The proposed API is Promise-based and reuses existing standards whenever possible. For example, ReadableStream and WritableStream are used for the read and write ends of the socket. This makes it easy to pipe data from a TCP socket to any other library or existing code that accepts a ReadableStream as input, or to write to a TCP socket via a WritableStream.

The entrypoint of the API is the connect() function, which takes a string containing both the hostname and port separated by a colon, or an object with discrete hostname and port fields. It returns a Socket object which represents a socket connection. An instance of this object exposes attributes and methods for working with the connection.

A connection can be established in plain-text or TLS mode, as well as a special “starttls” mode which allows the socket to be easily upgraded to TLS after some period of plain-text data transfer, by calling the startTls() method on the Socket object. No need to create a new socket or switch to using a separate set of APIs once the socket is upgraded to use TLS.

For example, to upgrade a socket using the startTLS pattern, you might do something like this:

import { connect } from "@arrowood.dev/socket"

const options = { secureTransport: "starttls" };
const socket = connect("address:port", options);
const secureSocket = socket.startTls();
// The socket is immediately writable
// Relies on web standard WritableStream
const writer = secureSocket.writable.getWriter();
const encoder = new TextEncoder();
const encoded = encoder.encode("hello");
await writer.write(encoded);

Equivalent code using the node:net and node:tls APIs:

import net from 'node:net'
import tls from 'node:tls'

const socket = new net.Socket(HOST, PORT);
socket.once('connect', () => {
  const options = { socket };
  const secureSocket = tls.connect(options, () => {
    // The socket can only be written to once the
    // connection is established.
    // Polymorphic API, uses Node.js streams
    secureSocket.write('hello');
  }
})

Use the Node.js implementation of connect() in your library

To make it easier for open-source library maintainers to adopt the connect() API, we’ve published an implementation of connect() in Node.js that allows you to publish your library such that it works across JavaScript runtimes, without having to maintain any runtime-specific code.

To get started, install it as a dependency:

npm install --save @arrowood.dev/socket

And import it in your library or application:

import { connect } from "@arrowood.dev/socket"

What’s next for connect()?

The wintercg/proposal-sockets-api is published as a draft, and the next step is to solicit and incorporate feedback. We’d love your feedback, particularly if you maintain an open-source library or make direct use of the node:net or node:tls APIs.

Once feedback has been incorporated, engineers from Cloudflare, Vercel and beyond will be continuing to work towards contributing an implementation of the API directly to Node.js as a built-in API.

Unbounded memory usage by TCP for receive buffers, and how we fixed it

Post Syndicated from Mike Freemon original http://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

Unbounded memory usage by TCP for receive buffers, and how we fixed it

Unbounded memory usage by TCP for receive buffers, and how we fixed it

At Cloudflare, we are constantly monitoring and optimizing the performance and resource utilization of our systems. Recently, we noticed that some of our TCP sessions were allocating more memory than expected.

The Linux kernel allows TCP sessions that match certain characteristics to ignore memory allocation limits set by autotuning and allocate excessive amounts of memory, all the way up to net.ipv4.tcp_rmem max (the per-session limit). On Cloudflare’s production network, there are often many such TCP sessions on a server, causing the total amount of allocated TCP memory to reach net.ipv4.tcp_mem thresholds (the server-wide limit). When that happens, the kernel imposes memory use constraints on all TCP sessions, not just the ones causing the problem. Those constraints have a negative impact on throughput and latency for the user. Internally within the kernel, the problematic sessions trigger TCP collapse processing, “OFO” pruning (dropping of packets already received and sitting in the out-of-order queue), and the dropping of newly arriving packets.

This blog post describes in detail the root cause of the problem and shows the test results of a solution.

TCP receive buffers are excessively big for some sessions

Our journey began when we started noticing a lot of TCP sessions on some servers with large amounts of memory allocated for receive buffers.  Receive buffers are used by Linux to hold packets that have arrived from the network but have not yet been read by the local process.

Digging into the details, we observed that most of those TCP sessions had a latency (RTT) of roughly 20ms. RTT is the round trip time between the endpoints, measured in milliseconds. At that latency, standard BDP calculations tell us that a window size of 2.5 MB can accommodate up to 1 Gbps of throughput. We then counted the number of TCP sessions with an upper memory limit set by autotuning (skmem_rb) greater than 5 MB, which is double our calculated window size. The relationship between the window size and skmem_rb is described in more detail here.  There were 558 such TCP sessions on one of our servers. Most of those sessions looked similar to this:

Unbounded memory usage by TCP for receive buffers, and how we fixed it

The key fields to focus on above are:

  • recvq – the user payload bytes in the receive queue (waiting to be read by the local userspace process)
  • skmem “r” field – the actual amount of kernel memory allocated for the receive buffer (this is the same as the kernel variable sk_rmem_alloc)
  • skmem “rb” field – the limit for “r” (this is the same as the kernel variable sk_rcvbuf)
  • l7read – the user payload bytes read by the local userspace process

Note the value of 256MiB for skmem_r and skmem_rb. That is the red flag that something is very wrong, because those values match the system-wide maximum value set by sysctl net.ipv4.tcp_rmem.  Linux autotuning should not permit the buffers to grow that large for these sessions.

Memory limits are not being honored for some TCP sessions

TCP autotuning sets the maximum amount of memory that a session can use. More information about Linux autotuning can be found at Optimizing TCP for high WAN throughput while preserving low latency.

Here is a graph of one of the problematic sessions, showing skmem_r (allocated memory) and skmem_rb (the limit for “r”) over time:

Unbounded memory usage by TCP for receive buffers, and how we fixed it

This graph is showing us that the limit being set by autotuning is being ignored, because every time skmem_r exceeds skmem_rb, skmem_rb is simply being raised to match it. So something is wrong with how skmem_rb is being handled. This explains the high memory usage. The question now is why.

The reproducer

At this point, we had only observed this problem in our production environment. Because we couldn’t predict which TCP sessions would fall into this dysfunctional state, and because we wanted to see the session information for these dysfunctional sessions from the beginning of those sessions, we needed to collect a lot of TCP session data for all TCP sessions. This is challenging in a production environment running at the scale of Cloudflare’s network. We needed to be able to reproduce this in a controlled lab environment. To that end, we gathered more details about what distinguishes these problematic TCP sessions from others, and ran a large number of experiments in our lab environment to reproduce the problem.

After a lot of attempts, we finally got it.

We were left with some pretty dirty lab machines by the time we got to this point, meaning that a lot of settings had been changed. We didn’t believe that all of them were related to the problem, but we didn’t know which ones were and which were not. So we went through a further series of tests to get us to a minimal set up to reproduce the problem. It turned out that a number of factors that we originally thought were important (such as latency) were not important.

The minimal set up turned out to be surprisingly simple:

  1. At the sending host, run a TCP program with an infinite loop, sending 1500B packets, with a 1 ms delay between each send.
  2. At the receiving host, run a TCP program with an infinite loop, reading 1B at a time, with a 1 ms delay between each read.

That’s it. Run these programs and watch your receive queue grow unbounded until it hits net.ipv4.tcp_rmem max.

tcp_server_sender.py

import time
import socket
import errno

daemon_port = 2425
payload = b'a' * 1448

listen_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
listen_sock.bind(('0.0.0.0', daemon_port))

# listen backlog
listen_sock.listen(32)
listen_sock.setblocking(True)

while True:
    mysock, _ = listen_sock.accept()
    mysock.setblocking(True)
    
    # do forever (until client disconnects)
    while True:
        try:
            mysock.send(payload)
            time.sleep(0.001)
        except Exception as e:
            print(e)
            mysock.close()
            break

tcp_client_receiver.py

import socket
import time

def do_read(bytes_to_read):
    total_bytes_read = 0
    while True:
        bytes_read = client_sock.recv(bytes_to_read)
        total_bytes_read += len(bytes_read)
        if total_bytes_read >= bytes_to_read:
            break

server_ip = “192.168.2.139”
server_port = 2425

client_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
client_sock.connect((server_ip, server_port))
client_sock.setblocking(True)

while True:
    do_read(1)
    time.sleep(0.001)

Reproducing the problem

First, we ran the above programs with these settings:

  • Kernel 6.1.14 vanilla
  • net.ipv4.tcp_rmem max = 256 MiB (window scale factor 13, or 8192 bytes)
  • net.ipv4.tcp_adv_win_scale = -2

Here is what this TCP session is doing:

Unbounded memory usage by TCP for receive buffers, and how we fixed it

At second 189 of the run, we see these packets being exchanged:

Unbounded memory usage by TCP for receive buffers, and how we fixed it

This is a significant failure because the memory limits are being ignored, and memory usage is unbounded until net.ipv4.tcp_rmem max is reached.

When net.ipv4.tcp_rmem max is reached:

  • The kernel drops incoming packets.
  • A ZeroWindow is never sent.  A ZeroWindow is a packet sent by the receiver to the sender telling the sender to stop sending packets.  This is normal and expected behavior when the receiver buffers are full.
  • The sender retransmits, with exponential backoff.
  • Eventually (~15 minutes, depending on system settings) the session times out and the connection is broken (“Errno 110 Connection timed out”).

Note that there is a range of packet sizes that can be sent, and a range of intervals which can be used for the delays, to cause this abnormal condition. This first reproduction is intentionally defined to grow the receive buffer quickly. These rates and delays do not reflect exactly what we see in production.

A closer look at real traffic in production

The prior section describes what is happening in our lab systems. Is that consistent with what we see in our production streams? Let’s take a look, now that we know more about what we are looking for.

We did find similar TCP sessions on our production network, which provided confirmation. But we also found this one, which, although it looks a little different, is actually the same root cause:

Unbounded memory usage by TCP for receive buffers, and how we fixed it

During this TCP session, the rate at which the userspace process is reading from the socket (the L7read rate line) after second 411 is zero. That is, L7 stops reading entirely at that point.

Notice that the bottom two graphs have a log scale on their y-axis to show that throughput and window size are never zero, even after L7 stops reading.

Here is the pattern of packet exchange that repeats itself during the erroneous “growth phase” after L7 stopped reading at the 411 second mark:

Unbounded memory usage by TCP for receive buffers, and how we fixed it

This variation of the problem is addressed below in the section called “Reader never reads”.

Getting to the root cause

sk_rcvbuf is being increased inappropriately. Somewhere. Let’s review the code to narrow down the possibilities.

sk_rcvbuf only gets updated in three places (that are relevant to this issue):

Actually, we are not calling tcp_set_rcvlowat, which eliminates that one. Next we used bpftrace scripts to figure out if it’s in tcp_clamp_window or tcp_rcv_space_adjust.   After bpftracing, the answer is: It’s tcp_clamp_window.

Summarizing what we know so far,
part I

tcp_try_rmem_schedule is being called as usual.

Unbounded memory usage by TCP for receive buffers, and how we fixed it

Sometimes rmem_alloc > sk_rcvbuf. When that happens, prune is called, which calls tcp_clamp_window. tcp_clamp_window increases sk_rcvbuf to match rmem_alloc. That is unexpected.

The key question is: Why is rmem_alloc > sk_rcvbuf?

Why is rmem_alloc > sk_rcvbuf?

More kernel code review ensued, reviewing all the places where rmem_alloc is increased, and looking to see where rmem_alloc could be exceeding sk_rcvbuf. After more bpftracing, watching netstats, etc., the answer is: TCP coalescing.

TCP coalescing

Coalescing is where the kernel will combine packets as they are being received.

Note that this is not Generic Receive Offload (GRO).  This is specific to TCP for packets on the INPUT path. Coalesce is a L4 feature that appends user payload from an incoming packet to an already existing packet, if possible. This saves memory (header space).

tcp_rcv_established calls tcp_queue_rcv, which calls tcp_try_coalesce. If the incoming packet can be coalesced, then it will be, and rmem_alloc is raised to reflect that. Here’s the important part: rmem_alloc can and does go above sk_rcvbuf because of the logic in that routine.

Summarizing what we know so far,
part II

  1. Data packets are being received
  2. tcp_rcv_established will coalesce, raising rmem_alloc above sk_rcvbuf
  3. tcp_try_rmem_schedule -> tcp_prune_queue -> tcp_clamp_window will raise sk_rcvbuf to match rmem_alloc
  4. The kernel then increases the window size based upon the new sk_rcvbuf value

In step 2, in order for rmem_alloc to exceed sk_rcvbuf, it has to be near sk_rcvbuf in the first place. We use tcp_adv_win_scale of -2, which means the window size will be 25% of the available buffer size, so we would not expect rmem_alloc to even be close to sk_rcvbuf. In our tests, the truesize ratio is not close to 4, so something unexpected is happening.

Why is rmem_alloc even close to sk_rcvbuf?

Why is rmem_alloc close to sk_rcvbuf?

Sending a ZeroWindow (a packet advertising a window size of zero) is how a TCP receiver tells a TCP sender to stop sending when the receive window is full. This is the mechanism that should keep rmem_alloc well below sk_rcvbuf.

During our tests, we happened to notice that the SNMP metric TCPWantZeroWindowAdv was increasing. The receiver was not sending ZeroWindows when it should have been.  So our attention fell on the window calculation logic, and we arrived at the root cause of all of our problems.

The root cause

The problem has to do with how the receive window size is calculated. This is the value in the TCP header that the receiver sends to the sender. Together with the ACK value, it communicates to the sender what the right edge of the window is.

The way TCP’s sliding window works is described in Stevens, “TCP/IP Illustrated, Volume 1”, section 20.3.  Visually, the receive window looks like this:

Unbounded memory usage by TCP for receive buffers, and how we fixed it

In the early days of the Internet, wide-area communications links offered low bandwidths (relative to today), so the 16 bits in the TCP header was more than enough to express the size of the receive window needed to achieve optimal throughput. Then the future happened, and now those 16-bit window values are scaled based upon a multiplier set during the TCP 3-way handshake.

The window scaling factor allows us to reach high throughputs on modern networks, but it also introduced an issue that we must now discuss.

The granularity of the receive window size that can be set in the TCP header is larger than the granularity of the actual changes we sometimes want to make to the size of the receive window.

When window scaling is in effect, every time the receiver ACKs some data, the receiver has to move the right edge of the window either left or right. The only exception would be if the amount of ACKed data is exactly a multiple of the window scale factor, and the receive window size specified in the ACK packet was reduced by the same multiple. This is rare.

So the right edge has to move. Most of the time, the receive window size does not change and the right edge moves to the right in lockstep with the ACK (the left edge), which always moves to the right.

The receiver can decide to increase the size of the receive window, based on its normal criteria, and that’s fine. It just means the right edge moves farther to the right. No problems.

But what happens when we approach a window full condition? Keeping the right edge unchanged is not an option.  We are forced to make a decision. Our choices are:

  • Move the right edge to the right
  • Move the right edge to the left

But if we have arrived at the upper limit, then moving the right edge to the right requires us to ignore the upper limit. This is equivalent to not having a limit. This is what Linux does today, and is the source of the problems described in this post.

This occurs for any window scaling factor greater than one. This means everyone.

A sidebar on terminology

The window size specified in the TCP header is the receive window size. It is sent from the receiver to the sender. The ACK number plus the window size defines the range of sequence numbers that the sender may send. It is also called the advertised window, or the offered window.

There are three terms related to TCP window management that are important to understand:

  • Closing the window. This is when the left edge of the window moves to the right. This occurs every time an ACK of a data packet arrives at the sender.
  • Opening the window. This is when the right edge of the window moves to the right.
  • Shrinking the window. This is when the right edge of the window moves to the left.

Opening and shrinking is not the same thing as the receive window size in the TCP header getting larger or smaller. The right edge is defined as the ACK number plus the receive window size. Shrinking only occurs when that right edge moves to the left (i.e. gets reduced).

RFC 7323 describes window retraction. Retracting the window is the same as shrinking the window.

Discussion Regarding Solutions

There are only three options to consider:

  1. Let the window grow
  2. Drop incoming packets
  3. Shrink the window

Let the window grow

Letting the window grow is the same as ignoring the memory limits set by autotuning. It results in allocating excessive amounts of memory for no reason. This is really just kicking the can down the road until allocated memory reaches net.ipv4.tcp_rmem max, when we are forced to choose from among one of the other two options.

Drop incoming packets

Dropping incoming packets will cause the sender to retransmit the dropped packets, with exponential backoff, until an eventual timeout (depending on the client read rate), which breaks the connection.  ZeroWindows are never sent.  This wastes bandwidth and processing resources by retransmitting packets we know will not be successfully delivered to L7 at the receiver.  This is functionally incorrect for a window full situation.

Shrink the window

Shrinking the window involves moving the right edge of the window to the left when approaching a window full condition.  A ZeroWindow is sent when the window is full.  There is no wasted memory, no wasted bandwidth, and no broken connections.

The current situation is that we are letting the window grow (option #1), and when net.ipv4.tcp_rmem max is reached, we are dropping packets (option #2).

We need to stop doing option #1.  We could either drop packets (option #2) when sk_rcvbuf is reached.  This avoids excessive memory usage, but is still functionally incorrect for a window full situation.  Or we could shrink the window (option #3).

Shrinking the window

It turns out that this issue has already been addressed in the RFC’s.

RFC 7323 says:

Unbounded memory usage by TCP for receive buffers, and how we fixed it

There are two elements here that are important.

  • “there are instances when a retracted window can be offered”
  • “Implementations MUST ensure that they handle a shrinking window”

Appendix F of that RFC describes our situation, adding:

  • “This is a general problem and can happen any time the sender does a write, which is smaller than the window scale factor.”

Kernel patch

The Linux kernel patch we wrote to enable TCP window shrinking can be found here.  This patch will also be submitted upstream.

Rerunning the test above with kernel patch

Here is the test we showed above, but this time using the kernel patch:

Unbounded memory usage by TCP for receive buffers, and how we fixed it

Here is the pattern of packet exchanges that repeat when using the kernel patch:

Unbounded memory usage by TCP for receive buffers, and how we fixed it

We see that the memory limit is being honored, ZeroWindows are being sent, there are no retransmissions, and no disconnects after 15 minutes. This is the desired result.

Test results using a TCP window scaling factor of 8

The window scaling factor of 8 and tcp_adv_win_scale of 1 is commonly seen on the public Internet, so let’s test that.

  • kernel 6.1.14 vanilla
  • tcp_rmem max = 8 MiB (window scale factor 8, or 256 bytes)
  • tcp_adv_win_scale = 1

Without the kernel patch

Unbounded memory usage by TCP for receive buffers, and how we fixed it

At the ~2100 second mark, we see the same problems we saw earlier when using wscale 13.

With the kernel patch

Unbounded memory usage by TCP for receive buffers, and how we fixed it

The kernel patch is working as expected.

Test results using an oscillating reader

This is a test run where the reader alternates every 240 seconds between reading slow and reading fast.  Slow is 1B every 1 ms and fast is 3300B every 1 ms.

  • kernel 6.1.14 vanilla
  • net.ipv4.tcp_rmem max = 256 MiB (window scale factor 13, or 8192 bytes)
  • tcp_adv_win_scale = -2

Without the kernel patch

Unbounded memory usage by TCP for receive buffers, and how we fixed it

With the kernel patch

Unbounded memory usage by TCP for receive buffers, and how we fixed it

The kernel patch is working as expected.

NB. We do see the increase of skmem_rb at the 720 second mark, but it only goes to ~20MB and does not grow unbounded. Whether or not 20MB is the most ideal value for this TCP session is an interesting question, but that is a topic for a different blog post.

Reader never reads

Here’s a good one. Say a reader never reads from the socket. How much TCP receive buffer memory would we expect that reader to consume? One might assume the answer is that the reader would read a few packets, store the payload in the receive queue, then pause the flow of packets until the userspace program starts reading.  The actual answer is that the reader will read packets until the receive queue grows to the size of net.ipv4.tcp_rmem max.  This is incorrect behavior, to say the very least.

For this test, the sender sends 4 bytes every 1 ms.  The reader, literally, never reads from the socket. Not once.

  • kernel 6.1.14 vanilla
  • net.ipv4.tcp_rmem max = 8 MiB (window scale factor 8, or 256 bytes)
  • net.ipv4.tcp_adv_win_scale = -2

Without the kernel patch:

Unbounded memory usage by TCP for receive buffers, and how we fixed it

With the kernel patch:

Unbounded memory usage by TCP for receive buffers, and how we fixed it

Using the kernel patch produces the expected behavior.

Results from the Cloudflare production network

We deployed this patch to the Cloudflare production network, and can see the effects in aggregate when running at scale.

Packet Drop Rates

This first graph shows RcvPruned, which shows how many incoming packets per second were dropped due to memory constraints.

Unbounded memory usage by TCP for receive buffers, and how we fixed it

The patch was enabled on most servers on 05/01 at 22:00, eliminating those drops.

TCPRcvCollapsed

Recall that TCPRcvCollapsed is the number of packets per second that are merged together in the queue in order to reduce memory usage (by eliminating header metadata).  This occurs when memory limits are reached.

Unbounded memory usage by TCP for receive buffers, and how we fixed it

The patch was enabled on most servers on 05/01 at 22:00. These graphs show the results from one of our data centers. The upper graph shows that the patch has eliminated all collapse processing. The lower graph shows the amount of time spent in collapse processing (each line in the lower graph is a single server). This is important because it can impact Cloudflare’s responsiveness in processing HTTP requests.  The result of the patch is that all latency due to TCP collapse processing has been eliminated.

Memory

Because the memory limits set by autotuning are now being enforced, the total amount of memory allocated is reduced.

Unbounded memory usage by TCP for receive buffers, and how we fixed it

In this graph, the green line shows the total amount of memory allocated for TCP buffers in one of our data centers.  This is with the patch enabled.  The purple line is the same total, but from exactly 7 days prior to the time indicated on the x axis, before the patch was enabled.  Using this approach to visualization, it is clear to see the memory saved with the patch enabled.

ZeroWindows

TCPWantZeroWindowAdv is the number of times per second that the window calculation based on available buffer memory produced a result that should have resulted in a ZeroWindow being sent to the sender, but was not.  In other words, this is how often the receive buffer was increased beyond the limit set by autotuning.

After a receiver has sent a Zero Window to the sender, the receiver is not expecting to get any additional data from the sender. Should additional data packets arrive at the receiver during the period when the window size is zero, those packets are dropped and the metric TCPZeroWindowDrop is incremented.  These dropped packets are usually just due to the timing of these events, i.e. the Zero Window packet in one direction and some data packets flowing in the other direction passed by each other on the network.

Unbounded memory usage by TCP for receive buffers, and how we fixed it

The patch was enabled on most servers on 05/01 at 22:00, although it was enabled for a subset of servers on 04/26 and 04/28.

The upper graph tells us that ZeroWindows are indeed being sent when they need to be based on the available memory at the receiver.  This is what the lack of “Wants” starting on 05/01 is telling us.

The lower graph reports the packets that are dropped because the session is in a ZeroWindow state. These are ok to drop, because the session is in a ZeroWindow state. These drops do not have a negative impact, for the same reason (it’s in a ZeroWindow state).

All of these results are as expected.

Importantly, we have also not found any peer TCP stacks that are non-RFC compliant (i.e. that are not able to accept a shrinking window).

Summary

In this blog post, we described when and why TCP memory limits are not being honored in the Linux kernel, and introduced a patch that fixes it. All in a day’s work at Cloudflare, where we are helping build a better Internet.

Announcing connect() — a new API for creating TCP sockets from Cloudflare Workers

Post Syndicated from Brendan Irvine-Broque original http://blog.cloudflare.com/workers-tcp-socket-api-connect-databases/

Announcing connect() — a new API for creating TCP sockets from Cloudflare Workers

Announcing connect() — a new API for creating TCP sockets from Cloudflare Workers

Today, we are excited to announce a new API in Cloudflare Workers for creating outbound TCP sockets, making it possible to connect directly to any TCP-based service from Workers.

Standard protocols including SSH, MQTT, SMTP, FTP, and IRC are all built on top of TCP. Most importantly, nearly all applications need to connect to databases, and most databases speak TCP. And while Cloudflare D1 works seamlessly on Workers, and some hosted database providers allow connections over HTTP or WebSockets, the vast majority of databases, both relational (SQL) and document-oriented (NoSQL), require clients to connect by opening a direct TCP “socket”, an ongoing two-way connection that is used to send queries and receive data. Now, Workers provides an API for this, the first of many steps to come in allowing you to use any database or infrastructure you choose when building full-stack applications on Workers.

Database drivers, the client code used to connect to databases and execute queries, are already using this new API. pg, the most widely used JavaScript database driver for PostgreSQL, works on Cloudflare Workers today, with more database drivers to come.

The TCP Socket API is available today to everyone. Get started by reading the TCP Socket API docs, or connect directly to any PostgreSQL database from your Worker by following this guide.

First — what is a TCP Socket?

TCP (Transmission Control Protocol) is a foundational networking protocol of the Internet. It is the underlying protocol that is used to make HTTP requests (prior to HTTP/3, which uses QUIC), to send email over SMTP, to query databases using database–specific protocols like MySQL, and many other application-layer protocols.

A TCP socket is a programming interface that represents a two-way communication connection between two applications that have both agreed to “speak” over TCP. One application (ex: a Cloudflare Worker) initiates an outbound TCP connection to another (ex: a database server) that is listening for inbound TCP connections. Connections are established by negotiating a three-way handshake, and after the handshake is complete, data can be sent bi-directionally.

Announcing connect() — a new API for creating TCP sockets from Cloudflare Workers

A socket is the programming interface for a single TCP connection — it has both a readable and writable “stream” of data, allowing applications to read and write data on an ongoing basis, as long as the connection remains open.

connect() — A simpler socket API

With Workers, we aim to support standard APIs that are supported across browsers and non-browser environments wherever possible, so that as many NPM packages as possible work on Workers without changes, and package authors don’t have to write runtime-specific code. But for TCP sockets, we faced a challenge — there was no clear shared standard across runtimes. Node.js provides the net and tls APIs, but Deno implements a different API — Deno.connect. And web browsers do not provide a raw TCP socket API, though a WICG proposal does exist, and it is different from both Node.js and Deno.

We also considered how a TCP socket API could be designed to maximize performance and ergonomics in a serverless environment. Most networking APIs were designed well before serverless emerged, with the assumption that the developer’s application is also the server, responsible for directly handling configuring TLS options and credentials.

With this backdrop, we reached out to the community, with a focus on maintainers of database drivers, ORMs and other libraries that create outbound TCP connections. Using this feedback, we’ve tried to incorporate the best elements of existing APIs and proposals, and intend to contribute back to future standards, as part of the Web-interoperable Runtimes Community Group (WinterCG).

The API we landed on is a simple function, connect(), imported from the new cloudflare:sockets module, that returns an instance of a Socket. Here’s a simple example showing it used to connect to a Gopher server. Gopher was one of the Internet’s early protocols that relied on TCP/IP, and still works today:

import { connect } from 'cloudflare:sockets';

export default {
  async fetch(req: Request) {
    const gopherAddr = "gopher.floodgap.com:70";
    const url = new URL(req.url);

    try {
      const socket = connect(gopherAddr);

      const writer = socket.writable.getWriter()
      const encoder = new TextEncoder();
      const encoded = encoder.encode(url.pathname + "\r\n");
      await writer.write(encoded);

      return new Response(socket.readable, { headers: { "Content-Type": "text/plain" } });
    } catch (error) {
      return new Response("Socket connection failed: " + error, { status: 500 });
    }
  }
};

We think this API design has many benefits that can be realized not just on Cloudflare, but in any serverless environment that adopts this design:

connect(address: SocketAddress | string, options?: SocketOptions): Socket

declare interface Socket {
  get readable(): ReadableStream;
  get writable(): WritableStream;
  get closed(): Promise<void>;
  close(): Promise<void>;
  startTls(): Socket;
}

declare interface SocketOptions {
  secureTransport?: string;
  allowHalfOpen: boolean;
}

declare interface SocketAddress {
  hostname: string;
  port: number;
}

Opportunistic TLS (StartTLS), without separate APIs

Opportunistic TLS, a pattern of creating an initial insecure connection, and then upgrading it to a secure one that uses TLS, remains common, particularly with database drivers. In Node.js, you must use the net API to create the initial connection, and then use the tls API to create a new, upgraded connection. In Deno, you pass the original socket to Deno.startTls(), which creates a new, upgraded connection.

Drawing on a previous W3C proposal for a TCP Socket API, we’ve simplified this by providing one API, that allows TLS to be enabled, allowed, or used when creating a socket, and exposes a simple method, startTls(), for upgrading a socket to use TLS.

// Create a new socket without TLS. secureTransport defaults to "off" if not specified.
const socket = connect("address:port", { secureTransport: "off" })

// Create a new socket, then upgrade it to use TLS.
// Once startTls() is called, only the newly created socket can be used.
const socket = connect("address:port", { secureTransport: "starttls" })
const secureSocket = socket.startTls();

// Create a new socket with TLS
const socket = connect("address:port", { secureTransport: "use" })

TLS configuration — a concern of host infrastructure, not application code

Existing APIs for creating TCP sockets treat TLS as a library that you interact with in your application code. The tls.createSecureContext() API from Node.js has a plethora of advanced configuration options that are mostly environment specific. If you use custom certificates when connecting to a particular service, you likely use a different set of credentials and options in production, staging and development. Managing direct file paths to credentials across environments and swapping out .env files in production build steps are common pain points.

Host infrastructure is best positioned to manage this on your behalf, and similar to Workers support for making subrequests using mTLS, TLS configuration and credentials for the socket API will be managed via Wrangler, and a connect() function provided via a capability binding. Currently, custom TLS credentials and configuration are not supported, but are coming soon.

Start writing data immediately, before the TLS handshake finishes

Because the connect() API synchronously returns a new socket, one can start writing to the socket immediately, without waiting for the TCP handshake to first complete. This means that once the handshake completes, data is already available to send immediately, and host platforms can make use of pipelining to optimize performance.

connect() API + DB drivers = Connect directly to databases

Many serverless databases already work on Workers, allowing clients to connect over HTTP or over WebSockets. But most databases don’t “speak” HTTP, including databases hosted on most cloud providers.

Databases each have their own “wire protocol”, and open-source database “drivers” that speak this protocol, sending and receiving data over a TCP socket. Developers rely on these drivers in their own code, as do database ORMs. Our goal is to make sure that you can use the same drivers and ORMs you might use in other runtimes and on other platforms on Workers.

Try it now — connect to PostgreSQL from Workers

We’ve worked with the maintainers of pg, one of the most popular database drivers in the JavaScript ecosystem, used by ORMs including Sequelize and knex.js, to add support for connect().

You can try this right now. First, create a new Worker and install pg:

wrangler init
npm install --save pg

As of this writing, you’ll need to enable the node_compat option in wrangler.toml:

wrangler.toml

name = "my-worker"
main = "src/index.ts"
compatibility_date = "2023-05-15"
node_compat = true

In just 20 lines of TypeScript, you can create a connection to a Postgres database, execute a query, return results in the response, and close the connection:

index.ts

import { Client } from "pg";

export interface Env {
  DB: string;
}

export default {
  async fetch(
    request: Request,
    env: Env,
    ctx: ExecutionContext
  ): Promise<Response> {
    const client = new Client(env.DB);
    await client.connect();
    const result = await client.query({
      text: "SELECT * from customers",
    });
    console.log(JSON.stringify(result.rows));
    const resp = Response.json(result.rows);
    // Close the database connection, but don't block returning the response
    ctx.waitUntil(client.end());
    return resp;
  },
};

To test this in local development, use the --experimental-local flag (instead of –local), which uses the open-source Workers runtime, ensuring that what you see locally mirrors behavior in production:

wrangler dev --experimental-local

What’s next for connecting to databases from Workers?

This is only the beginning. We’re aiming for the two popular MySQL drivers, mysql and mysql2, to work on Workers soon, with more to follow. If you work on a database driver or ORM, we’d love to help make your library work on Workers.

If you’ve worked more closely with database scaling and performance, you might have noticed that in the example above, a new connection is created for every request. This is one of the biggest current challenges of connecting to databases from serverless functions, across all platforms. With typical client connection pooling, you maintain a local pool of database connections that remain open. This approach of storing a reference to a connection or connection pool in global scope will not work, and is a poor fit for serverless. Managing individual pools of client connections on a per-isolate basis creates other headaches — when and how should connections be terminated? How can you limit the total number of concurrent connections across many isolates and locations?

Instead, we’re already working on simpler approaches to connection pooling for the most popular databases. We see a path to a future where you don’t have to think about or manage client connection pooling on your own. We’re also working on a brand new approach to making your database reads lightning fast.

What’s next for sockets on Workers?

Supporting outbound TCP connections is only one half of the story — we plan to support inbound TCP and UDP connections, as well as new emerging application protocols based on QUIC, so that you can build applications beyond HTTP with Socket Workers.

Earlier today we also announced Smart Placement, which improves performance by placing any Worker that makes multiple HTTP requests to an origin run as close as possible to reduce round-trip time. We’re working on making this work with Workers that open TCP connections, so that if your Worker connects to a database in Virginia and makes many queries over a TCP connection, each query is lightning fast and comes from the nearest location on Cloudflare’s global network.

We also plan to support custom certificates and other TLS configuration options in the coming months — tell us what is a must-have in order to connect to the services you need to connect to from Workers.

Get started, and share your feedback

The TCP Socket API is available today to everyone. Get started by reading the TCP Socket API docs, or connect directly to any PostgreSQL database from your Worker by following this guide.

We want to hear your feedback, what you’d like to see next, and more about what you’re building. Join the Cloudflare Developers Discord.

When the window is not fully open, your TCP stack is doing more than you think

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/

When the window is not fully open, your TCP stack is doing more than you think

Over the years I’ve been lurking around the Linux kernel and have investigated the TCP code many times. But when recently we were working on Optimizing TCP for high WAN throughput while preserving low latency, I realized I have gaps in my knowledge about how Linux manages TCP receive buffers and windows. As I dug deeper I found the subject complex and certainly non-obvious.

In this blog post I’ll share my journey deep into the Linux networking stack, trying to understand the memory and window management of the receiving side of a TCP connection. Specifically, looking for answers to seemingly trivial questions:

  • How much data can be stored in the TCP receive buffer? (it’s not what you think)
  • How fast can it be filled? (it’s not what you think either!)

Our exploration focuses on the receiving side of the TCP connection. We’ll try to understand how to tune it for the best speed, without wasting precious memory.

A case of a rapid upload

To best illustrate the receive side buffer management we need pretty charts! But to grasp all the numbers, we need a bit of theory.

We’ll draw charts from a receive side of a TCP flow, running a pretty straightforward scenario:

  • The client opens a TCP connection.
  • The client does send(), and pushes as much data as possible.
  • The server doesn’t recv() any data. We expect all the data to stay and wait in the receive queue.
  • We fix the SO_RCVBUF for better illustration.

Simplified pseudocode might look like (full code if you dare):

sd = socket.socket(AF_INET, SOCK_STREAM, 0)
sd.bind(('127.0.0.3', 1234))
sd.listen(32)

cd = socket.socket(AF_INET, SOCK_STREAM, 0)
cd.setsockopt(SOL_SOCKET, SO_RCVBUF, 32*1024)
cd.connect(('127.0.0.3', 1234))

ssd, _ = sd.accept()

while true:
    cd.send(b'a'*128*1024)

We’re interested in basic questions:

  • How much data can fit in the server’s receive buffer? It turns out it’s not exactly the same as the default read buffer size on Linux; we’ll get there.
  • Assuming infinite bandwidth, what is the minimal time  – measured in RTT – for the client to fill the receive buffer?

A bit of theory

Let’s start by establishing some common nomenclature. I’ll follow the wording used by the ss Linux tool from the iproute2 package.

First, there is the buffer budget limit. ss manpage calls it skmem_rb, in the kernel it’s named sk_rcvbuf. This value is most often controlled by the Linux autotune mechanism using the net.ipv4.tcp_rmem setting:

$ sysctl net.ipv4.tcp_rmem
net.ipv4.tcp_rmem = 4096 131072 6291456

Alternatively it can be manually set with setsockopt(SO_RCVBUF) on a socket. Note that the kernel doubles the value given to this setsockopt. For example SO_RCVBUF=16384 will result in skmem_rb=32768. The max value allowed to this setsockopt is limited to meager 208KiB by default:

$ sysctl net.core.rmem_max net.core.wmem_max
net.core.rmem_max = 212992
net.core.wmem_max = 212992

The aforementioned blog post discusses why manual buffer size management is problematic – relying on autotuning is generally preferable.

Here’s a diagram showing how skmem_rb budget is being divided:

When the window is not fully open, your TCP stack is doing more than you think

In any given moment, we can think of the budget as being divided into four parts:

  • Recv-q: part of the buffer budget occupied by actual application bytes awaiting read().
  • Another part of is consumed by metadata handling – the cost of struct sk_buff and such.
  • Those two parts together are reported by ss as skmem_r – kernel name is sk_rmem_alloc.
  • What remains is “free”, that is: it’s not actively used yet.
  • However, a portion of this “free” region is an advertised window – it may become occupied with application data soon.
  • The remainder will be used for future metadata handling, or might be divided into the advertised window further in the future.

The upper limit for the window is configured by tcp_adv_win_scale setting. By default, the window is set to at most 50% of the “free” space. The value can be clamped further by the TCP_WINDOW_CLAMP option or an internal rcv_ssthresh variable.

How much data can a server receive?

Our first question was “How much data can a server receive?”. A naive reader might think it’s simple: if the server has a receive buffer set to say 64KiB, then the client will surely be able to deliver 64KiB of data!

But this is totally not how it works. To illustrate this, allow me to temporarily set sysctl tcp_adv_win_scale=0. This is not a default and, as we’ll learn, it’s the wrong thing to do. With this setting the server will indeed set 100% of the receive buffer as an advertised window.

Here’s our setup:

  • The client tries to send as fast as possible.
  • Since we are interested in the receiving side, we can cheat a bit and speed up the sender arbitrarily. The client has transmission congestion control disabled: we set initcwnd=10000 as the route option.
  • The server has a fixed skmem_rb set at 64KiB.
  • The server has tcp_adv_win_scale=0.
When the window is not fully open, your TCP stack is doing more than you think

There are so many things here! Let’s try to digest it. First, the X axis is an ingress packet number (we saw about 65). The Y axis shows the buffer sizes as seen on the receive path for every packet.

  • First, the purple line is a buffer size limit in bytes – skmem_rb. In our experiment we called setsockopt(SO_RCVBUF)=32K and skmem_rb is double that value. Notice, by calling SO_RCVBUF we disabled the Linux autotune mechanism.
  • Green recv-q line is how many application bytes are available in the receive socket. This grows linearly with each received packet.
  • Then there is the blue skmem_r, the used data + metadata cost in the receive socket. It grows just like recv-q but a bit faster, since it accounts for the cost of the metadata kernel needs to deal with.
  • The orange rcv_win is an advertised window. We start with 64KiB (100% of skmem_rb) and go down as the data arrives.
  • Finally, the dotted line shows rcv_ssthresh, which is not important yet, we’ll get there.

Running over the budget is bad

It’s super important to notice that we finished with skmem_r higher than skmem_rb! This is rather unexpected, and undesired. The whole point of the skmem_rb memory budget is, well, not to exceed it. Here’s how ss shows it:

$ ss -m
Netid  State  Recv-Q  Send-Q  Local Address:Port  Peer Address:Port   
tcp    ESTAB  62464   0       127.0.0.3:1234      127.0.0.2:1235
     skmem:(r73984,rb65536,...)

As you can see, skmem_rb is 65536 and skmem_r is 73984, which is 8448 bytes over! When this happens we have an even bigger issue on our hands. At around the 62nd packet we have an advertised window of 3072 bytes, but while packets are being sent, the receiver is unable to process them! This is easily verifiable by inspecting an nstat TcpExtTCPRcvQDrop counter:

$ nstat -az TcpExtTCPRcvQDrop
TcpExtTCPRcvQDrop    13    0.0

In our run 13 packets were dropped. This variable counts a number of packets dropped due to either system-wide or per-socket memory pressure – we know we hit the latter. In our case, soon after the socket memory limit was crossed, new packets were prevented from being enqueued to the socket. This happened even though the TCP advertised window was still open.

This results in an interesting situation. The receiver’s window is open which might indicate it has resources to handle the data. But that’s not always the case, like in our example when it runs out of the memory budget.

The sender will think it hit a network congestion packet loss and will run the usual retry mechanisms including exponential backoff. This behavior can be looked at as desired or undesired, depending on how you look at it. On one hand no data will be lost, the sender can eventually deliver all the bytes reliably. On the other hand the exponential backoff logic might stall the sender for a long time, causing a noticeable delay.

The root of the problem is straightforward – Linux kernel skmem_rb sets a memory budget for both the data and metadata which reside on the socket. In a pessimistic case each packet might incur a cost of a struct sk_buff + struct skb_shared_info, which on my system is 576 bytes, above the actual payload size, plus memory waste due to network card buffer alignment:

When the window is not fully open, your TCP stack is doing more than you think

We now understand that Linux can’t just advertise 100% of the memory budget as an advertised window. Some budget must be reserved for metadata and such. The upper limit of window size is expressed as a fraction of the “free” socket budget. It is controlled by tcp_adv_win_scale, with the following values:

When the window is not fully open, your TCP stack is doing more than you think

By default, Linux sets the advertised window at most at 50% of the remaining buffer space.

Even with 50% of space “reserved” for metadata, the kernel is very smart and tries hard to reduce the metadata memory footprint. It has two mechanisms for this:

  • TCP Coalesce – on the happy path, Linux is able to throw away struct sk_buff. It can do so, by just linking the data to the previously enqueued packet. You can think about it as if it was extending the last packet on the socket.
  • TCP Collapse – when the memory budget is hit, Linux runs “collapse” code. Collapse rewrites and defragments the receive buffer from many small skb’s into a few very long segments – therefore reducing the metadata cost.

Here’s an extension to our previous chart showing these mechanisms in action:

When the window is not fully open, your TCP stack is doing more than you think

TCP Coalesce is a very effective measure and works behind the scenes at all times. In the bottom chart, the packets where the coalesce was engaged are shown with a pink line. You can see – the skmem_r bumps (blue line) are clearly correlated with a lack of coalesce (pink line)! The nstat TcpExtTCPRcvCoalesce counter might be helpful in debugging coalesce issues.

The TCP Collapse is a bigger gun. Mike wrote about it extensively, and I wrote a blog post years ago, when the latency of TCP collapse hit us hard. In the chart above, the collapse is shown as a red circle. We clearly see it being engaged after the socket memory budget is reached – from packet number 63. The nstat TcpExtTCPRcvCollapsed counter is relevant here. This value growing is a bad sign and might indicate bad latency spikes – especially when dealing with larger buffers. Normally collapse is supposed to be run very sporadically. A prominent kernel developer describes this pessimistic situation:

This also means tcp advertises a too optimistic window for a given allocated rcvspace: When receiving frames, sk_rmem_alloc can hit sk_rcvbuf limit and we call tcp_collapse() too often, especially when application is slow to drain its receive queue […] This is a major latency source.

If the memory budget remains exhausted after the collapse, Linux will drop ingress packets. In our chart it’s marked as a red “X”. The nstat TcpExtTCPRcvQDrop counter shows the count of dropped packets.

rcv_ssthresh predicts the metadata cost

Perhaps counter-intuitively, the memory cost of a packet can be much larger than the amount of actual application data contained in it. It depends on number of things:

  • Network card: some network cards always allocate a full page (4096, or even 16KiB) per packet, no matter how small or large the payload.
  • Payload size: shorter packets, will have worse metadata to content ratio since struct skb will be comparably larger.
  • Whether XDP is being used.
  • L2 header size: things like ethernet, vlan tags, and tunneling can add up.
  • Cache line size: many kernel structs are cache line aligned. On systems with larger cache lines, they will use more memory (see P4 or S390X architectures).

The first two factors are the most important. Here’s a run when the sender was specially configured to make the metadata cost bad and the coalesce ineffective (the details of the setup are messy):

When the window is not fully open, your TCP stack is doing more than you think

You can see the kernel hitting TCP collapse multiple times, which is totally undesired. Each time a collapse kernel is likely to rewrite the full receive buffer. This whole kernel machinery, from reserving some space for metadata with tcp_adv_win_scale, via using coalesce to reduce the memory cost of each packet, up to the rcv_ssthresh limit, exists to avoid this very case of hitting collapse too often.

The kernel machinery most often works fine, and TCP collapse is rare in practice. However, we noticed that’s not the case for certain types of traffic. One example is websocket traffic with loads of tiny packets and a slow reader. One kernel comment talks about such a case:

* The scheme does not work when sender sends good segments opening
* window and then starts to feed us spaghetti. But it should work
* in common situations. Otherwise, we have to rely on queue collapsing.

Notice that the rcv_ssthresh line dropped down on the TCP collapse. This variable is an internal limit to the advertised window. By dropping it the kernel effectively says: hold on, I mispredicted the packet cost, next time I’m given an opportunity I’m going to open a smaller window. Kernel will advertise a smaller window and be more careful – all of this dance is done to avoid the collapse.

Normal run – continuously updated window

Finally, here’s a chart from a normal run of a connection. Here, we use the default tcp_adv_win_wcale=1 (50%):

When the window is not fully open, your TCP stack is doing more than you think

Early in the connection you can see rcv_win being continuously updated with each received packet. This makes sense: while the rcv_ssthresh and tcp_adv_win_scale restrict the advertised window to never exceed 32KiB, the window is sliding nicely as long as there is enough space. At packet 18 the receiver stops updating the window and waits a bit. At packet 32 the receiver decides there still is some space and updates the window again, and so on. At the end of the flow the socket has 56KiB of data. This 56KiB of data was received over a sliding window reaching at most 32KiB .

The saw blade pattern of rcv_win is enabled by delayed ACK (aka QUICKACK). You can see the “acked” bytes in red dashed line. Since the ACK’s might be delayed, the receiver waits a bit before updating the window. If you want a smooth line, you can use quickack 1 per-route parameter, but this is not recommended since it will result in many small ACK packets flying over the wire.

In normal connection we expect the majority of packets to be coalesced and the collapse/drop code paths never to be hit.

Large receive windows – rcv_ssthresh

For large bandwidth transfers over big latency links – big BDP case – it’s beneficial to have a very wide advertised window. However, Linux takes a while to fully open large receive windows:

When the window is not fully open, your TCP stack is doing more than you think

In this run, the skmem_rb is set to 2MiB. As opposed to previous runs, the buffer budget is large and the receive window doesn’t start with 50% of the skmem_rb! Instead it starts from 64KiB and grows linearly. It takes a while for Linux to ramp up the receive window to full size – ~800KiB in this case. The window is clamped by rcv_ssthresh. This variable starts at 64KiB and then grows at a rate of two full-MSS packets per each packet which has a “good” ratio of total size (truesize) to payload size.

Eric Dumazet writes about this behavior:

Stack is conservative about RWIN increase, it wants to receive packets to have an idea of the skb->len/skb->truesize ratio to convert a memory budget to  RWIN.
Some drivers have to allocate 16K buffers (or even 32K buffers) just to hold one segment (of less than 1500 bytes of payload), while others are able to pack memory more efficiently.

This behavior of slow window opening is fixed, and not configurable in vanilla kernel. We prepared a kernel patch that allows to start up with higher rcv_ssthresh based on per-route option initrwnd:

$ ip route change local 127.0.0.0/8 dev lo initrwnd 1000

With the patch and the route change deployed, this is how the buffers look:

When the window is not fully open, your TCP stack is doing more than you think

The advertised window is limited to 64KiB during the TCP handshake, but with our kernel patch enabled it’s quickly bumped up to 1MiB in the first ACK packet afterwards. In both runs it took ~1800 packets to fill the receive buffer, however it took different time. In the first run the sender could push only 64KiB onto the wire in the second RTT. In the second run it could immediately push full 1MiB of data.

This trick of aggressive window opening is not really necessary for most users. It’s only helpful when:

  • You have high-bandwidth TCP transfers over big-latency links.
  • The metadata + buffer alignment cost of your NIC is sensible and predictable.
  • Immediately after the flow starts your application is ready to send a lot of data.
  • The sender has configured large initcwnd.
  • You care about shaving off every possible RTT.

On our systems we do have such flows, but arguably it might not be a common scenario. In the real world most of your TCP connections go to the nearest CDN point of presence, which is very close.

Getting it all together

In this blog post, we discussed a seemingly simple case of a TCP sender filling up the receive socket. We tried to address two questions: with our isolated setup, how much data can be sent, and how quickly?

With the default settings of net.ipv4.tcp_rmem, Linux initially sets a memory budget of 128KiB for the receive data and metadata. On my system, given full-sized packets, it’s able to eventually accept around 113KiB of application data.

Then, we showed that the receive window is not fully opened immediately. Linux keeps the receive window small, as it tries to predict the metadata cost and avoid overshooting the memory budget, therefore hitting TCP collapse. By default, with the net.ipv4.tcp_adv_win_scale=1, the upper limit for the advertised window is 50% of “free” memory. rcv_ssthresh starts up with 64KiB and grows linearly up to that limit.

On my system it took five window updates – six RTTs in total – to fill the 128KiB receive buffer. In the first batch the sender sent ~64KiB of data (remember we hacked the initcwnd limit), and then the sender topped it up with smaller and smaller batches until the receive window fully closed.

I hope this blog post is helpful and explains well the relationship between the buffer size and advertised window on Linux. Also, it describes the often misunderstood rcv_ssthresh which limits the advertised window in order to manage the memory budget and predict the unpredictable cost of metadata.

In case you wonder, similar mechanisms are in play in QUIC. The QUIC/H3 libraries though are still pretty young and don’t have so many complex and mysterious toggles…. yet.

As always, the code and instructions on how to reproduce the charts are available at our GitHub.

A July 4 technical reading list

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/july-4-2022-reading-list/

A July 4 technical reading list

A July 4 technical reading list

Here’s a short list of recent technical blog posts to give you something to read today.

Internet Explorer, we hardly knew ye

Microsoft has announced the end-of-life for the venerable Internet Explorer browser. Here we take a look at the demise of IE and the rise of the Edge browser. And we investigate how many bots on the Internet continue to impersonate Internet Explorer versions that have long since been replaced.

Live-patching security vulnerabilities inside the Linux kernel with eBPF Linux Security Module

Looking for something with a lot of technical detail? Look no further than this blog about live-patching the Linux kernel using eBPF. Code, Makefiles and more within!

Hertzbleed explained

Feeling mathematical? Or just need a dose of CPU-level antics? Look no further than this deep explainer about how CPU frequency scaling leads to a nasty side channel affecting cryptographic algorithms.

Early Hints update: How Cloudflare, Google, and Shopify are working together to build a faster Internet for everyone

The HTTP standard for Early Hints shows a lot of promise. How much? In this blog post, we dig into data about Early Hints in the real world and show how much faster the web is with it.

Private Access Tokens: eliminating CAPTCHAs on iPhones and Macs with open standards

Dislike CAPTCHAs? Yes, us too. As part of our program to eliminate captures there’s a new standard: Private Access Tokens. This blog shows how they work and how they can be used to prove you’re human without saying who you are.

Optimizing TCP for high WAN throughput while preserving low latency

Network nerd? Yeah, me too. Here’s a very in depth look at how we tune TCP parameters for low latency and high throughput.

Optimizing TCP for high WAN throughput while preserving low latency

Post Syndicated from Mike Freemon original https://blog.cloudflare.com/optimizing-tcp-for-high-throughput-and-low-latency/

Optimizing TCP for high WAN throughput while preserving low latency

Optimizing TCP for high WAN throughput while preserving low latency

Here at Cloudflare we’re constantly working on improving our service. Our engineers are looking at hundreds of parameters of our traffic, making sure that we get better all the time.

One of the core numbers we keep a close eye on is HTTP request latency, which is important for many of our products. We regard latency spikes as bugs to be fixed. One example is the 2017 story of “Why does one NGINX worker take all the load?”, where we optimized our TCP Accept queues to improve overall latency of TCP sockets waiting for accept().

Performance tuning is a holistic endeavor, and we monitor and continuously improve a range of other performance metrics as well, including throughput. Sometimes, tradeoffs have to be made. Such a case occurred in 2015, when a latency spike was discovered in our processing of HTTP requests. The solution at the time was to set tcp_rmem to 4 MiB, which minimizes the amount of time the kernel spends on TCP collapse processing. It was this collapse processing that was causing the latency spikes. Later in this post we discuss TCP collapse processing in more detail.

The tradeoff is that using a low value for tcp_rmem limits TCP throughput over high latency links. The following graph shows the maximum throughput as a function of network latency for a window size of 2 MiB. Note that the 2 MiB corresponds to a tcp_rmem value of 4 MiB due to the tcp_adv_win_scale setting in effect at the time.

Optimizing TCP for high WAN throughput while preserving low latency

For the Cloudflare products then in existence, this was not a major problem, as connections terminate and content is served from nearby servers due to our BGP anycast routing.

Since then, we have added new products, such as Magic WAN, WARP, Spectrum, Gateway, and others. These represent new types of use cases and traffic flows.

For example, imagine you’re a typical Magic WAN customer. You have connected all of your worldwide offices together using the Cloudflare global network. While Time to First Byte still matters, Magic WAN office-to-office traffic also needs good throughput. For example, a lot of traffic over these corporate connections will be file sharing using protocols such as SMB. These are elephant flows over long fat networks. Throughput is the metric every eyeball watches as they are downloading files.

We need to continue to provide world-class low latency while simultaneously providing high throughput over high-latency connections.

Before we begin, let’s introduce the players in our game.

TCP receive window is the maximum number of unacknowledged user payload bytes the sender should transmit (bytes-in-flight) at any point in time. The size of the receive window can and does go up and down during the course of a TCP session. It is a mechanism whereby the receiver can tell the sender to stop sending if the sent packets cannot be successfully received because the receive buffers are full. It is this receive window that often limits throughput over high-latency networks.

net.ipv4.tcp_adv_win_scale is a (non-intuitive) number used to account for the overhead needed by Linux to process packets. The receive window is specified in terms of user payload bytes. Linux needs additional memory beyond that to track other data associated with packets it is processing.

The value of the receive window changes during the lifetime of a TCP session, depending on a number of factors. The maximum value that the receive window can be is limited by the amount of free memory available in the receive buffer, according to this table:

tcp_adv_win_scale TCP window size
4 15/16 * available memory in receive buffer
3 ⅞ * available memory in receive buffer
2 ¾ * available memory in receive buffer
1 ½ * available memory in receive buffer
0 available memory in receive buffer
-1 ½ * available memory in receive buffer
-2 ¼ * available memory in receive buffer
-3 ⅛ * available memory in receive buffer

We can intuitively (and correctly) understand that the amount of available memory in the receive buffer is the difference between the used memory and the maximum limit. But what is the maximum size a receive buffer can be? The answer is sk_rcvbuf.

sk_rcvbuf is a per-socket field that specifies the maximum amount of memory that a receive buffer can allocate. This can be set programmatically with the socket option SO_RCVBUF. This can sometimes be useful to do, for localhost TCP sessions, for example, but in general the use of SO_RCVBUF is not recommended.

So how is sk_rcvbuf set? The most appropriate value for that depends on the latency of the TCP session and other factors. This makes it difficult for L7 applications to know how to set these values correctly, as they will be different for every TCP session. The solution to this problem is Linux autotuning.

Linux autotuning

Linux autotuning is logic in the Linux kernel that adjusts the buffer size limits and the receive window based on actual packet processing. It takes into consideration a number of things including TCP session RTT, L7 read rates, and the amount of available host memory.

Autotuning can sometimes seem mysterious, but it is actually fairly straightforward.

The central idea is that Linux can track the rate at which the local application is reading data off of the receive queue. It also knows the session RTT. Because Linux knows these things, it can automatically increase the buffers and receive window until it reaches the point at which the application layer or network bottleneck links are the constraint on throughput (and not host buffer settings). At the same time, autotuning prevents slow local readers from having excessively large receive queues. The way autotuning does that is by limiting the receive window and its corresponding receive buffer to an appropriate size for each socket.

The values set by autotuning can be seen via the Linux “ss” command from the iproute package (e.g. “ss -tmi”).  The relevant output fields from that command are:

Recv-Q is the number of user payload bytes not yet read by the local application.

rcv_ssthresh is the window clamp, a.k.a. the maximum receive window size. This value is not known to the sender. The sender receives only the current window size, via the TCP header field. A closely-related field in the kernel, tp->window_clamp, is the maximum window size allowable based on the amount of available memory. rcv_sshthresh is the receiver-side slow-start threshold value.

skmem_r is the actual amount of memory that is allocated, which includes not only user payload (Recv-Q) but also additional memory needed by Linux to process the packet (packet metadata). This is known within the kernel as sk_rmem_alloc.

Note that there are other buffers associated with a socket, so skmem_r does not represent the total memory that a socket might have allocated. Those other buffers are not involved in the issues presented in this post.

skmem_rb is the maximum amount of memory that could be allocated by the socket for the receive buffer. This is higher than rcv_ssthresh to account for memory needed for packet processing that is not packet data. Autotuning can increase this value (up to tcp_rmem max) based on how fast the L7 application is able to read data from the socket and the RTT of the session. This is known within the kernel as sk_rcvbuf.

rcv_space is the high water mark of the rate of the local application reading from the receive buffer during any RTT. This is used internally within the kernel to adjust sk_rcvbuf.

Earlier we mentioned a setting called tcp_rmem. net.ipv4.tcp_rmem consists of three values, but in this document we are always referring to the third value (except where noted). It is a global setting that specifies the maximum amount of memory that any TCP receive buffer can allocate, i.e. the maximum permissible value that autotuning can use for sk_rcvbuf. This is essentially just a failsafe for autotuning, and under normal circumstances should play only a minor role in TCP memory management.

It’s worth mentioning that receive buffer memory is not preallocated. Memory is allocated based on actual packets arriving and sitting in the receive queue. It’s also important to realize that filling up a receive queue is not one of the criteria that autotuning uses to increase sk_rcvbuf. Indeed, preventing this type of excessive buffering (bufferbloat) is one of the benefits of autotuning.

What’s the problem?

The problem is that we must have a large TCP receive window for high BDP sessions. This is directly at odds with the latency spike problem mentioned above.

Something has to give. The laws of physics (speed of light in glass, etc.) dictate that we must use large window sizes. There is no way to get around that. So we are forced to solve the latency spikes differently.

A brief recap of the latency spike problem

Sometimes a TCP session will fill up its receive buffers. When that happens, the Linux kernel will attempt to reduce the amount of memory the receive queue is using by performing what amounts to a “defragmentation” of memory. This is called collapsing the queue. Collapsing the queue takes time, which is what drives up HTTP request latency.

We do not want to spend time collapsing TCP queues.

Why do receive queues fill up to the point where they hit the maximum memory limit? The usual situation is when the local application starts out reading data from the receive queue at one rate (triggering autotuning to raise the max receive window), followed by the local application slowing down its reading from the receive queue. This is valid behavior, and we need to handle it correctly.

Selecting sysctl values

Before exploring solutions, let’s first decide what we need as the maximum TCP window size.

As we have seen above in the discussion about BDP, the window size is determined based upon the RTT and desired throughput of the connection.

Because Linux autotuning will adjust correctly for sessions with lower RTTs and bottleneck links with lower throughput, all we need to be concerned about are the maximums.

For latency, we have chosen 300 ms as the maximum expected latency, as that is the measured latency between our Zurich and Sydney facilities. It seems reasonable enough as a worst-case latency under normal circumstances.

For throughput, although we have very fast and modern hardware on the Cloudflare global network, we don’t expect a single TCP session to saturate the hardware. We have arbitrarily chosen 3500 mbps as the highest supported throughput for our highest latency TCP sessions.

The calculation for those numbers results in a BDP of 131MB, which we round to the more aesthetic value of 128 MiB.

Recall that allocation of TCP memory includes metadata overhead in addition to packet data. The ratio of actual amount of memory allocated to user payload size varies, depending on NIC driver settings, packet size, and other factors. For full-sized packets on some of our hardware, we have measured average allocations up to 3 times the packet data size. In order to reduce the frequency of TCP collapse on our servers, we set tcp_adv_win_scale to -2. From the table above, we know that the max window size will be ¼ of the max buffer space.

We end up with the following sysctl values:

net.ipv4.tcp_rmem = 8192 262144 536870912
net.ipv4.tcp_wmem = 4096 16384 536870912
net.ipv4.tcp_adv_win_scale = -2

A tcp_rmem of 512MiB and tcp_adv_win_scale of -2 results in a maximum window size that autotuning can set of 128 MiB, our desired value.

Disabling TCP collapse

Patient: Doctor, it hurts when we collapse the TCP receive queue.

Doctor: Then don’t do that!

Generally speaking, when a packet arrives at a buffer when the buffer is full, the packet gets dropped. In the case of these receive buffers, Linux tries to “save the packet” when the buffer is full by collapsing the receive queue. Frequently this is successful, but it is not guaranteed to be, and it takes time.

There are no problems created by immediately just dropping the packet instead of trying to save it. The receive queue is full anyway, so the local receiver application still has data to read. The sender’s congestion control will notice the drop and/or ZeroWindow and will respond appropriately. Everything will continue working as designed.

At present, there is no setting provided by Linux to disable the TCP collapse. We developed an in-house patch to the kernel to disable the TCP collapse logic.

Kernel patch – Attempt #1

The kernel patch for our first attempt was straightforward. At the top of tcp_try_rmem_schedule(), if the memory allocation fails, we simply return (after pred_flag = 0 and tcp_sack_reset()), thus completely skipping the tcp_collapse and related logic.

It didn’t work.

Although we eliminated the latency spikes while using large buffer limits, we did not observe the throughput we expected.

One of the realizations we made as we investigated the situation was that standard network benchmarking tools such as iperf3 and similar do not expose the problem we are trying to solve. iperf3 does not fill the receive queue. Linux autotuning does not open the TCP window large enough. Autotuning is working perfectly for our well-behaved benchmarking program.

We need application-layer software that is slightly less well-behaved, one that exercises the autotuning logic under test. So we wrote one.

A new benchmarking tool

Anomalies were seen during our “Attempt #1” that negatively impacted throughput. The anomalies were seen only under certain specific conditions, and we realized we needed a better benchmarking tool to detect and measure the performance impact of those anomalies.

This tool has turned into an invaluable resource during the development of this patch and raised confidence in our solution.

It consists of two Python programs. The reader opens a TCP session to the daemon, at which point the daemon starts sending user payload as fast as it can, and never stops sending.

The reader, on the other hand, starts and stops reading in a way to open up the TCP receive window wide open and then repeatedly causes the buffers to fill up completely. More specifically, the reader implemented this logic:

  1. reads as fast as it can, for five seconds
    • this is called fast mode
    • opens up the window
  2. calculates 5% of the high watermark of the bytes reader during any previous one second
  3. for each second of the next 15 seconds:
    • this is called slow mode
    • reads that 5% number of bytes, then stops reading
    • sleeps for the remainder of that particular second
    • most of the second consists of no reading at all
  4. steps 1-3 are repeated in a loop three times, so the entire run is 60 seconds

This has the effect of highlighting any issues in the handling of packets when the buffers repeatedly hit the limit.

Revisiting default Linux behavior

Taking a step back, let’s look at the default Linux behavior. The following is kernel v5.15.16.

Optimizing TCP for high WAN throughput while preserving low latency

The Linux kernel is effective at freeing up space in order to make room for incoming packets when the receive buffer memory limit is hit. As documented previously, the cost for saving these packets (i.e. not dropping them) is latency.

However, the latency spikes, in milliseconds, for tcp_try_rmem_schedule(), are:

tcp_rmem 170 MiB, tcp_adv_win_scale +2 (170p2):

@ms:
[0]       27093 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[1]           0 |
[2, 4)        0 |
[4, 8)        0 |
[8, 16)       0 |
[16, 32)      0 |
[32, 64)     16 |

tcp_rmem 146 MiB, tcp_adv_win_scale +3 (146p3):

@ms:
(..., 16)  25984 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[16, 20)       0 |
[20, 24)       0 |
[24, 28)       0 |
[28, 32)       0 |
[32, 36)       0 |
[36, 40)       0 |
[40, 44)       1 |
[44, 48)       6 |
[48, 52)       6 |
[52, 56)       3 |

tcp_rmem 137 MiB, tcp_adv_win_scale +4 (137p4):

@ms:
(..., 16)  37222 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[16, 20)       0 |
[20, 24)       0 |
[24, 28)       0 |
[28, 32)       0 |
[32, 36)       0 |
[36, 40)       1 |
[40, 44)       8 |
[44, 48)       2 |

These are the latency spikes we cannot have on the Cloudflare global network.

Kernel patch – Attempt #2

So the “something” that was not working in Attempt #1 was that the receive queue memory limit was hit early on as the flow was just ramping up (when the values for sk_rmem_alloc and sk_rcvbuf were small, ~800KB). This occurred at about the two second mark for 137p4 test (about 2.25 seconds for 170p2).

In hindsight, we should have noticed that tcp_prune_queue() actually raises sk_rcvbuf when it can. So we modified the patch in response to that, added a guard to allow the collapse to execute when sk_rmem_alloc is less than the threshold value.

net.ipv4.tcp_collapse_max_bytes = 6291456

The next section discusses how we arrived at this value for tcp_collapse_max_bytes.

The patch is available here.

The results with the new patch are as follows:

oscil – 300ms tests

Optimizing TCP for high WAN throughput while preserving low latency

oscil – 20ms tests

Optimizing TCP for high WAN throughput while preserving low latency

oscil – 0ms tests

Optimizing TCP for high WAN throughput while preserving low latency

iperf3 – 300 ms tests

Optimizing TCP for high WAN throughput while preserving low latency

iperf3 – 20 ms tests

Optimizing TCP for high WAN throughput while preserving low latency

iperf3 – 0ms tests

Optimizing TCP for high WAN throughput while preserving low latency

All tests are successful.

Setting tcp_collapse_max_bytes

In order to determine this setting, we need to understand what the biggest queue we can collapse without incurring unacceptable latency.

Optimizing TCP for high WAN throughput while preserving low latency
Optimizing TCP for high WAN throughput while preserving low latency

Using 6 MiB should result in a maximum latency of no more than 2 ms.

Cloudflare production network results

Current production settings (“Old”)

net.ipv4.tcp_rmem = 8192 2097152 16777216
net.ipv4.tcp_wmem = 4096 16384 33554432
net.ipv4.tcp_adv_win_scale = -2
net.ipv4.tcp_collapse_max_bytes = 0
net.ipv4.tcp_notsent_lowat = 4294967295

tcp_collapse_max_bytes of 0 means that the custom feature is disabled and that the vanilla kernel logic is used for TCP collapse processing.

New settings under test (“New”)

net.ipv4.tcp_rmem = 8192 262144 536870912
net.ipv4.tcp_wmem = 4096 16384 536870912
net.ipv4.tcp_adv_win_scale = -2
net.ipv4.tcp_collapse_max_bytes = 6291456
net.ipv4.tcp_notsent_lowat = 131072

The tcp_notsent_lowat setting is discussed in the last section of this post.

The middle value of tcp_rmem was changed as a result of separate work that found that Linux autotuning was setting receive buffers too high for localhost sessions. This updated setting reduces TCP memory usage for those sessions, but does not change anything about the type of TCP sessions that is the focus of this post.

For the following benchmarks, we used non-Cloudflare host machines in Iowa, US, and Melbourne, Australia performing data transfers to the Cloudflare data center in Marseille, France. In Marseille, we have some hosts configured with the existing production settings, and others with the system settings described in this post. Software used is perf3 version 3.9, kernel 5.15.32.

Throughput results

Optimizing TCP for high WAN throughput while preserving low latency

RTT (ms) Throughput with Current Settings (mbps) Throughput with New Settings (mbps) Increase Factor
Iowa to Marseille 121 276 6600 24x
Melbourne to Marseille 282 120 3800 32x

Iowa-Marseille throughput

Optimizing TCP for high WAN throughput while preserving low latency

Iowa-Marseille receive window and bytes-in-flight

Optimizing TCP for high WAN throughput while preserving low latency

Melbourne-Marseille throughput

Optimizing TCP for high WAN throughput while preserving low latency

Melbourne-Marseille receive window and bytes-in-flight

Optimizing TCP for high WAN throughput while preserving low latency

Even with the new settings in place, the Melbourne to Marseille performance is limited by the receive window on the Cloudflare host. This means that further adjustments to these settings yield even higher throughput.

Latency results

The Y-axis on these charts are the 99th percentile time for TCP collapse in seconds.

Cloudflare hosts in Marseille running the current production settings

Optimizing TCP for high WAN throughput while preserving low latency

Cloudflare hosts in Marseille running the new settings

Optimizing TCP for high WAN throughput while preserving low latency

The takeaway in looking at these graphs is that maximum TCP collapse time for the new settings is no worse than with the current production settings. This is the desired result.

Send Buffers

What we have shown so far is that the receiver side seems to be working well, but what about the sender side?

As part of this work, we are setting tcp_wmem max to 512 MiB. For oscillating reader flows, this can cause the send buffer to become quite large. This represents bufferbloat and wasted kernel memory, both things that nobody likes or wants.

Fortunately, there is already a solution: tcp_notsent_lowat. This setting limits the size of unsent bytes in the write queue. More details can be found at https://lwn.net/Articles/560082.

The results are significant:

Optimizing TCP for high WAN throughput while preserving low latency

The RTT for these tests was 466ms. Throughput is not negatively affected. Throughput is at full wire speed in all cases (1 Gbps). Memory usage is as reported by /proc/net/sockstat, TCP mem.

Our web servers already set tcp_notsent_lowat to 131072 for its sockets. All other senders are using 4 GiB, the default value. We are changing the sysctl so that 131072 is in effect for all senders running on the server.

Conclusion

The goal of this work is to open the throughput floodgates for high BDP connections while simultaneously ensuring very low HTTP request latency.

We have accomplished that goal.

A Primer on Proxies

Post Syndicated from Lucas Pardue original https://blog.cloudflare.com/a-primer-on-proxies/

A Primer on Proxies

A Primer on Proxies

Traffic proxying, the act of encapsulating one flow of data inside another, is a valuable privacy tool for establishing boundaries on the Internet. Encapsulation has an overhead, Cloudflare and our Internet peers strive to avoid turning it into a performance cost. MASQUE is the latest collaboration effort to design efficient proxy protocols based on IETF standards. We’re already running these at scale in production; see our recent blog post about Cloudflare’s role in iCloud Private Relay for an example.

In this blog post series, we’ll dive into proxy protocols.

To begin, let’s start with a simple question: what is proxying? In this case, we are focused on forward proxying — a client establishes an end-to-end tunnel to a target server via a proxy server. This contrasts with the Cloudflare CDN, which operates as a reverse proxy that terminates client connections and then takes responsibility for actions such as caching, security including WAF, load balancing, etc. With forward proxying, the details about the tunnel, such as how it is established and used, whether or not it provides confidentiality via authenticated encryption, and so on, vary by proxy protocol. Before going into specifics, let’s start with one of the most common tunnels used on the Internet: TCP.

Transport basics: TCP provides a reliable byte stream

The TCP transport protocol is a rich topic. For the purposes of this post, we will focus on one aspect: TCP provides a readable and writable, reliable, and ordered byte stream. Some protocols like HTTP and TLS require a reliable transport underneath them and TCP’s single byte stream is an ideal fit. The application layer reads or writes to this byte stream, but the details about how TCP sends this data “on the wire” are typically abstracted away.

Large application objects are written into a stream, then they are split into many small packets and they are sent in order to the network. At the receiver, packets are read from the network and combined back into an identical stream. Networks are not perfect and packets can be lost or reordered. TCP is clever at dealing with this and not worrying the application with details. It just works. A way to visualize this is to imagine a magic paper shredder that can both shred documents and convert shredded papers back to whole documents. Then imagine you and your friend bought a pair of these and decided that it would be fun to send each other shreds.

The one problem with TCP is that when a lost packet is detected at a receiver, the sender needs to retransmit it. This takes time to happen and can mean that the byte stream reconstruction gets delayed. This is known as TCP head-of-line blocking. Applications regularly use TCP via a socket API that abstracts away protocol details; they often can’t tell if there are delays because the other end is slow at sending or if the network is slowing things down via packet loss.

A Primer on Proxies

Proxy Protocols

Proxying TCP is immensely useful for many applications, including, though certainly not limited to HTTPS, SSH, and RDP. In fact, Oblivious DoH, which is a proxy protocol for DNS messages, could very well be implemented using a TCP proxy, though there are reasons why this may not be desirable. Today, there are a number of different options for proxying TCP end-to-end, including:

  • SOCKS, which runs in cleartext and requires an expensive connection establishment step.
  • Transparent TCP proxies, commonly referred to as performance enhancing proxies (PEPs), which must be on path and offer no additional transport security, and, definitionally, are limited to TCP protocols.
  • Layer 4 proxies such as Cloudflare Spectrum, which might rely on side carriage metadata via something like the PROXY protocol.
  • HTTP CONNECT, which transforms HTTPS connections into opaque byte streams.

While SOCKS and PEPs are viable options for some use cases, when choosing which proxy protocol to build future systems upon, it made most sense to choose a reusable and general-purpose protocol that provides well-defined and standard abstractions. As such, the IETF chose to focus on using HTTP as a substrate via the CONNECT method.

The concept of using HTTP as a substrate for proxying is not new. Indeed, HTTP/1.1 and HTTP/2 have supported proxying TCP-based protocols for a long time. In the following sections of this post, we’ll explain in detail how CONNECT works across different versions of HTTP, including HTTP/1.1, HTTP/2, and the recently standardized HTTP/3.

HTTP/1.1 and CONNECT

In HTTP/1.1, the CONNECT method can be used to establish an end-to-end TCP tunnel to a target server via a proxy server. This is commonly applied to use cases where there is a benefit of protecting the traffic between the client and the proxy, or where the proxy can provide access control at network boundaries. For example, a Web browser can be configured to issue all of its HTTP requests via an HTTP proxy.

A client sends a CONNECT request to the proxy server, which requests that it opens a TCP connection to the target server and desired port. It looks something like this:

CONNECT target.example.com:80 HTTP/1.1
Host: target.example.com

If the proxy succeeds in opening a TCP connection to the target, it responds with a 2xx range status code. If there is some kind of problem, an error status in the 5xx range can be returned. Once a tunnel is established there are two independent TCP connections; one on either side of the proxy. If a flow needs to stop, you can simply terminate them.

HTTP CONNECT proxies forward data between the client and the target server. The TCP packets themselves are not tunneled, only the data on the logical byte stream. Although the proxy is supposed to forward data and not process it, if the data is plaintext there would be nothing to stop it. In practice, CONNECT is often used to create an end-to-end TLS connection where only the client and target server have access to the protected content; the proxy sees only TLS records and can’t read their content because it doesn’t have access to the keys.

A Primer on Proxies

Finally, it’s worth noting that after a successful CONNECT request, the HTTP connection (and the TCP connection underpinning it) has been converted into a tunnel. There is no more possibility of issuing other HTTP messages, to the proxy itself, on the connection.

HTTP/2 and CONNECT

HTTP/2 adds logical streams above the TCP layer in order to support concurrent requests and responses on a single connection. Streams are also reliable and ordered byte streams, operating on top of TCP. Returning to our magic shredder analogy: imagine you wanted to send a book. Shredding each page one after another and rebuilding the book one page at a time is slow, but handling multiple pages at the same time might be faster. HTTP/2 streams allow us to do that. But, as we all know, trying to put too much into a shredder can sometimes cause it to jam.

A Primer on Proxies

In HTTP/2, each request and response is sent on a different stream. To support this, HTTP/2 defines frames that contain the stream identifier that they are associated with. Requests and responses are composed of HEADERS and DATA frames which contain HTTP header fields and HTTP content, respectively. Frames can be large. When they are sent on the wire they might span multiple TLS records or TCP segments. Side note: the HTTP WG has been working on a new revision of the document that defines HTTP semantics that are common to all HTTP versions. The terms message, header fields, and content all come from this description.

HTTP/2 concurrency allows applications to read and write multiple objects at different rates, which can improve HTTP application performance, such as web browsing. HTTP/1.1 traditionally dealt with this concurrency by opening multiple TCP connections in parallel and striping requests across these connections. In contrast, HTTP/2 multiplexes frames belonging to different streams onto the single byte stream provided by one TCP connection. Reusing a single connection has benefits, but it still leaves HTTP/2 at risk of TCP head-of-line blocking. For more details, refer to Perf Planet blog.

HTTP/2 also supports the CONNECT method. In contrast to HTTP/1.1, CONNECT requests do not take over an entire HTTP/2 connection. Instead, they convert a single stream into an end-to-end tunnel. It looks something like this:

:method = CONNECT
:authority = target.example.com:443

If the proxy succeeds in opening a TCP connection, it responds with a 2xx (Successful) status code. After this, the client sends DATA frames to the proxy, and the content of these frames are put into TCP packets sent to the target. In the return direction, the proxy reads from the TCP byte stream and populates DATA frames. If a tunnel needs to stop, you can simply terminate the stream; there is no need to terminate the HTTP/2 connection.

By using HTTP/2, a client can create multiple CONNECT tunnels in a single connection. This can help reduce resource usage (saving the global count of TCP connections) and allows related tunnels to be logically grouped together, ensuring that they “share fate” when either client or proxy need to gracefully close. On the proxy-to-server side there are still multiple independent TCP connections.

A Primer on Proxies

One challenge of multiplexing tunnels on concurrent streams is how to effectively prioritize them. We’ve talked in the past about prioritization for web pages, but the story is a bit different for CONNECT. We’ve been thinking about this and captured considerations in the new Extensible Priorities draft.

QUIC, HTTP/3 and CONNECT

QUIC is a new secure and multiplexed transport protocol from the IETF. QUIC version 1 was published as RFC 9000 in May 2021 and, the next day, we enabled it for all Cloudflare customers.

QUIC is composed of several foundational features. You can think of these like individual puzzle pieces that interlink to form a transport service. This service needs one more piece, an application mapping, to bring it all together.

A Primer on Proxies

Similar to HTTP/2, QUIC version 1 provides reliable and ordered streams. But QUIC streams live at the transport layer and they are the only type of QUIC primitive that can carry application data. QUIC has no opinion on how streams get used. Applications that wish to use QUIC must define that themselves.

QUIC streams can be long (up to 2^62 – 1 bytes). Stream data is sent on the wire in the form of STREAM frames. All QUIC frames must fit completely inside a QUIC packet. QUIC packets must fit entirely in a UDP datagram; fragmentation is prohibited. These requirements mean that a long stream is serialized to a series of QUIC packets sized roughly to the path MTU (Maximum Transmission Unit). STREAM frames provide reliability via QUIC loss detection and recovery. Frames are acknowledged by the receiver and if the sender detects a loss (via missing acknowledgments), QUIC will retransmit the lost data. In contrast, TCP retransmits packets. This difference is an important feature of QUIC, letting implementations decide how to repacketize and reschedule lost data.

When multiplexing streams, different packets can contain STREAM frames belonging to different stream identifiers. This creates independence between streams and helps avoid the head-of-line blocking caused by packet loss that we see in TCP. If a UDP packet containing data for one stream is lost, other streams can continue to make progress without being blocked by retransmission of the lost stream.

To use our magic shredder analogy one more time: we’re sending a book again, but this time we parallelise our task by using independent shredders. We need to logically associate them together so that the receiver knows the pages and shreds are all for the same book, but otherwise they can progress with less chance of jamming.

A Primer on Proxies

HTTP/3 is an example of an application mapping that describes how streams are used to exchange: HTTP settings, QPACK state, and request and response messages. HTTP/3 still defines its own frames like HEADERS and DATA, but it is overall simpler than HTTP/2 because QUIC deals with the hard stuff. Since HTTP/3 just sees a logical byte stream, its frames can be arbitrarily sized. The QUIC layer handles segmenting HTTP/3 frames over STREAM frames for sending in packets. HTTP/3 also supports the CONNECT method. It functions identically to CONNECT in HTTP/2, each request stream converting to an end-to-end tunnel.

HTTP packetization comparison

We’ve talked about HTTP/1.1, HTTP/2 and HTTP/3. The diagram below is a convenient way to summarize how HTTP requests and responses get serialized for transmission over a secure transport. The main difference is that with TLS, protected records are split across several TCP segments. While with QUIC there is no record layer, each packet has its own protection.

A Primer on Proxies

Limitations and looking ahead

HTTP CONNECT is a simple and elegant protocol that has a tremendous number of application use cases, especially for privacy-enhancing technology. In particular, applications can use it to proxy DNS-over-HTTPS similar to what’s been done for Oblivious DoH, or more generic HTTPS traffic (based on HTTP/1.1 or HTTP/2), and many more.

However, what about non-TCP traffic? Recall that HTTP/3 is an application mapping for QUIC, and therefore runs over UDP as well. What if we wanted to proxy QUIC? What if we wanted to proxy entire IP datagrams, similar to VPN technologies like IPsec or WireGuard? This is where MASQUE comes in. In the next post, we’ll discuss how the MASQUE Working Group is standardizing technologies to enable proxying for datagram-based protocols like UDP and IP.

How to stop running out of ephemeral ports and start to love long-lived connections

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/how-to-stop-running-out-of-ephemeral-ports-and-start-to-love-long-lived-connections/

How to stop running out of ephemeral ports and start to love long-lived connections

Often programmers have assumptions that turn out, to their surprise, to be invalid. From my experience this happens a lot. Every API, technology or system can be abused beyond its limits and break in a miserable way.

It’s particularly interesting when basic things used everywhere fail. Recently we’ve reached such a breaking point in a ubiquitous part of Linux networking: establishing a network connection using the connect() system call.

Since we are not doing anything special, just establishing TCP and UDP connections, how could anything go wrong? Here’s one example: we noticed alerts from a misbehaving server, logged in to check it out and saw:

marek@:~# ssh 127.0.0.1
ssh: connect to host 127.0.0.1 port 22: Cannot assign requested address

You can imagine the face of my colleague who saw that. SSH to localhost refuses to work, while she was already using SSH to connect to that server! On another occasion:

marek@:~# dig cloudflare.com @1.1.1.1
dig: isc_socket_bind: address in use

This time a basic DNS query failed with a weird networking error. Failing DNS is a bad sign!

In both cases the problem was Linux running out of ephemeral ports. When this happens it’s unable to establish any outgoing connections. This is a pretty serious failure. It’s usually transient and if you don’t know what to look for it might be hard to debug.

The root cause lies deeper though. We can often ignore limits on the number of outgoing connections. But we encountered cases where we hit limits on the number of concurrent outgoing connections during normal operation.

In this blog post I’ll explain why we had these issues, how we worked around them, and present an userspace code implementing an improved variant of connect() syscall.

Outgoing connections on Linux part 1 – TCP

Let’s start with a bit of historical background.

Long-lived connections

Back in 2014 Cloudflare announced support for WebSockets. We wrote two articles about it:

If you skim these blogs, you’ll notice we were totally fine with the WebSocket protocol, framing and operation. What worried us was our capacity to handle large numbers of concurrent outgoing connections towards the origin servers. Since WebSockets are long-lived, allowing them through our servers might greatly increase the concurrent connection count. And this did turn out to be a problem. It was possible to hit a ceiling for a total number of outgoing connections imposed by the Linux networking stack.

In a pessimistic case, each Linux connection consumes a local port (ephemeral port), and therefore the total connection count is limited by the size of the ephemeral port range.

Basics – how port allocation works

When establishing an outbound connection a typical user needs the destination address and port. For example, DNS might resolve cloudflare.com to the ‘104.1.1.229’ IPv4 address. A simple Python program can establish a connection to it with the following code:

cd = socket.socket(AF_INET, SOCK_STREAM)
cd.connect(('104.1.1.229', 80))

The operating system’s job is to figure out how to reach that destination, selecting an appropriate source address and source port to form the full 4-tuple for the connection:

How to stop running out of ephemeral ports and start to love long-lived connections

The operating system chooses the source IP based on the routing configuration. On Linux we can see which source IP will be chosen with ip route get:

$ ip route get 104.1.1.229
104.1.1.229 via 192.168.1.1 dev eth0 src 192.168.1.8 uid 1000
	cache

The src parameter in the result shows the discovered source IP address that should be used when going towards that specific target.

The source port, on the other hand, is chosen from the local port range configured for outgoing connections, also known as the ephemeral port range. On Linux this is controlled by the following sysctls:

$ sysctl net.ipv4.ip_local_port_range net.ipv4.ip_local_reserved_ports
net.ipv4.ip_local_port_range = 32768    60999
net.ipv4.ip_local_reserved_ports =

The ip_local_port_range sets the low and high (inclusive) port range to be used for outgoing connections. The ip_local_reserved_ports is used to skip specific ports if the operator needs to reserve them for services.

Vanilla TCP is a happy case

The default ephemeral port range contains more than 28,000 ports (60999+1-32768=28232). Does that mean we can have at most 28,000 outgoing connections? That’s the core question of this blog post!

In TCP the connection is identified by a full 4-tuple, for example:

full 4-tuple 192.168.1.8 32768 104.1.1.229 80

In principle, it is possible to reuse the source IP and port, and share them against another destination. For example, there could be two simultaneous outgoing connections with these 4-tuples:

full 4-tuple #A 192.168.1.8 32768 104.1.1.229 80
full 4-tuple #B 192.168.1.8 32768 151.101.1.57 80

This “source two-tuple” sharing can happen in practice when establishing connections using the vanilla TCP code:

sd = socket.socket(SOCK_STREAM)
sd.connect( (remote_ip, remote_port) )

But slightly different code can prevent this sharing, as we’ll discuss.

In the rest of this blog post, we’ll summarise the behaviour of code fragments that make outgoing connections showing:

  • The technique’s description
  • The typical `errno` value in the case of port exhaustion
  • And whether the kernel is able to reuse the {source IP, source port}-tuple against another destination

The last column is the most important since it shows if there is a low limit of total concurrent connections. As we’re going to see later, the limit is present more often than we’d expect.

technique description errno on port exhaustion possible src 2-tuple reuse
connect(dst_IP, dst_port) EADDRNOTAVAIL yes (good!)

In the case of generic TCP, things work as intended. Towards a single destination it’s possible to have as many connections as an ephemeral range allows. When the range is exhausted (against a single destination), we’ll see EADDRNOTAVAIL error. The system also is able to correctly reuse local two-tuple {source IP, source port} for ESTABLISHED sockets against other destinations. This is expected and desired.

Manually selecting source IP address

Let’s go back to the Cloudflare server setup. Cloudflare operates many services, to name just two: CDN (caching HTTP reverse proxy) and WARP.

For Cloudflare, it’s important that we don’t mix traffic types among our outgoing IPs. Origin servers on the Internet might want to differentiate traffic based on our product. The simplest example is CDN: it’s appropriate for an origin server to firewall off non-CDN inbound connections. Allowing Cloudflare cache pulls is totally fine, but allowing WARP connections which contain untrusted user traffic might lead to problems.

To achieve such outgoing IP separation, each of our applications must be explicit about which source IPs to use. They can’t leave it up to the operating system; the automatically-chosen source could be wrong. While it’s technically possible to configure routing policy rules in Linux to express such requirements, we decided not to do that and keep Linux routing configuration as simple as possible.

Instead, before calling connect(), our applications select the source IP with the bind() syscall. A trick we call “bind-before-connect”:

sd = socket.socket(SOCK_STREAM)
sd.bind( (src_IP, 0) )
sd.connect( (dst_IP, dst_port) )

technique description errno on port exhaustion possible src 2-tuple reuse
bind(src_IP, 0)
connect(dst_IP, dst_port)
EADDRINUSE no (bad!)

This code looks rather innocent, but it hides a considerable drawback. When calling bind(), the kernel attempts to find an unused local two-tuple. Due to BSD API shortcomings, the operating system can’t know what we plan to do with the socket. It’s totally possible we want to listen() on it, in which case sharing the source IP/port with a connected socket will be a disaster! That’s why the source two-tuple selected when calling bind() must be unique.

Due to this API limitation, in this technique the source two-tuple can’t be reused. Each connection effectively “locks” a source port, so the number of connections is constrained by the size of the ephemeral port range. Notice: one source port is used up for each connection, no matter how many destinations we have. This is bad, and is exactly the problem we were dealing with back in 2014 in the WebSockets articles mentioned above.

Fortunately, it’s fixable.

IP_BIND_ADDRESS_NO_PORT

Back in 2014 we fixed the problem by setting the SO_REUSEADDR socket option and manually retrying bind()+ connect() a couple of times on error. This worked ok, but later in 2015 Linux introduced a proper fix: the IP_BIND_ADDRESS_NO_PORT socket option. This option tells the kernel to delay reserving the source port:

sd = socket.socket(SOCK_STREAM)
sd.setsockopt(IPPROTO_IP, IP_BIND_ADDRESS_NO_PORT, 1)
sd.bind( (src_IP, 0) )
sd.connect( (dst_IP, dst_port) )

technique description errno on port exhaustion possible src 2-tuple reuse
IP_BIND_ADDRESS_NO_PORT
bind(src_IP, 0)

connect(dst_IP, dst_port)
EADDRNOTAVAIL yes (good!)

This gets us back to the desired behavior. On modern Linux, when doing bind-before-connect for TCP, you should set IP_BIND_ADDRESS_NO_PORT.

Explicitly selecting a source port

Sometimes an application needs to select a specific source port. For example: the operator wants to control full 4-tuple in order to debug ECMP routing issues.

Recently a colleague wanted to run a cURL command for debugging, and he needed the source port to be fixed. cURL provides the --local-port option to do this¹ :

$ curl --local-port 9999 -4svo /dev/null https://cloudflare.com/cdn-cgi/trace
*   Trying 104.1.1.229:443...

In other situations source port numbers should be controlled, as they can be used as an input to a routing mechanism.

But setting the source port manually is not easy. We’re back to square one in our hackery since IP_BIND_ADDRESS_NO_PORT is not an appropriate tool when calling bind() with a specific source port value. To get the scheme working again and be able to share source 2-tuple, we need to turn to SO_REUSEADDR:

sd = socket.socket(SOCK_STREAM)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sd.bind( (src_IP, src_port) )
sd.connect( (dst_IP, dst_port) )

Our summary table:

technique description errno on port exhaustion possible src 2-tuple reuse
SO_REUSEADDR
bind(src_IP, src_port)

connect(dst_IP, dst_port)
EADDRNOTAVAIL yes (good!)

Here, the user takes responsibility for handling conflicts, when an ESTABLISHED socket sharing the 4-tuple already exists. In such a case connect will fail with EADDRNOTAVAIL and the application should retry with another acceptable source port number.

Userspace connectx implementation

With these tricks, we can implement a common function and call it connectx. It will do what bind()+connect() should, but won’t have the unfortunate ephemeral port range limitation. In other words, created sockets are able to share local two-tuples as long as they are going to distinct destinations:

def connectx((source_IP, source_port), (destination_IP, destination_port)):

We have three use cases this API should support:

user specified technique
{_, _, dst_IP, dst_port} vanilla connect()
{src_IP, _, dst_IP, dst_port} IP_BIND_ADDRESS_NO_PORT
{src_IP, src_port, dst_IP, dst_port} SO_REUSEADDR

The name we chose isn’t an accident. MacOS (specifically the underlying Darwin OS) has exactly that function implemented as a connectx() system call (implementation):

How to stop running out of ephemeral ports and start to love long-lived connections

It’s more powerful than our connectx code, since it supports TCP Fast Open.

Should we, Linux users, be envious? For TCP, it’s possible to get the right kernel behaviour with the appropriate setsockopt/bind/connect dance, so a kernel syscall is not quite needed.

But for UDP things turn out to be much more complicated and a dedicated syscall might be a good idea.

Outgoing connections on Linux – part 2 – UDP

In the previous section we listed three use cases for outgoing connections that should be supported by the operating system:

  • Vanilla egress: operating system chooses the outgoing IP and port
  • Source IP selection: user selects outgoing IP but the OS chooses port
  • Full 4-tuple: user selects full 4-tuple for the connection

We demonstrated how to implement all three cases on Linux for TCP, without hitting connection count limits due to source port exhaustion.

It’s time to extend our implementation to UDP. This is going to be harder.

For UDP, Linux maintains one hash table that is keyed on local IP and port, which can hold duplicate entries. Multiple UDP connected sockets can not only share a 2-tuple but also a 4-tuple! It’s totally possible to have two distinct, connected sockets having exactly the same 4-tuple. This feature was created for multicast sockets. The implementation was then carried over to unicast connections, but it is confusing. With conflicting sockets on unicast addresses, only one of them will receive any traffic. A newer connected socket will “overshadow” the older one. It’s surprisingly hard to detect such a situation. To get UDP connectx() right, we will need to work around this “overshadowing” problem.

Vanilla UDP is limited

It might come as a surprise to many, but by default, the total count for outbound UDP connections is limited by the ephemeral port range size. Usually, with Linux you can’t have more than ~28,000 connected UDP sockets, even if they point to multiple destinations.

Ok, let’s start with the simplest and most common way of establishing outgoing UDP connections:

sd = socket.socket(SOCK_DGRAM)
sd.connect( (dst_IP, dst_port) )

technique description errno on port exhaustion possible src 2-tuple reuse risk of overshadowing
connect(dst_IP, dst_port) EAGAIN no (bad!) no

The simplest case is not a happy one. The total number of concurrent outgoing UDP connections on Linux is limited by the ephemeral port range size. On our multi-tenant servers, with potentially long-lived gaming and H3/QUIC flows containing WebSockets, this is too limiting.

On TCP we were able to slap on a setsockopt and move on. No such easy workaround is available for UDP.

For UDP, without REUSEADDR, Linux avoids sharing local 2-tuples among UDP sockets. During connect() it tries to find a 2-tuple that is not used yet. As a side note: there is no fundamental reason that it looks for a unique 2-tuple as opposed to a unique 4-tuple during ‘connect()’. This suboptimal behavior might be fixable.

SO_REUSEADDR is hard

To allow local two-tuple reuse we need the SO_REUSEADDR socket option. Sadly, this would also allow established sockets to share a 4-tuple, with the newer socket overshadowing the older one.

sd = socket.socket(SOCK_DGRAM)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sd.connect( (dst_IP, dst_port) )

technique description errno on port exhaustion possible src 2-tuple reuse risk of overshadowing
SO_REUSEADDR
connect(dst_IP, dst_port)
EAGAIN yes yes (bad!)

In other words, we can’t just set SO_REUSEADDR and move on, since we might hit a local 2-tuple that is already used in a connection against the same destination. We might already have an identical 4-tuple connected socket underneath. Most importantly, during such a conflict we won’t be notified by any error. This is unacceptably bad.

Detecting socket conflicts with eBPF

We thought a good solution might be to write an eBPF program to detect such conflicts. The idea was to put a code on the connect() syscall. Linux cgroups allow the BPF_CGROUP_INET4_CONNECT hook. The eBPF is called every time a process under a given cgroup runs the connect() syscall. This is pretty cool, and we thought it would allow us to verify if there is a 4-tuple conflict before moving the socket from UNCONNECTED to CONNECTED states.

Here is how to load and attach our eBPF

bpftool prog load ebpf.o /sys/fs/bpf/prog_connect4  type cgroup/connect4
bpftool cgroup attach /sys/fs/cgroup/unified/user.slice connect4 pinned /sys/fs/bpf/prog_connect4

With such a code, we’ll greatly reduce the probability of overshadowing:

technique description errno on port exhaustion possible src 2-tuple reuse risk of overshadowing
INET4_CONNECT hook
SO_REUSEADDR
connect(dst_IP, dst_port)
manual port discovery, EPERM on conflict yes yes, but small

However, this solution is limited. First, it doesn’t work for sockets with an automatically assigned source IP or source port, it only works when a user manually creates a 4-tuple connection from userspace. Then there is a second issue: a typical race condition. We don’t grab any lock, so it’s technically possible a conflicting socket will be created on another CPU in the time between our eBPF conflict check and the finish of the real connect() syscall machinery. In short, this lockless eBPF approach is better than nothing, but fundamentally racy.

Socket traversal – SOCK_DIAG ss way

There is another way to verify if a conflicting socket already exists: we can check for connected sockets in userspace. It’s possible to do it without any privileges quite effectively with the SOCK_DIAG_BY_FAMILY feature of netlink interface. This is the same technique the ss tool uses to print out sockets available on the system.

The netlink code is not even all that complicated. Take a look at the code. Inside the kernel, it goes quickly into a fast __udp_lookup() routine. This is great – we can avoid iterating over all sockets on the system.

With that function handy, we can draft our UDP code:

sd = socket.socket(SOCK_DGRAM)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
cookie = sd.getsockopt(socket.SOL_SOCKET, SO_COOKIE, 8)
sd.bind( src_addr )
c, _ = _netlink_udp_lookup(family, src_addr, dst_addr)
if c != cookie:
    raise OSError(...)
sd.connect( dst_addr )

This code has the same race condition issue as the connect inet eBPF hook before. But it’s a good starting point. We need some locking to avoid the race condition. Perhaps it’s possible to do it in the userspace.

SO_REUSEADDR as a lock

Here comes a breakthrough: we can use SO_REUSEADDR as a locking mechanism. Consider this:

sd = socket.socket(SOCK_DGRAM)
cookie = sd.getsockopt(socket.SOL_SOCKET, SO_COOKIE, 8)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sd.bind( src_addr )
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 0)
c, _ = _netlink_udp_lookup(family, src_addr, dst_addr)
if c != cookie:
    raise OSError()
sd.connect( dst_addr )
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

The idea here is:

  • We need REUSEADDR around bind, otherwise it wouldn’t be possible to reuse a local port. It’s technically possible to clear REUSEADDR after bind. Doing this technically makes the kernel socket state inconsistent, but it doesn’t hurt anything in practice.
  • By clearing REUSEADDR, we’re locking new sockets from using that source port. At this stage we can check if we have ownership of the 4-tuple we want. Even if multiple sockets enter this critical section, only one, the newest, can win this verification. This is a cooperative algorithm, so we assume all tenants try to behave.
  • At this point, if the verification succeeds, we can perform connect() and have a guarantee that the 4-tuple won’t be reused by another socket at any point in the process.

This is rather convoluted and hacky, but it satisfies our requirements:

technique description errno on port exhaustion possible src 2-tuple reuse risk of overshadowing
REUSEADDR as a lock EAGAIN yes no

Sadly, this schema only works when we know the full 4-tuple, so we can’t rely on kernel automatic source IP or port assignments.

Faking source IP and port discovery

In the case when the user calls ‘connect’ and specifies only target 2-tuple – destination IP and port, the kernel needs to fill in the missing bits – the source IP and source port. Unfortunately the described algorithm expects the full 4-tuple to be known in advance.

One solution is to implement source IP and port discovery in userspace. This turns out to be not that hard. For example, here’s a snippet of our code:

def _get_udp_port(family, src_addr, dst_addr):
    if ephemeral_lo == None:
        _read_ephemeral()
    lo, hi = ephemeral_lo, ephemeral_hi
    start = random.randint(lo, hi)
    ...

Putting it all together

Combining the manual source IP, port discovery and the REUSEADDR locking dance, we get a decent userspace implementation of connectx() for UDP.

We have covered all three use cases this API should support:

user specified comments
{_, _, dst_IP, dst_port} manual source IP and source port discovery
{src_IP, _, dst_IP, dst_port} manual source port discovery
{src_IP, src_port, dst_IP, dst_port} just our “REUSEADDR as lock” technique

Take a look at the full code.

Summary

This post described a problem we hit in production: running out of ephemeral ports. This was partially caused by our servers running numerous concurrent connections, but also because we used the Linux sockets API in a way that prevented source port reuse. It meant that we were limited to ~28,000 concurrent connections per protocol, which is not enough for us.

We explained how to allow source port reuse and prevent having this ephemeral-port-range limit imposed. We showed an userspace connectx() function, which is a better way of creating outgoing TCP and UDP connections on Linux.

Our UDP code is more complex, based on little known low-level features, assumes cooperation between tenants and undocumented behaviour of the Linux operating system. Using REUSEADDR as a locking mechanism is rather unheard of.

The connectx() functionality is valuable, and should be added to Linux one way or another. It’s not trivial to get all its use cases right. Hopefully, this blog post shows how to achieve this in the best way given the operating system API constraints.

___

¹ On a side note, on the second cURL run it fails due to TIME-WAIT sockets: “bind failed with errno 98: Address already in use”.

One option is to wait for the TIME_WAIT socket to die, or work around this with the time-wait sockets kill script. Killing time-wait sockets is generally a bad idea, violating protocol, unneeded and sometimes doesn’t work. But hey, in some extreme cases it’s good to know what’s possible. Just saying.

Announcing Argo for Spectrum

Post Syndicated from Achiel van der Mandele original https://blog.cloudflare.com/argo-spectrum/

Announcing Argo for Spectrum

Announcing Argo for Spectrum

Today we’re excited to announce the general availability of Argo for Spectrum, a way to turbo-charge any TCP based application. With Argo for Spectrum, you can reduce latency, packet loss and improve connectivity for any TCP application, including common protocols like Minecraft, Remote Desktop Protocol and SFTP.

The Internet — more than just a browser

When people think of the Internet, many of us think about using a browser to view websites. Of course, it’s so much more! We often use other ways to connect to each other and to the resources we need for work. For example, you may interact with servers for work using SSH File Transfer Protocol (SFTP), git or Remote Desktop software. At home, you might play a video game on the Internet with friends.

To help people that protect these services against DDoS attacks, Spectrum launched in 2018 and extends Cloudflare’s DDoS protection to any TCP or UDP based protocol. Customers use it for a wide variety of use cases, including to protect video streaming (RTMP), gaming and internal IT systems. Spectrum also supports common VoIP protocols such as SIP and RTP, which have recently seen an increase in DDoS ransomware attacks. A lot of these applications are also highly sensitive to performance issues. No one likes waiting for a file to upload or dealing with a lagging video game.

Latency and throughput are the two metrics people generally discuss when talking about network performance. Latency refers to the amount of time a piece of data (a packet) takes to traverse between two systems. Throughput refers to the amount of bits you can actually send per second. This blog will discuss how these two interplay and how we improve them with Argo for Spectrum.

Argo to the rescue

There are a number of factors that cause poor performance between two points on the Internet, including network congestion, the distance between the two points, and packet loss. This is a problem many of our customers have, even on web applications. To help, we launched Argo Smart Routing in 2017, a way to reduce latency (or time to first byte, to be precise) for any HTTP request that goes to an origin.

That’s great for folks who run websites, but what if you’re working on an application that doesn’t speak HTTP? Up until now people had limited options for improving performance for these applications. That changes today with the general availability of Argo for Spectrum. Argo for Spectrum offers the same benefits as Argo Smart Routing for any TCP-based protocol.

Argo for Spectrum takes the same smarts from our network traffic and applies it to Spectrum. At time of writing, Cloudflare sits in front of approximately 20% of the Alexa top 10 million websites. That means that we see, in near real-time, which networks are congested, which are slow and which are dropping packets. We use that data and take action by provisioning faster routes, which sends packets through the Internet faster than normal routing. Argo for Spectrum works the exact same way, using the same intelligence and routing plane but extending it to any TCP based application.

Performance

But what does this mean for real application performance? To find out, we ran a set of benchmarks on Catchpoint. Catchpoint is a service that allows you to set up performance monitoring from all over the world. Tests are repeated at intervals and aggregate results are reported. We wanted to use a third party such as Catchpoint to get objective results (as opposed to running themselves).

For our test case, we used a file server in the Netherlands as our origin. We provisioned various tests on Catchpoint to measure file transfer performance from various places in the world: Rabat, Tokyo, Los Angeles and Lima.

Announcing Argo for Spectrum
Throughput of a 10MB file. Higher is better.

Depending on location, transfers saw increases of up to 108% (for locations such as Tokyo) and 85% on average. Why is it so much faster? The answer is bandwidth delay product. In layman’s terms, bandwidth delay product means that the higher the latency, the lower the throughput. This is because with transmission protocols such as TCP, we need to wait for the other party to acknowledge that they received data before we can send more.

As an analogy, let’s assume we’re operating a water cleaning facility. We send unprocessed water through a pipe to a cleaning facility, but we’re not sure how much capacity the facility has! To test, we send an amount of water through the pipe. Once the water has arrived, the facility will call us up and say, “we can easily handle this amount of water at a time, please send more.” If the pipe is short, the feedback loop is quick: the water will arrive, and we’ll immediately be able to send more without having to wait. If we have a very, very long pipe, we have to stop sending water for a while before we get confirmation that the water has arrived and there’s enough capacity.

The same happens with TCP: we send an amount of data to the wire and wait to get confirmation that it arrived. If the latency is high it reduces the throughput because we’re constantly waiting for confirmation. If latency is low we can throttle throughput at a high rate. With Spectrum and Argo, we help in two ways: the first is that Spectrum terminates the TCP connection close to the user, meaning that latency for that link is low. The second is that Argo reduces the latency between our edge and the origin. In concert, they create a set of low-latency connections, resulting in a low overall bandwidth delay product between users in origin. The result is a much higher throughput than you would otherwise get.

Argo for Spectrum supports any TCP based protocol. This includes commonly used protocols like SFTP, git (over SSH), RDP and SMTP, but also media streaming and gaming protocols such as RTMP and Minecraft. Setting up Argo for Spectrum is easy. When creating a Spectrum application, just hit the “Argo Smart Routing” toggle. Any traffic will automatically be smart routed.

Announcing Argo for Spectrum

Argo for Spectrum covers much more than just these applications: we support any TCP-based protocol. If you’re interested, reach out to your account team today to see what we can do for you.