linux – Noise

So long, and thanks for all the fish: how to escape the Linux networking stack

Chris Branch — Wed, 29 Oct 2025 13:00:00 +0000

Many products at Cloudflare aren’t possible without pushing the limits of network hardware and software to deliver improved performance, increased efficiency, or novel capabilities such as soft-unicast, our method for sharing IP subnets across data centers. Happily, most people do not need to know the intricacies of how your operating system handles network and Internet access in general. Yes, even most people within Cloudflare. But sometimes we try to push well beyond the design intentions of Linux’s networking stack. This is a story about one of those attempts.

How to build your own VPN, or: the history of WARP

Chris Branch — Wed, 29 Oct 2025 13:00:00 +0000

WARP’s initial implementation resembled a VPN that allows Internet access through it. Here’s how we built it – and how you can, too.

A deep dive into BPF LPM trie performance and optimization

Matt Fleming — Tue, 21 Oct 2025 13:00:00 +0000

This post explores the performance of BPF LPM tries, a critical data structure used for IP matching.

Safe in the sandbox: security hardening for Cloudflare Workers

Erik Corry — Thu, 25 Sep 2025 14:00:00 +0000

We are further hardening Cloudflare Workers with the latest software and hardware features. We use defense-in-depth, including V8 sandboxes and the CPU's memory protection keys to keep your data safe.

New Linux Vulnerabilities

Bruce Schneier — Tue, 03 Jun 2025 11:07:32 +0000

They’re interesting:

Tracked as CVE-2025-5054 and CVE-2025-4598, both vulnerabilities are race condition bugs that could enable a local attacker to obtain access to access sensitive information. Tools like Apport and systemd-coredump are designed to handle crash reporting and core dumps in Linux systems.

[…]

“This means that if a local attacker manages to induce a crash in a privileged process and quickly replaces it with another one with the same process ID that resides inside a mount and pid namespace, apport will attempt to forward the core dump (which might contain sensitive information belonging to the original, privileged process) into the namespace.”...

QUIC restarts, slow problems: udpgrm to the rescue

Marek Majkowski — Wed, 07 May 2025 13:00:00 +0000

udpgrm is a lightweight daemon for graceful restarts of UDP servers. It leverages SO_REUSEPORT and eBPF to route new and existing flows to the correct server instance.

New Linux Rootkit

Bruce Schneier — Thu, 24 Apr 2025 19:35:15 +0000

Interesting:

The company has released a working rootkit called “Curing” that uses io_uring, a feature built into the Linux kernel, to stealthily perform malicious activities without being caught by many of the detection solutions currently on the market.

At the heart of the issue is the heavy reliance on monitoring system calls, which has become the go-to method for many cybersecurity vendors. The problem? Attackers can completely sidestep these monitored calls by leaning on io_uring instead. This clever method could let bad actors quietly make network connections or tamper with files without triggering the usual alarms...

How Netflix Accurately Attributes eBPF Flow Logs

Netflix Technology Blog — Tue, 08 Apr 2025 17:50:58 +0000

By Cheng Xie, Bryan Shultz, and Christine XuIn a previous blog post, we described how Netflix uses eBPF to capture TCP flow logs at scale for enhanced network insights. In this post, we delve deeper into how Netflix solved a core problem: accurately at...

A steam locomotive from 1993 broke my yarn test

Yew Leong — Wed, 02 Apr 2025 13:00:00 +0000

Yarn tests fail consistently at the 27-second mark. The usual suspects are swiftly eliminated. A deep dive is taken to comb through traces, only to be derailed into an unexpected crash investigation.

Searching for the cause of hung tasks in the Linux kernel

Oxana Kharitonova — Fri, 14 Feb 2025 14:00:00 +0000

The Linux kernel can produce a hung task warning. Searching the Internet and the kernel docs, you can find a brief explanation that the process is stuck in the uninterruptible state.

Multi-Path TCP: revolutionizing connectivity, one path at a time

Marek Majkowski — Fri, 03 Jan 2025 14:00:00 +0000

Multi-Path TCP (MPTCP) leverages multiple network interfaces, like Wi-Fi and cellular, to provide seamless mobility for more reliable connectivity. While promising, MPTCP is still in its early stages,

Perfectl Malware

Bruce Schneier — Mon, 14 Oct 2024 11:06:27 +0000

Perfectl in an impressive piece of malware:

The malware has been circulating since at least 2021. It gets installed by exploiting more than 20,000 common misconfigurations, a capability that may make millions of machines connected to the Internet potential targets, researchers from Aqua Security said. It can also exploit CVE-2023-33246, a vulnerability with a severity rating of 10 out of 10 that was patched last year in Apache RocketMQ, a messaging and streaming platform that’s found on many Linux machines.

The researchers are calling the malware Perfctl, the name of a malicious component that surreptitiously mines cryptocurrency. The unknown developers of the malware gave the process a name that combines the perf Linux monitoring tool and ctl, an abbreviation commonly used with command line tools. A signature characteristic of Perfctl is its use of process and file names that are identical or similar to those commonly found in Linux environments. The naming convention is one of the many ways the malware attempts to escape notice of infected users...

Noisy Neighbor Detection with eBPF

Netflix Technology Blog — Tue, 10 Sep 2024 18:00:21 +0000

By Jose Fernandez, Sebastien Dabdoub, Jason Koch, Artem Tkachuk

The Compute and Performance Engineering teams at Netflix regularly investigate performance issues in our multi-tenant environment. The first step is determining whether the problem originates from the application or the underlying infrastructure. One issue that often complicates this process is the "noisy neighbor" problem. On Titus, our multi-tenant compute platform, a "noisy neighbor" refers to a container or system service that heavily utilizes the server's resources, causing performance degradation in adjacent containers. We usually focus on CPU utilization because it is our workload's most frequent source of noisy neighbor issues.

Detecting the effects of noisy neighbors is complex. Traditional performance analysis tools such as perf can introduce significant overhead, risking further performance degradation. Additionally, these tools are typically deployed after the fact, which is too late for effective investigation. Another challenge is that debugging noisy neighbor issues requires significant low-level expertise and specialized tooling. In this blog post, we'll reveal how we leveraged eBPF to achieve continuous, low-overhead instrumentation of the Linux scheduler, enabling effective self-serve monitoring of noisy neighbor issues. Learn how Linux kernel instrumentation can improve your infrastructure observability with deeper insights and enhanced monitoring.

Continuous Instrumentation of the Linux Scheduler

To ensure the reliability of our workloads that depend on low latency responses, we instrumented the run queue latency for each container, which measures the time processes spend in the scheduling queue before being dispatched to the CPU. Extended waiting in this queue can be a telltale of performance issues, especially when containers are not utilizing their total CPU allocation. Continuous instrumentation is critical to catching such matters as they emerge, and eBPF, with its hooks into the Linux scheduler with minimal overhead, enabled us to monitor run queue latency efficiently.

To emit a run queue latency metric, we leveraged three eBPF hooks: sched_wakeup, sched_wakeup_new, and sched_switch.

The sched_wakeup and sched_wakeup_new hooks are invoked when a process changes state from 'sleeping' to 'runnable.' They let us identify when a process is ready to run and is waiting for CPU time. During this event, we generate a timestamp and store it in an eBPF hash map using the process ID as the key.

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, MAX_TASK_ENTRIES);
    __uint(key_size, sizeof(u32));
    __uint(value_size, sizeof(u64));
} runq_lat SEC(".maps");

SEC("tp_btf/sched_wakeup")
int tp_sched_wakeup(u64 *ctx)
{
    struct task_struct *task = (void *)ctx[0];
    u32 pid = task->pid;
    u64 ts = bpf_ktime_get_ns();

    bpf_map_update_elem(&runq_lat, &pid, &ts, BPF_NOEXIST);
    return 0;
}

Conversely, the sched_switch hook is triggered when the CPU switches between processes. This hook provides pointers to the process currently utilizing the CPU and the process about to take over. We use the upcoming task's process ID (PID) to fetch the timestamp from the eBPF map. This timestamp represents when the process entered the queue, which we had previously stored. We then calculate the run queue latency by simply subtracting the timestamps.

SEC("tp_btf/sched_switch")
int tp_sched_switch(u64 *ctx)
{
    struct task_struct *prev = (struct task_struct *)ctx[1];
    struct task_struct *next = (struct task_struct *)ctx[2];
    u32 prev_pid = prev->pid;
    u32 next_pid = next->pid;
 
    // fetch timestamp of when the next task was enqueued
    u64 *tsp = bpf_map_lookup_elem(&runq_lat, &next_pid);
    if (tsp == NULL) {
        return 0; // missed enqueue
    }

    // calculate runq latency before deleting the stored timestamp
    u64 now = bpf_ktime_get_ns();
    u64 runq_lat = now - *tsp;

    // delete pid from enqueued map
    bpf_map_delete_elem(&runq_lat, &next_pid);
    ....

One of the advantages of eBPF is its ability to provide pointers to the actual kernel data structures representing processes or threads, also known as tasks in kernel terminology. This feature enables access to a wealth of information stored about a process. We required the process's cgroup ID to associate it with a container for our specific use case. However, the cgroup information in the struct is safeguarded by an RCU (Read Copy Update) lock.

To safely access this RCU-protected information, we can leverage kfuncs in eBPF. kfuncs are kernel functions that can be called from eBPF programs. There are kfuncs available to lock and unlock RCU read-side critical sections. These functions ensure that our eBPF program remains safe and efficient while retrieving the cgroup ID from the task struct.

void bpf_rcu_read_lock(void) __ksym;
void bpf_rcu_read_unlock(void) __ksym;

u64 get_task_cgroup_id(struct task_struct *task)
{
    struct css_set *cgroups;
    u64 cgroup_id;
    bpf_rcu_read_lock();
    cgroups = task->cgroups;
    cgroup_id = cgroups->dfl_cgrp->kn->id;
    bpf_rcu_read_unlock();
    return cgroup_id;
}

Having the data ready, we must package it and send it to userspace. For this purpose, we chose the eBPF ring buffer. It is efficient, high-performing, and user-friendly. It can handle variable-length data records and allows data reading without necessitating extra memory copying or syscalls. However, the sheer amount of data points was causing the userspace program to use too much CPU, so we implemented a rate limiter in eBPF to sample the data effectively.

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);
} events SEC(".maps");

struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_HASH);
    __uint(max_entries, MAX_TASK_ENTRIES);
    __uint(key_size, sizeof(u64));
    __uint(value_size, sizeof(u64));
} cgroup_id_to_last_event_ts SEC(".maps");

struct runq_event {
    u64 prev_cgroup_id;
    u64 cgroup_id;
    u64 runq_lat;
    u64 ts;
};

SEC("tp_btf/sched_switch")
int tp_sched_switch(u64 *ctx)
{
    // ....
    // The previous code
    // ....
 
    u64 prev_cgroup_id = get_task_cgroup_id(prev);
    u64 cgroup_id = get_task_cgroup_id(next);
 
    // per-cgroup-id-per-CPU rate-limiting 
    // to balance observability with performance overhead
    u64 *last_ts = 
        bpf_map_lookup_elem(&cgroup_id_to_last_event_ts, &cgroup_id);
    u64 last_ts_val = last_ts == NULL ? 0 : *last_ts;

    // check the rate limit for the cgroup_id in consideration
    // before doing more work
    if (now - last_ts_val < RATE_LIMIT_NS) {
        // Rate limit exceeded, drop the event
        return 0;
    }

    struct runq_event *event;
    event = bpf_ringbuf_reserve(&events, sizeof(*event), 0);
  
    if (event) {
        event->prev_cgroup_id = prev_cgroup_id;
        event->cgroup_id = cgroup_id;
        event->runq_lat = runq_lat;
        event->ts = now;
        bpf_ringbuf_submit(event, 0);
        // Update the last event timestamp for the current cgroup_id
        bpf_map_update_elem(&cgroup_id_to_last_event_ts, &cgroup_id,
            &now, BPF_ANY);

    }

    return 0;
}

Our userspace application, developed in Go, processes events from the ring buffer to emit metrics to our metrics backend, Atlas. Each event includes a run queue latency sample with a cgroup ID, which we associate with running containers on the host. We categorize it as a system service if no such association is found. When a cgroup ID correlates with a container, we emit a percentile timer Atlas metric (runq.latency) for that container. We also increment a counter metric (sched.switch.out) to monitor preemptions occurring for the container's processes. Access to the prev_cgroup_id of the preempted process allows us to tag the metric with the cause of the preemption, whether it's due to a process within the same container (or cgroup), a process in another container, or a system service.

It's important to highlight that both the runq.latency metric and the sched.switch.out metrics are needed to determine if a container is affected by noisy neighbors, which is the goal we aim to achieve — relying solely on the runq.latency metric can lead to misconceptions. For example, if a container is at or over its cgroup CPU limit, the scheduler will throttle it, resulting in an apparent spike in run queue latency due to delays in the queue. If we were only to consider this metric, we might incorrectly attribute the performance degradation to noisy neighbors when it's actually because the container is hitting its CPU request limits. However, simultaneous spikes in both metrics, mainly when the cause is a different container or system process, clearly indicate a noisy neighbor issue.

A Noisy Neighbor Story

Below is the runq.latency metric for a server running a single container with ample CPU overhead. The 99th percentile averages 83.4µs (microseconds), serving as our baseline. Although there are some spikes reaching 400µs, the latency remains within acceptable parameters.

container1’s 99th percentile runq.latency averages 83µs (microseconds), with spikes up to 400µs, without adjacent containers. This serves as our baseline for a container not contending for CPU on a host.

At 10:35, launching container2, which fully utilized all CPUs on the host, caused a significant 131-millisecond spike (131,000 microseconds) in container1's P99 run queue latency. This spike would be noticeable in the userspace application if it were serving HTTP traffic. If userspace app owners reported an unexplained latency spike, we could quickly identify the noisy neighbor issue through run queue latency metrics.

Launching container2 at 10:35, which maxes out all CPUs on the host, caused a 131-millisecond spike in container1’s P99 run queue latency due to increased preemptions by system processes. This indicates a noisy neighbor issue, where system services compete for CPU time with containers.

The sched.switch.out metric indicates that the spike was due to increased preemptions by system processes, highlighting a noisy neighbor issue where system services compete with containers for CPU time. Our metrics show that the noisy neighbors were actually system processes, likely triggered by container2 consuming all available CPU capacity.

Optimizing eBPF Code

We developed an open-source eBPF process monitor called bpftop to measure the overhead of eBPF code in this hot kernel path. Our estimates suggest that the instrumentation adds less than 600 nanoseconds to each sched_* hook. We conducted a performance analysis on a Java service running in a container, and the instrumentation did not introduce significant overhead. The performance variance with the run queue profiling code active versus inactive was not measurable in milliseconds.

During our research on how eBPF statistics are measured in the kernel, we identified an opportunity to improve its calculation. We submitted this patch, which was included in the Linux kernel 6.10 release.

Through trial and error and using bpftop, we identified several optimizations that helped maintain low overhead for this code:

We found that BPF_MAP_TYPE_HASH was the most performant for storing enqueued timestamps. Using BPF_MAP_TYPE_TASK_STORAGE resulted in nearly a twofold performance decline. BPF_MAP_TYPE_PERCPU_HASH was slightly less performant than BPF_MAP_TYPE_HASH, which was unexpected and requires further investigation.
The BPF_CORE_READ helper adds 20–30 nanoseconds per invocation. In the case of raw tracepoints, specifically those that are "BTF-enabled" (tp_btf/*), it is safe and more efficient to access the task struct members directly. Andrii Nakryiko recommends this approach in this blog post.
BPF_MAP_TYPE_LRU_HASH maps are 40–50 nanoseconds slower per operation than regular hash maps. Due to space concerns from PID churn, we initially used them for enqueued timestamps. We have since increased the map size, mitigating this risk.
The sched_switch, sched_wakeup, and sched_wakeup_new are all triggered for kernel tasks, which are identifiable by their PID of 0. We found monitoring these tasks unnecessary, so we implemented several early exit conditions and conditional logic to prevent executing costly operations, such as accessing BPF maps, when dealing with a kernel task. Notably, kernel tasks operate through the scheduler queue like any regular process.

Conclusion

Our findings highlight the value of low-overhead continuous instrumentation of the Linux kernel with eBPF. We have integrated these metrics into customer dashboards, enabling actionable insights and guiding multitenancy performance discussions. We can also now use these metrics to refine CPU isolation strategies to minimize the impact of noisy neighbors. Additionally, thanks to these metrics, we've gained deeper insights into the Linux scheduler.

This project has also deepened our understanding of eBPF technology and underscored the importance of tools like bpftop for optimizing eBPF code. As eBPF adoption increases, we foresee more infrastructure observability and business logic shifting to it. One promising project in this space is sched_ext, potentially revolutionizing how scheduling decisions are made and tailored to specific workload needs.

Noisy Neighbor Detection with eBPF was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Investigation of a Cross-regional Network Performance Issue

Netflix Technology Blog — Mon, 05 Aug 2024 22:18:00 +0000

Hechao Li, Roger CruzCloud Networking TopologyNetflix operates a highly efficient cloud computing infrastructure that supports a wide array of applications essential for our SVOD (Subscription Video on Demand), live streaming and gaming services. Utili...

Backdoor in XZ Utils That Almost Happened

Bruce Schneier — Thu, 11 Apr 2024 11:01:51 +0000

Last week, the Internet dodged a major nation-state attack that would have had catastrophic cybersecurity repercussions worldwide. It’s a catastrophe that didn’t happen, so it won’t get much attention—but it should. There’s an important moral to the story of the attack and its discovery: The security of the global Internet depends on countless obscure pieces of software written and maintained by even more obscure unpaid, distractible, and sometimes vulnerable volunteers. It’s an untenable situation, and one that is being exploited by malicious actors. Yet precious little is being done to remedy it...

XZ Utils Backdoor

Bruce Schneier — Tue, 02 Apr 2024 18:50:50 +0000

The cybersecurity world got really lucky last week. An intentionally placed backdoor in XZ Utils, an open-source compression utility, was pretty much accidentally discovered by a Microsoft engineer—weeks before it would have been incorporated into both Debian and Red Hat Linux. From ArsTehnica:

Malicious code added to XZ Utils versions 5.6.0 and 5.6.1 modified the way the software functions. The backdoor manipulated sshd, the executable file used to make remote SSH connections. Anyone in possession of a predetermined encryption key could stash any code of their choice in an SSH login certificate, upload it, and execute it on the backdoored device. No one has actually seen code uploaded, so it’s not known what code the attacker planned to run. In theory, the code could allow for just about anything, including stealing encryption keys or installing malware...

Linux kernel security tunables everyone should consider adopting

Ignat Korchagin — Wed, 06 Mar 2024 14:00:43 +0000

This post illustrates some of the Linux Kernel features, which are helping us to keep our production systems more secure. We will deep dive into how they work and why you may consider enabling them as well

Announcing bpftop: Streamlining eBPF performance optimization

Netflix Technology Blog — Mon, 26 Feb 2024 16:43:30 +0000

By Jose FernandezToday, we are thrilled to announce the release of bpftop, a command-line tool designed to streamline the performance optimization and monitoring of eBPF applications. As Netflix increasingly adopts eBPF [1, 2], applying the same rigor ...

connect() – why are you so slow?

Frederick Lawler http://blog.cloudflare.com/author/frederick/ — Thu, 08 Feb 2024 14:00:27 +0000

This is our story of what we learned about the connect() implementation for TCP in Linux. Both its strong and weak points. How connect() latency changes under pressure, and how to open connection so that the syscall latency is deterministic and time-bound

New Windows/Linux Firmware Attack

Bruce Schneier — Tue, 12 Dec 2023 12:01:18 +0000

Interesting attack based on malicious pre-OS logo images:

LogoFAIL is a constellation of two dozen newly discovered vulnerabilities that have lurked for years, if not decades, in Unified Extensible Firmware Interfaces responsible for booting modern devices that run Windows or Linux….

The vulnerabilities are the subject of a coordinated mass disclosure released Wednesday. The participating companies comprise nearly the entirety of the x64 and ARM CPU ecosystem, starting with UEFI suppliers AMI, Insyde, and Phoenix (sometimes still called IBVs or independent BIOS vendors); device manufacturers such as Lenovo, Dell, and HP; and the makers of the CPUs that go inside the devices, usually Intel, AMD or designers of ARM CPUs…...