Tag Archives: security

No Humans Involved: Mitigating a 754 Million PPS DDoS Attack Automatically

Post Syndicated from Omer Yoachimik original https://blog.cloudflare.com/no-humans-involved-mitigating-a-754-million-pps-ddos-attack-automatically/

No Humans Involved: Mitigating a 754 Million PPS DDoS Attack Automatically

No Humans Involved: Mitigating a 754 Million PPS DDoS Attack Automatically

On June 21, Cloudflare automatically mitigated a highly volumetric DDoS attack that peaked at 754 million packets per second. The attack was part of an organized four day campaign starting on June 18 and ending on June 21: attack traffic was sent from over 316,000 IP addresses towards a single Cloudflare IP address that was mostly used for websites on our Free plan. No downtime or service degradation was reported during the attack, and no charges accrued to customers due to our unmetered mitigation guarantee.

The attack was detected and handled automatically by Gatebot, our global DDoS detection and mitigation system without any manual intervention by our teams. Notably, because our automated systems were able to mitigate the attack without issue, no alerts or pages were sent to our on-call teams and no humans were involved at all.

No Humans Involved: Mitigating a 754 Million PPS DDoS Attack Automatically
Attack Snapshot – Peaking at 754 Mpps. The two different colors in the graph represent two separate systems dropping packets. 

During those four days, the attack utilized a combination of three attack vectors over the TCP protocol: SYN floods, ACK floods and SYN-ACK floods. The attack campaign sustained for multiple hours at rates exceeding 400-600 million packets per second and peaked multiple times above 700 million packets per second, with a top peak of 754 million packets per second. Despite the high and sustained packet rates, our edge continued serving our customers during the attack without impacting performance at all

The Three Types of DDoS: Bits, Packets & Requests

Attacks with high bits per second rates aim to saturate the Internet link by sending more bandwidth per second than the link can handle. Mitigating a bit-intensive flood is similar to a dam blocking gushing water in a canal with limited capacity, allowing just a portion through.

No Humans Involved: Mitigating a 754 Million PPS DDoS Attack Automatically
Bit Intensive DDoS Attacks as a Gushing River Blocked By Gatebot

In such cases, the Internet service provider may block or throttle the traffic above the allowance resulting in denial of service for legitimate users that are trying to connect to the website but are blocked by the service provider. In other cases, the link is simply saturated and everything behind that connection is offline.

No Humans Involved: Mitigating a 754 Million PPS DDoS Attack Automatically
Swarm of Mosquitoes as a Packet Intensive DDoS Attack

However in this DDoS campaign, the attack peaked at a mere 250 Gbps (I say, mere, but ¼ Tbps is enough to knock pretty much anything offline if it isn’t behind some DDoS mitigation service) so it does not seem as the attacker intended to saturate our Internet links, perhaps because they know that our global capacity exceeds 37 Tbps. Instead, it appears the attacker attempted (and failed) to overwhelm our routers and data center appliances with high packet rates reaching 754 million packets per second. As opposed to water rushing towards a dam, flood of packets can be thought of as a swarm of millions of mosquitoes that you need to zap one by one.

No Humans Involved: Mitigating a 754 Million PPS DDoS Attack Automatically
Zapping Mosquitoes with Gatebot

Depending on the ‘weakest link’ in a data center, a packet intensive DDoS attack may impact the routers, switches, web servers, firewalls, DDoS mitigation devices or any other appliance that is used in-line. Typically, a high packet rate may cause the memory buffer to overflow and thus voiding the router’s ability to process additional packets. This is because there’s a small fixed CPU cost of handing each packet and so if you can send a lot of small packets you can block an Internet connection not by filling it but by causing the hardware that handles the connection to be overwhelmed with processing.

Another form of DDoS attack is one with a high HTTP request per second rate. An HTTP request intensive DDoS attack aims to overwhelm a web server’s resources with more HTTP requests per second than the server can handle. The goal of a DDoS attack with a high request per second rate is to max out the CPU and memory utilization of the server in order to crash it or prevent it from being able to respond to legitimate requests. Request intensive DDoS attacks allow the attacker to generate much less bandwidth, as opposed to bit intensive attacks, and still cause a denial of service.

Automated DDoS Detection & Mitigation

So how did we handle 754 million packets per second? First, Cloudflare’s network utilizes BGP Anycast to spread attack traffic globally across our fleet of data centers. Second, we built our own DDoS protection systems, Gatebot and dosd, which drop packets inside the Linux kernel for maximum efficiency in order to handle massive floods of packets. And third, we built our own L4 load-balancer, Unimog, which uses our appliances’ health and other various metrics to load-balance traffic intelligently within a data center.

In 2017, we published a blog introducing Gatebot, one of our two DDoS protection systems. The blog was titled Meet Gatebot – a bot that allows us to sleep, and that’s exactly what happened during this attack. The attack surface was spread out globally by our Anycast, then Gatebot detected and mitigated the attack automatically without human intervention. And traffic inside each datacenter was load-balanced intelligently to avoid overwhelming any one machine. And as promised in the blog title, the attack peak did in fact occur while our London team was asleep.

So how does Gatebot work? Gatebot asynchronously samples traffic from every one of our data centers in over 200 locations around the world. It also monitors our customers’ origin server health. It then analyzes the samples to identify patterns and traffic anomalies that can indicate attacks. Once an attack is detected, Gatebot sends mitigation instructions to the edge data centers.

To complement Gatebot, last year we released a new system codenamed dosd (denial of service daemon) which runs in every one of our data centers around the world in over 200 cities. Similarly to Gatebot, dosd detects and mitigates attacks autonomously but in the scope of a single server or data center. You can read more about dosd in our recent blog.

The DDoS Landscape

While in recent months we’ve observed a decrease in the size and duration of DDoS attacks, highly volumetric and globally distributed DDoS attacks such as this one still persist. Regardless of the size, type or sophistication of the attack, Cloudflare offers unmetered DDoS protection to all customers and plan levels—including the Free plans.

Sandboxing in Linux with zero lines of code

Post Syndicated from Ignat Korchagin original https://blog.cloudflare.com/sandboxing-in-linux-with-zero-lines-of-code/

Sandboxing in Linux with zero lines of code

Modern Linux operating systems provide many tools to run code more securely. There are namespaces (the basic building blocks for containers), Linux Security Modules, Integrity Measurement Architecture etc.

In this post we will review Linux seccomp and learn how to sandbox any (even a proprietary) application without writing a single line of code.

Sandboxing in Linux with zero lines of code

Tux by Iwan Gabovitch, GPL
Sandbox, Simplified Pixabay License

Linux system calls

System calls (syscalls) is a well-defined interface between userspace applications and the operating system (OS) kernel. On modern operating systems most applications provide only application-specific logic as code. Applications do not, and most of the time cannot, directly access low-level hardware or networking, when they need to store data or send something over the wire. Instead they use system calls to ask the OS kernel to do specific hardware and networking tasks on their behalf:

Sandboxing in Linux with zero lines of code

Apart from providing a generic high level way for applications to interact with the low level hardware, the system call architecture allows the OS kernel to manage available resources between applications as well as enforce policies, like application permissions, networking access control lists etc.

Linux seccomp

Linux seccomp is yet another syscall on Linux, but it is a bit special, because it influences how the OS kernel will behave when the application uses other system calls. By default, the OS kernel has almost no insight into userspace application logic, so it provides all the possible services it can. But not all applications require all services. Consider an application which converts image formats: it needs the ability to read and write data from disk, but in its simplest form probably does not need any network access. Using seccomp an application can declare its intentions in advance to the Linux kernel. For this particular case it can notify the kernel that it will be using the read and write system calls, but never the send and recv system calls (because its intent is to work with local files and never with the network). It’s like establishing a contract between the application and the OS kernel:

Sandboxing in Linux with zero lines of code

But what happens if the application later breaks the contract and tries to use one of the system calls it promised not to use? The kernel will “penalise” the application, usually by immediately terminating it. Linux seccomp also allows less restrictive actions for the kernel to take:

  • instead of terminating the whole application, the kernel can be requested to terminate only the thread, which issued the prohibited system call
  • the kernel may just send a SIGSYS signal to the calling thread
  • the seccomp policy can specify an error code, which the kernel will then return to the calling application instead of executing the prohibited system call
  • if the violating process is under ptrace (for example executing under a debugger), the kernel can notify the tracer (the debugger) that a prohibited system call is about to happen and let the debugger decide what to do
  • the kernel may be instructed to allow and execute the system call, but log the attempt: this is useful, when we want to verify that our seccomp policy is not too tight without the risk of terminating the application and potentially creating an outage

Although there is a lot of flexibility in defining the potential penalty for the application, from a security perspective it is usually best to stick with the complete application termination upon seccomp policy violation. The reason for that will be described later in the examples in the post.

So why would the application take the risk of being abruptly terminated and declare its intentions beforehand, if it can just be “silent” and the OS kernel will allow it to use any system call by default? Of course, for a normal behaving application it makes no sense, but it turns out this feature is quite effective to protect from rogue applications and arbitrary code execution exploits.

Imagine our image format converter is written in some unsafe language and an attacker was able to take control of the application by making it process some malformed image. What the attacker might do is to try to steal some sensitive information from the machine running our converter and send it to themselves via the network. By default, the OS kernel will most likely allow it and a data leak will happen. But if our image converter “confined” (or sandboxed) itself beforehand to only read and write local data the kernel will terminate the application when the latter tries to leak the data over the network thus preventing the leak and locking out the attacker from our system!

Integrating seccomp into the application

To see how seccomp can be used in practice, let’s consider a toy example program

myos.c:

#include <stdio.h>
#include <sys/utsname.h>

int main(void)
{
    struct utsname name;

    if (uname(&name)) {
        perror("uname failed: ");
        return 1;
    }

    printf("My OS is %s!\n", name.sysname);
    return 0;
}

This is a simplified version of the uname command line tool, which just prints your operating system name. Like its full-featured counterpart, it uses the uname system call to actually get the name of the current operating system from the kernel. Let’s see it action:

$ gcc -o myos myos.c
$ ./myos
My OS is Linux!

Great! We’re on Linux, so can further experiment with seccomp (it is a Linux-only feature). Notice that we’re properly handling the error code after invoking the uname system call. However, according to the man page it can only fail, when the passed in buffer pointer is invalid. And in this case the set error number will be “EINVAL”, which translates to invalid parameter. In our case, the “struct utsname” structure is being allocated on the stack, so our pointer will always be valid. In other words, in normal circumstances the uname system call should never fail in this particular program.

To illustrate seccomp capabilities we will add a “sandbox” function to our program before the main logic

myos_raw_seccomp.c:

#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/audit.h>
#include <sys/ptrace.h>
#include <sys/prctl.h>

#include <stdlib.h>
#include <stdio.h>
#include <stddef.h>
#include <sys/utsname.h>
#include <errno.h>
#include <unistd.h>
#include <sys/syscall.h>

static void sandbox(void)
{
    struct sock_filter filter[] = {
        /* seccomp(2) says we should always check the arch */
        /* as syscalls may have different numbers on different architectures */
        /* see https://fedora.juszkiewicz.com.pl/syscalls.html */
        /* for simplicity we only allow x86_64 */
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, arch))),
        /* if not x86_64, tell the kernel to kill the process */
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 4),
        /* get the actual syscall number */
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, nr))),
        /* if "uname", tell the kernel to return EPERM, otherwise just allow */
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, SYS_uname, 0, 1),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
    };

    struct sock_fprog prog = {
        .len = (unsigned short) (sizeof(filter) / sizeof(filter[0])),
        .filter = filter,
    };

    /* see seccomp(2) on why this is needed */
    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
        perror("PR_SET_NO_NEW_PRIVS failed");
        exit(1);
    };

    /* glibc does not have a wrapper for seccomp(2) */
    /* invoke it via the generic syscall wrapper */
    if (syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &prog)) {
        perror("seccomp failed");
        exit(1);
    };
}

int main(void)
{
    struct utsname name;

    sandbox();

    if (uname(&name)) {
        perror("uname failed");
        return 1;
    }

    printf("My OS is %s!\n", name.sysname);
    return 0;
}

To sandbox itself the application defines a BPF program, which implements the desired sandboxing policy. Then the application passes this program to the kernel via the seccomp system call. The kernel does some validation checks to ensure the BPF program is OK and then runs this program on every system call the application makes. The results of the execution of the program is used by the kernel to determine if the current call complies with the desired policy. In other words the BPF program is the “contract” between the application and the kernel.

In our toy example above, the BPF program simply checks which system call is about to be invoked. If the application is trying to use the uname system call we tell the kernel to just return a EPERM (which stands for “operation not permitted”) error code. We also tell the kernel to allow any other system call. Let’s see if it works now:

$ gcc -o myos myos_raw_seccomp.c
$ ./myos
uname failed: Operation not permitted

uname failed now with the EPERM error code and EPERM is not even described as a potential failure code in the uname manpage! So we know now that this happened because we “told” the kernel to prohibit us using the uname syscall and to return EPERM instead. We can double check this by replacing EPERM with some other error code, which is totally inappropriate for this context, for example ENETDOWN (“network is down”). Why would we need the network to be up to just get the currently executing OS? Yet, recompiling and rerunning the program we get:

$ gcc -o myos myos_raw_seccomp.c
$ ./myos
uname failed: Network is down

We can also verify the other part of our “contract” works as expected. We told the kernel to allow any other system call, remember? In our program, when uname fails, we convert the error code to a human readable message and print it on the screen with the perror function. To print on the screen perror uses the write system call under the hood and since we can actually see the printed error message, we know that the kernel allowed our program to make the write system call in the first place.

seccomp with libseccomp

While it is possible to use seccomp directly, as in the examples above, BPF programs are cumbersome to write by hand and hard to debug, review and update later. That’s why it is usually a good idea to use a more high-level library, which abstracts away most of the low-level details. Luckily such a library exists: it is called libseccomp and is even recommended by the seccomp man page.

Let’s rewrite our program’s sandbox() function to use this library instead:

myos_libseccomp.c:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/utsname.h>
#include <seccomp.h>
#include <err.h>

static void sandbox(void)
{
    /* allow all syscalls by default */
    scmp_filter_ctx seccomp_ctx = seccomp_init(SCMP_ACT_ALLOW);
    if (!seccomp_ctx)
        err(1, "seccomp_init failed");

    /* kill the process, if it tries to use "uname" syscall */
    if (seccomp_rule_add_exact(seccomp_ctx, SCMP_ACT_KILL, seccomp_syscall_resolve_name("uname"), 0)) {
        perror("seccomp_rule_add_exact failed");
        exit(1);
    }

    /* apply the composed filter */
    if (seccomp_load(seccomp_ctx)) {
        perror("seccomp_load failed");
        exit(1);
    }

    /* release allocated context */
    seccomp_release(seccomp_ctx);
}

int main(void)
{
    struct utsname name;

    sandbox();

    if (uname(&name)) {
        perror("uname failed: ");
        return 1;
    }

    printf("My OS is %s!\n", name.sysname);
    return 0;
}

Our sandbox() function not only became shorter and much more readable, but also provided the ability to reference syscalls in our rules by names and not internal numbers as well as not having to deal with other quirks, like setting PR_SET_NO_NEW_PRIVS bit and dealing with system architectures.

It is worth noting we have modified our seccomp policy a bit. In the raw seccomp example above we instructed the kernel to return an error code when the application tries to execute a prohibited syscall. This is good for demonstration purposes, but in most cases a stricter action is required. Just returning an error code and allowing the application to continue gives the potentially malicious code a chance to bypass the policy. There are many syscalls in Linux and some of them do the same or similar things. For example, we might want to prohibit the application to read data from disk, so we deny the read syscall in our policy and tell the kernel to return an error code instead. However, if the application does get exploited, the exploit code/logic might look like below:

…
if (-1 == read(fd, buf, count)) {
    /* hm… read failed, but what about pread? */
    if (-1 == pread(fd, buf, count, offset) {
        /* what about readv? */ ...
    }
    /* bypassed the prohibited read(2) syscall */
}
…

Wait what?! There is more than one read system call? Yes, there are read, pread, readv as well as more obscure ones, like io_submit and io_uring_enter. Of course, it is our fault for providing incomplete seccomp policy, which does not block all possible read syscalls. But if at least we had instructed the kernel to terminate the process immediately upon violation of the first plain read, the malicious code above would not have the chance to be clever and try other options.

Given the above in the libseccomp example we have a stricter policy now, which tells the kernel to terminate the process upon the policy violation. Let’s see if it works:

$ gcc -o myos myos_libseccomp.c -lseccomp
$ ./myos
Bad system call

Notice that we need to link against libseccomp when compiling the application. Also, when we run the application, we don’t see the uname failed: Operation not permitted error output anymore, because we don’t give the application the ability to even print a failure message. Instead, we see a Bad system call message from the shell, which tells us that the application was terminated with a SIGSYS signal. Great!

zero code seccomp

The previous examples worked fine, but both of them have one disadvantage: we actually needed to modify the source code to embed our desired seccomp policy into the application. This is because seccomp syscall affects the calling process and its children, but there is no interface to inject the policy from “outside”. It is expected that developers will sandbox their code themselves as part of the application logic, but in practice this rarely happens. When developers are starting a new project, most of the time the focus is on primary functionality and security features are usually either postponed or omitted altogether. Also, most real-world software is usually written using some high-level programming language and/or a framework, where the developers do not deal with the system calls directly and probably are even unaware which system calls are being used by their code.

On the other hand we have system operators, sysadmins, SRE and other folks, who run the above code in production. They are more incentivized to keep production systems secure, thus would probably want to sandbox the services as much as possible. But most of the time they don’t have access to the source code. So there are mismatched expectations: developers have the ability to sandbox their code, but are usually not incentivized to do so and operators have the incentive to sandbox the code, but don’t have the ability.

This is where “zero code seccomp” might help, where an external operator can inject the desired sandbox policy into any process without needing to modify any source code. Systemd is one of the popular implementations of a “zero code seccomp” approach. Systemd-managed services can have a SystemCallFilter= directive defined in their unit files listing all the system calls the managed service is allowed to make. As an example, let’s go back to our toy application without any sandboxing code embedded:

$ gcc -o myos myos.c
$ ./myos
My OS is Linux!

Now we can run the same code with systemd, but prohibit the application for using uname without changing or recompiling any code (we’re using systemd-run to create an ephemeral systemd service unit for us):

$ systemd-run --user --pty --same-dir --wait --collect --service-type=exec --property="SystemCallFilter=~uname" ./myos
Running as unit: run-u0.service
Press ^] three times within 1s to disconnect TTY.
Finished with result: signal
Main processes terminated with: code=killed/status=SYS
Service runtime: 6ms

We don’t see the normal My OS is Linux! output anymore and systemd conveniently tells us that the managed process was terminated with a SIGSYS signal. We can even go further and use another directive SystemCallErrorNumber= to configure our seccomp policy not to terminate the application, but return an error code instead as in our first seccomp raw example:

$ systemd-run --user --pty --same-dir --wait --collect --service-type=exec --property="SystemCallFilter=~uname" --property="SystemCallErrorNumber=ENETDOWN" ./myos
Running as unit: run-u2.service
Press ^] three times within 1s to disconnect TTY.
uname failed: Network is down
Finished with result: exit-code
Main processes terminated with: code=exited/status=1
Service runtime: 6ms

systemd small print

Great! We can now inject almost any seccomp policy into any process without the need to write any code or recompile the application. However, there is an interesting statement in the systemd documentation:

…Note that the execve, exit, exit_group, getrlimit, rt_sigreturn, sigreturn system calls and the system calls for querying time and sleeping are implicitly whitelisted and do not need to be listed explicitly…

Some system calls are implicitly allowed and we don’t have to list them. This is mostly related to the way how systemd manages processes and injects the seccomp policy. We established earlier that seccomp policy applies to the current process and its children. So, to inject the policy, systemd forks itself, calls seccomp in the forked process and then execs the forked process into the target application. That’s why always allowing the execve system call is necessary in the first place, because otherwise systemd cannot do its job as a service manager.

But what if we want to explicitly prohibit some of these system calls? If we continue with the execve as an example, that can actually be a dangerous system call most applications would want to prohibit. Seccomp is an effective tool to protect the code from arbitrary code execution exploits, remember? If a malicious actor takes over our code, most likely the first thing they will try is to get a shell (or replace our code with any other application which is easier to control) by directing our code to call execve with the desired binary. So, if our code does not need execve for its main functionality, it would be a good idea to prohibit it. Unfortunately, it is not possible with the systemd SystemCallFilter= approach…

Introducing Cloudflare sandbox

We really liked the “zero code seccomp” approach with systemd SystemCallFilter= directive, but were not satisfied with its limitations. We decided to take it one step further and make it possible to prohibit any system call in any process externally without touching its source code, so came up with the Cloudflare sandbox. It’s a simple standalone toolkit consisting of a shared library and an executable. The shared library is supposed to be used with dynamically linked applications and the executable is for statically linked applications.

sandboxing dynamically linked executables

For dynamically linked executables it is possible to inject custom code into the process by utilizing the LD_PRELOAD environment variable. The libsandbox.so shared library from our toolkit also contains a so-called initialization routine, which should be executed before the main logic. This is how we make the target application sandbox itself:

  • LD_PRELOAD tells the dynamic loader to load our libsandbox.so as part of the application, when it starts
  • the runtime executes the initialization routine from the libsandbox.so before most of the main logic
  • our initialization routine configures the sandbox policy described in special environment variables
  • by the time the main application logic begin executing, the target process has the configured seccomp policy enforced

Let’s see how it works with our myos toy tool. First, we need to make sure it is actually a dynamically linked application:

$ ldd ./myos
	linux-vdso.so.1 (0x00007ffd8e1e3000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f339ddfb000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f339dfcf000)

Yes, it is . Now, let’s prohibit it from using the uname system call with our toolkit:

$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libsandbox.so SECCOMP_SYSCALL_DENY=uname ./myos
adding uname to the process seccomp filter
Bad system call

Yet again, we’ve managed to inject our desired seccomp policy into the myos application without modifying or recompiling it. The advantage of this approach is that it doesn’t have the shortcomings of the systemd’s SystemCallFilter= and we can block any system call (luckily Bash is a dynamically linked application as well):

$ /bin/bash -c 'echo I will try to execve something...; exec /usr/bin/echo Doing arbitrary code execution!!!'
I will try to execve something...
Doing arbitrary code execution!!!
$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libsandbox.so SECCOMP_SYSCALL_DENY=execve /bin/bash -c 'echo I will try to execve something...; exec /usr/bin/echo Doing arbitrary code execution!!!'
adding execve to the process seccomp filter
I will try to execve something...
Bad system call

The only problem here is that we may accidentally forget to LD_PRELOAD our libsandbox.so library and potentially run unprotected. Also, as described in the man page, LD_PRELOAD has some limitations. We can overcome all these problems by making libsandbox.so a permanent part of our target application:

$ patchelf --add-needed /usr/lib/x86_64-linux-gnu/libsandbox.so ./myos
$ ldd ./myos
	linux-vdso.so.1 (0x00007fff835ae000)
	/usr/lib/x86_64-linux-gnu/libsandbox.so (0x00007fc4f55f2000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc4f5425000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fc4f5647000)

Again, we didn’t need access to the source code here, but patched the compiled binary instead. Now we can just configure our seccomp policy as before without the need of LD_PRELOAD:

$ ./myos
My OS is Linux!
$ SECCOMP_SYSCALL_DENY=uname ./myos
adding uname to the process seccomp filter
Bad system call

sandboxing statically linked executables

The above method is quite convenient and easy, but it doesn’t work for statically linked executables:

$ gcc -static -o myos myos.c
$ ldd ./myos
	not a dynamic executable
$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libsandbox.so SECCOMP_SYSCALL_DENY=uname ./myos
My OS is Linux!

This is because there is no dynamic loader involved in starting a statically linked executable, so LD_PRELOAD has no effect. For this case our toolkit contains a special application launcher, which will inject the seccomp rules similarly to the way systemd does it:

$ sandboxify ./myos
My OS is Linux!
$ SECCOMP_SYSCALL_DENY=uname sandboxify ./myos
adding uname to the process seccomp filter

Note that we don’t see the Bad system call shell message anymore, because our target executable is being started by the launcher instead of the shell directly. Unlike systemd however, we can use this launcher to block dangerous system calls, like execve, as well:

$ sandboxify /bin/bash -c 'echo I will try to execve something...; exec /usr/bin/echo Doing arbitrary code execution!!!'
I will try to execve something...
Doing arbitrary code execution!!!
SECCOMP_SYSCALL_DENY=execve sandboxify /bin/bash -c 'echo I will try to execve something...; exec /usr/bin/echo Doing arbitrary code execution!!!'
adding execve to the process seccomp filter
I will try to execve something...

sandboxify vs libsandbox.so

From the examples above you may notice that it is possible to use sandboxify with dynamically linked executables as well, so why even bother with libsandbox.so? The difference becomes visible, when we start using not the “denylist” policy as in most examples in this post, but rather the preferred “allowlist” policy, where we explicitly allow only the system calls we need, but prohibit everything else.

Let’s convert our toy application back into the dynamically-linked one and try to come up with the minimal list of allowed system calls it needs to function properly:

$ gcc -o myos myos.c
$ ldd ./myos
	linux-vdso.so.1 (0x00007ffe027f6000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f4f1410a000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f4f142de000)
$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libsandbox.so SECCOMP_SYSCALL_ALLOW=exit_group:fstat:uname:write ./myos
adding exit_group to the process seccomp filter
adding fstat to the process seccomp filter
adding uname to the process seccomp filter
adding write to the process seccomp filter
My OS is Linux

So we need to allow 4 system calls: exit_group:fstat:uname:write. This is the tightest “sandbox”, which still doesn’t break the application. If we remove any system call from this list, the application will terminate with the Bad system call message (try it yourself!).

If we use the same allowlist, but with the sandboxify launcher, things do not work anymore:

$ SECCOMP_SYSCALL_ALLOW=exit_group:fstat:uname:write sandboxify ./myos
adding exit_group to the process seccomp filter
adding fstat to the process seccomp filter
adding uname to the process seccomp filter
adding write to the process seccomp filter

The reason is sandboxify and libsandbox.so inject seccomp rules at different stages of the process lifecycle. Consider the following very high level diagram of a process startup:

Sandboxing in Linux with zero lines of code

In a nutshell, every process has two runtime stages: “runtime init” and the “main logic”. The main logic is basically the code, which is located in the program main() function and other code put there by the application developers. But the process usually needs to do some work before the code from the main() function is able to execute – we call this work the “runtime init” on the diagram above. Developers do not write this code directly, but most of the time this code is automatically generated by the compiler toolchain, which is used to compile the source code.

To do its job, the “runtime init” stage uses a lot of different system calls, but most of them are not needed later at the “main logic” stage. If we’re using the “allowlist” approach for our sandboxing, it does not make sense to allow these system calls for the whole duration of the program, if they are only used once on program init. This is where the difference between libsandbox.so and sandboxify comes from: libsandbox.so enforces the seccomp rules usually after the “runtime init” stage has already executed, so we don’t have to allow most system calls from that stage. sandboxify on the other hand enforces the policy before the “runtime init” stage, so we have to allow all the system calls from both stages, which usually results in a bigger allowlist, thus wider attack surface.

Going back to our toy myos example, here is the minimal list of all the system calls we need to allow to make the application work under our sandbox:

$ SECCOMP_SYSCALL_ALLOW=access:arch_prctl:brk:close:exit_group:fstat:mmap:mprotect:munmap:openat:read:uname:write sandboxify ./myos
adding access to the process seccomp filter
adding arch_prctl to the process seccomp filter
adding brk to the process seccomp filter
adding close to the process seccomp filter
adding exit_group to the process seccomp filter
adding fstat to the process seccomp filter
adding mmap to the process seccomp filter
adding mprotect to the process seccomp filter
adding munmap to the process seccomp filter
adding openat to the process seccomp filter
adding read to the process seccomp filter
adding uname to the process seccomp filter
adding write to the process seccomp filter
My OS is Linux!

It is 13 syscalls vs 4 syscalls, if we’re using the libsandbox.so approach!

Conclusions

In this post we discussed how to easily sandbox applications on Linux without the need to write any additional code. We introduced the Cloudflare sandbox toolkit and discussed the different approaches we take at sandboxing dynamically linked applications vs statically linked applications.

Having safer code online helps to build a Better Internet and we would be happy if you find our sandbox toolkit useful. Looking forward to the feedback, improvements and other contributions!

CVE-2020-5902: Helping to protect against the F5 TMUI RCE vulnerability

Post Syndicated from Michael Tremante original https://blog.cloudflare.com/cve-2020-5902-helping-to-protect-against-the-f5-tmui-rce-vulnerability/

CVE-2020-5902: Helping to protect against the F5 TMUI RCE vulnerability

Cloudflare has deployed a new managed rule protecting customers against a remote code execution vulnerability that has been found in F5 BIG-IP’s web-based Traffic Management User Interface (TMUI). Any customer who has access to the Cloudflare Web Application Firewall (WAF) is automatically protected by the new rule (100315) that has a default action of BLOCK.

Initial testing on our network has shown that attackers started probing and trying to exploit this vulnerability starting on July 3.

F5 has published detailed instructions on how to patch affected devices, how to detect if attempts have been made to exploit the vulnerability on a device and instructions on how to add a custom mitigation. If you have an F5 device, read their detailed mitigations before reading the rest of this blog post.

The most popular probe URL appears to be /tmui/login.jsp/..;/tmui/locallb/workspace/fileRead.jsp followed by /tmui/login.jsp/..;/tmui/util/getTabSet.jsp, /tmui/login.jsp/..;/tmui/system/user/authproperties.jsp and /tmui/login.jsp/..;/tmui/locallb/workspace/tmshCmd.jsp. All contain the critical pattern ..; which is at the heart of the vulnerability.

On July 3 we saw O(1k) probes ramping to O(1m) yesterday. This is because simple test patterns have been added to scanning tools and small test programs made available by security researchers.

CVE-2020-5902: Helping to protect against the F5 TMUI RCE vulnerability

The Vulnerability

The vulnerability was disclosed by the vendor on July 1 and allows both authenticated and unauthenticated users to perform remote code execution (RCE).

Remote Code Execution is a type of code injection which provides the attacker the ability to run any arbitrary code on the target application, allowing them, in most scenarios such as this one, to gain privileged access and perform a full system take over.

The vulnerability affects the administration interface only (the management dashboard), not the underlying data plane provided by the application.

How to Mitigate

If updating the application is not possible, the attack can be mitigated by blocking all requests that match the following regular expression in the URL:

.*\.\.;.*

The above regular expression matches two dot characters (.) followed by a semicolon within any sequence of characters.

Customers who are using the Cloudflare WAF, that have their F5 BIG-IP TMUI interface proxied behind Cloudflare, are already automatically protected from this vulnerability with rule 100315. If you wish to turn off the rule or change the default action:

  1. Head over to the Cloudflare Firewall, then click on Managed Rules and head over to the advanced link under the Cloudflare Managed Rule set,
  2. Search for rule ID: 100315,
  3. Select any appropriate action or disable the rule.

How to test HTTP/3 and QUIC with Firefox Nightly

Post Syndicated from Lucas Pardue original https://blog.cloudflare.com/how-to-test-http-3-and-quic-with-firefox-nightly/

How to test HTTP/3 and QUIC with Firefox Nightly

How to test HTTP/3 and QUIC with Firefox Nightly

HTTP/3 is the third major version of the Hypertext Transfer Protocol, which takes the bold step of moving away from TCP to the new transport protocol QUIC in order to provide performance and security improvements.

During Cloudflare’s Birthday Week 2019, we were delighted to announce that we had enabled QUIC and HTTP/3 support on the Cloudflare edge network. This was joined by support from Google Chrome and Mozilla Firefox, two of the leading browser vendors and partners in our effort to make the web faster and more reliable for all. A big part of developing new standards is interoperability, which typically means different people analysing, implementing and testing a written specification in order to prove that it is precise, unambiguous, and actually implementable.

At the time of our announcement, Chrome Canary had experimental HTTP/3 support and we were eagerly awaiting a release of Firefox Nightly. Now that Firefox supports HTTP/3 we thought we’d share some instructions to help you enable and test it yourselves.

How do I enable HTTP/3 for my domain?

Simply go to the Cloudflare dashboard and flip the switch from the “Network” tab manually:

How to test HTTP/3 and QUIC with Firefox Nightly

Using Firefox Nightly as an HTTP/3 client

Firefox Nightly has experimental support for HTTP/3. In our experience things are pretty good but be aware that you might experience some teething issues, so bear that in mind if you decide to enable and experiment with HTTP/3. If you’re happy with that responsibility, you’ll first need to download and install the latest Firefox Nightly build. Then open Firefox and enable HTTP/3 by visiting “about:config” and setting “network.http.http3.enabled” to true. There are some other parameters that can be tweaked but the defaults should suffice.

How to test HTTP/3 and QUIC with Firefox Nightly
about:config can be filtered by using a search term like “http3”.

Once HTTP/3 is enabled, you can visit your site to test it out. A straightforward way to check if HTTP/3 was negotiated is to check the Developer Tools “Protocol” column in the “Network” tab (on Windows and Linux the Developer Tools keyboard shortcut is Ctrl+Shift+I, on macOS it’s Command+Option+I). This “Protocol” column might not be visible at first, so to enable it right-click one of the column headers and check “Protocol” as shown below.

How to test HTTP/3 and QUIC with Firefox Nightly

Then reload the page and you should see that “HTTP/3” is reported.

How to test HTTP/3 and QUIC with Firefox Nightly

The aforementioned teething issues might cause HTTP/3 not to show up initially. When you enable HTTP/3 on a zone, we add a header field such as alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400 to all responses for that zone. Clients see this as an advertisement to try HTTP/3 out and will take up the offer on the next request. So to make this happen you can reload the page but make sure that you bypass the local browser cache (via the “Disable Cache” checkbox, or use the Shift-F5 key combo) or else you’ll just see the protocol used to fetch the resource the first time around. Finally, Firefox provides the “about:networking” page which provides a list of visited zones and the HTTP version that was used to load them; for example, this very blog.

How to test HTTP/3 and QUIC with Firefox Nightly
about:networking contains a table of all visited zones and the connection properties.

Sometimes browsers can get sticky to an existing HTTP connection and will refuse to start an HTTP/3 connection, this is hard to detect by humans, so sometimes the best option is to close the app completely and reopen it. Finally, we’ve also seen some interactions with Service Workers that make it appear that a resource was fetched from the network using HTTP/1.1, when in fact it was fetched from the local Service Worker cache. In such cases if you’re keen to see HTTP/3 in action then you’ll need to deregister the Service Worker. If you’re in doubt about what is happening on the network it is often useful to verify things independently, for example capturing a packet trace and dissecting it with Wireshark.

What’s next?

The QUIC Working Group recently announced a “Working Group Last Call”, which marks an important milestone in the continued maturity of the standards. From the announcement:

After more than three and a half years and substantial discussion, all 845 of the design issues raised against the QUIC protocol drafts have gained consensus or have a proposed resolution. In that time the protocol has been considerably transformed; it has become more secure, much more widely implemented, and has been shown to be interoperable. Both the Chairs and the Editors feel that it is ready to proceed in standardisation.

The coming months will see the specifications settle and we anticipate that implementations will continue to improve their QUIC and HTTP/3 support, eventually enabling it in their stable channels. We’re pleased to continue working with industry partners such as Mozilla to help build a better Internet together.

In the meantime, you might want to check out our guides to testing with other implementations such as Chrome Canary or curl. As compatibility becomes proven, implementations will shift towards optimizing their performance; you can read about Cloudflare’s efforts on comparing HTTP/3 to HTTP/2 and the work we’ve done to improve performance by adding support for CUBIC and HyStart++ to our congestion control module.

Setting up two-factor authentication on your Raspberry Pi

Post Syndicated from Alasdair Allan original https://www.raspberrypi.org/blog/setting-up-two-factor-authentication-on-your-raspberry-pi/

Enabling two-factor authentication (2FA) to boost security for your important accounts is becoming a lot more common these days. However you might be surprised to learn that you can do the same with your Raspberry Pi. You can enable 2FA on Raspberry Pi, and afterwards you’ll be challenged for a verification code when you access it remotely via Secure Shell (SSH).

Accessing your Raspberry Pi via SSH

A lot of people use a Raspberry Pi at home as a file, or media, server. This is has become rather common with the launch of Raspberry Pi 4, which has both USB 3 and Gigabit Ethernet. However, when you’re setting up this sort of server you often want to run it “headless”; without a monitor, keyboard, or mouse. This is especially true if you intend tuck your Raspberry Pi away behind your television, or somewhere else out of the way. In any case, it means that you are going to need to enable Secure Shell (SSH) for remote access.

However, it’s also pretty common to set up your server so that you can access your files when you’re away from home, making your Raspberry Pi accessible from the Internet.

Most of us aren’t going to be out of the house much for a while yet, but if you’re taking the time right now to build a file server, you might want to think about adding some extra security. Especially if you intend to make the server accessible from the Internet, you probably want to enable two-factor authentication (2FA) using Time-based One-Time Password (TOTP).

What is two-factor authentication?

Two-factor authentication is an extra layer of protection. As well as a password, “something you know,” you’ll need another piece of information to log in. This second factor will be based either on “something you have,” like a smart phone, or on “something you are,” like biometric information.

We’re going to go ahead and set up “something you have,” and use your smart phone as the second factor to protect your Raspberry Pi.

Updating the operating system

The first thing you should do is make sure your Raspberry Pi is up to date with the latest version of Raspbian. If you’re running a relatively recent version of the operating system you can do that from the command line:

$ sudo apt-get update
$ sudo apt-get full-upgrade

If you’re pulling your Raspberry Pi out of a drawer for the first time in a while, though, you might want to go as far as to install a new copy of Raspbian using the new Raspberry Pi Imager, so you know you’re working from a good image.

Enabling Secure Shell

The Raspbian operating system has the SSH server disabled on boot. However, since we’re intending to run the board without a monitor or keyboard, we need to enable it if we want to be able to SSH into our Raspberry Pi.

The easiest way to enable SSH is from the desktop. Go to the Raspbian menu and select “Preferences > Raspberry Pi Configuration”. Next, select the “Interfaces” tab and click on the radio button to enable SSH, then hit “OK.”

You can also enable it from the command line using systemctl:

$ sudo systemctl enable ssh
$ sudo systemctl start ssh

Alternatively, you can enable SSH using raspi-config, or, if you’re installing the operating system for the first time, you can enable SSH as you burn your SD Card.

Enabling challenge-response

Next, we need to tell the SSH daemon to enable “challenge-response” passwords. Go ahead and open the SSH config file:

$ sudo nano /etc/ssh/sshd_config

Enable challenge response by changing ChallengeResponseAuthentication from the default no to yes.

Editing /etc/ssh/ssd_config.

Then restart the SSH daemon:

$ sudo systemctl restart ssh

It’s good idea to open up a terminal on your laptop and make sure you can still SSH into your Raspberry Pi at this point — although you won’t be prompted for a 2FA code quite yet. It’s sensible to check that everything still works at this stage.

Installing two-factor authentication

The first thing you need to do is download an app to your phone that will generate the TOTP. One of the most commonly used is Google Authenticator. It’s available for Android, iOS, and Blackberry, and there is even an open source version of the app available on GitHub.

Google Authenticator in the App Store.

So go ahead and install Google Authenticator, or another 2FA app like Authy, on your phone. Afterwards, install the Google Authenticator PAM module on your Raspberry Pi:

$ sudo apt install libpam-google-authenticator

Now we have 2FA installed on both our phone, and our Raspberry Pi, we’re ready to get things configured.

Configuring two-factor authentication

You should now run Google Authenticator from the command line — without using sudo — on your Raspberry Pi in order to generate a QR code:

$ google-authenticator

Afterwards you’re probably going to have to resize the Terminal window so that the QR code is rendered correctly. Unfortunately, it’s just slightly wider than the standard 80 characters across.

The QR code generated by google-authenticator. Don’t worry, this isn’t the QR code for my key; I generated one just for this post that I didn’t use.

Don’t move forward quite yet! Before you do anything else you should copy the emergency codes and put them somewhere safe.

These codes will let you access your Raspberry Pi — and turn off 2FA — if you lose your phone. Without them, you won’t be able to SSH into your Raspberry Pi if you lose or break the device you’re using to authenticate.

Next, before we continue with Google Authenticator on the Raspberry Pi, open the Google Authenticator app on your phone and tap the plus sign (+) at the top right, then tap on “Scan barcode.”

Your phone will ask you whether you want to allow the app access to your camera; you should say “Yes.” The camera view will open. Position the barcode squarely in the green box on the screen.

Scanning the QR code with the Google Authenticator app.

As soon as your phone app recognises the QR code it will add your new account, and it will start generating TOTP codes automatically.

The TOTP in Google Authenticator app.

Your phone will generate a new one-time password every thirty seconds. However, this code isn’t going to be all that useful until we finish what we were doing on your Raspberry Pi. Switch back to your terminal window and answer “Y” when asked whether Google Authenticator should update your .google_authenticator file.

Then answer “Y” to disallow multiple uses of the same authentication token, “N” to increasing the time skew window, and “Y” to rate limiting in order to protect against brute-force attacks.

You’re done here. Now all we have to do is enable 2FA.

Enabling two-factor authentication

We’re going to use Linux Pluggable Authentication Modules (PAM), which provides dynamic authentication support for applications and services, to add 2FA to SSH on Raspberry Pi.

Now we need to configure PAM to add 2FA:

$ sudo nano /etc/pam.d/sshd

Add auth required pam_google_authenticator.so to the top of the file. You can do this either above or below the line that says @include common-auth.

Editing /etc/pam.d/sshd.

As I prefer to be prompted for my verification code after entering my password, I’ve added this line after the @include line. If you want to be prompted for the code before entering your password you should add it before the @include line.

Now restart the SSH daemon:

$ sudo systemctl restart ssh

Next, open up a terminal window on your laptop and try and SSH into your Raspberry Pi.

Wrapping things up

If everything has gone to plan, when you SSH into the Raspberry Pi, you should be prompted for a TOTP after being prompted for your password.

SSH’ing into my Raspberry Pi.

You should go ahead and open Google Authenticator on your phone, and enter the six-digit code when prompted. Then you should be logged into your Raspberry Pi as normal.

You’ll now need your phone, and a TOTP, every time you ssh into, or scp to and from, your Raspberry Pi. But because of that, you’ve just given a huge boost to the security of your device.

Now you have the Google Authenticator app on your phone, you should probably start enabling 2FA for your important services and sites — like Google, Twitter, Amazon, and others — since most bigger sites, and many smaller ones, now support two-factor authentication.

The post Setting up two-factor authentication on your Raspberry Pi appeared first on Raspberry Pi.

Enhancing site security with new Lightsail firewall features

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/enhancing-site-security-with-new-lightsail-firewall-features/

This post is contributed by Mike Coleman, AWS Senior Developer Advocate – Lightsail

Amazon Lightsail provides an easy way to get started with AWS for many customers. The service balances ease of use, security, and flexibility. The Lightsail firewall now offers additional features to help customers secure their Lightsail instances. This update offers three new capabilities:

  • The ability to specify source IP addresses for firewall rules
  • Explicitly allowing or disallowing remote access to instances via Lightsail’s web-based console
  • Support for PING

This blog explores each of these new features in detail, starting with source IP addresses.

Before this update, any open ports in the Lightsail firewall were open to the internet. In many cases, this is a reasonable approach. For example, for new WordPress servers, you likely need broad public access.

However, in some cases you want to restrict access to an instance. If you are staging a new website and it’s not ready for publication, you may want to limit access. One way to ensure that only certain people can visit the site is to only allow certain IP addresses to connect.

Another common use case is limiting remote access to an instance. With the new changes to the Lightsail firewall, you would be able to limit SSH or RDP access by source IP address. Additionally, you can now enable or disable remote access via Lightsail’s built-in web client.

Access can be restricted from one or more IP addresses (for example, the IP address for your home computer) or a continuous range of IP addresses (such as the address range for your corporate network).

Next, I review how you configure these options to restrict remote access via SSH to a single source IP address.

Finding your IP address

Most computers do not have an internet routable IP address assigned. Internet routable IP addresses are scarce and usually assigned to your internet gateway device. The devices on the network are assigned private IP addresses. To communicate between the private IP network and the internet, the network router typically uses network address translation (NAT).

This tutorial assumes you are using NAT. This means the IP address used to restrict SSH access is the IP routable address of your network gateway device (usually your wireless router). Consequently, this limits access to all devices on the network behind this IP address.

There are many ways to find your internet routable IP address. You can log into your network gateway device and find it there (consult your device’s user manual for more details). Alternatively, use one of several public services to determine your IP address – search online for “what is my IP” to list several options.

Restricting SSH access to a single IP address

  1. Start by creating a new Lightsail instance – you can select any blueprint.
  2. Once the instance state shows Running, choose the name of the instance to open the Instance details page.
    firewall test instance
  3. Choose Networking from the menu.
    networking tab
  4. Scroll down to find the current firewall settings. Under Allow connections from, it lists Any IP address for all of the applications. To change this, choose the edit icon for the SSH rule.
    IP address
  5. Check the box next to Restrict to IP address and enter your internet routable IP address under Source IP or range.restrict IP address
    Note: The next section shows how to restrict access from Lightsail’s browser-based SSH client. Currently, Allow Lightsail browser SSH box is checked.
  6. Choose Save.

Now, SSH into your Lightsail instance from your local machine. You can learn more about how to connect to your Lightsail instance using SSH from our documentation.

You should be able to connect to your instance successfully. Next, you test the connection from a different IP address. You do this by restricting a different IP address, and attempting to connect again:

  1. Edit the SSH firewall rule again follows the instructions above. This time, under IP or IP Range enter 192.168.2.150.
  2. Choose Save.

Attempt to connect to your instance once more. The connection fails because your IP address does not match an IP address in the range.

Restricting access from the Lightsail browser-based SSH client

The browser-based SSH client makes it easy to access instances without needing to manage SSH keys on locally. However, there may be cases where you must disable browser-based access.

To do this:

  1. Navigate to the firewall rules for the instance you created earlier.
  2. Choose the edit icon for the SSH rule.edit IP address
  3. Uncheck the box next to Allow Lightsail browser SSH. Choose Save.
  4. From the menu, choose Connect, then choose Connect using SSH. The browser window opens, but you are not connected to the instance.

PING Support

There is now support for PING, a command line utility used to check if a computer is reachable over the network. PING sends a packet to a remote computer, which sends a simple response back. Before this release, you could not PING Lightsail instances.

To activate this feature, add a firewall rule:

  1. Navigate to the networking page for your instance.      add rule to firewall<
  2. Under the firewall section, choose +Add rule.
  3. From the application list, choose PING (ICMP). Choose Save.
  4. From a terminal window on your local machine, send a ping command to your Lightsail instance’s IP address. You can find the IP address from the Connect tab of the instance details page or from instance card on the Lightsail home page.
ping -c 5 192.168.2.143

You see a response similar to:

<ping -c 5 192.168.2.143
PING 192.168.2.143 (192.168.2.143): 56 data bytes
64 bytes from 192.168.2.143: icmp_seq=0 ttl=54 time=19.383 ms
64 bytes from 192.168.2.143: icmp_seq=1 ttl=54 time=16.821 ms
64 bytes from 192.168.2.143: icmp_seq=2 ttl=54 time=16.363 ms
64 bytes from 192.168.2.143: icmp_seq=3 ttl=54 time=27.335 ms
64 bytes from 192.168.2.143: icmp_seq=4 ttl=54 time=19.429 ms

--- 192.168.2.143 ping statistics ---
5 packets transmitted, 5 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 16.363/19.866/27.335/3.943 ms

Conclusion

In this blog I covered how you can increase the security of your Lightsail instances by taking advantage of three new features: source IP restrictions, limiting access to the Lightsail browser SSH and RDP clients, and the addition of PING (ICMP) as an application type. These new features provide you an extra level of flexibility and security when deploying applications on Lightsail.

To learn more about the Lightsail firewall, see the documentation. Additionally, there are Getting Started tutorials for Lightsail, including launching a LAMP stack application or .NET application.

Cloudflare Bot Management: machine learning and more

Post Syndicated from Alex Bocharov original https://blog.cloudflare.com/cloudflare-bot-management-machine-learning-and-more/

Cloudflare Bot Management: machine learning and more

Introduction

Cloudflare Bot Management: machine learning and more

Building Cloudflare Bot Management platform is an exhilarating experience. It blends Distributed Systems, Web Development, Machine Learning, Security and Research (and every discipline in between) while fighting ever-adaptive and motivated adversaries at the same time.

This is the ongoing story of Bot Management at Cloudflare and also an introduction to a series of blog posts about the detection mechanisms powering it. I’ll start with several definitions from the Bot Management world, then introduce the product and technical requirements, leading to an overview of the platform we’ve built. Finally, I’ll share details about the detection mechanisms powering our platform.

Let’s start with Bot Management’s nomenclature.

Some Definitions

Bot – an autonomous program on a network that can interact with computer systems or users, imitating or replacing a human user’s behavior, performing repetitive tasks much faster than human users could.

Good bots – bots which are useful to businesses they interact with, e.g. search engine bots like Googlebot, Bingbot or bots that operate on social media platforms like Facebook Bot.

Bad bots – bots which are designed to perform malicious actions, ultimately hurting businesses, e.g. credential stuffing bots, third-party scraping bots, spam bots and sneakerbots.

Cloudflare Bot Management: machine learning and more

Bot Management – blocking undesired or malicious Internet bot traffic while still allowing useful bots to access web properties by detecting bot activity, discerning between desirable and undesirable bot behavior, and identifying the sources of the undesirable activity.

WAF – a security system that monitors and controls network traffic based on a set of security rules.

Gathering requirements

Cloudflare has been stopping malicious bots from accessing websites or misusing APIs from the very beginning, at the same time helping the climate by offsetting the carbon costs from the bots. Over time it became clear that we needed a dedicated platform which would unite different bot fighting techniques and streamline the customer experience. In designing this new platform, we tried to fulfill the following key requirements.

  • Complete, not complex – customers can turn on/off Bot Management with a single click of a button, to protect their websites, mobile applications, or APIs.
  • Trustworthy – customers want to know whether they can trust the website visitor is who they say they are and provide a certainty indicator for that trust level.
  • Flexible – customers should be able to define what subset of the traffic Bot Management mitigations should be applied to, e.g. only login URLs, pricing pages or sitewide.
  • Accurate – Bot Management detections should have a very small error, e.g. none or very few human visitors ever should be mistakenly identified as bots.
  • Recoverable – in case a wrong prediction was made, human visitors still should be able to access websites as well as good bots being let through.

Moreover, the goal for new Bot Management product was to make it work well on the following use cases:

Cloudflare Bot Management: machine learning and more

Technical requirements

Additionally to the product requirements above, we engineers had a list of must-haves for the new Bot Management platform. The most critical were:

  • Scalability – the platform should be able to calculate a score on every request, even at over 10 million requests per second.
  • Low latency – detections must be performed extremely quickly, not slowing down request processing by more than 100 microseconds, and not requiring additional hardware.
  • Configurability – it should be possible to configure what detections are applied on what traffic, including on per domain/data center/server level.
  • Modifiability – the platform should be easily extensible with more detection mechanisms, different mitigation actions, richer analytics and logs.
  • Security – no sensitive information from one customer should be used to build models that protect another customer.
  • Explainability & debuggability – we should be able to explain and tune predictions in an intuitive way.

Equipped with these requirements, back in 2018, our small team of engineers got to work to design and build the next generation of Cloudflare Bot Management.

Meet the Score

“Simplicity is the ultimate sophistication.”
– Leonardo Da Vinci

Cloudflare operates on a vast scale. At the time of this writing, this means covering 26M+ Internet properties, processing on average 11M requests per second (with peaks over 14M), and examining more than 250 request attributes from different protocol levels. The key question is how to harness the power of such “gargantuan” data to protect all of our customers from modern day cyberthreats in a simple, reliable and explainable way?

Bot management is hard. Some bots are much harder to detect and require looking at multiple dimensions of request attributes over a long time, and sometimes a single request attribute could give them away. More signals may help, but are they generalizable?

When we classify traffic, should customers decide what to do with it or are there decisions we can make on behalf of the customer? What concept could possibly address all these uncertainty problems and also help us to deliver on the requirements from above?

As you might’ve guessed from the section title, we came up with the concept of Trusted Score or simply The Scoreone thing to rule them all – indicating the likelihood between 0 and 100 whether a request originated from a human (high score) vs. an automated program (low score).

Cloudflare Bot Management: machine learning and more
“One Ring to rule them all” by idreamlikecrazy, used under CC BY / Desaturated from original

Okay, let’s imagine that we are able to assign such a score on every incoming HTTP/HTTPS request, what are we or the customer supposed to do with it? Maybe it’s enough to provide such a score in the logs. Customers could then analyze them on their end, find the most frequent IPs with the lowest scores, and then use the Cloudflare Firewall to block those IPs. Although useful, such a process would be manual, prone to error and most importantly cannot be done in real time to protect the customer’s Internet property.

Fortunately, around the same time we started worked on this system , our colleagues from the Firewall team had just announced Firewall Rules. This new capability provided customers the ability to control requests in a flexible and intuitive way, inspired by the widely known Wireshark®  language. Firewall rules supported a variety of request fields, and we thought – why not have the score be one of these fields? Customers could then write granular rules to block very specific attack types. That’s how the cf.bot_management.score field was born.

Having a score in the heart of Cloudflare Bot Management addressed multiple product and technical requirements with one strike – it’s simple, flexible, configurable, and it provides customers with telemetry about bots on a per request basis. Customers can adjust the score threshold in firewall rules, depending on their sensitivity to false positives/negatives. Additionally, this intuitive score allows us to extend our detection capabilities under the hood without customers needing to adjust any configuration.

So how can we produce this score and how hard is it? Let’s explore it in the following section.

Architecture overview

What is powering the Bot Management score? The short answer is a set of microservices. Building this platform we tried to re-use as many pipelines, databases and components as we could, however many services had to be built from scratch. Let’s have a look at overall architecture (this overly simplified version contains Bot Management related services):

Cloudflare Bot Management: machine learning and more

Core Bot Management services

In a nutshell our systems process data received from the edge data centers, produce and store data required for bot detection mechanisms using the following technologies:

  • Databases & data storesKafka, ClickHouse, Postgres, Redis, Ceph.
  • Programming languages – Go, Rust, Python, Java, Bash.
  • Configuration & schema management – Salt, Quicksilver, Cap’n Proto.
  • Containerization – Docker, Kubernetes, Helm, Mesos/Marathon.

Each of these services is built with resilience, performance, observability and security in mind.

Edge Bot Management module

All bot detection mechanisms are applied on every request in real-time during the request processing stage in the Bot Management module running on every machine at Cloudflare’s edge locations. When a request comes in we extract and transform the required request attributes and feed them to our detection mechanisms. The Bot Management module produces the following output:

Firewall fieldsBot Management fields
cf.bot_management.score – an integer indicating the likelihood between 0 and 100 whether a request originated from an automated program (low score) to a human (high score).
cf.bot_management.verified_bot – a boolean indicating whether such request comes from a Cloudflare whitelisted bot.
cf.bot_management.static_resource – a boolean indicating whether request matches file extensions for many types of static resources.

Cookies – most notably it produces cf_bm, which helps manage incoming traffic that matches criteria associated with bots.

JS challenges – for some of our detections and customers we inject into invisible JavaScript challenges, providing us with more signals for bot detection.

Detection logs – we log through our data pipelines to ClickHouse details about each applied detection, used features and flags, some of which are used for analytics and customer logs, while others are used to debug and improve our models.

Once the Bot Management module has produced the required fields, the Firewall takes over the actual bot mitigation.

Firewall integration

The Cloudflare Firewall’s intuitive dashboard enables users to build powerful rules through easy clicks and also provides Terraform integration. Every request to the firewall is inspected against the rule engine. Suspicious requests can be blocked, challenged or logged as per the needs of the user while legitimate requests are routed to the destination, based on the score produced by the Bot Management module and the configured threshold.

Cloudflare Bot Management: machine learning and more

Firewall rules provide the following bot mitigation actions:

  • Log – records matching requests in the Cloudflare Logs provided to customers.
  • Bypass – allows customers to dynamically disable Cloudflare security features for a request.
  • Allow – matching requests are exempt from challenge and block actions triggered by other Firewall Rules content.
  • Challenge (Captcha) – useful for ensuring that the visitor accessing the site is human, and not automated.
  • JS Challenge – useful for ensuring that bots and spam cannot access the requested resource; browsers, however, are free to satisfy the challenge automatically.
  • Block – matching requests are denied access to the site.

Our Firewall Analytics tool, powered by ClickHouse and GraphQL API, enables customers to quickly identify and investigate security threats using an intuitive interface. In addition to analytics, we provide detailed logs on all bots-related activity using either the Logpull API and/or LogPush, which provides the easy way to get your logs to your cloud storage.

Cloudflare Workers integration

In case a customer wants more flexibility on what to do with the requests based on the score, e.g. they might want to inject new, or change existing, HTML page content, or serve incorrect data to the bots, or stall certain requests, Cloudflare Workers provide an option to do that. For example, using this small code-snippet, we can pass the score back to the origin server for more advanced real-time analysis or mitigation:

addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})
 
async function handleRequest(request) {
  request = new Request(request);
 
  request.headers.set("Cf-Bot-Score", request.cf.bot_management.score)
 
  return fetch(request);
}

Now let’s have a look into how a single score is produced using multiple detection mechanisms.

Detection mechanisms

Cloudflare Bot Management: machine learning and more

The Cloudflare Bot Management platform currently uses five complementary detection mechanisms, producing their own scores, which we combine to form the single score going to the Firewall. Most of the detection mechanisms are applied on every request, while some are enabled on a per customer basis to better fit their needs.

Cloudflare Bot Management: machine learning and more
Cloudflare Bot Management: machine learning and more

Having a score on every request for every customer has the following benefits:

  • Ease of onboarding – even before we enable Bot Management in active mode, we’re able to tell how well it’s going to work for the specific customer, including providing historical trends about bot activity.
  • Feedback loop – availability of the score on every request along with all features has tremendous value for continuous improvement of our detection mechanisms.
  • Ensures scaling – if we can compute for score every request and customer, it means that every Internet property behind Cloudflare is a potential Bot Management customer.
  • Global bot insights – Cloudflare is sitting in front of more than 26M+ Internet properties, which allows us to understand and react to the tectonic shifts happening in security and threat intelligence over time.

Overall globally, more than third of the Internet traffic visible to Cloudflare is coming from bad bots, while Bot Management customers have the ratio of bad bots even higher at ~43%!

Cloudflare Bot Management: machine learning and more
Cloudflare Bot Management: machine learning and more

Let’s dive into specific detection mechanisms in chronological order of their integration with Cloudflare Bot Management.

Machine learning

The majority of decisions about the score are made using our machine learning models. These were also the first detection mechanisms to produce a score and to on-board customers back in 2018. The successful application of machine learning requires data high in Quantity, Diversity, and Quality, and thanks to both free and paid customers, Cloudflare has all three, enabling continuous learning and improvement of our models for all of our customers.

At the core of the machine learning detection mechanism is CatBoost  – a high-performance open source library for gradient boosting on decision trees. The choice of CatBoost was driven by the library’s outstanding capabilities:

  • Categorical features support – allowing us to train on even very high cardinality features.
  • Superior accuracy – allowing us to reduce overfitting by using a novel gradient-boosting scheme.
  • Inference speed – in our case it takes less than 50 microseconds to apply any of our models, making sure request processing stays extremely fast.
  • C and Rust API – most of our business logic on the edge is written using Lua, more specifically LuaJIT, so having a compatible FFI interface to be able to apply models is fantastic.

There are multiple CatBoost models run on Cloudflare’s Edge in the shadow mode on every request on every machine. One of the models is run in active mode, which influences the final score going to Firewall. All ML detection results and features are logged and recorded in ClickHouse for further analysis, model improvement, analytics and customer facing logs. We feed both categorical and numerical features into our models, extracted from request attributes and inter-request features built using those attributes, calculated and delivered by the Gagarin inter-requests features platform.

We’re able to deploy new ML models in a matter of seconds using an extremely reliable and performant Quicksilver configuration database. The same mechanism can be used to configure which version of an ML model should be run in active mode for a specific customer.

A deep dive into our machine learning detection mechanism deserves a blog post of its own and it will cover how do we train and validate our models on trillions of requests using GPUs, how model feature delivery and extraction works, and how we explain and debug model predictions both internally and externally.

Heuristics engine

Not all problems in the world are the best solved with machine learning. We can tweak the ML models in various ways, but in certain cases they will likely underperform basic heuristics. Often the problems machine learning is trying to solve are not entirely new. When building the Bot Management solution it became apparent that sometimes a single attribute of the request could give a bot away. This means that we can create a bunch of simple rules capturing bots in a straightforward way, while also ensuring lowest false positives.

The heuristics engine was the second detection mechanism integrated into the Cloudflare Bot Management platform in 2019 and it’s also applied on every request. We have multiple heuristic types and hundreds of specific rules based on certain attributes of the request, some of which are very hard to spoof. When any of the requests matches any of the heuristics – we assign the lowest possible score of 1.

The engine has the following properties:

  • Speed – if ML model inference takes less than 50 microseconds per model, hundreds of heuristics can be applied just under 20 microseconds!
  • Deployability – the heuristics engine allows us to add new heuristic in a matter of seconds using Quicksilver, and it will be applied on every request.
  • Vast coverage – using a set of simple heuristics allows us to classify ~15% of global traffic and ~30% of Bot Management customers’ traffic as bots. Not too bad for a few if conditions, right?
  • Lowest false positives – because we’re very sure and conservative on the heuristics we add, this detection mechanism has the lowest FP rate among all detection mechanisms.
  • Labels for ML – because of the high certainty we use requests classified with heuristics to train our ML models, which then can generalize behavior learnt from from heuristics and improve detections accuracy.

So heuristics gave us a lift when tweaked with machine learning and they contained a lot of the intuition about the bots, which helped to advance the Cloudflare Bot Management platform and allowed us to onboard more customers.

Behavioral analysis

Machine learning and heuristics detections provide tremendous value, but both of them require human input on the labels, or basically a teacher to distinguish between right and wrong. While our supervised ML models can generalize well enough even on novel threats similar to what we taught them on, we decided to go further. What if there was an approach which doesn’t require a teacher, but rather can learn to distinguish bad behavior from the normal behavior?

Enter the behavioral analysis detection mechanism, initially developed in 2018 and integrated with the Bot Management platform in 2019. This is an unsupervised machine learning approach, which has the following properties:

  • Fitting specific customer needs – it’s automatically enabled for all Bot Management customers, calculating and analyzing normal visitor behavior over an extended period of time.
  • Detects bots never seen before – as it doesn’t use known bot labels, it can detect bots and anomalies from the normal behavior on specific customer’s website.
  • Harder to evade – anomalous behavior is often a direct result of the bot’s specific goal.

Please stay tuned for a more detailed blog about behavioral analysis models and the platform powering this incredible detection mechanism, protecting many of our customers from unseen attacks.

Verified bots

So far we’ve discussed how to detect bad bots and humans. What about good bots, some of which are extremely useful for the customer website? Is there a need for a dedicated detection mechanism or is there something we could use from previously described detection mechanisms? While the majority of good bot requests (e.g. Googlebot, Bingbot, LinkedInbot) already have low score produced by other detection mechanisms, we also need a way to avoid accidental blocks of useful bots. That’s how the Firewall field cf.bot_management.verified_bot came into existence in 2019, allowing customers to decide for themselves whether they want to let all of the good bots through or restrict access to certain parts of the website.

The actual platform calculating Verified Bot flag deserves a detailed blog on its own, but in the nutshell it has the following properties:

  • Validator based approach – we support multiple validation mechanisms, each of them allowing us to reliably confirm good bot identity by clustering a set of IPs.
  • Reverse DNS validator – performs a reverse DNS check to determine whether or not a bots IP address matches its alleged hostname.
  • ASN Block validator – similar to rDNS check, but performed on ASN block.
  • Downloader validator – collects good bot IPs from either text files or HTML pages hosted on bot owner sites.
  • Machine learning validator – uses an unsupervised learning algorithm, clustering good bot IPs which are not possible to validate through other means.
  • Bots Directory – a database with UI that stores and manages bots that pass through the Cloudflare network.
Cloudflare Bot Management: machine learning and more
Bots directory UI sample‌‌

Using multiple validation methods listed above, the Verified Bots detection mechanism identifies hundreds of unique good bot identities, belonging to different companies and categories.

JS fingerprinting

When it comes to Bot Management detection quality it’s all about the signal quality and quantity. All previously described detections use request attributes sent over the network and analyzed on the server side using different techniques. Are there more signals available, which can be extracted from the client to improve our detections?

As a matter of fact there are plenty, as every browser has unique implementation quirks. Every web browser graphics output such as canvas depends on multiple layers such as hardware (GPU) and software (drivers, operating system rendering). This highly unique output allows precise differentiation between different browser/device types. Moreover, this is achievable without sacrificing website visitor privacy as it’s not a supercookie, and it cannot be used to track and identify individual users, but only to confirm that request’s user agent matches other telemetry gathered through browser canvas API.

This detection mechanism is implemented as a challenge-response system with challenge injected into the webpage on Cloudflare’s edge. The challenge is then rendered in the background using provided graphic instructions and the result sent back to Cloudflare for validation and further action such as  producing the score. There is a lot going on behind the scenes to make sure we get reliable results without sacrificing users’ privacy while being tamper resistant to replay attacks. The system is currently in private beta and being evaluated for its effectiveness and we already see very promising results. Stay tuned for this new detection mechanism becoming widely available and the blog on how we’ve built it.

This concludes an overview of the five detection mechanisms we’ve built so far. It’s time to sum it all up!

Summary

Cloudflare has the unique ability to collect data from trillions of requests flowing through its network every week. With this data, Cloudflare is able to identify likely bot activity with Machine Learning, Heuristics, Behavioral Analysis, and other detection mechanisms. Cloudflare Bot Management integrates seamlessly with other Cloudflare products, such as WAF  and Workers.

Cloudflare Bot Management: machine learning and more

All this could not be possible without hard work across multiple teams! First of all thanks to everybody on the Bots Team for their tremendous efforts to make this platform come to life. Other Cloudflare teams, most notably: Firewall, Data, Solutions Engineering, Performance, SRE, helped us a lot to design, build and support this incredible platform.

Cloudflare Bot Management: machine learning and more
Bots team during Austin team summit 2019 hunting bots with axes 🙂

Lastly, there are more blogs from the Bots series coming soon, diving into internals of our detection mechanisms, so stay tuned for more exciting stories about Cloudflare Bot Management!

How Netflix brings safer and faster streaming experience to the living room on crowded networks…

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/how-netflix-brings-safer-and-faster-streaming-experience-to-the-living-room-on-crowded-networks-78b8de7f758c

How Netflix brings safer and faster streaming experience to the living room on crowded networks using TLS 1.3

By Sekwon Choi

At Netflix, we are obsessed with the best streaming experiences. We want playback to start instantly and to never stop unexpectedly in any network environment. We are also committed to protecting users’ privacy and service security without sacrificing any part of the playback experience.

To achieve that, we are efficiently using ABR (adaptive bitrate streaming) for a better playback experience, DRM (Digital Right Management) to protect our service and TLS (Transport Layer Security) to protect customer privacy and to create a safer streaming experience.

Netflix on consumer electronics devices such as TVs, set-top boxes and streaming sticks was until recently using TLS 1.2 for streaming traffic. Now we support TLS 1.3 for safer and faster experiences.

What is TLS?

For two parties to communicate securely, a secure channel is necessary. This needs to have the following three properties.

  • Authentication: Identity of the communicating party is verified.
  • Confidentiality: Data sent over the channel is only visible to the endpoints.
  • Integrity: Data sent over the channel cannot be modified by attackers without detection.

The TLS protocol is designed to provide a secure channel between two peers by providing tools and methods to achieve the above properties.

TLS 1.3

TLS 1.3 is the latest version of the Transport Layer Security protocol. It is simpler, more secure and more efficient than its predecessor.

Perfect Forward Secrecy

One thing we believe is very important at Netflix is providing PFS (Perfect Forward Secrecy).

PFS is a feature of the key exchange algorithm that assures that session keys will not be compromised, even if the server’s private key is compromised. By generating new keys for each session, PFS protects past sessions against the future compromise of secret keys.

TLS 1.2 supports key exchange algorithms with PFS, but it also allows key exchange algorithms that do not support PFS. Even with the previous version of TLS 1.2, Netflix has always selected a key exchange algorithm that provides PFS such as ECDHE (Elliptic Curve Diffie Hellman Ephemeral). TLS 1.3, however, enforces this concept even more by removing all the key exchange algorithms that do not provide PFS, such as static RSA.

Authenticated Encryption

For encryption, TLS 1.3 removes all weak ciphers and uses only Authenticated Encryption with Associated Data (AEAD). This assures the confidentiality, integrity, and authenticity of the data. We use AES Galois/Counter Mode, as it also provides good performance and high throughput.

Secure Handshake

While the above changes are important, the most important change in TLS 1.3 is perhaps its redesign of the handshake protocol.

The TLS 1.2 handshake was not designed to protect the integrity of the entire handshake. It protected only the part of the handshake after the cipher suite negotiation and this opened up the possibility of downgrade attacks which may allow the attackers to force the use of insecure cipher suites.

With TLS 1.3, the server signs the entire handshake including the cipher suite negotiation and thus prevents the attacker from downgrading the cipher suite.

Also in TLS 1.2, extensions were sent in the clear in the ServerHello. Now with TLS 1.3, even extensions are encrypted and all handshake messages after ServerHello are now encrypted.

Reduced Handshake

TLS 1.2 supports numerous key exchange algorithms, cipher suites and digital signatures, including weak and vulnerable ones. Therefore, it requires more messages to perform a handshake and two network round trips.

In contrast, the handshake in TLS 1.3 now requires only one round trip, with a simplified design and with all weak and vulnerable algorithms removed.

In addition, it has a new feature called 0-RTT, or TLS early data, for the resumed handshake. This allows an application to include application data with its initial handshake message, instead of having to wait until the handshake completes.

At Netflix, by the efficient resumption of the TLS session and careful use of 0-RTT for the streaming data, we can reduce the play delay.

A/B Testing Result

We were pretty confident that TLS 1.3 would bring us better security from the analysis of its protocol composition, but we did not know how it would perform in the context of streaming.

Since TLS 1.3’s performance-related feature is the 0-RTT mode with the resumed handshake, our hypothesis is that TLS 1.3 would reduce play delay, as we are no longer required to wait for the handshake to finish and we can instead issue the HTTP request for media data and receive the HTTP response for media data earlier.

To see the actual performance of TLS 1.3 in the field, we performed an experiment with

  • User accounts: half-million user accounts per cell.
  • Device type: mid-performance device with Quad ARM core @ 1.7GHz.
  • Control cell: TLS 1.2
  • Treatment cell: TLS 1.3

Play Delay

Play Delay is defined by how long it takes for playback to start. Below are the results of the play delay measured in the experiment. The results imply that on slower or congested networks, which can be represented by the quantiles of at least 0.75, TLS 1.3 achieves the largest gains, with improvements across all network conditions.

Below is the time series median play delay graph for this mid-performance device in the field. It also shows that playback starts earlier with TLS 1.3.

Media Rebuffer

At Netflix, we define a media rebuffer as a non-network originated rebuffer. It typically occurs when media data is not processed quickly enough by the device due to the high load on the CPU. Comparing the control cell with TLS 1.2, the experiment cell with TLS 1.3 showed about a 7.4% improvement in media rebuffers. This result implies that using TLS 1.3 with 0-RTT is more efficient and can reduce the CPU load.

Conclusion

From the security analysis, we are confident that TLS 1.3 improves communication security over TLS 1.2. From the field test, we are confident that TLS 1.3 provides us a better streaming experience.

At the time of writing this article, the Internet is experiencing higher than usual traffic and congestion. We believe saving even small amounts of data and round trips can be meaningful and even better if it also provides a more secure and efficient streaming experience.

Therefore, we have started deploying TLS 1.3 on newer consumer electronics devices and we are expecting even more devices to be deployed with TLS 1.3 capability in the near future.


How Netflix brings safer and faster streaming experience to the living room on crowded networks… was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Is BGP Safe Yet? No. But we are tracking it carefully

Post Syndicated from Louis Poinsignon original https://blog.cloudflare.com/is-bgp-safe-yet-rpki-routing-security-initiative/

Is BGP Safe Yet? No. But we are tracking it carefully

BGP leaks and hijacks have been accepted as an unavoidable part of the Internet for far too long. We relied on protection at the upper layers like TLS and DNSSEC to ensure an untampered delivery of packets, but a hijacked route often results in an unreachable IP address. Which results in an Internet outage.

The Internet is too vital to allow this known problem to continue any longer. It’s time networks prevented leaks and hijacks from having any impact. It’s time to make BGP safe. No more excuses.

Border Gateway Protocol (BGP), a protocol to exchange routes has existed and evolved since the 1980s. Over the years it has had security features. The most notable security addition is Resource Public Key Infrastructure (RPKI), a security framework for routing. It has been the subject of a few blog posts following our deployment in mid-2018.

Today, the industry considers RPKI mature enough for widespread use, with a sufficient ecosystem of software and tools, including tools we’ve written and open sourced. We have fully deployed Origin Validation on all our BGP sessions with our peers and signed our prefixes.

However, the Internet can only be safe if the major network operators deploy RPKI. Those networks have the ability to spread a leak or hijack far and wide and it’s vital that they take a part in stamping out the scourge of BGP problems whether inadvertent or deliberate.

Many like AT&T and Telia pioneered global deployments of RPKI in 2019. They were successfully followed by Cogent and NTT in 2020. Hundreds networks of all sizes have done a tremendous job over the last few years but there is still work to be done.

If we observe the customer-cones of the networks that have deployed RPKI, we see around 50% of the Internet is more protected against route leaks. That’s great, but it’s nothing like enough.

Is BGP Safe Yet? No. But we are tracking it carefully

Today, we are releasing isBGPSafeYet.com, a website to track deployments and filtering of invalid routes by the major networks.

We are hoping this will help the community and we will crowdsource the information on the website. The source code is available on GitHub, we welcome suggestions and contributions.

We expect this initiative will make RPKI more accessible to everyone and ultimately will reduce the impact of route leaks. Share the message with your Internet Service Providers (ISP), hosting providers, transit networks to build a safer Internet.

Additionally, to monitor and test deployments, we decided to announce two bad prefixes from our 200+ data centers and via the 233+ Internet Exchange Points (IXPs) we are connected to:

  • 103.21.244.0/24
  • 2606:4700:7000::/48

Both these prefixes should be considered invalid and should not be routed by your provider if RPKI is implemented within their network. This makes it easy to demonstrate how far a bad route can go, and test whether RPKI is working in the real world.

Is BGP Safe Yet? No. But we are tracking it carefully
A Route Origin Authorization for 103.21.244.0/24 on rpki.cloudflare.com

In the test you can run on isBGPSafeYet.com, your browser will attempt to fetch two pages: the first one valid.rpki.cloudflare.com, is behind an RPKI-valid prefix and the second one, invalid.rpki.cloudflare.com, is behind the RPKI-invalid prefix.

The test has two outcomes:

  • If both pages were correctly fetched, your ISP accepted the invalid route. It does not implement RPKI.
  • If only valid.rpki.cloudflare.com was fetched, your ISP implements RPKI. You will be less sensitive to route-leaks.
Is BGP Safe Yet? No. But we are tracking it carefully
a simple test of RPKI invalid reachability

We will be performing tests using those prefixes to check for propagation. Traceroutes and probing helped us in the past by creating visualizations of deployment.

A simple indicator is the number of networks sending the accepted route to their peers and collectors:

Is BGP Safe Yet? No. But we are tracking it carefully
Routing status from online route collection tool RIPE Stat

In December 2019, we released a Hilbert curve map of the IPv4 address space. Every pixel represents a /20 prefix. If a dot is yellow, the prefix responded only to the probe from a RPKI-valid IP space. If it is blue, the prefix responded to probes from both RPKI valid and invalid IP space.

To summarize, the yellow areas are IP space behind networks that drop RPKI invalid prefixes. The Internet isn’t safe until the blue becomes yellow.

Is BGP Safe Yet? No. But we are tracking it carefully
Hilbert Curve Map of IP address space behind networks filtering RPKI invalid prefixes

Last but not least, we would like to thank every network that has already deployed RPKI and every developer that contributed to validator-software code bases. The last two years have shown that the Internet can become safer and we are looking forward to the day where we can call route leaks and hijacks an incident of the past.

Time-Based One-Time Passwords for Phone Support

Post Syndicated from Junade Ali original https://blog.cloudflare.com/time-based-one-time-passwords-for-phone-support/

Time-Based One-Time Passwords for Phone Support

Time-Based One-Time Passwords for Phone Support

As part of Cloudflare’s support offering, we provide phone support to Enterprise customers who are experiencing critical business issues.

For account security, specific account settings and sensitive details are not discussed via phone. From today, we are providing Enterprise customers with the ability to configure phone authentication to allow for greater support to be offered over the phone without need to perform validation through support tickets.

After providing your email address to a Cloudflare Support representative, you can now provide a token generated from the Cloudflare dashboard or via a 2FA app like Google Authenticator. So, a customer is able to prove over the phone that they are who they say they are.

Configuring Phone Authentication

If you are an existing Enterprise customer interested in phone support, please contact your Customer Success Manager for eligibility information and set-up. If you are interested in our Enterprise offering, please get in contact via our Enterprise plan page.

If you already have phone support eligibility, you can generate single-use tokens from the Cloudflare dashboard or configure an authenticator app to do the same remotely.

On the support page, you will see a card called “Emergency Phone Support Hotline – Authentication”. From here you can generate a Single-Use Token for authenticating a single call or configure an Authenticator App to generate tokens from a 2FA app.

Time-Based One-Time Passwords for Phone Support

For more detailed instructions, please see the “Emergency Phone” section of the Contacting Cloudflare Support article on the Cloudflare Knowledge Base.

How it Works

A standardised approach for generating TOTPs (Time-Based One-Time Passwords) is described in RFC 6238 – this is the approach that is often used for setting up Two Factor Authentication on websites.

When configuring a TOTP authenticator app, you are usually asked to scan a QR code or input a long alphanumeric string. This is a randomly generated secret that is shared between your local authenticator app and the web service where you are configuring TOTP. After TOTP is configured, this is stored between both the web server and your local device.

TOTP password generation relies on two key inputs; the shared secret and the number of seconds since the Unix epoch (Unix time). The timestamp is integer divided by a validity period (often 30 seconds) and this value is put into a cryptographic hash function alongside the secret to generate an output. The hexadecimal output is then truncated to provide the decimal digits which are shown to the user. The Avalanche Effect means that whenever the inputs that go into the hash function change slightly (e.g. the timestamp increments), a completely different hash output is generated.

This approach is fairly widely used and is available in a number of libraries depending on your preferred programming language. However, as our phone validation functionality offers both authenticator app support and generation of a single-use token from the dashboard (where no shared secret exists) – some deviation was required.

We generate a single use token by creating a hash of an internal user ID combined with a Cloudflare-internal secret, which in turn is used to generate RFC 6238 compliant time-based one-time passwords. Similarly, this service can generate random passwords for any user without needing to store additional secrets. This is then surfaced to the user every 30 seconds via a JavaScript request without exposing the secret used to generate the token.

Time-Based One-Time Passwords for Phone Support

One question you may be asking yourself after all of this is why don’t we simply use the 2FA mechanism which users use to login for phone validation too? Firstly, we don’t want to accustom users to providing their 2FA tokens to anyone else (they should purely be used for logging in). Secondly, as you may have noticed – we recently began supporting WebAuthn keys for logging in, as these are physical tokens used for website authentication they aren’t suited to usage on a mobile device.

To improve user experience during a phone call, we also validate tokens in the previous time step in the event it has expired by the time the user has read it out (indeed, RFC 6238 provides that “at most one time step is allowed as the network delay”). This means a token can be valid for up to one minute.

The APIs powering this service are then wrapped with API gateways that offer audit logging both for customer actions and actions completed by staff members. This provides a clear audit trail for customer authentication.

Future Work

Authentication is a critical component to securing customer support interactions. Authentication tooling must develop alongside support contact channels; from web forms behind logins to using JWT tokens for validating live chat sessions and now TOTP phone authentication. This is complimented by technical support engineers who will manage risk by routing certain issues into traditional support tickets and being able to refer some cases to named customer success managers for approval.

We are constantly advancing our support experience; for example, we plan to further improve our Enterprise Phone Support by giving users the ability to request a callback from a support agent within our dashboard. As always, right here on our blog we’ll keep you up-to-date with improvements in our service.

Offer of Assistance to Governments During COVID-19

Post Syndicated from Jocelyn Woolbright original https://blog.cloudflare.com/covid-19-government-assistance/

Offer of Assistance to Governments During COVID-19

Offer of Assistance to Governments During COVID-19

As the COVID-19 emergency continues to affect countries and territories around the world, the Internet has been a key factor in providing information to the public. As businesses, organizations and government agencies adjust to this new normal, we recognize the strain that this pandemic has put on the groups working to assist in virus mitigation and provide accurate information to the general public on the state of the pandemic.

At Cloudflare, this means ensuring that these entities have the necessary tools and resources available to them in these extenuating circumstances. On March 13, we announced our Cloudflare for Team products will be free until September 1, 2020, to ensure Cloudflare users and prospective users have the tools they need to support secure and efficient remote work. Additionally, we have removed usage caps for existing Cloudflare for Teams users and are also providing onboarding sessions so these groups can continue business in this new normal.

As a company, we believe we can do more and have been thinking about ways we can support organizations and businesses that are at the forefront of the pandemic such as health officials and those providing relief to the public. Many organizations have reached out to us with COVID-19 related initiatives including the creation of symptom tracking websites, medical resource donations, and websites focused on providing updates on COVID-19 cases in specific regions.

During this time, we have seen an increase in applications for Project Galileo, an initiative we started in 2014 to provide free services to organizations on the Internet including humanitarian organizations, media sites and voices of political dissent. Project Galileo was started to ensure these groups stay online, as they are repeatedly targeted due to the work they do. Since March 16, we have seen a 40% increase in applications for the project of organizations related to COVID-19 relief efforts and information. We are happy to assist other organizations that have started initiatives such as these with ensuring the accessibility and resilience of their web infrastructure and internal team.

Offer of Assistance to Governments During COVID-19

Risks faced to Government Agencies Web Infrastructure due to COVID-19 pandemic

As COVID-19 has disrupted our lives, the Internet has allowed many aspects of our life to adapt and carry on. From health care, to academia, to sales, a working Internet infrastructure is essential for business continuity and the dissemination of information. At Cloudflare, we’ve witnessed the effects of this transition to online interaction. In the last two months, we have seen both a massive increase in Internet traffic and a shift in the type of content users access online. Government agencies have seen a 100% increase in traffic to their websites during the pandemic.

Offer of Assistance to Governments During COVID-19

This unexpected shift in traffic patterns can come with a cost. Essential websites that provide crucial information and updates on this pandemic may not have configured their systems to handle the massive surges in traffic they are currently seeing. Government agencies providing essential health information to citizens on the COVID-19 pandemic have temporarily gone offline due to increased traffic. We’ve also seen examples of public service announcements and the sites of local governments providing unemployment resources unable to serve their traffic. In New Jersey, New York and Ohio, websites that provide unemployment benefits and health insurance options for people who have recently been laid off have crashed due to large amounts of traffic and unprecedented demand.

Offer of Assistance to Governments During COVID-19
To help process claims for unemployment benefits, New Jersey’s Department of Labor & Workforce Development has created a schedule for applicants.

During the spread of COVID-19, government agencies have also experienced cyberattacks.

The Australian government’s digital platform for providing welfare services for Australian citizens, known as Mygov, was slow and inaccessible for a short period of time. Although a DDoS attack was suspected, the problems were actually the result of 95,000 legitimate requests to access unemployment benefits, as the country recently doubled these benefits to help those impacted by the pandemic.

COVID-19 Government Package

Cloudflare has helped improve the security and performance of many vulnerable entities on the Internet with Project Galileo and ensured the security of government related election agencies with the Athenian Project. Our services are designed not only to prevent malicious actors from disrupting a website, but also to protect large influxes of legitimate traffic. In light of recent events, we want to help state and local government agencies stay online and provide essential information to the public without worrying their site can be taken down by malicious or unexpected spikes in traffic.

Therefore, we are excited to provide a free package of services to state and local governments worldwide until September 1, 2020, to ensure they have the tools needed to secure their web infrastructure and internal teams.

This package of free services includes the following features:

  • Cloudflare Business Level services: Includes unmetered mitigation of DDoS attacks, web application firewall (WAF) with up to 25 custom rulesets, and ability to upload custom SSL certificates.
  • Rate limiting: Rate Limiting allows users to rate limit, shape or block traffic based on the rate of requests per client IP address, cookie, authentication token, or other attributes of the request.
  • Cloudflare for Teams: A suite of tools to help ensure that those working from home can ensure continuity.
    • Access: To ensure the security of internal teams, Cloudflare Access, allows for organizations to secure, authenticate, and monitor user access to any domain, application, or path on Cloudflare, without using a VPN.
    • Gateway: Uses ​DNS filtering to help protect users from phishing scams or malware sites at multiple locations.​

To apply for our COVID-19 government assistance initiative, please visit our website at https://www.cloudflare.com/governmentagency/.

We are also making this offer available for Cloudflare channel partners around the world to help support government agencies in their respective countries during this challenging time for the global community.  If you are a partner and would like information on how to provide Cloudflare for Teams, a Business Plan and Rate Limiting at no charge, please contact your Cloudflare Partner Representative or email [email protected].

What’s Next

The news of COVID-19 has transformed every part of our lives. During this difficult time, the Internet has allowed us to stay connected with friends, family, and provide resources to those in need. At Cloudflare, we are committed to helping businesses, organizations and government agencies stay online to ensure that everyone has access to authoritative information.

Rolling With The Punches: Shifting Attack Tactics & Dropping Packets Faster & Cheaper At The Edge

Post Syndicated from Omer Yoachimik original https://blog.cloudflare.com/rolling-with-the-punches-shifting-attack-tactics-dropping-packets-faster-cheaper-at-the-edge/

Rolling With The Punches: Shifting Attack Tactics & Dropping Packets Faster & Cheaper At The Edge

Rolling With The Punches: Shifting Attack Tactics & Dropping Packets Faster & Cheaper At The Edge

On Cloudflare’s 8th birthday in 2017, we announced free unmetered DDoS Protection as part of all of our plans, regardless if you’re an independent blogger using WordPress on Cloudflare’s Free plan or part of a large enterprise operating global network infrastructures. Our DDoS protection covers attack vectors on Layers 3-7; whether highly distributed and volumetric (rate-intensive) or small and sneaky. We protect over 26 million Internet properties, and at this scale, identifying small and sneaky DDoS attacks can be challenging, especially at L7. In this post, we discuss this challenge along with trends that we’ve seen, interesting DDoS attacks, and how we’ve responded to them so that you don’t have to worry.

When analyzing attacks on the Cloudflare network, we’ve seen a steady decline in the proportion of L3/L4 DDoS attacks that exceed a rate of 30 Gbps in recent months. From September 2019 to March 2020, attacks peaking over 30 Gbps decreased by 82%, and in March 2020, more than 95% of all network-layer DDoS attacks peaked below 30 Gbps. Over the same time period, the average size of a DDoS attack has also steadily decreased by 53%, to just 11.88 Gbps. Yet, very large attacks have not disappeared: we’re still seeing attacks with intensive rates peaking at 330 Gbps on average and up to 400 millions packets per second.  Some of our customers are being targeted with as many as 890 DDoS attacks in a single day and 1,750 DDoS attacks in a month.

Rolling With The Punches: Shifting Attack Tactics & Dropping Packets Faster & Cheaper At The Edge

As the average rate of these L3/L4 attacks has decreased, they have become more localized and less geographically distributed. Increasingly, we’re seeing attacks hit just one or two of our data centers, which means that these hyper-localized attacks were launched in the catchment of the data center, otherwise our Anycast network would have spread the attack surface across our global fleet of data centers. Counterintuitively, these hyper-localized floods can be more difficult to detect on a global scale as the attack samples get diluted when aggregated from all of our data centers in the core. Therefore we’ve had to change our tactics and systems to roll with the change in attacker behavior.

Keeping things interesting in the penthouse floor of the OSI Model, over the same time period we’ve also observed some of the most rate-intensive and highly distributed L7 HTTP DDoS attacks we’ve ever seen. These attacks have pushed our engineering teams to invent even more efficient and intelligent ways to defend our network and our customers at scale. Let’s take a look at some of these trends and attacks.

Rolling With The Punches: Shifting Attack Tactics & Dropping Packets Faster & Cheaper At The Edge
Rolling With The Punches: Shifting Attack Tactics & Dropping Packets Faster & Cheaper At The Edge

Centrally Analyzed, Edge Enforced DDoS Mitigations

Before we released dosd late last year, the primary automated system responsible for protecting Cloudflare and our customers against distributed rate-intensive attacks was Gatebot. Gatebot works by ingesting samples of flow data from routers and samples of HTTP requests from servers. It then analyzes these samples for anomalies, and when attacks are detected, pushes mitigation instructions automatically to the edge.

Gatebot requires a lot of computational power to analyze these samples, and correlate them across all the data centers, so it runs centrally in our “core” data centers, rather than at the edge. It does a terrific job at mitigating large attacks, and on average stops over 4,000 L3/L4 DDoS attacks every month.

Rolling With The Punches: Shifting Attack Tactics & Dropping Packets Faster & Cheaper At The Edge

Edge Analyzed, Edge Enforced Mitigations

The persistent increase we’ve observed in smaller, more localized attacks was one of the main factors that drove us to develop a new, complementary system to Gatebot. We call this new system our denial of service daemon, or “dosd”, and this past month alone it mitigated 281,746 L3/4 DDoS attacks. This figure is roughly 6 times greater than what Gatebot dropped over the same period, thanks to dosd’s ability to detect smaller network attacks that would previously have flown under the radar (or taken longer to mitigate).

To complement the computationally heavy, centralized deployments of Gatebot, dosd was architected as a decentralized system that runs on every single server in every one of our data centers. Each instance detects and mitigates attacks independent of the other instances, or any sort of centralized data center whatsoever. As a result, the system is much faster than Gatebot, and can detect and mitigate attacks within 0-3 seconds (and less than 10 seconds on average). The speed of dosd enables it to generate real-time rules to quickly protect our customers at the data center. Then Gatebot, which samples traffic globally, can determine a mitigation that applies to all data centers if needed. In such a case, Gatebot will push rules to the data centers which will take priority over dosd’s rules.

dosd is also a leaner piece of software, consumes less memory and CPU, and significantly improves the resiliency of our network by removing the need to communicate with our core data centers to mitigate attacks. dosd detects and mitigates attacks using a similar logic to Gatebot’s methods, but in the scope of a single server, across a subset of servers in the same data center, or even across the entire data center.

Our automated Gatebot system is also tasked with mitigating L7 HTTP floods using request attributes as anomaly indicators. Mitigations can come in the form of actions such as JavaScript challenges, CAPTCHAs, Rate Limits (429), or Blocks (403) which are served back to the client as an error or challenge page. This form of mitigation at L7 allows the request to pass through TCP and TLS to the HTTP web server. During very rate-intensive attacks our servers can waste a lot of CPU and bandwidth as seen in the attack examples below.

Example #1 – Highly Distributed DDoS Attack Targeting A Customer Website

In July 2019, Cloudflare mitigated an HTTP DDoS attack that peaked at 1.4M requests per second. While this isn’t the most rate-intensive attack that we’ve seen, what is interesting is that the attack originated from almost 1.1M unique IP addresses. These were actual clients with the ability to complete a TCP and HTTPS handshake, they were not spoofed IP addresses. As it turns out, responding (rather than dropping at the network level) to over a million clients at a max rate of 1.4M requests per second can be quite costly.

Rolling With The Punches: Shifting Attack Tactics & Dropping Packets Faster & Cheaper At The Edge

Example #2 – Rate-Intensive DDoS Attack Targeting A Customer Website

The second attack took place in September 2019. We mitigated an HTTP DDoS attack that peaked and persisted just below 5M requests per second for a little over an hour. What’s interesting is the sustained capability of the attacker to reach those rates from only 371K unique IPs (also not spoofed).

Rolling With The Punches: Shifting Attack Tactics & Dropping Packets Faster & Cheaper At The Edge

These attacks highlighted to us what needed to be optimized and consequently drove us to improve our L7 mitigations even more so, and significantly reduced the cost of mitigating an attack.

Using IP Jails to Reduce the Cost of Mitigation

With the goal of reducing the computational cost to Cloudflare of mitigating rate-intensive attacks, we recently rolled out a new Gatebot capability called IP Jails. IP Jails excels at efficiently mitigating extremely rate-intensive and distributed HTTP DDoS attacks. It is triggered when an attack exceeds a certain request rate and then pushes the mitigation from the application layer (L7 in the OSI model) to the transport layer (L4). Therefore instead of responding with an error or challenge page from the proxy, we simply drop the connection for that IP. Mitigating at L4 is more computationally efficient, it reduces our CPU and memory consumption in addition to saving bandwidth. It allows us to keep mitigating the largest of attacks without sacrificing performance.

Rolling With The Punches: Shifting Attack Tactics & Dropping Packets Faster & Cheaper At The Edge

IP Jails in action

In the first graph below, you can see an HTTP flood peaking just below 8M rps before the IPs are ‘jailed’ for misbehaving. In the second graph, you can see that same attack being dropped as packets at L4.

Rolling With The Punches: Shifting Attack Tactics & Dropping Packets Faster & Cheaper At The Edge
Rolling With The Punches: Shifting Attack Tactics & Dropping Packets Faster & Cheaper At The Edge

The flood requests generated over 130 Gbps in responses. IP Jails slashed it by a factor of 10.

Rolling With The Punches: Shifting Attack Tactics & Dropping Packets Faster & Cheaper At The Edge

Similarly, you can see a spike in the attack mitigation CPU usage which then drops back to normal after IP Jails kicks in.

Rolling With The Punches: Shifting Attack Tactics & Dropping Packets Faster & Cheaper At The Edge

Using Origin Errors to Catch Low-Rate Attacks

We see one or two of these rate-intensive attacks every month. But the vast majority of attacks we observe are mostly of a lower request rate, trying to sneak under the radar. To tackle these low-rate attacks better, last month we completed the rollout of a new capability that synchronizes Gatebot’s detection sensitivity with our customers’ origin server health. Gatebot uses the origin’s error response codes as an additional adaptive feedback signal.

However, when we take a step back and think about what a DDoS attack is actually, we usually think of a malicious actor that targets traffic at a specific website or IP address with the intent to degrade performance or cause an outage. However, malicious attackers are not the only threats to your applications availability.

As the migration of functionality to the edge increases, the cloud becomes smarter and more powerful, which often allows administrators to scale down their origin servers and infrastructure leaving the origin server weaker and under-configured. Evidently, there are many cases where an origin was taken down by small floods of traffic that were neither malicious nor generated with bad intentions. These floods may be generated by an overly excited good bot or even faulty client applications calling home too frequently. Fixing a home-sick client application or strengthening a server can be lengthy and costly processes during which the origin remains susceptible. Consequently, if a website is taken offline, no matter the reason, the end-users still experience it as if it were an attack.

Therefore this new capability not only protects our customers against DDoS attacks, but also protects the origin against all kinds of unwanted floods. It is designed to protect every one of our customers; big or small. It’s available on all of our plans including the Free plan.

When an origin responds to Cloudflare with an increasing rate of errors from the 500 range (Internal Server Error), Gatebot initiates automatically and analyzes traffic to reduce or eliminate the impact on the origin even faster than before. The current error rate is also compared to the average error rate to minimize false positives. Once an attack is detected, dynamically generated, ephemeral mitigation rules are propagated to Cloudflare’s edge data centers to mitigate the flood. Mitigation rules may use a block action (403), rate-limit (429), or even a challenge based on the fingerprint logic and confidence.

In March 2020, we mitigated 812 HTTP DDoS attacks on average every day, and approximately 20,000 HTTP DDoS attacks in total.

Rolling With The Punches: Shifting Attack Tactics & Dropping Packets Faster & Cheaper At The Edge

Don’t Take Our Word For It, See For Yourself

Whether it’s Gatebot or dosd that mitigated L3/4 DDoS attacks, you can see both types of attack events for yourself in our new Network Analytics dashboard.

Rolling With The Punches: Shifting Attack Tactics & Dropping Packets Faster & Cheaper At The Edge

Today this dashboard provides Magic Transit & BYOIP customers real-time visibility into L3/4 traffic and DDoS attacks, and in the future we plan to expand access to customers of our other products.

Visibility into L7 DDoS attacks is available to our WAF/CDN customers that have access to the Firewall Analytics dashboard.

Unmetered DDoS Protection For All

Whether you’re part of a large global enterprise, or use Cloudflare for your personal site on the Free plan, we want to make sure that you’re protected and also have the visibility that you need.

DDoS Protection is included as part of every Cloudflare service; from Magic Transit at L3, through Spectrum at L4, to the WAF/CDN service at L7. Our mission is to help build a better Internet – and this means a safer, faster, and more reliable Internet. For everyone.

If you’re a Cloudflare customer of any plan (Free, Pro, Business or Enterprise), these new protections are now enabled by default at no additional charge.

How To Use 1.1.1.1 w/ WARP App And Cloudflare Gateway To Protect Your Phone From Security Threats

Post Syndicated from Irtefa original https://blog.cloudflare.com/how-to-use-1-1-1-1-w-warp-app-and-cloudflare-gateway-to-protect-your-phone-from-security-threats/

How To Use 1.1.1.1 w/ WARP App And Cloudflare Gateway To Protect Your Phone From Security Threats

Cloudflare Gateway protects users and devices from security threats. You can now use Gateway inside the 1.1.1.1 w/ WARP app to secure your phone from malware, phishing and other security threats.

The 1.1.1.1 w/ WARP app has secured millions of mobile Internet connections. When installed, 1.1.1.1 w/ WARP encrypts the traffic leaving your device, giving you a more private browsing experience.

Starting today, you can get even more out of your 1.1.1.1 w/ WARP. By adding Cloudflare Gateway’s secure DNS filtering to the app, you can add a layer of security and block malicious domains flagged as phishing, command and control, or spam. This protection isn’t dependent on what network you’re connected to – it follows you everywhere you go.

You can read more about how Cloudflare Gateway builds on our 1.1.1.1 resolver to secure Internet connections in our announcement. Ready to get started bringing that security to your mobile device? Follow the steps below.

Download the 1.1.1.1 w/ WARP mobile app

If you don’t have the latest version of the 1.1.1.1 w/ WARP app go to the Apple App Store or Google Play Store to download the latest version.

Sign up for Cloudflare Gateway

Sign up for Cloudflare Gateway by visiting the Cloudflare for Teams dashboard. You can use Cloudflare Gateway for free, all you need is a Cloudflare account to get started.

Get the unique ID for your DNS over HTTPS hostname

On your Cloudflare Gateway dashboard go to ‘Locations’.

How To Use 1.1.1.1 w/ WARP App And Cloudflare Gateway To Protect Your Phone From Security Threats

Click on the location listed on the locations page to expand the location item.

How To Use 1.1.1.1 w/ WARP App And Cloudflare Gateway To Protect Your Phone From Security Threats

Copy the unique 10 character subdomain from the DNS over HTTPS endpoint. This unique ID is case sensitive. Either note it down on a paper or keep this window open on your computer because you will need it when you setup Gateway inside your 1.1.1.1 w/ WARP app.

Enabling Cloudflare Gateway for 1.1.1.1 w/ WARP app

After you open the 1.1.1.1 w/ WARP app, click on the menu button on the top right corner:

How To Use 1.1.1.1 w/ WARP App And Cloudflare Gateway To Protect Your Phone From Security Threats

Click on ‘Advanced’ which is located under the ‘Account’ button.

How To Use 1.1.1.1 w/ WARP App And Cloudflare Gateway To Protect Your Phone From Security Threats

Click on ‘Connection options’ which is located at the bottom of the screen right above ‘Diagnostics’.

How To Use 1.1.1.1 w/ WARP App And Cloudflare Gateway To Protect Your Phone From Security Threats

Click on ‘DNS Settings’. This will take you to the screen where you can configure Gateway for your 1.1.1.1 mobile app.

How To Use 1.1.1.1 w/ WARP App And Cloudflare Gateway To Protect Your Phone From Security Threats

When you are on this screen on your phone, you will need to enter the unique subdomain of the location you created for your mobile phone. This is the unique ID I asked you to note down in the previous section.

How To Use 1.1.1.1 w/ WARP App And Cloudflare Gateway To Protect Your Phone From Security Threats

Enter the subdomain inside the field GATEWAY UNIQUE ID.

If 1.1.1.1 DNS, WARP or WARP+ was already enabled, the 1.1.1.1 w/ WARP app should be using Gateway now.

If you are using Android you can read about the setup instructions here.

If you are trying to enable Gateway for your corporate mobile devices using an MDM, you can read the setup instructions here.

Now that you have Gateway setup inside your 1.1.1.1 w/ WARP app, it will enforce security policies that are tied to the location and analytics will show up on your dashboard.

What’s next

We announced last week the 1.1.1.1 w/ WARP beta for Windows and macOS. If you are interested in using Cloudflare Gateway on macOS or Windows you can sign up for the beta here and we will reach out to you as soon as they are available.

Our team will continue to enhance Cloudflare Gateway. If you want to secure corporate devices, data centers or offices from security threats, get started today by visiting the Cloudflare for Teams dashboard.

Introducing 1.1.1.1 for Families

Post Syndicated from Matthew Prince original https://blog.cloudflare.com/introducing-1-1-1-1-for-families/

Introducing 1.1.1.1 for Families

Two years ago today we announced 1.1.1.1, a secure, fast, privacy-first DNS resolver free for anyone to use. In those two years, 1.1.1.1 has grown beyond our wildest imagination. Today, we process more than 200 billion DNS requests per day making us the second largest public DNS resolver in the world behind only Google.

Introducing 1.1.1.1 for Families

Yesterday, we announced the results of the 1.1.1.1 privacy examination. Cloudflare’s business has never involved selling user data or targeted advertising, so it was easy for us to commit to strong privacy protections for 1.1.1.1. We’ve also led the way supporting encrypted DNS technologies including DNS over TLS and DNS over HTTPS. It is long past time to stop transmitting DNS in plaintext and we’re excited that we see more and more encrypted DNS traffic every day.

1.1.1.1 for Families

Introducing 1.1.1.1 for Families

Since launching 1.1.1.1, the number one request we have received is to provide a version of the product that automatically filters out bad sites. While 1.1.1.1 can safeguard user privacy and optimize efficiency, it is designed for direct, fast DNS resolution, not for blocking or filtering content. The requests we’ve received largely come from home users who want to ensure that they have a measure of protection from security threats and can keep adult content from being accessed by their kids. Today, we’re happy to answer those requests.

Introducing 1.1.1.1 for Families — the easiest way to add a layer of protection to your home network and protect it from malware and adult content. 1.1.1.1 for Families leverages Cloudflare’s global network to ensure that it is fast and secure around the world. And it includes the same strong privacy guarantees that we committed to when we launched 1.1.1.1 two years ago. And, just like 1.1.1.1, we’re providing it for free and it’s for any home anywhere in the world.

Two Flavors: 1.1.1.2 (No Malware) & 1.1.1.3 (No Malware or Adult Content)

Introducing 1.1.1.1 for Families

1.1.1.1 for Families is easy to set up and install, requiring just changing two numbers in the settings of your home devices or network router: your primary DNS and your secondary DNS. Setting up 1.1.1.1 for Families usually takes less than a minute and we’ve provided instructions for common devices and routers through the installation guide.

1.1.1.1 for Families has two default options: one that blocks malware and the other that blocks malware and adult content. You choose which setting you want depending on which IP address you configure.

Malware Blocking Only
Primary DNS: 1.1.1.2
Secondary DNS: 1.0.0.2

Malware and Adult Content
Primary DNS: 1.1.1.3
Secondary DNS: 1.0.0.3

Additional Configuration

Introducing 1.1.1.1 for Families

In the coming months, we will provide the ability to define additional configuration settings for 1.1.1.1 for Families. This will include options to create specific whitelists and blacklists of certain sites. You will be able to set the times of the day when categories, such as social media, are blocked and get reports on your household’s Internet usage.

1.1.1.1 for Families is built on top of the same site categorization and filtering technology that powers Cloudflare’s Gateway product. With the success of Gateway, we wanted to provide an easy-to-use service that can help any home network be fast, reliable, secure, and protected from potentially harmful content.

Not A Joke

Most of Cloudflare’s business involves selling services to businesses. However, we’ve made it a tradition every April 1 to launch a new consumer product that leverages our network to bring more speed, reliability, and security to every Internet user. While we make money selling to businesses, the products we launch at this time of the year are close to our hearts because of the broad impact they have for every Internet user.

Introducing 1.1.1.1 for Families

This year, while many of us are confined to our homes, protecting our communities from COVID-19, and relying on our home networks more than ever it seemed especially important to launch 1.1.1.1 for Families. We hope during these troubled times it will help provide a bit of peace of mind for households everywhere.

Announcing the Beta for WARP for macOS and Windows

Post Syndicated from Matthew Prince original https://blog.cloudflare.com/announcing-the-beta-for-warp-for-macos-and-windows/

Announcing the Beta for WARP for macOS and Windows

Announcing the Beta for WARP for macOS and Windows

Last April 1 we announced WARP — an option within the 1.1.1.1 iOS and Android app to secure and speed up Internet connections. Today, millions of users have secured their mobile Internet connections with WARP.

While WARP started as an option within the 1.1.1.1 app, it’s really a technology that can benefit any device connected to the Internet. In fact, one of the most common requests we’ve gotten over the last year is support for WARP for macOS and Windows. Today we’re announcing exactly that: the start of the WARP beta for macOS and Windows.

What’s The Same: Fast, Secure, and Free

We always wanted to build a WARP client for macOS and Windows. We started with mobile because it was the hardest challenge. And it turned out to be a lot harder than we anticipated. While we announced the beta of 1.1.1.1 with WARP on April 1, 2019 it took us until late September before we were able to open it up to general availability. We don’t expect the wait for macOS and Windows WARP to be nearly as long.

The WARP client for macOS and Windows relies on the same fast, efficient Wireguard protocol to secure Internet connections and keep them safe from being spied on by your ISP. Also, just like WARP on the 1.1.1.1 mobile app, the basic service will be free on macOS and Windows.

Announcing the Beta for WARP for macOS and Windows

WARP+ Gets You There Faster

We plan to add WARP+ support in the coming months to allow you to leverage Cloudflare’s Argo network for even faster Internet performance. We will provide a plan option for existing WARP+ subscribers to add additional devices at a discount. In the meantime, existing WARP+ users will be among the first to be invited to try WARP for macOS and Windows. If you are a WARP+ subscriber, check your 1.1.1.1 app over the coming weeks for a link to an invitation to try the new WARP for macOS and Windows clients.

If you’re not a WARP+ subscriber, you can add yourself to the waitlist by signing up on the page linked below. We’ll email as soon as it’s ready for you to try.

https://one.one.one.one

Linux Support

We haven’t forgotten about Linux. About 10% of Cloudflare’s employees run Linux on their desktops. As soon as we get the macOS and Windows clients out we’ll turn our attention to building a WARP client for Linux.

Thank you to everyone who helped us make WARP fast, efficient, and reliable on mobile. It’s incredible how far it’s come over the last year. If you tried it early in the beta last year but aren’t using it now, I encourage you to give it another try. We’re looking forward to bringing WARP speed and security to even more devices.

Cloudflare now supports security keys with Web Authentication (WebAuthn)!

Post Syndicated from Anita Tenjarla original https://blog.cloudflare.com/cloudflare-now-supports-security-keys-with-web-authentication-webauthn/

Cloudflare now supports security keys with Web Authentication (WebAuthn)!

Cloudflare now supports security keys with Web Authentication (WebAuthn)!

We’re excited to announce that Cloudflare now supports security keys as a two factor authentication (2FA) method for all users. Cloudflare customers now have the ability to use security keys on WebAuthn-supported browsers to log into their user accounts. We strongly suggest users configure multiple security keys and 2FA methods on their account in order to access their apps from various devices and browsers. If you want to get started with security keys, visit your account’s 2FA settings.

Cloudflare now supports security keys with Web Authentication (WebAuthn)!

What is WebAuthn?

WebAuthn is a standardized protocol for authentication online using public key cryptography. It is part of the FIDO2 Project and is backwards compatible with FIDO U2F. Depending on your device and browser, you can use hardware security keys (like YubiKeys) or built-in biometric support (like Apple Touch ID) to authenticate to your Cloudflare user account as a second factor. WebAuthn support is rapidly increasing among browsers and devices, and we’re proud to join the growing list of services that offer this feature.

To use WebAuthn, a user registers their security key, or “authenticator”, to a supporting application, or “relying party” (in this case Cloudflare). The authenticator then generates and securely stores a public/private keypair on the device. The keypair is scoped to a specific domain and user account. The authenticator then sends the public key to the relying party, who stores it. A user may have multiple authenticators registered with the same relying party. In fact, it’s strongly encouraged for a user to do so in case an authenticator is lost or broken.

When a user logs into their account, the relying party will issue a randomly generated byte sequence called a “challenge”. The authenticator will prompt the user for “interaction” in the form of a tap, touch or PIN before signing the challenge with the stored private key and sending it back to the relying party. The relying party evaluates the signed challenge against the public key(s) it has stored associated with the user, and if the math adds up the user is authenticated! To learn more about how WebAuthn works, take a look at the official documentation.

Cloudflare now supports security keys with Web Authentication (WebAuthn)!

How is WebAuthn different from other 2FA methods?

There’s a lot of hype about WebAuthn, and rightfully so. But there are some common misconceptions about how WebAuthn actually works, so I wanted to take some time to explain why it’s so effective against various credential-based attacks.

First, WebAuthn relies on a “physical thing you have” rather than an app or a phone number, which makes it a lot harder for a remote attacker to impersonate a victim. This assumption prevents common exploits like SIM swapping, which is an attack used to bypass SMS-based verification. In contrast, an attacker physically (and cryptographically) cannot “impersonate” a hardware security key unless they have physical access to a victim’s unlocked device.

WebAuthn is also simpler and quicker to use compared to mobile app-based 2FA methods. Users often complain about the amount of time it takes to reach for their phone, open an app, and copy over an expiring passcode every time they want to log into an account. By contrast, security keys require a simple touch or tap on a piece of hardware that’s often attached to a device.

But where WebAuthn really shines is its particular resistance to phishing attacks. Phishing often requires an attacker to construct a believable fake replica of a target site. For example, an attacker could try to register cloudfare[.]com (notice the typo!) and construct a site that looks similar to the genuine cloudflare[.]com. The attacker might then try to trick a victim into logging into the fake site and disclosing their credentials. Even if the victim has mobile app TOTP authentication enabled, a sophisticated attacker can still proxy requests from the fake site to the genuine site and successfully authenticate as the victim. This is the assumption behind powerful man-in-the-middle tools like evilginx.

WebAuthn prevents users from falling victim to common phishing and man-in-the-middle attacks because it takes the domain name into consideration when creating user credentials. When an authenticator creates the public/private keypair, it is specifically scoped to a particular account and domain. So let’s say a user with WebAuthn configured navigates to the phishy cloudfare[.]com site. When the phishy site prompts the authenticator to sign its challenge, the authenticator will attempt to find credentials for that phishy site’s domain and, upon failing to find any, will error and prevent the user from logging in. This is why hardware security keys are among the most secure authentication methods in existence today according to research by Google.

Cloudflare now supports security keys with Web Authentication (WebAuthn)!

WebAuthn also has very strict privacy guarantees.  If a user authenticates with a biometric key (like Apple TouchID or Windows Hello), the relying party never receives any of that biometric data. The communication between authenticator and client browser is completely separate from the communication between client browser and relying party. WebAuthn also urges relying parties to not disclose user-identifiable information (like email addresses) during registration or authentication. This helps prevent replay or user enumeration attacks. And because credentials are strictly scoped to a particular relying party and domain, a malicious relying party won’t be able to gain information about other relying parties an authenticator has created credentials for in order to track a user’s various accounts.

Finally, WebAuthn is great for relying parties because they don’t have to store anything additionally sensitive about a user. The relying party simply stores a user’s public key. An attacker who gains access to the public key can’t do much with it because they won’t know the associated private key. This is markedly less risky than TOTP, where a relying party must use proper hygiene to store a TOTP secret seed from which all subsequent time-based user passcodes are generated.

Security isn’t always intuitive

Sometimes in the security industry we have the tendency to fixate on new and sophisticated attacks. But often it’s the same old “simple” problems that have the highest impact. Two factor authentication is a textbook case where the security industry largely believes a concept is trivial, but the average user still finds it confusing or annoying. WebAuthn addresses this problem because it’s quicker and more secure for the end user compared to other authentication methods. We think the trend towards security key adoption will continue to grow, and we’re looking forward to doing our part to help the effort.

Note: If you login to your Cloudflare user account with Single Sign-On (SSO), you will not have the option to use two factor authentication (2FA). This is because your SSO provider manages your 2FA methods. To learn more about Cloudflare’s 2FA offerings, please visit our support center.

Amazon Detective – Rapid Security Investigation and Analysis

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/amazon-detective-rapid-security-investigation-and-analysis/

Almost five years ago, I blogged about a solution that automatically analyzes AWS CloudTrail data to generate alerts upon sensitive API usage. It was a simple and basic solution for security analysis and automation. But demanding AWS customers have multiple AWS accounts, collect data from multiple sources, and simple searches based on regular expressions are not enough to conduct in-depth analysis of suspected security-related events. Today, when a security issue is detected, such as compromised credentials or unauthorized access to a resource, security analysts cross-analyze several data logs to understand the root cause of the issue and its impact on the environment. In-depth analysis often requires scripting and ETL to connect the dots between data generated by multiple siloed systems. It requires skilled data engineers to answer basic questions such as “is this normal?”. Analysts use Security Information and Event Management (SIEM) tools, third-party libraries, and data visualization tools to validate, compare, and correlate data to reach their conclusions. To further complicate the matters, new AWS accounts and new applications are constantly introduced, forcing analysts to constantly reestablish baselines of normal behavior, and to understand new patterns of activities every time they evaluate a new security issue.

Amazon Detective is a fully managed service that empowers users to automate the heavy lifting involved in processing large quantities of AWS log data to determine the cause and impact of a security issue. Once enabled, Detective automatically begins distilling and organizing data from AWS Guard Duty, AWS CloudTrail, and Amazon Virtual Private Cloud Flow Logs into a graph model that summarizes the resource behaviors and interactions observed across your entire AWS environment.

At re:invent 2019, we announced a preview of Amazon Detective. Today, it is our pleasure to announce its availability for all AWS Customers.

Amazon Detective uses machine learning models to produce graphical representations of your account behavior and helps you to answer questions such as “is this an unusual API call for this role?” or “is this spike in traffic from this instance expected?”. You do not need to write code, to configure or to tune your own queries.

To get started with Amazon Detective, I open the AWS Management Console, I type “detective” in the search bar and I select Amazon Detective from the provided results to launch the service. I enable the service and I let the console guide me to configure “member” accounts to monitor and the “master” account in which to aggregate the data. After this one-time setup, Amazon Detective immediately starts analyzing AWS telemetry data and, within a few minutes, I have access to a set of visual interfaces that summarize my AWS resources and their associated behaviors such as logins, API calls, and network traffic. I search for a finding or resource from the Amazon Detective Search bar and, after a short while, I am able to visualize the baseline and current value for a set of metrics.

I select the resource type and ID and start to browse the various graphs.

I can also investigate a AWS Guard Duty finding by using the native integrations within the Guard Duty and AWS Security Hub consoles. I click the “Investigate” link from any finding from AWS Guard Duty and jump directly into a Amazon Detective console that provides related details, context, and guidance to investigate and to respond to the issue. In the example below, Guard Duty reports an unauthorized access that I decide to investigate:

Amazon Detective console opens:

I scroll down the page to check the graph of failed API calls. I click a bar in the graph to get the details, such as the IP addresses where the calls originated:

Once I know the source IP addresses, I click New behavior: AWS role and observe where these calls originated from to compare with the automatically discovered baseline.

Amazon Detective works across your AWS accounts, it is a multi-account solution that aggregates data and findings from up to 1000 AWS accounts into a single security-owned “master” account making it easy to view behavioral patterns and connections across your entire AWS environment.

There are no agents, sensors, or additional software to deploy in order to use the service. Amazon Detective retrieves, aggregates and analyzes data from AWS Guard Duty, AWS CloudTrail and Amazon Virtual Private Cloud Flow Logs. Amazon Detective collects existing logs directly from AWS without touching your infrastructure, thereby not causing any impact to cost or performance.

Amazon Detective can be administered via the AWS Management Console or via the Amazon Detective management APIs. The management APIs enable you to build Amazon Detective into your standard account registration, enablement, and deployment processes.

Amazon Detective is a regional service. I activate the service in every AWS Regions in which I want to analyze findings. All data are processed in the AWS Region where they are generated. Amazon Detective maintains data analytics and log summaries in the behavior graph for a 1-year rolling period from the date of log ingestion. This allows for visual analysis and deep dives over a large data set for a long period of time. When I disable the service, all data is expunged to ensure no data remains.

There are no additional charges or upfront commitments required to use Amazon Detective. We charge per GB of data ingested from AWS AWS CloudTrail, Amazon Virtual Private Cloud Flow Logs, and AWS Guard Duty findings. Amazon Detective offers a 30-day free trial. As usual, check the pricing page for the details.

Amazon Detective is available in all commercial AWS Regions, except China. You can start to use it today.

— seb

Using Cloudflare to secure your cardholder data environment

Post Syndicated from Jacob Zollinger original https://blog.cloudflare.com/using-cloudflare-to-secure-your-cardholder-data-environment/

Using Cloudflare to secure your cardholder data environment

Using Cloudflare to secure your cardholder data environment

As part of our ongoing compliance efforts Cloudflare’s PCI scope is reviewed quarterly and after any significant changes to ensure all in-scope systems are operating in accordance with the PCI DSS. This review also allows us to periodically review each product we offer as a PCI validated service provider and identify where there might be opportunities to provide greater value to our customers.

With our customers in mind, we completed our latest assessment and have increased our PCI certified product offering!

Building trust in our products is one critical component that allows Cloudflare’s mission of “Building a Better Internet” to succeed. We reaffirm our dedication to building trust in our products by obtaining industry standard security compliance certifications and complying with regulations.

Cloudflare is a Level 1 Merchant, the highest level, and also provides services to organizations to help secure their cardholder data environment. Maintaining PCI DSS compliance is important for Cloudflare because (1) we must ensure that our transmission and processing of cardholder data is secure for our own customers, (2) that our customers know they can trust Cloudflare’s products to transmit cardholder data securely, and (3) that anyone who interacts with Cloudflare’s services know that their information is transmitted securely.

The PCI standard applies to any company or organization that accepts credit cards, debit cards, or even prepaid cards for payment. The purpose of this compliance standard is to help protect financial institutions and customers from having their payment card information compromised. Each major payment card brand has merchants sorted into different tiers based on the number of transactions made per year, and each tier requires varying requirements to satisfy their compliance obligations. Annually, Cloudflare undergoes an assessment by a Qualified Security Assessor. This assessor conducts a thorough review of Cloudflare’s technical environment and validates that Cloudflare’s controls related to securing the transmission, processing, and storage of cardholder data meet the requirements in the PCI Data Security Standard (PCI DSS).

Cloudflare has been PCI compliant since 2014 as both a merchant and as a service provider, but this year we have expanded our Service Provider scope to include more products that will help our customers become more secure and meet their own compliance obligations.

How can Cloudflare Help You?

In addition to our WAF, we are proud to announce that Cloudflare’s Content Delivery Network, Cloudflare Access, and the Cloudflare Time Service are also certified under our latest Attestation of Compliance!

Our Attestation of Compliance is applicable for all Pro, Business, and Enterprise accounts. This designation can be used to simplify your PCI audits and remove the pressure on you to manage these services or appliances locally.

If you use our WAF, enable the OWASP ruleset, and tune rules for your environment you will meet the need to protect web-facing applications and satisfy PCI requirement 6.6.

As detailed by several recent blog posts, Cloudflare Access is changing the game and your relationship with your corporate VPN. Many organizations rely on VPNs and other segmentation tools to reduce the scope of their cardholder data environment. Cloudflare Access provides another means of segmentation by using Cloudflare’s global network as a VPN service to access internal resources. Additionally, these sessions can be configured to time out after 15 minutes of inactivity to help customers meet requirement 8.1.8!

There are several large providers of time services that most organizations use. However, in 2019 Cloudflare announced our time.cloudflare.com NTP service. The benefits of using our time service rely on the use of our CDN and our global network to provide an advantage in latency and accuracy. Our 200 locations around the world all use anycast to route your packets to our closest server. All of our servers are synchronized with stratum 1 time service providers, and then offer NTP to the general public, similar to how other public NTP providers function. Accurate time services are critical to maintaining accurate audit logging and being able to respond to incidents. By changing your time source to time.cloudflare.com we can help you meet requirement 10.4.3.

Finally, Cloudflare has given our customers the opportunity to configure higher levels of TLS. Currently, you can configure connections between the client and your origin server to use up to TLS 1.3, which exceeds the requirement to use the latest versions of TLS 1.1 or higher  referenced in requirement 4.1!

We use our own products to secure our cardholder data environment and hope that our customers will find these product additions as beneficial and easy to implement as we have.

Learn more about Compliance at Cloudflare

Cloudflare is committed to helping our customers earn their user’s trust by ensuring our products are secure. The Security team is committed to adhering to security compliance certifications and regulations that maintain the security, confidentiality, and availability of company and client information.

In order to help our customers keep track of the latest certifications, Cloudflare continually updates our Compliance certification page – www.cloudflare.com/compliance. Today, you can view our status on all compliance certifications and download our SOC 3 report.

Six years of the GitHub Security Bug Bounty program

Post Syndicated from Brian Anglin original https://github.blog/2020-03-25-six-years-of-the-github-security-bug-bounty-program/

Last month GitHub reached some big milestones for our Security Bug Bounty program. As of February 2020, it’s been six years since we started accepting submissions. Over the years we’ve been able to invest in the bug bounty community through live events, private bug bounties, feature previews, and of course through cash bounties.

We’re excited to announce that we recently passed $1,000,000 in total payments to researchers since we moved our program to HackerOne in 2016. We paid out over half of our total awards in the last year alone, reaching almost $590,000 in total bounty rewards across our programs. We’ve also been able to provide quick response times to an increasingly large amount of submissions—maintaining an average response time of 17 hours. This is all while seeing a 40 percent increase in submissions since last year. We’re sharing some highlights from the past year along with our upcoming plans for the future.

2019 highlights

Cool bugs

One of my favorite parts of working on the bug bounty program is getting to see the amazing submissions we get from the community. Many of the best submissions show an understanding of GitHub and our technology that rivals that of our own engineering teams. We’ve offered very competitive bounties so we can attract those talented individuals and provide them an incentive to spend time digging deep into our codebase. The community in 2019 did not disappoint.

OAuth flow bypass using cross-site HEAD requests

@not-an-ardvark has a lot of great submissions to our program but this was particularly impactful. He wrote a great post about it in detail that I’ll quickly recap.

GitHub provides a few ways for integrators to interact with our ecosystem. One of the ways integrators can use GitHub is via OAuth applications which allow the application to take actions on behalf of a GitHub user. Before allowing access to a user’s data, an OAuth application must redirect the user to GitHub.com, allowing them to review the requested permissions and explicitly authorize the application. @not-an-ardvark found a way to bypass our controls to authorize OAuth applications without any user interaction. Let’s get into how this happened.

When we process state changing requests on GitHub.com, such as authorizing an OAuth application, we rely on Ruby on Rails’ Cross Site Request Forgery (CSRF) protection. We inject a special token into the DOM of every form element that we validate when receiving POST requests. The OAuth application authorization flow uses POST requests which require a valid CSRF token. However, the OAuth controller incorrectly allowed both POST and HEAD requests to trigger the authorization logic. We skip CSRF validation when processing HEAD requests since they’re not typically state changing. This allowed a malicious site to automatically authorize an OAuth application without any user interaction.

Due to the severity of the vulnerability, we needed to patch it as quickly as possible. We worked closely with the engineering team and shipped a fix to GitHub users within three hours of receiving the submission. We also conducted a full investigation with SIRT engineers and confirmed that this vulnerability wasn’t exploited in the wild. Additionally, we rolled out patches for GitHub Enterprise Server for all supported versions. We rewarded @not-an-aardvark with $25,000 for the severity of the vulnerability and their detailed writeup in their submission.

This bug demonstrates the important role that researchers play in our overall security. By identifying this issue via our bug bounty program, we were able to protect our users by patching the issue and validating that it wasn’t previously exploited.

GitHub.com remote code execution through command injection

@ajxchapman achieved remote code execution in GitHub.com by triggering command injection in our Mercurial import feature. The import logic didn’t correctly sanitize branch names which allowed a maliciously crafted branch name to execute code on our servers. Since the import feature is quite complicated, we’ve traditionally run the import code in a sandbox on dedicated servers isolated from our production network. This isolation limited the impact of the vulnerability, and we were able to quickly release a fix for GitHub.com and backported the fix for GitHub Enterprise Server customers. We also audited the import logic for similar issues and confirmed from our logging systems that this wasn’t exploited in the wild.

What makes this bug particularly interesting is the root cause: it was ultimately caused by an outdated dependency. The bug existed in a dependency that handles code imports and was previously fixed upstream. However, we failed to keep up with the latest version and were ultimately vulnerable to this issue. This issue highlights how critical dependency management is to the overall success of a security program. GitHub continues to invest in dependency management tooling to keep us and our customers secure. Find more of Alex’s work on his personal blog.

Expanded scope

GitHub released many new features in 2019 that were added to our Security Bug Bounty scope:

  • Pull reminders added functionality to help keep engineers informed of new pull requests that need attention. We included the solution into our core application and existing Slack integration.
  • Automated security updates (formerly Dependabot) added a better way to track vulnerabilities in dependencies since it automatically opens new pull requests updating the version of a dependency when it finds a new security fix.
  • GitHub for mobile is GitHub’s first presence in the App Store. This brought new requirements of our API and new security concerns in our application. We’re delivering the same security and functionality that’s available on GitHub.com.
  • GitHub Actions is one of GitHub’s biggest releases since pull requests and whole classes of new security corner cases. Through close collaboration with our engineering partners, we’ve provided users the ability to run their code right on GitHub.com.
  • Semmle’s LGTM tool was a significant addition to our suite of security tools, like Dependabot and the Maintainer Security Advisories. LGTM allows our users to scan for potential security issues in their code on every pull request.

We’ve had several valuable submissions that influenced the development of these products significantly. We paid out over $20,000 in bounties for vulnerabilities affecting the products in this expanded scope, and we’re excited to continue expanding our Bug Bounty scope as GitHub grows.

H1-702

In August 2019, we returned to Las Vegas to participate in our second H1-702 event. This event invited the top hackers from HackerOne’s platform to join us along with two other companies for three nights of live hacking. We were excited to participate and wanted to give researchers every incentive to dig deep into our application. We even added a bunch of bonuses on top of our base payouts, including bonuses for Best Proof of Concept, Longest Exploit Chain, and RCE. We also set up a CTF on GitHub.com to direct researchers to some of our newest attack surfaces. Lastly, we hid flags in a Maintainer Security Advisory and GitHub Package Registry with bonuses for every flag. We received positive feedback from some of our researchers about our CTF and will continue to include a CTF component in future events.

Overall, we paid out over $155,000 to researchers in one night, with half those rewards for high or critical severity issues. We can’t express how important live-hacking events, like H1-702, are to our bug bounty program. We look forward to more live-hacking events in the future and other new and innovative ways to engage the community.

Private bug bounty

Beyond the wide scope of our public program, we conducted an invite-only program where we preview features to researchers before they’re launched to everyone. These private programs allow us to work closely with a small group, and give us the opportunity to find bugs before they can affect the majority of our users. We’ve paid out just over $37,000 via our private program this year, and many of these findings were fixed before new features reached a significant number of our customers.

Actions CI/CD

Following the success of our first private bug bounty targeting GitHub Actions, we wanted to re-run the private program to target the most recent iteration of our GitHub Actions product. We used what we learned in our first bug bounty to secure the product against similar issues. The community accepted the challenge and found novel bugs in our second iteration.

Automated security updates (formerly Dependabot)

Just like any combination of two complex systems, the acquisition of Dependabot presented a unique challenge for our security team in integrating these two separate architectures. We used the private bug bounty to supplement our own security review of these new services. The findings from the private bug bounty program greatly informed how we integrated Dependabot with GitHub.com. We were also able to surface a few issues before rolling it out.

Pull reminders

Like Dependabot, pull reminders required the same care and attention to ensure a secure transition from an integration to a first-party GitHub product. Pull reminders also added more complexity through its connection to Slack. Our own Slack integration provided a foundation for this feature, but there was significant re-architecture and development to tie these two features together. Again, we turned to our bug bounty community to test our pull reminder integration before releasing the feature widely.

2020 initiatives

We have a lot of plans for 2020 and want to highlight some of our upcoming changes.

Security Lab bounty program

We launched the GitHub Security Lab bounty program to incentivize researchers to help us secure all open source software. The new program rewards community members who write CodeQL queries that detect entire vulnerability classes so that the rest of the community can run those queries against their own projects. This results in removing vulnerabilities at scale.

Making a contribution to this program not only influences the global state of software security,  but also prevents similar vulnerabilities from being released in the future. This is an exciting twist on our traditional bug bounty program, and we’re excited to see researchers using our new CodeQL tooling. To date, we received 20 submissions and awarded almost $21,000, with hundreds of vulnerabilities fixed across the OSS ecosystem as a direct result.

CVEs and disclosure

This year, we’re assigning CVEs to bounty submissions which affect GitHub Enterprise Server. This is a big step forward in consistently communicating the state of our software to our customers, but also provides another accolade for our researchers who identify vulnerabilities in GitHub Enterprise Server.

Get involved

Are you excited by the new additions to our program? Get involved! Visit the GitHub Security Bug Bounty page for details of our scope, rules, and rewards. We can’t wait to make GitHub better for everyone with the help of your submissions.

Learn more about the GitHub Security Bug Bounty

The post Six years of the GitHub Security Bug Bounty program appeared first on The GitHub Blog.

Speeding up Linux disk encryption

Post Syndicated from Ignat Korchagin original https://blog.cloudflare.com/speeding-up-linux-disk-encryption/

Speeding up Linux disk encryption

Data encryption at rest is a must-have for any modern Internet company. Many companies, however, don’t encrypt their disks, because they fear the potential performance penalty caused by encryption overhead.

Encrypting data at rest is vital for Cloudflare with more than 200 data centres across the world. In this post, we will investigate the performance of disk encryption on Linux and explain how we made it at least two times faster for ourselves and our customers!

Encrypting data at rest

When it comes to encrypting data at rest there are several ways it can be implemented on a modern operating system (OS). Available techniques are tightly coupled with a typical OS storage stack. A simplified version of the storage stack and encryption solutions can be found on the diagram below:

Speeding up Linux disk encryption

On the top of the stack are applications, which read and write data in files (or streams). The file system in the OS kernel keeps track of which blocks of the underlying block device belong to which files and translates these file reads and writes into block reads and writes, however the hardware specifics of the underlying storage device is abstracted away from the filesystem. Finally, the block subsystem actually passes the block reads and writes to the underlying hardware using appropriate device drivers.

The concept of the storage stack is actually similar to the well-known network OSI model, where each layer has a more high-level view of the information and the implementation details of the lower layers are abstracted away from the upper layers. And, similar to the OSI model, one can apply encryption at different layers (think about TLS vs IPsec or a VPN).

For data at rest we can apply encryption either at the block layers (either in hardware or in software) or at the file level (either directly in applications or in the filesystem).

Block vs file encryption

Generally, the higher in the stack we apply encryption, the more flexibility we have. With application level encryption the application maintainers can apply any encryption code they please to any particular data they need. The downside of this approach is they actually have to implement it themselves and encryption in general is not very developer-friendly: one has to know the ins and outs of a specific cryptographic algorithm, properly generate keys, nonces, IVs etc. Additionally, application level encryption does not leverage OS-level caching and Linux page cache in particular: each time the application needs to use the data, it has to either decrypt it again, wasting CPU cycles, or implement its own decrypted “cache”, which introduces more complexity to the code.

File system level encryption makes data encryption transparent to applications, because the file system itself encrypts the data before passing it to the block subsystem, so files are encrypted regardless if the application has crypto support or not. Also, file systems can be configured to encrypt only a particular directory or have different keys for different files. This flexibility, however, comes at a cost of a more complex configuration. File system encryption is also considered less secure than block device encryption as only the contents of the files are encrypted. Files also have associated metadata, like file size, the number of files, the directory tree layout etc., which are still visible to a potential adversary.

Encryption down at the block layer (often referred to as disk encryption or full disk encryption) also makes data encryption transparent to applications and even whole file systems. Unlike file system level encryption it encrypts all data on the disk including file metadata and even free space. It is less flexible though – one can only encrypt the whole disk with a single key, so there is no per-directory, per-file or per-user configuration. From the crypto perspective, not all cryptographic algorithms can be used as the block layer doesn’t have a high-level overview of the data anymore, so it needs to process each block independently. Most common algorithms require some sort of block chaining to be secure, so are not applicable to disk encryption. Instead, special modes were developed just for this specific use-case.

So which layer to choose? As always, it depends… Application and file system level encryption are usually the preferred choice for client systems because of the flexibility. For example, each user on a multi-user desktop may want to encrypt their home directory with a key they own and leave some shared directories unencrypted. On the contrary, on server systems, managed by SaaS/PaaS/IaaS companies (including Cloudflare) the preferred choice is configuration simplicity and security – with full disk encryption enabled any data from any application is automatically encrypted with no exceptions or overrides. We believe that all data needs to be protected without sorting it into "important" vs "not important" buckets, so the selective flexibility the upper layers provide is not needed.

Hardware vs software disk encryption

When encrypting data at the block layer it is possible to do it directly in the storage hardware, if the hardware supports it. Doing so usually gives better read/write performance and consumes less resources from the host. However, since most hardware firmware is proprietary, it does not receive as much attention and review from the security community. In the past this led to flaws in some implementations of hardware disk encryption, which render the whole security model useless. Microsoft, for example, started to prefer software-based disk encryption since then.

We didn’t want to put our data and our customers’ data to the risk of using potentially insecure solutions and we strongly believe in open-source. That’s why we rely only on software disk encryption in the Linux kernel, which is open and has been audited by many security professionals across the world.

Linux disk encryption performance

We aim not only to save bandwidth costs for our customers, but to deliver content to Internet users as fast as possible.

At one point we noticed that our disks were not as fast as we would like them to be. Some profiling as well as a quick A/B test pointed to Linux disk encryption. Because not encrypting the data (even if it is supposed-to-be a public Internet cache) is not a sustainable option, we decided to take a closer look into Linux disk encryption performance.

Device mapper and dm-crypt

Linux implements transparent disk encryption via a dm-crypt module and dm-crypt itself is part of device mapper kernel framework. In a nutshell, the device mapper allows pre/post-process IO requests as they travel between the file system and the underlying block device.

dm-crypt in particular encrypts "write" IO requests before sending them further down the stack to the actual block device and decrypts "read" IO requests before sending them up to the file system driver. Simple and easy! Or is it?

Benchmarking setup

For the record, the numbers in this post were obtained by running specified commands on an idle Cloudflare G9 server out of production. However, the setup should be easily reproducible on any modern x86 laptop.

Generally, benchmarking anything around a storage stack is hard because of the noise introduced by the storage hardware itself. Not all disks are created equal, so for the purpose of this post we will use the fastest disks available out there – that is no disks.

Instead Linux has an option to emulate a disk directly in RAM. Since RAM is much faster than any persistent storage, it should introduce little bias in our results.

The following command creates a 4GB ramdisk:

$ sudo modprobe brd rd_nr=1 rd_size=4194304
$ ls /dev/ram0

Now we can set up a dm-crypt instance on top of it thus enabling encryption for the disk. First, we need to generate the disk encryption key, "format" the disk and specify a password to unlock the newly generated key.

$ fallocate -l 2M crypthdr.img
$ sudo cryptsetup luksFormat /dev/ram0 --header crypthdr.img

WARNING!
========
This will overwrite data on crypthdr.img irrevocably.

Are you sure? (Type uppercase yes): YES
Enter passphrase:
Verify passphrase:

Those who are familiar with LUKS/dm-crypt might have noticed we used a LUKS detached header here. Normally, LUKS stores the password-encrypted disk encryption key on the same disk as the data, but since we want to compare read/write performance between encrypted and unencrypted devices, we might accidentally overwrite the encrypted key during our benchmarking later. Keeping the encrypted key in a separate file avoids this problem for the purposes of this post.

Now, we can actually "unlock" the encrypted device for our testing:

$ sudo cryptsetup open --header crypthdr.img /dev/ram0 encrypted-ram0
Enter passphrase for /dev/ram0:
$ ls /dev/mapper/encrypted-ram0
/dev/mapper/encrypted-ram0

At this point we can now compare the performance of encrypted vs unencrypted ramdisk: if we read/write data to /dev/ram0, it will be stored in plaintext. Likewise, if we read/write data to /dev/mapper/encrypted-ram0, it will be decrypted/encrypted on the way by dm-crypt and stored in ciphertext.

It’s worth noting that we’re not creating any file system on top of our block devices to avoid biasing results with a file system overhead.

Measuring throughput

When it comes to storage testing/benchmarking Flexible I/O tester is the usual go-to solution. Let’s simulate simple sequential read/write load with 4K block size on the ramdisk without encryption:

$ sudo fio --filename=/dev/ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=plain
plain: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
   READ: io=21013MB, aggrb=1126.5MB/s, minb=1126.5MB/s, maxb=1126.5MB/s, mint=18655msec, maxt=18655msec
  WRITE: io=21023MB, aggrb=1126.1MB/s, minb=1126.1MB/s, maxb=1126.1MB/s, mint=18655msec, maxt=18655msec

Disk stats (read/write):
  ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

The above command will run for a long time, so we just stop it after a while. As we can see from the stats, we’re able to read and write roughly with the same throughput around 1126 MB/s. Let’s repeat the test with the encrypted ramdisk:

$ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
   READ: io=1693.7MB, aggrb=150874KB/s, minb=150874KB/s, maxb=150874KB/s, mint=11491msec, maxt=11491msec
  WRITE: io=1696.4MB, aggrb=151170KB/s, minb=151170KB/s, maxb=151170KB/s, mint=11491msec, maxt=11491msec

Whoa, that’s a drop! We only get ~147 MB/s now, which is more than 7 times slower! And this is on a totally idle machine!

Maybe, crypto is just slow

The first thing we considered is to ensure we use the fastest crypto. cryptsetup allows us to benchmark all the available crypto implementations on the system to select the best one:

$ sudo cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1      1340890 iterations per second for 256-bit key
PBKDF2-sha256    1539759 iterations per second for 256-bit key
PBKDF2-sha512    1205259 iterations per second for 256-bit key
PBKDF2-ripemd160  967321 iterations per second for 256-bit key
PBKDF2-whirlpool  720175 iterations per second for 256-bit key
#  Algorithm | Key |  Encryption |  Decryption
     aes-cbc   128b   969.7 MiB/s  3110.0 MiB/s
 serpent-cbc   128b           N/A           N/A
 twofish-cbc   128b           N/A           N/A
     aes-cbc   256b   756.1 MiB/s  2474.7 MiB/s
 serpent-cbc   256b           N/A           N/A
 twofish-cbc   256b           N/A           N/A
     aes-xts   256b  1823.1 MiB/s  1900.3 MiB/s
 serpent-xts   256b           N/A           N/A
 twofish-xts   256b           N/A           N/A
     aes-xts   512b  1724.4 MiB/s  1765.8 MiB/s
 serpent-xts   512b           N/A           N/A
 twofish-xts   512b           N/A           N/A

It seems aes-xts with a 256-bit data encryption key is the fastest here. But which one are we actually using for our encrypted ramdisk?

$ sudo dmsetup table /dev/mapper/encrypted-ram0
0 8388608 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 1:0 0

We do use aes-xts with a 256-bit data encryption key (count all the zeroes conveniently masked by dmsetup tool – if you want to see the actual bytes, add the --showkeys option to the above command). The numbers do not add up however: cryptsetup benchmark tells us above not to rely on the results, as "Tests are approximate using memory only (no storage IO)", but that is exactly how we’ve set up our experiment using the ramdisk. In a somewhat worse case (assuming we’re reading all the data and then encrypting/decrypting it sequentially with no parallelism) doing back-of-the-envelope calculation we should be getting around (1126 * 1823) / (1126 + 1823) =~696 MB/s, which is still quite far from the actual 147 * 2 = 294 MB/s (total for reads and writes).

dm-crypt performance flags

While reading the cryptsetup man page we noticed that it has two options prefixed with --perf-, which are probably related to performance tuning. The first one is --perf-same_cpu_crypt with a rather cryptic description:

Perform encryption using the same cpu that IO was submitted on.  The default is to use an unbound workqueue so that encryption work is automatically balanced between available CPUs.  This option is only relevant for open action.

So we enable the option

$ sudo cryptsetup close encrypted-ram0
$ sudo cryptsetup open --header crypthdr.img --perf-same_cpu_crypt /dev/ram0 encrypted-ram0

Note: according to the latest man page there is also a cryptsetup refresh command, which can be used to enable these options live without having to "close" and "re-open" the encrypted device. Our cryptsetup however didn’t support it yet.

Verifying if the option has been really enabled:

$ sudo dmsetup table encrypted-ram0
0 8388608 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 1:0 0 1 same_cpu_crypt

Yes, we can now see same_cpu_crypt in the output, which is what we wanted. Let’s rerun the benchmark:

$ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
   READ: io=1596.6MB, aggrb=139811KB/s, minb=139811KB/s, maxb=139811KB/s, mint=11693msec, maxt=11693msec
  WRITE: io=1600.9MB, aggrb=140192KB/s, minb=140192KB/s, maxb=140192KB/s, mint=11693msec, maxt=11693msec

Hmm, now it is ~136 MB/s which is slightly worse than before, so no good. What about the second option --perf-submit_from_crypt_cpus:

Disable offloading writes to a separate thread after encryption.  There are some situations where offloading write bios from the encryption threads to a single thread degrades performance significantly.  The default is to offload write bios to the same thread.  This option is only relevant for open action.

Maybe, we are in the "some situation" here, so let’s try it out:

$ sudo cryptsetup close encrypted-ram0
$ sudo cryptsetup open --header crypthdr.img --perf-submit_from_crypt_cpus /dev/ram0 encrypted-ram0
Enter passphrase for /dev/ram0:
$ sudo dmsetup table encrypted-ram0
0 8388608 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 1:0 0 1 submit_from_crypt_cpus

And now the benchmark:

$ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
   READ: io=2066.6MB, aggrb=169835KB/s, minb=169835KB/s, maxb=169835KB/s, mint=12457msec, maxt=12457msec
  WRITE: io=2067.7MB, aggrb=169965KB/s, minb=169965KB/s, maxb=169965KB/s, mint=12457msec, maxt=12457msec

~166 MB/s, which is a bit better, but still not good…

Asking the community

Being desperate we decided to seek support from the Internet and posted our findings to the dm-crypt mailing list, but the response we got was not very encouraging:

If the numbers disturb you, then this is from lack of understanding on your side. You are probably unaware that encryption is a heavy-weight operation…

We decided to make a scientific research on this topic by typing "is encryption expensive" into Google Search and one of the top results, which actually contains meaningful measurements, is… our own post about cost of encryption, but in the context of TLS! This is a fascinating read on its own, but the gist is: modern crypto on modern hardware is very cheap even at Cloudflare scale (doing millions of encrypted HTTP requests per second). In fact, it is so cheap that Cloudflare was the first provider to offer free SSL/TLS for everyone.

Digging into the source code

When trying to use the custom dm-crypt options described above we were curious why they exist in the first place and what is that "offloading" all about. Originally we expected dm-crypt to be a simple "proxy", which just encrypts/decrypts data as it flows through the stack. Turns out dm-crypt does more than just encrypting memory buffers and a (simplified) IO traverse path diagram is presented below:

Speeding up Linux disk encryption

When the file system issues a write request, dm-crypt does not process it immediately – instead it puts it into a workqueue named "kcryptd". In a nutshell, a kernel workqueue just schedules some work (encryption in this case) to be performed at some later time, when it is more convenient. When "the time" comes, dm-crypt sends the request to Linux Crypto API for actual encryption. However, modern Linux Crypto API is asynchronous as well, so depending on which particular implementation your system will use, most likely it will not be processed immediately, but queued again for "later time". When Linux Crypto API will finally do the encryption, dm-crypt may try to sort pending write requests by putting each request into a red-black tree. Then a separate kernel thread again at "some time later" actually takes all IO requests in the tree and sends them down the stack.

Now for read requests: this time we need to get the encrypted data first from the hardware, but dm-crypt does not just ask for the driver for the data, but queues the request into a different workqueue named "kcryptd_io". At some point later, when we actually have the encrypted data, we schedule it for decryption using the now familiar "kcryptd" workqueue. "kcryptd" will send the request to Linux Crypto API, which may decrypt the data asynchronously as well.

To be fair the request does not always traverse all these queues, but the important part here is that write requests may be queued up to 4 times in dm-crypt and read requests up to 3 times. At this point we were wondering if all this extra queueing can cause any performance issues. For example, there is a nice presentation from Google about the relationship between queueing and tail latency. One key takeaway from the presentation is:

A significant amount of tail latency is due to queueing effects

So, why are all these queues there and can we remove them?

Git archeology

No-one writes more complex code just for fun, especially for the OS kernel. So all these queues must have been put there for a reason. Luckily, the Linux kernel source is managed by git, so we can try to retrace the changes and the decisions around them.

The "kcryptd" workqueue was in the source since the beginning of the available history with the following comment:

Needed because it would be very unwise to do decryption in an interrupt context, so bios returning from read requests get queued here.

So it was for reads only, but even then – why do we care if it is interrupt context or not, if Linux Crypto API will likely use a dedicated thread/queue for encryption anyway? Well, back in 2005 Crypto API was not asynchronous, so this made perfect sense.

In 2006 dm-crypt started to use the "kcryptd" workqueue not only for encryption, but for submitting IO requests:

This patch is designed to help dm-crypt comply with the new constraints imposed by the following patch in -mm: md-dm-reduce-stack-usage-with-stacked-block-devices.patch

It seems the goal here was not to add more concurrency, but rather reduce kernel stack usage, which makes sense again as the kernel has a common stack across all the code, so it is a quite limited resource. It is worth noting, however, that the Linux kernel stack has been expanded in 2014 for x86 platforms, so this might not be a problem anymore.

A first version of "kcryptd_io" workqueue was added in 2007 with the intent to avoid:

starvation caused by many requests waiting for memory allocation…

The request processing was bottlenecking on a single workqueue here, so the solution was to add another one. Makes sense.

We are definitely not the first ones experiencing performance degradation because of extensive queueing: in 2011 a change was introduced to conditionally revert some of the queueing for read requests:

If there is enough memory, code can directly submit bio instead queuing this operation in a separate thread.

Unfortunately, at that time Linux kernel commit messages were not as verbose as today, so there is no performance data available.

In 2015 dm-crypt started to sort writes in a separate "dmcrypt_write" thread before sending them down the stack:

On a multiprocessor machine, encryption requests finish in a different order than they were submitted. Consequently, write requests would be submitted in a different order and it could cause severe performance degradation.

It does make sense as sequential disk access used to be much faster than the random one and dm-crypt was breaking the pattern. But this mostly applies to spinning disks, which were still dominant in 2015. It may not be as important with modern fast SSDs (including NVME SSDs).

Another part of the commit message is worth mentioning:

…in particular it enables IO schedulers like CFQ to sort more effectively…

It mentions the performance benefits for the CFQ IO scheduler, but Linux schedulers have improved since then to the point that CFQ scheduler has been removed from the kernel in 2018.

The same patchset replaces the sorting list with a red-black tree:

In theory the sorting should be performed by the underlying disk scheduler, however, in practice the disk scheduler only accepts and sorts a finite number of requests. To allow the sorting of all requests, dm-crypt needs to implement its own sorting.

The overhead associated with rbtree-based sorting is considered negligible so it is not used conditionally.

All that make sense, but it would be nice to have some backing data.

Interestingly, in the same patchset we see the introduction of our familiar "submit_from_crypt_cpus" option:

There are some situations where offloading write bios from the encryption threads to a single thread degrades performance significantly

Overall, we can see that every change was reasonable and needed, however things have changed since then:

  • hardware became faster and smarter
  • Linux resource allocation was revisited
  • coupled Linux subsystems were rearchitected

And many of the design choices above may not be applicable to modern Linux.

The "clean-up"

Based on the research above we decided to try to remove all the extra queueing and asynchronous behaviour and revert dm-crypt to its original purpose: simply encrypt/decrypt IO requests as they pass through. But for the sake of stability and further benchmarking we ended up not removing the actual code, but rather adding yet another dm-crypt option, which bypasses all the queues/threads, if enabled. The flag allows us to switch between the current and new behaviour at runtime under full production load, so we can easily revert our changes should we see any side-effects. The resulting patch can be found on the Cloudflare GitHub Linux repository.

Synchronous Linux Crypto API

From the diagram above we remember that not all queueing is implemented in dm-crypt. Modern Linux Crypto API may also be asynchronous and for the sake of this experiment we want to eliminate queues there as well. What does "may be" mean, though? The OS may contain different implementations of the same algorithm (for example, hardware-accelerated AES-NI on x86 platforms and generic C-code AES implementations). By default the system chooses the "best" one based on the configured algorithm priority. dm-crypt allows overriding this behaviour and request a particular cipher implementation using the capi: prefix. However, there is one problem. Let us actually check the available AES-XTS (this is our disk encryption cipher, remember?) implementations on our system:

$ grep -A 11 'xts(aes)' /proc/crypto
name         : xts(aes)
driver       : xts(ecb(aes-generic))
module       : kernel
priority     : 100
refcnt       : 7
selftest     : passed
internal     : no
type         : skcipher
async        : no
blocksize    : 16
min keysize  : 32
max keysize  : 64
--
name         : __xts(aes)
driver       : cryptd(__xts-aes-aesni)
module       : cryptd
priority     : 451
refcnt       : 1
selftest     : passed
internal     : yes
type         : skcipher
async        : yes
blocksize    : 16
min keysize  : 32
max keysize  : 64
--
name         : xts(aes)
driver       : xts-aes-aesni
module       : aesni_intel
priority     : 401
refcnt       : 1
selftest     : passed
internal     : no
type         : skcipher
async        : yes
blocksize    : 16
min keysize  : 32
max keysize  : 64
--
name         : __xts(aes)
driver       : __xts-aes-aesni
module       : aesni_intel
priority     : 401
refcnt       : 7
selftest     : passed
internal     : yes
type         : skcipher
async        : no
blocksize    : 16
min keysize  : 32
max keysize  : 64

We want to explicitly select a synchronous cipher from the above list to avoid queueing effects in threads, but the only two supported are xts(ecb(aes-generic)) (the generic C implementation) and __xts-aes-aesni (the x86 hardware-accelerated implementation). We definitely want the latter as it is much faster (we’re aiming for performance here), but it is suspiciously marked as internal (see internal: yes). If we check the source code:

Mark a cipher as a service implementation only usable by another cipher and never by a normal user of the kernel crypto API

So this cipher is meant to be used only by other wrapper code in the Crypto API and not outside it. In practice this means, that the caller of the Crypto API needs to explicitly specify this flag, when requesting a particular cipher implementation, but dm-crypt does not do it, because by design it is not part of the Linux Crypto API, rather an "external" user. We already patch the dm-crypt module, so we could as well just add the relevant flag. However, there is another problem with AES-NI in particular: x86 FPU. "Floating point" you say? Why do we need floating point math to do symmetric encryption which should only be about bit shifts and XOR operations? We don’t need the math, but AES-NI instructions use some of the CPU registers, which are dedicated to the FPU. Unfortunately the Linux kernel does not always preserve these registers in interrupt context for performance reasons (saving/restoring FPU is expensive). But dm-crypt may execute code in interrupt context, so we risk corrupting some other process data and we go back to "it would be very unwise to do decryption in an interrupt context" statement in the original code.

Our solution to address the above was to create another somewhat "smart" Crypto API module. This module is synchronous and does not roll its own crypto, but is just a "router" of encryption requests:

  • if we can use the FPU (and thus AES-NI) in the current execution context, we just forward the encryption request to the faster, "internal" __xts-aes-aesni implementation (and we can use it here, because now we are part of the Crypto API)
  • otherwise, we just forward the encryption request to the slower, generic C-based xts(ecb(aes-generic)) implementation

Using the whole lot

Let’s walk through the process of using it all together. The first step is to grab the patches and recompile the kernel (or just compile dm-crypt and our xtsproxy modules).

Next, let’s restart our IO workload in a separate terminal, so we can make sure we can reconfigure the kernel at runtime under load:

$ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...

In the main terminal make sure our new Crypto API module is loaded and available:

$ sudo modprobe xtsproxy
$ grep -A 11 'xtsproxy' /proc/crypto
driver       : xts-aes-xtsproxy
module       : xtsproxy
priority     : 0
refcnt       : 0
selftest     : passed
internal     : no
type         : skcipher
async        : no
blocksize    : 16
min keysize  : 32
max keysize  : 64
ivsize       : 16
chunksize    : 16

Reconfigure the encrypted disk to use our newly loaded module and enable our patched dm-crypt flag (we have to use low-level dmsetup tool and cryptsetup obviously is not aware of our modifications):

$ sudo dmsetup table encrypted-ram0 --showkeys | sed 's/aes-xts-plain64/capi:xts-aes-xtsproxy-plain64/' | sed 's/$/ 1 force_inline/' | sudo dmsetup reload encrypted-ram0

We just "loaded" the new configuration, but for it to take effect, we need to suspend/resume the encrypted device:

$ sudo dmsetup suspend encrypted-ram0 && sudo dmsetup resume encrypted-ram0

And now observe the result. We may go back to the other terminal running the fio job and look at the output, but to make things nicer, here’s a snapshot of the observed read/write throughput in Grafana:

Speeding up Linux disk encryption
Speeding up Linux disk encryption

Wow, we have more than doubled the throughput! With the total throughput of ~640 MB/s we’re now much closer to the expected ~696 MB/s from above. What about the IO latency? (The await statistic from the iostat reporting tool):

Speeding up Linux disk encryption

The latency has been cut in half as well!

To production

So far we have been using a synthetic setup with some parts of the full production stack missing, like file systems, real hardware and most importantly, production workload. To ensure we’re not optimising imaginary things, here is a snapshot of the production impact these changes bring to the caching part of our stack:

Speeding up Linux disk encryption

This graph represents a three-way comparison of the worst-case response times (99th percentile) for a cache hit in one of our servers. The green line is from a server with unencrypted disks, which we will use as baseline. The red line is from a server with encrypted disks with the default Linux disk encryption implementation and the blue line is from a server with encrypted disks and our optimisations enabled. As we can see the default Linux disk encryption implementation has a significant impact on our cache latency in worst case scenarios, whereas the patched implementation is indistinguishable from not using encryption at all. In other words the improved encryption implementation does not have any impact at all on our cache response speed, so we basically get it for free! That’s a win!

We’re just getting started

This post shows how an architecture review can double the performance of a system. Also we reconfirmed that modern cryptography is not expensive and there is usually no excuse not to protect your data.

We are going to submit this work for inclusion in the main kernel source tree, but most likely not in its current form. Although the results look encouraging we have to remember that Linux is a highly portable operating system: it runs on powerful servers as well as small resource constrained IoT devices and on many other CPU architectures as well. The current version of the patches just optimises disk encryption for a particular workload on a particular architecture, but Linux needs a solution which runs smoothly everywhere.

That said, if you think your case is similar and you want to take advantage of the performance improvements now, you may grab the patches and hopefully provide feedback. The runtime flag makes it easy to toggle the functionality on the fly and a simple A/B test may be performed to see if it benefits any particular case or setup. These patches have been running across our wide network of more than 200 data centres on five generations of hardware, so can be reasonably considered stable. Enjoy both performance and security from Cloudflare for all!

Speeding up Linux disk encryption