# How to execute an object file: Part 1

Post Syndicated from Ignat Korchagin original https://blog.cloudflare.com/how-to-execute-an-object-file-part-1/

## Calling a simple function without linking

When we write software using a high-level compiled programming language, there are usually a number of steps involved in transforming our source code into the final executable binary:

First, our source files are compiled by a compiler translating the high-level programming language into machine code. The output of the compiler is a number of object files. If the project contains multiple source files, we usually get as many object files. The next step is the linker: since the code in different object files may reference each other, the linker is responsible for assembling all these object files into one big program and binding these references together. The output of the linker is usually our target executable, so only one file.

However, at this point, our executable might still be incomplete. These days, most executables on Linux are dynamically linked: the executable itself does not have all the code it needs to run a program. Instead it expects to "borrow" part of the code at runtime from shared libraries for some of its functionality:

This process is called runtime linking: when our executable is being started, the operating system will invoke the dynamic loader, which should find all the needed libraries, copy/map their code into our target process address space, and resolve all the dependencies our code has on them.

One interesting thing to note about this overall process is that we get the executable machine code directly from step 1 (compiling the source code), but if any of the later steps fail, we still can’t execute our program. So, in this series of blog posts we will investigate if it is possible to execute machine code directly from object files skipping all the later steps.

#### Why would we want to execute an object file?

There may be many reasons. Perhaps we’re writing an open-source replacement for a proprietary Linux driver or an application, and want to compare if the behaviour of some code is the same. Or we have a piece of a rare, obscure program and we can’t link to it, because it was compiled with a rare, obscure compiler. Maybe we have a source file, but cannot create a full featured executable, because of the missing build time or runtime dependencies. Malware analysis, code from a different operating system etc – all these scenarios may put us in a position, where either linking is not possible or the runtime environment is not suitable.

### A simple toy object file

For the purposes of this article, let’s create a simple toy object file, so we can use it in our experiments:

obj.c:

int add5(int num)
{
return num + 5;
}

{
return num + 10;
}


Our source file contains only 2 functions, add5 and add10, which adds 5 or 10 respectively to the only input parameter. It’s a small but fully functional piece of code, and we can easily compile it into an object file:

$gcc -c obj.c$ ls
obj.c  obj.o


Now we will try to import the add5 and add10 functions from the object file and execute them. When we talk about executing an object file, we mean using an object file as some sort of a library. As we learned above, when we have an executable that utilises external shared libraries, the dynamic loader loads these libraries into the process address space for us. With object files, however, we have to do this manually, because ultimately we can’t execute machine code that doesn’t reside in the operating system’s RAM. So, to execute object files we still need some kind of a wrapper program:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>

{
/* load obj.o into memory */
}

static void parse_obj(void)
{
/* parse an object file and find add5 and add10 functions */
}

static void execute_funcs(void)
{
}

int main(void)
{
parse_obj();
execute_funcs();

return 0;
}


Above is a self-contained object loader program with some functions as placeholders. We will be implementing these functions (and adding more) in the course of this post.

First, as we established already, we need to load our object file into the process address space. We could just read the whole file into a buffer, but that would not be very efficient. Real-world object files might be big, but as we will see later, we don’t need all of the object’s file contents. So it is better to mmap the file instead: this way the operating system will lazily read the parts from the file we need at the time we need them. Let’s implement the load_obj function:

...
/* for open(2), fstat(2) */
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

/* for close(2), fstat(2) */
#include <unistd.h>

/* for mmap(2) */
#include <sys/mman.h>

/* parsing ELF files */
#include <elf.h>

/* for errno */
#include <errno.h>

typedef union {
const Elf64_Ehdr *hdr;
const uint8_t *base;
} objhdr;

static objhdr obj;

{
struct stat sb;

int fd = open("obj.o", O_RDONLY);
if (fd <= 0) {
perror("Cannot open obj.o");
exit(errno);
}

/* we need obj.o size for mmap(2) */
if (fstat(fd, &sb)) {
perror("Failed to get obj.o info");
exit(errno);
}

/* mmap obj.o into memory */
obj.base = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
if (obj.base == MAP_FAILED) {
perror("Maping obj.o failed");
exit(errno);
}
close(fd);
}
...


If we don’t encounter any errors, after load_obj executes we should get the memory address, which points to the beginning of our obj.o in the obj global variable. It is worth noting we have created a special union type for the obj variable: we will be parsing obj.o later (and peeking ahead – object files are actually ELF files), so will be referring to the address both as Elf64_Ehdr (ELF header structure in C) and a byte pointer (parsing ELF files involves calculations of byte offsets from the beginning of the file).

### A peek inside an object file

To use some code from an object file, we need to find it first. As I’ve leaked above, object files are actually ELF files (the same format as Linux executables and shared libraries) and luckily they’re easy to parse on Linux with the help of the standard elf.h header, which includes many useful definitions related to the ELF file structure. But we actually need to know what we’re looking for, so a high-level understanding of an ELF file is needed.

#### ELF segments and sections

Segments (also known as program headers) and sections are probably the main parts of an ELF file and usually a starting point of any ELF tutorial. However, there is often some confusion between the two. Different sections contain different types of ELF data: executable code (which we are most interested in in this post), constant data, global variables etc. Segments, on the other hand, do not contain any data themselves – they just describe to the operating system how to properly load sections into RAM for the executable to work correctly. Some tutorials say "a segment may include 0 or more sections", which is not entirely accurate: segments do not contain sections, rather they just indicate to the OS where in memory a particular section should be loaded and what is the access pattern for this memory (read, write or execute):

Furthermore, object files do not contain any segments at all: an object file is not meant to be directly loaded by the OS. Instead, it is assumed it will be linked with some other code, so ELF segments are usually generated by the linker, not the compiler. We can check this by using the readelf command:

$readelf --segments obj.o There are no program headers in this file.  #### Object file sections The same readelf command can be used to get all the sections from our object file: $ readelf --sections obj.o
There are 11 section headers, starting at offset 0x268:

Size              EntSize          Flags  Link  Info  Align
[ 0]                   NULL             0000000000000000  00000000
0000000000000000  0000000000000000           0     0     0
[ 1] .text             PROGBITS         0000000000000000  00000040
000000000000001e  0000000000000000  AX       0     0     1
[ 2] .data             PROGBITS         0000000000000000  0000005e
0000000000000000  0000000000000000  WA       0     0     1
[ 3] .bss              NOBITS           0000000000000000  0000005e
0000000000000000  0000000000000000  WA       0     0     1
[ 4] .comment          PROGBITS         0000000000000000  0000005e
000000000000001d  0000000000000001  MS       0     0     1
[ 5] .note.GNU-stack   PROGBITS         0000000000000000  0000007b
0000000000000000  0000000000000000           0     0     1
[ 6] .eh_frame         PROGBITS         0000000000000000  00000080
0000000000000058  0000000000000000   A       0     0     8
[ 7] .rela.eh_frame    RELA             0000000000000000  000001e0
0000000000000030  0000000000000018   I       8     6     8
[ 8] .symtab           SYMTAB           0000000000000000  000000d8
00000000000000f0  0000000000000018           9     8     8
[ 9] .strtab           STRTAB           0000000000000000  000001c8
0000000000000012  0000000000000000           0     0     1
[10] .shstrtab         STRTAB           0000000000000000  00000210
0000000000000054  0000000000000000           0     0     1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
l (large), p (processor specific)


There are different tutorials online describing the most popular ELF sections in detail. Another great reference is the Linux manpages project. It is handy because it describes both sections’ purpose as well as C structure definitions from elf.h, which makes it a one-stop shop for parsing ELF files. However, for completeness, below is a short description of the most popular sections one may encounter in an ELF file:

• .text: this section contains the executable code (the actual machine code, which was created by the compiler from our source code). This section is the primary area of interest for this post as it should contain the add5 and add10 functions we want to use.
• .data and .bss: these sections contain global and static local variables. The difference is: .data has variables with an initial value (defined like int foo = 5;) and .bss just reserves space for variables with no initial value (defined like int bar;).
• .rodata: this section contains constant data (mostly strings or byte arrays). For example, if we use a string literal in the code (for example, for printf or some error message), it will be stored here. Note, that .rodata is missing from the output above as we didn’t use any string literals or constant byte arrays in obj.c.
• .symtab: this section contains information about the symbols in the object file: functions, global variables, constants etc. It may also contain information about external symbols the object file needs, like needed functions from the external libraries.
• .strtab and .shstrtab: contain packed strings for the ELF file. Note, that these are not the strings we may define in our source code (those go to the .rodata section). These are the strings describing the names of other ELF structures, like symbols from .symtab or even section names from the table above. ELF binary format aims to make its structures compact and of a fixed size, so all strings are stored in one place and the respective data structures just reference them as an offset in either .shstrtab or .strtab sections instead of storing the full string locally.

#### The .symtab section

At this point, we know that the code we want to import and execute is located in the obj.o‘s .text section. But we have two functions, add5 and add10, remember? At this level the .text section is just a byte blob – how do we know where each of these functions is located? This is where the .symtab (the "symbol table") comes in handy. It is so important that it has its own dedicated parameter in readelf:

readelf --symbols obj.o Symbol table '.symtab' contains 10 entries: Num: Value Size Type Bind Vis Ndx Name 0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND 1: 0000000000000000 0 FILE LOCAL DEFAULT ABS obj.c 2: 0000000000000000 0 SECTION LOCAL DEFAULT 1 3: 0000000000000000 0 SECTION LOCAL DEFAULT 2 4: 0000000000000000 0 SECTION LOCAL DEFAULT 3 5: 0000000000000000 0 SECTION LOCAL DEFAULT 5 6: 0000000000000000 0 SECTION LOCAL DEFAULT 6 7: 0000000000000000 0 SECTION LOCAL DEFAULT 4 8: 0000000000000000 15 FUNC GLOBAL DEFAULT 1 add5 9: 000000000000000f 15 FUNC GLOBAL DEFAULT 1 add10  Let’s ignore the other entries for now and just focus on the last two lines, because they conveniently have add5 and add10 as their symbol names. And indeed, this is the info about our functions. Apart from the names, the symbol table provides us with some additional metadata: • The Ndx column tells us the index of the section, where the symbol is located. We can cross-check it with the section table above and confirm that indeed these functions are located in .text (section with the index 1). • Type being set to FUNC confirms that these are indeed functions. • Size tells us the size of each function, but this information is not very useful in our context. The same goes for Bind and Vis. • Probably the most useful piece of information is Value. The name is misleading, because it is actually an offset from the start of the containing section in this context. That is, the add5 function starts just from the beginning of .text and add10 is located from 15th byte and onwards. So now we have all the pieces on how to parse an ELF file and find the functions we need. ### Finding and executing a function from an object file Given what we have learned so far, let’s define a plan on how to proceed to import and execute a function from an object file: 1. Find the ELF sections table and .shstrtab section (we need .shstrtab later to lookup sections in the section table by name). 2. Find the .symtab and .strtab sections (we need .strtab to lookup symbols by name in .symtab). 3. Find the .text section and copy it into RAM with executable permissions. 4. Find add5 and add10 function offsets from the .symtab. 5. Execute add5 and add10 functions. Let’s start by adding some more global variables and implementing the parse_obj function: loader.c: ... /* sections table */ static const Elf64_Shdr *sections; static const char *shstrtab = NULL; /* symbols table */ static const Elf64_Sym *symbols; /* number of entries in the symbols table */ static int num_symbols; static const char *strtab = NULL; ... static void parse_obj(void) { /* the sections table offset is encoded in the ELF header */ sections = (const Elf64_Shdr *)(obj.base + obj.hdr->e_shoff); /* the index of .shstrtab in the sections table is encoded in the ELF header * so we can find it without actually using a name lookup */ shstrtab = (const char *)(obj.base + sections[obj.hdr->e_shstrndx].sh_offset); ... } ...  Now that we have references to both the sections table and the .shstrtab section, we can lookup other sections by their name. Let’s create a helper function for that: loader.c: ... static const Elf64_Shdr *lookup_section(const char *name) { size_t name_len = strlen(name); /* number of entries in the sections table is encoded in the ELF header */ for (Elf64_Half i = 0; i < obj.hdr->e_shnum; i++) { /* sections table entry does not contain the string name of the section * instead, the sh_name parameter is an offset in the .shstrtab * section, which points to a string name */ const char *section_name = shstrtab + sections[i].sh_name; size_t section_name_len = strlen(section_name); if (name_len == section_name_len && !strcmp(name, section_name)) { /* we ignore sections with 0 size */ if (sections[i].sh_size) return sections + i; } } return NULL; } ...  Using our new helper function, we can now find the .symtab and .strtab sections: loader.c: ... static void parse_obj(void) { ... /* find the .symtab entry in the sections table */ const Elf64_Shdr *symtab_hdr = lookup_section(".symtab"); if (!symtab_hdr) { fputs("Failed to find .symtab\n", stderr); exit(ENOEXEC); } /* the symbols table */ symbols = (const Elf64_Sym *)(obj.base + symtab_hdr->sh_offset); /* number of entries in the symbols table = table size / entry size */ num_symbols = symtab_hdr->sh_size / symtab_hdr->sh_entsize; const Elf64_Shdr *strtab_hdr = lookup_section(".strtab"); if (!strtab_hdr) { fputs("Failed to find .strtab\n", stderr); exit(ENOEXEC); } strtab = (const char *)(obj.base + strtab_hdr->sh_offset); ... } ...  Next, let’s focus on the .text section. We noted earlier in our plan that it is not enough to just locate the .text section in the object file, like we did with other sections. We would need to copy it over to a different location in RAM with executable permissions. There are several reasons for that, but these are the main ones: • Many CPU architectures either don’t allow execution of the machine code, which is unaligned in memory (4 kilobytes for x86 systems), or they execute it with a performance penalty. However, the .text section in an ELF file is not guaranteed to be positioned at a page aligned offset, because the on-disk version of the ELF file aims to be compact rather than convenient. • We may need to modify some bytes in the .text section to perform relocations (we don’t need to do it in this case, but will be dealing with relocations in future posts). If, for example, we forget to use the MAP_PRIVATE flag, when mapping the ELF file, our modifications may propagate to the underlying file and corrupt it. • Finally, different sections, which are needed at runtime, like .text, .data, .bss and .rodata, require different memory permission bits: the .text section memory needs to be both readable and executable, but not writable (it is considered a bad security practice to have memory both writable and executable). The .data and .bss sections need to be readable and writable to support global variables, but not executable. The .rodata section should be readonly, because its purpose is to hold constant data. To support this, each section must be allocated on a page boundary as we can only set memory permission bits on whole pages and not custom ranges. Therefore, we need to create new, page aligned memory ranges for these sections and copy the data there. To create a page aligned copy of the .text section, first we actually need to know the page size. Many programs usually just hardcode the page size to 4096 (4 kilobytes), but we shouldn’t rely on that. While it’s accurate for most x86 systems, other CPU architectures, like arm64, might have a different page size. So hard coding a page size may make our program non-portable. Let’s find the page size and store it in another global variable: loader.c: ... static uint64_t page_size; static inline uint64_t page_align(uint64_t n) { return (n + (page_size - 1)) & ~(page_size - 1); } ... static void parse_obj(void) { ... /* get system page size */ page_size = sysconf(_SC_PAGESIZE); ... } ...  Notice, we have also added a convenience function page_align, which will round up the passed in number to the next page aligned boundary. Next, back to the .text section. As a reminder, we need to: 1. Find the .text section metadata in the sections table. 2. Allocate a chunk of memory to hold the .text section copy. 3. Actually copy the .text section to the newly allocated memory. 4. Make the .text section executable, so we can later call functions from it. Here is the implementation of the above steps: loader.c: ... /* runtime base address of the imported code */ static uint8_t *text_runtime_base; ... static void parse_obj(void) { ... /* find the .text entry in the sections table */ const Elf64_Shdr *text_hdr = lookup_section(".text"); if (!text_hdr) { fputs("Failed to find .text\n", stderr); exit(ENOEXEC); } /* allocate memory for .text copy rounding it up to whole pages */ text_runtime_base = mmap(NULL, page_align(text_hdr->sh_size), PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (text_runtime_base == MAP_FAILED) { perror("Failed to allocate memory for .text"); exit(errno); } /* copy the contents of .text section from the ELF file */ memcpy(text_runtime_base, obj.base + text_hdr->sh_offset, text_hdr->sh_size); /* make the .text copy readonly and executable */ if (mprotect(text_runtime_base, page_align(text_hdr->sh_size), PROT_READ | PROT_EXEC)) { perror("Failed to make .text executable"); exit(errno); } } ...  Now we have all the pieces we need to locate the address of a function. Let’s write a helper for it: loader.c: ... static void *lookup_function(const char *name) { size_t name_len = strlen(name); /* loop through all the symbols in the symbol table */ for (int i = 0; i < num_symbols; i++) { /* consider only function symbols */ if (ELF64_ST_TYPE(symbols[i].st_info) == STT_FUNC) { /* symbol table entry does not contain the string name of the symbol * instead, the st_name parameter is an offset in the .strtab * section, which points to a string name */ const char *function_name = strtab + symbols[i].st_name; size_t function_name_len = strlen(function_name); if (name_len == function_name_len && !strcmp(name, function_name)) { /* st_value is an offset in bytes of the function from the * beginning of the .text section */ return text_runtime_base + symbols[i].st_value; } } } return NULL; } ...  And finally we can implement the execute_funcs function to import and execute code from an object file: loader.c: ... static void execute_funcs(void) { /* pointers to imported add5 and add10 functions */ int (*add5)(int); int (*add10)(int); add5 = lookup_function("add5"); if (!add5) { fputs("Failed to find add5 function\n", stderr); exit(ENOENT); } puts("Executing add5..."); printf("add5(%d) = %d\n", 42, add5(42)); add10 = lookup_function("add10"); if (!add10) { fputs("Failed to find add10 function\n", stderr); exit(ENOENT); } puts("Executing add10..."); printf("add10(%d) = %d\n", 42, add10(42)); } ...  Let’s compile our loader and make sure it works as expected:  gcc -o loader loader.c
$./loader Executing add5... add5(42) = 47 Executing add10... add10(42) = 52  Voila! We have successfully imported code from obj.o and executed it. Of course, the example above is simplified: the code in the object file is self-contained, does not reference any global variables or constants, and does not have any external dependencies. In future posts we will look into more complex code and how to handle such cases. #### Security considerations Processing external inputs, like parsing an ELF file from the disk above, should be handled with care. The code from loader.c omits a lot of bounds checking and additional ELF integrity checks, when parsing the object file. The code is simplified for the purposes of this post, but most likely not production ready, as it can probably be exploited by specifically crafted malicious inputs. Use it only for educational purposes! The complete source code from this post can be found here. # A quirk in the SUNBURST DGA algorithm Post Syndicated from Nick Blazier original https://blog.cloudflare.com/a-quirk-in-the-sunburst-dga-algorithm/ On Wednesday, December 16, the RedDrip Team from QiAnXin Technology released their discoveries (tweet, github) regarding the random subdomains associated with the SUNBURST malware which was present in the SolarWinds Orion compromise. In studying queries performed by the malware, Cloudflare has uncovered additional details about how the Domain Generation Algorithm (DGA) encodes data and exfiltrates the compromised hostname to the command and control servers. ### Background The RedDrip team discovered that the DNS queries are created by combining the previously reverse-engineered unique guid (based on hashing of hostname and MAC address) with a payload that is a custom base 32 encoding of the hostname. The article they published includes screenshots of decompiled or reimplemented C# functions that are included in the compromised DLL. This background primer summarizes their work so far (which is published in Chinese). RedDrip discovered that the DGA subdomain portion of the query is split into three parts: <encoded_guid> + <byte> + <encoded_hostname> An example malicious domain is: 7cbtailjomqle1pjvr2d32i2voe60ce2.appsync-api.us-east-1.avsvmcloud.com Where the domain is split into the three parts as Encoded guid (15 chars) byte Encoded hostname 7cbtailjomqle1p j vr2d32i2voe60ce2 The work from the RedDrip Team focused on the encoded hostname portion of the string, we have made additional insights related to the encoded hostname and encoded guid portions. At a high level the encoded hostnames take one of two encoding schemes. If all of the characters in the hostname are contained in the set of domain name-safe characters "0123456789abcdefghijklmnopqrstuvwxyz-_." then the OrionImprovementBusinessLayer.CryptoHelper.Base64Decode algorithm, explained in the article, is used. If there are characters outside of that set in the hostname, then the OrionImprovementBusinessLayer.CryptoHelper.Base64Encode is used instead and ‘00’ is prepended to the encoding. This allows us to simply check if the first two characters of the encoded hostname are ‘00’ and know how the hostname is encoded. These function names within the compromised DLL are meant to resemble the names of legitimate functions, but in fact perform the message encoding for the malware. The DLL function Base64Decode is meant to resemble the legitimate function name base64decode, but its purpose is actually to perform the encoding of the query (which is a variant of base32 encoding). The RedDrip Team has posted Python code for encoding and decoding the queries, including identifying random characters inserted into the queries at regular character intervals. One potential issue we encountered with their implementation is the inclusion of a check clause looking for a ‘0’ character in the encoded hostname (line 138 of the decoding script). This line causes the decoding algorithm to ignore any encoded hostnames that do not contain a ‘0’. We believe this was included because ‘0’ is the encoded value of a ‘0’, ‘.’, ‘-’ or ‘_’. Since fully qualified hostnames are comprised of multiple parts separated by ‘.’s, e.g. ‘example.com’, it makes sense to be expecting a ‘.’ in the unencoded hostname and therefore only consider encoded hostnames containing a ‘0’. However, this causes the decoder to ignore many of the recorded DGA domains. As we explain below, we believe that long domains are split across multiple queries where the second half is much shorter and unlikely to include a ‘.’. For example ‘www2.example.c’ takes up one message, meaning that in order to transmit the entire domain ‘www2.example.c’ a second message containing just ‘om’ would also need to be sent. This second message does not contain a ‘.’ so its encoded form does not contain a ‘0’ and it is ignored in the RedDrip team’s implementation. ### The quirk: hostnames are split across multiple queries A list of observed queries performed by the malware was published publicly on GitHub. Applying the decoding script to this set of queries, we see some queries appear to be truncated, such as grupobazar.loca, but also some decoded hostnames are curiously short or incomplete, such as “com”, “.com”, or a single letter, such as “m”, or “l”. When the hostname does not fit into the available payload section of the encoded query, it is split up across multiple queries. Queries are matched up by matching the GUID section after applying a byte-by-byte exclusive-or (xor). ### Analysis of first 15 characters Noticing that long domains are split across multiple requests led us to believe that the first 16 characters encoded information to associate multipart messages. This would allow the receiver on the other end to correctly re-assemble the messages and get the entire domain. The RedDrip team identified the first 15 bytes as a GUID, we focused on those bytes and will refer to them subsequently as the header. We found the following queries that we believed to be matches without knowing yet the correct pairings between message 1 and message 2 (payload has been altered): Part 1 – Both decode to “www2.example.c” r1q6arhpujcf6jb6qqqb0trmuhd1r0ee.appsync-api.us-west-2.avsvmcloud.com r8stkst71ebqgj66qqqb0trmuhd1r0ee.appsync-api.us-west-2.avsvmcloud.com Part 2 – Both decode to “om” 0oni12r13ficnkqb2h.appsync-api.us-west-2.avsvmcloud.com ulfmcf44qd58t9e82h.appsync-api.us-west-2.avsvmcloud.com This gives us a final combined payload of www2.example.com This example gave us two sets of messages where we were confident the second part was associated with the first part, and allowed us to find the following relationship where message1 is the header of the first message and message2 is the header of the second: Base32Decode(message1) XOR KEY = Base32Decode(message2) The KEY is a single character. That character is xor’d with each byte of the Base32Decoded first header to produce the Base32Decoded second header. We do not currently know how to infer what character is used as the key, but we can still match messages together without that information. Since A XOR B = C where we know A and C but not B, we can instead use A XOR C = B. This means that in order to pair messages together we simply need to look for messages where XOR’ing them together results in a repeating character (the key). Base32Decode(message1) XOR Base32Decode(message2) = KEY Looking at the examples above this becomes Message 1 Message 2 Header r1q6arhpujcf6jb 0oni12r13ficnkq Base32Decode (binary) 101101000100110110111111011 010010000000011001010111111 01111000101001110100000101 110110010010000011010010000 001000110110110100111100100 00100011111111000000000100 We’ve truncated the results slightly, but below shows the two binary representations and the third line shows the result of the XOR. 101101000100110110111111011010010000000011001010111111011110001010011101 110110010010000011010010000001000110110110100111100100001000111111110000 011011010110110101101101011011010110110101101101011011010110110101101101 We can see the XOR result is the repeating sequence ‘01101101’meaning the original key was 0x6D or ‘m’. We provide the following python code as an implementation for matching paired messages (Note: the decoding functions are those provided by the RedDrip team): # string1 is the first 15 characters of the first message # string2 is the first 15 characters of the second message def is_match(string1, string2): encoded1 = Base32Decode(string1) encoded2 = Base32Decode(string2) xor_result = [chr(ord(a) ^ ord(b)) for a,b in zip(encoded1, encoded2)] match_char = xor_result[0] for character in xor_result[0:9]: if character != match_char: return False, None return True, "0x{:02X}".format(ord(match_char))  The following are additional headers which based on the payload content Cloudflare is confident are pairs (the payload has been redacted because it contains hostname information that is not yet publicly available): Example 1: vrffaikp47gnsd4a aob0ceh5l8cr6mco xorkey: 0x4E Example 2: vrffaikp47gnsd4a aob0ceh5l8cr6mco xorkey: 0x54 Example 3: vvu7884g0o86pr4a 6gpt7s654cfn4h6h xorkey: 0x2B We hypothesize that the xorkey can be derived from the header bytes and/or padding byte of the two messages, though we have not yet determined the relationship. # ASICs at the Edge Post Syndicated from Tom Strickx original https://blog.cloudflare.com/asics-at-the-edge/ At Cloudflare we pride ourselves in our global network that spans more than 200 cities in over 100 countries. To handle all the traffic passing through our network, there are multiple technologies at play. So let’s have a look at one of the cornerstones that makes all of this work… ASICs. No, not the running shoes. ## What’s an ASIC? ASIC stands for Application Specific Integrated Circuit. The name already says it, it’s a chip with a very narrow use case, geared towards a single application. This is in stark contrast to a CPU (Central Processing Unit), or even a GPU (Graphics Processing Unit). A CPU is designed and built for general purpose computation, and does a lot of things reasonably well. A GPU is more geared towards graphics (it’s in the name), but in the last 15 years, there’s been a drastic shift towards GPGPU (General Purpose GPU), in which technologies such as CUDA or OpenCL allow you to use the highly parallel nature of the GPU to do general purpose computing. A good example of GPU use is video encoding, or more recently, computer vision, used in applications such as self-driving cars. Unlike CPUs or GPUs, ASICs are built with a single function in mind. Great examples are the Google Tensor Processing Units (TPU), used to accelerate machine learning functions[1], or for orbital maneuvering[2], in which specific orbital maneuvers are encoded, like the Hohmann Transfer, used to move rockets (and their payloads) to a new orbit at a different altitude. And they are also heavily used in the networking industry. Technically, the use case in the network industry should be called an ASSP (Application Specific Standard Product), but network engineers are simple people, so we prefer to call it an ASIC. ## Why an ASIC ASICs have the major benefit of being hyper-efficient. The more complex hardware is, the more it will need cooling and power. As ASICs only contain the hardware components needed for their function, their overall size can be reduced, and so are their power requirements. This has a positive impact on the overall physical size of the network (devices don’t need to be as bulky to provide sufficient cooling), and helps reduce the power consumption of a data center. Reducing hardware complexity also reduces the failure rate of the manufacturing process, and allows for easier production. The downside is that you need to embed a lot of your features in hardware, and once a new technology or specification comes around, any chips made without that technology baked in, won’t be able to support it (VXLAN for example). For network equipment, this works perfectly. Overall, the networking industry is slow-moving, and considerable time is taken before new technologies make it to the market (as can be seen with IPv6, MPLS implementations, xDSL availability, …). This means the chips don’t need to evolve on a yearly basis, and can instead be created on a much slower cycle, with bigger leaps in technology. For example, it took Broadcom two years to go from Tomahawk 3 to Tomahawk 4, but in that process they doubled the throughput. The benefits listed earlier are super helpful for network equipment, as they allow for considerable throughput in a small form factor. ## Building an ASIC As with chips of any kind, building an ASIC is a long-term process. Just like with CPUs, if there’s a defect in the hardware design, you have to start from scratch, and scrap the entire build line. As such, the development lifecycle is incredibly long. It starts with prototyping in an FPGA (Field Programmable Gate Array), in which chip designers can program their required functionality and confirm compatibility. All of this is done in a HDL (Hardware Description Language), such as Verilog. Once the prototyping stage is over, they move to baking the new packet processing pipeline into the chip at a foundry. After that, no more changes can be made to the chip, as it’s literally baked into the hardware (unlike an FPGA, which can be reprogrammed). Further difficulty is added by the fact that there are a very small number of hardware companies that will buy ASICs in bulk to build equipment with; as such the unit cost can increase drastically. All of this means that the iteration cycle of an ASIC tends to be on the slower side of things (compared to the yearly refreshes in the Intel Process-Architecture-Optimization model for example), and will usually be smaller incremental updates: For example, increases in port-speeds are incremental (1G → 10G → 25G → 40G → 100G → 400G → 800G → …), and are tied into upgrades to the SerDes (Serialiser/Deserialiser) part of the chip. New protocol support is a lot harder, and might require multiple development cycles before it shows up in a chip. ## What ASICs do The ASICs in our network equipment are responsible for the switching and routing of packets, as well as being the first layer of defense (in the form of a stateless firewall). Due to the sheer nature of how fast packets get switched, fast memory access is a primary concern. Most ASICs will use a special sort of memory, called TCAM (Ternary Content-Addressable Memory). This memory will be used to store all sorts of lookup tables. These may be forwarding tables (where does this packet go), ACL (Access Control List) tables (is this packet allowed), or CoS (Class of Service) tables (which priority should be given to this packet) CAM, and its more advanced sibling, TCAM, are fascinating kinds of memory, as they operate fundamentally different than traditional Random Access Memory (RAM). While you have to use a memory address to access data in RAM, with CAM and TCAM you can directly refer to the content you are looking for. It is a physical implementation of a key-value store. In CAM you use the exact binary representation of a word, in a network application, that word is likely going to be an IP address, so 11001011.00000000.01110001.00000000 for example (203.0.113.0). While this is definitely useful, networks operate a big collection of IP addresses, and storing each individually would require significant memory. To remedy this memory requirement, TCAM can store three states, instead of the binary two. This third state, sometimes called ‘ignore’ state, allows for the storage of multiple sequential data words as a single entry. In networking, these sequential data words are IP prefixes. So for the previous example, if we wanted to store the collection of that IP address, and the 254 IPs following it, in TCAM it would as follows: 11001011.00000000.01110001.XXXXXXXX (203.0.113.0/24). This storage method means we can ask questions of the ASIC such as “where should I send packets with the destination IP address of 203.0.113.19”, to which the ASIC can have a reply ready in a single clock cycle, as it does not need to run through all memory, but instead can directly reference the key. This reply will usually be a reference to a memory address in traditional RAM, where more data can be stored, such as output port, or firewall requirements for the packet. To dig a bit deeper into what ASICs do in network equipment, let’s briefly go over some fundamentals. Networking can be split into two primary components: routing and switching. Switching allows you to directly interconnect multiple devices, so they can talk with each other across the network. It’s what allows your phone to connect to your TV to play a new family video. Routing is the next level up. It’s the mechanism that interconnects all these switched networks into a network of networks, and eventually, the Internet. So routers are the devices responsible for steering traffic through this complex maze of networks, so it gets to its destination safely, and hopefully, as fast as possible. On the Internet, routers will usually use a routing protocol called BGP (Border Gateway Protocol) to exchange reachability information for a prefix (a collection of IP addresses), also called NLRI (Network Layer Reachability Information). As with navigating the roads, there are multiple ways to get from point A to point B on the Internet. To make sure the router makes the right decision, it will store all of the reachability information in the RIB (Routing Information Base). That way, if anything changes with one route, the router still has other options immediately available. With this information, a BGP daemon can calculate the ideal path to take for any given destination from its own point-of-view. This Cisco documentation explains the decision process the daemon goes through to calculate that ideal path. Once we have this ideal path for a given destination, we should store this information, as it would be very inefficient to calculate this every single time we need to go there. The storage database is called the FIB (Forwarding Information Base). The FIB will be a subset of the RIB, as it will only ever contain the best path for a destination at any given time, while the RIB keeps all the available paths, even the non-ideal ones. With these individual components, routers can make packets go from point A to point B in a blink of an eye. Here’ are some of the more specific functions our ASICs need to perform: 1. FIB install: Once the router has calculated its FIB, it’s important the router can access this as quickly as possible. To do so, the ASIC will install (write) this calculated FIB into the TCAM, so any lookups can happen as quickly as possible. 2. Packet forwarding lookups: as we need to know where to send a received packet, we look up this information in TCAM, which is, as we mentioned, incredibly fast. 3. Stateless Firewall: while a router routes packets between destinations, you also want to ensure that certain packets don’t reach a destination at all. This can be done using either a stateless or stateful firewall. “State” in this case refers to TCP state, so the router would need to understand if a connection is new, or already established. As maintaining state is a complex issue, which requires storing tables, and can quickly consume a lot of memory, most routers will only operate a stateless firewall. Instead, stateful firewalls often have their own appliances. At Cloudflare, we’ve opted to move maintaining state to our compute nodes, as that severely reduces the state-table (one router for all state vs X metals for all state combined). A stateless firewall makes use of the TCAM again to store rules on what to do with specific types of packets. For example, one of the rules we employ at our edge is DENY-BOGON-RANGES , in which we discard traffic sourced from RFC1918 space (and other unroutable space). As this makes use of TCAM, it can all be done at line rate (the maximum speed of the interface). 4. Advanced features, such as GRE encapsulation: modern networking isn’t just packet switching and packet routing anymore, and more advanced features are needed. One of these is encapsulation. With packet encapsulation, a system will put a data packet into another data packet. Using this technique, it’s possible to build a network on top of an existing network (an overlay). Overlays can be used to build a virtual backbone for example, in which multiple locations can be virtually connected through the Internet. While you can encapsulate packets on a CPU (we do this for Magic Transit), there are considerable challenges in doing so in software. As such, the ASIC can have built-in functionality to encapsulate a packet in a multitude of protocols, such as GRE. You may not want encapsulated packets to have to take a second trip through your entire pipeline, as this adds latency, so these shortcuts can also be built into the chip. 5. MPLS, EVPN, VXLAN, SDWAN, SDN, …: I ran out of buzzwords to enumerate here, but while MPLS isn’t new (the first RFC was created in 2001), it’s a rather advanced requirement, just as the others listed, which means not all ASIC vendors will implement this for all their chips due to the increased complexity. ## Vendor Landscape At Cloudflare, we interact with both hardware and software vendors on a daily basis while operating our global network. As we’re talking about ASICs today, we’ll explore the hardware landscape, but some hardware vendors also have their own NOS (Network Operating System). There’s a vast selection of hardware out there, all with different features and pricing. It can become incredibly hard to see the wood for the trees, so we’ll focus on 4 important distinguishing factors: Throughput (how many bits can the ASIC push through), buffer size (how many bits can the ASIC store in memory in case of resource contention), programmability (how easy is it for a third party programmer like Cloudflare to interact directly with the ASIC), feature set (how many advanced things outside of routing/switching can the ASIC do). The landscape is so varied because different companies have different requirements. A company like Cloudflare has different expectations for its network hardware than your typical corner shop. Even within our own network we’ll have different requirements for the different layers that make up our network. ### Broadcom The elephant in the networking room (or is it the jumbo frame in the switch?) is Broadcom. Broadcom is a semiconductor company, with their primary revenue in the wired infrastructure segment (over 50% of revenue[3]). While they’ve been around since 1991, they’ve become an unstoppable force in the last 10 years, in part due to their reliance on Apple (25% of revenue). As a semiconductor manufacturer, their market dominance is primarily achieved by acquiring other companies. A great example is the acquisition of Dune Networks, which has become an excellent revenue generator as the StrataDNX series of ASIC (Arad, QumranMX, Jericho). As such, they have become the biggest ASIC vendor by far, and own 59% of the entire Ethernet Integrated Circuits market[4]. As such, they supply a lot of merchant silicon to Cisco, Juniper, Arista and others. Up until recently, if you wanted to use the Broadcom SDK to accelerate your packet forwarding, you have to sign so many NDAs you might get a hand cramp, which makes programming them a lot trickier. This changed recently when Broadcom open-sourced their SDK. Let’s have a quick look at some of their products. #### Tomahawk The Tomahawk line of ASICs are the bread-and-butter for the enterprise market. They’re cheap and incredibly fast. The first generation of Tomahawk chips did 3.2Tbps linerate, with low-latency switching. The latest generation of this chip (Tomahawk 4) does 25.6Tbps in a 7nm transistor footprint[5]). As you can’t have a cheap, fast, and full feature set for a single package, this means you lose out on features. In this case, you’re missing most of the more advanced networking technologies such as VXLAN, and you have no buffer to speak of. As an example of a different vendor using this silicon, you can have a look at the Juniper QFX5200 switching platform. #### StrataDNX (Arad, QumranMX, Jericho) These chipsets came through the acquisition of Dune Networks, and are a collection of high-bandwidth, deep buffer (large amount of memory available to store (buffer) packets) chips, allowing them to be deployed in versatile environments, including the Cloudflare edge. The Arista DCS-7280SR that we run in some of our edge locations as edge routers run on the Jericho chipset. Since then, the chips have evolved, and with Jericho2, Broadcom now have a 10Tbps deep buffer chip[6]. With their fabric chip (this links multiple ASICs together), you can build switches with 48x400G ports[7] without much effort. Cisco built their NCS5500 line of routers using the QumranMX[8]. #### Trident This ASIC is an upgrade from the Tomahawk chipset, with a complex and extensive feature set, while maintaining high throughput rates. The latest Trident4 does 12.8Tbps at incredibly low latencies[9], making it an incredibly flexible platform. It unfortunately has no buffer space to speak of, which limits its scope for Cloudflare, as we need the buffer space to be able to switch between the different port speeds we have on our edge routers. The Arista 7050X and 7300X are built on top of this. ### Intel Intel is well known in the network industry for building stable and high-performance 10G NICs (Network Interface Controller). They’re not known for ASICs. They made an initial attempt with their acquisition of Fulcrum[10], which built the FM6000[11] series of ASIC, but nothing of note was really built with them. Intel decided to try again in 2019 with their acquisition of Barefoot. This small manufacturer is responsible for the Barefoot Tofino ASIC, which may well be a fundamental paradigm shift in the network industry. #### Barefoot Tofino The Tofino[12] is built using a PISA (Protocol Independent Switch Architecture), and using P4 (Programming Protocol-Independent Packet Processors)[13], you can program the data-plane (packet forwarding) as you see fit. It’s a drastic move away from the traditional method of networking, in which direct programming of the ASIC isn’t easily possible, and definitely not through a standard programming language. As an added benefit, P4 also allows you to perform a formal verification of your forwarding program, and be sure that it will do what you expect it to. Caveat: OpenFlow tried this, but unfortunately never really got much traction. [14] There are multiple variations of the Tofino 1 available, but the top-end ASIC has a 6.5Tbps linerate capacity. As the ASIC is programmable, its featureset is as rich as you’d want it to be. Unfortunately, the chip does not come with a lot of buffer memory, so we can’t deploy these as edge devices (yet). Both Arista (7170 Series[15]) and Cisco (Nexus 34180YC and 3464C series[16]) have built equipment with the Tofino chip inside. ### Mellanox As some of you may know, Mellanox is the vendor that recently got acquired by Nvidia, which also provides our 25G NICs in our compute nodes. Besides NICs, Mellanox has a well-established line of ASICs, mostly for switching. #### Spectrum The latest iteration of this ASIC, Spectrum 3 offers 12.8Tbps switching capacity, with an extensive featureset, including Deep Packet Inspection and NAT. This chip allows for building dense high-speed port devices, going up to 25.6Tbps[17]. Buffering wise, there’s none to really speak of (64MB). Mellanox also builds their own hardware platforms. Unlike the other vendors below, they aren’t shipped with the Mellanox Operating System, instead, they offer you a variety of choices to run on top, including Cumulus Linux (which was also acquired by Nvidia 🤔). As mentioned, while we use their NIC technology extensively, we currently don’t have any Mellanox ASIC silicon in our network. ### Juniper Juniper is a network hardware supplier, and currently the biggest supplier of network equipment for Cloudflare. As previously mentioned in the Broadcom section, Juniper buys some of their silicon from Broadcom, but they also have a significant lineup of home-grown silicon, which can be split into 2 families: Trio and Express. #### Express The Express family is the switching-skewed family, where bandwidth is a priority, while still maintaining a broad range of feature capabilities. These chips live in the same application landscape as the Broadcom StrataDNX chips. ##### Paradise (Q5) The Q5 is the new generation of the Juniper switching ASIC[18]. While by itself it doesn’t boast high linerates (500Gbps), when combined into a chassis with a fabric chip (Clos network in this case), they can produce switches (or line cards) with up to 12Tbps of throughput capacity[19]. In addition to allowing for high-throughput, dense network appliances, the chip also comes with a staggering amount of buffer space (4GB per ASIC), provided by external HMC (Hybrid Memory Cube). In this HMC, they’ve also decided to put the FIB, MAC and other tables (so no TCAM). The Q5 chip is used in their QFX1000 lineup of switches, which include the QFX10002-36Q, QFX10002-60C, QFX10002-72Q and QFX10008, all of which are deployed in our datacenters, as either edge routers or core aggregation switches. ##### ExpressPlus (ZX) The ExpressPlus is the more feature-rich and faster evolution of the Paradise chip. It offers double the bandwidth per chip (1Tbps) and is built into a combined Clos-fabric reaching 6Tbps in a 2U form-factor (PTX10002). It also has an increased logical scale, which comes with bigger buffers, larger FIB storage, and more ACL space. The ExpressPlus drives some of the PTX line of IP routers, together with its newest sibling, Triton. ##### Triton (BT) Triton is the latest generation of ASIC in the Express family, with 3.6Tbps of capacity per chip, making way for some truly bandwidth-dense hardware. Both Triton and ExpressPlus are 400GE capable. #### Trio The Trio family of chips are primarily used in the feature-heavy MX routing platform, and is currently at its 5th generation. A Juniper MPC4E-3D-32XGE line card ##### Trio Eagle (Trio 4.0) (EA) The Trio Eagle is the previous generation of the Trio Penta, and can be found on the MPC7E line cards for example. It’s a feature-rich ASIC, with a 400Gbps forwarding capacity, and significant buffer and TCAM capacity (as is to be expected from a routing platform ASIC) ##### Trio Penta (Trio 5.0) (ZT) Penta is the new generation routing chip, which is built for the MX platform routers. On top of being a very beefy chip, capable of 500Gbps per ASIC, allowing Juniper to build line cards of up to 4Tbps of capacity, the chip also has a lot of baked in features, offering advanced hardware offloading for for example MACSec, or Layer 3 IPsec. The Penta chip is packaged on the MPC10E and MPC11E line card, which can be installed in multiple variations of the MX chassis routers (MX480 included). ### Cisco Last but not least, there’s Cisco. As the saying goes “nobody ever got fired for buying Cisco”, they’re the biggest vendor of network solutions around. Just like Juniper, they have a mixed product fleet of merchant silicon, as well as home-grown. While we used to operate Cisco routers as edge routers (Cisco ASR 9000), this is no longer the case. We do still use them heavily for our ToR (Top-of-Rack) switching needs, utilizing both their Nexus 5000 series and Nexus 9000 series switches. #### Bigsur Bigsur is custom silicon developed for the Nexus 6000 line of switches (confusingly, the switches themselves are called Cisco Nexus 5672UP and Cisco Nexus 6001). In our specific model, the Cisco Nexus 5672UP, there’s 7 of them interconnected, providing 10G and 40G connectivity. Unfortunately Cisco is a lot more tight-lipped about their ASIC capabilities, so I can’t go as deep as I did with the Juniper chips. Feature-wise, there’s not a lot we require from them in our edge network. They’re simple Layer 2 forwarding switches, with no added requirements. Buffer wise, they use a system called Virtual Output Queueing, just like the Juniper Express chip. Unlike the Juniper silicon, the Bigsur ASIC doesn’t come with a lot of TCAM or buffer space. #### Tahoe The Tahoe is the Cisco ASIC found in the Cisco 9300-EX switches, also known as the LSE (Leaf Spine Engine). It offers higher-density port configurations compared to the Bigsur (1.6Tbps)[20]. Overall, this ASIC is a maturation of the Bigsur silicon, offering more advanced features such as advanced VXLAN+EVPN fabrics, greater port flexibility (10G, 25G, 40G and 100G), and increased buffer sizes (40MB). We use this ASIC extensively in both our edge data centers as well as in our core data centers. ## Conclusion A lot of different factors come into play when making the decision to purchase the next generation of Cloudflare network equipment. This post only scratches the surface of technical considerations to be made, and doesn’t come near any other factors, such as ecosystem contributions, openness, interoperability, or pricing. None of this would’ve been possible without the contributions from other network engineers—this post was written on the shoulders of giants. In particular, thanks to the excellent work by Jim Warner at UCSC, the engrossing book on the new MX platforms, written by David Roy (Day One: Inside the MX 5G), as well as the best book on the Juniper QFX lineup: Juniper QFX10000 Series by Douglas Richard Hanks Jr, and to finish it off, the Summary of Network ASICs post by Justin Pietsch. # Diving into /proc/[pid]/mem Post Syndicated from Lennart Espe original https://blog.cloudflare.com/diving-into-proc-pid-mem/ A few months ago, after reading about Cloudflare doubling its intern class size, I quickly dusted off my CV and applied for an internship. Long story short: now, a couple of months later, I found myself staring into Linux kernel code and adding a pretty cool feature to gVisor, a Linux container runtime. My internship was under the Emerging Technologies and Incubation group on a project involving gVisor. A co-worker contacted my team about not being able to read the debug symbols of stack traces inside the sandbox. For example, when the isolated process crashed, this is what we saw in the logs: *** Check failure stack trace: *** @ 0x7ff5f69e50bd (unknown) @ 0x7ff5f69e9c9c (unknown) @ 0x7ff5f69e4dbd (unknown) @ 0x7ff5f69e55a9 (unknown) @ 0x5564b27912da (unknown) @ 0x7ff5f650ecca (unknown) @ 0x5564b27910fa (unknown)  Obviously, this wasn’t very useful. I eagerly volunteered to fix this stack unwinding code – how hard could it be? After some debugging, we found that the logging library used in the project opened /proc/self/mem to look for ELF headers at the start of each memory-mapped region. This was necessary to calculate an offset to find the correct addresses for debug symbols. It turns out this mechanism is rather common. The stack unwinding code is often run in weird contexts – like a SIGSEGV handler – so it would not be appropriate to dig over real memory addresses back and forth to read the ELF. This could trigger another SIGSEGV. And SIGSEGV inside a SIGSEGV handler means either termination via the default handler for a segfault or recursing into the same handler again and again (if one sets SA_NODEFER) leading to a stack overflow. However, inside gVisor, each call of open() on /proc/self/mem resulted in ENOENT, because the entire /proc/self/mem file was missing. In order to provide a robust sandbox, gVisor has to carefully reimplement the Linux kernel interfaces. This particular /proc file was simply unimplemented in the virtual file system of Sentry, one of gVisor’s sandboxing components. Marek asked the devs on the project chat and got confirmation – they would be happy to accept a patch implementing this file. The easy way out would have been to make a small, local patch to the unwinder behavior, yet I found myself diving into the Linux kernel trying to figure how the mem file worked in an attempt to implement it in Sentry’s VFS. ## What does /proc/[pid]/mem do? The file itself is quite powerful, because it allows raw access to the virtual address space of a process. According to manpages, the documented file operations are open(), read() and lseek(). Typical use cases are debugging tasks or dumping process memory. ## Opening the file When a process wants to open the file, the kernel does the file permissions check, looks up the associated operations for mem and invokes a method called proc_mem_open. It retrieves the associated task and calls a method named mm_access. /* * Grab a reference to a task's mm, if it is not already going away * and ptrace_may_access with the mode parameter passed to it * succeeds. */  Seems relatively straightforward, right? The special thing about mm_access is that it verifies the permissions the current task has regarding the task to which the memory belongs. If the current task and target task do not share the same memory manager, the kernel invokes a method named __ptrace_may_access. /* * May we inspect the given task? * This check is used both for attaching with ptrace * and for allowing access to sensitive information in /proc. * * ptrace_attach denies several cases that /proc allows * because setting up the necessary parent/child relationship * or halting the specified task is impossible. * */  According to the manpages, a process which would like to read from an unrelated /proc/[pid]/mem file should have access mode PTRACE_MODE_ATTACH_FSCREDS. This check does not verify that a process is attached via PTRACE_ATTACH, but rather if it has the permission to attach with the specified credentials mode. ## Access checks After skimming through the function, you will see that a process is allowed access if the current task belongs to the same thread group as the target task, or denied access (depending on whether PTRACE_MODE_FSCREDS or PTRACE_MODE_REALCREDS is set, we will use either the file-system UID / GID, which is typically the same as the effective UID/GID, or the real UID / GID) if none of the following conditions are met: • the current task’s credentials (UID, GID) match up with the credentials (real, effective and saved set-UID/GID) of the target process • the current task has CAP_SYS_PTRACE inside the user namespace of the target process In the next check, access is denied if the current task has neither CAP_SYS_PTRACE inside the user namespace of the target task, nor the target’s dumpable attribute is set to SUID_DUMP_USER. The dumpable attribute is typically required to allow producing core dumps. After these three checks, we also go through the commoncap Linux Security Module (and other LSMs) to verify our access mode is fine. LSMs you may know are SELinux and AppArmor. The commoncap LSM performs the checks on the basis of effective or permitted process capabilities (depending on the mode being FSCREDS or REALCREDS), allowing access if • the capabilities of the current task are a superset of the capabilities of the target task, or • the current task has CAP_SYS_PTRACE in the target task’s user namespace In conclusion, one has access (with only commoncap LSM checks active) if: • the current task is in the same task group as the target task, or • the current task has CAP_SYS_PTRACE in the target task’s user namespace, or • the credentials of the current and target task match up in the given credentials mode, the target task is dumpable, they run in the same user namespace and the target task’s capabilities are a subset of the current task’s capabilities I highly recommend reading through the ptrace manpages to dig deeper into the different modes, options and checks. ## Reading from the file Since all the access checks occur when opening the file, reading from it is quite straightforward. When one invokes read() on a mem file, it calls up mem_rw (which actually can do both reading and writing). To avoid using lots of memory, mem_rw performs the copy in a loop and buffers the data in an intermediate page. mem_rw has a hidden superpower, that is, it uses FOLL_FORCE to avoid permission checks on user-owned pages (handling pages marked as non-readable/non-writable readable and writable). mem_rw has other specialties, such as its error handling. Some interesting cases are: • if the target task has exited after opening the file descriptor, performing read() will always succeed with reading 0 bytes • if the initial copy from the target task’s memory to the intermediate page fails, it does not always return an error but only if no data has been read You can also perform lseek on the file excluding SEEK_END. ## How it works in gVisor Luckily, gVisor already implemented ptrace_may_access as kernel.task.CanTrace, so one can avoid reimplementing all the ptrace access logic. However, the implementation in gVisor is less complicated due to the lack of support for PTRACE_MODE_FSCREDS (which is still an open issue). When a new file descriptor is open()ed, the GetFile method of the virtual Inode is invoked, therefore this is where the access check naturally happens. After a successful access check, the method returns a fs.File. The fs.File implements all the file operations you would expect such as Read() and Write(). gVisor also provides tons of primitives for quickly building a working file structure so that one does not have to reimplement a generic lseek() for example. In case a task invokes a Read() call onto the fs.File, the Read method retrieves the memory manager of the file’s Task. Accessing the task’s memory manager is a breeze with comfortable CopyIn and CopyOut methods, with interfaces similar to io.Writer and io.Reader. After implementing all of this, we finally got a useful stack trace. *** Check failure stack trace: *** @ 0x7f190c9e70bd google::LogMessage::Fail() @ 0x7f190c9ebc9c google::LogMessage::SendToLog() @ 0x7f190c9e6dbd google::LogMessage::Flush() @ 0x7f190c9e75a9 google::LogMessageFatal::~LogMessageFatal() @ 0x55d6f718c2da main @ 0x7f190c510cca __libc_start_main @ 0x55d6f718c0fa _start  ## Conclusion A comprehensive victory! The /proc/<pid>/mem file is an important mechanism that gives insight into contents of process memory. It is essential to stack unwinders to do their work in case of complicated and unforeseeable failures. Because the process memory contains highly-sensitive information, data access to the file is determined by a complex set of poorly documented rules. With a bit of effort, you can emulate /proc/[PID]/mem inside gVisor’s sandbox, where the process only has access to the subset of procfs that has been implemented by the gVisor authors and, as a result, you can have access to an easily readable stack trace in case of a crash. Now I can’t wait to get the PR merged into gVisor. # What is Cloudflare One? Post Syndicated from Rustam Lalkaka original https://blog.cloudflare.com/cloudflare-one/ Running a secure enterprise network is really difficult. Employees spread all over the world work from home. Applications are run from data centers, hosted in public cloud, and delivered as services. Persistent and motivated attackers exploit any vulnerability. Enterprises used to build networks that resembled a castle-and-moat. The walls and moat kept attackers out and data in. Team members entered over a drawbridge and tended to stay inside the walls. Trust folks on the inside of the castle to do the right thing, and deploy whatever you need in the relative tranquility of your secure network perimeter. The Internet, SaaS, and “the cloud” threw a wrench in that plan. Today, more of the workloads in a modern enterprise run outside the castle than inside. So why are enterprises still spending money building more complicated and more ineffective moats? Today, we’re excited to share Cloudflare One™, our vision to tackle the intractable job of corporate security and networking. Cloudflare One combines networking products that enable employees to do their best work, no matter where they are, with consistent security controls deployed globally. Starting today, you can begin replacing traffic backhauls to security appliances with Cloudflare WARP and Gateway to filter outbound Internet traffic. For your office networks, we plan to bring next-generation firewall capabilities to Magic Transit with Magic Firewall to let you get rid of your top-of-shelf firewall appliances. With multiple on-ramps to the Internet through Cloudflare, and the elimination of backhauled traffic, we plan to make it simple and cost-effective to manage that routing compared to MPLS and SD-WAN models. Cloudflare Magic WAN will provide a control plane for how your traffic routes through our network. You can use Cloudflare One today to replace the other function of your VPN: putting users on a private network for access control. Cloudflare Access delivers Zero Trust controls that can replace private network security models. Later this week, we’ll announce how you can extend Access to any application – including SaaS applications. We’ll also preview our browser isolation technology to keep the endpoints that connect to those applications safe from malware. Finally, the products in Cloudflare One focus on giving your team the logs and tools to both understand and then remediate issues. As part of our Gateway filtering launch this week we’re including logs that provide visibility into the traffic leaving your organization. We’ll be sharing how those logs get smarter later this week with a new Intrusion Detection System that detects and stops intrusion attempts. Many of those components are available today, some new features are arriving this week, and other pieces will be launching soon. All together, we’re excited to share this vision and for the future of the corporate network. ## Problems in enterprise networking and security The demands placed on a corporate network have changed dramatically. IT has gone from a back-office function to mission critical. In parallel with networks becoming more integral, users spread out from offices to work from home. Applications left the datacenter and are now being run out of multiple clouds or are being delivered by vendors directly over the Internet. ### Direct network paths became hairpin turns Employees sitting inside of an office could connect over a private network to applications running in a datacenter nearby. When team members left the office, they could use a VPN to sneak back onto the network from outside the walls. Branch offices hopped on that same network over expensive MPLS links. When applications left the data center and users left their offices, organizations responded by trying to force that scattered world into the same castle-and-moat model. Companies purchased more VPN licenses and replaced MPLS links with difficult SD-WAN deployments. Networks became more complex in an attempt to mimic an older model of networking when in reality the Internet had become the new corporate network. ### Defense-in-depth splintered Attackers looking to compromise corporate networks have a multitude of tools at their disposal, and may execute surgical malware strikes, throw a volumetric kitchen sink at your network, or any number of things in between. Traditionally, defense against each class of attack was provided by a separate, specialized piece of hardware running in a datacenter. Security controls used to be relatively easy when every user and every application sat in the same place. When employees left offices and workloads left data centers, the same security controls struggled to follow. Companies deployed a patchwork of point solutions, attempting to rebuild their topside firewall appliances across hybrid and dynamic environments. ### High-visibility required high-effort The move to a patchwork model sacrificed more than just defense-in-depth — companies lost visibility into what was happening in their networks and applications. We hear from customers that this capture and standardization of logs has become one of their biggest hurdles. They purchased expensive data ingestion, analysis, storage, and analytics tools. Enterprises now rely on multiple point solutions that one of the biggest hurdles is the capture and standardization of logs. Increasing regulatory and compliance pressures place more emphasis on data retention and analysis. Splintered security solutions become a data management nightmare. ### Fixing issues relied on best guesses Without visibility into this new networking model, security teams had to guess at what could go wrong. Organizations who wanted to adopt an “assume breach” model struggled to determine what kind of breach could even occur, so they threw every possible solution at the problem. We talk to enterprises who purchase new scanning and filtering services, delivered in virtual appliances, for problems they are unsure they have. These teams attempt to remediate every possible event manually, because they lack visibility, rather than targeting specific events and adapting the security model. ## How does Cloudflare One fit? Over the last several years, we’ve been assembling the components of Cloudflare One. We launched individual products to target some of these problems one-at-a-time. We’re excited to share our vision for how they all fit together in Cloudflare One. ### Flexible data planes Cloudflare launched as a reverse proxy. Customers put their Internet-facing properties on our network and their audience connected to those specific destinations through our network. Cloudflare One represents years of launches that allow our network to process any type of traffic flowing in either the “reverse” or “forward” direction. In 2019, we launched Cloudflare WARP — a mobile application that kept Internet-bound traffic private with an encrypted connection to our network while also making it faster and more reliable. We’re now packaging that same technology into an enterprise version launching this week to connect roaming employees to Cloudflare Gateway. Your data centers and offices should have the same advantage. We launched Magic Transit last year to secure your networks from IP-layer attacks. Our initial focus with Magic Transit has been delivering best-in-class DDoS mitigation to on-prem networks. DDoS attacks are a persistent thorn in network operators’ sides, and Magic Transit effectively diffuses their sting without forcing performance compromises. That rock-solid DDoS mitigation is the perfect platform on which to build higher level security functions that apply to the same traffic already flowing across our network. Earlier this year, we expanded that model when we launched Cloudflare Network Interconnect (CNI) to allow our customers to interconnect branch offices and data centers directly with Cloudflare. As part of Cloudflare One, we’ll apply outbound filtering to that same connection. Cloudflare One should not just help your team move to the Internet as a corporate network, it should be faster than the Internet. Our network is carrier-agnostic, exceptionally well-connected and peered, and delivers the same set of services globally. In each of these on-ramps, we’re adding smarter routing based on our Argo Smart Routing technology, which has been shown to reduce latency by 30% or more in the real-world. Security + Performance, because they’re better together. ### A single, unified control plane When users connect to the Internet from branch offices and devices, they skip the firewall appliances that used to live in headquarters altogether. To keep pace, enterprises need a way to secure traffic that no longer lives entirely within their own network. Cloudflare One applies standard security controls to all traffic – regardless of how that connection starts or where in the network stack it lives. Cloudflare Access starts by introducing identity into Cloudflare’s network. Teams apply filters based on identity and context to both inbound and outbound connections. Every login, request, and response proxies through Cloudflare’s network regardless of the location of the server or user. The scale of our network and its distribution can filter and log enterprise traffic without compromising performance. Cloudflare Gateway keeps connections to the rest of the Internet safe. Gateway inspects traffic leaving devices and networks for threats and data loss events that hide inside of connections at the application layer. Launching soon, Gateway will bring that same level of control lower in the stack to the transport layer. You should have the same level of control over how your networks send traffic. We’re excited to announce Magic Firewall, a next-generation firewall for all traffic leaving your offices and data centers. With Gateway and Magic Firewall, you can build a rule once and run it everywhere, or tailor rules to specific use cases in a single control plane. We know some attacks can’t be filtered because they launch before filters can be built to stop them. Cloudflare Browser, our isolated browser technology gives your team a bulletproof pane of glass from threats that can evade known filters. Later this week, we’ll invite customers to sign up to join the beta to browse the Internet on Cloudflare’s edge without the risk of code leaping out of the browser to infect an endpoint. Finally, the PKI infrastructure that secures your network should be modern and simpler to manage. We heard from customers who described certificate management as one of the core problems of moving to a better model of security. Cloudflare works with, not against, modern encryption standards like TLS 1.3. Cloudflare made it easy to add encryption to your sites on the Internet with one click. We’re going to bring that ease-of-management to the network functions you run on Cloudflare One. ### One place to get your logs, one location for all of your security analysis Cloudflare’s network serves 18 million HTTP requests per second on average. We’ve built logging pipelines that make it possible for some of the largest Internet properties in the world to capture and analyze their logs at scale. Cloudflare One builds on that same capability. Cloudflare Access and Gateway capture every request, inbound or outbound, without any server-side code changes or advanced client-side configuration. Your team can export those logs to the SIEM provider of your choice with our Cloudflare Logpush service – the same pipeline that exports HTTP request events at scale for public sites. Magic Transit expands that logging capability to entire networks and offices to ensure you never lose visibility from any location. We’re going beyond just logging events. Available today for your websites, Cloudflare Web Analytics converts logs into insights. We plan to keep expanding that visibility into how your network operates, as well. Just as Cloudflare has replaced the “band-aid boxes” that performed disparate network functions and unified them into a cohesive, adaptable edge, we intend to do the same for the fragmented, hard to use, and expensive security analytics ecosystem. More to come on this soon.Smarter, faster remediation Data and analytics should surface events that a team can remediate. Log systems that lead to one-click fixes can be powerful tools, but we want to make that remediation automatic. Launching into a closed preview later this week, Cloudflare Intrusion Detection System (IDS) will proactively scan your network for anomalous events and recommend actions or, better yet, take actions for you to remediate problems. We plan to bring that same proactive scanning and remediation approach to Cloudflare Access and Cloudflare Gateway. ## Run your network on our globally scaled network Over 25 million Internet properties rely on Cloudflare’s network to reach their audiences. More than 10% of all websites connect through our reverse proxy, including 16% of the Fortune 1000. Cloudflare accelerates traffic for huge chunks of the Internet by delivering services from datacenters around the world. We deliver Cloudflare One from those same data centers. And critically, every datacenter we operate delivers the same set of services, whether that is Cloudflare Access, WARP, Magic Transit, or our WAF. As an example, when your employees connect through Cloudflare WARP to one of our data centers, there is a real chance they never have to leave our network or that data center to reach the site or data they need. As a result, their entire Internet experience becomes extraordinarily fast, no matter where they are in the world. We expect that performance bonus to become even more meaningful as browsing moves to Cloudflare’s edge with Cloudflare Browser. The isolated browsers running in Cloudflare’s data centers can request content that sits just centimeters away. Even further, as more web properties rely on Cloudflare Workers to power their applications, entire workflows can stay inside of a data center within 100 ms of your employees. ## What’s next? While many of these features are available today, we’re going to be launching several new features over the next several days as part of Cloudflare’s Zero Trust week. Stay tuned for announcements each day this week that add new pieces to the Cloudflare One featureset. # Unimog – Cloudflare’s edge load balancer Post Syndicated from David Wragg original https://blog.cloudflare.com/unimog-cloudflares-edge-load-balancer/ As the scale of Cloudflare’s edge network has grown, we sometimes reach the limits of parts of our architecture. About two years ago we realized that our existing solution for spreading load within our data centers could no longer meet our needs. We embarked on a project to deploy a Layer 4 Load Balancer, internally called Unimog, to improve the reliability and operational efficiency of our edge network. Unimog has now been deployed in production for over a year. This post explains the problems Unimog solves and how it works. Unimog builds on techniques used in other Layer 4 Load Balancers, but there are many details of its implementation that are tailored to the needs of our edge network. ### The role of Unimog in our edge network Cloudflare operates an anycast network, meaning that our data centers in 200+ cities around the world serve the same IP addresses. For example, our own cloudflare.com website uses Cloudflare services, and one of its IP addresses is 104.17.175.85. All of our data centers will accept connections to that address and respond to HTTP requests. By the magic of Internet routing, when you visit cloudflare.com and your browser connects to 104.17.175.85, your connection will usually go to the closest (and therefore fastest) data center. Inside those data centers are many servers. The number of servers in each varies greatly (the biggest data centers have a hundred times more servers than the smallest ones). The servers run the application services that implement our products (our caching, DNS, WAF, DDoS mitigation, Spectrum, WARP, etc). Within a single data center, any of the servers can handle a connection for any of our services on any of our anycast IP addresses. This uniformity keeps things simple and avoids bottlenecks. But if any server within a data center can handle any connection, when a connection arrives from a browser or some other client, what controls which server it goes to? That’s the job of Unimog. There are two main reasons why we need this control. The first is that we regularly move servers in and out of operation, and servers should only receive connections when they are in operation. For example, we sometimes remove a server from operation in order to perform maintenance on it. And sometimes servers are automatically removed from operation because health checks indicate that they are not functioning correctly. The second reason concerns the management of the load on the servers (by load we mean the amount of computing work each one needs to do). If the load on a server exceeds the capacity of its hardware resources, then the quality of service to users will suffer. The performance experienced by users degrades as a server approaches saturation, and if a server becomes sufficiently overloaded, users may see errors. We also want to prevent servers being underloaded, which would reduce the value we get from our investment in hardware. So Unimog ensures that the load is spread across the servers in a data center. This general idea is called load balancing (balancing because the work has to be done somewhere, and so for the load on one server to go down, the load on some other server must go up). Note that in this post, we’ll discuss how Cloudflare balances the load on its own servers in edge data centers. But load balancing is a requirement that occurs in many places in distributed computing systems. Cloudflare also has a Layer 7 Load Balancing product to allow our customers to balance load across their servers. And Cloudflare uses load balancing in other places internally. Deploying Unimog led to a big improvement in our ability to balance the load on our servers in our edge data centers. Here’s a chart for one data center, showing the difference due to Unimog. Each line shows the processor utilization of an individual server (the colour of the lines indicates server model). The load on the servers varies during the day with the activity of users close to this data center. The white line marks the point when we enabled Unimog. You can see that after that point, the load on the servers became much more uniform. We saw similar results when we deployed Unimog to our other data centers. ## How Unimog compares to other load balancers There are a variety of techniques for load balancing. Unimog belongs to a category called Layer 4 Load Balancers (L4LBs). L4LBs direct packets on the network by inspecting information up to layer 4 of the OSI network model, which distinguishes them from the more common Layer 7 Load Balancers. The advantage of L4LBs is their efficiency. They direct packets without processing the payload of those packets, so they avoid the overheads associated with higher level protocols. For any load balancer, it’s important that the resources consumed by the load balancer are low compared to the resources devoted to useful work. At Cloudflare, we already pay close attention to the efficient implementation of our services, and that sets a high bar for the load balancer that we put in front of those services. The downside of L4LBs is that they can only control which connections go to which servers. They cannot modify the data going over the connection, which prevents them from participating in higher-level protocols like TLS, HTTP, etc. (in contrast, Layer 7 Load Balancers act as proxies, so they can modify data on the connection and participate in those higher-level protocols). L4LBs are not new. They are mostly used at companies which have scaling needs that would be hard to meet with L7LBs alone. Google has published about Maglev, Facebook open-sourced Katran, and Github has open-sourced their GLB. Unimog is the L4LB that Cloudflare has built to meet the needs of our edge network. It shares features with other L4LBs, and it is particularly strongly influenced by GLB. But there are some requirements that were not well-served by existing L4LBs, leading us to build our own: • Unimog is designed to run on the same general-purpose servers that provide application services, rather than requiring a separate tier of servers dedicated to load balancing. • It performs dynamic load balancing: measurements of server load are used to adjust the number of connections going to each server, in order to accurately balance load. • It supports long-lived connections that remain established for days. • Virtual IP addresses are managed as ranges (Cloudflare serves hundreds of thousands of IPv4 addresses on behalf of our customers, so it is impractical to configure these individually). • Unimog is tightly integrated with our existing DDoS mitigation system, and the implementation relies on the same XDP technology in the Linux kernel. The rest of this post describes these features and the design and implementation choices that follow from them in more detail. For Unimog to balance load, it’s not enough to send the same (or approximately the same) number of connections to each server, because the performance of our servers varies. We regularly update our server hardware, and we’re now on our 10th generation. Once we deploy a server, we keep it in service for as long as it is cost effective, and the lifetime of a server can be several years. It’s not unusual for a single data center to contain a mix of server models, due to expansion and upgrades over time. Processor performance has increased significantly across our server generations. So within a single data center, we need to send different numbers of connections to different servers to utilize the same percentage of their capacity. It’s also not enough to give each server a fixed share of connections based on static estimates of their capacity. Not all connections consume the same amount of CPU. And there are other activities running on our servers and consuming CPU that are not directly driven by connections from clients. So in order to accurately balance load across servers, Unimog does dynamic load balancing: it takes regular measurements of the load on each of our servers, and uses a control loop that increases or decreases the number of connections going to each server so that their loads converge to an appropriate value. ## Refresher: TCP connections The relationship between TCP packets and connections is central to the operation of Unimog, so we’ll briefly describe that relationship. (Unimog supports UDP as well as TCP, but for clarity most of this post will focus on the TCP support. We explain how UDP support differs towards the end.) Here is the outline of a TCP packet: The TCP connection that this packet belongs to is identified by the four labelled header fields, which span the IPv4/IPv6 (i.e. layer 3) and TCP (i.e. layer 4) headers: the source and destination addresses, and the source and destination ports. Collectively, these four fields are known as the 4-tuple. When we say the Unimog sends a connection to a server, we mean that all the packets with the 4-tuple identifying that connection are sent to that server. A TCP connection is established via a three-way handshake between the client and the server handling that connection. Once a connection has been established, it is crucial that all the incoming packets for that connection go to that same server. If a TCP packet belonging to the connection is sent to a different server, it will signal the fact that it doesn’t know about the connection to the client with a TCP RST (reset) packet. Upon receiving this notification, the client terminates the connection, probably resulting in the user seeing an error. So a misdirected packet is much worse than a dropped packet. As usual, we consider the network to be unreliable, and it’s fine for occasional packets to be dropped. But even a single misdirected packet can lead to a broken connection. Cloudflare handles a wide variety of connections on behalf of our customers. Many of these connections carry HTTP, and are typically short lived. But some HTTP connections are used for websockets, and can remain established for hours or days. Our Spectrum product supports arbitrary TCP connections. TCP connections can be terminated or stall for many reasons, and ideally all applications that use long-lived connections would be able to reconnect transparently, and applications would be designed to support such reconnections. But not all applications and protocols meet this ideal, so we strive to maintain long-lived connections. Unimog can maintain connections that last for many days. ### Forwarding packets The previous section described that the function of Unimog is to steer connections to servers. We’ll now explain how this is implemented. To start with, let’s consider how one of our data centers might look without Unimog or any other load balancer. Here’s a conceptual view: Packets arrive from the Internet, and pass through the router, which forwards them on to servers (in reality there is usually additional network infrastructure between the router and the servers, but it doesn’t play a significant role here so we’ll ignore it). But is such a simple arrangement possible? Can the router spread traffic over servers without some kind of load balancer in between? Routers have a feature called ECMP (equal cost multipath) routing. Its original purpose is to allow traffic to be spread across multiple paths between two locations, but it is commonly repurposed to spread traffic across multiple servers within a data center. In fact, Cloudflare relied on ECMP alone to spread load across servers before we deployed Unimog. ECMP uses a hashing scheme to ensure that packets on a given connection use the same path (Unimog also employs a hashing scheme, so we’ll discuss how this can work in further detail below) . But ECMP is vulnerable to changes in the set of active servers, such as when servers go in and out of service. These changes cause rehashing events, which break connections to all the servers in an ECMP group. Also, routers impose limits on the sizes of ECMP groups, which means that a single ECMP group cannot cover all the servers in our larger edge data centers. Finally, ECMP does not allow us to do dynamic load balancing by adjusting the share of connections going to each server. These drawbacks mean that ECMP alone is not an effective approach. Ideally, to overcome the drawbacks of ECMP, we could program the router with the appropriate logic to direct connections to servers in the way we want. But although programmable network data planes have been a hot research topic in recent years, commodity routers are still essentially fixed-function devices. We can work around the limitations of routers by having the router send the packets to some load balancing servers, and then programming those load balancers to forward packets as we want. If the load balancers all act on packets in a consistent way, then it doesn’t matter which load balancer gets which packets from the router (so we can use ECMP to spread packets across the load balancers). That suggests an arrangement like this: And indeed L4LBs are often deployed like this. Instead, Unimog makes every server into a load balancer. The router can send any packet to any server, and that initial server will forward the packet to the right server for that connection: We have two reasons to favour this arrangement: First, in our edge network, we avoid specialised roles for servers. We run the same software stack on the servers in our edge network, providing all of our product features, whether DDoS attack prevention, website performance features, Cloudflare Workers, WARP, etc. This uniformity is key to the efficient operation of our edge network: we don’t have to manage how many load balancers we have within each of our data centers, because all of our servers act as load balancers. The second reason relates to stopping attacks. Cloudflare’s edge network is the target of incessant attacks. Some of these attacks are volumetric – large packet floods which attempt to overwhelm the ability of our data centers to process network traffic from the Internet, and so impact our ability to service legitimate traffic. To successfully mitigate such attacks, it’s important to filter out attack packets as early as possible, minimising the resources they consume. This means that our attack mitigation system needs to occur before the forwarding done by Unimog. That mitigation system is called l4drop, and we’ve written about it before. l4drop and Unimog are closely integrated. Because l4drop runs on all of our servers, and because l4drop comes before Unimog, it’s natural for Unimog to run on all of our servers too. ### XDP and xdpd Unimog implements packet forwarding using a Linux kernel facility called XDP. XDP allows a program to be attached to a network interface, and the program gets run for every packet that arrives, before it is processed by the kernel’s main network stack. The XDP program returns an action code to tell the kernel what to do with the packet: • PASS: Pass the packet on to the kernel’s network stack for normal processing. • DROP: Drop the packet. This is the basis for l4drop. • TX: Transmit the packet back out of the network interface. The XDP program can modify the packet data before transmission. This action is the basis for Unimog forwarding. XDP programs run within the kernel, making this an efficient approach even at high packet rates. XDP programs are expressed as eBPF bytecode, and run within an in-kernel virtual machine. Upon loading an XDP program, the kernel compiles its eBPF code into machine code. The kernel also verifies the program to check that it does not compromise security or stability. eBPF is not only used in the context of XDP: many recent Linux kernel innovations employ eBPF, as it provides a convenient and efficient way to extend the behaviour of the kernel. XDP is much more convenient than alternative approaches to packet-level processing, particularly in our context where the servers involved also have many other tasks. We have continued to enhance Unimog since its initial deployment. Our deployment model for new versions of our Unimog XDP code is essentially the same as for userspace services, and we are able to deploy new versions on a weekly basis if needed. Also, established techniques for optimizing the performance of the Linux network stack provide good performance for XDP. There are two main alternatives for efficient packet-level processing: • Kernel-bypass networking (such as DPDK), where a program in userspace manages a network interface (or some part of one) directly without the involvement of the kernel. This approach works best when servers can be dedicated to a network function (due to the need to dedicate processor or network interface hardware resources, and awkward integration with the normal kernel network stack; see our old post about this). But we avoid putting servers in specialised roles. (Github’s open-source GLB uses DPDK, and this is one of the main factors that made GLB unsuitable for us.) • Kernel modules, where code is added to the kernel to perform the necessary network functions. The Linux IPVS (IP Virtual Server) subsystem falls into this category. But developing, testing, and deploying kernel modules is cumbersome compared to XDP. The following diagram shows an overview of our use of XDP. Both l4drop and Unimog are implemented by an XDP program. l4drop matches attack packets, and uses the DROP action to discard them. Unimog forwards packets, using the TX action to resend them. Packets that are not dropped or forwarded pass through to the normal Linux network stack. To support our elaborate use of XPD, we have developed the xdpd daemon which performs the necessary supervisory and support functions for our XDP programs. Rather than a single XDP program, we have a chain of XDP programs that must be run for each packet (l4drop, Unimog, and others we have not covered here). One of the responsibilities of xdpd is to prepare these programs, and to make the appropriate system calls to load them and assemble the full chain. Our XDP programs come from two sources. Some are developed in a conventional way: engineers write C code, our build system compiles it (with clang) to eBPF ELF files, and our release system deploys those files to our servers. Our Unimog XDP code works like this. In contrast, the l4drop XDP code is dynamically generated by xdpd based on information it receives from attack detection systems. xdpd has many other duties to support our use of XDP: • XDP programs can be supplied with data using data structures called maps. xdpd populates the maps needed by our programs, based on information received from control planes. • Programs (for instance, our Unimog XDP program) may depend upon configuration values which are fixed while the program runs, but do not have universal values known at the time their C code was compiled. It would be possible to supply these values to the program via maps, but that would be inefficient (retrieving a value from a map requires a call to a helper function). So instead, xdpd will fix up the eBPF program to insert these constants before it is loaded. • Cloudflare carefully monitors the behaviour of all our software systems, and this includes our XDP programs: They emit metrics (via another use of maps), which xdpd exposes to our metrics and alerting system (prometheus). • When we deploy a new version of xdpd, it gracefully upgrades in such a way that there is no interruption to the operation of Unimog or l4drop. Although the XDP programs are written in C, xdpd itself is written in Go. Much of its code is specific to Cloudflare. But in the course of developing xdpd, we have collaborated with Cilium to develop https://github.com/cilium/ebpf, an open source Go library that provides the operations needed by xdpd for manipulating and loading eBPF programs and related objects. We’re also collaborating with the Linux eBPF community to share our experience, and extend the core eBPF technology in ways that make features of xdpd obsolete. In evaluating the performance of Unimog, our main concern is efficiency: that is, the resources consumed for load balancing relative to the resources used for customer-visible services. Our measurements show that Unimog costs less than 1% of the processor utilization, compared to a scenario where no load balancing is in use. Other L4LBs, intended to be used with servers dedicated to load balancing, may place more emphasis on maximum throughput of packets. Nonetheless, our experience with Unimog and XDP in general indicates that the throughput is more than adequate for our needs, even during large volumetric attacks. Unimog is not the first L4LB to use XDP. In 2018, Facebook open sourced Katran, their XDP-based L4LB data plane. We considered the possibility of reusing code from Katran. But it would not have been worthwhile: the core C code needed to implement an XDP-based L4LB is relatively modest (about 1000 lines of C, both for Unimog and Katran). Furthermore, we had requirements that were not met by Katran, and we also needed to integrate with existing components and systems at Cloudflare (particularly l4drop). So very little of the code could have been reused as-is. ## Encapsulation As discussed as the start of this post, clients make connections to one of our edge data centers with a destination IP address that can be served by any one of our servers. These addresses that do not correspond to a specific server are known as virtual IPs (VIPs). When our Unimog XDP program forwards a packet destined to a VIP, it must replace that VIP address with the direct IP (DIP) of the appropriate server for the connection, so that when the packet is retransmitted it will reach that server. But it is not sufficient to overwrite the VIP in the packet headers with the DIP, as that would hide the original destination address from the server handling the connection (the original destination address is often needed to correctly handle the connection). Instead, the packet must be encapsulated: Another set of packet headers is prepended to the packet, so that the original packet becomes the payload in this new packet. The DIP is then used as the destination address in the outer headers, but the addressing information in the headers of the original packet is preserved. The encapsulated packet is then retransmitted. Once it reaches the target server, it must be decapsulated: the outer headers are stripped off to yield the original packet as if it had arrived directly. Encapsulation is a general concept in computer networking, and is used in a variety of contexts. The headers to be added to the packet by encapsulation are defined by an encapsulation format. Many different encapsulation formats have been defined within the industry, tailored to the requirements in specific contexts. Unimog uses a format called GUE (Generic UDP Encapsulation), in order to allow us to re-use the glb-redirect component from github’s GLB (glb-redirect is discussed below). GUE is a relatively simple encapsulation format. It encapsulates within a UDP packet, placing a GUE-specific header between the outer IP/UDP headers and the payload packet to allow extension data to be carried (and we’ll see how Unimog takes advantage of this): When an encapsulated packet arrives at a server, the encapsulation process must be reversed. This step is called decapsulation. The headers that were added during the encapsulation process are removed, leaving the original packet to be processed by the network stack as if it had arrived directly from the client. An issue that can arise with encapsulation is hitting limits on the maximum packet size, because the encapsulation process makes packets larger. The de-facto maximum packet size on the Internet is 1500 bytes, and not coincidentally this is also the maximum packet size on ethernet networks. For Unimog, encapsulating a 1500-byte packet results in a 1536-byte packet. To allow for these enlarged encapsulated packets, we have enabled jumbo frames on the networks inside our data centers, so that the 1500-byte limit only applies to packets headed out to the Internet. ## Forwarding logic So far, we have described the technology used to implement the Unimog load balancer, but not how our Unimog XDP program selects the DIP address when forwarding a packet. This section describes the basic scheme. But as we’ll see, there is a problem, so then we’ll describe how this scheme is elaborated to solve that problem. In outline, our Unimog XDP program processes each packet in the following way: 1. Determine whether the packet is destined for a VIP address. Not all of the packets arriving at a server are for VIP addresses. Other packets are passed through for normal handling by the kernel’s network stack. (xdpd obtains the VIP address ranges from the Unimog control plane.) 2. Determine the DIP for the server handling the packet’s connection. 3. Encapsulate the packet, and retransmit it to the DIP. In step 2, note that all the load balancers must act consistently – when forwarding packets, they must all agree about which connections go to which servers. The rate of new connections arriving at a data center is large, so it’s not practical for load balancers to agree by communicating information about connections amongst themselves. Instead L4LBs adopt designs which allow the load balancers to reach consistent forwarding decisions independently. To do this, they rely on hashing schemes: Take the 4-tuple identifying the packet’s connection, put it through a hash function to obtain a key (the hash function ensures that these key values are uniformly distributed), then perform some kind of lookup into a data structure to turn the key into the DIP for the target server. Unimog uses such a scheme, with a data structure that is simple compared to some other L4LBs. We call this data structure the forwarding table, and it consists of an array where each entry contains a DIP specifying the server target server for the relevant packets (we call these entries buckets). The forwarding table is generated by the Unimog control plane and broadcast to the load balancers (more on this below), so that it has the same contents on all load balancers. To look up a packet’s key in the forwarding table, the low N bits from the key are used as the index for a bucket (the forwarding table is always a power-of-2 in size): Note that this approach does not provide per-connection control – each bucket typically applies to many connections. All load balancers in a data center use the same forwarding table, so they all forward packets in a consistent manner. This means it doesn’t matter which packets are sent by the router to which servers, and so ECMP re-hashes are a non-issue. And because the forwarding table is immutable and simple in structure, lookups are fast. Although the above description only discusses a single forwarding table, Unimog supports multiple forwarding tables, each one associated with a trafficset – the traffic destined for a particular service. Ranges of VIP addresses are associated with a trafficset. Each trafficset has its own configuration settings and forwarding tables. This gives us the flexibility to differentiate how Unimog behaves for different services. Precise load balancing requires the ability to make fine adjustments to the number of connections arriving at each server. So we make the number of buckets in the forwarding table more than 100 times the number of servers. Our data centers can contain hundreds of servers, and so it is normal for a Unimog forwarding table to have tens of thousands of buckets. The DIP for a given server is repeated across many buckets in the forwarding table, and by increasing or decreasing the number of buckets that refer to a server, we can control the share of connections going to that server. Not all buckets will correspond to exactly the same number of connections at a given point in time (the properties of the hash function make this a statistical matter). But experience with Unimog has demonstrated that the relationship between the number of buckets and resulting server load is sufficiently strong to allow for good load balancing. But as mentioned, there is a problem with this scheme as presented so far. Updating a forwarding table, and changing the DIPs in some buckets, would break connections that hash to those buckets (because packets on those connections would get forwarded to a different server after the update). But one of the requirements for Unimog is to allow us to change which servers get new connections without impacting the existing connections. For example, sometimes we want to drain the connections to a server, maintaining the existing connections to that server but not forwarding new connections to it, in the expectation that many of the existing connections will terminate of their own accord. The next section explains how we fix this scheme to allow such changes. ## Maintaining established connections To make changes to the forwarding table without breaking established connections, Unimog adopts the “daisy chaining” technique described in the paper Stateless Datacenter Load-balancing with Beamer. To understand how the Beamer technique works, let’s look at what can go wrong when a forwarding table changes: imagine the forwarding table is updated so that a bucket which contained the DIP of server A now refers to server B. A packet that would formerly have been sent to A by the load balancers is now sent to B. If that packet initiates a new connection (it’s a TCP SYN packet), there’s no problem – server B will continue the three-way handshake to complete the new connection. On the other hand, if the packet belongs to a connection established before the change, then the TCP implementation of server B has no matching TCP socket, and so sends a RST back to the client, breaking the connection. This explanation hints at a solution: the problem occurs when server B receives a forwarded packet that does not match a TCP socket. If we could change its behaviour in this case to forward the packet a second time to the DIP of server A, that would allow the connection to server A to be preserved. For this to work, server B needs to know the DIP for the bucket before the change. To accomplish this, we extend the forwarding table so that each bucket has two slots, each containing the DIP for a server. The first slot contains the current DIP, which is used by the load balancer to forward packets as discussed (and here we refer to this forwarding as the first hop). The second slot preserves the previous DIP (if any), in order to allow the packet to be forwarded again on a second hop when necessary. For example, imagine we have a forwarding table that refers to servers A, B, and C, and then it is updated to stop new connections going to server A, but maintaining established connections to server A. This is achieved by replacing server A’s DIP in the first slot of any buckets where it appears, but preserving it in the second slot: In addition to extending the forwarding table, this approach requires a component on each server to forward packets on the second hop when necessary. This diagram shows where this redirector fits into the path a packet can take: The redirector follows some simple logic to decide whether to process a packet locally on the first-hop server or to forward it on the second-hop server: • If the packet is a SYN packet, initiating a new connection, then it is always processed by the first-hop server. This ensures that new connections go to the first-hop server. • For other packets, the redirector checks whether the packet belongs to a connection with a corresponding TCP socket on the first-hop server. If so, it is processed by that server. • Otherwise, the packet has no corresponding TCP socket on the first-hop server. So it is forwarded on to the second-hop server to be processed there (in the expectation that it belongs to some connection established on the second-hop server that we wish to maintain). In that last step, the redirector needs to know the DIP for the second hop. To avoid the need for the redirector to do forwarding table lookups, the second-hop DIP is placed into the encapsulated packet by the Unimog XDP program (which already does a forwarding table lookup, so it has easy access to this value). This second-hop DIP is carried in a GUE extension header, so that it is readily available to the redirector if it needs to forward the packet again. This second hop, when necessary, does have a cost. But in our data centers, the fraction of forwarded packets that take the second hop is usually less than 1% (despite the significance of long-lived connections in our context). The result is that the practical overhead of the second hops is modest. When we initially deployed Unimog, we adopted the glb-redirect iptables module from github’s GLB to serve as the redirector component. In fact, some implementation choices in Unimog, such as the use of GUE, were made in order to facilitate this re-use. glb-redirect worked well for us initially, but subsequently we wanted to enhance the redirector logic. glb-redirect is a custom Linux kernel module, and developing and deploying changes to kernel modules is more difficult for us than for eBPF-based components such as our XDP programs. This is not merely due to Cloudflare having invested more engineering effort in software infrastructure for eBPF; it also results from the more explicit boundary between the kernel and eBPF programs (for example, we are able to run the same eBPF programs on a range of kernel versions without recompilation). We wanted to achieve the same ease of development for the redirector as for our XDP programs. To that end, we decided to write an eBPF replacement for glb-redirect. While the redirector could be implemented within XDP, like our load balancer, practical concerns led us to implement it as a TC classifier program instead (TC is the traffic control subsystem within the Linux network stack). A downside to XDP is that the packet contents prior to processing by the XDP program are not visible using conventional tools such as tcpdump, complicating debugging. TC classifiers do not have this downside, and in the context of the redirector, which passes most packets through, the performance advantages of XDP would not be significant. The result is cls-redirect, a redirector implemented as a TC classifier program. We have contributed our cls-redirect code as part of the Linux kernel test suite. In addition to implementing the redirector logic, cls-redirect also implements decapsulation, removing the need to separately configure GUE tunnel endpoints for this purpose. There are some features suggested in the Beamer paper that Unimog does not implement: • Beamer embeds generation numbers in the encapsulated packets to address a potential corner case where a ECMP rehash event occurs at the same time as a forwarding table update is propagating from the control plane to the load balancers. Given the combination of circumstances required for a connection to be impacted by this issue, we believe that in our context the number of affected connections is negligible, and so the added complexity of the generation numbers is not worthwhile. • In the Beamer paper, the concept of daisy-chaining encompasses third hops etc. to preserve connections across a series of changes to a bucket. Unimog only uses two hops (the first and second hops above), so in general it can only preserve connections across a single update to a bucket. But our experience is that even with only two hops, a careful strategy for updating the forwarding tables permits connection lifetimes of days. To elaborate on this second point: when the control plane is updating the forwarding table, it often has some choice in which buckets to change, depending on the event that led to the update. For example, if a server is being brought into service, then some buckets must be assigned to it (by placing the DIP for the new server in the first slot of the bucket). But there is a choice about which buckets. A strategy of choosing the least-recently modified buckets will tend to minimise the impact to connections. Furthermore, when updating the forwarding table to adjust the balance of load between servers, Unimog often uses a novel trick: due to the redirector logic, exchanging the first-hop and second-hop DIPs for a bucket only affects which server receives new connections for that bucket, and never impacts any established connections. Unimog is able to achieve load balancing in our edge data centers largely through forwarding table changes of this type. ## Control plane So far, we have discussed the Unimog data plane – the part that processes network packets. But much of the development effort on Unimog has been devoted to the control plane – the part that generates the forwarding tables used by the data plane. In order to correctly maintain the forwarding tables, the control plane consumes information from multiple sources: • Server information: Unimog needs to know the set of servers present in a data center, some key information about each one (such as their DIP addresses), and their operational status. It also needs signals about transitional states, such as when a server is being withdrawn from service, in order to gracefully drain connections (preventing the server from receiving new connections, while maintaining its established connections). • Health: Unimog should only send connections to servers that are able to correctly handle those connections, otherwise those servers should be removed from the forwarding tables. To ensure this, it needs health information at the node level (indicating that a server is available) and at the service level (indicating that a service is functioning normally on a server). • Load: in order to balance load, Unimog needs information about the resource utilization on each server. • IP address information: Cloudflare serves hundreds of thousands of IPv4 addresses, and these are something that we have to treat as a dynamic resource rather than something statically configured. The control plane is implemented by a process called the conductor. In each of our edge data centers, there is one active conductor, but there are also standby instances that will take over if the active instance goes away. We use Hashicorp’s Consul in a number of ways in the Unimog control plane (we have an independent Consul server cluster in each data center): • Consul provides a key-value store, with support for blocking queries so that changes to values can be received promptly. We use this to propagate the forwarding tables and VIP address information from the conductor to xdpd on the servers. • Consul provides server- and service-level health checks. We use this as the source of health information for Unimog. • The conductor stores its state in the Consul KV store, and uses Consul’s distributed locks to ensure that only one conductor instance is active. The conductor obtains server load information from Prometheus, which we already use for metrics throughout our systems. It balances the load across the servers using a control loop, periodically adjusting the forwarding tables to send more connections to underloaded servers and less connections to overloaded servers. The load for a server is defined by a Prometheus metric expression which measures processor utilization (with some intricacies to better handle characteristics of our workloads). The determination of whether a server is underloaded or overloaded is based on comparison with the average value of the load metric, and the adjustments made to the forwarding table are proportional to the deviation from the average. So the result of the feedback loop is that the load metric for all servers converges on the average. Finally, the conductor queries internal Cloudflare APIs to obtain the necessary information on servers and addresses. Unimog is a critical system: incorrect, poorly adjusted or stale forwarding tables could cause incoming network traffic to a data center to be dropped, or servers to be overloaded, to the point that a data center would have to be removed from service. To maintain a high quality of service and minimise the overhead of managing our many edge data centers, we have to be able to upgrade all components. So to the greatest extent possible, all components are able to tolerate brief absences of the other components without any impact to service. In some cases this is possible through careful design. In other cases, it requires explicit handling. For example, we have found that Consul can temporarily report inaccurate health information for a server and its services when the Consul agent on that server is restarted (for example, in order to upgrade Consul). So we implemented the necessary logic in the conductor to detect and disregard these transient health changes. Unimog also forms a complex system with feedback loops: The conductor reacts to its observations of behaviour of the servers, and the servers react to the control information they receive from the conductor. This can lead to behaviours of the overall system that are hard to anticipate or test for. For instance, not long after we deployed Unimog we encountered surprising behaviour when data centers became overloaded. This is of course a scenario that we strive to avoid, and we have automated systems to remove traffic from overloaded data centers if it does. But if a data center became sufficiently overloaded, then health information from its servers would indicate that many servers were degraded to the point that Unimog would stop sending new connections to those servers. Under normal circumstances, this is the correct reaction to a degraded server. But if enough servers become degraded, diverting new connections to other servers would mean those servers became degraded, while the original servers were able to recover. So it was possible for a data center that became temporarily overloaded to get stuck in a state where servers oscillated between healthy and degraded, even after the level of demand on the data center had returned to normal. To correct this issue, the conductor now has logic to distinguish between isolated degraded servers and such data center-wide problems. We have continued to improve Unimog in response to operational experience, ensuring that it behaves in a predictable manner over a wide range of conditions. ## UDP Support So far, we have described Unimog’s support for directing TCP connections. But Unimog also supports UDP traffic. UDP does not have explicit connections between clients and servers, so how it works depends upon how the UDP application exchanges packets between the client and server. There are a few cases of interest: ### Request-response UDP applications Some applications, such as DNS, use a simple request-response pattern: the client sends a request packet to the server, and expects a response packet in return. Here, there is nothing corresponding to a connection (the client only sends a single packet, so there is no requirement to make sure that multiple packets arrive at the same server). But Unimog can still provide value by spreading the requests across our servers. To cater to this case, Unimog operates as described in previous sections, hashing the 4-tuple from the packet headers (the source and destination IP addresses and ports). But the Beamer daisy-chaining technique that allows connections to be maintained does not apply here, and so the buckets in the forwarding table only have a single slot. ### UDP applications with flows Some UDP applications have long-lived flows of packets between the client and server. Like TCP connections, these flows are identified by the 4-tuple. It is necessary that such flows go to the same server (even when Cloudflare is just passing a flow through to the origin server, it is convenient for detecting and mitigating certain kinds of attack to have that flow pass through a single server within one of Cloudflare’s data centers). It’s possible to treat these flows by hashing the 4-tuple, skipping the Beamer daisy-chaining technique as for request-response applications. But then adding servers will cause some flows to change servers (this would effectively be a form of consistent hashing). For UDP applications, we can’t say in general what impact this has, as we can for TCP connections. But it’s possible that it causes some disruption, so it would be nice to avoid this. So Unimog adapts the daisy-chaining technique to apply it to UDP flows. The outline remains similar to that for TCP: the same redirector component on each server decides whether to send a packet on a second hop. But UDP does not have anything corresponding to TCP’s SYN packet that indicates a new connection. So for UDP, the part that depends on SYNs is removed, and the logic applied for each packet becomes: • The redirector checks whether the packet belongs to a connection with a corresponding UDP socket on the first-hop server. If so, it is processed by that server. • Otherwise, the packet has no corresponding TCP socket on the first-hop server. So it is forwarded on to the second-hop server to be processed there (in the expectation that it belongs to some flow established on the second-hop server that we wish to maintain). Although the change compared to the TCP logic is not large, it has the effect of switching the roles of the first- and second-hop servers: For UDP, new flows go to the second-hop server. The Unimog control plane has to take account of this when it updates a forwarding table. When it introduces a server into a bucket, that server should receive new connections or flows. For a TCP trafficset, this means it becomes the first-hop server. For UDP trafficset, it must become the second-hop server. This difference between handling of TCP and UDP also leads to higher overheads for UDP. In the case of TCP, as new connections are formed and old connections terminate over time, fewer packets will require the second hop, and so the overhead tends to diminish. But with UDP, new connections always involve the second hop. This is why we differentiate the two cases, taking advantage of SYN packets in the TCP case. The UDP logic also places a requirement on services. The redirector must be able to match packets to the corresponding sockets on a server according to their 4-tuple. This is not a problem in the TCP case, because all TCP connections are represented by connected sockets in the BSD sockets API (these sockets are obtained from an accept system call, so that they have a local address and a peer address, determining the 4-tuple). But for UDP, unconnected sockets (lacking a declared peer address) can be used to send and receive packets. So some UDP services only use unconnected sockets. For the redirector logic above to work, services must create connected UDP sockets in order to expose their flows to the redirector. ### UDP applications with sessions Some UDP-based protocols have explicit sessions, with a session identifier in each packet. Session identifiers allow sessions to persist even if the 4-tuple changes. This happens in mobility scenarios – for example, if a mobile device passes from a WiFi to a cellular network, causing its IP address to change. An example of a UDP-based protocol with session identifiers is QUIC (which calls them connection IDs). Our Unimog XDP program allows a flow dissector to be configured for different trafficsets. The flow dissector is the part of the code that is responsible for taking a packet and extracting the value that identifies the flow or connection (this value is then hashed and used for the lookup into the forwarding table). For TCP and UDP, there are default flow dissectors that extract the 4-tuple. But specialised flow dissectors can be added to handle UDP-based protocols. We have used this functionality in our WARP product. We extended the Wireguard protocol used by WARP in a backwards-compatible way to include a session identifier, and added a flow dissector to Unimog to exploit it. There are more details in our post on the technical challenges of WARP. ## Conclusion Unimog has been deployed to all of Cloudflare’s edge data centers for over a year, and it has become essential to our operations. Throughout that time, we have continued to enhance Unimog (many of the features described here were not present when it was first deployed). So the ease of developing and deploying changes, due to XDP and xdpd, has been a significant benefit. Today we continue to extend it, to support more services, and to help us manage our traffic and the load on our servers in more contexts. # Cloudflare Bot Management: machine learning and more Post Syndicated from Alex Bocharov original https://blog.cloudflare.com/cloudflare-bot-management-machine-learning-and-more/ ### Introduction Building Cloudflare Bot Management platform is an exhilarating experience. It blends Distributed Systems, Web Development, Machine Learning, Security and Research (and every discipline in between) while fighting ever-adaptive and motivated adversaries at the same time. This is the ongoing story of Bot Management at Cloudflare and also an introduction to a series of blog posts about the detection mechanisms powering it. I’ll start with several definitions from the Bot Management world, then introduce the product and technical requirements, leading to an overview of the platform we’ve built. Finally, I’ll share details about the detection mechanisms powering our platform. Let’s start with Bot Management’s nomenclature. ### Some Definitions Bot – an autonomous program on a network that can interact with computer systems or users, imitating or replacing a human user’s behavior, performing repetitive tasks much faster than human users could. Good bots – bots which are useful to businesses they interact with, e.g. search engine bots like Googlebot, Bingbot or bots that operate on social media platforms like Facebook Bot. Bad bots – bots which are designed to perform malicious actions, ultimately hurting businesses, e.g. credential stuffing bots, third-party scraping bots, spam bots and sneakerbots. Bot Management – blocking undesired or malicious Internet bot traffic while still allowing useful bots to access web properties by detecting bot activity, discerning between desirable and undesirable bot behavior, and identifying the sources of the undesirable activity. WAF – a security system that monitors and controls network traffic based on a set of security rules. ### Gathering requirements Cloudflare has been stopping malicious bots from accessing websites or misusing APIs from the very beginning, at the same time helping the climate by offsetting the carbon costs from the bots. Over time it became clear that we needed a dedicated platform which would unite different bot fighting techniques and streamline the customer experience. In designing this new platform, we tried to fulfill the following key requirements. • Complete, not complex – customers can turn on/off Bot Management with a single click of a button, to protect their websites, mobile applications, or APIs. • Trustworthy – customers want to know whether they can trust the website visitor is who they say they are and provide a certainty indicator for that trust level. • Flexible – customers should be able to define what subset of the traffic Bot Management mitigations should be applied to, e.g. only login URLs, pricing pages or sitewide. • Accurate – Bot Management detections should have a very small error, e.g. none or very few human visitors ever should be mistakenly identified as bots. • Recoverable – in case a wrong prediction was made, human visitors still should be able to access websites as well as good bots being let through. Moreover, the goal for new Bot Management product was to make it work well on the following use cases: ### Technical requirements Additionally to the product requirements above, we engineers had a list of must-haves for the new Bot Management platform. The most critical were: • Scalability – the platform should be able to calculate a score on every request, even at over 10 million requests per second. • Low latency – detections must be performed extremely quickly, not slowing down request processing by more than 100 microseconds, and not requiring additional hardware. • Configurability – it should be possible to configure what detections are applied on what traffic, including on per domain/data center/server level. • Modifiability – the platform should be easily extensible with more detection mechanisms, different mitigation actions, richer analytics and logs. • Security – no sensitive information from one customer should be used to build models that protect another customer. • Explainability & debuggability – we should be able to explain and tune predictions in an intuitive way. Equipped with these requirements, back in 2018, our small team of engineers got to work to design and build the next generation of Cloudflare Bot Management. ### Meet the Score “Simplicity is the ultimate sophistication.” – Leonardo Da Vinci Cloudflare operates on a vast scale. At the time of this writing, this means covering 26M+ Internet properties, processing on average 11M requests per second (with peaks over 14M), and examining more than 250 request attributes from different protocol levels. The key question is how to harness the power of such “gargantuan” data to protect all of our customers from modern day cyberthreats in a simple, reliable and explainable way? Bot management is hard. Some bots are much harder to detect and require looking at multiple dimensions of request attributes over a long time, and sometimes a single request attribute could give them away. More signals may help, but are they generalizable? When we classify traffic, should customers decide what to do with it or are there decisions we can make on behalf of the customer? What concept could possibly address all these uncertainty problems and also help us to deliver on the requirements from above? As you might’ve guessed from the section title, we came up with the concept of Trusted Score or simply The Scoreone thing to rule them all – indicating the likelihood between 0 and 100 whether a request originated from a human (high score) vs. an automated program (low score). Okay, let’s imagine that we are able to assign such a score on every incoming HTTP/HTTPS request, what are we or the customer supposed to do with it? Maybe it’s enough to provide such a score in the logs. Customers could then analyze them on their end, find the most frequent IPs with the lowest scores, and then use the Cloudflare Firewall to block those IPs. Although useful, such a process would be manual, prone to error and most importantly cannot be done in real time to protect the customer’s Internet property. Fortunately, around the same time we started worked on this system , our colleagues from the Firewall team had just announced Firewall Rules. This new capability provided customers the ability to control requests in a flexible and intuitive way, inspired by the widely known Wireshark® language. Firewall rules supported a variety of request fields, and we thought – why not have the score be one of these fields? Customers could then write granular rules to block very specific attack types. That’s how the cf.bot_management.score field was born. Having a score in the heart of Cloudflare Bot Management addressed multiple product and technical requirements with one strike – it’s simple, flexible, configurable, and it provides customers with telemetry about bots on a per request basis. Customers can adjust the score threshold in firewall rules, depending on their sensitivity to false positives/negatives. Additionally, this intuitive score allows us to extend our detection capabilities under the hood without customers needing to adjust any configuration. So how can we produce this score and how hard is it? Let’s explore it in the following section. ### Architecture overview What is powering the Bot Management score? The short answer is a set of microservices. Building this platform we tried to re-use as many pipelines, databases and components as we could, however many services had to be built from scratch. Let’s have a look at overall architecture (this overly simplified version contains Bot Management related services): ### Core Bot Management services In a nutshell our systems process data received from the edge data centers, produce and store data required for bot detection mechanisms using the following technologies: • Databases & data storesKafka, ClickHouse, Postgres, Redis, Ceph. • Programming languages – Go, Rust, Python, Java, Bash. • Configuration & schema management – Salt, Quicksilver, Cap’n Proto. • Containerization – Docker, Kubernetes, Helm, Mesos/Marathon. Each of these services is built with resilience, performance, observability and security in mind. ### Edge Bot Management module All bot detection mechanisms are applied on every request in real-time during the request processing stage in the Bot Management module running on every machine at Cloudflare’s edge locations. When a request comes in we extract and transform the required request attributes and feed them to our detection mechanisms. The Bot Management module produces the following output: Firewall fieldsBot Management fields cf.bot_management.score – an integer indicating the likelihood between 0 and 100 whether a request originated from an automated program (low score) to a human (high score). cf.bot_management.verified_bot – a boolean indicating whether such request comes from a Cloudflare whitelisted bot. cf.bot_management.static_resource – a boolean indicating whether request matches file extensions for many types of static resources. Cookies – most notably it produces cf_bm, which helps manage incoming traffic that matches criteria associated with bots. JS challenges – for some of our detections and customers we inject into invisible JavaScript challenges, providing us with more signals for bot detection. Detection logs – we log through our data pipelines to ClickHouse details about each applied detection, used features and flags, some of which are used for analytics and customer logs, while others are used to debug and improve our models. Once the Bot Management module has produced the required fields, the Firewall takes over the actual bot mitigation. ### Firewall integration The Cloudflare Firewall’s intuitive dashboard enables users to build powerful rules through easy clicks and also provides Terraform integration. Every request to the firewall is inspected against the rule engine. Suspicious requests can be blocked, challenged or logged as per the needs of the user while legitimate requests are routed to the destination, based on the score produced by the Bot Management module and the configured threshold. Firewall rules provide the following bot mitigation actions: • Log – records matching requests in the Cloudflare Logs provided to customers. • Bypass – allows customers to dynamically disable Cloudflare security features for a request. • Allow – matching requests are exempt from challenge and block actions triggered by other Firewall Rules content. • Challenge (Captcha) – useful for ensuring that the visitor accessing the site is human, and not automated. • JS Challenge – useful for ensuring that bots and spam cannot access the requested resource; browsers, however, are free to satisfy the challenge automatically. • Block – matching requests are denied access to the site. Our Firewall Analytics tool, powered by ClickHouse and GraphQL API, enables customers to quickly identify and investigate security threats using an intuitive interface. In addition to analytics, we provide detailed logs on all bots-related activity using either the Logpull API and/or LogPush, which provides the easy way to get your logs to your cloud storage. ### Cloudflare Workers integration In case a customer wants more flexibility on what to do with the requests based on the score, e.g. they might want to inject new, or change existing, HTML page content, or serve incorrect data to the bots, or stall certain requests, Cloudflare Workers provide an option to do that. For example, using this small code-snippet, we can pass the score back to the origin server for more advanced real-time analysis or mitigation: addEventListener('fetch', event => { event.respondWith(handleRequest(event.request)) }) async function handleRequest(request) { request = new Request(request); request.headers.set("Cf-Bot-Score", request.cf.bot_management.score) return fetch(request); }  Now let’s have a look into how a single score is produced using multiple detection mechanisms. ### Detection mechanisms The Cloudflare Bot Management platform currently uses five complementary detection mechanisms, producing their own scores, which we combine to form the single score going to the Firewall. Most of the detection mechanisms are applied on every request, while some are enabled on a per customer basis to better fit their needs. Having a score on every request for every customer has the following benefits: • Ease of onboarding – even before we enable Bot Management in active mode, we’re able to tell how well it’s going to work for the specific customer, including providing historical trends about bot activity. • Feedback loop – availability of the score on every request along with all features has tremendous value for continuous improvement of our detection mechanisms. • Ensures scaling – if we can compute for score every request and customer, it means that every Internet property behind Cloudflare is a potential Bot Management customer. • Global bot insights – Cloudflare is sitting in front of more than 26M+ Internet properties, which allows us to understand and react to the tectonic shifts happening in security and threat intelligence over time. Overall globally, more than third of the Internet traffic visible to Cloudflare is coming from bad bots, while Bot Management customers have the ratio of bad bots even higher at ~43%! Let’s dive into specific detection mechanisms in chronological order of their integration with Cloudflare Bot Management. ### Machine learning The majority of decisions about the score are made using our machine learning models. These were also the first detection mechanisms to produce a score and to on-board customers back in 2018. The successful application of machine learning requires data high in Quantity, Diversity, and Quality, and thanks to both free and paid customers, Cloudflare has all three, enabling continuous learning and improvement of our models for all of our customers. At the core of the machine learning detection mechanism is CatBoost – a high-performance open source library for gradient boosting on decision trees. The choice of CatBoost was driven by the library’s outstanding capabilities: • Categorical features support – allowing us to train on even very high cardinality features. • Superior accuracy – allowing us to reduce overfitting by using a novel gradient-boosting scheme. • Inference speed – in our case it takes less than 50 microseconds to apply any of our models, making sure request processing stays extremely fast. • C and Rust API – most of our business logic on the edge is written using Lua, more specifically LuaJIT, so having a compatible FFI interface to be able to apply models is fantastic. There are multiple CatBoost models run on Cloudflare’s Edge in the shadow mode on every request on every machine. One of the models is run in active mode, which influences the final score going to Firewall. All ML detection results and features are logged and recorded in ClickHouse for further analysis, model improvement, analytics and customer facing logs. We feed both categorical and numerical features into our models, extracted from request attributes and inter-request features built using those attributes, calculated and delivered by the Gagarin inter-requests features platform. We’re able to deploy new ML models in a matter of seconds using an extremely reliable and performant Quicksilver configuration database. The same mechanism can be used to configure which version of an ML model should be run in active mode for a specific customer. A deep dive into our machine learning detection mechanism deserves a blog post of its own and it will cover how do we train and validate our models on trillions of requests using GPUs, how model feature delivery and extraction works, and how we explain and debug model predictions both internally and externally. ### Heuristics engine Not all problems in the world are the best solved with machine learning. We can tweak the ML models in various ways, but in certain cases they will likely underperform basic heuristics. Often the problems machine learning is trying to solve are not entirely new. When building the Bot Management solution it became apparent that sometimes a single attribute of the request could give a bot away. This means that we can create a bunch of simple rules capturing bots in a straightforward way, while also ensuring lowest false positives. The heuristics engine was the second detection mechanism integrated into the Cloudflare Bot Management platform in 2019 and it’s also applied on every request. We have multiple heuristic types and hundreds of specific rules based on certain attributes of the request, some of which are very hard to spoof. When any of the requests matches any of the heuristics – we assign the lowest possible score of 1. The engine has the following properties: • Speed – if ML model inference takes less than 50 microseconds per model, hundreds of heuristics can be applied just under 20 microseconds! • Deployability – the heuristics engine allows us to add new heuristic in a matter of seconds using Quicksilver, and it will be applied on every request. • Vast coverage – using a set of simple heuristics allows us to classify ~15% of global traffic and ~30% of Bot Management customers’ traffic as bots. Not too bad for a few if conditions, right? • Lowest false positives – because we’re very sure and conservative on the heuristics we add, this detection mechanism has the lowest FP rate among all detection mechanisms. • Labels for ML – because of the high certainty we use requests classified with heuristics to train our ML models, which then can generalize behavior learnt from from heuristics and improve detections accuracy. So heuristics gave us a lift when tweaked with machine learning and they contained a lot of the intuition about the bots, which helped to advance the Cloudflare Bot Management platform and allowed us to onboard more customers. ### Behavioral analysis Machine learning and heuristics detections provide tremendous value, but both of them require human input on the labels, or basically a teacher to distinguish between right and wrong. While our supervised ML models can generalize well enough even on novel threats similar to what we taught them on, we decided to go further. What if there was an approach which doesn’t require a teacher, but rather can learn to distinguish bad behavior from the normal behavior? Enter the behavioral analysis detection mechanism, initially developed in 2018 and integrated with the Bot Management platform in 2019. This is an unsupervised machine learning approach, which has the following properties: • Fitting specific customer needs – it’s automatically enabled for all Bot Management customers, calculating and analyzing normal visitor behavior over an extended period of time. • Detects bots never seen before – as it doesn’t use known bot labels, it can detect bots and anomalies from the normal behavior on specific customer’s website. • Harder to evade – anomalous behavior is often a direct result of the bot’s specific goal. Please stay tuned for a more detailed blog about behavioral analysis models and the platform powering this incredible detection mechanism, protecting many of our customers from unseen attacks. ### Verified bots So far we’ve discussed how to detect bad bots and humans. What about good bots, some of which are extremely useful for the customer website? Is there a need for a dedicated detection mechanism or is there something we could use from previously described detection mechanisms? While the majority of good bot requests (e.g. Googlebot, Bingbot, LinkedInbot) already have low score produced by other detection mechanisms, we also need a way to avoid accidental blocks of useful bots. That’s how the Firewall field cf.bot_management.verified_bot came into existence in 2019, allowing customers to decide for themselves whether they want to let all of the good bots through or restrict access to certain parts of the website. The actual platform calculating Verified Bot flag deserves a detailed blog on its own, but in the nutshell it has the following properties: • Validator based approach – we support multiple validation mechanisms, each of them allowing us to reliably confirm good bot identity by clustering a set of IPs. • Reverse DNS validator – performs a reverse DNS check to determine whether or not a bots IP address matches its alleged hostname. • ASN Block validator – similar to rDNS check, but performed on ASN block. • Downloader validator – collects good bot IPs from either text files or HTML pages hosted on bot owner sites. • Machine learning validator – uses an unsupervised learning algorithm, clustering good bot IPs which are not possible to validate through other means. • Bots Directory – a database with UI that stores and manages bots that pass through the Cloudflare network. Using multiple validation methods listed above, the Verified Bots detection mechanism identifies hundreds of unique good bot identities, belonging to different companies and categories. ### JS fingerprinting When it comes to Bot Management detection quality it’s all about the signal quality and quantity. All previously described detections use request attributes sent over the network and analyzed on the server side using different techniques. Are there more signals available, which can be extracted from the client to improve our detections? As a matter of fact there are plenty, as every browser has unique implementation quirks. Every web browser graphics output such as canvas depends on multiple layers such as hardware (GPU) and software (drivers, operating system rendering). This highly unique output allows precise differentiation between different browser/device types. Moreover, this is achievable without sacrificing website visitor privacy as it’s not a supercookie, and it cannot be used to track and identify individual users, but only to confirm that request’s user agent matches other telemetry gathered through browser canvas API. This detection mechanism is implemented as a challenge-response system with challenge injected into the webpage on Cloudflare’s edge. The challenge is then rendered in the background using provided graphic instructions and the result sent back to Cloudflare for validation and further action such as producing the score. There is a lot going on behind the scenes to make sure we get reliable results without sacrificing users’ privacy while being tamper resistant to replay attacks. The system is currently in private beta and being evaluated for its effectiveness and we already see very promising results. Stay tuned for this new detection mechanism becoming widely available and the blog on how we’ve built it. This concludes an overview of the five detection mechanisms we’ve built so far. It’s time to sum it all up! ### Summary Cloudflare has the unique ability to collect data from trillions of requests flowing through its network every week. With this data, Cloudflare is able to identify likely bot activity with Machine Learning, Heuristics, Behavioral Analysis, and other detection mechanisms. Cloudflare Bot Management integrates seamlessly with other Cloudflare products, such as WAF and Workers. All this could not be possible without hard work across multiple teams! First of all thanks to everybody on the Bots Team for their tremendous efforts to make this platform come to life. Other Cloudflare teams, most notably: Firewall, Data, Solutions Engineering, Performance, SRE, helped us a lot to design, build and support this incredible platform. Lastly, there are more blogs from the Bots series coming soon, diving into internals of our detection mechanisms, so stay tuned for more exciting stories about Cloudflare Bot Management! # The History of the URL Post Syndicated from Zack Bloom original https://blog.cloudflare.com/the-history-of-the-url/ On the 11th of January 1982 twenty-two computer scientists met to discuss an issue with ‘computer mail’ (now known as email). Attendees included the guy who would create Sun Microsystems, the guy who made Zork, the NTP guy, and the guy who convinced the government to pay for Unix. The problem was simple: there were 455 hosts on the ARPANET and the situation was getting out of control. This issue was occuring now because the ARPANET was on the verge of switching from its original NCP protocol, to the TCP/IP protocol which powers what we now call the Internet. With that switch suddenly there would be a multitude of interconnected networks (an ‘Inter… net’) requiring a more ‘hierarchical’ domain system where ARPANET could resolve its own domains while the other networks resolved theirs. Other networks at the time had great names like “COMSAT”, “CHAOSNET”, “UCLNET” and “INTELPOSTNET” and were maintained by groups of universities and companies all around the US who wanted to be able to communicate, and could afford to lease 56k lines from the phone company and buy the requisite PDP-11s to handle routing. In the original ARPANET design, a central Network Information Center (NIC) was responsible for maintaining a file listing every host on the network. The file was known as the HOSTS.TXT file, similar to the /etc/hosts file on a Linux or OS X system today. Every network change would require the NIC to FTP (a protocol invented in 1971) to every host on the network, a significant load on their infrastructure. Having a single file list every host on the Internet would, of course, not scale indefinitely. The priority was email, however, as it was the predominant addressing challenge of the day. Their ultimate conclusion was to create a hierarchical system in which you could query an external system for just the domain or set of domains you needed. In their words: “The conclusion in this area was that the current ‘[email protected]’ mailbox identifier should be extended to ‘[email protected]’ where ‘domain’ could be a hierarchy of domains.” And the domain was born. It’s important to dispel any illusion that these decisions were made with prescience for the future the domain name would have. In fact, their elected solution was primarily decided because it was the “one causing least difficulty for existing systems.” For example, one proposal was for email addresses to be of the form <user>.<host>@<domain>. If email usernames of the day hadn’t already had ‘.’ characters you might be emailing me at ‘[email protected]’ today. ### UUCP and the Bang Path It has been said that the principal function of an operating system is to define a number of different names for the same object, so that it can busy itself keeping track of the relationship between all of the different names. Network protocols seem to have somewhat the same characteristic. — David D. Clark, 1982 Another failed proposal involved separating domain components with the exclamation mark (!). For example, to connect to the ISIA host on ARPANET, you would connect to !ARPA!ISIA. You could then query for hosts using wildcards, so !ARPA!* would return to you every ARPANET host. This method of addressing wasn’t a crazy divergence from the standard, it was an attempt to maintain it. The system of exclamation separated hosts dates to a data transfer tool called UUCP created in 1976. If you’re reading this on an OS X or Linux computer, uucp is likely still installed and available at the terminal. ARPANET was introduced in 1969, and quickly became a powerful communication tool… among the handful of universities and government institutions which had access to it. The Internet as we know it wouldn’t become publically available outside of research institutions until 1991, twenty one years later. But that didn’t mean computer users weren’t communicating. In the era before the Internet, the general method of communication between computers was with a direct point-to-point dial up connection. For example, if you wanted to send me a file, you would have your modem call my modem, and we would transfer the file. To craft this into a network of sorts, UUCP was born. In this system, each computer has a file which lists the hosts its aware of, their phone number, and a username and password on that host. You then craft a ‘path’, from your current machine to your destination, through hosts which each know how to connect to the next: sw-hosts!digital-lobby!zack This address would form not just a method of sending me files or connecting with my computer directly, but also would be my email address. In this era before ‘mail servers’, if my computer was off you weren’t sending me an email. While use of ARPANET was restricted to top-tier universities, UUCP created a bootleg Internet for the rest of us. It formed the basis for both Usenet and the BBS system. ### DNS Ultimately, the DNS system we still use today would be proposed in 1983. If you run a DNS query today, for example using the dig tool, you’ll likely see a response which looks like this: ;; ANSWER SECTION: google.com. 299 IN A 172.217.4.206  This is informing us that google.com is reachable at 172.217.4.206. As you might know, the A is informing us that this is an ‘address’ record, mapping a domain to an IPv4 address. The 299 is the ‘time to live’, letting us know how many more seconds this value will be valid for, before it should be queried again. But what does the IN mean? IN stands for ‘Internet’. Like so much of this, the field dates back to an era when there were several competing computer networks which needed to interoperate. Other potential values were CH for the CHAOSNET or HS for Hesiod which was the name service of the Athena system. CHAOSNET is long dead, but a much evolved version of Athena is still used by students at MIT to this day. You can find the list of DNS classes on the IANA website, but it’s no surprise only one potential value is in common use today. ### TLDs It is extremely unlikely that any other TLDs will be created. — John Postel, 1994 Once it was decided that domain names should be arranged hierarchically, it became necessary to decide what sits at the root of that hierarchy. That root is traditionally signified with a single ‘.’. In fact, ending all of your domain names with a ‘.’ is semantically correct, and will absolutely work in your web browser: google.com. The first TLD was .arpa. It allowed users to address their old traditional ARPANET hostnames during the transition. For example, if my machine was previously registered as hfnet, my new address would be hfnet.arpa. That was only temporary, during the transition, server administrators had a very important choice to make: which of the five TLDs would they assume? “.com”, “.gov”, “.org”, “.edu” or “.mil”. When we say DNS is hierarchical, what we mean is there is a set of root DNS servers which are responsible for, for example, turning .com into the .com nameservers, who will in turn answer how to get to google.com. The root DNS zone of the internet is composed of thirteen DNS server clusters. There are only 13 server clusters, because that’s all we can fit in a single UDP packet. Historically, DNS has operated through UDP packets, meaning the response to a request can never be more than 512 bytes. ; This file holds the information on root name servers needed to ; initialize cache of Internet domain name servers ; (e.g. reference this file in the "cache . " ; configuration file of BIND domain name servers). ; ; This file is made available by InterNIC ; under anonymous FTP as ; file /domain/named.cache ; on server FTP.INTERNIC.NET ; -OR- RS.INTERNIC.NET ; ; last update: March 23, 2016 ; related version of root zone: 2016032301 ; ; formerly NS.INTERNIC.NET ; . 3600000 NS A.ROOT-SERVERS.NET. A.ROOT-SERVERS.NET. 3600000 A 198.41.0.4 A.ROOT-SERVERS.NET. 3600000 AAAA 2001:503:ba3e::2:30 ; ; FORMERLY NS1.ISI.EDU ; . 3600000 NS B.ROOT-SERVERS.NET. B.ROOT-SERVERS.NET. 3600000 A 192.228.79.201 B.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:84::b ; ; FORMERLY C.PSI.NET ; . 3600000 NS C.ROOT-SERVERS.NET. C.ROOT-SERVERS.NET. 3600000 A 192.33.4.12 C.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:2::c ; ; FORMERLY TERP.UMD.EDU ; . 3600000 NS D.ROOT-SERVERS.NET. D.ROOT-SERVERS.NET. 3600000 A 199.7.91.13 D.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:2d::d ; ; FORMERLY NS.NASA.GOV ; . 3600000 NS E.ROOT-SERVERS.NET. E.ROOT-SERVERS.NET. 3600000 A 192.203.230.10 ; ; FORMERLY NS.ISC.ORG ; . 3600000 NS F.ROOT-SERVERS.NET. F.ROOT-SERVERS.NET. 3600000 A 192.5.5.241 F.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:2f::f ; ; FORMERLY NS.NIC.DDN.MIL ; . 3600000 NS G.ROOT-SERVERS.NET. G.ROOT-SERVERS.NET. 3600000 A 192.112.36.4 ; ; FORMERLY AOS.ARL.ARMY.MIL ; . 3600000 NS H.ROOT-SERVERS.NET. H.ROOT-SERVERS.NET. 3600000 A 198.97.190.53 H.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:1::53 ; ; FORMERLY NIC.NORDU.NET ; . 3600000 NS I.ROOT-SERVERS.NET. I.ROOT-SERVERS.NET. 3600000 A 192.36.148.17 I.ROOT-SERVERS.NET. 3600000 AAAA 2001:7fe::53 ; ; OPERATED BY VERISIGN, INC. ; . 3600000 NS J.ROOT-SERVERS.NET. J.ROOT-SERVERS.NET. 3600000 A 192.58.128.30 J.ROOT-SERVERS.NET. 3600000 AAAA 2001:503:c27::2:30 ; ; OPERATED BY RIPE NCC ; . 3600000 NS K.ROOT-SERVERS.NET. K.ROOT-SERVERS.NET. 3600000 A 193.0.14.129 K.ROOT-SERVERS.NET. 3600000 AAAA 2001:7fd::1 ; ; OPERATED BY ICANN ; . 3600000 NS L.ROOT-SERVERS.NET. L.ROOT-SERVERS.NET. 3600000 A 199.7.83.42 L.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:9f::42 ; ; OPERATED BY WIDE ; . 3600000 NS M.ROOT-SERVERS.NET. M.ROOT-SERVERS.NET. 3600000 A 202.12.27.33 M.ROOT-SERVERS.NET. 3600000 AAAA 2001:dc3::35 ; End of file  Root DNS servers operate in safes, inside locked cages. A clock sits on the safe to ensure the camera feed hasn’t been looped. Particularily given how slow DNSSEC implementation has been, an attack on one of those servers could allow an attacker to redirect all of the Internet traffic for a portion of Internet users. This, of course, makes for the most fantastic heist movie to have never been made. Unsurprisingly, the nameservers for top-level TLDs don’t actually change all that often. 98% of the requests root DNS servers receive are in error, most often because of broken and toy clients which don’t properly cache their results. This became such a problem that several root DNS operators had to spin up special servers just to return ‘go away’ to all the people asking for reverse DNS lookups on their local IP addresses. The TLD nameservers are administered by different companies and governments all around the world (Verisign manages .com). When you purchase a .com domain, about$0.18 goes to the ICANN, and 7.85 goes to Verisign. ### Punycode It is rare in this world that the silly name us developers think up for a new project makes it into the final, public, product. We might name the company database Delaware (because that’s where all the companies are registered), but you can be sure by the time it hits production it will be CompanyMetadataDatastore. But rarely, when all the stars align and the boss is on vacation, one slips through the cracks. Punycode is the system we use to encode unicode into domain names. The problem it is solving is simple, how do you write 比薩.com when the entire internet system was built around using the ASCII alphabet whose most foreign character is the tilde? It’s not a simple matter of switching domains to use unicode. The original documents which govern domains specify they are to be encoded in ASCII. Every piece of internet hardware from the last fourty years, including the Cisco and Juniper routers used to deliver this page to you make that assumption. The web itself was never ASCII-only. It was actually originally concieved to speak ISO 8859-1 which includes all of the ASCII characters, but adds an additional set of special characters like ¼ and letters with special marks like ä. It does not, however, contain any non-Latin characters. This restriction on HTML was ultimately removed in 2007 and that same year Unicode became the most popular character set on the web. But domains were still confined to ASCII. As you might guess, Punycode was not the first proposal to solve this problem. You most likely have heard of UTF-8, which is a popular way of encoding Unicode into bytes (the 8 is for the eight bits in a byte). In the year 2000 several members of the Internet Engineering Task Force came up with UTF-5. The idea was to encode Unicode into five bit chunks. You could then map each five bits into a character allowed (A-V & 0-9) in domain names. So if I had a website for Japanese language learning, my site 日本語.com would become the cryptic M5E5M72COA9E.com. This encoding method has several disadvantages. For one, A-V and 0-9 are used in the output encoding, meaning if you wanted to actually include one of those characters in your doman, it had to be encoded like everything else. This made for some very long domains, which is a serious problem when each segment of a domain is restricted to 63 characters. A domain in the Myanmar language would be restricted to no more than 15 characters. The proposal does make the very interesting suggestion of using UTF-5 to allow Unicode to be transmitted by Morse code and telegram though. There was also the question of how to let clients know that this domain was encoded so they could display them in the appropriate Unicode characters, rather than showing M5E5M72COA9E.com in my address bar. There were several suggestions, one of which was to use an unused bit in the DNS response. It was the “last unused bit in the header”, and the DNS folks were “very hesitant to give it up” however. Another suggestion was to start every domain using this encoding method with ra--. At the time (mid-April 2000), there were no domains which happened to start with those particular characters. If I know anything about the Internet, someone registered an ra-- domain out of spite immediately after the proposal was published. The ultimate conclusion, reached in 2003, was to adopt a format called Punycode which included a form of delta compression which could dramatically shorten encoded domain names. Delta compression is a particularily good idea because the odds are all of the characters in your domain are in the same general area within Unicode. For example, two characters in Farsi are going to be much closer together than a Farsi character and another in Hindi. To give an example of how this works, if we take the nonsense phrase: يذؽ In an uncompressed format, that would be stored as the three characters [1610, 1584, 1597] (based on their Unicode code points). To compress this we first sort it numerically (keeping track of where the original characters were): [1584, 1597, 1610]. Then we can store the lowest value (1584), and the delta between that value and the next character (13), and again for the following character (23), which is significantly less to transmit and store. Punycode then (very) efficiently encodes those integers into characters allowed in domain names, and inserts an xn-- at the beginning to let consumers know this is an encoded domain. You’ll notice that all the Unicode characters end up together at the end of the domain. They don’t just encode their value, they also encode where they should be inserted into the ASCII portion of the domain. To provide an example, the website 熱狗sales.com becomes xn--sales-r65lm0e.com. Anytime you type a Unicode-based domain name into your browser’s address bar, it is encoded in this way. This transformation could be transparent, but that introduces a major security problem. All sorts of Unicode characters print identically to existing ASCII characters. For example, you likely can’t see the difference between Cyrillic small letter a (“а”) and Latin small letter a (“a”). If I register Cyrillic аmazon.com (xn--mazon-3ve.com), and manage to trick you into visiting it, it’s gonna be hard to know you’re on the wrong site. For that reason, when you visit 🍕💩.ws, your browser somewhat lamely shows you xn--vi8hiv.ws in the address bar. ### Protocol The first portion of the URL is the protocol which should be used to access it. The most common protocol is http, which is the simple document transfer protocol Tim Berners-Lee invented specifically to power the web. It was not the only option. Some people believed we should just use Gopher. Rather than being general-purpose, Gopher is specifically designed to send structured data similar to how a file tree is structured. For example, if you request the /Cars endpoint, it might return: 1Chevy Camaro /Archives/cars/cc gopher.cars.com 70 iThe Camero is a classic fake (NULL) 0 iAmerican Muscle car fake (NULL) 0 1Ferrari 451 /Factbook/ferrari/451 gopher.ferrari.net 70  which identifies two cars, along with some metadata about them and where you can connect to for more information. The understanding was your client would parse this information into a usable form which linked the entries with the destination pages. The first popular protocol was FTP, which was created in 1971, as a way of listing and downloading files on remote computers. Gopher was a logical extension of this, in that it provided a similar listing, but included facilities for also reading the metadata about entries. This meant it could be used for more liberal purposes like a news feed or a simple database. It did not have, however, the freedom and simplicity which characterizes HTTP and HTML. HTTP is a very simple protocol, particularily when compared to alternatives like FTP or even the HTTP/3 protocol which is rising in popularity today. First off, HTTP is entirely text based, rather than being composed of bespoke binary incantations (which would have made it significantly more efficient). Tim Berners-Lee correctly intuited that using a text-based format would make it easier for generations of programmers to develop and debug HTTP-based applications. HTTP also makes almost no assumptions about what you’re transmitting. Despite the fact that it was invented expliticly to accompany the HTML language, it allows you to specify that your content is of any type (using the MIME Content-Type, which was a new invention at the time). The protocol itself is rather simple: A request: GET /index.html HTTP/1.1 Host: www.example.com  Might respond: HTTP/1.1 200 OK Date: Mon, 23 May 2005 22:38:34 GMT Content-Type: text/html; charset=UTF-8 Content-Encoding: UTF-8 Content-Length: 138 Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux) ETag: "3f80f-1b6-3e1cb03b" Accept-Ranges: bytes Connection: close <html> <head> <title>An Example Page</title> </head> <body> Hello World, this is a very simple HTML document. </body> </html>  To put this in context, you can think of the networking system the Internet uses as starting with IP, the Internet Protocol. IP is responsible for getting a small packet of data (around 1500 bytes) from one computer to another. On top of that we have TCP, which is responsible for taking larger blocks of data like entire documents and files and sending them via many IP packets reliably. On top of that, we then implement a protocol like HTTP or FTP, which specifies what format should be used to make the data we send via TCP (or UDP, etc.) understandable and meaningful. In other words, TCP/IP sends a whole bunch of bytes to another computer, the protocol says what those bytes should be and what they mean. You can make your own protocol if you like, assemblying the bytes in your TCP messages however you like. The only requirement is that whoever you are talking to speaks the same language. For this reason, it’s common to standardize these protocols. There are, of course, many less important protocols to play with. For example there is a Quote of The Day protocol (port 17), and a Random Characters protocol (port 19). They may seem silly today, but they also showcase just how important that a general-purpose document transmission format like HTTP was. ### Port The timeline of Gopher and HTTP can be evidenced by their default port numbers. Gopher is 70, HTTP 80. The HTTP port was assigned (likely by Jon Postel at the IANA) at the request of Tim Berners-Lee sometime between 1990 and 1992. This concept, of registering ‘port numbers’ predates even the Internet. In the original NCP protocol which powered the ARPANET remote addresses were identified by 40 bits. The first 32 identified the remote host, similar to how an IP address works today. The last eight were known as the AEN (it stood for “Another Eight-bit Number”), and were used by the remote machine in the way we use a port number, to separate messages destined for different processes. In other words, the address specifies which machine the message should go to, and the AEN (or port number) tells that remote machine which application should get the message. They quickly requested that users register these ‘socket numbers’ to limit potential collisions. When port numbers were expanded to 16 bits by TCP/IP, that registration process was continued. While protocols have a default port, it makes sense to allow ports to also be specified manually to allow for local development and the hosting of multiple services on the same machine. That same logic was the basis for prefixing websites with www.. At the time, it was unlikely anyone was getting access to the root of their domain, just for hosting an ‘experimental’ website. But if you give users the hostname of your specific machine (dx3.cern.ch), you’re in trouble when you need to replace that machine. By using a common subdomain (www.cern.ch) you can change what it points to as needed. ### The Bit In-between As you probably know, the URL syntax places a double slash (//) between the protocol and the rest of the URL: http://cloudflare.com  That double slash was inherited from the Apollo computer system which was one of the first networked workstations. The Apollo team had a similar problem to Tim Berners-Lee: they needed a way to separate a path from the machine that path is on. Their solution was to create a special path format: //computername/file/path/as/usual  And TBL copied that scheme. Incidentally, he now regrets that decision, wishing the domain (in this case example.com) was the first portion of the path: http:com/example/foo/bar/baz  URLs were never intended to be what they’ve become: an arcane way for a user to identify a site on the Web. Unfortunately, we’ve never been able to standardize URNs, which would give us a more useful naming system. Arguing that the current URL system is sufficient is like praising the DOS command line, and stating that most people should simply learn to use command line syntax. The reason we have windowing systems is to make computers easier to use, and more widely used. The same thinking should lead us to a superior way of locating specific sites on the Web. — Dale Dougherty 1996 There are several different ways to understand the ‘Internet’. One is as a system of computers connected using a computer network. That version of the Internet came into being in 1969 with the creation of the ARPANET. Mail, files and chat all moved over that network before the creation of HTTP, HTML, or the ‘web browser’. In 1992 Tim Berners-Lee created three things, giving birth to what we consider the Internet. The HTTP protocol, HTML, and the URL. His goal was to bring ‘Hypertext’ to life. Hypertext at its simplest is the ability to create documents which link to one another. At the time it was viewed more as a science fiction panacea, to be complimented by Hypermedia, and any other word you could add ‘Hyper’ in front of. The key requirement of Hypertext was the ability to link from one document to another. In TBL’s time though, these documents were hosted in a multitude of formats and accessed through protocols like Gopher and FTP. He needed a consistent way to refer to a file which encoded its protocol, its host on the Internet, and where it existed on that host. At the original World-Wide Web presentation in March of 1992 TBL described it as a ‘Universal Document Identifier’ (UDI). Many different formats were considered for this identifier: protocol: aftp host: xxx.yyy.edu path: /pub/doc/README PR=aftp; H=xx.yy.edu; PA=/pub/doc/README; PR:aftp/xx.yy.edu/pub/doc/README /aftp/xx.yy.edu/pub/doc/README  This document also explains why spaces must be encoded in URLs (%20): The use of white space characters has been avoided in UDIs: spaces > are not legal characters. This was done because of the frequent > introduction of extraneous white space when lines are wrapped by > systems such as mail, or sheer necessity of narrow column width, and > because of the inter-conversion of various forms of white space > which occurs during character code conversion and the transfer of > text between applications. What’s most important to understand is that the URL was fundamentally just an abbreviated way of refering to the combination of scheme, domain, port, credentials and path which previously had to be understood contextually for each different communication system. It was first officially defined in an RFC published in 1994. scheme:[//[user:[email protected]]host[:port]][/]path[?query][#fragment]  This system made it possible to refer to different systems from within Hypertext, but now that virtually all content is hosted over HTTP, may not be as necessary anymore. As early as 1996 browsers were already inserting the http:// and www. for users automatically (rendering any advertisement which still contains them truly ridiculous). ### Path I do not think the question is whether people can learn the meaning of the URL, I just find it it morally abhorrent to force grandma or grandpa to understand what, in the end, are UNIX file system conventions. — Israel del Rio 1996 The slash separated path component of a URL should be familiar to any user of any computer built in the last fifty years. The hierarchal filesystem itself was introduced by the MULTICS system. Its creator, in turn, attributes it to a two hour conversation with Albert Einstein he had in 1952. MULTICS used the greater than symbol (>) to separated file path components. For example: >usr>bin>local>awk  That was perfectly logical, but unfortunately the Unix folks decided to use > to represent redirection, delegating path separation to the forward slash (/). ### Snapchat the Supreme Court Wrong. We are I now see clearly *disagreeing*. You and I. As a person I reserve the right to use different criteria for different purposes. I want to be able to give names to generic works, AND to particular translations AND to particular versions. I want a richer world than you propose. I don’t want to be constrained by your two-level system of “documents” and “variants”. — Tim Berners-Lee 1993 One half of the URLs referenced by US Supreme Court opinions point to pages which no longer exist. If you were reading an academic paper in 2011, written in 2001, you have better than even odds that any given URL won’t be valid. There was a fervent belief in 1993 that the URL would die, in favor of the ‘URN’. The Uniform Resource Name is a permanent reference to a given piece of content which, unlike a URL, will never change or break. Tim Berners-Lee first described the “urgent need” for them as early as 1991. The simplest way to craft a URN might be to simply use a cryptographic hash of the contents of the page, for example: urn:791f0de3cfffc6ec7a0aacda2b147839. This method doesn’t meet the criteria of the web community though, as it wasn’t really possible to figure out who to ask to turn that hash into a piece of real content. It also didn’t account for the format changes which often happen to files (compressed vs uncompressed for example) which nevertheless represent the same content. In 1996 Keith Shafer and several others proposed a solution to the problem of broken URLs. The link to this solution is now broken. Roy Fielding posted an implementation suggestion in July of 1995. That link is now broken. I was able to find these pages through Google, which has functionally made page titles the URN of today. The URN format was ultimately finalized in 1997, and has essentially never been used since. The implementation is itself interesting. Each URN is composed of two components, an authority who can resolve a given type of URN, and the specific ID of this document in whichever format the authority understands. For example, urn:isbn:0131103628 will identify a book, forming a permanent link which can (hopefully) be turned into a set of URLs by your local isbn resolver. Given the power of search engines, it’s possible the best URN format today would be a simple way for files to point to their former URLs. We could allow the search engines to index this information, and link us as appropriate: <!-- On http://zack.is/history --> <link rel="past-url" href="http://zackbloom.com/history.html"> <link rel="past-url" href="http://zack.is/history.html">  ### Query Params The “application/x-www-form-urlencoded” format is in many ways an aberrant monstrosity, the result of many years of implementation accidents and compromises leading to a set of requirements necessary for interoperability, but in no way representing good design practices. If you’ve used the web for any period of time, you are familiar with query parameters. They follow the path portion of the URL, and encode options like ?name=zack&state=mi. It may seem odd to you that queries use the ampersand character (&) which is the same character used in HTML to encode special characters. In fact, if you’ve used HTML for any period of time, you likely have had to encode ampersands in URLs, turning http://host/?x=1&y=2 into http://host/?x=1&amp;y=2 or http://host?x=1&#38;y=2 (that particular confusion has always existed). You may have also noticed that cookies follow a similar, but different format: x=1;y=2 which doesn’t actually conflict with HTML character encoding at all. This idea was not lost on the W3C, who encouraged implementers to support ; as well as & in query parameters as early as 1995. Originally, this section of the URL was strictly used for searching ‘indexes’. The Web was originally created (and its funding was based on it creating) a method of collaboration for high energy physicists. This is not to say Tim Berners-Lee didn’t know he was really creating a general-purpose communication tool. He didn’t add support for tables for years, which is probably something physicists would have needed. In any case, these ‘physicists’ needed a way of encoding and linking to information, and a way of searching that information. To provide that, Tim Berners-Lee created the <ISINDEX> tag. If <ISINDEX> appeared on a page, it would inform the browser that this is a page which can be searched. The browser should show a search field, and allow the user to send a query to the server. That query was formatted as keywords separated by plus characters (+): http://cernvm/FIND/?sgml+cms  In fantastic Internet fashion, this tag was quickly abused to do all manner of things including providing an input to calculate square roots. It was quickly proposed that perhaps this was too specific, and we really needed a general purpose <input> tag. That particular proposal actually uses plus signs to separate the components of what otherwise looks like a modern GET query: http://somehost.somewhere/some/path?x=xxxx+y=yyyy+z=zzzz  This was far from universally acclaimed. Some believed we needed a way of saying that the content on the other side of links should be searchable: <a HREF="wais://quake.think.com/INFO" INDEX=1>search</a>  Tim Berners-Lee thought we should have a way of defining strongly-typed queries: <ISINDEX TYPE="iana:/www/classes/query/personalinfo">  I can be somewhat confident in saying, in retrospect, I am glad the more generic solution won out. The real work on <INPUT> began in January of 1993 based on an older SGML type. It was (perhaps unfortunately), decided that <SELECT> inputs needed a separate, richer, structure: <select name=FIELDNAME type=CHOICETYPE [value=VALUE] [help=HELPUDI]> <choice>item 1 <choice>item 2 <choice>item 3 </select>  If you’re curious, reusing <li>, rather than introducing the <option> element was absolutely considered. There were, of course, alternative form proposals. One included some variable substituion evocative of what Angular might do today: <ENTRYBLANK TYPE=int LENGTH=length DEFAULT=default VAR=lval>Prompt</ENTRYBLANK> <QUESTION TYPE=float DEFAULT=default VAR=lval>Prompt</QUESTION> <CHOICE DEFAULT=default VAR=lval> <ALTERNATIVE VAL=value1>Prompt1 ... <ALTERNATIVE VAL=valuen>Promptn </CHOICE>  In this example the inputs are checked against the type specified in type, and the VAR values are available on the page for use in string substitution in URLs, à la: http://cloudflare.com/apps/appId


Additional proposals actually used @, rather than =, to separate query components:

[email protected][email protected](value&value)


It was Marc Andreessen who suggested our current method based on what he had already implemented in Mosaic:

name=value&name=value&name=value


Just two months later Mosaic would add support for method=POST forms, and ‘modern’ HTML forms were born.

Of course, it was also Marc Andreessen’s company Netscape who would create the cookie format (using a different separator). Their proposal was itself painfully shortsighted, led to the attempt to introduce a Set-Cookie2 header, and introduced fundamental structural issues we still deal with at Cloudflare to this day.

### Fragments

The portion of the URL following the ‘#’ is known as the fragment. Fragments were a part of URLs since their initial specification, used to link to a specific location on the page being loaded. For example, if I have an anchor on my site:

<a name="bio"></a>


http://zack.is/#bio


This concept was gradually extended to any element (rather than just anchors), and moved to the id attribute rather than name:

<h1 id="bio">Bio</h1>


Tim Berners-Lee decided to use this character based on its connection to addresses in the United States (despite the fact that he’s British by birth). In his words:

In a snail mail address in the US at least, it is common
to use the number sign for an apartment number or suite
number within a building. So 12 Acacia Av #12 means “The
building at 12 Acacia Av, and then within that the unit
known numbered 12”. It seemed to be a natural character
for the task. Now, http://www.example.com/foo#bar means
“Within resource http://www.example.com/foo, the
particular view of it known as bar”.

It turns out that the original Hypertext system, created by Douglas Englebart, also used the ‘#’ character for the same purpose. This may be coincidental or it could be a case of accidental “idea borrowing”.

Fragments are explicitly not included in HTTP requests, meaning they only live inside the browser. This concept proved very valuable when it came time to implement client-side navigation (before pushState was introduced). Fragments were also very valuable when it came time to think about how we can store state in URLs without actually sending it to the server. What could that mean? Let’s explore:

### Molehills and Mountains

There is a whole standard, as yukky as SGML, on Electronic data Intercahnge [sic], meaning forms and form submission. I know no more except it looks like fortran backwards with no spaces.

— Tim Berners-Lee 1993

There is a popular perception that the internet standards bodies didn’t do much from the finalization of HTTP 1.1 and HTML 4.01 in 2002 to when HTML 5 really got on track. This period is also known (only by me) as the Dark Age of XHTML. The truth is though, the standardization folks were fantastically busy. They were just doing things which ultimately didn’t prove all that valuable.

One such effort was the Semantic Web. The dream was to create a Resource Description Framework (editorial note: run away from any team which seeks to create a framework), which would allow metadata about content to be universally expressed. For example, rather than creating a nice web page about my Corvette Stingray, I could make an RDF document describing its size, color, and the number of speeding tickets I had gotten while driving it.

This is, of course, in no way a bad idea. But the format was XML based, and there was a big chicken-and-egg problem between having the entire world documented, and having the browsers do anything useful with that documentation.

It did however provide a powerful environment for philosophical argument. One of the best such arguments lasted at least ten years, and was known by the masterful codename ‘httpRange-14’.

httpRange-14 sought to answer the fundamental question of what a URL is. Does a URL always refer to a document, or can it refer to anything? Can I have a URL which points to my car?

They didn’t attempt to answer that question in any satisfying manner. Instead they focused on how and when we can use 303 redirects to point users from links which aren’t documents to ones which are, and when we can use URL fragments (the bit after the ‘#’) to point users to linked data.

To the pragmatic mind of today, this might seem like a silly question. To many of us, you can use a URL for whatever you manage to use it for, and people will use your thing or they won’t. But the Semantic Web cares for nothing more than semantics, so it was on.

This particular topic was discussed on July 1st 2002, July 15th 2002, July 22nd 2002, July 29th 2002, September 16th 2002, and at least 20 other occasions through 2005. It was resolved by the great ‘httpRange-14 resolution’ of 2005, then reopened by complaints in 2007 and 2011 and a call for new solutions in 2012. The question was heavily discussed by the pedantic web group, which is very aptly named. The one thing which didn’t happen is all that much semantic data getting put on the web behind any sort of URL.

### Auth

As you may know, you can include a username and password in URLs:

http://zack:[email protected]


The browser then encodes this authentication data into Base64, and sends it as a header:

Authentication: Basic emFjazpzaGhoaGho


The only reason for the Base64 encoding is to allow characters which might not be valid in a header, it provides no obscurity to the username and password values.

Particularily over the pre-SSL internet, this was very problematic. Anyone who could snoop on your connection could easily see your password. Many alternatives were proposed including Kerberos which is a widely used security protocol both then and now.

As with so many of these examples though, the simple basic auth proposal was easiest for browser manufacturers (Mosaic) to implement. This made it the first, and ultimately the only, solution until developers were given the tools to build their own authentication systems.

### The Web Application

In the world of web applications, it can be a little odd to think of the basis for the web being the hyperlink. It is a method of linking one document to another, which was gradually augmented with styling, code execution, sessions, authentication, and ultimately became the social shared computing experience so many 70s researchers were trying (and failing) to create. Ultimately, the conclusion is just as true for any project or startup today as it was then: all that matters is adoption. If you can get people to use it, however slipshod it might be, they will help you craft it into what they need. The corollary is, of course, no one is using it, it doesn’t matter how technically sound it might be. There are countless tools which millions of hours of work went into which precisely no one uses today.

This was adapted from a post which originally appeared on the Eager blog. In 2016 Eager become Cloudflare Apps.

# When Bloom filters don’t bloom

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/when-bloom-filters-dont-bloom/

I’ve known about Bloom filters (named after Burton Bloom) since university, but I haven’t had an opportunity to use them in anger. Last month this changed – I became fascinated with the promise of this data structure, but I quickly realized it had some drawbacks. This blog post is the tale of my brief love affair with Bloom filters.

While doing research about IP spoofing, I needed to examine whether the source IP addresses extracted from packets reaching our servers were legitimate, depending on the geographical location of our data centers. For example, source IPs belonging to a legitimate Italian ISP should not arrive in a Brazilian datacenter. This problem might sound simple, but in the ever-evolving landscape of the internet this is far from easy. Suffice it to say I ended up with many large text files with data like this:

This reads as: the IP 192.0.2.1 was recorded reaching Cloudflare data center number 107 with a legitimate request. This data came from many sources, including our active and passive probes, logs of certain domains we own (like cloudflare.com), public sources (like BGP table), etc. The same line would usually be repeated across multiple files.

I ended up with a gigantic collection of data of this kind. At some point I counted 1 billion lines across all the harvested sources. I usually write bash scripts to pre-process the inputs, but at this scale this approach wasn’t working. For example, removing duplicates from this tiny file of a meager 600MiB and 40M lines, took… about an eternity:

Enough to say that deduplicating lines using the usual bash commands like ‘sort’ in various configurations (see ‘–parallel’, ‘–buffer-size’ and ‘–unique’) was not optimal for such a large data set.

## Bloom filters to the rescue

Image by David Eppstein Public Domain

Then I had a brainwave – it’s not necessary to sort the lines! I just need to remove duplicated lines – using some kind of "set" data structure should be much faster. Furthermore, I roughly know the cardinality of the input file (number of unique lines), and I can live with some data points being lost – using a probabilistic data structure is fine!

Bloom-filters are a perfect fit!

While you should go and read Wikipedia on Bloom Filters, here is how I look at this data structure.

How would you implement a "set"? Given a perfect hash function, and infinite memory, we could just create an infinite bit array and set a bit number ‘hash(item)’ for each item we encounter. This would give us a perfect "set" data structure. Right? Trivial. Sadly, hash functions have collisions and infinite memory doesn’t exist, so we have to compromise in our reality. But we can calculate and manage the probability of collisions. For example, imagine we have a good hash function, and 128GiB of memory. We can calculate the probability of the second item added to the bit array colliding would be 1 in 1099511627776. The probability of collision when adding more items worsens as we fill up the bit array.

Furthermore, we could use more than one hash function, and end up with a denser bit array. This is exactly what Bloom filters optimize for. A Bloom filter is a bunch of math on top of the four variables:

• ‘n’ – The number of input elements (cardinality)
• ‘m’ – Memory used by the bit-array
• ‘k’ – Number of hash functions counted for each input
• ‘p’ – Probability of a false positive match

Given the ‘n’ input cardinality and the ‘p’ desired probability of false positive, the Bloom filter math returns the ‘m’ memory required and ‘k’ number of hash functions needed.

Check out this excellent visualization by Thomas Hurst showing how parameters influence each other:

## mmuniq-bloom

Guided by this intuition, I set out on a journey to add a new tool to my toolbox – ‘mmuniq-bloom’, a probabilistic tool that, given input on STDIN, returns only unique lines on STDOUT, hopefully much faster than ‘sort’ + ‘uniq’ combo!

Here it is:

For simplicity and speed I designed ‘mmuniq-bloom’ with a couple of assumptions. First, unless otherwise instructed, it uses 8 hash functions k=8. This seems to be a close to optimal number for the data sizes I’m working with, and the hash function can quickly output 8 decent hashes. Then we align ‘m’, number of bits in the bit array, to be a power of two. This is to avoid the pricey % modulo operation, which compiles down to slow assembly ‘div’. With power-of-two sizes we can just do bitwise AND. (For a fun read, see how compilers can optimize some divisions by using multiplication by a magic constant.)

We can now run it against the same data file we used before:

Oh, this is so much better! 12 seconds is much more manageable than 2 minutes before. But hold on… The program is using an optimized data structure, relatively limited memory footprint, optimized line-parsing and good output buffering… 12 seconds is still eternity compared to ‘wc -l’ tool:

What is going on? I understand that counting lines by ‘wc’ is easier than figuring out unique lines, but is it really worth the 26x difference? Where does all the CPU in ‘mmuniq-bloom’ go?

It must be my hash function. ‘wc’ doesn’t need to spend all this CPU performing all this strange math for each of the 40M lines on input. I’m using a pretty non-trivial ‘siphash24’ hash function, so it surely burns the CPU, right? Let’s check by running the code computing hash function but not doing any Bloom filter operations:

This is strange. Counting the hash function indeed costs about 2s, but the program took 12s in the previous run. The Bloom filter alone takes 10 seconds? How is that possible? It’s such a simple data structure…

## A secret weapon – a profiler

It was time to use a proper tool for the task – let’s fire up a profiler and see where the CPU goes. First, let’s fire an ‘strace’ to confirm we are not running any unexpected syscalls:

Everything looks good. The 10 calls to ‘mmap’ each taking 4ms (3971 us) is intriguing, but it’s fine. We pre-populate memory up front with ‘MAP_POPULATE’ to save on page faults later.

What is the next step? Of course Linux’s ‘perf’!

Then we can see the results:

Right, so we indeed burn 87.2% of cycles in our hot code. Let’s see where exactly. Doing ‘perf annotate process_line –source’ quickly shows something I didn’t expect.

You can see 26.90% of CPU burned in the ‘mov’, but that’s not all of it! The compiler correctly inlined the function, and unrolled the loop 8-fold. Summed up that ‘mov’ or ‘uint64_t v = *p’ line adds up to a great majority of cycles!

Clearly ‘perf’ must be mistaken, how can such a simple line cost so much? We can repeat the benchmark with any other profiler and it will show us the same problem. For example, I like using ‘google-perftools’ with kcachegrind since they emit eye-candy charts:

The rendered result looks like this:

Allow me to summarise what we found so far.

The generic ‘wc’ tool takes 0.45s CPU time to process 600MiB file. Our optimized ‘mmuniq-bloom’ tool takes 12 seconds. CPU is burned on one ‘mov’ instruction, dereferencing memory….

Image by Jose Nicdao CC BY/2.0

Oh! I how could I have forgotten. Random memory access is slow! It’s very, very, very slow!

According to the rule of thumb "latency numbers every programmer should know about", one RAM fetch is about 100ns. Let’s do the math: 40 million lines, 8 hashes counted for each line. Since our Bloom filter is 128MiB, on our older hardware it doesn’t fit into L3 cache! The hashes are uniformly distributed across the large memory range – each hash generates a memory miss. Adding it together that’s…

That suggests 32 seconds burned just on memory fetches. The real program is faster, taking only 12s. This is because, although the Bloom filter data does not completely fit into L3 cache, it still gets some benefit from caching. It’s easy to see with ‘perf stat -d’:

Right, so we should have had at least 320M LLC-load-misses, but we had only 280M. This still doesn’t explain why the program was running only 12 seconds. But it doesn’t really matter. What matters is that the number of cache misses is a real problem and we can only fix it by reducing the number of memory accesses. Let’s try tuning Bloom filter to use only one hash function:

Ouch! That really hurt! The Bloom filter required 64 GiB of memory to get our desired false positive probability ratio of 1-error-per-10k-lines. This is terrible!

Also, it doesn’t seem like we improved much. It took the OS 22 seconds to prepare memory for us, but we still burned 11 seconds in userspace. I guess this time any benefits from hitting memory less often were offset by lower cache-hit probability due to drastically increased memory size. In previous runs we required only 128MiB for the Bloom filter!

## Dumping Bloom filters altogether

This is getting ridiculous. To get the same false positive guarantees we either must use many hashes in Bloom filter (like 8) and therefore many memory operations, or we can have 1 hash function, but enormous memory requirements.

We aren’t really constrained by available memory, instead we want to optimize for reduced memory accesses. All we need is a data structure that requires at most 1 memory miss per item, and use less than 64 Gigs of RAM…

While we could think of more sophisticated data structures like Cuckoo filter, maybe we can be simpler. How about a good old simple hash table with linear probing?

## Welcome mmuniq-hash

Here you can find a tweaked version of mmuniq-bloom, but using hash table:

Instead of storing bits as for the Bloom-filter, we are now storing 64-bit hashes from the ‘siphash24’ function. This gives us much stronger probability guarantees, with probability of false positives much better than one error in 10k lines.

Let’s do the math. Adding a new item to a hash table containing, say 40M, entries has ’40M/2^64′ chances of hitting a hash collision. This is about one in 461 billion – a reasonably low probability. But we are not adding one item to a pre-filled set! Instead we are adding 40M lines to the initially empty set. As per birthday paradox this has much higher chances of hitting a collision at some point. A decent approximation is ‘~n^2/2m’, which in our case is ‘~(40M^2)/(2*(2^64))’. This is a chance of one in 23000. In other words, assuming we are using good hash function, every one in 23 thousand random sets of 40M items, will have a hash collision. This practical chance of hitting a collision is non-negligible, but it’s still better than a Bloom filter and totally acceptable for my use case.

The hash table code runs faster, has better memory access patterns and better false positive probability than the Bloom filter approach.

Don’t be scared about the "hash conflicts" line, it just indicates how full the hash table was. We are using linear probing, so when a bucket is already used, we just pick up the next empty bucket. In our case we had to skip over 0.7 buckets on average to find an empty slot in the table. This is fine and, since we iterate over the buckets in linear order, we can expect the memory to be nicely prefetched.

From the previous exercise we know our hash function takes about 2 seconds of this. Therefore, it’s fair to say 40M memory hits take around 4 seconds.

## Lessons learned

Modern CPUs are really good at sequential memory access when it’s possible to predict memory fetch patterns (see Cache prefetching). Random memory access on the other hand is very costly.

Advanced data structures are very interesting, but beware. Modern computers require cache-optimized algorithms. When working with large datasets, not fitting L3, prefer optimizing for reduced number loads, over optimizing the amount of memory used.

I guess it’s fair to say that Bloom filters are great, as long as they fit into the L3 cache. The moment this assumption is broken, they are terrible. This is not news, Bloom filters optimize for memory usage, not for memory access. For example, see the Cuckoo Filters paper.

Another thing is the ever-lasting discussion about hash functions. Frankly – in most cases it doesn’t matter. The cost of counting even complex hash functions like ‘siphash24’ is small compared to the cost of random memory access. In our case simplifying the hash function will bring only small benefits. The CPU time is simply spent somewhere else – waiting for memory!

One colleague often says: "You can assume modern CPUs are infinitely fast. They run at infinite speed until they hit the memory wall".

Finally, don’t follow my mistakes – everyone should start profiling with ‘perf stat -d’ and look at the "Instructions per cycle" (IPC) counter. If it’s below 1, it generally means the program is stuck on waiting for memory. Values above 2 would be great, it would mean the workload is mostly CPU-bound. Sadly, I’m yet to see high values in the workloads I’m dealing with…

## Improved mmuniq

With the help of my colleagues I’ve prepared a further improved version of the ‘mmuniq’ hash table based tool. See the code:

It is able to dynamically resize the hash table, to support inputs of unknown cardinality. Then, by using batching, it can effectively use the ‘prefetch’ CPU hint, speeding up the program by 35-40%. Beware, sprinkling the code with ‘prefetch’ rarely works. Instead, I specifically changed the flow of algorithms to take advantage of this instruction. With all the improvements I got the run time down to 2.1 seconds:

## The end

Writing this basic tool which tries to be faster than ‘sort | uniq’ combo revealed some hidden gems of modern computing. With a bit of work we were able to speed it up from more than two minutes to 2 seconds. During this journey we learned about random memory access latency, and the power of cache friendly data structures. Fancy data structures are exciting, but in practice reducing random memory loads often brings better results.

Post Syndicated from Sam Mason de Caires original https://blog.cloudflare.com/thinking-about-color/

#### Color is my day-long obsession, joy and torment – Claude Monet

Over the last two years we’ve tried to improve our usage of color at Cloudflare. There were a number of forcing functions that made this work a priority. As a small team of designers and engineers we had inherited a bunch of design work that was a mix of values built by multiple teams. As a result it was difficult and unnecessarily time consuming to add new colors when building new components.

We also wanted to improve our accessibility. While we were doing pretty well, we had room for improvement, largely around how we used green. As our UI is increasingly centered around visualizations of large data sets we wanted to push the boundaries of making our analytics as visually accessible as possible.

Cloudflare had also undergone a rebrand around 2016. While our marketing site had rolled out an updated set of visuals, our product ui as well as a number of existing web properties were still using various versions of our old palette.

Our product palette wasn’t well balanced by itself. Many colors had been chosen one or two at a time. You can see how we chose blueberry, ice, and water at a different point in time than marine and thunder.

Lacking visual cohesion within our own product, we definitely weren’t providing a cohesive visual experience between our marketing site and our product. The transition from the nice blues and purples to our green CTAs wasn’t the streamlined experience we wanted to afford our users.

## Reworking our Palette

Our first step was to audit what we already had. Cloudflare has been around long enough to have more than one website. Beyond cloudflare.com we have dozens of web properties that are publicly accessible. From our community forums, support docs, blog, status page, to numerous micro-sites.

All-in-all we have dozens of front-end codebases that each represent one more chance to introduce entropy to our visual language. So we were curious to answer the question – what colors were we currently using? Were there consistent patterns we could document for further reuse? Could we build a living style guide that didn’t cover just one site, but all of them?

Our curiosity got the best of us and we went about exploring ways we could visualize our design language across all of our sites.

### A time machine for color

As we first started to identify the scale of our color problems, we tried to think outside the box on how we might explore the problem space. After an initial brainstorming session we combined the Internet Archive’s Wayback Machine with the Css Stats API to build an audit tool that shows how our various websites visual properties change over time. We can dynamically select which sites we want to compare and scrub through time to see changes.

Below is a visualization of palettes from 9 different websites changing over a period of 6 years. Above the palettes is a component that spits out common colors, across all of these sites. The only two common colors across all properties (appearing for only a brief flash) were #ffffff (white) and transparent. Over time we haven’t been very consistent with ourselves.

If we drill in to look at our marketing site compared to our dashboard app – it looks like the video below. We see a bit more overlap at first and then a significant divergence at the 16 second mark when our product palette grew significantly. At the 22 second mark you can see the marketing palette completely change as a result of the rebrand while our product palette stays the same. As time goes on you can see us becoming more and more inconsistent across the two code bases.

As a product team we had some catching up to do improve our usage of color and to align ourselves with the company brand. The good news was, there was no where to go but up.

This style of historical audit gives us a visual indication with real data. We can visualize for stakeholders how consistent and similar our usage of color is across products and if we are getting better or worse over time. Having this type of feedback loop was invaluable for us – as auditing this manually is incredibly time consuming so it often doesn’t get done. Hopefully in the future as it’s standard to track various performance metrics over time at a company it will be standard to be able to visualize your current levels of design entropy.

### Picking colors

After our initial audit revealed there wasn’t a lot of consistency across sites, we went to work to try and construct a color palette that could potentially be used for sites the product team owned. It was time to get our hands dirty and start “picking colors.”

Hindsight of course is always 20/20. We didn’t start out on day one trying to generate scales based on our brand palette. No, our first bright idea, was to generate the entire palette from a single color.

Our logo is made up of two oranges. Both of these seemed like prime candidates to generate a palette from.

We played around with a number of algorithms that took a single color and created a palette. From the initial color we generated an array scales for each hue. Initial attempts found us applying the exact same curves for luminosity to each hue, but as visual perception of hues is so different, this resulted in wildly different contrasts at each step of the scale.

Below are a few of our initial attempts at palette generation. Jeeyoung Jung did a brilliant writeup around designing palettes last year.

We can see the intensity of the colors change across hue in peaks, with yellow and green being the most dominant. One of the downsides of this, is when you are rapidly iterating through theming options, the inconsistent relationships between steps across hues can make it time consuming or impossible to keep visual harmony in your interface.

The video below is another way to visualize this phenomenon. The dividing line in the color picker indicates which part of the palette will be accessible with black and white. Notice how drastically the line changes around green and yellow. And then look back at the charts above.

After fiddling with a few different generative algorithms (we made a lot of ugly palettes…) we decided to try a more manual approach. We pursued creating custom curves for each hue in an effort to keep the contrast scales as optically balanced as possible.

Generating different color palettes makes you confront a basic question. How do you tell if a palette is good? Are some palettes better than others? In an effort to answer this question we constructed various feedback loops to help us evaluate palettes as quickly as possible. We tried a few methods to stress test a palette. At first we attempted to grab the “nearest color” for a bunch of our common UI colors. This wasn’t always helpful as sometimes you actually want the step above or below the closest existing color. But it was helpful to visualize for a few reasons.

Sometime during our exploration in this space, we stumbled across this tweet thread about building a palette for pixel art. There are a lot of places web and product designers can draw inspiration from game designers.

Here we see a similar concept where a number of different palettes are applied to the same component. This view shows us two things, the different ways a single palette can be applied to a sphere, and also the different aesthetics across color palettes.

It’s almost surprising that the default way to construct a color palette for apps and sites isn’t to build it while previewing its application against the most common UI patterns. As designers, there are a lot of consistent uses of color we could have baselines for. Many enterprise apps are centered around white background with blue as the primary color with mixtures of grays to add depth around cards and page sections. Red is often used for destructive actions like deleting some type of record. Gray for secondary actions. Maybe it’s an outline button with the primary color for secondary actions. Either way – the margins between the patterns aren’t that large in the grand scheme of things.

Consider the use case of designing UI while the palette or usage of color hasn’t been established. Given a single palette, you might want to experiment with applying that palette in a variety of ways that will output a wide variety of aesthetics. Alternatively you may need to test out several different palettes. These are two different modes of exploration that can be extremely time consuming to work through . It can be non-trivial to keep an in progress design synced with several different options for color application, even with the best use of layer comps or symbols.

How do we visualize the various ways a palette will look when applied to an interface? Here are examples of how palettes are shown on a palette list for pixel artists.

One method of visualization is to define a common set of primitive ui elements and show each one of them with a single set of colors applied. In isolation this can be helpful. This mode would make it easy to vet a single combination of colors and which ui elements it might be best applied to.

Alternatively we might want to see a composed interface with the closest colors from the palette applied. Consider a set of buttons that includes red, green, blue, and gray button styles. Seeing all of these together can help us visualize the relative nature of these buttons side by side. Given a baseline palette for common UI, we could swap to a new palette and replace each color with the “closest” color. This isn’t always a full-proof solution as there are many edge cases to cover. e.g. what happens when replacing a palette of 134 colors with a palette of 24 colors? Even still, this could allow us to quickly take a stab at automating how existing interfaces would change their appearance given a change to the underlying system. Whether locally or against a live site, this mode of working would allow for designers to view a color in multiple contets to truly asses its quality.

After moving on from the idea of generating a palette from a single color, we attempted to use our logo colors as well as our primary brand colors to drive the construction of modular scales. Our goal was to create a palette that would improve contrast for accessibility, stay true to our visual brand, work predictably for developers, work for data visualizations, and provide the ability to design visually balanced and attractive interfaces. No sweat.

While we knew going in we might not use every step in every hue, we wanted full coverage across the spectrum so that each hue had a consistent optical difference between each step. We also had no idea which steps across which hues we were going to need just yet. As they would just be variables in a theme file it didn’t add any significant code footprint to expose the full generated palette either.

One of the more difficult parts, was deciding on a number of steps for the scales. This would allow us to edit the palette in the future to a variety of aesthetics and swap the palette out at the theme level without needing to update anything else.

In the future if when we did need to augment the available colors, we could edit the entire palette instead of adding a one-off addition as we had found this was a difficult way to work over time. In addition to our primary brand colors we also explored adding scales for yellow / gold, violet, teal as well as a gray scale.

The first interface we built for this work was to output all of the scales vertically, with their contrast scores with both white and black on the right hand side. To aid scannability we bolded the values that were above the 4.5 threshold. As we edited the curves, we could see how the contrast ratios were affected at each step. Below you can see an early starting point before the scales were balanced. Red has 6 accessible combos with white, while yellow only has 1. We initially explored having the gray scale be larger than the others.

As both screen luminosity and ambient light can affect perception of color we developed on two monitors, one set to maximum and one set to minimum brightness levels. We also replicated the color scales with a grayscale filter immediately below to help illustrate visual contrast between steps AND across hues. Bouncing back and forth between the grayscale and saturated version of the scale serves as a great baseline reference. We found that going beyond 10 steps made it difficult to keep enough contrast between each step to keep them distinguishable from one another.

Taking a page from our game design friends – as we were balancing the scales and exploring how many steps we wanted in the scales, we were also stress testing the generated colors against various analytics components from our component library.

Our slightly random collection of grays had been a particular pain point as they appeared muddy in a number of places within our interface. For our new palette we used the slightest hint of blue to keep our grays consistent and just a bit off from being purely neutral.

With a palette consisting of 90 colors, the amount of combinations and permutations that can be applied to data visualizations is vast and can result in a wide variety of aesthetic directions. The same palette applied to both line and bar charts with different data sets can look substantially different, enough that they might not be distinguishable as being the exact same palette. Working with some of our engineering counterparts, we built a pipeline that would put up the same components rendered against different data sets, to simulate the various shapes and sizes the graph elements would appear in. This allowed us to rapidly test the appearance of different palettes. This workflow gave us amazing insights into how a palette would look in our interface. No matter how many hours we spent staring at a palette, we couldn’t get an accurate sense of how the colors would look when composed within an interface.

We experimented with a number of ideas on visualizing different sizes and shapes of colors and how they affected our perception of how much a color was changing element to element. In the first frame it is most difficult to tell the values at 2% and 6% apart given the size and shape of the elements.

We’ve begun to package up some of this work into a web app others can use to create or import a palette and preview multiple depths of accessible combinations against a set of UI elements.

The goal is to make it easier for anyone to work seamlessly with color and build beautiful interfaces with accessible color contrasts.

In an effort to make sure everything we are building will be visually accessible – we built a react component that will preview how a design would look if you were colorblind. The component overlays SVG filters to simulate alternate ways someone can perceive color.

While this is previewing an analytics component, really any component or page can be previewed with this method.

import React from "react"

const filters = [
'achromatopsia',
'protanomaly',
'protanopia',
'deuteranomaly',
'deuteranopia',
'tritanomaly',
'tritanopia',
'achromatomaly',
]

const ColorBlindFilter = ({itemPadding, itemWidth, ...props }) => {
return (
<div  {...props}>
{filters.map((filter, i) => (
<div
style={{filter: 'url(/filters.svg#'+filter+')'}}
width={itemWidth}
key={i+filter}
>
{props.children}
</div>
))}
</div>
)
}

ColorBlindFilter.defaultProps = {
display: 'flex',
justifyContent: 'space-around',
flexWrap: 'wrap',
width: 1,
itemWidth: 1/4
}

export default ColorBlindFilter


After quite a few iterations, we had finally come up with a color palette. Each scale was optically aligned with our brand colors. The 5th step in each scale is the closest to the original brand color, but adjusted slightly so it’s accessible with both black and white.

Lyft’s writeup “Re-approaching color” and Jeeyoung Jung’s “Designing Systematic Colors are some of the best write-ups on how to work with color at scale you can find.

## Color migrations

Getting a team of people to agree on a new color palette is a journey in and of itself. By the time you get everyone to consensus it’s tempting to just collapse into a heap and never think about colors ever again. Unfortunately the work doesn’t stop at this point. Now that we’ve picked our palette, it’s time to get it implemented so this bike shed is painted once and for all.

If you are porting an old legacy part of your app to be updated to the new style guide like we were, even the best color documentation can fall short in helping someone make the necessary changes.

We found it was more common than expected that engineers and designers wanted to know what the new version of a color they were familiar with was. During the transition between palettes we had an interface people could input any color and get the closest color within our palette.

There are times when migrating colors, the closest color isn’t actually what you want. Given the scenario where your brand color has changed from blue to purple, you might want to be porting all blues to the closest purple within the palette, not the closest blues which might still exist in your palette. To help visualize migrations as well as get suggestions on how to consolidate values within the old scale, we of course built a little tool. Here we can define those translations and import a color palette from a URL import. As we still have. a number of web properties to update to our palette, this simple tool has continued to prove useful.

We wanted to be as gentle as possible in transitioning to the new palette in usage. While the developers found string names for colors brittle and unpredictable, it was still a more familiar system for some than this new one. We first just added in our new palette to the existing theme for usage moving forward. Then we started to port colors for existing components and pages.

For our colleagues, we wrote out desired translations and offered warnings in the console that a color was deprecated, with a reference to the new theme value to use.

While we had a few bugs along the way, the team was supportive and helped us fix bugs almost as quickly as we could find them.

We’re still in the process of updating our web properties with our new palette, largely prioritizing accessibility first while trying to create a more consistent visual brand as a nice by-product of the work. A small example of this is our system status page. In the first image, the blue links in the header, the green status bar, and the about copy, were all inaccessible against their backgrounds.

A lot of the changes have been subtle. Most notably the green we use in the dashboard is a lot more inline with our brand colors than before. In addition we’ve also been able to add visual balance by not just using straight black text on background colors. Here we added one of the darker steps from the corresponding scale, to give it a bit more visual balance.

While we aren’t perfect yet, we’re making progress towards more visual cohesion across our marketing materials and products.

## Next steps

Trying to keep dozens of sites all using the same palette in a consistent manner across time is a task that you can never complete. It’s an ongoing maintenance problem. Engineers familiar with the color system leave, new engineers join and need to learn how the system works. People still launch sites using a different palette that doesn’t meet accessibility standards. Our work continues to be cut out for us. As they say, a garden doesn’t tend itself.

If we do ever revisit our brand colors, we’re excited to have infrastructure in place to update our apps and several of our satellite sites with significantly less effort than our first time around.

## Resources

Some of our favorite materials and resources we found while exploring this problem space.

# A History of HTML Parsing at Cloudflare: Part 2

Post Syndicated from Andrew Galloni original https://blog.cloudflare.com/html-parsing-2/

The second blog post in the series on HTML rewriters picks up the story in 2017 after the launch of the Cloudflare edge compute platform Cloudflare Workers. It became clear that the developers using workers wanted the same HTML rewriting capabilities that we used internally, but accessible via a JavaScript API.

This blog post describes the building of a streaming HTML rewriter/parser with a CSS-selector based API in Rust. It is used as the back-end for the Cloudflare Workers HTMLRewriter. We have open-sourced the library (LOL HTML) as it can also be used as a stand-alone HTML rewriting/parsing library.

The major change compared to LazyHTML, the previous rewriter, is the dual-parser architecture required to overcome the additional performance overhead of wrapping/unwrapping each token when propagating tokens to the workers runtime. The remainder of the post describes a CSS selector matching engine inspired by a Virtual Machine approach to regular expression matching.

## v2 : Give it to everyone and make it faster

In 2017, Cloudflare introduced an edge compute platform – Cloudflare Workers. It was no surprise that customers quickly required the same HTML rewriting capabilities that we were using internally. Our team was impressed with the platform and decided to migrate some of our features to Workers. The goal was to improve our developer experience working with modern JavaScript rather than statically linked NGINX modules implemented in C with a Lua API.

It is possible to rewrite HTML in Workers, though for that you needed a third party JavaScript package (such as Cheerio). These packages are not designed for HTML rewriting on the edge due to the latency, speed and memory considerations described in the previous post.

JavaScript is really fast but it still can’t always produce performance comparable to native code for some tasks – parsing being one of those. Customers typically needed to buffer the whole content of the page to do the rewriting resulting in considerable output latency and memory consumption that often exceeded the memory limits enforced by the Workers runtime.

We started to think about how we could reuse the technology in Workers. LazyHTML was a perfect fit in terms of parsing performance, but it had two issues:

1. API ergonomics: LazyHTML produces a stream of HTML tokens. This is sufficient for our internal needs. However, for an average user, it is not as convenient as the jQuery-like API of Cheerio.
2. Performance: Even though LazyHTML is tremendously fast, integration with the Workers runtime adds even more limitations. LazyHTML operates as a simple parse-modify-serialize pipeline, which means that it produces tokens for the whole content of the page. All of these tokens then have to be propagated to the Workers runtime and wrapped inside a JavaScript object and then unwrapped and fed back to LazyHTML for serialization. This is an extremely expensive operation which would nullify the performance benefit of LazyHTML.

### LOL HTML

We needed something new, designed with Workers requirements in mind, using a language with the native speed and safety guarantees (it’s incredibly easy to shoot yourself in the foot doing parsing). Rust was the obvious choice as it provides the native speed and the best guarantee of memory safety which minimises the attack surface of untrusted input. Wherever possible the Low Output Latency HTML rewriter (LOL HTML) uses all the previous optimizations developed for LazyHTML such as tag name hashing.

#### Dual-parser architecture

Most developers are familiar and prefer to use CSS selector-based APIs (as in Cheerio, jQuery or DOM itself) for HTML mutation tasks. We decided to base our API on CSS selectors as well. Although this meant additional implementation complexity, the decision created even more opportunities for parsing optimizations.

As selectors define the scope of the content that should be rewritten, we realised we can skip the content that is not in this scope and not produce tokens for it. This not only significantly speeds up the parsing itself, but also avoids the performance burden of the back and forth interactions with the JavaScript VM. As ever the best optimization is not to do something.

Considering the tasks required, LOL HTML’s parser consists of two internal parsers:

• Lexer – a regular full parser, that produces output for all types of content that it encounters;
• Tag scanner – looks for start and end tags and skips parsing the rest of the content. The tag scanner parses only the tag name and feeds it to the selector matcher. The matcher will switch parser to the lexer if there was a match or additional information about the tag (such as attributes) are required for matching.

The parser switches back to the tag scanner as soon as input leaves the scope of all selector matches. The tag scanner may also sometimes switch the parser to the Lexer – if it requires additional tag information for the parsing feedback simulation.

Having two different parser implementations for the same grammar will increase development costs and is error-prone due to implementation inconsistencies. We minimize these risks by implementing a small Rust macro-based DSL which is similar in spirit to Ragel. The DSL program describes Nondeterministic finite automaton states and actions associated with each state transition and matched input byte.

An example of a DSL state definition:

tag_name_state {
whitespace => ( finish_tag_name?; --> before_attribute_name_state )
b'/'       => ( finish_tag_name?; --> self_closing_start_tag_state )
b'>'       => ( finish_tag_name?; emit_tag?; --> data_state )
eof        => ( emit_raw_without_token_and_eof?; )
_          => ( update_tag_name_hash; )
}

The DSL program gets expanded by the Rust compiler into not quite as beautiful, but extremely efficient Rust code.

We no longer need to reimplement the code that drives the parsing process for each of our parsers. All we need to do is to define different action implementations for each. In the case of the tag scanner, the majority of these actions are a no-op, so the Rust compiler does the NFA optimization job for us: it optimizes away state branches with no-op actions and even whole states if all of the branches have no-op actions. Now that’s cool.

#### Byte slice processing optimisations

Moving to a memory-safe language provided new challenges. Rust has great memory safety mechanisms, however sometimes they have a runtime performance cost.

The task of the parser is to scan through the input and find the boundaries of lexical units of the language – tokens and their internal parts. For example, an HTML start tag token consists of multiple parts: a byte slice of input that represents the tag name and multiple pairs of input slices that represent attributes and values:

struct StartTagToken<'i> {
name: &'i [u8],
attributes: Vec<(&'i [u8], &'i [u8])>,
self_closing: bool
}

As Rust uses bound checks on memory access, construction of a token might be a relatively expensive operation. We need to be capable of constructing thousands of them in a fraction of second, so every CPU instruction counts.

Following the principle of doing as little as possible to improve performance we use a “token outline” representation of tokens: instead of having memory slices for token parts we use numeric ranges which are lazily transformed into a byte slice when required.

struct StartTagTokenOutline {
name: Range<usize>,
attributes: Vec<(Range<usize>, Range<usize>)>,
self_closing: bool
}

As you might have noticed, with this approach we are no longer bound to the lifetime of the input chunk which turns out to be very useful. If a start tag is spread across multiple input chunks we can easily update the token that is currently in construction, as new chunks of input arrive by just adjusting integer indices. This allows us to avoid constructing a new token with slices from the new input memory region (it could be the input chunk itself or the internal parser’s buffer).

This time we can’t get away with avoiding the conversion of input character encoding; we expose a user-facing API that operates on JavaScript strings and input HTML can be of any encoding. Luckily, as we can still parse without decoding and only encode and decode within token bounds by a request (though we still can’t do that for UTF-16 encoding).

So, when a user requests an element’s tag name in the API, internally it is still represented as a byte slice in the character encoding of the input, but when provided to the user it gets dynamically decoded. The opposite process happens when a user sets a new tag name.

For selector matching we can still operate on the original encoding representation – because we know the input encoding ahead of time we preemptively convert values in a selector to the page’s character encoding, so comparisons can be done without decoding fields of each token.

As you can see, the new parser architecture along with all these optimizations produced great performance results:

LOL HTML’s tag scanner is typically twice as fast as LazyHTML and the lexer has comparable performance, outperforming LazyHTML on bigger inputs. Both are a few times faster than the tokenizer from html5ever – another parser implemented in Rust used in the Mozilla’s Servo browser engine.

#### CSS selector matching VM

With an impressively fast parser on our hands we had only one thing missing – the CSS selector matcher. Initially we thought we could just use Servo’s CSS selector matching engine for this purpose. After a couple of days of experimentation it turned out that it is not quite suitable for our task.

It did not work well with our dual parser architecture. We first need to to match just a tag name from the tag scanner, and then, if we fail, query the lexer for the attributes. The selectors library wasn’t designed with this architecture in mind so we needed ugly hacks to bail out from matching in case of insufficient information. It was inefficient as we needed to start matching again after the bailout doing twice the work. There were other problems, such as the integration of lazy character decoding and integration of tag name comparison using tag name hashes.

##### Matching direction

The main problem encountered was the need to backtrack all the open elements for matching. Browsers match selectors from right to left and traverse all ancestors of an element. This StackOverflow has a good explanation of why they do it this way. We would need to store information about all open elements and their attributes – something that we can’t do while operating with tight memory constraints. This matching approach would be inefficient for our case – unlike browsers, we expect to have just a few selectors and a lot of elements. In this case it is much more efficient to match selectors from left to right.

And this is when we had a revelation. Consider the following CSS selector:

body > div.foo  img[alt] > div.foo ul

It can be split into individual components attributed to a particular element with hierarchical combinators in between:

body > div.foo img[alt] > div.foo  ul
---    ------- --------   -------  --

Each component is easy to match having a start tag token – it’s just a matter of comparison of token fields with values in the component. Let’s dive into abstract thinking and imagine that each such component is a character in the infinite alphabet of all possible components:

Selector component Character
body a
div.foo b
img[alt] c
ul d

Let’s rewrite our selector with selector components replaced by our imaginary characters:

a > b c > b d

Does this remind you of something?

A   > combinator can be considered a child element, or “immediately followed by”.

The   (space) is a descendant element can be thought of as there might be zero or more elements in between.

There is a very well known abstraction to express these relations – regular expressions. The selector replacing combinators can be replaced with a regular expression syntax:

ab.*cb.*d

We transformed our CSS selector into a regular expression that can be executed on the sequence of start tag tokens. Note that not all CSS selectors can be converted to such a regular grammar and the input on which we match has some specifics, which we’ll discuss later. However, it was a good starting point: it allowed us to express a significant subset of selectors.

##### Implementing a Virtual Machine

Next, we started looking at non-backtracking algorithms for regular expressions. The virtual machine approach seemed suitable for our task as it was possible to have a non-backtracking implementation that was flexible enough to work around differences between real regular expression matching on strings and our abstraction.

VM-based regular expression matching is implemented as one of the engines in many regular expression libraries such as regexp2 and Rust’s regex. The basic idea is that instead of building an NFA or DFA for a regular expression it is instead converted into DSL assembly language with instructions later executed by the virtual machine – regular expressions are treated as programs that accept strings for matching.

Since the VM program is just a representation of NFA with ε-transitions it can exist in multiple states simultaneously during the execution, or, in other words, spawns multiple threads. The regular expression matches if one or more states succeed.

For example, consider the following VM instructions:

• expect c – waits for next input character, aborts the thread if doesn’t equal to the instruction’s operand;
• thread L1, L2 – spawns threads for labels L1 and L2, effectively splitting the execution;
• match – succeed the thread with a match;

For example, using this instructions set regular expression “ab*c” can be translated into:

    expect a
L2: expect b
jmp L1
L3: expect c
match

Let’s try to translate the regular expression ab.*cb.*d from the selector we saw earlier:

    expect a
expect b
L2: expect [any]
jmp L1
L3: expect c
expect b
L5: expect [any]
jmp L4
L6: expect d
match

That looks complex! Though this assembly language is designed for regular expressions in general, and regular expressions can be much more complex than our case. For us the only kind of repetition that matters is “.*”. So, instead of expressing it with multiple instructions we can use just one called hereditary_jmp:

    expect a
expect b
hereditary_jmp L1
L1: expect c
expect b
hereditary_jmp L2
L2: expect d
match

The instruction tells VM to memoize instruction’s label operand and unconditionally spawn a thread with a jump to this label on each input character.

There is one significant distinction between the string input of regular expressions and the input provided to our VM. The input can shrink!

A regular string is just a contiguous sequence of characters, whereas we operate on a sequence of open elements. As new tokens arrive this sequence can grow as well as shrink. Assume we represent <div> as ‘a’ character in our imaginary language, so having <div><div><div> input we can represent it as aaa, if the next token in the input is </div> then our “string” shrinks to aa.

You might think at this point that our abstraction doesn’t work and we should try something else. What we have as an input for our machine is a stack of open elements and we needed a stack-like structure to store our hereditrary_jmp instruction labels that VM had seen so far. So, why not store it on the open element stack? If we store the next instruction pointer on each of stack items on which the expect instruction was successfully executed, we’ll have a full snapshot of the VM state, so we can easily roll back to it if our stack shrinks.

With this implementation we don’t need to store anything except a tag name on the stack, and, considering that we can use the tag name hashing algorithm, it is just a 64-bit integer per open element. As an additional small optimization, to avoid traversing of the whole stack in search of active hereditary jumps on each new input we store an index of the first ancestor with a hereditary jump on each stack item.

For example, having the following selector “body > div span” we’ll have the following VM program (let’s get rid of labels and just use instruction indices instead):

0| expect <body>
1| expect <div>
2| hereditary_jmp 3
3| expect <span>
4| match

Having an input “<body><div><div><a>” we’ll have the following stack:

Now, if the next token is a start tag <span> the VM will first try to execute the selectors program from the beginning and will fail on the first instruction. However, it will also look for any active hereditary jumps on the stack. We have one which jumps to the instructions at index 3. After jumping to this instruction the VM successfully produces a match. If we get yet another <span> start tag later it will much as well following the same steps which is exactly what we expect for the descendant selector.

If we then receive a sequence of “</span></span></div></a></div>” end tags our stack will contain only one item:

which instructs VM to jump to instruction at index 1, effectively rolling back to matching the div component of the selector.

We mentioned earlier that we can bail out from the matching process if we only have a tag name from the tag scanner and we need to obtain more information by running the lexer? With a VM approach it is as easy as stopping the execution of the current instruction and resuming it later when we get the required information.

##### Duplicate selectors

As we need a separate program for each selector we need to match, how can we stop the same simple components doing the same job? The AST for our selector matching program is a radix tree-like structure whose edge labels are simple selector components and nodes are hierarchical combinators.
For example for the following selectors:

body > div > link[rel]
body > span
body > span a

we’ll get the following AST:

If selectors have common prefixes we can match them just once for all these selectors. In the compilation process, we flatten this structure into a vector of instructions.

##### [not] JIT-compilation

For performance reasons compiled instructions are macro-instructions – they incorporate multiple basic VM instruction calls. This way the VM can execute only one macro instruction per input token. Each of the macro instructions compiled using the so-called “[not] JIT-compilation” (the same approach to the compilation is used in our other Rust project – wirefilter).

Internally the macro instruction contains expect and following jmp, hereditary_jmp and match basic instructions. In that sense macro-instructions resemble microcode making it easy to suspend execution of a macro instruction if we need to request attributes information from the lexer.

## What’s next

It is obviously not the end of the road, but hopefully, we’ve got a bit closer to it. There are still multiple bits of functionality that need to be implemented and certainly, there is a space for more optimizations.

If you are interested in the topic don’t hesitate to join us in development of LazyHTML and LOL HTML at GitHub and, of course, we are always happy to see people passionate about technology here at Cloudflare, so don’t hesitate to contact us if you are too :).

# Workers Sites: Extending the Workers platform with our own serverless building blocks

Post Syndicated from Ashley Williams original https://blog.cloudflare.com/extending-the-workers-platform/

As of today, with the Wrangler CLI, you can now deploy entire websites directly to Cloudflare Workers and Workers KV. If you can statically generate the assets for your site, think create-react-app, Jekyll, or even the WP2Static plugin, you can deploy it to our entire global network, which spans 194 cities in more than 90 countries.

While you could deploy an entire site directly to Workers before, it wasn’t the easiest process. So, the Workers Developer Experience Team came up with a solution to make deploying static assets a significantly better experience.

Using our Workers command-line tool Wrangler, we’ve made it possible to deploy any static site to Workers in three easy steps: run wrangler init --site, configure the newly created wrangler.toml file with your account and project details, and then publish it to Cloudflare’s edge with wrangler publish. If you want to explore how this works, check out our new Workers Sites tutorial for create-react-app, where we cover how this new functionality allows you to deploy without needing to write any additional code!

While in hindsight the path we took to get to this point might not seem the most straightforward, it really highlights the flexibility of the entire Workers platform to easily support use cases that we didn’t originally envision. With this in mind, I’ll walk you through the implementation and thinking we did to get to this point. I’ll also talk a bit about how the flexibility of the Workers platform has us excited, both for the ethos it represents, and the future it enables.

So, what went into building Workers Sites?

## “Filesystem?! Where we’re going, we don’t need a filesystem!”

The Workers platform is built on v8 isolates, which, while awesome, lack a filesystem. If you’ve ever deployed a static site via FTP, uploaded it to object storage, or used a computer, you’d probably agree that filesystems are important. For many use cases, like building an API or routing, you don’t need a filesystem, but as the vision for Workers grew and our audience grew with it, it became clear to us that this was a limitation we needed to address for new features.

## Welcome to the simulation

Without a filesystem, we decided to simulate one on top of Workers KV! Workers KV provides access to a secure key-value store that runs across Cloudflare’s Edge alongside Workers.

When running wrangler preview or wrangler publish, we check your wrangler.toml for the site key. The site key points to a bucket, which represents the KV namespace we’ll use to represent your static assets. We then upload each of your assets, where the path relative to the entry directory is the key, and the blob of the file is the value.

When a request from a user comes in, the Worker reads the request’s URI and looks up the asset that matches the segment requested. For example, if a user fetches “my-site.com/about.html”, the Worker looks up the “about.html” key in KV and returns the blob. Behind the scenes, we’ll also detect the mime-type of the requested asset and return the response with the correct content-type headers.

For folks who are used to building static sites or sites with a static asset serving component, this could feel deeply overengineered. Others may argue that, indeed, this is just how filesystems are built! The interesting thing for us is that we had to build one, there wasn’t just one there waiting for us.

It was great that we could put this together with Workers KV, but we still had a problem…

## Cache rules everything around me

Workers KV is a database, and so it’s set up for both read and write operations. However, it’s primarily tuned for read-heavy workloads on entries that don’t generally have a long life span. This works well for applications where data is accessed frequently and often updated. But, for static websites, assets are generally written once, and then they are never (or infrequently) written to again. Static site content should be cached for a very long time, if not forever (long live Space Jam). This means we need to cache data much longer than KV is used to.

To fix this, on publish or preview, Wrangler walks the entry-point directory you’ve declared in your wrangler.toml and creates an asset manifest: a map of your filenames to a hash of their content. We use this asset manifest to map requests for a particular filename, say index.html, to the content hash of the most recently uploaded static asset.

You may be familiar with the concept of an asset manifest from using tools like create-react-app. Asset manifests help maintain asset fingerprints for caching in the browser. We took this idea and implemented it in Workers Sites, so that we can leverage the edge cache as well!

This now allows us to, after first read per location, cache the static assets in the Cloudflare cache so that the assets can be stored on the edge indefinitely. This reduces reads to KV to almost nothing; we want to use KV for durability purposes, but we want to use a longer caching strategy for performance. Let’s dive in to exactly what this looks like:

## How it works

When a new asset is created, Wrangler publish will push the new asset to KV as well as an asset manifest to the edge alongside your Worker.

When someone first accesses your page, the Cloudflare location closest to them will run your Worker. The Worker script will determine the content hash of the asset they’ve requested by looking up that asset in the asset manifest. It will use the filename and content hash as the key to fetch the asset’s contents from KV. At this time it will also insert the asset’s contents into Cloudflare’s edge cache, again keyed by filename and content hash. It will then respond to the request with the asset.

On subsequent requests, the Worker script will look up the content hash in the asset manifest, and check the cache to see if the asset is there. Since this is a subsequent request, it will find your asset in the cache on the edge and return a response containing the asset without having to fetch the asset contents from KV.

So what happens when you update your “index.html”- or any of your static assets? The process is very similar to what happens on the upload of a new asset. You’ll run wrangler publish with your new asset on your local machine. Wrangler will walk your asset directory and upload them to KV. At the same time, it will create a new asset manifest containing the filename and a content hash representing the new contents of the asset. When a request comes into your Worker, your Worker will look into the asset manifest and retrieve the new content hash for that asset. The Worker will look in the cache now for the new hash! It will then fetch the new asset from KV, populate the cache, and return the new file to your end user.

Edge caching happens per location across 194 cities around the world, ensuring that the most frequently accessed content on your page is cached in a location closest to those requesting content, reducing latency. All of this happens in *addition* to the browser cache, which means that your assets are nearly always incredibly close to end users!

By being on the edge, a Worker is in a unique position to be able to cache not only static assets like JS, CSS and images, but also HTML assets! Traditional static site solutions utilize your site’s HTML an entry point to the static site generator’s asset manifest. With this method of caching your HTML, it would be impossible to bust that cache because there is no other entry point to manage your assets’ fingerprints other than the HTML itself. However, in a Worker the entry point is your *Worker*! We can then leverage our wrangler asset-manifest to look up and fetch the accurate and cacheable HTML, while at the same time cache bust on content hash.

## Making the possible imaginable

“What we have is a crisis of imagination. Albert Einstein said that you cannot solve a problem with the same mind-set that created it.” – Peter Buffett

When building a brand new developer platform, there’s often a vast number of possible applications. However, the sheer number of possibilities often make each one difficult to imagine. That’s why we think the most important part of any platform is its flexibility to adapt to previously unimagined use cases. And, we don’t mean that just for us. It’s important that everyone has the ability to customize the platform to new and interesting use cases!

At face value, the work we did to implement this feature might seem like another solution for a previously solved problem. However, it’s a great example of how a group of dedicated developers can improve the platform experience for others.

We hope that by paving a way to include static assets in a Worker, developers can use the extra cognitive space to conceive of even more new ways to use Workers that may have been hard to imagine before.

Workers Sites isn’t the end goal, but a stepping stone to continue to think critically about what it means to build a Web Application. We’re excited to give developers the space to explore how simple static applications can grow and evolve, when combined with the dynamic power of edge computing.

Go forth and build something awesome!

Have you built something interesting with Workers? Let us know @CloudflareDev!

# The Technical Challenges of Building Cloudflare WARP

Post Syndicated from Zack Bloom original https://blog.cloudflare.com/warp-technical-challenges/

If you have seen our other post you know that we released WARP to the last members of our waiting list today. With WARP our goal was to secure and improve the connection between your mobile devices and the Internet. Along the way we ran into problems with phone and operating system versions, diverse networks, and our own infrastructure, all while working to meet the pent up demand of a waiting list nearly two million people long.

To understand all these problems and how we solved them we first need to give you some background on how the Cloudflare network works:

## How Our Network Works

The Cloudflare network is composed of data centers located in 194 cities and in more than 90 countries. Every Cloudflare data center is composed of many servers that receive a continual flood of requests and has to distribute those requests between the servers that handle them. We use a set of routers to perform that operation:

Our routers listen on Anycast IP addresses which are advertised over the public Internet. If you have a site on Cloudflare, your site is available via two of these addresses. In this case, I am doing a DNS query for “workers.dev”, a site which is powered by Cloudflare:

➜ dig workers.dev

;; QUESTION SECTION:
;workers.dev.      IN  A

workers.dev.    161  IN  A  198.41.215.162
workers.dev.    161  IN  A  198.41.214.162

;; SERVER: 1.1.1.1#53(1.1.1.1)


workers.dev is available at two addresses 198.41.215.162 and 198.41.214.162 (along with two IPv6 addresses available via the AAAA DNS query). Those two addresses are advertised from every one of our data centers around the world. When someone connects to any Internet property on Cloudflare, each networking device their packets pass through will choose the shortest path to the nearest Cloudflare data center from their computer or phone.

Once the packets hit our data center, we send them to one of the many servers which operate there. Traditionally, one might use a load balancer to do that type of traffic distribution across multiple machines. Unfortunately putting a set of load balancers capable of handling our volume of traffic in every data center would be exceptionally expensive, and wouldn’t scale as easily as our servers do. Instead, we use devices built for operating on exceptional volumes of traffic: network routers.

Once a packet hits our data center it is processed by a router. That router sends the traffic to one of a set of servers responsible for handling that address using a routing strategy called ECMP (Equal-Cost Multi-Path). ECMP refers to the situation where the router doesn’t have a clear ‘winner’ between multiple routes, it has multiple good next hops, all to the same ultimate destination. In our case we hack that concept a bit, rather than using ECMP to balance across multiple intermediary links, we make the intermediary link addresses the final destination of our traffic: our servers.

Here’s the configuration of a Juniper-brand router of the type which might be in one of our data centers, and which is configured to balance traffic across three destinations:

[email protected]# show routing-options

static {
route 172.16.1.0/24 next-hop [ 172.16.2.1 172.16.2.2 172.16.2.3 ];
}
forwarding-table {
}


Since the ‘next-hop’ is our server, traffic will be split across multiple machines very efficiently.

## TCP, IP, and ECMP

IP is responsible for sending packets of data from addresses like 93.184.216.34 to 208.80.153.224 (or [2606:2800:220:1:248:1893:25c8:1946] to [2620:0:860:ed1a::1] in the case of IPv6) across the Internet. It’s the “Internet Protocol”.

TCP (Transmission Control Protocol) operates on top of a protocol like IP which can send a packet from one place to another, and makes data transmission reliable and useful for more than one process at a time. It is responsible for taking the unreliable and misordered packets that might arrive over a protocol like IP and delivering them reliably, in the correct order. It also introduces the concept of a ‘port’, a number from 1-65535 which help route traffic on a computer or phone to a specific service (such as the web or email). Each TCP connection has a source and destination port which is included in the header TCP adds to the beginning of each packet. Without the idea of ports it would not be easy to figure out which messages were destined for which program. For example, both Google Chrome and Mail might wish to send messages over your WiFi connection at the same time, so they will each use their own port.

Here’s an example of making a request for https://cloudflare.com/ at 198.41.215.162, on the default port for HTTPS: 443. My computer has randomly assigned me the port 51602 which it will listen on it for a response, which will (hopefully) receive the contents of the site:

Internet Protocol Version 4, Src: 19.5.7.21, Dst: 198.41.215.162
Protocol: TCP (6)
Source: 19.5.7.21
Destination: 198.41.215.162
Transmission Control Protocol, Src Port: 51602, Dst Port: 443, Seq: 0, Len: 0
Source Port: 51602
Destination Port: 443



Looking at the same request from the Cloudflare side will be a mirror image, a request from my public IP address originating at my source port, destined for port 443 (I’m ignoring NAT for the moment, more on that later):

Internet Protocol Version 4, Src: 198.41.215.16, Dst: 19.5.7.21
Protocol: TCP (6)
Source: 198.41.215.162
Destination: 19.5.7.21
Transmission Control Protocol, Src Port: 443, Dst Port: 51602, Seq: 0, Len: 0
Source Port: 443
Destination Port: 51602


We can now return to ECMP! It could be theoretically possible to use ECMP to balance packets between servers randomly, but you would almost never want to do that. A message over the Internet is generally composed of multiple TCP packets. If each packet were sent to a different server it would be impossible to reconstruct the original message in any one place and act on it. Even beyond that, it would be terrible for performance: we rely on being able to maintain long-lived TCP and TLS sessions which require a persistent connection to a single server. To provide that persistence, our routers don’t balance traffic randomly, they use a combination of four values: the source address, the source port, the destination address, and the destination port. Traffic with the same combination of those four values will always make it to the same server. In the case of my example above, all of my messages destined to cloudflare.com will make it to a single server which can reconstruct the TCP packets into my request and return packets in a response.

## Enter WARP

For a conventional request it is very important that our ECMP routing sends all of your packets to the same server for the duration of your request. Over the web a request commonly lasts less than ten seconds and the system works well. Unfortunately we quickly ran into issues with WARP.

WARP uses a session key negotiated with public-key encryption to secure packets. For a successful connection, both sides must negotiate a connection which is then only valid for that particular client and the specific server they are talking to. This negotiation takes time and has to be completed any time a client talks to a new server. Even worse, if packets get sent which expect one server, and end up at another, they can’t be decrypted, breaking the connection. Detecting those failed packets and restarting the connection from scratch takes so much time that our alpha testers experienced it as a complete loss of their Internet connection. As you can imagine, testers don’t leave WARP on very long when it prevents them from using the Internet.

WARP was experiencing so many failures because devices were switching servers much more often than we expected. If you recall, our ECMP router configuration uses a combination of (Source IP, Source Port, Destination IP, Destination Port) to match a packet to a server. Destination IP doesn’t generally change, WARP clients are always connecting to the same Anycast addresses. Similarly, Destination Port doesn’t change, we always listen on the same port for WARP traffic. The other two values, Source IP and Source Port, were changing much more frequently than we had planned.

One source of these changes was expected. WARP runs on cell phones, and cell phones commonly switch from Cellular to Wi-Fi connections. When you make that switch you suddenly go from communicating over the Internet via your cellular carrier’s (like AT&T or Verizon) IP address space to that of the Internet Service Provider your Wi-Fi connection uses (like Comcast or Google Fiber). It’s essentially impossible that your IP address won’t change when you move between connections.

The port changes occurred even more frequently than could be explained by network switches however. For an understanding of why we need to introduce one more component of Internet lore: Network Address Translation.

## NAT

An IPv4 address is composed of 32 bits (often written as four eight-bit numbers). If you exclude the reserved addresses which can’t be used, you are left with 3,706,452,992 possible addresses. This number has remained constant since IPv4 was deployed on the ARPANET in 1983, even as the number of devices has exploded (although it might go up a bit soon if the 0.0.0.0/8 becomes available). This data is based on Gartner Research predictions and estimates:

IPv6 is the definitive solution to this problem. It expands the length of an address from 32 to 128 bits, with 125 available in a valid Internet address at the moment (all public IPv6 addresses have the first three bits set to 001, the remaining 87.5% of the IPv6 address space is not considered necessary yet). 2^125 is an impossibly large number and would be more than enough for every device on Earth to have its own address. Unfortunately, 21 years after it was published, IPv6 still remains unsupported on many networks. Much of the Internet still relies on IPv4, and as seen above, there aren’t enough IPv4 addresses for every device to have their own.

To solve this problem many devices are commonly put behind a single Internet-addressable IP address. A router is used to do Network Address Translation; to take messages which arrive on that single public IP and forward them to the appropriate device on their local network. In effect it’s as if everyone in your apartment building had the same street address, and the postal worker was responsible for sorting out what mail was meant for which person.

When a response arrives destined for the port it has allocated to you, it matches it to your connection and again rewrites it, this time replacing the destination address with your address on the local network, and the destination port with the original source port you specified. It has transparently allowed all the devices on your network to act as if they were one big computer with a single Internet-connected IP address.

This process works very well for the duration of a common request over the Internet. Your router only has so much space however, so it will helpfully delete old port assignments, freeing up space for new ones. It generally waits for the connection to not have any messages for thirty seconds or more before deleting an assignment, making it unlikely a response will arrive which it can no longer direct to the appropriate source. Unfortunately, WARP sessions need to last much longer than thirty seconds.

When you next send a message after your NAT session has expired, you are given a new source port. That new port causes your ECMP mapping (based on source IP, source port, destination IP, destination port) to change, causing us to route your requests to a new machine within the Cloudflare data center your messages are arriving at. This breaks your WARP session, and your Internet connection.

We experimented extensively with methods of keeping your NAT session fresh by periodically sending keep-alive messages which would prevent routers and mobile carriers from evicting mappings. Unfortunately waking the radio of your device every thirty seconds has unfortunate consequences for your battery life, and it was not entirely successful at preventing port and address changes. We needed a way to always map sessions to the same machine, even as their source port (and even source address) changed.

Fortunately, we had a solution which came from elsewhere at Cloudflare. We don’t use dedicated load balancers, but we do have many of the same problems load balancers solve. We have long needed to map traffic to Cloudflare servers with more control than ECMP allows alone. Rather than deploying an entire tier of load balancers, we use every server in our network as a load balancer, forwarding packets first to an arbitrary machine and then relying on that machine to forward the packet to the appropriate host. This consumes minimal resources and allows us to scale our load balancing infrastructure with each new machine we add. We have a lot more to share on how this infrastructure works and what makes it unique, subscribe to this blog to be notified when that post is released.

To make our load balancing technique work though we needed a way to identify which client a WARP packet was associated with before it could be decrypted. To understand how we did that it’s helpful to understand how WARP encrypts your messages. The industry standard way of connecting a device to a remote network is a VPN. VPNs use a protocol like IPsec to allow your device to send messages securely to a remote network. Unfortunately, VPNs are generally rather disliked. They slow down connections, eat battery life, and their complexity makes them frequently the source of security vulnerabilities. Users of corporate networks which mandate VPNs often hate them, and the idea that we would convince millions of consumers to install one voluntarily seemed ridiculous.

After considering and testing several more modern options, we landed on WireGuard®. WireGuard is a modern, high performance, and most importantly, simple, protocol created by Jason Donenfeld to solve the same problem. Its original code-base is less than 1% the size of a popular IPsec implementation, making it easy for us to understand and secure. We chose Rust as the language most likely to give us the performance and safety we needed and implemented WireGuard while optimizing the code heavily to run quickly on the platforms we were targeting. Then we open sourced the project.

WireGuard changes two very relevant things about the traffic you send over the Internet. The first is it uses UDP not TCP. The second is it uses a session key negotiated with public-key encryption to secure the contents of that UDP packet.

TCP is the conventional protocol used for loading a website over the Internet. It combines the ability to address ports (which we talked about previously) with reliable delivery and flow control. Reliable delivery ensures that if a message is dropped, TCP will eventually resend the missing data. Flow control gives TCP the tools it needs to handle many clients all sharing the same link who exceed its capacity. UDP is a much simpler protocol which trades these capabilities for simplicity, it makes a best-effort attempt to send a message, and if the message is missing or there is too much data for the links, messages are simply never heard of again.

UDP’s lack of reliability would normally be a problem while browsing the Internet, but we are not simply sending UDP, we are sending a complete TCP packet _inside_ our UDP packets.

Inside the payload encrypted by WireGuard we have a complete TCP header which contains all the information necessary to ensure reliable delivery. We then wrap it with WireGuard’s encryption and use UDP to (less-than-reliably) send it over the Internet. Should it be dropped TCP will do its job just as if a network link lost the message and resend it. If we instead wrapped our inner, encrypted, TCP session in another TCP packet as some other protocols do we would dramatically increase the number of network messages required, destroying performance.

The second interesting component of WireGuard relevant to our discussion is public-key encryption. WireGuard allows you to secure each message you send such that only the specific destination you are sending it to can decrypt it. That is a powerful way of ensuring your security as you browse the Internet, but it means it is impossible to read anything inside the encrypted payload until the message has reached the server which is responsible for your session.

Returning to our load balancing issue, you can see that only three things are accessible to us before we can decrypt the message: The IP Header, the UDP Header, and the WireGuard header. Neither the IP Header or UDP Header include the information we need, as we have already failed with the four pieces of information they contain (source IP, source port, destination IP, destination port). That leaves the WireGuard header as the one location where we can find an identifier which can be used to keep track of who the client was before decrypting the message. Unfortunately, there isn’t one. This is the format of the message used to initiate a connection:

sender looks temptingly like a client id, but it’s randomly assigned every handshake. Handshakes have to be performed every two minutes to rotate keys making them insufficiently persistent. We could have forked the protocol to add any number of additional fields, but it is important to us to remain wire-compatible with other WireGuard clients. Fortunately, WireGuard has a three byte block in its header which is not currently used by other clients. We decided to put our identifier in this region and still support messages from other WireGuard clients (albeit with less reliable routing than we can offer). If this reserved section is used for other purposes we can ignore those bits or work with the WireGuard team to extend the protocol in another suitable way.

When we begin a WireGuard session we include our clientid field which is provided by our authentication server which has to be communicated with to begin a WARP session:

Data messages similarly include the same field:

It’s important to note that the clientid is only 24 bits long. That means there are less possible clientid values than the current number of users waiting to use WARP. This suits us well as we don’t need or want the ability to track individual WARP users. clientid is only necessary for load balancing, once it serves its purpose we get it expunged from our systems as quickly as we can.

The load balancing system now uses a hash of the clientid to identify which machine a packet should be routed to, meaning  WARP messages always arrive at the same machine even as you change networks or move from Wi-Fi to cellular, and the problem was eliminated.

## Client Software

Cloudflare has never developed client software before. We take pride in selling a service anyone can use without needing to buy hardware or provision infrastructure. To make WARP work, however, we needed to deploy our code onto one of the most ubiquitous hardware platforms on Earth: smartphones.

While developing software on mobile devices has gotten steadily easier over the past decade, unfortunately developing low-level networking software remains rather difficult. To consider one example: we began the project using the latest iOS connection API called Network, introduced in iOS 12. Apple strongly recommends the use of Network, in their words “Your customers are going to appreciate how much better your connections, how much more reliable your connections are established, and they’ll appreciate the longer battery life from the better performance.”

The Network framework provides a pleasantly high-level API which, as they say, integrates well with the native performance features built into iOS. Creating a UDP connection (connection is a bit of a misnomer, there are no connections in UDP, just packets) is as simple as:

self.connection = NWConnection(host: hostUDP, port: portUDP, using: .udp)

And sending a message can be as easy as:

self.connection?.send(content: content)

Unfortunately, at a certain point code actually gets deployed, and bug reports begin flowing in. The first issue was the simplicity of the API made it impossible for us to process more than a single UDP packet at a time. We commonly use packets of up to 1500 bytes, running a speed test on my Google Fiber connection currently results in a speed of 370 Mbps, or almost thirty-one thousand packets per second. Attempting to process each packet individually was slowing down connections by as much as 40%. According to Apple, the best solution to get the performance we needed was to fallback to the older NWUDPSession API, introduced in iOS 9.

## IPv6

If we compare the code required to create a NWUDPSession to the example above you will notice that we suddenly care which protocol, IPv4 or IPv6, we are using:

let v4Session = NWUDPSession(upgradeFor: self.ipv4Session)


In fact, NWUDPSession does not handle many of the more tricky elements of creating connections over the Internet. For example, the Network framework will automatically determine whether a connection should be made over IPv4 or 6:

NWUDPSession does not do this for you, so we began creating our own logic to determine which type of connection should be used. Once we began to experiment, it quickly became clear that they are not created equal. It’s fairly common for a route to the same destination to have very different performance based on whether you use its IPv4 or IPv6 address. Often this is because there are simply fewer IPv4 addresses which have been around for longer, making it possible for those routes to be better optimized by the Internet’s infrastructure.

Every Cloudflare product has to support IPv6 as a rule. In 2016, we enabled IPv6 for over 98% of our network, over four million sites, and made a pretty big dent in IPv6 adoption on the web:

We couldn’t release WARP without IPv6 support. We needed to ensure that we were always using the fastest possible connection while still supporting both protocols with equal measure. To solve that we turned to a technology we have used with DNS for years: Happy Eyeballs. As codified in RFC 6555 Happy Eyeballs is the idea that you should try to look for both an IPv4 and IPv6 address when doing a DNS lookup. Whichever returns first, wins. That way you can allow IPv6 websites to load quickly even in a world which does not fully support it.

As an example, I am loading the website http://zack.is/. My web browser makes a DNS request for both the IPv4 address (an “A” record) and the IPv6 address (an “AAAA” record) at the same time:

Internet Protocol Version 4, Src: 192.168.7.21, Dst: 1.1.1.1
User Datagram Protocol, Src Port: 47447, Dst Port: 53
Domain Name System (query)
Queries
zack.is: type A, class IN

Internet Protocol Version 4, Src: 192.168.7.21, Dst: 1.1.1.1
User Datagram Protocol, Src Port: 49946, Dst Port: 53
Domain Name System (query)
Queries
zack.is: type AAAA, class IN

In this case the response to the A query returned more quickly, and the connection is begun using that protocol:

Internet Protocol Version 4, Src: 1.1.1.1, Dst: 192.168.7.21
User Datagram Protocol, Src Port: 53, Dst Port: 47447
Domain Name System (response)
Queries
zack.is: type A, class IN
zack.is: type A, class IN, addr 104.24.101.191

Internet Protocol Version 4, Src: 192.168.7.21, Dst: 104.24.101.191
Transmission Control Protocol, Src Port: 55244, Dst Port: 80, Seq: 0, Len: 0
Source Port: 55244
Destination Port: 80
Flags: 0x002 (SYN)


We don’t need to do DNS queries to make WARP connections, we know the IP addresses of our data centers already, but we do want to know which of the IPv4 and IPv6 addresses will lead to a faster route over the Internet. To accomplish that we perform the same technique but at the network level: we send a packet over each protocol and use the protocol which returns first for subsequent messages. With some error handling and logging removed for brevity, it appears as:

let raceFinished = Atomic<Bool>(false)

let happyEyeballsRacer: (NWUDPSession, NWUDPSession, String) -> Void = {
(session, otherSession, name) in
// Session is the session the racer runs for, otherSession is a session we race against

let handleMessage: ([Data]) -> Void = { datagrams in
// This handler will be executed twice, once for the winner, again for the loser.

if raceFinished.swap(true) {
// This racer lost
}

// The winner becomes the current session
self.wireguardServerUDPSession = session

}

handleMessage(datagrams)
}, maxDatagrams: 1)

if !raceFinished.value {
// Send a handshake message
session.writeDatagram(onViable())
}
}


This technique successfully allows us to support IPv6 addressing. In fact, every device which uses WARP instantly supports IPv6 addressing even on networks which don’t have support. Using WARP takes the 34% of Comcast’s network which doesn’t support IPv6 or the 69% of Charter’s network which doesn’t (as of 2018), and allows those users to communicate to IPv6 servers successfully.

This test shows my phone’s IPv6 support before and after enabling WARP:

## Dying Connections

Nothing is simple however, with iOS 12.2 NWUDPSession began to trigger errors which killed connections. These errors were only identified with a code ‘55’. After some research it appears 55 has referred to the same error since the early foundations of the FreeBSD operating system OS X was originally built upon. In FreeBSD it’s commonly referred to as ENOBUFS, and it’s returned when the operating system does not have sufficient BUFfer Space to handle the operation being completed. For example, looking at the source of a FreeBSD today, you see this code in its IPv6 implementation:

In this example, if enough memory cannot be allocated to accommodate the size of an IPv6 and ICMP6 header, the error ENOBUFS (which is mapped to the number 55) will be returned. Unfortunately, Apple’s take on FreeBSD is not open source however: how, when, and why they might be returning the error is a mystery. This error has been experienced by other UDP-based projects, but a resolution is not forthcoming.

What is clear is once an error 55 begins occurring, the connection is no longer usable. To handle this case we need to reconnect, but doing the same Happy Eyeballs mechanic we do on initial connection is both unnecessary (as we were already talking over the fastest connection), and will consume valuable time. Instead we add a second connection method which is only used to recreate an already working session:

/**
Create a new UDP connection to the server using a Happy Eyeballs like heuristic.

This function should be called when first establishing a connection to the edge server.

It will initiate a new connection over IPv4 and IPv6 in parallel, keeping the connection that receives the first response.
*/

func connect(onViable: @escaping () -> Data, onReply: @escaping () -> Void, onFailure: @escaping () -> Void, onDisconnect: @escaping () -> Void)

/**
Recreate the current connections.

This function should be called as a response to error code 55, when a quick connection is required.

Unlike happyEyeballs, this function will use viability as its only success criteria.
*/

func reconnect(onViable: @escaping () -> Void, onFailure: @escaping () -> Void, onDisconnect: @escaping () -> Void)


Using reconnect we are able to recreate sessions broken by code 55 errors, but it still adds a latency hit which is not ideal. As with all client software development on a closed-source platform however, we are dependent on the platform to identify and fix platform-level bugs.

Truthfully, this is just one of a long list of platform-specific bugs we ran into building WARP. We hope to continue working with device vendors to get them fixed. There are an unimaginable number of device and connection combinations, and each connection doesn’t just exist at one moment in time, they are always changing, entering and leaving broken states almost faster than we can track. Even now, getting WARP to work on every device and connection on Earth is not a solved problem, we still get daily bug reports which we work to triage and resolve.

## WARP+

WARP is meant to be a place where we can apply optimizations which make the Internet better. We have a lot of experience making websites more performant, WARP is our opportunity to experiment with doing the same for all Internet traffic.

At Cloudflare we have a product called Argo. Argo makes websites’ time to first byte more than 30% faster on average by continually monitoring thousands of routes over the Internet between our data centers. That data builds a database which maps every IP address range with the fastest possible route to every destination. When a packet arrives it first reaches the closest data center to the client, then that data center uses data from our tests to discover the route which will get the packet to its destination with the lowest possible latency. You can think of it like a traffic-aware GPS for the Internet.

Argo has historically only operated on HTTP packets. HTTP is the protocol which powers the web, sending messages which load websites on top of TCP and IP. For example, if I load http://zack.is/, an HTTP message is sent inside a TCP packet:

Internet Protocol Version 4, Src: 192.168.7.21, Dst: 104.24.101.191
Transmission Control Protocol, Src Port: 55244, Dst Port: 80
Source Port: 55244
Destination Port: 80
Hypertext Transfer Protocol
GET / HTTP/1.1\r\n
Host: zack.is\r\n
Connection: keep-alive\r\n
Accept-Encoding: gzip, deflate\r\n
Accept-Language: en-US,en;q=0.9\r\n
\r\n


The modern and secure web presents a problem for us however: When I make the same request over HTTPS (https://zack.is) rather than just HTTP (http://zack.is), I see a very different result over the wire:

Internet Protocol Version 4, Src: 192.168.7.21, Dst: 104.25.151.102
Transmission Control Protocol, Src Port: 55983, Dst Port: 443
Source Port: 55983
Destination Port: 443
Transport Layer Security
Transport Layer Security
TLSv1.2 Record Layer: Application Data Protocol: http-over-tls


My request has been encrypted! It’s no longer possible for WARP (or anyone but the destination) to tell what is in the payload. It might be HTTP, but it also might be any other protocol. If my site is one of the twenty-million which use Cloudflare already, we can decrypt the traffic and accelerate it (along with a long list of other optimizations). But for encrypted traffic destined for another source existing HTTP-only Argo technology was not going to work.

Fortunately we now have a good amount of experience working with non-HTTP traffic through our Spectrum and Magic Transit products. To solve our problem the Argo team turned to the CONNECT protocol.

As we now know, when a WARP request is made it first communicates over the WireGuard protocol to a server running in one of our 194 data centers around the world. Once the WireGuard message has been decrypted, we examine the destination IP address to see if it is an HTTP request destined for a Cloudflare-powered site, or a request destined elsewhere. If it’s destined for us it enters our standard HTTP serving path; often we can reply to the request directly from our cache in the very same data center.

If it’s not destined for a Cloudflare-powered site we instead forward the packet to a proxy process which runs on each machine. This proxy is responsible for loading the fastest path from our Argo database and beginning an HTTP session with a machine in the data center this traffic should be forwarded to. It uses the CONNECT command to both transmit metadata (as headers) and turn the HTTP session into a connection which can transmit the raw bytes of the payload:

CONNECT 8.54.232.11:5564 HTTP/1.1\r\n
Exit-Tcp-Keepalive-Duration: 15\r\n
Application: warp\r\n
\r\n
<data to send to origin>


Once the message arrives at the destination data center it is either forwarded to another data center (if that is best for performance), or directed directly to the origin which is awaiting the traffic.

Smart routing is just the beginning of WARP+; We have a long list of projects and plans which are all aimed at making your Internet faster, and couldn’t be more thrilled to finally have a platform to test them with.

## Our Mission

Today, after well over a year of development, WARP is available to you and to your friends and family. For us though, this is just the beginning. With the ability to improve full network connection for all traffic, we unlock a whole new world of optimizations and security improvements which were simply impossible before. We couldn’t be more excited to experiment, play, and eventually release, all sorts of new WARP and WARP+ features.

Cloudflare’s mission is to help build a better Internet. If we are willing to experiment and solve hard technical problems together we believe we can help make the future of the Internet better than the Internet of today, and we are all grateful to play a part in that. Thank you for trusting us with your Internet connection.

WARP was built by Vlad Krasnov, Chris Branch, Dane Knecht, Naga Tripirineni, Andrew Plunk, Adam Schwartz, Irtefa, and intern Michelle Chen with support from members of our Austin, San Francisco, Champaign, London, Warsaw, and Lisbon offices.

# Inside the Web Browser’s Performance API

Post Syndicated from Young Park original https://blog.cloudflare.com/browser-performance-api/

Building a beautiful, feature-rich website is easier than ever before. Not long ago, you’d have to fire up a text editor and hand-craft a lot of HTML, CSS, and JavaScript. Today, you can use WYSIWYG tools and third-party libraries that make development much simpler. The flip side of this is that it can be hard to see everything that’s going into your website — and the performance can suffer.

The good news is that modern web browsers expose lots of performance data that can help you understand how your web page performs. With the launch of Browser Insights today, we can analyze the performance from the perspective of the web browser and what the end user actually experiences. In this post, we’ll dive into how we think about performance and utilize the timing APIs in the web browser.

## How web browsers measure performance

In the old days, the only way for a developer to profile performance was to intercept requests and measure the time from the beginning of the page load until the end of the load event.

Today, we can use Web APIs that are supported by modern browsers. This is part of the web standard called the Performance API. The Performance API consists of a collection of individual APIs that include:

• Navigation Timing (for timing information relates to the page and navigation)
• Resource Timing (for timing data regarding the loading of an application’s resources)
• Paint Timing (that provides timing information about paint operation during the page construction)

In this blog post, we will primarily focus on the Navigation Timing API.

## Inside the Performance API

To see what’s collected with the Performance API, you can open the Developer Tools in Chrome browser and type ‘performance’ in the console tab (or type in performance.timing to get direct access to the PerformanceTiming attribute).

Try expanding the Performance object by clicking on the arrow before the label and again expand the ‘timing’. This is called PerformanceTiming, which includes all the timings that relate to the current page load as UNIX epoch timestamp (milliseconds). The timing attributes shown are not in the order of the load. So for better understanding, let’s look at the illustration provided by the W3C.

As we can see from the diagram shown, each element (represented as a box above) in the order from left-to-right, represents the navigation flow of the page load. Each element has an attribute from the starting point to the end (and some have multiple attributes!) so that we can measure the elapsed time for each element. For example, to get the Request Time, you could type in the command like shown below in the console which appears to be 60 milliseconds.

## How Cloudflare uses performance data

Once your website is proxied through Cloudflare and Browser Insights is enabled, we write and inject a JavaScript beacon into the web page. Our beacon collects metrics from the Performance API to send to our edge, where it can be used to understand where your website is slowing down or having any network problems. The reported data is shown in the Cloudflare dashboard on the Speed Page showing as averages of each timing metric:

The metrics we surface are:

• DNS (domainLookupEnd – domainLookupStart): How long the DNS query takes. This could appear as zero if the connection is reused or the content was stored in the local cache (memory or disk).
• TCP (connectEnd – connectStart): How long it takes to establish a TCP connection the server. If HTTPS, this process includes TLS negotiation time.
• Request (responseStart – requestStart): The time elapsed between making an HTTP request and receiving the first byte of the response.
• Response (responseEnd – responseStart): The time elapsed between the first byte and the last byte of the response received. You can think of this as a resource download time.
• Processing (domComplete – domLoading): How long it took to render the page. If this number is big, you can optimize your document architecture, resource size, or configure settings under Speed page such as Auto Minify the source code. This document process can be drill down more with domInteractive, domContentLoadedEventStart, domContentLoadedEventEnd, and domComplete. We plan to provide more detailed analytics on this later on.
• Load Event (loadEventEnd – loadEventStart): When the browser finishes loading its document and resources, it triggers a load event. This duration may be helpful to you if you have additional functions or any logic for the load event.
• Total Time: Sum of each timing metrics shown on the graph.

If you are seeing any spikes or unusual form of a line in the stacked line chart, you could start investigating on each element to see what is causing the problem.

For more about how to use Browser Insights, see our announcement blog post.

## What’s next

In this blog post, we’ve focused on the Navigation Timing API, because it’s at the heart of our first version of Browser Insights. In the near future, we plan to incorporate metrics from some of the other APIs. For example, we can break down some of the longer timings by looking at individual resource loads, and pointing out what’s taking longer. In addition to that, we plan to track JavaScript errors, provide a way to measure A/B performance, set up monitoring/alerting based on the metrics, and so on. So stay tuned!

# Details of the Cloudflare outage on July 2, 2019

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/

Almost nine years ago, Cloudflare was a tiny company and I was a customer not an employee. Cloudflare had launched a month earlier and one day alerting told me that my little site, jgc.org, didn’t seem to have working DNS any more. Cloudflare had pushed out a change to its use of Protocol Buffers and it had broken DNS.

I wrote to Matthew Prince directly with an email titled “Where’s my dns?” and he replied with a long, detailed, technical response (you can read the full email exchange here) to which I replied:

From: John Graham-Cumming
Date: Thu, Oct 7, 2010 at 9:14 AM
Subject: Re: Where's my dns?
To: Matthew Prince

Awesome report, thanks. I'll make sure to call you if there's a
problem.  At some point it would probably be good to write this up as
a blog post when you have all the technical details because I think
people really appreciate openness and honesty about these things.
Especially if you couple it with charts showing your post launch
traffic increase.

I have pretty robust monitoring of my sites so I get an SMS when
anything fails.  Monitoring shows I was down from 13:03:07 to
14:04:12.  Tests are made every five minutes.

It was a blip that I'm sure you'll get past.  But are you sure you
don't need someone in Europe? :-)


To which he replied:

From: Matthew Prince
Date: Thu, Oct 7, 2010 at 9:57 AM
Subject: Re: Where's my dns?
To: John Graham-Cumming

Thanks. We've written back to everyone who wrote in. I'm headed in to
the office now and we'll put something on the blog or pin an official
post to the top of our bulletin board system. I agree 100%
transparency is best.


And so, today, as an employee of a much, much larger Cloudflare I get to be the one who writes, transparently about a mistake we made, its impact and what we are doing about it.

### The events of July 2

On July 2, we deployed a new rule in our WAF Managed Rules that caused CPUs to become exhausted on every CPU core that handles HTTP/HTTPS traffic on the Cloudflare network worldwide. We are constantly improving WAF Managed Rules to respond to new vulnerabilities and threats. In May, for example, we used the speed with which we can update the WAF to push a rule to protect against a serious SharePoint vulnerability. Being able to deploy rules quickly and globally is a critical feature of our WAF.

Unfortunately, last Tuesday’s update contained a regular expression that backtracked enormously and exhausted CPU used for HTTP/HTTPS serving. This brought down Cloudflare’s core proxying, CDN and WAF functionality. The following graph shows CPUs dedicated to serving HTTP/HTTPS traffic spiking to nearly 100% usage across the servers in our network.

This resulted in our customers (and their customers) seeing a 502 error page when visiting any Cloudflare domain. The 502 errors were generated by the front line Cloudflare web servers that still had CPU cores available but were unable to reach the processes that serve HTTP/HTTPS traffic.

We know how much this hurt our customers. We’re ashamed it happened. It also had a negative impact on our own operations while we were dealing with the incident.

It must have been incredibly stressful, frustrating and frightening if you were one of our customers. It was even more upsetting because we haven’t had a global outage for six years.

The CPU exhaustion was caused by a single WAF rule that contained a poorly written regular expression that ended up creating excessive backtracking. The regular expression that was at the heart of the outage is (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

Although the regular expression itself is of interest to many people (and is discussed more below), the real story of how the Cloudflare service went down for 27 minutes is much more complex than “a regular expression went bad”. We’ve taken the time to write out the series of events that lead to the outage and kept us from responding quickly. And, if you want to know more about regular expression backtracking and what to do about it, then you’ll find it in an appendix at the end of this post.

### What happened

Let’s begin with the sequence of events. All times in this blog are UTC.

At 13:42 an engineer working on the firewall team deployed a minor change to the rules for XSS detection via an automatic process. This generated a Change Request ticket. We use Jira to manage these tickets and a screenshot is below.

Three minutes later the first PagerDuty page went out indicating a fault with the WAF. This was a synthetic test that checks the functionality of the WAF (we have hundreds of such tests) from outside Cloudflare to ensure that it is working correctly. This was rapidly followed by pages indicating many other end-to-end tests of Cloudflare services failing, a global traffic drop alert, widespread 502 errors and then many reports from our points-of-presence (PoPs) in cities worldwide indicating there was CPU exhaustion.

Some of these alerts hit my watch and I jumped out of the meeting I was in and was on my way back to my desk when a leader in our Solutions Engineering group told me we had lost 80% of our traffic. I ran over to SRE where the team was debugging the situation. In the initial moments of the outage there was speculation it was an attack of some type we’d never seen before.

Cloudflare’s SRE team is distributed around the world, with continuous, around-the-clock coverage. Alerts like these, the vast majority of which are noting very specific issues of limited scopes in localized areas, are monitored in internal dashboards and addressed many times every day. This pattern of pages and alerts, however, indicated that something gravely serious had happened, and SRE immediately declared a P0 incident and escalated to engineering leadership and systems engineering.

The London engineering team was at that moment in our main event space listening to an internal tech talk. The talk was interrupted and everyone assembled in a large conference room and others dialed-in. This wasn’t a normal problem that SRE could handle alone, it needed every relevant team online at once.

At 14:00 the WAF was identified as the component causing the problem and an attack dismissed as a possibility. The Performance Team pulled live CPU data from a machine that clearly showed the WAF was responsible. Another team member used strace to confirm. Another team saw error logs indicating the WAF was in trouble. At 14:02 the entire team looked at me when it was proposed that we use a ‘global kill’, a mechanism built into Cloudflare to disable a single component worldwide.

But getting to the global WAF kill was another story. Things stood in our way. We use our own products and with our Access service down we couldn’t authenticate to our internal control panel (and once we were back we’d discover that some members of the team had lost access because of a security feature that disables their credentials if they don’t use the internal control panel frequently).

And we couldn’t get to other internal services like Jira or the build system. To get to them we had to use a bypass mechanism that wasn’t frequently used (another thing to drill on after the event). Eventually, a team member executed the global WAF kill at 14:07 and by 14:09 traffic levels and CPU were back to expected levels worldwide. The rest of Cloudflare’s protection mechanisms continued to operate.

Then we moved on to restoring the WAF functionality. Because of the sensitivity of the situation we performed both negative tests (asking ourselves “was it really that particular change that caused the problem?”) and positive tests (verifying the rollback worked) in a single city using a subset of traffic after removing our paying customers’ traffic from that location.

At 14:52 we were 100% satisfied that we understood the cause and had a fix in place and the WAF was re-enabled globally.

### How Cloudflare operates

Cloudflare has a team of engineers who work on our WAF Managed Rules product; they are constantly working to improve detection rates, lower false positives, and respond rapidly to new threats as they emerge. In the last 60 days, 476 change requests have been handled for the WAF Managed Rules (averaging one every 3 hours).

This particular change was to be deployed in “simulate” mode where real customer traffic passes through the rule but nothing is blocked. We use that mode to test the effectiveness of a rule and measure its false positive and false negative rate. But even in the simulate mode the rules actually need to execute and in this case the rule contained a regular expression that consumed excessive CPU.

As can be seen from the Change Request above there’s a deployment plan, a rollback plan and a link to the internal Standard Operating Procedure (SOP) for this type of deployment. The SOP for a rule change specifically allows it to be pushed globally. This is very different from all the software we release at Cloudflare where the SOP first pushes software to an internal dogfooding network point of presence (PoP) (which our employees pass through), then to a small number of customers in an isolated location, followed by a push to a large number of customers and finally to the world.

The process for a software release looks like this: We use git internally via BitBucket. Engineers working on changes push code which is built by TeamCity and when the build passes, reviewers are assigned. Once a pull request is approved the code is built and the test suite runs (again).

If the build and tests pass then a Change Request Jira is generated and the change has to be approved by the relevant manager or technical lead. Once approved deployment to what we call the “animal PoPs” occurs: DOG, PIG, and the Canaries.

The DOG PoP is a Cloudflare PoP (just like any of our cities worldwide) but it is used only by Cloudflare employees. This dogfooding PoP enables us to catch problems early before any customer traffic has touched the code. And it frequently does.

If the DOG test passes successfully code goes to PIG (as in “Guinea Pig”). This is a Cloudflare PoP where a small subset of customer traffic from non-paying customers passes through the new code.

If that is successful the code moves to the Canaries. We have three Canary PoPs spread across the world and run paying and non-paying customer traffic running through them on the new code as a final check for errors.

Once successful in Canary the code is allowed to go live. The entire DOG, PIG, Canary, Global process can take hours or days to complete, depending on the type of code change. The diversity of Cloudflare’s network and customers allows us to test code thoroughly before a release is pushed to all our customers globally. But, by design, the WAF doesn’t use this process because of the need to respond rapidly to threats.

### WAF Threats

In the last few years we have seen a dramatic increase in vulnerabilities in common applications. This has happened due to the increased availability of software testing tools, like fuzzing for example (we just posted a new blog on fuzzing here).

What is commonly seen is a Proof of Concept (PoC) is created and often published on Github quickly, so that teams running and maintaining applications can test to make sure they have adequate protections. Because of this, it’s imperative that Cloudflare are able to react as quickly as possible to new attacks to give our customers a chance to patch their software.

A great example of how Cloudflare proactively provided this protection was through the deployment of our protections against the SharePoint vulnerability in May (blog here). Within a short space of time from publicised announcements, we saw a huge spike in attempts to exploit our customer’s Sharepoint installations. Our team continuously monitors for new threats and writes rules to mitigate them on behalf of our customers.

The specific rule that caused last Tuesday’s outage was targeting Cross-site scripting (XSS) attacks. These too have increased dramatically in recent years.

The standard procedure for a WAF Managed Rules change indicates that Continuous Integration (CI) tests must pass prior to a global deploy. That happened normally last Tuesday and the rules were deployed. At 13:31 an engineer on the team had merged a Pull Request containing the change after it was approved.

At 13:37 TeamCity built the rules and ran the tests, giving it the green light. The WAF test suite tests that the core functionality of the WAF works and consists of a large collection of unit tests for individual matching functions. After the unit tests run the individual WAF rules are tested by executing a huge collection of HTTP requests against the WAF. These HTTP requests are designed to test requests that should be blocked by the WAF (to make sure it catches attacks) and those that should be let through (to make sure it isn’t over-blocking and creating false positives). What it didn’t do was test for runaway CPU utilization by the WAF and examining the log files from previous WAF builds shows that no increase in test suite run time was observed with the rule that would ultimately cause CPU exhaustion on our edge.

With the tests passing, TeamCity automatically began deploying the change at 13:42.

### Quicksilver

Because WAF rules are required to address emergent threats they are deployed using our Quicksilver distributed key-value (KV) store that can push changes globally in seconds. This technology is used by all our customers when making configuration changes in our dashboard or via the API and is the backbone of our service’s ability to respond to changes very, very rapidly.

We haven’t really talked about Quicksilver much. We previously used Kyoto Tycoon as a globally distributed key-value store, but we ran into operational issues with it and wrote our own KV store that is replicated across our more than 180 cities. Quicksilver is how we push changes to customer configuration, update WAF rules, and distribute JavaScript code written by customers using Cloudflare Workers.

From clicking a button in the dashboard or making an API call to change configuration to that change coming into effect takes seconds, globally. Customers have come to love this high speed configurability. And with Workers they expect near instant, global software deployment. On average Quicksilver distributes about 350 changes per second.

And Quicksilver is very fast.  On average we hit a p99 of 2.29s for a change to be distributed to every machine worldwide. Usually, this speed is a great thing. It means that when you enable a feature or purge your cache you know that it’ll be live globally nearly instantly. When you push code with Cloudflare Workers it’s pushed out a the same speed. This is part of the promise of Cloudflare fast updates when you need them.

However, in this case, that speed meant that a change to the rules went global in seconds. You may notice that the WAF code uses Lua. Cloudflare makes use of Lua extensively in production and details of the Lua in the WAF have been discussed before. The Lua WAF uses PCRE internally and it uses backtracking for matching and has no mechanism to protect against a runaway expression. More on that and what we’re doing about it below.

Everything that occurred up to the point the rules were deployed was done “correctly”: a pull request was raised, it was approved, CI/CD built the code and tested it, a change request was submitted with an SOP detailing rollout and rollback, and the rollout was executed.

### What went wrong

As noted, we deploy dozens of new rules to the WAF every week, and we have numerous systems in place to prevent any negative impact of that deployment. So when things do go wrong, it’s generally the unlikely convergence of multiple causes. Getting to a single root cause, while satisfying, may obscure the reality. Here are the multiple vulnerabilities that converged to get to the point where Cloudflare’s service for HTTP/HTTPS went offline.

1. An engineer wrote a regular expression that could easily backtrack enormously.
2. A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.
3. The regular expression engine being used didn’t have complexity guarantees.
4. The test suite didn’t have a way of identifying excessive CPU consumption.
5. The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.
6. The rollback plan required running the complete WAF build twice taking too long.
7. The first alert for the global traffic drop took too long to fire.
8. We didn’t update our status page quickly enough.
9. We had difficulty accessing our own systems because of the outage and the bypass procedure wasn’t well trained on.
11. Our customers were unable to access the Cloudflare Dashboard or API because they pass through the Cloudflare edge.

### What’s happened since last Tuesday

Firstly, we stopped all release work on the WAF completely and are doing the following:

1. Re-introduce the excessive CPU usage protection that got removed. (Done)
2. Manually inspecting all 3,868 rules in the WAF Managed Rules to find and correct any other instances of possible excessive backtracking. (Inspection complete)
3. Introduce performance profiling for all rules to the test suite. (ETA:  July 19)
4. Switching to either the re2 or Rust regex engine which both have run-time guarantees. (ETA: July 31)
5. Changing the SOP to do staged rollouts of rules in the same manner used for other software at Cloudflare while retaining the ability to do emergency global deployment for active attacks.
6. Putting in place an emergency ability to take the Cloudflare Dashboard and API off Cloudflare’s edge.
7. Automating update of the Cloudflare Status page.

In the longer term we are moving away from the Lua WAF that I wrote years ago. We are porting the WAF to use the new firewall engine. This will make the WAF both faster and add yet another layer of protection.

### Conclusion

This was an upsetting outage for our customers and for the team. We responded quickly to correct the situation and are correcting the process deficiencies that allowed the outage to occur and going deeper to protect against any further possible problems with the way we use regular expressions by replacing the underlying technology used.

We are ashamed of the outage and sorry for the impact on our customers. We believe the changes we’ve made mean such an outage will never recur.

### Appendix: About Regular Expression Backtracking

To fully understand how (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))  caused CPU exhaustion you need to understand a little about how a standard regular expression engine works. The critical part is .*(?:.*=.*). The (?: and matching ) are a non-capturing group (i.e. the expression inside the parentheses is grouped together as a single expression).

For the purposes of the discussion of why this pattern causes CPU exhaustion we can safely ignore it and treat the pattern as .*.*=.*. When reduced to this, the pattern obviously looks unnecessarily complex; but what’s important is any “real-world” expression (like the complex ones in our WAF rules) that ask the engine to “match anything followed by anything” can lead to catastrophic backtracking. Here’s why.

In a regular expression, . means match a single character, .* means match zero or more characters greedily (i.e. match as much as possible) so .*.*=.* means match zero or more characters, then match zero or more characters, then find a literal = sign, then match zero or more characters.

Consider the test string x=x. This will match the expression .*.*=.*. The .*.* before the equal can match the first x (one of the .* matches the x, the other matches zero characters). The .* after the = matches the final x.

It takes 23 steps for this match to happen. The first .* in .*.*=.* acts greedily and matches the entire x=x string. The engine moves on to consider the next .*. There are no more characters left to match so the second .* matches zero characters (that’s allowed). Then the engine moves on to the =. As there are no characters left to match (the first .* having consumed all of x=x) the match fails.

At this point the regular expression engine backtracks. It returns to the first .* and matches it against x= (instead of x=x) and then moves onto the second .*. That .* matches the second x and now there are no more characters left to match. So when the engine tries to match the = in .*.*=.* the match fails. The engine backtracks again.

This time it backtracks so that the first .* is still matching x= but the second .* no longer matches x; it matches zero characters. The engine then moves on to try to find the literal = in the .*.*=.* pattern but it fails (because it was already matched against the first .*). The engine backtracks again.

This time the first .* matches just the first x. But the second .* acts greedily and matches =x. You can see what’s coming. When it tries to match the literal = it fails and backtracks again.

The first .* still matches just the first x. Now the second .* matches just =. But, you guessed it, the engine can’t match the literal = because the second .* matched it. So the engine backtracks again. Remember, this is all to match a three character string.

Finally, with the first .* matching just the first x, the second .* matching zero characters the engine is able to match the literal = in the expression with the = in the string. It moves on and the final .* matches the final x.

23 steps to match x=x. Here’s a short video of that using the Perl Regexp::Debugger showing the steps and backtracking as they occur.

That’s a lot of work but what happens if the string is changed from x=x to x=xx? This time is takes 33 steps to match. And if the input is x=xxx it takes 45. That’s not linear. Here’s a chart showing matching from x=x to x=xxxxxxxxxxxxxxxxxxxx (20 x’s after the =). With 20 x’s after the = the engine takes 555 steps to match! (Worse, if the x= was missing, so the string was just 20 x’s, the engine would take 4,067 steps to find the pattern doesn’t match).

This video shows all the backtracking necessary to match x=xxxxxxxxxxxxxxxxxxxx:

That’s bad because as the input size goes up the match time goes up super-linearly. But things could have been even worse with a slightly different regular expression. Suppose it had been .*.*=.*; (i.e. there’s a literal semicolon at the end of the pattern). This could easily have been written to try to match an expression like foo=bar;.

This time the backtracking would have been catastrophic. To match x=x takes 90 steps instead of 23. And the number of steps grows very quickly. Matching x= followed by 20 x’s takes 5,353 steps. Here’s the corresponding chart. Look carefully at the Y-axis values compared the previous chart.

To complete the picture here are all 5,353 steps of failing to match x=xxxxxxxxxxxxxxxxxxxx against .*.*=.*;

Using lazy rather than greedy matches helps control the amount of backtracking that occurs in this case. If the original expression is changed to .*?.*?=.*? then matching x=x takes 11 steps (instead of 23) and so does matching x=xxxxxxxxxxxxxxxxxxxx. That’s because the ? after the .* instructs the engine to match the smallest number of characters first before moving on.

But laziness isn’t the total solution to this backtracking behaviour. Changing the catastrophic example .*.*=.*; to .*?.*?=.*?; doesn’t change its run time at all. x=x still takes 555 steps and x= followed by 20 x’s still takes 5,353 steps.

The only real solution, short of fully re-writing the pattern to be more specific, is to move away from a regular expression engine with this backtracking mechanism. Which we are doing within the next few weeks.

The solution to this problem has been known since 1968 when Ken Thompson wrote a paper titled “Programming Techniques: Regular expression search algorithm”. The paper describes a mechanism for converting a regular expression into an NFA (non-deterministic finite automata) and then following the state transitions in the NFA using an algorithm that executes in time linear in the size of the string being matched against.

Thompson’s paper doesn’t actually talk about NFA but the linear time algorithm is clearly explained and an ALGOL-60 program that generates assembly language code for the IBM 7094 is presented. The implementation may be arcane but the idea it presents is not.

Here’s what the .*.*=.* regular expression would look like when diagrammed in a similar manner to the pictures in Thompson’s paper.

Figure 0 has five states starting at 0. There are three loops which begin with the states 1, 2 and 3. These three loops correspond to the three .* in the regular expression. The three lozenges with dots in them match a single character. The lozenge with an = sign in it matches the literal = sign. State 4 is the ending state, if reached then the regular expression has matched.

To see how such a state diagram can be used to match the regular expression .*.*=.* we’ll examine matching the string x=x. The program starts in state 0 as shown in Figure 1.

The key to making this algorithm work is that the state machine is in multiple states at the same time. The NFA will take every transition it can, simultaneously.

Even before it reads any input, it immediately transitions to both states 1 and 2 as shown in Figure 2.

Looking at Figure 2 we can see what happened when it considers  first x in x=x. The x can match the top dot by transitioning from state 1 and back to state 1. Or the x can match the dot below it by transitioning from state 2 and back to state 2.

So after matching the first x in x=x the states are still 1 and 2. It’s not possible to reach state 3 or 4 because a literal = sign is needed.

Next the algorithm considers the = in x=x. Much like the x before it, it can be matched by either of the top two loops transitioning from state 1 to state 1 or state 2 to state 2, but additionally the literal = can be matched and the algorithm can transition state 2 to state 3 (and immediately state 4). That’s illustrated in Figure 3.

Next the algorithm reaches the final x in x=x. From states 1 and 2 the same transitions are possible back to states 1 and 2. From state 3 the x can match the dot on the right and transition back to state 3.

At that point every character of x=x has been considered and because state 4 has been reached the regular expression matches that string. Each character was processed once so the algorithm was linear in the length of the input string. And no backtracking was needed.

It might also be obvious that once state 4 was reached (after x= was matched) the regular expression had matched and the algorithm could terminate without considering the final x at all.

This algorithm is linear in the size of its input.

# 2019年7月2日に発生したCloudflareの停止に関する詳細

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019-jp/

9年ほど前のCloudflareは小さな会社で、当時私は一顧客であり従業員ではありませんでした。Cloudflareがひと月前に設立されたという時期のある日、私は自分の小さなサイト、jgc.orgのDNSが動作していないという警告を受け取りました。そしてCloudflareはProtocol Buffersの使用に変更を加えた上でDNSを切断したのです。

From: John Graham-Cumming

To: Matthew Prince

ご報告ありがとうございました。何か問題があれば
ご連絡します。 技術詳細に関する全容が判明したら、

SMSを受け取れます。 監視結果では13:03:07から14:04:12までダウンしていたことが
わかりました。 テストは5分おきに実行されています。



これに対するMatthewの返信は以下の通りです。
From: Matthew Prince

To: John Graham-Cumming

ありがとうございます。Cloudflareではいただいたメールすべてに対して返信しております。私は現在
オフィスに向かっており、ブログへの投稿またはCloudflareの掲示板システムのトップに



### 7月2日の件について

7月2日、CloudflareはWAFマネージドルールに新規ルールを追加したのですが、これが世界中のCloudflareネットワーク上にあるHTTP/HTTPSトラフィックを扱う各CPUコアのCPU枯渇を引き起こしました。Cloudflareでは新たな脆弱性や脅威に対応するため、継続的にWAFマネージドルールを改善しています。たとえば5月には、WAFの更新速度を活用して深刻なSharePointの脆弱性に対する保護を行うためのルールを追加しました。迅速かつグローバルにルールをリリースできることはCloudflareのWAFにとって重要な機能です。

しかし残念ながら、先週の火曜日に行った更新に莫大なバックトラックを行いHTTP/HTTPS配信用のCPUを枯渇させるような正規表現が含まれてしまい、これによりCloudflareのコアプロキシ、CDN、WAF機能のダウンに繋がる結果となりました。次のグラフはHTTP/HTTPSトラフィックの配信を専門に行うCPUがCloudflareネットワーク内のサーバー全体で100%に近い使用量まで急上昇したことを示しています。

この結果、Cloudflareのお客様（およびお客様の顧客の方々）に対し、Cloudflareのドメイン訪問時に502エラーが表示されることとなりました。この502エラーはフロントのCloudflare Webサーバーに利用可能なCPUコアがあるにも関わらずHTTP/HTTPSトラフィックを配信するプロセスに到達できないことにより発生したものです。

Cloudflareは本件がお客様に与えた損害について認識しており、誠に忸怩たる思いでおります。本インシデントの対応中ではありますが、Cloudflareの運営自体にも悪影響が及んでおります。

また、お客様におかれましては、多大なストレス、不満、不安を感じられたことと存じます。6年間グローバルな停止がなかったこともあり、動揺はことさら大きいものでした。

CPUが枯渇した原因は、過剰にバックトラッキングを発生させる不完全な正規表現を記載した1つのWAFルールによるものでした。停止の核心となった正規表現は次の通りです。(?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

### 発生内容

まず本件の流れをご説明します。本記事内に記載する時間は全て協定世界時（UTC）表記です。

13時42分、ファイアウォールチームに所属する1名のエンジニアが自動プロセスでXSSを検出するためのルールに対する小さな変更をリリースしました。そして、これに対する変更申請チケットが作成されました。Cloudflareではこのようなチケットの管理にJiraを使用しておりますが、以下はそのスクリーンショットです。

3分後、1つ目のPagerDutyページがWAFの異常を表示して停止しました。これはCloudflare外からWAFの機能を確認する模擬テストで（このようなテストは数百とあります）、正常動作を確認するためのものでした。そしてすぐにCloudflareサービスのエンドツーエンドテストの失敗、グローバルなトラフィック低下アラート、502エラーの蔓延がページに表示され、世界各都市のPoint of Presence（PoP）からCPU枯渇に関する報告を多数受けました。

これらのアラートの一部を受け取った私が会議を飛び出して自分のデスクに戻ると、ソリューションエンジニアグループのリーダーにCloudflareのトラフィックのうち80％がロストしているという報告を受けました。そこで私は事態に対するデバッグを行っているSREへ向かいました。停止の初期段階では、これまでにない種類の攻撃なのではないかという推測がありました。

CloudflareのSREチームは世界中に配置されており、24時間体制で継続的に対応を行っています。このようなアラート（アラートの大部分が特定の地域の制限された範囲における非常に具体的な問題に言及しているようなもの）は内部のダッシュボードで監視されており、毎日幾度となく対応が行われています。しかしながらこのパターンのページやアラートは非常に深刻な何かが発生しているということを示していたため、SREはすぐにP0インシデントを宣言してエンジニアリーダーおよびシステムエンジニアリングへエスカレーションを行いました。

ロンドンのエンジニアリングチームはその時Cloudflareのメインイベントスペースで内部のTechTalkを聞いているところだったのですが、それを中断して全員が大会議室に集まり他の社員も電話接続しました。これはSREが単独で処理できるような通常の問題ではなく、各関連チームがオンラインで一同に会す必要があったのです。

14時00分、WAFが問題の原因コンポーネントであることを特定し、攻撃が原因である可能性は却下されました。パフォーマンスチームはマシンから稼働中のCPUデータを取得し、WAFが原因であることを明示しました。他のチームのメンバーがstraceで確認を行い、また別のチームはWAFが問題を起こしているという記載があるエラーログを発見しました。14時02分、私は全チームに対して「global kill」を行う提案をしました。これはCloudflareに組み込まれた仕組みで、世界中の単一コンポーネントを無効とするものです。

しかしWAFに対するglobal killの実行も簡単にはいきませんでした。また問題が現れたのです。Cloudflareでは自社製品を使用しているため、Accessサービスがダウンすると内部のコントロールパネルで認証することができないのです（復旧後、内部コントロールパネルをあまり使用していないメンバーはセキュリティ機能により資格情報が無効になったためアクセスできなくなっていることがわかりました）。

さらにJiraやビルドシステムのような他の内部サービスも利用できなくなりました。利用できるようにするにはあまり使っていないバイパスの仕組みを使う必要がありました（これも本件の後で検討すべき項目です）。最終的にチームメンバーがWAFのglobal killを14時07分に実行し、14時09分までに世界中のトラフィックレベルおよびCPUが想定状態にまで戻りました。その他のCloudflareの保護の仕組みは継続して運用できています。

その後我々はWAF機能の復旧に取り掛かりました。微妙な状況だったこともあり、Cloudflareの有料プランのお客様のトラフィックを退避した後で一部のトラフィックを使って1つの都市内で異常系テスト（「この変更が本当に本件の原因なのか」を確認するもの）も正常系テスト（ロールバックの動作検証）も両方実施しました。

14時52分、原因の把握および適切な箇所への修正を行ったということに100%の確信を持てたため、WAFをグローバルに再度有効化しました。

### Cloudflareの運営方法

CloudflareにはWAFマネージドルール製品を担当するエンジニアチームがあり、検出率の改善、偽陽性の低下、新たな脅威への迅速な対応に継続的に取り組んでいます。直近60日では、WAFマネージドルールに対する476件の変更申請を処理しました（平均すると3時間ごとに1件のペースです）。

このような変更は「シミュレート」モードにリリースされます。このモードでは実際のカスタマートラフィックに対してルールが実行されますが何もブロックされません。Cloudflareではこのシミュレートモードを使ってルールの有効性をテストし、偽陽性および偽陰性の比率を測定しています。しかし、シミュレートモードでもルールを実際に実行する必要があり、今回の場合はそのルール内に過度にCPUを消費する正規表現が記載されていました。

ソフトウェアのリリース手順は次の通りです。Cloudflareでは内部的にBitBucket経由のgitを使用しています。変更を行ったエンジニアがTeamCityでビルドしたコードをプッシュし、ビルドがパスするとレビュアーが割り当てられます。プルリクエストが承認されるとコードがビルドされ、テストスイートが（再度）実行されます。

ビルドとテストが通ったらJiraの変更申請が作成され、関連する管理者または技術リーダーが変更承認を行います。承認されると「アニマルPoP」と呼ばれる場所へのリリースが行われます。アニマルPoPにはDOG、PIG、カナリアがあります。

DOG PoPは（世界の他の場所と同様）CloudflareのPoPですが、Cloudflareの社員のみが利用するものです。この試験運用版のPoPではお客様のトラフィックがコードに接触する前に問題を早期発見することができ、実際頻繁に検出されています。

DOGテストが正常に完了するとコードはPIG（「実験」目的）に移動します。PIGは無料プランのうちごく一部のお客様のトラフィックが新規コードを通過するようになっているCloudflareのPoPです。

ここでも正常であれば、コードは「カナリア」へ移動します。Cloudflareには世界中に3つのカナリアPoPがあり、有料／無料プランのお客様のトラフィックを新規コード上で実行してエラーの最終チェックを行っています。

カナリアで正常に動作すると、コードの公開ができるようになります。DOG、PIG、カナリア、グローバル手順の完了には、コード変更の種類にもよりますが数時間から数日ほどかかります。Cloudflareのネットワークやお客様が多様であるおかげで、Cloudflareではリリース内容を世界中の全てのお客様に公開する前に徹底的にコードをテストすることができるのです。しかし、設計上WAFにはこの手順を採用していません。それは脅威に迅速に対応する必要があるからです。

### WAFの脅威

5月にSharePointの脆弱性に対する保護を展開した件はCloudflareが事前にこのような保護を提供できた好例です（ブログはこちら）。発表の公表から間もなく、Cloudflareはお客様のSharePointインストールを悪用しようとする動きが急増したことを確認しました。Cloudflareチームは日々新たな脅威を監視し、お客様のために脅威を軽減するためのルールを記載しています。

WAFマネージドルールの変更における標準的な手順には、グローバルリリース前に継続的インテグレーション（CI）テストに合格しなければならないことが記載されています。これは先週の火曜日の際にも通常通り実施され、ルールがリリースされました。13時31分、チームのエンジニアが承認済みの変更を含むプルリクエストをマージしました。

13時37分、TeamCityがルールをビルドしてテストを実行し、合格を示す緑色を表示しました。WAFテストスイートはWAFの主な機能が動作することをテストするもので、個別のマッチング機能に対する多数の単体テストで構成されています。単体テストにて個別のWAFを実行した後、WAFに対する大規模なHTTPリクエストを実行してルールをテストします。こういったHTTPリクエストはWAFでブロックすべきリクエストのテスト（攻撃を検出できることの確認）やブロックしてはいけないリクエストのテスト（必要以上にブロックしないことや偽陽性を作り出していないことの確認）向けに設計されたものです。WAFテストスイートが実施しなかったのはCPU使用量の急増テストであり、結果的に今回CPU枯渇の原因となったルールが含まれている以前のWAFビルドのログファイルにはテストスイートの実行時間に増加は見られませんでした。

そしてテストが合格し、TeamCityが自動的に13時42分時点の変更をリリースし始めました。

### Quicksilver

WAFルールは新たな脅威に対応する必要があるため、数秒で世界中に変更を適用することのできるCloudflareの分散型Key-Value Store（KVS）、Quicksilverを使用してリリースしています。この技術はCloudflareのダッシュボード内やAPI経由での設定変更時にCloudflareの全てのお客様が使用しているもので、Cloudflareが変更に対して非常に迅速に処理できる理由でもあります。

Quicksilverについてはこれまであまり言及したことがありませんでした。以前CloudflareではKyoto Tycoonを分散型Key-Value Storeとしてグローバルに採用しておりましたが、運用上の問題が発生したため独自KVSを構築して180以上の都市に複製していました。Quicksilverはお客様の設定に変更を加えたり、WAFルールを更新したり、Cloudflare Workersを使用して書いたJavaScriptコードを配信したりするための手段です。

ダッシュボードのボタンをクリックまたはAPI呼び出しを行うことで、変更内容は数秒で世界中に適用されます。お客様にはこの高速に実施できる設定を気に入って頂いておりました。Workersを利用するとほぼ瞬時にグローバルなソフトウェアリリースが行えます。平均的ではQuicksilverは1秒あたりおよそ350件の変更を配信します。

さらに、Quicksilverは非常に高速です。 平均では2.29秒で世界中のマシンへ1つの変更を配信することができます。通常、このスピードは素晴らしいことです。要するに、機能を有効にしたりキャッシュをパージしたりする際、世界中に一瞬で稼働させられるのです。Cloudflare Workersでコードをプッシュすると、同じ速度でプッシュすることができます。これは必要なときに高速で更新ができるという、Cloudflareのお約束の1つです。

しかしながら今回はこのスピードがあることでルール変更が世界中に数秒で適用されたということを意味します。また、WAFコードにLuaを採用していることにお気づきの方もいるかもしれません。Cloudflareの製品には広くLuaを採用しておりますが、WAFのLuaに関する詳細は以前ご説明した通りです。WAFのLuaでは内部的にPCREを利用しているのですが、このPCREがマッチングにバックトラッキングを採用しており正規表現の暴走から保護する手段がありません。これに関する詳細や対策を以下に説明します。

ルールがリリースされた時点までは全てが「正しく」実行されていました。プルリクエストがあがって承認され、CI/CDがコードをビルドしてテストを行い、SOPのロールアウトとロールバックを詳述したSOPと共に変更申請が提出され、ロールアウトが実行されました。

### 問題点

1. エンジニアが簡単に膨大なバックトラックをしてしまう正規表現を記述しました。
2. 数週間前に実施したWAFのリファクタリングにより、正規表現によるCPUの過度な消費を防ぐための保護が誤って削除されていました。このリファクタリングはWAFによるCPU消費を抑えるためのものでした。
3. 使用していた正規表現エンジンには複雑性の保証がありませんでした。
4. テストスイートに過度なCPU消費を特定する手段がありませんでした。
5. 段階的ロールアウトをせずに世界中の本番環境へ緊急性のないルール変更を展開できるようなSOPになっていました。
6. WAFの完全なビルドを2回実行するという時間のかかりすぎることがロールバック計画で要求されていました。
7. グローバルなトラフィックドロップに対する初めのアラートの発火が遅すぎました。
8. Cloudflareのステータスページをすぐに更新できませんでした。
9. 停止およびバイパス手順に不慣れだったため、Cloudflare内からのCloudflare独自システムへのアクセスが困難でした。
10. セキュリティ上の理由により資格情報がタイムアウトしたため、SREが一部のシステムにアクセスできませんでした。
11. Cloudflareのエッジを経由するお客様は、CloudflareのダッシュボードやAPIにアクセスできませんでした。

### 火曜日以降の動向

まず、WAF上で動作する全リリースを停止して次のことを行いました。

1. 過度のCPU利用を行う保護を取り除いた上での再導入（完了）
2. WAFマネージドルールにある全3,868件のルールを手動にて調査し、過度のバックトラッキングが発生する可能性があるその他のインスタンスを検出、修正（検査は完了）
3. 全ルールに対するパフォーマンスプロファイリングをテストスイートに導入（完了予定： 7月19日）
4. 正規表現エンジンをre2またはRustに切り替え（どちらもランタイム保証搭載）（完了予定：7月31日）
5. 進行中の攻撃に対する緊急かつグローバルなリリースを実施できる点を保持しつつ、Cloudflareの他のソフトウェアと同じ方法でルールを段階的ロールアウトするようSOPを変更
6. CloudflareのダッシュボードやAPIをCloudflareのエッジから外すための緊急機能を設置
7. Cloudflareステータスページへの更新の自動化

### 付録：正規表現のバックトラッキングについて

(?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*))) がどのようにCPU枯渇を引き起こすのかを完全に理解するには、正規表現エンジンの動作を少々理解しておく必要があります。重要なのは.*(?:.*=.*)の部分です。(?:と)は非キャプチャグループです（つまり、カッコ内の表現は1つの表現としてグルーピングされています）。

ここではCPU枯渇の原因となったパターンを説明するため、これを無視して.*.*=.*というパターンを見ていきます。ここまでシンプルにすると、このパターンが不要に複雑であることがわかります。しかし、重要なのは「全てに続く全てにマッチする」ものをエンジンに問い合わせた「実際の」表現（CloudflareのWAFルールに記載された複雑な表現のようなもの）により壊滅的なバックトラッキングを引き起こしたという点です。こちらがその理由です。

テスト文字列x=xについて考察してみましょう。これは.*.*=.*にマッチする文字列です。イコールの前にある.*.*が1つ目のxにマッチします（.*のうちの1つがxにマッチしもう一方が0文字にマッチするため）。そして=の後にある.*は最後のxにマッチします。

このマッチングに至るまでには23の手順があります。.*.*=.*にある1つ目の.*が貪欲に（greedy）動作してx=xという文字列全体にマッチします。エンジンは次の.*の考慮に移ります。マッチする文字はもうないので、2つ目の.*は0文字にマッチします（こういう場合もあります）。それからエンジンは=部分に移行します。もうマッチングすべき文字が残っていないので（はじめの.*部分でx=xの全てにマッチしているので）マッチングは失敗します。

ここで正規表現エンジンがバックトラックします。エンジンは1つ目の.*に戻り（x=xではなく）x=とマッチし、それから2つ目の.*に移ります。.*が2つ目のxにマッチするので残りの文字はありません。そこでエンジンが=を.*.*=.*とマッチさせようとするとそのマッチングは失敗します。エンジンはまたバックトラックします。

1つ目の.*は1つ目のxとマッチします。そして今回は2つ目の.*が=とのみマッチします。しかし、ご想像どおりエンジンは=にマッチしません。2つ目の.*で既にマッチしているからです。そこでまたバックトラックを行います。ここで思い出していただきたいのですが、これは全て3文字の文字列のマッチングにかかる手順なのです。

これがx=xにマッチするまでの23の手順です。こちらはPerlのRegexp::Debuggerを使って発生したバックトラッキングの手順を説明した短いビデオです。

これでも作業量が多いのですが、もし文字列がx=xからx=xxに変わったらどうなるでしょうか？この場合のマッチング手順は33です。さらに、入力がx=xxxとなると手順は45になります。直線的な増え方ではありません。ここにx=xからx=xxxxxxxxxxxxxxxxxxxx（=の後のxが20個）までのマッチングを示したグラフがあります。=の後のxが20になると、エンジンのマッチングには555もの手順がかかります。（さらに悪いことに、x=の部分がなく文字列が20個のxだけになった場合、マッチしないパターンを探す手順は4,067になります。）

このビデオではx=xxxxxxxxxxxxxxxxxxxxのマッチングに必要なバックトラッキングを示しています。

この場合のバックトラッキングは最悪です。x=xのマッチには23ではなく90手順もかかります。手順の増加は非常に劇的です。x=の後に20個のxがある場合のマッチングにかかる手順は5,353にも及びます。こちらがそのグラフです。Y軸の値を前回のグラフと比べてみてください。

こちらの画像ではx=xxxxxxxxxxxxxxxxxxxxを.*.*=.*;にマッチさせようとして失敗するまでの全5,353手順を表示しています。

GreedyマッチではなくLazyマッチを用いると、この場合のバックトラッキング数を制限することができます。元の表現を.*?.*?=.*?に変更するとx=xのマッチングは（23手順から）11手順になり、x=xxxxxxxxxxxxxxxxxxxxの場合も同様となります。これは.*の後にある?がエンジンに、移動する前に最小文字数とマッチするよう指示するためです。

しかし、Lazyマッチがこのバックトラッキング行為に対する完全な解決策ではありません。.*.*=.*;という最悪の例を.*?.*?=.*?;に変えても実行回数は全く変わりません。x=xの所要手順は555で、x=の後に20個のxが続く場合の手順数も5,353のままです。

（パターンを完全に書き直してより具体的に記述する以外で）唯一真の解決策となるのが、正規表現エンジンをバックトラッキングの仕組みから退避することです。これは今後数週間で取り組んでいきます。

この問題に対する解決策は1968年のKen Thompson氏による「Programming Techniques:Regular expression search algorithm（プログラミングアルゴリズム：正規表現の検索アルゴリズム）」という論文で知られているものです。この論文では正規表現をNFA（非決定性有限オートマトン）に変換し、その後照合する文字列のサイズで時間線形的に実行するアルゴリズムを用いてNFAの状態遷移をするメカニズムについて説明しています。

この論文で実際にNFAに関する記述があるわけではありませんが、線形時間アルゴリズムに関しては明確に説明されており、IBM 7094用のアセンブリ言語のコードを生成するALGOL60プログラムが提示されています。その実装は難解なものですが、考え方はさほどではありません。

これは.*.*=.*という正規表現をThompson氏の論文の図と同じ形式で図式化したものです。

このような状態図を使って.*.*=.*という正規表現のマッチングを行う方法を確認するため、ここではx=xを検証していきます。プログラムは図1の状態0から開始します。

このアルゴリズムの動作のキーとなるのが状態マシンは同時に複数の状態になるという点です。NFAはそれぞれの遷移を同時に行います。

x=xの1つ目のxにマッチした後でも状態は1と2のままです。状態3や4に到達できないのは、リテラル=が必要になるためです。

x=xの全文字が考察された時点で状態4に到達するため、この正規表現は文字列にマッチします。各文字が一度処理されるためこのアルゴリズムは入力文字列の長さの点で線形です。さらに、バックトラッキングも必要ありませんでした。

（x=にマッチした後で）一度状態4に到達したらその正規表現はマッチしアルゴリズムは最後のxを全く考察することなく中止になります。

このアルゴリズムは入力サイズの点において線形です。

タグ事後検討,停止,Deep Dive

# Détails de la panne Cloudflare du 2 juillet 2019

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019-fr/

Il y a près de neuf ans, Cloudflare était une toute petite entreprise dont j’étais le client, et non l’employé. Cloudflare était sorti depuis un mois et un jour, une notification m’alerte que mon petit site,  jgc.org, semblait ne plus disposer d’un DNS fonctionnel. Cloudflare avait effectué une modification dans l’utilisation de Protocol Buffers qui avait endommagé le DNS.

J’ai contacté directement Matthew Prince avec un e-mail intitulé « Où est mon DNS ? » et il m’a envoyé une longue réponse technique et détaillée (vous pouvez lire tous nos échanges d’e-mails ici) à laquelle j’ai répondu :

De: John Graham-Cumming
Date: Jeudi 7 octobre 2010 à 09:14
Objet: Re: Où est mon DNS?
À: Matthew Prince

Superbe rapport, merci. Je veillerai à vous appeler s’il y a un
problème.  Il serait peut-être judicieux, à un certain moment, d’écrire tout cela dans un article de blog, lorsque vous aurez tous les détails techniques, car je pense que les gens apprécient beaucoup la franchise et l’honnêteté sur ce genre de choses. Surtout si vous y ajoutez les tableaux qui montrent l’augmentation du trafic suite à votre lancement.

Je dispose d’un système robuste de surveillance de mes sites qui m’envoie un SMS en cas de problème.  La surveillance montre une panne de 13:03:07 à 14:04:12.  Des tests sont réalisés toutes les cinq minutes.

Un accident de parcours dont vous vous relèverez certainement.  Mais ne pensez-vous pas qu’il vous faudrait quelqu’un en Europe? :-)


Ce à quoi il a répondu :

De: Matthew Prince
Date: Jeudi 7 octobre 2010 à 09:57
Objet: Re: Où est mon DNS?
À: John Graham-Cumming

Merci. Nous avons répondu à tous ceux qui nous ont contacté. Je suis en route vers le bureau et nous allons publier un article sur le blog ou épingler un post officiel en haut de notre panneau d’affichage. Je suis parfaitement d’accord, la transparence est nécessaire.


Aujourd’hui, en tant qu’employé d’un Cloudflare bien plus grand, je vous écris de manière transparente à propos d’une erreur que nous avons commise, de son impact et de ce que nous faisons pour régler le problème.

### Les événements du 2 juillet

Le 2 juillet, nous avons déployé une nouvelle règle dans nos règles gérées du pare-feu applicatif Web (WAF) qui ont engendré un surmenage des processeurs sur chaque cœur de processeur traitant le trafic HTTP/HTTPS sur le réseau Cloudflare à travers le monde. Nous améliorons constamment les règles gérées du WAF pour répondre aux nouvelles menaces et vulnérabilités. En mai, par exemple, nous avons utilisé la vitesse avec laquelle nous pouvons mettre à jour le WAF pour appliquer une règle et nous protéger d’une vulnérabilité SharePoint importante. Être capable de déployer rapidement et globalement des règles est une fonctionnalité essentielle de notre WAF.

Malheureusement, la mise à jour de mardi dernier contenait une expression régulière qui nous a fait reculer énormément et qui a épuisé les processeurs utilisés pour la distribution HTTP/HTTPS. Les fonctionnalités essentielles de mise en proxy, du CDN et du WAF de Cloudflare sont tombées en panne. Le graphique suivant présente les processeurs dédiés à la distribution du trafic HTTP/HTTPS atteindre presque 100 % d’utilisation sur les serveurs de notre réseau.

En conséquence, nos clients (et leurs clients) voyaient une page d’erreur 502 lorsqu’ils visitaient n’importe quel domaine Cloudflare. Les erreurs 502 étaient générées par les serveurs Web frontaux Cloudflare dont les cœurs de processeurs étaient encore disponibles mais incapables d’atteindre les processus qui distribuent le trafic HTTP/HTTPS.

Nous réalisons à quel points nos clients ont été affectés. Nous sommes terriblement gênés qu’une telle chose se soit produite. L’impact a été négatif également pour nos propres activités lorsque nous avons traité l’incident.

Cela a du être particulièrement stressant, frustrant et angoissant si vous étiez l’un de nos clients. Le plus regrettable est que nous n’avions pas eu de panne globale depuis six ans.

Le surmenage des processeurs fut causé par une seule règle WAF qui contenait une expression régulière mal écrite et qui a engendré un retour en arrière excessif. L’expression régulière au cœur de la panne est (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

Bien que l’expression régulière soit intéressante pour de nombreuses personnes (et évoquée plus en détails ci-dessous), comprendre ce qui a causé la panne du service Cloudflare pendant 27 minutes est bien plus complexe « qu’une expression régulière qui a mal tourné ». Nous avons pris le temps de noter les séries d’événements qui ont engendré la panne et nous ont empêché de réagir rapidement. Et si vous souhaitez en savoir plus sur le retour en arrière d’une expression régulière et que faire dans ce cas, veuillez consulter l’annexe à la fin de cet article.

### Ce qui s’est produit

Commençons par l’ordre des événements. Toutes les heures de ce blog sont en UTC.

À 13:42, un ingénieur de l’équipe pare-feu a déployé une légère modification aux règles de détection XSS via un processus automatique. Cela a généré un ticket de requête de modification. Nous utilisons Jira pour gérer ces tickets (capture d’écran ci-dessous).

Trois minutes plus tard, la première page PagerDuty indiquait une erreur avec le pare-feu applicatif Web (WAF). C’était un test synthétique qui vérifie les fonctionnalités du WAF (nous disposons de centaines de tests de ce type) en dehors de Cloudflare pour garantir qu’il fonctionne correctement. Ce test fut rapidement suivi par des pages indiquant de nombreux échecs d’autres tests de bout-en-bout des services Cloudflare, une alerte de perte de trafic globale, des erreurs 502 généralisées, puis par de nombreux rapports de nos points de présence (PoP) dans des villes à travers le monde indiquant un surmenage des processeurs.

Préoccupé par ces alertes, j’ai quitté la réunion à laquelle j’assistais et en me dirigeant vers mon bureau, un responsable de notre groupe Ingénieur Solutions m’a dit que nous avions perdu 80 % de notre trafic. J’ai couru au département SRE où l’équipe était en train de déboguer la situation. Au tout début de la panne, certains pensaient à un type d’attaque que nous n’avions jamais connu auparavant.

L’équipe SRE de Cloudflare est répartie dans le monde entier pour assurer une couverture continue. Les alertes de ce type, dont la majorité identifient des problèmes très spécifiques sur des cadres limités et dans des secteurs précis, sont surveillées dans des tableaux de bord internes et traitées de nombreuses fois chaque jour. Cependant, ce modèle de pages et d’alertes indiquait que quelque chose de grave s’était produit, et le département SRE a immédiatement déclaré un incident P0 et transmis à la direction de l’ingénierie et à l’ingénierie des systèmes.

À ce moment, l’équipe d’ingénieurs de Londres assistait à une conférence technique interne dans notre espace principal d’événements. La discussion fut interrompue et tout le monde s’est rassemblé dans une grande salle de conférence ; les autres se sont connectés à distance. Ce n’était pas un problème normal que le département SRE pouvait régler seul ; nous avions besoin que toutes les équipes appropriées soient en ligne au même moment.

À 14:00, nous avons identifié que le composant à l’origine du problème était le WAF, et nous avons écarté la possibilité d’une attaque. L’équipe Performances a extrait les données en direct du processeur d’une machine qui montrait clairement que le WAF était responsable. Un autre membre de l’équipe a utilisé Strace pour confirmer. Une autre équipe a observé des journaux d’erreur indiquant que le WAF présentait un problème. À 14:02, toute l’équipe m’a regardé : la proposition était d’utiliser un « global kill », un mécanisme intégré à Cloudflare pour désactiver un composant unique partout dans le monde.

Mais utiliser un « global kill » pour le WAF, c’était une toute autre histoire. Nous avons rencontré plusieurs obstacles. Nous utilisons nos propres produits et avec la panne de notre service Access, nous ne pouvions plus authentifier notre panneau de contrôle interne (une fois de retour en ligne, nous avons découvert que certains membres de l’équipe avaient perdu leur accès à cause d’une fonctionnalité de sécurité qui désactive les identifiants s’ils n’utilisent pas régulièrement le panneau de contrôle interne).

Et nous ne pouvions pas atteindre les autres services internes comme Jira ou le moteur de production. Pour les atteindre, nous avons du employer un mécanisme de contournement très rarement utilisé (un autre point à creuser après l’événement). Finalement, un membre de l’équipe a exécuté le « global kill » du WAF à 14:07 ; à 14:09, les niveaux du trafic et des processeurs sont revenus à la normale à travers le monde. Le reste des mécanismes de protection de Cloudflare a continué de fonctionner.

Nous sommes ensuite passés à la restauration de la fonctionnalité WAF. Le côté sensible de la situation nous a poussé à réaliser des tests négatifs (nous demander « est-ce réellement ce changement particulier qui a causé le problème ? ») et des tests positifs (vérifier le fonctionnement du retour) dans une ville unique à l’aide d’un sous-ensemble de trafic après avoir retiré de cet emplacement le trafic de nos clients d’offres payantes.

À 14:52, nous étions certains à 100 % d’avoir compris la cause, résolu le problème et que le WAF était réactivé partout dans le monde.

### Fonctionnement de Cloudflare

Cloudflare dispose d’une équipe d’ingénieurs dédiée à notre solution de règles gérées du WAF ; ils travaillent constamment pour améliorer les taux de détection, réduire les faux positifs et répondre rapidement aux nouvelles menaces dès qu’elles apparaissent. Dans les 60 derniers jours, 476 requêtes de modifications ont été traitées pour les règles gérées du WAF (en moyenne une toutes les 3 heures).

Cette modification spécifique fut déployée en mode « simulation » où le trafic client passe au travers de la règle mais rien n’est bloqué. Nous utilisons ce mode pour tester l’efficacité d’une règle et mesurer son taux de faux positifs et de faux négatifs. Cependant, même en mode simulation, les règles doivent être exécutées, et dans cette situation, la règle contenait une expression régulière qui consommait trop de processeur.

On peut observer dans la requête de modification ci-dessus un plan de déploiement, un plan de retour et un lien vers la procédure opérationnelle standard (POS) interne pour ce type de déploiement. La POS pour une modification de règle lui permet d’être appliquée à l’échelle mondiale. Ce processus est très différent de tous les logiciels que nous publions chez Cloudflare, où la POS applique d’abord le logiciel au réseau interne de dogfooding d’un point de présence (PoP) (au travers duquel passent nos employés), puis à quelques clients sur un emplacement isolé, puis à un plus grand nombre de clients et finalement au monde entier.

Le processus pour le lancement d’un logiciel ressemble à cela : Nous utilisons Git en interne via BitBucket. Les ingénieurs qui travaillent sur les modifications intègrent du code créé par TeamCity ; les examinateurs sont désignés une fois la construction validée. Lorsqu’une requête d’extraction est approuvée, le code est créé et la série de tests exécutée (à nouveau).

Si la construction et les tests sont validés, une requête de modification Jira est générée et la modification doit être approuvée par le responsable approprié ou la direction technique. Une fois approuvée, le déploiement de ce que l’on appelle les « PDP animaux » survient : DOG, PIG et les Canaries

Le PoP DOG est un PoP Cloudflare (au même titre que toutes nos villes à travers le monde) mais qui n’est utilisé que par les employés Cloudflare. Ce PoP de dogfooding nous permet d’identifier les problèmes avant que le trafic client n’atteigne le code. Et il le fait régulièrement.

Si le test DOG réussit, le code est transféré vers PIG (en référence au « cobaye » ou « cochon d’Inde »). C’est un PoP Cloudflare sur lequel un petit sous-ensemble de trafic client provenant de clients de l’offre gratuite passe au travers du nouveau code.

Si le test réussit, le code est transféré vers les Canaries. Nous disposons de trois PoP Canaries répartis dans le monde entier ; nous exécutons du trafic client des offres payantes et gratuite qui traverse ces points sur le nouveau code afin de vérifier une dernière fois qu’il n’y a pas d’erreur.

Une fois validé aux Canaries, le code peut être déployé. L’intégralité du processus DOG, PIG, Canari, Global peut prendre plusieurs heures ou plusieurs jours en fonction du type de modification de code. La diversité du réseau et des clients de Cloudflare nous permet de tester rigoureusement le code avant d’appliquer un lancement à tous nos clients partout dans le monde. Cependant, le WAF n’est pas conçu pour utiliser ce processus car il doit répondre rapidement aux menaces.

### Menaces WAF

Ces derniers années, nous avons observé une augmentation considérable des vulnérabilités dans les applications communes. C’est une conséquence directe de la disponibilité augmentée des outils de test de logiciels, comme par exemple le fuzzing (nous venons de publier un nouveau blog sur le fuzzing ici).

Le plus souvent, une preuve de concept (PoC, Proof of Concept) est créée et publiée rapidement sur Github, afin que les équipes en charge de l’exécution et de la maintenance des applications puissent effectuer des tests et vérifier qu’ils disposent des protections adéquates. Il est donc essentiel que Cloudflare soit capable de réagir aussi vite que possible aux nouvelles attaques pour offrir à nos clients une chance de corriger leur logiciel.

Le déploiement de nos protections contre la vulnérabilité SharePoint au mois de mai illustre parfaitement la façon dont Cloudflare est capable d’offrir une protection de manière proactive (blog ici). Peu de temps après la médiatisation de nos annonces, nous avons observé un énorme pic de tentatives visant à exploiter les installations Sharepoint de notre client. Notre équipe surveille constamment les nouvelles menaces et rédige des règles pour les atténuer pour le compte de nos clients.

La règle spécifique qui a engendré la panne de mardi dernier ciblait les attaques Cross-Site Scripting (XSS). Ce type d’attaque a aussi considérablement augmenté ces dernières années.

La procédure standard pour une modification des règles gérées du WAF indique que les tests d’intégration continue (CI) doivent être validés avant un déploiement à l’échelle mondiale. Cela s’est passé normalement mardi dernier et les règles ont été déployées. À 13:31, un ingénieur de l’équipe fusionnait une requête d’extraction contenant la modification déjà approuvée.

À 13:37, TeamCity a construit les règles et exécuté les tests puis donné son feu vert. La série de tests WAF vérifie que les fonctionnalités principales du WAF fonctionnent ; c’est un vaste ensemble de tests individuels ayant chacun une fonction associée. Une fois les tests individuels réalisés, les règles individuelles du WAF sont testées en exécutant une longue série de requêtes HTTP sur le WAF. Ces requêtes HTTP sont conçues pour tester les requêtes qui doivent être bloquées par le WAF (pour s’assurer qu’il identifie les attaques) et celles qui doivent être transmises (pour s’assurer qu’il ne bloque pas trop et ne crée pas de faux positifs). Il n’a pas vérifié si le WAF utilisait excessivement le processeur, et en examinant les journaux des précédentes constructions du WAF, on peut voir qu’aucune augmentation n’est observée pendant l’exécution de la série de tests avec la règle qui aurait engendré le surmenage du processeur sur notre périphérie.

Après le succès des tests, TeamCity a commencé à déployer automatiquement la modification à 13:42.

### Quicksilver

Les règles du WAF doivent répondre à de nouvelles menaces ; elles sont donc déployées à l’aide de notre base de données clé-valeur Quicksilver, capable d’appliquer des modifications à l’échelle mondiale en quelques secondes. Cette technologie est utilisée par tous nos clients pour réaliser des changements de configuration dans notre tableau de bord ou via l’API. C’est la base de notre service : répondre très rapidement aux modifications.

Nous n’avons pas vraiment abordé Quicksilver. Auparavant, nous utilisions Kyoto Tycoon comme base de données clé-valeur distribuées globalement, mais nous avons rencontré des problèmes opérationnels et rédigé notre propre base de données clé-valeur que nous avons reproduite dans plus de 180 villes. Quicksilver nous permet d’appliquer des modifications de configuration client, de mettre à jour les règles du WAF et de distribuer du code JavaScript rédigé par nos clients à l’aide de Cloudflare Workers.

Cliquer sur un bouton dans le tableau de bord, passer un appel d’API pour modifier la configuration et cette modification prendra effet en quelques secondes à l’échelle mondiale. Le clients adorent cette configurabilité haut débit. Avec Workers, ils peuvent profiter d’un déploiement logiciel global quasi instantané. En moyenne, Quicksilver distribue environ 350 modifications par seconde.

Et Quicksilver est très rapide.  En moyenne, nous atteignons un p99 de 2,29 s pour distribuer une modification sur toutes nos machines à travers le monde. En général, cette vitesse est un avantage. Lorsque vous activez une fonctionnalité ou purgez votre cache, vous êtes sûr que ce sera fait à l’échelle mondiale presque instantanément. Lorsque vous intégrez du code avec Cloudflare Workers, celui-ci est appliqué à la même vitesse. La promesse de Cloudflare est de vous offrir des mises à jour rapides quand vous en avez besoin.

Cependant, dans cette situation, cette vitesse a donc appliqué la modification des règles à l’échelle mondiale en quelques secondes. Vous avez peut-être remarqué que le code du WAF utilise Lua. Cloudflare utilise beaucoup Lua en production, et des informations sur Lua dans le WAF ont été évoquées auparavant. Le WAF Lua utilise PCRE en interne, le retour en arrière pour la correspondance, et ne dispose pas de mécanisme pour se protéger d’une expression étendue. Ci-dessous, plus d’informations sur nos actions.

Tout ce qui s’est passé jusqu’au déploiement des règles a été fait « correctement » : une requête d’extraction a été lancée, puis approuvée, CI/CD a créé le code et l’a testé, une requête de modification a été envoyée avec une procédure opérationnelle standard précisant le déploiement et le retour en arrière, puis le déploiement a été exécuté.

### Les causes du problème

Nous déployons des douzaines de nouvelles règles sur le WAF chaque semaine, et nous avons de nombreux systèmes en place pour empêcher tout impact négatif sur ce déploiement. Quand quelque chose tourne mal, c’est donc généralement la convergence improbable de causes multiples. Trouver une cause racine unique est satisfaisant mais peut obscurcir la réalité. Voici les vulnérabilités multiples qui ont convergé pour engendrer la mise hors ligne du service HTTP/HTTPS de Cloudflare.

• Un ingénieur a rédigé une expression régulière qui pouvait facilement causer un énorme retour en arrière.
• Une protection qui aurait pu empêcher l’utilisation excessive du processeur par une expression régulière a été retirée par erreur lors d’une refonte du WAF quelques semaines plus tôt (refonte dont l’objectif était de limiter l’utilisation du processeur par le WAF).
• Le moteur d’expression régulière utilisé n’avait pas de garanties de complexité.
• La série de tests ne présentait pas de moyen d’identifier une consommation excessive du processeur.
• La procédure opérationnelle standard a permis la mise en production globale d’une modification de règle non-urgente sans déploiement progressif.
• Le plan de retour en arrière nécessitait la double exécution de l’intégralité du WAF, ce qui aurait pris trop de temps.
• La première alerte de chute du trafic global a mis trop de temps à se déclencher.
• Nous n’avons pas mis à jour suffisamment rapidement notre page d’état.
• Nous avons eu des problèmes pour accéder à nos propres systèmes en raison de la panne et la procédure de contournement n’était pas bien préparée.
• Les ingénieurs fiabilité avaient perdu l’accès à certains systèmes car leurs identifiants ont expiré pour des raisons de sécurité.
• Nos clients ne pouvaient pas accéder au tableau de bord ou à l’API Cloudflare car ils passent par la périphérie Cloudflare.

### Ce qui s’est passé depuis mardi dernier

Tout d’abord, nous avons interrompu la totalité du travail de publication sur le WAF et réalisons les actions suivantes :

• Réintroduction de la protection en cas d’utilisation excessive du processeur qui avait été retirée. (Terminé)
• Inspection manuelle de toutes les 3 868 règles des règles gérées du WAF pour trouver et corriger tout autre cas d’éventuel retour en arrière excessif. (Inspection terminée)
• Introduction de profil de performances pour toutes les règles sur la série de tests. (ETA :  19 juillet)
• Passage au moteur d’expression régulière re2 ou Rust qui disposent de garanties de temps d’exécution. (ETA : 31 juillet)
• Modifier la procédure opérationnelle standard afin d’effectuer des déploiements progressifs de règles de la même manière que pour les autres logiciels chez Cloudflare tout en conservant la capacité de réaliser des déploiements globaux d’urgence pour les attaques actives.
• Mettre en place une capacité d’urgence pour retirer le tableau de bord et l’API Cloudflare de la périphérie de Cloudflare.
• Mettre à jour automatiquement la page d’état Cloudflare.

À long terme, nous nous éloignons du WAF Lua que j’ai rédigé il y a plusieurs années. Nous déplaçons le WAF pour qu’il utilise le nouveau moteur pare-feu. Cela rendra le WAF plus rapide et offrira une couche de protection supplémentaire.

### Conclusion

Ce fut une panne regrettable pour nos clients et pour l’équipe. Nous avons réagi rapidement pour régler la situation et nous réparons les défauts de processus qui ont permis à cette panne de se produire. Afin de nous protéger contre d’éventuels problèmes ultérieurs avec notre utilisation des expressions régulières, nous allons remplacer notre technologie fondamentale.

Nous sommes très gênés par cette panne et désolés de l’impact sur nos clients. Nous sommes convaincus que les changements réalisés nous permettront de ne plus jamais subir une telle panne.

### Annexe : À propos du retour en arrière d’expression régulière

Pour comprendre comment (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))  a causé le surmenage du processeur, vous devez vous familiariser avec le fonctionnement d’un moteur d’expression régulière standard. La partie critique est .*(?:.*=.*). Le(?: et les ) correspondantes sont un groupe de non-capture (l’expression entre parenthèses est regroupée dans une seule expression).

Pour identifier la raison pour laquelle ce modèle engendre le surmenage du processeur, nous pouvons l’ignorer et traiter le modèle .*.*=.*. En le réduisant, le modèle paraît excessivement complexe, mais ce qui est important, c’est que toute expression du « monde réel » (comme les expressions complexes au sein de nos règles WAF) qui demande au moteur de « faire correspondre quelque chose suivi de quelque chose » peut engendrer un retour en arrière catastrophique. Explication.

Dans une expression régulière, . signifie correspondre à un caractère unique, .* signifie correspondre à zéro ou plus de caractères abondamment (c’est à dire faire correspondre autant que possible). .*.*=.* signifie donc correspondre à zéro ou plus de caractères, puis à nouveau zéro ou plus de caractères, puis trouver un signe littéral =, puis correspondre à zéro ou plus de caractères.

Prenons la chaîne test x=x. Cela correspondra à l’expression .*.*=.*. Les .*.* précédant le signe = peuvent correspondre au premier x (l’un des .* correspond au x, l’autre correspond à zéro caractère). Le .* après le signe = correspond au x final.

Il faut 23 étapes pour atteindre cette correspondance. Le premier .* de .*.*=.* agit abondamment et correspond à la chaîne complète x=x. Le moteur avance pour examiner le .* suivant. Il ne reste aucun caractère à faire correspondre, le deuxième .* correspond donc à zéro caractère (c’est autorisé). Le moteur passe ensuite au =. La correspondance échoue car il n’y a plus de caractère à faire correspondre (le premier .* ayant consommé l’intégralité de x=x).

À ce moment, le moteur d’expression régulière retourne en arrière. Il retourne au premier  .* et le fait correspondre à x= (au lieu de x=x) et passe ensuite au deuxième .*. Ce .* correspond au deuxième x et il n’y a maintenant plus aucun caractère à associer. Quand le moteur essaie d’associer le = dans .*.*=.*, la correspondance échoue. Le moteur retourne de nouveau en arrière.

Avec ce retour en arrière, le premier .* correspond toujours à x= mais le deuxième .* ne correspond plus à x ; il correspond à zéro caractère. Le moteur avance ensuite pour essayer de trouver le signe littéral = dans le modèle .*.*=.* mais échoue (car il correspond déjà au premier .*). Le moteur retourne de nouveau en arrière.

Cette fois-ci, le premier .* correspond seulement au premier x. Mais le deuxième .* agit abondamment et correspond à =x. Vous imaginez la suite. Lorsqu’il essaie d’associer le signe littéral =, il échoue et retourne de nouveau en arrière.

Le premier .* correspond toujours seulement au premier x. Désormais, le deuxième .* correspond seulement à =. Comme vous l’avez deviné, le moteur ne peut pas associer le signe littéral = car il correspond déjà au deuxième .*. Le moteur retourne donc de nouveau en arrière. Tout cela pour associer une chaîne de trois caractères.

Enfin, avec le premier .* correspondant uniquement au premier x et le deuxième .* correspondant à zéro caractère, le moteur est capable d’associer le signe littéral = dans l’expression avec le = contenu dans la chaîne. Il avance et le dernier .* correspond au dernier x.

23 étapes pour faire correspondre x=x. Voici une courte vidéo de l’utilisation de Perl Regexp::Debugger présentant les étapes et le retour en arrière.

Cela représente beaucoup de travail, mais que se passe-t-il si la chaîne passe de x=x à x=xx ? La correspondance prend cette fois 33 étapes. Et si l’entrée est x=xxx, 45 étapes. Ce n’est pas linéaire. Voici un graphique présentant la correspondance de x=x à x=xxxxxxxxxxxxxxxxxxxx (20 x après le =). Avec 20 x après le =, le moteur requiert 555 étapes pour associer ! (Pire encore, si le x= était manquant et que la chaîne ne contenait que 20 x, le moteur nécessiterait 4 067 étapes pour réaliser que le modèle ne correspond pas).

Cette vidéo montre tout le retour en arrière nécessaire pour correspondre à  x=xxxxxxxxxxxxxxxxxxxx:

Ce n’est pas bon signe car à mesure que la taille de l’entrée augmente, le temps de correspondance augmente de manière ultra linéaire. Mais cela aurait pu être encore pire avec une expression régulière légèrement différente. Prenons l’exemple .*.*=.*; (avec un point-virgule littéral à la fin du modèle). Nous aurions facilement pu rédiger cela pour tenter d’associer une expression comme foo=bar;.

Le retour en arrière aurait été catastrophique. Associer x=x prend 90 étapes au lieu de 23. Et le nombre d’étapes augmente très rapidement. Associer x= suivi de 20 x prend 5 353 étapes. Voici le graphique correspondant : Observez attentivement les valeurs de l’axe Y par rapport au graphique précédent.

Pour terminer, voici toutes les 5 353 étapes de l’échec de la correspondance de x=xxxxxxxxxxxxxxxxxxxx avec .*.*=.*;

L’utilisation de correspondances fainéantes plutôt que gourmandes (lazy / greedy) permet de contrôler l’étendue du retour en arrière qui survient dans ce cas précis. Si l’expression originale est modifiée en .*?.*?=.*?, la correspondance de x=x prend 11 étapes (au lieu de 23), tout comme la correspondance de x=xxxxxxxxxxxxxxxxxxxx. C’est parce que le ? après le .*ordonne au moteur d’associer le plus petit nombre de caractères avant de commencer à avancer.

Mais la fainéantise n’est pas la solution totale à ce comportement de retour en arrière. Modifier l’exemple catastrophique .*.*=.*; en .*?.*?=.*?; n’affecte pas du tout son temps d’exécution. x=x prend toujours 555 étapes et x= suivi de 20 x prend toujours 5 353 étapes.

La seule vraie solution, mise à part la réécriture complète du modèle pour être plus précis, c’est de s’éloigner du moteur d’expression régulière avec ce mécanisme de retour en arrière. Ce que nous faisons au cours des prochaines semaines.

La solution à ce problème existe depuis 1968, lorsque Ken Thompson écrivait un article intitulé « Techniques de programmation : Algorithme de recherche d’expression régulière ». L’article décrit un mécanisme pour convertir une expression régulière en AFN (automate fini non-déterministe) et suivre les transitions d’état dans l’AFN à l’aide d’un algorithme exécuté en un temps linéaire de la taille de la chaîne que l’on souhaite faire correspondre.

L’article de Thompson n’évoque pas l’AFN mais l’algorithme de temps linéaire est clairement expliqué. Il présente également un programme ALGOL-60 qui génère du code de langage assembleur pour l’IBM 7094. La réalisation paraît ésotérique mais l’idée présentée ne l’est pas.

Voici à quoi ressemblerait l’expression régulière .*.*=.* si elle était schématisée d’une manière similaire aux images de l’article de Thompson.

Le schéma 0 contient 5 états commençant à 0. Il y a trois boucles qui commencent avec les états 1, 2 et 3. Ces trois boucles correspondent aux trois .* dans l’expression régulière. Les trois pastilles avec des points à l’intérieur correspondent à un caractère unique. La pastille avec le signe = à l’intérieur correspond au signe littéral =. L’état 4 est l’état final : s’il est atteint, l’expression régulière a trouvé une correspondance.

Pour voir comment un tel schéma d’état peut être utilisé pour associer l’expression régulière .*.*=.*, nous allons examiner la correspondance de la chaîne x=x. Le programme commence à l’état 0 comme indiqué dans le schéma 1.

La clé pour faire fonctionner cet algorithme est que la machine d’état soit dans plusieurs états au même moment. L’AFN prendra autant de transitions que possible simultanément.

Avant de lire une entrée, il passe immédiatement aux deux états 1 et 2 comme indiqué dans le schéma 2.

En observant le schéma 2, on peut voir ce qui s’est passé lorsqu’il considère le premier x dans x=x. Le x peut correspondre au point supérieur en faisant une transition de l’état 1 et en retournant à l’état 1. Ou le x peut correspondre au point inférieur en faisant une transition de l’état 2 et en retournant à l’état 2.

Après avoir associé le premier x dans x=x, les états restent 1 et 2. On ne peut pas atteindre l’état 3 ou 4 car il faut un signe littéral =.

L’algorithme considère ensuite le = dans x=x. Comme le x avant lui, il peut être associé par l’une des deux boucles supérieures faisant une transition de l’état 1 à l’état 1 ou de l’état 2 à l’état 2. En outre, le signe littéral = peut être associé et l’algorithme peut faire une transition de l’état 2 à l’état 3 (et directement à l’état 4). C’est illustré dans le schéma 3.

Ensuite, l’algorithme atteint le x final dans x=x. Depuis les états 1 et 2, les mêmes transitions sont possibles vers les états 1 et 2. Depuis l’état 3, le x peut correspondre au point sur la droite et retourner à l’état 3.

À ce niveau, chaque caractère de x=x a été considéré, et puisque l’état 4 a été atteint, l’expression régulière correspond à cette chaîne. Chaque caractère a été traité une fois, l’algorithme était donc linéaire avec la longueur de la chaîne saisie. Aucun retour en arrière nécessaire.

Cela peut paraître évident qu’une fois l’état 4 atteint (après la correspondance de x=), l’expression régulière correspond et l’algorithme peut terminer sans prendre en compte le x final.

Cet algorithme est linéaire avec la taille de son entrée.

# Cloudflare outage caused by bad software deploy (updated)

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/cloudflare-outage/

This is a short placeholder blog and will be replaced with a full post-mortem and disclosure of what happened today.

For about 30 minutes today, visitors to Cloudflare sites received 502 errors caused by a massive spike in CPU utilization on our network. This CPU spike was caused by a bad software deploy that was rolled back. Once rolled back the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels.

This was not an attack (as some have speculated) and we are incredibly sorry that this incident occurred. Internal teams are meeting as I write performing a full post-mortem to understand how this occurred and how we prevent this from ever occurring again.

### Update at 2009 UTC:

Starting at 1342 UTC today we experienced a global outage across our network that resulted in visitors to Cloudflare-proxied domains being shown 502 errors (“Bad Gateway”). The cause of this outage was deployment of a single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new Cloudflare WAF Managed rules.

The intent of these new rules was to improve the blocking of inline JavaScript that is used in attacks. These rules were being deployed in a simulated mode where issues are identified and logged by the new rule but no customer traffic is actually blocked so that we can measure false positive rates and ensure that the new rules do not cause problems when they are deployed into full production.

Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100% on our machines worldwide. This 100% CPU spike caused the 502 errors that our customers saw. At its worst traffic dropped by 82%.

This chart shows CPU spiking in one of our PoPs:

We were seeing an unprecedented CPU exhaustion event, which was novel for us as we had not experienced global CPU exhaustion before.

We make software deployments constantly across the network and have automated systems to run test suites and a procedure for deploying progressively to prevent incidents. Unfortunately, these WAF rules were deployed globally in one go and caused today’s outage.

At 1402 UTC we understood what was happening and decided to issue a ‘global kill’ on the WAF Managed Rulesets, which instantly dropped CPU back to normal and restored traffic. That occurred at 1409 UTC.

We then went on to review the offending pull request, roll back the specific rules, test the change to ensure that we were 100% certain that we had the correct fix, and re-enabled the WAF Managed Rulesets at 1452 UTC.

We recognize that an incident like this is very painful for our customers. Our testing processes were insufficient in this case and we are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future.

# The deep-dive into how Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Monday

### A recap on what happened Monday

On Monday we wrote about a painful Internet wide route leak. We wrote that this should never have happened because Verizon should never have forwarded those routes to the rest of the Internet. That blog entry came out around 19:58 UTC, just over seven hours after the route leak finished (which will we see below was around 12:39 UTC). Today we will dive into the archived routing data and analyze it. The format of the code below is meant to use simple shell commands so that any reader can follow along and, more importantly, do their own investigations on the routing tables.

This was a very public BGP route leak event. It was both reported online via many news outlets and the event’s BGP data was reported via social media as it was happening. Andree Toonk tweeted a quick list of 2,400 ASNs that were affected.

This blog contains a large number of acronyms and those are explained at the end of the blog.

### Using RIPE NCC archived data

The RIPE NCC operates a very useful archive of BGP routing. It runs collectors globally and provides an API for querying the data. More can be seen at https://stat.ripe.net/. In the world of BGP all routing is public (within the ability of anyone collecting data to have enough collections points). The archived data is very valuable for research and that’s what we will do in this blog. The site can create some very useful data visualizations.

### Dumping the RIPEstat data for this event

Presently, the RIPEstat data gets ingested around eight to twelve hours after real-time. It’s not meant to be a real-time service. The data can be queried in many ways, including a full web interface and an API. We are using the API to extract the data in a JSON format.

We are going to focus only on the Cloudflare routes that were leaked. Many other ASNs were leaked (see the tweet above); however, we want to deal with a finite data set and focus on what happened to Cloudflare’s routing. All the commands below can be run with ease on many systems. Both the scripts and the raw data file are now available on GitHub. The following was done on MacBook Pro running macOS Mojave.

First we collect 24 hours of route announcements and AS-PATH data that RIPEstat sees coming from AS13335 (Cloudflare).

$# Collect 24 hours of data - more than enough$ ASN="AS13335"
$START="2019-06-24T00:00:00"$ END="2019-06-25T00:00:00"
$ARGS="resource=${ASN}&starttime=${START}&endtime=${END}"
$URL="https://stat.ripe.net/data/bgp-updates/data.json?${ARGS}"
$# Fetch the data from RIPEstat$ curl -sS "${URL}" | jq . > 13335-routes.json$ ls -l 13335-routes.json
-rw-r--r--  1 martin  staff  339363899 Jun 25 08:47 13335-routes.json
$ That’s 340MB of data – which seems like a lot, but it contains plenty of white space and plenty of data we just don’t need. Our second task is to reduce this raw data down to just the required data – that’s timestamps, actual routes, and AS-PATH. The third item will be very useful. Note we are using jq, which can be installed on macOS with the brew package manager. $ # Extract just the times, routes, and AS-PATH
$jq -rc '.data.updates[]|.timestamp,.attrs.target_prefix,.attrs.path' < 13335-routes.json | paste - - - > 13335-listing-a.txt$ wc -l 13335-listing-a.txt
691318 13335-listing-a.txt
$ We are down to just below seven hundred thousand routing events, however, that’s not a leak, that’s everything that includes Cloudflare’s ASN (the number 13335 above). For that we need to go back to Monday’s blog and realize it was AS396531 (Allegheny Technologies) that showed up with 701 (Verizon) in the leak. Now we reduce the data further: $ # Extract the route leak 701,396531
$# AS701 is Verizon and AS396531 is Allegheny Technologies$ egrep '701,396531' < 13335-listing-a.txt > 13335-listing-b.txt
$wc -l 13335-listing-b.txt 204568 13335-listing-b.txt$


At 204 thousand data points, we are looking better. It’s still a lot of data because BGP can be very chatty if topology is changing. A route leak will cause exactly that. Now let’s see how many routes were affected:

$# Extract the actual routes affected by the route leak$ cut -f2 < 13335-listing-b.txt | sort -V -u > 13335-listing-c.txt
$wc -l 13335-listing-c.txt 101 13335-listing-c.txt$


It’s a much smaller number. We now have a listing of at least 101 routes that were leaked via Verizon. This may not be the full list because route collectors like RIPEstat don’t have direct feeds from Verizon, so this data is a blended view with Verizon’s path and other paths. We can see that if we look at the AS-PATH in the above files. Please note that I had a typo in this script when this blog was first published and only 20 routes showed up because the -n vs -V option was used on sort. Now the list is correct with 101 affected routes. Please see this short article from stackoverflow to see the issue.

Here’s a partial listing of affected routes.

$cat 13335-listing-c.txt 8.39.214.0/24 8.42.245.0/24 8.44.58.0/24 ... 104.16.80.0/21 104.17.168.0/21 104.18.32.0/21 104.19.168.0/21 104.20.64.0/21 104.22.8.0/21 104.23.128.0/21 104.24.112.0/21 104.25.144.0/21 104.26.0.0/21 104.27.160.0/21 104.28.16.0/21 104.31.0.0/21 141.101.120.0/23 162.159.224.0/21 172.68.60.0/22 172.69.116.0/22 ...$


This is an interesting list, as some of these routes do not originate from Cloudflare’s network, however, they show up with AS13335 (our ASN) as the originator. For example, the 104.26.0.0/21 route is not announced from our network, but we do announce 104.26.0.0/20 (which covers that route). More importantly, we have an IRR (Internet Routing Registries) route object plus an RPKI ROA for that block. Here’s the IRR object:

route:          104.26.0.0/20
origin:         AS13335
source:         ARIN


And here’s the RPKI ROA. This ROA has Max Length set to 20, so no smaller route should be accepted.

Prefix:       104.26.0.0/20
Max Length:   /20
ASN:          13335
Trust Anchor: ARIN
Validity:     Thu, 02 Aug 2018 04:00:00 GMT - Sat, 31 Jul 2027 04:00:00 GMT
Emitted:      Thu, 02 Aug 2018 21:45:37 GMT
Parent Key:   a6e7a6b44019cf4e388766d940677599d0c492dc
Path:         rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-...


The Max Length field in an ROA says what the minimum size of an acceptable announcement is. The fact that this is a /20 route with a /20 Max Length says that a /21 (or /22 or /23 or /24) within this IP space isn’t allowed. Looking further at the route list above we get the following listing:

Route Seen            Cloudflare IRR & ROA    ROA Max Length
104.16.80.0/21    ->  104.16.80.0/20          /20
104.17.168.0/21   ->  104.17.160.0/20         /20
104.18.32.0/21    ->  104.18.32.0/20          /20
104.19.168.0/21   ->  104.19.160.0/20         /20
104.20.64.0/21    ->  104.20.64.0/20          /20
104.22.8.0/21     ->  104.22.0.0/20           /20
104.23.128.0/21   ->  104.23.128.0/20         /20
104.24.112.0/21   ->  104.24.112.0/20         /20
104.25.144.0/21   ->  104.25.144.0/20         /20
104.26.0.0/21     ->  104.26.0.0/20           /20
104.27.160.0/21   ->  104.27.160.0/20         /20
104.28.16.0/21    ->  104.28.16.0/20          /20
104.31.0.0/21     ->  104.31.0.0/20           /20


So how did all these /21’s show up? That’s where we dive into the world of BGP route optimization systems and their propensity to synthesize routes that should not exist. If those routes leak (and it’s very clear after this week that they can), all hell breaks loose. That can be compounded when not one, but two ISPs allow invalid routes to be propagated outside their autonomous network. We will explore the AS-PATH further down this blog.

More than 20 years ago, RFC1997 added the concept of communities to BGP. Communities are a way of tagging or grouping route advertisements. Communities are often used to label routes so that specific handling policies can be applied. RFC1997 includes a small number of universal well-known communities. One of these is the NO_EXPORT community, which has the following specification:

    All routes received carrying a communities attribute
containing this value MUST NOT be advertised outside a BGP
confederation boundary (a stand-alone autonomous system that
is not part of a confederation should be considered a
confederation itself).


The use of the NO_EXPORT community is very common within BGP enabled networks and is a community tag that would have helped alleviate this route leak immensely.

How BGP route optimization systems work (or don’t work in this case) can be a subject for a whole other blog entry.

### Timing of the route leak

As we saved away the timestamps in the JSON file and in the text files, we can confirm the time for every route in the route leak by looking at the first and the last timestamp of a route in the data. We saved data from 00:00:00 UTC until 00:00:00 the next day, so we know we have covered the period of the route leak. We write a script that checks the first and last entry for every route and report the information sorted by start time:

$# Extract the timing of the route leak$ while read cidr
do
echo $cidr fgrep$cidr < 13335-listing-b.txt | head -1 | cut -f1
fgrep $cidr < 13335-listing-b.txt | tail -1 | cut -f1 done < 13335-listing-c.txt |\ paste - - - | sort -k2,3 | column -t | sed -e 's/2019-06-24T//g' ... 104.25.144.0/21 10:34:25 12:38:54 104.22.8.0/21 10:34:27 12:29:39 104.20.64.0/21 10:34:27 12:30:00 104.23.128.0/21 10:34:27 12:30:34 141.101.120.0/23 10:34:27 12:30:39 162.159.224.0/21 10:34:27 12:30:39 104.18.32.0/21 10:34:29 12:30:34 104.24.112.0/21 10:34:29 12:30:34 104.27.160.0/21 10:34:29 12:30:34 104.28.16.0/21 10:34:29 12:30:34 104.31.0.0/21 10:34:29 12:30:34 8.39.214.0/24 10:34:31 12:19:24 104.26.0.0/21 10:34:36 12:29:53 172.68.60.0/22 10:34:38 12:19:24 172.69.116.0/22 10:34:38 12:19:24 8.44.58.0/24 10:34:38 12:19:24 8.42.245.0/24 11:52:49 11:53:19 104.17.168.0/21 12:00:13 12:29:34 104.16.80.0/21 12:00:13 12:30:00 104.19.168.0/21 12:09:39 12:29:34 ...$


Now we know the times. The route leak started at 10:34:25 UTC (just before lunchtime London time) on 2019-06-24 and ended at 12:38:54 UTC. That’s a hair over two hours. Here’s that same time data in a graphical form showing the near-instant start of the event and the duration of each route leaked:

We can also go back to RIPEstat and look at the activity graph for Cloudflare’s AS13335 network:

Clearly between 10:30 UTC and 12:40 UTC there’s a lot of route activity – far more than normal.

Note that as we mentioned above, RIPEstat doesn’t get a full view of Verizon’s network routing and hence some of the propagated routes won’t show up.

### Drilling down on the AS-PATH part of the data

Having the routes is useful, but now we want to look at the paths of these leaked routes to see which ASNs are involved. We knew the offending ASNs during the route leak on Monday. Now we want to dig deeper using the archived data. This allows us to see the extent and reach of this route leak.

$# Extract the AS-PATH of the route leak$ # Use the list of routes to extract the full AS-PATH
$# Merge the results together to show an amalgamation of paths.$ # We know (luckily) the last few ASNs in the AS-PATH are consistent
$cut -f3 < 13335-listing-b.txt | tr -d '[]' |\ awk '{ n=split($0, a, ",");
printf "%50s\n",
a[n-5] "_" a[n-4] "_" a[n-3] "_" a[n-2] "_" a[n-1] "_" a[n];
}' | sort -u
174_701_396531_33154_3356_13335
2497_701_396531_33154_174_13335
577_701_396531_33154_3356_13335
6939_701_396531_33154_174_13335
1239_701_396531_33154_3356_13335
1273_701_396531_33154_3356_13335
1280_701_396531_33154_3356_13335
2497_701_396531_33154_3356_13335
2516_701_396531_33154_3356_13335
3320_701_396531_33154_3356_13335
3491_701_396531_33154_3356_13335
4134_701_396531_33154_3356_13335
4637_701_396531_33154_3356_13335
6453_701_396531_33154_3356_13335
6461_701_396531_33154_3356_13335
6762_701_396531_33154_3356_13335
6830_701_396531_33154_3356_13335
6939_701_396531_33154_3356_13335
7738_701_396531_33154_3356_13335
12956_701_396531_33154_3356_13335
17639_701_396531_33154_3356_13335
23148_701_396531_33154_3356_13335
$ This script clearly shows the AS-PATH of the leaked routes. It’s very consistent. Reading from the back of the line to the front, we have 13335 (Cloudflare), 3356 or 174 (Level3/CenturyLink or Cogent – both tier 1 transit providers for Cloudflare). So far, so good. Then we have 33154 (DQE Communications) and 396531 (Allegheny Technologies Inc) which is still technically not a leak, but trending that way. The reason why we can state this is still technically not a leak is because we don’t know the relationship between those two ASs. It’s possible they have a mutual transit agreement between them. That’s up to them. Back to the AS-PATH’s. Still reading leftwards, we see 701 (Verizon), which is very very bad and clear evidence of a leak. It’s a leak for two reasons. First, this matches the path when a transit is leaking a non-customer route from a customer. This Verizon customer does not have 13335 (Cloudflare) listed as a customer. Second, the route contains within its path a tier 1 ASN. This is the point where a route leak should have been absolutely squashed by filtering on the customer BGP session. Beyond this point there be dragons. And dragons there be! Everything above is about how Verizon filtered (or didn’t filter) its customer. What follows the 701 (i.e the number to the left of it) is the peers or customers of Verizon that have accepted these leaked routes. They are mainly other tier 1 networks of Verizon in this list: 174 (Cogent), 1239 (Sprint), 1273 (Vodafone), 3320 (DTAG), 3491 (PCCW), 6461 (Zayo), 6762 (Telecom Italia), etc. What’s missing from that list are three networks worthy of mentioning – 1299 (Telia), 2914 (NTT), and 7018 (AT&T). All three implement a very simple AS-PATH filter which saved the day for their network. They do not allow one tier 1 ISP to send them a route which has another tier 1 further down the path. That’s because when that happens, it’s officially a leak as each tier 1 is fully connected to all other tier 1’s (which is part of the definition of a tier 1 network). The topology of the Internet’s global BGP routing tables simply states that if you see another tier 1 in the path, then it’s a bad route and it should be filtered away. Additionally we know that 7018 (AT&T) operates a network which drops RPKI invalids. Because Cloudflare routes are RPKI signed, this also means that AT&T would have dropped these routes when it receives them from Verizon. This shows a clear win for RPKI (and for AT&T when you see their bandwidth graph below)! That all said, keep in mind we are still talking about routes that Cloudflare didn’t announce. They all came from the route optimizer. ### What should 701 Verizon network accept from their customer 396531? This is a great question to ask. Normally we would look at the IRR (Internet Routing Registries) to see what policy an ASN wants for it’s routes. $ whois -h whois.radb.net AS396531 ; the Verizon customer
%  No entries found for the selected source(s).
$whois -h whois.radb.net AS33154 ; the downstream of that customer % No entries found for the selected source(s).$


That’s enough to say that we should not be seeing this ASN anywhere on the Internet, however, we should go further into checking this. As we know the ASN of the network, we can search for any routes that are listed for that ASN. We find one:

$whois -h whois.radb.net ' -i origin AS396531' | egrep '^route|^origin|^mnt-by|^source' route: 192.92.159.0/24 origin: AS396531 mnt-by: MNT-DCNSL source: ARIN$


More importantly, now we have a maintainer (the owner of the routing registry entries). We can see what else is there for that network and we are specifically looking for this:

$whois -h whois.radb.net ' -i mnt-by -T as-set MNT-DCNSL' | egrep '^as-set|^members|^mnt-by|^source' as-set: AS-DQECUST members: AS4130, AS5050, AS11199, AS11360, AS12017, AS14088, AS14162, AS14740, AS15327, AS16821, AS18891, AS19749, AS20326, AS21764, AS26059, AS26257, AS26461, AS27223, AS30168, AS32634, AS33039, AS33154, AS33345, AS33358, AS33504, AS33726, AS40549, AS40794, AS54552, AS54559, AS54822, AS393456, AS395440, AS396531, AS15204, AS54119, AS62984, AS13659, AS54934, AS18572, AS397284 mnt-by: MNT-DCNSL source: ARIN$


This object is important. It lists all the downstream ASNs that this network is expected to announce to the world. It does not contain Cloudflare’s ASN (or any of the leaked ASNs). Clearly this as-set was not used for any BGP filtering.

Just for completeness the same exercise can be done for the other ASN (the downstream of the customer of Verizon). In this case, we just searched for the maintainer object (as there are plenty of route and route6 objects listed).

$whois -h whois.radb.net ' -i origin AS33154' | egrep '^mnt-by' | sort -u mnt-by: MNT-DCNSL mnt-by: MAINT-AS3257 mnt-by: MAINT-AS5050$


None of these maintainers are directly related to 33154 (DQE Communications). They have been created by other parties and hence they become a dead-end in that search.

It’s worth doing a secondary search to see if any as-set object exists with 33154 or 396531 included. We turned to the most excellent IRR Explorer website run by NLNOG. It provides deep insight into the routing registry data. We did a simple search for 33154 using http://irrexplorer.nlnog.net/search/33154 and we found these as-set objects.

It’s interesting to see this ASN listed in other as-set’s but none are important or related to Monday’s route-leak. Next we looked at 396531

This shows that there’s nowhere else we need to check. AS-DQECUST is the as-set macro that controls (or should control) filtering for any transit provider of their network.

The summary of all the investigation is a solid statement that no Cloudflare routes or ASNs are listed anywhere within the routing registry data for the customer of Verizon. As there were 2,300 ASNs listed in the tweet above, we can conclusively state no filtering was in place and hence this route leak went on its way unabated.

### IPv6? Where is the IPv6 route leak?

In what could be considered the only plus from Monday’s route leak, we can confirm that there was no route leak within IPv6 space. Why?

It turns out that 396531 (Allegheny Technologies Inc) is a network without IPv6 enabled. Normally you would hear Cloudflare chastise anyone that’s yet to enable IPv6, however, in this case we are quite happy that one of the two protocol families survived. IPv6 was stable during this route leak, which now can be called an IPv4-only route leak.

Yet that’s not really the whole story. Let’s look at the percentage of traffic Cloudflare sends Verizon that’s IPv6 (vs IPv4). Normally the IPv4/IPv6 percentage holds steady.

This uptick in IPv6 traffic could be the direct result of Happy Eyeballs on mobile handsets picking a working IPv6 path into Cloudflare vs a non-working IPv4 path. Happy Eyeballs is meant to protect against IPv6 failure, however in this case it’s doing a wonderful job in protecting from an IPv4 failure. Yet we have to be careful with this graph because after further thought and investigation the percentage only increased because IPv4 reduces. Sometimes graphs can be misinterpreted, yet Happy Eyeballs still did a good job even as end users were being affected.

Happy Eyeballs, described in RFC8305, is a mechanism where a client (lets say a mobile device) tries to connect to a website both on IPv4 and IPv6 simultaneously. IPv6 is sometimes given a head-start. The theory is that, should a failure exist on one of the paths (sadly IPv6 is the norm), then IPv4 will save the day. Monday was a day of opposites for Verizon.

In fact, enabling IPv6 for mobile users is the one area where we can praise the Verizon network this week (or at least the Verizon mobile network), unlike the residential Verizon networks where IPv6 is almost non-existent.

### Using bandwidth graphs to confirm routing leaks and stability.

The red line is 24 June 2019 (00:00 UTC to 00:00 UTC the next day). The gray lines are previous days for comparison. This graph includes both Verizon fixed-line services like FiOS along with mobile.

The AT&T graph is quite different.

There’s no perturbation. This, along with some direct confirmation, shows that 7018 (AT&T) was not affected. This is an important point.

Going back and looking at a third tier 1 network, we can see 6762 (Telecom Italia) affected by this route leak and yet Cloudflare has a direct interconnect with them.

We will be asking Telecom Italia to improve their route filtering as we now have this data.

### Future work that could have helped on Monday

The IETF is doing work in the area of BGP path protection within the Secure Inter-Domain Routing Operations Working Group (sidrops) area. The charter of this IETF group is:

The SIDR Operations Working Group (sidrops) develops guidelines for the operation of SIDR-aware networks, and provides operational guidance on how to deploy and operate SIDR technologies in existing and new networks.

One new effort from this group should be called out to show how important the issue of route leaks like today’s event is. The draft document from Alexander Azimov et al named draft-ietf-sidrops-aspa-profile (ASPA stands for Autonomous System Provider Authorization) extends the RPKI data structures to handle BGP path information. This in ongoing work and Cloudflare and other companies are clearly interested in seeing it progress further.

However, as we said in Monday’s blog and something we should reiterate again and again: Cloudflare encourages all network operators to deploy RPKI now!

### Acronyms used in the blog

• API – Application Programming Interface
• AS-PATH – The list of ASNs that a routes has traversed so far
• ASN – Autonomous System Number – A unique number assigned for each network on the Internet
• BGP – Border Gateway Protocol (version 4) – the core routing protocol for the Internet
• IETF – Internet Engineering Task Force – an open standards organization
• IPv4 – Internet Protocol version 4
• IPv6 – Internet Protocol version 6
• IRR – Internet Routing Registries – a database of Internet route objects
• ISP – Internet Service Provider
• JSON – JavaScript Object Notation – a lightweight data-interchange format
• RIPE NCC – Réseaux IP Européens Network Coordination Centre – a regional Internet registry
• ROA – Route Origin Authorization – a cryptographically signed attestation of a BGP route announcement
• RPKI – Resource Public Key Infrastructure – a public key infrastructure framework for routing information
• Tier 1 – A network that has no default route and peers with all other tier 1’s
• UTC – Coordinated Universal Time – a time standard for clocks and time
• "there be dragons" – a mistype, as it was meant to be "here be dragons" which means dangerous or unexplored territories

# The Quantum Menace

Post Syndicated from Armando Faz-Hernández original https://blog.cloudflare.com/the-quantum-menace/

Over the last few decades, the word ‘quantum’ has become increasingly popular. It is common to find articles, reports, and many people interested in quantum mechanics and the new capabilities and improvements it brings to the scientific community. This topic not only concerns physics, since the development of quantum mechanics impacts on several other fields such as chemistry, economics, artificial intelligence, operations research, and undoubtedly, cryptography.

This post begins a trio of blogs describing the impact of quantum computing on cryptography, and how to use stronger algorithms resistant to the power of quantum computing.

• This post introduces quantum computing and describes the main aspects of this new computing model and its devastating impact on security standards; it summarizes some approaches to securing information using quantum-resistant algorithms.
• Due to the relevance of this matter, we present our experiments on a large-scale deployment of quantum-resistant algorithms.
• Our third post introduces CIRCL, open-source Go library featuring optimized implementations of quantum-resistant algorithms and elliptic curve-based primitives.

All of this is part of Cloudflare’s Crypto Week 2019, now fasten your seatbelt and get ready to make a quantum leap.

### What is Quantum Computing?

Back in 1981, Richard Feynman raised the question about what kind of computers can be used to simulate physics. Although some physical systems can be simulated in a classical computer, the amount of resources used by such a computer can grow exponentially. Then, he conjectured the existence of a computer model that behaves under quantum mechanics rules, which opened a field of research now called quantum computing. To understand the basics of quantum computing, it is necessary to recall how classical computers work, and from that shine a spotlight on the differences between these computational models.

In 1936, Alan Turing and Emil Post independently described models that gave rise to the foundation of the computing model known as the Post-Turing machine, which describes how computers work and allowed further determination of limits for solving problems.

In this model, the units of information are bits, which store one of two possible values, usually denoted by 0 and 1. A computing machine contains a set of bits and performs operations that modify the values of the bits, also known as the machine’s state. Thus, a machine with N bits can be in one of 2ᴺ possible states. With this in mind, the Post-Turing computing model can be abstractly described as a machine of states, in which running a program is translated as machine transitions along the set of states.

A paper David Deutsch published in 1985 describes a computing model that extends the capabilities of a Turing machine based on the theory of quantum mechanics. This computing model introduces several advantages over the Turing model for processing large volumes of information. It also presents unique properties that deviate from the way we understand classical computing. Most of these properties come from the nature of quantum mechanics. We’re going to dive into these details before approaching the concept of quantum computing.

### Superposition

One of the most exciting properties of quantum computing that provides an advantage over the classical computing model is superposition. In physics, superposition is the ability to produce valid states from the addition or superposition of several other states that are part of a system.

Applying these concepts to computing information, it means that there is a system in which it is possible to generate a machine state that represents a (weighted) sum of the states 0 and 1; in this case, the term weighted means that the state can keep track of “the quantity of” 0 and 1 present in the state. In the classical computation model, one bit can only store either the state of 0 or 1, not both; even using two bits, they cannot represent the weighted sum of these states. Hence, to make a distinction from the basic states, quantum computing uses the concept of a quantum bit (qubit) — a unit of information to denote the superposition of two states. This is a cornerstone concept of quantum computing as it provides a way of tracking more than a single state per unit of information, making it a powerful tool for processing information.

So, a qubit represents the sum of two parts: the 0 or 1 state plus the amount each 0/1 state contributes to produce the state of the qubit.

In mathematical notation, qubit $$| \Psi \rangle$$ is an explicit sum indicating that a qubit represents the superposition of the states 0 and 1. This is the Dirac notation used to describe the value of a qubit $$| \Psi \rangle = A | 0 \rangle +B | 1 \rangle$$, where, A and B are complex numbers known as the amplitude of the states 0 and 1, respectively. The value of the basic states is represented by qubits as $$| 0 \rangle = 1 | 0 \rangle + 0 | 1 \rangle$$  and $$| 1 \rangle = 0 | 0 \rangle + 1 | 1 \rangle$$, respectively. The right side of the term contains the abbreviated notation for these special states.

### Measurement

In a classical computer, the values 0 and 1 are implemented as digital signals. Measuring the current of the signal automatically reveals the status of a bit. This means that at any moment the value of the bit can be observed or measured.

The state of a qubit is maintained in a physically closed system, meaning that the properties of the system, such as superposition, require no interaction with the environment; otherwise any interaction, like performing a measurement, can cause interference on the state of a qubit.

Measuring a qubit is a probabilistic experiment. The result is a bit of information that depends on the state of the qubit. The bit, obtained by measuring $$| \Psi \rangle = A | 0 \rangle +B | 1 \rangle$$, will be equal to 0 with probability $$|A|^2$$,  and equal to 1 with probability $$|B|^2$$, where $$|x|$$ represents the absolute value of $$x$$.

From Statistics, we know that the sum of probabilities of all possible events is always equal to 1, so it must hold that $$|A|^2 +|B|^2 =1$$. This last equation motivates to represent qubits as the points of a circle of radius one, and more generally, as the points on the surface of a sphere of radius one, which is known as the Bloch Sphere.

Let’s break it down: If you measure a qubit you also destroy the superposition of the qubit, resulting in a superposition state collapse, where it assumes one of the basics states, providing your final result.

Another way to think about superposition and measurement is through the coin tossing experiment.

Toss a coin in the air and you give people a random choice between two options: heads or tails. Now, don’t focus on the randomness of the experiment, instead note that while the coin is rotating in the air, participants are uncertain which side will face up when the coin lands. Conversely, once the coin stops with a random side facing up, participants are 100% certain of the status.

How does it relate? Qubits are similar to the participants. When a qubit is in a superposition of states, it is tracking the probability of heads or tails, which is the participants’ uncertainty quotient while the coin is in the air. However, once you start to measure the qubit to retrieve its value, the superposition vanishes, and a classical bit value sticks: heads or tails. Measurement is that moment when the coin is static with only one side facing up.

A fair coin is a coin that is not biased. Each side (assume 0=heads and 1=tails) of a fair coin has the same probability of sticking after a measurement is performed. The qubit $$\tfrac{1}{\sqrt{2}}|0\rangle + \tfrac{1}{\sqrt{2}}|1\rangle$$ describes the probabilities of tossing a fair coin. Note that squaring either of the amplitudes results in ½, indicating that there is a 50% chance either heads or tails sticks.

It would be interesting to be able to charge a fair coin at will while it is in the air. Although this is the magic of a professional illusionist, this task, in fact, can be achieved by performing operations over qubits. So, get ready to become the next quantum magician!

### Quantum Gates

A logic gate represents a Boolean function operating over a set of inputs (on the left) and producing an output (on the right). A logic circuit is a set of connected logic gates, a convenient way to represent bit operations.

Other gates are AND, OR, XOR, and NAND, and more. A set of gates is universal if it can generate other gates. For example, NOR and NAND gates are universal since any circuit can be constructed using only these gates.

Quantum computing also admits a description using circuits. Quantum gates operate over qubits, modifying the superposition of the states. For example, there is a quantum gate analogous to the NOT gate, the X gate.

The X quantum gate interchanges the amplitudes of the states of the input qubit.

The Z quantum gate flips the sign’s amplitude of state 1:

Another quantum gate is the Hadamard gate, which generates an equiprobable superposition of the basic states.

Using our coin tossing analogy, the Hadamard gate has the action of tossing a fair coin to the air. In quantum circuits, a triangle represents measuring a qubit, and the resulting bit is indicated by a double-wire.

Other gates, such as the CNOT gate, Pauli’s gates, Toffoli gate, Deutsch gate, are slightly more advanced. Quirk, the open-source playground, is a fun sandbox where you can construct quantum circuits using all of these gates.

### Reversibility

An operation is reversible if there exists another operation that rolls back the output state to the initial state. For instance, a NOT gate is reversible since applying a second NOT gate recovers the initial input.

In contrast, AND, OR, NAND gates are not reversible. This means that some classical computations cannot be reversed by a classic circuit that uses only the output bits. However, if you insert additional bits of information, the operation can be reversed.

Quantum computing mainly focuses on reversible computations, because there’s always a way to construct a reversible circuit to perform an irreversible computation. The reversible version of a circuit could require the use of ancillary qubits as auxiliary (but not temporary) variables.

Due to the nature of composed systems, it could be possible that these ancillas (extra qubits) correlate to qubits of the main computation. This correlation makes it infeasible to reuse ancillas since any modification could have the side-effect on the operation of a reversible circuit. This is like memory assigned to a process by the operating system: the process cannot use memory from other processes or it could cause memory corruption, and processes cannot release their assigned memory to other processes. You could use garbage collection mechanisms for ancillas, but performing reversible computations increases your qubit budget.

### Composed Systems

In quantum mechanics, a single qubit can be described as a single closed system: a system that has no interaction with the environment nor other qubits. Letting qubits interact with others leads to a composed system where more states are represented. The state of a 2-qubit composite system is denoted as $$A_0|00\rangle+A_1|01\rangle+A_2|10\rangle+A_3|11\rangle$$, where, $$A_i$$ values correspond to the amplitudes of the four basic states 00, 01, 10, and 11. This qubit $$\tfrac{1}{2}|00\rangle+\tfrac{1}{2}|01\rangle+\tfrac{1}{2}|10\rangle+\tfrac{1}{2}|11\rangle$$ represents the superposition of these basic states, both having the same probability obtained after measuring the two qubits.

In the classical case, the state of N bits represents only one of 2ᴺ possible states, whereas a composed state of N qubits represents all the 2ᴺ states but in superposition. This is one big difference between these computing models as it carries two important properties: entanglement and quantum parallelism.

### Entanglement

According to the theory behind quantum mechanics, some composed states can be described through the description of its constituents. However, there are composed states where no description is possible, known as entangled states.

The entanglement phenomenon was pointed out by Einstein, Podolsky, and Rosen in the so-called EPR paradox. Suppose there is a composed system of two entangled qubits, in which by performing a measurement in one qubit causes interference in the measurement of the second. This interference occurs even when qubits are separated by a long distance, which means that some information transfer happens faster than the speed of light. This is how quantum entanglement conflicts with the theory of relativity, where information cannot travel faster than the speed of light. The EPR paradox motivated further investigation for deriving new interpretations about quantum mechanics and aiming to resolve the paradox.

Quantum entanglement can help to transfer information at a distance by following a communication protocol. The following protocol examples rely on the fact that Alice and Bob separately possess one of two entangled qubits:

• The superdense coding protocol allows Alice to communicate a 2-bit message $$m_0,m_1$$ to Bob using a quantum communication channel, for example, using fiber optics to transmit photons. All Alice has to do is operate on her qubit according to the value of the message and send the resulting qubit to Bob. Once Bob receives the qubit, he measures both qubits, noting that the collapsed 2-bit state corresponds to Alice’s message.

• The quantum teleportation protocol allows Alice to transmit a qubit to Bob without using a quantum communication channel. Alice measures the qubit to send Bob and her entangled qubit resulting in two bits. Alice sends these bits to Bob, who operates on his entangled qubit according to the bits received and notes that the result state matches the original state of Alice’s qubit.

### Quantum Parallelism

Composed systems of qubits allow representation of more information per composed state. Note that operating on a composed state of N qubits is equivalent to operating over a set of 2ᴺ states in superposition. This procedure is quantum parallelism. In this setting, operating over a large volume of information gives the intuition of performing operations in parallel, like in the parallel computing paradigm; one big caveat is that superposition is not equivalent to parallelism.

Remember that a composed state is a superposition of several states so, a computation that takes a composed state of inputs will result in a composed state of outputs. The main divergence between classical and quantum parallelism is that quantum parallelism can obtain only one of the processed outputs. Observe that a measurement in the output of a composed state causes that the qubits collapse to only one of the outputs, making it unattainable to calculate all computed values.

Although quantum parallelism does not match precisely with the traditional notion of parallel computing, you can still leverage this computational power to get related information.

Deutsch-Jozsa Problem: Assume $$F$$ is a function that takes as input N bits, outputs one bit, and is either constant (always outputs the same value for all inputs) or balanced (outputs 0 for half of the inputs and 1 for the other half). The problem is to determine if $$F$$ is constant or balanced.

The quantum algorithm that solves the Deutsch-Jozsa problem uses quantum parallelism. First, N qubits are initialized in a superposition of 2ᴺ states. Then, in a single shot, it evaluates $$F$$ for all of these states.

The result of applying $$F$$ appears (in the exponent) of the amplitude of the all-zero state. Note that only when $$F$$ is constant is this amplitude, either +1 or -1. If the result of measuring the N qubit is an all-zeros bitstring, then there is a 100% certainty that $$F$$ is constant. Any other result indicates that $$F$$ is balanced.

A deterministic classical algorithm solves this problem using $$2^{N-1}+1$$ evaluations of $$F$$ in the worst case. Meanwhile, the quantum algorithm requires only one evaluation. The Deutsch-Jozsa problem exemplifies the exponential advantage of a quantum algorithm over classical algorithms.

## Quantum Computers

The theory of quantum computing is supported by investigations in the field of quantum mechanics. However, constructing a quantum machine requires a physical system that allows representing qubits and manipulation of states in a reliable and precise way.

The DiVincenzo Criteria require that a physical implementation of a quantum computer must:

1. Be scalable and have well-defined qubits.
2. Be able to initialize qubits to a state.
3. Have long decoherence times to apply quantum error-correcting codes. Decoherence of a qubit happens when the qubit interacts with the environment, for example, when a measurement is performed.
4. Use a universal set of quantum gates.
5. Be able to measure single qubits without modifying others.

Quantum computer physical implementations face huge engineering obstacles to satisfy these requirements. The most important challenge is to guarantee low error rates during computation and measurement. Lowering these rates require techniques for error correction, which add a significant number of qubits specialized on this task. For this reason, the number of qubits of a quantum computer should not be regarded as for classical systems. In a classical computer, the bits of a computer are all effective for performing a calculation, whereas the number of qubits is the sum of the effective qubits (those used to make calculations) plus the ancillas (used for reversible computations) plus the error correction qubits.

Current implementations of quantum computers partially satisfy the DiVincenzo criteria. Quantum adiabatic computers fit in this category since they do not operate using quantum gates. For this reason, they are not considered to be universal quantum computers.

A recurrent problem in optimization is to find the global minimum of an objective function. For example, a route-traffic control system can be modeled as a function that reduces the cost of routing to a minimum. Simulated annealing is a heuristic procedure that provides a good solution to these types of problems. Simulated annealing finds the solution state by slowly introducing changes (the adiabatic process) on the variables that govern the system.

Quantum annealing is the analogous quantum version of simulated annealing. A qubit is initialized into a superposition of states representing all possible solutions to the problem. Here is used the Hamiltonian operator, which is the sum of vectors of potential and kinetic energies of the system. Hence, the objective function is encoded using this operator describing the evolution of the system in correspondence with time. Then, if the system is allowed to evolve very slowly, it will eventually land on a final state representing the optimal value of the objective function.

Currently, there exist adiabatic computers in the market, such as the D-Wave and IBM Q systems, featuring hundreds of qubits; however, their capabilities are somewhat limited to some problems that can be modeled as optimization problems. The limits of adiabatic computers were studied by van Dam et al, showing that despite solving local searching problems and even some instances of the max-SAT problem, there exists harder searching problems this computing model cannot efficiently solve.

### Nuclear Magnetic Resonance

Nuclear Magnetic Resonance (NMR) is a physical phenomena that can be used to represent qubits. The spin of atomic nucleus of molecules is perturbed by an oscillating magnetic field. A 2001 report describes successful implementation of Shor’s algorithm in a 7-qubit NMR quantum computer. An iconic result since this computer was able to factor the number 15.

### Superconducting Quantum Computers

One way to physically construct qubits is based on superconductors, materials that conduct electric current with zero resistance when exposed to temperatures close to absolute zero.

The Josephson effect, in which current flows across the junction of two superconductors separated by a non-superconducting material, is used to physically implement a superposition of states.

When a magnetic flux is applied to this junction, the current flows continuously in one direction. But, depending on the quantity of magnetic flux applied, the current can also flow in the opposite direction. There exists a quantum superposition of currents going both clockwise and counterclockwise leading to a physical implementation of a qubit called flux qubit. The complete device is known as the Superconducting Quantum Interference Device (SQUID) and can be easily coupled scaling the number of qubits. Thus, SQUIDs are like the transistors of a quantum computer.

Examples of superconducting computers are:

• D-wave’s adiabatic computers process quantum annealing for solving diverse optimization problems.
• Google’s 72-qubit computer was recently announced and also several engineering issues such as achieving lower temperatures.
• IBM’s IBM-Q Tokyo, a 20-qubit adiabatic computer, and IBM Q Experience, a cloud-based system for exploring quantum circuits.

IBM Q System

## The Imminent Threat of Quantum Algorithms

The quantum zoo website tracks problems that can be solved using quantum algorithms. As of mid-2018, more than 60 problems appear on this list, targeting diverse applications in the area of number theory, approximation, simulation, and searching. As terrific as it sounds, some easily-solvable problems by quantum computing are surrounding the security of information.

### Grover’s Algorithm

Tales of a quantum detective (fragment). A couple of detectives have the mission of finding one culprit in a group of suspects that always respond to this question honestly: “are you guilty?”.
The detective C follows a classic interrogative method and interviews every person one at a time, until finding the first one that confesses.
The detective Q proceeds in a different way, First gather all suspects in a completely dark room, and after that, the detective Q asks them — are you guilty? — A steady sound comes from the room saying “No!” while at the same time, a single voice mixed in the air responds “Yes!.” Since everybody is submerged in darkness, the detective cannot see the culprit. However, detective Q knows that, as long as the interrogation advances, the culprit will feel desperate and start to speak louder and louder, and so, he continues asking the same question. Suddenly, detective Q turns on the lights, enters into the room, and captures the culprit. How did he do it?

The task of the detective can be modeled as a searching problem. Given a Boolean function $$f$$ that takes N bits and produces one bit, to find the unique input $$x$$ such that $$f(x)=1$$.

A classical algorithm (detective C) finds $$x$$ using $$2^N-1$$ function evaluations in the worst case. However, the quantum algorithm devised by Grover, corresponding to detective Q, searches quadratically faster using around $$2^{N/2}$$ function evaluations.

The key intuition of Grover’s algorithm is increasing the amplitude of the state that represents the solution while maintaining the other states in a lower amplitude. In this way, a system of N qubits, which is a superposition of 2ᴺ possible inputs, can be continuously updated using this intuition until the solution state has an amplitude closer to 1. Hence, after updating the qubits many times, there will be a high probability to measure the solution state.

Initially, a superposition of 2ᴺ states (horizontal axis) is set, each state has an amplitude (vertical axis) close to 0. The qubits are updated so that the amplitude of the solution state increases more than the amplitude of other states. By repeating the update step, the amplitude of the solution state gets closer to 1, which boosts the probability of collapsing to the solution state after measuring.

Grover’s Algorithm (pseudo-code):

1. Prepare an N qubit $$|x\rangle$$ as a uniform superposition of 2ᴺ states.
2. Update the qubits by performing this core operation. $$|x\rangle \mapsto (-1)^{f(x)} |x\rangle$$ The result of $$f(x)$$ only flips the amplitude of the searched state.
3. Negate the N qubit over the average of the amplitudes.
4. Repeat Step 2 and 3 for $$(\tfrac{\pi}{4}) 2^{ N/2}$$ times.
5. Measure the qubit and return the bits obtained.

Alternatively, the second step can be better understood as a conditional statement:

IF f(x) = 1 THEN
Negate the amplitude of the solution state.
ELSE
/* nothing */
ENDIF


Grover’s algorithm considers function $$f$$ a black box, so with slight modifications, the algorithm can also be used to find collisions on the function. This implies that Grover’s algorithm can find a collision using an asymptotically less number of operations than using a brute-force algorithm.

The power of Grover’s algorithm can be turned against cryptographic hash functions. For instance, a quantum computer running Grover’s algorithm could find a collision on SHA256 performing only 2¹²⁸ evaluations of a reversible circuit of SHA256. The natural protection for hash functions is to increase the output size to double. More generally, most of symmetric key encryption algorithms will survive to the power of Grover’s algorithm by doubling the size of keys.

The scenario for public-key algorithms is devastating in face of Peter Shor’s algorithm.

### Shor’s Algorithm

Multiplying integers is an easy task to accomplish, however, finding the factors that compose an integer is difficult. The integer factorization problem is to decompose a given integer number into its prime factors. For example, 42 has three factors 2, 3, and 7 since $$2\times 3\times 7 = 42$$. As the numbers get bigger, integer factorization becomes more difficult to solve, and the hardest instances of integer factorization are when the factors are only two different large primes. Thus, given an integer number $$N$$, to find primes $$p$$ and $$q$$ such that $$N = p \times q$$, is known as integer splitting.

Factoring integers is like cutting wood, and the specific task of splitting integers is analogous to using an axe for splitting the log in two parts. There exist many different tools (algorithms) for accomplishing each task.

For integer factorization, trial division, the Rho method, the elliptic curve method are common algorithms. Fermat’s method, the quadratic- and rational-sieve, leads to the (general) number field sieve (NFS) algorithm for integer splitting. The latter relies on finding a congruence of squares, that is, splitting $$N$$ as a product of squares such that $$N = x^2 – y^2 = (x+y)\times(x-y)$$ The complexity of NFS is mainly attached to the number of pairs $$(x, y)$$ that must be examined before getting a pair that factors $$N$$. The NFS algorithm has subexponential complexity on the size of $$N$$, meaning that the time required for splitting an integer increases significantly as the size of $$N$$ grows. For large integers, the problem becomes intractable for classical computers.

### The Axe of Thor Shor

The many different guesses of the NFS algorithm are analogous to hitting the log using a dulled axe; after subexponential many tries, the log is cut by half. However, using a sharper axe allows you to split the log faster. This sharpened axe is the quantum algorithm proposed by Shor in 1994.

Let $$x$$ be an integer less than $$N$$ and of the order $$k$$. Then, if $$k$$ is even, there exists an integer $$q$$ so $$qN$$ can be factored as follows.

This approach has some issues. For example, the factorization could correspond to $$q$$ not $$N$$ and the order of $$x$$ is unknown, and here is where Shor’s algorithm enters the picture, finding the order of $$x$$.

The internals of Shor’s algorithm rely on encoding the order $$k$$ into a periodic function, so that its period can be obtained using the quantum version of the Fourier transform (QFT). The order of $$x$$ can be found using a polynomial number quantum evaluations of Shor’s algorithm. Therefore, splitting integers using this quantum approach has polynomial complexity on the size of $$N$$.

Shor’s algorithm carries strong implications on the security of the RSA encryption scheme because its security relies on integer factorization. A large-enough quantum computer can efficiently break RSA for current instances.

Alternatively, one may recur to elliptic curves, used in cryptographic protocols like ECDSA or ECDH. Moreover, all TLS ciphersuites use a combination of elliptic curve groups, large prime groups, and RSA and DSA signatures. Unfortunately, these algorithms all succumb to Shor’s algorithm. It only takes a few modifications for Shor’s algorithm to solve the discrete logarithm problem on finite groups. This sounds like a catastrophic story where all of our encrypted data and privacy are no longer secure with the advent of a quantum computer, and in some sense this is true.

On one hand, it is a fact that the quantum computers constructed as of 2019 are not large enough to run, for instance, Shor’s algorithm for the RSA key sizes used in standard protocols. For example, a 2018 report shows experiments on the factorization of a 19-bit number using 94 qubits, they also estimate that 147456 qubits are needed for factoring a 768-bit number. Hence, there numbers indicates that we are still far from breaking RSA.

What if we increment RSA key sizes to be resistant to quantum algorithms, just like for symmetric algorithms?

Bernstein et al. estimated that RSA public keys should be as large as 1 terabyte to maintain secure RSA even in the presence of quantum factoring algorithms. So, for public-key algorithms, increasing the size of keys does not help.

A recent investigation by Gidney and Ekerá shows improvements that accelerate the evaluation of quantum factorization. In their report, the cost of factoring 2048-bit integers is estimated to take a few hours using a quantum machine of 20 million qubits, which is far from any current development. Something worth noting is that the number of qubits needed is two orders of magnitude smaller than the estimated numbers given in previous works developed in this decade. Under these estimates, current encryption algorithms will remain secure several more years; however, consider the following not-so-unrealistic situation.

Information currently encrypted with for example, RSA, can be easily decrypted with a quantum computer in the future. Now, suppose that someone records encrypted information and stores them until a quantum computer is able to decrypt ciphertexts. Although this could be as far as 20 years from now, the forward-secrecy principle is violated. A 20-year gap to the future is sometimes difficult to imagine. So, let’s think backwards, what would happen if all you did on the Internet at the end of the 1990s can be revealed 20 years later — today. How does this impact the security of your personal information? What if the ciphertexts were company secrets or business deals? In 1999, most of us were concerned about the effects of the Y2K problem, now we’re facing Y2Q (years to quantum): the advent of quantum computers.

### Post-Quantum Cryptography

Although the current capacity of the physical implementation of quantum computers is far from a real threat to secure communications, a transition to use stronger problems to protect information has already started. This wave emerged as post-quantum cryptography (PQC). The core idea of PQC is finding algorithms difficult enough that no quantum (and classical) algorithm can solve them.

A recurrent question is: How does it look like a problem that even a quantum computer can not solve?

These so-called quantum-resistant algorithms rely on different hard mathematical assumptions; some of them as old as RSA, others more recently proposed. For example, McEliece cryptosystem, formulated in the late 70s, relies on the hardness of decoding a linear code (in the sense of coding theory). The practical use of this cryptosystem didn’t become widespread, since with the passing of time, other cryptosystems superseded in efficiency. Fortunately, McEliece cryptosystem remains immune to Shor’s algorithm, gaining it relevance in the post-quantum era.

Post-quantum cryptography presents alternatives:

As of 2017, the NIST started an evaluation process that tracks possible alternatives for next-generation secure algorithms. From a practical perspective, all candidates present different trade-offs in implementation and usage. The time and space requirements are diverse; at this moment, it’s too early to define which will succeed RSA and elliptic curves. An initial round collected 70 algorithms for deploying key encapsulation mechanisms and digital signatures. As of early 2019, 28 of these survive and are currently in the analysis, investigation, and experimentation phase.

Cloudflare’s mission is to help build a better Internet. As a proactive action, our cryptography team is preparing experiments on the deployment of post-quantum algorithms at Cloudflare scale. Watch our blog post for more details.