Tag Archives: programming

How to execute an object file: Part 3

Post Syndicated from Ignat Korchagin original https://blog.cloudflare.com/how-to-execute-an-object-file-part-3/

Dealing with external libraries

How to execute an object file: Part 3

In the part 2 of our series we learned how to process relocations in object files in order to properly wire up internal dependencies in the code. In this post we will look into what happens if the code has external dependencies — that is, it tries to call functions from external libraries. As before, we will be building upon the code from part 2. Let’s add another function to our toy object file:

obj.c:

#include <stdio.h>
 
...
 
void say_hello(void)
{
    puts("Hello, world!");
}

In the above scenario our say_hello function now depends on the puts function from the C standard library. To try it out we also need to modify our loader to import the new function and execute it:

loader.c:

...
 
static void execute_funcs(void)
{
    /* pointers to imported functions */
    int (*add5)(int);
    int (*add10)(int);
    const char *(*get_hello)(void);
    int (*get_var)(void);
    void (*set_var)(int num);
    void (*say_hello)(void);
 
...
 
    say_hello = lookup_function("say_hello");
    if (!say_hello) {
        fputs("Failed to find say_hello function\n", stderr);
        exit(ENOENT);
    }
 
    puts("Executing say_hello...");
    say_hello();
}
...

Let’s run it:

$ gcc -c obj.c
$ gcc -o loader loader.c
$ ./loader
No runtime base address for section

Seems something went wrong when the loader tried to process relocations, so let’s check the relocations table:

$ readelf --relocs obj.o
 
Relocation section '.rela.text' at offset 0x3c8 contains 7 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000020  000a00000004 R_X86_64_PLT32    0000000000000000 add5 - 4
00000000002d  000a00000004 R_X86_64_PLT32    0000000000000000 add5 - 4
00000000003a  000500000002 R_X86_64_PC32     0000000000000000 .rodata - 4
000000000046  000300000002 R_X86_64_PC32     0000000000000000 .data - 4
000000000058  000300000002 R_X86_64_PC32     0000000000000000 .data - 4
000000000066  000500000002 R_X86_64_PC32     0000000000000000 .rodata - 4
00000000006b  001100000004 R_X86_64_PLT32    0000000000000000 puts - 4
...

The compiler generated a relocation for the puts invocation. The relocation type is R_X86_64_PLT32 and our loader already knows how to process these, so the problem is elsewhere. The above entry shows that the relocation references 17th entry (0x11 in hex) in the symbol table, so let’s check that:

$ readelf --symbols obj.o
 
Symbol table '.symtab' contains 18 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS obj.c
     2: 0000000000000000     0 SECTION LOCAL  DEFAULT    1
     3: 0000000000000000     0 SECTION LOCAL  DEFAULT    3
     4: 0000000000000000     0 SECTION LOCAL  DEFAULT    4
     5: 0000000000000000     0 SECTION LOCAL  DEFAULT    5
     6: 0000000000000000     4 OBJECT  LOCAL  DEFAULT    3 var
     7: 0000000000000000     0 SECTION LOCAL  DEFAULT    7
     8: 0000000000000000     0 SECTION LOCAL  DEFAULT    8
     9: 0000000000000000     0 SECTION LOCAL  DEFAULT    6
    10: 0000000000000000    15 FUNC    GLOBAL DEFAULT    1 add5
    11: 000000000000000f    36 FUNC    GLOBAL DEFAULT    1 add10
    12: 0000000000000033    13 FUNC    GLOBAL DEFAULT    1 get_hello
    13: 0000000000000040    12 FUNC    GLOBAL DEFAULT    1 get_var
    14: 000000000000004c    19 FUNC    GLOBAL DEFAULT    1 set_var
    15: 000000000000005f    19 FUNC    GLOBAL DEFAULT    1 say_hello
    16: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND _GLOBAL_OFFSET_TABLE_
    17: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND puts

Oh! The section index for the puts function is UND (essentially 0 in the code), which makes total sense: unlike previous symbols, puts is an external dependency, and it is not implemented in our obj.o file. Therefore, it can’t be a part of any section within obj.o.
So how do we resolve this relocation? We need to somehow point the code to jump to a puts implementation. Our loader actually already has access to the C library puts function (because it is written in C and we’ve used puts in the loader code itself already), but technically it doesn’t have to be the C library puts, just some puts implementation. For completeness, let’s implement our own custom puts function in the loader, which is just a decorator around the C library puts:

loader.c:

...
 
/* external dependencies for obj.o */
static int my_puts(const char *s)
{
    puts("my_puts executed");
    return puts(s);
}
...

Now that we have a puts implementation (and thus its runtime address) we should just write logic in the loader to resolve the relocation by instructing the code to jump to the correct function. However, there is one complication: in part 2 of our series, when we processed relocations for constants and global variables, we learned we’re mostly dealing with 32-bit relative relocations and that the code or data we’re referencing needs to be no more than 2147483647 (0x7fffffff in hex) bytes away from the relocation itself. R_X86_64_PLT32 is also a 32-bit relative relocation, so it has the same requirements, but unfortunately we can’t reuse the trick from part 2 as our my_puts function is part of the loader itself and we don’t have control over where in the address space the operating system places the loader code.

Luckily, we don’t have to come up with any new solutions and can just borrow the approach used in shared libraries.

Exploring PLT/GOT

Real world ELF executables and shared libraries have the same problem: often executables have dependencies on shared libraries and shared libraries have dependencies on other shared libraries. And all of the different pieces of a complete runtime program may be mapped to random ranges in the process address space. When a shared library or an ELF executable is linked together, the linker enumerates all the external references and creates two or more additional sections (for a refresher on ELF sections check out the part 1 of our series) in the ELF file. The two mandatory ones are the Procedure Linkage Table (PLT) and the Global Offset Table (GOT).

We will not deep-dive into specifics of the standard PLT/GOT implementation as there are many other great resources online, but in a nutshell PLT/GOT is just a jumptable for external code. At the linking stage the linker resolves all external 32-bit relative relocations with respect to a locally generated PLT/GOT table. It can do that, because this table would become part of the final ELF file itself, so it will be "close" to the main code, when the file is mapped into memory at runtime. Later, at runtime the dynamic loader populates PLT/GOT tables for every loaded ELF file (both the executable and the shared libraries) with the runtime addresses of all the dependencies. Eventually, when the program code calls some external library function, the CPU "jumps" through the local PLT/GOT table to the final code:

How to execute an object file: Part 3

Why do we need two ELF sections to implement one jumptable you may ask? Well, because real world PLT/GOT is a bit more complex than described above. Turns out resolving all external references at runtime may significantly slow down program startup time, so symbol resolution is implemented via a "lazy approach": a reference is resolved by the dynamic loader only when the code actually tries to call a particular function. If the main application code never calls a library function, that reference will never be resolved.

Implementing a simplified PLT/GOT

For learning and demonstrative purposes though we will not be reimplementing a full-blown PLT/GOT with lazy resolution, but a simple jumptable, which resolves external references when the object file is loaded and parsed. First of all we need to know the size of the table: for ELF executables and shared libraries the linker will count the external references at link stage and create appropriately sized PLT and GOT sections. Because we are dealing with raw object files we would have to do another pass over the .rela.text section and count all the relocations, which point to an entry in the symbol table with undefined section index (or 0 in code). Let’s add a function for this and store the number of external references in a global variable:

loader.c:

...
 
/* number of external symbols in the symbol table */
static int num_ext_symbols = 0;
...
static void count_external_symbols(void)
{
    const Elf64_Shdr *rela_text_hdr = lookup_section(".rela.text");
    if (!rela_text_hdr) {
        fputs("Failed to find .rela.text\n", stderr);
        exit(ENOEXEC);
    }
 
    int num_relocations = rela_text_hdr->sh_size / rela_text_hdr->sh_entsize;
    const Elf64_Rela *relocations = (Elf64_Rela *)(obj.base + rela_text_hdr->sh_offset);
 
    for (int i = 0; i < num_relocations; i++) {
        int symbol_idx = ELF64_R_SYM(relocations[i].r_info);
 
        /* if there is no section associated with a symbol, it is probably
         * an external reference */
        if (symbols[symbol_idx].st_shndx == SHN_UNDEF)
            num_ext_symbols++;
    }
}
...

This function is very similar to our do_text_relocations function. Only instead of actually performing relocations it just counts the number of external symbol references.

Next we need to decide the actual size in bytes for our jumptable. num_ext_symbols has the number of external symbol references in the object file, but how many bytes per symbol to allocate? To figure this out we need to design our jumptable format. As we established above, in its simple form our jumptable should be just a collection of unconditional CPU jump instructions — one for each external symbol. However, unfortunately modern x64 CPU architecture does not provide a jump instruction, where an address pointer can be a direct operand. Instead, the jump address needs to be stored in memory somewhere "close" — that is within 32-bit offset — and the offset is the actual operand. So, for each external symbol we need to store the jump address (64 bits or 8 bytes on a 64-bit CPU system) and the actual jump instruction with an offset operand (6 bytes for x64 architecture). We can represent an entry in our jumptable with the following C structure:

loader.c:

...
 
struct ext_jump {
    /* address to jump to */
    uint8_t *addr;
    /* unconditional x64 JMP instruction */
    /* should always be {0xff, 0x25, 0xf2, 0xff, 0xff, 0xff} */
    /* so it would jump to an address stored at addr above */
    uint8_t instr[6];
};
 
struct ext_jump *jumptable;
...

We’ve also added a global variable to store the base address of the jumptable, which will be allocated later. Notice that with the above approach the actual jump instruction will always be constant for every external symbol. Since we allocate a dedicated entry for each external symbol with this structure, the addr member would always be at the same offset from the end of the jump instruction in instr: -14 bytes or 0xfffffff2 in hex for a 32-bit operand. So instr will always be {0xff, 0x25, 0xf2, 0xff, 0xff, 0xff}: 0xff and 0x25 is the encoding of the x64 jump instruction and its modifier and 0xfffffff2 is the operand offset in little-endian format.

Now that we have defined the entry format for our jumptable, we can allocate and populate it when parsing the object file. First of all, let’s not forget to call our new count_external_symbols function from the parse_obj to populate num_ext_symbols (it has to be done before we allocate the jumptable):

loader.c:

...
 
static void parse_obj(void)
{
...
 
    count_external_symbols();
 
    /* allocate memory for `.text`, `.data` and `.rodata` copies rounding up each section to whole pages */
    text_runtime_base = mmap(NULL, page_align(text_hdr->sh_size)...
...
}

Next we need to allocate memory for the jumptable and store the pointer in the jumptable global variable for later use. Just a reminder that in order to resolve 32-bit relocations from the .text section to this table, it has to be "close" in memory to the main code. So we need to allocate it in the same mmap call as the rest of the object sections. Since we defined the table’s entry format in struct ext_jump and have num_ext_symbols, the size of the table would simply be sizeof(struct ext_jump) * num_ext_symbols:

loader.c:

...
 
static void parse_obj(void)
{
...
 
    count_external_symbols();
 
    /* allocate memory for `.text`, `.data` and `.rodata` copies and the jumptable for external symbols, rounding up each section to whole pages */
    text_runtime_base = mmap(NULL, page_align(text_hdr->sh_size) + \
                                   page_align(data_hdr->sh_size) + \
                                   page_align(rodata_hdr->sh_size) + \
                                   page_align(sizeof(struct ext_jump) * num_ext_symbols),
                                   PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (text_runtime_base == MAP_FAILED) {
        perror("Failed to allocate memory");
        exit(errno);
    }
 
...
    rodata_runtime_base = data_runtime_base + page_align(data_hdr->sh_size);
    /* jumptable will come after .rodata */
    jumptable = (struct ext_jump *)(rodata_runtime_base + page_align(rodata_hdr->sh_size));
 
...
}
...

Finally, because the CPU will actually be executing the jump instructions from our instr fields from the jumptable, we need to mark this memory readonly and executable (after do_text_relocations earlier in this function has completed):

loader.c:

...
 
static void parse_obj(void)
{
...
 
    do_text_relocations();
 
...
 
    /* make the jumptable readonly and executable */
    if (mprotect(jumptable, page_align(sizeof(struct ext_jump) * num_ext_symbols), PROT_READ | PROT_EXEC)) {
        perror("Failed to make the jumptable executable");
        exit(errno);
    }
}
...

At this stage we have our jumptable allocated and usable — all is left to do is to populate it properly. We’ll do this by improving the do_text_relocations implementation to handle the case of external symbols. The No runtime base address for section error from the beginning of this post is actually caused by this line in do_text_relocations:

loader.c:

...
 
static void do_text_relocations(void)
{
...
    for (int i = 0; i < num_relocations; i++) {
...
        /* symbol, with respect to which the relocation is performed */
        uint8_t *symbol_address = = section_runtime_base(&sections[symbols[symbol_idx].st_shndx]) + symbols[symbol_idx].st_value;
...
}
...

Currently we try to determine the runtime symbol address for the relocation by looking up the symbol’s section runtime address and adding the symbol’s offset. But we have established above that external symbols do not have an associated section, so their handling needs to be a special case. Let’s update the implementation to reflect this:

loader.c:

...
 
static void do_text_relocations(void)
{
...
    for (int i = 0; i < num_relocations; i++) {
...
        /* symbol, with respect to which the relocation is performed */
        uint8_t *symbol_address;
        
        /* if this is an external symbol */
        if (symbols[symbol_idx].st_shndx == SHN_UNDEF) {
            static int curr_jmp_idx = 0;
 
            /* get external symbol/function address by name */
            jumptable[curr_jmp_idx].addr = lookup_ext_function(strtab +  symbols[symbol_idx].st_name);
 
            /* x64 unconditional JMP with address stored at -14 bytes offset */
            /* will use the address stored in addr above */
            jumptable[curr_jmp_idx].instr[0] = 0xff;
            jumptable[curr_jmp_idx].instr[1] = 0x25;
            jumptable[curr_jmp_idx].instr[2] = 0xf2;
            jumptable[curr_jmp_idx].instr[3] = 0xff;
            jumptable[curr_jmp_idx].instr[4] = 0xff;
            jumptable[curr_jmp_idx].instr[5] = 0xff;
 
            /* resolve the relocation with respect to this unconditional JMP */
            symbol_address = (uint8_t *)(&jumptable[curr_jmp_idx].instr);
 
            curr_jmp_idx++;
        } else {
            symbol_address = section_runtime_base(&sections[symbols[symbol_idx].st_shndx]) + symbols[symbol_idx].st_value;
        }
...
}
...

If a relocation symbol does not have an associated section, we consider it external and call a helper function to lookup the symbol’s runtime address by its name. We store this address in the next available jumptable entry, populate the x64 jump instruction with our fixed operand and store the address of the instruction in the symbol_address variable. Later, the existing code in do_text_relocations will resolve the .text relocation with respect to the address in symbol_address in the same way it does for local symbols in part 2 of our series.

The only missing bit here now is the implementation of the newly introduced lookup_ext_function helper. Real world loaders may have complicated logic on how to find and resolve symbols in memory at runtime. But for the purposes of this article we’ll provide a simple naive implementation, which can only resolve the puts function:

loader.c:

...
 
static void *lookup_ext_function(const char *name)
{
    size_t name_len = strlen(name);
 
    if (name_len == strlen("puts") && !strcmp(name, "puts"))
        return my_puts;
 
    fprintf(stderr, "No address for function %s\n", name);
    exit(ENOENT);
}
...

Notice though that because we control the loader logic we are free to implement resolution as we please. In the above case we actually "divert" the object file to use our own "custom" my_puts function instead of the C library one. Let’s recompile the loader and see if it works:

$ gcc -o loader loader.c
$ ./loader
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52
Executing get_hello...
get_hello() = Hello, world!
Executing get_var...
get_var() = 5
Executing set_var(42)...
Executing get_var again...
get_var() = 42
Executing say_hello...
my_puts executed
Hello, world!

Hooray! We not only fixed our loader to handle external references in object files — we have also learned how to "hook" any such external function call and divert the code to a custom implementation, which might be useful in some cases, like malware research.

As in the previous posts, the complete source code from this post is available on GitHub.

How to execute an object file: Part 1

Post Syndicated from Ignat Korchagin original https://blog.cloudflare.com/how-to-execute-an-object-file-part-1/

Calling a simple function without linking

How to execute an object file: Part 1

When we write software using a high-level compiled programming language, there are usually a number of steps involved in transforming our source code into the final executable binary:

How to execute an object file: Part 1

First, our source files are compiled by a compiler translating the high-level programming language into machine code. The output of the compiler is a number of object files. If the project contains multiple source files, we usually get as many object files. The next step is the linker: since the code in different object files may reference each other, the linker is responsible for assembling all these object files into one big program and binding these references together. The output of the linker is usually our target executable, so only one file.

However, at this point, our executable might still be incomplete. These days, most executables on Linux are dynamically linked: the executable itself does not have all the code it needs to run a program. Instead it expects to "borrow" part of the code at runtime from shared libraries for some of its functionality:

How to execute an object file: Part 1

This process is called runtime linking: when our executable is being started, the operating system will invoke the dynamic loader, which should find all the needed libraries, copy/map their code into our target process address space, and resolve all the dependencies our code has on them.

One interesting thing to note about this overall process is that we get the executable machine code directly from step 1 (compiling the source code), but if any of the later steps fail, we still can’t execute our program. So, in this series of blog posts we will investigate if it is possible to execute machine code directly from object files skipping all the later steps.

Why would we want to execute an object file?

There may be many reasons. Perhaps we’re writing an open-source replacement for a proprietary Linux driver or an application, and want to compare if the behaviour of some code is the same. Or we have a piece of a rare, obscure program and we can’t link to it, because it was compiled with a rare, obscure compiler. Maybe we have a source file, but cannot create a full featured executable, because of the missing build time or runtime dependencies. Malware analysis, code from a different operating system etc – all these scenarios may put us in a position, where either linking is not possible or the runtime environment is not suitable.

A simple toy object file

For the purposes of this article, let’s create a simple toy object file, so we can use it in our experiments:

obj.c:

int add5(int num)
{
    return num + 5;
}

int add10(int num)
{
    return num + 10;
}

Our source file contains only 2 functions, add5 and add10, which adds 5 or 10 respectively to the only input parameter. It’s a small but fully functional piece of code, and we can easily compile it into an object file:

$ gcc -c obj.c 
$ ls
obj.c  obj.o

Loading an object file into the process memory

Now we will try to import the add5 and add10 functions from the object file and execute them. When we talk about executing an object file, we mean using an object file as some sort of a library. As we learned above, when we have an executable that utilises external shared libraries, the dynamic loader loads these libraries into the process address space for us. With object files, however, we have to do this manually, because ultimately we can’t execute machine code that doesn’t reside in the operating system’s RAM. So, to execute object files we still need some kind of a wrapper program:

loader.c:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>

static void load_obj(void)
{
    /* load obj.o into memory */
}

static void parse_obj(void)
{
    /* parse an object file and find add5 and add10 functions */
}

static void execute_funcs(void)
{
    /* execute add5 and add10 with some inputs */
}

int main(void)
{
    load_obj();
    parse_obj();
    execute_funcs();

    return 0;
}

Above is a self-contained object loader program with some functions as placeholders. We will be implementing these functions (and adding more) in the course of this post.

First, as we established already, we need to load our object file into the process address space. We could just read the whole file into a buffer, but that would not be very efficient. Real-world object files might be big, but as we will see later, we don’t need all of the object’s file contents. So it is better to mmap the file instead: this way the operating system will lazily read the parts from the file we need at the time we need them. Let’s implement the load_obj function:

loader.c:

...
/* for open(2), fstat(2) */
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

/* for close(2), fstat(2) */
#include <unistd.h>

/* for mmap(2) */
#include <sys/mman.h>

/* parsing ELF files */
#include <elf.h>

/* for errno */
#include <errno.h>

typedef union {
    const Elf64_Ehdr *hdr;
    const uint8_t *base;
} objhdr;

/* obj.o memory address */
static objhdr obj;

static void load_obj(void)
{
    struct stat sb;

    int fd = open("obj.o", O_RDONLY);
    if (fd <= 0) {
        perror("Cannot open obj.o");
        exit(errno);
    }

    /* we need obj.o size for mmap(2) */
    if (fstat(fd, &sb)) {
        perror("Failed to get obj.o info");
        exit(errno);
    }

    /* mmap obj.o into memory */
    obj.base = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    if (obj.base == MAP_FAILED) {
        perror("Maping obj.o failed");
        exit(errno);
    }
    close(fd);
}
...

If we don’t encounter any errors, after load_obj executes we should get the memory address, which points to the beginning of our obj.o in the obj global variable. It is worth noting we have created a special union type for the obj variable: we will be parsing obj.o later (and peeking ahead – object files are actually ELF files), so will be referring to the address both as Elf64_Ehdr (ELF header structure in C) and a byte pointer (parsing ELF files involves calculations of byte offsets from the beginning of the file).

A peek inside an object file

To use some code from an object file, we need to find it first. As I’ve leaked above, object files are actually ELF files (the same format as Linux executables and shared libraries) and luckily they’re easy to parse on Linux with the help of the standard elf.h header, which includes many useful definitions related to the ELF file structure. But we actually need to know what we’re looking for, so a high-level understanding of an ELF file is needed.

ELF segments and sections

Segments (also known as program headers) and sections are probably the main parts of an ELF file and usually a starting point of any ELF tutorial. However, there is often some confusion between the two. Different sections contain different types of ELF data: executable code (which we are most interested in in this post), constant data, global variables etc. Segments, on the other hand, do not contain any data themselves – they just describe to the operating system how to properly load sections into RAM for the executable to work correctly. Some tutorials say "a segment may include 0 or more sections", which is not entirely accurate: segments do not contain sections, rather they just indicate to the OS where in memory a particular section should be loaded and what is the access pattern for this memory (read, write or execute):

How to execute an object file: Part 1

Furthermore, object files do not contain any segments at all: an object file is not meant to be directly loaded by the OS. Instead, it is assumed it will be linked with some other code, so ELF segments are usually generated by the linker, not the compiler. We can check this by using the readelf command:

$ readelf --segments obj.o

There are no program headers in this file.

Object file sections

The same readelf command can be used to get all the sections from our object file:

$ readelf --sections obj.o
There are 11 section headers, starting at offset 0x268:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .text             PROGBITS         0000000000000000  00000040
       000000000000001e  0000000000000000  AX       0     0     1
  [ 2] .data             PROGBITS         0000000000000000  0000005e
       0000000000000000  0000000000000000  WA       0     0     1
  [ 3] .bss              NOBITS           0000000000000000  0000005e
       0000000000000000  0000000000000000  WA       0     0     1
  [ 4] .comment          PROGBITS         0000000000000000  0000005e
       000000000000001d  0000000000000001  MS       0     0     1
  [ 5] .note.GNU-stack   PROGBITS         0000000000000000  0000007b
       0000000000000000  0000000000000000           0     0     1
  [ 6] .eh_frame         PROGBITS         0000000000000000  00000080
       0000000000000058  0000000000000000   A       0     0     8
  [ 7] .rela.eh_frame    RELA             0000000000000000  000001e0
       0000000000000030  0000000000000018   I       8     6     8
  [ 8] .symtab           SYMTAB           0000000000000000  000000d8
       00000000000000f0  0000000000000018           9     8     8
  [ 9] .strtab           STRTAB           0000000000000000  000001c8
       0000000000000012  0000000000000000           0     0     1
  [10] .shstrtab         STRTAB           0000000000000000  00000210
       0000000000000054  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  l (large), p (processor specific)

There are different tutorials online describing the most popular ELF sections in detail. Another great reference is the Linux manpages project. It is handy because it describes both sections’ purpose as well as C structure definitions from elf.h, which makes it a one-stop shop for parsing ELF files. However, for completeness, below is a short description of the most popular sections one may encounter in an ELF file:

  • .text: this section contains the executable code (the actual machine code, which was created by the compiler from our source code). This section is the primary area of interest for this post as it should contain the add5 and add10 functions we want to use.
  • .data and .bss: these sections contain global and static local variables. The difference is: .data has variables with an initial value (defined like int foo = 5;) and .bss just reserves space for variables with no initial value (defined like int bar;).
  • .rodata: this section contains constant data (mostly strings or byte arrays). For example, if we use a string literal in the code (for example, for printf or some error message), it will be stored here. Note, that .rodata is missing from the output above as we didn’t use any string literals or constant byte arrays in obj.c.
  • .symtab: this section contains information about the symbols in the object file: functions, global variables, constants etc. It may also contain information about external symbols the object file needs, like needed functions from the external libraries.
  • .strtab and .shstrtab: contain packed strings for the ELF file. Note, that these are not the strings we may define in our source code (those go to the .rodata section). These are the strings describing the names of other ELF structures, like symbols from .symtab or even section names from the table above. ELF binary format aims to make its structures compact and of a fixed size, so all strings are stored in one place and the respective data structures just reference them as an offset in either .shstrtab or .strtab sections instead of storing the full string locally.

The .symtab section

At this point, we know that the code we want to import and execute is located in the obj.o‘s .text section. But we have two functions, add5 and add10, remember? At this level the .text section is just a byte blob – how do we know where each of these functions is located? This is where the .symtab (the "symbol table") comes in handy. It is so important that it has its own dedicated parameter in readelf:

$ readelf --symbols obj.o

Symbol table '.symtab' contains 10 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS obj.c
     2: 0000000000000000     0 SECTION LOCAL  DEFAULT    1
     3: 0000000000000000     0 SECTION LOCAL  DEFAULT    2
     4: 0000000000000000     0 SECTION LOCAL  DEFAULT    3
     5: 0000000000000000     0 SECTION LOCAL  DEFAULT    5
     6: 0000000000000000     0 SECTION LOCAL  DEFAULT    6
     7: 0000000000000000     0 SECTION LOCAL  DEFAULT    4
     8: 0000000000000000    15 FUNC    GLOBAL DEFAULT    1 add5
     9: 000000000000000f    15 FUNC    GLOBAL DEFAULT    1 add10

Let’s ignore the other entries for now and just focus on the last two lines, because they conveniently have add5 and add10 as their symbol names. And indeed, this is the info about our functions. Apart from the names, the symbol table provides us with some additional metadata:

  • The Ndx column tells us the index of the section, where the symbol is located. We can cross-check it with the section table above and confirm that indeed these functions are located in .text (section with the index 1).
  • Type being set to FUNC confirms that these are indeed functions.
  • Size tells us the size of each function, but this information is not very useful in our context. The same goes for Bind and Vis.
  • Probably the most useful piece of information is Value. The name is misleading, because it is actually an offset from the start of the containing section in this context. That is, the add5 function starts just from the beginning of .text and add10 is located from 15th byte and onwards.

So now we have all the pieces on how to parse an ELF file and find the functions we need.

Finding and executing a function from an object file

Given what we have learned so far, let’s define a plan on how to proceed to import and execute a function from an object file:

  1. Find the ELF sections table and .shstrtab section (we need .shstrtab later to lookup sections in the section table by name).
  2. Find the .symtab and .strtab sections (we need .strtab to lookup symbols by name in .symtab).
  3. Find the .text section and copy it into RAM with executable permissions.
  4. Find add5 and add10 function offsets from the .symtab.
  5. Execute add5 and add10 functions.

Let’s start by adding some more global variables and implementing the parse_obj function:

loader.c:

...

/* sections table */
static const Elf64_Shdr *sections;
static const char *shstrtab = NULL;

/* symbols table */
static const Elf64_Sym *symbols;
/* number of entries in the symbols table */
static int num_symbols;
static const char *strtab = NULL;

...

static void parse_obj(void)
{
    /* the sections table offset is encoded in the ELF header */
    sections = (const Elf64_Shdr *)(obj.base + obj.hdr->e_shoff);
    /* the index of `.shstrtab` in the sections table is encoded in the ELF header
     * so we can find it without actually using a name lookup
     */
    shstrtab = (const char *)(obj.base + sections[obj.hdr->e_shstrndx].sh_offset);

...
}

...

Now that we have references to both the sections table and the .shstrtab section, we can lookup other sections by their name. Let’s create a helper function for that:

loader.c:

...

static const Elf64_Shdr *lookup_section(const char *name)
{
    size_t name_len = strlen(name);

    /* number of entries in the sections table is encoded in the ELF header */
    for (Elf64_Half i = 0; i < obj.hdr->e_shnum; i++) {
        /* sections table entry does not contain the string name of the section
         * instead, the `sh_name` parameter is an offset in the `.shstrtab`
         * section, which points to a string name
         */
        const char *section_name = shstrtab + sections[i].sh_name;
        size_t section_name_len = strlen(section_name);

        if (name_len == section_name_len && !strcmp(name, section_name)) {
            /* we ignore sections with 0 size */
            if (sections[i].sh_size)
                return sections + i;
        }
    }

    return NULL;
}

...

Using our new helper function, we can now find the .symtab and .strtab sections:

loader.c:

...

static void parse_obj(void)
{
...

    /* find the `.symtab` entry in the sections table */
    const Elf64_Shdr *symtab_hdr = lookup_section(".symtab");
    if (!symtab_hdr) {
        fputs("Failed to find .symtab\n", stderr);
        exit(ENOEXEC);
    }

    /* the symbols table */
    symbols = (const Elf64_Sym *)(obj.base + symtab_hdr->sh_offset);
    /* number of entries in the symbols table = table size / entry size */
    num_symbols = symtab_hdr->sh_size / symtab_hdr->sh_entsize;

    const Elf64_Shdr *strtab_hdr = lookup_section(".strtab");
    if (!strtab_hdr) {
        fputs("Failed to find .strtab\n", stderr);
        exit(ENOEXEC);
    }

    strtab = (const char *)(obj.base + strtab_hdr->sh_offset);
    
...
}

...

Next, let’s focus on the .text section. We noted earlier in our plan that it is not enough to just locate the .text section in the object file, like we did with other sections. We would need to copy it over to a different location in RAM with executable permissions. There are several reasons for that, but these are the main ones:

  • Many CPU architectures either don’t allow execution of the machine code, which is unaligned in memory (4 kilobytes for x86 systems), or they execute it with a performance penalty. However, the .text section in an ELF file is not guaranteed to be positioned at a page aligned offset, because the on-disk version of the ELF file aims to be compact rather than convenient.
  • We may need to modify some bytes in the .text section to perform relocations (we don’t need to do it in this case, but will be dealing with relocations in future posts). If, for example, we forget to use the MAP_PRIVATE flag, when mapping the ELF file, our modifications may propagate to the underlying file and corrupt it.
  • Finally, different sections, which are needed at runtime, like .text, .data, .bss and .rodata, require different memory permission bits: the .text section memory needs to be both readable and executable, but not writable (it is considered a bad security practice to have memory both writable and executable). The .data and .bss sections need to be readable and writable to support global variables, but not executable. The .rodata section should be readonly, because its purpose is to hold constant data. To support this, each section must be allocated on a page boundary as we can only set memory permission bits on whole pages and not custom ranges. Therefore, we need to create new, page aligned memory ranges for these sections and copy the data there.

To create a page aligned copy of the .text section, first we actually need to know the page size. Many programs usually just hardcode the page size to 4096 (4 kilobytes), but we shouldn’t rely on that. While it’s accurate for most x86 systems, other CPU architectures, like arm64, might have a different page size. So hard coding a page size may make our program non-portable. Let’s find the page size and store it in another global variable:

loader.c:

...

static uint64_t page_size;

static inline uint64_t page_align(uint64_t n)
{
    return (n + (page_size - 1)) & ~(page_size - 1);
}

...

static void parse_obj(void)
{
...

    /* get system page size */
    page_size = sysconf(_SC_PAGESIZE);

...
}

...

Notice, we have also added a convenience function page_align, which will round up the passed in number to the next page aligned boundary. Next, back to the .text section. As a reminder, we need to:

  1. Find the .text section metadata in the sections table.
  2. Allocate a chunk of memory to hold the .text section copy.
  3. Actually copy the .text section to the newly allocated memory.
  4. Make the .text section executable, so we can later call functions from it.

Here is the implementation of the above steps:

loader.c:

...

/* runtime base address of the imported code */
static uint8_t *text_runtime_base;

...

static void parse_obj(void)
{
...

    /* find the `.text` entry in the sections table */
    const Elf64_Shdr *text_hdr = lookup_section(".text");
    if (!text_hdr) {
        fputs("Failed to find .text\n", stderr);
        exit(ENOEXEC);
    }

    /* allocate memory for `.text` copy rounding it up to whole pages */
    text_runtime_base = mmap(NULL, page_align(text_hdr->sh_size), PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (text_runtime_base == MAP_FAILED) {
        perror("Failed to allocate memory for .text");
        exit(errno);
    }

    /* copy the contents of `.text` section from the ELF file */
    memcpy(text_runtime_base, obj.base + text_hdr->sh_offset, text_hdr->sh_size);

    /* make the `.text` copy readonly and executable */
    if (mprotect(text_runtime_base, page_align(text_hdr->sh_size), PROT_READ | PROT_EXEC)) {
        perror("Failed to make .text executable");
        exit(errno);
    }
}

...

Now we have all the pieces we need to locate the address of a function. Let’s write a helper for it:

loader.c:

...

static void *lookup_function(const char *name)
{
    size_t name_len = strlen(name);

    /* loop through all the symbols in the symbol table */
    for (int i = 0; i < num_symbols; i++) {
        /* consider only function symbols */
        if (ELF64_ST_TYPE(symbols[i].st_info) == STT_FUNC) {
            /* symbol table entry does not contain the string name of the symbol
             * instead, the `st_name` parameter is an offset in the `.strtab`
             * section, which points to a string name
             */
            const char *function_name = strtab + symbols[i].st_name;
            size_t function_name_len = strlen(function_name);

            if (name_len == function_name_len && !strcmp(name, function_name)) {
                /* st_value is an offset in bytes of the function from the
                 * beginning of the `.text` section
                 */
                return text_runtime_base + symbols[i].st_value;
            }
        }
    }

    return NULL;
}

...

And finally we can implement the execute_funcs function to import and execute code from an object file:

loader.c:

...

static void execute_funcs(void)
{
    /* pointers to imported add5 and add10 functions */
    int (*add5)(int);
    int (*add10)(int);

    add5 = lookup_function("add5");
    if (!add5) {
        fputs("Failed to find add5 function\n", stderr);
        exit(ENOENT);
    }

    puts("Executing add5...");
    printf("add5(%d) = %d\n", 42, add5(42));

    add10 = lookup_function("add10");
    if (!add10) {
        fputs("Failed to find add10 function\n", stderr);
        exit(ENOENT);
    }

    puts("Executing add10...");
    printf("add10(%d) = %d\n", 42, add10(42));
}

...

Let’s compile our loader and make sure it works as expected:

$ gcc -o loader loader.c 
$ ./loader 
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52

Voila! We have successfully imported code from obj.o and executed it. Of course, the example above is simplified: the code in the object file is self-contained, does not reference any global variables or constants, and does not have any external dependencies. In future posts we will look into more complex code and how to handle such cases.

Security considerations

Processing external inputs, like parsing an ELF file from the disk above, should be handled with care. The code from loader.c omits a lot of bounds checking and additional ELF integrity checks, when parsing the object file. The code is simplified for the purposes of this post, but most likely not production ready, as it can probably be exploited by specifically crafted malicious inputs. Use it only for educational purposes!

The complete source code from this post can be found here.

Go is not an easy language

Post Syndicated from arp242.net original https://www.arp242.net/go-easy.html

Go is not an easy programming language. It is simple in many ways: the syntax
is simple, most of the semantics are simple. But a language is more than just
syntax; it’s about doing useful stuff. And doing useful stuff is not always
easy in Go.

Turns out that combining all those simple features in a way to do something
useful can be tricky. How do you remove an item from an array in Ruby?
list.delete_at(i). And remove entries by value? list.delete(value). Pretty
easy, yeah?

In Go it’s … less easy; to remove the index i you need to do:

list = append(list[:i], list[i+1:]...)

And to remove the value v you’ll need to use a loop:

n := 0
for _, l := range list {
    if l != v {
        list[n] = l
        n++
    }
}
list = list[:n]

Is this unacceptably hard? Not really; I think most programmers can figure out
what the above does even without prior Go experience. But it’s not exactly
easy either. I’m usually lazy and copy these kind of things from the Slice
Tricks
page because I want to focus on actually solving the problem at
hand, rather than plumbing like this.

It’s also easy to get it (subtly) wrong or suboptimal, especially for less
experienced programmers. For example compare the above to copying to a new array
and copying to a new pre-allocated array (make([]string, 0, len(list))):

InPlace             116 ns/op      0 B/op   0 allocs/op
NewArrayPreAlloc    525 ns/op    896 B/op   1 allocs/op
NewArray           1529 ns/op   2040 B/op   8 allocs/op

While 1529ns is still plenty fast enough for many use cases and isn’t something
to excessively worry about, there are plenty of cases where these things do
matter and having the guarantee to always use the best possible algorithm with
list.delete(value) has some value.


Goroutines are another good example. “Look how is it is to start a goroutine!
Just add go and you’re done!” Well, yes; you’re done until you have five
million of those running at the same time and then you’re left wondering where
all your memory went, and it’s not hard to “leak” goroutines by accident either.

There are a number of patterns to limit the number of goroutines, and none of
them are exactly easy. A simple example might be something like:

var (
	jobs    = 20                 // Run 20 jobs in total.
	running = make(chan bool, 3) // Limit concurrent jobs to 3.
	done    = make(chan bool)    // Signal that all jobs are done.
)

for i := 1; i <= jobs; i++ {
	running <- true // Fill running; this will block and wait if it's already full.

	// Start a job.
	go func(i int) {
		defer func() {
			<-running      // Drain running so new jobs can be added.
			if i == jobs { // Last job, signal that we're done.
				done <- true
			}
		}()

		// "do work"
		time.Sleep(1 * time.Second)
		fmt.Println(i)
	}(i)
}

<-done // Wait until all jobs are done.
fmt.Println("done")

There’s a reason I annotated this with some comments: for people not intimately
familiar with Go this may take some effort to understand. This also won’t ensure
that the numbers are printed in order (which may or may not be a requirement).

Go’s concurrency primitives may be simple and easy to use, but combining them to
solve common real-world scenarios is a lot less simple. The original version of
the above example was actually incorrect.


In Simple Made Easy Rich Hickey argues that we shouldn’t confuse “simple”
with “it’s easy to write”: just because you can do something useful in one or
two lines doesn’t mean the underlying concepts – and therefore the entire
program – are “simple” as in “simple to understand”.

I feel there is some wisdom in this; in most cases we shouldn’t sacrifice
“simple” for “easy”, but that doesn’t mean we can’t think at all about how to
make things easier. Just because concepts are simple doesn’t mean they’re easy
to use, can’t be misused, or can’t be used in ways that lead to (subtle) bugs.
Pushing Hickey’s argument to the extreme we’d end up with something like
Brainfuck and that would of course be silly.

Ideally a language should reduce the cognitive load required to reason about its
behaviour; there are many ways to increase this cognitive load: complex
intertwined language features is one of them, and getting “distracted” by
implementing fairly basic things from those simple concepts is another: it’s
another block of code I need to reason about. While I’m not overly concerned
about code formatting or syntax choices, I do think it can matter to reduce this
cognitive load when reading code.

The lack of generics probably plays some part here; implementing a slices
package which does these kind of things in a generic way is hard right now.
Generics makes this possible and also makes things more complex (more language
features are used), but they also make things easier and, arguably, less complex
on other fronts.[1]


Are these insurmountable problems? No. I still use (and like) Go after all. But
I also don’t think that Go is a language that you “could pick up in ~5-10
minutes”, which was the comment that prompted this post; a sentiment I’ve seen
expressed many times.

As a corollary to all of the above; learning the language isn’t just about
learning the syntax to write your ifs and fors; it’s about learning a way of
thinking. I’ve seen many people coming from Python or C♯ try to shoehorn
concepts or patterns from those languages in Go. Common ones include using
struct embedding as inheritance, panics as exceptions, “pseudo-dynamic
programming” with interface{}, and so forth. It rarely ends well, if ever.

I did this as well when I was writing my first Go program; it’s only natural.
And when I started as a Ruby programmed I tried to write Python code in Ruby
(although this works a bit better as the languages are more similar, but there
are still plenty of odd things you can do such as using for loops).

This is why I don’t like it when people get redirected to the Tour of Go to
“learn the language”, as it just teaches basic syntax and little more. It’s nice
as a little, well, tour to get a bit of a feel of the language and see how it
roughly works and what it can roughly do, but it’s ill-suited to actually learn
the language.

Footnotes

  1. Contrary to popular belief the Go team was never “against” generics;
    I’ve seen many comments to the effect of “the Go team doesn’t think
    generics are useful”, but this was never the case. 

Bitmasks for nicer APIs

Post Syndicated from arp242.net original https://www.arp242.net/bitmask.html

Bitmasks is one of those things where the basic idea is simple to understand:
it’s just 0s and 1s being toggled on and off. But actually “having it click”
to the point where it’s easy to work with can be a bit trickier. At least, it is
(or rather, was) for me 😅

With a bitmask you hide (or “mask”) certain bits of a number, which can be
useful for various things as we’ll see later on. There are two reasons one might
use bitmasks: for efficiency or for nicer APIs. Efficiency is rarely an issue
except for some embedded or specialized use cases, but everyone likes nice APIs,
so this is about that.


A while ago I added colouring support to my little zli library. Adding
colours to your terminal is not very hard as such, just print an escape code:

fmt.Println("\x1b[34mRed text!\x1b[0m")

But a library makes this a bit easier. There’s already a bunch of libraries out
there for Go specifically, the most popular being Fatih Arslan’s color:

color.New(color.FgRed).Add(color.Bold).Add(color.BgCyan).Println("bold red")

This is stored as:

type (
    Attribute int
    Color     struct { params  []Attribute }
)

I wanted a simple way to add some colouring, which looks a bit nicer than the
method chain in the color library, and eventually figured out you don’t need a
[]int to store all the different attributes but that a single uint64 will do
as well:

zli.Colorf("bold red", zli.Red | zli.Bold | zli.Cyan.Bg())

// Or alternatively, use Color.String():
fmt.Printf("%sbold red%s\n", zli.Red|zli.Bold|zli.Cyan.Bg(), zli.Reset)

Which in my eyes looks a bit nicer than Fatih’s library, and also makes it
easier to add 256 and true colour support.

All of the below can be used in any language by the way, and little of this is
specific to Go. You will need Go 1.13 or newer for the binary literals to work.


Here’s how zli stores all of this in a uint64:

                                   fg true, 256, 16 color mode ─┬──┐
                                bg true, 256, 16 color mode ─┬─┐│  │
                                                             │ ││  │┌── parsing error
 ┌───── bg color ────────────┐ ┌───── fg color ────────────┐ │ ││  ││┌─ term attr
 v                           v v                           v v vv  vvv         v
 0000_0000 0000_0000 0000_0000 0000_0000 0000_0000 0000_0000 0000_0000 0000_0000
 ^         ^         ^         ^         ^         ^         ^         ^
64        56        48        40        32        24        16         8

I’ll go over it in detail later, but in short (from right to left):

  • The first 9 bits are flags for the basic terminal attributes such as bold,
    italic, etc.

  • The next bit is to signal a parsing error for true colour codes (e.g. #123123).

  • There are 3 flags for the foreground and background colour each to signal that
    a colour should be applied, and how it should be interpreted (there are 3
    different ways to set the colour: 16-colour, 256-colour, and 24-bit “true
    colour”, which use different escape codes).

  • The colours for the foreground and background are stored separately, because
    you can apply both a foreground and background. These are 24-bit numbers.

  • A value of 0 is reset.

With this, you can make any combination of the common text attributes, the above
example:

zli.Colorf("bold red", zli.Red | zli.Bold | zli.Cyan.Bg())

Would be the following in binary layout:

                                              fg 16 color mode ────┐
                                           bg 16 color mode ───┐   │
                                                               │   │        bold
                bg color ─┬──┐                fg color ─┬──┐   │   │           │
                          v  v                          v  v   v   v           v
 0000_0000 0000_0000 0000_0110 0000_0000 0000_0000 0000_0001 0010_0100 0000_0001
 ^         ^         ^         ^         ^         ^         ^         ^
64        56        48        40        32        24        16         8


We need to go through several steps to actually do something meaningful with
this. First, we want to get all the flag values (the first 24 bits); a “flag” is
a bit being set to true (1) or false (0).

const (
    Bold         = 0b0_0000_0001
    Faint        = 0b0_0000_0010
    Italic       = 0b0_0000_0100
    Underline    = 0b0_0000_1000
    BlinkSlow    = 0b0_0001_0000
    BlinkRapid   = 0b0_0010_0000
    ReverseVideo = 0b0_0100_0000
    Concealed    = 0b0_1000_0000
    CrossedOut   = 0b1_0000_0000
)

func applyColor(c uint64) {
    if c & Bold != 0 {
        // Write escape code for bold
    }
    if c & Faint != 0 {
        // Write escape code for faint
    }
    // etc.
}

& is the bitwise AND operator. It works just as the more familiar && except
that it operates on every individual bit where 0 is false and 1 is true.
The end result will be 1 if both bits are “true” (1). An example with just
four bits:

0011 & 0101 = 0001

This can be thought of as four separate operations (from left to right):

0 AND 0 = 0      both false
0 AND 1 = 0      first value is false, so the end result is false
1 AND 0 = 0      second value is false
1 AND 1 = 1      both true

So what if c & Bold != 0 does is check if the “bold bit” is set:

Only bold set:
0 0000 0001 & 0 0000 0001 = 0 0000 0001

Underline bit set:
0 0000 1000 & 0 0000 0001 = 0 0000 0000      0 since there are no cases of "1 AND 1"

Bold and underline bits set:
0 0000 1001 & 0 0000 0001 = 0 0000 0001      Only "bold AND bold" is "1 AND 1"

As you can see, c & Bold != 0 could also be written as c & Bold == Bold.


The colours themselves are stored as a regular number like any other, except
that they’re “offset” a number of bits. To get the actual number value we need
to clear all the bits we don’t care about, and shift it all to the right:

const (
    colorOffsetFg   = 16

    colorMode16Fg   = 0b0000_0100_0000_0000
    colorMode256Fg  = 0b0000_1000_0000_0000
    colorModeTrueFg = 0b0001_0000_0000_0000

    maskFg          = 0b00000000_00000000_00000000_11111111_11111111_11111111_00000000_00000000
)

func getColor(c uint64) {
    if c & colorMode16Fg != 0  {
        cc := (c & maskFg) >> colorOffsetFg
        // ..write escape code for this color..
    }
}

First we check if the “16 colour mode” flag is set using the same method as the
terminal attributes, and then we AND it with maskFg to clear all the bits we
don’t care about:

                                   fg true, 256, 16 color mode ─┬──┐
                                bg true, 256, 16 color mode ─┬─┐│  │
                                                             │ ││  │┌── parsing error
 ┌───── bg color ────────────┐ ┌───── fg color ────────────┐ │ ││  ││┌─ term attr
 v                           v v                           v v vv  vvv         v
 0000_0000 0000_0000 0000_0110 0000_0000 0000_0000 0000_0001 0010_0100 0000_1001
AND maskFg
 0000_0000_0000_0000_0000_0000_1111_1111_1111_1111_1111_1111_0000_0000_0000_0000
=
 0000_0000 0000_0000 0000_0000 0000_0000 0000_0000 0000_0001 0000_0000 0000_0000
 ^         ^         ^         ^         ^         ^         ^         ^
64        56        48        40        32        24        16         8

After the AND operation we’re left with just the 24 bits we care about, and
everything else is set to 0. To get a normal number from this we need to shift
the bits to the right with >>:

1010 >> 1 = 0101    All bits shifted one position to the right.
1010 >> 2 = 0010    Shift two, note that one bit gets discarded.

Instead of >> 16 you can also subtract 65535 (a 16-bit number): (c &
maskFg) - 65535
. The end result is the same, but bit shifts are much easier to
reason about in this context.

We repeat this for the background colour (except that we shift everything 40
bits to the right). The background is actually a bit easier since we don’t need
to AND anything to clear bits, as all the bits to the right will just be
discarded:

cc := c >> ColorOffsetBg

For 256 and “true” 24-bit colours we do the same, except that we need to send
different escape codes for them, which is a detail that doesn’t really matter
for this explainer about bitmasks.


To set the background colour we use the Bg() function to transforms a
foreground colour to a background one. This avoids having to define BgCyan
constants like Fatih’s library, and makes working with 256 and true colour
easier.

const (
    colorMode16Fg   = 0b00000_0100_0000_0000
    colorMode16Bg   = 0b0010_0000_0000_0000

    maskFg          = 0b00000000_00000000_00000000_11111111_11111111_11111111_00000000_00000000
)

func Bg(c uint64) uint64 {
    if c & colorMode16Fg != 0 {
        c = c ^ colorMode16Fg | colorMode16Bg
    }
    return (c &^ maskFg) | (c & maskFg << 24)
}

First we check if the foreground colour flags is set; if it is then move that
bit to the corresponding background flag.

| is the OR operator; this works like || except on individual bits like in
the above example for &. Note that unlike || it won’t stop if the first
condition is false/0: if any of the two values are 1 the end result will be
1:

0 OR 0 = 0      both false
0 OR 1 = 1      second value is true, so end result is true
1 OR 0 = 1      first value is true
1 OR 1 = 1      both true

0011 | 0101 = 0111

^ is the “exclusive or”, or XOR, operator. It’s similar to OR except that it
only outputs 1 if exactly one value is 1, and not if both are:

0 XOR 0 = 0      both false
0 XOR 1 = 1      second value is true, so end result is true
1 XOR 0 = 1      first value is true
1 XOR 1 = 0      both true, so result is 0

0011 ^ 0101 = 0101

Putting both together, c ^ colorMode16Fg clears the foreground flag and |
colorMode16Bg
sets the background flag.

The last line moves the bits from the foreground colour to the background
colour:

return (c &^ maskFg) | (c & maskFg << 24)

&^ is “AND NOT”: these are two operations: first it will inverse the right
side (“NOT”) and then ANDs the result. So in our example the maskFg value is
inversed:

 0000_0000_0000_0000_0000_0000_1111_1111_1111_1111_1111_1111_0000_0000_0000_0000
NOT
 1111_1111_1111_1111_1111_1111_0000_0000_0000_0000_0000_0000_1111_1111_1111_1111

We then used this inversed maskFg value to clear the foreground colour,
leaving everything else intact:

 1111_1111_1111_1111_1111_1111_0000_0000_0000_0000_0000_0000_1111_1111_1111_1111
AND
 0000_0000 0000_0000 0000_0110 0000_0000 0000_0000 0000_0001 0010_0100 0000_1001
=
 0000_0000 0000_0000 0000_0110 0000_0000 0000_0000 0000_0000 0010_0100 0000_1001
 ^         ^         ^         ^         ^         ^         ^         ^
64        56        48        40        32        24        16         8

C and most other languages don’t have this operator and have ~ for NOT (which
Go doesn’t have), so the above would be (c & ~maskFg) in most other languages.

Finally, we set the background colour by clearing all bits that are not part of
the foreground colour, shifting them to the correct place, and ORing this to get
the final result.


I skipped a number of implementation details in the above example for clarity,
especially for people not familiar with Go. The full code is of course
available
. Putting all of
this together gives a fairly nice API IMHO in about 200 lines of code which
mostly avoids boilerplateism.

I only showed the 16-bit colours in the examples, in reality most of this is
duplicated for 256 and true colours as well. It’s all the same logic, just with
different values. I also skipped over the details of terminal colour codes, as
this article isn’t really about that.

In many of the above examples I used binary literals for the constants, and this
seemed the best way to communicate how it all works for this article. This isn’t
necessarily the best or easiest way to write things in actual code, especially
not for such large numbers. In the actual code it looks like:

const (
    ColorOffsetFg = 16
    ColorOffsetBg = 40
)

const (
    maskFg Color = (256*256*256 - 1) << ColorOffsetFg
    maskBg Color = maskFg << (ColorOffsetBg - ColorOffsetFg)
)

// Basic terminal attributes.
const (
    Reset Color = 0
    Bold  Color = 1 << (iota - 1)
    Faint
    // ...
)

Figuring out how this works is left as an exercise for the reader 🙂

Another thing that might be useful is a little helper function to print a number
as binary; it helps visualise things if you’re confused:

func bin(c uint64) {
    reBin := regexp.MustCompile(`([01])([01])([01])([01])([01])([01])([01])([01])`)
    reverse := func(s string) string {
        runes := []rune(s)
        for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
            runes[i], runes[j] = runes[j], runes[i]
        }
        return string(runes)
    }
    fmt.Printf("%[2]s → %[1]d\n", c,
        reverse(reBin.ReplaceAllString(reverse(fmt.Sprintf("%064b", c)),
            `$1$2$3${4}_$5$6$7$8 `)))
}

I put a slighly more advanced version of this at
zgo.at/zstd/zfmt.Binary.

You can also write a little wrapper to make things a bit easier:

type Bitflag64 uint64 uint64

func (f Bitflag64) Has(flag Bitflag64) bool { return f&flag != 0 }
func (f *Bitflag64) Set(flag Bitflag64)     { *f = *f | flag }
func (f *Bitflag64) Clear(flag Bitflag64)   { *f = *f &^ flag }
func (f *Bitflag64) Toggle(flag Bitflag64)  { *f = *f ^ flag }

If you need more than 64 bits then not all is lost; you can use type thingy
[2]uint64
.


Here’s an example where I did it wrong:

type APITokenPermissions struct {
    Count      bool 
    Export     bool 
    SiteRead   bool 
    SiteCreate bool 
    SiteUpdate bool 
}

This records the permissions for an API token the user creates. Looks nice, but
how do you check that only Count is set?

if p.Count && !p.Export && !p.SiteRead && !p.SiteCreate && !p.SiteUpdate { .. }

Ugh; not very nice, and neither is checking if multiple permissions are set:

if perm.Export && perm.SiteRead && perm.SiteCreate && perm.SiteUpdate { .. }

Had I stored it as a bitmask instead, it would have been easier:

if perm & Count == 0 { .. }

const permSomething = perm.Export | perm.SiteRead | perm.SiteCreate | perm.SiteUpdate
if perm & permEndpointSomething == 0 { .. }

No one likes functions with these kind of signatures either:

f(false, false, true)
f(true, false, true)

But with a bitmask things can look a lot nicer:

const (
    AddWarpdrive   = 0b0001
    AddTractorBeam = 0b0010
    AddPhasers     = 0b0100
)

f(AddPhasers)
f(AddWarpdrive | AddPhasers)

Stupid light software

Post Syndicated from arp242.net original https://www.arp242.net/stupid-light.html

The ultralight hiking community is – as you may gather from the name – very
focused on ultralight equipment and minimalism. Turns out that saving a bit of
weight ten times actually adds up to a significant weight savings, making hikes
– especially longer ones of several days or weeks – a lot more comfortable.

There’s also the concept of stupid light: when you save weight to the
point of stupidity. You won’t be comfortable, you’ll miss stuff you need, your
equipment will be too fragile.

In software, I try to avoid dependencies, needless features, and complexity to
keep things reasonably lightweight. Software is already hard to start with, and
the more of it you have the harder it gets. But you need to be careful not to
make it stupid light.

It’s a good idea to avoid a database if you don’t need one; often flat text
files or storing data in memory works just as well. But at the same time
databases do offer some advantages: it’s structured and it deals with file
locking and atomicity. A younger me would avoid databases at all costs and in
hindsight that was just stupid light in some cases. You don’t need to
immediately jump to PostgreSQL or MariaDB either, and there are many
intermediate solutions, SQLite being the best known, but SQLite can also be
stupid light
in some use cases.

Including a huge library may be overkill for what you need from it; you can
perhaps just copy that one function out of there, or reimplement your own if
it’s simple enough. But this only a good idea if you can do it well and ensure
it’s actually correct (are you sure all edge cases are handled correctly?)
Otherwise it just becomes stupid light.

I’ve seen several people write their own translation services. All of them were
lighter than gettext. And they were also completely terrible and stupid light.

Adding features or API interfaces can come with significant costs in maintenance
and complexity. But if you’re sacrificing UX and people need to work around the
lack of features then you app or API just becomes stupid light.

It’s all about a certain amount of balance. Lightweight is good, bloated is bad,
and stupid light is just as bad as bloated, or perhaps even worse since bloated
software usually at least allowed you to accomplish the task whereas stupid
light may prevent you from doing so.


I won’t list any examples here as I don’t really want to call out people’s work
as “stupid”, especially if they’re hobby projects people work on in their spare
time. I can think of a few examples, but does adding them really add any value?
I’m not so sure that it does. Arguably “stupid light” isn’t really the best
wording here – the original usage in hiking context is mostly a self-deprecating
one – and a different one without “stupid” would be better, but I couldn’t
really think of anything better 🤷 And it does have a nice ring to it.

Stupid light isn’t something you can measure and define exactly, just like you
can’t measure and exactly define “bloat”. It depends on a lot of factors. But
just as it’s worth thinking about “do we really need this?” to avoid bloat, it’s
also worth thinking about “can we really do without this?” to avoid stupid
light.

An API is a user interface

Post Syndicated from arp242.net original https://www.arp242.net/api-ux.html

An API is a user interface for programmers and is essentially no different from
a graphical user interface, command-line user interface, or any other interface
a human (“user”) is expected to work with. Whenever you create a publicly
callable function you’re creating a user interface. Programmers are users, too.

This applies for any API: libX11, libpng, Ruby on Rails (good UX is a major
factor for Rails’ success), a REST API, etc.

A library exists of two parts: implementation and exposed API. The
implementation is all about doing stuff and interacting with the computer,
whereas the exposed API is about giving a human access to this, preferably in a
convenient way that makes it easy to understand, and making it hard to get
things wrong.

This may sound rather obvious, but in my experience this often seems forgotten.
The world is full of badly documented clunky APIs that give confusing errors (or
no errors!) to prove it.

Whenever I design a public package, module, or class I tend to start by writing
a few basic usage examples and documenting it. This first draft won’t be perfect
and while writing the implementation I keep updating the examples and
documentation to iterate on what works and axe what doesn’t. This is kind of
like TDD, except that it “tests” the UX rather than the implementation. Call it
Example Driven Development if you will.

This is similar to sketching a basic mock UI for a GUI and avoids “oh, we need
to be able to do that too” half-way through building your UI, leading to awkward
clunky UI elements added willy-nilly as an afterthought.

In code reviews the first questions I usually have are things like “is this API
easy to use?”, “Is it consistent?”, “can we extend it in the future so it won’t
be ugly?”, “is it documented, and is the documentation comprehensible?”.
Sometimes I’ll even go as far as trying to write a simple example to see if
there are any problems and if it “feels” right. Only if this part is settled do
I move on to reviewing the correctness of the actual implementation.


I’m not going to list specific examples or tips here; it really depends on the
environment, intended audience (kernel programmers are not Rails programmers),
and most of all: what you’re doing.

Sometimes a single function with five parameters would be bad UX, whereas in
other cases it might be a good option, if all five really are mandatory for
example, or if you use Python and have named parameters. In other cases, it
makes more sense to have five functions which accepts a single parameter.

There usually isn’t “one right way”. If everyone started treating APIs as user
interfaces instead of “oh, it’s just for developers, they will figure it out”
then we’ll be 90% there.

That being said, the most useful general piece of advice I know of is John
Ousterhout’s concept of deep modules: modules that provide large functionality
with simple interfaces. Depth of module is a nice overview with goes in
to some more details about this, and I won’t repeat it here.

Diving into /proc/[pid]/mem

Post Syndicated from Lennart Espe original https://blog.cloudflare.com/diving-into-proc-pid-mem/

Diving into /proc/[pid]/mem

Diving into /proc/[pid]/mem

A few months ago, after reading about Cloudflare doubling its intern class size, I quickly dusted off my CV and applied for an internship. Long story short: now, a couple of months later, I found myself staring into Linux kernel code and adding a pretty cool feature to gVisor, a Linux container runtime.

My internship was under the Emerging Technologies and Incubation group on a project involving gVisor. A co-worker contacted my team about not being able to read the debug symbols of stack traces inside the sandbox. For example, when the isolated process crashed, this is what we saw in the logs:

*** Check failure stack trace: ***
    @     0x7ff5f69e50bd  (unknown)
    @     0x7ff5f69e9c9c  (unknown)
    @     0x7ff5f69e4dbd  (unknown)
    @     0x7ff5f69e55a9  (unknown)
    @     0x5564b27912da  (unknown)
    @     0x7ff5f650ecca  (unknown)
    @     0x5564b27910fa  (unknown)

Obviously, this wasn’t very useful. I eagerly volunteered to fix this stack unwinding code – how hard could it be?

After some debugging, we found that the logging library used in the project opened /proc/self/mem to look for ELF headers at the start of each memory-mapped region. This was necessary to calculate an offset to find the correct addresses for debug symbols.

It turns out this mechanism is rather common. The stack unwinding code is often run in weird contexts – like a SIGSEGV handler – so it would not be appropriate to dig over real memory addresses back and forth to read the ELF. This could trigger another SIGSEGV. And SIGSEGV inside a SIGSEGV handler means either termination via the default handler for a segfault or recursing into the same handler again and again (if one sets SA_NODEFER) leading to a stack overflow.

However, inside gVisor, each call of open() on /proc/self/mem resulted in ENOENT, because the entire /proc/self/mem file was missing. In order to provide a robust sandbox, gVisor has to carefully reimplement the Linux kernel interfaces. This particular /proc file was simply unimplemented in the virtual file system of Sentry, one of gVisor’s sandboxing components.
Marek asked the devs on the project chat and got confirmation – they would be happy to accept a patch implementing this file.
Diving into /proc/[pid]/mem

The easy way out would have been to make a small, local patch to the unwinder behavior, yet I found myself diving into the Linux kernel trying to figure how the mem file worked in an attempt to implement it in Sentry’s VFS.

What does /proc/[pid]/mem do?

The file itself is quite powerful, because it allows raw access to the virtual address space of a process. According to manpages, the documented file operations are open(), read() and lseek(). Typical use cases are debugging tasks or dumping process memory.

Opening the file

When a process wants to open the file, the kernel does the file permissions check, looks up the associated operations for mem and invokes a method called proc_mem_open. It retrieves the associated task and calls a method named mm_access.

/*
 * Grab a reference to a task's mm, if it is not already going away
 * and ptrace_may_access with the mode parameter passed to it
 * succeeds.
 */

Seems relatively straightforward, right? The special thing about mm_access is that it verifies the permissions the current task has regarding the task to which the memory belongs. If the current task and target task do not share the same memory manager, the kernel invokes a method named __ptrace_may_access.

/*
 * May we inspect the given task?
 * This check is used both for attaching with ptrace
 * and for allowing access to sensitive information in /proc.
 *
 * ptrace_attach denies several cases that /proc allows
 * because setting up the necessary parent/child relationship
 * or halting the specified task is impossible.
 *
 */

According to the manpages, a process which would like to read from an unrelated /proc/[pid]/mem file should have access mode PTRACE_MODE_ATTACH_FSCREDS. This check does not verify that a process is attached via PTRACE_ATTACH, but rather if it has the permission to attach with the specified credentials mode.

Access checks

After skimming through the function, you will see that a process is allowed access if the current task belongs to the same thread group as the target task, or denied access (depending on whether PTRACE_MODE_FSCREDS or PTRACE_MODE_REALCREDS is set, we will use either the file-system UID / GID, which is typically the same as the effective UID/GID, or the real UID / GID) if none of the following conditions are met:

  • the current task’s credentials (UID, GID) match up with the credentials (real, effective and saved set-UID/GID) of the target process
  • the current task has CAP_SYS_PTRACE inside the user namespace of the target process

In the next check, access is denied if the current task has neither CAP_SYS_PTRACE inside the user namespace of the target task, nor the target’s dumpable attribute is set to SUID_DUMP_USER. The dumpable attribute is typically required to allow producing core dumps.

After these three checks, we also go through the commoncap Linux Security Module (and other LSMs) to verify our access mode is fine. LSMs you may know are SELinux and AppArmor. The commoncap LSM performs the checks on the basis of effective or permitted process capabilities (depending on the mode being FSCREDS or REALCREDS), allowing access if

  • the capabilities of the current task are a superset of the capabilities of the target task, or
  • the current task has CAP_SYS_PTRACE in the target task’s user namespace

In conclusion, one has access (with only commoncap LSM checks active) if:

  • the current task is in the same task group as the target task, or
  • the current task has CAP_SYS_PTRACE in the target task’s user namespace, or
  • the credentials of the current and target task match up in the given credentials mode, the target task is dumpable, they run in the same user namespace and the target task’s capabilities are a subset of the current task’s capabilities

I highly recommend reading through the ptrace manpages to dig deeper into the different modes, options and checks.

Reading from the file

Since all the access checks occur when opening the file, reading from it is quite straightforward. When one invokes read() on a mem file, it calls up mem_rw (which actually can do both reading and writing).

To avoid using lots of memory, mem_rw performs the copy in a loop and buffers the data in an intermediate page. mem_rw has a hidden superpower, that is, it uses FOLL_FORCE to avoid permission checks on user-owned pages (handling pages marked as non-readable/non-writable readable and writable).

mem_rw has other specialties, such as its error handling. Some interesting cases are:

  • if the target task has exited after opening the file descriptor, performing read() will always succeed with reading 0 bytes
  • if the initial copy from the target task’s memory to the intermediate page fails, it does not always return an error but only if no data has been read

You can also perform lseek on the file excluding SEEK_END.

How it works in gVisor

Luckily, gVisor already implemented ptrace_may_access as kernel.task.CanTrace, so one can avoid reimplementing all the ptrace access logic. However, the implementation in gVisor is less complicated due to the lack of support for PTRACE_MODE_FSCREDS (which is still an open issue).

When a new file descriptor is open()ed, the GetFile method of the virtual Inode is invoked, therefore this is where the access check naturally happens. After a successful access check, the method returns a fs.File. The fs.File implements all the file operations you would expect such as Read() and Write(). gVisor also provides tons of primitives for quickly building a working file structure so that one does not have to reimplement a generic lseek() for example.

In case a task invokes a Read() call onto the fs.File, the Read method retrieves the memory manager of the file’s Task.
Accessing the task’s memory manager is a breeze with comfortable CopyIn and CopyOut methods, with interfaces similar to io.Writer and io.Reader.

After implementing all of this, we finally got a useful stack trace.

*** Check failure stack trace: ***
    @     0x7f190c9e70bd  google::LogMessage::Fail()
    @     0x7f190c9ebc9c  google::LogMessage::SendToLog()
    @     0x7f190c9e6dbd  google::LogMessage::Flush()
    @     0x7f190c9e75a9  google::LogMessageFatal::~LogMessageFatal()
    @     0x55d6f718c2da  main
    @     0x7f190c510cca  __libc_start_main
    @     0x55d6f718c0fa  _start

Conclusion

A comprehensive victory! The /proc/<pid>/mem file is an important mechanism that gives insight into contents of process memory. It is essential to stack unwinders to do their work in case of complicated and unforeseeable failures. Because the process memory contains highly-sensitive information, data access to the file is determined by a complex set of poorly documented rules. With a bit of effort, you can emulate /proc/[PID]/mem inside gVisor’s sandbox, where the process only has access to the subset of procfs that has been implemented by the gVisor authors and, as a result, you can have access to an easily readable stack trace in case of a crash.

Now I can’t wait to get the PR merged into gVisor.

Raking the floods: How to protect UDP services from DoS attacks with eBPF

Post Syndicated from Jonas Otten original https://blog.cloudflare.com/building-rakelimit/

Raking the floods: How to protect UDP services from DoS attacks with eBPF

Raking the floods: How to protect UDP services from DoS attacks with eBPF

Cloudflare’s globally distributed network is not just designed to protect HTTP services but any kind of TCP or UDP traffic that passes through our edge. To this end, we’ve built a number of sophisticated DDoS mitigation systems, such as Gatebot, which analyze world-wide traffic patterns. However, we’ve always employed defense-in-depth: in addition to global protection systems we also use off-the shelf mechanisms such as TCP SYN-cookies, which protect individual servers locally from the very common SYN-flood. But there’s a catch: such a mechanism does not exist for UDP. UDP is a connectionless protocol and does not have similar context around packets, especially considering that Cloudflare powers services such as Spectrum which are agnostic to the upper layer protocol (DNS, NTP, …), so my 2020 intern class project was to come up with a different approach.

Protecting UDP services

First of all, let’s discuss what it actually means to provide protection to UDP services. We want to ensure that an attacker cannot drown out legitimate traffic. To achieve this we want to identify floods and limit them while leaving legitimate traffic untouched.

The idea to mitigate such attacks is straight forward: first identify a group of packets that is related to an attack, and then apply a rate limit on this group. Such groups are determined based on the attributes available to us in the packet, such as addresses and ports.

We do not want to completely drop the flood of traffic, as legitimate traffic may still be part of it. We only want to drop as much traffic as necessary to comply with our set rate limit. Completely ignoring a set of packets just because it is slightly above the rate limit is not an option, as it may contain legitimate traffic.

This ensures both that our service stays responsive but also that legitimate packets experience as little impact as possible.

While rate limiting is a somewhat straightforward procedure, determining groups is a bit harder, for a number of reasons.

Finding needles in the haystack

The problem in determining groups in packets is that we have barely any context. We consider four things as useful attributes as attack signatures: the source address and port as well as the destination address and port. While that already is not a lot, it gets worse: the source address and port may not even be accurate. Packets can be spoofed, in which case an attacker hides their own address. That means only keeping a rate per source address may not provide much value, as it could simply be spoofed.

But there is another problem: keeping one rate per address does not scale. When bringing IPv6 into the equation and its whopping address space it becomes clear it’s not going to work.

To solve these issues we turned to the academic world and found what we were looking for, the problem of Heavy Hitters. Heavy Hitters are elements of a datastream that appear frequently, and can be expressed relative to the overall elements of the stream. We can define for example that an element is considered to be a Heavy Hitter if its frequency exceeds, say, 10% of the overall count. To do so we naively could suggest to simply maintain a counter per element, but due to the space limitations this will not scale. Instead probabilistic algorithms such as a CountMin sketch or the SpaceSaving algorithm can be used. These provide an estimated count instead of a precise one, but are capable of doing this with constant memory requirements, and in our case we will just save rates into the CountMin sketch instead of counts. So no matter how many unique elements we have to track, the memory consumption is the same.

We now have a way of finding the needle in the haystack, and it does have constant memory requirements, solving our problem. However, reality isn’t that simple. What if an attack is not just originating from a single port but many? Or what if a reflection attack is hitting our service, resulting in random source addresses but a single source port? Maybe a full /24 subnet is sending us a flood? We can not just keep a rate per combination we see, as it would ignore all these patterns.

Grouping the groups: How to organize packets

Luckily the academic world has us covered again, with the concept of Hierarchical Heavy Hitters. It extends the Heavy Hitter concept by using the underlying hierarchy in the elements of the stream. For example, an IP address can be naturally grouped into several subnets:

Raking the floods: How to protect UDP services from DoS attacks with eBPF

In this case we defined that we consider the fully-specified address, the /24 subnet and the /0 wildcard. We start at the left with the fully specified address, and each step walking towards the top we consider less information from it. We call these less-specific addresses generalisations, and measure how specific a generalisation is by assigning a level. In our example, the address 192.0.2.123 is at level 0, while 192.0.2.0/24 is at level 1, etc.

If we want to create a structure which can hold this information for every packet, it could look like this:

Raking the floods: How to protect UDP services from DoS attacks with eBPF

We maintain a CountMin-sketch per subnet and then apply Heavy Hitters. When a new packet arrives and we need to determine if it is allowed to pass we simply check the rates of the corresponding elements in every node. If no rate exceeds the rate limit that we set, e.g. 25 packets per second (pps), it is allowed to pass.

The structure could now keep track of a single attribute, but we would waste a lot of context around packets! So instead of letting it go to waste, we use the two-dimensional approach for addresses proposed in the paper Hierarchical Heavy Hitters with SpaceSaving algorithm, and extend it further to also incorporate ports into our structure. Ports do not have a natural hierarchy such as addresses, so they can only be in two states: either specified (e.g. 8080) or wildcard.

Now our structure looks like this:

Raking the floods: How to protect UDP services from DoS attacks with eBPF

Now let’s talk about the algorithm we use to traverse the structure and determine if a packet should be allowed to pass. The paper Hierarchical Heavy Hitters with SpaceSaving algorithm provides two methods that can be used on the data structure: one that updates elements and increases their counters, and one that provides all elements that currently are Heavy Hitters. This is actually not necessary for our use-case, as we are only interested if the element, or packet, we are looking at right now would be a Heavy Hitter to decide if it can pass or not.

Secondly, our goal is to prevent any Heavy Hitters from passing, thus leaving the structure with no Heavy Hitters whatsoever. This is a great property, as it allows us to simplify the algorithm substantially, and it looks like this:

Raking the floods: How to protect UDP services from DoS attacks with eBPF

As you may notice, we update every node of a level and maintain the maximum rate we see. After each level we calculate a probability that determines if a packet should be passed to the next level, based on the maximum rate we saw on that level and a set rate limit. Each node essentially filters the traffic for the following, less specific level.

I actually left out a small detail: a packet is not dropped if any rate exceeds the limit, but instead is kept with the probability rate limit/maximum rate seen. The reason is that if we just drop all packets if the rates exceed the limit, we would drop the whole traffic, not just a subset to make it comply with our set rate limit.

Since we now still update more specific nodes even if a node reaches a rate limit, the rate limit will converge towards the underlying pattern of the attack as much as possible. That means other traffic will be impacted as minimally as possible, and that with no manual intervention whatsoever!

BPF to the rescue: building a Go library

As we want to use this algorithm to mitigate floods, we need to spend as little computation and overhead as possible before we decide if a packet should be dropped or not. As so often, we looked into the BPF toolbox and found what we need: Socketfilters. As our colleague Marek put it: “It seems, no matter the question – BPF is the answer.”.

Socketfilters are pieces of code that can be attached to a single socket and get executed before a packet will be passed from kernel to userspace. This is ideal for a number of reasons. First, when the kernel runs the socket filter code, it gives it all the information from the packet we need, and other mitigations such as firewalls have been executed. Second the code is executed per socket, so every application can activate it as needed, and also set appropriate rate limits. It may even use different rate limits for different sockets. The third reason is privileges: we do not need to be root to attach the code to a socket. We can execute code in the kernel as a normal user!

BPF also has a number of limitations which have been already covered on this blog in the past, so we will focus on one that’s specific to our project: floating-point numbers.

To calculate rates we need floating-point numbers to provide an accurate estimate. BPF, and the whole kernel for that matter, does not support these. Instead we implemented a fixed-point representation, which uses a part of the available bits for the fractional part of a rational number and the remaining bits for the integer part. This allows us to represent floats within a certain range, but there is a catch when doing arithmetic: while subtraction and addition of two fixed-points work well, multiplication and division requires double the number of bits to ensure there will not be any loss in precision. As we use 64 bits for our fixed-point values, there is no larger data type available to ensure this does not happen. Instead of calculating the result with exact precision, we convert one of the arguments into an integer. That results in the loss of the fractional part, but as we deal with large rates that does not pose any issue, and helps us to work around the bit limitation as intermediate results fit into the available 64 bits. Whenever fixed-point arithmetic is necessary the precision of intermediate results has to be carefully considered.

There are many more details to the implementation, but instead of covering every single detail in this blog post lets just look at the code.

We open sourced rakelimit over on Github at cloudflare/rakelimit! It is a full-blown Go library that can be enabled on any UDP socket, and is easy to configure.

The development is still in early stages and this is a first prototype, but we are excited to continue and push the development with the community! And if you still can’t get enough, look at our talk from this year’s Linux Plumbers Conference.

Mathematics and programming: exploring the links

Post Syndicated from Sue Sentance original https://www.raspberrypi.org/blog/research-seminar-mathematics-programming-links/

“In my vision, the child programs the computer and, in doing so, both acquires a sense of mastery over a piece of the most modern and powerful technology and establishes an intimate contact with some of the deepest ideas from science, from mathematics, and from the art of intellectual model building.” – Seymour Papert, Mindstorms: Children, Computers, And Powerful Ideas, 1980

We owe much of what we have learned about children learning to program to Seymour Papert (1928–2016), who not only was a great mathematician and computer scientist, but also an inspirational educationalist. He developed the theoretical approach to learning we now know as constructionism, which purports that learning takes place through building artefacts that have meaning and can be shared with others. Papert, together with others, developed the Logo programming language in 1967 to help children develop concepts in both mathematics and in programming. He believed that programming could give children tangible and concrete experiences to support their acquisition of mathematical concepts. Educational programming languages such as Logo were widely used in both primary and secondary education settings during the 1980s and 90s. Thus for many years the links between mathematics and programming have been evident, and we were very fortunate to be able to explore this topic with our research seminar guest speaker, Professor Dame Celia Hoyles of University College London.

Dame Celia Hoyles

Professor Dame Celia Hoyles

Dame Celia Hoyles is a huge celebrity in the world of mathematical education and programming. As well as authoring literally hundreds of academic papers on mathematics education, including on Logo programming, she has received a number of prestigious awards and honours, and has served as the Chief Advisor to the UK government on mathematics in school. For all these reasons, we were delighted to hear her present at a Raspberry Pi Foundation computing education research seminar.

Mathematics is a subject we all need to understand the basics of — it underpins much of our other learning and empowers us in daily life. Yet some mathematical concepts can seem abstract and teachers have struggled over the years to help children to understand them. Since programming includes the design, building, and debugging of artefacts, it is a great approach for make such abstract concepts come to life. It also enables the development of both computational and mathematical thinking, as Celia described in her talk.

Learning mathematics through Scratch programming

Celia and a team* at University College London developed a curriculum initiative called ScratchMaths to teach carefully selected mathematical concepts through programming (funded by the Education Endowment Foundation in 2014–2018). ScratchMaths is for use in upper primary school (age 9–11) over a two-year period.

In the first year, pupils take three computational thinking modules, and in the second year, they move to three more mathematical thinking modules. All the ScratchMaths materials were designed around a pedagogical framework called the 5Es: explore, envisage, explain, exchange, and bridge. This enables teachers to understand the structure and sequencing of the materials as they use them in the classroom:

  • Explore: Investigate, try things out yourself, debug in reaction to feedback
  • Envisage: Have a goal in mind, predict outcome of program before trying
  • Explain: Explain what you have done, articulate reasons behind your approach to others
  • Exchange: Collaborate & share, try to see a problem from another’s perspective as well as defend your own approach and compare with others
  • bridgE: Make explicit links to the mathematics curriculum

Teachers in the ScratchMaths project participated in professional development (two days per module) to enable them to understand the materials and the pedagogical approach.

At the end of the project, external evaluators measured the childrens’ learning and found a statistically significant increase in computational thinking skills after the first year, but no difference between an intervention group and a control group in the mathematical thinking outcomes in the second year (as measured by the national mathematics tests at that age).

Celia discussed a number of reasons for these findings. She also drew out the positive perspective that children in the trial learned two subjects at the same time without any detriment to their learning of mathematics. Covering two subjects and drawing the links between them without detriment to the core learning is potentially a benefit to schools who need to fit many subjects into their teaching day.

Much more information about the programme and the materials, which are freely available for use, can be found on the ScratchMaths project’s website, and you can also read a research paper describing the project.

As at all our research seminars, participants had many questions for our speaker. Although the project was designed for primary education, where it’s more common to learn subjects together across the curriculum, several questions revolved around the project’s suitability for secondary school. It’s interesting to reflect on how a programme like ScratchMaths might work at secondary level.

Should computing be taught in conjunction or separately?

Teaching programming through mathematics, or vice versa, is established practice in some countries. One example comes from Sweden, where computing and programming is taught across different subject areas, including mathematics: “through teaching pupils should be given opportunities to develop knowledge in using digital tools and programming to explore problems and mathematical concepts, make calculations and to present and interpret data”. In England, conversely, we have a discrete computing curriculum, and an educational system that separates subjects out so that it is often difficult for children to see overlap and contiguity. However, having the focus on computing as a discrete subject gives enormous benefits too, as Celia outlined at the beginning of her talk, and it opens up the potential to give children an in-depth understanding of the whole subject area over their school careers. In an ideal world, perhaps we would teach programming in conjunction with a range of subjects, thus providing the concrete realisation of abstract concepts, while also having discrete computing and computer science in the curriculum.

Woman teacher and female students at a computer

In our current context of a global pandemic, we are continually seeing the importance of computing applications, for example computer modelling and simulation used in the analysis of data. This talk highlighted the importance of learning computing per se, as well as the mathematics one can learn through integrating these two subjects.

Celia is a member of the National Centre of Computing Education (NCCE) Academic Board, made up of academics and experts who support the teaching and learning elements of the NCCE, and we enjoy our continued work with her in this capacity. Through the NCCE, the Raspberry Pi Foundation is reaching thousands of children and educators with free computing resources, online courses, and advanced-level computer science materials. Our networks of Code Clubs and CoderDojos also give children the space and freedom to experiment and play with programming and digital making in a way that is concordant with a constructionist approach.

Next up in our seminar series

If you missed the seminar, you can find Celia’s presentation slides and a recording of her talk on our research seminars page.

In our next seminar on Tuesday 16 June at 17:00–18:00 BST / 12:00–13:00 EDT / 9:00–10:00 PDT / 18:00–19:00 CEST, we’ll welcome Jane Waite, Teaching Fellow at Queen Mary University of London. Jane will be sharing insights about Semantic Waves and unplugged computing. To join the seminar, simply sign up with your name and email address and we’ll email you the link and instructions. If you attended Celia’s seminar, the link remains the same.

 

*The ScratchMaths team are :

  • Professor Dame Celia Hoyles (Mathematics) & Professor Richard Noss (Mathematics) UCL Knowledge Lab
  • Professor Ivan Kalas, (Computing) Comenius University, Bratislava, Slovakia
  • Dr Laura Benton (Computing) & Piers Saunders, (Mathematics) UCL Knowledge Lab
  • Professor Dave Pratt (Mathematics) UCL Institute of Education

The post Mathematics and programming: exploring the links appeared first on Raspberry Pi.

When Bloom filters don’t bloom

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/when-bloom-filters-dont-bloom/

When Bloom filters don't bloom

When Bloom filters don't bloom

I’ve known about Bloom filters (named after Burton Bloom) since university, but I haven’t had an opportunity to use them in anger. Last month this changed – I became fascinated with the promise of this data structure, but I quickly realized it had some drawbacks. This blog post is the tale of my brief love affair with Bloom filters.

While doing research about IP spoofing, I needed to examine whether the source IP addresses extracted from packets reaching our servers were legitimate, depending on the geographical location of our data centers. For example, source IPs belonging to a legitimate Italian ISP should not arrive in a Brazilian datacenter. This problem might sound simple, but in the ever-evolving landscape of the internet this is far from easy. Suffice it to say I ended up with many large text files with data like this:

When Bloom filters don't bloom

This reads as: the IP 192.0.2.1 was recorded reaching Cloudflare data center number 107 with a legitimate request. This data came from many sources, including our active and passive probes, logs of certain domains we own (like cloudflare.com), public sources (like BGP table), etc. The same line would usually be repeated across multiple files.

I ended up with a gigantic collection of data of this kind. At some point I counted 1 billion lines across all the harvested sources. I usually write bash scripts to pre-process the inputs, but at this scale this approach wasn’t working. For example, removing duplicates from this tiny file of a meager 600MiB and 40M lines, took… about an eternity:

When Bloom filters don't bloom

Enough to say that deduplicating lines using the usual bash commands like ‘sort’ in various configurations (see ‘–parallel’, ‘–buffer-size’ and ‘–unique’) was not optimal for such a large data set.

Bloom filters to the rescue

When Bloom filters don't bloom

Image by David Eppstein Public Domain

Then I had a brainwave – it’s not necessary to sort the lines! I just need to remove duplicated lines – using some kind of "set" data structure should be much faster. Furthermore, I roughly know the cardinality of the input file (number of unique lines), and I can live with some data points being lost – using a probabilistic data structure is fine!

Bloom-filters are a perfect fit!

While you should go and read Wikipedia on Bloom Filters, here is how I look at this data structure.

How would you implement a "set"? Given a perfect hash function, and infinite memory, we could just create an infinite bit array and set a bit number ‘hash(item)’ for each item we encounter. This would give us a perfect "set" data structure. Right? Trivial. Sadly, hash functions have collisions and infinite memory doesn’t exist, so we have to compromise in our reality. But we can calculate and manage the probability of collisions. For example, imagine we have a good hash function, and 128GiB of memory. We can calculate the probability of the second item added to the bit array colliding would be 1 in 1099511627776. The probability of collision when adding more items worsens as we fill up the bit array.

Furthermore, we could use more than one hash function, and end up with a denser bit array. This is exactly what Bloom filters optimize for. A Bloom filter is a bunch of math on top of the four variables:

  • ‘n’ – The number of input elements (cardinality)
  • ‘m’ – Memory used by the bit-array
  • ‘k’ – Number of hash functions counted for each input
  • ‘p’ – Probability of a false positive match

Given the ‘n’ input cardinality and the ‘p’ desired probability of false positive, the Bloom filter math returns the ‘m’ memory required and ‘k’ number of hash functions needed.

Check out this excellent visualization by Thomas Hurst showing how parameters influence each other:

mmuniq-bloom

Guided by this intuition, I set out on a journey to add a new tool to my toolbox – ‘mmuniq-bloom’, a probabilistic tool that, given input on STDIN, returns only unique lines on STDOUT, hopefully much faster than ‘sort’ + ‘uniq’ combo!

Here it is:

For simplicity and speed I designed ‘mmuniq-bloom’ with a couple of assumptions. First, unless otherwise instructed, it uses 8 hash functions k=8. This seems to be a close to optimal number for the data sizes I’m working with, and the hash function can quickly output 8 decent hashes. Then we align ‘m’, number of bits in the bit array, to be a power of two. This is to avoid the pricey % modulo operation, which compiles down to slow assembly ‘div’. With power-of-two sizes we can just do bitwise AND. (For a fun read, see how compilers can optimize some divisions by using multiplication by a magic constant.)

We can now run it against the same data file we used before:

When Bloom filters don't bloom

Oh, this is so much better! 12 seconds is much more manageable than 2 minutes before. But hold on… The program is using an optimized data structure, relatively limited memory footprint, optimized line-parsing and good output buffering… 12 seconds is still eternity compared to ‘wc -l’ tool:

When Bloom filters don't bloom

What is going on? I understand that counting lines by ‘wc’ is easier than figuring out unique lines, but is it really worth the 26x difference? Where does all the CPU in ‘mmuniq-bloom’ go?

It must be my hash function. ‘wc’ doesn’t need to spend all this CPU performing all this strange math for each of the 40M lines on input. I’m using a pretty non-trivial ‘siphash24’ hash function, so it surely burns the CPU, right? Let’s check by running the code computing hash function but not doing any Bloom filter operations:

When Bloom filters don't bloom

This is strange. Counting the hash function indeed costs about 2s, but the program took 12s in the previous run. The Bloom filter alone takes 10 seconds? How is that possible? It’s such a simple data structure…

A secret weapon – a profiler

It was time to use a proper tool for the task – let’s fire up a profiler and see where the CPU goes. First, let’s fire an ‘strace’ to confirm we are not running any unexpected syscalls:

When Bloom filters don't bloom

Everything looks good. The 10 calls to ‘mmap’ each taking 4ms (3971 us) is intriguing, but it’s fine. We pre-populate memory up front with ‘MAP_POPULATE’ to save on page faults later.

What is the next step? Of course Linux’s ‘perf’!

When Bloom filters don't bloom

Then we can see the results:

When Bloom filters don't bloom

Right, so we indeed burn 87.2% of cycles in our hot code. Let’s see where exactly. Doing ‘perf annotate process_line –source’ quickly shows something I didn’t expect.

When Bloom filters don't bloom

You can see 26.90% of CPU burned in the ‘mov’, but that’s not all of it! The compiler correctly inlined the function, and unrolled the loop 8-fold. Summed up that ‘mov’ or ‘uint64_t v = *p’ line adds up to a great majority of cycles!

When Bloom filters don't bloom

Clearly ‘perf’ must be mistaken, how can such a simple line cost so much? We can repeat the benchmark with any other profiler and it will show us the same problem. For example, I like using ‘google-perftools’ with kcachegrind since they emit eye-candy charts:

When Bloom filters don't bloom

The rendered result looks like this:

When Bloom filters don't bloom

Allow me to summarise what we found so far.

The generic ‘wc’ tool takes 0.45s CPU time to process 600MiB file. Our optimized ‘mmuniq-bloom’ tool takes 12 seconds. CPU is burned on one ‘mov’ instruction, dereferencing memory….

When Bloom filters don't bloom

Image by Jose Nicdao CC BY/2.0

Oh! I how could I have forgotten. Random memory access is slow! It’s very, very, very slow!

According to the rule of thumb "latency numbers every programmer should know about", one RAM fetch is about 100ns. Let’s do the math: 40 million lines, 8 hashes counted for each line. Since our Bloom filter is 128MiB, on our older hardware it doesn’t fit into L3 cache! The hashes are uniformly distributed across the large memory range – each hash generates a memory miss. Adding it together that’s…

When Bloom filters don't bloom

That suggests 32 seconds burned just on memory fetches. The real program is faster, taking only 12s. This is because, although the Bloom filter data does not completely fit into L3 cache, it still gets some benefit from caching. It’s easy to see with ‘perf stat -d’:

When Bloom filters don't bloom

Right, so we should have had at least 320M LLC-load-misses, but we had only 280M. This still doesn’t explain why the program was running only 12 seconds. But it doesn’t really matter. What matters is that the number of cache misses is a real problem and we can only fix it by reducing the number of memory accesses. Let’s try tuning Bloom filter to use only one hash function:

When Bloom filters don't bloom

Ouch! That really hurt! The Bloom filter required 64 GiB of memory to get our desired false positive probability ratio of 1-error-per-10k-lines. This is terrible!

Also, it doesn’t seem like we improved much. It took the OS 22 seconds to prepare memory for us, but we still burned 11 seconds in userspace. I guess this time any benefits from hitting memory less often were offset by lower cache-hit probability due to drastically increased memory size. In previous runs we required only 128MiB for the Bloom filter!

Dumping Bloom filters altogether

This is getting ridiculous. To get the same false positive guarantees we either must use many hashes in Bloom filter (like 8) and therefore many memory operations, or we can have 1 hash function, but enormous memory requirements.

We aren’t really constrained by available memory, instead we want to optimize for reduced memory accesses. All we need is a data structure that requires at most 1 memory miss per item, and use less than 64 Gigs of RAM…

While we could think of more sophisticated data structures like Cuckoo filter, maybe we can be simpler. How about a good old simple hash table with linear probing?

When Bloom filters don't bloom
Image by Vadims Podāns

Welcome mmuniq-hash

Here you can find a tweaked version of mmuniq-bloom, but using hash table:

Instead of storing bits as for the Bloom-filter, we are now storing 64-bit hashes from the ‘siphash24’ function. This gives us much stronger probability guarantees, with probability of false positives much better than one error in 10k lines.

Let’s do the math. Adding a new item to a hash table containing, say 40M, entries has ’40M/2^64′ chances of hitting a hash collision. This is about one in 461 billion – a reasonably low probability. But we are not adding one item to a pre-filled set! Instead we are adding 40M lines to the initially empty set. As per birthday paradox this has much higher chances of hitting a collision at some point. A decent approximation is ‘~n^2/2m’, which in our case is ‘~(40M^2)/(2*(2^64))’. This is a chance of one in 23000. In other words, assuming we are using good hash function, every one in 23 thousand random sets of 40M items, will have a hash collision. This practical chance of hitting a collision is non-negligible, but it’s still better than a Bloom filter and totally acceptable for my use case.

The hash table code runs faster, has better memory access patterns and better false positive probability than the Bloom filter approach.

When Bloom filters don't bloom

Don’t be scared about the "hash conflicts" line, it just indicates how full the hash table was. We are using linear probing, so when a bucket is already used, we just pick up the next empty bucket. In our case we had to skip over 0.7 buckets on average to find an empty slot in the table. This is fine and, since we iterate over the buckets in linear order, we can expect the memory to be nicely prefetched.

From the previous exercise we know our hash function takes about 2 seconds of this. Therefore, it’s fair to say 40M memory hits take around 4 seconds.

Lessons learned

Modern CPUs are really good at sequential memory access when it’s possible to predict memory fetch patterns (see Cache prefetching). Random memory access on the other hand is very costly.

Advanced data structures are very interesting, but beware. Modern computers require cache-optimized algorithms. When working with large datasets, not fitting L3, prefer optimizing for reduced number loads, over optimizing the amount of memory used.

I guess it’s fair to say that Bloom filters are great, as long as they fit into the L3 cache. The moment this assumption is broken, they are terrible. This is not news, Bloom filters optimize for memory usage, not for memory access. For example, see the Cuckoo Filters paper.

Another thing is the ever-lasting discussion about hash functions. Frankly – in most cases it doesn’t matter. The cost of counting even complex hash functions like ‘siphash24’ is small compared to the cost of random memory access. In our case simplifying the hash function will bring only small benefits. The CPU time is simply spent somewhere else – waiting for memory!

One colleague often says: "You can assume modern CPUs are infinitely fast. They run at infinite speed until they hit the memory wall".

Finally, don’t follow my mistakes – everyone should start profiling with ‘perf stat -d’ and look at the "Instructions per cycle" (IPC) counter. If it’s below 1, it generally means the program is stuck on waiting for memory. Values above 2 would be great, it would mean the workload is mostly CPU-bound. Sadly, I’m yet to see high values in the workloads I’m dealing with…

Improved mmuniq

With the help of my colleagues I’ve prepared a further improved version of the ‘mmuniq’ hash table based tool. See the code:

It is able to dynamically resize the hash table, to support inputs of unknown cardinality. Then, by using batching, it can effectively use the ‘prefetch’ CPU hint, speeding up the program by 35-40%. Beware, sprinkling the code with ‘prefetch’ rarely works. Instead, I specifically changed the flow of algorithms to take advantage of this instruction. With all the improvements I got the run time down to 2.1 seconds:

When Bloom filters don't bloom

The end

Writing this basic tool which tries to be faster than ‘sort | uniq’ combo revealed some hidden gems of modern computing. With a bit of work we were able to speed it up from more than two minutes to 2 seconds. During this journey we learned about random memory access latency, and the power of cache friendly data structures. Fancy data structures are exciting, but in practice reducing random memory loads often brings better results.

Where Is This Coming From?

Post Syndicated from Bozho original https://techblog.bozho.net/where-is-this-coming-from/

In enterprise software the top one question you have to answer as a developer almost every day is “Where is this coming from?”. When trying to fix bugs, when developing new features, when refactoring. You have to be able to trace the code flow and to figure out where a certain value is coming from.

And the bigger the codebase is, the more complicated it is to figure out where something (some value or combination of values) is coming from. In theory it’s either from the user interface or from the database, but we all know it’s always more complicated. I learned that very early in my career when I had to navigate a huge telecom codebase in order to implement features and fix bugs that were in total a few dozens line of code.

Answering the question means navigating the code easily, debugging and tracing changes to the values passed around. And while that seems obvious, it isn’t so obvious in the grand scheme of things.

Frameworks, architectures, languages, coding styles and IDEs that obscure the answer to the question “where is this coming from?” make things much worse – for the individual developer and for the project in general. Let me give a few examples.

Scala, for which I have mixed feelings, gives you a lot of cool features. And some awful ones, like implicits. An implicit is something like a global variable, except there are nested implicit scopes. When you need some of those global variables, so just add the “implicit” keyword and you get the value from the inner-most scope available that matches the type of the parameter you want to set. And in larger projects it’s not trivial to chase where has that implicit value been set. It can take hours of debugging to figure out why something has a particular value, only to figure out some unrelated part of the code has touched the relevant implicits. That makes it really hard to trace where stuff is coming from and therefore is bad for enterprise codebases, at least for me.

Another Scala feature is partially applied functions. You have a function foo(a, b, c) (that’s not the correct syntax, of course). You have one parameter known at some point, and the other two parameters known at a later point. So you can call the function partially and pass the resulting partially applied function to the next function, and so on until you have the other arguments available. So you can do bar(foo(a)) which means that in bar(..) you can call foo(b, c). Of course, at that point, answering the question “where did the value of a come from” is harder to answer. The feature is really cool if used properly (I’ve used it, and was proud about it), but it should be limited to smaller parts of the codebase. If you start tossing partially applied functions all over the place, it becomes a mess. And unfortunately, I’ve seen that as well.

Enough about Scala, the microservices architecture (which I also have mixed feeling about) also complicates the ability of a developer to trace what’s happening. If for a given request you invoke 3-4 external systems, which both return data and manipulate data, it becomes much harder to debug your application. Instead of putting a breakpoint or doing a call hierarchy, you have to track the parameters of each interaction with each microservice. It’s news to nobody that microservices are harder to debug but I just wanted to put that in the context of answering the “where is this coming from” question.

Dynamic typing is another example. I’ve included that as part of my arguments why I prefer static typing. Java IDEs have “Call hierarchy”. Which is the single most useful IDE functionality for large enterprise software (for me even more important than the refactoring functionality). You really can trace every bit of possible code flow, not only in your codebase, but also in your dependencies, which often hide the important details (chances are, you’ll be putting breakpoints and inspecting 3rd party code rather often). Dynamic typing doesn’t give you the ability to do that properly. doSomething called on an unknown-at-compile-time type can be any method with that name. And tracing where stuff is coming from becomes much harder.

Code generation is something that I’ve always avoided. It takes input from text files (in whatever language they are) and generates code, turning the question “where is this coming from” to “why has this been generated that way”.

Message queues and async programming in general – message passing obscures the source and destination of a given piece of data; a message queue adds complexity to the communication between modules. With microservices you at least have API calls, with queues, you have multiple abstractions between the sender and recipient (exchanges, topics, queues). And that’s a general drawback of asynchrounous programming – that there’s something in between the program flow that does “async magic” and spits something on the other end – but is it transformed, is it delayed, is it lost and retried, is it still waiting?

By all these examples I’m not saying you should not use message queues, code generation, dynamic languages, microservices or Scala (though for some I’d really advice against). All of these things have their strengths, and they have been chosen exactly for those strengths. A message queue was probably chosen because you want to really decouple producer and consumer. Scala was chosen for its expressiveness. Microservices were chosen because a monolith had become really hard to manage with multiple teams and multiple languages.

But we should try to minimize the “damage” of not being able to easily trace the program flow and not being able to quickly answer “where is this coming from”. Impose a “no-implicits” rule on your scala code base. Use code-generation for simpler components (e.g. DTOs for protobuf). Use message queues with predictable message/queue/topic/exchange names and some slightly verbose debug logging. Make sure your microservices have corresponding SDKs with consistent naming and that they can be run locally without additional effort to ease debugging.

It is expected that the bigger and more complex a project is, the harder it will be to trace where stuff is going. But do try to make it as easy as possible, even if it costs a little extra effort in the design and coding phase. You’ll be designing and writing that feature for a week. And you (and others) will be supporting and expanding it for the next 10 years.

The post Where Is This Coming From? appeared first on Bozho's tech blog.

Let’s Annotate Our Methods With The Features They Implement

Post Syndicated from Bozho original https://techblog.bozho.net/lets-annotate-our-methods-with-the-features-they-implement/

Writing software consists of very little actual “writing”, and much more thinking, designing, reading, “digging”, analyzing, debugging, refactoring, aligning and meeting others.

The reading and digging part is where you try to understand what has been implemented before, why it has been implemented, and how it works. In larger projects it becomes increasingly hard to find what is happening and why – there are so many classes that interfere, and so many methods participate in implementing a particular feature.

That’s probably because there is a mismatch between the programming units (classes, methods) and the business logic units (features). Product owners want a “password reset” feature, and they don’t care if it’s done using framework configuration, custom code split in three classes, or one monolithic controller method that does that job.

This mismatch is partially addressed by the so called BDD (behaviour driven development), as business people can define scenarios in a formalized language (although they rarely do, it’s still up to the QAs or developers to write the tests). But having your tests organized around features and behaviours doesn’t mean the code is, and BDD doesn’t help in making your way through the codebase in search of why and how something is implemented.

Another issue is linking a piece of code to the issue tracking system. Source control conventions and hooks allow for setting the issue tracker number as part of the commit, and then when browsing the code, you can annotate the file and see the issue number. However, due the the many changes, even a very strict team will end up methods that are related to multiple issues and you can’t easily tell which is the proper one.

Yet another issue with the lack of a “feature” unit in programming languages is that you can’t trivially reuse existing projects to start a new one. We’ve all been there – you have a similar project and you want to get a skeleton to get thing running faster. And while there are many tools to help that (Spring Boot, Spring Roo, and other scaffolding utilities), they can rarely deliver what you need – you always have to tweak something, delete something, customize some configuration, as defaults are almost never practical.

And I have a simple proposal that will help with the issues above. As with any complex problem, simple ideas don’t solve everything, but are at least a step forward.

The proposal is in the title – let’s annotate our methods with the features they implement. Let’s have @Feature(name = "Forgotten password", issueTrackerCode="PROJ-123"). A method can implement multiple features, but that is generally discouraged by best practices (e.g. the single responsibility principle). The granularity of “feature” is something that has to be determined by each team and is the tricky part – sometimes an epic describes a feature, sometimes individual stories or even subtasks do. A definition of a feature should be agreed upon and every new team member should be told what to do and how to interpret it.

There is of course a lot of complexity, e.g. for generic methods like DAO methods, utility methods, or methods that are reused in too many places. But they also represent features, it’s just that these features are horizontal. “Data access layer” is a feature – a more technical one indeed, but it counts, and maybe deserves a story in the issue tracker.

Your features can actually be listed in one or several enums, grouped by type – business, horizontal, performance, etc. That way you can even compose features – e.g. account creation contains the logic itself, database access, a security layer.

How does such a proposal help?

  • Consciousnesses about the single responsibility of methods and that code should be readable
  • Provides a rationale for the existence of each method. Even if a proper comment is missing, the annotation will put a method (or a class) in context
  • Helps navigating code and fixing issues (if you can see all places where a feature is implemented, you are more likely to spot an issue)
  • Allows tools to analyze your features – amount, complexity, how chaotic a feature is spread across the code base, test coverage per feature, etc.
  • Allows tools to use existing projects for scaffolding for new ones – you specify the features you want to have, and they are automatically copied

At this point I’m supposed to give a link to a GitHub project for a feature annotation library. But it doesn’t make sense to have a single-annotation project. It can easily be part of guava or something similar Or can be manually created in each project. The complex part – the tools that will do the scanning and analysis, deserve separate projects, but unfortunately I don’t have time to write one.

But even without the tools, the concept of annotating methods with their high-level features is I think a useful one. Instead of trying to deduce why is this method here and what requirements does it have to implement (and were all necessary tests written at the time), such an annotation can come handy.

The post Let’s Annotate Our Methods With The Features They Implement appeared first on Bozho's tech blog.

Re-Architecting the Video Gatekeeper

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/re-architecting-the-video-gatekeeper-f7b0ac2f6b00?source=rss----2615bd06b42e---4

By Drew Koszewnik

This is the story about how the Content Setup Engineering team used Hollow, a Netflix OSS technology, to re-architect and simplify an essential component in our content pipeline — delivering a large amount of business value in the process.

The Context

Each movie and show on the Netflix service is carefully curated to ensure an optimal viewing experience. The team responsible for this curation is Title Operations. Title Operations will confirm, among other things:

  • We are in compliance with the contracts — date ranges and places where we can show a video are set up correctly for each title
  • Video with captions, subtitles, and secondary audio “dub” assets are sourced, translated, and made available to the right populations around the world
  • Title name and synopsis are available and translated
  • The appropriate maturity ratings are available for each country

When a title meets all of the minimum above requirements, then it is allowed to go live on the service. Gatekeeper is the system at Netflix responsible for evaluating the “liveness” of videos and assets on the site. A title doesn’t become visible to members until Gatekeeper approves it — and if it can’t validate the setup, then it will assist Title Operations by pointing out what’s missing from the baseline customer experience.

Gatekeeper accomplishes its prescribed task by aggregating data from multiple upstream systems, applying some business logic, then producing an output detailing the status of each video in each country.

The Tech

Hollow, an OSS technology we released a few years ago, has been best described as a total high-density near cache:

  • Total: The entire dataset is cached on each node — there is no eviction policy, and there are no cache misses.
  • High-Density: encoding, bit-packing, and deduplication techniques are employed to optimize the memory footprint of the dataset.
  • Near: the cache exists in RAM on any instance which requires access to the dataset.

One exciting thing about the total nature of this technology — because we don’t have to worry about swapping records in-and-out of memory, we can make assumptions and do some precomputation of the in-memory representation of the dataset which would not otherwise be possible. The net result is, for many datasets, vastly more efficient use of RAM. Whereas with a traditional partial-cache solution you may wonder whether you can get away with caching only 5% of the dataset, or if you need to reserve enough space for 10% in order to get an acceptable hit/miss ratio — with the same amount of memory Hollow may be able to cache 100% of your dataset and achieve a 100% hit rate.

And obviously, if you get a 100% hit rate, you eliminate all I/O required to access your data — and can achieve orders of magnitude more efficient data access, which opens up many possibilities.

The Status-Quo

Until very recently, Gatekeeper was a completely event-driven system. When a change for a video occurred in any one of its upstream systems, that system would send an event to Gatekeeper. Gatekeeper would react to that event by reaching into each of its upstream services, gathering the necessary input data to evaluate the liveness of the video and its associated assets. It would then produce a single-record output detailing the status of that single video.

Old Gatekeeper Architecture

This model had several problems associated with it:

  • This process was completely I/O bound and put a lot of load on upstream systems.
  • Consequently, these events would queue up throughout the day and cause processing delays, which meant that titles may not actually go live on time.
  • Worse, events would occasionally get missed, meaning titles wouldn’t go live at all until someone from Title Operations realized there was a problem.

The mitigation for these issues was to “sweep” the catalog so Videos matching specific criteria (e.g., scheduled to launch next week) would get events automatically injected into the processing queue. Unfortunately, this mitigation added many more events into the queue, which exacerbated the problem.

Clearly, a change in direction was necessary.

The Idea

We decided to employ a total high-density near cache (i.e., Hollow) to eliminate our I/O bottlenecks. For each of our upstream systems, we would create a Hollow dataset which encompasses all of the data necessary for Gatekeeper to perform its evaluation. Each upstream system would now be responsible for keeping its cache updated.

New Gatekeeper Architecture

With this model, liveness evaluation is conceptually separated from the data retrieval from upstream systems. Instead of reacting to events, Gatekeeper would continuously process liveness for all assets in all videos across all countries in a repeating cycle. The cycle iterates over every video available at Netflix, calculating liveness details for each of them. At the end of each cycle, it produces a complete output (also a Hollow dataset) representing the liveness status details of all videos in all countries.

We expected that this continuous processing model was possible because a complete removal of our I/O bottlenecks would mean that we should be able to operate orders of magnitude more efficiently. We also expected that by moving to this model, we would realize many positive effects for the business.

  • A definitive solution for the excess load on upstream systems generated by Gatekeeper
  • A complete elimination of liveness processing delays and missed go-live dates.
  • A reduction in the time the Content Setup Engineering team spends on performance-related issues.
  • Improved debuggability and visibility into liveness processing.

The Problem

Hollow can also be thought of like a time machine. As a dataset changes over time, it communicates those changes to consumers by breaking the timeline down into a series of discrete data states. Each data state represents a snapshot of the entire dataset at a specific moment in time.

Hollow is like a time machine

Usually, consumers of a Hollow dataset are loading the latest data state and keeping their cache updated as new states are produced. However, they may instead point to a prior state — which will revert their view of the entire dataset to a point in the past.

The traditional method of producing data states is to maintain a single producer which runs a repeating cycle. During that cycle, the producer iterates over all records from the source of truth. As it iterates, it adds each record to the Hollow library. Hollow then calculates the differences between the data added during this cycle and the data added during the last cycle, then publishes the state to a location known to consumers.

Traditional Hollow usage

The problem with this total-source-of-truth iteration model is that it can take a long time. In the case of some of our upstream systems, this could take hours. This data-propagation latency was unacceptable — we can’t wait hours for liveness processing if, for example, Title Operations adds a rating to a movie that needs to go live imminently.

The Improvement

What we needed was a faster time machine — one which could produce states with a more frequent cadence, so that changes could be more quickly realized by consumers.

Incremental Hollow is like a faster time machine

To achieve this, we created an incremental Hollow infrastructure for Netflix, leveraging work which had been done in the Hollow library earlier, and pioneered in production usage by the Streaming Platform Team at Target (and is now a public non-beta API).

With this infrastructure, each time a change is detected in a source application, the updated record is encoded and emitted to a Kafka topic. A new component that is not part of the source application, the Hollow Incremental Producer service, performs a repeating cycle at a predefined cadence. During each cycle, it reads all messages which have been added to the topic since the last cycle and mutates the Hollow state engine to reflect the new state of the updated records.

If a message from the Kafka topic contains the exact same data as already reflected in the Hollow dataset, no action is taken.

Hollow Incremental Producer Service

To mitigate issues arising from missed events, we implement a sweep mechanism that periodically iterates over an entire source dataset. As it iterates, it emits the content of each record to the Kafka topic. In this way, any updates which may have been missed will eventually be reflected in the Hollow dataset. Additionally, because this is not the primary mechanism by which updates are propagated to the Hollow dataset, this does not have to be run as quickly or frequently as a cycle must iterate the source in traditional Hollow usage.

The Hollow Incremental Producer is capable of reading a great many messages from the Kafka topic and mutating its Hollow state internally very quickly — so we can configure its cycle times to be very short (we are currently defaulting this to 30 seconds).

This is how we built a faster time machine. Now, if Title Operations adds a maturity rating to a movie, within 30 seconds, that data is available in the corresponding Hollow dataset.

The Tangible Result

With the data propagation latency issue solved, we were able to re-implement the Gatekeeper system to eliminate all I/O boundaries. With the prior implementation of Gatekeeper, re-evaluating all assets for all videos in all countries would have been unthinkable — it would tie up the entire content pipeline for more than a week (and we would then still be behind by a week since nothing else could be processed in the meantime). Now we re-evaluate everything in about 30 seconds — and we do that every minute.

There is no such thing as a missed or delayed liveness evaluation any longer, and the disablement of the prior Gatekeeper system reduced the load on our upstream systems — in some cases by up to 80%.

Load reduction on one upstream system

In addition to these performance benefits, we also get a resiliency benefit. In the prior Gatekeeper system, if one of the upstream services went down, we were unable to evaluate liveness at all because we were unable to retrieve any data from that system. In the new implementation, if one of the upstream systems goes down then it does stop publishing — but we still gate stale data for its corresponding dataset while all others make progress. So for example, if the translated synopsis system goes down, we can still bring a movie on-site in a region if it was held back for, and then receives, the correct subtitles.

The Intangible Result

Perhaps even more beneficial than the performance gains has been the improvement in our development velocity in this system. We can now develop, validate, and release changes in minutes which might have before taken days or weeks — and we can do so with significantly increased release quality.

The time-machine aspect of Hollow means that every deterministic process which uses Hollow exclusively as input data is 100% reproducible. For Gatekeeper, this means that an exact replay of what happened at time X can be accomplished by reverting all of our input states to time X, then re-evaluating everything again.

We use this fact to iterate quickly on changes to the Gatekeeper business logic. We maintain a PREPROD Gatekeeper instance which “follows” our PROD Gatekeeper instance. PREPROD is also continuously evaluating liveness for the entire catalog, but publishing its output to a different Hollow dataset. At the beginning of each cycle, the PREPROD environment will gather the latest produced state from PROD, and set each of its input datasets to the exact same versions which were used to produce the PROD output.

The PREPROD Gatekeeper instance “follows” the PROD instance

When we want to make a change to the Gatekeeper business logic, we do so and then publish it to our PREPROD cluster. The subsequent output state from PREPROD can be diffed with its corresponding output state from PROD to view the precise effect that the logic change will cause. In this way, at a glance, we can validate that our changes have precisely the intended effect, and zero unintended consequences.

A Hollow diff shows exactly what changes

This, coupled with some iteration on the deployment process, has resulted in the ability for our team to code, validate, and deploy impactful changes to Gatekeeper in literally minutes — at least an order of magnitude faster than in the prior system — and we can do so with a higher level of safety than was possible in the previous architecture.

Conclusion

This new implementation of the Gatekeeper system opens up opportunities to capture additional business value, which we plan to pursue over the coming quarters. Additionally, this is a pattern that can be replicated to other systems within the Content Engineering space and elsewhere at Netflix — already a couple of follow-up projects have been launched to formalize and capitalize on the benefits of this n-hollow-input, one-hollow-output architecture.

Content Setup Engineering is an exciting space right now, especially as we scale up our pipeline to produce more content with each passing quarter. We have many opportunities to solve real problems and provide massive value to the business — and to do so with a deep focus on computer science, using and often pioneering leading-edge technologies. If this kind of work sounds appealing to you, reach out to Ivan to get the ball rolling.


Re-Architecting the Video Gatekeeper was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join Cloudflare & Moz at our next meetup, Serverless in Seattle!

Post Syndicated from Giuliana DeAngelis original https://blog.cloudflare.com/join-cloudflare-moz-at-our-next-meetup-serverless-in-seattle/

Join Cloudflare & Moz at our next meetup, Serverless in Seattle!
Photo by oakie / Unsplash

Join Cloudflare & Moz at our next meetup, Serverless in Seattle!

Cloudflare is organizing a meetup in Seattle on Tuesday, June 25th and we hope you can join. We’ll be bringing together members of the developers community and Cloudflare users for an evening of discussion about serverless compute and the infinite number of use cases for deploying code at the edge.

To kick things off, our guest speaker Devin Ellis will share how Moz uses Cloudflare Workers to reduce time to first byte 30-70% by caching dynamic content at the edge. Kirk Schwenkler, Solutions Engineering Lead at Cloudflare, will facilitate this discussion and share his perspective on how to grow and secure businesses at scale.

Next up, Developer Advocate Kristian Freeman will take you through a live demo of Workers and highlight new features of the platform. This will be an interactive session where you can try out Workers for free and develop your own applications using our new command-line tool.

Food and drinks will be served til close so grab your laptop and a friend and come on by!

View Event Details & Register Here

Agenda:

  • 5:00 pm Doors open, food and drinks
  • 5:30 pm Customer use case by Devin and Kirk
  • 6:00 pm Workers deep dive with Kristian
  • 6:30 – 8:30 pm Networking, food and drinks

Join Cloudflare & Moz at our next meetup, Serverless in Seattle!

Post Syndicated from Giuliana DeAngelis original https://blog.cloudflare.com/join-cloudflare-moz-at-our-next-meetup-serverless-in-seattle/

Join Cloudflare & Moz at our next meetup, Serverless in Seattle!
Photo by oakie / Unsplash

Join Cloudflare & Moz at our next meetup, Serverless in Seattle!

Cloudflare is organizing a meetup in Seattle on Tuesday, June 25th and we hope you can join. We’ll be bringing together members of the developers community and Cloudflare users for an evening of discussion about serverless compute and the infinite number of use cases for deploying code at the edge.

To kick things off, our guest speaker Devin Ellis will share how Moz uses Cloudflare Workers to reduce time to first byte 30-70% by caching dynamic content at the edge. Kirk Schwenkler, Solutions Engineering Lead at Cloudflare, will facilitate this discussion and share his perspective on how to grow and secure businesses at scale.

Next up, Developer Advocate Kristian Freeman will take you through a live demo of Workers and highlight new features of the platform. This will be an interactive session where you can try out Workers for free and develop your own applications using our new command-line tool.

Food and drinks will be served til close so grab your laptop and a friend and come on by!

View Event Details & Register Here

Agenda:

  • 5:00 pm Doors open, food and drinks
  • 5:30 pm Customer use case by Devin and Kirk
  • 6:00 pm Workers deep dive with Kristian
  • 6:30 – 8:30 pm Networking, food and drinks

Inside the Entropy

Post Syndicated from Alex Davidson original https://blog.cloudflare.com/inside-the-entropy/

Inside the Entropy

Inside the Entropy

Randomness, randomness everywhere;
Nor any verifiable entropy.

Generating random outcomes is an essential part of everyday life; from lottery drawings and constructing competitions, to performing deep cryptographic computations. To use randomness, we must have some way to ‘sample’ it. This requires interpreting some natural phenomenon (such as a fair dice roll) as an event that generates some random output. From a computing perspective, we interpret random outputs as bytes that we can then use in algorithms (such as drawing a lottery) to achieve the functionality that we want.

The sampling of randomness securely and efficiently is a critical component of all modern computing systems. For example, nearly all public-key cryptography relies on the fact that algorithms can be seeded with bytes generated from genuinely random outcomes.

In scientific experiments, a random sampling of results is necessary to ensure that data collection measurements are not skewed. Until now, generating random outputs in a way that we can verify that they are indeed random has been very difficult; typically involving taking a variety of statistical measurements.

Inside the Entropy

During Crypto week, Cloudflare is releasing a new public randomness beacon as part of the launch of the League of Entropy. The League of Entropy is a network of beacons that produces distributed, publicly verifiable random outputs for use in applications where the nature of the randomness must be publicly audited. The underlying cryptographic architecture is based on the drand project.

Verifiable randomness is essential for ensuring trust in various institutional decision-making processes such as elections and lotteries. There are also cryptographic applications that require verifiable randomness. In the land of decentralized consensus mechanisms, the DFINITY approach uses random seeds to decide the outcome of leadership elections. In this setting, it is essential that the randomness is publicly verifiable so that the outcome of the leadership election is trustworthy. Such a situation arises more generally in Sortitions: an election where leaders are selected as a random individual (or subset of individuals) from a larger set.

In this blog post, we will give a technical overview behind the cryptography used in the distributed randomness beacon, and how it can be used to generate publicly verifiable randomness. We believe that distributed randomness beacons have a huge amount of utility in realizing the Internet of the Future; where we will be able to rely on distributed, decentralized solutions to problems of a global-scale.

Randomness & entropy

A source of randomness is measured in terms of the amount of entropy it provides. Think about the entropy provided by a random output as a score to indicate how “random” the output actually is. The notion of information entropy was concretised by the famous scientist Claude Shannon in his paper A Mathematical Theory of Communication, and is sometimes known as Shannon Entropy.

A common way to think about random outputs is: a sequence of bits derived from some random outcome. For the sake of an argument, consider a fair 8-sided dice roll with sides marked 0-7. The outputs of the dice can be written as the bit-strings 000,001,010,...,111. Since the dice is fair, any of these outputs is equally likely. This is means that each of the bits is equally likely to be 0 or 1. Consequently, interpreting the output of the dice roll as a random output then derives randomness with 3 bits of entropy.

More generally, if a perfect source of randomness guarantees strings with n bits of entropy, then it generates bit-strings where each bit is equally likely to be 0 or 1. This allows us to predict the value of any bit with maximum probability 1/2. If the outputs are sampled from such a perfect source, we consider them uniformly distributed. If we sample the outputs from a source where one bit is predictable with higher probability, then the string has n-1 bits of entropy. To go back to the dice analogy, rolling a 6-sided dice provides less than 3 bits of entropy because the possible outputs are 000,001,010,011,100,101 and so the 2nd and 3rd bits are more likely to be to set to 0 than to 1.

It is possible to mix entropy sources using specifically designed mixing functions to retrieve something with even greater entropy. The maximum resulting entropy is the sum of the entropy taken from the number of entropic sources used as input.

Inside the Entropy

Sampling randomness

To sample randomness, let’s first identify the appropriate sources. There are many natural phenomena that one can use:

  • atmospheric noise;
  • radioactive decay;
  • turbulent motion; like that generated in Cloudflare’s wall of lava lamps(!).

Inside the Entropy

Unfortunately, these phenomena require very specific measuring tools, which are prohibitively expensive to install in mainstream consumer electronics. As such, most personal computing devices usually use external usage characteristics for seeding specific generator functions that output randomness as and when the system requires it. These characteristics include keyboard typing patterns and speed and mouse movement – since such usage patterns are based on the human user, it is assumed they provide sufficient entropy as a randomness source. An example of a random number generator that takes entropy from these characteristics is the Linux-based RDRAND function.

Naturally, it is difficult to tell whether a system is actually returning random outputs by only inspecting the outputs. There are statistical tests that detect whether a series of outputs is not uniformly distributed, but these tests cannot ensure that they are unpredictable. This means that it is hard to detect if a given system has had its randomness generation compromised.

Distributed randomness

It’s clear we need alternative methods for sampling randomness so that we can provide guarantees that trusted mechanisms, such as elections and lotteries, take place in secure tamper-resistant environments. The drand project was started by researchers at EPFL to address this problem. The drand charter is to provide an easily configurable randomness beacon running at geographically distributed locations around the world. The intention is for each of these beacons to generate portions of randomness that can be combined into a single random string that is publicly verifiable.

This functionality is achieved using threshold cryptography. Threshold cryptography seeks to derive solutions for standard cryptographic problems by combining information from multiple distributed entities. The notion of the threshold means that if there are n entities, then any t of the entities can combine to construct some cryptographic object (like a ciphertext, or a digital signature). These threshold systems are characterised by a setup phase, where each entity learns a share of data. They will later use this share of data to create a combined cryptographic object with a subset of the other entities.

Threshold randomness

In the case of a distributed randomness protocol, there are n randomness beacons that broadcast random values sampled from their initial data share, and the current state of the system. This data share is created during a trusted setup phase, and also takes in some internal random value that is generated by the beacon itself.

When a user needs randomness, they send requests to some number t of beacons, where t < n, and combine these values using a specific procedure. The result is a random value that can be verified and used for public auditing mechanisms.

Inside the Entropy

Consider what happens if some proportion c/n of the randomness beacons are corrupted at any one time. The nature of a threshold cryptographic system is that, as long as c < t, then the end result still remains random.

Inside the Entropy

If c exceeds t then the random values produced by the system become predictable and the notion of randomness is lost. In summary, the distributed randomness procedure provides verifiably random outputs with sufficient entropy only when c < t.

By distributing the beacons independent of each other and in geographically disparate locations, the probability that t locations can be corrupted at any one time is extremely low. The minimum choice of t is equal to n/2.

How does it actually work?

What we described above sounds a bit like magic<sup>tm</sup>. Even if c = t-1, then we can ensure that the output is indeed random and unpredictable! To make it clearer how this works, let’s dive a bit deeper into the underlying cryptography.

Two core components of drand are: a distributed key generation (DKG) procedure, and a threshold signature scheme. These core components are used in setup and randomness generation procedures, respectively. In just a bit, we’ll outline how drand uses these components (without navigating too deeply into the onerous mathematics).

Distributed key generation

At a high-level, the DKG procedure creates a distributed secret key that is formed of n different key pairs (vk_i, sk_i), each one being held by the entity i in the system. These key pairs will eventually be used to instantiate a (t,n)-threshold signature scheme (we will discuss this more later). In essence, t of the entities will be able to combine to construct a valid signature on any message.

To think about how this might work, consider a distributed key generation scheme that creates n distributed keys that are going to be represented by pizzas. Each pizza is split into n slices and one slice from each is secretly passed to one of the participants. Each entity receives one slice from each of the different pizzas (n in total) and combines these slices to form their own pizza. Each combined pizza is unique and secret for each entity, representing their own key pair.

Inside the Entropy

Inside the Entropy

Mathematical intuition

Mathematically speaking, and rather than thinking about pizzas, we can describe the underlying phenomenon by reconstructing lines or curves on a graph. We can take two coordinates on a (x,y) plane and immediately (and uniquely) define a line with the equation y = ax+b. For example, the points (2,3) and (4,7) immediately define a line with gradient (7-3)/(4/2) = 2 so a=2. You can then derive the b coefficient as -1 by evaluating either of the coordinates in the equation y = 2x + b. By uniquely, we mean that only the line y = 2x -1 satisfies the two coordinates that are chosen; no other choice of a or b fits.

Inside the Entropy

The curve ax+b has degree 1, where the degree of the equation refers to the highest order multiplication of unknown variables in the equation. That might seem like mathematical jargon, but the equation above contains only one term ax, which depends on the unknown variable x. In this specific term, the  exponent (or power) of x is 1, and so the degree of the entire equation is also 1.

Likewise, by taking three sets of coordinates pairs in the same plane, we uniquely define a quadratic curve with an equation that approximately takes the form y = ax^2 + bx + c with the coefficients a,b,c uniquely defined by the chosen coordinates. The process is a bit more involved than the above case, but it essentially starts in the same way using three coordinate pairs (x_1, y_1), (x_2, y_2) and (x_3, y_3).

Inside the Entropy

By a quadratic curve, we mean a curve of degree 2. We can see that this curve has degree 2 because it contains two terms ax^2 and bx that depend on x. The highest order term is the ax^2 term with an exponent of 2, so this curve has degree 2 (ignore the term bx which has a smaller power).

What we are ultimately trying to show is that this approach scales for curves of degree n (of the form y = a_n x^n + … a_1 x + a_0). So, if we take n+1 coordinates on the (x,y) plane, then we can uniquely reconstruct the curve of this form entirely. Such degree n equations are also known as polynomials of degree n.

In order to generalise the approach to general degrees we need some kind of formula. This formula should take n+1 pairs of coordinates and return a polynomial of degree n. Fortunately, such a formula exists without us having to derive it ourselves, it is known as the Lagrange interpolation polynomial. Using the formula in the link, we can reconstruct any n degree polynomial using n+1 unique pairs of coordinates.

Going back to pizzas temporarily, it will become clear in the next section how this Lagrange interpolation procedure essentially describes the dissemination of one slice (corresponding to (x,y) coordinates) taken from a single pizza (the entire n-1 degree polynomial) among n participants. Running this procedure n times in parallel allows each entity to construct their entire pizza (or the eventual key pair).

Back to key generation

Intuitively, in the DKG procedure we want to distribute n key pairs among n participants. This effectively means running n parallel instances of a t-out-of-n Shamir Secret Sharing scheme. This secret sharing scheme is built entirely upon the polynomial interpolation technique that we described above.

In a single instance, we take the secret key to be the first coefficient of a polynomial of degree t-1 and the public key is a published value that depends on this secret key, but does not reveal the actual coefficient. Think of RSA, where we have a number N = pq for secret large prime numbers p,q, where N is public but does not reveal the actual factorisation. Notice that if the polynomial is reconstructed using the interpolation technique above, then we immediately learn the secret key, because the first coefficient will be made explicit.

Each secret sharing scheme publishes shares, where each share is a different evaluation of the polynomial (dependent on the entity i receiving the key share). These evaluations are essentially coordinates on the (x,y) plane.

Inside the Entropy

By running n parallel instances of the secret sharing scheme, each entity receives n shares and then combines all of these to form their overall key pair (vk_i, sk_i).

The DKG procedure uses n parallel secret sharing procedures along with Pedersen commitments to distribute the key pairs. We explain in the next section how this procedure is part of the procedure for provisioning randomness beacons.

In summary, it is important to remember that each party in the DKG protocol generates a random secret key from the n shares that they receive, and they compute the corresponding public key from this. We will now explain how each entity uses this key pair to perform the cryptographic procedure that is used by the drand protocol.

Threshold signature scheme

Remember: a standard signature scheme considers a key-pair (vk,sk), where vk is a public verification key and sk is a private signing key. So, messages m signed with sk can be verified with vk. The security of the scheme ensures that it is difficult for anybody who does not hold sk to compute a valid signature for any message m.

A threshold signature scheme allows a set of users holding distributed key-pairs (vk_i,sk_i) to compute intermediate signatures u_i on a given message m.

Given knowledge of some number t of intermediate signatures u_i, a valid signature u on the message m can be reconstructed under the combined secret key sk. The public key vk can also be inferred using knowledge of the public keys vk_i, and then this public key can be used to verify u.

Again, think back to reconstructing the degree t-1 curves on graphs with t known coordinates. In this case, the coordinates correspond to the intermediate signatures u_i, and the signature u corresponds to the entire curve. For the actual signature schemes, the mathematics are much more involved than in the DKG procedure, but the principal is the same.

Inside the Entropy

drand protocol

The n beacons that will take part in the drand project are identified. In the trusted setup phase, the DKG protocol from above is run, and each beacon effectively creates a key pair (vk_i, sk_i) for a threshold signature scheme. In other words, this key pair will be able to generate intermediate signatures that can be combined to create an entire signature for the system.

Inside the Entropy

For each round (occurring once a minute, for example), the beacons agree on a signature u evaluated over a message containing the previous round’s signature and the current round’s number. This signature u is the result of combining the intermediate signatures u_i over the same message. Each intermediate signature u_i is created by each of the beacons using their secret sk_i.

Once this aggregation completes, each beacon displays the signature for the current round, along with the previous signature and round number. This allows any client to publicly verify the signature over this data to verify that the beacons honestly aggregate. This provides a chain of verifiable signatures, extending back to the first round of output. In addition, there are threshold signature schemes that output signatures that are indistinguishable from random sequences of bytes. Therefore, these signatures can be used directly as verifiable randomness for the applications we discussed previously.

What does drand use?

To instantiate the required threshold signature scheme, drand uses the (t,n)BLS signature scheme of Boneh, Lynn and Shacham. In particular, we can instantiate this scheme in the elliptic curve setting using  Barreto-Naehrig curves. Moreover, the BLS signature scheme outputs sufficiently large signatures that are randomly distributed, giving them enough entropy to be sources of randomness. Specifically the signatures are randomly distributed over 64 bytes.

BLS signatures use a specific form of mathematical operation known as a cryptographic pairing. Pairings can be computed in certain elliptic curves, including the Barreto-Naehrig curve configurations. A detailed description of pairing operations is beyond the scope of this blog post; though it is important to remember that these operations are integral to how BLS signatures work.

Concretely speaking, all drand cryptographic operations are carried out using a library built on top of Cloudflare’s implementation of the bn256 curve. The Pedersen DKG protocol follows the design of Gennaro et al..

How does it work?

The randomness beacons are synchronised in rounds. At each round, a beacon produces a new signature u_i using its private key sk_i on the previous signature generated and the round ID. These signatures are usually broadcast on the URL drand.<host>.com/api/public. These signatures can be verified using the keys vk_i and over the same data that is signed. By signing the previous signature and the current round identifier, this establishes a chain of trust for the randomness beacon that can be traced back to the original signature value.

The randomness can be retrieved by combining the signatures from each of the beacons using the threshold property of the scheme. This reconstruction of the signature u from each intermediate signature u_i is done internally by the League of Entropy nodes. Each beacon broadcasts the entire signature u, that can be accessed over the HTTP endpoint above.

Cloudflare’s drand beacon

As we mentioned at the start of this blog post, Cloudflare has launched our distributed randomness beacon. This beacon is part of a network of beacons from different institutions around the globe that form the League of  Entropy.

The Cloudflare beacon uses LavaRand as its internal source of randomness for the DKG. Other League of Entropy drand beacons have their own sources of randomness.

Give me randomness!

The Cloudflare randomness beacon allows you to retrieve the latest random value from the League of Entropy using a simple HTTP request:

curl https://drand.cloudflare.com/api/public

The response is a JSON blob of the form:

{
    "round": 7,
    "previous": <hex-encoded-previous-signature>,
    "randomness": {
        "gid": 21,
        "point": <hex-encoded-new-signature>
    }
}

where, randomness.point is the signature u aggregated among the entire set of beacons.

The signature is computed as an evaluation of the message, and is comprised of the signature of the previous round, previous, the current round number, round, and the aggregated secret key of the system. This signature can be verified using the entire public key vk of the Cloudflare beacon, learned using another HTTP request:

curl https://drand.cloudflare.com/api/public

There are eight collaborators in the League of Entropy. You can learn the current round of randomness (or the system’s public key) by querying these beacons on the HTTP endpoints listed above.

Randomness & the future

Cloudflare will continue to take an active role in the drand project, both as a contributor and by running a randomness beacon with the League of Entropy. The League of Entropy is a worldwide joint effort of individuals and academic institutions. We at Cloudflare believe it can help us realize the mission of helping Build a Better Internet. For more information on Cloudflare’s participation in the League of Entropy, visit https://cloudflare.com/drand or read Dina’s blog post.

Cloudflare would like to thank all of their collaborators in the League of Entropy; from EPFL, UChile, Kudelski Security and Protocol Labs. This work would not have been possible without the work of those who contributed to the open-source drand project. We would also like to thank and appreciate the work of Gabbi Fisher, Brendan McMillion, and Mahrud Sayrafi in launching the Cloudflare randomness beacon.

Building a To-Do List with Workers and KV

Post Syndicated from Kristian Freeman original https://blog.cloudflare.com/building-a-to-do-list-with-workers-and-kv/

Building a To-Do List with Workers and KV

Building a To-Do List with Workers and KV

In this tutorial, we’ll build a todo list application in HTML, CSS and JavaScript, with a twist: all the data should be stored inside of the newly-launched Workers KV, and the application itself should be served directly from Cloudflare’s edge network, using Cloudflare Workers.

To start, let’s break this project down into a couple different discrete steps. In particular, it can help to focus on the constraint of working with Workers KV, as handling data is generally the most complex part of building an application:

  1. Build a todos data structure
  2. Write the todos into Workers KV
  3. Retrieve the todos from Workers KV
  4. Return an HTML page to the client, including the todos (if they exist)
  5. Allow creation of new todos in the UI
  6. Allow completion of todos in the UI
  7. Handle todo updates

This task order is pretty convenient, because it’s almost perfectly split into two parts: first, understanding the Cloudflare/API-level things we need to know about Workers and KV, and second, actually building up a user interface to work with the data.

Understanding Workers

In terms of implementation, a great deal of this project is centered around KV – although that may be the case, it’s useful to break down what Workers are exactly.

Service Workers are background scripts that run in your browser, alongside your application. Cloudflare Workers are the same concept, but super-powered: your Worker scripts run on Cloudflare’s edge network, in-between your application and the client’s browser. This opens up a huge amount of opportunity for interesting integrations, especially considering the network’s massive scale around the world. Here’s some of the use-cases that I think are the most interesting:

  1. Custom security/filter rules to block bad actors before they ever reach the origin
  2. Replacing/augmenting your website’s content based on the request content (i.e. user agents and other headers)
  3. Caching requests to improve performance, or using Cloudflare KV to optimize high-read tasks in your application
  4. Building an application directly on the edge, removing the dependence on origin servers entirely

For this project, we’ll lean heavily towards the latter end of that list, building an application that clients communicate with, served on Cloudflare’s edge network. This means that it’ll be globally available, with low-latency, while still allowing the ease-of-use in building applications directly in JavaScript.

Setting up a canvas

To start, I wanted to approach this project from the bare minimum: no frameworks, JS utilities, or anything like that. In particular, I was most interested in writing a project from scratch and serving it directly from the edge. Normally, I would deploy a site to something like GitHub Pages, but avoiding the need for an origin server altogether seems like a really powerful (and performant idea) – let’s try it!

I also considered using TodoMVC as the blueprint for building the functionality for the application, but even the Vanilla JS version is a pretty impressive amount of code, including a number of Node packages – it wasn’t exactly a concise chunk of code to just dump into the Worker itself.

Instead, I decided to approach the beginnings of this project by building a simple, blank HTML page, and including it inside of the Worker. To start, we’ll sketch something out locally, like this:

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width,initial-scale=1">
    <title>Todos</title>
  </head>
  <body>
    <h1>Todos</h1>
  </body>
</html>

Hold on to this code – we’ll add it later, inside of the Workers script. For the purposes of the tutorial, I’ll be serving up this project at todo.kristianfreeman.com,. My personal website was already hosted on Cloudflare, and since I’ll be serving , it was time to create my first Worker.

Creating a worker

Inside of my Cloudflare account, I hopped into the Workers tab and launched the Workers editor.

This is one of my favorite features of the editor – working with your actual website, understanding how the worker will interface with your existing project.

Building a To-Do List with Workers and KV

The process of writing a Worker should be familiar to anyone who’s used the fetch library before. In short, the default code for a Worker hooks into the fetch event, passing the request of that event into a custom function, handleRequest:

addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})

Within handleRequest, we make the actual request, using fetch, and return the response to the client. In short, we have a place to intercept the response body, but by default, we let it pass-through:

async function handleRequest(request) {
  console.log('Got request', request)
  const response = await fetch(request)
  console.log('Got response', response)
  return response
}

So, given this, where do we begin actually doing stuff with our worker?

Unlike the default code given to you in the Workers interface, we want to skip fetching the incoming request: instead, we’ll construct a new Response, and serve it directly from the edge:

async function handleRequest(request) {
  const response = new Response("Hello!")
  return response
}

Given that very small functionality we’ve added to the worker, let’s deploy it. Moving into the “Routes” tab of the Worker editor, I added the route https://todo.kristianfreeman.com/* and attached it to the cloudflare-worker-todos script.

Building a To-Do List with Workers and KV

Once attached, I deployed the worker, and voila! Visiting todo.kristianfreeman.com in-browser gives me my simple “Hello!” response back.

Building a To-Do List with Workers and KV

Writing data to KV

The next step is to populate our todo list with actual data. To do this, we’ll make use of Cloudflare’s Workers KV – it’s a simple key-value store that you can access inside of your Worker script to read (and write, although it’s less common) data.

To get started with KV, we need to set up a “namespace”. All of our cached data will be stored inside that namespace, and given just a bit of configuration, we can access that namespace inside the script with a predefined variable.

I’ll create a new namespace called KRISTIAN_TODOS, and in the Worker editor, I’ll expose the namespace by binding it to the variable KRISTIAN_TODOS.

Building a To-Do List with Workers and KV

Given the presence of KRISTIAN_TODOS in my script, it’s time to understand the KV API. At time of writing, a KV namespace has three primary methods you can use to interface with your cache: get, put, and delete. Pretty straightforward!

Let’s start storing data by defining an initial set of data, which we’ll put inside of the cache using the put method. I’ve opted to define an object, defaultData, instead of a simple array of todos: we may want to store metadata and other information inside of this cache object later on. Given that data object, I’ll use JSON.stringify to put a simple string into the cache:

async function handleRequest(request) {
  // ...previous code
  
  const defaultData = { 
    todos: [
      {
        id: 1,
        name: 'Finish the Cloudflare Workers blog post',
          completed: false
      }
    ] 
  }
  KRISTIAN_TODOS.put("data", JSON.stringify(defaultData))
}

The Worker KV data store is eventually consistent: writing to the cache means that it will become available eventually, but it’s possible to attempt to read a value back from the cache immediately after writing it, only to find that the cache hasn’t been updated yet.

Given the presence of data in the cache, and the assumption that our cache is eventually consistent, we should adjust this code slightly: first, we should actually read from the cache, parsing the value back out, and using it as the data source if exists. If it doesn’t, we’ll refer to defaultData, setting it as the data source for now (remember, it should be set in the future… eventually), while also setting it in the cache for future use. After breaking out the code into a few functions for simplicity, the result looks like this:

const defaultData = { 
  todos: [
    {
      id: 1,
      name: 'Finish the Cloudflare Workers blog post',
      completed: false
    }
  ] 
}

const setCache = data => KRISTIAN_TODOS.put("data", data)
const getCache = () => KRISTIAN_TODOS.get("data")

async function getTodos(request) {
  // ... previous code
  
  let data;
  const cache = await getCache()
  if (!cache) {
    await setCache(JSON.stringify(defaultData))
    data = defaultData
  } else {
    data = JSON.parse(cache)
  }
}

Rendering data from KV

Given the presence of data in our code, which is the cached data object for our application, we should actually take this data and make it available on screen.

In our Workers script, we’ll make a new variable, html, and use it to build up a static HTML template that we can serve to the client. In handleRequest, we can construct a new Response (with a Content-Type header of text/html), and serve it to the client:

const html = `
<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width,initial-scale=1">
    <title>Todos</title>
  </head>
  <body>
    <h1>Todos</h1>
  </body>
</html>
`

async function handleRequest(request) {
  const response = new Response(html, {
    headers: { 'Content-Type': 'text/html' }
  })
  return response
}

Building a To-Do List with Workers and KV

We have a static HTML site being rendered, and now we can begin populating it with data! In the body, we’ll add a ul tag with an id of todos:

<body>
  <h1>Todos</h1>
  <ul id="todos"></ul>
</body>

Given that body, we can also add a script after the body that takes a todos array, loops through it, and for each todo in the array, creates a li element and appends it to the todos list:

<script>
  window.todos = [];
  var todoContainer = document.querySelector("#todos");
  window.todos.forEach(todo => {
    var el = document.createElement("li");
    el.innerText = todo.name;
    todoContainer.appendChild(el);
  });
</script>

Our static page can take in window.todos, and render HTML based on it, but we haven’t actually passed in any data from KV. To do this, we’ll need to make a couple changes.

First, our html variable will change to a function. The function will take in an argument, todos, which will populate the window.todos variable in the above code sample:

const html = todos => `
<!doctype html>
<html>
  <!-- ... -->
  <script>
    window.todos = ${todos || []}
    var todoContainer = document.querySelector("#todos");
    // ...
  <script>
</html>
`

In handleRequest, we can use the retrieved KV data to call the html function, and generate a Response based on it:

async function handleRequest(request) {
  let data;
  
  // Set data using cache or defaultData from previous section...
  
  const body = html(JSON.stringify(data.todos))
  const response = new Response(body, {
    headers: { 'Content-Type': 'text/html' }
  })
  return response
}

The finished product looks something like this:

Building a To-Do List with Workers and KV

Adding todos from the UI

At this point, we’ve built a Cloudflare Worker that takes data from Cloudflare KV and renders a static page based on it. That static page reads the data, and generates a todo list based on that data. Of course, the piece we’re missing is creating todos, from inside the UI. We know that we can add todos using the KV API – we could simply update the cache by saying KRISTIAN_TODOS.put(newData), but how do we update it from inside the UI?

It’s worth noting here that Cloudflare’s Workers documentation suggests that any writes to your KV namespace happen via their API – that is, at its simplest form, a cURL statement:

curl "<https://api.cloudflare.com/client/v4/accounts/$ACCOUNT_ID/storage/kv/namespaces/$NAMESPACE_ID/values/first-key>" \
  -X PUT \
  -H "X-Auth-Email: $CLOUDFLARE_EMAIL" \
  -H "X-Auth-Key: $CLOUDFLARE_AUTH_KEY" \
  --data 'My first value!'

We’ll implement something similar by handling a second route in our worker, designed to watch for PUT requests to /. When a body is received at that URL, the worker will send the new todo data to our KV store.

I’ll add this new functionality to my worker, and in handleRequest, if the request method is a PUT, it will take the request body and update the cache:

addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})

const putInCache = body => {
  const accountId = "$accountId"
  const namespaceId = "$namespaceId"
  const key = "data"
  return fetch(
    `https://api.cloudflare.com/client/v4/accounts/${accountId}/storage/kv/namespaces/${namespaceId}/values/${key}`,
    { 
      method: "PUT",
      body,
      headers: { 
        'X-Auth-Email': '$accountEmail',
        'X-Auth-Key': "$authKey"
      }
    }
  )
}

async function updateTodos(request) {
  const body = await request.text()
  const ip = request.headers.get("CF-Connecting-IP")
  const cacheKey = `data-${ip}`;
  try {
    JSON.parse(body)
    await putInCache(cacheKey, body)
    return new Response(body, { status: 200 })
  } catch (err) {
    return new Response(err, { status: 500 })
  }
}

async function handleRequest(request) {
  if (request.method === "PUT") {
    return updateTodos(request);
  } else {
    // Defined in previous code block
    return getTodos(request);
  }
}

The script is pretty straightforward – we check that the request is a PUT, and wrap the remainder of the code in a try/catch block. First, we parse the body of the request coming in, ensuring that it is JSON, before we update the cache with the new data, and return it to the user. If anything goes wrong, we simply return a 500. If the route is hit with an HTTP method other than PUT – that is, GET, DELETE, or anything else – we return a 404.

With this script, we can now add some “dynamic” functionality to our HTML page to actually hit this route.

First, we’ll create an input for our todo “name”, and a button for “submitting” the todo.

<div>
  <input type="text" name="name" placeholder="A new todo"></input>
  <button id="create">Create</button>
</div>

Given that input and button, we can add a corresponding JavaScript function to watch for clicks on the button – once the button is clicked, the browser will PUT to / and submit the todo.

var createTodo = function() {
  var input = document.querySelector("input[name=name]");
  if (input.value.length) {
    fetch("/", { 
      method: 'PUT', 
      body: JSON.stringify({ todos: todos }) 
    });
  }
};

document.querySelector("#create")
  .addEventListener('click', createTodo);

This code updates the cache, but what about our local UI? Remember that the KV cache is eventually consistent – even if we were to update our worker to read from the cache and return it, we have no guarantees it’ll actually be up-to-date. Instead, let’s just update the list of todos locally, by taking our original code for rendering the todo list, making it a re-usable function called populateTodos, and calling it when the page loads and when the cache request has finished:

var populateTodos = function() {
  var todoContainer = document.querySelector("#todos");
  todoContainer.innerHTML = null;
  window.todos.forEach(todo => {
    var el = document.createElement("li");
    el.innerText = todo.name;
    todoContainer.appendChild(el);
  });
};

populateTodos();

var createTodo = function() {
  var input = document.querySelector("input[name=name]");
  if (input.value.length) {
    todos = [].concat(todos, { 
      id: todos.length + 1, 
      name: input.value,
      completed: false,
    });
    fetch("/", { 
      method: 'PUT', 
      body: JSON.stringify({ todos: todos }) 
    });
    populateTodos();
    input.value = "";
  }
};

document.querySelector("#create")
  .addEventListener('click', createTodo);

With the client-side code in place, deploying the new Worker should put all these pieces together. The result is an actual dynamic todo list!

Building a To-Do List with Workers and KV

Updating todos from the UI

For the final piece of our (very) basic todo list, we need to be able to update todos – specifically, marking them as completed.

Luckily, a great deal of the infrastructure for this work is already in place. We can currently update the todo list data in our cache, as evidenced by our createTodo function. Performing updates on a todo, in fact, is much more of a client-side task than a Worker-side one!

To start, let’s update the client-side code for generating a todo. Instead of a ul-based list, we’ll migrate the todo container and the todos themselves into using divs:

<!-- <ul id="todos"></ul> becomes... -->
<div id="todos"></div>

The populateTodos function can be updated to generate a div for each todo. In addition, we’ll move the name of the todo into a child element of that div:

var populateTodos = function() {
  var todoContainer = document.querySelector("#todos");
  todoContainer.innerHTML = null;
  window.todos.forEach(todo => {
    var el = document.createElement("div");
    var name = document.createElement("span");
    name.innerText = todo.name;
    el.appendChild(name);
    todoContainer.appendChild(el);
  });
}

So far, we’ve designed the client-side part of this code to take an array of todos in, and given that array, render out a list of simple HTML elements. There’s a number of things that we’ve been doing that we haven’t quite had a use for, yet: specifically, the inclusion of IDs, and updating the completed value on a todo. Luckily, these things work well together, in order to support actually updating todos in the UI.

To start, it would be useful to signify the ID of each todo in the HTML. By doing this, we can then refer to the element later, in order to correspond it to the todo in the JavaScript part of our code. Data attributes, and the corresponding dataset method in JavaScript, are a perfect way to implement this. When we generate our div element for each todo, we can simply attach a data attribute called todo to each div:

window.todos.forEach(todo => {
  var el = document.createElement("div");
  el.dataset.todo = todo.id
  // ... more setup

  todoContainer.appendChild(el);
});

Inside our HTML, each div for a todo now has an attached data attribute, which looks like:

<div data-todo="1"></div>
<div data-todo="2"></div>

Now we can generate a checkbox for each todo element. This checkbox will default to unchecked for new todos, of course, but we can mark it as checked as the element is rendered in the window:

window.todos.forEach(todo => {
  var el = document.createElement("div");
  el.dataset.todo = todo.id
  
  var name = document.createElement("span");
  name.innerText = todo.name;
  
  var checkbox = document.createElement("input")
  checkbox.type = "checkbox"
  checkbox.checked = todo.completed ? 1 : 0;

  el.appendChild(checkbox);
  el.appendChild(name);
  todoContainer.appendChild(el);
})

The checkbox is set up to correctly reflect the value of completed on each todo, but it doesn’t yet update when we actually check the box! To do this, we’ll add an event listener on the click event, calling completeTodo. Inside the function, we’ll inspect the checkbox element, finding its parent (the todo div), and using the todo data attribute on it to find the corresponding todo in our data. Given that todo, we can toggle the value of completed, update our data, and re-render the UI:

var completeTodo = function(evt) {
  var checkbox = evt.target;
  var todoElement = checkbox.parentNode;
  
  var newTodoSet = [].concat(window.todos)
  var todo = newTodoSet.find(t => 
    t.id == todoElement.dataset.todo
  );
  todo.completed = !todo.completed;
  todos = newTodoSet;
  updateTodos()
}

The final result of our code is a system that simply checks the todos variable, updates our Cloudflare KV cache with that value, and then does a straightforward re-render of the UI based on the data it has locally.

Building a To-Do List with Workers and KV

Conclusions and next steps

With this, we’ve created a pretty remarkable project: an almost entirely static HTML/JS application, transparently powered by Cloudflare KV and Workers, served at the edge. There’s a number of additions to be made to this application, whether you want to implement a better design (I’ll leave this as an exercise for readers to implement – you can see my version at todo.kristianfreeman.com), security, speed, etc.

Building a To-Do List with Workers and KV

One interesting and fairly trivial addition is implementing per-user caching. Of course, right now, the cache key is simply “data”: anyone visiting the site will share a todo list with any other user. Because we have the request information inside of our worker, it’s easy to make this data user-specific. For instance, implementing per-user caching by generating the cache key based on the requesting IP:

const ip = request.headers.get("CF-Connecting-IP")
const cacheKey = `data-${ip}`;
const getCache = key => KRISTIAN_TODOS.get(key)
getCache(cacheKey)

One more deploy of our Workers project, and we have a full todo list application, with per-user functionality, served at the edge!

The final version of our Workers script looks like this:

const html = todos => `
<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width,initial-scale=1">
    <title>Todos</title>
    <link href="https://cdn.jsdelivr.net/npm/tailwindcss/dist/tailwind.min.css" rel="stylesheet"></link>
  </head>

  <body class="bg-blue-100">
    <div class="w-full h-full flex content-center justify-center mt-8">
      <div class="bg-white shadow-md rounded px-8 pt-6 py-8 mb-4">
        <h1 class="block text-grey-800 text-md font-bold mb-2">Todos</h1>
        <div class="flex">
          <input class="shadow appearance-none border rounded w-full py-2 px-3 text-grey-800 leading-tight focus:outline-none focus:shadow-outline" type="text" name="name" placeholder="A new todo"></input>
          <button class="bg-blue-500 hover:bg-blue-800 text-white font-bold ml-2 py-2 px-4 rounded focus:outline-none focus:shadow-outline" id="create" type="submit">Create</button>
        </div>
        <div class="mt-4" id="todos"></div>
      </div>
    </div>
  </body>

  <script>
    window.todos = ${todos || []}

    var updateTodos = function() {
      fetch("/", { method: 'PUT', body: JSON.stringify({ todos: window.todos }) })
      populateTodos()
    }

    var completeTodo = function(evt) {
      var checkbox = evt.target
      var todoElement = checkbox.parentNode
      var newTodoSet = [].concat(window.todos)
      var todo = newTodoSet.find(t => t.id == todoElement.dataset.todo)
      todo.completed = !todo.completed
      window.todos = newTodoSet
      updateTodos()
    }

    var populateTodos = function() {
      var todoContainer = document.querySelector("#todos")
      todoContainer.innerHTML = null

      window.todos.forEach(todo => {
        var el = document.createElement("div")
        el.className = "border-t py-4"
        el.dataset.todo = todo.id

        var name = document.createElement("span")
        name.className = todo.completed ? "line-through" : ""
        name.innerText = todo.name

        var checkbox = document.createElement("input")
        checkbox.className = "mx-4"
        checkbox.type = "checkbox"
        checkbox.checked = todo.completed ? 1 : 0
        checkbox.addEventListener('click', completeTodo)

        el.appendChild(checkbox)
        el.appendChild(name)
        todoContainer.appendChild(el)
      })
    }

    populateTodos()

    var createTodo = function() {
      var input = document.querySelector("input[name=name]")
      if (input.value.length) {
        window.todos = [].concat(todos, { id: window.todos.length + 1, name: input.value, completed: false })
        input.value = ""
        updateTodos()
      }
    }

    document.querySelector("#create").addEventListener('click', createTodo)
  </script>
</html>
`

const defaultData = { todos: [] }

const setCache = (key, data) => KRISTIAN_TODOS.put(key, data)
const getCache = key => KRISTIAN_TODOS.get(key)

async function getTodos(request) {
  const ip = request.headers.get('CF-Connecting-IP')
  const cacheKey = `data-${ip}`
  let data
  const cache = await getCache(cacheKey)
  if (!cache) {
    await setCache(cacheKey, JSON.stringify(defaultData))
    data = defaultData
  } else {
    data = JSON.parse(cache)
  }
  const body = html(JSON.stringify(data.todos || []))
  return new Response(body, {
    headers: { 'Content-Type': 'text/html' },
  })
}

const putInCache = (cacheKey, body) => {
  const accountId = '$accountId'
  const namespaceId = '$namespaceId'
  return fetch(
    `https://api.cloudflare.com/client/v4/accounts/${accountId}/storage/kv/namespaces/${namespaceId}/values/${cacheKey}`,
    {
      method: 'PUT',
      body,
      headers: {
        'X-Auth-Email': '$cloudflareEmail',
        'X-Auth-Key': '$cloudflareApiKey',
      },
    },
  )
}

async function updateTodos(request) {
  const body = await request.text()
  const ip = request.headers.get('CF-Connecting-IP')
  const cacheKey = `data-${ip}`
  try {
    JSON.parse(body)
    await putInCache(cacheKey, body)
    return new Response(body, { status: 200 })
  } catch (err) {
    return new Response(err, { status: 500 })
  }
}

async function handleRequest(request) {
  if (request.method === 'PUT') {
    return updateTodos(request)
  } else {
    return getTodos(request)
  }
}

addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})

You can find the source code for this project, as well as a README with deployment instructions, on GitHub.

Get ready to write — Workers KV is now in GA!

Post Syndicated from Ashcon Partovi original https://blog.cloudflare.com/workers-kv-is-ga/

Get ready to write — Workers KV is now in GA!

Today, we’re excited to announce Workers KV is entering general availability and is ready for production use!

Get ready to write — Workers KV is now in GA!

What is Workers KV?

Workers KV is a highly distributed, eventually consistent, key-value store that spans Cloudflare’s global edge. It allows you to store billions of key-value pairs and read them with ultra-low latency anywhere in the world. Now you can build entire applications with the performance of a CDN static cache.

Why did we build it?

Workers is a platform that lets you run JavaScript on Cloudflare’s global edge of 175+ data centers. With only a few lines of code, you can route HTTP requests, modify responses, or even create new responses without an origin server.

// A Worker that handles a single redirect,
// such a humble beginning...
addEventListener("fetch", event => {
  event.respondWith(handleOneRedirect(event.request))
})

async function handleOneRedirect(request) {
  let url = new URL(request.url)
  let device = request.headers.get("CF-Device-Type")
  // If the device is mobile, add a prefix to the hostname.
  // (eg. example.com becomes mobile.example.com)
  if (device === "mobile") {
    url.hostname = "mobile." + url.hostname
    return Response.redirect(url, 302)
  }
  // Otherwise, send request to the original hostname.
  return await fetch(request)
}

Customers quickly came to us with use cases that required a way to store persistent data. Following our example above, it’s easy to handle a single redirect, but what if you want to handle billions of them? You would have to hard-code them into your Workers script, fit it all in under 1 MB, and re-deploy it every time you wanted to make a change — yikes! That’s why we built Workers KV.

// A Worker that can handle billions of redirects,
// now that's more like it!
addEventListener("fetch", event => {
  event.respondWith(handleBillionsOfRedirects(event.request))
})

async function handleBillionsOfRedirects(request) {
  let prefix = "/redirect"
  let url = new URL(request.url)
  // Check if the URL is a special redirect.
  // (eg. example.com/redirect/<random-hash>)
  if (url.pathname.startsWith(prefix)) {
    // REDIRECTS is a custom variable that you define,
    // it binds to a Workers KV "namespace." (aka. a storage bucket)
    let redirect = await REDIRECTS.get(url.pathname.replace(prefix, ""))
    if (redirect) {
      url.pathname = redirect
      return Response.redirect(url, 302)
    }
  }
  // Otherwise, send request to the original path.
  return await fetch(request)
}

With only a few changes from our previous example, we scaled from one redirect to billions − that’s just a taste of what you can build with Workers KV.

How does it work?

Distributed data stores are often modeled using the CAP Theorem, which states that distributed systems can only pick between 2 out of the 3 following guarantees:

  • Consistency – is my data the same everywhere?
  • Availability – is my data accessible all the time?
  • Partition tolerance – is my data stored in multiple locations?

Get ready to write — Workers KV is now in GA!
Diagram of the choices and tradeoffs of the CAP Theorem.

Workers KV chooses to guarantee Availability and Partition tolerance. This combination is known as eventual consistency, which presents Workers KV with two unique competitive advantages:

  • Reads are ultra fast (median of 12 ms) since its powered by our caching technology.
  • Data is available across 175+ edge data centers and resilient to regional outages.

Although, there are tradeoffs to eventual consistency. If two clients write different values to the same key at the same time, the last client to write eventually “wins” and its value becomes globally consistent. This also means that if a client writes to a key and that same client reads that same key, the values may be inconsistent for a short amount of time.

To help visualize this scenario, here’s a real-life example amongst three friends:

  • Suppose Matthew, Michelle, and Lee are planning their weekly lunch.
  • Matthew decides they’re going out for sushi.
  • Matthew tells Michelle their sushi plans, Michelle agrees.
  • Lee, not knowing the plans, tells Michelle they’re actually having pizza.

An hour later, Michelle and Lee are waiting at the pizza parlor while Matthew is sitting alone at the sushi restaurant — what went wrong? We can chalk this up to eventual consistency, because after waiting for a few minutes, Matthew looks at his updated calendar and eventually finds the new truth, they’re going out for pizza instead.

While it may take minutes in real-life, Workers KV is much faster. It can achieve global consistency in less than 60 seconds. Additionally, when a Worker writes to a key, then immediately reads that same key, it can expect the values to be consistent if both operations came from the same location.

When should I use it?

Now that you understand the benefits and tradeoffs of using eventual consistency, how do you determine if it’s the right storage solution for your application? Simply put, if you want global availability with ultra-fast reads, Workers KV is right for you.

However, if your application is frequently writing to the same key, there is an additional consideration. We call it “the Matthew question”: Are you okay with the Matthews of the world occasionally going to the wrong restaurant?

You can imagine use cases (like our redirect Worker example) where this doesn’t make any material difference. But if you decide to keep track of a user’s bank account balance, you would not want the possibility of two balances existing at once, since they could purchase something with money they’ve already spent.

What can I build with it?

Here are a few examples of applications that have been built with KV:

  • Mass redirects – handle billions of HTTP redirects.
  • User authentication – validate user requests to your API.
  • Translation keys – dynamically localize your web pages.
  • Configuration data – manage who can access your origin.
  • Step functions – sync state data between multiple APIs functions.
  • Edge file store – host large amounts of small files.

We’ve highlighted several of those use cases in our previous blog post. We also have some more in-depth code walkthroughs, including a recently published blog post on how to build an online To-do list with Workers KV.

Get ready to write — Workers KV is now in GA!

What’s new since beta?

By far, our most common request was to make it easier to write data to Workers KV. That’s why we’re releasing three new ways to make that experience even better:

1. Bulk Writes

If you want to import your existing data into Workers KV, you don’t want to go through the hassle of sending an HTTP request for every key-value pair. That’s why we added a bulk endpoint to the Cloudflare API. Now you can upload up to 10,000 pairs (up to 100 MB of data) in a single PUT request.

curl "https://api.cloudflare.com/client/v4/accounts/ \
     $ACCOUNT_ID/storage/kv/namespaces/$NAMESPACE_ID/bulk" \
  -X PUT \
  -H "X-Auth-Key: $CLOUDFLARE_AUTH_KEY" \
  -H "X-Auth-Email: $CLOUDFLARE_AUTH_EMAIL" \
  -d '[
    {"key": "built_by",    value: "kyle, alex, charlie, andrew, and brett"},
    {"key": "reviewed_by", value: "joaquin"},
    {"key": "approved_by", value: "steve"}
  ]'

Let’s walk through an example use case: you want to off-load your website translation to Workers. Since you’re reading translation keys frequently and only occasionally updating them, this application works well with the eventual consistency model of Workers KV.

In this example, we hook into Crowdin, a popular platform to manage translation data. This Worker responds to a /translate endpoint, downloads all your translation keys, and bulk writes them to Workers KV so you can read it later on our edge:

addEventListener("fetch", event => {
  if (event.request.url.pathname === "/translate") {
    event.respondWith(uploadTranslations())
  }
})

async function uploadTranslations() {
  // Ask crowdin for all of our translations.
  var response = await fetch(
    "https://api.crowdin.com/api/project" +
    "/:ci_project_id/download/all.zip?key=:ci_secret_key")
  // If crowdin is responding, parse the response into
  // a single json with all of our translations.
  if (response.ok) {
    var translations = await zipToJson(response)
    return await bulkWrite(translations)
  }
  // Return the errored response from crowdin.
  return response
}

async function bulkWrite(keyValuePairs) {
  return fetch(
    "https://api.cloudflare.com/client/v4/accounts" +
    "/:cf_account_id/storage/kv/namespaces/:cf_namespace_id/bulk",
    {
      method: "PUT",
      headers: {
        "Content-Type": "application/json",
        "X-Auth-Key": ":cf_auth_key",
        "X-Auth-Email": ":cf_email"
      },
      body: JSON.stringify(keyValuePairs)
    }
  )
}

async function zipToJson(response) {
  // ... omitted for brevity ...
  // (eg. https://stuk.github.io/jszip)
  return [
    {key: "hello.EN", value: "Hello World"},
    {key: "hello.ES", value: "Hola Mundo"}
  ]
}

Now, when you want to translate a page, all you have to do is read from Workers KV:

async function translate(keys, lang) {
  // You bind your translations namespace to the TRANSLATIONS variable.
  return Promise.all(keys.map(key => TRANSLATIONS.get(key + "." + lang)))
}

2. Expiring Keys

By default, key-value pairs stored in Workers KV last forever. However, sometimes you want your data to auto-delete after a certain amount of time. That’s why we’re introducing the expiration and expirationTtloptions for write operations.

// Key expires 60 seconds from now.
NAMESPACE.put("myKey", "myValue", {expirationTtl: 60})

// Key expires if the UNIX epoch is in the past.
NAMESPACE.put("myKey", "myValue", {expiration: 1247788800})

# You can also set keys to expire from the Cloudflare API.
curl "https://api.cloudflare.com/client/v4/accounts/ \
     $ACCOUNT_ID/storage/kv/namespaces/$NAMESPACE_ID/ \
     values/$KEY?expiration_ttl=$EXPIRATION_IN_SECONDS"
  -X PUT \
  -H "X-Auth-Key: $CLOUDFLARE_AUTH_KEY" \
  -H "X-Auth-Email: $CLOUDFLARE_AUTH_EMAIL" \
  -d "$VALUE"

Let’s say you want to block users that have been flagged as inappropriate from your website, but only for a week. With an expiring key, you can set the expire time and not have to worry about deleting it later.

In this example, we assume users and IP addresses are one of the same. If your application has authentication, you could use access tokens as the key identifier.

addEventListener("fetch", event => {
  var url = new URL(event.request.url)
  // An internal API that blocks a new user IP.
  // (eg. example.com/block/1.2.3.4)
  if (url.pathname.startsWith("/block")) {
    var ip = url.pathname.split("/").pop()
    event.respondWith(blockIp(ip))
  } else {
    // Other requests check if the IP is blocked.
   event.respondWith(handleRequest(event.request))
  }
})

async function blockIp(ip) {
  // Values are allowed to be empty in KV,
  // we don't need to store any extra information anyway.
  await BLOCKED.put(ip, "", {expirationTtl: 60*60*24*7})
  return new Response("ok")
}

async function handleRequest(request) {
  var ip = request.headers.get("CF-Connecting-IP")
  if (ip) {
    var blocked = await BLOCKED.get(ip)
    // If we detect an IP and its blocked, respond with a 403 error.
    if (blocked) {
      return new Response({status: 403, statusText: "You are blocked!"})
    }
  }
  // Otherwise, passthrough the original request.
  return fetch(request)
}

3. Larger Values

We’ve increased our size limit on values from 64 kB to 2 MB. This is quite useful if you need to store buffer-based or file data in Workers KV.

Get ready to write — Workers KV is now in GA!

Consider this scenario: you want to let your users upload their favorite GIF to their profile without having to store these GIFs as binaries in your database or managing another cloud storage bucket.

Workers KV is a great fit for this use case! You can create a Workers KV namespace for your users’ GIFs that is fast and reliable wherever your customers are located.

In this example, users upload a link to their favorite GIF, then a Worker downloads it and stores it to Workers KV.

addEventListener("fetch", event => {
  var url = event.request.url
  var arg = request.url.split("/").pop()
  // User sends a URI encoded link to the GIF they wish to upload.
  // (eg. example.com/api/upload_gif/<encoded-uri>)
  if (url.pathname.startsWith("/api/upload_gif")) {
    event.respondWith(uploadGif(arg))
    // Profile contains link to view the GIF.
    // (eg. example.com/api/view_gif/<username>)
  } else if (url.pathname.startsWith("/api/view_gif")) {
    event.respondWith(getGif(arg))
  }
})

async function uploadGif(url) {
  // Fetch the GIF from the Internet.
  var gif = await fetch(decodeURIComponent(url))
  var buffer = await gif.arrayBuffer()
  // Upload the GIF as a buffer to Workers KV.
  await GIFS.put(user.name, buffer)
  return gif
}

async function getGif(username) {
  var gif = await GIFS.get(username, "arrayBuffer")
  // If the user has set one, respond with the GIF.
  if (gif) {
    return new Response(gif, {headers: {"Content-Type": "image/gif"}})
  } else {
    return new Response({status: 404, statusText: "User has no GIF!"})
  }
}

Lastly, we want to thank all of our beta customers. It was your valuable feedback that led us to develop these changes to Workers KV. Make sure to stay in touch with us, we’re always looking ahead for what’s next and we love hearing from you!

Pricing

We’re also ready to announce our GA pricing. If you’re one of our Enterprise customers, your pricing obviously remains unchanged.

  • $0.50 / GB of data stored, 1 GB included
  • $0.50 / million reads, 10 million included
  • $5 / million write, list, and delete operations, 1 million included

During the beta period, we learned customers don’t want to just read values at our edge, they want to write values from our edge too. Since there is high demand for these edge operations, which are more costly, we have started charging non-read operations per month.

Limits

As mentioned earlier, we increased our value size limit from 64 kB to 2 MB. We’ve also removed our cap on the number of keys per namespace — it’s now unlimited. Here are our GA limits:

  • Up to 20 namespaces per account, each with unlimited keys
  • Keys of up to 512 bytes and values of up to 2 MB
  • Unlimited writes per second for different keys
  • One write per second for the same key
  • Unlimited reads per second per key

Try it out now!

Now open to all customers, you can start using Workers KV today from your Cloudflare dashboard under the Workers tab. You can also look at our updated documentation.

We’re really excited to see what you all can build with Workers KV!

Cloudflare architecture and how BPF eats the world

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/cloudflare-architecture-and-how-bpf-eats-the-world/

Cloudflare architecture and how BPF eats the world

Recently at Netdev 0x13, the Conference on Linux Networking in Prague, I gave a short talk titled “Linux at Cloudflare”. The talk ended up being mostly about BPF. It seems, no matter the question – BPF is the answer.

Here is a transcript of a slightly adjusted version of that talk.


Cloudflare architecture and how BPF eats the world

At Cloudflare we run Linux on our servers. We operate two categories of data centers: large “Core” data centers, processing logs, analyzing attacks, computing analytics, and the “Edge” server fleet, delivering customer content from 180 locations across the world.

In this talk, we will focus on the “Edge” servers. It’s here where we use the newest Linux features, optimize for performance and care deeply about DoS resilience.


Cloudflare architecture and how BPF eats the world

Our edge service is special due to our network configuration – we are extensively using anycast routing. Anycast means that the same set of IP addresses are announced by all our data centers.

This design has great advantages. First, it guarantees the optimal speed for end users. No matter where you are located, you will always reach the closest data center. Then, anycast helps us to spread out DoS traffic. During attacks each of the locations receives a small fraction of the total traffic, making it easier to ingest and filter out unwanted traffic.


Cloudflare architecture and how BPF eats the world

Anycast allows us to keep the networking setup uniform across all edge data centers. We applied the same design inside our data centers – our software stack is uniform across the edge servers. All software pieces are running on all the servers.

In principle, every machine can handle every task – and we run many diverse and demanding tasks. We have a full HTTP stack, the magical Cloudflare Workers, two sets of DNS servers – authoritative and resolver, and many other publicly facing applications like Spectrum and Warp.

Even though every server has all the software running, requests typically cross many machines on their journey through the stack. For example, an HTTP request might be handled by a different machine during each of the 5 stages of the processing.


Cloudflare architecture and how BPF eats the world

Let me walk you through the early stages of inbound packet processing:

(1) First, the packets hit our router. The router does ECMP, and forwards packets onto our Linux servers. We use ECMP to spread each target IP across many, at least 16, machines. This is used as a rudimentary load balancing technique.

(2) On the servers we ingest packets with XDP eBPF. In XDP we perform two stages. First, we run volumetric DoS mitigations, dropping packets belonging to very large layer 3 attacks.

(3) Then, still in XDP, we perform layer 4 load balancing. All the non-attack packets are redirected across the machines. This is used to work around the ECMP problems, gives us fine-granularity load balancing and allows us to gracefully take servers out of service.

(4) Following the redirection the packets reach a designated machine. At this point they are ingested by the normal Linux networking stack, go through the usual iptables firewall, and are dispatched to an appropriate network socket.

(5) Finally packets are received by an application. For example HTTP connections are handled by a “protocol” server, responsible for performing TLS encryption and processing HTTP, HTTP/2 and QUIC protocols.

It’s in these early phases of request processing where we use the coolest new Linux features. We can group useful modern functionalities into three categories:

  • DoS handling
  • Load balancing
  • Socket dispatch


Cloudflare architecture and how BPF eats the world

Let’s discuss DoS handling in more detail. As mentioned earlier, the first step after ECMP routing is Linux’s XDP stack where, among other things, we run DoS mitigations.

Historically our mitigations for volumetric attacks were expressed in classic BPF and iptables-style grammar. Recently we adapted them to execute in the XDP eBPF context, which turned out to be surprisingly hard. Read on about our adventures:

During this project we encountered a number of eBPF/XDP limitations. One of them was the lack of concurrency primitives. It was very hard to implement things like race-free token buckets. Later we found that Facebook engineer Julia Kartseva had the same issues. In February this problem has been addressed with the introduction of bpf_spin_lock helper.


Cloudflare architecture and how BPF eats the world

While our modern volumetric DoS defenses are done in XDP layer, we still rely on iptables for application layer 7 mitigations. Here, a higher level firewall’s features are useful: connlimit, hashlimits and ipsets. We also use the xt_bpf iptables module to run cBPF in iptables to match on packet payloads. We talked about this in the past:


Cloudflare architecture and how BPF eats the world

After XDP and iptables, we have one final kernel side DoS defense layer.

Consider a situation when our UDP mitigations fail. In such case we might be left with a flood of packets hitting our application UDP socket. This might overflow the socket causing packet loss. This is problematic – both good and bad packets will be dropped indiscriminately. For applications like DNS it’s catastrophic. In the past to reduce the harm, we ran one UDP socket per IP address. An unmitigated flood was bad, but at least it didn’t affect the traffic to other server IP addresses.

Nowadays that architecture is no longer suitable. We are running more than 30,000 DNS IP’s and running that number of UDP sockets is not optimal. Our modern solution is to run a single UDP socket with a complex eBPF socket filter on it – using the SO_ATTACH_BPF socket option. We talked about running eBPF on network sockets in past blog posts:

The mentioned eBPF rate limits the packets. It keeps the state – packet counts – in an eBPF map. We can be sure that a single flooded IP won’t affect other traffic. This works well, though during work on this project we found a rather worrying bug in the eBPF verifier:

I guess running eBPF on a UDP socket is not a common thing to do.


Cloudflare architecture and how BPF eats the world

Apart from the DoS, in XDP we also run a layer 4 load balancer layer. This is a new project, and we haven’t talked much about it yet. Without getting into many details: in certain situations we need to perform a socket lookup from XDP.

The problem is relatively simple – our code needs to look up the “socket” kernel structure for a 5-tuple extracted from a packet. This is generally easy – there is a bpf_sk_lookup helper available for this. Unsurprisingly, there were some complications. One problem was the inability to verify if a received ACK packet was a valid part of a three-way handshake when SYN-cookies are enabled. My colleague Lorenz Bauer is working on adding support for this corner case.


Cloudflare architecture and how BPF eats the world

After DoS and the load balancing layers, the packets are passed onto the usual Linux TCP / UDP stack. Here we do a socket dispatch – for example packets going to port 53 are passed onto a socket belonging to our DNS server.

We do our best to use vanilla Linux features, but things get complex when you use thousands of IP addresses on the servers.

Convincing Linux to route packets correctly is relatively easy with the “AnyIP” trick. Ensuring packets are dispatched to the right application is another matter. Unfortunately, standard Linux socket dispatch logic is not flexible enough for our needs. For popular ports like TCP/80 we want to share the port between multiple applications, each handling it on a different IP range. Linux doesn’t support this out of the box. You can call bind() either on a specific IP address or all IP’s (with 0.0.0.0).


Cloudflare architecture and how BPF eats the world

In order to fix this, we developed a custom kernel patch which adds a SO_BINDTOPREFIX socket option. As the name suggests – it allows us to call bind() on a selected IP prefix. This solves the problem of multiple applications sharing popular ports like 53 or 80.

Then we run into another problem. For our Spectrum product we need to listen on all 65535 ports. Running so many listen sockets is not a good idea (see our old war story blog), so we had to find another way. After some experiments we learned to utilize an obscure iptables module – TPROXY – for this purpose. Read about it here:

This setup is working, but we don’t like the extra firewall rules. We are working on solving this problem correctly – actually extending the socket dispatch logic. You guessed it – we want to extend socket dispatch logic by utilizing eBPF. Expect some patches from us.


Cloudflare architecture and how BPF eats the world

Then there is a way to use eBPF to improve applications. Recently we got excited about doing TCP splicing with SOCKMAP:

This technique has a great potential for improving tail latency across many pieces of our software stack. The current SOCKMAP implementation is not quite ready for prime time yet, but the potential is vast.

Similarly, the new TCP-BPF aka BPF_SOCK_OPS hooks provide a great way of inspecting performance parameters of TCP flows. This functionality is super useful for our performance team.


Cloudflare architecture and how BPF eats the world

Some Linux features didn’t age well and we need to work around them. For example, we are hitting limitations of networking metrics. Don’t get me wrong – the networking metrics are awesome, but sadly they are not granular enough. Things like TcpExtListenDrops and TcpExtListenOverflows are reported as global counters, while we need to know it on a per-application basis.

Our solution is to use eBPF probes to extract the numbers directly from the kernel. My colleague Ivan Babrou wrote a Prometheus metrics exporter called “ebpf_exporter” to facilitate this. Read on:

With “ebpf_exporter” we can generate all manner of detailed metrics. It is very powerful and saved us on many occasions.


Cloudflare architecture and how BPF eats the world

In this talk we discussed 6 layers of BPFs running on our edge servers:

  • Volumetric DoS mitigations are running on XDP eBPF
  • Iptables xt_bpf cBPF for application-layer attacks
  • SO_ATTACH_BPF for rate limits on UDP sockets
  • Load balancer, running on XDP
  • eBPFs running application helpers like SOCKMAP for TCP socket splicing, and TCP-BPF for TCP measurements
  • “ebpf_exporter” for granular metrics

And we’re just getting started! Soon we will be doing more with eBPF based socket dispatch, eBPF running on Linux TC (Traffic Control) layer and more integration with cgroup eBPF hooks. Then, our SRE team is maintaining ever-growing list of BCC scripts useful for debugging.

It feels like Linux stopped developing new API’s and all the new features are implemented as eBPF hooks and helpers. This is fine and it has strong advantages. It’s easier and safer to upgrade eBPF program than having to recompile a kernel module. Some things like TCP-BPF, exposing high-volume performance tracing data, would probably be impossible without eBPF.

Some say “software is eating the world”, I would say that: “BPF is eating the software”.

eBPF can’t count?!

Post Syndicated from Jakub Sitnicki original https://blog.cloudflare.com/ebpf-cant-count/

eBPF can't count?!
Grant mechanical calculating machine, public domain image

eBPF can't count?!

It is unlikely we can tell you anything new about the extended Berkeley Packet Filter, eBPF for short, if you’ve read all the great man pages, docs, guides, and some of our blogs out there.

But we can tell you a war story, and who doesn’t like those? This one is about how eBPF lost its ability to count for a while1.

They say in our Austin, Texas office that all good stories start with "y’all ain’t gonna believe this… tale." This one though, starts with a post to Linux netdev mailing list from Marek Majkowski after what I heard was a long night:

eBPF can't count?!

Marek’s findings were quite shocking – if you subtract two 64-bit timestamps in eBPF, the result is garbage. But only when running as an unprivileged user. From rool all works fine. Huh.

If you’ve seen Marek’s presentation from the Netdev 0x13 conference, you know that we are using BPF socket filters as one of the defenses against simple, volumetric DoS attacks. So potentially getting your packet count wrong could be a Bad Thing™, and affect legitimate traffic.

Let’s try to reproduce this bug with a simplified eBPF socket filter that subtracts two 64-bit unsigned integers passed to it from user-space though a BPF map. The input for our BPF program comes from a BPF array map, so that the values we operate on are not known at build time. This allows for easy experimentation and prevents the compiler from optimizing out the operations.

Starting small, eBPF, what is 2 – 1? View the code on our GitHub.

$ ./run-bpf 2 1
arg0                    2 0x0000000000000002
arg1                    1 0x0000000000000001
diff                    1 0x0000000000000001

OK, eBPF, what is 2^32 – 1?

$ ./run-bpf $[2**32] 1
arg0           4294967296 0x0000000100000000
arg1                    1 0x0000000000000001
diff 18446744073709551615 0xffffffffffffffff

Wrong! But if we ask nicely with sudo:

$ sudo ./run-bpf $[2**32] 1
[sudo] password for jkbs:
arg0           4294967296 0x0000000100000000
arg1                    1 0x0000000000000001
diff           4294967295 0x00000000ffffffff

Who is messing with my eBPF?

When computers stop subtracting, you know something big is up. We called for reinforcements.

Our colleague Arthur Fabre quickly noticed something is off when you examine the eBPF code loaded into the kernel. It turns out kernel doesn’t actually run the eBPF it’s supplied – it sometimes rewrites it first.

Any sane programmer would expect 64-bit subtraction to be expressed as a single eBPF instruction

$ llvm-objdump -S -no-show-raw-insn -section=socket1 bpf/filter.o
…
      20:       1f 76 00 00 00 00 00 00         r6 -= r7
…

However, that’s not what the kernel actually runs. Apparently after the rewrite the subtraction becomes a complex, multi-step operation.

To see what the kernel is actually running we can use little known bpftool utility. First, we need to load our BPF

$ ./run-bpf --stop-after-load 2 1
[2]+  Stopped                 ./run-bpf 2 1

Then list all BPF programs loaded into the kernel with bpftool prog list

$ sudo bpftool prog list
…
5951: socket_filter  name filter_alu64  tag 11186be60c0d0c0f  gpl
        loaded_at 2019-04-05T13:01:24+0200  uid 1000
        xlated 424B  jited 262B  memlock 4096B  map_ids 28786

The most recently loaded socket_filter must be our program (filter_alu64). Now we now know its id is 5951 and we can list its bytecode with

$ sudo bpftool prog dump xlated id 5951
…
  33: (79) r7 = *(u64 *)(r0 +0)
  34: (b4) (u32) r11 = (u32) -1
  35: (1f) r11 -= r6
  36: (4f) r11 |= r6
  37: (87) r11 = -r11
  38: (c7) r11 s>>= 63
  39: (5f) r6 &= r11
  40: (1f) r6 -= r7
  41: (7b) *(u64 *)(r10 -16) = r6
…

bpftool can also display the JITed code with: bpftool prog dump jited id 5951.

As you see, subtraction is replaced with a series of opcodes. That is unless you are root. When running from root all is good

$ sudo ./run-bpf --stop-after-load 0 0
[1]+  Stopped                 sudo ./run-bpf --stop-after-load 0 0
$ sudo bpftool prog list | grep socket_filter
659: socket_filter  name filter_alu64  tag 9e7ffb08218476f3  gpl
$ sudo bpftool prog dump xlated id 659
…
  31: (79) r7 = *(u64 *)(r0 +0)
  32: (1f) r6 -= r7
  33: (7b) *(u64 *)(r10 -16) = r6
…

If you’ve spent any time using eBPF, you must have experienced first hand the dreaded eBPF verifier. It’s a merciless judge of all eBPF code that will reject any programs that it deems not worthy of running in kernel-space.

What perhaps nobody has told you before, and what might come as a surprise, is that the very same verifier will actually also rewrite and patch up your eBPF code as needed to make it safe.

The problems with subtraction were introduced by an inconspicuous security fix to the verifier. The patch in question first landed in Linux 5.0 and was backported to 4.20.6 stable and 4.19.19 LTS kernel. The over 2000 words long commit message doesn’t spare you any details on the attack vector it targets.

The mitigation stems from CVE-2019-7308 vulnerability discovered by Jann Horn at Project Zero, which exploits pointer arithmetic, i.e. adding a scalar value to a pointer, to trigger speculative memory loads from out-of-bounds addresses. Such speculative loads change the CPU cache state and can be used to mount a Spectre variant 1 attack.

To mitigate it the eBPF verifier rewrites any arithmetic operations on pointer values in such a way the result is always a memory location within bounds. The patch demonstrates how arithmetic operations on pointers get rewritten and we can spot a familiar pattern there

eBPF can't count?!

Wait a minute… What pointer arithmetic? We are just trying to subtract two scalar values. How come the mitigation kicks in?

It shouldn’t. It’s a bug. The eBPF verifier keeps track of what kind of values the ALU is operating on, and in this corner case the state was ignored.

Why running BPF as root is fine, you ask? If your program has CAP_SYS_ADMIN privileges, side-channel mitigations don’t apply. As root you already have access to kernel address space, so nothing new can leak through BPF.

After our report, the fix has quickly landed in v5.0 kernel and got backported to stable kernels 4.20.15 and 4.19.28. Kudos to Daniel Borkmann for getting the fix out fast. However, kernel upgrades are hard and in the meantime we were left with code running in production that was not doing what it was supposed to.

32-bit ALU to the rescue

As one of the eBPF maintainers has pointed out, 32-bit arithmetic operations are not affected by the verifier bug. This opens a door for a work-around.

eBPF registers, r0..r10, are 64-bits wide, but you can also access just the lower 32 bits, which are exposed as subregisters w0..w10. You can operate on the 32-bit subregisters using BPF ALU32 instruction subset. LLVM 7+ can generate eBPF code that uses this instruction subset. Of course, you need to you ask it nicely with trivial -Xclang -target-feature -Xclang +alu32 toggle:

$ cat sub32.c
#include "common.h"

u32 sub32(u32 x, u32 y)
{
        return x - y;
}
$ clang -O2 -target bpf -Xclang -target-feature -Xclang +alu32 -c sub32.c
$ llvm-objdump -S -no-show-raw-insn sub32.o
…
sub32:
       0:       bc 10 00 00 00 00 00 00         w0 = w1
       1:       1c 20 00 00 00 00 00 00         w0 -= w2
       2:       95 00 00 00 00 00 00 00         exit

The 0x1c opcode of the instruction #1, which can be broken down as BPF_ALU | BPF_X | BPF_SUB (read more in the kernel docs), is the 32-bit subtraction between registers we are looking for, as opposed to regular 64-bit subtract operation 0x1f = BPF_ALU64 | BPF_X | BPF_SUB, which will get rewritten.

Armed with this knowledge we can borrow a page from bignum arithmetic and subtract 64-bit numbers using just 32-bit ops:

u64 sub64(u64 x, u64 y)
{
        u32 xh, xl, yh, yl;
        u32 hi, lo;

        xl = x;
        yl = y;
        lo = xl - yl;

        xh = x >> 32;
        yh = y >> 32;
        hi = xh - yh - (lo > xl); /* underflow? */

        return ((u64)hi << 32) | (u64)lo;
}

This code compiles as expected on normal architectures, like x86-64 or ARM64, but BPF Clang target plays by its own rules:

$ clang -O2 -target bpf -Xclang -target-feature -Xclang +alu32 -c sub64.c -o - \
  | llvm-objdump -S -
…  
      13:       1f 40 00 00 00 00 00 00         r0 -= r4
      14:       1f 30 00 00 00 00 00 00         r0 -= r3
      15:       1f 21 00 00 00 00 00 00         r1 -= r2
      16:       67 00 00 00 20 00 00 00         r0 <<= 32
      17:       67 01 00 00 20 00 00 00         r1 <<= 32
      18:       77 01 00 00 20 00 00 00         r1 >>= 32
      19:       4f 10 00 00 00 00 00 00         r0 |= r1
      20:       95 00 00 00 00 00 00 00         exit

Apparently the compiler decided it was better to operate on 64-bit registers and discard the upper 32 bits. Thus we weren’t able to get rid of the problematic 0x1f opcode. Annoying, back to square one.

Surely a bit of IR will do?

The problem was in Clang frontend – compiling C to IR. We know that BPF "assembly" backend for LLVM can generate bytecode that uses ALU32 instructions. Maybe if we tweak the Clang compiler’s output just a little we can achieve what we want. This means we have to get our hands dirty with the LLVM Intermediate Representation (IR).

If you haven’t heard of LLVM IR before, now is a good time to do some reading2. In short the LLVM IR is what Clang produces and LLVM BPF backend consumes.

Time to write IR by hand! Here’s a hand-tweaked IR variant of our sub64() function:

define dso_local i64 @sub64_ir(i64, i64) local_unnamed_addr #0 {
  %3 = trunc i64 %0 to i32      ; xl = (u32) x;
  %4 = trunc i64 %1 to i32      ; yl = (u32) y;
  %5 = sub i32 %3, %4           ; lo = xl - yl;
  %6 = zext i32 %5 to i64
  %7 = lshr i64 %0, 32          ; tmp1 = x >> 32;
  %8 = lshr i64 %1, 32          ; tmp2 = y >> 32;
  %9 = trunc i64 %7 to i32      ; xh = (u32) tmp1;
  %10 = trunc i64 %8 to i32     ; yh = (u32) tmp2;
  %11 = sub i32 %9, %10         ; hi = xh - yh
  %12 = icmp ult i32 %3, %5     ; tmp3 = xl < lo
  %13 = zext i1 %12 to i32
  %14 = sub i32 %11, %13        ; hi -= tmp3
  %15 = zext i32 %14 to i64
  %16 = shl i64 %15, 32         ; tmp2 = hi << 32
  %17 = or i64 %16, %6          ; res = tmp2 | (u64)lo
  ret i64 %17
}

It may not be pretty but it does produce desired BPF code when compiled3. You will likely find the LLVM IR reference helpful when deciphering it.

And voila! First working solution that produces correct results:

$ ./run-bpf -filter ir $[2**32] 1
arg0           4294967296 0x0000000100000000
arg1                    1 0x0000000000000001
diff           4294967295 0x00000000ffffffff

Actually using this hand-written IR function from C is tricky. See our code on GitHub.

eBPF can't count?!
public domain image by Sergei Frolov

The final trick

Hand-written IR does the job. The downside is that linking IR modules to your C modules is hard. Fortunately there is a better way. You can persuade Clang to stick to 32-bit ALU ops in generated IR.

We’ve already seen the problem. To recap, if we ask Clang to subtract 32-bit integers, it will operate on 64-bit values and throw away the top 32-bits. Putting C, IR, and eBPF side-by-side helps visualize this:

eBPF can't count?!

The trick to get around it is to declare the 32-bit variable that holds the result as volatile. You might already know the volatile keyword if you’ve written Unix signal handlers. It basically tells the compiler that the value of the variable may change under its feet so it should refrain from reorganizing loads (reads) from it, as well as that stores (writes) to it might have side-effects so changing the order or eliminating them, by skipping writing it to the memory, is not allowed either.

Using volatile makes Clang emit special loads and/or stores at the IR level, which then on eBPF level translates to writing/reading the value from memory (stack) on every access. While this sounds not related to the problem at hand, there is a surprising side-effect to it:

eBPF can't count?!

With volatile access compiler doesn’t promote the subtraction to 64 bits! Don’t ask me why, although I would love to hear an explanation. For now, consider this a hack. One that does not come for free – there is the overhead of going through the stack on each read/write.

However, if we play our cards right we just might reduce it a little. We don’t actually need the volatile load or store to happen, we just want the side effect. So instead of declaring the value as volatile, which implies that both reads and writes are volatile, let’s try to make only the writes volatile with a help of a macro:

/* Emits a "store volatile" in LLVM IR */
#define ST_V(rhs, lhs) (*(volatile typeof(rhs) *) &(rhs) = (lhs))

If this macro looks strangely familiar, it’s because it does the same thing as WRITE_ONCE() macro in the Linux kernel. Applying it to our example:

eBPF can't count?!

That’s another hacky but working solution. Pick your poison.

eBPF can't count?!
CC BY-SA 3.0 image by ANKAWÜ

So there you have it – from C, to IR, and back to C to hack around a bug in eBPF verifier and be able to subtract 64-bit integers again. Usually you won’t have to dive into LLVM IR or assembly to make use of eBPF. But it does help to know a little about it when things don’t work as expected.

Did I mention that 64-bit addition is also broken? Have fun fixing it!


1 Okay, it was more like 3 months time until the bug was discovered and fixed.

2 Some even think that it is better than assembly.

3 How do we know? The litmus test is to look for statements matching r[0-9] [-+]= r[0-9] in BPF assembly.