Tag Archives: linux

Raking the floods: How to protect UDP services from DoS attacks with eBPF

Post Syndicated from Jonas Otten original https://blog.cloudflare.com/building-rakelimit/

Raking the floods: How to protect UDP services from DoS attacks with eBPF

Raking the floods: How to protect UDP services from DoS attacks with eBPF

Cloudflare’s globally distributed network is not just designed to protect HTTP services but any kind of TCP or UDP traffic that passes through our edge. To this end, we’ve built a number of sophisticated DDoS mitigation systems, such as Gatebot, which analyze world-wide traffic patterns. However, we’ve always employed defense-in-depth: in addition to global protection systems we also use off-the shelf mechanisms such as TCP SYN-cookies, which protect individual servers locally from the very common SYN-flood. But there’s a catch: such a mechanism does not exist for UDP. UDP is a connectionless protocol and does not have similar context around packets, especially considering that Cloudflare powers services such as Spectrum which are agnostic to the upper layer protocol (DNS, NTP, …), so my 2020 intern class project was to come up with a different approach.

Protecting UDP services

First of all, let’s discuss what it actually means to provide protection to UDP services. We want to ensure that an attacker cannot drown out legitimate traffic. To achieve this we want to identify floods and limit them while leaving legitimate traffic untouched.

The idea to mitigate such attacks is straight forward: first identify a group of packets that is related to an attack, and then apply a rate limit on this group. Such groups are determined based on the attributes available to us in the packet, such as addresses and ports.

We do not want to completely drop the flood of traffic, as legitimate traffic may still be part of it. We only want to drop as much traffic as necessary to comply with our set rate limit. Completely ignoring a set of packets just because it is slightly above the rate limit is not an option, as it may contain legitimate traffic.

This ensures both that our service stays responsive but also that legitimate packets experience as little impact as possible.

While rate limiting is a somewhat straightforward procedure, determining groups is a bit harder, for a number of reasons.

Finding needles in the haystack

The problem in determining groups in packets is that we have barely any context. We consider four things as useful attributes as attack signatures: the source address and port as well as the destination address and port. While that already is not a lot, it gets worse: the source address and port may not even be accurate. Packets can be spoofed, in which case an attacker hides their own address. That means only keeping a rate per source address may not provide much value, as it could simply be spoofed.

But there is another problem: keeping one rate per address does not scale. When bringing IPv6 into the equation and its whopping address space it becomes clear it’s not going to work.

To solve these issues we turned to the academic world and found what we were looking for, the problem of Heavy Hitters. Heavy Hitters are elements of a datastream that appear frequently, and can be expressed relative to the overall elements of the stream. We can define for example that an element is considered to be a Heavy Hitter if its frequency exceeds, say, 10% of the overall count. To do so we naively could suggest to simply maintain a counter per element, but due to the space limitations this will not scale. Instead probabilistic algorithms such as a CountMin sketch or the SpaceSaving algorithm can be used. These provide an estimated count instead of a precise one, but are capable of doing this with constant memory requirements, and in our case we will just save rates into the CountMin sketch instead of counts. So no matter how many unique elements we have to track, the memory consumption is the same.

We now have a way of finding the needle in the haystack, and it does have constant memory requirements, solving our problem. However, reality isn’t that simple. What if an attack is not just originating from a single port but many? Or what if a reflection attack is hitting our service, resulting in random source addresses but a single source port? Maybe a full /24 subnet is sending us a flood? We can not just keep a rate per combination we see, as it would ignore all these patterns.

Grouping the groups: How to organize packets

Luckily the academic world has us covered again, with the concept of Hierarchical Heavy Hitters. It extends the Heavy Hitter concept by using the underlying hierarchy in the elements of the stream. For example, an IP address can be naturally grouped into several subnets:

Raking the floods: How to protect UDP services from DoS attacks with eBPF

In this case we defined that we consider the fully-specified address, the /24 subnet and the /0 wildcard. We start at the left with the fully specified address, and each step walking towards the top we consider less information from it. We call these less-specific addresses generalisations, and measure how specific a generalisation is by assigning a level. In our example, the address 192.0.2.123 is at level 0, while 192.0.2.0/24 is at level 1, etc.

If we want to create a structure which can hold this information for every packet, it could look like this:

Raking the floods: How to protect UDP services from DoS attacks with eBPF

We maintain a CountMin-sketch per subnet and then apply Heavy Hitters. When a new packet arrives and we need to determine if it is allowed to pass we simply check the rates of the corresponding elements in every node. If no rate exceeds the rate limit that we set, e.g. 25 packets per second (pps), it is allowed to pass.

The structure could now keep track of a single attribute, but we would waste a lot of context around packets! So instead of letting it go to waste, we use the two-dimensional approach for addresses proposed in the paper Hierarchical Heavy Hitters with SpaceSaving algorithm, and extend it further to also incorporate ports into our structure. Ports do not have a natural hierarchy such as addresses, so they can only be in two states: either specified (e.g. 8080) or wildcard.

Now our structure looks like this:

Raking the floods: How to protect UDP services from DoS attacks with eBPF

Now let’s talk about the algorithm we use to traverse the structure and determine if a packet should be allowed to pass. The paper Hierarchical Heavy Hitters with SpaceSaving algorithm provides two methods that can be used on the data structure: one that updates elements and increases their counters, and one that provides all elements that currently are Heavy Hitters. This is actually not necessary for our use-case, as we are only interested if the element, or packet, we are looking at right now would be a Heavy Hitter to decide if it can pass or not.

Secondly, our goal is to prevent any Heavy Hitters from passing, thus leaving the structure with no Heavy Hitters whatsoever. This is a great property, as it allows us to simplify the algorithm substantially, and it looks like this:

Raking the floods: How to protect UDP services from DoS attacks with eBPF

As you may notice, we update every node of a level and maintain the maximum rate we see. After each level we calculate a probability that determines if a packet should be passed to the next level, based on the maximum rate we saw on that level and a set rate limit. Each node essentially filters the traffic for the following, less specific level.

I actually left out a small detail: a packet is not dropped if any rate exceeds the limit, but instead is kept with the probability rate limit/maximum rate seen. The reason is that if we just drop all packets if the rates exceed the limit, we would drop the whole traffic, not just a subset to make it comply with our set rate limit.

Since we now still update more specific nodes even if a node reaches a rate limit, the rate limit will converge towards the underlying pattern of the attack as much as possible. That means other traffic will be impacted as minimally as possible, and that with no manual intervention whatsoever!

BPF to the rescue: building a Go library

As we want to use this algorithm to mitigate floods, we need to spend as little computation and overhead as possible before we decide if a packet should be dropped or not. As so often, we looked into the BPF toolbox and found what we need: Socketfilters. As our colleague Marek put it: “It seems, no matter the question – BPF is the answer.”.

Socketfilters are pieces of code that can be attached to a single socket and get executed before a packet will be passed from kernel to userspace. This is ideal for a number of reasons. First, when the kernel runs the socket filter code, it gives it all the information from the packet we need, and other mitigations such as firewalls have been executed. Second the code is executed per socket, so every application can activate it as needed, and also set appropriate rate limits. It may even use different rate limits for different sockets. The third reason is privileges: we do not need to be root to attach the code to a socket. We can execute code in the kernel as a normal user!

BPF also has a number of limitations which have been already covered on this blog in the past, so we will focus on one that’s specific to our project: floating-point numbers.

To calculate rates we need floating-point numbers to provide an accurate estimate. BPF, and the whole kernel for that matter, does not support these. Instead we implemented a fixed-point representation, which uses a part of the available bits for the fractional part of a rational number and the remaining bits for the integer part. This allows us to represent floats within a certain range, but there is a catch when doing arithmetic: while subtraction and addition of two fixed-points work well, multiplication and division requires double the number of bits to ensure there will not be any loss in precision. As we use 64 bits for our fixed-point values, there is no larger data type available to ensure this does not happen. Instead of calculating the result with exact precision, we convert one of the arguments into an integer. That results in the loss of the fractional part, but as we deal with large rates that does not pose any issue, and helps us to work around the bit limitation as intermediate results fit into the available 64 bits. Whenever fixed-point arithmetic is necessary the precision of intermediate results has to be carefully considered.

There are many more details to the implementation, but instead of covering every single detail in this blog post lets just look at the code.

We open sourced rakelimit over on Github at cloudflare/rakelimit! It is a full-blown Go library that can be enabled on any UDP socket, and is easy to configure.

The development is still in early stages and this is a first prototype, but we are excited to continue and push the development with the community! And if you still can’t get enough, look at our talk from this year’s Linux Plumbers Conference.

It’s a brand-new NODE Mini Server!

Post Syndicated from Ashley Whittaker original https://www.raspberrypi.org/blog/its-a-brand-new-node-mini-server/

NODE has long been working to create open-source resources to help more people harness the decentralised internet, and their easily 3D-printed designs are perfect to optimise your Raspberry Pi.

NODE wanted to take advantage of the faster processor and up to 8GB RAM on Raspberry Pi 4 when it came out last year. Now that our tiny computer is more than capable of being used as as a general Linux desktop system, the NODE Mini Server version 3 has been born.

As for previous versions of NODE’s Mini Server, one of their main goals for this new iteration was to package Raspberry Pi in a way which makes it a little easier to use as a regular mini server or computer. In other words, it’s put inside a neat little box with all the ports accessible on one side.

Black is incredibly slimming

Slimmer and simpler

The latest design is simplified compared to previous versions. Everything lives in a 92mm × 92mm enclosure that isn’t much thicker than Raspberry Pi itself.

The slimmed-down new case comprises a single 3D-printed piece and a top cover made from a custom-designed printed circuit board (PCB) that has four brass-threaded inserts soldered into the corners, giving you a simple way to screw everything together.

The custom PCB cover

What are the new features?

Another goal for version 3 NODE’s Mini Server was to include as much modularity as possible. That’s why this new mini server requires no modifications to the Raspberry Pi itself, thanks to a range of custom-designed adapter boards. How to take advantage of all these new features is explained at this point in NODE’s YouTube video.

Ooh, shiny and new and new and shiny

Just like for previous versions, all the files and a list of the components you need to create your own Mini Server are available for free on the NODE website.

Leave comments on NODE’s YouTube video if you’d like to create and sell your own Mini Server kits or pre-made servers. NODE is totally open to showcasing any add-ons or extras you come up with yourself.

Looking ahead, making the Mini Server stackable and improving fan circulation is next on NODE’s agenda.

The post It’s a brand-new NODE Mini Server! appeared first on Raspberry Pi.

Migrating Subversion repositories to AWS CodeCommit

Post Syndicated from Iftikhar khan original https://aws.amazon.com/blogs/devops/migrating-subversion-repositories-aws-codecommit/

In this post, we walk you through migrating Subversion (SVN) repositories to AWS CodeCommit. But before diving into the migration, we do a brief review of SVN and Git based systems such as CodeCommit.

About SVN

SVN is an open-source version control system. Founded in 2000 by CollabNet, Inc., it was originally designed to be a better Concurrent Versions System (CVS), and is being developed as a project of the Apache Software Foundation. SVN is the third implementation of a revision control system: Revision Control System (RCS), then CVS, and finally SVN.

SVN is the leader in centralized version control. Systems such as CVS and SVN have a single remote server of versioned data with individual users operating locally against copies of that data’s version history. Developers commit their changes directly to that central server repository.

All the files and commit history information are stored in a central server, but working on a single central server means more chances of having a single point of failure. SVN offers few offline access features; a developer has to connect to the SVN server to make a commit that makes commits slower. The single point of failure, security, maintenance, and scaling SVN infrastructure are the major concerns for any organization.

About DVCS

Distributed Version Control Systems (DVCSs) address the concerns and challenges of SVN. In a DVCS (such as Git or Mercurial), you don’t just check out the latest snapshot of the files; rather, you fully mirror the repository, including its full history. If any server dies, and these systems are collaborating via that server, you can copy any of the client repositories back up to the server to restore it. Every clone is a full backup of all the data.

DVCs such as Git are built with speed, non-linear development, simplicity, and efficiency in mind. It works very efficiently with large projects, which is one of the biggest factors why customers find it popular.

A significant reason to migrate to Git is branching and merging. Creating a branch is very lightweight, which allows you to work faster and merge easily.

About CodeCommit

CodeCommit is a version control system that is fully managed by AWS. CodeCommit can host secure and highly scalable private Git repositories, which eliminates the need to operate your source control system and scale its infrastructure. You can use it to securely store anything, from source code to binaries. CodeCommit features like collaboration, encryption, and easy access control make it a great choice. It works seamlessly with most existing Git tools and provides free private repositories.

Understanding the repository structure of SVN and Git

SVNs have a tree model with one branch where the revisions are stored, whereas Git uses a graph structure and each commit is a node that knows its parent. When comparing the two, consider the following features:

  • Trunk – An SVN trunk is like a primary branch in a Git repository, and contains tested and stable code.
  • Branches – For SVN, branches are treated as separate entities with its own history. You can merge revisions between branches, but they’re different entities. Because of its centralized nature, all branches are remote. In Git, branches are very cheap; it’s a pointer for a particular commit on the tree. It can be local or be pushed to a remote repository for collaboration.
  • Tags – A tag is just another folder in the main repository in SVN and remains static. In Git, a tag is a static pointer to a specific commit.
  • Commits – To commit in SVN, you need access to the main repository and it creates a new revision in the remote repository. On Git, the commit happens locally, so you don’t need to have access to the remote. You can commit the work locally and then push all the commits at one time.

So far, we have covered how SVN is different from Git-based version control systems and illustrated the layout of SVN repositories. Now it’s time to look at how to migrate SVN repositories to CodeCommit.

Planning for migration

Planning is always a good thing. Before starting your migration, consider the following:

  • Identify SVN branches to migrate.
  • Come up with a branching strategy for CodeCommit and document how you can map SVN branches.
  • Prepare build, test scripts, and test cases for system testing.

If the size of the SVN repository is big enough, consider running all migration commands on the SVN server. This saves time because it eliminates network bottlenecks.

Migrating the SVN repository to CodeCommit

When you’re done with the planning aspects, it’s time to start migrating your code.

Prerequisites

You must have the AWS Command Line Interface (AWS CLI) with an active account and Git installed on the machine that you’re planning to use for migration.

Listing all SVN users for an SVN repository
SVN uses a user name for each commit, whereas Git stores the real name and email address. In this step, we map SVN users to their corresponding Git names and email.

To list all the SVN users, run the following PowerShell command from the root of your local SVN checkout:

svn.exe log --quiet | ? { $_ -notlike '-*' } | % { "{0} = {0} <{0}>" -f ($_ -split ' \| ')[1] } | Select-Object -Unique | Out-File 'authors-transform.txt'

On a Linux based machine, run the following command from the root of your local SVN checkout:

svn log -q | awk -F '|' '/^r/ {sub("^ ", "", $2); sub(" $", "", $2); print $2" = "$2" <"$2">"}' | sort -u > authors-transform.txt

The authors-transform.txt file content looks like the following code:

ikhan = ikhan <ikhan>
foobar= foobar <foobar>
abob = abob <abob>

After you transform the SVN user to a Git user, it should look like the following code:

ikhan = ifti khan <[email protected]>
fbar = foo bar <[email protected]>
abob = aaron bob <[email protected]>

Importing SVN contents to a Git repository

The next step in the migration from SVN to Git is to import the contents of the SVN repository into a new Git repository. We do this with the git svn utility, which is included with most Git distributions. The conversion process can take a significant amount of time for larger repositories.

The git svn clone command transforms the trunk, branches, and tags in your SVN repository into a new Git repository. The command depends on the structure of the SVN.

git svn clone may not be available in all installations; you might consider using an AWS Cloud9 environment or using a temporary Amazon Elastic Compute Cloud (Amazon EC2) instance.

If your SVN layout is standard, use the following command:

git svn clone --stdlayout --authors-file=authors.txt  <svn-repo>/<project> <temp-dir/project>

If your SVN layout isn’t standard, you need to map the trunk, branches, and tags folder in the command as parameters:

git svn clone <svn-repo>/<project> --prefix=svn/ --no-metadata --trunk=<trunk-dir> --branches=<branches-dir>  --tags==<tags-dir>  --authors-file "authors-transform.txt" <temp-dir/project>

Creating a bare Git repository and pushing the local repository

In this step, we create a blank repository and match the default branch with the SVN’s trunk name.

To create the .gitignore file, enter the following code:

cd <temp-dir/project>
git svn show-ignore > .gitignore
git add .gitignore
git commit -m 'Adding .gitignore.'

To create the bare Git repository, enter the following code:

git init --bare <git-project-dir>\local-bare.git
cd <git-project-dir>\local-bare.git
git symbolic-ref HEAD refs/heads/trunk

To update the local bare Git repository, enter the following code:

cd <temp-dir/project>
git remote add bare <git-project-dir\local-bare.git>
git config remote.bare.push 'refs/remotes/*:refs/heads/*'
git push bare

You can also add tags:

cd <git-project-dir\local-bare.git>

For Windows, enter the following code:

git for-each-ref --format='%(refname)' refs/heads/tags | % { $_.Replace('refs/heads/tags/','') } | % { git tag $_ "refs/heads/tags/$_"; git branch -D "tags/$_" }

For Linux, enter the following code:

for t in $(git for-each-ref --format='%(refname:short)' refs/remotes/tags); do git tag ${t/tags\//} $t &amp;&amp; git branch -D -r $t; done

You can also add branches:

cd <git-project-dir\local-bare.git>

For Windows, enter the following code:

git for-each-ref --format='%(refname)' refs/remotes | % { $_.Replace('refs/remotes/','') } | % { git branch "$_" "refs/remotes/$_"; git branch -r -d "$_"; }

For Linux, enter the following code:

for b in $(git for-each-ref --format='%(refname:short)' refs/remotes); do git branch $b refs/remotes/$b && git branch -D -r $b; done

As a final touch-up, enter the following code:

cd <git-project-dir\local-bare.git>
git branch -m trunk master

Creating a CodeCommit repository

You can now create a CodeCommit repository with the following code (make sure that the AWS CLI is configured with your preferred Region and credentials):

aws configure
aws codecommit create-repository --repository-name MySVNRepo --repository-description "SVN Migration repository" --tags Team=Migration

You get the following output:

{
    "repositoryMetadata": {
        "repositoryName": "MySVNRepo",
        "cloneUrlSsh": "ssh://ssh://git-codecommit.us-east-2.amazonaws.com/v1/repos/MySVNRepo",
        "lastModifiedDate": 1446071622.494,
        "repositoryDescription": "SVN Migration repository",
        "cloneUrlHttp": "https://git-codecommit.us-east-2.amazonaws.com/v1/repos/MySVNRepo",
        "creationDate": 1446071622.494,
        "repositoryId": "f7579e13-b83e-4027-aaef-650c0EXAMPLE",
        "Arn": "arn:aws:codecommit:us-east-2:111111111111:MySVNRepo",
        "accountId": "111111111111"
    }
}

Pushing the code to CodeCommit

To push your code to the new CodeCommit repository, enter the following code:

cd <git-project-dir\local-bare.git>
git remote add origin https://git-codecommit.us-east-2.amazonaws.com/v1/repos/MySVNRepo
git add *

git push origin --all
git push origin --tags (Optional if tags are mapped)

Troubleshooting

When migrating SVN repositories, you might encounter a few SVN errors, which are displayed as code on the console. For more information, see Subversion client errors caused by inappropriate repository URL.

For more information about the git-svn utility, see the git-svn documentation.

Conclusion

In this post, we described the straightforward process of using the git-svn utility to migrate SVN repositories to Git or Git-based systems like CodeCommit. After you migrate an SVN repository to CodeCommit, you can use any Git-based client and start using CodeCommit as your primary version control system without worrying about securing and scaling its infrastructure.

How to enable X11 forwarding from Red Hat Enterprise Linux (RHEL), Amazon Linux, SUSE Linux, Ubuntu server to support GUI-based installations from Amazon EC2

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/how-to-enable-x11-forwarding-from-red-hat-enterprise-linux-rhel-amazon-linux-suse-linux-ubuntu-server-to-support-gui-based-installations-from-amazon-ec2/

This post was written by Sivasamy Subramaniam, AWS Database Consultant.

In this post, I discuss enabling X11 forwarding from Red Hat Enterprise Linux (RHEL), Amazon Linux, SUSE Linux, Ubuntu servers running on Amazon EC2. This is helpful for system and database administrators, and application teams that want to perform software installations on Amazon EC2 using GUI method. This blog provides detailed steps around SSH and x11 tools, various network and operating system (OS) level settings, and best practices to achieve the X11 forwarding on Amazon EC2 when installing databases like Oracle using GUI.

There are several techniques to connect Amazon EC2 instances to manage OS level configurations. Typically, you use SSH clients (such as PuTTY or SSH client) to establish the connection from the Windows OS-based bastion or jump servers to connect with Amazon EC2 instances running linux-based OS. Most commonly, database administrators use a common Database Management, bastion host, or jump servers to connect database servers. They do this instead of directly using their laptops connecting to the database servers. They can install all the needed tools in one server to perform database administrative or support activities. During the application installation or configuration, you might need to install software such as an Oracle database or a third-party database using GUI methods. This blog talks about steps that must be done in order to forward the X11 screen to your highly secure Windows OS-based bastion hosts. You can consider using NICE DCV as an alternative option for running GUI-based applications. Please refer to the prior link for more details and steps to enable NICE DCV.

Prerequisites

To complete this walkthrough the following is required:

  • Ensure that you have a bastion host running on Amazon EC2 with Windows OS for this blog. This OS must have access to the EC2 machines running Linux such as RHEL, Amazon Linux, SUSE Linux, and Ubuntu servers. If not, please configure a bastion host using Windows operating system with needed SSH access via port 22 to EC2 instance running linux-based operating systems. You can use any OS-based systems as a bastion host as long you have corresponding client tools installed or X11 supported by that OS.
  • I recommend having bastion hosts in the same Availability Zone or Region as the EC2 Linux hosts that you plan to connect and forward X11 to. This is to avoid any high latency in X11 forwarding during your application installations.
  • Install tools such as PuTTY and Xming on the Windows-based bastion host from which you want to SSH to Linux EC2 host and X11 forwarding.
  • In order to securely configure or install PuTTY, refer to the section Configuring ssh-agent on Windows in the blog post Securely Connect to Linux Instances Running in a Private Amazon VPC.
  • You may need sudo permission to run X11 forwarding commands as a root user in order to complete the setup.

Solution

Connect to your EC2 instance using SSH client, and perform following setup as needed.

Step 1: Install required X11 packages

Install X11 packages with following command based on your operating system release and version:

Installing xclock or xterm packages are optional as this is installed in this post to test the X11 forwarding using xclock or xterm commands.

 

Amazon Linux 2:

To install X11 related packages:

$ sudo yum install xorg-x11-xauth

To install X11  testing tools:

$ sudo yum install xclock xterm

 

Red Hat Enterprise Linux 8:

To install X11 related packages:

$ sudo yum install xorg-x11-xauth

To install X11 testing tools:

$ sudo yum install xterm

Note: The xorg-x11-apps package has been provided in the CodeReady Linux Builder Repository for RHEL8. So, I skipped installing this package, which has xclock and I used only xterm to test the X11 forwarding.

 

SUSE Linux Enterprise Server 15 SP1:

To install X11 related packages:

$ sudo zypper install xauth

To install X11 testing tools:

$ sudo zypper install xclock

 

Ubuntu Server 18:

To install X11 related packages and tools:

$ sudo apt install x11-apps

Step 2: configure X11 forwarding

To enable X11 Forwarding, change the “X11Forwarding” parameter using vi editor to “yes” in the /etc/ssh/sshd_config file if either commented out or set to no.

$ sudo vi /etc/ssh/sshd_config

 

To Verify X11Forwarding parameter:

$ sudo cat /etc/ssh/sshd_config |grep -i X11Forwarding

You should see similar output as the following:

X11Forwarding yes

To restart ssh service if you changed the value in /etc/ssh/sshd_config:

 

Amazon Linux 2, RHEL 8 and SUSE Linux OS:

$ sudo service sshd restart

 

Ubuntu Servers:

$ sudo service ssh restart

 

Step 3: Configure putty and Xming to perform X11 forwarding connect and verify X11 forwarding

Log in to your Windows bastion host. Then, open a fresh PuTTY session, and use a private key or password-based authentication per your organization setup. Then, test the xclock or xterm command to see x11 forwarding in action.

  • Click the xming utility you installed on Windows bastion host and have it running.

click on xming icon

  • Select Session from the Category pane on left. Set Host Name as your private IP, port 22, and Connection Type as SSH. Please note that you use the Private IP of EC2 instance later when you connect inside from the VPC/network.

putty configuration details

  • Go to Connection, and click Then, set Auto-login username as ec2-user, Ubuntu (Ubuntu OS), or whichever user you are allowed to logging in as.
  • Go to Connection, select SSH, and then click Then, click on Browse to select the private key generated earlier If you are using key based authentication.
  • Go to Connection, select SSH, and then click on Then, select enable X11 forwarding.
  • Set X display location as localhost:0.0

putty configuration screenshot 2

  • Go back to Session and click on Save after creating a session name in Saved session.

 

Now that you set up PuTTY, xming, and configured the x11 settings, you can click on load button and then Open button. This opens up a new SSH terminal with x11 forwarding enabled. Now, I move on to the testing X11 forwarding.

Test the X11 from the use you logged in:

Example:

$ xauth list

$ export DISPLAY=localhost:10.0

$ xclock or xterm

You should see the sample output and xclock or xterm window opened similar to the following image. This means your x11 forwarding setup working as expected, and you can start using GUI-based application installation or configuration by running the installer or configuration tools.

sample output xclock

Step 4: Configure the EC2 Linux session to forward X11 if you are switching to different user after login to run GUI-based installation / commands

In this example: ec2-user is the user logged in with SSH and then switched to oracle user.

From the Logged User to identify the xauth details:

$ xauth list

$ env|grep DISPLAY

$ xauth list | grep unix`echo $DISPLAY | cut -c10-12` > /tmp/xauth

Switch to the user where you want to run GUI-based installation or tools:

$ sudo su - oracle

$ xauth add `cat /tmp/xauth`

$ xauth list

$ env|grep DISPLAY

$ export DISPLAY=localhost:10.0

$ xclock

You should see the sample output and xclock or xterm window opened similar to the following image. This means your x11 forwarding setup is working as expected even after switched to different user. You can start using GUI-based application such as running the installer or configuration tools.

  sample output success

Conclusion

In this blog, I demonstrated how to configure Amazon EC2 instances running on various linux-based operating systems to forward X11 to the Windows OS-based bastion host. This is helpful to any application installation that requires GUI-based installation methods. This is also helpful to any bastion hosts that provide highly secure and low latency environments to perform SSH related operations including GUI-based installations as this does not require any additional network configuration other than opening the port 22 for standard SSH authentication. Please try this tutorial for yourself, and leave any comments following!

 

 

Sandboxing in Linux with zero lines of code

Post Syndicated from Ignat Korchagin original https://blog.cloudflare.com/sandboxing-in-linux-with-zero-lines-of-code/

Sandboxing in Linux with zero lines of code

Modern Linux operating systems provide many tools to run code more securely. There are namespaces (the basic building blocks for containers), Linux Security Modules, Integrity Measurement Architecture etc.

In this post we will review Linux seccomp and learn how to sandbox any (even a proprietary) application without writing a single line of code.

Sandboxing in Linux with zero lines of code

Tux by Iwan Gabovitch, GPL
Sandbox, Simplified Pixabay License

Linux system calls

System calls (syscalls) is a well-defined interface between userspace applications and the operating system (OS) kernel. On modern operating systems most applications provide only application-specific logic as code. Applications do not, and most of the time cannot, directly access low-level hardware or networking, when they need to store data or send something over the wire. Instead they use system calls to ask the OS kernel to do specific hardware and networking tasks on their behalf:

Sandboxing in Linux with zero lines of code

Apart from providing a generic high level way for applications to interact with the low level hardware, the system call architecture allows the OS kernel to manage available resources between applications as well as enforce policies, like application permissions, networking access control lists etc.

Linux seccomp

Linux seccomp is yet another syscall on Linux, but it is a bit special, because it influences how the OS kernel will behave when the application uses other system calls. By default, the OS kernel has almost no insight into userspace application logic, so it provides all the possible services it can. But not all applications require all services. Consider an application which converts image formats: it needs the ability to read and write data from disk, but in its simplest form probably does not need any network access. Using seccomp an application can declare its intentions in advance to the Linux kernel. For this particular case it can notify the kernel that it will be using the read and write system calls, but never the send and recv system calls (because its intent is to work with local files and never with the network). It’s like establishing a contract between the application and the OS kernel:

Sandboxing in Linux with zero lines of code

But what happens if the application later breaks the contract and tries to use one of the system calls it promised not to use? The kernel will “penalise” the application, usually by immediately terminating it. Linux seccomp also allows less restrictive actions for the kernel to take:

  • instead of terminating the whole application, the kernel can be requested to terminate only the thread, which issued the prohibited system call
  • the kernel may just send a SIGSYS signal to the calling thread
  • the seccomp policy can specify an error code, which the kernel will then return to the calling application instead of executing the prohibited system call
  • if the violating process is under ptrace (for example executing under a debugger), the kernel can notify the tracer (the debugger) that a prohibited system call is about to happen and let the debugger decide what to do
  • the kernel may be instructed to allow and execute the system call, but log the attempt: this is useful, when we want to verify that our seccomp policy is not too tight without the risk of terminating the application and potentially creating an outage

Although there is a lot of flexibility in defining the potential penalty for the application, from a security perspective it is usually best to stick with the complete application termination upon seccomp policy violation. The reason for that will be described later in the examples in the post.

So why would the application take the risk of being abruptly terminated and declare its intentions beforehand, if it can just be “silent” and the OS kernel will allow it to use any system call by default? Of course, for a normal behaving application it makes no sense, but it turns out this feature is quite effective to protect from rogue applications and arbitrary code execution exploits.

Imagine our image format converter is written in some unsafe language and an attacker was able to take control of the application by making it process some malformed image. What the attacker might do is to try to steal some sensitive information from the machine running our converter and send it to themselves via the network. By default, the OS kernel will most likely allow it and a data leak will happen. But if our image converter “confined” (or sandboxed) itself beforehand to only read and write local data the kernel will terminate the application when the latter tries to leak the data over the network thus preventing the leak and locking out the attacker from our system!

Integrating seccomp into the application

To see how seccomp can be used in practice, let’s consider a toy example program

myos.c:

#include <stdio.h>
#include <sys/utsname.h>

int main(void)
{
    struct utsname name;

    if (uname(&name)) {
        perror("uname failed: ");
        return 1;
    }

    printf("My OS is %s!\n", name.sysname);
    return 0;
}

This is a simplified version of the uname command line tool, which just prints your operating system name. Like its full-featured counterpart, it uses the uname system call to actually get the name of the current operating system from the kernel. Let’s see it action:

$ gcc -o myos myos.c
$ ./myos
My OS is Linux!

Great! We’re on Linux, so can further experiment with seccomp (it is a Linux-only feature). Notice that we’re properly handling the error code after invoking the uname system call. However, according to the man page it can only fail, when the passed in buffer pointer is invalid. And in this case the set error number will be “EINVAL”, which translates to invalid parameter. In our case, the “struct utsname” structure is being allocated on the stack, so our pointer will always be valid. In other words, in normal circumstances the uname system call should never fail in this particular program.

To illustrate seccomp capabilities we will add a “sandbox” function to our program before the main logic

myos_raw_seccomp.c:

#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/audit.h>
#include <sys/ptrace.h>
#include <sys/prctl.h>

#include <stdlib.h>
#include <stdio.h>
#include <stddef.h>
#include <sys/utsname.h>
#include <errno.h>
#include <unistd.h>
#include <sys/syscall.h>

static void sandbox(void)
{
    struct sock_filter filter[] = {
        /* seccomp(2) says we should always check the arch */
        /* as syscalls may have different numbers on different architectures */
        /* see https://fedora.juszkiewicz.com.pl/syscalls.html */
        /* for simplicity we only allow x86_64 */
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, arch))),
        /* if not x86_64, tell the kernel to kill the process */
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 4),
        /* get the actual syscall number */
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, nr))),
        /* if "uname", tell the kernel to return EPERM, otherwise just allow */
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, SYS_uname, 0, 1),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
    };

    struct sock_fprog prog = {
        .len = (unsigned short) (sizeof(filter) / sizeof(filter[0])),
        .filter = filter,
    };

    /* see seccomp(2) on why this is needed */
    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
        perror("PR_SET_NO_NEW_PRIVS failed");
        exit(1);
    };

    /* glibc does not have a wrapper for seccomp(2) */
    /* invoke it via the generic syscall wrapper */
    if (syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &prog)) {
        perror("seccomp failed");
        exit(1);
    };
}

int main(void)
{
    struct utsname name;

    sandbox();

    if (uname(&name)) {
        perror("uname failed");
        return 1;
    }

    printf("My OS is %s!\n", name.sysname);
    return 0;
}

To sandbox itself the application defines a BPF program, which implements the desired sandboxing policy. Then the application passes this program to the kernel via the seccomp system call. The kernel does some validation checks to ensure the BPF program is OK and then runs this program on every system call the application makes. The results of the execution of the program is used by the kernel to determine if the current call complies with the desired policy. In other words the BPF program is the “contract” between the application and the kernel.

In our toy example above, the BPF program simply checks which system call is about to be invoked. If the application is trying to use the uname system call we tell the kernel to just return a EPERM (which stands for “operation not permitted”) error code. We also tell the kernel to allow any other system call. Let’s see if it works now:

$ gcc -o myos myos_raw_seccomp.c
$ ./myos
uname failed: Operation not permitted

uname failed now with the EPERM error code and EPERM is not even described as a potential failure code in the uname manpage! So we know now that this happened because we “told” the kernel to prohibit us using the uname syscall and to return EPERM instead. We can double check this by replacing EPERM with some other error code, which is totally inappropriate for this context, for example ENETDOWN (“network is down”). Why would we need the network to be up to just get the currently executing OS? Yet, recompiling and rerunning the program we get:

$ gcc -o myos myos_raw_seccomp.c
$ ./myos
uname failed: Network is down

We can also verify the other part of our “contract” works as expected. We told the kernel to allow any other system call, remember? In our program, when uname fails, we convert the error code to a human readable message and print it on the screen with the perror function. To print on the screen perror uses the write system call under the hood and since we can actually see the printed error message, we know that the kernel allowed our program to make the write system call in the first place.

seccomp with libseccomp

While it is possible to use seccomp directly, as in the examples above, BPF programs are cumbersome to write by hand and hard to debug, review and update later. That’s why it is usually a good idea to use a more high-level library, which abstracts away most of the low-level details. Luckily such a library exists: it is called libseccomp and is even recommended by the seccomp man page.

Let’s rewrite our program’s sandbox() function to use this library instead:

myos_libseccomp.c:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/utsname.h>
#include <seccomp.h>
#include <err.h>

static void sandbox(void)
{
    /* allow all syscalls by default */
    scmp_filter_ctx seccomp_ctx = seccomp_init(SCMP_ACT_ALLOW);
    if (!seccomp_ctx)
        err(1, "seccomp_init failed");

    /* kill the process, if it tries to use "uname" syscall */
    if (seccomp_rule_add_exact(seccomp_ctx, SCMP_ACT_KILL, seccomp_syscall_resolve_name("uname"), 0)) {
        perror("seccomp_rule_add_exact failed");
        exit(1);
    }

    /* apply the composed filter */
    if (seccomp_load(seccomp_ctx)) {
        perror("seccomp_load failed");
        exit(1);
    }

    /* release allocated context */
    seccomp_release(seccomp_ctx);
}

int main(void)
{
    struct utsname name;

    sandbox();

    if (uname(&name)) {
        perror("uname failed: ");
        return 1;
    }

    printf("My OS is %s!\n", name.sysname);
    return 0;
}

Our sandbox() function not only became shorter and much more readable, but also provided the ability to reference syscalls in our rules by names and not internal numbers as well as not having to deal with other quirks, like setting PR_SET_NO_NEW_PRIVS bit and dealing with system architectures.

It is worth noting we have modified our seccomp policy a bit. In the raw seccomp example above we instructed the kernel to return an error code when the application tries to execute a prohibited syscall. This is good for demonstration purposes, but in most cases a stricter action is required. Just returning an error code and allowing the application to continue gives the potentially malicious code a chance to bypass the policy. There are many syscalls in Linux and some of them do the same or similar things. For example, we might want to prohibit the application to read data from disk, so we deny the read syscall in our policy and tell the kernel to return an error code instead. However, if the application does get exploited, the exploit code/logic might look like below:

…
if (-1 == read(fd, buf, count)) {
    /* hm… read failed, but what about pread? */
    if (-1 == pread(fd, buf, count, offset) {
        /* what about readv? */ ...
    }
    /* bypassed the prohibited read(2) syscall */
}
…

Wait what?! There is more than one read system call? Yes, there are read, pread, readv as well as more obscure ones, like io_submit and io_uring_enter. Of course, it is our fault for providing incomplete seccomp policy, which does not block all possible read syscalls. But if at least we had instructed the kernel to terminate the process immediately upon violation of the first plain read, the malicious code above would not have the chance to be clever and try other options.

Given the above in the libseccomp example we have a stricter policy now, which tells the kernel to terminate the process upon the policy violation. Let’s see if it works:

$ gcc -o myos myos_libseccomp.c -lseccomp
$ ./myos
Bad system call

Notice that we need to link against libseccomp when compiling the application. Also, when we run the application, we don’t see the uname failed: Operation not permitted error output anymore, because we don’t give the application the ability to even print a failure message. Instead, we see a Bad system call message from the shell, which tells us that the application was terminated with a SIGSYS signal. Great!

zero code seccomp

The previous examples worked fine, but both of them have one disadvantage: we actually needed to modify the source code to embed our desired seccomp policy into the application. This is because seccomp syscall affects the calling process and its children, but there is no interface to inject the policy from “outside”. It is expected that developers will sandbox their code themselves as part of the application logic, but in practice this rarely happens. When developers are starting a new project, most of the time the focus is on primary functionality and security features are usually either postponed or omitted altogether. Also, most real-world software is usually written using some high-level programming language and/or a framework, where the developers do not deal with the system calls directly and probably are even unaware which system calls are being used by their code.

On the other hand we have system operators, sysadmins, SRE and other folks, who run the above code in production. They are more incentivized to keep production systems secure, thus would probably want to sandbox the services as much as possible. But most of the time they don’t have access to the source code. So there are mismatched expectations: developers have the ability to sandbox their code, but are usually not incentivized to do so and operators have the incentive to sandbox the code, but don’t have the ability.

This is where “zero code seccomp” might help, where an external operator can inject the desired sandbox policy into any process without needing to modify any source code. Systemd is one of the popular implementations of a “zero code seccomp” approach. Systemd-managed services can have a SystemCallFilter= directive defined in their unit files listing all the system calls the managed service is allowed to make. As an example, let’s go back to our toy application without any sandboxing code embedded:

$ gcc -o myos myos.c
$ ./myos
My OS is Linux!

Now we can run the same code with systemd, but prohibit the application for using uname without changing or recompiling any code (we’re using systemd-run to create an ephemeral systemd service unit for us):

$ systemd-run --user --pty --same-dir --wait --collect --service-type=exec --property="SystemCallFilter=~uname" ./myos
Running as unit: run-u0.service
Press ^] three times within 1s to disconnect TTY.
Finished with result: signal
Main processes terminated with: code=killed/status=SYS
Service runtime: 6ms

We don’t see the normal My OS is Linux! output anymore and systemd conveniently tells us that the managed process was terminated with a SIGSYS signal. We can even go further and use another directive SystemCallErrorNumber= to configure our seccomp policy not to terminate the application, but return an error code instead as in our first seccomp raw example:

$ systemd-run --user --pty --same-dir --wait --collect --service-type=exec --property="SystemCallFilter=~uname" --property="SystemCallErrorNumber=ENETDOWN" ./myos
Running as unit: run-u2.service
Press ^] three times within 1s to disconnect TTY.
uname failed: Network is down
Finished with result: exit-code
Main processes terminated with: code=exited/status=1
Service runtime: 6ms

systemd small print

Great! We can now inject almost any seccomp policy into any process without the need to write any code or recompile the application. However, there is an interesting statement in the systemd documentation:

…Note that the execve, exit, exit_group, getrlimit, rt_sigreturn, sigreturn system calls and the system calls for querying time and sleeping are implicitly whitelisted and do not need to be listed explicitly…

Some system calls are implicitly allowed and we don’t have to list them. This is mostly related to the way how systemd manages processes and injects the seccomp policy. We established earlier that seccomp policy applies to the current process and its children. So, to inject the policy, systemd forks itself, calls seccomp in the forked process and then execs the forked process into the target application. That’s why always allowing the execve system call is necessary in the first place, because otherwise systemd cannot do its job as a service manager.

But what if we want to explicitly prohibit some of these system calls? If we continue with the execve as an example, that can actually be a dangerous system call most applications would want to prohibit. Seccomp is an effective tool to protect the code from arbitrary code execution exploits, remember? If a malicious actor takes over our code, most likely the first thing they will try is to get a shell (or replace our code with any other application which is easier to control) by directing our code to call execve with the desired binary. So, if our code does not need execve for its main functionality, it would be a good idea to prohibit it. Unfortunately, it is not possible with the systemd SystemCallFilter= approach…

Introducing Cloudflare sandbox

We really liked the “zero code seccomp” approach with systemd SystemCallFilter= directive, but were not satisfied with its limitations. We decided to take it one step further and make it possible to prohibit any system call in any process externally without touching its source code, so came up with the Cloudflare sandbox. It’s a simple standalone toolkit consisting of a shared library and an executable. The shared library is supposed to be used with dynamically linked applications and the executable is for statically linked applications.

sandboxing dynamically linked executables

For dynamically linked executables it is possible to inject custom code into the process by utilizing the LD_PRELOAD environment variable. The libsandbox.so shared library from our toolkit also contains a so-called initialization routine, which should be executed before the main logic. This is how we make the target application sandbox itself:

  • LD_PRELOAD tells the dynamic loader to load our libsandbox.so as part of the application, when it starts
  • the runtime executes the initialization routine from the libsandbox.so before most of the main logic
  • our initialization routine configures the sandbox policy described in special environment variables
  • by the time the main application logic begin executing, the target process has the configured seccomp policy enforced

Let’s see how it works with our myos toy tool. First, we need to make sure it is actually a dynamically linked application:

$ ldd ./myos
	linux-vdso.so.1 (0x00007ffd8e1e3000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f339ddfb000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f339dfcf000)

Yes, it is . Now, let’s prohibit it from using the uname system call with our toolkit:

$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libsandbox.so SECCOMP_SYSCALL_DENY=uname ./myos
adding uname to the process seccomp filter
Bad system call

Yet again, we’ve managed to inject our desired seccomp policy into the myos application without modifying or recompiling it. The advantage of this approach is that it doesn’t have the shortcomings of the systemd’s SystemCallFilter= and we can block any system call (luckily Bash is a dynamically linked application as well):

$ /bin/bash -c 'echo I will try to execve something...; exec /usr/bin/echo Doing arbitrary code execution!!!'
I will try to execve something...
Doing arbitrary code execution!!!
$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libsandbox.so SECCOMP_SYSCALL_DENY=execve /bin/bash -c 'echo I will try to execve something...; exec /usr/bin/echo Doing arbitrary code execution!!!'
adding execve to the process seccomp filter
I will try to execve something...
Bad system call

The only problem here is that we may accidentally forget to LD_PRELOAD our libsandbox.so library and potentially run unprotected. Also, as described in the man page, LD_PRELOAD has some limitations. We can overcome all these problems by making libsandbox.so a permanent part of our target application:

$ patchelf --add-needed /usr/lib/x86_64-linux-gnu/libsandbox.so ./myos
$ ldd ./myos
	linux-vdso.so.1 (0x00007fff835ae000)
	/usr/lib/x86_64-linux-gnu/libsandbox.so (0x00007fc4f55f2000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc4f5425000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fc4f5647000)

Again, we didn’t need access to the source code here, but patched the compiled binary instead. Now we can just configure our seccomp policy as before without the need of LD_PRELOAD:

$ ./myos
My OS is Linux!
$ SECCOMP_SYSCALL_DENY=uname ./myos
adding uname to the process seccomp filter
Bad system call

sandboxing statically linked executables

The above method is quite convenient and easy, but it doesn’t work for statically linked executables:

$ gcc -static -o myos myos.c
$ ldd ./myos
	not a dynamic executable
$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libsandbox.so SECCOMP_SYSCALL_DENY=uname ./myos
My OS is Linux!

This is because there is no dynamic loader involved in starting a statically linked executable, so LD_PRELOAD has no effect. For this case our toolkit contains a special application launcher, which will inject the seccomp rules similarly to the way systemd does it:

$ sandboxify ./myos
My OS is Linux!
$ SECCOMP_SYSCALL_DENY=uname sandboxify ./myos
adding uname to the process seccomp filter

Note that we don’t see the Bad system call shell message anymore, because our target executable is being started by the launcher instead of the shell directly. Unlike systemd however, we can use this launcher to block dangerous system calls, like execve, as well:

$ sandboxify /bin/bash -c 'echo I will try to execve something...; exec /usr/bin/echo Doing arbitrary code execution!!!'
I will try to execve something...
Doing arbitrary code execution!!!
SECCOMP_SYSCALL_DENY=execve sandboxify /bin/bash -c 'echo I will try to execve something...; exec /usr/bin/echo Doing arbitrary code execution!!!'
adding execve to the process seccomp filter
I will try to execve something...

sandboxify vs libsandbox.so

From the examples above you may notice that it is possible to use sandboxify with dynamically linked executables as well, so why even bother with libsandbox.so? The difference becomes visible, when we start using not the “denylist” policy as in most examples in this post, but rather the preferred “allowlist” policy, where we explicitly allow only the system calls we need, but prohibit everything else.

Let’s convert our toy application back into the dynamically-linked one and try to come up with the minimal list of allowed system calls it needs to function properly:

$ gcc -o myos myos.c
$ ldd ./myos
	linux-vdso.so.1 (0x00007ffe027f6000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f4f1410a000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f4f142de000)
$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libsandbox.so SECCOMP_SYSCALL_ALLOW=exit_group:fstat:uname:write ./myos
adding exit_group to the process seccomp filter
adding fstat to the process seccomp filter
adding uname to the process seccomp filter
adding write to the process seccomp filter
My OS is Linux

So we need to allow 4 system calls: exit_group:fstat:uname:write. This is the tightest “sandbox”, which still doesn’t break the application. If we remove any system call from this list, the application will terminate with the Bad system call message (try it yourself!).

If we use the same allowlist, but with the sandboxify launcher, things do not work anymore:

$ SECCOMP_SYSCALL_ALLOW=exit_group:fstat:uname:write sandboxify ./myos
adding exit_group to the process seccomp filter
adding fstat to the process seccomp filter
adding uname to the process seccomp filter
adding write to the process seccomp filter

The reason is sandboxify and libsandbox.so inject seccomp rules at different stages of the process lifecycle. Consider the following very high level diagram of a process startup:

Sandboxing in Linux with zero lines of code

In a nutshell, every process has two runtime stages: “runtime init” and the “main logic”. The main logic is basically the code, which is located in the program main() function and other code put there by the application developers. But the process usually needs to do some work before the code from the main() function is able to execute – we call this work the “runtime init” on the diagram above. Developers do not write this code directly, but most of the time this code is automatically generated by the compiler toolchain, which is used to compile the source code.

To do its job, the “runtime init” stage uses a lot of different system calls, but most of them are not needed later at the “main logic” stage. If we’re using the “allowlist” approach for our sandboxing, it does not make sense to allow these system calls for the whole duration of the program, if they are only used once on program init. This is where the difference between libsandbox.so and sandboxify comes from: libsandbox.so enforces the seccomp rules usually after the “runtime init” stage has already executed, so we don’t have to allow most system calls from that stage. sandboxify on the other hand enforces the policy before the “runtime init” stage, so we have to allow all the system calls from both stages, which usually results in a bigger allowlist, thus wider attack surface.

Going back to our toy myos example, here is the minimal list of all the system calls we need to allow to make the application work under our sandbox:

$ SECCOMP_SYSCALL_ALLOW=access:arch_prctl:brk:close:exit_group:fstat:mmap:mprotect:munmap:openat:read:uname:write sandboxify ./myos
adding access to the process seccomp filter
adding arch_prctl to the process seccomp filter
adding brk to the process seccomp filter
adding close to the process seccomp filter
adding exit_group to the process seccomp filter
adding fstat to the process seccomp filter
adding mmap to the process seccomp filter
adding mprotect to the process seccomp filter
adding munmap to the process seccomp filter
adding openat to the process seccomp filter
adding read to the process seccomp filter
adding uname to the process seccomp filter
adding write to the process seccomp filter
My OS is Linux!

It is 13 syscalls vs 4 syscalls, if we’re using the libsandbox.so approach!

Conclusions

In this post we discussed how to easily sandbox applications on Linux without the need to write any additional code. We introduced the Cloudflare sandbox toolkit and discussed the different approaches we take at sandboxing dynamically linked applications vs statically linked applications.

Having safer code online helps to build a Better Internet and we would be happy if you find our sandbox toolkit useful. Looking forward to the feedback, improvements and other contributions!

Learning with Raspberry Pi — robotics, a Master’s degree, and beyond

Post Syndicated from Ashley Whittaker original https://www.raspberrypi.org/blog/learning-with-raspberry-pi-robotics-a-masters-degree-and-beyond/

Meet Callum Fawcett, who shares his journey from tinkering with the first Raspberry Pi while he was at school, to a Master’s degree in computer science and a real-life job in programming. We also get to see some of the awesome projects he’s made along the way.

I first decided to get a Raspberry Pi at the age of 14. I had already started programming a little bit before and found that I really enjoyed the language Python. At the time the first Raspberry Pi came out, my History teacher told us about them and how they would be a great device to use to learn programming. I decided to ask for one to help me learn more. I didn’t really know what I would use it for or how it would even work, but after a little bit of help at the start, I quickly began making small programs in Python. I remember some of my first programs being very simple dictionary-type programs in which I would match English words to German to help with my German homework.

Learning Linux, C++, and Python

Most of my learning was done through two sources. I learnt Linux and how the terminal worked using online resources such as Stack Overflow. I would have a problem that I needed to solve, look up solutions online, and try out commands that I found. This was perhaps the hardest part of learning how to use a Raspberry Pi, as it was something I had never done before, but it really helped me in later years when I would use Linux more than Windows. For learning programming, I preferred to use books. I had a book for C++ and a book for Python that I would work through. These were game-based books, so many of the fun projects that I did were simple text-based games where you typed in responses to questions.

A family robotics project

The first robot Callum made using a Raspberry Pi

By far the coolest project I did with the Raspberry Pi was to build a small robot (shown above). This was a joint project between myself and my dad. He sorted out the electronics and I programmed the robot. It was a great opportunity to learn about robotics and refine my programming skills. By the end, the robot was capable of moving around by itself, driving into objects, and then reversing and trying a new direction. It was almost like an unintelligent Roomba that couldn’t hoover, but I spent many hours improving small bits and pieces to make it as easy to use as possible. My one wish that I never managed to achieve with my robot was allowing it to map out its surroundings. This was a very ambitious project at the time, since I was still quite inexperienced in programming. The biggest problem with this was calibrating the robot’s turning circle, which was never consistent so it was very hard to have the robot know where in the room it was.

Sense HAT maze game

Another fun project that I worked on used the Sense HAT developed for the Astro Pi computers for use on the International Space Station. Using this, I was able to make a memory maze game (shown below), in which a player is shown a maze for several seconds and then has to navigate that maze from memory by shaking the device. This was my first introduction to using more interactive types of input, and this eventually led to my final-year project, which used these interesting interactions to develop another way of teaching.

Learning programming without formal lessons

I have now just finished my Master’s degree in computer science at the University of Bristol. Before going to university, I had no experience of being taught programming in a formal environment. It was not a taught subject at my secondary school or sixth form. I wanted to get more people at my school interested in this area of study though, which I did by running a coding club for people. I would help others debug their code and discuss interesting problems with them. The reason that I chose to study computer science is largely because of my experiences with Raspberry Pi and other programming I did in my own time during my teenage years. I likely would have studied history if it weren’t for the programming I had done by myself making robots and other games.

Raspberry Pi has continued to play a part in my degree and extra-curricular activities; I used them in two large projects during my time at university and used a similar device in my final project. My robot experience also helped me to enter my university’s ‘Robot Wars’ competition which, though we never won, was a lot of fun.

A tool for learning and a device for industry

Having a Raspberry Pi is always useful during a hackathon, because it’s such a versatile component. Tech like Raspberry Pi will always be useful for beginners to learn the basics of programming and electronics, but these computers are also becoming more and more useful for people with more experience to make fun and useful projects. I could see tech like Raspberry Pi being used in the future to help quickly prototype many types of electronic devices and, as they become more powerful, even being used as an affordable way of controlling many types of robots, which will become more common in the future.

Our guest blogger Callum

Now I am going on to work on programming robot control systems at Ocado Technology. My experiences of robot building during my years before university played a large part in this decision. Already, robots are becoming a huge part of society, and I think they are only going to become more prominent in the future. Automation through robots and artificial intelligence will become one of the most important tools for humanity during the 21st century, and I look forward to being a part of that process. If it weren’t for learning through Raspberry Pi, I certainly wouldn’t be in this position.

Cheers for your story, Callum! Has tinkering with our tiny computer inspired your educational or professional choices? Let us know in the comments below. 

The post Learning with Raspberry Pi — robotics, a Master’s degree, and beyond appeared first on Raspberry Pi.

Attack Against PC Thunderbolt Port

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2020/05/attack_against_2.html

The attack requires physical access to the computer, but it’s pretty devastating:

On Thunderbolt-enabled Windows or Linux PCs manufactured before 2019, his technique can bypass the login screen of a sleeping or locked computer — and even its hard disk encryption — to gain full access to the computer’s data. And while his attack in many cases requires opening a target laptop’s case with a screwdriver, it leaves no trace of intrusion and can be pulled off in just a few minutes. That opens a new avenue to what the security industry calls an “evil maid attack,” the threat of any hacker who can get alone time with a computer in, say, a hotel room. Ruytenberg says there’s no easy software fix, only disabling the Thunderbolt port altogether.

“All the evil maid needs to do is unscrew the backplate, attach a device momentarily, reprogram the firmware, reattach the backplate, and the evil maid gets full access to the laptop,” says Ruytenberg, who plans to present his Thunderspy research at the Black Hat security conference this summer­or the virtual conference that may replace it. “All of this can be done in under five minutes.”

Lots of details in the article above, and in the attack website. (We know it’s a modern hack, because it comes with its own website and logo.)

Intel responds.

EDITED TO ADD (5/14): More.

Conntrack tales – one thousand and one flows

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/conntrack-tales-one-thousand-and-one-flows/

Conntrack tales - one thousand and one flows

At Cloudflare we develop new products at a great pace. Their needs often challenge the architectural assumptions we made in the past. For example, years ago we decided to avoid using Linux’s "conntrack" – stateful firewall facility. This brought great benefits – it simplified our iptables firewall setup, sped up the system a bit and made the inbound packet path easier to understand.

But eventually our needs changed. One of our new products had a reasonable need for it. But we weren’t confident – can we just enable conntrack and move on? How does it actually work? I volunteered to help the team understand the dark corners of the "conntrack" subsystem.

What is conntrack?

"Conntrack" is a part of Linux network stack, specifically part of the firewall subsystem. To put that into perspective: early firewalls were entirely stateless. They could express only basic logic, like: allow SYN packets to port 80 and 443, and block everything else.

The stateless design gave some basic network security, but was quickly deemed insufficient. You see, there are certain things that can’t be expressed in a stateless way. The canonical example is assessment of ACK packets – it’s impossible to say if an ACK packet is legitimate or part of a port scanning attempt, without tracking the connection state.

To fill such gaps all the operating systems implemented connection tracking inside their firewalls. This tracking is usually implemented as a big table, with at least 6 columns: protocol (usually TCP or UDP), source IP, source port, destination IP, destination port and connection state. On Linux this subsystem is called "conntrack" and is often enabled by default. Here’s how the table looks on my laptop inspected with "conntrack -L" command:

Conntrack tales - one thousand and one flows

The obvious question is how large this state tracking table can be. This setting is under "/proc/sys/net/nf_conntrack_max":

$ cat /proc/sys/net/nf_conntrack_max
262144

This is a global setting, but the limit is per per-container. On my system each container, or "network namespace", can have up to 256K conntrack entries.

What exactly happens when the number of concurrent connections exceeds the conntrack limit?

Testing conntrack is hard

In past testing conntrack was hard – it required complex hardware or vm setup. Fortunately, these days we can use modern "user namespace" facilities which do permission magic, allowing an unprivileged user to feel like root. Using the tool "unshare" it’s possible to create an isolated environment where we can precisely control the packets going through and experiment with iptables and conntrack without threatening the health of our host system. With appropriate parameters it’s possible to create and manage a networking namespace, including access to namespaced iptables and conntrack, from an unprivileged user.

This script is the heart of our test:

# Enable tun interface
ip tuntap add name tun0 mode tun
ip link set tun0 up
ip addr add 192.0.2.1 peer 192.0.2.2 dev tun0
ip route add 0.0.0.0/0 via 192.0.2.2 dev tun0

# Refer to conntrack at least once to ensure it's enabled
iptables -t raw -A PREROUTING -j CT
# Create a counter in mangle table
iptables -t mangle -A PREROUTING
# Make sure reverse traffic doesn't affect conntrack state
iptables -t raw -A OUTPUT -p tcp --sport 80 -j DROP

tcpdump -ni any -B 16384 -ttt &
...
./venv/bin/python3 send_syn.py

conntrack -L
# Show iptables counters
iptables -nvx -t raw -L PREROUTING
iptables -nvx -t mangle -L PREROUTING

This bash script is shortened for readability. See the full version here. The accompanying "send_syn.py" is just sending 10 SYN packets over "tun0" interface. Here is the source but allow me to paste it here – showing off "scapy" is always fun:

tun = TunTapInterface("tun0", mode_tun=True)
tun.open()

for i in range(10000,10000+10):
    ip=IP(src="198.18.0.2", dst="192.0.2.1")
    tcp=TCP(sport=i, dport=80, flags="S")
    send(ip/tcp, verbose=False, inter=0.01, socket=tun)

The bash script above contains a couple of gems. Let’s walk through them.

First, please note that we can’t just inject packets into the loopback interface using SOCK_RAW sockets. The Linux networking stack is a complex beast. The semantics of sending packets over a SOCK_RAW are different then delivering a packet over a real interface. We’ll discuss this later, but for now, to avoid triggering unexpected behaviour, we will deliver packets over a tun/tap device which better emulates a real interface.

Then we need to make sure the conntrack is active in the network namespace we wish to use for testing. Traditionally, just loading the kernel module would have done that, but in the brave new world of containers and network namespaces, a method had to be found to allow conntrack to be active in some and inactive in other containers. Hence this is tied to usage – rules referencing conntrack must exist in the namespace’s iptables for conntrack to be active inside the container.

As a side note, containers triggering host to load kernel modules is an interesting subject.

After the "-t raw -A PREROUTING" rule, which we added "-t mangle -A PREROUTING" rule, but notice – it doesn’t have any action! This syntax is allowed by iptables and it is pretty useful to get iptables to report rule counters. We’ll need these counters soon. A careful reader might suggest looking at "policy" counters in iptables to achieve our goal. Sadly, "policy" counters (increased for each packet entering a chain), work only if there is at least one rule inside it.

The rest of the steps are self-explanatory. We set up "tcpdump" in the background, send 10 SYN packets to 127.0.0.1:80 using the "scapy" Python library. Ten we print the conntrack table and iptables counters.

Let’s run this script in action. Remember to run it under networking namespace as fake root with "unshare -Ur -n":

Conntrack tales - one thousand and one flows

This is all nice. First we see a "tcpdump" listing showing 10 SYN packets. Then we see the conntrack table state, showing 10 created flows. Finally, we see iptables counters in two rules we created, each showing 10 packets processed.

Can conntrack table fill up?

Given that the conntrack table is size constrained, what exactly happens when it fills up? Let’s check it out. First, we need to drop the conntrack size. As mentioned it’s controlled by a global toggle – it’s necessary to tune it on the host side. Let’s reduce the table size to 7 entries, and repeat our test:

Conntrack tales - one thousand and one flows

This is getting interesting. We still see the 10 inbound SYN packets. We still see that the "-t raw PREROUTING" table received 10 packets, but this is where similarities end. The "-t mangle PREROUTING" table saw only 7 packets. Where did the three missing SYN packets go?

It turns out they went where all the dead packets go. They were hard dropped. Conntrack on overfill does exactly that. It even complains in the "dmesg":

Conntrack tales - one thousand and one flows

This is confirmed by our iptables counters. Let’s review the famous iptables diagram:

Conntrack tales - one thousand and one flows
image by Jan Engelhardt CC BY-SA 3.0

As we can see, the "-t raw PREROUTING" happens before conntrack, while "-t mangle PREROUTING" is just after it. This is why we see 10 and 7 packets reported by our iptables counters.

Let me emphasize the gravity of our discovery. We showed three completely valid SYN packets being implicitly dropped by "conntrack". There is no explicit "-j DROP" iptables rule. There is no configuration to be toggled. Just the fact of using "conntrack" means that, when it’s full, packets creating new flows will be dropped. No questions asked.

This is the dark side of using conntrack. If you use it, you absolutely must make sure it doesn’t get filled.

We could end our investigation here, but there are a couple of interesting caveats.

Strict vs loose

Conntrack supports a "strict" and "loose" mode, as configured by "nf_conntrack_tcp_loose" toggle.

$ cat /proc/sys/net/netfilter/nf_conntrack_tcp_loose
1

By default, it’s set to "loose" which means that stray ACK packets for unseen TCP flows will create new flow entries in the table. We can generalize: "conntrack" will implicitly drop all the packets that create new flow, whether that’s SYN or just stray ACK.

What happens when we clear the "nf_conntrack_tcp_loose=0" setting? This is a subject for another blog post, but suffice to say – mess. First, this setting is not settable in the network namespace scope – although it should be. To test it you need to be in the root network namespace. Then, due to twisted logic the ACK will be dropped on a full conntrack table, even though in this case it doesn’t create a flow. If the table is not full, the ACK packet will pass through it, having "-ctstate INVALID" from "mangle" table forward.

When a conntrack entry is not created?

There are important situations when conntrack entry is not created. For example, we could replace these line in our script:

# Make sure reverse traffic doesn't affect conntrack state
iptables -t raw -A OUTPUT -p tcp --sport 80 -j DROP

With those:

# Make sure inbound SYN packets don't go to networking stack
iptables -A INPUT -j DROP

Naively we could think dropping SYN packets past the conntrack layer would not interfere with the created flows. This is not correct. In spite of these SYN packets having been seen by conntrack, no flow state is created for them. Packets hitting "-j DROP" will not create new conntrack flows. Pretty magical, isn’t it?

Full Conntrack causes with EPERM

Recently we hit a case when a "sendto()" syscall on UDP socket from one of our applications was erroring with EPERM. This is pretty weird, and not documented in the man page. My colleague had no doubts:

Conntrack tales - one thousand and one flows

I’ll save you the gruesome details, but indeed, the full conntrack table will do that to your new UDP flows – you will get EPERM. Beware. Funnily enough, it’s possible to get EPERM if an outbound packet is dropped on OUTPUT firewall in other ways. For example:

marek:~$ sudo iptables -I OUTPUT -p udp --dport 53 --dst 192.0.2.8 -j DROP
marek:~$ strace -e trace=write nc -vu 192.0.2.8 53
write(3, "X", 1)                        = -1 EPERM (Operation not permitted)
+++ exited with 1 +++

If you ever receive EPERM from "sendto()", you might want to treat it as a transient error, if you suspect a filled conntrack problem, or permanent error if you blame iptables configuration.

This is also why we can’t send our SYN packets directly using SOCK_RAW sockets in our test. Let’s see what happens on conntrack overfill with standard "hping3" tool:

$ hping3 -S -i u10000 -c 10 --spoof 192.18.0.2 192.0.2.1 -p 80 -I lo
HPING 192.0.2.1 (lo 192.0.2.1): S set, 40 headers + 0 data bytes
[send_ip] sendto: Operation not permitted

"send()" even on a SOCK_RAW socket fails with EPERM when conntrack table is full.

Full conntrack can happen on a SYN flood

There is one more caveat. During a SYN flood, the conntrack entries will totally be created for the spoofed flows. Take a look at second test case we prepared, this time correctly listening on port 80, and sending SYN+ACK:

Conntrack tales - one thousand and one flows

We can see 7 SYN+ACK’s flying out of the port 80 listening socket. The final three SYN’s go nowhere as they are dropped by conntrack.

This has important implications. If you use conntrack on publicly accessible ports, during SYN flood mitigation technologies like SYN Cookies won’t help. You are still at risk of running out of conntrack space and therefore affecting legitimate connections.

For this reason, as a rule of thumb consider avoiding conntrack on inbound connections (-j NOTRACK). Alternatively having some reasonable rate limits on iptables layer, doing "-j DROP". This will work well and won’t create new flows, as we discussed above. The best method though, would be to trigger SYN Cookies from a layer before conntrack, like XDP. But this is a subject for another time.

Summary

Over the years Linux conntrack has gone through many changes and has improved a lot. While performance used to be a major concern, these days it’s considered to be very fast. Dark corners remain. Correctly applying conntrack is tricky.

In this blog post we showed how it’s possible to test parts of conntrack with "unshare" and a series of scripts. We showed the behaviour when the conntrack table gets filled – packets might implicitly be dropped. Finally, we mentioned the curious case of SYN floods where incorrectly applied conntrack may cause harm.

Stay tuned for more horror stories as we dig deeper and deeper into the Linux networking stack guts.

Speeding up Linux disk encryption

Post Syndicated from Ignat Korchagin original https://blog.cloudflare.com/speeding-up-linux-disk-encryption/

Speeding up Linux disk encryption

Data encryption at rest is a must-have for any modern Internet company. Many companies, however, don’t encrypt their disks, because they fear the potential performance penalty caused by encryption overhead.

Encrypting data at rest is vital for Cloudflare with more than 200 data centres across the world. In this post, we will investigate the performance of disk encryption on Linux and explain how we made it at least two times faster for ourselves and our customers!

Encrypting data at rest

When it comes to encrypting data at rest there are several ways it can be implemented on a modern operating system (OS). Available techniques are tightly coupled with a typical OS storage stack. A simplified version of the storage stack and encryption solutions can be found on the diagram below:

Speeding up Linux disk encryption

On the top of the stack are applications, which read and write data in files (or streams). The file system in the OS kernel keeps track of which blocks of the underlying block device belong to which files and translates these file reads and writes into block reads and writes, however the hardware specifics of the underlying storage device is abstracted away from the filesystem. Finally, the block subsystem actually passes the block reads and writes to the underlying hardware using appropriate device drivers.

The concept of the storage stack is actually similar to the well-known network OSI model, where each layer has a more high-level view of the information and the implementation details of the lower layers are abstracted away from the upper layers. And, similar to the OSI model, one can apply encryption at different layers (think about TLS vs IPsec or a VPN).

For data at rest we can apply encryption either at the block layers (either in hardware or in software) or at the file level (either directly in applications or in the filesystem).

Block vs file encryption

Generally, the higher in the stack we apply encryption, the more flexibility we have. With application level encryption the application maintainers can apply any encryption code they please to any particular data they need. The downside of this approach is they actually have to implement it themselves and encryption in general is not very developer-friendly: one has to know the ins and outs of a specific cryptographic algorithm, properly generate keys, nonces, IVs etc. Additionally, application level encryption does not leverage OS-level caching and Linux page cache in particular: each time the application needs to use the data, it has to either decrypt it again, wasting CPU cycles, or implement its own decrypted “cache”, which introduces more complexity to the code.

File system level encryption makes data encryption transparent to applications, because the file system itself encrypts the data before passing it to the block subsystem, so files are encrypted regardless if the application has crypto support or not. Also, file systems can be configured to encrypt only a particular directory or have different keys for different files. This flexibility, however, comes at a cost of a more complex configuration. File system encryption is also considered less secure than block device encryption as only the contents of the files are encrypted. Files also have associated metadata, like file size, the number of files, the directory tree layout etc., which are still visible to a potential adversary.

Encryption down at the block layer (often referred to as disk encryption or full disk encryption) also makes data encryption transparent to applications and even whole file systems. Unlike file system level encryption it encrypts all data on the disk including file metadata and even free space. It is less flexible though – one can only encrypt the whole disk with a single key, so there is no per-directory, per-file or per-user configuration. From the crypto perspective, not all cryptographic algorithms can be used as the block layer doesn’t have a high-level overview of the data anymore, so it needs to process each block independently. Most common algorithms require some sort of block chaining to be secure, so are not applicable to disk encryption. Instead, special modes were developed just for this specific use-case.

So which layer to choose? As always, it depends… Application and file system level encryption are usually the preferred choice for client systems because of the flexibility. For example, each user on a multi-user desktop may want to encrypt their home directory with a key they own and leave some shared directories unencrypted. On the contrary, on server systems, managed by SaaS/PaaS/IaaS companies (including Cloudflare) the preferred choice is configuration simplicity and security – with full disk encryption enabled any data from any application is automatically encrypted with no exceptions or overrides. We believe that all data needs to be protected without sorting it into "important" vs "not important" buckets, so the selective flexibility the upper layers provide is not needed.

Hardware vs software disk encryption

When encrypting data at the block layer it is possible to do it directly in the storage hardware, if the hardware supports it. Doing so usually gives better read/write performance and consumes less resources from the host. However, since most hardware firmware is proprietary, it does not receive as much attention and review from the security community. In the past this led to flaws in some implementations of hardware disk encryption, which render the whole security model useless. Microsoft, for example, started to prefer software-based disk encryption since then.

We didn’t want to put our data and our customers’ data to the risk of using potentially insecure solutions and we strongly believe in open-source. That’s why we rely only on software disk encryption in the Linux kernel, which is open and has been audited by many security professionals across the world.

Linux disk encryption performance

We aim not only to save bandwidth costs for our customers, but to deliver content to Internet users as fast as possible.

At one point we noticed that our disks were not as fast as we would like them to be. Some profiling as well as a quick A/B test pointed to Linux disk encryption. Because not encrypting the data (even if it is supposed-to-be a public Internet cache) is not a sustainable option, we decided to take a closer look into Linux disk encryption performance.

Device mapper and dm-crypt

Linux implements transparent disk encryption via a dm-crypt module and dm-crypt itself is part of device mapper kernel framework. In a nutshell, the device mapper allows pre/post-process IO requests as they travel between the file system and the underlying block device.

dm-crypt in particular encrypts "write" IO requests before sending them further down the stack to the actual block device and decrypts "read" IO requests before sending them up to the file system driver. Simple and easy! Or is it?

Benchmarking setup

For the record, the numbers in this post were obtained by running specified commands on an idle Cloudflare G9 server out of production. However, the setup should be easily reproducible on any modern x86 laptop.

Generally, benchmarking anything around a storage stack is hard because of the noise introduced by the storage hardware itself. Not all disks are created equal, so for the purpose of this post we will use the fastest disks available out there – that is no disks.

Instead Linux has an option to emulate a disk directly in RAM. Since RAM is much faster than any persistent storage, it should introduce little bias in our results.

The following command creates a 4GB ramdisk:

$ sudo modprobe brd rd_nr=1 rd_size=4194304
$ ls /dev/ram0

Now we can set up a dm-crypt instance on top of it thus enabling encryption for the disk. First, we need to generate the disk encryption key, "format" the disk and specify a password to unlock the newly generated key.

$ fallocate -l 2M crypthdr.img
$ sudo cryptsetup luksFormat /dev/ram0 --header crypthdr.img

WARNING!
========
This will overwrite data on crypthdr.img irrevocably.

Are you sure? (Type uppercase yes): YES
Enter passphrase:
Verify passphrase:

Those who are familiar with LUKS/dm-crypt might have noticed we used a LUKS detached header here. Normally, LUKS stores the password-encrypted disk encryption key on the same disk as the data, but since we want to compare read/write performance between encrypted and unencrypted devices, we might accidentally overwrite the encrypted key during our benchmarking later. Keeping the encrypted key in a separate file avoids this problem for the purposes of this post.

Now, we can actually "unlock" the encrypted device for our testing:

$ sudo cryptsetup open --header crypthdr.img /dev/ram0 encrypted-ram0
Enter passphrase for /dev/ram0:
$ ls /dev/mapper/encrypted-ram0
/dev/mapper/encrypted-ram0

At this point we can now compare the performance of encrypted vs unencrypted ramdisk: if we read/write data to /dev/ram0, it will be stored in plaintext. Likewise, if we read/write data to /dev/mapper/encrypted-ram0, it will be decrypted/encrypted on the way by dm-crypt and stored in ciphertext.

It’s worth noting that we’re not creating any file system on top of our block devices to avoid biasing results with a file system overhead.

Measuring throughput

When it comes to storage testing/benchmarking Flexible I/O tester is the usual go-to solution. Let’s simulate simple sequential read/write load with 4K block size on the ramdisk without encryption:

$ sudo fio --filename=/dev/ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=plain
plain: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
   READ: io=21013MB, aggrb=1126.5MB/s, minb=1126.5MB/s, maxb=1126.5MB/s, mint=18655msec, maxt=18655msec
  WRITE: io=21023MB, aggrb=1126.1MB/s, minb=1126.1MB/s, maxb=1126.1MB/s, mint=18655msec, maxt=18655msec

Disk stats (read/write):
  ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

The above command will run for a long time, so we just stop it after a while. As we can see from the stats, we’re able to read and write roughly with the same throughput around 1126 MB/s. Let’s repeat the test with the encrypted ramdisk:

$ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
   READ: io=1693.7MB, aggrb=150874KB/s, minb=150874KB/s, maxb=150874KB/s, mint=11491msec, maxt=11491msec
  WRITE: io=1696.4MB, aggrb=151170KB/s, minb=151170KB/s, maxb=151170KB/s, mint=11491msec, maxt=11491msec

Whoa, that’s a drop! We only get ~147 MB/s now, which is more than 7 times slower! And this is on a totally idle machine!

Maybe, crypto is just slow

The first thing we considered is to ensure we use the fastest crypto. cryptsetup allows us to benchmark all the available crypto implementations on the system to select the best one:

$ sudo cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1      1340890 iterations per second for 256-bit key
PBKDF2-sha256    1539759 iterations per second for 256-bit key
PBKDF2-sha512    1205259 iterations per second for 256-bit key
PBKDF2-ripemd160  967321 iterations per second for 256-bit key
PBKDF2-whirlpool  720175 iterations per second for 256-bit key
#  Algorithm | Key |  Encryption |  Decryption
     aes-cbc   128b   969.7 MiB/s  3110.0 MiB/s
 serpent-cbc   128b           N/A           N/A
 twofish-cbc   128b           N/A           N/A
     aes-cbc   256b   756.1 MiB/s  2474.7 MiB/s
 serpent-cbc   256b           N/A           N/A
 twofish-cbc   256b           N/A           N/A
     aes-xts   256b  1823.1 MiB/s  1900.3 MiB/s
 serpent-xts   256b           N/A           N/A
 twofish-xts   256b           N/A           N/A
     aes-xts   512b  1724.4 MiB/s  1765.8 MiB/s
 serpent-xts   512b           N/A           N/A
 twofish-xts   512b           N/A           N/A

It seems aes-xts with a 256-bit data encryption key is the fastest here. But which one are we actually using for our encrypted ramdisk?

$ sudo dmsetup table /dev/mapper/encrypted-ram0
0 8388608 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 1:0 0

We do use aes-xts with a 256-bit data encryption key (count all the zeroes conveniently masked by dmsetup tool – if you want to see the actual bytes, add the --showkeys option to the above command). The numbers do not add up however: cryptsetup benchmark tells us above not to rely on the results, as "Tests are approximate using memory only (no storage IO)", but that is exactly how we’ve set up our experiment using the ramdisk. In a somewhat worse case (assuming we’re reading all the data and then encrypting/decrypting it sequentially with no parallelism) doing back-of-the-envelope calculation we should be getting around (1126 * 1823) / (1126 + 1823) =~696 MB/s, which is still quite far from the actual 147 * 2 = 294 MB/s (total for reads and writes).

dm-crypt performance flags

While reading the cryptsetup man page we noticed that it has two options prefixed with --perf-, which are probably related to performance tuning. The first one is --perf-same_cpu_crypt with a rather cryptic description:

Perform encryption using the same cpu that IO was submitted on.  The default is to use an unbound workqueue so that encryption work is automatically balanced between available CPUs.  This option is only relevant for open action.

So we enable the option

$ sudo cryptsetup close encrypted-ram0
$ sudo cryptsetup open --header crypthdr.img --perf-same_cpu_crypt /dev/ram0 encrypted-ram0

Note: according to the latest man page there is also a cryptsetup refresh command, which can be used to enable these options live without having to "close" and "re-open" the encrypted device. Our cryptsetup however didn’t support it yet.

Verifying if the option has been really enabled:

$ sudo dmsetup table encrypted-ram0
0 8388608 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 1:0 0 1 same_cpu_crypt

Yes, we can now see same_cpu_crypt in the output, which is what we wanted. Let’s rerun the benchmark:

$ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
   READ: io=1596.6MB, aggrb=139811KB/s, minb=139811KB/s, maxb=139811KB/s, mint=11693msec, maxt=11693msec
  WRITE: io=1600.9MB, aggrb=140192KB/s, minb=140192KB/s, maxb=140192KB/s, mint=11693msec, maxt=11693msec

Hmm, now it is ~136 MB/s which is slightly worse than before, so no good. What about the second option --perf-submit_from_crypt_cpus:

Disable offloading writes to a separate thread after encryption.  There are some situations where offloading write bios from the encryption threads to a single thread degrades performance significantly.  The default is to offload write bios to the same thread.  This option is only relevant for open action.

Maybe, we are in the "some situation" here, so let’s try it out:

$ sudo cryptsetup close encrypted-ram0
$ sudo cryptsetup open --header crypthdr.img --perf-submit_from_crypt_cpus /dev/ram0 encrypted-ram0
Enter passphrase for /dev/ram0:
$ sudo dmsetup table encrypted-ram0
0 8388608 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 1:0 0 1 submit_from_crypt_cpus

And now the benchmark:

$ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
   READ: io=2066.6MB, aggrb=169835KB/s, minb=169835KB/s, maxb=169835KB/s, mint=12457msec, maxt=12457msec
  WRITE: io=2067.7MB, aggrb=169965KB/s, minb=169965KB/s, maxb=169965KB/s, mint=12457msec, maxt=12457msec

~166 MB/s, which is a bit better, but still not good…

Asking the community

Being desperate we decided to seek support from the Internet and posted our findings to the dm-crypt mailing list, but the response we got was not very encouraging:

If the numbers disturb you, then this is from lack of understanding on your side. You are probably unaware that encryption is a heavy-weight operation…

We decided to make a scientific research on this topic by typing "is encryption expensive" into Google Search and one of the top results, which actually contains meaningful measurements, is… our own post about cost of encryption, but in the context of TLS! This is a fascinating read on its own, but the gist is: modern crypto on modern hardware is very cheap even at Cloudflare scale (doing millions of encrypted HTTP requests per second). In fact, it is so cheap that Cloudflare was the first provider to offer free SSL/TLS for everyone.

Digging into the source code

When trying to use the custom dm-crypt options described above we were curious why they exist in the first place and what is that "offloading" all about. Originally we expected dm-crypt to be a simple "proxy", which just encrypts/decrypts data as it flows through the stack. Turns out dm-crypt does more than just encrypting memory buffers and a (simplified) IO traverse path diagram is presented below:

Speeding up Linux disk encryption

When the file system issues a write request, dm-crypt does not process it immediately – instead it puts it into a workqueue named "kcryptd". In a nutshell, a kernel workqueue just schedules some work (encryption in this case) to be performed at some later time, when it is more convenient. When "the time" comes, dm-crypt sends the request to Linux Crypto API for actual encryption. However, modern Linux Crypto API is asynchronous as well, so depending on which particular implementation your system will use, most likely it will not be processed immediately, but queued again for "later time". When Linux Crypto API will finally do the encryption, dm-crypt may try to sort pending write requests by putting each request into a red-black tree. Then a separate kernel thread again at "some time later" actually takes all IO requests in the tree and sends them down the stack.

Now for read requests: this time we need to get the encrypted data first from the hardware, but dm-crypt does not just ask for the driver for the data, but queues the request into a different workqueue named "kcryptd_io". At some point later, when we actually have the encrypted data, we schedule it for decryption using the now familiar "kcryptd" workqueue. "kcryptd" will send the request to Linux Crypto API, which may decrypt the data asynchronously as well.

To be fair the request does not always traverse all these queues, but the important part here is that write requests may be queued up to 4 times in dm-crypt and read requests up to 3 times. At this point we were wondering if all this extra queueing can cause any performance issues. For example, there is a nice presentation from Google about the relationship between queueing and tail latency. One key takeaway from the presentation is:

A significant amount of tail latency is due to queueing effects

So, why are all these queues there and can we remove them?

Git archeology

No-one writes more complex code just for fun, especially for the OS kernel. So all these queues must have been put there for a reason. Luckily, the Linux kernel source is managed by git, so we can try to retrace the changes and the decisions around them.

The "kcryptd" workqueue was in the source since the beginning of the available history with the following comment:

Needed because it would be very unwise to do decryption in an interrupt context, so bios returning from read requests get queued here.

So it was for reads only, but even then – why do we care if it is interrupt context or not, if Linux Crypto API will likely use a dedicated thread/queue for encryption anyway? Well, back in 2005 Crypto API was not asynchronous, so this made perfect sense.

In 2006 dm-crypt started to use the "kcryptd" workqueue not only for encryption, but for submitting IO requests:

This patch is designed to help dm-crypt comply with the new constraints imposed by the following patch in -mm: md-dm-reduce-stack-usage-with-stacked-block-devices.patch

It seems the goal here was not to add more concurrency, but rather reduce kernel stack usage, which makes sense again as the kernel has a common stack across all the code, so it is a quite limited resource. It is worth noting, however, that the Linux kernel stack has been expanded in 2014 for x86 platforms, so this might not be a problem anymore.

A first version of "kcryptd_io" workqueue was added in 2007 with the intent to avoid:

starvation caused by many requests waiting for memory allocation…

The request processing was bottlenecking on a single workqueue here, so the solution was to add another one. Makes sense.

We are definitely not the first ones experiencing performance degradation because of extensive queueing: in 2011 a change was introduced to conditionally revert some of the queueing for read requests:

If there is enough memory, code can directly submit bio instead queuing this operation in a separate thread.

Unfortunately, at that time Linux kernel commit messages were not as verbose as today, so there is no performance data available.

In 2015 dm-crypt started to sort writes in a separate "dmcrypt_write" thread before sending them down the stack:

On a multiprocessor machine, encryption requests finish in a different order than they were submitted. Consequently, write requests would be submitted in a different order and it could cause severe performance degradation.

It does make sense as sequential disk access used to be much faster than the random one and dm-crypt was breaking the pattern. But this mostly applies to spinning disks, which were still dominant in 2015. It may not be as important with modern fast SSDs (including NVME SSDs).

Another part of the commit message is worth mentioning:

…in particular it enables IO schedulers like CFQ to sort more effectively…

It mentions the performance benefits for the CFQ IO scheduler, but Linux schedulers have improved since then to the point that CFQ scheduler has been removed from the kernel in 2018.

The same patchset replaces the sorting list with a red-black tree:

In theory the sorting should be performed by the underlying disk scheduler, however, in practice the disk scheduler only accepts and sorts a finite number of requests. To allow the sorting of all requests, dm-crypt needs to implement its own sorting.

The overhead associated with rbtree-based sorting is considered negligible so it is not used conditionally.

All that make sense, but it would be nice to have some backing data.

Interestingly, in the same patchset we see the introduction of our familiar "submit_from_crypt_cpus" option:

There are some situations where offloading write bios from the encryption threads to a single thread degrades performance significantly

Overall, we can see that every change was reasonable and needed, however things have changed since then:

  • hardware became faster and smarter
  • Linux resource allocation was revisited
  • coupled Linux subsystems were rearchitected

And many of the design choices above may not be applicable to modern Linux.

The "clean-up"

Based on the research above we decided to try to remove all the extra queueing and asynchronous behaviour and revert dm-crypt to its original purpose: simply encrypt/decrypt IO requests as they pass through. But for the sake of stability and further benchmarking we ended up not removing the actual code, but rather adding yet another dm-crypt option, which bypasses all the queues/threads, if enabled. The flag allows us to switch between the current and new behaviour at runtime under full production load, so we can easily revert our changes should we see any side-effects. The resulting patch can be found on the Cloudflare GitHub Linux repository.

Synchronous Linux Crypto API

From the diagram above we remember that not all queueing is implemented in dm-crypt. Modern Linux Crypto API may also be asynchronous and for the sake of this experiment we want to eliminate queues there as well. What does "may be" mean, though? The OS may contain different implementations of the same algorithm (for example, hardware-accelerated AES-NI on x86 platforms and generic C-code AES implementations). By default the system chooses the "best" one based on the configured algorithm priority. dm-crypt allows overriding this behaviour and request a particular cipher implementation using the capi: prefix. However, there is one problem. Let us actually check the available AES-XTS (this is our disk encryption cipher, remember?) implementations on our system:

$ grep -A 11 'xts(aes)' /proc/crypto
name         : xts(aes)
driver       : xts(ecb(aes-generic))
module       : kernel
priority     : 100
refcnt       : 7
selftest     : passed
internal     : no
type         : skcipher
async        : no
blocksize    : 16
min keysize  : 32
max keysize  : 64
--
name         : __xts(aes)
driver       : cryptd(__xts-aes-aesni)
module       : cryptd
priority     : 451
refcnt       : 1
selftest     : passed
internal     : yes
type         : skcipher
async        : yes
blocksize    : 16
min keysize  : 32
max keysize  : 64
--
name         : xts(aes)
driver       : xts-aes-aesni
module       : aesni_intel
priority     : 401
refcnt       : 1
selftest     : passed
internal     : no
type         : skcipher
async        : yes
blocksize    : 16
min keysize  : 32
max keysize  : 64
--
name         : __xts(aes)
driver       : __xts-aes-aesni
module       : aesni_intel
priority     : 401
refcnt       : 7
selftest     : passed
internal     : yes
type         : skcipher
async        : no
blocksize    : 16
min keysize  : 32
max keysize  : 64

We want to explicitly select a synchronous cipher from the above list to avoid queueing effects in threads, but the only two supported are xts(ecb(aes-generic)) (the generic C implementation) and __xts-aes-aesni (the x86 hardware-accelerated implementation). We definitely want the latter as it is much faster (we’re aiming for performance here), but it is suspiciously marked as internal (see internal: yes). If we check the source code:

Mark a cipher as a service implementation only usable by another cipher and never by a normal user of the kernel crypto API

So this cipher is meant to be used only by other wrapper code in the Crypto API and not outside it. In practice this means, that the caller of the Crypto API needs to explicitly specify this flag, when requesting a particular cipher implementation, but dm-crypt does not do it, because by design it is not part of the Linux Crypto API, rather an "external" user. We already patch the dm-crypt module, so we could as well just add the relevant flag. However, there is another problem with AES-NI in particular: x86 FPU. "Floating point" you say? Why do we need floating point math to do symmetric encryption which should only be about bit shifts and XOR operations? We don’t need the math, but AES-NI instructions use some of the CPU registers, which are dedicated to the FPU. Unfortunately the Linux kernel does not always preserve these registers in interrupt context for performance reasons (saving/restoring FPU is expensive). But dm-crypt may execute code in interrupt context, so we risk corrupting some other process data and we go back to "it would be very unwise to do decryption in an interrupt context" statement in the original code.

Our solution to address the above was to create another somewhat "smart" Crypto API module. This module is synchronous and does not roll its own crypto, but is just a "router" of encryption requests:

  • if we can use the FPU (and thus AES-NI) in the current execution context, we just forward the encryption request to the faster, "internal" __xts-aes-aesni implementation (and we can use it here, because now we are part of the Crypto API)
  • otherwise, we just forward the encryption request to the slower, generic C-based xts(ecb(aes-generic)) implementation

Using the whole lot

Let’s walk through the process of using it all together. The first step is to grab the patches and recompile the kernel (or just compile dm-crypt and our xtsproxy modules).

Next, let’s restart our IO workload in a separate terminal, so we can make sure we can reconfigure the kernel at runtime under load:

$ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...

In the main terminal make sure our new Crypto API module is loaded and available:

$ sudo modprobe xtsproxy
$ grep -A 11 'xtsproxy' /proc/crypto
driver       : xts-aes-xtsproxy
module       : xtsproxy
priority     : 0
refcnt       : 0
selftest     : passed
internal     : no
type         : skcipher
async        : no
blocksize    : 16
min keysize  : 32
max keysize  : 64
ivsize       : 16
chunksize    : 16

Reconfigure the encrypted disk to use our newly loaded module and enable our patched dm-crypt flag (we have to use low-level dmsetup tool and cryptsetup obviously is not aware of our modifications):

$ sudo dmsetup table encrypted-ram0 --showkeys | sed 's/aes-xts-plain64/capi:xts-aes-xtsproxy-plain64/' | sed 's/$/ 1 force_inline/' | sudo dmsetup reload encrypted-ram0

We just "loaded" the new configuration, but for it to take effect, we need to suspend/resume the encrypted device:

$ sudo dmsetup suspend encrypted-ram0 && sudo dmsetup resume encrypted-ram0

And now observe the result. We may go back to the other terminal running the fio job and look at the output, but to make things nicer, here’s a snapshot of the observed read/write throughput in Grafana:

Speeding up Linux disk encryption
Speeding up Linux disk encryption

Wow, we have more than doubled the throughput! With the total throughput of ~640 MB/s we’re now much closer to the expected ~696 MB/s from above. What about the IO latency? (The await statistic from the iostat reporting tool):

Speeding up Linux disk encryption

The latency has been cut in half as well!

To production

So far we have been using a synthetic setup with some parts of the full production stack missing, like file systems, real hardware and most importantly, production workload. To ensure we’re not optimising imaginary things, here is a snapshot of the production impact these changes bring to the caching part of our stack:

Speeding up Linux disk encryption

This graph represents a three-way comparison of the worst-case response times (99th percentile) for a cache hit in one of our servers. The green line is from a server with unencrypted disks, which we will use as baseline. The red line is from a server with encrypted disks with the default Linux disk encryption implementation and the blue line is from a server with encrypted disks and our optimisations enabled. As we can see the default Linux disk encryption implementation has a significant impact on our cache latency in worst case scenarios, whereas the patched implementation is indistinguishable from not using encryption at all. In other words the improved encryption implementation does not have any impact at all on our cache response speed, so we basically get it for free! That’s a win!

We’re just getting started

This post shows how an architecture review can double the performance of a system. Also we reconfirmed that modern cryptography is not expensive and there is usually no excuse not to protect your data.

We are going to submit this work for inclusion in the main kernel source tree, but most likely not in its current form. Although the results look encouraging we have to remember that Linux is a highly portable operating system: it runs on powerful servers as well as small resource constrained IoT devices and on many other CPU architectures as well. The current version of the patches just optimises disk encryption for a particular workload on a particular architecture, but Linux needs a solution which runs smoothly everywhere.

That said, if you think your case is similar and you want to take advantage of the performance improvements now, you may grab the patches and hopefully provide feedback. The runtime flag makes it easy to toggle the functionality on the fly and a simple A/B test may be performed to see if it benefits any particular case or setup. These patches have been running across our wide network of more than 200 data centres on five generations of hardware, so can be reasonably considered stable. Enjoy both performance and security from Cloudflare for all!

Speeding up Linux disk encryption

Keepalives considered harmful

Post Syndicated from Ivan Babrou original https://blog.cloudflare.com/keepalives-considered-harmful/

Keepalives considered harmful

This may sound like a weird title, but hear me out. You’d think keepalives would always be helpful, but turns out reality isn’t always what you expect it to be. It really helps if you read Why does one NGINX worker take all the load? first. This post is an adaptation of a rather old post on Cloudflare’s internal blog, so not all details are exactly as they are in production today but the lessons are still valid.

This is a story about how we were seeing some complaints about sporadic latency spikes, made some unconventional changes, and were able to slash the 99.9th latency percentile by 4x!

Request flow on Cloudflare edge

I’m going to focus only on two parts of our edge stack: FL and SSL.

  • FL accepts plain HTTP connections and does the main request logic, including our WAF
  • SSL terminates SSL and passes connections to FL over local Unix socket:

Here’s a diagram:

Keepalives considered harmful

These days we route all traffic through SSL for simplicity, but in the grand scheme of things it’s not going to matter much.

Each of these processes is not itself a single process, but rather a master process and a collection of workers that do actual processing. Another diagram for you to make this more visual:

Keepalives considered harmful

Keepalives

Requests come over connections that are reused for performance reasons, which is sometimes referred to as “keepalive”. It’s generally expensive for a client to open a new TCP connection and our servers keep some memory pools associated with connections that need to be recycled. In fact, one of the range of mitigations we use for attack traffic is disabling keepalives for abusive clients, forcing them to reopen connections, which slows them down considerably.

To illustrate the usefulness of keepalives, here’s me requesting http://example.com/ from curl over the same connection:

$ curl -s -w "Time to connect: %{time_connect} Time to first byte: %{time_starttransfer}\n" -o /dev/null http://example.com/ -o /dev/null http://example.com/

Time to connect: 0.012108 Time to first byte: 0.018724
Time to connect: 0.000019 Time to first byte: 0.007391

The first request took 18.7ms and out of them 12.1ms were used to establish a new connection, which is not even a TLS one. When I sent another request over the same connection, I didn’t need to pay extra and it took just 7.3ms to service my request. That’s a big win, which gets even bigger if you need to establish a brand new connection (especially if it’s not TLSv1.3 or QUIC). DNS was also cached in the example above, but it may be another negative factor for domains with low TTL.

Keepalives tend to be used extensively, because they seem beneficial. Generally keepalives are enabled by default for this exact reason.

Taking all the load

Due to how Linux works (see the first link in this blog post), most of the load goes to a few workers out of the pool. When that worker is not able to accept() a pending connection because it’s busy processing requests, that connection spills to another worker. The process cascades until some worker is ready to pick up.

This leaves us with a ladder of load spread between workers:

               CPU%                                       Runtime
nobody    4254 51.2  0.8 5655600 1093848 ?     R    Aug23 3938:34 nginx: worker process
nobody    4257 47.9  0.8 5615848 1071612 ?     S    Aug23 3682:05 nginx: worker process
nobody    4253 43.8  0.8 5594124 1069424 ?     R    Aug23 3368:27 nginx: worker process
nobody    4255 39.4  0.8 5573888 1070272 ?     S    Aug23 3030:01 nginx: worker process
nobody    4256 36.2  0.7 5556700 1052560 ?     R    Aug23 2784:23 nginx: worker process
nobody    4251 33.1  0.8 5563276 1063700 ?     S    Aug23 2545:07 nginx: worker process
nobody    4252 29.2  0.8 5561232 1058748 ?     S    Aug23 2245:59 nginx: worker process
nobody    4248 26.7  0.8 5554652 1057288 ?     S    Aug23 2056:19 nginx: worker process
nobody    4249 24.5  0.7 5537276 1043568 ?     S    Aug23 1883:18 nginx: worker process
nobody    4245 22.5  0.7 5552340 1048592 ?     S    Aug23 1736:37 nginx: worker process
nobody    4250 20.7  0.7 5533728 1038676 ?     R    Aug23 1598:16 nginx: worker process
nobody    4247 19.6  0.7 5547548 1044480 ?     S    Aug23 1507:27 nginx: worker process
nobody    4246 18.4  0.7 5538104 1043452 ?     S    Aug23 1421:23 nginx: worker process
nobody    4244 17.5  0.7 5530480 1035264 ?     S    Aug23 1345:39 nginx: worker process
nobody    4243 16.6  0.7 5529232 1024268 ?     S    Aug23 1281:55 nginx: worker process
nobody    4242 16.6  0.7 5537956 1038408 ?     R    Aug23 1278:40 nginx: worker process

The third column is instant CPU%, the time after the date is total on-CPU time. If you look at the the same processes and count their open sockets, you’ll see this (using the same order of processes as above):

4254 2357
4257 1833
4253 2180
4255 1609
4256 1565
4251 1519
4252 1175
4248 1065
4249 1056
4245 1201
4250 886
4247 908
4246 968
4244 1180
4243 867
4242 884

More load corresponds to more open sockets. More open sockets generate more load to serve these connections. It’s a vicious circle.

The twist

Now that we have all these workers holding onto this connections, requests over these connections are also in a way pinned to workers:

Keepalives considered harmful

In FL we’re doing some things that are somewhat compute intensive, which means that some workers can be busy for a somewhat prolonged period of time, while other workers will be sitting idle, increasing latency for requests. Ideally we want to always hand over a request to a worker that’s idle, because that maximizes our chances of not being blocked, even if it takes a few event loop iterations to serve a request (meaning that we may still block down the road).

One part of dealing with the issue is offloading some of the compute intensive parts of the request processing into a thread pool, but that was something that hadn’t happened at that point. We had to do something else in the meantime.

Clients have no way of knowing that their worker is busy and they should probably ask another worker to serve the connection. For clients over a real network this doesn’t even make sense, since by the time their request comes to NGINX up to tens of milliseconds later, the situation will probably be different.

But our clients are not over a long haul network! They are local and they are SSL connecting over a Unix socket. It’s not exactly that expensive to reopen a new connection for each request and what it gives is the ability to pass a fully formed request that’s buffered in SSL into FL:

Keepalives considered harmful

Two key points here:

  • The request is always picked up by an idle worker
  • The request is readily available for potentially compute intensive processing

The former is the most important part.

Validating the hypothesis

To validate this assumption, I wrote the following program:

package main
 
import (
    "flag"
    "io/ioutil"
    "log"
    "net/http"
    "os"
    "time"
)
 
func main() {
    u := flag.String("url", "", "url to request")
    p := flag.Duration("pause", 0, "pause between requests")
    t := flag.Duration("threshold", 0, "threshold for reporting")
    c := flag.Int("count", 10000, "number of requests to send")
    n := flag.Int("close", 1, "close connection after every that many requests")
 
    flag.Parse()
 
    if *u == "" {
        flag.PrintDefaults()
        os.Exit(1)
    }
 
    client := http.Client{}
 
    for i := 0; i < *c; i++ {
        started := time.Now()
 
        request, err := http.NewRequest("GET", *u, nil)
        if err != nil {
            log.Fatalf("Error constructing request: %s", err)
        }
 
        if i%*n == 0 {
            request.Header.Set("Connection", "Close")
        }
 
        response, err := client.Do(request)
        if err != nil {
            log.Fatalf("Error performing request: %s", err)
        }
 
        _, err = ioutil.ReadAll(response.Body)
        if err != nil {
            log.Fatalf("Error reading request body: %s", err)
        }
 
        response.Body.Close()
 
        elapsed := time.Since(started)
        if elapsed > *t {
            log.Printf("Request %d took %dms", i, int(elapsed.Seconds()*1000))
        }
 
        time.Sleep(*p)
    }
}

The program connects to a requested URL and recycles the connection after X requests have completed. We also pause for a short time between requests.

If we close a connection after every request:

$ go run /tmp/main.go -url http://test.domain/cdn-cgi/trace -count 10000 -pause 5ms -threshold 20ms -close 1
2018/08/24 23:42:34 Request 453 took 32ms
2018/08/24 23:42:38 Request 1044 took 24ms
2018/08/24 23:43:00 Request 4106 took 83ms
2018/08/24 23:43:12 Request 5778 took 27ms
2018/08/24 23:43:16 Request 6292 took 27ms
2018/08/24 23:43:20 Request 6856 took 21ms
2018/08/24 23:43:32 Request 8578 took 45ms
2018/08/24 23:43:42 Request 9938 took 22ms

We request an endpoint that’s served in FL by Lua, so seeing any slow requests is unfortunate. There’s an element of luck in this game and our program sees no cooperation from eyeballs or SSL, so it’s somewhat expected.

Now, if we start closing the connection only after every other request, the situation immediately gets a lot worse:

$ go run /tmp/main.go -url http://teste1.cfperf.net/cdn-cgi/trace -count 10000 -pause 5ms -threshold 20ms -close 2
2018/08/24 23:43:51 Request 162 took 22ms
2018/08/24 23:43:51 Request 220 took 21ms
2018/08/24 23:43:53 Request 452 took 23ms
2018/08/24 23:43:54 Request 540 took 41ms
2018/08/24 23:43:54 Request 614 took 23ms
2018/08/24 23:43:56 Request 900 took 40ms
2018/08/24 23:44:02 Request 1705 took 21ms
2018/08/24 23:44:03 Request 1850 took 27ms
2018/08/24 23:44:03 Request 1878 took 36ms
2018/08/24 23:44:08 Request 2470 took 21ms
2018/08/24 23:44:11 Request 2926 took 22ms
2018/08/24 23:44:14 Request 3350 took 37ms
2018/08/24 23:44:14 Request 3404 took 21ms
2018/08/24 23:44:16 Request 3598 took 32ms
2018/08/24 23:44:16 Request 3606 took 22ms
2018/08/24 23:44:19 Request 4026 took 33ms
2018/08/24 23:44:20 Request 4250 took 74ms
2018/08/24 23:44:22 Request 4483 took 20ms
2018/08/24 23:44:23 Request 4572 took 21ms
2018/08/24 23:44:23 Request 4644 took 23ms
2018/08/24 23:44:24 Request 4758 took 63ms
2018/08/24 23:44:25 Request 4808 took 39ms
2018/08/24 23:44:30 Request 5496 took 28ms
2018/08/24 23:44:31 Request 5736 took 88ms
2018/08/24 23:44:32 Request 5845 took 43ms
2018/08/24 23:44:33 Request 5988 took 52ms
2018/08/24 23:44:34 Request 6042 took 26ms
2018/08/24 23:44:34 Request 6049 took 23ms
2018/08/24 23:44:40 Request 6872 took 86ms
2018/08/24 23:44:40 Request 6940 took 23ms
2018/08/24 23:44:40 Request 6964 took 23ms
2018/08/24 23:44:44 Request 7532 took 32ms
2018/08/24 23:44:49 Request 8224 took 22ms
2018/08/24 23:44:49 Request 8234 took 29ms
2018/08/24 23:44:51 Request 8536 took 24ms
2018/08/24 23:44:55 Request 9028 took 22ms
2018/08/24 23:44:55 Request 9050 took 23ms
2018/08/24 23:44:55 Request 9092 took 26ms
2018/08/24 23:44:57 Request 9330 took 25ms
2018/08/24 23:45:01 Request 9962 took 48ms

If we close after every 5 requests, the number of slow responses almost doubles. This is counter-intuitive, keepalives are supposed to help with latency, not make it worse!

Trying this in the wild

To see how this works out in the real world, we disabled keepalives between SSL and FL in one location, forcing SSL to send every request over a separate connection (remember: cheap local Unix socket). Here’s how our probes toward that location reacted to this:

Keepalives considered harmful

Here’s cumulative wait time between SSL and FL:

Keepalives considered harmful

And finally 99.9th percentile of the same measurement:

Keepalives considered harmful

This is a huge win.

Another telling graph is comparing our average “edge processing time” (which includes WAF) in a test location to a global value:

Keepalives considered harmful

We reduced unnecessary wait time due to an imbalance without increasing the CPU load, which directly translates into improved user experience and lower resource consumption for us.

The downsides

There has to be a downside from this, right? The problem we introduced is that CPU imbalance between individual CPU cores went up:

Keepalives considered harmful

Overall CPU usage did not change, just the distribution of it. We already know of a way of dealing with this: SO_REUSEPORT. Either that or EPOLLROUNDROBIN, which doesn’t have some of the drawbacks of SO_REUSEPORT (which does not work for a Unix socket, for example), but requires a patched kernel. If we combine both disabled keepalives and EPOLLROUNDROBIN changes, we can see CPUs allocated to FL converge in their utilization nicely:

Keepalives considered harmful

We’ve tried different combinations and having both EPOLLROUNDROBIN with disabled keepalives worked best. Having just one of them is not as beneficial to lower latency.

Conclusions

We disabled keepalives between SSL and FL running on the same box and this greatly improved our tail latency caused by requests landing on non-optimal FL workers. This was an unexpected fix, but it worked and we are able to explain it.

This doesn’t mean that you should go and disable keepalives everywhere. Generally keepalives are great and should stay enabled, but in our case the latency of local connection establishment is much lower than the delay we can get from landing on a busy worker.

In reality this means that we can run our machines hotter and not see latency rise as much as it did before. Imagine moving the CPU cap from 50% to 80% with no effect on latency. The numbers are arbitrary, but the idea holds. Running hotter allows for fewer machines able to serve the same amount of traffic, reducing our overall footprint. 🌎

USB Cable Kill Switch for Laptops

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2020/01/usb_cable_kill_.html

BusKill is designed to wipe your laptop (Linux only) if it is snatched from you in a public place:

The idea is to connect the BusKill cable to your Linux laptop on one end, and to your belt, on the other end. When someone yanks your laptop from your lap or table, the USB cable disconnects from the laptop and triggers a udev script [1, 2, 3] that executes a series of preset operations.

These can be something as simple as activating your screensaver or shutting down your device (forcing the thief to bypass your laptop’s authentication mechanism before accessing any data), but the script can also be configured to wipe the device or delete certain folders (to prevent thieves from retrieving any sensitive data or accessing secure business backends).

Clever idea, but I — and my guess is most people — would be much more likely to stand up from the table, forgetting that the cable was attached, and yanking it out. My problem with pretty much all systems like this is the likelihood of false alarms.

Slashdot article.

EDITED TO ADD (1/14): There are Bluetooth devices that will automatically encrypt a laptop when the device isn’t in proximity. That’s a much better interface than a cable.

It’s crowded in here!

Post Syndicated from Jakub Sitnicki original https://blog.cloudflare.com/its-crowded-in-here/

It's crowded in here!

We recently gave a presentation on Programming socket lookup with BPF at the Linux Plumbers Conference 2019 in Lisbon, Portugal. This blog post is a recap of the problem statement and proposed solution we presented.

It's crowded in here!
CC0 Public Domain, PxHere

Our edge servers are crowded. We run more than a dozen public facing services, leaving aside the all internal ones that do the work behind the scenes.

Quick Quiz #1: How many can you name? We blogged about them! Jump to answer.

These services are exposed on more than a million Anycast public IPv4 addresses partitioned into 100+ network prefixes.

To keep things uniform every Cloudflare edge server runs all services and responds to every Anycast address. This allows us to make efficient use of the hardware by load-balancing traffic between all machines. We have shared the details of Cloudflare edge architecture on the blog before.

It's crowded in here!

Granted not all services work on all the addresses but rather on a subset of them, covering one or several network prefixes.

So how do you set up your network services to listen on hundreds of IP addresses without driving the network stack over the edge?

Cloudflare engineers have had to ask themselves this question more than once over the years, and the answer has changed as our edge evolved. This evolution forced us to look for creative ways to work with the Berkeley sockets API, a POSIX standard for assigning a network address and a port number to your application. It has been quite a journey, and we are not done yet.

When life is simple – one address, one socket

It's crowded in here!

The simplest kind of association between an (IP address, port number) and a service that we can imagine is one-to-one. A server responds to client requests on a single address, on a well known port. To set it up the application has to open one socket for each transport protocol (be it TCP or UDP) it wants to support. A network server like our authoritative DNS would open up two sockets (one for UDP, one for TCP):

(192.0.2.1, 53/tcp) ⇨ ("auth-dns", pid=1001, fd=3)
(192.0.2.1, 53/udp) ⇨ ("auth-dns", pid=1001, fd=4)

To take it to Cloudflare scale, the service is likely to have to receive on at least a /20 network prefix, which is a range of IPs with 4096 addresses in it.

It's crowded in here!

This translates to opening 4096 sockets for each transport protocol. Something that is not likely to go unnoticed when looking at ss tool output.

$ sudo ss -ulpn 'sport = 53'
State  Recv-Q Send-Q  Local Address:Port Peer Address:Port
…
UNCONN 0      0           192.0.2.40:53        0.0.0.0:*    users:(("auth-dns",pid=77556,fd=11076))
UNCONN 0      0           192.0.2.39:53        0.0.0.0:*    users:(("auth-dns",pid=77556,fd=11075))
UNCONN 0      0           192.0.2.38:53        0.0.0.0:*    users:(("auth-dns",pid=77556,fd=11074))
UNCONN 0      0           192.0.2.37:53        0.0.0.0:*    users:(("auth-dns",pid=77556,fd=11073))
UNCONN 0      0           192.0.2.36:53        0.0.0.0:*    users:(("auth-dns",pid=77556,fd=11072))
UNCONN 0      0           192.0.2.31:53        0.0.0.0:*    users:(("auth-dns",pid=77556,fd=11071))
…

It's crowded in here!
CC BY 2.0, Luca Nebuloni, Flickr

The approach, while naive, has an advantage: when an IP from the range gets attacked with a UDP flood, the receive queues of sockets bound to the remaining IP addresses are not affected.

Life can be easier – all addresses, one socket

It's crowded in here!

It seems rather silly to create so many sockets for one service to receive traffic on a range of addresses. Not only that, the more listening sockets there are, the longer the chains in the socket lookup hash table. We have learned the hard way that going in this direction can hurt packet processing latency.

The sockets API comes with a big hammer that can make our life easier – the INADDR_ANY aka 0.0.0.0 wildcard address. With INADDR_ANY we can make a single socket receive on all addresses assigned to our host, specifying just the port.

s = socket(AF_INET, SOCK_STREAM, 0)
s.bind(('0.0.0.0', 12345))
s.listen(16)

Quick Quiz #2: Is there another way to bind a socket to all local addresses? Jump to answer.

In other words, compared to the naive “one address, one socket” approach, INADDR_ANY allows us to have a single catch-all listening socket for the whole IP range on which we accept incoming connections.

In Linux this is possible thanks to a two-phase listening socket lookup, where it falls back to search for an INADDR_ANY socket if a more specific match has not been found.

It's crowded in here!

Another upside of binding to 0.0.0.0 is that our application doesn’t need to be aware of what addresses we have assigned to our host. We are also free to assign or remove the addresses after binding the listening socket. No need to reconfigure the service when its listening IP range changes.

On the other hand if our service should be listening on just A.B.C.0/20 prefix, binding to all local addresses is more than we need. We might unintentionally expose an otherwise internal-only service to external traffic without a proper firewall or a socket filter in place.

Then there is the security angle. Since we now only have one socket, attacks attempting to flood any of the IPs assigned to our host on our service’s port, will hit the catch-all socket and its receive queue. While in such circumstances the Linux TCP stack has your back, UDP needs special care or legitimate traffic might drown in the flood of dropped packets.

Possibly the biggest downside, though, is that a service listening on the wildcard INADDR_ANY address claims the port number exclusively for itself. Binding over the wildcard-listening socket with a specific IP and port fails miserably due to the address already being taken (EADDRINUSE).

bind(3, {sa_family=AF_INET, sin_port=htons(12345), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(12345), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRINUSE (Address already in use)

Unless your service is UDP-only, setting the SO_REUSEADDR socket option, will not help you overcome this restriction. The only way out is to turn to SO_REUSEPORT, normally used to construct a load-balancing socket group. And that is only if you are lucky enough to run the port-conflicting services as the same user (UID). That is a story for another post.

Quick Quiz #3: Does setting the SO_REUSEADDR socket option have any effect at all when there is bind conflict? Jump to answer.

Life gets real – one port, two services

As it happens, at the Cloudflare edge we do host services that share the same port number but otherwise respond to requests on non-overlapping IP ranges. A prominent example of such port-sharing is our 1.1.1.1 recursive DNS resolver running side-by-side with the authoritative DNS service that we offer to all customers.

Sadly the sockets API doesn’t allow us to express a setup in which two services share a port and accept requests on disjoint IP ranges.

However, as Linux development history shows, any networking API limitation can be overcome by introducing a new socket option, with sixty-something options available (and counting!).

Enter SO_BINDTOPREFIX.

It's crowded in here!

Back in 2016 we proposed an extension to the Linux network stack. It allowed services to constrain a wildcard-bound socket to an IP range belonging to a network prefix.

# Service 1, 127.0.0.0/20, 1234/tcp
net1, plen1 = '127.0.0.0', 20
bindprefix1 = struct.pack('BBBBBxxx', *inet_aton(net1), plen1)

s1 = socket(AF_INET, SOCK_STREAM, 0)
s1.setsockopt(SOL_IP, IP_BINDTOPREFIX, bindprefix1)
s1.bind(('0.0.0.0', 1234))
s1.listen(1)

# Service 2, 127.0.16.0/20, 1234/tcp
net2, plen2 = '127.0.16.0', 20
bindprefix2 = struct.pack('BBBBBxxx', *inet_aton(net2), plen2)

s2 = socket(AF_INET, SOCK_STREAM, 0)
s2.setsockopt(SOL_IP, IP_BINDTOPREFIX, bindprefix2)
s2.bind(('0.0.0.0', 1234))
s2.listen(1)

This mechanism has served us well since then. Unfortunately, it didn’t get accepted upstream due to being too specific to our use-case. Having no better alternative we ended up maintaining patches in our kernel to this day.

Life gets complicated – all ports, one service

Just when we thought we had things figured out, we were faced with a new challenge. How to build a service that accepts connections on any of the 65,535 ports? The ultimate reverse proxy, if you will, code named Spectrum.

The bind syscall offers very little flexibility when it comes to mapping a socket to a port number. You can either specify the number you want or let the network stack pick an unused one for you. There is no counterpart of INADDR_ANY, a wildcard value to select all ports (INPORT_ANY?).

To achieve what we wanted, we had to turn to TPROXY, a Netfilter / iptables extension designed for intercepting remote-destined traffic on the forward path. However, we use it to steer local-destined packets, that is ones targeted to our host, to a catch-all-ports socket.

iptables -t mangle -I PREROUTING \
         -d 192.0.2.0/24 -p tcp \
         -j TPROXY --on-ip=127.0.0.1 --on-port=1234

It's crowded in here!

TPROXY-based setup comes at a price. For starters, your service needs elevated privileges to create a special catch-all socket (see the IP_TRANSPARENT socket option). Then you also have to understand and consider the subtle interactions between TPROXY and the receive path for your traffic profile, for example:

  • does connection tracking register the flows redirected with TPROXY?
  • is listening socket contention during a SYN flood when using TPROXY a concern?
  • do other parts of the network stack, like XDP programs, need to know about TPROXY redirecting packets?

These are some of the questions we needed to answer, and after running it in production for a while now, we have a good idea of what the consequences of using TPROXY are.

That said, it would not come as a shock, if tomorrow we’d discovered something new about TPROXY. Due to its complexity we’ve always considered using it to steer local-destined traffic a hack, a use-case outside of its intended application. No matter how well understood, a hack remains a hack.

Can BPF make life easier?

Despite its complex nature TPROXY shows us something important. No matter what IP or port the listening socket is bound to, with a bit of support from the network stack, we can steer any connection to it. As long the application is ready to handle this situation, things work.

Quick Quiz #4: Are there really no problems with accepting any connection on any socket? Jump to answer.

This is a really powerful concept. With a bunch of TPROXY rules, we can configure any mapping between (address, port) tuples and listening sockets.

💡 Idea #1: A local-destined connection can be accepted by any listening socket.

We didn’t tell you the whole story before. When we published SO_BINDTOPREFIX patches, they did not just get rejected. As sometimes happens by posting the wrong answer, we got the right answer to our problem

❝BPF is absolutely the way to go here, as it allows for whatever user specified tweaks, like a list of destination subnetwork, or/and a list of source network, or the date/time of the day, or port knocking without netfilter, or … you name it.❞

💡 Idea #2: How we pick a listening socket can be tweaked with BPF.

Combine the two ideas together and we arrive at an exciting concept. Let’s run BPF code to match an incoming packet with a listening socket, ignoring the address the socket is bound to. 🤯

Here’s an example to illustrate it.

It's crowded in here!

All packets coming on 1.1.1.0/24 prefix, port 53 are steered to socket sk:2, while traffic targeted at 3.3.3.3, on any port number lands in socket sk:4.

Welcome BPF inet_lookup

It's crowded in here!

To make this concept a reality we are proposing a new mechanism to program the socket lookup with BPF. What is socket lookup? It’s a stage on the receive path where the transport layer searches for a socket to dispatch the packet to. The last possible moment to steer packets before they land in the selected socket receive queue. In there we attach a new type of BPF program called inet_lookup.

It's crowded in here!

If you recall, socket lookup in the Linux TCP stack is a two phase process. First the kernel will try to find an established (connected) socket matching the packet 4-tuple. If there isn’t one, it will continue by looking for a listening socket using just the packet 2-tuple as key.

Our proposed extension allows users to program the second phase, the listening socket lookup. If present, a BPF program is allowed to choose a listening socket and terminate the lookup. Our program is also free to ignore the packet, in which case the kernel will continue to look for a listening socket as usual.

It's crowded in here!

How does this new type of BPF program operate? On input, as context, it gets handed a subset of information extracted from packet headers, including the packet 4-tuple. Based on the input the program accesses a BPF map containing references to listening sockets, and selects one to yield as the socket lookup result.

If we take a look at the corresponding BPF code, the program structure resembles a firewall rule. We have some match statements followed by an action.

It's crowded in here!

You may notice that we don’t access the BPF map with sockets directly. Instead we follow an established pattern in BPF called “map based redirection”, where a dedicated BPF helper accesses the map and carries out any steps necessary to redirect the packet.

We’ve skipped over one thing. Where does the BPF map of sockets come from? We create it ourselves and populate it with sockets. This is most easily done if your service uses systemd socket activation. systemd will let you associate more than one service unit with a socket unit, and both of the services will receive a file descriptor for the same socket. From there it’s just a matter of inserting the socket into the BPF map.

It's crowded in here!

Demo time!

This is not just a concept. We have already published a first working set of patches for the kernel together with ancillary user-space tooling to configure the socket lookup to your needs.

If you would like to see it in action, you are in luck. We’ve put together a demo that shows just how easily you can bind a network service to (i) a single port, (ii) all ports, or (iii) a network prefix. On-the-fly, without having to restart the service! There is a port scan running to prove it.

You can also bind to all-addresses-all-ports (0.0.0.0/0) because why not? Take that INADDR_ANY. All thanks to BPF superpowers.

It's crowded in here!

Summary

We have gone over how the way we bind services to network addresses on the Cloudflare edge has evolved over time. Each approach has its pros and cons, summarized below. We are currently working on a new BPF-based mechanism for binding services to addresses, which is intended to address the shortcomings of existing solutions.

bind to one address and port

👍 flood traffic on one address hits one socket, doesn’t affect the rest
👎 as many sockets as listening addresses, doesn’t scale

bind to all addresses with INADDR_ANY

👍 just one socket for all addresses, the kernel thanks you
👍 application doesn’t need to know about listening addresses
👎 flood scenario requires custom protection, at least for UDP
👎 port sharing is tricky or impossible

bind to a network prefix with SO_BINDTOPREFIX

👍 two services can share a port if their IP ranges are non-overlapping
👎 custom kernel API extension that never went upstream

bind to all port with TPROXY

👍 enables redirecting all ports to a listening socket and more
👎 meant for intercepting forwarded traffic early on the ingress path
👎 has subtle interactions with the network stack
👎 requires privileges from the application

bind to anything you want with BPF inet_lookup

👍 allows for the same flexibility as with TPROXY or SO_BINDTOPREFIX
👍 services don’t need extra capabilities, meant for local traffic only
👎 needs cooperation from services or PID 1 to build a socket map


Getting to this point has been a team effort. A special thank you to Lorenz Bauer and Marek Majkowski who have contributed in an essential way to the BPF inet_lookup implementation. The SO_BINDTOPREFIX patches were authored by Gilberto Bertin.

Fancy joining the team? Apply here!

Quiz Answers

Quiz 1

Q: How many Cloudflare services can you name?

  1. HTTP CDN (tcp/80)
  2. HTTPS CDN (tcp/443, udp/443)
  3. authoritative DNS (udp/53)
  4. recursive DNS (udp/53, 853)
  5. NTP with NTS (udp/1234)
  6. Roughtime time service (udp/2002)
  7. IPFS Gateway (tcp/443)
  8. Ethereum Gateway (tcp/443)
  9. Spectrum proxy (tcp/any, udp/any)
  10. WARP (udp)

Go back

Quiz 2

Q: Is there another way to bind a socket to all local addresses?

Yes, there is – by not bind()’ing it at all. Calling listen() on an unbound socket is equivalent to binding it to INADDR_ANY and letting the kernel pick a free port.

$ strace -e socket,bind,listen nc -l
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
listen(3, 1)                            = 0
^Z
[1]+  Stopped                 strace -e socket,bind,listen nc -l
$ ss -4tlnp
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
LISTEN     0      1            *:42669      

Go back

Quiz 3

Q: Does setting the SO_REUSEADDR socket option have any effect at all when there is bind conflict?

Yes. If two processes are racing to bind and listen on the same TCP port, on an overlapping IP, setting SO_REUSEADDR changes which syscall will report an error (EADDRINUSE). Without SO_REUSEADDR it will always be the second bind. With SO_REUSEADDR set there is a window of opportunity for a second bind to succeed but the subsequent listen to fail.

Go back

Quiz 4

Q: Are there really no problems with accepting any connection on any socket?

If the connection is destined for an address assigned to our host, i.e. a local address, there are no problems. However, for remote-destined connections, sending return traffic from a non-local address (i.e., one not present on any interface) will not get past the Linux network stack. The IP_TRANSPARENT socket option bypasses this protection mechanism known as source address check to lift this restriction.

Go back

New – Trigger a Kernel Panic to Diagnose Unresponsive EC2 Instances

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/new-trigger-a-kernel-panic-to-diagnose-unresponsive-ec2-instances/

When I was working on systems deployed in on-premises data centers, it sometimes happened I had to debug an unresponsive server. It usually involved asking someone to physically press a non-maskable interrupt (NMI) button on the frozen server or to send a signal to a command controller over a serial interface (yes, serial, such as in RS-232).This command triggered the system to dump the state of the frozen kernel to a file for further analysis. Such a file is usually called a core dump or a crash dump. The crash dump includes an image of the memory of the crashed process, the system registers, program counter, and other information useful in determining the root cause of the freeze.

Today, we are announcing a new Amazon Elastic Compute Cloud (EC2) API allowing you to remotely trigger the generation of a kernel panic on EC2 instances. The EC2:SendDiagnosticInterrupt API sends a diagnostic interrupt, similar to pressing a NMI button on a physical machine, to a running EC2 instance. It causes the instance’s hypervisor to send a non-maskable interrupt (NMI) to the operating system. The behaviour of your operating system when a NMI interrupt is received depends on its configuration. Typically, it involves entering into kernel panic. The kernel panic behaviour also depends on the operating system configuration, it might trigger the generation of the crash dump data file, obtain a backtrace, load a replacement kernel or restart the system.

You can control who in your organisation is authorized to use that API through IAM Policies, I will give an example below.

Cloud and System Engineers, or specialists in kernel diagnosis and debugging, find in the crash dump invaluable information to analyse the causes of a kernel freeze. Tools like WinDbg (on Windows) and crash (on Linux) can be used to inspect the dump.

Using Diagnostic Interrupt
Using this API is a three step process. First you need to configure the behavior of your OS when it receives the interrupt.

By default, our Windows Server AMIs have memory dump already turned on. Automatic restart after the memory dump has been saved is also selected. The default location for the memory dump file is %SystemRoot% which is equivalent to C:\Windows.

You can access these options by going to :
Start > Control Panel > System > Advanced System Settings > Startup and Recovery

On Amazon Linux 2, you need to install and configurekdump & kexec. This is a one-time setup.

$ sudo yum install kexec-tools

Then edit the file /etc/default/grub to allocate the amount of memory to be reserved for the crash kernel. In this example, we reserve 160M by adding crashkernel=160M. The amount of memory to allocate depends on your instance’s memory size. The general recommendation is to test kdump to see if the allocated memory is sufficient. The kernel doc has the full syntax of the crashkernel kernel parameter.

GRUB_CMDLINE_LINUX_DEFAULT="crashkernel=160M console=tty0 console=ttyS0,115200n8 net.ifnames=0 biosdevname=0 nvme_core.io_timeout=4294967295 rd.emergency=poweroff rd.shell=0"

And rebuild the grub configuration:

$ sudo grub2-mkconfig -o /boot/grub2/grub.cfg

Finally edit /etc/sysctl.conf and add a line : kernel.unknown_nmi_panic=1. This tells the kernel to trigger a kernel panic upon receiving the interrupt.

You are now ready to reboot your instance. Be sure to include these commands in your user data script or in your AMI to automatically configure this on all your instances. Once the instance is rebooted, verify that kdump is correctly started.

$ systemctl status kdump.service
● kdump.service - Crash recovery kernel arming
   Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled)
   Active: active (exited) since Fri 2019-07-05 15:09:04 UTC; 3h 13min ago
  Process: 2494 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS)
 Main PID: 2494 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/kdump.service

Jul 05 15:09:02 ip-172-31-15-244.ec2.internal systemd[1]: Starting Crash recovery kernel arming...
Jul 05 15:09:04 ip-172-31-15-244.ec2.internal kdumpctl[2494]: kexec: loaded kdump kernel
Jul 05 15:09:04 ip-172-31-15-244.ec2.internal kdumpctl[2494]: Starting kdump: [OK]
Jul 05 15:09:04 ip-172-31-15-244.ec2.internal systemd[1]: Started Crash recovery kernel arming.

Our documentation contains the instructions for other operating systems.

Once this one-time configuration is done, you’re ready for the second step, to trigger the API. You can do this from any machine where the AWS CLI or SDK is configured. For example :

$ aws ec2 send-diagnostic-interrupt --region us-east-1 --instance-id <value>

There is no return value from the CLI, this is expected. If you have a terminal session open on that instance, it disconnects. Your instance reboots. You reconnect to your instance, you find the crash dump in /var/crash.

The third and last step is to analyse the content of the crash dump. On Linux systems, you need to install the crash utility and the debugging symbols for your version of the kernel. Note that the kernel version should be the same that was captured by kdump. To find out which kernel you are currently running, use the uname -r command.

$ sudo yum install crash
$ sudo debuginfo-install kernel
$ sudo crash /usr/lib/debug/lib/modules/4.14.128-112.105.amzn2.x86_64/vmlinux /var/crash/127.0.0.1-2019-07-05-15\:08\:43/vmcore

crash 7.2.6-1.amzn2.0.1
... output suppressed for brevity ...

      KERNEL: /usr/lib/debug/lib/modules/4.14.128-112.105.amzn2.x86_64/vmlinux
    DUMPFILE: /var/crash/127.0.0.1-2019-07-05-15:08:43/vmcore  [PARTIAL DUMP]
        CPUS: 2
        DATE: Fri Jul  5 15:08:38 2019
      UPTIME: 00:07:23
LOAD AVERAGE: 0.00, 0.00, 0.00
       TASKS: 104
    NODENAME: ip-172-31-15-244.ec2.internal
     RELEASE: 4.14.128-112.105.amzn2.x86_64
     VERSION: #1 SMP Wed Jun 19 16:53:40 UTC 2019
     MACHINE: x86_64  (2500 Mhz)
      MEMORY: 7.9 GB
       PANIC: "Kernel panic - not syncing: NMI: Not continuing"
         PID: 0
     COMMAND: "swapper/0"
        TASK: ffffffff82013480  (1 of 2)  [THREAD_INFO: ffffffff82013480]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)

Collecting kernel crash dumps is often the only way to collect kernel debugging information, be sure to test this procedure frequently, in particular after updating your operating system or when you will create new AMIs.

Control Who Is Authorized to Send Diagnostic Interrupt
You can control who in your organisation is authorized to send the Diagnostic Interrupt, and to which instances, through IAM policies with resource-level permissions, like in the example below.

{
   "Version": "2012-10-17",
   "Statement": [
      {
      "Effect": "Allow",
      "Action": "ec2:SendDiagnosticInterrupt",
      "Resource": "arn:aws:ec2:region:account-id:instance/instance-id"
      }
   ]
}

Pricing
There are no additional charges for using this feature. However, as your instance continues to be in a ‘running’ state after it receives the diagnostic interrupt, instance billing will continue as usual.

Availability
You can send Diagnostic Interrupts to all EC2 instances powered by the AWS Nitro System, except A1 (Arm-based). This is C5, C5d, C5n, i3.metal, I3en, M5, M5a, M5ad, M5d, p3dn.24xlarge, R5, R5a, R5ad, R5d, T3, T3a, and Z1d as I write this.

The Diagnostic Interrupt API is now available in all public AWS Regions and GovCloud (US), you can start to use it today.

— seb

A Tale of Two (APT) Transports

Post Syndicated from Ryan Djurovich original https://blog.cloudflare.com/apt-transports/

A Tale of Two (APT) Transports

Securing access to your APT repositories is critical. At Cloudflare, like in most organizations, we used a legacy VPN to lock down who could reach our internal software repositories. However, a network perimeter model lacks a number of features that we consider critical to a team’s security.

As a company, we’ve been moving our internal infrastructure to our own zero-trust platform, Cloudflare Access. Access added SaaS-like convenience to the on-premise tools we managed. We started with web applications and then moved resources we need to reach over SSH behind the Access gateway, for example Git or user-SSH access. However, we still needed to handle how services communicate with our internal APT repository.

We recently open sourced a new APT transport which allows customers to protect their private APT repositories using Cloudflare Access. In this post, we’ll outline the history of APT tooling, APT transports and introduce our new APT transport for Cloudflare Access.

A brief history of APT

Advanced Package Tool, or APT, simplifies the installation and removal of software on Debian and related Linux distributions. Originally released in 1998, APT was to Debian what the App Store was to modern smartphones – a decade ahead of its time!

APT sits atop the lower-level dpkg tool, which is used to install, query, and remove .deb packages – the primary software packaging format in Debian and related Linux distributions such as Ubuntu. With dpkg, packaging and managing software installed on your system became easier – but it didn’t solve for problems around distribution of packages, such as via the Internet or local media; at the time of inception, it was commonplace to install packages from a CD-ROM.

APT introduced the concept of repositories – a mechanism for storing and indexing a collection of .deb packages. APT supports connecting to multiple repositories for finding packages and automatically resolving package dependencies. The way APT connects to said repositories is via a “transport” – a mechanism for communicating between the APT client and its repository source (more on this later).

APT over the Internet

Prior to version 1.5, APT did not include support for HTTPS – if you wanted to install a package over the Internet, your connection was not encrypted. This reduces privacy – an attacker snooping traffic could determine specific package version your system is installing. It also exposes you to man-in-the-middle attacks where an attacker could, for example, exploit a remote code execution vulnerability. Just 6 months ago, we saw an example of the latter with CVE-2019-3462.

Enter the APT HTTPS transport – an optional transport you can install to add support for connecting to repositories over HTTPS. Once installed, users need to configure their APT sources.list with repositories using HTTPS.

The challenge here, of course, is that the most common way to install this transport is via APT and HTTP – a classic bootstrapping problem! An alternative here is to download the .deb package via curl and install it via dpkg. You’ll find the links to apt-transport-https binaries for Stretch here – once you have the URL path for your system architecture, you can download it from the deb.debian.org mirror-redirector over HTTPS, e.g. for amd64 (a.k.a. x86_64):

curl -o apt-transport-https.deb -L https://deb.debian.org/debian/pool/main/a/apt/apt-transport-https_1.4.9_amd64.deb 
HASH=c8c4366d1912ff8223615891397a78b44f313b0a2f15a970a82abe48460490cb && echo "$HASH  apt-transport-https.deb" | sha256sum -c
sudo dpkg -i apt-transport-https.deb

To confirm which APT transports are installed on your system, you can list each “method binary” that is installed:

ls /usr/lib/apt/methods

With apt-transport-https installed you should now see ‘https’ in that list.

The state of APT & HTTPS on Debian

You may be wondering how relevant this APT HTTPS transport is today. Given the prevalence of HTTPS on the web today, I was surprised when I found out exactly how relevant it is.

Up until a couple of weeks ago, Debian Stretch (9.x) was the current stable release; 9.0 was first released in June 2017 – and the latest version (9.9) includes apt 1.4.9 by default – meaning that securing your APT communication for Debian Stretch requires installing the optional apt-transport-https package.

Thankfully, on July 6 of this year, Debian released the latest version – Buster – which currently includes apt 1.8.2 with HTTPS support built-in by default, negating the need for installing the apt-transport-https package – and removing the bootstrapping challenge of installing HTTPS support via HTTPS!

BYO HTTPS APT Repository

A powerful feature of APT is the ability to run your own repository. You can mirror a public repository to improve performance or protect against an outage. And if you’re producing your own software packages, you can run your own repository to simplify distribution and installation of your software for your users.

If you have your own APT repository and you’re looking to secure it with HTTPS we’ve offered free Universal SSL since 2014 and last year introduced a way to require it site-wide automatically with one click. You’ll get the benefits of DDoS attack protection, a Global CDN with Caching, and Analytics.

But what if you’re looking for more than just HTTPS for your APT repository? For companies operating private APT repositories, authentication of your APT repository may be a challenge. This is where our new, custom APT transport comes in.

Building custom transports

The system design of APT is powerful in that it supports extensibility via Transport executables, but how does this mechanism work?

When APT attempts to connect to a repository, it finds the executable which matches the “scheme” from the repository URL (e.g. “https://” prefix on a repository results in the “https” executable being called).

APT then uses the common Linux standard streams: stdin, stdout, and stderr. It communicates via stdin/stdout using a set of plain-text Messages, which follow IETF RFC #822 (the same format that .deb “Package” files use).

Examples of input message include “600 URI Acquire”, and examples of output messages include “200 URI Start” and “201 URI Done”:

A Tale of Two (APT) Transports

If you’re interested in building your own transport, check out the APT method interface spec for more implementation details.

APT meets Access

Cloudflare prioritizes dogfooding our own products early and often. The Access product has given our internal DevTools team a chance to work closely with the product team as we build features that help solve use cases across our organization. We’ve deployed new features internally, gathered feedback, improved them, and then released them to our customers. For example, we’ve been able to iterate on tools for Access like the Atlassian SSO plugin and the SSH feature, as collaborative efforts between DevTools and the Access team.

Our DevTools team wanted to take the same dogfooding approach to protect our internal APT repository with Access. We knew this would require a custom APT transport to support generating the required tokens and passing the correct headers in HTTPS requests to our internal APT repository server. We decided to build and test our own transport that both generated the necessary tokens and passed the correct headers to allow us to place our repository behind Access.

After months of internal use, we’re excited to announce that we have recently open-sourced our custom APT transport, so our customers can also secure their APT repositories by enabling authentication via Cloudflare Access.

By protecting your APT repository with Cloudflare Access, you can support authenticating users via Single-Sign On (SSO) providers, defining comprehensive access-control policies, and monitoring access and change logs.

Our APT transport leverages another Open Source tool we provide, cloudflared, which enables users to connect to your Cloudflare-protected domain securely.

Securing your APT Repository

To use our APT transport, you’ll need an APT repository that’s protected by Cloudflare Access. Our instructions (below) for using our transport will use apt.example.com as a hostname.

To use our APT transport with your own web-based APT repository, refer to our Setting Up Access guide.

APT Transport Installation

To install from source, both tools require Go – once you install Go, you can install `cloudflared` and our APT transport with four commands:

go get github.com/cloudflare/cloudflared/cmd/cloudflared
sudo cp ${GOPATH:-~/go}/bin/cloudflared /usr/local/bin/cloudflared
go get github.com/cloudflare/apt-transport-cloudflared/cmd/cfd
sudo cp ${GOPATH:-~/go}/bin/cfd /usr/lib/apt/methods/cfd

The above commands should place the cloudflared executable in /usr/local/bin (which should be on your PATH), and the APT transport binary in the required /usr/lib/apt/methods directory.

To confirm cloudflared is on your path, run:

which cloudflared

The above command should return /usr/local/bin/cloudflared

Now that the custom transport is installed, to start using it simply configure an APT source with the cfd:// rather than https:// e.g:

$ cat /etc/apt/sources.list.d/example.list 
deb [arch=amd64] cfd://apt.example.com/v2/stretch stable common

Next time you do `apt-get update` and `apt-get install`, a browser window will open asking you to log-in over Cloudflare Access, and your package will be retrieved using the token returned by `cloudflared`.

Fetching a GPG Key over Access

Usually, private APT repositories will use SecureApt and have their own GPG public key that users must install to verify the integrity of data retrieved from that repository.

Users can also leverage cloudflared for securely downloading and installing those keys, e.g:

cloudflared access login https://apt.example.com
cloudflared access curl https://apt.example.com/public.gpg | sudo apt-key add -

The first command will open your web browser allowing you to authenticate for your domain. The second command wraps curl to download the GPG key, and hands it off to `apt-key add`.

Cloudflare Access on “headless” servers

If you’re looking to deploy APT repositories protected by Cloudflare Access to non-user-facing machines (a.k.a. “headless” servers), opening a browser does not work. The good news is since February, Cloudflare Access supports service tokens – and we’ve built support for them into our APT transport from day one.

If you’d like to use service tokens with our APT transport, it’s as simple as placing the token in a file in the correct path; because the machine already has a token, there is also no dependency on `cloudflared` for authentication. You can find details on how to set-up a service token in the APT transport README.

What’s next?

As demonstrated, you can get started using our APT transport today – we’d love to hear your feedback on this!

This work came out of an internal dogfooding effort, and we’re currently experimenting with additional packaging formats and tooling. If you’re interested in seeing support for another format or tool, please reach out.

Predictive CPU isolation of containers at Netflix

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/predictive-cpu-isolation-of-containers-at-netflix-91f014d856c7?source=rss----2615bd06b42e---4

By Benoit Rostykus, Gabriel Hartmann

Noisy Neighbors

We’ve all had noisy neighbors at one point in our life. Whether it’s at a cafe or through a wall of an apartment, it is always disruptive. The need for good manners in shared spaces turns out to be important not just for people, but for your Docker containers too.

When you’re running in the cloud your containers are in a shared space; in particular they share the CPU’s memory hierarchy of the host instance.

Because microprocessors are so fast, computer architecture design has evolved towards adding various levels of caching between compute units and the main memory, in order to hide the latency of bringing the bits to the brains. However, the key insight here is that these caches are partially shared among the CPUs, which means that perfect performance isolation of co-hosted containers is not possible. If the container running on the core next to your container suddenly decides to fetch a lot of data from the RAM, it will inevitably result in more cache misses for you (and hence a potential performance degradation).

Linux to the rescue?

Traditionally it has been the responsibility of the operating system’s task scheduler to mitigate this performance isolation problem. In Linux, the current mainstream solution is CFS (Completely Fair Scheduler). Its goal is to assign running processes to time slices of the CPU in a “fair” way.

CFS is widely used and therefore well tested and Linux machines around the world run with reasonable performance. So why mess with it? As it turns out, for the large majority of Netflix use cases, its performance is far from optimal. Titus is Netflix’s container platform. Every month, we run millions of containers on thousands of machines on Titus, serving hundreds of internal applications and customers. These applications range from critical low-latency services powering our customer-facing video streaming service, to batch jobs for encoding or machine learning. Maintaining performance isolation between these different applications is critical to ensuring a good experience for internal and external customers.

We were able to meaningfully improve both the predictability and performance of these containers by taking some of the CPU isolation responsibility away from the operating system and moving towards a data driven solution involving combinatorial optimization and machine learning.

The idea

CFS operates by very frequently (every few microseconds) applying a set of heuristics which encapsulate a general concept of best practices around CPU hardware use.

Instead, what if we reduced the frequency of interventions (to every few seconds) but made better data-driven decisions regarding the allocation of processes to compute resources in order to minimize collocation noise?

One traditional way of mitigating CFS performance issues is for application owners to manually cooperate through the use of core pinning or nice values. However, we can automatically make better global decisions by detecting collocation opportunities based on actual usage information. For example if we predict that container A is going to become very CPU intensive soon, then maybe we should run it on a different NUMA socket than container B which is very latency-sensitive. This avoids thrashing caches too much for B and evens out the pressure on the L3 caches of the machine.

Optimizing placements through combinatorial optimization

What the OS task scheduler is doing is essentially solving a resource allocation problem: I have X threads to run but only Y CPUs available, how do I allocate the threads to the CPUs to give the illusion of concurrency?

As an illustrative example, let’s consider a toy instance of 16 hyperthreads. It has 8 physical hyperthreaded cores, split on 2 NUMA sockets. Each hyperthread shares its L1 and L2 caches with its neighbor, and shares its L3 cache with the 7 other hyperthreads on the socket:

If we want to run container A on 4 threads and container B on 2 threads on this instance, we can look at what “bad” and “good” placement decisions look like:

The first placement is intuitively bad because we potentially create collocation noise between A and B on the first 2 cores through their L1/L2 caches, and on the socket through the L3 cache while leaving a whole socket empty. The second placement looks better as each CPU is given its own L1/L2 caches, and we make better use of the two L3 caches available.

Resource allocation problems can be efficiently solved through a branch of mathematics called combinatorial optimization, used for example for airline scheduling or logistics problems.

We formulate the problem as a Mixed Integer Program (MIP). Given a set of K containers each requesting a specific number of CPUs on an instance possessing d threads, the goal is to find a binary assignment matrix M of size (d, K) such that each container gets the number of CPUs it requested. The loss function and constraints contain various terms expressing a priori good placement decisions such as:

  • avoid spreading a container across multiple NUMA sockets (to avoid potentially slow cross-sockets memory accesses or page migrations)
  • don’t use hyper-threads unless you need to (to reduce L1/L2 thrashing)
  • try to even out pressure on the L3 caches (based on potential measurements of the container’s hardware usage)
  • don’t shuffle things too much between placement decisions

Given the low-latency and low-compute requirements of the system (we certainly don’t want to spend too many CPU cycles figuring out how containers should use CPU cycles!), can we actually make this work in practice?

Implementation

We decided to implement the strategy through Linux cgroups since they are fully supported by CFS, by modifying each container’s cpuset cgroup based on the desired mapping of containers to hyper-threads. In this way a user-space process defines a “fence” within which CFS operates for each container. In effect we remove the impact of CFS heuristics on performance isolation while retaining its core scheduling capabilities.

This user-space process is a Titus subsystem called titus-isolate which works as follows. On each instance, we define three events that trigger a placement optimization:

  • add: A new container was allocated by the Titus scheduler to this instance and needs to be run
  • remove: A running container just finished
  • rebalance: CPU usage may have changed in the containers so we should reevaluate our placement decisions

We periodically enqueue rebalance events when no other event has recently triggered a placement decision.

Every time a placement event is triggered, titus-isolate queries a remote optimization service (running as a Titus service, hence also isolating itself… turtles all the way down) which solves the container-to-threads placement problem.

This service then queries a local GBRT model (retrained every couple of hours on weeks of data collected from the whole Titus platform) predicting the P95 CPU usage of each container in the coming 10 minutes (conditional quantile regression). The model contains both contextual features (metadata associated with the container: who launched it, image, memory and network configuration, app name…) as well as time-series features extracted from the last hour of historical CPU usage of the container collected regularly by the host from the kernel CPU accounting controller.

The predictions are then fed into a MIP which is solved on the fly. We’re using cvxpy as a nice generic symbolic front-end to represent the problem which can then be fed into various open-source or proprietary MIP solver backends. Since MIPs are NP-hard, some care needs to be taken. We impose a hard time budget to the solver to drive the branch-and-cut strategy into a low-latency regime, with guardrails around the MIP gap to control overall quality of the solution found.

The service then returns the placement decision to the host, which executes it by modifying the cpusets of the containers.

For example, at any moment in time, an r4.16xlarge with 64 logical CPUs might look like this (the color scale represents CPU usage):

Results

The first version of the system led to surprisingly good results. We reduced overall runtime of batch jobs by multiple percent on average while most importantly reducing job runtime variance (a reasonable proxy for isolation), as illustrated below. Here we see a real-world batch job runtime distribution with and without improved isolation:

Notice how we mostly made the problem of long-running outliers disappear. The right-tail of unlucky noisy-neighbors runs is now gone.

For services, the gains were even more impressive. One specific Titus middleware service serving the Netflix streaming service saw a capacity reduction of 13% (a decrease of more than 1000 containers) needed at peak traffic to serve the same load with the required P99 latency SLA! We also noticed a sharp reduction of the CPU usage on the machines, since far less time was spent by the kernel in cache invalidation logic. Our containers are now more predictable, faster and the machine is less used! It’s not often that you can have your cake and eat it too.

Next Steps

We are excited with the strides made so far in this area. We are working on multiple fronts to extend the solution presented here.

We want to extend the system to support CPU oversubscription. Most of our users have challenges knowing how to properly size the numbers of CPUs their app needs. And in fact, this number varies during the lifetime of their containers. Since we already predict future CPU usage of the containers, we want to automatically detect and reclaim unused resources. For example, one could decide to auto-assign a specific container to a shared cgroup of underutilized CPUs, to better improve overall isolation and machine utilization, if we can detect the sensitivity threshold of our users along the various axes of the following graph.

We also want to leverage kernel PMC events to more directly optimize for minimal cache noise. One possible avenue is to use the Intel based bare metal instances recently introduced by Amazon that allow deep access to performance analysis tools. We could then feed this information directly into the optimization engine to move towards a more supervised learning approach. This would require a proper continuous randomization of the placements to collect unbiased counterfactuals, so we could build some sort of interference model (“what would be the performance of container A in the next minute, if I were to colocate one of its threads on the same core as container B, knowing that there’s also C running on the same socket right now?”).

Conclusion

If any of this piques your interest, reach out to us! We’re looking for ML engineers to help us push the boundary of containers performance and “machine learning for systems” and systems engineers for our core infrastructure and compute platform.


Predictive CPU isolation of containers at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Cloudflare Repositories FTW

Post Syndicated from Guest Author original https://blog.cloudflare.com/cloudflare-repositories-ftw/

Cloudflare Repositories FTW

This is a guest post by Jim “Elwood” O’Gorman, one of the maintainers of Kali Linux. Kali Linux is a Debian based GNU/Linux distribution popular amongst the security research communities.

Cloudflare Repositories FTW

Kali Linux turned six years old this year!

In this time, Kali has established itself as the de-facto standard open source penetration testing platform. On a quarterly basis, we release updated ISOs for multiple platforms, pre-configured virtual machines, Kali Docker, WSL, Azure, AWS images, tons of ARM devices, Kali NetHunter, and on and on and on. This has lead to Kali being trusted and relied on to always being there for both security professionals and enthusiasts alike.

But that popularity has always led to one complication: How to get Kali to people?

With so many different downloads plus the apt repository, we have to move a lot of data. To accomplish this, we have always relied on our network of first- and third-party mirrors.

The way this works is, we run a master server that pushes out to a number of mirrors. We then pay to host a number of servers that are geographically dispersed and use them as our first-party mirrors. Then, a number of third parties donate storage and bandwidth to operate third-party mirrors, ensuring that we have even more systems that are geographically close to you. When you go to download, you hit a redirector that will send you to a mirror that is close to you, ideally allowing you to download your files quickly.

This solution has always been pretty decent, however it has some drawbacks. First, our network of first-party mirrors is expensive. Second, some mirrors are not as good as others. Nothing is worse than trying to download Kali and getting sent to a slow mirror, where your download might drag on for hours. Third, we always always need more mirrors as Kali continues to grow in popularity.

This situation led to us encountering Cloudflare thanks to some extremely generous outreach


I will be honest, we are a bunch of security nerds, so we were a bit skeptical at first. We have some pretty unique needs, we use a lot of bandwidth, syncing an apt repository to a CDN is no small task, and well, we are paranoid. We have an average of 1,000,000 downloads a month on just our ISO images. Add in our apt repos and you are talking some serious, serious traffic. So how much help could we really expect from Cloudflare anyway? Were we really going to be able to put this to use, or would this just be a nice fancy front end to our website and nothing else?

On the other hand, it was a chance to use something new and shiny, and it is an expensive product, so of course we dove right in to play with it.

Initially we had some sync issues. A package repository is a mix of static data (binary and source packages) and dynamic data (package lists are updated every 6 hours). To make things worse, the cryptographic sealing of the metadata means that we need atomic updates of all the metadata (the signed top-level ‘Release’ file contains checksums of all the binary and source package lists).

The default behavior of a CDN is not appropriate for this purpose as it caches all files for a certain amount of time after they have been fetched for the first time. This means that you could have different versions of various metadata files in the cache, resulting in invalid checksums errors returned by apt-get. So we had to implement a few tweaks to make it work and reap the full benefits of Cloudflare’s CDN network.

First we added an “Expires” HTTP header to disable expiration of all files that will never change. Then we added another HTTP header to tag all metadata files so that we could manually purge those files from the CDN cache through an API call that we integrated at the end of the repository update procedure on our backend server.

With nginx in our backend, the configuration looks like this:

location /kali/dists/ {
    add_header Cache-Tag metadata,dists;
}
location /kali/project/trace/ {
    add_header Cache-Tag metadata,trace;
    expires 1h;
}
location /kali/pool/ {
    add_header Cache-Tag pool;
    location ~ \.(deb|udeb|dsc|changes|xz|gz|bz2)$ {
        expires max;
    }
}

The API call is a simple shell script launched by a hook of the repository mirroring script:

#!/bin/sh
curl -sS -X POST "https://api.cloudflare.com/client/v4/zones/xxxxxxxxxxx/purge_cache" \
    -H "Content-Type:application/json" \
    -H "X-Auth-Key:XXXXXXXXXXXXX" \
    -H "X-Auth-Email:[email protected]" \
    --data '{"tags":["metadata"]}'

With this simple yet powerful feature, we ensure that the CDN cache always contains consistent versions of the metadata files. Going further, we might want to configure Prefetching so that Cloudflare downloads all the package lists as soon as a user downloads the top-level ‘Release’ file.

In short, we were using this system in a way that was never intended, but it worked! This really reduced the load on our backend, as a single server could feed the entire CDN. Putting the files geographically close to users, allowing the classic apt dist-upgrade to occur much, much faster than ever before.

A huge benefit, and was not really a lot of work to set up. Sevki Hasirci was there with us the entire time as we worked through this process, ensuring any questions we had were answered promptly. A great win.

However, there was just one problem.

Looking at our logs, while the apt repo was working perfectly, our image distribution was not so great. None of those images were getting cached, and our origin server was dying.

Talking with Sevki, it turns out there were limits to how large of a file Cloudflare would cache. He upped our limit to the system capacity, but that still was not enough for how large some of our images are. At this point, we just assumed that was that–we could use this solution for the repo but for our image distribution it would not help. However, Sevki told us to wait a bit. He had a surprise in the works for us.

After some development time, Cloudflare pushed out an update to address our issue, allowing us to cache very large files. With that in place, everything just worked with no additional tweaking. Even items like partial downloads for users using download accelerators worked just fine. Amazing!

To show an example of what this translated into, let’s look at some graphs. Once the very large file support was added and we started to push out our images through Cloudflare, you could see that there is not a real increase in requests:

Cloudflare Repositories FTW

However, looking at Bandwidth there is a clear increase:

Cloudflare Repositories FTW

After it had been implemented for a while, we see a clear pattern.

Cloudflare Repositories FTW

Cloudflare Repositories FTW

This pushed us from around 80 TB a week when we had just the repo, to now around 430TB a month when its repo and images. As you can imagine, that’s an amazing bandwidth savings for an open source project such as ours.

Performance is great, and with a cache hit rate of over 97% (amazingly high considering how often and frequently files in our repo changes), we could not be happier.

So what’s next? That’s the question we are asking ourselves. This solution has worked so well, we are looking at other ways to leverage it, and there are a lot of options. One thing is for sure, we are not done with this.

Thanks to Cloudflare, Sevki, Justin, and Matthew for helping us along this path. It is fair to say this is the single largest contribution to Kali that we have received outside of the support by Offensive Security.

Support we received from Cloudflare was amazing. The Kali project and community thanks you immensely every time they update their distribution or download an image.

Cloudflare architecture and how BPF eats the world

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/cloudflare-architecture-and-how-bpf-eats-the-world/

Cloudflare architecture and how BPF eats the world

Recently at Netdev 0x13, the Conference on Linux Networking in Prague, I gave a short talk titled “Linux at Cloudflare”. The talk ended up being mostly about BPF. It seems, no matter the question – BPF is the answer.

Here is a transcript of a slightly adjusted version of that talk.


Cloudflare architecture and how BPF eats the world

At Cloudflare we run Linux on our servers. We operate two categories of data centers: large “Core” data centers, processing logs, analyzing attacks, computing analytics, and the “Edge” server fleet, delivering customer content from 180 locations across the world.

In this talk, we will focus on the “Edge” servers. It’s here where we use the newest Linux features, optimize for performance and care deeply about DoS resilience.


Cloudflare architecture and how BPF eats the world

Our edge service is special due to our network configuration – we are extensively using anycast routing. Anycast means that the same set of IP addresses are announced by all our data centers.

This design has great advantages. First, it guarantees the optimal speed for end users. No matter where you are located, you will always reach the closest data center. Then, anycast helps us to spread out DoS traffic. During attacks each of the locations receives a small fraction of the total traffic, making it easier to ingest and filter out unwanted traffic.


Cloudflare architecture and how BPF eats the world

Anycast allows us to keep the networking setup uniform across all edge data centers. We applied the same design inside our data centers – our software stack is uniform across the edge servers. All software pieces are running on all the servers.

In principle, every machine can handle every task – and we run many diverse and demanding tasks. We have a full HTTP stack, the magical Cloudflare Workers, two sets of DNS servers – authoritative and resolver, and many other publicly facing applications like Spectrum and Warp.

Even though every server has all the software running, requests typically cross many machines on their journey through the stack. For example, an HTTP request might be handled by a different machine during each of the 5 stages of the processing.


Cloudflare architecture and how BPF eats the world

Let me walk you through the early stages of inbound packet processing:

(1) First, the packets hit our router. The router does ECMP, and forwards packets onto our Linux servers. We use ECMP to spread each target IP across many, at least 16, machines. This is used as a rudimentary load balancing technique.

(2) On the servers we ingest packets with XDP eBPF. In XDP we perform two stages. First, we run volumetric DoS mitigations, dropping packets belonging to very large layer 3 attacks.

(3) Then, still in XDP, we perform layer 4 load balancing. All the non-attack packets are redirected across the machines. This is used to work around the ECMP problems, gives us fine-granularity load balancing and allows us to gracefully take servers out of service.

(4) Following the redirection the packets reach a designated machine. At this point they are ingested by the normal Linux networking stack, go through the usual iptables firewall, and are dispatched to an appropriate network socket.

(5) Finally packets are received by an application. For example HTTP connections are handled by a “protocol” server, responsible for performing TLS encryption and processing HTTP, HTTP/2 and QUIC protocols.

It’s in these early phases of request processing where we use the coolest new Linux features. We can group useful modern functionalities into three categories:

  • DoS handling
  • Load balancing
  • Socket dispatch


Cloudflare architecture and how BPF eats the world

Let’s discuss DoS handling in more detail. As mentioned earlier, the first step after ECMP routing is Linux’s XDP stack where, among other things, we run DoS mitigations.

Historically our mitigations for volumetric attacks were expressed in classic BPF and iptables-style grammar. Recently we adapted them to execute in the XDP eBPF context, which turned out to be surprisingly hard. Read on about our adventures:

During this project we encountered a number of eBPF/XDP limitations. One of them was the lack of concurrency primitives. It was very hard to implement things like race-free token buckets. Later we found that Facebook engineer Julia Kartseva had the same issues. In February this problem has been addressed with the introduction of bpf_spin_lock helper.


Cloudflare architecture and how BPF eats the world

While our modern volumetric DoS defenses are done in XDP layer, we still rely on iptables for application layer 7 mitigations. Here, a higher level firewall’s features are useful: connlimit, hashlimits and ipsets. We also use the xt_bpf iptables module to run cBPF in iptables to match on packet payloads. We talked about this in the past:


Cloudflare architecture and how BPF eats the world

After XDP and iptables, we have one final kernel side DoS defense layer.

Consider a situation when our UDP mitigations fail. In such case we might be left with a flood of packets hitting our application UDP socket. This might overflow the socket causing packet loss. This is problematic – both good and bad packets will be dropped indiscriminately. For applications like DNS it’s catastrophic. In the past to reduce the harm, we ran one UDP socket per IP address. An unmitigated flood was bad, but at least it didn’t affect the traffic to other server IP addresses.

Nowadays that architecture is no longer suitable. We are running more than 30,000 DNS IP’s and running that number of UDP sockets is not optimal. Our modern solution is to run a single UDP socket with a complex eBPF socket filter on it – using the SO_ATTACH_BPF socket option. We talked about running eBPF on network sockets in past blog posts:

The mentioned eBPF rate limits the packets. It keeps the state – packet counts – in an eBPF map. We can be sure that a single flooded IP won’t affect other traffic. This works well, though during work on this project we found a rather worrying bug in the eBPF verifier:

I guess running eBPF on a UDP socket is not a common thing to do.


Cloudflare architecture and how BPF eats the world

Apart from the DoS, in XDP we also run a layer 4 load balancer layer. This is a new project, and we haven’t talked much about it yet. Without getting into many details: in certain situations we need to perform a socket lookup from XDP.

The problem is relatively simple – our code needs to look up the “socket” kernel structure for a 5-tuple extracted from a packet. This is generally easy – there is a bpf_sk_lookup helper available for this. Unsurprisingly, there were some complications. One problem was the inability to verify if a received ACK packet was a valid part of a three-way handshake when SYN-cookies are enabled. My colleague Lorenz Bauer is working on adding support for this corner case.


Cloudflare architecture and how BPF eats the world

After DoS and the load balancing layers, the packets are passed onto the usual Linux TCP / UDP stack. Here we do a socket dispatch – for example packets going to port 53 are passed onto a socket belonging to our DNS server.

We do our best to use vanilla Linux features, but things get complex when you use thousands of IP addresses on the servers.

Convincing Linux to route packets correctly is relatively easy with the “AnyIP” trick. Ensuring packets are dispatched to the right application is another matter. Unfortunately, standard Linux socket dispatch logic is not flexible enough for our needs. For popular ports like TCP/80 we want to share the port between multiple applications, each handling it on a different IP range. Linux doesn’t support this out of the box. You can call bind() either on a specific IP address or all IP’s (with 0.0.0.0).


Cloudflare architecture and how BPF eats the world

In order to fix this, we developed a custom kernel patch which adds a SO_BINDTOPREFIX socket option. As the name suggests – it allows us to call bind() on a selected IP prefix. This solves the problem of multiple applications sharing popular ports like 53 or 80.

Then we run into another problem. For our Spectrum product we need to listen on all 65535 ports. Running so many listen sockets is not a good idea (see our old war story blog), so we had to find another way. After some experiments we learned to utilize an obscure iptables module – TPROXY – for this purpose. Read about it here:

This setup is working, but we don’t like the extra firewall rules. We are working on solving this problem correctly – actually extending the socket dispatch logic. You guessed it – we want to extend socket dispatch logic by utilizing eBPF. Expect some patches from us.


Cloudflare architecture and how BPF eats the world

Then there is a way to use eBPF to improve applications. Recently we got excited about doing TCP splicing with SOCKMAP:

This technique has a great potential for improving tail latency across many pieces of our software stack. The current SOCKMAP implementation is not quite ready for prime time yet, but the potential is vast.

Similarly, the new TCP-BPF aka BPF_SOCK_OPS hooks provide a great way of inspecting performance parameters of TCP flows. This functionality is super useful for our performance team.


Cloudflare architecture and how BPF eats the world

Some Linux features didn’t age well and we need to work around them. For example, we are hitting limitations of networking metrics. Don’t get me wrong – the networking metrics are awesome, but sadly they are not granular enough. Things like TcpExtListenDrops and TcpExtListenOverflows are reported as global counters, while we need to know it on a per-application basis.

Our solution is to use eBPF probes to extract the numbers directly from the kernel. My colleague Ivan Babrou wrote a Prometheus metrics exporter called “ebpf_exporter” to facilitate this. Read on:

With “ebpf_exporter” we can generate all manner of detailed metrics. It is very powerful and saved us on many occasions.


Cloudflare architecture and how BPF eats the world

In this talk we discussed 6 layers of BPFs running on our edge servers:

  • Volumetric DoS mitigations are running on XDP eBPF
  • Iptables xt_bpf cBPF for application-layer attacks
  • SO_ATTACH_BPF for rate limits on UDP sockets
  • Load balancer, running on XDP
  • eBPFs running application helpers like SOCKMAP for TCP socket splicing, and TCP-BPF for TCP measurements
  • “ebpf_exporter” for granular metrics

And we’re just getting started! Soon we will be doing more with eBPF based socket dispatch, eBPF running on Linux TC (Traffic Control) layer and more integration with cgroup eBPF hooks. Then, our SRE team is maintaining ever-growing list of BCC scripts useful for debugging.

It feels like Linux stopped developing new API’s and all the new features are implemented as eBPF hooks and helpers. This is fine and it has strong advantages. It’s easier and safer to upgrade eBPF program than having to recompile a kernel module. Some things like TCP-BPF, exposing high-volume performance tracing data, would probably be impossible without eBPF.

Some say “software is eating the world”, I would say that: “BPF is eating the software”.

eBPF can’t count?!

Post Syndicated from Jakub Sitnicki original https://blog.cloudflare.com/ebpf-cant-count/

eBPF can't count?!
Grant mechanical calculating machine, public domain image

eBPF can't count?!

It is unlikely we can tell you anything new about the extended Berkeley Packet Filter, eBPF for short, if you’ve read all the great man pages, docs, guides, and some of our blogs out there.

But we can tell you a war story, and who doesn’t like those? This one is about how eBPF lost its ability to count for a while1.

They say in our Austin, Texas office that all good stories start with "y’all ain’t gonna believe this… tale." This one though, starts with a post to Linux netdev mailing list from Marek Majkowski after what I heard was a long night:

eBPF can't count?!

Marek’s findings were quite shocking – if you subtract two 64-bit timestamps in eBPF, the result is garbage. But only when running as an unprivileged user. From rool all works fine. Huh.

If you’ve seen Marek’s presentation from the Netdev 0x13 conference, you know that we are using BPF socket filters as one of the defenses against simple, volumetric DoS attacks. So potentially getting your packet count wrong could be a Bad Thing™, and affect legitimate traffic.

Let’s try to reproduce this bug with a simplified eBPF socket filter that subtracts two 64-bit unsigned integers passed to it from user-space though a BPF map. The input for our BPF program comes from a BPF array map, so that the values we operate on are not known at build time. This allows for easy experimentation and prevents the compiler from optimizing out the operations.

Starting small, eBPF, what is 2 – 1? View the code on our GitHub.

$ ./run-bpf 2 1
arg0                    2 0x0000000000000002
arg1                    1 0x0000000000000001
diff                    1 0x0000000000000001

OK, eBPF, what is 2^32 – 1?

$ ./run-bpf $[2**32] 1
arg0           4294967296 0x0000000100000000
arg1                    1 0x0000000000000001
diff 18446744073709551615 0xffffffffffffffff

Wrong! But if we ask nicely with sudo:

$ sudo ./run-bpf $[2**32] 1
[sudo] password for jkbs:
arg0           4294967296 0x0000000100000000
arg1                    1 0x0000000000000001
diff           4294967295 0x00000000ffffffff

Who is messing with my eBPF?

When computers stop subtracting, you know something big is up. We called for reinforcements.

Our colleague Arthur Fabre quickly noticed something is off when you examine the eBPF code loaded into the kernel. It turns out kernel doesn’t actually run the eBPF it’s supplied – it sometimes rewrites it first.

Any sane programmer would expect 64-bit subtraction to be expressed as a single eBPF instruction

$ llvm-objdump -S -no-show-raw-insn -section=socket1 bpf/filter.o
…
      20:       1f 76 00 00 00 00 00 00         r6 -= r7
…

However, that’s not what the kernel actually runs. Apparently after the rewrite the subtraction becomes a complex, multi-step operation.

To see what the kernel is actually running we can use little known bpftool utility. First, we need to load our BPF

$ ./run-bpf --stop-after-load 2 1
[2]+  Stopped                 ./run-bpf 2 1

Then list all BPF programs loaded into the kernel with bpftool prog list

$ sudo bpftool prog list
…
5951: socket_filter  name filter_alu64  tag 11186be60c0d0c0f  gpl
        loaded_at 2019-04-05T13:01:24+0200  uid 1000
        xlated 424B  jited 262B  memlock 4096B  map_ids 28786

The most recently loaded socket_filter must be our program (filter_alu64). Now we now know its id is 5951 and we can list its bytecode with

$ sudo bpftool prog dump xlated id 5951
…
  33: (79) r7 = *(u64 *)(r0 +0)
  34: (b4) (u32) r11 = (u32) -1
  35: (1f) r11 -= r6
  36: (4f) r11 |= r6
  37: (87) r11 = -r11
  38: (c7) r11 s>>= 63
  39: (5f) r6 &= r11
  40: (1f) r6 -= r7
  41: (7b) *(u64 *)(r10 -16) = r6
…

bpftool can also display the JITed code with: bpftool prog dump jited id 5951.

As you see, subtraction is replaced with a series of opcodes. That is unless you are root. When running from root all is good

$ sudo ./run-bpf --stop-after-load 0 0
[1]+  Stopped                 sudo ./run-bpf --stop-after-load 0 0
$ sudo bpftool prog list | grep socket_filter
659: socket_filter  name filter_alu64  tag 9e7ffb08218476f3  gpl
$ sudo bpftool prog dump xlated id 659
…
  31: (79) r7 = *(u64 *)(r0 +0)
  32: (1f) r6 -= r7
  33: (7b) *(u64 *)(r10 -16) = r6
…

If you’ve spent any time using eBPF, you must have experienced first hand the dreaded eBPF verifier. It’s a merciless judge of all eBPF code that will reject any programs that it deems not worthy of running in kernel-space.

What perhaps nobody has told you before, and what might come as a surprise, is that the very same verifier will actually also rewrite and patch up your eBPF code as needed to make it safe.

The problems with subtraction were introduced by an inconspicuous security fix to the verifier. The patch in question first landed in Linux 5.0 and was backported to 4.20.6 stable and 4.19.19 LTS kernel. The over 2000 words long commit message doesn’t spare you any details on the attack vector it targets.

The mitigation stems from CVE-2019-7308 vulnerability discovered by Jann Horn at Project Zero, which exploits pointer arithmetic, i.e. adding a scalar value to a pointer, to trigger speculative memory loads from out-of-bounds addresses. Such speculative loads change the CPU cache state and can be used to mount a Spectre variant 1 attack.

To mitigate it the eBPF verifier rewrites any arithmetic operations on pointer values in such a way the result is always a memory location within bounds. The patch demonstrates how arithmetic operations on pointers get rewritten and we can spot a familiar pattern there

eBPF can't count?!

Wait a minute… What pointer arithmetic? We are just trying to subtract two scalar values. How come the mitigation kicks in?

It shouldn’t. It’s a bug. The eBPF verifier keeps track of what kind of values the ALU is operating on, and in this corner case the state was ignored.

Why running BPF as root is fine, you ask? If your program has CAP_SYS_ADMIN privileges, side-channel mitigations don’t apply. As root you already have access to kernel address space, so nothing new can leak through BPF.

After our report, the fix has quickly landed in v5.0 kernel and got backported to stable kernels 4.20.15 and 4.19.28. Kudos to Daniel Borkmann for getting the fix out fast. However, kernel upgrades are hard and in the meantime we were left with code running in production that was not doing what it was supposed to.

32-bit ALU to the rescue

As one of the eBPF maintainers has pointed out, 32-bit arithmetic operations are not affected by the verifier bug. This opens a door for a work-around.

eBPF registers, r0..r10, are 64-bits wide, but you can also access just the lower 32 bits, which are exposed as subregisters w0..w10. You can operate on the 32-bit subregisters using BPF ALU32 instruction subset. LLVM 7+ can generate eBPF code that uses this instruction subset. Of course, you need to you ask it nicely with trivial -Xclang -target-feature -Xclang +alu32 toggle:

$ cat sub32.c
#include "common.h"

u32 sub32(u32 x, u32 y)
{
        return x - y;
}
$ clang -O2 -target bpf -Xclang -target-feature -Xclang +alu32 -c sub32.c
$ llvm-objdump -S -no-show-raw-insn sub32.o
…
sub32:
       0:       bc 10 00 00 00 00 00 00         w0 = w1
       1:       1c 20 00 00 00 00 00 00         w0 -= w2
       2:       95 00 00 00 00 00 00 00         exit

The 0x1c opcode of the instruction #1, which can be broken down as BPF_ALU | BPF_X | BPF_SUB (read more in the kernel docs), is the 32-bit subtraction between registers we are looking for, as opposed to regular 64-bit subtract operation 0x1f = BPF_ALU64 | BPF_X | BPF_SUB, which will get rewritten.

Armed with this knowledge we can borrow a page from bignum arithmetic and subtract 64-bit numbers using just 32-bit ops:

u64 sub64(u64 x, u64 y)
{
        u32 xh, xl, yh, yl;
        u32 hi, lo;

        xl = x;
        yl = y;
        lo = xl - yl;

        xh = x >> 32;
        yh = y >> 32;
        hi = xh - yh - (lo > xl); /* underflow? */

        return ((u64)hi << 32) | (u64)lo;
}

This code compiles as expected on normal architectures, like x86-64 or ARM64, but BPF Clang target plays by its own rules:

$ clang -O2 -target bpf -Xclang -target-feature -Xclang +alu32 -c sub64.c -o - \
  | llvm-objdump -S -
…  
      13:       1f 40 00 00 00 00 00 00         r0 -= r4
      14:       1f 30 00 00 00 00 00 00         r0 -= r3
      15:       1f 21 00 00 00 00 00 00         r1 -= r2
      16:       67 00 00 00 20 00 00 00         r0 <<= 32
      17:       67 01 00 00 20 00 00 00         r1 <<= 32
      18:       77 01 00 00 20 00 00 00         r1 >>= 32
      19:       4f 10 00 00 00 00 00 00         r0 |= r1
      20:       95 00 00 00 00 00 00 00         exit

Apparently the compiler decided it was better to operate on 64-bit registers and discard the upper 32 bits. Thus we weren’t able to get rid of the problematic 0x1f opcode. Annoying, back to square one.

Surely a bit of IR will do?

The problem was in Clang frontend – compiling C to IR. We know that BPF "assembly" backend for LLVM can generate bytecode that uses ALU32 instructions. Maybe if we tweak the Clang compiler’s output just a little we can achieve what we want. This means we have to get our hands dirty with the LLVM Intermediate Representation (IR).

If you haven’t heard of LLVM IR before, now is a good time to do some reading2. In short the LLVM IR is what Clang produces and LLVM BPF backend consumes.

Time to write IR by hand! Here’s a hand-tweaked IR variant of our sub64() function:

define dso_local i64 @sub64_ir(i64, i64) local_unnamed_addr #0 {
  %3 = trunc i64 %0 to i32      ; xl = (u32) x;
  %4 = trunc i64 %1 to i32      ; yl = (u32) y;
  %5 = sub i32 %3, %4           ; lo = xl - yl;
  %6 = zext i32 %5 to i64
  %7 = lshr i64 %0, 32          ; tmp1 = x >> 32;
  %8 = lshr i64 %1, 32          ; tmp2 = y >> 32;
  %9 = trunc i64 %7 to i32      ; xh = (u32) tmp1;
  %10 = trunc i64 %8 to i32     ; yh = (u32) tmp2;
  %11 = sub i32 %9, %10         ; hi = xh - yh
  %12 = icmp ult i32 %3, %5     ; tmp3 = xl < lo
  %13 = zext i1 %12 to i32
  %14 = sub i32 %11, %13        ; hi -= tmp3
  %15 = zext i32 %14 to i64
  %16 = shl i64 %15, 32         ; tmp2 = hi << 32
  %17 = or i64 %16, %6          ; res = tmp2 | (u64)lo
  ret i64 %17
}

It may not be pretty but it does produce desired BPF code when compiled3. You will likely find the LLVM IR reference helpful when deciphering it.

And voila! First working solution that produces correct results:

$ ./run-bpf -filter ir $[2**32] 1
arg0           4294967296 0x0000000100000000
arg1                    1 0x0000000000000001
diff           4294967295 0x00000000ffffffff

Actually using this hand-written IR function from C is tricky. See our code on GitHub.

eBPF can't count?!
public domain image by Sergei Frolov

The final trick

Hand-written IR does the job. The downside is that linking IR modules to your C modules is hard. Fortunately there is a better way. You can persuade Clang to stick to 32-bit ALU ops in generated IR.

We’ve already seen the problem. To recap, if we ask Clang to subtract 32-bit integers, it will operate on 64-bit values and throw away the top 32-bits. Putting C, IR, and eBPF side-by-side helps visualize this:

eBPF can't count?!

The trick to get around it is to declare the 32-bit variable that holds the result as volatile. You might already know the volatile keyword if you’ve written Unix signal handlers. It basically tells the compiler that the value of the variable may change under its feet so it should refrain from reorganizing loads (reads) from it, as well as that stores (writes) to it might have side-effects so changing the order or eliminating them, by skipping writing it to the memory, is not allowed either.

Using volatile makes Clang emit special loads and/or stores at the IR level, which then on eBPF level translates to writing/reading the value from memory (stack) on every access. While this sounds not related to the problem at hand, there is a surprising side-effect to it:

eBPF can't count?!

With volatile access compiler doesn’t promote the subtraction to 64 bits! Don’t ask me why, although I would love to hear an explanation. For now, consider this a hack. One that does not come for free – there is the overhead of going through the stack on each read/write.

However, if we play our cards right we just might reduce it a little. We don’t actually need the volatile load or store to happen, we just want the side effect. So instead of declaring the value as volatile, which implies that both reads and writes are volatile, let’s try to make only the writes volatile with a help of a macro:

/* Emits a "store volatile" in LLVM IR */
#define ST_V(rhs, lhs) (*(volatile typeof(rhs) *) &(rhs) = (lhs))

If this macro looks strangely familiar, it’s because it does the same thing as WRITE_ONCE() macro in the Linux kernel. Applying it to our example:

eBPF can't count?!

That’s another hacky but working solution. Pick your poison.

eBPF can't count?!
CC BY-SA 3.0 image by ANKAWÜ

So there you have it – from C, to IR, and back to C to hack around a bug in eBPF verifier and be able to subtract 64-bit integers again. Usually you won’t have to dive into LLVM IR or assembly to make use of eBPF. But it does help to know a little about it when things don’t work as expected.

Did I mention that 64-bit addition is also broken? Have fun fixing it!


1 Okay, it was more like 3 months time until the bug was discovered and fixed.

2 Some even think that it is better than assembly.

3 How do we know? The litmus test is to look for statements matching r[0-9] [-+]= r[0-9] in BPF assembly.

xdpcap: XDP Packet Capture

Post Syndicated from Arthur Fabre original https://blog.cloudflare.com/xdpcap/

xdpcap: XDP Packet Capture

Our servers process a lot of network packets, be it legitimate traffic or large denial of service attacks. To do so efficiently, we’ve embraced eXpress Data Path (XDP), a Linux kernel technology that provides a high performance mechanism for low level packet processing. We’re using it to drop DoS attack packets with L4Drop, and also in our new layer 4 load balancer. But there’s a downside to XDP: because it processes packets before the normal Linux network stack sees them, packets redirected or dropped are invisible to regular debugging tools such as tcpdump.

To address this, we built a tcpdump replacement for XDP, xdpcap. We are open sourcing this tool: the code and documentation are available on GitHub.

xdpcap uses our classic BPF (cBPF) to eBPF or C compiler, cbpfc, which we are also open sourcing: the code and documentation are available on GitHub.

xdpcap: XDP Packet Capture
CC BY 4.0 image by Christoph Müller

Tcpdump provides an easy way to dump specific packets of interest. For example, to capture all IPv4 DNS packets, one could:

$ tcpdump ip and udp port 53

xdpcap reuses the same syntax! xdpcap can write packets to a pcap file:

$ xdpcap /path/to/hook capture.pcap "ip and udp port 53"
XDPAborted: 0/0   XDPDrop: 0/0   XDPPass: 254/0   XDPTx: 0/0   (received/matched packets)
XDPAborted: 0/0   XDPDrop: 0/0   XDPPass: 995/1   XDPTx: 0/0   (received/matched packets)

Or write the pcap to stdout, and decode the packets with tcpdump:

$ xdpcap /path/to/hook - "ip and udp port 53" | sudo tcpdump -r -
reading from file -, link-type EN10MB (Ethernet)
16:18:37.911670 IP 1.1.1.1 > 1.2.3.4.21563: 26445$ 1/0/1 A 93.184.216.34 (56)

The remainder of this post explains how we built xdpcap, including how /path/to/hook/ is used to attach to XDP programs.

tcpdump

To replicate tcpdump, we first need to understand its inner workings. Marek Majkowski has previously written a detailed post on the subject. Tcpdump exposes a high level filter language, pcap-filter, to specify which packets are of interest. Reusing our earlier example, the following filter expression captures all IPv4 UDP packets to or from port 53, likely DNS traffic:

ip and udp port 53

Internally, tcpdump uses libpcap to compile the filter to classic BPF (cBPF). cBPF is a simple bytecode language to represent programs that inspect the contents of a packet. A program returns non-zero to indicate that a packet matched the filter, and zero otherwise. The virtual machine that executes cBPF programs is very simple, featuring only two registers, a and x. There is no way of checking the length of the input packet[1]; instead any out of bounds packet access will terminate the cBPF program, returning 0 (no match). The full set of opcodes are listed in the Linux documentation. Returning to our example filter, ip and udp port 53 compiles to the following cBPF program, expressed as an annotated flowchart:

Example cBPF filter flowchart

Tcpdump attaches the generated cBPF filter to a raw packet socket using a setsockopt system call with SO_ATTACH_FILTER. The kernel runs the filter on every packet destined for the socket, but only delivers matching packets. Tcpdump displays the delivered packets, or writes them to a pcap capture file for later analysis.

xdpcap

In the context of XDP, our tcpdump replacement should:

  • Accept filters in the same filter language as tcpdump
  • Dynamically instrument XDP programs of interest
  • Expose matching packets to userspace

XDP

XDP uses an extended version of the cBPF instruction set, eBPF, to allow arbitrary programs to run for each packet received by a network card, potentially modifying the packets. A stringent kernel verifier statically analyzes eBPF programs, ensuring that memory bounds are checked for every packet load.

eBPF programs can return:

  • XDP_DROP: Drop the packet
  • XDP_TX: Transmit the packet back out the network interface
  • XDP_PASS: Pass the packet up the network stack

eBPF introduces several new features, notably helper function calls, enabling programs to call functions exposed by the kernel. This includes retrieving or setting values in maps, key-value data structures that can also be accessed from userspace.

Filter

A key feature of tcpdump is the ability to efficiently pick out packets of interest; packets are filtered before reaching userspace. To achieve this in XDP, the desired filter must be converted to eBPF.

cBPF is already used in our XDP based DoS mitigation pipeline: cBPF filters are first converted to C by cbpfc, and the result compiled with Clang to eBPF. Reusing this mechanism allows us to fully support libpcap filter expressions:

Pipeline to convert pcap-filter expressions to eBPF via C using cbpfc

To remove the Clang runtime dependency, our cBPF compiler, cbpfc, was extended to directly generate eBPF:

Pipeline to convert pcap-filter expressions directly to eBPF using cbpfc

Converted to eBPF using cbpfc, ip and udp port 53 yields:

Example cBPF filter converted to eBPF with cbpfc flowchart

The emitted eBPF requires a prologue, which is responsible for loading a pointer to the beginning, and end, of the input packet into registers r6 and r7 respectively[2].

The generated code follows a very similar structure to the original cBPF filter, but with:

  • Bswap instructions to convert big endian packet data to little endian.
  • Guards to check the length of the packet before we load data from it. These are required by the kernel verifier.

The epilogue can use the result of the filter to perform different actions on the input packet.

As mentioned earlier, we’re open sourcing cbpfc; the code and documentation are available on GitHub. It can be used to compile cBPF to C, or directly to eBPF, and the generated code is accepted by the kernel verifier.

Instrument

Tcpdump can start and stop capturing packets at any time, without requiring coordination from applications. This rules out modifying existing XDP programs to directly run the generated eBPF filter; the programs would have to be modified each time xdpcap is run. Instead, programs should expose a hook that can be used by xdpcap to attach filters at runtime.

xdpcap’s hook support is built around eBPF tail-calls. XDP programs can yield control to other programs using the tail-call helper. Control is never handed back to the calling program, the return code of the subsequent program is used. For example, consider two XDP programs, foo and bar, with foo attached to the network interface. Foo can tail-call into bar:

Flow of XDP program foo tail-calling into program bar

The program to tail-call into is configured at runtime, using a special eBPF program array map. eBPF programs tail-call into a specific index of the map, the value of which is set by userspace. From our example above, foo’s tail-call map holds a single entry:

indexprogram
0bar

A tail-call into an empty index will not do anything, XDP programs always need to return an action themselves after a tail-call should it fail. Once again, this is enforced by the kernel verifier. In the case of program foo:

int foo(struct xdp_md *ctx) {
    // tail-call into index 0 - program bar
    tail_call(ctx, &map, 0);

    // tail-call failed, pass the packet
    return XDP_PASS;
}

To leverage this as a hook point, the instrumented programs are modified to always tail-call, using a map that is exposed to xdpcap by pinning it to a bpffs. To attach a filter, xdpcap can set it in the map. If no filter is attached, the instrumented program returns the correct action itself.

With a filter attached to program foo, we have:

Flow of XDP program foo tail-calling into an xdpcap filter

The filter must return the original action taken by the instrumented program to ensure the packet is processed correctly. To achieve this, xdpcap generates one filter program per possible XDP action, each one hardcoded to return that specific action. All the programs are set in the map:

indexprogram
0 (XDP_ABORTED)filter XDP_ABORTED
1 (XDP_DROP)filter XDP_DROP
2 (XDP_PASS)filter XDP_PASS
3 (XDP_TX)filter XDP_TX

By tail-calling into the correct index, the instrumented program determines the final action:

Flow of XDP program foo tail-calling into xdpcap filters, one for each action

xdpcap provides a helper function that attempts a tail-call for the given action. Should it fail, the action is returned instead:

enum xdp_action xdpcap_exit(struct xdp_md *ctx, enum xdp_action action) {
    // tail-call into the filter using the action as an index
    tail_call((void *)ctx, &xdpcap_hook, action);

    // tail-call failed, return the action
    return action;
}

This allows an XDP program to simply:

int foo(struct xdp_md *ctx) {
    return xdpcap_exit(ctx, XDP_PASS);
}

Expose

Matching packets, as well as the original action taken for them, need to be exposed to userspace. Once again, such a mechanism is already part of our XDP based DoS mitigation pipeline!

Another eBPF helper, perf_event_output, allows an XDP program to generate a perf event containing, amongst some metadata, the packet. As xdpcap generates one filter per XDP action, the filter program can include the action taken in the metadata. A userspace program can create a perf event ring buffer to receive events into, obtaining both the action and the packet.


  1. This is true of the original cBPF, but Linux implements a number of extensions, one of which allows the length of the input packet to be retrieved. ↩︎

  2. This example uses registers r6 and r7, but cbpfc can be configured to use any registers. ↩︎

Trace Event, Chrome and More Profile Formats on FlameScope

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/trace-event-chrome-and-more-profile-formats-on-flamescope-5dfe9df5dfa9?source=rss----2615bd06b42e---4

FlameScope sub-second heatmap visualization.

Less than a year ago, FlameScope was released as a proof of concept for a new profile visualization. Since then, it helped us, and many other users, to easily find and fix performance issues, and allowed us to see patterns that we had never noticed before in our profiles.

As a tool, FlameScope was limited. It only supported the profile format generated by Linux perf, which at the time, was the profiler of choice internally at Netflix.

Immediately after launch, we received multiple requests to support other profile formats. Users looking to use the FlameScope visualization, with their own profilers and tools. Our goal was never to support hundreds of profile formats, especially for tools we don’t use internally, but we always knew that supporting a few “generic” formats would be useful, both for us, and the community.

After receiving multiple requests from users and investigating a few profile formats, we opted to support the Trace Event Format. It is well documented. It is flexible. Multiple tools already use it, and it is the format used by Trace-Viewer, which is the javascript frontend for Chrome’s about:tracing and Android’s systrace tools.

The complete documentation for the format can be found here, but in a nutshell, it consists of an ordered list of events. For now, FlameScope only supports Duration and Complete event types. According to the documentation:

Duration events provide a way to mark a duration of work on a given thread. The duration events are specified by the B and E phase types. The B event must come before the corresponding E event. You can nest the B and E events. This allows you to capture function calling behavior on a thread.

Each complete event logically combines a pair of duration (B and E) events. The complete events are designated by the X phase type.

There is an extra parameter dur to specify the tracing clock duration of complete events in microseconds. All other parameters are the same as in duration events.

Here’s an example:

[
{
    "name": "Asub",
    "cat": "PERF",
    "ph": "B",
    "pid": 22630,
    "tid": 22630,
    "ts": 829
  }, {
    "name": "Asub",
    "cat": "PERF",
    "ph": "E",
    "pid": 22630,
    "tid": 22630,
    "ts": 833
  }
]

As you can imagine, this format works really well for tracing profilers, where the beginning and end of work units are recorded. For sampling based profilers, like perf, the format is not ideal. We could create a Complete event for every sample, with stacks, but even being more efficient than the output generated by perf, there is still a lot of overhead, especially from repeated stacks. Another option would be to analyze the whole profile and create begin and end events every time we enter or exit a stack frame, but that adds complexity to converters.

Since we also work with sampling profilers frequently, we needed a simpler format. In the past, we worked with profiles in the v8 profiler format, which is very similar to Chrome’s old JavaScript CPU profiler format and newer ProfileType event format. We already had all the code needed to generate both heatmap and partial flame graphs, so we decided to use it as base for a new format, which for lack of a more creative name, we called nflxprofile. Different from the v8 profiler format, it uses a map instead of a list to store the nodes, includes extra information about the profile, and takes advantage Protocol Buffers to serialize the data instead of JSON. The .proto file looks like this:

syntax = “proto2”;
package nflxprofile;
message Profile {
  required double start_time = 1;
  required double end_time = 2;
  repeated uint32 samples = 3 [packed=true];
  repeated double time_deltas = 4 [packed=true];
  message Node {
    required string function_name = 1;
    required uint32 hit_count = 2;
    repeated uint32 children = 3;
    optional string libtype = 4;
  }
  map<uint32, Node> nodes = 5;
}

It can be found on FlameScope’s repository too, and be used to generate code for the multiple programming languages supported by Protocol Buffers.

Netflix has been using the new format internally in its cloud profiling tool, and the improvement is noticeable. The significant reduction in file size, from raw perf output to nflxprofile, allows for faster download time from external storage. The reduction will depend on sampling duration and how homogeneous the workload is (similar stacks), but generally, the output is orders of magnitude smaller. Time spent on parsing and deserialization is also reduced significantly. No more regular expressions!

Since the new format is so similar to what Chrome generates, we decided to include it too! It has been challenging to keep up with the constant changes in DevTools, from CpuProfile, to Profile and now ProfileChunk events, but the format is supported as of now. If you want to try it out, check out the Get Started With Analyzing Runtime Performance post, record and save the profile to FlameScope’s profile directory, and open it!

We also had to make minor adjustments to the user interface, more specifically the file list, to support the new profile formats. Now, instead of a simple list, you will get a dropdown menu next to each file that allows you to select the correct profile type.

New profile selection dropdown.

We might consider adding support for more formats in the future, or accept pull requests that add support for something new, but in general, profile converters are the simplest solution. If you created a converter for a known profile format, we are happy to link to it on FlameScope’s documentation!

FlameScope was developed by Martin Spier, Brendan Gregg and the Netflix performance engineering team. Blog post by Martin Spier.


Trace Event, Chrome and More Profile Formats on FlameScope was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.