All posts by Chris Howells

Deploying firmware at Cloudflare-scale: updating thousands of servers in more than 285 cities

Post Syndicated from Chris Howells original https://blog.cloudflare.com/deploying-firmware-at-cloudflare-scale-how-we-update-thousands-of-servers-in-more-than-285-cities/

Deploying firmware at Cloudflare-scale: updating thousands of servers in more than 285 cities

Deploying firmware at Cloudflare-scale: updating thousands of servers in more than 285 cities

As a security company, it’s critical that we have good processes for dealing with security issues. We regularly release software to our servers – on a daily basis even – which includes new features, bug fixes, and as required, security patches. But just as critical is the software which is embedded into the server hardware, known as firmware. Primarily of interest is the BIOS and Baseboard Management Controller (BMC), but many other components also have firmware such as Network Interface Cards (NICs).

As the world becomes more digital, software which needs updating is appearing in more and more devices. As well as my computer, over the last year, I have waited patiently while firmware has updated in my TV, vacuum cleaner, lawn mower and light bulbs. It can be a cumbersome process, including obtaining the firmware, deploying it to the device which needs updating, navigating menus and other commands to initiate the update, and then waiting several minutes for the update to complete.

Firmware updates can be annoying even if you only have a couple of devices. We have more than a few devices at Cloudflare. We have a huge number of servers of varying kinds, from varying vendors, spread over 285 cities worldwide. We need to be able to rapidly deploy various types of firmware updates to all of them, reliably, and automatically, without any kind of manual intervention.

In this blog post I will outline the methods that we use to automate firmware deployment to our entire fleet. We have been using this method for several years now, and have deployed firmware without interrupting our SRE team, entirely automatically.

Background

A key component of our ability to deploy firmware at scale is the iPXE, an open source boot loader. iPXE is the glue which operates between the server and operating system, and is responsible for loading the operating system after the server has completed the Power On Self Test (POST). It is very flexible and contains a scripting language. With iPXE, we can write boot scripts which query the firmware version, continue booting if the correct firmware version is deployed, or if not, boot into a flashing environment to flash the correct firmware.

We only deploy new firmware when our systems are out of production, so we need a method to coordinate deployment only on out of production systems. The simplest way to do this is when they are rebooting, because by definition they are out of production then. We reboot our entire fleet every month, and have the ability to schedule reboots more urgently if required to deal with a security issue. Regularly rebooting our fleets has many advantages. We can deploy the latest Linux kernel, base operating system, and ensure that we do not have any breaking changes in our operating system and configuration management environment that breaks on fresh boot.

Our entire fleet operates in UEFI mode. UEFI is a modern replacement for the BIOS and offers more features and more security, such as Secure Boot. A full description of all of these changes is outside the scope of this article, but essentially UEFI provides a minimal environment and shell capable of executing binaries. Secure Boot ensures that the binaries are signed with keys embedded in the system, to prevent a bad actor from tampering with our software.

How we update the BIOS

We are able to update the BIOS without booting any operating system, purely by taking advantage of features offered by iPXE and the UEFI shell. This requires a flashing binary written for the UEFI environment.

Upon boot, iPXE is started. Through iPXE’s built-in variable ${smbios/0.5.0}  it is possible to query the current BIOS version, and compare it to the latest version, and trigger a flash only if there is a mis-match.  iPXE then downloads the files required for the firmware update to a ramdisk.

The following is an example of a very basic iPXE script which performs such an action:

# Check whether the BIOS version is 2.03
iseq ${smbios/0.5.0} 2.03 || goto biosupdate
echo Nothing to do for {{ model }}
exit 0

:biosupdate
echo Trying to update BIOS/UEFI...
echo Current: ${smbios/0.5.0}
echo New: 2.03

imgfetch ${boot_prefix}/tools/x64/shell.efi || goto unexpected_error
imgfetch startup.nsh || goto unexpected_error

imgfetch AfuEfix64.efi || goto unexpected_error
imgfetch bios-2.03.bin || goto unexpected_error

imgexec shell.efi || goto unexpected_error

Meanwhile, startup.nsh contains the binary to run and command line arguments to effect the flash:

startup.nsh:

%homefilesystem%\AfuEfix64.efi %homefilesystem%\bios-2.03.bin /P /B /K /N /X /RLC:E /REBOOT

After rebooting, the machine will boot using its new BIOS firmware, version 2.03. Since ${smbios/0.5.0} now contains 2.03, the machine continues to boot and enter production.

Other firmware updates such as BMC, network cards and more

Unfortunately, the number of vendors that support firmware updates with UEFI flashing binaries is limited. There are a large number of other updates that we need to perform such as BMC and NIC.

Deploying firmware at Cloudflare-scale: updating thousands of servers in more than 285 cities

Consequently, we need another way to flash these binaries. Thankfully, these vendors invariably support flashing from Linux. Consequently we can perform flashing from a minimal Linux environment. Since vendor firmware updates are typically closed source utilities and vendors are often highly secretive about firmware flashing, we can ensure that the flashing environment does not provide an attackable surface by ensuring that the network is not configured. If it’s not on the network, it can’t be attacked and exploited.

Not being on the network means that we need to inject files into the boot process when the machine boots. We can accomplish this with an initial ramdisk (initrd), and iPXE makes it easy to add additional initrd to the boot.

Creating an initrd is as simple as creating an archive of the files using cpio using the newc archive format.

Let’s imagine we are going to flash Broadcom NIC firmware. We’ll use the bnxtnvm firmware update utility, the firmware image firmware.pkg, and a shell script called flash to automate the task.

The files are laid out in the file system like this:

cd broadcom
find .
./opt/preflight
./opt/preflight/scripts
./opt/preflight/scripts/flash
./opt/broadcom
./opt/broadcom/firmware.pkg
./opt/broadcom/bnxtnvm

Now we compress all of these files into an image called broadcom.img.

find . | cpio --quiet -H newc -o | gzip -9 -n > ../broadcom.img

This is the first step completed; we have the firmware packaged up into an initrd.

Since it’s challenging to read, say, the firmware version of the NIC, from the EFI shell, we store firmware versions as UEFI variables. These can be written from Linux via efivars, the UEFI variable file system, and then read by iPXE on boot.

An example of writing an EFI variable from Linux looks like this:

declare -r fw_path='/sys/firmware/efi/efivars/broadcom-fw-9ca25c23-368a-4c21-943f-7d91f2b76008'
declare -r efi_header='\x07\x00\x00\x00'
declare -r version='1.05'

/bin/mount -o remount,rw,nosuid,nodev,noexec,noatime none /sys/firmware/efi/efivars

# Files on efivarfs are immutable by default, so remove the immutable flag so that we can write to it: https://docs.kernel.org/filesystems/efivarfs.html
if [ -f "${fw_path}" ] ; then
    /usr/bin/chattr -i "${fw_path}"
fi

echo -n -e "${efi_header}${version}" >| "$fw_path"

Then we can write an iPXE configuration file to load the flashing kernel, userland and flashing utilities.

set cf/guid 9ca25c23-368a-4c21-943f-7d91f2b76008

iseq ${efivar/broadcom-fw-${cf/guid}} 1.05 && echo Not flashing broadcom firmware, version already at 1.05 || goto update
exit

:update
echo Starting broadcom firmware update
kernel ${boot_prefix}/vmlinuz initrd=baseimg.img initrd=linux-initramfs-modules.img initrd=broadcom.img
initrd ${boot_prefix}/baseimg.img
initrd ${boot_prefix}/linux-initramfs-modules.img
initrd ${boot_prefix}/firmware/broadcom.img

Flashing scripts are deposited into /opt/preflight/scripts and we use systemd to execute them with run-parts on boot:

/etc/systemd/system/preflight.service:

[Unit]
Description=Pre-salt checks and simple configurations on boot
Before=salt-highstate.service
After=network.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/run-parts --verbose /opt/preflight/scripts

[Install]
WantedBy=multi-user.target
RequiredBy=salt-highstate.service

An example flashing script in /opt/preflight/scripts might look like:

#!/bin/bash

trap 'catch $? $LINENO' ERR
catch(){
    #error handling goes here
    echo "Error $1 occured on line $2"
}

declare -r fw_path='/sys/firmware/efi/efivars/broadcom-fw-9ca25c23-368a-4c21-943f-7d91f2b76008'
declare -r efi_header='\x07\x00\x00\x00'
declare -r version='1.05'

lspci | grep -q Broadcom
if [ $? -eq 0 ]; then
    echo "Broadcom firmware flashing starting"
    if [ ! -f "$fw_path" ] ; then
        chmod +x /opt/broadcom/bnxtnvm
        declare -r interface=$(/opt/broadcom/bnxtnvm listdev | grep "Device Interface Name" | awk -F ": " '{print $2}')
        /opt/broadcom/bnxtnvm -dev=${interface} -force -y install /opt/broadcom/BCM957414M4142C.pkg
        declare -r status=$?
        declare -r currentversion=$(/opt/broadcom/bnxtnvm -dev=${interface} device_info | grep "Package version on NVM" | awk -F ": " '{print $2}')
        declare -r expectedversion=$(echo $version | awk '{print $2}')
        if [ $status -eq 0 -a "$currentversion" = "$expectedversion" ]; then
            echo "Broadcom firmware $version flashed successfully"
            /bin/mount -o remount,rw,nosuid,nodev,noexec,noatime none /sys/firmware/efi/efivars
            echo -n -e "${efi_header}${version}" >| "$fw_path"
            echo "Created $fw_path"
        else
            echo "Failed to flash Broadcom firmware $version"
            /opt/broadcom/bnxtnvm -dev=${interface} device_info
        fi
    else
        echo "Broadcom firmware up-to-date"
    fi
else
    echo "No Broadcom NIC installed"
    /bin/mount -o remount,rw,nosuid,nodev,noexec,noatime none /sys/firmware/efi/efivars
    if [ -f "${fw_path}" ] ; then
        /usr/bin/chattr -i "${fw_path}"
    fi
    echo -n -e "${efi_header}${version}" >| "$fw_path"
    echo "Created $fw_path"
fi

if [ -f "${fw_path}" ]; then
    echo "rebooting in 60 seconds"
    sleep 60
    /sbin/reboot
fi

Conclusion

Whether you manage just your laptop or desktop computer, or a fleet of servers, it’s important to keep the firmware updated to ensure that the availability, performance and security of the devices is maintained.

If you have a few devices and would benefit from automating the deployment process, we hope that we have inspired you to have a go by making use of some basic open source tools such as the iPXE boot loader and some scripting.

Final thanks to my colleague Ignat Korchagin who did a large amount of the original work on the UEFI BIOS firmware flashing infrastructure.

The EPYC journey continues to Rome in Cloudflare’s 11th generation Edge Server

Post Syndicated from Chris Howells original https://blog.cloudflare.com/the-epyc-journey-continues-to-rome-in-cloudflares-11th-generation-edge-server/

The EPYC journey continues to Rome in Cloudflare’s 11th generation Edge Server

The EPYC journey continues to Rome in Cloudflare’s 11th generation Edge Server

When I was interviewing to join Cloudflare’s in 2014 as a member of the SRE team, we had just introduced our generation 4 server, and I was excited about the prospects. Since then, Cloudflare, the industry and I have all changed dramatically. The best thing about working for a rapidly growing company like Cloudflare is that as the company grows, new roles open up to enable career development. And so, having left the SRE team last year, I joined the recently formed hardware engineering team, a team that simply didn’t exist in 2014.

We aim to introduce a new server platform to our edge network every 12 to 18 months or so, to ensure that we keep up with the latest industry technologies and developments. We announced the generation 9 server in October 2018 and we announced the generation 10 server in February 2020. We consider this length of cycle optimal: short enough to stay nimble and take advantage of the latest technologies, but long enough to offset the time taken by our hardware engineers to test and validate the entire platform. When we are shipping servers to over 200 cities around the world with a variety of regulatory standards, it’s essential to get things right the first time.

We continually work with our silicon vendors to receive product roadmaps and stay on top of the latest technologies. Since mid-2020, the hardware engineering team at Cloudflare has been working on our generation 11 server.

Requests per Watt is one of our defining characteristics when testing new hardware and we use it to identify how much more efficient a new hardware generation is than the previous generation. We continually strive to reduce our operational costs and power consumption reduction is one of the most important parts of this. It’s good for the planet and we can fit more servers into a rack, reducing our physical footprint.

The design of these Generation 11 x86 servers has been in parallel with our efforts to design next-generation edge servers using the Ampere Altra ARM architecture. You can read more about our tests in a blog post by my colleague Sung and will document our work on Arm at the edge in a subsequent blog post.

We evaluated Intel’s latest generation of “Ice Lake” Xeon processors. Although Intel’s chips were able to compete with AMD in terms of raw performance, the power consumption was several hundred watts higher per server – that’s enormous. This meant that Intel’s Performance per Watt was unattractive.

We previously described how we had deployed AMD EPYC 7642’s processors in our generation 10 server. This has 48 cores and is based on AMD’s 2nd generation EPYC architecture, code named Rome. For our generation 11 server, we evaluated 48, 56 and 64 core samples based on AMD’s 3rd generation EPYC architecture, code named Milan. We were interested to find that comparing the two 48 core processors directly, we saw a performance boost of several percent in the 3rd generation EPYC architecture. We therefore had high hopes for the 56 core and 64 core chips.

So, based on the samples we received from our vendors and our subsequent testing, hardware from AMD and Ampere made the shortlist for our generation 11 server. On this occasion, we decided that Intel did not meet our requirements. However, it’s healthy that Intel and AMD compete and innovate in the x86 space and we look forward to seeing how Intel’s next generation shapes up.

The EPYC journey continues to Rome in Cloudflare’s 11th generation Edge Server

Testing and validation process

Before we go on to talk about the hardware, I’d like to say a few words about the testing process we went through to test out generation 11 servers.

As we elected to proceed with AMD chips, we were able to use our generation 10 servers as our Engineering Validation Test platform, with the only changes being the new silicon and updated firmware. We were able to perform these upgrades ourselves in our hardware validation lab.

Cloudflare’s network is built with cheap commodity hardware and we source the hardware from multiple vendors, known as ODMs (Original Design Manufacturer) who build the servers to our specifications.

When you are working with bleeding edge silicon and experimental firmware, not everything is plain sailing. We worked with one of our ODMs to eliminate an issue which was causing the Linux kernel to panic on boot. Once resolved, we used a variety of synthetic benchmarking tools to verify the performance including cf_benchmark, as well as an internal tool which applies a synthetic load to our entire software stack.

Once we were satisfied, we ordered Design Validation Test samples, which were manufactured by our ODMs with the new silicon. We continued to test these and iron out the inevitable issues that arise when you are developing custom hardware. To ensure that performance matched our expectations, we used synthetic benchmarking to test the new silicon. We also began testing it in our production environment by gradually introducing customer traffic to them as confidence grew.

Once the issues were resolved, we ordered the Product Validation Test samples, which were again manufactured by our ODMs, taking into account the feedback obtained in the DVT phase. As these are intended to be production grade, we work with the broader Cloudflare teams to deploy these units like a mass production order.

The EPYC journey continues to Rome in Cloudflare’s 11th generation Edge Server

CPU

Previously: AMD EPYC 7642 48-Core Processor
Now: AMD EPYC 7713 64-Core Processor

The EPYC journey continues to Rome in Cloudflare’s 11th generation Edge Server

AMD EPYC 7642 AMD EPYC 7643 AMD EPYC 7663 AMD EPYC 7713
Status Incumbent Candidate Candidate Candidate
Core Count 48 48 56 64
Thread Count 96 96 112 128
Base Clock 2.3GHz 2.3GHz 2.0GHz 2.0GHz
Max Boost Clock 3.3GHz 3.6GHz 3.5GHz 3.675GHz
Total L3 Cache 256MB 256MB 256MB 256MB
Default TDP 225W 225W 240W 225W
Configurable TDP 240W 240W 240W 240W

In the above chart, TDP refers to Thermal Design Power, a measure of the heat dissipated. All of the above processors have a configurable TDP – assuming the cooling solution is capable – giving more performance at the expense of increased power consumption. We tested all processors configured at their highest supported TDP.

The 64 core processors have 33% more cores than the 48 core processors so you might hypothesize that we would see a corresponding 33% increase in performance, although our benchmarks saw slightly more modest gains. This can be explained because the 64 core processors have lower base clock frequencies to fit within the same 225W power envelope.

In production testing, we found that the 64 core EPYC 7713 gave us around a 29% performance boost over the incumbent, whilst having similar power consumption and thermal properties.

The EPYC journey continues to Rome in Cloudflare’s 11th generation Edge Server

Memory

The EPYC journey continues to Rome in Cloudflare’s 11th generation Edge Server

Previously: 256GB DDR4-2933
Now: 384GB DDR4-3200

Having made a decision about the processor, the next step was to determine the optimal amount of memory for our workload. We ran a series of experiments with our chosen EPYC 7713 processor and 256GB, 384GB and 512MB memory configurations. We started off by running synthetic benchmarks with tools such as STREAM to ensure that none of the configurations performed unexpectedly poorly and to generate a baseline understanding of the performance.

After the synthetic benchmarks, we proceeded to test the various configurations with production workloads to empirically determine the optimal quantity. We use Prometheus and Grafana to gather and display a rich set of metrics from all of our servers so that we can monitor and spot trends, and we re-used the same infrastructure for our performance analysis.

As well as measuring available memory, previous experience has shown us that one of the best ways to ensure that we have enough memory is to observe request latency and disk IO performance. If there is insufficient memory, we expect to see request latency and disk IO volume and latency to increase. The reason for this is that our core HTTP server uses memory to cache web assets and if there is insufficient memory the assets will be ejected from memory prematurely and more assets will be fetched from disk instead of memory, degrading performance.

Like most things in life, it’s a balancing act. We want enough memory to take advantage of the fact that serving web assets directly from memory is much faster than even the best NVMe disks. We also want to future proof our platform to enable the new features such as the ones that we recently announced in security week and developer week. However, we don’t want to spend unnecessarily on excess memory that will never be used. We found that the 512GB configuration did not provide a performance boost to justify the extra cost and settled on the 384GB configuration.

We also tested the performance impact of switching from DDR4-2933 to DDR4-3200 memory. We found that it provided a performance boost of several percent and the pricing has improved to the point where it is cost beneficial to make the change.

Disk

Previously: 3x Samsung PM983 x 960GB
Now: 2x Samsung PM9A3 x 1.92TB

The EPYC journey continues to Rome in Cloudflare’s 11th generation Edge Server

We validated samples by studying the manufacturer’s data sheets and testing using fio to ensure that the results being obtained in our test environment were in line with the published specifications. We also developed an automation framework to help compare different drive models using fio. The framework helps us to restore the drives close to factory settings, precondition the drives, perform the sequential and random tests in our environment, and analyze the data results to evaluate the bandwidth and latency results. Since our SSD samples were arriving in our test center at different months, having an automated framework helped in dealing with speedy evaluations by reducing our time spent testing and doing analysis.

For Gen 11 we decided to move to a 2x 2TB configuration from the original 3x 1TB configuration giving us an extra 1 TB of storage. This also meant we could use the higher performance of a 2TB drive and save around 6W of power since there is one less SSD.

After analyzing the performances of various 2TB drives, their latencies and endurances, we chose Samsung’s PM9A3 SSDs as our Gen11 drives. The results we obtained below were consistent with the manufacturer’s claims.

Sequential performance:

The EPYC journey continues to Rome in Cloudflare’s 11th generation Edge Server
The EPYC journey continues to Rome in Cloudflare’s 11th generation Edge Server

Random Performance:

The EPYC journey continues to Rome in Cloudflare’s 11th generation Edge Server
The EPYC journey continues to Rome in Cloudflare’s 11th generation Edge Server

Compared to our previous generation drives, we could see a 1.5x – 2x improvement in read and write bandwidths. The higher values for the PM9A3 can be attributed to the fact that these are PCIe 4.0 drives, have more intelligent SSD controllers and an upgraded NAND architecture.

Network

Previously: Mellanox ConnectX-4 dual-port 25G
Now: Mellanox ConnectX-4 dual-port 25G

There is no change on the network front; the Mellanox ConnectX-4 is a solid performer which continues to meet our needs. We investigated higher speed Ethernet, but we do not currently see this as beneficial. Cloudflare’s network is built on cheap commodity hardware and the highly distributed nature of Cloudflare’s network means we don’t have discrete DDoS scrubbing centres. All points of presence operate as scrubbing centres. This means that we distribute the load across our entire network and do not need to employ higher speed and more expensive Ethernet devices.

Open source firmware

Transparency, security and integrity is absolutely critical to us at Cloudflare. Last year, we described how we had deployed Platform Secure Boot to create trust that we were running the software that we thought we were.

Now, we are pleased to announce that we are deploying open source firmware to our servers using OpenBMC. With access to the source code, we have been able to configure BMC features such as the fan PID controller, having BIOS POST codes recorded and accessible, and managing networking ports and devices. Prior to OpenBMC, requesting these features from our vendors led to varying results and misunderstandings of the scope and capabilities of the BMC. After working with the BMC source code much more directly, we have the flexibility to to work on features ourselves to our liking, or understand why the BMC is incapable of running our desired software.

Whilst our current BMC is an industry standard, we feel that OpenBMC better suits our needs and gives us advantages such as allowing us to deal with upstream security issues without a dependency on our vendors. Some opportunities with security include integration of desired authentication modules, usage of specific software packages, staying up to date with the latest Linux kernel, and controlling a variety of attack vectors. Because we have a kernel lockdown implemented, flashing tooling is difficult to use in our environment. With access to source code of the flashing tools, we have an understanding of what the tools need access to, and assess whether or not this meets our standard of security.

Summary

The EPYC journey continues to Rome in Cloudflare’s 11th generation Edge Server

The jump between our generation 9 and generation 10 servers was enormous. To summarise, we changed from a dual-socket Intel platform to a single socket AMD platform. We upgraded the SATA SSDs to NVMe storage devices, and physically the multi-node chassis changed to a 1U form factor.

At the start of the generation 11 project we weren’t sure if we would be making such radical changes again. However, after a thorough testing of the latest chips and a review of how well the generation 10 server has performed in production for over a year, our generation 11 server built upon the solid foundations of generation 10 and ended up as a refinement rather than total revamp. Despite this, and bearing in mind that performance varies by time of day and geography, we are pleased that generation 11 is capable of serving approximately 29% more requests than generation 10 without an increase in power consumption.

The EPYC journey continues to Rome in Cloudflare’s 11th generation Edge Server

Thanks to Denny Mathew and Ryan Chow’s work on benchmarking and OpenBMC, respectively.

If you are interested in working with bleeding edge hardware, open source server firmware, solving interesting problems, helping to improve our performance, and are interested in helping us work on our generation 12 server platform (amongst many other things!), we’re hiring.