We often put together experiments that measure hardware performance to improve our understanding and provide insights to our hardware partners. We recently wanted to know more about Hyper-Threading and Turbo Boost. The last time we assessed these two technologies was when we were still deploying the Intel Xeons (Skylake/Purley), but beginning with our Gen X servers we switched over to the AMD EPYC (Zen 2/Rome). This blog is about our latest attempt at quantifying the performance impact of Hyper-Threading and Turbo Boost on our AMD-based servers running our software stack.
Intel briefly introduced Hyper-Threading with NetBurst (Northwood) back in 2002, then reintroduced Hyper-Threading six years later with Nehalem along with Turbo Boost. AMD presented their own implementation of these technologies with Zen in 2017, but AMD’s version of Turbo Boost actually dates back to AMD K10 (Thuban), in 2010, when it used to be called Turbo Core. Since Zen, Hyper-Threading and Turbo Boost are known as simultaneous multithreading (SMT) and Core Performance Boost (CPB), respectively. The underlying implementation of Hyper-Threading and Turbo Boost differs between the two vendors, but the high-level concept remains the same.
Hyper-Threading or simultaneous multithreading creates a second hardware thread within a processor’s core, also known as a logical core, by duplicating various parts of the core to support the context of a second application thread. The two hardware threads execute simultaneously within the core, across their dedicated and remaining shared resources. If neither hardware threads contend over a particular shared resource, then the throughput can be drastically increased.
Turbo Boost or Core Performance Boost opportunistically allows the processor to operate beyond its rated base frequency as long as the processor operates within guidelines set by Intel or AMD. Generally speaking, the higher the frequency, the faster the processor finishes a task.
Similar to Intel’s Hyper-Threading, AMD implemented 2-way simultaneous multithreading. The AMD EPYC 7642 has 48 cores, and with simultaneous multithreading enabled it can simultaneously execute 96 hardware threads. Core Performance Boost allows the AMD EPYC 7642 to operate anywhere between 2.3 to 3.3 GHz, depending on the workload and limitations imposed on the processor. With Core Performance Boost disabled, the processor will operate at 2.3 GHz, the rated base frequency on the AMD EPYC 7642. We took our usual simulated traffic pattern of 10 KiB cached assets over HTTPS, provided by our performance team, to generate a sustained workload that saturated the processor to 100% CPU utilization.
After establishing a baseline with simultaneous multithreading and Core Performance Boost disabled, we started enabling one feature at a time. When we enabled Core Performance Boost, the processor operated near its peak turbo frequency, hovering between 3.2 to 3.3 GHz which is more than 39% higher than the base frequency. Higher operating frequency directly translated into 40% additional requests per second. We then disabled Core Performance Boost and enabled simultaneous multithreading. Similar to Core Performance Boost, simultaneous multithreading alone improved requests per second by 43%. Lastly, by enabling both features, we observed an 86% improvement in requests per second.
Latencies were generally lowered by either or both Core Performance Boost and simultaneous multithreading. While Core Performance Boost consistently maintained a lower latency than the baseline, simultaneous multithreading gradually took longer to process a request as it reached tail latencies. Though not depicted in the figure below, when we examined beyond p9999 or 99.99th percentile, simultaneous multithreading, even with the help of Core Performance Boost, exponentially increased in latency by more than 150% over the baseline, presumably due to the two hardware threads contending over a shared resource within the core.
Moving into production, since our traffic fluctuates throughout the day, we took four identical Gen X servers and measured in parallel during peak hours. The only changes we made to the servers were enabling and disabling simultaneous multithreading and Core Performance Boost to create a comprehensive test matrix. We conducted the experiment in two different regions to identify any anomalies and mismatching trends. All trends were alike.
Before diving into the results, we should preface that the baseline server operated at a higher CPU utilization than others. Every generation, our servers deliver a noticeable improvement in performance. So our load balancer, named Unimog, sends a different number of connections to the target server based on its generation to balance out the CPU utilization. When we disabled simultaneous multithreading and Core Performance Boost, the baseline server’s performance degraded to the point where Unimog encountered a “guard rail” or the lower limit on the requests sent to the server, and so its CPU utilization rose instead. Given that the baseline server operated at a higher CPU utilization, the baseline server processed more requests per second to meet the minimum performance threshold.
Due to the skewed baseline, when core performance boost was enabled, we only observed 7% additional requests per second. Next, simultaneous multithreading alone improved requests per second by 41%. Lastly, with both features enabled, we saw an 86% improvement in requests per second.
Though we lack concrete baseline data, we can normalize requests per second by CPU utilization to approximate the improvement for each scenario. Once normalized, the estimated improvement in requests per second from core performance boost and simultaneous multithreading were 36% and 80%, respectively. With both features enabled, requests per second improved by 136%.
Latency was not as interesting since the baseline server operated at a higher CPU utilization, and in turn, it produced a higher tail latency than we would have otherwise expected. All other servers maintained a lower latency due to their lower CPU utilization in conjunction with Core Performance Boost, simultaneous multithreading, or both.
At this point, our experiment did not go as we had planned. Our baseline is skewed, and we only got half useful answers. However, we find experimenting to be important because we usually end up finding other helpful insights as well.
Let’s add power data. Since our baseline server was operating at a higher CPU utilization, we knew it was serving more requests and therefore, consumed more power than it needed to. Enabling Core Performance Boost allowed the processor to run up to its peak turbo frequency, increasing power consumption by 35% over the skewed baseline. More interestingly, enabling simultaneous multithreading increased power consumption by only 7%. Combining Core Performance Boost with simultaneous multithreading resulted in 58% increase in power consumption.
AMD’s implementation of simultaneous multithreading appears to be power efficient as it achieves 41% additional requests per second while consuming only 7% more power compared to the skewed baseline. For completeness, using the data we have, we bridged performance and power together to obtain performance per watt to summarize power efficiency. We divided the non-normalized requests per second by power consumption to produce the requests per watt figure below. Our Gen X servers attained the best performance per watt by enabling just simultaneous multithreading.
In our assessment of AMD’s implementation of Hyper-Threading and Turbo Boost, the original experiment we designed to measure requests per second and latency did not pan out as expected. As soon as we entered production, our baseline measurement was skewed due to the imbalance in CPU utilization and only partially reproduced our lab results.
We added power to the experiment and found other meaningful insights. By analyzing the performance and power characteristics of simultaneous multithreading and Core Performance Boost, we concluded that simultaneous multithreading could be a power-efficient mechanism to attain additional requests per second. Drawbacks of simultaneous multithreading include long tail latency that is currently curtailed by enabling Core Performance Boost. While the higher frequency enabled by Core Performance Boost provides latency reduction and more requests per second, we are more mindful that the increase in power consumption is quite significant.
Do you want to help shape the Cloudflare network? This blog was a glimpse of the work we do at Cloudflare. Come join us and help complete the feedback loop for our developers and hardware partners.
When I was interviewing to join Cloudflare’s in 2014 as a member of the SRE team, we had just introduced our generation 4 server, and I was excited about the prospects. Since then, Cloudflare, the industry and I have all changed dramatically. The best thing about working for a rapidly growing company like Cloudflare is that as the company grows, new roles open up to enable career development. And so, having left the SRE team last year, I joined the recently formed hardware engineering team, a team that simply didn’t exist in 2014.
We aim to introduce a new server platform to our edge network every 12 to 18 months or so, to ensure that we keep up with the latest industry technologies and developments. We announced the generation 9 server in October 2018 and we announced the generation 10 server in February 2020. We consider this length of cycle optimal: short enough to stay nimble and take advantage of the latest technologies, but long enough to offset the time taken by our hardware engineers to test and validate the entire platform. When we are shipping servers to over 200 cities around the world with a variety of regulatory standards, it’s essential to get things right the first time.
We continually work with our silicon vendors to receive product roadmaps and stay on top of the latest technologies. Since mid-2020, the hardware engineering team at Cloudflare has been working on our generation 11 server.
Requests per Watt is one of our defining characteristics when testing new hardware and we use it to identify how much more efficient a new hardware generation is than the previous generation. We continually strive to reduce our operational costs and power consumption reduction is one of the most important parts of this. It’s good for the planet and we can fit more servers into a rack, reducing our physical footprint.
The design of these Generation 11 x86 servers has been in parallel with our efforts to design next-generation edge servers using the Ampere Altra ARM architecture. You can read more about our tests in a blog post by my colleague Sung and will document our work on Arm at the edge in a subsequent blog post.
We evaluated Intel’s latest generation of “Ice Lake” Xeon processors. Although Intel’s chips were able to compete with AMD in terms of raw performance, the power consumption was several hundred watts higher per server – that’s enormous. This meant that Intel’s Performance per Watt was unattractive.
We previously described how we had deployed AMD EPYC 7642’s processors in our generation 10 server. This has 48 cores and is based on AMD’s 2nd generation EPYC architecture, code named Rome. For our generation 11 server, we evaluated 48, 56 and 64 core samples based on AMD’s 3rd generation EPYC architecture, code named Milan. We were interested to find that comparing the two 48 core processors directly, we saw a performance boost of several percent in the 3rd generation EPYC architecture. We therefore had high hopes for the 56 core and 64 core chips.
So, based on the samples we received from our vendors and our subsequent testing, hardware from AMD and Ampere made the shortlist for our generation 11 server. On this occasion, we decided that Intel did not meet our requirements. However, it’s healthy that Intel and AMD compete and innovate in the x86 space and we look forward to seeing how Intel’s next generation shapes up.
Testing and validation process
Before we go on to talk about the hardware, I’d like to say a few words about the testing process we went through to test out generation 11 servers.
As we elected to proceed with AMD chips, we were able to use our generation 10 servers as our Engineering Validation Test platform, with the only changes being the new silicon and updated firmware. We were able to perform these upgrades ourselves in our hardware validation lab.
Cloudflare’s network is built with cheap commodity hardware and we source the hardware from multiple vendors, known as ODMs (Original Design Manufacturer) who build the servers to our specifications.
When you are working with bleeding edge silicon and experimental firmware, not everything is plain sailing. We worked with one of our ODMs to eliminate an issue which was causing the Linux kernel to panic on boot. Once resolved, we used a variety of synthetic benchmarking tools to verify the performance including cf_benchmark, as well as an internal tool which applies a synthetic load to our entire software stack.
Once we were satisfied, we ordered Design Validation Test samples, which were manufactured by our ODMs with the new silicon. We continued to test these and iron out the inevitable issues that arise when you are developing custom hardware. To ensure that performance matched our expectations, we used synthetic benchmarking to test the new silicon. We also began testing it in our production environment by gradually introducing customer traffic to them as confidence grew.
Once the issues were resolved, we ordered the Product Validation Test samples, which were again manufactured by our ODMs, taking into account the feedback obtained in the DVT phase. As these are intended to be production grade, we work with the broader Cloudflare teams to deploy these units like a mass production order.
In the above chart, TDP refers to Thermal Design Power, a measure of the heat dissipated. All of the above processors have a configurable TDP – assuming the cooling solution is capable – giving more performance at the expense of increased power consumption. We tested all processors configured at their highest supported TDP.
The 64 core processors have 33% more cores than the 48 core processors so you might hypothesize that we would see a corresponding 33% increase in performance, although our benchmarks saw slightly more modest gains. This can be explained because the 64 core processors have lower base clock frequencies to fit within the same 225W power envelope.
In production testing, we found that the 64 core EPYC 7713 gave us around a 29% performance boost over the incumbent, whilst having similar power consumption and thermal properties.
Previously: 256GB DDR4-2933 Now: 384GB DDR4-3200
Having made a decision about the processor, the next step was to determine the optimal amount of memory for our workload. We ran a series of experiments with our chosen EPYC 7713 processor and 256GB, 384GB and 512MB memory configurations. We started off by running synthetic benchmarks with tools such as STREAM to ensure that none of the configurations performed unexpectedly poorly and to generate a baseline understanding of the performance.
After the synthetic benchmarks, we proceeded to test the various configurations with production workloads to empirically determine the optimal quantity. We use Prometheus and Grafana to gather and display a rich set of metrics from all of our servers so that we can monitor and spot trends, and we re-used the same infrastructure for our performance analysis.
As well as measuring available memory, previous experience has shown us that one of the best ways to ensure that we have enough memory is to observe request latency and disk IO performance. If there is insufficient memory, we expect to see request latency and disk IO volume and latency to increase. The reason for this is that our core HTTP server uses memory to cache web assets and if there is insufficient memory the assets will be ejected from memory prematurely and more assets will be fetched from disk instead of memory, degrading performance.
Like most things in life, it’s a balancing act. We want enough memory to take advantage of the fact that serving web assets directly from memory is much faster than even the best NVMe disks. We also want to future proof our platform to enable the new features such as the ones that we recently announced in security week and developer week. However, we don’t want to spend unnecessarily on excess memory that will never be used. We found that the 512GB configuration did not provide a performance boost to justify the extra cost and settled on the 384GB configuration.
We also tested the performance impact of switching from DDR4-2933 to DDR4-3200 memory. We found that it provided a performance boost of several percent and the pricing has improved to the point where it is cost beneficial to make the change.
Previously: 3x Samsung PM983 x 960GB Now: 2x Samsung PM9A3 x 1.92TB
We validated samples by studying the manufacturer’s data sheets and testing using fio to ensure that the results being obtained in our test environment were in line with the published specifications. We also developed an automation framework to help compare different drive models using fio. The framework helps us to restore the drives close to factory settings, precondition the drives, perform the sequential and random tests in our environment, and analyze the data results to evaluate the bandwidth and latency results. Since our SSD samples were arriving in our test center at different months, having an automated framework helped in dealing with speedy evaluations by reducing our time spent testing and doing analysis.
For Gen 11 we decided to move to a 2x 2TB configuration from the original 3x 1TB configuration giving us an extra 1 TB of storage. This also meant we could use the higher performance of a 2TB drive and save around 6W of power since there is one less SSD.
After analyzing the performances of various 2TB drives, their latencies and endurances, we chose Samsung’s PM9A3 SSDs as our Gen11 drives. The results we obtained below were consistent with the manufacturer’s claims.
Compared to our previous generation drives, we could see a 1.5x – 2x improvement in read and write bandwidths. The higher values for the PM9A3 can be attributed to the fact that these are PCIe 4.0 drives, have more intelligent SSD controllers and an upgraded NAND architecture.
There is no change on the network front; the Mellanox ConnectX-4 is a solid performer which continues to meet our needs. We investigated higher speed Ethernet, but we do not currently see this as beneficial. Cloudflare’s network is built on cheap commodity hardware and the highly distributed nature of Cloudflare’s network means we don’t have discrete DDoS scrubbing centres. All points of presence operate as scrubbing centres. This means that we distribute the load across our entire network and do not need to employ higher speed and more expensive Ethernet devices.
Open source firmware
Transparency, security and integrity is absolutely critical to us at Cloudflare. Last year, we described how we had deployed Platform Secure Boot to create trust that we were running the software that we thought we were.
Now, we are pleased to announce that we are deploying open source firmware to our servers using OpenBMC. With access to the source code, we have been able to configure BMC features such as the fan PID controller, having BIOS POST codes recorded and accessible, and managing networking ports and devices. Prior to OpenBMC, requesting these features from our vendors led to varying results and misunderstandings of the scope and capabilities of the BMC. After working with the BMC source code much more directly, we have the flexibility to to work on features ourselves to our liking, or understand why the BMC is incapable of running our desired software.
Whilst our current BMC is an industry standard, we feel that OpenBMC better suits our needs and gives us advantages such as allowing us to deal with upstream security issues without a dependency on our vendors. Some opportunities with security include integration of desired authentication modules, usage of specific software packages, staying up to date with the latest Linux kernel, and controlling a variety of attack vectors. Because we have a kernel lockdown implemented, flashing tooling is difficult to use in our environment. With access to source code of the flashing tools, we have an understanding of what the tools need access to, and assess whether or not this meets our standard of security.
The jump between our generation 9 and generation 10 servers was enormous. To summarise, we changed from a dual-socket Intel platform to a single socket AMD platform. We upgraded the SATA SSDs to NVMe storage devices, and physically the multi-node chassis changed to a 1U form factor.
At the start of the generation 11 project we weren’t sure if we would be making such radical changes again. However, after a thorough testing of the latest chips and a review of how well the generation 10 server has performed in production for over a year, our generation 11 server built upon the solid foundations of generation 10 and ended up as a refinement rather than total revamp. Despite this, and bearing in mind that performance varies by time of day and geography, we are pleased that generation 11 is capable of serving approximately 29% more requests than generation 10 without an increase in power consumption.
Thanks to Denny Mathew and Ryan Chow’s work on benchmarking and OpenBMC, respectively.
If you are interested in working with bleeding edge hardware, open source server firmware, solving interesting problems, helping to improve our performance, and are interested in helping us work on our generation 12 server platform (amongst many other things!), we’re hiring.
The collective thoughts of the interwebz
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.