Tag Archives: AMD

The newest Top500 list is out, and we have the former #1 supercomputer Frontier was dethroned. In this list, the Intel-powered Aurora supercomputer passed 1EF, but then El Capitan rose to take the #1 spot. This is a big win for HPE and AMD delivering a system at over 2 exaflops of FP64 performance. El […]

The post El Capitan Towers Above the Top500 in a Big HPE and AMD Win appeared first on ServeTheHome.

AMD Ryzen 7 9800X3D Launch Zen 5 with 3D V-Cache is Great

2024-11-07 John Lee

Post Syndicated from John Lee original https://www.servethehome.com/amd-ryzen-7-9800x3d-launch-zen-5-with-3d-v-cache-asrock-gskill-micron-crucial-great/

The AMD Ryzen 7 9800X3D is AMD’s newest 3D V-Cache part for a massive 96MB of L3 cache on an 8 core processor for extra performance

The post AMD Ryzen 7 9800X3D Launch Zen 5 with 3D V-Cache is Great appeared first on ServeTheHome.

AMD Pensando Salina 400 DPU Spotted

2024-11-05 Rohit Kumar

Post Syndicated from Rohit Kumar original https://www.servethehome.com/amd-pensando-salina-400-dpu-arm-neoverse/

We spotted the AMD Pensando Salina 400 DPU a 400GbE generation DPU with 16x Arm Neoverse N1 cores, up to 128GB of DDR5, and a P4 pipeline

The post AMD Pensando Salina 400 DPU Spotted appeared first on ServeTheHome.

Beelink SER9 AMD Ryzen AI 9 HX 370 Mini PC Review

2024-10-22 Patrick Kennedy

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/beelink-ser9-amd-ryzen-ai-9-hx-370-mini-pc-review/

In our Beelink SER9 review, we see how this AMD Ryzen AI 9 HX 370 mini PC fares as the fastest AMD mini PC out there with caveats of course

The post Beelink SER9 AMD Ryzen AI 9 HX 370 Mini PC Review appeared first on ServeTheHome.

Meta Brings AMD EPYC Turin to Yosemite v4

2024-10-22 Patrick Kennedy

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/meta-brings-amd-epyc-turin-to-yosemite-v4/

It looks like the Meta Yosemite v4 platform will have an AMD EPYC 9005 Turin module that is CXL enabled for 2025 deployment

The post Meta Brings AMD EPYC Turin to Yosemite v4 appeared first on ServeTheHome.

Hell Freezes Over as AMD and Intel Come Together for x86

2024-10-15 Patrick Kennedy

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/hell-freezes-over-amd-and-intel-come-together-ocp/

Hell freezes over as AMD and Intel come together at OCP Summit 2024 to jointly drive the future x86 ISA through an industry advisory board

The post Hell Freezes Over as AMD and Intel Come Together for x86 appeared first on ServeTheHome.

Meta Announces AMD Instinct MI300X for AI Inference and NVIDIA GB200 Catalina

2024-10-15 Patrick Kennedy

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/meta-announces-amd-mi300x-for-ai-inference-marvell-fbnic-cisco-arista-broadcom/

Meta outlined its AI platforms at OCP Summit 2024, including GPUs from AMD and NVIDIA and networking from Marvell, Broadcom, Cisco and Arista

The post Meta Announces AMD Instinct MI300X for AI Inference and NVIDIA GB200 Catalina appeared first on ServeTheHome.

Analysis of the EPYC 145% performance gain in Cloudflare Gen 12 servers

2024-10-15 JQ Lau

Post Syndicated from JQ Lau original https://blog.cloudflare.com/analysis-of-the-epyc-145-performance-gain-in-cloudflare-gen-12-servers

Cloudflare’s network spans more than 330 cities in over 120 countries, serving over 60 million HTTP requests per second and 39 million DNS queries per second on average. These numbers will continue to grow, and at an accelerating pace, as will Cloudflare’s infrastructure to support them. While we can continue to scale out by deploying more servers, it is also paramount for us to develop and deploy more performant and more efficient servers.

At the heart of each server is the processor (central processing unit, or CPU). Even though many aspects of a server rack can be redesigned to improve the cost to serve a request, CPU remains the biggest lever, as it is typically the primary compute resource in a server, and the primary enabler of new technologies.

Cloudflare’s 12th Generation server with AMD EPYC 9684-X (codenamed Genoa-X) is 145% more performant and 63% more efficient. These are big numbers, but where do the performance gains come from? Cloudflare’s hardware system engineering team did a sensitivity analysis on three variants of 4th generation AMD EPYC processor to understand the contributing factors.

For the 4th generation AMD EPYC Processors, AMD offers three architectural variants:

mainstream classic Zen 4 cores, codenamed Genoa
efficiency optimized dense Zen 4c cores, codenamed Bergamo
cache optimized Zen 4 cores with 3D V-cache, codenamed Genoa-X

^{Figure 1 (from left to right): AMD EPYC 9654 (Genoa), AMD EPYC 9754 (Bergamo), AMD EPYC 9684X (Genoa-X)}

Key features common across the 4th Generation AMD EPYC processors:

Up to 12x Core Complex Dies (CCDs)
Each core has a private 1MB L2 cache
The CCDs connect to memory, I/O, and each other through an I/O die
Configurable Thermal Design Power (cTDP) up to 400W
Support up to 12 channels of DDR5-4800 1DPC
Support up to 128 lanes PCIe Gen 5

Classic Zen 4 Cores (Genoa):

Each Core Complex (CCX) has 8x Zen 4 Cores (16x Threads)
Each CCX has a shared 32 MB L3 cache (4 MB/core)
Each CCD has 1x CCX

Dense Zen 4c Cores (Bergamo):

Each CCX has 8x Zen 4c Cores (16x Threads)
Each CCX has a shared 16 MB L3 cache (2 MB/core)
Each CCD has 2x CCX

Classic Zen 4 Cores with 3D V-cache (Genoa-X):

Each CCX has 8x Zen 4 Cores (16x Threads)
Each CCX has a shared 96MB L3 cache (12 MB/core)
Each CCD has 1x CCX

For more information on 4th generation AMD EPYC Processors architecture, see: https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf

The following table is a summary of the specification of the AMD EPYC 7713 CPU in our Gen 11 server against the three CPU candidates, one from each variant of the 4th generation AMD EPYC Processors architecture:

CPU Model	AMD EPYC 7713	AMD EPYC 9654	AMD EPYC 9754	AMD EPYC 9684X
Series	Milan	Genoa	Bergamo	Genoa-X
# of CPU Cores	64	96	128	96
# of Threads	128	192	256	192
Base Clock	2.0 GHz	2.4 GHz	2.25 GHz	2.4 GHz
All Core Boost Clock	~2.7 GHz*	3.55 Ghz	3.1 Ghz	3.42 Ghz
Total L3 Cache	256 MB	384 MB	256 MB	1152 MB
L3 cache per core	4 MB / core	4 MB / core	2 MB / core	12 MB / core
Maximum configurable TDP	240W	400W	400W	400W

^{* AMD EPYC 7713 all core boost clock is based on Cloudflare production data, not the official specification from AMD}

cf_benchmark

Readers may remember that Cloudflare introduced cf_benchmark when we evaluated Qualcomm’s ARM chips, using it as our first pass benchmark to shortlist AMD’s Rome CPU for our Gen 10 servers and to evaluate our chosen ARM CPU Ampere Altra Max against AWS Graviton 2. Likewise, we ran cf_benchmark against the three candidate CPUs for our 12th Gen servers: AMD EPYC 9654 (Genoa), AMD EPYC 9754 (Bergamo), and AMD EPYC 9684X (Genoa-X). The majority of cf_benchmark workloads are compute bound, and given more cores or higher CPU frequency, they score better. The graph and the table below show the benchmark performance comparison of the three CPU candidates with Genoa 9654 as the baseline, where > 1.00x indicates better performance.

	Genoa 9654 (baseline)	Bergamo 9754	Genoa-X 9684X
openssl_pki	1.00x	1.16x	1.01x
openssl_aead	1.00x	1.20x	1.01x
luajit	1.00x	0.86x	1.00x
brotli	1.00x	1.11x	0.98x
gzip	1.00x	0.87x	1.01x
go	1.00x	1.09x	1.00x

Bergamo 9754 with 128 cores scores better in openssl_pki, openssl_aead, brotli, and go benchmark suites, and performs less favorably in luajit and gzip benchmark suites. Genoa-X 9684X (with significantly more L3 cache) doesn’t offer a significant boost in performance for these compute-bound benchmarks.

These benchmarks are representative of some of the common workloads Cloudflare runs, and are useful in identifying software scaling issues, system configuration bottlenecks, and the impact of CPU design choices on workload-specific performance. However, the benchmark suite is not an exhaustive list of all workloads Cloudflare runs in production, and in reality, the workloads included in the benchmark suites are almost certainly not the exclusive workload running on the CPU. In short, though benchmark results can be informative, they do not represent a good indication of production performance when a mix of these workloads run on the same processor.

Performance simulation

To get an early indication of production performance, Cloudflare has an internal performance simulation tool that exercises our software stack to fetch a fixed asset repeatedly. The simulation tool can be configured to fetch a specified fixed-size asset and configured to include or exclude services like WAF or Workers in the request path. Below, we show the simulated performance between the three CPUs for an asset size of 10 KB, where >1.00x indicates better performance.

	Milan 7713	Genoa 9654	Bergamo 9754	Genoa-X 9684X
Lab simulation performance multiplier	1.00x	2.20x	1.95x	2.75x

Based on these results, Bergamo 9754, which has the highest core count, but smallest L3 cache per core, is least performant among the three candidates, followed by Genoa 9654. The Genoa-X 9684X with the largest L3 cache per core is the most performant. This data suggests that our software stack is very sensitive to L3 cache size, in addition to core count and CPU frequency. This is interesting and worth a deep dive into a sensitivity analysis of our workload against a few (high level) CPU design points, especially core scaling, frequency scaling, and L2/L3 cache sizes scaling.

Sensitivity analysis

Core sensitivity

Number of cores is the headline specification that practically everyone talks about, and one of the easiest improvements CPU vendors can make to increase performance per socket. The AMD Genoa 9654 has 96 cores, 50% more than the 64 cores available on the AMD Milan 7713 CPUs that we used in our Gen 11 servers. Is more always better? Does Cloudflare’s primary workload scale with core count and effectively utilize all available cores?

The figure and table below shows the result of a core scaling experiment performed on an AMD Genoa 9654 configured with 96 cores, 80 cores, 64 cores, and 48 cores, which was done by incrementally disabling 2x CCD (8 cores/CCD) at each step. The result is GREAT, as Cloudflare’s simulated primary workload scales linearly with core count on AMD Genoa CPUs.

Core count	Core increase	Performance increase
48	1.00x	1.00
64	1.33x	1.39x
80	1.67x	1.71x
96	2.00x	2.05x

TDP sensitivity

Thermal Design Power (TDP), is the maximum amount of heat generated by a CPU that the cooling system is designed to dissipate, but more commonly refers to the power consumption of the processor under the maximum theoretical loads. AMD Genoa 9654’s default TDP is 360W, but can be configured up to 400W TDP. Is more always better? Does Cloudflare continue to see meaningful performance improvement up to 400W, or does performance stagnate at some point?

The chart below shows the result of sweeping the TDP of the AMD Genoa 9654 (in power determinism mode) from 240W to 400W. (Note: x-axis step size is not linear).

Cloudflare’s simulated primary workload continues to see incremental performance improvements up to the maximum configurable 400W, albeit at a less favorable perf/watt ratio.

Looking at TDP sensitivity data is a quick and easy way to identify if performance stagnates at some power point, but what does power sensitivity actually measure? There are several factors contributing to CPU power consumption, but let’s focus on one of the primary factors: dynamic power consumption. Dynamic power consumption is approximately CV²f, where C is the switched load capacitance, V is the regulated voltage, and f is the frequency. In modern processors like the AMD Genoa 9654, the CPU dynamically scales its voltage along with frequency, so theoretically, CPU dynamic power is loosely proportional to f³. In other words, measuring TDP sensitivity is measuring the frequency sensitivity of a workload. Does the data agree? Yes!

cTDP	All core boost frequency (GHz)	Perf (rps) / baseline
240	2.47	0.78x
280	2.75	0.87x
320	2.93	0.93x
340	3.13	0.97x
360	3.3	1.00x
380	3.4	1.03x
390	3.465	1.04x
400	3.55	1.05x

Frequency sensitivity

Instead of relying on an indirect measure through the TDP, let’s measure frequency sensitivity directly by sweeping the maximum boost frequency.

At above 3GHz, the data shows that Cloudflare’s primary workload sees roughly 2% incremental improvement for every 0.1GHz all core average frequency increment. We hit the 400W power cap at 3.545GHz. This is notably higher than the typical all core boost frequency that Cloudflare Gen 11 servers with AMD Milan 7713 at 2.7GHz see in production, or at 2.4GHz in our performance simulation, which is amazing!

L3 cache size sensitivity

What about L3 cache size sensitivity? L3 cache size is one of the primary design choices and major differences between the trio of Genoa, Bergamo, and Genoa-X. Genoa 9654 has 4 MB L3/core, Bergamo 9754 has 2 MB L3/core, and Genoa-X has 12 MB L3/core. L3 cache is the last and largest “memory” bank on-chip before having to access memory on DIMMs outside the chip that would take significantly more CPU cycles.

We ran an experiment on the Genoa 9654 to check how performance scales with L3 cache size. L3 cache size per core is reduced through MSR writes (but could also be done using Intel RDT) and L3 cache per core is increased by disabling physical cores in a CCD (which reduces the number of cores sharing the fixed size 32 MB L3 cache per CCD effectively growing the L3 cache per core). Below is the result of the experiment, where >1.00x indicates better performance:

L3 cache size increase vs baseline 4MB per core	0.25x	0.5x	0.75x	1x	1.14x	1.33x	1.60x	2.00x
rps/core / baseline	0.67x	0.78x	0.89x	1.00x	1.08x	1.15x	1.25x	1.31x
L3 cache miss rate per CCD	56.04%	39.15%	30.37%	23.55%	22.39%	19.73%	16.94%	14.28%

Even though the expectation was that the impact of a different L3 cache size gets diminished by the faster DDR5 and larger memory bandwidth, Cloudflare’s simulated primary workload is quite sensitive to L3 cache size. The L3 cache miss rate dropped from 56% with only 1 MB L3 per core, to 14.28% with 8 MB L3/core. Changing the L3 cache size by 25% affects the performance by approximately 11%, and we continue to see performance increase to 2x L3 cache size, though the performance increase starts to diminish when we get to 2x L3 cache per core.

Do we see the same behavior when comparing Genoa 9654, Bergamo 9754 and Genoa-X 9684X? We ran an experiment comparing the impact of L3 cache size, controlling for core count and all core boost frequency, and we also saw significant deltas. Halving the L3 cache size from 4 MB/core to 2 MB/core reduces performance by 24%, roughly matching the experiment above. However, increasing the cache 3x from 4 MB/core to 12 MB/core only increases performance by 25%, less than the indication provided by previous experiments. This is likely because the performance gain we saw on experiment result above could be partially attributed to less cache contention due to reduced number of cores based on how we set up the test. Nevertheless, these are significant deltas!

L3/core	2MB/core	4MB/core	12MB/core
Perf (rps) / baseline	0.76x	1x	1.25x

Putting it all together

The table below summarizes how each factor from sensitivity analysis above contributes to the overall performance gain. There are an additional 6% to 14% of unaccounted performance improvement that are contributed by other factors like larger L2 cache, higher memory bandwidth, and miscellaneous CPU architecture changes that improve IPC.

	Milan 7713	Genoa 9654	Bergamo 9754	Genoa-X 9684X
Lab simulation performance multiplier	1x	2.2x	1.95x	2.75x
Performance multiplier due to Core scaling	1x	1.5x	2x	1.5x
Performance multiplier due to Frequency scaling *(Note: Milan 7713 all core frequency is ~2.4GHz when running simulated workload at 100% CPU utilization)**	1x	1.32x	1.21x	1.29x
Performance multiplier due to L3 cache size scaling	1x	1x	0.76x	1.25x
Performance multiplier due to other factors like larger L2 cache, higher memory bandwidth, miscellaneous CPU architecture changes that improve IPC	1x	1.11x	1.06x	1.14x

Performance evaluation in production

How do these CPU candidates perform with real-world traffic and an actual production workload mix? The table below summarizes the performance of the three CPUs in lab simulation and in production. Genoa-X 9684X continues to outperform in production.

In addition, the Gen 12 server equipped with Genoa-X offered outstanding performance but only consumed 1.5x more power per system than our Gen 11 server with Milan 7713. In other words, we see a 63% increase in performance per watt. Genoa-X 9684X provides the best TCO improvement among the 3 options, and was ultimately chosen as the CPU for our Gen 12 server.

	Milan 7713	Genoa 9654	Bergamo 9754	Genoa-X 9684X
Lab simulation performance multiplier	1x	2.2x	1.95x	2.75x
Production performance multiplier	1x	2x	2.15x	2.45x
Production performance per watt multiplier	1x	1.33x	1.38x	1.63x

The Gen 12 server with AMD Genoa-X 9684X is the most powerful and the most power efficient server Cloudflare has built to date. It serves as the underlying platform for all the incredible services that Cloudflare offers to our customers globally, and will help power the growth of Cloudflare infrastructure for the next several years with improved cost structure.

Hardware engineers at Cloudflare work closely with our infrastructure engineering partners and externally with our vendors to design and develop world-class servers to best serve our customers.

Come join us at Cloudflare to help build a better Internet!

Cisco 8102-DPU 12.8T Switch with AMD Pensando

2024-10-13 Rohit Kumar

Post Syndicated from Rohit Kumar original https://www.servethehome.com/cisco-8102-dpu-12-8t-switch-with-amd-pensando/

We saw the Cisco 8102-DPU a 12.8T switch that can house up to eight AMD Pensando DPUs for 1.6T of programmable service acceleration

The post Cisco 8102-DPU 12.8T Switch with AMD Pensando appeared first on ServeTheHome.

AMD Instinct MI325X Launched and the MI355X is Coming

2024-10-12 Cliff Robinson

Post Syndicated from Cliff Robinson original https://www.servethehome.com/amd-instinct-mi325x-launched-and-the-mi355x-is-coming/

This week saw the launch of the AMD Instinct MI325X, a 256GB HBM3E accelerator that is an update to the MI300X. AMD also talked 2025 MI355X

The post AMD Instinct MI325X Launched and the MI355X is Coming appeared first on ServeTheHome.

ASRock Rack TURIN2D48G 48 DIMM Motherboard Launched

2024-10-12 Patrick Kennedy

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/asrock-rack-turin2d48g-48-dimm-motherboard-launched/

The ASRock Rack TURIN2D48G-2L+ motherboard supports up to two 500W AMD EPYC Turin CPUs with up to 768 threads and a whopping 48 DDR5 DIMMs

The post ASRock Rack TURIN2D48G 48 DIMM Motherboard Launched appeared first on ServeTheHome.

AMD EPYC 9005 Turin Turns Transcendent Performance with 768 Threads Per Server

2024-10-10 Patrick Kennedy

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/amd-epyc-9005-turin-turns-transcendent-performance-solidigm-broadcom/

With up to 768 threads in a dual socket server, the AMD EPYC 9005 “Turin” generation offers transcendent performance for servers

The post AMD EPYC 9005 Turin Turns Transcendent Performance with 768 Threads Per Server appeared first on ServeTheHome.

The 4th Gen AMD EPYC LEGO Model You Have Dreamed Of

2024-10-06 Patrick Kennedy

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/the-4th-gen-amd-epyc-lego-model-you-have-dreamed-of/

Ever dream of combining big server processors and LEGO bricks? That dream is here with a 4th Gen AMD EPYC model built out of LEGO bricks

The post The 4th Gen AMD EPYC LEGO Model You Have Dreamed Of appeared first on ServeTheHome.

AMD EPYC Embedded 8004 Series Launches with a New 70W SKU

2024-10-01 Cliff Robinson

Post Syndicated from Cliff Robinson original https://www.servethehome.com/amd-epyc-embedded-8004-series-launches-with-a-new-70w-sku/

The new AMD EPYC Embedded 8004 series includes a new 12-core SKU with cTDP as low as 70W making for a really neat embedded part

The post AMD EPYC Embedded 8004 Series Launches with a New 70W SKU appeared first on ServeTheHome.

Cloudflare’s 12th Generation servers — 145% more performant and 63% more efficient

2024-09-25 JQ Lau

Post Syndicated from JQ Lau original https://blog.cloudflare.com/gen-12-servers

Cloudflare is thrilled to announce the general deployment of our next generation of servers — Gen 12 powered by AMD EPYC 9684X (code name “Genoa-X”) processors. This next generation focuses on delivering exceptional performance across all Cloudflare services, enhanced support for AI/ML workloads, significant strides in power efficiency, and improved security features.

Here are some key performance indicators and feature improvements that this generation delivers as compared to the prior generation:

Beginning with performance, with close engineering collaboration between Cloudflare and AMD on optimization, Gen 12 servers can serve more than twice as many requests per second (RPS) as Gen 11 servers, resulting in lower Cloudflare infrastructure build-out costs.

Next, our power efficiency has improved significantly, by more than 60% in RPS per watt as compared to the prior generation. As Cloudflare continues to expand our infrastructure footprint, the improved efficiency helps reduce Cloudflare’s operational expenditure and carbon footprint as a percentage of our fleet size.

Third, in response to the growing demand for AI capabilities, we’ve updated the thermal-mechanical design of our Gen 12 server to support more powerful GPUs. This aligns with the Workers AI objective to support larger large language models and increase throughput for smaller models. This enhancement underscores our ongoing commitment to advancing AI inference capabilities

Fourth, to underscore our security-first position as a company, we’ve integrated hardware root of trust (HRoT) capabilities to ensure the integrity of boot firmware and board management controller firmware. Continuing to embrace open standards, the baseboard management and security controller (Data Center Secure Control Module or OCP DC-SCM) that we’ve designed into our systems is modular and vendor-agnostic, enabling a unified openBMC image, quicker prototyping, and allowing for reuse.

Finally, given the increasing importance of supply assurance and reliability in infrastructure deployments, our approach includes a robust multi-vendor strategy to mitigate supply chain risks, ensuring continuity and resiliency of our infrastructure deployment.

Cloudflare is dedicated to constantly improving our server fleet, empowering businesses worldwide with enhanced performance, efficiency, and security.

Gen 12 Servers

Let’s take a closer look at our Gen 12 server. The server is powered by a 4th generation AMD EPYC Processor, paired with 384 GB of DDR5 RAM, 16 TB of NVMe storage, a dual-port 25 GbE NIC, and two 800 watt power supply units.

Generation	Gen 12 Compute	Previous Gen 11 Compute
Form Factor	2U1N – Single socket	1U1N – Single socket
Processor	AMD EPYC 9684X Genoa-X 96-Core Processor	AMD EPYC 7713 Milan 64-Core Processor
Memory	384GB of DDR5-4800 x12 memory channel	384GB of DDR4-3200 x8 memory channel
Storage	x2 E1.S NVMe Samsung PM9A3 7.68TB / Micron 7450 Pro 7.68TB	x2 M.2 NVMe 2x Samsung PM9A3 x 1.92TB
Network	Dual 25 Gbe OCP 3.0 Intel Ethernet Network Adapter E810-XXVDA2 / NVIDIA Mellanox ConnectX-6 Lx	Dual 25 Gbe OCP 2.0 Mellanox ConnectX-4 dual-port 25G
System Management	DC-SCM 2.0 ASPEED AST2600 (BMC) + AST1060 (HRoT)	ASPEED AST2500 (BMC)
Power Supply	800W – Titanium Grade	650W – Titanium Grade

^{Cloudflare Gen 12 server}

CPU

During the design phase, we conducted an extensive survey of the CPU landscape. These options offer valuable choices as we consider how to shape the future of Cloudflare’s server technology to match the needs of our customers. We evaluated many candidates in the lab, and short-listed three standout CPU candidates from the 4th generation AMD EPYC Processor lineup: Genoa 9654, Bergamo 9754, and Genoa-X 9684X for production evaluation. The table below summarizes the differences in specifications of the short-listed candidates for Gen 12 servers against the AMD EPYC 7713 used in our Gen 11 servers. Notably, all three candidates offer significant increase in core count and marked increase in all core boost clock frequency.

CPU Model	AMD EPYC 7713	AMD EPYC 9654	AMD EPYC 9754	AMD EPYC 9684X
Series	Milan	Genoa	Bergamo	Genoa-X
# of CPU Cores	64	96	128	96
# of Threads	128	192	256	192
Base Clock	2.0 GHz	2.4 GHz	2.25 GHz	2.4 GHz
Max Boost Clock	3.67 GHz	3.7 Ghz	3.1 Ghz	3.7 Ghz
All Core Boost Clock	2.7 GHz *	3.55 GHz	3.1GHz	3.42 GHz
Total L3 Cache	256 MB	384 MB	256 MB	1152 MB
L3 cache per core	4MB / core	4MB / core	2MB / core	12MB / core
Maximum configurable TDP	240W	400W	400W	400W

_{*Note: AMD EPYC 7713 all core boost clock frequency of 2.7 GHz is not an official specification of the CPU but based on data collected at Cloudflare production fleet.}

During production evaluation, the configuration of all three CPUs were optimized to the best of our knowledge, including thermal design power (TDP) configured to 400W for maximum performance. The servers are set up to run the same processes and services like any other server we have in production, which makes for a great side-by-side comparison.

	Milan 7713	Genoa 9654	Bergamo 9754	Genoa-X 9684X
Production performance (request per second) multiplier	1x	2x	2.15x	2.45x
Production efficiency (request per second per watt) multiplier	1x	1.33x	1.38x	1.63x

AMD EPYC Genoa-X in Cloudflare Gen 12 server

Each of these CPUs outperforms the previous generation of processors by at least 2x. AMD EPYC 9684X Genoa-X with 3D V-cache technology gave us the greatest performance improvement, at 2.45x, when compared against our Gen 11 servers with AMD EPYC 7713 Milan.

Comparing the performance between Genoa-X 9684X and Genoa 9654, we see a ~22.5% performance delta. The primary difference between the two CPUs is the amount of L3 cache available on the CPU. Genoa-X 9684X has 1152 MB of L3 cache, which is three times the Genoa 9654 with 384 MB of L3 cache. Cloudflare workloads benefit from more low level cache being accessible and avoid the much larger latency penalty associated with fetching data from memory.

Genoa-X 9684X CPU delivered ~22.5% improved performance consuming the same amount of 400W power compared to Genoa 9654. The 3x larger L3 cache does consume additional power, but only at the expense of sacrificing 3% of highest achievable all core boost frequency on Genoa-X 9684X, a favorable trade-off for Cloudflare workloads.

More importantly, Genoa-X 9684X CPU delivered 145% performance improvement with only 50% system power increase, offering a 63% power efficiency improvement that will help drive down operational expenditure tremendously. It is important to note that even though a big portion of the power efficiency is due to the CPU, it needs to be paired with optimal thermal-mechanical design to realize the full benefit. Earlier last year, we made the thermal-mechanical design choice to double the height of the server chassis to optimize rack density and cooling efficiency across our global data centers. We estimated that moving from 1U to 2U would reduce fan power by 150W, which would decrease system power from 750 watts to 600 watts. Guess what? We were right — a Gen 12 server consumes 600 watts per system at a typical ambient temperature of 25°C.

While high performance often comes at a higher price, fortunately AMD EPYC 9684X offer an excellent balance between cost and capability. A server designed with this CPU provides top-tier performance without necessitating a huge financial outlay, resulting in a good Total Cost of Ownership improvement for Cloudflare.

Memory

AMD Genoa-X CPU supports twelve memory channels of DDR5 RAM up to 4800 mega transfers per second (MT/s) and per socket Memory Bandwidth of 460.8 GB/s. The twelve channels are fully utilized with 32 GB ECC 2Rx8 DDR5 RDIMM with one DIMM per channel configuration for a combined total memory capacity of 384 GB.

Choosing the optimal memory capacity is a balancing act, as maintaining an optimal memory-to-core ratio is important to make sure CPU capacity or memory capacity is not wasted. Some may remember that our Gen 11 servers with 64 core AMD EPYC 7713 CPUs are also configured with 384 GB of memory, which is about 6 GB per core. So why did we choose to configure our Gen 12 servers with 384 GB of memory when the core count is growing to 96 cores? Great question! A lot of memory optimization work has happened since we introduced Gen 11, including some that we blogged about, like Bot Management code optimization and our transition to highly efficient Pingora. In addition, each service has a memory allocation that is sized for optimal performance. The per-service memory allocation is programmed and monitored utilizing Linux control group resource management features. When sizing memory capacity for Gen 12, we consulted with the team who monitor resource allocation and surveyed memory utilization metrics collected from our fleet. The result of the analysis is that the optimal memory-to-core ratio is 4 GB per CPU core, or 384 GB total memory capacity. This configuration is validated in production. We chose dual rank memory modules over single rank memory modules because they have higher memory throughput, which improves server performance (read more about memory module organization and its effect on memory bandwidth).

The table below shows the result of running the Intel Memory Latency Checker (MLC) tool to measure peak memory bandwidth for the system and to compare memory throughput between 12 channels of dual-rank (2Rx8) 32 GB DIMM and 12 channels of single rank (1Rx4) 32 GB DIMM. Dual rank DIMMs have slightly higher (1.8%) read memory bandwidth, but noticeably higher write bandwidth. As write ratios increased from 25% to 50%, the memory throughput delta increased by 10%.

Benchmark	Dual rank advantage over single rank
Intel MLC ALL Reads	101.8%
Intel MLC 3:1 Reads-Writes	107.7%
Intel MLC 2:1 Reads-Writes	112.9%
Intel MLC 1:1 Reads-Writes	117.8%
Intel MLC Stream-triad like	108.6%

The table below shows the result of running the AMD STREAM benchmark to measure sustainable main memory bandwidth in MB/s and the corresponding computation rate for simple vector kernels. In all 4 types of vector kernels, dual rank DIMMs provide a noticeable advantage over single rank DIMMs.

Benchmark	Dual rank advantage over single rank
Stream Copy	115.44%
Stream Scale	111.22%
Stream Add	109.06%
Stream Triad	107.70%

Storage

Cloudflare’s Gen X server and Gen 11 server support M.2 form factor drives. We liked the M.2 form factor mainly because it was compact. The M.2 specification was introduced in 2012, but today, the connector system is dated and the industry has concerns about its ability to maintain signal integrity with the high speed signal specified by PCIe 5.0 and PCIe 6.0 specifications. The 8.25W thermal limit of the M.2 form factor also limits the number of flash dies that can be fitted, which limits the maximum supported capacity per drive. To address these concerns, the industry has introduced the E1.S specification and is transitioning from the M.2 form factor to the E1.S form factor.

In Gen 12, we are making the change to the EDSFF E1 form factor, more specifically the E1.S 15mm. E1.S 15mm, though still in a compact form factor, provides more space to fit more flash dies for larger capacity support. The form factor also has better cooling design to support more than 25W of sustained power.

While the AMD Genoa-X CPU supports 128 PCIe 5.0 lanes, we continue to use NVMe devices with PCIe Gen 4.0 x4 lanes, as PCIe Gen 4.0 throughput is sufficient to meet drive bandwidth requirements and keep server design costs optimal. The server is equipped with two 8 TB NVMe drives for a total of 16 TB available storage. We opted for two 8 TB drives instead of four 4 TB drives because the dual 8 TB configuration already provides sufficient I/O bandwidth for all Cloudflare workloads that run on each server.

Sequential Read (MB/s) :	6,700
Sequential Write (MB/s) :	4,000
Random Read IOPS:	1,000,000
Random Write IOPS:	200,000
Endurance	1 DWPD
PCIe GEN4 x4 lane throughput	7880 MB/s

_{Storage devices performance specification}

Network

Cloudflare servers and top-of-rack (ToR) network equipment operate at 25 GbE speeds. In Gen 12, we utilized a DC-MHS motherboard-inspired design, and upgraded from an OCP 2.0 form factor to an OCP 3.0 form factor, which provides tool-less serviceability of the NIC. The OCP 3.0 form factor also occupies less space in the 2U server compared to PCIe-attached NICs, which improves airflow and frees up space for other application-specific PCIe cards, such as GPUs.

Cloudflare has been using the Mellanox CX4-Lx EN dual port 25 GbE NIC since our Gen 9 servers in 2018. Even though the NIC has served us well over the years, we are single sourced. During the pandemic, we were faced with supply constraints and extremely long lead times. The team scrambled to qualify the Broadcom M225P dual port 25 GbE NIC as our second-sourced NIC in 2022, ensuring we could continue to turn up servers to serve customer demand. With the lessons learned from single-sourcing the Gen 11 NIC, we are now dual-sourcing and have chosen the Intel Ethernet Network Adapter E810 and NVIDIA Mellanox ConnectX-6 Lx to support Gen 12. These two NICs are compliant with the OCP 3.0 specification and offer more MSI-X queues that can then be mapped to the increased core count on the AMD EPYC 9684X. The Intel Ethernet Network Adapter comes with an additional advantage, offering full Generic Segmentation Offload (GSO) support including VLAN-tagged encapsulated traffic, whereast many vendors either only support Partial GSO or do not support it at all today. With Full GSO support, the kernel spent noticeably less time in softirq segmenting packets, and servers with Intel E810 NICs are processing approximately 2% more requests per second.

Improved security with DC-SCM: Project Argus

^{DC-SCM in Gen 12 server (Project Argus)}

Gen 12 servers are integrated with Project Argus, one of the industry first implementations of Data Center Secure Control Module 2.0 (DC-SCM 2.0). DC-SCM 2.0 decouples server management and security functions away from the motherboard. The baseboard management controller (BMC), hardware root of trust (HRoT), trusted platform module (TPM), and dual BMC/BIOS flash chips are all installed on the DC-SCM.

On our Gen X and Gen 11 server, Cloudflare moved our secure boot trust anchor from the system Basic Input/Output System (BIOS) or the Unified Extensible Firmware Interface (UEFI) firmware to hardware-rooted boot integrity — AMD’s implementation of Platform Secure Boot (PSB) or Ampere’s implementation of Single Domain Secure Boot. These solutions helped secure Cloudflare infrastructure from BIOS / UEFI firmware attacks. However, we are still vulnerable to out-of-band attacks through compromising the BMC firmware. BMC is a microcontroller that provides out-of-band monitoring and management capabilities for the system. When compromised, attackers can read processor console logs accessible by BMC and control server power states for example. On Gen 12, the HRoT on the DC-SCM serves as the trust store of cryptographic keys and is responsible to authenticate the BIOS/UEFI firmware (independent of CPU vendor) and the BMC firmware for secure boot process.

In addition, on the DC-SCM, there are additional flash storage devices to enable storing back-up BIOS/UEFI firmware and BMC firmware to allow rapid recovery when a corrupted or malicious firmware is programmed, and to be resilient to flash chip failure due to aging.

These updates make our Gen 12 server more secure and more resilient to firmware attacks.

Power

A Gen 12 server consumes 600 watts at a typical ambient temperature of 25°C. Even though this is a 50% increase from the 400 watts consumed by the Gen 11 server, as mentioned above in the CPU section, this is a relatively small price to pay for a 145% increase in performance. We’ve paired the server up with dual 800W common redundant power supplies (CRPS) with 80 PLUS Titanium grade efficiency. Both power supply units (PSU) operate actively with distributed power and current. The units are hot-pluggable, allowing the server to operate with redundancy and maximize uptime.

80 PLUS is a PSU efficiency certification program. The Titanium grade efficiency PSU is 2% more efficient than the Platinum grade efficiency PSU between typical operating load of 25% to 50%. 2% may not sound like a lot, but considering the size of Cloudflare fleet with servers deployed worldwide, 2% savings over the lifetime of all Gen 12 deployment is a reduction of more than 7 GWh, equivalent to carbon sequestered by more than 3400 acres of U.S. forests in one year. This upgrade also means our Gen 12 server complies with EU Lot9 requirements and can be deployed in the EU region.

80 PLUS certification	10%	20%	50%	100%
80 PLUS Platinum	–	92%	94%	90%
80 PLUS Titanium	90%	94%	96%	91%

Drop-in GPU support

Demand for machine learning and AI workloads exploded in 2023, and Cloudflare introduced Workers AI to serve the needs of our customers. Cloudflare retrofitted or deployed GPUs worldwide in a portion of our Gen 11 server fleet to support the growth of Workers AI. Our Gen 12 server is also designed to accommodate the addition of more powerful GPUs. This gives Cloudflare the flexibility to support Workers AI in all regions of the world, and to strategically place GPUs in regions to reduce inference latency for our customers. With this design, the server can run Cloudflare’s full software stack. During times when GPUs see lower utilization, the server continues to serve general web requests and remains productive.

The electrical design of the motherboard is designed to support up to two PCIe add-in cards and the power distribution board is sized to support an additional 400W of power. The mechanics are sized to support either a single FHFL (full height, full length) double width GPU PCIe card, or two FHFL single width GPU PCIe cards. The thermal solution including the component placement, fans, and air duct design are sized to support adding GPUs with TDP up to 400W.

Looking to the future

Gen 12 Servers are currently deployed and live in multiple Cloudflare data centers worldwide, and already process millions of requests per second. Cloudflare’s EPYC journey has not ended — the 5th-gen AMD EPYC CPUs (code name “Turin”) are already available for testing, and we are very excited to start the architecture planning and design discussion for the Gen 13 server. Come join us at Cloudflare to help build a better Internet!

HPE ProLiant DL145 Server Launched Edge AMD EPYC 8004 Server

2024-09-18 Cliff Robinson

Post Syndicated from Cliff Robinson original https://www.servethehome.com/hpe-proliant-dl145-server-launched-edge-amd-epyc-8004-server/

The HPE ProLiant DL145 Gen11 is a 2U edge server that sports the AMD EPYC 8004 Siena procesor line for efficient computing

The post HPE ProLiant DL145 Server Launched Edge AMD EPYC 8004 Server appeared first on ServeTheHome.