Tag Archives: AMD

Performance benefits of new Amazon EC2 R8a memory-optimized instances

Post Syndicated from Tyler Jones original https://aws.amazon.com/blogs/compute/performance-benefits-of-new-amazon-ec2-r8a-memory-optimized-instances/

Recently we announced the availability of Amazon Elastic Compute Cloud (Amazon EC2) R8a instances, the latest addition to the AMD memory-optimized instance family. These instances are powered by the 5th Generation AMD EPYC (codename Turin) processors with a maximum frequency of 4.5 GHz. In this post I take these instances for a spin and benchmark MySQL later on, but first I discuss the top things you should know about these instances.

Notable characteristics of R8a instances

Each vCPU on an R8a instance corresponds to a physical CPU core (something we started on 7th generation AMD instances). This means that there is no simultaneous multi-threading (SMT). Each vCPU mapped to a dedicated physical core, which means that you get more predictable and consistent performance because there’s no resource sharing or potential interference between threads, which is particularly crucial for performance-sensitive workloads where consistent latency is essential. When evaluating and adopting R8a instances, make sure that you’re re-evaluating your thresholds for CPU usage. You can likely squeeze more out of each instance’s CPU without impacting any of your workload’s SLA metrics.

R8a instances feature sizes of up to 192 vCPU with 1,536 GiB RAM. The following table shows the detailed specs:

Instance size vCPU Memory (GiB) Instance storage Network bandwidth (Gbps) EBS bandwidth (Gbps)
r8a.medium 1 8 EBS Only Up to 12.5 Up to 10
r8a.large 2 16 EBS Only Up to 12.5 Up to 10
r8a.xlarge 4 32 EBS Only Up to 12.5 Up to 10
r8a.2xlarge 8 64 EBS Only Up to 15 Up to 10
r8a.4xlarge 16 128 EBS Only Up to 15 Up to 10
r8a.8xlarge 32 256 EBS Only 15 10
r8a.12xlarge 48 384 EBS Only 22.5 15
r8a.16xlarge 64 512 EBS Only 30 20
r8a.24xlarge 96 768 EBS Only 40 30
r8a.48xlarge 192 1536 EBS Only 75 60
r8a.metal-24xl 96 768 EBS Only 40 30
r8a.metal-48xl 192 1536 EBS Only 75 60

Testing MySQL performance using HammerDB

R8a instances are a great choice for MySQL databases, so I thought that would be a great place to showcase some of these instances capabilities. To test MySQL, I used a series of scripts written by my colleagues to track MySQL performance across software versions and different EC2 instances. These scripts are stored in the repro-collection repository, which is an open source, extensible framework for performance testing that addresses real-world workloads rather than micro-benchmarks. It is built to provide a performance measurement reference usable across multiple organizations, and it’s currently centered on MySQL and actively used in discussions with Linux Kernel developers and maintainers. Furthermore, it helps track any performance impacts created by code changes to MySQL. The scripts contained in this repository set up a MySQL database to be tested, and a load generator running the HammerDB benchmark.

For this benchmark I used an r6a.24xlarge instance for the load generator, and an r6a.xlarge, r7a.xlarge, and r8a.xlarge instances for the MySQL database server all deployed in the same AWS Availability Zone (AZ). I chose a single AZ setup to minimize any latency variability from crossing multiple AZs. This is not meant to be a production-like setup, and I highly recommend using multiple AZs for production workloads. Each MySQL instance was tested separately using the same HammerDB load generator. Each test was run three times, and the results were averaged across the three runs. A diagram of the architecture is shown in the following figure:

Performance testing architecture showing r6a/r7a/r8a instance types with HammerDB load generator executing 9 test runs

HammerDB overall results

R8a instances show great results in the HammerDB benchmark for MySQL databases. For HammerDB’s overall score category, R8a instances outscored R7a instances by 55% and outscored R6a instances by 74%.

Performance comparison chart showing r6a, r7a, and r8a instance scores

HammerDB transactions per minute test

R8a instances also showed a notable improvement in this category. When compared to previous generation R7a instances, R8a out performed R7a by 32%. When compared to R6a instances, R8a outperformed by 63%.

 Performance comparison showing r6a (91,105), r7a (112,686), and r8a (148,478) transactions per minute

HammerDB P99 latency results

R8a instances showed improvement in P99 latency results, showing the efficiency gains driven by the new 5th Generation AMD EPYC CPUs and higher memory bandwidth. R8a shows an 14% latency reduction when compared to R7a, and a 25% latency reduction when compared to R6a.

P99 latency comparison showing decrease from 39.93ms (r6a) to 30.02ms (r8a) across instance generations

Conclusion

Built on the AWS Nitro System using sixth generation Nitro Cards, R8a instances are ideal for high performance, memory-intensive workloads, such as SQL and NoSQL databases, as demonstrated by the bench-marking shown in this post, as well as distributed web scale in-memory caches, in-memory databases, real-time big data analytics, and Electronic Design Automation (EDA) applications. R8a instances offer 12 sizes, including 2 bare metal sizes. Amazon EC2 R8a instances are SAP-certified, and providing 38% more SAPS when compared to R7a instances. If you’re still running 6th generation R6a instances, then I highly encourage you to migrate to the 8th generation instances to use their clear price performance benefits. Staying on modern infrastructure is a great way to drive down costs and provide more features for your customers, and there are clear gains to be had based on the testing shown in this post.

Start optimizing your high performance memory intensive workloads today by migrating to R8a instances. Visit the Amazon EC2 R8a instances page to learn more and get started on your upgrades to use the increased price performance of R8a instances today!

HPE Shows off AMD EPYC Venice and SP7 Supercomputing Node at SC25

Post Syndicated from Cliff Robinson original https://www.servethehome.com/hpe-shows-off-amd-epyc-venice-and-sp7-supercomputing-node-at-sc25/

HPE showed off its next-generation AMD EPYC “Venice” socket SP7 supercomputing platform at SC25 along with its Slingshot 400 interconnect

The post HPE Shows off AMD EPYC Venice and SP7 Supercomputing Node at SC25 appeared first on ServeTheHome.

HPE Launches New AMD EPYC Venice Instinct MI400 and NVIDIA Vera Rubin Compute Blades

Post Syndicated from Cliff Robinson original https://www.servethehome.com/hpe-launches-new-amd-venice-instinct-mi400-and-nvidia-vera-rubin-compute-arm/

HPE has three new blades based on AMD EPYC Venice, MI430X, and NVIDIA Vera Rubin, along with Slingshot 400 for 2027 HPC

The post HPE Launches New AMD EPYC Venice Instinct MI400 and NVIDIA Vera Rubin Compute Blades appeared first on ServeTheHome.

MLPerf Training v5.1 NVIDIA Dominates while AMD Has a Strong Showing and Cisco Silicon

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/mlperf-training-v5-1-nvidia-dominates-while-amd-has-a-strong-showing-and-cisco-silicon/

MLPerf Training v5.1 is out, with NVIDIA dominating. AMD showed up and performed well. Cisco Silicon One and MangoBoost DPUs made cameos

The post MLPerf Training v5.1 NVIDIA Dominates while AMD Has a Strong Showing and Cisco Silicon appeared first on ServeTheHome.

Gigabyte B343-C40-AAJ1 Review 10-Node AMD EPYC 4005 Goes High-Density

Post Syndicated from Eric Smith original https://www.servethehome.com/gigabyte-b343-c40-aaj1-review-10-node-amd-epyc-4005-goes-high-density/

We review the Gigabyte B343-C40-AAJ1, a 3U 10-node AMD EPYC 4005 and Ryzen 9000 server that seeks to drive up density and drive out costs

The post Gigabyte B343-C40-AAJ1 Review 10-Node AMD EPYC 4005 Goes High-Density appeared first on ServeTheHome.

AMD Helios MI450 Rack at OCP Summit with a Different Version from Meta

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/amd-helios-mi450-rack-at-ocp-summit-with-a-different-version-from-meta/

At OCP Summit 2025, we saw the AMD Helios rack along with Meta’s Helios rack which was very different swapping power, networking, and more

The post AMD Helios MI450 Rack at OCP Summit with a Different Version from Meta appeared first on ServeTheHome.

Beelink GTR9 Pro Review AMD Ryzen AI Max 395 System with 128GB and dual 10GbE

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/beelink-gtr9-pro-review-amd-ryzen-ai-max-395-system-with-128gb-and-dual-10gbe/

In our Beelink GTR9 Pro review, we see why this AMD Ryzen AI Max+ 395 system is fast packed with an Apple-like design and great features

The post Beelink GTR9 Pro Review AMD Ryzen AI Max 395 System with 128GB and dual 10GbE appeared first on ServeTheHome.

Tuning guide for AMD Amazon EC2 instances

Post Syndicated from Suyash Nadkarni original https://aws.amazon.com/blogs/compute/tuning-guide-for-amd-amazon-ec2-instances/

As organizations migrate more mission-critical workloads to the cloud, optimizing for price-performance becomes a key consideration. Amazon Elastic Compute Cloud (Amazon EC2) instances powered by AMD EPYC processors deliver high core density, large memory bandwidth, and hardware-enabled security features, making them a strong option for a wide range of compute, memory, and I/O-intensive workloads. In this post, we explain how to choose the right AMD-based Amazon EC2 instance types and describe tuning techniques that can help users improve workload efficiency. Whether you’re running simulations, large-scale analytics, or inference workloads, this post provides practical guidance for optimizing AMD-powered Amazon EC2 instance.

Amazon EC2 offers AMD-based instances built on multiple generations of AMD EPYC processors. This post focuses on optimization strategies for the 3rd and 4th generation families, which provide enhanced capabilities for compute and memory-intensive workloads.

  • 3rd generation (M6a, R6a, C6a, Hpc6a): Balances compute, memory, and storage—well-suited for analytics, web servers, and high-performance computing.
  • 4th generation (M7a, R7a, C7a, Hpc7a): Deliver up to 50% better performance over earlier AMD generations These instances introduce AVX-512 support, DDR5 memory, and Simultaneous Multithreading (SMT) turned off, SMT is a technology that allows a single physical core to run multiple threads concurrently; with SMT disabled, each virtual CPU (vCPU) maps directly to a physical core, which can improve workload isolation and consistency.

Choosing the right AMD EPYC powered Amazon EC2 instance type

Selecting the right AMD EPYC powered Amazon EC2 instance type starts with understanding how your application uses compute, memory, storage, and networking resources. Each instance family is optimized for specific workload characteristics.

Compute-intensive workloads

These workloads involve large-scale calculations, simulations, or encoding tasks, and they often need high CPU throughput and advanced instruction set support.

Recommended instances: C7a, Hpc7a, C6a, Hpc6a
Use cases: Scientific computing, financial modelling, media transcoding, encryption, machine learning (ML) inference

Big data and analytics

Applications that process and analyze large datasets benefit from high memory bandwidth and a balanced compute-to-memory ratio.

Recommended instances: R7a, M7a, R6a, M6a
Use cases: Stream processing, real-time analytics, business intelligence tools, distributed caching

Database workloads

Database workloads typically need consistent memory performance and high I/O throughput for read/write operations.

Recommended instances: R7a, M7a, R6a, M6a
Use cases: Relational databases (MySQL, PostgreSQL), NoSQL databases (MongoDB, Cassandra), in-memory databases (Redis)

Web and application servers

These applications handle variable request loads and benefit from balanced compute, memory, and network performance.

Recommended instances: C7a, M7a, C6a, M6a
Use cases: Web servers, content management systems, e-commerce platforms, API endpoints

AI/ML on CPU

ML tasks that do not need GPUs—such as inference or preprocessing—can run efficiently on CPU-based instances.

Recommended instances: M7a, R7a, C7a
Use cases: Model inference, natural language processing, computer vision, recommendation engines

High Performance Computing (HPC)

These workloads need high core counts, memory bandwidth, and low-latency networking for tightly coupled computations.

Recommended instances: Hpc7a, Hpc6a, R7a, M7a
Use cases: Computational fluid dynamics, genomics, seismic analysis, engineering simulations

Aligning your instance type with the needs of your workload helps provide predictable performance and cost efficiency. Services such as Amazon EC2 Auto Scaling and AWS Compute Optimizer can assist with ongoing instance selection and scaling decisions.

Optimizing AMD EPYC powered Amazon EC2 instances

Amazon EC2 instances powered by 4th generation AMD EPYC processors use a modular chiplet architecture, as shown in the following figure. Each processor includes multiple Core Complex Dies (CCDs), and each CCD contains one or more core complexes (CCXs). A CCX groups up to eight physical cores, with each core having 1 MB of dedicated L2 cache and all eight cores sharing a 32 MB L3 cache. These CCDs are connected to a central I/O die, which manages memory and interconnects across the chip.

Figure 1: Layout of the ‘Zen 4’ CPU die with 8 cores per die

Figure 1: Layout of the ‘Zen 4’ CPU die with 8 cores per die

The modular architecture of 4th generation AMD EPYC processors enables Amazon EC2 instances such as m7a.24xlarge and m7a.48xlarge to support high core counts-up to 96 physical cores per socket. For example:

  • m7a.24xlarge provides 96 physical cores from a single socket.
  • m7a.48xlarge spans two sockets, offering 192 physical cores.

Understanding how Amazon EC2 instance sizes map to physical processor layouts can help you optimize for performance and cache locality. Workloads that involve shared memory access or thread synchronization, such as high-performance computing or in-memory databases, can benefit from selecting instance sizes that minimize cross-socket communication and make efficient use of shared L3 cache, as shown in the following figure.

Figure 2: Layout of the ‘EPYC Chiplet’ CPU

Figure 2: Layout of the ‘EPYC Chiplet’ CPU

Amazon EC2 instances powered by 4th generation AMD EPYC processors operate with SMT turned off. In this configuration, each vCPU maps directly to a physical core, eliminating resource sharing such as execution units and cache between sibling threads. This design can reduce intra-core interference and help provide more consistent performance under certain workloads. Users can isolate threads at the core level and observe lower variability and more stable throughput for workloads, such as high-performance computing, ML inference, and transactional databases.

CPU optimizations

Tools such as htop can help identify CPU usage patterns, system load averages, and per-process resource consumption. CPU usage should be evaluated in the context of your workload and performance requirements. If usage consistently reaches 100%, then it may indicate that the workload is CPU-bound and not optimally balanced. Before modifying the instance size, enabling Auto Scaling, or switching instance families, evaluations must be conducted for the tuning opportunities that could improve performance without changing infrastructure. Load averages that regularly exceed the number of vCPUs can also signal compute saturation and may warrant further optimization.

L3 cache usage

The L3 cache is a shared, high-speed memory layer used by a group of CPU cores. On AMD-based Amazon EC2, cores are organized into L3 cache slices, each shared by a subset of cores on the same socket. Threads scheduled within the same slice can access shared data more efficiently, reducing memory latency. On 4th generation AMD instances such as m7a.2xlarge or r7a.2xlarge, all vCPUs typically map to cores within a single L3 slice, which ensures consistent cache locality. For larger sizes (for example m7a.8xlarge and above), thread pinning—assigning threads to specific physical cores—can help maintain this locality. Thread pinning can reduce performance variability in workloads with shared-memory access patterns.

You can pin threads using the taskset command:

taskset -c 0-3 ./your_application

This example pins your application to CPU cores 0 through 3. To determine which cores share the same L3 cache region, use tools such as lscpu or lstopo to inspect the system’s CPU topology. Grouping related threads on cores that share an L3 cache can improve performance consistency for workloads with frequent shared-memory access.

Docker container optimization

In containerized environments running on AMD-based Amazon EC2 instances, tuning CPU-related settings can improve workload consistency and efficiency—particularly for compute-intensive or latency-sensitive applications. Although default configurations work for many general-purpose scenarios, certain workloads may benefit from more explicit control over how CPU resources are allocated. By default, container runtimes such as Docker allow the operating system to schedule containers across any available CPU cores. This flexible scheduling can lead to variability in performance when containers move across cores that don’t share cache. To reduce this variability and improve cache efficiency, containers can be pinned to specific cores using the --cpuset-cpus flag.

docker run --cpuset-cpus="1,3" my-container

This setting restricts the container to use only the specified cores. In this example, cores 1 and 3 are used for demonstration. The actual core selection should be based on CPU topology to make sure of cache-efficient scheduling. Pinning containers to cores that share L3 cache can reduce scheduling overhead and improve consistency for workloads with shared-memory access patterns.

CPU frequency governor settings

Some operating systems adjust CPU frequency dynamically to save power. This is typically controlled by a setting called the CPU frequency governor. Although this behavior is efficient for general-purpose workloads, it may introduce latency or performance variability in compute-sensitive environments. For workloads that need consistently high CPU performance—such as high-throughput data processing, simulations, or real-time applications—we recommend setting the CPU governor to performance mode. This makes sure that the CPU runs at its maximum frequency under load, avoiding time spent ramping up from lower power states.

You can apply this setting on bare metal instances or Amazon EC2 Dedicated Hosts using the following command:

sudo cpupower frequency-set -g performance

Before applying, consider benchmarking workload performance with other CPU frequency governors (such as ondemand or schedutil) to make sure that the performance setting provides measurable benefits without unnecessary energy trade-offs.

Use architecture-specific compiler flags

When compiling performance-sensitive C or C++ applications, architecture-specific flags such as -march=znverX can unlock AMD EPYC–specific optimizations, including improved vectorization and floating-point performance. Although this is beneficial for compute-heavy workloads, it may reduce portability across architectures. To balance performance and flexibility, consider implementing runtime feature detection and dispatching an approach used by many optimized libraries to adapt behavior based on the underlying CPU.

Before using these flags, verify that your compiler version supports them and make sure that the target EC2 instance architecture matches the specified flag. For example, a binary compiled with -march=znver4 may fail with an illegal instruction error (SIGILL) if run on earlier-generation instances such as M5a.The following table outlines the appropriate flags and minimum supported compiler versions for each AMD EPYC generation:

AMD EPYC Generation -march Flag Minimum GCC Version Minimum LLVM/Clang Version
4th generation (for example M7a) znver4 GCC 12 Clang 15
3rd generation (for example M6a) znver3 GCC 11 Clang 13
2nd generation (for example M5a) znver2 GCC 9 Clang 11

The following flags are supported for GCC 11+ or LLVM Clang 13+:

# 4th Gen EPYC (M7a, R7a, C7a, Hpc7a)
-march=znver4

# 3rd Gen EPYC (M6a, R6a, C6a)
-march=znver3

# 2nd Gen EPYC (M5a, R5a, C5a)
-march=znver2

When to enable AVX-512 and VNNI instructions

4th generation AMD EPYC powered Amazon EC2 instances support advanced single instruction, multiple data (SIMD) instruction sets such as AVX2, AVX-512, and VNNI. These can improve throughput for vector-heavy workloads such as ML inference, image processing, or scientific simulations. However, these flags are generation-specific—attempting to run binaries compiled with AVX-512 on unsupported instances (for example 2nd generation M5a) may result in runtime errors such as illegal instruction (SIGILL).

When compiling C or C++ code:

gcc -mavx2 -mavx512f -O2 your_program.c -o your_program

To better understand which optimizations are applied, use the following:

-ftree-vectorizer-verbose=2 -fopt-info-vec-missed

This helps identify loops that benefit from vectorization and those that don’t. Only enable these optimizations if your workload benefits and you’ve validated compatibility with the instance generation in use. Avoid applying AVX flags indiscriminately, because it may reduce portability and increase binary complexity.

AMD Optimizing CPU Libraries

The AMD Optimizing CPU Libraries (AOCL) provide performance-tuned math libraries specifically designed for AMD EPYC processors. These libraries include optimized implementations of commonly used functions in scientific computing, engineering, and ML workloads. You can link your applications against AOCL to use processor-specific optimizations without rewriting your code. AOCL includes libraries for vector and scalar math, random number generation, FFT, BLAS, and LAPACK, among others.

Setting up AOCL

  • Set the AOCL_ROOT environment variable to point to the installation directory:
    export AOCL_ROOT=/path/to/aocl

  • Compile your application with the appropriate include and library paths:
    gcc -I$AOCL_ROOT/include -L$AOCL_ROOT/lib -lamdlibm -lm your_program.c -o your_program

  • Vector and scalar math optimization: you can enable more vectorized or scalar math tuning flags for specific workloads:
    # Vector math optimization
    gcc -lamdlibm -fveclib=AMDLIBM -lm your_program.c -o your_program
    		
    # Faster scalar math
    gcc -lamdlibm -fsclrlib=AMDLIBM -lamdlibmfast -lm your_program.c -o your_program

  • AOCL runtime profiling: AOCL supports runtime profiling, which helps developers identify which mathematical operations dominate execution time. To enable profiling, run the following:
    export AOCL_PROFILE=1
    ./your_program

After running this, a report file named aocl_profile_report.txt is generated. It provides a function-level breakdown of call counts, execution time, and thread usage. Developers can use this to focus optimization efforts on high-impact operations.

Conclusion

This post explored how to select AMD-based Amazon EC2 instance types that align with specific workload characteristics, and how to apply tuning techniques focused on CPU usage, thread placement, cache efficiency, and math library optimization. These approaches are especially relevant for compute-bound or latency-sensitive workloads where consistent performance is critical.

Ready to get started? Sign in to the AWS Management Console and launch AMD EPYC powered Amazon EC2 instances to begin optimizing your workloads today.

Picking Servers CPUs for Databases in 2025 is Still Complex

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/picking-servers-cpus-for-databases-in-2025-is-still-complex-amd-oracle-microsoft/

Picking CPUs for databases is still a topic of great complexity in 2025 with CPU vendors making different chips catering to database licenses

The post Picking Servers CPUs for Databases in 2025 is Still Complex appeared first on ServeTheHome.

MiTAC G8825Z5 AMD Instinct MI325X 8-GPU Server Review

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/mitac-g8825z5-amd-instinct-mi325x-8-gpu-server-review/

In our MiTAC G8825Z5 review, we see how this 8-GPU AMD Instinct MI325X server performs and how it was made in a very neat fashion

The post MiTAC G8825Z5 AMD Instinct MI325X 8-GPU Server Review appeared first on ServeTheHome.

AMD Dives Deep on CDNA 4 Architecture and MI350 Accelerator at Hot Chips 2025

Post Syndicated from Ryan Smith original https://www.servethehome.com/amd-dives-deep-on-cdna-4-architecture-and-mi350-accelerator-at-hot-chips-2025/

The second big machine learning accelerator talk of the afternoon belongs to AMD. The company’s chip architects are at this year’s show to tell the audience all about the CDNA 4 architecture, which is powering AMD’s new MI350 family of accelerators. Like it’s MI300 predecessor, AMD is using 3D die stacking to build up a […]

The post AMD Dives Deep on CDNA 4 Architecture and MI350 Accelerator at Hot Chips 2025 appeared first on ServeTheHome.