Tag Archives: hardware

Hardware Vulnerability in Apple’s M-Series Chips

2024-03-28 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/03/hardware-vulnerability-in-apples-m-series-chips.html

It’s yet another hardware side-channel attack:

The threat resides in the chips’ data memory-dependent prefetcher, a hardware optimization that predicts the memory addresses of data that running code is likely to access in the near future. By loading the contents into the CPU cache before it’s actually needed, the DMP, as the feature is abbreviated, reduces latency between the main memory and the CPU, a common bottleneck in modern computing. DMPs are a relatively new phenomenon found only in M-series chips and Intel’s 13th-generation Raptor Lake microarchitecture, although older forms of prefetchers have been common for years.

[…]

The breakthrough of the new research is that it exposes a previously overlooked behavior of DMPs in Apple silicon: Sometimes they confuse memory content, such as key material, with the pointer value that is used to load other data. As a result, the DMP often reads the data and attempts to treat it as an address to perform memory access. This “dereferencing” of “pointers”—meaning the reading of data and leaking it through a side channel—is a flagrant violation of the constant-time paradigm.

[…]

The attack, which the researchers have named GoFetch, uses an application that doesn’t require root access, only the same user privileges needed by most third-party applications installed on a macOS system. M-series chips are divided into what are known as clusters. The M1, for example, has two clusters: one containing four efficiency cores and the other four performance cores. As long as the GoFetch app and the targeted cryptography app are running on the same performance cluster—even when on separate cores within that cluster—GoFetch can mine enough secrets to leak a secret key.

The attack works against both classical encryption algorithms and a newer generation of encryption that has been hardened to withstand anticipated attacks from quantum computers. The GoFetch app requires less than an hour to extract a 2048-bit RSA key and a little over two hours to extract a 2048-bit Diffie-Hellman key. The attack takes 54 minutes to extract the material required to assemble a Kyber-512 key and about 10 hours for a Dilithium-2 key, not counting offline time needed to process the raw data.

The GoFetch app connects to the targeted app and feeds it inputs that it signs or decrypts. As its doing this, it extracts the app secret key that it uses to perform these cryptographic operations. This mechanism means the targeted app need not perform any cryptographic operations on its own during the collection period.

Note that exploiting the vulnerability requires running a malicious app on the target computer. So it could be worse. On the other hand, like many of these hardware side-channel attacks, it’s not possible to patch.

Slashdot thread.

Autonomous hardware diagnostics and recovery at scale

2024-03-25 Jet Mariscal

Post Syndicated from Jet Mariscal original https://blog.cloudflare.com/autonomous-hardware-diagnostics-and-recovery-at-scale

Cloudflare’s global network spans more than 310 cities in over 120 countries. That means thousands of servers geographically spread across different data centers, running services that protect and accelerate our customer’s Internet applications. Operating hardware at such a scale means that hardware can break anywhere and at any time. In such cases, our systems are engineered such that these failures cause little to no impact. However, detecting and managing server failure at scale requires automation. This blog aims to provide insights into the difficulties involved in handling broken servers and how we were able to simplify the process through automation.

Challenges dealing with broken servers

When a server is found to have faulty hardware and needs to be removed from production, it is considered broken and its state is set to Repair in the internal database where server status is tracked. In the past, our Data Center Operations team were essentially left to troubleshoot and diagnose broken servers on their own. They had to go through laborious tasks like performing queries to locate and repair servers, conducting diagnostics, reviewing results, evaluating if a server can be restored to production, and creating the necessary tickets for re-enabling servers and executing operations to put them back in production. Such effort can take hours for a single server alone, and can easily consume an engineer’s entire day.

As you can see, addressing server repairs was a labor-intensive process performed manually, Additionally, a lot of these servers remained powered on within the racks, wasting energy. With our fleet expanding rapidly, the attention of Data Center Operations is primarily devoted to supporting this growth, leaving less time to handle servers in need of repair.

It was clear that our infrastructure was growing too fast for us to be able to handle repairs and recovery, so we had to find a better way to handle these sorts of inefficiencies in our operations. This would allow our engineers to focus on the growth of our footprint while not abandoning repair and recovery – after all, these are still huge CapEx investments and wasted capacity that otherwise would have been fully utilized.

Using automation as an autonomous system

As members of the Infrastructure Software Systems and Automation team at Cloudflare, we primarily work on building tools and automation that help reduce excess work in order to ease the pressure on our operations teams, increase productivity, and enable people to execute operations with the highest efficiency.

Our team continuously strives to challenge our existing processes and systems, finding ways we can evolve them and make significant improvements – one of which is to build not just a typical automated system but an autonomous one. Building autonomous automations means creating systems that can operate independently, without the need for constant human intervention or oversight – a perfect example of this is Phoenix.

Introducing Phoenix

Phoenix is an autonomous diagnostics and recovery automation that runs at regular intervals to discover Cloudflare data centers with servers that are broken, performing diagnostics on detection, recovering those that pass diagnostics by re-provisioning, and ultimately re-enabling those that have successfully been re-provisioned in the safest and most unobtrusive way possible – all without requiring any human intervention! Should a server fail at any point in the process, Phoenix will take care of updating relevant tickets, even pinpointing the cause of the failure, and reverting the state of the server accordingly when needed – again, all without any human intervention!

The image below illustrates the whole process:

To better understand exactly how Phoenix works, let’s dive into some details about its core functionality.

Discovery

Discovery runs at a regular interval of 30 minutes, selecting a maximum of two Cloudflare data centers that have broken or repair state servers in its fleet, which are all configurable depending on business and operational needs, against which it can immediately execute diagnostics. At this rate, Phoenix is able to discover and operate on all broken servers in the fleet in about 3 days. On each run, it also detects data centers that may have broken servers already queued for recovery, and takes care of ensuring that the Recovery phase is executed immediately.

Diagnostics

Diagnostics takes care of running various tests across the broken servers of a selected data center in a single run, verifying viability of the hardware components, and identifying the candidates for recovery.

A diagnostic operation includes running the following:

Out-of-Band connectivity check
This check determines the reachability of a device via out-of-band network. We employ IPMI (Intelligent Platform Management Interface) to ensure proper physical connectivity and accessibility of devices. This allows for effective monitoring and management of hardware components, enhancing overall system reliability and performance. Only devices that pass this check can progress to the Node Acceptance Testing phase.
Node Acceptance Tests
We leverage an existing internally-built tool called INAT (Integrated Node Acceptance Testing) that runs various tests suites/cases (Hardware Validation, Performance, etc.).
For every server that needs to be diagnosed, Phoenix will send relevant system instructions to have it boot into a custom Linux boot image, internally called INAT-image. Built into this image are the various tests that need to run when the server boots up, publishing the results to an internal resource in both human-readable (HTML) and machine-readable (JSON) formats, with the latter consumed and interpreted by Phoenix. Upon completion of the boot diagnostics, the server is powered off again to ensure it is not wasting energy.

Our node acceptance tests encompass a range of evaluations, including but not limited to benchmark testing, CPU/Memory/Storage checks, drive wiping, and various other assessments. Look out for an upcoming in-depth blog post covering INAT.

A summarized diagnostics result is immediately added to the tracking ticket, including pinpointing the exact cause of a failure.

Recovery

Recovery executes what we call an expansion operation, which in its first phase will provision the servers that pass diagnostics. The second phase is to re-enable the successfully provisioned servers back to production, where only those that have been re-enabled successfully will start receiving production traffic again.

Once the diagnostics are passed and the broken servers move on towards the first phase of recovery, we change their statuses from Repair to Pending Provision. If the servers don’t fully recover, for example, because there are server configuration errors or issues enabling services, Phoenix assesses the situation. In such cases, it returns those servers to the Repair state for additional evaluation. Additionally, if the diagnostics indicate that the servers need any faulty components replaced, then Phoenix notifies our Data Center operation team for manual repairs as required, ensuring that the server is not repeatedly selected until the required part replacement is completed. This ensures any necessary human intervention can be applied promptly, making the server ready for Phoenix to rediscover in its next iteration.

An autonomous recovery operation requires infusing intelligence into the automated system so that we can fully trust that it’s able to execute an expansion operation in the safest way possible and handle situations on its own without any human interventions. To do this, we’ve made sure Phoenix is automation-aware – this means that it knows when there are other automations executing certain operations such as expansions, and will only execute an expansion when there are no ongoing provisioning operations in the target data center. This ability to execute only when it’s safe to do so is to ensure that the recovery operation will not interfere with any other ongoing operations in the data center. We’ve also adjusted its tolerance with faulty hardware – this means it’s able to gracefully deal with misbehaving servers by letting these quickly drop out of the recovery candidate list upon misbehavior that prevents blocking the operation.

Visibility

While our autonomous system, Phoenix, seamlessly handles operations without human intervention, it doesn’t mean we sacrifice visibility. Transparency is a key feature of Phoenix. It meticulously logs every operation, from executing tasks to providing progress updates, and shares this information in communication channels like chat rooms and Jira tickets. This ensures a clear understanding of what Phoenix is doing at all times.

Tracking of actions taken by automation as well as the state transitions of a server keeps us in the loop and gives us a better understanding of what these actions were and when they were executed, essentially giving us valuable insights that will help us improve not only the system but our processes as well. Having this operational data allows us to generate dashboards that let various teams monitor automation activities and measure their success. We are able to generate dashboards to guide business decisions and even answer common operational questions related to repair and recovery.

Balancing automation and empathy: Error Budgets

When we launched Phoenix, we were well aware that not every broken server can be re-enabled and successfully returned to production, and more importantly, there’s no 100% guarantee that a recovered server will be as stable as the ones with no repair history – there’s a risk that these servers could fail and end up back in Repair status again.

Although there’s no guarantee that these recovered servers won’t fail again, causing additional work for SRE’s due to the monitoring alerts that get triggered, what we can guarantee is that Phoenix immediately stops recoveries without any human intervention if a certain number of failures for a server are reached in a given time window – this is where we applied the concept of an Error Budget.

The Error Budget is the amount of error that automation can accumulate over a certain period of time before our SRE’s start being unhappy due to the excessive server failures or unreliability of the system. It is empathy embedded in automation.

In the figure above, the y-axis represents the error budget. In this context, the error budget applies to the number of recovered servers that failed and were moved back to Repair state again. The x-axis represents the time unit allocated to the error budget – in this case, 24 hours. To ensure that Phoenix is strict enough in mitigating possible issues, we divide the time unit into three consecutive buckets of the same duration – representing the three “follow the sun” SRE shifts in a day. With this, Phoenix can only execute recoveries if the number of server failures is no more than 2. Additionally, Phoenix will also have to compensate succeeding time buckets by deducting the error budget of any excess failures in a given time bucket.

Phoenix will immediately stop recoveries if it exhausts its error budget prematurely. In this context, prematurely means before the end of the time unit for which the error budget was granted. Regardless of the error budget depletion rate within a time unit, the error budget is fully replenished at the beginning of each time unit, meaning the budget resets every day.

The Error Budget has helped us define and manage our tolerance for hardware failures without causing significant harm to the system or too much noise for SREs, and gave us opportunities to improve our diagnostics system. It provides a common incentive that allows both the Infrastructure Engineering and SRE teams to focus on finding the right balance between innovation and reliability.

Where we go from here

With Phoenix, we’ve not only witnessed the significant and far-reaching potential of having an autonomous automated system in our infrastructure, we’re actually reaping its benefits as well. It provides a win-win situation by successfully recovering hardware and ensuring that broken devices are powered off, thus preventing them from consuming unnecessary power while being idle in our racks. This not only reduces energy wastage but also contributes to sustainability efforts and cost savings. Automated processes that operate independently have not only freed our colleagues on various Infrastructure teams from doing mundane and repetitive tasks, allowing them to focus more on areas where they can use their skill sets for more interesting and productive work, but have also led us to evolving our old processes for handling hardware failures and repairs, making us much more efficient than ever.

Autonomous automation is a reality that is now beginning to shape the future of how we are building better and smarter systems here at Cloudflare, and we will continue to invest engineering time for these initiatives.

A huge thank you to Elvin Tan for his awesome work on INAT, and to Graeme, Darrel and David for INAT’s continuous improvements.

Redefining fleet management at Cloudflare

2024-03-19 Ryan De Lap

Post Syndicated from Ryan De Lap original https://blog.cloudflare.com/redefining-fleet-management-at-cloudflare

As Cloudflare continues to grow, we are constantly provisioning new servers, data centers, and hardware all over the globe. With this increase in scale it became necessary to re-evaluate our approach to node and datacenter tooling. In this blog post, we explore an in-house infrastructure system we’ve built, called Zinc, to step up to the task. This system, built in Rust, has become an essential part of system engineering, platform management, and provisioning at Cloudflare, while providing user-friendly engineering tools and automations for Cloudflare employees to leverage.

The nature of Zinc is a rather simple system, providing first class data models for logical and physical infrastructure assets here at Cloudflare. Items such as servers (nodes), network devices, and data centers are all members of Zinc, modeled in a strongly-typed system. With these models, Zinc enables powerful APIs, integrations, and interfaces for efficient fleet management on top of this data. Tasks such as assigning workloads to nodes, scheduling any type of data center maintenance, querying data about our fleet, or even managing the repair cycle of faulty nodes are greatly simplified through Zinc and its integrations with other Cloudflare systems.

By providing Cloudflare engineers with a native web interface and command line tooling for interacting with Zinc’s data, a central pane of a glass has been created, where the ability to expand, build, and monitor our fleet has never been easier.

Humble Beginnings

Several years ago, workload management and server provisioning was a tedious process. For our control plane data centers, we would define the workload for every node in massive source-controlled YAML files, sometimes as long as 80,000 lines. Each entry was a node, its name, its rack, and roles to be read by our configuration management software for assignment.

compute5545:
    rack: 219
    clickhouse:
      cluster: dns
    comment: |-
      Updated by <user>

As time went on, this became extremely cumbersome for engineers to manage and assign workloads for servers. Engineers would often have to update multiple files, updating every entry to assign and change workload data by hand. While this may seem like a slight inconvenience at first, when provisioning new hardware or changing workload configuration data, engineers would have to update hundreds of lines of YAML. Additionally, this data was not readily accessible to other systems and automation to read and modify. It became clear that this pattern could not scale, and a stronger framework would need to be created to manage this information.

First, we aimed to tackle this problem by making nodes and their workloads — which we call roles — first class data structures. Workload and node information were collected and stored in this new system called Zinc, and our configuration management system Salt began to read this information not from the YAML files, but a new RESTful API. We also added several features to Zinc to administer and manage node data:

Workload Management – Zinc assumed the role of the source of truth for node workloads, also taking charge of metadata management for roles. Attributes like a node’s associated cluster or its designated kernel version are now managed through Zinc, eliminating the need for lengthy configuration files scattered across our repositories.
Least-Privilege User Accounts – Leveraging Cloudflare Access, every Cloudflare employee who uses Zinc has an individual account, with scoped permissions for their job role. This prevents potentially compromised or prying users from viewing sensitive asset information, and makes modifications to production systems impossible without approval.
Change Request and Approval System – Zinc implements a change request system, similar to pull requests, so nodes and their associated workloads require approval from the team that manages the workload. For example, if a Cloudflare engineer wanted to provision and assign new Kubernetes nodes, this action would require approval by the Kubernetes team before being applied.
Node Reservations – It can become necessary for Cloudflare engineers to reserve specific hardware for testing and future workload capacity. Zinc provides this functionality as a first-class operation, providing a clear view into what a node is being used for, even when not in production. A common pattern to see is spare hardware for roles like Postgres or Clickhouse reserved and ready to take over if other nodes need to be taken out of production.
Node Metadata – Zinc collects a variety of node asset data through other subsystems at Cloudflare, unifying it all under a single pane of glass. Hardware information such as CPU, memory, generation, chassis, power, and networking configuration are all members of Zinc’s APIs and interfaces.

These were the initial features Zinc offered to Cloudflare’s SRE teams, but over time needs grew to expand the scope and start handling a variety of asset and operational data. Zinc has since started representing and managing more infrastructure related to network devices and datacenter management.

System Blueprint

At Cloudflare, two critical systems, Zinc and Netbox, play complementary roles in managing infrastructure. Zinc specializes in handling the logical infrastructure and operational configuration, while Netbox focuses on physical infrastructure. For those unfamiliar with Netbox, it plays the important role of acting as our Datacenter Inventory Management System (DCIM). Details such as hardware specifications, serial numbers, cable diagrams, and rack layouts are all stored in Netbox. These elements are the building blocks of the infrastructure that Zinc imports and relies on to create higher level abstractions, useful for a variety of systems on Cloudflare to depend on, without having to know the nitty-gritty specifics of our datacenter information.

Supercharged Automations

Growing pains were inevitable given the sheer pace of Cloudflare’s growth. Processes around server provisioning, maintenance windows, repairs, and diagnostics reporting were reaching their limits. Luckily, the data available through Zinc made it a natural home for new and improved workflow automations aimed at removing toil across various touch points throughout a server’s lifecycle.

Repairs
Hardware failures are common at our scale. Issues such as disk failure, motherboard problems, or CPU voltage errors are just a few of the many common failures seen in production. While Cloudflare’s infrastructure is very fault tolerant, we want to quickly return hardware to production after a failure to increase capacity and optimize infrastructure costs. Prior to Zinc, engineers would have to manually collect and file tickets with information related to the hardware failure, tediously filling ticket details manually in order for data center technicians to service it. With Zinc, however, the process of collecting this data and generating repair tickets is entirely automated. As we continue developing Zinc, we will be able to manage this process all the way down to the individual hardware component level and enhance existing automation and diagnostic integrations, further optimizing the repair process. With just a few clicks (or driven by other automation), an accurate service ticket can be filed, enabling data center technicians to make repairs and get servers back into production as fast as possible.

Diagnostic Reports

Zinc integrates directly with a diagnostic service we use to identify hardware issues on our fleet, known as INAT (Integrated Node Acceptance Tests). Zinc leverages this system to run acceptance tests before, during, and after server provisioning. It also can be executed in an ad hoc manner to determine the health of a machine. With INAT, we are able to save engineers time and quickly get results back to aid in the debugging process. Once these diagnostics are complete, the Zinc interface provides a report that can be used to determine the health of a server and if any actions need to be taken.

Maintenance Windows

If you’ve ever wondered how the maintenance page on Cloudflare Status is populated, Zinc is the place of origin. As we are constantly doing hardware and network upgrades, it’s important for Cloudflare to have a centralized view of what maintenance is happening, where it is happening, and the scale of all the systems and services that are or can be impacted by the maintenance. During maintenance, there are a variety of automated systems that ensure that Cloudflare sees no loss in quality of service, no matter where in the world the maintenance is happening. Zinc orchestrates and tracks these maintenance windows, and sends alerts to teams and Cloudflare customers when a disruptive, or even potentially disruptive, maintenance is scheduled in a region.

Reboots
Zinc provides an integration with a core system responsible for scheduling node reboots. When a node needs to be rebooted, such as to apply a new firmware upgrade or new Kernel version, there are systems at Cloudflare that schedule and safely manage this functionality. For example, it would be unsafe to reboot a production Clickhouse node with no prior warning, so these systems ensure traffic is properly routed away from this node prior to its reboot. Zinc provides an integration in its Web UI and CLI with this reboot management system to make the process of queuing and executing reboots much easier, as well as providing a place where we can add orchestration logic that leverages Zinc’s operational management capabilities for server reboots.

Engineer Productivity

One of the most valuable parts of Zinc is that it provides engineers the ability to quickly perform complex queries and apply changes to our assets in production. At the API layer, we ensure that any access or changes to our infrastructure are properly scoped and authenticated. From there, Zinc provides two interfaces for employees managing our fleet: a command-line interface built in Rust, and a web application built in React, both of which are built on the same Zinc API that can be directly called from scripts or other systems as integrations are built out to automate more of the management of our infrastructure.

Here are some common examples of the CLI tooling our engineers use:

# List all datacenters at Cloudflare 
$ zinc site get --all
# Set a node's status to disabled, removing it from production. 
$ zinc node update status compute5545 disabled
# Querying all nodes in a specific rack, that are Kubernetes nodes. 
$ zinc node get -f "rack:A413" -f “role:kubernetes” 
# Putting a node failure into a repair state with debug information on how to fix. 
$ zinc repair node create –name 36com360 --repair-type motherboard –remove-from-prod –comments “Diagnostic determined bad motherboard”

While these are simple queries, Zinc also provides its own query syntax to get more detailed information using its own query structure. Here we see an example of looking for Kubernetes workers that are a part of our pdx cluster, while ignoring storage and rook nodes.

zinc node get -f 'role:kubernetes.cluster=pdx&kubernetes.worker=true&!kubernetes.storage&!kubernetes.rook'
Node Name    Node Type  Node Status  Colo ID  Colo Name  Rack  Rack Unit  Roles
compute2712  compute     V           348      pdx05      B103  39         kubernetes
compute1995  compute     V           349      pdx06      A104  7          kubernetes
compute1192  compute     V           36       pdx01      A203  10         kubernetes
…
Total records: 1337

Web UI
Despite being an internal tool, we felt it necessary to ensure that the UX of Zinc was intuitive and crisp. As it stands, hundreds of engineers at Cloudflare rely on Zinc’s web interface, so we found it essential to provide a fast, easy to use design. Built in React as a single page application, we aim to optimize for ease of use wherever possible. Querying items such as assets in repair, nodes in a specific city or country, or even CPU model are all first-class searchable items in our UI.

As mentioned previously, Zinc also provides a user-friendly Change Request interface, similar to Git Pull Requests, which shows what asset data is changing and who is making the change, and ensures the change is approved by designated staff prior to being applied in production.

Looking Ahead

Zinc represents a significant advancement in infrastructure management at Cloudflare. With our fleet growing faster than ever, especially with our new expansions to deliver GPUs on the Edge and Cloudflare’s R2, Zinc has stepped up to the plate, tackling the challenges of growth and providing invaluable support to our engineering teams. We hope this has been an insightful view of how Cloudflare is building to grow and scale well into the future.

There are still many wins to be had when it comes to infrastructure tooling here at Cloudflare. In the long term, Zinc will continue to be the backbone of infrastructure and asset data, with deeper automations and integrations to save our engineers time and toil and reduce manual errors as we continue to expand.

If managing and operating a fleet of servers as large as Cloudflare sounds like an exciting challenge to you, we’re hiring!

Introducing the Project Argus Datacenter-ready Secure Control Module design specification

2023-10-16 Xiaomin Shen

Post Syndicated from Xiaomin Shen original http://blog.cloudflare.com/introducing-the-project-argus-datacenter-ready-secure-control-module-design-specification/

Introducing the Project Argus Datacenter-ready Secure Control Module design specification

Historically, data center servers have used motherboards that included all key components on a single circuit board. The DC-SCM (Datacenter-ready Secure Control Module) decouples server management and security functions from a traditional server motherboard, enabling development of server management and security solutions independent of server architecture. It also provides opportunities for reducing server printed circuit board (PCB) material cost, and allows unified firmware images to be developed.

Today, Cloudflare is announcing that it has partnered with Lenovo to design a DC-SCM for our next-generation servers. The design specification has been published to the OCP (Open Compute Project) contribution database under the name Project Argus.

A brief introduction to baseboard management controllers

A baseboard management controller (BMC) is a specialized processor that can be found in virtually every server product. It allows remote access to the server through a network connection, and provides a rich set of server management features. Some of the commonly used BMC features include server power management, device discovery, sensor monitoring, remote firmware update, system event logging, and error reporting.

In a typical server design, the BMC resides on the server motherboard, along with other key components such as the processor, memory, CPLD and so on. This was the norm for generations of server products, but that has changed in recent years as motherboards are increasingly optimized for high-speed signal bandwidth, and servers need to support specialized security requirements. This has made it necessary to decouple the BMC and its related components from the server motherboard, and move them to a smaller common form factor module known as the Datacenter Secure Control Module (DC-SCM).

Figure 1 is a picture of a motherboard used on Cloudflare’s previous generation of edge servers. The BMC and its related circuit components are placed on the same printed circuit board as the host CPU.

For Cloudflare’s next generation of edge servers, we are partnering with Lenovo to create a DC-SCM based design. On the left-hand side of Figure 2 is the printed circuit board assembly (PCBA) for the Host Processor Module (HPM). It hosts the CPU, the memory slots, and other components required for the operation and features of the server design. But the BMC and its related circuits have been relocated to a separate PCBA, which is the DC-SCM.

Benefits of DC-SCM based server design

PCB cost reduction

As of today, DDR5 memory runs at 6400MT/s (mega transfers per second). In the future DDR5 speed may even increase to 7200MT/s or 8800MT/s. Meanwhile, PCIe Gen5 is running at 32 GT/s (giga transfers per second), doubling the speed rate of PCIe Gen4. Both DDR5 and PCIE Gen5 are key interfaces for the processors used on our next-generation servers.

The increasing rates of high-speed IO signals and memory buses are pushing the next generation of server motherboard designs to transition from low-loss to ultra-low loss dielectric printed circuit board (PCB) materials, and higher layer counts in the PCB. At the same time, the speed of BMC and its related circuitry are not progressing so quickly. For example, the physical layer interface of ASPEED AST2600 BMC is only at PCIe Gen2 (5 GT/s).

Ultra-low loss dielectric PCB material and higher PCB layer count are both driving factors for higher PCB cost. Another driving factor of PCB cost is the size of the PCB. In a traditional server motherboard design, the size of the server motherboard is larger, since the BMC and its related circuits are placed on the same PCB as the host CPU.

By decoupling the BMC and its related circuitry from the host processor module (HPM), we can reduce the size of the relatively more expensive PCB for the HPM. BMC and its related circuitry can be placed on relatively cheaper PCB, with reduced layer count and lossier PCB dielectric materials. For example, in the design of Cloudflare’s next generation of servers, the server motherboard PCB needs to be 14 or more layers, whereas the BMC and its related components can be easily routed with 8 or 10 layers of PCB. In addition, the dielectric material used on DC-SCM PCB is low-loss dielectric — another cost saver compared to ultra-low loss dielectric materials used on HPM PCB.

Modularized design enables flexibility

DC-SCM modularizes server management and security components into a common add-in card form factor, enabling developers to remove customer specific solutions from the more complex components, such as motherboards, to the DC-SCM. This provides flexibility for developers to offer multiple customer-specific solutions, without the need to redesign multiple motherboards for each solution.

Developers are able to reuse the DC-SCM from a previous generation of server design, if the management and security requirements remain the same. This reduces the overall cost of upgrading to a new generation of servers, and has the potential to reduce e-waste when a server is decommissioned.

Likewise, management and security solution upgrades within a server generation can be carried out separately by modifying or replacing the DC-SCM. The more complex components on the HPM do not need to be redesigned. From a data center perspective, it speeds up the upgrade of management and security hardware across multiple server platforms.

Unified interoperable OpenBMC firmware development

Data center secure control interface (DC-SCI) is a standardized hardware interface between DC-SCM and the Host Processor Module (HPM). It provides a basis for electrical interoperability between different DC-SCM and host processor module (HPM) designs.

This interoperability makes it possible to have a unified firmware image across multiple DC-SCM designs, concentrating development resources on a single firmware rather than an array of them. The publicly-accessible OpenBMC repository provides a perfect platform for firmware developers of different companies to collaborate and develop such unified OpenBMC images. Instead of maintaining a separate BMC firmware image for each platform, we now use a single image that can be applied across multiple server platforms. The device tree specific to each respective server is automatically loaded based on device product information.

Using a unified OpenBMC image significantly simplifies the process of releasing BMC firmware to multiple server platforms. Firmware updates and changes are propagated to all supported platforms in a single firmware release.

Project Argus

The DC-SCM specifications have been driven by the Open Compute Project (OCP) Foundation hardware management workstream, as a way to standardize server management, security, and control features.

Cloudflare has partnered with Lenovo on what we call Project Augus, Cloudflare’s first DC-SCM implementation that fully adheres to the DC-SCM 2.0 specification. In the DC-SCM 2.0 specifications, a few design items are left open for implementers to decide on the most suitable architectural choices. With the goal of improving interoperability of Cloudflare DC-SCM designs across server vendors and server designs, Project Argus includes documentation on implementation details and design decisions on form factor, mechanical locking mechanism, faceplate design, DC-SCI pin out, BMC chip, BMC pinout, Hardware Root of Trust (HWRoT), HWRoT pinout, and minimum bootable device tree.

At the heart of the Project Argus DC-SCM is the ASPEED AST2600 BMC System on Chip (SoC), which when loaded with a compatible OpenBMC firmware, provides a rich set of common features necessary for remote server management. ASPEED AST1060 is used on Project Argus DC-SCM as the HWRoT solution, providing secure firmware authentication, firmware recovery, and firmware update capability. Project Argus DC-SCM 2.0 uses Lattice MachXO3D CPLD with secure boot and dual boot ability as the DC-SCM CPLD to support a variety of IO interfaces including LTPI, SGPIO, UART and GPIOs.

The mechanical form factor of Project Argus DC-SCM 2.0 is the horizontal External Form Factor (EFF).

Cloudflare and Lenovo have contributed Project Argus Design Specification and reference design files to the OCP contribution database. Below is a detailed list of our contribution:

SPI, I2C/I3C, UART, LTPI/SGPIO block diagrams
DC-SCM PCB stackup
DC-SCM Board placements (TOP and BOTTOM layers)
DC-SCM schematic PDF file
DC-SCI pin definition PDF file
Power sequence PDF file
DC-SCM bill of materials Excel spreadsheet
Minimum bootable device tree requirements
Mechanical Drawings PDF files, including card assembly drawing and interlock rail drawing

The security foundation for our Gen 12 hardware

Cloudflare has been innovating around server design for many years, delivering increased performance per watt and reduced carbon footprints. We are excited to integrate Project Argus DC-SCM 2.0 into our next-generation, Cloudflare Gen 12 servers. Stay tuned for more exciting updates on Cloudflare Gen 12 hardware design!

How Cloudflare is staying ahead of the AMD Zen vulnerability known as “Zenbleed”

2023-07-25 Derek Chamorro

Post Syndicated from Derek Chamorro original http://blog.cloudflare.com/zenbleed-vulnerability/

How Cloudflare is staying ahead of the AMD Zen vulnerability known as “Zenbleed”

Google Project Zero revealed a new flaw in AMD's Zen 2 processors in a blog post today. The 'Zenbleed' flaw affects the entire Zen 2 product stack, from AMD's EPYC data center processors to the Ryzen 3000 CPUs, and can be exploited to steal sensitive data stored in the CPU, including encryption keys and login credentials. The attack can even be carried out remotely through JavaScript on a website, meaning that the attacker need not have physical access to the computer or server.

Cloudflare’s network includes servers using AMD’s Zen line of CPUs. We have patched our entire fleet of potentially impacted servers with AMD’s microcode to mitigate this potential vulnerability. While our network is now protected from this vulnerability, we will continue to monitor for any signs of attempted exploitation of the vulnerability and will report on any attempts we discover in the wild. To better understand the Zenbleed vulnerability, read on.

Background

Understanding how a CPU executes programs is crucial to comprehending the attack's workings. The CPU works with an arithmetic processing unit called the ALU. The ALU is used to perform mathematical tasks. Operations like addition, multiplication, and floating-point calculations fall under this category. The CPU's clock signal controls the application-specific digital circuitry that the ALU uses to carry out these functions.

For data to reach the ALU, it has to pass through a series of storage systems. These include secondary memory, primary memory, cache memory, and CPU registers. Since the registers of the CPU are the target of this attack, we will go into a little more depth. Depending on the design of the computer, the CPU registers can store either 32 or 64 bits of information. The ALU can access the data in these registers and complete the operation.

As the demands on CPUs have increased, there has been a need for faster ways to perform calculations. Advanced Vector Extensions (or AVX) were developed to speed up the processing of large data sets by applications. AVX are extensions to the x86 instruction set architecture, which are relevant to x86-based CPUs from Intel and AMD. With the help of compatible software and the extra instruction set, compatible processors could handle more complex tasks. The primary motivation for developing this instruction set was to speed up operations associated with data compression, image processing, and cryptographic computations.

The vector data used by AVX instructions is stored in 16 YMM registers, each of which is 256 bits in size. The Y-register in the XMM register set is where the 128-bit values are stored, hence the name. Instructions from the arithmetic, logic, and trigonometry families of the AVX standard all make use of the YMM registers. They can also be used to keep masks, data that is used to filter out certain vector components.

Vectorized operations can be executed with great efficiency using the YMM registers. Applications that process large amounts of data stand to gain significantly from them, but they are increasingly the focus of malicious activity.

The attack

Speculative execution attacks have previously been used to compromise CPU registers. These are an attack variant that takes advantage of the speculative execution capabilities of modern CPUs. Computer processors use a method called speculative execution to speed up processing times. A CPU will execute an instruction speculatively if it has no way of knowing whether or not it will be executed. If it turns out that the CPU was unable to carry out the instruction, it will simply discard the data.

Because of their potential use for storing private information, AVX registers are especially susceptible to these kinds of attacks. Cryptographic keys and passwords, for instance, could be accessed by an attacker via a speculative execution attack on the AVX registers.

As mentioned above, Project Zero discovered a vulnerability in AMD's Zen 2-architecture-based CPUs, wherein data from another process and/or thread could be stored in the YMM registers, a 256-bit series of extended registers, potentially allowing an attacker access to sensitive information. This vulnerability is caused by a register not being written to 0 correctly under specific microarchitectural circumstances. Although this error is associated with speculative execution, it is not a side channel vulnerability.

This attack works by manipulating register files to force a mispredicted command. First, there is a trigger to XMM Register Merge Optimization2, which ironically is a hardware mitigation that can be used to protect against speculative execution attacks, followed by a register remapping (a technique used in computer processor design to resolve name conflicts between physical registers and logical registers) and then a mispredicted instruction call to vzeroupper, an instruction that is used to zero the upper half of the YMM and ZMM registers.

Since the register file is shared by all the processes running on the same physical core, this exploit can be used to eavesdrop on even the most fundamental system operations by monitoring the data being transferred between the CPU and the rest of the computer.

Fixing the bleed

Because of the exact timing for this to successfully execute, this vulnerability, CVE-2023-20593, is classified with a CVSS score of 6.5 (Medium). AMD's mitigation is implemented via the MSR register, which turns off a floating point optimization that otherwise would have allowed a move operation.

The following microcode updates have applied to our entire server fleet that contain potentially affected AMD Zen processors. We have seen no evidence of the bug being exploited and were able to patch our entire network within hours of the vulnerability’s disclosure. We will continue to monitor traffic across our network for any attempts to exploit the bug and report on our findings.

DDR4 memory organization and how it affects memory bandwidth

2023-04-19 Xiaomin Shen

Post Syndicated from Xiaomin Shen original https://blog.cloudflare.com/ddr4-memory-organization-and-how-it-affects-memory-bandwidth/

DDR4 memory organization and how it affects memory bandwidth

When shopping for DDR4 memory modules, we typically look at the memory density and memory speed. For example a 32GB DDR4-2666 memory module has 32GB of memory density, and the data rate transfer speed is 2666 mega transfers per second (MT/s).

If we take a closer look at the selection of DDR4 memories, we will then notice that there are several other parameters to choose from. One of them is rank x organization, for example 1Rx8, 2Rx4, 2Rx8 and so on. What are these and does memory module rank and organization have an effect on DDR4 module performance?

In this blog, we will study the concepts of memory rank and organization, and how memory rank and organization affect the memory bandwidth performance by reviewing some benchmarking test results.

Memory rank

Memory rank is a term that is used to describe how many sets of DRAM chips, or devices, exist on a memory module. A set of DDR4 DRAM chips is always 64-bit wide, or 72-bit wide if ECC is supported. Within a memory rank, all chips share the address, command and control signals.

The concept of memory rank is very similar to memory bank. Memory rank is a term used to describe memory modules, which are small printed circuit boards with memory chips and other electronic components on them; and memory bank is a term used to describe memory integrated circuit chips, which are the building blocks of the memory modules.

A single-rank (1R) memory module contains one set of DRAM chips. Each set of DRAM chips is 64-bits wide, or 72-bits wide if Error Correction Code (ECC) is supported.

A dual-rank (2R) memory module is similar to having two single-rank memory modules. It contains two sets of DRAM chips, therefore doubling the capacity of a single-rank module. The two ranks are selected one at a time through a chip select signal, therefore only one rank is accessible at a time.

Likewise, a quad-rank (4R) memory module contains four sets of DRAM chips. It is similar to having two dual-rank memories on one module, and it provides the greatest capacity. There are two chip select signals needed to access one of the four ranks. Again, only one rank is accessible at a time.

Figure 1 is a simplified view of the DQ signal flow on a dual-rank memory module. There are two identical sets of memory chips: set 1 and set 2. The 64-bit data I/O signals of each memory set are connected to a data I/O module. A single bit chip select (CS_n) signal controls which set of memory chips is accessed and the data I/O signals of the selected set will be connected to the DQ pins of the memory module.

Dual-rank and quad-rank memory modules double or quadruple the memory capacity on a module, within the existing memory technology. Even though only one rank can be accessed at a time, the other ranks are not sitting idle. Multi-rank memory modules use a process called rank interleaving, where the ranks that are not accessed go through their refresh cycles in parallel. This pipelined process reduces memory response time, as soon as the previous rank completes data transmission, the next rank can start its transmission.

On the other hand, there is some I/O latency penalty with multi-rank memory modules, since memory controllers need additional clock cycles to move from one rank to another. The overall latency performance difference between single rank and multi-rank memories depend heavily on the type of application.

In addition, because there are less memory chips on each module, single-rank modules produce less heat and are less likely to fail.

Memory depth and width

The capacity of each memory chip, or device, is defined as memory depth x memory width. Memory width refers to the width of the data bus, i.e. the number of DQ lines of each memory chip.

The width of memory chips are standard, they are either x4, x8 or x16. From here, we can calculate how many memory chips are in a 64-bit wide single rank memory. For example, with x4 memory chips, we will need 16 pieces (64 ÷ 4 = 16); and with x8 memory chips, we will only need 8 of them.

Let’s look at the following two high-level block diagrams of 1Gbx8 and 2Gbx4 memory chips. The total memory capacity for both of them is 8Gb. Figure 2 describes the 1Gb x8 configuration, and Figure 3 describes the 2Gbx4 configuration. With DDR4, both x4 and x8 devices have 4 groups of 4 banks. x16 devices have 2 groups of 4 banks.

We can think of each memory chip as a library. Within that library, there are four bank groups, which are the four floors of the library. On each floor, there are four shelves, each shelf is similar to one of the banks. And we can locate each one of the memory cells by its row and column addresses, just like the library book call numbers. Within each bank, the row address MUX activates a line in the memory array through the Row address latch and decoder, based on the given row address. This line is also called the word line. When a word line is activated, the data on the word line is loaded on to the sense amplifiers. Subsequently, the column decoder accesses the data on the sense amplifier based on the given column address.

The capacity, or density of a memory chip is calculated as:

Memory Depth = Number of Rows * Number of Columns * Number of Banks

Total Memory Capacity = Memory Depth * Memory Width

In the example of a 1Gbx8 device as shown in Figure 2 above:

Number of Row Address Bits = 16

Total Number of Rows = 2 ^ 16 = 65536

Number of Column Address Bits = 10

Total Number of Columns = 2 ^ 10 = 1024

And the calculation goes:

Memory Depth = 65536 Rows * 1024 Columns * 16 Banks = 1Gb

Total Memory Capacity = 1Gb * 8 = 8Gb

Figure 3 describes the function block diagram of a 2 Gb x 4 device.

Number of Row Address Bits = 17

Total Number of Rows = 2 ^ 17 = 131072

Number of Column Address Bits = 10

Total Number of Columns = 2 ^ 10 = 1024

And the calculation goes:

Memory Depth = 131072 * 1024 * 16 = 2Gb

Total Memory Capacity = 2Gb* 4 = 8Gb

Memory module capacity

Memory rank and memory width determine how many memory devices are needed on each memory module.

A 64-bit DDR4 module with ECC support has a total of 72 bits for the data bus. Of the 72 bits, 8 bits are used for ECC. It would require a total of 18 x4 memory devices for a single rank module. Each memory device would supply 4 bits, and the total number of bits with 18 devices is 72 bits. For a dual rank module, we would need to double the amount of memory devices to 36.

If each x4 memory device has a memory capacity of 8Gb, a single rank module with 16 + 2 (ECC) devices would have 16GB module capacity.

8Gb * 16 = 128Gb = 16GB

And a dual rank ECC module with 36 8Gb (2Gb x 4) devices would have 32GB module capacity.

8Gb * 32 = 256Gb = 32GB

If the memory devices are x8, a 64-bit DDR4 module with ECC support would require a total of 9 x8 memory devices for a single rank module, and 18 x8 memory devices for a dual rank memory module. If each of these x8 memory devices has a memory capacity of 8Gb, a single rank module would have 8GB module capacity.

8Gb * 8 = 64Gb = 8GB

A dual rank ECC module with 18 8Gb (1Gb x 8) devices would have 16GB module capacity.

8Gb * 16 = 128Gb = 16GB

Notice that within the same memory device technology, for example 8Gb in our example, higher memory module capacity is achieved through using x4 device width, or dual-rank, or even quad-rank.

ACTIVATE timing and DRAM page sizes

Memory device width, whether it is x4, x8 or x16, also has an effect on memory timing parameters such as tFAW.

tFAW refers to Four Active Window time. It specifies a timing window within which four ACTIVATE commands can be issued. An ACTIVATE command is issued to open a row within a bank. In the block diagrams above we can see that each bank has its own set of sense amplifiers, thus one row can remain active per bank. A memory controller can issue four back-to-back ACTIVATE commands, but once the fourth ACTIVATE is done, the fifth ACTIVATE cannot be issued until the tFAW window expires.

The table below lists out the tFAW window lengths assigned to various DDR4 speeds and page sizes. Notice that under the same DDR4 speed, the bigger the page size, the longer the tFAW window is. For example, DDR4-1600 has a tFAW window of 20ns with 1/2KB page size. This means that within a bank, once a command to open a first row is issued, the controller must wait for 20ns, or 16 clock cycles (CK) before a command to open a fifth row can be issued.

The JEDEC memory standard specification for DDR4 tFAW timing varies by page sizes: 1/2KB, 1KB and 2KB.

	Symbol	DDR4-1600	DDR4-1866	DDR4-2133	DDR4-2400
Four ACTIVATE windows for 1/2KB page size (minimum)	tFAW (1/2KB)	greater of 16CK or 20ns	greater of 16CK or 17ns	greater of 16CK or 15ns	greater of 16CK or 13ns
Four ACTIVATE windows for 1KB page size (minimum)	tFAW (1KB)	greater of 20CK or 25ns	greater of 20CK or 23ns	greater of 20CK or 21ns	greater of 20CK or 21ns
Four ACTIVATE windows for 2KB page size (minimum)	tFAW (2KB)	greater of 28CK or 35ns	greater of 28CK or 30ns	greater of 28CK or 30ns	greater of 28CK or 30ns

What is the relationship between page sizes and memory device width? Since we briefly compared two 8Gb memory devices earlier, it makes sense to take another look at those two in terms of page sizes.

Page size is the number of bits loaded into the sense amplifiers when a row is activated. Therefore page size is directly related to the number of bits per row, or number of columns per row.

Page Size = Number of Columns * Memory Device Width = 1024 * Memory Device Width

The table below shows the page sizes for each device width:

Device Width	Page Size (Kb)	Page Size (KB)
x4	4 Kb	1/2 KB
x8	8 Kb	1 KB
x16	16 Kb	2 KB

Among the three device widths, x4 devices have the shortest tFAW timing limit, and x16 devices have the longest tFAW timing limit. The difference in tFAW specification has a negative timing performance impact on devices with higher device width.

An experiment with 2Rx4 and 2Rx8 DDR4 modules

To quantify the impact on memory performance from different memory device widths, an experiment has been conducted on our Gen11 servers with AMD EPYC 7713 Rome CPU. The Rome CPU has 64 cores, supports 8 memory channels.

Our production Gen11 servers are configured with 1 DIMM populated in each memory channel. In order to achieve 6GB/core memory per core ratio, the total memory for the Gen11 system is 64 core * 6 GB/core = 384 GB. This is achieved by installing four pieces of 32GB 2Rx8 and four pieces of 64GB 2Rx4 memory modules.

To compare the bandwidth performance difference between 2Rx4 and 2Rx8 DDR4 modules, two test cases are needed. One with all 2Rx4 DDR4 modules, and another one with 2Rx8 DDR4 modules. Each test case populates eight pieces of 32GB 32Mbps DDR4 RDIMM memories in each memory channel (1DPC). As shown in the table below, the difference between the set up of the two test cases is: case A tests 2Rx4 memory modules, and case B tests 2Rx8 memory modules.

Test case	Number of DIMMs	Memory vendor	Part number	Memory size	Memory speed	Memory organization
A	8	Samsung	M393A4G43BB4-CWE	32GB	3200 MT/s	2Rx8
B	8	Samsung	M393A4K40EB3-CWECQ	32GB	3200 MT/s	2Rx4

Memory Latency Checker results

Memory Latency Checker is an Intel developed synthetic benchmarking tool. It measures memory latency and bandwidth, and how they vary with workloads of different read/write ratios, as well as stream triad. Stream triad is a memory benchmark workload that contains three operations: it first multiples a large 1D array with a scalar, then adds it to a second array, and assigns it to a third array.

	2Rx8 32GB bandwidth (MB/s)	2Rx4 32GB bandwidth (MB/s)	Percentage difference
All reads	173,287	173,650	0.21%
3:1 reads-writes	154,593	156,343	1.13%
2:1 reads-writes	151,660	155,289	2.39%
1:1 reads-writes	146,895	151,199	2.93%
Stream-triad like	156,273	158,710	1.56%

The bandwidth performance difference in the All reads test case is not very significant, only 0.21%.

As the amount of writes increase, from 25% (3:1 reads-writes) to 50% (1:1 reads-writes), the bandwidth performance differences between test case A and test case B increase from 1.13% to 2.93%.

LMBench Results

LMBench is another synthetic benchmarking tool often used to study bandwidth performances of memory. Our LMBench bw_mem tests results are comparable to the results obtained from the MLC benchmark test.

	2Rx8 32GB bandwidth (MB/s)	2Rx4 32GB bandwidth (MB/s)	Percentage difference
Read	170,285	173,897	2.12%
Write	73,179	76,019	3.88%
Read then write	72,804	74,926	2.91%
Copy	50,332	51,776	2.87%

The biggest bandwidth performance difference is with Write workload. The 2Rx4 test case has 3.88% higher write bandwidth than the 2Rx8 test case.

Summary

Memory organization and memory width has a slight effect on memory bandwidth performance. The difference is most obvious in write-heavy workloads than read-heavy workloads. But even in write-heavy workloads, the difference is less than 4% according to our benchmark tests.

Memory modules with x4 width require twice the number of memory devices on the memory module, as compared to memory modules with x8 width of the same capacity. More memory devices would consume more power. According to Micron’s measurement data, 2Rx8 32GB memory modules using 16Gb devices consume 31% less power than 2Rx4 32GB memory modules using 8Gb devices. The substantial power saving of using x8 memory modules may outweigh the slight bandwidth performance impact.

Our Gen11 servers are configured with a mix of 2Rx4 and 2Rx8 DDR4 modules. For our future generations, we may consider using 2Rx8 memory where possible, in order to reduce overall system power consumption, with minimal impact to bandwidth performance.

References

https://www.micron.com/-/media/client/global/documents/products/data-sheet/dram/ddr4/8gb_ddr4_sdram.pdf

https://www.systemverilog.io/ddr4-basics#:~:text=DRAM%20Page%20Size&text=Since%20the%20column%20address%20is,it%20is%202KB%20per%20page

Deploying firmware at Cloudflare-scale: updating thousands of servers in more than 285 cities

2023-03-10 Chris Howells

Post Syndicated from Chris Howells original https://blog.cloudflare.com/deploying-firmware-at-cloudflare-scale-how-we-update-thousands-of-servers-in-more-than-285-cities/

Deploying firmware at Cloudflare-scale: updating thousands of servers in more than 285 cities

As a security company, it’s critical that we have good processes for dealing with security issues. We regularly release software to our servers – on a daily basis even – which includes new features, bug fixes, and as required, security patches. But just as critical is the software which is embedded into the server hardware, known as firmware. Primarily of interest is the BIOS and Baseboard Management Controller (BMC), but many other components also have firmware such as Network Interface Cards (NICs).

As the world becomes more digital, software which needs updating is appearing in more and more devices. As well as my computer, over the last year, I have waited patiently while firmware has updated in my TV, vacuum cleaner, lawn mower and light bulbs. It can be a cumbersome process, including obtaining the firmware, deploying it to the device which needs updating, navigating menus and other commands to initiate the update, and then waiting several minutes for the update to complete.

Firmware updates can be annoying even if you only have a couple of devices. We have more than a few devices at Cloudflare. We have a huge number of servers of varying kinds, from varying vendors, spread over 285 cities worldwide. We need to be able to rapidly deploy various types of firmware updates to all of them, reliably, and automatically, without any kind of manual intervention.

In this blog post I will outline the methods that we use to automate firmware deployment to our entire fleet. We have been using this method for several years now, and have deployed firmware without interrupting our SRE team, entirely automatically.

Background

A key component of our ability to deploy firmware at scale is the iPXE, an open source boot loader. iPXE is the glue which operates between the server and operating system, and is responsible for loading the operating system after the server has completed the Power On Self Test (POST). It is very flexible and contains a scripting language. With iPXE, we can write boot scripts which query the firmware version, continue booting if the correct firmware version is deployed, or if not, boot into a flashing environment to flash the correct firmware.

We only deploy new firmware when our systems are out of production, so we need a method to coordinate deployment only on out of production systems. The simplest way to do this is when they are rebooting, because by definition they are out of production then. We reboot our entire fleet every month, and have the ability to schedule reboots more urgently if required to deal with a security issue. Regularly rebooting our fleets has many advantages. We can deploy the latest Linux kernel, base operating system, and ensure that we do not have any breaking changes in our operating system and configuration management environment that breaks on fresh boot.

Our entire fleet operates in UEFI mode. UEFI is a modern replacement for the BIOS and offers more features and more security, such as Secure Boot. A full description of all of these changes is outside the scope of this article, but essentially UEFI provides a minimal environment and shell capable of executing binaries. Secure Boot ensures that the binaries are signed with keys embedded in the system, to prevent a bad actor from tampering with our software.

How we update the BIOS

We are able to update the BIOS without booting any operating system, purely by taking advantage of features offered by iPXE and the UEFI shell. This requires a flashing binary written for the UEFI environment.

Upon boot, iPXE is started. Through iPXE’s built-in variable ${smbios/0.5.0} it is possible to query the current BIOS version, and compare it to the latest version, and trigger a flash only if there is a mis-match. iPXE then downloads the files required for the firmware update to a ramdisk.

The following is an example of a very basic iPXE script which performs such an action:

# Check whether the BIOS version is 2.03
iseq ${smbios/0.5.0} 2.03 || goto biosupdate
echo Nothing to do for {{ model }}
exit 0

:biosupdate
echo Trying to update BIOS/UEFI...
echo Current: ${smbios/0.5.0}
echo New: 2.03

imgfetch ${boot_prefix}/tools/x64/shell.efi || goto unexpected_error
imgfetch startup.nsh || goto unexpected_error

imgfetch AfuEfix64.efi || goto unexpected_error
imgfetch bios-2.03.bin || goto unexpected_error

imgexec shell.efi || goto unexpected_error

Meanwhile, startup.nsh contains the binary to run and command line arguments to effect the flash:

startup.nsh:

%homefilesystem%\AfuEfix64.efi %homefilesystem%\bios-2.03.bin /P /B /K /N /X /RLC:E /REBOOT

After rebooting, the machine will boot using its new BIOS firmware, version 2.03. Since ${smbios/0.5.0} now contains 2.03, the machine continues to boot and enter production.

Other firmware updates such as BMC, network cards and more

Unfortunately, the number of vendors that support firmware updates with UEFI flashing binaries is limited. There are a large number of other updates that we need to perform such as BMC and NIC.

Consequently, we need another way to flash these binaries. Thankfully, these vendors invariably support flashing from Linux. Consequently we can perform flashing from a minimal Linux environment. Since vendor firmware updates are typically closed source utilities and vendors are often highly secretive about firmware flashing, we can ensure that the flashing environment does not provide an attackable surface by ensuring that the network is not configured. If it’s not on the network, it can’t be attacked and exploited.

Not being on the network means that we need to inject files into the boot process when the machine boots. We can accomplish this with an initial ramdisk (initrd), and iPXE makes it easy to add additional initrd to the boot.

Creating an initrd is as simple as creating an archive of the files using cpio using the newc archive format.

Let’s imagine we are going to flash Broadcom NIC firmware. We’ll use the bnxtnvm firmware update utility, the firmware image firmware.pkg, and a shell script called flash to automate the task.

The files are laid out in the file system like this:

cd broadcom
find .
./opt/preflight
./opt/preflight/scripts
./opt/preflight/scripts/flash
./opt/broadcom
./opt/broadcom/firmware.pkg
./opt/broadcom/bnxtnvm

Now we compress all of these files into an image called broadcom.img.

find . | cpio --quiet -H newc -o | gzip -9 -n > ../broadcom.img

This is the first step completed; we have the firmware packaged up into an initrd.

Since it’s challenging to read, say, the firmware version of the NIC, from the EFI shell, we store firmware versions as UEFI variables. These can be written from Linux via efivars, the UEFI variable file system, and then read by iPXE on boot.

An example of writing an EFI variable from Linux looks like this:

declare -r fw_path='/sys/firmware/efi/efivars/broadcom-fw-9ca25c23-368a-4c21-943f-7d91f2b76008'
declare -r efi_header='\x07\x00\x00\x00'
declare -r version='1.05'

/bin/mount -o remount,rw,nosuid,nodev,noexec,noatime none /sys/firmware/efi/efivars

# Files on efivarfs are immutable by default, so remove the immutable flag so that we can write to it: https://docs.kernel.org/filesystems/efivarfs.html
if [ -f "${fw_path}" ] ; then
    /usr/bin/chattr -i "${fw_path}"
fi

echo -n -e "${efi_header}${version}" >| "$fw_path"

Then we can write an iPXE configuration file to load the flashing kernel, userland and flashing utilities.

set cf/guid 9ca25c23-368a-4c21-943f-7d91f2b76008

iseq ${efivar/broadcom-fw-${cf/guid}} 1.05 && echo Not flashing broadcom firmware, version already at 1.05 || goto update
exit

:update
echo Starting broadcom firmware update
kernel ${boot_prefix}/vmlinuz initrd=baseimg.img initrd=linux-initramfs-modules.img initrd=broadcom.img
initrd ${boot_prefix}/baseimg.img
initrd ${boot_prefix}/linux-initramfs-modules.img
initrd ${boot_prefix}/firmware/broadcom.img

Flashing scripts are deposited into /opt/preflight/scripts and we use systemd to execute them with run-parts on boot:

/etc/systemd/system/preflight.service:

[Unit]
Description=Pre-salt checks and simple configurations on boot
Before=salt-highstate.service
After=network.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/run-parts --verbose /opt/preflight/scripts

[Install]
WantedBy=multi-user.target
RequiredBy=salt-highstate.service

An example flashing script in /opt/preflight/scripts might look like:

#!/bin/bash

trap 'catch $? $LINENO' ERR
catch(){
    #error handling goes here
    echo "Error $1 occured on line $2"
}

declare -r fw_path='/sys/firmware/efi/efivars/broadcom-fw-9ca25c23-368a-4c21-943f-7d91f2b76008'
declare -r efi_header='\x07\x00\x00\x00'
declare -r version='1.05'

lspci | grep -q Broadcom
if [ $? -eq 0 ]; then
    echo "Broadcom firmware flashing starting"
    if [ ! -f "$fw_path" ] ; then
        chmod +x /opt/broadcom/bnxtnvm
        declare -r interface=$(/opt/broadcom/bnxtnvm listdev | grep "Device Interface Name" | awk -F ": " '{print $2}')
        /opt/broadcom/bnxtnvm -dev=${interface} -force -y install /opt/broadcom/BCM957414M4142C.pkg
        declare -r status=$?
        declare -r currentversion=$(/opt/broadcom/bnxtnvm -dev=${interface} device_info | grep "Package version on NVM" | awk -F ": " '{print $2}')
        declare -r expectedversion=$(echo $version | awk '{print $2}')
        if [ $status -eq 0 -a "$currentversion" = "$expectedversion" ]; then
            echo "Broadcom firmware $version flashed successfully"
            /bin/mount -o remount,rw,nosuid,nodev,noexec,noatime none /sys/firmware/efi/efivars
            echo -n -e "${efi_header}${version}" >| "$fw_path"
            echo "Created $fw_path"
        else
            echo "Failed to flash Broadcom firmware $version"
            /opt/broadcom/bnxtnvm -dev=${interface} device_info
        fi
    else
        echo "Broadcom firmware up-to-date"
    fi
else
    echo "No Broadcom NIC installed"
    /bin/mount -o remount,rw,nosuid,nodev,noexec,noatime none /sys/firmware/efi/efivars
    if [ -f "${fw_path}" ] ; then
        /usr/bin/chattr -i "${fw_path}"
    fi
    echo -n -e "${efi_header}${version}" >| "$fw_path"
    echo "Created $fw_path"
fi

if [ -f "${fw_path}" ]; then
    echo "rebooting in 60 seconds"
    sleep 60
    /sbin/reboot
fi

Conclusion

Whether you manage just your laptop or desktop computer, or a fleet of servers, it’s important to keep the firmware updated to ensure that the availability, performance and security of the devices is maintained.

If you have a few devices and would benefit from automating the deployment process, we hope that we have inspired you to have a go by making use of some basic open source tools such as the iPXE boot loader and some scripting.

Final thanks to my colleague Ignat Korchagin who did a large amount of the original work on the UEFI BIOS firmware flashing infrastructure.

Armed to Boot: an enhancement to Arm’s Secure Boot chain

2023-01-25 Derek Chamorro

Post Syndicated from Derek Chamorro original https://blog.cloudflare.com/armed-to-boot/

Armed to Boot: an enhancement to Arm's Secure Boot chain

Over the last few years, there has been a rise in the number of attacks that affect how a computer boots. Most modern computers use a specification called Unified Extensible Firmware Interface (UEFI) that defines a software interface between an operating system (e.g. Windows) and platform firmware (e.g. disk drives, video cards). There are security mechanisms built into UEFI that ensure that platform firmware can be cryptographically validated and boot securely through an application called a bootloader. This firmware is stored in non-volatile SPI flash memory on the motherboard, so it persists on the system even if the operating system is reinstalled and drives are replaced.

This creates a ‘trust anchor’ used to validate each stage of the boot process, but, unfortunately, this trust anchor is also a target for attack. In these UEFI attacks, malicious actions are loaded onto a compromised device early in the boot process. This means that malware can change configuration data, establish persistence by ‘implanting’ itself, and can bypass security measures that are only loaded at the operating system stage. So, while UEFI-anchored secure boot protects the bootloader from bootloader attacks, it does not protect the UEFI firmware itself.

Because of this growing trend of attacks, we began the process of cryptographically signing our UEFI firmware as a mitigation step. While our existing solution is platform specific to our x86 AMD server fleet, we did not have a similar solution to UEFI firmware signing for Arm. To determine what was missing, we had to take a deep dive into the Arm secure boot process.

Read on to learn about the world of Arm Trusted Firmware Secure Boot.

Arm Trusted Firmware Secure Boot

Arm defines a trusted boot process through an architecture called Trusted Board Boot Requirements (TBBR), or Arm Trusted Firmware (ATF) Secure Boot. TBBR works by authenticating a series of cryptographically signed binary images each containing a different stage or element in the system boot process to be loaded and executed. Every bootloader (BL) stage accomplishes a different stage in the initialization process:

BL1

BL1 defines the boot path (is this a cold boot or warm boot), initializes the architectures (exception vectors, CPU initialization, and control register setup), and initializes the platform (enables watchdog processes, MMU, and DDR initialization).

BL2

BL2 prepares initialization of the Arm Trusted Firmware (ATF), the stack responsible for setting up the secure boot process. After ATF setup, the console is initialized, memory is mapped for the MMU, and message buffers are set for the next bootloader.

BL3

The BL3 stage has multiple parts, the first being initialization of runtime services that are used in detecting system topology. After initialization, there is a handoff between the ATF ‘secure world’ boot stage to the ‘normal world’ boot stage that includes setup of UEFI firmware. Context is set up to ensure that no secure state information finds its way into the normal world execution state.

Each image is authenticated by a public key, which is stored in a signed certificate and can be traced back to a root key stored on the SoC in one time programmable (OTP) memory or ROM.

TBBR was originally designed for cell phones. This established a reference architecture on how to build a “Chain of Trust” from the first ROM executed (BL1) to the handoff to “normal world” firmware (BL3). While this creates a validated firmware signing chain, it has caveats:

SoC manufacturers are heavily involved in the secure boot chain, while the customer has little involvement.
A unique SoC SKU is required per customer. With one customer this could be easy, but most manufacturers have thousands of SKUs
The SoC manufacturer is primarily responsible for end-to-end signing and maintenance of the PKI chain. This adds complexity to the process requiring USB key fobs for signing.
Doesn’t scale outside the manufacturer.

What this tells us is what was built for cell phones doesn’t scale for servers.

If we were involved 100% in the manufacturing process, then this wouldn’t be as much of an issue, but we are a customer and consumer. As a customer, we have a lot of control of our server and block design, so we looked at design partners that would take some of the concepts we were able to implement with AMD Platform Secure Boot and refine them to fit Arm CPUs.

Amping it up

We partnered with Ampere and tested their Altra Max single socket rack server CPU (code named Mystique) that provides high performance with incredible power efficiency per core, much of what we were looking for in reducing power consumption. These are only a small subset of specs, but Ampere backported various features into the Altra Max notably, speculative attack mitigations that include Meltdown and Spectre (variants 1 and 2) from the Armv8.5 instruction set architecture, giving Altra the “+” designation in their ISA.

Ampere does implement a signed boot process similar to the ATF signing process mentioned above, but with some slight variations. We’ll explain it a bit to help set context for the modifications that we made.

Ampere Secure Boot

The diagram above shows the Arm processor boot sequence as implemented by Ampere. System Control Processors (SCP) are comprised of the System Management Processor (SMpro) and the Power Management Processor (PMpro). The SMpro is responsible for features such as secure boot and bmc communication while the PMpro is responsible for power features such as Dynamic Frequency Scaling and on-die thermal monitoring.

At power-on-reset, the SCP runs the system management bootloader from ROM and loads the SMpro firmware. After initialization, the SMpro spawns the power management stack on the PMpro and ATF threads. The ATF BL2 and BL31 bring up processor resources such as DRAM, and PCIe. After this, control is passed to BL33 BIOS.

Authentication flow

At power on, the SMpro firmware reads Ampere’s public key (ROTPK) from the SMpro key certificate in SCP EEPROM, computes a hash and compares this to Ampere’s public key hash stored in eFuse. Once authenticated, Ampere’s public key is used to decrypt key and content certificates for SMpro, PMpro, and ATF firmware, which are launched in the order described above.

The SMpro public key will be used to authenticate the SMpro and PMpro images and ATF keys which in turn will authenticate ATF images. This cascading set of authentication that originates with the Ampere root key and stored in chip called an electronic fuse, or eFuse. An eFuse can be programmed only once, setting the content to be read-only and can not be tampered with nor modified.

This is the original hardware root of trust used for signing system, secure world firmware. When we looked at this, after referencing the signing process we had with AMD PSB and knowing there was a large enough one-time-programmable (OTP) region within the SoC, we thought: why can’t we insert our key hash in here?

Single Domain Secure Boot

Single Domain Secure Boot takes the same authentication flow and adds a hash of the customer public key (Cloudflare firmware signing key in this case) to the eFuse domain. This enables the verification of UEFI firmware by a hardware root of trust. This process is performed in the already validated ATF firmware by BL2. Our public key (dbb) is read from UEFI secure variable storage, a hash is computed and compared to the public key hash stored in eFuse. If they match, the validated public key is used to decrypt the BL33 content certificate, validating and launching the BIOS, and remaining boot items. This is the key feature added by SDSB. It validates the entire software boot chain with a single eFuse root of trust on the processor.

Building blocks

With a basic understanding of how Single Domain Secure Boot works, the next logical question is “How does it get implemented?”. We ensure that all UEFI firmware is signed at build time, but this process can be better understood if broken down into steps.

Ampere, our original device manufacturer (ODM), and we play a role in execution of SDSB. First, we generate certificates for a public-private key pair using our internal, secure PKI. The public key side is provided to the ODM as dbb.auth and dbu.auth in UEFI secure variable format. Ampere provides a reference Software Release Package (SRP) including the baseboard management controller, system control processor, UEFI, and complex programmable logic device (CPLD) firmware to the ODM, who customizes it for their platform. The ODM generates a board file describing the hardware configuration, and also customizes the UEFI to enroll dbb and dbu to secure variable storage on first boot.

Once this is done, we generate a UEFI.slim file using the ODM’s UEFI ROM image, Arm Trusted Firmware (ATF) and Board File. (Note: This differs from AMD PSB insofar as the entire image and ATF files are signed; with AMD PSB, only the first block of boot code is signed.) The entire .SLIM file is signed with our private key, producing a signature hash in the file. This can only be authenticated by the correct public key. Finally, the ODM packages the UEFI into .HPM format compatible with their platform BMC.

In parallel, we provide the debug fuse selection and hash of our DER-formatted public key. Ampere uses this information to create a special version of the SCP firmware known as Security Provisioning (SECPROV) .slim format. This firmware is run one time only, to program the debug fuse settings and public key hash into the SoC eFuses. Ampere delivers the SECPROV .slim file to the ODM, who packages it into a .hpm file compatible with the BMC firmware update tooling.

Fusing the keys

During system manufacturing, firmware is pre-programmed into storage ICs before placement on the motherboard. Note that the SCP EEPROM contains the SECPROV image, not standard SCP firmware. After a system is first powered on, an IPMI command is sent to the BMC which releases the Ampere processor from reset. This allows SECPROV firmware to run, burning the SoC eFuse with our public key hash and debug fuse settings.

Final manufacturing flow

Once our public key has been provisioned, manufacturing proceeds by re-programming the SCP EEPROM with its regular firmware. Once the system powers on, ATF detects there are no keys present in secure variable storage and allows UEFI firmware to boot, regardless of signature. Since this is the first UEFI boot, it programs our public key into secure variable storage and reboots. ATF is validated by Ampere’s public key hash as usual. Since our public key is present in dbb, it is validated against our public key hash in eFuse and allows UEFI to boot.

Validation

The first part of validation requires observing successful destruction of the eFuses. This imprints our public key hash into a dedicated, immutable memory region, not allowing the hash to be overwritten. Upon automatic or manual issue of an IPMI OEM command to the BMC, the BMC observes a signal from the SECPROV firmware, denoting eFuse programming completion. This can be probed with BMC commands.

When the eFuses have been blown, validation continues by observing the boot chain of the other firmware. Corruption of the SCP, ATF, or UEFI firmware obstructs boot flow and boot authentication and will cause the machine to fail booting to the OS. Once firmware is in place, happy path validation begins with booting the machine.

Upon first boot, firmware boots in the following order: BMC, SCP, ATF, and UEFI. The BMC, SCP, and ATF firmware can be observed via their respective serial consoles. The UEFI will automatically enroll the dbb and dbu files to the secure variable storage and trigger a reset of the system.

After observing the reset, the machine should successfully boot to the OS if the feature is executed correctly. For further validation, we can use the UEFI shell environment to extract the dbb file and compare the hash against the hash submitted to Ampere. After successfully validating the keys, we flash an unsigned UEFI image. An unsigned UEFI image causes authentication failure at bootloader stage BL3-2. The ATF firmware undergoes a boot loop as a result. Similar results will occur for a UEFI image signed with incorrect keys.

Updated authentication flow

On all subsequent boot cycles, the ATF will read secure variable dbb (our public key), compute a hash of the key, and compare it to the read-only Cloudflare public key hash in eFuse. If the computed and eFuse hashes match, our public key variable can be trusted and is used to authenticate the signed UEFI. After this, the system boots to the OS.

Let’s boot!

We were unable to get a machine without the feature enabled to demonstrate the set-up of the feature since we have the eFuse set at build time, but we can demonstrate what it looks like to go between an unsigned BIOS and a signed BIOS. What we would have observed with the set-up of the feature is a custom BMC command to instruct the SCP to burn the ROTPK into the SOC’s OTP fuses. From there, we would observe feedback to the BMC detailing whether burning the fuses was successful. Upon booting the UEFI image for the first time, the UEFI will write the dbb and dbu into secure storage.

As you can see, after flashing the unsigned BIOS, the machine fails to boot.

Despite the lack of visibility in failure to boot, there are a few things going on underneath the hood. The SCP (System Control Processor) still boots.

The SCP image holds a key certificate with Ampere’s generated ROTPK and the SCP key hash. SCP will calculate the ROTPK hash and compare it against the burned OTP fuses. In the failure case, where the hash does not match, you will observe a failure as you saw earlier. If successful, the SCP firmware will proceed to boot the PMpro and SMpro. Both the PMpro and SMpro firmware will be verified and proceed with the ATF authentication flow.
The conclusion of the SCP authentication is the passing of the BL1 key to the first stage bootloader via the SCP HOB(hand-off-block) to proceed with the standard three stage bootloader ATF authentication mentioned previously.
At BL2, the dbb is read out of the secure variable storage and used to authenticate the BL33 certificate and complete the boot process by booting the BL33 UEFI image.

Still more to do

In recent years, management interfaces on servers, like the BMC, have been the target of cyber attacks including ransomware, implants, and disruptive operations. Access to the BMC can be local or remote. With remote vectors open, there is potential for malware to be installed on the BMC via network interfaces. With compromised software on the BMC, malware or spyware could maintain persistence on the server. An attacker might be able to update the BMC directly using flashing tools such as flashrom or socflash without the same level of firmware resilience established at the UEFI level.

The future state involves using host CPU-agnostic infrastructure to enable a cryptographically secure host prior to boot time. We will look to incorporate a modular approach that has been proposed by the Open Compute Project’s Data Center Secure Control

Module Specification (DC-SCM) 2.0 specification. This will allow us to standardize our Root of Trust, sign our BMC, and assign physically unclonable function (PUF) based identity keys to components and peripherals to limit the use of OTP fusing. OTP fusing creates a problem with trying to “e-cycle” or reuse machines as you cannot truly remove a machine identity.

A more sustainable end-of-life for your legacy hardware appliances with Cloudflare and Iron Mountain

2022-12-14 May Ma

Post Syndicated from May Ma original https://blog.cloudflare.com/sustainable-end-of-life-hardware/

A more sustainable end-of-life for your legacy hardware appliances with Cloudflare and Iron Mountain

Today, as part of Cloudflare’s Impact Week, we’re excited to announce an opportunity for Cloudflare customers to make it easier to decommission and dispose of their used hardware appliances sustainably. We’re partnering with Iron Mountain to offer preferred pricing and discounts for Cloudflare customers that recycle or remarket legacy hardware through its service.

Replacing legacy hardware with Cloudflare’s network

Cloudflare’s products enable customers to replace legacy hardware appliances with our global network. Connecting to our network enables access to firewall (including WAF and Network Firewalls, Intrusion Detection Systems, etc), DDoS mitigation, VPN replacement, WAN optimization, and other networking and security functions that were traditionally delivered in physical hardware. These are served from our network and delivered as a service. This creates a myriad of benefits for customers including stronger security, better performance, lower operational overhead, and none of the headaches of traditional hardware like capacity planning, maintenance, or upgrade cycles. It’s also better for the Earth: our multi-tenant SaaS approach means more efficiency and a lower carbon footprint to deliver those functions.

But what happens with all that hardware you no longer need to maintain after switching to Cloudflare?

The life of a hardware box

The life of a hardware box begins on the factory line at the manufacturer. These are then packaged, shipped and installed at the destination infrastructure where they provide processing power to run front-end products or services, and routing network traffic. Occasionally, if the hardware fails to operate, or its performance declines over time, it will get fixed or will be returned for replacement under the warranty.

When none of these options work, the hardware box is considered end-of-life and it “dies”. This hardware must be decommissioned by being disconnected from the network, and then physically removed from the data center for disposal.

The useful lifespan of hardware depends on the availability of newer generations of processors which help realize critical efficiency improvements around cost, performance, and power. In general, the industry standard of hardware decommissioning timeline is between three and six years after installation. There are additional benefits to refreshing these physical assets at the lower end of the hardware lifespan spectrum, keeping your infrastructure at optimal performance.

In the instance where the hardware still works, but is replaced by newer technologies, it would be such a waste to discard this gear. Instead, there could be recoverable value in this outdated hardware. And simply tossing unwanted hardware into the trash indiscriminately, which will eventually become part of the landfill, causes devastating consequences as these electronic devices contain hazardous materials like lithium, palladium, lead, copper and cobalt or mercury, and those could contaminate the environment. Below, we explain sustainable alternatives and cost-beneficial practices one can pursue to dispose of your infrastructure hardware.

Option 1: Remarket / Reuse

For hardware that still works, the most sustainable route is to sanitize it of data, refurbish, and resell it in the second-hand market at a depreciated cost. Some IT asset disposition firms would also repurpose used hardware to maximize its market value. For example, harvesting components from a device to build part of another product and selling that at a higher price. For working parts that have very little resale value, companies can also consider reusing them to build a spare parts inventory for replacing failed parts later in the data centers.

The benefits of remarket and reuse are many. It helps maximize a hardware’s return of investment by including any reclaimed value at end-of-life stage, offering financial benefits to the business. And it reduces discarded electronics, or e-waste and their harmful efforts on our environment, helping socially responsible organizations build a more sustainable business. Lastly, it provides alternatives to individuals and organizations that cannot afford to buy new IT equipment.

Option 2: Recycle

For used hardware that is not able to be remarketed, it is recommended to engage an asset disposition firm to professionally strip it of any valuable and recyclable materials, such as precious metal and plastic, before putting it up for physical destruction. Similar to remarketing, recycling also reduces environmental impact, and cuts down the amount of raw materials needed to manufacture new products.

A key factor in hardware recycling is a secure chain of custody. Meaning, a supplier has the right certification, preferably its own fleet and secure facilities to properly and securely process the equipment.

Option 3: Destroy

From a sustainable point of view, this route should only be used as a last resort. When hardware does not operate as it is intended to, and has no remarketed nor recycled value, an asset disposition supplier would remove all the asset tags and information from it in preparation for a physical destruction. Depending on disposal policies, some companies would choose to sanitize and destroy all the data bearing hardware, such as SSD or HDD, for security reasons.

To further maximize recycling value and reduce e-waste, it is recommended to keep security policy up to date on discarded IT equipment and explore the option of reusing working devices after a professional data wiping as much as possible.

At Cloudflare, we follow an industry-standard capital depreciation timeline, which culminates in recycling actions through the engagement of IT asset disposition partners including Iron Mountain. Through these partnerships, besides data bearing hardware which follows the security policy to be sanitized and destroyed, approximately 99% of the rest decommissioned IT equipment from Cloudflare is sold or recycled.

Partnering with Iron Mountain to make sustainable goals more accessible

Hardware discomission can be a burden on a business, from operational strain to complex processes, a lack of streamlined execution to the risk of a data breach. Our experience shows that partnering with an established firm like Iron Mountain who is specialized in IT asset disposition would help kick-start one’s hardware recycling journey.

Iron Mountain has more than two decades of experience working with Hyperscale technology and data centers. A market leader in decommissioning, data security and remarketing capabilities. It has a wide footprint of facilities to support their customers’ sustainability goals globally.

Today, Iron Mountain has generated more than US$1.5 billion through value recovery and has been continually developing new ways to sell mass volumes of technology for their best use. Other than their end-to-end decommission offering, there are two additional value adding services that Iron Mountain provides to their customers that we find valuable. They offer a quarterly survey report which presents insights in the used market, and a sustainability report that measures the environmental impact based on total hardware processed with their customers.

Get started today

Get started today with Iron Mountain on your hardware recycling journey and sign up from here. After receiving the completed contact form, Iron Mountain will consult with you on the best solution possible. It has multiple programs to support including revenue share, fair market value, and guaranteed destruction with proper recycling. For example, when it comes to reselling used IT equipment, Iron Mountain would propose an appropriate revenue split, namely how much percentage of sold value will be shared with the customer, based on business needs. Iron Mountain’s secure chain of custody with added solutions such as redeployment, equipment retrieval programs, and onsite destruction can ensure it can tailor the solution that works best for your company’s security and environmental needs.

And in collaboration with Cloudflare, Iron Mountain offers additional two percent on your revenue share of the remarketed items and a five percent discount on the standard fees for other IT asset disposition services if you are new to Iron Mountain and choose to use these services via the link in this blog.

How we’re making Cloudflare’s infrastructure more sustainable

2022-12-14 Rebecca Weekly

Post Syndicated from Rebecca Weekly original https://blog.cloudflare.com/extending-the-life-of-hardware/

How we’re making Cloudflare’s infrastructure more sustainable

Whether you are building a global network or buying groceries, some rules of sustainable living remain the same: be thoughtful about what you get, make the most out of what you have, and try to upcycle your waste rather than throwing it away. These rules are central to Cloudflare — we take helping build a better Internet seriously, and we define this as not just having the most secure, reliable, and performant network — but also the most sustainable one.

With incredible growth of the Internet, and the increased usage of Cloudflare’s network, even linear improvements to sustainability in our hardware today will result in exponential gains in the future. We want to use this post to outline how we think about the sustainability impact of the hardware in our network, and what we’re doing to continually mitigate that impact.

Sustainability in the realm of servers

The total carbon footprint of a server is approximately 6 tons of Carbon Dioxide equivalent (CO2eq) when used in the US. There are four parts to the carbon footprint of any computing device:

The embodied emissions: source materials and production
Packing and shipping
Use of the product
End of life.

The emissions from the actual operations and use of a server account for the vast majority of the total life-cycle impact. The secondary impact is embodied emissions (which is the carbon footprint from the creation of the device in the first place), which is about 10% overall.

Use of Product Emissions

It’s difficult to reduce the total emissions for the operation of servers. If there’s a workload that needs computing power, the server will complete the workload and use the energy required to complete it. What we can do, however, is consistently seek to improve the amount of computing output per kilo of CO2 emissions — and the way we do that is to consistently upgrade our hardware to the most power-efficient designs. As we switch from one generation of server to the next, we often see very large increases in computing output, at the same level of power consumption. In this regard, given energy is a large cost for our business, our incentives of reducing our environmental impact are naturally aligned to our business model.

Embodied Emissions

The other large category of emissions — the embodied emissions — are a domain where we actually have a lot more control than the use of the product. Reminder from before: the embodied carbon means the sources of emissions generated outside of equipments’ operation. How can we reduce the embodied emissions involved in running a fleet of servers? Turns out, there are a few ways: modular design, relying on open vs proprietary standards to enable reuse, and recycling.

Modular Design

The first big opportunity is through modular system design. Modular systems are a great way of reducing embodied carbon, as they result in fewer new components and allow for parts that don’t have efficiency upgrades to be leveraged longer. Modular server design is essentially decomposing functions of the motherboard onto sub-boards so that the server owner can selectively upgrade the components that are required for their use cases.

How much of an impact can modular design have? Well, if 30% of the server is delivering meaningful efficiency gains (usually CPU and memory, sometimes I/O), we may really need to upgrade those in order to meet efficiency goals, but creating an additional 70% overhead in embodied carbon (i.e. the rest of the server, which often is made up of components that do not get more efficient) is not logical. Modular design allows us to upgrade the components that will improve the operational efficiency of our data centers, but amortize carbon in the “glue logic” components over the longer time periods for which they can continue to function.

Previously, many systems providers drove ridiculous and useless changes in the peripherals (custom I/Os, outputs that may not be needed for a specific use case such as VGA for crash carts we might not use given remote operations, etc.), which would force a new motherboard design for every new CPU socket design. By standardizing those interfaces across vendors, we can now only source the components we need, and reuse a larger percentage of systems ourselves. This trend also helps with reliability (sub-boards are more well tested), and supply assurance (since standardized subcomponent boards can be sourced from more vendors), something all of us in the industry have had top-of-mind given global supply challenges of the past few years.

Standards-based Hardware to Encourage Re-use

But even with modularity, components need to go somewhere after they’ve been deprecated — and historically, this place has been a landfill. There is demand for second-hand servers, but many have been parts of closed systems with proprietary firmware and BIOS, so repurposing them has been costly or impossible to integrate into new systems. The economics of a circular economy are such that service fees for closed firmware and BIOS support as well as proprietary interconnects or ones that are not standardized can make reuse prohibitively expensive. How do you solve this? Well, if servers can be supported using open source firmware and BIOS, you dramatically reduce the cost of reusing the parts — so that another provider can support the new customer.

Recycling

Beyond that, though, there are parts failures, or parts that are simply no longer economical to be run, even in the second hand market. Metal recycling can always be done, and some manufacturers are starting to invest in programs there, although the energy investment for extracting the usable elements sometimes doesn’t make sense. There is innovation in this domain, Zhan, et al. (2020) developed an environmentally friendly and efficient hydrothermal-buffering technique for the recycling of GaAs-based ICs, achieving gallium and arsenic recovery rates of 99.9 and 95.5% respectively. Adoption is still limited — most manufacturers are discussing water recycling and renewable energy vs. full-fledged recycling of metals — but we’re closely monitoring the space to take advantage of any further innovation that happens.

What Cloudflare is Doing To Reduce Our Server Impact

It is great to talk about these concepts, but we are doing this work today. I’d describe them as being under two main banners: taking steps to reduce embodied emissions through modular and open standards design, and also using the most power-efficient solutions for our workloads.

Gen 12: Walking the Talk

Our next generation of servers, Gen 12, will be coming soon. We’re emphasizing modular-driven design, as well as a focus on open standards, to enable reuse of the components inside the servers.

A modular-driven design

Historically, every generation of server here at Cloudflare has required a massive redesign. An upgrade to a new CPU required a new motherboard, power supply, chassis, memory DIMMs, and BMC. This, in turn, might mean new fans, storage, network cards, and even cables. However, many of these components are not changing drastically from generation to generation: these components are built using older manufacturing processes, and leverage interconnection protocols that do not require the latest speeds.

To help illustrate this, let’s look at our Gen 11 server today: a single socket server is ~450W of power, with the CPU and associated memory taking about 320W of that (potentially 360W at peak load). All the other components on that system (mentioned above) are ~100W of operational power (mostly dominated by fans, which is why so many companies are exploring alternative cooling designs), so they are not where the optimization efforts or newer ICs will greatly improve the system’s efficiency. So, instead of rebuilding all those pieces from scratch for every new server and generating that much more embodied carbon, we are reusing them as often as possible.

By disaggregating components that require changes for efficiency reasons from other system-level functions (storage, fans, BMCs, programmable logic devices, etc.), we are able to maximize reuse of electronic components across generations. Building systems modularly like this significantly reduces our embodied carbon footprint over time. Consider how much waste would be eliminated if you were able to upgrade your car’s engine to improve its efficiency without changing the rest of the parts that are working well, like the frame, seats, and windows. That’s what modular design is enabling in data centers like ours across the world.

A Push for Open Standards, Too

We, as an industry, have to work together to accelerate interoperability across interfaces, standards, and vendors if we want to achieve true modularity and our goal of a 70% reduction in e-waste. We have begun this effort by leveraging standard add-in-card form factors (OCP 2.0 and 3.0 NICs, Datacenter Secure Control Module for our security and management modules, etc.) and our next server design is leveraging Datacenter Modular Hardware System, an open-source design specification that allows for modular subcomponents to be connected across common buses (regardless of the system manufacturer). This technique allows us to maintain these components over multiple generations without having to incur more carbon debt on parts that don’t change as often as CPUs and memory.

In order to enable a more comprehensive circular economy, Cloudflare has made extensive and increasing use of open-source solutions, like OpenBMC, a requirement for all of our vendors, and we work to ensure fixes are upstreamed to the community. Open system firmware allows for greater security through auditability, but the most important factor for sustainability is that a new party can assume responsibility and support for that server, which allows systems that might otherwise have to be destroyed to be reused. This ensures that (other than data-bearing assets, which are destroyed based on our security policy) 99% of hardware used by Cloudflare is repurposed, reducing the number of new servers that need to be built to fulfill global capacity demand. Further details about the specifics of how that happens – and how you can join our vision of reducing e-waste – you can find in this blog post.

Using the most power-efficient solutions for our workloads

The other big way we can push for sustainability (in our hardware) while responding to our exponential increase in demand without wastefully throwing more servers at the problem is simple in concept, and difficult in practice: testing and deploying more power-efficient architectures and tuning them for our workloads. This means not only evaluating the efficiency of our next generation of servers and networking gear, but also reducing hardware and energy waste in our fleet.

Currently, in production, we see that Gen 11 servers can handle about 25% more requests than Gen 10 servers for the same amount of energy. This is about what we expected when we were testing in mid-2021, and is exciting to see given that we continue to launch new products and services we couldn’t test at that time.

System power efficiency is not as simple a concept as it used to be for us. Historically, the key metric for assessing efficiency has been requests per second per watt. This metric allowed for multi-generational performance comparisons when qualifying new generations of servers, but it was really designed with our historical core product suite in mind.

We want – and, as a matter of scaling, require – our global network to be an increasingly intelligent threat detection mechanism, and also a highly performant development platform for our customers. As anyone who’s looked at a benchmark when shopping for a new computer knows, fast performance in one domain (traditional benchmarks such as SpecInt_Rate, STREAM, etc.) does not necessarily mean fast performance in another (e.g. AI inference, video processing, bulk object storage). The validation testing process for our next generation of server needs to take all of these workloads and their relative prevalence into account — not just requests. The deep partnership between hardware and software that Cloudflare can have is enabling optimization opportunities that other companies running third party code cannot pursue. I often say this is one of our superpowers, and this is the opportunity that makes me most excited about my job every day.

The other way we can be both sustainable and efficient is by leveraging domain-specific accelerators. Accelerators are a wide field, and we’ve seen incredible opportunities with application-level ones (see our recent announcement on AV1 hardware acceleration for Cloudflare Stream) as well as infrastructure accelerators (sometimes referred to as Smart NICs). That said, adding new silicon to our fleet is only adding to the problem if it isn’t as efficient as the thing it’s replacing, and a node-level performance analysis often misses the complexity of deployment in a fleet as distributed as ours, so we’re moving quickly but cautiously.

Moving Forward: Industry Standard Reporting

We’re pushing by ourselves as hard as we can, but there are certain areas where the industry as a whole needs to step up.

In particular: there is a woeful lack of standards about emissions reporting for server component manufacturing and operation, so we are engaging with standards bodies like the Open Compute Project to help define sustainability metrics for the industry at large. This post explains how we are increasing our efficiency and decreasing our carbon footprint generationally, but there should be a clear methodology that we can use to ensure that you know what kind of businesses you are supporting.

The Greenhouse Gas (GHG) Protocol initiative is doing a great job developing internationally accepted GHG accounting and reporting standards for business and to promote their broad adoption. They define scope 1 emissions to be the “direct carbon accounting of a reporting company’s operations” which is somewhat easy to calculate, and quantify scope 3 emissions as “the indirect value chain emissions.” To have standardized metrics across the entire life cycle of generating equipment, we need the carbon footprint of the subcomponents’ manufacturing process, supply chains, transportation, and even the construction methods used in building our data centers.

Ensuring embodied carbon is measured consistently across vendors is a necessity for building industry-standard, defensible metrics.

Helping to build a better, greener, Internet

The carbon impact of the cloud has a meaningful impact on the Earth–by some accounts, the ICT footprint will be 21% of global energy demand by 2030. We’re absolutely committed to keeping Cloudflare’s footprint on the planet as small as possible. If you’ve made it this far through, and you’re interested in contributing to building the most global, efficient, and sustainable network on the Internet — the Hardware Systems Engineering team is hiring. Come join us.

Successful Hack of Time-Triggered Ethernet

2022-11-18 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2022/11/successful-hack-of-time-triggered-ethernet.html

Time-triggered Ethernet (TTE) is used in spacecraft, basically to use the same hardware to process traffic with different timing and criticality. Researchers have defeated it:

On Tuesday, researchers published findings that, for the first time, break TTE’s isolation guarantees. The result is PCspooF, an attack that allows a single non-critical device connected to a single plane to disrupt synchronization and communication between TTE devices on all planes. The attack works by exploiting a vulnerability in the TTE protocol. The work was completed by researchers at the University of Michigan, the University of Pennsylvania, and NASA’s Johnson Space Center.

“Our evaluation shows that successful attacks are possible in seconds and that each successful attack can cause TTE devices to lose synchronization for up to a second and drop tens of TT messages—both of which can result in the failure of critical systems like aircraft or automobiles,” the researchers wrote. “We also show that, in a simulated spaceflight mission, PCspooF causes uncontrolled maneuvers that threaten safety and mission success.”

Much more detail in the article—and the research paper.

Let’s Architect! Architecting with custom chips and accelerators

2022-09-21 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-custom-chips-and-accelerators/

It’s hard to imagine a world without computer chips. They are at the heart of the devices that we use to work and play every day. Currently, Amazon Web Services (AWS) is offering customers the next generation of computer chip, with lower cost, higher performance, and a reduced carbon footprint.

This edition of Let’s Architect! focuses on custom computer chips, accelerators, and technologies developed by AWS, such as AWS Nitro System, custom-designed Arm-based AWS Graviton processors that support data-intensive workloads, as well as AWS Trainium, and AWS Inferentia chips optimized for machine learning training and inference.

In this post, we discuss these new AWS technologies, their main characteristics, and how to take advantage of them in your architecture.

Deliver high performance ML inference with AWS Inferentia

As Deep Learning models become increasingly large and complex, the training cost for these models increases, as well as the inference time for serving.

With AWS Inferentia, machine learning practitioners can deploy complex neural-network models that are built and trained on popular frameworks, such as Tensorflow, PyTorch, and MXNet on AWS Inferentia-based Amazon EC2 Inf1 instances.

This video introduces you to the main concepts of AWS Inferentia, a service designed to reduce both cost and latency for inference. To speed up inference, AWS Inferentia: selects and shares a model across multiple chips, places pieces inside the on-chip cache, then streams the data via pipeline for low-latency predictions.

Presenters discuss through the structure of the chip, software considerations, as well as anecdotes from the Amazon Alexa team, who uses AWS Inferentia to serve predictions. If you want to learn more about high throughput coupled with low latency, explore Achieve 12x higher throughput and lowest latency for PyTorch Natural Language Processing applications out-of-the-box on AWS Inferentia on the AWS Machine Learning Blog.

AWS Inferentia shares a model across different chips to speed up inference

AWS Lambda Functions Powered by AWS Graviton2 Processor – Run Your Functions on Arm and Get Up to 34% Better Price Performance

AWS Lambda is a serverless, event-driven compute service that enables code to run from virtually any type of application or backend service, without provisioning or managing servers. Lambda uses a high-availability compute infrastructure and performs all of the administration of the compute resources, including server- and operating-system maintenance, capacity-provisioning, and automatic scaling and logging.

AWS Graviton processors are designed to deliver the best price and performance for cloud workloads. AWS Graviton3 processors are the latest in the AWS Graviton processor family and provide up to: 25% increased compute performance, two-times higher floating-point performance, and two-times faster cryptographic workload performance compared with AWS Graviton2 processors. This means you can migrate AWS Lambda functions to Graviton in minutes, plus get as much as 19% improved performance at approximately 20% lower cost (compared with x86).

Comparison between x86 and Arm/Graviton2 results for the AWS Lambda function computing prime numbers (click to enlarge)

Powering next-gen Amazon EC2: Deep dive on the Nitro System

The AWS Nitro System is a collection of building-block technologies that includes AWS-built hardware offload and security components. It is powering the next generation of Amazon EC2 instances, with a broadening selection of compute, storage, memory, and networking options.

In this session, dive deep into the Nitro System, reviewing its design and architecture, exploring new innovations to the Nitro platform, and understanding how it allows for fasting innovation and increased security while reducing costs.

Traditionally, hypervisors protect the physical hardware and bios; virtualize the CPU, storage, networking; and provide a rich set of management capabilities. With the AWS Nitro System, AWS breaks apart those functions and offloads them to dedicated hardware and software.

AWS Nitro System separates functions and offloads them to dedicated hardware and software, in place of a traditional hypervisor

How Amazon migrated a large ecommerce platform to AWS Graviton

In this re:Invent 2021 session, we learn about the benefits Amazon’s ecommerce Datapath platform has realized with AWS Graviton.

With a range of 25%-40% performance gains across 53,000 Amazon EC2 instances worldwide for Prime Day 2021, the Datapath team is lowering their internal costs with AWS Graviton’s improved price performance. Explore the software updates that were required to achieve this and the testing approach used to optimize and validate the deployments. Finally, learn about the Datapath team’s migration approach that was used for their production deployment.

AWS Graviton2: core components

See you next time!

Thanks for exploring custom computer chips, accelerators, and technologies developed by AWS. Join us in a couple of weeks when we talk more about architectures and the daily challenges faced while working with distributed systems.

Looking for more architecture content?

AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Levels of Assurance for DoD Microelectronics

2022-08-29 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2022/08/levels-of-assurance-for-dod-microelectronics.html

The NSA has has published criteria for evaluating levels of assurance required for DoD microelectronics.

The introductory report in a DoD microelectronics series outlines the process for determining levels of hardware assurance for systems and custom microelectronic components, which include application-specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) and other devices containing reprogrammable digital logic.

The levels of hardware assurance are determined by the national impact caused by failure or subversion of the top-level system and the criticality of the component to that top-level system. The guidance helps programs acquire a better understanding of their system and components so that they can effectively mitigate against threats.

The report was published last month, but I only just noticed it.

Security and Cheap Complexity

2022-08-26 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2022/08/security-and-cheap-complexity.html

I’ve been saying that complexity is the worst enemy of security for a long time now. (Here’s me in 1999.) And it’s been true for a long time.

In 2018, Thomas Dullien of Google’s Project Zero talked about “cheap complexity.” Andrew Appel summarizes:

The anomaly of cheap complexity. For most of human history, a more complex device was more expensive to build than a simpler device. This is not the case in modern computing. It is often more cost-effective to take a very complicated device, and make it simulate simplicity, than to make a simpler device. This is because of economies of scale: complex general-purpose CPUs are cheap. On the other hand, custom-designed, simpler, application-specific devices, which could in principle be much more secure, are very expensive.

This is driven by two fundamental principles in computing: Universal computation, meaning that any computer can simulate any other; and Moore’s law, predicting that each year the number of transistors on a chip will grow exponentially. ARM Cortex-M0 CPUs cost pennies, though they are more powerful than some supercomputers of the 20th century.

The same is true in the software layers. A (huge and complex) general-purpose operating system is free, but a simpler, custom-designed, perhaps more secure OS would be very expensive to build. Or as Dullien asks, “How did this research code someone wrote in two weeks 20 years ago end up in a billion devices?”

This is correct. Today, it’s easier to build complex systems than it is to build simple ones. As recently as twenty years ago, if you wanted to build a refrigerator you would create custom refrigerator controller hardware and embedded software. Today, you just grab some standard microcontroller off the shelf and write a software application for it. And that microcontroller already comes with an IP stack, a microphone, a video port, Bluetooth, and a whole lot more. And since those features are there, engineers use them.

M1 Chip Vulnerability

2022-06-15 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2022/06/m1-chip-vulnerability.html

This is a new vulnerability against Apple’s M1 chip. Researchers say that it is unpatchable.

Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory, however, have created a novel hardware attack, which combines memory corruption and speculative execution attacks to sidestep the security feature. The attack shows that pointer authentication can be defeated without leaving a trace, and as it utilizes a hardware mechanism, no software patch can fix it.

The attack, appropriately called “Pacman,” works by “guessing” a pointer authentication code (PAC), a cryptographic signature that confirms that an app hasn’t been maliciously altered. This is done using speculative execution—a technique used by modern computer processors to speed up performance by speculatively guessing various lines of computation—to leak PAC verification results, while a hardware side-channel reveals whether or not the guess was correct.

What’s more, since there are only so many possible values for the PAC, the researchers found that it’s possible to try them all to find the right one.

It’s not obvious how to exploit this vulnerability in the wild, so I’m unsure how important this is. Also, I don’t know if it also applies to Apple’s new M2 chip.

Research paper. Another news article.

Cloudflare’s approach to handling BMC vulnerabilities

2022-05-26 Derek Chamorro

Post Syndicated from Derek Chamorro original https://blog.cloudflare.com/bmc-vuln/

Cloudflare’s approach to handling BMC vulnerabilities

In recent years, management interfaces on servers like a Baseboard Management Controller (BMC) have been the target of cyber attacks including ransomware, implants, and disruptive operations. Common BMC vulnerabilities like Pantsdown and USBAnywhere, combined with infrequent firmware updates, have left servers vulnerable.

We were recently informed from a trusted vendor of new, critical vulnerabilities in popular BMC software that we use in our fleet. Below is a summary of what was discovered, how we mitigated the impact, and how we look to prevent these types of vulnerabilities from having an impact on Cloudflare and our customers.

Background

A baseboard management controller is a small, specialized processor used for remote monitoring and management of a host system. This processor has multiple connections to the host system, giving it the ability to monitor hardware, update BIOS firmware, power cycle the host, and many more things.

Access to the BMC can be local or, in some cases, remote. With remote vectors open, there is potential for malware to be installed on the BMC from the local host via PCI Express or the Low Pin Count (LPC) interface. With compromised software on the BMC, malware or spyware could maintain persistence on the server.

According to the National Vulnerability Database, the two BMC chips (ASPEED AST2400 and AST2500) have implemented Advanced High-Performance Bus (AHB) bridges, which allow arbitrary read and write access to the physical address space of the BMC from the host. This means that malware running on the server can also access the RAM of the BMC.

These BMC vulnerabilities are sufficient to enable ransomware propagation, server bricking, and data theft.

Impacted versions

Numerous vulnerabilities were found to affect the QuantaGrid D52B cloud server due to vulnerable software found in the BMC. These vulnerabilities are associated with specific interfaces that are exposed on AST2400 and AST2500 and explained in CVE-2019-6260. The vulnerable interfaces in question are:

iLPC2AHB bridge Pt I
iLPC2AHB bridge Pt II
PCIe VGA P2A bridge
DMA from/to arbitrary BMC memory via X-DMA
UART-based SoC Debug interface
LPC2AHB bridge
PCIe BMC P2A bridge
Watchdog setup

An attacker might be able to update the BMC directly using SoCFlash through inband LPC or BMC debug universal async receiver-transmitter (UART) serial console. While this might be thought of as a usual path in case of total corruption, this is actually an abuse within SoCFlash by using any open interface for flashing.

Mitigations and response

Updated firmware

We reached out to one of our manufacturers, Quanta, to validate that existing firmware within a subset of systems was in fact patched against these vulnerabilities. While some versions of our firmware were not vulnerable, others were. A patch was released, tested, and deployed on the affected BMCs within our fleet.

Cloudflare Security and Infrastructure teams also proactively worked with additional manufacturers to validate their own BMC patches were not explicitly vulnerable to these firmware vulnerabilities and interfaces.

Reduced exposure of BMC remote interfaces

It is a standard practice within our data centers to implement network segmentation to separate different planes of traffic. Our out-of-band networks are not exposed to the outside world and only accessible within their respective data centers. Access to any management network goes through a defense in depth approach, restricting connectivity to jumphosts and authentication/authorization through our zero trust Cloudflare One service.

Reduced exposure of BMC local interfaces

Applications within a host are limited in what can call out to the BMC. This is done to restrict what can be done from the host to the BMC and allow for secure in-band updating and userspace logging and monitoring.

Do not use default passwords

This sounds like common knowledge for most companies, but we still follow a standard process of changing not just the default username and passwords that come with BMC software, but disabling the default accounts to prevent them from ever being used. Any static accounts follow a regular password rotation.

BMC logging and auditing

We log all activity by default on our BMCs. Logs that are captured include the following:

Authentication (Successful, Unsuccessful)
Authorization (user/service)
Interfaces (SOL, CLI, UI)
System status (Power on/off, reboots)
System changes (firmware updates, flashing methods)

We were able to validate that there was no malicious activity detected.

What’s next for the BMC

Cloudflare regularly works with several original design manufacturers (ODMs) to produce the highest performing, efficient, and secure computing systems according to our own specifications. The standard processors used for our baseboard management controller often ship with proprietary firmware which is less transparent and more cumbersome to maintain for us and our ODMs. We believe in improving on every component of the systems we operate in over 270 cities around the world.

OpenBMC

We are moving forward with OpenBMC, an open-source firmware for our supported baseboard management controllers. Based on the Yocto Project, a toolchain for Linux on embedded systems, OpenBMC will enable us to specify, build, and configure our own firmware based on the latest Linux kernel featureset per our specification, similar to the physical hardware and ODMs.

OpenBMC firmware will enable:

Latest stable and patched Linux kernel
Internally-managed TLS certificates for secure, trusted communication across our isolated management network
Fine-grained credentials management
Faster response time for patching and critical updates

While many of these features are community-driven, vulnerabilities like Pantsdown are patched quickly.

Extending secure boot

You may have read about our recent work securing the boot process with a hardware root-of-trust, but the BMC has its own boot process that often starts as soon as the system gets power. Newer versions of the BMC chips we use, as well as leveraging cutting edge security co-processors, will allow us to extend our secure boot capabilities prior to loading our UEFI firmware by validating cryptographic signatures on our BMC/OpenBMC firmware. By extending our security boot chain to the very first device that has power to our systems, we greatly reduce the impact of malicious implants that can be used to take down a server.

Conclusion

While this vulnerability ended up being one we could quickly resolve through firmware updates with Quanta and quick action by our teams to validate and patch our fleet, we are continuing to innovate through OpenBMC, and secure root of trust to ensure that our fleet is as secure as possible. We are grateful to our partners for their quick action and are always glad to report any risks and our mitigations to ensure that you can trust how seriously we take your security.

Testing Faraday Cages

2021-12-03 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2021/12/testing-faraday-cages.html

Matt Blaze tested a variety of Faraday cages for phones, both commercial and homemade.

The bottom line:

A quick and likely reliable “go/no go test” can be done with an Apple AirTag and an iPhone: drop the AirTag in the bag under test, and see if the phone can locate it and activate its alarm (beware of caching in the FindMy app when doing this).

This test won’t tell you the exact attenuation level, of course, but it will tell you if the attenuation is sufficient for most practical purposes. It can also detect whether an otherwise good bag has been damaged and compromised.

At least in the frequency ranges I tested, two commercial Faraday pouches (the EDEC OffGrid and Mission Darkness Window pouches) yielded excellent performance sufficient to provide assurance of signal isolation under most real-world circumstances. None of the makeshift solutions consistently did nearly as well, although aluminum foil can, under ideal circumstances (that are difficult to replicate) sometimes provide comparable levels of attenuation.

New Rowhammer Technique

2021-11-19 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2021/11/new-rowhammer-technique.html

Rowhammer is an attack technique involving accessing — that’s “hammering” — rows of bits in memory, millions of times per second, with the intent of causing bits in neighboring rows to flip. This is a side-channel attack, and the result can be all sorts of mayhem.

Well, there is a new enhancement:

All previous Rowhammer attacks have hammered rows with uniform patterns, such as single-sided, double-sided, or n-sided. In all three cases, these “aggressor” rows — meaning those that cause bitflips in nearby “victim” rows — are accessed the same number of times.

Research published on Monday presented a new Rowhammer technique. It uses non-uniform patterns that access two or more aggressor rows with different frequencies. The result: all 40 of the randomly selected DIMMs in a test pool experienced bitflips, up from 13 out of 42 chips tested in previous work from the same researchers.

[…]

The non-uniform patterns work against Target Row Refresh. Abbreviated as TRR, the mitigation works differently from vendor to vendor but generally tracks the number of times a row is accessed and recharges neighboring victim rows when there are signs of abuse. The neutering of this defense puts further pressure on chipmakers to mitigate a class of attacks that many people thought more recent types of memory chips were resistant to.

Measuring Hyper-Threading and Turbo Boost

2021-10-05 Sung Park

Post Syndicated from Sung Park original https://blog.cloudflare.com/measuring-hyper-threading-and-turbo-boost/

Measuring Hyper-Threading and Turbo Boost

We often put together experiments that measure hardware performance to improve our understanding and provide insights to our hardware partners. We recently wanted to know more about Hyper-Threading and Turbo Boost. The last time we assessed these two technologies was when we were still deploying the Intel Xeons (Skylake/Purley), but beginning with our Gen X servers we switched over to the AMD EPYC (Zen 2/Rome). This blog is about our latest attempt at quantifying the performance impact of Hyper-Threading and Turbo Boost on our AMD-based servers running our software stack.

Intel briefly introduced Hyper-Threading with NetBurst (Northwood) back in 2002, then reintroduced Hyper-Threading six years later with Nehalem along with Turbo Boost. AMD presented their own implementation of these technologies with Zen in 2017, but AMD’s version of Turbo Boost actually dates back to AMD K10 (Thuban), in 2010, when it used to be called Turbo Core. Since Zen, Hyper-Threading and Turbo Boost are known as simultaneous multithreading (SMT) and Core Performance Boost (CPB), respectively. The underlying implementation of Hyper-Threading and Turbo Boost differs between the two vendors, but the high-level concept remains the same.

Hyper-Threading or simultaneous multithreading creates a second hardware thread within a processor’s core, also known as a logical core, by duplicating various parts of the core to support the context of a second application thread. The two hardware threads execute simultaneously within the core, across their dedicated and remaining shared resources. If neither hardware threads contend over a particular shared resource, then the throughput can be drastically increased.

Turbo Boost or Core Performance Boost opportunistically allows the processor to operate beyond its rated base frequency as long as the processor operates within guidelines set by Intel or AMD. Generally speaking, the higher the frequency, the faster the processor finishes a task.

Simulated Environment

CPU Specification

Our Gen X or 10th generation servers are powered by the AMD EPYC 7642, based on the Zen 2 microarchitecture. The vast majority of the Zen 2-based processors along with its successor Zen 3 that our Gen 11 servers are based on, supports simultaneous multithreading and Core Performance Boost.

Similar to Intel’s Hyper-Threading, AMD implemented 2-way simultaneous multithreading. The AMD EPYC 7642 has 48 cores, and with simultaneous multithreading enabled it can simultaneously execute 96 hardware threads. Core Performance Boost allows the AMD EPYC 7642 to operate anywhere between 2.3 to 3.3 GHz, depending on the workload and limitations imposed on the processor. With Core Performance Boost disabled, the processor will operate at 2.3 GHz, the rated base frequency on the AMD EPYC 7642. We took our usual simulated traffic pattern of 10 KiB cached assets over HTTPS, provided by our performance team, to generate a sustained workload that saturated the processor to 100% CPU utilization.

Results

After establishing a baseline with simultaneous multithreading and Core Performance Boost disabled, we started enabling one feature at a time. When we enabled Core Performance Boost, the processor operated near its peak turbo frequency, hovering between 3.2 to 3.3 GHz which is more than 39% higher than the base frequency. Higher operating frequency directly translated into 40% additional requests per second. We then disabled Core Performance Boost and enabled simultaneous multithreading. Similar to Core Performance Boost, simultaneous multithreading alone improved requests per second by 43%. Lastly, by enabling both features, we observed an 86% improvement in requests per second.

Latencies were generally lowered by either or both Core Performance Boost and simultaneous multithreading. While Core Performance Boost consistently maintained a lower latency than the baseline, simultaneous multithreading gradually took longer to process a request as it reached tail latencies. Though not depicted in the figure below, when we examined beyond p9999 or 99.99th percentile, simultaneous multithreading, even with the help of Core Performance Boost, exponentially increased in latency by more than 150% over the baseline, presumably due to the two hardware threads contending over a shared resource within the core.

Production Environment

Moving into production, since our traffic fluctuates throughout the day, we took four identical Gen X servers and measured in parallel during peak hours. The only changes we made to the servers were enabling and disabling simultaneous multithreading and Core Performance Boost to create a comprehensive test matrix. We conducted the experiment in two different regions to identify any anomalies and mismatching trends. All trends were alike.

Before diving into the results, we should preface that the baseline server operated at a higher CPU utilization than others. Every generation, our servers deliver a noticeable improvement in performance. So our load balancer, named Unimog, sends a different number of connections to the target server based on its generation to balance out the CPU utilization. When we disabled simultaneous multithreading and Core Performance Boost, the baseline server’s performance degraded to the point where Unimog encountered a “guard rail” or the lower limit on the requests sent to the server, and so its CPU utilization rose instead. Given that the baseline server operated at a higher CPU utilization, the baseline server processed more requests per second to meet the minimum performance threshold.

Results

Due to the skewed baseline, when core performance boost was enabled, we only observed 7% additional requests per second. Next, simultaneous multithreading alone improved requests per second by 41%. Lastly, with both features enabled, we saw an 86% improvement in requests per second.

Though we lack concrete baseline data, we can normalize requests per second by CPU utilization to approximate the improvement for each scenario. Once normalized, the estimated improvement in requests per second from core performance boost and simultaneous multithreading were 36% and 80%, respectively. With both features enabled, requests per second improved by 136%.

Latency was not as interesting since the baseline server operated at a higher CPU utilization, and in turn, it produced a higher tail latency than we would have otherwise expected. All other servers maintained a lower latency due to their lower CPU utilization in conjunction with Core Performance Boost, simultaneous multithreading, or both.

At this point, our experiment did not go as we had planned. Our baseline is skewed, and we only got half useful answers. However, we find experimenting to be important because we usually end up finding other helpful insights as well.

Let’s add power data. Since our baseline server was operating at a higher CPU utilization, we knew it was serving more requests and therefore, consumed more power than it needed to. Enabling Core Performance Boost allowed the processor to run up to its peak turbo frequency, increasing power consumption by 35% over the skewed baseline. More interestingly, enabling simultaneous multithreading increased power consumption by only 7%. Combining Core Performance Boost with simultaneous multithreading resulted in 58% increase in power consumption.

AMD’s implementation of simultaneous multithreading appears to be power efficient as it achieves 41% additional requests per second while consuming only 7% more power compared to the skewed baseline. For completeness, using the data we have, we bridged performance and power together to obtain performance per watt to summarize power efficiency. We divided the non-normalized requests per second by power consumption to produce the requests per watt figure below. Our Gen X servers attained the best performance per watt by enabling just simultaneous multithreading.

Conclusion

In our assessment of AMD’s implementation of Hyper-Threading and Turbo Boost, the original experiment we designed to measure requests per second and latency did not pan out as expected. As soon as we entered production, our baseline measurement was skewed due to the imbalance in CPU utilization and only partially reproduced our lab results.

We added power to the experiment and found other meaningful insights. By analyzing the performance and power characteristics of simultaneous multithreading and Core Performance Boost, we concluded that simultaneous multithreading could be a power-efficient mechanism to attain additional requests per second. Drawbacks of simultaneous multithreading include long tail latency that is currently curtailed by enabling Core Performance Boost. While the higher frequency enabled by Core Performance Boost provides latency reduction and more requests per second, we are more mindful that the increase in power consumption is quite significant.

Do you want to help shape the Cloudflare network? This blog was a glimpse of the work we do at Cloudflare. Come join us and help complete the feedback loop for our developers and hardware partners.

Router Security

2021-02-19 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2021/02/router-security.html

This report is six months old, and I don’t know anything about the organization that produced it, but it has some alarming data about router security.

Conclusion: Our analysis showed that Linux is the most used OS running on more than 90% of the devices. However, many routers are powered by very old versions of Linux. Most devices are still powered with a 2.6 Linux kernel, which is no longer maintained for many years. This leads to a high number of critical and high severity CVEs affecting these devices.

Since Linux is the most used OS, exploit mitigation techniques could be enabled very easily. Anyhow, they are used quite rarely by most vendors except the NX feature.

A published private key provides no security at all. Nonetheless, all but one vendor spread several private keys in almost all firmware images.

Mirai used hard-coded login credentials to infect thousands of embedded devices in the last years. However, hard-coded credentials can be found in many of the devices and some of them are well known or at least easy crackable.

However, we can tell for sure that the vendors prioritize security differently. AVM does better job than the other vendors regarding most aspects. ASUS and Netgear do a better job in some aspects than D-Link, Linksys, TP-Link and Zyxel.

Additionally, our evaluation showed that large scale automated security analysis of embedded devices is possible today utilizing just open source software. To sum it up, our analysis shows that there is no router without flaws and there is no vendor who does a perfect job regarding all security aspects. Much more effort is needed to make home routers as secure as current desktop of server systems.

One comment on the report:

One-third ship with Linux kernel version 2.6.36 was released in October 2010. You can walk into a store today and buy a brand new router powered by software that’s almost 10 years out of date! This outdated version of the Linux kernel has 233 known security vulnerabilities registered in the Common Vulnerability and Exposures (CVE) database. The average router contains 26 critically-rated security vulnerabilities, according to the study.

We know the reasons for this. Most routers are designed offshore, by third parties, and then private labeled and sold by the vendors you’ve heard of. Engineering teams come together, design and build the router, and then disperse. There’s often no one around to write patches, and most of the time router firmware isn’t even patchable. The way to update your home router is to throw it away and buy a new one.

And this paper demonstrates that even the new ones aren’t likely to be secure.

Challenges dealing with broken servers

Using automation as an autonomous system

Introducing Phoenix

Discovery

Diagnostics

Recovery

Visibility

Balancing automation and empathy: Error Budgets

Where we go from here

Humble Beginnings

System Blueprint

Supercharged Automations

Engineer Productivity

Looking Ahead

A brief introduction to baseboard management controllers

Benefits of DC-SCM based server design

PCB cost reduction

Modularized design enables flexibility

Unified interoperable OpenBMC firmware development

Project Argus

The security foundation for our Gen 12 hardware

Background

The attack

Fixing the bleed

Memory rank

Memory depth and width

Memory module capacity

ACTIVATE timing and DRAM page sizes

An experiment with 2Rx4 and 2Rx8 DDR4 modules

Memory Latency Checker results

LMBench Results

Summary

References

Background

How we update the BIOS

Other firmware updates such as BMC, network cards and more

Conclusion

Arm Trusted Firmware Secure Boot

BL1

BL2

BL3

Amping it up

Ampere Secure Boot

Authentication flow

Single Domain Secure Boot

Building blocks

Fusing the keys

Final manufacturing flow

Validation

Updated authentication flow

Let’s boot!

Still more to do

Replacing legacy hardware with Cloudflare’s network

The life of a hardware box

Option 1: Remarket / Reuse

Option 2: Recycle

Option 3: Destroy

Partnering with Iron Mountain to make sustainable goals more accessible

Get started today

Sustainability in the realm of servers

Use of Product Emissions

Embodied Emissions

Modular Design

Standards-based Hardware to Encourage Re-use

Recycling

What Cloudflare is Doing To Reduce Our Server Impact

Gen 12: Walking the Talk

A modular-driven design

A Push for Open Standards, Too

Using the most power-efficient solutions for our workloads

Moving Forward: Industry Standard Reporting

Helping to build a better, greener, Internet

See you next time!

Other posts in this series

Looking for more architecture content?

Background

Impacted versions

Mitigations and response

Updated firmware

Reduced exposure of BMC remote interfaces