Tag Archives: hard drive stats

Backblaze Drive Stats for Q3 2025

2025-11-11 Drive Stats Team

Post Syndicated from Drive Stats Team original https://www.backblaze.com/blog/backblaze-drive-stats-for-q3-2025/

An illustration of chart bars with the words Backblaze S3 2025 Drive Stats overlaid

Every quarter, Drive Stats gives us the numbers. This quarter, it gave us a crisis of meaning. What does it really mean for a hard drive to fail? Is it the moment the lights go out, or the moment we decide they have? Philosophers might call that an ontological gray area. We just call it Q3.

As of June 30, 2025, we had 332,915 drives under management. Of that total, there were 3,970 boot drives and 328,348 data drives. Let’s dig into our stats, then talk about the meaning of failure.

This quarter, we have more to talk about (Stats-wise)

Drive Stats was the beginning. Want to see more of the full picture? Check out the Stats Lab webinar, bringing together content from all of our Stats articles. We’re going to chat about all things Backblaze (and beyond)—by the numbers.

Drive Stats: The digest version

Q3 2025 hard drive failure rates

During Q3 2025, we were tracking 328,348 storage drives. Here are the numbers:

Backblaze Hard Drive Failure Rates for Q3 2025

Reporting period July 1, 2025–September 30, 2025 inclusive
Drive models with drive count > 100 as of July 1, 2025 and drive days > 10,000 in Q3 2025

Notes and observations

The failure rate has increased: The failure rate has changed, and by quite a bit. As a reminder, last quarter’s AFR was 1.36% compared with this quarter’s 1.55%. (Interestingly, the 2024 yearly AFR was 1.57%.)
That new drive energy: Say hello to the 24TB Toshiba MG11ACA24TE, joining the drive pool with 2,400 drives and 24,148 drive days. That means that we’ve hit the thresholds for the quarterly stats, but not the lifetime.
The zero failure club: It was a big month for the zero failure club, with four drives making the cut:
- Seagate HMS5C4040BLE640 (4TB)
- Seagate ST8000NM000A (8TB)
- Toshiba MG09ACA16TE (16TB)
- Toshiba MG11ACA24TE (24TB)—and yes, that’s the new drive.

For those of you tracking the stats closely, you’ll notice that the Seagate ST8000NM000A (8TB) is a frequent flier on this list. The last time it had a failure was in Q3 2024—and it was just a single failure for the whole quarter!

The highest AFRs were really high: The high end was so high that this month, it inspired us to run an outlier analysis using the standard quartile analysis (Tukey method). Based on that information, any drive with a quarterly AFR higher than 5.88% is an outlier, and there are three:
- Seagate ST10000NM0086 (10TB): 7.97%
- Seagate ST14000NM0138 (14TB): 6.86%
- Toshiba MG08ACA16TEY (16TB): 16.95%

What’s going on there? Great question, and we’ll get into that after the lifetime failure rates.

Lifetime hard drive failure rates

To be considered for the lifetime review, a drive model was required to have 500 or more drives as of the end of Q2 2025 and have over 100,000 accumulated drive days during their lifetime. When we removed those drive models which did not meet the lifetime criteria, we had drives grouped into 27 models remaining for analysis as shown in the table below.

Backblaze Hard Drive Failure Rates for Q2 2025

Reporting period ending September 30, 2025
Drive models > 500 drives and > 100,000 lifetime drive days

Notes and observations

That lifetime AFR is pretty consistent, isn’t it? The lifetime AFR is 1.31%. Last quarter we reported that it was 1.30%, and the quarter before that, it was 1.31%.
The 4TB average age hasn’t shifted: As we’ve reported on previously, the 4TB drives are being decommissioned over time. Now, we’re down to just a handful left—just 11 of the ALE models and 187 of the BLE models. But, because their lifetime populations are so comparatively large, the additional drive days aren’t enough to move the needle on the average age in months. So, no ghosts in the machine here, and decommissioning is proceeding as planned.
Steady uptick in higher capacity drives: Of the 20TB+ drives that meet our lifetime data parameters, we’ve added 7,936 since last quarter. And, don’t forget that our newest entrée to the cohort, the Toshiba MG11ACA24TE (24TB), hasn’t made its way to this table yet—that adds an additional 2,400 drive models. All together, the 20TB+ club represents 67,939 drives, or about 21% of the drive pool.

Defining a failure—from a technical perspective

A question that’s come up a few times when we’re hosting a webinar or chatting in the comments section is how we define a failure. While it may seem intuitive, it’s actually something of a meaty conundrum, and something we haven’t addressed since the early days of this series. Tracking down the answer to this question touches internal drive fleet monitoring tools (via SMART stats), the actual Drive Stats collection program, and our data engineering layer. I’ll dig into each of these in detail, then we’ll take a look at the outliers for this quarter.

SMART stats reporting

We use Smartmontools to collect the SMART attributes of drives, and another monitoring tool called drive sentinel to flag read/write errors that exceed a certain threshold as well as some other anomalies.

The main indicator we use for determining if a drive should be replaced is when it responds to reads with uncorrectable medium errors. When a drive reads the data from the disk, but the data fails its integrity check, the drive will try to reconstruct the data using internal error correction codes. If it is unable to reconstruct the data, it notifies the host by reporting it as an uncorrectable error and marks that part of the disk as pending reallocation, which shows up in SMART under an attribute like Current_Pending_Sector.

On Storage Pods that control drives through SATA links, the drive sentinel will count the number of these uncorrectable errors a drive reports and if it exceeds a threshold, access to the drive will be removed. This is important in the classic Backblaze Storage Pods where five drives share a single SATA link and errors by one drive will affect all drives on the link.

On Dell and SMCI pods that use a SAS topology to connect drives, drive sentinel doesn’t remove access to drives because the errors are reported differently; but, that’s also not as critical since SAS minimizes the impact that a problem disk can have on others.

The Drive Stats program

We’ve talked about the custom program we use to collect Drive Stats in the past, and here’s a quick recap:

The podstats generator runs on every Storage Pod, what we call any host that holds customer data, every few minutes. It’s a C++ program that collects SMART stats and a few other attributes, then converts them into an .xml file (“podstats”). Those are then pushed to a central host in each datacenter and bundled. Once the data leaves these central hosts, it has entered the domain of what we will call Drive Stats.

For this program, the logic is relatively simple: A failure in Drive Stats occurs when a drive vanishes out of the reporting population. It is considered “failed” until it shows up again. Drives are tracked by serial number and we report daily logs on a per-drive basis, so truly, we can get pretty granular here.

The data engineering layer

To recap, we’ve collected our SMART stats and compiled them with the podstats program. Now we’ve got all the information, and data intelligence needs to add the context. A drive may go offline for a day or so (not return a response to those tools that collect daily logs of SMART stats), but it could be something as simple as a loose cable. So, time-wise, if a drive reappears after one day or 30, at what point in that period of time do we classify it as an official failure?

Previously, we manually cross-referenced data center work tickets, but these days, we’ve automated that process. On the backend, it’s a SQL query, but in human speak, this is what it comes down to:

If a drive logs data on the last day of the selection period (which in this case is a quarter) then it has not failed.
There are three human-curated tables that the query cross references. If a drive serial number appears on one of them, it tells us whether there’s a failure or not (depending on the table’s function).
If the drive serial number is the primary serial number in a drive replacement Jira ticket then it has failed. (Jira is where we track our data center work tickets.)
If the drive serial number is the target serial number in a clone Jira ticket or a (temp) replacement ticket, then it has not failed.

Basically, when we go to write the Drive Stats reports at the end of the quarter, if a drive has either appeared in one of our various work trackers or hasn’t re-entered the population, then it’s considered failed.

In rare instances, that can mean that we have so-called “cosmetic” failures when we have some work we’re doing on a drive model that lasts more than that quarterly collection period. And, spoiler, we have one of those instances that showed up in the data this month—our outlier Toshiba drive with the 16.9% failure rate. We’ll dig in in just a minute; but first, some context.

Connecting drive failure to overall picture of the drive pool

As we mentioned above, certain drives in the pool had such high swings in AFR that we ended up running an outlier analysis using the quartile method. (It’s also worth mentioning that a cluster analysis could potentially be a better fit, but we can save that for another day.) Based on that analysis, anything that has above a 5.88% failure rate is an outlier.

The primary motivation was inspired by an attempt to visualize the relationship between the age in months of a drive versus this quarter’s AFRs.

And yes, we’re fully aware that that’s a… super unreadable scatter plot. Removing the labels, this is a bit better:

We’re interested, really, in the shape of the relationship. If we posit that the older drives get, the higher their failure rates, you’d expect a larger concentration in the top right quadrant. But, our data follows a much more interesting pattern than that, with most of our data points concentrated in the lowest regions of the graph regardless of age—something you’d expect from a set of data that reflects a bunch of smart folks actively working towards the goal of a healthy drive population. And yet, we have some data points that break the mold.

As is pretty intuitive to my business intelligence folks in the audience, the process of identifying outliers is actionable data as well. Just like all press is good press; in our world, more data is more better. So, let’s take a closer look at those outliers. As a reminder, that’s these three drive models:

Seagate ST10000NM0086 (10TB): 7.97%
Seagate ST14000NM0138 (14TB): 6.86%
Toshiba MG08ACA16TEY (16TB): 16.95%

Seagate ST10000NM0086 (10TB)

This drive has some pretty explainable factors for the high failure rate. It’s well over seven years old (92.35 months). And, since it only has 1,018 drive models in operation, single failures hold a lot of weight compared with the average drive count per model—which comes in at 10,952 if you use the mean of this quarterly data and 6,177 if you use the median.

And, you can see that borne out in the trend in the last year of data:

Seagate ST14000NM0138 (14TB)

This drive is nearing five years in age (56.57 months) and, again, has a lower drive count at 1,286. More importantly, this particular drive model has had historically high failure rates. In parallel with above, here’s the last year of quarterly failure rates:

Toshiba MG08ACA16TEY (16TB)

Finally, our Toshiba model is the most interesting of all. It’s less than four years old (44.61 months), and has 5,145 drives in the pool. And, this quarter is clearly a change from its normal, decent, AFRs.

When we see deviations like this one, it’s usually an indication that there’s something afoot.

Never fear, Drive Stats fans; this was a known quantity before we went on this journey. This past quarter, working with Toshiba, we deployed some firmware updates they provided to optimize performance on these drives. Because we needed to pull drives to achieve this in some cases, we had an abnormal number of “failed” drives in this population.

What that means for this drive is that it’s actually not a bad drive model; and, given the ways we and Toshiba have worked together on a fix, we should see failure rates normalizing in the near future. And, this also goes back to our conversation of defining a failure—in this case, while the drives “failed,” the failure wasn’t mechanical and was based on something that we’ll be able to fix without replacing the drives. In short, don’t sweat the spike and pay attention to the long arc of performance on this population. We expect to see those drives happy and spinning for years to come (and with better performance, too).

The Hard Drive dataset (and beyond)

Thank you, as always, for making it through ~2,500 or so words to examine the fun side of data. Here’s our standard fine print:

You cite Backblaze as the source if you use the data;
You accept that you are solely responsible for how you use the data, and;
You do not sell this data itself to anyone; it is free.

If you’re a new Drive Stats fan, consider signing up for the newsletter. If you’re not ready for that kind of commitment, sound off in the comments section below or reach out directly to us to let us know what you’re working on. Happy investigating!

The post Backblaze Drive Stats for Q3 2025 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Are Hard Drives Getting Better? Let’s Revisit the Bathtub Curve

2025-10-15 Drive Stats Team

Post Syndicated from Drive Stats Team original https://www.backblaze.com/blog/are-hard-drives-getting-better-lets-revisit-the-bathtub-curve/

A decorative image showing stylized hard drives.

If you’ve hung around Backblaze for a while (and especially if you’re a Drive Stats fan), you may have heard us talking about the bathtub curve. In Drive Failure Over Time: The Bathtub Curve Is Leaking, we challenged one of reliability engineering’s oldest ideas—the notion that drive failures trace a predictable U-shaped curve over time.

But, the data didn’t agree. Our fleet showed dips, spikes, and plateaus that refused to behave. Now, after 13 years of continuous data, the picture is clearer—and stranger.

The bathtub curve isn’t just leaking, and the shape of reliability might look more like an ankle-high wall at the entrance to a walk-in shower. The neat story of early failures, calm middle age, and gentle decline no longer fits the world our drives inhabit. Drives are getting better—or, more precisely, the Drive Stats dataset says that our drives are performing better in data center environments.

So, let’s talk about what our current “bathtub curve” looks like, and how it compares to earlier generations of the analysis.

The TL;DR: Hard drives are getting better, and lasting longer.

The intro: Let’s talk bathtub curve

If you’ve spent any time around hardware reliability, you’ve seen it: a smooth U-shaped line called the bathtub curve. It promises order in the chaos of failure—a story where devices begin life with a burst of defects, settle into steady performance, and finally wear out in predictable decline. And, this is what it looks like:

For decades, it’s been engineering shorthand for how things die. But as our dataset has grown—more than a decade of drive telemetry and millions of drive-days—the data is clear: Our real drive population is more complicated.

What the bathtub curve looked like then

The first time we ran this analysis was in 2013, and when we updated the article in 2021, we shared this chart:

It shows the annualized failure rate (AFR) of the full drive pool over time (in years) at two different look-back points—2013 and 2021. At that time, you could already see that the bathtub curve was starting to, as the venerable Andy Klein put it, “leak.” The 2013 data looks the closest to a true bathtub curve, while the 2021 data shows fewer early failures and a lower failure rate for more years. We also see the average longevity of drives goes up by about two years before spiking into the failure zone.

Numbers can both define and obscure reality

Now, there are some very interesting factors that come into play when comparing hard drive reliability over time. For example, our usual caveats about how we use drives vs. how consumers use drives, how our workloads have changed over time, etc. More importantly, though, because we’re comparing averages, it’s easy to lose track of the context around our dataset—how many hard drives are we talking about in 2013 vs. 2021?

When we did this analysis in 2013, Backblaze had been open for six years, but we’d only been publishing the Drive Stats dataset since 2013. So, arriving at presenting a look-back at the data (i.e., this is how many drives failed when they were between zero and one years old) was a bit of a math problem compared to our usual data reporting. We were talking about drives that entered the drive pool in 2007, and those were ones we hadn’t shared complete daily logs about, even if the drive was still in service in 2013 (which, as you can tell from the data, was unlikely). We achieved that by looking at failures vs. logged on hours, and when we re-created the analysis recently, we used this SQL query:

CREATE VIEW introduction_dates AS
    -- Calculate the introduction date of drives that were already in service on 2013-04-10
    SELECT serial_number, date(date_add('hour', -1 * smart_9_raw, TIMESTAMP '2013-04-10 00:00:00')) AS introduced
    FROM drivestats
    WHERE date = DATE '2013-04-10'
    UNION
    -- Use the minimum date for drives that entered service after after 2013-04-10
    SELECT serial_number, MIN(date) as introduced
        FROM drivestats
        WHERE serial_number NOT IN (
            SELECT serial_number
            FROM drivestats 
            WHERE date = DATE '2013-04-10'
        )
        GROUP BY serial_number;

SELECT
    date_diff('day', d2.introduced, d1.date) / 91 AS age_in_quarters,
    100 * 365 * (cast(SUM(d1.failure) AS DOUBLE) / COUNT(*)) AS afr
FROM drivestats AS d1
INNER JOIN introduction_dates AS d2
ON d1.serial_number = d2.serial_number
GROUP BY 1
ORDER BY 1;

Our drive pool looked a lot different in 2013 as well. Not only was it smaller (~35,000 drives and over 100PB of data were live as of September 2014), but it also was made up of “consumer” drives. While we didn’t see much of a difference between the two when we actually tested them in the environment, we did a lot of drive farming in those days, a process that included actually “shelling” the drives and removing them from their housings—which means that our drive pool had a lot more potential to get some bumps along the way. Hard drives are pretty resilient and we were careful, but it’s worth noting.

By the time we were doing this analysis in 2021, we had a lot more data and a lot more storage drives—206,928 or so. Between 2013 and 2021, we had added capacity to our Sacramento data center; expanded our data center regions with locations in Phoenix and Amsterdam, with more on the way in 2022; we launched Backblaze B2 Cloud Storage; and, we went public.

All those things are cool from a historical perspective, but the more impactful thing to pay attention to is that any time you have less data (read: a smaller number of total drives), each individual data point has more impact on the whole. In the bathtub curve, you naturally reduce the number of drives as they get older—every drive has a day one, but not every drive has a day 1,461 (or, in lay people’s terms: four years, one day). With fewer drives, more spikes. So, if you start off with more drives, your numbers are likely to be more steady—unless there’s a real problem, or you’re entering your true drive pool failure zone.

And, since we’ve transitioned to buying more drives, and decommissioning drives in a different way—well, that all affects what the end result is. More on our drive hygiene habits later; for now, let’s get into our current data.

What the bathtub curve looks like now

Without further ado, let’s look at the failure rates in our current Backblaze drive pool:

That’s a pretty solid deviation in both age of drive failure and the high point of AFR from the last two times we’ve run the analyses. When we ran our 2025 numbers (at the close of Q2 2025), we reported on 317,230 drives. Take that as an approximate raw number given the normal drive exclusions in each Drive Stats report, but it gets you in the ballpark.

For consistency’s sake, here’s 2013:

And here’s 2021:

What’s missing, and a bit difficult to visualize, is the scale on both the x axis (time in years) and the y axis (annualized failure rate expressed in percentage). Let’s put all three on the same chart:

Note that both the 2013 data and the 2021 data have high failure percentage peaks at some point near the end of their drive lifetimes. In 2013, it was 13.73% at about 3 years, 3 months (and 13.30% at 3 years, 9 months). In 2021, it’s 14.24%, with that peak hitting at 7 years, 9 months.

Now, compare that with the 2025 data: Our peak is 4.25% at 10 years, 3 months (woah). Not only is that a significant improvement in drive longevity, it’s also the first time we’ve seen the peak drive failure rate at the hairy end of the drive curve. And, it’s about a third of each of the other failure peaks.

Meanwhile, we see that the drive failure rates on the front end of the curve are also incredibly low—when a drive is between zero and one years old, we barely crack 1.30% AFR. For reference, the most recent quarterly AFR is 1.36%.

Still, if we take a look at the trendlines, we can see that the 2021 and the 2025 data isn’t too far off, shape-wise. That is, we see a pretty even failure rate through the significant majority of the drives’ lives, then a fairly steep spike once we get into drive failure territory.

What does that mean? Well, drives are getting better, and lasting longer. And, given that our trendlines are about the same shape from 2021 to 2025, we should likely check back in when 2029 rolls around to see if our failure peak has pushed out even further.

Hey, what about that data contextualization you did above?

Good point—there are significant things that have changed about our dataset that may be affecting our numbers. We’ve already tackled the consumer vs. enterprise drive debate, and while we don’t have updated testing on that front, there are other things about buying drives at scale that may have an effect on the data.

For instance, because we buy drives in bulk, that means that a big chunk of drives enter our data pool at the same time. Given that we, over the years, have really only seen model-by-model variation, this means that if you get a lemon of a drive and you’ve added a lot of them, you may have a chunk of drives failing all at once.

Also, we have a different process for decommissioning drives these days. There are lots of things that go into that strategy, but you can simplify it all to risk management and our ability to grow our storage footprint over time. From a practical perspective, that means sometimes there are drives that are still performing well that we decide to take out of service anyway—and that means they get taken out of the fleet without ever having failed. Since our analyses above are based on annualized failure rate vs. age of drive, you can see a big drop in drive population without the expected failure rate spike.

Finally, we have different standards for new drives. Some of them just have to do with the industry at large—drives are getting bigger, and storage patterns are changing. But, compared with 2013, when a natural disaster forced us to innovate in unexpected ways, we’ve got more flexibility to consider our purchases, and to do so in a way that’s specific to our environment.

Was the bathtub curve just wrong?

The issue isn’t that the bathtub curve is wrong—it’s that it’s incomplete. It treats time as the only dimension of reliability, ignoring workload, manufacturing variation, firmware updates, and operational churn. And, it rests on a set of assumptions:

Devices are identical and operate under the same conditions.
Failures happen independently, driven mostly by time.
The environment stays constant across a product’s life.

The good news: When it comes to data centers, most of these are as true as they can be in a real-world environment. Data centers environments attempt to be as consistent as possible to be able to reduce power consumption, and to be able to properly anticipate and plan data workloads. Basically, consistency = a happy data center.

That said, conditions can’t ever be perfect. Our numbers have always and will always reflect both good planning and the unforeseen aspects of reality. Understanding whether drives are “good” or “bad” is always a conversation between what you theorize (in this case, the bathtub curve) and what happens (the Drive Stats dataset).

What’s next?

Why does all this talk of numbers matter? Well, as we’ve expanded our drive pool over time, in some ways, we’ve increased confidence in the results we’re seeing, both on day one and day 1,461. Even if we had the exact same drives models and drive pool make up (by percentage) from 2013 that we did in 2021, having more of them would give us better results. But, now we have a greater diversity of drives and more of them.

That doesn’t mean we’re the be-all, end-all of drive reliability, but it does give us some more footing to slice and dice the data and bring it back to you. As always, you can find the full Drive Stats dataset on our website, which means you can repeat this experiment, or use the data in any way you can imagine. Stay tuned for our quarterly reports and more articles from the Drive Stats extended universe—and feel free to sign up for the Drive Stats newsletter if you want to stay up-to-date.

The post Are Hard Drives Getting Better? Let’s Revisit the Bathtub Curve appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Drive Stats for Q2 2025

2025-08-05 Drive Stats Team

Post Syndicated from Drive Stats Team original https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-2025/

With hundreds of thousands of hard drives spinning 24/7, our data centers are less like peaceful white-noise oases and more like a a series of obstacle courses—if said obstacle courses were about managing over four exabytes of customer data from archival backups to streaming media to AI training datasets. Sure, they’re obstacle courses we all (and I’m including you, users of the internet) collectively create, but it’s no less of a balancing act to find the contestants (erm, hard drives) that can go the distance.

And we, dear readers, get to watch it all. Welcome to Drive Stats: where failure is inevitable, survival is fascinating, and every quarter brings a new leaderboard.

As of June 30, 2025, we had 321,201 drives under management. Of that total, there were 3,971 boot drives and 317,230 data drives. Stay tuned as we take our standard peek into quarterly and lifetime failure rates, and do a deep dive into the 20TB+ club.

As always, we’ll see you in the comments section. This month, you’ll also get three (count ‘em, three!) opportunities to talk to us in person as well—virtually at our Drive Stats LinkedIn Live on August 5 (today), or twice in Las Vegas at DefCon on August 7 and 8.

Sign up for the Drive Stats LinkedIn Live

Ready to dive deeper into the data? Tune in today at 10:00 a.m. PT, to query the Drive Stats team, Stephanie Doyle and Pat Patterson. We’ll see you there!

Drive Stats by the numbers: The digest version

An infographic summarizing key data points in this report, including drive count, drive failures, drive days, drive population by manufacturer, and a summary of the quarterly, annual, and lifetime AFRs.

Q2 2025 hard drive failure rates

For those that are new to the Drive Stats report, it’s worth mentioning that we have certain criteria that we use to select drives for consideration each quarter. We’ll discuss those in the next section, but for now, let’s talk about the data. The table below shows the failure rates for Q2 2025.

Backblaze Hard Drive Failure Rates for Q2 2025

Reporting period April 1, 2025–June 30, 2025 inclusive
Drive models with drive count > 100 as of June 30, 2025 and drive days > 10,000 in Q2 2025

Notes and observations

The annual failure rate is lower this quarter. We had some major fluctuations last quarter. Quoting ourselves from May 2025:

The quarterly failure rate is slightly higher. The quarterly failure rate went up from 1.35% to 1.42%. As with the zero-failure club, our higher-end outlier AFRs show some of the usual suspects:

We’re now back down to 1.36%. What’s changed?

Big swings in our higher-end failure rates: Well, some of the drives with higher failure rates have come down quite a bit. Notably, that includes the 12TB Seagate model ST12000NM0007, which was at a whopping 9.47% failure rate last quarter—down this quarter to only 3.58%. With its drive count holding more or less steady (1,038 in Q1 and 1,014 in Q2), that means a real change in failure rates. Note that this drive was at 8.72% in Q4 2024, so it’s worth keeping an eye on whether this is a fluke or a new pattern. Other significant drops include the 12TB HGST model HUH721212ALN604 (Q1: 4.97%; Q2: 3.39%) and the 14TB Seagate model ST14000NM0138 (Q1: 6.82%, Q2: 4.37%).
One new drive model on the way in: Welcome to the party, Toshiba MG09ACA16TE (16TB).
Zero failures for the quarter: Rising to the top, we’ve got only two this time around:
- Seagate ST8000NM000A (8TB)
- Seagate ST16000NM002J (16TB)

That 8TB Seagate is really shining, given this is its third quarter running with zero failures.

Bonus: One failure drives: Since we only have two 0 failures (and that just seems a little lackluster, doesn’t it?), it’s also worth mentioning the drives with only one failure this quarter:
- HGST HMS5C4040BLE640 (4TB)
- Seagate ST12000NM000J (12TB)
- Seagate ST14000NM000J (14TB)
- Toshiba MG09ACA16TE (16TB)

Drive model criteria

We noted earlier we removed 495 drives from consideration when we produced the table above covering Q2 2025. There are two primary reasons we did not consider these drive models.

Testing. These are drives of a given model that we monitor and collect Drive Stats data on, but are not considered production drives at this time. For example, drives undergoing certification testing to determine if they are performant enough for our environment are not included in our Drive Stats calculations.
Insufficient data points. When we calculate the annualized failure rate for a drive model for a given period of time (quarterly, annual, or lifetime), we want to ensure we have enough data to reliably do so. Therefore we have defined criteria for a drive model to be included in the tables and charts for the specified period of time. Models that do not meet these criteria are not included in the tables and charts for the period in question.

A table that outlines the drive inclusion parameters for each type of Backblaze Drive Stats report.

Regardless of whether or not a given drive model is included in the charts and tables, all of the data for all of the drives we use is included in our Drive Stats dataset which you can download by visiting our Drive Stats page.

As with the Q2 quarterly results, we will apply these criteria to the lifetime charts that follow in this report.

Lifetime hard drive failure rates

To be considered for the lifetime review, a drive model was required to have 500 or more drives as of the end of Q2 2025 and have over 100,000 accumulated drive days during their lifetime. When we removed those drive models which did not meet the lifetime criteria, we had 393,907 drives grouped into 27 models remaining for analysis as shown in the table below.

Backblaze Hard Drive Failure Rates for Q2 2025

Reporting period ending June 30, 2025
Drive models > 500 drives and > 100,000 lifetime drive days

A table showing the lifetime Backblaze Drive Stats.

Notes and observations

Again, the lifetime AFR holds steady, dropping from Q1 2025’s 1.31% to 1.30%.

Now you see me: This quarter’s table also gives us an interesting snapshot that has to do with our drive exclusions as the 4TB HGST model HMS5C4040ALE640 is on the way out. It meets our lifetime drive criteria, so it is included in this second table, but it didn’t make the cut for the quarterly table because it had too few drives running by the end of the quarter. Usually you see the opposite, where drive models show up in the quarterly requirements but not the lifetime. This quarter, four models meet that standard (Seagate model numbers ST8000NM000A, ST14000NM000J, ST16000NM002J, and Toshiba MG09ACA16TE).
Smaller drives getting older: Perhaps an unsurprising trend—Backblaze’s smaller capacity drives are getting older. We have a total of 13 drive models with 12TB or less, with a collective 1.54% failure rate. See the table below:

Backblaze drives with ≤12TB capacity

A image showing drives that are less than or equal to 12TB, including color coding to indicate drive age.

Of those models, eight are five years old or older (shown in purple). An additional two models are four years or older (that’s your orange). Taking just these 10 models—drives reaching their supposed golden years—we have a collective AFR of 1.42%.

Notably, that AFR is due to some well-performing low-failure outliers, including both of the 4TB Seagate models (0.57% and 0.40%), the 12TB HGST model HUH721212ALE600 (0.56%), and the 12TB Seagate model ST12000NM001G (0.99%).

That said, it’s also perhaps more impressive that when we say “eight are five years and older,” of those eight drive models, five are six or more years old. Their collective AFR is 1.33%.

Drive models that are less than or equal to 12TB and that are 6 or more years old.

This begs the age-old question: Is age just a number? Or, are we just seeing several exceptional drive models? In any event—an interesting drive population to keep an eye on, as it represents 156,724 of our 393,907 (~40%) of the lifetime drive pool.

Zoom in: The 20TB+ club

We’ve been taking quick peeks at the 20TB+ drives in the last few reports, but it’s high time we dig in a bit deeper. Right now, our cohort of 20TB+ drives that meet the lifetime criteria consists of three drives, the 20TB Toshiba model MG10ACA20TE, 22TB WDC model WUH722222ALE6L4, and 24TB Seagate model ST24000NM002H. Quite neatly, that also gives us one per manufacturer, lending itself to something of a head-to-head comparison—though, of course, with the variability we see on a per-drive basis within the same manufacturer, we won’t over-index on lending it too much significance.

Let’s take a look at each.

20TB Toshiba MG10ACA20TE

The Toshiba has actually been in our drive pool for 22 months, but until just under a year ago, there were only two drives. For the purposes of significance, then, we’ll exclude significantly low numbers of drives—thankfully, each model has something of a natural fall-off point where they go from single-digit drive numbers to hundreds.

For the Toshiba, that gives us the following data:

A table showing the AFRs for the 20TB Toshiba based on their age.

Converted to a graph, we end up with the following:

A graph showing the failure rates based on age for the 20TB Toshiba drives.

On this graph, the blue line represents the AFR and the red line represents the drive count. Drive count can be a bit tricky since our x-axis is age, and we start with age=0, which means that the drive count (from our perspective) goes from larger to smaller. That is, as drives get older, there are fewer of them by count—you have your initial purchase cohort, then you add drives over time. You can read this as the first data point representing drives that are between 0–1 month old, the next data point as 1–2 months old, etc.

We set it up this way because we wanted to be able to directly compare the failure rates of the drives based on their ages. Those familiar with our bathtub curve analysis may recognize our methodology here—we’re just zooming in on specific drives and drive capacities.

22TB WDC WUH722222ALE6L4

Now let’s take a look at the WDC model. We have usable data for about 21 months of its drive life:

A table showing the AFRs for the 22TB WDC drives based on their age.

Which gives us the following visualization:

A graph showing the AFRs for the 22TB WDC drives based on their age.

Interestingly, we see a lot less variability in the span of time where we have a direct comparison. That said, the WDC model also had a minimum of double the drive count if we’re looking at a similar time period—so, at their youngest (0 months old) the Toshiba had 14,407 drives vs. WDC’s 37,363; and, at 11 months Toshiba had 1,034 drives vs. WDC’s 13,965.

While AFRs do get us mostly on an even playing field as far as being able to make a 1:1 comparison, it’s important to remember that in smaller drive pools, a single failure can be amplified by quite a bit.

24TB Seagate ST24000NM002H

Our youngest drive model, the 24TB Seagate ST24000NM002H, has just half a year of data.

A table showing the AFRs for the 24TB Seagate drives based on their age.

That gives us the following visualization:

A graph showing the AFRs for the 24TB Seagate drives based on their age.

Compared with our other two drive models, the 24TB Seagate definitely has the highest failure rates. This could be explained, in part, by it being a young drive—is it in the leading edge of a traditional bathtub curve? So, certainly something to track over time to see if it will settle out as it gets older.

All together now: Comparing each 20TB+ drives

We designed this view to be directly comparable at points in time, so, here’s your graph that puts each drive on the same time scale:

A graph comparing the AFRs for the 20TB+ drives based on their age.

What’s our takeaway here? Well, in both drive count and length of time in the pool, it’s a little early to create definitive trends for the Seagate and the Toshiba. Certainly we can see that the Seagate is, early on, showing higher failure rates. Meanwhile, the 20TB Toshiba has had a bit of a variable year one. But again, with significantly variable drive pools between all models, we’re not quite comparing apples to apples. (We chose not to plot drive count on this chart—it gets messy quickly.) Add to that: the Seagate in particular is potentially at the beginning of the “bathtub” curve, we may see it change over time.

On the other hand, the 22TB WDC model has shown up quite a bit below our current average AFR for the drive pool of all drive sizes and ages, and it’s the model with the most data. But, how does that compare to other models as they come online?

Comparison: 20TB+ as a pool vs. 14-16TB as a pool

When we were considering whether this information would be a useful slice of the data, our biggest question was how to contextualize it versus drives. It’s perhaps a tad imperfect, but we landed on combining the 14–16TB drives as a pool, largely because they have a significant amount of data points and were our last set of drives onboarded, which means that they’re more or less the last generation of hardware.

The other thing to note is that once we combined the 20TB drives into a pool, some of the data we filtered out on a per-drive basis got added back in. So, at the 21 month mark, where the Toshiba model only had one drive, we added that single drive to the 399 that our WDC model brought to the table and calculated AFR across the pool (giving us 400 drives to work with).

Here’s the numbers for the 20TB+ drive pool:

A table combining the AFRs for the 20TB+ drives based on their age.

That gives us a pretty neat graph, actually:

A graph combining the AFRs for the 20TB+ drives based on their age.

Now, let’s compare to the 14–16TB drives of the same age. We have significant data for this population for nearly seven years, but in the interest of saving you three pages of scrolling, I’ll give you the table for the data that directly correlates to the 21 months of data we have for the 20TB+ drives.

A table combining the AFRs for the 14-16TB drives based on their age.

This is the line chart for that range of time:

A graph combining the AFRs for the 14-16TB drives based on their age, showing only the time period from 0-21 months.

Comparing age of drive to age of drive, it would seem that our 20TB are right on target, and perhaps doing a bit better than expected. But, that definitely isn’t a perfect comparison given that the 14–16TB drives have a steadier and larger drive count. So, let’s look at the chart with the full, nearly seven year time period:

A graph combining the AFRs for the 14-16TB drives based on their age, showing only the full age of the drive pool, 81 months.

This view starts to show us some spiky patterns as the 14–16TB drives get older, of course exacerbated by drive counts reducing over time.

So what’s it all mean?

It’s clear from the data that we need to give the 20TB+ drives time to mature, and that as we (depending on our buying behavior, of course) add more drives, we might see some interesting changes in the data.

As for the 14–16TB pool, they’re following relatively expected patterns of wearing out in the five-plus year range—but what does that mean when you compare to what we observed in our current lifetime stats, where we see our 12TB and smaller drive pool performing so well?

Without taking a closer look at the 14–16TB drives, it’s hard to say that they don’t have similar outlier tendencies to what the 12TB and smaller pool does, just pulling the failure rates upward. Even a casual glance at our current lifetime table’s 14–16TB drives bears that out (four years and older highlighted in orange, as our corollary above):

A image showing drives that are 14-16TB drives, including color coding to indicate drive age.

That data isn’t inclusive of all of the 14–16TB drives we’ve ever had, though—just those currently in operation. So, as always, there’s more investigation to be done.

The Hard Drive Stats data

The complete dataset used to create the tables and charts in this report is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data itself to anyone; it is free.

Good luck, and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q2 2025 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Drive Stats for Q1 2025

2025-05-13 Drive Stats Team

Post Syndicated from Drive Stats Team original https://www.backblaze.com/blog/backblaze-drive-stats-for-q1-2025/

A decorative image showing the title Backblaze Q1 2025 Drive Stats.

Welcome to the first Drive Stats of 2025. In case you missed it, the 2024 Drive Stats report was the last for long-time Drive Stats guru, Andy Klein, who is happily retired—off putting the “green” in greener pastures by working on his golf game. We–being Backblaze staff writer Stephanie Doyle and Chief Technical Evangelist Pat Patterson–are picking up where Andy left off, bringing you the metrics and analysis you know and love. Now, on to the numbers!

As of March 31, 2025, we had 312,831 drives under management. Of that total, there were 3,970 boot drives and 308,861 data drives. We’ll review their annualized failure rates (AFRs) as of Q1 2025, and we’ll dig into the average age of drive failure by model, drive size, and more. Along the way, we’ll share our observations and insights on the data presented and, this time around, we’ve got some exciting updates to share about how we produce Drive Stats. (Stay tuned, fellow Snowflake fans.)

As always, we look forward to your thoughts—we’ll see you in the comments section.

Sign up for the Drive Stats LinkedIn Live

Ready to dive deeper into the data? Tune in Thursday, May 15, 2025 at 10:00 a.m. PT, to query the new Drive Stats team, Stephanie Doyle and Pat Patterson. Feel free to drop us a line with any questions you want us to answer.

Q1 2025 hard drive failure rates

As mentioned above, at the end of Q1 2025, we were running 312,831 drives. During the quarter as a whole, however, we were monitoring a total of 318,426 drives; this count includes those that were taken out of service during the quarter, either because they failed or were only used temporarily.

We’ll discuss the criteria we used in the next section of this report. Removing these drives leaves us with 317,833 hard drives to analyze. The table below shows the annualized failure rates (AFR) for Q1 2025 for this collection of drives.

Backblaze Hard Drive Failure Rates for Q1 2025

Reporting period January 1, 2025–March 31, 2025 inclusive
Drive models with drive count > 100 as of March 31, 2025 and drive days > 10,000 in Q1 2025.

Notes and observations

The 4TB drives are hanging on and finishing strong. Good news: We have another quarter’s worth of data on our beloved 4TB drives (though the planned migration is well underway). True to their history, the 4TB drives showed wonderfully low failure rates, with yet another quarter of zero failures from model HMS5C4040ALE640 and 0.34% AFR from model HMS5C4040BLE640.
Keeping an eye on the 20TB+ pool. The 24TB Seagate (model ST24000NM002H) no longer has a perfect record, with eight failures for the quarter. Still, the drives put up a respectable 1.00% AFR. Meanwhile, the 20TB+ drives as a pool are averaging a 0.72% AFR, coming in lower than the overall failure rates—always a promising sign.
Zero failures for the quarter. Four drives get a gold star for zero failures this quarter:
- The 4TB HGST (model HMS5C4040ALE640)
- The Seagate 8TB (model ST8000NM000A)
- Seagate 12TB (model ST12000NM000J)
- Seagate 14TB (model ST14000NM000J)

Three out of the four also had zero failures last quarter, all but the Seagate 12TB.

The quarterly failure rate is slightly higher. The quarterly failure rate went up from 1.35% to 1.42%. As with the zero-failure club, our higher-end outlier AFRs show some of the usual suspects:
- Seagate 10TB (model ST10000NM0086). Q4 2024: 5.72%. Q1 2025: 4.72%.
- HGST 12TB (model HUH721212ALN604). Q4 2024: 5.15%. Q1 2025: 4.97%.
- Seagate 12TB (model ST12000NM0007). Q4 2024: 8.72%. Q1 2025: 9.47%.
- Seagate 14TB (model ST14000NM0138). Q4 2024: 5.95%. Q1 2025: 6.82%.

Drive model criteria

We noted earlier we removed 593 drives from consideration when we produced the table above covering Q4 2024. There are two primary reasons we did not consider these drive models.

Testing. These are drives of a given model that we monitor and collect Drive Stats data on, but are not considered production drives at this time. For example, drives undergoing certification testing to determine if they are performant enough for our environment are not included in our Drive Stats calculations.
Insufficient data points. When we calculate the annualized failure rate for a drive model for a given period of time (quarterly, annual, or lifetime), we want to ensure we have enough data to reliably do so. Therefore we have defined criteria for a drive model to be included in the tables and charts for the specified period of time. Models that do not meet these criteria are not included in the tables and charts for the period in question.

As with the Q4 quarterly results, we will apply these criteria to the annual and lifetime charts that follow in this report.

Lifetime hard drive failure rates

As of the end of Q1 2025, we were tracking 312,831 data hard drives. To be considered for the lifetime review, a drive model was required to have 500 or more drives as of the end of Q1 2025 and have over 100,000 accumulated drive days during their lifetime. When we removed those drive models which did not meet the lifetime criteria, we had 312,493 drives grouped into 26 models remaining for analysis as shown in the table below.

Backblaze Lifetime Hard Drive Failure Rates

Reporting period ending March 31, 2025 inclusive
Drive models with > 500 drives and > 100,000 lifetime drive days

Notes and observations

The lifetime AFR remains steady, despite some drives having significant change. We see virtually no change in our overall lifetime AFR, which we last tracked at 1.31% in the 2024 Year-End Drive Stats Report. But, with some drive models showing significant change in year-over-year AFR, it’s worth digging in a little deeper.

Statistically significant improved AFRs:

Both the 12TB and the 14TB had the same number of failures (or nearly so). Meanwhile, the Toshiba 20TB and WDC 22TB had more failures, but added a significant number of drives to the fleet. Both of these activities increase the number of drive days we tracked for the model’s drive pool, so these results are unsurprising.

Statistically significant worsened AFRs:

Meanwhile, we have a few things happening for the significantly worsened AFRs. The WDC drive models are all top performers from a failure perspective, even a change from .45 to .48 shows up in the numbers.
That leaves us with two HGST 12TB drives. Both come in above the average failure rate, at 1.45% (model: HUH721212ALE604) and 2.06% (model: HUH721212ALN604). We can give HUH721212ALE604 a pass—with the drive pool showing an average age of 67.1 months, or about five and a half years, it’s firmly on track with the expected pattern defined by the bathtub curve.
Where does that leave us with model HUH721212ALE604? We’ll keep an eye on it. Given that its AFR rate isn’t too far off from the total AFR of the Backblaze drive fleet, it’s not hugely concerning unless we see the rate of change continue.

What’s new with Drive Stats?

In taking on this report, our main focus was to ensure continuity with our decades-old dataset. That said, we also saw some opportunities to streamline the process of data collection, a continuation of the work that David Winings talked about in Overload to Overhaul: How We Upgraded the Drive Stats Data and Drive Stats Data Deep Dive: The Architecture. All of these things set us up for not just an easier time generating this report, but some bigger plans in the future. (We won’t tip our hand yet—but stay tuned.)

Drive Stats gets a Snowflake upgrade

When we first started tracking Drive Stats way back in 2013, data collection was very ad hoc. For the first few years, when Brian Beach was at the helm, we published stats once a year. When Andy took over in 2015, he moved to publishing quarterly data (starting in 2016). As the dataset grew, and Andy’s collection of lightweight desktop apps started to run out of steam, it became apparent that we needed to upgrade to more capable analytical tooling. For a variety of operational reasons, Andy was gamely running SQL queries against CSV data imported into a MySQL instance running on his laptop—and having to do a ton of manual data cleanup to boot. (Pun obviously intended.)

This year, with the help of our colleagues on the database engineering team (shoutout to Tom Roden—thanks so much!), we were able to get the Drive Stats data included in the Backblaze Snowflake instance. Gone are the days of us bugging folks for exports that take hours to process! We can run lightweight queries against a cached, structured table.

We started from Andy’s SQL queries and tweaked them a bit to match the logic and nomenclature of Snowflake fields. Once we had that worked out, the first thing we did was validate our methodology by running the Q4 Drive Stats numbers and comparing them to Andy’s—success.

It helps that Pat has experimented with our Drive Stats dataset in Trino and other analytical tools like Apache Iceberg, so it’s certainly not the first time he’s considered methodology and tooling for this problem. Going forward, we may further refine the process, but for now, the migration to Snowflake saved us a ton of time and manual data cleanup.

The Hard Drive Stats data

Good luck, and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q1 2025 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Drive Stats for 2024

2025-02-11 Andy Klein

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-2024/

A decorative image with the title 2024 Year End Drive Stats.

As of December 31, 2024, we had 305,180 drives under management. Of that number, there were 4,060 boot drives and 301,120 data drives. This report will focus on those data drives as we review the Q4 2024 annualized failure rates (AFR), the 2024 failure rates, and the lifetime failure rates for the drive models in service as of the end of 2024. Along the way, we’ll share our observations and insights on the data presented, and, as always, we look forward to you doing the same in the comments section at the end of the post.

Sign up for the Drive Stats webinar

Tune in to ask those questions you’ve had spinning ‘round your head like so many drives, and meet the new Drive Stats team—Stephanie Doyle and David Johnson of Backblaze Blog fame. Yes, you heard that right: It’s my last Drive Stats before I head off to retirement (but more on that later in the report). Read on, and sign up, for analysis and insights from the 2024 report.

Q4 2024 hard drive failure rates

As of the end of 2024, Backblaze was monitoring 301,120 hard drives used to store data. For our evaluation, we removed from consideration 487 drives, as they did not meet the criteria to be included. We’ll discuss the criteria we used in the next section of this report. Removing these drives leaves us with 300,633 hard drives to analyze. The table below shows the annualized failure rates for Q4 2024 for this collection of drives.

Notes and observations

24TB drives are here. Seagate 24TB drives (model: ST24000NM002H) arrived in early December. The 1,200 drives filled one Backblaze Vault with no failed drives through the end of Q4. The 24TB Seagate drives join the 20TB Toshiba and 22TB WDC drive models in the 20-plus capacity club as we continue to dramatically increase storage capacity while optimizing existing storage server space.
Zero failures for the quarter. Five drive models had zero failures for the quarter starting with the 24TB Seagate drive model noted above. The others are the 4TB HGST (model: HMS5C4040ALE640), the 8TB Seagate (model: ST8000NM000A), the 14TB Seagate (model: ST14000NM000J), and the 16TB Seagate (model: ST16000NM002J). All of the zeroes come with the caveat of having a relatively small number of drives and drive days, but zero failures in a quarter is always a good thing.
The 4TB drives are nearly extinct. The 4TB drive count decreased by another 1,774 drives in Q4. (I discussed exactly how we migrate them in more detail if you want to dig in.) The remaining ~4,000 drives should be gone by the end of Q1 2025. They will be replaced by the incoming 20TB, 22TB, and 24TB drives. It should be noted that out of the 4TB drives in operation in Q4, only one failed, so those 20-plus TB drives have a lot to live up to from a failure perspective.
The quarterly failure rate is down. The AFR for Q4 dropped from 1.89% in Q3 to 1.35% in Q4. While all drive sizes delivered some improvement from Q3 to Q4, one of the primary drivers is the addition of over 14,000 new 20-plus TB drives. As a group, these drives delivered an AFR of 0.77% for the quarter.

Drive model criteria

We noted earlier we removed 487 drives from consideration when we produced the table above covering Q4 2024. There are two primary reasons we did not consider these drive models.

Testing. These are drives of a given model that we monitor and collect Drive Stats data on, but are not considered production drives at this time. For example, drives undergoing certification testing to determine if they are performant enough for our environment are not included in our Drive Stats calculations.
Insufficient data points. When we calculate the annualized failure rate for a drive model for a given period of time (quarterly, annual, or lifetime), we want to ensure we have enough data to reliably do so. Therefore we have defined criteria for a drive model to be included in the tables and charts for the specified period of time. Models that do not meet these criteria are not included in the tables and charts for the period in question.

Period	Drive Count	Drive Days
Quarterly	> 100	> 10,000
Annual	> 250	> 50,000
Lifetime	> 500	>100,000

As with the Q4 quarterly results, we will apply these criteria to the annual and lifetime charts that follow in this report.

2024 annual hard drive failure rates

As of the end of 2024, Backblaze was monitoring 301,120 hard drives used to store data. We removed nine drive models consisting of 2,012 drives from consideration as they did not meet the annual criteria we have defined. This leaves us with 298,954 drives divided across 27 different drive models. The table below shows the AFRs for 2024 for this collection of drives.

Notes and observations

No zeros for the year. There were no qualifying drive models with zero failures in 2024. That said, the 16TB Seagate (model: ST16000NM002J) got close by recording just one drive failure back in Q3, giving the drive an AFR of 0.22% for 2024.
Busy data center techs. During 2024, our data center techs installed 53,337 drives. If we assume there are 2,080 work hours a year (52 weeks times 40 hours), that math is 53,337/2,080, and that means our intrepid DC techs installed 26 drives per hour. Busy, busy, busy!
The 24TB Seagate drives? While there were 1,200 new 24TB Seagate drives added in 2024, they were installed in early December and did not accumulate enough drive days to make the cut for the annual, or lifetime, tables. Including the 24TB Seagate drive, there were three models that missed out on being included in the 2024 annual tables, these drive models are listed below.

MFG	Model	Drive Count	Drive Days	2024 AFR
Seagate	ST8000NM000A	247	22,684	0.84%
Seagate	ST14000NM000J	232	19,696	1.32%
Seagate	ST24000NM002H	1,200	18,000	0.00%

As a reminder, a drive model needs to have over 250 drives by the end of Q4 and accumulate at least 50,000 drive days during 2024 to be included in the annual tables.

Comparing Drive Stats for 2022, 2023, and 2024

The table below compares the annual failure rates by drive model for each of the last three years. The table includes just those drive models which met the annual criteria as of the end of 2024. The data for each year is inclusive of that year only for the operational drive models present at the end of each year. The table is sorted by drive size and then AFR.

Notes and observations

The annual AFR is down. The 2024 AFR for all drives listed was 1.57%, this is down from 1.70% in 2023. We expect the overall failure rates to continue to fall in 2025, but we will be watching the following for indicators.
- The failure rates of the 8TB and 12TB drive models. All of the models will exceed their five years of service. In general, the failure rate will noticeably increase as the drives exceed five years of service. And, while there are outliers like the current HGST 4TB drives, you can’t assume that will happen.
- The failure rates of the 14TB and 16TB drive models. These models are approaching middle age—three to five years in operation. This is where, according to the bathtub curve, their failure rates could gradually increase—but not as severely as when they exceed five years.
- The failure rates for the 20TB, 22TB, and 24TB drives models. These drives will enter the flat portion of the bathtub curve, that is where their failure rate should be the lowest.

Annualized failure rates vs. drive size

Now, we can dig into the numbers to see what else we can learn. We’ll start by looking at the quarterly annualized failure rate by drive size over the last three years.

Let’s take a look at the different drive sizes and how they affect the overall annualized failure rate over time.

Minimal impact. The 4TB (blue line) drives and 10TB (gold line) drives have had little impact over the last year on the overall failure rate as each finished the year with a relatively small number of drives. Still, the wild ride delivered by the 10TB drives keeps our DC techs on their toes.

Older drives. The 8TB (gray line) drives and 12TB (purple line) drives range in age from five to eight years and as such their overall failure rates should be increasing over time. The 12TB drives are following that pattern moving up from about 1% AFR back in 2021 to just about 3% in 2024. The failure rates of the 8TB drives, while erratic from quarter-to-quarter, have a nearly flat trendline over the same period.

Workhorse drives. The 14TB (green line) and 16TB (azure* line) drives comprise 57% of the drives in service and on average they range in age from two to four years. They are in the prime of their working lives. As such, they should have low and stable failure rates, and as you can see, they do.

* Maybe azure isn’t quite right, but robin’s egg blue seemed a bit pretentious.

New drives on the block. The 22TB (orange line) drives are in their early days as we continue to add more drives on a regular basis. Once the drive population settles down, we’ll have a better sense of the AFR direction. Still, the early results are solid with a lifetime AFR of 1.06%.

Annualized failure rates vs. manufacturer

One of the more popular ways we can look at this data is by the drive manufacturer as we’ve done below.

To complete the picture, the chart below uses the same data, but displays just the linear trendlines for each of the manufacturers over the same three-year period.

HGST. While the HGST trendline is not pretty, it doesn’t tell the entire story. Looking at the first chart, until Q4 2023, the HGST drives were at or below the average for all of the drives, that is all manufacturers. At that point, HGST has exceeded the average, and then some. The table below contains results for just the HGST drives for 2024. We’ve sorted them, high to low, by the 2024 AFR.

As you can see, there are two 12TB drive models driving the high AFR for the HGST drives. The HUH721212ALN604 model began showing signs of an increased quarterly AFR in Q1 2023 and the HUH721212ALE604 model followed suit in Q3 2024. Without these drive models, the 2024 AFR for HGST drive would be 0.55%.

Seagate. The quarterly AFR trendline decreased for the Seagate drives from 2022 through 2024. While the decrease was slight, from 2.25% to 2.0%, Seagate was the only manufacturer to do so. The decrease appears, at least in part, to be due to the removal of the Seagate 4TB drives during that period.

Toshiba. Over the 2022 to 2024 period, the quarterly AFR for the Toshiba drive models varied within a fairly narrow range between 0.80% and 1.52%, with most quarters hovering slightly around 1.2%. Most importantly, none of the individual drive models were outliers, as the highest quarterly AFR for any Toshiba drive model was 1.58%. We like consistency.

WDC. While WDC drive models delivered a similar level of consistency as the Toshiba models, they did so with a lower AFR each quarter. From 2022 through 2024, the range of quarterly AFR values for the WDC models was 0.0% to 0.85%. The 0.0% AFR was in Q1 2022 when none of the 12,207 WDC drives in operation failed during that quarter.

Lifetime hard drive stats

As of the end of 2024, Backblaze was monitoring 301,120 hard drives used to store data. Applying our drive criteria noted above for the lifetime period, we removed 11 drive models consisting of 2,736 drives from consideration as they did not meet the lifetime criteria we defined. This leaves us with 298,230 drives divided across 25 different drive models. The table below shows the lifetime AFRs for this collection of drives.

The current lifetime AFR for all of the drives is 1.31%. This is down from 1.46% in 2023. The drop is primarily due to the completion of the migration of the 4TB Seagate drives in 2024, which left us with only two of these drives still in operation as of the end of 2024. As a consequence, the 79 million drive days and over 5,600 drive failures racked up by the 4TB Seagate drives by the end of 2023 are not included in the data presented in the 2024 lifetime table above.

In the final table below, we’ve taken the lifetime table and sorted out the drive models that have a lifetime AFR of 1.50% or less by drive size.

A couple of caveats as you review the table.

There is enough data for each model to say the AFR values are solid. That said, everything could change tomorrow. In general, the hard drive failure rate follows the bathtub curve as the drives age—unless it doesn’t. Some drives refuse to fail as they age, like the 4TB HGST drives. Other drives are great, and then “hit the wall” and bend the failure curve upward, fast.
A drive model with a 1% annualized failure rate means that you can expect one drive out of 100 to fail in a year. If you’re a personal drive user, that one drive could be yours. If you have exactly one drive, your personal annualized failure rate is 100%. In other words, always have a backup, and don’t forget to test it.

Migration time

I have been authoring the various Drive Stats reports for the past ten years and this will be my last one. I am retiring, or perhaps in Drive Stats vernacular, it would be “migrating.” Either way, after 10 years in the U.S. Air Force and 30+ years in Silicon Valley Tech, it is time. Drive Stats will continue with Stephanie Doyle and David Johnson as the replacement drive models beginning with the Q1 2025 report. I wish them well.

I want to say thank you to each of you who have taken your time to peruse and engage with the Drive Stats reports and data over the last 10 years. And, thank you as well for the comments, questions, and discussions that raced and raged across the various communities that care about something as mundane and awesome as a hard drive. It has been quite the ride—thanks again.

The Hard Drive Stats data

The complete data set used to create the tables and charts in this report is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data itself to anyone; it is free.

Good luck, and let us know if you find anything interesting.

The post Backblaze Drive Stats for 2024 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Drive Stats for Q3 2024

2024-11-12 Andy Klein

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-q3-2024/

A decorative image that displays the words Q3 2024 Drive Stats.

As of the end of Q3 2024, Backblaze was monitoring 292,647 hard disk drives (HDDs) and solid state drives (SSDs) in our cloud storage servers located in our data centers around the world. We removed from this analysis 4,100 boot drives, consisting of 3,344 SSDs and 756 HDDs. This leaves us with 288,547 hard drives under management to review for this report. We’ll review the annualized failure rates (AFRs) for Q3 2024 and the lifetime AFRs of the qualifying drive models. Along the way, we’ll share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

Hard drive failure rates for Q3 2024

For our Q3 2024 quarterly analysis, we remove the following from consideration: drive models which did not have at least 100 drives in service at the end of the quarter, drive models which did not accumulate 10,000 or more drive days during the quarter, and individual drives which exceeded their manufacturer’s temperature specification during their lifetime. The removed pool totalled 471 drives, leaving us with 288,076 drives grouped into 29 drive models for our Q3 2024 analysis.

The table below lists the AFRs and related data for these drive models. The table is sorted ascending by drive size then ascending by AFR within drive size.

Notes and observations on the Q3 2024 Drive Stats

Upward AFR. The quarter-to-quarter AFR continues to creep up rising from 1.71% in Q2 2024 to 1.89% in Q3 2024. The rise can’t be attributed to the aging 4TB drives, as our CVT drive migration system continues to replace these drives. As a consequence, the AFR for the remaining 4TB drives was 0.26% in Q3. The primary culprit is the collection of 8TB drives, which are now on average over seven years old. As a group, the AFR for the 8TB drives rose to 3.04% in Q3 2024, up from 2.31% in Q2. The CVT team is gearing up to begin the migration of 8TB drives over the next few months.
Yet another golden oldie is gone. You may have noticed that the 4TB Seagate drives (model: ST4000DM000) are missing from the table. All of the Backblaze Vaults containing these drives have been migrated, and as a consequence there are only two of these drives remaining, not enough to make the quarterly chart. You can read more about their demise in our recent Halloween post.
A new drive in town. In Q3, the 20TB Toshiba drives (model: MG10ACA20TE) arrived in force, populating three complete Backblaze Vaults of 1,200 drives each. Over the last few months our drive qualification team put the 20TB drive model through its paces and, having passed the test, they are now on the list of drive models we can deploy.
One zero. For the second quarter in a row, the 14TB Seagate (model: ST16000NM00J) drive model had zero failures. With only 185 drives in service, there is a lot of potential variability in the future, but for the moment, they are settling in quite well.
The nine year club. There are no data drives with 10 or more years of service, but there are 39 drives that are nine years or older. They are all 4TB HGST drives (model: HMS5C4040ALE640) spread across 31 different Storage Pods, in five different Backblaze Vaults and two different data centers. Will any of those drives make it to 10 years? Probably not, given that four of the five vaults have started their CVT migrations and will be gone by the end of the year. And, while the fifth vault is not scheduled for migration yet, it is just a matter of time before all of the 4TB drives we are using will be gone.

Reactive and proactive drive failures

In the Drive Stats dataset schema, there is a field named failure, which displays either a 1 for failure or a 0 for not failed. Over the years in various posts, we have stated that for our purposes drive failure is either reactive or proactive. Furthermore, we have suggested that failed drives fall basically evenly into these two categories. We’d like to put some data behind that 50/50 number, but first let’s start by defining our two categories of drive failure, reactive and proactive.

Reactive: A reactive failure is when any of the following conditions occur: the drive crashes and refuses to boot or spin up, the drive won’t respond to system commands, or the drive won’t stay operational.
Proactive: A proactive failure is generally anything not a reactive failure, and typically is when one or more indicators such as SMART stats, FSCK (file system) checks, etc., signal that the drive is having difficulty and drive failure is highly probable. Typically a multitude of indicators are present in drives declared as proactive failures.

A drive that is removed and replaced as either a proactive or reactive failure is considered a drive failure in Drive Stats unless we learn otherwise. For example, a drive is experiencing communications errors and command timeouts and is scheduled for a proactive drive replacement. During the replacement process, the data center tech realizes the drive does not appear to be fully seated. After gently securing the drive, further testing reveals no issues and the drive is no longer considered failed. At that point, the Drive Stats dataset is updated accordingly.

As noted above, the Drive Stats dataset includes the failure status (0 or 1) but not the type of failure (proactive or reactive). That’s a project for the future. To get a breakdown of different types of drives failure we have to interrogate the data center maintenance ticketing system used by each data center to record any maintenance activities on Storage Pods and related equipment. Historically, the drive failure data was not readily accessible, but a recent software upgrade now allows us access to this data for the first time. So in the spirit of Drive Stats, we’d like to share our drive failure types with you.

Drive failure type stats

Q3 2024 will be our starting point for any drive failure type stats we publish going forward. For consistency, we will use the same drive models listed in the Drive Stats quarterly report, in this case Q3 2024. For this period, there were 1,361 drive failures across 29 drive models.

We actually have been using the data center maintenance data for several years as each quarter we validate the failed drives reported by the Drive Stats system with the maintenance records. Only validated failed drives are used for the Drive Stats reports we publish quarterly and in the data we publish on our Drive Stats webpage.

The recent upgrades to the data center maintenance ticketing system have not only made the drive failure validation process easier, we can now easily join together the two sources. This gives us the ability to look at the drive failure data across several different attributes as shown in the tables below. We’ll start with the number of failed drives in each category and go from there. This will form our baseline data.

Reactive vs. proactive drive failures for Q3 2024

Observation period	Reactive failures	Proactive failures	Total failures	Reactive %	Proactive%
Q3 2024 failed drives	640	721	1,361	47.0%	53.0%

Reactive vs. proactive drive failures for Q3 2024

Manufacturer	Reactive failures	Proactive failures	Total failures	Reactive %	Proactive %
HGST	194	177	371	52.3%	47.7%
Seagate	258	272	530	48.7%	51.3%
Toshiba	124	221	345	35.9%	64.1%
WDC	64	51	115	55.7%	44.3%

Reactive vs. proactive drive failures by Backblaze data center

Backblaze data center	Reactive failures	Proactive failures	Total failures	Reactive %	Proactive %
AMS	36	77	113	31.9%	68.1%
IAD	50	92	142	35.2%	64.8%
PHX	179	201	380	47.1%	52.9%
SAC 0	151	148	299	50.5%	49.5%
SAC 2	224	203	427	52.5%	47.5%

Reactive vs. proactive drive failures by server type

Server type	Reactive failures	Proactive failures	Total failures	Reactive %	Proactive %
5.0 red Storage Pod (45 drives)	4	2	6	66.7%	33.3%
6.0 red Storage Pod (60 drives)	433	349	782	55.4%	44.6%
6.1 red Storage Pod (60 drives)	70	107	177	39.5%	60.5%
Dell Server (26 drives)	22	61	83	26.5%	73.5%
Supermicro Server (60 drives)	111	202	313	35.5%	64.5%

Obviously, there are many things we could analyze here, but for the moment we just want to establish a baseline. Next, we’ll collect additional data to see how consistent and reliable our data is over time. We’ll let you know what we find.

Learning more about proactive failures

One item of interest to us is the different reasons that cause a drive to be designated as a proactive failure. Today we record the reasons for the proactive designation at the time the drive is flagged for replacement, but currently multiple reasons are allowed for a given drive. This makes determining the primary reason difficult to determine. Of course, there may be no such thing as a primary reason, as it is often a combination of factors causing the problem. That analysis could be interesting as well. Regardless of the exact reason, such drives are in bad shape and replacing degraded drives to protect the data they store is our first priority.

Lifetime hard drive failure rates

As of the end of Q3 2024, we were tracking 288,547 operational hard drives. To be considered for the lifetime review, a drive model was required to have 500 or more drives as of the end of Q3 2024 and have over 100,000 accumulated drive days during their lifetime. When we removed those drive models which did not meet the lifetime criteria, we had 286,892 drives grouped into 25 models remaining for analysis as shown in the table below.

Downward lifetime AFR

In Q2 2024, the lifetime AFR for the drives listed was 1.47%. In Q3, the lifetime AFR went down to 1.31%, a significant decrease from one quarter to the next for the lifetime AFR. This decrease is also contrary to the increasing quarterly AFR increase over the same period. At first blush, that doesn’t make much sense as an increasing quarter-to-quarter AFR should increase the lifetime AFR. There are two related factors which explain this seemingly contradictory data. Let’s take a look.

We’ll start with the table below which summarizes the differences between the Q2 and Q3 lifetime stats.

Period	Drive count	Drive days	Drive failures	Lifetime AFR
Q2 2024	283,065	469,219,469	18,949	1.47%
Q3 2024	286,892	398,476,931	14,308	1.31%

To create the dataset for the lifetime AFR tables two criteria are applied: first, at the end of a given quarter, the number of drives of a drive model must be greater than 500, and, second, the number of drive days must be greater than 100,000. The first criterion ensures that the drive models are relevant to the data presented; that is, we have a significant number of each of the included drive models. The second standard ensures that the drive models listed in the lifetime AFR table have a sufficient number of data points; that is, they have enough drive days to be significant.

As we can see in the table above, while the number of drives went up from Q2 to Q3, the number of drive days and the number of drive failures went down significantly. This is explained by comparing the drive models listed in the Q2 lifetime table versus the Q3 lifetime table. Let’s summarize.

Added: In Q3, we added the 20TB Toshiba drive model (MG10ACA20TE). In Q2, there were only two of these drives in service.
Removed: In Q3, we removed the 4TB Seagate drive model (ST4000DM000) as there were only two drives remaining as of the end of Q3, well below the criteria of 500 drives.

When we removed the 4TB Seagate drives we also removed 80,400,065 lifetime drive days and 5,789 lifetime drive failures from the Q3 lifetime AFR computations. If the 4TB Seagate drive model data (drive days and drive failures) was included in the Q3 Lifetime stats, the AFR would have been 1.50%.

Why not include the 4TB Seagate data? In other words, why have a drive count criteria at all? Shouldn’t we compute lifetime AFR using all of the drive models we have ever used which accumulated over 100,000 drive days in a lifetime? If we did things that way, the list of drive models used to compute the lifetime AFR would now include drive models we stopped using years ago and would include nearly 100 different drive models. As a result, a majority of the drive models used to compute the lifetime AFR would be outdated and the lifetime AFR table would contain rows of basically useless data that has no current or future value. In short, having drive count as one of the criteria in computing lifetime AFR keeps the table relevant and approachable.

The Hard Drive Stats data

It has now been over 11 years since we began recording, storing, and reporting the operational statistics of the HDDs and SSDs we use to store data at Backblaze. We look at the telemetry data of the drives, including their SMART stats and other health related attributes. We do not read or otherwise examine the actual customer data stored.

Over the years, we have analyzed the data we have gathered and published our findings and insights from our analyses. For transparency, we also publish the data itself, known as the Drive Stats dataset. This dataset is open source and can be downloaded from our Drive Stats webpage.

You can download and use the Drive Stats dataset for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, 3) you may sell derivative works based on the data, but 4) you can not sell this data to anyone; it is free.

Good luck, and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q3 2024 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Quoth the Drive Stats, Nevermore: An Elegy for Our Seagate 4TB Drives

2024-10-31 Andy Klein

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/quoth-the-drive-stats-nevermore-an-elegy-for-our-seagate-4tb-drives/

A decorative image showing a gravestone with ravens around it.

Once upon a midnight dreary, as I typed another query

Seeking many a quaint and curious fact of hidden Drive Stats lore—

While I waited, time advancing, suddenly the stats came dancing

Lines of empty datasets; the database had nothing more

“Is that right?” I muttered, “The database had nothing more—

So those drives, I must explore.”

Ah, distinctly I remember, it was just past this September

I requested failure rates of Seagate drives with terabytes of four

Eagerly I typed the query, even though my eyes were bleary

The count of Seagate fours was eerie, eerie; there was nothing more.

The sad and certain count screamed like it never had before;

No Seagate drives with terabytes of four.

There are missing rows, I’m certain, and files waiting to explore.

The reality I kept dismissing, the Seagate data must be missing

With hours gone to data fishing, the facts shook me to the core;

The spinning life is over for our Seagate drives with terabytes of four—

Those Seagate drives are nevermore.

(My apologies to Edgar Allen Poe.)

Shortly, we will publish the Q3 2024 Backblaze Drive Stats report, and an old faithful will be missing from the tables, the 4TB Seagate drive model ST4000DM000. This drive model has graced our Drive Stats charts and tables since the very first Drive Stats report, and it would be a ghastly mistake if we let the drive slip into the afterlife unnoticed. So on this All Hallows’ Eve, it’s only fitting we say nevermore to these Seagate drives.

The first 45 of these Seagate 4TB drives were installed in a 45-drive Backblaze Storage Pod in May 2013. That was before 60-drive Storage Pods, Backblaze Vaults, and even Backblaze B2. Over the next two years, thousands of new Seagate 4TB drives were added each quarter, and by Q3 2016, there were 34,744 spinning souls in service. That represented more than 50% of all the drives in service at the time—a howling success that has not been duplicated by any other drive model.

Alas, that didn’t last as the first wave of 8TB drives arrived in mid-2016 and with that, no additional 4TB Seagate drives were procured. Over time, as 4TB Seagate drives met their maker, the count decreased, and when Storage Pods containing these drives started being phased out in 2018, the count dropped faster. The final nail in the coffin came when, in 2023, our CVT drive migration system became fixated on the replacement of all the remaining 4TB Seagate drives, and here we are.

As for those intrepid 45 original drives installed in May 2013, they were not around at the end. They were unceremoniously replaced in a Storage Pod upgrade back in 2017. A few were resurrected as drive replacements, but today they only exist in the spirit world, having died or been replaced by 2020. Still many other 4TB Seagate drives have lived long happy lives, with nearly 100 exceeding 100 months of service (8.4 years) before being sent to their final resting place by the CVT reaper.

And so it is time; we shall gather in a circle, cross our arms and hold hands and chant “our Seagate drives…with terabytes of four…are nevermore!”

The post Quoth the Drive Stats, Nevermore: An Elegy for Our Seagate 4TB Drives appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Drive Stats for Q2 2024

2024-08-06 Andy Klein

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-2024/

A decorative image with the headline Q2 2024 Drive Stats.

As of the end of Q2 2024, Backblaze was monitoring 288,665 hard drives (HDDs) and solid state drives (SSDs) in our cloud storage servers located in our data centers around the world. We removed from this analysis 3,789 boot drives, consisting of 2,923 SSDs and 866 hard drives. This leaves us with 284,876 hard drives under management to review for this report. We’ll review the annualized failure rates (AFRs) for Q2 2024 and the lifetime AFRs of the qualifying drive models, and we’ll also check out drive age versus failure rates over time. Along the way, we’ll share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

Hard drive failure rates for Q2 2024

For our Q2 2024 quarterly analysis, we remove from consideration: drive models which did have at least 100 drives in service at the end of the quarter, drive models which did not accumulate 10,000 or more drive days during the quarter, and individual drives which exceeded their manufacturer’s temperature specification during their lifetime. The removed pool totalled 490 drives, leaving us with 284,386 drives grouped into 29 drive models for our Q2 2024 analysis.

The table below lists the AFRs and related data for these drive models. The table is sorted large to small by drive size then by AFR within drive size.

Notes and observations on the Q2 2024 Drive Stats

Upward AFR: The AFR for Q2 2024 was 1.71%. That’s up from Q1 2024 at 1.41%, but down from one year ago (Q2 2023) at 2.28%. While the quarter over quarter increase was a bit surprising, quarterly fluctuations in AFR are expected. Sixteen drive models had an AFR of 1.71% or below while 13 drive models had an AFR above.
Two good zeroes: In Q2 2024, two drive models had zero failures, a 14TB Seagate (model: ST14000NM000J) and a 16TB Seagate (model: ST16000NM002J). Both have a relatively small number of drives and drive days for the quarter, so their success is somewhat muted, but the 16TB Seagate drive model has a very respectable 0.57% lifetime failure rate.
Another GOAT is gone: In Q1, we migrated the last of our 4TB Toshiba drives. In Q2, we migrated the last of our 6TB drives, including all of the Seagate 6TB drives which had reached an average age of nine years (108 months). This Seagate drive model closed out its career at Backblaze with an impressive 0.86% lifetime AFR.
Currently the 4TB Seagate (model: ST4000DM000) is our oldest data drive model in production at an average age of 99.5 months. The data on these drives is scheduled to be migrated over the next quarter or two using CVT, our in-house drive migration system. They’ll never reach nine years of service.
The 10-Year Club: With the 6TB Seagate drives being migrated as they hit 10 years of service, we wondered: What is the oldest data drive in service? The answer, a 4TB HGST drive (model: HMS5C4040ALE640) with 9 years, 11 months and 23 days service as of the end of Q2. Alas, the Backblaze Vault in which this drive resides is now being migrated as are many other drives with over nine years of service. We’ll see next quarter to see if any of them made it to the 10-Year Club before they are retired.
While there are no data drives with 10 years of service, there are 11 HDD boot drives that exceed the mark. In fact one, a 500GB WD drive (model: WD5000BPKT) has over 11 years of service. (Psst, don’t tell the CVT team.)
An HGST surprise: Over the years, the HGST drive models we have used performed very well. So, when the 12TB HGST (model: HUH721212ALN604) drive showed up with a 7.17% AFR for Q2, it’s news. Such uncharacteristic quarterly failure rates for this model actually go back about a year, although the 7.17% AFR is the largest quarterly value to date. As a result, the lifetime AFR has risen from 0.99% to 1.57% over the last year. While the lifetime AFR is not alarming, we are paying attention to this trend.

Lifetime hard drive failure rates

As of the end of Q2 2024, we were tracking 284,876 operational hard drives. To be considered for the lifetime review, a drive model was required to have 500 or more drives as of the end of Q2 2024 and have over 100,000 accumulated drive days during their lifetime. When we removed those drive models which did not meet the lifetime criteria, we had 283,065 drives grouped into 25 models remaining for analysis as shown in the table below.

Age, AFR, and snakes

One of the truisms in our business is that different drive models fail at different rates. Our goal is to develop a failure profile for a given drive model over time. Such a profile can help optimize our drive replacement and migration strategies, and ultimately maintains the durability of our cloud storage service.

For our cohort of data drives, we’ll look at the changes in the lifetime AFR over time for drive models with at least one million drive days as of the end of Q2 2024. This gives us 23 drive models to review. We’ll divide the drive models into two groups: those whose average age is five years (60 months) or less, and those whose average age is above 60 months. Why that cutoff? That’s the typical warranty period for enterprise class hard drives.

Let’s start by plotting the current lifetime AFR for the 14 drives models that have an average age of 60 months or less as shown in the chart below.

Let’s review the drive models by characterizing the four quadrants as follows:

Quadrant I: Drive models in this quadrant are performing well, and have a respectable AFR of less than 1.5%. Drive models to the right in this quadrant might require a little more attention over the coming months than those to the left.
Quadrant II: These drive models have failure rates above 1.5%, but are still reasonable at around 2% lifetime AFR. What is important is that AFR does not increase significantly over time.
Quadrant III: There are no drives currently in this quadrant, but if there were it would not be a cause for alarm. Why? Some drive models experience higher rates of failure early on, and then following the bathtub curve, their AFR drops as they get older.
Quadrant IV: These drive models are just starting out and are just beginning to establish their failure profile, which at the moment is good.

At a glance, the chart tells us that everything seems fine. The drives in Quadrant I are performing well, the two drives in Quadrant II could be better, but are still acceptable, and there are no surprises in the newer drive models to this point. Let’s see how things fair for the drive models which have an average age of over 60 months as in the chart below.

There are nine drive models which fit the average age criteria, including the Seagate 6TB drive (in yellow) whose drives were removed from service in Q2. As you can see the drive models are spread out across all four quadrants. As before, Quadrant I contains good drives, Quadrants II and III are drives we need to worry about, and Quadrant IV models look good so far.

If we were to stop here we could decide for example that the 4TB Seagate drives are first in line for the CVT migration process, but not so fast. All of these drive models have been around for at least five years and we have their failure rates over time. So, rather than rely on just a point in time, let’s look at their change in failure rates over time in the chart below.

The snake chart, as we’re calling it, shows the lifetime failure rate of each drive model over time. We started at 24 months to make the chart less messy. Regardless, the drive models sort themselves out into either Quadrant I or II once their average age passes 60 months. Let’s take a look at the drives in each of those quadrants.

Quadrant I: Five of the nine drive models are in Quadrant I as of Q2 2024. The two 4TB HGST drives (brown and purple lines) as well as the 6TB Seagate (red line) have nearly vertical lines indicating their failure rates have been consistent over time, especially after 60 months of service. Such demonstrated consistency over time is a failure profile we like to see.
The failure profile of the 8TB Seagate (blue line) and the 8TB HGST (gray line) are less consistent, with each increasing their failure rates as they have aged. In the case of the HGST drive, the lifetime AFR rose from about 0.5% to 1.0% over an 18 month period starting at 48 months before leveling out. The Seagate drive took about two years starting at 60 months to go from 1.0% to nearly 1.5% before leveling out.
Quadrant II: The remaining 4 drive models ended in this quadrant. Three of the models, the 8TB Seagate (yellow line), the 10TB Seagate (green line), and the 12TB HGST (teal line) have similar failure profiles. All three got to some point in their lifetime and their curve began bending to the right. In other words, their failure rates over time accelerated. While the 8TB Seagate (yellow) shows some signs of leveling off, all three models will be closely watched and replaced if this trend continues.
Also in Quadrant II is the 4TB Seagate drive (black line). This drive model is aggressively being migrated and is being replaced by 16TB and larger drives via the CVT process. As such, it is hard to tell if the nearly vertical failure profile is a function of the replacement process or the drive model failure rate leveling out over time. Either way, the migration of this drive model is expected to be complete in the next quarter or two.

A normal failure profile

If we had to pick one of the drive models to represent a normal failure profile, it would be the 8TB Seagate (blue line, model: ST800DM002). Why? The failure rate for the first 60 months was consistently around 1.0%, Seagate’s predicted AFR. After 60 months, the AFR increased as the drive aged as one would expect. You might have thought we’d choose the failure profile of one of the two 4TB HGST drive models (brown and purple lines). The “trouble” is their failure rates are well below any published AFR by any drive manufacturer. While that’s great for us, their annualized failure rates over time are sadly not normal.

Can AI help?

The idea of using AI/ML techniques to predict drive failure has been around for several years, but as a first step let’s see if predicting drive failure is even an AI-worthy problem. We recently conducted a webinar “Leveraging Your Cloud Storage Data in AL/ML Apps and Services” in which we outlined general criteria to be used in evaluating if AI/ML is needed to solve a given problem, in this case predicting drive failure. The most salient criteria which applies here is that AI is best used for a problem for which you can not consistently apply a set of rules to solve the problem.

A model is trained by taking the source data and applying an algorithm to iteratively combine and weigh multiple factors. The output is a model which can be used to answer questions about the model’s subject matter, in this case drive failure. For example, we train a model using the Drive Stats data for a given drive model for the last year. Then, we ask the model a question using drive Z’s daily SMART stats and related information. We use this data as input to the model, and while there is no exact match, the model will use inference to develop a response of the probability of drive failure for drive Z over time. As such, it would seem that drive failure prediction would be a good candidate for using AI.

What’s not clear is whether what is learned about one drive model can be applied to another drive model. One look at the snake chart above visualizes the issue as the failure profile for each drive model is different, sometimes radically different. For example, do you think you could train a model on the 4TB Seagate drives (black line) and use it to predict drive failures for either of the 4TB HGST drive models (purple and brown lines)? The answer may be yes, but it certainly doesn’t seem likely.

All that said, several research papers and studies have been published over the years attempting to determine whether or not AI/ML can be used to make drive failure predictions. We’ll be doing a review of these publications in the next couple of months and hopefully shed some light on the ability to use AI to accurately make drive failure predictions in a timely manner.

The Hard Drive Stats data

It has now been over 11 years since we began recording, storing, and reporting the operational statistics of the hard drives and SSDs we use to store data in the Backblaze data storage cloud. We look at the telemetry data of the drives, including their SMART stats and other health related attributes. We do not read or otherwise examine the actual customer data stored.

The post Backblaze Drive Stats for Q2 2024 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Drive Stats for Q1 2024

2024-05-02 Andy Klein

Post Syndicated from Andy Klein original https://backblaze.com/blog/backblaze-drive-stats-for-q1-2024/

A decorative image displaying the title Q1 2024 Drive Stats.

As of the end of Q1 2024, Backblaze was monitoring 283,851 hard drives and SSDs in our cloud storage servers located in our data centers around the world. We removed from this analysis 4,279 boot drives, consisting of 3,307 SSDs and 972 hard drives. This leaves us with 279,572 hard drives under management to examine for this report. We’ll review their annualized failure rates (AFRs) as of Q1 2024, and we’ll dig into the average age of drive failure by model, drive size, and more. Along the way, we’ll share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

Hard Drive Failure Rates for Q1 2024

We analyzed the drive stats data of 279,572 hard drives. In this group we identified 275 individual drives which exceeded their manufacturer’s temperature specification at some point in their operational life. As such, these drives were removed from our AFR calculations.

The remaining 279,297 drives were divided into two groups. The primary group consists of the drive models which had at least 100 drives in operation as of the end of the quarter and accumulated over 10,000 drive days during the same quarter. This group consists of 278,656 drives grouped into 29 drive models. The secondary group contains the remaining 641 drives which did not meet the criteria noted. We will review the secondary group later in this post, but for the moment let’s focus on the primary group.

For Q1 2024, we analyzed 278,656 hard drives grouped into 29 drive models. The table below lists the AFRs of these drive models. The table is sorted by drive size then AFR, and grouped by drive size.

Notes and Observations on the Q1 2024 Drive Stats

Downward AFR: The AFR for Q1 2024 was 1.41%. That’s down from Q4 2023 at 1.53%, and also down from one year ago (Q1 2023) at 1.54%. The continuing process of replacing older 4TB drives is a primary driver of this decrease as the Q1 2024 AFR (1.36%) for the 4TB drive cohort is down from a high of 2.33% in Q2 2023.
A Few Good Zeroes: In Q1 2024, three drive models had zero failures:
- 16TB Seagate (model: ST16000NM002J)
  - Q1 2024 drive days: 42,133
  - Lifetime drive days: 216,019
  - Lifetime AFR: 0.68%
  - Lifetime confidence interval: 1.4%
- 8TB Seagate (model: ST8000NM000A)
  - Q1 2024 drive days: 19,684
  - Lifetime drive days: 106,759
  - Lifetime AFR: 0.00%
  - Lifetime confidence interval: 1.9%
- 6TB Seagate (model: ST6000DX000)
  - Q1 2024 drive days: 80,262
  - Lifetime drive days: 4,268,373
  - Lifetime AFR: 0.86%
  - Lifetime confidence interval: 0.3%

All three drives have a lifetime AFR of less than 1%, but in the case of the 8TB and 16TB drive models the confidence interval (95%) is still too high. While it is possible the two drives models will continue to perform well, we’d like to see the confidence interval below 1%, and preferably below 0.5%, before we can trust the lifetime AFR.

With a confidence interval of 0.3% the 6TB Seagate drives delivered another quarter of zero failures. At an average age of nine years, these drives continue to defy their age. They were purchased and installed at the same time back in 2015, and are members of the only 6TB Backblaze Vault still in operation.

The End of the Line: The 4TB Toshiba (model: MD04ABA400V) are not in the Q1 2024 Drive Stats tables. This was not an oversight. The last of these drives became a migration target early in Q1 and their data was securely transferred to pristine 16TB Toshiba drives. They rivaled the 6TB Seagate drives in age and AFR, but their number was up and it was time to go.

The Secondary Group

As noted previously, we divided the drive models into two groups, primary and secondary, with drive count (>100) and drive days (>10,000) being the metrics used to divide the groups. The secondary group has a total of 641 drives spread across 27 drive models. Below is a table of those drive models.

The secondary group is mostly made up of drive models which are replacement drives or migration candidates. Regardless, the lack of observations (drive days) over the observation period is too low to have any certainty about the calculated AFR.

From time to time, a secondary drive model will move into the primary group. For example, the 14TB Seagate (model: ST14000NM000J) will most likely have over 100 drives and 10,000 drive days in Q2. The reverse is also possible, especially as we continue to migrate our 4TB drive models.

Why Have a Secondary Group?

In practice we’ve always had two groups; we just didn’t name them. Previously, we would eliminate from the quarterly, annual, and lifetime AFR charts drive models which did not have at least 45 drives, then we upped that to 60 drives. This was okay, but we realized that we needed to also set a minimum number of drive days over the analysis period to improve our confidence in the AFRs we calculated. To that end, we have set the following thresholds for drive models to be in the primary group.

Review Period	Drive Count per Model	Drive Days per Model
Quarterly	>100 drives	>10,000 drive days
Annual	>250 drives	>50,000 drives days
Lifetime	>500 drives	>100,000 drive days

We will evaluate these metrics as we go along and change them if needed. The goal is to continue to provide AFRs that we are confident are an accurate reflection of the drives in our environment.

The Average Age of Drive Failure Redux

In Q1 2023 Drive Stats report, we took a look at the average age in which a drive fails. This review was inspired by the folks at Secure Data Recovery who calculated that based on their analysis of 2,007 failed drives, the average age at which they failed was 1,051 days or roughly 2 years and 10 months.

We applied the same approach to our 17,155 failed drives and were surprised when our average age of failure was only 2 years and 6 months. Then we realized that many of the drive models that were still in use were older (much older) than the average, and surely when some number of them failed, it would affect the average age of failure for a given drive model.

To account for this realization, we considered only those drive models that are no longer active in our production environment. We call this collection retired drive models as these are drives that can no longer age or fail. When we reviewed the average age of this retired group of drives, the average age of failure was 2 years and 7 months. Unexpected, yes, but we decided we needed more data before reaching any conclusions.

So, here we are a year later to see if the average age of drive failure we computed in Q1 2023 has changed. Let’s dig in.

As before we recorded the date, serial_number, model, drive_capacity, failure, and SMART 9 raw value for all of the failed drives we have in the Drive Stats dataset back to April 2013. The SMART 9 raw value gives us the number of hours the drive was in operation. Then we removed boot drives and drives with incomplete data, that is some of the values were missing or wildly inaccurate. This left us with 17,155 failed drives as of Q1 2023.

Over this past year, Q2 2023 through Q1 2024, we recorded an additional 4,406 failed drives. There were 173 drives which were either boot drives or had incomplete data, leaving us with 4,233 drives to add to the 17,155 failed drives previous, totalling 21,388 failed drives to evaluate.

When we compare Q1 2023 to Q1 2024 we get the table below.

The average age of failure for all of the Backblaze drive models (2 years and 10 months) matches the Secure Data Recovery baseline. The question is, does that validate their number? We say, not yet. Why? Two primary reasons.

First, we only have two data points, so we don’t have much of a trend, that is we don’t know if the alignment is real or just temporary. Second, the average age of failure of the active drive models (that is, those in production) is now already higher (2 years and 11 months) than the Secure Data baseline. If that trend were to continue, then when the active drive models retire, they will likely increase the average age of failure of the drive models that are not in production.

That said, we can compare the numbers by drive size and drive model from Q1 2023 to Q1 2024 to see if we can gain any additional insights. Let’s start with the average age by drive size in the table below.

The most salient observation is that for every drive size that had active drive models (green), the average age of failure increased from Q1 2023 to Q1 2024. Given that the overall average age of failure increased over the last year, it is reasonable to expect that some of the active drive size cohorts would increase. With that in mind, let’s take a look at the changes by drive model over the same period.

Starting with the retired drive models, there were three drive models totalling 196 drives which moved from active to retired from Q1 2023 to Q1 2024. Still, the average age of failure for the retired drive cohort remained at 2 years 7 months, so we’ll spare you from looking at a chart with 39 drive models where over 90% of the data didn’t change Q1 2023 to Q1 2024.

On the other hand, the active drive models are a little more interesting, as we can see below.

In all but the two drive models (highlighted), the average age of failure for each drive model went up. In other words, active drive models are, on average, older when they fail, than one year ago. Remember, we are testing the average age of the drive failures, not the average age of the drive.

At this point, let’s review. The Secure Data Recovery folks checked 2,007 failed drives and determined their average age of failure was 2 years and 10 months. We are testing that assertion. At the moment, the average age of failure for the retired drive models (those no longer in operation in our environment) is 2 years and 7 months. This is still less than the Secure Data number. But, the drive models still in operation are now hitting an average of 2 years and 10 months, suggesting that once these drive models are removed from service, the average age of failure for the retired drive models will increase.

Based on all of this, we think the average age of failure for our retired drive models will eventually exceed 2 years and 10 months. Further, we predict that the average age of failure will reach closer to 4 years for the retired drive models once our 4TB drive models are removed from service.

Annualized Failures Rates for Manufacturers

As we noted at the beginning of this report, the quarterly AFR for Q1 2024 was 1.41%. Each of the four manufacturers we track contributed to the overall AFR as shown in the chart below.

As you can see, the overall AFR for all drives peaked in Q3 2023 and is dropping. This is mostly due to the retirement of older 4TB drives that are further along the bathtub curve of drive failure. Interestingly, all of the remaining 4TB drives in use today are either Seagate or HGST models. Therefore, we expect the quarterly AFR will most likely continue to decrease for those two manufacturers as over the next year their 4TB drive models will be replaced.

Lifetime Hard Drive Failure Rates

As of the end of Q1 2024, we were tracking 279,572 operational hard drives. As noted earlier, we defined the minimum eligibility criteria of a drive model to be included in our analysis for quarterly, annual and lifetime reviews. To be considered for the lifetime review, a drive model was required to have 500 or more drives as of the end of Q1 2024 and have over 100,000 accumulated drive days during their lifetime. When we removed those drive models which did not meet the lifetime criteria, we had 277,910 drives grouped into 26 models remaining for analysis as shown in the tale below.

With three exceptions, the conference interval for each drive model is 0.5% or less at 95% certainty. For the three exceptions: the 10TB Seagate, the 14TB Seagate, and 14TB Toshiba models, the occurrence of drive failure from quarter to quarter was too variable over their lifetime. This volatility has a negative effect on the confidence interval.

The combination of a low lifetime AFR and a small confidence interval is helpful in identifying the drive models which work well in our environment. These days we are interested mostly in the larger drives as replacements, migration targets, or new installations. Using the table above, let’s see if we can identify our best 12, 14, and 16TB performers. We’ll skip reviewing the 22TB drives as we only have one model.

The drive models are grouped by drive size, then sorted by their Lifetime AFR. Let’s take a look at each of those groups.

12TB drive models: The three 12TB HGST models are great performers, but are hard to find new. Also, Western Digital, who purchased the HGST drive business a while back, has started using their own model numbers of these drives, so it can be confusing. If you do find an original HGST make sure it is new as from our perspective buying a refurbished drive is not the same as buying a new.
14TB drive models: The first three models look to be solid—the WDC (WUH721414ALE6L4), Toshiba (MG07ACA14TA), and Seagate (ST14000NM001G). The remaining two drive models have mediocre lifetime AFRs and undesirable confidence intervals.
16TB drive models: Lots of choice here, with all six drive models performing well to this point, although the WDC models are the best of the best to date.

The Hard Drive Stats Data

It has now been eleven years since we began recording, storing and reporting the operational statistics of the hard drives and SSDs we use to store data in the Backblaze data storage cloud. We look at the telemetry data of the drives, including their SMART stats and other health related attributes. We do not read or otherwise examine the actual customer data stored.

You can download and use the Drive Stats dataset for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone; it is free.

Good luck and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q1 2024 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Drive Stats for 2023

2024-02-13 Andy Klein

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-2023/

A decorative image displaying the words 2023 Year End Drive Stats

As of December 31, 2023, we had 274,622 drives under management. Of that number, there were 4,400 boot drives and 270,222 data drives. This report will focus on our data drives. We will review the hard drive failure rates for 2023, compare those rates to previous years, and present the lifetime failure statistics for all the hard drive models active in our data center as of the end of 2023. Along the way we share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

2023 Hard Drive Failure Rates

As of the end of 2023, Backblaze was monitoring 270,222 hard drives used to store data. For our evaluation, we removed 466 drives from consideration which we’ll discuss later on. This leaves us with 269,756 hard drives covering 35 drive models to analyze for this report. The table below shows the Annualized Failure Rates (AFRs) for 2023 for this collection of drives.

An chart displaying the failure rates of Backblaze hard drives.

Notes and Observations

One zero for the year: In 2023, only one drive model had zero failures, the 8TB Seagate (model: ST8000NM000A). In fact, that drive model has had zero failures in our environment since we started deploying it in Q3 2022. That “zero” does come with some caveats: We have only 204 drives in service and the drive has a limited number of drive days (52,876), but zero failures over 18 months is a nice start.

Failures for the year: There were 4,189 drives which failed in 2023. Doing a little math, over the last year on average, we replaced a failed drive every two hours and five minutes. If we limit hours worked to 40 per week, then we replaced a failed drive every 30 minutes.

More drive models: In 2023, we added six drive models to the list while retiring zero, giving us a total of 35 different models we are tracking.

Two of the models have been in our environment for a while but finally reached 60 drives in production by the end of 2023.

Toshiba 8TB, model HDWF180: 60 drives.
Seagate 18TB, model ST18000NM000J: 60 drives.

Four of the models were new to our production environment and have 60 or more drives in production by the end of 2023.

Seagate 12TB, model ST12000NM000J: 195 drives.
Seagate 14TB, model ST14000NM000J: 77 drives.
Seagate 14TB, model ST14000NM0018: 66 drives.
WDC 22TB, model WUH722222ALE6L4: 2,442 drives.

The drives for the three Seagate models are used to replace failed 12TB and 14TB drives. The 22TB WDC drives are a new model added primarily as two new Backblaze Vaults of 1,200 drives each.

Mixing and Matching Drive Models

There was a time when we purchased extra drives of a given model to have on hand so we could replace a failed drive with the same drive model. For example, if we needed 1,200 drives for a Backblaze Vault, we’d buy 1,300 to get 100 spares. Over time, we tested combinations of different drive models to ensure there was no impact on throughput and performance. This allowed us to purchase drives as needed, like the Seagate drives noted previously. This saved us the cost of buying drives just to have them hanging around for months or years waiting for the same drive model to fail.

Drives Not Included in This Review

We noted earlier there were 466 drives we removed from consideration in this review. These drives fall into three categories.

Testing: These are drives of a given model that we monitor and collect Drive Stats data on, but are in the process of being qualified as production drives. For example, in Q4 there were four 20TB Toshiba drives being evaluated.
Hot Drives: These are drives that were exposed to high temperatures while in operation. We have removed them from this review, but are following them separately to learn more about how well drives take the heat. We covered this topic in depth in our Q3 2023 Drive Stats Report.
Less than 60 drives: This is a holdover from when we used a single storage server of 60 drives to store a blob of data sent to us. Today we divide that same blob across 20 servers, i.e. a Backblaze Vault, dramatically improving the durability of the data. For 2024 we are going to review the 60 drive criteria and most likely replace this standard with a minimum number of drive days in a given period of time to be part of the review.

Regardless, in the Q4 2023 Drive Stats data you will find these 466 drives along with the data for the 269,756 drives used in the review.

Comparing Drive Stats for 2021, 2022, and 2023

The table below compares the AFR for each of the last three years. The table includes just those drive models which had over 200,000 drive days during 2023. The data for each year is inclusive of that year only for the operational drive models present at the end of each year. The table is sorted by drive size and then AFR.

A chart showing the failure rates of hard drives from 2021, 2022, and 2023.

Notes and Observations

What’s missing?: As noted, a drive model required 200,000 drive days or more in 2023 to make the list. Drives like the 22TB WDC model with 126,956 drive days and the 8TB Seagate with zero failures, but only 52,876 drive days didn’t qualify. Why 200,000? Each quarter we use 50,000 drive days as the minimum number to qualify as statistically relevant. It’s not a perfect metric, but it minimizes the volatility sometimes associated with drive models with a lower number of drive days.

The 2023 AFR was up: The AFR for all drives models listed was 1.70% in 2023. This compares to 1.37% in 2022 and 1.01% in 2021. Throughout 2023 we have seen the AFR rise as the average age of the drive fleet has increased. There are currently nine drive models with an average age of six years or more. The nine models make up nearly 20% of the drives in production. Since Q2, we have accelerated the migration from older drive models, typically 4TB in size, to new drive models, typically 16TB in size. This program will continue throughout 2024 and beyond.

Annualized Failure Rates vs. Drive Size

Now, let’s dig into the numbers to see what else we can learn. We’ll start by looking at the quarterly AFRs by drive size over the last three years.

A chart showing hard drive failure rates by drive size from 2021 to 2023.

To start, the AFR for 10TB drives (gold line) are obviously increasing, as are the 8TB drives (gray line) and the 12TB drives (purple line). Each of these groups finished at an AFR of 2% or higher in Q4 2023 while starting from an AFR of about 1% in Q2 2021. On the other hand, the AFR for the 4TB drives (blue line) rose initially, peaking in 2022 and has decreased since. The remaining three drive sizes—6TB, 14TB, and 16TB—have oscillated around 1% AFR for the entire period.

Zooming out, we can look at the change in AFR by drive size on an annual basis. If we compare the annual AFR results for 2022 to 2023, we get the table below. The results for each year are based only on the data from that year.

At first glance it may seem odd that the AFR for 4TB drives is going down. Especially given the average age of each of the 4TB drives models is over six years and getting older. The reason is likely related to our focus in 2023 on migrating from 4TB drives to 16TB drives. In general we migrate the oldest drives first, that is those more likely to fail in the near future. This process of culling out the oldest drives appears to mitigate the expected rise in failure rates as a drive ages.

But, not all drive models play along. The 6TB Seagate drives are over 8.6 years old on average and, for 2023, have the lowest AFR for any drive size group potentially making a mockery of the age-is-related-to-failure theory, at least over the last year. Let’s see if that holds true for the lifetime failure rate of our drives.

Lifetime Hard Drive Stats

We evaluated 269,756 drives across 35 drive models for our lifetime AFR review. The table below summarizes the lifetime drive stats data from April 2013 through the end of Q4 2023.

A chart showing lifetime annualized failure rates for 2023.

The current lifetime AFR for all of the drives is 1.46%. This is up from the end of last year (Q4 2022) which was 1.39%. This makes sense given the quarterly rise in AFR over 2023 as documented earlier. This is also the highest the lifetime AFR has been since Q1 2021 (1.49%).

The table above contains all of the drive models active as of 12/31/2023. To declutter the list, we can remove those models which don’t have enough data to be statistically relevant. This does not mean the AFR shown above is incorrect, it just means we’d like to have more data to be confident about the failure rates we are listing. To that end, the table below only includes those drive models which have two million drive days or more over their lifetime, this gives us a manageable list of 23 drive models to review.

A chart showing the 2023 annualized failure rates for drives with more than 2 million drive days in their lifetimes.

Using the table above we can compare the lifetime drive failure rates of different drive models. In the charts below, we group the drive models by manufacturer, and then plot the drive model AFR versus average age in months of each drive model. The relative size of each circle represents the number of drives in each cohort. The horizontal and vertical scales for each manufacturer chart are the same.

A chart showing annualized failure rates by average age and drive manufacturer.

Notes and Observations

Drive migration: When selecting drive models to migrate we could just replace the oldest drive models first. In this case, the 6TB Seagate drives. Given there are only 882 drives—that’s less than one Backblaze Vault—the impact on failure rates would be minimal. That aside, the chart makes it clear that we should continue to migrate our 4TB drives as we discussed in our recent post on which drives reside in which storage servers. As that post notes, there are other factors, such as server age, server size (45 vs. 60 drives), and server failure rates which help guide our decisions.

HGST: The chart on the left below shows the AFR trendline (second order polynomial) for all of our HGST models. It does not appear that drive failure consistently increases with age. The chart on the right shows the same data with the HGST 4TB drive models removed. The results are more in line with what we’d expect, that drive failure increased over time. While the 4TB drives perform great, they don’t appear to be the AFR benchmark for newer/larger drives.

One other potential factor not explored here, is that beginning with the 8TB drive models, helium was used inside the drives and the drives were sealed. Prior to that they were air-cooled and not sealed. So did switching to helium inside a drive affect the failure profile of the HGST drives? Interesting question, but with the data we have on hand, I’m not sure we can answer it—or that it matters much anymore as helium is here to stay.

Seagate: The chart on the left below shows the AFR trendline (second order polynomial) for our Seagate models. As with the HGST models, it does not appear that drive failure continues to increase with age. For the chart on the right, we removed the drive models that were greater than seven years old (average age).

Interestingly, the trendline for the two charts is basically the same up to the six year point. If we attempt to project past that for the 8TB and 12TB drives there is no clear direction. Muddying things up even more is the fact that the three models we removed because they are older than seven years are all consumer drive models, while the remaining drive models are all enterprise drive models. Will that make a difference in the failure rates of the enterprise drive model when they get to seven or eight or even nine years of service? Stay tuned.

Toshiba and WDC: As for the Toshia and WDC drive models, there is a little over three years worth of data and no discernible patterns have emerged. All of the drives from each of these manufacturers are performing well to date.

Drive Failure and Drive Migration

One thing we’ve seen above is that drive failure projections are typically drive model dependent. But we don’t migrate drive models as a group, instead, we migrate all of the drives in a storage server or Backblaze Vault. The drives in a given server or Vault may not be the same model. How we choose which servers and Vaults to migrate will be covered in a future post, but for now we’ll just say that drive failure isn’t everything.

The Hard Drive Stats Data

The complete data set used to create the tables and charts in this report is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data itself to anyone; it is free.

Good luck, and let us know if you find anything interesting.

The post Backblaze Drive Stats for 2023 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

The Drive Stats of Backblaze Storage Pods

2024-01-03 Andy Klein

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/the-drive-stats-of-backblaze-storage-pods/

Since 2009, Backblaze has written extensively about the data storage servers we created and deployed which we call Backblaze Storage Pods. We not only wrote about our Storage Pods, we open sourced the design, published a parts list, and even provided instructions on how to build one. Many people did. Of the six storage pod versions we produced, four of them are still in operation in our data centers today. Over the last few years, we began using storage servers from Dell and, more recently, Supermicro, as they have proven to be economically and operationally viable in our environment.

Since 2013, we have also written extensively about our Drive Stats, sharing reports on the failure rates of the HDDs and SSDs in our legion of storage servers. We have examined the drive failure rates by manufacturer, size, age, and so on, but we have never analyzed the drive failure rates of the storage servers—until now. Let’s take a look at the Drive Stats for our fleet of storage servers and see what we can learn.

Storage Pods, Storage Servers, and Backblaze Vaults

Let’s start with a few definitions:

Storage Server: A storage server is our generic name for a server from any manufacturer which we use to store customer data. We use storage servers from Backblaze, Dell, and Supermicro.
Storage Pod: A Storage Pod is the name we gave to the storage servers Backblaze designed and had built for our data centers. The first Backblaze Storage Pod version was announced in September 2009. Subsequent versions are 2.0, 3.0, 4.0, 4.5, 5.0, 6.0, and 6.1. All but 6.1 were announced publicly.
Backblaze Vault: A Backblaze Vault is 20 storage servers grouped together for the purpose of data storage. Uploaded data arrives at a given storage server within a Backblaze Vault and is encoded into 20 parts with a given part being either a data blob or parity. Each of the 20 parts (shards) is then stored on one of the 20 storage servers.

As you review the charts and tables here are a few things to know about Backblaze Vaults.

There are currently six cohorts of storage servers in operation today: Supermicro, Dell, Backblaze 3.0, Backblaze 5.0, Backblaze 6.0, and Backblaze 6.1.
A given Vault will always be made up from one of the six cohorts of storage servers noted above. For example, Vault 1016 is made up of 20 Backblaze 5.0 Storage Pods and Vault 1176 is made of the 20 Supermicro servers.
A given Vault is made up of storage servers that contain the same number of drives as follows:
- Dell servers: 26 drives.
- Backblaze 3.0 and Backblaze 5.0 servers: 45 drives.
- Backblaze 6.0, Backblaze 6.1, and Supermicro servers: 60 drives.
All of the hard drives in a Backblaze Vault will be logically the same size; so, 16TB drives for example.

Drive Stats by Backblaze Vault Cohort

With the background out of the way, let’s get started. As of the end of Q3 2023, there were a total of 241 Backblaze Vaults divided into the six cohorts, as shown in the chart below. The chart includes the server cohort, the number of Vaults in the cohort, and the percentage that cohort is of the total number of Vaults.

A pie chart showing the types of Backblaze Vaults by percentage.

Vaults consisting of Backblaze servers still comprise 68% of the vaults in use today (shaded from orange to red), although that number is dropping as older Vaults are being replaced with newer server models, typically the Supermicro systems.

The table below shows the Drive Stats for the different Vault cohorts identified above for Q3 2023.

A chart showing the Drive Stats for Backblaze Vaults.

The Avg Age (months) column is the average age of the drives, not the average age of the Vaults. The two may seem to be related, that’s not entirely the case. It is true the Backblaze 3.0 Vaults were deployed first followed in order by the 5.0 and 6.0 Vaults, but that’s where things get messy. There was some overlap between the Dell and Backblaze 6.1 deployments as the Dell systems were deployed in our central Europe data center, while the 6.1 Vaults continued to be deployed in the U.S. In addition, some migrations from the Backblaze 3.0 Vaults were initially done to 6.1 Vaults while we were also deploying new drives in the Supermicro Vaults.

The AFR for each of the server versions does not seem to follow any pattern or correlation to the average age of the drives. This was unexpected because, in general, as drives pass about four years in age, they start to fail more often. This should mean that Vaults with older drives, especially those with drives whose average age is over four years (48 months), should have a higher failure rate. But, as we can see, the Backblaze 5.0 Vaults defy that expectation.

To see if we can determine what’s going on, let’s expand on the previous table and dig into the different drive sizes that are in each Vault cohort, as shown in the table below.

A table showing Drive Stats by server version and drive size.

Observations for Each Vault Cohort

Backblaze 3.0: Obviously these Vaults have the oldest drives and, given their AFR is nearly twice the average for all of the drives (1.53%), it would make sense to migrate off of these servers. Of course the 6TB drives seem to be the exception, but at some point they will most likely “hit the wall” and start failing.
Backblaze 5.0: There are two Backblaze 5.0 drive sizes (4TB and 8TB) and the AFR for each is well below the average AFR for all of the drives (1.53%). The average age of the two drive sizes is nearly seven years or more. When compared to the Backblaze 6.0 Vaults, it would seem that migrating the 5.0 Vaults could wait, but there is an operational consideration here. The Backblaze 5.0 Vaults each contain 45 drives, and from the perspective of data density per system, they should be migrated to 60 drive servers sooner rather than later to optimize data center rack space.
Backblaze 6.0: These Vaults as a group don’t seem to make any of the five different drive sizes happy. Only the AFR of the 4TB drives (1.42%) is just barely below the average AFR for all of the drives. The rest of the drive groups are well above the average.
Backblaze 6.1: The 6.1 servers are similar to the 6.0 servers, but with an upgraded CPU and faster NIC cards. Is that why their annualized failure rates are much lower than the 6.0 systems? Maybe, but the drives in the 6.1 systems are also much younger, about half the age of those in the 6.0 systems, so we don’t have the full picture yet.
Dell: The 14TB drives in the Dell Vaults seem to be a problem at a 5.46% AFR. Much of that is driven by two particular Dell vaults which have a high AFR, over 8% for Q3. This appears to be related to their location in the data center. All 40 of the Dell servers which make up these two Vaults were relocated to the top of 52U racks, and it appears that initially they did not like their new location. Recent data indicates they are doing much better, and we’ll publish that data soon. We’ll need to see what happens over the next few quarters. That said, if you remove these two Vaults from the Dell tally, the AFR is a respectable 0.99% for the remaining Vaults.
Supermicro: This server cohort is mostly 16TB drives which are doing very well with an AFR of 0.62%. The one 14TB Vault is worth our attention with an AFR of 1.95%, and the 22TB Vault is too new to do any analysis.

Drive Stats by Drive Size and Vault Cohort

Another way to look at the data is to take the previous table and re-sort it by drive size. Before we do that let’s establish the AFR for the different drive sizes aggregated over all Vaults.

A bar chart showing annualized failure rates for Backblaze Vaults by drive size.

As we can see in Q3 the 6TB and 22TB Vaults had zero failures (AFR = 0%). Also, the 10TB Vault is indeed only one Vault, so there are no other 10TB Vaults to compare it to. Given this, for readability, we will remove the 6TB, 10TB, and 22TB Vaults from the next table which compares how each drive size has fared in each of the six different Vault cohorts.

A table showing the annualized failure rates of servers by drive size and server version, not displaying the 6TB, 10TB, and 22TB Vaults.

Currently we are migrating the 4TB drive Vaults to larger Vaults, replacing them with drives of 16TB and above. The migrations are done using an in-house system which we’ll expand upon in a future post. The specific order of migrations is based on failure rates and durability of the existing 4TB Vaults with an eye towards removing the Backblaze 3.0 systems first as they are nearly 10 years old in some cases, and many of the non-drive replacement parts are no longer available. Whether we give away, destroy, or recycle the retired Backblaze 3.0 Storage Pods (sans drives) is still being debated.

For the 8TB drive Vaults, the Backblaze 5.0 Vaults are up first for migration when the time comes. Yes, their AFR is lower then the Backblaze 6.0 Vaults, but remember: the 5.0 Vaults are 45 drive units which are not as efficient storage density-wise versus the 60 drive systems.

Speaking of systems with less than 60 drives, the Dell servers are 26 drives. Those 26 drives are in a 2U chassis versus a 4U chassis for all of the other servers. The Dell servers are not quite as dense as the 60 drive units, but their 2U form factor gives us some flexibility in filling racks, especially when you add utility servers (1U or 2U) and networking gear to the mix. That’s one of the reasons the two Dell Vaults we noted earlier were moved to the top of the 52U racks. FYI, those two Vaults hold 14TB drives and are two of the four 14TB Dell Vaults making up the 5.46% AFR. The AFR for the Dell Vaults with 12TB and 16TB drives is 0.76% and 0.92% respectively. As noted earlier, we expect the AFR for 14TB Dell Vaults to drop over the coming months.

What Have We Learned?

Our goal today was to see what we can learn about the drive failure rates of the storage servers we use in our data centers. All of our storage servers are grouped in operational systems we call Backblaze Vaults. There are six different cohorts of storage servers with each vault being composed of the same type of storage server, hence there are six types of vaults.

As we dug into data, we found that the different cohorts of Vaults had different annualized failure rates. What we didn’t find was a correlation between the age of the drives used in the servers and the annualized failure rates of the different Vault cohorts. For example, the Backblaze 5.0 Vaults have a much lower AFR of 0.99% versus the Backblaze 6.0 Vault AFR at 2.14%—even though the drives in the 5.0 Vaults are nearly twice as old on average than the drives in the 6.0 Vaults.

This suggests that while our initial foray into the annualized failure rates of the different Vault cohorts is a good first step, there is more to do here.

Where Do We Go From Here?

In general, all of the Vaults in a given cohort were manufactured to the same specifications, used the same parts, and were assembled using the same processes. One obvious difference is that different drive models are used in each Vault cohort. For example, the 16TB vaults are composed of seven different drive models. Do some drive models work better in one Vault cohort versus another? Over the next couple of quarters we’ll dig into the data and let you know what we find. Hopefully it will add to our understanding of the annualized failures rates of the different Vault cohorts. Stay tuned.

The post The Drive Stats of Backblaze Storage Pods appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Drive Stats for Q3 2023

2023-11-14 Andy Klein

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-q3-2023/

A decorative image showing the title Q3 2023 Drive Stats.

At the end of Q3 2023, Backblaze was monitoring 263,992 hard disk drives (HDDs) and solid state drives (SSDs) in our data centers around the world. Of that number, 4,459 are boot drives, with 3,242 being SSDs and 1,217 being HDDs. The failure rates for the SSDs are analyzed in the SSD Edition: 2023 Drive Stats review.

That leaves us with 259,533 HDDs that we’ll focus on in this report. We’ll review the quarterly and lifetime failure rates of the data drives as of the end of Q3 2023. Along the way, we’ll share our observations and insights on the data presented, and, for the first time ever, we’ll reveal the drive failure rates broken down by data center.

Q3 2023 Hard Drive Failure Rates

At the end of Q3 2023, we were managing 259,533 hard drives used to store data. For our review, we removed 449 drives from consideration as they were used for testing purposes, or were drive models which did not have at least 60 drives. This leaves us with 259,084 hard drives grouped into 32 different models.

The table below reviews the annualized failure rate (AFR) for those drive models for the Q3 2023 time period.

A table showing the quarterly annualized failure rates of Backblaze hard drives.

Notes and Observations on the Q3 2023 Drive Stats

The 22TB drives are here: At the bottom of the list you’ll see the WDC 22TB drives (model: WUH722222ALE6L4). A Backblaze Vault of 1,200 drives (plus four) is now operational. The 1,200 drives were installed on September 29, so they only have one day of service each in this report, but zero failures so far.
The old get bolder: At the other end of the time-in-service spectrum are the 6TB Seagate drives (model: ST6000DX000) with an average of 101 months in operation. This cohort had zero failures in Q3 2023 with 883 drives and a lifetime AFR of 0.88%.
Zero failures: In Q3, six different drive models managed to have zero drive failures during the quarter. But only the 6TB Seagate, noted above, had over 50,000 drive days, our minimum standard for ensuring we have enough data to make the AFR plausible.
One failure: There were four drive models with one failure during Q3. After applying the 50,000 drive day metric, two drives stood out:
1. WDC 16TB (model: WUH721816ALE6L0) with a 0.15% AFR.
2. Toshiba 14TB (model: MG07ACA14TEY) with a 0.63% AFR.

The Quarterly AFR Drops

In Q3 2023, quarterly AFR for all drives was 1.47%. That was down from 2.2% in Q2 and also down from 1.65% a year ago. The quarterly AFR is based on just the data in that quarter, so it can often fluctuate from quarter to quarter.

In our Q2 2023 report, we suspected the 2.2% for the quarter was due to the overall aging of the drive fleet and in particular we pointed a finger at specific 8TB, 10TB, and 12TB drive models as potential culprits driving the increase. That prediction fell flat in Q3 as nearly two-thirds of drive models experienced a decreased AFR quarter over quarter from Q2 and any increases were minimal. This included our suspect 8TB, 10TB, and 12TB drive models.

It seems Q2 was an anomaly, but there was one big difference in Q3: we retired 4,585 aging 4TB drives. The average age of the retired drives was just over eight years, and while that was a good start, there’s another 28,963 4TB drives to go. To facilitate the continuous retirement of aging drives and make the data migration process easy and safe we use CVT, our awesome in-house data migration software which we’ll cover at another time.

A Hot Summer and the Drive Stats Data

As anyone should in our business, Backblaze continuously monitors our systems and drives. So, it was of little surprise to us when the folks at NASA confirmed the summer of 2023 as Earth’s hottest on record. The effects of this record-breaking summer showed up in our monitoring systems in the form of drive temperature alerts. A given drive in a storage server can heat up for many reasons: it is failing; a fan in the storage server has failed; other components are producing additional heat; the air flow is somehow restricted; and so on. Add in the fact that the ambient temperature within a data center often increases during the summer months, and you can get more temperature alerts.

In reviewing the temperature data for our drives in Q3, we noticed that a small number of drives exceeded the maximum manufacturer’s temperature for at least one day. The maximum temperature for most drives is 60°C, except for the 12TB, 14TB, and 16TB Toshiba drives which have a maximum temperature of 55°C. Of the 259,533 data drives in operation in Q3, there were 354 individual drives (0.0013%) that exceeded their maximum manufacturer temperature. Of those only two drives failed, leaving 352 drives which were still operational as of the end of Q3.

While temperature fluctuation is part of running data centers and temp alerts like these aren’t unheard of, our data center teams are looking into the root causes to ensure we’re prepared for the inevitability of increasingly hot summers to come.

Will the Temperature Alerts Affect Drive Stats?

The two drives which exceeded their maximum temperature and failed in Q3 have been removed from the Q3 AFR calculations. Both drives were 4TB Seagate drives (model: ST4000DM000). Given that the remaining 352 drives which exceeded their temperature maximum did not fail in Q3, we have left them in the Drive Stats calculations for Q3 as they did not increase the computed failure rates.

Beginning in Q4, we will remove the 352 drives from the regular Drive Stats AFR calculations and create a separate cohort of drives to track that we’ll name Hot Drives. This will allow us to track the drives which exceeded their maximum temperature and compare their failure rates to those drives which operated within the manufacturer’s specifications. While there are a limited number of drives in the Hot Drives cohort, it could give us some insight into whether drives being exposed to high temperatures could cause a drive to fail more often. This heightened level of monitoring will identify any increase in drive failures so that they can be detected and dealt with expeditiously.

New Drive Stats Data Fields in Q3

In Q2 2023, we introduced three new data fields that we started populating in the Drive Stats data we publish: vault_id, pod_id, and is_legacy_format. In Q3, we are adding three more fields into each drive records as follows:

datacenter: The Backblaze data center where the drive is installed, currently one of these values: ams5, iad1, phx1, sac0, and sac2.
cluster_id: The name of a given collection of storage servers logically grouped together to optimize system performance. Note: At this time the cluster_id is not always correct, we are working on fixing that.
pod_slot_num: The physical location of a drive within a storage server. The specific slot differs based on the storage server type and capacity: Backblaze (45 drives), Backblaze (60 drives), Dell (26 drives), or Supermicro (60 drives). We’ll dig into these differences in another post.

With these additions, the new schema beginning in Q3 2023 is:

date
serial_number
model
capacity_bytes
failure
datacenter (Q3)
cluster_id (Q3)
vault_id (Q2)
pod_id (Q2)
pod_slot_num (Q3)
is_legacy_format (Q2)
smart_1_normalized
smart_1_raw
The remaining SMART value pairs (as reported by each drive model)

Beginning in Q3, these data data fields have been added to the publicly available Drive Stats files that we publish each quarter.

Failure Rates by Data Center

Now that we have the data center for each drive we can compute the AFRs for the drives in each data center. Below you’ll find the AFR for each of five data centers for Q3 2023.

Notes and Observations

Null?: The drives which reported a null or blank value for their data center are grouped in four Backblaze vaults. David, the Senior Infrastructure Software Engineer for Drive Stats, described the process of how we gather all the parts of the Drive Stats data each day. The TL:DR is that vaults can be too busy to respond at the moment we ask, and since the data center field is nice-to-have data, we get a blank field. We can go back a day or two to find the data center value, which we will do in the future when we report this data.
sac0?: sac0 has the highest AFR of all of the data centers, but it also has the oldest drives—nearly twice as old, on average, versus the next closest in data center, sac2. As discussed previously, drive failures do seem to follow the “bathtub curve”, although recently we’ve seen the curve start out flatter. Regardless, as drive models age, they do generally fail more often. Another factor could be that sac0, and to a lesser extent sac2, has some of the oldest Storage Pods, including a handful of 45-drive units. We are in the process of using CVT to replace these older servers while migrating from 4TB to 16TB and larger drives.
iad1: The iad data center is the foundation of our eastern region and has been growing rapidly since coming online about a year ago. The growth is a combination of new data and customers using our cloud replication capability to automatically make a copy of their data in another region.
Q3 Data: This chart is for Q3 data only and includes all the data drives, including those with less than 60 drives per model. As we track this data over the coming quarters, we hope to get some insight into whether different data centers really have different drive failure rates, and, if so, why.

Lifetime Hard Drive Failure Rates

As of September 30, 2023, we were tracking 259,084 hard drives used to store customer data. For our lifetime analysis, we collect the number of drive days and the number of drive failures for each drive beginning from the time a drive was placed into production in one of our data centers. We group these drives by model, then sum up the drive days and failures for each model over their lifetime. That chart is below.

One of the most important columns on this chart is the confidence interval, which is the difference between the low and high AFR confidence levels calculated at 95%. The lower the value, the more certain we are of the AFR stated. We like a confidence interval to be 0.5% or less. When the confidence interval is higher, that is not necessarily bad, it just means we either need more data or the data is somewhat inconsistent.

The table below contains just those drive models which have a confidence interval of less than 0.5%. We have sorted the list by drive size and then by AFR.

A chart showing Backblaze hard drive annualized failure rates with a confidence interval of less than 0.5%.

The 4TB, 6TB, 8TB, and some of the 12TB drive models are no longer in production. The HGST 12TB models in particular can still be found, but they have been relabeled as Western Digital and given alternate model numbers. Whether they have materially changed internally is not known, at least to us.

One final note about the lifetime AFR data: you might have noticed the AFR for all of the drives hasn’t changed much from quarter to quarter. It has vacillated between 1.39% to 1.45% percent for the last two years. Basically, we have lots of drives with lots of time-in-service so it is hard to move the needle up or down. While the lifetime stats for individual drive models can be very useful, the lifetime AFR for all drives will probably get less and less interesting as we add more and more drives. Of course, a few hundred thousand drives that never fail could arrive, so we will continue to calculate and present the lifetime AFR.

The Hard Drive Stats Data

The complete data set used to create the information used in this review is available on our Hard Drive Stats Data webpage. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone; it is free.

Good luck and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q3 2023 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Overload to Overhaul: How We Upgraded Drive Stats Data

2023-10-05 David Winings

Post Syndicated from David Winings original https://www.backblaze.com/blog/overload-to-overhaul-how-we-upgraded-drive-stats-data/

A decorative image showing the words "overload to overhaul: how we upgraded Drive Stats data."

This year, we’re celebrating 10 years of Drive Stats. Coincidentally, we also made some upgrades to how we run our Drive Stats reports. We reported on how an attempt to migrate triggered a weeks-long recalculation of the dataset, leading us to map the architecture of the Drive Stats data.

This follow-up article focuses on the improvements we made after we fixed the existing bug (because hey, we were already in there), and then presents some of our ideas for future improvements. Remember that those are just ideas so far—they may not be live in a month (or ever?), but consider them good food for thought, and know that we’re paying attention so that we can pass this info along to the right people.

Now, onto the fun stuff.

Quick Refresh: Drive Stats Data Architecture

The podstats generator runs on every Storage Pod, what we call any host that holds customer data, every few minutes. It’s a C++ program that collects SMART stats and a few other attributes, then converts them into an .xml file (“podstats”). Those are then pushed to a central host in each datacenter and bundled. Once the data leaves these central hosts, it has entered the domain of what we will call Drive Stats.

Now let’s go into a little more detail: when you’re gathering stats about drives, you’re running a set of modules with dependencies to other modules, forming a data-dependency tree. Each time a module “runs”, it takes information, modifies it, and writes it to a disk. As you run each module, the data will be transformed sequentially. And, once a quarter, we run a special module that collects all the attributes for our Drive Stats reports, collecting data all the way down the tree.

Here’s a truncated diagram of the whole system, to give you an idea of what the logic looks like:

A diagram of the mapped logic of the Drive Stats modules. — An abbreviated logic map of Drive Stats modules.

As you move down through the module layers, the logic gets more and more specialized. When you run a module, the first thing the module does is check in with the previous module to make sure the data exists and is current. It caches the data to disk at every step, and fills out the logic tree step by step. So for example, drive_stats, being a “per-day” module, will write out a file such as /data/drive_stats/2023-01-01.json.gz when it finishes processing. This lets future modules read that file to avoid repeating work.

This work deduplication process saves us a lot of time overall—but it also turned out to be the root cause of our weeks-long process when we were migrating Drive Stats to our new host. We fixed that by implementing versions to each module.

While You’re There… Why Not Upgrade?

Once the dust from the bug fix had settled, we moved forward to try to modernize Drive Stats in general. Our daily report still ran quite slowly, on the order of several hours, and there was some low-hanging fruit to chase.

Waiting On You, `failures_with_stats`

First things first, we saved a log of a run of our daily reports in Jenkins. Then we wrote an analyzer to see which modules were taking a lot of time. failures_with_stats was our biggest offender, running for about two hours, while every other module took about 15 minutes.

An image showing runtimes for each module when running a Drive Stats report. — Not *quite* two hours.

Upon investigation, the time cost had to do with how the date_range module works. This takes us back to caching: our module checks if the file has been written already, and if it has, it uses the cached file. However, a date range is written to a single file. That is, Drive Stats will recognize “Monday to Wednesday” as distinct from “Monday to Thursday” and re-calculate the entire range. This is a problem for a workload that is essentially doing work for all of time, every day.

On top of this, the raw Drive Stats data, which is a dependency for failures_with_stats, would be gzipped onto a disk. When each new query triggered a request to recalculate all-time data, each dependency would pick up the podstats file from disk, decompress it, read it into memory, and do that for every day of all time. We were picking up and processing our biggest files every day, and time continued to make that cost larger.

Our solution was what I called the “Date Range Accumulator.” It works as follows:

If we have a date range like “all of time as of yesterday” (or any partial range with the same start), consider it as a starting point.
Make sure that the version numbers don’t consider our starting point to be too old.
Do the processing of today’s data on top of our starting point to create “all of time as of today.”

To do this, we read the directory of the date range accumulator, find the “latest” valid one, and use that to determine the delta (change) to our current date. Basically, the module says: “The last time I ran this was on data from the beginning of time to Thursday. It’s now Friday. I need to run the process for Friday, and then add that to the compiled all-time.” And, before it does that, it double checks the version number to avoid errors. (As we noted in our previous article, if it doesn’t see the correct version number, instead of inefficiently running all data, it just tells you there is a version number discrepancy.)

The code is also a bit finicky—there are lots of snags when it comes to things like defining exceptions, such as if we took a drive out of the fleet, but it wasn’t a true failure. The module also needed to be processable day by day to be usable with this technique.

Still, even with all the tweaks, it’s massively better from a runtime perspective for eligible candidates. Here’s our new failures_with_stats runtime:

An output of module runtime after the Drive Stats improvements were made. — Ahh, sweet victory.

Note that in this example, we’re running that 60-day report. The daily report is quite a bit quicker. But, at least the 60-day report is a fixed amount of time (as compared with the all-time dataset, which is continually growing).

Code Upgrade to Python 3

Next, we converted our code to Python 3. (Shout out to our intern, Anath, who did amazing work on this part of the project!) We didn’t make this improvement just to make it; no, we did this because I wanted faster JSON processors, and a lot of the more advanced ones did not work with Python 2. When we looked at the time each module took to process, most of that was spent serializing and deserializing JSON.

What Is JSON Parsing?

JSON is an open standard file format that uses human readable text to store and transmit data objects. Many modern programming languages include code to generate and parse JSON-format data. Here’s how you might describe a person named John, aged 30, from New York using JSON:

{ 
“firstName”: “John”, 
“age”: 30,
“State”: “New York”
}

You can express those attributes into a single line of code and define them as a native object:

x = { 'name':'John', 'age':30, 'city':'New York'}

“Parsing” is the process by which you take the JSON data and make it into an object that you can plug into another programming language. You’d write your script (program) in Python, it would parse (interpret) the JSON data, and then give you an answer. This is what that would look like:

import json

# some JSON:
x = '''
{ 
	"firstName": "John", 
	"age": 30,
	"State": "New York"
}
'''

# parse x:
y = json.loads(x)

# the result is a Python object:
print(y["name"])

If you run this script, you’ll get the output “John.” If you change print(y["name"]) to print(y["age"]), you’ll get the output “30.” Check out this website if you want to interact with the code for yourself. In practice, the JSON would be read from a database, or a web API, or a file on disk rather than defined as a “string” (or text) in the Python code. If you are converting a lot of this JSON, small improvements in efficiency can make a big difference in how a program performs.

And Implementing UltraJSON

Upgrading to Python 3 meant we could use UltraJSON. This was approximately 50% faster than the built-in Python JSON library we used previously.

We also looked at the XML parsing for the podstats files, since XML parsing is often a slow process. In this case, we actually found our existing tool is pretty fast (and since we wrote it 10 years ago, that’s pretty cool). Off-the-shelf XML parsers take quite a bit longer because they care about a lot of things we don’t have to: our tool is customized for our Drive Stats needs. It’s a well known adage that you should not parse XML with regular expressions, but if your files are, well, very regular, it can save a lot of time.

What Does the Future Hold?

Now that we’re working with a significantly faster processing time for our Drive Stats dataset, we’ve got some ideas about upgrades in the future. Some of these are easier to achieve than others. Here’s a sneak peek of some potential additions and changes in the future.

Data on Data

In keeping with our data-nerd ways, I got curious about how much the Drive Stats dataset is growing and if the trend is linear. We made this graph, which shows the baseline rolling average, and has a trend line that attempts to predict linearly.

A graph showing the rate at which the Drive Stats dataset has grown over time.

I envision this graph living somewhere on the Drive Stats page and being fully interactive. It’s just one graph, but this and similar tools available on our website would be 1) fun and 2) lead to some interesting insights for those who don’t dig in line by line.

What About Changing the Data Module?

The way our current module system works, everything gets processed in a tree approach, and they’re flat files. If we used something like SQLite or Parquet, we’d be able to process data in a more depth-first way, and that would mean that we could open a file for one module or data range, process everything, and not have to read the file again.

And, since one of the first things that our Drive Stats expert, Andy Klein, does with our .xml data is to convert it to SQLite, outputting it in a queryable form would save a lot of time.

We could also explore keeping the data as a less-smart filetype, but using something more compact than JSON, such as MessagePack.

Can We Improve Failure Tracking and Attribution?

One of the odd things about our Drive Stats datasets is that they don’t always and automatically agree with our internal data lake. Our Drive Stats outputs have some wonkiness that’s hard to replicate, and it’s mostly because of exceptions we build into the dataset. These exceptions aren’t when a drive fails, but rather when we’ve removed it from the fleet for some other reason, like if we were testing a drive or something along those lines. (You can see specific callouts in Drive Stats reports, if you’re interested.) It’s also where a lot of Andy’s manual work on Drive Stats data comes in each month: he’s often comparing the module’s output with data in our datacenter ticket tracker.

These tickets come from the awesome data techs working in our data centers. Each time a drive fails and they have to replace it, our techs add a reason for why it was removed from the fleet. While not all drive replacements are “failures”, adding a root cause to our Drive Stats dataset would give us more confidence in our failure reporting (and would save Andy comparing the two lists).

The Result: Faster Drive Stats and Future Fun

These two improvements (the date range accumulator and upgrading to Python 3) resulted in hours, and maybe even days, of work saved. Even from a troubleshooting point of view, we often wouldn’t know if the process was stuck, or if this was the normal amount of time the module should take to run. Now, if it takes more than about 15 minutes to run a report, you’re sure there’s a problem.

While the Drive Stats dataset can’t really be called “big data”, it provides a good, concrete example of scaling with your data. We’ve been collecting Drive Stats for just over 10 years now, and even though most of the code written way back when is inherently sound, small improvements that seem marginal become amplified as datasets grow.

Now that we’ve got better documentation of how everything works, it’s going to be easier to keep Drive Stats up-to-date with the best tools and run with future improvements. Let us know in the comments what you’d be interested in seeing.

The post Overload to Overhaul: How We Upgraded Drive Stats Data appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

The SSD Edition: 2023 Drive Stats Mid-Year Review

2023-09-26 Andy Klein

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/ssd-edition-2023-mid-year-drive-stats-review/

A decorative image displaying the title 2023 Mid-Year Report Drive Stats SSD Edition.

Welcome to the 2023 Mid-Year SSD Edition of the Backblaze Drive Stats review. This report is based on data from the solid state drives (SSDs) we use as storage server boot drives on our Backblaze Cloud Storage platform. In this environment, the drives do much more than boot the storage servers. They also store log files and temporary files produced by the storage server. Each day a boot drive will read, write, and delete files depending on the activity of the storage server itself.

We will review the quarterly and lifetime failure rates for these drives, and along the way we’ll offer observations and insights to the data presented. In addition, we’ll take a first look at the average age at which our SSDs fail, and examine how well SSD failure rates fit the ubiquitous bathtub curve.

Mid-Year SSD Results by Quarter

As of June 30, 2023, there were 3,144 SSDs in our storage servers. This compares to 2,558 SSDs we reported in our 2022 SSD annual report. We’ll start by presenting and discussing the quarterly data from each of the last two quarters (Q1 2022 and Q2 2023).

Notes and Observations

Data is by quarter: The data used in each table is specific to that quarter. That is, the number of drive failures and drive days are inclusive of the specified quarter, Q1 or Q2. The drive counts are as of the last day of each quarter.

Drives added: Since our last SSD report, ending in Q4 2022, we added 238 SSD drives to our collection. Of that total, the Crucial (model: CT250MX500SSD1) led the way with 110 new drives added, followed by 62 new WDC drives (model: WD Blue SA510 2.5) and 44 Seagate drives (model: ZA250NM1000).

Really high annualized failure rates (AFR): Some of the failure rates, that is AFR, seem crazy high. How could the Seagate model SSDSCKKB240GZR have an annualized failure rate over 800%? In that case, in Q1, we started with two drives and one failed shortly after being installed. Hence, the high AFR. In Q2, the remaining drive did not fail and the AFR was 0%. Which AFR is useful? In this case neither, we just don’t have enough data to get decent results. For any given drive model, we like to see at least 100 drives and 10,000 drive days in a given quarter as a minimum before we begin to consider the calculated AFR to be “reasonable.” We include all of the drive models for completeness, so keep an eye on drive count and drive days before you look at the AFR with a critical eye.

Quarterly Annualized Failures Rates Over Time

The data in any given quarter can be volatile with factors like drive age and the randomness of failures factoring in to skew the AFR up or down. For Q1, the AFR was 0.96% and, for Q2, the AFR was 1.05%. The chart below shows how these quarterly failure rates relate to previous quarters over the last three years.

As you can see, the AFR fluctuates between 0.36% and 1.72%, so what’s the value of quarterly rates? Well, they are useful as the proverbial canary in a coal mine. For example, the AFR in Q1 2021 (0.58%) jumped 1.51% in Q2 2021, then to 1.72% in Q3 2021. A subsequent investigation showed one drive model was the primary cause of the rise and that model was removed from service.

It happens from time to time that a given drive model is not compatible with our environment, and we will moderate or even remove that drive’s effect on the system as a whole. While not as critical as data drives in managing our system’s durability, we still need to keep boot drives in operation to collect the drive/server/vault data they capture each day.

How Backblaze Uses the Data Internally

As you’ve seen in our SSD and HDD Drive Stats reports, we produce quarterly, annual, and lifetime charts and tables based on the data we collect. What you don’t see is that every day we produce similar charts and tables for internal consumption. While typically we produce one chart for each drive model, in the example below we’ve combined several SSD models into one chart.

The “Recent” period we use internally is 60 days. This differs from our public facing reports which are quarterly. In either case, charts like the one above allow us to quickly see trends requiring further investigation. For example, in our chart above, the recent results of the Micron SSDs indicate a deeper dive into the data behind the charts might be necessary.

By collecting, storing, and constantly analyzing the Drive Stats data we can be proactive in maintaining our durability and availability goals. Without our Drive Stats data, we would be inclined to over-provision our systems as we would be blind to the randomness of drive failures which would directly impact those goals.

A First Look at More SSD Stats

Over the years in our quarterly Hard Drive Stats reports, we’ve examined additional metrics beyond quarterly and lifetime failure rates. Many of these metrics can be applied to SSDs as well. Below we’ll take a first look at two of these: the average age of failure for SSDs and how well SSD failures correspond to the bathtub curve. In both cases, the datasets are small, but are a good starting point as the number of SSDs we monitor continues to increase.

The Average Age of Failure for SSDs

Previously, we calculated the average age at which a hard drive in our system fails. In our initial calculations that turned out to be about two years and seven months. That was a good baseline, but further analysis was required as many of the drive models used in the calculations were still in service and hence some number of them could fail, potentially affecting the average.

We are going to apply the same calculations to our collection of failed SSDs and establish a baseline we can work from going forward. Our first step was to determine the SMART_9_RAW value (power-on-hours or POH) for the 63 failed SSD drives we have to date. That’s not a great dataset size, but it gave us a starting point. Once we collected that information, we computed that the average age of failure for our collection of failed SSDs is 14 months. Given that the average age of the entire fleet of our SSDs is just 25 months, what should we expect to happen as the average age of the SSDs still in operation increases? The table below looks at three drive models which have a reasonable amount of data.

		Good Drives		Failed Drives
MFG	Model	Count	Avg Age	Count	Avg Age
Crucial	CT250MX500SSD1	598	11 months	9	7 months
Seagate	ZA250CM10003	1,114	28 months	14	11 months
Seagate	ZA250CM10002	547	40 months	17	25 months

As we can see in the table, the average age of the failed drives increases as the average age of drives in operation (good drives) increases. In other words, it is reasonable to expect that the average age of SSD failures will increase as the entire fleet gets older.

Is There a Bathtub Curve for SSD Failures?

Previously we’ve graphed our hard drive failures over time to determine their fit to the classic bathtub curve used in reliability engineering. Below, we used our SSD data to determine how well our SSD failures fit the bathtub curve.

While the actual curve (blue line) produced by the SSD failures over each quarter is a bit “lumpy”, the trend line (second order polynomial) does have a definite bathtub curve look to it. The trend line is about a 70% match to the data, so we can’t be too confident of the curve at this point, but for the limited amount of data we have, it is surprising to see how the occurrences of SSD failures are on a path to conform to the tried-and-true bathtub curve.

SSD Lifetime Annualized Failure Rates

As of June 30, 2023, there were 3,144 SSDs in our storage servers. The table below is based on the lifetime data for the drive models which were active as of the end of Q2 2023.

Notes and Observations

Lifetime AFR: The lifetime data is cumulative from Q4 2018 through Q2 2023. For this period, the lifetime AFR for all of our SSDs was 0.90%. That was up slightly from 0.89% at the end of Q4 2022, but down from a year ago, Q2 2022, at 1.08%.

High failure rates?: As we noted with the quarterly stats, we like to have at least 100 drives and over 10,000 drive days to give us some level of confidence in the AFR numbers. If we apply that metric to our lifetime data, we get the following table.

Applying our modest criteria to the list eliminated those drive models with crazy high failure rates. This is not a statistics trick; we just removed those models which did not have enough data to make the calculated AFR reliable. It is possible the drive models we removed will continue to have high failure rates. It is also just as likely their failure rates will fall into a more normal range. If this technique seems a bit blunt to you, then confidence intervals may be what you are looking for.

Confidence intervals: In general, the more data you have and the more consistent that data is, the more confident you are in the predictions based on that data. We calculate confidence intervals at 95% certainty.

For SSDs, we like to see a confidence interval of 1.0% or less between the low and the high values before we are comfortable with the calculated AFR. If we apply this metric to our lifetime SSD data we get the following table.

This doesn’t mean the failure rates for the drive models with a confidence interval greater than 1.0% are wrong; it just means we’d like to get more data to be sure.

Regardless of the technique you use, both are meant to help clarify the data presented in the tables throughout this report.

The SSD Stats Data

The data collected and analyzed for this review is available on our Drive Stats Data page. You’ll find SSD and HDD data in the same files and you’ll have to use the model number to locate the drives you want, as there is no field to designate a drive as SSD or HDD. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone—it is free.

Good luck and let us know if you find anything interesting.

The post The SSD Edition: 2023 Drive Stats Mid-Year Review appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Drive Stats Data Deep Dive: The Architecture

2023-09-07 David Winings

Post Syndicated from David Winings original https://www.backblaze.com/blog/drive-stats-data-deep-dive-the-architecture/

A decorative image displaying the words Drive Stats Data Deep Dive: The Architecture.

This year, we’re celebrating 10 years of Drive Stats—that’s 10 years of collecting the data and sharing the reports with all of you. While there’s some internal debate about who first suggested publishing the failure rates of drives, we all agree that Drive Stats has had impact well beyond our expectations. As of today, Drive Stats is still one of the only public datasets about drive usage, has been cited 150+ times by Google Scholar, and always sparks lively conversation, whether it’s at a conference, in the comments section, or in one of the quarterly Backblaze Engineering Week presentations.

This article is based on a presentation I gave during Backblaze’s internal Engineering Week, and is the result of a deep dive into managing and improving the architecture of our Drive Stats datasets. So, without further ado, let’s dive down the Drive Stats rabbit hole together.

More to Come

This article is part of a series on the nuts and bolts of Drive Stats. Up next, we’ll highlight some improvements we’ve made to the Drive Stats code, and we’ll link to them here. Stay tuned!

A “Simple” Ask

When I started at Backblaze in 2020, one of the first things I was asked to do was to “clean up Drive Stats.” It had not not been ignored per se, which is to say that things still worked, but it took forever and the teams that had worked on it previously were engaged in other projects. While we were confident that we had good data, running a report took about two and a half hours, plus lots of manual labor put in by Andy Klein to scrub and validate drives in the dataset.

On top of all that, the host on which we stored the data kept running out of space. But, each time we tried to migrate the data, something went wrong. When I started a fresh attempt at moving our dataset between hosts for this project, then ran the report, it ran for weeks (literally).

Trying to diagnose the root cause of the issue was challenging due to the amount of history surrounding the codebase. There was some code documentation, but not a ton of practical knowledge. In short, I had my work cut out for me.

Drive Stats Data Architecture

Let’s start with the origin of the data. The podstats generator runs on every Backblaze Storage Pod, what we call any host that holds customer data, every few minutes. It’s a legacy C++ program that collects SMART stats and a few other attributes, then converts them into an .xml file (“podstats”). Those are then pushed to a central host in each data center and bundled. Once the data leaves these central hosts, it has entered the domain of what we will call Drive Stats. This is a program that knows how to populate various types of data, within arbitrary time bounds based on the underlying podstats .xml files. When we run our daily reports, the lowest level of data are the raw podstats. When we run a “standard” report, it looks for the last 60 days or so of podstats. If you’re missing any part of the data, Drive Stats will download the necessary podstats .xml files.

Now let’s go into a little more detail: when you’re gathering stats about drives, you’re running a set of modules with dependencies to other modules, forming a data dependency tree. Each time a module “runs”, it takes information, modifies it, and writes it to a disk. As you run each module, the data will be transformed sequentially. And, once a quarter, we run a special module that collects all the attributes for our Drive Stats reports, collecting data all the way down the tree.

There’s a registry that catalogs each module, what their dependencies are, and their function signatures. Each module knows how its own data should be aggregated, such as per day, per day per cluster, global, data range, and so on. The “module type” will determine how the data is eventually stored on disk. Here’s a truncated diagram of the whole system, to give you an idea of what the logic looks like:

Let’s take model_hack_table as an example. This is a global module, and it’s a reference table that includes drives that might be exceptions in the data center. (So, any of the reasons Andy might identify in a report for why a drive isn’t included in our data, including testing out a new drive and so on.)

The green drive_stats module takes in the json_podstats file, references the model names of exceptions in model_hack_table, then cross references that information against all the drives that we have, and finally assigns them the serial number, brand name, and model number. At that point, it can do things like get the drive count by data center.

Similarly, pod_drives looks up the host file in our Ansible configuration to find out which Pods we have in which data centers. It then does attributions with a reference table so we know how many drives are in each data center.

This work-deduplication process saves us a lot of time overall—but it also turned out to be the root cause of our weeks-long process when we were migrating Drive Stats to our new host.

Cache Invalidation Is Always Treacherous

We have to go into slightly more detail to understand what was happening. The dependency resolution process is as follows:

Before any module can run, it checks for a dependency.
For any dependency it finds, it checks modification times.
The module has to be at least as old as the dependency, and the dependency has to be at least as old as the target data. If one of those conditions isn’t met, the data is recalculated.
Any modules that get recalculated will trigger a rebuild of the whole branch of the logic tree.

When we moved the Drive Stats data and modules, I kept the modification time of the data (using rsync) because I knew in vague terms that Drive Stats used that for its caching. However, when Ansible copied the source code during the migration, it reset the modification time of the code for all source files. Since the freshly copied source files were younger than the dependencies, that meant the entire dataset was recalculating—and that represents terabytes of raw data dating back to 2013, which took weeks.

Note that Git doesn’t preserve mod times and it doesn’t save source files, which is part of the reason this problem exists. Because the data doesn’t exist at all in Git, there’s no way to clone-while-preserving-date. Any time you do a code update or deploy, you run the risk of this same weeks-long process being triggered. However, this code has been stable for so long, tweaks to it wouldn’t invalidate the underlying base modules, and things more or less worked fine.

To add to the complication, lots of modules weren’t in their own source files. Instead, they were grouped together by function. A drive_days module might also be with a drive_days_by_model, drive_days_by_brand, drive_days_by_size, and so on, meaning that changing any of these modules would invalidate all of the other ones in the same file.

This may sound straightforward, but with all the logical dependencies in the various Drive Stats modules, you’re looking at pretty complex code. This was a poorly understood legacy system, so the invalidation logic was implemented somewhat differently for each module type, and in slightly different terms, making it a very unappealing problem to resolve.

Now to Solve

The good news is that, once identified, the solution was fairly intuitive. We decided to set an explicit version for each module, and save it to disk with the files containing its data. In Linux, there is something called an “extended attribute,” which is a small bit of space the filesystem preserves for metadata about the stored file—perfect for our uses. We now write a JSON object containing all of the dependent versions for each module. Here it is:

A snapshot of the code written for the module versions. — To you, it’s just version code pinned in Linux’s extended attributes. To me, it’s beautiful.

Now we will have two sets of versions, one stored on the files written to disk, and another set in the source code itself. So whenever a module is attempting to resolve whether or not it is out of date, it can check the versions on disk and see if they are compatible with the versions in source code. Additionally, since we are using semantic versioning, this means that we can do non-invalidating minor version bumps and still know exactly which code wrote a given file. Nice!

The one downside is that you have to manually specify to preserve extended attributes when using many Unix tools such as rsync (otherwise the version numbers don’t get copied). We chose the new default behavior in the presence of missing extended attributes to be for the module to print a warning and assume it’s current. We had a bunch of warnings the first time the system ran, but we haven’t seen them since. This way if we move the dataset and forget to preserve all the versions, we won’t invalidate the entire dataset by accident—awesome!

Wrapping It All Up

One of the coolest parts about this exploration was finding how many parts of this process still worked, and worked well. The C++ went untouched; the XML parser is still the best tool for the job; the logic of the modules and caching protocols weren’t fundamentally changed and had some excellent benefits for the system at large. We’re lucky at Backblaze that we’ve had many talented people work on our code over the years. Cheers to institutional knowledge.

That’s even more impressive when you think of how Drive Stats started—it was a somewhat off-the-cuff request. “Wouldn’t it be nice if we could monitor what these different drives are doing?” Of course, we knew it would have a positive impact on how we could monitor, use, and buy drives internally, but sharing that information is really what showed us how powerful this information could be for the industry and our community. These days we monitor more than 240,000 drives and have over 21.1 million days of data.

This journey isn’t over, by the way—stay tuned for parts two and three where we talk about improvements we made and some future plans we have for Drive Stats data. As always, feel free to sound off in the comments.

The post Drive Stats Data Deep Dive: The Architecture appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Drive Stats for Q2 2023

2023-08-03 Andy Klein

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-2023/

A decorative image with title Q2 2023 Drive Stats.

At the end of Q2 2023, Backblaze was monitoring 245,757 hard drives and SSDs in our data centers around the world. Of that number, 4,460 are boot drives, with 3,144 being SSDs and 1,316 being HDDs. The failure rates for the SSDs are analyzed in the SSD Edition: 2022 Drive Stats review.

Today, we’ll focus on the 241,297 data drives under management as we review their quarterly and lifetime failure rates as of the end of Q2 2023. Along the way, we’ll share our observations and insights on the data presented, tell you about some additional data fields we are now including and more.

Q2 2023 Hard Drive Failure Rates

At the end of Q2 2023, we were managing 241,297 hard drives used to store data. For our review, we removed 357 drives from consideration as they were used for testing purposes or drive models which did not have at least 60 drives. This leaves us with 240,940 hard drives grouped into 31 different models. The table below reviews the annualized failure rate (AFR) for those drive models for Q2 2023.

Notes and Observations on the Q2 2023 Drive Stats

Zero Failures: There were six drive models with zero failures in Q2 2023 as shown in the table below.

The table is sorted by the number of drive days each model accumulated during the quarter. In general a drive model should have at least 50,000 drive days in the quarter to be statistically relevant. The top three drives all meet that criteria, and having zero failures in a quarter is not surprising given the lifetime AFR for the three drives ranges from 0.13% to 0.45%. None of the bottom three drives has accumulated 50,000 drive days in the quarter, but the two Seagate drives are off to a good start. And, it is always good to see the 4TB Toshiba (model: MD04ABA400V), with eight plus years of service, post zero failures for the quarter.

The Oldest Drive? The drive model with the oldest average age is still the 6TB Seagate (model: ST6000DX000) at 98.3 months (8.2 years), with the oldest drive of this cohort being 104 months (8.7 years) old.
The oldest operational data drive in the fleet is a 4TB Seagate (model: ST4000DM000) at 105.2 months (8.8 years). That is quite impressive, especially in a data center environment, but the winner for the oldest operational drive in our fleet is actually a boot drive: a WDC 500GB drive (model: WD5000BPKT) with 122 months (10.2 years) of continuous service.
Upward AFR: The AFR for Q2 2023 was 2.28%, up from 1.54% in Q1 2023. While quarterly AFR numbers can be volatile, they can also be useful in identifying trends which need further investigation. In this case, the rise was expected as the age of our fleet continues to increase. But was that the real reason?
Digging in, we start with the annualized failure rates and average age of our drives grouped by drive size, as shown in the table below.

For our purpose, we’ll define a drive as old when it is five years old or more. Why? That’s the warranty period of the drives we are purchasing today. Of course, the 4TB and 6TB drives, and some of the 8TB drives, came with only two year warranties, but for consistency we’ll stick with five years as the point at which we label a drive as “old”.

Using our definition for old drives eliminates the 12TB, 14TB and 16TB drives. This leaves us with the chart below of the Quarterly AFR over the last three years for each cohort of older drives, the 4TB, 6TB, 8TB, and 10TB models.

Interestingly, the oldest drives, the 4TB and 6TB drives, are holding their own. Yes, there has been an increase over the last year or so, but given their age, they are doing well.

On the other hand, the 8TB and 10TB drives, with an average of five and six years of service respectively, require further attention. We’ll look at the lifetime data later on in this report to see if our conclusions are justified.

What’s New in the Drive Stats Data?

For the past 10 years, we’ve been capturing and storing the drive stats data and since 2015 we’ve open sourced the data files that we used to create the Drive Stats reports. From time to time, new SMART attribute pairs have been added to the schema as we install new drive models which report new sets of SMART attributes. This quarter we decided to capture and store some additional data fields about the drives and the environment they operate in, and we’ve added them to the publicly available Drive Stats files that we publish each quarter.

The New Data Fields

Beginning with the Q2 2023 Drive Stats data, there are three new data fields populated in each drive record.

Vault_id: All data drives are members of a Backblaze Vault. Each vault consists of either 900 or 1,200 hard drives divided evenly across 20 storage servers. The vault is a numeric value starting at 1,000.
Pod_id: There are 20 storage servers in each Backblaze Vault. The Pod_id is a numeric field with values from 0 to 19 assigned to one of the 20 storage servers.
Is_legacy_format: Currently 0, but will be useful over the coming quarters as more fields are added.

The new schema is as follows:

date
serial_number
model
capacity_bytes
failure
vault_id
pod_id
is_legacy_format
smart_1_normalized
smart_1_raw
Remaining SMART value pairs (as reported by each drive model)

Occasionally, our readers would ask if we had any additional information we could provide with regards to where a drive lived, and, more importantly, where it died. The newly-added data fields above are part of the internal drive data we collect each day, but they were not included in the Drive Stats data that we use to create the Drive Stats reports. With the help of David from our Infrastructure Software team, these fields will now be available in the Drive Stats data.

How Can We Use the Vault and Pod Information?

First a caveat: We have exactly one quarter’s worth of this new data. While it was tempting to create charts and tables, we want to see a couple of quarters worth of data to understand it better. Look for an initial analysis later on in the year.

That said, what this data gives us is the storage server and the vault of every drive. Working backwards, we should be able to ask questions like: “Are certain storage servers more prone to drive failure?” or, “Do certain drive models work better or worse in certain storage servers?” In addition, we hope to add data elements like storage server type and data center to the mix in order to provide additional insights into our multi-exabyte cloud storage platform.

Over the years, we have leveraged our Drive Stats data internally to improve our operational efficiency and durability. Providing these new data elements to everyone via our Drive Stats reports and data downloads is just the right thing to do.

There’s a New Drive in Town

If you do decide to download our Drive Stats data for Q2 2023, there’s a surprise inside—a new drive model. There are only four of these drives, so they’d be easy to miss, and they are not listed on any of the tables and charts we publish as they are considered “test” drives at the moment. But, if you are looking at the data, search for model “WDC WUH722222ALE6L4” and you’ll find our newly installed 22TB WDC drives. They went into testing in late Q2 and are being put through their paces as we speak. Stay tuned. (Psst, as of 7/28, none had failed.)

Lifetime Hard Drive Failure Rates

As of June 30, 2023, we were tracking 241,297 hard drives used to store customer data. For our lifetime analysis, we removed 357 drives that were only used for testing purposes or did not have at least 60 drives represented in the full dataset. This leaves us with 240,940 hard drives grouped into 31 different models to analyze for the lifetime table below.

Notes and Observations About the Lifetime Stats

The Lifetime AFR also rises. The lifetime annualized failure rate for all the drives listed above is 1.45%. That is an increase of 0.05% from the previous quarter of 1.40%. Earlier in this report by examining the Q2 2023 data, we identified the 8TB and 10TB drives as primary suspects in the increasing rate. Let’s see if we can confirm that by examining the change in the lifetime AFR rates of the different drives grouped by size.

The red line is our baseline as it is the difference from Q1 to Q2 (0.05%) of the lifetime AFR for all drives. Drives above the red line support the increase, drives below the line subtract from the increase. The primary drives (by size) which are “driving” the increased lifetime annualized failure rate are the 8TB and 10TB drives. This confirms what we found earlier. Given there are relatively few 10TB drives (1,124) versus 8TB drives (24,891), let’s dig deeper into the 8TB drives models.

The Lifetime AFR for all 8TB drives jumped from 1.42% in Q1 to 1.59% in Q2. An increase of 12%. There are six 8TB drive models in operation, but three of these models comprise 99.5% of the drive failures for the 8TB drive cohort, so we’ll focus on them. They are listed below.

For all three models, the increase of the lifetime annualized failure rate from Q1 to Q2 is 10% or more which is statistically similar to the 12% increase for all of the 8TB drive models. If you had to select one drive model to focus on for migration, any of the three would be a good candidate. But, the Seagate drives, model ST8000DM002, are on average nearly a year older than the other drive models in question.

Not quite a lifetime? The table above analyzes data for the period of April 20, 2013 through June 30, 2023, or 10 years, 2 months and 10 days. As noted earlier, the oldest drive we have is 10 years and 2 months old, give or take a day or two. It would seem we need to change our table header, but not quite yet. A drive that was installed anytime in Q2 2013 and is still operational today would report drive days as part of the lifetime data for that model. Once all the drives installed in Q2 2013 are gone, we can change the start date on our tables and charts accordingly.

A Word About Drive Failure

Are we worried about the increase in drive failure rates? Of course we’d like to see them lower, but the inescapable reality of the cloud storage business is that drives fail. Over the years, we have seen a wide range of failure rates across different manufacturers, drive models, and drive sizes. If you are not prepared for that, you will fail. As part of our preparation, we use our drive stats data as one of the many inputs into understanding our environment so we can adjust when and as we need.

So, are we worried about the increase in drive failure rates? No, but we are not arrogant either. We’ll continue to monitor our systems, take action where needed, and share what we can with you along the way.

The Hard Drive Stats Data

If you want the tables and charts used in this report, you can download the .zip file from Backblaze B2 Cloud Storage which contains an MS Excel spreadsheet with a tab for each of the tables or charts..

Good luck and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q2 2023 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Drive Stats for Q1 2023

2023-05-04

Post Syndicated from original https://www.backblaze.com/blog/backblaze-drive-stats-for-q1-2023/

A long time ago in a galaxy far, far away, we started collecting and storing Drive Stats data. More precisely it was 10 years ago, and the galaxy was just Northern California, although it has expanded since then (as galaxies are known to do). During the last 10 years, a lot has happened with the where, when, and how of our Drive Stats data, but regardless, the Q1 2023 drive stats data is ready, so let’s get started.

As of the end of Q1 2023, Backblaze was monitoring 241,678 hard drives (HDDs) and solid state drives (SSDs) in our data centers around the world. Of that number, 4,400 are boot drives, with 3,038 SSDs and 1,362 HDDs. The failure rates for the SSDs are analyzed in the SSD Edition: 2022 Drive Stats review.

Today, we’ll focus on the 237,278 data drives under management as we review their quarterly and lifetime failure rates as of the end of Q1 2023. We also dig into the topic of average age of failed hard drives by drive size, model, and more. Along the way, we’ll share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

Q1 2023 Hard Drive Failure Rates

Let’s start with reviewing our data for the Q1 2023 period. In that quarter, we tracked 237,278 hard drives used to store customer data. For our evaluation, we removed 385 drives from consideration as they were used for testing purposes or were drive models which did not have at least 60 drives. This leaves us with 236,893 hard drives grouped into 30 different models to analyze.

Notes and Observations on the Q1 2023 Drive Stats

Upward AFR: The annualized failure rate (AFR) for Q1 2023 was 1.54%, that’s up from Q4 2022 at 1.21% and from one year ago, Q1 2022, at 1.22%. Quarterly AFR numbers can be volatile, but can be useful in identifying a trend which needs further investigation. For example, three drives in Q1 2023 (listed below) more than doubled their individual AFR from Q4 2022 to Q1 2023. As a consequence, further review (or in some cases continued review) of these drives is warranted.

Zeroes and ones: The table below shows those drive models with either zero or one drive failure in Q1 2023.

When reviewing the table, any drive model with less than 50,000 drive days for the quarter does not have enough data to be statistically relevant for that period. That said, for two of the drive models listed, posting zero failures is not new. The 16TB Seagate (model: ST16000NM002J) had zero failures last quarter as well, and the 8TB Seagate (model: ST8000NM000A) has had zero failures since it was first installed in Q3 2022, a lifetime AFR of 0%.

A new, but not so new drive model: There is one new drive model in Q1 2023, the 8TB Toshiba (model: HDWF180). Actually, it is not new, it’s just that we now have 60 drives in production this quarter, so it makes the charts. This model has actually been in production since Q1 2022, starting with 18 drives and adding more drives over time. Why? This drive model is replacing some of the 187 failed 8TB drives this quarter. We have stockpiles of various sized drives we keep on hand for just this reason.

Q1 2023 Annualized Failures Rates by Drive Size and Manufacturer

The charts below summarize the Q1 2023 data first by Drive Size and then by manufacturer.

While we included all of the drive sizes we currently use, both the 6TB and 10TB drive sizes consist of one model for each and each has a limited number of drive days in the quarter: 79,651 for the 6TB drives and 105,443 for the 10TB drives. Each of the remaining drive sizes has at least 2.2 million drive days, making their quarterly annualized failure rates more reliable.

This chart combines all of the manufacturer’s drive models regardless of their age. In our case, many of the older drive models are from Seagate and that helps drive up their overall AFR. For example, 60% of the 4TB drives are from Seagate and are, on average, 89 months old, and over 95% of the 8TB drives in production are from Seagate and they are, on average, over 70 months old. As we’ve seen when we examined hard drive life expectancy using the Bathtub Curve, older drives have a tendency to fail more often.

That said, there are outliers out there like our intrepid fleet of 6TB Seagate drives which have an average age of 95.4 months and have a Q1 2023 AFR of 0.92% and a lifetime AFR of 0.89% as we’ll see later in this report.

The Average Age of Drive Failure

Recently the folks at Blocks & Files published an article outlining the average age of a hard drive when it failed. The article was based on the work of Timothy Burlee at Secure Data Recovery. To summarize, the article found that for the 2,007 failed hard drives analyzed, the average age at which they failed was 1,051 days, or two years and 10 months. We thought this was an interesting way to look at drive failure, and we wanted to know what we would find if we asked the same question of our Drive Stats data. They also determined the current pending sector count for each failed drive, but today we’ll focus on the average age of drive failure.

Getting Started

The article didn’t specify how they collected the amount of time a drive was operational before it failed but we’ll assume they used the SMART 9 raw value for power-on hours. Given that, our first task was to round up all of the failed drives in our dataset and record the power-on hours for each drive. That query produced a list of 18,605 drives which failed between April 10, 2013 and March 30, 2023, inclusive.

For each failed drive we recorded the date, serial_number, model, drive_capacity, failure, and SMART 9 raw value. A sample is below.

To start the data cleanup process, we first removed 1,355 failed boot drives from the dataset, leaving us with 17,250 data drives.

We then removed 95 drives for one of the following reasons:

The failed drive had no data recorded or a zero in the SMART 9 raw attribute.
The failed drive had out of bounds data in one or more fields. For example, the capacity_bytes field was negative or the model was corrupt, that is unknown or unintelligible.

In both of these cases, the drives in question were not in a good state when the data was collected and as such any other data collected could be unreliable.

We are left with 17,155 failed drives to analyze. When we compute the average age at which this cohort of drives failed we get 22,360 hours, which is 932 days, or just over two years and six months. This is reasonably close to the two years and 10 months from the Blocks & Files article, but before we confirm their numbers let’s dig into our results a bit more.

Average Age of Drive Failure by Model and Size

Our Drive Stats dataset contains drive failures for 72 drive models, and that number does not include boot drives. To make our table a bit more manageable we’ve limited the list to those drive models which have recorded 50 or more failures. The resulting list contains 30 models which we’ve sorted by average failure age:

As one would expect, there are drive models above and below our overall failure average age of two years and six months. One observation is that the average failure age of many of the smaller sized drive models (1TB, 1.5TB, 2TB, etc.) is higher than our overall average of two years and six months. Conversely, for many larger sized drive models (12TB, 14TB, etc.) the average failure age was below the average. Before we reach any conclusions, let’s see what happens if we review the average failure age by drive size as shown below.

This chart seems to confirm the general trend that the average failure age of smaller drive models is higher than larger drive models.

At this point you might start pondering whether technologies in larger drives such as the additional platters, increased areal density, or even the use of helium would impact the average failure age of these drives. But as the unflappable Admiral Ackbar would say:

“It’s a Trap”

The trap is that the dataset for the smaller sized drive models is, in our case, complete—there are no more 1TB, 1.5TB, 2TB, 3TB, or even 5TB drives in operation in our dataset. On the contrary, most of the larger sized drive models are still in operation and therefore they “haven’t finished failing yet.” In other words, as these larger drives continue to fail over the coming months and years, they could increase or decrease the average failure age of that drive model.

A New Hope

One way to move forward at this point is to limit our computations to only those drive models which are no longer in operation in our data centers. When we do this, we find we have 35 drive models consisting of 3,379 drives that have a failed average age of two years and seven months.

Trap or not, our results are consistent with the Blocks & Files article as their failed average age of two years and 10 months for their dataset. It will be interesting to see how this comparison holds up over time as more drive models in our dataset finish their Backblaze operational life.

The second way to look at drive failure is to view the problem from the life expectancy point of view instead. This approach takes a page from bioscience and utilizes Kaplan-Meier techniques to produce life expectancy (aka survival) curves for different cohorts, in our case hard drive models. We used such curves previously in our Hard Drive Life Expectancy and Bathtub Curve blog posts. This approach allows us to see the failure rate over time and helps answer questions such as, “If I bought a drive today, what are the chances it will survive x years?”

Let’s Recap

We have three different, but similar, values for average failure age of hard drives, and they are as follows:

Source	Failed Drive Count	Average Failed Age
Secure Data Recovery	2,007 failed drives	2 years, 10 months
Backblaze	17,155 failed drives (all models)	2 years, 6 months
Backblaze	3,379 failed drives (only drive models no longer in production)	2 years, 7 months

When we first saw the Secure Data Recovery average failed age we thought that two years and 10 months was too low. We were surprised by what our data told us, but a little math never hurt anyone. Given we are always adding additional failed drives to our dataset, and retiring drive models along the way, we will continue to track the average failed age of our drive models and report back if we find anything interesting.

Lifetime Hard Drive Failure Rates

As of March 31, 2023, we were tracking 237,278 hard drives. For our lifetime analysis, we removed 385 drives that were only used for testing purposes or did not have at least 60 drives. This leaves us with 236,893 hard drives grouped into 30 different models to analyze for the lifetime table below.

Notes and Observations About the Lifetime Stats

The lifetime AFR for all the drives listed above is 1.40%. That is a slight increase from the previous quarter of 1.39%. The lifetime AFR number for all of our hard drives seems to have settled around 1.40%, although each drive model has its own unique AFR value.

For the past 10 years we’ve been capturing and storing the Drive Stats data which is the source of the lifetime AFRs listed in the table above. But, why keep track of the data at all? Well, besides creating this report each quarter, we use the data internally to help run our business. While there are many other factors which go into the decisions we make, the Drive Stats data helps to surface potential issues sooner, allows us to take better informed drive related actions, and overall adds a layer of confidence in the drive-based decisions we make.

The Hard Drive Stats Data

The complete dataset used to create the information used in this review is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone; it is free.

If you want the tables and charts used in this report, you can download the .zip file from Backblaze B2 Cloud Storage which contains an Excel file with a tab for each table or chart.

Good luck and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q1 2023 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

10 Stories from 10 Years of Drive Stats Data

2023-04-10

Post Syndicated from original https://www.backblaze.com/blog/10-stories-from-10-years-of-drive-stats-data/

On April 10, 2013, Backblaze saved our first daily hard drive snapshot file. We had decided to start saving these daily snapshots to improve our understanding of the burgeoning collection of hard drives we were using to store customer data. That was the beginning of the Backblaze Drive Stats reports that we know today.

Little did we know at the time that we’d be collecting the data for the next 10 years or writing various Drive Stats reports that are read by millions, but here we are.

I’ve been at Backblaze longer than Drive Stats and probably know the drive stats data and history better than most, so let’s spend the next few minutes getting beyond the quarterly and lifetime tables and charts and I’ll tell you some stories from behind the scenes of Drive Stats over the past 10 years.

1. The Drive Stats Light Bulb Moment

I have never been able to confirm whose idea it was to start saving the Drive Stats data. The two Brians—founder Brian Wilson, our CTO before he retired and engineer Brian Beach, our current CTO—take turns eating humble pie and giving each other credit for this grand experiment.

But, beyond the idea, one Brian or the other also had to make it happen. Someone had to write the Python scripts to capture and process the data, and then deploy these scripts across our fleet of shiny red Storage Pods and other storage servers, and finally someone also had to find a place to store all this newly captured data. My money’s on—to paraphrase Mr. Edison—founder Brian being the 1% that is inspiration, and engineer Brian being the 99% that is perspiration. The split could be 90/10 or even 80/20, but that’s how I think it went down.

2. The Experiment Begins

In April 2013, our Drive Stats data collection experiment began. We would collect and save basic drive information, including the SMART statistics for each drive, each day. The effort was more than a skunkworks project, but certainly not a full-fledged engineering project. Conducting such experiments has been part of our DNA since we started and we continue today, albeit with a little more planning and documentation. Still the basic process—try something, evaluate it, tweak it, and try again—still applies, and over the years, such experiments have led to the development of our Storage Pods and our Drive Farming efforts.

Our initial goal in collecting the Drive Stats data was to determine if it would help us better understand the failure rates of the hard drives we were using to store data. Questions that were top of mind included: Which drive models lasted longer? Which SMART attributes really foretold drive health? What is the failure rate of different models? And so on. The answers, we hoped, would help us make better purchasing and drive deployment decisions.

3. Where “Drive Days” Came From

To compute a failure rate of a given group of drives over a given time period, you might start with two pieces of data: the number of drives, and the number of drive failures over that period of time. So, if over the last year, you had 10 drives and one failed, you could say the 10% failure rate for the year. That works for static systems, but data centers are quite different. On a daily basis, drives enter and leave the system. There are new drives, failed drives, migrated drives, and so on. In other words, the number of drives is probably not consistent across a given time period. To address this issue, CTO Brian (current CTO Brian that is) worked with professors from UC Santa Cruz on the problem and the idea of Drive Days was born. A drive day is one drive in operation for one day, so one drive in operation for ten days is ten drive days.

To see this in action you start by defining the cohort of drives and the time period you want and then apply the following formula to get the Annualized Failure Rate (AFR).

AFR = ( Drive Failures / ( Drive Days / 365 ) )

This simple calculation allows you to compute an Annualized Failure Rate for any cohort of drives over any period of time and accounts for a variable number of drives over that period.

4. Wait! There’s No Beginning?

In testing out our elegantly simple AFR formula, we discovered a problem. Not with the formula, but with the data. We started collecting data on April 10, 2013, but many of the drives were present before then. If we wanted to compute the AFR of model XYZ for 2013, we could not count the number of drive days those drives had prior to April 10—there were none.

Never fear, SMART 9 raw value to the rescue. For the uninitiated, the SMART 9 raw value contains the number of power-on hours for a drive. A little math gets you the number of days—that is Drive Days—and you are ready to go. This little workaround was employed whenever we needed to work with drives that came into service before we started collecting data.

Why not use SMART 9 all of the time? A couple of reasons. First, sometimes the value gets corrupted. Especially when the drive is failing, it could be zero or a million or anywhere in between. Second, a new drive can have non-default SMART values. Perhaps it is just part of the burn in process or a test group at the manufacturer, or maybe the drive was a return that passed some qualification process.

Regardless, the starting value of SMART 9 wasn’t consistent across drives, so we just counted operational days in our environment and used SMART 9 as a substitute only when we couldn’t count those days. Using SMART 9 is moot now as these days there are no drives left in the current drive collection which were present prior to April 2013.

5. There’s Gold In That There Data

While the primary objective of collecting the data was to improve our operations, there was always another potential use lurking about—to write a blog post, or two, or 56. Yes, we’ve written 56 blog posts and counting based on our Drive Stats data. And no, we could have never imagined that would be the case when this all started back in 2013.

The very first Drive Stats-related blog post was written by Brian Beach (current CTO Brian, former engineer Brian) in November 2013 (we’ve updated it since then). The post had the audacious title of “How Long Do Disk Drives Last?” and a matching URL of “www.backblaze.com/blog/how-long-do-disk-drives-last/”. Besides our usual blog readers, search engines were falling all over themselves referring new readers to the site based on searches for variants of the title and the post became first page search material for multiple years. Alas, all Google things must come to an end, as the post disappeared into page two and then the oblivion beyond.

Buoyed by the success of the first post, Brian went on to write several additional posts over the next year or so based on the Drive Stats data.

December 2013: Enterprise Drives: Fact or Fiction?
January 2014: What Hard Drive Should I Buy?
May 2014: Hard Drive Temperature—Does It Matter?
September 2014: Hard Drive Reliability Update: September 2014
November 2014: Hard Drive SMART Stats
January 2015: What Is the Best Hard Drive?
February 2015: Reliability Data Set for 41,000 Hard Drives Now Open-Source

That’s an impressive body of work, but Brian is, by head and heart, an engineer, and writing blog posts meant he wasn’t writing code. So after his post to open source the Drive Stats data in February 2015, he passed the reins of this nascent franchise over to me.

6. What’s in a Name?

When writing about drive failure rates, Brian used the term “Hard Drive Reliability” in his posts. When I took over, beginning with the Q1 2015 report, we morphed the term slightly to “Hard Drive Reliability Stats.” That term lasted through 2015 and in Q1 2016 it was shortened to “Hard Drive Stats.” I’d like to tell you there was a great deal of contemplation and angst that went into the decision, but the truth is the title of the Q1 2016 post “One Billion Drive Hours and Counting: Q1 2016 Hard Drive Stats,” was really long and we left out the word reliability so it wouldn’t be any longer—something about title length, the URL, search terms, and so on. The abbreviated version stuck and to this day we publish “Hard Drive Stats” reports. That said, we often shorten the term even more to just “Drive Stats,” which is technically more correct given we have solid state drives (SSDs), not just hard disk drives (HDDs), in the dataset when we talk about boot drives.

7. Boot Drives

Beginning in Q4 2013, we began collecting and storing failure and SMART stats data from some of the boot drives that we use on our storage servers in the Drive Stats data set. Over the first half of 2014, additional boot drive models were configured to report their data and by Q3 2014, all boot drives were reporting. Now the Drive Stats dataset contained both data from the data drives and the boot drives of our storage servers. There was one problem: there was no field for drive source. In other words, to distinguish a data drive from a boot drive, you needed to use the drive model.

In Q4 2018, we began using SSDs as boot drives and began collecting and storing drive stats data from the SSDs as well. Guess what? There was no drive type field either, so SSD and HDD boot drives had to be distinguished by their model numbers. Our engineering folks are really busy on product and platform features and functionality, so we use some quick-and-dirty SQL on the post-processing side to add the missing information.

The boot drive data sat quietly in the Drive Stats dataset for the next few years until Q3 2021 when we asked the question “Are SSDs Really More Reliable Than Hard Drives?” That’s the first time the boot drive data was used. In this case, we compared the failure rates of SSDs and HDDs over time. As the number of boot drive SSDs increased, we started publishing a semi-annual report focused on just the failure rates for the SSD boot drives.

8. More Drives = More Data

On April 10, 2013, data was collected for 21,195 hard drives. The .csv data file for that day was 3.2MB. The numbers of drives and the amount of data has grown just a wee bit since then, as you can see in the following charts.

The current size of a daily Drive Stats .csv file is over 87MB. If you downloaded the entire Drive Stats dataset, you would need 113GB of storage available once you unzipped all the data files. If you are so inclined, you’ll find the data on our Drive Stats page. Once there, open the “Downloading the Raw HD Test Data” link to see a complete list of the files available.

9. Who Uses The Drive Stats Dataset?

Over the years, the Drive Stats dataset has been used in multiple ways for different reasons. Using Google Scholar, you can currently find 660 citations for the term “Backblaze hard drive stats” going back to 2014. This includes 18 review articles. Here are a couple of different ways the data has been used.

- - As a teaching tool: Several universities and similar groups have used the dataset as part of their computer science, data analytics, or statistics classes. The dataset is somewhat large, but it’s still manageable, and can be divided into yearly increments if needed. In addition, it is reasonably standardized, but not perfect, providing a good data cleansing challenge. The different drive models and variable number of drive counts allows students to practice data segmentation across the various statistical methods they are studying.
  - For artificial intelligence (AI) and machine learning: Over the years several studies have been conducted using AI and machine learning techniques applied to the Drive Stats data to determine if drive failure or drive health is predictable. We looked at one method from Interpretable on our blog, but there are several others. The results have varied, but the general conclusion is that while you can predict drive failure to some degree, the results seem to be limited to a given drive model.

10. Drive Stats Experiments at Backblaze

Of course, we also use the Drive Stats data internally at Backblaze to inform our operations and run our own experiments. Here are a couple examples:

- - Inside Backblaze: Part of the process in developing and productizing the Backblaze Storage Pod was the development of the software to manage the system itself. Almost from day one, we used certain SMART stats to help determine if a drive was not feeling well. In practice, other triggers such as ATA errors or FSCKs alerts, will often provide the first indicator of a problem. We then apply the historical and current SMART stats data that we have recorded and stored to complete the analysis. For example, we receive an ATA error on a given drive. There could be several non-drive reasons for such an error, but we can quickly determine that the drive has a history of increasing bad media and command timeouts values over time. Taken together, it could be time to replace that drive.
  - Trying new things: The Backblaze Evangelism team decided that SQL was too slow when accessing the Drive Stats data. They decided to see if they could use a combination of Parquet and Trino to make the process faster. Once they had done that, they went to work duplicating some of the standard queries we run each quarter in producing our Drive Stats Reports.

What Lies Ahead

First, thank you for reading and commenting on our various Drive Stats Reports over the years. You’ve made us better and we appreciate your comments—all of them. Not everyone likes the data or the reports, and that’s fine, but most people find the data interesting and occasionally useful. We publish the data as a service to the community at large, and we’re glad many people have found it helpful, especially when it can be used in teaching people how to test, challenge, and comprehend data—a very useful skill in navigating today’s noise versus knowledge environment.

We will continue to gather and publish the Drive Stats dataset each quarter for as long as it is practical and useful to our readers. That said, I can’t imagine we’ll be writing Drive Stats reports 10 years from now, but just in case, if anyone is interested in taking over, just let me know.

The post 10 Stories from 10 Years of Drive Stats Data appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

The SSD Edition: 2022 Drive Stats Review

2023-03-09

Post Syndicated from original https://www.backblaze.com/blog/ssd-edition-2022-drive-stats-review/

A decorative image displaying the article title 2022 Annual Report Drive Stats SSD Edition.

Welcome to the 2022 SSD Edition of the Backblaze Drive Stats series. The SSD Edition focuses on the solid state drives (SSDs) we use as boot drives for the data storage servers in our cloud storage platform. This is opposed to our traditional Drive Stats reports which focus on our hard disk drives (HDDs) used to store customer data.

We started using SSDs as boot drives beginning in Q4 of 2018. Since that time, all new storage servers and any with failed HDD boot drives have had SSDs installed. Boot drives in our environment do much more than boot the storage servers. Each day they also read, write, and delete log files and temporary files produced by the storage server itself. The workload is similar across all the SSDs included in this report.

In this report, we look at the failure rates of the SSDs that we use in our storage servers for 2022, for the last 3 years, and for the lifetime of the SSDs. In addition, we take our first look at the temperature of our SSDs for 2022, and we compare SSD and HDD temperatures to see if SSDs really do run cooler.

Overview

As of December 31, 2022, there were 2,906 SSDs being used as boot drives in our storage servers. There were 13 different models in use, most of which are considered consumer grade SSDs, and we’ll touch on why we use consumer grade SSDs a little later. In this report, we’ll show the Annualized Failure Rate (AFR) for these drive models over various periods of time, making observations and providing caveats to help interpret the data presented.

The dataset on which this report is based is available for download on our Drive Stats Test Data webpage. The SSD data is combined with the HDD data in the same files. Unfortunately, the data itself does not distinguish between SSD and HDD drive types, so you have to use the model field to make that distinction. If you are just looking for SSD data, start with Q4 2018 and go forward.

2022 Annual SSD Failure Rates

As noted, at the end of 2022, there were 2,906 SSDs in operation in our storage servers. The table below shows data for 2022. Later on we’ll compare the 2022 data to previous years.

Observations and Caveats

For 2022, seven of the 13 drive models had no failures. Six of the seven models had a limited number of drive days—less than 10,000—meaning that there is not enough data to make a reliable projection about the failure rates of those drive models.
The Dell SSD (model: DELLBOSS VD) has zero failures for 2022 and has over 100,000 drive days for the year. The resulting AFR is excellent, but this is an M.2 SSD mounted on a PCIe card (half-length and half-height form factor) meant for server deployments, and as such it may not be generally available. By the way, BOSS stands for Boot Optimized Storage Solution.
Besides the Dell SSD, three other drive models have over 100,000 drive days for the year, so there is sufficient data to consider their failure rates. Of the three, the Seagate (model: ZA250CM10003, aka: Seagate BarraCuda 120 SSD ZA250CM10003) has the lowest AFR at 0.73%, with the Crucial (model: CT250MX500SSD1) coming in next with an AFR of 1.04% and finally, the Seagate (model: ZA250CM10002, aka: Seagate BarraCuda SSD ZA250CM10002) delivers an AFR of 1.98% for 2022.

Annual SSD Failure Rates for 2020, 2021, and 2022

The 2022 annual chart above presents data for events that occurred in just 2022. Below we compare the 2022 annual data to the 2020 and 2021 (respectively) annual data where the data for each year represents just the events which occurred during that period.

Observations and Caveats

As expected, the Crucial drives (model: CT250MX500SSD1) recovered nicely in 2022 after having a couple of early failures in 2021. We expect that trend to continue.
Four new models were introduced in 2022, although none have a sufficient number of drive days to discern any patterns even though none of the four models have experienced a failure as of the end of 2022.
Two of the 250GB Seagate drives have been around all three years, but they are going in different directions. The Seagate drive (model: ZA250CM10003) has delivered a sub-1% AFR over all three years. While the AFR for the Seagate drive (model: ZA250CM10002) slipped in 2022 to nearly 2%. Model ZA250CM10003 is the newer model of the two by about a year. There is little difference otherwise except the ZA250CM10003 uses less idle power, 116mW versus 185mW for the ZA250CM10002. It will be interesting to see how the younger model fares over the next year. Will it follow the trend of its older sibling and start failing more often, or will it chart its own course?

SSD Temperature and AFR: A First Look

Before we jump into the lifetime SSD failure rates, let’s talk about SSD SMART stats. Here at Backblaze, we’ve been wrestling with SSD SMART stats for several months now, and one thing we have found is there is not much consistency on the attributes, or even the naming, SSD manufacturers use to record their various SMART data. For example, terms like wear leveling, endurance, lifetime used, life used, LBAs written, LBAs read, and so on are used inconsistently between manufacturers, often using different SMART attributes, and sometimes they are not recorded at all.

One SMART attribute that does appear to be consistent (almost) is drive temperature. SMART 194 (raw value) records the internal temperature of the SSD in degrees Celsius. We say almost, because the Dell SSD (model: DELLBOSS VD) does not report raw or normalized values for SMART 194. The chart below shows the monthly average temperature for the remaining SSDs in service during 2022.

Observations and Caveats

There were an average of 67,724 observations per month, ranging from 57,015 in February to 77,174 in December. For 2022, the average temperature varied only one degree Celsius from the low of 34.4 degrees Celsius to the high of 35.4 degrees Celsius over the period.
For 2022, the average temperature was 34.9 degrees Celsius. The average temperature of the hard drives in the same storage servers over the same period was 29.1 degrees Celsius. This difference seems to fly in the face of conventional wisdom that says SSDs run cooler than HDDs. One possible reason is that, in all of our storage servers, the boot drives are further away from the cool aisle than the data drives. That is, the data drives get the cool air first. If you have any thoughts, let us know in the comments.
The temperature variation across all drives for 2022 ranged from 20 degrees Celsius (four observations) to 61 degrees Celsius (one observation). The chart below shows the observations for the SSD’s across that temperature range.

The shape of the curve should look familiar: it’s a bell curve. We’ve seen the same type of curve when plotting the temperature observations of the storage server hard drives. The SSD curve is for all operational SSD drives, except the Dell SSDs. We attempted to plot the same curve for the failed SSDs, but with only 25 failures in 2022, the curve was nonsense.

Lifetime SSD Failure Rates

The lifetime failure rates are based on data from the entire time the given drive model has been in service in our system. This data goes back as far as Q4 2018, although most of the drives were put in service in the last three years. The table below shows the lifetime AFR for all of the SSD drive models in service as of the end of 2022.

Observations and Caveats

The overall Lifetime AFR was 0.89% as of the end of 2022. This is lower than the Lifetime AFR 1.04% as of the end of 2021.
There are several very large confidence intervals. That is due to the limited amount of data (drive days) for those drive models. For example, there are only 104 drive days for the WDC model WD Blue SA510 2.5. As we accumulate more data, those confidence intervals should become more accurate.
We like to see a confidence interval of 1.0% or less for a given drive model. Only three drive models met this criteria:
- Dell model DELLBOSS VD: lifetime AFR–0.00%
- Seagate model ZA250CM10003: lifetime AFR–0.66%
- Seagate model ZA250CM10002: lifetime AFR–0.96%
The Dell SSD, as noted earlier in this report, is an M.2 SSD mounted on a PCIe card and may not be generally available. The two Seagate drives are consumer level SSDs. In our case, a less expensive consumer level SSD works for our needs as there is no customer data on a boot drive, just boot files as well as log and temporary files. More recently as we have purchased storage servers from Supermicro and Dell, they bundle all of the components together into a unit price per storage server. If that bundle includes enterprise class SSDs or an M.2 SSD on a PCIe card, that’s fine with us.

The SSD Stats Data

We acknowledge that 2,906 SSDs is a relatively small number of drives on which to perform our analysis, and while this number does lead to wider than desired confidence intervals, it’s a start. Of course we will continue to add SSD boot drives to the study group, which will improve the fidelity of the data presented. In the meantime, we expect our readers will apply their usual skeptical lens to the data presented and use it accordingly.

The complete dataset used to create the information used in this review is available on our Hard Drive Test Data page. As noted earlier you’ll find SSD and HDD data in the same files, and you’ll have to use the model number to distinguish one record from another. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone; it is free.

Good luck, and let us know if you find anything interesting.

The post The SSD Edition: 2022 Drive Stats Review appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Drive Stats for 2022

2023-01-31

Post Syndicated from original https://www.backblaze.com/blog/backblaze-drive-stats-for-2022/

As of December 31, 2022, we had 235,608 drives under management. Of that number, there were 4,299 boot drives and 231,309 data drives. This report will focus on our data drives. We’ll review the hard drive failure rates for 2022, compare those rates to previous years, and present the lifetime failure statistics for all the hard drive models active in our data center as of the end of 2022. Along the way, we’ll share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

2022 Hard Drive Failure Rates

At the end of 2022, Backblaze was monitoring 231,309 hard drives used to store data. For our evaluation, we removed 388 drives from consideration which were used for either testing purposes or drive models for which we did not have at least 60 drives. This leaves us with 230,921 hard drives to analyze for this report.

Observations and Notes

One Zero for the Year

In 2022, only one drive had zero failures, the 8TB Seagate (model: ST8000NM000A). That “zero” does come with some caveats: We have only 79 drives in service and the drive has a limited number of drive days—22,839. These drives are used as spares to replace 8TB drives that have failed.

What About the Old Guys?

The 6TB Seagate (model: ST6000DX000) drive is the oldest in our fleet with an average age of 92.5 months. In 2021, it had an annualized failure rate (AFR) of just 0.11%, but has slipped a bit to 0.68% for 2022. A very respectable number any time, but especially after nearly eight years of duty.
The 4TB Toshiba (model: MD04ABA400V) drives have an average age of 91.3 months. In 2021, this drive has an AFR of 2.04% and that has jumped to 3.13% for 2022, which included three drive failures. Given the limited number of drives and drive days for this model, if there were only two drive failures in 2022, the AFR would be 2.08%, or nearly the same as 2021.
Both of these drive models have a relatively small number of drive days, so confidence in the AFR numbers is debatable. That said, both drives have performed well over their lifespan.

New Models

In 2021, we added five new models while retiring zero, giving us a total of 29 different models we are tracking. Here are the five new models:

HUH728080ALE604–8TB
ST8000NM000A–8TB
ST16000NM002J–16TB
MG08ACA16TA–16TB
WUH721816ALE6L4–16TB

The two 8TB drive models are being used to replace failed 8TB drives. The three 16TB drive models are additive to the inventory.

Comparing Drive Stats for 2020, 2021, and 2022

The chart below compares the AFR for each of the last three years. The data for each year is inclusive of that year only and the operational drive models present at the end of each year.

Drive Failure Was Up in 2022

After a slight increase in AFR from 2020 to 2021, there was a more notable increase in AFR in 2022 from 1.01% in 2021 to 1.37%. What happened? In our Q2 2022 and Q3 2022 quarterly Drive Stats reports, we noted an increase in the overall AFR from the previous quarter and attributed it to the aging fleet of drives. But, is that really the case? Let’s take a look at some of the factors at play that could cause the rise in AFR for 2022. We’ll start with drive size.

Drive Size and Drive Failure

The chart below compares 2021 and 2022 AFR for our large drives (which we’ve defined as 12TB, 14TB, and 16TB drives) to our smaller drives (which we’ve defined as 4TB, 6TB , 8TB, and 10TB drives).

With the exception of the 16TB drives, every drive size had an increase in their AFR from 2021 to 2022. In the case of the small drives, the increase was pronounced, and at 2.12% is well above the 1.37% AFR for 2022 for all drives.

In addition, while the small drive cohort represents only 28.7% of the drive days in 2022, they account for 44.5% of the drive failures. Our smaller drives are failing more often, but they are also older, so let’s take a closer look at that.

Drive Age and Drive Failure

When examining the correlation of drive age to drive failure we should start with our previous look at the hard drive failure bathtub curve. There we concluded that drives generally fail more often as they age. To see if that matters here, we’ll start with the table below which shows the average age of each drive model of drives by size.

With the exception of the 8TB Seagate (model: ST8000NM000A), which we recently purchased as replacements for failed 8TB drives, the drives fall neatly into our two groups noted above—10TB and below and 12TB and up.

Now let’s group the individual drive models into cohorts defined by drive size. But before we do, we should remember that the 6TB and 10TB drive models have a relatively small number of drives and drive days in comparison to the remaining drive groups. In addition, the 6TB and 10TB drive cohorts consist of one drive model, while the other drive groups have at least four different drive models. Still, leaving them out seems incomplete, so we’ve included tables with and without the 6TB and 10TB drive cohorts.

Each table shows the relationship for each drive size, between the average age of the drives and their associated AFR. The chart on the right (V2) clearly shows that the older drives, when grouped by size, fail more often. This increase as a drive model ages follows the bathtub curve we spoke of earlier.

So, What Caused the Increase in Drive Failure and Does it Matter?

The aging of our fleet of hard drives does appear to be the most logical reason for the increased AFR in 2022. We could dig in further, but that is probably moot at this point. You see, we spent 2022 building out our presence in two new data centers, the Nautilus facility in Stockton, California and the CoreSite facility in Reston, Virginia. In 2023, our focus is expected to be on replacing our older drives with 16TB and larger hard drives. The 4TB drives and yes, even our O.G. 6TB Seagate drives could go. We’ll keep you posted.

Drive Failures by Manufacturer

We’ve looked at drive failure by drive age and drive size, so it’s only right to look at drive failure by manufacturer. Below we have plotted the quarterly AFR over the last three years by manufacturer.

Starting in Q1 of 2021 and continuing to the end of 2022, we can see that the overall rise in the overall AFR over that time seems to be driven by Seagate and, to a lesser degree, Toshiba, although HGST contributes heavily to the Q1 2022 rise. In the case of Seagate, this makes sense as most of our Seagate drives are significantly older than any of the other manufacturers’ drives.

Before you throw your Seagate and Toshiba drives in the trash, you might want to consider the lifecycle cost of a given hard drive model versus its failure rate. We looked at this in our Q3 2022 Drive Stats report, and outlined the trade-offs between drive cost and failure rates. For example, in general, Seagate drives are less expensive and their failure rates are typically higher in our environment. But, their failure rates are typically not high enough to make them less cost effective over their lifetime. You could make a good case that for us, many Seagate drive models are just as cost effective as more expensive drives. It helps that our B2 Cloud Storage platform is built with drive failure in mind, but we’ll admit that fewer drive failures is never a bad thing.

Lifetime Hard Drive Stats

The table below is the lifetime AFR of all the drive models in production as of December 31, 2022.

The current lifetime AFR is 1.39%, which is down from a year ago (1.40%) and also down from last quarter (1.41%). The lifetime AFR is less prone to rapid changes due to temporary fluctuations in drive failures and is a good indicator of a drive model’s AFR. But it takes a fair amount of observations (in our case, drive days) to be confident in that number. To that end, the table below shows only those drive models which have accumulated one million drive days or more in their lifetime. We’ve ordered the list by drive days.

Finally, we are going to open up a bit here and share the results of the 388 drives we removed from our analysis because they were test drives or drive models with 60 or fewer drives. These drives are divided amongst 20 different drive models and the table below lists those drive models which were operational in our data centers as of December 31, 2022. Big caveat here: these are just test drives and so on, so be gentle. We usually ignore them in the reports, so this is their chance to shine, or not. We look forward to seeing your comments.

There are many reasons why these drives got to this point in their Backblaze career, but we’ll save those stories for another time. At this point, we’re just sharing to be forthright about the data, but there are certainly tales to be told. Stay tuned.

Our Annual Drive Stats Webinar

Join me on Tuesday, February 7 at 10 a.m. PT to review the results of the 2022 report. You’ll get a look behind the scenes at the data and the process we use to create the annual report.

The Hard Drive Stats Data

The complete data set used to create the tables and charts in this report is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data itself to anyone; it is free.

If you just want the data used to create the tables and charts in this blog post you can download the ZIP file containing the CSV files for each chart.

Good luck and let us know if you find anything interesting.

Want More Insights?

Check out our take on Hard Drive Cost per Gigabyte and Hard Drive Life Expectancy.

Interested in the SSD Data?

Read the most recent SSD edition of our Drive Stats Report.

The post Backblaze Drive Stats for 2022 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Noise

Tag Archives: hard drive stats

The collective thoughts of the interwebz