All posts by Andy Klein

Backblaze Drive Stats for 2024

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-2024/

A decorative image with the title 2024 Year End Drive Stats.

As of December 31, 2024, we had 305,180 drives under management. Of that number, there were 4,060 boot drives and 301,120 data drives. This report will focus on those data drives as we review the Q4 2024 annualized failure rates (AFR), the 2024 failure rates, and the lifetime failure rates for the drive models in service as of the end of 2024. Along the way, we’ll share our observations and insights on the data presented, and, as always, we look forward to you doing the same in the comments section at the end of the post.

Sign up for the Drive Stats webinar

Tune in to ask those questions you’ve had spinning ‘round your head like so many drives, and meet the new Drive Stats team—Stephanie Doyle and David Johnson of Backblaze Blog fame. Yes, you heard that right: It’s my last Drive Stats before I head off to retirement (but more on that later in the report). Read on, and sign up, for analysis and insights from the 2024 report.

Sign Up ➔ 

Q4 2024 hard drive failure rates

As of the end of 2024, Backblaze was monitoring 301,120 hard drives used to store data. For our evaluation, we removed from consideration 487 drives, as they did not meet the criteria to be included. We’ll discuss the criteria we used in the next section of this report. Removing these drives leaves us with 300,633 hard drives to analyze. The table below shows the annualized failure rates for Q4 2024 for this collection of drives.

Notes and observations

  • 24TB drives are here. Seagate 24TB drives (model: ST24000NM002H) arrived in early December. The 1,200 drives filled one Backblaze Vault with no failed drives through the end of Q4. The 24TB Seagate drives join the 20TB Toshiba and 22TB WDC drive models in the 20-plus capacity club as we continue to dramatically increase storage capacity while optimizing existing storage server space.
  • Zero failures for the quarter. Five drive models had zero failures for the quarter starting with the 24TB Seagate drive model noted above. The others are the 4TB HGST (model: HMS5C4040ALE640), the 8TB Seagate (model: ST8000NM000A), the 14TB Seagate (model: ST14000NM000J), and the 16TB Seagate (model: ST16000NM002J). All of the zeroes come with the caveat of having a relatively small number of drives and drive days, but zero failures in a quarter is always a good thing.
  • The 4TB drives are nearly extinct. The 4TB drive count decreased by another 1,774 drives in Q4. (I discussed exactly how we migrate them in more detail if you want to dig in.) The remaining ~4,000 drives should be gone by the end of Q1 2025. They will be replaced by the incoming 20TB, 22TB, and 24TB drives. It should be noted that out of the 4TB drives in operation in Q4, only one failed, so those 20-plus TB drives have a lot to live up to from a failure perspective.
  • The quarterly failure rate is down. The AFR for Q4 dropped from 1.89% in Q3 to 1.35% in Q4. While all drive sizes delivered some improvement from Q3 to Q4, one of the primary drivers is the addition of over 14,000 new 20-plus TB drives. As a group, these drives delivered an AFR of 0.77% for the quarter.

Drive model criteria

We noted earlier we removed 487 drives from consideration when we produced the table above covering Q4 2024. There are two primary reasons we did not consider these drive models.

  • Testing. These are drives of a given model that we monitor and collect Drive Stats data on, but are not considered production drives at this time. For example, drives undergoing certification testing to determine if they are performant enough for our environment are not included in our Drive Stats calculations.
  • Insufficient data points. When we calculate the annualized failure rate for a drive model for a given period of time (quarterly, annual, or lifetime), we want to ensure we have enough data to reliably do so. Therefore we have defined criteria for a drive model to be included in the tables and charts for the specified period of time. Models that do not meet these criteria are not included in the tables and charts for the period in question.
Period Drive Count Drive Days
Quarterly > 100 > 10,000
Annual > 250 > 50,000
Lifetime > 500 >100,000

Regardless of whether or not a given drive model is included in the charts and tables, all of the data for all of the drives we use is included in our Drive Stats dataset which you can download by visiting our Drive Stats page.

As with the Q4 quarterly results, we will apply these criteria to the annual and lifetime charts that follow in this report.

2024 annual hard drive failure rates

As of the end of 2024, Backblaze was monitoring 301,120 hard drives used to store data. We removed nine drive models consisting of 2,012 drives from consideration as they did not meet the annual criteria we have defined. This leaves us with 298,954 drives divided across 27 different drive models. The table below shows the AFRs for 2024 for this collection of drives.

Notes and observations

  • No zeros for the year. There were no qualifying drive models with zero failures in 2024. That said, the 16TB Seagate (model: ST16000NM002J) got close by recording just one drive failure back in Q3, giving the drive an AFR of 0.22% for 2024. 
  • Busy data center techs. During 2024, our data center techs installed 53,337 drives. If we assume there are 2,080 work hours a year (52 weeks times 40 hours), that math is 53,337/2,080, and that means our intrepid DC techs installed 26 drives per hour. Busy, busy, busy! 
  • The 24TB Seagate drives? While there were 1,200 new 24TB Seagate drives added in 2024, they were installed in early December and did not accumulate enough drive days to make the cut for the annual, or lifetime, tables. Including the 24TB Seagate drive, there were three models that missed out on being included in the 2024 annual tables, these drive models are listed below.
MFG Model Drive Count Drive Days 2024 AFR
Seagate ST8000NM000A 247 22,684 0.84%
Seagate ST14000NM000J 232 19,696 1.32%
Seagate ST24000NM002H 1,200 18,000 0.00%

As a reminder, a drive model needs to have over 250 drives by the end of Q4 and accumulate at least 50,000 drive days during 2024 to be included in the annual tables.

Comparing Drive Stats for 2022, 2023, and 2024

The table below compares the annual failure rates by drive model for each of the last three years. The table includes just those drive models which met the annual criteria as of the end of 2024. The data for each year is inclusive of that year only for the operational drive models present at the end of each year. The table is sorted by drive size and then AFR.

Notes and observations

  • The annual AFR is down. The 2024 AFR for all drives listed was 1.57%, this is down from 1.70% in 2023.  We expect the overall failure rates to continue to fall in 2025, but we will be watching the following for indicators.
    • The failure rates of the 8TB and 12TB drive models. All of the models will exceed their five years of service. In general, the failure rate will noticeably increase as the drives exceed five years of service. And, while there are outliers like the current HGST 4TB drives, you can’t assume that will happen.
    • The failure rates of the 14TB and 16TB drive models. These models are approaching middle age—three to five years in operation. This is where, according to the bathtub curve, their failure rates could gradually increase—but not as severely as when they exceed five years. 
    • The failure rates for the 20TB, 22TB, and 24TB drives models. These drives will enter the flat portion of the bathtub curve, that is where their failure rate should be the lowest.

Annualized failure rates vs. drive size

Now, we can dig into the numbers to see what else we can learn. We’ll start by looking at the quarterly annualized failure rate by drive size over the last three years.

Let’s take a look at the different drive sizes and how they affect the overall annualized failure rate over time.

Minimal impact. The 4TB (blue line) drives and 10TB (gold line) drives have had little impact over the last year on the overall failure rate as each finished the year with a relatively small number of drives. Still, the wild ride delivered by the 10TB drives keeps our DC techs on their toes. 

Older drives. The 8TB (gray line) drives and 12TB (purple line) drives range in age from five to eight years and as such their overall failure rates should be increasing over time. The 12TB drives are following that pattern moving up from about 1% AFR back in 2021 to just about 3% in 2024. The failure rates of the 8TB drives, while erratic from quarter-to-quarter, have a nearly flat trendline over the same period.

Workhorse drives. The 14TB (green line) and 16TB (azure* line) drives comprise 57% of the drives in service and on average they range in age from two to four years. They are in the prime of their working lives. As such, they should have low and stable failure rates, and as you can see, they do.

*  Maybe azure isn’t quite right, but robin’s egg blue seemed a bit pretentious.

New drives on the block. The 22TB (orange line) drives are in their early days as we continue to add more drives on a regular basis. Once the drive population settles down, we’ll have a better sense of the AFR direction. Still, the early results are solid with a lifetime AFR of 1.06%.

Annualized failure rates vs. manufacturer

One of the more popular ways we can look at this data is by the drive manufacturer as we’ve done below.

To complete the picture, the chart below uses the same data, but displays just the linear trendlines for each of the manufacturers over the same three-year period.

HGST. While the HGST trendline is not pretty, it doesn’t tell the entire story. Looking at the first chart, until Q4 2023, the HGST drives were at or below the average for all of the drives, that is all manufacturers. At that point, HGST has exceeded the average, and then some. The table below contains results for just the HGST drives for 2024. We’ve sorted them, high to low, by the 2024 AFR.

As you can see, there are two 12TB drive models driving the high AFR for the HGST drives. The HUH721212ALN604 model began showing signs of an increased quarterly AFR in Q1 2023 and the HUH721212ALE604 model followed suit in Q3 2024. Without these drive models, the 2024 AFR for HGST drive would be 0.55%.

Seagate. The quarterly AFR trendline decreased for the Seagate drives from 2022 through 2024. While the decrease was slight, from 2.25% to 2.0%, Seagate was the only manufacturer to do so. The decrease appears, at least in part, to be due to the removal of the Seagate 4TB drives during that period. 

Toshiba. Over the 2022 to 2024 period, the quarterly AFR for the Toshiba drive models varied within a fairly narrow range between 0.80% and 1.52%, with most quarters hovering slightly around 1.2%. Most importantly, none of the individual drive models were outliers, as the highest quarterly AFR for any Toshiba drive model was 1.58%. We like consistency. 

WDC. While WDC drive models delivered a similar level of consistency as the Toshiba models, they did so with a lower AFR each quarter. From 2022 through 2024, the range of quarterly AFR values for the WDC models was 0.0% to 0.85%. The 0.0% AFR was in Q1 2022 when none of the 12,207 WDC drives in operation failed during that quarter.

Lifetime hard drive stats

As of the end of 2024, Backblaze was monitoring 301,120 hard drives used to store data. Applying our drive criteria noted above for the lifetime period, we removed 11 drive models consisting of 2,736 drives from consideration as they did not meet the lifetime criteria we defined. This leaves us with 298,230 drives divided across 25 different drive models. The table below shows the lifetime AFRs for this collection of drives.

The current lifetime AFR for all of the drives is 1.31%. This is down from 1.46% in 2023. The drop is primarily due to the completion of the migration of the 4TB Seagate drives in 2024, which left us with only two of these drives still in operation as of the end of 2024. As a consequence, the 79 million drive days and over 5,600 drive failures racked up by the 4TB Seagate drives by the end of 2023 are not included in the data presented in the 2024 lifetime table above.  

In the final table below, we’ve taken the lifetime table and sorted out the drive models that have a lifetime AFR of 1.50% or less by drive size.

A couple of caveats as you review the table.

  • There is enough data for each model to say the AFR values are solid. That said, everything could change tomorrow. In general, the hard drive failure rate follows the bathtub curve as the drives age—unless it doesn’t. Some drives refuse to fail as they age, like the 4TB HGST drives. Other drives are great, and then “hit the wall” and bend the failure curve upward, fast.
  • A drive model with a 1% annualized failure rate means that you can expect one drive out of 100 to fail in a year. If you’re a personal drive user, that one drive could be yours. If you have exactly one drive, your personal annualized failure rate is 100%. In other words, always have a backup, and don’t forget to test it.

Migration time

I have been authoring the various Drive Stats reports for the past ten years and this will be my last one. I am retiring, or perhaps in Drive Stats vernacular, it would be “migrating.” Either way, after 10 years in the U.S. Air Force and 30+ years in Silicon Valley Tech, it is time. Drive Stats will continue with Stephanie Doyle and David Johnson as the replacement drive models beginning with the Q1 2025 report. I wish them well.

I want to say thank you to each of you who have taken your time to peruse and engage with the Drive Stats reports and data over the last 10 years. And, thank you as well for the comments, questions, and discussions that raced and raged across the various communities that care about something as mundane and awesome as a hard drive. It has been quite the ride—thanks again.

The Hard Drive Stats data

The complete data set used to create the tables and charts in this report is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data itself to anyone; it is free.

Good luck, and let us know if you find anything interesting.

The post Backblaze Drive Stats for 2024 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Drive Stats for Q3 2024

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-q3-2024/

A decorative image that displays the words Q3 2024 Drive Stats.

As of the end of Q3 2024, Backblaze was monitoring 292,647 hard disk drives (HDDs) and solid state drives (SSDs) in our cloud storage servers located in our data centers around the world. We removed from this analysis 4,100 boot drives, consisting of 3,344 SSDs and 756 HDDs. This leaves us with 288,547 hard drives under management to review for this report. We’ll review the annualized failure rates (AFRs) for Q3 2024 and the lifetime AFRs of the qualifying drive models. Along the way, we’ll share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

Hard drive failure rates for Q3 2024

For our Q3 2024 quarterly analysis, we remove the following from consideration: drive models which did not have at least 100 drives in service at the end of the quarter, drive models which did not accumulate 10,000 or more drive days during the quarter, and individual drives which exceeded their manufacturer’s temperature specification during their lifetime. The removed pool totalled 471 drives, leaving us with 288,076 drives grouped into 29 drive models for our Q3 2024 analysis. 

The table below lists the AFRs and related data for these drive models. The table is sorted ascending by drive size then ascending by AFR within drive size.

Notes and observations on the Q3 2024 Drive Stats

  • Upward AFR. The quarter-to-quarter AFR continues to creep up rising from 1.71% in Q2 2024 to 1.89% in Q3 2024. The rise can’t be attributed to the aging 4TB drives, as our CVT drive migration system continues to replace these drives. As a consequence, the AFR for the remaining 4TB drives was 0.26% in Q3. The primary culprit is the collection of 8TB drives, which are now on average over seven years old. As a group, the AFR for the 8TB drives rose to 3.04% in Q3 2024, up from 2.31% in Q2. The CVT team is gearing up to begin the migration of 8TB drives over the next few months.
  • Yet another golden oldie is gone. You may have noticed that the 4TB Seagate drives (model: ST4000DM000) are missing from the table. All of the Backblaze Vaults containing these drives have been migrated, and as a consequence there are only two of these drives remaining, not enough to make the quarterly chart. You can read more about their demise in our recent Halloween post. 
  • A new drive in town. In Q3, the 20TB Toshiba drives (model: MG10ACA20TE) arrived in force, populating three complete Backblaze Vaults of 1,200 drives each. Over the last few months our drive qualification team put the 20TB drive model through its paces and, having passed the test, they are now on the list of drive models we can deploy.
  • One zero. For the second quarter in a row, the 14TB Seagate (model: ST16000NM00J) drive model had zero failures. With only 185 drives in service, there is a lot of potential variability in the future, but for the moment, they are settling in quite well.
  • The nine year club. There are no data drives with 10 or more years of service, but there are 39 drives that are nine years or older. They are all 4TB HGST drives (model: HMS5C4040ALE640) spread across 31 different Storage Pods, in five different Backblaze Vaults and two different data centers. Will any of those drives make it to 10 years? Probably not, given that four of the five vaults have started their CVT migrations and will be gone by the end of the year. And, while the fifth vault is not scheduled for migration yet, it is just a matter of time before all of the 4TB drives we are using will be gone.

Reactive and proactive drive failures

In the Drive Stats dataset schema, there is a field named failure, which displays either a 1 for failure or a 0 for not failed. Over the years in various posts, we have stated that for our purposes drive failure is either reactive or proactive. Furthermore, we have suggested that failed drives fall basically evenly into these two categories. We’d like to put some data behind that 50/50 number, but first let’s start by defining our two categories of drive failure, reactive and proactive. 

  • Reactive: A reactive failure is when any of the following conditions occur: the drive crashes and refuses to boot or spin up, the drive won’t respond to system commands, or the drive won’t stay operational. 
  • Proactive: A proactive failure is generally anything not a reactive failure, and typically is when one or more indicators such as SMART stats, FSCK (file system) checks, etc., signal that the drive is having difficulty and drive failure is highly probable. Typically a multitude of indicators are present in drives declared as proactive failures.

A drive that is removed and replaced as either a proactive or reactive failure is considered a drive failure in Drive Stats unless we learn otherwise. For example, a drive is experiencing communications errors and command timeouts and is scheduled for a proactive drive replacement. During the replacement process, the data center tech realizes the drive does not appear to be fully seated. After gently securing the drive, further testing reveals no issues and the drive is no longer considered failed.  At that point, the Drive Stats dataset is updated accordingly.

As noted above, the Drive Stats dataset includes the failure status (0 or 1) but not the type of failure (proactive or reactive). That’s a project for the future. To get a breakdown of different types of drives failure we have to interrogate the data center maintenance ticketing system used by each data center to record any maintenance activities on Storage Pods and related equipment. Historically, the drive failure data was not readily accessible, but a recent software upgrade now allows us access to this data for the first time. So in the spirit of Drive Stats, we’d like to share our drive failure types with you. 

Drive failure type stats

Q3 2024 will be our starting point for any drive failure type stats we publish going forward. For consistency, we will use the same drive models listed in the Drive Stats quarterly report, in this case Q3 2024. For this period, there were 1,361 drive failures across 29 drive models. 

We actually have been using the data center maintenance data for several years as each quarter we validate the failed drives reported by the Drive Stats system with the maintenance records. Only validated failed drives are used for the Drive Stats reports we publish quarterly and in the data we publish on our Drive Stats webpage.

The recent upgrades to the data center maintenance ticketing system have not only made the drive failure validation process easier, we can now easily join together the two sources. This gives us the ability to look at the drive failure data across several different attributes as shown in the tables below. We’ll start with the number of failed drives in each category and go from there. This will form our baseline data.

Reactive vs. proactive drive failures for Q3 2024

Observation period Reactive failures Proactive failures Total failures Reactive % Proactive%
Q3 2024 failed drives 640 721 1,361 47.0% 53.0%

Reactive vs. proactive drive failures for Q3 2024

Manufacturer Reactive failures Proactive failures Total failures Reactive % Proactive %
HGST 194 177 371 52.3% 47.7%
Seagate 258 272 530 48.7% 51.3%
Toshiba 124 221 345 35.9% 64.1%
WDC 64 51 115 55.7% 44.3%

Reactive vs. proactive drive failures by Backblaze data center

Backblaze data center Reactive failures Proactive failures Total failures Reactive % Proactive %
AMS 36 77 113 31.9% 68.1%
IAD 50 92 142 35.2% 64.8%
PHX 179 201 380 47.1% 52.9%
SAC 0 151 148 299 50.5% 49.5%
SAC 2 224 203 427 52.5% 47.5%

Reactive vs. proactive drive failures by server type

Server type Reactive failures Proactive failures Total failures Reactive % Proactive %
5.0 red Storage Pod (45 drives) 4 2 6 66.7% 33.3%
6.0 red Storage Pod (60 drives) 433 349 782 55.4% 44.6%
6.1 red Storage Pod (60 drives) 70 107 177 39.5% 60.5%
Dell Server (26 drives) 22 61 83 26.5% 73.5%
Supermicro Server (60 drives) 111 202 313 35.5% 64.5%

Obviously, there are many things we could analyze here, but for the moment we just want to establish a baseline. Next, we’ll collect additional data to see how consistent and reliable our data is over time. We’ll let you know what we find.

Learning more about proactive failures

One item of interest to us is the different reasons that cause a drive to be designated as a proactive failure. Today we record the reasons for the proactive designation at the time the drive is flagged for replacement, but currently multiple reasons are allowed for a given drive. This makes determining the primary reason difficult to determine. Of course, there may be no such thing as a primary reason, as it is often a combination of factors causing the problem. That analysis could be interesting as well. Regardless of the exact reason, such drives are in bad shape and replacing degraded drives to protect the data they store is our first priority.

Lifetime hard drive failure rates

As of the end of Q3 2024, we were tracking 288,547 operational hard drives. To be considered for the lifetime review, a drive model was required to have 500 or more drives as of the end of Q3 2024 and have over 100,000 accumulated drive days during their lifetime. When we removed those drive models which did not meet the lifetime criteria, we had 286,892 drives grouped into 25 models remaining for analysis as shown in the table below.

Downward lifetime AFR

In Q2 2024, the lifetime AFR for the drives listed was 1.47%. In Q3, the lifetime AFR went down to 1.31%, a significant decrease from one quarter to the next for the lifetime AFR. This decrease is also contrary to the increasing quarterly AFR increase over the same period. At first blush, that doesn’t make much sense as an increasing quarter-to-quarter AFR should increase the lifetime AFR. There are two related factors which explain this seemingly contradictory data. Let’s take a look. 

We’ll start with the table below which summarizes the differences between the Q2 and Q3 lifetime stats.

Period Drive count Drive days Drive failures Lifetime AFR
Q2 2024 283,065 469,219,469 18,949 1.47%
Q3 2024 286,892 398,476,931 14,308 1.31%

To create the dataset for the lifetime AFR tables two criteria are applied: first, at the end of a given quarter, the number of drives of a drive model must be greater than 500, and, second, the number of drive days must be greater than 100,000. The first  criterion ensures that the drive models are relevant to the data presented; that is, we have a significant number of each of the included drive models. The second standard ensures that the drive models listed in the lifetime AFR table have a sufficient number of data points; that is, they have enough drive days to be significant. 

As we can see in the table above, while the number of drives went up from Q2 to Q3, the number of drive days and the number of drive failures went down significantly. This is explained by comparing the drive models listed in the Q2 lifetime table versus the Q3 lifetime table. Let’s summarize.

  • Added: In Q3, we added the 20TB Toshiba drive model (MG10ACA20TE). In Q2, there were only two of these drives in service.
  • Removed: In Q3, we removed the 4TB Seagate drive model (ST4000DM000) as there were only two drives remaining as of the end of Q3, well below the criteria of 500 drives.

When we removed the 4TB Seagate drives we also removed 80,400,065 lifetime drive days and 5,789 lifetime drive failures from the Q3 lifetime AFR computations. If the 4TB Seagate drive model data (drive days and drive failures) was included in the Q3 Lifetime stats, the AFR would have been 1.50%. 

Why not include the 4TB Seagate data? In other words, why have a drive count criteria at all? Shouldn’t we compute lifetime AFR using all of the drive models we have ever used which accumulated over 100,000 drive days in a lifetime? If we did things that way, the list of drive models used to compute the lifetime AFR would now include drive models we stopped using years ago and would include nearly 100 different drive models. As a result, a majority of the drive models used to compute the lifetime AFR would be outdated and the lifetime AFR table would contain rows of basically useless data that has no current or future value. In short, having drive count as one of the criteria in computing lifetime AFR keeps the table relevant and approachable.

The Hard Drive Stats data

It has now been over 11 years since we began recording, storing, and reporting the operational statistics of the HDDs and SSDs we use to store data at Backblaze. We look at the telemetry data of the drives, including their SMART stats and other health related attributes. We do not read or otherwise examine the actual customer data stored. 

Over the years, we have analyzed the data we have gathered and published our findings and insights from our analyses. For transparency, we also publish the data itself, known as the Drive Stats dataset. This dataset is open source and can be downloaded from our Drive Stats webpage.

You can download and use the Drive Stats dataset for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, 3) you may sell derivative works based on the data, but 4) you can not sell this data to anyone; it is free.

Good luck, and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q3 2024 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Quoth the Drive Stats, Nevermore: An Elegy for Our Seagate 4TB Drives

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/quoth-the-drive-stats-nevermore-an-elegy-for-our-seagate-4tb-drives/

A decorative image showing a gravestone with ravens around it.

Once upon a midnight dreary, as I typed another query

Seeking many a quaint and curious fact of hidden Drive Stats lore—

While I waited, time advancing, suddenly the stats came dancing

Lines of empty datasets; the database had nothing more

“Is that right?” I muttered, “The database had nothing more—

So those drives, I must explore.”

Ah, distinctly I remember, it was just past this September

I requested failure rates of Seagate drives with terabytes of four

Eagerly I typed the query, even though my eyes were bleary

The count of Seagate fours was eerie, eerie; there was nothing more.

The sad and certain count screamed like it never had before;

No Seagate drives with terabytes of four.

There are missing rows, I’m certain, and files waiting to explore.

The reality I kept dismissing, the Seagate data must be missing

With hours gone to data fishing, the facts shook me to the core;

The spinning life is over for our Seagate drives with terabytes of four—

Those Seagate drives are nevermore.

(My apologies to Edgar Allen Poe.)

Shortly, we will publish the Q3 2024 Backblaze Drive Stats report, and an old faithful will be missing from the tables, the 4TB Seagate drive model ST4000DM000. This drive model has graced our Drive Stats charts and tables since the very first Drive Stats report, and it would be a ghastly mistake if we let the drive slip into the afterlife unnoticed. So on this All Hallows’ Eve, it’s only fitting we say nevermore to these Seagate drives.

The first 45 of these Seagate 4TB drives were installed in a 45-drive Backblaze Storage Pod in May 2013. That was before 60-drive Storage Pods, Backblaze Vaults, and even Backblaze B2. Over the next two years, thousands of new Seagate 4TB drives were added each quarter, and by Q3 2016, there were 34,744 spinning souls in service. That represented more than 50% of all the drives in service at the time—a howling success that has not been duplicated by any other drive model.

Alas, that didn’t last as the first wave of 8TB drives arrived in mid-2016 and with that, no additional 4TB Seagate drives were procured. Over time, as 4TB Seagate drives met their maker, the count decreased, and when Storage Pods containing these drives started being phased out in 2018, the count dropped faster. The final nail in the coffin came when, in 2023, our CVT drive migration system became fixated on the replacement of all the remaining 4TB Seagate drives, and here we are.

As for those intrepid 45 original drives installed in May 2013, they were not around at the end. They were unceremoniously replaced in a Storage Pod upgrade back in 2017. A few were resurrected as drive replacements, but today they only exist in the spirit world, having died or been replaced by 2020. Still many other 4TB Seagate drives have lived long happy lives, with nearly 100 exceeding 100 months of service (8.4 years) before being sent to their final resting place by the CVT reaper.

And so it is time; we shall gather in a circle, cross our arms and hold hands and chant “our Seagate drives…with terabytes of four…are nevermore!”

The post Quoth the Drive Stats, Nevermore: An Elegy for Our Seagate 4TB Drives appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Drive Stats for Q2 2024

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-2024/

A decorative image with the headline Q2 2024 Drive Stats.

As of the end of Q2 2024, Backblaze was monitoring 288,665 hard drives (HDDs) and solid state drives (SSDs) in our cloud storage servers located in our data centers around the world. We removed from this analysis 3,789 boot drives, consisting of 2,923 SSDs and 866 hard drives. This leaves us with 284,876 hard drives under management to review for this report. We’ll review the annualized failure rates (AFRs) for Q2 2024 and the lifetime AFRs of the qualifying drive models, and we’ll also check out drive age versus failure rates over time. Along the way, we’ll share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

Hard drive failure rates for Q2 2024

For our Q2 2024 quarterly analysis, we remove from consideration: drive models which did have at least 100 drives in service at the end of the quarter, drive models which did not accumulate 10,000 or more drive days during the quarter, and individual drives which exceeded their manufacturer’s temperature specification during their lifetime. The removed pool totalled 490 drives, leaving us with 284,386 drives grouped into 29 drive models for our Q2 2024 analysis. 

The table below lists the AFRs and related data for these drive models. The table is sorted large to small by drive size then by AFR within drive size.

Notes and observations on the Q2 2024 Drive Stats

  • Upward AFR: The AFR for Q2 2024 was 1.71%. That’s up from Q1 2024 at 1.41%, but down from one year ago (Q2 2023) at 2.28%. While the quarter over quarter increase was a bit surprising, quarterly fluctuations in AFR are expected. Sixteen drive models had an AFR of 1.71% or below while 13 drive models had an AFR above.
  • Two good zeroes: In Q2 2024, two drive models had zero failures, a 14TB Seagate (model: ST14000NM000J) and a 16TB Seagate (model: ST16000NM002J). Both have a relatively small number of drives and drive days for the quarter, so their success is somewhat muted, but the 16TB Seagate drive model has a very respectable 0.57% lifetime failure rate.
  • Another GOAT is gone: In Q1, we migrated the last of our 4TB Toshiba drives. In Q2, we migrated the last of our 6TB drives, including all of the Seagate 6TB drives which had reached an average age of nine years (108 months). This Seagate drive model closed out its career at Backblaze with an impressive 0.86% lifetime AFR.

    Currently the 4TB Seagate (model: ST4000DM000) is our oldest data drive model in production at an average age of 99.5 months. The data on these drives is scheduled to be migrated over the next quarter or two using CVT, our in-house drive migration system. They’ll never reach nine years of service. 

  • The 10-Year Club: With the 6TB Seagate drives being migrated as they hit 10 years of service, we wondered: What is the oldest data drive in service? The answer, a 4TB HGST drive (model: HMS5C4040ALE640) with 9 years, 11 months and 23 days service as of the end of Q2. Alas, the Backblaze Vault in which this drive resides is now being migrated as are many other drives with over nine years of service. We’ll see next quarter to see if any of them made it to the 10-Year Club before they are retired.

    While there are no data drives with 10 years of service, there are 11 HDD boot drives that exceed the mark. In fact one, a 500GB WD drive (model: WD5000BPKT) has over 11 years of service. (Psst, don’t tell the CVT team.)

  • An HGST surprise: Over the years, the HGST drive models we have used performed very well. So, when the 12TB HGST (model: HUH721212ALN604) drive showed up with a 7.17% AFR for Q2, it’s news. Such uncharacteristic quarterly failure rates for this model actually go back about a year, although the 7.17% AFR is the largest quarterly value to date. As a result, the lifetime AFR has risen from 0.99% to 1.57% over the last year. While the lifetime AFR is not alarming, we are paying attention to this trend.

Lifetime hard drive failure rates

As of the end of Q2 2024, we were tracking 284,876 operational hard drives. To be considered for the lifetime review, a drive model was required to have 500 or more drives as of the end of Q2 2024 and have over 100,000 accumulated drive days during their lifetime. When we removed those drive models which did not meet the lifetime criteria, we had 283,065 drives grouped into 25 models remaining for analysis as shown in the table below.

Age, AFR, and snakes

One of the truisms in our business is that different drive models fail at different rates. Our goal is to develop a failure profile for a given drive model over time. Such a profile can help optimize our drive replacement and migration strategies, and ultimately maintains the durability of our cloud storage service.

For our cohort of data drives, we’ll look at the changes in the lifetime AFR over time for drive models with at least one million drive days as of the end of Q2 2024. This gives us 23 drive models to review. We’ll divide the drive models into two groups: those whose average age is five years (60 months) or less, and those whose average age is above 60 months. Why that cutoff? That’s the typical warranty period for enterprise class hard drives. 

Let’s start by plotting the current lifetime AFR for the 14 drives models that have an average age of 60 months or less as shown in the chart below.

Let’s review the drive models by characterizing the four quadrants as follows:

  • Quadrant I: Drive models in this quadrant are performing well, and have a respectable AFR of less than 1.5%. Drive models to the right in this quadrant might require a little more attention over the coming months than those to the left.
  • Quadrant II: These drive models have failure rates above 1.5%, but are still reasonable at around 2% lifetime AFR. What is important is that AFR does not increase significantly over time.
  • Quadrant III: There are no drives currently in this quadrant, but if there were it would not be a cause for alarm. Why? Some drive models experience higher rates of failure early on, and then following the bathtub curve, their AFR drops as they get older. 
  • Quadrant IV: These drive models are just starting out and are just beginning to establish their failure profile, which at the moment is good.

At a glance, the chart tells us that everything seems fine. The drives in Quadrant I are performing well, the two drives in Quadrant II could be better, but are still acceptable, and there are no surprises in the newer drive models to this point. Let’s see how things fair for the drive models which have an average age of over 60 months as in the chart below.

There are nine drive models which fit the average age criteria, including the Seagate 6TB drive (in yellow) whose drives were removed from service in Q2. As you can see the drive models are spread out across all four quadrants. As before, Quadrant I contains good drives, Quadrants II and III are drives we need to worry about, and Quadrant IV models look good so far. 

If we were to stop here we could decide for example that the 4TB Seagate drives are first in line for the CVT migration process, but not so fast. All of these drive models have been around for at least five years and we have their failure rates over time. So, rather than rely on just a point in time, let’s look at their change in failure rates over time in the chart below.

The snake chart, as we’re calling it, shows the lifetime failure rate of each drive model over time. We started at 24 months to make the chart less messy. Regardless, the drive models sort themselves out into either Quadrant I or II once their average age passes 60 months. Let’s take a look at the drives in each of those quadrants.

  • Quadrant I: Five of the nine drive models are in Quadrant I as of Q2 2024. The two 4TB HGST drives (brown and purple lines) as well as the 6TB Seagate (red line) have nearly vertical lines indicating their failure rates have been consistent over time, especially after 60 months of service. Such demonstrated consistency over time is a failure profile we like to see. 

    The failure profile of the 8TB Seagate (blue line) and the 8TB HGST (gray line) are less consistent, with each increasing their failure rates as they have aged. In the case of the HGST drive, the lifetime AFR rose from about 0.5% to 1.0% over an 18 month period starting at 48 months before leveling out. The Seagate drive took about two years starting at 60 months to go from 1.0% to nearly 1.5% before leveling out.

  • Quadrant II: The remaining 4 drive models ended in this quadrant. Three of the models, the 8TB Seagate (yellow line), the 10TB Seagate (green line), and the 12TB HGST (teal line) have similar failure profiles. All three got to some point in their lifetime and their curve began bending to the right. In other words, their failure rates over time accelerated. While the 8TB Seagate (yellow) shows some signs of leveling off, all three models will be closely watched and replaced if this trend continues.

    Also in Quadrant II is the 4TB Seagate drive (black line). This drive model is aggressively being migrated and is being replaced by 16TB and larger drives via the CVT process. As such, it is hard to tell if the nearly vertical failure profile is a function of the replacement process or the drive model failure rate leveling out over time. Either way, the migration of this drive model is expected to be complete in the next quarter or two.

A normal failure profile

If we had to pick one of the drive models to represent a normal failure profile, it would be the 8TB Seagate (blue line, model: ST800DM002). Why? The failure rate for the first 60 months was consistently around 1.0%, Seagate’s predicted AFR. After 60 months, the AFR increased as the drive aged as one would expect. You might have thought we’d choose the failure profile of one of the two 4TB HGST drive models (brown and purple lines). The “trouble” is their failure rates are well below any published AFR by any drive manufacturer. While that’s great for us, their annualized failure rates over time are sadly not normal.

Can AI help?

The idea of using AI/ML techniques to predict drive failure has been around for several years, but as a first step let’s see if predicting drive failure is even an AI-worthy problem. We recently conducted a webinar “Leveraging Your Cloud Storage Data in AL/ML Apps and Services” in which we outlined general criteria to be used in evaluating if AI/ML is needed to solve a given problem, in this case predicting drive failure. The most salient criteria which applies here is that AI is best used for a problem for which you can not consistently apply a set of rules to solve the problem. 

A model is trained by taking the source data and applying an algorithm to iteratively combine and weigh multiple factors. The output is a model which can be used to answer questions about the model’s subject matter, in this case drive failure. For example, we train a model using the Drive Stats data for a given drive model for the last year. Then, we ask the model a question using drive Z’s daily SMART stats and related information. We use this data as input to the model, and while there is no exact match, the model will use inference to develop a response of the probability of drive failure for drive Z over time. As such, it would seem that drive failure prediction would be a good candidate for using AI.

What’s not clear is whether what is learned about one drive model can be applied to another drive model. One look at the snake chart above visualizes the issue as the failure profile for each drive model is different, sometimes radically different. For example, do you think you could train a model on the 4TB Seagate drives (black line) and use it to predict drive failures for either of the 4TB HGST drive models (purple and brown lines)? The answer may be yes, but it certainly doesn’t seem likely. 

All that said, several research papers and studies have been published over the years attempting to determine whether or not AI/ML can be used to make drive failure predictions. We’ll be doing a review of these publications in the next couple of months and hopefully shed some light on the ability to use AI to accurately make drive failure predictions in a timely manner.

The Hard Drive Stats data

It has now been over 11 years since we began recording, storing, and reporting the operational statistics of the hard drives and SSDs we use to store data in the Backblaze data storage cloud. We look at the telemetry data of the drives, including their SMART stats and other health related attributes. We do not read or otherwise examine the actual customer data stored. 

Over the years, we have analyzed the data we have gathered and published our findings and insights from our analyses. For transparency, we also publish the data itself, known as the Drive Stats dataset. This dataset is open source and can be downloaded from our Drive Stats webpage.

The post Backblaze Drive Stats for Q2 2024 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

How Backblaze Scales Our Storage Cloud

Post Syndicated from Andy Klein original https://backblaze.com/blog/how-backblaze-scales-our-storage-cloud/

A decorative image showing a larger cube being compressed into a smaller cube.

Increasing storage density is a fancy way of saying we are replacing one drive with another drive of a larger capacity; for example replacing a 4TB drive with a 16TB drive—same space, four times the storage. You’ve probably copied or cloned a drive or two over the years, so you understand the general process. Now imagine having 270,000 drives that over the next several years will need to be replaced, or migrated as is often the term used. That’s a lot of work. And when you finish—well actually you’ll never finish as the process is continuous for as long as you are in the cloud storage business. So, how does Backblaze manage this ABC (Always Be Copying) process? Let me introduce you to CVT Copy or CVT for short.

CVT Copy is our in-house purpose-built application used to perform drive migrations at scale. CVT stands for Cluster, Vault, Tome, which is engineering vernacular mercifully shortened to CVT. 

Before we jump in, let’s take a minute to define a few terms in the context of how we organize storage.

  • Drive: The basic unit of storage ranging in our case from 4TB to 22TB in size.
  • Storage Server: A collection of drives in a single server. We have servers of 26, 45, and 60 drives. All drives in a storage server are the same logical size.
  • Backblaze Vault: A logical collection of 20 Storage Pods or servers. Each storage server in a Vault will have the same number of drives.
  • Tome: A tome is a logical collection of 20 drives, with each drive being in one of the 20 storage servers in a given Vault. If the storage servers in a Vault have 60 drives each, then there will be 60 unique tomes in that Vault.
  • Cluster: A logical collection of Vaults, grouped together to share other resources such as networking equipment and utility servers.

Based on this, a Vault consisting of 20, 60-drive storage servers will have 1,200 drives, a Vault with 45-drive storage servers will have 900 drives, and a Vault with 26-drive servers will have 520 drives. A cluster can have any combination of Vault sizes.

A Quick Review on How Backblaze Stores Data

Data is uploaded to one of the 20 drives within a tome. The data is then divided into parts, called data shards. At this point, we use our own Reed-Solomon erasing coding algorithm to compute the parity shards for that data. The number of data shards plus the number of parity shards will equal 20, i.e. the number of drives in a tome. The data and parity shards are written to their assigned drives, one shard per drive. The ratios of data shards to parity shards we currently use are 17/3, 16/4, and 15/5 depending primarily on the size of the drives being used to store the data—the larger the drive, the higher the parity.

Using parity allows us to restore (i.e. read) a file using less than 20 drives. For example, when a tome is 17/3 (data/parity), we only need data from any 17 of the 20 drives in that tome to restore a file. This dramatically increases the durability of the files stored.

CVT Overview

For CVT, the basic unit of migration is a tome, with all of the tomes in a source Vault being copied simultaneously to a new destination Vault which is typically new hardware. For each tome, the data, in the form of files, is copied file-by-file from the source tome to the destination tome.

The CVT Process

An overview of the CVT process is below, followed by an explanation of each task noted.

Selection

Selecting a Vault to migrate involves considering several factors. We start by reviewing current drive failure rates and predicted drive failure rates over time. We also calculate and consider overall Vault durability; that is, our ability to safeguard data from loss. In addition, we need to consider operational needs. For example, we still have Vaults using 45-drive Storage Pods. Upgrading these to 60-drive storage servers increases drive density in the same rack space. These factors taken together determine the next Vault to migrate.

Currently we are migrating systems with 4TB drives, which means we are migrating up to 3.6 petabytes (PB) of data for a 900 drive Vault or 4.8PB of data for a 1,200 drive Vault. Actually, there are no limitations as to the size of the source system drives, so Vaults with 6TB, 8TB, and larger sized drives can be migrated using CVT with minimal setup and configuration changes.

Once we’ve identified a source Vault to migrate we need to identify the target or destination system. Currently, we are using destination vaults containing 16TB drives. There is no limitation as to the size of the drives of the destination Vault, so long as they are at least as large as those in the source Vault. You can migrate the data from any sized source Vault to any sized destination Vault as long as there is adequate room on the destination Vault.  

Setup

Once the source Vault and destination Vault are selected, the various Technical Operations and Data Center folks get to work on setting things up. If we are not using an existing destination Vault, then a new destination Vault is provisioned. This brings up one of the features of CVT: The migration can be to a new clean Vault or an existing Vault; that is, one with data from a previous migration on it. In the latter case, the new data is just added and does not replace any of the existing data. The chart below are examples of the different ways a destination Vault can be filled from one or more source Vaults.

In any of these scenarios, the free space can be used for another migration destination or for a Vault where new customer data can be written.

Kickoff

With the source and destination Vaults identified and setup, we are now ready to kick-off the CVT process. The first step is to put the source Vault in a read-only state and to disable file deletions on both the source and destination Vaults. It is possible that some older source Vaults may have already been placed in read-only state to reduce their workload. A Vault in a read-only state continues to perform other operations such as running shard integrity checks, reporting Drive Stats statistics and so on. 

CVT and Drive Stats

We record Drive Stats data from the drives in the source Vault until the migration is complete and verified. At that point we begin recording Drive Stats data from the drives in the destination Vault and stop recording Drive Stats data from the drives in the source Vault. The drives in the source Vault are not marked as failed.

Build the File List

This step and the next three steps (read files, write files, and validate) are done as a consecutive group of steps for each tome in a source Vault. For our purpose here, we’ll call this group of steps the tome migration process, although they really don’t have such a name in the internal documentation. The tome migration process is for a single tome, but, in general, all tomes in a Vault are migrated at the same time, although due to their unique contents, they most likely will complete at different times.

For each tome, the source file list is copied to a file transfer database and each entry is mapped to its new location in the destination tome. This process allows us to maintain the same upload path while copying the data as the customer used to initially upload their data. This ensures that from the customer point of view, nothing changes in how they work with their files even though we have migrated them from one Vault to another.

Read Files

For each tome, we use the file location database to read the files. One file at a time. We use the same code in this process that we use when a user requests their data from the Backblaze B2 Storage Cloud. As noted earlier, the data is sharded across multiple drives using the preset data/parity scheme, for example 17/3. That means, in this case we only need data from 17 of the drives to read the file.

When we read a file, one advantage we get by using our standard read process is a pristine copy of the file to migrate. While we regularly run shard integrity checks on the stored data to ensure a given shard of data is good, media degradation, cosmic rays and so on can affect data sitting on a hard drive. By using the standard read process, we get a completely clean version of each file to migrate.

Write Files

The restored file is sent to the destination vault, there is no intermediate location where the file resides. The transfer is done over an encrypted network connection typically within the same data center, preferably on the same network segment. If the transfer is done between data centers, it is done over an encrypted dark fiber connection. 

The file is then written to the destination tome. The write process is the same one used by our customers when they upload a file and given that process has successfully written hundreds of billions of files we didn’t need to invent anything new.

At this point, you could be thinking that’s a lot of work to copy each file one by one.  Why not copy and transfer block by block, for example? The answer lies in the flexibility we get by using the standard file-based read and write processes.

  • We can change the number of tomes. Let’s say we have 45 tomes in the source Vault and 60 tomes in the destination Vault. If we had copied blocks of data the destination Vault would have 15 empty tomes. This creates load balancing and other assorted performance problems when that destination Vault is opened up for new data writes at a later date. By using standard read and write calls for each file, all 60 of the destination Vault’s tomes fill up evenly, just like they do when we receive customer data.
  • We can change parity of the data. The source 4TB drive Vaults have a data/parity ratio of 17/3. By using our standard process to write the files, the data/parity ratio can be set to whatever ratio we want for the destination Vault. Currently, the data/parity ratio for the 16TB destination Vaults is set to 15/5. This ratio ensures that the durability of the destination Vault and therefore the recoverability of the files therein is maintained as a result of migrating the data to larger drives.
  • We can maximize parity economics. Increasing the number of parity drives in a tome from three to five decreases the number of data drives in that tome. That would seem to increase the cost of storage, but the opposite is true in this case. Here’s how:
    • Using 4TB drives for 16TB of data stored
      • Our average cost for a 4TB drive was $120 or $0.03 per GB.
      • Our cost of 16TB of storage, using 4TB drives, was $480 (4 x $120).
      • Using a 17/3 data/parity scheme means:
        • Data storage: We have 13.6TB of data storage at $0.03/GB ($30/TB) which costs us $408. 
        • Parity storage: We have 2.4TB of parity storage at $0.03/GB ($30/TB) which costs us $72.
    • Using 16TB drives for 16TB of data stored
      • Our average cost for a 16TB drive is $232 or $0.0145 per GB.
      • Our cost of 16TB of storage is $232.
      • Using a 15/5 data/parity scheme means:
        • Data storage: We have 12.0TB of data storage at $0.0145/GB ($14.5/TB) which costs us $174.
        • Data parity: We have 4.0TB of parity storage at $0.0145/GB ($14.5/TB) which costs us $58.
    • In summary, increasing the data/parity ratio to 15/5 for the 16TB drives is less expensive ($58) than the cost of parity when using our 4TB drives ($72) to provide the same 16TB of storage. The lower cost per TB of the 16TB drives allows us to increase the number of parity drives in a tome. Therefore, the cost of increasing the parity of the destination tome not only enhances data durability, it is economically sound.
    • Obviously a 16TB drive actually holds a bit less data due to formatting and overhead and four 4TB drives hold even less data. In other words, even with formatting and so on, the math still works out in favor of using the 16TB drives.

Validate Tome

The last step in migrating a tome is to validate the destination tome is the same as the source tome. This is done for each tome as they complete their copy process. If the source and destination tomes are not consistent, shard integrity check data can be reviewed to determine any errors and the system can retransfer individual files, up to and including the entire tome.

Redirect Reads

Once all of the tomes within the Vault have completed their individual migrations and have passed their validation checks, we are ready to redirect customer reads (download requests) to the destination Vault. This process is completely invisible to the customer as they will use the same file handle as before. This redirection or swap process can be done tome by tome, but is usually done once the entire destination Vault is ready.

Monitor

At this point all download requests are handled by the destination Vault. We monitor the operational status of the Vault, as well as any failed download requests. We also review inputs from customer support and sales support to see if there are any customer related issues.

Once we are satisfied that the destination Vault is handling customer requests, we will logically decommission the source Vault. Basically, that means while the source Vault continues to run, it is no longer externally reachable. If a significant problem were to arise with the new destination Vault, we can swap in the source Vault. At this point, both Vaults are read-only, so the swap would be straightforward. We have not had to do this in our production environment. 

Full Capability

Once we are satisfied there are no issues with the destination Vault, we can proceed one or two ways.

  • Another migration: We can prepare for the migration of another source Vault to this destination Vault. If this is the case, we return to the Selection step of the CVT process with the Vault once again being assigned as a destination Vault. 
  • Allow new data: We allow the destination Vault to accept new data from customers. Typically, the contents of multiple source Vaults have been migrated to the destination Vault before this is done. Once new customer writes have been allowed on a destination Vault, we won’t use it as a destination Vault again.

Decommission

After three months the source Vault is eligible to be physically decommissioned. That is, we turn it off, disconnect it from power and networking, and schedule it to be disassembled. This includes wiping the drives and recycling the remaining parts either internally or externally. In practice, we will wait to decommission at least two Vaults at once as it is more economical in dealing with our recycling partners.

Automation

You’re probably wondering how much of this process is automated or uses some type of orchestration to align and accomplish tasks. We currently have monitoring tools, dashboards, scripts, and such, but humans, real ones not AI generated, are in control. That said, we are working on orchestration of the setup and provisioning processes as well as upleveling the automation in the tome migration process. Over time, we expect the entire migration process to be automated, but only when we are sure it works—the “run fast, break things” approach is not appropriate when dealing with customer data.

Not for the Faint of Heart

The basic idea of copying the contents of a drive to another larger drive is straightforward and well understood. As you scale this process, complexity creeps in as you have to consider how the data is organized and stored while keeping it secure and available to the end user. 

If your organization manages your data in-house, the never-ending task of simultaneously migrating hundreds or perhaps thousands of drives falls to you or perhaps the contractor you hired to perform the task if you lack the experience or staffing. And this is just one of the tasks you are faced with in operating, maintaining, and upgrading your own storage infrastructure.

In addition to managing a storage infrastructure, there are the growing environmental concerns of data storage. The amount of data generated and stored each year continues to skyrocket and tools such as CVT allow us to scale and optimize our resources in a cost efficient, yet environmentally sensitive way. 

To do this, we start with data durability. Using our Drive Stats data and other information, we optimize the length of time a Vault should be in operation before the drives need to be replaced, that is before the drive failure rate impacts durability. We then consider data density, how much data can we pack into a given space. Migrating data from 4TB to 16TB drives, for example, not only increases data density, it uses less electricity per stored terabyte of data and reduces the amount of waste if, for example, we had continued to buy and use 4TB drives instead of upgrading to 16TB drives. 

In summary, CVT is more than just a fancy data migration tool: It is part of our overall infrastructure management program addressing scalability, durability, and environmental challenges faced by the ever-increasing amounts of data we are asked to store and protect each day.

Kudos

The CVT program is run by Bryan with the wonderfully descriptive title of Senior Manager, Operations Analysis and Capacity Management. He is assisted by folks from across the organization. They are, in no particular order, Bach, Lorelei (Lo), Madhu, Mitch, Ben, Rodney, Vicky, Ryan, David M., Sudhi, David W., Zoe, Mike, and unnamed others who pitch in as the process rolls along. Each person brings their own expertise to the process which Bryan coordinates. To date, the CVT Team has migrated 24 Vaults containing over 60PB of data—and that’s just the beginning.

The post How Backblaze Scales Our Storage Cloud appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Drive Stats for Q1 2024

Post Syndicated from Andy Klein original https://backblaze.com/blog/backblaze-drive-stats-for-q1-2024/

A decorative image displaying the title Q1 2024 Drive Stats.

As of the end of Q1 2024, Backblaze was monitoring 283,851 hard drives and SSDs in our cloud storage servers located in our data centers around the world. We removed from this analysis 4,279 boot drives, consisting of 3,307 SSDs and 972 hard drives. This leaves us with 279,572 hard drives under management to examine for this report. We’ll review their annualized failure rates (AFRs) as of Q1 2024, and we’ll dig into the average age of drive failure by model, drive size, and more. Along the way, we’ll share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

Hard Drive Failure Rates for Q1 2024

We analyzed the drive stats data of 279,572 hard drives. In this group we identified 275 individual drives which exceeded their manufacturer’s temperature specification at some point in their operational life. As such, these drives were removed from our AFR calculations.

The remaining 279,297 drives were divided into two groups. The primary group consists of the drive models which had at least 100 drives in operation as of the end of the quarter and accumulated over 10,000 drive days during the same quarter. This group consists of 278,656 drives grouped into 29 drive models. The secondary group contains the remaining 641 drives which did not meet the criteria noted. We will review the secondary group later in this post, but for the moment let’s focus on the primary group.

For Q1 2024, we analyzed 278,656 hard drives grouped into 29 drive models. The table below lists the AFRs of these drive models. The table is sorted by drive size then AFR, and grouped by drive size.

Notes and Observations on the Q1 2024 Drive Stats

  • Downward AFR: The AFR for Q1 2024 was 1.41%. That’s down from Q4 2023 at 1.53%, and also down from one year ago (Q1 2023) at 1.54%. The continuing process of replacing older 4TB drives is a primary driver of this decrease as the Q1 2024 AFR (1.36%) for the 4TB drive cohort is down from a high of 2.33% in Q2 2023.
  • A Few Good Zeroes: In Q1 2024, three drive models had zero failures:
    • 16TB Seagate (model: ST16000NM002J)
      • Q1 2024 drive days: 42,133
      • Lifetime drive days: 216,019
      • Lifetime AFR: 0.68%
      • Lifetime confidence interval: 1.4%
    • 8TB Seagate (model: ST8000NM000A)
      • Q1 2024 drive days: 19,684
      • Lifetime drive days: 106,759
      • Lifetime AFR: 0.00%
      • Lifetime confidence interval: 1.9%
    • 6TB Seagate (model: ST6000DX000)
      • Q1 2024 drive days: 80,262 
      • Lifetime drive days: 4,268,373
      • Lifetime AFR: 0.86%
      • Lifetime confidence interval: 0.3%

All three drives have a lifetime AFR of less than 1%, but in the case of the 8TB and 16TB drive models the confidence interval (95%) is still too high. While it is possible the two drives models will continue to perform well, we’d like to see the confidence interval below 1%, and preferably below 0.5%, before we can trust the lifetime AFR.

With a confidence interval of 0.3% the 6TB Seagate drives delivered another quarter of zero failures. At an average age of nine years, these drives continue to defy their age. They were purchased and installed at the same time back in 2015, and are members of the only 6TB Backblaze Vault still in operation.

  • The End of the Line: The 4TB Toshiba (model: MD04ABA400V) are not in the Q1 2024 Drive Stats tables. This was not an oversight.  The last of these drives became a migration target early in Q1 and their data was securely transferred to pristine 16TB Toshiba drives. They rivaled the 6TB Seagate drives in age and AFR, but their number was up and it was time to go.

The Secondary Group

As noted previously, we divided the drive models into two groups, primary and secondary, with drive count (>100) and drive days (>10,000) being the metrics used to divide the groups. The secondary group has a total of 641 drives spread across 27 drive models. Below is a table of those drive models. 

The secondary group is mostly made up of drive models which are replacement drives or migration candidates. Regardless, the lack of observations (drive days) over the observation period is too low to have any certainty about the calculated AFR.

From time to time, a secondary drive model will move into the primary group. For example, the 14TB Seagate (model: ST14000NM000J) will most likely have over 100 drives and 10,000 drive days in Q2. The reverse is also possible, especially as we continue to migrate our 4TB drive models.

Why Have a Secondary Group?

In practice we’ve always had two groups; we just didn’t name them. Previously, we would eliminate from the quarterly, annual, and lifetime AFR charts drive models which did not have at least 45 drives, then we upped that to 60 drives. This was okay, but we realized that we needed to also set a minimum number of drive days over the analysis period to improve our confidence in the AFRs we calculated. To that end, we have set the following thresholds for drive models to be in the primary group.

Review Period Drive Count per Model Drive Days per Model
Quarterly >100 drives >10,000 drive days
Annual >250 drives >50,000 drives days
Lifetime >500 drives >100,000 drive days

We will evaluate these metrics as we go along and change them if needed. The goal is to continue to provide AFRs that we are confident are an accurate reflection of the drives in our environment.

The Average Age of Drive Failure Redux

In Q1 2023 Drive Stats report, we took a look at the average age in which a drive fails. This review was inspired by the folks at Secure Data Recovery who calculated that based on their analysis of 2,007 failed drives, the average age at which they failed was 1,051 days or roughly 2 years and 10 months. 

We applied the same approach to our 17,155 failed drives and were surprised when our average age of failure was only 2 years and 6 months. Then we realized that many of the drive models that were still in use were older (much older) than the average, and surely when some number of them failed, it would affect the average age of failure for a given drive model. 

To account for this realization, we considered only those drive models that are no longer active in our production environment. We call this collection retired drive models as these are drives that can no longer age or fail. When we reviewed the average age of this retired group of drives, the average age of failure was 2 years and 7 months. Unexpected, yes, but we decided we needed more data before reaching any conclusions.

So, here we are a year later to see if the average age of drive failure we computed in Q1 2023 has changed. Let’s dig in. 

As before we recorded the date, serial_number, model, drive_capacity, failure, and SMART 9 raw value for all of the failed drives we have in the Drive Stats dataset back to April 2013. The SMART 9 raw value gives us the number of hours the drive was in operation. Then we removed boot drives and drives with incomplete data, that is some of the values were missing or wildly inaccurate. This left us with 17,155 failed drives as of Q1 2023.

Over this past year, Q2 2023 through Q1 2024, we recorded an additional 4,406 failed drives. There were 173 drives which were either boot drives or had incomplete data, leaving us with 4,233 drives to add to the 17,155 failed drives previous, totalling 21,388 failed drives to evaluate.

When we compare Q1 2023 to Q1 2024 we get the table below.

The average age of failure for all of the Backblaze drive models (2 years and 10 months) matches the Secure Data Recovery baseline. The question is, does that validate their number? We say, not yet. Why? Two primary reasons. 

First, we only have two data points, so we don’t have much of a trend, that is we don’t know if the alignment is real or just temporary. Second, the average age of failure of the active drive models (that is, those in production) is now already higher (2 years and 11 months) than the Secure Data baseline. If that trend were to continue, then when the active drive models retire, they will likely increase the average age of failure of the drive models that are not in production.

That said, we can compare the numbers by drive size and drive model from Q1 2023 to Q1 2024 to see if we can gain any additional insights. Let’s start with the average age by drive size in the table below.

The most salient observation is that for every drive size that had active drive models (green), the average age of failure increased from Q1 2023 to Q1 2024. Given that the overall average age of failure increased over the last year, it is reasonable to expect that some of the active drive size cohorts would increase. With that in mind, let’s take a look at the changes by drive model over the same period. 

Starting with the retired drive models, there were three drive models totalling 196 drives which moved from active to retired from Q1 2023 to Q1 2024. Still, the average age of failure for the retired drive cohort remained at 2 years 7 months, so we’ll spare you from looking at a chart with 39 drive models where over 90% of the data didn’t change Q1 2023 to Q1 2024.

On the other hand, the active drive models are a little more interesting, as we can see below.

In all but the two drive models (highlighted), the average age of failure for each drive model went up. In other words, active drive models are, on average, older when they fail, than one year ago. Remember, we are testing the average age of the drive failures, not the average age of the drive. 

At this point, let’s review. The Secure Data Recovery folks checked 2,007 failed drives and determined their average age of failure was 2 years and 10 months. We are testing that assertion. At the moment, the average age of failure for the retired drive models (those no longer in operation in our environment) is 2 years and 7 months. This is still less than the Secure Data number. But, the drive models still in operation are now hitting an average of 2 years and 10 months, suggesting that once these drive models are removed from service, the average age of failure for the retired drive models will increase. 

Based on all of this, we think the average age of failure for our retired drive models will eventually exceed 2 years and 10 months. Further, we predict that the average age of failure will reach closer to 4 years for the retired drive models once our 4TB drive models are removed from service. 

Annualized Failures Rates for Manufacturers

As we noted at the beginning of this report, the quarterly AFR for Q1 2024 was 1.41%. Each of the four manufacturers we track contributed to the overall AFR as shown in the chart below.

As you can see, the overall AFR for all drives peaked in Q3 2023 and is dropping. This is mostly due to the retirement of older 4TB drives that are further along the bathtub curve of drive failure. Interestingly, all of the remaining 4TB drives in use today are either Seagate or HGST models. Therefore, we expect the quarterly AFR will most likely continue to decrease for those two manufacturers as over the next year their 4TB drive models will be replaced.

Lifetime Hard Drive Failure Rates

As of the end of Q1 2024, we were tracking 279,572 operational hard drives. As noted earlier, we defined the minimum eligibility criteria of a drive model to be included in our analysis for quarterly, annual and lifetime reviews. To be considered for the lifetime review, a drive model was required to have 500 or more drives as of the end of Q1 2024 and have over 100,000 accumulated drive days during their lifetime. When we removed those drive models which did not meet the lifetime criteria, we had 277,910 drives grouped into 26 models remaining for analysis as shown in the tale below.

With three exceptions, the conference interval for each drive model is 0.5% or less at 95% certainty. For the three exceptions: the 10TB Seagate, the 14TB Seagate, and 14TB Toshiba models, the occurrence of drive failure from quarter to quarter was too variable over their lifetime. This volatility has a negative effect on the confidence interval.

The combination of a low lifetime AFR and a small confidence interval is helpful in identifying the drive models which work well in our environment. These days we are interested mostly in the larger drives as replacements, migration targets, or new installations. Using the table above, let’s see if we can identify our best 12, 14, and 16TB performers. We’ll skip reviewing the 22TB drives as we only have one model.

The drive models are grouped by drive size, then sorted by their Lifetime AFR. Let’s take a look at each of those groups.

  • 12TB drive models: The three 12TB HGST models are great performers, but are hard to find new. Also, Western Digital, who purchased the HGST drive business a while back, has started using their own model numbers of these drives, so it can be confusing. If you do find an original HGST make sure it is new as from our perspective buying a refurbished drive is not the same as buying a new.
  • 14TB drive models: The first three models look to be solid—the WDC (WUH721414ALE6L4), Toshiba (MG07ACA14TA), and Seagate (ST14000NM001G). The remaining two drive models have mediocre lifetime AFRs and undesirable confidence intervals. 
  • 16TB drive models: Lots of choice here, with all six drive models performing well to this point, although the WDC models are the best of the best to date.

The Hard Drive Stats Data

It has now been eleven years since we began recording, storing and reporting the operational statistics of the hard drives and SSDs we use to store data in the Backblaze data storage cloud. We look at the telemetry data of the drives, including their SMART stats and other health related attributes. We do not read or otherwise examine the actual customer data stored. 

Over the years, we have analyzed the data we have gathered and published our findings and insights from our analyses. For transparency, we also publish the data itself, known as the Drive Stats dataset. This dataset is open source and can be downloaded from our Drive Stats webpage.

You can download and use the Drive Stats dataset for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone; it is free.

Good luck and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q1 2024 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Drive Stats for 2023

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-2023/

A decorative image displaying the words 2023 Year End Drive Stats

As of December 31, 2023, we had 274,622 drives under management. Of that number, there were 4,400 boot drives and 270,222 data drives. This report will focus on our data drives. We will review the hard drive failure rates for 2023, compare those rates to previous years, and present the lifetime failure statistics for all the hard drive models active in our data center as of the end of 2023. Along the way we share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

2023 Hard Drive Failure Rates

As of the end of 2023, Backblaze was monitoring 270,222 hard drives used to store data. For our evaluation, we removed 466 drives from consideration which we’ll discuss later on. This leaves us with 269,756 hard drives covering 35 drive models to analyze for this report. The table below shows the Annualized Failure Rates (AFRs) for 2023 for this collection of drives.

An chart displaying the failure rates of Backblaze hard drives.

Notes and Observations

One zero for the year: In 2023, only one drive model had zero failures, the 8TB Seagate (model: ST8000NM000A). In fact, that drive model has had zero failures in our environment since we started deploying it in Q3 2022. That “zero” does come with some caveats: We have only 204 drives in service and the drive has a limited number of drive days (52,876), but zero failures over 18 months is a nice start.

Failures for the year: There were 4,189 drives which failed in 2023. Doing a little math, over the last year on average, we replaced a failed drive every two hours and five minutes. If we limit hours worked to 40 per week, then we replaced a failed drive every 30 minutes.

More drive models: In 2023, we added six drive models to the list while retiring zero, giving us a total of 35 different models we are tracking. 

Two of the models have been in our environment for a while but finally reached 60 drives in production by the end of 2023.

  1. Toshiba 8TB, model HDWF180: 60 drives.
  2. Seagate 18TB, model ST18000NM000J: 60 drives.

Four of the models were new to our production environment and have 60 or more drives in production by the end of 2023.

  1. Seagate 12TB, model ST12000NM000J: 195 drives.
  2. Seagate 14TB, model ST14000NM000J: 77 drives.
  3. Seagate 14TB, model ST14000NM0018: 66 drives.
  4. WDC 22TB, model WUH722222ALE6L4: 2,442 drives.

The drives for the three Seagate models are used to replace failed 12TB and 14TB drives. The 22TB WDC drives are a new model added primarily as two new Backblaze Vaults of 1,200 drives each.

Mixing and Matching Drive Models

There was a time when we purchased extra drives of a given model to have on hand so we could replace a failed drive with the same drive model. For example, if we needed 1,200 drives for a Backblaze Vault, we’d buy 1,300 to get 100 spares. Over time, we tested combinations of different drive models to ensure there was no impact on throughput and performance. This allowed us to purchase drives as needed, like the Seagate drives noted previously. This saved us the cost of buying drives just to have them hanging around for months or years waiting for the same drive model to fail.

Drives Not Included in This Review

We noted earlier there were 466 drives we removed from consideration in this review. These drives fall into three categories.

  • Testing: These are drives of a given model that we monitor and collect Drive Stats data on, but are in the process of being qualified as production drives. For example, in Q4 there were four 20TB Toshiba drives being evaluated.
  • Hot Drives: These are drives that were exposed to high temperatures while in operation. We have removed them from this review, but are following them separately to learn more about how well drives take the heat. We covered this topic in depth in our Q3 2023 Drive Stats Report
  • Less than 60 drives: This is a holdover from when we used a single storage server of 60 drives to store a blob of data sent to us. Today we divide that same blob across 20 servers, i.e. a Backblaze Vault, dramatically improving the durability of the data. For 2024 we are going to review the 60 drive criteria and most likely replace this standard with a minimum number of drive days in a given period of time to be part of the review. 

Regardless, in the Q4 2023 Drive Stats data you will find these 466 drives along with the data for the 269,756 drives used in the review.

Comparing Drive Stats for 2021, 2022, and 2023

The table below compares the AFR for each of the last three years. The table includes just those drive models which had over 200,000 drive days during 2023. The data for each year is inclusive of that year only for the operational drive models present at the end of each year. The table is sorted by drive size and then AFR.

A chart showing the failure rates of hard drives from 2021, 2022, and 2023.

Notes and Observations

What’s missing?: As noted, a drive model required 200,000 drive days or more in 2023 to make the list. Drives like the 22TB WDC model with 126,956 drive days and the 8TB Seagate with zero failures, but only 52,876 drive days didn’t qualify. Why 200,000? Each quarter we use 50,000 drive days as the minimum number to qualify as statistically relevant. It’s not a perfect metric, but it minimizes the volatility sometimes associated with drive models with a lower number of drive days.

The 2023 AFR was up: The AFR for all drives models listed was 1.70% in 2023. This compares to 1.37% in 2022 and 1.01% in 2021. Throughout 2023 we have seen the AFR rise as the average age of the drive fleet has increased. There are currently nine drive models with an average age of six years or more. The nine models make up nearly 20% of the drives in production. Since Q2, we have accelerated the migration from older drive models, typically 4TB in size, to new drive models, typically 16TB in size. This program will continue throughout 2024 and beyond.

Annualized Failure Rates vs. Drive Size

Now, let’s dig into the numbers to see what else we can learn. We’ll start by looking at the quarterly AFRs by drive size over the last three years.

A chart showing hard drive failure rates by drive size from 2021 to 2023.

To start, the AFR for 10TB drives (gold line) are obviously increasing, as are the 8TB drives (gray line) and the 12TB drives (purple line). Each of these groups finished at an AFR of 2% or higher in Q4 2023 while starting from an AFR of about 1% in Q2 2021. On the other hand, the AFR for the 4TB drives (blue line) rose initially, peaking in 2022 and has decreased since. The remaining three drive sizes—6TB, 14TB, and 16TB—have oscillated around 1% AFR for the entire period. 

Zooming out, we can look at the change in AFR by drive size on an annual basis. If we compare the annual AFR results for 2022 to 2023, we get the table below. The results for each year are based only on the data from that year.

At first glance it may seem odd that the AFR for 4TB drives is going down. Especially given the average age of each of the 4TB drives models is over six years and getting older. The reason is likely related to our focus in 2023 on migrating from 4TB drives to 16TB drives. In general we migrate the oldest drives first, that is those more likely to fail in the near future. This process of culling out the oldest drives appears to mitigate the expected rise in failure rates as a drive ages. 

But, not all drive models play along. The 6TB Seagate drives are over 8.6 years old on average and, for 2023, have the lowest AFR for any drive size group potentially making a mockery of the age-is-related-to-failure theory, at least over the last year. Let’s see if that holds true for the lifetime failure rate of our drives.

Lifetime Hard Drive Stats

We evaluated 269,756 drives across 35 drive models for our lifetime AFR review. The table below summarizes the lifetime drive stats data from April 2013 through the end of Q4 2023. 

A chart showing lifetime annualized failure rates for 2023.

The current lifetime AFR for all of the drives is 1.46%. This is up from the end of last year (Q4 2022) which was 1.39%. This makes sense given the quarterly rise in AFR over 2023 as documented earlier. This is also the highest the lifetime AFR has been since Q1 2021 (1.49%). 

The table above contains all of the drive models active as of 12/31/2023. To declutter the list, we can remove those models which don’t have enough data to be statistically relevant. This does not mean the AFR shown above is incorrect, it just means we’d like to have more data to be confident about the failure rates we are listing. To that end, the table below only includes those drive models which have two million drive days or more over their lifetime, this gives us a manageable list of 23 drive models to review.

A chart showing the 2023 annualized failure rates for drives with more than 2 million drive days in their lifetimes.

Using the table above we can compare the lifetime drive failure rates of different drive models. In the charts below, we group the drive models by manufacturer, and then plot the drive model AFR versus average age in months of each drive model. The relative size of each circle represents the number of drives in each cohort. The horizontal and vertical scales for each manufacturer chart are the same.

A chart showing annualized failure rates by average age and drive manufacturer.

Notes and Observations

Drive migration: When selecting drive models to migrate we could just replace the oldest drive models first. In this case, the 6TB Seagate drives. Given there are only 882 drives—that’s less than one Backblaze Vault—the impact on failure rates would be minimal. That aside, the chart makes it clear that we should continue to migrate our 4TB drives as we discussed in our recent post on which drives reside in which storage servers. As that post notes, there are other factors, such as server age, server size (45 vs. 60 drives), and server failure rates which help guide our decisions. 

HGST: The chart on the left below shows the AFR trendline (second order polynomial) for all of our HGST models.  It does not appear that drive failure consistently increases with age. The chart on the right shows the same data with the HGST 4TB drive models removed. The results are more in line with what we’d expect, that drive failure increased over time. While the 4TB drives perform great, they don’t appear to be the AFR benchmark for newer/larger drives.

One other potential factor not explored here, is that beginning with the 8TB drive models, helium was used inside the drives and the drives were sealed. Prior to that they were air-cooled and not sealed. So did switching to helium inside a drive affect the failure profile of the HGST drives? Interesting question, but with the data we have on hand, I’m not sure we can answer it—or that it matters much anymore as helium is here to stay.

Seagate: The chart on the left below shows the AFR trendline (second order polynomial) for our Seagate models. As with the HGST models, it does not appear that drive failure continues to increase with age. For the chart on the right, we removed the drive models that were greater than seven years old (average age).

Interestingly, the trendline for the two charts is basically the same up to the six year point. If we attempt to project past that for the 8TB and 12TB drives there is no clear direction. Muddying things up even more is the fact that the three models we removed because they are older than seven years are all consumer drive models, while the remaining drive models are all enterprise drive models. Will that make a difference in the failure rates of the enterprise drive model when they get to seven or eight or even nine years of service? Stay tuned.

Toshiba and WDC: As for the Toshia and WDC drive models, there is a little over three years worth of data and no discernible patterns have emerged. All of the drives from each of these manufacturers are performing well to date.

Drive Failure and Drive Migration

One thing we’ve seen above is that drive failure projections are typically drive model dependent. But we don’t migrate drive models as a group, instead, we migrate all of the drives in a storage server or Backblaze Vault. The drives in a given server or Vault may not be the same model. How we choose which servers and Vaults to migrate will be covered in a future post, but for now we’ll just say that drive failure isn’t everything.

The Hard Drive Stats Data

The complete data set used to create the tables and charts in this report is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data itself to anyone; it is free.

Good luck, and let us know if you find anything interesting.

The post Backblaze Drive Stats for 2023 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

The Drive Stats of Backblaze Storage Pods

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/the-drive-stats-of-backblaze-storage-pods/

A decorative image showing the Backblaze logo on a cloud over a pattern representing a network.

Since 2009, Backblaze has written extensively about the data storage servers we created and deployed which we call Backblaze Storage Pods. We not only wrote about our Storage Pods, we open sourced the design, published a parts list, and even provided instructions on how to build one. Many people did. Of the six storage pod versions we produced, four of them are still in operation in our data centers today. Over the last few years, we began using storage servers from Dell and, more recently, Supermicro, as they have proven to be economically and operationally viable in our environment. 

Since 2013, we have also written extensively about our Drive Stats, sharing reports on the failure rates of the HDDs and SSDs in our legion of storage servers. We have examined the drive failure rates by manufacturer, size, age, and so on, but we have never analyzed the drive failure rates of the storage servers—until now. Let’s take a look at the Drive Stats for our fleet of storage servers and see what we can learn.

Storage Pods, Storage Servers, and Backblaze Vaults

Let’s start with a few definitions:

  • Storage Server: A storage server is our generic name for a server from any manufacturer which we use to store customer data. We use storage servers from Backblaze, Dell, and Supermicro.
  • Storage Pod: A Storage Pod is the name we gave to the storage servers Backblaze designed and had built for our data centers. The first Backblaze Storage Pod version was announced in September 2009. Subsequent versions are 2.0, 3.0, 4.0, 4.5, 5.0, 6.0, and 6.1. All but 6.1 were announced publicly. 
  • Backblaze Vault: A Backblaze Vault is 20 storage servers grouped together for the purpose of data storage. Uploaded data arrives at a given storage server within a Backblaze Vault and is encoded into 20 parts with a given part being either a data blob or parity. Each of the 20 parts (shards) is then stored on one of the 20 storage servers. 

As you review the charts and tables here are a few things to know about Backblaze Vaults.

  • There are currently six cohorts of storage servers in operation today: Supermicro, Dell, Backblaze 3.0, Backblaze 5.0, Backblaze 6.0, and Backblaze 6.1.
  • A given Vault will always be made up from one of the six cohorts of storage servers noted above. For example, Vault 1016 is made up of 20 Backblaze 5.0 Storage Pods and Vault 1176 is made of the 20 Supermicro servers. 
  • A given Vault is made up of storage servers that contain the same number of drives as follows:
    • Dell servers: 26 drives.
    • Backblaze 3.0 and Backblaze 5.0 servers: 45 drives.
    • Backblaze 6.0, Backblaze 6.1, and Supermicro servers: 60 drives.
  • All of the hard drives in a Backblaze Vault will be logically the same size; so, 16TB drives for example.

Drive Stats by Backblaze Vault Cohort

With the background out of the way, let’s get started. As of the end of Q3 2023, there were a total of 241 Backblaze Vaults divided into the six cohorts, as shown in the chart below. The chart includes the server cohort, the number of Vaults in the cohort, and the percentage that cohort is of the total number of Vaults.

A pie chart showing the types of Backblaze Vaults by percentage.

Vaults consisting of Backblaze servers still comprise 68% of the vaults in use today (shaded from orange to red), although that number is dropping as older Vaults are being replaced with newer server models, typically the Supermicro systems.

The table below shows the Drive Stats for the different Vault cohorts identified above for Q3 2023.

A chart showing the Drive Stats for Backblaze Vaults.

The Avg Age (months) column is the average age of the drives, not the average age of the Vaults. The two may seem to be related, that’s not entirely the case. It is true the Backblaze 3.0 Vaults were deployed first followed in order by the 5.0 and 6.0 Vaults, but that’s where things get messy. There was some overlap between the Dell and Backblaze 6.1 deployments as the Dell systems were deployed in our central Europe data center, while the 6.1 Vaults continued to be deployed in the U.S. In addition, some migrations from the Backblaze 3.0 Vaults were initially done to 6.1 Vaults while we were also deploying new drives in the Supermicro Vaults. 

The AFR for each of the server versions does not seem to follow any pattern or correlation to the average age of the drives. This was unexpected because, in general, as drives pass about four years in age, they start to fail more often. This should mean that Vaults with older drives, especially those with drives whose average age is over four years (48 months), should have a higher failure rate. But, as we can see, the Backblaze 5.0 Vaults defy that expectation. 

To see if we can determine what’s going on, let’s expand on the previous table and dig into the different drive sizes that are in each Vault cohort, as shown in the table below.

A table showing Drive Stats by server version and drive size.

Observations for Each Vault Cohort

  • Backblaze 3.0: Obviously these Vaults have the oldest drives and, given their AFR is nearly twice the average for all of the drives (1.53%), it would make sense to migrate off of these servers. Of course the 6TB drives seem to be the exception, but at some point they will most likely “hit the wall” and start failing.
  • Backblaze 5.0: There are two Backblaze 5.0 drive sizes (4TB and 8TB) and the AFR for each is well below the average AFR for all of the drives (1.53%). The average age of the two drive sizes is nearly seven years or more. When compared to the Backblaze 6.0 Vaults, it would seem that migrating the 5.0 Vaults could wait, but there is an operational consideration here. The Backblaze 5.0 Vaults each contain 45 drives, and from the perspective of data density per system, they should be migrated to 60 drive servers sooner rather than later to optimize data center rack space.
  • Backblaze 6.0: These Vaults as a group don’t seem to make any of the five different drive sizes happy. Only the AFR of the 4TB drives (1.42%) is just barely below the average AFR for all of the drives. The rest of the drive groups are well above the average.
  • Backblaze 6.1: The 6.1 servers are similar to the 6.0 servers, but with an upgraded CPU and faster NIC cards. Is that why their annualized failure rates are much lower than the 6.0 systems? Maybe, but the drives in the 6.1 systems are also much younger, about half the age of those in the 6.0 systems, so we don’t have the full picture yet.
  • Dell: The 14TB drives in the Dell Vaults seem to be a problem at a 5.46% AFR. Much of that is driven by two particular Dell vaults which have a high AFR, over 8% for Q3. This appears to be related to their location in the data center. All 40 of the Dell servers which make up these two Vaults were relocated to the top of 52U racks, and it appears that initially they did not like their new location. Recent data indicates they are doing much better, and we’ll publish that data soon. We’ll need to see what happens over the next few quarters. That said, if you remove these two Vaults from the Dell tally, the AFR is a respectable 0.99% for the remaining Vaults.
  • Supermicro: This server cohort is mostly 16TB drives which are doing very well with an AFR of 0.62%. The one 14TB Vault is worth our attention with an AFR of 1.95%, and the 22TB Vault is too new to do any analysis.

Drive Stats by Drive Size and Vault Cohort

Another way to look at the data is to take the previous table and re-sort it by drive size. Before we do that let’s establish the AFR for the different drive sizes aggregated over all Vaults.

A bar chart showing annualized failure rates for Backblaze Vaults by drive size.

As we can see in Q3 the 6TB and 22TB Vaults had zero failures (AFR = 0%). Also, the 10TB Vault is indeed only one Vault, so there are no other 10TB Vaults to compare it to. Given this, for readability, we will remove the 6TB, 10TB, and 22TB Vaults from the next table which compares how each drive size has fared in each of the six different Vault cohorts.

A table showing the annualized failure rates of servers by drive size and server version, not displaying the 6TB, 10TB, and 22TB Vaults.

Currently we are migrating the 4TB drive Vaults to larger Vaults, replacing them with drives of 16TB and above. The migrations are done using an in-house system which we’ll expand upon in a future post. The specific order of migrations is based on failure rates and durability of the existing 4TB Vaults with an eye towards removing the Backblaze 3.0 systems first as they are nearly 10 years old in some cases, and many of the non-drive replacement parts are no longer available. Whether we give away, destroy, or recycle the retired Backblaze 3.0 Storage Pods (sans drives) is still being debated.

For the 8TB drive Vaults, the Backblaze 5.0 Vaults are up first for migration when the time comes. Yes, their AFR is lower then the Backblaze 6.0 Vaults, but remember: the 5.0 Vaults are 45 drive units which are not as efficient storage density-wise versus the 60 drive systems. 

Speaking of systems with less than 60 drives, the Dell servers are 26 drives. Those 26 drives are in a 2U chassis versus a 4U chassis for all of the other servers. The Dell servers are not quite as dense as the 60 drive units, but their 2U form factor gives us some flexibility in filling racks, especially when you add utility servers (1U or 2U) and networking gear to the mix. That’s one of the reasons the two Dell Vaults we noted earlier were moved to the top of the 52U racks. FYI, those two Vaults hold 14TB drives and are two of the four 14TB Dell Vaults making up the 5.46% AFR. The AFR for the Dell Vaults with 12TB and 16TB drives is 0.76% and 0.92% respectively. As noted earlier, we expect the AFR for 14TB Dell Vaults to drop over the coming months.

What Have We Learned?

Our goal today was to see what we can learn about the drive failure rates of the storage servers we use in our data centers. All of our storage servers are grouped in operational systems we call Backblaze Vaults. There are six different cohorts of storage servers with each vault being composed of the same type of storage server, hence there are six types of vaults. 

As we dug into data, we found that the different cohorts of Vaults had different annualized failure rates. What we didn’t find was a correlation between the age of the drives used in the servers and the annualized failure rates of the different Vault cohorts. For example, the Backblaze 5.0 Vaults have a much lower AFR of 0.99%  versus the Backblaze 6.0 Vault AFR at 2.14%—even though the drives in the 5.0 Vaults are nearly twice as old on average than the drives in the 6.0 Vaults.

This suggests that while our initial foray into the annualized failure rates of the different Vault cohorts is a good first step, there is more to do here.

Where Do We Go From Here?

In general, all of the Vaults in a given cohort were manufactured to the same specifications, used the same parts, and were assembled using the same processes. One obvious difference is that different drive models are used in each Vault cohort. For example, the 16TB vaults are composed of seven different drive models. Do some drive models work better in one Vault cohort versus another? Over the next couple of quarters we’ll dig into the data and let you know what we find. Hopefully it will add to our understanding of the annualized failures rates of the different Vault cohorts. Stay tuned.

The post The Drive Stats of Backblaze Storage Pods appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Drive Stats for Q3 2023

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-q3-2023/

A decorative image showing the title Q3 2023 Drive Stats.

At the end of Q3 2023, Backblaze was monitoring 263,992 hard disk drives (HDDs) and solid state drives (SSDs) in our data centers around the world. Of that number, 4,459 are boot drives, with 3,242 being SSDs and 1,217 being HDDs. The failure rates for the SSDs are analyzed in the SSD Edition: 2023 Drive Stats review.

That leaves us with 259,533 HDDs that we’ll focus on in this report. We’ll review the quarterly and lifetime failure rates of the data drives as of the end of Q3 2023. Along the way, we’ll share our observations and insights on the data presented, and, for the first time ever, we’ll reveal the drive failure rates broken down by data center.

Q3 2023 Hard Drive Failure Rates

At the end of Q3 2023, we were managing 259,533 hard drives used to store data. For our review, we removed 449 drives from consideration as they were used for testing purposes, or were drive models which did not have at least 60 drives. This leaves us with 259,084 hard drives grouped into 32 different models. 

The table below reviews the annualized failure rate (AFR) for those drive models for the Q3 2023 time period.

A table showing the quarterly annualized failure rates of Backblaze hard drives.

Notes and Observations on the Q3 2023 Drive Stats

  • The 22TB drives are here: At the bottom of the list you’ll see the WDC 22TB drives (model: WUH722222ALE6L4). A Backblaze Vault of 1,200 drives (plus four) is now operational. The 1,200 drives were installed on September 29, so they only have one day of service each in this report, but zero failures so far.
  • The old get bolder: At the other end of the time-in-service spectrum are the 6TB Seagate drives (model: ST6000DX000) with an average of 101 months in operation. This cohort had zero failures in Q3 2023 with 883 drives and a lifetime AFR of 0.88%.
  • Zero failures: In Q3, six different drive models managed to have zero drive failures during the quarter. But only the 6TB Seagate, noted above, had over 50,000 drive days, our minimum standard for ensuring we have enough data to make the AFR plausible.
  • One failure: There were four drive models with one failure during Q3. After applying the 50,000 drive day metric, two drives stood out:
    1. WDC 16TB (model: WUH721816ALE6L0) with a 0.15% AFR.
    2. Toshiba 14TB (model: MG07ACA14TEY) with a 0.63% AFR.

The Quarterly AFR Drops

In Q3 2023, quarterly AFR for all drives was 1.47%. That was down from 2.2% in Q2 and also down from 1.65% a year ago. The quarterly AFR is based on just the data in that quarter, so it can often fluctuate from quarter to quarter. 

In our Q2 2023 report, we suspected the 2.2% for the quarter was due to the overall aging of the drive fleet and in particular we pointed a finger at specific 8TB, 10TB, and 12TB drive models as potential culprits driving the increase. That prediction fell flat in Q3 as nearly two-thirds of drive models experienced a decreased AFR quarter over quarter from Q2 and any increases were minimal. This included our suspect 8TB, 10TB, and 12TB drive models. 

It seems Q2 was an anomaly, but there was one big difference in Q3: we retired 4,585 aging 4TB drives. The average age of the retired drives was just over eight years, and while that was a good start, there’s another 28,963 4TB drives to go. To facilitate the continuous retirement of aging drives and make the data migration process easy and safe we use CVT, our awesome in-house data migration software which we’ll cover at another time.

A Hot Summer and the Drive Stats Data

As anyone should in our business, Backblaze continuously monitors our systems and drives. So, it was of little surprise to us when the folks at NASA confirmed the summer of 2023 as Earth’s hottest on record. The effects of this record-breaking summer showed up in our monitoring systems in the form of drive temperature alerts. A given drive in a storage server can heat up for many reasons: it is failing; a fan in the storage server has failed; other components are producing additional heat; the air flow is somehow restricted; and so on. Add in the fact that the ambient temperature within a data center often increases during the summer months, and you can get more temperature alerts.

In reviewing the temperature data for our drives in Q3, we noticed that a small number of drives exceeded the maximum manufacturer’s temperature for at least one day. The maximum temperature for most drives is 60°C, except for the 12TB, 14TB, and 16TB Toshiba drives which have a maximum temperature of 55°C. Of the 259,533 data drives in operation in Q3, there were 354 individual drives (0.0013%) that exceeded their maximum manufacturer temperature. Of those only two drives failed, leaving 352 drives which were still operational as of the end of Q3.

While temperature fluctuation is part of running data centers and temp alerts like these aren’t unheard of, our data center teams are looking into the root causes to ensure we’re prepared for the inevitability of increasingly hot summers to come.

Will the Temperature Alerts Affect Drive Stats?

The two drives which exceeded their maximum temperature and failed in Q3 have been removed from the Q3 AFR calculations. Both drives were 4TB Seagate drives (model: ST4000DM000). Given that the remaining 352 drives which exceeded their temperature maximum did not fail in Q3, we have left them in the Drive Stats calculations for Q3 as they did not increase the computed failure rates.

Beginning in Q4, we will remove the 352 drives from the regular Drive Stats AFR calculations and create a separate cohort of drives to track that we’ll name Hot Drives. This will allow us to track the drives which exceeded their maximum temperature and compare their failure rates to those drives which operated within the manufacturer’s specifications. While there are a limited number of drives in the Hot Drives cohort, it could give us some insight into whether drives being exposed to high temperatures could cause a drive to fail more often. This heightened level of monitoring will identify any increase in drive failures so that they can be detected and dealt with expeditiously.

New Drive Stats Data Fields in Q3

In Q2 2023, we introduced three new data fields that we started populating in the Drive Stats data we publish: vault_id, pod_id, and is_legacy_format. In Q3, we are adding three more fields into each drive records as follows:

  • datacenter: The Backblaze data center where the drive is installed, currently one of these values: ams5, iad1, phx1, sac0, and sac2.
  • cluster_id: The name of a given collection of storage servers logically grouped together to optimize system performance. Note: At this time the cluster_id is not always correct, we are working on fixing that. 
  • pod_slot_num: The physical location of a drive within a storage server. The specific slot differs based on the storage server type and capacity: Backblaze (45 drives), Backblaze (60 drives), Dell (26 drives), or Supermicro (60 drives). We’ll dig into these differences in another post.

With these additions, the new schema beginning in Q3 2023 is:

  • date
  • serial_number
  • model
  • capacity_bytes
  • failure
  • datacenter (Q3)
  • cluster_id (Q3)
  • vault_id (Q2)
  • pod_id (Q2)
  • pod_slot_num (Q3)
  • is_legacy_format (Q2)
  • smart_1_normalized
  • smart_1_raw
  • The remaining SMART value pairs (as reported by each drive model)

Beginning in Q3, these data data fields have been added to the publicly available Drive Stats files that we publish each quarter. 

Failure Rates by Data Center

Now that we have the data center for each drive we can compute the AFRs for the drives in each data center. Below you’ll find the AFR for each of five data centers for Q3 2023.

A chart showing Backblaze annualized failure rates by data center.

Notes and Observations

  • Null?: The drives which reported a null or blank value for their data center are grouped in four Backblaze vaults. David, the Senior Infrastructure Software Engineer for Drive Stats, described the process of how we gather all the parts of the Drive Stats data each day. The TL:DR is that vaults can be too busy to respond at the moment we ask, and since the data center field is nice-to-have data, we get a blank field. We can go back a day or two to find the data center value, which we will do in the future when we report this data.
  • sac0?: sac0 has the highest AFR of all of the data centers, but it also has the oldest drives—nearly twice as old, on average, versus the next closest in data center, sac2. As discussed previously, drive failures do seem to follow the “bathtub curve”, although recently we’ve seen the curve start out flatter. Regardless, as drive models age, they do generally fail more often. Another factor could be that sac0, and to a lesser extent sac2, has some of the oldest Storage Pods, including a handful of 45-drive units. We are in the process of using CVT to replace these older servers while migrating from 4TB to 16TB and larger drives.
  • iad1: The iad data center is the foundation of our eastern region and has been growing rapidly since coming online about a year ago. The growth is a combination of new data and customers using our cloud replication capability to automatically make a copy of their data in another region.
  • Q3 Data: This chart is for Q3 data only and includes all the data drives, including those with less than 60 drives per model. As we track this data over the coming quarters, we hope to get some insight into whether different data centers really have different drive failure rates, and, if so, why.

Lifetime Hard Drive Failure Rates

As of September 30, 2023, we were tracking 259,084 hard drives used to store customer data. For our lifetime analysis, we collect the number of drive days and the number of drive failures for each drive beginning from the time a drive was placed into production in one of our data centers. We group these drives by model, then sum up the drive days and failures for each model over their lifetime. That chart is below. 

A chart showing Backblaze lifetime hard drive failure rates.

One of the most important columns on this chart is the confidence interval, which is the difference between the low and high AFR confidence levels calculated at 95%. The lower the value, the more certain we are of the AFR stated. We like a confidence interval to be 0.5% or less. When the confidence interval is higher, that is not necessarily bad, it just means we either need more data or the data is somewhat inconsistent. 

The table below contains just those drive models which have a confidence interval of less than 0.5%. We have sorted the list by drive size and then by AFR.

A chart showing Backblaze hard drive annualized failure rates with a confidence interval of less than 0.5%.

The 4TB, 6TB, 8TB, and some of the 12TB drive models are no longer in production. The HGST 12TB models in particular can still be found, but they have been relabeled as Western Digital and given alternate model numbers. Whether they have materially changed internally is not known, at least to us.

One final note about the lifetime AFR data: you might have noticed the AFR for all of the drives hasn’t changed much from quarter to quarter. It has vacillated between 1.39% to 1.45% percent for the last two years. Basically, we have lots of drives with lots of time-in-service so it is hard to move the needle up or down. While the lifetime stats for individual drive models can be very useful, the lifetime AFR for all drives will probably get less and less interesting as we add more and more drives. Of course, a few hundred thousand drives that never fail could arrive, so we will continue to calculate and present the lifetime AFR.

The Hard Drive Stats Data

The complete data set used to create the information used in this review is available on our Hard Drive Stats Data webpage. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone; it is free. 

Good luck and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q3 2023 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

The SSD Edition: 2023 Drive Stats Mid-Year Review

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/ssd-edition-2023-mid-year-drive-stats-review/

A decorative image displaying the title 2023 Mid-Year Report Drive Stats SSD Edition.

Welcome to the 2023 Mid-Year SSD Edition of the Backblaze Drive Stats review. This report is based on data from the solid state drives (SSDs) we use as storage server boot drives on our Backblaze Cloud Storage platform. In this environment, the drives do much more than boot the storage servers. They also store log files and temporary files produced by the storage server. Each day a boot drive will read, write, and delete files depending on the activity of the storage server itself.

We will review the quarterly and lifetime failure rates for these drives, and along the way we’ll offer observations and insights to the data presented. In addition, we’ll take a first look at the average age at which our SSDs fail, and examine how well SSD failure rates fit the ubiquitous bathtub curve.

Mid-Year SSD Results by Quarter

As of June 30, 2023, there were 3,144 SSDs in our storage servers. This compares to 2,558 SSDs we reported in our 2022 SSD annual report. We’ll start by presenting and discussing the quarterly data from each of the last two quarters (Q1 2022 and Q2 2023).

Notes and Observations

Data is by quarter: The data used in each table is specific to that quarter. That is, the number of drive failures and drive days are inclusive of the specified quarter, Q1 or Q2. The drive counts are as of the last day of each quarter.

Drives added: Since our last SSD report, ending in Q4 2022, we added 238 SSD drives to our collection. Of that total, the Crucial (model: CT250MX500SSD1) led the way with 110 new drives added, followed by 62 new WDC drives (model: WD Blue SA510 2.5) and 44 Seagate drives (model: ZA250NM1000).

Really high annualized failure rates (AFR): Some of the failure rates, that is AFR, seem crazy high. How could the Seagate model SSDSCKKB240GZR have an annualized failure rate over 800%? In that case, in Q1, we started with two drives and one failed shortly after being installed. Hence, the high AFR. In Q2, the remaining drive did not fail and the AFR was 0%. Which AFR is useful? In this case neither, we just don’t have enough data to get decent results. For any given drive model, we like to see at least 100 drives and 10,000 drive days in a given quarter as a minimum before we begin to consider the calculated AFR to be “reasonable.” We include all of the drive models for completeness, so keep an eye on drive count and drive days before you look at the AFR with a critical eye.

Quarterly Annualized Failures Rates Over Time

The data in any given quarter can be volatile with factors like drive age and the randomness of failures factoring in to skew the AFR up or down. For Q1, the AFR was 0.96% and, for Q2, the AFR was 1.05%. The chart below shows how these quarterly failure rates relate to previous quarters over the last three years.

As you can see, the AFR fluctuates between 0.36% and 1.72%, so what’s the value of quarterly rates? Well, they are useful as the proverbial canary in a coal mine. For example, the AFR in Q1 2021 (0.58%) jumped 1.51% in Q2 2021, then to 1.72% in Q3 2021. A subsequent investigation showed one drive model was the primary cause of the rise and that model was removed from service. 

It happens from time to time that a given drive model is not compatible with our environment, and we will moderate or even remove that drive’s effect on the system as a whole. While not as critical as data drives in managing our system’s durability, we still need to keep boot drives in operation to collect the drive/server/vault data they capture each day. 

How Backblaze Uses the Data Internally

As you’ve seen in our SSD and HDD Drive Stats reports, we produce quarterly, annual, and lifetime charts and tables based on the data we collect. What you don’t see is that every day we produce similar charts and tables for internal consumption. While typically we produce one chart for each drive model, in the example below we’ve combined several SSD models into one chart. 

The “Recent” period we use internally is 60 days. This differs from our public facing reports which are quarterly. In either case, charts like the one above allow us to quickly see trends requiring further investigation. For example, in our chart above, the recent results of the Micron SSDs indicate a deeper dive into the data behind the charts might be necessary.

By collecting, storing, and constantly analyzing the Drive Stats data we can be proactive in maintaining our durability and availability goals. Without our Drive Stats data, we would be inclined to over-provision our systems as we would be blind to the randomness of drive failures which would directly impact those goals.

A First Look at More SSD Stats

Over the years in our quarterly Hard Drive Stats reports, we’ve examined additional metrics beyond quarterly and lifetime failure rates. Many of these metrics can be applied to SSDs as well. Below we’ll take a first look at two of these: the average age of failure for SSDs and how well SSD failures correspond to the bathtub curve. In both cases, the datasets are small, but are a good starting point as the number of SSDs we monitor continues to increase.

The Average Age of Failure for SSDs

Previously, we calculated the average age at which a hard drive in our system fails. In our initial calculations that turned out to be about two years and seven months. That was a good baseline, but further analysis was required as many of the drive models used in the calculations were still in service and hence some number of them could fail, potentially affecting the average.

We are going to apply the same calculations to our collection of failed SSDs and establish a baseline we can work from going forward. Our first step was to determine the SMART_9_RAW value (power-on-hours or POH) for the 63 failed SSD drives we have to date. That’s not a great dataset size, but it gave us a starting point. Once we collected that information, we computed that the average age of failure for our collection of failed SSDs is 14 months. Given that the average age of the entire fleet of our SSDs is just 25 months, what should we expect to happen as the average age of the SSDs still in operation increases? The table below looks at three drive models which have a reasonable amount of data.

    Good Drives Failed Drives
MFG Model Count Avg Age Count Avg Age
Crucial CT250MX500SSD1 598 11 months 9 7 months
Seagate ZA250CM10003 1,114 28 months 14 11 months
Seagate ZA250CM10002 547 40 months 17 25 months

As we can see in the table, the average age of the failed drives increases as the average age of drives in operation (good drives) increases. In other words, it is reasonable to expect that the average age of SSD failures will increase as the entire fleet gets older.

Is There a Bathtub Curve for SSD Failures?

Previously we’ve graphed our hard drive failures over time to determine their fit to the classic bathtub curve used in reliability engineering. Below, we used our SSD data to determine how well our SSD failures fit the bathtub curve.

While the actual curve (blue line) produced by the SSD failures over each quarter is a bit “lumpy”, the trend line (second order polynomial) does have a definite bathtub curve look to it. The trend line is about a 70% match to the data, so we can’t be too confident of the curve at this point, but for the limited amount of data we have, it is surprising to see how the occurrences of SSD failures are on a path to conform to the tried-and-true bathtub curve.

SSD Lifetime Annualized Failure Rates

As of June 30, 2023, there were 3,144 SSDs in our storage servers. The table below is based on the lifetime data for the drive models which were active as of the end of Q2 2023.

Notes and Observations

Lifetime AFR: The lifetime data is cumulative from Q4 2018 through Q2 2023. For this period, the lifetime AFR for all of our SSDs was 0.90%. That was up slightly from 0.89% at the end of Q4 2022, but down from a year ago, Q2 2022, at 1.08%.

High failure rates?: As we noted with the quarterly stats, we like to have at least 100 drives and over 10,000 drive days to give us some level of confidence in the AFR numbers. If we apply that metric to our lifetime data, we get the following table.

Applying our modest criteria to the list eliminated those drive models with crazy high failure rates. This is not a statistics trick; we just removed those models which did not have enough data to make the calculated AFR reliable. It is possible the drive models we removed will continue to have high failure rates. It is also just as likely their failure rates will fall into a more normal range. If this technique seems a bit blunt to you, then confidence intervals may be what you are looking for.

Confidence intervals: In general, the more data you have and the more consistent that data is, the more confident you are in the predictions based on that data. We calculate confidence intervals at 95% certainty. 

For SSDs, we like to see a confidence interval of 1.0% or less between the low and the high values before we are comfortable with the calculated AFR. If we apply this metric to our lifetime SSD data we get the following table.

This doesn’t mean the failure rates for the drive models with a confidence interval greater than 1.0% are wrong; it just means we’d like to get more data to be sure. 

Regardless of the technique you use, both are meant to help clarify the data presented in the tables throughout this report.

The SSD Stats Data

The data collected and analyzed for this review is available on our Drive Stats Data page. You’ll find SSD and HDD data in the same files and you’ll have to use the model number to locate the drives you want, as there is no field to designate a drive as SSD or HDD. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone—it is free.

Good luck and let us know if you find anything interesting.

The post The SSD Edition: 2023 Drive Stats Mid-Year Review appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

SSD 101: How to Upgrade Your Computer With an SSD

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/ssd-upgrade-guide/

A decorative image showing an a hard drive and a solid state drive.
Editor’s note: Since it was published in 2019, this post has been updated in 2021 and 2023 with the latest information to help you take advantage of SSDs.

Solid-state drives (SSDs) have become the norm for most laptops and desktops, replacing the older hard disk drives (HDDs) that had been in use for decades previously. If your computer still relies on an HDD, it might be time to consider upgrading to an SSD for improved performance.

Upgrading to an SSD can give your computer a significant speed and responsiveness boost, especially if your machine is more than a few years old. However, before taking the plunge, it’s essential to weigh practical considerations. Let’s take a closer look at SSDs and the factors you should consider.

What Is an SSD?

An SSD is a type of data storage device used in computers and other electronic devices. Unlike traditional HDDs, which use spinning disks and mechanical read/write heads to store and retrieve data, SSDs rely on NAND-based flash memory to store information. This flash memory is similar to the kind used in USB drives and memory cards, but it’s optimized for higher performance and reliability.

Refresher: What Is NAND?

NAND stands for “Not And.” It’s a type of logic gate used in digital circuits, specifically in memory and storage devices. In the context of NAND-based flash memory used in SSDs, the term NAND refers to the electronic structure of the memory cells that store data. The name NAND comes from its logical operation, which is the complement of the AND operation. NAND flash memory is a type of non-volatile storage, meaning it retains data even when the power is turned off, which makes it well-suited for use with things like SSDs and other data storage devices. That’s different from the regular RAM in your computer, which is reset when you turn off or restart the computer.

Compared to HDDs, SSDs are more shock resistant (due to their lack of moving parts) and are less likely to be affected by magnetic fields. They also offer faster data access times, quicker boot-up and application load times, and better overall responsiveness. 

A photo of the internal hardware of a 2.5"SSD. Captions indicate where the cache, controller, and memory are, and that it is shock resistant up to 1500g.

For more about the differences between HDDs and SSDs, check out Hard Disk Drive vs. Solid State Drive: What’s the Diff? or our two-part series, HDD vs. SSD: What Does the Future for Storage Hold?.

Why Upgrade to an SSD?

Because of their speed and efficiency, SSDs have become the preferred choice for many computing applications, ranging from laptops and desktops to servers and data centers. They are especially useful in situations where speed and reliability are crucial, such as in gaming, content creation, and tasks involving large data transfers. Despite typically offering less storage capacity compared to HDDs of similar cost, SSD performance benefits often outweigh the storage trade-off, making them a popular choice.

Depending on the task at hand, SSDs can be up to 10 times faster than their HDD counterparts. Replacing your hard drive with an SSD is one of the best things you can do to dramatically improve the performance of your older computer.

A photo of a Samsung 2.5" SSD.
Samsung 870 QVO SATA III 2.5″ SSD 1TB.

Without any moving parts, SSDs operate more quietly, more efficiently, and with fewer breakable things than hard drives that have spinning platters. Read and write speeds for SSDs are much better than hard drives, resulting in noticeably faster operations.

For you, that means less time waiting for stuff to happen. An SSD is worth looking into if you’re frequently seeing a spinning wheel cursor on your computer screen. Modern operating systems rely more on virtual memory management, utilizing temporary swap files that are written to the disk. A faster SSD minimizes the performance impact caused by this process.

If you have just one drive in your laptop or desktop, you could replace an HDD or small SSD with a 1TB SSD for less than $40. For those dealing with substantial amounts of data, concentrating on replacing the drive that houses your operating system and applications can yield a significant speed boost. Put your working data on additional internal or external hard drives, and you’re ready to tackle a mountain of photos, videos, or supersized databases. Just be sure to implement a backup plan to make sure you keep a copy of that data safe on additional local drives, network attached drives, or in the cloud.

Are There Any Reasons Not to Upgrade to an SSD?

If SSDs are so much better than hard drives, why aren’t all drives SSDs? The two biggest reasons are cost and capacity. SSDs are more expensive than hard drives. A 1TB SSD or HDD now cost about the same, $30–$50, with HDDs being slightly less, maybe around $25. 

That’s not much of a difference, but as drive capacity gets larger, the cost differential gets increasingly larger. For example, an 8TB HDD drive runs $120–$180, while 8TB SSDs start at around $350. In short, while upgrading the 1TB internal hard drive on your computer to an SSD is cost effective, the same may not be true for replacing larger capacity drives, like those used in external drives, unless the increased speed is worth the increased cost.

Whether your computer can use an SSD is another question. It all depends on the computer’s age and how it was designed. Let’s take a look at that question next.

How Do You Upgrade to an SSD?

Does your computer use a regular off-the-shelf SATA HDD? If so, you can upgrade it with an SSD. 

SSDs are compatible with both Macs and PCs. All current Mac laptops come with SSDs. Both iMacs and Mac Pros come with SSDs as well. Around 2010, Apple started moving to only SSD storage on most of its devices. That said, some Mac desktop computers continued to offer the option of both SSD and HDD storage until 2020, a setup they called a Fusion Drive

Note that as of November 2021, Apple does not offer any Macs with a Fusion Drive. Basically, if you bought your device before 2010 or you have a desktop computer from 2021 or earlier, there’s a chance you may be using an HDD.

Determine Your Disk Type in a Mac

To determine what kind of drive your Mac uses, click on the Apple menu and select About This Mac. 

Avoid the pitfall of selecting the Storage tab in the top menu. What you’ll find is that the default name of your drive is “Macintosh HD” which is confusing, given that they’re referring to the internal storage of the computer as a hard drive when (in most cases), your drive is an SSD. While you can find information about your drive on this screen, we prefer the method that provides maximum clarity. 

So, on the Overview screen, click System Report. Bonus: You’ll also see what type of processor you have and your macOS version (which will be useful later).

A screenshot of the about this Mac overview tab.

Once there, select the Storage tab, then the volume name you want to identify. You should see a line called Medium Type, which will tell you what kind of drive you have. 

A screenshot of the storage tab under the Mac System Report screen.

Determine Your Disk Type in a PC

To determine your disk type in a Windows PC, first open the Task Manager in Windows:

  1. Right-click the Start button and click Run. In the Run Command window, type dfrgui and click OK.
A screenshot of the run screen in a Windows computer.
  1. On the next screen, the type of drive will be listed under the Media Type column.
A screenshot of a Windows computer Optimize Drives window.

Can I Upgrade to a Better SSD?

Even if your computer already has an SSD, you may be able to upgrade it with a larger, faster SSD model. Besides SATA-based hard drive replacements, some later model PCs can be upgraded with M.2 SSDs, which look more like RAM chips than hard drives. 

Some Apple laptops made before 2016 that already shipped with SSDs can be upgraded with larger ones. However, you will need to upgrade to a Mac-specific SSD. Check Other World Computing and Transcend to find ones designed to work. Apple laptop models made after 2016 have SSDs soldered to the motherboard, so you’re stuck with what you have.

A photo of an M.2 SSD.
M.2 SSD.

How to Install an SSD

If you’re comfortable tinkering with your computer’s guts, upgrading it with an SSD is a pretty common do-it-yourself project. Many companies offer hassle-free plug-and-play SSD replacements. Check out Amazon or NewEgg and you’ll have an embarrassment of riches. The choice is yours: Samsung, SanDisk, Crucial, and Toshiba are all popular SSD makers. There are many others, too.

However, if computer hardware isn’t your forte, it might not be worth the effort to learn from scratch. SSD upgrades are such a common aftermarket improvement most independent computer repair and service specialists will take on the task if you’re willing to pay them. Some throw in a data transfer if you’re lucky, or a skilled negotiator. Ask your friends and colleagues for recommendations. You can also hit up services like Angi to find someone.

If you are DIY inclined, YouTube has tons of walkthroughs like this one for desktop PCs, this one for laptops, and this one aimed at Mac users.

A photo of an HDD/SSD ot 3.5" drive bay adapter.
HDD/SSD to 3.5″ drive bay adapter.

Many SSDs replace 2.5 inch HDDs. Those are the same drives you find in laptop computers and even small desktop models. Have a desktop computer that uses a 3.5 inch hard drive? You may need to use a 2.5 inch to 3.5 inch mounting adapter.

A Word on SSD Compatibility

Beyond the drive size, it’s a good idea to check to see if the SSD you want to buy is compatible with your laptop or desktop, especially if your system is older than a couple of years. Here are articles from Tom’s Hardware and ShareUs which can help with that.

How to Migrate to an SSD

Buying a replacement SSD is the first step. Moving your data onto the SSD is the next step. To achieve this, you need two essential components: cloning software and an external drive case, sled, or enclosure. These tools enable you to connect your SSD to your computer through its USB port or another data transfer interface.

Cloning software creates an exact replica of your internal hard drive’s data. Once this data is successfully migrated to the SSD, you can then insert the new drive into your computer. I prefer to clone a hard drive onto an SSD whenever possible. When executed correctly, a cloned SSD retains its bootable capabilities, providing a true plug-and-play experience. Just copying files between the two drives instead may not copy all the data you need to get the computer to boot with the new drive.

How to Clone a Hard Drive to an SSD

When you buy a new SSD or even a fresh hard drive, it’s unlikely that the operating system you need will be pre-installed. Cloning your existing hard drive fixes that. However, there are instances where this may not be feasible. For example, maybe you’ve installed the SSD in a computer that previously had a bad hard drive. If so, you can do what’s called a clean install and start fresh. Different operating system providers offer distinct guidelines for this procedure. Here’s a link to Microsoft’s clean install procedure, and Apple’s clean install instructions.

As we said at the outset, SSDs tend to come at a higher cost per gigabyte compared to traditional hard drives. You may not be able to afford as large an SSD as your current drive, so make sure your data will fit on your new drive. If it won’t, you might have to pare down first. Additionally, it’s wise to leave some room for expansion. The last thing you want to do is immediately max out your new, fast drive.

Now that you’ve successfully cloned your drive and integrated the SSD into your system, what do you do with the old drive? If it’s still functional, repurposing the external drive chassis utilized during migration is a practical option. It can continue to serve as a standalone external drive or become part of a disk array, such as a network attached storage (NAS) device. You can use it for local back up—something we strongly recommend doing—in addition to using cloud back up like Backblaze. Or, just use it for extra storage needs, like for your photos or music.

Make Sure to Back Up

SSD upgrades are commonplace, but that doesn’t mean things don’t go wrong that can stop you dead in your tracks. If your computer is working fine before the SSD upgrade, make sure you have a complete backup of your computer to restore from in the event something goes wrong.

More Questions About SSDs?

You might enjoy reading other posts in our SSD 101 series.

The post SSD 101: How to Upgrade Your Computer With an SSD appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Drive Stats for Q2 2023

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-2023/

A decorative image with title Q2 2023 Drive Stats.

At the end of Q2 2023, Backblaze was monitoring 245,757 hard drives and SSDs in our data centers around the world. Of that number, 4,460 are boot drives, with 3,144 being SSDs and 1,316 being HDDs. The failure rates for the SSDs are analyzed in the SSD Edition: 2022 Drive Stats review.

Today, we’ll focus on the 241,297 data drives under management as we review their quarterly and lifetime failure rates as of the end of Q2 2023. Along the way, we’ll share our observations and insights on the data presented, tell you about some additional data fields we are now including and more.

Q2 2023 Hard Drive Failure Rates

At the end of Q2 2023, we were managing 241,297 hard drives used to store data. For our review, we removed 357 drives from consideration as they were used for testing purposes or drive models which did not have at least 60 drives. This leaves us with 240,940 hard drives grouped into 31 different models. The table below reviews the annualized failure rate (AFR) for those drive models for Q2 2023.

Notes and Observations on the Q2 2023 Drive Stats

  • Zero Failures: There were six drive models with zero failures in Q2 2023 as shown in the table below.

The table is sorted by the number of drive days each model accumulated during the quarter. In general a drive model should have at least 50,000 drive days in the quarter to be statistically relevant. The top three drives all meet that criteria, and having zero failures in a quarter is not surprising given the lifetime AFR for the three drives ranges from 0.13% to 0.45%. None of the bottom three drives has accumulated 50,000 drive days in the quarter, but the two Seagate drives are off to a good start. And, it is always good to see the 4TB Toshiba (model: MD04ABA400V), with eight plus years of service, post zero failures for the quarter.

  • The Oldest Drive? The drive model with the oldest average age is still the 6TB Seagate (model: ST6000DX000) at 98.3 months (8.2 years), with the oldest drive of this cohort being 104 months (8.7 years) old.

    The oldest operational data drive in the fleet is a 4TB Seagate (model: ST4000DM000) at 105.2 months (8.8 years). That is quite impressive, especially in a data center environment, but the winner for the oldest operational drive in our fleet is actually a boot drive: a WDC 500GB drive (model: WD5000BPKT) with 122 months (10.2 years) of continuous service.

  • Upward AFR: The AFR for Q2 2023 was 2.28%, up from 1.54% in Q1 2023. While quarterly AFR numbers can be volatile, they can also be useful in identifying trends which need further investigation. In this case, the rise was expected as the age of our fleet continues to increase. But was that the real reason?

    Digging in, we start with the annualized failure rates and average age of our drives grouped by drive size, as shown in the table below.

For our purpose, we’ll define a drive as old when it is five years old or more. Why? That’s the warranty period of the drives we are purchasing today. Of course, the 4TB and 6TB drives, and some of the 8TB drives, came with only two year warranties, but for consistency we’ll stick with five years as the point at which we label a drive as “old”. 

Using our definition for old drives eliminates the 12TB, 14TB and 16TB drives. This leaves us with the chart below of the Quarterly AFR over the last three years for each cohort of older drives, the 4TB, 6TB, 8TB, and 10TB models.

Interestingly, the oldest drives, the 4TB and 6TB drives, are holding their own. Yes, there has been an increase over the last year or so, but given their age, they are doing well.

On the other hand, the 8TB and 10TB drives, with an average of five and six years of service respectively, require further attention. We’ll look at the lifetime data later on in this report to see if our conclusions are justified.

What’s New in the Drive Stats Data?

For the past 10 years, we’ve been capturing and storing the drive stats data and since 2015 we’ve open sourced the data files that we used to create the Drive Stats reports. From time to time, new SMART attribute pairs have been added to the schema as we install new drive models which report new sets of SMART attributes. This quarter we decided to capture and store some additional data fields about the drives and the environment they operate in, and we’ve added them to the publicly available Drive Stats files that we publish each quarter. 

The New Data Fields

Beginning with the Q2 2023 Drive Stats data, there are three new data fields populated in each drive record.

  1. Vault_id: All data drives are members of a Backblaze Vault. Each vault consists of either 900 or 1,200 hard drives divided evenly across 20 storage servers.  The vault is a numeric value starting at 1,000.
  2. Pod_id: There are 20 storage servers in each Backblaze Vault. The Pod_id is a numeric field with values from 0 to 19 assigned to one of the 20 storage servers.
  3. Is_legacy_format: Currently 0, but will be useful over the coming quarters as more fields are added.

The new schema is as follows:

  • date
  • serial_number
  • model
  • capacity_bytes
  • failure
  • vault_id
  • pod_id
  • is_legacy_format
  • smart_1_normalized
  • smart_1_raw
  • Remaining SMART value pairs (as reported by each drive model)

Occasionally, our readers would ask if we had any additional information we could provide with regards to where a drive lived, and, more importantly, where it died. The newly-added data fields above are part of the internal drive data we collect each day, but they were not included in the Drive Stats data that we use to create the Drive Stats reports. With the help of David from our Infrastructure Software team, these fields will now be available in the Drive Stats data.

How Can We Use the Vault and Pod Information?

First a caveat: We have exactly one quarter’s worth of this new data. While it was tempting to create charts and tables, we want to see a couple of quarters worth of data to understand it better. Look for an initial analysis later on in the year.

That said, what this data gives us is the storage server and the vault of every drive. Working backwards, we should be able to ask questions like: “Are certain storage servers more prone to drive failure?” or, “Do certain drive models work better or worse in certain storage servers?” In addition, we hope to add data elements like storage server type and data center to the mix in order to provide additional insights into our multi-exabyte cloud storage platform.

Over the years, we have leveraged our Drive Stats data internally to improve our operational efficiency and durability. Providing these new data elements to everyone via our Drive Stats reports and data downloads is just the right thing to do.

There’s a New Drive in Town

If you do decide to download our Drive Stats data for Q2 2023, there’s a surprise inside—a new drive model. There are only four of these drives, so they’d be easy to miss, and they are not listed on any of the tables and charts we publish as they are considered “test” drives at the moment. But, if you are looking at the data, search for model “WDC WUH722222ALE6L4” and you’ll find our newly installed 22TB WDC drives. They went into testing in late Q2 and are being put through their paces as we speak. Stay tuned. (Psst, as of 7/28, none had failed.)

Lifetime Hard Drive Failure Rates

As of June 30, 2023, we were tracking 241,297 hard drives used to store customer data. For our lifetime analysis, we removed 357 drives that were only used for testing purposes or did not have at least 60 drives represented in the full dataset. This leaves us with 240,940 hard drives grouped into 31 different models to analyze for the lifetime table below.

Notes and Observations About the Lifetime Stats

The Lifetime AFR also rises. The lifetime annualized failure rate for all the drives listed above is 1.45%. That is an increase of 0.05% from the previous quarter of 1.40%. Earlier in this report by examining the Q2 2023 data, we identified the 8TB and 10TB drives as primary suspects in the increasing rate. Let’s see if we can confirm that by examining the change in the lifetime AFR rates of the different drives grouped by size.

The red line is our baseline as it is the difference from Q1 to Q2 (0.05%) of the lifetime AFR for all drives. Drives above the red line support the increase, drives below the line subtract from the increase. The primary drives (by size) which are “driving” the increased lifetime annualized failure rate are the 8TB and 10TB drives. This confirms what we found earlier. Given there are relatively few 10TB drives (1,124) versus 8TB drives (24,891), let’s dig deeper into the 8TB drives models.

The Lifetime AFR for all 8TB drives jumped from 1.42% in Q1 to 1.59% in Q2.  An increase of 12%. There are six 8TB drive models in operation, but three of these models comprise 99.5% of the drive failures for the 8TB drive cohort, so we’ll focus on them. They are listed below.

For all three models, the increase of the lifetime annualized failure rate from Q1 to Q2 is 10% or more which is statistically similar to the 12% increase for all of the 8TB drive models. If you had to select one drive model to focus on for migration, any of the three would be a good candidate. But, the Seagate drives, model ST8000DM002, are on average nearly a year older than the other drive models in question.

  • Not quite a lifetime? The table above analyzes data for the period of April 20, 2013 through June 30, 2023, or 10 years, 2 months and 10 days. As noted earlier, the oldest drive we have is 10 years and 2 months old, give or take a day or two. It would seem we need to change our table header, but not quite yet. A drive that was installed anytime in Q2 2013 and is still operational today would report drive days as part of the lifetime data for that model. Once all the drives installed in Q2 2013 are gone, we can change the start date on our tables and charts accordingly.

A Word About Drive Failure

Are we worried about the increase in drive failure rates? Of course we’d like to see them lower, but the inescapable reality of the cloud storage business is that drives fail. Over the years, we have seen a wide range of failure rates across different manufacturers, drive models, and drive sizes. If you are not prepared for that, you will fail. As part of our preparation, we use our drive stats data as one of the many inputs into understanding our environment so we can adjust when and as we need.

So, are we worried about the increase in drive failure rates? No, but we are not arrogant either. We’ll continue to monitor our systems, take action where needed, and share what we can with you along the way. 

The Hard Drive Stats Data

The complete data set used to create the information used in this review is available on our Hard Drive Stats Data webpage. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone; it is free.

If you want the tables and charts used in this report, you can download the .zip file from Backblaze B2 Cloud Storage which contains an MS Excel spreadsheet with a tab for each of the tables or charts..

Good luck and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q2 2023 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.