Tag Archives: Featured

Storage Tech of the Future: Ceramics, DNA, and More

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/storage-tech-of-the-future-ceramics-dna-and-more/

A decorative image showing a data drive with a health monitor indicator running through and behind it.

Two announcements had the Backblaze #social Slack channel blowing up this week, both related to “Storage Technologies of the Future.” The first reported “Video of Ceramic Storage System Surfaces Online” like some kind of UFO sighting. The second, somewhat more restrained announcement heralded the release of DNA storage cards available to the general public. Yep, you heard that right—coming to a Best Buy near you. (Not really. You absolutely have to special order these babies, but they ARE going to be for sale.)

We talked about DNA storage way back in 2015. It’s been nine years, so we thought it was high time to revisit the tech and dig into ceramics as well. (Pun intended.) 

What Is DNA Storage?

The idea is elegant, really. What is DNA if not an organic, naturally occuring form of code? 

DNA consists of four nucleotide bases: adenine (A), thymine (T), cytosine (C), and guanine (G). 

In DNA storage, information is encoded into sequences of these nucleotide bases. For example, A and C might represent 0, while T and G represent 1. This encoding allows digital data, such as text, images, or other types of information, to be translated into DNA sequences. Cool!

The appeal of DNA as a storage medium lies in its density and stability, as well as its ability to store vast amounts of information in a very compact space. It also boasts remarkable durability, with the ability to preserve information for thousands of years under suitable conditions. I mean, leave it to Mother Nature to put our silly little hard drives to shame.

Back in 2015, we shared that the storage density of DNA was about 2.2 petabytes per gram. In 2017, a study out of Columbia University and the New York Genome Center put it at an incredible 215 petabytes per gram. For comparison’s sake, a WDC 22TB drive (WDC WUH722222ALE6L4) that we currently use in our data centers is 1.5 pounds or 680 grams, which nets out at 0.032TB/gram or 0.000032PB/gram.

Another major advantage is its sustainability. Estimated global data center electricity consumption in 2022 was 240–340 TWh1, or around 1–1.3% of global final electricity demand. Current data storage technology uses rare earth metals which are environmentally damaging to mine. Drives take up space, and they also create e-waste at the end of their lifecycle. It’s a challenge anyone who works in the data storage industry thinks about a lot. 

DNA storage, on the other hand, requires less energy. A 2023 study found that data writing can be achieved in the DNA movable-type storage system under normal operating temperatures ranging from about 60–113°F and can be stored at room temperature. DNA molecules are also biodegradable and can be broken down naturally. 

The DNA data-writing process is chemical-based, and actually not the most environmentally friendly, but the DNA storage cards developed by Biomemory use a proprietary biosourced writing process, which they call “a significant advancement over existing chemical or enzymatic synthesis technologies.” So, there might be some trade-offs, but we’ll know more as the technology evolves. 

What’s the Catch?

Density? Check. Durability? Wow, yeah. Sustainability? You got it. But DNA storage is still a long way from sitting on your desk, storing your duplicate selfies. First, and we said this back in 2015 too, DNA takes a long time to read and write—DNA synthesis writes at a few hundred bytes per second. An average iPhone photo would take several hours to write to DNA. And to read it, you have to sequence the DNA—a time-intensive process. Both of those processes require specialized scientific equipment.

It’s also still too expensive. In 2015, we found a study that put 83 kilobytes of DNA storage at £1000 (about $1,500 U.S. dollars). In 2021, MIT estimated it would cost about $1 trillion to store one petabyte of data on DNA. For comparison, it costs $6,000 per month to store one petabyte in Backblaze B2 Cloud Storage ($6/TB/month). You could store that petabyte for a little over 13 million years before you’d hit $1 trillion.

Today, Biomemory’s DNA storage cards ring in at a cool €1000 (about $1,080 U.S. dollars). And they can hold a whopping one kilobyte of data or the equivalent of a short email. So, yeah …it’s ahh, gotten even more expensive for the commercial product. 

The discrepancy between the MIT theoretical estimate and the cost of the Biomemory cards really speaks to the expense of bringing a technology like this to market. The theoretical cost per byte is a lot different than the operational cost, and the Biomemory cards are really meant to serve as proof of concept.  All that said, as the technology improves, one can only hope that it becomes more cost-effective in the future. Folks are experimenting with different encoding schemes to make writing and reading more efficient, as one example of an advance that could start to tip the balance.  

Finally, there’s just something a bit spooky about using synthetic DNA to store data. There’s a Black Mirror episode in there somewhere. Maybe one day we can upload kung fu skills directly into our brain domes and that would be cool, but for now, it’s still somewhat unsettling.

What Is Ceramic Storage?

Ceramic storage makes an old school approach new again, if you consider that the first stone tablets were kind of the precursor to today’s hard drives. Who’s up for storing some cuneiform?

Cerabyte, the company behind the “video that surfaced online,” is working on storage technology that uses ceramic and glass substrates in devices the size of a typical HDD that can store 10 petabytes of data. They use a glass base similar to Gorilla Glass by Corning topped with a layer of ceramic 300 micrometers thick that’s essentially etched with lasers. (Glass is used in many larger hard drives today, for what it’s worth. Hoya makes them, for example.) The startup debuted a fully operational prototype system using only commercial off-the-shelf equipment—pretty impressive. 

The prototype consists of a single read-write rack and several library racks. When you want to write data, it moves one of the cartridges from the library to the read-write rack where it is opened to expose and stage the ceramic substrate. Two million laser beamlets then punch nanoscale ones and zeros into the surface. Once the data is written, the read-write arm verifies it on the return motion to its original position. 

Cerabyte isn’t the only player in the game. Others like MDisc use similar technology. Currently, MDisc stores data on DVD-sized disks using a “rock-like” substrate. Several DVD player manufacturers have included the technology in players. 

Similar to DNA storage, ceramic storage boasts much higher density than current data storage tech—terabytes per square centimeter versus an HDD’s 0.02TB per square centimeter. Also like DNA storage, it’s more environmentally friendly. Ceramic and glass can be stored within a wide temperature range between -460°F–570°F, and it’s a natural material that will last millennia and eventually decompose. It’s also incredibly durable: Cerabyte claims it will last 5000+ years, and with tons of clay pots still laying around from ancient times, that makes sense. 

One advantage it has on DNA storage though is speed. One laser pulse writes up to 2,000,000 bits, so data can be written at GBps speeds. 

What’s the Catch?

Ceramic also has density, sustainability, and speed to boot, but our biggest question is: who’s going to need that speed? There are only a handful of applications, like AI, that require that speed now. AI is certainly having a big moment, and it can only get bigger. So, presumably there’s a market, but only a small one that can justify the cost. 

One other biggie, at least for a cloud storage provider like us, though not necessarily for consumers or other enterprise users: it’s a write-once model. Once it’s on there, it’s on there. 

Finally, much like DNA tech, it’s probably (?) still too expensive to make it feasible for most data center applications. Cerabyte hasn’t released pricing yet. According to Blocks & Files, “The cost roadmap is expected to offer cost structures below projections of current commercial storage technologies.” But it’s still a big question mark.

Our Hot Take

Both of these technologies are really cool. They definitely got our storage geek brains fired up. But until they become scalable, operationally feasible, and cost-effective, you won’t see them in production—they’re still far enough out that they’re on the fiction end of the science fiction to science fact spectrum. And there are a couple roadblocks we see before they reach the ubiquity of your trusty hard drive. 

The first is making both technologies operational, not just theoretical in a lab. We’ll know more about both Biomemory’s and Cerabyte’s technologies as they roll out these initial proof of concept cards and prototype machines. And both have plans, naturally, for scaling the technologies to the data center. Whether they can or not remains to be seen. Lots of technologies have come and gone, falling victim to the challenges of production, scaling, and cost. 

The second is the attendant infrastructure needs. Getting 100x speed is great, if the device is right next to you. But we’ll need similar leaps in physical networking infrastructure to transfer the data anywhere else. Until that catches up, the tech remains lab-bound. 

All that said, I still remember using floppy disks that held mere megabytes of data, and now you can put 20TB on a hard disk. So, I guess the question is, how long will it be before I can plug into the Matrix?

The post Storage Tech of the Future: Ceramics, DNA, and More appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

How to Download Your Google Drive and Back Up Your Files

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/download-backup-google-drive/

A decorative image showing a Google Drive logo and a storage bar filling up with different types of files.

Editor’s Note

What better time for a reminder to back up your data than after a serious data loss event? If you are concerned about the safety of your Google Drive data after the reports of unexplained data loss by Google Drive users last week, then read on to learn how to download and back up your Google Drive.

More than one billion businesses and individuals use Google Drive according to, well, a quick search on Google. If most of those one billion people are like me, they save pretty much everything there. 

Whether the data is professional or personal, the end result is a lot of important files that aren’t necessarily backed up anywhere. Maybe your school is closing your account and you need to move all of your data somewhere else. Maybe your account gets attacked by cybercriminals. Or maybe Google goes down or loses your data. In order to protect your important Google Drive files, you need to understand how to go about downloading and backing up your account. 

In this post, you’ll learn some simple steps to achieve that, including how to download your Google Drive, how to back up your computer, and how to back up your Google Drive.
We’ve gathered a handful of guides to help you protect social content across many different platforms. We’re working on developing this list—please comment below if you’d like to see another platform covered.

How to Download Your Google Drive

Most people have multiple email accounts, so first it is important to make sure you are logged in to the correct Google Account before you start this process. 

Once you’re signed in, you will want to go to Google Drive: drive.google.com. From there, you can download individual files if you don’t have that many or do a bulk download.

To download individual files:

  1. Hold shift while you select all of your files.
  2. Right click and select download.

To do a bulk download:

  1. Go to your account at myaccount.google.com.
  2. Go to Data & privacy.
  3. Scroll down to the section of the page titled “Download or delete your data” and click “Download your data.” This allows you to download all of the data in your Google account (not just Google Drive) via Google Takeout.
A screenshot of Google Drive settings showing where to download your data.
  1. Select Google Drive (and whatever other services you might want to download data from).
A screenshot of Google Drive settings showing how to select which Google suite data you want to download.
  1. You then have a few options to select:
    1. Multiple formats: Here you can tell Google the formats of the files you want to download. For example, if you want to download documents as .docx files or as PDFs.
    2. Advanced settings: Here you can tell Google to download additional data, including previous versions and the names of your folders. 
    3. All Drive data included: Here you can select all data, or deselect specific folders if you want to.
  2. Scroll down to the bottom and click on Next Step.
  3. You’ll be prompted to specify your delivery method. Select Send download link via email.
  4. You can then specify your frequency. You can select a single export or an export every two months for a year. For our purposes, you can select a single export. (We’ll talk about options for backing up your data more frequently later.)
  5. Specify the file type and the file size you want to export.
    1. You can choose to have these files sent as a .zip file or a .tgz (tar) file. The main difference between the two options is that a .zip file compresses every file independently in the archive, but a .tgz file compresses the archive as a whole.
    2. The file size tells Google when to split your data into a separate file. Depending on the size of your data, Google may send you multiple emails with different sizes of files.
A screenshot of Google Drive settings showing where to set the frequency and file types of data downloads.
  1. Click Create export.

When most people think about downloading the data they store in Google Drive, they’re thinking about the documents, photos, and other larger files they work with, but (as Google Takeout makes clear) you have a lot more data stored with Google outside of Drive.

Here’s why you might choose to export everything: 

  • To have a copy of bookmarked websites. 
  • To have a copy of emails that may contain files you’ve lost over time. 
  • To have a copy of important voicemails from loved ones in Google’s Voice product that you want to keep forever. 

Also, when you download all of your data it is a good reminder of what information Google has of yours.

After you click Create export, you’ll get an email in a few minutes, hours, or a couple of days, depending on the size of your data, informing you that your Google data is ready to download.

How to Back Up Your Computer

You now have your Google Drive data out of the Google Cloud and on your computer. Next, you’ll want to make sure it’s backed up. Your computer can fail just like Google, so simply downloading it isn’t enough. Protecting your newly downloaded Google data with a good cloud backup strategy should be the next thing you do.

Make sure to have at least three copies of your data: two local including one on your desktop and one on a different storage medium, like a hard drive. Then, you should have one off-site, and these days that means in the cloud.  

Note that when we’re using the word “cloud” here, we specifically mean that you’re backing up to the cloud. Often using a “cloud drive” means that you’re syncing, and, as the current data loss snafu at Google shows, there’s a big difference between sync and backup.

How to Back Up Google Drive

Downloading your data once and backing it all up is a good step. But, you’re adding documents to Google Drive all the time, and downloading your data manually can get tedious if you want to make sure your work is consistently and reliably backed up. 

Of course, as we noted above, you can set your Google Drive bulk download frequency to a regular cadence. You’d still have to manually download your data and add it to your computer’s local storage, then back it up using the same method you would for your computer data. If you’re using Backblaze Computer Backup, which automatically runs in the background on your computer, those files would be backed once they entered your local storage. 

Still, that means that you have the possibility of losing files if your cadence isn’t frequent enough, and if you forget to manually download and replace those files sent to you in email, then you might run into trouble. 

Alternatively, there are a few services that will back up your Google Drive data for you. With something like Movebot, you can set up your Google Drive to sync and back up to a cloud storage service like Backblaze B2. If you’re a little more tech savvy, you can also use rclone to do the same thing. 

These tools are a bit more complex than using your Backblaze Computer Backup account, but you can configure these tools to back up your Google Drive at a frequency that makes sense for you to make sure new data is getting backed up as you add it.

Do you have any techniques on how you download your data from Google Drive or other Google products? Share them in the comments section below!


How do I download individual files from Google?

You can simply select the files you want to download, right click, and select Download.

How do I download my entire Google Drive?

You can use Google Takeout to download your entire Google Drive as well as any data you have in other Google services. Go to your account, click on Data & privacy, and click on Download your data to get started.

How do I back up my Google data once I download it?

You can back up your Google Data once you’ve downloaded it to your computer by using a trusted cloud computer backup service. Make sure to follow a 3-2-1 backup strategy by keeping at least two backups in addition to your data in Google drive: one local, on your desktop or on a hard drive, and one in the cloud.

How do I back up my Google Drive?

There are many backup software services available to help you back up your Google drive data. With something like Movebot, you can set up your Google Drive to sync and back up to a cloud storage service like Backblaze B2. If you’re a little more tech savvy, you can also use rclone to do the same thing. 

The post How to Download Your Google Drive and Back Up Your Files appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Cloud 101: Data Egress Fees Explained

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/cloud-101-data-egress-fees-explained/

A decorative article showing a server, a cloud, and arrows pointing up and down with a dollar sign.

You can imagine data egress fees like tolls on a highway—your data is cruising along trying to get to its destination, but it has to pay a fee for the privilege of continuing its journey. If you have a lot of data to move, or a lot of toll booths (different cloud services) to move it through, those fees can add up quickly. 

Data egress fees are charges you incur for moving data out of a cloud service. They can be a big part of your cloud bill depending on how you use the cloud. And, they’re frequently a reason behind surprise AWS bills. So, let’s take a closer look at egress, egress fees, and ways you can reduce or eliminate them, so that your data can travel the cloud superhighways at will. 

What Is Data Egress?

In computing generally, data egress refers to the transfer or movement of data out of a given location, typically from within a network or system to an external destination. When it comes to cloud computing, egress generally means whenever data leaves the boundaries of a cloud provider’s network. 

In the simplest terms, data egress is the outbound flow of data.

A photo of a stair case with a sign that says "out" and an arrow pointing up.
The fees, like these stairs, climb higher. Source.

Egress vs. Ingress?

While egress pertains to data exiting a system, ingress refers to data entering a system. When you download something, you’re egressing data from a service. When you upload something, you’re ingressing data to that service. 

Unsurprisingly, most cloud storage providers do not charge you to ingress data—they want you to store your data on their platform, so why would they? 

Egress vs. Download

You might hear egress referred to as download, and that’s not wrong, but there are some nuances. Egress applies not only to downloads, but also when you migrate data between cloud services, for example. So, egress includes downloads, but it’s not limited to them. 

In the context of cloud service providers, the distinction between egress and download may not always be explicitly stated, and the terminology used can vary between providers. It’s essential to refer to the specific terms and pricing details provided by the service or platform you are using to understand how they classify and charge for data transfers.

How Do Egress Fees Work?

Data egress fees are charges incurred when data is transferred out of a cloud provider’s environment. These fees are often associated with cloud computing services, where users pay not only for the resources they consume within the cloud (such as storage and compute) but also for the data that is transferred from the cloud to external destinations.

There are a number of scenarios where a cloud provider typically charges egress: 

  • When you’re migrating data from one cloud to another.
  • When you’re downloading data from a cloud to a local repository.
  • When you move data between regions or zones with certain cloud providers. 
  • When an application, end user, or content delivery network (CDN) requests data from your cloud storage bucket. 

The fees can vary depending on the amount of data transferred and the destination of the data. For example, transferring data between regions within the same cloud provider’s network might incur lower fees than transferring data to the internet or to a different cloud provider.

Data egress fees are an important consideration for organizations using cloud services, and they can impact the overall cost of hosting and managing data in the cloud. It’s important to be aware of the pricing details related to data egress in the cloud provider’s pricing documentation, as these fees can contribute significantly to the total cost of using cloud services.

Why Do Cloud Providers Charge Egress Fees?

Both ingressing and egressing data costs cloud providers money. They have to build the physical infrastructure to allow users to do that, including switches, routers, fiber cables, etc. They also have to have enough of that infrastructure on hand to meet customer demand, not to mention staff to deploy and maintain it. 

However, it’s telling that most cloud providers don’t charge ingress fees, only egress fees. It would be hard to entice people to use your service if you charged them extra for uploading their data. But, once cloud providers have your data, they want you to keep it there. Charging you to remove it is one way cloud providers like AWS, Google Cloud, and Microsoft Azure do that. 

What Are AWS’s Egress Fees?

AWS S3 gives customers 100GB of data transfer out to the internet free each month, with some caveats—that 100GB excludes data stored in China and GovCloud. After that, the published rates for U.S. regions for data transferred over the public internet are as follows as of the date of publication:

  • The first 10TB per month is $0.09 per GB.
  • The next 40TB per month is $0.085 per GB.
  • The next 100TB per month is $0.07 per GB.
  • Anything greater than 150TB per month is $0.05 per GB. 

But AWS also charges customers egress between certain services and regions, and it can get complicated quickly as the following diagram shows…


How Can I Reduce Egress Fees?

If you’re using cloud services, minimizing your egress fees is probably a high priority. Companies like the Duckbill Group (the creators of the diagram above) exist to help businesses manage their AWS bills. In fact, there’s a whole industry of consultants that focuses solely on reducing your AWS bills. 

Aside from hiring a consultant to help you spend less, there are a few simple ways to lower your egress fees:

  1. Use a content delivery network (CDN): If you’re hosting an application, using a CDN can lower your egress fees since a CDN will cache data on edge servers. That way, when a user sends a request for your data, it can pull it from the CDN server rather than your cloud storage provider where you would be charged egress. 
  2. Optimize data transfer protocols: Choose efficient data transfer protocols that minimize the amount of data transmitted. For example, consider using compression or delta encoding techniques to reduce the size of transferred files. Compressing data before transfer can reduce the volume of data sent over the network, leading to lower egress costs. However, the effectiveness of compression depends on the nature of the data.
  3. Utilize integrated cloud providers: Some cloud providers offer free data transfer with a range of other cloud partners. (Hint: that’s what we do here at Backblaze!)
  4. Be aware of tiering: It may sound enticing to opt for a cold(er) storage tier to save on storage, but some of those tiers come with much higher egress fees. 

How Does Backblaze Reduce Egress Fees?

There’s one more way you can drastically reduce egress, and we’ll just come right out and say it: Backblaze gives you free egress up to 3x the average monthly storage and unlimited free egress through a number of CDN and compute partners, including Fastly, Cloudflare, Bunny.net, and Vultr

Why do we offer free egress? Supporting an open cloud environment is central to our mission, so we expanded free egress to all customers so they can move data when and where they prefer. Cloud providers like AWS and others charge high egress fees that make it expensive for customers to use multi-cloud infrastructures and therefore lock in customers to their services. These walled gardens hamper innovation and long-term growth.

Free Egress = A Better, Multi-Cloud World

The bottom line: the high egress fees charged by hyperscalers like AWS, Google, and Microsoft are a direct impediment to a multi-cloud future driven by customer choice and industry need. And, a multi-cloud future is something we believe in. So go forth and build the multi-cloud future of your dreams, and leave worries about high egress fees in the past. 

The post Cloud 101: Data Egress Fees Explained appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Digging Deeper Into Object Lock

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/digging-deeper-into-object-lock/

A decorative image showing data inside of a vault.

Using Object Lock for your data is a smart choice—you can protect your data from ransomware, meet compliance requirements, beef up your security policy, or preserve data for legal reasons. But, it’s not a simple on/off switch, and accidentally locking your data for 100 years is a mistake you definitely don’t want to make.

Today we’re taking a deeper dive into Object Lock and the related legal hold feature, examining the different levels of control that are available, explaining why developers might want to build Object Lock into their own applications, and showing exactly how to do that. While the code samples are aimed at our developer audience, anyone looking for a deeper understanding of Object Lock should be able to follow along.

I presented a webinar on this topic earlier this year that covers much the same ground as this blog post, so feel free to watch it instead of, or in addition to, reading this article. 

Check Out the Docs

For even more information on Object Lock, check out our Object Lock overview in our Technical Documentation Portal as well as these how-tos about how to enable Object Lock using the Backblaze web UI, Backblaze B2 Native API, and the Backblaze S3 Compatible API:

What Is Object Lock?

In the simplest explanation, Object Lock is a way to lock objects (aka files) stored in Backblaze B2 so that they are immutable—that is, they cannot be deleted or modified, for a given period of time, even by the user account that set the Object Lock rule. Backblaze B2’s implementation of Object Lock was originally known as File Lock, and you may encounter the older terminology in some documentation and articles. For consistency, I’ll use the term “object” in this blog post, but in this context it has exactly the same meaning as “file.”

Object Lock is a widely offered feature included with backup applications such as Veeam and MSP360, allowing organizations to ensure that their backups are not vulnerable to deliberate or accidental deletion or modification for some configurable retention period.

Ransomware mitigation is a common motivation for protecting data with Object Lock. Even if an attacker were to compromise an organization’s systems to the extent of accessing the application keys used to manage data in Backblaze B2, they would not be able to delete or change any locked data. Similarly, Object Lock guards against insider threats, where the attacker may try to abuse legitimate access to application credentials.

Object Lock is also used in industries that store sensitive or personal identifiable information (PII) such as banking, education, and healthcare. Because they work with such sensitive data, regulatory requirements dictate that data be retained for a given period of time, but data must also be deleted in particular circumstances. 

For example, the General Data Protection Regulation (GDPR), an important component of the EU’s privacy laws and an international regulatory standard that drives best practices, may dictate that some data must be deleted when a customer closes their account. A related use case is where data must be preserved due to litigation, where the period for which data must be locked is not fixed and depends on the type of lawsuit at hand. 

To handle these requirements, Backblaze B2 offers two Object Lock modes—compliance and governance—as well as the legal hold feature. Let’s take a look at the differences between them.

Compliance Mode: Near-Absolute Immutability

When objects are locked in compliance mode, not only can they not be deleted or modified while the lock is in place, but the lock also cannot be removed during the specified retention period. It is not possible to remove or override the compliance lock to delete locked data until the lock expires, whether you’re attempting to do so via the Backblaze web UI or either of the S3 Compatible or B2 Native APIs. Similarly, Backblaze Support is unable to unlock or delete data locked under compliance mode in response to a support request, which is a safeguard designed to address social engineering attacks where an attacker impersonates a legitimate user.

What if you inadvertently lock many terabytes of data for several years? Are you on the hook for thousands of dollars of storage costs? Thankfully, no—you have one escape route, which is to close your Backblaze account. Closing the account is a multi-step process that requires access to both the account login credentials and two-factor verification (if it is configured) and results in the deletion of all data in that account, locked or unlocked. This is a drastic step, so we recommend that developers create one or more “burner” Backblaze accounts for use in developing and testing applications that use Object Lock, that can be closed if necessary without disrupting production systems.

There is one lock-related operation you can perform on compliance-locked objects: extending the retention period. In fact, you can keep extending the retention period on locked data any number of times, protecting that data from deletion until you let the compliance lock expire.

Governance Mode: Override Permitted

In our other Object Lock option, objects can be locked in governance mode for a given retention period. But, in contrast to compliance mode, the governance lock can be removed or overridden via an API call, if you have an application key with appropriate capabilities. Governance mode handles use cases that require retention of data for some fixed period of time, with exceptions for particular circumstances.

When I’m trying to remember the difference between compliance and governance mode, I think of the phrase, “Twenty seconds to comply!”, uttered by the ED-209 armed robot in the movie “RoboCop.” It turned out that there was no way to override ED-209’s programming, with dramatic, and fatal, consequences.

ED-209: as implacable as compliance mode.

Legal Hold: Flexible Preservation

While the compliance and governance retention modes lock objects for a given retention period, legal hold is more like a toggle switch: you can turn it on and off at any time, again with an application key with sufficient capabilities. As its name suggests, legal hold is ideal for situations where data must be preserved for an unpredictable period of time, such as while litigation is proceeding.

The compliance and governance modes are mutually exclusive, which is to say that only one may be in operation at any time. Objects locked in governance mode can be switched to compliance mode, but, as you might expect from the above explanation, objects locked in compliance mode cannot be switched to governance mode until the compliance lock expires.

Legal hold, on the other hand, operates independently, and can be enabled and disabled regardless of whether an object is locked in compliance or governance mode.

How does this work? Consider an object that is locked in compliance or governance mode and has legal hold enabled:

  • If the legal hold is removed, the object remains locked until the retention period expires.
  • If the retention period expires, the object remains locked until the legal hold is removed.

Object Lock and Versioning

By default, Backblaze B2 Buckets have versioning enabled, so as you upload successive objects with the same name, previous versions are preserved automatically. None of the Object Lock modes prevent you from uploading a new version of a locked object; the lock is specific to the object version to which it was applied.

You can also hide a locked object so it doesn’t appear in object listings. The hidden version is retained and can be revealed using the Backblaze web UI or an API call.

As you might expect, locked object versions are not subject to deletion by lifecycle rules—any attempt to delete a locked object version via a lifecycle rule will fail.

How to Use Object Lock in Applications

Now that you understand the two modes of Object Lock, plus legal hold, and how they all work with object versions, let’s look at how you can take advantage of this functionality in your applications. I’ll include code samples for Backblaze B2’s S3 Compatible API written in Python, using the AWS SDK, aka Boto3, in this blog post. You can find details on working with Backblaze B2’s Native API in the documentation.

Application Key Capabilities for Object Lock

Every application key you create for Backblaze B2 has an associated set of capabilities; each capability allows access to a specific functionality in Backblaze B2. There are seven capabilities relevant to object lock and legal hold. 

Two capabilities relate to bucket settings:

  1. readBucketRetentions 
  2. writeBucketRetentions

Three capabilities relate to object settings for retention: 

  1. readFileRetentions 
  2. writeFileRetentions 
  3. bypassGovernance

And, two are specific to Object Lock: 

  1. readFileLegalHolds 
  2. writeFileLegalHolds 

The Backblaze B2 documentation contains full details of each capability and the API calls it relates to for both the S3 Compatible API and the B2 Native API.

When you create an application key via the web UI, it is assigned capabilities according to whether you allow it access to all buckets or just a single bucket, and whether you assign it read-write, read-only, or write-only access.

An application key created in the web UI with read-write access to all buckets will receive all of the above capabilities. A key with read-only access to all buckets will receive readBucketRetentions, readFileRetentions, and readFileLegalHolds. Finally, a key with write-only access to all buckets will receive bypassGovernance, writeBucketRetentions, writeFileRetentions, and writeFileLegalHolds.

In contrast, an application key created in the web UI restricted to a single bucket is not assigned any of the above permissions. When an application using such a key uploads objects to its associated bucket, they receive the default retention mode and period for the bucket, if they have been set. The application is not able to select a different retention mode or period when uploading an object, change the retention settings on an existing object, or bypass governance when deleting an object.

You may want to create application keys with more granular permissions when working with Object Lock and/or legal hold. For example, you may need an application restricted to a single bucket to be able to toggle legal hold for objects in that bucket. You can use the Backblaze B2 CLI to create an application key with this, or any other set of capabilities. This command, for example, creates a key with the default set of capabilities for read-write access to a single bucket, plus the ability to read and write the legal hold setting:

% b2 create-key --bucket my-bucket-name my-key-name listBuckets,readBuckets,listFiles,readFiles,shareFiles,writeFiles,deleteFiles,readBucketEncryption,writeBucketEncryption,readBucketReplications,writeBucketReplications,readFileLegalHolds,writeFileLegalHolds

Enabling Object Lock

You must enable Object Lock on a bucket before you can lock any objects therein; you can do this when you create the bucket, or at any time later, but you cannot disable Object Lock on a bucket once it has been enabled. Here’s how you create a bucket with Object Lock enabled:


Once a bucket’s settings have Object Lock enabled, you can configure a default retention mode and period for objects that are created in that bucket. Only compliance mode is configurable from the web UI, but you can set governance mode as the default via an API call, like this:

        'ObjectLockEnabled': 'Enabled',
        'Rule': {
            'DefaultRetention': {
                'Mode': 'GOVERNANCE',
                'Days': 7

You cannot set legal hold as a default configuration for the bucket.

Locking Objects

Regardless of whether you set a default retention mode for the bucket, you can explicitly set a retention mode and period when you upload objects, or apply the same settings to existing objects, provided you use an application key with the appropriate writeFileRetentions or writeFileLegalHolds capability.

Both the S3 PutObject operation and Backblaze B2’s b2_upload_file include optional parameters for specifying retention mode and period, and/or legal hold. For example:

    Body=open('/path/to/local/file', mode='rb'),
        2023, 9, 7, hour=10, minute=30, second=0

Both APIs implement additional operations to get and set retention settings and legal hold for existing objects. Here’s an example of how you apply a governance mode lock:

        'Mode': 'GOVERNANCE',  # Required, even if mode is not changed
        'RetainUntilDate': datetime(
            2023, 9, 5, hour=10, minute=30, second=0

The VersionId parameter is optional: the operation applies to the current object version if it is omitted.

You can also use the web UI to view, but not change, an object’s retention settings, and to toggle legal hold for an object:

A screenshot highlighting where to enable Object Lock via the Backblaze web UI.

Deleting Objects in Governance Mode

As mentioned above, a key difference between the compliance and governance modes is that it is possible to override governance mode to delete an object, given an application key with the bypassGovernance capability. To do so, you must identify the specific object version, and pass a flag to indicate that you are bypassing the governance retention restriction:

# Get object details, including version id of current version
object_info = s3_client.head_object(

# Delete the most recent object version, bypassing governance

There is no way to delete an object in legal hold; the legal hold must be removed before the object can be deleted.

Protect Your Data With Object Lock and Legal Hold

Object Lock is a powerful feature, and with great power… you know the rest. Here are some of the questions you should ask when deciding whether to implement Object Lock in your applications:

  • What would be the impact of malicious or accidental deletion of your application’s data?
  • Should you lock all data according to a central policy, or allow users to decide whether to lock their data, and for how long?
  • If you are storing data on behalf of users, are there special circumstances where a lock must be overridden?
  • Which users should be permitted to set and remove a legal hold? Does it make sense to build this into the application rather than have an administrator use a tool such as the Backblaze B2 CLI to manage legal holds?

If you already have a Backblaze B2 account, you can start working with Object Lock today; otherwise, create an account to get started.

The post Digging Deeper Into Object Lock appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

What Is Hybrid Cloud?

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/confused-about-the-hybrid-cloud-youre-not-alone/

An illustration of clouds computers and servers.
Editor’s note: This post has been updated since it was originally published in 2017.

The term hybrid cloud has been around for a while—we originally published this explainer in 2017. But time hasn’t necessarily made things clearer. Maybe you hear folks talk about your company’s hybrid cloud approach, but what does that really mean? If you’re confused about the hybrid cloud, you’re not alone. 

Hybrid cloud is a computing approach that uses both private and public cloud resources with some kind of orchestration between them. The term has been applied to a wide variety of IT solutions, so it’s no wonder the concept breeds confusion. 

In this post, we’ll explain what a hybrid cloud is, how it can benefit your business, and how to choose a cloud storage provider for your hybrid cloud strategy.

What Is the Hybrid Cloud?

A hybrid cloud is an infrastructure approach that uses both private and public resources. Let’s first break down those key terms:

  • Public cloud: When you use a public cloud, you are storing your data in another company’s internet-accessible data center. A public cloud service allows anybody to sign up for an account, and share data center resources with other customers or tenants. Instead of worrying about the costs and complexity of operating an on-premises data center, a cloud storage user only needs to pay for the cloud storage they need.
  • Private cloud: In contrast, a private cloud is specifically designed for a single tenant. Think of a private cloud as a permanently reserved private dining room at a restaurant—no other customer can use that space. As a result, private cloud services can be more expensive than public clouds. Traditionally, private clouds typically lived on on-premises infrastructure, meaning they were built and maintained on company property. Now, private clouds can be maintained and managed on-premises by an organization or by a third party in a data center. The key defining factor is that the cloud is dedicated to a single tenant or organization.

Those terms are important to know to understand the hybrid cloud architecture approach. Hybrid clouds are defined by a combined management approach, which means there is some type of orchestration between the private and public environments that allows workloads and data to move between them in a flexible way as demands, needs, and costs change. This gives you flexibility when it comes to data deployment and usage.  

In other words, if you have some IT resources on-premises that you are replicating or sharing with an external vendor—congratulations, you have a hybrid cloud!

Hybrid cloud refers to a computing architecture that is made up of both private cloud resources and public cloud resources with some kind of orchestration between them.

Hybrid Cloud Examples

Here are a few examples of how a hybrid cloud can be used:

  1. As an active archive: You might establish a protocol that says all accounting files that have not been changed in the last year, for example, are automatically moved off-premises to cloud storage archive to save cost and reduce the amount of storage needed on-site. You can still access the files; they are just no longer stored on your local systems. 
  2. To meet compliance requirements: Let’s say some of your data is subject to strict data privacy requirements, but other data you manage isn’t as closely protected. You could keep highly regulated data on premises in a private cloud and the rest of your data in a public cloud. 
  3. To scale capacity: If you’re in an industry that experiences seasonal or frequent spikes like retail or ecommerce, these spikes can be handled by a public cloud which provides the elasticity to deal with times when your data needs exceed your on-premises capacity.
  4. For digital transformation: A hybrid cloud lets you adopt cloud resources in a phased approach as you expand your cloud presence.

Hybrid Cloud vs. Multi-cloud: What’s the Diff?

You wouldn’t be the first person to think that the terms multi-cloud and hybrid cloud appear similar. Both of these approaches involve using multiple clouds. However, multi-cloud uses two clouds of the same type in combination (i.e., two or more public clouds) and hybrid cloud approaches combine a private cloud with a public cloud. One cloud approach is not necessarily better than the other—they simply serve different use cases. 

For example, let’s say you’ve already invested in significant on-premises IT infrastructure, but you want to take advantage of the scalability of the cloud. A hybrid cloud solution may be a good fit for you. 

Alternatively, a multi-cloud approach may work best for you if you are already in the cloud and want to mitigate the risk of a single cloud provider having outages or issues. 

Hybrid Cloud Benefits

A hybrid cloud approach allows you to take advantage of the best elements of both private and public clouds. The primary benefits are flexibility, scalability, and cost savings.

Benefit 1: Flexibility and Scalability

One of the top benefits of the hybrid cloud is its flexibility. Managing IT infrastructure on-premises can be time consuming and expensive, and adding capacity requires advance planning, procurement, and upfront investment

The public cloud is readily accessible and able to provide IT resources whenever needed on short notice. For example, the term “cloud bursting” refers to the on-demand and temporary use of the public cloud when demand exceeds resources available in the private cloud. A private cloud, on the other hand, provides the absolute fastest access speeds since it is generally located on-premises. (But cloud providers are catching up fast, for what it’s worth.) For data that is needed with the absolute lowest levels of latency, it may make sense for the organization to use a private cloud for current projects and store an active archive in a less expensive, public cloud.

Benefit 2: Cost Savings

Within the hybrid cloud framework, the public cloud segment offers cost-effective IT resources, eliminating the need for upfront capital expenses and associated labor costs. IT professionals gain the flexibility to optimize configurations, choose the most suitable service provider, and determine the optimal location for each workload. This strategic approach reduces costs by aligning resources with specific tasks. Furthermore, the ability to easily scale, redeploy, or downsize services enhances efficiency, curbing unnecessary expenses and contributing to overall cost savings.

Comparing Private vs. Hybrid Cloud Storage Costs

To understand the difference in storage costs between a purely on-premises solution and a hybrid cloud solution, we’ll present two scenarios. For each scenario, we’ll use data storage amounts of 100TB, 1PB, and 2PB. Each table is the same format, all we’ve done is change how the data is distributed: private (on-premises) or public (off-premises). We are using the costs for our own Backblaze B2 Cloud Storage in this example. The math can be adapted for any set of numbers you wish to use.

Scenario 1    100% of data on-premises storage

    Data Stored
  Data Stored On-premises: 100%   100TB 1,000TB 2,000TB
On-premises cost range   Monthly Cost
  Low — $12/TB/Month   $1,200 $12,000 $24,000
  High — $20/TB/Month   $2,000 $20,000 $40,000

Scenario 2    20% of data on-premises with 80% public cloud storage (Backblaze B2)

    Data Stored
  Data Stored On-premises: 20%   20TB 200TB 400TB
  Data Stored in the Cloud: 80%   80TB 800TB 1,600TB
On-premises cost range   Monthly Cost
  Low — $12/TB/Month   $240 $2,400 $4,800
  High — $20/TB/Month   $400 $4,000 $8,000
Public cloud cost range   Monthly Cost
  Low — $6/TB/Month (Backblaze B2)   $480 $4,800 $9,600
  High — $20/TB/Month   $1,600 $16,000 $32,000
On-premises + public cloud cost range   Monthly Cost
  Low   $720 $7,200 $14,400
  High   $2,000 $20,000 $40,000

As you can see, using a hybrid cloud solution and storing 80% of the data in the cloud with a provider like Backblaze B2 can result in significant savings over storing only on-premises.

Choosing a Cloud Storage Provider for Your Hybrid Cloud

Okay, so you understand the benefits of using a hybrid cloud approach, what next? Determining the right mix of cloud services may be intimidating because there are so many public cloud options available. Fortunately, there are a few decision factors you can use to simplify setting up your hybrid cloud solution. Here’s what to think about when choosing a public cloud storage provider:

  • Ease of use: Avoiding a steep learning curve can save you hours of work effort in managing your cloud deployments. By contrast, overly complicated pricing tiers or bells and whistles you don’t need can slow you down.
  • Data security controls: Compare how each cloud provider facilitates proper data controls. For example, take a look at features like authentication, Object Lock, and encryption.
  • Data egress fees: Some cloud providers charge additional fees for data egress (i.e., removing data from the cloud). These fees can make it more expensive to switch between providers. In addition to fees, check the data speeds offered by the provider.
  • Interoperability: Flexibility and interoperability are key reasons to use cloud services. Before signing up for a service, understand the provider’s integration ecosystem. A lack of needed integrations may place a greater burden on your team to keep the service running effectively.
  • Storage tiers: Some providers offer different storage tiers where you sacrifice access for lower costs. While the promise of inexpensive cold storage can be attractive, evaluate whether you can afford to wait hours or days to retrieve your data.
  • Pricing transparency: Pay careful attention to the cloud provider’s pricing model and tier options. Consider building a spreadsheet to compare a shortlist of cloud providers’ pricing models.

When Hybrid Cloud Might Not Always Be the Right Fit

The hybrid cloud may not always be the optimal solution, particularly for smaller organizations with limited IT budgets that might find a purely public cloud approach more cost-effective. The substantial setup and operational costs of private servers could be prohibitive.

A thorough understanding of workloads is crucial to effectively tailor the hybrid cloud, ensuring the right blend of private, public, and traditional IT resources for each application and maximizing the benefits of the hybrid cloud architecture.

So, Should You Go Hybrid?

Big picture, anything that helps you respond to IT demands quickly, easily, and affordably is a win. With a hybrid cloud, you can avoid some big up-front capital expenses for in-house IT infrastructure, making your CFO happy. Being able to quickly spin up IT resources as they’re needed will appeal to the CTO and VP of operations.

So, given all that, we’ve arrived at the bottom line and the question is, should you or your organization embrace hybrid cloud infrastructure?According to Flexera’s 2023 State of the Cloud report, 72% of enterprises utilize a hybrid cloud strategy. That indicates that the benefits of the hybrid cloud appeal to a broad range of companies.

If an organization approaches implementing a hybrid cloud solution with thoughtful planning and a structured approach, a hybrid cloud can deliver on-demand flexibility, empower legacy systems, and applications with new capabilities, and become a catalyst for digital transformation. The result can be an elastic and responsive infrastructure that has the ability to quickly adapt to changing demands of the business.

As data management professionals increasingly recognize the advantages of the hybrid cloud, we can expect more and more of them to embrace it as an essential part of their IT strategy.

Tell Us What You’re Doing With the Hybrid Cloud

Are you currently embracing the hybrid cloud, or are you still uncertain or hanging back because you’re satisfied with how things are currently? We’d love to hear your comments below on how you’re approaching your cloud architecture decisions.

FAQs About Hybrid Cloud

What exactly is a hybrid cloud?

Hybrid cloud is a computing approach that uses both private and public cloud resources with some kind of orchestration between them.

What is the difference between hybrid and multi-cloud?

Multi-cloud uses two clouds of the same type in combination (i.e., two or more public clouds) and hybrid cloud approaches combine a private cloud with a public cloud. One cloud approach is not necessarily better than the other—they simply serve different use cases.

What is a hybrid cloud architecture?

Hybrid cloud architecture is any kind of IT architecture that combines both the public and private clouds. Many organizations use this term to describe specific software products that provide solutions which combine the two types of clouds.

What are hybrid clouds used for?

Organizations will often use hybrid clouds to create redundancy and scalability for their computing workload. A hybrid cloud is a great way for a company to have extra fallback options to continue offering services even when they have higher than usual levels of traffic, and it can also help companies scale up their services over time as they need to offer more options.

The post What Is Hybrid Cloud? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Holiday Gift Guide 2023

Post Syndicated from Yev original https://www.backblaze.com/blog/holiday-gift-guide-2023/

A decorative image showing exciting images falling out of a present.  The title reads Holiday Gift Guide.

The holidays are fast approaching and with them the many cyber sales that provide both inspiration and opportunity for gift giving on any budget. To help narrow the field, every year I ask my fellow Backblazers to submit the gifts that they are looking forward to both gifting and receiving. (Hopefully some of their loved ones read the blog?) And of course, I’ve sprinkled in a few of my favorites as well. Without further ado, here’s what we suggest looking into for your 2023 gift giving!

Health and Wellness

Oura Ring

A decorative image showing several of the Oura ring models.

This little thing is pretty neat. It helps you keep track of your health, tracking everything from sleep to stress levels. It lasts for a week on a single charge, and is super easy on the eyes, so you’ll want to wear it all over the place.

Garmin InReach Mini

An image of a Garmin InReach Mini.

We have a lot of hikers, joggers, and runners at Backblaze and, as firm believers in thinking about your backup options before a disaster, the Garmin is an awesome-to-have trail buddy.

Drinks On Me (You?)

Yeti Cocktail Shaker

A product image of a Yeti cocktail shaker shown in red.

While a cocktail shaker is a pretty common household item, this one is sure to impress. Ask questions like, “Could my drink possibly get any colder and stay that way?” and “Can I customize my shaker with a sticker of my cat’s face?” And the Yeti’s answer is yes. Also, you know we love when a product comes in red. 

The Durand

A decorative image of a Durand removing corks from an old bottle of wine.

Wine anyone? If you or someone in your life is a big wino, older wines are a delicious treat, with a potentially fatal stumbling block: old, crumbly corks. The Durand corkscrew helps take them out with no breakage.

Coravin Timeless Three

An image showing a Coravin attached to a wine bottle pouring wine into a glass.

Another one for winos, the Coravin is an incredible wine system that uses tiny needles and argon gas to pour wine into your glass without having to actually open it. I can personally vouch for this one as a single human who has nice wine bottles and often wants a single glass once or twice a week.

Japanese Matcha Tea Set

A decorative images showing someone making matcha tea.

Tea time is a dreamy time and this matcha set allows you to make yourself a traditional cup. And if you need some matcha powder for it, this one comes high comes highly recommended: Organic Ceremonial Grade Matcha Powder.

Jet Boil Camping Stove

A decorative image showing a JetBoil camper heater setup.

Tea and coffee at a campsite are a must-have, and if you’ve never tried a Jet Boil, this model is easy to use. Also helpful for those times where you lose power and need to make some hot water in a hurry.

Food’s Good

Sous Vide

A product image of a sous vide kitchen appliance.

Foodies know and love the sous vide method, a.k.a. low temperature, long time (LTLT). If you’re into cooking your food in a hot tub, you’ll be happy to know that this accessory has come down in price dramatically over time. We like this version of a kitchen appliance, but there is certainly a wide world of sous vide gadgets out there if you’re interested. 

Ooni Pizza Oven

A product image of an Ooni pizza oven.

Pizza night gets fancier with this pizza oven that can make you a Neapolitan style pizza in less than five minutes. You gotta love that efficiency. 

Goldbelly Iconic Meal Kits

An image of the Goldbelly website showing iconic meal kits.

Love fancy foods but can’t travel to get them? Goldbelly has become the go-to for nationwide delivery of local favorites, and they now do meal kits as well. We’re not going to say you should give up on your standard, probably nutritionally balanced Hello Freshes of the world, but we will say that these are a whole lot more, well, iconic.


A produce image of a hydroponic garden.

Have your own mini-garden whether you’re in a house or an apartment. With just a little bit of counter space, a semi-green thumb, some patience, and water, you’ll never have herbs go bad in your fridge again. 

Games and Gaming

Steam Deck OLED, Lenovo Legion Go, & Rog Ally

Not since the times of the Game Boy Advance or maybe the Nintendo 3DS have handheld gaming systems seen such a rise in popularity. Along with the Nintendo Switch, these three handhelds bring the power of a computer to your fingertips on the go. While it’s not quite a gaming rig, it’s good enough for most airline flights, and hey…they’ll all play Baldur’s Gate 3. 

D&D Starter Set

It’s a great time to be a nerd. Critical Role, Dimension 20, The Adventure Zone, and many more role playing games (RPGs) are super popular nowadays, and it’s high time you take part. Get the D&D starter kit, some dice, and your soon-to-be best friends, create your character and get rolling.


Ororo Heated Vest

A product image of an Ororo heated vest.

Backblaze is based in California, but that doesn’t mean that we don’t know about weather. (What’s this wet stuff falling from the sky again?) That said, as a Midwesterner by heritage (dontcha know), I know something about staying warm. Heated clothes take the benefits of your favorite heated blanket and give them to you on the go. 


A product image of a selk'bag.

Camping? Walking? Freezing? How about a sleeping bag that you can walk in, eh?

Hats, Fanny Packs, & Bomber Jackets From Lower Park

A screenshot of the Lower Park website showing a lovely bomber jacket.

We’re all about being good community members, and this local (to us) company makes hats, fanny packs, and bomber jackets using environmentally friendly materials. They’re good products, in more ways than one.


Breathing Buddy

A product image showing how to meditate.

Studies have shown that meditation has measurable benefits for your mind and body. There are a plethora of tools out there to help you build good habits (see below), but this one is stinkin’ cute. Let this little guy help visually take you through a guided meditation. Bonus: it’s a great gift for kids, too.


The Calm app helps people stay mindful with everything from guided meditation to celebrity-read stories. We’re big fans of their social posts that just encourage you to take a 15 second break—it’s a positive interruption to the doomscroll effect, and a great way to preview some of the app’s content.

Watch and Listen


A product image showing several Skylight frames.

A twist on photo frames: you can send pictures to it and have all of your favorite memories staring back at you when you look over. Or, send photos to anyone, anywhere. Definitely some potential prank opportunities to be had; but it’s also a great way to keep in touch with far-flung family members. 

Sonos Surround Set With Beam

A product image of a Sonos surround kit.

Sonos surround systems are a great addition to homes. Multiple speakers can sync up to make sure that you’re never far away from rocking out to Weird Al, no matter where you are in the house.

Ikea FREKVENS (Sound Activated Lightbox)

An Ikean soundbox.

Music’s always better with light shows and this lightbox from Ikea matches beats and keeps things groovy. Yet another reason to love Ikea!

Apple AirPods Max

An image of Apple AirPods Max.

For the audiophiles in your life, the AirPods Max are the over-the-ear variant of the traditional AirPod. They’re much harder to lose, giving you that impressive combo of sound and noise cancellation you’ve come to expect.

Pixel Buds Pro

A product image of Pixel buds.

To balance the scales for our Android lovers, here are Google’s in-ear buds. They have a lot of bells and whistles including noise cancellation and built-in Android Assistant. Now when you talk to yourself, someone will answer. (That’s a good thing right?).


A product image of a Lego typewriter kit.

LEGO is having a bit of a moment (at least in my family) and we have spent a lot of time building complicated models. For the adults in your life that love to tinker, we recommend some of these cool sets! 

LEGO Ideas Typewriter


LEGO Sanderson Sisters’ Cottage

Give the Gift of Backblaze

And, of course, we’d be remiss if we didn’t remind you that Backblaze Computer Backup makes a great gift. Help your family and friends experience the sweet, sweet peace of mind that comes from a good backup strategy and make sure they never lose a file again. Bonus: you don’t even have to go to the store to get it.

A decorative image showing a gift box with the words "Give Backblaze Backup" overlayed.

Go Forth and Gift!

We hope this guide sparked some ideas and simplified some choices. We’ll also be publishing our second-annual book guide in December if you’re struggling with something for the literary folks in your life. (There’s some good stuff in the first one too.) We love hearing about what folks are excited about, so feel free to give us some more good options in the comments below.

The post Holiday Gift Guide 2023 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Drive Stats for Q3 2023

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-q3-2023/

A decorative image showing the title Q3 2023 Drive Stats.

At the end of Q3 2023, Backblaze was monitoring 263,992 hard disk drives (HDDs) and solid state drives (SSDs) in our data centers around the world. Of that number, 4,459 are boot drives, with 3,242 being SSDs and 1,217 being HDDs. The failure rates for the SSDs are analyzed in the SSD Edition: 2023 Drive Stats review.

That leaves us with 259,533 HDDs that we’ll focus on in this report. We’ll review the quarterly and lifetime failure rates of the data drives as of the end of Q3 2023. Along the way, we’ll share our observations and insights on the data presented, and, for the first time ever, we’ll reveal the drive failure rates broken down by data center.

Q3 2023 Hard Drive Failure Rates

At the end of Q3 2023, we were managing 259,533 hard drives used to store data. For our review, we removed 449 drives from consideration as they were used for testing purposes, or were drive models which did not have at least 60 drives. This leaves us with 259,084 hard drives grouped into 32 different models. 

The table below reviews the annualized failure rate (AFR) for those drive models for the Q3 2023 time period.

A table showing the quarterly annualized failure rates of Backblaze hard drives.

Notes and Observations on the Q3 2023 Drive Stats

  • The 22TB drives are here: At the bottom of the list you’ll see the WDC 22TB drives (model: WUH722222ALE6L4). A Backblaze Vault of 1,200 drives (plus four) is now operational. The 1,200 drives were installed on September 29, so they only have one day of service each in this report, but zero failures so far.
  • The old get bolder: At the other end of the time-in-service spectrum are the 6TB Seagate drives (model: ST6000DX000) with an average of 101 months in operation. This cohort had zero failures in Q3 2023 with 883 drives and a lifetime AFR of 0.88%.
  • Zero failures: In Q3, six different drive models managed to have zero drive failures during the quarter. But only the 6TB Seagate, noted above, had over 50,000 drive days, our minimum standard for ensuring we have enough data to make the AFR plausible.
  • One failure: There were four drive models with one failure during Q3. After applying the 50,000 drive day metric, two drives stood out:
    1. WDC 16TB (model: WUH721816ALE6L0) with a 0.15% AFR.
    2. Toshiba 14TB (model: MG07ACA14TEY) with a 0.63% AFR.

The Quarterly AFR Drops

In Q3 2023, quarterly AFR for all drives was 1.47%. That was down from 2.2% in Q2 and also down from 1.65% a year ago. The quarterly AFR is based on just the data in that quarter, so it can often fluctuate from quarter to quarter. 

In our Q2 2023 report, we suspected the 2.2% for the quarter was due to the overall aging of the drive fleet and in particular we pointed a finger at specific 8TB, 10TB, and 12TB drive models as potential culprits driving the increase. That prediction fell flat in Q3 as nearly two-thirds of drive models experienced a decreased AFR quarter over quarter from Q2 and any increases were minimal. This included our suspect 8TB, 10TB, and 12TB drive models. 

It seems Q2 was an anomaly, but there was one big difference in Q3: we retired 4,585 aging 4TB drives. The average age of the retired drives was just over eight years, and while that was a good start, there’s another 28,963 4TB drives to go. To facilitate the continuous retirement of aging drives and make the data migration process easy and safe we use CVT, our awesome in-house data migration software which we’ll cover at another time.

A Hot Summer and the Drive Stats Data

As anyone should in our business, Backblaze continuously monitors our systems and drives. So, it was of little surprise to us when the folks at NASA confirmed the summer of 2023 as Earth’s hottest on record. The effects of this record-breaking summer showed up in our monitoring systems in the form of drive temperature alerts. A given drive in a storage server can heat up for many reasons: it is failing; a fan in the storage server has failed; other components are producing additional heat; the air flow is somehow restricted; and so on. Add in the fact that the ambient temperature within a data center often increases during the summer months, and you can get more temperature alerts.

In reviewing the temperature data for our drives in Q3, we noticed that a small number of drives exceeded the maximum manufacturer’s temperature for at least one day. The maximum temperature for most drives is 60°C, except for the 12TB, 14TB, and 16TB Toshiba drives which have a maximum temperature of 55°C. Of the 259,533 data drives in operation in Q3, there were 354 individual drives (0.0013%) that exceeded their maximum manufacturer temperature. Of those only two drives failed, leaving 352 drives which were still operational as of the end of Q3.

While temperature fluctuation is part of running data centers and temp alerts like these aren’t unheard of, our data center teams are looking into the root causes to ensure we’re prepared for the inevitability of increasingly hot summers to come.

Will the Temperature Alerts Affect Drive Stats?

The two drives which exceeded their maximum temperature and failed in Q3 have been removed from the Q3 AFR calculations. Both drives were 4TB Seagate drives (model: ST4000DM000). Given that the remaining 352 drives which exceeded their temperature maximum did not fail in Q3, we have left them in the Drive Stats calculations for Q3 as they did not increase the computed failure rates.

Beginning in Q4, we will remove the 352 drives from the regular Drive Stats AFR calculations and create a separate cohort of drives to track that we’ll name Hot Drives. This will allow us to track the drives which exceeded their maximum temperature and compare their failure rates to those drives which operated within the manufacturer’s specifications. While there are a limited number of drives in the Hot Drives cohort, it could give us some insight into whether drives being exposed to high temperatures could cause a drive to fail more often. This heightened level of monitoring will identify any increase in drive failures so that they can be detected and dealt with expeditiously.

New Drive Stats Data Fields in Q3

In Q2 2023, we introduced three new data fields that we started populating in the Drive Stats data we publish: vault_id, pod_id, and is_legacy_format. In Q3, we are adding three more fields into each drive records as follows:

  • datacenter: The Backblaze data center where the drive is installed, currently one of these values: ams5, iad1, phx1, sac0, and sac2.
  • cluster_id: The name of a given collection of storage servers logically grouped together to optimize system performance. Note: At this time the cluster_id is not always correct, we are working on fixing that. 
  • pod_slot_num: The physical location of a drive within a storage server. The specific slot differs based on the storage server type and capacity: Backblaze (45 drives), Backblaze (60 drives), Dell (26 drives), or Supermicro (60 drives). We’ll dig into these differences in another post.

With these additions, the new schema beginning in Q3 2023 is:

  • date
  • serial_number
  • model
  • capacity_bytes
  • failure
  • datacenter (Q3)
  • cluster_id (Q3)
  • vault_id (Q2)
  • pod_id (Q2)
  • pod_slot_num (Q3)
  • is_legacy_format (Q2)
  • smart_1_normalized
  • smart_1_raw
  • The remaining SMART value pairs (as reported by each drive model)

Beginning in Q3, these data data fields have been added to the publicly available Drive Stats files that we publish each quarter. 

Failure Rates by Data Center

Now that we have the data center for each drive we can compute the AFRs for the drives in each data center. Below you’ll find the AFR for each of five data centers for Q3 2023.

A chart showing Backblaze annualized failure rates by data center.

Notes and Observations

  • Null?: The drives which reported a null or blank value for their data center are grouped in four Backblaze vaults. David, the Senior Infrastructure Software Engineer for Drive Stats, described the process of how we gather all the parts of the Drive Stats data each day. The TL:DR is that vaults can be too busy to respond at the moment we ask, and since the data center field is nice-to-have data, we get a blank field. We can go back a day or two to find the data center value, which we will do in the future when we report this data.
  • sac0?: sac0 has the highest AFR of all of the data centers, but it also has the oldest drives—nearly twice as old, on average, versus the next closest in data center, sac2. As discussed previously, drive failures do seem to follow the “bathtub curve”, although recently we’ve seen the curve start out flatter. Regardless, as drive models age, they do generally fail more often. Another factor could be that sac0, and to a lesser extent sac2, has some of the oldest Storage Pods, including a handful of 45-drive units. We are in the process of using CVT to replace these older servers while migrating from 4TB to 16TB and larger drives.
  • iad1: The iad data center is the foundation of our eastern region and has been growing rapidly since coming online about a year ago. The growth is a combination of new data and customers using our cloud replication capability to automatically make a copy of their data in another region.
  • Q3 Data: This chart is for Q3 data only and includes all the data drives, including those with less than 60 drives per model. As we track this data over the coming quarters, we hope to get some insight into whether different data centers really have different drive failure rates, and, if so, why.

Lifetime Hard Drive Failure Rates

As of September 30, 2023, we were tracking 259,084 hard drives used to store customer data. For our lifetime analysis, we collect the number of drive days and the number of drive failures for each drive beginning from the time a drive was placed into production in one of our data centers. We group these drives by model, then sum up the drive days and failures for each model over their lifetime. That chart is below. 

A chart showing Backblaze lifetime hard drive failure rates.

One of the most important columns on this chart is the confidence interval, which is the difference between the low and high AFR confidence levels calculated at 95%. The lower the value, the more certain we are of the AFR stated. We like a confidence interval to be 0.5% or less. When the confidence interval is higher, that is not necessarily bad, it just means we either need more data or the data is somewhat inconsistent. 

The table below contains just those drive models which have a confidence interval of less than 0.5%. We have sorted the list by drive size and then by AFR.

A chart showing Backblaze hard drive annualized failure rates with a confidence interval of less than 0.5%.

The 4TB, 6TB, 8TB, and some of the 12TB drive models are no longer in production. The HGST 12TB models in particular can still be found, but they have been relabeled as Western Digital and given alternate model numbers. Whether they have materially changed internally is not known, at least to us.

One final note about the lifetime AFR data: you might have noticed the AFR for all of the drives hasn’t changed much from quarter to quarter. It has vacillated between 1.39% to 1.45% percent for the last two years. Basically, we have lots of drives with lots of time-in-service so it is hard to move the needle up or down. While the lifetime stats for individual drive models can be very useful, the lifetime AFR for all drives will probably get less and less interesting as we add more and more drives. Of course, a few hundred thousand drives that never fail could arrive, so we will continue to calculate and present the lifetime AFR.

The Hard Drive Stats Data

The complete data set used to create the information used in this review is available on our Hard Drive Stats Data webpage. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone; it is free. 

Good luck and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q3 2023 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

AI 101: Training vs. Inference

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/ai-101-training-vs-inference/

A decorative image depicting a neural network identifying a cat.

What do Sherlock Holmes and ChatGPT have in common? Inference, my dear Watson!

“We approached the case, you remember, with an absolutely blank mind, which is always an advantage. We had formed no theories. We were simply there to observe and to draw inferences from our observations.”
—Sir Arthur Conan Doyle, The Adventures of the Cardboard Box

As we all continue to refine our thinking around artificial intelligence (AI), it’s useful to define terminology that describes the various stages of building and using AI algorithms—namely, the AI training stage and the AI inference stage. As we see in the quote above, these are not new concepts: they’re based on ideas and methodologies that have been around since before Sherlock Holmes’ time. 

If you’re using AI, building AI, or just curious about AI, it’s important to understand the difference between these two stages so you understand how data moves through an AI workflow. That’s what I’ll explain today.


The difference between these two terms can be summed up fairly simply: first you train an AI algorithm, then your algorithm uses that training to make inferences from data. To create a whimsical analogy, when an algorithm is training, you can think of it like Watson—still learning how to observe and draw conclusions through inference. Once it’s trained, it’s an inferring machine, a.k.a. Sherlock Holmes. 

Whimsy aside, let’s dig a little deeper into the tech behind AI training and AI inference, the differences between them, and why the distinction is important. 

Obligatory Neural Network Recap

Neural networks have emerged as the brainpower behind AI, and a basic understanding of how they work is foundational when it comes to understanding AI.  

Complex decisions, in theory, can be broken down into a series of yeses and nos, which means that they can be encoded in binary. Neural networks have the ability to combine enough of those smaller decisions, weigh how they affect each other, and then use that information to solve complex problems. And, because more complex decisions require more points of information to come to a final decision, they require more processing power. Neural networks are one of the most widely used approaches to AI and machine learning (ML). 

A diagram showing the inputs, hidden layers, and outputs of a neural network.

What Is AI Training?: Understanding Hyperparameters and Parameters

In simple terms, training an AI algorithm is the process through which you take a base algorithm and then teach it how to make the correct decision. This process requires large amounts of data, and can include various degrees of human oversight. How much data you need has a relationship to the number of parameters you set for your algorithm as well as the complexity of a problem. 

We made this handy dandy diagram to show you how data moves through the training process:

A diagram showing how data moves through an AI training algorithm.
As you can see in this diagram, the end result is model data, which then gets saved in your data store for later use.

And hey—we’re leaving out a lot of nuance in that conversation because dataset size, parameter choice, etc. is a graduate-level topic on its own, and usually is considered proprietary information by the companies who are training an AI algorithm. It suffices to say that dataset size and number of parameters are both significant and have a relationship to each other, though it’s not a direct cause/effect relationship. And, both the number of parameters and the size of the dataset affect things like processing resources—but that conversation is outside of scope for this article (not to mention a hot topic in research). 

As with everything, your use case determines your execution. Some types of tasks actually see excellent results with smaller datasets and more parameters, whereas others require more data and fewer parameters. Bringing it back to the real world, here’s a very cool graph showing how many parameters different AI systems have. Note that they very helpfully identified what type of task each system is designed to solve:

So, let’s talk about what parameters are with an example. Back in our very first AI 101 post, we talked about ways to frame an algorithm in simple terms: 

Machine learning does not specify how much knowledge the bot you’re training starts with—any task can have more or fewer instructions. You could ask your friend to order dinner, or you could ask your friend to order you pasta from your favorite Italian place to be delivered at 7:30 p.m. 

Both of those tasks you just asked your friend to complete are algorithms. The first algorithm requires your friend to make more decisions to execute the task at hand to your satisfaction, and they’ll do that by relying on their past experience of ordering dinner with you—remembering your preferences about restaurants, dishes, cost, and so on. 

The factors that help your friend make a decision about dinner are called hyperparameters and parameters. Hyperparameters are those that frame the algorithm—they are set  outside the training process, but can influence the training of the algorithms. In the example above, a hyperparameter would be how you structure your dinner feedback. Do you thumbs up or down each dish? Do you write a short review? You get the idea. 

Parameters are factors that the algorithm derives through training. In the example above, that’s what time you prefer to eat dinner, which restaurants you enjoy after eating, and so on. 

When you’ve trained a neural network, there will be heavier weights between various nodes. That’s a shorthand of saying that an algorithm will prefer a path it knows is significant, and if you want to really get nerdy with it, this article is well-researched, has a ton of math explainers for various training methods, and includes some fantastic visuals. For our purposes, here’s one way people visualize a “trained” algorithm: 

An image showing a neural network that has prioritized certain pathways after training.

The “dropout method” is essentially adding weight to the relationships an AI algorithm has found to be significant for the dataset it’s working on. It can then de-prioritize (or sometimes even eliminate) the other relationships. 

Once you have a trained algorithm, then you can use it with a reasonable degree of certainty that it will give you good results, and that leads us to inference. 

What Is AI Inference?

Once you’ve trained your algorithm, you can send it out in the world to do its job (and make yours easier). When you present a trained AI algorithm with a problem and it gives you an answer, that’s called inference. It’s using the way it was trained to draw conclusions or make predictions, depending on how it was built, and once an algorithm is in the “inference stage”, it’s no longer learning (usually). 

Here’s our diagram for how data might move through an inference process: 

A diagram showing how data moves through an inference workflow.
As you can see, if we want to add more training data to the algorithm, we’d do so as a separate input/output flow. The goal of this process is to output data.

As we’ve seen in the real world, this does not mean that the AI model will give you a perfect answer every time—depending on, for example, the AI’s access to new or current information, the complexity of training, etc., the algorithm might give you a better or worse (more or less accurate) answer. But, it will be able to look at a new piece of information or dataset, and return a response. 

When I asked our friend ChatGPT about this, it broke down inference and defined it in relation to the logical underpinnings of different types of algorithms. (As a mere human, how could I resist using inference while writing about inference?) Sure, ChatGPT may be merely mimicking human-like dialogue based on its ingestion of the whole internet, then giving me the most likely answer based on its specialized neural network that predicts the likelihood of the next word in response to my conversational query and without access to data after January 2022, but here’s what our AI friend had to say for itself: 

  • Machine Learning Inference: In machine learning, after a model has been trained on a dataset, it is deployed to make predictions or classifications on new, unseen data. During inference, the model takes the input data, processes it, and produces an output or a prediction based on the patterns and relationships it learned during training. For example, a trained machine learning model might be used to classify images, predict stock prices, or identify spam emails.
  • Deep Learning Inference: Deep learning, a subfield of machine learning, involves neural networks with multiple layers. In deep learning, inference means feeding an input through a trained deep neural network and obtaining an output. This is commonly used in applications like image recognition, natural language processing, and speech recognition. Inference in deep learning can be computationally intensive, and specialized hardware, such as GPUs and TPUs, are often used to accelerate the process.
  • Rule-Based Systems Inference: In rule-based AI systems, inference is the process of applying logical rules to make decisions or draw conclusions. These systems use a set of predefined rules and facts to reason and infer new information. For example, in an expert system for medical diagnosis, inference involves applying a set of rules to patient symptoms and medical knowledge to arrive at a diagnosis.
  • Probabilistic Inference: In probabilistic AI, inference is used to estimate probabilities and uncertainty. Bayesian networks and Markov decision processes are examples of probabilistic models that perform inference to compute probabilities of events or make decisions based on probabilities.

You’ll notice that each of these are saying basically the same thing: the AI algorithm applies its decision-making paradigm to a problem. 

Why Stop Learning During the Inference Stage?

In general, it’s important to keep these two stages—training and inference—of an AI algorithm separate for a few reasons: 

  • Efficiency: Training is typically a computationally intensive process, whereas inference is usually faster and less resource-intensive. Separating them allows for efficient use of computational resources.
  • Generalization: The model’s ability to generalize from training data to unseen data is a key feature. It should not learn from every new piece of data it encounters during inference to maintain this generalization ability.
  • Reproducibility: When using trained models in production or applications, it’s important to have consistency and reproducibility in the results. If models were allowed to learn during inference, it would introduce variability and unpredictability in their behavior.

There are some specialized AI algorithms that want to continue learning during the inference stage—your Netflix algorithm is a good example, as are self-driving cars, or dynamic pricing models used to set airfare pricing. On the other hand, the majority of problems we’re trying to solve with AI algorithms deliver better decisions by separating these two phases—think of things like image recognition, language translation, or medical diagnosis, for example.

Training vs. Inference (But, Really: Training Then Inference)

To recap: the AI training stage is when you feed data into your learning algorithm to produce a model, and the AI inference stage is when your  algorithm uses that training to make inferences from data. Here’s a chart for quick reference: 

Training Inference
Feed training data into a learning algorithm. Apply the model to inference data.
Produces a model comprising code and data. Produces output data.
One time(ish). Retraining is sometimes necessary. Often continuous.

The difference may seem inconsequential at first glance, but defining these two stages helps to show implications for AI adoption particularly with businesses. That is, given that it’s much less resource intensive (and therefore, less expensive), it’s likely to be much easier for businesses to integrate already-trained AI algorithms with their existing systems. 

And, as always, we’re big believers in demystifying terminology for discussion purposes. Let us know what you think in the comments, and feel free to let us know what you’re interested in learning about next.

The post AI 101: Training vs. Inference appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Anything as a Service: All the “as a Service” Acronyms You Didn’t Know You Needed

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/anything-as-a-service-all-the-as-a-service-acronyms-you-didnt-know-you-needed/

A decorative image showing acronyms for different "as as service" acronyms.

Have you ever felt like you need a dictionary just to understand what tech-savvy folks are talking about? Well, you’re in luck, because we’re about to decode some of the most common jargon of the digital age, one acronym at a time. Welcome to the world of “as a Service” acronyms, where we take the humble alphabet and turn it into a digital buffet. 

So, whether you’re SaaS-savvy or PaaS-puzzled, or just someone desperately searching for a little HaaS (Humor as a Service …yeah, we made that one up), you’ve come to the right place. Let’s take a big slurp from this alphabet soup of tech terms.

The One That Started It All: SaaS

SaaS stands for software as a service, and it’s the founding member of the “as a service” nomenclature. (Though, very confusingly, there’s also Sales as a Service—it’s just not shortened to SaaS. Usually.)

Imagine your software as a pizza delivery service. You don’t need to buy all the ingredients, knead the dough, and bake it yourself. Instead, you simply order a slice, and it magically appears on your table (a.k.a. screen). SaaS products are like that, but instead of pizza they serve up everything from messaging to video conferencing to email marketing to …well, really you name it. Which brings us to…

The Kind of Ironic One: XaaS

XaaS stands for, variously, “everything” or “anything” as a service. No one is really sure about the term’s provenance, but it’s a fair guess to say it came into existence when, well, everything started to become a service, probably sometime around the mid-2010s. The thinking is: if it exists in the digital realm, you can probably get it “as a service.” 

The Hardware Related Ones: HaaS, IaaS, and PaaS

HaaS (Hardware as a Service): Instead of purchasing hardware yourself, like computers, servers, networking equipment, and other physical infrastructure components, with HaaS, you can lease or rent the equipment for a given period. It would be like renting a pizza kitchen to make your specialty pies specifically for your sister’s wedding or your grandma’s birthday.

IaaS (Infrastructure as a Service): Infrastructure as a service is kind of like hardware as a service, but it comes with some additional goodies thrown in. Instead of renting just the kitchen, you rent the whole restaurant, chair, tables, and servers (no pun intended) included. IaaS delivers virtualized computing resources, like virtual machines, storage (that’s us!), and networking, over the internet.

PaaS (Platform as a Service): Think of PaaS as a step even further than IaaS—you’re not just renting a pizza restaurant, you’re renting a test kitchen where you can develop your award-winning pie. PaaS provides developers the ability to build, manage, and deploy applications with services like development frameworks, databases, and infrastructure management. It’s the ultimate DIY platform for tech enthusiasts.

The Bad One: RaaS

RaaS stands for Ransomware as a Service, and this is one “as a service” variant you don’t want to mess with. Basically, cybercriminals can purchase ransomware just as easily as you would purchase any app on the app store (it’s probably more complicated than that, but you get the general gist). This makes it easy for even the least savvy cybercriminal to get into the ransomware game. Not great. 

The Ones That Help With the Last One: BaaS and DRaaS

BaaS (Backup as a Service): Backup as a Service is a cloud-based data protection solution that allows individuals and organizations to back up their data to a remote cloud. (Hey! That’s us too!) Instead of managing on-premises backup infrastructure, users can securely store their data off-site, often on highly redundant and geographically distributed servers.

DRaaS (Disaster Recovery as a Service): DRaaS stands for disaster recovery as a service, and it’s the antidote to RaaS. Of course, you need good backups to begin with, but adding DRaaS allows businesses to ensure specific recovery time objectives (RTOs, FYI) so they can get back up and running in the event they’re attacked by ransomware or there’s a natural disaster at your primary storage location. DRaaS solutions used to be made almost exclusively with the large enterprise in mind, but today, it’s possible to architect a DRaaS solution for your business affordably and easily.

The Analytical One: DaaS

DaaS stands for data as a service, and it’s your data’s personal chauffeur. It fetches the information you need and serves it up on a silver platter. DaaS offers data on-demand, making structured data accessible to users over the internet. It simplifies data sharing and access, often in real-time, without the need for complex data management.

The Development-Focused Ones: CaaS, BaaS (again), and FaaS

CaaS (Containers as a Service): CaaS simplifies the deployment, scaling, and orchestration of containerized applications. It’s the tech version of a literal container ship. The individual containers “ship” individual pieces of software, and a CaaS tool helps carry all of those individual containers. Check out container management software Docker’s logo for a visualization:

It looks more like a whale carrying containers, which is far more adorable, in our opinion.

BaaS (Backend as a Service): It wouldn’t be the first time an acronym has two meanings. BaaS, in this context, provides a backend infrastructure for mobile and web app developers, offering services like databases, user authentication, and APIs. Imagine your own team of digital butlers tending to the back end of your apps. They handle all the behind-the-scenes stuff, so you can focus on making your app shine. 

FaaS (Function as a Service): FaaS is a serverless computing model where developers focus on writing and deploying individual functions or code snippets. These functions run in response to specific events, promoting scalability and efficiency in application development. It’s like having a team of tiny, code-savvy robots doing your bidding.

Go Forth and Abbreviate

Now that you’ve sampled all of the flavors the vast “as a service” world has to offer, we hope you’ve gained a clearer understanding of these sometimes confounding terms. So whether you’re a business professional navigating the cloud or just curious about the tech world, you can wield these acronyms with confidence. 

Did we miss any? I’m sure. Let us know in the comments.

The post Anything as a Service: All the “as a Service” Acronyms You Didn’t Know You Needed appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

How We Achieved Upload Speeds Faster Than AWS S3

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/2023-performance-improvements/

An image of a city skyline with lines going up to a cloud.

You don’t always need the absolute fastest cloud storage—your performance requirements depend on your use case, business objectives, and security needs. But still, faster is usually better. And Backblaze just announced innovation on B2 Cloud Storage that delivers a lot more speed: most file uploads will now be up to 30% faster than AWS S3. 

Today, I’m diving into all of the details of this performance improvement, how we did it, and what it means for you.


The Results: Customers who rely on small file uploads (1MB or less) can expect to see 10–30% faster uploads on average based on our tests, all without any change to durability, availability, or pricing. 

What Does This Mean for You? 

All B2 Cloud Storage customers will benefit from these performance enhancements, especially those who use Backblaze B2 as a storage destination for data protection software. Small uploads of 1MB or less make up about 70% of all uploads to B2 Cloud Storage and are common for backup and archive workflows. Specific benefits of the performance upgrades include:

  • Secures data in offsite backup faster.
  • Frees up time for IT administrators to work on other projects.
  • Decreases congestion on network bandwidth.
  • Deduplicates data more efficiently.

Veeam® is dedicated to working alongside our partners to innovate and create a united front against cyber threats and attacks. The new performance improvements released by Backblaze for B2 Cloud Storage furthers our mission to provide radical resilience to our joint customers.

—Andreas Neufert, Vice President, Product Management, Alliances, Veeam

When Can I Expect Faster Uploads?

Today. The performance upgrades have been fully rolled out across Backblaze’s global data regions.

How We Did It

Prior to this work, when a customer uploaded a file to Backblaze B2, the data was written to multiple hard disk drives (HDDs). Those operations had to be completed before returning a response to the client. Now, we write the incoming data to the same HDDs and also, simultaneously, to a pool of solid state drives (SSDs) we call a “shard stash,” waiting only for the HDD writes to make it to the filesystems’ in-memory caches and the SSD writes to complete before returning a response. Once the writes to HDD are complete, we free up the space from the SSDs so it can be reused.

Since writing data to an SSD is much faster than writing to HDDs, the net result is faster uploads. 

That’s just a brief summary; if you’re interested in the technical details (as well as the results of some rigorous testing), read on!

The Path to Performance Upgrades

As you might recall from many Drive Stats blog posts and webinars, Backblaze stores all customer data on HDDs, affectionately termed ‘spinning rust’ by some. We’ve historically reserved SSDs for Storage Pod (storage server) boot drives. 

Until now. 

That’s right—SSDs have entered the data storage chat. To achieve these performance improvements, we combined the performance of SSDs with the cost efficiency of HDDs. First, I’ll dig into a bit of history to add some context to how we went about the upgrades.


IBM shipped the first hard drive way back in 1957, so it’s fair to say that the HDD is a mature technology. Drive capacity and data rates have steadily increased over the decades while cost per byte has fallen dramatically. That first hard drive, the IBM RAMAC 350, had a total capacity of 3.75MB, and cost $34,500. Adjusting for inflation, that’s about $375,000, equating to $100,000 per MB, or $100 billion per TB, in 2023 dollars.

A photograph of people pushing one of the first hard disk drives into a truck.
An early hard drive shipped by IBM. Source.

Today, the 16TB version of the Seagate Exos X16—an HDD widely deployed in the Backblaze B2 Storage Cloud—retails for around $260, $16.25 per TB. If it had the same cost per byte as the IBM RAMAC 250, it would sell for $1.6 trillion—around the current GDP of China!

SSDs, by contrast, have only been around since 1991, when SanDisk’s 20MB drive shipped in IBM ThinkPad laptops for an OEM price of about $1,000. Let’s consider a modern SSD: the 3.2TB Micron 7450 MAX. Retailing at around $360, the Micron SSD is priced at $112.50 per TB, nearly seven times as much as the Seagate HDD.

So, HDDs easily beat SSDs in terms of storage cost, but what about performance? Here are the numbers from the manufacturers’ data sheets:

Seagate Exos X16 Micron 7450 MAX
Model number ST16000NM001G MTFDKCB3T2TFS
Capacity 16TB 3.2TB
Drive cost $260 $360
Cost per TB $16.25 $112.50
Max sustained read rate (MB/s) 261 6,800
Max sustained write rate (MB/s) 261 5,300
Random read rate, 4kB blocks, IOPS 170/440* 1,000,000
Random write rate, 4kB blocks, IOPS 170/440* 390,000

Since HDD platters rotate at a constant rate, 7,200 RPM in this case, they can transfer more blocks per revolution at the outer edge of the disk than close to the middle—hence the two figures for the X16’s transfer rate.

The SSD is over 20 times as fast at sustained data transfer than the HDD, but look at the difference in random transfer rates! Even when the HDD is at its fastest, transferring blocks from the outer edge of the disk, the SSD is over 2,200 times faster reading data and nearly 900 times faster for writes.

This massive difference is due to the fact that, when reading data from random locations on the disk, the platters have to complete an average of 0.5 revolutions between blocks. At 7,200 rotations per minute (RPM), that means that the HDD spends about 4.2ms just spinning to the next block before it can even transfer data. In contrast, the SSD’s data sheet quotes its latency as just 80µs (that’s 0.08ms) for reads and 15µs (0.015ms) for writes, between 84 and 280 times faster than the spinning disk.

Let’s consider a real-world operation, say, writing 64kB of data. Assuming the HDD can write that data to sequential disk sectors, it will spin for an average of 4.2ms, then spend 0.25ms writing the data to the disk, for a total of 4.5ms. The SSD, in contrast, can write the data to any location instantaneously, taking just 27µs (0.027ms) to do so. This (somewhat theoretical) 167x speed advantage is the basis for the performance improvement.

Why did I choose a 64kB block? As we mentioned in a recent blog post focusing on cloud storage performance, in general, bigger files are better when it comes to the aggregate time required to upload a dataset. However, there may be other requirements that push for smaller files. Many backup applications split data into fixed size blocks for upload as files to cloud object storage. There is a trade-off in choosing the block size: larger blocks improve backup speed, but smaller blocks reduce the amount of storage required. In practice, backup blocks may be as small as 1MB or even 256kB. The 64kB blocks we used in the calculation above represent the shards that comprise a 1MB file.

The challenge facing our engineers was to take advantage of the speed of solid state storage to accelerate small file uploads without breaking the bank.

Improving Write Performance for Small Files

When a client application uploads a file to the Backblaze B2 Storage Cloud, a coordinator pod splits the file into 16 data shards, creates four additional parity shards, and writes the resulting 20 shards to 20 different HDDs, each in a different Pod.

Note: As HDD capacity increases, so does the time required to recover after a drive failure, so we periodically adjust the ratio between data shards and parity shards to maintain our eleven nines durability target. In the past, you’ve heard us talk about 17 + 3 as the ratio but we also run 16 + 4 and our very newest vaults use a 15 + 5 scheme.

Each Pod writes the incoming shard to its local filesystem; in practice, this means that the data is written to an in-memory cache and will be written to the physical disk at some point in the near future. Any requests for the file can be satisfied from the cache, but the data hasn’t actually been persistently stored yet.

We need to be absolutely certain that the shards have been written to disk before we return a “success” response to the client, so each Pod executes an fsync system call to transfer (“flush”) the shard data from system memory through the HDD’s write cache to the disk itself before returning its status to the coordinator. When the coordinator has received at least 19 successful responses, it returns a success response to the client. This ensures that, even if the entire data center was to lose power immediately after the upload, the data would be preserved.

As we explained above, for small blocks of data, the vast majority of the time spent writing the data to disk is spent waiting for the drive platter to spin to the correct location. Writing shards to SSD could result in a significant performance gain for small files, but what about that 7x cost difference?

Our engineers came up with a way to have our cake and eat it too by harnessing the speed of SSDs without a massive increase in cost. Now, upon receiving a file of 1MB or less, the coordinator splits it into shards as before, then simultaneously sends the shards to a set of 20 Pods and a separate pool of servers, each populated with 10 of the Micron SSDs described above—a “shard stash.” The shard stash servers easily win the “flush the data to disk” race and return their status to the coordinator in just a few milliseconds. Meanwhile, each HDD Pod writes its shard to the filesystem, queues up a task to flush the shard data to the disk, and returns an acknowledgement to the coordinator.

Once the coordinator has received replies establishing that at least 19 of the 20 Pods have written their shards to the filesystem, and at least 19 of the 20 shards have been flushed to the SSDs, it returns its response to the client. Again, if power was to fail at this point, the data has already been safely written to solid state storage.

We don’t want to leave the data on the SSDs any longer than we have to, so, each Pod, once it’s finished flushing its shard to disk, signals to the shard stash that it can purge its copy of the shard.

Real-World Performance Gains

As I mentioned above, that calculated 167x performance advantage of SSDs over HDDs is somewhat theoretical. In the real world, the time required to upload a file also depends on a number of other factors—proximity to the data center, network speed, and all of the software and hardware between the client application and the storage device, to name a few.

The first Backblaze region to receive the performance upgrade was U.S. East, located in Reston, Virginia. Over a 12-day period following the shard stash deployment there, the average time to upload a 256kB file was 118ms, while a 1MB file clocked in at 137ms. To replicate a typical customer environment, we ran the test application at our partner Vultr’s New Jersey data center, uploading data to Backblaze B2 across the public internet.

For comparison, we ran the same test against Amazon S3’s U.S. East (Northern Virginia) region, a.k.a. us-east-1, from the same machine in New Jersey. On average, uploading a 256kB file to S3 took 157ms, with a 1MB file taking 153ms.

So, comparing the Backblaze B2 U.S. East region to the Amazon S3 equivalent, we benchmarked the new, improved Backblaze B2 as 30% faster than S3 for 256kB files and 10% faster than S3 for 1MB files.

These low-level tests were confirmed when we timed Veeam Backup & Replication software backing up 1TB of virtual machines with 256k block sizes. Backing the server up to Amazon S3 took three hours and 12 minutes; we measured the same backup to Backblaze B2 at just two hours and 15 minutes, 40% faster than S3.

Test Methodology

We wrote a simple Python test app using the AWS SDK for Python (Boto3). Each test run involved timing 100 file uploads using the S3 PutObject API, with a 10ms delay between each upload. (FYI, the delay is not included in the measured time.) The test app used a single HTTPS connection across the test run, following best practice for API usage. We’ve been running the test on a VM in Vultr’s New Jersey region every six hours for the past few weeks against both our U.S. East region and its AWS neighbor. Latency to the Backblaze B2 API endpoint averaged 5.7ms, to the Amazon S3 API endpoint 7.8ms, as measured across 100 ping requests.

What’s Next?

At the time of writing, shard stash servers have been deployed to all of our data centers, across all of our regions. In fact, you might even have noticed small files uploading faster already. It’s important to note that this particular optimization is just one of a series of performance improvements that we’ve implemented, with more to come. It’s safe to say that all of our Backblaze B2 customers will enjoy faster uploads and downloads, no matter their storage workload.

The post How We Achieved Upload Speeds Faster Than AWS S3 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Things That Used to Be Science Fiction (and Aren’t Anymore)

Post Syndicated from Yev original https://www.backblaze.com/blog/things-that-used-to-be-science-fiction-and-arent-anymore/

A decorative image showing a spaceship beaming up the Backblaze logo.

The year is 2023, and the human race has spread across the globe. Nuclear powered flying cars are everywhere, and the first colonies have landed on Mars! [Radio crackles.] 

Okay, so that isn’t exactly how it’s gone down, but in honor of Halloween, the day celebrates the whimsy of all things being possible, let’s talk about things that used to be science fiction and aren’t anymore.

Artificial Intelligence (AI)

Have we gotten reader fatigue from this topic yet? (As technology people by nature, we’re deep in it.) The rise of generative AI over the past year or so has certainly brought this subject into the spotlight, so in some ways it seems “early” to judge all the ways that AI will change things. On the other hand, there are lots of tools and functions we’ve been using for a while that have been powered by AI algorithms, including AI assistants. 

Shout out to this content creator for a hilarious video.

At the risk of not doing this topic justice in this list, I’ll say that there’s plenty of reporting on—and plenty of potential for—AI now and in the future. 


This year, the U.S. House Oversight Committee was conducting an investigation on unidentified flying objects (UFOs). While many UFOs turn out to be things like weather balloons and drones designed for home use, well, some apparently aren’t. Three military veterans, including a former intelligence officer, went on record saying that the government has a secret facility where it’s been reverse engineering highly advanced vehicles, and that the U.S. has recovered “non-human biologics” from these crash sites. (Whatever that means—but we all know what that means, right…) 

Here’s the video, if you want to see for yourself. 

Weirdly, the public response was… not much of one. (The last couple of years have been “a year”.) But, chalk this one up as confirmed. 

Space Stations

The list of sci-fi shows and books set on space stations is definitely too long to list item by item. Depending on your age (and we won’t ask you to tell us), you may think of Isaac Asimov’s Foundation series (the books), Star Wars, Zenon, Girl of the 21st Century (or maybe the Zequel?), Babylon 5, or the Expanse. 

Back in the real world, the International Space Station (ISS) has been in orbit since 1998 and runs all manner of scientific experiments. Notably, these experiments include the Alpha Meter Spectrometer (AMS-02) which is designed to detect things like dark matter and antimatter and solve the fundamental mysteries of the universe. (No big deal.) 

For those of us stuck on Earth (for now), you can keep up with the ISS in lots of ways. Check out this map that lets you track whether you can see it from your current location. (Wave the next time it floats over!) And, of course, there are some fun YouTube channels streaming the ISS. Here’s just one:  

Universal Translators

Okay, universal translators is the cool sci-fi name, but if you want the actual, machine learning (ML) name, folks call that interlingual machine translation. Translation may seem straightforward at first glance, but, as this legendary Star Trek episode demonstrates, things are not always so simple. 

And sure, it’s easy to say that this is an unreasonable standard given that most human languages are known—but are they? Native language reclamation projects like those from the Cherokee and Oneida tribes demonstrate how easy it is to lose the nuance of a language without those who natively speak it. Advanced degrees in poetry translation, like this Masters of Fine Arts from the University of Iowa (go Hawks!), help specialists grapple with and capture the nuance between smell, scent, odor, and stench across different languages. And, add to those challenges that translators also have to contend with the wide array of accents in each language. 

With that in mind, it’s pretty amazing that we now have translation devices that can be as small as earbuds. Most still require an internet connection, and some are more effective than others, but it’s safe to say we live in the future, folks. Case in point: I had a wine tasting in Tuscany a few months ago where we used Google Translate exclusively to speak with the winemaker and proprietor. 


“What?” you say. “iPads are so normal!” Sure, now you’re used to touch screens. But, let me present you with this image from a show that is definitely considered science fiction:

Shockingly, not an iPad.

Yes, folks, that’s Captain Jean Luc Picard from Star Trek: The Next Generation. And here’s a later one, from Star Trek: Deep Space Nine. 

These are plans for the arboretum, so Keiko is probably dropping some knowledge.

Star Trek wikis describe the details of a Personal Access Display Device, or PADD, including a breakdown of how they changed over time in the series. Uhura even had a “digital clipboard” in the original Star Trek series: 

We’d have to revisit the episode to see what this masterful side-eye is about.

And, just for the record, we’ll always have a soft spot in our heart for Chief O’Brien’s love of backups.

Robot Domestic Laborer

If you were ever a fan of this lovely lady—

Rosie the Robot, of course, longtime employee and friend of The Jetsons.

—then you’ll be happy to know that your robot caretaker(s) have arrived. Just as Rosie was often seen using a separate vacuum cleaner, they’re not all integrated into one charming package—yet. If you’re looking for the full suite of domestic help, you’ll have to get a few different products. 

First up, the increasingly popular (and, as time goes on, increasingly affordable) robot vacuum. There are tons of models, from the simple vacuum to the vacuum/mop. While they’re reportedly prone to some pretty hilarious (horrific?) accidents, having one or several of these disk-shaped appliances saves lots of folks lots of time. Bonus: just add cat, and you have adorable internet content in the comfort of your own home. 

Next up, the Snoo, marketed as a smart bassinet, will track everything baby, then use that data to help said baby sleep. Parents who can afford to buy or rent this item sing its praises, noting that you can swaddle the baby for safety and review the data collected to better care for your child. 

And, don’t forget to round out your household with this charming toilet cleaning robot

Robot Bartenders

In this iconic scene from The Fifth Element, Luc Besson’s 1997 masterpiece, a drunken priest waxes poetic about a perfect being (spoiler: she’s a woman) to a robot bartender. “Do you know what I mean?” the priest asks. The robot shakes its head. “Do you want some more?”

Start at about 2:00 minutes.

These days, you can actually visit robot bartenders in Las Vegas or on Royal Caribbean cruise ships. Or, if you’re looking for a robot bartender that does more than serve up a great Sazerac, you can turn to Brillo, a robot bartender powered by AI who can also engage in complex dialogue. 

Please politely ignore that his face is the stuff of nightmares…it’s what’s on the inside (and in the glass) that counts.

And, if leaving your house sounds terrible, don’t worry: you can also get a specialized appliance for your home. 

It’s a Good Time to Be Cloud Storage

One thing that all these current (and future) tech developments have in common: you never see them carting something trailing wires. That means (you guessed it!) that they’re definitely using a central data source delivered via wireless network, a.k.a the cloud.

After you’ve done all the work to, say, study an alien life form or design and program the perfect cocktail, you definitely don’t want to do that work twice. And, do you see folks slowing down to schedule a backup? Definitely not. Easy, online, always updating backups are the way to go.

So, we’re not going to say Backblaze Computer Backup makes the list as a sci-fi idea that we’ve made real; we’re just saying that it’s probably one of those things that people leave off-stage, like characters brushing their teeth on a regular basis. And, past or future, we’re here to remind you that you should always back up your data.

Backup Is Past, Present, and Future 

Things We Still Want (Get On It, Scientists!) 

Everything we just listed is really cool and all, but let’s not forget that we are still waiting for some very important things. Help us out scientists; we really need these: 

  • Flying cars
  • Faster than light space travel
  • Teleportation 
  • Matter replicators (3D printing isn’t quite there)

We feel compelled to add that, despite our jocular tone, the line between science and science fiction has always been something of a thin one. Studies have shown and inventors like Motorola’s Martin Cooper have gone on record pointing to their inspiration in the imaginative works of science fiction. 

So, that leaves us standing by for new developments! Let’s see what 2024 brings. Let us know in the comments section what cool tech in your life fits this brief.

The post Things That Used to Be Science Fiction (and Aren’t Anymore) appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

HYCU + Backblaze: Protecting Against Data Loss

Post Syndicated from Jennifer Newman original https://www.backblaze.com/blog/hycu-backblaze-protecting-against-data-loss/

A decorative image showing the Backblaze and HYCU logos.

Backblaze and HYCU, the fastest-growing leader in multi-cloud data protection as a service (DPaaS) are teaming up to provide businesses a complete backup solution for modern workloads with low-cost scalable infrastructure—a must-have for modern cyber resilience needs.

Read on to learn more about the partnership, how you can benefit from affordable, agile data protection, and a little bit about a relevant ancient poetic art form.

HYCU + Backblaze: The Power of Collaboration

Within HYCU’s DPaaS platform, shared customers can now select Backblaze B2 Cloud Storage—an S3 compatible object storage platform that provides highly durable, instantly available hot storage—as a destination for their HYCU backups. 

With more applications in use across the modern data center, visibility and the ability to protect that mission-critical data has never been at more of a premium. Our collaboration with Backblaze now offers joint customers a cost-effective and scalable data protection solution combining the best in backup and recovery with Backblaze’s streamlined and secure cloud storage.

—Subbiah Sundaram, SVP Product, HYCU, Inc.

The Data Sprawl Problem

On average, businesses and organizations have upwards of 200 different sets of data or “data silos” spread across a growing number of applications, databases, and physical locations. This data sprawl isn’t just hard to manage, it opens up more opportunities for cybercriminals to inject ransomware and gain access to systems. 

HYCU gives customers the power to protect every byte while also managing all their business critical data in one place. Powered by the world’s first development platform for data protection, HYCU is the only DPaaS platform that can scale to protect all of your data—wherever it resides. Most importantly, it gives customers the ability to recover from disaster almost instantly, keeping them online and in business, with an average recovery time of 10 minutes. 

Backblaze and HYCU:

Keeping data safe for all

at one-fifth the cost.

By combining HYCU data protection with Backblaze B2 Storage Cloud, customers can see up to 80% lower costs in comparison to using providers like AWS for their storage, which means that combining the two can be a force multiplier for a businesses’ ability to fully protect their data and scale efficiently and reliably.

Data protection:

Once challenging, now easy—

HYCU and B2.

The partnership offers the following benefits:

  • Performance: With a 99.9% uptime service level agreement (SLA) and no cold delays or speed premiums, storing data in Backblaze B2 Cloud Storage means joint customers have instant access to their data whenever and wherever they need it. 
  • Affordability: Existing customers can reduce their total cost of ownership by switching backup tiers with interoperable S3 compatible storage, and institutions and businesses who may not have been able to afford hyperscaler-based solutions can now protect their data.
  • Compliance and Security: With Backblaze B2’s Object Lock feature, the partnership also offers an additional layer of security through data immutability, which protects backups from ransomware and satisfies evolving cyber insurance requirements.

These benefits can prove particularly useful for higher education institutions, schools, state and local governments, nonprofits, and others where maximizing tight budgets is always a priority.

What’s in a Name?

For the poetically minded among our readership (there must be a few of you, right?), you may have noticed a haiku or two above. And that’s not a coincidence.

The humble haiku inspired the name for HYCU. In true poetic fashion, the name serves more than one purpose—it’s also an acronym for “hyperconverged uptime,” making the least amount of letters do the most, as they should.

Making Data Protection Easier

This partnership adds a powerful new data protection option for joint customers looking to affordably back up their data and establish a disaster recovery strategy. And, this is just the beginning. Stay tuned for more from this partnership, including integrations with HYCU’s other data protection offerings in the future. 

Interested in getting started? Learn more in our docs.

The post HYCU + Backblaze: Protecting Against Data Loss appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Cloud Storage Performance: The Metrics that Matter

Post Syndicated from Kari Rivas original https://www.backblaze.com/blog/cloud-storage-performance-the-metrics-that-matter/

A decorative image showing a cloud in the foreground and various mocked up graphs in the background.

Availability, time to first byte, throughput, durability—there are plenty of ways to measure “performance” when it comes to cloud storage. But, which measure is best and how should performance factor in when you’re choosing a cloud storage provider? Other than security and cost, performance is arguably the most important decision criteria, but it’s also the hardest dimension to clarify. It can be highly variable and depends on your own infrastructure, your workload, and all the network connections between your infrastructure and the cloud provider as well.

Today, I’m walking through how to think strategically about cloud storage performance, including which metrics matter and which may not be as important for you.

First, What’s Your Use Case?

The first thing to keep in mind is how you’re going to be using cloud storage. After all, performance requirements will vary from one use case to another. For instance, you may need greater performance in terms of latency if you’re using cloud storage to serve up software as a service (SaaS) content; however, if you’re using cloud storage to back up and archive data, throughput is probably more important for your purposes.

For something like application storage, you should also have other tools in your toolbox even when you are using hot, fast, public cloud storage, like the ability to cache content on edge servers, closer to end users, with a content delivery network (CDN).

Ultimately, you need to decide which cloud storage metrics are the most important to your organization. Performance is important, certainly, but security or cost may be weighted more heavily in your decision matrix.

A decorative image showing several icons representing different types of files on a grid over a cloud.

What Is Performant Cloud Storage?

Performance can be described using a number of different criteria, including:

  • Latency
  • Throughput
  • Availability
  • Durability

I’ll define each of these and talk a bit about what each means when you’re evaluating a given cloud storage provider and how they may affect upload and download speeds.


  • Latency is defined as the time between a client request and a server response. It quantifies the time it takes data to transfer across a network.  
  • Latency is primarily influenced by physical distance—the farther away the client is from the server, the longer it takes to complete the request. 
  • If you’re serving content to many geographically dispersed clients, you can use a CDN to reduce the latency they experience. 

Latency can be influenced by network congestion, security protocols on a network, or network infrastructure, but the primary cause is generally distance, as we noted above. 

Downstream latency is typically measured using time to first byte (TTFB). In the context of surfing the web, TTFB is the time between a page request and when the browser receives the first byte of information from the server. In other words, TTFB is measured by how long it takes between the start of the request and the start of the response, including DNS lookup and establishing the connection using a TCP handshake and TLS handshake if you’ve made the request over HTTPS.

Let’s say you’re uploading data from California to a cloud storage data center in Sacramento. In that case, you’ll experience lower latency than if your business data is stored in, say, Ohio and has to make the cross-country trip. However, making the “right” decision about where to store your data isn’t quite as simple as that, and the complexity goes back to your use case. If you’re using cloud storage for off-site backup, you may want your data to be stored farther away from your organization to protect against natural disasters. In this case, performance is likely secondary to location—you only need fast enough performance to meet your backup schedule. 

Using a CDN to Improve Latency

If you’re using cloud storage to store active data, you can speed up performance by using a CDN. A CDN helps speed content delivery by caching content at the edge, meaning faster load times and reduced latency. 

Edge networks create “satellite servers” that are separate from your central data server, and CDNs leverage these to chart the fastest data delivery path to end users. 


  • Throughput is a measure of the amount of data passing through a system at a given time.
  • If you have spare bandwidth, you can use multi-threading to improve throughput. 
  • Cloud storage providers’ architecture influences throughput, as do their policies around slowdowns (i.e. throttling).

Throughput is often confused with bandwidth. The two concepts are closely related, but different. 

To explain them, it’s helpful to use a metaphor: Imagine a swimming pool. The amount of water in it is your file size. When you want to drain the pool, you need a pipe. Bandwidth is the size of the pipe, and throughput is the rate at which water moves through the pipe successfully. So, bandwidth affects your ultimate throughput. Throughput is also influenced by processing power, packet loss, and network topology, but bandwidth is the main factor. 

Using Multi-Threading to Improve Throughput

Assuming you have some bandwidth to spare, one of the best ways to improve throughput is to enable multi-threading. Threads are units of execution within processes. When you transmit files using a program across a network, they are being communicated by threads. Using more than one thread (multi-threading) to transmit files is, not surprisingly, better and faster than using just one (although a greater number of threads will require more processing power and memory). To return to our water pipe analogy, multi-threading is like having multiple water pumps (threads) running to that same pipe. Maybe with one pump, you can only fill 10% of your pipe. But you can keep adding pumps until you reach pipe capacity.

When you’re using cloud storage with an integration like backup software or a network attached storage (NAS) device, the multi-threading setting is typically found in the integration’s settings. Many backup tools, like Veeam, are already set to multi-thread by default. Veeam automatically makes adjustments based on details like the number of individual backup jobs, or you can configure the number of threads manually. Other integrations, like Synology’s Cloud Sync, also give you granular control over threading so you can dial in your performance.  

A diagram showing single vs. multi-threaded processes.
Still confused about threads? Learn more in our deep dive, including what’s going on in this diagram.

That said, the gains from increasing the number of threads are limited by the available bandwidth, processing power, and memory. Finding the right setting can involve some trial and error, but the improvements can be substantial (as we discovered when we compared download speeds on different Python versions using single vs. multi-threading).

What About Throttling?

One question you’ll absolutely want to ask when you’re choosing a cloud storage provider is whether they throttle traffic. That means they deliberately slow down your connection for various reasons. Shameless plug here: Backblaze does not throttle, so customers are able to take advantage of all their bandwidth while uploading to B2 Cloud Storage. Amazon and many other public cloud services do throttle.

Upload Speed and Download Speed

Your ultimate upload and download speeds will be affected by throughput and latency. Again, it’s important to consider your use case when determining which performance measure is most important for you. Latency is important to application storage use cases where things like how fast a website loads can make or break a potential SaaS customer. With latency being primarily influenced by distance, it can be further optimized with the help of a CDN. Throughput is often the measurement that’s more important to backup and archive customers because it is indicative of the upload and download speeds an end user will experience, and it can be influenced by cloud storage provider practices, like throttling.   


  • Availability is the percentage of time a cloud service or a resource is functioning correctly.
  • Make sure the availability listed in the cloud provider’s service level agreement (SLA) matches your needs. 
  • Keep in mind the difference between hot and cold storage—cold storage services like Amazon Glacier offer slower retrieval and response times.

Also called uptime, this metric measures the percentage of time that a cloud service or resource is available and functioning correctly. It’s usually expressed as a percentage, with 99.9% (three nines) or 99.99% (four nines) availability being common targets for critical services. Availability is often backed by SLAs that define the uptime customers can expect and what happens if availability falls below that metric. 

You’ll also want to consider availability if you’re considering whether you want to store in cold storage versus hot storage. Cold storage is lower performing by design. It prioritizes durability and cost-effectiveness over availability. Services like Amazon Glacier and Google Coldline take this approach, offering slower retrieval and response times than their hot storage counterparts. While cost savings is typically a big factor when it comes to considering cold storage, keep in mind that if you do need to retrieve your data, it will take much longer (potentially days instead of seconds), and speeding that up at all is still going to cost you. You may end up paying more to get your data back faster, and you should also be aware of the exorbitant egress fees and minimum storage duration requirements for cold storage—unexpected costs that can easily add up. 

Cold Hot
Access Speed Slow Fast
Access Frequency Seldom or Never Frequent
Data Volume Low High
Storage Media Slower drives, LTO, offline Faster drives, durable drives, SSDs
Cost Lower Higher


  • Durability is the ability of a storage system to consistently preserve data.
  • Durability is measured in “nines” or the probability that your data is retrievable after one year of storage. 
  • We designed the Backblaze B2 Storage Cloud for 11 nines of durability using erasure coding.

Data durability refers to the ability of a data storage system to reliably and consistently preserve data over time, even in the face of hardware failures, errors, or unforeseen issues. It is a measure of data’s long-term resilience and permanence. Highly durable data storage systems ensure that data remains intact and accessible, meeting reliability and availability expectations, making it a fundamental consideration for critical applications and data management.

We usually measure durability or, more precisely annual durability, in “nines”, referring to the number of nines in the probability (expressed as a percentage) that your data is retrievable after one year of storage. We know from our work on Drive Stats that an annual failure rate of 1% is typical for a hard drive. So, if you were to store your data on a single drive, its durability, the probability that it would not fail, would be 99%, or two nines.

The very simplest way of improving durability is to simply replicate data across multiple drives. If a file is lost, you still have the remaining copies. It’s also simple to calculate the durability with this approach. If you write each file to two drives, you lose data only if both drives fail. We calculate the probability of both drives failing by multiplying the probabilities of either drive failing, 0.01 x 0.01 = 0.0001, giving a durability of 99.99%, or four nines. While simple, this approach is costly—it incurs a 100% overhead in the amount of storage required to deliver four nines of durability.

Erasure coding is a more sophisticated technique, improving durability with much less overhead than simple replication. An erasure code takes a “message,” such as a data file, and makes a longer message in a way that the original can be reconstructed from the longer message even if parts of the longer message have been lost. 

A decorative image showing the matrices that get multiplied to allow Reed-Solomon code to re-create files.
A representation of Reed-Solomon erasure coding, with some very cool Storage Pods in the background.

The durability calculation for this approach is much more complex than for replication, as it involves the time required to replace and rebuild failed drives as well as the probability that a drive will fail, but we calculated that we could take advantage of erasure coding in designing the Backblaze B2 Storage Cloud for 11 nines of durability with just 25% overhead in the amount of storage required. 

How does this work? Briefly, when we store a file, we split it into 16 equal-sized pieces, or shards. We then calculate four more shards, called parity shards, in such a way that the original file can be reconstructed from any 16 of the 20 shards. We then store the resulting 20 shards on 20 different drives, each in a separate Storage Pod (storage server).

Note: As hard disk drive capacity increases, so does the time required to recover after a drive failure, so we periodically adjust the ratio between data shards and parity shards to maintain our eleven nines durability target. Consequently, our very newest vaults use a 15+5 scheme.

If a drive does fail, it can be replaced with a new drive, and its data rebuilt from the remaining good drives. We open sourced our implementation of Reed-Solomon erasure coding, so you can dive into the source code for more details.

Additional Factors Impacting Cloud Storage Performance

In addition to bandwidth and latency, there are a few additional factors that impact cloud storage performance, including:

  • The size of your files.
  • The number of files you upload or download.
  • Block (part) size.
  • The amount of available memory on your machine. 

Small files—that is, those less than 5GB—can be uploaded in a single API call. Larger files, from 5MB to 10TB, can be uploaded as “parts”, in multiple API calls. You’ll notice that there is quite an overlap here! For uploading files between 5MB and 5GB, is it better to upload them in a single API call, or split them into parts? What is the optimum part size? For backup applications, which typically split all data into equal-sized blocks, storing each block as a file, what is the optimum block size? As with many questions, the answer is that it depends.

Remember latency? Each API call incurs a more-or-less fixed overhead due to latency. For a 1GB file, assuming a single thread of execution, uploading all 1GB in a single API call will be faster than ten API calls each uploading a 100MB part, since those additional nine API calls each incur some latency overhead. So, bigger is better, right?

Not necessarily. Multi-threading, as mentioned above, affords us the opportunity to upload multiple parts simultaneously, which improves performance—but there are trade-offs. Typically, each part must be stored in memory as it is uploaded, so more threads means more memory consumption. If the number of threads multiplied by the part size exceeds available memory, then either the application will fail with an out of memory error, or data will be swapped to disk, reducing performance.

Downloading data offers even more flexibility, since applications can specify any portion of the file to download in each API call. Whether uploading or downloading, there is a maximum number of threads that will drive throughput to consume all of the available bandwidth. Exceeding this maximum will consume more memory, but provide no performance benefit. If you go back to our pipe analogy, you’ll have reached the maximum capacity of the pipe, so adding more pumps won’t make things move faster. 

So, what to do to get the best performance possible for your use case? Simple: customize your settings. 

Most backup and file transfer tools allow you to configure the number of threads and the amount of data to be transferred per API call, whether that’s block size or part size. If you are writing your own application, you should allow for these parameters to be configured. When it comes to deployment, some experimentation may be required to achieve maximum throughput given available memory.

How to Evaluate Cloud Performance

To sum up, the cloud is increasingly becoming a cornerstone of every company’s tech stack. Gartner predicts that by 2026, 75% of organizations will adopt a digital transformation model predicated on cloud as the fundamental underlying platform. So, cloud storage performance will likely be a consideration for your company in the next few years if it isn’t already.

It’s important to consider that cloud storage performance can be highly subjective and heavily influenced by things like use case considerations (i.e. backup and archive versus application storage, media workflow, or another), end user bandwidth and throughput, file size, block size, etc. Any evaluation of cloud performance should take these factors into account rather than simply relying on metrics in isolation. And, a holistic cloud strategy will likely have multiple operational schemas to optimize resources for different use cases.

Wait, Aren’t You, Backblaze, a Cloud Storage Company?

Why, yes. Thank you for noticing. We ARE a cloud storage company, and we OFTEN get questions about all of the topics above. In fact, that’s why we put this guide together—our customers and prospects are the best sources of content ideas we can think of. Circling back to the beginning, it bears repeating that performance is one factor to consider in addition to security and cost. (And, hey, we would be remiss not to mention that we’re also one-fifth the cost of AWS S3.) Ultimately, whether you choose Backblaze B2 Cloud Storage or not though, we hope the information is useful to you. Let us know if there’s anything we missed.

The post Cloud Storage Performance: The Metrics that Matter appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Load Balancing 2.0: What’s Changed After 7 Years?

Post Syndicated from nathaniel wagner original https://www.backblaze.com/blog/load-balancing-2-0-whats-changed-after-7-years/

A decorative image showing a file, a drive, and some servers.

What do two billion transactions a day look like? Well, the data may be invisible to the naked eye, but the math breaks down to just over 23,000 transactions every second. (Shout out to Kris Allen for burning into my memory that there are 86,400 seconds in a day and my handy calculator!) 

Part of my job as a Site Reliability Engineer (SRE) at Backblaze is making sure that greater than 99.9% of those transactions are what we consider “OK” (via status code), and part of the fun is digging for the needle in the haystack of over 7,000 production servers and over 250,000 spinning hard drives to try and understand how all of the different transactions interact with the pieces of infrastructure. 

In this blog post, I’m going to pick up where Principal SRE Elliott Sims left off in his 2016 article on load balancing. You’ll notice that the design principles we’ve employed are largely the same (cool!). So, I’ll review our stance on those principles, then talk about how the evolution of the B2 Cloud Storage platform—including the introduction of the S3 Compatible API—has changed our load balancing architecture. Read on for specifics. 

Editor’s Note

We know there are a ton of specialized networking terms flying around this article, and one of our primary goals is to make technical content accessible to all readers, regardless of their technical background. To that end, we’ve used footnotes to add some definitions and minimize the disruption to your reading experience.

What Is Load Balancing?

Load balancing is the process of distributing traffic across a network. It helps with resource utilization, prevents overloading any one server, and makes your system more reliable. Load balancers also monitor server health and redirect requests to the most suitable server.

With two billion requests per day to our servers, you can be assured that we use load balancers at Backblaze. Whenever anyone—a Backblaze Computer Backup or a B2 Cloud Storage customer—wants to upload or download data or modify or interact with their files, a load balancer will be there to direct traffic to the right server. Think of them as your trusty mail service, delivering your letters and packages to the correct destination—and using things like zip codes and addresses to interpret your request and get things to the right place.  

How Do We Do It?

We build our own load balancers using open-source tools. We use layer 4 load balancing with direct server response (DSR). Here are some of the resources that we call on to make that happen:  

  • Border gateway patrol (BGP) which is part of the Linux kernel1. It’s a standardized gateway protocol that exchanges routing and reachability information on the internet.   
  • keepalived, an open-source routing software. keepalived keeps track of all of our VIPs2 and IPs3 for each backend server. 
  • Hard disk drives (HDDs). We use the same drives that we use for other API servers and whatnot, but that’s definitely overkill—we made that choice to save the work of sourcing another type of device.  
  • A lot of hard work by a lot of really smart folks.

What We Mean by Layers

When we’re talking about layers in load balancing, it’s shorthand for how deep into the architecture your program needs to see. Here’s a great diagram that defines those layers: 

An image describing application layers.

DSR takes place at layer 4, but solves the problem presented by a full proxy4 method, having to see the original client’s IP address.

Why Do We Do It the Way We Do It?

Building our own load balancers, instead of buying an off-the-shelf solution, means that we have more control and insight, more cost-effective hardware, and more scalable architecture. In general, DSR is more complicated to set up and maintain, but this method also lets us handle lots of traffic with minimal hardware and supports our goal of keeping data encrypted, even within our own data center. 

What Hasn’t Changed

We’re still using a layer 4 DSR approach to load balancing, which we’ll explain below. For reference, other common methods of load balancing are layer 7, full proxy and layer 4, full proxy load balancing. 

First, I’ll explain how DSR works. DSR load balancing requires two things:

  1. A load balancer with the VIP address attached to an external NIC5 and ARPing6, so that the rest of the network knows it “owns” the IP.
  2. Two or more servers on the same layer 2 network that also have the VIP address attached to a NIC, either internal or external, but are not replying to ARP requests about that address. This means that no other servers on the network know that the VIP exists anywhere but on the load balancer.

A request packet will enter the network, and be routed to the load balancer. Once it arrives there, the load balancer leaves the source and destination IP addresses intact and instead modifies the destination MAC7 address to that of a server, then puts the packet back on the network. The network switch only understands MAC addresses, so it forwards the packet on to the correct server.

A diagram of how a packet moves through the network router and load balancer to reach the server, then respond to the original client request.

When the packet arrives at the server’s network interface, it checks to make sure the destination MAC address matches its own. The address matches, so the server accepts the packet. The server network interface then, separately, checks to see whether the destination IP address is one attached to it somehow. That’s a yes, even though the rest of the network doesn’t know it, so the server accepts the packet and passes it on to the application. The application then sends a response with the VIP as the source IP address and the client as the destination IP, so the request (and subsequent response) is routed directly to the client without passing back through the load balancer.

So, What’s Changed?

Lots of things. But, since we first wrote this article, we’ve expanded our offerings and platform. The biggest of these changes (as far as load balancing is concerned) is that we added the S3 Compatible API. 

We also serve a much more diverse set of clients, both in the size of files they have and their access patterns. File sizes affect how long it takes us to serve requests (larger files = more time to upload or download, which means an individual server is tied up for longer). Access patterns can vastly increase the amount of requests a server has to process on a regular, but not consistent basis (which means you might have times that your network is more or less idle, and you have to optimize appropriately). 

A definitely Photoshopped images showing a giraffe riding an elephant on a rope in the sky. The rope's anchor points disappear into clouds.
So, if we were to update this amazing image from the first article, we might have a tightrope walker with a balancing pole on top of the giraffe, plus some birds flying on a collision course with the elephant.

Where You Can See the Changes: ECMP, Volume of Data Per Month, and API Processing

DSR is how we send data to the customer—the server responds (sends data) directly to the request (client). This is the equivalent of going to the post office to mail something, but adding your home address as the return address (so that you don’t have to go back to the post office to get your reply).   

Given how our platform has evolved over the years, things might happen slightly differently. Let’s dig in to some of the details that affect how the load balancers make their decisions—what rules govern how they route traffic, and how different types of requests cause them to behave differently. We’ll look at:

  • Equal cost multipath routing (ECMP). 
  • Volume of data in petabytes (PBs) per month.
  • APIs and processing costs.


One thing that’s not explicitly outlined above is how the load balancer determines which server should respond to a request. At Backblaze, we use stateless load balancing, which means that the load balancer doesn’t take into account most information about the servers it routes to. We use a round robin approach—i.e. the load balancers choose between one of a few hosts, in order, each time they’re assigning a request. 

We also use Maglev, so the load balancers use consistent hashing and connection tracking. This means that we’re minimizing the negative impact of unexpected faults failures from connection-oriented protocols. If a load balancer goes down, its server pool can be transferred to another, and it will make decisions in the same way, seamlessly picking up the load. When the initial load balancer comes back online, it already has a connection to its load balancer friend and can pick up where it left off. 

The upside is that it’s super rare to see a disruption, and it essentially only happens when the load balancer and the neighbor host go down in a short period of time. The downside is that the load balancer decision is static. If you have “better” servers for one reason or another—they’re newer, for instance—they don’t take that information into account. On the other hand, we do have the ability to push more traffic to specific servers through ECMP weights if we need to, which means that we have good control over a diverse fleet of hardware. 

Volume of Data

Backblaze now has over three exabytes of storage under management. Based on the scalability of the network design, that doesn’t really make a huge amount of difference when you’re scaling your infrastructure properly. What can make a difference is how people store and access their data. 

Most of the things that make managing large datasets difficult from an architecture perspective can also be silly for a client. For example: querying files individually (creating lots of requests) instead of batching or creating a range request. (There may be a business reason to do that, but usually, it makes more sense to batch requests.)

On the other hand, some things that make sense for how clients need to store data require architecture realignment. One of those is just a sheer fluctuation of data by volume—if you’re adding and deleting large amounts of data (we’re talking hundreds of terabytes or more) on a “shorter” cycle (monthly or less), then there will be a measurable impact. And, with more data stored, you have the potential for more transactions.

Similarly, if you need to retrieve data often, but not regularly, there are potential performance impacts. Most of them are related to caching, and ironically, they can actually improve performance. The more you query the same “set” of servers for the same file, the more likely that each server in the group will have cached your data locally (which means they can serve it more quickly). 

And, as with most data centers, we store our long term data on hard disk drives (HDDs), whereas our API servers are on solid state drives (SSDs). There are positives and negatives to each type of drive, but the performance impact is that data at rest takes longer to retrieve, and data in the cache is on a faster SSD drive on the API server. 

On the other hand, the more servers the data center has, the lower the chance that the servers can/will deliver cached data. And, of course, if you’re replacing large volumes of old data with new on a shorter timeline, then you won’t see the benefits. It sounds like an edge case, but industries like security camera data are a great example. While they don’t retrieve their data very frequently, they are constantly uploading and overwriting their data, often to meet different industry requirements about retention periods, which can be challenging to allocate a finite amount of input/output operations per second (IOPS) for uploads, downloads, and deletes.  

That said, the built-in benefit of our system is that adding another load balancer is (relatively) cheap. If we’re experiencing a processing chokepoint for whatever reason—typically either a CPU bottleneck or from throughput on the NIC—we can add another load balancer, and just like that, we can start to see bits flying through the new load balancer and traffic being routed amongst more hosts, alleviating the choke points. 

APIs and Processing Costs

We mentioned above that one of the biggest changes to our platform was the addition of the S3 Compatible API. When all requests were made through the B2 Native API, the Backblaze CLI tool, or the web UI, the processing cost was relatively cheap. 

That’s because of the way our upload requests to the Backblaze Vaults are structured. When you make an upload request via the Native API, there are actually two transactions, one to get an upload URL, and the second to send a request to the Vault. And, all other types of requests (besides upload) have always had to be processed through our load balancers. Since the S3 Compatible API is a single request, we knew we would have to add more processing power and load balancers. (If you want to go back to 2018 and see some of the reasons why, here’s Brian Wilson on the subject—read with the caveat that our current Tech Doc on the subject outlines how we solve the complications he points out.) 

We’re still leveraging DSR to respond directly to the client, but we’ve significantly increased the amount of transactions that hit our load balancers, both because it has to take on more of the processing during transit and because, well, lots of folks like to use the S3 Compatible API and our customer base has grown by a quite a bit since 2018. 

And, just like above, we’ve set ourselves up for a relatively painless fix: we can add another load balancer to solve most problems. 

Do We Have More Complexity?

This is the million dollar question, solved for a dollar: how could we not? Since our first load balancing article, we’ve added features, complexity, and lots of customers. Load balancing algorithms are inherently complex, but we (mostly Elliott and other smart people) have taken a lot of time and consideration to not just build a system that will scale to up and past two billion transactions a day, but that can be fairly “easily” explained and doesn’t require a graduate degree to understand what is happening.  

But, we knew it was important early on, so we prioritized building a system where we could “just” add another load balancer. The thinking is more complicated at the outset, but the tradeoff is that it’s simple once you’ve designed the system. It would take a lot for us to outgrow the usefulness of this strategy—but hey, we might get there someday. When we do, we’ll write you another article. 


  1. Kernel: A kernel is the computer program at the core of a computer’s operating system. It has control of the system and does things like run processes, manage hardware devices like the hard drive, and handle interrupts, as well as memory and input/output (I/O) requests from software, translating them to instructions for the central processing unit (CPU). ↩
  2. Virtual internet protocol (VIP): An IP address that does not correspond to a real place. ↩
  3. Internet protocol (IP): The global physical address of a device, used to identify devices on the internet. Can be changed.  ↩
  4. Proxy: In networking, a proxy is a server application that validates outside requests to a network. Think of them as a gatekeeper. There are several common types of proxies you interact with all the time—HTTPS requests on the internet, for example. ↩
  5. Network interface controller, or network interface card (NIC): This connects the computer to the network. ↩
  6. Address resolution protocol (ARP): The process by which a device or network identifies another device. There are four types. ↩
  7. Media access control (MAC): The local physical address of a device, used to identify devices on the same network. Hardcoded into the device. ↩

The post Load Balancing 2.0: What’s Changed After 7 Years? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

NAS Performance Guide: How To Optimize Your NAS

Post Syndicated from Vinodh Subramanian original https://www.backblaze.com/blog/nas-performance-guide-how-to-optimize-your-nas/

A decorative images showing a 12 bay NAS device connecting to the cloud.

Upgrading to a network attached storage (NAS) device puts your data in the digital fast lane. If you’re using one, it’s likely because you want to keep your data close to you, ensuring quick access whenever it’s needed. NAS devices, acting as centralized storage systems connected to local networks, offer a convenient way to access data in just a few clicks. 

However, as the volume of data on the NAS increases, its performance can tank. You need to know how to keep your NAS operating at its best, especially with growing data demand. 

In this blog, you’ll learn about various factors that can affect NAS performance, as well as practical steps you can take to address these issues, ensuring optimal speed, reliability, and longevity for your NAS device.

Why NAS Performance Matters

NAS devices can function as extended hard disks, virtual file cabinets, or centralized local storage solutions, depending on individual needs. 

While NAS offers a convenient way to store data locally, storing the data alone isn’t enough. How quickly and reliably you can access your data can make all the difference if you want an efficient workflow. For example, imagine working on a critical project with your team and facing slow file transfers, or streaming a video on a Zoom call only for it to stutter or buffer continuously.

All these can be a direct result of NAS performance issues, and an increase in stored data can directly undermine the device’s performance. Therefore, ensuring optimal performance isn’t just a technical concern, it’s also a concern that directly affects user experience, productivity, and collaboration. 

So, let’s talk about what could potentially cause performance issues and how to enhance your NAS. 

Common NAS Performance Issues

NAS performance can be influenced by a variety of factors. Here are some of the most common factors that can impact the performance of a NAS device.

Hardware Limitations:

  • Insufficient RAM: Especially in tasks like media streaming or handling large files, having inadequate memory can slow down operations. 
  • Slow CPU: An underpowered processor can become a bottleneck when multiple users access the NAS at once or during collaboration with team members. 
  • Drive Speed and Type: Hard disk drives (HDDs) are generally slower compared to solid state drives (SSDs), and your NAS can have either type. If your NAS mainly serves as a hub for storing and sharing files, a conventional HDD should meet your requirements. However, for those seeking enhanced speed and performance, SSDs deliver the performance you need. 
  • Outdated Hardware: Older NAS models might not be equipped to handle modern data demands or the latest software.

Software Limitations:

  • Outdated Firmware/Software: Not updating to the latest firmware or software can lead to performance issues, or to missing out on optimization and security features.
  • Misconfigured Settings: Incorrect settings can impact performance. This includes improper RAID configuration or network settings. 
  • Background Processes: Certain background tasks, like indexing or backups, can also slow down the system when running.

Network Challenges: 

  • Bandwidth Limitations: A slow network connection, especially on a Wi-Fi network can limit data transfer rates. 
  • Network Traffic: High traffic on the network can cause congestion, reducing the speed at which data can be accessed or transferred.

Disk Health and Configuration:

  • Disk Failures: A failing disk in the NAS can slow down performance and also poses data loss risk.
  • Suboptimal RAID Configuration: Some RAID configurations prioritize redundancy more than performance, which can affect the data storage and access speeds. 

External Factors:

  • Simultaneous User Access: If multiple users are accessing, reading, or writing to the NAS simultaneously, it can strain the system, especially if the hardware isn’t optimized to such traffic from multiple users. 
  • Inadequate Power Supply: Fluctuating or inadequate power can cause the NAS to malfunction or reduce its performance.
  • Operating Temperature: Additionally, if the NAS is in a hot environment, it might overheat and impact the performance of the device.

Practical Solutions for Optimizing NAS Performance

Understanding the common performance issues with NAS devices is the first critical step. However, simply identifying these issues alone isn’t enough. It’s vital to understand practical ways to optimize your existing NAS setup so you can enhance its speed, efficiency, and reliability. Let’s explore how to optimize your NAS. 

Performance Enhancement 1: Upgrading Hardware

There are a few different things you can do on a hardware level to enhance NAS performance. First, adding more RAM can significantly improve performance, especially if multiple tasks or users are accessing the NAS simultaneously. 

You can also consider switching to SSDs. While they can be more expensive, SSDs offer faster read/write speeds than traditional HDDs, and they store data in flash memory, which means that they retain information even without power. 

Finally, you could upgrade the CPU. For NAS devices that support it, a more powerful CPU can better handle multiple simultaneous requests and complex tasks. 

Performance Enhancement 2: Optimizing Software Configuration

Remember to always keep your NAS operating system and software up-to-date to benefit from the latest performance optimizations and security patches. Schedule tasks like indexing, backups or antivirus scans during off-peak hours to ensure they don’t impact user access during high-traffic times. You also need to make sure you’re using the right RAID configuration for your needs. RAID 5 or RAID 6, for example, can offer a good balance between redundancy and performance.

Performance Enhancement 3: Network Enhancements

Consider moving to faster network protocols, like 10Gb ethernet, or ensuring that your router and switches can handle high traffic. Wherever possible, use wired connections instead of Wi-Fi to connect to the NAS for more stable and faster data access and transfer. And, regularly review and adjust network settings for optimal performance. If you can, it also helps to limit simultaneous access. If possible, manage peak loads by setting up access priorities.

Performance Enhancement 4: Regular Maintenance

Use your NAS device’s built-in tools or third-party software to monitor the health of your disks and replace any that show signs of failure. And, keep the physical environment around your NAS device clean, cool, and well ventilated to prevent overheating. 

Leveraging the Cloud for NAS Optimization

After taking the necessary steps to optimize your NAS for improved performance and reliability, it’s worth considering leveraging the cloud to further enhance the performance. While NAS offers convenient local storage, it can sometimes fall short when it comes to scalability, accessibility from different locations, and seamless collaboration. Here’s where cloud storage comes into play. 

At its core, cloud storage is a service model in which data is maintained, managed, and backed up remotely, and made available to users over the internet. Instead of relying solely on local storage solutions such as NAS or a server, you utilize the vast infrastructure of data centers across the globe to store your data not just in one physical location, but across multiple secure and redundant environments. 

As an off-site storage solution for NAS, the cloud not only completes your 3-2-1 backup plan, but can also amplify its performance. Let’s take a look at how integrating cloud storage can help optimize your NAS.

  • Off-Loading and Archiving: One of the most straightforward approaches is to move infrequently accessed or archival data from the NAS to the cloud. This frees up space on the NAS, ensuring it runs smoothly, while optimizing the NAS by only keeping data that’s frequently accessed or essential. 
  • Caching: Some advanced NAS systems can cache frequently accessed data in the cloud. This means that the most commonly used data can be quickly retrieved, enhancing user experience and reducing the load on the NAS device. 
  • Redundancy and Disaster Recovery: Instead of duplicating data on multiple NAS devices for redundancy, which can be costly and still vulnerable to local disasters, the data can be backed up to the cloud. In case of NAS failure or catastrophic event, the data can be quickly restored from the cloud, ensuring minimal downtime. 
  • Remote Access and Collaboration: While NAS devices can offer remote access, integrating them with cloud storage can streamline this process, often offering a more user-friendly interface and better speeds. This is especially useful for collaborative environments where multiple users work together on files and projects. 
  • Scaling Without Hardware Constraints: As your data volume grows, expanding a NAS can involve purchasing additional drives or even new devices. With cloud integration, you can expand your storage capacity without these immediate hardware investments, eliminating or delaying the need for physical upgrades and extending the lifespan of your NAS. 

In essence, integrating cloud storage solutions with your NAS can create a comprehensive system that addresses the shortcomings of NAS devices, helping you create a hybrid setup that offers the best of both worlds: the speed and accessibility of local storage, and the flexibility and scalability of the cloud. 

Getting the Best From Your NAS

At its core, NAS offers an unparalleled convenience of localized storage. However, it’s not without challenges, especially when performance issues come into play. Addressing these challenges requires a blend of hardware optimization, software updates, and smart data management settings. 

But, it doesn’t have to stop at your local network. Cloud storage can be leveraged effectively to optimize your NAS. It doesn’t just act as a safety net by storing your NAS data off-site, it also makes collaboration easier with dispersed teams and further optimizes NAS performance. 

Now, it’s time to hear from you. Have you encountered any NAS performance issues? What measures have you taken to optimize your NAS? Share your experiences and insights in the comments below. 

The post NAS Performance Guide: How To Optimize Your NAS appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

How to Manage Your Family’s Backups

Post Syndicated from Yev original https://www.backblaze.com/blog/groups-speeds-family-backup/

A decorative image showing faces on laptop screens.

When it comes to navigating the treacherous landscape of a household’s digital ecosystem, from smartphones and laptops to smart homes and millions of subscriptions, there often emerges a silent hero—the ever-humble, quietly toiling, underappreciated Family IT Manager. This unsung role, typically filled by a tech-savvy-est member of the family, takes on the responsibility of keeping everyone’s digital lives running smoothly. Maybe you know one of these vaillant souls. Or maybe, just maybe, it’s you. 

As the Family IT Manager, having one more arrow in your quiver with which to slay the dreaded data loss dragon is always helpful. And that’s what Backblaze Groups is all about—making it easier for you to keep track of everyone’s data in one place. 

Today, we’re sharing some practical tips and tricks for using Groups to better manage your family’s backups.

Have You Checked Out v9.0?

Backblaze recently rolled out v9.0 to all Backblaze Computer Backup users. If you haven’t had a chance, you can read all about the latest version, including the new Restore App.

What Are Backblaze Groups?

Groups helps you manage the backups your family creates without having to log in and out of individual accounts. This makes it simple to keep track of everyone in one place. All the backup accounts are linked to the same credit card (they can Venmo you later), and you can even help someone else in your family create a backup or restore files easily with Groups. Need to help a family member with a computer emergency? Log in, access their most recent backup, and restore everything. Is your sibling unsure that you really added Backblaze to their computer? Log in, view their account, and get the screenshots to prove it to them (and everyone else). 

By the way, this would be a great time to give the new Restore App, included with Backblaze Computer Backup v9.0, a spin.

One point of clarification: You might see Backblaze Groups referred to as “Business Groups,” but you don’t have to be a business to use Groups. They work equally well for businesses and personal users alike, including Family IT Managers (and, truly, running family IT is kind of like running a business, isn’t it?).

Why Use Groups?

You can already manage multiple computers on a single Backblaze account. So why use Groups instead? Well, with Groups, each user has individual access to, and control of, their account. You—as Group administrator—manage billing and, as needed, data recovery. This is a more secure and safer method than sharing the same account credentials among several computers used by different people.

Have multiple households or groupings of folks in your life that you need to manage? You can have as many Groups as you like to help you keep track of everyone and everything, and each of those Groups can have separate billing. 

What Do I Need to Know About Setting Up Backblaze Groups for My Family?

The Groups feature streamlines the management of the accounts you need to monitor. As the Group administrator, you have total control over who’s included as part of your Group. You can send out email invitations, or alternatively, you can use a unique Group invitation link that allows anyone you share it with to easily join. 

A screenshot of a Backblaze account showing how to create a Group.
Here’s the visual of where you’d find everything in your account.

Being in a Group is entirely voluntary. Any member of a Group can leave any time they want, and Group administrators can also remove individuals from a Group at any time. 

If you dissolve your Group for some reason or if someone chooses to leave, the removed person can decide whether they want to keep using Backblaze by establishing their own payment method. Perfect for when it’s time to wean the kiddos off of your shared accounts—whether they like it or not.

One last note: while you can set up and administer more than one Group with separate billing, you can only be a member in one Group. 

Those are all the caveats, really. If you want to read more about the step-by-step instructions, check out our Help article about creating a Group.

Invite Members: The More the Merrier

Once you create a Group, you can invite members to join it. Copy the Group invite link Backblaze generates automatically for you. Give it to friends and family via email, chat, or any other means you’d like. 

A screenshot of a Backblaze account showing how to invite Group members.
We promise to send the emails. You may have to remind them to check their email.

When the person you’ve invited clicks on the link, they will be prompted to either create a Backblaze account (if they don’t have one) or log in to their existing account. After completing this step, they will be prompted to download Backblaze. If they are already using Backblaze, there is no need for a reinstallation; they will seamlessly become a part of your Group.

Once an existing user successfully joins your Group, they’ll be under your billing account. Their existing credit card will automatically receive a prorated refund for the remaining portion of their previous Backblaze license. There is no need to worry about re-uploading data—their backup remains securely stored in Backblaze.

Newcomers to Backblaze can download and install the client to initiate their initial backup process. As the Group administrator, you will have the capability to monitor their backup progress. Remember that the first backup of data may take some time, but after that, everything will run smoothly in the background. 

Go Forth and Conquer, Mighty IT Manager

We understand that being the go-to “tech person” for your family and friends can be challenging. We hope that Groups simplifies the process, making it easier for you to help keep your family’s data safe.

The post How to Manage Your Family’s Backups appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Overload to Overhaul: How We Upgraded Drive Stats Data

Post Syndicated from David Winings original https://www.backblaze.com/blog/overload-to-overhaul-how-we-upgraded-drive-stats-data/

A decorative image showing the words "overload to overhaul: how we upgraded Drive Stats data."

This year, we’re celebrating 10 years of Drive Stats. Coincidentally, we also made some upgrades to how we run our Drive Stats reports. We reported on how an attempt to migrate triggered a weeks-long recalculation of the dataset, leading us to map the architecture of the Drive Stats data. 

This follow-up article focuses on the improvements we made after we fixed the existing bug (because hey, we were already in there), and then presents some of our ideas for future improvements. Remember that those are just ideas so far—they may not be live in a month (or ever?), but consider them good food for thought, and know that we’re paying attention so that we can pass this info along to the right people.

Now, onto the fun stuff. 

Quick Refresh: Drive Stats Data Architecture

The podstats generator runs on every Storage Pod, what we call any host that holds customer data, every few minutes. It’s a C++ program that collects SMART stats and a few other attributes, then converts them into an .xml file (“podstats”). Those are then pushed to a central host in each datacenter and bundled. Once the data leaves these central hosts, it has entered the domain of what we will call Drive Stats.  

Now let’s go into a little more detail: when you’re gathering stats about drives, you’re running a set of modules with dependencies to other modules, forming a data-dependency tree. Each time a module “runs”, it takes information, modifies it, and writes it to a disk. As you run each module, the data will be transformed sequentially. And, once a quarter, we run a special module that collects all the attributes for our Drive Stats reports, collecting data all the way down the tree. 

Here’s a truncated diagram of the whole system, to give you an idea of what the logic looks like:

A diagram of the mapped logic of the Drive Stats modules.
An abbreviated logic map of Drive Stats modules.

As you move down through the module layers, the logic gets more and more specialized. When you run a module, the first thing the module does is check in with the previous module to make sure the data exists and is current. It caches the data to disk at every step, and fills out the logic tree step by step. So for example, drive_stats, being a “per-day” module, will write out a file such as /data/drive_stats/2023-01-01.json.gz when it finishes processing. This lets future modules read that file to avoid repeating work.

This work deduplication process saves us a lot of time overall—but it also turned out to be the root cause of our weeks-long process when we were migrating Drive Stats to our new host. We fixed that by implementing versions to each module.  

While You’re There… Why Not Upgrade?

Once the dust from the bug fix had settled, we moved forward to try to modernize Drive Stats in general. Our daily report still ran quite slowly, on the order of several hours, and there was some low-hanging fruit to chase.

Waiting On You, failures_with_stats

First things first, we saved a log of a run of our daily reports in Jenkins. Then we wrote an analyzer to see which modules were taking a lot of time. failures_with_stats was our biggest offender, running for about two hours, while every other module took about 15 minutes.

An image showing runtimes for each module when running a Drive Stats report.
Not quite two hours.

Upon investigation, the time cost had to do with how the date_range module works. This takes us back to caching: our module checks if the file has been written already, and if it has, it uses the cached file. However, a date range is written to a single file. That is, Drive Stats will recognize “Monday to Wednesday” as distinct from “Monday to Thursday” and re-calculate the entire range. This is a problem for a workload that is essentially doing work for all of time, every day.  

On top of this, the raw Drive Stats data, which is a dependency for failures_with_stats, would be gzipped onto a disk. When each new query triggered a request to recalculate all-time data, each dependency would pick up the podstats file from disk, decompress it, read it into memory, and do that for every day of all time. We were picking up and processing our biggest files every day, and time continued to make that cost larger.

Our solution was what I called the “Date Range Accumulator.” It works as follows:

  • If we have a date range like “all of time as of yesterday” (or any partial range with the same start), consider it as a starting point.
  • Make sure that the version numbers don’t consider our starting point to be too old.
  • Do the processing of today’s data on top of our starting point to create “all of time as of today.”

To do this, we read the directory of the date range accumulator, find the “latest” valid one, and use that to determine the delta (change) to our current date. Basically, the module says: “The last time I ran this was on data from the beginning of time to Thursday. It’s now Friday. I need to run the process for Friday, and then add that to the compiled all-time.” And, before it does that, it double checks the version number to avoid errors. (As we noted in our previous article, if it doesn’t see the correct version number, instead of inefficiently running all data, it just tells you there is a version number discrepancy.) 

The code is also a bit finicky—there are lots of snags when it comes to things like defining exceptions, such as if we took a drive out of the fleet, but it wasn’t a true failure. The module also needed to be processable day by day to be usable with this technique.

Still, even with all the tweaks, it’s massively better from a runtime perspective for eligible candidates. Here’s our new failures_with_stats runtime: 

An output of module runtime after the Drive Stats improvements were made.
Ahh, sweet victory.

Note that in this example, we’re running that 60-day report. The daily report is quite a bit quicker. But, at least the 60-day report is a fixed amount of time (as compared with the all-time dataset, which is continually growing). 

Code Upgrade to Python 3

Next, we converted our code to Python 3. (Shout out to our intern, Anath, who did amazing work on this part of the project!) We didn’t make this improvement just to make it; no, we did this because I wanted faster JSON processors, and a lot of the more advanced ones did not work with Python 2. When we looked at the time each module took to process, most of that was spent serializing and deserializing JSON.

What Is JSON Parsing?

JSON is an open standard file format that uses human readable text to store and transmit data objects. Many modern programming languages include code to generate and parse JSON-format data. Here’s how you might describe a person named John, aged 30, from New York using JSON: 

“firstName”: “John”, 
“age”: 30,
“State”: “New York”

You can express those attributes into a single line of code and define them as a native object:

x = { 'name':'John', 'age':30, 'city':'New York'}

“Parsing” is the process by which you take the JSON data and make it into an object that you can plug into another programming language. You’d write your script (program) in Python, it would parse (interpret) the JSON data, and then give you an answer. This is what that would look like: 

import json

# some JSON:
x = '''
	"firstName": "John", 
	"age": 30,
	"State": "New York"

# parse x:
y = json.loads(x)

# the result is a Python object:

If you run this script, you’ll get the output “John.” If you change print(y["name"]) to print(y["age"]), you’ll get the output “30.” Check out this website if you want to interact with the code for yourself. In practice, the JSON would be read from a database, or a web API, or a file on disk rather than defined as a “string” (or text) in the Python code. If you are converting a lot of this JSON, small improvements in efficiency can make a big difference in how a program performs.

And Implementing UltraJSON

Upgrading to Python 3 meant we could use UltraJSON. This was approximately 50% faster than the built-in Python JSON library we used previously. 

We also looked at the XML parsing for the podstats files, since XML parsing is often a slow process. In this case, we actually found our existing tool is pretty fast (and since we wrote it 10 years ago, that’s pretty cool). Off-the-shelf XML parsers take quite a bit longer because they care about a lot of things we don’t have to: our tool is customized for our Drive Stats needs. It’s a well known adage that you should not parse XML with regular expressions, but if your files are, well, very regular, it can save a lot of time.

What Does the Future Hold?

Now that we’re working with a significantly faster processing time for our Drive Stats dataset, we’ve got some ideas about upgrades in the future. Some of these are easier to achieve than others. Here’s a sneak peek of some potential additions and changes in the future.

Data on Data

In keeping with our data-nerd ways, I got curious about how much the Drive Stats dataset is growing and if the trend is linear. We made this graph, which shows the baseline rolling average, and has a trend line that attempts to predict linearly.

A graph showing the rate at which the Drive Stats dataset has grown over time.

I envision this graph living somewhere on the Drive Stats page and being fully interactive. It’s just one graph, but this and similar tools available on our website would be 1) fun and 2) lead to some interesting insights for those who don’t dig in line by line. 

What About Changing the Data Module?

The way our current module system works, everything gets processed in a tree approach, and they’re flat files. If we used something like SQLite or Parquet, we’d be able to process data in a more depth-first way, and that would mean that we could open a file for one module or data range, process everything, and not have to read the file again. 

And, since one of the first things that our Drive Stats expert, Andy Klein, does with our .xml data is to convert it to SQLite, outputting it in a queryable form would save a lot of time. 

We could also explore keeping the data as a less-smart filetype, but using something more compact than JSON, such as MessagePack.

Can We Improve Failure Tracking and Attribution?

One of the odd things about our Drive Stats datasets is that they don’t always and automatically agree with our internal data lake. Our Drive Stats outputs have some wonkiness that’s hard to replicate, and it’s mostly because of exceptions we build into the dataset. These exceptions aren’t when a drive fails, but rather when we’ve removed it from the fleet for some other reason, like if we were testing a drive or something along those lines. (You can see specific callouts in Drive Stats reports, if you’re interested.) It’s also where a lot of Andy’s manual work on Drive Stats data comes in each month: he’s often comparing the module’s output with data in our datacenter ticket tracker.

These tickets come from the awesome data techs working in our data centers. Each time a drive fails and they have to replace it, our techs add a reason for why it was removed from the fleet. While not all drive replacements are “failures”, adding a root cause to our Drive Stats dataset would give us more confidence in our failure reporting (and would save Andy comparing the two lists). 

The Result: Faster Drive Stats and Future Fun

These two improvements (the date range accumulator and upgrading to Python 3) resulted in hours, and maybe even days, of work saved. Even from a troubleshooting point of view, we often wouldn’t know if the process was stuck, or if this was the normal amount of time the module should take to run. Now, if it takes more than about 15 minutes to run a report, you’re sure there’s a problem. 

While the Drive Stats dataset can’t really be called “big data”, it provides a good, concrete example of scaling with your data. We’ve been collecting Drive Stats for just over 10 years now, and even though most of the code written way back when is inherently sound, small improvements that seem marginal become amplified as datasets grow. 

Now that we’ve got better documentation of how everything works, it’s going to be easier to keep Drive Stats up-to-date with the best tools and run with future improvements. Let us know in the comments what you’d be interested in seeing.

The post Overload to Overhaul: How We Upgraded Drive Stats Data appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

AI 101: Do the Dollars Make Sense?

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/ai-101-do-the-dollars-make-sense/

A decorative image showing a cloud reaching out with digital tentacles to stacks of dollar signs.

Welcome back to AI 101, a series dedicated to breaking down the realities of artificial intelligence (AI). Previously we’ve defined artificial intelligence, deep learning (DL), and machine learning (ML) and dove into the types of processors that make AI possible. Today we’ll talk about one of the biggest limitations of AI adoption—how much it costs. Experts have already flagged that the significant investment necessary for AI can cause antitrust concerns and that AI is driving up costs in data centers

To that end, we’ll talk about: 

  • Factors that impact the cost of AI.
  • Some real numbers about the cost of AI components. 
  • The AI tech stack and some of the industry solutions that have been built to serve it.
  • And, uncertainty.

Defining AI: Complexity and Cost Implications

While ChatGPT, DALL-E, and the like may be the most buzz-worthy of recent advancements, AI has already been a part of our daily lives for several years now. In addition to generative AI models, examples include virtual assistants like Siri and Google Home, fraud detection algorithms in banks, facial recognition software, URL threat analysis services, and so on. 

That brings us to the first challenge when it comes to understanding the cost of AI: The type of AI you’re training—and how complex a problem you want it to solve—has a huge impact on the computing resources needed and the cost, both in the training and in the implementation phases. AI tasks are hungry in all ways: they need a lot of processing power, storage capacity, and specialized hardware. As you scale up or down in the complexity of the task you’re doing, there’s a huge range in the types of tools you need and their costs.   

To understand the cost of AI, several other factors come into play as well, including: 

  • Latency requirements: How fast does the AI need to make decisions? (e.g. that split second before a self-driving car slams on the brakes.)
  • Scope: Is the AI solving broad-based or limited questions? (e.g. the best way to organize this library vs. how many times is the word “cat” in this article.)
  • Actual human labor: How much oversight does it need? (e.g. does a human identify the cat in cat photos, or does the AI algorithm identify them?)
  • Adding data: When, how, and what quantity new data will need to be ingested to update information over time? 

This is by no means an exhaustive list, but it gives you an idea of the considerations that can affect the kind of AI you’re building and, thus, what it might cost.

The Big Three AI Cost Drivers: Hardware, Storage, and Processing Power

In simple terms, you can break down the cost of running an AI to a few main components: hardware, storage, and processing power. That’s a little bit simplistic, and you’ll see some of these lines blur and expand as we get into the details of each category. But, for our purposes today, this is a good place to start to understand how much it costs to ask a bot to create a squirrel holding a cool guitar.

An AI generative image of a squirrel holding a guitar. Both the squirrel and the guitar and warped in strange, but not immediately noticeable ways.
Still not quite there on the guitar. Or the squirrel. How much could this really cost?

First Things First: Hardware Costs

Running an AI takes specialized processors that can handle complex processing queries. We’re early in the game when it comes to picking a “winner” for specialized processors, but these days, the most common processor is a graphical processing unit (GPU), with Nvidia’s hardware and platform as an industry favorite and front-runner. 

The most common “workhorse chip” of AI processing tasks, the Nvidia A100, starts at about $10,000 per chip, and a set of eight of the most advanced processing chips can cost about $300,000. When Elon Musk wanted to invest in his generative AI project, he reportedly bought 10,000 GPUs, which equates to an estimated value in the tens of millions of dollars. He’s gone on record as saying that AI chips can be harder to get than drugs

Google offers folks the ability to rent their TPUs through the cloud starting at $1.20 per chip hour for on-demand service (less if you commit to a contract). Meanwhile, Intel released a sub-$100 USB stick with a full NPU that can plug into your personal laptop, and folks have created their own models at home with the help of open sourced developer toolkits. Here’s a guide to using them if you want to get in the game yourself. 

Clearly, the spectrum for chips is vast—from under $100 to millions—and the landscape for chip producers is changing often, as is the strategy for monetizing those chips—which leads us to our next section. 

Using Third Parties: Specialized Problems = Specialized Service Providers

Building AI is a challenge with so many moving parts that, in a business use case, you eventually confront the question of whether it’s more efficient to outsource it. It’s true of storage, and it’s definitely true of AI processing. You can already see one way Google answered that question above: create a network populated by their TPUs, then sell access.   

Other companies specialize in broader or narrower parts of the AI creation and processing chain. Just to name a few, diverse companies: there’s Hugging Face, Inflection AI, CoreWeave, and Vultr. Those companies have a wide array of product offerings and resources from open source communities like Hugging Face that provide a menu of models, datasets, no-code tools, and (frankly) rad developer experiments to bare metal servers like Vultr that enhance your compute resources. How resources are offered also exist on a spectrum, including proprietary company resources (i.e. Nvidia’s platform), open source communities (looking at you, Hugging Face), or a mix of the two. 

An AI generated comic showing various iterations of data storage superheroes.
A comic generated on Hugging Face’s AI Comic Factory.

This means that, whichever piece of the AI tech stack you’re considering, you have a high degree of flexibility when you’re deciding where and how much you want to customize and where and how to implement an out-of-the box solution. 

Ballparking an estimate of what any of that costs would be so dependent on the particular model you want to build and the third-party solutions you choose that it doesn’t make sense to do so here. But, it suffices to say that there’s a pretty narrow field of folks who have the infrastructure capacity, the datasets, and the business need to create their own network. Usually it comes back to any combination of the following: whether you have existing infrastructure to leverage or are building from scratch, if you’re going to sell the solution to others, what control over research or dataset you have or want, how important privacy is and how you’re incorporating it into your products, how fast you need the model to make decisions, and so on. 

Welcome to the Spotlight, Storage

And, hey, with all that, let’s not forget storage. At the most basic level of consideration, AI uses a ton of data. How much? Going knowledge says at least an order of magnitude more examples than the problem presented to train an AI model. That means you want 10 times more examples than parameters. 

Parameters and Hyperparameters

The easiest way to think of parameters is to think of them as factors that control how an AI makes a decision. More parameters = more accuracy. And, just like our other AI terms, the term can be somewhat inconsistently applied. Here’s what ChatGPT has to say for itself:

A screenshot of a conversation with ChatGPT where it tells us it has 175 billion parameters.

That 10x number is just the amount of data you store for the initial training model—clearly the thing learns and grows, because we’re talking about AI. 

Preserving both your initial training algorithm and your datasets can be incredibly useful, too. As we talked about before, the more complex an AI, the higher the likelihood that your model will surprise you. And, as many folks have pointed out, deciding whether to leverage an already-trained model or to build your own doesn’t have to be an either/or—oftentimes the best option is to fine-tune an existing model to your narrower purpose. In both cases, having your original training model stored can help you roll back and identify the changes over time. 

The size of the dataset absolutely affects costs and processing times. The best example is that ChatGPT, everyone’s favorite model, has been rocking GPT-3 (or 3.5) instead of GPT-4 on the general public release because GPT-4, which works from a much larger, updated dataset than GPT-3, is too expensive to release to the wider public. It also returns results much more slowly than GPT-3.5, which means that our current love of instantaneous search results and image generation would need an adjustment. 

And all of that is true because GPT-4 was updated with more information (by volume), more up-to-date information, and the model was given more parameters to take into account for responses. So, it has to both access more data per query and use more complex reasoning to make decisions. That said, it also reportedly has much better results.

Storage and Cost

What are the real numbers to store, say, a primary copy of an AI dataset? Well, it’s hard to estimate, but we can ballpark that, if you’re training a large AI model, you’re going to have at a minimum tens of gigabytes of data and, at a maximum, petabytes. OpenAI considers the size of its training database proprietary information, and we’ve found sources that cite that number as  anywhere from 17GB to 570GB to 45TB of text data

That’s not actually a ton of data, and, even taking the highest number, it would only cost $225 per month to store that data in Backblaze B2 (45TB * $5/TB/mo), for argument’s sake. But let’s say you’re training an AI on video to, say, make a robot vacuum that can navigate your room or recognize and identify human movement. Your training dataset could easily reach into petabyte scale (for reference, one petabyte would cost $5,000 per month in Backblaze B2). Some research shows that dataset size is trending up over time, though other folks point out that bigger is not always better.

On the other hand, if you’re the guy with the Intel Neural Compute stick we mentioned above and a Raspberry Pi, you’re talking the cost of the ~$100 AI processor, ~$50 for the Raspberry Pi, and any incidentals. You can choose to add external hard drives, network attached storage (NAS) devices, or even servers as you scale up.

Storage and Speed

Keep in mind that, in the above example, we’re only considering the cost of storing the primary dataset, and that’s not very accurate when thinking about how you’d be using your dataset. You’d also have to consider temporary storage for when you’re actually training the AI as your primary dataset is transformed by your AI algorithm, and nearly always you’re splitting your primary dataset into discrete parts and feeding those to your AI algorithm in stages—so each of those subsets would also be stored separately. And, in addition to needing a lot of storage, where you physically locate that storage makes a huge difference to how quickly tasks can be accomplished. In many cases, the difference is a matter of seconds, but there are some tasks that just can’t handle that delay—think of tasks like self-driving cars. 

For huge data ingest periods such as training, you’re often talking about a compute process that’s assisted by powerful, and often specialized, supercomputers, with repeated passes over the same dataset. Having your data physically close to those supercomputers saves you huge amounts of time, which is pretty incredible when you consider that it breaks down to as little as milliseconds per task.

One way this problem is being solved is via caching, or creating temporary storage on the same chips (or motherboards) as the processor completing the task. Another solution is to keep the whole processing and storage cluster on-premises (at least while training), as you can see in the Microsoft-OpenAI setup or as you’ll often see in universities. And, unsurprisingly, you’ll also see edge computing solutions which endeavor to locate data physically close to the end user. 

While there can be benefits to on-premises or co-located storage, having a way to quickly add more storage (and release it if no longer needed), means cloud storage is a powerful tool for a holistic AI storage architecture—and can help control costs. 

And, as always, effective backup strategies require at least one off-site storage copy, and the easiest way to achieve that is via cloud storage. So, any way you slice it, you’re likely going to have cloud storage touch some part of your AI tech stack. 

What Hardware, Processing, and Storage Have in Common: You Have to Power Them

Here’s the short version: any time you add complex compute + large amounts of data, you’re talking about a ton of money and a ton of power to keep everything running. 

A disorganized set of power cords and switches plugged into what is decidedly too small of an outlet space.
Just flip the switch, and you have AI. Source.

Fortunately for us, other folks have done the work of figuring out how much this all costs. This excellent article from SemiAnalysis goes deep on the total cost of powering searches and running generative AI models. The Washington Post cites Dylan Patel (also of SemiAnalysis) as estimating that a single chat with ChatGPT could cost up to 1,000 times as much as a simple Google search. Those costs include everything we’ve talked about above—the capital expenditures, data storage, and processing. 

Consider this: Google spent several years putting off publicizing a frank accounting of their power usage. When they released numbers in 2011, they said that they use enough electricity to power 200,000 homes. And that was in 2011. There are widely varying claims for how much a single search costs, but even the most conservative say .03 Wh of energy. There are approximately 8.5 billion Google searches per day. (That’s just an incremental cost by the way—as in, how much does a single search cost in extra resources on top of how much the system that powers it costs.) 

Power is a huge cost in operating data centers, even when you’re only talking about pure storage. One of the biggest single expenses that affects power usage is cooling systems. With high-compute workloads, and particularly with GPUs, the amount of work the processor is doing generates a ton more heat—which means more money in cooling costs, and more power consumed. 

So, to Sum Up

When we’re talking about how much an AI costs, it’s not just about any single line item cost. If you decide to build and run your own models on-premises, you’re talking about huge capital expenditure and ongoing costs in data centers with high compute loads. If you want to build and train a model on your own USB stick and personal computer, that’s a different set of cost concerns. 

And, if you’re talking about querying a generative AI from the comfort of your own computer, you’re still using a comparatively high amount of power somewhere down the line. We may spread that power cost across our national and international infrastructures, but it’s important to remember that it’s coming from somewhere—and that the bill comes due, somewhere along the way. 

The post AI 101: Do the Dollars Make Sense? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

The SSD Edition: 2023 Drive Stats Mid-Year Review

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/ssd-edition-2023-mid-year-drive-stats-review/

A decorative image displaying the title 2023 Mid-Year Report Drive Stats SSD Edition.

Welcome to the 2023 Mid-Year SSD Edition of the Backblaze Drive Stats review. This report is based on data from the solid state drives (SSDs) we use as storage server boot drives on our Backblaze Cloud Storage platform. In this environment, the drives do much more than boot the storage servers. They also store log files and temporary files produced by the storage server. Each day a boot drive will read, write, and delete files depending on the activity of the storage server itself.

We will review the quarterly and lifetime failure rates for these drives, and along the way we’ll offer observations and insights to the data presented. In addition, we’ll take a first look at the average age at which our SSDs fail, and examine how well SSD failure rates fit the ubiquitous bathtub curve.

Mid-Year SSD Results by Quarter

As of June 30, 2023, there were 3,144 SSDs in our storage servers. This compares to 2,558 SSDs we reported in our 2022 SSD annual report. We’ll start by presenting and discussing the quarterly data from each of the last two quarters (Q1 2022 and Q2 2023).

Notes and Observations

Data is by quarter: The data used in each table is specific to that quarter. That is, the number of drive failures and drive days are inclusive of the specified quarter, Q1 or Q2. The drive counts are as of the last day of each quarter.

Drives added: Since our last SSD report, ending in Q4 2022, we added 238 SSD drives to our collection. Of that total, the Crucial (model: CT250MX500SSD1) led the way with 110 new drives added, followed by 62 new WDC drives (model: WD Blue SA510 2.5) and 44 Seagate drives (model: ZA250NM1000).

Really high annualized failure rates (AFR): Some of the failure rates, that is AFR, seem crazy high. How could the Seagate model SSDSCKKB240GZR have an annualized failure rate over 800%? In that case, in Q1, we started with two drives and one failed shortly after being installed. Hence, the high AFR. In Q2, the remaining drive did not fail and the AFR was 0%. Which AFR is useful? In this case neither, we just don’t have enough data to get decent results. For any given drive model, we like to see at least 100 drives and 10,000 drive days in a given quarter as a minimum before we begin to consider the calculated AFR to be “reasonable.” We include all of the drive models for completeness, so keep an eye on drive count and drive days before you look at the AFR with a critical eye.

Quarterly Annualized Failures Rates Over Time

The data in any given quarter can be volatile with factors like drive age and the randomness of failures factoring in to skew the AFR up or down. For Q1, the AFR was 0.96% and, for Q2, the AFR was 1.05%. The chart below shows how these quarterly failure rates relate to previous quarters over the last three years.

As you can see, the AFR fluctuates between 0.36% and 1.72%, so what’s the value of quarterly rates? Well, they are useful as the proverbial canary in a coal mine. For example, the AFR in Q1 2021 (0.58%) jumped 1.51% in Q2 2021, then to 1.72% in Q3 2021. A subsequent investigation showed one drive model was the primary cause of the rise and that model was removed from service. 

It happens from time to time that a given drive model is not compatible with our environment, and we will moderate or even remove that drive’s effect on the system as a whole. While not as critical as data drives in managing our system’s durability, we still need to keep boot drives in operation to collect the drive/server/vault data they capture each day. 

How Backblaze Uses the Data Internally

As you’ve seen in our SSD and HDD Drive Stats reports, we produce quarterly, annual, and lifetime charts and tables based on the data we collect. What you don’t see is that every day we produce similar charts and tables for internal consumption. While typically we produce one chart for each drive model, in the example below we’ve combined several SSD models into one chart. 

The “Recent” period we use internally is 60 days. This differs from our public facing reports which are quarterly. In either case, charts like the one above allow us to quickly see trends requiring further investigation. For example, in our chart above, the recent results of the Micron SSDs indicate a deeper dive into the data behind the charts might be necessary.

By collecting, storing, and constantly analyzing the Drive Stats data we can be proactive in maintaining our durability and availability goals. Without our Drive Stats data, we would be inclined to over-provision our systems as we would be blind to the randomness of drive failures which would directly impact those goals.

A First Look at More SSD Stats

Over the years in our quarterly Hard Drive Stats reports, we’ve examined additional metrics beyond quarterly and lifetime failure rates. Many of these metrics can be applied to SSDs as well. Below we’ll take a first look at two of these: the average age of failure for SSDs and how well SSD failures correspond to the bathtub curve. In both cases, the datasets are small, but are a good starting point as the number of SSDs we monitor continues to increase.

The Average Age of Failure for SSDs

Previously, we calculated the average age at which a hard drive in our system fails. In our initial calculations that turned out to be about two years and seven months. That was a good baseline, but further analysis was required as many of the drive models used in the calculations were still in service and hence some number of them could fail, potentially affecting the average.

We are going to apply the same calculations to our collection of failed SSDs and establish a baseline we can work from going forward. Our first step was to determine the SMART_9_RAW value (power-on-hours or POH) for the 63 failed SSD drives we have to date. That’s not a great dataset size, but it gave us a starting point. Once we collected that information, we computed that the average age of failure for our collection of failed SSDs is 14 months. Given that the average age of the entire fleet of our SSDs is just 25 months, what should we expect to happen as the average age of the SSDs still in operation increases? The table below looks at three drive models which have a reasonable amount of data.

    Good Drives Failed Drives
MFG Model Count Avg Age Count Avg Age
Crucial CT250MX500SSD1 598 11 months 9 7 months
Seagate ZA250CM10003 1,114 28 months 14 11 months
Seagate ZA250CM10002 547 40 months 17 25 months

As we can see in the table, the average age of the failed drives increases as the average age of drives in operation (good drives) increases. In other words, it is reasonable to expect that the average age of SSD failures will increase as the entire fleet gets older.

Is There a Bathtub Curve for SSD Failures?

Previously we’ve graphed our hard drive failures over time to determine their fit to the classic bathtub curve used in reliability engineering. Below, we used our SSD data to determine how well our SSD failures fit the bathtub curve.

While the actual curve (blue line) produced by the SSD failures over each quarter is a bit “lumpy”, the trend line (second order polynomial) does have a definite bathtub curve look to it. The trend line is about a 70% match to the data, so we can’t be too confident of the curve at this point, but for the limited amount of data we have, it is surprising to see how the occurrences of SSD failures are on a path to conform to the tried-and-true bathtub curve.

SSD Lifetime Annualized Failure Rates

As of June 30, 2023, there were 3,144 SSDs in our storage servers. The table below is based on the lifetime data for the drive models which were active as of the end of Q2 2023.

Notes and Observations

Lifetime AFR: The lifetime data is cumulative from Q4 2018 through Q2 2023. For this period, the lifetime AFR for all of our SSDs was 0.90%. That was up slightly from 0.89% at the end of Q4 2022, but down from a year ago, Q2 2022, at 1.08%.

High failure rates?: As we noted with the quarterly stats, we like to have at least 100 drives and over 10,000 drive days to give us some level of confidence in the AFR numbers. If we apply that metric to our lifetime data, we get the following table.

Applying our modest criteria to the list eliminated those drive models with crazy high failure rates. This is not a statistics trick; we just removed those models which did not have enough data to make the calculated AFR reliable. It is possible the drive models we removed will continue to have high failure rates. It is also just as likely their failure rates will fall into a more normal range. If this technique seems a bit blunt to you, then confidence intervals may be what you are looking for.

Confidence intervals: In general, the more data you have and the more consistent that data is, the more confident you are in the predictions based on that data. We calculate confidence intervals at 95% certainty. 

For SSDs, we like to see a confidence interval of 1.0% or less between the low and the high values before we are comfortable with the calculated AFR. If we apply this metric to our lifetime SSD data we get the following table.

This doesn’t mean the failure rates for the drive models with a confidence interval greater than 1.0% are wrong; it just means we’d like to get more data to be sure. 

Regardless of the technique you use, both are meant to help clarify the data presented in the tables throughout this report.

The SSD Stats Data

The data collected and analyzed for this review is available on our Drive Stats Data page. You’ll find SSD and HDD data in the same files and you’ll have to use the model number to locate the drives you want, as there is no field to designate a drive as SSD or HDD. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone—it is free.

Good luck and let us know if you find anything interesting.

The post The SSD Edition: 2023 Drive Stats Mid-Year Review appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Big Performance Improvements in Rclone 1.64.0, but Should You Upgrade?

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/big-performance-improvements-in-rclone-1-64-0-but-should-you-upgrade/

A decorative image showing a diagram about multithreading, as well as the Rclone and Backblaze logos.

Rclone is an open source, command line tool for file management, and it’s widely used to copy data between local storage and an array of cloud storage services, including Backblaze B2 Cloud Storage. Rclone has had a long association with Backblaze—support for Backblaze B2 was added back in January 2016, just two months before we opened Backblaze B2’s public beta, and five months before the official launch—and it’s become an indispensable tool for many Backblaze B2 customers. 

Rclone v1.64.0, released last week, includes a new implementation of multithreaded data transfers, promising much faster data transfer of large files between cloud storage services. 

Does it deliver? Should you upgrade? Read on to find out!

Multithreading to Boost File Transfer Performance

Something of a Swiss Army Knife for cloud storage, rclone can copy files, synchronize directories, and even mount remote storage as a local filesystem. Previous versions of rclone were able to take advantage of multithreading to accelerate the transfer of “large” files (by default at least 256MB), but the benefits were limited. 

When transferring files from a storage system to Backblaze B2, rclone would read chunks of the file into memory in a single reader thread, starting a set of multiple writer threads to simultaneously write those chunks to Backblaze B2. When the source storage was a local disk (the common case) as opposed to remote storage such as Backblaze B2, this worked really well—the operation of moving files from local disk to Backblaze B2 was quite fast. However, when the source was another remote storage—say, transferring from Amazon S3 to Backblaze B2, or even Backblaze B2 to Backblaze B2—data chunks were read into memory by that single reader thread at about the same rate as they could be written to the destination, meaning that all but one of the writer threads were idle.

What’s the Big Deal About Rclone v1.64.0?

Rclone v1.64.0 completely refactors multithreaded transfers. Now rclone starts a single set of threads, each of which both reads a chunk of data from the source service into memory, and then writes that chunk to the destination service, iterating through a subset of chunks until the transfer is complete. The threads transfer their chunks of data in parallel, and each transfer is independent of the others. This architecture is both simpler and much, much faster.

Show Me the Numbers!

How much faster? I spun up a virtual machine (VM) via our compute partner, Vultr, and downloaded both rclone v1.64.0 and the preceding version, v1.63.1. As a quick test, I used Rclone’s copyto command to copy 1GB and 10GB files from Amazon S3 to Backblaze B2, like this:

rclone --no-check-dest copyto s3remote:my-s3-bucket/1gigabyte-test-file b2remote:my-b2-bucket/1gigabyte-test-file

Note that I made no attempt to “tune” rclone for my environment by setting the chunk size or number of threads. I was interested in the out of the box performance. I used the --no-check-dest flag so that rclone would overwrite the destination file each time, rather than detecting that the files were the same and skipping the copy.

I ran each copyto operation three times, then calculated the average time. Here are the results; all times are in seconds:

Rclone version 1GB 10GB
1.63.1 52.87 725.04
1.64.0 18.64 240.45

As you can see, the difference is significant! The new rclone transferred both files around three times faster than the previous version.

So, copying individual large files is much faster with the latest version of rclone. How about migrating a whole bucket containing a variety of file sizes from Amazon S3 to Backblaze B2, which is a more typical operation for a new Backblaze customer? I used rclone’s copy command to transfer the contents of an Amazon S3 bucket—2.8GB of data, comprising 35 files ranging in size from 990 bytes to 412MB—to a Backblaze B2 Bucket:

rclone --fast-list --no-check-dest copyto s3remote:my-s3-bucket b2remote:my-b2-bucket

Much to my dismay, this command failed, returning errors related to the files being corrupted in transfer, for example:

2023/09/18 16:00:37 ERROR : tpcds-benchmark/catalog_sales/20221122_161347_00795_djagr_3a042953-d0a2-4b8d-8c4e-6a88df245253: corrupted on transfer: sizes differ 244695498 vs 0

Rclone was reporting that the transferred files in the destination bucket contained zero bytes, and deleting them to avoid the use of corrupt data.

After some investigation, I discovered that the files were actually being transferred successfully, but a bug in rclone 1.64.0 caused the app to incorrectly interpret some successful transfers as corrupted, and thus delete the transferred file from the destination. 

I was able to use the --ignore-size flag to workaround the bug by disabling the file size check so I could continue with my testing:

rclone --fast-list --no-check-dest --ignore-size copyto s3remote:my-s3-bucket b2remote:my-b2-bucket

A Word of Caution to Control Your Transaction Fees

Note the use of the --fast-list flag. By default, rclone’s method of reading the contents of cloud storage buckets minimizes memory usage at the expense of making a “list files” call for every subdirectory being processed. Backblaze B2’s list files API, b2_list_file_names, is a class C transaction, priced at $0.004 per 1,000 with 2,500 free per day. This doesn’t sound like a lot of money, but using rclone with large file hierarchies can generate a huge number of transactions. Backblaze B2 customers have either hit their configured caps or incurred significant transaction charges on their account when using rclone without the --fast-list flag.

We recommend you always use --fast-list with rclone if at all possible. You can set an environment variable so you don’t have to include the flag in every command:


Again, I performed the copy operation three times, and averaged the results:

Rclone version 2.8GB tree
1.63.1 56.92
1.64.0 42.47

Since the bucket contains both large and small files, we see a lesser, but still significant, improvement in performance with rclone v1.64.0—it’s about 33% faster than the previous version with this set of files.

So, Should I Upgrade to the Latest Rclone?

As outlined above, rclone v1.64.0 contains a bug that can cause copy (and presumably also sync) operations to fail. If you want to upgrade to v1.64.0 now, you’ll have to use the --ignore-size workaround. If you don’t want to use the workaround, it’s probably best to hold off until rclone releases v1.64.1, when the bug fix will likely be deployed—I’ll come back and update this blog entry when I’ve tested it!

The post Big Performance Improvements in Rclone 1.64.0, but Should You Upgrade? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.