Tag Archives: B2Cloud

Leveraging Backblaze Drive Stats to Boost Backblaze B2 Cloud Storage Sales: A Guide for Reseller Partners

Post Syndicated from Mary Ellen Cavanagh original https://www.backblaze.com/blog/leveraging-backblaze-drive-stats-to-boost-backblaze-b2-cloud-storage-sales-a-guide-for-reseller-partners/

A decorative image image showing a variety of images related to Backblaze and cloud storage.

If you’re a reseller partner, we know it’s hard to cut through the noise and get potential clients interested in the services you sell. It helps when you’re able to share relevant, useful, truly valuable information with them to build your brand and engage potential clients in prospective services. 

The Backblaze Drive Stats reports can be a powerful tool in your arsenal. They not only provide insights into drive reliability but also empower you to better position and sell Backblaze B2 Cloud Storage. So, let’s dig into what Drive Stats are and how you can use them to serve your clients.

What Are Drive Stats?

The Backblaze Drive Stats reports include a comprehensive set of data that Backblaze openly shares about the performance and reliability of the hard drives that we use in our data centers. The data we publish is excellent for building trust with customers—it’s unique in the industry, regularly covered in industry media, and used by everyone from IT admins to research institutions to inform their strategies. Use it to level up your understanding of hard drives in general—including how they affect cloud storage infrastructure—and to build trust with end users around Backblaze in particular.

How Can I Use Drive Stats as a Reseller?

Identifying and Addressing Customer Concerns

You probably encounter customer concerns regarding the potential risks associated with data storage—both on premises and in the cloud—all the time. With Drive Stats, you can speak to those concerns with hard data on drive failure rates. This data-driven approach empowers customers to make optimal operational decisions and positions you as a knowledgeable, trusted advisor in their cloud storage journey.

Tailoring Solutions to Customer Needs

Every business has unique data storage and backup requirements, often a combination of on premises and cloud based data storage. In crafting the proper storage solution for your clients, you are often confronted with cost versus reliability trade-offs. The Backblaze Drive Stats reports provide a dependable source of unbiased drive reliability statistics when local data storage is required. With the Drive Stats data at hand, you can apply your knowledge and experience to confidently propose and deliver a comprehensive, cost-effective data storage and backup solution at a fair price that meets your customers’ unique needs.

Educating Customers on Data Management Best Practices

Beyond selling a product, you play a vital role in educating your customers on best practices for data management. Backblaze Drive Stats provide you with valuable insights that can be shared with your clients to help them make informed decisions about their storage strategy. By educating customers on the factors that contribute to reliable and efficient data storage, you position yourselves as trusted advisors in the rapidly evolving world of cloud technology.

Drive Stats as Your Competitive Advantage

In the competitive landscape of cloud storage solutions, reseller partners can gain a strategic advantage by harnessing the power of Backblaze Drive Stats as an effective, valuable, and powerful piece of content. The stats not only enhance transparency and build trust with customers but also empower resellers to effectively address concerns, tailor solutions, and educate clients on data management best practices. By leveraging this valuable resource, resellers can position themselves as leaders in the market and drive the success of Backblaze B2 Cloud Storage.

The post Leveraging Backblaze Drive Stats to Boost Backblaze B2 Cloud Storage Sales: A Guide for Reseller Partners appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

New Open Source Tool for Consistency in Cassandra Migrations

Post Syndicated from Elliott Sims original https://www.backblaze.com/blog/new-open-source-tool-for-consistency-in-cassandra-migrations/

A decorative image showing the Cassandra logo with a function represented by two servers on either side of the logo.

Sometimes you find a problem because something breaks, and other times you find a problem the good way—by thinking it through before things break. This is a story about one of those bright, shining, lightbulb moments when you find a problem the good way.

On the Backblaze Site Reliability Engineering (SRE) team, we were thinking through an upcoming datacenter migration in Cassandra. We were running through all of the various types of queries we would have to do when we had the proverbial “aha” moment. We discovered an inconsistency in the way Cassandra handles lightweight transactions (LWTs).

If you’ve ever tried to do a datacenter migration in Cassandra and something got corrupted in the process but you couldn’t figure out why or how—this might be why. I’m going to walk through a short intro on Cassandra, how we use it, and the issue we ran into. Then, I’ll explain the workaround, which we open sourced. 

Get the Open Source Code

You can download the open source code from our Git repository. We’d love to know how you’re using it and how it’s working for you—let us know in the comments.

How We Use Cassandra

First, if you’re not a Cassandra dev, I should mention that when we say “datacenter migration” it means something slightly different in Cassandra than what it sounds like. It doesn’t mean a data center migration in the physical sense (although you can use datacenter migrations in Cassandra when you’re moving data from one physical data center to another). In the simplest terms, it involves moving data between two Cassandra or Cassandra-compatible database replica sets within a cluster.

And, if you’re not familiar with Cassandra at all, it’s an open-source, NoSQL, distributed database management system. It was created to handle large amounts of data across many commodity servers, so it fits our use case—lots of data, lots of servers. 

At Backblaze, we use Cassandra to index filename to location for data stored in Backblaze B2, for example. Because it’s customer data and not just analytics, we care more about durability and consistency than some other applications of Cassandra. We run with three replicas in a single datacenter and “batch” mode to require writes to be committed to disk before acknowledgement rather than the default “periodic.”

Datacenter migrations are an important aspect of running Cassandra, especially on bare metal. We do a few datacenter migrations per year either for physical data moves, hardware refresh, or to change certain cluster layout parameters like tokens per host that are otherwise static. 

What Are LWTs and Why Do They Matter for Datacenter Migrations in Cassandra?

First of all, LWTs are neither lightweight nor transactions, but that’s neither here nor there. They are an important feature in Cassandra. Here’s why. 

Cassandra is great at scaling. In something like a replicated SQL cluster, you can add additional replicas for read throughput, but not writes. Cassandra scales writes (as well as reads) nearly linearly with the number of hosts—into the hundreds. Adding nodes is a fairly straightforward and “automagic” process as well, with no need to do something like manual token range splits. It also handles individual down nodes with little to no impact on queries. Unfortunately, these properties come with a trade-off: a complex and often nonintuitive consistency model that engineers and operators need to understand well.

In a distributed database like Cassandra, data is replicated across multiple nodes for durability and availability. 

Although databases generally allow multiple reads and writes to be submitted at once, they make it look to the outside world like all the operations are happening in order, one at a time. This property is known as serializability, and Cassandra is not serializable. Although it does have a “last write wins” system, there’s no transaction isolation and timestamps can be identical. 

It’s possible, for example, to have a row that has some columns from one write and other columns from another write. It’s safe if you’re only appending additional rows, but mutating existing rows safely requires careful design. Put another way, you can have two transactions with different data that, to the system, appear to have equal priority. 

How Do LWTs Solve This Problem?

As a solution for cases where stronger consistency is needed, Cassandra has a feature called “Lightweight Transactions” or LWTs. These are not really identical to traditional database transactions, but provide a sort of “compare and set” operation that also guarantees pending writes are completed before answering a read. This means if you’re trying to change a row’s value from “A” to “B”, a simultaneous attempt to change that row from “A” to “C” will return a failure. This is accomplished by doing a full—not at all lightweight—Paxos round complete with multiple round trips and slow expensive retries in the event of a conflict.

In Cassandra, the minimum consistency level for read and write operations is ONE, meaning that only a single replica needs to acknowledge the operation for it to be considered successful. This is fast, but in a situation where you have one down host, it could mean data loss, and later reads may or may not show the newest write depending on which replicas are involved and whether they’re received the previous write. For better durability and consistency, Cassandra also provides various quorum levels that require a response from multiple replicas, as well as an ALL consistency that requires responses from every replica.

Cassandra Is My Type of Database

Curious to know more about consistency limitations and LWTs in Cassandra? Christopher Batey’s presentation at the 2016 Cassandra Summit does a good job of explaining the details.

The Problem We Found With LWTs During Datacenter Migrations

Usually we use one datacenter in Cassandra, but there are circumstances where we sometimes stand up a second datacenter in the cluster and migrate to it, then tear down the original. We typically do this either to change num_tokens, to move data when we’re refreshing hardware, or to physically move to another nearby data center.

The TL:DR

We reasoned through the interaction between LWTs/serial and datacenter migrations and found a hole—there’s no guarantee of LWT correctness during a topology change (that is, a change to the number of replicas) large enough to change the number of replicas needed to satisfy quorum. It turns out that combining LWTs and datacenter migrations can violate consistency guarantees in subtle ways without some specific steps and tools to work around it.

The Long Version

Let’s say you are standing up a new datacenter, and you need to copy an existing datacenter to it. So, you have two datacenters—datacenter A, the existing datacenter, and datacenter B, the new datacenter. Let’s say datacenter A has three replicas you need to copy for simplicity’s sake, and you’re using quorum writes to ensure consistency.

Refresher: What is Quorum-Based Consistency in Cassandra?

Quorum consistency in Cassandra is based on the concept that a specific number of replicas must participate in a read or write operation to ensure consistency and availability—a majority (n/2 +1) of the nodes must respond before considering the operation as successful. This ensures that the data is durably stored and available even if a minority of replicas are unavailable.

You have different types of quorum you can choose from, and here’s how those defaults make a decision: 

  • Local quorum: Two out of the three replicas in the datacenter I’m talking to must respond in order to return success. I don’t care about the other datacenter.
  • Global quorum: Four out of the six total replicas must respond in order to return success, and it doesn’t matter which datacenter they come from.
  • Each quorum: Two out of the three replicas in each datacenter must respond in order to return success.

Most of these quorum types also have a serial equivalent for LWTs.

Type of Quorum Serial Regular
Local LOCAL_SERIAL LOCAL_QUORUM
Each unsupported EACH_QUORUM
Global SERIAL QUORUM

The problem you might run into, however, is that LWTs do not have an each_serial mode. They only have local and global. There’s no way to tell the LWT you want quorum in each datacenter. 

local_serial is good for performance, but transactions on different datacenters could overlap and be inconsistent. serial is more expensive, but normally guarantees correctness as long as all queries agree on cluster size. But what if a query straddles a topology change that changes quorum size? 

Let’s use global quorum to show how this plays out. If a LWT starts when RF=3, at least two hosts must process it. 

While it’s running, the topology changes to two datacenters (A and B) each with RF=3 (so six replicas total) with a quorum of four. There’s a chance that a query affecting the same partition could then run without overlapping nodes, which means consistency guarantees are not maintained for those queries. For that query, quorum is four out of six where those four could be the three replicas in datacenter B and the remaining replica in datacenter A. 

Those two queries are on the same partition, but they’re not overlapping any hosts, so they don’t know about each other. It violates the LWT guarantees.

The Solution to LWT Inconsistency

What we needed was a way to make sure that the definition of “quorum” didn’t change too much in the middle of a LWT running. Some change is okay, as long as old and new are guaranteed to overlap.

To account for this, you need to change the replication factor one level at a time and make sure there are no transactions still running that started before the previous topology change before you make the next. Three replicas with a quorum of two can only change to four replicas with a quorum of three. That way, at least one replica must overlap. The same thing happens when you go from four to five replicas or five to six replicas. This also applies when reducing the replication factor, such as when tearing down the old datacenter after everything has moved to the new one.

Then, you just need to make sure no LWT overlaps multiple changes. You could just wait long enough that they’ve timed out, but it’s better to be sure. This requires querying the internal-only system.paxos table on each host in the cluster between topology changes.

We built a tool that checks to see whether there are still transactions running from before we made a topology change. It reads system.paxos on each host, ignoring any rows with proposal_ballot=null, and records them. Then after a short delay, it re-reads system.paxos, ignoring any rows that weren’t present in the previous run, or any with proposal_ballot=null in either read, or any where in_progress_ballot has changed. Any remaining rows are potentially active transactions. 

This worked well the first few times that we used it, on 3.11.6. To our surprise, when we tried to migrate a cluster running 3.11.10 the tool reported hundreds of thousands of long-running LWTs. After a lot of digging, we found a small (but fortunately well-commented) performance optimization added as part of a correctness fix (CASSANDRA-12126), which means proposal_ballot does not get set to null if the proposal is empty/noop. To work around this, we had to actually parse the proposal field. Fortunately all we need is the is_empty flag in the third field, so no need to reimplement the full parsing code. A big impact to us for a seemingly small and innocuous change piggy-backed onto a correctness fix, but that’s the risk of directly reading internal-only tables. 

We’ve used the tool several times now for migrations with good results, but it’s still relatively basic and limited. It requires running repeatedly until all transactions are complete, and sometimes manual intervention to deal with incomplete transactions. In some cases we’ve been able to force-commit a long-pending LWT by doing a SERIAL read of the partition affected, but in a couple of cases we actually ended up running across LWTs that still didn’t seem to complete. Fortunately in every case so far it was in a temporary table and a little work allowed us to confirm that we no longer needed the partition at all and could just delete it.

Most people who use Cassandra may never run across this problem, and most of those who do will likely never track down what caused the small mystery inconsistency around the time they did a datacenter migration. If you rely on LWTs and are doing a datacenter migration, we definitely recommend going through the extra steps to guarantee consistency until and unless Cassandra implements an EACH_SERIAL consistency level.

Using the Tool

If you want to use the tool for yourself to help maintain consistency through datacenter migrations, you can find it here. Drop a note in the comments to let us know how it’s working for you and if you think of any other ways around this problem—we’re all ears!

If You’ve Made It This Far

You might be interested in signing up for our Developer Newsletter where our resident Chief Technical Evangelist, Pat Patterson, shares the latest and greatest ways you can use B2 Cloud Storage in your applications.

The post New Open Source Tool for Consistency in Cassandra Migrations appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

The Drive Stats of Backblaze Storage Pods

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/the-drive-stats-of-backblaze-storage-pods/

A decorative image showing the Backblaze logo on a cloud over a pattern representing a network.

Since 2009, Backblaze has written extensively about the data storage servers we created and deployed which we call Backblaze Storage Pods. We not only wrote about our Storage Pods, we open sourced the design, published a parts list, and even provided instructions on how to build one. Many people did. Of the six storage pod versions we produced, four of them are still in operation in our data centers today. Over the last few years, we began using storage servers from Dell and, more recently, Supermicro, as they have proven to be economically and operationally viable in our environment. 

Since 2013, we have also written extensively about our Drive Stats, sharing reports on the failure rates of the HDDs and SSDs in our legion of storage servers. We have examined the drive failure rates by manufacturer, size, age, and so on, but we have never analyzed the drive failure rates of the storage servers—until now. Let’s take a look at the Drive Stats for our fleet of storage servers and see what we can learn.

Storage Pods, Storage Servers, and Backblaze Vaults

Let’s start with a few definitions:

  • Storage Server: A storage server is our generic name for a server from any manufacturer which we use to store customer data. We use storage servers from Backblaze, Dell, and Supermicro.
  • Storage Pod: A Storage Pod is the name we gave to the storage servers Backblaze designed and had built for our data centers. The first Backblaze Storage Pod version was announced in September 2009. Subsequent versions are 2.0, 3.0, 4.0, 4.5, 5.0, 6.0, and 6.1. All but 6.1 were announced publicly. 
  • Backblaze Vault: A Backblaze Vault is 20 storage servers grouped together for the purpose of data storage. Uploaded data arrives at a given storage server within a Backblaze Vault and is encoded into 20 parts with a given part being either a data blob or parity. Each of the 20 parts (shards) is then stored on one of the 20 storage servers. 

As you review the charts and tables here are a few things to know about Backblaze Vaults.

  • There are currently six cohorts of storage servers in operation today: Supermicro, Dell, Backblaze 3.0, Backblaze 5.0, Backblaze 6.0, and Backblaze 6.1.
  • A given Vault will always be made up from one of the six cohorts of storage servers noted above. For example, Vault 1016 is made up of 20 Backblaze 5.0 Storage Pods and Vault 1176 is made of the 20 Supermicro servers. 
  • A given Vault is made up of storage servers that contain the same number of drives as follows:
    • Dell servers: 26 drives.
    • Backblaze 3.0 and Backblaze 5.0 servers: 45 drives.
    • Backblaze 6.0, Backblaze 6.1, and Supermicro servers: 60 drives.
  • All of the hard drives in a Backblaze Vault will be logically the same size; so, 16TB drives for example.

Drive Stats by Backblaze Vault Cohort

With the background out of the way, let’s get started. As of the end of Q3 2023, there were a total of 241 Backblaze Vaults divided into the six cohorts, as shown in the chart below. The chart includes the server cohort, the number of Vaults in the cohort, and the percentage that cohort is of the total number of Vaults.

A pie chart showing the types of Backblaze Vaults by percentage.

Vaults consisting of Backblaze servers still comprise 68% of the vaults in use today (shaded from orange to red), although that number is dropping as older Vaults are being replaced with newer server models, typically the Supermicro systems.

The table below shows the Drive Stats for the different Vault cohorts identified above for Q3 2023.

A chart showing the Drive Stats for Backblaze Vaults.

The Avg Age (months) column is the average age of the drives, not the average age of the Vaults. The two may seem to be related, that’s not entirely the case. It is true the Backblaze 3.0 Vaults were deployed first followed in order by the 5.0 and 6.0 Vaults, but that’s where things get messy. There was some overlap between the Dell and Backblaze 6.1 deployments as the Dell systems were deployed in our central Europe data center, while the 6.1 Vaults continued to be deployed in the U.S. In addition, some migrations from the Backblaze 3.0 Vaults were initially done to 6.1 Vaults while we were also deploying new drives in the Supermicro Vaults. 

The AFR for each of the server versions does not seem to follow any pattern or correlation to the average age of the drives. This was unexpected because, in general, as drives pass about four years in age, they start to fail more often. This should mean that Vaults with older drives, especially those with drives whose average age is over four years (48 months), should have a higher failure rate. But, as we can see, the Backblaze 5.0 Vaults defy that expectation. 

To see if we can determine what’s going on, let’s expand on the previous table and dig into the different drive sizes that are in each Vault cohort, as shown in the table below.

A table showing Drive Stats by server version and drive size.

Observations for Each Vault Cohort

  • Backblaze 3.0: Obviously these Vaults have the oldest drives and, given their AFR is nearly twice the average for all of the drives (1.53%), it would make sense to migrate off of these servers. Of course the 6TB drives seem to be the exception, but at some point they will most likely “hit the wall” and start failing.
  • Backblaze 5.0: There are two Backblaze 5.0 drive sizes (4TB and 8TB) and the AFR for each is well below the average AFR for all of the drives (1.53%). The average age of the two drive sizes is nearly seven years or more. When compared to the Backblaze 6.0 Vaults, it would seem that migrating the 5.0 Vaults could wait, but there is an operational consideration here. The Backblaze 5.0 Vaults each contain 45 drives, and from the perspective of data density per system, they should be migrated to 60 drive servers sooner rather than later to optimize data center rack space.
  • Backblaze 6.0: These Vaults as a group don’t seem to make any of the five different drive sizes happy. Only the AFR of the 4TB drives (1.42%) is just barely below the average AFR for all of the drives. The rest of the drive groups are well above the average.
  • Backblaze 6.1: The 6.1 servers are similar to the 6.0 servers, but with an upgraded CPU and faster NIC cards. Is that why their annualized failure rates are much lower than the 6.0 systems? Maybe, but the drives in the 6.1 systems are also much younger, about half the age of those in the 6.0 systems, so we don’t have the full picture yet.
  • Dell: The 14TB drives in the Dell Vaults seem to be a problem at a 5.46% AFR. Much of that is driven by two particular Dell vaults which have a high AFR, over 8% for Q3. This appears to be related to their location in the data center. All 40 of the Dell servers which make up these two Vaults were relocated to the top of 52U racks, and it appears that initially they did not like their new location. Recent data indicates they are doing much better, and we’ll publish that data soon. We’ll need to see what happens over the next few quarters. That said, if you remove these two Vaults from the Dell tally, the AFR is a respectable 0.99% for the remaining Vaults.
  • Supermicro: This server cohort is mostly 16TB drives which are doing very well with an AFR of 0.62%. The one 14TB Vault is worth our attention with an AFR of 1.95%, and the 22TB Vault is too new to do any analysis.

Drive Stats by Drive Size and Vault Cohort

Another way to look at the data is to take the previous table and re-sort it by drive size. Before we do that let’s establish the AFR for the different drive sizes aggregated over all Vaults.

A bar chart showing annualized failure rates for Backblaze Vaults by drive size.

As we can see in Q3 the 6TB and 22TB Vaults had zero failures (AFR = 0%). Also, the 10TB Vault is indeed only one Vault, so there are no other 10TB Vaults to compare it to. Given this, for readability, we will remove the 6TB, 10TB, and 22TB Vaults from the next table which compares how each drive size has fared in each of the six different Vault cohorts.

A table showing the annualized failure rates of servers by drive size and server version, not displaying the 6TB, 10TB, and 22TB Vaults.

Currently we are migrating the 4TB drive Vaults to larger Vaults, replacing them with drives of 16TB and above. The migrations are done using an in-house system which we’ll expand upon in a future post. The specific order of migrations is based on failure rates and durability of the existing 4TB Vaults with an eye towards removing the Backblaze 3.0 systems first as they are nearly 10 years old in some cases, and many of the non-drive replacement parts are no longer available. Whether we give away, destroy, or recycle the retired Backblaze 3.0 Storage Pods (sans drives) is still being debated.

For the 8TB drive Vaults, the Backblaze 5.0 Vaults are up first for migration when the time comes. Yes, their AFR is lower then the Backblaze 6.0 Vaults, but remember: the 5.0 Vaults are 45 drive units which are not as efficient storage density-wise versus the 60 drive systems. 

Speaking of systems with less than 60 drives, the Dell servers are 26 drives. Those 26 drives are in a 2U chassis versus a 4U chassis for all of the other servers. The Dell servers are not quite as dense as the 60 drive units, but their 2U form factor gives us some flexibility in filling racks, especially when you add utility servers (1U or 2U) and networking gear to the mix. That’s one of the reasons the two Dell Vaults we noted earlier were moved to the top of the 52U racks. FYI, those two Vaults hold 14TB drives and are two of the four 14TB Dell Vaults making up the 5.46% AFR. The AFR for the Dell Vaults with 12TB and 16TB drives is 0.76% and 0.92% respectively. As noted earlier, we expect the AFR for 14TB Dell Vaults to drop over the coming months.

What Have We Learned?

Our goal today was to see what we can learn about the drive failure rates of the storage servers we use in our data centers. All of our storage servers are grouped in operational systems we call Backblaze Vaults. There are six different cohorts of storage servers with each vault being composed of the same type of storage server, hence there are six types of vaults. 

As we dug into data, we found that the different cohorts of Vaults had different annualized failure rates. What we didn’t find was a correlation between the age of the drives used in the servers and the annualized failure rates of the different Vault cohorts. For example, the Backblaze 5.0 Vaults have a much lower AFR of 0.99%  versus the Backblaze 6.0 Vault AFR at 2.14%—even though the drives in the 5.0 Vaults are nearly twice as old on average than the drives in the 6.0 Vaults.

This suggests that while our initial foray into the annualized failure rates of the different Vault cohorts is a good first step, there is more to do here.

Where Do We Go From Here?

In general, all of the Vaults in a given cohort were manufactured to the same specifications, used the same parts, and were assembled using the same processes. One obvious difference is that different drive models are used in each Vault cohort. For example, the 16TB vaults are composed of seven different drive models. Do some drive models work better in one Vault cohort versus another? Over the next couple of quarters we’ll dig into the data and let you know what we find. Hopefully it will add to our understanding of the annualized failures rates of the different Vault cohorts. Stay tuned.

The post The Drive Stats of Backblaze Storage Pods appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

NAS Life Expectancy: How to Extend NAS Longevity

Post Syndicated from Vinodh Subramanian original https://www.backblaze.com/blog/nas-life-expectancy-how-to-extend-nas-longevity/

A decorative image showing a NAS device and a cloud in the background.

There’s only one truth you need to know about tech: at some point, it will fail. Hard drives die. You get the blue screen of death the day of the big presentation. You lose cell service right when your mom calls. (Or maybe you pretend to lose service right when your mom calls. We won’t judge.) Whatever it is, we’ve all been there. 

If you use network attached storage (NAS) for your business or at home, you’re probably well aware of this fact. The redundancy you get from a solid RAID configuration might be one of the reasons you invested in a NAS—to prepare for inevitable drive failures. NAS devices are a great investment for a business for their durability and reliability, but there are things you can do to extend the lifespan of your NAS and get even more out of your investment.

Extending the lifespan of a NAS system isn’t just about preventing hardware failures; it’s crucial for ensuring uninterrupted access to critical data while maximizing ROI on your IT investments. In this blog, you’ll learn about the key factors that influence NAS longevity and get real-world strategies to strengthen and optimize your NAS infrastructure. 

Understanding NAS Lifespan

Today’s NAS devices offer faster processing, enhanced performance, and significantly larger storage capabilities than ever before. These technological advancements have paved the way for efficient local data management using NAS in both professional and personal settings. 

A decorative image showing a NAS device, an office building, and the Backblaze logo on a cloud.

Despite these advancements, it’s important to acknowledge that NAS devices tend to have a finite lifespan. This limitation is due to several factors, including the physical wear and tear of hardware components, ever-evolving software requirements, and the constant advancement of technology which can render older systems less efficient or incompatible with new standards. 

Also, it’s crucial to differentiate between the life expectancy of a NAS device and that of the hard drives within the device. While the NAS itself may continue to function, the hard drives that are subjected to intense read/write operations often have a shorter lifespan. If you want to learn more about hard drive reliability and failure rates, refer to Backblaze’s Hard Drive Stats.

Key Factors Affecting NAS Longevity

From the quality of the hardware to the environment it operates in, multiple elements contribute to the lifespan of NAS. Let’s explore these key factors in detail:

1. Hardware Components: Quality and Durability

One of the key factors that heavily affects the NAS device is the quality of the hardware itself. Additionally, factors such as the processor, memory, and power unit also play crucial roles. High-quality hardware and components from reputable manufacturers tend to last longer and offer better performance and reliability which contribute to the overall lifespan of the NAS.

2. Workload and Usage Intensity

The workload handled by the NAS is a significant determinant of its longevity. Devices that are constantly under heavy load, managing large data transfers, or running intensive applications will likely experience wear and tear more rapidly than those used for lighter tasks. 

3. Environmental Factors: Temperature, Humidity, and Corrosion

Operating a NAS in environments with high temperatures or humidity levels can lead to overheating and moisture-related damage. Additionally, locations with high levels of dust or corrosive elements can lead to physical deterioration of components.

4. Quality of Network Environment

The quality and stability of the network environment in which the NAS operates can also affect its lifespan. Frequent network issues or unstable connections can strain the NAS’s hardware and software, potentially leading to earlier failures. 

5. Support, Technological Advancements, and Compatibility

Ongoing support and compatibility with new technologies are also vital for the longevity of NAS systems. As technology evolves, older NAS devices may struggle with compatibility issues, rendering them less efficient or even obsolete.

Maintenance and Care for Enhanced Lifespan

Now that we understand the factors that affect NAS longevity, let’s explore how proactive maintenance and care are crucial to extending its lifespan. There are a number of things you can do to keep your NAS functioning properly for the long haul:

  • Regular Maintenance: Routine cleaning is vital for maintaining NAS efficiency. Dust accumulation can lead to overheating. Regularly cleaning the external vents and fans, and ensuring the device is in a well-ventilated area can prevent thermal issues and prolong the device’s life. 
  • Proactive Drive Replacements: Hard drives are among the most failure-prone components in a NAS. Implementing a regular schedule to check drive health and replacing unhealthy or borderline drives proactively can prevent data loss and reduce the workload on the NAS’s remaining drives, thus preserving its overall integrity. 
  • Updating Software and Patches: Keeping the NAS software and firmware up to date is essential for security and performance. Regular updates often include patches for vulnerabilities, performance enhancements, and new features that can improve the efficiency and longevity of the NAS. 
  • Monitoring NAS Health: Utilizing the tools and built-in functionalities to monitor the health and performance of your NAS also helps extend its lifespan. Many NAS systems come with software that can alert you to issues such as failing drives, high temperatures, or network problems. Keeping an eye on these metrics can help you address potential problems before they escalate. 
  • Environmental Considerations: The operating environment of a NAS plays a significant role in NAS longevity. Keeping your NAS in a stable environment with controlled temperature and humidity levels should be considered to extend its lifespan. 
  • Power Protection: Protect your NAS from power surges and outages using uninterrupted power supply (UPS). This can not only prevent data loss but also help avoid any potential hardware damage caused by electrical issues. 
A product image of a CyberPower 900 AVR uninterrupted power supply tower.
The CyberPower 900 AVR is just one example of a UPS. Source.
  • Recognizing When to Replace the NAS: Look out for indicators that suggest it’s time to replace your NAS. These include the expiration of the manufacturer’s warranty, noticeable performance declines, increased frequency of repairs, or when the device no longer meets your evolving storage needs. Waiting until a complete failure can be more costly and disruptive. 

By adhering to these maintenance and care guidelines, you can significantly enhance the lifespan and reliability of your NAS, ensuring that it continues to serve as a robust and efficient data storage solution for your business or personal needs. 

Implementing Fault Tolerance and Off-Loading Data to Cloud

In addition to proactive maintenance and care, there are a few other strategies you can use to extend NAS lifespan such as implementing fault tolerance and adding cloud storage to your backup strategy to offload data from your NAS. Let’s explore them below. 

The Importance of Fault Tolerance and RAID Configurations in NAS

Fault tolerance refers to the ability of a NAS to continue operating correctly even if one or more of its hard drives fail. It’s critical for NAS systems to have fault tolerance in place, especially in business environments where data availability and integrity can not be compromised. 

RAID (Redundant Array of Independent Disks) plays a pivotal role in achieving fault tolerance. It involves combining multiple hard drives into a single system to improve data redundancy and performance. By doing so, RAID protects against data loss due to single or multiple disk failures, depending on the RAID level implemented. 

An image showing a RAID 5 configuration.
RAID 5: Striping and parity distributed across disks.

RAID Configurations

Various RAID configurations offer different balances of data protection, storage efficiency, and performance. Common configurations include RAID 0 (striping), RAID 1 (mirroring), RAID 5 and RAID 6, each with its specific advantages and use cases. For example, RAID 1 is simple and offers data redundancy, while RAID 5 and 6 provide a good balance between storage capacity and data protection. To fully understand which RAID configuration suits your needs, explore our NAS RAID Levels blog which explains how to choose the right RAID level for your NAS data. 

Off-Loading and Backing Up NAS Data to the Cloud

Off-loading data from your NAS to a cloud storage service can significantly reduce the workload on your NAS hardware, thus potentially extending its life. You can consider creating a cloud archive of old data and a system for off-loading data on a regular cadence or as projects close.  

The cloud also helps you establish a robust 3-2-1 backup strategy (three total copies of your data, two of which are local but on different mediums, and one copy off-site). This ensures data redundancy and offers enhanced data protection against local disasters, theft, or hardware failure. Many NAS devices, like Synology and QNAP, have built-in backup utilities that back up directly to cloud storage like Backblaze B2

By integrating fault tolerance through RAID configurations and backing up data to the cloud, you can significantly enhance the data protection capabilities of your NAS. This approach not only ensures data safety but also contributes to the overall health and longevity of your NAS system.

Extending Your NAS Lifespan

Navigating the complexities of NAS systems can be a journey full of learning and adaptation. While NAS offers unparalleled convenience of local storage, it’s essential to recognize that its longevity relies on more than just its initial setup. It requires a proactive approach with diligent maintenance while embracing technological advancements such as fault tolerance and RAID levels. When properly cared for, however, many users find NAS to be a long-lived piece of their tech stack

But, it doesn’t have to stop there. The integration of cloud storage with NAS systems can significantly help reduce the strain on your local system, and safeguard your NAS data off-site while ensuring you have comprehensive 3-2-1 data protection in place. 

It’s time to hear from you. What strategies have you employed to extend the lifespan of your NAS device? Share your stories and tips in the comments below to help others in the NAS community.

The post NAS Life Expectancy: How to Extend NAS Longevity appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Network Stats

Post Syndicated from Brent Nowak original https://www.backblaze.com/blog/backblaze-network-stats/

A decorative image displaying the headline Welcome to Network Stats.

At the end of Q3 2023, Backblaze was monitoring 263,992 hard disk drives (HDDs) and solid state drives (SSDs) in our data centers around the world. We’ve reported on those drives for many years. But, all the data on those drives needs to somehow get from your devices to our storage servers. You might be wondering, “How does that happen?” and “Where does that happen?” Or, more likely, “How fast does that happen?”

These are all questions we want to start to answer.

Welcome to our new Backblaze Network Stats series, where we’ll explore the world of network connectivity and how we better serve our customers, partners, and the internet community at large. We hope to share our challenges, initiatives, and engineering perspective as we build the open cloud with our Partners.

In this first post, we will explore two issues: how we connect our network with the internet and the distribution of our peering traffic. As we expand this series, we hope to capture and share more metrics and insights.

Nice to Meet You; I’m Brent

Since this is the first time you’re hearing from me, I thought I should introduce myself. I’m a Senior Network Engineer here at Backblaze. The Network Engineering group is responsible for ensuring the reliability, capacity, and security of network traffic. 

My interest in computer networking began in my childhood when I first persuaded my father to upgrade our analog modem to a ISDN line by providing a financial comparison of time sink due to large download times I was conducting (nothing like using all the family dial-up time to download multi-megabyte SimCity 2000 and Doom customizations). Needless to say, I’m still interested in those same types of networking metrics, and that’s why I’m here sharing them with you at Backblaze.

First, Some Networking Basics

If you’ve ever heard folks joke about the internet being a series of tubes, well, it may be widely mocked, but it’s not entirely wrong. The internet as we know it is fundamentally a complex network of all the computers on the planet. Whenever we’re typing in an internet address into a web browser, we’re basically giving our computer the address of a computer (or server, etc.) to locate, and that computer will hopefully display data to you that it’s storing. 

Of course, it’s not a free-for-all. Internet protocols like TLS/SSL are the boundaries that set the rules for how computers communicate, and networks allow different levels of access to outsiders. Internet service providers (ISPs) are defined and regulated, and we’ll outline some of those roles and how Backblaze interacts with them below. But, all that communication between computers has to be powered by hardware, which is why, at one point, we actually had to solve the problem of sharks attacking the internet. Good news: since 2006, sharks have accounted for less than one percent of fiber optic cable attacks. 

Wireless internet has largely made this connectivity invisible to consumers, but the job of Wi-Fi is to broadcast a short-range network that connects you to this series of cables and “tubes.” That’s why when you’re transmitting or storing larger amounts of data, you typically get better speeds when you use a wired connection. (A good consumer example: setting up NAS devices works better with an ethernet cable.)

When you’re talking about storing and serving petabytes of data for a myriad of use cases, then you have to create and use different networks to connect to the internet effectively. Think of it like water: both a fire hose and your faucet are connected to utility lines, but they have to move different amounts of water, so they have different kinds of connections to the main utility.   

And, that brings us to peering, the different levels of internet service providers, and many, many more things that Backblaze Network Engineers deal with from both a hardware and a software perspective on a regular basis. 

What Is Peering?

Peering on the internet is akin to building direct express lanes between neighborhoods. Instead of all data (residents) relying on crowded highways (public internet), networks (neighborhoods) establish peering connections—dedicated pathways connecting them directly. This allows data to travel faster and more efficiently, reducing congestion and delays. Peering is like having exclusive lanes, streamlining communication between networks and enhancing the overall performance of the internet “transportation” system. 

We connect to various types of networks to help move your data. I’ll explain the different types below.

The Bit Exchange

Every day we move multiple petabytes of traffic between our internet connectivity points and our scalable data center fabric layer to be delivered to either our SSD caching layer (what we call a “shard stash”) or spinning hard drives for storage.

Our data centers are connected to the world in three different ways.

1. Direct Internet Access (DIA)

The most common way we reach everyone is via a DIA connection with a Tier 1 internet service provider. These connections give us access to long-haul, high-capacity fiber infrastructure that spans continents and oceans. Connecting to a Tier 1 ISP has the advantage of scale and reach, but this scale comes at a cost—we may not have the best path to our customers. 

If we draw out the hierarchy of networks that we have to traverse to reach you, it would look like a series of geographic levels (Global, Regional, and Local). The Tier 1 ISPs would be positioned at the top, leasing bandwidth on their networks to smaller Tier 2 and Tier 3 networks, which are closer to our customer’s home and office networks.

A chart showing an example of network and ISP reroutes between Backblaze and a customer.
How we get from B to C (Backblaze to customer).

Since our connections to the Tier 1 ISPs are based on leased bandwidth, we pay based on how much data we transfer. The bill grows the more we transfer. There are commitments and overage charges, and the relationship is more formal since a Tier 1 ISP is a for-profit company. Sometimes you just want unlimited bandwidth, and that’s where the role of the internet exchange (IX) helps us.

2. Internet Exchange (IX)

We always want to be as close to the client as possible and our next connectivity option allows us to join a community of peers that exchange traffic more locally. Peering with an IX means that network traffic doesn’t have to bubble up to a Tier 1 National ISP to eventually reach a regional network. If we are on an advantageous IX, we transfer data locally inside a data center or within the same data center campus, thus reducing latency and improving the overall experience.

Benefits of an IX, aka the “Unlimited Plan,” include:

  • Paying a flat rate per month to get a fiber connection to the IX equipment versus paying based on how much data we transfer.
  • No price negotiation based on bandwidth transfer rates.
  • No overage charges.
  • Connectivity to lower tiered networks that are closer to consumer and business networks.
  • Participation helps build a more egalitarian internet.

In short, we pay a small fee to help the IX remain financially stable, and then we can exchange as much or as little traffic as we want.

Our network connectivity standard is to connect to multiple Tier 1 ISPs and a localized IX at every location to give us the best of both solutions. Every time we have to traverse a network, we’re adding latency and increasing the total amount of time for a file to upload or download. Internet routing prefers the shortest path, so if we have a shorter (faster) way to reach you, we will talk over the IX versus the Tier 1 network.

A decorative image showing two possible paths to serve data from Backblaze to the customer.
Less is more—the fewer networks between us and you, the better.

3. Private Network Interconnect (PNI)

The most direct and lowest latency way for us to exchange traffic is with a PNI. This option is used for direct fiber connections within the same data center or metro region to some of our specific partners like Fastly and Cloudflare. Our edge routing equipment—that is, the appliances that allow us to connect our internal network to external networks—is connected directly to our partner’s edge routing equipment. To go back to our neighborhood analogy, this would be if you and your friend put a gate in the fences that connect your backyards. With a PNI, the logical routing distance between us and our partners is the best it can be. 

IX Participation

Personally, the internet exchange path is the most exciting for me as a network engineer. It harkens back to the days of the early internet (IX points began as Network Access Points and were a key component of Al Gore’s National Information Infrastructure (NII) plan way back in 1991), and the growth of an IX feels communal, as people are joining to help the greater whole. When we add our traffic to an IX as a new peer, it increases participation, further strengthening the advantage of contributing to the local IX and encouraging more organizations to join.

Backblaze Joins the Equinix Silicon Valley (SV1) Internet Exchange

Our San Jose data center is a major point of presence (PoP) (that is, a point where a network connects to the internet) for Backblaze, with the site connecting us in the Silicon Valley region to many major peering networks.

In November, we brought up connectivity to Equinix IX peering exchange in San Jose, bringing us closer to 278 peering networks at the time of publishing. Many of the networks that participate on this IX are very logically close to our customers. The participants are some of the well known ISPs that serve homes, offices, and business in the region, including Comcast, Google Fiber, Sprint, and Verizon.

Now, for the Stats

As soon as we turned up the connection, 26% inbound traffic that was being sent to our DIA connections shifted to the local Equinix IX, as shown in the pie chart below.

Two side by side pie charts comparing traffic on the different types of network connections.
Before: 98% direct internet access (DIA); 2% private network interconnect (PNI). After: 72% DIA; 2% PNI; 26% internet exchange (IX).

The below graph shows our peering traffic load over the edge router and how immediately the traffic pattern changed as soon as we brought up the peer. Green indicates inbound traffic, while yellow shows outbound traffic. It’s always exciting to see a project go live with such an immediate reaction!

A graph showing networking uploads and downloads increasing as Backblaze brought networks up to peer.

To give you an idea of what we mean by better network proximity, let’s take a look at our improved connectivity to Google Fiber. Here’s a diagram of the three pathways that our edge routers see that show how to get to Google Fiber. With the new IX connection, we see a more advantageous path and pick that as our method to exchange traffic. We no longer have to send traffic to the Tier 1 providers and can use them as backup paths.

A graph showing possible network paths now that peering is enabled.
Taking faster local roads.

What Does This Mean for You?

We here at Backblaze are always trying to improve the performance and reliability of our storage platform while scaling up. We monitor our systems for inefficiencies, and improving the network components is one way that we can deliver a better experience. 

By joining the Equinix SV1 peering exchange, we shorten the number of network hops that we have to transit to communicate with you. And that reduces latency, speeding up your backup job upload, allowing for faster image download, or supporting Partners

Cheers from the NetEng team! We’re excited to start this series and bring you more content as our solutions evolve and grow. Some of the coverage we hope to share in the future includes analyzing our proximity to our peers and Partners, how we can improve those connections further, and stats to show the amount of bits per second that we process in our data centers to ensure that we not only have a file, but all the related redundancy shard components related to it. So, stay tuned!

The post Backblaze Network Stats appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

How to Run AI/ML Workloads on CoreWeave + Backblaze

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/how-to-run-ai-ml-workloads-on-coreweave-backblaze/

A decorative image showing the Backblaze and CoreWeave logos superimposed on clouds.

Backblaze compute partner CoreWeave is a specialized GPU cloud provider designed to power use cases such as AI/ML, graphics, and rendering up to 35x faster and for 80% less than generalized public clouds. Brandon Jacobs, an infrastructure architect at CoreWeave, joined us earlier this year for Backblaze Tech Day ‘23. Brandon and I co-presented a session explaining both how to backup CoreWeave Cloud storage volumes to Backblaze B2 Cloud Storage and how to load a model from Backblaze B2 into the CoreWeave Cloud inference stack.

Since we recently published an article covering the backup process, in this blog post I’ll focus on loading a large language model (LLM) directly from Backblaze B2 into CoreWeave Cloud.

Below is the session recording from Tech Day; feel free to watch it instead of, or in addition to, reading this article.

More About CoreWeave

In the Tech Day session, Brandon covered the two sides of CoreWeave Cloud: 

  1. Model training and fine tuning. 
  2. The inference service. 

To maximize performance, CoreWeave provides a fully-managed Kubernetes environment running on bare metal, with no hypervisors between your containers and the hardware.

CoreWeave provides a range of storage options: storage volumes that can be directly mounted into Kubernetes pods as block storage or a shared file system, running on solid state drives (SSDs) or hard disk drives (HDDs), as well as their own native S3 compatible object storage. Knowing that, you’re probably wondering, “Why bother with Backblaze B2, when CoreWeave has their own object storage?”

The answer echoes the first few words of this blog post—CoreWeave’s object storage is a specialized implementation, co-located with their GPU compute infrastructure, with high-bandwidth networking and caching. Backblaze B2, in contrast, is general purpose cloud object storage, and includes features such as Object Lock and lifecycle rules, that are not as relevant to CoreWeave’s object storage. There is also a price differential. Currently, at $6/TB/month, Backblaze B2 is one-fifth of the cost of CoreWeave’s object storage.

So, as Brandon and I explained in the session, CoreWeave’s native storage is a great choice for both the training and inference use cases, where you need the fastest possible access to data, while Backblaze B2 shines as longer term storage for training, model, and inference data as well as the destination for data output from the inference process. In addition, since Backblaze and CoreWeave are bandwidth partners, you can transfer data between our two clouds with no egress fees, freeing you from unpredictable data transfer costs.

Loading an LLM From Backblaze B2

To demonstrate how to load an archived model from Backblaze B2, I used CoreWeave’s GPT-2 sample. GPT-2 is an earlier version of the GPT-3.5 and GPT-4 LLMs used in ChatGPT. As such, it’s an accessible way to get started with LLMs, but, as you’ll see, it certainly doesn’t pass the Turing test!

This sample comprises two applications: a transformer and a predictor. The transformer implements a REST API, handling incoming prompt requests from client apps, encoding each prompt into a tensor, which the transformer passes to the predictor. The predictor applies the GPT-2 model to the input tensor, returning an output tensor to the transformer for decoding into text that is returned to the client app. The two applications have different hardware requirements—the predictor needs a GPU, while the transformer is satisfied with just a CPU, so they are configured as separate Kubernetes pods, and can be scaled up and down independently.

Since the GPT-2 sample includes instructions for loading data from Amazon S3, and Backblaze B2 features an S3 compatible API, it was a snap to modify the sample to load data from a Backblaze B2 Bucket. In fact, there was just a single line to change, in the s3-secret.yaml configuration file. The file is only 10 lines long, so here it is in its entirety:

apiVersion: v1
kind: Secret
metadata:
  name: s3-secret
  annotations:
     serving.kubeflow.org/s3-endpoint: s3.us-west-004.backblazeb2.com
type: Opaque
data:
  AWS_ACCESS_KEY_ID: <my-backblaze-b2-application-key-id>
  AWS_SECRET_ACCESS_KEY: <my-backblaze-b2-application-key>

As you can see, all I had to do was set the serving.kubeflow.org/s3-endpoint metadata annotation to my Backblaze B2 Bucket’s endpoint and paste in an application key and its ID.

While that was the only Backblaze B2-specific edit, I did have to configure the bucket and path where my model was stored. Here’s an excerpt from gpt-s3-inferenceservice.yaml, which configures the inference service itself:

apiVersion: serving.kubeflow.org/v1alpha2
kind: InferenceService
metadata:
  name: gpt-s3
  annotations:
    # Target concurrency of 4 active requests to each container
    autoscaling.knative.dev/target: "4"
    serving.kubeflow.org/gke-accelerator: Tesla_V100
spec:
  default:
    predictor:
      minReplicas: 0 # Allow scale to zero
      maxReplicas: 2 
      serviceAccountName: s3-sa # The B2 credentials are retrieved from the service account
      tensorflow:
        # B2 bucket and path where the model is stored
        storageUri: s3://<my-bucket>/model-storage/124M/
        runtimeVersion: "1.14.0-gpu"
        ...

Aside from storageUri configuration, you can see how the predictor application’s pod is configured to scale from between zero and two instances (“replicas” in Kubernetes terminology). The remainder of the file contains the transformer pod configuration, allowing it to scale from zero to a single instance.

Running an LLM on CoreWeave Cloud

Spinning up the inference service involved a kubectl apply command for each configuration file and a short wait for the CoreWeave GPU cloud to bring up the compute and networking infrastructure. Once the predictor and transformer services were ready, I used curl to submit my first prompt to the transformer endpoint:

% curl -d '{"instances": ["That was easy"]}' http://gpt-s3-transformer-default.tenant-dead0a.knative.chi.coreweave.com/v1/models/gpt-s3:predict
{"predictions": ["That was easy for some people, it's just impossible for me,\" Davis said. \"I'm still trying to" ]}

In the video, I repeated the exercise, feeding GPT-2’s response back into it as a prompt a few times to generate a few paragraphs of text. Here’s what it came up with:

“That was easy: If I had a friend who could take care of my dad for the rest of his life, I would’ve known. If I had a friend who could take care of my kid. He would’ve been better for him than if I had to rely on him for everything.

The problem is, no one is perfect. There are always more people to be around than we think. No one cares what anyone in those parts of Britain believes,

The other problem is that every decision the people we’re trying to help aren’t really theirs. If you have to choose what to do”

If you’ve used ChatGPT, you’ll recognize how far LLMs have come since GPT-2’s release in 2019!

Run Your Own Large Language Model

While CoreWeave’s GPT-2 sample is an excellent introduction to the world of LLMs, it’s a bit limited. If you’re looking to get deeper into generative AI, another sample, Fine-tune Large Language Models with CoreWeave Cloud, shows how to fine-tune a model from the more recent EleutherAI Pythia suite.

Since CoreWeave is a specialized GPU cloud designed to deliver best-in-class performance up to 35x faster and 80% less expensive than generalized public clouds, it’s a great choice for workloads such as AI, ML, rendering, and more, and, as you’ve seen in this blog post, easy to integrate with Backblaze B2 Cloud Storage, with no data transfer costs. For more information, contact the CoreWeave team.

The post How to Run AI/ML Workloads on CoreWeave + Backblaze appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Storage Tech of the Future: Ceramics, DNA, and More

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/storage-tech-of-the-future-ceramics-dna-and-more/

A decorative image showing a data drive with a health monitor indicator running through and behind it.

Two announcements had the Backblaze #social Slack channel blowing up this week, both related to “Storage Technologies of the Future.” The first reported “Video of Ceramic Storage System Surfaces Online” like some kind of UFO sighting. The second, somewhat more restrained announcement heralded the release of DNA storage cards available to the general public. Yep, you heard that right—coming to a Best Buy near you. (Not really. You absolutely have to special order these babies, but they ARE going to be for sale.)

We talked about DNA storage way back in 2015. It’s been nine years, so we thought it was high time to revisit the tech and dig into ceramics as well. (Pun intended.) 

What Is DNA Storage?

The idea is elegant, really. What is DNA if not an organic, naturally occuring form of code? 

DNA consists of four nucleotide bases: adenine (A), thymine (T), cytosine (C), and guanine (G). 

In DNA storage, information is encoded into sequences of these nucleotide bases. For example, A and C might represent 0, while T and G represent 1. This encoding allows digital data, such as text, images, or other types of information, to be translated into DNA sequences. Cool!

The appeal of DNA as a storage medium lies in its density and stability, as well as its ability to store vast amounts of information in a very compact space. It also boasts remarkable durability, with the ability to preserve information for thousands of years under suitable conditions. I mean, leave it to Mother Nature to put our silly little hard drives to shame.

Back in 2015, we shared that the storage density of DNA was about 2.2 petabytes per gram. In 2017, a study out of Columbia University and the New York Genome Center put it at an incredible 215 petabytes per gram. For comparison’s sake, a WDC 22TB drive (WDC WUH722222ALE6L4) that we currently use in our data centers is 1.5 pounds or 680 grams, which nets out at 0.032TB/gram or 0.000032PB/gram.

Another major advantage is its sustainability. Estimated global data center electricity consumption in 2022 was 240–340 TWh1, or around 1–1.3% of global final electricity demand. Current data storage technology uses rare earth metals which are environmentally damaging to mine. Drives take up space, and they also create e-waste at the end of their lifecycle. It’s a challenge anyone who works in the data storage industry thinks about a lot. 

DNA storage, on the other hand, requires less energy. A 2023 study found that data writing can be achieved in the DNA movable-type storage system under normal operating temperatures ranging from about 60–113°F and can be stored at room temperature. DNA molecules are also biodegradable and can be broken down naturally. 

The DNA data-writing process is chemical-based, and actually not the most environmentally friendly, but the DNA storage cards developed by Biomemory use a proprietary biosourced writing process, which they call “a significant advancement over existing chemical or enzymatic synthesis technologies.” So, there might be some trade-offs, but we’ll know more as the technology evolves. 

What’s the Catch?

Density? Check. Durability? Wow, yeah. Sustainability? You got it. But DNA storage is still a long way from sitting on your desk, storing your duplicate selfies. First, and we said this back in 2015 too, DNA takes a long time to read and write—DNA synthesis writes at a few hundred bytes per second. An average iPhone photo would take several hours to write to DNA. And to read it, you have to sequence the DNA—a time-intensive process. Both of those processes require specialized scientific equipment.

It’s also still too expensive. In 2015, we found a study that put 83 kilobytes of DNA storage at £1000 (about $1,500 U.S. dollars). In 2021, MIT estimated it would cost about $1 trillion to store one petabyte of data on DNA. For comparison, it costs $6,000 per month to store one petabyte in Backblaze B2 Cloud Storage ($6/TB/month). You could store that petabyte for a little over 13 million years before you’d hit $1 trillion.

Today, Biomemory’s DNA storage cards ring in at a cool €1000 (about $1,080 U.S. dollars). And they can hold a whopping one kilobyte of data or the equivalent of a short email. So, yeah …it’s ahh, gotten even more expensive for the commercial product. 

The discrepancy between the MIT theoretical estimate and the cost of the Biomemory cards really speaks to the expense of bringing a technology like this to market. The theoretical cost per byte is a lot different than the operational cost, and the Biomemory cards are really meant to serve as proof of concept.  All that said, as the technology improves, one can only hope that it becomes more cost-effective in the future. Folks are experimenting with different encoding schemes to make writing and reading more efficient, as one example of an advance that could start to tip the balance.  

Finally, there’s just something a bit spooky about using synthetic DNA to store data. There’s a Black Mirror episode in there somewhere. Maybe one day we can upload kung fu skills directly into our brain domes and that would be cool, but for now, it’s still somewhat unsettling.

What Is Ceramic Storage?

Ceramic storage makes an old school approach new again, if you consider that the first stone tablets were kind of the precursor to today’s hard drives. Who’s up for storing some cuneiform?

Cerabyte, the company behind the “video that surfaced online,” is working on storage technology that uses ceramic and glass substrates in devices the size of a typical HDD that can store 10 petabytes of data. They use a glass base similar to Gorilla Glass by Corning topped with a layer of ceramic 300 micrometers thick that’s essentially etched with lasers. (Glass is used in many larger hard drives today, for what it’s worth. Hoya makes them, for example.) The startup debuted a fully operational prototype system using only commercial off-the-shelf equipment—pretty impressive. 

The prototype consists of a single read-write rack and several library racks. When you want to write data, it moves one of the cartridges from the library to the read-write rack where it is opened to expose and stage the ceramic substrate. Two million laser beamlets then punch nanoscale ones and zeros into the surface. Once the data is written, the read-write arm verifies it on the return motion to its original position. 

Cerabyte isn’t the only player in the game. Others like MDisc use similar technology. Currently, MDisc stores data on DVD-sized disks using a “rock-like” substrate. Several DVD player manufacturers have included the technology in players. 

Similar to DNA storage, ceramic storage boasts much higher density than current data storage tech—terabytes per square centimeter versus an HDD’s 0.02TB per square centimeter. Also like DNA storage, it’s more environmentally friendly. Ceramic and glass can be stored within a wide temperature range between -460°F–570°F, and it’s a natural material that will last millennia and eventually decompose. It’s also incredibly durable: Cerabyte claims it will last 5000+ years, and with tons of clay pots still laying around from ancient times, that makes sense. 

One advantage it has on DNA storage though is speed. One laser pulse writes up to 2,000,000 bits, so data can be written at GBps speeds. 

What’s the Catch?

Ceramic also has density, sustainability, and speed to boot, but our biggest question is: who’s going to need that speed? There are only a handful of applications, like AI, that require that speed now. AI is certainly having a big moment, and it can only get bigger. So, presumably there’s a market, but only a small one that can justify the cost. 

One other biggie, at least for a cloud storage provider like us, though not necessarily for consumers or other enterprise users: it’s a write-once model. Once it’s on there, it’s on there. 

Finally, much like DNA tech, it’s probably (?) still too expensive to make it feasible for most data center applications. Cerabyte hasn’t released pricing yet. According to Blocks & Files, “The cost roadmap is expected to offer cost structures below projections of current commercial storage technologies.” But it’s still a big question mark.

Our Hot Take

Both of these technologies are really cool. They definitely got our storage geek brains fired up. But until they become scalable, operationally feasible, and cost-effective, you won’t see them in production—they’re still far enough out that they’re on the fiction end of the science fiction to science fact spectrum. And there are a couple roadblocks we see before they reach the ubiquity of your trusty hard drive. 

The first is making both technologies operational, not just theoretical in a lab. We’ll know more about both Biomemory’s and Cerabyte’s technologies as they roll out these initial proof of concept cards and prototype machines. And both have plans, naturally, for scaling the technologies to the data center. Whether they can or not remains to be seen. Lots of technologies have come and gone, falling victim to the challenges of production, scaling, and cost. 

The second is the attendant infrastructure needs. Getting 100x speed is great, if the device is right next to you. But we’ll need similar leaps in physical networking infrastructure to transfer the data anywhere else. Until that catches up, the tech remains lab-bound. 

All that said, I still remember using floppy disks that held mere megabytes of data, and now you can put 20TB on a hard disk. So, I guess the question is, how long will it be before I can plug into the Matrix?

The post Storage Tech of the Future: Ceramics, DNA, and More appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Cloud 101: Data Egress Fees Explained

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/cloud-101-data-egress-fees-explained/

A decorative article showing a server, a cloud, and arrows pointing up and down with a dollar sign.

You can imagine data egress fees like tolls on a highway—your data is cruising along trying to get to its destination, but it has to pay a fee for the privilege of continuing its journey. If you have a lot of data to move, or a lot of toll booths (different cloud services) to move it through, those fees can add up quickly. 

Data egress fees are charges you incur for moving data out of a cloud service. They can be a big part of your cloud bill depending on how you use the cloud. And, they’re frequently a reason behind surprise AWS bills. So, let’s take a closer look at egress, egress fees, and ways you can reduce or eliminate them, so that your data can travel the cloud superhighways at will. 

What Is Data Egress?

In computing generally, data egress refers to the transfer or movement of data out of a given location, typically from within a network or system to an external destination. When it comes to cloud computing, egress generally means whenever data leaves the boundaries of a cloud provider’s network. 

In the simplest terms, data egress is the outbound flow of data.

A photo of a stair case with a sign that says "out" and an arrow pointing up.
The fees, like these stairs, climb higher. Source.

Egress vs. Ingress?

While egress pertains to data exiting a system, ingress refers to data entering a system. When you download something, you’re egressing data from a service. When you upload something, you’re ingressing data to that service. 

Unsurprisingly, most cloud storage providers do not charge you to ingress data—they want you to store your data on their platform, so why would they? 

Egress vs. Download

You might hear egress referred to as download, and that’s not wrong, but there are some nuances. Egress applies not only to downloads, but also when you migrate data between cloud services, for example. So, egress includes downloads, but it’s not limited to them. 

In the context of cloud service providers, the distinction between egress and download may not always be explicitly stated, and the terminology used can vary between providers. It’s essential to refer to the specific terms and pricing details provided by the service or platform you are using to understand how they classify and charge for data transfers.

How Do Egress Fees Work?

Data egress fees are charges incurred when data is transferred out of a cloud provider’s environment. These fees are often associated with cloud computing services, where users pay not only for the resources they consume within the cloud (such as storage and compute) but also for the data that is transferred from the cloud to external destinations.

There are a number of scenarios where a cloud provider typically charges egress: 

  • When you’re migrating data from one cloud to another.
  • When you’re downloading data from a cloud to a local repository.
  • When you move data between regions or zones with certain cloud providers. 
  • When an application, end user, or content delivery network (CDN) requests data from your cloud storage bucket. 

The fees can vary depending on the amount of data transferred and the destination of the data. For example, transferring data between regions within the same cloud provider’s network might incur lower fees than transferring data to the internet or to a different cloud provider.

Data egress fees are an important consideration for organizations using cloud services, and they can impact the overall cost of hosting and managing data in the cloud. It’s important to be aware of the pricing details related to data egress in the cloud provider’s pricing documentation, as these fees can contribute significantly to the total cost of using cloud services.

Why Do Cloud Providers Charge Egress Fees?

Both ingressing and egressing data costs cloud providers money. They have to build the physical infrastructure to allow users to do that, including switches, routers, fiber cables, etc. They also have to have enough of that infrastructure on hand to meet customer demand, not to mention staff to deploy and maintain it. 

However, it’s telling that most cloud providers don’t charge ingress fees, only egress fees. It would be hard to entice people to use your service if you charged them extra for uploading their data. But, once cloud providers have your data, they want you to keep it there. Charging you to remove it is one way cloud providers like AWS, Google Cloud, and Microsoft Azure do that. 

What Are AWS’s Egress Fees?

AWS S3 gives customers 100GB of data transfer out to the internet free each month, with some caveats—that 100GB excludes data stored in China and GovCloud. After that, the published rates for U.S. regions for data transferred over the public internet are as follows as of the date of publication:

  • The first 10TB per month is $0.09 per GB.
  • The next 40TB per month is $0.085 per GB.
  • The next 100TB per month is $0.07 per GB.
  • Anything greater than 150TB per month is $0.05 per GB. 

But AWS also charges customers egress between certain services and regions, and it can get complicated quickly as the following diagram shows…

Source.

How Can I Reduce Egress Fees?

If you’re using cloud services, minimizing your egress fees is probably a high priority. Companies like the Duckbill Group (the creators of the diagram above) exist to help businesses manage their AWS bills. In fact, there’s a whole industry of consultants that focuses solely on reducing your AWS bills. 

Aside from hiring a consultant to help you spend less, there are a few simple ways to lower your egress fees:

  1. Use a content delivery network (CDN): If you’re hosting an application, using a CDN can lower your egress fees since a CDN will cache data on edge servers. That way, when a user sends a request for your data, it can pull it from the CDN server rather than your cloud storage provider where you would be charged egress. 
  2. Optimize data transfer protocols: Choose efficient data transfer protocols that minimize the amount of data transmitted. For example, consider using compression or delta encoding techniques to reduce the size of transferred files. Compressing data before transfer can reduce the volume of data sent over the network, leading to lower egress costs. However, the effectiveness of compression depends on the nature of the data.
  3. Utilize integrated cloud providers: Some cloud providers offer free data transfer with a range of other cloud partners. (Hint: that’s what we do here at Backblaze!)
  4. Be aware of tiering: It may sound enticing to opt for a cold(er) storage tier to save on storage, but some of those tiers come with much higher egress fees. 

How Does Backblaze Reduce Egress Fees?

There’s one more way you can drastically reduce egress, and we’ll just come right out and say it: Backblaze gives you free egress up to 3x the average monthly storage and unlimited free egress through a number of CDN and compute partners, including Fastly, Cloudflare, Bunny.net, and Vultr

Why do we offer free egress? Supporting an open cloud environment is central to our mission, so we expanded free egress to all customers so they can move data when and where they prefer. Cloud providers like AWS and others charge high egress fees that make it expensive for customers to use multi-cloud infrastructures and therefore lock in customers to their services. These walled gardens hamper innovation and long-term growth.

Free Egress = A Better, Multi-Cloud World

The bottom line: the high egress fees charged by hyperscalers like AWS, Google, and Microsoft are a direct impediment to a multi-cloud future driven by customer choice and industry need. And, a multi-cloud future is something we believe in. So go forth and build the multi-cloud future of your dreams, and leave worries about high egress fees in the past. 

The post Cloud 101: Data Egress Fees Explained appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Digging Deeper Into Object Lock

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/digging-deeper-into-object-lock/

A decorative image showing data inside of a vault.

Using Object Lock for your data is a smart choice—you can protect your data from ransomware, meet compliance requirements, beef up your security policy, or preserve data for legal reasons. But, it’s not a simple on/off switch, and accidentally locking your data for 100 years is a mistake you definitely don’t want to make.

Today we’re taking a deeper dive into Object Lock and the related legal hold feature, examining the different levels of control that are available, explaining why developers might want to build Object Lock into their own applications, and showing exactly how to do that. While the code samples are aimed at our developer audience, anyone looking for a deeper understanding of Object Lock should be able to follow along.

I presented a webinar on this topic earlier this year that covers much the same ground as this blog post, so feel free to watch it instead of, or in addition to, reading this article. 

Check Out the Docs

For even more information on Object Lock, check out our Object Lock overview in our Technical Documentation Portal as well as these how-tos about how to enable Object Lock using the Backblaze web UI, Backblaze B2 Native API, and the Backblaze S3 Compatible API:

What Is Object Lock?

In the simplest explanation, Object Lock is a way to lock objects (aka files) stored in Backblaze B2 so that they are immutable—that is, they cannot be deleted or modified, for a given period of time, even by the user account that set the Object Lock rule. Backblaze B2’s implementation of Object Lock was originally known as File Lock, and you may encounter the older terminology in some documentation and articles. For consistency, I’ll use the term “object” in this blog post, but in this context it has exactly the same meaning as “file.”

Object Lock is a widely offered feature included with backup applications such as Veeam and MSP360, allowing organizations to ensure that their backups are not vulnerable to deliberate or accidental deletion or modification for some configurable retention period.

Ransomware mitigation is a common motivation for protecting data with Object Lock. Even if an attacker were to compromise an organization’s systems to the extent of accessing the application keys used to manage data in Backblaze B2, they would not be able to delete or change any locked data. Similarly, Object Lock guards against insider threats, where the attacker may try to abuse legitimate access to application credentials.

Object Lock is also used in industries that store sensitive or personal identifiable information (PII) such as banking, education, and healthcare. Because they work with such sensitive data, regulatory requirements dictate that data be retained for a given period of time, but data must also be deleted in particular circumstances. 

For example, the General Data Protection Regulation (GDPR), an important component of the EU’s privacy laws and an international regulatory standard that drives best practices, may dictate that some data must be deleted when a customer closes their account. A related use case is where data must be preserved due to litigation, where the period for which data must be locked is not fixed and depends on the type of lawsuit at hand. 

To handle these requirements, Backblaze B2 offers two Object Lock modes—compliance and governance—as well as the legal hold feature. Let’s take a look at the differences between them.

Compliance Mode: Near-Absolute Immutability

When objects are locked in compliance mode, not only can they not be deleted or modified while the lock is in place, but the lock also cannot be removed during the specified retention period. It is not possible to remove or override the compliance lock to delete locked data until the lock expires, whether you’re attempting to do so via the Backblaze web UI or either of the S3 Compatible or B2 Native APIs. Similarly, Backblaze Support is unable to unlock or delete data locked under compliance mode in response to a support request, which is a safeguard designed to address social engineering attacks where an attacker impersonates a legitimate user.

What if you inadvertently lock many terabytes of data for several years? Are you on the hook for thousands of dollars of storage costs? Thankfully, no—you have one escape route, which is to close your Backblaze account. Closing the account is a multi-step process that requires access to both the account login credentials and two-factor verification (if it is configured) and results in the deletion of all data in that account, locked or unlocked. This is a drastic step, so we recommend that developers create one or more “burner” Backblaze accounts for use in developing and testing applications that use Object Lock, that can be closed if necessary without disrupting production systems.

There is one lock-related operation you can perform on compliance-locked objects: extending the retention period. In fact, you can keep extending the retention period on locked data any number of times, protecting that data from deletion until you let the compliance lock expire.

Governance Mode: Override Permitted

In our other Object Lock option, objects can be locked in governance mode for a given retention period. But, in contrast to compliance mode, the governance lock can be removed or overridden via an API call, if you have an application key with appropriate capabilities. Governance mode handles use cases that require retention of data for some fixed period of time, with exceptions for particular circumstances.

When I’m trying to remember the difference between compliance and governance mode, I think of the phrase, “Twenty seconds to comply!”, uttered by the ED-209 armed robot in the movie “RoboCop.” It turned out that there was no way to override ED-209’s programming, with dramatic, and fatal, consequences.

ED-209: as implacable as compliance mode.

Legal Hold: Flexible Preservation

While the compliance and governance retention modes lock objects for a given retention period, legal hold is more like a toggle switch: you can turn it on and off at any time, again with an application key with sufficient capabilities. As its name suggests, legal hold is ideal for situations where data must be preserved for an unpredictable period of time, such as while litigation is proceeding.

The compliance and governance modes are mutually exclusive, which is to say that only one may be in operation at any time. Objects locked in governance mode can be switched to compliance mode, but, as you might expect from the above explanation, objects locked in compliance mode cannot be switched to governance mode until the compliance lock expires.

Legal hold, on the other hand, operates independently, and can be enabled and disabled regardless of whether an object is locked in compliance or governance mode.

How does this work? Consider an object that is locked in compliance or governance mode and has legal hold enabled:

  • If the legal hold is removed, the object remains locked until the retention period expires.
  • If the retention period expires, the object remains locked until the legal hold is removed.

Object Lock and Versioning

By default, Backblaze B2 Buckets have versioning enabled, so as you upload successive objects with the same name, previous versions are preserved automatically. None of the Object Lock modes prevent you from uploading a new version of a locked object; the lock is specific to the object version to which it was applied.

You can also hide a locked object so it doesn’t appear in object listings. The hidden version is retained and can be revealed using the Backblaze web UI or an API call.

As you might expect, locked object versions are not subject to deletion by lifecycle rules—any attempt to delete a locked object version via a lifecycle rule will fail.

How to Use Object Lock in Applications

Now that you understand the two modes of Object Lock, plus legal hold, and how they all work with object versions, let’s look at how you can take advantage of this functionality in your applications. I’ll include code samples for Backblaze B2’s S3 Compatible API written in Python, using the AWS SDK, aka Boto3, in this blog post. You can find details on working with Backblaze B2’s Native API in the documentation.

Application Key Capabilities for Object Lock

Every application key you create for Backblaze B2 has an associated set of capabilities; each capability allows access to a specific functionality in Backblaze B2. There are seven capabilities relevant to object lock and legal hold. 

Two capabilities relate to bucket settings:

  1. readBucketRetentions 
  2. writeBucketRetentions

Three capabilities relate to object settings for retention: 

  1. readFileRetentions 
  2. writeFileRetentions 
  3. bypassGovernance

And, two are specific to Object Lock: 

  1. readFileLegalHolds 
  2. writeFileLegalHolds 

The Backblaze B2 documentation contains full details of each capability and the API calls it relates to for both the S3 Compatible API and the B2 Native API.

When you create an application key via the web UI, it is assigned capabilities according to whether you allow it access to all buckets or just a single bucket, and whether you assign it read-write, read-only, or write-only access.

An application key created in the web UI with read-write access to all buckets will receive all of the above capabilities. A key with read-only access to all buckets will receive readBucketRetentions, readFileRetentions, and readFileLegalHolds. Finally, a key with write-only access to all buckets will receive bypassGovernance, writeBucketRetentions, writeFileRetentions, and writeFileLegalHolds.

In contrast, an application key created in the web UI restricted to a single bucket is not assigned any of the above permissions. When an application using such a key uploads objects to its associated bucket, they receive the default retention mode and period for the bucket, if they have been set. The application is not able to select a different retention mode or period when uploading an object, change the retention settings on an existing object, or bypass governance when deleting an object.

You may want to create application keys with more granular permissions when working with Object Lock and/or legal hold. For example, you may need an application restricted to a single bucket to be able to toggle legal hold for objects in that bucket. You can use the Backblaze B2 CLI to create an application key with this, or any other set of capabilities. This command, for example, creates a key with the default set of capabilities for read-write access to a single bucket, plus the ability to read and write the legal hold setting:

% b2 create-key --bucket my-bucket-name my-key-name listBuckets,readBuckets,listFiles,readFiles,shareFiles,writeFiles,deleteFiles,readBucketEncryption,writeBucketEncryption,readBucketReplications,writeBucketReplications,readFileLegalHolds,writeFileLegalHolds

Enabling Object Lock

You must enable Object Lock on a bucket before you can lock any objects therein; you can do this when you create the bucket, or at any time later, but you cannot disable Object Lock on a bucket once it has been enabled. Here’s how you create a bucket with Object Lock enabled:

s3_client.create_bucket(
    Bucket='my-bucket-name',
    ObjectLockEnabledForBucket=True
)

Once a bucket’s settings have Object Lock enabled, you can configure a default retention mode and period for objects that are created in that bucket. Only compliance mode is configurable from the web UI, but you can set governance mode as the default via an API call, like this:

s3_client.put_object_lock_configuration(
    Bucket='my-bucket-name',
    ObjectLockConfiguration={
        'ObjectLockEnabled': 'Enabled',
        'Rule': {
            'DefaultRetention': {
                'Mode': 'GOVERNANCE',
                'Days': 7
            }
        }
    }
)

You cannot set legal hold as a default configuration for the bucket.

Locking Objects

Regardless of whether you set a default retention mode for the bucket, you can explicitly set a retention mode and period when you upload objects, or apply the same settings to existing objects, provided you use an application key with the appropriate writeFileRetentions or writeFileLegalHolds capability.

Both the S3 PutObject operation and Backblaze B2’s b2_upload_file include optional parameters for specifying retention mode and period, and/or legal hold. For example:

s3_client.put_object(
    Body=open('/path/to/local/file', mode='rb'),
    Bucket='my-bucket-name',
    Key='my-object-name',
    ObjectLockMode='GOVERNANCE',
    ObjectLockRetainUntilDate=datetime(
        2023, 9, 7, hour=10, minute=30, second=0
    )
)

Both APIs implement additional operations to get and set retention settings and legal hold for existing objects. Here’s an example of how you apply a governance mode lock:

s3_client.put_object_retention(
    Bucket='my-bucket-name',
    Key='my-object-name',
    VersionId='some-version-id',
    Retention={
        'Mode': 'GOVERNANCE',  # Required, even if mode is not changed
        'RetainUntilDate': datetime(
            2023, 9, 5, hour=10, minute=30, second=0
        )
    }
)

The VersionId parameter is optional: the operation applies to the current object version if it is omitted.

You can also use the web UI to view, but not change, an object’s retention settings, and to toggle legal hold for an object:

A screenshot highlighting where to enable Object Lock via the Backblaze web UI.

Deleting Objects in Governance Mode

As mentioned above, a key difference between the compliance and governance modes is that it is possible to override governance mode to delete an object, given an application key with the bypassGovernance capability. To do so, you must identify the specific object version, and pass a flag to indicate that you are bypassing the governance retention restriction:

# Get object details, including version id of current version
object_info = s3_client.head_object(
    Bucket='my-bucket-name',
    Key='my-object-name'
)

# Delete the most recent object version, bypassing governance
s3_client.delete_object(
    Bucket='my-bucket-name',
    Key='my-object-name',
    VersionId=object_info['VersionId'],
    BypassGovernanceRetention=True
)

There is no way to delete an object in legal hold; the legal hold must be removed before the object can be deleted.

Protect Your Data With Object Lock and Legal Hold

Object Lock is a powerful feature, and with great power… you know the rest. Here are some of the questions you should ask when deciding whether to implement Object Lock in your applications:

  • What would be the impact of malicious or accidental deletion of your application’s data?
  • Should you lock all data according to a central policy, or allow users to decide whether to lock their data, and for how long?
  • If you are storing data on behalf of users, are there special circumstances where a lock must be overridden?
  • Which users should be permitted to set and remove a legal hold? Does it make sense to build this into the application rather than have an administrator use a tool such as the Backblaze B2 CLI to manage legal holds?

If you already have a Backblaze B2 account, you can start working with Object Lock today; otherwise, create an account to get started.

The post Digging Deeper Into Object Lock appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

What Is Hybrid Cloud?

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/confused-about-the-hybrid-cloud-youre-not-alone/

An illustration of clouds computers and servers.
Editor’s note: This post has been updated since it was originally published in 2017.

The term hybrid cloud has been around for a while—we originally published this explainer in 2017. But time hasn’t necessarily made things clearer. Maybe you hear folks talk about your company’s hybrid cloud approach, but what does that really mean? If you’re confused about the hybrid cloud, you’re not alone. 

Hybrid cloud is a computing approach that uses both private and public cloud resources with some kind of orchestration between them. The term has been applied to a wide variety of IT solutions, so it’s no wonder the concept breeds confusion. 

In this post, we’ll explain what a hybrid cloud is, how it can benefit your business, and how to choose a cloud storage provider for your hybrid cloud strategy.

What Is the Hybrid Cloud?

A hybrid cloud is an infrastructure approach that uses both private and public resources. Let’s first break down those key terms:

  • Public cloud: When you use a public cloud, you are storing your data in another company’s internet-accessible data center. A public cloud service allows anybody to sign up for an account, and share data center resources with other customers or tenants. Instead of worrying about the costs and complexity of operating an on-premises data center, a cloud storage user only needs to pay for the cloud storage they need.
  • Private cloud: In contrast, a private cloud is specifically designed for a single tenant. Think of a private cloud as a permanently reserved private dining room at a restaurant—no other customer can use that space. As a result, private cloud services can be more expensive than public clouds. Traditionally, private clouds typically lived on on-premises infrastructure, meaning they were built and maintained on company property. Now, private clouds can be maintained and managed on-premises by an organization or by a third party in a data center. The key defining factor is that the cloud is dedicated to a single tenant or organization.

Those terms are important to know to understand the hybrid cloud architecture approach. Hybrid clouds are defined by a combined management approach, which means there is some type of orchestration between the private and public environments that allows workloads and data to move between them in a flexible way as demands, needs, and costs change. This gives you flexibility when it comes to data deployment and usage.  

In other words, if you have some IT resources on-premises that you are replicating or sharing with an external vendor—congratulations, you have a hybrid cloud!

Hybrid cloud refers to a computing architecture that is made up of both private cloud resources and public cloud resources with some kind of orchestration between them.

Hybrid Cloud Examples

Here are a few examples of how a hybrid cloud can be used:

  1. As an active archive: You might establish a protocol that says all accounting files that have not been changed in the last year, for example, are automatically moved off-premises to cloud storage archive to save cost and reduce the amount of storage needed on-site. You can still access the files; they are just no longer stored on your local systems. 
  2. To meet compliance requirements: Let’s say some of your data is subject to strict data privacy requirements, but other data you manage isn’t as closely protected. You could keep highly regulated data on premises in a private cloud and the rest of your data in a public cloud. 
  3. To scale capacity: If you’re in an industry that experiences seasonal or frequent spikes like retail or ecommerce, these spikes can be handled by a public cloud which provides the elasticity to deal with times when your data needs exceed your on-premises capacity.
  4. For digital transformation: A hybrid cloud lets you adopt cloud resources in a phased approach as you expand your cloud presence.

Hybrid Cloud vs. Multi-cloud: What’s the Diff?

You wouldn’t be the first person to think that the terms multi-cloud and hybrid cloud appear similar. Both of these approaches involve using multiple clouds. However, multi-cloud uses two clouds of the same type in combination (i.e., two or more public clouds) and hybrid cloud approaches combine a private cloud with a public cloud. One cloud approach is not necessarily better than the other—they simply serve different use cases. 

For example, let’s say you’ve already invested in significant on-premises IT infrastructure, but you want to take advantage of the scalability of the cloud. A hybrid cloud solution may be a good fit for you. 

Alternatively, a multi-cloud approach may work best for you if you are already in the cloud and want to mitigate the risk of a single cloud provider having outages or issues. 

Hybrid Cloud Benefits

A hybrid cloud approach allows you to take advantage of the best elements of both private and public clouds. The primary benefits are flexibility, scalability, and cost savings.

Benefit 1: Flexibility and Scalability

One of the top benefits of the hybrid cloud is its flexibility. Managing IT infrastructure on-premises can be time consuming and expensive, and adding capacity requires advance planning, procurement, and upfront investment

The public cloud is readily accessible and able to provide IT resources whenever needed on short notice. For example, the term “cloud bursting” refers to the on-demand and temporary use of the public cloud when demand exceeds resources available in the private cloud. A private cloud, on the other hand, provides the absolute fastest access speeds since it is generally located on-premises. (But cloud providers are catching up fast, for what it’s worth.) For data that is needed with the absolute lowest levels of latency, it may make sense for the organization to use a private cloud for current projects and store an active archive in a less expensive, public cloud.

Benefit 2: Cost Savings

Within the hybrid cloud framework, the public cloud segment offers cost-effective IT resources, eliminating the need for upfront capital expenses and associated labor costs. IT professionals gain the flexibility to optimize configurations, choose the most suitable service provider, and determine the optimal location for each workload. This strategic approach reduces costs by aligning resources with specific tasks. Furthermore, the ability to easily scale, redeploy, or downsize services enhances efficiency, curbing unnecessary expenses and contributing to overall cost savings.

Comparing Private vs. Hybrid Cloud Storage Costs

To understand the difference in storage costs between a purely on-premises solution and a hybrid cloud solution, we’ll present two scenarios. For each scenario, we’ll use data storage amounts of 100TB, 1PB, and 2PB. Each table is the same format, all we’ve done is change how the data is distributed: private (on-premises) or public (off-premises). We are using the costs for our own Backblaze B2 Cloud Storage in this example. The math can be adapted for any set of numbers you wish to use.

Scenario 1    100% of data on-premises storage

    Data Stored
  Data Stored On-premises: 100%   100TB 1,000TB 2,000TB
On-premises cost range   Monthly Cost
  Low — $12/TB/Month   $1,200 $12,000 $24,000
  High — $20/TB/Month   $2,000 $20,000 $40,000

Scenario 2    20% of data on-premises with 80% public cloud storage (Backblaze B2)

    Data Stored
  Data Stored On-premises: 20%   20TB 200TB 400TB
  Data Stored in the Cloud: 80%   80TB 800TB 1,600TB
On-premises cost range   Monthly Cost
  Low — $12/TB/Month   $240 $2,400 $4,800
  High — $20/TB/Month   $400 $4,000 $8,000
Public cloud cost range   Monthly Cost
  Low — $6/TB/Month (Backblaze B2)   $480 $4,800 $9,600
  High — $20/TB/Month   $1,600 $16,000 $32,000
On-premises + public cloud cost range   Monthly Cost
  Low   $720 $7,200 $14,400
  High   $2,000 $20,000 $40,000

As you can see, using a hybrid cloud solution and storing 80% of the data in the cloud with a provider like Backblaze B2 can result in significant savings over storing only on-premises.

Choosing a Cloud Storage Provider for Your Hybrid Cloud

Okay, so you understand the benefits of using a hybrid cloud approach, what next? Determining the right mix of cloud services may be intimidating because there are so many public cloud options available. Fortunately, there are a few decision factors you can use to simplify setting up your hybrid cloud solution. Here’s what to think about when choosing a public cloud storage provider:

  • Ease of use: Avoiding a steep learning curve can save you hours of work effort in managing your cloud deployments. By contrast, overly complicated pricing tiers or bells and whistles you don’t need can slow you down.
  • Data security controls: Compare how each cloud provider facilitates proper data controls. For example, take a look at features like authentication, Object Lock, and encryption.
  • Data egress fees: Some cloud providers charge additional fees for data egress (i.e., removing data from the cloud). These fees can make it more expensive to switch between providers. In addition to fees, check the data speeds offered by the provider.
  • Interoperability: Flexibility and interoperability are key reasons to use cloud services. Before signing up for a service, understand the provider’s integration ecosystem. A lack of needed integrations may place a greater burden on your team to keep the service running effectively.
  • Storage tiers: Some providers offer different storage tiers where you sacrifice access for lower costs. While the promise of inexpensive cold storage can be attractive, evaluate whether you can afford to wait hours or days to retrieve your data.
  • Pricing transparency: Pay careful attention to the cloud provider’s pricing model and tier options. Consider building a spreadsheet to compare a shortlist of cloud providers’ pricing models.

When Hybrid Cloud Might Not Always Be the Right Fit

The hybrid cloud may not always be the optimal solution, particularly for smaller organizations with limited IT budgets that might find a purely public cloud approach more cost-effective. The substantial setup and operational costs of private servers could be prohibitive.

A thorough understanding of workloads is crucial to effectively tailor the hybrid cloud, ensuring the right blend of private, public, and traditional IT resources for each application and maximizing the benefits of the hybrid cloud architecture.

So, Should You Go Hybrid?

Big picture, anything that helps you respond to IT demands quickly, easily, and affordably is a win. With a hybrid cloud, you can avoid some big up-front capital expenses for in-house IT infrastructure, making your CFO happy. Being able to quickly spin up IT resources as they’re needed will appeal to the CTO and VP of operations.

So, given all that, we’ve arrived at the bottom line and the question is, should you or your organization embrace hybrid cloud infrastructure?According to Flexera’s 2023 State of the Cloud report, 72% of enterprises utilize a hybrid cloud strategy. That indicates that the benefits of the hybrid cloud appeal to a broad range of companies.

If an organization approaches implementing a hybrid cloud solution with thoughtful planning and a structured approach, a hybrid cloud can deliver on-demand flexibility, empower legacy systems, and applications with new capabilities, and become a catalyst for digital transformation. The result can be an elastic and responsive infrastructure that has the ability to quickly adapt to changing demands of the business.

As data management professionals increasingly recognize the advantages of the hybrid cloud, we can expect more and more of them to embrace it as an essential part of their IT strategy.

Tell Us What You’re Doing With the Hybrid Cloud

Are you currently embracing the hybrid cloud, or are you still uncertain or hanging back because you’re satisfied with how things are currently? We’d love to hear your comments below on how you’re approaching your cloud architecture decisions.

FAQs About Hybrid Cloud

What exactly is a hybrid cloud?

Hybrid cloud is a computing approach that uses both private and public cloud resources with some kind of orchestration between them.

What is the difference between hybrid and multi-cloud?

Multi-cloud uses two clouds of the same type in combination (i.e., two or more public clouds) and hybrid cloud approaches combine a private cloud with a public cloud. One cloud approach is not necessarily better than the other—they simply serve different use cases.

What is a hybrid cloud architecture?

Hybrid cloud architecture is any kind of IT architecture that combines both the public and private clouds. Many organizations use this term to describe specific software products that provide solutions which combine the two types of clouds.

What are hybrid clouds used for?

Organizations will often use hybrid clouds to create redundancy and scalability for their computing workload. A hybrid cloud is a great way for a company to have extra fallback options to continue offering services even when they have higher than usual levels of traffic, and it can also help companies scale up their services over time as they need to offer more options.

The post What Is Hybrid Cloud? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Drive Stats for Q3 2023

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-q3-2023/

A decorative image showing the title Q3 2023 Drive Stats.

At the end of Q3 2023, Backblaze was monitoring 263,992 hard disk drives (HDDs) and solid state drives (SSDs) in our data centers around the world. Of that number, 4,459 are boot drives, with 3,242 being SSDs and 1,217 being HDDs. The failure rates for the SSDs are analyzed in the SSD Edition: 2023 Drive Stats review.

That leaves us with 259,533 HDDs that we’ll focus on in this report. We’ll review the quarterly and lifetime failure rates of the data drives as of the end of Q3 2023. Along the way, we’ll share our observations and insights on the data presented, and, for the first time ever, we’ll reveal the drive failure rates broken down by data center.

Q3 2023 Hard Drive Failure Rates

At the end of Q3 2023, we were managing 259,533 hard drives used to store data. For our review, we removed 449 drives from consideration as they were used for testing purposes, or were drive models which did not have at least 60 drives. This leaves us with 259,084 hard drives grouped into 32 different models. 

The table below reviews the annualized failure rate (AFR) for those drive models for the Q3 2023 time period.

A table showing the quarterly annualized failure rates of Backblaze hard drives.

Notes and Observations on the Q3 2023 Drive Stats

  • The 22TB drives are here: At the bottom of the list you’ll see the WDC 22TB drives (model: WUH722222ALE6L4). A Backblaze Vault of 1,200 drives (plus four) is now operational. The 1,200 drives were installed on September 29, so they only have one day of service each in this report, but zero failures so far.
  • The old get bolder: At the other end of the time-in-service spectrum are the 6TB Seagate drives (model: ST6000DX000) with an average of 101 months in operation. This cohort had zero failures in Q3 2023 with 883 drives and a lifetime AFR of 0.88%.
  • Zero failures: In Q3, six different drive models managed to have zero drive failures during the quarter. But only the 6TB Seagate, noted above, had over 50,000 drive days, our minimum standard for ensuring we have enough data to make the AFR plausible.
  • One failure: There were four drive models with one failure during Q3. After applying the 50,000 drive day metric, two drives stood out:
    1. WDC 16TB (model: WUH721816ALE6L0) with a 0.15% AFR.
    2. Toshiba 14TB (model: MG07ACA14TEY) with a 0.63% AFR.

The Quarterly AFR Drops

In Q3 2023, quarterly AFR for all drives was 1.47%. That was down from 2.2% in Q2 and also down from 1.65% a year ago. The quarterly AFR is based on just the data in that quarter, so it can often fluctuate from quarter to quarter. 

In our Q2 2023 report, we suspected the 2.2% for the quarter was due to the overall aging of the drive fleet and in particular we pointed a finger at specific 8TB, 10TB, and 12TB drive models as potential culprits driving the increase. That prediction fell flat in Q3 as nearly two-thirds of drive models experienced a decreased AFR quarter over quarter from Q2 and any increases were minimal. This included our suspect 8TB, 10TB, and 12TB drive models. 

It seems Q2 was an anomaly, but there was one big difference in Q3: we retired 4,585 aging 4TB drives. The average age of the retired drives was just over eight years, and while that was a good start, there’s another 28,963 4TB drives to go. To facilitate the continuous retirement of aging drives and make the data migration process easy and safe we use CVT, our awesome in-house data migration software which we’ll cover at another time.

A Hot Summer and the Drive Stats Data

As anyone should in our business, Backblaze continuously monitors our systems and drives. So, it was of little surprise to us when the folks at NASA confirmed the summer of 2023 as Earth’s hottest on record. The effects of this record-breaking summer showed up in our monitoring systems in the form of drive temperature alerts. A given drive in a storage server can heat up for many reasons: it is failing; a fan in the storage server has failed; other components are producing additional heat; the air flow is somehow restricted; and so on. Add in the fact that the ambient temperature within a data center often increases during the summer months, and you can get more temperature alerts.

In reviewing the temperature data for our drives in Q3, we noticed that a small number of drives exceeded the maximum manufacturer’s temperature for at least one day. The maximum temperature for most drives is 60°C, except for the 12TB, 14TB, and 16TB Toshiba drives which have a maximum temperature of 55°C. Of the 259,533 data drives in operation in Q3, there were 354 individual drives (0.0013%) that exceeded their maximum manufacturer temperature. Of those only two drives failed, leaving 352 drives which were still operational as of the end of Q3.

While temperature fluctuation is part of running data centers and temp alerts like these aren’t unheard of, our data center teams are looking into the root causes to ensure we’re prepared for the inevitability of increasingly hot summers to come.

Will the Temperature Alerts Affect Drive Stats?

The two drives which exceeded their maximum temperature and failed in Q3 have been removed from the Q3 AFR calculations. Both drives were 4TB Seagate drives (model: ST4000DM000). Given that the remaining 352 drives which exceeded their temperature maximum did not fail in Q3, we have left them in the Drive Stats calculations for Q3 as they did not increase the computed failure rates.

Beginning in Q4, we will remove the 352 drives from the regular Drive Stats AFR calculations and create a separate cohort of drives to track that we’ll name Hot Drives. This will allow us to track the drives which exceeded their maximum temperature and compare their failure rates to those drives which operated within the manufacturer’s specifications. While there are a limited number of drives in the Hot Drives cohort, it could give us some insight into whether drives being exposed to high temperatures could cause a drive to fail more often. This heightened level of monitoring will identify any increase in drive failures so that they can be detected and dealt with expeditiously.

New Drive Stats Data Fields in Q3

In Q2 2023, we introduced three new data fields that we started populating in the Drive Stats data we publish: vault_id, pod_id, and is_legacy_format. In Q3, we are adding three more fields into each drive records as follows:

  • datacenter: The Backblaze data center where the drive is installed, currently one of these values: ams5, iad1, phx1, sac0, and sac2.
  • cluster_id: The name of a given collection of storage servers logically grouped together to optimize system performance. Note: At this time the cluster_id is not always correct, we are working on fixing that. 
  • pod_slot_num: The physical location of a drive within a storage server. The specific slot differs based on the storage server type and capacity: Backblaze (45 drives), Backblaze (60 drives), Dell (26 drives), or Supermicro (60 drives). We’ll dig into these differences in another post.

With these additions, the new schema beginning in Q3 2023 is:

  • date
  • serial_number
  • model
  • capacity_bytes
  • failure
  • datacenter (Q3)
  • cluster_id (Q3)
  • vault_id (Q2)
  • pod_id (Q2)
  • pod_slot_num (Q3)
  • is_legacy_format (Q2)
  • smart_1_normalized
  • smart_1_raw
  • The remaining SMART value pairs (as reported by each drive model)

Beginning in Q3, these data data fields have been added to the publicly available Drive Stats files that we publish each quarter. 

Failure Rates by Data Center

Now that we have the data center for each drive we can compute the AFRs for the drives in each data center. Below you’ll find the AFR for each of five data centers for Q3 2023.

A chart showing Backblaze annualized failure rates by data center.

Notes and Observations

  • Null?: The drives which reported a null or blank value for their data center are grouped in four Backblaze vaults. David, the Senior Infrastructure Software Engineer for Drive Stats, described the process of how we gather all the parts of the Drive Stats data each day. The TL:DR is that vaults can be too busy to respond at the moment we ask, and since the data center field is nice-to-have data, we get a blank field. We can go back a day or two to find the data center value, which we will do in the future when we report this data.
  • sac0?: sac0 has the highest AFR of all of the data centers, but it also has the oldest drives—nearly twice as old, on average, versus the next closest in data center, sac2. As discussed previously, drive failures do seem to follow the “bathtub curve”, although recently we’ve seen the curve start out flatter. Regardless, as drive models age, they do generally fail more often. Another factor could be that sac0, and to a lesser extent sac2, has some of the oldest Storage Pods, including a handful of 45-drive units. We are in the process of using CVT to replace these older servers while migrating from 4TB to 16TB and larger drives.
  • iad1: The iad data center is the foundation of our eastern region and has been growing rapidly since coming online about a year ago. The growth is a combination of new data and customers using our cloud replication capability to automatically make a copy of their data in another region.
  • Q3 Data: This chart is for Q3 data only and includes all the data drives, including those with less than 60 drives per model. As we track this data over the coming quarters, we hope to get some insight into whether different data centers really have different drive failure rates, and, if so, why.

Lifetime Hard Drive Failure Rates

As of September 30, 2023, we were tracking 259,084 hard drives used to store customer data. For our lifetime analysis, we collect the number of drive days and the number of drive failures for each drive beginning from the time a drive was placed into production in one of our data centers. We group these drives by model, then sum up the drive days and failures for each model over their lifetime. That chart is below. 

A chart showing Backblaze lifetime hard drive failure rates.

One of the most important columns on this chart is the confidence interval, which is the difference between the low and high AFR confidence levels calculated at 95%. The lower the value, the more certain we are of the AFR stated. We like a confidence interval to be 0.5% or less. When the confidence interval is higher, that is not necessarily bad, it just means we either need more data or the data is somewhat inconsistent. 

The table below contains just those drive models which have a confidence interval of less than 0.5%. We have sorted the list by drive size and then by AFR.

A chart showing Backblaze hard drive annualized failure rates with a confidence interval of less than 0.5%.

The 4TB, 6TB, 8TB, and some of the 12TB drive models are no longer in production. The HGST 12TB models in particular can still be found, but they have been relabeled as Western Digital and given alternate model numbers. Whether they have materially changed internally is not known, at least to us.

One final note about the lifetime AFR data: you might have noticed the AFR for all of the drives hasn’t changed much from quarter to quarter. It has vacillated between 1.39% to 1.45% percent for the last two years. Basically, we have lots of drives with lots of time-in-service so it is hard to move the needle up or down. While the lifetime stats for individual drive models can be very useful, the lifetime AFR for all drives will probably get less and less interesting as we add more and more drives. Of course, a few hundred thousand drives that never fail could arrive, so we will continue to calculate and present the lifetime AFR.

The Hard Drive Stats Data

The complete data set used to create the information used in this review is available on our Hard Drive Stats Data webpage. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone; it is free. 

Good luck and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q3 2023 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

AI 101: Training vs. Inference

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/ai-101-training-vs-inference/

A decorative image depicting a neural network identifying a cat.

What do Sherlock Holmes and ChatGPT have in common? Inference, my dear Watson!

“We approached the case, you remember, with an absolutely blank mind, which is always an advantage. We had formed no theories. We were simply there to observe and to draw inferences from our observations.”
—Sir Arthur Conan Doyle, The Adventures of the Cardboard Box

As we all continue to refine our thinking around artificial intelligence (AI), it’s useful to define terminology that describes the various stages of building and using AI algorithms—namely, the AI training stage and the AI inference stage. As we see in the quote above, these are not new concepts: they’re based on ideas and methodologies that have been around since before Sherlock Holmes’ time. 

If you’re using AI, building AI, or just curious about AI, it’s important to understand the difference between these two stages so you understand how data moves through an AI workflow. That’s what I’ll explain today.

The TL:DR

The difference between these two terms can be summed up fairly simply: first you train an AI algorithm, then your algorithm uses that training to make inferences from data. To create a whimsical analogy, when an algorithm is training, you can think of it like Watson—still learning how to observe and draw conclusions through inference. Once it’s trained, it’s an inferring machine, a.k.a. Sherlock Holmes. 

Whimsy aside, let’s dig a little deeper into the tech behind AI training and AI inference, the differences between them, and why the distinction is important. 

Obligatory Neural Network Recap

Neural networks have emerged as the brainpower behind AI, and a basic understanding of how they work is foundational when it comes to understanding AI.  

Complex decisions, in theory, can be broken down into a series of yeses and nos, which means that they can be encoded in binary. Neural networks have the ability to combine enough of those smaller decisions, weigh how they affect each other, and then use that information to solve complex problems. And, because more complex decisions require more points of information to come to a final decision, they require more processing power. Neural networks are one of the most widely used approaches to AI and machine learning (ML). 

A diagram showing the inputs, hidden layers, and outputs of a neural network.

What Is AI Training?: Understanding Hyperparameters and Parameters

In simple terms, training an AI algorithm is the process through which you take a base algorithm and then teach it how to make the correct decision. This process requires large amounts of data, and can include various degrees of human oversight. How much data you need has a relationship to the number of parameters you set for your algorithm as well as the complexity of a problem. 

We made this handy dandy diagram to show you how data moves through the training process:

A diagram showing how data moves through an AI training algorithm.
As you can see in this diagram, the end result is model data, which then gets saved in your data store for later use.

And hey—we’re leaving out a lot of nuance in that conversation because dataset size, parameter choice, etc. is a graduate-level topic on its own, and usually is considered proprietary information by the companies who are training an AI algorithm. It suffices to say that dataset size and number of parameters are both significant and have a relationship to each other, though it’s not a direct cause/effect relationship. And, both the number of parameters and the size of the dataset affect things like processing resources—but that conversation is outside of scope for this article (not to mention a hot topic in research). 

As with everything, your use case determines your execution. Some types of tasks actually see excellent results with smaller datasets and more parameters, whereas others require more data and fewer parameters. Bringing it back to the real world, here’s a very cool graph showing how many parameters different AI systems have. Note that they very helpfully identified what type of task each system is designed to solve:

So, let’s talk about what parameters are with an example. Back in our very first AI 101 post, we talked about ways to frame an algorithm in simple terms: 

Machine learning does not specify how much knowledge the bot you’re training starts with—any task can have more or fewer instructions. You could ask your friend to order dinner, or you could ask your friend to order you pasta from your favorite Italian place to be delivered at 7:30 p.m. 

Both of those tasks you just asked your friend to complete are algorithms. The first algorithm requires your friend to make more decisions to execute the task at hand to your satisfaction, and they’ll do that by relying on their past experience of ordering dinner with you—remembering your preferences about restaurants, dishes, cost, and so on. 

The factors that help your friend make a decision about dinner are called hyperparameters and parameters. Hyperparameters are those that frame the algorithm—they are set  outside the training process, but can influence the training of the algorithms. In the example above, a hyperparameter would be how you structure your dinner feedback. Do you thumbs up or down each dish? Do you write a short review? You get the idea. 

Parameters are factors that the algorithm derives through training. In the example above, that’s what time you prefer to eat dinner, which restaurants you enjoy after eating, and so on. 

When you’ve trained a neural network, there will be heavier weights between various nodes. That’s a shorthand of saying that an algorithm will prefer a path it knows is significant, and if you want to really get nerdy with it, this article is well-researched, has a ton of math explainers for various training methods, and includes some fantastic visuals. For our purposes, here’s one way people visualize a “trained” algorithm: 

An image showing a neural network that has prioritized certain pathways after training.
Source.

The “dropout method” is essentially adding weight to the relationships an AI algorithm has found to be significant for the dataset it’s working on. It can then de-prioritize (or sometimes even eliminate) the other relationships. 

Once you have a trained algorithm, then you can use it with a reasonable degree of certainty that it will give you good results, and that leads us to inference. 

What Is AI Inference?

Once you’ve trained your algorithm, you can send it out in the world to do its job (and make yours easier). When you present a trained AI algorithm with a problem and it gives you an answer, that’s called inference. It’s using the way it was trained to draw conclusions or make predictions, depending on how it was built, and once an algorithm is in the “inference stage”, it’s no longer learning (usually). 

Here’s our diagram for how data might move through an inference process: 

A diagram showing how data moves through an inference workflow.
As you can see, if we want to add more training data to the algorithm, we’d do so as a separate input/output flow. The goal of this process is to output data.

As we’ve seen in the real world, this does not mean that the AI model will give you a perfect answer every time—depending on, for example, the AI’s access to new or current information, the complexity of training, etc., the algorithm might give you a better or worse (more or less accurate) answer. But, it will be able to look at a new piece of information or dataset, and return a response. 

When I asked our friend ChatGPT about this, it broke down inference and defined it in relation to the logical underpinnings of different types of algorithms. (As a mere human, how could I resist using inference while writing about inference?) Sure, ChatGPT may be merely mimicking human-like dialogue based on its ingestion of the whole internet, then giving me the most likely answer based on its specialized neural network that predicts the likelihood of the next word in response to my conversational query and without access to data after January 2022, but here’s what our AI friend had to say for itself: 

  • Machine Learning Inference: In machine learning, after a model has been trained on a dataset, it is deployed to make predictions or classifications on new, unseen data. During inference, the model takes the input data, processes it, and produces an output or a prediction based on the patterns and relationships it learned during training. For example, a trained machine learning model might be used to classify images, predict stock prices, or identify spam emails.
  • Deep Learning Inference: Deep learning, a subfield of machine learning, involves neural networks with multiple layers. In deep learning, inference means feeding an input through a trained deep neural network and obtaining an output. This is commonly used in applications like image recognition, natural language processing, and speech recognition. Inference in deep learning can be computationally intensive, and specialized hardware, such as GPUs and TPUs, are often used to accelerate the process.
  • Rule-Based Systems Inference: In rule-based AI systems, inference is the process of applying logical rules to make decisions or draw conclusions. These systems use a set of predefined rules and facts to reason and infer new information. For example, in an expert system for medical diagnosis, inference involves applying a set of rules to patient symptoms and medical knowledge to arrive at a diagnosis.
  • Probabilistic Inference: In probabilistic AI, inference is used to estimate probabilities and uncertainty. Bayesian networks and Markov decision processes are examples of probabilistic models that perform inference to compute probabilities of events or make decisions based on probabilities.

You’ll notice that each of these are saying basically the same thing: the AI algorithm applies its decision-making paradigm to a problem. 

Why Stop Learning During the Inference Stage?

In general, it’s important to keep these two stages—training and inference—of an AI algorithm separate for a few reasons: 

  • Efficiency: Training is typically a computationally intensive process, whereas inference is usually faster and less resource-intensive. Separating them allows for efficient use of computational resources.
  • Generalization: The model’s ability to generalize from training data to unseen data is a key feature. It should not learn from every new piece of data it encounters during inference to maintain this generalization ability.
  • Reproducibility: When using trained models in production or applications, it’s important to have consistency and reproducibility in the results. If models were allowed to learn during inference, it would introduce variability and unpredictability in their behavior.

There are some specialized AI algorithms that want to continue learning during the inference stage—your Netflix algorithm is a good example, as are self-driving cars, or dynamic pricing models used to set airfare pricing. On the other hand, the majority of problems we’re trying to solve with AI algorithms deliver better decisions by separating these two phases—think of things like image recognition, language translation, or medical diagnosis, for example.

Training vs. Inference (But, Really: Training Then Inference)

To recap: the AI training stage is when you feed data into your learning algorithm to produce a model, and the AI inference stage is when your  algorithm uses that training to make inferences from data. Here’s a chart for quick reference: 

Training Inference
Feed training data into a learning algorithm. Apply the model to inference data.
Produces a model comprising code and data. Produces output data.
One time(ish). Retraining is sometimes necessary. Often continuous.

The difference may seem inconsequential at first glance, but defining these two stages helps to show implications for AI adoption particularly with businesses. That is, given that it’s much less resource intensive (and therefore, less expensive), it’s likely to be much easier for businesses to integrate already-trained AI algorithms with their existing systems. 

And, as always, we’re big believers in demystifying terminology for discussion purposes. Let us know what you think in the comments, and feel free to let us know what you’re interested in learning about next.

The post AI 101: Training vs. Inference appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Anything as a Service: All the “as a Service” Acronyms You Didn’t Know You Needed

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/anything-as-a-service-all-the-as-a-service-acronyms-you-didnt-know-you-needed/

A decorative image showing acronyms for different "as as service" acronyms.

Have you ever felt like you need a dictionary just to understand what tech-savvy folks are talking about? Well, you’re in luck, because we’re about to decode some of the most common jargon of the digital age, one acronym at a time. Welcome to the world of “as a Service” acronyms, where we take the humble alphabet and turn it into a digital buffet. 

So, whether you’re SaaS-savvy or PaaS-puzzled, or just someone desperately searching for a little HaaS (Humor as a Service …yeah, we made that one up), you’ve come to the right place. Let’s take a big slurp from this alphabet soup of tech terms.

The One That Started It All: SaaS

SaaS stands for software as a service, and it’s the founding member of the “as a service” nomenclature. (Though, very confusingly, there’s also Sales as a Service—it’s just not shortened to SaaS. Usually.)

Imagine your software as a pizza delivery service. You don’t need to buy all the ingredients, knead the dough, and bake it yourself. Instead, you simply order a slice, and it magically appears on your table (a.k.a. screen). SaaS products are like that, but instead of pizza they serve up everything from messaging to video conferencing to email marketing to …well, really you name it. Which brings us to…

The Kind of Ironic One: XaaS

XaaS stands for, variously, “everything” or “anything” as a service. No one is really sure about the term’s provenance, but it’s a fair guess to say it came into existence when, well, everything started to become a service, probably sometime around the mid-2010s. The thinking is: if it exists in the digital realm, you can probably get it “as a service.” 

The Hardware Related Ones: HaaS, IaaS, and PaaS

HaaS (Hardware as a Service): Instead of purchasing hardware yourself, like computers, servers, networking equipment, and other physical infrastructure components, with HaaS, you can lease or rent the equipment for a given period. It would be like renting a pizza kitchen to make your specialty pies specifically for your sister’s wedding or your grandma’s birthday.

IaaS (Infrastructure as a Service): Infrastructure as a service is kind of like hardware as a service, but it comes with some additional goodies thrown in. Instead of renting just the kitchen, you rent the whole restaurant, chair, tables, and servers (no pun intended) included. IaaS delivers virtualized computing resources, like virtual machines, storage (that’s us!), and networking, over the internet.

PaaS (Platform as a Service): Think of PaaS as a step even further than IaaS—you’re not just renting a pizza restaurant, you’re renting a test kitchen where you can develop your award-winning pie. PaaS provides developers the ability to build, manage, and deploy applications with services like development frameworks, databases, and infrastructure management. It’s the ultimate DIY platform for tech enthusiasts.

The Bad One: RaaS

RaaS stands for Ransomware as a Service, and this is one “as a service” variant you don’t want to mess with. Basically, cybercriminals can purchase ransomware just as easily as you would purchase any app on the app store (it’s probably more complicated than that, but you get the general gist). This makes it easy for even the least savvy cybercriminal to get into the ransomware game. Not great. 

The Ones That Help With the Last One: BaaS and DRaaS

BaaS (Backup as a Service): Backup as a Service is a cloud-based data protection solution that allows individuals and organizations to back up their data to a remote cloud. (Hey! That’s us too!) Instead of managing on-premises backup infrastructure, users can securely store their data off-site, often on highly redundant and geographically distributed servers.

DRaaS (Disaster Recovery as a Service): DRaaS stands for disaster recovery as a service, and it’s the antidote to RaaS. Of course, you need good backups to begin with, but adding DRaaS allows businesses to ensure specific recovery time objectives (RTOs, FYI) so they can get back up and running in the event they’re attacked by ransomware or there’s a natural disaster at your primary storage location. DRaaS solutions used to be made almost exclusively with the large enterprise in mind, but today, it’s possible to architect a DRaaS solution for your business affordably and easily.

The Analytical One: DaaS

DaaS stands for data as a service, and it’s your data’s personal chauffeur. It fetches the information you need and serves it up on a silver platter. DaaS offers data on-demand, making structured data accessible to users over the internet. It simplifies data sharing and access, often in real-time, without the need for complex data management.

The Development-Focused Ones: CaaS, BaaS (again), and FaaS

CaaS (Containers as a Service): CaaS simplifies the deployment, scaling, and orchestration of containerized applications. It’s the tech version of a literal container ship. The individual containers “ship” individual pieces of software, and a CaaS tool helps carry all of those individual containers. Check out container management software Docker’s logo for a visualization:

It looks more like a whale carrying containers, which is far more adorable, in our opinion.

BaaS (Backend as a Service): It wouldn’t be the first time an acronym has two meanings. BaaS, in this context, provides a backend infrastructure for mobile and web app developers, offering services like databases, user authentication, and APIs. Imagine your own team of digital butlers tending to the back end of your apps. They handle all the behind-the-scenes stuff, so you can focus on making your app shine. 

FaaS (Function as a Service): FaaS is a serverless computing model where developers focus on writing and deploying individual functions or code snippets. These functions run in response to specific events, promoting scalability and efficiency in application development. It’s like having a team of tiny, code-savvy robots doing your bidding.

Go Forth and Abbreviate

Now that you’ve sampled all of the flavors the vast “as a service” world has to offer, we hope you’ve gained a clearer understanding of these sometimes confounding terms. So whether you’re a business professional navigating the cloud or just curious about the tech world, you can wield these acronyms with confidence. 

Did we miss any? I’m sure. Let us know in the comments.

The post Anything as a Service: All the “as a Service” Acronyms You Didn’t Know You Needed appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

How We Achieved Upload Speeds Faster Than AWS S3

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/2023-performance-improvements/

An image of a city skyline with lines going up to a cloud.

You don’t always need the absolute fastest cloud storage—your performance requirements depend on your use case, business objectives, and security needs. But still, faster is usually better. And Backblaze just announced innovation on B2 Cloud Storage that delivers a lot more speed: most file uploads will now be up to 30% faster than AWS S3. 

Today, I’m diving into all of the details of this performance improvement, how we did it, and what it means for you.

The TL:DR

The Results: Customers who rely on small file uploads (1MB or less) can expect to see 10–30% faster uploads on average based on our tests, all without any change to durability, availability, or pricing. 

What Does This Mean for You? 

All B2 Cloud Storage customers will benefit from these performance enhancements, especially those who use Backblaze B2 as a storage destination for data protection software. Small uploads of 1MB or less make up about 70% of all uploads to B2 Cloud Storage and are common for backup and archive workflows. Specific benefits of the performance upgrades include:

  • Secures data in offsite backup faster.
  • Frees up time for IT administrators to work on other projects.
  • Decreases congestion on network bandwidth.
  • Deduplicates data more efficiently.

Veeam® is dedicated to working alongside our partners to innovate and create a united front against cyber threats and attacks. The new performance improvements released by Backblaze for B2 Cloud Storage furthers our mission to provide radical resilience to our joint customers.

—Andreas Neufert, Vice President, Product Management, Alliances, Veeam

When Can I Expect Faster Uploads?

Today. The performance upgrades have been fully rolled out across Backblaze’s global data regions.

How We Did It

Prior to this work, when a customer uploaded a file to Backblaze B2, the data was written to multiple hard disk drives (HDDs). Those operations had to be completed before returning a response to the client. Now, we write the incoming data to the same HDDs and also, simultaneously, to a pool of solid state drives (SSDs) we call a “shard stash,” waiting only for the HDD writes to make it to the filesystems’ in-memory caches and the SSD writes to complete before returning a response. Once the writes to HDD are complete, we free up the space from the SSDs so it can be reused.

Since writing data to an SSD is much faster than writing to HDDs, the net result is faster uploads. 

That’s just a brief summary; if you’re interested in the technical details (as well as the results of some rigorous testing), read on!

The Path to Performance Upgrades

As you might recall from many Drive Stats blog posts and webinars, Backblaze stores all customer data on HDDs, affectionately termed ‘spinning rust’ by some. We’ve historically reserved SSDs for Storage Pod (storage server) boot drives. 

Until now. 

That’s right—SSDs have entered the data storage chat. To achieve these performance improvements, we combined the performance of SSDs with the cost efficiency of HDDs. First, I’ll dig into a bit of history to add some context to how we went about the upgrades.

HDD vs. SSD

IBM shipped the first hard drive way back in 1957, so it’s fair to say that the HDD is a mature technology. Drive capacity and data rates have steadily increased over the decades while cost per byte has fallen dramatically. That first hard drive, the IBM RAMAC 350, had a total capacity of 3.75MB, and cost $34,500. Adjusting for inflation, that’s about $375,000, equating to $100,000 per MB, or $100 billion per TB, in 2023 dollars.

A photograph of people pushing one of the first hard disk drives into a truck.
An early hard drive shipped by IBM. Source.

Today, the 16TB version of the Seagate Exos X16—an HDD widely deployed in the Backblaze B2 Storage Cloud—retails for around $260, $16.25 per TB. If it had the same cost per byte as the IBM RAMAC 250, it would sell for $1.6 trillion—around the current GDP of China!

SSDs, by contrast, have only been around since 1991, when SanDisk’s 20MB drive shipped in IBM ThinkPad laptops for an OEM price of about $1,000. Let’s consider a modern SSD: the 3.2TB Micron 7450 MAX. Retailing at around $360, the Micron SSD is priced at $112.50 per TB, nearly seven times as much as the Seagate HDD.

So, HDDs easily beat SSDs in terms of storage cost, but what about performance? Here are the numbers from the manufacturers’ data sheets:

Seagate Exos X16 Micron 7450 MAX
Model number ST16000NM001G MTFDKCB3T2TFS
Capacity 16TB 3.2TB
Drive cost $260 $360
Cost per TB $16.25 $112.50
Max sustained read rate (MB/s) 261 6,800
Max sustained write rate (MB/s) 261 5,300
Random read rate, 4kB blocks, IOPS 170/440* 1,000,000
Random write rate, 4kB blocks, IOPS 170/440* 390,000

Since HDD platters rotate at a constant rate, 7,200 RPM in this case, they can transfer more blocks per revolution at the outer edge of the disk than close to the middle—hence the two figures for the X16’s transfer rate.

The SSD is over 20 times as fast at sustained data transfer than the HDD, but look at the difference in random transfer rates! Even when the HDD is at its fastest, transferring blocks from the outer edge of the disk, the SSD is over 2,200 times faster reading data and nearly 900 times faster for writes.

This massive difference is due to the fact that, when reading data from random locations on the disk, the platters have to complete an average of 0.5 revolutions between blocks. At 7,200 rotations per minute (RPM), that means that the HDD spends about 4.2ms just spinning to the next block before it can even transfer data. In contrast, the SSD’s data sheet quotes its latency as just 80µs (that’s 0.08ms) for reads and 15µs (0.015ms) for writes, between 84 and 280 times faster than the spinning disk.

Let’s consider a real-world operation, say, writing 64kB of data. Assuming the HDD can write that data to sequential disk sectors, it will spin for an average of 4.2ms, then spend 0.25ms writing the data to the disk, for a total of 4.5ms. The SSD, in contrast, can write the data to any location instantaneously, taking just 27µs (0.027ms) to do so. This (somewhat theoretical) 167x speed advantage is the basis for the performance improvement.

Why did I choose a 64kB block? As we mentioned in a recent blog post focusing on cloud storage performance, in general, bigger files are better when it comes to the aggregate time required to upload a dataset. However, there may be other requirements that push for smaller files. Many backup applications split data into fixed size blocks for upload as files to cloud object storage. There is a trade-off in choosing the block size: larger blocks improve backup speed, but smaller blocks reduce the amount of storage required. In practice, backup blocks may be as small as 1MB or even 256kB. The 64kB blocks we used in the calculation above represent the shards that comprise a 1MB file.

The challenge facing our engineers was to take advantage of the speed of solid state storage to accelerate small file uploads without breaking the bank.

Improving Write Performance for Small Files

When a client application uploads a file to the Backblaze B2 Storage Cloud, a coordinator pod splits the file into 16 data shards, creates four additional parity shards, and writes the resulting 20 shards to 20 different HDDs, each in a different Pod.

Note: As HDD capacity increases, so does the time required to recover after a drive failure, so we periodically adjust the ratio between data shards and parity shards to maintain our eleven nines durability target. In the past, you’ve heard us talk about 17 + 3 as the ratio but we also run 16 + 4 and our very newest vaults use a 15 + 5 scheme.

Each Pod writes the incoming shard to its local filesystem; in practice, this means that the data is written to an in-memory cache and will be written to the physical disk at some point in the near future. Any requests for the file can be satisfied from the cache, but the data hasn’t actually been persistently stored yet.

We need to be absolutely certain that the shards have been written to disk before we return a “success” response to the client, so each Pod executes an fsync system call to transfer (“flush”) the shard data from system memory through the HDD’s write cache to the disk itself before returning its status to the coordinator. When the coordinator has received at least 19 successful responses, it returns a success response to the client. This ensures that, even if the entire data center was to lose power immediately after the upload, the data would be preserved.

As we explained above, for small blocks of data, the vast majority of the time spent writing the data to disk is spent waiting for the drive platter to spin to the correct location. Writing shards to SSD could result in a significant performance gain for small files, but what about that 7x cost difference?

Our engineers came up with a way to have our cake and eat it too by harnessing the speed of SSDs without a massive increase in cost. Now, upon receiving a file of 1MB or less, the coordinator splits it into shards as before, then simultaneously sends the shards to a set of 20 Pods and a separate pool of servers, each populated with 10 of the Micron SSDs described above—a “shard stash.” The shard stash servers easily win the “flush the data to disk” race and return their status to the coordinator in just a few milliseconds. Meanwhile, each HDD Pod writes its shard to the filesystem, queues up a task to flush the shard data to the disk, and returns an acknowledgement to the coordinator.

Once the coordinator has received replies establishing that at least 19 of the 20 Pods have written their shards to the filesystem, and at least 19 of the 20 shards have been flushed to the SSDs, it returns its response to the client. Again, if power was to fail at this point, the data has already been safely written to solid state storage.

We don’t want to leave the data on the SSDs any longer than we have to, so, each Pod, once it’s finished flushing its shard to disk, signals to the shard stash that it can purge its copy of the shard.

Real-World Performance Gains

As I mentioned above, that calculated 167x performance advantage of SSDs over HDDs is somewhat theoretical. In the real world, the time required to upload a file also depends on a number of other factors—proximity to the data center, network speed, and all of the software and hardware between the client application and the storage device, to name a few.

The first Backblaze region to receive the performance upgrade was U.S. East, located in Reston, Virginia. Over a 12-day period following the shard stash deployment there, the average time to upload a 256kB file was 118ms, while a 1MB file clocked in at 137ms. To replicate a typical customer environment, we ran the test application at our partner Vultr’s New Jersey data center, uploading data to Backblaze B2 across the public internet.

For comparison, we ran the same test against Amazon S3’s U.S. East (Northern Virginia) region, a.k.a. us-east-1, from the same machine in New Jersey. On average, uploading a 256kB file to S3 took 157ms, with a 1MB file taking 153ms.

So, comparing the Backblaze B2 U.S. East region to the Amazon S3 equivalent, we benchmarked the new, improved Backblaze B2 as 30% faster than S3 for 256kB files and 10% faster than S3 for 1MB files.

These low-level tests were confirmed when we timed Veeam Backup & Replication software backing up 1TB of virtual machines with 256k block sizes. Backing the server up to Amazon S3 took three hours and 12 minutes; we measured the same backup to Backblaze B2 at just two hours and 15 minutes, 40% faster than S3.

Test Methodology

We wrote a simple Python test app using the AWS SDK for Python (Boto3). Each test run involved timing 100 file uploads using the S3 PutObject API, with a 10ms delay between each upload. (FYI, the delay is not included in the measured time.) The test app used a single HTTPS connection across the test run, following best practice for API usage. We’ve been running the test on a VM in Vultr’s New Jersey region every six hours for the past few weeks against both our U.S. East region and its AWS neighbor. Latency to the Backblaze B2 API endpoint averaged 5.7ms, to the Amazon S3 API endpoint 7.8ms, as measured across 100 ping requests.

What’s Next?

At the time of writing, shard stash servers have been deployed to all of our data centers, across all of our regions. In fact, you might even have noticed small files uploading faster already. It’s important to note that this particular optimization is just one of a series of performance improvements that we’ve implemented, with more to come. It’s safe to say that all of our Backblaze B2 customers will enjoy faster uploads and downloads, no matter their storage workload.

The post How We Achieved Upload Speeds Faster Than AWS S3 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

HYCU + Backblaze: Protecting Against Data Loss

Post Syndicated from Jennifer Newman original https://www.backblaze.com/blog/hycu-backblaze-protecting-against-data-loss/

A decorative image showing the Backblaze and HYCU logos.

Backblaze and HYCU, the fastest-growing leader in multi-cloud data protection as a service (DPaaS) are teaming up to provide businesses a complete backup solution for modern workloads with low-cost scalable infrastructure—a must-have for modern cyber resilience needs.

Read on to learn more about the partnership, how you can benefit from affordable, agile data protection, and a little bit about a relevant ancient poetic art form.

HYCU + Backblaze: The Power of Collaboration

Within HYCU’s DPaaS platform, shared customers can now select Backblaze B2 Cloud Storage—an S3 compatible object storage platform that provides highly durable, instantly available hot storage—as a destination for their HYCU backups. 

With more applications in use across the modern data center, visibility and the ability to protect that mission-critical data has never been at more of a premium. Our collaboration with Backblaze now offers joint customers a cost-effective and scalable data protection solution combining the best in backup and recovery with Backblaze’s streamlined and secure cloud storage.

—Subbiah Sundaram, SVP Product, HYCU, Inc.

The Data Sprawl Problem

On average, businesses and organizations have upwards of 200 different sets of data or “data silos” spread across a growing number of applications, databases, and physical locations. This data sprawl isn’t just hard to manage, it opens up more opportunities for cybercriminals to inject ransomware and gain access to systems. 

HYCU gives customers the power to protect every byte while also managing all their business critical data in one place. Powered by the world’s first development platform for data protection, HYCU is the only DPaaS platform that can scale to protect all of your data—wherever it resides. Most importantly, it gives customers the ability to recover from disaster almost instantly, keeping them online and in business, with an average recovery time of 10 minutes. 

Backblaze and HYCU:

Keeping data safe for all

at one-fifth the cost.

By combining HYCU data protection with Backblaze B2 Storage Cloud, customers can see up to 80% lower costs in comparison to using providers like AWS for their storage, which means that combining the two can be a force multiplier for a businesses’ ability to fully protect their data and scale efficiently and reliably.

Data protection:

Once challenging, now easy—

HYCU and B2.

The partnership offers the following benefits:

  • Performance: With a 99.9% uptime service level agreement (SLA) and no cold delays or speed premiums, storing data in Backblaze B2 Cloud Storage means joint customers have instant access to their data whenever and wherever they need it. 
  • Affordability: Existing customers can reduce their total cost of ownership by switching backup tiers with interoperable S3 compatible storage, and institutions and businesses who may not have been able to afford hyperscaler-based solutions can now protect their data.
  • Compliance and Security: With Backblaze B2’s Object Lock feature, the partnership also offers an additional layer of security through data immutability, which protects backups from ransomware and satisfies evolving cyber insurance requirements.

These benefits can prove particularly useful for higher education institutions, schools, state and local governments, nonprofits, and others where maximizing tight budgets is always a priority.

What’s in a Name?

For the poetically minded among our readership (there must be a few of you, right?), you may have noticed a haiku or two above. And that’s not a coincidence.

The humble haiku inspired the name for HYCU. In true poetic fashion, the name serves more than one purpose—it’s also an acronym for “hyperconverged uptime,” making the least amount of letters do the most, as they should.

Making Data Protection Easier

This partnership adds a powerful new data protection option for joint customers looking to affordably back up their data and establish a disaster recovery strategy. And, this is just the beginning. Stay tuned for more from this partnership, including integrations with HYCU’s other data protection offerings in the future. 

Interested in getting started? Learn more in our docs.

The post HYCU + Backblaze: Protecting Against Data Loss appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Cloud Storage Performance: The Metrics that Matter

Post Syndicated from Kari Rivas original https://www.backblaze.com/blog/cloud-storage-performance-the-metrics-that-matter/

A decorative image showing a cloud in the foreground and various mocked up graphs in the background.

Availability, time to first byte, throughput, durability—there are plenty of ways to measure “performance” when it comes to cloud storage. But, which measure is best and how should performance factor in when you’re choosing a cloud storage provider? Other than security and cost, performance is arguably the most important decision criteria, but it’s also the hardest dimension to clarify. It can be highly variable and depends on your own infrastructure, your workload, and all the network connections between your infrastructure and the cloud provider as well.

Today, I’m walking through how to think strategically about cloud storage performance, including which metrics matter and which may not be as important for you.

First, What’s Your Use Case?

The first thing to keep in mind is how you’re going to be using cloud storage. After all, performance requirements will vary from one use case to another. For instance, you may need greater performance in terms of latency if you’re using cloud storage to serve up software as a service (SaaS) content; however, if you’re using cloud storage to back up and archive data, throughput is probably more important for your purposes.

For something like application storage, you should also have other tools in your toolbox even when you are using hot, fast, public cloud storage, like the ability to cache content on edge servers, closer to end users, with a content delivery network (CDN).

Ultimately, you need to decide which cloud storage metrics are the most important to your organization. Performance is important, certainly, but security or cost may be weighted more heavily in your decision matrix.

A decorative image showing several icons representing different types of files on a grid over a cloud.

What Is Performant Cloud Storage?

Performance can be described using a number of different criteria, including:

  • Latency
  • Throughput
  • Availability
  • Durability

I’ll define each of these and talk a bit about what each means when you’re evaluating a given cloud storage provider and how they may affect upload and download speeds.

Latency

  • Latency is defined as the time between a client request and a server response. It quantifies the time it takes data to transfer across a network.  
  • Latency is primarily influenced by physical distance—the farther away the client is from the server, the longer it takes to complete the request. 
  • If you’re serving content to many geographically dispersed clients, you can use a CDN to reduce the latency they experience. 

Latency can be influenced by network congestion, security protocols on a network, or network infrastructure, but the primary cause is generally distance, as we noted above. 

Downstream latency is typically measured using time to first byte (TTFB). In the context of surfing the web, TTFB is the time between a page request and when the browser receives the first byte of information from the server. In other words, TTFB is measured by how long it takes between the start of the request and the start of the response, including DNS lookup and establishing the connection using a TCP handshake and TLS handshake if you’ve made the request over HTTPS.

Let’s say you’re uploading data from California to a cloud storage data center in Sacramento. In that case, you’ll experience lower latency than if your business data is stored in, say, Ohio and has to make the cross-country trip. However, making the “right” decision about where to store your data isn’t quite as simple as that, and the complexity goes back to your use case. If you’re using cloud storage for off-site backup, you may want your data to be stored farther away from your organization to protect against natural disasters. In this case, performance is likely secondary to location—you only need fast enough performance to meet your backup schedule. 

Using a CDN to Improve Latency

If you’re using cloud storage to store active data, you can speed up performance by using a CDN. A CDN helps speed content delivery by caching content at the edge, meaning faster load times and reduced latency. 

Edge networks create “satellite servers” that are separate from your central data server, and CDNs leverage these to chart the fastest data delivery path to end users. 

Throughput

  • Throughput is a measure of the amount of data passing through a system at a given time.
  • If you have spare bandwidth, you can use multi-threading to improve throughput. 
  • Cloud storage providers’ architecture influences throughput, as do their policies around slowdowns (i.e. throttling).

Throughput is often confused with bandwidth. The two concepts are closely related, but different. 

To explain them, it’s helpful to use a metaphor: Imagine a swimming pool. The amount of water in it is your file size. When you want to drain the pool, you need a pipe. Bandwidth is the size of the pipe, and throughput is the rate at which water moves through the pipe successfully. So, bandwidth affects your ultimate throughput. Throughput is also influenced by processing power, packet loss, and network topology, but bandwidth is the main factor. 

Using Multi-Threading to Improve Throughput

Assuming you have some bandwidth to spare, one of the best ways to improve throughput is to enable multi-threading. Threads are units of execution within processes. When you transmit files using a program across a network, they are being communicated by threads. Using more than one thread (multi-threading) to transmit files is, not surprisingly, better and faster than using just one (although a greater number of threads will require more processing power and memory). To return to our water pipe analogy, multi-threading is like having multiple water pumps (threads) running to that same pipe. Maybe with one pump, you can only fill 10% of your pipe. But you can keep adding pumps until you reach pipe capacity.

When you’re using cloud storage with an integration like backup software or a network attached storage (NAS) device, the multi-threading setting is typically found in the integration’s settings. Many backup tools, like Veeam, are already set to multi-thread by default. Veeam automatically makes adjustments based on details like the number of individual backup jobs, or you can configure the number of threads manually. Other integrations, like Synology’s Cloud Sync, also give you granular control over threading so you can dial in your performance.  

A diagram showing single vs. multi-threaded processes.
Still confused about threads? Learn more in our deep dive, including what’s going on in this diagram.

That said, the gains from increasing the number of threads are limited by the available bandwidth, processing power, and memory. Finding the right setting can involve some trial and error, but the improvements can be substantial (as we discovered when we compared download speeds on different Python versions using single vs. multi-threading).

What About Throttling?

One question you’ll absolutely want to ask when you’re choosing a cloud storage provider is whether they throttle traffic. That means they deliberately slow down your connection for various reasons. Shameless plug here: Backblaze does not throttle, so customers are able to take advantage of all their bandwidth while uploading to B2 Cloud Storage. Amazon and many other public cloud services do throttle.

Upload Speed and Download Speed

Your ultimate upload and download speeds will be affected by throughput and latency. Again, it’s important to consider your use case when determining which performance measure is most important for you. Latency is important to application storage use cases where things like how fast a website loads can make or break a potential SaaS customer. With latency being primarily influenced by distance, it can be further optimized with the help of a CDN. Throughput is often the measurement that’s more important to backup and archive customers because it is indicative of the upload and download speeds an end user will experience, and it can be influenced by cloud storage provider practices, like throttling.   

Availability

  • Availability is the percentage of time a cloud service or a resource is functioning correctly.
  • Make sure the availability listed in the cloud provider’s service level agreement (SLA) matches your needs. 
  • Keep in mind the difference between hot and cold storage—cold storage services like Amazon Glacier offer slower retrieval and response times.

Also called uptime, this metric measures the percentage of time that a cloud service or resource is available and functioning correctly. It’s usually expressed as a percentage, with 99.9% (three nines) or 99.99% (four nines) availability being common targets for critical services. Availability is often backed by SLAs that define the uptime customers can expect and what happens if availability falls below that metric. 

You’ll also want to consider availability if you’re considering whether you want to store in cold storage versus hot storage. Cold storage is lower performing by design. It prioritizes durability and cost-effectiveness over availability. Services like Amazon Glacier and Google Coldline take this approach, offering slower retrieval and response times than their hot storage counterparts. While cost savings is typically a big factor when it comes to considering cold storage, keep in mind that if you do need to retrieve your data, it will take much longer (potentially days instead of seconds), and speeding that up at all is still going to cost you. You may end up paying more to get your data back faster, and you should also be aware of the exorbitant egress fees and minimum storage duration requirements for cold storage—unexpected costs that can easily add up. 

Cold Hot
Access Speed Slow Fast
Access Frequency Seldom or Never Frequent
Data Volume Low High
Storage Media Slower drives, LTO, offline Faster drives, durable drives, SSDs
Cost Lower Higher

Durability

  • Durability is the ability of a storage system to consistently preserve data.
  • Durability is measured in “nines” or the probability that your data is retrievable after one year of storage. 
  • We designed the Backblaze B2 Storage Cloud for 11 nines of durability using erasure coding.

Data durability refers to the ability of a data storage system to reliably and consistently preserve data over time, even in the face of hardware failures, errors, or unforeseen issues. It is a measure of data’s long-term resilience and permanence. Highly durable data storage systems ensure that data remains intact and accessible, meeting reliability and availability expectations, making it a fundamental consideration for critical applications and data management.

We usually measure durability or, more precisely annual durability, in “nines”, referring to the number of nines in the probability (expressed as a percentage) that your data is retrievable after one year of storage. We know from our work on Drive Stats that an annual failure rate of 1% is typical for a hard drive. So, if you were to store your data on a single drive, its durability, the probability that it would not fail, would be 99%, or two nines.

The very simplest way of improving durability is to simply replicate data across multiple drives. If a file is lost, you still have the remaining copies. It’s also simple to calculate the durability with this approach. If you write each file to two drives, you lose data only if both drives fail. We calculate the probability of both drives failing by multiplying the probabilities of either drive failing, 0.01 x 0.01 = 0.0001, giving a durability of 99.99%, or four nines. While simple, this approach is costly—it incurs a 100% overhead in the amount of storage required to deliver four nines of durability.

Erasure coding is a more sophisticated technique, improving durability with much less overhead than simple replication. An erasure code takes a “message,” such as a data file, and makes a longer message in a way that the original can be reconstructed from the longer message even if parts of the longer message have been lost. 

A decorative image showing the matrices that get multiplied to allow Reed-Solomon code to re-create files.
A representation of Reed-Solomon erasure coding, with some very cool Storage Pods in the background.

The durability calculation for this approach is much more complex than for replication, as it involves the time required to replace and rebuild failed drives as well as the probability that a drive will fail, but we calculated that we could take advantage of erasure coding in designing the Backblaze B2 Storage Cloud for 11 nines of durability with just 25% overhead in the amount of storage required. 

How does this work? Briefly, when we store a file, we split it into 16 equal-sized pieces, or shards. We then calculate four more shards, called parity shards, in such a way that the original file can be reconstructed from any 16 of the 20 shards. We then store the resulting 20 shards on 20 different drives, each in a separate Storage Pod (storage server).

Note: As hard disk drive capacity increases, so does the time required to recover after a drive failure, so we periodically adjust the ratio between data shards and parity shards to maintain our eleven nines durability target. Consequently, our very newest vaults use a 15+5 scheme.

If a drive does fail, it can be replaced with a new drive, and its data rebuilt from the remaining good drives. We open sourced our implementation of Reed-Solomon erasure coding, so you can dive into the source code for more details.

Additional Factors Impacting Cloud Storage Performance

In addition to bandwidth and latency, there are a few additional factors that impact cloud storage performance, including:

  • The size of your files.
  • The number of files you upload or download.
  • Block (part) size.
  • The amount of available memory on your machine. 

Small files—that is, those less than 5GB—can be uploaded in a single API call. Larger files, from 5MB to 10TB, can be uploaded as “parts”, in multiple API calls. You’ll notice that there is quite an overlap here! For uploading files between 5MB and 5GB, is it better to upload them in a single API call, or split them into parts? What is the optimum part size? For backup applications, which typically split all data into equal-sized blocks, storing each block as a file, what is the optimum block size? As with many questions, the answer is that it depends.

Remember latency? Each API call incurs a more-or-less fixed overhead due to latency. For a 1GB file, assuming a single thread of execution, uploading all 1GB in a single API call will be faster than ten API calls each uploading a 100MB part, since those additional nine API calls each incur some latency overhead. So, bigger is better, right?

Not necessarily. Multi-threading, as mentioned above, affords us the opportunity to upload multiple parts simultaneously, which improves performance—but there are trade-offs. Typically, each part must be stored in memory as it is uploaded, so more threads means more memory consumption. If the number of threads multiplied by the part size exceeds available memory, then either the application will fail with an out of memory error, or data will be swapped to disk, reducing performance.

Downloading data offers even more flexibility, since applications can specify any portion of the file to download in each API call. Whether uploading or downloading, there is a maximum number of threads that will drive throughput to consume all of the available bandwidth. Exceeding this maximum will consume more memory, but provide no performance benefit. If you go back to our pipe analogy, you’ll have reached the maximum capacity of the pipe, so adding more pumps won’t make things move faster. 

So, what to do to get the best performance possible for your use case? Simple: customize your settings. 

Most backup and file transfer tools allow you to configure the number of threads and the amount of data to be transferred per API call, whether that’s block size or part size. If you are writing your own application, you should allow for these parameters to be configured. When it comes to deployment, some experimentation may be required to achieve maximum throughput given available memory.

How to Evaluate Cloud Performance

To sum up, the cloud is increasingly becoming a cornerstone of every company’s tech stack. Gartner predicts that by 2026, 75% of organizations will adopt a digital transformation model predicated on cloud as the fundamental underlying platform. So, cloud storage performance will likely be a consideration for your company in the next few years if it isn’t already.

It’s important to consider that cloud storage performance can be highly subjective and heavily influenced by things like use case considerations (i.e. backup and archive versus application storage, media workflow, or another), end user bandwidth and throughput, file size, block size, etc. Any evaluation of cloud performance should take these factors into account rather than simply relying on metrics in isolation. And, a holistic cloud strategy will likely have multiple operational schemas to optimize resources for different use cases.

Wait, Aren’t You, Backblaze, a Cloud Storage Company?

Why, yes. Thank you for noticing. We ARE a cloud storage company, and we OFTEN get questions about all of the topics above. In fact, that’s why we put this guide together—our customers and prospects are the best sources of content ideas we can think of. Circling back to the beginning, it bears repeating that performance is one factor to consider in addition to security and cost. (And, hey, we would be remiss not to mention that we’re also one-fifth the cost of AWS S3.) Ultimately, whether you choose Backblaze B2 Cloud Storage or not though, we hope the information is useful to you. Let us know if there’s anything we missed.

The post Cloud Storage Performance: The Metrics that Matter appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Load Balancing 2.0: What’s Changed After 7 Years?

Post Syndicated from nathaniel wagner original https://www.backblaze.com/blog/load-balancing-2-0-whats-changed-after-7-years/

A decorative image showing a file, a drive, and some servers.

What do two billion transactions a day look like? Well, the data may be invisible to the naked eye, but the math breaks down to just over 23,000 transactions every second. (Shout out to Kris Allen for burning into my memory that there are 86,400 seconds in a day and my handy calculator!) 

Part of my job as a Site Reliability Engineer (SRE) at Backblaze is making sure that greater than 99.9% of those transactions are what we consider “OK” (via status code), and part of the fun is digging for the needle in the haystack of over 7,000 production servers and over 250,000 spinning hard drives to try and understand how all of the different transactions interact with the pieces of infrastructure. 

In this blog post, I’m going to pick up where Principal SRE Elliott Sims left off in his 2016 article on load balancing. You’ll notice that the design principles we’ve employed are largely the same (cool!). So, I’ll review our stance on those principles, then talk about how the evolution of the B2 Cloud Storage platform—including the introduction of the S3 Compatible API—has changed our load balancing architecture. Read on for specifics. 

Editor’s Note

We know there are a ton of specialized networking terms flying around this article, and one of our primary goals is to make technical content accessible to all readers, regardless of their technical background. To that end, we’ve used footnotes to add some definitions and minimize the disruption to your reading experience.

What Is Load Balancing?

Load balancing is the process of distributing traffic across a network. It helps with resource utilization, prevents overloading any one server, and makes your system more reliable. Load balancers also monitor server health and redirect requests to the most suitable server.

With two billion requests per day to our servers, you can be assured that we use load balancers at Backblaze. Whenever anyone—a Backblaze Computer Backup or a B2 Cloud Storage customer—wants to upload or download data or modify or interact with their files, a load balancer will be there to direct traffic to the right server. Think of them as your trusty mail service, delivering your letters and packages to the correct destination—and using things like zip codes and addresses to interpret your request and get things to the right place.  

How Do We Do It?

We build our own load balancers using open-source tools. We use layer 4 load balancing with direct server response (DSR). Here are some of the resources that we call on to make that happen:  

  • Border gateway patrol (BGP) which is part of the Linux kernel1. It’s a standardized gateway protocol that exchanges routing and reachability information on the internet.   
  • keepalived, an open-source routing software. keepalived keeps track of all of our VIPs2 and IPs3 for each backend server. 
  • Hard disk drives (HDDs). We use the same drives that we use for other API servers and whatnot, but that’s definitely overkill—we made that choice to save the work of sourcing another type of device.  
  • A lot of hard work by a lot of really smart folks.

What We Mean by Layers

When we’re talking about layers in load balancing, it’s shorthand for how deep into the architecture your program needs to see. Here’s a great diagram that defines those layers: 

An image describing application layers.
Source.

DSR takes place at layer 4, but solves the problem presented by a full proxy4 method, having to see the original client’s IP address.

Why Do We Do It the Way We Do It?

Building our own load balancers, instead of buying an off-the-shelf solution, means that we have more control and insight, more cost-effective hardware, and more scalable architecture. In general, DSR is more complicated to set up and maintain, but this method also lets us handle lots of traffic with minimal hardware and supports our goal of keeping data encrypted, even within our own data center. 

What Hasn’t Changed

We’re still using a layer 4 DSR approach to load balancing, which we’ll explain below. For reference, other common methods of load balancing are layer 7, full proxy and layer 4, full proxy load balancing. 

First, I’ll explain how DSR works. DSR load balancing requires two things:

  1. A load balancer with the VIP address attached to an external NIC5 and ARPing6, so that the rest of the network knows it “owns” the IP.
  2. Two or more servers on the same layer 2 network that also have the VIP address attached to a NIC, either internal or external, but are not replying to ARP requests about that address. This means that no other servers on the network know that the VIP exists anywhere but on the load balancer.

A request packet will enter the network, and be routed to the load balancer. Once it arrives there, the load balancer leaves the source and destination IP addresses intact and instead modifies the destination MAC7 address to that of a server, then puts the packet back on the network. The network switch only understands MAC addresses, so it forwards the packet on to the correct server.

A diagram of how a packet moves through the network router and load balancer to reach the server, then respond to the original client request.

When the packet arrives at the server’s network interface, it checks to make sure the destination MAC address matches its own. The address matches, so the server accepts the packet. The server network interface then, separately, checks to see whether the destination IP address is one attached to it somehow. That’s a yes, even though the rest of the network doesn’t know it, so the server accepts the packet and passes it on to the application. The application then sends a response with the VIP as the source IP address and the client as the destination IP, so the request (and subsequent response) is routed directly to the client without passing back through the load balancer.

So, What’s Changed?

Lots of things. But, since we first wrote this article, we’ve expanded our offerings and platform. The biggest of these changes (as far as load balancing is concerned) is that we added the S3 Compatible API. 

We also serve a much more diverse set of clients, both in the size of files they have and their access patterns. File sizes affect how long it takes us to serve requests (larger files = more time to upload or download, which means an individual server is tied up for longer). Access patterns can vastly increase the amount of requests a server has to process on a regular, but not consistent basis (which means you might have times that your network is more or less idle, and you have to optimize appropriately). 

A definitely Photoshopped images showing a giraffe riding an elephant on a rope in the sky. The rope's anchor points disappear into clouds.
So, if we were to update this amazing image from the first article, we might have a tightrope walker with a balancing pole on top of the giraffe, plus some birds flying on a collision course with the elephant.

Where You Can See the Changes: ECMP, Volume of Data Per Month, and API Processing

DSR is how we send data to the customer—the server responds (sends data) directly to the request (client). This is the equivalent of going to the post office to mail something, but adding your home address as the return address (so that you don’t have to go back to the post office to get your reply).   

Given how our platform has evolved over the years, things might happen slightly differently. Let’s dig in to some of the details that affect how the load balancers make their decisions—what rules govern how they route traffic, and how different types of requests cause them to behave differently. We’ll look at:

  • Equal cost multipath routing (ECMP). 
  • Volume of data in petabytes (PBs) per month.
  • APIs and processing costs.

ECMP

One thing that’s not explicitly outlined above is how the load balancer determines which server should respond to a request. At Backblaze, we use stateless load balancing, which means that the load balancer doesn’t take into account most information about the servers it routes to. We use a round robin approach—i.e. the load balancers choose between one of a few hosts, in order, each time they’re assigning a request. 

We also use Maglev, so the load balancers use consistent hashing and connection tracking. This means that we’re minimizing the negative impact of unexpected faults failures from connection-oriented protocols. If a load balancer goes down, its server pool can be transferred to another, and it will make decisions in the same way, seamlessly picking up the load. When the initial load balancer comes back online, it already has a connection to its load balancer friend and can pick up where it left off. 

The upside is that it’s super rare to see a disruption, and it essentially only happens when the load balancer and the neighbor host go down in a short period of time. The downside is that the load balancer decision is static. If you have “better” servers for one reason or another—they’re newer, for instance—they don’t take that information into account. On the other hand, we do have the ability to push more traffic to specific servers through ECMP weights if we need to, which means that we have good control over a diverse fleet of hardware. 

Volume of Data

Backblaze now has over three exabytes of storage under management. Based on the scalability of the network design, that doesn’t really make a huge amount of difference when you’re scaling your infrastructure properly. What can make a difference is how people store and access their data. 

Most of the things that make managing large datasets difficult from an architecture perspective can also be silly for a client. For example: querying files individually (creating lots of requests) instead of batching or creating a range request. (There may be a business reason to do that, but usually, it makes more sense to batch requests.)

On the other hand, some things that make sense for how clients need to store data require architecture realignment. One of those is just a sheer fluctuation of data by volume—if you’re adding and deleting large amounts of data (we’re talking hundreds of terabytes or more) on a “shorter” cycle (monthly or less), then there will be a measurable impact. And, with more data stored, you have the potential for more transactions.

Similarly, if you need to retrieve data often, but not regularly, there are potential performance impacts. Most of them are related to caching, and ironically, they can actually improve performance. The more you query the same “set” of servers for the same file, the more likely that each server in the group will have cached your data locally (which means they can serve it more quickly). 

And, as with most data centers, we store our long term data on hard disk drives (HDDs), whereas our API servers are on solid state drives (SSDs). There are positives and negatives to each type of drive, but the performance impact is that data at rest takes longer to retrieve, and data in the cache is on a faster SSD drive on the API server. 

On the other hand, the more servers the data center has, the lower the chance that the servers can/will deliver cached data. And, of course, if you’re replacing large volumes of old data with new on a shorter timeline, then you won’t see the benefits. It sounds like an edge case, but industries like security camera data are a great example. While they don’t retrieve their data very frequently, they are constantly uploading and overwriting their data, often to meet different industry requirements about retention periods, which can be challenging to allocate a finite amount of input/output operations per second (IOPS) for uploads, downloads, and deletes.  

That said, the built-in benefit of our system is that adding another load balancer is (relatively) cheap. If we’re experiencing a processing chokepoint for whatever reason—typically either a CPU bottleneck or from throughput on the NIC—we can add another load balancer, and just like that, we can start to see bits flying through the new load balancer and traffic being routed amongst more hosts, alleviating the choke points. 

APIs and Processing Costs

We mentioned above that one of the biggest changes to our platform was the addition of the S3 Compatible API. When all requests were made through the B2 Native API, the Backblaze CLI tool, or the web UI, the processing cost was relatively cheap. 

That’s because of the way our upload requests to the Backblaze Vaults are structured. When you make an upload request via the Native API, there are actually two transactions, one to get an upload URL, and the second to send a request to the Vault. And, all other types of requests (besides upload) have always had to be processed through our load balancers. Since the S3 Compatible API is a single request, we knew we would have to add more processing power and load balancers. (If you want to go back to 2018 and see some of the reasons why, here’s Brian Wilson on the subject—read with the caveat that our current Tech Doc on the subject outlines how we solve the complications he points out.) 

We’re still leveraging DSR to respond directly to the client, but we’ve significantly increased the amount of transactions that hit our load balancers, both because it has to take on more of the processing during transit and because, well, lots of folks like to use the S3 Compatible API and our customer base has grown by a quite a bit since 2018. 

And, just like above, we’ve set ourselves up for a relatively painless fix: we can add another load balancer to solve most problems. 

Do We Have More Complexity?

This is the million dollar question, solved for a dollar: how could we not? Since our first load balancing article, we’ve added features, complexity, and lots of customers. Load balancing algorithms are inherently complex, but we (mostly Elliott and other smart people) have taken a lot of time and consideration to not just build a system that will scale to up and past two billion transactions a day, but that can be fairly “easily” explained and doesn’t require a graduate degree to understand what is happening.  

But, we knew it was important early on, so we prioritized building a system where we could “just” add another load balancer. The thinking is more complicated at the outset, but the tradeoff is that it’s simple once you’ve designed the system. It would take a lot for us to outgrow the usefulness of this strategy—but hey, we might get there someday. When we do, we’ll write you another article. 

Footnotes

  1. Kernel: A kernel is the computer program at the core of a computer’s operating system. It has control of the system and does things like run processes, manage hardware devices like the hard drive, and handle interrupts, as well as memory and input/output (I/O) requests from software, translating them to instructions for the central processing unit (CPU). ↩
  2. Virtual internet protocol (VIP): An IP address that does not correspond to a real place. ↩
  3. Internet protocol (IP): The global physical address of a device, used to identify devices on the internet. Can be changed.  ↩
  4. Proxy: In networking, a proxy is a server application that validates outside requests to a network. Think of them as a gatekeeper. There are several common types of proxies you interact with all the time—HTTPS requests on the internet, for example. ↩
  5. Network interface controller, or network interface card (NIC): This connects the computer to the network. ↩
  6. Address resolution protocol (ARP): The process by which a device or network identifies another device. There are four types. ↩
  7. Media access control (MAC): The local physical address of a device, used to identify devices on the same network. Hardcoded into the device. ↩

The post Load Balancing 2.0: What’s Changed After 7 Years? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

NAS Performance Guide: How To Optimize Your NAS

Post Syndicated from Vinodh Subramanian original https://www.backblaze.com/blog/nas-performance-guide-how-to-optimize-your-nas/

A decorative images showing a 12 bay NAS device connecting to the cloud.

Upgrading to a network attached storage (NAS) device puts your data in the digital fast lane. If you’re using one, it’s likely because you want to keep your data close to you, ensuring quick access whenever it’s needed. NAS devices, acting as centralized storage systems connected to local networks, offer a convenient way to access data in just a few clicks. 

However, as the volume of data on the NAS increases, its performance can tank. You need to know how to keep your NAS operating at its best, especially with growing data demand. 

In this blog, you’ll learn about various factors that can affect NAS performance, as well as practical steps you can take to address these issues, ensuring optimal speed, reliability, and longevity for your NAS device.

Why NAS Performance Matters

NAS devices can function as extended hard disks, virtual file cabinets, or centralized local storage solutions, depending on individual needs. 

While NAS offers a convenient way to store data locally, storing the data alone isn’t enough. How quickly and reliably you can access your data can make all the difference if you want an efficient workflow. For example, imagine working on a critical project with your team and facing slow file transfers, or streaming a video on a Zoom call only for it to stutter or buffer continuously.

All these can be a direct result of NAS performance issues, and an increase in stored data can directly undermine the device’s performance. Therefore, ensuring optimal performance isn’t just a technical concern, it’s also a concern that directly affects user experience, productivity, and collaboration. 

So, let’s talk about what could potentially cause performance issues and how to enhance your NAS. 

Common NAS Performance Issues

NAS performance can be influenced by a variety of factors. Here are some of the most common factors that can impact the performance of a NAS device.

Hardware Limitations:

  • Insufficient RAM: Especially in tasks like media streaming or handling large files, having inadequate memory can slow down operations. 
  • Slow CPU: An underpowered processor can become a bottleneck when multiple users access the NAS at once or during collaboration with team members. 
  • Drive Speed and Type: Hard disk drives (HDDs) are generally slower compared to solid state drives (SSDs), and your NAS can have either type. If your NAS mainly serves as a hub for storing and sharing files, a conventional HDD should meet your requirements. However, for those seeking enhanced speed and performance, SSDs deliver the performance you need. 
  • Outdated Hardware: Older NAS models might not be equipped to handle modern data demands or the latest software.

Software Limitations:

  • Outdated Firmware/Software: Not updating to the latest firmware or software can lead to performance issues, or to missing out on optimization and security features.
  • Misconfigured Settings: Incorrect settings can impact performance. This includes improper RAID configuration or network settings. 
  • Background Processes: Certain background tasks, like indexing or backups, can also slow down the system when running.

Network Challenges: 

  • Bandwidth Limitations: A slow network connection, especially on a Wi-Fi network can limit data transfer rates. 
  • Network Traffic: High traffic on the network can cause congestion, reducing the speed at which data can be accessed or transferred.

Disk Health and Configuration:

  • Disk Failures: A failing disk in the NAS can slow down performance and also poses data loss risk.
  • Suboptimal RAID Configuration: Some RAID configurations prioritize redundancy more than performance, which can affect the data storage and access speeds. 

External Factors:

  • Simultaneous User Access: If multiple users are accessing, reading, or writing to the NAS simultaneously, it can strain the system, especially if the hardware isn’t optimized to such traffic from multiple users. 
  • Inadequate Power Supply: Fluctuating or inadequate power can cause the NAS to malfunction or reduce its performance.
  • Operating Temperature: Additionally, if the NAS is in a hot environment, it might overheat and impact the performance of the device.

Practical Solutions for Optimizing NAS Performance

Understanding the common performance issues with NAS devices is the first critical step. However, simply identifying these issues alone isn’t enough. It’s vital to understand practical ways to optimize your existing NAS setup so you can enhance its speed, efficiency, and reliability. Let’s explore how to optimize your NAS. 

Performance Enhancement 1: Upgrading Hardware

There are a few different things you can do on a hardware level to enhance NAS performance. First, adding more RAM can significantly improve performance, especially if multiple tasks or users are accessing the NAS simultaneously. 

You can also consider switching to SSDs. While they can be more expensive, SSDs offer faster read/write speeds than traditional HDDs, and they store data in flash memory, which means that they retain information even without power. 

Finally, you could upgrade the CPU. For NAS devices that support it, a more powerful CPU can better handle multiple simultaneous requests and complex tasks. 

Performance Enhancement 2: Optimizing Software Configuration

Remember to always keep your NAS operating system and software up-to-date to benefit from the latest performance optimizations and security patches. Schedule tasks like indexing, backups or antivirus scans during off-peak hours to ensure they don’t impact user access during high-traffic times. You also need to make sure you’re using the right RAID configuration for your needs. RAID 5 or RAID 6, for example, can offer a good balance between redundancy and performance.

Performance Enhancement 3: Network Enhancements

Consider moving to faster network protocols, like 10Gb ethernet, or ensuring that your router and switches can handle high traffic. Wherever possible, use wired connections instead of Wi-Fi to connect to the NAS for more stable and faster data access and transfer. And, regularly review and adjust network settings for optimal performance. If you can, it also helps to limit simultaneous access. If possible, manage peak loads by setting up access priorities.

Performance Enhancement 4: Regular Maintenance

Use your NAS device’s built-in tools or third-party software to monitor the health of your disks and replace any that show signs of failure. And, keep the physical environment around your NAS device clean, cool, and well ventilated to prevent overheating. 

Leveraging the Cloud for NAS Optimization

After taking the necessary steps to optimize your NAS for improved performance and reliability, it’s worth considering leveraging the cloud to further enhance the performance. While NAS offers convenient local storage, it can sometimes fall short when it comes to scalability, accessibility from different locations, and seamless collaboration. Here’s where cloud storage comes into play. 

At its core, cloud storage is a service model in which data is maintained, managed, and backed up remotely, and made available to users over the internet. Instead of relying solely on local storage solutions such as NAS or a server, you utilize the vast infrastructure of data centers across the globe to store your data not just in one physical location, but across multiple secure and redundant environments. 

As an off-site storage solution for NAS, the cloud not only completes your 3-2-1 backup plan, but can also amplify its performance. Let’s take a look at how integrating cloud storage can help optimize your NAS.

  • Off-Loading and Archiving: One of the most straightforward approaches is to move infrequently accessed or archival data from the NAS to the cloud. This frees up space on the NAS, ensuring it runs smoothly, while optimizing the NAS by only keeping data that’s frequently accessed or essential. 
  • Caching: Some advanced NAS systems can cache frequently accessed data in the cloud. This means that the most commonly used data can be quickly retrieved, enhancing user experience and reducing the load on the NAS device. 
  • Redundancy and Disaster Recovery: Instead of duplicating data on multiple NAS devices for redundancy, which can be costly and still vulnerable to local disasters, the data can be backed up to the cloud. In case of NAS failure or catastrophic event, the data can be quickly restored from the cloud, ensuring minimal downtime. 
  • Remote Access and Collaboration: While NAS devices can offer remote access, integrating them with cloud storage can streamline this process, often offering a more user-friendly interface and better speeds. This is especially useful for collaborative environments where multiple users work together on files and projects. 
  • Scaling Without Hardware Constraints: As your data volume grows, expanding a NAS can involve purchasing additional drives or even new devices. With cloud integration, you can expand your storage capacity without these immediate hardware investments, eliminating or delaying the need for physical upgrades and extending the lifespan of your NAS. 

In essence, integrating cloud storage solutions with your NAS can create a comprehensive system that addresses the shortcomings of NAS devices, helping you create a hybrid setup that offers the best of both worlds: the speed and accessibility of local storage, and the flexibility and scalability of the cloud. 

Getting the Best From Your NAS

At its core, NAS offers an unparalleled convenience of localized storage. However, it’s not without challenges, especially when performance issues come into play. Addressing these challenges requires a blend of hardware optimization, software updates, and smart data management settings. 

But, it doesn’t have to stop at your local network. Cloud storage can be leveraged effectively to optimize your NAS. It doesn’t just act as a safety net by storing your NAS data off-site, it also makes collaboration easier with dispersed teams and further optimizes NAS performance. 

Now, it’s time to hear from you. Have you encountered any NAS performance issues? What measures have you taken to optimize your NAS? Share your experiences and insights in the comments below. 

The post NAS Performance Guide: How To Optimize Your NAS appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Overload to Overhaul: How We Upgraded Drive Stats Data

Post Syndicated from David Winings original https://www.backblaze.com/blog/overload-to-overhaul-how-we-upgraded-drive-stats-data/

A decorative image showing the words "overload to overhaul: how we upgraded Drive Stats data."

This year, we’re celebrating 10 years of Drive Stats. Coincidentally, we also made some upgrades to how we run our Drive Stats reports. We reported on how an attempt to migrate triggered a weeks-long recalculation of the dataset, leading us to map the architecture of the Drive Stats data. 

This follow-up article focuses on the improvements we made after we fixed the existing bug (because hey, we were already in there), and then presents some of our ideas for future improvements. Remember that those are just ideas so far—they may not be live in a month (or ever?), but consider them good food for thought, and know that we’re paying attention so that we can pass this info along to the right people.

Now, onto the fun stuff. 

Quick Refresh: Drive Stats Data Architecture

The podstats generator runs on every Storage Pod, what we call any host that holds customer data, every few minutes. It’s a C++ program that collects SMART stats and a few other attributes, then converts them into an .xml file (“podstats”). Those are then pushed to a central host in each datacenter and bundled. Once the data leaves these central hosts, it has entered the domain of what we will call Drive Stats.  

Now let’s go into a little more detail: when you’re gathering stats about drives, you’re running a set of modules with dependencies to other modules, forming a data-dependency tree. Each time a module “runs”, it takes information, modifies it, and writes it to a disk. As you run each module, the data will be transformed sequentially. And, once a quarter, we run a special module that collects all the attributes for our Drive Stats reports, collecting data all the way down the tree. 

Here’s a truncated diagram of the whole system, to give you an idea of what the logic looks like:

A diagram of the mapped logic of the Drive Stats modules.
An abbreviated logic map of Drive Stats modules.

As you move down through the module layers, the logic gets more and more specialized. When you run a module, the first thing the module does is check in with the previous module to make sure the data exists and is current. It caches the data to disk at every step, and fills out the logic tree step by step. So for example, drive_stats, being a “per-day” module, will write out a file such as /data/drive_stats/2023-01-01.json.gz when it finishes processing. This lets future modules read that file to avoid repeating work.

This work deduplication process saves us a lot of time overall—but it also turned out to be the root cause of our weeks-long process when we were migrating Drive Stats to our new host. We fixed that by implementing versions to each module.  

While You’re There… Why Not Upgrade?

Once the dust from the bug fix had settled, we moved forward to try to modernize Drive Stats in general. Our daily report still ran quite slowly, on the order of several hours, and there was some low-hanging fruit to chase.

Waiting On You, failures_with_stats

First things first, we saved a log of a run of our daily reports in Jenkins. Then we wrote an analyzer to see which modules were taking a lot of time. failures_with_stats was our biggest offender, running for about two hours, while every other module took about 15 minutes.

An image showing runtimes for each module when running a Drive Stats report.
Not quite two hours.

Upon investigation, the time cost had to do with how the date_range module works. This takes us back to caching: our module checks if the file has been written already, and if it has, it uses the cached file. However, a date range is written to a single file. That is, Drive Stats will recognize “Monday to Wednesday” as distinct from “Monday to Thursday” and re-calculate the entire range. This is a problem for a workload that is essentially doing work for all of time, every day.  

On top of this, the raw Drive Stats data, which is a dependency for failures_with_stats, would be gzipped onto a disk. When each new query triggered a request to recalculate all-time data, each dependency would pick up the podstats file from disk, decompress it, read it into memory, and do that for every day of all time. We were picking up and processing our biggest files every day, and time continued to make that cost larger.

Our solution was what I called the “Date Range Accumulator.” It works as follows:

  • If we have a date range like “all of time as of yesterday” (or any partial range with the same start), consider it as a starting point.
  • Make sure that the version numbers don’t consider our starting point to be too old.
  • Do the processing of today’s data on top of our starting point to create “all of time as of today.”

To do this, we read the directory of the date range accumulator, find the “latest” valid one, and use that to determine the delta (change) to our current date. Basically, the module says: “The last time I ran this was on data from the beginning of time to Thursday. It’s now Friday. I need to run the process for Friday, and then add that to the compiled all-time.” And, before it does that, it double checks the version number to avoid errors. (As we noted in our previous article, if it doesn’t see the correct version number, instead of inefficiently running all data, it just tells you there is a version number discrepancy.) 

The code is also a bit finicky—there are lots of snags when it comes to things like defining exceptions, such as if we took a drive out of the fleet, but it wasn’t a true failure. The module also needed to be processable day by day to be usable with this technique.

Still, even with all the tweaks, it’s massively better from a runtime perspective for eligible candidates. Here’s our new failures_with_stats runtime: 

An output of module runtime after the Drive Stats improvements were made.
Ahh, sweet victory.

Note that in this example, we’re running that 60-day report. The daily report is quite a bit quicker. But, at least the 60-day report is a fixed amount of time (as compared with the all-time dataset, which is continually growing). 

Code Upgrade to Python 3

Next, we converted our code to Python 3. (Shout out to our intern, Anath, who did amazing work on this part of the project!) We didn’t make this improvement just to make it; no, we did this because I wanted faster JSON processors, and a lot of the more advanced ones did not work with Python 2. When we looked at the time each module took to process, most of that was spent serializing and deserializing JSON.

What Is JSON Parsing?

JSON is an open standard file format that uses human readable text to store and transmit data objects. Many modern programming languages include code to generate and parse JSON-format data. Here’s how you might describe a person named John, aged 30, from New York using JSON: 

{ 
“firstName”: “John”, 
“age”: 30,
“State”: “New York”
}

You can express those attributes into a single line of code and define them as a native object:

x = { 'name':'John', 'age':30, 'city':'New York'}

“Parsing” is the process by which you take the JSON data and make it into an object that you can plug into another programming language. You’d write your script (program) in Python, it would parse (interpret) the JSON data, and then give you an answer. This is what that would look like: 

import json

# some JSON:
x = '''
{ 
	"firstName": "John", 
	"age": 30,
	"State": "New York"
}
'''

# parse x:
y = json.loads(x)

# the result is a Python object:
print(y["name"])

If you run this script, you’ll get the output “John.” If you change print(y["name"]) to print(y["age"]), you’ll get the output “30.” Check out this website if you want to interact with the code for yourself. In practice, the JSON would be read from a database, or a web API, or a file on disk rather than defined as a “string” (or text) in the Python code. If you are converting a lot of this JSON, small improvements in efficiency can make a big difference in how a program performs.

And Implementing UltraJSON

Upgrading to Python 3 meant we could use UltraJSON. This was approximately 50% faster than the built-in Python JSON library we used previously. 

We also looked at the XML parsing for the podstats files, since XML parsing is often a slow process. In this case, we actually found our existing tool is pretty fast (and since we wrote it 10 years ago, that’s pretty cool). Off-the-shelf XML parsers take quite a bit longer because they care about a lot of things we don’t have to: our tool is customized for our Drive Stats needs. It’s a well known adage that you should not parse XML with regular expressions, but if your files are, well, very regular, it can save a lot of time.

What Does the Future Hold?

Now that we’re working with a significantly faster processing time for our Drive Stats dataset, we’ve got some ideas about upgrades in the future. Some of these are easier to achieve than others. Here’s a sneak peek of some potential additions and changes in the future.

Data on Data

In keeping with our data-nerd ways, I got curious about how much the Drive Stats dataset is growing and if the trend is linear. We made this graph, which shows the baseline rolling average, and has a trend line that attempts to predict linearly.

A graph showing the rate at which the Drive Stats dataset has grown over time.

I envision this graph living somewhere on the Drive Stats page and being fully interactive. It’s just one graph, but this and similar tools available on our website would be 1) fun and 2) lead to some interesting insights for those who don’t dig in line by line. 

What About Changing the Data Module?

The way our current module system works, everything gets processed in a tree approach, and they’re flat files. If we used something like SQLite or Parquet, we’d be able to process data in a more depth-first way, and that would mean that we could open a file for one module or data range, process everything, and not have to read the file again. 

And, since one of the first things that our Drive Stats expert, Andy Klein, does with our .xml data is to convert it to SQLite, outputting it in a queryable form would save a lot of time. 

We could also explore keeping the data as a less-smart filetype, but using something more compact than JSON, such as MessagePack.

Can We Improve Failure Tracking and Attribution?

One of the odd things about our Drive Stats datasets is that they don’t always and automatically agree with our internal data lake. Our Drive Stats outputs have some wonkiness that’s hard to replicate, and it’s mostly because of exceptions we build into the dataset. These exceptions aren’t when a drive fails, but rather when we’ve removed it from the fleet for some other reason, like if we were testing a drive or something along those lines. (You can see specific callouts in Drive Stats reports, if you’re interested.) It’s also where a lot of Andy’s manual work on Drive Stats data comes in each month: he’s often comparing the module’s output with data in our datacenter ticket tracker.

These tickets come from the awesome data techs working in our data centers. Each time a drive fails and they have to replace it, our techs add a reason for why it was removed from the fleet. While not all drive replacements are “failures”, adding a root cause to our Drive Stats dataset would give us more confidence in our failure reporting (and would save Andy comparing the two lists). 

The Result: Faster Drive Stats and Future Fun

These two improvements (the date range accumulator and upgrading to Python 3) resulted in hours, and maybe even days, of work saved. Even from a troubleshooting point of view, we often wouldn’t know if the process was stuck, or if this was the normal amount of time the module should take to run. Now, if it takes more than about 15 minutes to run a report, you’re sure there’s a problem. 

While the Drive Stats dataset can’t really be called “big data”, it provides a good, concrete example of scaling with your data. We’ve been collecting Drive Stats for just over 10 years now, and even though most of the code written way back when is inherently sound, small improvements that seem marginal become amplified as datasets grow. 

Now that we’ve got better documentation of how everything works, it’s going to be easier to keep Drive Stats up-to-date with the best tools and run with future improvements. Let us know in the comments what you’d be interested in seeing.

The post Overload to Overhaul: How We Upgraded Drive Stats Data appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

AI 101: Do the Dollars Make Sense?

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/ai-101-do-the-dollars-make-sense/

A decorative image showing a cloud reaching out with digital tentacles to stacks of dollar signs.

Welcome back to AI 101, a series dedicated to breaking down the realities of artificial intelligence (AI). Previously we’ve defined artificial intelligence, deep learning (DL), and machine learning (ML) and dove into the types of processors that make AI possible. Today we’ll talk about one of the biggest limitations of AI adoption—how much it costs. Experts have already flagged that the significant investment necessary for AI can cause antitrust concerns and that AI is driving up costs in data centers

To that end, we’ll talk about: 

  • Factors that impact the cost of AI.
  • Some real numbers about the cost of AI components. 
  • The AI tech stack and some of the industry solutions that have been built to serve it.
  • And, uncertainty.

Defining AI: Complexity and Cost Implications

While ChatGPT, DALL-E, and the like may be the most buzz-worthy of recent advancements, AI has already been a part of our daily lives for several years now. In addition to generative AI models, examples include virtual assistants like Siri and Google Home, fraud detection algorithms in banks, facial recognition software, URL threat analysis services, and so on. 

That brings us to the first challenge when it comes to understanding the cost of AI: The type of AI you’re training—and how complex a problem you want it to solve—has a huge impact on the computing resources needed and the cost, both in the training and in the implementation phases. AI tasks are hungry in all ways: they need a lot of processing power, storage capacity, and specialized hardware. As you scale up or down in the complexity of the task you’re doing, there’s a huge range in the types of tools you need and their costs.   

To understand the cost of AI, several other factors come into play as well, including: 

  • Latency requirements: How fast does the AI need to make decisions? (e.g. that split second before a self-driving car slams on the brakes.)
  • Scope: Is the AI solving broad-based or limited questions? (e.g. the best way to organize this library vs. how many times is the word “cat” in this article.)
  • Actual human labor: How much oversight does it need? (e.g. does a human identify the cat in cat photos, or does the AI algorithm identify them?)
  • Adding data: When, how, and what quantity new data will need to be ingested to update information over time? 

This is by no means an exhaustive list, but it gives you an idea of the considerations that can affect the kind of AI you’re building and, thus, what it might cost.

The Big Three AI Cost Drivers: Hardware, Storage, and Processing Power

In simple terms, you can break down the cost of running an AI to a few main components: hardware, storage, and processing power. That’s a little bit simplistic, and you’ll see some of these lines blur and expand as we get into the details of each category. But, for our purposes today, this is a good place to start to understand how much it costs to ask a bot to create a squirrel holding a cool guitar.

An AI generative image of a squirrel holding a guitar. Both the squirrel and the guitar and warped in strange, but not immediately noticeable ways.
Still not quite there on the guitar. Or the squirrel. How much could this really cost?

First Things First: Hardware Costs

Running an AI takes specialized processors that can handle complex processing queries. We’re early in the game when it comes to picking a “winner” for specialized processors, but these days, the most common processor is a graphical processing unit (GPU), with Nvidia’s hardware and platform as an industry favorite and front-runner. 

The most common “workhorse chip” of AI processing tasks, the Nvidia A100, starts at about $10,000 per chip, and a set of eight of the most advanced processing chips can cost about $300,000. When Elon Musk wanted to invest in his generative AI project, he reportedly bought 10,000 GPUs, which equates to an estimated value in the tens of millions of dollars. He’s gone on record as saying that AI chips can be harder to get than drugs

Google offers folks the ability to rent their TPUs through the cloud starting at $1.20 per chip hour for on-demand service (less if you commit to a contract). Meanwhile, Intel released a sub-$100 USB stick with a full NPU that can plug into your personal laptop, and folks have created their own models at home with the help of open sourced developer toolkits. Here’s a guide to using them if you want to get in the game yourself. 

Clearly, the spectrum for chips is vast—from under $100 to millions—and the landscape for chip producers is changing often, as is the strategy for monetizing those chips—which leads us to our next section. 

Using Third Parties: Specialized Problems = Specialized Service Providers

Building AI is a challenge with so many moving parts that, in a business use case, you eventually confront the question of whether it’s more efficient to outsource it. It’s true of storage, and it’s definitely true of AI processing. You can already see one way Google answered that question above: create a network populated by their TPUs, then sell access.   

Other companies specialize in broader or narrower parts of the AI creation and processing chain. Just to name a few, diverse companies: there’s Hugging Face, Inflection AI, CoreWeave, and Vultr. Those companies have a wide array of product offerings and resources from open source communities like Hugging Face that provide a menu of models, datasets, no-code tools, and (frankly) rad developer experiments to bare metal servers like Vultr that enhance your compute resources. How resources are offered also exist on a spectrum, including proprietary company resources (i.e. Nvidia’s platform), open source communities (looking at you, Hugging Face), or a mix of the two. 

An AI generated comic showing various iterations of data storage superheroes.
A comic generated on Hugging Face’s AI Comic Factory.

This means that, whichever piece of the AI tech stack you’re considering, you have a high degree of flexibility when you’re deciding where and how much you want to customize and where and how to implement an out-of-the box solution. 

Ballparking an estimate of what any of that costs would be so dependent on the particular model you want to build and the third-party solutions you choose that it doesn’t make sense to do so here. But, it suffices to say that there’s a pretty narrow field of folks who have the infrastructure capacity, the datasets, and the business need to create their own network. Usually it comes back to any combination of the following: whether you have existing infrastructure to leverage or are building from scratch, if you’re going to sell the solution to others, what control over research or dataset you have or want, how important privacy is and how you’re incorporating it into your products, how fast you need the model to make decisions, and so on. 

Welcome to the Spotlight, Storage

And, hey, with all that, let’s not forget storage. At the most basic level of consideration, AI uses a ton of data. How much? Going knowledge says at least an order of magnitude more examples than the problem presented to train an AI model. That means you want 10 times more examples than parameters. 

Parameters and Hyperparameters

The easiest way to think of parameters is to think of them as factors that control how an AI makes a decision. More parameters = more accuracy. And, just like our other AI terms, the term can be somewhat inconsistently applied. Here’s what ChatGPT has to say for itself:

A screenshot of a conversation with ChatGPT where it tells us it has 175 billion parameters.

That 10x number is just the amount of data you store for the initial training model—clearly the thing learns and grows, because we’re talking about AI. 

Preserving both your initial training algorithm and your datasets can be incredibly useful, too. As we talked about before, the more complex an AI, the higher the likelihood that your model will surprise you. And, as many folks have pointed out, deciding whether to leverage an already-trained model or to build your own doesn’t have to be an either/or—oftentimes the best option is to fine-tune an existing model to your narrower purpose. In both cases, having your original training model stored can help you roll back and identify the changes over time. 

The size of the dataset absolutely affects costs and processing times. The best example is that ChatGPT, everyone’s favorite model, has been rocking GPT-3 (or 3.5) instead of GPT-4 on the general public release because GPT-4, which works from a much larger, updated dataset than GPT-3, is too expensive to release to the wider public. It also returns results much more slowly than GPT-3.5, which means that our current love of instantaneous search results and image generation would need an adjustment. 

And all of that is true because GPT-4 was updated with more information (by volume), more up-to-date information, and the model was given more parameters to take into account for responses. So, it has to both access more data per query and use more complex reasoning to make decisions. That said, it also reportedly has much better results.

Storage and Cost

What are the real numbers to store, say, a primary copy of an AI dataset? Well, it’s hard to estimate, but we can ballpark that, if you’re training a large AI model, you’re going to have at a minimum tens of gigabytes of data and, at a maximum, petabytes. OpenAI considers the size of its training database proprietary information, and we’ve found sources that cite that number as  anywhere from 17GB to 570GB to 45TB of text data

That’s not actually a ton of data, and, even taking the highest number, it would only cost $225 per month to store that data in Backblaze B2 (45TB * $5/TB/mo), for argument’s sake. But let’s say you’re training an AI on video to, say, make a robot vacuum that can navigate your room or recognize and identify human movement. Your training dataset could easily reach into petabyte scale (for reference, one petabyte would cost $5,000 per month in Backblaze B2). Some research shows that dataset size is trending up over time, though other folks point out that bigger is not always better.

On the other hand, if you’re the guy with the Intel Neural Compute stick we mentioned above and a Raspberry Pi, you’re talking the cost of the ~$100 AI processor, ~$50 for the Raspberry Pi, and any incidentals. You can choose to add external hard drives, network attached storage (NAS) devices, or even servers as you scale up.

Storage and Speed

Keep in mind that, in the above example, we’re only considering the cost of storing the primary dataset, and that’s not very accurate when thinking about how you’d be using your dataset. You’d also have to consider temporary storage for when you’re actually training the AI as your primary dataset is transformed by your AI algorithm, and nearly always you’re splitting your primary dataset into discrete parts and feeding those to your AI algorithm in stages—so each of those subsets would also be stored separately. And, in addition to needing a lot of storage, where you physically locate that storage makes a huge difference to how quickly tasks can be accomplished. In many cases, the difference is a matter of seconds, but there are some tasks that just can’t handle that delay—think of tasks like self-driving cars. 

For huge data ingest periods such as training, you’re often talking about a compute process that’s assisted by powerful, and often specialized, supercomputers, with repeated passes over the same dataset. Having your data physically close to those supercomputers saves you huge amounts of time, which is pretty incredible when you consider that it breaks down to as little as milliseconds per task.

One way this problem is being solved is via caching, or creating temporary storage on the same chips (or motherboards) as the processor completing the task. Another solution is to keep the whole processing and storage cluster on-premises (at least while training), as you can see in the Microsoft-OpenAI setup or as you’ll often see in universities. And, unsurprisingly, you’ll also see edge computing solutions which endeavor to locate data physically close to the end user. 

While there can be benefits to on-premises or co-located storage, having a way to quickly add more storage (and release it if no longer needed), means cloud storage is a powerful tool for a holistic AI storage architecture—and can help control costs. 

And, as always, effective backup strategies require at least one off-site storage copy, and the easiest way to achieve that is via cloud storage. So, any way you slice it, you’re likely going to have cloud storage touch some part of your AI tech stack. 

What Hardware, Processing, and Storage Have in Common: You Have to Power Them

Here’s the short version: any time you add complex compute + large amounts of data, you’re talking about a ton of money and a ton of power to keep everything running. 

A disorganized set of power cords and switches plugged into what is decidedly too small of an outlet space.
Just flip the switch, and you have AI. Source.

Fortunately for us, other folks have done the work of figuring out how much this all costs. This excellent article from SemiAnalysis goes deep on the total cost of powering searches and running generative AI models. The Washington Post cites Dylan Patel (also of SemiAnalysis) as estimating that a single chat with ChatGPT could cost up to 1,000 times as much as a simple Google search. Those costs include everything we’ve talked about above—the capital expenditures, data storage, and processing. 

Consider this: Google spent several years putting off publicizing a frank accounting of their power usage. When they released numbers in 2011, they said that they use enough electricity to power 200,000 homes. And that was in 2011. There are widely varying claims for how much a single search costs, but even the most conservative say .03 Wh of energy. There are approximately 8.5 billion Google searches per day. (That’s just an incremental cost by the way—as in, how much does a single search cost in extra resources on top of how much the system that powers it costs.) 

Power is a huge cost in operating data centers, even when you’re only talking about pure storage. One of the biggest single expenses that affects power usage is cooling systems. With high-compute workloads, and particularly with GPUs, the amount of work the processor is doing generates a ton more heat—which means more money in cooling costs, and more power consumed. 

So, to Sum Up

When we’re talking about how much an AI costs, it’s not just about any single line item cost. If you decide to build and run your own models on-premises, you’re talking about huge capital expenditure and ongoing costs in data centers with high compute loads. If you want to build and train a model on your own USB stick and personal computer, that’s a different set of cost concerns. 

And, if you’re talking about querying a generative AI from the comfort of your own computer, you’re still using a comparatively high amount of power somewhere down the line. We may spread that power cost across our national and international infrastructures, but it’s important to remember that it’s coming from somewhere—and that the bill comes due, somewhere along the way. 

The post AI 101: Do the Dollars Make Sense? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.