Tag Archives: B2Cloud

Load Balancing 2.0: What’s Changed After 7 Years?

Post Syndicated from nathaniel wagner original https://www.backblaze.com/blog/load-balancing-2-0-whats-changed-after-7-years/

A decorative image showing a file, a drive, and some servers.

What do two billion transactions a day look like? Well, the data may be invisible to the naked eye, but the math breaks down to just over 23,000 transactions every second. (Shout out to Kris Allen for burning into my memory that there are 86,400 seconds in a day and my handy calculator!) 

Part of my job as a Site Reliability Engineer (SRE) at Backblaze is making sure that greater than 99.9% of those transactions are what we consider “OK” (via status code), and part of the fun is digging for the needle in the haystack of over 7,000 production servers and over 250,000 spinning hard drives to try and understand how all of the different transactions interact with the pieces of infrastructure. 

In this blog post, I’m going to pick up where Principal SRE Elliott Sims left off in his 2016 article on load balancing. You’ll notice that the design principles we’ve employed are largely the same (cool!). So, I’ll review our stance on those principles, then talk about how the evolution of the B2 Cloud Storage platform—including the introduction of the S3 Compatible API—has changed our load balancing architecture. Read on for specifics. 

Editor’s Note

We know there are a ton of specialized networking terms flying around this article, and one of our primary goals is to make technical content accessible to all readers, regardless of their technical background. To that end, we’ve used footnotes to add some definitions and minimize the disruption to your reading experience.

What Is Load Balancing?

Load balancing is the process of distributing traffic across a network. It helps with resource utilization, prevents overloading any one server, and makes your system more reliable. Load balancers also monitor server health and redirect requests to the most suitable server.

With two billion requests per day to our servers, you can be assured that we use load balancers at Backblaze. Whenever anyone—a Backblaze Computer Backup or a B2 Cloud Storage customer—wants to upload or download data or modify or interact with their files, a load balancer will be there to direct traffic to the right server. Think of them as your trusty mail service, delivering your letters and packages to the correct destination—and using things like zip codes and addresses to interpret your request and get things to the right place.  

How Do We Do It?

We build our own load balancers using open-source tools. We use layer 4 load balancing with direct server response (DSR). Here are some of the resources that we call on to make that happen:  

  • Border gateway patrol (BGP) which is part of the Linux kernel1. It’s a standardized gateway protocol that exchanges routing and reachability information on the internet.   
  • keepalived, an open-source routing software. keepalived keeps track of all of our VIPs2 and IPs3 for each backend server. 
  • Hard disk drives (HDDs). We use the same drives that we use for other API servers and whatnot, but that’s definitely overkill—we made that choice to save the work of sourcing another type of device.  
  • A lot of hard work by a lot of really smart folks.

What We Mean by Layers

When we’re talking about layers in load balancing, it’s shorthand for how deep into the architecture your program needs to see. Here’s a great diagram that defines those layers: 

An image describing application layers.
Source.

DSR takes place at layer 4, but solves the problem presented by a full proxy4 method, having to see the original client’s IP address.

Why Do We Do It the Way We Do It?

Building our own load balancers, instead of buying an off-the-shelf solution, means that we have more control and insight, more cost-effective hardware, and more scalable architecture. In general, DSR is more complicated to set up and maintain, but this method also lets us handle lots of traffic with minimal hardware and supports our goal of keeping data encrypted, even within our own data center. 

What Hasn’t Changed

We’re still using a layer 4 DSR approach to load balancing, which we’ll explain below. For reference, other common methods of load balancing are layer 7, full proxy and layer 4, full proxy load balancing. 

First, I’ll explain how DSR works. DSR load balancing requires two things:

  1. A load balancer with the VIP address attached to an external NIC5 and ARPing6, so that the rest of the network knows it “owns” the IP.
  2. Two or more servers on the same layer 2 network that also have the VIP address attached to a NIC, either internal or external, but are not replying to ARP requests about that address. This means that no other servers on the network know that the VIP exists anywhere but on the load balancer.

A request packet will enter the network, and be routed to the load balancer. Once it arrives there, the load balancer leaves the source and destination IP addresses intact and instead modifies the destination MAC7 address to that of a server, then puts the packet back on the network. The network switch only understands MAC addresses, so it forwards the packet on to the correct server.

A diagram of how a packet moves through the network router and load balancer to reach the server, then respond to the original client request.

When the packet arrives at the server’s network interface, it checks to make sure the destination MAC address matches its own. The address matches, so the server accepts the packet. The server network interface then, separately, checks to see whether the destination IP address is one attached to it somehow. That’s a yes, even though the rest of the network doesn’t know it, so the server accepts the packet and passes it on to the application. The application then sends a response with the VIP as the source IP address and the client as the destination IP, so the request (and subsequent response) is routed directly to the client without passing back through the load balancer.

So, What’s Changed?

Lots of things. But, since we first wrote this article, we’ve expanded our offerings and platform. The biggest of these changes (as far as load balancing is concerned) is that we added the S3 Compatible API. 

We also serve a much more diverse set of clients, both in the size of files they have and their access patterns. File sizes affect how long it takes us to serve requests (larger files = more time to upload or download, which means an individual server is tied up for longer). Access patterns can vastly increase the amount of requests a server has to process on a regular, but not consistent basis (which means you might have times that your network is more or less idle, and you have to optimize appropriately). 

A definitely Photoshopped images showing a giraffe riding an elephant on a rope in the sky. The rope's anchor points disappear into clouds.
So, if we were to update this amazing image from the first article, we might have a tightrope walker with a balancing pole on top of the giraffe, plus some birds flying on a collision course with the elephant.

Where You Can See the Changes: ECMP, Volume of Data Per Month, and API Processing

DSR is how we send data to the customer—the server responds (sends data) directly to the request (client). This is the equivalent of going to the post office to mail something, but adding your home address as the return address (so that you don’t have to go back to the post office to get your reply).   

Given how our platform has evolved over the years, things might happen slightly differently. Let’s dig in to some of the details that affect how the load balancers make their decisions—what rules govern how they route traffic, and how different types of requests cause them to behave differently. We’ll look at:

  • Equal cost multipath routing (ECMP). 
  • Volume of data in petabytes (PBs) per month.
  • APIs and processing costs.

ECMP

One thing that’s not explicitly outlined above is how the load balancer determines which server should respond to a request. At Backblaze, we use stateless load balancing, which means that the load balancer doesn’t take into account most information about the servers it routes to. We use a round robin approach—i.e. the load balancers choose between one of a few hosts, in order, each time they’re assigning a request. 

We also use Maglev, so the load balancers use consistent hashing and connection tracking. This means that we’re minimizing the negative impact of unexpected faults failures from connection-oriented protocols. If a load balancer goes down, its server pool can be transferred to another, and it will make decisions in the same way, seamlessly picking up the load. When the initial load balancer comes back online, it already has a connection to its load balancer friend and can pick up where it left off. 

The upside is that it’s super rare to see a disruption, and it essentially only happens when the load balancer and the neighbor host go down in a short period of time. The downside is that the load balancer decision is static. If you have “better” servers for one reason or another—they’re newer, for instance—they don’t take that information into account. On the other hand, we do have the ability to push more traffic to specific servers through ECMP weights if we need to, which means that we have good control over a diverse fleet of hardware. 

Volume of Data

Backblaze now has over three exabytes of storage under management. Based on the scalability of the network design, that doesn’t really make a huge amount of difference when you’re scaling your infrastructure properly. What can make a difference is how people store and access their data. 

Most of the things that make managing large datasets difficult from an architecture perspective can also be silly for a client. For example: querying files individually (creating lots of requests) instead of batching or creating a range request. (There may be a business reason to do that, but usually, it makes more sense to batch requests.)

On the other hand, some things that make sense for how clients need to store data require architecture realignment. One of those is just a sheer fluctuation of data by volume—if you’re adding and deleting large amounts of data (we’re talking hundreds of terabytes or more) on a “shorter” cycle (monthly or less), then there will be a measurable impact. And, with more data stored, you have the potential for more transactions.

Similarly, if you need to retrieve data often, but not regularly, there are potential performance impacts. Most of them are related to caching, and ironically, they can actually improve performance. The more you query the same “set” of servers for the same file, the more likely that each server in the group will have cached your data locally (which means they can serve it more quickly). 

And, as with most data centers, we store our long term data on hard disk drives (HDDs), whereas our API servers are on solid state drives (SSDs). There are positives and negatives to each type of drive, but the performance impact is that data at rest takes longer to retrieve, and data in the cache is on a faster SSD drive on the API server. 

On the other hand, the more servers the data center has, the lower the chance that the servers can/will deliver cached data. And, of course, if you’re replacing large volumes of old data with new on a shorter timeline, then you won’t see the benefits. It sounds like an edge case, but industries like security camera data are a great example. While they don’t retrieve their data very frequently, they are constantly uploading and overwriting their data, often to meet different industry requirements about retention periods, which can be challenging to allocate a finite amount of input/output operations per second (IOPS) for uploads, downloads, and deletes.  

That said, the built-in benefit of our system is that adding another load balancer is (relatively) cheap. If we’re experiencing a processing chokepoint for whatever reason—typically either a CPU bottleneck or from throughput on the NIC—we can add another load balancer, and just like that, we can start to see bits flying through the new load balancer and traffic being routed amongst more hosts, alleviating the choke points. 

APIs and Processing Costs

We mentioned above that one of the biggest changes to our platform was the addition of the S3 Compatible API. When all requests were made through the B2 Native API, the Backblaze CLI tool, or the web UI, the processing cost was relatively cheap. 

That’s because of the way our upload requests to the Backblaze Vaults are structured. When you make an upload request via the Native API, there are actually two transactions, one to get an upload URL, and the second to send a request to the Vault. And, all other types of requests (besides upload) have always had to be processed through our load balancers. Since the S3 Compatible API is a single request, we knew we would have to add more processing power and load balancers. (If you want to go back to 2018 and see some of the reasons why, here’s Brian Wilson on the subject—read with the caveat that our current Tech Doc on the subject outlines how we solve the complications he points out.) 

We’re still leveraging DSR to respond directly to the client, but we’ve significantly increased the amount of transactions that hit our load balancers, both because it has to take on more of the processing during transit and because, well, lots of folks like to use the S3 Compatible API and our customer base has grown by a quite a bit since 2018. 

And, just like above, we’ve set ourselves up for a relatively painless fix: we can add another load balancer to solve most problems. 

Do We Have More Complexity?

This is the million dollar question, solved for a dollar: how could we not? Since our first load balancing article, we’ve added features, complexity, and lots of customers. Load balancing algorithms are inherently complex, but we (mostly Elliott and other smart people) have taken a lot of time and consideration to not just build a system that will scale to up and past two billion transactions a day, but that can be fairly “easily” explained and doesn’t require a graduate degree to understand what is happening.  

But, we knew it was important early on, so we prioritized building a system where we could “just” add another load balancer. The thinking is more complicated at the outset, but the tradeoff is that it’s simple once you’ve designed the system. It would take a lot for us to outgrow the usefulness of this strategy—but hey, we might get there someday. When we do, we’ll write you another article. 

Footnotes

  1. Kernel: A kernel is the computer program at the core of a computer’s operating system. It has control of the system and does things like run processes, manage hardware devices like the hard drive, and handle interrupts, as well as memory and input/output (I/O) requests from software, translating them to instructions for the central processing unit (CPU). ↩
  2. Virtual internet protocol (VIP): An IP address that does not correspond to a real place. ↩
  3. Internet protocol (IP): The global physical address of a device, used to identify devices on the internet. Can be changed.  ↩
  4. Proxy: In networking, a proxy is a server application that validates outside requests to a network. Think of them as a gatekeeper. There are several common types of proxies you interact with all the time—HTTPS requests on the internet, for example. ↩
  5. Network interface controller, or network interface card (NIC): This connects the computer to the network. ↩
  6. Address resolution protocol (ARP): The process by which a device or network identifies another device. There are four types. ↩
  7. Media access control (MAC): The local physical address of a device, used to identify devices on the same network. Hardcoded into the device. ↩

The post Load Balancing 2.0: What’s Changed After 7 Years? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

NAS Performance Guide: How To Optimize Your NAS

Post Syndicated from Vinodh Subramanian original https://www.backblaze.com/blog/nas-performance-guide-how-to-optimize-your-nas/

A decorative images showing a 12 bay NAS device connecting to the cloud.

Upgrading to a network attached storage (NAS) device puts your data in the digital fast lane. If you’re using one, it’s likely because you want to keep your data close to you, ensuring quick access whenever it’s needed. NAS devices, acting as centralized storage systems connected to local networks, offer a convenient way to access data in just a few clicks. 

However, as the volume of data on the NAS increases, its performance can tank. You need to know how to keep your NAS operating at its best, especially with growing data demand. 

In this blog, you’ll learn about various factors that can affect NAS performance, as well as practical steps you can take to address these issues, ensuring optimal speed, reliability, and longevity for your NAS device.

Why NAS Performance Matters

NAS devices can function as extended hard disks, virtual file cabinets, or centralized local storage solutions, depending on individual needs. 

While NAS offers a convenient way to store data locally, storing the data alone isn’t enough. How quickly and reliably you can access your data can make all the difference if you want an efficient workflow. For example, imagine working on a critical project with your team and facing slow file transfers, or streaming a video on a Zoom call only for it to stutter or buffer continuously.

All these can be a direct result of NAS performance issues, and an increase in stored data can directly undermine the device’s performance. Therefore, ensuring optimal performance isn’t just a technical concern, it’s also a concern that directly affects user experience, productivity, and collaboration. 

So, let’s talk about what could potentially cause performance issues and how to enhance your NAS. 

Common NAS Performance Issues

NAS performance can be influenced by a variety of factors. Here are some of the most common factors that can impact the performance of a NAS device.

Hardware Limitations:

  • Insufficient RAM: Especially in tasks like media streaming or handling large files, having inadequate memory can slow down operations. 
  • Slow CPU: An underpowered processor can become a bottleneck when multiple users access the NAS at once or during collaboration with team members. 
  • Drive Speed and Type: Hard disk drives (HDDs) are generally slower compared to solid state drives (SSDs), and your NAS can have either type. If your NAS mainly serves as a hub for storing and sharing files, a conventional HDD should meet your requirements. However, for those seeking enhanced speed and performance, SSDs deliver the performance you need. 
  • Outdated Hardware: Older NAS models might not be equipped to handle modern data demands or the latest software.

Software Limitations:

  • Outdated Firmware/Software: Not updating to the latest firmware or software can lead to performance issues, or to missing out on optimization and security features.
  • Misconfigured Settings: Incorrect settings can impact performance. This includes improper RAID configuration or network settings. 
  • Background Processes: Certain background tasks, like indexing or backups, can also slow down the system when running.

Network Challenges: 

  • Bandwidth Limitations: A slow network connection, especially on a Wi-Fi network can limit data transfer rates. 
  • Network Traffic: High traffic on the network can cause congestion, reducing the speed at which data can be accessed or transferred.

Disk Health and Configuration:

  • Disk Failures: A failing disk in the NAS can slow down performance and also poses data loss risk.
  • Suboptimal RAID Configuration: Some RAID configurations prioritize redundancy more than performance, which can affect the data storage and access speeds. 

External Factors:

  • Simultaneous User Access: If multiple users are accessing, reading, or writing to the NAS simultaneously, it can strain the system, especially if the hardware isn’t optimized to such traffic from multiple users. 
  • Inadequate Power Supply: Fluctuating or inadequate power can cause the NAS to malfunction or reduce its performance.
  • Operating Temperature: Additionally, if the NAS is in a hot environment, it might overheat and impact the performance of the device.

Practical Solutions for Optimizing NAS Performance

Understanding the common performance issues with NAS devices is the first critical step. However, simply identifying these issues alone isn’t enough. It’s vital to understand practical ways to optimize your existing NAS setup so you can enhance its speed, efficiency, and reliability. Let’s explore how to optimize your NAS. 

Performance Enhancement 1: Upgrading Hardware

There are a few different things you can do on a hardware level to enhance NAS performance. First, adding more RAM can significantly improve performance, especially if multiple tasks or users are accessing the NAS simultaneously. 

You can also consider switching to SSDs. While they can be more expensive, SSDs offer faster read/write speeds than traditional HDDs, and they store data in flash memory, which means that they retain information even without power. 

Finally, you could upgrade the CPU. For NAS devices that support it, a more powerful CPU can better handle multiple simultaneous requests and complex tasks. 

Performance Enhancement 2: Optimizing Software Configuration

Remember to always keep your NAS operating system and software up-to-date to benefit from the latest performance optimizations and security patches. Schedule tasks like indexing, backups or antivirus scans during off-peak hours to ensure they don’t impact user access during high-traffic times. You also need to make sure you’re using the right RAID configuration for your needs. RAID 5 or RAID 6, for example, can offer a good balance between redundancy and performance.

Performance Enhancement 3: Network Enhancements

Consider moving to faster network protocols, like 10Gb ethernet, or ensuring that your router and switches can handle high traffic. Wherever possible, use wired connections instead of Wi-Fi to connect to the NAS for more stable and faster data access and transfer. And, regularly review and adjust network settings for optimal performance. If you can, it also helps to limit simultaneous access. If possible, manage peak loads by setting up access priorities.

Performance Enhancement 4: Regular Maintenance

Use your NAS device’s built-in tools or third-party software to monitor the health of your disks and replace any that show signs of failure. And, keep the physical environment around your NAS device clean, cool, and well ventilated to prevent overheating. 

Leveraging the Cloud for NAS Optimization

After taking the necessary steps to optimize your NAS for improved performance and reliability, it’s worth considering leveraging the cloud to further enhance the performance. While NAS offers convenient local storage, it can sometimes fall short when it comes to scalability, accessibility from different locations, and seamless collaboration. Here’s where cloud storage comes into play. 

At its core, cloud storage is a service model in which data is maintained, managed, and backed up remotely, and made available to users over the internet. Instead of relying solely on local storage solutions such as NAS or a server, you utilize the vast infrastructure of data centers across the globe to store your data not just in one physical location, but across multiple secure and redundant environments. 

As an off-site storage solution for NAS, the cloud not only completes your 3-2-1 backup plan, but can also amplify its performance. Let’s take a look at how integrating cloud storage can help optimize your NAS.

  • Off-Loading and Archiving: One of the most straightforward approaches is to move infrequently accessed or archival data from the NAS to the cloud. This frees up space on the NAS, ensuring it runs smoothly, while optimizing the NAS by only keeping data that’s frequently accessed or essential. 
  • Caching: Some advanced NAS systems can cache frequently accessed data in the cloud. This means that the most commonly used data can be quickly retrieved, enhancing user experience and reducing the load on the NAS device. 
  • Redundancy and Disaster Recovery: Instead of duplicating data on multiple NAS devices for redundancy, which can be costly and still vulnerable to local disasters, the data can be backed up to the cloud. In case of NAS failure or catastrophic event, the data can be quickly restored from the cloud, ensuring minimal downtime. 
  • Remote Access and Collaboration: While NAS devices can offer remote access, integrating them with cloud storage can streamline this process, often offering a more user-friendly interface and better speeds. This is especially useful for collaborative environments where multiple users work together on files and projects. 
  • Scaling Without Hardware Constraints: As your data volume grows, expanding a NAS can involve purchasing additional drives or even new devices. With cloud integration, you can expand your storage capacity without these immediate hardware investments, eliminating or delaying the need for physical upgrades and extending the lifespan of your NAS. 

In essence, integrating cloud storage solutions with your NAS can create a comprehensive system that addresses the shortcomings of NAS devices, helping you create a hybrid setup that offers the best of both worlds: the speed and accessibility of local storage, and the flexibility and scalability of the cloud. 

Getting the Best From Your NAS

At its core, NAS offers an unparalleled convenience of localized storage. However, it’s not without challenges, especially when performance issues come into play. Addressing these challenges requires a blend of hardware optimization, software updates, and smart data management settings. 

But, it doesn’t have to stop at your local network. Cloud storage can be leveraged effectively to optimize your NAS. It doesn’t just act as a safety net by storing your NAS data off-site, it also makes collaboration easier with dispersed teams and further optimizes NAS performance. 

Now, it’s time to hear from you. Have you encountered any NAS performance issues? What measures have you taken to optimize your NAS? Share your experiences and insights in the comments below. 

The post NAS Performance Guide: How To Optimize Your NAS appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Overload to Overhaul: How We Upgraded Drive Stats Data

Post Syndicated from David Winings original https://www.backblaze.com/blog/overload-to-overhaul-how-we-upgraded-drive-stats-data/

A decorative image showing the words "overload to overhaul: how we upgraded Drive Stats data."

This year, we’re celebrating 10 years of Drive Stats. Coincidentally, we also made some upgrades to how we run our Drive Stats reports. We reported on how an attempt to migrate triggered a weeks-long recalculation of the dataset, leading us to map the architecture of the Drive Stats data. 

This follow-up article focuses on the improvements we made after we fixed the existing bug (because hey, we were already in there), and then presents some of our ideas for future improvements. Remember that those are just ideas so far—they may not be live in a month (or ever?), but consider them good food for thought, and know that we’re paying attention so that we can pass this info along to the right people.

Now, onto the fun stuff. 

Quick Refresh: Drive Stats Data Architecture

The podstats generator runs on every Storage Pod, what we call any host that holds customer data, every few minutes. It’s a C++ program that collects SMART stats and a few other attributes, then converts them into an .xml file (“podstats”). Those are then pushed to a central host in each datacenter and bundled. Once the data leaves these central hosts, it has entered the domain of what we will call Drive Stats.  

Now let’s go into a little more detail: when you’re gathering stats about drives, you’re running a set of modules with dependencies to other modules, forming a data-dependency tree. Each time a module “runs”, it takes information, modifies it, and writes it to a disk. As you run each module, the data will be transformed sequentially. And, once a quarter, we run a special module that collects all the attributes for our Drive Stats reports, collecting data all the way down the tree. 

Here’s a truncated diagram of the whole system, to give you an idea of what the logic looks like:

A diagram of the mapped logic of the Drive Stats modules.
An abbreviated logic map of Drive Stats modules.

As you move down through the module layers, the logic gets more and more specialized. When you run a module, the first thing the module does is check in with the previous module to make sure the data exists and is current. It caches the data to disk at every step, and fills out the logic tree step by step. So for example, drive_stats, being a “per-day” module, will write out a file such as /data/drive_stats/2023-01-01.json.gz when it finishes processing. This lets future modules read that file to avoid repeating work.

This work deduplication process saves us a lot of time overall—but it also turned out to be the root cause of our weeks-long process when we were migrating Drive Stats to our new host. We fixed that by implementing versions to each module.  

While You’re There… Why Not Upgrade?

Once the dust from the bug fix had settled, we moved forward to try to modernize Drive Stats in general. Our daily report still ran quite slowly, on the order of several hours, and there was some low-hanging fruit to chase.

Waiting On You, failures_with_stats

First things first, we saved a log of a run of our daily reports in Jenkins. Then we wrote an analyzer to see which modules were taking a lot of time. failures_with_stats was our biggest offender, running for about two hours, while every other module took about 15 minutes.

An image showing runtimes for each module when running a Drive Stats report.
Not quite two hours.

Upon investigation, the time cost had to do with how the date_range module works. This takes us back to caching: our module checks if the file has been written already, and if it has, it uses the cached file. However, a date range is written to a single file. That is, Drive Stats will recognize “Monday to Wednesday” as distinct from “Monday to Thursday” and re-calculate the entire range. This is a problem for a workload that is essentially doing work for all of time, every day.  

On top of this, the raw Drive Stats data, which is a dependency for failures_with_stats, would be gzipped onto a disk. When each new query triggered a request to recalculate all-time data, each dependency would pick up the podstats file from disk, decompress it, read it into memory, and do that for every day of all time. We were picking up and processing our biggest files every day, and time continued to make that cost larger.

Our solution was what I called the “Date Range Accumulator.” It works as follows:

  • If we have a date range like “all of time as of yesterday” (or any partial range with the same start), consider it as a starting point.
  • Make sure that the version numbers don’t consider our starting point to be too old.
  • Do the processing of today’s data on top of our starting point to create “all of time as of today.”

To do this, we read the directory of the date range accumulator, find the “latest” valid one, and use that to determine the delta (change) to our current date. Basically, the module says: “The last time I ran this was on data from the beginning of time to Thursday. It’s now Friday. I need to run the process for Friday, and then add that to the compiled all-time.” And, before it does that, it double checks the version number to avoid errors. (As we noted in our previous article, if it doesn’t see the correct version number, instead of inefficiently running all data, it just tells you there is a version number discrepancy.) 

The code is also a bit finicky—there are lots of snags when it comes to things like defining exceptions, such as if we took a drive out of the fleet, but it wasn’t a true failure. The module also needed to be processable day by day to be usable with this technique.

Still, even with all the tweaks, it’s massively better from a runtime perspective for eligible candidates. Here’s our new failures_with_stats runtime: 

An output of module runtime after the Drive Stats improvements were made.
Ahh, sweet victory.

Note that in this example, we’re running that 60-day report. The daily report is quite a bit quicker. But, at least the 60-day report is a fixed amount of time (as compared with the all-time dataset, which is continually growing). 

Code Upgrade to Python 3

Next, we converted our code to Python 3. (Shout out to our intern, Anath, who did amazing work on this part of the project!) We didn’t make this improvement just to make it; no, we did this because I wanted faster JSON processors, and a lot of the more advanced ones did not work with Python 2. When we looked at the time each module took to process, most of that was spent serializing and deserializing JSON.

What Is JSON Parsing?

JSON is an open standard file format that uses human readable text to store and transmit data objects. Many modern programming languages include code to generate and parse JSON-format data. Here’s how you might describe a person named John, aged 30, from New York using JSON: 

{ 
“firstName”: “John”, 
“age”: 30,
“State”: “New York”
}

You can express those attributes into a single line of code and define them as a native object:

x = { 'name':'John', 'age':30, 'city':'New York'}

“Parsing” is the process by which you take the JSON data and make it into an object that you can plug into another programming language. You’d write your script (program) in Python, it would parse (interpret) the JSON data, and then give you an answer. This is what that would look like: 

import json

# some JSON:
x = '''
{ 
	"firstName": "John", 
	"age": 30,
	"State": "New York"
}
'''

# parse x:
y = json.loads(x)

# the result is a Python object:
print(y["name"])

If you run this script, you’ll get the output “John.” If you change print(y["name"]) to print(y["age"]), you’ll get the output “30.” Check out this website if you want to interact with the code for yourself. In practice, the JSON would be read from a database, or a web API, or a file on disk rather than defined as a “string” (or text) in the Python code. If you are converting a lot of this JSON, small improvements in efficiency can make a big difference in how a program performs.

And Implementing UltraJSON

Upgrading to Python 3 meant we could use UltraJSON. This was approximately 50% faster than the built-in Python JSON library we used previously. 

We also looked at the XML parsing for the podstats files, since XML parsing is often a slow process. In this case, we actually found our existing tool is pretty fast (and since we wrote it 10 years ago, that’s pretty cool). Off-the-shelf XML parsers take quite a bit longer because they care about a lot of things we don’t have to: our tool is customized for our Drive Stats needs. It’s a well known adage that you should not parse XML with regular expressions, but if your files are, well, very regular, it can save a lot of time.

What Does the Future Hold?

Now that we’re working with a significantly faster processing time for our Drive Stats dataset, we’ve got some ideas about upgrades in the future. Some of these are easier to achieve than others. Here’s a sneak peek of some potential additions and changes in the future.

Data on Data

In keeping with our data-nerd ways, I got curious about how much the Drive Stats dataset is growing and if the trend is linear. We made this graph, which shows the baseline rolling average, and has a trend line that attempts to predict linearly.

A graph showing the rate at which the Drive Stats dataset has grown over time.

I envision this graph living somewhere on the Drive Stats page and being fully interactive. It’s just one graph, but this and similar tools available on our website would be 1) fun and 2) lead to some interesting insights for those who don’t dig in line by line. 

What About Changing the Data Module?

The way our current module system works, everything gets processed in a tree approach, and they’re flat files. If we used something like SQLite or Parquet, we’d be able to process data in a more depth-first way, and that would mean that we could open a file for one module or data range, process everything, and not have to read the file again. 

And, since one of the first things that our Drive Stats expert, Andy Klein, does with our .xml data is to convert it to SQLite, outputting it in a queryable form would save a lot of time. 

We could also explore keeping the data as a less-smart filetype, but using something more compact than JSON, such as MessagePack.

Can We Improve Failure Tracking and Attribution?

One of the odd things about our Drive Stats datasets is that they don’t always and automatically agree with our internal data lake. Our Drive Stats outputs have some wonkiness that’s hard to replicate, and it’s mostly because of exceptions we build into the dataset. These exceptions aren’t when a drive fails, but rather when we’ve removed it from the fleet for some other reason, like if we were testing a drive or something along those lines. (You can see specific callouts in Drive Stats reports, if you’re interested.) It’s also where a lot of Andy’s manual work on Drive Stats data comes in each month: he’s often comparing the module’s output with data in our datacenter ticket tracker.

These tickets come from the awesome data techs working in our data centers. Each time a drive fails and they have to replace it, our techs add a reason for why it was removed from the fleet. While not all drive replacements are “failures”, adding a root cause to our Drive Stats dataset would give us more confidence in our failure reporting (and would save Andy comparing the two lists). 

The Result: Faster Drive Stats and Future Fun

These two improvements (the date range accumulator and upgrading to Python 3) resulted in hours, and maybe even days, of work saved. Even from a troubleshooting point of view, we often wouldn’t know if the process was stuck, or if this was the normal amount of time the module should take to run. Now, if it takes more than about 15 minutes to run a report, you’re sure there’s a problem. 

While the Drive Stats dataset can’t really be called “big data”, it provides a good, concrete example of scaling with your data. We’ve been collecting Drive Stats for just over 10 years now, and even though most of the code written way back when is inherently sound, small improvements that seem marginal become amplified as datasets grow. 

Now that we’ve got better documentation of how everything works, it’s going to be easier to keep Drive Stats up-to-date with the best tools and run with future improvements. Let us know in the comments what you’d be interested in seeing.

The post Overload to Overhaul: How We Upgraded Drive Stats Data appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

AI 101: Do the Dollars Make Sense?

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/ai-101-do-the-dollars-make-sense/

A decorative image showing a cloud reaching out with digital tentacles to stacks of dollar signs.

Welcome back to AI 101, a series dedicated to breaking down the realities of artificial intelligence (AI). Previously we’ve defined artificial intelligence, deep learning (DL), and machine learning (ML) and dove into the types of processors that make AI possible. Today we’ll talk about one of the biggest limitations of AI adoption—how much it costs. Experts have already flagged that the significant investment necessary for AI can cause antitrust concerns and that AI is driving up costs in data centers

To that end, we’ll talk about: 

  • Factors that impact the cost of AI.
  • Some real numbers about the cost of AI components. 
  • The AI tech stack and some of the industry solutions that have been built to serve it.
  • And, uncertainty.

Defining AI: Complexity and Cost Implications

While ChatGPT, DALL-E, and the like may be the most buzz-worthy of recent advancements, AI has already been a part of our daily lives for several years now. In addition to generative AI models, examples include virtual assistants like Siri and Google Home, fraud detection algorithms in banks, facial recognition software, URL threat analysis services, and so on. 

That brings us to the first challenge when it comes to understanding the cost of AI: The type of AI you’re training—and how complex a problem you want it to solve—has a huge impact on the computing resources needed and the cost, both in the training and in the implementation phases. AI tasks are hungry in all ways: they need a lot of processing power, storage capacity, and specialized hardware. As you scale up or down in the complexity of the task you’re doing, there’s a huge range in the types of tools you need and their costs.   

To understand the cost of AI, several other factors come into play as well, including: 

  • Latency requirements: How fast does the AI need to make decisions? (e.g. that split second before a self-driving car slams on the brakes.)
  • Scope: Is the AI solving broad-based or limited questions? (e.g. the best way to organize this library vs. how many times is the word “cat” in this article.)
  • Actual human labor: How much oversight does it need? (e.g. does a human identify the cat in cat photos, or does the AI algorithm identify them?)
  • Adding data: When, how, and what quantity new data will need to be ingested to update information over time? 

This is by no means an exhaustive list, but it gives you an idea of the considerations that can affect the kind of AI you’re building and, thus, what it might cost.

The Big Three AI Cost Drivers: Hardware, Storage, and Processing Power

In simple terms, you can break down the cost of running an AI to a few main components: hardware, storage, and processing power. That’s a little bit simplistic, and you’ll see some of these lines blur and expand as we get into the details of each category. But, for our purposes today, this is a good place to start to understand how much it costs to ask a bot to create a squirrel holding a cool guitar.

An AI generative image of a squirrel holding a guitar. Both the squirrel and the guitar and warped in strange, but not immediately noticeable ways.
Still not quite there on the guitar. Or the squirrel. How much could this really cost?

First Things First: Hardware Costs

Running an AI takes specialized processors that can handle complex processing queries. We’re early in the game when it comes to picking a “winner” for specialized processors, but these days, the most common processor is a graphical processing unit (GPU), with Nvidia’s hardware and platform as an industry favorite and front-runner. 

The most common “workhorse chip” of AI processing tasks, the Nvidia A100, starts at about $10,000 per chip, and a set of eight of the most advanced processing chips can cost about $300,000. When Elon Musk wanted to invest in his generative AI project, he reportedly bought 10,000 GPUs, which equates to an estimated value in the tens of millions of dollars. He’s gone on record as saying that AI chips can be harder to get than drugs

Google offers folks the ability to rent their TPUs through the cloud starting at $1.20 per chip hour for on-demand service (less if you commit to a contract). Meanwhile, Intel released a sub-$100 USB stick with a full NPU that can plug into your personal laptop, and folks have created their own models at home with the help of open sourced developer toolkits. Here’s a guide to using them if you want to get in the game yourself. 

Clearly, the spectrum for chips is vast—from under $100 to millions—and the landscape for chip producers is changing often, as is the strategy for monetizing those chips—which leads us to our next section. 

Using Third Parties: Specialized Problems = Specialized Service Providers

Building AI is a challenge with so many moving parts that, in a business use case, you eventually confront the question of whether it’s more efficient to outsource it. It’s true of storage, and it’s definitely true of AI processing. You can already see one way Google answered that question above: create a network populated by their TPUs, then sell access.   

Other companies specialize in broader or narrower parts of the AI creation and processing chain. Just to name a few, diverse companies: there’s Hugging Face, Inflection AI, CoreWeave, and Vultr. Those companies have a wide array of product offerings and resources from open source communities like Hugging Face that provide a menu of models, datasets, no-code tools, and (frankly) rad developer experiments to bare metal servers like Vultr that enhance your compute resources. How resources are offered also exist on a spectrum, including proprietary company resources (i.e. Nvidia’s platform), open source communities (looking at you, Hugging Face), or a mix of the two. 

An AI generated comic showing various iterations of data storage superheroes.
A comic generated on Hugging Face’s AI Comic Factory.

This means that, whichever piece of the AI tech stack you’re considering, you have a high degree of flexibility when you’re deciding where and how much you want to customize and where and how to implement an out-of-the box solution. 

Ballparking an estimate of what any of that costs would be so dependent on the particular model you want to build and the third-party solutions you choose that it doesn’t make sense to do so here. But, it suffices to say that there’s a pretty narrow field of folks who have the infrastructure capacity, the datasets, and the business need to create their own network. Usually it comes back to any combination of the following: whether you have existing infrastructure to leverage or are building from scratch, if you’re going to sell the solution to others, what control over research or dataset you have or want, how important privacy is and how you’re incorporating it into your products, how fast you need the model to make decisions, and so on. 

Welcome to the Spotlight, Storage

And, hey, with all that, let’s not forget storage. At the most basic level of consideration, AI uses a ton of data. How much? Going knowledge says at least an order of magnitude more examples than the problem presented to train an AI model. That means you want 10 times more examples than parameters. 

Parameters and Hyperparameters

The easiest way to think of parameters is to think of them as factors that control how an AI makes a decision. More parameters = more accuracy. And, just like our other AI terms, the term can be somewhat inconsistently applied. Here’s what ChatGPT has to say for itself:

A screenshot of a conversation with ChatGPT where it tells us it has 175 billion parameters.

That 10x number is just the amount of data you store for the initial training model—clearly the thing learns and grows, because we’re talking about AI. 

Preserving both your initial training algorithm and your datasets can be incredibly useful, too. As we talked about before, the more complex an AI, the higher the likelihood that your model will surprise you. And, as many folks have pointed out, deciding whether to leverage an already-trained model or to build your own doesn’t have to be an either/or—oftentimes the best option is to fine-tune an existing model to your narrower purpose. In both cases, having your original training model stored can help you roll back and identify the changes over time. 

The size of the dataset absolutely affects costs and processing times. The best example is that ChatGPT, everyone’s favorite model, has been rocking GPT-3 (or 3.5) instead of GPT-4 on the general public release because GPT-4, which works from a much larger, updated dataset than GPT-3, is too expensive to release to the wider public. It also returns results much more slowly than GPT-3.5, which means that our current love of instantaneous search results and image generation would need an adjustment. 

And all of that is true because GPT-4 was updated with more information (by volume), more up-to-date information, and the model was given more parameters to take into account for responses. So, it has to both access more data per query and use more complex reasoning to make decisions. That said, it also reportedly has much better results.

Storage and Cost

What are the real numbers to store, say, a primary copy of an AI dataset? Well, it’s hard to estimate, but we can ballpark that, if you’re training a large AI model, you’re going to have at a minimum tens of gigabytes of data and, at a maximum, petabytes. OpenAI considers the size of its training database proprietary information, and we’ve found sources that cite that number as  anywhere from 17GB to 570GB to 45TB of text data

That’s not actually a ton of data, and, even taking the highest number, it would only cost $225 per month to store that data in Backblaze B2 (45TB * $5/TB/mo), for argument’s sake. But let’s say you’re training an AI on video to, say, make a robot vacuum that can navigate your room or recognize and identify human movement. Your training dataset could easily reach into petabyte scale (for reference, one petabyte would cost $5,000 per month in Backblaze B2). Some research shows that dataset size is trending up over time, though other folks point out that bigger is not always better.

On the other hand, if you’re the guy with the Intel Neural Compute stick we mentioned above and a Raspberry Pi, you’re talking the cost of the ~$100 AI processor, ~$50 for the Raspberry Pi, and any incidentals. You can choose to add external hard drives, network attached storage (NAS) devices, or even servers as you scale up.

Storage and Speed

Keep in mind that, in the above example, we’re only considering the cost of storing the primary dataset, and that’s not very accurate when thinking about how you’d be using your dataset. You’d also have to consider temporary storage for when you’re actually training the AI as your primary dataset is transformed by your AI algorithm, and nearly always you’re splitting your primary dataset into discrete parts and feeding those to your AI algorithm in stages—so each of those subsets would also be stored separately. And, in addition to needing a lot of storage, where you physically locate that storage makes a huge difference to how quickly tasks can be accomplished. In many cases, the difference is a matter of seconds, but there are some tasks that just can’t handle that delay—think of tasks like self-driving cars. 

For huge data ingest periods such as training, you’re often talking about a compute process that’s assisted by powerful, and often specialized, supercomputers, with repeated passes over the same dataset. Having your data physically close to those supercomputers saves you huge amounts of time, which is pretty incredible when you consider that it breaks down to as little as milliseconds per task.

One way this problem is being solved is via caching, or creating temporary storage on the same chips (or motherboards) as the processor completing the task. Another solution is to keep the whole processing and storage cluster on-premises (at least while training), as you can see in the Microsoft-OpenAI setup or as you’ll often see in universities. And, unsurprisingly, you’ll also see edge computing solutions which endeavor to locate data physically close to the end user. 

While there can be benefits to on-premises or co-located storage, having a way to quickly add more storage (and release it if no longer needed), means cloud storage is a powerful tool for a holistic AI storage architecture—and can help control costs. 

And, as always, effective backup strategies require at least one off-site storage copy, and the easiest way to achieve that is via cloud storage. So, any way you slice it, you’re likely going to have cloud storage touch some part of your AI tech stack. 

What Hardware, Processing, and Storage Have in Common: You Have to Power Them

Here’s the short version: any time you add complex compute + large amounts of data, you’re talking about a ton of money and a ton of power to keep everything running. 

A disorganized set of power cords and switches plugged into what is decidedly too small of an outlet space.
Just flip the switch, and you have AI. Source.

Fortunately for us, other folks have done the work of figuring out how much this all costs. This excellent article from SemiAnalysis goes deep on the total cost of powering searches and running generative AI models. The Washington Post cites Dylan Patel (also of SemiAnalysis) as estimating that a single chat with ChatGPT could cost up to 1,000 times as much as a simple Google search. Those costs include everything we’ve talked about above—the capital expenditures, data storage, and processing. 

Consider this: Google spent several years putting off publicizing a frank accounting of their power usage. When they released numbers in 2011, they said that they use enough electricity to power 200,000 homes. And that was in 2011. There are widely varying claims for how much a single search costs, but even the most conservative say .03 Wh of energy. There are approximately 8.5 billion Google searches per day. (That’s just an incremental cost by the way—as in, how much does a single search cost in extra resources on top of how much the system that powers it costs.) 

Power is a huge cost in operating data centers, even when you’re only talking about pure storage. One of the biggest single expenses that affects power usage is cooling systems. With high-compute workloads, and particularly with GPUs, the amount of work the processor is doing generates a ton more heat—which means more money in cooling costs, and more power consumed. 

So, to Sum Up

When we’re talking about how much an AI costs, it’s not just about any single line item cost. If you decide to build and run your own models on-premises, you’re talking about huge capital expenditure and ongoing costs in data centers with high compute loads. If you want to build and train a model on your own USB stick and personal computer, that’s a different set of cost concerns. 

And, if you’re talking about querying a generative AI from the comfort of your own computer, you’re still using a comparatively high amount of power somewhere down the line. We may spread that power cost across our national and international infrastructures, but it’s important to remember that it’s coming from somewhere—and that the bill comes due, somewhere along the way. 

The post AI 101: Do the Dollars Make Sense? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

The SSD Edition: 2023 Drive Stats Mid-Year Review

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/ssd-edition-2023-mid-year-drive-stats-review/

A decorative image displaying the title 2023 Mid-Year Report Drive Stats SSD Edition.

Welcome to the 2023 Mid-Year SSD Edition of the Backblaze Drive Stats review. This report is based on data from the solid state drives (SSDs) we use as storage server boot drives on our Backblaze Cloud Storage platform. In this environment, the drives do much more than boot the storage servers. They also store log files and temporary files produced by the storage server. Each day a boot drive will read, write, and delete files depending on the activity of the storage server itself.

We will review the quarterly and lifetime failure rates for these drives, and along the way we’ll offer observations and insights to the data presented. In addition, we’ll take a first look at the average age at which our SSDs fail, and examine how well SSD failure rates fit the ubiquitous bathtub curve.

Mid-Year SSD Results by Quarter

As of June 30, 2023, there were 3,144 SSDs in our storage servers. This compares to 2,558 SSDs we reported in our 2022 SSD annual report. We’ll start by presenting and discussing the quarterly data from each of the last two quarters (Q1 2022 and Q2 2023).

Notes and Observations

Data is by quarter: The data used in each table is specific to that quarter. That is, the number of drive failures and drive days are inclusive of the specified quarter, Q1 or Q2. The drive counts are as of the last day of each quarter.

Drives added: Since our last SSD report, ending in Q4 2022, we added 238 SSD drives to our collection. Of that total, the Crucial (model: CT250MX500SSD1) led the way with 110 new drives added, followed by 62 new WDC drives (model: WD Blue SA510 2.5) and 44 Seagate drives (model: ZA250NM1000).

Really high annualized failure rates (AFR): Some of the failure rates, that is AFR, seem crazy high. How could the Seagate model SSDSCKKB240GZR have an annualized failure rate over 800%? In that case, in Q1, we started with two drives and one failed shortly after being installed. Hence, the high AFR. In Q2, the remaining drive did not fail and the AFR was 0%. Which AFR is useful? In this case neither, we just don’t have enough data to get decent results. For any given drive model, we like to see at least 100 drives and 10,000 drive days in a given quarter as a minimum before we begin to consider the calculated AFR to be “reasonable.” We include all of the drive models for completeness, so keep an eye on drive count and drive days before you look at the AFR with a critical eye.

Quarterly Annualized Failures Rates Over Time

The data in any given quarter can be volatile with factors like drive age and the randomness of failures factoring in to skew the AFR up or down. For Q1, the AFR was 0.96% and, for Q2, the AFR was 1.05%. The chart below shows how these quarterly failure rates relate to previous quarters over the last three years.

As you can see, the AFR fluctuates between 0.36% and 1.72%, so what’s the value of quarterly rates? Well, they are useful as the proverbial canary in a coal mine. For example, the AFR in Q1 2021 (0.58%) jumped 1.51% in Q2 2021, then to 1.72% in Q3 2021. A subsequent investigation showed one drive model was the primary cause of the rise and that model was removed from service. 

It happens from time to time that a given drive model is not compatible with our environment, and we will moderate or even remove that drive’s effect on the system as a whole. While not as critical as data drives in managing our system’s durability, we still need to keep boot drives in operation to collect the drive/server/vault data they capture each day. 

How Backblaze Uses the Data Internally

As you’ve seen in our SSD and HDD Drive Stats reports, we produce quarterly, annual, and lifetime charts and tables based on the data we collect. What you don’t see is that every day we produce similar charts and tables for internal consumption. While typically we produce one chart for each drive model, in the example below we’ve combined several SSD models into one chart. 

The “Recent” period we use internally is 60 days. This differs from our public facing reports which are quarterly. In either case, charts like the one above allow us to quickly see trends requiring further investigation. For example, in our chart above, the recent results of the Micron SSDs indicate a deeper dive into the data behind the charts might be necessary.

By collecting, storing, and constantly analyzing the Drive Stats data we can be proactive in maintaining our durability and availability goals. Without our Drive Stats data, we would be inclined to over-provision our systems as we would be blind to the randomness of drive failures which would directly impact those goals.

A First Look at More SSD Stats

Over the years in our quarterly Hard Drive Stats reports, we’ve examined additional metrics beyond quarterly and lifetime failure rates. Many of these metrics can be applied to SSDs as well. Below we’ll take a first look at two of these: the average age of failure for SSDs and how well SSD failures correspond to the bathtub curve. In both cases, the datasets are small, but are a good starting point as the number of SSDs we monitor continues to increase.

The Average Age of Failure for SSDs

Previously, we calculated the average age at which a hard drive in our system fails. In our initial calculations that turned out to be about two years and seven months. That was a good baseline, but further analysis was required as many of the drive models used in the calculations were still in service and hence some number of them could fail, potentially affecting the average.

We are going to apply the same calculations to our collection of failed SSDs and establish a baseline we can work from going forward. Our first step was to determine the SMART_9_RAW value (power-on-hours or POH) for the 63 failed SSD drives we have to date. That’s not a great dataset size, but it gave us a starting point. Once we collected that information, we computed that the average age of failure for our collection of failed SSDs is 14 months. Given that the average age of the entire fleet of our SSDs is just 25 months, what should we expect to happen as the average age of the SSDs still in operation increases? The table below looks at three drive models which have a reasonable amount of data.

    Good Drives Failed Drives
MFG Model Count Avg Age Count Avg Age
Crucial CT250MX500SSD1 598 11 months 9 7 months
Seagate ZA250CM10003 1,114 28 months 14 11 months
Seagate ZA250CM10002 547 40 months 17 25 months

As we can see in the table, the average age of the failed drives increases as the average age of drives in operation (good drives) increases. In other words, it is reasonable to expect that the average age of SSD failures will increase as the entire fleet gets older.

Is There a Bathtub Curve for SSD Failures?

Previously we’ve graphed our hard drive failures over time to determine their fit to the classic bathtub curve used in reliability engineering. Below, we used our SSD data to determine how well our SSD failures fit the bathtub curve.

While the actual curve (blue line) produced by the SSD failures over each quarter is a bit “lumpy”, the trend line (second order polynomial) does have a definite bathtub curve look to it. The trend line is about a 70% match to the data, so we can’t be too confident of the curve at this point, but for the limited amount of data we have, it is surprising to see how the occurrences of SSD failures are on a path to conform to the tried-and-true bathtub curve.

SSD Lifetime Annualized Failure Rates

As of June 30, 2023, there were 3,144 SSDs in our storage servers. The table below is based on the lifetime data for the drive models which were active as of the end of Q2 2023.

Notes and Observations

Lifetime AFR: The lifetime data is cumulative from Q4 2018 through Q2 2023. For this period, the lifetime AFR for all of our SSDs was 0.90%. That was up slightly from 0.89% at the end of Q4 2022, but down from a year ago, Q2 2022, at 1.08%.

High failure rates?: As we noted with the quarterly stats, we like to have at least 100 drives and over 10,000 drive days to give us some level of confidence in the AFR numbers. If we apply that metric to our lifetime data, we get the following table.

Applying our modest criteria to the list eliminated those drive models with crazy high failure rates. This is not a statistics trick; we just removed those models which did not have enough data to make the calculated AFR reliable. It is possible the drive models we removed will continue to have high failure rates. It is also just as likely their failure rates will fall into a more normal range. If this technique seems a bit blunt to you, then confidence intervals may be what you are looking for.

Confidence intervals: In general, the more data you have and the more consistent that data is, the more confident you are in the predictions based on that data. We calculate confidence intervals at 95% certainty. 

For SSDs, we like to see a confidence interval of 1.0% or less between the low and the high values before we are comfortable with the calculated AFR. If we apply this metric to our lifetime SSD data we get the following table.

This doesn’t mean the failure rates for the drive models with a confidence interval greater than 1.0% are wrong; it just means we’d like to get more data to be sure. 

Regardless of the technique you use, both are meant to help clarify the data presented in the tables throughout this report.

The SSD Stats Data

The data collected and analyzed for this review is available on our Drive Stats Data page. You’ll find SSD and HDD data in the same files and you’ll have to use the model number to locate the drives you want, as there is no field to designate a drive as SSD or HDD. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone—it is free.

Good luck and let us know if you find anything interesting.

The post The SSD Edition: 2023 Drive Stats Mid-Year Review appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Big Performance Improvements in Rclone 1.64.0, but Should You Upgrade?

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/big-performance-improvements-in-rclone-1-64-0-but-should-you-upgrade/

A decorative image showing a diagram about multithreading, as well as the Rclone and Backblaze logos.

Rclone is an open source, command line tool for file management, and it’s widely used to copy data between local storage and an array of cloud storage services, including Backblaze B2 Cloud Storage. Rclone has had a long association with Backblaze—support for Backblaze B2 was added back in January 2016, just two months before we opened Backblaze B2’s public beta, and five months before the official launch—and it’s become an indispensable tool for many Backblaze B2 customers. 

Rclone v1.64.0, released last week, includes a new implementation of multithreaded data transfers, promising much faster data transfer of large files between cloud storage services. 

Does it deliver? Should you upgrade? Read on to find out!

Multithreading to Boost File Transfer Performance

Something of a Swiss Army Knife for cloud storage, rclone can copy files, synchronize directories, and even mount remote storage as a local filesystem. Previous versions of rclone were able to take advantage of multithreading to accelerate the transfer of “large” files (by default at least 256MB), but the benefits were limited. 

When transferring files from a storage system to Backblaze B2, rclone would read chunks of the file into memory in a single reader thread, starting a set of multiple writer threads to simultaneously write those chunks to Backblaze B2. When the source storage was a local disk (the common case) as opposed to remote storage such as Backblaze B2, this worked really well—the operation of moving files from local disk to Backblaze B2 was quite fast. However, when the source was another remote storage—say, transferring from Amazon S3 to Backblaze B2, or even Backblaze B2 to Backblaze B2—data chunks were read into memory by that single reader thread at about the same rate as they could be written to the destination, meaning that all but one of the writer threads were idle.

What’s the Big Deal About Rclone v1.64.0?

Rclone v1.64.0 completely refactors multithreaded transfers. Now rclone starts a single set of threads, each of which both reads a chunk of data from the source service into memory, and then writes that chunk to the destination service, iterating through a subset of chunks until the transfer is complete. The threads transfer their chunks of data in parallel, and each transfer is independent of the others. This architecture is both simpler and much, much faster.

Show Me the Numbers!

How much faster? I spun up a virtual machine (VM) via our compute partner, Vultr, and downloaded both rclone v1.64.0 and the preceding version, v1.63.1. As a quick test, I used Rclone’s copyto command to copy 1GB and 10GB files from Amazon S3 to Backblaze B2, like this:

rclone --no-check-dest copyto s3remote:my-s3-bucket/1gigabyte-test-file b2remote:my-b2-bucket/1gigabyte-test-file

Note that I made no attempt to “tune” rclone for my environment by setting the chunk size or number of threads. I was interested in the out of the box performance. I used the --no-check-dest flag so that rclone would overwrite the destination file each time, rather than detecting that the files were the same and skipping the copy.

I ran each copyto operation three times, then calculated the average time. Here are the results; all times are in seconds:

Rclone version 1GB 10GB
1.63.1 52.87 725.04
1.64.0 18.64 240.45

As you can see, the difference is significant! The new rclone transferred both files around three times faster than the previous version.

So, copying individual large files is much faster with the latest version of rclone. How about migrating a whole bucket containing a variety of file sizes from Amazon S3 to Backblaze B2, which is a more typical operation for a new Backblaze customer? I used rclone’s copy command to transfer the contents of an Amazon S3 bucket—2.8GB of data, comprising 35 files ranging in size from 990 bytes to 412MB—to a Backblaze B2 Bucket:

rclone --fast-list --no-check-dest copyto s3remote:my-s3-bucket b2remote:my-b2-bucket

Much to my dismay, this command failed, returning errors related to the files being corrupted in transfer, for example:

2023/09/18 16:00:37 ERROR : tpcds-benchmark/catalog_sales/20221122_161347_00795_djagr_3a042953-d0a2-4b8d-8c4e-6a88df245253: corrupted on transfer: sizes differ 244695498 vs 0

Rclone was reporting that the transferred files in the destination bucket contained zero bytes, and deleting them to avoid the use of corrupt data.

After some investigation, I discovered that the files were actually being transferred successfully, but a bug in rclone 1.64.0 caused the app to incorrectly interpret some successful transfers as corrupted, and thus delete the transferred file from the destination. 

I was able to use the --ignore-size flag to workaround the bug by disabling the file size check so I could continue with my testing:

rclone --fast-list --no-check-dest --ignore-size copyto s3remote:my-s3-bucket b2remote:my-b2-bucket

A Word of Caution to Control Your Transaction Fees

Note the use of the --fast-list flag. By default, rclone’s method of reading the contents of cloud storage buckets minimizes memory usage at the expense of making a “list files” call for every subdirectory being processed. Backblaze B2’s list files API, b2_list_file_names, is a class C transaction, priced at $0.004 per 1,000 with 2,500 free per day. This doesn’t sound like a lot of money, but using rclone with large file hierarchies can generate a huge number of transactions. Backblaze B2 customers have either hit their configured caps or incurred significant transaction charges on their account when using rclone without the --fast-list flag.

We recommend you always use --fast-list with rclone if at all possible. You can set an environment variable so you don’t have to include the flag in every command:

export RCLONE_FAST_LIST=1

Again, I performed the copy operation three times, and averaged the results:

Rclone version 2.8GB tree
1.63.1 56.92
1.64.0 42.47

Since the bucket contains both large and small files, we see a lesser, but still significant, improvement in performance with rclone v1.64.0—it’s about 33% faster than the previous version with this set of files.

So, Should I Upgrade to the Latest Rclone?

As outlined above, rclone v1.64.0 contains a bug that can cause copy (and presumably also sync) operations to fail. If you want to upgrade to v1.64.0 now, you’ll have to use the --ignore-size workaround. If you don’t want to use the workaround, it’s probably best to hold off until rclone releases v1.64.1, when the bug fix will likely be deployed—I’ll come back and update this blog entry when I’ve tested it!

The post Big Performance Improvements in Rclone 1.64.0, but Should You Upgrade? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Drive Stats Data Deep Dive: The Architecture

Post Syndicated from David Winings original https://www.backblaze.com/blog/drive-stats-data-deep-dive-the-architecture/

A decorative image displaying the words Drive Stats Data Deep Dive: The Architecture.

This year, we’re celebrating 10 years of Drive Stats—that’s 10 years of collecting the data and sharing the reports with all of you. While there’s some internal debate about who first suggested publishing the failure rates of drives, we all agree that Drive Stats has had impact well beyond our expectations. As of today, Drive Stats is still one of the only public datasets about drive usage, has been cited 150+ times by Google Scholar, and always sparks lively conversation, whether it’s at a conference, in the comments section, or in one of the quarterly Backblaze Engineering Week presentations. 

This article is based on a presentation I gave during Backblaze’s internal Engineering Week, and is the result of a deep dive into managing and improving the architecture of our Drive Stats datasets. So, without further ado, let’s dive down the Drive Stats rabbit hole together. 

More to Come

This article is part of a series on the nuts and bolts of Drive Stats. Up next, we’ll highlight some improvements we’ve made to the Drive Stats code, and we’ll link to them here. Stay tuned!

A “Simple” Ask

When I started at Backblaze in 2020, one of the first things I was asked to do was to “clean up Drive Stats.” It had not not been ignored per se, which is to say that things still worked, but it took forever and the teams that had worked on it previously were engaged in other projects. While we were confident that we had good data, running a report took about two and a half hours, plus lots of manual labor put in by Andy Klein to scrub and validate drives in the dataset. 

On top of all that, the host on which we stored the data kept running out of space. But, each time we tried to migrate the data, something went wrong. When I started a fresh attempt at moving our dataset between hosts for this project, then ran the report, it ran for weeks (literally). 

Trying to diagnose the root cause of the issue was challenging due to the amount of history surrounding the codebase. There was some code documentation, but not a ton of practical knowledge. In short, I had my work cut out for me. 

Drive Stats Data Architecture

Let’s start with the origin of the data. The podstats generator runs on every Backblaze Storage Pod, what we call any host that holds customer data, every few minutes. It’s a legacy C++ program that collects SMART stats and a few other attributes, then converts them into an .xml file (“podstats”). Those are then pushed to a central host in each data center and bundled. Once the data leaves these central hosts, it has entered the domain of what we will call Drive Stats. This is a program that knows how to populate various types of data, within arbitrary time bounds based on the underlying podstats .xml files. When we run our daily reports, the lowest level of data are the raw podstats. When we run a “standard” report, it looks for the last 60 days or so of podstats. If you’re missing any part of the data, Drive Stats will download the necessary podstats .xml files. 

Now let’s go into a little more detail: when you’re gathering stats about drives, you’re running a set of modules with dependencies to other modules, forming a data dependency tree. Each time a module “runs”, it takes information, modifies it, and writes it to a disk. As you run each module, the data will be transformed sequentially. And, once a quarter, we run a special module that collects all the attributes for our Drive Stats reports, collecting data all the way down the tree. 

There’s a registry that catalogs each module, what their dependencies are, and their function signatures. Each module knows how its own data should be aggregated, such as per day, per day per cluster, global, data range, and so on. The “module type” will determine how the data is eventually stored on disk. Here’s a truncated diagram of the whole system, to give you an idea of what the logic looks like: 

A diagram of the mapped logic of the Drive Stats modules.

Let’s take model_hack_table as an example. This is a global module, and it’s a reference table that includes drives that might be exceptions in the data center. (So, any of the reasons Andy might identify in a report for why a drive isn’t included in our data, including testing out a new drive and so on.) 

The green drive_stats module takes in the json_podstats file, references the model names of exceptions in model_hack_table, then cross references that information against all the drives that we have, and finally assigns them the serial number, brand name, and model number. At that point, it can do things like get the drive count by data center. 

Similarly, pod_drives looks up the host file in our Ansible configuration to find out which Pods we have in which data centers. It then does attributions with a reference table so we know how many drives are in each data center. 

As you move down through the module layers, the logic gets more and more specialized. When you run a module, the first thing the module does is check in with the previous module to make sure the data exists and is current. It caches the data to disk at every step, and fills out the logic tree step by step. So for example, drive_stats, being a “per-day” module, will write out a file such as /data/drive_stats/2023-01-01.json.gz when it finishes processing. This lets future modules read that file to avoid repeating work.

This work-deduplication process saves us a lot of time overall—but it also turned out to be the root cause of our weeks-long process when we were migrating Drive Stats to our new host. 

Cache Invalidation Is Always Treacherous

We have to go into slightly more detail to understand what was happening. The dependency resolution process is as follows:

  1. Before any module can run, it checks for a dependency. 
  2. For any dependency it finds, it checks modification times. 
  3. The module has to be at least as old as the dependency, and the dependency has to be at least as old as the target data. If one of those conditions isn’t met, the data is recalculated. 
  4. Any modules that get recalculated will trigger a rebuild of the whole branch of the logic tree. 

When we moved the Drive Stats data and modules, I kept the modification time of the data (using rsync) because I knew in vague terms that Drive Stats used that for its caching. However, when Ansible copied the source code during the migration, it reset the modification time of the code for all source files. Since the freshly copied source files were younger than the dependencies, that meant the entire dataset was recalculating—and that represents terabytes of raw data dating back to 2013, which took weeks.

Note that Git doesn’t preserve mod times and it doesn’t save source files, which is part of the reason this problem exists. Because the data doesn’t exist at all in Git, there’s no way to clone-while-preserving-date. Any time you do a code update or deploy, you run the risk of this same weeks-long process being triggered. However, this code has been stable for so long, tweaks to it wouldn’t invalidate the underlying base modules, and things more or less worked fine.

To add to the complication, lots of modules weren’t in their own source files. Instead, they were grouped together by function. A drive_days module might also be with a drive_days_by_model, drive_days_by_brand, drive_days_by_size, and so on, meaning that changing any of these modules would invalidate all of the other ones in the same file. 

This may sound straightforward, but with all the logical dependencies in the various Drive Stats modules, you’re looking at pretty complex code. This was a poorly understood legacy system, so the invalidation logic was implemented somewhat differently for each module type, and in slightly different terms, making it a very unappealing problem to resolve.

Now to Solve

The good news is that, once identified, the solution was fairly intuitive. We decided to set an explicit version for each module, and save it to disk with the files containing its data. In Linux, there is something called an “extended attribute,” which is a small bit of space the filesystem preserves for metadata about the stored file—perfect for our uses. We now write a JSON object containing all of the dependent versions for each module. Here it is: 

A snapshot of the code written for the module versions.
To you, it’s just version code pinned in Linux’s extended attributes. To me, it’s beautiful.

Now we will have two sets of versions, one stored on the files written to disk, and another set in the source code itself. So whenever a module is attempting to resolve whether or not it is out of date, it can check the versions on disk and see if they are compatible with the versions in source code. Additionally, since we are using semantic versioning, this means that we can do non-invalidating minor version bumps and still know exactly which code wrote a given file. Nice!

The one downside is that you have to manually specify to preserve extended attributes when using many Unix tools such as rsync (otherwise the version numbers don’t get copied). We chose the new default behavior in the presence of missing extended attributes to be for the module to print a warning and assume it’s current. We had a bunch of warnings the first time the system ran, but we haven’t seen them since. This way if we move the dataset and forget to preserve all the versions, we won’t invalidate the entire dataset by accident—awesome! 

Wrapping It All Up

One of the coolest parts about this exploration was finding how many parts of this process still worked, and worked well. The C++ went untouched; the XML parser is still the best tool for the job; the logic of the modules and caching protocols weren’t fundamentally changed and had some excellent benefits for the system at large. We’re lucky at Backblaze that we’ve had many talented people work on our code over the years. Cheers to institutional knowledge.

That’s even more impressive when you think of how Drive Stats started—it was a somewhat off-the-cuff request. “Wouldn’t it be nice if we could monitor what these different drives are doing?” Of course, we knew it would have a positive impact on how we could monitor, use, and buy drives internally, but sharing that information is really what showed us how powerful this information could be for the industry and our community. These days we monitor more than 240,000 drives and have over 21.1 million days of data. 

This journey isn’t over, by the way—stay tuned for parts two and three where we talk about improvements we made and some future plans we have for Drive Stats data. As always, feel free to sound off in the comments. 

The post Drive Stats Data Deep Dive: The Architecture appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

APIs for Media and Film: What You Need to Know

Post Syndicated from James Flores original https://www.backblaze.com/blog/apis-for-media-and-film-what-you-need-to-know/

A decorative image showing a drive dissolving into the cloud with the clouds connected by digital lines.

Over the years, the film industry has witnessed constant transformation, from the introduction of sound and color to the digital revolution, 4K, and ultra high definition (UHD). However, a groundbreaking change is now underway, as cloud technology merges with media and entertainment (M&E) workflows, reshaping the way content is created, stored, and shared.

What’s helping to drive this transformation? APIs, or application programming interfaces. For any post facility, indie filmmaker/creator, or media team, understanding what APIs are is the first step in using them to embrace the flexibility, efficiency, and speed of the cloud.

Check Out Our New Technical Documentation Portal

When you’re working on a media project, you need to be able to find instructions about the tools you’re using quickly. And, it helps if those instructions are easy to use, easy to understand, and easy to share. Our Technical Documentation Portal has been completely overhauled to deliver on-demand content in a user-friendly way so you can find the information you need. Check out the API overview page to get you started, then dig into the implementation with the full documentation for our S3 Compatible, Backblaze, and Partner APIs.

From Tape to Digital: A Digital File Revolution

The journey towards the cloud transformation in the M&E industry started with the shift from traditional tape and film to digital formats. This revolutionary transition converted traditional media into digital entities, moving them from workstations to servers, shuttle drives, and shared storage systems. Simultaneously, the proliferation of email and cloud-hosted applications like Gmail, Dropbox, and Office 365 laid the groundwork for a cloud-centric future.

Seamless Collaboration With API-Driven Tools

As time went on, applications began communicating effortlessly with one another, facilitating tasks such as creating calendar invites in Gmail through Zoom and the ability to start Zoom meetings with a command in Slack. These integrations were made possible by APIs that allow applications to interact and share data effectively.

What Are APIs?

APIs are sets of rules and protocols that enable different software applications to communicate and interact with each other, allowing you to access specific functionalities or data from one application to be used in another. APIs facilitate seamless integration between diverse systems, enhancing workflows and promoting interoperability.

Most of us in the film industry are familiar with a GUI, a graphical user interface. It’s how we use applications day in and day out—literally the screens on our programs and computers. But a lot of the tasks we execute via a GUI (like saving files, reading files, and moving files) really are pieces of executable code hidden from us behind a nice button. Think of APIs as another method to execute those same pieces of code, but with code. Code executing code. (Let’s not get into the Skynet complex, and this isn’t AI either.)

Grinding the Gears: A Metaphor for APIs

An easy way to think about APIs is to think of them as gears. Each application has a gear.  If we adjust the two gears to talk we simply align them to each other allowing their APIs to establish communication.

A diagram that shows two gears. One is labeled API 1 and the other is labeled API 2. There are arrows going back and forth between them.

Once communications are established, you can start to do some cool stuff. For example, you can migrate a Frame.io archive to a Backblaze B2 Bucket. Or you could use the iconik API to move a file we want to edit with into our Lucidlink filespace, then remove it as soon as we finish our project. 

A chart showing iconik with workflow lines going out to Backblaze and LucidLink.

Check out a video about the solution here:

The MovieLabs 2030 Vision and Cloud Integration

As the industry embraced cloud technology, the need for standardization became apparent. Organizations like the Institute of Electrical and Electronics Engineers (IEEE) and the Society of Motion Picture and Television Engineers (SMPTE) worked diligently to establish technical unity across vendors and technologies. However, implementation of these standards lacked persistence. To address this void, the Movie Picture Association (MPA) established MovieLabs, an organization dedicated to researching, testing, and developing new guidelines, processes, and tooling to drive improvements on how content is created. One such set of guidelines is the MovieLabs 2030 Vision.

Core Principles of the MovieLabs 2030 Vision

The MovieLabs 2030 Vision outlines 10 core principles that are aspirational for the film industry to accomplish by 2030. These core principles set the stage with a high importance on cloud technology and interoperability. Interoperability boils down to the ability to use various tools but have them share resources—which is where APIs come in. APIs help make tools interoperable and able to share resources. It’s a key functionality, and it’s how many cloud tools work together today.  

A list of the MovieLabs 2023 Vision's 10 core principles to upgrade technology in the film industry.
MovieLab’s 2030 Vision aspirational principles.

The Future Is Here: Cloud Technology at Its Peak

Cloud technology grants us instant access to digital documents and the ability to carry our entire lives in our pockets. With the right tools, our data is securely synced, backed up, and accessible across devices, from smartphones to laptops and even TVs.

Although cloud technology has revolutionized various industries, the media and entertainment sector lagged behind, relying on cumbersome shuttle drives and expensive file systems for our massive files. The COVID pandemic, however, acted as a catalyst for change, pushing the industry to seriously consider the benefits of cloud integration.

Breaking Down Silos With APIs

In a post-pandemic world, many popular media and entertainment applications are built in the cloud, the same as other software as a service (SaaS) applications like Zoom, Slack, or Outlook. Which is great! But many of these tools are designed to best operate in their own ecosystem, meaning once the files are in their systems, it’s not easy to take them out. This may sound familiar if you are an iPhone user faced with migrating to an Android or vice versa. (But who would do that? 😀

With each of these applications working in their own ecosystem, the result is their own dedicated storage and usage costs which can vary greatly across tools. So many productions end up with projects and project files locked in various different environments creating storage silos—the opposite of centralized interoperability. 

An image showing projects in two silos. Projects 2, 4, and 6 are in Tool A, and Projects 1, 3, and 5 are in Tool B.

APIs not only foster interoperability in cloud-based business applications, but also empower filmmaking cloud tools like Frame.io, iconik, and Backblaze the ability to send, receive, and delete files (the POST, GET, PUT, and DELETE commands) data from other programs, enabling more dynamic and advanced workflows, such as sending files to colorists or reviewing edits for picture lock.

Customized Workflows and Automation

APIs offer the flexibility to tailor workflows to specific needs, whether within a single company or for vendor-specific processes. The automation possibilities are virtually limitless, facilitating seamless integration between cloud tools and storage solutions.

The Road Ahead for Media and Entertainment

The Movie Labs 2030 Vision offers a glimpse into a future defined by cloud tools and automation. Principally, that cloud technology with open and extensible storage exists and is available today. 

So for any post facility, indie filmmaker/creator, or media team still driving around shuttle drives while James Cameron is shooting Avatar in New Zealand and editing it in Santa Monica, the future is here and within reach. You can get started today with all the power and flexibility of the cloud without the Avatar budget.

The post APIs for Media and Film: What You Need to Know appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze + Qencode: Video Transcoding Made Simple

Post Syndicated from Elton Carneiro original https://www.backblaze.com/blog/backblaze-qencode-video-transcoding-made-simple/

A decorative image that reads Backblaze plus Qencode with accompanying logos.

If you do any kind of video streaming, encoding and storing your data is one of your main challenges. Encoding videos in various formats and resolutions for different devices and platforms can be a resource-intensive task, and setting up and maintaining on-premises encoding infrastructure can be expensive.

Today, we’re excited to announce an expanded partnership with Qencode, a media services platform that enables users to build powerful video solutions, including solutions to the challenges of transcoding, live streaming, and media storage. The expanded partnership embeds the Backblaze Partner API within the Qencode platform, making it frictionless for users to add cloud storage to their media production workflows. 

What Is Qencode?

Qencode is a media services platform founded in 2017 that assists with digital video transformation. The Qencode API provides developers within the over-the-top (OTT), broadcasting, and media & entertainment sectors with scalable and robust APIs for:

  • Video transcoding
  • Live streaming
  • Content delivery
  • Media storage
  • Artificial intelligence

Qencode + Backblaze

Recognizing the growing demand for integrated and efficient cloud storage within media production, Qencode and Backblaze built an alliance which creates a new paradigm for cutting-edge video APIs fortified by a reliable and efficient cloud storage solution. This integration empowers flexible workflows consisting of uploading, transcoding, storing, and delivering video content for media and OTT companies of all sizes. By integrating the platforms, this partnership provides top-tier features while simplifying the complexities and reducing the risks often associated with innovation.

We want to set new standards for value in an industry that is fragmented and complex. By merging Qencode’s advanced video processing capabilities with Backblaze’s reliable cloud storage, we’re addressing a critical industry need for seamless integration and efficiency. Integrating Backblaze’s Partner API takes our platform to the next level, providing users with a single, streamlined interface for all their video and media needs.

Murad Mordukhay, CEO of Qencode

Qencode + Backblaze Use Cases

The easy-to-use interface and affordability make Qencode an ideal choice for businesses who need video processing at scale without compromising spend or flexibility. Qencode enables businesses of all sizes to customize and control a complete end-to-end solution, from sign-on to billing, which includes seamless access to Backblaze storage through the Qencode software as a service (SaaS) platform. 

Simplifying the User Experience

Expanding this partnership with Qencode takes our API technology a step further in making cloud storage more accessible to businesses whose mission is to simplify user experience. We are excited to work with a specialist like Qencode to bring a simple and low cost storage solution to businesses who need it the most.

The post Backblaze + Qencode: Video Transcoding Made Simple appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Product and Pricing Updates

Post Syndicated from Gleb Budman original https://www.backblaze.com/blog/2023-product-announcement/

A decorative image showing the Backblaze logo on a cloud. A title reads Product Updates and Upgrades

Over the coming months, Backblaze will make big updates and upgrades to both our products—B2 Cloud Storage and Computer Backup. Considering the volume of new stuff on the horizon, I’m dropping into the blog today to explain what’s happening, when, and why for our customers as well as any others who are considering adopting our services. Here’s what’s new.

B2 Cloud Storage Updates

Price, Egress, and Product Upgrades

Meeting and exceeding customers’ needs for building applications, protecting data, supporting media workflows, and more is the top priority for B2 Cloud Storage. To further these efforts, we’ll be implementing the following updates:

Price Changes

Storage Price: Effective October 3, 2023, we are increasing the monthly pay-as-you-go storage rate from $5/TB to $6/TB. The price of B2 Reserve will not change.

Free Egress: Also effective October 3, we’re making egress free (i.e. free download of data) for all B2 Cloud Storage customers—both pay-as-you-go and B2 Reserve—up to three times the amount of data you store with us, with any additional egress priced at just $0.01/GB. Because supporting an open cloud environment is central to our mission, expanding free egress to all customers so they can move data when and where they prefer is a key next step.

Backblaze B2 Upgrades

From Object Lock for ransomware protection, to Cloud Replication for redundancy, to more data centers to support data location needs, Backblaze has consistently improved B2 Cloud Storage. Stay tuned for more this fall, when we’ll announce upload performance upgrades, expanded integrations, and more partnerships.

Things That Aren’t Changing

Storage pricing on committed contracts, B2 Reserve pricing, and unlimited free egress between Backblaze B2 and many leading content delivery network (CDN) and compute partners are all not changing. 

Why the Changes for B2 Cloud Storage?

1. Continuing to provide the best cloud storage.

I am excited that B2 Cloud Storage continues to be the best high-quality and low-cost alternative to traditional cloud providers like AWS for businesses of all sizes. After seven years in service with no price increases, the bar was very high for considering any change to our pricing. We invest in making Backblaze B2 a better cloud storage provider every day. A price increase enables us to continue doing so into the future.

2. Advancing the freedom of customers’ data.

We’ve heard from customers that one of the greatest benefits of B2 Cloud Storage is freedom—freedom from complexity, runaway bills, and data lock-in. We wanted to double down on these benefits and further empower our customers to leverage the open cloud to use their data how and where they wish. Making egress free supports all these benefits for our customers.

Backblaze Computer Backup

Price, Version History, Version 9.0, and Admin Upgrades

To expand our ability to provide astonishingly easy computer backup that is as reliable as it is trustworthy and affordable, we’re instituting the following updates to Backblaze Computer Backup and sharing some upcoming product upgrades:

  • Computer Backup Pricing: Effective October 3, new purchases and renewals will be $9/month, $99/year, and $189 for two-year subscription plans, and Forever Version History pricing will be $0.006/GB.
  • Free One Year Extended Version History: Also effective October 3, all Computer Backup licenses may add One Year Extended Version History, previously a $2 per month expense, for free. Being able to recover deleted or altered files up to a year later saves Computer Backup users from huge headaches, and now this benefit is available to all subscribers. Starting October 3, log in to your account and select One Year of Extended Version History for free. 
  • Version 9.0: In September, the release of Version 9.0 will go live. Among some improvements to performance and usability, this release includes a highly requested new local restore experience for end users. We’ll share all the details with you in September when Version 9.0 goes live.
  • Groups Administration Upgrades: In addition to Version 9.0, we’ve got an exciting roadmap of upgrades to our Groups functionality aimed at serving our growing and evolving customer base. For those who need to manage everything from two to two thousand workstations, we’re excited to offer more peace of mind and control with expanded tools built for the enterprise at a price still ahead of the competition.

Why the Change for Computer Backup?

Since launching Computer Backup in 2008, we’ve stayed committed to a product that backs up all your data automatically to the cloud for a flat rate. Over the following 15 years, the average amount of data stored per user has grown tremendously, and our investments to build out our storage cloud to support that growth has increased to keep pace. 

At the same time, we’ve continued to invest in improving the product—as we have been recently with the upcoming release of Version 9.0, in our active development of new Group administration features, and in the free addition of optional One Year Extended Version history for all users. And, we still have more to do to ensure our product consistently lives up to its promise. 

To continue offering unlimited backup, innovating, and adding value to the best computer backup service, we need to align our pricing with our costs.

Thank You

We understand how valuable your data is to your business and your life, and the trust you place in Backlaze every day is not lost on me. We are deeply committed to our mission of making storing, using, and protecting that data astonishingly easy, and the updates I’ve shared today are a big step forward in ensuring we can do so for the long haul. So, in closing, I’ll say thank you for entrusting us with your precious data—we’re honored to serve you. 

FAQ: B2 Cloud Storage

Am I affected by this B2 Cloud Storage pricing update?

Maybe. This update applies to B2 Cloud Storage pay-as-you-go customers—those who pay variable monthly amounts based on their actual consumption of the service—who have not entered into committed contracts for one or more years.

When will I, as an existing B2 Cloud Storage pay-as-you-go customer, see this update in my monthly bill?

The updated pricing is effective October 3, 2023, so you will see it applied starting from this date to bills sent after this date.

How does Backblaze measure monthly average storage and free egress?

Backblaze measures pay-as-you-go customers’ usage in byte hours. The monthly storage average is based on the byte hours. As of October 3, 2023, monthly egress up to three times your average is free; any monthly egress above this 3x average is priced at $0.01 per GB.

Will Backblaze continue to offer unlimited free egress to CDN and compute partners?

Yes. This change has no impact on the unlimited free egress that Backblaze offers through leading CDN and compute partners including Fastly, Cloudflare, CacheFly, bunny.net, and Vultr.

How can I switch from pay-as-you-go B2 Cloud Storage to a B2 Reserve annual capacity bundle plan?

B2 Reserve bundles start at 20TB. You can explore B2 Reserve with our Sales Team here to discuss making a switch.

Is Backblaze still much more affordable than other cloud providers like AWS?

Yes. Backblaze remains highly affordable compared to other cloud storage providers. The service also remains roughly one-fifth the cost of AWS S3 for the combination of hot storage and egress, with the exact difference varying based on usage. For example, if you store 10TB in the U.S. West and also egress 10% of it in a month, your pricing from Backblaze and AWS is as follows:

Backblaze B2: Storage $6/TB + Egress $0/GB = $60

AWS S3: Storage $26/TB + Egress $0.09/GB = Storage $260 + Egress $90 = $350

In this instance, Backblaze is 17% or about one-fifth the cost of AWS S3.

What sort of improvements do you plan alongside the increase in pricing?

Beyond including free egress for all customers, we have a number of other upgrades and improvements in the pipeline. We’ll be announcing them in the coming months, but they include improvements to the upload experience, features to expand use cases for application storage customers, new integrations, and more partnerships.

Is Backlaze making any other updates to B2 Cloud Storage pricing, such as adding a minimum storage duration fee?

No. This is the extent of the update effective October 3, 2023. We also continue to believe that minimum storage duration fees as levied by some vendors run counter to the interests of many customers.

When was your last price increase?

This is the only price increase we have had since we launched B2 Cloud Storage in 2015.

FAQ: Computer Backup

What are the new prices?

Monthly licenses will be $9, yearly licenses will be $99, and two-year licenses will be $189. One Year Extended Version History will be available for free to those who wish to enable it. The $2 per month charge for Forever Version History will be removed while the incremental rate for when a file has been changed, modified, or deleted over a year ago will be $0.006/GB/month.

When are prices changing?

October 3, 2023 at 00:00 UTC is when the price increase will go into effect for new purchases and renewals. Existing contracts and licenses will be honored for their duration, and any prorated purchases after that time will be prorated at the new rate.

How does Extended Version History work?

Extended Version History allows you to “go back in time” further to retrieve earlier versions of your data. By default that setting is set to 30 days. With this update, you can choose to keep versions up to one year old for free.

What is a version?

When an individual file is changed, updated, edited, or deleted, without the file name changing, a new version is created.

When will the One Year Extended Version History option be included with my license?

On October 3, 2023, we’ll be removing the charge for selecting One Year Extended Version History. Any changes made to that setting ahead of that date will result in a prorated charge to the payment method on file.

I do not have One Year Extended Version History. Do I need to do anything to get it?

Yes. We will not be changing anyone’s settings on their behalf, so please see below for instructions on how to change your version history settings to one year. Note: making changes to this setting before October 3 will result in a prorated charge, as noted above.

How do I add One Year Extended Version History to my account or to my Group’s backups?

For individual Backblaze users: simply log in to your Backblaze account and navigate to the Overview page. From there you’ll see a list of all your computers and their selected Version History. To make a change, press the Update button next to the computer you wish to add One Year Extended Version History for.

For Group admins: simply log in to your Backblaze account and navigate to the Groups Management page. From there, you’ll see a list of all of the Groups you manage and their selected Version History. To make a change, press the Update button next to the Group you wish to enable One Year Extended Version History for, and all computers within it will be enabled.

Can I still use Forever Version History?

Yes. Forever Version History is still available. The prior $2 per month charge will be removed, and only files changed, deleted, or modified over a year ago will be charged at the incremental $0.006/GB/month.

I already have One Year Extended Version History on my account. Will my price go up?

It depends on your payment plan. If you are on a monthly plan with One Year Extended Version History, you will not see an increase. However, anyone on a yearly plan will see an increase from $94 to $99, and for two-year licenses, your price will increase from $178 to $189.

The post Backblaze Product and Pricing Updates appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Welcome Chris Opat, Senior Vice President of Cloud Operations

Post Syndicated from Patrick Thomas original https://www.backblaze.com/blog/welcome-chris-opat-senior-vice-president-of-cloud-operations/

An image of Chris Opat, Senior Vice President of Cloud Operations at Backblaze. Text reads "Chris Opat, Senior Vice President of Cloud Operations."

Backblaze is happy to announce that Chris Opat has joined our team as senior vice president of cloud operations. Chris will oversee the strategy and operations of the Backblaze global cloud storage platform.

What Chris Brings to Backblaze

Chris expands the company’s leadership by bringing his impressive cloud and infrastructure knowledge with more than 25 years of industry experience. 

Previously, Chris served as senior vice president leading platform engineering and operations at StackPath, a specialized provider in edge technology and content delivery. He also held leadership roles at CyrusOne, CompuCom, Cloudreach, and Bear Stearns/JPMorgan. Chris earned his Bachelor of Science degree in television and digital media production from Ithaca College.

Backblaze CEO, Gleb Budman, shared that Chris is a forward-thinking cloud leader with a proven track record of leading teams that are clever and bold in solving problems and creating best-in-class experiences for customers. His expertise and approach will be pivotal as more customers move to an open cloud ecosystem and will help advance Backblaze’s cloud strategy as we continue to grow.

Chris’ Role as SVP of Cloud Operations

As SVP of Cloud Operations, Chris oversees cloud strategy, platform engineering, and technology infrastructure, enabling Backblaze to further scale capacity and improve performance to meet larger-sized customers’ needs, as we continue to see success in moving up-market.

Chris says of his new role at Backblaze:

Backblaze’s vision and mission resonate with me. I’m proud to be joining a company that is supporting customers and advocating for an open cloud ecosystem. I’m looking forward to working with the amazing team at Backblaze as we continue to scale with our customers and accelerate growth.

The post Welcome Chris Opat, Senior Vice President of Cloud Operations appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

What’s the Diff: Hot and Cold Data Storage

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/whats-the-diff-hot-and-cold-data-storage/

A decorative image showing two thermometers overlaying pictures of servers. The one on the left says "cold" and the one on the right says "hot".

This post was originally published in 2017 and updated in 2019 and 2023 to share the latest information on cloud storage tiering.

Temperature, specifically a range from cold to hot, is a common way to describe different levels of data storage. It’s possible these terms originated based on where data was historically stored. Hot data was stored close to the heat of the spinning drives and CPUs. Cold data was stored on drives or tape away from the warmer data center, likely tucked away on a shelf somewhere. 

Today, they’re used to describe how easily you can access your data. Hot storage is for data you need fast or access frequently. Cold storage is typically used for data you rarely need. The terms are used by most data storage providers to describe their tiered storage plans. However, there are no industry standard definitions for what hot and cold mean, which makes comparing services across different storage providers challenging. 

It’s a common misconception that hot storage means expensive storage and that cold storage means slower, less expensive storage. Today, we’ll explain why these terms may no longer be serving you when it comes to anticipating storage cost and performance.

Defining Hot Storage

Hot storage serves as the go-to destination for frequently accessed and mission-critical data that demands swift retrieval. Think of it as the fast lane of data storage, tailored for scenarios where time is of the essence. Industries relying on real-time data processing and rapid response times, such as video editing, web content, and application development, find hot storage to be indispensable.

To achieve the necessary rapid data access, hot storage is often housed in hybrid or tiered storage environments. The hotter the service, the more it embraces cutting-edge technologies, including the latest drives, fastest transport protocols, and geographical proximity to clients or multiple regions. However, the resource-intensive nature of hot storage warrants a premium, and leading cloud data storage providers like Microsoft’s Azure Hot Blobs and AWS S3 reflect this reality.

Data stored in the hottest tier might use solid-state drives (SSDs), which are optimized for lower latency and higher transactional rates compared to traditional hard drives. In other cases, hard disk drives are more suitable for environments where the drives are heavily accessed due to their higher durability standing up to intensive read/write cycles.

Regardless of the storage medium, hot data workloads necessitate fast and consistent response times, making them ideal for tasks like capturing telemetry data, messaging, and data transformation.

Defining Cold Storage

On the opposite end of the data storage spectrum lies cold storage, catering to information accessed infrequently and without the urgency of hot data. Cold storage houses data that might remain dormant for extended periods, months, years, decades, or maybe forever. Practical examples might include old projects or records mandated for financial, legal, HR, or other business record-keeping requirements.

Cold cloud storage systems prioritize durability and cost-effectiveness over real-time data manipulation capabilities. Services like Amazon Glacier and Google Coldline take this approach, offering slower retrieval and response times than their hot storage counterparts. Lower performing and less expensive storage environments, both on-premises and in the cloud, commonly host cold data. 

Linear Tape Open (LTO or Tape) has historically been a popular storage medium for cold data, though manual retrieval from storage racks renders it relatively slow. To access data from LTO, the tapes must be physically retrieved from storage racks and mounted in a tape reading machine, making it one of the slowest, therefore coldest, methods of storing data.

While cold cloud storage systems generally boast lower overall costs than warm or hot storage, they may incur higher per-operation expenses. Accessing data from cold storage demands patience and thoughtful planning, as the response times are intentionally sluggish.

With the landscape of data storage continually evolving, the definition of cold storage has also expanded. In modern contexts, cold storage might describe completely offline data storage, wherein information resides outside the cloud and remains disconnected from any network. This isolation, also described as air gapped, is crucial for safeguarding sensitive data. However, today, data can be virtually air-gapped using technology like Object Lock.

Traditional Views of Cold and Hot Data Storage

Cold Hot
Access Speed Slow Fast
Access Frequency Seldom or Never Frequent
Data Volume Low High
Storage Media Slower drives, LTO, offline Faster drives, durable drives, SSDs
Cost Lower Higher

What Is Hot Cloud Storage?

Today there are new players in data storage, who, through innovation and efficiency, are able to offer cloud storage at the cost of cold storage, but with the performance and availability of hot storage.

The concept of organizing data by temperature has long been employed by diversified cloud providers like Amazon, Microsoft, and Google to describe their tiered storage services and set pricing accordingly. But, today, in a cloud landscape defined by the open, multi-cloud internet, customers have come to realize the value and benefits they can get from moving away from those diversified providers. 

A wave of independent cloud providers are disrupting the traditional notions of cloud storage temperatures, offering cloud storage that’s as cost-effective as cold storage, yet delivering the speed and availability associated with hot storage. If you’re familiar with Backblaze B2 Cloud Storage, you know where we’re going with this. 

Backblaze B2 falls into this category. We can compete on price with LTO and other traditionally cold storage services, but can be used for applications that are usually reserved for hot storage, such as media management, workflow collaboration, websites, and data retrieval.

The newfound efficiency of this model has prompted customers to rethink their storage strategies, opting to migrate entirely from cumbersome cold storage and archival systems.

What Temperature Is Your Cloud Storage?

When it comes to choosing the right storage temperature for your cloud data, organizations must carefully consider their unique needs. Ensuring that storage costs align with actual requirements is key to maintaining a healthy bottom line. The ongoing evolution of cloud storage services, driven by efficiency, technology, and innovation, further amplifies the need for tailored storage solutions.

Still have questions that aren’t answered here? Join the discussion in the comments.

The post What’s the Diff: Hot and Cold Data Storage appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Seven Reasons Your Backup Strategy Might Be Failing You

Post Syndicated from Kari Rivas original https://www.backblaze.com/blog/seven-reasons-your-backup-strategy-might-be-failing-you/

A decorative image showing a cloud with a backup symbol, then three circles with 3, 2, and 1. There are question marks behind the cloud.

Are you confident that your backup strategy has you covered? If not, it’s time to confront the reality that your backup strategy might not be as strong as you think. And even if you’re feeling great about it, it can never hurt to poke holes in your strategy to see where you need to shore up your defenses.

Whether you’re a small business owner wearing many hats (including the responsibility for backing up your company’s data) or a seasoned IT professional, you know that protecting your data is a top priority. The industry standard is the 3-2-1 backup strategy, which states you should have three copies of your data on two different kinds of media with at least one copy off-site or in the cloud. But a lot has changed since that standard was introduced. 

In this post, we’ll identify several ways your 3-2-1 strategy (and your backups in general) could fail. These are common mistakes that even professional IT teams can make. While 3-2-1 is a great place to start, especially if you’re not currently following that approach, it can now be considered table stakes. 

For larger businesses or any business wanting to fail proof its backups, read on to learn how you can plug the gaps in your 3-2-1 strategy and better secure your data from ransomware and other disasters.

Join the Webinar

There’s more to learn about how to shore up your data protection strategy. Join Backblaze on Thursday, August 10 at 10 a.m. PT/noon CT/5 p.m. UTC for a 30-minute webinar on “10 Common Data Protection Mistakes.”

Sign Up ➔ 

Let’s start with a quick review of the 3-2-1 strategy.

The 3-2-1 Backup Strategy

A 3-2-1 strategy means having at least three total copies of your data, two of which are local but on different media, and at least one off-site copy or in the cloud. For instance, a business may keep a local copy of its data on a server at the main office, a second copy of its data on a NAS device in the same location, and a third copy of its data in the public cloud, such as Backblaze B2 Cloud Storage. Hence, there are three copies of its data with two local copies on different media (the server and NAS) and one copy stored off-site in the cloud.

A diagram showing a 3-2-1 backup strategy, in which there are three copies of data, in two different locations, with one location off-site.

The 3-2-1 rule originated in 2005 when Peter Krogh, a photographer, writer, and consultant, introduced it in his book, “The DAM Book: Digital Asset Management for Photographers.” As this rule was developed almost 20 years ago, you can imagine that it may be outdated in some regards. Consider that 2005 was the year YouTube was founded. Let’s face it, a lot has changed since 2005, and today the 3-2-1 strategy is just the starting point. In fact, even if you’re faithfully following the 3-2-1 rule, there may still be some gaps in your data protection strategy.

While backups to external hard drives, tape, and other recordable media (CDs, DVDs, and SD cards) were common two decades ago, those modalities are now considered legacy storage. The public cloud was a relatively new innovation in 2005, so, at first, 3-2-1 did not even consider the possibilities of cloud storage. 

Arguably, the entire concept of “media” in 3-2-1 (as in having two local copies of your data on two different kinds of media) may not make sense in today’s modern IT environment. And, while an on-premises copy of your data typically offers the fastest Recovery Time Objective (RTO), having two local copies of your data will not protect against the multitude of potential natural disasters like fire, floods, tornados, and earthquakes. 

The “2” part of the 3-2-1 equation may make sense for consumers and sole proprietors (e.g., photographers, graphic designers, etc.) who are prone to hardware failure and for whom having a second copy of data on a NAS device or external hard drive is an easy solution, but enterprises have more complex infrastructures. 

Enterprises may be better served by having more than one off-site copy, in case of an on-premises data disaster. This can be easily automated with a cloud replication tool which allows you to store your data in different regions. (Backblaze offers Cloud Replication for this purpose.) Replicating your data across regions provides geographical separation from your production environment and added redundancy. The bottom line is that 3-2-1 is a good starting point for configuring your backup strategy, but it should not be taken as a one-size-fits-all approach.

The 3-2-1-1-0 Strategy

Some companies in the data protection space, like Veeam, have updated 3-2-1 with the 3-2-1-1-0 approach. This particular definition stipulates that you:

  • Maintain at least three copies of business data.
  • Store data on at least two different types of storage media.
  • Keep one copy of the backups in an off-site location.
  • Keep one copy of the media offline or air gapped.
  • Ensure all recoverability solutions have zero errors.
A diagram showing the 3-2-1-1-0 backup strategy.

The 3-2-1-1-0 approach addresses two important weaknesses of 3-2-1. First, 3-2-1 doesn’t address the prevalence of ransomware. Even if you follow 3-2-1 with fidelity, your data could still be vulnerable to a ransomware attack. The 3-2-1-1-0 rule covers this by requiring one copy to be offline or air gapped. With Object Lock, your data can be made immutable, which is considered a virtual air gap, thus fulfilling the 3-2-1-1-0 rule. 

Second, 3-2-1 does not consider disaster recovery (DR) needs. While backups are one part of your disaster recovery plan, your DR plan needs to consider many more factors. The “0” in 3-2-1-1-0 captures an important aspect of DR planning, which is that you must test your backups and ensure you can recover from them without error. Ultimately, you should architect your backup strategy to support your DR plan and the potential need for a recovery, rather than trying to abide by any particular backup rule.

Additional Gaps in Your Backup Strategy

As you can tell by now, there are many shades of gray when it comes to 3-2-1, and these varying interpretations can create areas of weakness in a business’ data protection plan. Review your own plan for the following seven common mistakes and close the gaps in your strategy by implementing the suggested best practices.

1. Using Sync Functionality Instead of Backing Up

You may be following 3-2-1, but if copies of your data are stored on a sync service like Google Drive, Dropbox, or OneDrive, you’re not fully protected. Syncing your data does not allow you to recover from previous versions with the level of granularity that a backup offers.

Best Practice: Instead, ensure you have three copies of your data protected by true backup functionality.

2. Counting Production Data as a Backup

Some interpret the production data to be one of the three copies of data or one of the two different media types.

Best Practice: It’s open to interpretation, but you may want to consider having three copies of data in addition to your production data for the best protection.

3. Using a Storage Appliance That’s Vulnerable to Ransomware

Many on-premises storage systems now support immutability, so it’s a good time to reevaluate your local storage. 

Best Practice: New features in popular backup software like Veeam even enable NAS devices to be protected from ransomware. Learn more about Veeam support for NAS immutability and how to orchestrate end-to-end immutability for impenetrable backups.

4. Not Backing Up Your SaaS Data

It’s a mistake to think your Microsoft 365, Google Workspace, and other software as a service (SaaS) data is protected because it’s already hosted in the cloud. SaaS providers operate under a “shared responsibility model,” meaning they may not back up your data as often as you’d like or provide effective means to recovery. 

Best Practice: Be sure to back up your SaaS data to the cloud to ensure complete coverage of the 3-2-1 rule. 

5. Relying On Off-Site Legacy Storage

It’s always a good idea to have at least one copy of your data on-site for the fastest RTO. But if you’re relying on legacy storage, like tape, to fulfill the off-site requirement of the 3-2-1 strategy, you probably know how expensive and time-consuming it can be. And sometimes that expense and timesuck means your off-site backups are not updated as often as they should be, which leads to mistakes. 

Best Practice: Replace your off-site storage with cloud storage to modernize your architecture and prevent gaps in your backups. Backblaze B2 is one-fifth of the cost of AWS, so it’s easily affordable to migrate off tape and other legacy storage systems.

6. No Plan for Affected Infrastructure

Faithfully following 3-2-1 will get you nowhere if you don’t have the infrastructure to restore your backups. If your infrastructure is destroyed or disrupted, you need a way to ensure business continuity in the face of data disaster.

Best Practice: Be sure your disaster recovery plan outlines how you will access your DR documentation and implement the plan even if your environment is down. Using a tool like Cloud Instant Business Recovery (Cloud IBR), which offers an on-demand, automated solution that allows Veeam users to stand up bare metal servers in the cloud, allows you to immediately begin recovering data while rebuilding infrastructure.

7. Keeping Your Off-Site Copy Down the Street

The 3-2-1 policy states that one copy of your data be kept off-site, and some companies maintain a DR site for that exact purpose. However, if your DR facility is in the same local area as your main office, you have a big gap in your data protection strategy. 

Best Practice: Ideally, you should have an off-site copy of your data stored in a public cloud data center far from your data production site, to protect against regional natural disasters.

Telco Adopts Cloud for Geographic Separation

AcenTek’s existing storage scheme covered the 3-2-1 basics, but their off-site copy was no further away than their own data center. In the case of a large natural disaster, their one off-site copy could be vulnerable to destruction, leaving them without a path to recovery. With Backblaze B2, AcenTek has an additional layer of resilience for its backup data by storing it in a secure, immutable cloud storage platform across the country from their headquarters in Minnesota.

Read the Full Story ➔ 

Modernize Your Backup Strategy

The 3-2-1 strategy is a great starting point for small businesses that need to develop a backup plan, but larger mid-market and enterprise organizations must think about business continuity more holistically. 

Backblaze B2 Cloud Storage makes it easy to modernize your backup strategy by sending data backups and archives straight to the cloud—without the expense and complexity of many public cloud services.

At one-fifth of the price of AWS, Backblaze B2 is an affordable, time-saving alternative to the hyperscalers, LTO, and traditional DR sites. Get started today or contact Sales for more information on Backblaze B2 Reserve, Backblaze’s all-inclusive capacity-based pricing that includes premium support and no egress fees. The intricacies of operations, data management, and potential risks demand a more advanced approach to ensure uninterrupted operations. By leveraging cloud storage, you can create a robust, cost-effective, and flexible backup strategy that you can easily customize to your business needs.

Interested in learning more about backup, business continuity, and disaster recovery best practices? Check out the free Backblaze resources below.

The post Seven Reasons Your Backup Strategy Might Be Failing You appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Drive Stats for Q2 2023

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-2023/

A decorative image with title Q2 2023 Drive Stats.

At the end of Q2 2023, Backblaze was monitoring 245,757 hard drives and SSDs in our data centers around the world. Of that number, 4,460 are boot drives, with 3,144 being SSDs and 1,316 being HDDs. The failure rates for the SSDs are analyzed in the SSD Edition: 2022 Drive Stats review.

Today, we’ll focus on the 241,297 data drives under management as we review their quarterly and lifetime failure rates as of the end of Q2 2023. Along the way, we’ll share our observations and insights on the data presented, tell you about some additional data fields we are now including and more.

Q2 2023 Hard Drive Failure Rates

At the end of Q2 2023, we were managing 241,297 hard drives used to store data. For our review, we removed 357 drives from consideration as they were used for testing purposes or drive models which did not have at least 60 drives. This leaves us with 240,940 hard drives grouped into 31 different models. The table below reviews the annualized failure rate (AFR) for those drive models for Q2 2023.

Notes and Observations on the Q2 2023 Drive Stats

  • Zero Failures: There were six drive models with zero failures in Q2 2023 as shown in the table below.

The table is sorted by the number of drive days each model accumulated during the quarter. In general a drive model should have at least 50,000 drive days in the quarter to be statistically relevant. The top three drives all meet that criteria, and having zero failures in a quarter is not surprising given the lifetime AFR for the three drives ranges from 0.13% to 0.45%. None of the bottom three drives has accumulated 50,000 drive days in the quarter, but the two Seagate drives are off to a good start. And, it is always good to see the 4TB Toshiba (model: MD04ABA400V), with eight plus years of service, post zero failures for the quarter.

  • The Oldest Drive? The drive model with the oldest average age is still the 6TB Seagate (model: ST6000DX000) at 98.3 months (8.2 years), with the oldest drive of this cohort being 104 months (8.7 years) old.

    The oldest operational data drive in the fleet is a 4TB Seagate (model: ST4000DM000) at 105.2 months (8.8 years). That is quite impressive, especially in a data center environment, but the winner for the oldest operational drive in our fleet is actually a boot drive: a WDC 500GB drive (model: WD5000BPKT) with 122 months (10.2 years) of continuous service.

  • Upward AFR: The AFR for Q2 2023 was 2.28%, up from 1.54% in Q1 2023. While quarterly AFR numbers can be volatile, they can also be useful in identifying trends which need further investigation. In this case, the rise was expected as the age of our fleet continues to increase. But was that the real reason?

    Digging in, we start with the annualized failure rates and average age of our drives grouped by drive size, as shown in the table below.

For our purpose, we’ll define a drive as old when it is five years old or more. Why? That’s the warranty period of the drives we are purchasing today. Of course, the 4TB and 6TB drives, and some of the 8TB drives, came with only two year warranties, but for consistency we’ll stick with five years as the point at which we label a drive as “old”. 

Using our definition for old drives eliminates the 12TB, 14TB and 16TB drives. This leaves us with the chart below of the Quarterly AFR over the last three years for each cohort of older drives, the 4TB, 6TB, 8TB, and 10TB models.

Interestingly, the oldest drives, the 4TB and 6TB drives, are holding their own. Yes, there has been an increase over the last year or so, but given their age, they are doing well.

On the other hand, the 8TB and 10TB drives, with an average of five and six years of service respectively, require further attention. We’ll look at the lifetime data later on in this report to see if our conclusions are justified.

What’s New in the Drive Stats Data?

For the past 10 years, we’ve been capturing and storing the drive stats data and since 2015 we’ve open sourced the data files that we used to create the Drive Stats reports. From time to time, new SMART attribute pairs have been added to the schema as we install new drive models which report new sets of SMART attributes. This quarter we decided to capture and store some additional data fields about the drives and the environment they operate in, and we’ve added them to the publicly available Drive Stats files that we publish each quarter. 

The New Data Fields

Beginning with the Q2 2023 Drive Stats data, there are three new data fields populated in each drive record.

  1. Vault_id: All data drives are members of a Backblaze Vault. Each vault consists of either 900 or 1,200 hard drives divided evenly across 20 storage servers.  The vault is a numeric value starting at 1,000.
  2. Pod_id: There are 20 storage servers in each Backblaze Vault. The Pod_id is a numeric field with values from 0 to 19 assigned to one of the 20 storage servers.
  3. Is_legacy_format: Currently 0, but will be useful over the coming quarters as more fields are added.

The new schema is as follows:

  • date
  • serial_number
  • model
  • capacity_bytes
  • failure
  • vault_id
  • pod_id
  • is_legacy_format
  • smart_1_normalized
  • smart_1_raw
  • Remaining SMART value pairs (as reported by each drive model)

Occasionally, our readers would ask if we had any additional information we could provide with regards to where a drive lived, and, more importantly, where it died. The newly-added data fields above are part of the internal drive data we collect each day, but they were not included in the Drive Stats data that we use to create the Drive Stats reports. With the help of David from our Infrastructure Software team, these fields will now be available in the Drive Stats data.

How Can We Use the Vault and Pod Information?

First a caveat: We have exactly one quarter’s worth of this new data. While it was tempting to create charts and tables, we want to see a couple of quarters worth of data to understand it better. Look for an initial analysis later on in the year.

That said, what this data gives us is the storage server and the vault of every drive. Working backwards, we should be able to ask questions like: “Are certain storage servers more prone to drive failure?” or, “Do certain drive models work better or worse in certain storage servers?” In addition, we hope to add data elements like storage server type and data center to the mix in order to provide additional insights into our multi-exabyte cloud storage platform.

Over the years, we have leveraged our Drive Stats data internally to improve our operational efficiency and durability. Providing these new data elements to everyone via our Drive Stats reports and data downloads is just the right thing to do.

There’s a New Drive in Town

If you do decide to download our Drive Stats data for Q2 2023, there’s a surprise inside—a new drive model. There are only four of these drives, so they’d be easy to miss, and they are not listed on any of the tables and charts we publish as they are considered “test” drives at the moment. But, if you are looking at the data, search for model “WDC WUH722222ALE6L4” and you’ll find our newly installed 22TB WDC drives. They went into testing in late Q2 and are being put through their paces as we speak. Stay tuned. (Psst, as of 7/28, none had failed.)

Lifetime Hard Drive Failure Rates

As of June 30, 2023, we were tracking 241,297 hard drives used to store customer data. For our lifetime analysis, we removed 357 drives that were only used for testing purposes or did not have at least 60 drives represented in the full dataset. This leaves us with 240,940 hard drives grouped into 31 different models to analyze for the lifetime table below.

Notes and Observations About the Lifetime Stats

The Lifetime AFR also rises. The lifetime annualized failure rate for all the drives listed above is 1.45%. That is an increase of 0.05% from the previous quarter of 1.40%. Earlier in this report by examining the Q2 2023 data, we identified the 8TB and 10TB drives as primary suspects in the increasing rate. Let’s see if we can confirm that by examining the change in the lifetime AFR rates of the different drives grouped by size.

The red line is our baseline as it is the difference from Q1 to Q2 (0.05%) of the lifetime AFR for all drives. Drives above the red line support the increase, drives below the line subtract from the increase. The primary drives (by size) which are “driving” the increased lifetime annualized failure rate are the 8TB and 10TB drives. This confirms what we found earlier. Given there are relatively few 10TB drives (1,124) versus 8TB drives (24,891), let’s dig deeper into the 8TB drives models.

The Lifetime AFR for all 8TB drives jumped from 1.42% in Q1 to 1.59% in Q2.  An increase of 12%. There are six 8TB drive models in operation, but three of these models comprise 99.5% of the drive failures for the 8TB drive cohort, so we’ll focus on them. They are listed below.

For all three models, the increase of the lifetime annualized failure rate from Q1 to Q2 is 10% or more which is statistically similar to the 12% increase for all of the 8TB drive models. If you had to select one drive model to focus on for migration, any of the three would be a good candidate. But, the Seagate drives, model ST8000DM002, are on average nearly a year older than the other drive models in question.

  • Not quite a lifetime? The table above analyzes data for the period of April 20, 2013 through June 30, 2023, or 10 years, 2 months and 10 days. As noted earlier, the oldest drive we have is 10 years and 2 months old, give or take a day or two. It would seem we need to change our table header, but not quite yet. A drive that was installed anytime in Q2 2013 and is still operational today would report drive days as part of the lifetime data for that model. Once all the drives installed in Q2 2013 are gone, we can change the start date on our tables and charts accordingly.

A Word About Drive Failure

Are we worried about the increase in drive failure rates? Of course we’d like to see them lower, but the inescapable reality of the cloud storage business is that drives fail. Over the years, we have seen a wide range of failure rates across different manufacturers, drive models, and drive sizes. If you are not prepared for that, you will fail. As part of our preparation, we use our drive stats data as one of the many inputs into understanding our environment so we can adjust when and as we need.

So, are we worried about the increase in drive failure rates? No, but we are not arrogant either. We’ll continue to monitor our systems, take action where needed, and share what we can with you along the way. 

The Hard Drive Stats Data

The complete data set used to create the information used in this review is available on our Hard Drive Stats Data webpage. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone; it is free.

If you want the tables and charts used in this report, you can download the .zip file from Backblaze B2 Cloud Storage which contains an MS Excel spreadsheet with a tab for each of the tables or charts..

Good luck and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q2 2023 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

AI 101: GPU vs. TPU vs. NPU

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/ai-101-gpu-vs-tpu-vs-npu/

Word bubbles that say "What's the Diff: GPU, TPU, NPU."
This article is part of an ongoing content arc about artificial intelligence (AI). The first article in the series is AI 101: How Cognitive Science and Computer Processors Create Artificial Intelligence. Stay tuned for the rest of the series, and feel free to suggest other articles you’d like to see on this content in the comments.

It’s no secret that artificial intelligence (AI) is driving innovation, particularly when it comes to processing data at scale. Machine learning (ML) and deep learning (DL) algorithms, designed to solve complex problems and self-learn over time, are exploding the possibilities of what computers are capable of. 

It’s no secret that artificial intelligence (AI) is driving innovation, particularly when it comes to processing data at scale. Machine learning (ML) and deep learning (DL) algorithms, designed to solve complex problems and self-learn over time, are exploding the possibilities of what computers are capable of. 

As the problems we ask computers to solve get more complex, there’s also an unavoidable, explosive growth in the number of processes they run. This growth has led to the rise of specialized processors and a whole host of new acronyms.

Joining the ranks of central processing units (CPUs), which you may already be familiar with, are neural processing units (NPUs), graphics processing units (GPUs), and tensile processing units (TPUs). 

So, let’s dig in to understand how some of these specialized processors work, and how they’re different from each other. If you’re still with me after that, stick around for an IT history lesson.  I’ll get into some of the more technical concepts about the combination of hardware and software developments in the last 100 or so years.

Central Processing Unit (CPU): The OG

Think of the CPU as the general of your computer. There are two main parts of a CPU, an arithmetic-logic unit (ALU) and a control unit. An ALU allows arithmetic (add, subtract, etc.) and logic (AND, OR, NOT, etc.) operations to be carried out. The control unit controls the ALU, memory, and IO functions, which tells them how to respond to the program that’s just been read from the memory. 

The best way to track what the CPU does is to think of it as an input/output flow. The CPU will take the request (input), access the memory of the computer for instructions on how to perform that task, delegate the execution to either its own ALUs or another specialized processor, take all that data back into its control unit, then take a single, unified action (output). 

For a visual, this is the the circuitry map for an ALU from 1970:

Circuitry map for an ALU from 1970.
From our good friends at Texas Instruments: the combinational logic circuitry of the 74181 integrated circuit, an early four-bit ALU. Image source.

But, more importantly, here’s a logic map about what a CPU does: 

Logic map of what a CPU does.
Image source.

CPUs have gotten more powerful over the years as we’ve moved from single-core processors to multicore processors. Basically, there are several ALUs executing tasks that are being managed by the CPU’s control unit, and they perform tasks in parallel. That means that it works well in combination with specialized AI processors like GPUs. 

The Rise of Specialized Processors

When a computer is given a task, the first thing the processor has to do is communicate with the memory, including program memory (ROM)—designed for more fixed tasks like startup—and data memory (RAM)—designed for things that change more often like loading applications, editing a document, and browsing the internet. The thing that allows these elements to talk is called the bus, and it can only access one of the two types of memory at one time.  

In the past, processors ran more slowly than memory access, but that’s changed as processors have gotten more sophisticated. Now, when CPUs are asked to do a bunch of processes on large amounts of data, the CPU ends up waiting for memory access because of traffic on the bus. In addition to slower processing, it also uses a ton of energy. Folks in computing call this the Von Neumann bottleneck, and as compute tasks like those for AI have become more complex, we’ve had to work out ways to solve this problem.

One option is to create chips that are optimized to specific tasks. Specialized chips are designed to solve the processing difficulties machine learning algorithms present to CPUs. In the race to create the best AI processor, big players like Google, IBM, Microsoft, and Nvidia have solved this with specialized processors that can execute more logical queries (and thus more complex logic). They achieve this in a few different ways. So, let’s talk about what that looks like: What are GPUs, TPUs, and NPUs?

Graphics Processing Unit (GPU)

GPUs started out as specialized graphics processors and are often conflated with graphics cards (which have a bit more hardware to them). GPUs were designed to support massive amounts of parallel processing, and they work in tandem with CPUs, either fully integrated on the main motherboard, or, for heavier loads, on their own dedicated piece of hardware. They also use a ton of energy and thus generate heat. 

GPUs have long been used in gaming, and it wasn’t until the 2000s that folks started using them for general computing—thanks to Nvidia. Nvidia certainly designs chips, of course, but they also introduced a proprietary platform called CUDA that allows programmers to have direct access to a GPU’s virtual instruction set and parallel computational elements. This means that you can set up compute kernels, or clusters of processors that work together and are ideally suited to specific tasks, without taxing the rest of your resources. Here’s a great diagram that shows the workflow:

Processing flow on CUDA
Image source.

This made GPUs wildly applicable for machine learning tasks, and they benefited from the fact that they leveraged existing, well-known processes. What we mean by that is: oftentimes when you’re researching solutions, the solution that wins is not always the “best” one based on pure execution. If you’re introducing something that has to (for example) fundamentally change consumer behavior, or that requires everyone to relearn a skill, you’re going to have resistance to adoption. So, GPUs playing nice with existing systems, programming languages, etc. aided wide adoption. They’re not quite plug-and-play, but you get the gist. 

As time has gone on, there are now also open source platforms that support GPUs that are supported by heavy-hitting industry players (including Nvidia). The largest of these is OpenCL. And, folks have added tensor cores, which this article does a fabulous job of explaining.

Tensor Processing Unit (TPU) 

Great news: the TL:DR of this acronym boils down to: It’s Google’s proprietary AI processor. They started using them in their own data centers in 2015, released them to the public in 2016, and there are some commercially available models. They run on ASICs (hard-etched chips I’ll talk more about later) and Google’s TensorFlow software. 

Compared with GPUs, they’re specifically designed to have slightly lower precision, which makes sense given that this makes them more flexible to different types of workloads. I think Google themselves sum it up best:

If it’s raining outside, you probably don’t need to know exactly how many droplets of water are falling per second—you just wonder whether it’s raining lightly or heavily. Similarly, neural network predictions often don’t require the precision of floating point calculations with 32-bit or even 16-bit numbers. With some effort, you may be able to use 8-bit integers to calculate a neural network prediction and still maintain the appropriate level of accuracy.

Google Cloud Blog

GPUs, on the other hand, were originally designed for graphics processing and rendering, which relies on each point’s relationship to each other to create a readable image—if you have less accuracy in those points, you amplify that in their vectors, and then you end up with Playstation 2 Spyro instead of Playstation 4 Spyro.

Another important design choice that deviates from CPUs and GPUs is that TPUs are designed around a systolic array. Systolic arrays create a network of processors that are each computing a partial task, then sending it along to the next node until you reach the end of the line. Each node is usually fixed and identical, but the program that runs between them is programmable. It’s called a data processing unit (DPU).  

Neural processing unit (NPU)

“NPU” is sometimes used as the category name for all specialized AI processors, but it’s more often specifically applied to those designed for mobile devices. Just for confusion’s sake, note that Samsung also refers to its proprietary chipsets as NPU. 

NPUs contain all the necessary information to complete AI processing, and they run on a principle of synaptic weight. Synaptic weight is a term adapted from biology which describes the strength of connection between two neurons. Simply put, in our bodies if two neurons find themselves sharing information more often, the connection between them becomes literally stronger, making it easier for energy to pass between them. At the end of the day, that makes it easier for you to do something. (Wow, the science between habit forming makes a lot more sense now.) Many neural networks mimic this. 

When we say AI algorithms learn, this is one of the ways—they track likely possibilities over time, and give more weight to that connected node. The impact is huge when it comes to power consumption. Parallel processing runs each task next to each other, but isn’t great at accounting for the completion of tasks, especially as your architecture scales and processing units might be more separate.

Quick Refresh: Neural Networks and Decision Making in Computers

As we discuss in AI 101, when you’re thinking about the process of making a decision, what you see is that you’re actually making many decisions in a series, and often the things you’re considering before you reach your final decision affect the eventual outcome. Since computers are designed on a strict binary, they’re not “naturally” suited to contextualizing information in order to make better decisions. Neural networks are the solution. They’re based on matrix math, and they look like this: 

An image showing how a neural network is mapped.
Image source.

Basically, you’re asking a computer to have each potential decision check in with all the other possibilities, to weigh the outcome, and to learn from their own experience and sensory information. That all translates to more calculations being run at one time. 

Recapping the Key Differences

That was a lot. Here’s a summary: 

  1. Functionality: GPUs were developed for graphics rendering, while TPUs and NPUs are purpose-built for AI/ML workloads. 
  2. Parallelism: GPUs are made for parallel processing, ideal for training complex neural networks. TPUs take this specialization further, focusing on tensor operations to achieve higher speeds and energy efficiencies. 
  3. Customization: TPUs and NPUs are more specialized and customized for AI tasks, while GPUs offer a more general-purpose approach suitable for various compute workloads.
  4. Use Cases: GPUs are commonly used in data centers and workstations for AI research and training. TPUs are extensively utilized in Google’s cloud infrastructure, and NPUs are prevalent in AI-enabled devices like smartphones and Internet of Things (IoT) gadgets.
  5. Availability: GPUs are widely available from various manufacturers and accessible to researchers, developers, and hobbyists. TPUs are exclusive to Google Cloud services, and NPUs are integrated into specific devices.

Do the Differences Matter?

The definitions of the different processors start to sound pretty similar after a while. A multicore processor combines multiple ALUs under a central control unit. A GPU combines more ALUs under a specialized processor. A TPU combines multiple compute nodes under a DPU, which is analogous to a CPU. 

At the end of the day, there’s some nuance about the different design choices between processors, but their impact is truly seen at scale versus at the consumer level. Specialized processors can handle larger datasets more efficiently, which translates to faster processing using less electrical power (though our net power usage may go up as we use AI tools more). 

It’s also important to note that these are new and changing terms in a new and changing landscape. Google’s TPU was announced in 2015, just eight years ago. I can’t count the amount of conversations I’ve had that end in a hyperbolic impression of what AI is going to do for/to the world, and that’s largely because people think that there’s no limit to what it is. 

But, the innovations that make AI possible were created by real people. (Though, maybe AIs will start coding themselves, who knows.) And, chips that power AI are real things—a piece of silicon that comes from the ground and is processed in a lab. Wrapping our heads around what those physical realities are, what challenges we had to overcome, and how they were solved, can help us understand how we can use these tools more effectively—and do more cool stuff in the future.

Bonus Content: A Bit of a History of the Hardware

Which brings me to our history lesson. In order to more deeply understand our topic today, you have to know a little bit about how computers are physically built. The most fundamental language of computers is binary code, represented as a series of 0s and 1s. Those values correspond to whether a circuit is closed or open, respectively. When a circuit is closed, you cannot push power through it. When it’s open, you can. Transistors regulate current flow, generate electrical signals, and act as a switch or gate. You can connect lots of transistors with circuitry to create an integrated circuit chip.   

The combination of open and closed patterns of transistors can be read by your computer. As you add more transistors, you’re able to express more and more numbers in binary code. You can see how this influences the basic foundations of computing in how we measure bits and bytes. Eight transistors store one byte of data: two possibilities for each of the eight transistors, and then every possible combination of those possibilities (2^8) = 256 possible combinations of open/closed gates (bits), so 8 bits = one byte, which can represent any number between 0 and 255.

Diagram of how transistors combine to create logic.
Transistors combining to create logic. You need a bunch of these to run a program. Image source.

Improvements in reducing transistor size and increasing transistor density on a single chip has led to improvements in capacity, speed, and power consumption, largely due to our ability to purify semiconductor materials, leverage more sophisticated tools like chemical etching, and improve clean room technology. That all started with the integrated circuit chip. 

Integrated circuit chips were invented around 1958, fueled by the discoveries of a few different people who solved different challenges nearly simultaneously. Jack Kilby of Texas Instruments created a hybrid integrated circuit measuring about 7/16” by 1/16” (11.1 mm by 1.6 mm). Robert Noyce (eventual co-founder of Intel) went on to create the first monolithic integrated circuit chip (so, all circuits held on the same chip) and it was around the same size. Here’s a blown-up version of it, held by Noyce:

Image of Robert Noyce.
Image source.

Note those first chips only held about 60 transistors. Current chips can have billions of transistors etched onto the same microchip, and are even smaller. Here’s an example of what a integrated circuit looks like when it’s exposed:

A microchip when it's exposed.
Image source.

And, for reference, that’s about this big:

Size comparison of a chip.
Image source.

And, that, folks, is one of the reasons you can now have a whole computer in your pocket in the guise of a smartphone. As you can imagine, something the size of a modern laptop or rack-mounted server can combine more of these elements more effectively. Hence, the rise of AI.

One More Acronym: What are FGPAs?

So far, I’ve described fixed, physical points on a chip, but chip performance is also affected by software. Software represents the logic and instructions for how all these things work together. So, when you create a chip, you have two options: you either know what software you’re going to run and create a customized chip that supports that, or you get a chip that acts like a blank slate and can be reprogrammed based on what you need. 

The first method is called application-specific integrated circuits (ASIC). However, just like any proprietary build in manufacturing, you need to build them at scale for them to be profitable, and they’re slower to produce. Both CPUs and GPUs typically run on hard-etched chips like this. 

Reprogrammable chips are known as field-programmable gate arrays (FPGA). They’re flexible and come with a variety of standard interfaces for developers. That means they’re incredibly valuable for AI applications, and particularly deep learning algorithms—as things rapidly advance, FPGAs can be continuously reprogrammed with multiple functions on the same chip, which lets developers test, iterate, and deliver them to market quickly. This flexibility is most notable in that you can also reprogram things like the input/output (IO) interface, so you can reduce latency and overcome bottlenecks. For that reason, folks will often compare the efficacy of the whole class of ASIC-based processors (CPUs, GPUs, NPUs, TPUs) to FPGAs, which, of course, has also led to hybrid solutions. 

Summing It All Up: Chip Technology is Rad

Improvements in materials science and microchip construction laid the foundation for providing the processing capacity required by AI, and big players in the industry (Nvidia, Intel, Google, Microsoft, etc.) have leveraged those chips to create specialized processors. 

Simultaneously, software has allowed many processing cores to be networked in order to control and distribute processing loads for increased speeds. All that has led us to the rise in specialized chips that enable the massive demands of AI. 

Hopefully you have a better understanding of the different chipsets out there, how they work, and the difference between them. Still have questions? Let us know in the comments.

The post AI 101: GPU vs. TPU vs. NPU appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Secure Your SaaS Tools: Back Up Microsoft 365 to the Cloud

Post Syndicated from Kari Rivas original https://www.backblaze.com/blog/secure-your-saas-tools-back-up-microsoft-365-to-the-cloud/

A decorative image showing a computer backing up programs to a cloud with a Microsoft logo on one side, and on the other side, data to a cloud with the Backblaze logo.

Have you ever had that nagging feeling that you are forgetting something important? It’s like when you were back in school and sat down to take a test, only to realize you studied the wrong material. Worrying about your business data can feel like that. Are you fully protected? Are you doing all you can to ensure your data is backed up, safe, and easily restorable?

If you aren’t backing up your Microsoft 365 data, you could be leaving yourself unprepared and exposed. It’s a common misconception that data stored in software as a service (SaaS) products like Microsoft 365 is already backed up because it’s in a cloud application. But, anyone who’s tried to restore an entire company’s Microsoft 365 instance can tell you that’s not the case. 

In this post, you’ll get a better understanding of how your Microsoft 365 data is stored and how to back it up so you can reliably and quickly restore it should you ever need to. 

What Is Microsoft 365?

More than one million companies worldwide use Microsoft 365 (formerly Office 365). Microsoft 365 is a cloud-based productivity platform that includes a suite of popular applications like Outlook, Teams, Word, Excel, PowerPoint, Access, OneDrive, Publisher, SharePoint, and others.

Chances are that if you’re using Microsoft 365, you use it daily for all your business operations and rely heavily on the information stored within the cloud. But have you ever checked out the backup policies in Microsoft 365? 

If you are not backing up your Microsoft 365 data, you have a gap in your backup strategy which may put your business at risk. If you suffer a malware or ransomware attack, natural disaster, or even accidental deletion by an employee, you could lose that data. In addition, it may cost you a lot of time and money trying to restore from Microsoft after a data emergency.

Why You Need to Back Up M365

You might assume that, because it’s in the cloud, your SaaS data is backed up automatically for you. In reality, SaaS companies and products like Microsoft 365 operate on a shared responsibility model, meaning they back up the data and infrastructure to maintain uptime, not to help you in the event you need to restore. Practically speaking, that means that they may not back up your data as often as you would like or archive it for as long as you need. Microsoft does not concern itself with fully protecting your files. Most importantly, they may not offer a timely recovery option if you lose the data, which is critical to getting your business back online in the event of an outage. 

The bottom line is that Microsoft’s top priority is to keep its own services running. They replicate data and have redundancy safeguards in place to ensure you can access your data through the platform reliably, but they do not assume responsibility for their users’ data. 

All this to say, you are ultimately responsible for backing up your data and files in Microsoft 365.

M365 Native Backup Tools

But wait—what about Microsoft 365’s native backup tools? If you are relying on native backup support for your crucial business data, let’s talk about why that may not be the best way to make sure your data is protected.

Retention Period and Storage Costs

First, there are default settings within Microsoft 365 that dictate how long items are retained in the Recycle Bin and Deleted Items folders. You can tweak those settings for a longer retention period, but there is also a storage limit, so you might run out of space quickly. To keep your data longer, you must upgrade your license type and purchase additional storage, which could quickly become costly. Additionally, if an employee accidentally or purposefully deletes items from the trash bin, the item may be gone forever.

Replication Is Not a Backup

Microsoft replicates data as part of its responsibility, but this doesn’t help you meet the requirements of a solid 3-2-1 strategy, where there are three copies of your data, one of which is off-site. So Microsoft doesn’t fully protect you and doesn’t support compliance standards that call for immutability. When Microsoft replicates data, they’re only making a second copy, and that copy is designed to be in sync with your production data. This means that an item gets corrupted and then replicated, the archive version is also corrupted, and you could lose crucial data. You can’t bank on M365’s replication to protect you.

Sync Is Not a Backup

Similarly, syncing is not backup protection and could end up hurting you. Syncing is designed to have a single copy of a file always up-to-date with changes you or other users have made on different devices. For example, if you use OneDrive as your cloud backup service, the bad news is that OneDrive will sync corrupted files overwriting your healthy ones. Essentially, if a file is deleted or infected, it will be infected or deleted on all synchronized devices. In contrast, a true backup allows you to restore from a specific point in time and provides access to previous versions of data, which can be useful in case of a ransomware attack or deletion.

Back Up Frequency and Control

Lastly, one of the biggest drawbacks of relying on Microsoft’s built-in backup tools is that you lack the ability to dial in your backup system the way you may want or need. There are several rules to follow in order to be able to recover or restore files in Microsoft 365. For instance, it’s strongly recommended that you save your documents in the cloud, both for syncing purposes and to enable things like Version History. But, if you delete an online-only file, it doesn’t go to your Recycle Bin, which means there’s no way to recover it. 

And, there are limits to the maximum numbers of versions saved when using Version History, the period of time a file is recoverable for, and so on. Some of the recovery periods even change depending on file type. For example, you can’t restore email after 30 days, but if you have an enterprise-level account, other file types are stored in your Recycle Bin or trash for up to 93 days.   

Backups may not be created as often as you like, and the recovery process isn’t quick or easy. For example, Microsoft backs up your data every 12 hours and retains it for 14 days. If you need to restore files, you must contact Microsoft Support, and they will perform a “full restore,” overwriting everything, not just the specific information you need. The recovery process probably won’t meet your recovery time objective (RTO) requirements. 

Compliance and Cyber Insurance

Many people want more control over their backups than what Microsoft offers, especially for mission-critical business data. In addition to having clarity and control over the backup and recovery process, data storage and backups are often an essential element in supporting compliance needs, particularly if your business stores personal identifiable information (PII). Different industries and regions will have different standards that need to be enforced, so it’s always a good idea to have your legal or compliance team involved in the conversation.  

Similarly, with the increasing frequency of ransomware attacks, many businesses are adding cyber insurance. Cyber insurance provides protection for a variety of things, including legal fees, expenditure related to breaches, court-ordered judgments, and forensic post-break review expenses. As a result, they often have stipulations about how and when you’re backing up to mitigate the fallout of business downtime. 

Backing Up M365 With a Third Party Tool to the Cloud

Instead of the native Microsoft 365 backup tool, you could use one of the many popular backup applications that provide Microsoft 365 backup support. Options include:

Note that some of these applications include Microsoft 365 protection with their standard license, but it’s an optional add-on module with others. Be sure to check licensing and pricing before choosing an option.  

One thing to keep in mind with these tools: if you store on-premises, the backup data they generate can be vulnerable to local disasters like fire or earthquakes and to cyberattacks. For example, if you keep backups on network attached storage (NAS) that doesn’t tier to the cloud, then your data would not be fully protected  

Backing your data up to the cloud puts a copy off-site and geographically distant from your production data, so it’s better protected from things like natural disasters. When you’re choosing a cloud storage provider, make sure you check out where they store their data—if their data center is just down the road, then you’ll want to pick a different region. 

Backblaze B2 + Microsoft 365

Backblaze B2 Cloud Storage is reliable, affordable, and secure backup cloud storage, and it integrates seamlessly with the third party applications listed above for backing up Microsoft 365. Some of the benefits of using Backblaze B2 include:

Check out our Help Center for Quick-Start Guides from partners like Veeam and MSP360.

Start backing up your Microsoft 365 data to Backblaze B2 today.

Protect Your M365 Data for Peace of Mind

Whether you are a business professional or an IT director, your goal is to protect your company data. Backing up your Microsoft 365 data to the cloud satisfies your RTO goals and better protects you against various threats. 

Relying on Microsoft 365 native tools is inefficient and slow, which means you could blow your RTO targets. Backing up to the cloud allows you to meet retention requirements, ensuring that you retain the data you need for as long as required without destroying your operational budget.

Your business-critical data is too important to trust to a native backup tool that doesn’t meet your needs. In the event of a catastrophic situation, you need complete control and quick access to all your files from a specific point in time. Backing your Microsoft 365 data up to the cloud gives you more control, more freedom, and better protection. 

The post Secure Your SaaS Tools: Back Up Microsoft 365 to the Cloud appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

What’s the Diff: VMs vs. Containers

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/vm-vs-containers/

A decorative images comparing VMs and containers.
This post was originally published in 2018 and updated in 2021. We’re sharing an update to this post to provide the latest information on VMs and containers.

Both virtual machines (VMs) and containers help you optimize computer hardware and software resources via virtualization. 

Containers have been around for a while, but their broad adoption over the past few years has fundamentally changed IT practices. On the other hand, VMs have enjoyed enduring popularity, maintaining their presence across data centers of various scales.

As you think about how to run services and build applications in the cloud, these virtualization techniques can help you do so faster and more efficiently.  Today, we’re digging into how they work, how they compare to each other, and how to use them to drive your organization’s digital transformation.

First, the Basics: Some Definitions

What Is Virtualization?

Virtualization is the process of creating a virtual version or representation of computing resources like servers, storage devices, operating systems (OS), or networks that are abstracted from the physical computing hardware. This abstraction enables greater flexibility, scalability, and agility in managing and deploying computing resources. You can create multiple virtual computers from the hardware and software components of a single machine. You can think of it as essentially a computer-generated computer.

What Is a Hypervisor?

The software that enables the creation and management of virtual computing environments is called a hypervisor. It’s a lightweight software or firmware layer that sits between the physical hardware and the virtualized environments and allows multiple operating systems to run concurrently on a single physical machine. The hypervisor abstracts and partitions the underlying hardware resources, such as central processing units (CPUs), memory, storage, and networking, and allocates them to the virtual environments.  You can think of the hypervisor as the middleman that pulls resources from the raw materials of your infrastructure and directs them to the various computing instances.

There are two types of hypervisors: 

  1. Type 1, bare-metal hypervisors, run directly on the hardware. 
  2. Type 2 hypervisors operate within a host operating system. 

Hypervisors are fundamental to virtualization technology, enabling efficient utilization and management of computing resources.

VMs and Containers

What Are VMs?

The computer-generated computers that virtualization makes possible are known as virtual machines (VMs)—separate virtual computers running on one set of hardware or a pool of hardware. Each virtual machine acts as an isolated and self-contained environment, complete with its own virtual hardware components, including CPU, memory, storage, and network interfaces. The hypervisor allocates and manages resources, ensuring each VM has its fair share and preventing interference between them.

Each VM requires its own OS. Thus each VM can host a different OS, enabling diverse software environments and applications to exist without conflict on the same machine. VMs provide a level of isolation, ensuring that failures or issues within one VM do not impact others on the same hardware. They also enable efficient testing and development environments, as developers can create VM snapshots to capture specific system states for experimentation or rollbacks. VMs also offer the ability to easily migrate or clone instances, making it convenient to scale resources or create backups.

Since the advent of affordable virtualization technology and cloud computing services, IT departments large and small have embraced VMs as a way to lower costs and increase efficiencies.

A how virtual diagram of virtual machines interact with and are stored on a server.

VMs, however, can take up a lot of system resources. Each VM runs not just a full copy of an OS, but a virtual copy of all the hardware that the operating system needs to run. It’s why VMs are sometimes associated with the term “monolithic”—they’re single, all-in-one units commonly used to run applications built as single, large files. (The nickname, “monolithic,” will make a bit more sense after you learn more about containers below.) This quickly adds up to a lot of RAM and CPU cycles. They’re still economical compared to running separate actual computers, but for some use cases, particularly applications, it can be overkill, which led to the development of containers.

Benefits of VMs

  • All OS resources available to apps.
  • Well-established functionality.
  • Robust management tools.
  • Well-known security tools and controls.
  • The ability to run different OS on one physical machine.
  • Cost savings compared to running separate, physical machines.

Popular VM Providers

What Are Containers?

With containers, instead of virtualizing an entire computer like a VM, just the OS is virtualized.

Containers sit on top of a physical server and its host OS—typically Linux or Windows. Each container shares the host OS kernel and, usually, the binaries and libraries, too, resulting in more efficient resource utilization. (See below for definitions if you’re not familiar with these terms.) Shared components are read-only.

Why are they more efficient? Sharing OS resources, such as libraries, significantly reduces the need to reproduce the operating system code—a server can run multiple workloads with a single operating system installation. That makes containers lightweight and portable—they are only megabytes in size and take just seconds to start. What this means in practice is you can put two to three times as many applications on a single server with containers than you can with a VM. Compared to containers, VMs take minutes to run and are an order of magnitude larger than an equivalent container, measured in gigabytes versus megabytes.

Container technology has existed for a long time, but the launch of Docker in 2013 made containers essentially industry standard for application and software development. Technologies like Docker or Kubernetes to create isolated environments for applications. And containers solve the problem of environment inconsistency—the old “works on my machine” problem often encountered in software development and deployment.

Developers generally write code locally, say on their laptop, then deploy that code on a server. Any differences between those environments—software versions, permissions, database access, etc.—leads to bugs. With containers, developers can create a portable, packaged unit that contains all of the dependencies needed for that unit to run in any environment whether it’s local, development, testing, or production. This portability is one of containers’ key advantages.

Containers also offer scalability, as multiple instances of a containerized application can be deployed and managed in parallel, allowing for efficient resource allocation and responsiveness to changing demand.

Microservices architectures for application development evolved out of this container boom. With containers, applications could be broken down into their smallest component parts or “services” that serve a single purpose, and those services could be developed and deployed independently of each other instead of in one monolithic unit. 

For example, let’s say you have an app that allows customers to buy anything in the world. You might have a search bar, a shopping cart, a buy button, etc. Each of those “services” can exist in their own container, so that if, say, the search bar fails due to high load, it doesn’t bring the whole thing down. And that’s how you get your Prime Day deals today.

A diagram for how containers interact with and are stored on a server.

More Definitions: Binaries, Libraries, and Kernels

Binaries: In general, binaries are non-text files made up of ones and zeros that tell a processor how to execute a program.

Libraries: Libraries are sets of prewritten code that a program can use to do either common or specialized things. They allow developers to avoid rewriting the same code over and over.

Kernels: Kernels are the ringleaders of the OS. They’re the core programming at the center that controls all other parts of the operating system.

Container Tools

Linux Containers (LXC): Commonly known as LXC, these are the original Linux container technology. LXC is a Linux operating system-level virtualization method for running multiple isolated Linux systems on a single host.

Docker: Originally conceived as an initiative to develop LXC containers for individual applications, Docker revolutionized the container landscape by introducing significant enhancements to improve their portability and versatility. Gradually evolving into an independent container runtime environment, Docker emerged as a prominent Linux utility, enabling the seamless creation, transportation, and execution of containers with remarkable efficiency.

Kubernetes: Kubernetes, though not a container software in its essence, serves as a vital container orchestrator. In the realm of cloud-native architecture and microservices, where applications deploy numerous containers ranging from hundreds to thousands or even billions, Kubernetes plays a crucial role in automating the comprehensive management of these containers. While Kubernetes relies on complementary tools like Docker to function seamlessly, it’s such a big name in the container space it wouldn’t be a container post without mentioning it.

Benefits of Containers

  • Reduced IT management resources.
  • Faster spin ups.
  • Smaller size means one physical machine can host many containers.
  • Reduced and simplified security updates.
  • Less code to transfer, migrate, and upload workloads.

What’s the Diff: VMs vs. Containers

The virtual machine versus container debate gets at the heart of the debate between traditional IT architecture and contemporary DevOps practices.

VMs have been, and continue to be, tremendously popular and useful, but sadly for them, they now carry the term “monolithic” with them wherever they go like a 25-ton Stonehenge around the neck. Containers, meanwhile, pushed the old gods aside, bedecked in the glittering mantle of “microservices.” Cute.

To offer another quirky tech metaphor, VMs are to containers what glamping is to ultralight backpacking. Both equip you with everything you need to survive in the wilds of virtualization. Both are portable, but containers will get you farther, faster, if that’s your goal. And while VMs bring everything and the kitchen sink, containers leave the toothbrush at home to cut weight. To make a more direct comparison, we’ve consolidated the differences into a handy table:

VMs Containers
Heavyweight. Lightweight.
Limited performance. Native performance.
Each VM runs in its own OS. All containers share the host OS.
Hardware-level virtualization. OS virtualization.
Startup time in minutes. Startup time in milliseconds.
Allocates required memory. Requires less memory space.
Fully isolated and hence more secure. Process-level isolation, possibly less secure.

Uses for VMs vs. Uses for Containers

Both containers and VMs have benefits and drawbacks, and the ultimate decision will depend on your specific needs.

When it comes to selecting the appropriate technology for your workloads, virtual machines (VMs) excel in situations where applications demand complete access to the operating system’s resources and functionality. When you need to run multiple applications on servers, or have a wide variety of operating systems to manage, VMs are your best choice. If you have an existing monolithic application that you don’t plan to or need to refactor into microservices, VMs will continue to serve your use case well.

Containers are a better choice when your biggest priority is maximizing the number of applications or services running on a minimal number of servers and when you need maximum portability. If you are developing a new app and you want to use a microservices architecture for scalability and portability, containers are the way to go. Containers shine when it comes to cloud-native application development based on a microservices architecture.

You can also run containers on a virtual machine, making the question less of an either/or and more of an exercise in understanding which technology makes the most sense for your workloads.

In a nutshell:

  • VMs help companies make the most of their infrastructure resources by expanding the number of machines you can squeeze out of a finite amount of hardware and software.
  • Containers help companies make the most of the development resources by enabling microservices and DevOps practices.

Are You Using VMs, Containers, or Both?

If you are using VMs or containers, we’d love to hear from you about what you’re using and how you’re using them. Drop a note in the comments.

The post What’s the Diff: VMs vs. Containers appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

How to Use Cloud Replication to Automate Environments

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/how-to-use-cloud-replication-to-automate-environments/

A decorative image showing a workflow from a computer, to a checklist, to a server stack.

A little over a year ago, we announced general availability of Backblaze Cloud Replication, the ability to automatically copy data across buckets, accounts, or regions. There are several ways to use this service, but today we’re focusing on how to use Cloud Replication to replicate data between environments like testing, staging, and production when developing applications. 

First we’ll talk about why you might want to replicate environments and how to go about it. Then, we’ll get into the details: there are some nuances that might not be obvious when you set out to use Cloud Replication in this way, and we’ll talk about those so that you can replicate successfully.

Other Ways to Use Cloud Replication

In addition to replicating between environments, there are two main reasons you might want to use Cloud Replication:

  • Data Redundancy: Replicating data for security, compliance, and continuity purposes.
  • Data Proximity: Bringing data closer to distant teams or customers for faster access.

Maintaining a redundant copy of your data sounds, well, redundant, but it is the most common use case for cloud replication. It supports disaster recovery as part of a broad cyber resilience framework, reduces the risk of downtime, and helps you comply with regulations.

The second reason (replicating data to bring it geographically closer to end users) has the goal of improving performance and user experience. We looked at this use case in detail in the webinar Low Latency Multi-Region Content Delivery with Fastly and Backblaze.

Four Levels of Testing: Unit, Integration, System, and Acceptance

An image of the character, "The Most Interesting Man in the World", with the title "I don't always test my code, but when I do, I do it in production."
Friendly reminder to both drink and code responsibly (and probably not at the same time).

The Most Interesting Man in the World may test his code in production, but most of us prefer to lead a somewhat less “interesting” life. If you work in software development, you are likely well aware of the various types of testing, but it’s useful to review them to see how different tests might interact with data in cloud object storage.

Let’s consider a photo storage service that stores images in a Backblaze B2 Bucket. There are several real-world Backblaze customers that do exactly this, including Can Stock Photo and CloudSpot, but we’ll just imagine some of the features that any photo storage service might provide that its developers would need to write tests for.

Unit Tests

Unit tests test the smallest components of a system. For example, our photo storage service will contain code to manipulate images in a B2 Bucket, so its developers will write unit tests to verify that each low-level operation completes successfully. A test for thumbnail creation, for example, might do the following:

  1. Directly upload a test image to the bucket.
  2. Run the “‘Create Thumbnail” function against the test image.
  3. Verify that the resulting thumbnail image has indeed been created in the expected location in the bucket with the expected dimensions.
  4. Delete both the test and thumbnail images.

A large application might have hundreds, or even thousands, of unit tests, and it’s not unusual for development teams to set up automation to run the entire test suite against every change to the system to help guard against bugs being introduced during the development process.

Typically, unit tests require a blank slate to work against, with test code creating and deleting files as illustrated above. In this scenario, the test automation might create a bucket, run the test suite, then delete the bucket, ensuring a consistent environment for each test run.

Integration Tests

Integration tests bring together multiple components to test that they interact correctly. In our photo storage example, an integration test might combine image upload, thumbnail creation, and artificial intelligence (AI) object detection—all of the functions executed when a user adds an image to the photo storage service. In this case, the test code would do the following:

  1. Run the Add Image” procedure against a test image of a specific subject, such as a cat.
  2. Verify that the test and thumbnail images are present in the expected location in the bucket, the thumbnail image has the expected dimensions, and an entry has been created in the image index with the “cat” tag.
  3. Delete the test and thumbnail images, and remove the image’s entry from the index.

Again, integration tests operate against an empty bucket, since they test particular groups of functions in isolation, and require a consistent, known environment.

System Tests

The next level of testing, system testing, verifies that the system as a whole operates as expected. System testing can be performed manually by a QA engineer following a test script, but is more likely to be automated, with test software taking the place of the user. For example, the Selenium suite of open source test tools can simulate a user interacting with a web browser.   A system test for our photo storage service might operate as follows:

  1. Open the photo storage service web page.
  2. Click the upload button.
  3. In the resulting file selection dialog, provide a name for the image, navigate to the location of the test image, select it, and click the submit button.
  4. Wait as the image is uploaded and processed.
  5. When the page is updated, verify that it shows that the image was uploaded with the provided name.
  6. Click the image to go to its details.
  7. Verify that the image metadata is as expected. For example, the file size and object tag match the test image and its subject.

When we test the system at this level, we usually want to verify that it operates correctly against real-world data, rather than a synthetic test environment. Although we can generate “dummy data” to simulate the scale of a real-world system, real-world data is where we find the wrinkles and edge cases that tend to result in unexpected system behavior. For example, a German-speaking user might name an image “Schloss Schönburg.” Does the system behave correctly with non-ASCII characters such as ö in image names? Would the developers think to add such names to their dummy data?

A picture of Schönburg Castle in the Rhine Valley at sunset.
Non-ASCII characters: our excuse to give you your daily dose of seratonin. Source.

Acceptance Tests

The final testing level, acceptance testing, again involves the system as a whole. But, where system testing verifies that the software produces correct results without crashing, acceptance testing focuses on whether the software works for the user. Beta testing, where end-users attempt to work with the system, is a form of acceptance testing. Here, real-world data is essential to verify that the system is ready for release.

How Does Cloud Replication Fit Into Testing Environments?

Of course, we can’t just use the actual production environment for system and acceptance testing, since there may be bugs that destroy data. This is where Cloud Replication comes in: we can create a replica of the production environment, complete with its quirks and edge cases, against which we can run tests with no risk of destroying real production data. The term staging environment is often used in connection with acceptance testing, with test(ing) environments used with unit, integration, and system testing.

Caution: Be Aware of PII!

Before we move on to look at how you can put replication into practice, it’s worth mentioning that it’s essential to determine whether you should be replicating the data at all, and what safeguards you should place on replicated data—and to do that, you’ll need to consider whether or not it is or contains personally identifiable information (PII).

The National Institute of Science and Technology (NIST) document SP 800-122 provides guidelines for identifying and protecting PII. In our example photo storage site, if the images include photographs of people that may be used to identify them, then that data may be considered PII.

In most cases, you can still replicate the data to a test or staging environment as necessary for business purposes, but you must protect it at the same level that it is protected in the production environment. Keep in mind that there are different requirements for data protection in different industries and different countries or regions, so make sure to check in with your legal or compliance team to ensure everything is up to standard.

In some circumstances, it may be preferable to use dummy data, rather than replicating real-world data. For example, if the photo storage site was used to store classified images related to national security, we would likely assemble a dummy set of images rather than replicating production data.

How Does Backblaze Cloud Replication Work?

To replicate data in Backblaze B2, you must create a replication rule via either the web console or the B2 Native API. The replication rule specifies the source and destination buckets for replication and, optionally, advanced replication configuration. The source and destination buckets can be located in the same account, different accounts in the same region, or even different accounts in different regions; replication works just the same in all cases. While standard Backblaze B2 Cloud Storage rates apply to replicated data storage, note that Backblaze does not charge service or egress fees for replication, even between regions.

It’s easier to create replication rules in the web console, but the API allows access to two advanced features not currently accessible from the web console: 

  1. Setting a prefix to constrain the set of files to be replicated. 
  2. Excluding existing files from the replication rule. 

Don’t worry: this blog post provides a detailed explanation of how to create replication rules via both methods.

Once you’ve created the replication rule, files will begin to replicate at midnight UTC, and it can take several hours for the initial replication if you have a large quantity of data. Files uploaded after the initial replication rule is active are automatically replicated within a few seconds, depending on file size. You can check whether a given file has been replicated either in the web console or via the b2-get-file-info API call. Here’s an example using curl at the command line:

 % curl -s -H "Authorization: ${authorizationToken}" \
    -d "{\"fileId\":  \"${fileId}\"}" \
    "${apiUrl}/b2api/v2/b2_get_file_info" | jq .
{
  "accountId": "15f935cf4dcb",
  "action": "upload",
  "bucketId": "11d5cf096385dc5f841d0c1b",
  ...
  "replicationStatus": "pending",
  ...
}

In the example response, replicationStatus returns the response pending; once the file has been replicated, it will change to completed.

Here’s a short Python script that uses the B2 Python SDK to retrieve replication status for all files in a bucket, printing the names of any files with pending status:

import argparse
import os

from dotenv import load_dotenv

from b2sdk.v2 import B2Api, InMemoryAccountInfo
from b2sdk.replication.types import ReplicationStatus

# Load credentials from .env file into environment
load_dotenv()

# Read bucket name from the command line
parser = argparse.ArgumentParser(description='Show files with "pending" replication status')
parser.add_argument('bucket', type=str, help='a bucket name')
args = parser.parse_args()

# Create B2 API client and authenticate with key and ID from environment
b2_api = B2Api(InMemoryAccountInfo())
b2_api.authorize_account("production", os.environ["B2_APPLICATION_KEY_ID"], os.environ["B2_APPLICATION_KEY"])

# Get the bucket object
bucket = b2_api.get_bucket_by_name(args.bucket)

# List all files in the bucket, printing names of files that are pending replication
for file_version, folder_name in bucket.ls(recursive=True):
    if file_version.replication_status == ReplicationStatus.PENDING:
        print(file_version.file_name)

Note: Backblaze B2’s S3-compatible API (just like Amazon S3 itself) does not include replication status when listing bucket contents—so for this purpose, it’s much more efficient to use the B2 Native API, as used by the B2 Python SDK.

You can pause and resume replication rules, again via the web console or the API. No files are replicated while a rule is paused. After you resume replication, newly uploaded files are replicated as before. Assuming that the replication rule does not exclude existing files, any files that were uploaded while the rule was paused will be replicated in the next midnight-UTC replication job.

How to Replicate Production Data for Testing

The first question is: does your system and acceptance testing strategy require read-write access to the replicated data, or is read-only access sufficient?

Read-Only Access Testing

If read-only access suffices, it might be tempting to create a read-only application key to test against the production environment, but be aware that testing and production make different demands on data. When we run a set of tests against a dataset, we usually don’t want the data to change during the test. That is: the production environment is a moving target, and we don’t want the changes that are normal in production to interfere with our tests. Creating a replica gives you a snapshot of real-world data against which you can run a series of tests and get consistent results.

It’s straightforward to create a read-only replica of a bucket: you just create a replication rule to replicate the data to a destination bucket, allow replication to complete, then pause replication. Now you can run system or acceptance tests against a static replica of your production data.

To later bring the replica up to date, simply resume replication and wait for the nightly replication job to complete. You can run the script shown in the previous section to verify that all files in the source bucket have been replicated.

Read-Write Access Testing

Alternatively, if, as is usually the case, your tests will create, update, and/or delete files in the replica bucket, there is a bit more work to do. Since testing intends to change the dataset you’ve replicated, there is no easy way to bring the source and destination buckets back into sync—changes may have happened in both buckets while your replication rule was paused. 

In this case, you must delete the replication rule, replicated files, and the replica bucket, then create a new destination bucket and rule. You can reuse the destination bucket name if you wish since, internally, replication status is tracked via the bucket ID.

Always Test Your Code in an Environment Other Than Production

In short, we all want to lead interesting lives—but let’s introduce risk in a controlled way, by testing code in the proper environments. Cloud Replication lets you achieve that end while remaining nimble, which means you get to spend more time creating interesting tests to improve your product and less time trying to figure out why your data transformed in unexpected ways.  

Now you have everything you need to create test and staging environments for applications that use Backblaze B2 Cloud Object Storage. If you don’t already have a Backblaze B2 account, sign up here to receive 10GB of storage, free, to try it out.

The post How to Use Cloud Replication to Automate Environments appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Free Your Premiere Pro Workflows With Backblaze Cloud Storage

Post Syndicated from James Flores original https://www.backblaze.com/blog/free-your-premiere-pro-workflows-with-backblaze-cloud-storage/

A decorative image showing a mockup of Premiere Pro's user interface and the Backblaze storage cloud.

Projects and technologies come and go, and with each new tool comes new workflow changes. But changing the way you move media around can be tough. Maybe you’ve always done things a certain way, and using a new tool feels like too much of a learning curve especially when you’re pressed for time. But the way you’ve always done things isn’t always the best, easiest, or fastest way. Sometimes you need to change the status quo to level up your media operations. 

As a freelance editor, I worked on a recent project that presented some challenges that demanded new approaches to media storage challenges you might also be facing. I solved them with the cloud—but not an all-in-one cloud. My solution was a mix of cloud tools, including Adobe Premiere Pro, which gives me customization and flexibility—the best of all worlds in media workflows

Right Opportunity at the Right Time

Last year I had the opportunity to serve as a digital imaging technician (DIT) on the set of an indie film titled “Vengeance” produced by Falcon Pictures. The role of a DIT can vary. In many instances you’re simply a data wrangler making backups of the data being shot. In others, you work in the color space of the project creating color corrected dailies on set. For “Vengeance”, I was mostly data wrangling. 

“Vengeance” was an 11-day shoot in the mountains of Northern California near Bass Lake. While the rest of the crew spent their days hiking around with equipment, I was stationed back at home base with my DIT cart. With a lot of free time, I found myself logging data as it came in. Logging clip names soon turned into organizing bins and prepping the project for editing. And, while I was not the editor on the project, I was happy to help edit while I was on set. 

The Challenge

A few months after my work as DIT ended, it became clear that “Vengeance” needed a boost in post-production. The editing was a bit stuck—they had no assistant editor to complete logging and to sound sync all the footage. So, I was asked to help out. The only problem: I needed to be able to share my work with another editor who lived 45 miles away.

A screenshot of an indie film, Vengeance, being edited in Adobe Premiere Pro.
Editing “Vengeance” in Adobe Premiere Pro.

Evaluating the World of Workflows and Cloud Tools

So we began to evaluate a few different solutions. It was clear that Adobe Premiere Pro would be used, but data storage was still a big question. We debated a few methods for sharing media:

  1. The traditional route: Sharing a studio. With the other editor 45 miles away, commuting and scheduling time with each other was going to be cumbersome. 
  2. Email: We could email project files back and forth as we worked, but how would we keep track of versioning? Project bloat was a big concern. 
  3. Sharing a shuttle drive. Or what I’m calling “Sneakernet 2.0.” This is a popular method, but far from efficient. 
  4. Google Drive or Dropbox: Another popular option, but also one that comes with costs and service limitations like rate limiting. 

None of these options were great, so we went back to the drawing board. 

The Solution: A Hybrid Workflow Designed for Our Needs

To come to a final decision for this workflow, we made a list of our needs: 

  • The ability to share a Premiere Pro project file for updates. 
  • The ability to share media for the project. 
  • No exchanging external hard drives. 
  • No driving (a car).  
  • Changes need to be real time.

Based on those needs, here’s where we landed.

Sharing Project Files

Adobe recently released a new update to its Team Projects features within Premiere Pro. Team Projects allows you to host a Premiere Pro project in the Adobe cloud and share it with other Adobe Creative Cloud users. This gave us the flexibility to share a single project and share updates in real time. This means no emailing of project files, versioning issues, or bloated files. That left the issues of the media. How do we share media? 

Sharing Media Files

You may think that it would be obvious to share files in the Adobe Creative Cloud where you get 100GB free. And while 100GB may be enough storage for .psd and .ai files, 100GB is nothing for video, especially when we are talking about RED (.r3d) files which start off as approximately 4GB chunks and can quickly add up to terabytes of footage. 

So we put everything in a Backblaze B2 Bucket. All the .r3d source files went directly from my Synology network attached storage (NAS) into a Backblaze B2 Bucket using the Synology Cloud Sync tool. In addition to the source files, I used Adobe Media Encoder to generate proxy files of all the .r3d files. This folder of proxy files also synced with Backblaze automatically. 

Making Changes in Real Time

What was great about this solution is that all of the uploading is done automatically via a seamless Backblaze + Synology integration, and the Premiere Pro Team Project had a slew of publish functions perfect for real-time updates. And because the project files and proxies are stored in the cloud, I could get to them from several computers. I spent time at my desktop PC logging and syncing footage, but was also able to move to my couch and do the same from my MacBook Pro. I never had to move hard drives around, copy projects files, or worry about version control.

The other editor was able to connect to my Backblaze B2 Bucket using Cyberduck, a cloud storage browser for Mac. Using Cyberduck, he was able to pull down all the proxy files I created and share any files that he created. So, we were synced for the entire duration of the project. 

Once the technology was configured, I was able to finish logging for “Vengeance”, sync all the sound, build out stringouts and assemblies, and even a rough cut of every scene for the entire movie, giving the post-production process the boost it needed.

A diagram showing how editors use Backblaze B2 Cloud Storage with Adobe Premiere Pro.

The Power of Centralized Storage for Media Workflows

Technology is constantly evolving, and, in most circumstances, technology makes how we work a lot easier. For years filmmakers have worked on projects by physically moving our source material, whether it was on film reels, tapes, or hard drives. The cloud changed all that

The key to getting “Vengeance” through post-production was our centralized approach to file management. Files existed in Backblaze already, we simply brought Premiere Pro to the data rather than moving the huge amount of files to Premiere Pro via the Creative Cloud. 

The mix of technologies lets us create a customized flow that works for us. Creative Cloud had the benefit of providing a project sharing mechanism, and Backblaze provided a method of sharing media (Synology and Cyberduck) regardless of the tooling each editor had. 

Once we hit picture lock, the centralized files will serve as a distribution point for VFX, color, and sound, making turnover a breeze. It can even be used as a distribution hub—check out how American Public Television uses Backblaze to distribute their finished assets. 

Centralizing in the cloud not only made it easy for me to work from home, it allowed us to collaborate on a project with ease eliminating the overhead of driving, shuttle drive delivery (Sneakernet 2.0), and version control. The best part? A workflow like this is affordable for any size production and can be set up in minutes. 

Have you recently moved to a cloud workflow? Let us know what you’re using and how it went in the comments. 

The post Free Your Premiere Pro Workflows With Backblaze Cloud Storage appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Things Might Look a Little Different Around Here: Technical Documentation Gets an Upgrade

Post Syndicated from Alison McClelland original https://www.backblaze.com/blog/things-might-look-a-little-different-around-here-technical-documentation-gets-an-upgrade/

A decorative image of a computer displaying the title Introducing the New Backblaze B2 Cloud Storage Documentation Portal.

When you’re working hard on an IT or development project, you need to be able to find instructions about the tools you’re using quickly. And, it helps if those instructions are easy to use, easy to understand, and easy to share. 

On the Technical Publications team, we spend a lot of time thinking about how to make our docs just that—easy. 

Today, the fruits of a lot of thinking and reorganizing and refining are paying off. The new Backblaze technical documentation portal is live.

Explore the Portal ➔ 

What’s New in the Tech Docs Portal?

The documentation portal has been completely overhauled to deliver on-demand content with a modern look and feel. Whether you’re a developer, web user, or someone who wants to understand how our products and services work, our portal is designed to be user-friendly, with a clean and intuitive interface that makes it easy to navigate and find the information you need.

Here are some highlights of what you can look forward to:

  • New and updated articles right on the landing page—so you’re always the first to know about important content changes.
  • A powerful search engine to help you find topics quickly.
  • A more logical navigation menu that organizes content into sections for easy browsing.
  • Information about all of the Backblaze B2 features and services in the About section.

You can get started using the Backblaze UI quickly to create application keys, create buckets, manage your files, and more. If you’re programmatically managing your data, we’ve included resources such as SDKs, developer quick-start guides, and step-by-step integration guides. 

Perhaps the most exciting enhancement is our API documentation. This resource provides endpoints, parameters, and responses for all three of our APIs: S3-Compatible, B2 Native, and Partner API.   

For Fun: A Brief History of Technical Documentation

As our team put our heads together to think about how to announce the new portal, we went down some internet rabbit holes on the history of technical documentation. Technical documentation was recognized as a profession around the start of World War II when technical documents became a necessity for military purposes. (Note: This was also the same era that a “computer” referred to a job for a person, meaning “one who computes”.) But the first technical content in the Western world can be traced back to 1650 B.C—the Rhind Papyrus describes some of the mathematical knowledge and methods of the Egyptians. And the title of first Technical Writer? That goes to none other than poet Geoffrey Chaucer of Canterbury Tales fame for his lesser-known work “A Treatise on the Astrolabe”—a tool that measures angles to calculate time and determine latitude.

A photograph of an astrolabe.
An astrolabe, or, as the Smithsonian calls it, “the original smartphone.” Image source.

After that history lesson, we ourselves waxed a bit poetic about the “old days” when we wrote long manuals in word processing software that were meant to be printed, compiled long indexes for user guides using desktop publishing tools, and wrote more XML code in structured authoring programs than actual content. These days we use what-you-see-is-what-you-get (WYSIWYG) editors in cloud-based content management systems which make producing content much easier and quicker—and none of us are dreaming in HTML anymore. 

<section><p>Or maybe we are.</p></section>

Overall, the history of documentation in the tech industry reflects the changing needs of users and the progression of technology. It evolved from technical manuals for experts to user-centric, accessible resources for audiences of all levels of technical proficiency.

The Future of Backblaze Technical Documentation Portal

In the coming months, you’ll see even more Backblaze B2 Cloud Storage content including many third-party integration guides. Backblaze Computer Backup documentation will also find a home here in this new portal so that you’ll have a one-stop-shop for all of your Backblaze technical and help documentation needs. 

We are committed to providing the best possible customer-focused documentation experience. Explore the portal to see how our documentation can make using Backblaze even easier!

The post Things Might Look a Little Different Around Here: Technical Documentation Gets an Upgrade appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.