Tag Archives: Featured-Cloud Storage

Do More with Backblaze B2: A Tour of the Backblaze GitHub Repositories

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/do-more-with-backblaze-b2-a-tour-of-the-backblaze-github-repositories/

A decorative image showing a computer with the GitHub logo and the Backblaze logo superimposed on files.

If you work with Backblaze B2, you’re probably already aware of resources such as the Backblaze B2 Python SDK and the Backblaze B2 Command Line Tool, but did you know that there is also a Terraform Provider for Backblaze B2, an SDK for Java, and a whole slew of open source samples showing how to integrate with Backblaze B2 from web browsers, serverless platforms, and more? Today, I’ll take you on a quick tour of our open source SDKs, tools, and sample code, pointing out some interesting sights along the way.

Why open source?

We’ve long been believers in open source code here at Backblaze, open sourcing our implementation of Reed-Solomon erasure coding back in 2015, and, even before then, sharing our Storage Pod designs and, of course, Drive Stats, the statistics and insights based on our observations of the hard drives we operate in our data centers, including the raw metrics we collect from many thousands of hard drives, every day.

While the Storage Pod designs and Drive Stats live here on the Backblaze website, we make our open source code available via two GitHub organizations:

Let’s take a closer look.

Official Backblaze SDKs and tools

You can use any of AWS’ range of SDKs, plus the AWS Command Line Interface (CLI), to access Backblaze B2 via its S3 Compatible API; just remember to configure the endpoint URL as well as the access key ID and secret access key.

Not every Backblaze B2 operation is accessible via the S3 Compatible API—for example, application key management—so we also support a range of open source SDKs for accessing Backblaze B2’s Native API from a variety of programming languages:

  • The Backblaze B2 Python SDK: This SDK provides access to the basic operations of the Native API, such as list_buckets() and download_file_by_id(), as well as a powerful Synchronizer class that implements high performance, multi-threaded file copying between Backblaze B2 and local file storage.
  • The Backblaze B2 Java SDK: Although it doesn’t include anything quite as sophisticated as the Python Synchronizer, the Java SDK does implement high-level functionality such as uploadLargeFile(), which encapsulates all of the mechanics of a multi-threaded file upload in a single method call. We also use it internally at Backblaze in our production environment. 
  • blazer, an open source Backblaze B2 SDK for Go (aka golang): We adopted blazer from its original author, Toby Burress, when he was no longer able to maintain it. We’ve made a few improvements since taking it on, and we’re looking at doing more with it.

The Backblaze GitHub organization also contains a pair of tools built on the Python SDK:

The remaining repositories contain utilities and other code that we have published over the years, including our open source Reed-Solomon erasure coding implementation and a utility we wrote to support migrating a live Cassandra cluster from one data center to another.

Backblaze sample and demo code

Our https://github.com/backblaze-b2-samples organization contains, at the time of writing, 34 repositories, demonstrating how to use Backblaze B2 in a wide variety of situations. We’ve covered a few of them in past blog posts:

As you explore the https://github.com/backblaze-b2-samples organization, you’ll also find repositories that have not yet been covered here on the Backblaze blog:

  • B2listen allows you to forward Backblaze B2 Event Notifications to a service listening on a local URL. B2listen uses Cloudflare’s free Quick Tunnels feature to proxy traffic from an internet-accessible URL to a local endpoint.
  • B2 Browser Upload shows you how to upload files directly to Backblaze B2 from JavaScript code running in the browser, with sample code for both the Backblaze B2 Native and S3-compatible APIs.
  • The Backblaze B2 Zip Files Example implements a simple Python web app, using the Flask web application framework and the flask-executor task queue, that can compress a set of files located in Backblaze B2 into an archive, also stored in Backblaze B2, without using any local storage.

We’ll write more about these, and other, as yet unreleased, open source projects, over the coming weeks and months, but, if you’d like us to prioritize any of the above three repositories, or any of our other projects, let us know in the comments!

The post Do More with Backblaze B2: A Tour of the Backblaze GitHub Repositories appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Three Surprising Factors that Affect Cloud Performance

Post Syndicated from Kari Rivas original https://www.backblaze.com/blog/three-surprising-factors-that-affect-cloud-performance/

A decorative image showing a cloud and data graphs.

When you think about cloud performance, metrics like latency and throughput are probably the first things that come to mind. We covered those metrics pretty extensively here and here. So, today, I’m walking through some factors that affect cloud performance that may not get talked about as often, including:

  • The size of your files.
  • The number of parts you upload or download.
  • Block (part) size.

These factors may not be “surprising” per se especially if you remember the pain of trying to download The Matrix over dial up. But they are all things that you should consider (and that you have more control over) when thinking about cloud performance overall. 

Let’s dig in.

1. The size of your files

This one is pretty obvious. Larger files take longer because they require more data to be transferred. If you have a 10Mbps upload connection, a 1GB file will take approximately 800 seconds (13 minutes and 20 seconds) to upload, whereas a 100MB file will take about 80 seconds (a minute and 20 seconds). Most enterprise-grade internet connections offer higher upload speeds, but 10Mbps makes the math approachable for the sake of argument.  

Small files—that is, those less than 5GB—can be uploaded in a single API call. (Note: this can vary based on cloud storage provider and configuration.) Larger files up to 10TB can be uploaded as “parts” in multiple API calls. Each part has to be a minimum of 5MB and a maximum of 5GB. 

You’ll notice that there is quite an overlap here! For uploading files between 5MB and 5GB, is it better to upload them in a single API call, or split them into parts? What is the optimum part size? For backup applications, which typically split all data into equally sized blocks, storing each block as a file, what is the optimum block size? As with many questions, the answer is: it depends.

2. The number of parts you upload or download

Each API call incurs a more-or-less fixed overhead due to latency. For a 1GB file, assuming a single thread of execution, uploading all 1GB in a single API call will be faster than 10 API calls each uploading a 100MB part, since those additional nine API calls each incur some latency overhead. So, bigger is better, right?

3. Block (part) size

Not necessarily, and that brings us to part size. Multi-threading, as mentioned above, affords us the opportunity to upload multiple parts simultaneously, which improves performance—but there are trade-offs. Typically, each part must be stored in memory as it is uploaded, so more threads means more memory consumption. If the number of threads multiplied by the part size exceeds available memory, then either the application will fail with an out of memory error, or data will be swapped to disk, reducing performance.

Downloading data offers even more flexibility, since applications can specify any portion of the file to download in each API call. Whether uploading or downloading, there is a maximum number of threads that will drive throughput to consume all of the available bandwidth. Exceeding this maximum will consume more memory, but provide no performance benefit. 

So, what to do to get the best performance possible for your use case? 

Simple: Customize your settings

Most backup and file transfer tools allow you to configure the number of threads and the amount of data to be transferred per API call, whether that’s block size or part size. If you are writing your own application, you should allow for these parameters to be configured. When it comes to deployment, some experimentation may be required to achieve maximum throughput given available memory.

The big takeaway: When it comes to cloud performance, the metrics you need to care about and the performance you actually need are highly dependent on your use case, your own infrastructure, your workload, and all the network connections between your infrastructure and the cloud provider as well. So, when you’re deciding how to store and use your data, it’s worth taking some extra time to consider the above factors for optimum performance. 

The post Three Surprising Factors that Affect Cloud Performance appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Five Tips for Creating a Predictable Cloud Storage Budget

Post Syndicated from David Johnson original https://www.backblaze.com/blog/calculate-cost-cloud-storage/

A decorative image showing buildings, data, and icons indicating cost.

Editor’s Note

This post has been updated since it was originally published.

With spending on public cloud services expected to double by 2028, many businesses are looking for ways to cut cloud costs—or at least gain predictability in their spend. Forecasting cloud storage costs should be straightforward once you know what to look for.

Here are five tips you can use when doing your due diligence on the cloud storage vendors you are considering. The goal is to create a cloud storage forecast that you can rely on each and every month.

Tip 1: Navigate tiered pricing structures carefully

Many cloud providers still use tiered pricing structures, which can be misleading if not carefully understood. For example:

AWS S3 Storage Pricing Example

For this post, we’re comparing with hypothetical data stored in AWS S3’s U.S. East Region (N. Virginia) using pricing available at the time of publishing. Note that many factors may affect your final price, including selecting a different region, choosing a different storage tier, etc.

  • First 50 TB/month = $0.023 per GB
  • Next 450 TB/month = $0.022 per GB
  • Over 500 TB/month = $0.021 per GB

In order to receive lower pricing, you have to reach a specific amount of data stored. But, the lower rate only applies to data above the threshold for that tier. In other words, you don’t get a discount on the cumulative amount—each pricing tier is reflected in the data you’re storing. 

The mistake sometimes made is estimating your entire storage cost based on the level for the total data stored. For example, if you had 600TB of storage, you could wrongly calculate as follows:

600,000GB x $0.021 = $12,600/month

When, in fact, you should do the following:

(50,000GB x $0.023) + (450,000GB x $0.022) + (100,000GB x $0.021) = $13,150/month

That was just for storage. Make sure you consider the tiered pricing tables for data retrieval, and API transactions as well.

Tip 2: Don’t choose the wrong storage class

Many cloud providers, especially hyperscalers, now offer a wider array of storage classes than ever before. The idea is that you can trade service capabilities for lower costs. If you don’t need immediate access to your files or don’t want data replication or 11 nines of durability, you can choose to downgrade your service and gain cost savings. The biggest problem with this method is that you have to know what you are going to do with your data to pick the right service—as well as correctly anticipate future business needs—because mistakes can get very expensive. For example:

  • You choose a low cost, cold storage tier that takes hours or days to restore your data. What can go wrong? You need some files back immediately (if, for example, your backups are corrupted by ransomware) and you end up paying 10-20 times the cost to expedite your restore.
  • You choose one storage class and decide you want to upload some data to a compute-based application or to another region—features not part of your current service. The good news? You can usually move the data. The bad news? Even if you’re transferring within the same cloud storage company’s infrastructure, you’re often charged a transfer fee to move the data because you didn’t choose the right storage class when you started. These fees often eradicate any “savings” you had gotten from the lower priced tier.

Basically, if your needs change as they pertain to the data you have stored, you will pay more than you expect to get your data where you need it to be.

Tip 3: Don’t pay for deleted (or modified) files

Some cloud storage companies have a minimum amount of time you are charged for storage for each file uploaded. Typically this minimum period is between 30 and 90 days. You are charged even if you delete the file before the minimum period. For example (assuming a 90 day minimum period), if you upload a file today and delete the file tomorrow, you still have to pay for storing that deleted file for the next 88 days.

This “feature” often extends to files deleted due to versioning. If you set your system to keep three versions of each file, with older versions automatically deleted, you end up paying for those deleted versions for the full minimum duration.

In a typical backup workflow, let’s say you are using a cloud storage service to store your files and your backup program is set to a 30 day retention. That means you will be perpetually paying for an additional 60 days worth of storage (for files that were pruned at 30 days). In other words, you would be paying for a 90 day retention period even though you only have 30 days worth of backups.

Tip 4: Beware of hidden minimums

As the cloud storage market has matured, pricing models have become more complicated. To create an accurate budget, it’s crucial to understand all potential cost components, including some that might not be immediately obvious. Here are two key areas to examine:

  1. Minimum monthly charges: Some providers charge a set fee regardless of how little you store. For instance, you might pay for 1TB even if you only use 100GB.
  2. Minimum file sizes: Some services round up small files to a minimum billable size, often 128KB. While this might seem insignificant, it can add up quickly if you have millions of small files.

Tip 5: Be suspicious of the fine print

Misdirection is the art of getting you to focus on one thing so you don’t focus on other things going on. Practiced by magicians and some cloud storage companies, the idea is to get you to focus on certain features and capabilities without delving below the surface into the fine print. (And, sometimes the prices this technique generates feels like someone has pulled a rabbit out of a hat—to your company’s detriment.)

Read the fine print and as you scroll through the multi-page pricing tables and linked pages of all of the rules that shape how you can use a given cloud storage service. Stop and ask, “What are they trying to hide?” If you find phrases like: “We reserve the right to limit your egress traffic,” or “New users get free usage tier for 12 months,” or “Provisioned requests should be used when you need a guarantee that your retrieval capacity will be available when you need it,” take heed. 

And, even if it seems like you can turn the tables and use things like free credits in the short term, remember that you’ll want to have a plan for your long-term infrastructure when those credits run out as well. 

How to build a predictable cloud storage budget

As organizations increasingly rely on cloud storage for everything from day-to-day operations to long-term data archiving, the ability to accurately forecast and control these costs can significantly impact overall IT budgets and business planning.

The first place to start is data storage as it’s generally the easiest for a company to calculate. For a given month, you can calculate your data volume as follows:

Data stored = current data + new data – deleted data

Take that total and multiple by the monthly storage rate and you’ll get your monthly storage costs. 

Things can get more complicated if your business regularly uploads and downloads data. The data stored at the end of the month should get you at least in the ballpark. But, creating a predictable cloud storage budget requires a holistic understanding of your data needs, usage patterns, and the pricing structures of your chosen provider. It’s not just about estimating how much data you’ll store, but also how you’ll interact with that data over time. Will you be frequently accessing and modifying files, or primarily using the storage for long-term archiving? Are there seasonal fluctuations in your data storage or retrieval patterns? These factors can all influence your overall costs, and we’ll walk through a scenario to show that next.

Let’s do the math

To illustrate how to calculate your cloud storage costs, let’s work through an example using current Backblaze B2 pricing. We’ll focus on a single month for a growing business that is backing up business data to the cloud and verifying their backups have zero errors during recovery:

  • Initial storage at the beginning of the month: 100TB
  • New data added during the month: 10TB
  • Data deleted during the month: 5TB
  • Downloads during the month (egress): 75TB

Backblaze has built a cloud storage calculator that computes costs for all of the major cloud storage providers. Using this calculator, we find that Amazon S3 would cost $2,675 to store this data for a month, while Backblaze B2 would charge just $630.

Using those numbers for storage and assuming you download 75TB a month for backup validation testing, you get a total monthly cost of $8,725 for Amazon S3; Backblaze B2 would be $630 a month. 

The additional cost you see from AWS S3 is from download costs, also known as egress fees, and they can certainly take a toll on your budget. Backblaze offers free egress up to three times the amount you have stored so you can move data when and where you prefer.

The chart below provides the breakdown of the expected cost.

Backblaze B2 Amazon S3
Storage $630 $2,675
Egress Free* $6,050
Totals: $630 $8,725

*Up to 3x of average monthly data stored, then $0.01/GB for additional egress.

Of course each month you will add and delete storage, so you’ll have to account for that in your forecast. And, as we mentioned above, there may also be other fees like minimum storage duration fees or API transaction fees. Using the cloud storage calculator noted above, you can get a reasonable estimate of your total cost over the budget forecasting period.

Finally, you can use the Backblaze B2 storage calculator to address potential use cases that are outside of your normal operations, such as if you delete a large project from your storage or you need to download a large amount of data. Running the calculator for these types of actions lets you obtain a solid estimate for their effect on your budget before they happen and lets you plan accordingly.

Understanding cloud storage pricing gives you options

Creating a predictable cloud storage forecast is key to taking full advantage of all of the value in cloud storage. Organizations like Austin City Limits, Amplify, and Runbiz were able to move to the cloud because they could reliably predict their cloud storage cost with Backblaze B2. You don’t have to let pricing tiers, hidden costs, and fine print stop you. Backblaze makes predicting your cloud storage costs easy.

The post Five Tips for Creating a Predictable Cloud Storage Budget appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Network Stats: Ingress Trends and What They Tell Us About Backup Behaviors

Post Syndicated from Brent Nowak original https://www.backblaze.com/blog/backblaze-network-stats-ingress-trends-and-what-they-tell-us-about-backup-behaviors/

An image with a background pattern of trend lines and the words "Network Stats Ingress rends and what they tell us"

Every day, thousands of Backblaze customers create and update files. These changes make their way into our system to be securely stored. Sometimes they are sent to us immediately, while other times the differentials are batched up into a job that runs at a scheduled time. 

In this post, I’m sampling three points in our network where we take in a lot of ingress traffic off of the internet, and we’re going to explore some of the trends that we see. 

Reading the ingress tea leaves

So, why do we care about ingress trends? In short, it helps us with capacity planning, and it also tells us a lot about how people use cloud storage. We often think of planning in longer terms—weeks, months, or years. Here I wanted to focus on some of the patterns that we see during a shorter period; for example, a single day or a significant date, like the end of the calendar month. There are some interesting patterns we see in our client behavior that keep us on our toes when we are performing capacity planning.

We currently have two product offerings that have different usage and traffic patterns:

  • Backblaze B2 Cloud Storage: Ingress and egress, high variance in traffic levels throughout the day, hour, and at the start of month. 
  • Backblaze Computer Backup: Heavy ingress, with a small variance in traffic levels during the business day or weekday vs. weekend.

Since humans are using our system, we see very human quirks in our traffic profiles. For example, we humans like round numbers! We notice that a lot of backup jobs kick off at midnight local or UTC, or fire off at the top of the hour, or trigger on the first of the month. This means we see spikes of network traffic during these periods. Additionally, a lot of new content gets created during the day and then queued up to be uploaded to us in an overnight backup job.

Scope and terms

Today we’re going to look at ingress traffic, which means we’re monitoring uploads from both Backblaze Computer Backup and Backblaze B2 into our environment. We’ll save downloads, traffic coming out of Backblaze, for analysis in future posts.

One common term that you’ll see on our graphs in the 95th percentile. The 95th percentile number is a point where 95% of all measurements are under and only 5% are over. This is a very typical method to use for monitoring, billing, and trend analysis in the telecom industry. It maps to a standard bell curve, and tells you that you’re capturing the vast majority of usage for planning purposes.

A chart displaying a bell curve and percentiles
A standard bell curve. Source.

In one of our monitoring systems, we are sampling and recording the utilization on our network links and computing a 95th percentile over a five minute period.

With these items defined, let’s get into the data with some charts!

Sample 1: One-month trend

In this first sample, we see that the majority of our daily traffic falls within a nice range. What stands out here is the clock tick over from February to March, where we see a spike of ingress traffic that is outside the expected daily range.

A chart displaying a sample of ingress trends over one month.

Taking that same dataset, let’s take a closer look at the end of the month and zoom in on the calendar change into March.

Adding a vertical red line on 00:00 UTC where the month changes over, we see that there must be a lot of automated jobs that kick in right at the clock changeover into the new month.

A chart showing ingress trends over 7 days.

Sample 2: Top of the hour

Taking a look at another traffic sample from another point in our network, we see very distinct traffic patterns on the top of almost every hour.

A chart showing ingress trends over 24 hours

Sample 3: Pacific Time Zone working hours

Here’s a sample of traffic in our US-West region. During the business day on the West Coast, we see a lull in traffic, with a pickup after the business day is done. This makes sense to us as there are jobs that backup daily content that start to send traffic to us overnight.

A chart showing ingress trends over three days.

What does this mean for you?

It’s very interesting to see the impact of humans in our network traffic and the patterns that emerge. Generally we humans create and modify things during the day, and we like to back them up over night for safekeeping. And we also like round numbers—people tend to send data at the top of the hour, midnight, or end of the month. 

All of these elements are very important in how we, at Backblaze, capacity plan and balance traffic over transit links. We do a lot of work to make sure that no matter what time of day or day of the month, you can reliably get your data into Backblaze.

But, you might also look at this data and take away a meaningful conclusion: Much like choosing to go to the grocery store at 10:30 a.m. on a Tuesday versus fighting the after-work rush at 6:00 p.m., scheduling jobs on the 15, 30, or 45 minute mark or mid-month instead of at the end of the month would mean you’re up against less traffic, which is never a bad thing (and it also smooths out our ingress, which we wouldn’t be mad about either).

At the end of the day, however you choose to schedule your jobs works for us. We’re just glad we’re able to store and protect our customers’ data reliably and affordably, and we’re happy to pass along any tips and trips for a better, less congested, backup experience as well.   

Thanks for reading, and stay tuned for more graphs and commentary on how we strive to build a reliable, scalable, and forward looking network to serve our customer’s needs.

The post Backblaze Network Stats: Ingress Trends and What They Tell Us About Backup Behaviors appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Command Like a Pro with New Backblaze B2 CLI Enhancements

Post Syndicated from Bala Krishna Gangisetty original https://www.backblaze.com/blog/command-like-a-pro-with-new-backblaze-b2-cli-enhancements/

An image of a computer monitor with the words B2 Command Line Interface Tool Version 4.1.0

The tools you use impact your efficiency, productivity, and the quality of your work. That’s true whether you’re a carpenter looking for the best saw blades, a chef choosing high-quality knives, or a developer or programmer investing in top-notch software. The B2 Command Line Interface (CLI) is one tool that you can use to interact with B2 Cloud Storage, and some recent improvements make it a more powerful, intuitive part of your arsenal. 

It’s been a while since our last blog about the Backblaze B2 Command Line Tool (B2 CLI for short). Today, we’re sharing more details on the key enhancements and new features as part of the B2 CLI version 4.1.0.

Let’s dive into the highlights of these changes and explore how they can elevate your B2 CLI experience.

User experience enhancements

1. A new nested command structure

Gone are the days of sifting through a long list of commands to find what you need. The B2 command structure has been revamped to be more intuitive and organized. With the new nested command structure, related commands are logically grouped together. The new structure looks like b2 <resource>. It makes it easier for you to locate and utilize the functionality you require. Whether you’re managing files, buckets, keys, or accounts, commands are now categorized in a way that aligns with their functions. This gives you a clearer and more concise enhanced user experience.

An image listing the usage tags for the Backblaze B2 CLI
New command structure.

2. Streamlined ls and rm commands

Why use two when one will do? The b2 ls and b2 rm commands can now accept a single cohesive string, B2 URI (e.g., b2://bucketName/path), instead of two separate positional arguments, giving you enhanced consistency and usability. It simplifies the command syntax and reduces potential for errors by eliminating the chance of misplacing or mistyping one of the separate arguments. And it ensures that the bucket and file path are always correctly associated with each other. This change minimizes confusion and helps to avoid common mistakes that can occur with multiple arguments.

In addition, some commands, such as b2 file large parts, accept a B2 ID URI (e.g. b2id://4_zf1f51fb…), which specifies a file by its unique identifier (a.k.a. Fguid).

Some redundant commands have also been deprecated with the introduction of the B2 and B2 URIs. For example: download-file-by-id and download-file-by-name functionality is available through b2 file download b2://bucketname/path and b2 file download b2id://fileid command.

3. Enhanced credential management

To enhance security and performance, the CLI will no longer persist credentials on disk if they are passed through B2_* environment variables (that is, B2_application_key_id and B2_application_key). This reduces the risk of unauthorized access to your credentials and improves the overall security of your environment.

At the same time, it’s important that security is balanced with performance. To address this, you can persist your credentials to local cache and can continue using local cache for better performance. You can explicitly choose to persist your credentials by using the b2 account authorize command. 

By eliminating the automatic persistence of credentials from environment variables and providing a clear method to manage local caching, you now have a balanced approach that keeps your data secure while ensuring efficient CLI operations.

4. Transition to kebab-case flags

Previously CLI flags had mixed camelCase and kebab-case styles. Users needed to remember the style to use it along with the name for the option. But kebab-case, where words are separated by hyphens (e.g., --my-flag), offers a clearer and more straightforward way to read and interpret flags. We’ve transitioned all CLI flags to --kebab-case. This style not only enhances readability, making it easier to understand complex commands at a glance, but also makes it easy to remember. It’s particularly beneficial when flags are composed of multiple words, as it reduces visual clutter and makes the flag names more accessible.

5. Simplified listing with ls

Ever wondered how to list all your buckets in one go? Now, you can call b2 ls without any arguments to do this. Whether you’re managing multiple buckets or just need a quick overview of your entire bucket inventory, the ability to list all buckets with a single command saves you time and effort. The enhancement to the b2 ls command is all about making your life easier. (As an aside, it’s also the quickest way to check that Backblaze B2 is correctly configured and you’re using the right set of credentials.)

6. Handy aliases for common flags

Why go the long way when you can take shortcuts? You can now use -r as an alias for the --recursive argument and -q for the --quiet argument. These shortcuts make your command-line interactions quicker and more efficient. You can get things done with fewer presses.

7. Global quiet mode

The --quiet option is now available for all commands, allowing you to suppress all messages printed to stdout and stderr. This is particularly useful for scripting and automation, where you want to minimize output.

8. Autocomplete

This enhancement for the B2 CLI means that you no longer have to remember and type out lengthy command arguments or options manually. As you start typing a command, the CLI will provide you with suggestions for completing the command, options, and arguments based on the context of your input. This can significantly save up your time and help you avoid typos or incorrect entries.

New features to boost your productivity

In addition to the CLI enhancements, we’ve also recently announced a few new features and capabilities for Backblaze B2, including:

  • Event Notifications: Event Notifications helps you automate workflows and integrate Backblaze B2 with other tools and systems. You can now manage Event Notification rules through b2 bucket notification-rule commands directly from the CLI. The feature is available in public preview. If you’re interested, check out the announcement and sign up here.  
  • Unhide files with ease: Previously, if you needed to reverse the hiding of a file, the process could be cumbersome or require multiple steps. To restore hidden files, using the b2 file unhide command is now as simple as it sounds. You only need to specify the file you want to unhide, and the command will handle the rest. This ensures that you can quickly and accurately restore file visibility without unnecessary complications. Whether you’ve hidden previous backup files and need to access them again, or when reorganizing your storage or adjusting file visibility for different users, or if you unintentionally hide files and need to make them visible for auditing or review purposes, you can use this command swiftly.
  • Custom file upload timestamps: You can now enable custom file upload timestamps on your account, enabling you to preserve original upload times for your files. This feature is ideal for maintaining accurate records for compliance and reporting, and it gives you greater control over the file metadata. If you’d like to enable the feature, please reach out to Backblaze Support.

In addition to the above highlights, we’ve implemented crucial fixes to improve the stability and reliability of the CLI. We’ve also made several improvements to our documentation, ensuring you have the guidance you need right at your fingertips.

Start using the new features today

The easier we can make your CLI experience, the easier your job becomes and the more you can get out of Backblaze B2. Install or upgrade the B2 CLI today to take advantage of all the new features.

As always, we value your feedback. If you have any thoughts or experiences to share as you start using the new enhancements and features, please let us know in the comments or submit feedback via our Product Portal. Your input is crucial in helping us continue to improve and innovate.

Happy coding, and enjoy the new B2 CLI offerings!

The post Command Like a Pro with New Backblaze B2 CLI Enhancements appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

How to Build Your Own LLM with Backblaze B2 + Jupyter Notebook

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/how-to-build-your-own-llm-with-backblaze-b2-jupyter-notebook/

A decorative image showing a computer, a cloud, and a building.

Last month, Backblaze Principal Storage Cloud Storyteller, Andy Klein, and I presented a webinar, Leveraging your Cloud Storage Data in AI/ML Apps and Services, in which we explored the various ways AI/ML applications use and store data. In addition to sharing insights from our customers who leverage Backblaze B2 Cloud Object Storage for their AI/ML needs, we also demonstrated a simple AI use case: a retrieval-augmented generation (RAG) chatbot answering questions based on custom context downloaded from a private Backblaze B2 Bucket. After the webinar, I posted the demo source code to a GitHub repository: https://github.com/backblaze-b2-samples/ai-rag-examples

In this blog post, I’ll recap a little of the webinar, and explain how you can use the demo source code as a basis for your own experiments with RAG and large language models (LLMs).

But why, Pat?

You might be asking yourself, why build a chatbot when multiple online services, such as ChatGPT, are just a click away? It’s a good question. The main motivation is privacy. Suppose you want your chatbot to use confidential data in generating its responses. By running an LLM on your own machine, whether on-premises or on a virtual machine (VM) under your control in a public cloud provider’s infrastructure, you eliminate the risks of that data surfacing in response to a question from a user outside your organization.

In the webinar, I showed two Python apps running on a GPU-equipped VM at Vultr, one of Backblaze’s compute partners. The first app used the GPT4All Python SDK to create a very simple conversational chatbot running a local instance of a large language model (LLM), which it used in answering general questions. Here’s an example from the webinar:

Ask me a question: What were the causes of the First World War?

The causes of the First World War are complex and varied, but some key factors include nationalism, imperialism, militarism, alliances between nations, and the assassination of Archduke Franz Ferdinand. Nationalism led to tensions between ethnic groups and their respective governments, while imperialism saw countries competing for territories and resources around the world. Militarism also played a role as many European powers built up large armies and navies in preparation for potential conflicts. The complex web of alliances between nations meant that any conflict could quickly escalate into a latIer war, and the assassination of Archduke Franz Ferdinand by a Serbian nationalist was the spark that ignited the powder keg of tensions already present in Europe at the time.

Now, I’m not a history scholar, but that looks like a pretty good answer to me! (History scholars, you are welcome to correct me.)

The second app used the Langchain framework to implement a more elaborate chatbot, again running on my own machine at Vultr, that used PDF data downloaded from a private bucket in Backblaze B2 as context for answering questions. As much as I love our webinar attendees, I didn’t want to share genuinely confidential data with them, so I used our Backblaze B2 Cloud Storage documentation as context. The chatbot was configured to use that context, and only that context, in answering questions. From the webinar:

Ask me a question about Backblaze 82: What's the difference between the master application key and a standard application key?

The master application key provides complete access to your account with all capabilities, access to all buckets, and has no file prefix restrictions or expiration. On the other hand, a standard application key is limited to the level of access that a user needs and can be specific to a bucket.

Ask me a question about Backblaze B2: What were the causes of the First World War?

The exact cause of the First World War is not mentioned in these documents.

The chatbot provides a comprehensive, accurate answer to the question on Backblaze application keys, but doesn’t answer the question on the causes of the First World War, since it was configured to use only the supplied context in generating its response.

During the webinar’s question-and-answer session, an attendee posed an excellent question: “Can you ask [the chatbot] follow-up questions where it can use previous discussions to build a proper answer based on content?” I responded, “Yes, absolutely; I’ll extend the demo to do exactly that before I post it to GitHub.” What follows are instructions for building a simple RAG chatbot, and then extending it to include message history.

Building a simple RAG chatbot

After the webinar, I rewrote both demo apps as Jupyter notebooks, which allowed me to add commentary to the code. I’ll provide you with edited highlights here, but you can find all of the details in the RAG demo notebook.

The first section of the notebook focuses on downloading PDF data from the private Backblaze B2 Bucket into a vector database, a storage mechanism particularly well suited for use with RAG. This process involves retrieving each PDF, splitting it into uniformly sized segments, and loading the segments into the database. The database stores each segment as a vector with many dimensions—we’re talking hundreds, or even thousands. The vector database can then vectorize a new piece of text—say a question from a user—and very quickly retrieve a list of matching segments.

Since this process can take significant time—about four minutes on my MacBook Pro M1 for the 225 PDF files I used, totaling 58MB of data—the notebook also shows you how to archive the resulting vector data to Backblaze B2 for safekeeping and retrieve it when running the chatbot later.

The vector database provides a “retriever” interface that takes a string as input, performs a similarity search on the vectors in the database, and outputs a list of matching documents. Given the vector database, it’s easy to obtain its retriever:

retriever = vectorstore.as_retriever()

The prompt template I used in the webinar provides the basic instructions for the LLM: use this context to answer the user’s question, and don’t go making things up!

prompt_template = """Use the following pieces of context to answer the question at the end. 
    If you don't know the answer, just say that you don't know, don't try to make up an answer.
    
    {context}
    
    Question: {question}
    Helpful Answer:"""

prompt = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

The RAG demo app creates a local instance of an LLM, using GPT4All with Nous Hermes 2 Mistral DPO, a fast chat-based model. Here’s an abbreviated version of the code:

model = GPT4All(
    model='Nous-Hermes-2-Mistral-7B-DPO.Q4_0.gguf',
    max_tokens=4096,
    device='gpu'
)

LangChain, as its name suggests, allows you to combine these components into a chain that can accept the user’s question and generate a response.

chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | model
        | StrOutputParser()
)

As mentioned above, the retriever takes the user’s question as input and returns a list of matching documents. The user’s question is also passed through the first step, and, in the second step, the prompt template combines the context with the user’s question to form the input to the LLM. If we were to peek inside the chain as it was processing the question about application keys, the prompt’s output would look something like this:

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

<Text of first matching document>

<Text of second matching document>

Question: What's the difference between the master application key and a standard application key?

Helpful Answer:

This is the basis of RAG: building an LLM prompt that contains the information required to generate an answer, then using the LLM to distill that prompt into an answer. The final step of the chain transforms the data structure emitted by the LLM into a simple string for display.

Now that we have a chain, we can ask it a question. Again, abbreviated from the sample code:

question = 'What is the difference between the master application key and a standard application key?'
answer = chain.invoke(question)

Adding message history to the simple RAG chatbot

The first step of extending the chatbot is to give the LLM new instructions, similar to its previous prompt template, but including the message history:

prompt_template = """Use the following pieces of context and the message history to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
    
Context: {context}
    
History: {history}
    
Question: {question}

Helpful Answer:"""

prompt = PromptTemplate(
    template=prompt_template, input_variables=["context", "question", "history"]
)

The chain must be modified slightly to accommodate the message history:

chain = (
    {
        "context": (
                itemgetter("question")
                | retriever
        ),
        "question": itemgetter("question"),
        "history": itemgetter("history")
    }
    | prompt
    | model
    | StrOutputParser()
)

Now, we define a very simple in-memory message store that uses a session_id parameter to manage multiple simultaneous conversations:

store = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = InMemoryChatMessageHistory()
    return store[session_id]

LangChain provides a wrapper, RunnableWithMessageHistory, that combines the message store with the above chain to create a new chain with message history capability:

with_message_history = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="question",
    history_messages_key="history",
)

Now we can feed a series of related questions into the new chain:

questions = [
    'What is the difference between the master application key and a standard application key?',
    'Which one would I use to work with a single bucket?',
    'Can you tell me anything more about this topic?'
]

for question in questions:
    print(f'\n{question}\n')
    answer = with_message_history.invoke(
        {"question": question},
        config={"configurable": {"session_id": "abc123"}},
    )
    print(f'{answer}\n')

I have to admit, I was pleasantly surprised by the results:

What is the difference between the master application key and a standard application key?

A master application key grants broad access privileges, while a standard application key is limited to the level of access that a user needs.

Which one would I use to work with a single bucket?

You would use a standard application key to work with a single bucket as it has limited access and only grants permissions needed for specific tasks, unlike the master application key which provides broad access privileges.

Can you tell me anything more about this topic?

Sure! The master application key is typically used by developers during development or testing phases to grant full access to all resources in a Backblaze B2 account, while the standard application key provides limited permissions and should be used for production environments where security is paramount.

Processing this series of questions on my MacBook Pro M1 with no GPU-acceleration took three minutes and 25 seconds, and just 52 seconds with its 16-core GPU. For comparison, I spun up a VM at Ori, another Backblaze partner offering GPU VM instances, with an Nvidia L4 Tensor Core GPU and 24GB of VRAM. The only code change required was to set the LLM device to ‘cuda’ to select the Nvidia GPU. The Ori VM answered those same questions in just 18 seconds.

An image of an Nvidia L4 Tensor Core GPU
The Nvidia L4 Tensor Core GPU: not much to look at, but crazy-fast AI inference!

Go forth and experiment

One of the reasons I refactored the demo apps was that notebooks allow an interactive, experimental approach. You can run the code in a cell, make a change, then re-run it to see the outcome. The RAG demo repository includes instructions for running the notebooks, and both the GPT4All and LangChain SDKs can run LLMs on machines with or without a GPU. Use the code as a starting point for your own exploration of AI, and let us know how you get on in the comments!

The post How to Build Your Own LLM with Backblaze B2 + Jupyter Notebook appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Drive Stats for Q2 2024

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-2024/

A decorative image with the headline Q2 2024 Drive Stats.

As of the end of Q2 2024, Backblaze was monitoring 288,665 hard drives (HDDs) and solid state drives (SSDs) in our cloud storage servers located in our data centers around the world. We removed from this analysis 3,789 boot drives, consisting of 2,923 SSDs and 866 hard drives. This leaves us with 284,876 hard drives under management to review for this report. We’ll review the annualized failure rates (AFRs) for Q2 2024 and the lifetime AFRs of the qualifying drive models, and we’ll also check out drive age versus failure rates over time. Along the way, we’ll share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

Hard drive failure rates for Q2 2024

For our Q2 2024 quarterly analysis, we remove from consideration: drive models which did have at least 100 drives in service at the end of the quarter, drive models which did not accumulate 10,000 or more drive days during the quarter, and individual drives which exceeded their manufacturer’s temperature specification during their lifetime. The removed pool totalled 490 drives, leaving us with 284,386 drives grouped into 29 drive models for our Q2 2024 analysis. 

The table below lists the AFRs and related data for these drive models. The table is sorted large to small by drive size then by AFR within drive size.

Notes and observations on the Q2 2024 Drive Stats

  • Upward AFR: The AFR for Q2 2024 was 1.71%. That’s up from Q1 2024 at 1.41%, but down from one year ago (Q2 2023) at 2.28%. While the quarter over quarter increase was a bit surprising, quarterly fluctuations in AFR are expected. Sixteen drive models had an AFR of 1.71% or below while 13 drive models had an AFR above.
  • Two good zeroes: In Q2 2024, two drive models had zero failures, a 14TB Seagate (model: ST14000NM000J) and a 16TB Seagate (model: ST16000NM002J). Both have a relatively small number of drives and drive days for the quarter, so their success is somewhat muted, but the 16TB Seagate drive model has a very respectable 0.57% lifetime failure rate.
  • Another GOAT is gone: In Q1, we migrated the last of our 4TB Toshiba drives. In Q2, we migrated the last of our 6TB drives, including all of the Seagate 6TB drives which had reached an average age of nine years (108 months). This Seagate drive model closed out its career at Backblaze with an impressive 0.86% lifetime AFR.

    Currently the 4TB Seagate (model: ST4000DM000) is our oldest data drive model in production at an average age of 99.5 months. The data on these drives is scheduled to be migrated over the next quarter or two using CVT, our in-house drive migration system. They’ll never reach nine years of service. 

  • The 10-Year Club: With the 6TB Seagate drives being migrated as they hit 10 years of service, we wondered: What is the oldest data drive in service? The answer, a 4TB HGST drive (model: HMS5C4040ALE640) with 9 years, 11 months and 23 days service as of the end of Q2. Alas, the Backblaze Vault in which this drive resides is now being migrated as are many other drives with over nine years of service. We’ll see next quarter to see if any of them made it to the 10-Year Club before they are retired.

    While there are no data drives with 10 years of service, there are 11 HDD boot drives that exceed the mark. In fact one, a 500GB WD drive (model: WD5000BPKT) has over 11 years of service. (Psst, don’t tell the CVT team.)

  • An HGST surprise: Over the years, the HGST drive models we have used performed very well. So, when the 12TB HGST (model: HUH721212ALN604) drive showed up with a 7.17% AFR for Q2, it’s news. Such uncharacteristic quarterly failure rates for this model actually go back about a year, although the 7.17% AFR is the largest quarterly value to date. As a result, the lifetime AFR has risen from 0.99% to 1.57% over the last year. While the lifetime AFR is not alarming, we are paying attention to this trend.

Lifetime hard drive failure rates

As of the end of Q2 2024, we were tracking 284,876 operational hard drives. To be considered for the lifetime review, a drive model was required to have 500 or more drives as of the end of Q2 2024 and have over 100,000 accumulated drive days during their lifetime. When we removed those drive models which did not meet the lifetime criteria, we had 283,065 drives grouped into 25 models remaining for analysis as shown in the table below.

Age, AFR, and snakes

One of the truisms in our business is that different drive models fail at different rates. Our goal is to develop a failure profile for a given drive model over time. Such a profile can help optimize our drive replacement and migration strategies, and ultimately maintains the durability of our cloud storage service.

For our cohort of data drives, we’ll look at the changes in the lifetime AFR over time for drive models with at least one million drive days as of the end of Q2 2024. This gives us 23 drive models to review. We’ll divide the drive models into two groups: those whose average age is five years (60 months) or less, and those whose average age is above 60 months. Why that cutoff? That’s the typical warranty period for enterprise class hard drives. 

Let’s start by plotting the current lifetime AFR for the 14 drives models that have an average age of 60 months or less as shown in the chart below.

Let’s review the drive models by characterizing the four quadrants as follows:

  • Quadrant I: Drive models in this quadrant are performing well, and have a respectable AFR of less than 1.5%. Drive models to the right in this quadrant might require a little more attention over the coming months than those to the left.
  • Quadrant II: These drive models have failure rates above 1.5%, but are still reasonable at around 2% lifetime AFR. What is important is that AFR does not increase significantly over time.
  • Quadrant III: There are no drives currently in this quadrant, but if there were it would not be a cause for alarm. Why? Some drive models experience higher rates of failure early on, and then following the bathtub curve, their AFR drops as they get older. 
  • Quadrant IV: These drive models are just starting out and are just beginning to establish their failure profile, which at the moment is good.

At a glance, the chart tells us that everything seems fine. The drives in Quadrant I are performing well, the two drives in Quadrant II could be better, but are still acceptable, and there are no surprises in the newer drive models to this point. Let’s see how things fair for the drive models which have an average age of over 60 months as in the chart below.

There are nine drive models which fit the average age criteria, including the Seagate 6TB drive (in yellow) whose drives were removed from service in Q2. As you can see the drive models are spread out across all four quadrants. As before, Quadrant I contains good drives, Quadrants II and III are drives we need to worry about, and Quadrant IV models look good so far. 

If we were to stop here we could decide for example that the 4TB Seagate drives are first in line for the CVT migration process, but not so fast. All of these drive models have been around for at least five years and we have their failure rates over time. So, rather than rely on just a point in time, let’s look at their change in failure rates over time in the chart below.

The snake chart, as we’re calling it, shows the lifetime failure rate of each drive model over time. We started at 24 months to make the chart less messy. Regardless, the drive models sort themselves out into either Quadrant I or II once their average age passes 60 months. Let’s take a look at the drives in each of those quadrants.

  • Quadrant I: Five of the nine drive models are in Quadrant I as of Q2 2024. The two 4TB HGST drives (brown and purple lines) as well as the 6TB Seagate (red line) have nearly vertical lines indicating their failure rates have been consistent over time, especially after 60 months of service. Such demonstrated consistency over time is a failure profile we like to see. 

    The failure profile of the 8TB Seagate (blue line) and the 8TB HGST (gray line) are less consistent, with each increasing their failure rates as they have aged. In the case of the HGST drive, the lifetime AFR rose from about 0.5% to 1.0% over an 18 month period starting at 48 months before leveling out. The Seagate drive took about two years starting at 60 months to go from 1.0% to nearly 1.5% before leveling out.

  • Quadrant II: The remaining 4 drive models ended in this quadrant. Three of the models, the 8TB Seagate (yellow line), the 10TB Seagate (green line), and the 12TB HGST (teal line) have similar failure profiles. All three got to some point in their lifetime and their curve began bending to the right. In other words, their failure rates over time accelerated. While the 8TB Seagate (yellow) shows some signs of leveling off, all three models will be closely watched and replaced if this trend continues.

    Also in Quadrant II is the 4TB Seagate drive (black line). This drive model is aggressively being migrated and is being replaced by 16TB and larger drives via the CVT process. As such, it is hard to tell if the nearly vertical failure profile is a function of the replacement process or the drive model failure rate leveling out over time. Either way, the migration of this drive model is expected to be complete in the next quarter or two.

A normal failure profile

If we had to pick one of the drive models to represent a normal failure profile, it would be the 8TB Seagate (blue line, model: ST800DM002). Why? The failure rate for the first 60 months was consistently around 1.0%, Seagate’s predicted AFR. After 60 months, the AFR increased as the drive aged as one would expect. You might have thought we’d choose the failure profile of one of the two 4TB HGST drive models (brown and purple lines). The “trouble” is their failure rates are well below any published AFR by any drive manufacturer. While that’s great for us, their annualized failure rates over time are sadly not normal.

Can AI help?

The idea of using AI/ML techniques to predict drive failure has been around for several years, but as a first step let’s see if predicting drive failure is even an AI-worthy problem. We recently conducted a webinar “Leveraging Your Cloud Storage Data in AL/ML Apps and Services” in which we outlined general criteria to be used in evaluating if AI/ML is needed to solve a given problem, in this case predicting drive failure. The most salient criteria which applies here is that AI is best used for a problem for which you can not consistently apply a set of rules to solve the problem. 

A model is trained by taking the source data and applying an algorithm to iteratively combine and weigh multiple factors. The output is a model which can be used to answer questions about the model’s subject matter, in this case drive failure. For example, we train a model using the Drive Stats data for a given drive model for the last year. Then, we ask the model a question using drive Z’s daily SMART stats and related information. We use this data as input to the model, and while there is no exact match, the model will use inference to develop a response of the probability of drive failure for drive Z over time. As such, it would seem that drive failure prediction would be a good candidate for using AI.

What’s not clear is whether what is learned about one drive model can be applied to another drive model. One look at the snake chart above visualizes the issue as the failure profile for each drive model is different, sometimes radically different. For example, do you think you could train a model on the 4TB Seagate drives (black line) and use it to predict drive failures for either of the 4TB HGST drive models (purple and brown lines)? The answer may be yes, but it certainly doesn’t seem likely. 

All that said, several research papers and studies have been published over the years attempting to determine whether or not AI/ML can be used to make drive failure predictions. We’ll be doing a review of these publications in the next couple of months and hopefully shed some light on the ability to use AI to accurately make drive failure predictions in a timely manner.

The Hard Drive Stats data

It has now been over 11 years since we began recording, storing, and reporting the operational statistics of the hard drives and SSDs we use to store data in the Backblaze data storage cloud. We look at the telemetry data of the drives, including their SMART stats and other health related attributes. We do not read or otherwise examine the actual customer data stored. 

Over the years, we have analyzed the data we have gathered and published our findings and insights from our analyses. For transparency, we also publish the data itself, known as the Drive Stats dataset. This dataset is open source and can be downloaded from our Drive Stats webpage.

The post Backblaze Drive Stats for Q2 2024 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Proper Address: IPv4 vs. IPv6

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/proper-address-ipv4-vs-ipv6/

A decorative image showing a cloud over performance graphs and charts.

Ah, the 1980s. It brought us such classics as Ghostbusters, The Princess Bride, Tina Turner’s triumphant comeback, Pac-Man, and the original Apple Macintosh. Also, it gave us the birth of the internet, in which we figured out how to make all our computers one giant, powerful network held together initially by internet protocols (IPs) and, eventually, by a mutual love of cat videos

Now, each of our devices that connect to the internet require a way to find and send information back and forth, which means they need an IP address. Most folks don’t type IP addresses into their search bar though—we use domain names (for example, www.backblaze.com). Which IP addresses correspond to which domain names is stored in a hierarchical and distributed database system known as the domain name system (DNS), which is also an internet protocol. 

Today, let’s talk about IP addresses: What are IPv4 and IPv6, why is IPv6 necessary, and what impact will it have on networking?

Let’s set the scene

Any time you’re sending and receiving data, be it a letter in the mail, dialing a phone number, or loading a website, you’ve got to have an identifiable address reach the proper person and/or device. What all of these types of addresses have in common is that as our population has exploded, we’ve had to re-work how addresses work in order to include more possible data locations. U.S. zip codes were established in 1963. Area codes were established in 1947, and a great expansion was necessary only three(ish) decades later, and that plan was implemented starting in the late 1980s and ending in the mid ’90s.

IP addresses, meanwhile, have been operating on the first and only protocol we introduced in the 1980s, called IPv4. Not only has the world population almost doubled since then, but there has also been a nonlinear explosion in internet-connected devices per person. When IP addresses were first invented, it was unfathomable that most folks would be walking around with a computer in their pocket, remotely checking who’s ringing their doorbells while adjusting their thermostat in anticipation of returning home. All of those internet-connected devices use an IP address, in one way or another. 

So, it’s no surprise that we’re now seeing an adoption of a new IP address standard. In keeping with tradition, the versions aren’t sequential: Right now we’re jumping from IPv4 to IPv6. (What happened to IPv5? It was skipped, sort of.)

What is IPv4?

IPv4 is an internet protocol that assigns addresses to devices. It uses a 32-bit address, represented by four numbers (octets), each between 0 and 255, separated by dots (e.g., 192.168.1.100), and uses decimal notation. 

Remember that each bit represents one of two possible values, a 0 or a 1. So, for a 32-bit value, there are 2^32 possible addresses, or 4,294,967,296 IP addresses total. Several IPv4 address blocks were also reserved for private networks and multicast addresses, about 286 million total. Between the two reserved blocks of addresses, that’s about 7% of the total addresses in existence.

What is IPv6?

IPv6 uses a 128-bit address, represented by a longer string of numbers and letters (e.g., 2001:0db8:85a3:0000:0000:8a2e:0370:7334) in hexadecimal code, aka hex code. If you’ve ever designed a MySpace page (hi, Tom!) or a webpage, you’re likely familiar with the hex codes used to identify precise colors.

Doing the math as we did above, there are 2^128 possible IPv6 addresses, which is 340 undecillion. (That’s the 11th order of magnitude if you’re going, million, billion, trillion, and so on.) And, just like IPv4, there are some reserved addresses, but they represent such a comparatively smaller number of total available addresses that it’s not even worth calculating a percentage. 

Woah, how have we been surviving in the meantime?

We mentioned above that we’ve known we’re running out of IP addresses for a while. But, important detail: There was evidence of the problem as early as 1981, and mitigation efforts were enacted by 1992. Before we get into what mitigation strategies have been used over the years, a bit of a refinement of the above information—IP addresses consist of two main parts, one that identifies the network (or, sometimes, the subnet) and the host, or the destination on that network. (That’s true of both IPv4 and IPv6.)

Classful networking

In the original iteration of IPv4, the bits that identified the subnet were fixed, and that meant a lot of wasted space. In 1981, we implemented classful networking. Instead of keeping a fixed number of bits to identify a network, the three most significant bits identified the size of the network prefix, and that sent you to different classes. That meant that existing addresses didn’t have to change. Here’s a handy table:

Class Most significant bits Network prefix size (bits) Host identifier size (bits) Address range Maximum number of networks Maximum number of hosts per network
A 0 8 24 0.0.0.0–127.255.255.255 128 networks 16,777,216 hosts per network
B 10 16 16 128.0.0.0–191.255.255.255 16,384 networks 65,386 hosts per network
C 110 24 8 192.0.0.0–223.255.255.255 2,097,152 networks 256 hosts per network
D (multicast)
E (reserved)
1110
1111
224.0.0.0–255.255.255.255

All that sounds a bit like gobbley-gook. An analogy: You live in a city that wants to improve mail delivery, so it’s introduced the option to choose from a small, medium, or large mailbox. The sizes are actually pretty disproportionate—the small is about the size of a toaster, whereas the medium is the size of a kitchen trash can. (And large is the size of your car. Who gets that much mail?) No matter which size mailbox you (or your neighbor) chooses, your physical address didn’t change when this system was implemented. You usually get more mail than the toaster would accommodate, but never even come close to filling your trash can-sized mailbox. So, that extra space just sits empty and unused, never fulfilling its mail volume potential.  

Note that classful networking is now largely defunct, replaced by…  

Classless inter-domain routing (CIDR)

The biggest issue of the above system was its inflexibility. Adding classes gave us more flexibility than the original design, but you were still restricted to 8, 16, or 24 bits to identify the network. That means you can end up with a lot of unused IP addresses, as indicated by our above analogy. Here’s the math behind why: 

The number of addresses available on a network is the inverse of how many bits you use to define it. So, in a 32-bit address, if you use 16 bits to define the network, you have 8 bits leftover to define the host. That’s our Class C network, which contained 2^8 (256) IP addresses—not enough for most use cases. And, the next smallest subset, Class B, represented 2^16 IP addresses (65,536 total), which most organizations could not use efficiently. After DNS became the norm, it became clear that classful networking wasn’t scalable, and thus CIDR rose to prominence.  

CIDR is based on variable-length subnet masking (VLSM), which lets each network be divided into subnetworks of various power-of-two sizes. This method optimizes the allocation of IPv4 addresses by allowing for more flexible address blocks. 

Using our analogy, instead of assigning mailbox size based on household size, you might just have a system in which folks walk up to the post office and find their name on a list associated with a mailbox. If someone has more or less mail that month, then they can be assigned the properly sized mailbox. 

Network address translation (NAT)

NAT allows multiple devices to share a single public IPv4 address by modifying the IP header when it’s in transit. This is super useful when you’re talking about private networks—you can assign a single IP address to multiple devices. For example, if you have several internet of thing (IoT) devices in your home, they can all appear to the public network as one IP address, and your local network can figure out what traffic goes where. It also makes it so that if a network moves, the host doesn’t necessarily have to be assigned a new IP address, such as if an internet provider like Cox decides to stop doing business in your region, and Spectrum takes over their IP address allocation—though likely they’d just change your public IP address in that specific scenario.

In our mail analogy, NAT is like those group mailboxes you see in rural areas, apartment buildings, or in neighborhoods. Everyone in the same location gets their mail delivered to the same physical address, and your box number is used to further identify your house within the group mailbox. 

The secondary market of IP addresses

If we can learn anything from the above workarounds, flexibility and possibility is key. So, it’s unsurprising to know that a secondary market has cropped up, introducing things like address recycling, address trading, and address leasing. IPv6 will solve the scarcity issue—but what else can it do?

What are the benefits of IPv6?

So far we’ve talked about the primary benefit of IPv6—more IP addresses that we clearly need. But, there are other benefits as well. Here’s a summary: 

Improved Efficiency

  • Simpler header: The IPv6 header is simpler than IPv4’s, leading to faster packet processing and reduced overhead.
  • Efficient routing: IPv6’s design allows for more efficient routing, potentially reducing latency and improving network performance. Arguably, most folks won’t see a huge performance improvement unless they reconfigure their own network architecture, but the possibility is there. 
  • Autoconfiguration: IPv6 supports automatic configuration of network interfaces, simplifying setup and reducing administrative overhead.

Enhanced Security

  • Built-in security features: IPv6 offers built-in security mechanisms like IPsec, potentially providing better protection against attacks. In practice, it’s not typically implemented as most encryption is typically handled at the transport layer security (TLS) IP layer. 

Quality of Service (QoS)

  • Improved QoS: IPv6 provides better support for QoS, allowing for prioritization of different types of traffic, ensuring a better user experience for applications like video conferencing and online gaming.

Other Benefits

  • Reduced reliance on NAT: IPv6 reduces the need for NAT, simplifying network configurations and improving end-to-end connectivity.
  • Support for new services: IPv6 is better suited for emerging technologies and applications that require a large number of addresses and advanced features.

What’s next? Will we run out again?

Given the amount of addresses for IPv4 vs. IPv6 (4.2 billion vs. 340 undecillion, respectively), you can understand how we might have needed to shore up our IPv4 addresses. Honestly, if you assume one device per person, we already outnumber IPv4 addresses—in fact, we outnumbered IP addresses in the 1970s, before IPv4 was even invented! You shouldn’t assume one device per person, by the way. While many countries with widespread broadband access have several devices per person—in the U.S., Consumer Affairs was reporting 21 per U.S. household in 2023, and the average U.S. household for that same year was 2.51 people. Globally, that same source reports 3.6 internet-connected devices per person.   

Changes like this can certainly be disruptive, but the good news on that front is that most devices will be dual-stacked for quite a while. That means that you’ll have both versions of an IP address, and this change can roll out organically (so to speak). In the end, we’ll have a better-performing internet, ready to grow with us for the foreseeable future.

The post Proper Address: IPv4 vs. IPv6 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Container Orchestration: Managing Applications at Scale

Post Syndicated from Vinodh Subramanian original https://www.backblaze.com/blog/container-orchestration-managing-applications-at-scale/

A decorative image showing containers stacked in a pattern.

The use of containers for software deployment has emerged as a powerful method for packaging applications and their dependencies into single, portable units. Containers enable developers to create, deploy, and run applications consistently across various environments. However, as containerized applications grow in scale and complexity, efficiently deploying, managing, and terminating containers can become a challenging task.

The growing need for streamlined container management has led to the rise of container orchestration—an automated approach to deploying, scaling, and managing containerized applications. Because it simplifies the management of large-scale, dynamic container environments, container orchestration has become a crucial component in modern application development and deployment. 

In this blog post, we’ll explore what container orchestration is, how it works, its benefits, and the leading tools that make it possible. Whether you are new to using containers or looking to optimize your existing strategy, this guide will provide insights that you can leverage for more efficient and scalable application deployment. 

What are containers?

Before containers, developers often faced the “it works on my machine” problem, where an application would run perfectly on a developer’s computer but fail in other environments due to differences in operating systems (OS), dependencies, or configuration. 

Containers solve this problem by packaging applications with all their dependencies into single, portable units, improving consistency across different environments. This greatly reduces the compatibility issues and simplifies the deployment process. 

As a lightweight software package, containers include everything needed to run an application such as code, runtime environment, system tools, libraries, binaries, settings, and so on. They run on top of the host OS, sharing the same OS kernel, and can run anywhere—on a laptop, server, in the cloud, etc. On top of that, containers remain isolated from each other, making them more lightweight and efficient than virtual machines (VMs), which require a full OS for each instance. Check out our article to learn more about the difference between containers and VMs here

Containers provide consistent environments, higher resource efficiency, faster startup times, and portability. They differ from VMs in that they share the host OS kernel. While VMs virtualize hardware for strong isolation, containers isolate at the process level. By solving the longstanding issues of environment consistency and resource efficiency, containers have become an essential tool in modern application development. 

What is container orchestration?

As container adoption has grown, developers have encountered new challenges that highlight the need for container orchestration. While containers simplify application deployment by ensuring consistency across environments, managing containers at scale introduces complexities that manual processes can’t handle efficiently, such as:

  1. Scalability: In a production environment, applications often require hundreds or thousands of containers running simultaneously. Manually managing such a large number of containers becomes impractical and error-prone. 
  2. Resource management: Efficiently utilizing resources across multiple containers is critical. Manual resource allocation leads to underutilization or overloading of hardware, negatively impacting performance and cost-effectiveness. 
  3. Container failure management: In dynamic environments, containers can fail or become unresponsive. Developers need a way to create a self-healing environment, in which failed containers are automatically detected, then recover without manual intervention to ensure high availability and reliability. 
  4. Rolling updates: Deploying updates to applications without downtime and the ability to quickly roll back in case of issues are crucial for maintaining service continuity. Manual updates can be risky and cumbersome. 

Container orchestration automates the deployment, scaling, and management of containers, addressing the complexities that arise in large-scale, dynamic application environments. It ensures that applications run smoothly and efficiently, enabling developers to focus on building features rather than managing infrastructure. Container orchestration tools provide various features such as automated scheduling, self-healing, load balancing, and resource optimization to deploy and manage applications more effectively to ensure reliability, performance, and scalability. 

What are the benefits of container orchestration?

Container orchestration offers many different advantages that streamline the deployment and management of containerized applications. We’ve touched on a few of them, but here’s a concise list: 

  • Improved resource utilization: Orchestration tools can efficiently pack containers onto hosts, maximizing hardware usage. 
  • Enhanced scalability: Easily scale applications up or down to meet changing demands. 
  • Increased reliability: Automatic health checks and container replacement ensure high availability. 
  • Simplified management: Centralized control and automation reduce the complexity of managing large-scale containerized applications. 
  • Faster deployments: Orchestrators enable rapid and consistent deployments across different environments. 
  • Cost efficiency: Better resource utilization and automation, leading to cost savings. 

How does container orchestration work?

Now that we understand what container orchestration is, let’s take a look at how container orchestration works using the example of Kubernetes, one of the most popular container orchestration platforms. 

In the above diagram, we see an example of container orchestration in action. The system is divided into two main sections: the control plane and the worker nodes. 

Control plane

The control plane is the brain of the container orchestration system. It manages the entire system, ensuring that the desired state of the applications is maintained. Key components of the control plane include:

  • Configuration store (etcd): A distributed key-value store that holds all the cluster data, such as the configuration and state information. Think of it as a central database for the cluster. 
  • API server: The front-end of the control plane, exposing the orchestration API. It handles all the communication within the cluster and with external clients. 
  • Scheduler: Assigns workloads to nodes based on resource availability and scheduling policies, ensuring efficient resource utilization. 
  • Controller manager: Runs various controllers that handle routine tasks to maintain the cluster’s desired state. 
  • Cloud control manager: Interacts with cloud provider APIs to manage cloud specific resources, integrating the cluster with cloud infrastructure. 

Worker nodes

Worker nodes, virtual machines, and bare metal servers are all common options for where to run application workloads. Each worker node has the following components: 

  • Node agent (kubelet): An agent that ensures the containers are running as expected. It communicates with the control plane to receive instructions and report back on the status of the nodes. 
  • Network proxy (kube-proxy): Maintains network rules on each node, facilitating communication between containers and services within the cluster. 

Within the worker nodes, pods are the smallest deployable units. Each pod can contain one or more containers that run the application and its dependencies. The diagram shows multiple pods within the worker nodes, indicating how applications are deployed and managed. 

The cloud provider API directs how the orchestration system dynamically interacts with cloud infrastructure to provision resources as needed, making it a flexible and powerful tool for managing containerized applications across various environments. 

Popular container orchestration tools

Several container orchestration tools have emerged as the leaders in the industry, each offering unique features and capabilities. Here are some of the most popular tools:

Kubernetes

Kubernetes, often referred to as K8s, is an open-source container orchestration platform initially developed by Google. It has become the industry standard for managing containerized applications at scale. K8s is ideal for handling complex, multi-container applications, making it suitable for large-scale microservices architectures and multi-cloud deployments. Its strong community support and flexibility with various container runtimes contribute to its widespread adoption.

Docker Swarm

Docker Swarm is Docker’s native container orchestration tool, providing a simpler alternative to Kubernetes. It integrates seamlessly with Docker containers, making it a natural choice for teams already using Docker. Known for its ease of setup and use, Docker Swarm allows quick scaling of services with straightforward commands, making it ideal for small to medium-sized applications and rapid development cycles. 

Amazon Elastic Container Service (ECS)

Amazon ECS (Elastic Container Service) is a fully managed container orchestration service provided by AWS, designed to simplify running containerized applications. ECS integrates deeply with AWS services for networking, security, and monitoring. ECS leverages the extensive range of AWS services, making it a straightforward orchestration solution for enterprises using AWS infrastructure.

Red Hat OpenShift

Red Hat OpenShift is an enterprise-grade Kubernetes container orchestration platform that extends Kubernetes with additional tools for developers and operations, integrated security, and lifecycle management. OpenShift supports multiple cloud and on-premise environments, providing a consistent foundation for building and scaling containerized applications.

Google Kubernetes Engine (GKE)

Google Kubernetes Engine (GKE) is a managed Kubernetes service offered by Google Cloud Platform (GCP). It provides a scalable environment for deploying, managing, and scaling containerized applications using Kubernetes. GKE simplifies cluster management with automated upgrades, monitoring, and scalability features. Its deep integration with GCP services and Google’s expertise in running Kubernetes at scale make GKE an attractive option for complex application architectures.

Embracing the future of application deployment

Container orchestration has undoubtedly revolutionized the way we deploy, manage, and scale applications in today’s complex and dynamic software environments. By automating critical tasks such as scheduling, scaling, load balancing, and health monitoring, container orchestration enables organizations to achieve greater efficiency, reliability, and scalability in their application deployments. 

The choice of orchestration platform should be carefully considered based on your specific needs, team expertise and long term goals. It is not just a technical solution but a strategic enabler, providing you with significant advantages in your development and operational workflows.

The post Container Orchestration: Managing Applications at Scale appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

How to Back Up Your QNAP NAS to the Cloud

Post Syndicated from Vinodh Subramanian original https://www.backblaze.com/blog/qnap-nas-backup-to-cloud/

A decorative image with the title sync with QNAP.

Your QNAP network attached storage (NAS) device helps your business centralize storage capacity, support collaboration, and access files 24/7 from anywhere. If you were relying on individual hard drives or another ad hoc storage solution before, it definitely helps you uplevel your data management practices.

One of the great features of a QNAP NAS device is Hybrid Backup Sync (HBS), its onboard backup utility that allows you to easily store a copy of your data to your NAS and other destinations. You can set regular, automated backups to protect against data loss due to hardware failures or accidental deletion. But, keeping a copy of your data on your NAS alone doesn’t constitute a true backup strategy. For that, you need to follow the 3-2-1 backup rule with at least one copy stored off-site.

This post explains how to set up a 3-2-1 backup strategy with your QNAP NAS. We’ll share the benefits of storing your backups in the cloud, discuss different options for backing up your QNAP NAS, and provide some practical examples of what you can do by combining cloud storage and your NAS.

Download Our Complete NAS Guide

QNAP NAS and a 3-2-1 backup strategy

Following the 3-2-1 strategy means having three copies of your data, two of which are stored locally but on different media (aka devices), and one stored off-site. 

Your QNAP NAS is your first step towards completing the 3-2-1 strategy. By using it to store data locally, you have two copies on-site. Backing up your QNAP NAS to the cloud completes the 3-2-1 strategy by serving as your off-site storage. 

A diagram showing the 3-2-1 backup strategy, which has three copies of data, on two different types of media, with one stored in an off-site location.

You could maintain an off-site copy on another physical device like another NAS, an external drive, or a file server, but keep in mind, backing up to an external destination other than the cloud will require you to physically separate the backup copy—that is, send your drive via mail or drive it elsewhere in order to ensure geographic separation. Backing up your QNAP NAS to the cloud means you achieve a 3-2-1 strategy without going out of your way to physically separate the copies, and it allows you to easily store data in different regions for greater data resilience and disaster recovery.

The additional benefits of backing your QNAP NAS to the cloud

Backing up your QNAP NAS to the cloud gives you a number of additional benefits, including:

  • Disaster recovery: Without an off-site backup, your on-site data, including data on your individual workstations and your NAS, is susceptible to data loss. Natural disasters could wipe out your machines, your NAS, and any other backups you might store locally. Cloud backups safeguard your data from physical disasters that could destroy both your NAS and local copies.
  • Ransomware protection: While QNAP has on-board utilities that allow you to revert to a previous backup, your NAS is still connected to your network and susceptible to ransomware. Cloud backups, especially those configured with Object Lock, provide a layer of security against ransomware attacks that can encrypt or delete data stored on your network-connected NAS. 
  • Protection against hardware failure: Because your NAS is likely set up in a RAID configuration, one drive failure might not affect your data. But, while one drive is down, your data is at a higher risk. If another drive were to fail, you could lose data. Keeping an off-site backup in cloud storage helps you avoid this fate.
  • Accessibility: With your data in the cloud, your backups are accessible from anywhere. If you’re away from your desk or office and you need to retrieve a file, you can simply log in to your cloud account and copy that file down.
  • Security: Cloud vendors typically protect customer data by encrypting it as it travels to its final destination and/or when it is at rest on the vendors’ storage servers. Encryption protocols differ between cloud vendors, so make sure to understand them as you’re evaluating cloud providers, especially if you have specific security requirements.
  • Automation: Your QNAP NAS comes with a built-in backup utility so you can set your cloud backup schedule in advance and avoid human error (like forgetting to back up) in the future.
  • Scalability: As your data grows, your cloud backups grow with it. With cloud storage, there’s no need to invest in or maintain additional hardware to ensure your data is properly backed up.

How to protect your business data with QNAP

QNAP offers a number of different tools and functionality to help you back up business devices and systems to your NAS, including:

  1. Qsync: Qsync is an on-board backup utility on QNAP devices that allows you to sync computer files to your QNAP NAS. This allows you to back up workstations to your NAS, creating a second, local copy of that data. QNAP NAS also supports Time Machine for Macs. 
  2. NetBack PC Agent: A utility specifically for backing up Windows PCs and servers.
  3. Hyper Data Protector: Use Hyper Data Protector to back up multiple VMware and Hyper-V virtual machines (VMs).
  4. File server backup: QNAP devices support multiple protocols, including rsync, FTP, and CIFS for backing up different file servers.
  5. Boxafe: Use Boxafe to back up Google workspace and Microsoft 365 business account data to your NAS.
  6. Snapshot feature: Takes point-in-time copies of data for protection and recovery.
  7. MARS: Use QNAP’s MARS service to back up Google Photos and WordPress databases and files to your NAS. 

How to back up your QNAP to the cloud

Once you’ve created a copy of your business data to your QNAP NAS, you can then use QNAP Hybrid Backup Sync to back it up to the cloud. Hybrid Backup Sync supports multi-version backups and allows you to customize retention settings for version management. QNAP’s QuDedup feature deduplicates data, helping you manage your storage footprint. The utility also allows you to manage Time Machine backups for Mac devices.

A product photo of a QNAP NAS.

What can you do with cloud storage and QNAP Hybrid Backup Sync?

The QNAP Hybrid Backup Sync app provides you with a lot of options. You can synchronize in the cloud as little or as much as you want. Here are some practical examples of what you can do with Hybrid Backup Sync and cloud storage working together.

1. Sync the entire contents of your QNAP to the cloud

The QNAP NAS has excellent fault tolerance—it can continue operating even when individual drive units fail—but nothing in life is foolproof. It pays to be prepared in the event of a catastrophe. Now that you know about the 3-2-1 backup strategy, you know how important it is to make sure that you have a copy of your files in the cloud.

2. Sync your most important media files

Using your QNAP to store marketing assets like video and photos? You’ve invested untold amounts of time, money, and effort into producing those media files, so make sure they’re safely and securely synced to the cloud with Hybrid Backup Sync.

3. Back up Time Machine and other local backups

Apple’s Time Machine software provides Mac users with reliable local backup, and many Backblaze customers rely on it to provide that crucial first step in making sure their data is secure. QNAP enables the NAS to act as a network-based Time Machine backup. Those Time Machine files can be synced to the cloud, so you can make sure to have Time Machine files to restore from in the event of a critical failure.

If you use Windows or Linux, you can configure the QNAP NAS as the destination for your Windows or Linux local data backup. That, in turn, can be synced to the cloud from the NAS.

Ready to give it a try?

Hybrid Backup Sync allows you to choose from any number of cloud storage providers as a backup destination, and Backblaze B2 Cloud Storage is one of them. Check out our videos on how to use Hybrid Backup Sync to back up or sync your data to B2 in under 15 minutes.

If you haven’t given cloud storage a try yet, you can get started now and make sure your NAS is synced or backed up securely to the cloud.

The post How to Back Up Your QNAP NAS to the Cloud appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

AI 101: Why RAG Is All the RAGe

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/ai-101-why-rag-is-all-the-rage/

A decorative image showing an AI chip connecting icons of representing different files.

At the risk of being called the stick in the mud of the tech world, we here at Backblaze have often bemoaned our industry’s love of making up new acronyms. The most recent culprit, hailing from the fast-moving artificial intelligence/machine learning (AI/ML) space, is truly memorable: RAG, aka retrieval-augmented generation. For the record, its creator has apologized for inflicting it upon the world.

Given how useful it is, we’re willing to forgive. (I’m sure he was holding his breath for that news.) Today, our AI 101 series is back to talk about what RAG is—and the big problem it solves. 

Read more AI 101

This article is part of a series that attempts to understand the evolving world of AI/ML. Check out our previous articles for more context:

Let’s start with large language models (LLMs)

LLMs are the most recognizable expression of AI in our current zeitgeist. (Arguably, you could append that with “that we’re all paying attention to,” given that ML algorithms have been behind many tools for decades now.) LLMs underpin tools like ChatGPT, Google Gemini, and Claude, as well as things like service-oriented chatbots, natural language processing tasks, and so on. They’re trained on vast amounts of data with algorithmic guardrails known as parameters and hyperparameters guiding their training. Once trained, we query them through a process known as inference

Fabulous! The possibilities are endless. However, one of the biggest challenges we’ve experienced (and laughed about on the internet) is that LLMs can return inaccurate results, while sounding very, very reasonable. Additionally, LLMs don’t know what they don’t know. Their answers can only be as good as the data they draw from—so, if their training dataset is outdated or contains a systematic bias, it will impact your results. As AI tools have become more widely adopted, we’ve seen LLM inaccuracies range from “funny and widely mocked” to “oh, that’s actually serious.

Enter retrieval-augmented generation (Fine! RAG)

RAG is a solution to these problems. Instead of relying on only an LLM’s dataset, RAG queries external sources before returning a response. It’s more complicated than “let me google that for you,” as the process takes that external data, turns it into a vectored database, and then balances external data with an LLM’s “general knowledge” generated response and skill at responding to conversational queries. 

This has several advantages. Users now have sources they can cite, and recent information is taken into account. From a development perspective, it means that you don’t have to re-train a model as frequently. And, it can be implemented in as few as five lines of code. 

One important nuance is that when you’re building RAG into your product, you can set its sources. For industries like medicine and law, that means you can point them towards industry journals and trusted sources, outweighing the often misquoted or mis-cited examples you might see in a general database. 

Another example: For a technical documentation portal, you can take an LLM, trained on general information and the nuts and bolts of conversational querying, and direct it to rely on your organization’s help articles as its most important sources. Your organization controls the authoritative data, and how often/when changes are made. Users can trust that they’re getting the most recent security patches and correct code. And, you can do so quickly, easily, and—most importantly—cost-effectively. 

RAG doesn’t mean foolproof AI

RAG is a great, straightforward method for keeping LLM tools updated with current, high-quality information and giving users more transparency around where their answers are coming from. However, as we mentioned above, AI is only ever as good as the data it uses. Keep in mind, that’s a deceptively simple thing to say. It’s an entire, specialized job to validate datasets, and that expertise is built into the research and monitoring that happens while training an LLM. 

RAG gives a new source of data a privileged position—you’re saying “this data is more authoritative than that data” and, since the LLM doesn’t have anything in its general database, it may not have a counter argument. If you’re not paying attention to your RAG data source standards, and doing so on an ongoing basis, it’s possible, and even likely, that data bias, low quality data, etc. could creep into your model. 

Think of it this way: If you’re pointing to a new feature in your tech docs and there’s an error, that impact is magnified because an LLM will give more weight to the RAG data. At least in that case, you’re the one who controls the source data. In our other examples of legal or medical AI tools pointing to journal updates, things can get, well, more complicated. If (when) you’re setting up an AI that uses RAG, it’s imperative to make sure you’re also setting yourself up with reliable sources that are regularly updated. 

But, given its impact, and how low of a lift it is to integrate into existing products, we can see why RAG is all the RAGe—and, as always, we look forward to more to come in the AI landscape. For now, we can already see the impact it’s having on the market, with SaaS companies and startups alike exploring the possibilities.

The post AI 101: Why RAG Is All the RAGe appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

NAS vs. Cloud Storage: Which Remote Storage Option Is Best?

Post Syndicated from Vinodh Subramanian original https://www.backblaze.com/blog/nas-vs-cloud-storage-which-solution-fits-your-business-needs/

A decorative image showing a cloud and a NAS device.

If you’re leading IT strategy for a growing enterprise and still weighing network attached storage (NAS) and cloud storage, you’re not alone. And you’re not behind. Even the most seasoned infrastructure pros find themselves re-evaluating their stack as data volumes explode and budgets tighten. Both offer unique benefits, but with overlapping features, it’s easy to see why the choice can be confusing. 

Are you looking for greater control with physical access, as in a local NAS setup? Or is off-site backup, flexibility, and scalability through a cloud service provider more aligned with your needs? With plenty of discussions and debates outlining the pros and cons of one or the other, it can be difficult to determine the best storage solution for your specific needs. 

This guide walks through clear, actionable insights into NAS and cloud storage, addressing your most pressing questions about storage costs, dedicated machines, data sharing, and performance. Whether the focus is cost, scalability, security, or accessibility, this guide will help identify the ideal storage solution for your business.

What is NAS?

NAS, or network attached storage, is a file-level storage system designed specifically to provide centralized and shared disk storage for users on a local area network (LAN). 

Essentially, NAS is a purpose-built computer that operates its own dedicated operating system (OS). It contains one or more storage devices that are configured to create a single shared volume. These storage devices are arranged in a RAID configuration to ensure data redundancy and performance. 

These configurations make NAS ideal for file sharing, data backups, and accessing large files within an organization, making it a cost-effective solution for enterprises that need local storage with physical access.

Many NAS devices, such as Synology NAS or QNAP NAS, come with built-in software for additional functionalities like file syncing, data backups, and offsite backup options to integrate with cloud services.

How does NAS work?

NAS provides access to files using standard network file sharing protocols such as Network File System (NFS) and Server Message Block (SMB). By connecting directly to the local network, NAS allows users to easily store, access, and collaborate on files without overburdening other servers within the network. This separation of file-serving responsibilities helps optimize overall network performance, particularly for high-traffic environments. 

NAS systems are generally managed through a web-based utility accessible over the network, offering an intuitive interface for configuration and maintenance. This interface allows administrators to handle tasks such as user permissions, storage allocation, and data redundancy settings—making it simpler to secure and organize shared files across the network.

Advantages of NAS

NAS offers several advantages including faster data access, easier administration, simplified management, and many others. Here’s a breakdown: 

  • Cost effective: NAS devices typically involve an upfront purchase cost that includes access to applications from the NAS provider, like Synology Hyper Backup or QNAP Hybrid Backup Sync. This greatly reduces ongoing subscription fees, though you may incur costs if you want to expand your storage capacity with high-capacity storage drives or increase its performance with updates like more powerful processors, etc. 
  • Data control and security: NAS systems offer extensive control over data storage and security protocols. NAS systems are only accessible on the local network and to user accounts that can be controlled and managed.
  • Performance: NAS provides high-speed access to data over a local network, ensuring quick file retrieval and sharing. NAS generally work as fast as the local network speeds.
  • Scalable storage: Many NAS systems allow additional drives to be added, providing flexible storage expansion, albeit with the cost of additional drives or device upgrades. Modern NAS devices today offer large storage capacities and advanced features for virtualization and application hosting.
  • Data redundancy: When equipped with RAID configurations, NAS provides redundancy, ensuring data remains accessible even if one or more hard drives fail.
  • Better data management tools: Features such as fully automated backups, deduplication, compression, and encryption enhance data storage efficiency and security. NAS systems also support sync workflows for team collaboration, directory services for user and group management, and services like photo or media management.
  • Compatibility: NAS systems are designed to support different OS environments and are compatible with Windows, Mac, and Linux operating systems. They offer a seamless cross-platform access.
  • Remote access options: While primarily local, most NAS devices offer secure remote access through VPN or encrypted connections, allowing authorized users to access files from outside the office network when needed.

Limitations of NAS

While NAS offers numerous advantages for centralized file storage, there are some notable limitations to consider:

  • Initial setup and maintenance:. The configuration process can be complex at enterprise scale, and ongoing maintenance may demand external IT support, adding to operational costs.
  • Remote access vulnerabilities: NAS systems can be accessed remotely over the internet, creating a private cloud or hybrid cloud solution. While this offers a significant advantage in using your device, just like anything connected to the internet, it also poses security risks. Bad actors can exploit vulnerabilities and gain remote access to the device. To minimize risk, businesses must ensure proper security configurations, use encrypted connections, regularly update firmware, and restrict access to trusted IPs.
  • Scalability constraints: Although NAS systems allow for storage expansion, they are still limited by the physical capacity of the hardware.  Adding storage often involves purchasing high-capacity drives, which can be costly, and for larger expansions, migrating to more powerful NAS devices might be necessary.
  • Data vulnerability: Data stored on a NAS is susceptible to various threats, including hardware failures, natural disasters, theft, and cyber attacks such as ransomware. While RAID configurations offer some level of data redundancy, they do not protect against all forms of data loss. Regular backups and additional security measures are essential to mitigate these risks.
  • Performance overheads: As more users and devices access the NAS, network bandwidth and device performance can become bottlenecks. High demand may reduce access speeds, impact data throughput, and reduce efficiency, especially in larger organizations with extensive data needs.
  • Data recovery challenges: If a NAS drive fails or becomes corrupted, data recovery processes may be complex and require specialized services, which can be costly and time-intensive.

What is cloud storage?

Cloud storage is a model of data storage where data is stored on servers located in off-site locations and accessed via the internet. This setup enables users to store, retrieve, and manage data without requiring local storage infrastructure. There are two main types of cloud: public and private. 

  • Public cloud storage: Hyperscale providers like AWS, Google Cloud, and Azure and specialized cloud providers like Backblaze maintain servers and are responsible for hosting, managing, and securing data. The public cloud is cost-effective and offers scalable storage for multiple users and businesses.
  • Private cloud storage: Typically managed in-house or by a dedicated third-party provider, private cloud storage is reserved for a single organization. For example, a university may maintain data centers for its community. Private clouds offer enhanced control and security, though they often require more complex management.

What’s the diff: Public vs. private cloud

Public cloud storage services are provided by third-party vendors over the public internet, making them accessible to anyone who wants to purchase or lease storage capacity. These services are designed to offer scalability and reliability, often on a pay-as–you-go basis.

Private cloud storage is dedicated to a single organization where an organization utilizes its own servers and data centers to store data within their own network. It can be hosted on-premises or by a third-party provider, but it’s always behind the organization’s firewall. This model is ideal for businesses that require more control over their data and have stringent security and compliance requirements.

Advantages of public cloud

One of the key benefits of public cloud storage is that it eliminates the need for businesses to buy, manage, and operate their own data center infrastructure. This shift allows companies to move from capital expenditure (CapEx) to operational expenditure (OpEx) model, focusing on paying only for the storage they need when they need it. 

Additionally, cloud storage is elastic, enabling businesses to scale their storage capacity up or down more efficiently and strategically than through tactical hardware investments.

Advantages of private cloud

Private cloud storage allows for customized control and security measures, as organizations have full authority over their data environment. This setup can be highly beneficial for industries with strict data regulations, like finance and healthcare, as it enables better compliance with data privacy laws. 

Additionally, private clouds provide reliable performance since resources are not shared with external users, reducing latency issues and enabling faster data access for internal teams.

Types of cloud storage architecture 

In addition to the  elasticity and scalability benefits of cloud storage, you can also combine on-premises storage and different types of public or private cloud storage to uniquely support your business needs. The primary models of cloud storage are:

  • Hybrid cloud storage: A hybrid model combines both public and private cloud storage. This allows an organization to decide which data it wants to store in which cloud. Sensitive data and data that must meet strict compliance requirements may be stored in a private cloud or on-premises while less sensitive data is stored in the public cloud. You could also use hybrid cloud to leverage on-premises storage for performance-sensitive tasks, such as using NAS to edit large media files locally, which are later synced to the cloud. 
  • Multi-cloud storage: A multi-cloud model involves using two or more public cloud storage services from different service providers. This model helps businesses leverage the best features of each cloud service while enhancing data availability and redundancy. For example, some companies use multiple cloud providers to host mirrored copies of their active production data. If one of their public clouds suffers an outage, they have mechanisms in place to direct their applications or websites to failover to a second public cloud.

This flexibility in cloud storage architecture allows businesses to balance performance, cost, and security—ensuring critical data is stored securely while remaining accessible and resilient across multiple environments.

How does cloud storage work?

Cloud storage works by allowing users to upload data, such as files, documents, videos, or images to remote servers via the internet. 

Public cloud storage providers like Amazon, Google, Microsoft, and Backblaze maintain servers in large data centers. The uploaded data can be accessed and managed through web interfaces or APIs, making it highly accessible and flexible. 

Cloud storage offers numerous benefits that can greatly enhance business operations, such as storage space scalability, flexible data sharing options, and built-in data protection through regular backups and client-side encryption. However, there are also a few considerations like data security and storage costs to keep in mind. Next, we’ll look at the advantages and some of the key limitations of cloud-based storage solutions.

Advantages of cloud storage

Cloud storage enables businesses to scale with ease, reduce IT burdens, and access data remotely—offering a reliable, cost-efficient way to manage critical information. Here are some of the advantages of cloud storage:

  • Off-site protection: Cloud storage provides convenient off-site protection for data, ensuring that in the event of a physical disaster (such as fire or flood), data remains safe and accessible from any location. This supports in data redundancy and business continuity. 
  • Enhanced security: Leading cloud providers invest heavily in advanced security measures—including encryption, multi-factor authentication, Object Lock for immutability, and regular security audits—to protect stored data from unauthorized access and breaches.
  • Scalability: Cloud storage services offer virtually unlimited storage capacity. Businesses can easily scale their storage needs up or down based on demand without needing to invest in physical hardware. 
  • Accessibility: Data stored in the cloud can be accessed from anywhere with an internet connection, facilitating remote work and data sharing across teams and locations. 
  • Lower maintenance: Cloud providers handle all hardware maintenance, software updates, and security patches, reducing the IT burden of managing storage systems on businesses. 
  • Cost efficiency: Many cloud storage solutions operate on a pay-as-you-go model, allowing businesses to pay only for the storage they use, which can be more cost-effective than local NAS or investing in on-premises hardware.

Limitations of cloud storage

While cloud storage offers flexibility and scalability, it also has some limitations that impose additional considerations like ongoing costs and internet dependence that businesses should evaluate carefully. 

  • Ongoing costs: Unlike on-premises storage solutions such as NAS, cloud storage operates on a subscription-based pricing model. When evaluating cloud storage, businesses should consider the total cost of ownership, including ongoing fees, and weigh these against the benefits of cloud storage. 
  • Dependence on the internet: Cloud storage relies on a stable internet connection for access and data transfer. Any disruptions in internet connectivity can hinder access to critical files and services, potentially impacting business operations. Ensuring reliable internet service and having contingency plans are crucial for minimizing downtime.

NAS vs cloud storage: A side-by-side comparison

The following table provides a side-by-side comparison of NAS and cloud storage, highlighting key aspects such as cost, scalability, security, and performance. This comparison will help you determine which storage solution best aligns with your business requirements and operational workflows.

Aspect NAS Cloud Storage
Storage model File-level storage within a local network Data stored on remote servers accessed via the internet
Performance High speed access over a local network; optimal for on-premises work Dependent on internet speed and latency; suitable for global access and remote teams
Scalability Limited by physical hardware capacity; requires purchasing new devices for expansion Virtually unlimited scalability; allowing storage to expand without additional hardware
Cost Upfront hardware purchase, ongoing investment to expand capacity Subscription-based, pay-as-you-go model, often with no upfront hardware investment
Maintenance Requires in-house IT maintenance, updates and troubleshooting Maintenance handled by cloud provider, reducing IT burden
Security Controlled in-house, local network security; ideal for high-sensitive data Enhanced by provider with encryption, multi-factor authentication, and security
Data redundancy RAID configurations for local redundancy Built-in data redundancy and disaster recovery options
Accessibility Limited to local network access or VPN for remote connections Accessible from anywhere with an internet connection, supporting remote work and collaboration
Compliance Greater control for compliance in regulated industries; depends on in-house protocols Many providers offer compliance with standards like GDPR, HIPAA, and SOC 2, ideal for regulated industries

Hybrid cloud: The best of both worlds

A hybrid cloud solution combines the strengths of both NAS and cloud storage. While NAS offers a centralized location to store and access files, the data stored on the NAS is still vulnerable to data disasters such as floods, fires, or hardware failures. 

By integrating cloud storage with NAS, you create an off-site backup of your NAS data that securely protects your critical data from virtually any data threat. This approach not only mitigates the risk associated with physical damage to your on-premises NAS equipment but also offers the scalability, flexibility and remote accessibility benefits of cloud storage. 

Additionally, this helps you implement 3-2-1 backup protection where three copies of your data are stored in two different storage media (NAS and cloud) with one copy stored off-site in the cloud, protecting against ransomware, hardware failures, natural disasters, and other data threats.

NAS vs. cloud: Which is best for your business?

Choosing between NAS and cloud storage for your business largely depends on your specific use cases and operational needs. NAS provides fast local access, control, and cost efficiency for businesses with stable storage needs and on-premises operations. In contrast, cloud storage offers unparalleled scalability, remote access, and maintenance-free operation, making it ideal for organizations with dynamic storage needs and remote workforces. 

However, many businesses find that a combination of both, known as a hybrid cloud solution, offers the best of both worlds by combining the control of NAS with the scalability of cloud storage. 

Ultimately, the right choice will depend on a thorough evaluation of your business needs and operational workflows. By understanding the strengths and limitations of both NAS and cloud storage, you can make an informed decision that ensures your data is secure, accessible, and available when you need it.

FAQs about NAS and cloud storage

Is cloud storage better than NAS?

The answer depends on your specific business needs. Cloud storage offers scalability, remote access, and minimal maintenance requirements. NAS, on the other hand, provides fast local access and higher control over data management and security settings. Each solution has its strengths, and the best choice will depend on your priorities regarding data security, access, and cost.

Can I use a NAS as a cloud?

Yes, many modern NAS devices come with built-in features that allow them to function similarly to cloud storage, or to connect to a cloud storage provider of your choice. These NAS systems can be accessed remotely over the internet, creating a private cloud or hybrid cloud solution. However, it requires proper configuration, secure settings and a reliable internet connection to ensure seamless remote access.

Why use NAS instead of a server?

NAS devices are purpose-built for storage, offering simplicity, ease of management, and lower costs compared to traditional servers. While servers are multifunctional and can handle a variety of tasks, they are more complex to set up and maintain. NAS provides a straightforward solution for file sharing, backups, and media streaming without the need for extensive IT infrastructure. This makes NAS an excellent choice for small to medium-sized businesses that primarily need a dedicated storage solution.

Can NAS work without the internet?

Yes, NAS devices are designed to operate within a local area network (LAN) and do not require an internet connection for local access and file sharing. Users can store, access, and collaborate on files within local networks without internet access. However, for remote access or to leverage additional features such as cloud backups, an internet connection is necessary.

The post NAS vs. Cloud Storage: Which Remote Storage Option Is Best? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Live Read: The Game Changer for Live Media Cloud Workflows

Post Syndicated from Elton Carneiro original https://backblaze.com/blog/announcing-b2-live-read/

A decorative image with the title Live Read.

Every sports fan knows that when something incredible happens on the field/ice/court, we want to see the replay right now. But many of us don’t know the impressive efforts that live media teams undertake to deliver clips in real time to all of us on whatever viewing platform we might prefer. Today, Backblaze is excited to make the work of live media production (and the end results) a lot easier with our latest innovation.

Announcing Backblaze B2 Live Read

Backblaze B2 Live Read is a patent-pending service that gives media production teams working on live events the ability to access, edit, and transform media content while it is being uploaded into Backblaze B2 Cloud Storage. This means that teams can start working on content far faster than they could before, without having to drastically change their workflows and tools, massively speeding up their time to engagement and revenue. 

This is a game changer for live media teams, who are passionate about bringing content to their audience as soon as possible. It means they don’t need to worry as screen resolutions continue to expand, ranging from 4K to 8K and beyond. It also reduces the need for having production teams on-site to minimize latency, which could be extremely costly depending on the venue. 

Previously, producers had to wait hours or days before they could access uploaded data, or they had to rely on cost-prohibitive and complicated options that often required on-premises storage. That’s no longer necessary. This innovation will make it faster and less expensive to:

  • Create near real-time highlight clips for news segments, in-app replays, and much more.
  • Tap into talent where they are versus trying to find local talent to produce events.
  • Promote content for on-demand sales within minutes of presentations at live events.
  • Distribute teasers for buzz on social media before talent has even left the venue.

For our customers, turnaround time is essential, and Live Read promises to speed up workflows and operations for producers across the industry. We’re incredibly excited to offer this innovative feature to boost performance and accelerate our customers’ business engagements.”

Richard Andes, VP, Product Management, Telestream

Coming soon inside your favorite tools

We designed Live Read to be easily accessible directly via the Backblaze S3 Compatible API and/or seamlessly within the user interface of launch partners including Telestream, Glookast, and Mimir. These platforms, along with CineDeck, Alteon, Hedge, Hiscale, MoovIT, and many others to come, are enabling Live Read within their platforms soon.   

If you want to use Live Read, you can join our private preview.  

How does it work?

Previously, media teams were forced to either wait for uploads to complete or use on-premises storage. Now, Live Read uniquely supports accessing parts of each growing file or growing object as it is uploaded so there’s no need to wait for the full file upload to complete. And, when the full upload is complete, it’s accessible like any other file in a Backblaze B2 Cloud Storage Bucket, with no middleware or proprietary software needed. 

Here’s a short video showing both how Live Read works on a conceptual level, as well as a live demo showing how one app can upload video data to Backblaze B2 using Live Read while a second app reads the uploaded video data:

For those of you who want to dig deeper into the code samples you saw in the video, here is some example code that uses the Amazon SDK for Python, Boto3, to start uploading data with Live Read. If you’re familiar with Amazon S3, you’ll recognize that this is a standard multipart upload apart from the add_custom_header handler function and the call to register it with Boto3’s event system:

def add_custom_header(params, **_kwargs):
    """
    Add the Live Read custom headers to the outgoing request.
    See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/events.html
    """
    params['headers']['x-backblaze-live-read-enabled'] = 'true'

client = boto3.client('s3')
client.meta.events.register('before-call.s3.CreateMultipartUpload', add_custom_header)

response = client.create_multipart_upload(Bucket='my-video-files', Key='liveread.mp4')

upload_id = response['UploadId']

# Now upload data as usual with repeated calls to client.upload_part()

As it processes the call to create_multipart_upload(), Boto3 calls the add_custom_header() handler function, which adds a custom HTTP header, x-backblaze-live-read-enabled, with the value true, to the S3 API request. The custom HTTP header signals to Backblaze B2 that this is a Live Read upload. As with standard multipart uploads, the data is uploaded in parts between 5MB and 5GB in size. To facilitate reading data efficiently, all parts except the last one must have the same size.

Since this is a Live Read upload, as soon as a part is uploaded, it is accessible for downloading.

An app that downloads the file needs to send the same custom HTTP header when it retrieves data. For example:

def add_custom_header(params, **_kwargs):
    """
    Add the Live Read custom headers to the outgoing request.
    See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/events.html
    """
    params['headers']['x-backblaze-live-read-enabled'] = 'true'

client = boto3.client('s3')
client.meta.events.register('before-call.s3.GetObject', add_custom_header)

# Read the first 1 KiB of the file
response = client.get_object(
    Bucket='my-video-files',
    Key='liveread.mp4',
    Range='bytes=0-1023'
)

Note that you must supply either Range or PartNumber to specify a portion of the file when you download data using Live Read. If you request a range or part that does not exist, then Backblaze B2 responds with a 416 Range Not Satisfiable error, just as you might expect. On receiving this error, an app reading the file might repeatedly retry the request, waiting for a short interval after each unsuccessful request.

The source code for the applications is available as open source at https://github.com/backblaze-b2-samples/live-read-demo/.

How much does it cost?

Live Read upload capacity is offered in $15/TB increments—and the capacity is only consumed when an upload is marked for Live Read. Standard uploads are free, as usual. After uploading is complete, the data stored in Backblaze B2 is billed as normal. From a cost perspective, this represents significant savings versus the workflows that production teams must currently follow to achieve anything close to the functionality delivered by Live Read.

And it’s not just for live media

Beyond media, the Live Read API can support breakthroughs across development and IT workloads. For example, organizations maintaining large data logs or surveillance footage backups have often had to parse them into hundreds or thousands of small files each day in order to have quick access when needed—but with Live Read, they can now move to far more manageable single files per day or hour while preserving ability to access parts immediately after they are written.

What’s next

For those interested in Live Read, you can sign up for the private preview here. We’ll continue to report as we add more integrations and we’ll share stories as customers succeed with the new feature. Until then, feel free to ask any question you have in the comments below. 

Want to see more?

Join Pat Patterson, Chief Technical Evangelist, and Elton Carneiro, Senior Director of Partnerships, on January 26, 2024 at 10:00 a.m. PT to learn more in real time. Can’t make it live? Sign up anyway and we’ll send a recording straight to your inbox.

Join the Webinar 

The post Backblaze Live Read: The Game Changer for Live Media Cloud Workflows appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Video Surveillance Data Storage: Cloud vs. On-Prem vs. Hybrid

Post Syndicated from Tonya Comer original https://backblaze.com/blog/video-surveillance-data-storage-cloud-vs-on-prem-vs-hybrid/

A decorative image showing several video surveillance cameras connected to a cloud with the Backblaze logo on it.

Depending on your industry, you may need to install and run video surveillance. And once you have footage, you might be required to store it for a set period of days, months, or even years. This leads to the question: Where are you supposed to keep it all?

Not all storage systems are created equal, so it’s important to weigh the benefits and drawbacks of each option before making a decision. In some cases, government and industry regulations will require you to use a certain type of storage system. Ultimately, you will benefit from knowing how the system functions, what risks are involved, and how to select a technology provider.

This article will help you consider the pros and cons of on-premises, cloud, and hybrid storage systems. As you read, keep in mind that the amount of storage you need for your enterprise will depend on the number of cameras you have, the quality of the video footage, the length of time you are required to retain the footage, and various other factors. 

First Things First: Your Backup Strategy

No matter how or where you store your video surveillance footage, the most important thing you should do is establish a backup strategy that follows the 3-2-1 backup approach. That means you should have three copies of your data on two different media with one stored off-site. In this post, we’ll weigh the pros and cons of whether you keep that off-site copy stored at an off-site location like, say, an Iron Mountain storage facility, a remote office, or data center, or whether you keep that off-site copy in the cloud. 

You might think we’re biased as a cloud provider. Of course, we’d love it if you choose to keep your backups with Backblaze! But the main thing we want to emphasize is that you should have a backup plan for your video surveillance footage (or any data, really!) whether it includes Backblaze or not. And, because you have to store one of those copies off-site, it’s miles easier (pun intended) to store in the cloud than to physically drive or mail hard drives to a secondary location.  

What Is On-Premises Storage?

Storing video footage on-premises means your data is stored on physical media—that is, servers, network attached storage (NAS), storage area network (SAN), LTO tape (linear tape open), etc.—in a physical location on your premises. We’ll talk about two forms of on-premises storage as they pertain to video footage: NAS and SAN.

Are NAS Devices Good for Storing Video Footage?

NAS devices have a large data storage capacity that provides file-based data storage services to other devices on a network. Usually, they also have a client or web portal interface, as well as services like QNAP’s Hybrid Backup Sync or Synology’s Hyper Backup to help manage your files.

A photo of a Synology NAS.

One of the benefits of NAS is that it’s easy to set up and use, and you can upgrade internal drives over time. The main drawback when it comes to storing video surveillance footage is that its storage capacity is limited. Even if you buy a bigger device than you need right now, eventually you’ll run out of space and need to buy more, especially if you’re storing large amounts of video surveillance footage.

Is a SAN Good for Storing Video Footage?

On the other end of the spectrum, SANs are engineered for high-performance and mission-critical applications. They function by connecting multiple storage devices, such as disk arrays or tape libraries, to a dedicated network that is separate from the main local area network (LAN).

SANs offer high-speed data access, critical for handling large video streams from multiple cameras and allow for seamless scalability. As video surveillance systems grow, SANs can accommodate additional cameras and storage without disrupting ongoing operations. They also provide enhanced data security by isolating block-level storage within the operating system layer, to protect against failures and unauthorized access. Managing SANs can be a bit complex, necessitating skilled administrators familiar with SAN architecture. Additionally, implementing SANs incurs upfront expenses for hardware, software, and expertise, while their reliance on centralized controllers poses a risk of impacting multiple cameras in case of failure.

What Is Cloud Storage?

Cloud storage enables you to securely store data and files in an off-site location. You can access this data through the public internet.

When you transfer data off-site for storage, the cloud storage provider (CSP) hosts, secures, manages, and maintains the servers and associated infrastructure, ensuring that you have seamless access to your data whenever you need it.

What Are the Benefits of Cloud Storage for Video Surveillance Footage?

  1. Scalability: Cloud storage services allow you to dynamically adjust capacity as your video surveillance data volumes fluctuate. 
  2. Avoid capital expenses (CapEx): By leveraging cloud storage for video surveillance, your organization benefits from paying for storage technology and capacity as a service, rather than incurring the capital expenses associated with constructing and upkeeping in-house storage networks. As data volumes grow over time, your costs may increase, but there’s no need to overprovision storage networks in anticipation of future data expansion. 
  3. Security: Cloud surveillance systems enhance data security with unique user accounts and data encryption ensure that only authorized personnel can access the footage. This controlled access minimizes the risk of unauthorized viewing or tampering.
  4. Accessibility: Cloud storage relies on an internet or network connection so authorized users can access surveillance footage remotely from anywhere using smart devices or web browsers. Whether you’re at the office, traveling, or even at home, you can review camera feeds without being physically present on-site. Keep in mind if the connection is lost or disrupted, access to video footage becomes challenging. This dependency can impact real-time monitoring and retrieval of critical data.

Just like our other storage strategies, there are drawbacks to cloud storage. For example, it relies on a stable internet connection. Video surveillance files are large, even when you apply compression techniques, which means that they take time and proper network connections to upload. So, if your internet connection goes down, it takes longer to get data properly stored or backed up than it would with other file types. That means you may not have real-time access to your data, or (in the worst cases) that you potentially risk file corruption if you don’t have a robust enough local storage infrastructure. 

Similarly, businesses should evaluate the privacy and data ownership concerns. Storing video footage in the cloud means entrusting sensitive data to a third-party service provider. Make sure that your CSP meets or exceeds all regulatory or compliance requirements, like SOC 2 or ISO 27001, before you store data on their platforms. 

All things considered, cloud storage offers scalability, ease of access, fine–tuned file control, and minimal maintenance, which are essential when dealing with the complexities of storing video surveillance footage.

Direct-to-Cloud Video Surveillance

Some companies choose to transfer video surveillance off-site to the cloud for backup purposes, while others push video footage directly to the cloud as a primary storage location, especially as there are several camera models and video surveillance solutions that are designed to easily push footage directly to cloud storage. When you’re choosing video surveillance hardware, it’s worth looking into whether they have this functionality, and if so, how much control you have over setting your storage destination to optimize costs. 

And, if you’re using cloud storage as the primary storage for video footage, a multi-cloud setup can be used to ensure the primary copy in the cloud is backed up. A multi-cloud setup involves using multiple cloud service providers simultaneously—so, if your video surveillance platform stores footage in their own cloud, you can still set up a workflow that backs up to a different CSP. For backup and archive purposes, organizations can distribute their data across different clouds to enhance reliability, reduce risk, create geographic diversity in storage locations for disaster recovery purposes, and to comply with data retention policies. This approach ensures data availability even if one cloud provider experiences issues.

What Is Hybrid Cloud Storage?

Hybrid cloud storage combines elements from both public clouds and private clouds (typically on-premises systems). It’s essentially a unified management approach where an integrated infrastructure enables seamless movement of workloads and data between the private and public clouds.

Using a hybrid cloud for video surveillance makes sense for lots of use cases, including backup and archive. Let’s talk about how. 

Backup: To deploy a hybrid approach for a video surveillance backup use case, you’d store all of your video surveillance footage in your on-premises systems, then store your backups in the cloud. Many NAS devices, for example, come with on-board backup utilities that allow you to store backups of your video surveillance footage directly in the cloud. You could also use third-party backup software to automatically back up your systems to the cloud. This hybrid approach gives you fast access to your footage via your on-premises storage, while protecting it with cloud backups.

Archive: To deploy a hybrid approach for a video surveillance archive use case, you’d store recent live recordings of your video surveillance footage on-premises. After a recurring cutoff date—whether in days or months—you then move old footage to a public cloud. This hybrid system allows you to access recent footage quickly while archiving older footage, particularly if you have retention requirements for compliance or cyber insurance purposes. If done right, this system can help your company comply with both short- and long-term industry requirements.

For a more in-depth look at hybrid cloud storage, check out our blog on hybrid cloud

Is Hybrid Cloud Good for Video Surveillance Footage?

Leveraging hybrid cloud storage provides a dual advantage for video surveillance: swift local access to your video surveillance footage while simultaneously safeguarding it through off-site backups or off-loading it through a cloud archive. This strategic approach allows you to harness the strengths of both public and private clouds. Moreover, it offers enhanced scalability and flexibility compared to traditional on-premises solutions.

However, it’s essential to note that implementing a private cloud system can be cost-intensive. It necessitates budgeting for hardware acquisitions and replacements over time. Additionally, you’ll likely need to allocate resources for dedicated staff to maintain servers and backup strategies.

The Verdict: Which Type of Storage Is Best for Video Surveillance?

Choosing the right video surveillance storage solution is a critical decision for any organization. On-premises, cloud, and hybrid cloud each have their merits and drawbacks. While on-premises solutions offer large data storage capacity that is easy to set up and use, they require significant infrastructure investment. Cloud storage provides data accessibility and scales seamlessly while optimizing cost-effectiveness. Hybrid cloud provides both rapid local access to your video surveillance footage and secure off-site backups. 

Ultimately, the choice depends on your specific needs, budget, and long-term strategy. Consider the trade-offs carefully to ensure seamless and reliable video storage for your surveillance system.

The post Video Surveillance Data Storage: Cloud vs. On-Prem vs. Hybrid appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

How Backblaze Scales Our Storage Cloud

Post Syndicated from Andy Klein original https://backblaze.com/blog/how-backblaze-scales-our-storage-cloud/

A decorative image showing a larger cube being compressed into a smaller cube.

Increasing storage density is a fancy way of saying we are replacing one drive with another drive of a larger capacity; for example replacing a 4TB drive with a 16TB drive—same space, four times the storage. You’ve probably copied or cloned a drive or two over the years, so you understand the general process. Now imagine having 270,000 drives that over the next several years will need to be replaced, or migrated as is often the term used. That’s a lot of work. And when you finish—well actually you’ll never finish as the process is continuous for as long as you are in the cloud storage business. So, how does Backblaze manage this ABC (Always Be Copying) process? Let me introduce you to CVT Copy or CVT for short.

CVT Copy is our in-house purpose-built application used to perform drive migrations at scale. CVT stands for Cluster, Vault, Tome, which is engineering vernacular mercifully shortened to CVT. 

Before we jump in, let’s take a minute to define a few terms in the context of how we organize storage.

  • Drive: The basic unit of storage ranging in our case from 4TB to 22TB in size.
  • Storage Server: A collection of drives in a single server. We have servers of 26, 45, and 60 drives. All drives in a storage server are the same logical size.
  • Backblaze Vault: A logical collection of 20 Storage Pods or servers. Each storage server in a Vault will have the same number of drives.
  • Tome: A tome is a logical collection of 20 drives, with each drive being in one of the 20 storage servers in a given Vault. If the storage servers in a Vault have 60 drives each, then there will be 60 unique tomes in that Vault.
  • Cluster: A logical collection of Vaults, grouped together to share other resources such as networking equipment and utility servers.

Based on this, a Vault consisting of 20, 60-drive storage servers will have 1,200 drives, a Vault with 45-drive storage servers will have 900 drives, and a Vault with 26-drive servers will have 520 drives. A cluster can have any combination of Vault sizes.

A Quick Review on How Backblaze Stores Data

Data is uploaded to one of the 20 drives within a tome. The data is then divided into parts, called data shards. At this point, we use our own Reed-Solomon erasing coding algorithm to compute the parity shards for that data. The number of data shards plus the number of parity shards will equal 20, i.e. the number of drives in a tome. The data and parity shards are written to their assigned drives, one shard per drive. The ratios of data shards to parity shards we currently use are 17/3, 16/4, and 15/5 depending primarily on the size of the drives being used to store the data—the larger the drive, the higher the parity.

Using parity allows us to restore (i.e. read) a file using less than 20 drives. For example, when a tome is 17/3 (data/parity), we only need data from any 17 of the 20 drives in that tome to restore a file. This dramatically increases the durability of the files stored.

CVT Overview

For CVT, the basic unit of migration is a tome, with all of the tomes in a source Vault being copied simultaneously to a new destination Vault which is typically new hardware. For each tome, the data, in the form of files, is copied file-by-file from the source tome to the destination tome.

The CVT Process

An overview of the CVT process is below, followed by an explanation of each task noted.

Selection

Selecting a Vault to migrate involves considering several factors. We start by reviewing current drive failure rates and predicted drive failure rates over time. We also calculate and consider overall Vault durability; that is, our ability to safeguard data from loss. In addition, we need to consider operational needs. For example, we still have Vaults using 45-drive Storage Pods. Upgrading these to 60-drive storage servers increases drive density in the same rack space. These factors taken together determine the next Vault to migrate.

Currently we are migrating systems with 4TB drives, which means we are migrating up to 3.6 petabytes (PB) of data for a 900 drive Vault or 4.8PB of data for a 1,200 drive Vault. Actually, there are no limitations as to the size of the source system drives, so Vaults with 6TB, 8TB, and larger sized drives can be migrated using CVT with minimal setup and configuration changes.

Once we’ve identified a source Vault to migrate we need to identify the target or destination system. Currently, we are using destination vaults containing 16TB drives. There is no limitation as to the size of the drives of the destination Vault, so long as they are at least as large as those in the source Vault. You can migrate the data from any sized source Vault to any sized destination Vault as long as there is adequate room on the destination Vault.  

Setup

Once the source Vault and destination Vault are selected, the various Technical Operations and Data Center folks get to work on setting things up. If we are not using an existing destination Vault, then a new destination Vault is provisioned. This brings up one of the features of CVT: The migration can be to a new clean Vault or an existing Vault; that is, one with data from a previous migration on it. In the latter case, the new data is just added and does not replace any of the existing data. The chart below are examples of the different ways a destination Vault can be filled from one or more source Vaults.

In any of these scenarios, the free space can be used for another migration destination or for a Vault where new customer data can be written.

Kickoff

With the source and destination Vaults identified and setup, we are now ready to kick-off the CVT process. The first step is to put the source Vault in a read-only state and to disable file deletions on both the source and destination Vaults. It is possible that some older source Vaults may have already been placed in read-only state to reduce their workload. A Vault in a read-only state continues to perform other operations such as running shard integrity checks, reporting Drive Stats statistics and so on. 

CVT and Drive Stats

We record Drive Stats data from the drives in the source Vault until the migration is complete and verified. At that point we begin recording Drive Stats data from the drives in the destination Vault and stop recording Drive Stats data from the drives in the source Vault. The drives in the source Vault are not marked as failed.

Build the File List

This step and the next three steps (read files, write files, and validate) are done as a consecutive group of steps for each tome in a source Vault. For our purpose here, we’ll call this group of steps the tome migration process, although they really don’t have such a name in the internal documentation. The tome migration process is for a single tome, but, in general, all tomes in a Vault are migrated at the same time, although due to their unique contents, they most likely will complete at different times.

For each tome, the source file list is copied to a file transfer database and each entry is mapped to its new location in the destination tome. This process allows us to maintain the same upload path while copying the data as the customer used to initially upload their data. This ensures that from the customer point of view, nothing changes in how they work with their files even though we have migrated them from one Vault to another.

Read Files

For each tome, we use the file location database to read the files. One file at a time. We use the same code in this process that we use when a user requests their data from the Backblaze B2 Storage Cloud. As noted earlier, the data is sharded across multiple drives using the preset data/parity scheme, for example 17/3. That means, in this case we only need data from 17 of the drives to read the file.

When we read a file, one advantage we get by using our standard read process is a pristine copy of the file to migrate. While we regularly run shard integrity checks on the stored data to ensure a given shard of data is good, media degradation, cosmic rays and so on can affect data sitting on a hard drive. By using the standard read process, we get a completely clean version of each file to migrate.

Write Files

The restored file is sent to the destination vault, there is no intermediate location where the file resides. The transfer is done over an encrypted network connection typically within the same data center, preferably on the same network segment. If the transfer is done between data centers, it is done over an encrypted dark fiber connection. 

The file is then written to the destination tome. The write process is the same one used by our customers when they upload a file and given that process has successfully written hundreds of billions of files we didn’t need to invent anything new.

At this point, you could be thinking that’s a lot of work to copy each file one by one.  Why not copy and transfer block by block, for example? The answer lies in the flexibility we get by using the standard file-based read and write processes.

  • We can change the number of tomes. Let’s say we have 45 tomes in the source Vault and 60 tomes in the destination Vault. If we had copied blocks of data the destination Vault would have 15 empty tomes. This creates load balancing and other assorted performance problems when that destination Vault is opened up for new data writes at a later date. By using standard read and write calls for each file, all 60 of the destination Vault’s tomes fill up evenly, just like they do when we receive customer data.
  • We can change parity of the data. The source 4TB drive Vaults have a data/parity ratio of 17/3. By using our standard process to write the files, the data/parity ratio can be set to whatever ratio we want for the destination Vault. Currently, the data/parity ratio for the 16TB destination Vaults is set to 15/5. This ratio ensures that the durability of the destination Vault and therefore the recoverability of the files therein is maintained as a result of migrating the data to larger drives.
  • We can maximize parity economics. Increasing the number of parity drives in a tome from three to five decreases the number of data drives in that tome. That would seem to increase the cost of storage, but the opposite is true in this case. Here’s how:
    • Using 4TB drives for 16TB of data stored
      • Our average cost for a 4TB drive was $120 or $0.03 per GB.
      • Our cost of 16TB of storage, using 4TB drives, was $480 (4 x $120).
      • Using a 17/3 data/parity scheme means:
        • Data storage: We have 13.6TB of data storage at $0.03/GB ($30/TB) which costs us $408. 
        • Parity storage: We have 2.4TB of parity storage at $0.03/GB ($30/TB) which costs us $72.
    • Using 16TB drives for 16TB of data stored
      • Our average cost for a 16TB drive is $232 or $0.0145 per GB.
      • Our cost of 16TB of storage is $232.
      • Using a 15/5 data/parity scheme means:
        • Data storage: We have 12.0TB of data storage at $0.0145/GB ($14.5/TB) which costs us $174.
        • Data parity: We have 4.0TB of parity storage at $0.0145/GB ($14.5/TB) which costs us $58.
    • In summary, increasing the data/parity ratio to 15/5 for the 16TB drives is less expensive ($58) than the cost of parity when using our 4TB drives ($72) to provide the same 16TB of storage. The lower cost per TB of the 16TB drives allows us to increase the number of parity drives in a tome. Therefore, the cost of increasing the parity of the destination tome not only enhances data durability, it is economically sound.
    • Obviously a 16TB drive actually holds a bit less data due to formatting and overhead and four 4TB drives hold even less data. In other words, even with formatting and so on, the math still works out in favor of using the 16TB drives.

Validate Tome

The last step in migrating a tome is to validate the destination tome is the same as the source tome. This is done for each tome as they complete their copy process. If the source and destination tomes are not consistent, shard integrity check data can be reviewed to determine any errors and the system can retransfer individual files, up to and including the entire tome.

Redirect Reads

Once all of the tomes within the Vault have completed their individual migrations and have passed their validation checks, we are ready to redirect customer reads (download requests) to the destination Vault. This process is completely invisible to the customer as they will use the same file handle as before. This redirection or swap process can be done tome by tome, but is usually done once the entire destination Vault is ready.

Monitor

At this point all download requests are handled by the destination Vault. We monitor the operational status of the Vault, as well as any failed download requests. We also review inputs from customer support and sales support to see if there are any customer related issues.

Once we are satisfied that the destination Vault is handling customer requests, we will logically decommission the source Vault. Basically, that means while the source Vault continues to run, it is no longer externally reachable. If a significant problem were to arise with the new destination Vault, we can swap in the source Vault. At this point, both Vaults are read-only, so the swap would be straightforward. We have not had to do this in our production environment. 

Full Capability

Once we are satisfied there are no issues with the destination Vault, we can proceed one or two ways.

  • Another migration: We can prepare for the migration of another source Vault to this destination Vault. If this is the case, we return to the Selection step of the CVT process with the Vault once again being assigned as a destination Vault. 
  • Allow new data: We allow the destination Vault to accept new data from customers. Typically, the contents of multiple source Vaults have been migrated to the destination Vault before this is done. Once new customer writes have been allowed on a destination Vault, we won’t use it as a destination Vault again.

Decommission

After three months the source Vault is eligible to be physically decommissioned. That is, we turn it off, disconnect it from power and networking, and schedule it to be disassembled. This includes wiping the drives and recycling the remaining parts either internally or externally. In practice, we will wait to decommission at least two Vaults at once as it is more economical in dealing with our recycling partners.

Automation

You’re probably wondering how much of this process is automated or uses some type of orchestration to align and accomplish tasks. We currently have monitoring tools, dashboards, scripts, and such, but humans, real ones not AI generated, are in control. That said, we are working on orchestration of the setup and provisioning processes as well as upleveling the automation in the tome migration process. Over time, we expect the entire migration process to be automated, but only when we are sure it works—the “run fast, break things” approach is not appropriate when dealing with customer data.

Not for the Faint of Heart

The basic idea of copying the contents of a drive to another larger drive is straightforward and well understood. As you scale this process, complexity creeps in as you have to consider how the data is organized and stored while keeping it secure and available to the end user. 

If your organization manages your data in-house, the never-ending task of simultaneously migrating hundreds or perhaps thousands of drives falls to you or perhaps the contractor you hired to perform the task if you lack the experience or staffing. And this is just one of the tasks you are faced with in operating, maintaining, and upgrading your own storage infrastructure.

In addition to managing a storage infrastructure, there are the growing environmental concerns of data storage. The amount of data generated and stored each year continues to skyrocket and tools such as CVT allow us to scale and optimize our resources in a cost efficient, yet environmentally sensitive way. 

To do this, we start with data durability. Using our Drive Stats data and other information, we optimize the length of time a Vault should be in operation before the drives need to be replaced, that is before the drive failure rate impacts durability. We then consider data density, how much data can we pack into a given space. Migrating data from 4TB to 16TB drives, for example, not only increases data density, it uses less electricity per stored terabyte of data and reduces the amount of waste if, for example, we had continued to buy and use 4TB drives instead of upgrading to 16TB drives. 

In summary, CVT is more than just a fancy data migration tool: It is part of our overall infrastructure management program addressing scalability, durability, and environmental challenges faced by the ever-increasing amounts of data we are asked to store and protect each day.

Kudos

The CVT program is run by Bryan with the wonderfully descriptive title of Senior Manager, Operations Analysis and Capacity Management. He is assisted by folks from across the organization. They are, in no particular order, Bach, Lorelei (Lo), Madhu, Mitch, Ben, Rodney, Vicky, Ryan, David M., Sudhi, David W., Zoe, Mike, and unnamed others who pitch in as the process rolls along. Each person brings their own expertise to the process which Bryan coordinates. To date, the CVT Team has migrated 24 Vaults containing over 60PB of data—and that’s just the beginning.

The post How Backblaze Scales Our Storage Cloud appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Plugs In to Internet2

Post Syndicated from Brent Nowak original https://www.backblaze.com/blog/backblaze-plugs-into-internet2/

A decorative images showing the Internet2 logo.

Who doesn’t love a sequel? From Star Wars to the Godfather, some of the best moments in storytelling have been part twos. (Let’s not talk about some of those part threes though.) And, if you were to write a sequel to The Internet, you couldn’t look for a better second chapter than a mission to support the technical and networking needs of leading academic and research organizations.  

Well, Internet2 is not actually a sequel, and it’s not a new version of the internet we all use every day. It’s an organization dedicated to delivering technical solutions and dedicated, high speed connectivity to institutions—ranging from the Smithsonian to Harvard and 330 other colleges, universities, regional research and education networks, nonprofit and government organizations, and more—who are working to solve today’s most pressing issues.

And today, Backblaze joined the Internet2 community to help further their mission. Here’s what that means:

  • First and foremost, the Backblaze Storage Cloud now connects to Internet2’s network as part of the Internet2 Peer Exchange (I2PX) program. This means that members of Internet2 can now move data into and out of Backblaze’s US-West and US-East regions at incredibly high speeds.
  • Second, Backblaze also completed the Internet2 Cloud Scorecard to offer research and educational institutions relevant details about Backblaze’s security, compliance, and technology specifications, making it easier to assess and procure our solutions.

Hundreds of institutions in the higher education and research space already use Backblaze for storing and using their data and protecting their endpoints. However, many others require data transmission via Internet2 for new cloud solutions. For these folks, Backblaze’s participation in Internet2’s community and I2PX program provides secure data storage with less latency and a lower cost for their data needs.

What type of data are we talking about? Think genetic sequencing records, billions of vector data points to help model and forecast weather events, or images of particle collisions at the subatomic level! 

The Backblaze team is incredibly excited to take this step forward in serving the different use cases that Internet2 supports. And of course, in addition to being a part of the Internet2 community, we’re always excited to add more high-quality peering relationships to our wider network (and to share some stats about it, too) . 

How big is the Internet2 network? Take a look below.

Now, let’s dig into how Internet2 creates high speed data transfer pathways, and how it will impact traffic here at Backblaze.

Our Connection

The diagram below gives you an idea of what the data path looks like for someone on the left with direct connectivity to Internet2 or access via a regional provider reaching the Backblaze US-West or US-East regions.

The entities on the left could exist locally in California or as far as the U.S. East Coast. At any source location, the traffic will transport the Internet2 network and then enter our network in our common peering points in San Jose, CA and Reston, VA.

Turning Up The Peering Session

Below is a chart of ingress traffic that was once reaching us over the public internet and is now taking the preferred path over Internet2. As soon as we established peering we started to receive a few gigabits per second of traffic, with large spikes occurring overnight.

Whenever we add a new service or peer, the flow of information in our network changes. This latest addition creates more interesting traffic patterns for our Network Engineering team to profile, monitor, and capacity plan for.

An Example of How that Speed Is Used: Moving Scientific Data

If you’re a scientist in Texas and want to send your 50TB research set quickly and reliably to a partner in California, you might only have a commercial connection to the internet. This could be a 1Gbps or smaller connection, and even that could have data transfer limits on each month—not ideal. Our 50TB example dataset would take over 4.6 days to complete and use 100% of the available bandwidth if we were limited to 1Gbps (assuming perfect conditions and no latency).

The Internet2 network is built with capacity in mind. With backbone links up to 400Gbps, our example dataset would transfer in 16.7 minutes. Now, there are other limitations that will impede you from being able to reach that rate (hard drive read speed, local Internet2 connection speed, and distance/latency factors), but this example gives you an idea of how much faster the Internet2 network can be over vanilla commercial connections that might be available to a local university, college, or other research institution.

Conclusion

We’re very excited to be joining the Internet2 community and network, supporting industry best practices and enabling better connectivity to our storage platform. Hopefully, the next scientific breakthrough is sitting encrypted on our hard drives, and we can be part of the many, many people, tools, and organizations who helped it on its way from research to reality.  

For more information about Backblaze and Internet2, you can read our press release or check out the Internet2 member directory.  

The post Backblaze Plugs In to Internet2 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

AI Video Understanding in Your Apps with Twelve Labs and Backblaze

Post Syndicated from Pat Patterson original https://backblaze.com/blog/ai-video-understanding-in-your-apps-with-twelve-labs-and-backblaze/

A decorative header depicting several screens with video editing tasks and a cloud with the Backblaze logo on it.

Over the past few years, since long before the recent large language model (LLM) revolution, we’ve benefited not only from the ability of AI models to transcribe audio to text, but also to automatically tag video files according to their content. Media asset management (MAM) software—such as Backlight iconik and Axle.ai (both Backblaze Partners, by the way)—allows media professionals to quickly locate footage by searching for combinations of tags. For example, “red car”, will return not only a list of video files containing red cars, but also the timecodes pinpointing the appearance of the red car in each clip.

San Francisco startup Twelve Labs has created a video understanding platform that allows any developer to build this kind of functionality, and more, into their app via a straightforward RESTful API. 

In preparation for our webinar with Twelve Labs last month, I created a web app to show how to integrate Twelve Labs with Backblaze B2 for storing video. The complete sample app is available as open source at GitHub; in this blog post, I’ll provide a brief description of the Twelve Labs platform, explain how presigned URLs allow temporary access to files in a private bucket, and then share the key elements of the sample app. If you just want a high level understanding of the integration, read on, and feel free to skip the technical details!

The Twelve Labs Video Understanding Platform

The core of the Twelve Labs platform is a foundation model that operates across the visual, audio, and text modes of video content, allowing multimodal video understanding. When you submit a video using the Twelve Labs Task API, the platform generates a compact numerical representation of the video content, termed an embedding, that identifies entities, actions, patterns, movements, objects, scenes, other elements of the video, and their interrelationships. The embedding contains everything the Twelve Labs platform needs to do its work—after the initial scan, the platform no longer needs access to the original video content. As each video is scanned into the platform, its embedding is added to an index, so this scanning process is often referred to as indexing.

As part of the indexing process, the platform extracts a standard set of data from each video: a thumbnail image, a transcript of any spoken content, any text that appears on screen, and a list of brand logos, all annotated with timecodes locating them on the video’s timeline, and all accessible via the Twelve Labs Index API.

You can have the platform create a title and summary, and even prompt the model to describe the video, via Twelve Labs’ Generate API. For example, I indexed an eight-minute video that explains how to back up a Synology NAS to Backblaze B2, then prompted the Generate API, “What are the two Synology applications mentioned in the video?” This was the first sentence of the resulting text:

The two Synology applications mentioned throughout the video are “Synology Hyper Backup” and “Synology Cloud Sync.”

The remainder of the response is a brief summary of the two applications and how they differ; here’s the full text. Although it does have that “AI flavor” as you read it, it’s clear and accurate. I must admit, I was quite impressed!

You can define a taxonomy for your videos via the Classify API. Submit a one- or two-level classification schema and a set of video IDs, and the platform will assign each video to a category.

Rounding up this quick tour of the Twelve Labs platform, the Search API, as its name suggests, allows you to search the indexed videos. As well as a search query, you must specify a set of content sources: any combination of visual, conversation, text in video, or logos. Each search result includes timecodes for its start and end.

Now you understand the basic capabilities of the Twelve Labs platform, let’s look at how you can integrate it with Backblaze B2.

Allowing Temporary Access to Files in a Private Backblaze B2 Bucket

A key feature of the sample app is that it uploads videos to a private Backblaze B2 Bucket, where they are only accessible to authorized users. Twelve Labs’ API allows you to submit a video for indexing by POSTing a JSON payload including the video’s URL to its Task API. This is straightforward for video files in a public bucket, but how do we allow the Twelve Labs platform to read files from a private bucket?

One way would be to create an application key with capabilities to read files from the private bucket and share it with the Twelve Labs platform. The main drawback to this approach is that the platform currently lacks the ability to sign requests for files from a private bucket.

Since Twelve Labs only needs to read the video file when we submit it for indexing, we can send it a presigned URL for the video file. As well as the usual Backblaze B2 endpoint, bucket name, and object key (path and filename), a presigned URL includes query parameters containing data such as the time when the URL was created, its validity period in seconds, an application key ID (or access key ID, in S3 terminology), and a signature created with the corresponding application key (secret access key). Here’s an example, with line breaks added for clarity:

https://s3.us-west-004.backblazeb2.com/mybucket/image.jpeg
?X-Amz-Algorithm=AWS4-HMAC-SHA256
&X-Amz-Credential=00415f935c00000000aa%2F20240423%2Fus-west-004%2Fs3%2Faws4_request
&X-Amz-Date=20240423T222652Z
&X-Amz-Expires=3600
&X-Amz-SignedHeaders=host
&X-Amz-Signature=23ade1...3ca1eb

This URL was created at 22:26:52 UTC on 04/23/2024, and was valid for one hour (3600 seconds). The signature is 64 hex characters. Changing any part of the URL, for example, the X-Amz-Date parameter, invalidates the signature, resulting in an HTTP 403 Forbidden error when you try to use it, with a corresponding message in the response payload:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Error>
    <Code>SignatureDoesNotMatch</Code>
    <Message>Signature validation failed</Message>
</Error>

Attempting to use the presigned URL after it expires yields HTTP 401 Unauthorized with a message such as:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Error>
    <Code>UnauthorizedAccess</Code>
    <Message>Request has expired given timestamp: '20240423T222652Z' and expiration: 3600</Message>
</Error>

You can create presigned URLs with any of the AWS SDKs or the AWS CLI. For example, with the CLI:

% aws s3 presign s3://mybucket/image.jpeg --expires-in 600 
https://s3.us-west-004.backblazeb2.com/mybucket/image.jpeg?X-Amz...

Presigned URLs are useful whenever you want to provide temporary access to a file in a private bucket without having to share an application key for a client app to sign the request itself. The sample app also uses them when rendering HTML web pages. For example, all of the thumbnail images are retrieved by the user’s browser via presigned URLs.

Note that presigned URLs are a feature of Backblaze B2’s S3 Compatible API. Creating a presigned URL is an offline operation and does not consume any API calls. We recommend you use presigned URLs rather than the b2_get_download_authorization B2 Native API operation, since the latter is a class C API call.

Inside the Backblaze B2 + Twelve Labs Media Asset Management Example

The sample app is written in Python, using JavaScript for its front end, the Django web framework for its backend, the Huey task queue for managing long-running tasks, and the Twelve Labs Python SDK to interact with the Twelve Labs platform. A simple web UI allows the user to upload videos to the private bucket, browse uploaded videos, submit them for indexing, view the resulting transcription, logos, etc., and search the indexed videos.

Most of the application code is concerned with rendering the web UI; very little code is required to interact with Twelve Labs.

Configuration

The Django settings.py file defines a constant for the Twelve Labs index ID and creates an SDK client object using the Twelve Labs API key. Note that the app reads the index ID and API key from environment variables, rather than including the values in the source code. Externalizing the index ID as an environment variable allows more flexibility in deployment while, of course, you should never include secrets such as passwords or API keys in source code!

TWELVE_LABS_INDEX_ID = os.environ['TWELVE_LABS_INDEX_ID']
TWELVE_LABS_CLIENT = TwelveLabs(api_key=os.environ['TWELVE_LABS_API_KEY'])

Startup

When the web application starts, it validates the index ID and API key by retrieving details of the index. This is the relevant code, in apps.py:

index = TWELVE_LABS_CLIENT.index.retrieve(TWELVE_LABS_INDEX_ID)

If this API call fails, then the app prints a suitable diagnostic message identifying the issue.

Indexing

When a web application needs to perform an action that takes more than a few seconds to complete—for example—indexing a set of videos, it typically starts a background task to do the work, and returns an appropriate response to the user. The sample app follows this pattern: when the user selects one or more videos and hits the Index button, the web app starts a Huey task, do_video_indexing(), passing the IDs of the selected videos, and returns the IDs to the JavaScript front end. The front end can then show that the indexing tasks have started, and poll for their current status.

Here’s the code, in tasks.py, for submitting the videos for indexing.

# Create a task for each video we want to index
for video_task in video_tasks:
    task = TWELVE_LABS_CLIENT.task.create(
        TWELVE_LABS_INDEX_ID,
        url=default_storage.url(video_task['video']),
        disable_video_stream=True
    )
    print(f'Created task: {task}')
    video_task['task_id'] = task.id

Notice the call to default_storage.url(). This function, implemented by the django-storages library, takes as its argument the path to the video file, returning the presigned URL. The default expiry period is one hour.

Once the videos have been submitted, do_video_indexing() polls for the status of each indexing task until all are complete. Most of the code is concerned with minimizing the number of calls to the API, and saving status to the app’s database; getting the status of a task is simple:

task = TWELVE_LABS_CLIENT.task.retrieve(video_task['task_id'])

The task object’s status attribute is a string with a value such as validating, indexing, or ready. When the task reaches the ready status, the task object also includes a video_id attribute, uniquely identifying the video within the Twelve Labs platform. At this point, do_video_indexing() calls a helper function that retrieves the thumbnail, transcript, text, and logos and stores them in Backblaze B2.

Retrieving Video Data

Here’s the call to retrieve the thumbnail:

thumbnail_url = TWELVE_LABS_CLIENT.index.video.thumbnail(TWELVE_LABS_INDEX_ID, video.video_id)

The helper function creates a path for the thumbnail file from the video ID and the file extension in the returned URL, and saves the thumbnail to Backblaze B2:

default_storage.save(thumbnail_path, urlopen(thumbnail_url))

Again, django-storages is doing the heavy lifting. We use urlopen(), from the urllib.request module, to open the thumbnail URL, providing default_storage.save() with a file-like object from which it can read the thumbnail data.

The calls to retrieve transcript, text, and logo data have a slightly different form, for example:

video_data = TWELVE_LABS_CLIENT.index.video.transcription(TWELVE_LABS_INDEX_ID, video.video_id)

Each call returns a list of VideoValue objects, each VideoValue object comprising a start and end timecode (in seconds) and a value specific to the type of data; for example, a fragment of the transcription. We serialize each list to JSON and save it as a file in Backblaze B2.

When the user navigates to the detail page for a video, JavaScript reads each dataset from Backblaze B2 and renders it into the page, allowing the user to easily navigate to any of the data items.

Searching the Index

When the user enters a query and hits the search button, the backend calls the Twelve Labs Search API, passing the query text, and requesting results for all four sources of information. We set group_by to video since we want to show the results by video, and set the confidence threshold to medium to improve the relevance of the results. From VideoSearchView in views.py:

results = TWELVE_LABS_CLIENT.search.query(
    TWELVE_LABS_INDEX_ID,
    query,
    ["visual", "conversation", "text_in_video", "logo"],
    group_by="video",
    threshold="medium"
)

By default, the query() call returns a page of 10 results in result.data, so we loop through the pages using next(result) to fetch pages of search results as necessary. Each individual search result includes start and end timecodes, confidence, and the type of match (visual, conversation, text, or logo).

In the web UI, the user can click through to the results for a given video, then click an individual search result to view the matching video clip.

Getting Started with Backblaze B2 and Twelve Labs

Backblaze B2 Cloud Storage is a great choice for storing video to index with Twelve Labs; free egress each month for up to three times the amount of data you’re storing means that you can submit your entire video library to the Twelve Labs platform without worrying about data transfer charges, and unlimited free egress to our CDN partners reduces the costs of distributing video content to end users.

Click here to create a Backblaze B2 account, if you don’t already have one. Your first 10GB of storage is free, no credit card required. If you’re an enterprise that wants to run a larger proof of concept, you can always reach out to our Sales Team. You don’t need to write any code to upload video files or create presigned URLs, and you can use the Backblaze web UI to upload files up to 500MB, or any of a wide variety of tools to upload files up to 10TB, including the AWS CLI, rclone and Cyberduck. Select S3 as the protocol to be able to create presigned URLs.

Similarly, click here to sign up for Twelve Labs’ Free plan. With it, you can index up to 600 minutes of video, again, no credit card required. Python and Node.js developers can use one of the Twelve Labs SDKs, while the Twelve Labs API documentation includes code examples for a wide range of other programming languages.

The post AI Video Understanding in Your Apps with Twelve Labs and Backblaze appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Drive Stats for Q1 2024

Post Syndicated from Andy Klein original https://backblaze.com/blog/backblaze-drive-stats-for-q1-2024/

A decorative image displaying the title Q1 2024 Drive Stats.

As of the end of Q1 2024, Backblaze was monitoring 283,851 hard drives and SSDs in our cloud storage servers located in our data centers around the world. We removed from this analysis 4,279 boot drives, consisting of 3,307 SSDs and 972 hard drives. This leaves us with 279,572 hard drives under management to examine for this report. We’ll review their annualized failure rates (AFRs) as of Q1 2024, and we’ll dig into the average age of drive failure by model, drive size, and more. Along the way, we’ll share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

Hard Drive Failure Rates for Q1 2024

We analyzed the drive stats data of 279,572 hard drives. In this group we identified 275 individual drives which exceeded their manufacturer’s temperature specification at some point in their operational life. As such, these drives were removed from our AFR calculations.

The remaining 279,297 drives were divided into two groups. The primary group consists of the drive models which had at least 100 drives in operation as of the end of the quarter and accumulated over 10,000 drive days during the same quarter. This group consists of 278,656 drives grouped into 29 drive models. The secondary group contains the remaining 641 drives which did not meet the criteria noted. We will review the secondary group later in this post, but for the moment let’s focus on the primary group.

For Q1 2024, we analyzed 278,656 hard drives grouped into 29 drive models. The table below lists the AFRs of these drive models. The table is sorted by drive size then AFR, and grouped by drive size.

Notes and Observations on the Q1 2024 Drive Stats

  • Downward AFR: The AFR for Q1 2024 was 1.41%. That’s down from Q4 2023 at 1.53%, and also down from one year ago (Q1 2023) at 1.54%. The continuing process of replacing older 4TB drives is a primary driver of this decrease as the Q1 2024 AFR (1.36%) for the 4TB drive cohort is down from a high of 2.33% in Q2 2023.
  • A Few Good Zeroes: In Q1 2024, three drive models had zero failures:
    • 16TB Seagate (model: ST16000NM002J)
      • Q1 2024 drive days: 42,133
      • Lifetime drive days: 216,019
      • Lifetime AFR: 0.68%
      • Lifetime confidence interval: 1.4%
    • 8TB Seagate (model: ST8000NM000A)
      • Q1 2024 drive days: 19,684
      • Lifetime drive days: 106,759
      • Lifetime AFR: 0.00%
      • Lifetime confidence interval: 1.9%
    • 6TB Seagate (model: ST6000DX000)
      • Q1 2024 drive days: 80,262 
      • Lifetime drive days: 4,268,373
      • Lifetime AFR: 0.86%
      • Lifetime confidence interval: 0.3%

All three drives have a lifetime AFR of less than 1%, but in the case of the 8TB and 16TB drive models the confidence interval (95%) is still too high. While it is possible the two drives models will continue to perform well, we’d like to see the confidence interval below 1%, and preferably below 0.5%, before we can trust the lifetime AFR.

With a confidence interval of 0.3% the 6TB Seagate drives delivered another quarter of zero failures. At an average age of nine years, these drives continue to defy their age. They were purchased and installed at the same time back in 2015, and are members of the only 6TB Backblaze Vault still in operation.

  • The End of the Line: The 4TB Toshiba (model: MD04ABA400V) are not in the Q1 2024 Drive Stats tables. This was not an oversight.  The last of these drives became a migration target early in Q1 and their data was securely transferred to pristine 16TB Toshiba drives. They rivaled the 6TB Seagate drives in age and AFR, but their number was up and it was time to go.

The Secondary Group

As noted previously, we divided the drive models into two groups, primary and secondary, with drive count (>100) and drive days (>10,000) being the metrics used to divide the groups. The secondary group has a total of 641 drives spread across 27 drive models. Below is a table of those drive models. 

The secondary group is mostly made up of drive models which are replacement drives or migration candidates. Regardless, the lack of observations (drive days) over the observation period is too low to have any certainty about the calculated AFR.

From time to time, a secondary drive model will move into the primary group. For example, the 14TB Seagate (model: ST14000NM000J) will most likely have over 100 drives and 10,000 drive days in Q2. The reverse is also possible, especially as we continue to migrate our 4TB drive models.

Why Have a Secondary Group?

In practice we’ve always had two groups; we just didn’t name them. Previously, we would eliminate from the quarterly, annual, and lifetime AFR charts drive models which did not have at least 45 drives, then we upped that to 60 drives. This was okay, but we realized that we needed to also set a minimum number of drive days over the analysis period to improve our confidence in the AFRs we calculated. To that end, we have set the following thresholds for drive models to be in the primary group.

Review Period Drive Count per Model Drive Days per Model
Quarterly >100 drives >10,000 drive days
Annual >250 drives >50,000 drives days
Lifetime >500 drives >100,000 drive days

We will evaluate these metrics as we go along and change them if needed. The goal is to continue to provide AFRs that we are confident are an accurate reflection of the drives in our environment.

The Average Age of Drive Failure Redux

In Q1 2023 Drive Stats report, we took a look at the average age in which a drive fails. This review was inspired by the folks at Secure Data Recovery who calculated that based on their analysis of 2,007 failed drives, the average age at which they failed was 1,051 days or roughly 2 years and 10 months. 

We applied the same approach to our 17,155 failed drives and were surprised when our average age of failure was only 2 years and 6 months. Then we realized that many of the drive models that were still in use were older (much older) than the average, and surely when some number of them failed, it would affect the average age of failure for a given drive model. 

To account for this realization, we considered only those drive models that are no longer active in our production environment. We call this collection retired drive models as these are drives that can no longer age or fail. When we reviewed the average age of this retired group of drives, the average age of failure was 2 years and 7 months. Unexpected, yes, but we decided we needed more data before reaching any conclusions.

So, here we are a year later to see if the average age of drive failure we computed in Q1 2023 has changed. Let’s dig in. 

As before we recorded the date, serial_number, model, drive_capacity, failure, and SMART 9 raw value for all of the failed drives we have in the Drive Stats dataset back to April 2013. The SMART 9 raw value gives us the number of hours the drive was in operation. Then we removed boot drives and drives with incomplete data, that is some of the values were missing or wildly inaccurate. This left us with 17,155 failed drives as of Q1 2023.

Over this past year, Q2 2023 through Q1 2024, we recorded an additional 4,406 failed drives. There were 173 drives which were either boot drives or had incomplete data, leaving us with 4,233 drives to add to the 17,155 failed drives previous, totalling 21,388 failed drives to evaluate.

When we compare Q1 2023 to Q1 2024 we get the table below.

The average age of failure for all of the Backblaze drive models (2 years and 10 months) matches the Secure Data Recovery baseline. The question is, does that validate their number? We say, not yet. Why? Two primary reasons. 

First, we only have two data points, so we don’t have much of a trend, that is we don’t know if the alignment is real or just temporary. Second, the average age of failure of the active drive models (that is, those in production) is now already higher (2 years and 11 months) than the Secure Data baseline. If that trend were to continue, then when the active drive models retire, they will likely increase the average age of failure of the drive models that are not in production.

That said, we can compare the numbers by drive size and drive model from Q1 2023 to Q1 2024 to see if we can gain any additional insights. Let’s start with the average age by drive size in the table below.

The most salient observation is that for every drive size that had active drive models (green), the average age of failure increased from Q1 2023 to Q1 2024. Given that the overall average age of failure increased over the last year, it is reasonable to expect that some of the active drive size cohorts would increase. With that in mind, let’s take a look at the changes by drive model over the same period. 

Starting with the retired drive models, there were three drive models totalling 196 drives which moved from active to retired from Q1 2023 to Q1 2024. Still, the average age of failure for the retired drive cohort remained at 2 years 7 months, so we’ll spare you from looking at a chart with 39 drive models where over 90% of the data didn’t change Q1 2023 to Q1 2024.

On the other hand, the active drive models are a little more interesting, as we can see below.

In all but the two drive models (highlighted), the average age of failure for each drive model went up. In other words, active drive models are, on average, older when they fail, than one year ago. Remember, we are testing the average age of the drive failures, not the average age of the drive. 

At this point, let’s review. The Secure Data Recovery folks checked 2,007 failed drives and determined their average age of failure was 2 years and 10 months. We are testing that assertion. At the moment, the average age of failure for the retired drive models (those no longer in operation in our environment) is 2 years and 7 months. This is still less than the Secure Data number. But, the drive models still in operation are now hitting an average of 2 years and 10 months, suggesting that once these drive models are removed from service, the average age of failure for the retired drive models will increase. 

Based on all of this, we think the average age of failure for our retired drive models will eventually exceed 2 years and 10 months. Further, we predict that the average age of failure will reach closer to 4 years for the retired drive models once our 4TB drive models are removed from service. 

Annualized Failures Rates for Manufacturers

As we noted at the beginning of this report, the quarterly AFR for Q1 2024 was 1.41%. Each of the four manufacturers we track contributed to the overall AFR as shown in the chart below.

As you can see, the overall AFR for all drives peaked in Q3 2023 and is dropping. This is mostly due to the retirement of older 4TB drives that are further along the bathtub curve of drive failure. Interestingly, all of the remaining 4TB drives in use today are either Seagate or HGST models. Therefore, we expect the quarterly AFR will most likely continue to decrease for those two manufacturers as over the next year their 4TB drive models will be replaced.

Lifetime Hard Drive Failure Rates

As of the end of Q1 2024, we were tracking 279,572 operational hard drives. As noted earlier, we defined the minimum eligibility criteria of a drive model to be included in our analysis for quarterly, annual and lifetime reviews. To be considered for the lifetime review, a drive model was required to have 500 or more drives as of the end of Q1 2024 and have over 100,000 accumulated drive days during their lifetime. When we removed those drive models which did not meet the lifetime criteria, we had 277,910 drives grouped into 26 models remaining for analysis as shown in the tale below.

With three exceptions, the conference interval for each drive model is 0.5% or less at 95% certainty. For the three exceptions: the 10TB Seagate, the 14TB Seagate, and 14TB Toshiba models, the occurrence of drive failure from quarter to quarter was too variable over their lifetime. This volatility has a negative effect on the confidence interval.

The combination of a low lifetime AFR and a small confidence interval is helpful in identifying the drive models which work well in our environment. These days we are interested mostly in the larger drives as replacements, migration targets, or new installations. Using the table above, let’s see if we can identify our best 12, 14, and 16TB performers. We’ll skip reviewing the 22TB drives as we only have one model.

The drive models are grouped by drive size, then sorted by their Lifetime AFR. Let’s take a look at each of those groups.

  • 12TB drive models: The three 12TB HGST models are great performers, but are hard to find new. Also, Western Digital, who purchased the HGST drive business a while back, has started using their own model numbers of these drives, so it can be confusing. If you do find an original HGST make sure it is new as from our perspective buying a refurbished drive is not the same as buying a new.
  • 14TB drive models: The first three models look to be solid—the WDC (WUH721414ALE6L4), Toshiba (MG07ACA14TA), and Seagate (ST14000NM001G). The remaining two drive models have mediocre lifetime AFRs and undesirable confidence intervals. 
  • 16TB drive models: Lots of choice here, with all six drive models performing well to this point, although the WDC models are the best of the best to date.

The Hard Drive Stats Data

It has now been eleven years since we began recording, storing and reporting the operational statistics of the hard drives and SSDs we use to store data in the Backblaze data storage cloud. We look at the telemetry data of the drives, including their SMART stats and other health related attributes. We do not read or otherwise examine the actual customer data stored. 

Over the years, we have analyzed the data we have gathered and published our findings and insights from our analyses. For transparency, we also publish the data itself, known as the Drive Stats dataset. This dataset is open source and can be downloaded from our Drive Stats webpage.

You can download and use the Drive Stats dataset for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone; it is free.

Good luck and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q1 2024 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

How to Back Up Your Synology NAS to the Cloud | Backblaze

Post Syndicated from Vinodh Subramanian original https://www.backblaze.com/blog/synology-cloud-backup-guide/

A decorative image showing a NAS device.
Editor’s note: Editor’s note: This post has been updated since it was last published in 2021.

Synology network attached storage (NAS) devices are great for businesses. They enable easy collaboration, speed up restores, make your files accessible 24/7, and give you a level of data protection you probably didn’t have before. Essentially, a NAS device acts as a private cloud, offering centralized access and storage for everything from large files to ongoing projects.

That’s why it’s important to back up your Synology DiskStation to the cloud. While NAS offers a layer of redundancy on-premises if you happen to lose files, it doesn’t fully protect you from things like a natural disaster, a ransomware attack that infiltrates your backups, or multiple hard drive failures. Cloud backups are important for data redundancy and future data recovery, giving you easy access and fast restores.

To keep your data truly safe, the 3-2-1 backup strategy is the industry baseline. Using a 3-2-1 strategy with your NAS means you keep three copies of your data on two different media (like NAS and cloud storage), with one stored off-site. Backing your DiskStation up to the cloud is a great way to achieve that key off-site element. This setup protects against various risks, and ensures your data is available for recovery.

In this post, we’ll explain how to implement a 3-2-1 backup strategy for your Synology NAS, the benefits of backing up to cloud storage, options for backing up your DiskStation, and some practical examples of what you can do by pairing your NAS with cloud storage.

Synology NAS and a 3-2-1 Backup Strategy

The 3-2-1 backup strategy is simple and time-tested. If you are using your Synology NAS to connect and back up computers on your network, that’s the first step—you have two local copies of your data on different mediums. You’d accomplish this by creating a multi-version local copy. 

While this setup might seem sufficient, your data is still at risk from NAS device failure. It remains co-located with your primary data, making it vulnerable to disasters or theft. To fully protect your data, you need a third, off-site backup copy.

3-2-1 backup strategy diagram

For your third copy, you could back up your Synology to an external desFor your third copy, you could back up your Synology to an external destination—either another Synology NAS, a file server, or a USB device. Each has pros and cons, and we’ll talk through them for argument’s sake. 

  • Back up to another Synology NAS: If you recently upgraded to a new device, you could store the third copy of your data on your old DiskStation. You get to put the old one to use, and you know it’s compatible. 
  • Back up to a file server: Backing your Synology NAS up to a file server is also an option, but it will take up more storage space for caching than backing up to another DiskStation. 
  • Back up to a USB device: Backing up to a USB device has some limited advantages—the format of your data is readable, so you can plug the USB in anywhere and access your data. However, USB backup won’t back up applications or system files, and it’s a manual rather than an automated process.

With any of these options, you’ll need to physically move your backup device—the old Synology, file server, or USB-connected device—to another location, ideally more than a few miles away, to truly achieve a 3-2-1 backup. 

However, backing up your Synology NAS DiskStation to the cloud means you achieve a 3-2-1 strategy without the need to physically separate your backup copies. Backing up your Synology NAS to the cloud means you have both convenience and robust data redundancy.

The Benefits of Backing Up Your Synology DiskStation to the Cloud

In addition to avoiding the lift of a physical move, backing up Synology NAS to the cloud offers a number of other benefits, too, including:

  • Avoiding data loss: A cloud backup protects against physical disasters, such as floods, hurricanes, and fires, that could compromise your NAS and data on individual workstations. Because the NAS is always connected to your machines, it’s also at risk of infection from ransomware attacks. And finally, the hard drives in your NAS can fail. Because your NAS is likely set up in a RAID configuration, one drive failure might not affect your data. But, while one drive is down, your data is at a higher risk. If another drive were to fail, you could lose data. Having an off-site backup in cloud storage significantly reduces this risk.
  • Accessibility: With your data in the cloud, your backups are accessible from anywhere. If you’re away from your desk or office and you need to retrieve a file, you can simply log in to your cloud instance and retrieve it remotely.
  • Security: Cloud vendors typically protect customer data by encrypting it as it travels to its final destination and/or when it’s at rest on the vendors’ storage servers. Encryption protocols differ between cloud vendors, so make sure to understand them as you’re evaluating cloud providers, especially if your organization has specific security requirements.
  • Automation: Your Synology NAS comes with built-in backup utilities, so you can configure a backup schedule for automated cloud backups . This saves time and ensures your data is always up-to-date.
  • Scalability: As your data grows, your cloud backups grow with it. With cloud storage, there’s no need to invest in or maintain additional hardware to ensure your data is properly backed up.
  • Rapid Data Recovery: Cloud storage often offers shorter recovery times than traditional methods, particularly if your NAS device fails or data needs to be restored urgently. Cloud storage solutions can streamline data retrieval, allowing quick access to backed-up files and minimizing downtime.
  • Multi-Cloud Options: Many cloud providers support multi-cloud setups, allowing you to back up your Synology NAS to multiple cloud destinations. This added redundancy can be a valuable safeguard against any single provider outages, helping to ensure continuous data availability.
  • File Versioning: Some cloud storage services support file versioning, which is the ability to keep previous versions of files. This is particularly useful if files are accidentally modified or deleted. It can help you restore earlier versions without losing valuable information.

Options for Backing Up Your Synology NAS

Synology offers various backup utilities and methods to protect your data, each suited to different backup needs and environments.

1. Hyper Backup

Hyper Backup is Synology’s built-in backup utility for backing up to any number of external destinations, including public clouds. It enables you to back up not just data stored on your NAS, but also applications and system configurations.

 It offers incremental backups to help you manage your storage footprint. After your initial backup, using incremental backups means only files that have been changed will be updated. 

It also offers cross-file deduplication to help you further manage your storage footprint. Hyper Backup allows you to back up to external devices as well as cloud services.

2. Cloud Sync

In addition to Hyper Backup, Synology also offers Cloud Sync, which is important for those who need real-time collaboration and file syncing capabilities. Keep in mind that sync is not the same as backup–Cloud Sync does not support application and system configuration file backups, and it only keeps the current version of your files. If someone accidentally deletes that file, it’s gone. If you’re not sure if you’re looking for backup or sync, you can read about the differences between them in this post.

3. Snapshot replication

If your Synology model supports the Btrfs file system, using Snapshot Replication is a bit faster both on the backup side and the restore side than Hyper Backup. Snapshot Replication allows you to back up to the same Synology NAS or another Synology NAS, but not to the cloud.

4. USB copy

USB Copy only copies your data, not applications or system configuration files. It does not support cross-file deduplication, so you might end up with duplicate copies of your files. Additionally, this method is manual, and will require you to be responsible for regular backups as opposed to automating them with Hyper Backup or Snapshot Replication.

Synology NAS box

What You Can Do With Cloud Sync, Hyper Backup, and Cloud Storage

Using Hyper Backup and Cloud Sync together gives you total control over what gets backed up to cloud storage—you can synchronize in the cloud as little or as much as you want. This flexible approach allows you to customize your backup plan and protect your Synology NAS data based on priority and needs.

Here are some practical examples of what you can do with Cloud Sync, Hyper Backup, and cloud storage working together.

1. Sync or Back Up the Entire Contents of Your DiskStation to the Cloud

The DiskStation has excellent fault-tolerance—it can continue operating even when individual drive units fail. However, for comprehensive protection, syncing and backing up the entire DiskStation to cloud storage ensures that your data remains secure during a disaster or system failure.

2. Sync or Back Up Your Most Important Media Files

If you’re storing essential media files—like videos, music, and photos—on your DiskStation, Cloud Sync or Hyper Backup can ensure these valuable files are safely stored in the cloud. Synology NAS offers data redundancy on-premises, but cloud storage provides an additional off-site backup layer for further protection.

3. Back Up Time Machine

For Mac operations, Synology allows the DiskStation to serve as a network-based Time Machine backup. With Hyper Backup, you can synchronize Time Machine files to the cloud so that in the event of a critical failure, your Time Machine backups are securely stored off-site, ready for a seamless restoration.

Ready to Give It a Try?

Hyper Backup allows you to choose from any number of cloud storage providers as a backup destination, and Backblaze B2 Cloud Storage is one of them.If you haven’t given cloud storage a try yet, you can get started now, and make sure your NAS is synced or backed up securely to the cloud.

FAQs About Synology NAS

How do I back up my Synology NAS to the cloud?

Hyper Backup is Synology’s built-in backup utility for backing up to any number of external destinations, including public clouds. It enables you to back up not just data stored on your NAS, but also applications and system configurations. Additionally, It offers cross-file deduplication to help you further manage your storage footprint and avoid duplicates.

What’s the best way to back up my Synology NAS?

Synology offers a lot of options for backing up your device, including to local volumes, external devices, other Synology systems, rsync servers, or public cloud services like Backblaze B2. The best way to back up your Synology NAS depends on many different factors, but the most important thing to remember is that you should follow a 3-2-1 backup strategy. That means keeping three copies of your data on two different media (i.e. devices) with one off-site. Backing up to the cloud is a great option for data redundancy and long-term protection when handling your off-site backups.

Can I schedule automatic cloud backups from my Synology NAS?

Yes, with Hyper Backup, you can set up automatic backups to many public clouds, including Backblaze B2. It offers incremental backups to help you manage your storage footprint. After your initial backup, using incremental backups means only files that have been changed will be updated.

Which cloud storage providers are compatible with Synology NAS for backup?

Synology is compatible with many public cloud providers, including Backblaze B2, Microsoft Azure, Google Cloud Platform, Amazon S3, and Synology C2 Storage.

How much cloud storage space do I need for my Synology NAS backup?

The amount of cloud storage space needed for your Synology NAS backup depends on factors like the total data size, frequency of backups, and retention policies. Calculate your NAS data size, estimate growth, and choose a cloud plan accordingly.  Hyper Backup provides storage estimates, helping you select the right amount of cloud storage space for secure, scalable data backups.

The post How to Back Up Your Synology NAS to the Cloud | Backblaze appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze and Parablu Team Up to Elevate Security For Microsoft 365 Users

Post Syndicated from Anna Hobbs-Maddox original https://backblaze.com/blog/backblaze-and-parablu-team-up-to-elevate-security-for-microsoft-365-users/

A decorative image showing the Backblaze and Parablu logos.

Microsoft 365 (M365) is used by more than one million companies worldwide. If you’re one of them, you know how important it is to your business. And, like anything that’s important to your business, it’s important to back it up. 

Today, backing up M365 to off-site storage just got easier and more affordable thanks to a new Backblaze Partnership with Parablu. Now, you can back up your Microsoft 365 data to Backblaze, ensuring it’s backed up both inside and/or outside of the Azure ecosystem, adding another layer of protection to your backup and recovery playbook.

What Parablu Does

Parablu specializes in data security and resiliency solutions catered to digital enterprises. Their advanced solutions ensure comprehensive protection for enterprise data while offering complete visibility into all data movement through user-friendly, centrally-managed dashboards. Their product BluVault for M365 elevates data security across Exchange, SharePoint, OneDrive, and Teams.

With Parablu, you can seamlessly control every aspect of your Microsoft 365 data, gain immediate protection against threats with advanced anomaly detection and swift recovery mechanisms for ransomware attacks, streamline administration with intuitive and efficient controls, reduce network congestion, and ensure secure data transmission with robust encryption protocols.

Why Back Up Microsoft 365 to Backblaze?

By integrating Backblaze as a storage tier outside of Azure for tools like M365, OneDrive, or Sharepoint, Parablu is providing its customers with cloud storage that’s easy to use, highly affordable at one-fifth the cost of legacy providers, secured with immutable backups, and high-performing with industry-leading small file uploads.

Key benefits for Backblaze + Parablu customers include:

  • Avoiding a Single Point of Failure: Many businesses that use M365 also back up their instance with the same service. However, backup best practices include keeping a backup copy of your data geographically and virtually separate from your production copy. While backing up your M365 data with Microsoft Azure is a great thing to do, it’s wise to keep a backup copy outside of that ecosystem as well. If Microsoft were to experience a failure, you’d still be able to recover your critical business data. 
  • Protecting Data With Immutability: When you protect your M365 data with immutability via Object Lock, you ensure no one can alter or delete that data until a given date. When you set the lock, you can specify the length of time an object should be locked. Any attempts to manipulate, copy, encrypt, change, or delete the file will fail during that time.
  • Faster Small File Uploads: Small file uploads are common for backup and archive workflows, especially when it comes to backing up the kind of data in M365—email, Word documents, simple Excel spreadsheets, etc. With Backblaze, users can expect to see significantly faster upload speeds for smaller files without any change to durability, availability, or pricing. The faster data upload bolsters security and enhances data protection by securing data with off-site backups faster, limiting the time that the data is vulnerable.

Partnering with Backblaze offers our customers a secure, cost-efficient storage alternative. We’ve witnessed a growing demand for secure, fast, and affordable storage that complements public cloud storage and we look forward to continued innovation with Backblaze.

—Randy De Meno, Chief Strategy Officer/Chief Technology Officer, Parablu

How Backblaze Integrates With Parablu

The Backblaze + Parablu partnership integrates the M365 backup power of Parablu with affordable cloud storage from Backblaze, helping you protect your M365 environment with enhanced security, compliance, and performance. The joint solution is available for customers today.

Interested in getting started? Learn more in our docs or contact Sales.

The post Backblaze and Parablu Team Up to Elevate Security For Microsoft 365 Users appeared first on Backblaze Blog | Cloud Storage & Cloud Backup