Tag Archives: Featured

Backblaze Rides the Nautilus Data Center Wave

Post Syndicated from original https://www.backblaze.com/blog/backblaze-rides-the-nautilus-data-center-wave/

On the outside and on the inside, our newest data center (DC) is more than a little different: there are no cooling towers chugging on the roof, no chillers, or coolants at all. No, we’re not doing a drive stats experiment on how well data centers run at 54° Celsius. This data center, owned and developed by Nautilus Data Technologies, is nice and cool inside. Built with a unique mix of proven maritime and industrial water-cooling technologies that use river water to cool everything inside—racks, servers, switches, and people—this new DC is innovative, environmentally awesome, secure, fascinating, and other such words and phrases, all rolled into one. And it just happens to be located on a barge on a river in California.

It’s a unique setup, one that might raise a few eyebrows. It certainly did for us. But once our team dug in, we didn’t just find room for another exabyte of data, we found an extremely resilient data center that supports our durability requirements and decreases our environmental impact of providing you cloud storage. You can do a deep dive into the Nautilus technology on their website, but of course I needed to make my own visit to look into this shiny new tech on my own. What follows is an overview of what I learned: how the technology works and why we decided to make the Nautilus data center part of our cloud storage platform.


Source

Nautilus Data Center Overview

In the Port of Stockton in California, an odd looking barge is moored next to the shore of the San Joaquin River. If you were able to get close enough, you might notice the massive mooring poles the barge is attached to. And if you were a student of such things, you might recognize these mooring poles as having the same rating as the mooring poles whose attached boats and barges survived hurricane Katrina. The barge isn’t going anywhere.

Above deck are the data center halls. Once inside, it feels like, well, a data center—almost. The power distribution units (PDUs) and other power-related equipment hum quietly and racks of servers and networking gear are lined up across the floor, but there are no hot and cold aisles, and no air conditioning grates or ductwork either. Instead the ceiling is lined with an orderly arrangement of pipes carrying water that’s been cooled by the river outside.

Upriver from the data center, water is collected from the river and filtered before running through the heat exchanger that cools water circulating in a closed loop inside the data center. River water never enters the data center hall.

The technology used to collect and filter the water has been used for decades in power plants, submarines, aircraft carriers, and so on. The entire collection system is marine wildlife-friendly and certified by multiple federal and state agencies, commissions, and water boards, including the California Department of Fish and Wildlife. One of the reasons Nautilus chose the Port of Stockton was the truism that, if you can get something certified for operation in the state of California, then you can typically get it certified pretty much anywhere.


Source

Inside the data center, at specific intervals, water supply and return lines run down to the rear door on each rack. The server fans expel hot air through the rear door and the water inside the door removes the heat to deliver cool air into the room. We use ColdLogik Rear Door Coolers to perform the heat exchange process. The closed loop water system is under vacuum—meaning that it’s leak proof, so water will never touch the servers. A nice bit of innovation by the Nautilus designers and engineers.

Downriver from the data center, the water is discharged. The water can be up to 4° Fahrenheit warmer than when it started upriver. As we mentioned before, the various federal and state authorities worked with Nautilus engineers to select a discharge location which was marine wildlife-friendly. Within seconds of being discharged the water is back to river temperature and continues its journey to the Sacramento Delta. The water spends less than 15 seconds end-to-end in the system which operates with no additional water, uses no chemicals, and adds zero pollutants to the river.

Why Nautilus

For Backblaze, the process of choosing a data center location is a bit more rigorous than throwing a dart at a map and putting some servers there. Our due diligence checklist is long and thorough, taking into consideration redundancy, capacity, scalability, cost, network providers, power providers, stability of the data center owner, and so on. The Nautilus facility passed all of our tests and will enable us to store over an exabyte of data on-site to optimize our operational scalability. In addition, the Nautilus relationship brings us a few additional benefits not traditionally heard of when talking about data centers.

Innovation

Storage Pods, Drive Farming, Drive Stats, and even Backblaze B2 Cloud Storage are all innovations in their own way as they changed market dynamics or defined a different way to do things. They all have in common the trait of putting together proven ideas and technologies in a new way that adds value to the marketplace. In this case, Nautilus marries proven maritime and industrial water cooling and distribution technologies with a new approach to data center infrastructure. The result is an innovative way to use a precious resource to help meet the ever-increasing demand for data storage. This is the kind of engineering and innovation we admire and respect.

Environmental Awesomeness

We can appreciate the environmental impact of the Nautilus data center from two perspectives. The first is obvious: taking a precious resource, river water, and using it to not only lower the carbon footprint of the data center (Nautilus projects by up to 30%), but to also do so without permanently affecting the resource and ecosystem. That’s awesome. The world has been harnessing the power of Mother Nature for thousands of years, yet doing so responsibly has not always been top-of-mind in the process. In the case of Nautilus, the environmental impact is at the top of their list.

The second reason this is awesome is that Nautilus chose to do this in California, coming face-to-face with probably the most stringent environmental requirements in the United States. Almost anywhere else would have been easier, but if you are looking to show your environmental credibility and commitment, then California is the place to start. We commend them for their effort.

Unique Security

Like any well-run data center site, Nautilus has a multitude of industry standard security practices in place: a 24x7x365 security staff, cameras, biometric access, and so on. But the security doesn’t stop there. Being a data center on a barge also means that divers regularly inspect the underwater systems and the barge itself for maintenance and security purposes. In addition, by nature of being a data center on a barge in the Port of Stockton, the data center has additional security: the port itself is protected by the U.S. Department of Homeland Security (DHS) and the waterways are patrolled by the U.S. Coast Guard. This enhanced collection of protective resources is unique for data centers in the U.S., except possibly the kind of data centers we are not supposed to know anything about.

The Manatee in the River

Let’s get to the elephant in the room here: is there risk in putting a data center on a barge in a river? Yes—but no more so than putting one in a desert, or near any body of water, or near a forest, or in an abandoned mine, or near a mountain, or in a city. You get the idea: they all have some level of risk. We’d argue that this new data center—with its decreased reliance on energy and air conditioning and its protection by DHS, among other positives—is quite a bit more reliable than most places the world stores its data. As always, though, we continue to encourage folks to have their data in multiple places.

Still, putting a data center on a river is novel. We’re sure some people will make jokes, and probably pretty funny ones—we’re happy to laugh at our own expense. (It’s certainly happened before.) We are also sure some competitors will use this as part of their sales and marketing—FUD (fear, uncertainty and doubt) as it is called behind your back. We don’t play that game, and, as with our past innovations, we’re used to people sniping a bit when we move out ahead on technology. As always, we encourage you to dig in, get the facts, and be comfortable with the choice you make. Here at Backblaze, we won’t sell you up the river, but we may put your data there.

The post Backblaze Rides the Nautilus Data Center Wave appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Lights, Camera, Custom Action: Integrating Frame.io with Backblaze B2

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/lights-camera-custom-action-integrating-frame-io-with-backblaze-b2/

At Backblaze, we love hearing from our customers about their unique and varied storage needs. Our media and entertainment customers have some of the most interesting use cases and often tell us about their workflow needs moving assets at every stage of the process, from camera to post-production and everywhere in between.

The desire to have more flexibility controlling data movement in their media management systems is a consistent theme. In the interest of helping customers with not just storing their data, but using their data, today we are publishing a new open-source custom integration we have created for Frame.io. Read on to learn more about how to use Frame.io to streamline your media workflows.

What is Frame.io?

Frame.io, an Adobe company, has built a cloud-based media asset management (MAM) platform allowing creative professionals to collaborate at every step of the video production process. For example, videographers can upload footage from the set after each take; editors can work with proxy files transcoded by Frame.io to speed the editing process; and production staff can share sound reports, camera logs, and files like Color Decision Lists.

The Backblaze B2 Custom Action for Frame.io

Creative professionals who use Frame.io know that it can be a powerful tool for content collaboration. Many of those customers also leverage Backblaze B2 for long-term archive, and often already have large asset inventories in Backblaze B2 as well.

What our Backblaze B2 Custom Action for Frame.io does is quite simple: it allows you to quickly move data between Backblaze B2 and Frame.io. Media professionals can use the action to export selected assets or whole projects from Frame.io to B2 Cloud Storage, and then later import exported assets and projects from B2 Cloud Storage back to Frame.io.

How to Use the Backblaze B2 Custom Action for Frame.io

Let’s take a quick look at how to use the custom action:

As you can see, after enabling the Custom Action, a new option appears in the asset context dropdown. Once you select the action, you are presented with a dialog to select Import or Export of data:

After selecting Export, you can choose whether you want just the single selected asset, or the entire project sent to Backblaze B2.

Once you make a selection, that’s it! The custom action handles the movement for you behind the scenes. The export is a point-in-time snapshot of the data from Frame.io—which remains as it was—to Backblaze B2.

The Custom Action creates a new exports folder in your B2 bucket, and then uploads the asset(s) to the folder. If you opt to upload the entire Project, it will be structured the same way it is organized in Frame.io.

How to Get Started With Backblaze B2 and Frame.io

To get started using the Custom Action described above, you will need:

  • A Frame.io account.
  • Access to a compute resource to run the custom action code.
  • A Backblaze B2 account.

If you don’t have a Backblaze B2 account yet, you can sign up here and get 10GB free, or contact us here to run a proof of concept with more than 10GB.

What’s Next?

We’ve written previously about similar open-sourced custom integrations for other tools, and by releasing this one we are continuing in that same spirit. If you are interested in learning more about this integration, you can jump straight to the source code on GitHub.

Watch this space for a follow-up post diving into more of the technical details. We’ll discuss how we secured the solution, made it deployable anywhere (including to options with free bandwidth), and how you can customize it to your needs.

We would love to hear your feedback on this integration, and also any other integrations you would like to see from Backblaze. Feel free to reach out to us in the comments below or through our social channels. We’re particularly active on Twitter and Reddit—let’s chat!

The post Lights, Camera, Custom Action: Integrating Frame.io with Backblaze B2 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Five Misconceptions About Moving From Tape to Cloud

Post Syndicated from Jeremy Milk original https://www.backblaze.com/blog/five-misconceptions-about-moving-from-tape-to-cloud/

There are a lot of pros and cons that go along with using the old, reliable LTO system for your backups. And while the medium still has many adherents, there is a growing movement of folks looking to move beyond magnetic tape, a form of storage technology that has been around since 1928. Technically, it’s the same age as sliced bread.

Those working in IT already know the benefits of migrating from LTO to cloud storage, which include everything from nearly universal ease of access to reduced maintenance, but those who hold the company’s pursestrings might still need convincing. Some organizations delay making a move because of misconceptions about the cost, inconvenience, risk, and security, but they may not have all the details. Let’s explore five top misconceptions about migrating from tape so you can help your team make an informed decision.

Misconception #1 – Total Cost of Ownership is Higher in the Cloud

The first misconception is that moving from a tape-based backup solution to cloud storage is expensive. Using our LTO vs. B2 Cloud Storage calculator, you can enter the amount of existing data you have, the amount of data you add yearly, and your daily incremental data to determine the actual cost savings.

For example, say you have 50TB of existing data, you add 5TB more every year, and your daily incremental backup data is 500GB. If that were the case, you could expect to pay almost 75% less backing up with cloud storage versus tape. The calculator also includes details about the assumptions we used in the computations so you can adjust accordingly. These assumptions include the LTO Backup Model, Data Compression Ratio and Data Retention Policy, as well as a handful of others you can dig into on your own if you’d like to fine tune the math.

Misconception #2 – Migration Costs are Impossible to Manage

We have shown how much more affordable it is to store on the cloud vs. on tape, but what about the costs of moving all of your data? Everyone with a frequently accessed data archive and especially those serving data to end users live in fear of large egress fees. Understandably the idea of paying egress fees for ALL of their data at once can be paralyzing. There is one service available today that pays for your data migration—egress fees, transfer costs, administration, all of it.

The new Universal Data Migration (UDM) service covers data migration fees for customers in US, Canada, Europe storing more than 10TB—including any legacy provider egress fees. The service offers a suite of tools and resources to make moving your data over to cloud storage a breeze, including high speed processes for reading tape media (reel cassettes and cartridges) and transferring directly to Backblaze B2 via a high-speed data connection. This all comes with full solution engineer support throughout the process and beyond. Data is transferred quickly and securely within days, not weeks.

Short story: Even if it might feel like it some days, your data does not have to be held hostage by egress expenses. Migration can be the opposite of a “killer”–it can open your budget for other investments and free your teams to access the data they need whenever they need it.

Misconception #3 – Cloud Storage Is a Security Risk

A topic on everyone’s minds these days is security. It’s reasonable to worry about risks when transitioning from tapes stored on-premises or off-site to the cloud. You can see the tapes on site; they’re disconnected from the internet and locked in a storage space on your property. When it comes to cybercriminals accessing data, you’re breathing easy. Statistics on data breaches and ransomware show that businesses of every size are at risk when it comes to cyberattacks, so this is an understandable stance. But when you look at the big picture, the cloud can offer greater peace of mind across a wide range of risks:

  • Cut Risk by Tiering Data Off Site: Cybercrime is certainly a huge threat, so it’s wise to keep it front of mind in your planning. There are a number of other risk factors that deserve equal consideration, however. Whether you live in an area prone to natural disasters, are headquartered in an older building, or just have bad luck, getting a copy of your data offsite is essential to ensuring you can recover from most disasters.
  • Apply Object Lock for Virtual Air Gapping: Air gaps used to be the big divider between cloud and tape on the security front. But setting immutability through Object Lock means you can set a virtual air gap on all of your cloud data. This functionality is available through Veeam, MSP 360, and a number of other leading backup management software providers. You don’t have to rely on tape to attain object lock.
  • Boost Security without Burdening IT: Cloud storage providers’ full time job is maintaining the durability of the data they hold—they focus 24/7 on maintenance and upkeep so you don’t have to worry about whether your hardware and software are up to date and properly maintained. No need to sweat security updates, patches, or dealing with alerts. That’s your provider’s problem.

Misconception #4 – It’s All or Nothing with Data Migration

For certain industries, regulations require that certain data sets stay on-site. In the past, managing some data on-site and some in the cloud was just too much of a hassle. But hybrid services have come a long way toward making the process smoother and more efficient.

For all of your data that doesn’t have to stay on-site, you could start using cloud storage for daily incremental backups today, while keeping your tape system in place for older archived data. Not only would this save you time not worrying about as many tapes, but you can also restore the cloud-based files instantly if you need to.

Using software from StarWind VTL or Archiware P5, you can start backing up to the cloud instantly and make the job of migrating more manageable.

The Hybrid Approach

If you’re not able to go in on the all-in cloud approach right away, you may want to continue to keep some archived data on tape and move over any current data that is more critical. A hybrid system gives you options and allows you to make the transition on your schedule.

Some of the ways companies execute the hybrid model are:

  • Date Hybrid: Pick a cut-off date; everything after that date is stored in cloud storage and everything before stays on tape.
  • Classic Hybrid: Full backups remain on tape and incremental data is stored in the cloud.
  • Type Hybrid: You might store different data types on tape and other types in the cloud. For example, perhaps you store employee files on tape and customer data in cloud storage.

Regardless of how you choose to break it up, the hybrid model makes it faster and easier to migrate.

Misconception #5 – The Costs Outweigh the Benefits

If you’re going to go through the process of migrating your data from LTO to the cloud—even though we’ve shown it to be fairly painless—you want to make sure there’s an upside, right?

Let’s start with the simple ease of access. With tape storage, the nature of physical media means that access is limited by its nature. You have to be on premises to locate the data you need (no small feat if you have a catalog of tapes to sort through).

By putting all that data in the cloud, you enable instant access to anyone in your organization with the right provisions. This shifts hours of burden from your IT department, helping the organization get more out of the resources and infrastructure they already have.

Bonus Pro-Tip: Use a “Cheat Sheet” or Checklist to Convince Your CFO or COO

When you pitch the idea of migrating from tape to cloud storage to your CFO or COO, you can allay their fears by presenting them with a cheat sheet or checklist that proactively addresses any concerns they might have.

Some things to include in your cheat sheet are basically what we’ve outlined above: First, that cloud storage is not more expensive than tape; it actually saves you money. Second, using a hybrid model, you can move your data over conveniently on your own time. There is no cost to you to migrate your data using our UDM service, and your data is fully protected against loss and secured by Object Lock to keep it safe and sound in the cloud.

Migration Success Stories

Check out these tape migration success stories to help you decide if this solution is right for you.

Kings County, CA

Kings County, California, experienced a natural disaster destroying their tapes and tape drive, prompting an $80,000 price tag to continue backing up critical county data like HIPAA records and legal information. John Devlin, CIO of Kings County, decided it was time for a change. His plan was to move away from capital expenditures (tapes and tape drives) to operating expenses like cloud storage and backup software. After much debate, Kings County decided on Veeam Software paired with Backblaze B2 Cloud Storage for its backup solution, and it’s been smooth sailing ever since!

Austin City Limits

Austin City Limits is a public TV program that has stored more than 4,000 hours of priceless live music performances on tape. As those tapes were rapidly beginning to deteriorate, the company opted to transfer recordings to Backblaze B2 Cloud Storage for immediate and ongoing archiving with real-time, hot access. Utilizing a Backblaze Fireball rapid data ingest tool, they were able to securely back up hours of footage without tying up bandwidth. Thanks to their quick actions, irreplaceable performances from Johnny Cash, Stevie Ray Vaughan and The Foo Fighters are now preserved for posterity.

In Summary

So, we’ve covered that moving your backups to a storage cloud can save your organization time and money, is a fairly painless process to undertake, doesn’t present a higher security risk, and creates important geo-redundancies that represent best practices. Hopefully, we’ve helped clear up those misconceptions and we’ve helped you decide whether migrating from tape to cloud storage makes sense for your business.

The post Five Misconceptions About Moving From Tape to Cloud appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Packing Up and Backing Up: It’s Time for School

Post Syndicated from Josephine Quock original https://www.backblaze.com/blog/packing-up-and-backing-up-its-time-for-school/

Note from the Editor: When students head off to college, they take their data and computers with them. We wanted to dig into what students and other young tech users are doing to save and protect their data, so we invited our intern, Josephine, to develop some resources for you before she heads off to college.

Apart from discovering the outside world and learning more about who you are as a person, the university experience is, at its core, about the pursuit of knowledge. In years past, students would store that knowledge in spiral-bound notebooks or scattered sheets of loose leaf, but today it’s all digital. While there’s no doubt that computers have changed everything from how we research to how assignments are turned in, technology comes with its own set of challenges. I wondered: how much are students taught about how to keep their data safe? My experience at Backblaze has taught me a lot about the importance of backing up your data, and I want to share that with other college students.

Lesson One: Syncing is Not Backing Up

If students do think about how to preserve their data, most are content to let services like iCloud or OneDrive handle it. What they may not realize (and I didn’t) is that these sync services are completely different from a true backup. Before interning at Backblaze, I didn’t even think about backing up, let alone understand the importance of it or the best strategies to back up my data. I couldn’t tell you the difference between backup and sync services. Because my photos and contacts were synced on my iPhone, iPad, and even my MacBook, I assumed all this data was saved and in good hands. However, working at Backblaze taught me the importance of backing up to prevent losing my data.

For example, if I were to delete a photo on my iPhone, syncing means that it would also delete on my other devices. Backing up gives you an additional copy of the data that doesn’t follow the same rules as your syncing services. If I accidentally delete something or get a virus, a backup with version history means that I could restore an earlier version of my files or library. This is why ensuring the safety of your data is proactive, as backing up can provide security in the event of hardware failure, cyber attack, natural disaster, or even a careless mistake.

Lesson Two: You Might Need Your Old Data

Let me provide an example that we college students have experienced at least once throughout our time in university: the frustration and annoyance of endlessly searching through the deepest corners of our devices to find an important document, only to not find it in the end.

A few months ago, I was given the option to select a prompt for an essay of my choosing from a couple of short stories. “The Tell-Tale Heart” by Edgar Allen Poe immediately caught my eye because I had read it and written a paper on it in high school. I enjoyed the story in high school and liked the idea of comparing my new paper with my old to see how much I’d grown as a writer. I browsed and inspected every possible drive on multiple email accounts—no luck.

Looking back, I wished that I had saved all the files and emails from middle and high school. But since I didn’t, I wasn’t able to review what I had written and couldn’t see the progress and improvement in my writing. Had I backed up all the data of my high school years, I could have read my essay on “The Tell-Tale Heart” and expanded on my original ideas. I was warned many times that all my documents from my account would be deleted. I knew this information and yet I still didn’t do anything about it.

Lesson Three: There Are Easy Answers

As a college student, I need a backup service that’s easy to use and affordable. Since Backblaze first started offering consumer backup, they’ve prioritized both of those things. According to an internal Backblaze’s Customer Survey in 2022, users love using the service because of the price, unlimited data backup, and most importantly ease of use. Backups are automatic, and 45% of Backblaze users spend less than $100 per year on all their backup services (including Backblaze).

Now that I will be going into my fourth year in college, I will have other important documents beyond my school work since I am going to be applying for more internships and jobs when I graduate. My resume, job applications, school projects, and essays are all things that I can’t risk losing. Additionally, lots of my photos are saved to social media accounts, and I want to save them somewhere besides my phone or on those platforms.

Lesson Learned!

After college, I plan on purchasing a new computer and will need to back everything up on my current computer to transfer over to my new laptop. Not only will that make sure that I don’t lose data when I change devices, but (rather than buying an external drive that I need to keep track of) I can use my Backblaze account. I will be using multiple devices to access my single Backblaze account. Once I have my new laptop, I just have to log in from my new device and restore all my files. I won’t have to worry about losing an external hard drive or making sure the files on a hard drive are updated.

With school back in session and the end of my university experience approaching, my social media, school work, and job search exist on all my devices. I can’t imagine losing any of it. After interning at Backblaze and hearing so many disaster stories, I will definitely be backing up my data, and I recommend that other students do too. That’s why services like Backblaze are really helpful.

The post Packing Up and Backing Up: It’s Time for School appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Media Workflowing to Europe: IBC 2022 in Amsterdam Preview

Post Syndicated from Jeremy Milk original https://www.backblaze.com/blog/media-workflowing-to-europe-ibc-2022-in-amsterdam-preview/

You can send media in milliseconds to just about every corner of the earth with an origin store at your favorite cloud storage company and a snappy content delivery network (CDN). Sadly, delivering people to Europe is a touch more complicated and time intensive. Nevertheless, the Backblaze team is saddling up planes, trains, and automobiles to bring the latest on media workflows to the attendees of IBC 2022. Whether you’re there in person or virtually, we’ll be discussing and demo-ing all the newest Backblaze B2 Cloud Storage solutions that will ensure your data can travel with ease—no mass transit needed—everywhere you need it to be.

Learn More LIVE in Amsterdam

If you’re attending the IBC 2022 conference in Amsterdam, join us at stand 7.B06 to learn about integrating B2 Cloud Storage into your workflow. Stop by anytime or you can schedule a meeting here. We’d love to see you.

IBC 2022 Preview: What’s New for Backblaze B2 Media Workflow Solutions

Our stand will have all the usual goodness: partners, friendly faces, spots to take a load off and talk about making your data work harder, and, of course, some next-level SWAG. Let’s get into what you can expect.

New Pricing Models and Migration Tools

Our team is on hand to talk you through two new offerings that have been generating a lot of excitement among teams across media organizations:

  • Backblaze B2 Reserve: You can now purchase the Backblaze service many know and love in capacity-based bundles through resellers. If your team needs 100% budget predictability and would like waived transaction fees and premium support included as well, you should check out this new pricing model. Check it out here.
  • Universal Data Migration: This IBC 2022 Best of Show nominee makes it easy and FREE to move data into Backblaze from legacy cloud, on-premises, and LTO/tape origins. If your current data storage is holding your team or your budget back, we’ll pay to free your media and move it to B2 Cloud Storage. Learn more here.

Six Flavors of Media Workflow Deep Dives

We’ve gathered materials and expertise to discuss or demo our six most asked about workflow improvements. We’re happy to talk about many other tools and improvements, but here are the six areas we expect to talk about the most:

  1. Moving more (or all) media production to the cloud. Ensuring everyone—clients, collaborators, employers, everyone—has easy real-time access to content is essential for the inevitable geographical distribution of modern media workflows.
  2. Reducing costs. Cloud workflows don’t need to come with costly gotchas, minimum retention penalties, and/or high costs when you actually want to use your content. We’ll explain how the right partners will unlock your budget so you can save on cloud services and spend more on creative projects.
  3. Streamlining delivery. Pairing cloud storage with the right CDN is essential to make sure your media is consumable and monetizable at the edge. From streaming services to ecommerce outlets to legacy media outlets, we’ve helped every type of media organization do more with their content.
  4. Freeing storage. Empty your expensive on-prem storage and stop adding HDs and tapes to the pile by moving finished projects to always-hot cloud storage. This doesn’t just free up space and money: Instantly accessible archives means you can work with and monetize older content with little friction in your creative process.
  5. Safeguarding content. All those tapes or HDs on a shelf, in the closet, or wherever you keep them are hard to manage and harder to access and use. Parking everything safely and securely in the cloud means all that data is centrally accessible, protected, and available for more use.
  6. Backing up (better!). Yes, we’ve got roots in backup going back >15 years—so when it comes to making sure your precious media is protected with easy access for speedy recovery, we’ve got a few thoughts (and solutions).

Partners, Partners, and More Partners…

“The more we get together, the happier we’ll be,” might as well be the theme lyric of cloud workflows. Combining best of breed platforms unlocks better value and functionality, and offers you the ability to build your cloud stack exactly how you need it for your business. We’ve got a massive ecosystem of integration partners to bring to bear on your challenges, and we’re happy to share our IBC 2022 stand with two incredible members of that group: media management and collaboration company iconik and the cloud NAS platform LucidLink.

We’ll be demoing a brand new, free Backblaze B2 Storage Plug-in for iconik which enables users of Backblaze, iconik, and LucidLink to move files between services in just a click–we’d love to walk you through it.

Hoping We Can Help You Soon

Whether it’s in person at IBC 2022 or virtually when it works for you, we’d love to walk you through any of the solutions we can serve for hardworking media teams. If you will be in Amsterdam, schedule a meeting to ensure you’ll get the right expert on our team, then stick around for the swag and good times. If you’re not making the trip, please reach out to to us here where we can share all of the same information.

The post Media Workflowing to Europe: IBC 2022 in Amsterdam Preview appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

ELEMENTS Media Platform Adds Backblaze B2 as Native Cloud Option

Post Syndicated from Jennifer Newman original https://www.backblaze.com/blog/elements-media-platform-adds-backblaze-b2-as-native-cloud-option/

Cloud workflows are rapidly becoming a driver of every modern media team’s day-to-day creative output. Whether it’s enabling remote shoots, distributed teams, or leveraging budgets more effectively, the cloud can deliver a ton of value to any team. But workflows, by their nature, are complex—and plenty of legacy cloud solutions only add to that complexity with tangled pricing models and limits to egress and collaboration.

ELEMENTS has been simplifying media workflows for more than a decade. Cloud storage has always been part of the ELEMENTS DNA, but they’ll be presenting revolutionary platform updates for cloud workflows in 2023 and beyond at IBC 2022. Part of this new focus on cloud solutions is their addition of an easy, transparent cloud storage option to their platform in Backblaze B2 Cloud Storage. Nimble post-production teams are always on the lookout for more straightforward and easy-to-understand cloud plans with transparent costs—this is a market that Backblaze serves more effectively than the other legacy providers accessible through the platform.

Learn More About This Solution Live in Amsterdam

If you’re attending the 2022 IBC Conference in Amsterdam, join us at stand 7.B06 to learn about integrating B2 Cloud Storage into your workflow. You can schedule a meeting here.

The Backblaze + ELEMENTS Integration

The ELEMENTS platform makes it simple to upload and download files straight from on-premises storage while also offering smart and fully customisable archiving workflows, cloud-based media asset management (MAM), and a number of other tools and features that remove the borders between cloud and on-premises storage. Once connected, ELEMENTS enables users to search, edit, or automate changes to media assets. This extends to team collaboration and setting rights to folders and data across the connected networks. ELEMENTS has provided an intuitive interface making this an easy-to-use solution that is designed for the M&E industry.

Connecting your Backblaze account with ELEMENTS is easy. Simply navigate to the System > Integrations menu and enter your Backblaze login credentials. After this, Backblaze B2 Buckets of the connected account can be mounted as a volume on ELEMENTS.

If you’d like to run a proof of concept with Backblaze, the first 10 GB is free and setting up a Backblaze account only takes a few clicks. Or you can contact sales for more information.

If you’re already a Backblaze B2 customer and would like to check out the ELEMENTS platform, contact ELEMENTS directly here.

Simplifying Your Workflow AND Your Budget

Backblaze focuses on end-to-end ease, including how it works in your budget. Businesses can select a pay-as-you go option or work with a reseller to access capacity plans.

B2 Cloud Storage – This is a general cloud plan for applications, backups and almost all of your business needs. The pricing is simple: $5 per TB per month + $0.01 per GB download fee. As with all plans, the files are located on one storage tier and can always be easily accessed.

B2 Reserve – This is the sweet spot for most media use cases. B2 Reserve is a cloud package starting from 20TB per month. This plan comes at a slightly higher cost than the standard B2 Cloud Storage plan but is free from egress fees up to the amount of storage purchased per month. B2 Reserve will quickly work in your favor if you plan on accessing your files regularly. NOTE: B2 Reserve is only available through resellers.

Top Benefits for Teams Using Backblaze and ELEMENTS Together

The ELEMENTS platform offers a set of robust tools that unlock time and budgets for creative teams to do more. We’ll underline how these different features can work with Backblaze B2.

Automation Engine

The ELEMENTS Automation Engine allows users to create workflows with any number of steps. This tool has a growing list of templates, two of which are Archive and Restore automations. These can be used to archive footage to Backblaze and delete it from the on-premises storage while keeping a lightweight preview proxy. If you need the original footage after previewing the proxy, triggering the Restore automation is all you need to do. The hi-res footage will automatically be downloaded from the Backblaze B2 bucket and placed onto the original location.

A huge benefit of using cloud storage through the ELEMENTS platform is that the individual users do not need to have cloud accounts or direct cloud access. Users will only be able to use the cloud features through the preset automation jobs and according to their permissions.

Media Library

Cloud technologies open up a number of new possibilities within the Media Library, our powerful, browser-based media asset management (MAM) platform.

For example, if your post-production facility has a locally-deployed media library which is running on your ELEMENTS storage and is connected to your Backblaze account, users can playback all of your footage at any time, no matter where it is stored—on-premises, in the cloud, or even in your LTO archive.

The Media Library adds a layer of functionality to the cloud and allows you to easily build a true cloud archive—one that can be accessed from anywhere, in which footage can easily be previewed and just as easily restored with a click of a button.

File Manager

The File Manager is a functionality of the ELEMENTS Web UI that allows you to browse and manage content on your storage on-premises and, very soon, in the cloud. It provides you with a clear overview of all your files, no matter how many file systems and cloud buckets you have. File Managers’ support for cloud storage means users will be able to manage all of their files in one place, without having to navigate through a host of different cloud providers’ interfaces.

ELEMENTS Client

The ELEMENTS Client is an intuitive connection manager that allows admins to decide who gets to mount what—providing a secure gatekeeper to your footage.
The latest function, coming soon to the ELEMENTS Client, will allow users to mount cloud workspaces. This means that users will be able to access the contents of the Backblaze B2 Bucket as if it were a local drive. With optional access logging, users will have the ability to access the cloud-stored content while admins can maintain a high level of security.

Bringing Independent Cloud Storage to the ELEMENTS Platform

Offering B2 Cloud Storage as a native option within the ELEMENTS platform brings a whole new type of cloud offering to Elements’ users. We’re eager to see how creatives use an easier, more affordable, independent option in their workflows.

Learn More About this Solution Live in Amsterdam

If you’re attending the 2022 IBC Conference in Amsterdam, join us at stand 7.B06 to learn about integrating B2 Cloud Storage into your workflow. You can schedule a meeting here.

The post ELEMENTS Media Platform Adds Backblaze B2 as Native Cloud Option appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Announcing Backblaze B2 Object Lock for MSP360: Enhanced Ransomware Protection

Post Syndicated from Jennifer Newman original https://www.backblaze.com/blog/announcing-backblaze-b2-object-lock-for-msp360/

The potential threat of ransomware is something every modern business lives with. It’s a frightening prospect, but it’s a manageable risk with technology that’s readily available on many major platforms. Today, we’re adding one more tool to that list: we’re excited to announce that our long-time partners at MSP360 have now made Backblaze B2 Object Lock functionality available to customers that use B2 Cloud Storage as cloud tier for their backup data.

How Backblaze B2 Object Lock Works with MSP360 Data

Backups are the last line of defense against cyberattacks and accidental data loss. From ransomware attacks to hacking unattended devices, there are plenty of attack vectors available to cybercriminals, not to mention the very real risk of human error. But, when activated by an IT admin in MSP360 Managed Backup 6.0, Backblaze B2 Object Lock provides an additional layer of security to a business’ backups by blocking deletion or modification by anyone (including admins) during a user-defined retention period. Object Lock puts the data in an immutable state—it’s accessible and usable, but it can’t be changed or trashed. For anyone worried about attacks on their last line of defense, this is a huge relief. It’s also increasingly a requirement, as it’s becoming more common to request immutability as proof of compliance for many industries with strict standards.

“Our customers have clients operating in a range of IT environments. With whatever we do, we want to keep that in mind and ensure we provide our customers with options. Offering Backblaze B2 Object Lock to our customers provides them another tool in the fight against ransomware, arguably still cybersecurity’s biggest challenge.”—Brian Helwig, CEO, MSP360

How to Use Object Lock with MSP360 Today

If you’re already using Backblaze B2 as a cloud storage tier for MSP360 and you’re running the latest version, you can choose to enable Object Lock when you create a new bucket. If you’re interested in checking out the joint solution now that Object Lock is enabled, you can learn more here.

“MSP360 has been a long-time partner of Backblaze and continues to impress us with their commitment to delivering a well-rounded platform to customers. We’re very happy to extend Backblaze B2 Object Lock to MSP360’s customers to meet their security, disaster recovery, and cloud storage needs.”—Nilay Patel, Vice President, Sales & Partnerships, Backblaze

Want to Learn More About Object Lock?

Protecting data is one of our favorite things, so, appropriately, we’ve written about the value of Object Lock quite a bit. You can learn more about the basics in this general guide to Object Lock. If you’re interested in how this feature will integrate into your existing security policy, you can read about adding Object Lock to your IT security policy here.

And if you want to hear more from the experts on the subject, register for our webinar, Cybersecurity and the Public Cloud: Cloud Backup Best Practices on September 21. The webinar features John Hill, Cybersecurity Expert; Troy Liljedahl, Director of Solutions Engineering at Backblaze; and David Gugick, VP Product Management at MSP360, and you’ll learn about a few of the common security concerns facing you today as you back up data into the public cloud. If you can’t join us live, the webinar will be available on demand on the Backblaze BrightTALK channel.

We hope these guides can be useful for you, but drop a comment if there’s anything else we can cover.

The post Announcing Backblaze B2 Object Lock for MSP360: Enhanced Ransomware Protection appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

How to Back Up Veeam to the Cloud

Post Syndicated from Kari Rivas original https://www.backblaze.com/blog/how-to-back-up-veeam-to-the-cloud/

Backups are your best defense against ransomware and other types of data loss. Thankfully it is quick and easy to back up all your Veeam data to Backblaze B2 Cloud Storage within minutes—and we have the videos to prove it!

What is Veeam?

Veeam is well-respected backup and disaster recovery software that works across many platforms and hardware/software configurations. Founded in 2006, Veeam Software is a U.S.-based company that operates in over 180 countries and has 400,000 customers—many of them Fortune 500 companies.

The Veeam and Backblaze B2 Cloud Storage Integration

Backblaze has partnered with Veeam to deliver the most reliable, affordable, and secure data protection and cloud storage target for your data. Veeam Backup & Replication provides modern data protection for your cloud, virtual, and physical workloads to solve your challenges around backup, recovery, archive, disaster recovery, and ransomware.

With a transparent pricing model that is a fraction of the competitors’ cost, Backblaze B2 Cloud Storage helps you plan your budget effectively and store more than four times the restore points you could otherwise. With Backblaze B2 as your cloud tier storage destination in Veeam, you can store your data for $5/TB per month with no minimum retention requirement, tiers, or hidden fees.

Additionally, Backblaze is certified as Veeam Ready—Object and Veeam Ready—Object with Immutability. Immutability is an important part of protecting backups from threats such as ransomware or stolen credentials because it allows you to protect objects from being changed, deleted, manipulated, copied, or encrypted for a specified, user-defined time period. Even better, Backblaze does not charge an extra fee for the use of the object lock feature.

How Does Veeam Work with Backblaze B2 Cloud Storage?

Backblaze is a proud partner of Veeam and is fully compatible with Veeam Cloud Tier. Using Backblaze B2’s S3-compatible API, you can set B2 Cloud Storage as your Cloud Tier in Veeam’s Scale-Out Backup Repository.

In Veeam v11 and earlier versions, you must first establish the Performance Tier, or Local Repository, before you can set up the Cloud Tier.

If you’ve been using Veeam, you probably already know how to add a local storage repository to Veeam. However, if you are one of our B2 users who are exploring this partnership for the first time, we have a video to guide you through the process. Watch as Greg Hamer, Senior Developer Evangelist, demonstrates how to set up the Local Repository in just a few minutes. If your Local Repository is already configured, then you’re ready to proceed to cloud backup!

Steps to Back Up Your Data with Veeam and Cloud Storage

To make things easy, we have created a video about How to Back Up Veeam to the Cloud. In the video, Greg demonstrates how you can securely store your Veeam data in just 20 minutes.

If you’re not a visual learner, you can easily back up all of your Veeam data to Backblaze’s B2 Cloud Storage using the five easy steps below.

Step 1: Create a Backblaze Account

First, you need to set up a Backblaze account. If you already have a Backblaze account, you’re all set and can move on to step two. Otherwise, visit Backblaze’s Veeam page and click the Start Now button to create one.

The Start Now button will take you to a simple sign-up form where you only have to enter your email address and a password. Don’t worry about setting up billing just yet. You have 10GB of free space to test drive B2 Cloud Storage before you have to set up any billing information.

Once you successfully create a new account, you will create a bucket to store your data in, then collect and save some information from your Backblaze dashboard to use later.

Step 2: Create a Backblaze B2 Bucket and Set Up an Application Key

A “bucket” is a container that holds your files uploaded by your Veeam software to Backblaze B2 Cloud Storage. When configuring your bucket, you will give it a unique name, choose whether it’s private or public (most customers choose private buckets), and turn on Object Lock to secure your files and make them immutable. (This is an important security step you won’t want to miss.)

Each bucket is associated with a name and an S3 Endpoint. You should jot down this Endpoint to use later in Veeam to connect with Backblaze.

Before you exit the Backblaze console, you will set up an Application Key that allows Veeam to connect to and access your storage bucket securely. You give the Application Key a name and make some additional choices to finish setting it up. Finally, you will jot down some details for the Application Key, such as keyID, keyName, and applicationKey, which is essentially a passcode for the key. Be sure to write these down immediately after creating the key, or you won’t be able to access it in plaintext again.

Step 3: Add Backblaze B2 Cloud Storage as a Cloud Tier Repository in Veeam

Switching over to the Veeam console, you will log into your software and create a Cloud Tier repository to interface with Backblaze B2 Cloud Storage.

Before you do that, however, you need to have a local repository created. The tutorial assumes that you have one already and have been using Veeam to backup locally.

To set up your cloud tier, you will follow a few simple steps:

  1. Choose your object storage type.
  2. Give it a name.
  3. Enter your Backblaze S3 Endpoint value.

You will also be prompted to enter your credentials, which is the Application Key information you’ve already set up when you created your Backblaze B2 Bucket. Before exiting that area, Veeam will test the connection to ensure it can reach your Bucket. The final stage in this step allows you to turn on Object Lock to keep your backup files safe.

Step 4: Create the Scale-Out Backup Repository in Veeam

Still working within the Veeam console, you will also set up a scale-out repository to handle backup data load. During this step, you will name your Veeam Scale-Out repository, choose a few options, select the Cloud Tier repository you just created in step three, and ensure that your files are backed up immediately.

Step 5: Create a Backup Job in Veeam

The final stage of our backup tutorial walks you through the process of setting up a backup job. You will continue working in Veeam to create a new backup job using cloud storage. In the video we show you a Virtual Machine backup, but you can create several other types of backup jobs as needed. You can then name your backup job, add the files you want to backup, and choose where you want to save them (in this case, the Scale-Out repository we just created).

You also have options to optimize storage and schedule your backup job to run as often as you like. Then, you can test it immediately to see how it goes.

We hope this video guide and brief explanation were useful in helping you get the most out of both Veeam and Backblaze. If you have thoughts for topics on future videos, sound off in the comments. And be sure to subscribe to our YouTube channel for more great content!

The post How to Back Up Veeam to the Cloud appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Keeping Passwords Secret and Your Data Safe

Post Syndicated from Laura Debney original https://www.backblaze.com/blog/keeping-passwords-secret-and-your-data-safe/


Even if you’ve heard about the 3.27 billion email and password combinations made public on an English language hacker forum in 2020, you may not have heard the worst of it.

Anyone could buy the list for $2 a download.

You’d think that was the scariest part of what became known as the Compilation of Many Breaches (COMB) leak–but it’s not.

The scariest and most preventable part of this breach is that people reuse their passwords and cybercriminals know it.

You may have read Backblaze’s post about credential stuffing attacks. Briefly, it’s a brute force attack using credentials from a list like COMB to unlock an account, most likely with a weak and common password like 123456 or “passwort” (if you’re German).

Is there a better reason than COMB to stop reusing passwords for multiple devices, apps and websites? That alone should do it, but if you need more reasons, they are legion.

There’s no denying how hard it can be to remember 12 to 15 pieces of information in length. After all, our telephone numbers are only seven digits long by design–which happens to be the length of any sequence of numbers we humans can easily recall.

While we each have responsibility for implementing password best practices, the COMB attacks show us that personal password protection isn’t enough. Cloud service providers (CSP) are also responsible for protecting the data entrusted to them. Backblaze uses a sophisticated security approach to protect access.

Regardless, user verification is just more effective with strong passwords, so here are a few tips on keeping your password secret and your data safe.

What Is a Password?

First and foremost, a password is a secret authenticator, according to the National Institute of Standards and Technology (NIST). They have a lot to say about the strength of passwords by complexity, length, and manner of creation. We’ll get into details a little later.

The string of letters, numbers, and symbols used in a password can’t be easy to guess or forget. This is tougher to achieve than it sounds, and typically users choose short, memorable passwords for convenience.

If you want to eliminate the possibility of being the next victim of credential stuffing, here are some things you shouldn’t do:

  1. Use the same password for more than one online account or website.
  2. Recycle or rotate passwords.
  3. Store passwords where anyone can access them (e.g. on a piece of paper or as an autofill setting in your browser).

The reality is that hackers can (and have) released accurate email and password combinations, and the best way to render that old information utterly useless is never to use those passwords again. (Also, stop adding a “2” at the end of “Password1.” You’re not fooling anyone.)

How Does a Strong Password Help Keep Your Data Safe?

You should know that verifying a digital identity with an email and a password isn’t as straightforward as presenting your photo ID at the airport. And, password authentication isn’t as safe as it once was, thanks to cybercrimes resulting in lists like COMB. In light of the security risks hackers pose, new authentication guidelines were created to help CSPs ensure the authenticity of a user.

The Three Authenticators

There are three types of authenticators.

    1. Something you know (e.g. password).
    2. Something you have (e.g. a cellphone).
    3. Something you are (e.g. biometric data).

CSPs or verifiers can employ different combinations of authenticators to achieve different levels of assurance that will ultimately help to reduce successful cybersecurity attacks. The different combinations of authenticators create an authentication assurance level (AAL). For example, when you log in, a CSP might use a combination of your password and a code generated by an authenticator app.

A strong password is essential to each authentication level. Each AAL meets the recommended privacy controls and relies on something that only you know and is difficult to estimate. Cyber criminals only need to guess one reused email and password combination to make a costly mess of things for you or your business.

Strong Passwords Defined

This may sound too simplistic, but the NIST (SP 800-63-3 Appendix A) qualifies a password’s strength by its length. In the case of brute force attacks, shorter passwords are too easy to uncover before rate limitings results in a lockout. Passwords of sufficient length reduce the success of credential stuffing or denial-of-service attacks. The NIST recommends that CSPs allow passwords or password phrases of almost any length so long as it doesn’t demand excessive time to disguise with a salting of random letters, numbers, and hashing algorithm.

The next thing to consider when creating a strong password is complexity. That said, the NIST recommends that password complexity not impede memorability, which would defeat the purpose of using a password to authenticate something you know. Too complex of a password leads people to writing down passwords or storing them in unsafe places rather than forgetting them. This vulnerability has to be addressed when CSPs provide instructions for creating passwords for users.

Unfortunately, analysis of breached data reveals that combining complexity and length isn’t a foolproof deterrent. However, a sufficiently long password is harder to guess, and an adequately complex one will improve the masking efforts like salting and hashing.

One more way to ensure you’re using a strong password is to use a tool to randomly generate one based on a set of standards, like length, type of character, readability, etc.. Randomly produced passwords are harder to brute force attack or guess. While there are a few different places you can find a random password generator, we love password managers like BitWarden, 1Password, and LastPass, which generate, organize, and secure passwords.

Remember that breached data can provide insights into what an old password might be for the same account or similar type because cybercriminals know that we are creatures of habit. Not only that, but some facts are immutable. A great example is when you always select the same challenge question. If that data has been breached, it’s likely known to cybercriminals; also, your mother’s maiden name is not going to change. As another layer, you can use randomly generated answers to security questions the same way you use randomly generated passwords, meaning those answers won’t be able to be reused (or easily gleaned from dumb Facebook quizzes).

Next, let’s get into how you can keep hackers from guessing your strong passwords.

How to Create a Strong Password

To review, strong passwords are long, complex, and secret. These days you can take advantage of a password generator and save it to your password manager. However, there are times you need to come up with a strong password.

Consider these two steps for making a password strong:

Step one: Use a memorable phrase that’s 12 to 15 characters long (e.g. she sells seashells).

Step two: Lightly salt your version with some random characters (e.g. sHe sellz seasHells).

A few ideas for memorable phrases are to use a song lyric, a poetic verse or a line from a movie.

Pro Tip: When you lightly salt your memorable phrase, try not to use @ for the letter ‘a’ or the number zero for the letter ‘o’.

Avoid being predictable. Also, avoid the temptation to use sensitive information like your child’s birthdate or your first and last name or 12345678. Trust me, using this type of information is uber common worldwide.

Another way to check that you’re using a unique password is by culling breached data records. According to Troy Hunt’s pwned.com site, the password Qwerty was used 71,219 times before I typed it into Have I Been Pwned Password API. As Hunt points out, the NIST recommends that CSPs compare user-generated passwords with unacceptable ones. A blocklist should have passwords from previous breaches and predictable options that include the service name, like using the password ‘G000gle’ for your Gmail account.

What Else Can You Do to Protect Against Credential Stuffing Attacks?

In the battle against brute force attacks from hackers that can compute ridiculous numbers of hashes without rate limiting, users play a critical role in protecting your data with strong passwords.

Here are a few other ways to keep your data safer:

The good news is that even as cybercriminals get more ingenious, new and innovative tools have been created to make personal data security easier than ever. And, as data nerds ourselves, Backblaze takes your cybersecurity seriously. Check out some of the ways we secure your data here.

The post Keeping Passwords Secret and Your Data Safe appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Storing and Querying Analytical Data in Backblaze B2

Post Syndicated from Greg Hamer original https://www.backblaze.com/blog/storing-and-querying-analytical-data-in-backblaze-b2/

Note: This blog is the result of a collaborative effort of the Backblaze Evangelism team members Andy Klein, Pat Patterson and Greg Hamer.

Have You Ever Used Backblaze B2 Cloud Storage for Your Data Analytics?

Backblaze customers find that Backblaze B2 Cloud Storage is optimal for a wide variety of use cases. However, one application that many teams might not yet have tried is using Backblaze B2 for data analytics. You may find that having a highly reliable pre-provisioned storage option like Backblaze B2 Cloud Storage for your data lakes can be a useful and very cost-effective alternative for your data analytic workloads.

This article is an introductory primer on getting started using Backblaze B2 for data analytics that uses our Drive Stats as the example of the data being analyzed. For readers new to data lakes, this article can help you get your own data lake up and going on Backblaze B2 Cloud Storage.

As you probably know, a commonly used technology for data analytics is SQL (Structured Query Language). Most people know SQL from databases. However, SQL can be used against collections of files stored outside of databases, now commonly referred to as data lakes. We will focus here on several options using SQL for analyzing Drive Stats data stored on Backblaze B2 Cloud Storage.

It should be noted that data lakes most frequently prove optimal for read-only or append-only datasets. Whereas databases often remain optimal for “hot” data with active insert, update and delete of individual rows, and especially updates of individual column values on individual rows.

We can only scratch the surface of storing, querying, and analyzing tabular data in a single blog post. So for this introductory article, we will:

  • Briefly explain the Drive Stats data.
  • Introduce open-source Trino as one option for executing SQL against the Drive Stats data.
  • Query Drive Stats data both in raw CSV format versus enhanced performance after transforming the data into the open-source Apache Parquet format.

The sections below take a step-by-step approach including details on the performance improvements realized when implementing recommended data engineering options. We start with a demonstration of analysis of raw data. Then progress through “data engineering” that transforms the data into formats that are optimal for accelerating repeated queries of the dataset. We conclude by highlighting our hosted, consolidated, complete Drive Stats dataset.

As mentioned earlier, this blog post is intended only as an introductory primer. In future blog posts, we will detail additional best practices and other common issues and opportunities with data analysis using Backblaze B2.

Backblaze Hard Drive Data and Stats (aka Drive Stats)

Drive Stats is an open-source data set of the daily metrics on the hard drives in Backblaze’s cloud storage infrastructure that Backblaze has open-sourced starting with April 2013. Currently, Drive Stats comprises nearly 300 million records, occupying over 90GB of disk space in raw comma-separated values (CSV) format, rising by over 200,000 records, or about 75MB of CSV data, per day. Drive Stats is an append-only dataset effectively logging daily statistics that once written are never updated or deleted.

The Drive Stats dataset is not quite “big data,” where datasets range from a few dozen terabytes to many zettabytes, but enough that physical data architecture starts to have a significant effect in both the amount of space that the data occupies and how the data can be accessed.

At the end of each quarter, Backblaze creates a CSV file for each day of data, ZIP those 90 or so files together, and make the compressed file available for download from a Backblaze B2 Bucket. While it’s easy to download and decompress a single file containing three months of data, this data architecture is not very flexible. With a little data engineering, though, it’s possible to make analytical data, such as the Drive Stats data set, available for modern data analysis tools to directly access from cloud storage, unlocking new opportunities for data analysis and data science.

Later, for comparison, we include a brief demonstration of performance of the data lake versus a traditional relational database. Architecturally, a difference between a data lake and a database is that databases integrate together both the query engine and the data storage. When data is either inserted or loaded into a database, the database has optimized internal storage structures it uses. Alternatively, with a data lake, the query engine and the data storage are separate. What we highlight below are basics for optimizing data storage in a data lake to enable the query engine to deliver the fastest query response times.

As with all data analysis, it is helpful to understand details of what the data represents. Before showing results, let’s take a deeper dive into the nature of the Drive Stats data. (For readers interested in first reviewing outcomes and improved query performance results, please skip ahead to the later sections “Compressed CSV” and “Enter Apache Parquet.”)

Navigating the Drive Stats Data

At Backblaze we collect a Drive Stats record from each hard drive, each day, containing the following data:

  • date: the date of collection.
  • serial_number: the unique serial number of the drive.
  • model: the manufacturer’s model number of the drive.
  • capacity_bytes: the drive’s capacity, in bytes.
  • failure: 1 if this was the last day that the drive was operational before failing, 0 if all is well.
  • A collection of SMART attributes. The number of attributes collected has risen over time; currently we store 87 SMART attributes in each record, each one in both raw and normalized form, with field names of the form smart_n_normalized and smart_n_raw, where n is between 1 and 255.

In total, each record currently comprises 179 fields of data describing the state of an individual hard drive on a given day (the number of SMART attributes collected has risen over time).

Comma-Separated Values, a Lingua Franca for Tabular Data

A CSV file is a delimited text file that, as its name implies, uses a comma to separate values. Typically, the first line of a CSV file is a header containing the field names for the data, separated by commas. The remaining lines in the file hold the data: one line per record, with each line containing the field values, again separated by commas.

Here’s a subset of the Drive Stats data represented as CSV. We’ve omitted most of the SMART attributes to make the records more manageable.

date,serial_number,model,capacity_bytes,failure,
smart_1_normalized,smart_1_raw
2022-01-01,ZLW18P9K,ST14000NM001G,14000519643136,0,73,20467240
2022-01-01,ZLW0EGC7,ST12000NM001G,12000138625024,0,84,228715872
2022-01-01,ZA1FLE1P,ST8000NM0055,8001563222016,0,82,157857120
2022-01-01,ZA16NQJR,ST8000NM0055,8001563222016,0,84,234265456
2022-01-01,1050A084F97G,TOSHIBA MG07ACA14TA,14000519643136,0,100,0

Currently, we create a CSV file for each day’s data, comprising a record for each drive that was operational at the beginning of that day. The CSV files are each named with the appropriate date in year-month-day order, for example, 2022-06-28.csv. As mentioned above, we make each quarter’s data available as a ZIP file containing the CSV files.

At the beginning of the last Drive Stats quarter, Jan 1, 2022, we were spinning over 200,000 hard drives, so each daily file contained over 200,000 lines and occupied nearly 75MB of disk space. The ZIP file containing the Drive Stats data for the first quarter of 2022 compressed 90 files totaling 6.63GB of CSV data to a single 1.06GB file made available for download here.

Big Data Analytics in the Cloud with Trino

Zipped CSV files allow users to easily download, inspect, and analyze the data locally, but a new generation of tools allows us to explore and query data in situ on Backblaze B2 and other cloud storage platforms. One example is the open-source Trino query engine (formerly known as Presto SQL). Trino can natively query data in Backblaze B2, Cassandra, MySQL, and many other data sources without copying that data into its own dedicated store.

A powerful capability of Trino is that it is a distributed query engine and offers what is sometimes referred to as massively parallel processing (MPP). Thus, adding more nodes in your Trino compute cluster consistently delivers dramatically shorter query execution times. Faster results are always desirable. We achieved the results we report below running Trino on only a single node.

Note: If you are unfamiliar with Trino, the open-source project was previously known as Presto and leverages the Hadoop ecosystem.

In preparing this blog post, our team used Brian Olsen’s excellent Hive connector over MinIO file storage tutorial as a starting point for integrating Trino with Backblaze B2. The tutorial environment includes a preconfigured Docker Compose environment comprising the Trino Docker image and other required services for working with data in Backblaze B2. We brought up the environment in Docker Desktop; alternately on ThinkPads and MacBook Pros.

As a first step, we downloaded the data set for the most recent quarter, unzipped it to our local disks, and then finally reuploaded the unzipped CSV into Backblaze B2 buckets. As mentioned above, the uncompressed CSV data occupies 6.63GB of storage, so we confined our initial explorations to just a single day’s data: over 200,000 records, occupying 72.8MB.

A Word About Apache Hive

Trino accesses analytical data in Backblaze B2 and other cloud storage platforms via its Hive connector. Quoting from the Trino documentation:

The Hive connector allows querying data stored in an Apache Hive data warehouse. Hive is a combination of three components:

  • Data files in varying formats, that are typically stored in the Hadoop Distributed File System (HDFS) or in object storage systems such as Amazon S3.
  • Metadata about how the data files are mapped to schemas and tables. This metadata is stored in a database, such as MySQL, and is accessed via the Hive metastore service.
  • A query language called HiveQL. This query language is executed on a distributed computing framework such as MapReduce or Tez.

Trino only uses the first two components: the data and the metadata. It does not use HiveQL or any part of Hive’s execution environment.

The Hive connector tutorial includes Docker images for the Hive metastore service (HMS) and MariaDB, so it’s a convenient way to explore this functionality with Backblaze B2.

Configuring Trino for Backblaze B2

The tutorial uses MinIO, an open-source implementation of the Amazon S3 API, so it was straightforward to adapt the sample MinIO configuration to Backblaze B2’s S3 Compatible API by just replacing the endpoint and credentials. Here’s the b2.properties file we created:

connector.name=hive
hive.metastore.uri=thrift://hive-metastore:9083
hive.s3.path-style-access=true
hive.s3.endpoint=https://s3.us-west-004.backblazeb2.com
hive.s3.aws-access-key=
hive.s3.aws-secret-key=
hive.non-managed-table-writes-enabled=true
hive.s3select-pushdown.enabled=false
hive.storage-format=CSV
hive.allow-drop-table=true

Similarly, we edited the Hive configuration files, again replacing the MinIO configuration with the corresponding Backblaze B2 values. Here’s a sample core-site.xml:

<?xml version="1.0"?>
<configuration>

    <property>
        <name>fs.defaultFS</name>
        <value>s3a://b2-trino-getting-started</value>
    </property>


    <!-- B2 properties -->
    <property>
        <name>fs.s3a.connection.ssl.enabled</name>
        <value>true</value>
    </property>

    <property>
        <name>fs.s3a.endpoint</name>
        <value>https://s3.us-west-004.backblazeb2.com</value>
    </property>

    <property>
        <name>fs.s3a.access.key</name>
        <value><my b2 application key id></value>
    </property>

    <property>
        <name>fs.s3a.secret.key</name>
        <value><my b2 application key id></value>
    </property>

    <property>
        <name>fs.s3a.path.style.access</name>
        <value>true</value>
    </property>

    <property>
        <name>fs.s3a.impl</name>
        <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
    </property>

</configuration>

We made a similar set of edits to metastore-site.xml and restarted the Docker instances so our changes took effect.

Uncompressed CSV

Our first test validated creating a table and running a query on a single-day CSV data set. Hive tables are configured with the directory containing the actual data files, so we uploaded 2020-01-01.csv from a local disk to data_20220101_csv/2020-01-01.csv in a Backblaze B2 bucket, opened the Trino CLI, and created a schema and a table:

CREATE SCHEMA b2.ds
WITH (location = 's3a://b2-trino-getting-started/');

USE b2.ds;

CREATE TABLE jan1_csv (
    date VARCHAR,
    serial_number VARCHAR,
    model VARCHAR,
    capacity_bytes VARCHAR,
    failure VARCHAR,
    smart_1_normalized VARCHAR,
    smart_1_raw VARCHAR,
    ...
    smart_255_normalized VARCHAR,
    smart_255_raw VARCHAR)
WITH (format = 'CSV',
    skip_header_line_count = 1,
    external_location = '
s3a://b2-trino-getting-started/data_20220101_csv');

Unfortunately, the Trino Hive connector only supports the VARCHAR data type when accessing CSV data, but, as we’ll see in a moment, we can use the CAST function in queries to convert character data to numeric and other types.

Now to run some queries! A good test is to check if all the data is there:

trino:ds> SELECT COUNT(*) FROM jan1_csv;
 _col0  
--------
 206954 
(1 row)

Query 20220629_162533_00024_qy4c6, FINISHED, 1 node
Splits: 8 total, 8 done (100.00%)
8.23 [207K rows, 69.4MB] [25.1K rows/s, 8.43MB/s]
Note: If you’re wondering about the discrepancy between the size of the CSV file–72.8MB–and the amount of data read by Trino–69.4MB–it’s accounted for in the different usage of the ‘MB’ abbreviation. For instance Mac interprets MB as a megabyte, 1,000,000 bytes, while Trino is reporting mebibytes, 1,048,576 bytes. Strictly speaking, Trino should use the abbreviation MiB. Pat opened an issue for this (with a goal of fixing it and submitting a pull request to the Trino project).

Now let’s see how many drives failed that day, grouped by the drive model:

trino:ds> SELECT model, COUNT(*) as failures 
       -> FROM jan1_csv 
       -> WHERE failure = 1 
       -> GROUP BY model 
       -> ORDER BY failures DESC;
       model        | failures 
--------------------+----------
 TOSHIBA MQ01ABF050 |        1 
 ST4000DM005        |        1 
 ST8000NM0055       |        1 
(3 rows)

Query 20220629_162609_00025_qy4c6, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
8.23 [207K rows, 69.4MB] [25.1K rows/s, 8.43MB/s]

Notice that the query execution time is identical between the two queries. This makes sense–the time taken to run the query is dominated by the time required to download the data from Backblaze B2.

Finally, we can use the CAST function with SUM and ROUND to see how many exabytes of storage we were spinning on that day:

trino:ds> SELECT ROUND(SUM(CAST(capacity_bytes AS bigint))/1e+18, 2) FROM jan1_csv;
 _col0 
-------
  2.25 
(1 row)

Query 20220629_172703_00047_qy4c6, FINISHED, 1 node
Splits: 12 total, 12 done (100.00%)
7.83 [207K rows, 69.4MB] [26.4K rows/s, 8.86MB/s]

Although this performance may seem too long running, please note that this is against raw data. What we are highlighting here with Drive Stats data can also be used for querying data in log files. As new records are written on this append-only dataset they immediately appear as new rows in the query. This is very powerful for both real-time and near real-time analysis, and faster performance is easily achieved by scaling out the Trino cluster. Remember, Trino is a distributed query engine. For this demonstration, we have limited Trino to running on just a single node.

Compressed CSV

This is pretty neat, but not exactly fast. Extrapolating, we might expect it to take about 12 minutes to run a query against a whole quarter of Drive Stats data.

Can we improve performance? Absolutely–we simply need to reduce the amount of data that needs to be downloaded for each query!

Commonplace in the world of data analytics are data pipelines, often known as ETL for Extract, Transform, and Load. Where data is repeatedly queried, it is often advantageous to “transform” data from the raw form that it originates in into some format more optimized for the repeated queries that follow through the next stages of that data’s life cycle.

For our next test we will perform an elementary transformation of the data using a lossless compression of the CSV data with Hive’s preferred gzip format, resulting in an 11.7 MB file, 2020-01-01.csv.gz. After uploading the compressed file to data_20220101_csv_gz/2020-01-01.csv.gz, we created a second table, copying the schema from the first:

CREATE TABLE jan1_csv_gz (
	LIKE jan1_csv
)
WITH (FORMAT = 'CSV',
    EXTERNAL_LOCATION = 's3a://b2-trino-getting-started/data_20220101_csv_gz');

Trying the failure count query:

trino:ds> SELECT model, COUNT(*) as failures 
       -> FROM jan1_csv_gz 
       -> WHERE failure = 1 
       -> GROUP BY model 
       -> ORDER BY failures DESC;
       model        | failures 
--------------------+----------
 TOSHIBA MQ01ABF050 |        1 
 ST8000NM0055       |        1 
 ST4000DM005        |        1 
(3 rows)

Query 20220629_162713_00027_qy4c6, FINISHED, 1 node
Splits: 15 total, 15 done (100.00%)
2.71 [207K rows, 11.1MB] [76.4K rows/s, 4.1MB/s]

As you might expect, given that Trino has to download less than ⅙ as much data as previously, the query time fell dramatically–from just over 8 seconds to under 3 seconds. Can we do even better than this?

Enter Apache Parquet

The issue with running this kind of analytical query is that it often results in a “full table scan”–Trino has to read the model and failure fields from every record to execute the query. The row-oriented layout of CSV data means that Trino ends up reading the entire file. We can get around this by using a file format designed specifically for analytical workloads.

While CSV files comprise a line of text for each record, Parquet is a column-oriented, binary file format, storing the binary values for each column contiguously. Here’s a simple visualization of the difference between row and column orientation:

Table representation:

Row orientation:

Column Orientation:


Parquet also implements run-length encoding and other compression techniques. Where a series of records have the same value for a given field the Parquet file need only store the value and the number of repetitions:

The result is a compact file format well suited for analytical queries.

There are many tools to manipulate tabular data from one format to another. In this case, we wrote a very simple Python script that used the pyarrow library to do the job:

import pyarrow.csv as csv
import pyarrow.parquet as parquet

filename = '2022-01-01.csv'

parquet.write_table(csv.read_csv(filename), 
filename.replace('.csv', '.parquet'))

The resulting Parquet file occupies 12.8MB–only 1.1MB more than the gzip file. Again, we uploaded the resulting file and created a table in Trino.

CREATE TABLE jan1_parquet (
    date DATE,
    serial_number VARCHAR,
    model VARCHAR,
    capacity_bytes BIGINT,
    failure TINYINT,
    smart_1_normalized BIGINT,
    smart_1_raw BIGINT,
    ...
    smart_255_normalized BIGINT,
    smart_255_raw BIGINT)
WITH (FORMAT = 'PARQUET',
    EXTERNAL_LOCATION = 
's3a://b2-trino-getting-started/data_20220101_parquet);

Note that the conversion to Parquet automatically formatted the data using appropriate types, which we used in the table definition.

Let’s run a query and see how Parquet fares against compressed CSV:

trino:ds> SELECT model, COUNT(*) as failures 
       -> FROM jan1_parquet 
       -> WHERE failure = 1 
       -> GROUP BY model 
       -> ORDER BY failures DESC;
       model        | failures 
--------------------+----------
 TOSHIBA MQ01ABF050 |        1 
 ST4000DM005        |        1 
 ST8000NM0055       |        1 
(3 rows)

Query 20220629_163018_00031_qy4c6, FINISHED, 1 node
Splits: 15 total, 15 done (100.00%)
0.78 [207K rows, 334KB] [265K rows/s, 427KB/s]

The test query is executed in well under a second! Looking at the last line of output, we can see that the same number of rows were read, but only 334KB of data was retrieved. Trino was able to retrieve just the two columns it needed, out of the 179 columns in the file, to run the query.

Similar analytical queries execute just as efficiently. Calculating the total amount of storage in exabytes:

trino:ds> SELECT ROUND(SUM(capacity_bytes)/1e+18, 2) FROM jan1_parquet;
 _col0 
-------
  2.25 
(1 row)

Query 20220629_163058_00033_qy4c6, FINISHED, 1 node
Splits: 10 total, 10 done (100.00%)
0.83 [207K rows, 156KB] [251K rows/s, 189KB/s]

What was the capacity of the largest drive in terabytes?

trino:ds> SELECT max(capacity_bytes)/1e+12 FROM jan1_parquet;
      _col0      
-----------------
 18.000207937536 
(1 row)

Query 20220629_163139_00034_qy4c6, FINISHED, 1 node
Splits: 10 total, 10 done (100.00%)
0.80 [207K rows, 156KB] [259K rows/s, 195KB/s]

Parquet’s columnar layout excels with analytical workloads, but if we try a query more suited to an operational database, Trino has to read the entire file, as we would expect:

trino:ds> SELECT * FROM jan1_parquet WHERE serial_number = 'ZLW18P9K';
    date    | serial_number |     model     | capacity_bytes | failure
------------+---------------+---------------+----------------+--------
 2022-01-01 | ZLW18P9K      | ST14000NM001G | 14000519643136 |       0
(1 row)

Query 20220629_163206_00035_qy4c6, FINISHED, 1 node
Splits: 5 total, 5 done (100.00%)
2.05 [207K rows, 12.2MB] [101K rows/s, 5.95MB/s]

Scaling Up

After validating our Trino configuration with just a single day’s data, our next step up was to create a Parquet file containing an entire quarter. The file weighed in at 1.0GB, a little smaller than the zipped CSV.

Here’s the failed drives query for the entire quarter, limited to the top 10 results:

trino:ds> SELECT model, COUNT(*) as failures 
       -> FROM q1_2022_parquet 
       -> WHERE failure = 1 
       -> GROUP BY model 
       -> ORDER BY failures DESC 
       -> LIMIT 10;
        model         | failures 
----------------------+----------
 ST4000DM000          |      117 
 TOSHIBA MG07ACA14TA  |       88 
 ST8000NM0055         |       86 
 ST12000NM0008        |       73 
 ST8000DM002          |       38 
 ST16000NM001G        |       24 
 ST14000NM001G        |       24 
 HGST HMS5C4040ALE640 |       21 
 HGST HUH721212ALE604 |       21 
 ST12000NM001G        |       20 
(10 rows)

Query 20220629_183338_00050_qy4c6, FINISHED, 1 node
Splits: 43 total, 43 done (100.00%)
3.38 [18.8M rows, 15.8MB] [5.58M rows/s, 4.68MB/s]

Of course, those are absolute failure numbers; they don’t take account of how many of each drive model are in use. We can construct a more complex query that tells us the percentages of failed drives, by model:

trino:ds> SELECT drives.model AS model, drives.drives AS drives, 
       ->   failures.failures AS failures, 
       ->   ROUND((CAST(failures AS double)/drives)*100, 6) AS percentage
       -> FROM
       -> (
       ->   SELECT model, COUNT(*) as drives 
       ->   FROM q1_2022_parquet 
       ->   GROUP BY model
       -> ) AS drives
       -> RIGHT JOIN
       -> (
       ->   SELECT model, COUNT(*) as failures 
       ->   FROM q1_2022_parquet 
       ->   WHERE failure = 1 
       ->   GROUP BY model
       -> ) AS failures
       -> ON drives.model = failures.model
       -> ORDER BY percentage DESC
       -> LIMIT 10;
        model         | drives | failures | percentage 
----------------------+--------+----------+------------
 ST12000NM0117        |    873 |        1 |   0.114548 
 ST10000NM001G        |   1028 |        1 |   0.097276 
 HGST HUH728080ALE604 |   4504 |        3 |   0.066607 
 TOSHIBA MQ01ABF050M  |  26231 |       13 |    0.04956 
 TOSHIBA MQ01ABF050   |  24765 |       12 |   0.048455 
 ST4000DM005          |   3331 |        1 |   0.030021 
 WDC WDS250G2B0A      |   3338 |        1 |   0.029958 
 ST500LM012 HN        |  37447 |       11 |   0.029375 
 ST12000NM0007        | 118349 |       19 |   0.016054 
 ST14000NM0138        | 144333 |       17 |   0.011778 
(10 rows)

Query 20220629_191755_00010_tfuuz, FINISHED, 1 node
Splits: 82 total, 82 done (100.00%)
8.70 [37.7M rows, 31.6MB] [4.33M rows/s, 3.63MB/s]

This query took twice as long as the last one! Again, data transfer time is the limiting factor–Trino downloads the data for each subquery. A real-world deployment would take advantage of the Hive Connector’s storage caching feature to avoid repeatedly retrieving the same data.

Picking the Right Tool for the Job

You might be wondering how a relational database would stack up against the Trino/Parquet/Backblaze B2 combination. As a quick test, we installed PostgreSQL 14 on a MacBook Pro, loaded the same quarter’s data into a table, and ran the same set of queries:

Count Rows

sql_stmt=# \timing
Timing is on.
sql_stmt=# SELECT COUNT(*) FROM q1_2022;

  count   
----------
 18845260
(1 row)

Time: 1579.532 ms (00:01.580)

Absolute Number of Failures

sql_stmt=# SELECT model, COUNT(*) as failures                                                                                                          FROM q1_2022                                                                                                                                             WHERE failure = 't'                                                                                                                                      GROUP BY model                                                                                                                                           ORDER BY failures DESC                                                                                                                                   LIMIT 10;

        model         | failures 
----------------------+----------
 ST4000DM000          |      117
 TOSHIBA MG07ACA14TA  |       88
 ST8000NM0055         |       86
 ST12000NM0008        |       73
 ST8000DM002          |       38
 ST14000NM001G        |       24
 ST16000NM001G        |       24
 HGST HMS5C4040ALE640 |       21
 HGST HUH721212ALE604 |       21
 ST12000NM001G        |       20
(10 rows)

Time: 2052.019 ms (00:02.052)

Relative Number of Failures

sql_stmt=# SELECT drives.model AS model, drives.drives AS drives,                                                                                      failures.failures,                                                                                                                                       ROUND((CAST(failures AS numeric)/drives)*100, 6) AS percentage                                                                                           FROM                                                                                                                                                     (                                                                                                                                                        SELECT model, COUNT(*) as drives                                                                                                                         FROM q1_2022                                                                                                                                             GROUP BY model                                                                                                                                           ) AS drives                                                                                                                                              RIGHT JOIN                                                                                                                                               (                                                                                                                                                        SELECT model, COUNT(*) as failures                                                                                                                       FROM q1_2022                                                                                                                                             WHERE failure = 't'                                                                                                                                      GROUP BY model                                                                                                                                           ) AS failures                                                                                                                                            ON drives.model = failures.model                                                                                                                         ORDER BY percentage DESC                                                                                                                                 LIMIT 10;
        model         | drives | failures | percentage 
----------------------+--------+----------+------------
 ST12000NM0117        |    873 |        1 |   0.114548
 ST10000NM001G        |   1028 |        1 |   0.097276
 HGST HUH728080ALE604 |   4504 |        3 |   0.066607
 TOSHIBA MQ01ABF050M  |  26231 |       13 |   0.049560
 TOSHIBA MQ01ABF050   |  24765 |       12 |   0.048455
 ST4000DM005          |   3331 |        1 |   0.030021
 WDC WDS250G2B0A      |   3338 |        1 |   0.029958
 ST500LM012 HN        |  37447 |       11 |   0.029375
 ST12000NM0007        | 118349 |       19 |   0.016054
 ST14000NM0138        | 144333 |       17 |   0.011778
(10 rows)

Time: 3831.924 ms (00:03.832)

Retrieve a Single Record by Serial Number and Date

Modifying the query, since we have an entire quarter’s data:

sql_stmt=# SELECT * FROM q1_2022 WHERE serial_number = 'ZLW18P9K' AND date = '2022-01-01';
    date    | serial_number |     model     | capacity_bytes | failure
------------+---------------+---------------+----------------+-------- 
 2022-01-01 | ZLW18P9K      | ST14000NM001G | 14000519643136 | f       (1 row)

Time: 1690.091 ms (00:01.690)

For comparison, we tried to run the same query against the quarter’s data in Parquet format, but Trino crashed with an out of memory error after 58 seconds. Clearly some tuning of the default configuration is required!

Bringing the numbers together for the quarterly data sets. All times are in seconds.

PostgreSQL is faster for most operations, but not by much, especially considering that its data is on the local SSD, rather than Backblaze B2!

It’s worth mentioning that there are yet more tuning optimizations that we have not demonstrated in this exercise. For instance, the Trino Hive connector supports storage caching. Implementing a cache yields further performance gains by avoiding repeatedly retrieving the same data from Backblaze B2. Further, Trino is a distributed query engine. Trino’s architecture is horizontally scalable. This means that Trino can also deliver shorter query run times by adding more nodes in your Trino compute cluster. We have limited all timings in this demonstration to Trino running on just a single node.

Partitioning Your Data Lake

Our final exercise was to create a single Drive Stats dataset containing all nine years of Drive Stats data. As stated above, at the time of writing the full Drive Stats dataset comprises nearly 300 million records, occupying over 90GB of disk space when in raw CSV format, rising by over 200,000 records per day, or about 75MB of CSV data.

As the dataset grows in size, an additional data engineering best practice is to include partitions.

In the introduction we mentioned that databases use optimized internal storage structures. Foremost among these are indexes. Data lakes have limited support for indexes. Data lakes do, however, support partitions. Data lake partitions are functionally similar to what databases alternately refer to as either a primary key index or index-organized tables. Regardless of the name, they effectively achieve faster data retrieval by having the data itself physically sorted. Since Drive Stats is append-only, when sorting on a date field, new records are appended to the dataset.

Having the data physically sorted greatly aids retrieval in cases that are known as range queries. To achieve fastest retrieval on a given query, it is important to only retrieve data that resolves true on the predicate in the WHERE clause. In the case of Drive Stats, for a query on only a single month or several consecutive months we get the fastest time to the result if we can read only the data for these months. Without partitioning Trino would need to do a full table scan, resulting in slower response due to the overhead of reading records for which the WHERE clause logic resolves to false. Organizing the Drive Stats data into partitions enables Trino to efficiently skip records that resolve the WHERE clause to false. Thus with partitions, many queries are far more efficient and incur the read cost only of those records whose WHERE clause logic resolves to true.

Our final transformation required a tweak to the Python script to iterate over all of the Drive Stats CSV files, writing Parquet files partitioned by year and month, so the files have prefixes of the form.

/drivestats/year={year}/month={month}/

For example:

/drivestats/year=2021/month=12/

The number of SMART attributes reported can change from one day to the next, and a single Parquet file can have only one schema, so there are one or more files with each prefix, named

{year}-{month}-{index}.parquet

For example:

2021-12-1.parquet

Again, we uploaded the resulting files and created a table in Trino.

CREATE TABLE drivestats (
    serial_number VARCHAR,
    model VARCHAR,
    capacity_bytes BIGINT,
    failure TINYINT,
    smart_1_normalized BIGINT,
    smart_1_raw BIGINT,
    ...
    smart_255_normalized BIGINT,
    smart_255_raw BIGINT,
    day SMALLINT,
    year SMALLINT,
    month SMALLINT
)
WITH (format = 'PARQUET',
 PARTITIONED_BY = ARRAY['year', 'month'],
      EXTERNAL_LOCATION = 's3a://b2-trino-getting-started/drivestats-parquet');

Note that the conversion to Parquet automatically formatted the data using appropriate types, which we used in the table definition.

This command tells Trino to scan for partition files.

CALL system.sync_partition_metadata('ds', 'drivestats', 'FULL');

Let’s run a query and see the performance against the full Drive Stats dataset in Parquet format, partitioned by month:

trino:ds> SELECT COUNT(*) FROM drivestats;
   _col0   
-----------
296413574 
(1 row)

Query 20220707_182743_00055_tshdf, FINISHED, 1 node
Splits: 412 total, 412 done (100.00%)
15.84 [296M rows, 5.63MB] [18.7M rows/s, 364KB/s]

It takes 16 seconds to count the total number of records, reading only 5.6MB of the 15.3GB total data.

Next, let’s run a query against just one month’s data:

trino:ds> SELECT COUNT(*) FROM drivestats WHERE year = 2022 AND month = 1;
  _col0  
---------
 6415842 
(1 row)

Query 20220707_184801_00059_tshdf, FINISHED, 1 node
Splits: 16 total, 16 done (100.00%)
0.85 [6.42M rows, 56KB] [7.54M rows/s, 65.7KB/s]

Counting the records for a given month takes less than a second, retrieving just 56KB of data–partitioning is working!

Now we have the entire Drive Stats data set loaded into Backblaze B2 in an efficient format and layout for running queries. Our next blog post will look at some of the queries we’ve run to clean up the data set and gain insight into nine years of hard drive metrics.

Conclusion

We hope that this article inspires you to try using Backblaze for your data analytics workloads if you’re not already doing so, and that it also serves as a useful primer to help you set up your own data lake using Backblaze B2 Cloud Storage. Our Drive Stats data is just one example of the type of data set that can be used for data analytics on Backblaze B2.

Hopefully, you too will find that Backblaze B2 Cloud Storage can be a useful, powerful, and very cost effective option for your data lake workloads.

If you’d like to get started working with analytical data in Backblaze B2, sign up here for 10 GB storage, free of charge, and get to work. If you’re already storing and querying analytical data in Backblaze B2, please let us know in the comments what tools you’re using and how it’s working out for you!

If you already work with Trino (or other data lake analytic engines), and would like connection credentials for our partitioned, Parquet, complete Drive Stats data set that is now hosted on Backblaze B2 Cloud Storage, please contact us at [email protected].
Future blog posts focused on Drive Stats and analytics will be using this complete Drive Stats dataset.

Similarly, please let us know if you would like to run a proof of concept hosting your own data in a Backblaze B2 data lake and would like the assistance of the Backblaze Developer Evangelism team.

And lastly, if you think this article may be of interest to your colleagues, we’d very much appreciate you sharing it with them.

The post Storing and Querying Analytical Data in Backblaze B2 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Drive Stats for Q2 2022

Post Syndicated from original https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-2022/

As of the end of Q2 2022, Backblaze was monitoring 219,444 hard drives and SSDs in our data centers around the world. Of that number, 4,020 are boot drives, with 2,558 being SSDs, and 1,462 being HDDs. Later this quarter, we’ll review our SSD collection. Today, we’ll focus on the 215,424 data drives under management as we review their quarterly and lifetime failure rates as of the end of Q2 2022. Along the way, we’ll share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

Lifetime Hard Drive Failure Rates

This report, we’ll change things up a bit and start with the lifetime failure rates. We’ll cover the Q2 data later on in this post. As of June 30, 2022, Backblaze was monitoring 215,424 hard drives used to store data. For our evaluation, we removed 413 drives from consideration as they were used for testing purposes or drive models which did not have at least 60 drives. This leaves us with 215,011 hard drives grouped into 27 different models to analyze for the lifetime report.

Notes and Observations About the Lifetime Stats

The lifetime annualized failure rate for all the drives listed above is 1.39%. That is the same as last quarter and down from 1.45% one year ago (6/30/2021).

A quick glance down the annualized failure rate (AFR) column identifies the three drives with the highest failure rates:

  • The 8TB HGST (model: HUH728080ALE604) at 6.26%.
  • The Seagate 14TB (model: ST14000NM0138) at 4.86%.
  • The Toshiba 16TB (model: MG08ACA16TA at 3.57%.

What’s common between these three models? The sample size, in our case drive days, is too small, and in these three cases leads to a wide range between the low and high confidence interval values. The wider the gap, the less confident we are about the AFR in the first place.

In the table above, we list all of the models for completeness, but it does make the chart more complex. We like to make things easy, so let’s remove those drive models that have wide confidence intervals and only include drive models that are generally available. We’ll set our parameters as follows: a 95% confidence interval gap of 0.5% or less, a minimum drive days value of one million to ensure we have a large enough sample size, and drive models that are 8TB or more in size. The simplified chart is below.

To summarize, in our environment, we are 95% confident that the AFR listed for each drive model is between the low and high confidence interval values.

Computing the Annualized Failure Rate

We use the term annualized failure rate, or AFR, throughout our Drive Stats reports. Let’s spend a minute to explain how we calculate the AFR value and why we do it the way we do. The formula for a given cohort of drives is:

AFR = ( drive_failures / ( drive_days / 365 )) * 100

Let’s define the terms used:

  • Cohort of drives: The selected set of drives (typically by model) for a given period of time (quarter, annual, lifetime).
  • AFR: Annualized failure rate, which is applied to the selected cohort of drives.
  • drive_failures: The number of failed drives for the selected cohort of drives.
  • drive_days: The number of days all of the drives in the selected cohort are operational during the defined period of time of the cohort (i.e., quarter, annual, lifetime).

For example, for the 16TB Seagate drive in the table above, we have calculated there were 117 drive failures and 4,117,553 drive days over the lifetime of this particular cohort of drives. The AFR is calculated as follows:

AFR = ( 117 / ( 4,117,553 / 365 )) * 100 = 1.04%

Why Don’t We Use Drive Count?

Our environment is very dynamic when it comes to drives entering and leaving the system; a 12TB HGST drive fails and is replaced by a 12TB Seagate, a new Backblaze Vault is added and 1,200 new 14TB Toshiba drives are added, a Backblaze Vault of 4TB drives is retired, and so on. Using drive count is problematic because it assumes a stable number of drives in the cohort over the observation period. Yes, we will concede that with enough math you can make this work, but rather than going back to college, we keep it simple and use drive days as it accounts for the potential change in the number of drives during the observation period and apportions each drive’s contribution accordingly.

For completeness, let’s calculate the AFR for the 16TB Seagate drive using a drive count-based formula given there were 16,860 drives and 117 failures.

Drive Count AFR = ( 117 / 16,860 ) * 100 = 0.69%

While the drive count AFR is much lower, the assumption that all 16,860 drives were present the entire observation period (lifetime) is wrong. Over the last quarter, we added 3,601 new drives, and over the last year, we added 12,003 new drives. Yet, all of these were counted as if they were installed on day one. In other words, using drive count AFR in our case would misrepresent drive failure rates in our environment.

How We Determine Drive Failure

Today, we classify drive failure into two categories: reactive and proactive. Reactive failures are where the drive has failed and won’t or can’t communicate with our system. Proactive failures are where failure is imminent based on errors the drive is reporting which are confirmed by examining the SMART stats of the drive. In this case, the drive is removed before it completely fails.

Over the last few years, data scientists have used the SMART stats data we’ve collected to see if they can predict drive failure using various statistical methodologies, and more recently, artificial intelligence and machine learning techniques. The ability to accurately predict drive failure, with minimal false positives, will optimize our operational capabilities as we scale our storage platform.

SMART Stats

SMART stands for Self-monitoring, Analysis, and Reporting Technology and is a monitoring system included in hard drives that reports on various attributes of the state of a given drive. Each day, Backblaze records and stores the SMART stats that are reported by the hard drives we have in our data centers. Check out this post to learn more about SMART stats and how we use them.

Q2 2022 Hard Drive Failure Rates

For the Q2 2022 quarterly report, we tracked 215,011 hard drives broken down by drive model into 27 different cohorts using only data from Q2. The table below lists the data for each of these drive models.

Notes and Observations on the Q2 2022 Stats

Breaking news, the OG stumbles: The 6TB Seagate drives (model: ST6000DX000) finally had a failure this quarter—actually, two failures. Given this is the oldest drive model in our fleet with an average age of 86.7 months of service, a failure or two is expected. Still, this was the first failure by this drive model since Q3 of last year. At some point in the future we can expect these drives will be cycled out, but with their lifetime AFR at just 0.87%, they are not first in line.

Another zero for the next OG: The next oldest drive cohort in our collection, the 4TB Toshiba drives (model: MD04ABA400V) at 85.3 months, had zero failures for Q2. The last failure was recorded a year ago in Q2 2021. Their lifetime AFR is just 0.79%, although their lifetime confidence interval gap is 1.3%, which as we’ve seen means we are lacking enough data to be truly confident of the AFR number. Still, at one failure per year, they could last another 97 years—probably not.

More zeroes for Q2: Three other drives had zero failures this quarter: the 8TB HGST (model: HUH728080ALE604), the 14TB Toshiba (model: MG07ACA14TEY), and the 16TB Toshiba (model: MG08ACA16TA). As with the 4TB Toshiba noted above, these drives have very wide confidence interval gaps driven by a limited number of data points. For example, the 16TB Toshiba had the most drive days—32,064—of any of these drive models. We would need to have at least 500,000 drive days in a quarter to get to a 95% confidence interval. Still, it is entirely possible that any or all of these drives will continue to post great numbers over the coming quarters, we’re just not 95% confident yet.

Running on fumes: The 4TB Seagate drives (model: ST4000DM000) are starting to show their age, 80.3 months on average. Their quarterly failure rate has increased each of the last four quarters to 3.42% this quarter. We have deployed our drive cloning program for these drives as part of our data durability program, and over the next several months, these drives will be cycled out. They have served us well, but it appears they are tired after nearly seven years of constant spinning.

The AFR increases, again: In Q2, the AFR increased to 1.46% for all drives models combined. This is up from 1.22% in Q1 2022 and up from 1.01% a year ago in Q2 2021. The aging 4TB Seagate drives are part of the increase, but the failure rates of both the Toshiba and HGST drives have increased as well over the last year. This appears to be related to the aging of the entire drive fleet and we would expect this number to go down as older drives are retired over the next year.

Four Thousand Storage Servers

In the opening paragraph, we noted there were 4,020 boot drives. What may not be obvious is that this equates to 4,020 storage servers. These are 4U servers with 45 or 60 drives in each with drives ranging in size from 4TB to 16TB. The smallest is 180TB of raw storage space (45 * 4TB drives) and the largest is 960TB of raw storage (60 * 16TB drives). These servers are a mix of Backblaze Storage Pods and third-party storage servers. It’s been a while since our last Storage Pod update, so look for something in late Q3 or early Q4.

Drive Stats at DEFCON

If you will be at DEFCON 30 in Las Vegas, I will be speaking live at the Data Duplication Village (DDV) at 1 p.m. on Friday, August 12th. The all-volunteer DDV is located in the lower level of the executive conference center of the Flamingo hotel. We’ll be talking about Drive Stats, SSDs, drive life expectancy, SMART stats, and more. I hope to see you there.

Never Miss the Drive Stats Report

Sign up for the Drive Stats Insiders newsletter and be the first to get Drive Stats data every quarter as well as the new Drive Stats SSD edition.

➔ Sign Up

The Hard Drive Stats Data

The complete data set used to create the information used in this review is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone; it is free.

If you want the tables and charts used in this report, you can download the .zip file from Backblaze B2 Cloud Storage which contains the .jpg and/or .xlsx files as applicable.
Good luck and let us know if you find anything interesting.

Want More Drive Stats Insights?

Check out our 2021 Year-end Drive Stats Report.

Interested in the SSD Data?

Read our first SSD-based Drive Stats Report.

The post Backblaze Drive Stats for Q2 2022 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Get a Clear Picture of Your Data Spread With BackBlaze and DataIntell

Post Syndicated from Jennifer Newman original https://www.backblaze.com/blog/get-a-clear-picture-of-your-data-spread-with-backblaze-and-dataintell/

Do you know where your data is? It’s a question more and more businesses have to ask themselves, and if you don’t have a definitive answer, you’re not alone. The average company manages over 100TB of data. By 2025, it’s estimated that 463 exabytes of data will be created each day globally. That’s a massive amount of data to keep tabs on.

But understanding where your data lives is just one part of the equation. Your next question is probably, “How much is it costing me?” A new partnership between Backblaze and DataIntell can help you get answers to both questions.

What Is DataIntell?

DataIntell is an application designed to help you better understand your data and
storage utilization. This analytic tool helps identify old and unused files and gives better
insights into data changes, file duplication, and used space over time. It is designed to
help you manage large amounts of data growth. It provides detailed, user friendly, and accurate analytics of your data use, storage, and cost, allowing you to optimize your storage and monitor its usage no matter where it lives—on-premises or in the cloud.

How Does Backblaze Integrate With DataIntell?

Together, DataIntell and Backblaze provide you with the best of both worlds. DataIntell allows you to identify and understand the costs and security of your data today, while Backblaze provides you with a simple, scalable, and reliable cloud storage option for the future.

“DataIntell offers a unique storage analysis and data management software which facilitates decision making while reducing costs and increasing efficiency, either for on-prem, cloud, or archives. With Backblaze and DataIntell, organizations can now manage their data growth and optimize their storage cost with these two simple and easy-to-use solutions.
—Olivier Rivard, President/CTO, DataIntell

How Does This Partnership Benefit Joint Customers?

This partnership delivers value to joint customers in three key areas:

  • It allows you to make the most of your data wherever it lives, at speed, and with a 99.9% uptime SLA—no cold delays or speed premiums.
  • You can easily migrate on-premises data and data stored on tape to scalable, affordable cloud storage.
  • You can stretch your budget (further) with S3-compatible storage predictably priced at a fraction of the cost of other cloud providers.

“Unlike legacy providers, Backblaze offers always-hot storage in one tier, so there’s no juggling between tiers to stay within budget. By partnering with DataIntell, we can offer a cost-effective solution to joint customers looking to simplify their storage spend and data management efforts.”
—Nilay Patel, Vice President of Sales and Partnerships, Backblaze

Getting Started With Backblaze B2 and DataIntell

Are you looking for more insight into your data landscape? Contact our Sales team today to get started.

The post Get a Clear Picture of Your Data Spread With BackBlaze and DataIntell appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

What Is Extended Version History?

Post Syndicated from original https://www.backblaze.com/blog/what-is-extended-version-history/

Our recent Backup Awareness Survey showed that 61% of Americans who own a computer and back it up are not very confident that all of their data is being backed up. That just goes to show how complicated some backup solutions are.

And what good is a backup service if it’s hard to get your data back when you need it?

The Backblaze Computer Backup client is designed to stay out of the way and back up your data, while making restoring that data a walk in the park. One of the popular ways of enhancing our backup service is a feature called Extended Version History. But, we’ve found that some people still don’t quite understand what it does. With that in mind, I wanted to write an overview of the feature, how it works, and why it’s useful for anyone that uses our Computer Backup service, whether it’s for personal use or for their company or family group.

Extended Version History Explained

First, we need to define two key terms, “retention” and “version,” to help explain Extended Version History.

What Is Retention?

In simple terms, retention is how long something in your backup is kept backed up.

What Is a Version?

It seems simple enough, but it’s worth explaining what we mean by a “version.” Without getting in the weeds, whenever a file is added or created on your computer, that is a version. Whenever you change a file on your computer, whether you add more lines to a spreadsheet or edit your recent vacation photos, those changes also create another version of the file.

When you understand what retention is and what a version is, it’s easy to understand Extended Version History. It’s a feature that allows you to set a retention timeframe that specifies how long all the older versions—the version history—of your files should be kept as part of your backup.

How Long Is Backblaze’s Retention?

The standard Backblaze Computer Backup service comes with 30 days of Version History for the files that are backed up. This means that you can go back in time (using our roll back time feature) and access older versions of files for 30 days from the date they were last changed or deleted. After that 30-day mark, the version of the file that’s 30 days old will leave your backup, but any newer files will remain.

Note: If you last changed or added a file more than 30 days ago, but have not made any changes to it, it will remain in your backup as long as it remains on your computer (or is unchanged). If it gets removed or changed, that’s when the 30-day retention period starts.

What Does Extended Version History Do?

With Extended Version History, you can increase that 30-day period to one year or even forever. This essentially increases the duration for which you can roll back time when going to access your data.

A Very Simple Example (With Babies!)

As a new uncle, I have babies on the brain. Let’s say that a baby was born on July 1st and our family creates a spreadsheet to chart the growth of the baby. Every single day, we add a row to the spreadsheet to add in the baby’s weight, height, and maybe a cute note. The previous rows don’t get deleted, and so the spreadsheet grows by one row every single day.
On July 30th, our spreadsheet will have 30 rows (one per day). If that spreadsheet was being backed up by Backblaze, I could go back in time to July 1st and get a copy of that spreadsheet from the very first day, with just a single row of baby information. However, if I tried to do that on July 31, that original version would be gone, but I could go back and get a copy from July 2nd, the version with the first two rows of baby data. If I tried to go back on, say, August 30th, I could get a copy from August 1, which would have all of July’s rows of baby data.

To illustrate the point, here’s me as a Soviet baby.

With Extended Version History, using that same example, I have more time (a year or forever) to go back and retrieve that original copy of the spreadsheet created on July 1 with just the first row.

Why would you want the spreadsheet with just one row? Who knows, it’s an example!

Why Extended Version History? Because Mistakes Happen!

Our Backup Awareness Month survey found that 67% of respondents have reported accidentally deleting a file. 44% reported losing data, or access to data, because a shared or synced drive or folder was deleted. Having Extended Version History turned on for your Backblaze backup helps avoid data loss because of accidental deletions.

You may not always realize right away (or within 30 days) that you deleted a file accidentally. Or you may not regularly check that shared drive until it’s too late and your older versions are gone. With Extended Version History, you can go back in time up to a year later or forever later and get those files back.

How to Get Extended Version History?

I encourage everyone I know to enable Extended Version History as soon as they install Backblaze on their computer.
Step 1: Click “Upgrade.”

Step 2: Select how long you want to keep files—one year or forever.

One thing to keep in mind is that simply turning on Extended Version History won’t automatically extend the “life” of your files retroactively. For example, if you open your account on July 1 and enable Extended Version History on July 28, only the versions from July 28 onward will have Extended Version History, not the versions created between July 1 and July 28. Once enabled, any new or changed files will have their retention rate increased, which is why doing so when you first install Backblaze is the best policy.

Consider Extended Version History as getting additional “mistake insurance” for your data. If something happens, or you lose access to shared files and that goes unnoticed, we’ll have your back!

The post What Is Extended Version History? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Hard Drive Life Expectancy

Post Syndicated from original https://www.backblaze.com/blog/hard-drive-life-expectancy/

For the last several years, we have written about drive failure, or more specifically, the annualized failure rates for the hard drives and SSDs we use for our cloud storage platform. In this post, we’ll look at drive failure from a different angle: life expectancy.

By looking at life expectancy, we can answer the question, “How long is the drive I am buying today expected to last?” This line of thinking matches the way we buy many things. For example, knowing that a washing machine has an annualized failure rate of 4% is academically interesting, but what we really want to know is, “How long can I expect the washing machine to last before I need to replace it?”

Using the Drive Stats data we’ve collected since 2013, we have selected 10 drive models that have a sufficient number of both drives and drive days to produce Kaplan-Meier life expectancy curves we can use to easily visualize their life expectancy. Using these life expectancy curves we’ll compare drive models in cohorts of 4TB, 8TB, 12TB, and 14TB to see what we can find.

What Is a Kaplan-Meier Curve?

Kaplan-Meier curves are most often used in biological sciences to forecast life expectancy by measuring the fraction of subjects living for a certain amount of time after receiving treatment. That said, the application of the technique to other fields is not unusual.

Comparing 4TB Drives

The two 4TB drive models we selected for comparison had the most 4TB drives in operation as of March 31, 2022. The Drive Stats for each drive model as of March 31, 2022 is shown below, followed by the Kaplan-Meier curve for each drive.

MFG Model Drives in Operation Lifetime Drive Failures Lifetime Drive Days Lifetime AFR
HGST HMS5C4040BLE640 12,728 343 30,025,871 0.40%
Seagate ST4000DM000 18,495 4,581 68,104,520 2.45%


What Is the Graph Telling Us?

  1. If you purchased an HGST drive at time zero, there is a 97% chance that drive would still be operational after six years (72 months).
  2. If you purchased a Seagate drive at time zero, there is an 81% chance that drive would still be operational after six years.

Case closed—we were stupid to buy any Seagate 4TB drives, right? Not so fast, there are other factors at work here: cost, availability, time, and maintenance, to name a few. For example, suppose I told you that the HGST drive you wanted was 1.2 to 1.5 times as expensive as the Seagate drive. In addition, the Seagate drive was readily available while the HGST drive was harder to get, and finally, at the time of purchase there was over an 80% chance that the Seagate drive would still be alive after six years. How does that change your perception?

In the case of buying one or two drives, you may find a single factor like, “how much do you have to spend” is the only thing that matters. In our case, these factors are intertwined. We explain some of the thinking behind our decision-making in our “How Backblaze Buys Hard Drives” post.

Was It Worth the Savings?

In the simple case, if the time and effort we spent replacing the failed Seagate drives was more than the savings, we failed. So, let’s do a little back-of-the-envelope math to see how we landed.

We replaced a little over 4,200 more Seagate drives over a six year period than HGST drives. That is 700 drives a year or about two Seagate drives per day we had to replace. That’s 30-40 minutes a day someone spent doing that task spread across multiple data centers. Yes, it’s work, but hardly something you would need to hire a person specifically to do.

Why Buy HGST Drives at All?

Fair question. At the time we were purchasing these Seagate and HGST drive models back in 2013 through 2015, there were no life expectancy curves and Drive Stats was just starting. We had anecdotal information that the HGST drives were better, but little else. In short, sometimes, the pricing and availability of the HGST was good enough so we bought them.

Comparing 8TB Drives

The two 8TB drives we’ve chosen to compare using life expectancy curves have done battle before. The 8TB Seagate model: ST8000DM002 is classified as a consumer drive, while the 8TB Seagate model: ST8000NM0055 is classified as an enterprise drive. Their lifetime annualized failure rates tell an interesting story. All data is as of March 31, 2022.

Type Model Drives in Operation Lifetime Drive Failures Lifetime Drive Days Lifetime AFR
Consumer ST8000DM002 9,678 628 19,815,919 1.13%
Enterprise ST8000NM0055 14,323 915 24,999,738 1.35%

Let’s take a look at the life expectancy curves and see what else we can learn.

Observations

  • If you purchased either drive, the life expectancy is nearly the same for early on, but starts to separate at about two years and the difference increases over the next three years.
  • For the consumer model (ST8000DM002) you would expect nearly 95% of the drives to survive five years.
  • For the enterprise model (ST8000NM0055) you would expect 93.6% of the drives to survive five years.

These results seem at odds with the warranties for each model. Consumer drives typically have two-year warranties, while enterprise drives typically have five-year warranties. Yet at five years, the consumer drives, in this case, are more likely to survive, and the trend starts at two years—the end of the typical consumer drive warranty period. It’s almost like we got the data backwards. We didn’t.

Even with this odd difference, both drives performed well. If you wanted to buy an 8TB drive and the salesperson said there would be a 93.6% chance the drive would last five years, well, that’s pretty good. Regardless of the failure rate or life expectancy, there are other reasons to purchase an enterprise class drive, including the ability to tune the drive, tweak the firmware, or get a replacement via the warranty for three more years versus the consumer drive. All are good reasons and may be worth the premium you will pay for an enterprise class drive, but in this case at least, long live the consumer drive.

A Word About Drive Warranties

One of the advantages we get for buying drives in bulk from a manufacturer or one of their top tier resellers is that they will honor the warranty period ascribed to the drive. When you are buying from a retailer (typically an online retailer, but not always), you may find the warranty terms and conditions to be less straightforward. Here are three common situations:

  • The retailer purchases the drive or takes the drive on consignment from the manufacturer/distributor/reseller/etc., and that event triggers the start of the manufacturer warranty. When you buy the drive six months later, the warranty is no longer “X” years, but “X” years minus six months.
  • The retailer replaces the warranty with their own time period. While this is usually done for refurbished drives, we have seen this done by online retailers for new drives as well. In one case we saw, the original five-year warranty period was reduced to one year.
  • The retailer is only a storefront while the actual seller is different. At that point, determining the warranty period and who services the drive can be, shall we say, challenging. Of course, you can always buy the add-on warranty that’s offered—it’s always nice to pay for something that was supposed to be included.

As a drive model gets older, these types of shenanigans are more likely to happen. For example, a given drive model gathers dust awaiting shipment while new models are coming to market at competitive prices. The multiple players on the path from a drive’s manufacture to its eventual sale are looking for ways to “move” these aging drives along that path. One option is to lower or eliminate the warranty period to help reduce the cost of the drive. The warranty becomes a casualty of the supply chain and you, as the last buyer, are left with the results.

Comparing 12TB Drives

If you are serious about storing copious amounts of data, you’re probably looking at 12TB drives and higher. Your Plex media server or eight-bay NAS system demands nothing less. To that end, we selected three 12TB models for which we have at least two years worth of data to base our life expectancy curves upon. The Drive Stats data for these three drives is as of March 31, 2022.

MFR Model Drives in Operation Lifetime Drive Failures Lifetime Drive Days Lifetime AFR
HGST HUH721212ALN604 10,813 148 11,813,149 0.48%
Seagate ST12000NM001G 12,269 104 6,166,144 0.63%
Seagate ST12000NM0008 20,139 449 14,802,577 1.12%


Observations and Thoughts

For any of the three models, at least 98% of the drives are expected to survive two years. I suspect that most of us would take that bet. While none of us wants to own the one or two drives out of 100 that will fail in that two years period, we know there are no 100% guarantees when it comes to hard drives.

That brings us to asking: What is the cost of each drive, and would that affect the buying decision? As we’ve noted previously, we buy in bulk and the price we pay is probably not reflective of the price you may pay in the consumer market. To that end, below are the current prices, via the Amazon website, for the three drive models. We’ve assumed that these are new drives and they have the same warranty coverage of five years.

  • HUH721212ALN604 – $413
  • ST12000NM001G – $249
  • ST12000NM0008 – $319

The Seagate model: ST12000NM001G and the HGST model: HUH721212ALN604 have about the same life expectancy after two years, but their price is significantly different. Which one do you buy today? If you are expecting the drive to last (i.e., survive) two years, you would select the Seagate drive and save yourself $164, plus tax. Some of you will disagree, and given we know nothing beyond the two-year point for the Seagate drive, you may be right. Time will tell.

One thing that may be perplexing here is why the Seagate model: ST12000NM0008 is more expensive than the Seagate model: ST12000NM001G even though the ST12000NM008 fails more often and has a lower life expectancy after two years? The reason is simple: Drive pricing is basically driven by supply and demand. We suspect that annualized failure rates and life expectancy curves are not part of the pricing math done by the various companies (manufacturers/distributors/resellers/etc.) along the supply chain.

By the way, if you purchase the 12TB HGST drive, it may say Western Digital (WDC) on the label. For the first couple of years when these drives were produced, they had HGST on the label, but that changed somewhere in the last couple of years. In either case, both “versions” report as HGST drives and have the same model number, HUH721212ALN604. The new Western Digital label is part of the continuing rebranding effort being done by WDC to update the HGST assets they purchased a few years back.

Comparing 14TB Drives

We will finish up our look at hard drive life expectancy curves with three models from our collection of 14TB drives. While the data ranges from 14 to 41 months depending on the drive model, this is the one cohort where we have comparable data on drives from all three of the major manufacturers: Seagate, Toshiba, and WDC. The Drive Stats data is below, followed by the life expectancy curves for the same models.

MFR Model Drives in Operation Lifetime Drive Failures Lifetime Drive Days Lifetime AFR
Toshiba MG07ACA14TA 38,210 454 19,834,886 0.83%
Seagate ST14000NM001G 10,734 123 4,474,417 1.00%
WDC WUH721414ALE6L4 8,268 35 3,941,427 0.33%


Observations and Thoughts

All three drives have a life expectancy of 99% or more after one year. Previously, we examined the bathtub curve for drive failure and made the observation that the early mortality rate for hard drives, those that failed during their first year in operation, was now nearly the same as the random failure rate. That seems to be the case for this collection of drives as the observed early mortality effect is nominal.

When considering the bathtub curve, the Toshiba model seems to be an outlier beginning at 22 months. At that point, the downward curvature in the line suggests an accelerating failure rate when the failure rate should be steady, as seen below.

The projected life expectancy curve line is derived by extending the random failure rate from the first 22 months. That said, 97% of the Toshiba drives survived for three years while the projected number was 98%, or simply put, the failure rate was one drive per hundred more over a three-year period.

Interested in More Drive Stats Insights?

Physical disk drives remain essential elements of business and personal tech. That’s why Backblaze publishes performance data and analysis on 200,000+ HDDs: to offer useful insights into how different drive models stack up in our data center. As SSDs increasingly become the norm in many computers and servers, Backblaze is now also sharing data for the thousands of SSDs we use as boot drives.

Methodology

The raw data comes from the Backblaze Drive Stats data and is based on the raw value of SMART attribute 9 (power on hours) for a defined cohort of drives. After removing outliers, we basically compared the number of drives which failed after a specific number of months versus the number of drives which managed to survive that many months. The math is absolutely more complex than that and I want to thank Dr. Charles Zaiontz, Ph.D. for providing an excellent tutorial on Kaplan-Meier curves and, more specifically, how to use Microsoft Excel to do the math.

Refresher: What Are SMART Stats?

SMART stands for Self-monitoring, Analysis, and Reporting Technology and is a monitoring system included in hard drives that reports on various attributes of the state of a given drive. Each day, Backblaze records the SMART stats that are reported by the hard drives we have in our data centers. Check out this post to learn more about SMART stats and how we use them.

Standing on the Shoulders

Using our Drive Stats data in combination with Kaplan-Meier curves has been done previously in various forms by others including Ross Lazarus, Simon Erni, and Tom Baldwin. We thank them for their collective efforts and for providing us with the inspiration to produce the current curves that enabled the comparisons we did in this post.

The post Hard Drive Life Expectancy appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Server Backup 101: Disaster Recovery Planning

Post Syndicated from Kari Rivas original https://www.backblaze.com/blog/server-backup-101-disaster-recovery-planning/

In any business, time is money. What may shock you is how much money that time is actually worth. According to Gartner, the average cost of one hour of downtime for a business is roughly $300,000. That’s $5,600 a minute. Multiply that out by the amount of time it takes to recover from data theft, sabotage, or a natural disaster, and you could easily be looking at millions of dollars in lost revenue. That is, unless you’ve planned ahead with an effective disaster recovery plan.

Even one hour of lost time due to a cyberattack or natural disaster could adversely affect your business operations. Read on to learn how to develop an effective disaster recovery plan so you can quickly rebound no matter what happens, including:

  • Knowing what a disaster recovery plan is and why you need it.
  • Developing an effective strategy.
  • Identifying key roles.
  • Prioritizing business operations and objectives.
  • Deploying backups.

What Is a Disaster Recovery Plan?

A disaster recovery plan is made up of resources and processes that a business can use to restore apps, data, digital assets, equipment, and network operations in the event of any unplanned disruption.

Events such as natural disasters (floods, fires, earthquakes, etc.), theft, and cybercrime often interrupt business operations or restrict access to data. The goal of a disaster recovery plan is to get back up and running as quickly and smoothly as possible.

Some companies will choose to write their own disaster recovery plans, while others may contract with a managed service provider (MSP) specializing in disaster recovery as a service (DRaaS). Either way, crafting a disaster recovery plan that covers you for any contingency is crucial.

Why Do You Need a Disaster Recovery Plan?

A disaster recovery plan is not just a good idea, it is an essential component of your business. Cybercrime is on the rise, targeting small and medium-sized businesses just as often as large corporations. According to Cybersecurity Magazine, 43% of recent data breaches affected small and medium-sized businesses. Additionally, you could be cut off from your data by power outages, hardware failure, data corruption, and natural occurrences that restrict IT workflows. So, why do you need a disaster recovery plan? A few key benefits rise to the top:

  • Your disaster recovery plan will ensure business continuity in the case of a disaster. Imagine the confidence of knowing that no matter what happens, your business is prepared and can continue operations seamlessly.
  • An effective disaster recovery plan will help you get back up and running faster and more efficiently.
  • The plan also helps to communicate to your entire team, from top to bottom, what to do in the event of an emergency.

Writing a Disaster Recovery Plan: What Should Your Disaster Recovery Plan Include?

A solid disaster recovery plan should include five main elements, which we’ll detail below:

  1. An effective strategy.
  2. Key team members who can carry out the plan.
  3. Clear objectives and priorities.
  4. Solid backups.
  5. Testing protocols.

An Effective Strategy

One of the most critical aspects of your disaster recovery plan should be your strategy. Typically, the details of a disaster recovery plan include steps for prevention, preparation, mitigation, and recovery. Think about both the big picture and fine details when putting together the pieces.

Disaster Recovery Planning Case Study: Santa Cruz Skateboards

Santa Cruz Skateboards safeguarded decades worth of data with a disaster recovery plan and backups to prevent loss from the threat of tsunamis on the California coast. Read more about how they did it.

Some tips for creating an effective strategy include:

  • Identify possible disasters. Consider the types of disasters your business may encounter and design your plan around those. Every business is susceptible to cybercrime, which should be a significant component of your plan. If your business is located in a disaster prone location, let that dictate your plan objectives.
  • Plan for “minor” disasters. A “major” disaster like an earthquake could take out the entire office and on-premises infrastructure, but “minor” disasters can also be disruptive. Good employees make mistakes and delete things, and bad employees sometimes make worse mistakes. A disaster recovery plan protects you from those “minor” disasters as well.
  • Create multiple disaster recovery plans. You may need to create different versions of your disaster recovery plan based on specific scenarios and the severity of the disaster. For example, you may need a plan that responds to a cyberattack and restores data quickly, while another plan may deal with hardware destruction and replacement rather than data restoration.
  • Plan from your recovery backward. Think about what you need to accomplish with your disaster recovery and plan your backup routine to support it. Then, after your plan is written, go back and ensure that your backup routine follows the plan initiatives and accomplishes the goals in an acceptable time frame.
  • Develop KPIs. Include critical key performance indicators (KPIs) in the plan, such as a recovery time objective (RTO) and recovery point objective (RPO). RTO refers to how quickly you intend to restore your systems after a disaster, and RPO is the maximum amount of data loss you can safely incur.

Establish the Key Team Members and Their Roles and Hierarchy

Another crucial component of your disaster recovery plan is identifying key team members to carry out the instructions. You must clearly define roles and hierarchy for effectiveness. Consider the following when building your disaster recovery team:

  • Communicate roles and hierarchy. Ensure that each team member knows their role in the plan and understands where they land in the hierarchy. Build in redundancy if a major player is unavailable.
  • Develop a master contact list. Create a master list with updated contact information for each team member and update it regularly as things change. Be sure the list includes everyone’s cell phone and landline numbers (if applicable) and emergency contacts for each person. Don’t assume you will have working internet and consider alternative ways to reach critical team members in the middle of the night.
  • Plan on how to manage your team. Think about how you will stay organized and manage your team to function 24/7 until you resolve the disaster.

Prioritize Business Operations and Objectives

Another important aspect of your disaster recovery plan is prioritizing business operations and objectives and crafting your plan around those.

Identify the most critical aspects of the business that need to be restored first. Then, focus on those and leave the less essential things until later. Understand that it is not feasible to restore everything at once. Instead, you must prioritize the most critical business areas and get those up and running and then, other, less crucial parts of the system. Detail these priorities in your plan so that no one wastes time on nonessential operations.

Know How to Deploy Your Backups

Backups should be a routine function for your organization, and you should know them inside and out. Be sure to familiarize yourself with every aspect of the backup process, including where data is stored, how recent it is, and how to restore it at a moment’s notice.

Having a reliable backup plan could save your business. You don’t want to waste precious time figuring out where the latest backup is, where it’s stored (whether that’s locally or on the cloud), or how to access it. Off-site cloud storage is a safe, reliable way to store and retrieve your data, especially in the event of a disaster.

Practice restoring your backups regularly to test their viability. Document the process for restoring in case you are unavailable and someone else has to take over. Data restoration should be a central part of your disaster recovery plan. Remember, backups are not your entire disaster recovery plan but only a piece of the overall system.

Foolproof Your Plan With Disaster Recovery Testing

The best-laid plans don’t always work out. Therefore, it’s essential that you foolproof your disaster recovery plan by testing it regularly (once a year, or every six months, whatever works for you). You don’t have to experience a real catastrophe; you can simulate what a disaster would look like and run through the entire process to ensure everything works as expected. Some disaster recovery testing best practices include:

  • Planning for the worst-case scenario. Think about things like access to a car, how you will get to the office, and how you will access your backups if they are stored online and you don’t have internet? Prepare by having multiple alternate plans (A, B, C, etc.). Remember, disasters come in all shapes and sizes so, be prepared to think outside the box. When the COVID-19 pandemic started, businesses had to scramble to adjust. Prepare for anything, even minor disruptions or cut-offs from resources you rely on.
  • Securing resources in advance. If you need resources to make it work, such as budgetary funds, software, hardware, or services, get those approved now so you’re not stuck provisioning necessary resources in the middle of a disaster.
  • Regularly reviewing and updating your disaster recovery plan as things change. Team members come and go, so schedule routine updates every three to six months to ensure that everything is up to date and viable.
  • Distributing copies of your disaster recovery plan. All staff members, including executives, should have a copy of your plan, and you should clearly communicate how it works and what everyone’s responsibility is.
  • Conducting post mortems after training and simulations (or a real disaster) to determine what works and what doesn’t. Make changes to your plan accordingly.

Don’t wait until a disaster occurs before writing your disaster recovery plan. A disaster recovery plan is an ever-evolving process you must maintain as the business changes and grows so you can face anything that the future brings.

Disaster Recovery, Done.

Ready to check disaster recovery off your list? Check out our Instant Recovery in Any Cloud solution that you can use as part of your disaster recovery plan. You can run a single command to instantly see your servers, data, firewalls, and network storage. Get back up and running as soon as possible with minimal disruption and expense to your business.

The post Server Backup 101: Disaster Recovery Planning appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Level Up Your Backup Game With Backblaze and Veritas

Post Syndicated from Jennifer Newman original https://www.backblaze.com/blog/level-up-your-backup-game-with-backblaze-and-veritas/

If you’re using Veritas’ Backup Exec, you’re already ahead of the game when it comes to your data backup and recovery approach. Now, you can power up your backup strength by adding Backblaze B2 Cloud Storage as a destination for your Backup Exec data.

The joint solution offers easy, affordable, S3-compatible object storage to customers who use Backup Exec to streamline their data backup and recovery approach. Read on to learn more about this partnership and how it can benefit your business.

Backblaze and Veritas: In Person at VMware Explore 2022

Want to learn more? Stop by booth 1501 at VMware Explore 2022, August 29–September 1. You can visit with our technical experts, see demos, and learn more about how Backblaze B2 Cloud Storage seamlessly integrates with Veritas Backup Exec. Or, schedule a meeting using this link to talk about solutions tailored to your business needs.

What Is Veritas Backup Exec?

The Veritas Backup Exec service helps businesses protect almost any data on any storage device—tape, servers, or in the cloud. By unifying backup management in one panel, it creates a simple, customizable approach to managing backups and orchestrating recovery that can benefit everyone from the local bakery, to a regional school district, to large government units.

How Does Backblaze Integrate With Backup Exec?

Customers have regularly asked for Backblaze to be accepted into the Veritas Technology Partner Program. Being able to seamlessly integrate Veritas with Backblaze B2 is a big win for these customers who’ve long been looking for an easy, affordable solution that helps them manage infrastructure costs.

Beginning today, any business or institution using Backup Exec can configure their data to back up to B2 Cloud Storage. IT teams can rest easy, knowing that remote offices, Linux and Unix workloads, virtual or Microsoft workloads, and more, are protected and immediately available if they need them—all configurable in a few clicks and at one-fifth the price of traditional storage vendors, with no data retention minimums or other lock-in fees.

“The Backup Exec service empowers small and medium-sized businesses to do more with less, and to minimize their daily lift around data management and protection. This aligns perfectly with our mission to make storing and using data astonishingly easy—especially for IT teams that already have a ton on their plate.”
—Nilay Patel, VP of Sales & Partnerships, Backblaze

How Does This Partnership Benefit Joint Customers?

The partnership delivers in three key value areas:

  • Simplification: Backup Exec provides a simple, unified user interface that removes complexity from data protection. Backblaze B2 is configurable in a few easy steps.
  • Hybrid-cloud adoption: The capital and personnel expense of purchasing and managing on-prem data storage can be challenging. Veritas and Backblaze put cloud adoption within reach—for backups, workloads, infrastructure, recovery, or everything together.
  • Cost reduction: Backup Exec makes configuring backups easy, which means that businesses can fine-tune their storage bill in line with their budgets. With B2 Cloud Storage priced at one-fifth of traditional cloud vendors—the combination offers a powerful tool to businesses that want to optimize their spend.

“Our goal with Backup Exec is to provide a simple, powerful solution that frees business owners from concerns about data loss. Backblaze is a natural partner in this effort and we’re excited to work with them to bring the value of cloud backup to a larger group of businesses and institutions.”
—Jason Von Eberstein, Senior Principal Product Manager of Veritas

Bonus Points: Veritas + Carahsoft + Backblaze = Public Sector Success

Backblaze recently announced a partnership with Carahsoft to help public sector CIOs optimize their cloud spend. Carahsoft is the Master Government Aggregator™ for the IT industry. The partnership—which was enabled by our recent launch of a capacity-based pricing bundle, Backblaze B2 Reserve—solves both the budgeting and procurement challenges public sector CIOs are facing. And Carahsoft is a strategic distributor for Veritas, so public sector customers can take advantage of the joint solution as well.

About Veritas

Veritas enables organizations to harness the power of their information, with solutions designed to serve the world’s largest and most complex heterogeneous environments. Veritas’s industry-leading solutions cover all platforms with backup and recovery, business continuity, software-defined storage, and information governance.

Getting Started With Backblaze B2 and Veritas

Ready to take your backup game to the next level? Click here to get started today and check out our Knowledge Base article for detailed instructions.

The post Level Up Your Backup Game With Backblaze and Veritas appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

How to Connect Your QNAP NAS to Backblaze B2 Cloud Storage

Post Syndicated from Troy Liljedahl original https://www.backblaze.com/blog/guide-qnap-backup-b2-cloud-storage/

Network attached storage (NAS) devices are a popular solution for data storage, sharing files for remote collaboration purposes, syncing files that are part of a workflow, and more. QNAP, one of the leading NAS manufacturers, makes it incredibly easy to backup and/or sync your business or personal data for these purposes with the inclusion of its application, Hybrid Backup Sync (HBS). HBS consolidates backup, restoration, and synchronization functions into a single application.

Protecting your data with a NAS is a great first step, but you shouldn’t stop there. NAS devices are still vulnerable to any kind of on-premises disaster like fires, floods, and tornados. They’re also not safe from ransomware attacks that might hit your network. To truly protect your data, it’s important to back up or sync to an off-site cloud storage destination like Backblaze B2 Cloud Storage. Backblaze B2 offers a geographically distanced location for your data for $5/TB per month, and you can also embed it into your NAS-based workflows to streamline access across multiple locations.

Read on for more information on whether you should use backup or sync for your purposes and how to connect your QNAP NAS to Backblaze B2 step-by-step. We’ve even provided videos that show you just how easy it is—it typically takes less than 15 minutes!

➔ Download Our Complete NAS Guide

Should I Back Up or Sync?

It’s easy to confuse backup and sync. They’re essentially both making a copy of your data, but they have different use cases. It’s important to understand the difference so you’re getting the right protection and accessibility for your data.

Check out the table below. You’ll see that backup is best for being able to recover from a data disaster, including the ability to access previous versions of data. However, if you’re just looking for a mirror copy of your data, sync functionality is all you need. Sync is also useful as part of remote workflows: you can sync your data between your QNAP and Backblaze B2, and then remote workers can pull down the most up-to-date files from the B2 cloud.

A table comparing Backup vs. Sync

A table comparing Backup vs. Sync.

Because Hybrid Backup Sync provides both functions in one application, you should first identify which feature you truly need. The setup process is similar, but you will need to take different steps to configure backup vs. sync in HBS.

How to Set Up Your Backblaze B2 Account

Now that you’ve determined whether you want to back up or sync your data, it’s time to create your Backblaze B2 Cloud Storage account to securely protect your on-premises data.

If you already have a B2 Cloud Storage account, feel free to skip ahead. Otherwise, you can sign up for an account and get started with 10GB of free storage to test it out.

Ready to get started? You can follow along with the directions in this blog or take a look at our video guides. Greg Hamer, Senior Technical Evangelist, demonstrates how to get your data into B2 Cloud Storage in under 15 minutes using HBS for either backup or sync.

Video: Back Up QNAP to Backblaze B2 Cloud Storage with QNAP Hybrid Backup Sync

Video: Sync QNAP to Backblaze B2 Cloud Storage with QNAP Hybrid Backup Sync

How to Set Up a Bucket, Application Key ID, and Application Key

Once you’ve signed up for a Backblaze B2 Account, you’ll need to create a bucket, Application Key ID, and Application Key. This may sound like a lot, but all you need are a few clicks, a couple names, and less than a minute!

  1. On the Buckets page of your account, click the Create a Bucket button.
  2. An screenshot of the B2 Cloud Storage Buckets page.

  3. Give your bucket a name and enable encryption for added security.
  4. An image showing the Create a Bucket page with security features to be enabled.

  5. Click the Create a Bucket button and you should see your new bucket on the Buckets page.
  6. An image showing a successfully created bucket.

  7. Navigate to the App Keys page of your account and click Add a New Application Key.
  8. Name your Application Key and click the Create New Key button. Make sure that your key has both read and write permissions (the default option).
  9. Your Application Key ID and Application Key will appear on your App Keys page. Important: Make sure to copy these somewhere secure as the Application Key will not appear again!

How to Set Up QNAP’s Hybrid Backup Sync to Work With B2 Cloud Storage

To set up your QNAP with Backblaze B2 sync support, you’ll need access to your B2 Cloud Storage account. You’ll also need your B2 Cloud Storage account ID, Application Key, and bucket name—all of which are available after you log in to your Backblaze account. Finally, you’ll need the Hybrid Backup Sync application installed in QTS. You’ll need QTS 4.3.3 or later and Hybrid Backup Sync v2.1.170615 or later.

To configure a backup or sync job, simply follow the rest of the steps in this integration guide or reference the videos posted above. Once you follow the rest of the configuration steps, you’ll have a set-it-and-forget-it solution in place.

What Can You Do With Backblaze B2 and QNAP Hybrid Backup Sync?

With QNAP’s Hybrid Backup Sync software, you can easily back up and sync data to the cloud. Here’s some more information on what you can do to make the most of your setup.

Hybrid Backup Sync 3.0

QNAP and Backblaze B2 users can take advantage of Hybrid Backup Sync, as explained above. Hybrid Backup Sync is a powerful tool that provides true backup capability with features like version control, client-side encryption, and block-level deduplication. QNAP’s operating system, QTS, continues to deliver innovation and add thrilling new features. The ability to preview backed up files using the QuDedup Extract Tool, a feature first released in QTS 4.4.1, allowed QNAP users to save on bandwidth costs.

You can download the latest QTS update here and Hybrid Backup Sync is available in the App Center on your QNAP device.

Hybrid Mount and VJBOD Cloud

The Hybrid Mount and VJBOD Cloud apps allow QNAP users to designate a drive in their system to function as a cache while accessing B2 Cloud Storage. This allows users to interact with Backblaze B2 just like you would a folder on your QNAP device while using Backblaze B2 as an active storage location.

Hybrid Mount and VJBOD Cloud are both included in the QTS 4.4.1 versions and higher, and function as a storage gateway on a file-based or block-based level, respectively. Hybrid Mount enables Backblaze B2 to be used as a file server and is ideal for online collaboration and file-level data analysis. VJBOD Cloud is ideal for a large number of small files or singular massively large files (think databases!) since it’s able to update and change files on a block-level basis. Both apps offer the ability to connect to B2 Cloud Storage via popular protocols to fit any environment, including server message block (SMB), Apple Filing Protocol (AFP), network file sharing (NFS), file transfer protocol (FTP), and WebDAV.

QuDedup

QuDedup introduces client-side deduplication to the QNAP ecosystem. This helps users at all levels save on space on their NAS by avoiding redundant copies in storage. Backblaze B2 users have something to look forward to as well since these savings carry over to cloud storage via the HBS 3.0 update.

Why Backblaze B2?

QNAP continues to innovate and unlock the potential of B2 Cloud Storage in the NAS ecosystem. If you haven’t given B2 Cloud Storage a try yet, now is the time. You can get started with Backblaze B2 and your QNAP NAS right now, and make sure your NAS is synced securely and automatically to the cloud.

The post How to Connect Your QNAP NAS to Backblaze B2 Cloud Storage appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

How to Connect Your Synology NAS to Backblaze B2 Cloud Storage

Post Syndicated from Troy Liljedahl original https://www.backblaze.com/blog/how-to-connect-your-synology-nas-to-backblaze-b2-cloud-storage/

You’ve added a Synology Network Attached Storage (NAS) Device to your tech stack, but you may be wondering how to protect your files from ransomware, natural disasters, or accidental deletion. Saving your data to cloud storage can help protect you from the painful consequences of data loss. But now you may be wondering whether to backup or sync your data to the cloud. The answer to that question will largely depend on your own individual needs. If you are looking to keep an additional copy of your regularly changing data at an off-premise location to maintain the 3-2-1 backup strategy, then backing up your data to the cloud is the way to go.

If, however, you need your files in a place where everyone in the organization can access them at any moment, where edits to any files can be shown across all devices in real time or you need up-to-the-minute versions of your files off-site, then syncing your files to the cloud will be sufficient.

Your Synology NAS has applications for either backup—Hyper Backup—or sync—Cloud Sync—which we will explain in greater detail below. Understanding the distinction between the two functions is an important part of setting your tech stack up for success. And setting your tech stack up to connect to Backblaze B2 Cloud Storage, gives you greater security, accessibility and off-site peace of mind at a fifth of the cost of other cloud providers.

Read on to learn the differences between backup and sync, how they work with your Synology NAS, and how to connect your NAS to Backblaze B2.

Backup vs. Sync

As mentioned before, understanding the difference between backup and sync is a crucial step in determining how you will pair your NAS with an offsite cloud storage solution like Backblaze B2. As such, it may help you to have a full understanding of the difference between the two.

A backup lets you create copies of files and other digital assets, which are then sent from a NAS to another device or an off-site storage location such as a public cloud. Allowing for either incremental or full backups of the contents of your NAS on a customized schedule, this method allows you to retain a copy of the most recent version of a file, while also being able to retain previous versions. This can also be an effective strategy to combat malware or ransomware, as you can simply delete infected files and restore from a clean backup. In addition, maintaining storage off-site protects your data from any natural disasters that might befall your immediate vicinity.

By contrast, a sync strategy consists of one or more devices working in unison, updating files in the same way across each device and/or a cloud storage location. The benefits of syncing files come from the ability to instantly see updates on files and provide easy access to changes in files to people across your organization. If you connect your NAS to Backblaze B2, you can easily access and download files anywhere you are through native applications or another Backblaze partner integration like Veeam, Iconik, and Cyberduck. The drawback of syncing is that it does not offer effective protections against accidental deletions, unauthorized access or malware.

There are essentially two different ways to sync your files: one-way or two-way. In a one-way sync, when a file from Location A changes, the same file at Location B is updated; however, if something on the file changes in Location B, the file in Location A will not be updated. On the other hand, in a two-way sync, regardless of where the file changes, the other location will automatically update to mirror the other. And in most cases, this means the entire file will be re-uploaded.

It is not uncommon for an organization to use both backup and sync strategies simultaneously, relying on one over the other as needs change. Thankfully, Synology has two relevant proprietary applications that serve the various needs of backing up and syncing data which can be seen in the table below.  Whether you plan to utilize the backup and sync features Synology offers via Hyper Backup and Cloud Sync, securing your files to the cloud will help you create an effective 3-2-1 Backup Strategy, protecting your digital assets. Now we’ll take a closer look at how you can connect your Synology NAS to Backblaze B2 Cloud Storage.

Setting Up Your B2 Cloud Storage Account

Regardless of whether you use Hyper Backup or Cloud Sync, you can get set up in minutes with B2 Cloud Storage. You can follow along with the directions in this blog or take a look at our video guides. Pat Patterson, Chief Technical Evangelist, demonstrates how to get your data into B2 Cloud Storage in under 10 minutes using either Hyper Backup or Cloud Sync.

Here’s a video tutorial for Hyper Backup:

And here’s one for Cloud Sync:

The first step is to create a Backblaze B2 Cloud Storage account so your data has a location to be securely stored. You can sign up for an account and get started with 10GB of storage for free.

We’ll continue to show the steps after you’ve signed up for a Backblaze B2 Account in order to access your new bucket, Application Key ID, and Application Key. This will only take a few clicks, a couple names, and less than a minute.

  1. On the Buckets page of your account, click the Create a Bucket button.
  2. Give your bucket a name and enable encryption for added security.
  3. Click the Create a Bucket button and you should see your new Bucket on the Buckets page.
  4. Navigate to the App Keys page of your account and click the Add a New Application Key button.
  5. Name your Application Key and click the Create New Key button—make sure that your key has both Read and Write permissions (the default option).
  6. Your Application Key ID and Application Key will appear on your App Keys page. Make sure to copy these somewhere secure as the Application Key will not appear again!

Backing Up or Syncing Your Synology to Backblaze B2

By now you have created the location for your data to be either backed up or synced to and obtained your Application Key.

If you want to backup your data, then follow this integration guide or the video mentioned above that takes you step-by-step on how you can use Hyper Backup to backup your data from your Synology to B2 Cloud Storage.

If syncing your data is what you need, then follow this integration guide or the video mentioned above that takes you through how you can use Cloud Sync to sync your data from your Synology to B2 Cloud Storage.

Once you have built the connection between your Synology to B2 Cloud Storage either through Hyper Backup or Cloud Sync (or both!), you can begin backing up or syncing your data for greater protection and accessibility no matter the location.

Summary

Creating and implementing an effective backup strategy, sync strategy or hybrid of the two can be an effective way to protect your data. A thorough understanding of the benefits, drawbacks and strategies involved, and the ways your Synology NAS can utilize both Hyper Backup and Cloud Sync, will hopefully get you on your way to securing your data.

At a fifth of the price of competitors, with setup that takes less than 10 minutes, Backblaze B2 Cloud Storage is a great complement to your Synology NAS.

The post How to Connect Your Synology NAS to Backblaze B2 Cloud Storage appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

What Are Software Containers?

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/what-are-containers/

A decorative image showing shipping containers in a lightbox.

You’re probably familiar with containers if you’re even remotely involved in software development or systems administration. In their 2023 survey, the Cloud Native Computing Foundation found that over 90% organizations use containers in production. Additionally, more than 90% of organizations that rely on cloud native practices for most or all of their application development and deployment also depend on containers. 

But, whether orchestrating containers is a regular part of your day-to-day life, or you are just trying to understand what an operating system kernel is, it helps to have an understanding of some container basics.

Today, we’re explaining what containers are, how they’re used, and how cloud storage fits into the container picture—all in one neat and tidy containerized blog post package. (And, yes, the kernel is important, so we’ll get to that, too).

What are containers?

Containers are packaged units of software that contain all of the dependencies (e.g. binaries, libraries, programming language versions, etc.) they need to run no matter where they live—on a laptop, in the cloud, or in an on-premises data center. That’s a fairly technical definition, so you might be wondering, “OK, but what are they really?”

The generally accepted definition of the term applies almost exactly to what the technology does.

A container, generally = a receptacle for holding goods; a portable compartment in which freight is placed (as on a train or ship) for convenience of movement.

A container in software development = a figurative “receptacle” for holding software. The second part of the definition applies even better—shipping containers are often used as a metaphor to describe what containers do. In shipping, instead of stacking goods in a jumbled pile, goods are packed into standard-sized containers that fit on whatever is hauling them—a ship, a train, or a trailer.

Likewise, instead of “shipping” an unwieldy mess of code, including the required operating system, containers package software into lightweight units that share the same operating system (OS) kernel and can run anywhere—on a laptop, on a server, in the cloud, etc.

What’s an OS kernel?

As promised, here’s where the OS kernel becomes important. The kernel is the core programming at the center of the OS that controls all other parts of the OS. The term makes sense if you consider the definition of “kernel” as “the central or essential part” as in “a kernel of truth.” (It also begs the question, “Why didn’t they just call it a colonel?” especially because it’s in charge of so many things… But that’s neither here nor there.) And now you know what an OS kernel does.

Compared to older virtualization technology, namely virtual machines which are measured in gigabytes, containers are only megabytes in size. That means you can run quite a few of them on a given computer or server much like you can stack many containers onto a ship.

Indeed, the founders of Docker, the software that sparked widespread container adoption, looked to the port of Oakland, California for inspiration. Former Docker CEO, Ben Golub, explained in an interview with InfoWorld, “We could see all the container ships coming into the port of Oakland, and we were talking about the value of the container in the world of shipping. The fact it was easier to ship a car from one side of the world than to take an app from one server to another, that seemed like a problem ripe for solving.” In fact, it’s right there in their logo.

And that is exactly what containers, mainly via Docker’s popularity, did—they solved the problem of environment inconsistency for developers. Before containers became widely used, moving software between environments meant things broke, a lot. If a developer wrote an app on their laptop, then moved it into a testing environment on a server, for example, everything had to be the same—same versions of the programming language, same permissions, same database access, etc. If not, you had a very sad app.

Virtualization 101

Containers work their magic by way of virtualization. Virtualization is the process of creating a simulated computing environment that’s abstracted from the physical computing hardware—essentially a computer-generated computer, also referred to as a software-defined computer.

The first virtualization technology to really take off was the virtual machine (VM). A VM sits atop a hypervisor—a lightweight software layer that allows multiple operating systems to run in tandem on the same hardware. VMs allow developers and system administrators to make the most of computing hardware. Before VMs, each application had to run on its own server, and it probably didn’t use the server’s full capacity. After VMs, you could use the same server to run multiple applications, increasing efficiency and lowering costs.

Containers vs. virtual machines

While VMs increase hardware efficiency, each VM requires its own OS and a virtualized copy of the underlying hardware. Because of this, VMs can take up a lot of system resources, and they’re slow to start up.

Containers, on the other hand, do not virtualize the hardware. Instead, they share the host operating system’s kernel, making them much smaller and faster than VMs. Want to know more? Check out our deep dive into the differences between VMs and containers.

chart of how Containers stack on a server

The benefits of containers

Containers allow developers and system administrators to develop, test, and deploy software and applications faster and more efficiently than older virtualization technologies like VMs. The benefits of containers include:

  1. Portability: Containers include all of the dependencies they need to run in any environment, provided that environment includes the appropriate OS. This reduces the errors and bugs that arise when moving applications between different environments, increasing portability.
  2. Size: Containers share OS resources and don’t include their own OS image, making them lightweight—megabytes compared to VMs’ gigabytes. As such, one machine or server can support many containers.
  3. Speed: Again, because they share OS resources and don’t include their own OS image, containers can be spun up in seconds compared to VMs which can take minutes to spin up.
  4. Resource efficiency: Similar to VMs, containers allow developers to make the best use of hardware and software resources.
  5. Isolation: Also similar to VMs, with containers, different applications or even component parts of a singular application can be isolated such that issues like excessive load or bugs on one don’t impact others.

Container use cases

Containers are nothing if not versatile, so they can be used for a wide variety of use cases. However, there are a few instances where containers are especially useful:

  1. Enabling microservices architectures: Before containers, applications were typically built as all-in-one units or “monoliths.” With their portability and small size, containers changed that, ushering in the era of microservices architecture. Applications could be broken down into their component “services,” and each of those services could be built in its own container and run independently of the other parts of the application. For example, the code for your application’s search bar can be built separately from the code for your application’s shopping cart, then loosely coupled to work as one application.
  2. Supporting modern development practices: Containers and the microservices architectures they enable paved the way for modern software development practices. With the ability to split applications into their component parts, each part could be developed, tested, and deployed independently. Thus, developers can build and deploy applications using modern development approaches like DevOps, continuous integration/continuous deployment (CI/CD), and agile development.
  3. Facilitating hybrid cloud and multi-cloud approaches: Because of their portability, containers enable developers to utilize hybrid cloud and/or multi-cloud approaches. Containers allow applications to move easily between environments—from on-premises to the cloud or between different clouds.
  4. Accelerating cloud migration or cloud-native development: Existing applications can be refactored using containers to make them easier to migrate to modern cloud environments. Containers also enable cloud-native development and deployment.

The role of software containers in AI application development

In addition to enabling microservices architectures and supporting modern development practices, containers play a role in AI application development. Their ability to provide consistent, reproducible environments makes them ideal for AI, where managing complex dependencies and ensuring uniform performance across different platforms are essential. 

AI projects often rely on specific versions of libraries, drivers, and runtimes, which can lead to compatibility issues and errors. Containers solve this problem by encapsulating all necessary dependencies, libraries, and runtime environments to provide a consistent and reproducible platform for AI development. This encapsulation ensures that AI models and applications run the same way, regardless of the underlying infrastructure and provides consistency from development through production. 

The portability of containers also offers advantages for deploying AI workloads across diverse environments. They can be easily moved between local development machines, on-premises servers, and cloud platforms without requiring code or configuration changes. This flexibility supports easy scalability of AI applications to meet changing demands—such as increased user loads or the need for more intensive data processing. 

Additionally, containers enable organizations to leverage the most cost effective and powerful computing resources available, whether it’s local hardware for testing and development or cloud-based GPU clusters for training large-scale models. This ability moves workloads efficiently across different environments and also supports hybrid and multi-cloud strategies to provide organizations with greater agility, while reducing costs and avoiding vendor lock-in.

Container tools

The two most widely recognized container tools are Docker and Kubernetes. They’re not the only options out there, but in their 2023 developer survey, Stack Overflow found that nearly 52% out of 90,000+ respondents use Docker and 19% use Kubernetes. But what do they do?

1. What is Docker?

Container technology had been around for a while in the form of Linux containers or LXC, but the widespread adoption of containers happened only in the past decade with the introduction of Docker.

Docker was launched in 2013 as a project to build single-application LXC containers, introducing several changes to LXC that make containers more portable and flexible to use. It later morphed into its own container runtime environment. At a high level, Docker is a Linux utility that can efficiently create, ship, and run containers.

Docker introduced more standardization to containers than previous technologies and focused on developers, specifically, making it the de facto standard in the developer world for application development.

2. What is Kubernetes?

As containerization took off, many early adopters found themselves facing a new problem: how to manage a whole bunch of containers. Enter: Kubernetes. Kubernetes is an open-source container orchestrator. It was developed at Google (deploying billions of containers per week is no small task) as a “minimum viable product” version of their original cluster orchestrator, ominously named Borg. Today, it is managed by the Cloud Native Computing Foundation, and it helps automate management of containers including provisioning, load balancing, basic health checks, and scheduling.

Kubernetes allows developers to describe the desired state of a container deployment using YAML files (YAML stands for Yet Another Markup Language, which is yet another winning tech acronym.). The YAML file uses declarative language to tell Kubernetes “this is what this container deployment should look like” and Kubernetes does all the grunt work of creating and maintaining that state.

Containers + storage: What you need to know

Containers are inherently ephemeral or stateless. They get spun up, and they do their thing. When they get spun down, any data that was created while they were running is destroyed with them. But most applications are stateful, and need data to live on even after a given container goes away.

Object storage is inherently scalable. It enables the storage of massive amounts of unstructured data while still maintaining easy data accessibility. For containerized applications that depend on data scalability and accessibility, it’s an ideal solution for keeping stateful data stateful.

There are three essential use cases where object storage works hand in hand with containerized applications:

  1. Backup and disaster recovery: Tools like Docker and Kubernetes enable easy replication of containers, but replication doesn’t replace traditional backup and disaster recovery just as sync services aren’t a good replacement for backing up the data on your laptop, for example. With object storage, you can replicate your entire environment and back it up to the cloud. There’s just one catch: some object storage providers have retention minimums, sometimes up to 90 days. If you’re experimenting and iterating on your container architecture, or if you use CI/CD methods, your environment is constantly changing. With retention minimums, that means you might be paying for previous iterations much longer than you want to. (Shameless plug: Backblaze B2 Cloud Storage is calculated hourly, with no minimum retention requirement.)
  2. Primary storage: You can use a cloud object storage repository to store your container images, then when you want to deploy them, you can pull them into the compute service of your choice.
  3. Origin storage: If you’re serving out high volumes of media, or even if you’re just hosting a simple website, object storage can serve as your origin store coupled with a CDN for serving out content globally. For example, CloudSpot, a SaaS platform that serves professional photographers, moved to a Kubernetes cluster environment and connected it to their origin store in Backblaze B2, where they now keep 120+ million files readily accessible for their customers.

Need object storage for your containerized application?

Now that you have a handle on what containers are and what they can do, you can make decisions about how to build your applications or structure your internal systems. Whether you’re contemplating moving your application to the cloud, adopting a hybrid or multi-cloud approach, or going completely cloud native, containers can help you get there. And with object storage, you have a data repository that can keep up with your containerized workloads.

Ready to connect your application to scalable, S3-compatible object storage? You can get started today for free.

The post What Are Software Containers? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

What’s the Diff: SSD vs. NVMe vs. M.2 Drives

Post Syndicated from Lora Maslenitsyna original https://backblaze.com/blog/nvme-vs-m-2-drives/

What's the Diff? M.2 vs. NVMe vs. SSD

Storage technology has certainly changed over time, with new storage mediums taking aim at what’s become the industry standard: the hard disk drive (HDD). Of all the challengers, solid state drives (SSDs) are the most widely-adopted and likely to take the top spot—they’re fast, quiet, and retain information when they’re powered off (so you can just pop open your laptop and get back in the groove). 

Yes, they’re more expensive than traditional hard disk drives, but there’s value in their much faster read/write speeds, and they’re arguably more reliable depending on your use case in the long term. Even with those benefits, that’s not to say HDDs are obsolete, just that there is good reason to upgrade to SSDs in many cases. The decision from the judges: These two contenders get to share the spotlight, and you get to decide which drive works best for your needs. 

Once you’ve made the decision to invest in an SSD, you’ll have to choose the form factor  and possibly the interface. In this post we’ll cover some basics about SSDs that should help you choose which type of SSD is best for you, including:

  • What is an SSD?
  • What is a SATA SSD?
  • What is an M.2 SSD?
  • What is an MSATA SSD?
  • What is an U.2 SSD?
  • What is an NVMe SSD?
  • Which SSD is right for you?

A Brief Introduction to SSDs

SSDs are storage devices that use NAND-based flash memory to store data. They are now standard issue for most computers, as is the case across Apple’s line of Macs. Unlike traditional hard disk drives (HDDs), which store data on spinning disks, SSDs have no moving parts, which makes them faster, more reliable, and less prone to mechanical failures.

SSDs have become so common mainly because they are faster in terms of read/write speeds versus hard drives. This means that they can access and transfer data much more quickly. This makes them an ideal choice for use in high-performance computers, servers, and other devices that require fast data access and transfer speeds. They also use less power and are available at different form factors, such as 2.5” and M.2, so they can be used in a range of devices. You can read more about the difference between SSDs and HDDs in this post.

One downside of SSDs is that they tend to be more expensive than HDDs, especially when it comes to larger storage capacities. However, as the cost of flash memory continues to decrease, SSDs are becoming more affordable and accessible for everyday consumers. Comparatively speaking, they also have a more limited write cycle. While you usually don’t run into this limitation as a typical computer user, if you’re constantly saving large files like you would if you’re working with media files, for instance, it’s something to consider. 

Additionally, it can be harder to tell when an SSD is failing (no spinning disks to whir furiously at you), which means you’ll want to make sure you’re effectively monitoring your drives and backing up.  

What Is a SATA SSD?

A Serial Advanced Technology Attachment (SATA) is the standard storage interface used in many computers. A SATA SSD is an SSD equipped with a SATA interface to connect the storage device to a computer’s motherboard. The SATA SSD comes in the standard 2.5 inch form factor and has both power and data (SATA) connectors. If you buy an SSD external drive to connect to your PC, there will most likely be a SATA SSD inside. 

Generally, the SATA SSD is the least expensive type of SSD, all other factors being equal. This makes a great choice to get an instant performance boost out of your HDD-based computer, or add an external drive that can read and write data more quickly.

One thing to know about external SSD drives is that they should not be disconnected from your computer and stored away for long periods of time. Anything over a year is too long, and as the drive gets older it needs to be plugged in even more often. If you do want to use your SSD for long-term archival storage and have it disconnected from your main devices (air gap, anyone?), then it’s a good idea to either store them with about 50% charge, or power on the SSD every few months to refresh the charge. 

An image of a Western Digital 1TB SATA SSD.

What Is an NVMe?

Non-Volatile Memory Express (NVMe) is a storage protocol that offers high-speed and efficient communication between a computer’s CPU and SSDs. Drives that use NVMe were introduced in 2013 to attach to the Peripheral Component Interconnect Express (PCIe) slot directly on a motherboard instead of using the traditional SATA interface typically used by HDDs and older SSDs. Unlike SATA, which was originally designed for slower HDDs, NVMe takes advantage of the low-latency and high-speed capabilities of SSDs. NVMe drives can usually deliver a sustained read-write speed of 4.0 GB/s in contrast with SATA SSDs that limit at 600 MB/s. Since NVMe SSDs can reach higher speeds and handle multiple data transfers simultaneously, it makes them ideal for gaming, high-resolution video editing, and applications that require high-performance storage, such as enterprise databases, virtualization, and data analytics.

Their high speeds come at a high cost, however: NVMe drives are some of the more expensive drives on the market.

An image of a Kioxia NVMe SSD.

To sum up our analysis on storage protocols in SSDs, here’s a handy chart:

Feature SATA NVMe
Interface Serial Advancement Technology Attachment Non-Volatile Memory Express
Bus SATA bus PCIe bus
Data Transfer Speeds Slower (up to 600 MB/s) Faster (typically 3000 MB/s and up to 7,500 MB/s)
Latency Higher Lower
Scalability Limited Scalable

What Are M.2 Drives?

M.2 drives, also known as Next Generation Form Factor (NGFF) drives, are a type of SSD that uses the M.2 interface to connect directly into a computer’s motherboard without the need for cables. M.2 is a form factor, which refers to the standard size of a component. So, while we’re comparing NVMe to SATA in storage protocol above, a direct comparison for the M.2 SSD is a traditional, 2.5 inch SSD. 

M.2 SSDs are significantly smaller than traditional, 2.5 inch SSDs. Because they can use either SATA or NVMe to connect to the motherboard, they have the potential to be much faster when they’re utilizing the latter. They’re also more power-efficient than other types of SSDs, which improves battery life in portable devices. Because they take up less space while delivering all of the above, M.2 drives have become popular in gaming setups. (Game on, friends.)

Even at this smaller size, M.2 SSDs are able to hold as much data as other SSDs, ranging up to 8TB in storage size. But, while they can hold just as much data and are generally faster than other SSDs, they also come at a higher cost. As the old adage goes, you can only have two of the following things: cheap, fast, or good.

M.2 drives are easy to install, and they can be added to most modern motherboards that have an M.2 slot. If your motherboard does not have an M.2 slot, you may be able to use an M.2 drive by using an adapter card that fits into a PCIe slot. So, before you run out and buy an M.2 SSD, you’ll need to know which interface your computer will accept, M.2 SATA or M.2 PCIe.

A Samsung SATA M.2 SSD.

What Are mSATA Drives?

mSATA, or mini-SATA, is basically a smaller version of the full-size SATA SSD that is designed for smaller form factor systems where space is limited. The mSATA form factor is compact like M.2, but they are not interchangeable. Currently, mSATA drives only support the SATA interface.

A Rogob mSATA SSD.

What Are U.2 Drives?

U.2 drives look like a 2.5” drive but are a bit thicker, and they use a different connector that sends data through a PCIe interface. The U.2 SSD was developed to enable the newer and faster NVMe-based PCIe SSDs to plug into the same drive backplane as older SAS and SATA drives. This allows the drives to be hot-swappable, an especially useful feature in enterprise server and storage systems.

A Mircon U.2 SSD.

Which SSD Is Best to Use?

There are a few factors to consider in choosing which drive is best for you. As you compare the different components of your build, consider your technical constraints, budget, capacity needs, and speed priority.

Technical Constraints

Check the capability of your system before choosing a drive, as some older devices don’t have the components needed for NVMe connections. Also, check that you have enough PCIe connections to support multiple PCIe devices. Not enough lanes, or only specific lanes, means you may have to choose a different drive or that only one of your lanes will be able to connect to the NVMe drive at full speed.

The same advice holds if you’re adding an SSD to your NAS array. Most newer models will be compatible with adding an SSD, but you should always double check your NAS specs before you commit to a purchase. And remember: even though SSDs offer many benefits, especially when you’re looking to add caching or using your NAS for media workflows, there are several use cases where you’d see minimal advantage compared with an HDD, such as if you want a NAS to store large amounts of infrequently accessed files (i.e. in backup/archive scenarios).

Budget

If you plan to be making a lot of large file transfers or want to have the highest speeds for gaming, then an NVMe SSD is what you want. Until recently SATA SSDs were much more affordable options compared with NVMe drives, but that is changing rapidly. For example, at the time of publication, a Samsung 1TB SATA SSD (870 EVO) retails for $990 on Amazon, while a Samsung 1TB NVMe drive (970 EVO) is listed for $150only $98 on sale on Amazon. That said, those 8TB NVMe drives we talked about are running about $850+, depending on the manufacturer. 

Drive Capacity

SATA drives usually range from 500GB to 16TB in storage capacity. Most M.2 drives top out at 2TB, although some may be available at 4TB and 8TB models at much higher prices.

Drive Speed

When choosing the right drive for your setup, remember that SATA M.2 drives and 2.5 inch SSDs provide the same level of speed, so to gain a performance increase, you will have to opt for the NVMe-connected drives. While NVMe SSDs are going to be much faster than SATA drives, you may also need to upgrade your processor to keep up or you may experience worse performance. Finally, remember to check read and write speeds on a drive as some earlier generations of NVMe drives can have different speeds.

Whatever You Choose, Back It Up

Before choosing a new drive, remember to back up all of your data. Backing up is essential as every drive will eventually fail and need to be replaced. The basis of a solid backup plan requires three copies of your data: one on your device, one backup saved locally, and one stored off-site. Storing a copy of your data in the cloud ensures that you’re able to retrieve it if any data loss occurs on your device.

Interested in learning more about other drive types or best ways to optimize your setup? Let us know in the comments below.

FAQs about NVMe vs. M.2 drives

What is the difference between NVMe and M.2 drives?

NVMe and M.2 are often used interchangeably, but they refer to different aspects of storage technology. Non-Volatile Memory Express (NVMe) drives attach to the Peripheral Component Interconnect Express (PCIe) slot directly on a motherboard instead of using the traditional SATA interface, resulting in higher data transfer speeds. M.2, on the other hand, is a physical form factor or connector used for SSDs. M.2 drives can support various storage interfaces, including NVMe, SATA, and others, providing flexibility in terms of compatibility and speed.

Which is faster, NVMe or M.2 drives?

NVMe and M.2 drives are not directly comparable in terms of speed because they refer to different aspects of storage technology. NVMe (Non-Volatile Memory Express) is a storage protocol that provides high-speed communication between the computer’s CPU and SSDs. It is designed to take full advantage of the capabilities of SSDs and can offer significantly faster data transfer speeds compared to traditional interfaces like SATA.

M.2, on the other hand, refers to a physical form factor or connector used for storage devices, including SSDs. M.2 drives can support various interfaces, including NVMe, SATA, and others. The speed of an M.2 drive depends on the specific interface it uses. NVMe M.2 drives, which utilize the NVMe protocol, can provide faster speeds compared to M.2 drives that use the SATA interface.

In summary, NVMe is a storage protocol that can be implemented in various form factors, including M.2, and NVMe drives tend to offer faster speeds compared to M.2 drives that utilize the SATA interface.

Can NVMe be used in any M.2 slot?

NVMe drives can generally be used in M.2 slots, but it is important to ensure compatibility with the specific M.2 slot on your motherboard. M.2 slots can support different types of interfaces, including SATA and NVMe.

What are the advantages of NVMe in M.2 drives?

NVMe (Non-Volatile Memory Express) is a storage protocol that can be implemented through various form factors, one of which is M.2.

The main advantage of NVMe technology is its high-speed data transfer capabilities. Compared to traditional storage interfaces like SATA, NVMe provides significantly faster performance. It leverages the Peripheral Component Interconnect Express (PCIe) interface, allowing for direct communication between the CPU and the SSD. This results in reduced latency and improved overall system responsiveness.

M.2, on the other hand, is a physical form factor or connector that can support various interfaces, including SATA and NVMe. M.2 drives can accommodate NVMe SSDs, allowing them to take advantage of the faster speeds provided by the NVMe protocol. In addition to the ability to utilize a faster NVMe protocol, the advantages of an M.2 size are its small size and easy direct-to-motherboard installation.

Are NVMe drives more expensive than SATA drives?

Until recently SATA SSDs were much more affordable options compared with NVMe drives, but that is changing rapidly. For example, as of June 2023, a Samsung 1TB SATA SSD (870 EVO) retails for $90 on Amazon, while a Samsung 1TB NVMe drive (970 EVO) is listed for only $98 on sale on Amazon. Prices are now comparable.

The post What’s the Diff: SSD vs. NVMe vs. M.2 Drives appeared first on Backblaze Blog | Cloud Storage & Cloud Backup