Tag Archives: Featured

Managing the Media Tidal Wave: Backlight iconik’s 2024 Media Report

Post Syndicated from Jennifer Newman original https://backblazeprod.wpenginepowered.com/blog/managing-the-media-tidal-wave-backlight-iconiks-2024-media-report/

A decorative image showing several types of files falling into the a cloud with the Backblaze logo on it.

Everyone knows we’re living through an exceptional time when it comes to media production: Every day we experience a tidal wave of content—social video, virtual reality (VR) and augmented reality (AR) gaming, 10K sports footage, every streaming option imaginable—crashing down on us.

Two eye popping stats underscore that this perception is real: In December, Netflix shared that its users streamed close to 100 billion hours of content on its platform during the first half of 2023. At the beginning of 2024, YouTube revealed that its users watch one billion hours of video daily. 

It’s hard to make sense of that volume of content, it’s even harder to understand how it’s produced. Imagine the armies of people and types of programs required to capture, ingest, transcode, store, tag, edit, distribute, and archive all of it. Managing that content means touching every stage, start to finish, of the production process through the data lifecycle. 

A photo showing a woman surrounded by glistening raindrops, some of which transform into media icons.

To further complicate things, the modern production person’s workflow is nothing close to linear. They have to deal with:

  • A diversity of inputs: Content is pouring in from various sources. Keeping track of all this content, ensuring its quality, and organizing it effectively can feel like trying to catch a waterfall in a teacup. 
  • A variety of formats and file types: Each platform or device may require a different format or resolution, making it difficult to maintain consistency across the board and adding another layer of complexity.
  • Managing metadata, or indexing and tagging: With so much content flying around, ensuring that each file is properly tagged with relevant information is crucial for easy retrieval and management. However, manually inputting metadata for every piece of media can be time-consuming and prone to errors.
  • Remote and in-person collaboration: With team members spread across different locations, coordinating efforts and maintaining version control can be a headache.
  • Storage and scalability: As the volume of media grows, so does the need for storage space. Finding a solution that can accommodate this growth without breaking the bank can be tough.

While many companies are jumping in to provide tools to manage this tidal wave of content, one company, Backlight iconik, differentiates itself by providing industry-leading tools and offering public media reports on the state of media data today.

Backlight iconik’s 2024 Media Stats Report

Since 2018, iconik has provided a cloud-based media asset management (MAM) tool to help production professionals tame the insanity of modern content development. For the past four years, they’ve also provided an annual Media Stats report to help the industry understand the type of media being developed and distributed, as well as where and how it’s stored. (In 2022, Backlight, a global media technology, acquired iconik, hence the name change.) If you want the full story, please check out Backlight iconik’s 2024 Media Stats Report,

As cloud storage specialists here at Backblaze (and, lovers of stats ourselves), we would like to dig into their stats on storage and offer our own take here for you today.

2024 Media Storage Trend: The Content Tidal Wave Is Not a Mirage

According to Backlight iconik’s Media Stats Report, iconik’s data exploded to 152PB, shooting up by a whopping 57%—that’s 53PB more than in 2023. To put it in perspective, that’s roughly 6TB of fresh data pouring in every hour. This surge in data can be attributed to both new customers integrating their archives with iconik and existing customers ramping up their usage.

2024 Media Storage Trend: Audio on the Rise

An interesting find in their study was the difference between audio and video asset growth. Iconik is now managing 328 years of video (up 41% YoY) and 208 years of audio (up 50% YoY).

A decorative display of media stats from iconik's media report. The title reads Asset Duration. On the left, a blue circle displays the words 328 years of video, which represents 41% growth year on year. On the right, an orange circle contains the words 208 years of audio, which represents 50% growth year on year.
Note that “duration” in this context is measuring the total hours of runtime for each file. Source.

Over the last year, the growth of audio assets managed by Backlight iconik has surged, surpassing that of video, with a staggering 1,700 hours being added daily. They believe this surge is closely tied to the remarkable expansion of both the podcasting and audiobook markets in recent years. The global podcast market ballooned to $17.9 billion in 2023 and is forecasted to soar to $144 billion by 2032. Similarly, the audiobook market is projected to hit $35 billion by 2030, with expected revenue of $35.05 billion in the same year. While audio files are smaller than video files by far, it’s reasonable to anticipate a continued upward trajectory for audio assets across the media and entertainment landscape.

2024 Trend: The Shift to Cloud Storage for Cost-Effective Storage and Collaboration

According to Backlight iconik’s 2024 Media Stats Report, the trend toward cloud storage is definitely on the rise as the increased competition in the market and move away from hyperscalers drive more reasonable pricing. Companies are opting to transition to the cloud at their own speed, and hybrid cloud setups give them the freedom to shift assets as needed to improve things like performance, ease of access, security, and meeting regulatory requirements.

Get Ahead of the Wave: Pair iconik With Modern Cloud Storage Today

The reasons so many media professionals are moving to cloud are relatively simple: Cloud workflows enable enhanced collaboration and flexibility, greater cost predictability, and heightened security and management capabilities. And often, all of the above is possible at a lower total cost than legacy solutions.

Pairing Backlight iconik and Backblaze provides a simple solution for users to manage, collaborate, and store media projects easily. By integrating with iconik, Backblaze boosts workflow effectiveness, delivering a strong cloud-based MAM system that allows thorough management of Backblaze B2 Cloud Storage data right from a web browser.

Customer Story: How One Streaming Company Tackled Remote Content Workflows With Backblaze and Backlight iconik

When Goalcast, the empowering media company, decided to dive into making their own content, they realized their current setup just wasn’t cutting it. With their team spread out all over the place, they needed an easy way to get footage, access videos from anywhere, and keep a stash of finished files ready to jazz up for YouTube, Facebook, Instagram, Snapchat, TikTok, and the Goalcast OTT app. 

Goalcast combined LucidLink’s cloud-based workflows, iconik’s media asset manager and uploader features, and Backblaze B2 Cloud Storage. The integration between iconik, LucidLink, and Backblaze creates a slick media workflow. The content crew uploads raw footage straight into iconik, tossing in key details. Original files zip into Goalcast’s Backblaze B2 Bucket automatically, while edited versions are up for grabs via LucidLink. After the editing magic, final assets kick back into Backblaze B2.

The integration and partnership means endless possibilities for Goalcast. They’re saving around 150 hours a month in grunt work and stress. Now, they don’t have to fret about where footage hides or how to snag it—it’s all securely stored in Backblaze, ready for anyone on the team to grab, no matter where they’re working from.

You can get lost in the weeds of tech companies and storage solutions. It can hurt your brain. The sweet spot is these three—iconik, LucidLink, and Backblaze—and how they work together.

—Dan Schiffmacher, Production Operations Manager, Goalcast

2024 Media Mega Trend

Looking at Backlight iconik’s numbers and forecasts from a 25,000 foot vantage point makes one thing painfully clear: Effective media management and storage are going to be absolutely crucial for media teams to succeed in the future landscape of production. Dive deeper into how Backblaze and Backlight iconik can support you now and down the road, ensuring seamless media management, and affordable storage, that creates easy, stress-free expansion as your data continues to grow. 

Already have iconik and want to get started with Backblaze? Click here. 

The post Managing the Media Tidal Wave: Backlight iconik’s 2024 Media Report appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Network Stats: Real-World Measurements from the Backblaze US-West Region

Post Syndicated from Brent Nowak original https://backblazeprod.wpenginepowered.com/blog/backblaze-network-stats-real-world-measurements-from-the-backblaze-us-west-region/

A decorative image with a title that says Network Stats: Network Latency Study on US-West.

As you’re sitting down to watch yet another episode of “Mystery Science Theater 3000” (Anyone else? Just me?), you might be thinking about how you’re receiving more than 16 billion signals over your home internet connection that need to be interpreted and sent to decoding software in order to be viewed on a screen as images. (Anyone else? Just me?)

In telecommunications, we study how signals (electrical and optical) propagate in a medium (copper wires, optical fiber, or through the air). Signals bounce off objects, degrade as they travel, and arrive at the receiving end (hopefully) intact enough to be interpreted as a binary 0 or 1, eventually to be decoded to form images that we see on our screens.

Today in the latest installment of Network Stats, I’m going to talk about one of the most important things that I’ve learned about the application of telecommunication to computer networks: the study of how long operations take. It’s a complicated process, but that doesn’t make buffering any less annoying, am I right? And if you’re using cloud storage in any way, you rely on signals to run applications, back up your data, and maybe even stream out those movies yourself. 

So, let’s get into how we measure data transmission, what things we can do to improve transmission time, and some real-world measurements from the Backblaze US-West region.

Networking and Nanoseconds: A Primer

At the risk of being too basic, when we study network communication, we’re measuring how long signals take to get from point A to point B, which implies that there is some amount of distance between them. This is also known as latency. As networks have evolved, the time it takes to get from point A to point B has gone from being measured in hours to being measured in fractions of a second. Since we live in more human relatable terms like minutes, hours, and days, it can be hard for us to understand concepts like nanoseconds (a billionth of a second). 

Here’s a breakdown of how many milli, micro, and nanoseconds are in one second.

Time Symbol Number in 1 Second
1 second 1
1 millisecond ms 1,000
1 microsecond μs 1,000,000
1 nanosecond ns 1,000,000,000

For reference, taking a deep breath takes approximately one second. When you’re watching TV, you start to notice audio delays at approximately 10–20ms, and if your cat pushes your pen off your desk, it will hit the ground in approximately 400ms.

Nanosecond Wire: Making Nanoseconds Real

In the 1960s, Grace Murray Hopper (1906–1992) explained computer operations and what a nanosecond means in a succinct, tangible fashion. An early computer scientist who also served in the U.S. Navy, Hopper was often asked by folks like generals and admirals why it took so long for signals to reach satellites. So, Hopper had pieces of wire cut to 30cm (11.8in) in length, which is the distance that it takes light to travel in perfect conditions—that is, a vacuum—in one nanosecond. (Remember: we humans express speed as distance over time.) 

You could touch the wire, feel the distance from point A to point B, and see what a nanosecond in light-time meant. Put a bunch of them together, and you’d see what that looks like on a human scale. In literal terms, she was forcing people to measure the distance to a satellite (anywhere between 160–35,786 km) in terms of the long side of a piece of printer paper. (Rough math: it’s about 741,058,823 to 1,657,529,411,764 pieces of paper end-to-end.) That’ll definitely make you realize that there are a lot of nanoseconds between you and the next city over or a satellite in space.

A fun side note: if we go up a factor from a nanosecond to a microsecond, the wire would be almost 300 meters (984 feet)—which is the height of the Eiffel Tower, minus the broadcast antennae. It gets even more fun to think about because we still have to scale up to two more orders of magnitude to get to a millisecond, and then again to get a second. 

I love when a difficult topic can be grasped with an elegant explanation. It’s one of the skills that I strive to develop as an engineer—how can I relate a complicated concept to a larger audience that doesn’t have years of study in my field, and make it easily digestible and memorable?

Added Application Time

We don’t live in an ideal, perfect vacuum where signals can propagate in a line-of-sight fashion and travel in an optimal path. We have wires in buildings and in the ground that wind around metro regions, follow long-haul railroad tracks as they curve, and have to pass over hills and along mountainous terrain that add elevation to the wire length, all adding light-time. 

These physical factors are not the only component in the way of receiving your data. There are computer operations that add time to our transactions. We have to send a “hello” message to the server and wait for an acknowledgement, negotiate our security protocols so that we can transmit data without anyone snooping in on the conversation, spend time receiving the bytes that make up our file, and acknowledge chunks of information as received so the server can send more.

How much do geography and software operations add to the time it takes to get a file? That’s what we’re going to explore here. So, if we’re requesting a file that’s stored on Backblaze’s servers in the US-West region, what does that look like if we are delivering the file to different locations across the continental U.S.? How long does it take?

Building a Latency Test

Today we’re going to focus on network statistics for our US-West location and how the performance profile looks as we travel east and the various components that contribute to the change in translation time.

Below we’re going to compare theoretical best case numbers to the real world. There are three categories in our analysis that contribute to the total time it takes to get a request from a person in Los Angeles, Chicago, or New York, to our servers in the US-West region. Let’s take a look at each one:

  1. Ideal Line: Let’s draw an imaginary line between our US-West location and each major city in our testing sample. Then we can calculate the time it takes light to be sent and received as RTT (Round Trip Time) one time between the two points. This number gives us the Ideal Line time, or the time it takes for a light signal to travel between two points in a perfect line, in vacuum conditions, with no obstructions. Hardly how we live, so we have to add a few other data points!
  2. Fiber: Fiber optics in the real world have to pass through optical equipment, connect to aerial fiber on telephone poles where ground access is limited, route around pesky obstructions like mountains or municipal services, and sometimes travel along non-ideal paths to reach major connection points where long-haul fiber providers have offices to improve and reamplify the signal. This RTT number is taken from testing services that we have running across the country.
  3. Software: This measurement shows the time spent in Software tasks (as opposed to Setup or Download tasks, as defined by Google) that are required to initiate network connections, negotiate mutual settings between sender and receiver, wait for data to start to be received, and encrypt/decrypt messages. We’re also getting this number from our testing services and will explore all the inner workings of the Software components a little later on.
  4. Total: The interesting part! Real world RTT for retrieving a sample file from various locations.

Fun fact: You don’t need any monitoring infrastructure in order to take a deeper dive—every Chrome web browser has the ability to show load times for all the elements that are needed to present a website.

Do note that test results may vary based on your ISP connectivity, hardware capabilities, software stack, or improvements Backblaze makes over time.

To show more detailed information, open Chrome:

  • Go to Chrome Options > More Tools > Developer Tools 
  • Select Network Tab
  • Browse to a website to see results sourced from your machine

A deeper dive into this can be found on Google’s developer.chrome.com website.

If you wish to run agent based tests, you can start with Google’s Chromium Project, as it offers a free and open source method to simulate and perform profiling.

Here are the results from just one test we ran:

A chart showing round trip trip time from US-West.
Round Trip Times (RTT) for various categories to our US-West location.

It’s important, at this stage, to caveat these numbers with a few things. First, they include a decent amount of overhead from being within our (or any) infrastructure, which can be affected by things like your browser, security, and lookup time needed to connect to a server infrastructure. And, if a user is running a different browser, has different layers of security, and so on, those things can affect RTT results. 

Second, they don’t accurately talk about performance without context. Every type of application has its own requirements for what is a “good” or “bad” RTT time, and these numbers can change based on how you store your data; where you store, access, and serve your data; if you use a content delivery network (CDN); and more. As with anything related to performance in cloud storage, your use case determines your cloud storage strategy, not the other way around.

Unpacking the Software Measurement

In addition to the Chrome tools we talk about above, we have access to agents running in various geographical locations across the world that run periodic tests to our services. These agents can simulate client activity and record metrics for us that we use for alerting and trend analysis. Simulating client operations helps alert our operations teams to potential issues, helping us to better support our clients and be more proactive.

With this type of agent based testing, we have greater insight into the network that lets us break down the Software step and observe if any one step in the transfer pipeline is underperforming. We’re not only looking at the entire round trip time of downloading a file, but also including all the browser, security, and lookup time needed to connect to our server infrastructure. And, as always, the biggest addition in time it takes to deliver files is often distance-based latency, or the fact that even with ideal conditions, the further away an end user is, the longer it takes to transport data across networks. 

Unpacking Software Results

The below chart shows how long in milliseconds it takes to get a sample file from our US-West cluster from agents running in different locations across the U.S. and all the Software steps involved.

Test transaction times for a 78kb test file.
Chromium application profile of loading a sample 78kb test file from various locations.

You can find definitions for all these terms in the Chromium dev docs, but here’s a cheat sheet for our purposes: 

  • Queueing: Browser queues up connection.
  • DNS Lookup: Resolving the request’s IP address.
  • SSL: Time spent negotiating a secure session.
  • Initial Connection: TCP handshake.
  • Request Sent: This is pretty self explanatory—the request is sent to the server. 
  • TTFB (Time to First Byte): Waiting for the first byte of a response. Includes one round trip of latency and the time the server took to prepare the response.
  • Content Download: Total amount of time receiving the requested file.

Pie Slices

Let’s zoom in on the Los Angeles and New York tests and group just the Download (Content Download) and all the other Setup items (Queueing, DNS Lookup, SSL, Initial Connection, TTFB) and see if they differ drastically. 

A pie a chart comparing download and setup times while sending a file from an origin point to Los Angeles.
A pie a chart comparing download and setup times while sending a file from origin to New York.

In the Los Angeles test, which is the closest geographical test to the US-West cluster, the total transaction time is 71ms. It takes our storage system 23.8ms to start to send the file, and we’re spending 47ms (66%) of the total time in setup. 

If we go further east to New York, we can see how much more time it takes to retrieve our test file from the West (71ms vs 470ms), but the ratio between download and setup doesn’t differ all that drastically. This is because all of our operations are the same, but we’re spending more sending each file over the network, so it all scales up.

Note that no matter where the client request is coming from, the data center is doing the same amount of work to serve up the file.

Customer Considerations: The Importance of Location in Data Storage

Latency is certainly a factor to consider when you choose where to store your data, especially if you are running extremely time sensitive processes like content delivery—as we’ve noted here and elsewhere, the closer you are to your end user, the greater the speed of delivery. Content delivery networks (CDNs) can offer one way to layer your network capabilities for greater speed of delivery, and Backblaze offers completely free egress through many CDN partners. (This is in addition to our normal amount of free egress, which is up to 3x the data stored and includes the vast majority of use cases.) 

There are other reasons to consider different regions for storage as well, such as compliance and disaster resilience. Our EU datacenter, for instance, helps to support GDPR compliance. Also, if you’re storing data in only one location, you’re more vulnerable to natural disasters. Redundancy is key to a good disaster recovery or business continuity plan, and you want to make sure to consider power regions in that analysis. In short, as with all things in storage optimization, considering your use case is key to balancing performance and cost.

Milliseconds Matter

I started this article talking about Grace Murray Hopper demonstrating nanoseconds with pieces of wire, and we’re concluding what can be considered light years (ha) away from that point. The biggest thing to remember, as a network engineering team, is that even though approximately 600ms from the US-West to the US-East regions seems miniscule, the amount of times we travel that distance very quickly takes us up those orders of magnitude from milliseconds to seconds. And, when you, the user, are choosing where to store your data—knowing that we register audio lag at as little as 10ms—those inconsequential numbers start to get to human relative terms very, very quickly. 

So, when we find that peering shaves a few milliseconds off of a file delivery time, that’s a big, human-sized win. You can see some of the ways we’re optimizing our network in the first article of this series, and the types of tests we’re running above give us good insights—and inspiration—for more, better, and ongoing improvements. We’ll keep sharing what we’re measuring and how we’re improving, and we’re looking forward to seeing you all in the comments.

The post Backblaze Network Stats: Real-World Measurements from the Backblaze US-West Region appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

7 Data Dilemmas + 5 Backup Strategies for World Backup Day 2024

Post Syndicated from Yev original https://backblazeprod.wpenginepowered.com/blog/7-data-dilemmas-5-backup-strategies-for-world-backup-day-2024/

A decorative image showing the World Backup Day logo and the Backblaze logo on the cloud.

Everyone’s favorite holiday is fast approaching. That’s right: World Backup Day is just around the corner on March 31 (if you’re new to celebrating). Many moons ago, we got together with some like-minded champions of the backup lifestyle to encourage people to protect their data, and World Backup Day was born. In the past we’ve shared internal metrics on backup trends, advice for talking to your family about backups, and learnings from our yearly backup poll (stay tuned in June for more of those!).

This year to mark the occasion, we’re revisiting some tales of bullets dodged and backup victories. You’ll find no scary monsters here—no, these tales end happily. We like to call them ReStories—heartwarming sagas of folks who found a data lifeline. And we’re throwing in some tips and tricks to help you protect your data, too. 

Let’s take a walk down ReStory lane.

Rising From the Ashes of the Marshall Fire Crisis

In 2021, the Marshall Fire left many in despair, but for Christopher G., it was a test of foresight. “A lifetime of memories were kept in my data, and years before this I decided to get a permanent backup solution,” Christopher shared. When disaster struck, Christopher lost his data—including his on-site backup copies—but he remembered he had an off-site backup stored in the cloud with Backblaze. He initiated a restore, and we sent hard drives with everything he needed to get his precious memories back. 

Tip 1: Mitigate Risks With 3-2-1 Backups

Christopher’s story is a powerful testament to being prepared with a 3-2-1 backup strategy, which means keeping three copies of your data on two different media with one stored off-site (and preferably in the cloud). When two copies of his data were wiped out by the Marshall fire, he could rely on his third copy to restore all of the data, including years of photos and important documents.

School District Protects Data for 23,000 Students

Bethel School District had 200 servers and 125TB of data backed up by Rubrik, a backup software provider, to Amazon S3, but high costs were straining their budget—so much so that they had to shorten needed retention periods. They moved their backup copies from Amazon S3 to Backblaze B2, resulting in savings of 75%, which allowed them the budget flexibility to reinstate longer retention times and better protect their data from the threat of ransomware.

It was really a couple clicks, about five minutes worth of work, and we were pointed to Backblaze.

—Patrick Emerick, Senior Systems Engineer, Bethel School District

Tip 2: Plan for a Ransomware Attack Before It Happens

Ransomware attacks specifically targeting school districts and universities are on the rise—79% of institutions reported they were hit with ransomware in the past year. A ransomware attack is not a matter of if, but when, and that’s true whether you’re a school, university, business, or just someone who has data they care about. Take a cue from Bethel School District and take proactive measures to protect your business data from ransomware, like establishing retention periods that allow you to recover adequately in the event of an attack.

Backing Up Years of Research

The Caesar Kleberg Wildlife Research Institute at Texas A&M–Kingsville needed an endpoint backup solution to protect data on researchers’ laptops in the field and on-site, knowing researchers in the field don’t always follow protocols to the letter when it comes to saving their data. The Institute’s IT manager implemented Backblaze Computer Backup which gave him the ability to remotely manage faculty and staff backups. And he knows that, with no added fees, recoveries won’t be cost prohibitive.

Tip 3: Manage Backups Centrally

Whether you’re a remote employee or managing them, it can help to have tools like silent install, fine-grained access permissions, and management controls (at Backblaze, you can access all of these via Enterprise Control for Computer Backup). That way you can stay focused on what matters most instead of updating backup clients and fiddling with settings. Plus, you don’t have to worry about backups being accidentally deleted or tampered with. 

Glenda B.’s Emotional Rescue: 20 Years of Memories Reclaimed

Losing decades of family photos can be devastating, a sentiment echoed by Glenda B.: “Several years ago my photos were all inexplicably deleted from my computer—20 years of family photos gone in an instant!” Some of them were on iCloud, but there were years of older photos that were only stored on her computer. Fortunately, she had very recently installed Backblaze Computer Backup, so all of her photos were safely backed up in the cloud. Glenda initiated a restore with Backblaze, restoring her files and her invaluable memories. 

Tip 4: Sync Is Not Backup

If you’re like Glenda, your digital life is probably scattered across your computer, external hard drives, and multiple sync services from iCloud to Google Drive. Glenda’s story is an important lesson that sync is not backup. Sync services are great for sharing data and accessing it on multiple devices, but that doesn’t help you when you lose data that’s only stored on your computer or when you accidentally delete a file and don’t realize it. One of the drawbacks of using sync services as a backup is that data outside those services is vulnerable. And the fix for that vulnerability is to use a true backup service to protect all of your data. 

What Happens When One-Third of Your Employees’ Machines Crash?

BELAY Solutions is a staffing company that connects organizations with virtual assistants, bookkeepers, website specialists, and social media managers. While performing scheduled system updates across BELAY’s fleet of Macs, nearly a third of the company’s machines crashed. After shipping out replacement laptops, the IT team empowered BELAY employees to use Backblaze Business Backup to recover their own data independently in a matter of minutes.

Our work is very time intensive, so our team can’t be offline for long—you always need reliable technical assets to support virtual assistants in the field.

—Cam Cox, IT Systems Administrator, BELAY Solutions

AJ’s Tech Misadventure: Averting a Digital Disaster

Upgrading your computer’s operating system is routine until it results in an accidental wipeout, as AJ found out. “In summer 2020, I accidentally wiped my external hard drive while downloading a copy of Windows 10,” he recounts. But thanks to Backblaze, AJ could redownload everything, salvaging irreplaceable files. 

Rob D.’s Professional Life: Recovering Years of Work

For Rob D., a graphic designer, losing years of work to a computer crash was catastrophic. He woke up to the “dreaded blue screen of death” and despite efforts, only scattered metadata could be salvaged. But, Backblaze came to the rescue. “As a graphic designer, YEARS of design projects were gone in a flash. Clients…were not too pleased…Enter Backblaze,” Rob said. With a new hard drive filled with his backed up data, he experienced immense relief. “Can’t quite describe the feeling of relief I felt at that moment knowing that I was going to be ok. THANK YOU Backblaze!! I’m a customer for life!”

Tip 5: Reduce Downtime With Self-Serve Backup Solutions

Even tech savvy folks like AJ, Rob D., and the staff at BELAY solutions can get flustered when they suddenly lose their data or ability to work, so an easy restore process everyone can use themselves no matter their level of IT knowledge is essential for those high-stress situations. BELAY initially chose Backblaze for its simplicity and ease of use. “I’ve been able to help someone get their data back within five minutes. I don’t think that ever would have happened using our previous tool,” said Cam Cox, IT Systems Administrator. And, Backblaze user AJ relayed that having Backblaze was “worth every penny for the rapid restore process.”

Take the World Backup Day Pledge This Year

As we celebrate World Backup Day, let’s take a moment to recognize the critical role that data backup plays in safeguarding our digital assets against unforeseen threats. Whether you’re a business owner, an IT director, or an individual user, investing in robust backup solutions is an investment in resilience and peace of mind. By embracing proactive measures and leveraging technology to fortify our defenses, we can navigate the complexities of the digital age with confidence and resilience. We encourage you to take the World Backup Day pledge, feel free to reach out to us on socials, and check back in June to see the newest results of our yearly backup survey.

The post 7 Data Dilemmas + 5 Backup Strategies for World Backup Day 2024 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Now Available via Carahsoft’s NASPO Contract

Post Syndicated from Mary Ellen Cavanagh original https://backblazeprod.wpenginepowered.com/blog/backblaze-now-available-via-carahsofts-naspo-contract/

A decorative image showing three logos: Backblaze, carahsoft, and NASPO ValuePoint.

If you’re an IT professional for a state, local government, or educational institution or you’re a reseller serving those entities, you know firsthand how complex and time consuming procurement can truly be. Onboarding a cloud storage provider that meets both your needs and your organization’s procurement requirements is a challenge. And the need for affordable and secure data storage has never been greater—an incredible 79% of educational institutions reported being hit with ransomware in the past year. 

Today, choosing Backblaze as your preferred cloud storage provider just got a lot easier. Backblaze is now available to purchase via Carahsoft’s National Association of State Procurement Officials (NASPO) ValuePoint contract. 

The contract addition enables Carahsoft, The Trusted Government IT Solutions Provider®, and Backblaze to provide cloud storage solutions to participating states, local governments, and educational institutions. The contract also comes on the heels of Backblaze’s inclusion on Carahsoft’s NYOGS- and OMNIA-approved vendor lists.

What Is the NASPO ValuePoint Contract?

NASPO ValuePoint is a cooperative purchasing program that facilitates public procurement solicitations and agreements using a lead-state model, which means one state or organization takes the lead on soliciting proposals on behalf of others and working with a sourcing team to evaluate responses and choose a vendor. By leveraging the leadership and expertise of all states and the collective purchasing power of their public entities, NASPO ValuePoint delivers the highest valued, reliable, and competitively sourced contracts, offering public entities outstanding pricing.

Benefits to Customers

As a state, local government, or educational institution, you get a number of benefits by purchasing Backblaze through Carahsoft’s ValuePoint contract, including:

  • Simplified Procurement: You don’t have to go through the hassle of setting up your own contracts or negotiating prices. You can just use the NASPO ValuePoint contract, and that hard work is already taken care of.
  • Cost Savings: Because the ValuePoint contract covers lots of states and organizations, services are purchased in bulk, which usually means cheaper prices.
  • Time Savings: You save time researching suppliers or going through a long bidding process. You can just choose from the options already approved under the contract.
  • Quality Assurance: The contract usually has strict standards for its offerings, ensuring that you as a customer get access to quality products and services.

Benefits for Resellers

Resellers who are not currently listed on the NASPO ValuePoint Contract can still reap the benefits as well (and if you’re already listed, even better). By purchasing Backblaze through Carahsoft, you will gain:

  • Access to a Larger Market: Resellers can sell Backblaze to multiple states or organizations without having to negotiate separate contracts each time. This means more potential cloud customers.
  • Streamlined Sales Process: You don’t have to spend as much time and effort trying to win individual contracts. Now that Backblaze is on the NASPO ValuePoint Contract, you’re pre-approved to sell B2 Cloud Storage and Computer Backup to all participating entities.
  • Increased Credibility: With Backblaze being part of a trusted contract like NASPO ValuePoint, you can enhance your reputation and credibility in the cloud market, potentially attracting more customers.
  • Stable Revenue Stream: Having a contract with multiple states or organizations provides a more stable and predictable revenue stream for you as a reseller, as you have access to a broader customer base.

Making It Easier to Provision Backblaze

As a public sector agency, you face some of the greatest challenges when it comes to affordably protecting and using your data given ransomware attacks and budget constraints. Carahsoft’s NASPO program cuts through this complexity with cooperative purchasing, resulting in more favorable terms and conditions and competitive pricing. 

Previously, it was hard for many state, local, educational and government institutions to benefit from the affordability and reliability that Backblaze provides. Now, Backblaze’s addition to Carahsoft’s NASPO contract streamlines procurement of B2 Cloud Storage and Backblaze Computer Backup—speeding up your acquisition timeline.

The availability of the Backblaze portfolio to NASPO members strengthens our partnership by aligning with Government procurement processes and expanding Backblaze’s reach in the Public Sector market. It is critical we support the Government as they work to modernize their cloud data storage systems to meet the demands of an increasingly digital era. By collaborating with Backblaze and our reseller partners, we can continue to expand and improve agency access to the affordable, cutting-edge solutions they need to achieve mission success.

—John Rentz, MultiCloud Team Lead, Carahsoft

How to Purchase Backblaze via Carahsoft

Backblaze’s offerings are available through Carahsoft’s NASPO ValuePoint Master Agreement #AR2472 and OMNIA Partners Contract #R191902. To purchase, reach out to your preferred reseller or contact the Backblaze team. For more information about the NASPO ValuePoint Master Agreement, contact the Carahsoft team at [email protected].

More About Carahsoft

Carahsoft Technology Corp. is The Trusted Government IT Solutions Provider®, supporting public sector organizations across federal, state and local government agencies and education and healthcare markets. As the Master Government Aggregator® for our vendor partners, Carahsoft delivers solutions for multicloud, cybersecurity, DevSecOps, big data, artificial intelligence, open source, customer experience and engagement, and more.

The post Backblaze Now Available via Carahsoft’s NASPO Contract appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

A Disaster Recovery Game Plan for Media Teams

Post Syndicated from James Flores original https://www.backblaze.com/blog/a-disaster-recovery-game-plan-for-media-teams/

A decorative image showing icons representing file types surrounding a cloud. The background has sports imagery incorporated.

When it comes to content creation, every second you can spend honing your craft counts. Which means things like disaster recovery planning are often overlooked—they’re tasks that easily get bumped to the bottom of every to-do list. Yet, the consequences of data loss or downtime can be huge, affecting everything from marketing strategy to viewer engagement. 

For years, LTO tape has been a staple in disaster recovery (DR) plans for media teams that focus on everything from sports teams to broadcast news to TV and film production. Using an on-premises network attached storage (NAS) backed up to LTO tapes stored on-site, occasionally with a second copy off-site, is the de facto DR strategy for many. And while your off-site backup may be in a different physical location, more often than not, it’s the same city and still vulnerable to some of the same threats.  

As in all areas of business, the key to a successful DR plan is preparation. Having a solid DR plan in place can be the difference between bouncing back swiftly or facing downtime. Today, I’ll lay out some of the challenges media teams face with disaster recovery and share some of the most cost-effective and time-efficient solutions.

Disaster Recovery Challenges

Let’s dive into some potential issues media teams face when it comes to disaster recovery.

Insufficient Resources

It’s easy to deprioritize disaster recovery when you’re facing budgetary constraints. You’re often faced with a trade-off: protect your data assets or invest in creating more. You might have limited NAS or LTO capacity, so you’re constantly evaluating what is worthy of protecting. Beyond cost, you might also be facing space limitations where investing in more infrastructure means not just shouldering the price of new tapes or drives, but also building out space to house them.

Simplicity vs. Comprehensive Coverage: Keeping Up With Scale

We’ve all heard the saying “keep it simple, stupid.” But sometimes you sacrifice adequate coverage for the sake of simplicity. Maybe you established a disaster recovery plan early on, but haven’t revisited it as your team scaled. Broadcasting and media management can quickly become complex, involving multiple departments, facilities, and stakeholders. If you haven’t revisited your plan, you may have gaps in your readiness to respond to threats. 

As media teams grow and evolve, their disaster recovery needs may also change, meaning disaster recovery backups should be easy, automated, and geographically distanced. 

The LTO Fallacy

No matter how well documented your processes may be, it’s inevitable that any process that requires a physical component is subject to human error. And managing LTO tapes is nothing if not a physical process. You’re manually inserting LTO tapes into an LTO deck to perform a backup. You’re then physically placing that tape and its replicas in the correct location in your library. These processes have a considerable margin of error; any deviation from an established procedure compromises the recovery process.

Additionally, LTO components—the decks and the tapes themselves—age like any other piece of equipment. And ensuring that all appropriate staff members are adequately trained and aware of any nuances of the LTO system becomes crucial in understanding the recovery process. Achieving consistent training across all levels of the organization and maintaining hardware can be challenging, leading to gaps in preparedness.

Embracing Cloud Readiness

As a media team faced with the challenges outlined above, you need solutions. Enter cloud readiness. Cloud-based storage offers unparalleled scalability, flexibility, and reliability, making it ideal for safeguarding media for teams large and small. By leveraging the power of the cloud, media teams can ensure seamless access to vital information from any location, at any time. Whether it’s raw footage, game footage, or final assets, cloud storage enables rapid recovery and minimal disruption in the event of a disaster.

Cloud Storage Considerations for Media Teams

Migrating to a cloud-based disaster recovery model requires careful planning and consideration. Here are some key factors for sports teams to keep in mind:

  1. Data Security: Content security is becoming more and more of a top priority with many in the media space concerned about footage leakage and the growing monetization of archival content. Ensure your cloud provider employs robust security measures like encryption, and verify compliance with industry standards to maintain data privacy, especially if your media content involves sensitive or confidential information. 
  2. Cost Efficiency: Given the cost of NAS servers, LTO tapes, and external hard drives, scaling on-premises solutions indefinitely is not always the best solution. Extending your storage to the cloud makes scaling easy, but it’s not without its own set of considerations. Evaluate the cost structure of different cloud providers, considering factors like storage capacity, data transfer costs, and retention minimums.  
  3. Geospatial Redundancy: Driving LTO tapes to different locations or even shipping them to secure sites can become a logistical nightmare. When data is stored in the cloud, it not only can be accessed from anywhere but the replication of that data across geographic locations can be automated. Consider the geographical locations of the cloud servers to ensure optimal accessibility for your team, minimizing latency and providing a smooth user experience.
  4. Interoperability: With data securely stored in the cloud it becomes instantly accessible to not only users but across different systems, platforms, and applications. This facilitates interoperability with applications like cloud media asset managers (MAMs) or cloud editing solutions and even simplifies media distribution. When choosing a cloud provider, consider APIs and third-party integrations that might enhance the functionality of your media production environment. 
  5. Testing and Training: Testing and training are paramount in disaster recovery to ensure a swift and effective response when crises strike. Rigorous testing identifies vulnerabilities, fine-tunes procedures, and validates recovery strategies. Simulated scenarios enable teams to practice and refine their roles, enhancing coordination and readiness. Regular training instills confidence and competence, reducing downtime during actual disasters. By prioritizing testing and training, your media team can bolster resilience, safeguard critical data, and increase the likelihood of a seamless recovery in the face of unforeseen disasters.

Cloud Backup in Action

For Trailblazer Studios, a leading media production company, satisfying internal and external backup requirements led to a complex and costly manual system of LTO tape and spinning disk drive redundancies. They utilized Backblaze’s cloud storage to streamline their data recovery processes and enhance their overall workflow efficiency.

Backblaze is our off-site production backup. The hope is that we never need to use it, but it gives us peace of mind.

—Kevin Shattuck, Systems Administrator, Trailblazer Studios

The Road Ahead

As media continues to embrace digital transformation, the need for robust disaster recovery solutions has never been greater. By transitioning away from on-premises solutions like LTO tape and embracing cloud readiness, organizations can future-proof their operations and ensure uninterrupted production. And, while cloud readiness creates a more secure foundation for disaster recovery, having data in the cloud creates a pathway into the future teams can take advantage of a wave of cloud tools designed to foster productivity and efficiency. 

With the right strategy in place, media teams can turn potential disasters into mere setbacks, while taking advantage of their new cloud centric posterity maintaining their competitive edge. 

The post A Disaster Recovery Game Plan for Media Teams appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Data Centers, Temperature, and Power

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/data-centers-temperature-and-power/

A decorative image showing a thermometer, a cost symbol, and servers in a stair step pattern with an upwards trendline.

It’s easy to open a data center, right? All you have to do is connect a bunch of hard drives to power and the internet, find a building, and you’re off to the races.  

Well, not exactly. Building and using one Storage Pod is quite a bit different than managing exabytes of data. As the world has grown more connected, the demand for data centers has grown—and then along comes artificial intelligence (AI), with processing and storage demands that amp up the need even more. 

That, of course, has real-world impacts, and we’re here to chat about why. Today we’re going to talk about power, one of the single biggest costs to running a data center, how it has impacts far beyond a simple utility bill, and what role temperature plays in things.

How Much Power Does a Data Center Use?

There’s no “normal” when it comes to the total amount of power a data center will need, as data centers vary in size. Here are a few figures that can help us get us on the same page about scale: 

The goal of a data center is to be always online. That means that there are redundant systems of power—so, what comes in from the grid as well as generators and high-tech battery systems like uninterruptible power supplies (UPS)—running 24 hours a day to keep servers storing and processing data and connected to networks. In order to keep all that equipment running well, they need to stay in a healthy temperature (and humidity) range, which sounds much, much simpler than it is.  

Measuring Power Usage

One of the most popular metrics for tracking power efficiency in data centers is power usage effectiveness (PUE), which is the ratio of the total amount of energy used by a data center to the energy delivered to computing equipment. 

Note that this metric divides power usage into two main categories: what you spend keeping devices online (which we’ll call “IT load” for shorthand purposes), and “overhead”, which is largely comprised of the power dedicated to cooling your data center down. 

There are valid criticisms of the metric, including that improvements to IT load will actually make your metric worse: You’re being more efficient about IT power, but your overhead stays the same—so less efficiency even though you’re using less power overall. Still, it gives companies a repeatable way to measure against themselves and others over time, including directly comparing seasons year to year, so it’s a widely adopted metric. 

Calculating your IT load is a relatively predictable number. Manufacturers tell you the wattage of your device (or you can calculate it based on your device’s specs), then you take that number and plan for it being always online. The sum of all your devices running 24 hours a day is your IT power spend. 

Comparatively, doing the same for cooling is a bit more complicated—and it accounts for approximately 40% of power usage

What Increases Temperature in a Data Center?

Any time you’re using power, you’re creating heat. So the first thing you consider is always your IT load. You don’t want your servers overtaxed—most folks agree that you want to run at about 80% of capacity to keep things kosher—but you also don’t want to have a bunch of servers sitting around idle when you return to off-peak usage. Even at rest, they’re still consuming power. 

So, the methodology around temperature mitigation always starts at power reduction—which means that growth, IT efficiencies, right-sizing for your capacity, and even device provisioning are an inextricable part of the conversation. And, you create more heat when you’re asking an electrical component to work harder—so, more processing for things like AI tasks means more power and more heat. 

And, there are a number of other things that can compound or create heat: the types of drives or processors in the servers, the layout of the servers within the data center, people, lights, and the ambient temperature just on the other side of the data center walls. 

Brief reminder that servers look like this: 

A photograph of Backblaze servers, called Storage Vaults.
Only most of them aren’t as beautifully red as ours.

When you’re building a server, fundamentally what you’re doing is shoving a bunch of electrical components in a box. Yes, there are design choices about those boxes that help mitigate temperature, but just like a smaller room heating up more quickly than a warehouse, you are containing and concentrating a heat source.

We humans generate heat and need lights to see, so the folks who work in data centers have to be taken into account when considering the overall temperature of the data center. Check out these formulas or this nifty calculator for rough numbers (with the caveat that you should always consult an expert and monitor your systems when you’re talking about real data centers):

  • Heat produced by people = maximum number of people in the facility at one time x 100 
  • Heat output of lighting = 2.0 x floor area in square feet or 21.53 x floor area in square meters

Also, your data center exists in the real world, and we haven’t (yet) learned to control the weather—so you also have to factor in fighting the external temperature when you’re bringing things back to ideal conditions. That’s led to a movement towards building data centers in new locations. It’s important to note that there are other reasons you might not want to move, however, including network infrastructure.

Accounting for people and the real world also means that there will be peak usage times, which is to say that even in a global economy, there are times when more people are asking to use their data (and their dryers, so if you’re reliant on a consumer power grid, you’ll also see the price of power spike). Aside from the cost, more people using their data = more processing = more power.

How Is Temperature Mitigated in Data Centers?

Cooling down your data center with fans, air conditioners, and water also uses power (and generates heat). Different methods of cooling use different amounts of power—water cooling in server doors vs. traditional high-capacity air conditioners, for example. 

Talking about real numbers here gets a bit tricky. Data centers aren’t a standard size. As data centers get larger, the environment gets more complex, expanding the potential types of problems, while also increasing the net benefit of changes that might not have a visible impact in smaller data centers. It’s like any economy of scale: The field of “what is possible” is wider; rewards are bigger, and the relationship between change vs. impact is not linear. Studies have shown that creating larger data centers creates all sorts of benefits (which is an article in and of itself), and one of those specific benefits is greater power efficiency

Most folks talk about the impact of different cooling technologies in a comparative way, i.e., we saw a 30% reduction in heat. And, many of the methods of mitigating temperature are about preventing the need to use power in the first place. For that reason, it’s arguably more useful to think about the total power usage of the system. In that context, it’s useful to know that a single fan takes x amount of power and produces x amount of heat, but it’s more useful to think of them in relation to the net change on the overall temperature bottom line. With that in mind, let’s talk about some tactics data centers use to reduce temperature. 

Customizing and Monitoring the Facility 

One of the best ways to keep temperature regulated in your data center is to never let it get hotter than it needs to be in the first place, and every choice you make contributes to that overall total. For example, when you’re talking about adding or removing servers from your pool, that reduces your IT power consumption and affects temperature. 

There are a whole host of things that come down to data centers being a purpose-built space, and most of them have to do with ensuring healthy airflow based on the system you’ve designed to move hot air out and cold air in. 

No matter what tactics you’re using, monitoring your data center environment is essential to keeping your system healthy. Some devices in your environment will come with internal indicators, like SMART stats on drives, and, of course, folks also set up sensors that connect to a central monitoring system. Even if you’ve designed a “perfect” system in theory, things change over time, whether you’re accounting for adding new capacity or just dealing with good old entropy. 

Here’s a non-inclusive list of some of ways data centers customize their environments: 

  • Raised Floors: This allows airflow or liquid cooling under the server rack in addition to the top, bottom, and sides. 
  • Containment, or Hot and Cold Rows: The strategy here is to keep the hot side of your servers facing each other and the cold parts facing outward. That means that you can create a cyclical air flow with the exhaust strategically pulling hot air out of hot space, cooling it, then pushing the cold air over the servers.  
  • Calibrated Vector Cooling: Basically, concentrated active cooling measures in areas you know are going to be hotter. This allows you to use fewer resources by cooling at the source of the heat instead of generally cooling the room. 
  • Cable Management: Keeping cords organized isn’t just pretty, it also makes sure you’re not restricting airflow.  
  • Blanking Panels: This is a fancy way of saying that you should plug up the holes between devices.
A photo of a server stack without blanking panels. There are large empty gaps between the servers.
A photo of a server stack with blanking panels.

Source.

Air vs. Liquid-Based Cooling

Why not both? Most data centers end up using a combination of air and water based cooling at different points in the overall environment. And, other liquids have led to some very exciting innovations. Let’s go into a bit more detail. 

Air-Based Cooling

Air based cooling is all about understanding air flow and using that knowledge to extract hot air and move cold air over your servers.  

Air-based cooling is good up to a certain temperature threshold—about 20 kilowatts (kW) per rack. Newer hardware can easily reach 30kw or higher, and high processing workloads can take that even higher. That said, air-based cooling has benefitted by becoming more targeted, and people talk about building strategies based on room, row, or rack. 

Water-Based Cooling

From here, it’s actually a pretty easy jump into water-based cooling. Water and other liquids are much better at transferring heat than air, about 50 to 1,000 times more, depending on the liquid you’re talking about. And, lots of traditional “air” cooling methods run warm air through a compressor (like in an air conditioner), which stores cold water and cools off the air, recirculating it into the data center. So, one fairly direct combination of this is the evaporative cooling tower: 

Obviously water and electricity don’t naturally blend well, and one of the main concerns of using this method is leakage. Over time, folks have come up with some good, safe methods, designed around effectively containing the liquid. This increases the up-front cost, but has big payoffs for temperature mitigation. You find this methodology in rear door heat exchangers, which create a heat exchanger in—you guessed it—the rear door of a server, and direct-to-chip cooling, which contains the liquid into a plate, then embeds that plate directly in the hardware component. 

So, we’ve got a piece of hardware, a server rack—the next step is the full data center turning itself into a heat exchange, and that’s when you get Nautilus—a data center built over a body of water. 

(Other) Liquid-Based Cooling, or Immersion Cooling

With the same sort of daring thought process of the people who said, “I bet we can fly if we jump off this cliff with some wings,” somewhere along the way, someone said, “It would cool down a lot faster if we just dunked it in liquid.” Liquid-based cooling utilizes dielectric liquids, which can safely come in contact with electrical components. Single phase immersion uses fluids that don’t boil or undergo a phase change (think: similar to an oil), while two phase immersion uses liquids that boil at low temperatures, which releases heat by converting to a gas. 

You’ll see components being cooled this way either in enclosed chassis, which can be used in rack-style environments, in open baths, which require specialized equipment, or a hybrid approach. 

How Necessary Is This?

Let’s bring it back: we’re talking about all those technologies efficiently removing heat from a system because hotter environments break devices, which leads to downtime. And, we want to use efficient methods to remove heat because it means we can ask our devices to work harder without having to spend electricity to do it. 

Recently, folks have started to question exactly how cool data centers need to be. Even allowing a few more degrees of tolerance can make a huge difference to how much time and money you spend on cooling. Whether it has longer term effects on the device performance is questionable—manufacturers are fairly opaque about data around how these standards are set, though exceeding recommended temperatures can have other impacts, like voiding device warranties.

Power, Infrastructure, Growth, and Sustainability

But the simple question of “Is it necessary?” is definitely answered “yes,” because power isn’t infinite. And, all this matters because improving power usage has a direct impact on both cost and long-term sustainability. According to a recent MIT article, the data centers now have a greater carbon footprint than the airline industry, and a single data center can consume the same amount of energy as 50,000 homes. 

Let’s contextualize that last number, because it’s a tad controversial. The MIT research paper in question was published in 2022, and that last number is cited from “A Prehistory of the Cloud” by Tung-Hui Hu, published in 2006. Beyond just the sheer growth in the industry since 2006, data centers are notoriously reticent about publishing specific numbers when it comes to these metrics—Google didn’t release numbers until 2011, and they were founded in 1998. 

Based on our 1MW = 200 homes metric the number from the MIT article number represents 250MW. One of the largest data centers in the world has a 650MW capacity. So, while you can take that MIT number with a grain of salt, you should also pay attention to market reports like this one—the aggregate numbers clearly show that power availability and consumption is one of the biggest concerns for future growth. 

So, we have less-than-ideal reporting and numbers, and well-understood environmental impacts of creating electricity, and that brings us to the complicated relationship between the two factors. Costs of power have gone up significantly, and are fairly volatile when you’re talking about non-renewable energy sources. International agencies report that renewable energy sources are now the cheapest form of energy worldwide, but the challenge is integrating renewables into existing grids. While the U.S. power grid is reliable (and the U.S. accounts for half of the world’s hyperscale data center capacity), the Energy Department recently announced that the network of transmission lines may need to expand by more than two-thirds to carry that data nationwide—and invested $1.3 billion to make that happen.

What’s Next?

It’s easy to say, “It’s important that data centers stay online,” as we sort of glossed over above, but the true importance becomes clear when you consider what that data does—it keeps planes in the air, hospitals online, and so many other vital functions. Downtime is not an option, which leads us full circle to our introduction.   

We (that is, we, humans) are only going to build more data centers. Incremental savings in power have high impact—just take a look at Google’s demand response initiative, which “shift[s] compute tasks and their associated energy consumption to the times and places where carbon-free energy is available on the grid.” 

It’s definitely out of scope for this article to talk about the efficiencies of different types of energy sources. That kind of inefficiency doesn’t directly impact a data center, but it certainly has downstream effects in power availability—and it’s probably one reason why Microsoft, considering both its growth in power need and those realities, decided to set up a team dedicated to building nuclear power plants to directly power some of their data centers and then dropped $650 million to acquire a nuclear-powered data center campus

Which is all to say: this is an exciting time for innovation in the cloud, and many of the opportunities are happening below the surface, so to speak. Understanding how the fundamental principles of physics and compute work—now more than ever—is a great place to start thinking about what the future holds and how it will impact our world, technologically, environmentally, and otherwise. And, data centers sit at the center of that “hot” debate. 

The post Data Centers, Temperature, and Power appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

How Much Storage Do I Need to Back Up My Video Surveillance Footage?

Post Syndicated from Tonya Comer original https://www.backblaze.com/blog/how-much-storage-do-i-need-to-backup-my-video-surveillance-footage/

A decorative image showing a surveillance camera connected to the cloud.

We all have things we want to protect, and if you’re responsible for physically or virtually protecting a business from all types of threats, you probably have some kind of system in place to monitor your physical space. If you’ve ever dealt with video surveillance footage, you know managing it can be a monumental task. Ensuring the safety and security of monitored spaces relies on collecting, storing, and analyzing large amounts of data from cameras, sensors, and other devices. The requirements to back up and retain footage to support investigations are only getting more stringent. Anyone dealing with surveillance data, whether in business or security, needs to ensure that surveillance data is not only backed up, but also protected and accessible. 

In this post, we’ll talk through why you should back up surveillance footage, the factors you should consider to understand how much storage you need, and how you can use cloud storage in your video surveillance backup strategy. 

The Importance of Backing Up Video Surveillance Footage

Backup storage plays a critical part in maintaining the security of video surveillance footage. Here’s why it’s so important:

  1. Risk Reduction: Without backup storage, surveillance system data stored on a single hard drive or storage device is susceptible to crashes, corruption, or theft. Having a redundant copy ensures that critical footage is not lost in case of system failures or data corruption.
  2. Fast Recovery: Video surveillance systems rely on continuous recording to monitor and record all activities. In the event of system failures, backup storage enables swift recovery, minimizing downtime and ensuring uninterrupted surveillance.
  3. Compliance and Legal Requirements: Many industries, including security, have legal obligations to retain surveillance footage for a specified duration. Backup storage ensures compliance with these requirements and provides evidence when needed.
  4. Verification and Recall: Backup recordings allow you to verify actions, recall events, and keep track of activities. Having access to historical footage is valuable for potential investigations and future decision making.

Each piece of information about video surveillance requirements will affect how much space your video files take up and, consequently, your storage requirements. Let’s walk through each of these general requirements so you don’t end up underestimating how much backup storage you’ll need.

Video Surveillance Storage Considerations

When you’re implementing a backup strategy for video surveillance, there are several factors that can impact your choices. The number and resolution of cameras, frame rates, retention periods, and more can influence how you design your backup storage system. Consider the following factors when thinking about how much storage you’ll need for your video surveillance footage:

  • Placement and Coverage: When it comes to video surveillance camera placement, strategic positioning is crucial for optimal security and regulatory compliance. Consider ground floor doors and windows, main stairs or hallways, common areas, and driveways. Install cameras both inside and outside entry points. Generally, the wider the field of view, the fewer cameras you’ll likely need overall. The FBI provides extensive recommendations for setting up your surveillance system properly.
  • Resolution: The resolution determines the clarity of the video footage, which is measured in the number of pixels (px). A higher resolution means more pixels and a sharper image. While there’s no universal minimum for admissible surveillance footage, a resolution of 480 x 640 px is recommended. However, mandated minimums can differ based on local regulations and specific use cases. Note that some regulations may not provide a minimum resolution requirement, and some minimum requirements may not meet the intended purpose of surveillance. Often, it’s better to go with a camera that can record at a higher resolution than the mandated minimum.
  • Frame Rate: All videos are made up of individual frames. A higher frame rate—measured in frames per second (FPS)—means a smoother, less clunky image. This is because there are more frames being packed into each second. Like your cameras’ resolution, there are no universal requirements specified by regulations. However, it’s better to go with a camera that can record at a higher FPS so that you have more images to choose from if there’s ever an open investigation.
  • Recording Length: Surveillance cameras are required to run all day, every day, which requires a lot of storage. To help reduce the instance of storing videos that aren’t of interest, some cameras can come with artificial intelligence (AI) tools that will only record footage when it identifies something of interest, such as movement or a vehicle. But if you’re protecting a business with heavy activity, this use of AI may be moot. 
  • Retention Length: Video surveillance retention requirements can vary significantly based on local, state, and federal regulations. These laws dictate how long companies must store their video surveillance footage. For example, medical marijuana dispensaries in Ohio require 24/7 security video to be retained for a minimum of six months, and the footage must be made available to the licensing board upon request. The required length can be prolonged even further if a piece of footage is required for an ongoing investigation. Additionally, backing up your video footage (i.e., saving a copy of it in another area) is different from archiving it for long-term use. You’ll want to be sure that you select a storage system that helps you meet those requirements—more on that later.

Each point on this list affects how much storage capacity you need. More cameras mean more footage being generated, which means more video files. Additionally, a higher resolution and frame rate mean larger file sizes. Multiply this by the round-the-clock operation of surveillance cameras and the required retention length, and you’ll likely have more video data than you know what to do with.

Scoping Video Surveillance Storage: An Example

To illustrate how much footage you can expect to collect, it can be helpful to see an example of how a given business’s video surveillance may operate. Note that this example may not apply to you specifically. You should review your local area’s regulations and consult with an industry professional to make sure you are compliant. 

Let’s say that you need to install surveillance cameras for a bank. Customers enter through the lobby and wait for the next available bank teller to assist them at a teller station. No items are put on display, only the exchange of paper and cash between the teller and the customer. Only authorized employees are allowed within the teller area. After customers complete their transactions, they walk back through the lobby area and exit via the building’s front entry.

As an estimate, let’s say you need at least 10 cameras around your building: one for the entrance; another for the lobby; eight more to cover the general back area, including the door to the teller terminals, the teller terminals themselves, the door to the safe, and inside of the safe; and, of course, one for the room where the surveillance equipment is housed. You may need more than 10 to cover the exterior of your building plus your ATM and drive through banking, but for the sake of an example, we’ll leave it at 10.

Now, suppose all your cameras record at 1080p resolution (1920 x 1080 px), 15 FPS, and a color depth of 14 bits (basically, how many colors the camera captures). For one 24 hour recording on one camera, you’re looking at 4.703 terabytes (TB). Over 30 days of storage, this can grow to 141.1TB. In other words, if the average person today needs a 2TB hard disk for their PC, it will take more than 70 PCs to hold all the information from just one camera.  

How Cloud Storage Can Help Back Up Surveillance Footage

Backing up surveillance footage is essential for ensuring data security and accountability. It provides a reliable record of events, aids in investigations, and helps prevent wrongdoing by acting as a deterrent. But the right backup strategy is key to preserving your footage.

The 3-2-1 backup strategy is an accepted foundational structure that recommends keeping three copies of all important data (one primary copy and two backup copies) on two different media types (to diversify risk) and storing at least one copy off-site. With surveillance data utilizing high-capacity data storage systems, adhering to the 3-2-1 rule is important in order to access footage in case of an investigation. The 3-2-1 rule mitigates single points of failure, enhances data availability, and protects against corruption. By adhering to this rule, you increase the resilience of your surveillance footage, making it easier to recover even in unexpected events or disasters.

Having an on-site backup copy is a great start for the 3-2-1 backup strategy, but having an off-site backup is a key component in having a complete backup strategy. Having a backup copy in the cloud provides an easy to maintain, reliable off-site copy, safeguarding against a host of potential data losses including:

  • Natural Disasters: If your business is harmed by a natural disaster, the devices you use for your primary storage or on-site backup may be damaged, resulting in a loss of data.
  • Tampering and Theft: Even if someone doesn’t try to steal or manipulate your surveillance footage, an employee can still move, change, or delete files accidentally. You’ll need to safeguard footage with proper security protocols, such as authorization codes, data immutability, and encryption keys. These protocols may require constant, professional, and IT administration and maintenance that are often automatically built into the cloud.
  • Lack of Backup and Archive Protocols: Unless your primary storage source uses   specialized software to automatically save copies of your footage or move them to long-term storage, any of your data may be lost.

The cloud has transformed backup strategies and made it easy to ensure the integrity of large data sets, like surveillance footage. Here’s how the cloud helps achieve the 3-2-1 back strategy affordably:

  •  Scalability: With the cloud, your backup storage space is no longer limited to what servers you can afford. The cloud provider will continue to build and deploy new servers to keep up with customer demand, meaning you can simply rent out the storage space and pay for more as needed.
  • Reliability: Most cloud providers share information on their durability and reliability and are heavily invested in building systems and processes to mitigate the impact of failures. Their systems are built to be fault-tolerant. 
  • Security: Cloud providers protect data you store with them with enterprise-grade security measures and offer features like access controls and encryption to allow users the ability to better protect their data.
  • Affordability: Cloud storage helps you use your storage budgets effectively by not paying to provision and maintain physical off-site backup locations yourself.
  • Disaster Recovery: If unexpected disasters occur, such as natural disasters, theft, or hardware failure, you’ll know exactly where your data lives in the cloud and how to restore it so you can get back up and running quickly.
  • Compliance: By adhering to a set of standards and regulations cloud solutions meet compliance requirements to ensure data stored and managed in the cloud is protected and used responsibly.

Protect The Footage You Invested In to Protect Yourself

No matter the size, operation, or location of your business, it’s critical to remain compliant with all industry laws and regulations—especially when it comes to surveillance. Protect your business by partnering with a cloud provider that understands your unique business requirements, offering scalable, reliable, and secure services at a fraction of the cost compared with other platforms.

As a leading specialized cloud provider, Backblaze B2 Cloud Storage can secure your surveillance footage for both primary backup and long-term protection. B2 offers a range of security options—from encryption to Object Lock to Cloud Replication and access management controls—to help you protect your data and achieve industry compliance. Learn more about Backblaze B2 for surveillance data or contact our Sales Team today.

The post How Much Storage Do I Need to Back Up My Video Surveillance Footage? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Navigating Cloud Storage: What is Latency and Why Does It Matter?

Post Syndicated from Amrit Singh original https://www.backblaze.com/blog/navigating-cloud-storage-what-is-latency-and-why-does-it-matter/

A decorative image showing a computer and a server arrows moving between them, and a stopwatch indicating time.

In today’s bandwidth-intensive world, latency is an important factor that can impact performance and the end-user experience for modern cloud-based applications. For many CTOs, architects, and decision-makers at growing small and medium sized businesses (SMBs), understanding and reducing latency is not just a technical need but also a strategic play. 

Latency, or the time it takes for data to travel from one point to another, affects everything from how snappy or responsive your application may feel to content delivery speeds to media streaming. As infrastructure increasingly relies on cloud object storage to manage terabytes or even petabytes of data, optimizing latency can be the difference between success and failure. 

Let’s get into the nuances of latency and its impact on cloud storage performance.

Upload vs. Download Latency: What’s the Difference?

In the world of cloud storage, you’ll typically encounter two forms of latency: upload latency and download latency. Each can impact the responsiveness and efficiency of your cloud-based application.

Upload Latency

Upload latency refers to the delay when data is sent from a client or user’s device to the cloud. Live streaming applications, backup solutions, or any application that relies heavily on real-time data uploading will experience hiccups if upload latency is high, leading to buffering delays or momentary stream interruptions.

Download Latency

Download latency, on the other hand, is the delay when retrieving data from the cloud to the client or end user’s device. Download latency is particularly relevant for content delivery applications, such as on demand video streaming platforms, e-commerce, or other web-based applications. Reducing download latency, creating a snappy web experience, and ensuring content is swiftly delivered to the end user will make for a more favorable user experience.

Ideally, you’ll want to optimize for latency in both directions, but, depending on your use case and the type of application you are building, it’s important to understand the nuances of upload and download latency and their impact on your end users.

Decoding Cloud Latency: Key Factors and Their Impact

When it comes to cloud storage, how good or bad the latency is can be influenced by a number of factors, each having an impact on the overall performance of your application. Let’s explore a few of these key factors.

Network Congestion

Like traffic on a freeway, packets of data can experience congestion on the internet. This can lead to slower data transmission speeds, especially during peak hours, leading to a laggy experience. Internet connection quality and the capacity of networks can also contribute to this congestion.

Geographical Distance

Often overlooked, the physical distance from the client or end user’s device to the cloud origin store can have an impact on latency. The farther the distance from the client to the server, the farther the data has to traverse and the longer it takes for transmission to complete, leading to higher latency.

Infrastructure Components

The quality of infrastructure, including routers, switches, and cables, may affect network performance and latency numbers. Modern hardware, such as fiber-optic cables, can reduce latency, unlike outdated systems that don’t meet current demands. Often, you don’t have full control over all of these infrastructure elements, but awareness of potential bottlenecks may be helpful, guiding upgrades wherever possible.

Technical Processes

  • TCP/IP Handshake: Connecting a client and a server involves a handshake process, which may introduce a delay, especially if it’s a new connection.
  • DNS Resolution: Latency can be increased by the time it takes to resolve a domain name to its IP address. There is a small reduction in total latency with faster DNS resolution times.
  • Data routing: Data does not necessarily travel a straight line from its source to its destination. Latency can be influenced by the effectiveness of routing algorithms and the number of hops that data must make.

Reduced latency and improved application performance are important for businesses that rely on frequently accessing data stored in cloud storage. This may include selecting providers with strategically positioned data centers, fine-tuning network configurations, and understanding how internet infrastructure affects the latency of their applications.

Minimizing Latency With Content Delivery Networks (CDNs)

Further reducing latency in your application may be achieved by layering a content delivery network (CDN) in front of your origin storage. CDNs help reduce the time it takes for content to reach the end user by caching data in distributed servers that store content across multiple geographic locations. When your end-user requests or downloads content, the CDN delivers it from the nearest server, minimizing the distance the data has to travel, which significantly reduces latency.

Backblaze B2 Cloud Storage integrates with multiple CDN solutions, including Fastly, Bunny.net, and Cloudflare, providing a performance advantage. And, Backblaze offers the additional benefit of free egress between where the data is stored and the CDN’s edge servers. This not only reduces latency, but also optimizes bandwidth usage, making it cost effective for businesses building bandwidth intensive applications such as on demand media streaming. 

To get slightly into the technical weeds, CDNs essentially cache content at the edge of the network, meaning that once content is stored on a CDN server, subsequent requests do not need to go back to the origin server to request data. 

This reduces the load on the origin server and reduces the time needed to deliver the content to the user. For companies using cloud storage, integrating CDNs into their infrastructure is an effective configuration to improve the global availability of content, making it an important aspect of cloud storage and application performance optimization.

Case Study: Musify Improves Latency and Reduces Cloud Bill by 70%

To illustrate the impact of reduced latency on performance, consider the example of music streaming platform Musify. By moving from Amazon S3 to Backblaze B2 and leveraging the partnership with Cloudflare, Musify significantly improved its service offering. Musify egresses about 1PB of data per month, which, under traditional cloud storage pricing models, can lead to significant costs. Because Backblaze and Cloudflare are both members of the Bandwidth Alliance, Musify now has no data transfer costs, contributing to an estimated 70% reduction in cloud spend. And, thanks to the high cache hit ratio, 90% of the transfer takes place in the CDN layer, which helps maintain high performance, regardless of the location of the file or the user.

Latency Wrap Up

As we wrap up our look at the role latency plays in cloud-based applications, it’s clear that understanding and strategically reducing latency is a necessary approach for CTOs, architects, and decision-makers building many of the modern applications we all use today.  There are several factors that impact upload and download latency, and it’s important to understand the nuances to effectively improve performance.

Additionally, Backblaze B2’s integrations with CDNs like Fastly, bunny.net, and Cloudflare offer a cost-effective way to improve performance and reduce latency. The strategic decisions Musify made demonstrate how reducing latency with a CDN can significantly improve content delivery while saving on egress costs, and reducing overall business OpEx.

For additional information and guidance on reducing latency, improving TTFB numbers and overall performance, the insights shared in “Cloud Performance and When It Matters” offer a deeper, technical look.

If you’re keen to explore further into how an object storage platform may support your needs and help scale your bandwidth-intensive applications, read more about Backblaze B2 Cloud Storage.

The post Navigating Cloud Storage: What is Latency and Why Does It Matter? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

What’s Wrong With Google Drive, Dropbox, and OneDrive? More Than You Think

Post Syndicated from Vinodh Subramanian original https://www.backblaze.com/blog/whats-wrong-with-google-drive-dropbox-and-onedrive-more-than-you-think/

Cloud drives like Google Drive, Dropbox, Box, and OneDrive have become the go-to data management solution for countless individuals and organizations. Their appeal lies in the initial free storage offering, user-friendly interface, robust file-sharing, and collaboration tools, making it easier to access files from anywhere with an internet connection. 

However, recent developments in the cloud drives space have posed significant challenges for businesses and organizations. Both Google and Microsoft, leading providers in this space, have announced the discontinuation of their unlimited storage plans.

Additionally, it’s essential to note that cloud drives, which are primarily sync services, do not offer comprehensive data protection. Today, we’re exploring how organizations can recognize the limitations of cloud drives and strategize accordingly to safeguard their data without breaking the bank. 

Attention Higher Ed

Higher education institutions have embraced platforms like Google Drive, Dropbox, Box, and OneDrive to store vast amounts of data—sometimes reaching into the petabytes. With unlimited plans out the window, they now face the dilemma of either finding alternative storage solutions or deleting data to avoid steep fees. In fact, the education sector reported the highest rates of ransomware attacks with 80% of secondary education providers and 79% of higher education providers hit by ransomware in 2023. If you manage IT for a

Sync vs. Backup: Why Cloud Drives Fall Short on Full Data Security

Cloud Sync

Cloud drives offer users an easy way to store and protect files online, and it might seem like these services back up your data. But, they don’t. These services sync (short for “synchronize”) files or folders on your computer to your other devices running the same application, ensuring that the same and most up-to-date information is merged across each device.

The “live update” feature of cloud drives is a double-edged sword. On one hand, it ensures you’re always working on the latest version of a document. On the other, if you need to go back to a specific version of a file from two weeks ago, you might be out of luck unless you’ve manually saved that version elsewhere. 

Another important item to note is that if cloud drives are shared with others, often they can make changes to the content which can result in the data changing or being deleted and without notifying other users. With the complexity of larger organizations, this presents a potential vulnerability, even with well-meaning users and proactive management of drive permissions. 

Cloud Backup

Unlike cloud sync tools, backup solutions are all about historical data preservation. They utilize block-level backup technology, which offers granular protection of your data. After an initial full backup, these systems only save the incremental changes that occur in the dataset. This means if you need to recover a file (or an entire system) as it existed at a specific point in time, you can do so with precision. This approach is not only more efficient in terms of storage space but also crucial for data recovery scenarios.

For organizations where data grows exponentially but is also critically important and sensitive, the difference between sync and backup is a crucial divide between being vulnerable and being secure. While cloud drives offer ease of access and collaboration, they fall short in providing the comprehensive data protection that comes from true backup solutions, highlighting the need to identify the gap and choose a solution that better fits your data storage and security goals. A full-scale backup solution will typically include backup software like Veeam, Commvault, and Rubrik, and a storage destination for that data. The backup software allows you to configure the frequency and types of backups, and the backup data is then stored on-premises and/or off-premises. Ideally, at least one copy is stored in the cloud, like Backblaze B2, to provide true off-site, geographically distanced protection.

Lack of Protection Against Ransomware

Ransomware payments hit a record high $1 billion in 2023. It shouldn’t be news to anyone in IT that you need to defend against the evolving threat of ransomware with immutable backups now more than ever. However, cloud drives fall short when it comes to protecting against ransomware.

The Absence of Object Lock

Object Lock serves as a digital vault, making data immutable for a specified period. It creates a virtual air gap, protecting data from modification, manipulation, or deletion, effectively shielding it from ransomware attacks that seek to encrypt files for ransom. Unfortunately, most cloud drives do not incorporate this technology. 

Without Object Lock, if a piece of data or a document becomes infected with ransomware before it’s uploaded to the cloud, the version saved on a cloud drive can be compromised as well. This replication of infected files across the cloud environment can escalate a localized ransomware attack into a widespread data disaster. 

Other Security Shortcomings

Beyond the absence of Object Lock, cloud drives may also lag in other critical security measures. While many offer some level of encryption, the robustness of this encryption and its effectiveness in protecting data at reset and in transit can vary significantly. Additionally, the implementation of 2FA and other access control measures is not always standard. These gaps in security protocols can leave the door open for unauthorized access and data breaches.

Navigating the Shared Responsibility Model

The shared responsibility model of cloud computing outlines who is responsible for what when it comes to cloud security. However, this model often leads to a sense of false security. Under this model, cloud drives typically take responsibility for the security “of” the cloud, including the infrastructure that runs all of the services offered in the cloud. On the other hand, the customers are responsible for security “in” the cloud. This means customers must manage the security of their own data. 

What’s the difference? Let’s use an example. If a user inadvertently uploads a ransomware-infected file to a cloud drive, the service might protect the integrity of the cloud infrastructure, ensuring the malware doesn’t spread to other users. However, the responsibility to prevent the upload of the infected file in the first place, and managing its consequences, falls directly on the user. In essence, while cloud drives provide a platform for storing your data, relying solely on them without understanding the nuances of the shared responsibility model could leave gaps in your data protection strategy. 

It’s also important to understand that Google, Microsoft, and Dropbox may not back up your data as often as you’d like, in the format you need, or provide timely, accessible recovery options. 

The Limitations of Cloud Drives in Computer Failures

Cloud drives, such as iCloud, Google Drive, Dropbox, and OneDrive, synchronize your files across multiple devices and the cloud, ensuring that the latest version of a file is accessible from anywhere. However, this synchronization does not equate to a full backup of your computer’s data. In the event of a computer failure, only the files you’ve chosen to sync would be recoverable. Other data stored on the computer (but not in the sync folder) would be lost. 

While some cloud drives offer versioning, which allows you to recover previous versions of files, this features are often limited in scope and time. It’s not designed to recover all types of files after a hardware failure, which a comprehensive backup solution would allow. 

Additionally, users often have to select which folders of files are synchronized, potentially overlooking important data. This selective sync means that not all critical information is protected automatically, unlike with a backup solution that can be set to automatically back up all data.

The Challenges of Data Sprawl in Cloud Drives

Cloud drives make it easy to provision storage for a wide array of end users. From students and faculty in education institutions to teams in corporations, the ease with which users can start storing data is unparalleled. However, this convenience comes with its own set of challenges—and one of the most notable culprits is data sprawl. 

Data sprawl refers to the rapid expansion and scattering of data without a cohesive management strategy. It is the accumulation of vast amounts of data to the point where organizations no longer know what data they have or what is happening with that data. Organizations often struggle to get a clear picture of who is storing what, how much space it’s taking up, and whether certain data remains accessed or has become redundant. This can lead to inefficient use of storage resources, increased costs, and potential security risks as outdated or unnecessary information piles up. The lack of sophisticated tools within cloud drive platforms for analyzing and understanding storage usage can significantly complicate data governance and compliance efforts. 

The Economic Hurdles of Cloud Drive Pricing

The pricing structure of cloud drive solutions present a significant barrier to achieving both cost efficiency and operational flexibility. The sticker price is only the tip of the iceberg, especially for sprawling organizations like higher education institutions or large enterprises with unique challenges that make the standard pricing models of many cloud drive services less than ideal. Some of the main challenges are: 

  1. User-Based Pricing: Cloud drive platforms base their pricing on the number of users, an approach that quickly becomes problematic for large institutions and businesses. With staff and end user turnover, predicting the number of active users at any given time can be a challenge. This leads to overpaying for unused accounts or constantly adjusting pricing tiers to match the current headcount, both of which are administrative headaches. 
  2. The High Cost of Scaling: The initial promise of free storage tiers or low-cost entry points fades quickly as institutions hit their storage limits. Beyond these thresholds, prices can escalate dramatically, making budget planning a nightmare. This pricing model is particularly problematic for businesses where data is continually growing. As these data sets expand, the cost to store them grows exponentially, straining already tight budgets. 
  3. Limitations of Storage and Users: Most cloud drive platforms come with limits on storage capacity and a cap on the number of users. Upgrading to higher tier plans to accommodate more users or additional storage can be expensive. This often forces organizations into a cycle of constant renegotiation and plan adjustments. 

We’re Partial to an Alternative: Backblaze

While cloud drives excel in collaboration and file sharing, they often fall short in delivering the comprehensive data security and backup that businesses and organizations need. However, you are not without options. Cloud storage platforms like Backblaze B2 Cloud Storage secure business and educational data and budgets with immutable, set-and-forget, off-site backups and archives at a fraction of the cost of legacy providers. And, with Universal Data Migration, you can move large amounts of data from cloud drives or any other source to B2 Cloud Storage at no cost to you. 

For those who appreciate the user-friendly interfaces of services like Dropbox or Google Drive, Backblaze provides integrations that deliver comparable front-end experiences for ease of use without compromising on security. However, if your priority lies in securing data against threats like ransomware, you can integrate Backblaze B2 with popular backup tools including Veeam, Rubrik, and Commvault, for immutable, virtually air-gapped backups to defend against cyber threats. Backblaze also offers  free egress for up to three times your data stored—or unlimited free egress between many of our compute or CDN partners—which means you don’t have to worry about the costs of downloading data from the cloud when necessary. 

Beyond Cloud Drives: A Secure, Cost-Effective Approach to Data Storage

In summary, cloud drives offer robust file sharing and collaboration tools, yet businesses and organizations looking for a more secure, reliable, and cost-effective data storage solution have options. By recognizing the limitations of cloud drives and by leveraging the advanced capabilities of cloud backup services, organizations can not only safeguard their data against emerging threats but also ensure it remains accessible and within budget. 

The post What’s Wrong With Google Drive, Dropbox, and OneDrive? More Than You Think appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Your AI Toolbox: 16 Must Have Products

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/your-ai-toolbox-16-must-have-products/

A decorative image showing a chip networked to several tech icon images, including a computer and a cloud, with a box that says AI above the image.

Folks, it’s an understatement to say that the explosion of AI has been a wild ride. And, like any new, high-impact technology, the market initially floods with new companies. The normal lifecycle, of course, is that money is invested, companies are built, and then there will be winners and losers as the market narrows. Exciting times. 

That said, we thought it was a good time to take you back to the practical side of things. One of the most pressing questions these days is how businesses may want to use AI in their existing or future processes, what options exist, and which strategies and tools are likely to survive long term. 

We can’t predict who will sink or swim in the AI race—we might be able to help folks predict drive failure, but the Backblaze Crystal Ball (™) is not on our roadmap—so let’s talk about what we know. Things will change over time, and some of the tools we’ve included on this list will likely go away. And, as we fully expect all of you to have strong opinions, let us know what you’re using, which tools we may have missed, and why we’re wrong in the comments section.

Tools Businesses Can Implement Today (and the Problems They Solve)

As AI has become more accessible, we’ve seen it touted as either standalone tools or incorporated into existing software. It’s probably easiest to think about them in terms of the problems they solve, so here is a non-inclusive list.

The Large Language Model (LLM) “Everything Bot”

LLMs are useful in generative AI tasks because they work largely on a model of association. They intake huge amounts of data, use that to learn associations between ideas and words, and then use those learnings to perform tasks like creating copy or natural language search. That makes them great for a generalized use case (an “everything bot”) but it’s important to note that it’s not the only—or best—model for all AI/ML tasks. 

These generative AI models are designed to be talked to in whatever way suits the querier best, and are generally accessed via browser. That’s not to say that the models behind them aren’t being incorporated elsewhere in things like chat bots or search, but that they stand alone and can be identified easily. 

ChatGPT

In many ways, ChatGPT is the tool that broke the dam. It’s a large language model (LLM) whose multi-faceted capabilities were easily apparent and translatable across both business and consumer markets. Never say it came from nowhere, however: OpenAI and Microsoft Azure have been in cahoots for years creating the tool that (ahem) broke the internet. 

Google Gemini, née Google Bard

It’s undeniable that Google has been on the front lines of AI/ML for quite some time. Some experts even say that their networks are the best poised to build a sustainable AI architecture. So why is OpenAI’s ChatGPT the tool on everyone’s mind? Simply put, Google has had difficulty commercializing their AI product—until, that is, they announced Google Gemini, and folks took notice. Google Gemini represents a strong contender for the type of function that we all enjoy from ChatGPT, powered by all the infrastructure and research they’re already known for.

Machine Learning (ML)

ML tasks cover a wide range of possibilities. When you’re looking to build an algorithm yourself, however, you don’t have to start from ground zero. There are robust, open source communities that offer pre-trained models, community support, integration with cloud storage, access to large datasets, and more. 

  • TensorFlow: TensorFlow was originally developed by Google for internal research and production. It supports various programming languages like C++, Python, and Java, and is designed to scale easily from research to development.  
  • PyTorch: PyTorch, on the other hand, is built for rapid prototyping and experimentation, and is primarily built for Python. That makes the learning curve for most devs much shorter, and lots of folks will layer it with Keras for additional API support (without sacrificing the speed and lower-level control of PyTorch). 

Given the amount of flexibility in having an open source library, you see all sorts of things being built. A photo management company might grab a facial recognition algorithm, for instance, or use another to help order the parameters and hyperparameters of the algorithm. Think of it like wanting to build a table, but making the hammer and nails instead of purchasing your own. 

Building Products With AI

You may also want or need to invest more resources—maybe you want to add AI to your existing product. In that scenario, you might hire an AI consultant to help you design, build, and train the algorithm, buy processing power from CoreWeave or Google, and store your data on-premises or in cloud storage.

In reality, most companies will likely do a mix of things depending on how they operate and what they offer. The biggest thing I’m trying to get at by presenting these scenarios, however, is that most people likely won’t set up their own large scale infrastructure, instead relying on inference tools. And, there’s something of a distinction to be made between whether you’re using tools designed to create efficiencies in your business versus whether you’re creating or incorporating AI/ML into your products.

Data Analytics

Without being too contentions, data analytics is one of the most powerful applications of AI/ML. While we measly humans may still need to provide context to make sense of the identified patterns, computers are excellent at identifying them more quickly and accurately than we could ever dream. If you’re looking to crunch serious numbers, these two tools will come in handy.

  • Snowflake: Snowflake is a cloud-based data as a service (DaaS) company that specializes in data warehouses, data lakes, and data analytics. They provide a flexible, integration-friendly platform with options for both developing your own data tools or using built-out options. Loved by devs and business leaders alike, Snowflake is a powerhouse platform that supports big names and diverse customers such as AT&T, Netflix, Capital One, Canva, and Bumble. 
  • Looker: Looker is a business intelligence (BI) platform powered by Google. It’s a good example of a platform that takes the core functionalities of a product we’re already used to and layering on AI to make them more powerful. So, while BI platforms have long had robust data management and visualization capabilities, they can now do things like use natural language search or get automated data insights.

Development and Security

It’s no secret that one of the biggest pain points in the world of tech is having enough developers and having enough high quality ones, at that. It’s pushed the tech industry to work internationally, driven the creation of coding schools that train folks within six months, and compelled people to come up with codeless or low-code platforms that users of different skill levels can use. This also makes it one of the prime opportunities for the assistance of AI. 

  • GitHub Copilot: Even if you’re not in tech or working as a developer, you’ve likely heard of GitHub. Started in 2007 and officially launched in 2008, it’s a bit hard to imagine coding before it existed as the de facto center to find, share, and collaborate on code in a public forum. Now, they’re responsible for GitHub Copilot, which allows devs to generate code with a simple query. As with all generative tools, however, users should double check for accuracy and bias, and make sure to consider privacy, legal, and ethical concerns while using the tool. 

Customer Experience and Marketing

Customer relationship management (CRM) tools assist businesses in effectively communicating with their customers and audiences. You use them to glean insights as broadly as trends in how you’re finding and converting leads to customers, or as granular as a single users’ interactions with marketing emails. A well-honed CRM means being able to serve your target and existing customers effectively. 

  • Hubspot and Salesforce Einstein: Two of the largest CRM platforms on the market, these tools are designed to make everything from email to marketing emails to lead scoring to customer service interactions easy. AI has started popping up in almost every function offered, including social media post generation, support ticket routing, website personalization suggestions, and more.    

Operations, Productivity, and Efficiency

These kinds of tools take onerous everyday tasks and make them easy. Internally, these kinds of tools can represent massive savings to your OpEx budget, letting you use your resources more effectively. And, given that some of them also make processes external to your org easier (like scheduling meetings with new leads), they can also contribute to new and ongoing revenue streams. 

  • Loom: Loom is a specialized tool designed to make screen recording and subsequent video editing easy. Given how much time it takes to make video content, Loom’s targeting of this once-difficult task has certainly saved time and increased collaboration. Loom includes things like filler word and silence removal, auto-generating chapters with timestamps, summarizing the video, and so on. All features are designed for easy sharing and ingesting of data across video and text mediums.  
  • Calendly: Speaking of collaboration, remember how many emails it used to take to schedule a meeting, particularly if the person was external to your company? How about when you were working a conference and wanted to give a new lead an easy way to get on your calendar? And, of course, there’s the joy of managing multiple inboxes. (Thanks, Calendly. You changed my life.) Moving into the AI future, Calendly is doing similar small but mighty things: predicting your availability, detecting time zones, automating meeting schedules based on team member availability or round robin scheduling, cancellation insights, and more.  
  • Slack: Ah, Slack. Business experts have been trying for years to summarize the effect it’s had on workplace communication, and while it’s not the only tool on the market, it’s definitely a leader. Slack has been adding a variety of AI functions to its platform, including the ability to summarize channels, organize unreads, search and summarize messages—and then there’s all the work they’re doing with integrations rumored to be on the horizon, like creating meeting invite suggestions purely based on your mentioning “putting time on the calendar” in a message. 

Creative and Design 

Like coding and developer tools, creative of all kinds—image, video, copy—has long been a resource intensive task. These skills are not traditionally suited to corporate structures, and measuring whether one brand or another is better or worse is a complex process, though absolutely measurable and important. Generative AI, again like above, is giving teams the ability to create first drafts, or even train libraries, and then move the human oversight to a higher, more skilled, tier of work. 

  • Adobe and Figma: Both Adobe and Figma are reputable design collaboration tools. Though a merger was recently called off by both sides, both are incorporating AI to make it much, much easier to create images and video for all sorts of purposes. Generative AI means that large swaths of canvas can be filled by a generative tool that predicts background, for instance, or add stock versions of things like buildings with enough believability to fool a discerning eye. Video tools are still in beta, but early releases are impressive, to say the least. With the preview of OpenAI’s text-to-video model Sora making waves to the tune of a 7% drop in Adobe’s stock, video is the space to watch at the moment.
  • Jasper and Copy.ai: Just like image generation above, these bots are also creating usable copy for tasks of all kinds. And, just like all generative tools, AI copywriters deliver a baseline level of quality best suited to some human oversight. As time goes on, how much oversight remains to be seen.

Tools for Today; Build for Tomorrow

At the end of this roundup, it’s worth noting that there are plenty of tools on the market, and we’ve just presented a few of the bigger names. Honestly, we had trouble narrowing the field of what to include so to speak—this very easily could have been a much longer article, or even a series of articles that delved into things we’re seeing within each use case. As we talked about in AI 101: Do the Dollars Make Sense? (and as you can clearly see here), there’s a great diversity of use cases, technological demands, and unexplored potential in the AI space—which means that companies have a variety of strategic options when deciding how to implement AI or machine learning.

Most businesses will find it easier and more in line with their business goals to adopt software as a service (SaaS) solutions that are either sold as a whole package or integrated into existing tools. These types of tools are great because they’re almost plug and play—you can skip training the model and go straight to using them for whatever task you need. 

But, when you’re a hyperscaler and you’re talking about building infrastructure to support the processing and storage demands of the AI future, it’s a different scenario than when other types of businesses are talking about using or building an AI tool or algorithm specific to your business’ internal strategy or products. We’ve already seen that hyperscalers are going for broke in building data centers and processing hubs, investing in companies that are taking on different parts of the tech stack, and, of course, doing longer-term research and experimentation as well.

So, with a brave new world at our fingertips—being built as we’re interacting with it—the best thing for businesses to remember is that periods of rapid change offer opportunity, as long as you’re thoughtful about implementation. And, there are plenty of companies creating tools that make it easy to do just that. 

The post Your AI Toolbox: 16 Must Have Products appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Kubernetes Data Protection: How to Safeguard Your Containerized Applications

Post Syndicated from Vinodh Subramanian original https://www.backblaze.com/blog/kubernetes-data-protection-how-to-safeguard-your-containerized-applications/

A decorative image showing the Kubernetes and Backblaze logos.

Kubernetes, originally embraced by DevOps teams for its seamless application deployment, has become the go-to for businesses looking to deploy and manage applications at scale. This powerful container orchestration platform brings many benefits, but it’s not without risks—data loss, misconfigurations, and systems failures can happen when you least expect them. 

That’s why implementing a comprehensive backup strategy is essential for protecting against potential failures that could result in significant downtime, end-user dissatisfaction, and financial losses. However, backing up Kubernetes can be challenging. The environment’s dynamic nature, with containers constantly being created and destroyed, presents a unique set of challenges. 

Traditional backup solutions might falter in the face of Kubernetes’s complexities, highlighting the need for specialized approaches to back up the data and the state of your containerized applications. In this guide, let’s explore how to effectively back up and protect your Kubernetes environments against a wide range of threats, from misconfigurations to ransomware.

Understanding Kubernetes Architecture

Kubernetes has a fairly straightforward architecture that is designed to automate the deployment, scaling, and management of application containers across cluster of hosts. Understanding this architecture is not only essential for deploying and managing applications, but also for implementing effective security measures. Here’s a breakdown of Kubernetes hierarchical components and concepts. 

A chart describing cluster and node organizations within Kubernetes.

Containers: The Foundation of Kubernetes

Containers are lightweight, virtualized environments designed to run application code. They encapsulate an application’s code, libraries, and dependencies into a single object. This makes containerized applications easy to deploy, scale, and manage across different environments.

Pods: The Smallest Deployable Units

Pods are often described as logical hosts that can contain one or multiple containers that share storage, network, and specifications on how to run the containers. They are ephemeral by nature—temporary storage for a container that gets wiped out and lost when the container is stopped or restarted.

Nodes: The Workhorses of Kubernetes

Nodes represent the physical or virtual machines that run the containerized applications. Each node is managed by the master components and contains the services necessary to run pods. 

Cluster: The Heart of Kubernetes

A cluster is a collection of nodes that run containerized applications. Clusters provide the high-level structure within which Kubernetes manages the containerized applications. They enable Kubernetes to orchestrate containers’ deployment, scaling, and management across multiple nodes seamlessly.

Control Plane: The Brain Behind the Operation

The control plane is responsible for managing the worker nodes and the pods in the cluster. It includes several components, such as Kubernetes API server, scheduler, controller manager, and etcd (a key-value store for cluster data). The control plane makes global decisions about the cluster, and therefore its security is paramount as it’s the central point of management for the cluster. 

What Needs to be Protected in Kubernetes?

In Kubernetes, securing your environment is not just about safeguarding the data; it’s about protecting the entire ecosystem that interacts with and manages the data. Here’s an overview of the key components that require protection.

Workloads and Applications

  • Containers and Pods: Protecting containers involves securing the container images from vulnerabilities and ensuring runtime security. For pods, it’s crucial to manage security contexts and network policies effectively to prevent unauthorized access and ensure that sensitive data isn’t exposed to other pods or services unintentionally.
  • Deployments and StatefulSets: These are higher-level constructs that manage the deployment and scaling of pods. Protecting these components involves ensuring that only authorized users can create, update, or delete deployments.

Data and Storage

  • Persistent Volumes (PVs) and Persistent Volume Claims (PVCs): Persistent storage in Kubernetes is managed through PVs and PVCs, and protecting them is essential to ensure data integrity and confidentiality. This includes securing access to the data they contain, encrypting data at rest and transit, and properly managing storage access permissions.
  • ConfigMaps and Secrets: While ConfigMaps might contain general configuration settings, secrets are used to store sensitive data such as passwords, OAuth tokens, and SSH keys. 

Network Configuration

  • Services and Ingress: Services in Kubernetes provide a way to expose an application on a set of pods as a network service. Ingress, on the other hand, manages external access to the services within a cluster, typically HTTP. Protecting these components involves securing the communication channels, implementing network policies to restrict access to and from the services, and ensuring that only authorized services are exposed to the outside world.
  • Network Policies: Network policies define how groups of pods are allowed to communicate with each other and other network endpoints. Securing them is essential for creating a controlled, secure networking environment with your Kubernetes cluster.

Access Controls and User Management

  • Role-Based Access Control (RBAC): RBAC in Kubernetes helps define who can access what within a cluster. It allows administrators to regulate access to Kubernetes resources and namespaces based on the roles assigned to users. Protecting your cluster with RBAC users and applications having only the access they need while minimizing the potential impact of compromised credentials or insider threats.
  • Service Accounts: Service accounts provide an identity for processes that run in a pod, allowing them to interact with the Kubernetes API. Managing and securing these accounts is crucial to prevent unauthorized API access, which could lead to data leakage or unauthorized modifications of the cluster state.

Cluster Infrastructure

  • Nodes and the Control Plane: The nodes run the containerized applications and are controlled by the control plane, which includes the API server, scheduler, controller manager, and etcd database. Securing the nodes involves hardening the underlying operating system (OS), ensuring secure communication between the nodes and the control plane, and protecting control plane components from unauthorized access and tampering.
  • Kubernetes Secrets Management: Managing secrets securely in Kubernetes is critical for protecting sensitive data. This includes implementing best practices for secrets encryption, both at rest and in transit, and limiting secrets exposure to only those pods that require access.

Protecting these components is crucial for maintaining both the security and operational integrity of your Kubernetes environment. A breach in any of these areas can compromise your entire cluster, leading to data loss and causing service disruption and financial damage. Implementing a layered security approach that addresses the vulnerabilities of the Kubernetes architecture is essential for building a resilient, secure deployment.

Challenges in Kubernetes Data Protection

Securing the Kubernetes components we discussed above poses unique challenges due to the platform’s dynamic nature and the diverse types of workloads it supports. Understanding these challenges is the first step toward developing effective strategies for safeguarding your applications and data. Here are some of the key challenges:

Dynamic Nature of Container Environments

Kubernetes’s fluid landscape, with containers constantly being created and destroyed, makes traditional data protection methods less effective. The rapid pace of change demands backup solutions that can adapt just as quickly to avoid data loss. 

Statelessness vs. Statefulness

  • Stateless Applications: These don’t retain data, pushing the need to safeguard the external persistent storage they rely on. 
  • Stateful Applications: Managing data across sessions involves intricate handling of PVs PVCs, which can be challenging in a system where pods and nodes are frequently changing.

Data Consistency

Maintaining data consistency across distributed replicas in Kubernetes is complex, especially for stateful sets with persistent data needs. Strategies for consistent snapshot or application specific replication are vital to ensure integrity.

Scalability Concerns

The scalability of Kubernetes, while a strength, introduces data protection complexities. As clusters grow, ensuring efficient and scalable backup solutions becomes critical to prevent performance degradation and data loss.

Security and Regulatory Compliance

Ensuring compliance with the appropriate standards—GDPR, HIPAA, or SOC 2 standards, for instance—always requires keeping track of storage and management of sensitive data. In a dynamic environment like Kubernetes, which allows for frequent creation and destruction of containers, enforcing persistent security measures can be a challenge. Also, the sensitive data that needs to be encrypted and protected may be hosted in portions across multiple containers. Therefore, it’s important to not only track what is currently existent but also anticipate possible iterations of the environment by ensuring continuous monitoring and the implementation of robust data management practices.

As you can see, Kubernetes data protection requires navigating its dynamic nature and the dichotomy of stateless and stateful applications while addressing the consistency and scalability challenges. A strategic approach to leveraging Kubernetes-native solutions and best practices is essential for effective data protection.

Choosing the Right Kubernetes Backup Solution: Strategies and Considerations

When it comes to protecting your Kubernetes environments, selecting the right backup solution is important. Solutions like Kasten by Veeam, Rubrik, and Commvault are some of the top Kubernetes container backup solutions that offer robust support for Kubernetes backup. 

Here are some essential strategies and considerations for choosing a solution that supports your needs. 

  • Assess Your Workload Types: Different applications demand different backup strategies. Stateful applications, in particular, require backup solutions that can handle persistent storage effectively. 
  • Evaluate Data Consistency Needs: Opt for backup solutions that offer consistent backup capabilities, especially for databases and applications requiring strict data consistency. Look for features that support application-consistent backups, ensuring that data is in a usable state when restored. 
  • Scalability and Performance: The backup solution should seamlessly scale with your Kubernetes deployment without impacting performance. Consider solutions that offer efficient data deduplication, compressions, and incremental backup capabilities to handle growing data volumes.
  • Recovery Objectives: Define clear recovery objectives. Look for solutions that offer granular recovery options, minimizing downtime by allowing for precise restoration of applications or data, aligning with recovery time objectives (RTOs) and recovery point objectives (RPOs). 
  • Integration and Automation: Choose a backup solution that integrates well or natively with Kubernetes, offering automation capabilities for backup schedules, policy management, and recovery processes. This integration simplifies operations and enhances reliability. 
  • Vendor Support and Community: Consider the vendor’s reputation, the level of support provided, and the solution’s community engagement. A strong support system and active community can be invaluable for troubleshooting and best practices.

By considering the above strategies and the unique features offered by backup solutions, you can ensure your Kubernetes environment is not only protected against data loss but also aligned with your operational dynamics and business objectives. 

Leveraging Cloud Storage for Comprehensive Kubernetes Data Protection

After choosing a Kubernetes backup application, integrating cloud storage such as Backblaze B2 with your application offers a flexible, secure, scalable approach to data protection. By leveraging cloud storage solutions, organizations can enhance their Kubernetes data protection strategy, ensuring data durability and availability across a distributed environment. This integration facilitates off-site backups, which are essential for disaster recovery and compliance with data protection policies, providing a robust layer of security against data loss, configuration errors, and breaches. 

Protect Your Kubernetes Data

In summary, understanding the intricacies of Kubernetes components, acknowledging the challenges in Kubernetes backup, selecting the appropriate backup solution, and effectively integrating cloud storage are pivotal steps in crafting a comprehensive Kubernetes backup strategy. These measures ensure data protection, operational continuity, and compliance. The right backup solution, tailored to Kubernetes’s distinctive needs, coupled with the scalability and resiliency of cloud storage, provides a robust framework for safeguarding against data loss or breaches. This multi-faceted approach not only safeguards critical data but also supports the agility and scalability that modern IT environments demand. 

The post Kubernetes Data Protection: How to Safeguard Your Containerized Applications appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Drive Stats for 2023

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-2023/

A decorative image displaying the words 2023 Year End Drive Stats

As of December 31, 2023, we had 274,622 drives under management. Of that number, there were 4,400 boot drives and 270,222 data drives. This report will focus on our data drives. We will review the hard drive failure rates for 2023, compare those rates to previous years, and present the lifetime failure statistics for all the hard drive models active in our data center as of the end of 2023. Along the way we share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

2023 Hard Drive Failure Rates

As of the end of 2023, Backblaze was monitoring 270,222 hard drives used to store data. For our evaluation, we removed 466 drives from consideration which we’ll discuss later on. This leaves us with 269,756 hard drives covering 35 drive models to analyze for this report. The table below shows the Annualized Failure Rates (AFRs) for 2023 for this collection of drives.

An chart displaying the failure rates of Backblaze hard drives.

Notes and Observations

One zero for the year: In 2023, only one drive model had zero failures, the 8TB Seagate (model: ST8000NM000A). In fact, that drive model has had zero failures in our environment since we started deploying it in Q3 2022. That “zero” does come with some caveats: We have only 204 drives in service and the drive has a limited number of drive days (52,876), but zero failures over 18 months is a nice start.

Failures for the year: There were 4,189 drives which failed in 2023. Doing a little math, over the last year on average, we replaced a failed drive every two hours and five minutes. If we limit hours worked to 40 per week, then we replaced a failed drive every 30 minutes.

More drive models: In 2023, we added six drive models to the list while retiring zero, giving us a total of 35 different models we are tracking. 

Two of the models have been in our environment for a while but finally reached 60 drives in production by the end of 2023.

  1. Toshiba 8TB, model HDWF180: 60 drives.
  2. Seagate 18TB, model ST18000NM000J: 60 drives.

Four of the models were new to our production environment and have 60 or more drives in production by the end of 2023.

  1. Seagate 12TB, model ST12000NM000J: 195 drives.
  2. Seagate 14TB, model ST14000NM000J: 77 drives.
  3. Seagate 14TB, model ST14000NM0018: 66 drives.
  4. WDC 22TB, model WUH722222ALE6L4: 2,442 drives.

The drives for the three Seagate models are used to replace failed 12TB and 14TB drives. The 22TB WDC drives are a new model added primarily as two new Backblaze Vaults of 1,200 drives each.

Mixing and Matching Drive Models

There was a time when we purchased extra drives of a given model to have on hand so we could replace a failed drive with the same drive model. For example, if we needed 1,200 drives for a Backblaze Vault, we’d buy 1,300 to get 100 spares. Over time, we tested combinations of different drive models to ensure there was no impact on throughput and performance. This allowed us to purchase drives as needed, like the Seagate drives noted previously. This saved us the cost of buying drives just to have them hanging around for months or years waiting for the same drive model to fail.

Drives Not Included in This Review

We noted earlier there were 466 drives we removed from consideration in this review. These drives fall into three categories.

  • Testing: These are drives of a given model that we monitor and collect Drive Stats data on, but are in the process of being qualified as production drives. For example, in Q4 there were four 20TB Toshiba drives being evaluated.
  • Hot Drives: These are drives that were exposed to high temperatures while in operation. We have removed them from this review, but are following them separately to learn more about how well drives take the heat. We covered this topic in depth in our Q3 2023 Drive Stats Report
  • Less than 60 drives: This is a holdover from when we used a single storage server of 60 drives to store a blob of data sent to us. Today we divide that same blob across 20 servers, i.e. a Backblaze Vault, dramatically improving the durability of the data. For 2024 we are going to review the 60 drive criteria and most likely replace this standard with a minimum number of drive days in a given period of time to be part of the review. 

Regardless, in the Q4 2023 Drive Stats data you will find these 466 drives along with the data for the 269,756 drives used in the review.

Comparing Drive Stats for 2021, 2022, and 2023

The table below compares the AFR for each of the last three years. The table includes just those drive models which had over 200,000 drive days during 2023. The data for each year is inclusive of that year only for the operational drive models present at the end of each year. The table is sorted by drive size and then AFR.

A chart showing the failure rates of hard drives from 2021, 2022, and 2023.

Notes and Observations

What’s missing?: As noted, a drive model required 200,000 drive days or more in 2023 to make the list. Drives like the 22TB WDC model with 126,956 drive days and the 8TB Seagate with zero failures, but only 52,876 drive days didn’t qualify. Why 200,000? Each quarter we use 50,000 drive days as the minimum number to qualify as statistically relevant. It’s not a perfect metric, but it minimizes the volatility sometimes associated with drive models with a lower number of drive days.

The 2023 AFR was up: The AFR for all drives models listed was 1.70% in 2023. This compares to 1.37% in 2022 and 1.01% in 2021. Throughout 2023 we have seen the AFR rise as the average age of the drive fleet has increased. There are currently nine drive models with an average age of six years or more. The nine models make up nearly 20% of the drives in production. Since Q2, we have accelerated the migration from older drive models, typically 4TB in size, to new drive models, typically 16TB in size. This program will continue throughout 2024 and beyond.

Annualized Failure Rates vs. Drive Size

Now, let’s dig into the numbers to see what else we can learn. We’ll start by looking at the quarterly AFRs by drive size over the last three years.

A chart showing hard drive failure rates by drive size from 2021 to 2023.

To start, the AFR for 10TB drives (gold line) are obviously increasing, as are the 8TB drives (gray line) and the 12TB drives (purple line). Each of these groups finished at an AFR of 2% or higher in Q4 2023 while starting from an AFR of about 1% in Q2 2021. On the other hand, the AFR for the 4TB drives (blue line) rose initially, peaking in 2022 and has decreased since. The remaining three drive sizes—6TB, 14TB, and 16TB—have oscillated around 1% AFR for the entire period. 

Zooming out, we can look at the change in AFR by drive size on an annual basis. If we compare the annual AFR results for 2022 to 2023, we get the table below. The results for each year are based only on the data from that year.

At first glance it may seem odd that the AFR for 4TB drives is going down. Especially given the average age of each of the 4TB drives models is over six years and getting older. The reason is likely related to our focus in 2023 on migrating from 4TB drives to 16TB drives. In general we migrate the oldest drives first, that is those more likely to fail in the near future. This process of culling out the oldest drives appears to mitigate the expected rise in failure rates as a drive ages. 

But, not all drive models play along. The 6TB Seagate drives are over 8.6 years old on average and, for 2023, have the lowest AFR for any drive size group potentially making a mockery of the age-is-related-to-failure theory, at least over the last year. Let’s see if that holds true for the lifetime failure rate of our drives.

Lifetime Hard Drive Stats

We evaluated 269,756 drives across 35 drive models for our lifetime AFR review. The table below summarizes the lifetime drive stats data from April 2013 through the end of Q4 2023. 

A chart showing lifetime annualized failure rates for 2023.

The current lifetime AFR for all of the drives is 1.46%. This is up from the end of last year (Q4 2022) which was 1.39%. This makes sense given the quarterly rise in AFR over 2023 as documented earlier. This is also the highest the lifetime AFR has been since Q1 2021 (1.49%). 

The table above contains all of the drive models active as of 12/31/2023. To declutter the list, we can remove those models which don’t have enough data to be statistically relevant. This does not mean the AFR shown above is incorrect, it just means we’d like to have more data to be confident about the failure rates we are listing. To that end, the table below only includes those drive models which have two million drive days or more over their lifetime, this gives us a manageable list of 23 drive models to review.

A chart showing the 2023 annualized failure rates for drives with more than 2 million drive days in their lifetimes.

Using the table above we can compare the lifetime drive failure rates of different drive models. In the charts below, we group the drive models by manufacturer, and then plot the drive model AFR versus average age in months of each drive model. The relative size of each circle represents the number of drives in each cohort. The horizontal and vertical scales for each manufacturer chart are the same.

A chart showing annualized failure rates by average age and drive manufacturer.

Notes and Observations

Drive migration: When selecting drive models to migrate we could just replace the oldest drive models first. In this case, the 6TB Seagate drives. Given there are only 882 drives—that’s less than one Backblaze Vault—the impact on failure rates would be minimal. That aside, the chart makes it clear that we should continue to migrate our 4TB drives as we discussed in our recent post on which drives reside in which storage servers. As that post notes, there are other factors, such as server age, server size (45 vs. 60 drives), and server failure rates which help guide our decisions. 

HGST: The chart on the left below shows the AFR trendline (second order polynomial) for all of our HGST models.  It does not appear that drive failure consistently increases with age. The chart on the right shows the same data with the HGST 4TB drive models removed. The results are more in line with what we’d expect, that drive failure increased over time. While the 4TB drives perform great, they don’t appear to be the AFR benchmark for newer/larger drives.

One other potential factor not explored here, is that beginning with the 8TB drive models, helium was used inside the drives and the drives were sealed. Prior to that they were air-cooled and not sealed. So did switching to helium inside a drive affect the failure profile of the HGST drives? Interesting question, but with the data we have on hand, I’m not sure we can answer it—or that it matters much anymore as helium is here to stay.

Seagate: The chart on the left below shows the AFR trendline (second order polynomial) for our Seagate models. As with the HGST models, it does not appear that drive failure continues to increase with age. For the chart on the right, we removed the drive models that were greater than seven years old (average age).

Interestingly, the trendline for the two charts is basically the same up to the six year point. If we attempt to project past that for the 8TB and 12TB drives there is no clear direction. Muddying things up even more is the fact that the three models we removed because they are older than seven years are all consumer drive models, while the remaining drive models are all enterprise drive models. Will that make a difference in the failure rates of the enterprise drive model when they get to seven or eight or even nine years of service? Stay tuned.

Toshiba and WDC: As for the Toshia and WDC drive models, there is a little over three years worth of data and no discernible patterns have emerged. All of the drives from each of these manufacturers are performing well to date.

Drive Failure and Drive Migration

One thing we’ve seen above is that drive failure projections are typically drive model dependent. But we don’t migrate drive models as a group, instead, we migrate all of the drives in a storage server or Backblaze Vault. The drives in a given server or Vault may not be the same model. How we choose which servers and Vaults to migrate will be covered in a future post, but for now we’ll just say that drive failure isn’t everything.

The Hard Drive Stats Data

The complete data set used to create the tables and charts in this report is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data itself to anyone; it is free.

Good luck, and let us know if you find anything interesting.

The post Backblaze Drive Stats for 2023 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Commits to Routing Security With MANRS Participation

Post Syndicated from Brent Nowak original https://www.backblaze.com/blog/backblaze-commits-to-routing-security-with-manrs-participation/

A decorative image displaying the MANRS logo.

They say good manners are better than good looks. When it comes to being a good internet citizen, we have to agree. And when someone else tells you that you have good manners (or MANRS in this case), even better. 

If you hold your cloud partners to a higher standard, and if you think it’s not asking too much that they make the internet a better, safer place for everyone, then you’ll be happy to know that Backblaze is now recognized as a Mutually Agreed Norms for Routing Security (MANRS) participant (aka MANRS Compliant). 

What Is MANRS?

MANRS is a global initiative with over 1,095 participants that are enacting network policies and controls to help reduce the most common routing threats. At a high level, we’re setting up filters to check that network routing information we receive for peers is valid, ensuring that the networks we advertise to the greater internet are marked as owned by Backblaze, and making sure that data that gets out of our network is legitimate and can’t be spoofed.

You can view a full list of MANRS participants here.

What Our (Good) MANRS Mean For You

The biggest benefit for customers is that network traffic to and from Backblaze’s connection points where we exchange traffic with our peering partners is more secure and more trustworthy. All of the changes that we’ve implemented (which we get into below) are on our side—so, no action is necessary from Backblaze partners or users—and will be transparent for our customers. Our Network Engineering team has done the heavy lifting. 

MANRS Actions

Backblaze falls under the MANRS category of CDN and Cloud Providers, and as such, we’ve implemented solutions or processes for each of the five actions stipulated by MANRS:

  1. Prevent propagation of incorrect routing information: Ensure that traffic we receive is coming from known networks.
  2. Prevent traffic of illegitimate source IP addresses: Prevent malicious traffic coming out of our network.
  3. Facilitate global operational communication and coordination: Keep our records with 3rd party sites like Peeringdb.com up to date as other operators use this to validate our connectivity details.
  4. Facilitate validation of routing information on a global scale: Digitally sign our network objects using the Resource Public Key Infrastructure (RPKI) standard.
  5. Encourage MANRS adoption: By telling the world, just like in this post!

Digging Deeper Into Filtering and RPKI

Let’s go over the filtering and RPKI details, since they are very valuable to ensuring the security and validity of our network traffic.

Filtering: Sorting Out the Good Networks From the Bad

One major action for MANRS compliance is to validate that the networks we receive from peers are valid. When we connect to other networks, we each tell each other about our networks in order to build a routing table that lets us know the optimal path to send traffic.

We can blindly trust what the other party is telling us, or we can reach out to an external source to validate. We’ve implemented automated internal processes to help us apply these filters to our edge routers (the devices that connect us externally to other networks).

If you’re a more visual learner, like me, here’s a quick conversational bubble diagram of what we have in place.

Externally verifying routing information we receive.

Every edge device that connects to an external peer now has validation steps to ensure that the networks we receive and use to send out traffic are valid. We have automated processes that periodically check and deploy for updates to any lists.

What Is RPKI?

RPKI is a public key infrastructure framework designed to secure the internet’s routing infrastructure, specifically the Border Gateway Protocol (BGP). RPKI provides a way to connect internet number resource information (such as IP addresses) to a trust anchor. In layman’s terms, RPKI allows us, as a network operator, to securely identify whether other networks that interact with ours are legitimate or malicious.

RPKI: Signing Our Paperwork

Much like going to a notary and validating a form, we can perform the same action digitally with the list of networks that we advertise to the greater internet. The RPKI framework allows us to stamp our networks as owned by us.

It also allows us to digitally sign records of our networks that we own, allowing external parties to confirm that the networks that they see from us are valid. If another party comes along and tries to claim to be us, by using RPKI our peering partner will deny using that network to send data to a false Backblaze network.

You can check the status of our RPKI signed route objects on the MANRS statistics website.

What does the process of peering and advertising networks look like without RPKI validation?

A diagram that imagines IP address requests for ownership without RPKI standards. Bad actors would be able to claim traffic directed towards IP addresses that they don't own.
Bad actor claiming to be a Backblaze network without RPKI validation.

Now, with RPKI, we’ve dotted our I’s and crossed our T’s. A third party certificate holder serves as a validator for the digital certificates that we used to sign our network objects. If anyone else claims to be us, they will be marked as invalid and the peer will not accept the routing information, as you can see in the diagram below.

A diagram that imagines networking requests for ownership with RPKI standards properly applied. Bad actors would attempt to claim traffic towards an owned or valid IP address, but be prevented because they don't have the correct credentials.
With RPKI validation, the bad actor is denied the ability to claim to be a Backblaze network.

Mind Your MANRS

Our first value as a company is to be fair and good. It reads: “Be good. Trust is paramount. Build a good product. Charge fairly. Be open, honest, and accepting with customers and each other.” Almost sounds like Emily Post wrote it—that’s why our MANRS participation fits right in with the way we do business. We believe in an open internet, and participating in MANRS is just one way that we can contribute to a community that is working towards good for all.

The post Backblaze Commits to Routing Security With MANRS Participation appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Object Storage Simplified: Introducing Powered by Backblaze

Post Syndicated from Elton Carneiro original https://www.backblaze.com/blog/powered-by-announcement-2024/

A decorative image showing the Backblaze logo on a cloud hovering over a power button.

Today, we announced our new Powered by Backblaze program to give platform providers the ability to offer cloud storage without the burden of building scalable storage infrastructure (something we know a little bit about). 

If you’re an independent software vendor (ISV), technology partner, or any company that wants to incorporate easy, affordable data storage within your branded user experience, Powered by Backblaze will give you the tools to do so without complex code, capital outlay, or massive expense.

Read on to learn more about Powered by Backblaze and how it can help you enhance your platforms and services. Or, if you’d like to get started asap, contact our Sales Team for access.  

Benefits of Powered by Backblaze

  • Business Growth: Adding cloud services to your product portfolios can generate new revenue streams and/or grow your existing margin.
  • Improved Customer Experience: Take the complexity out of object storage and deliver the best solutions by incorporating a proven object cloud storage solution.
  • Simplified Billing: Reduce complex billing by providing customers with a single bill from a single provider. 
  • Build Your Brand:  Improve customer expectations by providing cloud storage with your company name for consistency and brand identity.

What Is Powered by Backblaze?

Powered by Backblaze offers companies the ability to incorporate B2 Cloud Storage into their products so they can sell more services or enhance their user experience with no capital investment. Today, this program offers two solutions that support the provisioning of B2 Cloud Storage: Custom Domains and the Backblaze Partner API.

How Can I Leverage Custom Domains?

Custom Domains, launched today, lets you serve content to your end users from the web domain or URL of your choosing, with no need for complex code or proxy servers. Backblaze manages the heavy lifting of cloud storage on the back end.

Custom Domains functionality combines CNAME and Backblaze B2 Object Storage, enabling the use of your preferred domain name in your files’ web domain or URLs instead of using the domain name that Backblaze automatically assigns.

We’ve chosen Backblaze so we can have a reliable partner behind our new Edge Storage solution. With their Custom Domain feature, we can implement the security needed to serve data from Backblaze to end users from Azion’s Edge Platform, improving user experience.

—Rafael Umann, CEO, Azion, a full stack platform for developers

How Can I Leverage the Backblaze Partner API?

The Backblaze Partner API automates the provisioning and management of Backblaze B2 Cloud Storage storage accounts within a platform. It allows for managing accounts, running reports, and creating a bundled solution or managed service for a unified user experience.

We wrote more about the Backblaze Partner API here, but briefly: We created this solution by exposing existing API functionality in a manner that allows partners to automate tasks essential to provisioning users with seamless access to storage.

The Backblaze Partner API calls allow you to:

  • Create accounts (add Group members)
  • Organize accounts in Groups
  • List Groups
  • List Group members
  • Eject Group members

If you’d like to get into the details, you can dig deeper in our technical documentation.

Our customers produce thousands of hours of content daily and, with the shift to leveraging cloud services like ours, they need a place to store both their original and transcoded files. The Backblaze Partner API allows us to expand our cloud services and eliminate complexity for our customers—giving them time to focus on their business needs, while we focus on innovations that drive more value.

—Murad Mordukhay, CEO, Qencode

How to Get Started With Powered by Backblaze

To get started with Powered by Backblaze, contact our Sales Team. They will work with you to understand your use case and how you can best utilize Powered by Backblaze. 

What’s Next?

We’re looking forward to adding more to the Powered by Backblaze program as we continue investing in the tools you need to bring performant cloud storage to your users in an easy, seamless fashion.

The post Object Storage Simplified: Introducing Powered by Backblaze appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

What’s the Diff: RAM vs. Storage

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/whats-diff-ram-vs-storage/

A decorative image showing a RAM chip and a hard drive with the words What's the Diff in the center.
Editor’s Note: This post was originally published in 2016 and has since been updated in 2022 and 2023 with the latest information on RAM vs. storage.

Memory is a finite resource when it comes to both humans and computers—it’s one of the most common causes of computer issues. And if you’ve ever left the house without your keys, you know memory is one of the most common human problems, too.

If you’re unclear about the different types of memory in your computer, it makes pinpointing the cause of computer problems that much harder. You might hear folks use the terms memory and storage interchangeably, but there are some important differences. Understanding how both components work can help you understand what kind of computer you need, diagnose problems you’re having, and know when it’s time to consider upgrades. 

The Difference Between RAM and Storage

Random access memory (RAM) and storage are both forms of computer memory, but they serve different functions. 

What Is RAM?

RAM is volatile memory used by the computer’s processor to store and quickly access data that is actively being used or processed. Volatile memory maintains data only while the device is powered on. RAM takes the form of computer chips—integrated circuits—that are either soldered directly onto the main logic board of your computer or installed in memory modules that go in sockets on your computer’s logic board.

You can think of it like a desk—it’s where your computer gets work done. When you double-click on an app, open a document, or do much of anything, part of your “desk” is covered and can’t be used by anything else. As you open more files, it is like covering your desk with more and more items. Using a desk with a handful of files is easy, but a desk that is covered with a bunch of stuff gets difficult to use.

What Is Computer Storage?

On the other hand, storage is used for long-term data retention, like a hard disk drive (HDD) or solid state drive (SSD). Compared with RAM, this type of storage is non-volatile, which means it retains information even when a device is powered off. You can think of storage like a filing cabinet—a place next to your desk where you can retrieve information as needed. 

RAM vs. Storage: How Do They Compare?

Speed and Performance

Two of the primary differences between RAM and storage are speed and performance. RAM is significantly faster than storage. Data stored in RAM can be written and accessed almost instantly, so it’s very fast—milliseconds fast. DDR4 RAM, one of the newer types of RAM technology, is capable of a peak transfer rate of 25.6GB/s! RAM has a very fast path to the computer’s central processing unit (CPU), the brain of the computer that does most of the work. 

Storage, as it’s slower in comparison, is responsible for holding the operating system (OS), applications, and user data for the long term—it should still be fast, but it doesn’t need to be as fast as RAM.

That said, computer storage is getting faster thanks to the popularity of SSDs. SSDs are much faster than hard drives since they use integrated circuits instead of mechanical platters that have to be read sequentially, like HDDs. SSDs use a special type of memory circuitry called non-volatile RAM (NVRAM) to store data, so those shorter term memory access points stay in place even when the computer is turned off.

Even though SSDs are faster than HDDs, they’re still slower than RAM. There are two reasons for that difference in speed. First, the memory chips in SSDs are slower than those in RAM. Second, there is a bottleneck created by the interface that connects the storage device to the computer. RAM, in comparison, has a much faster interface.

Capacity and Size

RAM is typically smaller in capacity compared to storage. It is measured in gigabytes (GB) or terabytes (TB), whereas storage capacities can reach multiple terabytes or even petabytes. The smaller size of RAM is intentional, as it is designed to store only the data currently in use, ensuring quick access for the processor.

Volatility and Persistence

Another key difference is the volatility of RAM and the persistence of storage. RAM is volatile, meaning it loses its data when the power is turned off or the system is rebooted. This makes it ideal for quick data access and manipulation, but unsuitable for long-term storage. Storage is non-volatile or persistent, meaning it retains data even when the power is off, making it suitable for holding files, applications, and the operating system over extended periods.

How Much RAM Do I Have?

Understanding how much RAM you have might be one of your first steps for diagnosing computer performance issues. 

Use the following steps to confirm how much RAM your computer has installed. We’ll start with an Apple computer. Click on the Apple menu and then click About This Mac. In the screenshot below, we can see that the computer has 16GB of RAM.

A screenshot of the Mac system screen that shows a computer summary with total RAM.
How much RAM on macOS (Apple menu > About This Mac).

With a Windows 11 computer, use the following steps to see how much RAM you have installed. Open the Control Panel by clicking the Windows button and typing “control panel,” then click System and Security, and then click System. Look for the line “Installed RAM.” In the screenshot below, you can see that the computer has 32GB of RAM installed.

A screenshot from a Windows computer showing installed RAM.
How much RAM on Windows 11 (Control Panel > System and Security > System).

How Much Computer Storage Do I Have?

To view how much free storage space you have available on a Mac computer, use these steps. Click on the Apple menu, then System Settings, select General, and then open Storage. In the screenshot below, we’ve circled where your available storage is displayed.

A screenshot from a Mac showing total storage and usage.
Disk space on Mac OS (Apple Menu > System Settings > General > Storage).

With a Windows 11 computer, it is also easy to view how much available storage space you have. Click the Windows button and type in “file explorer.” When File Explorer opens, click on This PC from the list of options in the left-hand pane. In the screenshot below, we’ve circled where your available storage is displayed (in this case, 200GB).

A screenshot from a Windows computer showing available and used storage.
Disk Space on Windows 10 (File Explorer > This PC).

How RAM and Storage Affect Your Computer’s Performance

RAM

For most general-purpose uses of computers—email, writing documents, surfing the web, or watching Netflix—the RAM that comes with our computer is enough. If you own your computer for a long enough time, you might need to add a bit more to keep up with memory demands from newer apps and OSes. Specifically, more RAM makes it possible for you to use more apps, documents, and larger files at the same time.

People that work with very large files like large databases, videos, and images can benefit significantly from having more RAM. If you regularly use large files, it is worth checking to see if your computer’s RAM is upgradeable.

Adding More RAM to Your Computer

In some situations, adding more RAM is worth the expense. For example, editing videos and high-resolution images takes a lot of memory. In addition, high-end audio recording and editing as well as some scientific work require significant RAM.

However, not all computers allow you to upgrade RAM. For example, the Chromebook typically has a fixed amount of RAM, and you cannot install more. So, when you’re buying a new computer—particularly if you plan on using that computer for more than five years, make sure to 1) understand how much RAM your computer has, and, 2) if you can upgrade the computer’s RAM. 

When your computer’s RAM is filled up, your computer has to get creative to keep working. Specifically, your computer starts to temporarily use your hard drive or SSD as “virtual memory.” If you have relatively fast storage like an SSD, virtual memory will be fast. On the other hand, using a traditional hard drive will be fairly slow.

Storage

Besides RAM, the most serious bottleneck to improving performance in your computer can be your storage. Even with plenty of RAM installed, computers need to read and write information from the storage system (i.e., the HDD or the SSD).

Hard drives come in different speeds and sizes. For laptops and desktops, the most common RPM rates are between 5400–7200RPM. In some cases, you might even decide to use a 10,000RPM drive. Faster drives cost more, are louder, have greater cooling needs, and use more power, but they may be a good option.

New disk technologies enable hard drives to be bigger and faster. These technologies include filling the drive with helium instead of air to reduce disk platter friction and using heat or microwaves to improve disk density, such as with heat-assisted magnetic recording (HAMR) drives and microwave-assisted magnetic recording (MAMR) drives.

Today, SSDs are becoming increasingly popular for computer storage. This type of computer storage is popular because it is faster, cooler, and takes up less space than traditional hard drives. They’re also less susceptible to magnetic fields and physical jolts, which makes them great for laptops. 

For more about the difference between HDDs and SSDs, check out our post, “Hard Disk Drive (HDD) vs. Solid-state Drive (SSD): What’s the Diff?”

Adding More Computer Storage

As a user’s disk storage needs increase, typically they will look to larger drives to store more data. The first step might be to replace an existing drive with a larger, faster drive. Or you might decide to install a second drive. One approach is to use different drives for different purposes. For example, use an SSD for the operating system, and then store your business videos on a larger SSD.

If more storage space is needed, you can also use an external drive, most often using USB or Thunderbolt to connect to the computer. This can be a single drive or multiple drives and might use a data storage virtualization technology such as RAID to protect the data.

If you have really large amounts of data, or simply wish to make it easy to share data with others in your location or elsewhere, you might consider network-attached storage (NAS). A NAS device can hold multiple drives, typically uses a data virtualization technology like RAID, and is accessible to anyone on your local network and—if you wish—on the internet, as well. NAS devices can offer a great deal of storage and other services that typically have been offered only by dedicated network servers in the past.

Back Up Early and Often

As a cloud storage company, we’d be remiss not to mention that you should back up your computer. No matter how you configure your computer’s storage, remember that technology can fail (we know a thing or two about that). You always want a backup so you can restore everything easily. The best backup strategy shouldn’t be dependent on any single device, either. Your backup strategy should always include three copies of your data on two different mediums with one off-site.

FAQs About Differences Between RAM and Storage

What is the difference between internal storage and RAM and internal storage?

Internal storage is a method of data storage that writes data to a disk, holding onto that data until it’s erased. Think of it as your computer’s brain. RAM is a method of communicating data between your device’s CPU and its internal storage. Think of it as your brain’s short-term memory and ability to multi-task. The data the RAM receives is volatile, so it will only last until it’s no longer needed, usually when you turn off the power or reset the computer.

Is it better to have more RAM or more storage?

If you’re looking for better PC performance, you can upgrade either RAM or storage for a boost in performance. More RAM will make it easier for your computer to perform multiple tasks at once, while upgrading your storage will improve battery life, make it faster to open applications and files, and give you more space for photos and applications. This is especially true if you’re switching your storage from a hard disk drive (HDD) to a solid state drive (SSD).

Does RAM give you more storage?

More RAM does not provide you with more free space. If your computer is giving you notifications that you’re getting close to running out of storage or you’ve already started having to delete files to make room for new ones, you should upgrade the internal storage, not the RAM.

Is memory and storage the same?

Memory and storage are also not the same thing, even though the words are often used interchangeably. Memory is another term for RAM.

The post What’s the Diff: RAM vs. Storage appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Exploring aws-lite, a Community-Driven JavaScript SDK for AWS

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/exploring-aws-lite-a-community-driven-javascript-sdk-for-aws/

A decorative image showing the Backblaze and aws-lite logos.

One of the benefits of the Backblaze B2 Storage Cloud having an S3 compatible API is that developers can take advantage of the wide range of Amazon Web Services SDKs when building their apps. The AWS team has released over a dozen SDKs covering a broad range of programming languages, including Java, Python, and JavaScript, and the latter supports both frontend (browser) and backend (Node.js) applications.

With all of this tooling available, you might be surprised to discover aws-lite. In the words of its creators, it is “a simple, extremely fast, extensible Node.js client for interacting with AWS services.” After meeting Brian LeRoux, cofounder and chief technology officer (CTO) of Begin, the company that created the aws-lite project, at the AWS re:Invent conference last year, I decided to give aws-lite a try and share the experience. Read on for the learnings I discovered along the way.

A photo showing an aws-lite promotional sticker that says, I've got p99 problems but an SDK ain't one, as well as a Backblaze promotional sticker that says Blaze/On.
Brian bribed me to try out aws-lite with a shiny laptop sticker!

Why Not Just Use the AWS SDK for JavaScript?

The AWS SDK has been through a few iterations. The initial release, way back in May 2013, focused on Node.js, while version 2, released in June 2014, added support for JavaScript running on a web page. We had to wait until December 2020 for the next major revision of the SDK, with version 3 adding TypeScript support and switching to an all-new modular architecture.

However, not all developers saw version 3 as an improvement. Let’s look at a simple example of the evolution of the SDK. The simplest operation you can perform against an S3 compatible cloud object store, such as Backblaze B2, is to list the buckets in an account. Here’s how you would do that in the AWS SDK for JavaScript v2:

var AWS = require('aws-sdk');

var client = new AWS.S3({
  region: 'us-west-004', 
  endpoint: 's3.us-west-004.backblazeb2.com'
});

client.listBuckets(function (err, data) {
  if (err) {
    console.log("Error", err);
  } else {
    console.log("Success", data.Buckets);
  }
});

Looking back from 2023, passing a callback function to the listBuckets() method looks quite archaic! Version 2.3.0 of the SDK, released in 2016, added support for JavaScript promises, and, since async/await arrived in JavaScript in 2017, today we can write the above example a little more clearly and concisely:

const AWS = require('aws-sdk');

const client = new AWS.S3({
  region: 'us-west-004', 
  endpoint: 's3.us-west-004.backblazeb2.com'
});

try {
  const data = await client.listBuckets().promise();
  console.log("Success", data.Buckets);  
} catch (err) {
  console.log("Error", err);
}

One major drawback with version 2 of the AWS SDK for JavaScript is that it is a single, monolithic, JavaScript module. The most recent version, 2.1539.0, weighs in at 92.9MB of code and resources. Even the most minimal app using the SDK has to include all that, plus another couple of MB of dependencies, causing performance issues in resource-constrained environments such as internet of things (IoT) devices, or browsers on low-end mobile devices.

Version 3 of the AWS SDK for JavaScript aimed to fix this, taking a modular approach. Rather than a single JavaScript module there are now over 300 packages published under the @aws-sdk/ scope on NPM. Now, rather than the entire SDK, an app using S3 need only install @aws-sdk/client-s3, which, with its dependencies, adds up to just 20MB.

So, What’s the Problem With AWS SDK for JavaScript v3?

One issue is that, to fully take advantage of modularization, you must adopt an unfamiliar coding style, creating a command object and passing it to the client’s send() method. Here is the “new way” of listing buckets:

const { S3Client, ListBucketsCommand } = require("@aws-sdk/client-s3");

// Since v3.378, S3Client can read region and endpoint, as well as
// credentials, from configuration, so no need to pass any arguments
const client = new S3Client();

try {
  // Inexplicably, you must pass an empty object to 
  // ListBucketsCommand() to avoid the SDK throwing an error
  const data = await client.send(new ListBucketsCommand({}));
  console.log("Success", data.Buckets);  
} catch (err) {
  console.log("Error", err);
}

The second issue is that, to help manage the complexity of keeping the SDK packages in sync with the 200+ services and their APIs, AWS now generates the SDK code from the API specifications. The problem with generated code is that, as the aws-lite home page says, it can result in “large dependencies, poor performance, awkward semantics, difficult to understand documentation, and errors without usable stack traces.”

A couple of these effects are evident even in the short code sample above. The underlying ListBuckets API call does not accept any parameters, so you might expect to be able to call the ListBucketsCommand constructor without any arguments. In fact, you have to supply an empty object, otherwise the SDK throws an error. Digging into the error reveals that a module named middleware-sdk-s3 is validating that, if the object passed to the constructor has a Bucket property, it is a valid bucket name. This is a bit odd since, as I mentioned above, ListBuckets doesn’t take any parameters, let alone a bucket name. The documentation for ListBucketsCommand contains two code samples, one with the empty object, one without. (I filed an issue for the AWS team to fix this.)

“Okay,” you might be thinking, “I’ll just carry on using v2.” After all, the AWS team is still releasing regular updates, right? Not so fast! When you run the v2 code above, you’ll see the following warning before the list of buckets:

(node:35814) NOTE: We are formalizing our plans to enter AWS SDK for JavaScript (v2) into maintenance mode in 2023.
Please migrate your code to use AWS SDK for JavaScript (v3).
For more information, check the migration guide at https://a.co/7PzMCcy

At some (as yet unspecified) time in the future, v2 of the SDK will enter maintenance mode, during which, according to the AWS SDKs and Tools maintenance policy, “AWS limits SDK releases to address critical bug fixes and security issues only.” Sometime after that, v2 will reach the end of support, and it will no longer receive any updates or releases.

Getting Started With aws-lite

Faced with a forced migration to what they judged to be an inferior SDK, Brian’s team got to work on aws-lite, posting the initial code to the aws-lite GitHub repository in September last year, under the Apache 2.0 open source license. At present the project comprises a core client and 13 plugins covering a range of AWS services including S3, Lambda, and DynamoDB.

Following the instructions on the aws-lite site, I installed the client module and the S3 plugin, and implemented the ListBuckets sample:

import awsLite from '@aws-lite/client';

const aws = await awsLite();

try {
  const data = await aws.S3.ListBuckets();
  console.log("Success", data.Buckets);
} catch (err) {
  console.log("Error", err);
}

For me, this combines the best of both worlds—concise code, like AWS SDK v2, and full support for modern JavaScript features, like v3. Best of all, the aws-lite client, S3 plugin, and their dependencies occupy just 284KB of disk space, which is less than 2% of the modular AWS SDK’s 20MB, and less than 0.5% of the monolith’s 92.9MB!

Caveat Developer!

(Not to kill the punchline here, but for those of you who might not have studied Latin or law, this is a play on the phrase, “caveat emptor”, meaning “buyer beware”.)

I have to mention, at this point, that aws-lite is still very much under construction. Only a small fraction of AWS services are covered by plugins, although it is possible (with a little extra code) to use the client to call services without a plugin. Also, not all operations are covered by the plugins that do exist. For example, at present, the S3 plugin supports 10 of the most frequently used S3 operations, such as PutObject, GetObject, and ListObjectsV2, leaving the remaining 89 operations TBD.

That said, it’s straightforward to add more operations and services, and the aws-lite team welcomes pull requests. We’re big believers in being active participants in the open source community, and I’ve already contributed the ListBuckets operation, a fix for HeadObject, and I’m working on adding tests for the S3 plugin using a mock S3 server. If you’re a JavaScript developer working with cloud services, this is a great opportunity to contribute to an open source project that promises to make your coding life better!

The post Exploring aws-lite, a Community-Driven JavaScript SDK for AWS appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Welcoming Chief Product Officer David Ngo to Backblaze

Post Syndicated from Backblaze original https://www.backblaze.com/blog/welcoming-chief-product-officer-david-ngo-to-backblaze/

A decorative image with David Ngo's photo as well as the headline, "David Ngo, Chief Product Officer."

Backblaze is happy to announce that David Ngo has joined our team as Chief Product Officer, a role responsible for spearheading the company’s global product management function, shaping the strategy, crafting the technology roadmap and overseeing execution. 

What David Brings to the Role

David is a software as a service (SaaS) data protection industry veteran with more than 25 years of global leadership experience. He previously served as the global chief technology officer (CTO) for Metallic, a division of Commvault, which provides data protection and cyber resilience as a service. He will play a pivotal role in guiding overall product direction for our existing customers as well as emerging needs as the company continues to succeed in moving upmarket.

I am pleased to welcome David as our new Chief Product Officer. David brings impressive engineering, design, and product leadership to Backblaze. He joins us at an exciting time as we help more customers break free from traditional cloud walled gardens and move to an open cloud ecosystem and empower them to do more with their data.

Gleb Budman, Backblaze CEO and Chairperson of the Board

Ngo joins a team with an impressive track record of building and scaling products and solutions that excite customers, drive growth, and deliver impact. With over 500,000 customers and three billion gigabytes of data storage under management, Backblaze has built data storage products at industry leading pricing over the past 15 years. Ngo further expands the company’s leadership by bringing his vast cloud, infrastructure, and data management knowledge developed during his time leading global teams at Commvault.

David says of his new role: 

I am thrilled to lead the amazing product organization at Backblaze and to help accelerate growth for our company. I am committed to continuing the company’s impressive track record of building powerful products that support customers’ data needs and leading the industry towards an open cloud ecosystem.

—David Ngo, Backblaze Chief Product Officer

The post Welcoming Chief Product Officer David Ngo to Backblaze appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Data Storage Beyond the Hardware: 4 Surprising Questions

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/data-storage-beyond-the-hardware-4-surprising-questions/

A decorative image showing a several types of data storage medium, like a floppy disk, a USB stick, a CD, and the cloud.

We’ve gathered you together here today to address some of weirdest questions (and answers) about everyone’s favorite topic: data storage. 

From the outside looking in, it’s easy to think it’s a subject that is as dry as Ben Stein in “Ferris Beuller’s Day Off”. But, given that everyday functions are increasingly moving to the internet, data storage is, in some ways, the secret backbone of modern society. 

Today it’s estimated that there are over 8,000 data centers (DCs) in the world, built on a variety of storage media, connected to various networks, consuming vast amounts of power, and taking up valuable real estate. Plus, the drive technology itself brings together engineering foci affected by (driving?) everything from clean room technology to DNA research. 

Fertile ground for strange, surprising questions, certainly. So, without further ado, here are some of our favorite questions about data storage. 

1. Does a Hard Drive Weigh More When It’s Full?

Short answer: for all practical purposes, no. Long answer: technically yes, but it’s such a miniscule amount that you wouldn’t be able to measure it. Shout out to David Zaslavsky for doing all the math, and here’s the summary. 

As Einstein famously hypothesized, e = mc2. If it’s been a while since you took physics, that formula defined is that energy is equal to mass multiplied by the speed of light squared. Since energy is defined by mass, then, we can infer that energy has a weight, even if it’s negligible. 

Now, hard drives record data by magnetizing a thin film of ferromagnetic material. Basically, you’re forcing the atoms in a magnetic field to align in a different direction. And, since magnetic fields have differing amounts of energy depending on whether they’re aligned or antialigned, technically the weight does change. According to David’s math, it’d be approximately 10-14 g for a 1TB hard drive. 

2. How Loud Is the Cloud?

In the past, we’ve talked about how heavy the Backblaze Storage Cloud is, and we’ve spent some ink on how loud a Backblaze DC is. All that noise comes from a combination of factors, largely cooling systems. Back in 2017, we measured our DCs at approximately 78dB, but other sources report that DCs can reach up to 96dB

When you’re talking about building your own storage, my favorite research data point was one Reddit user’s opinion:

A screenshot of a comment from Reddit user EpicEpyc that says:

I think a good rule of thumb will be "if you care about noise, don't get rackmount equipment" go a with a used workstation from your favorite brand and your ears will thank you

But, it’s still worth investing in ways to reduce the noise—if not for worker safety, then to reduce the environmental impact of DCs, including noise pollution. There are a wealth of studies out there connecting noise pollution to cardiovascular disease, hypertension, high stress levels, sleep disturbance, and good ol’ hearing loss in humans. In our animal friends, noise pollution can disrupt predator/prey detection and avoidance, echolocation, and interfere with reproduction and navigation. 

The good news is that there are technologies to keep data centers (relatively) quiet when they become disruptive to communities.  

3. How Long Does Data Stay Where You Stored It?

As much as we love old-school media here at Backblaze, we’re keeping this conversation to digital storage—so let’s chat about how long your data storage will retain your media, unplugged, in ideal environmental conditions. 

We like the way Enterprise Storage Forum put it: “Storage experts know that there are two kinds of drive in this world—those that have already failed, and those that will fail sooner or later.” Their article encompasses a pretty solid table of how long (traditional) storage media lasts.

A table that compares types of drive and how long they will last. 

Hard disk drives: 4-7 years 
Solid state drives: 5-10 years
Flash drives: 10 years average use

However, with new technologies—and their consumer applications—emerging, we might see a challenge to the data storage throne. The Institute of Physics reports that data written to a glass memory crystal could remain intact for a million years, a product they’ve dubbed the “Superman crystal.” So, look out for lasers altering the optical properties of quartz at the nanoscale. (That was just too cool not to say.)

4. What’s the Most Expensive Data Center Site?

And why? 

One thing we know from the Network Engineering team at Backblaze is that optimizing your connectivity (getting your data from point A to point B) to the strongest networks is no simple feat. Take this back to the real world: when you’re talking about what the internet truly is, you’re just connecting one computer to every other computer, and there are, in fact, cables involved

The hardware infrastructure combines with population dispersion in murky ways. We’ll go ahead and admit that’s out of scope for this article. But, working backwards from the below image, let’s just say that where there are more data centers, it’s likely there are more network exchanges. 

An infographic depicting data center concentration on a global map.
Source.

From an operational standpoint, you’d likely assume it’s a bad choice to have your data center in the middle of the most expensive real estate and power infrastructures in the world, but there are tangible benefits to joining up all those networks at a central hub and to putting them in or near population centers. We call those spaces carrier hotels

Here’s the best definition we found: 

There is no industry standard definition of a carrier hotel versus merely a data center with a meet-me room (MMR). But, generally, the term is reserved for the facilities where metro fiber carriers meet long-haul carriers—and the number of network providers numbers in the dozens.
Data Center Dynamics

Some sources go so far as to say that carrier hotels have to be in cities by definition. Either way, the result is that carrier hotels sit on some of the most expensive real estate in the world. Citing DGTL Infra from April 2023, here are the top 25 U.S. carrier hotels: 

A chart showing the top 25 carrier hotels in the United States and their locations.

Let’s take #12 on this list, the NYC listing. According to PropertyShark, it’s worth $1.15 billion. With a b. That’s before you even get to the tech inside the building. 

If you’re so inclined, flex those internet research skills and look up some of the other property values on the list. Some of them are a bit hard to find, and there are other interesting tidbits along the way. (And tell us what you find in the comments, of course.)

Bonus Question: Is It Over Already?

Look, do I want it to be over? No, never. But, the amount of weird and wonderful data storage questions that I could include in this article is infinite. Here’s a shortlist that other folks from Backblaze suggested: 

  • How broken is too broken when it comes to restoring files from a hard drive? (This is a whole article in and of itself.)
  • When I send an email, how does it get to where it goes? (Check out Backblaze CEO Gleb Budman’s Bookblaze recommendation if you’re curious.) 
  • What happens to storage drives when we’re done with them? What does recycling look like? 

So, the real question is, what do you want to know? Sound off in the comments—we’ll do our best to research and answer.

The post Data Storage Beyond the Hardware: 4 Surprising Questions appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Now Available: Enterprise Control for Computer Backup

Post Syndicated from Yev original https://www.backblaze.com/blog/2024-enterprise-control-announcement/

A decorative image showing a person at two computers with the Backblaze logo on a cloud in the background.

If you’re responsible for protecting company data, you know that any number of things can jeopardize the data on workstations, be it human error or natural disaster. It’s your job to reduce risk, but to do that you need the ability to fine-tune your backup systems.

Backblaze Computer Backup gives you an easy, automatic, centrally-managed solution for backup. And, starting today IT administrators can take greater control of their endpoint backups—from how employees authenticate to what they can and cannot restore—with the introduction of our new Enterprise Control for Backblaze Computer Backup.   

Ready to Turn the Dials?

Enterprise Control is available for enterprises with more than 20 Computer Backup licenses at an additional $2 per license. To take advantage of greater administrative control, contact a Sales representative. Learn more about how to set up Enterprise Control by visiting our technical documentation on the subject.

What’s New in Enterprise Control?

Whether you’re an IT manager or an MSP responsible for protecting business data, Enterprise Control allows you to meet your full business continuity and data security standards for workstation data and better support a hybrid and remote workforce. Here’s what you can do with Enterprise Control:

  • Fine-Grained Access Permissions: Manage access to group member data on a granular level for enterprise operations. This includes control over members’ ability to delete their own backups, admin’s ability to delete member backups, and admin’s permissions for restoring data on a member’s behalf. 
  • Advanced Single Sign-On: Enable OpenID Connect (OIDC) single sign-on (SSO) and the ability to use tools like Okta and Azure Active Directory in addition to GSuite and Microsoft. This enhances security control, allowing you to ramp up authentication practices, verifying member identity and streamlining identity management.
  • Group Management Controls: Prevent members from leaving a group, taking data with them, or ordering restore hard drives or snapshot hard drives without permission. You also have the option to hide the ability to update the client through the desktop app, rename or purge end user backups from the web application, and prevent Group members from updating the client app on their own. 
  • Compliance Support: Benefits businesses who are mandated to apply greater controls given compliance, cyber insurance, or heightened recovery point objective (RPO) and recovery time objective (RTO) requirements.

We’ve been using Backblaze to reliably back up our 400 endpoints for years. We’re excited at the possibility of having even more control to meet our growing administration and data protection needs with this new Computer Backup with Enterprise Control solution.

—Sintya Pappagallo, IT Manager, North Point Ministries

Enterprise Control Gives You The Guardrails

Backblaze Computer Backup reduces IT burden with its simplicity, and consistently ranked as Wirecutter’s Best Online Cloud Backup Service. Now, we’ve wrapped that simplicity with the enterprise features larger organizations require so you can reduce risk, achieve compliance, and better support your cybersecurity and disaster recovery goals. 

How to Upgrade to Enterprise Control

Enterprise Control is available for Groups with 20 or more Computer Backup licenses. To take advantage of Enterprise Control or to purchase Backblaze Computer Backup for your organization, contact your Sales representative. Or, learn more about how to implement Enterprise Control by visiting our technical documentation article.

If you have additional feature requests, please visit our Product Portal or let us know in the comments below.

The post Now Available: Enterprise Control for Computer Backup appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Data-Driven Decisions With Snowflake and Backblaze B2

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/data-driven-decisions-wwith-snowflake-and-backblaze-b2/

A decorative image showing the Backblaze and Snowflake images superimposed over a cloud.

Since its launch in 2014 as a cloud-based data warehouse, Snowflake has evolved into a broad data-as-a-service platform addressing a wide variety of use cases, including artificial intelligence (AI), machine learning (ML), collaboration across organizations, and data lakes. Last year, Snowflake introduced support for S3 compatible cloud object stores, such as Backblaze B2 Cloud Storage. Now, Snowflake customers can access unstructured data such as images and videos, as well as structured and semi-structured data such as CSV, JSON, Parquet, and XML files, directly in the Snowflake Platform, served up from Backblaze B2.

Why access external data from Snowflake, when Snowflake is itself a data as a service (DaaS) platform with a cloud-based relational database at its core? To put it simply, not all data belongs in Snowflake. Organizations use cloud object storage solutions such as Backblaze B2 as a cost-effective way to maintain both master and archive data, with multiple applications reading and writing that data. In this situation, Snowflake is just another consumer of the data. Besides, data storage in Snowflake is much more expensive than in Backblaze B2, raising the possibility of significant cost savings as a result of optimizing your data’s storage location.

Snowflake Basics

At Snowflake’s core is a cloud-based relational database. You can create tables, load data into them, and run SQL queries just as you can with a traditional on-premises database. Given Snowflake’s origin as a data warehouse, it is currently better suited to running analytical queries against large datasets than as an operational database serving a high volume of transactions, but Snowflake Unistore’s hybrid tables feature (currently in private preview) aims to bridge the gap between transactional and analytical workloads.

As a DaaS platform, Snowflake runs on your choice of public cloud—currently Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform—but insulates you from the details of managing storage, compute, and networking infrastructure. Having said that, sometimes you need to step outside the Snowflake box to access data that you are managing in your own cloud object storage account. I’ll explain exactly how that works in this blog post, but, first, let’s take a quick look at how we classify data according to its degree of structure, as this can have a big impact on your decision of where to store it.

Structured and Semi-Structured Data

Structured data conforms to a rigid data model. Relational database tables are the most familiar example—a table’s schema describes required and optional fields and their data types, and it is not possible to insert rows into the table that contain additional fields not listed in the schema. Aside from relational databases, file formats such as Apache Parquet, Optimized Row Columnar (ORC), and Avro can all store structured data; each file format specifies a schema that fully describes the data stored within a file. Here’s an example of a schema for a Parquet file:

% parquet meta customer.parquet

File path:  /data/customer.parquet
...
Schema:
message hive_schema {
  required int64 custkey;
  required binary name (STRING);
  required binary address (STRING);
  required int64 nationkey;
  required binary phone (STRING);
  required int64 acctbal;
  optional binary mktsegment (STRING);
  optional binary comment (STRING);
}

Semi-structured data, as its name suggests, is more flexible. File formats such as CSV, XML and JSON need not use a formal schema, since they can be self-describing. That is, an application can infer the structure of the data as it reads the file, a mechanism often termed “schema-on-read.” 

This simple JSON example illustrates the principle. You can see how it’s possible for an application to build the schema of a product record as it reads the file:

{
  "products" : [
    {
      "name" : "Paper Shredder",
      "description" : "Crosscut shredder with auto-feed"
    },
    {
      "name" : "Stapler",
      "color" : "Red"
    },
    {
      "name" : "Sneakers",
      "size" : "11"
    }
  ]
}

Accessing Structured and Semi-Structured Data Stored in Backblaze B2 from Snowflake

You can access data located in cloud object storage external to Snowflake, such as Backblaze B2, by creating an external stage. The external stage is a Snowflake database object that holds a URL for the external location, as well as configuration (e.g., credentials) required to access the data. For example:

CREATE STAGE b2_stage
  URL = 's3compat://your-b2-bucket-name/'
  ENDPOINT = 's3.your-region.backblazeb2.com'
  REGION = 'your-region'
  CREDENTIALS = (
    AWS_KEY_ID = 'your-application-key-id'
    AWS_SECRET_KEY = 'your-application-key'
  );

You can create an external table to query data stored in an external stage as if the data were inside a table in Snowflake, specifying the table’s columns as well as filenames, file formats, and data partitioning. Just like the external stage, the external table is a database object, located in a Snowflake schema, that stores the metadata required to access data stored externally to Snowflake, rather than the data itself.

Every external table automatically contains a single VARIANT type column, named value, that can hold arbitrary collections of fields. An external table definition for semi-structured data needs no further column definitions, only metadata such as the location of the data. For example:

CREATE EXTERNAL TABLE product
  LOCATION = @b2_stage/data/
  FILE_FORMAT = (TYPE = JSON)
  AUTO_REFRESH = false;

When you query the external table, you can reference elements within the value column, like this:

SELECT value:name
  FROM product
  WHERE value:color = ‘Red’;
+------------+
| VALUE:NAME |
|------------|
| "Stapler"  |
+------------+

Since structured data has a more rigid layout, you must define table columns (technically, in Snowflake, these are referred to as “pseudocolumns”), corresponding to the fields in the data files, in terms of the value column. For example:

CREATE EXTERNAL TABLE customer (
    custkey number AS (value:custkey::number),
    name varchar AS (value:name::varchar),
    address varchar AS (value:address::varchar),
    nationkey number AS (value:nationkey::number),
    phone varchar AS (value:phone::varchar),
    acctbal number AS (value:acctbal::number),
    mktsegment varchar AS (value:mktsegment::varchar),
    comment varchar AS (value:comment::varchar)
  )
  LOCATION = @b2_stage/data/
  FILE_FORMAT = (TYPE = PARQUET)
  AUTO_REFRESH = false;

Once you’ve created the external table, you can write SQL statements to query the data stored externally, just as if it were inside a table in Snowflake:

SELECT phone
  FROM customer
  WHERE name = ‘Acme, Inc.’;
+----------------+
| PHONE          |
|----------------|
| "111-222-3333" |
+----------------+

The Backblaze B2 documentation includes a pair of technical articles that go further into the details, describing how to export data from Snowflake to an external table stored in Backblaze B2, and how to create an external table definition for existing structured data stored in Backblaze B2.

Accessing Unstructured Data Stored in Backblaze B2 from Snowflake

The term “unstructured”, in this context, refers to data such as images, audio, and video, that cannot be defined in terms of a data model. You still need to create an external stage to access unstructured data located outside of Snowflake, but, rather than creating external tables and writing SQL queries, you typically access unstructured data from custom code running in Snowflake’s Snowpark environment.

Here’s an excerpt from a Snowflake user-defined function, written in Python, that loads an image file from an external stage:

from snowflake.snowpark.files import SnowflakeFile

# The file_path argument is a scoped Snowflake file URL to a file in the 
# external stage, created with the BUILD_SCOPED_FILE_URL function. 
# It has the form
# https://abc12345.snowflakecomputing.com/api/files/01b1690e-0001-f66c-...
def generate_image_label(file_path):

  # Read the image file 
  with SnowflakeFile.open(file_path, 'rb') as f:
    image_bytes = f.readall()

  ...

In this example, the user-defined function reads an image file from an external stage, then runs an ML model on the image data to generate a label for the image according to its content. A Snowflake task using this user-defined function can insert rows into a table of image names and labels as image files are uploaded into a Backblaze B2 Bucket. You can learn more about this use case in particular, and loading unstructured data from Backblaze B2 into Snowflake in general, from the Backblaze Tech Day ‘23 session that I co-presented with Snowflake Product Manager Saurin Shah:

Choices, Choices: Where Should I Store My Data?

Given that, currently, Snowflake charges at least $23/TB/month for data storage on its platform compared to Backblaze B2 at $6/TB/month, it might seem tempting to move your data wholesale from Snowflake to Backblaze B2 and create external tables to replace tables currently residing in Snowflake. There are, however, a couple of caveats to mention: performance and egress costs.

The same query on the same dataset will run much more quickly against tables inside Snowflake than the corresponding external tables. A comprehensive analysis of performance and best practices for Snowflake external tables is a whole other blog post, but, as an example, one of my queries that completes in 30 seconds against a table in Snowflake takes three minutes to run against the same data in an external table.

Similarly, when you query an external table located in Backblaze B2, Snowflake must download data across the internet. Data formats such as Parquet can make this very efficient, organizing data column-wise and compressing it to minimize the amount of data that must be transferred. But, some amount of data still has to be moved from Backblaze B2 to Snowflake. Downloading data from Backblaze B2 is free of charge for up to 3x your average monthly data footprint, then $0.01/GB for additional egress, so there is a trade-off between data storage cost and data transfer costs for frequently-accessed data.

Some data naturally lives on one platform or the other. Frequently-accessed tables should probably be located in Snowflake. Media files, that might only ever need to be downloaded once to be processed by code running in Snowpark, belong in Backblaze B2. The gray area is large datasets that will only be accessed a few times a month, where the performance disparity is not an issue, and the amount of data transferred might fit into Backblaze B2’s free egress allowance. By understanding how you access your data, and doing some math, you’re better able to choose the right cloud storage tool for your specific tasks.

The post Data-Driven Decisions With Snowflake and Backblaze B2 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.