Tag Archives: B2Cloud

Kubernetes Data Protection: How to Safeguard Your Containerized Applications

Post Syndicated from Vinodh Subramanian original https://www.backblaze.com/blog/kubernetes-data-protection-how-to-safeguard-your-containerized-applications/

A decorative image showing the Kubernetes and Backblaze logos.

Kubernetes, originally embraced by DevOps teams for its seamless application deployment, has become the go-to for businesses looking to deploy and manage applications at scale. This powerful container orchestration platform brings many benefits, but it’s not without risks—data loss, misconfigurations, and systems failures can happen when you least expect them. 

That’s why implementing a comprehensive backup strategy is essential for protecting against potential failures that could result in significant downtime, end-user dissatisfaction, and financial losses. However, backing up Kubernetes can be challenging. The environment’s dynamic nature, with containers constantly being created and destroyed, presents a unique set of challenges. 

Traditional backup solutions might falter in the face of Kubernetes’s complexities, highlighting the need for specialized approaches to back up the data and the state of your containerized applications. In this guide, let’s explore how to effectively back up and protect your Kubernetes environments against a wide range of threats, from misconfigurations to ransomware.

Understanding Kubernetes Architecture

Kubernetes has a fairly straightforward architecture that is designed to automate the deployment, scaling, and management of application containers across cluster of hosts. Understanding this architecture is not only essential for deploying and managing applications, but also for implementing effective security measures. Here’s a breakdown of Kubernetes hierarchical components and concepts. 

A chart describing cluster and node organizations within Kubernetes.

Containers: The Foundation of Kubernetes

Containers are lightweight, virtualized environments designed to run application code. They encapsulate an application’s code, libraries, and dependencies into a single object. This makes containerized applications easy to deploy, scale, and manage across different environments.

Pods: The Smallest Deployable Units

Pods are often described as logical hosts that can contain one or multiple containers that share storage, network, and specifications on how to run the containers. They are ephemeral by nature—temporary storage for a container that gets wiped out and lost when the container is stopped or restarted.

Nodes: The Workhorses of Kubernetes

Nodes represent the physical or virtual machines that run the containerized applications. Each node is managed by the master components and contains the services necessary to run pods. 

Cluster: The Heart of Kubernetes

A cluster is a collection of nodes that run containerized applications. Clusters provide the high-level structure within which Kubernetes manages the containerized applications. They enable Kubernetes to orchestrate containers’ deployment, scaling, and management across multiple nodes seamlessly.

Control Plane: The Brain Behind the Operation

The control plane is responsible for managing the worker nodes and the pods in the cluster. It includes several components, such as Kubernetes API server, scheduler, controller manager, and etcd (a key-value store for cluster data). The control plane makes global decisions about the cluster, and therefore its security is paramount as it’s the central point of management for the cluster. 

What Needs to be Protected in Kubernetes?

In Kubernetes, securing your environment is not just about safeguarding the data; it’s about protecting the entire ecosystem that interacts with and manages the data. Here’s an overview of the key components that require protection.

Workloads and Applications

  • Containers and Pods: Protecting containers involves securing the container images from vulnerabilities and ensuring runtime security. For pods, it’s crucial to manage security contexts and network policies effectively to prevent unauthorized access and ensure that sensitive data isn’t exposed to other pods or services unintentionally.
  • Deployments and StatefulSets: These are higher-level constructs that manage the deployment and scaling of pods. Protecting these components involves ensuring that only authorized users can create, update, or delete deployments.

Data and Storage

  • Persistent Volumes (PVs) and Persistent Volume Claims (PVCs): Persistent storage in Kubernetes is managed through PVs and PVCs, and protecting them is essential to ensure data integrity and confidentiality. This includes securing access to the data they contain, encrypting data at rest and transit, and properly managing storage access permissions.
  • ConfigMaps and Secrets: While ConfigMaps might contain general configuration settings, secrets are used to store sensitive data such as passwords, OAuth tokens, and SSH keys. 

Network Configuration

  • Services and Ingress: Services in Kubernetes provide a way to expose an application on a set of pods as a network service. Ingress, on the other hand, manages external access to the services within a cluster, typically HTTP. Protecting these components involves securing the communication channels, implementing network policies to restrict access to and from the services, and ensuring that only authorized services are exposed to the outside world.
  • Network Policies: Network policies define how groups of pods are allowed to communicate with each other and other network endpoints. Securing them is essential for creating a controlled, secure networking environment with your Kubernetes cluster.

Access Controls and User Management

  • Role-Based Access Control (RBAC): RBAC in Kubernetes helps define who can access what within a cluster. It allows administrators to regulate access to Kubernetes resources and namespaces based on the roles assigned to users. Protecting your cluster with RBAC users and applications having only the access they need while minimizing the potential impact of compromised credentials or insider threats.
  • Service Accounts: Service accounts provide an identity for processes that run in a pod, allowing them to interact with the Kubernetes API. Managing and securing these accounts is crucial to prevent unauthorized API access, which could lead to data leakage or unauthorized modifications of the cluster state.

Cluster Infrastructure

  • Nodes and the Control Plane: The nodes run the containerized applications and are controlled by the control plane, which includes the API server, scheduler, controller manager, and etcd database. Securing the nodes involves hardening the underlying operating system (OS), ensuring secure communication between the nodes and the control plane, and protecting control plane components from unauthorized access and tampering.
  • Kubernetes Secrets Management: Managing secrets securely in Kubernetes is critical for protecting sensitive data. This includes implementing best practices for secrets encryption, both at rest and in transit, and limiting secrets exposure to only those pods that require access.

Protecting these components is crucial for maintaining both the security and operational integrity of your Kubernetes environment. A breach in any of these areas can compromise your entire cluster, leading to data loss and causing service disruption and financial damage. Implementing a layered security approach that addresses the vulnerabilities of the Kubernetes architecture is essential for building a resilient, secure deployment.

Challenges in Kubernetes Data Protection

Securing the Kubernetes components we discussed above poses unique challenges due to the platform’s dynamic nature and the diverse types of workloads it supports. Understanding these challenges is the first step toward developing effective strategies for safeguarding your applications and data. Here are some of the key challenges:

Dynamic Nature of Container Environments

Kubernetes’s fluid landscape, with containers constantly being created and destroyed, makes traditional data protection methods less effective. The rapid pace of change demands backup solutions that can adapt just as quickly to avoid data loss. 

Statelessness vs. Statefulness

  • Stateless Applications: These don’t retain data, pushing the need to safeguard the external persistent storage they rely on. 
  • Stateful Applications: Managing data across sessions involves intricate handling of PVs PVCs, which can be challenging in a system where pods and nodes are frequently changing.

Data Consistency

Maintaining data consistency across distributed replicas in Kubernetes is complex, especially for stateful sets with persistent data needs. Strategies for consistent snapshot or application specific replication are vital to ensure integrity.

Scalability Concerns

The scalability of Kubernetes, while a strength, introduces data protection complexities. As clusters grow, ensuring efficient and scalable backup solutions becomes critical to prevent performance degradation and data loss.

Security and Regulatory Compliance

Ensuring compliance with the appropriate standards—GDPR, HIPAA, or SOC 2 standards, for instance—always requires keeping track of storage and management of sensitive data. In a dynamic environment like Kubernetes, which allows for frequent creation and destruction of containers, enforcing persistent security measures can be a challenge. Also, the sensitive data that needs to be encrypted and protected may be hosted in portions across multiple containers. Therefore, it’s important to not only track what is currently existent but also anticipate possible iterations of the environment by ensuring continuous monitoring and the implementation of robust data management practices.

As you can see, Kubernetes data protection requires navigating its dynamic nature and the dichotomy of stateless and stateful applications while addressing the consistency and scalability challenges. A strategic approach to leveraging Kubernetes-native solutions and best practices is essential for effective data protection.

Choosing the Right Kubernetes Backup Solution: Strategies and Considerations

When it comes to protecting your Kubernetes environments, selecting the right backup solution is important. Solutions like Kasten by Veeam, Rubrik, and Commvault are some of the top Kubernetes container backup solutions that offer robust support for Kubernetes backup. 

Here are some essential strategies and considerations for choosing a solution that supports your needs. 

  • Assess Your Workload Types: Different applications demand different backup strategies. Stateful applications, in particular, require backup solutions that can handle persistent storage effectively. 
  • Evaluate Data Consistency Needs: Opt for backup solutions that offer consistent backup capabilities, especially for databases and applications requiring strict data consistency. Look for features that support application-consistent backups, ensuring that data is in a usable state when restored. 
  • Scalability and Performance: The backup solution should seamlessly scale with your Kubernetes deployment without impacting performance. Consider solutions that offer efficient data deduplication, compressions, and incremental backup capabilities to handle growing data volumes.
  • Recovery Objectives: Define clear recovery objectives. Look for solutions that offer granular recovery options, minimizing downtime by allowing for precise restoration of applications or data, aligning with recovery time objectives (RTOs) and recovery point objectives (RPOs). 
  • Integration and Automation: Choose a backup solution that integrates well or natively with Kubernetes, offering automation capabilities for backup schedules, policy management, and recovery processes. This integration simplifies operations and enhances reliability. 
  • Vendor Support and Community: Consider the vendor’s reputation, the level of support provided, and the solution’s community engagement. A strong support system and active community can be invaluable for troubleshooting and best practices.

By considering the above strategies and the unique features offered by backup solutions, you can ensure your Kubernetes environment is not only protected against data loss but also aligned with your operational dynamics and business objectives. 

Leveraging Cloud Storage for Comprehensive Kubernetes Data Protection

After choosing a Kubernetes backup application, integrating cloud storage such as Backblaze B2 with your application offers a flexible, secure, scalable approach to data protection. By leveraging cloud storage solutions, organizations can enhance their Kubernetes data protection strategy, ensuring data durability and availability across a distributed environment. This integration facilitates off-site backups, which are essential for disaster recovery and compliance with data protection policies, providing a robust layer of security against data loss, configuration errors, and breaches. 

Protect Your Kubernetes Data

In summary, understanding the intricacies of Kubernetes components, acknowledging the challenges in Kubernetes backup, selecting the appropriate backup solution, and effectively integrating cloud storage are pivotal steps in crafting a comprehensive Kubernetes backup strategy. These measures ensure data protection, operational continuity, and compliance. The right backup solution, tailored to Kubernetes’s distinctive needs, coupled with the scalability and resiliency of cloud storage, provides a robust framework for safeguarding against data loss or breaches. This multi-faceted approach not only safeguards critical data but also supports the agility and scalability that modern IT environments demand. 

The post Kubernetes Data Protection: How to Safeguard Your Containerized Applications appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Drive Stats for 2023

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-2023/

A decorative image displaying the words 2023 Year End Drive Stats

As of December 31, 2023, we had 274,622 drives under management. Of that number, there were 4,400 boot drives and 270,222 data drives. This report will focus on our data drives. We will review the hard drive failure rates for 2023, compare those rates to previous years, and present the lifetime failure statistics for all the hard drive models active in our data center as of the end of 2023. Along the way we share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

2023 Hard Drive Failure Rates

As of the end of 2023, Backblaze was monitoring 270,222 hard drives used to store data. For our evaluation, we removed 466 drives from consideration which we’ll discuss later on. This leaves us with 269,756 hard drives covering 35 drive models to analyze for this report. The table below shows the Annualized Failure Rates (AFRs) for 2023 for this collection of drives.

An chart displaying the failure rates of Backblaze hard drives.

Notes and Observations

One zero for the year: In 2023, only one drive model had zero failures, the 8TB Seagate (model: ST8000NM000A). In fact, that drive model has had zero failures in our environment since we started deploying it in Q3 2022. That “zero” does come with some caveats: We have only 204 drives in service and the drive has a limited number of drive days (52,876), but zero failures over 18 months is a nice start.

Failures for the year: There were 4,189 drives which failed in 2023. Doing a little math, over the last year on average, we replaced a failed drive every two hours and five minutes. If we limit hours worked to 40 per week, then we replaced a failed drive every 30 minutes.

More drive models: In 2023, we added six drive models to the list while retiring zero, giving us a total of 35 different models we are tracking. 

Two of the models have been in our environment for a while but finally reached 60 drives in production by the end of 2023.

  1. Toshiba 8TB, model HDWF180: 60 drives.
  2. Seagate 18TB, model ST18000NM000J: 60 drives.

Four of the models were new to our production environment and have 60 or more drives in production by the end of 2023.

  1. Seagate 12TB, model ST12000NM000J: 195 drives.
  2. Seagate 14TB, model ST14000NM000J: 77 drives.
  3. Seagate 14TB, model ST14000NM0018: 66 drives.
  4. WDC 22TB, model WUH722222ALE6L4: 2,442 drives.

The drives for the three Seagate models are used to replace failed 12TB and 14TB drives. The 22TB WDC drives are a new model added primarily as two new Backblaze Vaults of 1,200 drives each.

Mixing and Matching Drive Models

There was a time when we purchased extra drives of a given model to have on hand so we could replace a failed drive with the same drive model. For example, if we needed 1,200 drives for a Backblaze Vault, we’d buy 1,300 to get 100 spares. Over time, we tested combinations of different drive models to ensure there was no impact on throughput and performance. This allowed us to purchase drives as needed, like the Seagate drives noted previously. This saved us the cost of buying drives just to have them hanging around for months or years waiting for the same drive model to fail.

Drives Not Included in This Review

We noted earlier there were 466 drives we removed from consideration in this review. These drives fall into three categories.

  • Testing: These are drives of a given model that we monitor and collect Drive Stats data on, but are in the process of being qualified as production drives. For example, in Q4 there were four 20TB Toshiba drives being evaluated.
  • Hot Drives: These are drives that were exposed to high temperatures while in operation. We have removed them from this review, but are following them separately to learn more about how well drives take the heat. We covered this topic in depth in our Q3 2023 Drive Stats Report
  • Less than 60 drives: This is a holdover from when we used a single storage server of 60 drives to store a blob of data sent to us. Today we divide that same blob across 20 servers, i.e. a Backblaze Vault, dramatically improving the durability of the data. For 2024 we are going to review the 60 drive criteria and most likely replace this standard with a minimum number of drive days in a given period of time to be part of the review. 

Regardless, in the Q4 2023 Drive Stats data you will find these 466 drives along with the data for the 269,756 drives used in the review.

Comparing Drive Stats for 2021, 2022, and 2023

The table below compares the AFR for each of the last three years. The table includes just those drive models which had over 200,000 drive days during 2023. The data for each year is inclusive of that year only for the operational drive models present at the end of each year. The table is sorted by drive size and then AFR.

A chart showing the failure rates of hard drives from 2021, 2022, and 2023.

Notes and Observations

What’s missing?: As noted, a drive model required 200,000 drive days or more in 2023 to make the list. Drives like the 22TB WDC model with 126,956 drive days and the 8TB Seagate with zero failures, but only 52,876 drive days didn’t qualify. Why 200,000? Each quarter we use 50,000 drive days as the minimum number to qualify as statistically relevant. It’s not a perfect metric, but it minimizes the volatility sometimes associated with drive models with a lower number of drive days.

The 2023 AFR was up: The AFR for all drives models listed was 1.70% in 2023. This compares to 1.37% in 2022 and 1.01% in 2021. Throughout 2023 we have seen the AFR rise as the average age of the drive fleet has increased. There are currently nine drive models with an average age of six years or more. The nine models make up nearly 20% of the drives in production. Since Q2, we have accelerated the migration from older drive models, typically 4TB in size, to new drive models, typically 16TB in size. This program will continue throughout 2024 and beyond.

Annualized Failure Rates vs. Drive Size

Now, let’s dig into the numbers to see what else we can learn. We’ll start by looking at the quarterly AFRs by drive size over the last three years.

A chart showing hard drive failure rates by drive size from 2021 to 2023.

To start, the AFR for 10TB drives (gold line) are obviously increasing, as are the 8TB drives (gray line) and the 12TB drives (purple line). Each of these groups finished at an AFR of 2% or higher in Q4 2023 while starting from an AFR of about 1% in Q2 2021. On the other hand, the AFR for the 4TB drives (blue line) rose initially, peaking in 2022 and has decreased since. The remaining three drive sizes—6TB, 14TB, and 16TB—have oscillated around 1% AFR for the entire period. 

Zooming out, we can look at the change in AFR by drive size on an annual basis. If we compare the annual AFR results for 2022 to 2023, we get the table below. The results for each year are based only on the data from that year.

At first glance it may seem odd that the AFR for 4TB drives is going down. Especially given the average age of each of the 4TB drives models is over six years and getting older. The reason is likely related to our focus in 2023 on migrating from 4TB drives to 16TB drives. In general we migrate the oldest drives first, that is those more likely to fail in the near future. This process of culling out the oldest drives appears to mitigate the expected rise in failure rates as a drive ages. 

But, not all drive models play along. The 6TB Seagate drives are over 8.6 years old on average and, for 2023, have the lowest AFR for any drive size group potentially making a mockery of the age-is-related-to-failure theory, at least over the last year. Let’s see if that holds true for the lifetime failure rate of our drives.

Lifetime Hard Drive Stats

We evaluated 269,756 drives across 35 drive models for our lifetime AFR review. The table below summarizes the lifetime drive stats data from April 2013 through the end of Q4 2023. 

A chart showing lifetime annualized failure rates for 2023.

The current lifetime AFR for all of the drives is 1.46%. This is up from the end of last year (Q4 2022) which was 1.39%. This makes sense given the quarterly rise in AFR over 2023 as documented earlier. This is also the highest the lifetime AFR has been since Q1 2021 (1.49%). 

The table above contains all of the drive models active as of 12/31/2023. To declutter the list, we can remove those models which don’t have enough data to be statistically relevant. This does not mean the AFR shown above is incorrect, it just means we’d like to have more data to be confident about the failure rates we are listing. To that end, the table below only includes those drive models which have two million drive days or more over their lifetime, this gives us a manageable list of 23 drive models to review.

A chart showing the 2023 annualized failure rates for drives with more than 2 million drive days in their lifetimes.

Using the table above we can compare the lifetime drive failure rates of different drive models. In the charts below, we group the drive models by manufacturer, and then plot the drive model AFR versus average age in months of each drive model. The relative size of each circle represents the number of drives in each cohort. The horizontal and vertical scales for each manufacturer chart are the same.

A chart showing annualized failure rates by average age and drive manufacturer.

Notes and Observations

Drive migration: When selecting drive models to migrate we could just replace the oldest drive models first. In this case, the 6TB Seagate drives. Given there are only 882 drives—that’s less than one Backblaze Vault—the impact on failure rates would be minimal. That aside, the chart makes it clear that we should continue to migrate our 4TB drives as we discussed in our recent post on which drives reside in which storage servers. As that post notes, there are other factors, such as server age, server size (45 vs. 60 drives), and server failure rates which help guide our decisions. 

HGST: The chart on the left below shows the AFR trendline (second order polynomial) for all of our HGST models.  It does not appear that drive failure consistently increases with age. The chart on the right shows the same data with the HGST 4TB drive models removed. The results are more in line with what we’d expect, that drive failure increased over time. While the 4TB drives perform great, they don’t appear to be the AFR benchmark for newer/larger drives.

One other potential factor not explored here, is that beginning with the 8TB drive models, helium was used inside the drives and the drives were sealed. Prior to that they were air-cooled and not sealed. So did switching to helium inside a drive affect the failure profile of the HGST drives? Interesting question, but with the data we have on hand, I’m not sure we can answer it—or that it matters much anymore as helium is here to stay.

Seagate: The chart on the left below shows the AFR trendline (second order polynomial) for our Seagate models. As with the HGST models, it does not appear that drive failure continues to increase with age. For the chart on the right, we removed the drive models that were greater than seven years old (average age).

Interestingly, the trendline for the two charts is basically the same up to the six year point. If we attempt to project past that for the 8TB and 12TB drives there is no clear direction. Muddying things up even more is the fact that the three models we removed because they are older than seven years are all consumer drive models, while the remaining drive models are all enterprise drive models. Will that make a difference in the failure rates of the enterprise drive model when they get to seven or eight or even nine years of service? Stay tuned.

Toshiba and WDC: As for the Toshia and WDC drive models, there is a little over three years worth of data and no discernible patterns have emerged. All of the drives from each of these manufacturers are performing well to date.

Drive Failure and Drive Migration

One thing we’ve seen above is that drive failure projections are typically drive model dependent. But we don’t migrate drive models as a group, instead, we migrate all of the drives in a storage server or Backblaze Vault. The drives in a given server or Vault may not be the same model. How we choose which servers and Vaults to migrate will be covered in a future post, but for now we’ll just say that drive failure isn’t everything.

The Hard Drive Stats Data

The complete data set used to create the tables and charts in this report is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data itself to anyone; it is free.

Good luck, and let us know if you find anything interesting.

The post Backblaze Drive Stats for 2023 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Commits to Routing Security With MANRS Participation

Post Syndicated from Brent Nowak original https://www.backblaze.com/blog/backblaze-commits-to-routing-security-with-manrs-participation/

A decorative image displaying the MANRS logo.

They say good manners are better than good looks. When it comes to being a good internet citizen, we have to agree. And when someone else tells you that you have good manners (or MANRS in this case), even better. 

If you hold your cloud partners to a higher standard, and if you think it’s not asking too much that they make the internet a better, safer place for everyone, then you’ll be happy to know that Backblaze is now recognized as a Mutually Agreed Norms for Routing Security (MANRS) participant (aka MANRS Compliant). 

What Is MANRS?

MANRS is a global initiative with over 1,095 participants that are enacting network policies and controls to help reduce the most common routing threats. At a high level, we’re setting up filters to check that network routing information we receive for peers is valid, ensuring that the networks we advertise to the greater internet are marked as owned by Backblaze, and making sure that data that gets out of our network is legitimate and can’t be spoofed.

You can view a full list of MANRS participants here.

What Our (Good) MANRS Mean For You

The biggest benefit for customers is that network traffic to and from Backblaze’s connection points where we exchange traffic with our peering partners is more secure and more trustworthy. All of the changes that we’ve implemented (which we get into below) are on our side—so, no action is necessary from Backblaze partners or users—and will be transparent for our customers. Our Network Engineering team has done the heavy lifting. 

MANRS Actions

Backblaze falls under the MANRS category of CDN and Cloud Providers, and as such, we’ve implemented solutions or processes for each of the five actions stipulated by MANRS:

  1. Prevent propagation of incorrect routing information: Ensure that traffic we receive is coming from known networks.
  2. Prevent traffic of illegitimate source IP addresses: Prevent malicious traffic coming out of our network.
  3. Facilitate global operational communication and coordination: Keep our records with 3rd party sites like Peeringdb.com up to date as other operators use this to validate our connectivity details.
  4. Facilitate validation of routing information on a global scale: Digitally sign our network objects using the Resource Public Key Infrastructure (RPKI) standard.
  5. Encourage MANRS adoption: By telling the world, just like in this post!

Digging Deeper Into Filtering and RPKI

Let’s go over the filtering and RPKI details, since they are very valuable to ensuring the security and validity of our network traffic.

Filtering: Sorting Out the Good Networks From the Bad

One major action for MANRS compliance is to validate that the networks we receive from peers are valid. When we connect to other networks, we each tell each other about our networks in order to build a routing table that lets us know the optimal path to send traffic.

We can blindly trust what the other party is telling us, or we can reach out to an external source to validate. We’ve implemented automated internal processes to help us apply these filters to our edge routers (the devices that connect us externally to other networks).

If you’re a more visual learner, like me, here’s a quick conversational bubble diagram of what we have in place.

Externally verifying routing information we receive.

Every edge device that connects to an external peer now has validation steps to ensure that the networks we receive and use to send out traffic are valid. We have automated processes that periodically check and deploy for updates to any lists.

What Is RPKI?

RPKI is a public key infrastructure framework designed to secure the internet’s routing infrastructure, specifically the Border Gateway Protocol (BGP). RPKI provides a way to connect internet number resource information (such as IP addresses) to a trust anchor. In layman’s terms, RPKI allows us, as a network operator, to securely identify whether other networks that interact with ours are legitimate or malicious.

RPKI: Signing Our Paperwork

Much like going to a notary and validating a form, we can perform the same action digitally with the list of networks that we advertise to the greater internet. The RPKI framework allows us to stamp our networks as owned by us.

It also allows us to digitally sign records of our networks that we own, allowing external parties to confirm that the networks that they see from us are valid. If another party comes along and tries to claim to be us, by using RPKI our peering partner will deny using that network to send data to a false Backblaze network.

You can check the status of our RPKI signed route objects on the MANRS statistics website.

What does the process of peering and advertising networks look like without RPKI validation?

A diagram that imagines IP address requests for ownership without RPKI standards. Bad actors would be able to claim traffic directed towards IP addresses that they don't own.
Bad actor claiming to be a Backblaze network without RPKI validation.

Now, with RPKI, we’ve dotted our I’s and crossed our T’s. A third party certificate holder serves as a validator for the digital certificates that we used to sign our network objects. If anyone else claims to be us, they will be marked as invalid and the peer will not accept the routing information, as you can see in the diagram below.

A diagram that imagines networking requests for ownership with RPKI standards properly applied. Bad actors would attempt to claim traffic towards an owned or valid IP address, but be prevented because they don't have the correct credentials.
With RPKI validation, the bad actor is denied the ability to claim to be a Backblaze network.

Mind Your MANRS

Our first value as a company is to be fair and good. It reads: “Be good. Trust is paramount. Build a good product. Charge fairly. Be open, honest, and accepting with customers and each other.” Almost sounds like Emily Post wrote it—that’s why our MANRS participation fits right in with the way we do business. We believe in an open internet, and participating in MANRS is just one way that we can contribute to a community that is working towards good for all.

The post Backblaze Commits to Routing Security With MANRS Participation appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Object Storage Simplified: Introducing Powered by Backblaze

Post Syndicated from Elton Carneiro original https://www.backblaze.com/blog/powered-by-announcement-2024/

A decorative image showing the Backblaze logo on a cloud hovering over a power button.

Today, we announced our new Powered by Backblaze program to give platform providers the ability to offer cloud storage without the burden of building scalable storage infrastructure (something we know a little bit about). 

If you’re an independent software vendor (ISV), technology partner, or any company that wants to incorporate easy, affordable data storage within your branded user experience, Powered by Backblaze will give you the tools to do so without complex code, capital outlay, or massive expense.

Read on to learn more about Powered by Backblaze and how it can help you enhance your platforms and services. Or, if you’d like to get started asap, contact our Sales Team for access.  

Benefits of Powered by Backblaze

  • Business Growth: Adding cloud services to your product portfolios can generate new revenue streams and/or grow your existing margin.
  • Improved Customer Experience: Take the complexity out of object storage and deliver the best solutions by incorporating a proven object cloud storage solution.
  • Simplified Billing: Reduce complex billing by providing customers with a single bill from a single provider. 
  • Build Your Brand:  Improve customer expectations by providing cloud storage with your company name for consistency and brand identity.

What Is Powered by Backblaze?

Powered by Backblaze offers companies the ability to incorporate B2 Cloud Storage into their products so they can sell more services or enhance their user experience with no capital investment. Today, this program offers two solutions that support the provisioning of B2 Cloud Storage: Custom Domains and the Backblaze Partner API.

How Can I Leverage Custom Domains?

Custom Domains, launched today, lets you serve content to your end users from the web domain or URL of your choosing, with no need for complex code or proxy servers. Backblaze manages the heavy lifting of cloud storage on the back end.

Custom Domains functionality combines CNAME and Backblaze B2 Object Storage, enabling the use of your preferred domain name in your files’ web domain or URLs instead of using the domain name that Backblaze automatically assigns.

We’ve chosen Backblaze so we can have a reliable partner behind our new Edge Storage solution. With their Custom Domain feature, we can implement the security needed to serve data from Backblaze to end users from Azion’s Edge Platform, improving user experience.

—Rafael Umann, CEO, Azion, a full stack platform for developers

How Can I Leverage the Backblaze Partner API?

The Backblaze Partner API automates the provisioning and management of Backblaze B2 Cloud Storage storage accounts within a platform. It allows for managing accounts, running reports, and creating a bundled solution or managed service for a unified user experience.

We wrote more about the Backblaze Partner API here, but briefly: We created this solution by exposing existing API functionality in a manner that allows partners to automate tasks essential to provisioning users with seamless access to storage.

The Backblaze Partner API calls allow you to:

  • Create accounts (add Group members)
  • Organize accounts in Groups
  • List Groups
  • List Group members
  • Eject Group members

If you’d like to get into the details, you can dig deeper in our technical documentation.

Our customers produce thousands of hours of content daily and, with the shift to leveraging cloud services like ours, they need a place to store both their original and transcoded files. The Backblaze Partner API allows us to expand our cloud services and eliminate complexity for our customers—giving them time to focus on their business needs, while we focus on innovations that drive more value.

—Murad Mordukhay, CEO, Qencode

How to Get Started With Powered by Backblaze

To get started with Powered by Backblaze, contact our Sales Team. They will work with you to understand your use case and how you can best utilize Powered by Backblaze. 

What’s Next?

We’re looking forward to adding more to the Powered by Backblaze program as we continue investing in the tools you need to bring performant cloud storage to your users in an easy, seamless fashion.

The post Object Storage Simplified: Introducing Powered by Backblaze appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Exploring aws-lite, a Community-Driven JavaScript SDK for AWS

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/exploring-aws-lite-a-community-driven-javascript-sdk-for-aws/

A decorative image showing the Backblaze and aws-lite logos.

One of the benefits of the Backblaze B2 Storage Cloud having an S3 compatible API is that developers can take advantage of the wide range of Amazon Web Services SDKs when building their apps. The AWS team has released over a dozen SDKs covering a broad range of programming languages, including Java, Python, and JavaScript, and the latter supports both frontend (browser) and backend (Node.js) applications.

With all of this tooling available, you might be surprised to discover aws-lite. In the words of its creators, it is “a simple, extremely fast, extensible Node.js client for interacting with AWS services.” After meeting Brian LeRoux, cofounder and chief technology officer (CTO) of Begin, the company that created the aws-lite project, at the AWS re:Invent conference last year, I decided to give aws-lite a try and share the experience. Read on for the learnings I discovered along the way.

A photo showing an aws-lite promotional sticker that says, I've got p99 problems but an SDK ain't one, as well as a Backblaze promotional sticker that says Blaze/On.
Brian bribed me to try out aws-lite with a shiny laptop sticker!

Why Not Just Use the AWS SDK for JavaScript?

The AWS SDK has been through a few iterations. The initial release, way back in May 2013, focused on Node.js, while version 2, released in June 2014, added support for JavaScript running on a web page. We had to wait until December 2020 for the next major revision of the SDK, with version 3 adding TypeScript support and switching to an all-new modular architecture.

However, not all developers saw version 3 as an improvement. Let’s look at a simple example of the evolution of the SDK. The simplest operation you can perform against an S3 compatible cloud object store, such as Backblaze B2, is to list the buckets in an account. Here’s how you would do that in the AWS SDK for JavaScript v2:

var AWS = require('aws-sdk');

var client = new AWS.S3({
  region: 'us-west-004', 
  endpoint: 's3.us-west-004.backblazeb2.com'
});

client.listBuckets(function (err, data) {
  if (err) {
    console.log("Error", err);
  } else {
    console.log("Success", data.Buckets);
  }
});

Looking back from 2023, passing a callback function to the listBuckets() method looks quite archaic! Version 2.3.0 of the SDK, released in 2016, added support for JavaScript promises, and, since async/await arrived in JavaScript in 2017, today we can write the above example a little more clearly and concisely:

const AWS = require('aws-sdk');

const client = new AWS.S3({
  region: 'us-west-004', 
  endpoint: 's3.us-west-004.backblazeb2.com'
});

try {
  const data = await client.listBuckets().promise();
  console.log("Success", data.Buckets);  
} catch (err) {
  console.log("Error", err);
}

One major drawback with version 2 of the AWS SDK for JavaScript is that it is a single, monolithic, JavaScript module. The most recent version, 2.1539.0, weighs in at 92.9MB of code and resources. Even the most minimal app using the SDK has to include all that, plus another couple of MB of dependencies, causing performance issues in resource-constrained environments such as internet of things (IoT) devices, or browsers on low-end mobile devices.

Version 3 of the AWS SDK for JavaScript aimed to fix this, taking a modular approach. Rather than a single JavaScript module there are now over 300 packages published under the @aws-sdk/ scope on NPM. Now, rather than the entire SDK, an app using S3 need only install @aws-sdk/client-s3, which, with its dependencies, adds up to just 20MB.

So, What’s the Problem With AWS SDK for JavaScript v3?

One issue is that, to fully take advantage of modularization, you must adopt an unfamiliar coding style, creating a command object and passing it to the client’s send() method. Here is the “new way” of listing buckets:

const { S3Client, ListBucketsCommand } = require("@aws-sdk/client-s3");

// Since v3.378, S3Client can read region and endpoint, as well as
// credentials, from configuration, so no need to pass any arguments
const client = new S3Client();

try {
  // Inexplicably, you must pass an empty object to 
  // ListBucketsCommand() to avoid the SDK throwing an error
  const data = await client.send(new ListBucketsCommand({}));
  console.log("Success", data.Buckets);  
} catch (err) {
  console.log("Error", err);
}

The second issue is that, to help manage the complexity of keeping the SDK packages in sync with the 200+ services and their APIs, AWS now generates the SDK code from the API specifications. The problem with generated code is that, as the aws-lite home page says, it can result in “large dependencies, poor performance, awkward semantics, difficult to understand documentation, and errors without usable stack traces.”

A couple of these effects are evident even in the short code sample above. The underlying ListBuckets API call does not accept any parameters, so you might expect to be able to call the ListBucketsCommand constructor without any arguments. In fact, you have to supply an empty object, otherwise the SDK throws an error. Digging into the error reveals that a module named middleware-sdk-s3 is validating that, if the object passed to the constructor has a Bucket property, it is a valid bucket name. This is a bit odd since, as I mentioned above, ListBuckets doesn’t take any parameters, let alone a bucket name. The documentation for ListBucketsCommand contains two code samples, one with the empty object, one without. (I filed an issue for the AWS team to fix this.)

“Okay,” you might be thinking, “I’ll just carry on using v2.” After all, the AWS team is still releasing regular updates, right? Not so fast! When you run the v2 code above, you’ll see the following warning before the list of buckets:

(node:35814) NOTE: We are formalizing our plans to enter AWS SDK for JavaScript (v2) into maintenance mode in 2023.
Please migrate your code to use AWS SDK for JavaScript (v3).
For more information, check the migration guide at https://a.co/7PzMCcy

At some (as yet unspecified) time in the future, v2 of the SDK will enter maintenance mode, during which, according to the AWS SDKs and Tools maintenance policy, “AWS limits SDK releases to address critical bug fixes and security issues only.” Sometime after that, v2 will reach the end of support, and it will no longer receive any updates or releases.

Getting Started With aws-lite

Faced with a forced migration to what they judged to be an inferior SDK, Brian’s team got to work on aws-lite, posting the initial code to the aws-lite GitHub repository in September last year, under the Apache 2.0 open source license. At present the project comprises a core client and 13 plugins covering a range of AWS services including S3, Lambda, and DynamoDB.

Following the instructions on the aws-lite site, I installed the client module and the S3 plugin, and implemented the ListBuckets sample:

import awsLite from '@aws-lite/client';

const aws = await awsLite();

try {
  const data = await aws.S3.ListBuckets();
  console.log("Success", data.Buckets);
} catch (err) {
  console.log("Error", err);
}

For me, this combines the best of both worlds—concise code, like AWS SDK v2, and full support for modern JavaScript features, like v3. Best of all, the aws-lite client, S3 plugin, and their dependencies occupy just 284KB of disk space, which is less than 2% of the modular AWS SDK’s 20MB, and less than 0.5% of the monolith’s 92.9MB!

Caveat Developer!

(Not to kill the punchline here, but for those of you who might not have studied Latin or law, this is a play on the phrase, “caveat emptor”, meaning “buyer beware”.)

I have to mention, at this point, that aws-lite is still very much under construction. Only a small fraction of AWS services are covered by plugins, although it is possible (with a little extra code) to use the client to call services without a plugin. Also, not all operations are covered by the plugins that do exist. For example, at present, the S3 plugin supports 10 of the most frequently used S3 operations, such as PutObject, GetObject, and ListObjectsV2, leaving the remaining 89 operations TBD.

That said, it’s straightforward to add more operations and services, and the aws-lite team welcomes pull requests. We’re big believers in being active participants in the open source community, and I’ve already contributed the ListBuckets operation, a fix for HeadObject, and I’m working on adding tests for the S3 plugin using a mock S3 server. If you’re a JavaScript developer working with cloud services, this is a great opportunity to contribute to an open source project that promises to make your coding life better!

The post Exploring aws-lite, a Community-Driven JavaScript SDK for AWS appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Welcoming Chief Product Officer David Ngo to Backblaze

Post Syndicated from Backblaze original https://www.backblaze.com/blog/welcoming-chief-product-officer-david-ngo-to-backblaze/

A decorative image with David Ngo's photo as well as the headline, "David Ngo, Chief Product Officer."

Backblaze is happy to announce that David Ngo has joined our team as Chief Product Officer, a role responsible for spearheading the company’s global product management function, shaping the strategy, crafting the technology roadmap and overseeing execution. 

What David Brings to the Role

David is a software as a service (SaaS) data protection industry veteran with more than 25 years of global leadership experience. He previously served as the global chief technology officer (CTO) for Metallic, a division of Commvault, which provides data protection and cyber resilience as a service. He will play a pivotal role in guiding overall product direction for our existing customers as well as emerging needs as the company continues to succeed in moving upmarket.

I am pleased to welcome David as our new Chief Product Officer. David brings impressive engineering, design, and product leadership to Backblaze. He joins us at an exciting time as we help more customers break free from traditional cloud walled gardens and move to an open cloud ecosystem and empower them to do more with their data.

Gleb Budman, Backblaze CEO and Chairperson of the Board

Ngo joins a team with an impressive track record of building and scaling products and solutions that excite customers, drive growth, and deliver impact. With over 500,000 customers and three billion gigabytes of data storage under management, Backblaze has built data storage products at industry leading pricing over the past 15 years. Ngo further expands the company’s leadership by bringing his vast cloud, infrastructure, and data management knowledge developed during his time leading global teams at Commvault.

David says of his new role: 

I am thrilled to lead the amazing product organization at Backblaze and to help accelerate growth for our company. I am committed to continuing the company’s impressive track record of building powerful products that support customers’ data needs and leading the industry towards an open cloud ecosystem.

—David Ngo, Backblaze Chief Product Officer

The post Welcoming Chief Product Officer David Ngo to Backblaze appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Leveraging Backblaze Drive Stats to Boost Backblaze B2 Cloud Storage Sales: A Guide for Reseller Partners

Post Syndicated from Mary Ellen Cavanagh original https://www.backblaze.com/blog/leveraging-backblaze-drive-stats-to-boost-backblaze-b2-cloud-storage-sales-a-guide-for-reseller-partners/

A decorative image image showing a variety of images related to Backblaze and cloud storage.

If you’re a reseller partner, we know it’s hard to cut through the noise and get potential clients interested in the services you sell. It helps when you’re able to share relevant, useful, truly valuable information with them to build your brand and engage potential clients in prospective services. 

The Backblaze Drive Stats reports can be a powerful tool in your arsenal. They not only provide insights into drive reliability but also empower you to better position and sell Backblaze B2 Cloud Storage. So, let’s dig into what Drive Stats are and how you can use them to serve your clients.

What Are Drive Stats?

The Backblaze Drive Stats reports include a comprehensive set of data that Backblaze openly shares about the performance and reliability of the hard drives that we use in our data centers. The data we publish is excellent for building trust with customers—it’s unique in the industry, regularly covered in industry media, and used by everyone from IT admins to research institutions to inform their strategies. Use it to level up your understanding of hard drives in general—including how they affect cloud storage infrastructure—and to build trust with end users around Backblaze in particular.

How Can I Use Drive Stats as a Reseller?

Identifying and Addressing Customer Concerns

You probably encounter customer concerns regarding the potential risks associated with data storage—both on premises and in the cloud—all the time. With Drive Stats, you can speak to those concerns with hard data on drive failure rates. This data-driven approach empowers customers to make optimal operational decisions and positions you as a knowledgeable, trusted advisor in their cloud storage journey.

Tailoring Solutions to Customer Needs

Every business has unique data storage and backup requirements, often a combination of on premises and cloud based data storage. In crafting the proper storage solution for your clients, you are often confronted with cost versus reliability trade-offs. The Backblaze Drive Stats reports provide a dependable source of unbiased drive reliability statistics when local data storage is required. With the Drive Stats data at hand, you can apply your knowledge and experience to confidently propose and deliver a comprehensive, cost-effective data storage and backup solution at a fair price that meets your customers’ unique needs.

Educating Customers on Data Management Best Practices

Beyond selling a product, you play a vital role in educating your customers on best practices for data management. Backblaze Drive Stats provide you with valuable insights that can be shared with your clients to help them make informed decisions about their storage strategy. By educating customers on the factors that contribute to reliable and efficient data storage, you position yourselves as trusted advisors in the rapidly evolving world of cloud technology.

Drive Stats as Your Competitive Advantage

In the competitive landscape of cloud storage solutions, reseller partners can gain a strategic advantage by harnessing the power of Backblaze Drive Stats as an effective, valuable, and powerful piece of content. The stats not only enhance transparency and build trust with customers but also empower resellers to effectively address concerns, tailor solutions, and educate clients on data management best practices. By leveraging this valuable resource, resellers can position themselves as leaders in the market and drive the success of Backblaze B2 Cloud Storage.

The post Leveraging Backblaze Drive Stats to Boost Backblaze B2 Cloud Storage Sales: A Guide for Reseller Partners appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

New Open Source Tool for Consistency in Cassandra Migrations

Post Syndicated from Elliott Sims original https://www.backblaze.com/blog/new-open-source-tool-for-consistency-in-cassandra-migrations/

A decorative image showing the Cassandra logo with a function represented by two servers on either side of the logo.

Sometimes you find a problem because something breaks, and other times you find a problem the good way—by thinking it through before things break. This is a story about one of those bright, shining, lightbulb moments when you find a problem the good way.

On the Backblaze Site Reliability Engineering (SRE) team, we were thinking through an upcoming datacenter migration in Cassandra. We were running through all of the various types of queries we would have to do when we had the proverbial “aha” moment. We discovered an inconsistency in the way Cassandra handles lightweight transactions (LWTs).

If you’ve ever tried to do a datacenter migration in Cassandra and something got corrupted in the process but you couldn’t figure out why or how—this might be why. I’m going to walk through a short intro on Cassandra, how we use it, and the issue we ran into. Then, I’ll explain the workaround, which we open sourced. 

Get the Open Source Code

You can download the open source code from our Git repository. We’d love to know how you’re using it and how it’s working for you—let us know in the comments.

How We Use Cassandra

First, if you’re not a Cassandra dev, I should mention that when we say “datacenter migration” it means something slightly different in Cassandra than what it sounds like. It doesn’t mean a data center migration in the physical sense (although you can use datacenter migrations in Cassandra when you’re moving data from one physical data center to another). In the simplest terms, it involves moving data between two Cassandra or Cassandra-compatible database replica sets within a cluster.

And, if you’re not familiar with Cassandra at all, it’s an open-source, NoSQL, distributed database management system. It was created to handle large amounts of data across many commodity servers, so it fits our use case—lots of data, lots of servers. 

At Backblaze, we use Cassandra to index filename to location for data stored in Backblaze B2, for example. Because it’s customer data and not just analytics, we care more about durability and consistency than some other applications of Cassandra. We run with three replicas in a single datacenter and “batch” mode to require writes to be committed to disk before acknowledgement rather than the default “periodic.”

Datacenter migrations are an important aspect of running Cassandra, especially on bare metal. We do a few datacenter migrations per year either for physical data moves, hardware refresh, or to change certain cluster layout parameters like tokens per host that are otherwise static. 

What Are LWTs and Why Do They Matter for Datacenter Migrations in Cassandra?

First of all, LWTs are neither lightweight nor transactions, but that’s neither here nor there. They are an important feature in Cassandra. Here’s why. 

Cassandra is great at scaling. In something like a replicated SQL cluster, you can add additional replicas for read throughput, but not writes. Cassandra scales writes (as well as reads) nearly linearly with the number of hosts—into the hundreds. Adding nodes is a fairly straightforward and “automagic” process as well, with no need to do something like manual token range splits. It also handles individual down nodes with little to no impact on queries. Unfortunately, these properties come with a trade-off: a complex and often nonintuitive consistency model that engineers and operators need to understand well.

In a distributed database like Cassandra, data is replicated across multiple nodes for durability and availability. 

Although databases generally allow multiple reads and writes to be submitted at once, they make it look to the outside world like all the operations are happening in order, one at a time. This property is known as serializability, and Cassandra is not serializable. Although it does have a “last write wins” system, there’s no transaction isolation and timestamps can be identical. 

It’s possible, for example, to have a row that has some columns from one write and other columns from another write. It’s safe if you’re only appending additional rows, but mutating existing rows safely requires careful design. Put another way, you can have two transactions with different data that, to the system, appear to have equal priority. 

How Do LWTs Solve This Problem?

As a solution for cases where stronger consistency is needed, Cassandra has a feature called “Lightweight Transactions” or LWTs. These are not really identical to traditional database transactions, but provide a sort of “compare and set” operation that also guarantees pending writes are completed before answering a read. This means if you’re trying to change a row’s value from “A” to “B”, a simultaneous attempt to change that row from “A” to “C” will return a failure. This is accomplished by doing a full—not at all lightweight—Paxos round complete with multiple round trips and slow expensive retries in the event of a conflict.

In Cassandra, the minimum consistency level for read and write operations is ONE, meaning that only a single replica needs to acknowledge the operation for it to be considered successful. This is fast, but in a situation where you have one down host, it could mean data loss, and later reads may or may not show the newest write depending on which replicas are involved and whether they’re received the previous write. For better durability and consistency, Cassandra also provides various quorum levels that require a response from multiple replicas, as well as an ALL consistency that requires responses from every replica.

Cassandra Is My Type of Database

Curious to know more about consistency limitations and LWTs in Cassandra? Christopher Batey’s presentation at the 2016 Cassandra Summit does a good job of explaining the details.

The Problem We Found With LWTs During Datacenter Migrations

Usually we use one datacenter in Cassandra, but there are circumstances where we sometimes stand up a second datacenter in the cluster and migrate to it, then tear down the original. We typically do this either to change num_tokens, to move data when we’re refreshing hardware, or to physically move to another nearby data center.

The TL:DR

We reasoned through the interaction between LWTs/serial and datacenter migrations and found a hole—there’s no guarantee of LWT correctness during a topology change (that is, a change to the number of replicas) large enough to change the number of replicas needed to satisfy quorum. It turns out that combining LWTs and datacenter migrations can violate consistency guarantees in subtle ways without some specific steps and tools to work around it.

The Long Version

Let’s say you are standing up a new datacenter, and you need to copy an existing datacenter to it. So, you have two datacenters—datacenter A, the existing datacenter, and datacenter B, the new datacenter. Let’s say datacenter A has three replicas you need to copy for simplicity’s sake, and you’re using quorum writes to ensure consistency.

Refresher: What is Quorum-Based Consistency in Cassandra?

Quorum consistency in Cassandra is based on the concept that a specific number of replicas must participate in a read or write operation to ensure consistency and availability—a majority (n/2 +1) of the nodes must respond before considering the operation as successful. This ensures that the data is durably stored and available even if a minority of replicas are unavailable.

You have different types of quorum you can choose from, and here’s how those defaults make a decision: 

  • Local quorum: Two out of the three replicas in the datacenter I’m talking to must respond in order to return success. I don’t care about the other datacenter.
  • Global quorum: Four out of the six total replicas must respond in order to return success, and it doesn’t matter which datacenter they come from.
  • Each quorum: Two out of the three replicas in each datacenter must respond in order to return success.

Most of these quorum types also have a serial equivalent for LWTs.

Type of Quorum Serial Regular
Local LOCAL_SERIAL LOCAL_QUORUM
Each unsupported EACH_QUORUM
Global SERIAL QUORUM

The problem you might run into, however, is that LWTs do not have an each_serial mode. They only have local and global. There’s no way to tell the LWT you want quorum in each datacenter. 

local_serial is good for performance, but transactions on different datacenters could overlap and be inconsistent. serial is more expensive, but normally guarantees correctness as long as all queries agree on cluster size. But what if a query straddles a topology change that changes quorum size? 

Let’s use global quorum to show how this plays out. If a LWT starts when RF=3, at least two hosts must process it. 

While it’s running, the topology changes to two datacenters (A and B) each with RF=3 (so six replicas total) with a quorum of four. There’s a chance that a query affecting the same partition could then run without overlapping nodes, which means consistency guarantees are not maintained for those queries. For that query, quorum is four out of six where those four could be the three replicas in datacenter B and the remaining replica in datacenter A. 

Those two queries are on the same partition, but they’re not overlapping any hosts, so they don’t know about each other. It violates the LWT guarantees.

The Solution to LWT Inconsistency

What we needed was a way to make sure that the definition of “quorum” didn’t change too much in the middle of a LWT running. Some change is okay, as long as old and new are guaranteed to overlap.

To account for this, you need to change the replication factor one level at a time and make sure there are no transactions still running that started before the previous topology change before you make the next. Three replicas with a quorum of two can only change to four replicas with a quorum of three. That way, at least one replica must overlap. The same thing happens when you go from four to five replicas or five to six replicas. This also applies when reducing the replication factor, such as when tearing down the old datacenter after everything has moved to the new one.

Then, you just need to make sure no LWT overlaps multiple changes. You could just wait long enough that they’ve timed out, but it’s better to be sure. This requires querying the internal-only system.paxos table on each host in the cluster between topology changes.

We built a tool that checks to see whether there are still transactions running from before we made a topology change. It reads system.paxos on each host, ignoring any rows with proposal_ballot=null, and records them. Then after a short delay, it re-reads system.paxos, ignoring any rows that weren’t present in the previous run, or any with proposal_ballot=null in either read, or any where in_progress_ballot has changed. Any remaining rows are potentially active transactions. 

This worked well the first few times that we used it, on 3.11.6. To our surprise, when we tried to migrate a cluster running 3.11.10 the tool reported hundreds of thousands of long-running LWTs. After a lot of digging, we found a small (but fortunately well-commented) performance optimization added as part of a correctness fix (CASSANDRA-12126), which means proposal_ballot does not get set to null if the proposal is empty/noop. To work around this, we had to actually parse the proposal field. Fortunately all we need is the is_empty flag in the third field, so no need to reimplement the full parsing code. A big impact to us for a seemingly small and innocuous change piggy-backed onto a correctness fix, but that’s the risk of directly reading internal-only tables. 

We’ve used the tool several times now for migrations with good results, but it’s still relatively basic and limited. It requires running repeatedly until all transactions are complete, and sometimes manual intervention to deal with incomplete transactions. In some cases we’ve been able to force-commit a long-pending LWT by doing a SERIAL read of the partition affected, but in a couple of cases we actually ended up running across LWTs that still didn’t seem to complete. Fortunately in every case so far it was in a temporary table and a little work allowed us to confirm that we no longer needed the partition at all and could just delete it.

Most people who use Cassandra may never run across this problem, and most of those who do will likely never track down what caused the small mystery inconsistency around the time they did a datacenter migration. If you rely on LWTs and are doing a datacenter migration, we definitely recommend going through the extra steps to guarantee consistency until and unless Cassandra implements an EACH_SERIAL consistency level.

Using the Tool

If you want to use the tool for yourself to help maintain consistency through datacenter migrations, you can find it here. Drop a note in the comments to let us know how it’s working for you and if you think of any other ways around this problem—we’re all ears!

If You’ve Made It This Far

You might be interested in signing up for our Developer Newsletter where our resident Chief Technical Evangelist, Pat Patterson, shares the latest and greatest ways you can use B2 Cloud Storage in your applications.

The post New Open Source Tool for Consistency in Cassandra Migrations appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

The Drive Stats of Backblaze Storage Pods

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/the-drive-stats-of-backblaze-storage-pods/

A decorative image showing the Backblaze logo on a cloud over a pattern representing a network.

Since 2009, Backblaze has written extensively about the data storage servers we created and deployed which we call Backblaze Storage Pods. We not only wrote about our Storage Pods, we open sourced the design, published a parts list, and even provided instructions on how to build one. Many people did. Of the six storage pod versions we produced, four of them are still in operation in our data centers today. Over the last few years, we began using storage servers from Dell and, more recently, Supermicro, as they have proven to be economically and operationally viable in our environment. 

Since 2013, we have also written extensively about our Drive Stats, sharing reports on the failure rates of the HDDs and SSDs in our legion of storage servers. We have examined the drive failure rates by manufacturer, size, age, and so on, but we have never analyzed the drive failure rates of the storage servers—until now. Let’s take a look at the Drive Stats for our fleet of storage servers and see what we can learn.

Storage Pods, Storage Servers, and Backblaze Vaults

Let’s start with a few definitions:

  • Storage Server: A storage server is our generic name for a server from any manufacturer which we use to store customer data. We use storage servers from Backblaze, Dell, and Supermicro.
  • Storage Pod: A Storage Pod is the name we gave to the storage servers Backblaze designed and had built for our data centers. The first Backblaze Storage Pod version was announced in September 2009. Subsequent versions are 2.0, 3.0, 4.0, 4.5, 5.0, 6.0, and 6.1. All but 6.1 were announced publicly. 
  • Backblaze Vault: A Backblaze Vault is 20 storage servers grouped together for the purpose of data storage. Uploaded data arrives at a given storage server within a Backblaze Vault and is encoded into 20 parts with a given part being either a data blob or parity. Each of the 20 parts (shards) is then stored on one of the 20 storage servers. 

As you review the charts and tables here are a few things to know about Backblaze Vaults.

  • There are currently six cohorts of storage servers in operation today: Supermicro, Dell, Backblaze 3.0, Backblaze 5.0, Backblaze 6.0, and Backblaze 6.1.
  • A given Vault will always be made up from one of the six cohorts of storage servers noted above. For example, Vault 1016 is made up of 20 Backblaze 5.0 Storage Pods and Vault 1176 is made of the 20 Supermicro servers. 
  • A given Vault is made up of storage servers that contain the same number of drives as follows:
    • Dell servers: 26 drives.
    • Backblaze 3.0 and Backblaze 5.0 servers: 45 drives.
    • Backblaze 6.0, Backblaze 6.1, and Supermicro servers: 60 drives.
  • All of the hard drives in a Backblaze Vault will be logically the same size; so, 16TB drives for example.

Drive Stats by Backblaze Vault Cohort

With the background out of the way, let’s get started. As of the end of Q3 2023, there were a total of 241 Backblaze Vaults divided into the six cohorts, as shown in the chart below. The chart includes the server cohort, the number of Vaults in the cohort, and the percentage that cohort is of the total number of Vaults.

A pie chart showing the types of Backblaze Vaults by percentage.

Vaults consisting of Backblaze servers still comprise 68% of the vaults in use today (shaded from orange to red), although that number is dropping as older Vaults are being replaced with newer server models, typically the Supermicro systems.

The table below shows the Drive Stats for the different Vault cohorts identified above for Q3 2023.

A chart showing the Drive Stats for Backblaze Vaults.

The Avg Age (months) column is the average age of the drives, not the average age of the Vaults. The two may seem to be related, that’s not entirely the case. It is true the Backblaze 3.0 Vaults were deployed first followed in order by the 5.0 and 6.0 Vaults, but that’s where things get messy. There was some overlap between the Dell and Backblaze 6.1 deployments as the Dell systems were deployed in our central Europe data center, while the 6.1 Vaults continued to be deployed in the U.S. In addition, some migrations from the Backblaze 3.0 Vaults were initially done to 6.1 Vaults while we were also deploying new drives in the Supermicro Vaults. 

The AFR for each of the server versions does not seem to follow any pattern or correlation to the average age of the drives. This was unexpected because, in general, as drives pass about four years in age, they start to fail more often. This should mean that Vaults with older drives, especially those with drives whose average age is over four years (48 months), should have a higher failure rate. But, as we can see, the Backblaze 5.0 Vaults defy that expectation. 

To see if we can determine what’s going on, let’s expand on the previous table and dig into the different drive sizes that are in each Vault cohort, as shown in the table below.

A table showing Drive Stats by server version and drive size.

Observations for Each Vault Cohort

  • Backblaze 3.0: Obviously these Vaults have the oldest drives and, given their AFR is nearly twice the average for all of the drives (1.53%), it would make sense to migrate off of these servers. Of course the 6TB drives seem to be the exception, but at some point they will most likely “hit the wall” and start failing.
  • Backblaze 5.0: There are two Backblaze 5.0 drive sizes (4TB and 8TB) and the AFR for each is well below the average AFR for all of the drives (1.53%). The average age of the two drive sizes is nearly seven years or more. When compared to the Backblaze 6.0 Vaults, it would seem that migrating the 5.0 Vaults could wait, but there is an operational consideration here. The Backblaze 5.0 Vaults each contain 45 drives, and from the perspective of data density per system, they should be migrated to 60 drive servers sooner rather than later to optimize data center rack space.
  • Backblaze 6.0: These Vaults as a group don’t seem to make any of the five different drive sizes happy. Only the AFR of the 4TB drives (1.42%) is just barely below the average AFR for all of the drives. The rest of the drive groups are well above the average.
  • Backblaze 6.1: The 6.1 servers are similar to the 6.0 servers, but with an upgraded CPU and faster NIC cards. Is that why their annualized failure rates are much lower than the 6.0 systems? Maybe, but the drives in the 6.1 systems are also much younger, about half the age of those in the 6.0 systems, so we don’t have the full picture yet.
  • Dell: The 14TB drives in the Dell Vaults seem to be a problem at a 5.46% AFR. Much of that is driven by two particular Dell vaults which have a high AFR, over 8% for Q3. This appears to be related to their location in the data center. All 40 of the Dell servers which make up these two Vaults were relocated to the top of 52U racks, and it appears that initially they did not like their new location. Recent data indicates they are doing much better, and we’ll publish that data soon. We’ll need to see what happens over the next few quarters. That said, if you remove these two Vaults from the Dell tally, the AFR is a respectable 0.99% for the remaining Vaults.
  • Supermicro: This server cohort is mostly 16TB drives which are doing very well with an AFR of 0.62%. The one 14TB Vault is worth our attention with an AFR of 1.95%, and the 22TB Vault is too new to do any analysis.

Drive Stats by Drive Size and Vault Cohort

Another way to look at the data is to take the previous table and re-sort it by drive size. Before we do that let’s establish the AFR for the different drive sizes aggregated over all Vaults.

A bar chart showing annualized failure rates for Backblaze Vaults by drive size.

As we can see in Q3 the 6TB and 22TB Vaults had zero failures (AFR = 0%). Also, the 10TB Vault is indeed only one Vault, so there are no other 10TB Vaults to compare it to. Given this, for readability, we will remove the 6TB, 10TB, and 22TB Vaults from the next table which compares how each drive size has fared in each of the six different Vault cohorts.

A table showing the annualized failure rates of servers by drive size and server version, not displaying the 6TB, 10TB, and 22TB Vaults.

Currently we are migrating the 4TB drive Vaults to larger Vaults, replacing them with drives of 16TB and above. The migrations are done using an in-house system which we’ll expand upon in a future post. The specific order of migrations is based on failure rates and durability of the existing 4TB Vaults with an eye towards removing the Backblaze 3.0 systems first as they are nearly 10 years old in some cases, and many of the non-drive replacement parts are no longer available. Whether we give away, destroy, or recycle the retired Backblaze 3.0 Storage Pods (sans drives) is still being debated.

For the 8TB drive Vaults, the Backblaze 5.0 Vaults are up first for migration when the time comes. Yes, their AFR is lower then the Backblaze 6.0 Vaults, but remember: the 5.0 Vaults are 45 drive units which are not as efficient storage density-wise versus the 60 drive systems. 

Speaking of systems with less than 60 drives, the Dell servers are 26 drives. Those 26 drives are in a 2U chassis versus a 4U chassis for all of the other servers. The Dell servers are not quite as dense as the 60 drive units, but their 2U form factor gives us some flexibility in filling racks, especially when you add utility servers (1U or 2U) and networking gear to the mix. That’s one of the reasons the two Dell Vaults we noted earlier were moved to the top of the 52U racks. FYI, those two Vaults hold 14TB drives and are two of the four 14TB Dell Vaults making up the 5.46% AFR. The AFR for the Dell Vaults with 12TB and 16TB drives is 0.76% and 0.92% respectively. As noted earlier, we expect the AFR for 14TB Dell Vaults to drop over the coming months.

What Have We Learned?

Our goal today was to see what we can learn about the drive failure rates of the storage servers we use in our data centers. All of our storage servers are grouped in operational systems we call Backblaze Vaults. There are six different cohorts of storage servers with each vault being composed of the same type of storage server, hence there are six types of vaults. 

As we dug into data, we found that the different cohorts of Vaults had different annualized failure rates. What we didn’t find was a correlation between the age of the drives used in the servers and the annualized failure rates of the different Vault cohorts. For example, the Backblaze 5.0 Vaults have a much lower AFR of 0.99%  versus the Backblaze 6.0 Vault AFR at 2.14%—even though the drives in the 5.0 Vaults are nearly twice as old on average than the drives in the 6.0 Vaults.

This suggests that while our initial foray into the annualized failure rates of the different Vault cohorts is a good first step, there is more to do here.

Where Do We Go From Here?

In general, all of the Vaults in a given cohort were manufactured to the same specifications, used the same parts, and were assembled using the same processes. One obvious difference is that different drive models are used in each Vault cohort. For example, the 16TB vaults are composed of seven different drive models. Do some drive models work better in one Vault cohort versus another? Over the next couple of quarters we’ll dig into the data and let you know what we find. Hopefully it will add to our understanding of the annualized failures rates of the different Vault cohorts. Stay tuned.

The post The Drive Stats of Backblaze Storage Pods appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

NAS Life Expectancy: How to Extend NAS Longevity

Post Syndicated from Vinodh Subramanian original https://www.backblaze.com/blog/nas-life-expectancy-how-to-extend-nas-longevity/

A decorative image showing a NAS device and a cloud in the background.

There’s only one truth you need to know about tech: at some point, it will fail. Hard drives die. You get the blue screen of death the day of the big presentation. You lose cell service right when your mom calls. (Or maybe you pretend to lose service right when your mom calls. We won’t judge.) Whatever it is, we’ve all been there. 

If you use network attached storage (NAS) for your business or at home, you’re probably well aware of this fact. The redundancy you get from a solid RAID configuration might be one of the reasons you invested in a NAS—to prepare for inevitable drive failures. NAS devices are a great investment for a business for their durability and reliability, but there are things you can do to extend the lifespan of your NAS and get even more out of your investment.

Extending the lifespan of a NAS system isn’t just about preventing hardware failures; it’s crucial for ensuring uninterrupted access to critical data while maximizing ROI on your IT investments. In this blog, you’ll learn about the key factors that influence NAS longevity and get real-world strategies to strengthen and optimize your NAS infrastructure. 

Understanding NAS Lifespan

Today’s NAS devices offer faster processing, enhanced performance, and significantly larger storage capabilities than ever before. These technological advancements have paved the way for efficient local data management using NAS in both professional and personal settings. 

A decorative image showing a NAS device, an office building, and the Backblaze logo on a cloud.

Despite these advancements, it’s important to acknowledge that NAS devices tend to have a finite lifespan. This limitation is due to several factors, including the physical wear and tear of hardware components, ever-evolving software requirements, and the constant advancement of technology which can render older systems less efficient or incompatible with new standards. 

Also, it’s crucial to differentiate between the life expectancy of a NAS device and that of the hard drives within the device. While the NAS itself may continue to function, the hard drives that are subjected to intense read/write operations often have a shorter lifespan. If you want to learn more about hard drive reliability and failure rates, refer to Backblaze’s Hard Drive Stats.

Key Factors Affecting NAS Longevity

From the quality of the hardware to the environment it operates in, multiple elements contribute to the lifespan of NAS. Let’s explore these key factors in detail:

1. Hardware Components: Quality and Durability

One of the key factors that heavily affects the NAS device is the quality of the hardware itself. Additionally, factors such as the processor, memory, and power unit also play crucial roles. High-quality hardware and components from reputable manufacturers tend to last longer and offer better performance and reliability which contribute to the overall lifespan of the NAS.

2. Workload and Usage Intensity

The workload handled by the NAS is a significant determinant of its longevity. Devices that are constantly under heavy load, managing large data transfers, or running intensive applications will likely experience wear and tear more rapidly than those used for lighter tasks. 

3. Environmental Factors: Temperature, Humidity, and Corrosion

Operating a NAS in environments with high temperatures or humidity levels can lead to overheating and moisture-related damage. Additionally, locations with high levels of dust or corrosive elements can lead to physical deterioration of components.

4. Quality of Network Environment

The quality and stability of the network environment in which the NAS operates can also affect its lifespan. Frequent network issues or unstable connections can strain the NAS’s hardware and software, potentially leading to earlier failures. 

5. Support, Technological Advancements, and Compatibility

Ongoing support and compatibility with new technologies are also vital for the longevity of NAS systems. As technology evolves, older NAS devices may struggle with compatibility issues, rendering them less efficient or even obsolete.

Maintenance and Care for Enhanced Lifespan

Now that we understand the factors that affect NAS longevity, let’s explore how proactive maintenance and care are crucial to extending its lifespan. There are a number of things you can do to keep your NAS functioning properly for the long haul:

  • Regular Maintenance: Routine cleaning is vital for maintaining NAS efficiency. Dust accumulation can lead to overheating. Regularly cleaning the external vents and fans, and ensuring the device is in a well-ventilated area can prevent thermal issues and prolong the device’s life. 
  • Proactive Drive Replacements: Hard drives are among the most failure-prone components in a NAS. Implementing a regular schedule to check drive health and replacing unhealthy or borderline drives proactively can prevent data loss and reduce the workload on the NAS’s remaining drives, thus preserving its overall integrity. 
  • Updating Software and Patches: Keeping the NAS software and firmware up to date is essential for security and performance. Regular updates often include patches for vulnerabilities, performance enhancements, and new features that can improve the efficiency and longevity of the NAS. 
  • Monitoring NAS Health: Utilizing the tools and built-in functionalities to monitor the health and performance of your NAS also helps extend its lifespan. Many NAS systems come with software that can alert you to issues such as failing drives, high temperatures, or network problems. Keeping an eye on these metrics can help you address potential problems before they escalate. 
  • Environmental Considerations: The operating environment of a NAS plays a significant role in NAS longevity. Keeping your NAS in a stable environment with controlled temperature and humidity levels should be considered to extend its lifespan. 
  • Power Protection: Protect your NAS from power surges and outages using uninterrupted power supply (UPS). This can not only prevent data loss but also help avoid any potential hardware damage caused by electrical issues. 
A product image of a CyberPower 900 AVR uninterrupted power supply tower.
The CyberPower 900 AVR is just one example of a UPS. Source.
  • Recognizing When to Replace the NAS: Look out for indicators that suggest it’s time to replace your NAS. These include the expiration of the manufacturer’s warranty, noticeable performance declines, increased frequency of repairs, or when the device no longer meets your evolving storage needs. Waiting until a complete failure can be more costly and disruptive. 

By adhering to these maintenance and care guidelines, you can significantly enhance the lifespan and reliability of your NAS, ensuring that it continues to serve as a robust and efficient data storage solution for your business or personal needs. 

Implementing Fault Tolerance and Off-Loading Data to Cloud

In addition to proactive maintenance and care, there are a few other strategies you can use to extend NAS lifespan such as implementing fault tolerance and adding cloud storage to your backup strategy to offload data from your NAS. Let’s explore them below. 

The Importance of Fault Tolerance and RAID Configurations in NAS

Fault tolerance refers to the ability of a NAS to continue operating correctly even if one or more of its hard drives fail. It’s critical for NAS systems to have fault tolerance in place, especially in business environments where data availability and integrity can not be compromised. 

RAID (Redundant Array of Independent Disks) plays a pivotal role in achieving fault tolerance. It involves combining multiple hard drives into a single system to improve data redundancy and performance. By doing so, RAID protects against data loss due to single or multiple disk failures, depending on the RAID level implemented. 

An image showing a RAID 5 configuration.
RAID 5: Striping and parity distributed across disks.

RAID Configurations

Various RAID configurations offer different balances of data protection, storage efficiency, and performance. Common configurations include RAID 0 (striping), RAID 1 (mirroring), RAID 5 and RAID 6, each with its specific advantages and use cases. For example, RAID 1 is simple and offers data redundancy, while RAID 5 and 6 provide a good balance between storage capacity and data protection. To fully understand which RAID configuration suits your needs, explore our NAS RAID Levels blog which explains how to choose the right RAID level for your NAS data. 

Off-Loading and Backing Up NAS Data to the Cloud

Off-loading data from your NAS to a cloud storage service can significantly reduce the workload on your NAS hardware, thus potentially extending its life. You can consider creating a cloud archive of old data and a system for off-loading data on a regular cadence or as projects close.  

The cloud also helps you establish a robust 3-2-1 backup strategy (three total copies of your data, two of which are local but on different mediums, and one copy off-site). This ensures data redundancy and offers enhanced data protection against local disasters, theft, or hardware failure. Many NAS devices, like Synology and QNAP, have built-in backup utilities that back up directly to cloud storage like Backblaze B2

By integrating fault tolerance through RAID configurations and backing up data to the cloud, you can significantly enhance the data protection capabilities of your NAS. This approach not only ensures data safety but also contributes to the overall health and longevity of your NAS system.

Extending Your NAS Lifespan

Navigating the complexities of NAS systems can be a journey full of learning and adaptation. While NAS offers unparalleled convenience of local storage, it’s essential to recognize that its longevity relies on more than just its initial setup. It requires a proactive approach with diligent maintenance while embracing technological advancements such as fault tolerance and RAID levels. When properly cared for, however, many users find NAS to be a long-lived piece of their tech stack

But, it doesn’t have to stop there. The integration of cloud storage with NAS systems can significantly help reduce the strain on your local system, and safeguard your NAS data off-site while ensuring you have comprehensive 3-2-1 data protection in place. 

It’s time to hear from you. What strategies have you employed to extend the lifespan of your NAS device? Share your stories and tips in the comments below to help others in the NAS community.

The post NAS Life Expectancy: How to Extend NAS Longevity appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Network Stats

Post Syndicated from Brent Nowak original https://www.backblaze.com/blog/backblaze-network-stats/

A decorative image displaying the headline Welcome to Network Stats.

At the end of Q3 2023, Backblaze was monitoring 263,992 hard disk drives (HDDs) and solid state drives (SSDs) in our data centers around the world. We’ve reported on those drives for many years. But, all the data on those drives needs to somehow get from your devices to our storage servers. You might be wondering, “How does that happen?” and “Where does that happen?” Or, more likely, “How fast does that happen?”

These are all questions we want to start to answer.

Welcome to our new Backblaze Network Stats series, where we’ll explore the world of network connectivity and how we better serve our customers, partners, and the internet community at large. We hope to share our challenges, initiatives, and engineering perspective as we build the open cloud with our Partners.

In this first post, we will explore two issues: how we connect our network with the internet and the distribution of our peering traffic. As we expand this series, we hope to capture and share more metrics and insights.

Nice to Meet You; I’m Brent

Since this is the first time you’re hearing from me, I thought I should introduce myself. I’m a Senior Network Engineer here at Backblaze. The Network Engineering group is responsible for ensuring the reliability, capacity, and security of network traffic. 

My interest in computer networking began in my childhood when I first persuaded my father to upgrade our analog modem to a ISDN line by providing a financial comparison of time sink due to large download times I was conducting (nothing like using all the family dial-up time to download multi-megabyte SimCity 2000 and Doom customizations). Needless to say, I’m still interested in those same types of networking metrics, and that’s why I’m here sharing them with you at Backblaze.

First, Some Networking Basics

If you’ve ever heard folks joke about the internet being a series of tubes, well, it may be widely mocked, but it’s not entirely wrong. The internet as we know it is fundamentally a complex network of all the computers on the planet. Whenever we’re typing in an internet address into a web browser, we’re basically giving our computer the address of a computer (or server, etc.) to locate, and that computer will hopefully display data to you that it’s storing. 

Of course, it’s not a free-for-all. Internet protocols like TLS/SSL are the boundaries that set the rules for how computers communicate, and networks allow different levels of access to outsiders. Internet service providers (ISPs) are defined and regulated, and we’ll outline some of those roles and how Backblaze interacts with them below. But, all that communication between computers has to be powered by hardware, which is why, at one point, we actually had to solve the problem of sharks attacking the internet. Good news: since 2006, sharks have accounted for less than one percent of fiber optic cable attacks. 

Wireless internet has largely made this connectivity invisible to consumers, but the job of Wi-Fi is to broadcast a short-range network that connects you to this series of cables and “tubes.” That’s why when you’re transmitting or storing larger amounts of data, you typically get better speeds when you use a wired connection. (A good consumer example: setting up NAS devices works better with an ethernet cable.)

When you’re talking about storing and serving petabytes of data for a myriad of use cases, then you have to create and use different networks to connect to the internet effectively. Think of it like water: both a fire hose and your faucet are connected to utility lines, but they have to move different amounts of water, so they have different kinds of connections to the main utility.   

And, that brings us to peering, the different levels of internet service providers, and many, many more things that Backblaze Network Engineers deal with from both a hardware and a software perspective on a regular basis. 

What Is Peering?

Peering on the internet is akin to building direct express lanes between neighborhoods. Instead of all data (residents) relying on crowded highways (public internet), networks (neighborhoods) establish peering connections—dedicated pathways connecting them directly. This allows data to travel faster and more efficiently, reducing congestion and delays. Peering is like having exclusive lanes, streamlining communication between networks and enhancing the overall performance of the internet “transportation” system. 

We connect to various types of networks to help move your data. I’ll explain the different types below.

The Bit Exchange

Every day we move multiple petabytes of traffic between our internet connectivity points and our scalable data center fabric layer to be delivered to either our SSD caching layer (what we call a “shard stash”) or spinning hard drives for storage.

Our data centers are connected to the world in three different ways.

1. Direct Internet Access (DIA)

The most common way we reach everyone is via a DIA connection with a Tier 1 internet service provider. These connections give us access to long-haul, high-capacity fiber infrastructure that spans continents and oceans. Connecting to a Tier 1 ISP has the advantage of scale and reach, but this scale comes at a cost—we may not have the best path to our customers. 

If we draw out the hierarchy of networks that we have to traverse to reach you, it would look like a series of geographic levels (Global, Regional, and Local). The Tier 1 ISPs would be positioned at the top, leasing bandwidth on their networks to smaller Tier 2 and Tier 3 networks, which are closer to our customer’s home and office networks.

A chart showing an example of network and ISP reroutes between Backblaze and a customer.
How we get from B to C (Backblaze to customer).

Since our connections to the Tier 1 ISPs are based on leased bandwidth, we pay based on how much data we transfer. The bill grows the more we transfer. There are commitments and overage charges, and the relationship is more formal since a Tier 1 ISP is a for-profit company. Sometimes you just want unlimited bandwidth, and that’s where the role of the internet exchange (IX) helps us.

2. Internet Exchange (IX)

We always want to be as close to the client as possible and our next connectivity option allows us to join a community of peers that exchange traffic more locally. Peering with an IX means that network traffic doesn’t have to bubble up to a Tier 1 National ISP to eventually reach a regional network. If we are on an advantageous IX, we transfer data locally inside a data center or within the same data center campus, thus reducing latency and improving the overall experience.

Benefits of an IX, aka the “Unlimited Plan,” include:

  • Paying a flat rate per month to get a fiber connection to the IX equipment versus paying based on how much data we transfer.
  • No price negotiation based on bandwidth transfer rates.
  • No overage charges.
  • Connectivity to lower tiered networks that are closer to consumer and business networks.
  • Participation helps build a more egalitarian internet.

In short, we pay a small fee to help the IX remain financially stable, and then we can exchange as much or as little traffic as we want.

Our network connectivity standard is to connect to multiple Tier 1 ISPs and a localized IX at every location to give us the best of both solutions. Every time we have to traverse a network, we’re adding latency and increasing the total amount of time for a file to upload or download. Internet routing prefers the shortest path, so if we have a shorter (faster) way to reach you, we will talk over the IX versus the Tier 1 network.

A decorative image showing two possible paths to serve data from Backblaze to the customer.
Less is more—the fewer networks between us and you, the better.

3. Private Network Interconnect (PNI)

The most direct and lowest latency way for us to exchange traffic is with a PNI. This option is used for direct fiber connections within the same data center or metro region to some of our specific partners like Fastly and Cloudflare. Our edge routing equipment—that is, the appliances that allow us to connect our internal network to external networks—is connected directly to our partner’s edge routing equipment. To go back to our neighborhood analogy, this would be if you and your friend put a gate in the fences that connect your backyards. With a PNI, the logical routing distance between us and our partners is the best it can be. 

IX Participation

Personally, the internet exchange path is the most exciting for me as a network engineer. It harkens back to the days of the early internet (IX points began as Network Access Points and were a key component of Al Gore’s National Information Infrastructure (NII) plan way back in 1991), and the growth of an IX feels communal, as people are joining to help the greater whole. When we add our traffic to an IX as a new peer, it increases participation, further strengthening the advantage of contributing to the local IX and encouraging more organizations to join.

Backblaze Joins the Equinix Silicon Valley (SV1) Internet Exchange

Our San Jose data center is a major point of presence (PoP) (that is, a point where a network connects to the internet) for Backblaze, with the site connecting us in the Silicon Valley region to many major peering networks.

In November, we brought up connectivity to Equinix IX peering exchange in San Jose, bringing us closer to 278 peering networks at the time of publishing. Many of the networks that participate on this IX are very logically close to our customers. The participants are some of the well known ISPs that serve homes, offices, and business in the region, including Comcast, Google Fiber, Sprint, and Verizon.

Now, for the Stats

As soon as we turned up the connection, 26% inbound traffic that was being sent to our DIA connections shifted to the local Equinix IX, as shown in the pie chart below.

Two side by side pie charts comparing traffic on the different types of network connections.
Before: 98% direct internet access (DIA); 2% private network interconnect (PNI). After: 72% DIA; 2% PNI; 26% internet exchange (IX).

The below graph shows our peering traffic load over the edge router and how immediately the traffic pattern changed as soon as we brought up the peer. Green indicates inbound traffic, while yellow shows outbound traffic. It’s always exciting to see a project go live with such an immediate reaction!

A graph showing networking uploads and downloads increasing as Backblaze brought networks up to peer.

To give you an idea of what we mean by better network proximity, let’s take a look at our improved connectivity to Google Fiber. Here’s a diagram of the three pathways that our edge routers see that show how to get to Google Fiber. With the new IX connection, we see a more advantageous path and pick that as our method to exchange traffic. We no longer have to send traffic to the Tier 1 providers and can use them as backup paths.

A graph showing possible network paths now that peering is enabled.
Taking faster local roads.

What Does This Mean for You?

We here at Backblaze are always trying to improve the performance and reliability of our storage platform while scaling up. We monitor our systems for inefficiencies, and improving the network components is one way that we can deliver a better experience. 

By joining the Equinix SV1 peering exchange, we shorten the number of network hops that we have to transit to communicate with you. And that reduces latency, speeding up your backup job upload, allowing for faster image download, or supporting Partners

Cheers from the NetEng team! We’re excited to start this series and bring you more content as our solutions evolve and grow. Some of the coverage we hope to share in the future includes analyzing our proximity to our peers and Partners, how we can improve those connections further, and stats to show the amount of bits per second that we process in our data centers to ensure that we not only have a file, but all the related redundancy shard components related to it. So, stay tuned!

The post Backblaze Network Stats appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

How to Run AI/ML Workloads on CoreWeave + Backblaze

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/how-to-run-ai-ml-workloads-on-coreweave-backblaze/

A decorative image showing the Backblaze and CoreWeave logos superimposed on clouds.

Backblaze compute partner CoreWeave is a specialized GPU cloud provider designed to power use cases such as AI/ML, graphics, and rendering up to 35x faster and for 80% less than generalized public clouds. Brandon Jacobs, an infrastructure architect at CoreWeave, joined us earlier this year for Backblaze Tech Day ‘23. Brandon and I co-presented a session explaining both how to backup CoreWeave Cloud storage volumes to Backblaze B2 Cloud Storage and how to load a model from Backblaze B2 into the CoreWeave Cloud inference stack.

Since we recently published an article covering the backup process, in this blog post I’ll focus on loading a large language model (LLM) directly from Backblaze B2 into CoreWeave Cloud.

Below is the session recording from Tech Day; feel free to watch it instead of, or in addition to, reading this article.

More About CoreWeave

In the Tech Day session, Brandon covered the two sides of CoreWeave Cloud: 

  1. Model training and fine tuning. 
  2. The inference service. 

To maximize performance, CoreWeave provides a fully-managed Kubernetes environment running on bare metal, with no hypervisors between your containers and the hardware.

CoreWeave provides a range of storage options: storage volumes that can be directly mounted into Kubernetes pods as block storage or a shared file system, running on solid state drives (SSDs) or hard disk drives (HDDs), as well as their own native S3 compatible object storage. Knowing that, you’re probably wondering, “Why bother with Backblaze B2, when CoreWeave has their own object storage?”

The answer echoes the first few words of this blog post—CoreWeave’s object storage is a specialized implementation, co-located with their GPU compute infrastructure, with high-bandwidth networking and caching. Backblaze B2, in contrast, is general purpose cloud object storage, and includes features such as Object Lock and lifecycle rules, that are not as relevant to CoreWeave’s object storage. There is also a price differential. Currently, at $6/TB/month, Backblaze B2 is one-fifth of the cost of CoreWeave’s object storage.

So, as Brandon and I explained in the session, CoreWeave’s native storage is a great choice for both the training and inference use cases, where you need the fastest possible access to data, while Backblaze B2 shines as longer term storage for training, model, and inference data as well as the destination for data output from the inference process. In addition, since Backblaze and CoreWeave are bandwidth partners, you can transfer data between our two clouds with no egress fees, freeing you from unpredictable data transfer costs.

Loading an LLM From Backblaze B2

To demonstrate how to load an archived model from Backblaze B2, I used CoreWeave’s GPT-2 sample. GPT-2 is an earlier version of the GPT-3.5 and GPT-4 LLMs used in ChatGPT. As such, it’s an accessible way to get started with LLMs, but, as you’ll see, it certainly doesn’t pass the Turing test!

This sample comprises two applications: a transformer and a predictor. The transformer implements a REST API, handling incoming prompt requests from client apps, encoding each prompt into a tensor, which the transformer passes to the predictor. The predictor applies the GPT-2 model to the input tensor, returning an output tensor to the transformer for decoding into text that is returned to the client app. The two applications have different hardware requirements—the predictor needs a GPU, while the transformer is satisfied with just a CPU, so they are configured as separate Kubernetes pods, and can be scaled up and down independently.

Since the GPT-2 sample includes instructions for loading data from Amazon S3, and Backblaze B2 features an S3 compatible API, it was a snap to modify the sample to load data from a Backblaze B2 Bucket. In fact, there was just a single line to change, in the s3-secret.yaml configuration file. The file is only 10 lines long, so here it is in its entirety:

apiVersion: v1
kind: Secret
metadata:
  name: s3-secret
  annotations:
     serving.kubeflow.org/s3-endpoint: s3.us-west-004.backblazeb2.com
type: Opaque
data:
  AWS_ACCESS_KEY_ID: <my-backblaze-b2-application-key-id>
  AWS_SECRET_ACCESS_KEY: <my-backblaze-b2-application-key>

As you can see, all I had to do was set the serving.kubeflow.org/s3-endpoint metadata annotation to my Backblaze B2 Bucket’s endpoint and paste in an application key and its ID.

While that was the only Backblaze B2-specific edit, I did have to configure the bucket and path where my model was stored. Here’s an excerpt from gpt-s3-inferenceservice.yaml, which configures the inference service itself:

apiVersion: serving.kubeflow.org/v1alpha2
kind: InferenceService
metadata:
  name: gpt-s3
  annotations:
    # Target concurrency of 4 active requests to each container
    autoscaling.knative.dev/target: "4"
    serving.kubeflow.org/gke-accelerator: Tesla_V100
spec:
  default:
    predictor:
      minReplicas: 0 # Allow scale to zero
      maxReplicas: 2 
      serviceAccountName: s3-sa # The B2 credentials are retrieved from the service account
      tensorflow:
        # B2 bucket and path where the model is stored
        storageUri: s3://<my-bucket>/model-storage/124M/
        runtimeVersion: "1.14.0-gpu"
        ...

Aside from storageUri configuration, you can see how the predictor application’s pod is configured to scale from between zero and two instances (“replicas” in Kubernetes terminology). The remainder of the file contains the transformer pod configuration, allowing it to scale from zero to a single instance.

Running an LLM on CoreWeave Cloud

Spinning up the inference service involved a kubectl apply command for each configuration file and a short wait for the CoreWeave GPU cloud to bring up the compute and networking infrastructure. Once the predictor and transformer services were ready, I used curl to submit my first prompt to the transformer endpoint:

% curl -d '{"instances": ["That was easy"]}' http://gpt-s3-transformer-default.tenant-dead0a.knative.chi.coreweave.com/v1/models/gpt-s3:predict
{"predictions": ["That was easy for some people, it's just impossible for me,\" Davis said. \"I'm still trying to" ]}

In the video, I repeated the exercise, feeding GPT-2’s response back into it as a prompt a few times to generate a few paragraphs of text. Here’s what it came up with:

“That was easy: If I had a friend who could take care of my dad for the rest of his life, I would’ve known. If I had a friend who could take care of my kid. He would’ve been better for him than if I had to rely on him for everything.

The problem is, no one is perfect. There are always more people to be around than we think. No one cares what anyone in those parts of Britain believes,

The other problem is that every decision the people we’re trying to help aren’t really theirs. If you have to choose what to do”

If you’ve used ChatGPT, you’ll recognize how far LLMs have come since GPT-2’s release in 2019!

Run Your Own Large Language Model

While CoreWeave’s GPT-2 sample is an excellent introduction to the world of LLMs, it’s a bit limited. If you’re looking to get deeper into generative AI, another sample, Fine-tune Large Language Models with CoreWeave Cloud, shows how to fine-tune a model from the more recent EleutherAI Pythia suite.

Since CoreWeave is a specialized GPU cloud designed to deliver best-in-class performance up to 35x faster and 80% less expensive than generalized public clouds, it’s a great choice for workloads such as AI, ML, rendering, and more, and, as you’ve seen in this blog post, easy to integrate with Backblaze B2 Cloud Storage, with no data transfer costs. For more information, contact the CoreWeave team.

The post How to Run AI/ML Workloads on CoreWeave + Backblaze appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Storage Tech of the Future: Ceramics, DNA, and More

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/storage-tech-of-the-future-ceramics-dna-and-more/

A decorative image showing a data drive with a health monitor indicator running through and behind it.

Two announcements had the Backblaze #social Slack channel blowing up this week, both related to “Storage Technologies of the Future.” The first reported “Video of Ceramic Storage System Surfaces Online” like some kind of UFO sighting. The second, somewhat more restrained announcement heralded the release of DNA storage cards available to the general public. Yep, you heard that right—coming to a Best Buy near you. (Not really. You absolutely have to special order these babies, but they ARE going to be for sale.)

We talked about DNA storage way back in 2015. It’s been nine years, so we thought it was high time to revisit the tech and dig into ceramics as well. (Pun intended.) 

What Is DNA Storage?

The idea is elegant, really. What is DNA if not an organic, naturally occuring form of code? 

DNA consists of four nucleotide bases: adenine (A), thymine (T), cytosine (C), and guanine (G). 

In DNA storage, information is encoded into sequences of these nucleotide bases. For example, A and C might represent 0, while T and G represent 1. This encoding allows digital data, such as text, images, or other types of information, to be translated into DNA sequences. Cool!

The appeal of DNA as a storage medium lies in its density and stability, as well as its ability to store vast amounts of information in a very compact space. It also boasts remarkable durability, with the ability to preserve information for thousands of years under suitable conditions. I mean, leave it to Mother Nature to put our silly little hard drives to shame.

Back in 2015, we shared that the storage density of DNA was about 2.2 petabytes per gram. In 2017, a study out of Columbia University and the New York Genome Center put it at an incredible 215 petabytes per gram. For comparison’s sake, a WDC 22TB drive (WDC WUH722222ALE6L4) that we currently use in our data centers is 1.5 pounds or 680 grams, which nets out at 0.032TB/gram or 0.000032PB/gram.

Another major advantage is its sustainability. Estimated global data center electricity consumption in 2022 was 240–340 TWh1, or around 1–1.3% of global final electricity demand. Current data storage technology uses rare earth metals which are environmentally damaging to mine. Drives take up space, and they also create e-waste at the end of their lifecycle. It’s a challenge anyone who works in the data storage industry thinks about a lot. 

DNA storage, on the other hand, requires less energy. A 2023 study found that data writing can be achieved in the DNA movable-type storage system under normal operating temperatures ranging from about 60–113°F and can be stored at room temperature. DNA molecules are also biodegradable and can be broken down naturally. 

The DNA data-writing process is chemical-based, and actually not the most environmentally friendly, but the DNA storage cards developed by Biomemory use a proprietary biosourced writing process, which they call “a significant advancement over existing chemical or enzymatic synthesis technologies.” So, there might be some trade-offs, but we’ll know more as the technology evolves. 

What’s the Catch?

Density? Check. Durability? Wow, yeah. Sustainability? You got it. But DNA storage is still a long way from sitting on your desk, storing your duplicate selfies. First, and we said this back in 2015 too, DNA takes a long time to read and write—DNA synthesis writes at a few hundred bytes per second. An average iPhone photo would take several hours to write to DNA. And to read it, you have to sequence the DNA—a time-intensive process. Both of those processes require specialized scientific equipment.

It’s also still too expensive. In 2015, we found a study that put 83 kilobytes of DNA storage at £1000 (about $1,500 U.S. dollars). In 2021, MIT estimated it would cost about $1 trillion to store one petabyte of data on DNA. For comparison, it costs $6,000 per month to store one petabyte in Backblaze B2 Cloud Storage ($6/TB/month). You could store that petabyte for a little over 13 million years before you’d hit $1 trillion.

Today, Biomemory’s DNA storage cards ring in at a cool €1000 (about $1,080 U.S. dollars). And they can hold a whopping one kilobyte of data or the equivalent of a short email. So, yeah …it’s ahh, gotten even more expensive for the commercial product. 

The discrepancy between the MIT theoretical estimate and the cost of the Biomemory cards really speaks to the expense of bringing a technology like this to market. The theoretical cost per byte is a lot different than the operational cost, and the Biomemory cards are really meant to serve as proof of concept.  All that said, as the technology improves, one can only hope that it becomes more cost-effective in the future. Folks are experimenting with different encoding schemes to make writing and reading more efficient, as one example of an advance that could start to tip the balance.  

Finally, there’s just something a bit spooky about using synthetic DNA to store data. There’s a Black Mirror episode in there somewhere. Maybe one day we can upload kung fu skills directly into our brain domes and that would be cool, but for now, it’s still somewhat unsettling.

What Is Ceramic Storage?

Ceramic storage makes an old school approach new again, if you consider that the first stone tablets were kind of the precursor to today’s hard drives. Who’s up for storing some cuneiform?

Cerabyte, the company behind the “video that surfaced online,” is working on storage technology that uses ceramic and glass substrates in devices the size of a typical HDD that can store 10 petabytes of data. They use a glass base similar to Gorilla Glass by Corning topped with a layer of ceramic 300 micrometers thick that’s essentially etched with lasers. (Glass is used in many larger hard drives today, for what it’s worth. Hoya makes them, for example.) The startup debuted a fully operational prototype system using only commercial off-the-shelf equipment—pretty impressive. 

The prototype consists of a single read-write rack and several library racks. When you want to write data, it moves one of the cartridges from the library to the read-write rack where it is opened to expose and stage the ceramic substrate. Two million laser beamlets then punch nanoscale ones and zeros into the surface. Once the data is written, the read-write arm verifies it on the return motion to its original position. 

Cerabyte isn’t the only player in the game. Others like MDisc use similar technology. Currently, MDisc stores data on DVD-sized disks using a “rock-like” substrate. Several DVD player manufacturers have included the technology in players. 

Similar to DNA storage, ceramic storage boasts much higher density than current data storage tech—terabytes per square centimeter versus an HDD’s 0.02TB per square centimeter. Also like DNA storage, it’s more environmentally friendly. Ceramic and glass can be stored within a wide temperature range between -460°F–570°F, and it’s a natural material that will last millennia and eventually decompose. It’s also incredibly durable: Cerabyte claims it will last 5000+ years, and with tons of clay pots still laying around from ancient times, that makes sense. 

One advantage it has on DNA storage though is speed. One laser pulse writes up to 2,000,000 bits, so data can be written at GBps speeds. 

What’s the Catch?

Ceramic also has density, sustainability, and speed to boot, but our biggest question is: who’s going to need that speed? There are only a handful of applications, like AI, that require that speed now. AI is certainly having a big moment, and it can only get bigger. So, presumably there’s a market, but only a small one that can justify the cost. 

One other biggie, at least for a cloud storage provider like us, though not necessarily for consumers or other enterprise users: it’s a write-once model. Once it’s on there, it’s on there. 

Finally, much like DNA tech, it’s probably (?) still too expensive to make it feasible for most data center applications. Cerabyte hasn’t released pricing yet. According to Blocks & Files, “The cost roadmap is expected to offer cost structures below projections of current commercial storage technologies.” But it’s still a big question mark.

Our Hot Take

Both of these technologies are really cool. They definitely got our storage geek brains fired up. But until they become scalable, operationally feasible, and cost-effective, you won’t see them in production—they’re still far enough out that they’re on the fiction end of the science fiction to science fact spectrum. And there are a couple roadblocks we see before they reach the ubiquity of your trusty hard drive. 

The first is making both technologies operational, not just theoretical in a lab. We’ll know more about both Biomemory’s and Cerabyte’s technologies as they roll out these initial proof of concept cards and prototype machines. And both have plans, naturally, for scaling the technologies to the data center. Whether they can or not remains to be seen. Lots of technologies have come and gone, falling victim to the challenges of production, scaling, and cost. 

The second is the attendant infrastructure needs. Getting 100x speed is great, if the device is right next to you. But we’ll need similar leaps in physical networking infrastructure to transfer the data anywhere else. Until that catches up, the tech remains lab-bound. 

All that said, I still remember using floppy disks that held mere megabytes of data, and now you can put 20TB on a hard disk. So, I guess the question is, how long will it be before I can plug into the Matrix?

The post Storage Tech of the Future: Ceramics, DNA, and More appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Cloud 101: Data Egress Fees Explained

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/cloud-101-data-egress-fees-explained/

A decorative article showing a server, a cloud, and arrows pointing up and down with a dollar sign.

You can imagine data egress fees like tolls on a highway—your data is cruising along trying to get to its destination, but it has to pay a fee for the privilege of continuing its journey. If you have a lot of data to move, or a lot of toll booths (different cloud services) to move it through, those fees can add up quickly. 

Data egress fees are charges you incur for moving data out of a cloud service. They can be a big part of your cloud bill depending on how you use the cloud. And, they’re frequently a reason behind surprise AWS bills. So, let’s take a closer look at egress, egress fees, and ways you can reduce or eliminate them, so that your data can travel the cloud superhighways at will. 

What Is Data Egress?

In computing generally, data egress refers to the transfer or movement of data out of a given location, typically from within a network or system to an external destination. When it comes to cloud computing, egress generally means whenever data leaves the boundaries of a cloud provider’s network. 

In the simplest terms, data egress is the outbound flow of data.

A photo of a stair case with a sign that says "out" and an arrow pointing up.
The fees, like these stairs, climb higher. Source.

Egress vs. Ingress?

While egress pertains to data exiting a system, ingress refers to data entering a system. When you download something, you’re egressing data from a service. When you upload something, you’re ingressing data to that service. 

Unsurprisingly, most cloud storage providers do not charge you to ingress data—they want you to store your data on their platform, so why would they? 

Egress vs. Download

You might hear egress referred to as download, and that’s not wrong, but there are some nuances. Egress applies not only to downloads, but also when you migrate data between cloud services, for example. So, egress includes downloads, but it’s not limited to them. 

In the context of cloud service providers, the distinction between egress and download may not always be explicitly stated, and the terminology used can vary between providers. It’s essential to refer to the specific terms and pricing details provided by the service or platform you are using to understand how they classify and charge for data transfers.

How Do Egress Fees Work?

Data egress fees are charges incurred when data is transferred out of a cloud provider’s environment. These fees are often associated with cloud computing services, where users pay not only for the resources they consume within the cloud (such as storage and compute) but also for the data that is transferred from the cloud to external destinations.

There are a number of scenarios where a cloud provider typically charges egress: 

  • When you’re migrating data from one cloud to another.
  • When you’re downloading data from a cloud to a local repository.
  • When you move data between regions or zones with certain cloud providers. 
  • When an application, end user, or content delivery network (CDN) requests data from your cloud storage bucket. 

The fees can vary depending on the amount of data transferred and the destination of the data. For example, transferring data between regions within the same cloud provider’s network might incur lower fees than transferring data to the internet or to a different cloud provider.

Data egress fees are an important consideration for organizations using cloud services, and they can impact the overall cost of hosting and managing data in the cloud. It’s important to be aware of the pricing details related to data egress in the cloud provider’s pricing documentation, as these fees can contribute significantly to the total cost of using cloud services.

Why Do Cloud Providers Charge Egress Fees?

Both ingressing and egressing data costs cloud providers money. They have to build the physical infrastructure to allow users to do that, including switches, routers, fiber cables, etc. They also have to have enough of that infrastructure on hand to meet customer demand, not to mention staff to deploy and maintain it. 

However, it’s telling that most cloud providers don’t charge ingress fees, only egress fees. It would be hard to entice people to use your service if you charged them extra for uploading their data. But, once cloud providers have your data, they want you to keep it there. Charging you to remove it is one way cloud providers like AWS, Google Cloud, and Microsoft Azure do that. 

What Are AWS’s Egress Fees?

AWS S3 gives customers 100GB of data transfer out to the internet free each month, with some caveats—that 100GB excludes data stored in China and GovCloud. After that, the published rates for U.S. regions for data transferred over the public internet are as follows as of the date of publication:

  • The first 10TB per month is $0.09 per GB.
  • The next 40TB per month is $0.085 per GB.
  • The next 100TB per month is $0.07 per GB.
  • Anything greater than 150TB per month is $0.05 per GB. 

But AWS also charges customers egress between certain services and regions, and it can get complicated quickly as the following diagram shows…

Source.

How Can I Reduce Egress Fees?

If you’re using cloud services, minimizing your egress fees is probably a high priority. Companies like the Duckbill Group (the creators of the diagram above) exist to help businesses manage their AWS bills. In fact, there’s a whole industry of consultants that focuses solely on reducing your AWS bills. 

Aside from hiring a consultant to help you spend less, there are a few simple ways to lower your egress fees:

  1. Use a content delivery network (CDN): If you’re hosting an application, using a CDN can lower your egress fees since a CDN will cache data on edge servers. That way, when a user sends a request for your data, it can pull it from the CDN server rather than your cloud storage provider where you would be charged egress. 
  2. Optimize data transfer protocols: Choose efficient data transfer protocols that minimize the amount of data transmitted. For example, consider using compression or delta encoding techniques to reduce the size of transferred files. Compressing data before transfer can reduce the volume of data sent over the network, leading to lower egress costs. However, the effectiveness of compression depends on the nature of the data.
  3. Utilize integrated cloud providers: Some cloud providers offer free data transfer with a range of other cloud partners. (Hint: that’s what we do here at Backblaze!)
  4. Be aware of tiering: It may sound enticing to opt for a cold(er) storage tier to save on storage, but some of those tiers come with much higher egress fees. 

How Does Backblaze Reduce Egress Fees?

There’s one more way you can drastically reduce egress, and we’ll just come right out and say it: Backblaze gives you free egress up to 3x the average monthly storage and unlimited free egress through a number of CDN and compute partners, including Fastly, Cloudflare, Bunny.net, and Vultr

Why do we offer free egress? Supporting an open cloud environment is central to our mission, so we expanded free egress to all customers so they can move data when and where they prefer. Cloud providers like AWS and others charge high egress fees that make it expensive for customers to use multi-cloud infrastructures and therefore lock in customers to their services. These walled gardens hamper innovation and long-term growth.

Free Egress = A Better, Multi-Cloud World

The bottom line: the high egress fees charged by hyperscalers like AWS, Google, and Microsoft are a direct impediment to a multi-cloud future driven by customer choice and industry need. And, a multi-cloud future is something we believe in. So go forth and build the multi-cloud future of your dreams, and leave worries about high egress fees in the past. 

The post Cloud 101: Data Egress Fees Explained appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Digging Deeper Into Object Lock

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/digging-deeper-into-object-lock/

A decorative image showing data inside of a vault.

Using Object Lock for your data is a smart choice—you can protect your data from ransomware, meet compliance requirements, beef up your security policy, or preserve data for legal reasons. But, it’s not a simple on/off switch, and accidentally locking your data for 100 years is a mistake you definitely don’t want to make.

Today we’re taking a deeper dive into Object Lock and the related legal hold feature, examining the different levels of control that are available, explaining why developers might want to build Object Lock into their own applications, and showing exactly how to do that. While the code samples are aimed at our developer audience, anyone looking for a deeper understanding of Object Lock should be able to follow along.

I presented a webinar on this topic earlier this year that covers much the same ground as this blog post, so feel free to watch it instead of, or in addition to, reading this article. 

Check Out the Docs

For even more information on Object Lock, check out our Object Lock overview in our Technical Documentation Portal as well as these how-tos about how to enable Object Lock using the Backblaze web UI, Backblaze B2 Native API, and the Backblaze S3 Compatible API:

What Is Object Lock?

In the simplest explanation, Object Lock is a way to lock objects (aka files) stored in Backblaze B2 so that they are immutable—that is, they cannot be deleted or modified, for a given period of time, even by the user account that set the Object Lock rule. Backblaze B2’s implementation of Object Lock was originally known as File Lock, and you may encounter the older terminology in some documentation and articles. For consistency, I’ll use the term “object” in this blog post, but in this context it has exactly the same meaning as “file.”

Object Lock is a widely offered feature included with backup applications such as Veeam and MSP360, allowing organizations to ensure that their backups are not vulnerable to deliberate or accidental deletion or modification for some configurable retention period.

Ransomware mitigation is a common motivation for protecting data with Object Lock. Even if an attacker were to compromise an organization’s systems to the extent of accessing the application keys used to manage data in Backblaze B2, they would not be able to delete or change any locked data. Similarly, Object Lock guards against insider threats, where the attacker may try to abuse legitimate access to application credentials.

Object Lock is also used in industries that store sensitive or personal identifiable information (PII) such as banking, education, and healthcare. Because they work with such sensitive data, regulatory requirements dictate that data be retained for a given period of time, but data must also be deleted in particular circumstances. 

For example, the General Data Protection Regulation (GDPR), an important component of the EU’s privacy laws and an international regulatory standard that drives best practices, may dictate that some data must be deleted when a customer closes their account. A related use case is where data must be preserved due to litigation, where the period for which data must be locked is not fixed and depends on the type of lawsuit at hand. 

To handle these requirements, Backblaze B2 offers two Object Lock modes—compliance and governance—as well as the legal hold feature. Let’s take a look at the differences between them.

Compliance Mode: Near-Absolute Immutability

When objects are locked in compliance mode, not only can they not be deleted or modified while the lock is in place, but the lock also cannot be removed during the specified retention period. It is not possible to remove or override the compliance lock to delete locked data until the lock expires, whether you’re attempting to do so via the Backblaze web UI or either of the S3 Compatible or B2 Native APIs. Similarly, Backblaze Support is unable to unlock or delete data locked under compliance mode in response to a support request, which is a safeguard designed to address social engineering attacks where an attacker impersonates a legitimate user.

What if you inadvertently lock many terabytes of data for several years? Are you on the hook for thousands of dollars of storage costs? Thankfully, no—you have one escape route, which is to close your Backblaze account. Closing the account is a multi-step process that requires access to both the account login credentials and two-factor verification (if it is configured) and results in the deletion of all data in that account, locked or unlocked. This is a drastic step, so we recommend that developers create one or more “burner” Backblaze accounts for use in developing and testing applications that use Object Lock, that can be closed if necessary without disrupting production systems.

There is one lock-related operation you can perform on compliance-locked objects: extending the retention period. In fact, you can keep extending the retention period on locked data any number of times, protecting that data from deletion until you let the compliance lock expire.

Governance Mode: Override Permitted

In our other Object Lock option, objects can be locked in governance mode for a given retention period. But, in contrast to compliance mode, the governance lock can be removed or overridden via an API call, if you have an application key with appropriate capabilities. Governance mode handles use cases that require retention of data for some fixed period of time, with exceptions for particular circumstances.

When I’m trying to remember the difference between compliance and governance mode, I think of the phrase, “Twenty seconds to comply!”, uttered by the ED-209 armed robot in the movie “RoboCop.” It turned out that there was no way to override ED-209’s programming, with dramatic, and fatal, consequences.

ED-209: as implacable as compliance mode.

Legal Hold: Flexible Preservation

While the compliance and governance retention modes lock objects for a given retention period, legal hold is more like a toggle switch: you can turn it on and off at any time, again with an application key with sufficient capabilities. As its name suggests, legal hold is ideal for situations where data must be preserved for an unpredictable period of time, such as while litigation is proceeding.

The compliance and governance modes are mutually exclusive, which is to say that only one may be in operation at any time. Objects locked in governance mode can be switched to compliance mode, but, as you might expect from the above explanation, objects locked in compliance mode cannot be switched to governance mode until the compliance lock expires.

Legal hold, on the other hand, operates independently, and can be enabled and disabled regardless of whether an object is locked in compliance or governance mode.

How does this work? Consider an object that is locked in compliance or governance mode and has legal hold enabled:

  • If the legal hold is removed, the object remains locked until the retention period expires.
  • If the retention period expires, the object remains locked until the legal hold is removed.

Object Lock and Versioning

By default, Backblaze B2 Buckets have versioning enabled, so as you upload successive objects with the same name, previous versions are preserved automatically. None of the Object Lock modes prevent you from uploading a new version of a locked object; the lock is specific to the object version to which it was applied.

You can also hide a locked object so it doesn’t appear in object listings. The hidden version is retained and can be revealed using the Backblaze web UI or an API call.

As you might expect, locked object versions are not subject to deletion by lifecycle rules—any attempt to delete a locked object version via a lifecycle rule will fail.

How to Use Object Lock in Applications

Now that you understand the two modes of Object Lock, plus legal hold, and how they all work with object versions, let’s look at how you can take advantage of this functionality in your applications. I’ll include code samples for Backblaze B2’s S3 Compatible API written in Python, using the AWS SDK, aka Boto3, in this blog post. You can find details on working with Backblaze B2’s Native API in the documentation.

Application Key Capabilities for Object Lock

Every application key you create for Backblaze B2 has an associated set of capabilities; each capability allows access to a specific functionality in Backblaze B2. There are seven capabilities relevant to object lock and legal hold. 

Two capabilities relate to bucket settings:

  1. readBucketRetentions 
  2. writeBucketRetentions

Three capabilities relate to object settings for retention: 

  1. readFileRetentions 
  2. writeFileRetentions 
  3. bypassGovernance

And, two are specific to Object Lock: 

  1. readFileLegalHolds 
  2. writeFileLegalHolds 

The Backblaze B2 documentation contains full details of each capability and the API calls it relates to for both the S3 Compatible API and the B2 Native API.

When you create an application key via the web UI, it is assigned capabilities according to whether you allow it access to all buckets or just a single bucket, and whether you assign it read-write, read-only, or write-only access.

An application key created in the web UI with read-write access to all buckets will receive all of the above capabilities. A key with read-only access to all buckets will receive readBucketRetentions, readFileRetentions, and readFileLegalHolds. Finally, a key with write-only access to all buckets will receive bypassGovernance, writeBucketRetentions, writeFileRetentions, and writeFileLegalHolds.

In contrast, an application key created in the web UI restricted to a single bucket is not assigned any of the above permissions. When an application using such a key uploads objects to its associated bucket, they receive the default retention mode and period for the bucket, if they have been set. The application is not able to select a different retention mode or period when uploading an object, change the retention settings on an existing object, or bypass governance when deleting an object.

You may want to create application keys with more granular permissions when working with Object Lock and/or legal hold. For example, you may need an application restricted to a single bucket to be able to toggle legal hold for objects in that bucket. You can use the Backblaze B2 CLI to create an application key with this, or any other set of capabilities. This command, for example, creates a key with the default set of capabilities for read-write access to a single bucket, plus the ability to read and write the legal hold setting:

% b2 create-key --bucket my-bucket-name my-key-name listBuckets,readBuckets,listFiles,readFiles,shareFiles,writeFiles,deleteFiles,readBucketEncryption,writeBucketEncryption,readBucketReplications,writeBucketReplications,readFileLegalHolds,writeFileLegalHolds

Enabling Object Lock

You must enable Object Lock on a bucket before you can lock any objects therein; you can do this when you create the bucket, or at any time later, but you cannot disable Object Lock on a bucket once it has been enabled. Here’s how you create a bucket with Object Lock enabled:

s3_client.create_bucket(
    Bucket='my-bucket-name',
    ObjectLockEnabledForBucket=True
)

Once a bucket’s settings have Object Lock enabled, you can configure a default retention mode and period for objects that are created in that bucket. Only compliance mode is configurable from the web UI, but you can set governance mode as the default via an API call, like this:

s3_client.put_object_lock_configuration(
    Bucket='my-bucket-name',
    ObjectLockConfiguration={
        'ObjectLockEnabled': 'Enabled',
        'Rule': {
            'DefaultRetention': {
                'Mode': 'GOVERNANCE',
                'Days': 7
            }
        }
    }
)

You cannot set legal hold as a default configuration for the bucket.

Locking Objects

Regardless of whether you set a default retention mode for the bucket, you can explicitly set a retention mode and period when you upload objects, or apply the same settings to existing objects, provided you use an application key with the appropriate writeFileRetentions or writeFileLegalHolds capability.

Both the S3 PutObject operation and Backblaze B2’s b2_upload_file include optional parameters for specifying retention mode and period, and/or legal hold. For example:

s3_client.put_object(
    Body=open('/path/to/local/file', mode='rb'),
    Bucket='my-bucket-name',
    Key='my-object-name',
    ObjectLockMode='GOVERNANCE',
    ObjectLockRetainUntilDate=datetime(
        2023, 9, 7, hour=10, minute=30, second=0
    )
)

Both APIs implement additional operations to get and set retention settings and legal hold for existing objects. Here’s an example of how you apply a governance mode lock:

s3_client.put_object_retention(
    Bucket='my-bucket-name',
    Key='my-object-name',
    VersionId='some-version-id',
    Retention={
        'Mode': 'GOVERNANCE',  # Required, even if mode is not changed
        'RetainUntilDate': datetime(
            2023, 9, 5, hour=10, minute=30, second=0
        )
    }
)

The VersionId parameter is optional: the operation applies to the current object version if it is omitted.

You can also use the web UI to view, but not change, an object’s retention settings, and to toggle legal hold for an object:

A screenshot highlighting where to enable Object Lock via the Backblaze web UI.

Deleting Objects in Governance Mode

As mentioned above, a key difference between the compliance and governance modes is that it is possible to override governance mode to delete an object, given an application key with the bypassGovernance capability. To do so, you must identify the specific object version, and pass a flag to indicate that you are bypassing the governance retention restriction:

# Get object details, including version id of current version
object_info = s3_client.head_object(
    Bucket='my-bucket-name',
    Key='my-object-name'
)

# Delete the most recent object version, bypassing governance
s3_client.delete_object(
    Bucket='my-bucket-name',
    Key='my-object-name',
    VersionId=object_info['VersionId'],
    BypassGovernanceRetention=True
)

There is no way to delete an object in legal hold; the legal hold must be removed before the object can be deleted.

Protect Your Data With Object Lock and Legal Hold

Object Lock is a powerful feature, and with great power… you know the rest. Here are some of the questions you should ask when deciding whether to implement Object Lock in your applications:

  • What would be the impact of malicious or accidental deletion of your application’s data?
  • Should you lock all data according to a central policy, or allow users to decide whether to lock their data, and for how long?
  • If you are storing data on behalf of users, are there special circumstances where a lock must be overridden?
  • Which users should be permitted to set and remove a legal hold? Does it make sense to build this into the application rather than have an administrator use a tool such as the Backblaze B2 CLI to manage legal holds?

If you already have a Backblaze B2 account, you can start working with Object Lock today; otherwise, create an account to get started.

The post Digging Deeper Into Object Lock appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

What Is Hybrid Cloud?

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/confused-about-the-hybrid-cloud-youre-not-alone/

An illustration of clouds computers and servers.
Editor’s note: This post has been updated since it was originally published in 2017.

The term hybrid cloud has been around for a while—we originally published this explainer in 2017. But time hasn’t necessarily made things clearer. Maybe you hear folks talk about your company’s hybrid cloud approach, but what does that really mean? If you’re confused about the hybrid cloud, you’re not alone. 

Hybrid cloud is a computing approach that uses both private and public cloud resources with some kind of orchestration between them. The term has been applied to a wide variety of IT solutions, so it’s no wonder the concept breeds confusion. 

In this post, we’ll explain what a hybrid cloud is, how it can benefit your business, and how to choose a cloud storage provider for your hybrid cloud strategy.

What Is the Hybrid Cloud?

A hybrid cloud is an infrastructure approach that uses both private and public resources. Let’s first break down those key terms:

  • Public cloud: When you use a public cloud, you are storing your data in another company’s internet-accessible data center. A public cloud service allows anybody to sign up for an account, and share data center resources with other customers or tenants. Instead of worrying about the costs and complexity of operating an on-premises data center, a cloud storage user only needs to pay for the cloud storage they need.
  • Private cloud: In contrast, a private cloud is specifically designed for a single tenant. Think of a private cloud as a permanently reserved private dining room at a restaurant—no other customer can use that space. As a result, private cloud services can be more expensive than public clouds. Traditionally, private clouds typically lived on on-premises infrastructure, meaning they were built and maintained on company property. Now, private clouds can be maintained and managed on-premises by an organization or by a third party in a data center. The key defining factor is that the cloud is dedicated to a single tenant or organization.

Those terms are important to know to understand the hybrid cloud architecture approach. Hybrid clouds are defined by a combined management approach, which means there is some type of orchestration between the private and public environments that allows workloads and data to move between them in a flexible way as demands, needs, and costs change. This gives you flexibility when it comes to data deployment and usage.  

In other words, if you have some IT resources on-premises that you are replicating or sharing with an external vendor—congratulations, you have a hybrid cloud!

Hybrid cloud refers to a computing architecture that is made up of both private cloud resources and public cloud resources with some kind of orchestration between them.

Hybrid Cloud Examples

Here are a few examples of how a hybrid cloud can be used:

  1. As an active archive: You might establish a protocol that says all accounting files that have not been changed in the last year, for example, are automatically moved off-premises to cloud storage archive to save cost and reduce the amount of storage needed on-site. You can still access the files; they are just no longer stored on your local systems. 
  2. To meet compliance requirements: Let’s say some of your data is subject to strict data privacy requirements, but other data you manage isn’t as closely protected. You could keep highly regulated data on premises in a private cloud and the rest of your data in a public cloud. 
  3. To scale capacity: If you’re in an industry that experiences seasonal or frequent spikes like retail or ecommerce, these spikes can be handled by a public cloud which provides the elasticity to deal with times when your data needs exceed your on-premises capacity.
  4. For digital transformation: A hybrid cloud lets you adopt cloud resources in a phased approach as you expand your cloud presence.

Hybrid Cloud vs. Multi-cloud: What’s the Diff?

You wouldn’t be the first person to think that the terms multi-cloud and hybrid cloud appear similar. Both of these approaches involve using multiple clouds. However, multi-cloud uses two clouds of the same type in combination (i.e., two or more public clouds) and hybrid cloud approaches combine a private cloud with a public cloud. One cloud approach is not necessarily better than the other—they simply serve different use cases. 

For example, let’s say you’ve already invested in significant on-premises IT infrastructure, but you want to take advantage of the scalability of the cloud. A hybrid cloud solution may be a good fit for you. 

Alternatively, a multi-cloud approach may work best for you if you are already in the cloud and want to mitigate the risk of a single cloud provider having outages or issues. 

Hybrid Cloud Benefits

A hybrid cloud approach allows you to take advantage of the best elements of both private and public clouds. The primary benefits are flexibility, scalability, and cost savings.

Benefit 1: Flexibility and Scalability

One of the top benefits of the hybrid cloud is its flexibility. Managing IT infrastructure on-premises can be time consuming and expensive, and adding capacity requires advance planning, procurement, and upfront investment

The public cloud is readily accessible and able to provide IT resources whenever needed on short notice. For example, the term “cloud bursting” refers to the on-demand and temporary use of the public cloud when demand exceeds resources available in the private cloud. A private cloud, on the other hand, provides the absolute fastest access speeds since it is generally located on-premises. (But cloud providers are catching up fast, for what it’s worth.) For data that is needed with the absolute lowest levels of latency, it may make sense for the organization to use a private cloud for current projects and store an active archive in a less expensive, public cloud.

Benefit 2: Cost Savings

Within the hybrid cloud framework, the public cloud segment offers cost-effective IT resources, eliminating the need for upfront capital expenses and associated labor costs. IT professionals gain the flexibility to optimize configurations, choose the most suitable service provider, and determine the optimal location for each workload. This strategic approach reduces costs by aligning resources with specific tasks. Furthermore, the ability to easily scale, redeploy, or downsize services enhances efficiency, curbing unnecessary expenses and contributing to overall cost savings.

Comparing Private vs. Hybrid Cloud Storage Costs

To understand the difference in storage costs between a purely on-premises solution and a hybrid cloud solution, we’ll present two scenarios. For each scenario, we’ll use data storage amounts of 100TB, 1PB, and 2PB. Each table is the same format, all we’ve done is change how the data is distributed: private (on-premises) or public (off-premises). We are using the costs for our own Backblaze B2 Cloud Storage in this example. The math can be adapted for any set of numbers you wish to use.

Scenario 1    100% of data on-premises storage

    Data Stored
  Data Stored On-premises: 100%   100TB 1,000TB 2,000TB
On-premises cost range   Monthly Cost
  Low — $12/TB/Month   $1,200 $12,000 $24,000
  High — $20/TB/Month   $2,000 $20,000 $40,000

Scenario 2    20% of data on-premises with 80% public cloud storage (Backblaze B2)

    Data Stored
  Data Stored On-premises: 20%   20TB 200TB 400TB
  Data Stored in the Cloud: 80%   80TB 800TB 1,600TB
On-premises cost range   Monthly Cost
  Low — $12/TB/Month   $240 $2,400 $4,800
  High — $20/TB/Month   $400 $4,000 $8,000
Public cloud cost range   Monthly Cost
  Low — $6/TB/Month (Backblaze B2)   $480 $4,800 $9,600
  High — $20/TB/Month   $1,600 $16,000 $32,000
On-premises + public cloud cost range   Monthly Cost
  Low   $720 $7,200 $14,400
  High   $2,000 $20,000 $40,000

As you can see, using a hybrid cloud solution and storing 80% of the data in the cloud with a provider like Backblaze B2 can result in significant savings over storing only on-premises.

Choosing a Cloud Storage Provider for Your Hybrid Cloud

Okay, so you understand the benefits of using a hybrid cloud approach, what next? Determining the right mix of cloud services may be intimidating because there are so many public cloud options available. Fortunately, there are a few decision factors you can use to simplify setting up your hybrid cloud solution. Here’s what to think about when choosing a public cloud storage provider:

  • Ease of use: Avoiding a steep learning curve can save you hours of work effort in managing your cloud deployments. By contrast, overly complicated pricing tiers or bells and whistles you don’t need can slow you down.
  • Data security controls: Compare how each cloud provider facilitates proper data controls. For example, take a look at features like authentication, Object Lock, and encryption.
  • Data egress fees: Some cloud providers charge additional fees for data egress (i.e., removing data from the cloud). These fees can make it more expensive to switch between providers. In addition to fees, check the data speeds offered by the provider.
  • Interoperability: Flexibility and interoperability are key reasons to use cloud services. Before signing up for a service, understand the provider’s integration ecosystem. A lack of needed integrations may place a greater burden on your team to keep the service running effectively.
  • Storage tiers: Some providers offer different storage tiers where you sacrifice access for lower costs. While the promise of inexpensive cold storage can be attractive, evaluate whether you can afford to wait hours or days to retrieve your data.
  • Pricing transparency: Pay careful attention to the cloud provider’s pricing model and tier options. Consider building a spreadsheet to compare a shortlist of cloud providers’ pricing models.

When Hybrid Cloud Might Not Always Be the Right Fit

The hybrid cloud may not always be the optimal solution, particularly for smaller organizations with limited IT budgets that might find a purely public cloud approach more cost-effective. The substantial setup and operational costs of private servers could be prohibitive.

A thorough understanding of workloads is crucial to effectively tailor the hybrid cloud, ensuring the right blend of private, public, and traditional IT resources for each application and maximizing the benefits of the hybrid cloud architecture.

So, Should You Go Hybrid?

Big picture, anything that helps you respond to IT demands quickly, easily, and affordably is a win. With a hybrid cloud, you can avoid some big up-front capital expenses for in-house IT infrastructure, making your CFO happy. Being able to quickly spin up IT resources as they’re needed will appeal to the CTO and VP of operations.

So, given all that, we’ve arrived at the bottom line and the question is, should you or your organization embrace hybrid cloud infrastructure?According to Flexera’s 2023 State of the Cloud report, 72% of enterprises utilize a hybrid cloud strategy. That indicates that the benefits of the hybrid cloud appeal to a broad range of companies.

If an organization approaches implementing a hybrid cloud solution with thoughtful planning and a structured approach, a hybrid cloud can deliver on-demand flexibility, empower legacy systems, and applications with new capabilities, and become a catalyst for digital transformation. The result can be an elastic and responsive infrastructure that has the ability to quickly adapt to changing demands of the business.

As data management professionals increasingly recognize the advantages of the hybrid cloud, we can expect more and more of them to embrace it as an essential part of their IT strategy.

Tell Us What You’re Doing With the Hybrid Cloud

Are you currently embracing the hybrid cloud, or are you still uncertain or hanging back because you’re satisfied with how things are currently? We’d love to hear your comments below on how you’re approaching your cloud architecture decisions.

FAQs About Hybrid Cloud

What exactly is a hybrid cloud?

Hybrid cloud is a computing approach that uses both private and public cloud resources with some kind of orchestration between them.

What is the difference between hybrid and multi-cloud?

Multi-cloud uses two clouds of the same type in combination (i.e., two or more public clouds) and hybrid cloud approaches combine a private cloud with a public cloud. One cloud approach is not necessarily better than the other—they simply serve different use cases.

What is a hybrid cloud architecture?

Hybrid cloud architecture is any kind of IT architecture that combines both the public and private clouds. Many organizations use this term to describe specific software products that provide solutions which combine the two types of clouds.

What are hybrid clouds used for?

Organizations will often use hybrid clouds to create redundancy and scalability for their computing workload. A hybrid cloud is a great way for a company to have extra fallback options to continue offering services even when they have higher than usual levels of traffic, and it can also help companies scale up their services over time as they need to offer more options.

The post What Is Hybrid Cloud? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Drive Stats for Q3 2023

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-q3-2023/

A decorative image showing the title Q3 2023 Drive Stats.

At the end of Q3 2023, Backblaze was monitoring 263,992 hard disk drives (HDDs) and solid state drives (SSDs) in our data centers around the world. Of that number, 4,459 are boot drives, with 3,242 being SSDs and 1,217 being HDDs. The failure rates for the SSDs are analyzed in the SSD Edition: 2023 Drive Stats review.

That leaves us with 259,533 HDDs that we’ll focus on in this report. We’ll review the quarterly and lifetime failure rates of the data drives as of the end of Q3 2023. Along the way, we’ll share our observations and insights on the data presented, and, for the first time ever, we’ll reveal the drive failure rates broken down by data center.

Q3 2023 Hard Drive Failure Rates

At the end of Q3 2023, we were managing 259,533 hard drives used to store data. For our review, we removed 449 drives from consideration as they were used for testing purposes, or were drive models which did not have at least 60 drives. This leaves us with 259,084 hard drives grouped into 32 different models. 

The table below reviews the annualized failure rate (AFR) for those drive models for the Q3 2023 time period.

A table showing the quarterly annualized failure rates of Backblaze hard drives.

Notes and Observations on the Q3 2023 Drive Stats

  • The 22TB drives are here: At the bottom of the list you’ll see the WDC 22TB drives (model: WUH722222ALE6L4). A Backblaze Vault of 1,200 drives (plus four) is now operational. The 1,200 drives were installed on September 29, so they only have one day of service each in this report, but zero failures so far.
  • The old get bolder: At the other end of the time-in-service spectrum are the 6TB Seagate drives (model: ST6000DX000) with an average of 101 months in operation. This cohort had zero failures in Q3 2023 with 883 drives and a lifetime AFR of 0.88%.
  • Zero failures: In Q3, six different drive models managed to have zero drive failures during the quarter. But only the 6TB Seagate, noted above, had over 50,000 drive days, our minimum standard for ensuring we have enough data to make the AFR plausible.
  • One failure: There were four drive models with one failure during Q3. After applying the 50,000 drive day metric, two drives stood out:
    1. WDC 16TB (model: WUH721816ALE6L0) with a 0.15% AFR.
    2. Toshiba 14TB (model: MG07ACA14TEY) with a 0.63% AFR.

The Quarterly AFR Drops

In Q3 2023, quarterly AFR for all drives was 1.47%. That was down from 2.2% in Q2 and also down from 1.65% a year ago. The quarterly AFR is based on just the data in that quarter, so it can often fluctuate from quarter to quarter. 

In our Q2 2023 report, we suspected the 2.2% for the quarter was due to the overall aging of the drive fleet and in particular we pointed a finger at specific 8TB, 10TB, and 12TB drive models as potential culprits driving the increase. That prediction fell flat in Q3 as nearly two-thirds of drive models experienced a decreased AFR quarter over quarter from Q2 and any increases were minimal. This included our suspect 8TB, 10TB, and 12TB drive models. 

It seems Q2 was an anomaly, but there was one big difference in Q3: we retired 4,585 aging 4TB drives. The average age of the retired drives was just over eight years, and while that was a good start, there’s another 28,963 4TB drives to go. To facilitate the continuous retirement of aging drives and make the data migration process easy and safe we use CVT, our awesome in-house data migration software which we’ll cover at another time.

A Hot Summer and the Drive Stats Data

As anyone should in our business, Backblaze continuously monitors our systems and drives. So, it was of little surprise to us when the folks at NASA confirmed the summer of 2023 as Earth’s hottest on record. The effects of this record-breaking summer showed up in our monitoring systems in the form of drive temperature alerts. A given drive in a storage server can heat up for many reasons: it is failing; a fan in the storage server has failed; other components are producing additional heat; the air flow is somehow restricted; and so on. Add in the fact that the ambient temperature within a data center often increases during the summer months, and you can get more temperature alerts.

In reviewing the temperature data for our drives in Q3, we noticed that a small number of drives exceeded the maximum manufacturer’s temperature for at least one day. The maximum temperature for most drives is 60°C, except for the 12TB, 14TB, and 16TB Toshiba drives which have a maximum temperature of 55°C. Of the 259,533 data drives in operation in Q3, there were 354 individual drives (0.0013%) that exceeded their maximum manufacturer temperature. Of those only two drives failed, leaving 352 drives which were still operational as of the end of Q3.

While temperature fluctuation is part of running data centers and temp alerts like these aren’t unheard of, our data center teams are looking into the root causes to ensure we’re prepared for the inevitability of increasingly hot summers to come.

Will the Temperature Alerts Affect Drive Stats?

The two drives which exceeded their maximum temperature and failed in Q3 have been removed from the Q3 AFR calculations. Both drives were 4TB Seagate drives (model: ST4000DM000). Given that the remaining 352 drives which exceeded their temperature maximum did not fail in Q3, we have left them in the Drive Stats calculations for Q3 as they did not increase the computed failure rates.

Beginning in Q4, we will remove the 352 drives from the regular Drive Stats AFR calculations and create a separate cohort of drives to track that we’ll name Hot Drives. This will allow us to track the drives which exceeded their maximum temperature and compare their failure rates to those drives which operated within the manufacturer’s specifications. While there are a limited number of drives in the Hot Drives cohort, it could give us some insight into whether drives being exposed to high temperatures could cause a drive to fail more often. This heightened level of monitoring will identify any increase in drive failures so that they can be detected and dealt with expeditiously.

New Drive Stats Data Fields in Q3

In Q2 2023, we introduced three new data fields that we started populating in the Drive Stats data we publish: vault_id, pod_id, and is_legacy_format. In Q3, we are adding three more fields into each drive records as follows:

  • datacenter: The Backblaze data center where the drive is installed, currently one of these values: ams5, iad1, phx1, sac0, and sac2.
  • cluster_id: The name of a given collection of storage servers logically grouped together to optimize system performance. Note: At this time the cluster_id is not always correct, we are working on fixing that. 
  • pod_slot_num: The physical location of a drive within a storage server. The specific slot differs based on the storage server type and capacity: Backblaze (45 drives), Backblaze (60 drives), Dell (26 drives), or Supermicro (60 drives). We’ll dig into these differences in another post.

With these additions, the new schema beginning in Q3 2023 is:

  • date
  • serial_number
  • model
  • capacity_bytes
  • failure
  • datacenter (Q3)
  • cluster_id (Q3)
  • vault_id (Q2)
  • pod_id (Q2)
  • pod_slot_num (Q3)
  • is_legacy_format (Q2)
  • smart_1_normalized
  • smart_1_raw
  • The remaining SMART value pairs (as reported by each drive model)

Beginning in Q3, these data data fields have been added to the publicly available Drive Stats files that we publish each quarter. 

Failure Rates by Data Center

Now that we have the data center for each drive we can compute the AFRs for the drives in each data center. Below you’ll find the AFR for each of five data centers for Q3 2023.

A chart showing Backblaze annualized failure rates by data center.

Notes and Observations

  • Null?: The drives which reported a null or blank value for their data center are grouped in four Backblaze vaults. David, the Senior Infrastructure Software Engineer for Drive Stats, described the process of how we gather all the parts of the Drive Stats data each day. The TL:DR is that vaults can be too busy to respond at the moment we ask, and since the data center field is nice-to-have data, we get a blank field. We can go back a day or two to find the data center value, which we will do in the future when we report this data.
  • sac0?: sac0 has the highest AFR of all of the data centers, but it also has the oldest drives—nearly twice as old, on average, versus the next closest in data center, sac2. As discussed previously, drive failures do seem to follow the “bathtub curve”, although recently we’ve seen the curve start out flatter. Regardless, as drive models age, they do generally fail more often. Another factor could be that sac0, and to a lesser extent sac2, has some of the oldest Storage Pods, including a handful of 45-drive units. We are in the process of using CVT to replace these older servers while migrating from 4TB to 16TB and larger drives.
  • iad1: The iad data center is the foundation of our eastern region and has been growing rapidly since coming online about a year ago. The growth is a combination of new data and customers using our cloud replication capability to automatically make a copy of their data in another region.
  • Q3 Data: This chart is for Q3 data only and includes all the data drives, including those with less than 60 drives per model. As we track this data over the coming quarters, we hope to get some insight into whether different data centers really have different drive failure rates, and, if so, why.

Lifetime Hard Drive Failure Rates

As of September 30, 2023, we were tracking 259,084 hard drives used to store customer data. For our lifetime analysis, we collect the number of drive days and the number of drive failures for each drive beginning from the time a drive was placed into production in one of our data centers. We group these drives by model, then sum up the drive days and failures for each model over their lifetime. That chart is below. 

A chart showing Backblaze lifetime hard drive failure rates.

One of the most important columns on this chart is the confidence interval, which is the difference between the low and high AFR confidence levels calculated at 95%. The lower the value, the more certain we are of the AFR stated. We like a confidence interval to be 0.5% or less. When the confidence interval is higher, that is not necessarily bad, it just means we either need more data or the data is somewhat inconsistent. 

The table below contains just those drive models which have a confidence interval of less than 0.5%. We have sorted the list by drive size and then by AFR.

A chart showing Backblaze hard drive annualized failure rates with a confidence interval of less than 0.5%.

The 4TB, 6TB, 8TB, and some of the 12TB drive models are no longer in production. The HGST 12TB models in particular can still be found, but they have been relabeled as Western Digital and given alternate model numbers. Whether they have materially changed internally is not known, at least to us.

One final note about the lifetime AFR data: you might have noticed the AFR for all of the drives hasn’t changed much from quarter to quarter. It has vacillated between 1.39% to 1.45% percent for the last two years. Basically, we have lots of drives with lots of time-in-service so it is hard to move the needle up or down. While the lifetime stats for individual drive models can be very useful, the lifetime AFR for all drives will probably get less and less interesting as we add more and more drives. Of course, a few hundred thousand drives that never fail could arrive, so we will continue to calculate and present the lifetime AFR.

The Hard Drive Stats Data

The complete data set used to create the information used in this review is available on our Hard Drive Stats Data webpage. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone; it is free. 

Good luck and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q3 2023 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

AI 101: Training vs. Inference

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/ai-101-training-vs-inference/

A decorative image depicting a neural network identifying a cat.

What do Sherlock Holmes and ChatGPT have in common? Inference, my dear Watson!

“We approached the case, you remember, with an absolutely blank mind, which is always an advantage. We had formed no theories. We were simply there to observe and to draw inferences from our observations.”
—Sir Arthur Conan Doyle, The Adventures of the Cardboard Box

As we all continue to refine our thinking around artificial intelligence (AI), it’s useful to define terminology that describes the various stages of building and using AI algorithms—namely, the AI training stage and the AI inference stage. As we see in the quote above, these are not new concepts: they’re based on ideas and methodologies that have been around since before Sherlock Holmes’ time. 

If you’re using AI, building AI, or just curious about AI, it’s important to understand the difference between these two stages so you understand how data moves through an AI workflow. That’s what I’ll explain today.

The TL:DR

The difference between these two terms can be summed up fairly simply: first you train an AI algorithm, then your algorithm uses that training to make inferences from data. To create a whimsical analogy, when an algorithm is training, you can think of it like Watson—still learning how to observe and draw conclusions through inference. Once it’s trained, it’s an inferring machine, a.k.a. Sherlock Holmes. 

Whimsy aside, let’s dig a little deeper into the tech behind AI training and AI inference, the differences between them, and why the distinction is important. 

Obligatory Neural Network Recap

Neural networks have emerged as the brainpower behind AI, and a basic understanding of how they work is foundational when it comes to understanding AI.  

Complex decisions, in theory, can be broken down into a series of yeses and nos, which means that they can be encoded in binary. Neural networks have the ability to combine enough of those smaller decisions, weigh how they affect each other, and then use that information to solve complex problems. And, because more complex decisions require more points of information to come to a final decision, they require more processing power. Neural networks are one of the most widely used approaches to AI and machine learning (ML). 

A diagram showing the inputs, hidden layers, and outputs of a neural network.

What Is AI Training?: Understanding Hyperparameters and Parameters

In simple terms, training an AI algorithm is the process through which you take a base algorithm and then teach it how to make the correct decision. This process requires large amounts of data, and can include various degrees of human oversight. How much data you need has a relationship to the number of parameters you set for your algorithm as well as the complexity of a problem. 

We made this handy dandy diagram to show you how data moves through the training process:

A diagram showing how data moves through an AI training algorithm.
As you can see in this diagram, the end result is model data, which then gets saved in your data store for later use.

And hey—we’re leaving out a lot of nuance in that conversation because dataset size, parameter choice, etc. is a graduate-level topic on its own, and usually is considered proprietary information by the companies who are training an AI algorithm. It suffices to say that dataset size and number of parameters are both significant and have a relationship to each other, though it’s not a direct cause/effect relationship. And, both the number of parameters and the size of the dataset affect things like processing resources—but that conversation is outside of scope for this article (not to mention a hot topic in research). 

As with everything, your use case determines your execution. Some types of tasks actually see excellent results with smaller datasets and more parameters, whereas others require more data and fewer parameters. Bringing it back to the real world, here’s a very cool graph showing how many parameters different AI systems have. Note that they very helpfully identified what type of task each system is designed to solve:

So, let’s talk about what parameters are with an example. Back in our very first AI 101 post, we talked about ways to frame an algorithm in simple terms: 

Machine learning does not specify how much knowledge the bot you’re training starts with—any task can have more or fewer instructions. You could ask your friend to order dinner, or you could ask your friend to order you pasta from your favorite Italian place to be delivered at 7:30 p.m. 

Both of those tasks you just asked your friend to complete are algorithms. The first algorithm requires your friend to make more decisions to execute the task at hand to your satisfaction, and they’ll do that by relying on their past experience of ordering dinner with you—remembering your preferences about restaurants, dishes, cost, and so on. 

The factors that help your friend make a decision about dinner are called hyperparameters and parameters. Hyperparameters are those that frame the algorithm—they are set  outside the training process, but can influence the training of the algorithms. In the example above, a hyperparameter would be how you structure your dinner feedback. Do you thumbs up or down each dish? Do you write a short review? You get the idea. 

Parameters are factors that the algorithm derives through training. In the example above, that’s what time you prefer to eat dinner, which restaurants you enjoy after eating, and so on. 

When you’ve trained a neural network, there will be heavier weights between various nodes. That’s a shorthand of saying that an algorithm will prefer a path it knows is significant, and if you want to really get nerdy with it, this article is well-researched, has a ton of math explainers for various training methods, and includes some fantastic visuals. For our purposes, here’s one way people visualize a “trained” algorithm: 

An image showing a neural network that has prioritized certain pathways after training.
Source.

The “dropout method” is essentially adding weight to the relationships an AI algorithm has found to be significant for the dataset it’s working on. It can then de-prioritize (or sometimes even eliminate) the other relationships. 

Once you have a trained algorithm, then you can use it with a reasonable degree of certainty that it will give you good results, and that leads us to inference. 

What Is AI Inference?

Once you’ve trained your algorithm, you can send it out in the world to do its job (and make yours easier). When you present a trained AI algorithm with a problem and it gives you an answer, that’s called inference. It’s using the way it was trained to draw conclusions or make predictions, depending on how it was built, and once an algorithm is in the “inference stage”, it’s no longer learning (usually). 

Here’s our diagram for how data might move through an inference process: 

A diagram showing how data moves through an inference workflow.
As you can see, if we want to add more training data to the algorithm, we’d do so as a separate input/output flow. The goal of this process is to output data.

As we’ve seen in the real world, this does not mean that the AI model will give you a perfect answer every time—depending on, for example, the AI’s access to new or current information, the complexity of training, etc., the algorithm might give you a better or worse (more or less accurate) answer. But, it will be able to look at a new piece of information or dataset, and return a response. 

When I asked our friend ChatGPT about this, it broke down inference and defined it in relation to the logical underpinnings of different types of algorithms. (As a mere human, how could I resist using inference while writing about inference?) Sure, ChatGPT may be merely mimicking human-like dialogue based on its ingestion of the whole internet, then giving me the most likely answer based on its specialized neural network that predicts the likelihood of the next word in response to my conversational query and without access to data after January 2022, but here’s what our AI friend had to say for itself: 

  • Machine Learning Inference: In machine learning, after a model has been trained on a dataset, it is deployed to make predictions or classifications on new, unseen data. During inference, the model takes the input data, processes it, and produces an output or a prediction based on the patterns and relationships it learned during training. For example, a trained machine learning model might be used to classify images, predict stock prices, or identify spam emails.
  • Deep Learning Inference: Deep learning, a subfield of machine learning, involves neural networks with multiple layers. In deep learning, inference means feeding an input through a trained deep neural network and obtaining an output. This is commonly used in applications like image recognition, natural language processing, and speech recognition. Inference in deep learning can be computationally intensive, and specialized hardware, such as GPUs and TPUs, are often used to accelerate the process.
  • Rule-Based Systems Inference: In rule-based AI systems, inference is the process of applying logical rules to make decisions or draw conclusions. These systems use a set of predefined rules and facts to reason and infer new information. For example, in an expert system for medical diagnosis, inference involves applying a set of rules to patient symptoms and medical knowledge to arrive at a diagnosis.
  • Probabilistic Inference: In probabilistic AI, inference is used to estimate probabilities and uncertainty. Bayesian networks and Markov decision processes are examples of probabilistic models that perform inference to compute probabilities of events or make decisions based on probabilities.

You’ll notice that each of these are saying basically the same thing: the AI algorithm applies its decision-making paradigm to a problem. 

Why Stop Learning During the Inference Stage?

In general, it’s important to keep these two stages—training and inference—of an AI algorithm separate for a few reasons: 

  • Efficiency: Training is typically a computationally intensive process, whereas inference is usually faster and less resource-intensive. Separating them allows for efficient use of computational resources.
  • Generalization: The model’s ability to generalize from training data to unseen data is a key feature. It should not learn from every new piece of data it encounters during inference to maintain this generalization ability.
  • Reproducibility: When using trained models in production or applications, it’s important to have consistency and reproducibility in the results. If models were allowed to learn during inference, it would introduce variability and unpredictability in their behavior.

There are some specialized AI algorithms that want to continue learning during the inference stage—your Netflix algorithm is a good example, as are self-driving cars, or dynamic pricing models used to set airfare pricing. On the other hand, the majority of problems we’re trying to solve with AI algorithms deliver better decisions by separating these two phases—think of things like image recognition, language translation, or medical diagnosis, for example.

Training vs. Inference (But, Really: Training Then Inference)

To recap: the AI training stage is when you feed data into your learning algorithm to produce a model, and the AI inference stage is when your  algorithm uses that training to make inferences from data. Here’s a chart for quick reference: 

Training Inference
Feed training data into a learning algorithm. Apply the model to inference data.
Produces a model comprising code and data. Produces output data.
One time(ish). Retraining is sometimes necessary. Often continuous.

The difference may seem inconsequential at first glance, but defining these two stages helps to show implications for AI adoption particularly with businesses. That is, given that it’s much less resource intensive (and therefore, less expensive), it’s likely to be much easier for businesses to integrate already-trained AI algorithms with their existing systems. 

And, as always, we’re big believers in demystifying terminology for discussion purposes. Let us know what you think in the comments, and feel free to let us know what you’re interested in learning about next.

The post AI 101: Training vs. Inference appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Anything as a Service: All the “as a Service” Acronyms You Didn’t Know You Needed

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/anything-as-a-service-all-the-as-a-service-acronyms-you-didnt-know-you-needed/

A decorative image showing acronyms for different "as as service" acronyms.

Have you ever felt like you need a dictionary just to understand what tech-savvy folks are talking about? Well, you’re in luck, because we’re about to decode some of the most common jargon of the digital age, one acronym at a time. Welcome to the world of “as a Service” acronyms, where we take the humble alphabet and turn it into a digital buffet. 

So, whether you’re SaaS-savvy or PaaS-puzzled, or just someone desperately searching for a little HaaS (Humor as a Service …yeah, we made that one up), you’ve come to the right place. Let’s take a big slurp from this alphabet soup of tech terms.

The One That Started It All: SaaS

SaaS stands for software as a service, and it’s the founding member of the “as a service” nomenclature. (Though, very confusingly, there’s also Sales as a Service—it’s just not shortened to SaaS. Usually.)

Imagine your software as a pizza delivery service. You don’t need to buy all the ingredients, knead the dough, and bake it yourself. Instead, you simply order a slice, and it magically appears on your table (a.k.a. screen). SaaS products are like that, but instead of pizza they serve up everything from messaging to video conferencing to email marketing to …well, really you name it. Which brings us to…

The Kind of Ironic One: XaaS

XaaS stands for, variously, “everything” or “anything” as a service. No one is really sure about the term’s provenance, but it’s a fair guess to say it came into existence when, well, everything started to become a service, probably sometime around the mid-2010s. The thinking is: if it exists in the digital realm, you can probably get it “as a service.” 

The Hardware Related Ones: HaaS, IaaS, and PaaS

HaaS (Hardware as a Service): Instead of purchasing hardware yourself, like computers, servers, networking equipment, and other physical infrastructure components, with HaaS, you can lease or rent the equipment for a given period. It would be like renting a pizza kitchen to make your specialty pies specifically for your sister’s wedding or your grandma’s birthday.

IaaS (Infrastructure as a Service): Infrastructure as a service is kind of like hardware as a service, but it comes with some additional goodies thrown in. Instead of renting just the kitchen, you rent the whole restaurant, chair, tables, and servers (no pun intended) included. IaaS delivers virtualized computing resources, like virtual machines, storage (that’s us!), and networking, over the internet.

PaaS (Platform as a Service): Think of PaaS as a step even further than IaaS—you’re not just renting a pizza restaurant, you’re renting a test kitchen where you can develop your award-winning pie. PaaS provides developers the ability to build, manage, and deploy applications with services like development frameworks, databases, and infrastructure management. It’s the ultimate DIY platform for tech enthusiasts.

The Bad One: RaaS

RaaS stands for Ransomware as a Service, and this is one “as a service” variant you don’t want to mess with. Basically, cybercriminals can purchase ransomware just as easily as you would purchase any app on the app store (it’s probably more complicated than that, but you get the general gist). This makes it easy for even the least savvy cybercriminal to get into the ransomware game. Not great. 

The Ones That Help With the Last One: BaaS and DRaaS

BaaS (Backup as a Service): Backup as a Service is a cloud-based data protection solution that allows individuals and organizations to back up their data to a remote cloud. (Hey! That’s us too!) Instead of managing on-premises backup infrastructure, users can securely store their data off-site, often on highly redundant and geographically distributed servers.

DRaaS (Disaster Recovery as a Service): DRaaS stands for disaster recovery as a service, and it’s the antidote to RaaS. Of course, you need good backups to begin with, but adding DRaaS allows businesses to ensure specific recovery time objectives (RTOs, FYI) so they can get back up and running in the event they’re attacked by ransomware or there’s a natural disaster at your primary storage location. DRaaS solutions used to be made almost exclusively with the large enterprise in mind, but today, it’s possible to architect a DRaaS solution for your business affordably and easily.

The Analytical One: DaaS

DaaS stands for data as a service, and it’s your data’s personal chauffeur. It fetches the information you need and serves it up on a silver platter. DaaS offers data on-demand, making structured data accessible to users over the internet. It simplifies data sharing and access, often in real-time, without the need for complex data management.

The Development-Focused Ones: CaaS, BaaS (again), and FaaS

CaaS (Containers as a Service): CaaS simplifies the deployment, scaling, and orchestration of containerized applications. It’s the tech version of a literal container ship. The individual containers “ship” individual pieces of software, and a CaaS tool helps carry all of those individual containers. Check out container management software Docker’s logo for a visualization:

It looks more like a whale carrying containers, which is far more adorable, in our opinion.

BaaS (Backend as a Service): It wouldn’t be the first time an acronym has two meanings. BaaS, in this context, provides a backend infrastructure for mobile and web app developers, offering services like databases, user authentication, and APIs. Imagine your own team of digital butlers tending to the back end of your apps. They handle all the behind-the-scenes stuff, so you can focus on making your app shine. 

FaaS (Function as a Service): FaaS is a serverless computing model where developers focus on writing and deploying individual functions or code snippets. These functions run in response to specific events, promoting scalability and efficiency in application development. It’s like having a team of tiny, code-savvy robots doing your bidding.

Go Forth and Abbreviate

Now that you’ve sampled all of the flavors the vast “as a service” world has to offer, we hope you’ve gained a clearer understanding of these sometimes confounding terms. So whether you’re a business professional navigating the cloud or just curious about the tech world, you can wield these acronyms with confidence. 

Did we miss any? I’m sure. Let us know in the comments.

The post Anything as a Service: All the “as a Service” Acronyms You Didn’t Know You Needed appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

How We Achieved Upload Speeds Faster Than AWS S3

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/2023-performance-improvements/

An image of a city skyline with lines going up to a cloud.

You don’t always need the absolute fastest cloud storage—your performance requirements depend on your use case, business objectives, and security needs. But still, faster is usually better. And Backblaze just announced innovation on B2 Cloud Storage that delivers a lot more speed: most file uploads will now be up to 30% faster than AWS S3. 

Today, I’m diving into all of the details of this performance improvement, how we did it, and what it means for you.

The TL:DR

The Results: Customers who rely on small file uploads (1MB or less) can expect to see 10–30% faster uploads on average based on our tests, all without any change to durability, availability, or pricing. 

What Does This Mean for You? 

All B2 Cloud Storage customers will benefit from these performance enhancements, especially those who use Backblaze B2 as a storage destination for data protection software. Small uploads of 1MB or less make up about 70% of all uploads to B2 Cloud Storage and are common for backup and archive workflows. Specific benefits of the performance upgrades include:

  • Secures data in offsite backup faster.
  • Frees up time for IT administrators to work on other projects.
  • Decreases congestion on network bandwidth.
  • Deduplicates data more efficiently.

Veeam® is dedicated to working alongside our partners to innovate and create a united front against cyber threats and attacks. The new performance improvements released by Backblaze for B2 Cloud Storage furthers our mission to provide radical resilience to our joint customers.

—Andreas Neufert, Vice President, Product Management, Alliances, Veeam

When Can I Expect Faster Uploads?

Today. The performance upgrades have been fully rolled out across Backblaze’s global data regions.

How We Did It

Prior to this work, when a customer uploaded a file to Backblaze B2, the data was written to multiple hard disk drives (HDDs). Those operations had to be completed before returning a response to the client. Now, we write the incoming data to the same HDDs and also, simultaneously, to a pool of solid state drives (SSDs) we call a “shard stash,” waiting only for the HDD writes to make it to the filesystems’ in-memory caches and the SSD writes to complete before returning a response. Once the writes to HDD are complete, we free up the space from the SSDs so it can be reused.

Since writing data to an SSD is much faster than writing to HDDs, the net result is faster uploads. 

That’s just a brief summary; if you’re interested in the technical details (as well as the results of some rigorous testing), read on!

The Path to Performance Upgrades

As you might recall from many Drive Stats blog posts and webinars, Backblaze stores all customer data on HDDs, affectionately termed ‘spinning rust’ by some. We’ve historically reserved SSDs for Storage Pod (storage server) boot drives. 

Until now. 

That’s right—SSDs have entered the data storage chat. To achieve these performance improvements, we combined the performance of SSDs with the cost efficiency of HDDs. First, I’ll dig into a bit of history to add some context to how we went about the upgrades.

HDD vs. SSD

IBM shipped the first hard drive way back in 1957, so it’s fair to say that the HDD is a mature technology. Drive capacity and data rates have steadily increased over the decades while cost per byte has fallen dramatically. That first hard drive, the IBM RAMAC 350, had a total capacity of 3.75MB, and cost $34,500. Adjusting for inflation, that’s about $375,000, equating to $100,000 per MB, or $100 billion per TB, in 2023 dollars.

A photograph of people pushing one of the first hard disk drives into a truck.
An early hard drive shipped by IBM. Source.

Today, the 16TB version of the Seagate Exos X16—an HDD widely deployed in the Backblaze B2 Storage Cloud—retails for around $260, $16.25 per TB. If it had the same cost per byte as the IBM RAMAC 250, it would sell for $1.6 trillion—around the current GDP of China!

SSDs, by contrast, have only been around since 1991, when SanDisk’s 20MB drive shipped in IBM ThinkPad laptops for an OEM price of about $1,000. Let’s consider a modern SSD: the 3.2TB Micron 7450 MAX. Retailing at around $360, the Micron SSD is priced at $112.50 per TB, nearly seven times as much as the Seagate HDD.

So, HDDs easily beat SSDs in terms of storage cost, but what about performance? Here are the numbers from the manufacturers’ data sheets:

Seagate Exos X16 Micron 7450 MAX
Model number ST16000NM001G MTFDKCB3T2TFS
Capacity 16TB 3.2TB
Drive cost $260 $360
Cost per TB $16.25 $112.50
Max sustained read rate (MB/s) 261 6,800
Max sustained write rate (MB/s) 261 5,300
Random read rate, 4kB blocks, IOPS 170/440* 1,000,000
Random write rate, 4kB blocks, IOPS 170/440* 390,000

Since HDD platters rotate at a constant rate, 7,200 RPM in this case, they can transfer more blocks per revolution at the outer edge of the disk than close to the middle—hence the two figures for the X16’s transfer rate.

The SSD is over 20 times as fast at sustained data transfer than the HDD, but look at the difference in random transfer rates! Even when the HDD is at its fastest, transferring blocks from the outer edge of the disk, the SSD is over 2,200 times faster reading data and nearly 900 times faster for writes.

This massive difference is due to the fact that, when reading data from random locations on the disk, the platters have to complete an average of 0.5 revolutions between blocks. At 7,200 rotations per minute (RPM), that means that the HDD spends about 4.2ms just spinning to the next block before it can even transfer data. In contrast, the SSD’s data sheet quotes its latency as just 80µs (that’s 0.08ms) for reads and 15µs (0.015ms) for writes, between 84 and 280 times faster than the spinning disk.

Let’s consider a real-world operation, say, writing 64kB of data. Assuming the HDD can write that data to sequential disk sectors, it will spin for an average of 4.2ms, then spend 0.25ms writing the data to the disk, for a total of 4.5ms. The SSD, in contrast, can write the data to any location instantaneously, taking just 27µs (0.027ms) to do so. This (somewhat theoretical) 167x speed advantage is the basis for the performance improvement.

Why did I choose a 64kB block? As we mentioned in a recent blog post focusing on cloud storage performance, in general, bigger files are better when it comes to the aggregate time required to upload a dataset. However, there may be other requirements that push for smaller files. Many backup applications split data into fixed size blocks for upload as files to cloud object storage. There is a trade-off in choosing the block size: larger blocks improve backup speed, but smaller blocks reduce the amount of storage required. In practice, backup blocks may be as small as 1MB or even 256kB. The 64kB blocks we used in the calculation above represent the shards that comprise a 1MB file.

The challenge facing our engineers was to take advantage of the speed of solid state storage to accelerate small file uploads without breaking the bank.

Improving Write Performance for Small Files

When a client application uploads a file to the Backblaze B2 Storage Cloud, a coordinator pod splits the file into 16 data shards, creates four additional parity shards, and writes the resulting 20 shards to 20 different HDDs, each in a different Pod.

Note: As HDD capacity increases, so does the time required to recover after a drive failure, so we periodically adjust the ratio between data shards and parity shards to maintain our eleven nines durability target. In the past, you’ve heard us talk about 17 + 3 as the ratio but we also run 16 + 4 and our very newest vaults use a 15 + 5 scheme.

Each Pod writes the incoming shard to its local filesystem; in practice, this means that the data is written to an in-memory cache and will be written to the physical disk at some point in the near future. Any requests for the file can be satisfied from the cache, but the data hasn’t actually been persistently stored yet.

We need to be absolutely certain that the shards have been written to disk before we return a “success” response to the client, so each Pod executes an fsync system call to transfer (“flush”) the shard data from system memory through the HDD’s write cache to the disk itself before returning its status to the coordinator. When the coordinator has received at least 19 successful responses, it returns a success response to the client. This ensures that, even if the entire data center was to lose power immediately after the upload, the data would be preserved.

As we explained above, for small blocks of data, the vast majority of the time spent writing the data to disk is spent waiting for the drive platter to spin to the correct location. Writing shards to SSD could result in a significant performance gain for small files, but what about that 7x cost difference?

Our engineers came up with a way to have our cake and eat it too by harnessing the speed of SSDs without a massive increase in cost. Now, upon receiving a file of 1MB or less, the coordinator splits it into shards as before, then simultaneously sends the shards to a set of 20 Pods and a separate pool of servers, each populated with 10 of the Micron SSDs described above—a “shard stash.” The shard stash servers easily win the “flush the data to disk” race and return their status to the coordinator in just a few milliseconds. Meanwhile, each HDD Pod writes its shard to the filesystem, queues up a task to flush the shard data to the disk, and returns an acknowledgement to the coordinator.

Once the coordinator has received replies establishing that at least 19 of the 20 Pods have written their shards to the filesystem, and at least 19 of the 20 shards have been flushed to the SSDs, it returns its response to the client. Again, if power was to fail at this point, the data has already been safely written to solid state storage.

We don’t want to leave the data on the SSDs any longer than we have to, so, each Pod, once it’s finished flushing its shard to disk, signals to the shard stash that it can purge its copy of the shard.

Real-World Performance Gains

As I mentioned above, that calculated 167x performance advantage of SSDs over HDDs is somewhat theoretical. In the real world, the time required to upload a file also depends on a number of other factors—proximity to the data center, network speed, and all of the software and hardware between the client application and the storage device, to name a few.

The first Backblaze region to receive the performance upgrade was U.S. East, located in Reston, Virginia. Over a 12-day period following the shard stash deployment there, the average time to upload a 256kB file was 118ms, while a 1MB file clocked in at 137ms. To replicate a typical customer environment, we ran the test application at our partner Vultr’s New Jersey data center, uploading data to Backblaze B2 across the public internet.

For comparison, we ran the same test against Amazon S3’s U.S. East (Northern Virginia) region, a.k.a. us-east-1, from the same machine in New Jersey. On average, uploading a 256kB file to S3 took 157ms, with a 1MB file taking 153ms.

So, comparing the Backblaze B2 U.S. East region to the Amazon S3 equivalent, we benchmarked the new, improved Backblaze B2 as 30% faster than S3 for 256kB files and 10% faster than S3 for 1MB files.

These low-level tests were confirmed when we timed Veeam Backup & Replication software backing up 1TB of virtual machines with 256k block sizes. Backing the server up to Amazon S3 took three hours and 12 minutes; we measured the same backup to Backblaze B2 at just two hours and 15 minutes, 40% faster than S3.

Test Methodology

We wrote a simple Python test app using the AWS SDK for Python (Boto3). Each test run involved timing 100 file uploads using the S3 PutObject API, with a 10ms delay between each upload. (FYI, the delay is not included in the measured time.) The test app used a single HTTPS connection across the test run, following best practice for API usage. We’ve been running the test on a VM in Vultr’s New Jersey region every six hours for the past few weeks against both our U.S. East region and its AWS neighbor. Latency to the Backblaze B2 API endpoint averaged 5.7ms, to the Amazon S3 API endpoint 7.8ms, as measured across 100 ping requests.

What’s Next?

At the time of writing, shard stash servers have been deployed to all of our data centers, across all of our regions. In fact, you might even have noticed small files uploading faster already. It’s important to note that this particular optimization is just one of a series of performance improvements that we’ve implemented, with more to come. It’s safe to say that all of our Backblaze B2 customers will enjoy faster uploads and downloads, no matter their storage workload.

The post How We Achieved Upload Speeds Faster Than AWS S3 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.