Tag Archives: Featured

Why We Use Native Code in Backblaze Computer Backup

Post Syndicated from Natasha Rabinov original https://www.backblaze.com/blog/why-we-use-native-code-in-backblaze-computer-backup/

A decorative image showing icons of file types superimposed over a cloud.

There’s a lot that goes into building a user-friendly, robust backup utility. When Backblaze set out to create one back in 2007, our goal was to make sure that users of all skill levels would have automatic, nearly continuous backups that could be restored on command. There were plenty of design decisions to be made, and one of the biggest was whether to implement our client in native code. 

You might have seen us talk about this on our website and elsewhere, and we felt it was high time to dive into what that decision meant for our development, how it affected the way the Backblaze client works, and why we think it was an important decision and inflection point for Backblaze Computer Backup and our customers. 

What is native code?

Each kind of computer central processing unit (CPU), such as Intel/AMD or Apple Silicon, has its own “machine language,” which is the set of instructions the CPU can understand and follow. These instructions are encoded in binary, and aren’t something people can read or write without great effort. When folks talk about using native code, they’re typically talking about a computer program that’s written in machine language, so a computer’s CPU can “natively” understand what the program needs the CPU to do.  

Compiled languages

To use a compiled language, developers write instructions into source code that’s easy for humans to read and edit. Then, they use a program aptly called a compiler to convert the source code into machine language for a particular kind of CPU. Examples of compiled languages are assembly (ASM), C, C++, Rust, Go, Swift, and Haskell.

Interpreted languages

Like with compiled languages, developers write programs in interpreted languages by writing instructions into source code files. But instead of converting those instructions into machine language, another program called an interpreter reads the source code and follows the instructions it contains without converting them to machine language. Common interpreted languages are things like Python, Ruby, BASIC, and PHP. 

There is a bit of a slippery slope between a compiled vs. interpreted language. For example, some modern Java implementations mix an interpreter and a compiler. But, the difference when it comes to programming is about picking a language that’s suitable to a task’s requirements. 

When and how do you use which type of code language(s)?

Well, pretty much anything anyone does on computers these days will take a combination of code languages. In some ways, the whole challenge of working with computers is bridging how humans communicate vs. how computers can process things. 

If you were using a metaphor for the above, a compiled code language would represent someone who was raised to natively speak two languages, and could fluently curse in both languages. 

By contrast, interpreted language is like this: You’ve moved to a country where you’re not fluent in the language, but someone needs a thorough dressing-down. An interpreted language would let you write in your native language, take your words and literally translate the idiom you were intending to use—then the computer would take your literal translation, and, executing the program, would be supplied with a dictionary to then give you an effective, similarly meaningful, insult. If you didn’t have your translator, your attempt at offense (in this metaphor, a program!), would likely fail because no one can understand you. 

To wit: While they mean similar things, “when pigs fly,” and “quand les poules auront des dents,” do not literally translate. 

What are the benefits of using native code in a backup application?

Using native code in a backup application is, in our opinion, better for several reasons.

Permissions

When you’re writing in native code, you’re plugging in your program at a lower level than most applications. That gives you access to the kinds of APIs the native operating system (OS) uses. Because you’re in that level of integration with the operating system, it means that users have to update permissions less frequently, have access to more robust build possibilities for your client, and their backup client can seamlessly run in the background.   

Efficiency: Build once, run everywhere

By building our backup client lower in the chain of command, so to speak, it allows us to use the same work for different situations, and there are some interpreted languages that have been built for this purpose, like Java VM. Using those solutions, however, would sacrifice some of the other benefits we’re outlining in this article. 

Being fully in control of our common code, we can do this without interpreted language and still have the other advantages listed here. So, we can use the same base code for both our Mac and Windows clients, but then add modifications to the code on top of each to refine the clients. There may be slight differences between the operating system (OS) environments, but coding at the level of a compiled language like C++ means that we can adjust for those differences effectively.

Performance

Running native code typically results in better performance. That’s because there are fewer steps (for your computer) between understanding a program and running a program.

Backup programs run all the time in the background, and have to keep track of a lot of information. Backblaze’s native code does that using half to a tenth of the computing resources that a backup program written in an interpreted language would use. So, Backblaze won’t slow down or interrupt the other activities you’re doing with your computer.

Reducing software bloat and size of software

Also, since you don’t have to install interpreters (you know, your insult dictionary), native code applications are usually leaner and more performant on the system. 

Eliminating risky third-party dependencies

Since they’re software, computer language interpreters have bugs and get new features, so they’re frequently updated. Sometimes an updated interpreter won’t run programs written for an older version of the language, or will cause a program to behave differently in an unexpected (read: “buggy”) way. Also, vendors have even changed licensing terms and started charging money for interpreters that had been free. Backblaze’s native code doesn’t have those problems.

Platform-standard user interface

Operating system vendors like Microsoft and Apple strongly encourage developers to write programs that use a platform-standard user interface “look-and-feel.” Programs that do that help users feel comfortable, minimize surprises, and support accessibility features like text-to-speech.

The most effective way to ensure a program’s user interface matches a platform’s standard look-and-feel is to use features built into the operating system, and those are typically only available to native code like Backblaze’s client.

What are the challenges of using native code in a backup application?

Nothing is perfect. What are the downsides to this approach?

Industry preference moving towards interpreted language/web apps

Has anyone else noticed that the world of development has changed recently? (No need to qualify that statement—it will be true tomorrow, tomorrow, and tomorrow again.) 

As with any industry, tech’s (and developers’) favorite strategies for creating things and solving problems have changed over time. 

There are various players in this space, including platforms (Mac, Windows, Linux), software (Adobe, Office), applications (Slack, the latest mobile game, your headphone utility client), and frankly, many things that skirt the boundaries of the above buckets. Executing any program, and particularly third-party applications, is a negotiation between operating systems’ publishers and the program/application’s developers. 

Over time, those who sell computers and manage OSes have grown to prefer the lightweight development of application ecosystems. It lets them have more control over their platforms, and it gives developers a shorter time to deployment—as long as they play within the sandbox the OS has made available. OS publishers are attempting to anticipate the needs of program and app developers, but there are some types of utilities—and backup is one of them—that justifiably break standard rules. Giving access to all your files by default, for example, isn’t something you’d do for a social media application. However, in order to get a full and complete backup, a program does justifiably require that level of access. 

Limited dev libraries

Given the preference of developers to move to web applications and interpreted languages (for good reason in some cases), many OSes are releasing less detailed support and/or technical documentation for some of their deeper-level tools. If you’re implementing in native code in today’s environment, you need both historical knowledge and ingenuity in house. Which leads us to our next point…

Expertise

We’re on board with the evolution of development—innovation is at the heart of our company—but for aspects of our backup client, we need developers with a deep understanding of compiled code languages and our supported ecosystems. And luckily, in any sufficiently large tech company, you’ll find folks specializing in different code languages and parts of the tech stack. That means we can spend more time nurturing and developing our internal talent rather than seeking it externally. 

Hybrid approaches?

Hey, we’ve spent a whole article telling you why native code matters. But, many folks agree that the future requires a hybrid approach, largely because of that gray area between compiled and interpreted languages we mentioned above. You can certainly see that in our style as well—our Mac client uses a combination of Objective C, SwiftUI and C++, for example. 

The now and future Backblaze

The core functionality of our client depends on native code for very good design reasons, and they’re ultimately all about making things easier for our end users. 

Overall, our design ideas are all centered on what it means to use Backblaze every day, regardless of an end user’s skill level. We want things to be simpler, and sometimes the questions we need to answer (how do I make sure the Backblaze client backs everything up?) are actually a tad more complicated upfront (the Backblaze client needs system permissions—and that means implementing it in native code), in that they require forethought and an investment of time and resources. But, we also prioritize the kind of thinking we can use over and over—so, even if we spend a little more time building native code, it’s an investment that has longevity. Put another way: Build once, run everywhere. 

The post Why We Use Native Code in Backblaze Computer Backup appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Container Orchestration: Managing Applications at Scale

Post Syndicated from Vinodh Subramanian original https://www.backblaze.com/blog/container-orchestration-managing-applications-at-scale/

A decorative image showing containers stacked in a pattern.

The use of containers for software deployment has emerged as a powerful method for packaging applications and their dependencies into single, portable units. Containers enable developers to create, deploy, and run applications consistently across various environments. However, as containerized applications grow in scale and complexity, efficiently deploying, managing, and terminating containers can become a challenging task.

The growing need for streamlined container management has led to the rise of container orchestration—an automated approach to deploying, scaling, and managing containerized applications. Because it simplifies the management of large-scale, dynamic container environments, container orchestration has become a crucial component in modern application development and deployment. 

In this blog post, we’ll explore what container orchestration is, how it works, its benefits, and the leading tools that make it possible. Whether you are new to using containers or looking to optimize your existing strategy, this guide will provide insights that you can leverage for more efficient and scalable application deployment. 

What are containers?

Before containers, developers often faced the “it works on my machine” problem, where an application would run perfectly on a developer’s computer but fail in other environments due to differences in operating systems (OS), dependencies, or configuration. 

Containers solve this problem by packaging applications with all their dependencies into single, portable units, improving consistency across different environments. This greatly reduces the compatibility issues and simplifies the deployment process. 

As a lightweight software package, containers include everything needed to run an application such as code, runtime environment, system tools, libraries, binaries, settings, and so on. They run on top of the host OS, sharing the same OS kernel, and can run anywhere—on a laptop, server, in the cloud, etc. On top of that, containers remain isolated from each other, making them more lightweight and efficient than virtual machines (VMs), which require a full OS for each instance. Check out our article to learn more about the difference between containers and VMs here

Containers provide consistent environments, higher resource efficiency, faster startup times, and portability. They differ from VMs in that they share the host OS kernel. While VMs virtualize hardware for strong isolation, containers isolate at the process level. By solving the longstanding issues of environment consistency and resource efficiency, containers have become an essential tool in modern application development. 

What is container orchestration?

As container adoption has grown, developers have encountered new challenges that highlight the need for container orchestration. While containers simplify application deployment by ensuring consistency across environments, managing containers at scale introduces complexities that manual processes can’t handle efficiently, such as:

  1. Scalability: In a production environment, applications often require hundreds or thousands of containers running simultaneously. Manually managing such a large number of containers becomes impractical and error-prone. 
  2. Resource management: Efficiently utilizing resources across multiple containers is critical. Manual resource allocation leads to underutilization or overloading of hardware, negatively impacting performance and cost-effectiveness. 
  3. Container failure management: In dynamic environments, containers can fail or become unresponsive. Developers need a way to create a self-healing environment, in which failed containers are automatically detected, then recover without manual intervention to ensure high availability and reliability. 
  4. Rolling updates: Deploying updates to applications without downtime and the ability to quickly roll back in case of issues are crucial for maintaining service continuity. Manual updates can be risky and cumbersome. 

Container orchestration automates the deployment, scaling, and management of containers, addressing the complexities that arise in large-scale, dynamic application environments. It ensures that applications run smoothly and efficiently, enabling developers to focus on building features rather than managing infrastructure. Container orchestration tools provide various features such as automated scheduling, self-healing, load balancing, and resource optimization to deploy and manage applications more effectively to ensure reliability, performance, and scalability. 

What are the benefits of container orchestration?

Container orchestration offers many different advantages that streamline the deployment and management of containerized applications. We’ve touched on a few of them, but here’s a concise list: 

  • Improved resource utilization: Orchestration tools can efficiently pack containers onto hosts, maximizing hardware usage. 
  • Enhanced scalability: Easily scale applications up or down to meet changing demands. 
  • Increased reliability: Automatic health checks and container replacement ensure high availability. 
  • Simplified management: Centralized control and automation reduce the complexity of managing large-scale containerized applications. 
  • Faster deployments: Orchestrators enable rapid and consistent deployments across different environments. 
  • Cost efficiency: Better resource utilization and automation, leading to cost savings. 

How does container orchestration work?

Now that we understand what container orchestration is, let’s take a look at how container orchestration works using the example of Kubernetes, one of the most popular container orchestration platforms. 

In the above diagram, we see an example of container orchestration in action. The system is divided into two main sections: the control plane and the worker nodes. 

Control plane

The control plane is the brain of the container orchestration system. It manages the entire system, ensuring that the desired state of the applications is maintained. Key components of the control plane include:

  • Configuration store (etcd): A distributed key-value store that holds all the cluster data, such as the configuration and state information. Think of it as a central database for the cluster. 
  • API server: The front-end of the control plane, exposing the orchestration API. It handles all the communication within the cluster and with external clients. 
  • Scheduler: Assigns workloads to nodes based on resource availability and scheduling policies, ensuring efficient resource utilization. 
  • Controller manager: Runs various controllers that handle routine tasks to maintain the cluster’s desired state. 
  • Cloud control manager: Interacts with cloud provider APIs to manage cloud specific resources, integrating the cluster with cloud infrastructure. 

Worker nodes

Worker nodes, virtual machines, and bare metal servers are all common options for where to run application workloads. Each worker node has the following components: 

  • Node agent (kubelet): An agent that ensures the containers are running as expected. It communicates with the control plane to receive instructions and report back on the status of the nodes. 
  • Network proxy (kube-proxy): Maintains network rules on each node, facilitating communication between containers and services within the cluster. 

Within the worker nodes, pods are the smallest deployable units. Each pod can contain one or more containers that run the application and its dependencies. The diagram shows multiple pods within the worker nodes, indicating how applications are deployed and managed. 

The cloud provider API directs how the orchestration system dynamically interacts with cloud infrastructure to provision resources as needed, making it a flexible and powerful tool for managing containerized applications across various environments. 

Popular container orchestration tools

Several container orchestration tools have emerged as the leaders in the industry, each offering unique features and capabilities. Here are some of the most popular tools:

Kubernetes

Kubernetes, often referred to as K8s, is an open-source container orchestration platform initially developed by Google. It has become the industry standard for managing containerized applications at scale. K8s is ideal for handling complex, multi-container applications, making it suitable for large-scale microservices architectures and multi-cloud deployments. Its strong community support and flexibility with various container runtimes contribute to its widespread adoption.

Docker Swarm

Docker Swarm is Docker’s native container orchestration tool, providing a simpler alternative to Kubernetes. It integrates seamlessly with Docker containers, making it a natural choice for teams already using Docker. Known for its ease of setup and use, Docker Swarm allows quick scaling of services with straightforward commands, making it ideal for small to medium-sized applications and rapid development cycles. 

Amazon Elastic Container Service (ECS)

Amazon ECS (Elastic Container Service) is a fully managed container orchestration service provided by AWS, designed to simplify running containerized applications. ECS integrates deeply with AWS services for networking, security, and monitoring. ECS leverages the extensive range of AWS services, making it a straightforward orchestration solution for enterprises using AWS infrastructure.

Red Hat OpenShift

Red Hat OpenShift is an enterprise-grade Kubernetes container orchestration platform that extends Kubernetes with additional tools for developers and operations, integrated security, and lifecycle management. OpenShift supports multiple cloud and on-premise environments, providing a consistent foundation for building and scaling containerized applications.

Google Kubernetes Engine (GKE)

Google Kubernetes Engine (GKE) is a managed Kubernetes service offered by Google Cloud Platform (GCP). It provides a scalable environment for deploying, managing, and scaling containerized applications using Kubernetes. GKE simplifies cluster management with automated upgrades, monitoring, and scalability features. Its deep integration with GCP services and Google’s expertise in running Kubernetes at scale make GKE an attractive option for complex application architectures.

Embracing the future of application deployment

Container orchestration has undoubtedly revolutionized the way we deploy, manage, and scale applications in today’s complex and dynamic software environments. By automating critical tasks such as scheduling, scaling, load balancing, and health monitoring, container orchestration enables organizations to achieve greater efficiency, reliability, and scalability in their application deployments. 

The choice of orchestration platform should be carefully considered based on your specific needs, team expertise and long term goals. It is not just a technical solution but a strategic enabler, providing you with significant advantages in your development and operational workflows.

The post Container Orchestration: Managing Applications at Scale appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

How to Back Up Your QNAP NAS to the Cloud

Post Syndicated from Vinodh Subramanian original https://www.backblaze.com/blog/qnap-nas-backup-to-cloud/

A decorative image with the title sync with QNAP.

Your QNAP network attached storage (NAS) device helps your business centralize storage capacity, support collaboration, and access files 24/7 from anywhere. If you were relying on individual hard drives or another ad hoc storage solution before, it definitely helps you uplevel your data management practices.

One of the great features of a QNAP NAS device is Hybrid Backup Sync (HBS), its onboard backup utility that allows you to easily store a copy of your data to your NAS and other destinations. You can set regular, automated backups to protect against data loss due to hardware failures or accidental deletion. But, keeping a copy of your data on your NAS alone doesn’t constitute a true backup strategy. For that, you need to follow the 3-2-1 backup rule with at least one copy stored off-site.

This post explains how to set up a 3-2-1 backup strategy with your QNAP NAS. We’ll share the benefits of storing your backups in the cloud, discuss different options for backing up your QNAP NAS, and provide some practical examples of what you can do by combining cloud storage and your NAS.

Download Our Complete NAS Guide

QNAP NAS and a 3-2-1 backup strategy

Following the 3-2-1 strategy means having three copies of your data, two of which are stored locally but on different media (aka devices), and one stored off-site. 

Your QNAP NAS is your first step towards completing the 3-2-1 strategy. By using it to store data locally, you have two copies on-site. Backing up your QNAP NAS to the cloud completes the 3-2-1 strategy by serving as your off-site storage. 

A diagram showing the 3-2-1 backup strategy, which has three copies of data, on two different types of media, with one stored in an off-site location.

You could maintain an off-site copy on another physical device like another NAS, an external drive, or a file server, but keep in mind, backing up to an external destination other than the cloud will require you to physically separate the backup copy—that is, send your drive via mail or drive it elsewhere in order to ensure geographic separation. Backing up your QNAP NAS to the cloud means you achieve a 3-2-1 strategy without going out of your way to physically separate the copies, and it allows you to easily store data in different regions for greater data resilience and disaster recovery.

The additional benefits of backing your QNAP NAS to the cloud

Backing up your QNAP NAS to the cloud gives you a number of additional benefits, including:

  • Disaster recovery: Without an off-site backup, your on-site data, including data on your individual workstations and your NAS, is susceptible to data loss. Natural disasters could wipe out your machines, your NAS, and any other backups you might store locally. Cloud backups safeguard your data from physical disasters that could destroy both your NAS and local copies.
  • Ransomware protection: While QNAP has on-board utilities that allow you to revert to a previous backup, your NAS is still connected to your network and susceptible to ransomware. Cloud backups, especially those configured with Object Lock, provide a layer of security against ransomware attacks that can encrypt or delete data stored on your network-connected NAS. 
  • Protection against hardware failure: Because your NAS is likely set up in a RAID configuration, one drive failure might not affect your data. But, while one drive is down, your data is at a higher risk. If another drive were to fail, you could lose data. Keeping an off-site backup in cloud storage helps you avoid this fate.
  • Accessibility: With your data in the cloud, your backups are accessible from anywhere. If you’re away from your desk or office and you need to retrieve a file, you can simply log in to your cloud account and copy that file down.
  • Security: Cloud vendors typically protect customer data by encrypting it as it travels to its final destination and/or when it is at rest on the vendors’ storage servers. Encryption protocols differ between cloud vendors, so make sure to understand them as you’re evaluating cloud providers, especially if you have specific security requirements.
  • Automation: Your QNAP NAS comes with a built-in backup utility so you can set your cloud backup schedule in advance and avoid human error (like forgetting to back up) in the future.
  • Scalability: As your data grows, your cloud backups grow with it. With cloud storage, there’s no need to invest in or maintain additional hardware to ensure your data is properly backed up.

How to protect your business data with QNAP

QNAP offers a number of different tools and functionality to help you back up business devices and systems to your NAS, including:

  1. Qsync: Qsync is an on-board backup utility on QNAP devices that allows you to sync computer files to your QNAP NAS. This allows you to back up workstations to your NAS, creating a second, local copy of that data. QNAP NAS also supports Time Machine for Macs. 
  2. NetBack PC Agent: A utility specifically for backing up Windows PCs and servers.
  3. Hyper Data Protector: Use Hyper Data Protector to back up multiple VMware and Hyper-V virtual machines (VMs).
  4. File server backup: QNAP devices support multiple protocols, including rsync, FTP, and CIFS for backing up different file servers.
  5. Boxafe: Use Boxafe to back up Google workspace and Microsoft 365 business account data to your NAS.
  6. Snapshot feature: Takes point-in-time copies of data for protection and recovery.
  7. MARS: Use QNAP’s MARS service to back up Google Photos and WordPress databases and files to your NAS. 

How to back up your QNAP to the cloud

Once you’ve created a copy of your business data to your QNAP NAS, you can then use QNAP Hybrid Backup Sync to back it up to the cloud. Hybrid Backup Sync supports multi-version backups and allows you to customize retention settings for version management. QNAP’s QuDedup feature deduplicates data, helping you manage your storage footprint. The utility also allows you to manage Time Machine backups for Mac devices.

A product photo of a QNAP NAS.

What can you do with cloud storage and QNAP Hybrid Backup Sync?

The QNAP Hybrid Backup Sync app provides you with a lot of options. You can synchronize in the cloud as little or as much as you want. Here are some practical examples of what you can do with Hybrid Backup Sync and cloud storage working together.

1. Sync the entire contents of your QNAP to the cloud

The QNAP NAS has excellent fault tolerance—it can continue operating even when individual drive units fail—but nothing in life is foolproof. It pays to be prepared in the event of a catastrophe. Now that you know about the 3-2-1 backup strategy, you know how important it is to make sure that you have a copy of your files in the cloud.

2. Sync your most important media files

Using your QNAP to store marketing assets like video and photos? You’ve invested untold amounts of time, money, and effort into producing those media files, so make sure they’re safely and securely synced to the cloud with Hybrid Backup Sync.

3. Back up Time Machine and other local backups

Apple’s Time Machine software provides Mac users with reliable local backup, and many Backblaze customers rely on it to provide that crucial first step in making sure their data is secure. QNAP enables the NAS to act as a network-based Time Machine backup. Those Time Machine files can be synced to the cloud, so you can make sure to have Time Machine files to restore from in the event of a critical failure.

If you use Windows or Linux, you can configure the QNAP NAS as the destination for your Windows or Linux local data backup. That, in turn, can be synced to the cloud from the NAS.

Ready to give it a try?

Hybrid Backup Sync allows you to choose from any number of cloud storage providers as a backup destination, and Backblaze B2 Cloud Storage is one of them. Check out our videos on how to use Hybrid Backup Sync to back up or sync your data to B2 in under 15 minutes.

If you haven’t given cloud storage a try yet, you can get started now and make sure your NAS is synced or backed up securely to the cloud.

The post How to Back Up Your QNAP NAS to the Cloud appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

AI 101: Why RAG Is All the RAGe

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/ai-101-why-rag-is-all-the-rage/

A decorative image showing an AI chip connecting icons of representing different files.

At the risk of being called the stick in the mud of the tech world, we here at Backblaze have often bemoaned our industry’s love of making up new acronyms. The most recent culprit, hailing from the fast-moving artificial intelligence/machine learning (AI/ML) space, is truly memorable: RAG, aka retrieval-augmented generation. For the record, its creator has apologized for inflicting it upon the world.

Given how useful it is, we’re willing to forgive. (I’m sure he was holding his breath for that news.) Today, our AI 101 series is back to talk about what RAG is—and the big problem it solves. 

Read more AI 101

This article is part of a series that attempts to understand the evolving world of AI/ML. Check out our previous articles for more context:

Let’s start with large language models (LLMs)

LLMs are the most recognizable expression of AI in our current zeitgeist. (Arguably, you could append that with “that we’re all paying attention to,” given that ML algorithms have been behind many tools for decades now.) LLMs underpin tools like ChatGPT, Google Gemini, and Claude, as well as things like service-oriented chatbots, natural language processing tasks, and so on. They’re trained on vast amounts of data with algorithmic guardrails known as parameters and hyperparameters guiding their training. Once trained, we query them through a process known as inference

Fabulous! The possibilities are endless. However, one of the biggest challenges we’ve experienced (and laughed about on the internet) is that LLMs can return inaccurate results, while sounding very, very reasonable. Additionally, LLMs don’t know what they don’t know. Their answers can only be as good as the data they draw from—so, if their training dataset is outdated or contains a systematic bias, it will impact your results. As AI tools have become more widely adopted, we’ve seen LLM inaccuracies range from “funny and widely mocked” to “oh, that’s actually serious.

Enter retrieval-augmented generation (Fine! RAG)

RAG is a solution to these problems. Instead of relying on only an LLM’s dataset, RAG queries external sources before returning a response. It’s more complicated than “let me google that for you,” as the process takes that external data, turns it into a vectored database, and then balances external data with an LLM’s “general knowledge” generated response and skill at responding to conversational queries. 

This has several advantages. Users now have sources they can cite, and recent information is taken into account. From a development perspective, it means that you don’t have to re-train a model as frequently. And, it can be implemented in as few as five lines of code. 

One important nuance is that when you’re building RAG into your product, you can set its sources. For industries like medicine and law, that means you can point them towards industry journals and trusted sources, outweighing the often misquoted or mis-cited examples you might see in a general database. 

Another example: For a technical documentation portal, you can take an LLM, trained on general information and the nuts and bolts of conversational querying, and direct it to rely on your organization’s help articles as its most important sources. Your organization controls the authoritative data, and how often/when changes are made. Users can trust that they’re getting the most recent security patches and correct code. And, you can do so quickly, easily, and—most importantly—cost-effectively. 

RAG doesn’t mean foolproof AI

RAG is a great, straightforward method for keeping LLM tools updated with current, high-quality information and giving users more transparency around where their answers are coming from. However, as we mentioned above, AI is only ever as good as the data it uses. Keep in mind, that’s a deceptively simple thing to say. It’s an entire, specialized job to validate datasets, and that expertise is built into the research and monitoring that happens while training an LLM. 

RAG gives a new source of data a privileged position—you’re saying “this data is more authoritative than that data” and, since the LLM doesn’t have anything in its general database, it may not have a counter argument. If you’re not paying attention to your RAG data source standards, and doing so on an ongoing basis, it’s possible, and even likely, that data bias, low quality data, etc. could creep into your model. 

Think of it this way: If you’re pointing to a new feature in your tech docs and there’s an error, that impact is magnified because an LLM will give more weight to the RAG data. At least in that case, you’re the one who controls the source data. In our other examples of legal or medical AI tools pointing to journal updates, things can get, well, more complicated. If (when) you’re setting up an AI that uses RAG, it’s imperative to make sure you’re also setting yourself up with reliable sources that are regularly updated. 

But, given its impact, and how low of a lift it is to integrate into existing products, we can see why RAG is all the RAGe—and, as always, we look forward to more to come in the AI landscape. For now, we can already see the impact it’s having on the market, with SaaS companies and startups alike exploring the possibilities.

The post AI 101: Why RAG Is All the RAGe appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

AWS CodeArtifact adds support for Rust packages with Cargo

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-codeartifact-adds-support-for-rust-packages-with-cargo/

Starting today, Rust developers can store and access their libraries (known as crates in Rust’s world) on AWS CodeArtifact.

Modern software development relies heavily on pre-written code packages to accelerate development. These packages, which can number in the hundreds for a single application, tackle common programming tasks and can be created internally or obtained from external sources. While these packages significantly help to speed up development, their use introduces two main challenges for organizations: legal and security concerns.

On the legal side, organizations need to ensure they have compatible licenses for these third-party packages and that they don’t infringe on intellectual property rights. Security is another risk, as vulnerabilities in these packages could be exploited to compromise an application. A known tactic, the supply chain attack, involves injecting vulnerabilities into popular open source projects.

To address these challenges, organizations can set up private package repositories. These repositories store pre-approved packages vetted by security and legal teams, limiting the risk of legal or security exposure. This is where CodeArtifact enters.

AWS CodeArtifact is a fully managed artifact repository service designed to securely store, publish, and share software packages used in application development. It supports popular package managers and formats such as npm, PyPI, Maven, NuGet, SwiftPM, and Rubygem, enabling easy integration into existing development workflows. It helps enhance security through controlled access and facilitates collaboration across teams. CodeArtifact helps maintain a consistent, secure, and efficient software development lifecycle by integrating with AWS Identity and Access Management (IAM) and continuous integration and continuous deployment (CI/CD) tools.

For the eighth year in a row, Rust has topped the chart as “the most desired programming language” in Stack Overflow’s annual developer survey, with more than 80 percent of developers reporting that they’d like to use the language again next year. Rust’s growing popularity stems from its ability to combine the performance and memory safety of systems languages such as C++ with features that makes writing reliable, concurrent code easier. This, along with a rich ecosystem and a strong focus on community collaboration, makes Rust an attractive option for developers working on high-performance systems and applications.

Rust developers rely on Cargo, the official package manager, to manage package dependencies. Cargo simplifies the process of finding, downloading, and integrating pre-written crates (libraries) into their projects. This not only saves time by eliminating manual dependency management, but also ensures compatibility and security. Cargo’s robust dependency resolution system tackles potential conflicts between different crate versions, and because many crates come from a curated registry, developers can be more confident about the code’s quality and safety. This focus on efficiency and reliability makes Cargo an essential tool for building Rust applications.

Let’s create a CodeArtifact repository for my crates
In this demo, I use the AWS Command Line Interface (AWS CLI) and AWS Management Console to create two repositories. I configure the first repository to download public packages from the official crates.io repository. I configure the second repository to download packages from the first one only. This dual repository configuration is the recommended way to manage repositories and external connections, see the CodeArtifact documentation for managing external connections. To quote the documentation:

“It is recommended to have one repository per domain with an external connection to a given public repository. To connect other repositories to the public repository, add the repository with the external connection as an upstream to them.”

I sketched this diagram to illustrate the setup.

Code Artifact repositories for cargo

Domains and repositories can be created either from the command line or the console. I choose the command line. In shell terminal, I type:

CODEARTIFACT_DOMAIN=stormacq-test

# Create an internal-facing repository: crates-io-store
aws codeartifact create-repository \
   --domain $CODEARTIFACT_DOMAIN   \
   --repository crates-io-store

# Associate the internal-facing repository crates-io-store to the public crates-io
aws codeartifact associate-external-connection \
--domain $CODEARTIFACT_DOMAIN \
--repository crates-io-store  \
--external-connection public:crates-io

# Create a second internal-facing repository: cargo-repo 
# and connect it to upstream crates-io-store just created
aws codeartifact create-repository \
   --domain $CODEARTIFACT_DOMAIN   \
   --repository cargo-repo         \
   --upstreams '{"repositoryName":"crates-io-store"}'	 

Next, as a developer, I want my local machine to fetch crates from the internal repository (cargo-repo) I just created.

I configure cargo to fetch libraries from the internal repository instead of the public crates.io. To do so, I create a config.toml file to point to CodeArtifact internal repository.

# First, I retrieve the URI of the repo
REPO_ENDPOINT=$(aws codeartifact get-repository-endpoint \
                           --domain $CODEARTIFACT_DOMAIN \ 
                           --repository cargo-repo       \
                           --format cargo                \
                           --output text)

# at this stage, REPO_ENDPOINT is https://stormacq-test-012345678912.d.codeartifact.us-west-2.amazonaws.com/cargo/cargo-repo/

# Next, I create the cargo config file
cat << EOF > ~/.cargo/config.toml
[registries.cargo-repo]
index = "sparse+$REPO_ENDPOINT"
credential-provider = "cargo:token-from-stdout aws codeartifact get-authorization-token --domain $CODEARTIFACT_DOMAIN --query authorizationToken --output text"

[registry]
default = "cargo-repo"

[source.crates-io]
replace-with = "cargo-repo"
EOF

Note that the two environment variables are replaced when I create the config file. cargo doesn’t support environment variables in its configuration.

From now on, on this machine, every time I invoke cargo to add a crate, cargo will obtain an authorization token from CodeArtifact to communicate with the internal cargo-repo repository. I must have IAM privileges to call the get-authorization-token CodeArtifact API in addition to permissions for read/publish package according to the command I use. If you’re running this setup from a build machine for your continuous integration (CI) pipeline, your build machine must have proper permissions to do so.

I can now test this setup and add a crate to my local project.

$ cargo add regex
    Updating `codeartifact` index
      Adding regex v1.10.4 to dependencies
             Features:
             + perf
             + perf-backtrack
             + perf-cache
             + perf-dfa
             + perf-inline
             + perf-literal
             + perf-onepass
             + std
             + unicode
             + unicode-age
             + unicode-bool
             + unicode-case
             + unicode-gencat
             + unicode-perl
             + unicode-script
             + unicode-segment
             - logging
             - pattern
             - perf-dfa-full
             - unstable
             - use_std
    Updating `cargo-repo` index

# Build the project to trigger the download of the crate
$ cargo build
  Downloaded memchr v2.7.2 (registry `cargo-repo`)
  Downloaded regex-syntax v0.8.3 (registry `cargo-repo`)
  Downloaded regex v1.10.4 (registry `cargo-repo`)
  Downloaded aho-corasick v1.1.3 (registry `cargo-repo`)
  Downloaded regex-automata v0.4.6 (registry `cargo-repo`)
  Downloaded 5 crates (1.5 MB) in 1.99s
   Compiling memchr v2.7.2 (registry `cargo-repo`)
   Compiling regex-syntax v0.8.3 (registry `cargo-repo`)
   Compiling aho-corasick v1.1.3 (registry `cargo-repo`)
   Compiling regex-automata v0.4.6 (registry `cargo-repo`)
   Compiling regex v1.10.4 (registry `cargo-repo`)
   Compiling hello_world v0.1.0 (/home/ec2-user/hello_world)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 16.60s

I can verify CodeArtifact downloaded the crate and its dependencies from the upstream public repository. I connect to the CodeArtifact console and check the list of packages available in either repository I created. At this stage, the package list should be identical in the two repositories.

CodeArtifact cargo packages list

Publish a private package to the repository
Now that I know the upstream link works as intended, let’s publish a private package to my cargo-repo repository to make it available to other teams in my organization.

To do so, I use the standard Rust tool cargo, just like usual. Before doing so, I add and commit the project files to the gitrepository.

$  git add . && git commit -m "initial commit"
 5 files changed, 1855 insertions(+)
create mode 100644 .gitignore
create mode 100644 Cargo.lock
create mode 100644 Cargo.toml
create mode 100644 commands.sh
create mode 100644 src/main.rs

$  cargo publish 
    Updating `codeartifact` index
   Packaging hello_world v0.1.0 (/home/ec2-user/hello_world)
    Updating crates.io index
    Updating `codeartifact` index
   Verifying hello_world v0.1.0 (/home/ec2-user/hello_world)
   Compiling libc v0.2.155
... (redacted for brevity) ....
   Compiling hello_world v0.1.0 (/home/ec2-user/hello_world/target/package/hello_world-0.1.0)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 1m 03s
    Packaged 5 files, 44.1KiB (11.5KiB compressed)
   Uploading hello_world v0.1.0 (/home/ec2-user/hello_world)
    Uploaded hello_world v0.1.0 to registry `cargo-repo`
note: waiting for `hello_world v0.1.0` to be available at registry `cargo-repo`.
You may press ctrl-c to skip waiting; the crate should be available shortly.
   Published hello_world v0.1.0 at registry `cargo-repo`

Lastly, I use the console to verify the hello_world crate is now available in the cargo-repo.

CodeArtifact cargo package hello world

Pricing and availability
You can now store your Rust libraries in the 13 AWS Regions where CodeArtifact is available. There is no additional cost for Rust packages. The three billing dimensions are the storage (measured in GB per month), the number of requests, and the data transfer out to the internet or to other AWS Regions. Data transfer to AWS services in the same Region is not charged, meaning you can run your continuous integration and delivery (CI/CD) jobs on Amazon Elastic Compute Cloud (Amazon EC2) or AWS CodeBuild, for example, without incurring a charge for the CodeArtifact data transfer. As usual, the pricing page has the details.

Now go build your Rust applications and upload your private crates to CodeArtifact!

— seb

Anthropic’s Claude 3.5 Sonnet model now available in Amazon Bedrock: Even more intelligence than Claude 3 Opus at one-fifth the cost

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/anthropics-claude-3-5-sonnet-model-now-available-in-amazon-bedrock-the-most-intelligent-claude-model-yet/

It’s been just 3 months since Anthropic launched Claude 3, a family of state-of-the-art artificial intelligence (AI) models that allows you to choose the right combination of intelligence, speed, and cost that suits your needs.

Today, Anthropic introduced Claude 3.5 Sonnet, its first release in the forthcoming Claude 3.5 model family. We are happy to announce that Claude 3.5 Sonnet is now available in Amazon Bedrock.

Claude 3.5 Sonnet raises the industry bar for intelligence, outperforming other generative AI models on a wide range of evaluations, including Anthropic’s previously most intelligent model, Claude 3 Opus. Claude 3.5 Sonnet is available with the speed and cost of the original Claude 3 Sonnet model. In fact, you can now get intelligence and speed better than Claude 3 Opus at one-fifth of the price because Claude 3.5 Sonnet is 80 percent cheaper than Opus.

Anthropic Claude 3.5 Sonnet Family

The frontier intelligence displayed by Claude 3.5 Sonnet combined with cost-effective pricing, makes the model ideal for complex tasks such as context-sensitive customer support, orchestrating multi-step workflows, and streamlining code translations.

Claude 3.5 Sonnet sets new industry benchmarks for undergraduate-level expert knowledge (MMLU), graduate-level expert reasoning (GPQA), code (HumanEval), and more. As you can see in the following table, according to Anthropic, Claude 3.5 Sonnet outperforms OpenAI’s GPT-4o and Google’s Gemini 1.5 Pro in nearly every benchmark.

Anthropic Claude 3.5 Sonnet Benchmarks

Claude 3.5 Sonnet is also Anthropic’s strongest vision model yet, performing an average of 10 percent better than Claude 3 Opus across the majority of vision benchmarks. According to Anthropic, Claude 3.5 Sonnet also outperforms other generative AI models in nearly every category.

Anthropic Claude 3.5 Sonnet Vision Benchmarks

Anthropic’s Claude 3.5 Sonnet key improvements
The release of Claude 3.5 Sonnet brings significant improvements across multiple domains, empowering software developers and businesses with new generative AI-powered capabilities. Here are some of the key strengths of this new model:

Visual processing and understanding – Claude 3.5 Sonnet demonstrates remarkable capabilities in processing images, particularly in interpreting charts and graphs. It accurately transcribes text from imperfect images, a core capability for industries such as retail, logistics, and financial services, to gather more insights from graphics or illustrations than from text alone. Use Claude 3.5 Sonnet to automate visual data processing tasks, extract valuable information, and enhance data analysis pipelines.

Writing and content generation – Claude 3.5 Sonnet represents a significant leap in its ability to understand nuance and humor. The model produces high-quality written content with a more natural, human tone that feels more authentic and relatable. Use the model to generate engaging and compelling content, streamline your writing workflows, and enhance your storytelling capabilities.

Customer support and natural language processing – With its improved understanding of context and multistep workflow orchestration, Claude 3.5 Sonnet excels at handling intricate customer inquiries. This capability enables round-the-clock support, faster response times, and more natural-sounding interactions, ultimately leading to improved customer satisfaction. Use this model to automate and enhance customer support processes and provide a seamless experience for end users. For an example of a similar implementation, see how DoorDash built a generative AI self-service contact center solution using Anthropic’s Claude 3 models in Amazon Bedrock.

Analytics and insights – Claude 3.5 Sonnet augments human expertise in data science by effortlessly navigating unstructured data and using multiple tools to generate insights. It produces high-quality statistical visualizations and actionable predictions, ranging from business strategies to real-time product trends. Use Claude 3.5 Sonnet to simplify data analysis workflows, uncover valuable insights, and drive data-driven decision-making.

Coding and software development – Claude 3.5 Sonnet can independently write, edit, and execute code with sophisticated reasoning and troubleshooting capabilities when provided with the relevant tools. Use Claude 3.5 Sonnet to streamline developer workflows, accelerate coding tasks, reduce manual effort, and enhance overall productivity.

Using Anthropic’s Claude 3.5 Sonnet in the Amazon Bedrock console
Because this is a new model, I go to the Amazon Bedrock console and choose Model access from the navigation pane to enable access to Claude 3.5 Sonnet.

Console screenshot.

Now that I have access to the model, I’d like to use it to extract information from the following image from the Our World in Data website:

Primary energy consumption from hydropower, 2022

I choose Chat in the Playgrounds section of the navigation pane and select Anthropic and Claude 3.5 Sonnet as the model. I then choose the Image upload button to upload the previous image, type the following prompt, and choose Run:

Which countries consume more than 1000 TWh from hydropower? Think step by step and look at all regions. Output in JSON.

Here’s the result and an animated image recapping what I did.

To determine which countries consume more than 1,000 TWh from hydropower in 2022, I need to carefully examine the map and color-coding. The scale at the bottom shows that dark blue represents consumption between 1,000 TWh and 2,000 TWh.

Let's look at each region systematically:

1. North America: Canada is dark blue, indicating it consumes between 1,000-2,000 TWh from hydropower.
2. South America: Brazil is dark blue, also consuming between 1,000-2,000 TWh.
3. Europe: No country is dark blue; all are light green or have no data.
4. Africa: No data available for any country.
5. Asia: China is dark blue, indicating consumption between 1,000-2,000 TWh.
6. Oceania: No country is dark blue; Australia is light green.

In summary, there are three countries that consume more than 1,000 TWh from hydropower in 2022:
1. Canada (North America)
2. Brazil (South America)
3. China (Asia)

Here's the response in JSON format:

```json
{
  "countries_consuming_over_1000TWh_hydropower": [
    {
      "name": "Canada",
      "region": "North America",
      "consumption_range": "1,000-2,000 TWh"
    },
    {
      "name": "Brazil",
      "region": "South America",
      "consumption_range": "1,000-2,000 TWh"
    },
    {
      "name": "China",
      "region": "Asia",
      "consumption_range": "1,000-2,000 TWh"
    }
  ]
}
```

These three countries stand out as the highest consumers of hydropower energy, each using between 1,000 and 2,000 terawatt-hours in 2022.

Anthropic's Claude 3.5 Sonnet demo in the Amazon Bedrock console.

The model’s ability to reliably extract information from unstructured data, like images, opens up a world of new possibilities.

I choose the three small dots in the corner of the playground window and then View API request to see code examples using the model in the AWS Command Line Interface (AWS CLI) and AWS SDKs. Let’s have a better look at the code syntax.

Using Claude 3.5 Sonnet with AWS SDKs
You can use Claude 3.5 Sonnet with any AWS SDK using the new Amazon Bedrock Converse API or Anthropic Claude Messages API.

To update code already using a Claude 3 model, I just need to replace the model ID with:

anthropic.claude-3-5-sonnet-20240620-v1:0

Here’s a sample implementation with the AWS SDK for Python (Boto3) using the same image as before to show how to use images and text with the Converse API.

import boto3
from botocore.exceptions import ClientError

MODEL_ID = "anthropic.claude-3-5-sonnet-20240620-v1:0"

IMAGE_NAME = "primary-energy-hydro.png"

bedrock_runtime = boto3.client("bedrock-runtime", region_name="us-east-1")

with open(IMAGE_NAME, "rb") as f:
    image = f.read()

user_message = "Which countries consume more than 1000 TWh from hydropower? Think step by step and look at all regions. Output in JSON."

messages = [
    {
        "role": "user",
        "content": [
            {"image": {"format": "png", "source": {"bytes": image}}},
            {"text": user_message},
        ],
    }
]

response = bedrock_runtime.converse(
    modelId=MODEL_ID,
    messages=messages,
)
response_text = response["output"]["message"]["content"][0]["text"]
print(response_text)

When I run it, I get a similar output as in the console:

Let's approach this step-by-step:

1. First, I'll examine the color scale at the bottom of the map. The darkest blue color represents consumption of 2,000 TWh or more.

2. Now, I'll scan the map region by region:

   North America: Canada is dark blue, indicating over 1,000 TWh.
   South America: Brazil is also dark blue, over 1,000 TWh.
   Europe: No country appears to be dark blue.
   Africa: No country appears to be dark blue.
   Asia: China stands out as dark blue, indicating over 1,000 TWh.
   Oceania: No country appears to be dark blue.

3. To be thorough, I'll double-check for any medium blue countries that might be close to or over 1,000 TWh, but I don't see any that appear to reach that threshold.

4. Based on this analysis, there are three countries that clearly consume more than 1,000 TWh from hydropower.

Now, I'll format the answer in JSON:

```json
{
  "countries_consuming_over_1000TWh_hydropower": [
    "Canada",
    "Brazil",
    "China"
  ]
}
```

This JSON output lists the three countries that visually appear to consume more than 1,000 TWh of primary energy from hydropower according to the 2022 data presented in the map.

Because I didn’t specify a JSON syntax, the two answers use a different format. In your applications, you can describe in the prompt the JSON properties you want or provide a sample to get a standard format in output.

For more examples, see the code samples in the Amazon Bedrock User Guide. For a more advanced use case, here’s a fully functional tool use demo illustrating how to connect a generative AI model with a custom tool or API.

Using Claude 3.5 Sonnet with the AWS CLI
There are times when nothing beats the speed of the command line. This is how you can use the AWS CLI with the new model:

aws bedrock-runtime converse \
    --model-id anthropic.claude-3-5-sonnet-20240620-v1:0 \
    --messages '{"role": "user", "content": [{"text": "Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?"}]}' \
    --region us-east-1
    --query output.message.content

In the output, I use the query option to only get the content of the output message:

[
    {
        "text": "Let's approach this step-by-step:\n\n1. First, we need to understand the relationships:\n   - Alice has N brothers\n   - Alice has M sisters\n\n2. Now, let's consider Alice's brother:\n   - He is one of Alice's N brothers\n   - He has the same parents as Alice\n\n3. This means that Alice's brother has:\n   - The same sisters as Alice\n   - One sister more than Alice (because Alice herself is his sister)\n\n4. Therefore, the number of sisters Alice's brother has is:\n   M + 1\n\n   Where M is the number of sisters Alice has.\n\nSo, the answer is: Alice's brother has M + 1 sisters."
    }
]

I copy the text into a small Python program to see it printed on multiple lines:

print("Let's approach this step-by-step:\n\n1. First, we need to understand the relationships:\n   - Alice has N brothers\n   - Alice has M sisters\n\n2. Now, let's consider Alice's brother:\n   - He is one of Alice's N brothers\n   - He has the same parents as Alice\n\n3. This means that Alice's brother has:\n   - The same sisters as Alice\n   - One sister more than Alice (because Alice herself is his sister)\n\n4. Therefore, the number of sisters Alice's brother has is:\n   M + 1\n\n   Where M is the number of sisters Alice has.\n\nSo, the answer is: Alice's brother has M + 1 sisters.")
Let's approach this step-by-step:

1. First, we need to understand the relationships:
   - Alice has N brothers
   - Alice has M sisters

2. Now, let's consider Alice's brother:
   - He is one of Alice's N brothers
   - He has the same parents as Alice

3. This means that Alice's brother has:
   - The same sisters as Alice
   - One sister more than Alice (because Alice herself is his sister)

4. Therefore, the number of sisters Alice's brother has is:
   M + 1

   Where M is the number of sisters Alice has.

So, the answer is: Alice's brother has M + 1 sisters.

Even if this was a quite nuanced question, Claude 3.5 Sonnet got it right and described its reasoning step by step.

Things to know
Anthropic’s Claude 3.5 Sonnet is available in Amazon Bedrock today in the US East (N. Virginia) AWS Region. More information on Amazon Bedrock model support by Region is available in the documentation. View the Amazon Bedrock pricing page to determine the costs for your specific use case.

By providing access to a faster and more powerful model at a lower cost, Claude 3.5 Sonnet makes generative AI easier and more effective to use for many industries, such as:

Healthcare and life sciences – In the medical field, Claude 3.5 Sonnet shows promise in enhancing imaging analysis, acting as a diagnostic assistant for patient triage, and summarizing the latest research findings in an easy-to-digest format.

Financial services – The model can provide valuable assistance in identifying financial trends and creating personalized debt repayment plans tailored to clients’ unique situations.

Legal – Law firms can use the model to accelerate legal research by quickly surfacing relevant precedents and statutes. Additionally, the model can increase paralegal efficiency through contract analysis and assist with drafting standard legal documents.

Media and entertainment – The model can expedite research for journalists, support the creative process of scriptwriting and character development, and provide valuable audience sentiment analysis.

Technology – For software developers, Claude 3.5 Sonnet offers opportunities in rapid application prototyping, legacy code migration, innovative feature ideation, user experience optimization, and identification of friction points.

Education – Educators can use the model to streamline grant proposal writing, develop comprehensive curricula incorporating emerging trends, and receive research assistance through database queries and insight generation.

It’s an exciting time for for generative AI. To start using this new model, see the Anthropic Claude models section of the Amazon Bedrock User Guide. You can also visit our community.aws site to find deep-dive technical content and to discover how our Builder communities are using Amazon Bedrock in their solutions. Let me know what you do with these enhanced capabilities!

Danilo

NAS vs. Cloud Storage: Which Solution Fits Your Business Needs?

Post Syndicated from Vinodh Subramanian original https://www.backblaze.com/blog/nas-vs-cloud-storage-which-solution-fits-your-business-needs/

A decorative image showing a cloud and a NAS device.

If you’re searching for the best storage solution for your business and finding yourself stuck between network attached storage (NAS) and cloud storage, you’re not alone. Many business owners, IT professionals, and system administrators are stuck in the same dilemma. With plenty of discussions and debates outlining the pros and cons of one or the other, it can be difficult to determine the best storage solution for your specific needs. 

This blog aims to cut through the noise and provide clear, actionable insights into NAS and cloud storage, addressing your most pressing questions and helping you make an informed decision. Whether the focus is cost, scalability, security, or accessibility, this guide will help identify the ideal storage solution for your business.  

Understanding NAS: What is NAS? 

NAS is a file-level storage system designed specifically to provide centralized and shared disk storage for users on a local area network (LAN). Essentially, NAS is a purpose-built computer that operates its own dedicated operating system (OS). It contains one or more storage devices that are configured to create a single shared volume. These storage devices are arranged in a RAID configuration to provide data redundancy and performance. 

How does NAS work?

NAS provides access to files using standard network file sharing protocols such as Network File System (NFS) and Server Message Block (SMB). By connecting to the local network, NAS allows users to easily store, access, and collaborate on files without overburdening other servers within the network. This separation of file serving responsibilities helps optimize overall network performance. Management and control of the NAS system are typically performed through a web-based utility accessible over the network, offering a user-friendly interface for administration tasks. 

Advantages of NAS

NAS offers several advantages including faster data access, easier administration, simplified management, and many others:

  • Cost effective: NAS devices typically involve an upfront purchase cost that includes access to applications from the NAS provider, like Synology Hyper Backup or QNAP Hybrid Backup Sync. This greatly reduces ongoing subscription fees, though you may incur costs if you want to expand your storage capacity with high-capacity storage drives or increase its performance with more powerful processors, etc. 
  • Data control and security: NAS systems offer extensive control over data storage and security protocols. NAS systems are only accessible on the local office network and to user accounts that can be controlled and managed.
  • Performance: NAS provides high-speed access to data over a local network, ensuring quick file retrieval and sharing. NAS generally work as fast as the local office network speeds.
  • Scalable storage: Many NAS systems allow additional drives to be added, providing flexible storage expansion, albeit with the cost of additional drives or device upgrades. NAS devices today offer large storage capacities and advanced features for virtualization and application hosting.
  • Data redundancy: When equipped with RAID configurations, NAS provides redundancy, ensuring data remains accessible even if one or more hard drives fail.
  • Better data management tools: Features such as fully automated backups, deduplication, compression, and encryption enhance data storage efficiency and security. NAS systems also support sync workflows for team collaboration, directory services for user and group management, and services like photo or media management.

Limitations of NAS

While NAS offers numerous advantages for centralized file storage, there are some notable limitations to consider:

  • Initial setup and maintenance: Setting up and maintaining an enterprise NAS requires some IT expertise. The configuration process can be complex, and ongoing maintenance may demand technical skills that not all businesses possess in-house.
  • Remote access vulnerabilities: NAS systems can be accessed remotely over the internet, creating a private cloud or hybrid cloud solution. While this offers a significant advantage in using your device, just like anything connected to the internet, it also poses security risks. Bad actors can exploit vulnerabilities and gain remote access to the device. To mitigate this, it is crucial to ensure proper configuration, regular updates, and secure access settings. 
  • Scalability constraints: Although NAS systems allow for storage expansion, they are still limited by the physical capacity of the hardware. Scaling beyond certain limits can require significant investment in new devices or infrastructure.
  • Data vulnerability: Data stored on a NAS is susceptible to various threats, including hardware failures, natural disasters, theft, and cyber attacks such as ransomware. While RAID configurations offer some level of data redundancy, they do not protect against all forms of data loss. Regular backups and additional security measures are essential to mitigate these risks.
  • Performance overheads: As more users and devices access the NAS, network bandwidth and device performance can become bottlenecks. High demand can lead to slower access speeds and reduced efficiency, especially in larger organizations with extensive data needs.

Understanding cloud storage: What is cloud storage?

Cloud storage is a model of data storage where data is stored on servers located in off-site locations and accessed via the internet. There are two main types of cloud: public and private. Public cloud storage providers maintain servers and are responsible for hosting, managing, and securing the data. Private cloud storage is typically managed in-house and dedicated to a single organization; for example, when a university maintains data centers for access by their wider community. In both cases, providers ensure that the data is always accessible through both public and private internet connections. 

What’s the diff: Public vs. private cloud

Public cloud storage services are provided by third-party vendors over the public internet, making them accessible to anyone who wants to purchase or lease storage capacity. These services are designed to offer scalability and reliability, often on a pay-as–you-go basis.

Private cloud storage is dedicated to a single organization where an organization utilizes its own servers and data centers to store data within their own network. It can be hosted on-premises or by a third-party provider, but it’s always behind the organization’s firewall. This model is ideal for businesses that require more control over their data and have stringent security and compliance requirements.

One of the key benefits of public cloud storage is that it eliminates the need for businesses to buy, manage, and operate their own data center infrastructure. This shift allows companies to move from capital expenditure (CapEx) to operational expenditure (OpEx) model. Additionally, cloud storage is elastic, enabling businesses to scale their storage capacity up or down more efficiently and strategically than through tactical hardware investments. 

Types of cloud storage architecture 

In addition to the benefits of elasticity and scalability gained by adopting cloud storage, you can also combine on-premises storage and different types of public or private cloud storage to uniquely support your business needs. The primary models of cloud storage are:

  • Hybrid cloud storage: A hybrid model combines both public and private cloud storage. This allows an organization to decide which data it wants to store in which cloud. Sensitive data and data that must meet strict compliance requirements may be stored in a private cloud while less sensitive data is stored in the public cloud. You could also use it to leverage on-premises storage for performance-sensitive tasks, such as using NAS to edit large media files locally, which are later synced to the cloud. 
  • Multi-cloud storage: A multi-cloud model involves using two or more public cloud storage services from different service providers. This model helps businesses leverage the best features of each cloud service while enhancing data availability and redundancy. For example, some companies use multiple cloud providers to host mirrored copies of their active production data. If one of their public clouds suffers an outage, they have mechanisms in place to direct their applications or websites to failover to a second public cloud.

How does cloud storage work?

Cloud storage works by allowing users to upload data, such as files, documents, videos, or images to remote servers via the internet. Public cloud storage providers like Amazon, Google, Microsoft, and Backblaze maintain servers in large data centers. The uploaded data can be accessed and managed through web interfaces or APIs, making it highly accessible and flexible. 

Cloud storage offers numerous benefits that can greatly enhance business operations. However, there are also a few considerations to keep in mind. Next we’ll look at the advantages and some of the key limitations of cloud storage. 

Advantages of cloud storage

  • Off-site protection: Cloud storage provides convenient off-site protection for data, ensuring that in the event of a physical disaster (such as fire or flood), data remains safe and accessible from any location.  
  • Enhanced security: Leading cloud providers invest heavily in advanced security measures—including encryption, multi-factor authentication, Object Lock for immutability, and regular security audits—to protect stored data from unauthorized access and breaches.
  • Scalability: Cloud storage services offer virtually unlimited storage capacity. Businesses can easily scale their storage needs up or down based on demand. 
  • Accessibility: Data stored in the cloud can be accessed from anywhere with an internet connection, facilitating remote work and collaboration. 
  • Lower maintenance: Cloud providers handle all hardware maintenance, software updates, and security patches, reducing the IT burden on businesses. 
  • Cost efficiency: Many cloud storage solutions operate on a pay-as-you-go model, allowing businesses to pay only for the storage they use, which can be more cost-effective than investing in on-premises hardware. 

Limitations of cloud storage

  • Ongoing costs: Unlike on-premises storage solutions such as NAS, cloud storage operates on a subscription-based pricing model. When evaluating cloud storage, businesses should consider the total cost of ownership, including ongoing fees, and weigh these against the benefits of cloud storage. 
  • Dependence on the internet: Cloud storage relies on a stable internet connection for access and data transfer. Any disruptions in internet connectivity can hinder access to critical files and services, potentially impacting business operations. Ensuring reliable internet service and having contingency plans are crucial for minimizing downtime. 

NAS vs cloud storage: A side-by-side comparison

The following table provides a side-by-side comparison of NAS and cloud storage, highlighting key aspects such as cost, scalability, security, and performance. This comparison will help you determine which storage solution best aligns with your business requirements and operational workflows.

Aspect NAS Cloud Storage
Storage model File-level storage within a local network Data stored on remote servers accessed via the internet
Performance High speed access over a local network Dependent on internet speed and latency
Scalability Limited by physical hardware capacity Virtually unlimited scalability
Cost Upfront hardware purchase, ongoing investment to expand capacity Subscription-based, pay-as-you-go model
Maintenance Requires in-house IT maintenance Maintenance handled by cloud provider
Security Controlled in-house, local network security Enhanced by provider with encryption and security
Data redundancy RAID configurations for local redundancy Built-in data redundancy and disaster recovery options

Hybrid cloud: The best of both worlds

A hybrid cloud solution combines the strengths of both NAS and cloud storage. While NAS offers a centralized location to store and access files, the data stored on the NAS is still vulnerable to data disasters such as floods, fires, or hardware failures. Integrating cloud storage with NAS ensures that there is an off-site backup of your NAS data that securely protects your critical data from virtually any data threat. This approach not only mitigates the risk associated with physical damage to your on-premises NAS equipment but also offers the scalability and remote accessibility benefits of cloud storage. Additionally, this helps you implement 3-2-1 backup protection where three copies of your data are stored in two different storage media (NAS and cloud) with one copy stored off-site in the cloud, protecting against ransomware, hardware failures, natural disasters, and other data threats. 

NAS vs. cloud: Which is best for your business?

Choosing between NAS and cloud storage for your business largely depends on your specific use cases and operational needs. NAS provides fast local access, control, and cost efficiency for businesses with stable storage needs and on-premises operations. In contrast, cloud storage offers unparalleled scalability, remote access, and maintenance-free operation, making it ideal for organizations with dynamic storage needs and remote workforcesf. However, many businesses find that a combination of both, known as a hybrid cloud solution, offers the best of both worlds. 

Ultimately, the right choice will depend on a thorough evaluation of your business needs and operational workflows. By understanding the strengths and limitations of both NAS and cloud storage, you can make an informed decision that ensures your data is secure, accessible, and available when you need it. 

FAQs about NAS and cloud storage

Is cloud storage better than NAS?

The answer depends on your specific needs. Cloud storage offers scalability, remote access, and low maintenance. NAS, on the other hand, provides fast local access and higher data control. Each solution has its strengths, and the best choice will depend on your priorities regarding data security, access, and cost.

Can I use a NAS as a cloud?

Yes, many modern NAS devices come with built-in features that allow them to function similarly to cloud storage, or to connect to a cloud storage provider of your choice. These NAS systems can be accessed remotely over the internet, creating a private cloud or hybrid cloud solution. However, it requires proper configuration and a reliable internet connection to ensure seamless remote access.

Why use NAS instead of a server?

NAS devices are purpose-built for storage, offering simplicity, ease of management, and lower costs compared to traditional servers. While servers are multifunctional and can handle a variety of tasks, they are more complex to set up and maintain. NAS provides a straightforward solution for file sharing, backups, and media streaming without the need for extensive IT infrastructure. This makes NAS an excellent choice for small to medium-sized businesses that primarily need a dedicated storage solution. 

Can NAS work without the internet?

Yes, NAS devices are designed to operate within a local area network (LAN) and do not require an internet connection for local access and file sharing. Users can store, access, and collaborate on files within local networks without internet access. However, for remote access or to leverage additional features such as cloud backups, an internet connection is necessary.

The post NAS vs. Cloud Storage: Which Solution Fits Your Business Needs? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

2024 State of the Backup: Survey Says Security Incidents and Data Loss on the Rise

Post Syndicated from Yev original https://backblaze.com/blog/2024-state-of-the-backup-security-incidents-and-data-loss-on-the-rise/

A decorative image showing several icons that represent graphs, charts, and stats.

June is Backup Awareness Month, and every year, we work with the Harris Poll to survey the state of computer backups in the U.S. It’s our 16th year running, and this year, we expanded our lens and created a new survey focused on analyzing the state of backups among businesses, providing critical insights into organizational backup strategies and challenges. 

And as in previous years, our consumer backup survey provides a comprehensive summary which reflects trends and changes over nearly two decades. The combination of these two audience surveys provides a more complete picture for the state of backups in the U.S. Let’s start with our new survey data.

Spotlight: Business Backup Is Coming Up Short

Our inaugural Business Backup Survey included 300 IT decision makers across the U.S. Part of what we wanted to learn was: 

With all of the different ways IT professionals have to protect their user’s data, how are they choosing to back up and are backup solutions even working? 

We can infer the answers to those questions by looking at the tools organizations use to back up their data, the frequency of data recoveries, and how successful or not those recoveries were.

What Tools Do Organizations Use to Back Up Their Data?

One of the most striking findings from the poll is that a significant majority (84%) of IT decision makers say their organizations utilize cloud drive services, which rely on syncing data to the cloud, for off-site data backup. You may have heard us say this before—sync is not backup

What’s Wrong with Cloud Drives and Sync Services?

Cloud drives allow for file storage and sharing but may not protect against file corruption or accidental deletion. Sync services automatically update files across multiple devices, meaning that any changes or deletions are replicated everywhere, which can lead to unintended data loss. While some cloud drives have added minimal backup capabilities (i.e., 30 days of version history or similar), they are often lacking in key areas that are necessary for business continuity or compliance standards. 

Cloud backup solutions, on the other hand, are designed to systematically and securely back up data, offering robust protection against loss, corruption, and security breaches. This makes cloud backup a better choice, particularly for addressing security concerns and ensuring the integrity and availability of critical data.

How Often Do Organizations Need to Restore Data?

39% of IT decision-makers report that their organizations need to restore data from backups at least once a month, with special requests for archived or deleted data (62%), backup software failure (54%), hard drive failure (52%), and cyber attacks (49%) reported as some of the top reasons. This frequent need for data recovery underscores the persistent vulnerabilities IT professionals face.

Are Data Recoveries Successful?

Not only do many organizations need to restore on a regular basis, but the survey also shows that among those that experienced data loss, only 42% were able to recover all of their data when they perform a restore. That leaves 58% with some amount of unrecovered data. 

Are Backups Working?

The data shows a sizable gap between the use of backup services and the effectiveness of data restoration. Although a significant percentage of organizations indicate they’re using what they would consider a cloud backup solution (the shortcomings of cloud drives and cloud sync services aside), only 42% of those that experienced data loss were able to restore all their data. This discrepancy highlights the risks associated with inadequate backup measures and the potential for data loss, which can have serious repercussions for businesses.

Only 42% of organizations that experienced data loss were able to restore all their data.

There are all sorts of ways businesses need to slice their data management strategy in order to make sure all data is backed up. This includes data type (e.g., files vs. system information), frequency with which the data is updated or changed, retention requirements for compliance, and more. There are often reasons that businesses will employ different backup frequency or strategies for different file types—file-based versus block-level incremental backups, for example. However, incomplete backups can lead to situations where only parts of the data can be restored, disrupting business operations and resulting in downtime as efforts are made to recover or reconstruct lost data. 

The importance of creating an end-to-end data backup plan, as well as choosing the right tools that provide comprehensive coverage, may be highlighted only at the moment of failure. As it stands, the Harris Poll data suggests that the limitations of cloud sync and cloud drive tools are leaving gaps in data protection and disaster recovery strategies.

This is further validated by the features IT decision makers report as being absolutely essential/very important in selecting backup tools, including security (97%), bandwidth and memory capacity (87%), a variety of features (79%), ease of operations and customizable elements (83% each). These rigorous requirements suggest that many existing solutions may fall short of meeting the comprehensive needs of modern businesses, and/or that the complex mix of tools may be contributing to blind spots in an overall data management strategy, only exposed at the point of recovery.  

How To Close the Gap?

These insights underscore the need for innovative and robust backup solutions that address evolving business requirements. As the volume of data continues to grow and cyber threats become increasingly sophisticated, the demand for reliable, secure, and user-friendly backup systems will only heighten. Given the challenges many businesses face in fully recovering their data, there’s an opportunity to promote education and awareness regarding the importance of refreshing backup strategies and utilizing suitable backup tools.

Consumer Backup Practices: Less Than 1 in 5 Are Certain of Their Backups

The consumer portion of the 2024 Backup Awareness Survey seeks to understand a simple question we’ve asked year after year: How often do you back up all of the data on your computer? We also look at who backs up the most and the reasons people cite for needing to restore data, and we compare those trends over time. Let’s dig in to the results. 

How Often Do People Back Up?

This year’s survey reveals that fewer than 1 in 5 Americans (15%) feel absolutely certain that their most important files are securely backed up. This is despite 84% of Americans who own a computer stating that they’ve backed up all their data and 45% performing backups at least once a month.

The survey also highlights the predominance of cloud solutions among backup methods. 63% of individuals who back up their data use a cloud-based system as their primary method. However, only 11% utilize dedicated cloud backup services, indicating a preference towards cloud drives (39%) and sync services (13%). As we noted above, cloud drives and sync services are fundamentally different from cloud backup solutions and can create gaps in a robust 3-2-1 backup strategy

Who Is Best at Backing Up?

Every year, we highlight which demographic is the best at backing up their data, and in 2024, men (73% vs. 66% of women) and younger adults ages 18–54 (76% vs. just 61% of those ages 55+) take the lead backing up at least once a year. 

The Reasons for Restores

The survey also found that 74% of Americans who own a computer have accidentally deleted important data (a 5.7% increase from 2023), and 57% have experienced a security incident on their computer.

Trends Over Time

For those interested in the data over time, let’s travel back and see how this year’s data compares to previous years. The first graph is one of our favorites. Since 2023, daily backups have dipped by 1%, while weekly and monthly backups have remained steady, which is encouraging. Additionally, there is a slight, but not statistically significant, increase of 1% in yearly and more-than-yearly backups. Notably, the percentage of people who have never backed up their data has decreased by 2%. 

For all the table enthusiasts, you’ll appreciate this detailed view showcasing how 2024 compares with previous years. We love to see Never down to an all-time low, although Daily took a slight dip. 

If you’re a visual person who appreciates vibrant pie charts for easier data digestion, here are pie charts comparing data from 2008 to 2024:

Within each population (business and consumer), the most striking data points are around the differences between backup and sync. Both consumers and businesses are leveraging cloud drive and sync services for ease of use, but that has not translated to successful data recoveries. With ransomware attacks on the rise, now more than ever, it’s essential to have a strong backup strategy. 

Still, we’ve come a long way since 2008, and the consumer data shows positive change over time around backup awareness and tool adoption. Going forward, we’ll be interested to see how the business audience data changes over time. See below for our full testing methodology, and, as always, drop us a line in the comment section if you have any questions or insights.

Consumer Survey Method:

This survey was conducted online within the United States by The Harris Poll on behalf of Backblaze from April 25-29, 2024, among 2,058 adults ages 18 and older, among whom 1,877 own a computer. The sampling precision of Harris online polls is measured by using a Bayesian credible interval.  For this study, the sample data is accurate to within +/- 2.5 percentage points using a 95% confidence level.

Prior year’s surveys were conducted online by The Harris Poll on behalf of Backblaze among U.S. adults ages 18+ who own a computer in April 25–27, 2023 (n=1,857) May 19–23, 2022 (n=1,861); May 12–14, 2021 (n=1,870); June 1–3, 2020 (n=1,913); June 6–10, 2019 (n=1,858); June 5–7, 2018 (n=1,871); May 19–23, 2017 (n=1,954); May 13–17, 2016 (n=1,920); May 15–19, 2015 (n=2,009); June 2-4, 2014 (n=1,991); June 13–17, 2013 (n=1,952); May 31–June 4, 2012 (n=2,176); June 28–30, 2011 (n=2,209); June 3–7, 2010 (n=2,051); May 13–14, 2009 (n=2,154); and May 27–29, 2008 (n=2,723).

Business Backup Survey Method:

This survey was conducted online within the United States by The Harris Poll on behalf of Backblaze from April 30 – May 8, 2024, among 300 IT Decision Makers. The sampling precision of Harris online polls is measured by using a Bayesian credible interval.  For this study, the sample data is accurate to within +/- 5.7 percentage points using a 95% confidence level.

For complete survey methodologies, including weighting variables and subgroup sample sizes, please contact Backblaze.

The post 2024 State of the Backup: Survey Says Security Incidents and Data Loss on the Rise appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Live Read: The Game Changer for Live Media Cloud Workflows

Post Syndicated from Elton Carneiro original https://backblaze.com/blog/announcing-b2-live-read/

A decorative image with the title Live Read.

Every sports fan knows that when something incredible happens on the field/ice/court, we want to see the replay right now. But many of us don’t know the impressive efforts that live media teams undertake to deliver clips in real time to all of us on whatever viewing platform we might prefer. Today, Backblaze is excited to make the work of live media production (and the end results) a lot easier with our latest innovation.

Announcing Backblaze B2 Live Read

Backblaze B2 Live Read is a patent-pending service that gives media production teams working on live events the ability to access, edit, and transform media content while it is being uploaded into Backblaze B2 Cloud Storage. This means that teams can start working on content far faster than they could before, without having to drastically change their workflows and tools, massively speeding up their time to engagement and revenue. 

This is a game changer for live media teams, who are passionate about bringing content to their audience as soon as possible. It means they don’t need to worry as screen resolutions continue to expand, ranging from 4K to 8K and beyond. It also reduces the need for having production teams on-site to minimize latency, which could be extremely costly depending on the venue. 

Previously, producers had to wait hours or days before they could access uploaded data, or they had to rely on cost-prohibitive and complicated options that often required on-premises storage. That’s no longer necessary. This innovation will make it faster and less expensive to:

  • Create near real-time highlight clips for news segments, in-app replays, and much more.
  • Tap into talent where they are versus trying to find local talent to produce events.
  • Promote content for on-demand sales within minutes of presentations at live events.
  • Distribute teasers for buzz on social media before talent has even left the venue.

For our customers, turnaround time is essential, and Live Read promises to speed up workflows and operations for producers across the industry. We’re incredibly excited to offer this innovative feature to boost performance and accelerate our customers’ business engagements.”

Richard Andes, VP, Product Management, Telestream

Coming soon inside your favorite tools

We designed Live Read to be easily accessible directly via the Backblaze S3 Compatible API and/or seamlessly within the user interface of launch partners including Telestream, Glookast, and Mimir. These platforms, along with CineDeck, Alteon, Hedge, Hiscale, MoovIT, and many others to come, are enabling Live Read within their platforms soon.   

If you want to use Live Read, you can join our private preview.  

How does it work?

Previously, media teams were forced to either wait for uploads to complete or use on-premises storage. Now, Live Read uniquely supports accessing parts of each growing file or growing object as it is uploaded so there’s no need to wait for the full file upload to complete. And, when the full upload is complete, it’s accessible like any other file in a Backblaze B2 Cloud Storage Bucket, with no middleware or proprietary software needed. 

Here’s a short video showing both how Live Read works on a conceptual level, as well as a live demo showing how one app can upload video data to Backblaze B2 using Live Read while a second app reads the uploaded video data:

For those of you who want to dig deeper into the code samples you saw in the video, here is some example code that uses the Amazon SDK for Python, Boto3, to start uploading data with Live Read. If you’re familiar with Amazon S3, you’ll recognize that this is a standard multipart upload apart from the add_custom_header handler function and the call to register it with Boto3’s event system:

def add_custom_header(params, **_kwargs):
    """
    Add the Live Read custom headers to the outgoing request.
    See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/events.html
    """
    params['headers']['x-backblaze-live-read-enabled'] = 'true'

client = boto3.client('s3')
client.meta.events.register('before-call.s3.CreateMultipartUpload', add_custom_header)

response = client.create_multipart_upload(Bucket='my-video-files', Key='liveread.mp4')

upload_id = response['UploadId']

# Now upload data as usual with repeated calls to client.upload_part()

As it processes the call to create_multipart_upload(), Boto3 calls the add_custom_header() handler function, which adds a custom HTTP header, x-backblaze-live-read-enabled, with the value true, to the S3 API request. The custom HTTP header signals to Backblaze B2 that this is a Live Read upload. As with standard multipart uploads, the data is uploaded in parts between 5MB and 5GB in size. To facilitate reading data efficiently, all parts except the last one must have the same size.

Since this is a Live Read upload, as soon as a part is uploaded, it is accessible for downloading.

An app that downloads the file needs to send the same custom HTTP header when it retrieves data. For example:

def add_custom_header(params, **_kwargs):
    """
    Add the Live Read custom headers to the outgoing request.
    See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/events.html
    """
    params['headers']['x-backblaze-live-read-enabled'] = 'true'

client = boto3.client('s3')
client.meta.events.register('before-call.s3.GetObject', add_custom_header)

# Read the first 1 KiB of the file
response = client.get_object(
    Bucket='my-video-files',
    Key='liveread.mp4',
    Range='bytes=0-1023'
)

Note that you must supply either Range or PartNumber to specify a portion of the file when you download data using Live Read. If you request a range or part that does not exist, then Backblaze B2 responds with a 416 Range Not Satisfiable error, just as you might expect. On receiving this error, an app reading the file might repeatedly retry the request, waiting for a short interval after each unsuccessful request.

The source code for the applications is available as open source at https://github.com/backblaze-b2-samples/live-read-demo/.

How much does it cost?

Live Read upload capacity is offered in $15/TB increments—and the capacity is only consumed when an upload is marked for Live Read. Standard uploads are free, as usual. After uploading is complete, the data stored in Backblaze B2 is billed as normal. From a cost perspective, this represents significant savings versus the workflows that production teams must currently follow to achieve anything close to the functionality delivered by Live Read.

And it’s not just for live media

Beyond media, the Live Read API can support breakthroughs across development and IT workloads. For example, organizations maintaining large data logs or surveillance footage backups have often had to parse them into hundreds or thousands of small files each day in order to have quick access when needed—but with Live Read, they can now move to far more manageable single files per day or hour while preserving ability to access parts immediately after they are written.

What’s next

For those interested in Live Read, you can sign up for the private preview here. We’ll continue to report as we add more integrations and we’ll share stories as customers succeed with the new feature. Until then, feel free to ask any question you have in the comments below. 

Want to see more?

Join Pat Patterson, Chief Technical Evangelist, and Elton Carneiro, Senior Director of Partnerships, on January 26, 2024 at 10:00 a.m. PT to learn more in real time. Can’t make it live? Sign up anyway and we’ll send a recording straight to your inbox.

Join the Webinar 

The post Backblaze Live Read: The Game Changer for Live Media Cloud Workflows appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Do I Need a Continuous Backup Solution?

Post Syndicated from Stephanie Doyle original https://backblaze.com/blog/do-i-need-a-continuous-backup-solution/

A decorative image showing a calendar displaying a cloud with arrows in a circular pattern, plus several devices around the calendar.

Most IT administrators and businesses know that you need to employ a 3-2-1 backup solution to meet minimum backup durability requirements, though many folks are of the opinion that that methodology is just table stakes these days. Once you get into the variety of ways you can ensure reliable, robust backups, you learn about strategies like 3-2-1-1-0, bare metal recovery, cyber resilience, and more. 

Which method your business ultimately ascribes to depends on your risk tolerance—but no business wants to experience the costs associated with extended downtime. In a world where the threat of ransomware is not an “if”, but a “when,” and disaster recovery is front-of-mind for businesses of all sizes, getting granular with your backup and recovery options is key. That brings us to today’s topic: continuous backup solutions, aka continuous data protection (CDP).

What Is Continuous Data Protection (CDP)?

A continuous data protection solution is an automated data backup method that keeps track of changes to your files and backs them up constantly. This is different from traditional backups that copy your data at set intervals. While you can set the interval that you’d want to have your data backup (e.g., daily, weekly, monthly), you’d still be relying on a systemic approach, and would have data loss exposure correlating to the duration you set. What if you had a day of particularly high volume of new data? What if you’re dealing with sensitive customer information that your business needs to show an accurate audit trail for? 

Quick Refresher: The 3-2-1 Backup Strategy

The 3-2-1 backup solution calls for three copies of your data, to be stored on two different media types, with one copy off-site (and preferably geographically separate from your primary data storage region).

Continuous backup solutions work by tracking every change to your data in real time, then they asynchronously create a second copy of those changes. Compare this with traditional backups, where data is written to a destination, then a copy (often, a snapshot) of the data is made to be distributed to other backup locations. With a continuous backup solution, you can reduce your recovery point objection (RPO)—that is, the point in time from which you can recover your data—to near zero. And, because your continuous backups typically have to be stored in more accessible storage tiers, you can reduce your recovery time object (RTO) as well. 

Here are some of the key qualities of a continuous backup solution: 

  • Automatic backups. With continuous backup, you don’t need to remember to manually back up your data. The system continuously monitors your files for changes and backs them up automatically.
  • Real-time or frequent backups. Continuous backup solutions can back up your data in real-time, or at least very frequently. This means you’ll always have a very recent copy of your data available in case of data loss.
  • Restore to any point in time. Because continuous backups keep track of all the changes made to your files, you can restore your data to any point in time, not just the last time a full backup was run.

Benefits of Continuous Backup Solutions

There are several advantages to using a continuous backup solution:

  • Reduced risk of data loss. You’re less likely to lose data due to hardware failure, software corruption, ransomware attacks, or accidental deletion.
  • Faster recovery times. If you do lose data, you can restore it from a recent backup much faster than you could with a traditional backup system, especially those that rely on hardware (air gaps) or cold storage for archival storage. 
  • Improved business continuity. Continuous backup can help businesses keep running smoothly even if they experience a data loss event.
  • User-friendly for diverse levels of tech proficiency. Within a business, there are always going to be folks who are more or less proficient with technology. When you have an automatic backup utility that runs in the background—particularly one with advanced central administrative controls like Backblaze Business Backup with Enterprise Control—you aren’t relying on employees’ tech proficiency (or memory) to create backups.

Continuous Backup vs. Near Continuous Backup Solutions

True CDP solutions run at the level where you’re writing changes to your file—according to the patent, at the basic input/output system (BIOS) of the computer such that normal operations are unaffected. In a practical sense, that means you’re nearly always using virtual machines. 

There are several near-continuous backup solutions, however, that back up in intervals of one hour or less and leverage a high-availability system. For transparency’s sake, it’s worth noting that Backblaze Computer Backup is a near continuous backup using native code over Java to reduce impact on your computer’s operations—see our documentation for more.

Downsides of Continuous Backup Solutions

There are some challenges to consider when implementing a continuous backup solution:

  • Cost: Continuous backup solutions can be more expensive than traditional backup solutions. As time has gone on, there are more solutions available—which has led to more competitive pricing—but it is still a factor to consider. 
  • Storage requirements: Continuous backup can generate a lot of data. If you’re provisioning storage yourself, you’ll need to make sure you have enough to accommodate the backups. If you’re considering using a cloud backup utility, make sure you look into unlimited backups. Many common backup utilities create pricing tiers that take into account the volume of data a user generates and stores.
  • System resources and compatibility: Solutions can use up system resources, which could slow down your computer or server. For example, many utilities use Java instead of native code, which can noticeably slow down your devices. 

What’s the Diff: Continuous Backup vs. Synced Cloud Drive

Because a continuous backup solution updates backups in near real-time, it can be confused with cloud sync services. You may have heard us say it before—sync is not backup. The main difference between a continuous backup service and a synced cloud drive boils down to their purpose:

Continuous Backup Service: This prioritizes data protection and recovery. It constantly monitors your files for changes and backs them up frequently, often in real-time. You can restore your data to any specific point in time, making it ideal for disaster recovery or retrieving accidentally deleted files.

Synced Cloud Drive: This focuses on accessibility and collaboration. It keeps a mirrored copy of your designated folders across multiple devices. Any edits you make on one device are reflected in the cloud storage and on all other connected devices. This is great for working on the same files from different locations and keeping everyone, well, in sync.

While some cloud drives have introduced a backup or version history function, they are often limited in scope and subject to the shared responsibility model. Both undermine a true backup strategy, especially if your business has cyber insurance or compliance standards to meet.  

Here’s a table summarizing the key differences:

Feature Continuous Backup Service Synced Cloud Drive
Main Purpose Data protection and recovery Accessibility and collaboration
Backup Frequency Continuous or very frequent User-defined, automatic at intervals, often limited
Version Control Supports restoring to any point in time May or may not have version history
Focus Protecting against data loss Sharing and working on files across devices
Ideal Use Case Disaster recovery; accidental deletion recovery; ransomware protection Remote work; team collaboration

In short:

  • Use a continuous backup service if you’re worried about losing important data due to hardware failure, software corruption, or accidental deletion.
  • Use a synced cloud drive if you want to access and edit your files from anywhere and keep them updated across all your devices.

Some services might offer both features, but it’s important to understand which functionality takes priority. And, if in doubt—use both. 

Conclusion

Backups continue to be one of the cornerstones of any business’ ransomware protection strategy. As you’re considering what kind of backup you need, consider how much your business needs a granular, point-in-time recovery option to maintain business continuity. As always, you should balance functionality with costs, and the needs of your particular business. But, given the relative affordability of backup tools—and the amount they can save you in the event of a data disaster—solutions like continuous backup are worth considering for businesses of all sizes.

The post Do I Need a Continuous Backup Solution? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Video Surveillance Data Storage: Cloud vs. On-Prem vs. Hybrid

Post Syndicated from Tonya Comer original https://backblaze.com/blog/video-surveillance-data-storage-cloud-vs-on-prem-vs-hybrid/

A decorative image showing several video surveillance cameras connected to a cloud with the Backblaze logo on it.

Depending on your industry, you may need to install and run video surveillance. And once you have footage, you might be required to store it for a set period of days, months, or even years. This leads to the question: Where are you supposed to keep it all?

Not all storage systems are created equal, so it’s important to weigh the benefits and drawbacks of each option before making a decision. In some cases, government and industry regulations will require you to use a certain type of storage system. Ultimately, you will benefit from knowing how the system functions, what risks are involved, and how to select a technology provider.

This article will help you consider the pros and cons of on-premises, cloud, and hybrid storage systems. As you read, keep in mind that the amount of storage you need for your enterprise will depend on the number of cameras you have, the quality of the video footage, the length of time you are required to retain the footage, and various other factors. 

First Things First: Your Backup Strategy

No matter how or where you store your video surveillance footage, the most important thing you should do is establish a backup strategy that follows the 3-2-1 backup approach. That means you should have three copies of your data on two different media with one stored off-site. In this post, we’ll weigh the pros and cons of whether you keep that off-site copy stored at an off-site location like, say, an Iron Mountain storage facility, a remote office, or data center, or whether you keep that off-site copy in the cloud. 

You might think we’re biased as a cloud provider. Of course, we’d love it if you choose to keep your backups with Backblaze! But the main thing we want to emphasize is that you should have a backup plan for your video surveillance footage (or any data, really!) whether it includes Backblaze or not. And, because you have to store one of those copies off-site, it’s miles easier (pun intended) to store in the cloud than to physically drive or mail hard drives to a secondary location.  

What Is On-Premises Storage?

Storing video footage on-premises means your data is stored on physical media—that is, servers, network attached storage (NAS), storage area network (SAN), LTO tape (linear tape open), etc.—in a physical location on your premises. We’ll talk about two forms of on-premises storage as they pertain to video footage: NAS and SAN.

Are NAS Devices Good for Storing Video Footage?

NAS devices have a large data storage capacity that provides file-based data storage services to other devices on a network. Usually, they also have a client or web portal interface, as well as services like QNAP’s Hybrid Backup Sync or Synology’s Hyper Backup to help manage your files.

A photo of a Synology NAS.

One of the benefits of NAS is that it’s easy to set up and use, and you can upgrade internal drives over time. The main drawback when it comes to storing video surveillance footage is that its storage capacity is limited. Even if you buy a bigger device than you need right now, eventually you’ll run out of space and need to buy more, especially if you’re storing large amounts of video surveillance footage.

Is a SAN Good for Storing Video Footage?

On the other end of the spectrum, SANs are engineered for high-performance and mission-critical applications. They function by connecting multiple storage devices, such as disk arrays or tape libraries, to a dedicated network that is separate from the main local area network (LAN).

SANs offer high-speed data access, critical for handling large video streams from multiple cameras and allow for seamless scalability. As video surveillance systems grow, SANs can accommodate additional cameras and storage without disrupting ongoing operations. They also provide enhanced data security by isolating block-level storage within the operating system layer, to protect against failures and unauthorized access. Managing SANs can be a bit complex, necessitating skilled administrators familiar with SAN architecture. Additionally, implementing SANs incurs upfront expenses for hardware, software, and expertise, while their reliance on centralized controllers poses a risk of impacting multiple cameras in case of failure.

What Is Cloud Storage?

Cloud storage enables you to securely store data and files in an off-site location. You can access this data through the public internet.

When you transfer data off-site for storage, the cloud storage provider (CSP) hosts, secures, manages, and maintains the servers and associated infrastructure, ensuring that you have seamless access to your data whenever you need it.

What Are the Benefits of Cloud Storage for Video Surveillance Footage?

  1. Scalability: Cloud storage services allow you to dynamically adjust capacity as your video surveillance data volumes fluctuate. 
  2. Avoid capital expenses (CapEx): By leveraging cloud storage for video surveillance, your organization benefits from paying for storage technology and capacity as a service, rather than incurring the capital expenses associated with constructing and upkeeping in-house storage networks. As data volumes grow over time, your costs may increase, but there’s no need to overprovision storage networks in anticipation of future data expansion. 
  3. Security: Cloud surveillance systems enhance data security with unique user accounts and data encryption ensure that only authorized personnel can access the footage. This controlled access minimizes the risk of unauthorized viewing or tampering.
  4. Accessibility: Cloud storage relies on an internet or network connection so authorized users can access surveillance footage remotely from anywhere using smart devices or web browsers. Whether you’re at the office, traveling, or even at home, you can review camera feeds without being physically present on-site. Keep in mind if the connection is lost or disrupted, access to video footage becomes challenging. This dependency can impact real-time monitoring and retrieval of critical data.

Just like our other storage strategies, there are drawbacks to cloud storage. For example, it relies on a stable internet connection. Video surveillance files are large, even when you apply compression techniques, which means that they take time and proper network connections to upload. So, if your internet connection goes down, it takes longer to get data properly stored or backed up than it would with other file types. That means you may not have real-time access to your data, or (in the worst cases) that you potentially risk file corruption if you don’t have a robust enough local storage infrastructure. 

Similarly, businesses should evaluate the privacy and data ownership concerns. Storing video footage in the cloud means entrusting sensitive data to a third-party service provider. Make sure that your CSP meets or exceeds all regulatory or compliance requirements, like SOC 2 or ISO 27001, before you store data on their platforms. 

All things considered, cloud storage offers scalability, ease of access, fine–tuned file control, and minimal maintenance, which are essential when dealing with the complexities of storing video surveillance footage.

Direct-to-Cloud Video Surveillance

Some companies choose to transfer video surveillance off-site to the cloud for backup purposes, while others push video footage directly to the cloud as a primary storage location, especially as there are several camera models and video surveillance solutions that are designed to easily push footage directly to cloud storage. When you’re choosing video surveillance hardware, it’s worth looking into whether they have this functionality, and if so, how much control you have over setting your storage destination to optimize costs. 

And, if you’re using cloud storage as the primary storage for video footage, a multi-cloud setup can be used to ensure the primary copy in the cloud is backed up. A multi-cloud setup involves using multiple cloud service providers simultaneously—so, if your video surveillance platform stores footage in their own cloud, you can still set up a workflow that backs up to a different CSP. For backup and archive purposes, organizations can distribute their data across different clouds to enhance reliability, reduce risk, create geographic diversity in storage locations for disaster recovery purposes, and to comply with data retention policies. This approach ensures data availability even if one cloud provider experiences issues.

What Is Hybrid Cloud Storage?

Hybrid cloud storage combines elements from both public clouds and private clouds (typically on-premises systems). It’s essentially a unified management approach where an integrated infrastructure enables seamless movement of workloads and data between the private and public clouds.

Using a hybrid cloud for video surveillance makes sense for lots of use cases, including backup and archive. Let’s talk about how. 

Backup: To deploy a hybrid approach for a video surveillance backup use case, you’d store all of your video surveillance footage in your on-premises systems, then store your backups in the cloud. Many NAS devices, for example, come with on-board backup utilities that allow you to store backups of your video surveillance footage directly in the cloud. You could also use third-party backup software to automatically back up your systems to the cloud. This hybrid approach gives you fast access to your footage via your on-premises storage, while protecting it with cloud backups.

Archive: To deploy a hybrid approach for a video surveillance archive use case, you’d store recent live recordings of your video surveillance footage on-premises. After a recurring cutoff date—whether in days or months—you then move old footage to a public cloud. This hybrid system allows you to access recent footage quickly while archiving older footage, particularly if you have retention requirements for compliance or cyber insurance purposes. If done right, this system can help your company comply with both short- and long-term industry requirements.

For a more in-depth look at hybrid cloud storage, check out our blog on hybrid cloud

Is Hybrid Cloud Good for Video Surveillance Footage?

Leveraging hybrid cloud storage provides a dual advantage for video surveillance: swift local access to your video surveillance footage while simultaneously safeguarding it through off-site backups or off-loading it through a cloud archive. This strategic approach allows you to harness the strengths of both public and private clouds. Moreover, it offers enhanced scalability and flexibility compared to traditional on-premises solutions.

However, it’s essential to note that implementing a private cloud system can be cost-intensive. It necessitates budgeting for hardware acquisitions and replacements over time. Additionally, you’ll likely need to allocate resources for dedicated staff to maintain servers and backup strategies.

The Verdict: Which Type of Storage Is Best for Video Surveillance?

Choosing the right video surveillance storage solution is a critical decision for any organization. On-premises, cloud, and hybrid cloud each have their merits and drawbacks. While on-premises solutions offer large data storage capacity that is easy to set up and use, they require significant infrastructure investment. Cloud storage provides data accessibility and scales seamlessly while optimizing cost-effectiveness. Hybrid cloud provides both rapid local access to your video surveillance footage and secure off-site backups. 

Ultimately, the choice depends on your specific needs, budget, and long-term strategy. Consider the trade-offs carefully to ensure seamless and reliable video storage for your surveillance system.

The post Video Surveillance Data Storage: Cloud vs. On-Prem vs. Hybrid appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

How Backblaze Scales Our Storage Cloud

Post Syndicated from Andy Klein original https://backblaze.com/blog/how-backblaze-scales-our-storage-cloud/

A decorative image showing a larger cube being compressed into a smaller cube.

Increasing storage density is a fancy way of saying we are replacing one drive with another drive of a larger capacity; for example replacing a 4TB drive with a 16TB drive—same space, four times the storage. You’ve probably copied or cloned a drive or two over the years, so you understand the general process. Now imagine having 270,000 drives that over the next several years will need to be replaced, or migrated as is often the term used. That’s a lot of work. And when you finish—well actually you’ll never finish as the process is continuous for as long as you are in the cloud storage business. So, how does Backblaze manage this ABC (Always Be Copying) process? Let me introduce you to CVT Copy or CVT for short.

CVT Copy is our in-house purpose-built application used to perform drive migrations at scale. CVT stands for Cluster, Vault, Tome, which is engineering vernacular mercifully shortened to CVT. 

Before we jump in, let’s take a minute to define a few terms in the context of how we organize storage.

  • Drive: The basic unit of storage ranging in our case from 4TB to 22TB in size.
  • Storage Server: A collection of drives in a single server. We have servers of 26, 45, and 60 drives. All drives in a storage server are the same logical size.
  • Backblaze Vault: A logical collection of 20 Storage Pods or servers. Each storage server in a Vault will have the same number of drives.
  • Tome: A tome is a logical collection of 20 drives, with each drive being in one of the 20 storage servers in a given Vault. If the storage servers in a Vault have 60 drives each, then there will be 60 unique tomes in that Vault.
  • Cluster: A logical collection of Vaults, grouped together to share other resources such as networking equipment and utility servers.

Based on this, a Vault consisting of 20, 60-drive storage servers will have 1,200 drives, a Vault with 45-drive storage servers will have 900 drives, and a Vault with 26-drive servers will have 520 drives. A cluster can have any combination of Vault sizes.

A Quick Review on How Backblaze Stores Data

Data is uploaded to one of the 20 drives within a tome. The data is then divided into parts, called data shards. At this point, we use our own Reed-Solomon erasing coding algorithm to compute the parity shards for that data. The number of data shards plus the number of parity shards will equal 20, i.e. the number of drives in a tome. The data and parity shards are written to their assigned drives, one shard per drive. The ratios of data shards to parity shards we currently use are 17/3, 16/4, and 15/5 depending primarily on the size of the drives being used to store the data—the larger the drive, the higher the parity.

Using parity allows us to restore (i.e. read) a file using less than 20 drives. For example, when a tome is 17/3 (data/parity), we only need data from any 17 of the 20 drives in that tome to restore a file. This dramatically increases the durability of the files stored.

CVT Overview

For CVT, the basic unit of migration is a tome, with all of the tomes in a source Vault being copied simultaneously to a new destination Vault which is typically new hardware. For each tome, the data, in the form of files, is copied file-by-file from the source tome to the destination tome.

The CVT Process

An overview of the CVT process is below, followed by an explanation of each task noted.

Selection

Selecting a Vault to migrate involves considering several factors. We start by reviewing current drive failure rates and predicted drive failure rates over time. We also calculate and consider overall Vault durability; that is, our ability to safeguard data from loss. In addition, we need to consider operational needs. For example, we still have Vaults using 45-drive Storage Pods. Upgrading these to 60-drive storage servers increases drive density in the same rack space. These factors taken together determine the next Vault to migrate.

Currently we are migrating systems with 4TB drives, which means we are migrating up to 3.6 petabytes (PB) of data for a 900 drive Vault or 4.8PB of data for a 1,200 drive Vault. Actually, there are no limitations as to the size of the source system drives, so Vaults with 6TB, 8TB, and larger sized drives can be migrated using CVT with minimal setup and configuration changes.

Once we’ve identified a source Vault to migrate we need to identify the target or destination system. Currently, we are using destination vaults containing 16TB drives. There is no limitation as to the size of the drives of the destination Vault, so long as they are at least as large as those in the source Vault. You can migrate the data from any sized source Vault to any sized destination Vault as long as there is adequate room on the destination Vault.  

Setup

Once the source Vault and destination Vault are selected, the various Technical Operations and Data Center folks get to work on setting things up. If we are not using an existing destination Vault, then a new destination Vault is provisioned. This brings up one of the features of CVT: The migration can be to a new clean Vault or an existing Vault; that is, one with data from a previous migration on it. In the latter case, the new data is just added and does not replace any of the existing data. The chart below are examples of the different ways a destination Vault can be filled from one or more source Vaults.

In any of these scenarios, the free space can be used for another migration destination or for a Vault where new customer data can be written.

Kickoff

With the source and destination Vaults identified and setup, we are now ready to kick-off the CVT process. The first step is to put the source Vault in a read-only state and to disable file deletions on both the source and destination Vaults. It is possible that some older source Vaults may have already been placed in read-only state to reduce their workload. A Vault in a read-only state continues to perform other operations such as running shard integrity checks, reporting Drive Stats statistics and so on. 

CVT and Drive Stats

We record Drive Stats data from the drives in the source Vault until the migration is complete and verified. At that point we begin recording Drive Stats data from the drives in the destination Vault and stop recording Drive Stats data from the drives in the source Vault. The drives in the source Vault are not marked as failed.

Build the File List

This step and the next three steps (read files, write files, and validate) are done as a consecutive group of steps for each tome in a source Vault. For our purpose here, we’ll call this group of steps the tome migration process, although they really don’t have such a name in the internal documentation. The tome migration process is for a single tome, but, in general, all tomes in a Vault are migrated at the same time, although due to their unique contents, they most likely will complete at different times.

For each tome, the source file list is copied to a file transfer database and each entry is mapped to its new location in the destination tome. This process allows us to maintain the same upload path while copying the data as the customer used to initially upload their data. This ensures that from the customer point of view, nothing changes in how they work with their files even though we have migrated them from one Vault to another.

Read Files

For each tome, we use the file location database to read the files. One file at a time. We use the same code in this process that we use when a user requests their data from the Backblaze B2 Storage Cloud. As noted earlier, the data is sharded across multiple drives using the preset data/parity scheme, for example 17/3. That means, in this case we only need data from 17 of the drives to read the file.

When we read a file, one advantage we get by using our standard read process is a pristine copy of the file to migrate. While we regularly run shard integrity checks on the stored data to ensure a given shard of data is good, media degradation, cosmic rays and so on can affect data sitting on a hard drive. By using the standard read process, we get a completely clean version of each file to migrate.

Write Files

The restored file is sent to the destination vault, there is no intermediate location where the file resides. The transfer is done over an encrypted network connection typically within the same data center, preferably on the same network segment. If the transfer is done between data centers, it is done over an encrypted dark fiber connection. 

The file is then written to the destination tome. The write process is the same one used by our customers when they upload a file and given that process has successfully written hundreds of billions of files we didn’t need to invent anything new.

At this point, you could be thinking that’s a lot of work to copy each file one by one.  Why not copy and transfer block by block, for example? The answer lies in the flexibility we get by using the standard file-based read and write processes.

  • We can change the number of tomes. Let’s say we have 45 tomes in the source Vault and 60 tomes in the destination Vault. If we had copied blocks of data the destination Vault would have 15 empty tomes. This creates load balancing and other assorted performance problems when that destination Vault is opened up for new data writes at a later date. By using standard read and write calls for each file, all 60 of the destination Vault’s tomes fill up evenly, just like they do when we receive customer data.
  • We can change parity of the data. The source 4TB drive Vaults have a data/parity ratio of 17/3. By using our standard process to write the files, the data/parity ratio can be set to whatever ratio we want for the destination Vault. Currently, the data/parity ratio for the 16TB destination Vaults is set to 15/5. This ratio ensures that the durability of the destination Vault and therefore the recoverability of the files therein is maintained as a result of migrating the data to larger drives.
  • We can maximize parity economics. Increasing the number of parity drives in a tome from three to five decreases the number of data drives in that tome. That would seem to increase the cost of storage, but the opposite is true in this case. Here’s how:
    • Using 4TB drives for 16TB of data stored
      • Our average cost for a 4TB drive was $120 or $0.03 per GB.
      • Our cost of 16TB of storage, using 4TB drives, was $480 (4 x $120).
      • Using a 17/3 data/parity scheme means:
        • Data storage: We have 13.6TB of data storage at $0.03/GB ($30/TB) which costs us $408. 
        • Parity storage: We have 2.4TB of parity storage at $0.03/GB ($30/TB) which costs us $72.
    • Using 16TB drives for 16TB of data stored
      • Our average cost for a 16TB drive is $232 or $0.0145 per GB.
      • Our cost of 16TB of storage is $232.
      • Using a 15/5 data/parity scheme means:
        • Data storage: We have 12.0TB of data storage at $0.0145/GB ($14.5/TB) which costs us $174.
        • Data parity: We have 4.0TB of parity storage at $0.0145/GB ($14.5/TB) which costs us $58.
    • In summary, increasing the data/parity ratio to 15/5 for the 16TB drives is less expensive ($58) than the cost of parity when using our 4TB drives ($72) to provide the same 16TB of storage. The lower cost per TB of the 16TB drives allows us to increase the number of parity drives in a tome. Therefore, the cost of increasing the parity of the destination tome not only enhances data durability, it is economically sound.
    • Obviously a 16TB drive actually holds a bit less data due to formatting and overhead and four 4TB drives hold even less data. In other words, even with formatting and so on, the math still works out in favor of using the 16TB drives.

Validate Tome

The last step in migrating a tome is to validate the destination tome is the same as the source tome. This is done for each tome as they complete their copy process. If the source and destination tomes are not consistent, shard integrity check data can be reviewed to determine any errors and the system can retransfer individual files, up to and including the entire tome.

Redirect Reads

Once all of the tomes within the Vault have completed their individual migrations and have passed their validation checks, we are ready to redirect customer reads (download requests) to the destination Vault. This process is completely invisible to the customer as they will use the same file handle as before. This redirection or swap process can be done tome by tome, but is usually done once the entire destination Vault is ready.

Monitor

At this point all download requests are handled by the destination Vault. We monitor the operational status of the Vault, as well as any failed download requests. We also review inputs from customer support and sales support to see if there are any customer related issues.

Once we are satisfied that the destination Vault is handling customer requests, we will logically decommission the source Vault. Basically, that means while the source Vault continues to run, it is no longer externally reachable. If a significant problem were to arise with the new destination Vault, we can swap in the source Vault. At this point, both Vaults are read-only, so the swap would be straightforward. We have not had to do this in our production environment. 

Full Capability

Once we are satisfied there are no issues with the destination Vault, we can proceed one or two ways.

  • Another migration: We can prepare for the migration of another source Vault to this destination Vault. If this is the case, we return to the Selection step of the CVT process with the Vault once again being assigned as a destination Vault. 
  • Allow new data: We allow the destination Vault to accept new data from customers. Typically, the contents of multiple source Vaults have been migrated to the destination Vault before this is done. Once new customer writes have been allowed on a destination Vault, we won’t use it as a destination Vault again.

Decommission

After three months the source Vault is eligible to be physically decommissioned. That is, we turn it off, disconnect it from power and networking, and schedule it to be disassembled. This includes wiping the drives and recycling the remaining parts either internally or externally. In practice, we will wait to decommission at least two Vaults at once as it is more economical in dealing with our recycling partners.

Automation

You’re probably wondering how much of this process is automated or uses some type of orchestration to align and accomplish tasks. We currently have monitoring tools, dashboards, scripts, and such, but humans, real ones not AI generated, are in control. That said, we are working on orchestration of the setup and provisioning processes as well as upleveling the automation in the tome migration process. Over time, we expect the entire migration process to be automated, but only when we are sure it works—the “run fast, break things” approach is not appropriate when dealing with customer data.

Not for the Faint of Heart

The basic idea of copying the contents of a drive to another larger drive is straightforward and well understood. As you scale this process, complexity creeps in as you have to consider how the data is organized and stored while keeping it secure and available to the end user. 

If your organization manages your data in-house, the never-ending task of simultaneously migrating hundreds or perhaps thousands of drives falls to you or perhaps the contractor you hired to perform the task if you lack the experience or staffing. And this is just one of the tasks you are faced with in operating, maintaining, and upgrading your own storage infrastructure.

In addition to managing a storage infrastructure, there are the growing environmental concerns of data storage. The amount of data generated and stored each year continues to skyrocket and tools such as CVT allow us to scale and optimize our resources in a cost efficient, yet environmentally sensitive way. 

To do this, we start with data durability. Using our Drive Stats data and other information, we optimize the length of time a Vault should be in operation before the drives need to be replaced, that is before the drive failure rate impacts durability. We then consider data density, how much data can we pack into a given space. Migrating data from 4TB to 16TB drives, for example, not only increases data density, it uses less electricity per stored terabyte of data and reduces the amount of waste if, for example, we had continued to buy and use 4TB drives instead of upgrading to 16TB drives. 

In summary, CVT is more than just a fancy data migration tool: It is part of our overall infrastructure management program addressing scalability, durability, and environmental challenges faced by the ever-increasing amounts of data we are asked to store and protect each day.

Kudos

The CVT program is run by Bryan with the wonderfully descriptive title of Senior Manager, Operations Analysis and Capacity Management. He is assisted by folks from across the organization. They are, in no particular order, Bach, Lorelei (Lo), Madhu, Mitch, Ben, Rodney, Vicky, Ryan, David M., Sudhi, David W., Zoe, Mike, and unnamed others who pitch in as the process rolls along. Each person brings their own expertise to the process which Bryan coordinates. To date, the CVT Team has migrated 24 Vaults containing over 60PB of data—and that’s just the beginning.

The post How Backblaze Scales Our Storage Cloud appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Plugs Into Internet2

Post Syndicated from Brent Nowak original https://backblaze.com/blog/backblaze-plugs-into-internet2/

A decorative images showing the Internet2 logo.

Who doesn’t love a sequel? From Star Wars to the Godfather, some of the best moments in storytelling have been part twos. (Let’s not talk about some of those part threes though.) And, if you were to write a sequel to The Internet, you couldn’t look for a better second chapter than a mission to support the technical and networking needs of leading academic and research organizations.  

Well, Internet2 is not actually a sequel, and it’s not a new version of the internet we all use every day. It’s an organization dedicated to delivering technical solutions and dedicated, high speed connectivity to institutions—ranging from the Smithsonian to Harvard and 330 other colleges, universities, regional research and education networks, nonprofit and government organizations, and more—who are working to solve today’s most pressing issues.

And today, Backblaze joined the Internet2 community to help further their mission. Here’s what that means:

  • First and foremost, the Backblaze Storage Cloud now connects to Internet2’s network as part of the Internet2 Peer Exchange (I2PX) program. This means that members of Internet2 can now move data into and out of Backblaze’s US-West region at incredibly high speeds.
  • Second, Backblaze also completed the Internet2 Cloud Scorecard to offer research and educational institutions relevant details about Backblaze’s security, compliance, and technology specifications, making it easier to assess and procure our solutions.

Hundreds of institutions in the higher education and research space already use Backblaze for storing and using their data and protecting their endpoints. However, many others require data transmission via Internet2 for new cloud solutions. For these folks, Backblaze’s participation in Internet2’s community and I2PX program provides secure data storage with less latency and a lower cost for their data needs.

What type of data are we talking about? Think genetic sequencing records, billions of vector data points to help model and forecast weather events, or images of particle collisions at the subatomic level! 

The Backblaze team is incredibly excited to take this step forward in serving the different use cases that Internet2 supports. And of course, in addition to being a part of the Internet2 community, we’re always excited to add more high-quality peering relationships to our wider network (and to share some stats about it, too) . 

How big is the Internet2 network? Take a look below.

Now, let’s dig into how Internet2 creates high speed data transfer pathways, and how it will impact traffic here at Backblaze.

Our Connection

The diagram below gives you an idea of what the data path looks like for someone on the left with direct connectivity to Internet2 or access via a regional provider reaching the Backblaze US-West region.

The entities on the left could exist locally in California or as far as the U.S. East Coast. At any source location, the traffic will transport the Internet2 network and then enter our network in our common peering point in San Jose, CA.

Turning Up The Peering Session

Below is a chart of ingress traffic that was once reaching us over the public internet and is now taking the preferred path over Internet2. As soon as we established peering we started to receive a few gigabits per second of traffic, with large spikes occurring overnight.

Whenever we add a new service or peer, the flow of information in our network changes. This latest addition creates more interesting traffic patterns for our Network Engineering team to profile, monitor, and capacity plan for.

An Example of How that Speed Is Used: Moving Scientific Data

If you’re a scientist in Texas and want to send your 50TB research set quickly and reliably to a partner in California, you might only have a commercial connection to the internet. This could be a 1Gbps or smaller connection, and even that could have data transfer limits on each month—oh no. Our 50TB example dataset would take over 4.6 days to complete and use 100% of the available bandwidth if we were limited to 1Gbps (assuming perfect conditions and no latency).

The Internet2 network is built with capacity in mind. With backbone links up to 400Gbps, our example dataset would transfer in 16.7 minutes. Now, there are other limitations that will impede you from being able to reach that rate (hard drive read speed, local Internet2 connection speed, and distance/latency factors), but this example gives you an idea of how much faster the Internet2 network can be over vanilla commercial connections that might be available to a local university, college, or other research institution.

Conclusion

We’re very excited to be joining the Internet2 community and network, supporting industry best practices and enabling better connectivity to our storage platform. Hopefully, the next scientific breakthrough is sitting encrypted on our hard drives, and we can be part of the many, many people, tools, and organizations who helped it on its way from research to reality.  

For more information about Backblaze and Internet2, you can read our press release or check out the Internet2 member directory.  

The post Backblaze Plugs Into Internet2 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

AI Video Understanding in Your Apps with Twelve Labs and Backblaze

Post Syndicated from Pat Patterson original https://backblaze.com/blog/ai-video-understanding-in-your-apps-with-twelve-labs-and-backblaze/

A decorative header depicting several screens with video editing tasks and a cloud with the Backblaze logo on it.

Over the past few years, since long before the recent large language model (LLM) revolution, we’ve benefited not only from the ability of AI models to transcribe audio to text, but also to automatically tag video files according to their content. Media asset management (MAM) software—such as Backlight iconik and Axle.ai (both Backblaze Partners, by the way)—allows media professionals to quickly locate footage by searching for combinations of tags. For example, “red car”, will return not only a list of video files containing red cars, but also the timecodes pinpointing the appearance of the red car in each clip.

San Francisco startup Twelve Labs has created a video understanding platform that allows any developer to build this kind of functionality, and more, into their app via a straightforward RESTful API. 

In preparation for our webinar with Twelve Labs last month, I created a web app to show how to integrate Twelve Labs with Backblaze B2 for storing video. The complete sample app is available as open source at GitHub; in this blog post, I’ll provide a brief description of the Twelve Labs platform, explain how presigned URLs allow temporary access to files in a private bucket, and then share the key elements of the sample app. If you just want a high level understanding of the integration, read on, and feel free to skip the technical details!

The Twelve Labs Video Understanding Platform

The core of the Twelve Labs platform is a foundation model that operates across the visual, audio, and text modes of video content, allowing multimodal video understanding. When you submit a video using the Twelve Labs Task API, the platform generates a compact numerical representation of the video content, termed an embedding, that identifies entities, actions, patterns, movements, objects, scenes, other elements of the video, and their interrelationships. The embedding contains everything the Twelve Labs platform needs to do its work—after the initial scan, the platform no longer needs access to the original video content. As each video is scanned into the platform, its embedding is added to an index, so this scanning process is often referred to as indexing.

As part of the indexing process, the platform extracts a standard set of data from each video: a thumbnail image, a transcript of any spoken content, any text that appears on screen, and a list of brand logos, all annotated with timecodes locating them on the video’s timeline, and all accessible via the Twelve Labs Index API.

You can have the platform create a title and summary, and even prompt the model to describe the video, via Twelve Labs’ Generate API. For example, I indexed an eight-minute video that explains how to back up a Synology NAS to Backblaze B2, then prompted the Generate API, “What are the two Synology applications mentioned in the video?” This was the first sentence of the resulting text:

The two Synology applications mentioned throughout the video are “Synology Hyper Backup” and “Synology Cloud Sync.”

The remainder of the response is a brief summary of the two applications and how they differ; here’s the full text. Although it does have that “AI flavor” as you read it, it’s clear and accurate. I must admit, I was quite impressed!

You can define a taxonomy for your videos via the Classify API. Submit a one- or two-level classification schema and a set of video IDs, and the platform will assign each video to a category.

Rounding up this quick tour of the Twelve Labs platform, the Search API, as its name suggests, allows you to search the indexed videos. As well as a search query, you must specify a set of content sources: any combination of visual, conversation, text in video, or logos. Each search result includes timecodes for its start and end.

Now you understand the basic capabilities of the Twelve Labs platform, let’s look at how you can integrate it with Backblaze B2.

Allowing Temporary Access to Files in a Private Backblaze B2 Bucket

A key feature of the sample app is that it uploads videos to a private Backblaze B2 Bucket, where they are only accessible to authorized users. Twelve Labs’ API allows you to submit a video for indexing by POSTing a JSON payload including the video’s URL to its Task API. This is straightforward for video files in a public bucket, but how do we allow the Twelve Labs platform to read files from a private bucket?

One way would be to create an application key with capabilities to read files from the private bucket and share it with the Twelve Labs platform. The main drawback to this approach is that the platform currently lacks the ability to sign requests for files from a private bucket.

Since Twelve Labs only needs to read the video file when we submit it for indexing, we can send it a presigned URL for the video file. As well as the usual Backblaze B2 endpoint, bucket name, and object key (path and filename), a presigned URL includes query parameters containing data such as the time when the URL was created, its validity period in seconds, an application key ID (or access key ID, in S3 terminology), and a signature created with the corresponding application key (secret access key). Here’s an example, with line breaks added for clarity:

https://s3.us-west-004.backblazeb2.com/mybucket/image.jpeg
?X-Amz-Algorithm=AWS4-HMAC-SHA256
&X-Amz-Credential=00415f935c00000000aa%2F20240423%2Fus-west-004%2Fs3%2Faws4_request
&X-Amz-Date=20240423T222652Z
&X-Amz-Expires=3600
&X-Amz-SignedHeaders=host
&X-Amz-Signature=23ade1...3ca1eb

This URL was created at 22:26:52 UTC on 04/23/2024, and was valid for one hour (3600 seconds). The signature is 64 hex characters. Changing any part of the URL, for example, the X-Amz-Date parameter, invalidates the signature, resulting in an HTTP 403 Forbidden error when you try to use it, with a corresponding message in the response payload:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Error>
    <Code>SignatureDoesNotMatch</Code>
    <Message>Signature validation failed</Message>
</Error>

Attempting to use the presigned URL after it expires yields HTTP 401 Unauthorized with a message such as:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Error>
    <Code>UnauthorizedAccess</Code>
    <Message>Request has expired given timestamp: '20240423T222652Z' and expiration: 3600</Message>
</Error>

You can create presigned URLs with any of the AWS SDKs or the AWS CLI. For example, with the CLI:

% aws s3 presign s3://mybucket/image.jpeg --expires-in 600 
https://s3.us-west-004.backblazeb2.com/mybucket/image.jpeg?X-Amz...

Presigned URLs are useful whenever you want to provide temporary access to a file in a private bucket without having to share an application key for a client app to sign the request itself. The sample app also uses them when rendering HTML web pages. For example, all of the thumbnail images are retrieved by the user’s browser via presigned URLs.

Note that presigned URLs are a feature of Backblaze B2’s S3 Compatible API. Creating a presigned URL is an offline operation and does not consume any API calls. We recommend you use presigned URLs rather than the b2_get_download_authorization B2 Native API operation, since the latter is a class C API call.

Inside the Backblaze B2 + Twelve Labs Media Asset Management Example

The sample app is written in Python, using JavaScript for its front end, the Django web framework for its backend, the Huey task queue for managing long-running tasks, and the Twelve Labs Python SDK to interact with the Twelve Labs platform. A simple web UI allows the user to upload videos to the private bucket, browse uploaded videos, submit them for indexing, view the resulting transcription, logos, etc., and search the indexed videos.

Most of the application code is concerned with rendering the web UI; very little code is required to interact with Twelve Labs.

Configuration

The Django settings.py file defines a constant for the Twelve Labs index ID and creates an SDK client object using the Twelve Labs API key. Note that the app reads the index ID and API key from environment variables, rather than including the values in the source code. Externalizing the index ID as an environment variable allows more flexibility in deployment while, of course, you should never include secrets such as passwords or API keys in source code!

TWELVE_LABS_INDEX_ID = os.environ['TWELVE_LABS_INDEX_ID']
TWELVE_LABS_CLIENT = TwelveLabs(api_key=os.environ['TWELVE_LABS_API_KEY'])

Startup

When the web application starts, it validates the index ID and API key by retrieving details of the index. This is the relevant code, in apps.py:

index = TWELVE_LABS_CLIENT.index.retrieve(TWELVE_LABS_INDEX_ID)

If this API call fails, then the app prints a suitable diagnostic message identifying the issue.

Indexing

When a web application needs to perform an action that takes more than a few seconds to complete—for example—indexing a set of videos, it typically starts a background task to do the work, and returns an appropriate response to the user. The sample app follows this pattern: when the user selects one or more videos and hits the Index button, the web app starts a Huey task, do_video_indexing(), passing the IDs of the selected videos, and returns the IDs to the JavaScript front end. The front end can then show that the indexing tasks have started, and poll for their current status.

Here’s the code, in tasks.py, for submitting the videos for indexing.

# Create a task for each video we want to index
for video_task in video_tasks:
    task = TWELVE_LABS_CLIENT.task.create(
        TWELVE_LABS_INDEX_ID,
        url=default_storage.url(video_task['video']),
        disable_video_stream=True
    )
    print(f'Created task: {task}')
    video_task['task_id'] = task.id

Notice the call to default_storage.url(). This function, implemented by the django-storages library, takes as its argument the path to the video file, returning the presigned URL. The default expiry period is one hour.

Once the videos have been submitted, do_video_indexing() polls for the status of each indexing task until all are complete. Most of the code is concerned with minimizing the number of calls to the API, and saving status to the app’s database; getting the status of a task is simple:

task = TWELVE_LABS_CLIENT.task.retrieve(video_task['task_id'])

The task object’s status attribute is a string with a value such as validating, indexing, or ready. When the task reaches the ready status, the task object also includes a video_id attribute, uniquely identifying the video within the Twelve Labs platform. At this point, do_video_indexing() calls a helper function that retrieves the thumbnail, transcript, text, and logos and stores them in Backblaze B2.

Retrieving Video Data

Here’s the call to retrieve the thumbnail:

thumbnail_url = TWELVE_LABS_CLIENT.index.video.thumbnail(TWELVE_LABS_INDEX_ID, video.video_id)

The helper function creates a path for the thumbnail file from the video ID and the file extension in the returned URL, and saves the thumbnail to Backblaze B2:

default_storage.save(thumbnail_path, urlopen(thumbnail_url))

Again, django-storages is doing the heavy lifting. We use urlopen(), from the urllib.request module, to open the thumbnail URL, providing default_storage.save() with a file-like object from which it can read the thumbnail data.

The calls to retrieve transcript, text, and logo data have a slightly different form, for example:

video_data = TWELVE_LABS_CLIENT.index.video.transcription(TWELVE_LABS_INDEX_ID, video.video_id)

Each call returns a list of VideoValue objects, each VideoValue object comprising a start and end timecode (in seconds) and a value specific to the type of data; for example, a fragment of the transcription. We serialize each list to JSON and save it as a file in Backblaze B2.

When the user navigates to the detail page for a video, JavaScript reads each dataset from Backblaze B2 and renders it into the page, allowing the user to easily navigate to any of the data items.

Searching the Index

When the user enters a query and hits the search button, the backend calls the Twelve Labs Search API, passing the query text, and requesting results for all four sources of information. We set group_by to video since we want to show the results by video, and set the confidence threshold to medium to improve the relevance of the results. From VideoSearchView in views.py:

results = TWELVE_LABS_CLIENT.search.query(
    TWELVE_LABS_INDEX_ID,
    query,
    ["visual", "conversation", "text_in_video", "logo"],
    group_by="video",
    threshold="medium"
)

By default, the query() call returns a page of 10 results in result.data, so we loop through the pages using next(result) to fetch pages of search results as necessary. Each individual search result includes start and end timecodes, confidence, and the type of match (visual, conversation, text, or logo).

In the web UI, the user can click through to the results for a given video, then click an individual search result to view the matching video clip.

Getting Started with Backblaze B2 and Twelve Labs

Backblaze B2 Cloud Storage is a great choice for storing video to index with Twelve Labs; free egress each month for up to three times the amount of data you’re storing means that you can submit your entire video library to the Twelve Labs platform without worrying about data transfer charges, and unlimited free egress to our CDN partners reduces the costs of distributing video content to end users.

Click here to create a Backblaze B2 account, if you don’t already have one. Your first 10GB of storage is free, no credit card required. If you’re an enterprise that wants to run a larger proof of concept, you can always reach out to our Sales Team. You don’t need to write any code to upload video files or create presigned URLs, and you can use the Backblaze web UI to upload files up to 500MB, or any of a wide variety of tools to upload files up to 10TB, including the AWS CLI, rclone and Cyberduck. Select S3 as the protocol to be able to create presigned URLs.

Similarly, click here to sign up for Twelve Labs’ Free plan. With it, you can index up to 600 minutes of video, again, no credit card required. Python and Node.js developers can use one of the Twelve Labs SDKs, while the Twelve Labs API documentation includes code examples for a wide range of other programming languages.

The post AI Video Understanding in Your Apps with Twelve Labs and Backblaze appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

In Honor of May the Fourth, Let’s Talk About the Internet in Space

Post Syndicated from Stephanie Doyle original https://backblaze.com/blog/in-honor-of-may-the-fourth-lets-talk-about-the-internet-in-space/

A decorative image showing a satellite and the Backblaze logo on a cloud in space.

It is time, once again, to celebrate the things that bring us together as tech and sci-fi lovers of the world. Today, to mark the upcoming high holiday, May the Fourth, we’re bridging our current reality to that time long ago in a galaxy far, far away by discussing the important issues: How in the world are we expected to survive in space without good internet? 

Maybe it’s just me, but it seems absurd that the Death Star blueprints had to be literally carried off a spaceship on what’s essentially an external hard drive when the Jedi Council (RIP) could make perfect holographic representations of themselves from across the galaxy. Sure, you can argue that making an off-site copy and sneaking it out was the most covert way to go about it, but didn’t some of those characters in Rogue One die next to a giant antenna? One powerful enough that it controlled traffic into and out of the planet? Why did they have to transmit the plans to the closest battleship when, in theory, they could have sent them anywhere? 

Never fear folks, we are here with what we think, based on a fair amount of research and our own humble opinions, are the answers. The truth is that current and future space internet still requires a good bit of hardware and networking. Let’s talk about where we’re at today, where we could be in the near future, and why the Rebel Scum may have, in fact, needed to run faster than Darth Vader, sacrificing all those lives, to get the Death Star schematics out of the sector.

How Do We Currently Move Data Through Space?

The internet, as we know and love it, is largely a function of hardware. To simplify things to their most base definition, the internet is a network of all the networks on the planet. Key word there, folks: planet. We use fiber optic cables to connect things on our terrestrial plane. What happens when we want to take things to space? 

We have a variety of telecommunications operations that allow us to move data through space, but they’re nowhere near as fast as our fiber-optic cables, especially with recent advancements in fiber transmission. To make our space communications that fast, we’d need analogous hardware and/or scientific advancements in some very cool research areas. 

For today’s conversation, here are the basics: when you transmit data (via any medium, not just through space), you convert it to a format computers can read, namely 0s and 1s. Typically we represent those values by moderating or fluctuating different types of electromagnetic waves. Currently the most prevalent form of data transmission in space is radio, and lasers are a developing, but usable technology.

An image from the European Southern Observatory showing lasers guiding a high-powered telescope.
Frickin’ lasers. Source. 

Our Earth-based organizations move data through space both near and far using different networks of satellites and listening technology. Both use a satellite system called the Tracking and Data Relay Satellite (TRDS), which orbits Earth at a far enough range that relay points are nearly always visible to spacecraft like the International Space Station (ISS).

As you get further out into deep space, you can beam your signal directly to Earth—you just have a smaller window of time where orbits are aligned to make that possible. In that case, rovers stationed on other planets might co-opt other orbiters to relay signals back to Earth. The only problem there is that those orbiters typically have a scientific mission of their own, which means that the relay orbiter has to make a choice about what traffic is prioritized. These things also signal what space internet could be in the future: a network of relay satellites that transfer data planet to planet.  

And, while networking on Earth is designed for and assumes real-time responses, scientists are working on Delay-Tolerant Networking (DTN) which is designed to handle significant delays and optimize routing based on that information. It’s not yet mainstream, but DTN has been successfully demonstrated on several missions, including on NASA’s Curiosity mission and the European Space Agency (ESA) Rosetta comet mission. 

Yeah, But What Does Star Wars Use?

We see a couple of types of communications networks in the Star Wars films, and more in the non-canonical expanded universe: 

  • Holonet: This is a galaxy-wide communication network mentioned in the films. It’s likely a complex system of satellites, relays, and subspace transceivers that facilitate rapid data transfer. This is similar to what we’re using and building today. 
  • Subspace: While primarily used for faster-than-light travel, subspace might also be used for transmitting information. Subspace is a fictional realm that allows hyperspace travel, and it’s possible that communication signals could piggyback on this network for faster travel times. 
  • Hyperspace Communication Droids: Legends lore (non-canon Star Wars material) mentions these specialized droids that could transmit messages via hyperspace, achieving near-instantaneous communication.

Since the last two depend on the fictional subspace zone, we’re really just considering the Holonet today. And, that works largely like our current technology, though they obviously have more satellites and relays to work with. That’s good news for our little thought experiment—we can look at file transmission times on our current Mars missions to get some analogous numbers.

Mars Transmission Times & File Sizes

Okay folks, now that the science is out of the way, let’s get down to brass tacks. Why was it possibly faster to move the Death Star plans via external storage than just transmitting them out once the planetary shields had been lifted? That answer depends on transmission times and file size. I’ll talk about transmission times first. 

The current technology we use to communicate with Mars has a few different transmission times we can work with: 

  • Radio, low-gain antenna: Up to a few kilobits per second (kbps)
  • Radio, high-gain antenna: Up to several megabits per second (Mpbs)
  • Laser, standard communications systems: Up to 10 gigabits per second (Gbps)
  • Laser, advanced systems under development: In development, but 10s of Gbps 

For our purposes, let’s go ahead and choose two and use a 10GB file as an example. The basic transmission time formula is: 

Transmission time = file size / data rate

Assuming radio waves and a high-gain antenna:

Transmission time = (10GB * 8 bits) / (1Mbps) = 80,000 seconds, or about 22 hours

Assuming laser communications with a standard system:

Transmission time = (10GB * 8 bits) / (10Gbps) = 8 seconds

So, How Big Were the Death Star Files?

We have two main canonical sources of truth we can use to infer the file size of the Death Star schematics: A New Hope and Rogue One: A Star Wars Story. (The plans were discussed in the Clone Wars, but not in detail.) Full disclosure: I used AI tools to assist with our file size estimations. 

A New Hope

In the OG, we get a glimpse of the plans the rebels have smuggled out as they plan to attack the Death Star, and we can use these to make some assumptions about file size. Interestingly, these plans were actually created for the movie by a few scientists at NASA’s Jet Propulsion Labs (JPL), and they were originally credited in the film.

As easy as shooting womp rats.

Factors to consider about file size:

  • Visual Complexity: The schematics we see on the holographic projectors show detailed technical diagrams with various sections, labels, and annotations.
  • Color Depth: While the movie doesn’t definitively show color, for the sake of estimation, let’s assume the plans are grayscale (requiring 1 byte per pixel).
  • Resolution: Estimating the exact resolution from the movie is difficult. However, considering the detail visible on screen and the technology of the time (1977), a conservative guess might be a resolution similar to standard definition video (around 480p).

Calculating File Size—A Conservative Estimate

The formula for calculating file size per image is:

File size per image = Width x Height x Color Depth

Let’s assume the Death Star plans are displayed on a holographic projector with a resolution of 640 x 480 pixels (a common standard definition resolution). If they are grayscale images, they would require 1 byte per pixel for color depth, so:

640 pixels * 480 pixels * 1 byte/pixel = 307,200 bytes per image

However, the plans likely consist of multiple schematics and blueprints. In the movie, we see various sections and scrolling text, suggesting a considerable amount of information.

The formula for calculating total file size is:

Total file size = File size per image * Number of images

Let’s assume the Death Star plans consist of a total of 100 grayscale images (a very rough estimate), so:

Total file size = 307,200 bytes/image * 100 images Total file size = 30,720,000 bytes

1MB is equal to 1,048,576 bytes, so that’s 29.3MB (30,720,000 bytes / 1,048,576 bytes/MB).

Remember, this is a very rough estimate.

The actual file size could be much larger or smaller depending on factors like:

  • Compression: The Death Star technology might utilize advanced data compression techniques, significantly reducing the file size.
  • Vector Graphics: If the plans are stored as vector graphics (scalable images), the file size would be smaller compared to bitmaps (storing pixel information).
  • Additional Data: The data card might contain additional information beyond visual schematics, like text descriptions, material specifications, etc., which could increase the file size.

Taking everything into account, a reasonable guess for the Death Star plans file size in Star Wars: A New Hope could be in the ballpark of 20 to 50 megabytes. This is enough to hold a significant amount of technical data but still fit on a reasonably sized data card for the time period the movie depicts (1977).

Rogue One

In Rogue One, we don’t actually see the plans in detail like we do in A New Hope, but we do have a short clip showing digital blueprints. Based on what we can glean from that and other newer, canonical sources, which employ 3D holograms, here’s a revised estimate for the Death Star schematics file size:

Factors to consider about file size:

  • Data Complexity: Rogue One reveals plans that include detailed schematics, technical readouts, and potentially 3D models. These elements significantly increase the file size compared to our previous estimate based on static images.
  • 3D Model Complexity: The size of 3D models depends on the level of detail. High-resolution models with intricate textures would require more data than simpler ones.
  • Data Hierarchy: The plans likely involve a layered structure, with overviews and deep dives into specific sections. This adds to the overall file size.
  • Compression: The presence of data compression is unknown. Compression algorithms can significantly reduce file size, but the effectiveness depends on the data type.
Gotta love a data center.

Estimated Range:

Given these factors, here’s a possible range for the Death Star schematics:

  • Low-End Estimate (100s of GB):
    • Moderately complex 3D models.
    • Some level of data compression.
    • Focus on essential schematics and technical data.
  • High-End Estimate (Low Single-Digit TB):
    • Highly detailed 3D models encompassing the entire Death Star.
    • Limited or no data compression.
    • Extensive data beyond core schematics, including maintenance procedures, weapon system details, etc.

Final Call?

Sure, we don’t know if data storage techniques are different in the Star Wars universe, and sure, the difference between technology in 1977 vs. 2016 gives sci-fi writers are a lot more to work with, but considering the complexity of the Death Star and the variety of data hinted at in Rogue One, the schematics file size likely falls somewhere between hundreds of gigabytes to a low single-digit terabyte. Frankly, despite the New Hope plans being our original introduction to the universe, this range is more realistic for a project of such immense scale. 

Of course, with a file size in the 100s of GBs or low TBs, it makes a lot more sense why the Rebels didn’t attempt to transmit the files much, much further away. We know from the movie that the Death Star plans were on a relatively isolated planet in an Imperial-controlled quadrant, and who knows how large quadrants are. 

For the sake of argument, let’s say the Death Star schematics were 1TB and there’s a safe planet at the equivalent distance of Mars. Transmitting the files via radio with a high-gain antenna would take about 2330 hours, and transmitting via laser would take 217 hours. 

With that in mind, even though it’s pretty old school, it was probably faster to put the files on a drive on a spaceship, and then have that spaceship get those files where they needed to go (you know, not accounting for misadventures). 

Always Have a Backup: Is a Droid the Safest Way to Transmit Files?

The most confusing part of this whole discussion is why, once they were past the “Darth Vader is attempting to murder us” part, they didn’t make several copies of the data and distribute it to various, separate entities. The urgency of the mad rush of Luke trying to reach the Rebels is compelling and all, but also an excellent reason you should always have a geographically separated backup. R2-D2’s badassery notwithstanding, the fate of the universe should have some redundancy.

If It Works, It Works

Hey, in the end, we really can’t complain. Luke got the files to Leia; Leia goes on to be instrumental in the Rebel victories against not one, but two Death Stars, and we all just had to endure the dark times of the prequels before we got the compelling story of Rogue One. Cheers, Star Wars fans, and May the Fourth be with you.

The post In Honor of May the Fourth, Let’s Talk About the Internet in Space appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Drive Stats for Q1 2024

Post Syndicated from Andy Klein original https://backblaze.com/blog/backblaze-drive-stats-for-q1-2024/

A decorative image displaying the title Q1 2024 Drive Stats.

As of the end of Q1 2024, Backblaze was monitoring 283,851 hard drives and SSDs in our cloud storage servers located in our data centers around the world. We removed from this analysis 4,279 boot drives, consisting of 3,307 SSDs and 972 hard drives. This leaves us with 279,572 hard drives under management to examine for this report. We’ll review their annualized failure rates (AFRs) as of Q1 2024, and we’ll dig into the average age of drive failure by model, drive size, and more. Along the way, we’ll share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

Hard Drive Failure Rates for Q1 2024

We analyzed the drive stats data of 279,572 hard drives. In this group we identified 275 individual drives which exceeded their manufacturer’s temperature specification at some point in their operational life. As such, these drives were removed from our AFR calculations.

The remaining 279,297 drives were divided into two groups. The primary group consists of the drive models which had at least 100 drives in operation as of the end of the quarter and accumulated over 10,000 drive days during the same quarter. This group consists of 278,656 drives grouped into 29 drive models. The secondary group contains the remaining 641 drives which did not meet the criteria noted. We will review the secondary group later in this post, but for the moment let’s focus on the primary group.

For Q1 2024, we analyzed 278,656 hard drives grouped into 29 drive models. The table below lists the AFRs of these drive models. The table is sorted by drive size then AFR, and grouped by drive size.

Notes and Observations on the Q1 2024 Drive Stats

  • Downward AFR: The AFR for Q1 2024 was 1.41%. That’s down from Q4 2023 at 1.53%, and also down from one year ago (Q1 2023) at 1.54%. The continuing process of replacing older 4TB drives is a primary driver of this decrease as the Q1 2024 AFR (1.36%) for the 4TB drive cohort is down from a high of 2.33% in Q2 2023.
  • A Few Good Zeroes: In Q1 2024, three drive models had zero failures:
    • 16TB Seagate (model: ST16000NM002J)
      • Q1 2024 drive days: 42,133
      • Lifetime drive days: 216,019
      • Lifetime AFR: 0.68%
      • Lifetime confidence interval: 1.4%
    • 8TB Seagate (model: ST8000NM000A)
      • Q1 2024 drive days: 19,684
      • Lifetime drive days: 106,759
      • Lifetime AFR: 0.00%
      • Lifetime confidence interval: 1.9%
    • 6TB Seagate (model: ST6000DX000)
      • Q1 2024 drive days: 80,262 
      • Lifetime drive days: 4,268,373
      • Lifetime AFR: 0.86%
      • Lifetime confidence interval: 0.3%

All three drives have a lifetime AFR of less than 1%, but in the case of the 8TB and 16TB drive models the confidence interval (95%) is still too high. While it is possible the two drives models will continue to perform well, we’d like to see the confidence interval below 1%, and preferably below 0.5%, before we can trust the lifetime AFR.

With a confidence interval of 0.3% the 6TB Seagate drives delivered another quarter of zero failures. At an average age of nine years, these drives continue to defy their age. They were purchased and installed at the same time back in 2015, and are members of the only 6TB Backblaze Vault still in operation.

  • The End of the Line: The 4TB Toshiba (model: MD04ABA400V) are not in the Q1 2024 Drive Stats tables. This was not an oversight.  The last of these drives became a migration target early in Q1 and their data was securely transferred to pristine 16TB Toshiba drives. They rivaled the 6TB Seagate drives in age and AFR, but their number was up and it was time to go.

The Secondary Group

As noted previously, we divided the drive models into two groups, primary and secondary, with drive count (>100) and drive days (>10,000) being the metrics used to divide the groups. The secondary group has a total of 641 drives spread across 27 drive models. Below is a table of those drive models. 

The secondary group is mostly made up of drive models which are replacement drives or migration candidates. Regardless, the lack of observations (drive days) over the observation period is too low to have any certainty about the calculated AFR.

From time to time, a secondary drive model will move into the primary group. For example, the 14TB Seagate (model: ST14000NM000J) will most likely have over 100 drives and 10,000 drive days in Q2. The reverse is also possible, especially as we continue to migrate our 4TB drive models.

Why Have a Secondary Group?

In practice we’ve always had two groups; we just didn’t name them. Previously, we would eliminate from the quarterly, annual, and lifetime AFR charts drive models which did not have at least 45 drives, then we upped that to 60 drives. This was okay, but we realized that we needed to also set a minimum number of drive days over the analysis period to improve our confidence in the AFRs we calculated. To that end, we have set the following thresholds for drive models to be in the primary group.

Review Period Drive Count per Model Drive Days per Model
Quarterly >100 drives >10,000 drive days
Annual >250 drives >50,000 drives days
Lifetime >500 drives >100,000 drive days

We will evaluate these metrics as we go along and change them if needed. The goal is to continue to provide AFRs that we are confident are an accurate reflection of the drives in our environment.

The Average Age of Drive Failure Redux

In Q1 2023 Drive Stats report, we took a look at the average age in which a drive fails. This review was inspired by the folks at Secure Data Recovery who calculated that based on their analysis of 2,007 failed drives, the average age at which they failed was 1,051 days or roughly 2 years and 10 months. 

We applied the same approach to our 17,155 failed drives and were surprised when our average age of failure was only 2 years and 6 months. Then we realized that many of the drive models that were still in use were older (much older) than the average, and surely when some number of them failed, it would affect the average age of failure for a given drive model. 

To account for this realization, we considered only those drive models that are no longer active in our production environment. We call this collection retired drive models as these are drives that can no longer age or fail. When we reviewed the average age of this retired group of drives, the average age of failure was 2 years and 7 months. Unexpected, yes, but we decided we needed more data before reaching any conclusions.

So, here we are a year later to see if the average age of drive failure we computed in Q1 2023 has changed. Let’s dig in. 

As before we recorded the date, serial_number, model, drive_capacity, failure, and SMART 9 raw value for all of the failed drives we have in the Drive Stats dataset back to April 2013. The SMART 9 raw value gives us the number of hours the drive was in operation. Then we removed boot drives and drives with incomplete data, that is some of the values were missing or wildly inaccurate. This left us with 17,155 failed drives as of Q1 2023.

Over this past year, Q2 2023 through Q1 2024, we recorded an additional 4,406 failed drives. There were 173 drives which were either boot drives or had incomplete data, leaving us with 4,233 drives to add to the 17,155 failed drives previous, totalling 21,388 failed drives to evaluate.

When we compare Q1 2023 to Q1 2024 we get the table below.

The average age of failure for all of the Backblaze drive models (2 years and 10 months) matches the Secure Data Recovery baseline. The question is, does that validate their number? We say, not yet. Why? Two primary reasons. 

First, we only have two data points, so we don’t have much of a trend, that is we don’t know if the alignment is real or just temporary. Second, the average age of failure of the active drive models (that is, those in production) is now already higher (2 years and 11 months) than the Secure Data baseline. If that trend were to continue, then when the active drive models retire, they will likely increase the average age of failure of the drive models that are not in production.

That said, we can compare the numbers by drive size and drive model from Q1 2023 to Q1 2024 to see if we can gain any additional insights. Let’s start with the average age by drive size in the table below.

The most salient observation is that for every drive size that had active drive models (green), the average age of failure increased from Q1 2023 to Q1 2024. Given that the overall average age of failure increased over the last year, it is reasonable to expect that some of the active drive size cohorts would increase. With that in mind, let’s take a look at the changes by drive model over the same period. 

Starting with the retired drive models, there were three drive models totalling 196 drives which moved from active to retired from Q1 2023 to Q1 2024. Still, the average age of failure for the retired drive cohort remained at 2 years 7 months, so we’ll spare you from looking at a chart with 39 drive models where over 90% of the data didn’t change Q1 2023 to Q1 2024.

On the other hand, the active drive models are a little more interesting, as we can see below.

In all but the two drive models (highlighted), the average age of failure for each drive model went up. In other words, active drive models are, on average, older when they fail, than one year ago. Remember, we are testing the average age of the drive failures, not the average age of the drive. 

At this point, let’s review. The Secure Data Recovery folks checked 2,007 failed drives and determined their average age of failure was 2 years and 10 months. We are testing that assertion. At the moment, the average age of failure for the retired drive models (those no longer in operation in our environment) is 2 years and 7 months. This is still less than the Secure Data number. But, the drive models still in operation are now hitting an average of 2 years and 10 months, suggesting that once these drive models are removed from service, the average age of failure for the retired drive models will increase. 

Based on all of this, we think the average age of failure for our retired drive models will eventually exceed 2 years and 10 months. Further, we predict that the average age of failure will reach closer to 4 years for the retired drive models once our 4TB drive models are removed from service. 

Annualized Failures Rates for Manufacturers

As we noted at the beginning of this report, the quarterly AFR for Q1 2024 was 1.41%. Each of the four manufacturers we track contributed to the overall AFR as shown in the chart below.

As you can see, the overall AFR for all drives peaked in Q3 2023 and is dropping. This is mostly due to the retirement of older 4TB drives that are further along the bathtub curve of drive failure. Interestingly, all of the remaining 4TB drives in use today are either Seagate or HGST models. Therefore, we expect the quarterly AFR will most likely continue to decrease for those two manufacturers as over the next year their 4TB drive models will be replaced.

Lifetime Hard Drive Failure Rates

As of the end of Q1 2024, we were tracking 279,572 operational hard drives. As noted earlier, we defined the minimum eligibility criteria of a drive model to be included in our analysis for quarterly, annual and lifetime reviews. To be considered for the lifetime review, a drive model was required to have 500 or more drives as of the end of Q1 2024 and have over 100,000 accumulated drive days during their lifetime. When we removed those drive models which did not meet the lifetime criteria, we had 277,910 drives grouped into 26 models remaining for analysis as shown in the tale below.

With three exceptions, the conference interval for each drive model is 0.5% or less at 95% certainty. For the three exceptions: the 10TB Seagate, the 14TB Seagate, and 14TB Toshiba models, the occurrence of drive failure from quarter to quarter was too variable over their lifetime. This volatility has a negative effect on the confidence interval.

The combination of a low lifetime AFR and a small confidence interval is helpful in identifying the drive models which work well in our environment. These days we are interested mostly in the larger drives as replacements, migration targets, or new installations. Using the table above, let’s see if we can identify our best 12, 14, and 16TB performers. We’ll skip reviewing the 22TB drives as we only have one model.

The drive models are grouped by drive size, then sorted by their Lifetime AFR. Let’s take a look at each of those groups.

  • 12TB drive models: The three 12TB HGST models are great performers, but are hard to find new. Also, Western Digital, who purchased the HGST drive business a while back, has started using their own model numbers of these drives, so it can be confusing. If you do find an original HGST make sure it is new as from our perspective buying a refurbished drive is not the same as buying a new.
  • 14TB drive models: The first three models look to be solid—the WDC (WUH721414ALE6L4), Toshiba (MG07ACA14TA), and Seagate (ST14000NM001G). The remaining two drive models have mediocre lifetime AFRs and undesirable confidence intervals. 
  • 16TB drive models: Lots of choice here, with all six drive models performing well to this point, although the WDC models are the best of the best to date.

The Hard Drive Stats Data

It has now been eleven years since we began recording, storing and reporting the operational statistics of the hard drives and SSDs we use to store data in the Backblaze data storage cloud. We look at the telemetry data of the drives, including their SMART stats and other health related attributes. We do not read or otherwise examine the actual customer data stored. 

Over the years, we have analyzed the data we have gathered and published our findings and insights from our analyses. For transparency, we also publish the data itself, known as the Drive Stats dataset. This dataset is open source and can be downloaded from our Drive Stats webpage.

You can download and use the Drive Stats dataset for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone; it is free.

Good luck and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q1 2024 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Amazon Titan Text V2 now available in Amazon Bedrock, optimized for improving RAG

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/amazon-titan-text-v2-now-available-in-amazon-bedrock-optimized-for-improving-rag/

The Amazon Titan family of models, available exclusively in Amazon Bedrock, is built on top of 25 years of Amazon expertise in artificial intelligence (AI) and machine learning (ML) advancements. Amazon Titan foundation models (FMs) offer a comprehensive suite of pre-trained image, multimodal, and text models accessible through a fully managed API. Trained on extensive datasets, Amazon Titan models are powerful and versatile, designed for a range of applications while adhering to responsible AI practices.

The latest addition to the Amazon Titan family is Amazon Titan Text Embeddings V2, the second-generation text embeddings model from Amazon now available within Amazon Bedrock. This new text embeddings model is optimized for Retrieval-Augmented Generation (RAG). It is pre-trained on 100+ languages and on code.

Amazon Titan Text Embeddings V2 now lets you choose the size of of the output vector (either 256, 512, or 1024). Larger vector sizes create more detailed responses, but will also increase the computational time. Shorter vector lengths are less detailed but will improve the response time. Using smaller vectors helps to reduce your storage costs and the latency to search and retrieve document extracts from a vector database. We measured the accuracy of the vectors generated by Amazon Titan Text Embeddings V2 and we observed that vectors with 512 dimensions keep approximately 99 percent of the accuracy provided by vectors with 1024 dimensions. Vectors with 256 dimensions keep 97 percent of the accuracy. This means that you can save 75 percent in vector storage (from 1024 down to 256 dimensions) and keep approximately 97 percent of the accuracy provided by larger vectors.

Amazon Titan Text Embeddings V2 also proposes an improved unit vector normalization that helps improve the accuracy when measuring vector similarity. You can choose between normalized or unnormalized versions of the embeddings based on your use case (normalized is more accurate for RAG use cases). Normalization of a vector is the process of scaling it to have a unit length or magnitude of 1. It is useful to ensure that all vectors have the same scale and contribute equally during vector operations, preventing some vectors from dominating others due to their larger magnitudes.

This new text embeddings model is well-suited for a variety of use cases. It can help you perform semantic searches on documents, for example, to detect plagiarism. It can classify labels into data-based learned representations, for example, to categorize movies into genres. It can also improve the quality and relevance of retrieved or generated search results, for example, recommending content based on interest using RAG.

How embeddings help to improve accuracy of RAG
Imagine you’re a superpowered research assistant for a large language model (LLM). LLMs are like those brainiacs who can write different creative text formats, but their knowledge comes from the massive datasets they were trained on. This training data might be a bit outdated or lack specific details for your needs.

This is where RAG comes in. RAG acts like your assistant, fetching relevant information from a custom source, like a company knowledge base. When the LLM needs to answer a question, RAG provides the most up-to-date information to help it generate the best possible response.

To find the most up-to-date information, RAG uses embeddings. Imagine these embeddings (or vectors) as super-condensed summaries that capture the key idea of a piece of text. A high-quality embeddings model, such as Amazon Titan Text Embeddings V2, can create these summaries accurately, like a great assistant who can quickly grasp the important points of each document. This ensures RAG retrieves the most relevant information for the LLM, leading to more accurate and on-point answers.

Think of it like searching a library. Each page of the book is indexed and represented by a vector. With a bad search system, you might end up with a pile of books that aren’t quite what you need. But with a great search system that understands the content (like a high-quality embeddings model), you’ll get exactly what you’re looking for, making the LLM’s job of generating the answer much easier.

Amazon Titan Text Embeddings V2 overview
Amazon Titan Text Embeddings V2 is optimized for high accuracy and retrieval performance at smaller dimensions for reduced storage and latency. We measured that vectors with 512 dimensions maintain approximately 99 percent of the accuracy provided by vectors with 1024 dimensions. Those with 256 dimensions offer 97 percent of the accuracy.

Max tokens 8,192
Languages 100+ in pre-training
Fine-tuning supported No
Normalization supported Yes
Vector size 256, 512, 1,024 (default)

How to use Amazon Titan Text Embeddings V2
It’s very likely you will interact with Amazon Titan Text Embeddings V2 indirectly through Knowledge Bases for Amazon Bedrock. Knowledge Bases takes care of the heavy lifting to create a RAG-based application. However, you can also use the Amazon Bedrock Runtime API to directly invoke the model from your code. Here is a simple example in the Swift programming language (just to show you you can use any programming language, not just Python):

import Foundation
import AWSBedrockRuntime 

let text = "This is the text to transform in a vector"

// create an API client
let client = try BedrockRuntimeClient(region: "us-east-1")

// create the request 
let request = InvokeModelInput(
   accept: "application/json",
   body: """
   {
      "inputText": "\(text)",
      "dimensions": 256,
      "normalize": true
   }
   """.data(using: .utf8), 
   contentType: "application/json",
   modelId: "amazon.titan-embed-text-v2:0")

// send the request 
let response = try await client.invokeModel(input: request)

// decode the response
let response = String(data: (response.body!), encoding: .utf8)

print(response ?? "")

The model takes three parameters in its payload:

  • inputText – The text to convert to embeddings.
  • normalize – A flag indicating whether or not to normalize the output embeddings. It defaults to true, which is optimal for RAG use cases.
  • dimensions – The number of dimensions the output embeddings should have. Three values are accepted: 256, 512, and 1024 (the default value).

I added the dependency on the AWS SDK for Swift in my Package.swift. I type swift run to build and run this code. It prints the following output (truncated to keep it brief):

{"embedding":[-0.26757812,0.15332031,-0.015991211...-0.8203125,0.94921875],
"inputTextTokenCount":9}

As usual, do not forget to enable access to the new model in the Amazon Bedrock console before using the API.

Amazon Titan Text Embeddings V2 will soon be the default LLM proposed by Knowledge Bases for Amazon Bedrock. Your existing knowledge bases created with the original Amazon Titan Text Embeddings model will continue to work without changes.

To learn more about the Amazon Titan family of models, view the following video:

The new Amazon Titan Text Embeddings V2 model is available today in Amazon Bedrock in the US East (N. Virginia) and US West (Oregon) AWS Regions. Check the full Region list for future updates.

To learn more, check out the Amazon Titan in Amazon Bedrock product page and pricing page. Also, do not miss this blog post to learn how to use Amazon Titan Text Embeddings models. You can also visit our community.aws site to find deep-dive technical content and to discover how our Builder communities are using Amazon Bedrock in their solutions.

Give Amazon Titan Text Embeddings V2 a try in the Amazon Bedrock console today, and send feedback to AWS re:Post for Amazon Bedrock or through your usual AWS Support contacts.

— seb

Backblaze and Parablu Team Up to Elevate Security For Microsoft 365 Users

Post Syndicated from Anna Hobbs-Maddox original https://backblaze.com/blog/backblaze-and-parablu-team-up-to-elevate-security-for-microsoft-365-users/

A decorative image showing the Backblaze and Parablu logos.

Microsoft 365 (M365) is used by more than one million companies worldwide. If you’re one of them, you know how important it is to your business. And, like anything that’s important to your business, it’s important to back it up. 

Today, backing up M365 to off-site storage just got easier and more affordable thanks to a new Backblaze Partnership with Parablu. Now, you can back up your Microsoft 365 data to Backblaze, ensuring it’s backed up both inside and/or outside of the Azure ecosystem, adding another layer of protection to your backup and recovery playbook.

What Parablu Does

Parablu specializes in data security and resiliency solutions catered to digital enterprises. Their advanced solutions ensure comprehensive protection for enterprise data while offering complete visibility into all data movement through user-friendly, centrally-managed dashboards. Their product BluVault for M365 elevates data security across Exchange, SharePoint, OneDrive, and Teams.

With Parablu, you can seamlessly control every aspect of your Microsoft 365 data, gain immediate protection against threats with advanced anomaly detection and swift recovery mechanisms for ransomware attacks, streamline administration with intuitive and efficient controls, reduce network congestion, and ensure secure data transmission with robust encryption protocols.

Why Back Up Microsoft 365 to Backblaze?

By integrating Backblaze as a storage tier outside of Azure for tools like M365, OneDrive, or Sharepoint, Parablu is providing its customers with cloud storage that’s easy to use, highly affordable at one-fifth the cost of legacy providers, secured with immutable backups, and high-performing with industry-leading small file uploads.

Key benefits for Backblaze + Parablu customers include:

  • Avoiding a Single Point of Failure: Many businesses that use M365 also back up their instance with the same service. However, backup best practices include keeping a backup copy of your data geographically and virtually separate from your production copy. While backing up your M365 data with Microsoft Azure is a great thing to do, it’s wise to keep a backup copy outside of that ecosystem as well. If Microsoft were to experience a failure, you’d still be able to recover your critical business data. 
  • Protecting Data With Immutability: When you protect your M365 data with immutability via Object Lock, you ensure no one can alter or delete that data until a given date. When you set the lock, you can specify the length of time an object should be locked. Any attempts to manipulate, copy, encrypt, change, or delete the file will fail during that time.
  • Faster Small File Uploads: Small file uploads are common for backup and archive workflows, especially when it comes to backing up the kind of data in M365—email, Word documents, simple Excel spreadsheets, etc. With Backblaze, users can expect to see significantly faster upload speeds for smaller files without any change to durability, availability, or pricing. The faster data upload bolsters security and enhances data protection by securing data with off-site backups faster, limiting the time that the data is vulnerable.

Partnering with Backblaze offers our customers a secure, cost-efficient storage alternative. We’ve witnessed a growing demand for secure, fast, and affordable storage that complements public cloud storage and we look forward to continued innovation with Backblaze.

—Randy De Meno, Chief Strategy Officer/Chief Technology Officer, Parablu

How Backblaze Integrates With Parablu

The Backblaze + Parablu partnership integrates the M365 backup power of Parablu with affordable cloud storage from Backblaze, helping you protect your M365 environment with enhanced security, compliance, and performance. The joint solution is available for customers today.

Interested in getting started? Learn more in our docs or contact Sales.

The post Backblaze and Parablu Team Up to Elevate Security For Microsoft 365 Users appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Wins NAB Product of the Year

Post Syndicated from Jeremy Milk original https://backblaze.com/blog/backblaze-wins-nab-product-of-the-year/

A decorative image showing a film strip flowing into a cloud with the Backblaze and NAB Show logos displayed.

Last week the Backblaze team connected with the best and brightest of the media and entertainment industry at the National Association of Broadcasters (NAB) conference in Las Vegas. For those who missed out, here is a recap of last week events:

Event Notifications Launches at NAB 2024

Event Notifications gives you the freedom to build automated workflows across the different best-of-breed cloud platforms you use or want to use—saving time, money, and improving your end users’ experiences. After demoing this new service for folks working across media workflows at the NAB show and winning the NAB Product of the Year in the Cloud Computing and Storage category, the Event Notifications private preview list is filling up, but you can still join it today

A photograph of the 2024 Product of the Year trophy. Backblaze won this award for their new feature, Event Notifications.

Backblaze B2 Cloud Storage Event Notifications Wins NAB Best of Show Product of the Year

Join the Waitlist ➔ 

Backblaze Achieves TPN Blue Shield Status

More news for media and entertainment leaders: Backblaze joined the Trusted Partner Network (TPN), receiving Blue Shield status. TPN is the Motion Picture Association’s initiative aimed at advancing trust and security best practices across the media and entertainment industry. This important initiative provides greater transparency and confidence in the procurement process as you advance your workflow with new technology. 

Axle AI Cloud is Powered by Backblaze

A testament to the power of partnerships across the industry, Axel AI announced Axel AI Cloud Powered by Backblaze, a new, highly affordable, AI-powered media storage service that leverages Backblaze’s new Powered by Backblaze program for affordable, seamlessly integrated AI tagging, transcription, and search.

Experts on Stage

For the first time, Backblaze hosted a series of events together with Partners and customers, delivering actionable insights for folks on the show floor to improve their own media workflows. Our Backblaze Tech Talks featured Quantum, CuttingRoom, iconik, ELEMENTS, Axel AI, CloudSoda, Hedge, bunny.net, Suite Studios, and Qencode. Huge thanks to the presenters and everyone who came by.

See You at the Next One

Thank you to all of our customers, partners, and friends who helped make last week a hit. We are already looking forward to NAB–NY, IBC, NAB 2025—and, most of all, to moving the industry forward together. 

The post Backblaze Wins NAB Product of the Year appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Moving 670 Network Connections

Post Syndicated from Jack Fults original https://backblazeprod.wpenginepowered.com/blog/moving-670-network-connections/

An illustration of server racks and networking cables.

Editor’s Note

We’re constantly upgrading our storage cloud, but we don’t always have ways to tangibly show what multi-exabyte infrastructure looks like. When data center manager, Jack Fults, shared photos from a recent network switch migration, though, it felt like exactly the kind of thing that makes The Cloud™ real in a physical, visual sense. We figured it was a good opportunity to dig into some of our more recent upgrades.

If your parents ever tried to enforce restrictions on internet time, and in response, you hardwired a secret 120ft Ethernet cable from the router in your basement through the rafters and up into your room so you could game whenever you wanted, this story is for you. 

Replacing 670 network switches in a data center is kind of like that, times 1,000.  And that’s exactly what we did in our Sacramento data center recently. 

Hi, I’m Jack

I’m a data center manager here at Backblaze, and I’m in charge of making sure our hardware can meet our production needs, interfacing with the data center ownership, and generally keeping the building running, all in service of delivering easy cloud storage and backup services to our customers. I lead an intrepid team of data center technicians who deserve a ton of kudos for making this project happen as well as our entire Cloud Operations team.

An image of a data center manager with decommissioned cables.
Here I am taking a swim in a bunch of decommissioned cable from an older migration of cat 5e out of our racks. Do not be alarmed by the Spaghetti Monster—these cables aren’t connected to anything, and they promptly made their way to a recycling facility.

Why Did We Need to Move 670 Network Connections?

We’re constantly looking for ways to make our infrastructure better, faster, and smarter, and in that effort, we wanted to upgrade to new network switches. The new switches would allow us to consolidate connections and mitigate any potential future failures. We have plenty of redundancy and protocols in place in the event that happens, but it was a risk we knew we’d be wise to get ahead of as we continued to grow our data under management.

An image of network cables in a data center rack.
Example of the old cabling connected to the Dell switches. Pretty much everything in this cabinet has been replaced, except for the aggregate switch providing uplinks to our access switches.

Switch Migration Challenges

In order to make the move, we faced a few challenges:

  • Minimizing network loss: How do we rip out all those switches without our Vaults being down for hours and hours?
  • Space for new cabling: In order to minimize network loss, we needed the new cabling in place and connected to the new switches before a cutover, but our original network cabinets were on the smaller side and full of existing cabling.
  • Space for new switches: We wanted to reuse the same rack units for the new Arista switches, so we had to figure out a method that allowed us to slide the old switches straight forward, out of the cabinet, and slide the new switches straight in.
  • Time: Every day we didn’t have the new switches in place was a day we risked a lock up that would take time away from our ability to roll out standard deployments and prepare for production demands.

Here’s How We Did It

Racking new switches in cabinets that are already fully populated isn’t ideal, but it is totally doable with a little planning (okay, a lot of planning). It’s a good thing I love nothing more than a good Google sheet, and believe me we tracked everything down to the length of the cables (3,272ft to be exact, but more on that later). Here’s a breakdown of our process:

  1. Put up a temporary, transfer switch in the cabinet and move the connections there. Ports didn’t matter, since it was just temporary, so that sped things up a bit.
  2. Decommission the old switch, pulling the power cabling and unbolting it from the rack.
  3. Ratchet our cables up using a makeshift pulley system in order to pull the switches straight out from the rack and set them aside.
An image of cables connected to network switches in a data center.
Carefully hoisting up the cabling with our makeshift Velcro pulley systems to allow the old switches to come out, and the new ones go in. Although this might look a little jury-rigged, it greatly helped us support the weight of the production management cables and hold them out of the way.
  1. Rack the new Arista switch and connect it to our aggregate switch which breaks out connections to all of the access switches.
  2. Configure the new switch – many thanks go to our Network Engineering team for their work on this part.
  3. Finally, move the connections from the temporary switch to the new Arista switch.
An image of network switches in a data center rack.
One of the first 2U switches to start receiving new cabling.

Each 1U Dell had 48 connections, which handled two Backblaze Vaults. We were able to upgrade to 2U switches with the new Aristas, which each had 96 connections, fitting four Backblaze Vaults plus 16 core servers. So, every time we moved to the next four vaults, we’d go through this process until we were through the network switch migration for 27 Vaults plus core servers, comprising the 670 network connections.

An image of a data center technician plugging network connections into servers.
Justin Whisenant, our senior DC tech, realizing that this is the last cable to cutover before all connections have been swapped.

Using the transfer switch allowed us to decommission the old switch then rack and configure the new switch so that we only lost a second or two of network connectivity as one of the DC techs moved the connection. That was one of the things we had to be very planful about—making sure the Vault would remain available, with the exception of one server that would be down for a split second during the swap. Then, our DC techs would confirm that connectivity was back up before moving on to the next server in the Vault.

Oh, And We Also Ran New Cables

We ran into a wrinkle early on in the project. We had two cabinets side by side where the switches are located, so sometimes we’d rack the temporary switch in one and the new Arista switch in the other. Some of the old cables weren’t long enough to reach the new switches. There’s not much else you can do at that point but run new cables, so we decided to replace all of the cables wholesale—3,272ft of new cable went in. 

We had to fine-tune our plans even more to balance decommissioning with racking the new switches in order to make room for the new cables, but it also ended up solving another issue we hadn’t even set out to address. It allowed us to eliminate a lot of slack from cables that were too long. Over time, with the amount of cables we had, the slack made it difficult to work in the racks, so we were happy to see that go away.

An image of cable dressing in a data center.
There’s still decom and cable dressing to do, but it looks so much better.

While we still have some cable management and decommissioning to be done, migrating to the Arista switches was the mission critical piece to mitigate our risk and plan for ongoing improvements. 

As a data center manager, we get to work on the side of tech that takes the abstract internet and makes it tangible, and that’s pretty cool. It can be hard for people to visualize The Cloud, but it’s made up of cables and racks and network switches just like these. Even though my mom loves to bring up that secret Ethernet cable story at family events, I think she’s pretty happy that it led that mischievous kid to a project like this.

One Project Among Many

While not every project has great pictures to go along with it, we’re always upgrading our systems for performance, security, and reliability. Some other projects we’re completed in the last few months include reconfiguring much of our space to make it more efficient and ready for enterprise-level hardware, moving our physical media operations, and decommissioning 4TB Vaults as we migrate them to larger Vaults with larger drives. Stay tuned for a longer post about that from our very own Andy Klein.

The post Moving 670 Network Connections appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.