All posts by Pat Patterson

Building a Conversational AI Chatbot Website with Backblaze B2 + LangChain

2025-05-28 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/building-a-conversational-ai-chatbot-website-with-backblaze-b2-langchain/

In an earlier blog post, I explained how to build your own LLM with Backblaze B2 + Jupyter Notebook, implementing a simple conversational AI chatbot using the LangChain AI framework to implement retrieval-augmented generation (RAG). The notebook walks you through the process of loading PDF files from a Backblaze B2 Bucket into a vector store, running a local instance of a large language model (LLM) and combining those to form a chatbot that can answer questions on its specialist subject.

That article generated a lot of interest, and a few questions:

“Could you make this into a web app, like ChatGPT?”
“Could you use this with OpenAI? DeepSeek?”
“Could I load multiple collections of documents into this?”
“Could I run multiple LLMs and compare them?”
“Can I add new documents to the vector store as they are uploaded to the bucket?”

The answer to all of these questions is “Yes!”

Today, I’ll present a simple conversational AI chatbot web app with a ChatGPT-style UI that you can easily configure to work with OpenAI, DeepSeek, or any of a range of other LLMs. In future blog posts, I’ll extend this to allow you to configure multiple LLMs and document collections, and integrate with Backblaze B2’s Event Notifications feature to load documents into the vector store within seconds of them being uploaded.

And, here’s a very short video of the chatbot in action:

Editorial note: A version of this article was previously published on the New Stack.

RAG basics

Retrieval-augmented generation, or RAG for short, is a technique that applies the generative features of an LLM to a collection of documents, resulting in a chatbot that can effectively answer questions based on the content of those documents.

A typical RAG implementation splits each document in the collection into a number of roughly equal-sized, overlapping chunks, and generates an embedding for each chunk. Embeddings are vectors (lists) of floating point numbers with hundreds or thousands of dimensions. The distance between two vectors indicates their similarity. Small distances indicate high similarity and large distances indicate low similarity.

The RAG app then loads each chunk, along with its embedding, into a vector store. The vector store is a special-purpose database that can perform a similarity search–given a piece of text, the vector store can retrieve chunks ranked by their similarity to the query text by comparing the embeddings.

Let’s put the pieces together:

Given a question from the user (1), the RAG app can query the vector store for chunks of text that are similar to the question (2). This will be the context that helps the LLM answer the user’s question. Here’s a concrete example using the Backblaze documentation collection: Given the question, “Tell me about object lock,” the vector store returns four document chunks, each of about 170 words, to the app (3). Here is a link to the text of, and a short extract from, each chunk:

Object Lock: With Object Lock Legal Hold, files are prevented from being changed or deleted, but the immutability does not have a defined retention period.
Object Lock: You can enable Object Lock on a bucket when you create a new bucket or on an existing bucket.
Object Lock: Object Lock in Backblaze B2 was designed with the Immutability feature in Veeam Backup & Replication in mind.
Enable Object Lock with the S3 Compatible API: To enable Object Lock on an existing S3 compatible bucket, add the x-amz-bucket-object-lock-token header with a value of 1 to the S3 Put Object Lock Configuration API call.

Unsurprisingly, three of the four results are extracts from the main documentation article on Object Lock.

In order to provide a conversational experience, the RAG app maintains a chat history in a message store. The RAG app retrieves the message history (4) to form part of the query it will send to the LLM.

A system prompt sets the guidelines for the LLM to generate responses. In a RAG, the system prompt typically specifies that the response should be based on retrieved document data rather than model memorization, and attempts to prevent hallucinations by ensuring that the model does not fabricate information if no relevant documents are found. Here is the system prompt from the RAG app:

Use the following pieces of context and the message history to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

The RAG submits the system prompt, the context, the last few messages in the conversation, and the user’s question to the LLM (5), which generates a, hopefully useful, response (6). Here’s the answer it provided to the above question:

Object Lock is a feature that allows users to enforce immutability on files stored in cloud storage, specifically in services like Backblaze B2 and AWS S3. Here are the key points about Object Lock:

Immutability: Once a file is locked, it cannot be modified or deleted until a specified retention period expires or a legal hold is lifted.
Retention Periods: Users can set a retention period when locking a file. During this time, the file remains immutable, meaning it cannot be changed or deleted.
Legal Hold: Users can place files under a legal hold, which prevents modifications or deletions indefinitely until the hold is removed.
Enabling Object Lock: Object Lock can be enabled for new buckets or existing ones, but it must be turned on at the bucket level before files are uploaded.
Integrations: Some applications, like Veeam Backup & Replication, offer built-in support for Object Lock, making the process easier for users when they want to ensure file immutability during backups.
No Extra Cost: There is no additional fee for using Object Lock; however, standard storage charges for the locked files still apply.
Mistakes in Locking: If a user locks a file for longer than intended, they may need to contact support to resolve the situation, which could involve closing their storage account in some cases.

Overall, Object Lock is a crucial feature for organizations that need to ensure data integrity and compliance with regulations concerning data retention.

The RAG app adds the user’s question and the LLM’s response to the message store (7), returns the answer to the user (8), and awaits the next question.

A quick tour of the sample app

The sample app is on GitHub at https://github.com/backblaze-b2-samples/ai-rag-app. The app is open source, under the MIT license, so you can use it as a basis for your own experimentation without any restrictions. The app was originally written to demonstrate RAG with Backblaze B2 Cloud Storage, but it works with any S3 compatible object store.

The README file covers configuration and deployment in some detail; in this blog post, I’ll just give you a high-level overview. The sample app is written in Python using the Django web framework. API credentials and related settings are configured via environment variables, while the LLM and vector store are configured via Django’s settings.py file:

CHAT_MODEL: ModelSpec = {
    'name': 'OpenAI',
    'llm': {
        'cls': ChatOpenAI,
        'init_args': {
            'model': "gpt-4o-mini",
        }
    },
}

# Change source_data_location and vector_store_location to match your environment
# search_k is the number of results to return when searching the vector store
DOCUMENT_COLLECTION: CollectionSpec = {
    'name': 'Docs',
    'source_data_location': 's3://blze-ev-ai-rag-app/pdfs',
    'vector_store_location': 's3://blze-ev-ai-rag-app/vectordb/docs/openai',
    'search_k': 4,
    'embeddings': {
        'cls': OpenAIEmbeddings,
        'init_args': {
            'model': "text-embedding-3-large",
        },
    },
}

The sample app is configured to use OpenAI GPT-4o mini, but the README explains how to use different online LLMs such as DeepSeek V3 or Google Gemini 2.0 Flash, or even a local LLM such as Meta Llama 3.1 via the Ollama framework. If you do run a local LLM, be sure to pick a model that fits your hardware. I tried running Meta’s Llama 3.3, which has 70 billion parameters (70B), on my MacBook Pro with the M1 Pro CPU. It took nearly three hours to answer a single question! Llama 3.1 8B was a much better fit, answering questions in less than 30 seconds.

Notice that the document collection is configured with the location of a vector store containing the Backblaze documentation as a sample dataset. The README file contains an application key with read-only access to the PDFs and vector store so you can try the application without having to load your own set of documents.

If you want to use your own document collection, a pair of custom commands allow you to load them from a Backblaze B2 Bucket into the vector store and then query the vector store to test that it all worked.

First, you need to load your data:

% python manage.py load_vector_store
Deleting existing LanceDB vector store at s3://blze-ev-ai-rag-app/vectordb/docs
Creating LanceDB vector store at s3://blze-ev-ai-rag-app/vectordb/docs
Loading data from s3://blze-ev-ai-rag-app/pdfs in pages of 1000 results
Successfully retrieved page 1 containing 618 result(s) from s3://blze-ev-ai-rag-app/pdfs
Skipping pdfs/.bzEmpty
Skipping pdfs/cloud_storage/.bzEmpty
Loading pdfs/cloud_storage/cloud-storage-about-backblaze-b2-cloud-storage.pdf
Loading pdfs/cloud_storage/cloud-storage-add-file-information-with-the-native-api.pdf
Loading pdfs/cloud_storage/cloud-storage-additional-resources.pdf
...
Loading pdfs/v1_api/s3-put-object.pdf
Loading pdfs/v1_api/s3-upload-part-copy.pdf
Loading pdfs/v1_api/s3-upload-part.pdf
Loaded batch of 614 document(s) from page
Split batch into 2758 chunks
[2025-02-28T01:26:11Z WARN  lance_table::io::commit] Using unsafe commit handler. Concurrent writes may result in data loss. Consider providing a commit handler that prevents conflicting writes.
Added chunks to vector store
Added 614 document(s) containing 2758 chunks to vector store; skipped 4 result(s).
Created LanceDB vector store at s3://blze-ev-ai-rag-app/vectordb/docs. "vectorstore" table contains 2758 rows

Now you can verify that the data is stored by querying the vector store. Notice how the raw results from the vector store include an S3 URI identifying the source document:

% python manage.py search_vector_store 'Which B2 native APIs would I use to upload large files?' 
2025-03-01 02:38:07,740 ai_rag_app.management.commands.search INFO     Opening vector store at s3://blze-ev-ai-rag-app/vectordb/docs/openai
2025-03-01 02:38:07,740 ai_rag_app.utils.vectorstore DEBUG    Populating AWS environment variables from the b2 profile
Found 4 docs in 2.30 seconds
2025-03-01 02:38:11,074 ai_rag_app.management.commands.search INFO     
page_content='Parts of a large file can be uploaded and copied in parallel, which can significantly reduce the time it takes to upload terabytes of data. Each part can be  anywhere from 5 MB to 5 GB, and you can pick the size that is most convenient for your application. For best upload performance, Backblaze recommends that you use the recommendedPartSize parameter that is returned by the b2_authorize_account operation.  To upload larger files and data sets, you can use the command-line interface (CLI), the Native API, or an integration, such as Cyberduck.  Usage for Large Files  Generally, large files are treated the same as small files. The costs for the API calls are the same.  You are charged for storage for the parts that you uploaded or copied. Usage is counted from the time the part is stored. When you call the b2_finish_large_file' metadata={'source': 's3://blze-ev-ai-rag-app/pdfs/cloud_storage/cloud-storage-large-files.pdf'}
...

The core of the sample application is the RAG class. There are several methods that create the basic components of the RAG, but here we’ll look at how the _create_chain() method brings together the system prompt, vector store, message history, and LLM.

First, we define the system prompt, which includes a placeholder for the context—those chunks of text that the RAG will retrieve from the vector store:

# These are the basic instructions for the LLM
system_prompt = (
    "Use the following pieces of context and the message history to "
    "answer the question at the end. If you don't know the answer, "
    "just say that you don't know, don't try to make up an answer. "
    "\n\n"
    "Context: {context}"
)

Then we create a prompt template that brings together the system prompt, message history, and the user’s question:

# The prompt template brings together the system prompt, context, message history and the user's question
prompt_template = ChatPromptTemplate(
    [
        ("system", system_prompt),
        MessagesPlaceholder(variable_name="history", optional=True, n_messages=10),
        ("human", "{question}"),
    ]
)

Now we use LangChain Expression Language (LCEL) to bring the various components together to form a chain. LCEL allows us to define a chain of components declaratively; that is, we provide a high-level representation of the chain we want, rather than specifying how the components should fit together.

Notice the log_data() helper method—it simply logs its input and passes it on to the next component in the chain.

# Create the basic chain
# When loglevel is set to DEBUG, log_input will log the results from the vector store
chain = (
    {
        "context": (
            itemgetter("question")
            | retriever
            | log_data('Documents from vector store', pretty=True)
        ),
        "question": itemgetter("question"),
        "history": itemgetter("history"),
    }
    | prompt_template
    | model
    | log_data('Output from model', pretty=True)
)

Assigning a name to the chain allows us to add instrumentation when we invoke it:

# Give the chain a name so the handler can see it
named_chain: Runnable[Input, Output] = chain.with_config(run_name="my_chain")

Now, we use LangChain’s RunnableWithMessageHistory class to manage adding and retrieving messages from the message store:

# Add message history management
return RunnableWithMessageHistory(
    named_chain,
    lambda session_id: RAG._get_session_history(store, session_id),
    input_messages_key="question",
    history_messages_key="history",
)

Finally, the log_chain() function prints an ASCII representation of the chain to the debug log:

log_chain(history_chain, logging.DEBUG, {"configurable": {'session_id': 'dummy'}})

This is the output:

The RAG class’ invoke() function, in contrast, is very simple. Here is the key section of code:

response = self._chain.invoke(
    {"question": question},
    config={
        "configurable": {
            "session_id": session_key
        },
        "callbacks": [
            ChainElapsedTime("my_chain")
        ]
    },
)

The input to the chain is a Python dictionary containing the question, while the config argument configures the chain with the Django session key and a callback that annotates the chain output with its execution time. Since the chain output contains Markdown formatting, the API endpoint that handles requests from the front end uses the open source markdown-it library to render the output to HTML for display.

The remainder of the code is mostly concerned with rendering the web UI. One interesting facet is that the Django view, responsible for rendering the UI as the page loads, uses the RAG’s message store to render the conversation, so if you reload the page, you don’t lose your context.

Take this code and run it!

The sample AI RAG application is open source under the MIT license, and I encourage you to use it as the basis for your own RAG exploration. The README file suggests a few ways you could extend it, and I also draw your attention to conclusion of the README if you are thinking of running the app in production:

[…] in order to get you started quickly, we streamlined the application in several ways. There are a few areas to attend to if you wish to run this app in a production setting:

The app does not use a database for user accounts, or any other data, so there is no authentication. All access is anonymous. If you wished to have users log in, you would need to restore Django’s AuthenticationMiddleware class to the MIDDLEWARE configuration and configure a database.
Sessions are stored in memory. As explained above, you can use Gunicorn to scale the application to multiple threads, but you would need to configure a Django session backend to run the app in multiple processes or on multiple hosts.
Similarly, conversation history is stored in memory, so you would need to use a persistent message history implementation, such as RedisChatMessageHistory to run the app in multiple processes or on multiple hosts.

Above all, have fun! AI is a rapidly evolving technology, with vendors and open source projects releasing new capabilities every day. I hope you find this app a useful way of jumping in.

The post Building a Conversational AI Chatbot Website with Backblaze B2 + LangChain appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Iceberg on Backblaze B2

2025-05-01 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/iceberg-on-backblaze-b2/

A decorative image showing icons of different file types on a grid superimposed over a cloud.

If you work with cloud storage and data lakes, you’re likely hearing the word “Iceberg” with increasing frequency, occasionally prefixed by “Apache”. What is Apache Iceberg, and how can you leverage it to efficiently store data in object stores such as Backblaze B2 Cloud Storage? I’ll answer both of those questions in this blog post.

But, first, join me on a brief trip back in time to the beginning of the twenty-first century, a long-ago time before the emergence of big data and cloud computing.

A timely shoutout to the Data Council conference

We recently attended the 2025 Data Council conference and caught Ryan Blue, co-creator of Apache Iceberg’s excellent presentation (featuring some very entertaining slides).

If you want to hear more about topics like this one, feel free to join us at Backblaze Weekly, an ongoing webinar series where we discuss all things Backblaze.

An image of Ryan Blue speaking at the 2025 Data Council conference. — Ryan Blue speaking at the 2025 Data Council conference. Note: His shirt says “the future is open”. We agree!

CSV: The lingua franca of tabular data

In the early 2000s, if you were working with tabular data, you were likely using either a relational database management system (RDBMS), such as Oracle Database, or a spreadsheet, likely Microsoft Excel.

Data stored in an RDBMS is highly structured, meaning that it MUST conform to a predefined schema. For example, you might create an employee table with columns such as first name, last name, date of birth, hire date, and so on. The database schema holds metadata such as the name and data type of each column, whether that column must have a value, relationships between tables, and so on.

A spreadsheet, on the other hand, has some structure—data is arranged in rows and columns, similarly to an RDBMS–but each cell can contain anything: text, a number, a formula referencing other cells, even an image in today’s spreadsheets. We say that a spreadsheet is semi-structured data.

At the turn of the century, each database and spreadsheet had its own proprietary file format, optimized for its own requirements, and often not at all publicly documented, but the need to be able to exchange data between applications led to broad adoption of a file format to allow just that: comma-separated values, or CSV.

Here’s a simple example of some tabular data represented as CSV:

employee_id,first_name,last_name,reports_to,job_title,is_manager
1,Gleb,Budman,,CEO,1
123,Patrick,Thomas,1,"VP of Marketing",1
45,Yev,Pusin,123,"Head of Communications and Community",1
678,Pat,Patterson,45,"Chief Technical Evangelist",0

CSV is simple and flexible enough that it was easy for me to type that example up manually and import it into Microsoft Excel with no problems at all. Note that, as well as the commas, the double quotes in the CSV data are part of the file format, and do not appear in the imported data:

CSV has a lot of advantages: It’s simple; flexible; widely understood; the optional header line means that data can be somewhat self-describing; and it’s not controlled by any single vendor.

CSV does, however, also have a few disadvantages, including:

There’s no schema; nothing in that file expresses that the values in the first column, apart from the header, must be integers.
It’s difficult to represent complex or hierarchical datasets.
Data is stored as text, which is inefficient for numerical and repetitive data. Text representations of numbers occupy more storage than binary, and applications must convert them to binary when loading the file and convert them back to text when saving it.

Avro, Parquet and ORC: File formats for big data

The emergence of open-source distributed computing frameworks such as Apache Hadoop and, later, Apache Spark, in the first two decades of this century drove the creation and adoption of more efficient ways of storing tabular data. Avro, Parquet and ORC, all Apache projects, are binary file formats that address shortcomings of CSV, such as encapsulating schema alongside the data.

Avro, like CSV, is designed for row-oriented data, which makes it well-suited to use cases that involve appending new data to files. Parquet and ORC, in contrast, are column-oriented file formats, perfect for online analytical processing (OLAP) use cases where, for example, an application might read an entire column from a table to calculate the sum of its values. As well as storing numbers in a binary representation, Parquet and ORC can also reduce file size through compression strategies such as run-length encoding.

Here’s a concrete example: The Drive Stats data set for December 2024 occupies 3.7GB of storage in CSV format. As Parquet, the same data consumes just 242MB, a data compression ratio of more than 15:1.

Why does it matter if your dataset is smaller? Well, beyond just cost savings, which are amplified when dealing with huge datasets, smaller files mean that running queries against full datasets takes less time, which reduces server load, compute costs, and so on.

From file formats to table formats and data lakes

Apache Hadoop’s original use case was as an implementation of MapReduce, a programming model for manipulating large datasets. Engineers at Facebook, tasked with allowing SQL queries over datasets generated by Hadoop, created Apache Hive, and, with it, the Hive table format, which specified how to view a collection of files as a single logical table. The Hive table format in turn allowed organizations to create data lakes, repositories that store structured and semi-structured data in their original format for analysis by a wide range of tools, and, later, data lakehouses, which aim to combine the benefits of data lakes and traditional data warehouses by storing structured data using data lake tools and technologies.

A key concept of the Hive table format is partitioning, a way of organizing files to reduce the amount of data that must be read to process a query. Taking the Drive Stats dataset as an example, we can partition the files by year and month, so that each file has a prefix of the form:

/drivestats/year={year}/month={month}/

For example:

/drivestats/year=2024/month=12/

With this partitioning scheme, a system processing a query for hard drive statistics for, say, December 12, 2024, need only retrieve files with the above prefix. You might be wondering, “Why not partition the data on day, also, to further reduce the number of files that must be retrieved?” The answer depends on the data volume and access patterns. It’s much more efficient to partition data into fewer large files than many small files, so overly granular partitioning can actually impair performance.

It’s worth mentioning that file formats and table formats are largely independent of each other. You can use Avro, Parquet, ORC, or even CSV files with the Hive table format.

For more detail on the Parquet file format, Hive table format, and partitioning, see the blog post, Storing and Querying Analytical Data in Backblaze B2.

“Iceberg, captain, dead ahead!”

While the Hive table format served the big data community well for several years, it had a number of shortcomings:

Every query incurs a file list (“list objects”, in S3 API terms) operation, which is particularly expensive with cloud object storage, both in terms of time and API transaction charges.
Deleting or modifying data typically implies rewriting an entire data file, even if only a single row was affected.
Hive can only partition datasets on columns that are in the table schema. For example, the Drive Stats data set includes a date column, so to use it with Hive, we had to create additional, redundant, year and month columns.
Any changes to the data schema or partitioning strategy require affected files to be rewritten, making schema evolution problematic, if not infeasible, for large datasets.
There is limited support for the kind of ACID (Atomic, Consistent, Isolated, Durable) transactions that are familiar from the RDBMS world. Attempts to add transaction support to Hive were not widely or consistently supported.

As a result, vendors and the broader big data community formed a number of projects to define new table formats to succeed Hive, including Apache Iceberg, Apache Hudi, and Delta Lake, a Linux Foundation project.

The three are broadly comparable in terms of features, but, over the past couple of years, Iceberg has emerged as the leader in terms of vendor adoption, with Snowflake announcing general availability of Iceberg tables in June 2024, and Amazon announcing S3 Tables, its managed Iceberg offering, in December 2024. Significantly, Databricks, the prime mover behind Delta Lake, acquired Tabular, a company founded by the original creators of Apache Iceberg, in June 2024, establishing its own beachhead in the Iceberg community.

Iceberg‘s features allow it to be used to organize huge data sets, efficiently and flexibly:

Table metadata including the list of files that comprise a table is stored as JSON data alongside the data files, eliminating the need to run an expensive list object operation for every query.
Schema evolution allows you to add, drop, update, or rename columns.
Hidden partitioning decouples partitioning from the table schema. For example, you can partition data like the Drive Stats dataset by year and month based on the existing date values, without creating additional columns.
Partition layout evolution allows you to modify your partitioning strategy as data volume or access patterns change.
Time travel allows you to query table snapshots.
Serializable isolation provides atomic table changes, ensuring readers never see inconsistent data.
Multiple concurrent writers use optimistic concurrency, retrying to ensure that compatible updates succeed while detecting conflicting writes.

Iceberg is widely supported across the big data ecosystem, with many applications and tools allowing you to store Iceberg tables in S3 compatible cloud object storage such as Backblaze B2. In this article, I’ll look at the simplest use case, running queries against the Drive Stats dataset, with three representative examples: Snowflake, Trino, and DuckDB.

Writing Iceberg data to Backblaze B2

I wrote a simple Python application, drivestats2iceberg, using the PyIceberg library, that converts the Drive Stats dataset from the zipped CSV files we publish to Parquet files in an Iceberg table stored in a Backblaze B2 Bucket. There are some useful techniques in drivestats2iceberg, and it is published on GitHub as open source, under the MIT license, so feel free to use it as a starting point for your own data conversion apps.

Querying Iceberg tables in Backblaze B2 from Snowflake

Snowflake is a data-as-a-service platform addressing a wide variety of use cases, including artificial intelligence (AI), machine learning (ML), collaboration across organizations, and data lakes.

A decorative image showing the Backblaze and Snowflake logos superimposed over a cloud that dissolves into binary 0s and 1s. — We’re big fans of the Backblaze + Snowflake integration. Our customers are too.

As I mentioned above, Snowflake announced general availability of its Iceberg tables offering in June 2024, allowing you to manipulate Iceberg tables located on external volumes, outside your Snowflake warehouse, and query them alongside data in Snowflake-managed tables.

Snowflake’s Iceberg implementation is quite complicated, with different capabilities according to your choice of cloud object storage provider and whether you want Snowflake to manage your Iceberg catalog or use a catalog integration.

For our simple use case, where the Iceberg metadata and data files already exist in a Backblaze B2 Bucket, the first step is to create a Snowflake external volume, configuring it with suitable credentials and the location of the Drive Stats data.

Note: the application key shown in this Snowflake statement has read-only access to the drivestats-iceberg bucket. You can use it to query the Drive Stats data set from your own Snowflake instance or from other environments.

CREATE EXTERNAL VOLUME drivestats_b2
  STORAGE_LOCATIONS = (
    (
      NAME = 'b2_storage_location'
      STORAGE_PROVIDER = 'S3COMPAT'
      STORAGE_BASE_URL = 's3compat://drivestats-iceberg/'
      CREDENTIALS = (
        AWS_KEY_ID = '0045f0571db506a0000000017'
        AWS_SECRET_KEY = 'K004Fs/bgmTk5dgo6GAVm2Waj3Ka+TE'
      )
      STORAGE_ENDPOINT = 's3.us-west-004.backblazeb2.com'
    )
  )
  ALLOW_WRITES = FALSE;

Next, you must create a catalog integration. The object store catalog integration simply reads Iceberg metadata from an external (to Snowflake) cloud storage location:

CREATE CATALOG INTEGRATION my_iceberg_catalog_integration
  CATALOG_SOURCE = OBJECT_STORE
  TABLE_FORMAT = ICEBERG
  ENABLED = TRUE;

Now you can create an Iceberg table object that references the existing dataset. Note that Snowflake requires you to explicitly specify the metadata file to use for column definitions; this is typically the most recently created JSON file under the metadata prefix.

CREATE ICEBERG TABLE drivestats
  EXTERNAL_VOLUME = 'drivestats_b2'
  CATALOG = 'my_iceberg_catalog_integration'
  METADATA_FILE_PATH = 'drivestats/metadata/00225-317608b1-35a6-4135-8393-7543583623db.metadata.json';

That done, you can start querying the data:

How many records are in the current Drive Stats dataset?

SELECT COUNT(*) 
FROM drivestats;

Result:

564566016

How many hard drives was Backblaze spinning on a given date?

SELECT COUNT(*) 
FROM drivestats 
WHERE date = DATE '2024-12-31';

Result:

How many exabytes of raw storage was Backblaze managing on a given date?

SELECT ROUND(SUM(CAST(capacity_bytes AS BIGINT))/1e+18, 2) 
FROM drivestats 
WHERE date = DATE '2024-12-31';

Result:

4.42

What are the top 10 most common drive models in the dataset?

SELECT model, COUNT(DISTINCT serial_number) AS count 
FROM drivestats 
GROUP BY model
ORDER BY count DESC
LIMIT 10;

Results (in drive days):

TOSHIBA MG08ACA16TA   40859
TOSHIBA MG07ACA14TA   39387
ST12000NM0007         38843
ST4000DM000           37040
ST16000NM001G         34501
WDC WUH722222ALE6L4   30148
WDC WUH721816ALE6L4   26547
ST12000NM0008         21028
HGST HMS5C4040BLE640  16349
ST8000NM0055          15680

My x-small Snowflake warehouse executed the first three queries in a fraction of a second. As you might expect from its additional complexity, the last query took longer: 16 seconds.

Querying Iceberg tables in Backblaze B2 from Trino

Trino is an open-source distributed query engine, formerly known as PrestoSQL. Trino can natively query data in Backblaze B2, Cassandra, MySQL, and many other data sources without copying that data into its own dedicated store. Trino has become the Backblaze Evangelism Team’s go-to date lake tool over the past few years; we’ve used it in several past blog posts, and we maintain a GitHub repository with quick start guides for running Trino with BackblazeB2.

To access the Drive Stats data set from Trino, you must configure its Iceberg connector with a catalog properties file. For example, to configure a catalog named drivestats_b2, create a file etc/catalog/drivestats_b2.properties:

connector.name=iceberg

hive.metastore.uri=thrift://hive-metastore:9083

iceberg.register-table-procedure.enabled=true

fs.native-s3.enabled=true

s3.endpoint=https://s3.us-west-004.backblazeb2.com
s3.region=us-west-004
s3.aws-access-key=0045f0571db506a0000000017
s3.aws-secret-key=K004Fs/bgmTk5dgo6GAVm2Waj3Ka+TE
s3.exclusive-create=false

Note that the above configuration file uses the same read-only credentials as the Snowflake example. You can use this configuration file as-is to explore the Drive Stats dataset using Trino.

Start the Trino server and CLI, then create a Trino schema with the location of the data, and set it as the default schema for subsequent queries:

CREATE SCHEMA drivestats_b2.ds_schema
    WITH (location = 's3://drivestats-iceberg/');
USE drivestats_b2.ds_schema;

The Trino Iceberg connector provides the register_table procedure for registering existing Iceberg tables into the metastore. Optionally, you can provide an additional metadata_file_name parameter if you wish to register the table with some specific table state, or if the connector cannot automatically figure out the metadata version to use.

CALL drivestats_b2.system.register_table(
    schema_name => 'ds_schema',
    table_name => 'drivestats',
    table_location => 's3://drivestats-iceberg/drivestats'
);

Since you can query the table using the exact same SQL queries as in the Snowflake example, producing the exact same results, I won’t reproduce them here. Running Trino in a Docker container on my MacBook Pro, the first three queries executed in less than three seconds, the fourth took just over a minute.

Querying Iceberg tables in Backblaze B2 from DuckDB

DuckDB is an open-source column-oriented RDBMS, intended for in-process use: embedded in applications. There are DuckDB client APIs (also known as drivers) for many programming languages, including Python, Java, JavaScript (Node.js) and Go.

DuckDB is focused on the same kinds of use cases as Snowflake and Trino; it is effectively the OLAP equivalent to SQLite, which targets online transaction processing (OLTP) workloads.

To work with Iceberg tables in cloud object storage, you must install and load the httpfs and iceberg DuckDB extensions:

INSTALL httpfs;
LOAD httpfs;

INSTALL iceberg;
LOAD iceberg;

Now, you need to create a secret with your Backblaze B2 credentials.

Again, the application key shown here has read-only access to the Drive Stats dataset; you can use it to explore the data yourself if you like.

CREATE SECRET secret (
    TYPE s3,
    KEY_ID '0045f0571db506a0000000017',
    SECRET 'K004Fs/bgmTk5dgo6GAVm2Waj3Ka+TE',
    REGION 'us-west-004',
    ENDPOINT 's3.us-west-004.backblazeb2.com'
);

By default, queries against Iceberg tables in DuckDB use a SELECT ... FROM iceberg_scan(...) syntax, but you can define a schema and a view so that you can use the same SQL queries as with Snowflake and Trino:

First, a schema:

CREATE SCHEMA ds_schema;
USE ds_schema;

Then, a view:

CREATE VIEW drivestats AS 
    SELECT *
    FROM iceberg_scan(
        's3://drivestats-iceberg/drivestats', 
        version = '?',
        allow_moved_paths = true
    );

Note: the version = '?' parameter tells DuckDB to examine the table’s metadata files and “guess” which one corresponds to the latest version. This behavior is not enabled by default, so you must set unsafe_enable_version_guessing to true before you query the data, like this:

SET unsafe_enable_version_guessing = true;

That done, you can query the table using the exact same SQL queries as with Snowflake and Trino, with the exact same results. With DuckDB on my MacBook Pro, the first three queries took about 15–25 seconds; the fourth about 90 seconds.

Note that Snowflake, Trino and DuckDB are very different systems, with different trade-offs between cost, performance, and flexibility. I’ve included the execution times I saw to set your expectations when working with these tools, rather than as a point of comparison between them.

What’s next for Apache Iceberg?

Apache Iceberg is much more than a table format specification; it’s a broad, thriving ecosystem that is constantly innovating new features, tracking progress via its own GitHub repository. Here are a few technologies that are currently in active development:

Variant Data Type Support will offer a more efficient, versatile approach to managing hierarchical, JSON-like data, aligning with Apache Spark’s variant format.
Materialized Views will allow you to define a view as you usually would, in terms of a query against one or more existing views or tables, that is able to store data, like a table. On creation, the materialized view is populated with data and functions as a cache, serving its data in response to queries. The materialized view can be periodically refreshed to keep it in sync with its sources.
Geospatial Support will add Iceberg-native data types and operations storage and analysis of geospatial data, allowing you to define columns as points, lines and polygons, and use conditions such as “intersects” in queries.

I’ve only scratched the surface of Apache Iceberg in this blog post. Stay tuned for deeper dives into using Snowflake, Trino, DuckDB and more platforms and tools with the Iceberg table format and Backblaze B2 Cloud Storage.

The post Iceberg on Backblaze B2 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Experimenting with DeepSeek, Backblaze B2, and Drive Stats

2025-03-04 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/experimenting-with-deepseek-backblaze-b2-and-drive-stats/

A decorative image showing buildings of many sizes.

As we explained in our recent blog post, AI Reasoning Models: OpenAI o3-mini, o1-mini, and DeepSeek R1, Chinese startup DeepSeek caused a stir when it released its R1 reasoning model in January of this year. Interestingly, DeepSeek R1 has an OpenAI-compatible API, so applications written for OpenAI should work with DeepSeek R1 with just a configuration change. Since I had a suitable sample app all ready to go, I decided to put their claim to the test.

Why, and why not, use DeepSeek?

A major difference between DeepSeek and OpenAI is cost. At the time of writing, DeepSeek charges $0.55 per million input tokens and $2.19 per million output tokens for its R1 model. That’s about 3.6% of OpenAI’s $15.00 per million input tokens and $60.00 per million output tokens for its flagship o1 reasoning model, and about half of o3-mini’s $1.10 per million input tokens and $4.40 per million output tokens.

Set against this is the fact that, in using the DeepSeek platform’s API, you are sending your data to a startup located in China that has been accused by OpenAI of “inappropriately” basing its work on the output of OpenAI’s models. It’s up to you, and your organizations’ data governance policy, whether the trade-off is worthwhile.

Another consideration is the ability to run DeepSeek’s models locally, on your own infrastructure, or, more likely, your chosen provider’s infrastructure, rather than sending requests to the DeepSeek platform. Spinning up my own DeepSeek instance was out of scope for this blog post, but I’ll likely return to it in a future blog post.

Swapping OpenAI for DeepSeek

Last month, I explained how you can build an AI agent with Backblaze B2, LangChain, and Drive Stats, walking you through a simple chatbot that can answer questions based on our Drive Stats data set—11 years of metrics gathered from the Backblaze B2 Cloud Storage platform’s fleet of hard drives. In that example, the chatbot accepted a natural language question, used OpenAI’s GPT‑4o mini large language model (LLM) to generate a SQL query that might help provide an answer, executed the query against the Drive Stats data set via the Trino SQL engine, and then used OpenAI again to interpret the result set and either repeat the query-interpret cycle, or generate a natural language answer.

I copied the Jupyter notebook from that example and used it as the basis for investigating the feasibility of swapping out OpenAI for DeepSeek. The DeepSeek version of the notebook contains the full source code of my experiments; I’ll include relevant extracts here, edited for clarity.

Since I used the LangChain AI framework, which provides a layer above a range of AI models, the only place that OpenAI surfaced in my code was in creating an instance of LangChain’s ChatOpenAI wrapper:

# OPENAI_API_KEY must be defined in the .env file
load_dotenv()
llm = ChatOpenAI(model="gpt-4o-mini")

The ChatOpenAI class contains all the code required to communicate with OpenAI via its API.

According to the DeepSeek documentation, all you should need to do is:

Provide your DeepSeek API key in the same OPENAI_API_KEY environment variable.
Set the API base URL to https://api.deepseek.com.
Provide a DeepSeek model name in place of the OpenAI one.

If this reminds you of the steps for using Backblaze B2’s S3-compatible API, you’re not alone. The OpenAI API has become a de facto standard for integrating with LLMs in much the same way as Amazon’s S3 API allows an ecosystem of apps and tools to interoperate with object storage systems from a variety of vendors.

Looking at the DeepSeek documentation, you can use one of two models, deepseek-reasoner (aka DeepSeek R1) or deepseek-chat. Let’s see what the much-talked-about DeepSeek R1 came up with.

Using DeepSeek R1 in the AI agent

To make it easy to use both the OpenAI and DeepSeek notebooks, I created a second entry in the .env file for the DeepSeek API key, and copied it to the OpenAI environment variable in the notebook code:

# The .env file needs at least DEEPSEEK_API_KEY, and may also contain
# OPENAI_API_KEY. Move the DeepSeek API key to the OpenAI environment
# variable
load_dotenv()

os.environ["OPENAI_API_KEY"] = os.environ.pop("DEEPSEEK_API_KEY")

llm = ChatOpenAI(model="deepseek-reasoner", base_url='https://api.deepseek.com')

As I set about repeating the steps from the Jupyter notebook that supported my previous blog post, I was disappointed to see DeepSeek fall at the very first hurdle: generating a SQL query for a simple natural language question. Here is the code:

question = {"question": "How many drives are there?"}

write_query(question)

Looking back at the original notebook, OpenAI’s response was valid SQL, although it didn’t have enough information to construct the correct query:

{'query': 'SELECT COUNT(*) AS drive_count FROM drivestats'}

DeepSeek, on the other hand, responded with a Python stack trace and this error:

openai.UnprocessableEntityError: Failed to deserialize the JSON body into the target type: response_format: response_format.type `json_schema` is unavailable now at line 1 column 13827

What went wrong? Searching for the error turns up a comment from a LangChain engineer explaining that we should use BaseChatOpenAI rather than ChatOpenAI since it “[…] accommodates many APIs that are similar to OpenAI. It uses tool calling for structured output by default.”

So, we can redefine llm accordingly, and try generating a query again:

llm = BaseChatOpenAI(model="deepseek-reasoner", base_url='https://api.deepseek.com')

write_query(question)

Unfortunately, DeepSeek returns another error:

BadRequestError: Error code: 400 - {'error': {'message': 'The last message of deepseek-reasoner must be a user message, or an assistant message with prefix mode on (refer to https://api-docs.deepseek.com/guides/chat_prefix_completion).', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_request_error'}}

Looking back at the AI agent code, we can see that we used an off-the-shelf prompt from the LangChain Prompt Hub that provides the model with a single, system, message:

================================ System Message ================================

Given an input question, create a syntactically correct {dialect} query to run to help find the answer. Unless the user specifies in his question a specific number of examples they wish to obtain, always limit your query to at most {top_k} results. You can order the results by a relevant column to return the most interesting examples in the database.

Never query for all the columns from a specific table, only ask for a few relevant columns given the question.

Pay attention to use only the column names that you can see in the schema description. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.

Only use the following tables:
{table_info}

Question: {input}

Does this mean that DeepSeek is not, in fact, API-compatible with OpenAI? I would argue that it does not. DeepSeek implements the same API request/response syntax as OpenAI, but it is a different platform. Some variation in semantics is to be expected. We see similar variations between Backblaze B2 and Amazon S3; for example, the S3 PutObjectAcl operation sets the access control list (ACL) for an object in a bucket. Amazon S3’s access management model allows you to manipulate an object’s ACL independently of its bucket—for example, you can put a private object in a public bucket, and vice versa.

This flexibility comes with a cost: It becomes difficult to reason about the visibility of data. In fact, AWS now recommends “that you keep ACLs disabled, except in unusual circumstances where you need to control access for each object individually.”

Backblaze B2’s model is much simpler: You control access at the bucket level, and all objects have the same ACL as their bucket. Backblaze B2 implements the PutObjectAcl operation, but, if you try to set an object’s ACL to any other value than its bucket’s ACL, the service responds with an error.

Returning to the AI agent code, we can replace the single-system-message prompt with one that combines a system message with a user message:

import textwrap
from langchain_core.prompts import ChatPromptTemplate

query_prompt_template = ChatPromptTemplate([
    ("system", textwrap.dedent("""Given an input question, create a
    syntactically correct {dialect} query to run to help find the answer.
    Unless the user specifies in his question a specific number of examples
    they wish to obtain, always limit your query to at most {top_k} results.
    You can order the results by a relevant column to return the most
    interesting examples in the database.

    Never query for all the columns from a specific table, only ask for a the
    few relevant columns given the question.

    Pay attention to use only the column names that you can see in the schema
    description. Be careful to not query for columns that do not exist. Also,
    pay attention to which column is in which table.

    Only use the following tables:
    {table_info}""")),
    ("human", "Question: {input}"),
])

Trying the write_query() call for a third time, this is the response:

BadRequestError: Error code: 400 - {'error': {'message': 'deepseek-reasoner does not support Function Calling', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_request_error'}}

A third error! What is this “function calling” that deepseek-reasoner does not support? A helpful article on the topic at the Hugging Face AI community explains:

Function calling is a powerful capability that enables Large Language Models (LLMs) to interact with your code and external systems in a structured way. Instead of just generating text responses, LLMs can understand when to call specific functions and provide the necessary parameters to execute real-world actions.

Unfortunately, that is exactly our use case. It’s becoming clear that DeepSeek R1 is not the correct tool for implementing an AI agent—we’ve been trying to use a chisel as a screwdriver!

DeepSeek-V3: A better fit

As its name suggests, the deepseek-chat model is more appropriate for this application. The DeepSeek documentation tells us that it is based on DeepSeek-V3, released in December 2024. DeepSeek-V3 is priced at $0.27 per million input tokens and $1.10 per million output tokens; this is actually more expensive than the GPT-4o mini model I used for the OpenAI agent example ($0.15 per million input tokens, $0.600 per million output tokens), but how does it compare? Let’s take a look.

First, we need to edit the LLM creation code again to set the model name:

llm = BaseChatOpenAI(model="deepseek-chat", base_url='https://api.deepseek.com')

Now we can run write_query() again. It’s immediately clear that it’s a better fit than its “big brother:”

{'query': 'SELECT COUNT(*) AS total_drives FROM drivestats LIMIT 10'}

As with the OpenAI agent, this query is well-formed SQL, but it’s not answering the question we set—it’s giving us the total number of rows in the dataset, rather than the number of drives. Also, it’s a little odd to have a LIMIT clause in a SELECT COUNT(*) query, but it’s legal SQL, and the agent is following its instructions very literally: always limit your query to at most {top_k} results, where we set top_k to 10.

question = {"question": "Each drive has its own serial number. How many drives are there?"}

query = write_query(question)

{'query': 'SELECT COUNT(DISTINCT serial_number) AS total_drives FROM drivestats'}

So far, so good!

I’ll skip some intermediate steps here—they are all in the Jupyter notebook if you want to review them, or run them for yourself—and look at how a simple LangChain graph, built on the DeepSeek LLM, answered the question: “Each drive has its own serial number. How many drives did each data center have on 9/1/2024?”

The OpenAI version generated an invalid query, comparing the date column with the string ’2024-09-01’ without using the required DATE type identifier, but DeepSeek generates a correct SQL query and provides a useful natural language response:

/SELECT datacenter, COUNT(DISTINCT serial_number) AS drive_count FROM drivestats WHERE date = DATE ‘2024-09-01’ GROUP BY datacenter ORDER BY drive_count DESC LIMIT 10

[(‘phx1’, 89477), (‘sac0’, 78444), (‘sac2’, 60775), (”, 24080), (‘iad1’, 22800), (‘ams5’, 16139)]

On September 1, 2024, the data centers had the following number of drives:

phx1: 89,477 drives
sac0: 78,444 drives
sac2: 60,775 drives
(empty datacenter): 24,080 drives
iad1: 22,800 drives
ams5: 16,139 drives

These are the top data centers with the highest drive counts on that date.

DeepSeek scores a point!

Moving on to the ReAct AI Agent, which allows the LLM to perform multiple SQL queries in generating an answer to a question, DeepSeek performs similarly to OpenAI. Given the question, “Each drive has its own serial number. What is the annualized failure rate of the ST4000DM000 drive model?”, the DeepSeek agent provides the overall failure rate rather than the annualized failure rate (AFR).

When we provide explicit instructions for calculating AFR in its prompt, the DeepSeek agent provides the correct result, identical, in fact, to the OpenAI agent’s response:

The annual failure rate (AFR) for the ST4000DM000 drive model is approximately 2.63%.

However, when given the question, “What was the annual failure rate of the ST8000NM000A drive model in Q3 2024?”, the DeepSeek agent gives us:

[(1.6100573445081607,)]

While OpenAI responds:

The annual failure rate (AFR) of the ST8000NM000A drive model in Q3 2024 is approximately 1.61%.

Wrapping up the investigation, the final question from the OpenAI notebook is more complex:

Considering only drive models which had at least 100 drives in service at the end of the quarter and which accumulated 10,000 or more drive days during the quarter, which drive had the most failures in Q3 2024, and what was its failure rate?

Impressively, the OpenAI agent constructed a well-formed SQL query and provided the correct response:

The drive model with the most failures in Q3 2024 is the TOSHIBA MG08ACA16TA, which had 181 failures. Its failure rate during this period was approximately 1.84%.

BadRequestError: Error code: 400 - {'error': {'message': "An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'. (insufficient tool messages following tool_calls message)", 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_request_error'}}
During task with name 'agent' and id '0aa26ba6-a3ee-ced1-de4d-b60ed7fbca99'

The phrase “insufficient tool messages” suggested that the DeepSeek LLM might need to be reconfigured to allow more tokens. According to the documentation on models and pricing, the deepseek-chat model supports a maximum of 8K output tokens, but defaults to 4K if max_tokens is not specified.

Recreating the DeepSeek wrapper object and agent accordingly, I gave it the last question again:

llm = BaseChatOpenAI(model="deepseek-chat", base_url='https://api.deepseek.com', max_tokens=8192, **extra_kwargs)

agent_executor = create_react_agent(llm, tools, state_modifier=system_message)

response = agent_executor.invoke(
    {"messages": [{"role": "user", "content": "Considering only drive models which had at least 100 drives in service at the end of the quarter and which accumulated 10,000 or more drive days during the quarter, which drive had the most failures in Q3 2024, and what was its failure rate?"}]}
)

# Show the SQL query sent to the database
print(response['messages'][-3].tool_calls[0]['args']['query'])

# Show the final response message
display_markdown(response['messages'][-1].content, raw=True)

This time, DeepSeek was able to generate a similar SQL query to OpenAI:

WITH drive_counts AS (
    SELECT model, COUNT(DISTINCT serial_number) AS drive_count
    FROM drivestats
    WHERE date >= DATE '2024-07-01' AND date <= DATE '2024-09-30'
    GROUP BY model
    HAVING COUNT(DISTINCT serial_number) >= 100
), drive_days AS (
    SELECT model, COUNT(*) AS total_drive_days
    FROM drivestats
    WHERE date >= DATE '2024-07-01' AND date <= DATE '2024-09-30'
    GROUP BY model
    HAVING COUNT(*) >= 10000
), failures AS (
    SELECT model, COUNT(*) AS failure_count
    FROM drivestats
    WHERE date >= DATE '2024-07-01' AND date <= DATE '2024-09-30' AND failure = 1
    GROUP BY model
)
SELECT d.model,
       f.failure_count,
       100 * (CAST(f.failure_count AS DOUBLE) / (CAST(d.total_drive_days AS DOUBLE) / 365)) AS annual_failure_rate
FROM drive_days d
JOIN failures f ON d.model = f.model
JOIN drive_counts dc ON d.model = dc.model
ORDER BY f.failure_count DESC
LIMIT 1

With a correct response:

To answer the question:

The drive model with the most failures in Q3 2024 is TOSHIBA MG08ACA16TA, which had 181 failures. The annualized failure rate (AFR) for this model during that quarter was 1.84%.

Success! But, unfortunately, this isn’t the whole story.

DeepSeek Reliability

A screenshot of a DeepSeek error message.

I originally set out to write this blog post at the end of January, but the DeepSeek platform website had gone offline by January 30, so I couldn’t even start until I was able to sign up for an API key on February 5.

A screenshot of DeepSeek availability from December 2024 to Feburary 2025.

Given my shiny new API key, and DeepSeek’s claims of OpenAI API compatibility, I naïvely expected to be able to work through my earlier OpenAI notebook and write up the results in a couple of days. The reality was more like two weeks.

In this blog post I’ve detailed some of the error messages I encountered along the way, but I saw many more that pointed to the DeepSeek API simply being overwhelmed with traffic. For example, for over a day, when the status page reported no issues, most API requests to DeepSeek terminated after a minute with the error message:

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

A time-consuming investigation revealed that this was caused by the DeepSeek API returning the 200 status code and headers as if the request was successful, then hanging for a minute before terminating the connection without returning any actual data. The calling code saw the 200 as success and tried to decode the non-existent API response body, resulting in the error.

I saw several more instances of intermittent errors that all seemed to point in the same direction: DeepSeek needs to add capacity to its API platform. Notably, the platform seemed faster and more stable on a Saturday morning, U.S. Pacific time, the early hours of Sunday morning in China.

Final thoughts

At present, I would have to classify the DeepSeek-V3 API as “promising, but somewhat flaky.” An agent invocation that succeeds one minute could fail the next with any of a range of error messages. That’s a shame, since when it does work, for instance, in creating the SQL query for the final question above, it tends to work very well.

One final caveat: This is a dynamic field; frameworks and services are literally being updated on a daily basis. For example, since yesterday, as I write this, four of the notebook’s module dependencies have been updated. I encourage you to experiment for yourself as your mileage will almost certainly vary, hopefully in a positive direction.

The post Experimenting with DeepSeek, Backblaze B2, and Drive Stats appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Building an AI Agent with Backblaze B2, LangChain, and Drive Stats

2025-01-22 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats/

A decorative image showing multiple computer windows folding into the cloud.

Last August, I explained how you can use a Jupyter Notebook to explore AI development; specifically, building a chatbot that answers questions based on custom context downloaded from a private bucket in Backblaze B2 Cloud Storage.

In this post, I’ll look at another AI technology, agents, and show you how I built an AI agent that answers questions about hard drive reliability based on over 11 years of raw data from our Drive Stats franchise.

The Drive Stats dataset is ideal for this kind of work. It’s a real-world dataset, but, it only weighs in at around 500 million records consuming about 20GB of storage in Parquet format (“only” being a relative term), so you can use it with big data and AI tools on a laptop in a reasonable amount of time rather than spinning up an expensive virtual machine (VM) and/or spending days waiting for an operation to complete. As an example, converting the entire Drive Stats data set from CSV to Parquet using a Python app on my MacBook Pro takes a couple of hours. On the same hardware, converting a terabyte-scale data set would take about four days.

Speaking of Drive Stats

The Drive Stats 2024 report comes out February 11, and we’re hosting a LinkedIn Live event where Andy Klein, resident Drive Stats guru, will share highlights. Register today to save your spot.

You can use these same techniques with any large dataset, from healthcare to ecommerce to financial services. In this example, we’re working with a single table, but you could adapt the sample code to a data lake comprising any number of tables.

What is an AI agent?

In the spirit of the times, I posed this question to ChatGPT. Its answer:

An AI agent is a software system designed to autonomously perform tasks or make decisions based on its environment and goals. It leverages artificial intelligence techniques—such as machine learning, reasoning, and natural language processing—to process information, make decisions, and take actions to achieve specific objectives.

Key components of an AI agent include:

Perception: The ability to sense and understand its environment. This could be through sensors, input data, or other means of gathering information.
Reasoning/decision-making: The core processing mechanism that helps the agent interpret its environment, make decisions, and plan actions. It could use various algorithms, such as decision trees, reinforcement learning, or neural networks.
Action: Once the agent has analyzed the environment and made a decision, it takes action to achieve its goal, whether it’s performing an operation, giving a recommendation, or interacting with another system.
Learning: Some AI agents can adapt over time, improving their decision-making and actions based on experience (via reinforcement learning, supervised learning, etc.).

AI agents can range from simple systems, like chatbots or virtual assistants, to more complex systems like autonomous vehicles, robots, or financial trading algorithms.

In general, the term “agent” emphasizes the idea of autonomy—the agent operates independently, often with the ability to learn, adapt, and make decisions based on changing conditions without direct human intervention.

In this example, the agent’s environment is a database containing the Drive Stats data (more on that below), and I want it to perform the following tasks:

Based on a natural language question, such as “Which drive has the lowest annual failure rate?”, generate a SQL query that retrieves data that will help answer the question.
Execute that query against the Drive Stats dataset.
Based on the query results, either create a new query that better answers the question, or generate a natural language answer.

As in my previous post, I’m using the open source LangChain framework. This tutorial on building a question/answering system over SQL data was my starting point. I’ll explain key points of the integration in this blog post; the full source code is available as a Jupyter notebook in the ai-agent-demo repository.

Querying the Drive Stats dataset

Now I’ve established that my agent will be writing a SQL query, the next question is, “What will it be querying?” I’ve written about querying the Drive Stats dataset before; in that blog post I explained how I wrote a Python script to convert the Drive Stats data from the CSV format in which we publish it to Apache Parquet, a column-oriented file format particularly well-suited for storing tabular data for use in analytical queries, and upload it to a Backblaze B2 Bucket using the Apache Hive table format. There’s a broad ecosystem of tools and platforms that can manipulate Parquet data in object storage (for example, Apache Spark and Snowflake) and I chose Trino, the open source distributed SQL engine that forms the basis for Amazon Athena, to execute queries against the data.

I could have used the same technologies for this exercise, but I decided to add Apache Iceberg to the mix. While Parquet is a file format that specifies how tabular data is stored in files, Iceberg is a table format that governs how those files can be combined and interpreted as a database table. Iceberg provides a number of advantages over Hive as a table format, including better performance and much more flexible data partitioning.

What is partitioning?

Partitioning splits a dataset on one or more column values, easing data management and improving performance when a query includes a partition column.

Partitioning by year and month makes sense for the Drive Stats dataset—the resulting Parquet files are in the hundreds of megabytes, the sweet spot for Parquet data. To apply this partitioning to the Drive Stats data using the Hive table format, I had to create otherwise redundant month and year columns from the existing date column, complicating the schema.

Iceberg, by contrast, supports hidden partitioning, allowing you to apply a transformation to a column value to produce a partition value without adding any new columns. With the Drive Stats data, that meant I could simply define the partitioning as month(date) (the resulting value being the number of months since 1/1/1970, rather than an integer between 1 and 12), with no need to create any additional columns.

LangChain’s SQLDatabase class provides access to databases via the SQLAlchemy open-source Python library. The demo code obtains a SQLDatabase instance by providing a URI containing the trino scheme, a username and the location of the database node:

db = SQLDatabase.from_uri('trino://admin@localhost:8080/iceberg/drivestats')

Note: In this and other code excerpts in this blog post, I’ve omitted extraneous “boilerplate” code. As mentioned above, the full source code is available in the ai-agent-demo repository.

As you can infer from the localhost domain name, I’m running Trino on my laptop. I’m actually running it in Docker, using the Iceberg/Hive Docker Compose script from the trino-getting-started-b2 repository. I’ll dive into that example in a future blog post.

A simple query confirms that we have a successful database connection:

db.run("SELECT COUNT(*) FROM drivestats")

'[(537220724,)]'

As the result conveys, there are over 537 million rows in the Drive Stats dataset.

Each row contains the metrics collected from a single drive in the Backblaze fleet on a specific day. The schema has evolved over time, but, currently, the following columns are included:

date: The date of collection.
serial_number: The unique serial number of the drive.
model: The manufacturer’s model number of the drive.
capacity_bytes: The drive’s capacity in bytes.
failure: 1 if this was the last day that the drive was operational before failing, 0 if all is well.
pod_slot_num: The physical location of a drive within a storage server, as an integer from 0 to 59. The specific slot differs based on the storage server type and capacity: Backblaze (45 or 60 drives), Dell (26 drives), or Supermicro (60 drives).
pod_id: There are 20 storage servers in each Backblaze Vault. The pod_id is a numeric field with values from 0 to 19 assigned to each of the 20 storage servers.
vault_id: All data drives are members of a Backblaze Vault. Each Vault consists of either 900 or 1,200 hard drives divided evenly across 20 storage servers. The Vault is a numeric value starting at 1,000.
cluster_id: The name of a given collection of storage servers logically grouped together to optimize system performance, formatted as a numeric field with up to two digits. Note: At this time the cluster_id is not always correct; we are working on fixing that.
datacenter: The Backblaze data center where the drive is installed, currently one of ams5 (Amsterdam, Netherlands), iad1 (Reston, Virginia), phx1 (Phoenix, Arizona), sac0 (Sacramento, California), sac2 (Stockton, California) or, now live, yyz1, our new Toronto, Ontario, data center.
is_legacy_format: Currently 0, but may change in future as more fields are added.
A collection of SMART attributes. The number of attributes collected has risen over time; currently we store 93 SMART attributes in each record, each one in both raw and normalized form, with field names of the form smart_n_normalized and smart_n_raw, where n is between 1 and 255.

Using OpenAI to generate a SQL query

For this project, I decided to use the OpenAI API, rather than running a large language model (LLM) directly on my laptop. LangChain has a chat model integration for OpenAI, as well as many other providers, so you could use, for example, a local Llama model (via ChatOllama) or one of the Claude models (via ChatAnthropic) if you prefer.

To use the OpenAI API, you must sign up for an OpenAI account and create an OpenAI API key. This code loads the API key from a .env file and creates a chat model instance using OpenAI’s GPT-4o mini model:

# OPENAI_API_KEY must be defined in the .env file
load_dotenv()
llm = ChatOpenAI(model="gpt-4o-mini")

Now we need a system prompt template. We’ll combine this with the database schema and a natural language question to form the prompt that we send to OpenAI. As in the LangChain tutorial, I’m using a prompt from the LangChain Prompt Hub:

query_prompt_template = hub.pull("langchain-ai/sql-query-system-prompt")
query_prompt_template.messages[0].pretty_print()

This is the prompt template text, with the placeholders shown in {braces}:

================================ System Message ================================

Given an input question, create a syntactically correct {dialect} query to run to help find the answer. Unless the user specifies in his question a specific number of examples they wish to obtain, always limit your query to at most {top_k} results. You can order the results by a relevant column to return the most interesting examples in the database.

Never query for all the columns from a specific table, only ask for a few relevant columns given the question.

Pay attention to use only the column names that you can see in the schema description. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.

Only use the following tables:
{table_info}

Question: {input}

Notice how the template requires you to specify the correct SQL dialect, constrains the number of results returned, and encourages the model to not hallucinate column names that do not exist in the schema.

A helper function populates the prompt template, sends it to the model, and returns the generated SQL query:

def write_query(state: State):
    prompt = query_prompt_template.invoke(
        {
            "dialect": db.dialect,
            "top_k": 10,
            "table_info": db.get_table_info(),
            "input": state["question"],
        }
    )
    structured_llm = llm.with_structured_output(QueryOutput)
    result = structured_llm.invoke(prompt)
    return {"query": result["query"].rstrip(';')}

We can test the helper function by calling it directly with a Python dictionary containing a simple question:

question = {"question": "How many drives are there?"}
query = write_query(question)

The resulting query dictionary does indeed contain a valid SQL query, but it won’t give us the answer we are looking for.

{'query': 'SELECT COUNT(*) AS drive_count FROM drivestats'}

That query will tell us how many rows there are in the dataset, rather than how many drives. We supplied the database schema to the model, but we haven’t given it any information on the semantics of the columns in the drivestats table. We can provide a bit more detail to obtain the correct query:

question = {"question": "Each drive has its own serial number. How many drives are there?"}
query = write_query(question)

This time, the generated SQL query is correct:

{'query': 'SELECT COUNT(DISTINCT serial_number) AS total_drives FROM drivestats'}

As you can see, it’s important to check the output of AI models—they can and do generate unexpected results.

A second helper function executes the query against the database:

def execute_query(state: State):
    execute_query_tool = QuerySQLDatabaseTool(db=db)
    return {"result": execute_query_tool.invoke(state["query"])}

We can test it using the (correct) generated query:

result = execute_query(query)

{'result': '[(430464,)]'}

We need one more helper function, to pass the result set to the model and have it generate a natural language response. This time, we define our own prompt:

def generate_answer(state: State):
    prompt = (
        "Given the following user question, corresponding SQL query, "
        "and SQL result, answer the user question.\n\n"
        f'Question: {state["question"]}\n'
        f'SQL Query: {state["query"]}\n'
        f'SQL Result: {state["result"]}'
    )
    response = llm.invoke(prompt)
    return {"answer": response.content}

Again, we can test it in isolation. Notice that we have to provide the question and query, as well as the result so that the model has the context it needs:

answer = generate_answer(question | query | result)
answer['answer']

'There are 430,464 drives.'

Success! At the present time, there are indeed 430,464 drives in the Drive Stats dataset.

LangChain’s LangGraph orchestration framework allows us to compile our three helper functions into a single graph object:

graph_builder = StateGraph(State).add_sequence(
    [write_query, execute_query, generate_answer]
)
graph_builder.add_edge(START, "write_query")
graph = graph_builder.compile()

We can visualize the flow in the notebook:

display(Image(graph.get_graph().draw_mermaid_png()))

A diagram showing a query workflow. The workflow is defined as start, write_query, execute_query, generate_answer.

We’ve combined the write_query and execute_query steps into a graph object that can run agent-generated queries. I’ll quote the security note from the LangChain tutorial on the inherent risks in doing so:

Building Q&A systems of SQL databases requires executing model-generated SQL queries. There are inherent risks in doing this. Make sure that your database connection permissions are always scoped as narrowly as possible for your chain/agent’s needs. This will mitigate though not eliminate the risks of building a model-driven system. For more on general security best practices, see here.

In this example, we are querying a public dataset, and I followed best practice by configuring Trino’s Iceberg connector with a read-only application key scoped to the bucket containing the Drive Stats Iceberg tables.

Now let’s stream a new question through the flow. This mode of operation displays the output of each step as it is executed, essential for understanding the flow’s behavior, particularly when it is behaving unexpectedly. The model returns structured text in Markdown format. With a couple of lines of code to extract the message from the step variable, we can use the display_markdown function to render each step’s output:

for step in graph.stream(
    {"question": "Each drive has its own serial number. How many drives did each data center have on 9/1/2024"}, stream_mode="updates"
):
    # unwrap the step value to get the markdown message
    state = one(step.values())
    message = one(state.values())
    display_markdown(message, raw=True)

This is the model’s output, and it gives us three different messages. I’ve separated them with a horizontal line for clarity:

SELECT datacenter, COUNT(DISTINCT serial_number) AS drive_count FROM drivestats WHERE date = ‘2024-09-01’ GROUP BY datacenter ORDER BY drive_count DESC LIMIT 10

Error: (trino.exceptions.TrinoUserError) TrinoUserError(type=USER_ERROR, name=TYPE_MISMATCH, message=”line 3:12: Cannot apply operator: date = varchar(10)”, query_id=20250113_221649_00214_bsut5) [SQL: SELECT datacenter, COUNT(DISTINCT serial_number) AS drive_count FROM drivestats WHERE date = ‘2024-09-01’ GROUP BY datacenter ORDER BY drive_count DESC LIMIT 10] (Background on this error at: https://sqlalche.me/e/20/f405)

The SQL query encountered an error because it attempted to compare a date column with a string in the format ‘YYYY-MM-DD’. Specifically, the error message indicates that the date column is of a different type (likely not a string), which is causing the type mismatch.

To answer the user question about how many drives each data center had on 9/1/2024, the SQL query needs to be corrected. Here’s the revised query:

SELECT datacenter, COUNT(DISTINCT serial_number) AS drive_count 
FROM drivestats 
WHERE date = DATE '2024-09-01' 
GROUP BY datacenter 
ORDER BY drive_count DESC 
LIMIT 10

This corrected query uses the DATE keyword to ensure that the date string is properly interpreted as a date type, which should resolve the type mismatch error. If executed successfully, this query will provide the count of distinct drives in each data center for the specified date. However, without executing the corrected query against the database, we cannot provide the exact counts.

As you can see in the output from the first step, the model generated an invalid query, comparing a date to a string, despite the database schema being included in the prompt. The output of the second step contains the resulting error message from the database, while the third step contains the model’s diagnosis of the error.

This exchange highlights a limitation of a flow that is simply a linear series of steps, such as write_query, execute_query, and generate_answer. We cannot rely on the model to generate a valid SQL query, although it is able to point the way towards resolving its error.

Creating a ReAct AI agent with LangGraph

The LangGraph framework gives you the capability to create AI agents based on arbitrarily complex logic. In this article, I’ve used its prebuilt ReAct (Reason+Act) agent, since it neatly demonstrates the agent concept, rewriting the SQL query repeatedly in response to database errors.

There are three steps to creating the agent. The first is to create an instance of LangChain’s SQLDatabaseToolkit, passing it the database and model, and obtain its list of tools:

toolkit = SQLDatabaseToolkit(db=db, llm=llm)
tools = toolkit.get_tools()

The tools list contains tools that execute queries, retrieve the names, schemas and content of database tables, and check SQL query syntax.

The next step is to retrieve a suitable prompt template from the Prompt Hub and populate the template placeholders:

prompt_template = hub.pull("langchain-ai/sql-agent-system-prompt")
system_message = prompt_template.format(dialect=db.dialect, top_k=10)

Here is the prompt template’s text:

================================ System Message ================================

You are an agent designed to interact with a SQL database.
Given an input question, create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.
Unless the user specifies a specific number of examples they wish to obtain, always limit your query to at most {top_k} results.
You can order the results by a relevant column to return the most interesting examples in the database.
Never query for all the columns from a specific table, only ask for the relevant columns given the question.
You have access to tools for interacting with the database.
Only use the below tools. Only use the information returned by the below tools to construct your final answer.
You MUST double check your query before executing it. If you get an error while executing a query, rewrite the query and try again.

DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.

To start you should ALWAYS look at the tables in the database to see what you can query.
Do NOT skip this step.
Then you should query the schema of the most relevant tables.

Now we can create an instance of the prebuilt agent:

agent_executor = create_react_agent(llm, tools, 
state_modifier=system_message)

Note how the agent must select the next step, and how the flow can cycle between the agent and tools steps:

display(Image(agent_executor.get_graph().draw_mermaid_png()))

A diagram showing the workflow between tools and agent. The workflow is as follows: start, agent, then a split option to access tools (a recursive step), or to end. The diagram shows that after agent, you can optionally select tools or end, indicating that you can end without choosing tools.

Again, we can stream the agent’s execution to show us each step of its operation.

for step in agent_executor.stream(
    {"messages": [{"role": "user", "content": "Each drive has its own serial number. How many drives did each data center have on 9/1/2024?"}]},
    stream_mode="values",
):
    step["messages"][-1].pretty_print()

The output from this flow is over 300 lines long; I posted it in its entirety as a Gist, but I’ll summarize the steps here:

Question: Each drive has its own serial number. How many drives did each data center have on 9/1/2024?
The model calls the “list tables” tool.
The list tables tool responds with a single table name, drivestats.
The model calls the “get schema” tool, passing it the table name.
The get schema tool responds with the schema and three sample rows from the drivestats table.
The model submits a query to the “query checker” tool:
SELECT datacenter, COUNT(serial_number) AS drive_count FROM drivestats WHERE date = '2024-09-01' GROUP BY datacenter ORDER BY drive_count DESC LIMIT 10;
The query checker responds with the checked query, which is the same as its input. Note that the query checker only checks the SQL query’s syntax. The query contains the same data type mismatch as the query we generated earlier, as well as another error, as we’re about to discover.
The model submits the query to the “query executor” tool.
The query executor responds with a syntax error—Trino does not allow a trailing semi-colon on the query.
The model submits a modified query to the query checker tool:
SELECT datacenter, COUNT(serial_number) AS drive_count FROM drivestats WHERE date = '2024-09-01' GROUP BY datacenter ORDER BY drive_count DESC LIMIT 10
The query checker responds with the checked query, which is the same as its input.
The model submits the query to the “query executor” tool.
The query executor responds with a type mismatch error since the query tries to compare a string value with a date column.
The model submits a query with the necessary DATE type identifier to the query checker tool:
SELECT datacenter, COUNT(serial_number) AS drive_count FROM drivestats WHERE date = DATE '2024-09-01' GROUP BY datacenter ORDER BY drive_count DESC LIMIT 10
The query checker responds with the checked query, which is the same as its input.
The model submits the query to the “query executor” tool.
The query executor responds with a result set:
[ ('phx1', 89477), ('sac0', 78444), ('sac2', 60775), ('', 24080), ('iad1', 22800), ('ams5', 16139) ]
The model returns a message containing the answer:

On September 1, 2024, the following datacenters had the specified number of drives:

1. phx1: 89,477 drives
2. sac0: 78,444 drives
3. sac2: 60,775 drives
4. (unknown datacenter): 24,080 drives
5. iad1: 22,800 drives
6. ams5: 16,139 drives

These results show the datacenters with their respective drive counts.

Now let’s see if the model can calculate the annualized failure rate of a drive model. We’ll use the Seagate ST4000DM000, just because that is the drive model with the most days of operation in the dataset.

for step in agent_executor.stream(
        {"messages": [{"role": "user", "content": "Each drive has its own serial number. What is the annualized failure rate of the ST4000DM000 drive model?"}]},
        stream_mode="values",
):
    step["messages"][-1].pretty_print()

The agent’s response mixes Markdown and LaTex notation. I used QuickLaTeX to render the LaTex to images:

The annualized failure rate (AFR) for the ST4000DM000 drive model can be calculated using the following information:

– Total failures: 5,791

– Total drives: 37,040

– Time period: from May 10, 2013, to September 30, 2024, which is approximately 11.35 years.

The formula for calculating the annualized failure rate is:

The calculation for the annualized failure rate. It's total failures divided by total drives, multiplied by one over the total years, multiplied by 100.

Plugging in the numbers:

Real number for the annualize failure rate calculations. In this instance, the text reads 5791 divided by 37040, multiplied by one over 11.35, multiplied by 100, which equals approximately 13.77 percent.

Therefore, the annualized failure rate (AFR) of the ST4000DM000 drive model is approximately 13.77%.

It’s impressive that the agent shows its working so comprehensively, but, unfortunately, it arrives at the wrong answer. Those drives were not all running for the entire span of the Drive Stats dataset. The correct calculation involves determining the number of days with data for those drives and dividing it by 365 to get the correct number of years’ operation.

It’s clear that the model is not able to answer questions on drive reliability given the data available to it so far. The solution lies in prompt engineering—providing more context on the semantics of the data in the system prompt.

We can extend the default AI agent system prompt template to include specific instructions on working with the Drive Stats dataset:

prompt_template.messages[0].prompt.template += """
Each row of the drivestats table records one day of a drive’s operation, and contains the serial number of a drive, its model name, capacity in bytes, whether it failed on that day, SMART attributes and identifiers for the slot, pod, vault, cluster and data center in which it is located.

Use this calculation for the annualized failure rate (AFR) for a drive model over a given time period:

1. **drive_days** is the number of rows for that model during the time period.
2. **failures** is the number of rows for that model during the time period where **failure** is equal to 1.
3. **annual failure rate** is 100 * (**failures** / (**drive_days** / 365)).

Use double precision arithmetic in the calculation to avoid truncation errors. To convert an integer **i** to a double, use CAST(**i** AS DOUBLE)

Note that the date column is a DATE type, not a string. Use the DATE type identifier when comparing the date column to a string.

Do not add a semi-colon suffix to SQL queries."""

Now, when we ask the same question on the annual failure rate of the ST4000DM000 drive model, the AI agent generates a correct SQL query and a more concise, and correct, final response (you can inspect the full output here).

SELECT 100 * (CAST(COUNT(CASE WHEN failure = 1 THEN 1 END) AS DOUBLE) / (COUNT(*) / 365)) AS annual_failure_rate
FROM drivestats
WHERE model = 'ST4000DM000'

The annual failure rate (AFR) for the ST4000DM000 drive model is approximately 2.63%.

Let’s ask the AI agent for a statistic that we can corroborate from the Backblaze Drive Stats for Q3 2024 blog post.

response = agent_executor.invoke(
    {"messages": [{"role": "user", "content": "What was the annual failure rate of the ST8000NM000A drive model in Q3 2024?"}]}
)
response['messages'][-3].pretty_print()
display_markdown(response['messages'][-1].content, raw=True)

The query makes sense, and the response agrees with the table in the blog post:

SELECT 100 * (CAST(SUM(failure) AS DOUBLE) / (COUNT(*) / 365)) AS annual_failure_rate
FROM drivestats
WHERE model = 'ST8000NM000A' AND date >= DATE '2024-07-01' AND date < DATE '2024-10-01'

The annual failure rate (AFR) of the ST8000NM000A drive model in Q3 2024 is approximately 1.61%.

Interestingly, this time the SQL query used SUM(failure) to count the number of failures, rather than the equivalent, but rather long-winded COUNT(CASE WHEN failure = 1 THEN 1 END) it used in the previous query. Also, looking at the full response, we can see that, as directed by the custom prompt, the agent generated the correct syntax for comparing dates, so it didn’t need to correct and retry any queries.

Finally, let’s ask a more convoluted question, including the constraints given in the blog post:

response = agent_executor.invoke(
    {"messages": [{"role": "user", "content": "Considering only drive models which had at least 100 drives in service at the end of the quarter and which accumulated 10,000 or more drive days during the quarter, which drive had the most failures in Q3 2024, and what was its failure rate?"}]}
)
response['messages'][-3].pretty_print()
display_markdown(response['messages'][-1].content, raw=True)

Again, the AI agent is able to generate a valid SQL query, this time including a subquery, and its response matches the data from the blog post exactly:

WITH drive_stats AS (
    SELECT model,
           COUNT(DISTINCT serial_number) AS drive_count,
           COUNT(*) AS drive_days,
           COUNT(CASE WHEN failure = 1 THEN 1 END) AS failures
    FROM drivestats
    WHERE date >= DATE '2024-07-01' AND date < DATE '2024-10-01'
    GROUP BY model
    HAVING COUNT(DISTINCT serial_number) >= 100 AND COUNT(*) >= 10000
)
SELECT model,
       failures,
       100 * (CAST(failures AS DOUBLE) / (CAST(drive_days AS DOUBLE) / 365)) AS failure_rate
FROM drive_stats
ORDER BY failures DESC
LIMIT 10

The drive model with the most failures in Q3 2024 is the TOSHIBA MG08ACA16TA, which had 181 failures. Its failure rate during this period was approximately 1.84%.

Closing thoughts

My experience building an AI agent was astonishment at its ability to correctly generate quite complex SQL queries based on natural language instructions, tempered with frustration at its limitations, particularly the way that it would confidently generate an incorrect response, rather than saying “I’m sorry, but I don’t know how to do that.” Your AI agent development process should include generous testing time, as well as ongoing monitoring to ensure that it is coming up with the right answers.

The post Building an AI Agent with Backblaze B2, LangChain, and Drive Stats appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

How to Zip Files with the Python S3fs Library + Backblaze B2 Cloud Storage

2024-09-17 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/how-to-zip-files-with-the-python-s3fs-library-backblaze-b2-cloud-storage/

Whenever you want to send more than two or three files to someone, chances are you’ll zip the files to do so. The .zip file format, originally created by computer programmer Phil Katz in 1986, has become ubiquitous; indeed, the dictionary definition of the word zip includes this usage of zip as a verb.

If your web application allows end users to download files, it’s natural that you’d want to provide the ability to select multiple files and download them as a single .zip file. Aside from the fact that downloading a single file is straightforward and familiar, the files are compressed, saving download time and bandwidth.

There are a few ways you can provide this functionality in your application, and some are more efficient than others. Today, inspired by a question from a Backblaze customer, I’m talking through a web application I created that allows you to implement .zip downloads in your application with data stored in Backblaze B2 Cloud Storage.

First: Avoid this mistake

When writing a web application that stores files in a cloud object store such as Backblaze B2 Cloud Storage, a simple approach to implementing .zip downloads would be to:

Download the selected files from cloud object storage to temporary local storage.
Compress them into a .zip file.
Delete the local source files.
Upload the .zip file to cloud object storage.
Delete the local .zip file.
Supply the user with a link to download the .zip file.

A diagram showing how to download zip files from Backblaze B2 to local storage

There’s a problem here, though—there has to be enough temporary local storage available to hold the selected files and the resulting .zip file. Not only that, but you have to account for the fact that multiple users may be downloading files concurrently. Finally, no matter how much local storage you provision, you also have to handle the possibility that a spike in usage might consume all the available local storage, at best making downloads temporarily unavailable, at worst destabilizing your whole web application.

Troubleshooting a better way

If you’re familiar with piping data through applications on the command line, the solution might already have occurred to you: Rather than downloading the selected files, compressing them, then uploading the .zip file, stream the selected files directly from the cloud object store, piping them through the compression algorithm, and stream the compressed data back to a new file in the cloud object store.

A diagram showing how to create ZIP files from Backblaze B2 by streaming them to a compression engine.

The web application I created allows you to do just that. I learned a lot in the process, and I was surprised by just how compact the solution was, just a couple dozen lines of code, once I’d picked the appropriate tools for the job.

I was familiar with Python’s zipfile module, so it was a logical place to start. The zipfile module provides tools for compressing and decompressing data, and follows the Python convention in working with file-like objects. A file-like object provides standard methods, such as read() and/or write(), even though it doesn’t necessarily represent an actual file stored on a local drive. Python’s file-like objects make it straightforward to assemble pipelines that read from a source, operate on the data, and write to a destination—exactly the problem at hand.

My next thought was to reach for the AWS SDK for Python, also known as Boto3. Here’s what I had in mind:

b2_client = boto3.client('s3')

# BytesIO is a binary stream using an in-memory bytes buffer
with BytesIO() as buffer:
    # Open a ZipFile object for writing to the buffer
    with ZipFile(buffer, 'w') as zipfile:
        for filename in selected_filenames:
            # ZipInfo represents a file within the ZIP
            zipinfo = ZipInfo(filename)
            # You need to set the compress_type on each ZipInfo
            # object - it is not inherited from the ZipFile!
            zipinfo.compress_type = ZIP_DEFLATED
            # Open the ZipInfo object for output
            with (zipfile.open(zipinfo, 'w') as dst):
                # Get the selected file from B2
                response = b2_client.get_object(
                    Bucket=input_bucket_name,
                    Key=filename,
                )
                # Copy the file data to the archive member
                copyfileobj(response['Body'], dst, COPY_BUFFER_SIZE)

    # Rewind to the start of the buffer
    buffer.seek(0)
    # Upload the buffer to B2
    b2_client.put_object(
        Body=buffer,
        Bucket=output_bucket_name,
        Key=zip_filename,
    )

While the above code appears to work just fine, there are two issues. First, the maximum size of a file uploaded with a single put_object call is 5GB, and, second, the BytesIO object, buffer, holds the entire .zip file in memory. It may well be that your users will never select enough files to produce a .zip file greater than 5GB, but there is still a similar problem to the approach we started with: There needs to be enough memory available to hold all of the .zip files being concurrently created by users. We’re no further forward; in fact we’ve gone backwards–we traded a limited, but relatively cheap resource, disk space, for a more limited, more expensive resource: RAM!

It’s straightforward to upload files greater than 5GB using multipart uploads, splitting the file into multiple parts between 5MB and 5GB. I could rewrite my code to split the compressed data into chunks of 5MB, but that would add significant complexity to what seemed like it should be a simple task. I decided to try a different approach.

S3Fs is a “Pythonic” file interface to S3-compatible cloud object stores, such as Backblaze B2, that builds on Filesystem Spec (fsspec), a project to provide a unified Pythonic interface to all sorts of file systems, and aiobotocore, an asynchronous client for AWS. As well as handling details such as multipart uploads, allowing you to to write much more concise code, S3Fs allows you to write data to a file-like object, like this:

# S3FileSystem reads its configuration from the usual config files,
# environment variables. Alternatively, you can pass configuration
# to the constructor.
b2fs = S3FileSystem()

# Create and write to a file in cloud object storage exactly as you
# would a local file.
with b2fs.open(output_path, 'wb') as f:
    for element in some_collection:
        data = some_serialization_function(element)
        f.write(data)

Using S3Fs, my solution for arbitrarily large .zip files was about the same number of lines of code as my previous attempt. In fact, I realized that the app should get each selected file’s last modified time to set the timestamps in the .zip file correctly, so this version actually does more:

zip_file_path = f'{output_bucket_name}/{zip_filename}'

# Open the ZIP file for output, open a ZipFile object 
# for writing to the ZIP file
with b2fs.open(zip_file_path, 'wb') as f, ZipFile(f, 'w') as zipfile:
    for filename in selected_filenames:
        input_path = f'{input_bucket_name}/{filename}'

        # Get file info, so we have a timestamp and 
        # file size for the ZIP entry
        file_info = b2fs.info(input_path)

        last_modified = file_info['LastModified']
        date_time = (last_modified.year, last_modified.month, last_modified.day,
                     last_modified.hour, last_modified.minute, last_modified.second)

        # ZipInfo represents a file within the ZIP
        zipinfo = ZipInfo(filename=filename, date_time=date_time)
        # You need to set the compress_type on each ZipInfo 
        # object - it is not inherited from the ZipFile!
        zipinfo.compress_type = ZIP_DEFLATED
        # Since we know the file size, set it in the ZipInfo 
        # object so that large files work correctly
        zipinfo.file_size = input_file_info['size']

        # Open the selected file for input, 
        # open the ZipInfo object for output
        with (b2fs.open(input_path, 'rb') as src,
              zipfile.open(zipinfo, 'w') as dst):
        	# Copy the data across
            copyfileobj(src, dst, COPY_BUFFER_SIZE)

You might be wondering, how much memory does this actually use? The copyfileobj() call, right at the very end, reads data from the selected files and writes it to the .zip file. copyfileobj() takes an optional length argument that specifies the buffer size for the copy, so you can control the tradeoff between speed and memory use. I set the default in the b2-zip-files app to 1MiB.

This solves the problems we initially ran into, allowing you to offer .zip downloads without maxing out disk storage or RAM.

My last piece of advice… Other than an easy .zip file downloader, I took one big lesson away from this experiment: Look beyond the AWS SDKs next time you write an application that accesses cloud object storage. You may just find that you can save yourself a lot of time!

The post How to Zip Files with the Python S3fs Library + Backblaze B2 Cloud Storage appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Open Sources Boardwalk Workflow Engine for Ansible

2024-09-12 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/backblaze-open-sources-boardwalk-workflow-engine-for-ansible/

An illustration of six server racks connected to a gear icon.

If you maintain cloud infrastructure as part of your job, as our Cloud Operations team here at Backblaze does, you’ll recognize the wisdom in the mantra, “Automate early, automate often”. When you’re working with tens, hundreds, or even thousands of production servers, manually applying changes gets old very quickly!

Today, Backblaze is releasing a new open source project: Boardwalk, hosted on GitHub at https://github.com/Backblaze/boardwalk, to help automate rolling maintenance jobs like kernel and operating system (OS) upgrades. Boardwalk is a linear Ansible workflow engine, written in Python, that our infrastructure systems engineers built to help automate complex operations tasks for large numbers of production hosts.

Why did Backblaze create Boardwalk?

Back in 2021, the Backblaze Storage Cloud platform comprised about 1,800 servers, the majority of which were Storage Pods. Upgrading those machines to a new OS version was an arduous task. The job took over a year and required well over 1,000 hours of hands-on toil by our data center staff. It was clear that we would need to automate the next OS upgrade, especially since it would involve even more machines.

While there are a range of tools available for this kind of work, we couldn’t just feed a list of server addresses into one of them and set it loose. Each Storage Pod is a server fitted with between 26 and 60 hard drives containing customer data, plus a boot drive holding the server’s OS. Twenty pods make up a Backblaze Vault.

Normal storage operations are as follows: Incoming customer data is assigned to a Vault for storage, then split into 20 shards, each of which is stored in a separate Pod. (I’m skipping some of the details here; for the full story, see How Backblaze Scales Our Storage Cloud). If you’ve followed our Drive Stats blog posts over the years, you’ll know that, at our scale, drives fail every day, so any one of those Pods can be taken temporarily offline for a drive replacement at any time.

This architecture means that we have to be quite intentional when we take Pods offline for upgrade.

Remotely upgrading the OS on a Storage Pod takes about 40 minutes. When the pod goes offline for upgrade, we put its Vault into read-only mode so that the upgraded server doesn’t have to catch up with writes that occurred when it was offline; the remaining 19 Pods in the Vault can still serve read requests. While one Storage Pod is being upgraded, we absolutely do not want a second Storage Pod in the same Vault upgrading.

Doing so would reduce read performance for the Vault, since fewer Storage Pods would be available to handle incoming requests, as well as increasing the risk that random drive failures in the other Pods might take the entire Vault offline. Once the upgrade is complete and the Pod comes back online, the Vault is returned to read-write mode.

The challenge of automation at scale

Backblaze has a long history of using Ansible to configure and deploy changes to its fleets of servers. However, while Ansible is a very capable agentless, modular, remote execution and configuration management engine, it isn’t well suited to complex, multi-stage operations tasks at Backblaze’s scale. Ansible playbooks have always helped us automate most of the process of managing so many servers, but eventually we hit challenges trying to reduce human toil even further.

Ansible is connection-oriented and most operations are performed on remote hosts, rather than on the administrative machine. From the administrative machine, Ansible connects to a remote host, copies code over, and executes it. There’s no practical way to run pre-checks about a host before connecting to it. This makes long-running background jobs difficult to work with using Ansible alone.

For example, if a playbook is running for days or weeks and fails, Ansible doesn’t retain any knowledge of where it left off, and can’t make any offline decisions about which hosts it needs to finish up with. When the playbook is re-run, Ansible will attempt to connect to all of the hosts it had previously connected to, potentially resulting in a long recovery time for a failed job. Considering Backblaze runs thousands of Storage Pods, this takes a long time!

The reality was that we needed something more, but also wanted to leverage all of our history with Ansible, including the playbooks that we had built, and the skills we already had. So we decided to build a workflow engine around Ansible, and we called it Boardwalk.

What does Boardwalk do?

We created Boardwalk to manage these kinds of long-running Ansible workflows, codifying our vast experience operating storage systems at scale. Boardwalk makes it easy to define workflows composed of a series of jobs to perform tasks on hosts using Ansible. It connects to hosts one-at-a-time, running jobs in a defined order, and maintaining local state as it goes; this makes stopping and resuming long-running Ansible workflows easy and efficient. It’s designed and built to be easy for DevOps and systems engineers to introduce, and frontline operators to use, while leveraging existing playbooks.

One of Boardwalk’s features is its ability to connect to a host and determine whether it should run a job on that host now, or leave it until later. When we use Boardwalk to perform rolling OS upgrades, it connects to a Pod and requests that the Pod temporarily remove itself from its Vault. The Pod checks that the other 19 Pods in the Vault are online and healthy; if so, then that Pod proceeds. Then Boardwalk can run the Ansible playbook to upgrade it. If, on the other hand, one or more of the other Pods are offline for some reason, that Pod sends a failure response to Boardwalk, causing the upgrade to be postponed until the Vault is in its correct state.

When Boardwalk is working on a host, it acquires a virtual “lock,” and saves its progress as it walks through the steps. The lock prevents multiple instances of Boardwalk from conflicting with each other, and the progress state allows Boardwalk to pick up where it left off in case of failure. If something does go wrong, an alert brings a human into the loop. Once a Pod has been successfully upgraded, Boardwalk updates its local state accordingly.

In practice, for OS upgrades, we run a single Boardwalk workflow per data center, which keeps things simple. It has a list of all of the servers it needs to upgrade, and quietly works down the list, with little or no manual intervention.

In this way, in our most recent OS upgrade, we were able to upgrade 6,000 servers over the course of nine months, with zero impact on availability and minimal intervention from data center staff. Customers were able to read files regardless of whether a Pod was being upgraded in one of the Vaults holding their data; file uploads were automatically sent to Pods in read-write mode.

What can I do with Boardwalk?

Today, we are releasing Boardwalk under the MIT License, a permissive open source license with very few restrictions on reuse. You are free to download Boardwalk, run it yourself, modify it, build it into a product, even sell it, as long as you observe the terms of the license.

We anticipate that most Boardwalk users will be able to use it as-is to automate long-running jobs across large numbers of hosts, but we welcome contributions from the community, whether they be documentation, examples, fixes, or enhancements.

We do not require contributors to sign a Contributor License Agreement (CLA) or Developer’s Certificate of Origin (DCO); instead, we simply accept contributions subject to the GitHub Terms of Service, specifically section D.6, which states, helpfully, in both legalese and plain English:

Whenever you add Content to a repository containing notice of a license, you license that Content under the same terms, and you agree that you have the right to license that Content under those terms. If you have a separate agreement to license that Content under different terms, such as a contributor license agreement, that agreement will supersede.

Isn’t this just how it works already? Yep. This is widely accepted as the norm in the open-source community; it’s commonly referred to by the shorthand “inbound=outbound”. We’re just making it explicit.

The CONTRIBUTING file explains how to build and test Boardwalk, and how to submit your contribution via a pull request. After you submit your pull request, a project maintainer will review it and respond within two weeks, likely much less unless we are flooded with contributions!

How Do I Get Started?

The README file at https://github.com/Backblaze/boardwalk is the best place to start—it contains much more detail on Boardwalk’s architecture, design, installation, and usage. Feel free to ask questions at the Boardwalk project discussions page, or file an issue if you encounter a bug or see an opportunity to enhance Boardwalk. We hope you find Boardwalk useful, and look forward to hearing how you’re using it!

We’d like to express our gratitude to Mat Hornbeek for not only writing the initial version of Boardwalk, but also kindly contributing to this article some time after he moved on from Backblaze to a new opportunity. Thanks, Mat!

The post Backblaze Open Sources Boardwalk Workflow Engine for Ansible appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Do More with Backblaze B2: A Tour of the Backblaze GitHub Repositories

2024-09-05 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/do-more-with-backblaze-b2-a-tour-of-the-backblaze-github-repositories/

If you work with Backblaze B2, you’re probably already aware of resources such as the Backblaze B2 Python SDK and the Backblaze B2 Command Line Tool, but did you know that there is also a Terraform Provider for Backblaze B2, an SDK for Java, and a whole slew of open source samples showing how to integrate with Backblaze B2 from web browsers, serverless platforms, and more? Today, I’ll take you on a quick tour of our open source SDKs, tools, and sample code, pointing out some interesting sights along the way.

Why open source?

We’ve long been believers in open source code here at Backblaze, open sourcing our implementation of Reed-Solomon erasure coding back in 2015, and, even before then, sharing our Storage Pod designs and, of course, Drive Stats, the statistics and insights based on our observations of the hard drives we operate in our data centers, including the raw metrics we collect from many thousands of hard drives, every day.

While the Storage Pod designs and Drive Stats live here on the Backblaze website, we make our open source code available via two GitHub organizations:

https://github.com/backblaze contains official, supported, Backblaze SDKs and other tools.
https://github.com/backblaze-b2-samples contains unsupported samples and demos showing how to integrate Backblaze B2 with a wide variety of services, frameworks, and applications.

Let’s take a closer look.

Official Backblaze SDKs and tools

You can use any of AWS’ range of SDKs, plus the AWS Command Line Interface (CLI), to access Backblaze B2 via its S3 Compatible API; just remember to configure the endpoint URL as well as the access key ID and secret access key.

Not every Backblaze B2 operation is accessible via the S3 Compatible API—for example, application key management—so we also support a range of open source SDKs for accessing Backblaze B2’s Native API from a variety of programming languages:

The Backblaze B2 Python SDK: This SDK provides access to the basic operations of the Native API, such as list_buckets() and download_file_by_id(), as well as a powerful Synchronizer class that implements high performance, multi-threaded file copying between Backblaze B2 and local file storage.
The Backblaze B2 Java SDK: Although it doesn’t include anything quite as sophisticated as the Python Synchronizer, the Java SDK does implement high-level functionality such as uploadLargeFile(), which encapsulates all of the mechanics of a multi-threaded file upload in a single method call. We also use it internally at Backblaze in our production environment.
blazer, an open source Backblaze B2 SDK for Go (aka golang): We adopted blazer from its original author, Toby Burress, when he was no longer able to maintain it. We’ve made a few improvements since taking it on, and we’re looking at doing more with it.

The Backblaze GitHub organization also contains a pair of tools built on the Python SDK:

The Backblaze B2 Command Line Tool (also known as the B2 CLI): The B2 CLI gives easy access to all of the capabilities of Backblaze B2, from uploading and downloading files to managing buckets and application keys. Command Like a Pro with New Backblaze B2 CLI Enhancements provides a summary of recent upgrades to the B2 CLI.
The Terraform Provider for Backblaze B2: This tool allows you to manage Backblaze B2 resources, such as buckets, files, and application keys, from Terraform configurations. Although the provider is written in Go, it embeds the B2 Python SDK for accessing the Backblaze B2 API.

The remaining repositories contain utilities and other code that we have published over the years, including our open source Reed-Solomon erasure coding implementation and a utility we wrote to support migrating a live Cassandra cluster from one data center to another.

Backblaze sample and demo code

Our https://github.com/backblaze-b2-samples organization contains, at the time of writing, 34 repositories, demonstrating how to use Backblaze B2 in a wide variety of situations. We’ve covered a few of them in past blog posts:

Build a conversational chatbot: Last month, we walked through the example code in retrieval-augmented generation (RAG) with Backblaze B2, showing how you can build a conversational chatbot that answers questions based on content downloaded from a private Backblaze B2 Bucket.
Create an AI-powered media asset management (MAM) app: Earlier this year, we showed how you can integrate Backblaze B2 with the Twelve Labs video understanding platform, publishing the code for a simple AI-powered MAM.
Set up a tiered media storage architecture: In 2022, we explained how iconik, LucidLink, and Backblaze B2 complement each other in a tiered media storage architecture. Several customers have deployed the open source Backblaze B2 Storage Plugin for iconik; one of those customers joined me on stage at NAB Show 2023 to tell their story.

As you explore the https://github.com/backblaze-b2-samples organization, you’ll also find repositories that have not yet been covered here on the Backblaze blog:

B2listen allows you to forward Backblaze B2 Event Notifications to a service listening on a local URL. B2listen uses Cloudflare’s free Quick Tunnels feature to proxy traffic from an internet-accessible URL to a local endpoint.
B2 Browser Upload shows you how to upload files directly to Backblaze B2 from JavaScript code running in the browser, with sample code for both the Backblaze B2 Native and S3-compatible APIs.
The Backblaze B2 Zip Files Example implements a simple Python web app, using the Flask web application framework and the flask-executor task queue, that can compress a set of files located in Backblaze B2 into an archive, also stored in Backblaze B2, without using any local storage.

We’ll write more about these, and other, as yet unreleased, open source projects, over the coming weeks and months, but, if you’d like us to prioritize any of the above three repositories, or any of our other projects, let us know in the comments!

The post Do More with Backblaze B2: A Tour of the Backblaze GitHub Repositories appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

How to Build Your Own LLM with Backblaze B2 + Jupyter Notebook

2024-08-13 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/how-to-build-your-own-llm-with-backblaze-b2-jupyter-notebook/

A decorative image showing a computer, a cloud, and a building.

Last month, Backblaze Principal Storage Cloud Storyteller, Andy Klein, and I presented a webinar, Leveraging your Cloud Storage Data in AI/ML Apps and Services, in which we explored the various ways AI/ML applications use and store data. In addition to sharing insights from our customers who leverage Backblaze B2 Cloud Object Storage for their AI/ML needs, we also demonstrated a simple AI use case: a retrieval-augmented generation (RAG) chatbot answering questions based on custom context downloaded from a private Backblaze B2 Bucket. After the webinar, I posted the demo source code to a GitHub repository: https://github.com/backblaze-b2-samples/ai-rag-examples.

In this blog post, I’ll recap a little of the webinar, and explain how you can use the demo source code as a basis for your own experiments with RAG and large language models (LLMs).

But why, Pat?

You might be asking yourself, why build a chatbot when multiple online services, such as ChatGPT, are just a click away? It’s a good question. The main motivation is privacy. Suppose you want your chatbot to use confidential data in generating its responses. By running an LLM on your own machine, whether on-premises or on a virtual machine (VM) under your control in a public cloud provider’s infrastructure, you eliminate the risks of that data surfacing in response to a question from a user outside your organization.

In the webinar, I showed two Python apps running on a GPU-equipped VM at Vultr, one of Backblaze’s compute partners. The first app used the GPT4All Python SDK to create a very simple conversational chatbot running a local instance of a large language model (LLM), which it used in answering general questions. Here’s an example from the webinar:

Ask me a question: What were the causes of the First World War?
The causes of the First World War are complex and varied, but some key factors include nationalism, imperialism, militarism, alliances between nations, and the assassination of Archduke Franz Ferdinand. Nationalism led to tensions between ethnic groups and their respective governments, while imperialism saw countries competing for territories and resources around the world. Militarism also played a role as many European powers built up large armies and navies in preparation for potential conflicts. The complex web of alliances between nations meant that any conflict could quickly escalate into a latIer war, and the assassination of Archduke Franz Ferdinand by a Serbian nationalist was the spark that ignited the powder keg of tensions already present in Europe at the time.

Now, I’m not a history scholar, but that looks like a pretty good answer to me! (History scholars, you are welcome to correct me.)

The second app used the Langchain framework to implement a more elaborate chatbot, again running on my own machine at Vultr, that used PDF data downloaded from a private bucket in Backblaze B2 as context for answering questions. As much as I love our webinar attendees, I didn’t want to share genuinely confidential data with them, so I used our Backblaze B2 Cloud Storage documentation as context. The chatbot was configured to use that context, and only that context, in answering questions. From the webinar:

Ask me a question about Backblaze 82: What's the difference between the master application key and a standard application key?

The master application key provides complete access to your account with all capabilities, access to all buckets, and has no file prefix restrictions or expiration. On the other hand, a standard application key is limited to the level of access that a user needs and can be specific to a bucket.

Ask me a question about Backblaze B2: What were the causes of the First World War?

The exact cause of the First World War is not mentioned in these documents.

The chatbot provides a comprehensive, accurate answer to the question on Backblaze application keys, but doesn’t answer the question on the causes of the First World War, since it was configured to use only the supplied context in generating its response.

During the webinar’s question-and-answer session, an attendee posed an excellent question: “Can you ask [the chatbot] follow-up questions where it can use previous discussions to build a proper answer based on content?” I responded, “Yes, absolutely; I’ll extend the demo to do exactly that before I post it to GitHub.” What follows are instructions for building a simple RAG chatbot, and then extending it to include message history.

Building a simple RAG chatbot

After the webinar, I rewrote both demo apps as Jupyter notebooks, which allowed me to add commentary to the code. I’ll provide you with edited highlights here, but you can find all of the details in the RAG demo notebook.

The first section of the notebook focuses on downloading PDF data from the private Backblaze B2 Bucket into a vector database, a storage mechanism particularly well suited for use with RAG. This process involves retrieving each PDF, splitting it into uniformly sized segments, and loading the segments into the database. The database stores each segment as a vector with many dimensions—we’re talking hundreds, or even thousands. The vector database can then vectorize a new piece of text—say a question from a user—and very quickly retrieve a list of matching segments.

Since this process can take significant time—about four minutes on my MacBook Pro M1 for the 225 PDF files I used, totaling 58MB of data—the notebook also shows you how to archive the resulting vector data to Backblaze B2 for safekeeping and retrieve it when running the chatbot later.

The vector database provides a “retriever” interface that takes a string as input, performs a similarity search on the vectors in the database, and outputs a list of matching documents. Given the vector database, it’s easy to obtain its retriever:

retriever = vectorstore.as_retriever()

The prompt template I used in the webinar provides the basic instructions for the LLM: use this context to answer the user’s question, and don’t go making things up!

prompt_template = """Use the following pieces of context to answer the question at the end. 
    If you don't know the answer, just say that you don't know, don't try to make up an answer.
    
    {context}
    
    Question: {question}
    Helpful Answer:"""

prompt = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

The RAG demo app creates a local instance of an LLM, using GPT4All with Nous Hermes 2 Mistral DPO, a fast chat-based model. Here’s an abbreviated version of the code:

model = GPT4All(
    model='Nous-Hermes-2-Mistral-7B-DPO.Q4_0.gguf',
    max_tokens=4096,
    device='gpu'
)

LangChain, as its name suggests, allows you to combine these components into a chain that can accept the user’s question and generate a response.

chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | model
        | StrOutputParser()
)

As mentioned above, the retriever takes the user’s question as input and returns a list of matching documents. The user’s question is also passed through the first step, and, in the second step, the prompt template combines the context with the user’s question to form the input to the LLM. If we were to peek inside the chain as it was processing the question about application keys, the prompt’s output would look something like this:

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

<Text of first matching document>

<Text of second matching document>

Question: What's the difference between the master application key and a standard application key?

Helpful Answer:

This is the basis of RAG: building an LLM prompt that contains the information required to generate an answer, then using the LLM to distill that prompt into an answer. The final step of the chain transforms the data structure emitted by the LLM into a simple string for display.

Now that we have a chain, we can ask it a question. Again, abbreviated from the sample code:

question = 'What is the difference between the master application key and a standard application key?'
answer = chain.invoke(question)

Adding message history to the simple RAG chatbot

The first step of extending the chatbot is to give the LLM new instructions, similar to its previous prompt template, but including the message history:

prompt_template = """Use the following pieces of context and the message history to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
    
Context: {context}
    
History: {history}
    
Question: {question}

Helpful Answer:"""

prompt = PromptTemplate(
    template=prompt_template, input_variables=["context", "question", "history"]
)

The chain must be modified slightly to accommodate the message history:

chain = (
    {
        "context": (
                itemgetter("question")
                | retriever
        ),
        "question": itemgetter("question"),
        "history": itemgetter("history")
    }
    | prompt
    | model
    | StrOutputParser()
)

Now, we define a very simple in-memory message store that uses a session_id parameter to manage multiple simultaneous conversations:

store = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = InMemoryChatMessageHistory()
    return store[session_id]

LangChain provides a wrapper, RunnableWithMessageHistory, that combines the message store with the above chain to create a new chain with message history capability:

with_message_history = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="question",
    history_messages_key="history",
)

Now we can feed a series of related questions into the new chain:

questions = [
    'What is the difference between the master application key and a standard application key?',
    'Which one would I use to work with a single bucket?',
    'Can you tell me anything more about this topic?'
]

for question in questions:
    print(f'\n{question}\n')
    answer = with_message_history.invoke(
        {"question": question},
        config={"configurable": {"session_id": "abc123"}},
    )
    print(f'{answer}\n')

I have to admit, I was pleasantly surprised by the results:

What is the difference between the master application key and a standard application key?

A master application key grants broad access privileges, while a standard application key is limited to the level of access that a user needs.

Which one would I use to work with a single bucket?

You would use a standard application key to work with a single bucket as it has limited access and only grants permissions needed for specific tasks, unlike the master application key which provides broad access privileges.

Can you tell me anything more about this topic?

Sure! The master application key is typically used by developers during development or testing phases to grant full access to all resources in a Backblaze B2 account, while the standard application key provides limited permissions and should be used for production environments where security is paramount.

Processing this series of questions on my MacBook Pro M1 with no GPU-acceleration took three minutes and 25 seconds, and just 52 seconds with its 16-core GPU. For comparison, I spun up a VM at Ori, another Backblaze partner offering GPU VM instances, with an Nvidia L4 Tensor Core GPU and 24GB of VRAM. The only code change required was to set the LLM device to ‘cuda’ to select the Nvidia GPU. The Ori VM answered those same questions in just 18 seconds.

An image of an Nvidia L4 Tensor Core GPU — The Nvidia L4 Tensor Core GPU: not much to look at, but crazy-fast AI inference!

Go forth and experiment

One of the reasons I refactored the demo apps was that notebooks allow an interactive, experimental approach. You can run the code in a cell, make a change, then re-run it to see the outcome. The RAG demo repository includes instructions for running the notebooks, and both the GPT4All and LangChain SDKs can run LLMs on machines with or without a GPU. Use the code as a starting point for your own exploration of AI, and let us know how you get on in the comments!

The post How to Build Your Own LLM with Backblaze B2 + Jupyter Notebook appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

AI Video Understanding in Your Apps with Twelve Labs and Backblaze

2024-05-09 Pat Patterson

Post Syndicated from Pat Patterson original https://backblaze.com/blog/ai-video-understanding-in-your-apps-with-twelve-labs-and-backblaze/

Over the past few years, since long before the recent large language model (LLM) revolution, we’ve benefited not only from the ability of AI models to transcribe audio to text, but also to automatically tag video files according to their content. Media asset management (MAM) software—such as Backlight iconik and Axle.ai (both Backblaze Partners, by the way)—allows media professionals to quickly locate footage by searching for combinations of tags. For example, “red car”, will return not only a list of video files containing red cars, but also the timecodes pinpointing the appearance of the red car in each clip.

San Francisco startup Twelve Labs has created a video understanding platform that allows any developer to build this kind of functionality, and more, into their app via a straightforward RESTful API.

In preparation for our webinar with Twelve Labs last month, I created a web app to show how to integrate Twelve Labs with Backblaze B2 for storing video. The complete sample app is available as open source at GitHub; in this blog post, I’ll provide a brief description of the Twelve Labs platform, explain how presigned URLs allow temporary access to files in a private bucket, and then share the key elements of the sample app. If you just want a high level understanding of the integration, read on, and feel free to skip the technical details!

The Twelve Labs Video Understanding Platform

The core of the Twelve Labs platform is a foundation model that operates across the visual, audio, and text modes of video content, allowing multimodal video understanding. When you submit a video using the Twelve Labs Task API, the platform generates a compact numerical representation of the video content, termed an embedding, that identifies entities, actions, patterns, movements, objects, scenes, other elements of the video, and their interrelationships. The embedding contains everything the Twelve Labs platform needs to do its work—after the initial scan, the platform no longer needs access to the original video content. As each video is scanned into the platform, its embedding is added to an index, so this scanning process is often referred to as indexing.

As part of the indexing process, the platform extracts a standard set of data from each video: a thumbnail image, a transcript of any spoken content, any text that appears on screen, and a list of brand logos, all annotated with timecodes locating them on the video’s timeline, and all accessible via the Twelve Labs Index API.

You can have the platform create a title and summary, and even prompt the model to describe the video, via Twelve Labs’ Generate API. For example, I indexed an eight-minute video that explains how to back up a Synology NAS to Backblaze B2, then prompted the Generate API, “What are the two Synology applications mentioned in the video?” This was the first sentence of the resulting text:

The two Synology applications mentioned throughout the video are “Synology Hyper Backup” and “Synology Cloud Sync.”

The remainder of the response is a brief summary of the two applications and how they differ; here’s the full text. Although it does have that “AI flavor” as you read it, it’s clear and accurate. I must admit, I was quite impressed!

You can define a taxonomy for your videos via the Classify API. Submit a one- or two-level classification schema and a set of video IDs, and the platform will assign each video to a category.

Rounding up this quick tour of the Twelve Labs platform, the Search API, as its name suggests, allows you to search the indexed videos. As well as a search query, you must specify a set of content sources: any combination of visual, conversation, text in video, or logos. Each search result includes timecodes for its start and end.

Now you understand the basic capabilities of the Twelve Labs platform, let’s look at how you can integrate it with Backblaze B2.

Allowing Temporary Access to Files in a Private Backblaze B2 Bucket

A key feature of the sample app is that it uploads videos to a private Backblaze B2 Bucket, where they are only accessible to authorized users. Twelve Labs’ API allows you to submit a video for indexing by POSTing a JSON payload including the video’s URL to its Task API. This is straightforward for video files in a public bucket, but how do we allow the Twelve Labs platform to read files from a private bucket?

One way would be to create an application key with capabilities to read files from the private bucket and share it with the Twelve Labs platform. The main drawback to this approach is that the platform currently lacks the ability to sign requests for files from a private bucket.

Since Twelve Labs only needs to read the video file when we submit it for indexing, we can send it a presigned URL for the video file. As well as the usual Backblaze B2 endpoint, bucket name, and object key (path and filename), a presigned URL includes query parameters containing data such as the time when the URL was created, its validity period in seconds, an application key ID (or access key ID, in S3 terminology), and a signature created with the corresponding application key (secret access key). Here’s an example, with line breaks added for clarity:

https://s3.us-west-004.backblazeb2.com/mybucket/image.jpeg
?X-Amz-Algorithm=AWS4-HMAC-SHA256
&X-Amz-Credential=00415f935c00000000aa%2F20240423%2Fus-west-004%2Fs3%2Faws4_request
&X-Amz-Date=20240423T222652Z
&X-Amz-Expires=3600
&X-Amz-SignedHeaders=host
&X-Amz-Signature=23ade1...3ca1eb

This URL was created at 22:26:52 UTC on 04/23/2024, and was valid for one hour (3600 seconds). The signature is 64 hex characters. Changing any part of the URL, for example, the X-Amz-Date parameter, invalidates the signature, resulting in an HTTP 403 Forbidden error when you try to use it, with a corresponding message in the response payload:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Error>
    <Code>SignatureDoesNotMatch</Code>
    <Message>Signature validation failed</Message>
</Error>

Attempting to use the presigned URL after it expires yields HTTP 401 Unauthorized with a message such as:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Error>
    <Code>UnauthorizedAccess</Code>
    <Message>Request has expired given timestamp: '20240423T222652Z' and expiration: 3600</Message>
</Error>

You can create presigned URLs with any of the AWS SDKs or the AWS CLI. For example, with the CLI:

% aws s3 presign s3://mybucket/image.jpeg --expires-in 600 
https://s3.us-west-004.backblazeb2.com/mybucket/image.jpeg?X-Amz...

Presigned URLs are useful whenever you want to provide temporary access to a file in a private bucket without having to share an application key for a client app to sign the request itself. The sample app also uses them when rendering HTML web pages. For example, all of the thumbnail images are retrieved by the user’s browser via presigned URLs.

Note that presigned URLs are a feature of Backblaze B2’s S3 Compatible API. Creating a presigned URL is an offline operation and does not consume any API calls. We recommend you use presigned URLs rather than the b2_get_download_authorization B2 Native API operation, since the latter is a class C API call.

Inside the Backblaze B2 + Twelve Labs Media Asset Management Example

The sample app is written in Python, using JavaScript for its front end, the Django web framework for its backend, the Huey task queue for managing long-running tasks, and the Twelve Labs Python SDK to interact with the Twelve Labs platform. A simple web UI allows the user to upload videos to the private bucket, browse uploaded videos, submit them for indexing, view the resulting transcription, logos, etc., and search the indexed videos.

Most of the application code is concerned with rendering the web UI; very little code is required to interact with Twelve Labs.

Configuration

The Django settings.py file defines a constant for the Twelve Labs index ID and creates an SDK client object using the Twelve Labs API key. Note that the app reads the index ID and API key from environment variables, rather than including the values in the source code. Externalizing the index ID as an environment variable allows more flexibility in deployment while, of course, you should never include secrets such as passwords or API keys in source code!

TWELVE_LABS_INDEX_ID = os.environ['TWELVE_LABS_INDEX_ID']
TWELVE_LABS_CLIENT = TwelveLabs(api_key=os.environ['TWELVE_LABS_API_KEY'])

Startup

When the web application starts, it validates the index ID and API key by retrieving details of the index. This is the relevant code, in apps.py:

index = TWELVE_LABS_CLIENT.index.retrieve(TWELVE_LABS_INDEX_ID)

If this API call fails, then the app prints a suitable diagnostic message identifying the issue.

Indexing

When a web application needs to perform an action that takes more than a few seconds to complete—for example—indexing a set of videos, it typically starts a background task to do the work, and returns an appropriate response to the user. The sample app follows this pattern: when the user selects one or more videos and hits the Index button, the web app starts a Huey task, do_video_indexing(), passing the IDs of the selected videos, and returns the IDs to the JavaScript front end. The front end can then show that the indexing tasks have started, and poll for their current status.

Here’s the code, in tasks.py, for submitting the videos for indexing.

# Create a task for each video we want to index
for video_task in video_tasks:
    task = TWELVE_LABS_CLIENT.task.create(
        TWELVE_LABS_INDEX_ID,
        url=default_storage.url(video_task['video']),
        disable_video_stream=True
    )
    print(f'Created task: {task}')
    video_task['task_id'] = task.id

Notice the call to default_storage.url(). This function, implemented by the django-storages library, takes as its argument the path to the video file, returning the presigned URL. The default expiry period is one hour.

Once the videos have been submitted, do_video_indexing() polls for the status of each indexing task until all are complete. Most of the code is concerned with minimizing the number of calls to the API, and saving status to the app’s database; getting the status of a task is simple:

task = TWELVE_LABS_CLIENT.task.retrieve(video_task['task_id'])

The task object’s status attribute is a string with a value such as validating, indexing, or ready. When the task reaches the ready status, the task object also includes a video_id attribute, uniquely identifying the video within the Twelve Labs platform. At this point, do_video_indexing() calls a helper function that retrieves the thumbnail, transcript, text, and logos and stores them in Backblaze B2.

Retrieving Video Data

Here’s the call to retrieve the thumbnail:

thumbnail_url = TWELVE_LABS_CLIENT.index.video.thumbnail(TWELVE_LABS_INDEX_ID, video.video_id)

The helper function creates a path for the thumbnail file from the video ID and the file extension in the returned URL, and saves the thumbnail to Backblaze B2:

default_storage.save(thumbnail_path, urlopen(thumbnail_url))

Again, django-storages is doing the heavy lifting. We use urlopen(), from the urllib.request module, to open the thumbnail URL, providing default_storage.save() with a file-like object from which it can read the thumbnail data.

The calls to retrieve transcript, text, and logo data have a slightly different form, for example:

video_data = TWELVE_LABS_CLIENT.index.video.transcription(TWELVE_LABS_INDEX_ID, video.video_id)

Each call returns a list of VideoValue objects, each VideoValue object comprising a start and end timecode (in seconds) and a value specific to the type of data; for example, a fragment of the transcription. We serialize each list to JSON and save it as a file in Backblaze B2.

When the user navigates to the detail page for a video, JavaScript reads each dataset from Backblaze B2 and renders it into the page, allowing the user to easily navigate to any of the data items.

Searching the Index

When the user enters a query and hits the search button, the backend calls the Twelve Labs Search API, passing the query text, and requesting results for all four sources of information. We set group_by to video since we want to show the results by video, and set the confidence threshold to medium to improve the relevance of the results. From VideoSearchView in views.py:

results = TWELVE_LABS_CLIENT.search.query(
    TWELVE_LABS_INDEX_ID,
    query,
    ["visual", "conversation", "text_in_video", "logo"],
    group_by="video",
    threshold="medium"
)

By default, the query() call returns a page of 10 results in result.data, so we loop through the pages using next(result) to fetch pages of search results as necessary. Each individual search result includes start and end timecodes, confidence, and the type of match (visual, conversation, text, or logo).

In the web UI, the user can click through to the results for a given video, then click an individual search result to view the matching video clip.

Getting Started with Backblaze B2 and Twelve Labs

Backblaze B2 Cloud Storage is a great choice for storing video to index with Twelve Labs; free egress each month for up to three times the amount of data you’re storing means that you can submit your entire video library to the Twelve Labs platform without worrying about data transfer charges, and unlimited free egress to our CDN partners reduces the costs of distributing video content to end users.

Click here to create a Backblaze B2 account, if you don’t already have one. Your first 10GB of storage is free, no credit card required. If you’re an enterprise that wants to run a larger proof of concept, you can always reach out to our Sales Team. You don’t need to write any code to upload video files or create presigned URLs, and you can use the Backblaze web UI to upload files up to 500MB, or any of a wide variety of tools to upload files up to 10TB, including the AWS CLI, rclone and Cyberduck. Select S3 as the protocol to be able to create presigned URLs.

Similarly, click here to sign up for Twelve Labs’ Free plan. With it, you can index up to 600 minutes of video, again, no credit card required. Python and Node.js developers can use one of the Twelve Labs SDKs, while the Twelve Labs API documentation includes code examples for a wide range of other programming languages.

The post AI Video Understanding in Your Apps with Twelve Labs and Backblaze appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Exploring aws-lite, a Community-Driven JavaScript SDK for AWS

2024-01-25 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/exploring-aws-lite-a-community-driven-javascript-sdk-for-aws/

A decorative image showing the Backblaze and aws-lite logos.

One of the benefits of the Backblaze B2 Storage Cloud having an S3 compatible API is that developers can take advantage of the wide range of Amazon Web Services SDKs when building their apps. The AWS team has released over a dozen SDKs covering a broad range of programming languages, including Java, Python, and JavaScript, and the latter supports both frontend (browser) and backend (Node.js) applications.

With all of this tooling available, you might be surprised to discover aws-lite. In the words of its creators, it is “a simple, extremely fast, extensible Node.js client for interacting with AWS services.” After meeting Brian LeRoux, cofounder and chief technology officer (CTO) of Begin, the company that created the aws-lite project, at the AWS re:Invent conference last year, I decided to give aws-lite a try and share the experience. Read on for the learnings I discovered along the way.

A photo showing an aws-lite promotional sticker that says, I've got p99 problems but an SDK ain't one, as well as a Backblaze promotional sticker that says Blaze/On. — Brian bribed me to try out aws-lite with a shiny laptop sticker!

Why Not Just Use the AWS SDK for JavaScript?

The AWS SDK has been through a few iterations. The initial release, way back in May 2013, focused on Node.js, while version 2, released in June 2014, added support for JavaScript running on a web page. We had to wait until December 2020 for the next major revision of the SDK, with version 3 adding TypeScript support and switching to an all-new modular architecture.

However, not all developers saw version 3 as an improvement. Let’s look at a simple example of the evolution of the SDK. The simplest operation you can perform against an S3 compatible cloud object store, such as Backblaze B2, is to list the buckets in an account. Here’s how you would do that in the AWS SDK for JavaScript v2:

var AWS = require('aws-sdk');

var client = new AWS.S3({
  region: 'us-west-004', 
  endpoint: 's3.us-west-004.backblazeb2.com'
});

client.listBuckets(function (err, data) {
  if (err) {
    console.log("Error", err);
  } else {
    console.log("Success", data.Buckets);
  }
});

Looking back from 2023, passing a callback function to the listBuckets() method looks quite archaic! Version 2.3.0 of the SDK, released in 2016, added support for JavaScript promises, and, since async/await arrived in JavaScript in 2017, today we can write the above example a little more clearly and concisely:

const AWS = require('aws-sdk');

const client = new AWS.S3({
  region: 'us-west-004', 
  endpoint: 's3.us-west-004.backblazeb2.com'
});

try {
  const data = await client.listBuckets().promise();
  console.log("Success", data.Buckets);  
} catch (err) {
  console.log("Error", err);
}

One major drawback with version 2 of the AWS SDK for JavaScript is that it is a single, monolithic, JavaScript module. The most recent version, 2.1539.0, weighs in at 92.9MB of code and resources. Even the most minimal app using the SDK has to include all that, plus another couple of MB of dependencies, causing performance issues in resource-constrained environments such as internet of things (IoT) devices, or browsers on low-end mobile devices.

Version 3 of the AWS SDK for JavaScript aimed to fix this, taking a modular approach. Rather than a single JavaScript module there are now over 300 packages published under the @aws-sdk/ scope on NPM. Now, rather than the entire SDK, an app using S3 need only install @aws-sdk/client-s3, which, with its dependencies, adds up to just 20MB.

So, What’s the Problem With AWS SDK for JavaScript v3?

One issue is that, to fully take advantage of modularization, you must adopt an unfamiliar coding style, creating a command object and passing it to the client’s send() method. Here is the “new way” of listing buckets:

const { S3Client, ListBucketsCommand } = require("@aws-sdk/client-s3");

// Since v3.378, S3Client can read region and endpoint, as well as
// credentials, from configuration, so no need to pass any arguments
const client = new S3Client();

try {
  // Inexplicably, you must pass an empty object to 
  // ListBucketsCommand() to avoid the SDK throwing an error
  const data = await client.send(new ListBucketsCommand({}));
  console.log("Success", data.Buckets);  
} catch (err) {
  console.log("Error", err);
}

The second issue is that, to help manage the complexity of keeping the SDK packages in sync with the 200+ services and their APIs, AWS now generates the SDK code from the API specifications. The problem with generated code is that, as the aws-lite home page says, it can result in “large dependencies, poor performance, awkward semantics, difficult to understand documentation, and errors without usable stack traces.”

A couple of these effects are evident even in the short code sample above. The underlying ListBuckets API call does not accept any parameters, so you might expect to be able to call the ListBucketsCommand constructor without any arguments. In fact, you have to supply an empty object, otherwise the SDK throws an error. Digging into the error reveals that a module named middleware-sdk-s3 is validating that, if the object passed to the constructor has a Bucket property, it is a valid bucket name. This is a bit odd since, as I mentioned above, ListBuckets doesn’t take any parameters, let alone a bucket name. The documentation for ListBucketsCommand contains two code samples, one with the empty object, one without. (I filed an issue for the AWS team to fix this.)

“Okay,” you might be thinking, “I’ll just carry on using v2.” After all, the AWS team is still releasing regular updates, right? Not so fast! When you run the v2 code above, you’ll see the following warning before the list of buckets:

(node:35814) NOTE: We are formalizing our plans to enter AWS SDK for JavaScript (v2) into maintenance mode in 2023.
Please migrate your code to use AWS SDK for JavaScript (v3).
For more information, check the migration guide at https://a.co/7PzMCcy

At some (as yet unspecified) time in the future, v2 of the SDK will enter maintenance mode, during which, according to the AWS SDKs and Tools maintenance policy, “AWS limits SDK releases to address critical bug fixes and security issues only.” Sometime after that, v2 will reach the end of support, and it will no longer receive any updates or releases.

Getting Started With aws-lite

Faced with a forced migration to what they judged to be an inferior SDK, Brian’s team got to work on aws-lite, posting the initial code to the aws-lite GitHub repository in September last year, under the Apache 2.0 open source license. At present the project comprises a core client and 13 plugins covering a range of AWS services including S3, Lambda, and DynamoDB.

Following the instructions on the aws-lite site, I installed the client module and the S3 plugin, and implemented the ListBuckets sample:

import awsLite from '@aws-lite/client';

const aws = await awsLite();

try {
  const data = await aws.S3.ListBuckets();
  console.log("Success", data.Buckets);
} catch (err) {
  console.log("Error", err);
}

For me, this combines the best of both worlds—concise code, like AWS SDK v2, and full support for modern JavaScript features, like v3. Best of all, the aws-lite client, S3 plugin, and their dependencies occupy just 284KB of disk space, which is less than 2% of the modular AWS SDK’s 20MB, and less than 0.5% of the monolith’s 92.9MB!

Caveat Developer!

(Not to kill the punchline here, but for those of you who might not have studied Latin or law, this is a play on the phrase, “caveat emptor”, meaning “buyer beware”.)

I have to mention, at this point, that aws-lite is still very much under construction. Only a small fraction of AWS services are covered by plugins, although it is possible (with a little extra code) to use the client to call services without a plugin. Also, not all operations are covered by the plugins that do exist. For example, at present, the S3 plugin supports 10 of the most frequently used S3 operations, such as PutObject, GetObject, and ListObjectsV2, leaving the remaining 89 operations TBD.

That said, it’s straightforward to add more operations and services, and the aws-lite team welcomes pull requests. We’re big believers in being active participants in the open source community, and I’ve already contributed the ListBuckets operation, a fix for HeadObject, and I’m working on adding tests for the S3 plugin using a mock S3 server. If you’re a JavaScript developer working with cloud services, this is a great opportunity to contribute to an open source project that promises to make your coding life better!

The post Exploring aws-lite, a Community-Driven JavaScript SDK for AWS appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Data-Driven Decisions With Snowflake and Backblaze B2

2024-01-09 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/data-driven-decisions-wwith-snowflake-and-backblaze-b2/

A decorative image showing the Backblaze and Snowflake images superimposed over a cloud.

Since its launch in 2014 as a cloud-based data warehouse, Snowflake has evolved into a broad data-as-a-service platform addressing a wide variety of use cases, including artificial intelligence (AI), machine learning (ML), collaboration across organizations, and data lakes. Last year, Snowflake introduced support for S3 compatible cloud object stores, such as Backblaze B2 Cloud Storage. Now, Snowflake customers can access unstructured data such as images and videos, as well as structured and semi-structured data such as CSV, JSON, Parquet, and XML files, directly in the Snowflake Platform, served up from Backblaze B2.

Why access external data from Snowflake, when Snowflake is itself a data as a service (DaaS) platform with a cloud-based relational database at its core? To put it simply, not all data belongs in Snowflake. Organizations use cloud object storage solutions such as Backblaze B2 as a cost-effective way to maintain both master and archive data, with multiple applications reading and writing that data. In this situation, Snowflake is just another consumer of the data. Besides, data storage in Snowflake is much more expensive than in Backblaze B2, raising the possibility of significant cost savings as a result of optimizing your data’s storage location.

Snowflake Basics

At Snowflake’s core is a cloud-based relational database. You can create tables, load data into them, and run SQL queries just as you can with a traditional on-premises database. Given Snowflake’s origin as a data warehouse, it is currently better suited to running analytical queries against large datasets than as an operational database serving a high volume of transactions, but Snowflake Unistore’s hybrid tables feature (currently in private preview) aims to bridge the gap between transactional and analytical workloads.

As a DaaS platform, Snowflake runs on your choice of public cloud—currently Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform—but insulates you from the details of managing storage, compute, and networking infrastructure. Having said that, sometimes you need to step outside the Snowflake box to access data that you are managing in your own cloud object storage account. I’ll explain exactly how that works in this blog post, but, first, let’s take a quick look at how we classify data according to its degree of structure, as this can have a big impact on your decision of where to store it.

Structured and Semi-Structured Data

Structured data conforms to a rigid data model. Relational database tables are the most familiar example—a table’s schema describes required and optional fields and their data types, and it is not possible to insert rows into the table that contain additional fields not listed in the schema. Aside from relational databases, file formats such as Apache Parquet, Optimized Row Columnar (ORC), and Avro can all store structured data; each file format specifies a schema that fully describes the data stored within a file. Here’s an example of a schema for a Parquet file:

% parquet meta customer.parquet

File path:  /data/customer.parquet
...
Schema:
message hive_schema {
  required int64 custkey;
  required binary name (STRING);
  required binary address (STRING);
  required int64 nationkey;
  required binary phone (STRING);
  required int64 acctbal;
  optional binary mktsegment (STRING);
  optional binary comment (STRING);
}

Semi-structured data, as its name suggests, is more flexible. File formats such as CSV, XML and JSON need not use a formal schema, since they can be self-describing. That is, an application can infer the structure of the data as it reads the file, a mechanism often termed “schema-on-read.”

This simple JSON example illustrates the principle. You can see how it’s possible for an application to build the schema of a product record as it reads the file:

{
  "products" : [
    {
      "name" : "Paper Shredder",
      "description" : "Crosscut shredder with auto-feed"
    },
    {
      "name" : "Stapler",
      "color" : "Red"
    },
    {
      "name" : "Sneakers",
      "size" : "11"
    }
  ]
}

Accessing Structured and Semi-Structured Data Stored in Backblaze B2 from Snowflake

You can access data located in cloud object storage external to Snowflake, such as Backblaze B2, by creating an external stage. The external stage is a Snowflake database object that holds a URL for the external location, as well as configuration (e.g., credentials) required to access the data. For example:

CREATE STAGE b2_stage
  URL = 's3compat://your-b2-bucket-name/'
  ENDPOINT = 's3.your-region.backblazeb2.com'
  REGION = 'your-region'
  CREDENTIALS = (
    AWS_KEY_ID = 'your-application-key-id'
    AWS_SECRET_KEY = 'your-application-key'
  );

You can create an external table to query data stored in an external stage as if the data were inside a table in Snowflake, specifying the table’s columns as well as filenames, file formats, and data partitioning. Just like the external stage, the external table is a database object, located in a Snowflake schema, that stores the metadata required to access data stored externally to Snowflake, rather than the data itself.

Every external table automatically contains a single VARIANT type column, named value, that can hold arbitrary collections of fields. An external table definition for semi-structured data needs no further column definitions, only metadata such as the location of the data. For example:

CREATE EXTERNAL TABLE product
  LOCATION = @b2_stage/data/
  FILE_FORMAT = (TYPE = JSON)
  AUTO_REFRESH = false;

When you query the external table, you can reference elements within the value column, like this:

SELECT value:name
  FROM product
  WHERE value:color = ‘Red’;
+------------+
| VALUE:NAME |
|------------|
| "Stapler"  |
+------------+

Since structured data has a more rigid layout, you must define table columns (technically, in Snowflake, these are referred to as “pseudocolumns”), corresponding to the fields in the data files, in terms of the value column. For example:

CREATE EXTERNAL TABLE customer (
    custkey number AS (value:custkey::number),
    name varchar AS (value:name::varchar),
    address varchar AS (value:address::varchar),
    nationkey number AS (value:nationkey::number),
    phone varchar AS (value:phone::varchar),
    acctbal number AS (value:acctbal::number),
    mktsegment varchar AS (value:mktsegment::varchar),
    comment varchar AS (value:comment::varchar)
  )
  LOCATION = @b2_stage/data/
  FILE_FORMAT = (TYPE = PARQUET)
  AUTO_REFRESH = false;

Once you’ve created the external table, you can write SQL statements to query the data stored externally, just as if it were inside a table in Snowflake:

SELECT phone
  FROM customer
  WHERE name = ‘Acme, Inc.’;
+----------------+
| PHONE          |
|----------------|
| "111-222-3333" |
+----------------+

The Backblaze B2 documentation includes a pair of technical articles that go further into the details, describing how to export data from Snowflake to an external table stored in Backblaze B2, and how to create an external table definition for existing structured data stored in Backblaze B2.

Accessing Unstructured Data Stored in Backblaze B2 from Snowflake

The term “unstructured”, in this context, refers to data such as images, audio, and video, that cannot be defined in terms of a data model. You still need to create an external stage to access unstructured data located outside of Snowflake, but, rather than creating external tables and writing SQL queries, you typically access unstructured data from custom code running in Snowflake’s Snowpark environment.

Here’s an excerpt from a Snowflake user-defined function, written in Python, that loads an image file from an external stage:

from snowflake.snowpark.files import SnowflakeFile

# The file_path argument is a scoped Snowflake file URL to a file in the 
# external stage, created with the BUILD_SCOPED_FILE_URL function. 
# It has the form
# https://abc12345.snowflakecomputing.com/api/files/01b1690e-0001-f66c-...
def generate_image_label(file_path):

  # Read the image file 
  with SnowflakeFile.open(file_path, 'rb') as f:
    image_bytes = f.readall()

  ...

In this example, the user-defined function reads an image file from an external stage, then runs an ML model on the image data to generate a label for the image according to its content. A Snowflake task using this user-defined function can insert rows into a table of image names and labels as image files are uploaded into a Backblaze B2 Bucket. You can learn more about this use case in particular, and loading unstructured data from Backblaze B2 into Snowflake in general, from the Backblaze Tech Day ‘23 session that I co-presented with Snowflake Product Manager Saurin Shah:

Choices, Choices: Where Should I Store My Data?

Given that, currently, Snowflake charges at least $23/TB/month for data storage on its platform compared to Backblaze B2 at $6/TB/month, it might seem tempting to move your data wholesale from Snowflake to Backblaze B2 and create external tables to replace tables currently residing in Snowflake. There are, however, a couple of caveats to mention: performance and egress costs.

The same query on the same dataset will run much more quickly against tables inside Snowflake than the corresponding external tables. A comprehensive analysis of performance and best practices for Snowflake external tables is a whole other blog post, but, as an example, one of my queries that completes in 30 seconds against a table in Snowflake takes three minutes to run against the same data in an external table.

Similarly, when you query an external table located in Backblaze B2, Snowflake must download data across the internet. Data formats such as Parquet can make this very efficient, organizing data column-wise and compressing it to minimize the amount of data that must be transferred. But, some amount of data still has to be moved from Backblaze B2 to Snowflake. Downloading data from Backblaze B2 is free of charge for up to 3x your average monthly data footprint, then $0.01/GB for additional egress, so there is a trade-off between data storage cost and data transfer costs for frequently-accessed data.

Some data naturally lives on one platform or the other. Frequently-accessed tables should probably be located in Snowflake. Media files, that might only ever need to be downloaded once to be processed by code running in Snowpark, belong in Backblaze B2. The gray area is large datasets that will only be accessed a few times a month, where the performance disparity is not an issue, and the amount of data transferred might fit into Backblaze B2’s free egress allowance. By understanding how you access your data, and doing some math, you’re better able to choose the right cloud storage tool for your specific tasks.

The post Data-Driven Decisions With Snowflake and Backblaze B2 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

How to Run AI/ML Workloads on CoreWeave + Backblaze

2023-12-13 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/how-to-run-ai-ml-workloads-on-coreweave-backblaze/

A decorative image showing the Backblaze and CoreWeave logos superimposed on clouds.

Backblaze compute partner CoreWeave is a specialized GPU cloud provider designed to power use cases such as AI/ML, graphics, and rendering up to 35x faster and for 80% less than generalized public clouds. Brandon Jacobs, an infrastructure architect at CoreWeave, joined us earlier this year for Backblaze Tech Day ‘23. Brandon and I co-presented a session explaining both how to backup CoreWeave Cloud storage volumes to Backblaze B2 Cloud Storage and how to load a model from Backblaze B2 into the CoreWeave Cloud inference stack.

Since we recently published an article covering the backup process, in this blog post I’ll focus on loading a large language model (LLM) directly from Backblaze B2 into CoreWeave Cloud.

Below is the session recording from Tech Day; feel free to watch it instead of, or in addition to, reading this article.

More About CoreWeave

In the Tech Day session, Brandon covered the two sides of CoreWeave Cloud:

Model training and fine tuning.
The inference service.

To maximize performance, CoreWeave provides a fully-managed Kubernetes environment running on bare metal, with no hypervisors between your containers and the hardware.

CoreWeave provides a range of storage options: storage volumes that can be directly mounted into Kubernetes pods as block storage or a shared file system, running on solid state drives (SSDs) or hard disk drives (HDDs), as well as their own native S3 compatible object storage. Knowing that, you’re probably wondering, “Why bother with Backblaze B2, when CoreWeave has their own object storage?”

The answer echoes the first few words of this blog post—CoreWeave’s object storage is a specialized implementation, co-located with their GPU compute infrastructure, with high-bandwidth networking and caching. Backblaze B2, in contrast, is general purpose cloud object storage, and includes features such as Object Lock and lifecycle rules, that are not as relevant to CoreWeave’s object storage. There is also a price differential. Currently, at $6/TB/month, Backblaze B2 is one-fifth of the cost of CoreWeave’s object storage.

So, as Brandon and I explained in the session, CoreWeave’s native storage is a great choice for both the training and inference use cases, where you need the fastest possible access to data, while Backblaze B2 shines as longer term storage for training, model, and inference data as well as the destination for data output from the inference process. In addition, since Backblaze and CoreWeave are bandwidth partners, you can transfer data between our two clouds with no egress fees, freeing you from unpredictable data transfer costs.

Loading an LLM From Backblaze B2

To demonstrate how to load an archived model from Backblaze B2, I used CoreWeave’s GPT-2 sample. GPT-2 is an earlier version of the GPT-3.5 and GPT-4 LLMs used in ChatGPT. As such, it’s an accessible way to get started with LLMs, but, as you’ll see, it certainly doesn’t pass the Turing test!

This sample comprises two applications: a transformer and a predictor. The transformer implements a REST API, handling incoming prompt requests from client apps, encoding each prompt into a tensor, which the transformer passes to the predictor. The predictor applies the GPT-2 model to the input tensor, returning an output tensor to the transformer for decoding into text that is returned to the client app. The two applications have different hardware requirements—the predictor needs a GPU, while the transformer is satisfied with just a CPU, so they are configured as separate Kubernetes pods, and can be scaled up and down independently.

Since the GPT-2 sample includes instructions for loading data from Amazon S3, and Backblaze B2 features an S3 compatible API, it was a snap to modify the sample to load data from a Backblaze B2 Bucket. In fact, there was just a single line to change, in the s3-secret.yaml configuration file. The file is only 10 lines long, so here it is in its entirety:

apiVersion: v1
kind: Secret
metadata:
  name: s3-secret
  annotations:
     serving.kubeflow.org/s3-endpoint: s3.us-west-004.backblazeb2.com
type: Opaque
data:
  AWS_ACCESS_KEY_ID: <my-backblaze-b2-application-key-id>
  AWS_SECRET_ACCESS_KEY: <my-backblaze-b2-application-key>

As you can see, all I had to do was set the serving.kubeflow.org/s3-endpoint metadata annotation to my Backblaze B2 Bucket’s endpoint and paste in an application key and its ID.

While that was the only Backblaze B2-specific edit, I did have to configure the bucket and path where my model was stored. Here’s an excerpt from gpt-s3-inferenceservice.yaml, which configures the inference service itself:

apiVersion: serving.kubeflow.org/v1alpha2
kind: InferenceService
metadata:
  name: gpt-s3
  annotations:
    # Target concurrency of 4 active requests to each container
    autoscaling.knative.dev/target: "4"
    serving.kubeflow.org/gke-accelerator: Tesla_V100
spec:
  default:
    predictor:
      minReplicas: 0 # Allow scale to zero
      maxReplicas: 2 
      serviceAccountName: s3-sa # The B2 credentials are retrieved from the service account
      tensorflow:
        # B2 bucket and path where the model is stored
        storageUri: s3://<my-bucket>/model-storage/124M/
        runtimeVersion: "1.14.0-gpu"
        ...

Aside from storageUri configuration, you can see how the predictor application’s pod is configured to scale from between zero and two instances (“replicas” in Kubernetes terminology). The remainder of the file contains the transformer pod configuration, allowing it to scale from zero to a single instance.

Running an LLM on CoreWeave Cloud

Spinning up the inference service involved a kubectl apply command for each configuration file and a short wait for the CoreWeave GPU cloud to bring up the compute and networking infrastructure. Once the predictor and transformer services were ready, I used curl to submit my first prompt to the transformer endpoint:

% curl -d '{"instances": ["That was easy"]}' http://gpt-s3-transformer-default.tenant-dead0a.knative.chi.coreweave.com/v1/models/gpt-s3:predict
{"predictions": ["That was easy for some people, it's just impossible for me,\" Davis said. \"I'm still trying to" ]}

In the video, I repeated the exercise, feeding GPT-2’s response back into it as a prompt a few times to generate a few paragraphs of text. Here’s what it came up with:

“That was easy: If I had a friend who could take care of my dad for the rest of his life, I would’ve known. If I had a friend who could take care of my kid. He would’ve been better for him than if I had to rely on him for everything.

The problem is, no one is perfect. There are always more people to be around than we think. No one cares what anyone in those parts of Britain believes,

The other problem is that every decision the people we’re trying to help aren’t really theirs. If you have to choose what to do”

If you’ve used ChatGPT, you’ll recognize how far LLMs have come since GPT-2’s release in 2019!

Run Your Own Large Language Model

While CoreWeave’s GPT-2 sample is an excellent introduction to the world of LLMs, it’s a bit limited. If you’re looking to get deeper into generative AI, another sample, Fine-tune Large Language Models with CoreWeave Cloud, shows how to fine-tune a model from the more recent EleutherAI Pythia suite.

Since CoreWeave is a specialized GPU cloud designed to deliver best-in-class performance up to 35x faster and 80% less expensive than generalized public clouds, it’s a great choice for workloads such as AI, ML, rendering, and more, and, as you’ve seen in this blog post, easy to integrate with Backblaze B2 Cloud Storage, with no data transfer costs. For more information, contact the CoreWeave team.

The post How to Run AI/ML Workloads on CoreWeave + Backblaze appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Digging Deeper Into Object Lock

2023-11-28 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/digging-deeper-into-object-lock/

A decorative image showing data inside of a vault.

Using Object Lock for your data is a smart choice—you can protect your data from ransomware, meet compliance requirements, beef up your security policy, or preserve data for legal reasons. But, it’s not a simple on/off switch, and accidentally locking your data for 100 years is a mistake you definitely don’t want to make.

Today we’re taking a deeper dive into Object Lock and the related legal hold feature, examining the different levels of control that are available, explaining why developers might want to build Object Lock into their own applications, and showing exactly how to do that. While the code samples are aimed at our developer audience, anyone looking for a deeper understanding of Object Lock should be able to follow along.

I presented a webinar on this topic earlier this year that covers much the same ground as this blog post, so feel free to watch it instead of, or in addition to, reading this article.

Check Out the Docs

For even more information on Object Lock, check out our Object Lock overview in our Technical Documentation Portal as well as these how-tos about how to enable Object Lock using the Backblaze web UI, Backblaze B2 Native API, and the Backblaze S3 Compatible API:

What Is Object Lock?

In the simplest explanation, Object Lock is a way to lock objects (aka files) stored in Backblaze B2 so that they are immutable—that is, they cannot be deleted or modified, for a given period of time, even by the user account that set the Object Lock rule. Backblaze B2’s implementation of Object Lock was originally known as File Lock, and you may encounter the older terminology in some documentation and articles. For consistency, I’ll use the term “object” in this blog post, but in this context it has exactly the same meaning as “file.”

Object Lock is a widely offered feature included with backup applications such as Veeam and MSP360, allowing organizations to ensure that their backups are not vulnerable to deliberate or accidental deletion or modification for some configurable retention period.

Ransomware mitigation is a common motivation for protecting data with Object Lock. Even if an attacker were to compromise an organization’s systems to the extent of accessing the application keys used to manage data in Backblaze B2, they would not be able to delete or change any locked data. Similarly, Object Lock guards against insider threats, where the attacker may try to abuse legitimate access to application credentials.

Object Lock is also used in industries that store sensitive or personal identifiable information (PII) such as banking, education, and healthcare. Because they work with such sensitive data, regulatory requirements dictate that data be retained for a given period of time, but data must also be deleted in particular circumstances.

For example, the General Data Protection Regulation (GDPR), an important component of the EU’s privacy laws and an international regulatory standard that drives best practices, may dictate that some data must be deleted when a customer closes their account. A related use case is where data must be preserved due to litigation, where the period for which data must be locked is not fixed and depends on the type of lawsuit at hand.

To handle these requirements, Backblaze B2 offers two Object Lock modes—compliance and governance—as well as the legal hold feature. Let’s take a look at the differences between them.

Compliance Mode: Near-Absolute Immutability

When objects are locked in compliance mode, not only can they not be deleted or modified while the lock is in place, but the lock also cannot be removed during the specified retention period. It is not possible to remove or override the compliance lock to delete locked data until the lock expires, whether you’re attempting to do so via the Backblaze web UI or either of the S3 Compatible or B2 Native APIs. Similarly, Backblaze Support is unable to unlock or delete data locked under compliance mode in response to a support request, which is a safeguard designed to address social engineering attacks where an attacker impersonates a legitimate user.

What if you inadvertently lock many terabytes of data for several years? Are you on the hook for thousands of dollars of storage costs? Thankfully, no—you have one escape route, which is to close your Backblaze account. Closing the account is a multi-step process that requires access to both the account login credentials and two-factor verification (if it is configured) and results in the deletion of all data in that account, locked or unlocked. This is a drastic step, so we recommend that developers create one or more “burner” Backblaze accounts for use in developing and testing applications that use Object Lock, that can be closed if necessary without disrupting production systems.

There is one lock-related operation you can perform on compliance-locked objects: extending the retention period. In fact, you can keep extending the retention period on locked data any number of times, protecting that data from deletion until you let the compliance lock expire.

Governance Mode: Override Permitted

In our other Object Lock option, objects can be locked in governance mode for a given retention period. But, in contrast to compliance mode, the governance lock can be removed or overridden via an API call, if you have an application key with appropriate capabilities. Governance mode handles use cases that require retention of data for some fixed period of time, with exceptions for particular circumstances.

When I’m trying to remember the difference between compliance and governance mode, I think of the phrase, “Twenty seconds to comply!”, uttered by the ED-209 armed robot in the movie “RoboCop.” It turned out that there was no way to override ED-209’s programming, with dramatic, and fatal, consequences.

ED-209: as implacable as compliance mode.

Legal Hold: Flexible Preservation

While the compliance and governance retention modes lock objects for a given retention period, legal hold is more like a toggle switch: you can turn it on and off at any time, again with an application key with sufficient capabilities. As its name suggests, legal hold is ideal for situations where data must be preserved for an unpredictable period of time, such as while litigation is proceeding.

The compliance and governance modes are mutually exclusive, which is to say that only one may be in operation at any time. Objects locked in governance mode can be switched to compliance mode, but, as you might expect from the above explanation, objects locked in compliance mode cannot be switched to governance mode until the compliance lock expires.

Legal hold, on the other hand, operates independently, and can be enabled and disabled regardless of whether an object is locked in compliance or governance mode.

How does this work? Consider an object that is locked in compliance or governance mode and has legal hold enabled:

If the legal hold is removed, the object remains locked until the retention period expires.
If the retention period expires, the object remains locked until the legal hold is removed.

Object Lock and Versioning

By default, Backblaze B2 Buckets have versioning enabled, so as you upload successive objects with the same name, previous versions are preserved automatically. None of the Object Lock modes prevent you from uploading a new version of a locked object; the lock is specific to the object version to which it was applied.

You can also hide a locked object so it doesn’t appear in object listings. The hidden version is retained and can be revealed using the Backblaze web UI or an API call.

As you might expect, locked object versions are not subject to deletion by lifecycle rules—any attempt to delete a locked object version via a lifecycle rule will fail.

How to Use Object Lock in Applications

Now that you understand the two modes of Object Lock, plus legal hold, and how they all work with object versions, let’s look at how you can take advantage of this functionality in your applications. I’ll include code samples for Backblaze B2’s S3 Compatible API written in Python, using the AWS SDK, aka Boto3, in this blog post. You can find details on working with Backblaze B2’s Native API in the documentation.

Application Key Capabilities for Object Lock

Every application key you create for Backblaze B2 has an associated set of capabilities; each capability allows access to a specific functionality in Backblaze B2. There are seven capabilities relevant to object lock and legal hold.

Two capabilities relate to bucket settings:

readBucketRetentions
writeBucketRetentions

Three capabilities relate to object settings for retention:

readFileRetentions
writeFileRetentions
bypassGovernance

And, two are specific to Object Lock:

readFileLegalHolds
writeFileLegalHolds

The Backblaze B2 documentation contains full details of each capability and the API calls it relates to for both the S3 Compatible API and the B2 Native API.

When you create an application key via the web UI, it is assigned capabilities according to whether you allow it access to all buckets or just a single bucket, and whether you assign it read-write, read-only, or write-only access.

An application key created in the web UI with read-write access to all buckets will receive all of the above capabilities. A key with read-only access to all buckets will receive readBucketRetentions, readFileRetentions, and readFileLegalHolds. Finally, a key with write-only access to all buckets will receive bypassGovernance, writeBucketRetentions, writeFileRetentions, and writeFileLegalHolds.

In contrast, an application key created in the web UI restricted to a single bucket is not assigned any of the above permissions. When an application using such a key uploads objects to its associated bucket, they receive the default retention mode and period for the bucket, if they have been set. The application is not able to select a different retention mode or period when uploading an object, change the retention settings on an existing object, or bypass governance when deleting an object.

You may want to create application keys with more granular permissions when working with Object Lock and/or legal hold. For example, you may need an application restricted to a single bucket to be able to toggle legal hold for objects in that bucket. You can use the Backblaze B2 CLI to create an application key with this, or any other set of capabilities. This command, for example, creates a key with the default set of capabilities for read-write access to a single bucket, plus the ability to read and write the legal hold setting:

% b2 create-key --bucket my-bucket-name my-key-name listBuckets,readBuckets,listFiles,readFiles,shareFiles,writeFiles,deleteFiles,readBucketEncryption,writeBucketEncryption,readBucketReplications,writeBucketReplications,readFileLegalHolds,writeFileLegalHolds

Enabling Object Lock

You must enable Object Lock on a bucket before you can lock any objects therein; you can do this when you create the bucket, or at any time later, but you cannot disable Object Lock on a bucket once it has been enabled. Here’s how you create a bucket with Object Lock enabled:

s3_client.create_bucket(
    Bucket='my-bucket-name',
    ObjectLockEnabledForBucket=True
)

Once a bucket’s settings have Object Lock enabled, you can configure a default retention mode and period for objects that are created in that bucket. Only compliance mode is configurable from the web UI, but you can set governance mode as the default via an API call, like this:

s3_client.put_object_lock_configuration(
    Bucket='my-bucket-name',
    ObjectLockConfiguration={
        'ObjectLockEnabled': 'Enabled',
        'Rule': {
            'DefaultRetention': {
                'Mode': 'GOVERNANCE',
                'Days': 7
            }
        }
    }
)

You cannot set legal hold as a default configuration for the bucket.

Locking Objects

Regardless of whether you set a default retention mode for the bucket, you can explicitly set a retention mode and period when you upload objects, or apply the same settings to existing objects, provided you use an application key with the appropriate writeFileRetentions or writeFileLegalHolds capability.

Both the S3 PutObject operation and Backblaze B2’s b2_upload_file include optional parameters for specifying retention mode and period, and/or legal hold. For example:

s3_client.put_object(
    Body=open('/path/to/local/file', mode='rb'),
    Bucket='my-bucket-name',
    Key='my-object-name',
    ObjectLockMode='GOVERNANCE',
    ObjectLockRetainUntilDate=datetime(
        2023, 9, 7, hour=10, minute=30, second=0
    )
)

Both APIs implement additional operations to get and set retention settings and legal hold for existing objects. Here’s an example of how you apply a governance mode lock:

s3_client.put_object_retention(
    Bucket='my-bucket-name',
    Key='my-object-name',
    VersionId='some-version-id',
    Retention={
        'Mode': 'GOVERNANCE',  # Required, even if mode is not changed
        'RetainUntilDate': datetime(
            2023, 9, 5, hour=10, minute=30, second=0
        )
    }
)

The VersionId parameter is optional: the operation applies to the current object version if it is omitted.

You can also use the web UI to view, but not change, an object’s retention settings, and to toggle legal hold for an object:

A screenshot highlighting where to enable Object Lock via the Backblaze web UI.

Deleting Objects in Governance Mode

As mentioned above, a key difference between the compliance and governance modes is that it is possible to override governance mode to delete an object, given an application key with the bypassGovernance capability. To do so, you must identify the specific object version, and pass a flag to indicate that you are bypassing the governance retention restriction:

# Get object details, including version id of current version
object_info = s3_client.head_object(
    Bucket='my-bucket-name',
    Key='my-object-name'
)

# Delete the most recent object version, bypassing governance
s3_client.delete_object(
    Bucket='my-bucket-name',
    Key='my-object-name',
    VersionId=object_info['VersionId'],
    BypassGovernanceRetention=True
)

There is no way to delete an object in legal hold; the legal hold must be removed before the object can be deleted.

Protect Your Data With Object Lock and Legal Hold

Object Lock is a powerful feature, and with great power… you know the rest. Here are some of the questions you should ask when deciding whether to implement Object Lock in your applications:

What would be the impact of malicious or accidental deletion of your application’s data?
Should you lock all data according to a central policy, or allow users to decide whether to lock their data, and for how long?
If you are storing data on behalf of users, are there special circumstances where a lock must be overridden?
Which users should be permitted to set and remove a legal hold? Does it make sense to build this into the application rather than have an administrator use a tool such as the Backblaze B2 CLI to manage legal holds?

If you already have a Backblaze B2 account, you can start working with Object Lock today; otherwise, create an account to get started.

The post Digging Deeper Into Object Lock appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

How We Achieved Upload Speeds Faster Than AWS S3

2023-11-02 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/2023-performance-improvements/

An image of a city skyline with lines going up to a cloud.

You don’t always need the absolute fastest cloud storage—your performance requirements depend on your use case, business objectives, and security needs. But still, faster is usually better. And Backblaze just announced innovation on B2 Cloud Storage that delivers a lot more speed: most file uploads will now be up to 30% faster than AWS S3.

Today, I’m diving into all of the details of this performance improvement, how we did it, and what it means for you.

The TL:DR

The Results: Customers who rely on small file uploads (1MB or less) can expect to see 10–30% faster uploads on average based on our tests, all without any change to durability, availability, or pricing.

What Does This Mean for You?

All B2 Cloud Storage customers will benefit from these performance enhancements, especially those who use Backblaze B2 as a storage destination for data protection software. Small uploads of 1MB or less make up about 70% of all uploads to B2 Cloud Storage and are common for backup and archive workflows. Specific benefits of the performance upgrades include:

Secures data in offsite backup faster.
Frees up time for IT administrators to work on other projects.
Decreases congestion on network bandwidth.
Deduplicates data more efficiently.

Veeam® is dedicated to working alongside our partners to innovate and create a united front against cyber threats and attacks. The new performance improvements released by Backblaze for B2 Cloud Storage furthers our mission to provide radical resilience to our joint customers.

—Andreas Neufert, Vice President, Product Management, Alliances, Veeam

When Can I Expect Faster Uploads?

Today. The performance upgrades have been fully rolled out across Backblaze’s global data regions.

How We Did It

Prior to this work, when a customer uploaded a file to Backblaze B2, the data was written to multiple hard disk drives (HDDs). Those operations had to be completed before returning a response to the client. Now, we write the incoming data to the same HDDs and also, simultaneously, to a pool of solid state drives (SSDs) we call a “shard stash,” waiting only for the HDD writes to make it to the filesystems’ in-memory caches and the SSD writes to complete before returning a response. Once the writes to HDD are complete, we free up the space from the SSDs so it can be reused.

Since writing data to an SSD is much faster than writing to HDDs, the net result is faster uploads.

That’s just a brief summary; if you’re interested in the technical details (as well as the results of some rigorous testing), read on!

The Path to Performance Upgrades

As you might recall from many Drive Stats blog posts and webinars, Backblaze stores all customer data on HDDs, affectionately termed ‘spinning rust’ by some. We’ve historically reserved SSDs for Storage Pod (storage server) boot drives.

Until now.

That’s right—SSDs have entered the data storage chat. To achieve these performance improvements, we combined the performance of SSDs with the cost efficiency of HDDs. First, I’ll dig into a bit of history to add some context to how we went about the upgrades.

HDD vs. SSD

IBM shipped the first hard drive way back in 1957, so it’s fair to say that the HDD is a mature technology. Drive capacity and data rates have steadily increased over the decades while cost per byte has fallen dramatically. That first hard drive, the IBM RAMAC 350, had a total capacity of 3.75MB, and cost $34,500. Adjusting for inflation, that’s about $375,000, equating to $100,000 per MB, or $100 billion per TB, in 2023 dollars.

A photograph of people pushing one of the first hard disk drives into a truck. — An early hard drive shipped by IBM. Source.

Today, the 16TB version of the Seagate Exos X16—an HDD widely deployed in the Backblaze B2 Storage Cloud—retails for around $260, $16.25 per TB. If it had the same cost per byte as the IBM RAMAC 250, it would sell for $1.6 trillion—around the current GDP of China!

SSDs, by contrast, have only been around since 1991, when SanDisk’s 20MB drive shipped in IBM ThinkPad laptops for an OEM price of about $1,000. Let’s consider a modern SSD: the 3.2TB Micron 7450 MAX. Retailing at around $360, the Micron SSD is priced at $112.50 per TB, nearly seven times as much as the Seagate HDD.

So, HDDs easily beat SSDs in terms of storage cost, but what about performance? Here are the numbers from the manufacturers’ data sheets:

	Seagate Exos X16	Micron 7450 MAX
Model number	ST16000NM001G	MTFDKCB3T2TFS
Capacity	16TB	3.2TB
Drive cost	$260	$360
Cost per TB	$16.25	$112.50
Max sustained read rate (MB/s)	261	6,800
Max sustained write rate (MB/s)	261	5,300
Random read rate, 4kB blocks, IOPS	170/440*	1,000,000
Random write rate, 4kB blocks, IOPS	170/440*	390,000

Since HDD platters rotate at a constant rate, 7,200 RPM in this case, they can transfer more blocks per revolution at the outer edge of the disk than close to the middle—hence the two figures for the X16’s transfer rate.

The SSD is over 20 times as fast at sustained data transfer than the HDD, but look at the difference in random transfer rates! Even when the HDD is at its fastest, transferring blocks from the outer edge of the disk, the SSD is over 2,200 times faster reading data and nearly 900 times faster for writes.

This massive difference is due to the fact that, when reading data from random locations on the disk, the platters have to complete an average of 0.5 revolutions between blocks. At 7,200 rotations per minute (RPM), that means that the HDD spends about 4.2ms just spinning to the next block before it can even transfer data. In contrast, the SSD’s data sheet quotes its latency as just 80µs (that’s 0.08ms) for reads and 15µs (0.015ms) for writes, between 84 and 280 times faster than the spinning disk.

Let’s consider a real-world operation, say, writing 64kB of data. Assuming the HDD can write that data to sequential disk sectors, it will spin for an average of 4.2ms, then spend 0.25ms writing the data to the disk, for a total of 4.5ms. The SSD, in contrast, can write the data to any location instantaneously, taking just 27µs (0.027ms) to do so. This (somewhat theoretical) 167x speed advantage is the basis for the performance improvement.

Why did I choose a 64kB block? As we mentioned in a recent blog post focusing on cloud storage performance, in general, bigger files are better when it comes to the aggregate time required to upload a dataset. However, there may be other requirements that push for smaller files. Many backup applications split data into fixed size blocks for upload as files to cloud object storage. There is a trade-off in choosing the block size: larger blocks improve backup speed, but smaller blocks reduce the amount of storage required. In practice, backup blocks may be as small as 1MB or even 256kB. The 64kB blocks we used in the calculation above represent the shards that comprise a 1MB file.

The challenge facing our engineers was to take advantage of the speed of solid state storage to accelerate small file uploads without breaking the bank.

Improving Write Performance for Small Files

When a client application uploads a file to the Backblaze B2 Storage Cloud, a coordinator pod splits the file into 16 data shards, creates four additional parity shards, and writes the resulting 20 shards to 20 different HDDs, each in a different Pod.

Note: As HDD capacity increases, so does the time required to recover after a drive failure, so we periodically adjust the ratio between data shards and parity shards to maintain our eleven nines durability target. In the past, you’ve heard us talk about 17 + 3 as the ratio but we also run 16 + 4 and our very newest vaults use a 15 + 5 scheme.

Each Pod writes the incoming shard to its local filesystem; in practice, this means that the data is written to an in-memory cache and will be written to the physical disk at some point in the near future. Any requests for the file can be satisfied from the cache, but the data hasn’t actually been persistently stored yet.

We need to be absolutely certain that the shards have been written to disk before we return a “success” response to the client, so each Pod executes an fsync system call to transfer (“flush”) the shard data from system memory through the HDD’s write cache to the disk itself before returning its status to the coordinator. When the coordinator has received at least 19 successful responses, it returns a success response to the client. This ensures that, even if the entire data center was to lose power immediately after the upload, the data would be preserved.

As we explained above, for small blocks of data, the vast majority of the time spent writing the data to disk is spent waiting for the drive platter to spin to the correct location. Writing shards to SSD could result in a significant performance gain for small files, but what about that 7x cost difference?

Our engineers came up with a way to have our cake and eat it too by harnessing the speed of SSDs without a massive increase in cost. Now, upon receiving a file of 1MB or less, the coordinator splits it into shards as before, then simultaneously sends the shards to a set of 20 Pods and a separate pool of servers, each populated with 10 of the Micron SSDs described above—a “shard stash.” The shard stash servers easily win the “flush the data to disk” race and return their status to the coordinator in just a few milliseconds. Meanwhile, each HDD Pod writes its shard to the filesystem, queues up a task to flush the shard data to the disk, and returns an acknowledgement to the coordinator.

Once the coordinator has received replies establishing that at least 19 of the 20 Pods have written their shards to the filesystem, and at least 19 of the 20 shards have been flushed to the SSDs, it returns its response to the client. Again, if power was to fail at this point, the data has already been safely written to solid state storage.

We don’t want to leave the data on the SSDs any longer than we have to, so, each Pod, once it’s finished flushing its shard to disk, signals to the shard stash that it can purge its copy of the shard.

Real-World Performance Gains

As I mentioned above, that calculated 167x performance advantage of SSDs over HDDs is somewhat theoretical. In the real world, the time required to upload a file also depends on a number of other factors—proximity to the data center, network speed, and all of the software and hardware between the client application and the storage device, to name a few.

The first Backblaze region to receive the performance upgrade was U.S. East, located in Reston, Virginia. Over a 12-day period following the shard stash deployment there, the average time to upload a 256kB file was 118ms, while a 1MB file clocked in at 137ms. To replicate a typical customer environment, we ran the test application at our partner Vultr’s New Jersey data center, uploading data to Backblaze B2 across the public internet.

For comparison, we ran the same test against Amazon S3’s U.S. East (Northern Virginia) region, a.k.a. us-east-1, from the same machine in New Jersey. On average, uploading a 256kB file to S3 took 157ms, with a 1MB file taking 153ms.

So, comparing the Backblaze B2 U.S. East region to the Amazon S3 equivalent, we benchmarked the new, improved Backblaze B2 as 30% faster than S3 for 256kB files and 10% faster than S3 for 1MB files.

These low-level tests were confirmed when we timed Veeam Backup & Replication software backing up 1TB of virtual machines with 256k block sizes. Backing the server up to Amazon S3 took three hours and 12 minutes; we measured the same backup to Backblaze B2 at just two hours and 15 minutes, 40% faster than S3.

Test Methodology

We wrote a simple Python test app using the AWS SDK for Python (Boto3). Each test run involved timing 100 file uploads using the S3 PutObject API, with a 10ms delay between each upload. (FYI, the delay is not included in the measured time.) The test app used a single HTTPS connection across the test run, following best practice for API usage. We’ve been running the test on a VM in Vultr’s New Jersey region every six hours for the past few weeks against both our U.S. East region and its AWS neighbor. Latency to the Backblaze B2 API endpoint averaged 5.7ms, to the Amazon S3 API endpoint 7.8ms, as measured across 100 ping requests.

What’s Next?

At the time of writing, shard stash servers have been deployed to all of our data centers, across all of our regions. In fact, you might even have noticed small files uploading faster already. It’s important to note that this particular optimization is just one of a series of performance improvements that we’ve implemented, with more to come. It’s safe to say that all of our Backblaze B2 customers will enjoy faster uploads and downloads, no matter their storage workload.

The post How We Achieved Upload Speeds Faster Than AWS S3 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Big Performance Improvements in Rclone 1.64.0, but Should You Upgrade?

2023-09-21 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/big-performance-improvements-in-rclone-1-64-0-but-should-you-upgrade/

A decorative image showing a diagram about multithreading, as well as the Rclone and Backblaze logos.

Rclone is an open source, command line tool for file management, and it’s widely used to copy data between local storage and an array of cloud storage services, including Backblaze B2 Cloud Storage. Rclone has had a long association with Backblaze—support for Backblaze B2 was added back in January 2016, just two months before we opened Backblaze B2’s public beta, and five months before the official launch—and it’s become an indispensable tool for many Backblaze B2 customers.

Rclone v1.64.0, released last week, includes a new implementation of multithreaded data transfers, promising much faster data transfer of large files between cloud storage services.

Does it deliver? Should you upgrade? Read on to find out!

Multithreading to Boost File Transfer Performance

Something of a Swiss Army Knife for cloud storage, rclone can copy files, synchronize directories, and even mount remote storage as a local filesystem. Previous versions of rclone were able to take advantage of multithreading to accelerate the transfer of “large” files (by default at least 256MB), but the benefits were limited.

When transferring files from a storage system to Backblaze B2, rclone would read chunks of the file into memory in a single reader thread, starting a set of multiple writer threads to simultaneously write those chunks to Backblaze B2. When the source storage was a local disk (the common case) as opposed to remote storage such as Backblaze B2, this worked really well—the operation of moving files from local disk to Backblaze B2 was quite fast. However, when the source was another remote storage—say, transferring from Amazon S3 to Backblaze B2, or even Backblaze B2 to Backblaze B2—data chunks were read into memory by that single reader thread at about the same rate as they could be written to the destination, meaning that all but one of the writer threads were idle.

What’s the Big Deal About Rclone v1.64.0?

Rclone v1.64.0 completely refactors multithreaded transfers. Now rclone starts a single set of threads, each of which both reads a chunk of data from the source service into memory, and then writes that chunk to the destination service, iterating through a subset of chunks until the transfer is complete. The threads transfer their chunks of data in parallel, and each transfer is independent of the others. This architecture is both simpler and much, much faster.

Show Me the Numbers!

How much faster? I spun up a virtual machine (VM) via our compute partner, Vultr, and downloaded both rclone v1.64.0 and the preceding version, v1.63.1. As a quick test, I used Rclone’s copyto command to copy 1GB and 10GB files from Amazon S3 to Backblaze B2, like this:

rclone --no-check-dest copyto s3remote:my-s3-bucket/1gigabyte-test-file b2remote:my-b2-bucket/1gigabyte-test-file

Note that I made no attempt to “tune” rclone for my environment by setting the chunk size or number of threads. I was interested in the out of the box performance. I used the --no-check-dest flag so that rclone would overwrite the destination file each time, rather than detecting that the files were the same and skipping the copy.

I ran each copyto operation three times, then calculated the average time. Here are the results; all times are in seconds:

Rclone version	1GB	10GB
1.63.1	52.87	725.04
1.64.0	18.64	240.45

As you can see, the difference is significant! The new rclone transferred both files around three times faster than the previous version.

So, copying individual large files is much faster with the latest version of rclone. How about migrating a whole bucket containing a variety of file sizes from Amazon S3 to Backblaze B2, which is a more typical operation for a new Backblaze customer? I used rclone’s copy command to transfer the contents of an Amazon S3 bucket—2.8GB of data, comprising 35 files ranging in size from 990 bytes to 412MB—to a Backblaze B2 Bucket:

rclone --fast-list --no-check-dest copyto s3remote:my-s3-bucket b2remote:my-b2-bucket

Much to my dismay, this command failed, returning errors related to the files being corrupted in transfer, for example:

2023/09/18 16:00:37 ERROR : tpcds-benchmark/catalog_sales/20221122_161347_00795_djagr_3a042953-d0a2-4b8d-8c4e-6a88df245253: corrupted on transfer: sizes differ 244695498 vs 0

Rclone was reporting that the transferred files in the destination bucket contained zero bytes, and deleting them to avoid the use of corrupt data.

After some investigation, I discovered that the files were actually being transferred successfully, but a bug in rclone 1.64.0 caused the app to incorrectly interpret some successful transfers as corrupted, and thus delete the transferred file from the destination.

I was able to use the --ignore-size flag to workaround the bug by disabling the file size check so I could continue with my testing:

rclone --fast-list --no-check-dest --ignore-size copyto s3remote:my-s3-bucket b2remote:my-b2-bucket

A Word of Caution to Control Your Transaction Fees

Note the use of the --fast-list flag. By default, rclone’s method of reading the contents of cloud storage buckets minimizes memory usage at the expense of making a “list files” call for every subdirectory being processed. Backblaze B2’s list files API, b2_list_file_names, is a class C transaction, priced at $0.004 per 1,000 with 2,500 free per day. This doesn’t sound like a lot of money, but using rclone with large file hierarchies can generate a huge number of transactions. Backblaze B2 customers have either hit their configured caps or incurred significant transaction charges on their account when using rclone without the --fast-list flag.

We recommend you always use --fast-list with rclone if at all possible. You can set an environment variable so you don’t have to include the flag in every command:

export RCLONE_FAST_LIST=1

Again, I performed the copy operation three times, and averaged the results:

Rclone version	2.8GB tree
1.63.1	56.92
1.64.0	42.47

Since the bucket contains both large and small files, we see a lesser, but still significant, improvement in performance with rclone v1.64.0—it’s about 33% faster than the previous version with this set of files.

So, Should I Upgrade to the Latest Rclone?

As outlined above, rclone v1.64.0 contains a bug that can cause copy (and presumably also sync) operations to fail. If you want to upgrade to v1.64.0 now, you’ll have to use the --ignore-size workaround. If you don’t want to use the workaround, it’s probably best to hold off until rclone releases v1.64.1, when the bug fix will likely be deployed—I’ll come back and update this blog entry when I’ve tested it!

The post Big Performance Improvements in Rclone 1.64.0, but Should You Upgrade? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

How to Use Cloud Replication to Automate Environments

2023-07-13 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/how-to-use-cloud-replication-to-automate-environments/

A decorative image showing a workflow from a computer, to a checklist, to a server stack.

A little over a year ago, we announced general availability of Backblaze Cloud Replication, the ability to automatically copy data across buckets, accounts, or regions. There are several ways to use this service, but today we’re focusing on how to use Cloud Replication to replicate data between environments like testing, staging, and production when developing applications.

First we’ll talk about why you might want to replicate environments and how to go about it. Then, we’ll get into the details: there are some nuances that might not be obvious when you set out to use Cloud Replication in this way, and we’ll talk about those so that you can replicate successfully.

Other Ways to Use Cloud Replication

In addition to replicating between environments, there are two main reasons you might want to use Cloud Replication:

Data Redundancy: Replicating data for security, compliance, and continuity purposes.
Data Proximity: Bringing data closer to distant teams or customers for faster access.

Maintaining a redundant copy of your data sounds, well, redundant, but it is the most common use case for cloud replication. It supports disaster recovery as part of a broad cyber resilience framework, reduces the risk of downtime, and helps you comply with regulations.

The second reason (replicating data to bring it geographically closer to end users) has the goal of improving performance and user experience. We looked at this use case in detail in the webinar Low Latency Multi-Region Content Delivery with Fastly and Backblaze.

Four Levels of Testing: Unit, Integration, System, and Acceptance

An image of the character, "The Most Interesting Man in the World", with the title "I don't always test my code, but when I do, I do it in production." — Friendly reminder to both drink and code responsibly (and probably not at the same time).

The Most Interesting Man in the World may test his code in production, but most of us prefer to lead a somewhat less “interesting” life. If you work in software development, you are likely well aware of the various types of testing, but it’s useful to review them to see how different tests might interact with data in cloud object storage.

Let’s consider a photo storage service that stores images in a Backblaze B2 Bucket. There are several real-world Backblaze customers that do exactly this, including Can Stock Photo and CloudSpot, but we’ll just imagine some of the features that any photo storage service might provide that its developers would need to write tests for.

Unit Tests

Unit tests test the smallest components of a system. For example, our photo storage service will contain code to manipulate images in a B2 Bucket, so its developers will write unit tests to verify that each low-level operation completes successfully. A test for thumbnail creation, for example, might do the following:

Directly upload a test image to the bucket.
Run the “‘Create Thumbnail” function against the test image.
Verify that the resulting thumbnail image has indeed been created in the expected location in the bucket with the expected dimensions.
Delete both the test and thumbnail images.

A large application might have hundreds, or even thousands, of unit tests, and it’s not unusual for development teams to set up automation to run the entire test suite against every change to the system to help guard against bugs being introduced during the development process.

Typically, unit tests require a blank slate to work against, with test code creating and deleting files as illustrated above. In this scenario, the test automation might create a bucket, run the test suite, then delete the bucket, ensuring a consistent environment for each test run.

Integration Tests

Integration tests bring together multiple components to test that they interact correctly. In our photo storage example, an integration test might combine image upload, thumbnail creation, and artificial intelligence (AI) object detection—all of the functions executed when a user adds an image to the photo storage service. In this case, the test code would do the following:

Run the “Add Image” procedure against a test image of a specific subject, such as a cat.
Verify that the test and thumbnail images are present in the expected location in the bucket, the thumbnail image has the expected dimensions, and an entry has been created in the image index with the “cat” tag.
Delete the test and thumbnail images, and remove the image’s entry from the index.

Again, integration tests operate against an empty bucket, since they test particular groups of functions in isolation, and require a consistent, known environment.

System Tests

The next level of testing, system testing, verifies that the system as a whole operates as expected. System testing can be performed manually by a QA engineer following a test script, but is more likely to be automated, with test software taking the place of the user. For example, the Selenium suite of open source test tools can simulate a user interacting with a web browser. A system test for our photo storage service might operate as follows:

Open the photo storage service web page.
Click the upload button.
In the resulting file selection dialog, provide a name for the image, navigate to the location of the test image, select it, and click the submit button.
Wait as the image is uploaded and processed.
When the page is updated, verify that it shows that the image was uploaded with the provided name.
Click the image to go to its details.
Verify that the image metadata is as expected. For example, the file size and object tag match the test image and its subject.

When we test the system at this level, we usually want to verify that it operates correctly against real-world data, rather than a synthetic test environment. Although we can generate “dummy data” to simulate the scale of a real-world system, real-world data is where we find the wrinkles and edge cases that tend to result in unexpected system behavior. For example, a German-speaking user might name an image “Schloss Schönburg.” Does the system behave correctly with non-ASCII characters such as ö in image names? Would the developers think to add such names to their dummy data?

A picture of Schönburg Castle in the Rhine Valley at sunset. — Non-ASCII characters: our excuse to give you your daily dose of seratonin. Source.

Acceptance Tests

The final testing level, acceptance testing, again involves the system as a whole. But, where system testing verifies that the software produces correct results without crashing, acceptance testing focuses on whether the software works for the user. Beta testing, where end-users attempt to work with the system, is a form of acceptance testing. Here, real-world data is essential to verify that the system is ready for release.

How Does Cloud Replication Fit Into Testing Environments?

Of course, we can’t just use the actual production environment for system and acceptance testing, since there may be bugs that destroy data. This is where Cloud Replication comes in: we can create a replica of the production environment, complete with its quirks and edge cases, against which we can run tests with no risk of destroying real production data. The term staging environment is often used in connection with acceptance testing, with test(ing) environments used with unit, integration, and system testing.

Caution: Be Aware of PII!

Before we move on to look at how you can put replication into practice, it’s worth mentioning that it’s essential to determine whether you should be replicating the data at all, and what safeguards you should place on replicated data—and to do that, you’ll need to consider whether or not it is or contains personally identifiable information (PII).

The National Institute of Science and Technology (NIST) document SP 800-122 provides guidelines for identifying and protecting PII. In our example photo storage site, if the images include photographs of people that may be used to identify them, then that data may be considered PII.

In most cases, you can still replicate the data to a test or staging environment as necessary for business purposes, but you must protect it at the same level that it is protected in the production environment. Keep in mind that there are different requirements for data protection in different industries and different countries or regions, so make sure to check in with your legal or compliance team to ensure everything is up to standard.

In some circumstances, it may be preferable to use dummy data, rather than replicating real-world data. For example, if the photo storage site was used to store classified images related to national security, we would likely assemble a dummy set of images rather than replicating production data.

How Does Backblaze Cloud Replication Work?

To replicate data in Backblaze B2, you must create a replication rule via either the web console or the B2 Native API. The replication rule specifies the source and destination buckets for replication and, optionally, advanced replication configuration. The source and destination buckets can be located in the same account, different accounts in the same region, or even different accounts in different regions; replication works just the same in all cases. While standard Backblaze B2 Cloud Storage rates apply to replicated data storage, note that Backblaze does not charge service or egress fees for replication, even between regions.

It’s easier to create replication rules in the web console, but the API allows access to two advanced features not currently accessible from the web console:

Setting a prefix to constrain the set of files to be replicated.
Excluding existing files from the replication rule.

Don’t worry: this blog post provides a detailed explanation of how to create replication rules via both methods.

Once you’ve created the replication rule, files will begin to replicate at midnight UTC, and it can take several hours for the initial replication if you have a large quantity of data. Files uploaded after the initial replication rule is active are automatically replicated within a few seconds, depending on file size. You can check whether a given file has been replicated either in the web console or via the b2-get-file-info API call. Here’s an example using curl at the command line:

 % curl -s -H "Authorization: ${authorizationToken}" \
    -d "{\"fileId\":  \"${fileId}\"}" \
    "${apiUrl}/b2api/v2/b2_get_file_info" | jq .
{
  "accountId": "15f935cf4dcb",
  "action": "upload",
  "bucketId": "11d5cf096385dc5f841d0c1b",
  ...
  "replicationStatus": "pending",
  ...
}

In the example response, replicationStatus returns the response pending; once the file has been replicated, it will change to completed.

Here’s a short Python script that uses the B2 Python SDK to retrieve replication status for all files in a bucket, printing the names of any files with pending status:

import argparse
import os

from dotenv import load_dotenv

from b2sdk.v2 import B2Api, InMemoryAccountInfo
from b2sdk.replication.types import ReplicationStatus

# Load credentials from .env file into environment
load_dotenv()

# Read bucket name from the command line
parser = argparse.ArgumentParser(description='Show files with "pending" replication status')
parser.add_argument('bucket', type=str, help='a bucket name')
args = parser.parse_args()

# Create B2 API client and authenticate with key and ID from environment
b2_api = B2Api(InMemoryAccountInfo())
b2_api.authorize_account("production", os.environ["B2_APPLICATION_KEY_ID"], os.environ["B2_APPLICATION_KEY"])

# Get the bucket object
bucket = b2_api.get_bucket_by_name(args.bucket)

# List all files in the bucket, printing names of files that are pending replication
for file_version, folder_name in bucket.ls(recursive=True):
    if file_version.replication_status == ReplicationStatus.PENDING:
        print(file_version.file_name)

Note: Backblaze B2’s S3-compatible API (just like Amazon S3 itself) does not include replication status when listing bucket contents—so for this purpose, it’s much more efficient to use the B2 Native API, as used by the B2 Python SDK.

You can pause and resume replication rules, again via the web console or the API. No files are replicated while a rule is paused. After you resume replication, newly uploaded files are replicated as before. Assuming that the replication rule does not exclude existing files, any files that were uploaded while the rule was paused will be replicated in the next midnight-UTC replication job.

How to Replicate Production Data for Testing

The first question is: does your system and acceptance testing strategy require read-write access to the replicated data, or is read-only access sufficient?

Read-Only Access Testing

If read-only access suffices, it might be tempting to create a read-only application key to test against the production environment, but be aware that testing and production make different demands on data. When we run a set of tests against a dataset, we usually don’t want the data to change during the test. That is: the production environment is a moving target, and we don’t want the changes that are normal in production to interfere with our tests. Creating a replica gives you a snapshot of real-world data against which you can run a series of tests and get consistent results.

It’s straightforward to create a read-only replica of a bucket: you just create a replication rule to replicate the data to a destination bucket, allow replication to complete, then pause replication. Now you can run system or acceptance tests against a static replica of your production data.

To later bring the replica up to date, simply resume replication and wait for the nightly replication job to complete. You can run the script shown in the previous section to verify that all files in the source bucket have been replicated.

Read-Write Access Testing

Alternatively, if, as is usually the case, your tests will create, update, and/or delete files in the replica bucket, there is a bit more work to do. Since testing intends to change the dataset you’ve replicated, there is no easy way to bring the source and destination buckets back into sync—changes may have happened in both buckets while your replication rule was paused.

In this case, you must delete the replication rule, replicated files, and the replica bucket, then create a new destination bucket and rule. You can reuse the destination bucket name if you wish since, internally, replication status is tracked via the bucket ID.

Always Test Your Code in an Environment Other Than Production

In short, we all want to lead interesting lives—but let’s introduce risk in a controlled way, by testing code in the proper environments. Cloud Replication lets you achieve that end while remaining nimble, which means you get to spend more time creating interesting tests to improve your product and less time trying to figure out why your data transformed in unexpected ways.

Now you have everything you need to create test and staging environments for applications that use Backblaze B2 Cloud Object Storage. If you don’t already have a Backblaze B2 account, sign up here to receive 10GB of storage, free, to try it out.

The post How to Use Cloud Replication to Automate Environments appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Discover the Secret to Lightning-Fast Big Data Analytics: Backblaze + Vultr Beats Amazon S3/EC2 by 39%

2023-06-27 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/discover-the-secret-to-lightning-fast-big-data-analytics-backblaze-vultr-beats-amazon-s3-ec2-by-39/

A decorative image showing the Vultr and Backblaze logos on a trophy.

Over the past few months, we’ve explained how to store and query analytical data in Backblaze B2, and how to query the Drive Stats dataset using the Trino SQL query engine. Prompted by the recent expansion of Backblaze’s strategic partnership with Vultr, we took a closer look at how the Backblaze B2 + Vultr Cloud Compute combination performs for big data analytical workloads in comparison to similar services on Amazon Web Services (AWS).

Running an industry-standard benchmark, and because AWS is almost five times more expensive, we were expecting to see a trade-off between better performance on the single cloud AWS deployment and lower cost on the multi-cloud Backblaze/Vultr equivalent, but we were very pleasantly surprised by the results we saw.

Spoiler alert: not only was the Backblaze B2 + Vultr combination significantly cheaper than Amazon S3/EC2, it also outperformed the Amazon services by a wide margin. Read on for the details—we cover a lot of background on this experiment, but you can skip straight ahead to the results of our tests if you’d rather get to the good stuff.

First, Some History: The Evolution of Big Data Storage Architecture

Back in 2004, Google’s MapReduce paper lit a fire under the data processing industry, proposing a new “programming model and an associated implementation for processing and generating large datasets.” MapReduce was applicable to many real-world data processing tasks, and, as its name implies, presented a straightforward programming model comprising two functions (map and reduce), each operating on sets of key/value pairs. This model allowed programs to be automatically parallelized and executed on large clusters of commodity machines, making it well suited for tackling “big data” problems involving datasets ranging into the petabytes.

The Apache Hadoop project, founded in 2005, produced an open source implementation of MapReduce, as well as the Hadoop Distributed File System (HDFS), which handled data storage. A Hadoop cluster could comprise hundreds, or even thousands, of nodes, each one responsible for both storing data to disk and running MapReduce tasks. In today’s terms, we would say that each Hadoop node combined storage and compute.

With the advent of cloud computing, more flexible big data frameworks, such as Apache Spark, decoupled storage from compute. Now organizations could store petabyte-scale datasets in cloud object storage, rather than on-premises clusters, with applications running on cloud compute platforms. Fast intra-cloud network connections and the flexibility and elasticity of the cloud computing environment more than compensated for the fact that big data applications were now accessing data via the network, rather than local storage.

Today we are moving into the next phase of cloud computing. With specialist providers such as Backblaze and Vultr each focusing on a core capability, can we move storage and compute even further apart, into different data centers? Our hypothesis was that increased latency and decreased bandwidth would severely impact performance, perhaps by a factor of two or three, but cost savings might still make for an attractive alternative to colocating storage and compute at a hyperscaler such as AWS. The tools we chose to test this hypothesis were the Trino open source SQL Query Engine and the TPC-DS benchmark.

Benchmarking Deployment Options With TPC-DS

The TPC-DS benchmark is widely used to measure the performance of systems operating on online analytical processing (OLAP) workloads, so it’s well suited for comparing deployment options for big data analytics.

A formal TPC-DS benchmark result measures query response time in single-user mode, query throughput in multiuser mode and data maintenance performance, giving a price/performance metric that can be used to compare systems from different vendors. Since we were focused on query performance rather than data loading, we simply measured the time taken for each configuration to execute TPC-DS’s set of 99 queries.

Helpfully, Trino includes a tpcds catalog with a range of schemas each containing the tables and data to run the benchmark at a given scale. After some experimentation, we chose scale factor 10, corresponding to approximately 10GB of raw test data, as it was a good fit for our test hardware configuration. Although this test dataset was relatively small, the TPC-DS query set simulates a real-world analytical workload of complex queries, and took several minutes to complete on the test systems. It would be straightforward, though expensive and time consuming, to repeat the test for larger scale factors.

We generated raw test data from the Trino tpcds catalog with its sf10 (scale factor 10) schema, resulting in 3GB of compressed Parquet files. We then used Greg Rahn’s version of the TPC-DS benchmark tools, tpcds-kit, to generate a standard TPC-DS 99-query script, modifying the script syntax slightly to match Trino’s SQL dialect and data types. We ran the set of 99 queries in single user mode three times on each of three combinations of compute/storage platforms: EC2/S3, EC2/B2 and Vultr/B2. The EC2/B2 combination allowed us to isolate the effect of moving storage duties to Backblaze B2 while keeping compute on Amazon EC2.

A note on data transfer costs: AWS does not charge for data transferred between an Amazon S3 bucket and an Amazon EC2 instance in the same region. In contrast, the Backblaze + Vultr partnership allows customers free data transfer between Backblaze B2 and Vultr Cloud Compute across any combination of regions.

Deployment Options for Cloud Compute and Storage

AWS

The EC2 configuration guide for Starburst Enterprise, the commercial version of Trino, recommends a r4.4xlarge EC2 instance, a memory-optimized instance offering 16 virtual CPUs and 122 GiB RAM, running Amazon Linux 2.

Following this lead, we configured an r4.4xlarge instance with 32GB of gp2 SSD local disk storage in the us-west-1 (Northern California) region. The combined hourly cost for the EC2 instance and SSD storage was $1.19.

We created an S3 bucket in the same us-west-1 region. After careful examination of the Amazon S3 Pricing Guide, we determined that the storage cost for the data on S3 was $0.026 per GB per month.

Vultr

We selected Vultr’s closest equivalent to the EC2 r4.4xlarge instance: a Memory Optimized Cloud Compute instance with 16 vCPUs, 128GB RAM plus 800GB of NVMe local storage, running Debian 11, at a cost of $0.95/hour in Vultr’s Silicon Valley region. Note the slight difference in the amount of available RAM–Vultr’s virtual machine (VM) includes an extra 6GB, despite its lower cost.

Backblaze B2

We created a Backblaze B2 Bucket located in the Sacramento, California data center of our U.S. West region, priced at $0.005/GB/month, about one-fifth the cost of Amazon S3.

Trino Configuration

We used the official Trino Docker image configured identically on the two compute platforms. Although a production Trino deployment would typically span several nodes, for simplicity, time savings, and cost-efficiency we brought up a single-node test deployment. We dedicated 78% of the VM’s RAM to Trino, and configured its Hive connector to access the Parquet files via the S3 compatible API. We followed the Trino/Backblaze B2 getting started tutorial to ensure consistency between the environments.

Benchmark Results

The table shows the time taken to complete the TPC-DS benchmark’s 99 queries. We calculated the mean of three runs for each combination of compute and storage. All times are in minutes and seconds, and a lower time is better.

A graph showing TPC/DS benchmark query times.

We used Trino on Amazon EC2 accessing data on Amazon S3 as our starting point; this configuration ran the benchmark in 20:43.

Next, we kept Trino on Amazon EC2 and moved the data to Backblaze B2. We saw a surprisingly small difference in performance, considering that the data was no longer located in the same AWS region as the application. The EC2/B2 Storage Cloud combination ran the benchmark just 38 seconds slower (that’s about 3%), clocking in at 21:21.

When we looked at Trino running on Vultr accessing data on Amazon S3, we saw a significant increase in performance. On Vultr/S3, the benchmark ran in 15:07, 27% faster than the EC2/S3 combination. We suspect that this is due to Vultr providing faster vCPUs, more available memory, faster networking, or a combination of the three. Determining the exact reason for the performance delta would be an interesting investigation, but was out of scope for this exercise.

Finally, looking at Trino on Vultr accessing data on Backblaze B2, we were astonished to see that not only did this combination post the fastest benchmark time of all, Trino on Vultr/Backblaze B2’s time of 12:39 was 16% faster than Vultr/S3 and 39% faster than Trino on EC2/S3!

Note: this is not a formal TPC-DS result, and the query times generated cannot be compared outside this benchmarking exercise.

The Bottom Line: Higher Performance at Lower Cost

For the scale factor 10 TPC-DS data set and queries, with comparably specified instances, Trino running on Vultr retrieving data from B2 is 39% faster than Trino on EC2 pulling data from S3, with 20% lower compute cost and 76% lower storage cost.

You can get started with both Backblaze B2 and Vultr free of charge—click here to sign up for Backblaze B2, with 10GB free storage forever, and click here for $250 of free credit at Vultr.

The post Discover the Secret to Lightning-Fast Big Data Analytics: Backblaze + Vultr Beats Amazon S3/EC2 by 39% appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Go Wild with Wildcards in the Backblaze B2 Command Line Tool 3.7.1

2023-02-22 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/go-wild-with-wildcards-in-backblaze-b2-command-line-tool-3-7-1/

File transfer tools such as Cyberduck, FileZilla Pro, and Transmit implement a graphical user interface (GUI), which allows users to manage and transfer files across local storage and any number of services, including cloud object stores such as Backblaze B2 Cloud Storage. Some tasks, however, require a little more power and flexibility than a GUI can provide. This is where a command line interface (CLI) shines. A CLI typically provides finer control over operations than a GUI tool, and makes it straightforward to automate repetitive tasks. We recently released version 3.7.0 (and then, shortly thereafter, version 3.7.1) of the Backblaze B2 Command Line Tool, alongside version 1.19.0 of the underlying Backblaze B2 Python SDK. Let’s take a look at the highlights in the new releases, and why you might want to use the Backblaze B2 CLI rather than the AWS equivalent.

Battle of the CLI’s: Backblaze B2 vs. AWS

As you almost certainly already know, Backblaze B2 has an S3-compatible API in addition to its original API, now known as the B2 Native API. In most cases, we recommend using the S3-compatible API, since a rich ecosystem of S3 tools and knowledge has evolved over the years.

While the AWS CLI works perfectly well with Backblaze B2, and we explain how to use it in our B2 Developer Quick-Start Guide, it’s slightly clunky. The AWS CLI allows you to set your access key id and secret access key via either environment variables or a configuration file, but you must override the default endpoint on the command line with every command, like this:

% aws --endpoint-url https://s3.us-west-004.backblazeb2.com s3api list-buckets

This is very tiresome if you’re working interactively at the command line! In contrast, the B2 CLI retrieves the correct endpoint from Backblaze B2 when it authenticates, so the command line is much more concise:

% b2 list-buckets

Additionally, the CLI provides fine-grain access to Backblaze B2-specific functionality, such as application key management and replication.

Automating Common Tasks with the B2 Command Line Tool

If you’re already familiar with CLI tools, feel free to skip to the next section.

Imagine you’ve uploaded a large number of WAV files to a Backblaze B2 Bucket for transcoding into .mp3 format. Once the transcoding is complete, and you’ve reviewed a sample of the .mp3 files, you decide that you can delete the .wav files. You can do this in a GUI tool, opening the bucket, navigating to the correct location, sorting the files by extension, selecting all of the .wav files, and deleting them. However, the CLI can do this in a single command:

% b2 rm --withWildcard --recursive my-bucket 'audio/*.wav'

If you want to be sure you’re deleting the correct files, you can add the --dryRun option to show the files that would be deleted, rather than actually deleting them:

% b2 rm --dryRun --withWildcard --recursive my-bucket 'audio/*.wav' audio/aardvark.wav audio/barracuda.wav ... audio/yak.wav audio/zebra.wav

You can find a complete list of the CLI’s commands and their options in the documentation.

Let’s take a look at what’s new in the latest release of the Backblaze B2 CLI.

Major Changes in B2 Command Line Tool Version 3.7.0

New `rm` command

The most significant addition in 3.7.0 is a whole new command: rm. As you might expect, rm removes files. The CLI has always included the low-level delete-file-version command (to delete a single file version) but you had to call that multiple times and combine it with other commands to remove all versions of a file, or to remove all files with a given prefix.

The new rm command is significantly more powerful, allowing you to delete all versions of a file in a single command:

% b2 rm --versions --withWildcard --recursive my-bucket images/san-mateo.png

Let’s unpack that command:

%: represents the command shell’s prompt. (You don’t type this.)
b2: the B2 CLI executable.
rm: the command we’re running.
--versions: apply the command to all versions. Omitting this option applies the command to just the most recent version.
--withWildcard: treat the folderName argument as a pattern to match the file name.
--recursive: descend into all folders. (This is required with –withWildcard.)
my-bucket: the bucket name.
images/san-mateo.png: the file to be deleted. There are no wildcard characters in the pattern, so the file name must match exactly. Note: there is no leading ‘/’ in Backblaze B2 file names.

As mentioned above, the --dryRun argument allows you to see what files would be deleted, without actually deleting them. Here it is with the ‘*’ wildcard to apply the command to all versions of the .png files in /images. Note the use of quotes to avoid the command shell expanding the wildcard:

% b2 rm --dryRun --versions --withWildcard --recursive my-bucket 'images/*.png' images/amsterdam.png images/sacramento.png

DANGER ZONE: by omitting --withWildcard and the folderName argument, you can delete all of the files in a bucket. We strongly recommend you use --dryRun first, to check that you will be deleting the correct files.

% b2 rm --dryRun --versions –recursive my-bucket index.html images/amsterdam.png images/phoenix.jpeg images/sacramento.png stylesheets/style.css

New `--withWildcard` option for the `ls` command

The ls command gains the --withWildcard option. It operates identically as described above. In fact, b2 rm --dryRun --withWildcard --recursive executes the exact same code as b2 ls --withWildcard --recursive. For example:

% b2 ls --withWildcard --recursive my-bucket 'images/*.png' images/amsterdam.png images/sacramento.png

You can combine --withWildcard with any of the existing options for ls, for example --long:

% b2 ls --long --withWildcard --recursive my-bucket 'images/*.png' 4_z71d55dummyid381234ed0c1b_f108f1dummyid163b_d2dummyid_m165048_c004 _v0402014_t0016_u01dummyid48198 upload 2023-02-09 16:50:48 714686 images/amsterdam.png 4_z71d55dummyid381234ed0c1b_f1149bdummyid1141_d2dummyid_m165048_c004 _v0402010_t0048_u01dummyid48908 upload 2023-02-09 16:50:48 549261 images/sacramento.png

New `--incrementalMode` option for `upload-file` and `sync`

The new --incrementalMode option saves time and bandwidth when working with files that grow over time, such as log files, by only uploading the changes since the last upload. When you use the --incrementalMode option with upload-file or sync, the B2 CLI looks for an existing file in the bucket with the b2FileName that you supplied, and notes both its length and SHA-1 digest. Let’s call that length l. The CLI then calculates the SHA-1 digest of the first l bytes of the local file. If the digests match, then the CLI can instruct Backblaze B2 to create a new file comprising the existing file and the remaining bytes of the local file.

That was a bit complicated, so let’s look at a concrete example. My web server appends log data to a file, access.log. I’ll see how big it is, get its SHA-1 digest, and upload it to a B2 Bucket:

% ls -l access.log -rw-r--r-- 1 ppatterson staff 5525849 Feb 9 15:55 access.log

% sha1sum access.log ff46904e56c7f9083a4074ea3d92f9be2186bc2b access.log

The upload-file command outputs all of the file’s metadata, but we’ll focus on the SHA-1 digest, file info, and size.

% b2 upload-file my-bucket access.log access.log ... { ... "contentSha1": "ff46904e56c7f9083a4074ea3d92f9be2186bc2b", ... "fileInfo": { "src_last_modified_millis": "1675986940381" }, ... "size": 5525849, ... }

As you might expect, the digest and size match those of the local file.

Time passes, and our log file grows. I’ll first upload it as a different file, so that we can see the default behavior when the B2 Cloud Storage file is simply replaced:

% ls -l access.log -rw-r--r-- 1 ppatterson staff 11047145 Feb 9 15:57 access.log

% sha1sum access.log
7c97866ff59330b67aa96d7a481578d62e030788 access.log

% b2 upload-file my-bucket access.log new-access.log { ... "contentSha1": "7c97866ff59330b67aa96d7a481578d62e030788", ... "fileInfo": { "src_last_modified_millis": "1675987069538" }, ... "size": 11047145, ... }

Everything is as we might expect—the CLI uploaded 11,047,145 bytes to create a new file, which is 5,521,296 bytes bigger than the initial upload.

Now I’ll use the --incrementalMode option to replace the first Backblaze B2 file:

% b2 upload-file --quiet my-bucket access.log access.log ... { ... "contentSha1": "none", ... "fileInfo": { "large_file_sha1": "7c97866ff59330b67aa96d7a481578d62e030788", "plan_id": "ea6b099b48e7eb7fce01aba18dbfdd72b56eb0c2", "src_last_modified_millis": "1675987069538" }, ... "size": 11047145, ... }

The digest is exactly the same, but it has moved from contentSha1 to fileInfo.large_file_sha1, indicating that the file was uploaded as separate parts, resulting in a large file. The CLI didn’t need to upload the initial 5,525,849 bytes of the local file; it instead instructed Backblaze B2 to combine the existing file with the final 5,521,296 bytes of the local file to create a new version of the file.

There are several more new features and fixes to existing functionality in version 3.7.0—make sure to check out the B2 CLI changelog for a complete list.

Major Changes in B2 Python SDK 1.19.0

Most of the changes in the B2 Python SDK support the new features in the B2 CLI, such as adding wildcard matching to the Bucket.ls operation and adding support for incremental upload and sync. Again, you can inspect the B2 Python SDK changelog for a comprehensive list.

Get to Grips with B2 Command Line Tool Version 3.7.0 3.7.1

Whether you’re working on Windows, Mac or Linux, it’s straightforward to install or update the B2 CLI; full instructions are provided in the Backblaze B2 documentation.

Note that the latest version is now 3.7.1. The only changes from 3.7.0 are a handful of corrections to help text and that the Mac binary is no longer provided, due to shortcomings in the Mac version of PyInstaller. Instead, we provide the Mac version of the CLI via the Homebrew package manager.

The post Go Wild with Wildcards in the Backblaze B2 Command Line Tool 3.7.1 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Build a Cloud Storage App in 30 Minutes

2023-01-24 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/build-a-cloud-storage-app-in-30-minutes/

The working title for this developer tutorial was originally the “Polyglot Quickstart.” It made complete sense to me—it’s a “multilingual” guide that shows developers how to get started with Backblaze B2 using different programming languages—Java, Python, and the command line interface (CLI). But the folks on our publishing and technical documentation teams wisely advised against such an arcane moniker.

Editor’s Note

Full disclosure, I had to look up the word polyglot. Thanks, Merriam-Webster, for the assist.

Polyglot, adjective.: 1a: speaking or writing several languages: multilingual; 1b: composed of numerous linguistic groups; a polyglot population; 2: containing matter in several languages; a polyglot sign; 3: composed of elements from different languages; 4: widely diverse (as in ethnic or cultural origins); a polyglot cuisine

Fortunately for you, readers, and you, Google algorithms, we landed on the much easier to understand Backblaze B2 Developer Quick-Start Guide, and we’re launching it today. Read on to learn all about it.

Start Building Applications on Backblaze B2 in 30 Minutes or Less

Yes, you heard that correctly. Whether or not you already have experience working with cloud object storage, this tutorial will get you started building applications that use Backblaze B2 Cloud Storage in 30 minutes or less. You’ll learn how scripts and applications can interact with Backblaze B2 via the AWS SDKs and CLI and the Backblaze S3-compatible API.

The tutorial covers how to:

Sign up for a Backblaze B2 account.
Create a public bucket, upload and view files, and create an application key using the Backblaze B2 web console.
Interact with the Backblaze B2 Storage Cloud using Java, Python, and the CLI: listing the contents of buckets, creating new buckets, and uploading files to buckets.

This first release of the tutorial covers Java, Python, and the CLI. We’ll add more programming languages in the future. Right now we’re looking at JavaScript, C#, and Go. Let us know in the comments if there’s another language we should cover!

What Else Can You Do?

If you already have experience with Amazon S3, the Quick-Start Guide shows how to use the tools and techniques you already know with Backblaze B2. You’ll be able to quickly build new applications and modify existing ones to interact with the Backblaze Storage Cloud. If you’re new to cloud object storage, on the other hand, this is the ideal way to get started.

Watch this space for future tutorials on topics such as:

Downloading files from a private bucket programmatically.
Uploading large files by splitting them into chunks.
Creating pre-signed URLs so that users can access private files securely.
Deleting versions, files and buckets.

Want More?

Have questions about any of the above? Curious about how to use Backblaze B2 with your specific application? Already a wiz at this and ready to do more? Here’s how you can get in touch and get involved:

Sign up for Backblaze’s virtual user group.
Find us at Developer Week.
Let us know in the comments which programming languages we should add to the Quick-Start Guide.

The post Build a Cloud Storage App in 30 Minutes appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

How to Serve Data From a Private Bucket with a Cloudflare Worker

2023-01-18 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/how-to-serve-data-from-a-private-bucket-with-a-cloudflare-worker/

Customers storing data in Backblaze B2 Cloud Storage enjoy zero-cost downloads via our Content Delivery Network (CDN) partners: Cloudflare, Fastly, and Bunny.net. Configuring a CDN to proxy access to a Backblaze B2 Bucket is straightforward and improves the user experience, since the CDN caches data close to end-users. Ensuring that end-users can only access content via the CDN, and not directly from the bucket, requires a little more effort. A new technical article, Cloudflare Workers for Backblaze B2, provides the steps to serve content from Backblaze B2 via your own Cloudflare Worker.

In this blog post, I’ll explain why you might want to prevent direct downloads from your Backblaze B2 Bucket, and how you can use a Cloudflare Worker to do so.

Why Prevent Direct Downloads?

As mentioned above, Backblaze’s partnerships with CDN providers allow our customers to deliver content to end users with zero costs for data egress from Backblaze to the CDN. To illustrate why you might want to serve data to your end users exclusively through the CDN, let’s imagine you’re creating a website, storing your website’s images in a Backblaze B2 Bucket with public-read access, acme-images.

For the initial version, you build web pages with direct links to the images of the form https://acme-images.s3.us-west-001.backblazeb2.com/logos/acme.png. As users browse your site, their browsers will download images directly from Backblaze B2. Everything works just fine for users near the Backblaze data center hosting your bucket, but the further a user is from that data center, the longer it will take each image to appear on screen. No matter how fast the network connection, there’s no getting around the speed of light!

Aside from the degraded user experience, there are costs associated with end users downloading data directly from Backblaze. The first GB of data downloaded each day is free, then we charge $0.01 for each subsequent GB. Depending on your provider’s pricing plan, adding a CDN to your architecture can both reduce download costs and improve the user experience, as the CDN will transfer data through its own network and cache content close to end users. Another detail to note when comparing costs is that Backblaze and Cloudflare’s Bandwidth Alliance means that data flows from Backblaze to Cloudflare free of download charges, unlike data flowing from, for example, Amazon S3 to Cloudflare.

Typically, you need to set up a custom domain, say images.acme.com, that resolves to an IP address at the CDN. You then configure one or more origin servers or backends at the CDN with your Backblaze B2 Buckets’ S3 endpoints. In this example, we’ll use a single bucket, with endpoint acme-images.s3.us-west-001.backblazeb2.com, but you might use Cloud Replication to replicate content between buckets in multiple regions for greater resilience.

Now, after you update the image links in your web pages to the form https://images.acme.com/logos/acme.png, your users will enjoy an improved experience, and your operating costs will be reduced.

As you might have guessed, however, there is one chink in the armor. Clients can still download images directly from the Backblaze B2 Bucket, incurring charges on your Backblaze account. For example, users might have bookmarked or shared links to images in the bucket, or browsers or web crawlers might have cached those links.

The solution is to make the bucket private and create an edge function: a small piece of code running on the CDN infrastructure at the images.acme.com endpoint, with the ability to securely access the bucket.

Both Cloudflare and Fastly offer edge computing platforms; in this blog post, I’ll focus on Cloudflare Workers and cover Fastly Compute@Edge at a later date.

Proxying Backblaze B2 Downloads With a Cloudflare Worker

The blog post Use a Cloudflare Worker to Send Notifications on Backblaze B2 Events provides a brief introduction to Cloudflare Workers; here I’ll focus on how the Worker accesses the Backblaze B2 Bucket.

API clients, such as Workers, downloading data from a private Backblaze B2 Bucket via the Backblaze S3 Compatible API must digitally sign each request with a Backblaze Application Key ID (access key ID in AWS parlance) and Application Key (secret access key). On receiving a signed request, the Backblaze B2 service verifies the identity of the sender (authentication) and that the request was not changed in transit (integrity) before returning the requested data.

So when the Worker receives an unsigned HTTP request from an end user’s browser, it must sign it, forward it to Backblaze B2, and return the response to the browser. Here are the steps in more detail:

A user views a web page in their browser.
The user’s browser requests an image from the Cloudflare Worker.
The Worker makes a copy of the incoming request, changing the target host in the copy to the bucket endpoint, and signs the copy with its application key and key ID.
The Worker sends the signed request to Backblaze B2.
Backblaze B2 validates the signature, and processes the request.
Backblaze B2 returns the image to the Worker.
The Worker forwards the image to the user’s browser.

These steps are illustrated in the diagram below.

The signing process imposes minimal overhead, since GET requests have no payload. The Worker need not even read the incoming response payload into memory, instead returning the response from Backblaze B2 to the Cloudflare Workers framework to be streamed directly to the user’s browser.

Now you understand the use case, head over to our newly published technical article, Cloudflare Workers for Backblaze B2, and follow the steps to serve content from Backblaze B2 via your own Cloudflare Worker.

Put the Proxy to Work!

The Cloudflare Worker for Backblaze B2 can be used as-is to ensure that clients download files from one or more Backblaze B2 Buckets via Cloudflare, rather than directly from Backblaze B2. At the same time, it can be readily adapted for different requirements. For example, the Worker could verify that clients pass a shared secret in an HTTP header, or route requests to buckets in different data centers depending on the location of the edge server. The possibilities are endless.

How will you put the Cloudflare Worker for Backblaze B2 to work? Sign up for a Backblaze B2 account and get started!

The post How to Serve Data From a Private Bucket with a Cloudflare Worker appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Noise

All posts by Pat Patterson

Adding message history to the simple RAG chatbot

The collective thoughts of the interwebz