Tag Archives: Featured

Three Surprising Factors that Affect Cloud Performance

2024-09-03 Kari Rivas

Post Syndicated from Kari Rivas original https://www.backblaze.com/blog/three-surprising-factors-that-affect-cloud-performance/

A decorative image showing a cloud and data graphs.

When you think about cloud performance, metrics like latency and throughput are probably the first things that come to mind. We covered those metrics pretty extensively here and here. So, today, I’m walking through some factors that affect cloud performance that may not get talked about as often, including:

The size of your files.
The number of parts you upload or download.
Block (part) size.

These factors may not be “surprising” per se especially if you remember the pain of trying to download The Matrix over dial up. But they are all things that you should consider (and that you have more control over) when thinking about cloud performance overall.

Let’s dig in.

1. The size of your files

This one is pretty obvious. Larger files take longer because they require more data to be transferred. If you have a 10Mbps upload connection, a 1GB file will take approximately 800 seconds (13 minutes and 20 seconds) to upload, whereas a 100MB file will take about 80 seconds (a minute and 20 seconds). Most enterprise-grade internet connections offer higher upload speeds, but 10Mbps makes the math approachable for the sake of argument.

Small files—that is, those less than 5GB—can be uploaded in a single API call. (Note: this can vary based on cloud storage provider and configuration.) Larger files up to 10TB can be uploaded as “parts” in multiple API calls. Each part has to be a minimum of 5MB and a maximum of 5GB.

You’ll notice that there is quite an overlap here! For uploading files between 5MB and 5GB, is it better to upload them in a single API call, or split them into parts? What is the optimum part size? For backup applications, which typically split all data into equally sized blocks, storing each block as a file, what is the optimum block size? As with many questions, the answer is: it depends.

2. The number of parts you upload or download

Each API call incurs a more-or-less fixed overhead due to latency. For a 1GB file, assuming a single thread of execution, uploading all 1GB in a single API call will be faster than 10 API calls each uploading a 100MB part, since those additional nine API calls each incur some latency overhead. So, bigger is better, right?

3. Block (part) size

Not necessarily, and that brings us to part size. Multi-threading, as mentioned above, affords us the opportunity to upload multiple parts simultaneously, which improves performance—but there are trade-offs. Typically, each part must be stored in memory as it is uploaded, so more threads means more memory consumption. If the number of threads multiplied by the part size exceeds available memory, then either the application will fail with an out of memory error, or data will be swapped to disk, reducing performance.

Downloading data offers even more flexibility, since applications can specify any portion of the file to download in each API call. Whether uploading or downloading, there is a maximum number of threads that will drive throughput to consume all of the available bandwidth. Exceeding this maximum will consume more memory, but provide no performance benefit.

So, what to do to get the best performance possible for your use case?

Simple: Customize your settings

Most backup and file transfer tools allow you to configure the number of threads and the amount of data to be transferred per API call, whether that’s block size or part size. If you are writing your own application, you should allow for these parameters to be configured. When it comes to deployment, some experimentation may be required to achieve maximum throughput given available memory.

The big takeaway: When it comes to cloud performance, the metrics you need to care about and the performance you actually need are highly dependent on your use case, your own infrastructure, your workload, and all the network connections between your infrastructure and the cloud provider as well. So, when you’re deciding how to store and use your data, it’s worth taking some extra time to consider the above factors for optimum performance.

The post Three Surprising Factors that Affect Cloud Performance appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Five Tips for Creating a Predictable Cloud Storage Budget

2024-08-29 David Johnson

Post Syndicated from David Johnson original https://www.backblaze.com/blog/calculate-cost-cloud-storage/

A decorative image showing buildings, data, and icons indicating cost.

Editor’s Note

This post has been updated since it was originally published.

With spending on public cloud services expected to double by 2028, many businesses are looking for ways to cut cloud costs—or at least gain predictability in their spend. Forecasting cloud storage costs should be straightforward once you know what to look for.

Here are five tips you can use when doing your due diligence on the cloud storage vendors you are considering. The goal is to create a cloud storage forecast that you can rely on each and every month.

Tip 1: Navigate tiered pricing structures carefully

Many cloud providers still use tiered pricing structures, which can be misleading if not carefully understood. For example:

AWS S3 Storage Pricing Example

For this post, we’re comparing with hypothetical data stored in AWS S3’s U.S. East Region (N. Virginia) using pricing available at the time of publishing. Note that many factors may affect your final price, including selecting a different region, choosing a different storage tier, etc.

First 50 TB/month = $0.023 per GB
Next 450 TB/month = $0.022 per GB
Over 500 TB/month = $0.021 per GB

In order to receive lower pricing, you have to reach a specific amount of data stored. But, the lower rate only applies to data above the threshold for that tier. In other words, you don’t get a discount on the cumulative amount—each pricing tier is reflected in the data you’re storing.

The mistake sometimes made is estimating your entire storage cost based on the level for the total data stored. For example, if you had 600TB of storage, you could wrongly calculate as follows:

600,000GB x $0.021 = $12,600/month

When, in fact, you should do the following:

(50,000GB x $0.023) + (450,000GB x $0.022) + (100,000GB x $0.021) = $13,150/month

That was just for storage. Make sure you consider the tiered pricing tables for data retrieval, and API transactions as well.

Tip 2: Don’t choose the wrong storage class

Many cloud providers, especially hyperscalers, now offer a wider array of storage classes than ever before. The idea is that you can trade service capabilities for lower costs. If you don’t need immediate access to your files or don’t want data replication or 11 nines of durability, you can choose to downgrade your service and gain cost savings. The biggest problem with this method is that you have to know what you are going to do with your data to pick the right service—as well as correctly anticipate future business needs—because mistakes can get very expensive. For example:

You choose a low cost, cold storage tier that takes hours or days to restore your data. What can go wrong? You need some files back immediately (if, for example, your backups are corrupted by ransomware) and you end up paying 10-20 times the cost to expedite your restore.
You choose one storage class and decide you want to upload some data to a compute-based application or to another region—features not part of your current service. The good news? You can usually move the data. The bad news? Even if you’re transferring within the same cloud storage company’s infrastructure, you’re often charged a transfer fee to move the data because you didn’t choose the right storage class when you started. These fees often eradicate any “savings” you had gotten from the lower priced tier.

Basically, if your needs change as they pertain to the data you have stored, you will pay more than you expect to get your data where you need it to be.

Tip 3: Don’t pay for deleted (or modified) files

Some cloud storage companies have a minimum amount of time you are charged for storage for each file uploaded. Typically this minimum period is between 30 and 90 days. You are charged even if you delete the file before the minimum period. For example (assuming a 90 day minimum period), if you upload a file today and delete the file tomorrow, you still have to pay for storing that deleted file for the next 88 days.

This “feature” often extends to files deleted due to versioning. If you set your system to keep three versions of each file, with older versions automatically deleted, you end up paying for those deleted versions for the full minimum duration.

In a typical backup workflow, let’s say you are using a cloud storage service to store your files and your backup program is set to a 30 day retention. That means you will be perpetually paying for an additional 60 days worth of storage (for files that were pruned at 30 days). In other words, you would be paying for a 90 day retention period even though you only have 30 days worth of backups.

Tip 4: Beware of hidden minimums

As the cloud storage market has matured, pricing models have become more complicated. To create an accurate budget, it’s crucial to understand all potential cost components, including some that might not be immediately obvious. Here are two key areas to examine:

Minimum monthly charges: Some providers charge a set fee regardless of how little you store. For instance, you might pay for 1TB even if you only use 100GB.
Minimum file sizes: Some services round up small files to a minimum billable size, often 128KB. While this might seem insignificant, it can add up quickly if you have millions of small files.

Tip 5: Be suspicious of the fine print

Misdirection is the art of getting you to focus on one thing so you don’t focus on other things going on. Practiced by magicians and some cloud storage companies, the idea is to get you to focus on certain features and capabilities without delving below the surface into the fine print. (And, sometimes the prices this technique generates feels like someone has pulled a rabbit out of a hat—to your company’s detriment.)

Read the fine print and as you scroll through the multi-page pricing tables and linked pages of all of the rules that shape how you can use a given cloud storage service. Stop and ask, “What are they trying to hide?” If you find phrases like: “We reserve the right to limit your egress traffic,” or “New users get free usage tier for 12 months,” or “Provisioned requests should be used when you need a guarantee that your retrieval capacity will be available when you need it,” take heed.

And, even if it seems like you can turn the tables and use things like free credits in the short term, remember that you’ll want to have a plan for your long-term infrastructure when those credits run out as well.

How to build a predictable cloud storage budget

As organizations increasingly rely on cloud storage for everything from day-to-day operations to long-term data archiving, the ability to accurately forecast and control these costs can significantly impact overall IT budgets and business planning.

The first place to start is data storage as it’s generally the easiest for a company to calculate. For a given month, you can calculate your data volume as follows:

Data stored = current data + new data – deleted data

Take that total and multiple by the monthly storage rate and you’ll get your monthly storage costs.

Things can get more complicated if your business regularly uploads and downloads data. The data stored at the end of the month should get you at least in the ballpark. But, creating a predictable cloud storage budget requires a holistic understanding of your data needs, usage patterns, and the pricing structures of your chosen provider. It’s not just about estimating how much data you’ll store, but also how you’ll interact with that data over time. Will you be frequently accessing and modifying files, or primarily using the storage for long-term archiving? Are there seasonal fluctuations in your data storage or retrieval patterns? These factors can all influence your overall costs, and we’ll walk through a scenario to show that next.

Let’s do the math

To illustrate how to calculate your cloud storage costs, let’s work through an example using current Backblaze B2 pricing. We’ll focus on a single month for a growing business that is backing up business data to the cloud and verifying their backups have zero errors during recovery:

Initial storage at the beginning of the month: 100TB
New data added during the month: 10TB
Data deleted during the month: 5TB
Downloads during the month (egress): 75TB

Backblaze has built a cloud storage calculator that computes costs for all of the major cloud storage providers. Using this calculator, we find that Amazon S3 would cost $2,675 to store this data for a month, while Backblaze B2 would charge just $630.

Using those numbers for storage and assuming you download 75TB a month for backup validation testing, you get a total monthly cost of $8,725 for Amazon S3; Backblaze B2 would be $630 a month.

The additional cost you see from AWS S3 is from download costs, also known as egress fees, and they can certainly take a toll on your budget. Backblaze offers free egress up to three times the amount you have stored so you can move data when and where you prefer.

The chart below provides the breakdown of the expected cost.

	Backblaze B2	Amazon S3
Storage	$630	$2,675
Egress	Free*	$6,050
Totals:	$630	$8,725

*Up to 3x of average monthly data stored, then $0.01/GB for additional egress.

Of course each month you will add and delete storage, so you’ll have to account for that in your forecast. And, as we mentioned above, there may also be other fees like minimum storage duration fees or API transaction fees. Using the cloud storage calculator noted above, you can get a reasonable estimate of your total cost over the budget forecasting period.

Finally, you can use the Backblaze B2 storage calculator to address potential use cases that are outside of your normal operations, such as if you delete a large project from your storage or you need to download a large amount of data. Running the calculator for these types of actions lets you obtain a solid estimate for their effect on your budget before they happen and lets you plan accordingly.

Understanding cloud storage pricing gives you options

Creating a predictable cloud storage forecast is key to taking full advantage of all of the value in cloud storage. Organizations like Austin City Limits, Amplify, and Runbiz were able to move to the cloud because they could reliably predict their cloud storage cost with Backblaze B2. You don’t have to let pricing tiers, hidden costs, and fine print stop you. Backblaze makes predicting your cloud storage costs easy.

The post Five Tips for Creating a Predictable Cloud Storage Budget appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Network Stats: Ingress Trends and What They Tell Us About Backup Behaviors

2024-08-22 Brent Nowak

Post Syndicated from Brent Nowak original https://www.backblaze.com/blog/backblaze-network-stats-ingress-trends-and-what-they-tell-us-about-backup-behaviors/

An image with a background pattern of trend lines and the words "Network Stats Ingress rends and what they tell us"

Every day, thousands of Backblaze customers create and update files. These changes make their way into our system to be securely stored. Sometimes they are sent to us immediately, while other times the differentials are batched up into a job that runs at a scheduled time.

In this post, I’m sampling three points in our network where we take in a lot of ingress traffic off of the internet, and we’re going to explore some of the trends that we see.

Reading the ingress tea leaves

So, why do we care about ingress trends? In short, it helps us with capacity planning, and it also tells us a lot about how people use cloud storage. We often think of planning in longer terms—weeks, months, or years. Here I wanted to focus on some of the patterns that we see during a shorter period; for example, a single day or a significant date, like the end of the calendar month. There are some interesting patterns we see in our client behavior that keep us on our toes when we are performing capacity planning.

We currently have two product offerings that have different usage and traffic patterns:

Backblaze B2 Cloud Storage: Ingress and egress, high variance in traffic levels throughout the day, hour, and at the start of month.
Backblaze Computer Backup: Heavy ingress, with a small variance in traffic levels during the business day or weekday vs. weekend.

Since humans are using our system, we see very human quirks in our traffic profiles. For example, we humans like round numbers! We notice that a lot of backup jobs kick off at midnight local or UTC, or fire off at the top of the hour, or trigger on the first of the month. This means we see spikes of network traffic during these periods. Additionally, a lot of new content gets created during the day and then queued up to be uploaded to us in an overnight backup job.

Scope and terms

Today we’re going to look at ingress traffic, which means we’re monitoring uploads from both Backblaze Computer Backup and Backblaze B2 into our environment. We’ll save downloads, traffic coming out of Backblaze, for analysis in future posts.

One common term that you’ll see on our graphs in the 95th percentile. The 95th percentile number is a point where 95% of all measurements are under and only 5% are over. This is a very typical method to use for monitoring, billing, and trend analysis in the telecom industry. It maps to a standard bell curve, and tells you that you’re capturing the vast majority of usage for planning purposes.

A chart displaying a bell curve and percentiles — A standard bell curve. Source.

In one of our monitoring systems, we are sampling and recording the utilization on our network links and computing a 95th percentile over a five minute period.

With these items defined, let’s get into the data with some charts!

Sample 1: One-month trend

In this first sample, we see that the majority of our daily traffic falls within a nice range. What stands out here is the clock tick over from February to March, where we see a spike of ingress traffic that is outside the expected daily range.

A chart displaying a sample of ingress trends over one month.

Taking that same dataset, let’s take a closer look at the end of the month and zoom in on the calendar change into March.

Adding a vertical red line on 00:00 UTC where the month changes over, we see that there must be a lot of automated jobs that kick in right at the clock changeover into the new month.

A chart showing ingress trends over 7 days.

Sample 2: Top of the hour

Taking a look at another traffic sample from another point in our network, we see very distinct traffic patterns on the top of almost every hour.

A chart showing ingress trends over 24 hours

Sample 3: Pacific Time Zone working hours

Here’s a sample of traffic in our US-West region. During the business day on the West Coast, we see a lull in traffic, with a pickup after the business day is done. This makes sense to us as there are jobs that backup daily content that start to send traffic to us overnight.

A chart showing ingress trends over three days.

What does this mean for you?

It’s very interesting to see the impact of humans in our network traffic and the patterns that emerge. Generally we humans create and modify things during the day, and we like to back them up over night for safekeeping. And we also like round numbers—people tend to send data at the top of the hour, midnight, or end of the month.

All of these elements are very important in how we, at Backblaze, capacity plan and balance traffic over transit links. We do a lot of work to make sure that no matter what time of day or day of the month, you can reliably get your data into Backblaze.

But, you might also look at this data and take away a meaningful conclusion: Much like choosing to go to the grocery store at 10:30 a.m. on a Tuesday versus fighting the after-work rush at 6:00 p.m., scheduling jobs on the 15, 30, or 45 minute mark or mid-month instead of at the end of the month would mean you’re up against less traffic, which is never a bad thing (and it also smooths out our ingress, which we wouldn’t be mad about either).

At the end of the day, however you choose to schedule your jobs works for us. We’re just glad we’re able to store and protect our customers’ data reliably and affordably, and we’re happy to pass along any tips and trips for a better, less congested, backup experience as well.

Thanks for reading, and stay tuned for more graphs and commentary on how we strive to build a reliable, scalable, and forward looking network to serve our customer’s needs.

The post Backblaze Network Stats: Ingress Trends and What They Tell Us About Backup Behaviors appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

How to Build Your Own LLM with Backblaze B2 + Jupyter Notebook

2024-08-13 Pat Patterson

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/how-to-build-your-own-llm-with-backblaze-b2-jupyter-notebook/

A decorative image showing a computer, a cloud, and a building.

Last month, Backblaze Principal Storage Cloud Storyteller, Andy Klein, and I presented a webinar, Leveraging your Cloud Storage Data in AI/ML Apps and Services, in which we explored the various ways AI/ML applications use and store data. In addition to sharing insights from our customers who leverage Backblaze B2 Cloud Object Storage for their AI/ML needs, we also demonstrated a simple AI use case: a retrieval-augmented generation (RAG) chatbot answering questions based on custom context downloaded from a private Backblaze B2 Bucket. After the webinar, I posted the demo source code to a GitHub repository: https://github.com/backblaze-b2-samples/ai-rag-examples.

In this blog post, I’ll recap a little of the webinar, and explain how you can use the demo source code as a basis for your own experiments with RAG and large language models (LLMs).

But why, Pat?

You might be asking yourself, why build a chatbot when multiple online services, such as ChatGPT, are just a click away? It’s a good question. The main motivation is privacy. Suppose you want your chatbot to use confidential data in generating its responses. By running an LLM on your own machine, whether on-premises or on a virtual machine (VM) under your control in a public cloud provider’s infrastructure, you eliminate the risks of that data surfacing in response to a question from a user outside your organization.

In the webinar, I showed two Python apps running on a GPU-equipped VM at Vultr, one of Backblaze’s compute partners. The first app used the GPT4All Python SDK to create a very simple conversational chatbot running a local instance of a large language model (LLM), which it used in answering general questions. Here’s an example from the webinar:

Ask me a question: What were the causes of the First World War?
The causes of the First World War are complex and varied, but some key factors include nationalism, imperialism, militarism, alliances between nations, and the assassination of Archduke Franz Ferdinand. Nationalism led to tensions between ethnic groups and their respective governments, while imperialism saw countries competing for territories and resources around the world. Militarism also played a role as many European powers built up large armies and navies in preparation for potential conflicts. The complex web of alliances between nations meant that any conflict could quickly escalate into a latIer war, and the assassination of Archduke Franz Ferdinand by a Serbian nationalist was the spark that ignited the powder keg of tensions already present in Europe at the time.

Now, I’m not a history scholar, but that looks like a pretty good answer to me! (History scholars, you are welcome to correct me.)

The second app used the Langchain framework to implement a more elaborate chatbot, again running on my own machine at Vultr, that used PDF data downloaded from a private bucket in Backblaze B2 as context for answering questions. As much as I love our webinar attendees, I didn’t want to share genuinely confidential data with them, so I used our Backblaze B2 Cloud Storage documentation as context. The chatbot was configured to use that context, and only that context, in answering questions. From the webinar:

Ask me a question about Backblaze 82: What's the difference between the master application key and a standard application key?

The master application key provides complete access to your account with all capabilities, access to all buckets, and has no file prefix restrictions or expiration. On the other hand, a standard application key is limited to the level of access that a user needs and can be specific to a bucket.

Ask me a question about Backblaze B2: What were the causes of the First World War?

The exact cause of the First World War is not mentioned in these documents.

The chatbot provides a comprehensive, accurate answer to the question on Backblaze application keys, but doesn’t answer the question on the causes of the First World War, since it was configured to use only the supplied context in generating its response.

During the webinar’s question-and-answer session, an attendee posed an excellent question: “Can you ask [the chatbot] follow-up questions where it can use previous discussions to build a proper answer based on content?” I responded, “Yes, absolutely; I’ll extend the demo to do exactly that before I post it to GitHub.” What follows are instructions for building a simple RAG chatbot, and then extending it to include message history.

Building a simple RAG chatbot

After the webinar, I rewrote both demo apps as Jupyter notebooks, which allowed me to add commentary to the code. I’ll provide you with edited highlights here, but you can find all of the details in the RAG demo notebook.

The first section of the notebook focuses on downloading PDF data from the private Backblaze B2 Bucket into a vector database, a storage mechanism particularly well suited for use with RAG. This process involves retrieving each PDF, splitting it into uniformly sized segments, and loading the segments into the database. The database stores each segment as a vector with many dimensions—we’re talking hundreds, or even thousands. The vector database can then vectorize a new piece of text—say a question from a user—and very quickly retrieve a list of matching segments.

Since this process can take significant time—about four minutes on my MacBook Pro M1 for the 225 PDF files I used, totaling 58MB of data—the notebook also shows you how to archive the resulting vector data to Backblaze B2 for safekeeping and retrieve it when running the chatbot later.

The vector database provides a “retriever” interface that takes a string as input, performs a similarity search on the vectors in the database, and outputs a list of matching documents. Given the vector database, it’s easy to obtain its retriever:

retriever = vectorstore.as_retriever()

The prompt template I used in the webinar provides the basic instructions for the LLM: use this context to answer the user’s question, and don’t go making things up!

prompt_template = """Use the following pieces of context to answer the question at the end. 
    If you don't know the answer, just say that you don't know, don't try to make up an answer.
    
    {context}
    
    Question: {question}
    Helpful Answer:"""

prompt = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

The RAG demo app creates a local instance of an LLM, using GPT4All with Nous Hermes 2 Mistral DPO, a fast chat-based model. Here’s an abbreviated version of the code:

model = GPT4All(
    model='Nous-Hermes-2-Mistral-7B-DPO.Q4_0.gguf',
    max_tokens=4096,
    device='gpu'
)

LangChain, as its name suggests, allows you to combine these components into a chain that can accept the user’s question and generate a response.

chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | model
        | StrOutputParser()
)

As mentioned above, the retriever takes the user’s question as input and returns a list of matching documents. The user’s question is also passed through the first step, and, in the second step, the prompt template combines the context with the user’s question to form the input to the LLM. If we were to peek inside the chain as it was processing the question about application keys, the prompt’s output would look something like this:

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

<Text of first matching document>

<Text of second matching document>

Question: What's the difference between the master application key and a standard application key?

Helpful Answer:

This is the basis of RAG: building an LLM prompt that contains the information required to generate an answer, then using the LLM to distill that prompt into an answer. The final step of the chain transforms the data structure emitted by the LLM into a simple string for display.

Now that we have a chain, we can ask it a question. Again, abbreviated from the sample code:

question = 'What is the difference between the master application key and a standard application key?'
answer = chain.invoke(question)

Adding message history to the simple RAG chatbot

The first step of extending the chatbot is to give the LLM new instructions, similar to its previous prompt template, but including the message history:

prompt_template = """Use the following pieces of context and the message history to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
    
Context: {context}
    
History: {history}
    
Question: {question}

Helpful Answer:"""

prompt = PromptTemplate(
    template=prompt_template, input_variables=["context", "question", "history"]
)

The chain must be modified slightly to accommodate the message history:

chain = (
    {
        "context": (
                itemgetter("question")
                | retriever
        ),
        "question": itemgetter("question"),
        "history": itemgetter("history")
    }
    | prompt
    | model
    | StrOutputParser()
)

Now, we define a very simple in-memory message store that uses a session_id parameter to manage multiple simultaneous conversations:

store = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = InMemoryChatMessageHistory()
    return store[session_id]

LangChain provides a wrapper, RunnableWithMessageHistory, that combines the message store with the above chain to create a new chain with message history capability:

with_message_history = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="question",
    history_messages_key="history",
)

Now we can feed a series of related questions into the new chain:

questions = [
    'What is the difference between the master application key and a standard application key?',
    'Which one would I use to work with a single bucket?',
    'Can you tell me anything more about this topic?'
]

for question in questions:
    print(f'\n{question}\n')
    answer = with_message_history.invoke(
        {"question": question},
        config={"configurable": {"session_id": "abc123"}},
    )
    print(f'{answer}\n')

I have to admit, I was pleasantly surprised by the results:

What is the difference between the master application key and a standard application key?

A master application key grants broad access privileges, while a standard application key is limited to the level of access that a user needs.

Which one would I use to work with a single bucket?

You would use a standard application key to work with a single bucket as it has limited access and only grants permissions needed for specific tasks, unlike the master application key which provides broad access privileges.

Can you tell me anything more about this topic?

Sure! The master application key is typically used by developers during development or testing phases to grant full access to all resources in a Backblaze B2 account, while the standard application key provides limited permissions and should be used for production environments where security is paramount.

Processing this series of questions on my MacBook Pro M1 with no GPU-acceleration took three minutes and 25 seconds, and just 52 seconds with its 16-core GPU. For comparison, I spun up a VM at Ori, another Backblaze partner offering GPU VM instances, with an Nvidia L4 Tensor Core GPU and 24GB of VRAM. The only code change required was to set the LLM device to ‘cuda’ to select the Nvidia GPU. The Ori VM answered those same questions in just 18 seconds.

An image of an Nvidia L4 Tensor Core GPU — The Nvidia L4 Tensor Core GPU: not much to look at, but crazy-fast AI inference!

Go forth and experiment

One of the reasons I refactored the demo apps was that notebooks allow an interactive, experimental approach. You can run the code in a cell, make a change, then re-run it to see the outcome. The RAG demo repository includes instructions for running the notebooks, and both the GPT4All and LangChain SDKs can run LLMs on machines with or without a GPU. Use the code as a starting point for your own exploration of AI, and let us know how you get on in the comments!

The post How to Build Your Own LLM with Backblaze B2 + Jupyter Notebook appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Drive Stats for Q2 2024

2024-08-06 Andy Klein

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-2024/

A decorative image with the headline Q2 2024 Drive Stats.

As of the end of Q2 2024, Backblaze was monitoring 288,665 hard drives (HDDs) and solid state drives (SSDs) in our cloud storage servers located in our data centers around the world. We removed from this analysis 3,789 boot drives, consisting of 2,923 SSDs and 866 hard drives. This leaves us with 284,876 hard drives under management to review for this report. We’ll review the annualized failure rates (AFRs) for Q2 2024 and the lifetime AFRs of the qualifying drive models, and we’ll also check out drive age versus failure rates over time. Along the way, we’ll share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

Hard drive failure rates for Q2 2024

For our Q2 2024 quarterly analysis, we remove from consideration: drive models which did have at least 100 drives in service at the end of the quarter, drive models which did not accumulate 10,000 or more drive days during the quarter, and individual drives which exceeded their manufacturer’s temperature specification during their lifetime. The removed pool totalled 490 drives, leaving us with 284,386 drives grouped into 29 drive models for our Q2 2024 analysis.

The table below lists the AFRs and related data for these drive models. The table is sorted large to small by drive size then by AFR within drive size.

Notes and observations on the Q2 2024 Drive Stats

Upward AFR: The AFR for Q2 2024 was 1.71%. That’s up from Q1 2024 at 1.41%, but down from one year ago (Q2 2023) at 2.28%. While the quarter over quarter increase was a bit surprising, quarterly fluctuations in AFR are expected. Sixteen drive models had an AFR of 1.71% or below while 13 drive models had an AFR above.
Two good zeroes: In Q2 2024, two drive models had zero failures, a 14TB Seagate (model: ST14000NM000J) and a 16TB Seagate (model: ST16000NM002J). Both have a relatively small number of drives and drive days for the quarter, so their success is somewhat muted, but the 16TB Seagate drive model has a very respectable 0.57% lifetime failure rate.
Another GOAT is gone: In Q1, we migrated the last of our 4TB Toshiba drives. In Q2, we migrated the last of our 6TB drives, including all of the Seagate 6TB drives which had reached an average age of nine years (108 months). This Seagate drive model closed out its career at Backblaze with an impressive 0.86% lifetime AFR.
Currently the 4TB Seagate (model: ST4000DM000) is our oldest data drive model in production at an average age of 99.5 months. The data on these drives is scheduled to be migrated over the next quarter or two using CVT, our in-house drive migration system. They’ll never reach nine years of service.
The 10-Year Club: With the 6TB Seagate drives being migrated as they hit 10 years of service, we wondered: What is the oldest data drive in service? The answer, a 4TB HGST drive (model: HMS5C4040ALE640) with 9 years, 11 months and 23 days service as of the end of Q2. Alas, the Backblaze Vault in which this drive resides is now being migrated as are many other drives with over nine years of service. We’ll see next quarter to see if any of them made it to the 10-Year Club before they are retired.
While there are no data drives with 10 years of service, there are 11 HDD boot drives that exceed the mark. In fact one, a 500GB WD drive (model: WD5000BPKT) has over 11 years of service. (Psst, don’t tell the CVT team.)
An HGST surprise: Over the years, the HGST drive models we have used performed very well. So, when the 12TB HGST (model: HUH721212ALN604) drive showed up with a 7.17% AFR for Q2, it’s news. Such uncharacteristic quarterly failure rates for this model actually go back about a year, although the 7.17% AFR is the largest quarterly value to date. As a result, the lifetime AFR has risen from 0.99% to 1.57% over the last year. While the lifetime AFR is not alarming, we are paying attention to this trend.

Lifetime hard drive failure rates

As of the end of Q2 2024, we were tracking 284,876 operational hard drives. To be considered for the lifetime review, a drive model was required to have 500 or more drives as of the end of Q2 2024 and have over 100,000 accumulated drive days during their lifetime. When we removed those drive models which did not meet the lifetime criteria, we had 283,065 drives grouped into 25 models remaining for analysis as shown in the table below.

Age, AFR, and snakes

One of the truisms in our business is that different drive models fail at different rates. Our goal is to develop a failure profile for a given drive model over time. Such a profile can help optimize our drive replacement and migration strategies, and ultimately maintains the durability of our cloud storage service.

For our cohort of data drives, we’ll look at the changes in the lifetime AFR over time for drive models with at least one million drive days as of the end of Q2 2024. This gives us 23 drive models to review. We’ll divide the drive models into two groups: those whose average age is five years (60 months) or less, and those whose average age is above 60 months. Why that cutoff? That’s the typical warranty period for enterprise class hard drives.

Let’s start by plotting the current lifetime AFR for the 14 drives models that have an average age of 60 months or less as shown in the chart below.

Let’s review the drive models by characterizing the four quadrants as follows:

Quadrant I: Drive models in this quadrant are performing well, and have a respectable AFR of less than 1.5%. Drive models to the right in this quadrant might require a little more attention over the coming months than those to the left.
Quadrant II: These drive models have failure rates above 1.5%, but are still reasonable at around 2% lifetime AFR. What is important is that AFR does not increase significantly over time.
Quadrant III: There are no drives currently in this quadrant, but if there were it would not be a cause for alarm. Why? Some drive models experience higher rates of failure early on, and then following the bathtub curve, their AFR drops as they get older.
Quadrant IV: These drive models are just starting out and are just beginning to establish their failure profile, which at the moment is good.

At a glance, the chart tells us that everything seems fine. The drives in Quadrant I are performing well, the two drives in Quadrant II could be better, but are still acceptable, and there are no surprises in the newer drive models to this point. Let’s see how things fair for the drive models which have an average age of over 60 months as in the chart below.

There are nine drive models which fit the average age criteria, including the Seagate 6TB drive (in yellow) whose drives were removed from service in Q2. As you can see the drive models are spread out across all four quadrants. As before, Quadrant I contains good drives, Quadrants II and III are drives we need to worry about, and Quadrant IV models look good so far.

If we were to stop here we could decide for example that the 4TB Seagate drives are first in line for the CVT migration process, but not so fast. All of these drive models have been around for at least five years and we have their failure rates over time. So, rather than rely on just a point in time, let’s look at their change in failure rates over time in the chart below.

The snake chart, as we’re calling it, shows the lifetime failure rate of each drive model over time. We started at 24 months to make the chart less messy. Regardless, the drive models sort themselves out into either Quadrant I or II once their average age passes 60 months. Let’s take a look at the drives in each of those quadrants.

Quadrant I: Five of the nine drive models are in Quadrant I as of Q2 2024. The two 4TB HGST drives (brown and purple lines) as well as the 6TB Seagate (red line) have nearly vertical lines indicating their failure rates have been consistent over time, especially after 60 months of service. Such demonstrated consistency over time is a failure profile we like to see.
The failure profile of the 8TB Seagate (blue line) and the 8TB HGST (gray line) are less consistent, with each increasing their failure rates as they have aged. In the case of the HGST drive, the lifetime AFR rose from about 0.5% to 1.0% over an 18 month period starting at 48 months before leveling out. The Seagate drive took about two years starting at 60 months to go from 1.0% to nearly 1.5% before leveling out.
Quadrant II: The remaining 4 drive models ended in this quadrant. Three of the models, the 8TB Seagate (yellow line), the 10TB Seagate (green line), and the 12TB HGST (teal line) have similar failure profiles. All three got to some point in their lifetime and their curve began bending to the right. In other words, their failure rates over time accelerated. While the 8TB Seagate (yellow) shows some signs of leveling off, all three models will be closely watched and replaced if this trend continues.
Also in Quadrant II is the 4TB Seagate drive (black line). This drive model is aggressively being migrated and is being replaced by 16TB and larger drives via the CVT process. As such, it is hard to tell if the nearly vertical failure profile is a function of the replacement process or the drive model failure rate leveling out over time. Either way, the migration of this drive model is expected to be complete in the next quarter or two.

A normal failure profile

If we had to pick one of the drive models to represent a normal failure profile, it would be the 8TB Seagate (blue line, model: ST800DM002). Why? The failure rate for the first 60 months was consistently around 1.0%, Seagate’s predicted AFR. After 60 months, the AFR increased as the drive aged as one would expect. You might have thought we’d choose the failure profile of one of the two 4TB HGST drive models (brown and purple lines). The “trouble” is their failure rates are well below any published AFR by any drive manufacturer. While that’s great for us, their annualized failure rates over time are sadly not normal.

Can AI help?

The idea of using AI/ML techniques to predict drive failure has been around for several years, but as a first step let’s see if predicting drive failure is even an AI-worthy problem. We recently conducted a webinar “Leveraging Your Cloud Storage Data in AL/ML Apps and Services” in which we outlined general criteria to be used in evaluating if AI/ML is needed to solve a given problem, in this case predicting drive failure. The most salient criteria which applies here is that AI is best used for a problem for which you can not consistently apply a set of rules to solve the problem.

A model is trained by taking the source data and applying an algorithm to iteratively combine and weigh multiple factors. The output is a model which can be used to answer questions about the model’s subject matter, in this case drive failure. For example, we train a model using the Drive Stats data for a given drive model for the last year. Then, we ask the model a question using drive Z’s daily SMART stats and related information. We use this data as input to the model, and while there is no exact match, the model will use inference to develop a response of the probability of drive failure for drive Z over time. As such, it would seem that drive failure prediction would be a good candidate for using AI.

What’s not clear is whether what is learned about one drive model can be applied to another drive model. One look at the snake chart above visualizes the issue as the failure profile for each drive model is different, sometimes radically different. For example, do you think you could train a model on the 4TB Seagate drives (black line) and use it to predict drive failures for either of the 4TB HGST drive models (purple and brown lines)? The answer may be yes, but it certainly doesn’t seem likely.

All that said, several research papers and studies have been published over the years attempting to determine whether or not AI/ML can be used to make drive failure predictions. We’ll be doing a review of these publications in the next couple of months and hopefully shed some light on the ability to use AI to accurately make drive failure predictions in a timely manner.

The Hard Drive Stats data

It has now been over 11 years since we began recording, storing, and reporting the operational statistics of the hard drives and SSDs we use to store data in the Backblaze data storage cloud. We look at the telemetry data of the drives, including their SMART stats and other health related attributes. We do not read or otherwise examine the actual customer data stored.

Over the years, we have analyzed the data we have gathered and published our findings and insights from our analyses. For transparency, we also publish the data itself, known as the Drive Stats dataset. This dataset is open source and can be downloaded from our Drive Stats webpage.

The post Backblaze Drive Stats for Q2 2024 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Proper Address: IPv4 vs. IPv6

2024-08-01 Stephanie Doyle

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/proper-address-ipv4-vs-ipv6/

A decorative image showing a cloud over performance graphs and charts.

Ah, the 1980s. It brought us such classics as Ghostbusters, The Princess Bride, Tina Turner’s triumphant comeback, Pac-Man, and the original Apple Macintosh. Also, it gave us the birth of the internet, in which we figured out how to make all our computers one giant, powerful network held together initially by internet protocols (IPs) and, eventually, by a mutual love of cat videos.

Now, each of our devices that connect to the internet require a way to find and send information back and forth, which means they need an IP address. Most folks don’t type IP addresses into their search bar though—we use domain names (for example, www.backblaze.com). Which IP addresses correspond to which domain names is stored in a hierarchical and distributed database system known as the domain name system (DNS), which is also an internet protocol.

Today, let’s talk about IP addresses: What are IPv4 and IPv6, why is IPv6 necessary, and what impact will it have on networking?

Let’s set the scene

Any time you’re sending and receiving data, be it a letter in the mail, dialing a phone number, or loading a website, you’ve got to have an identifiable address reach the proper person and/or device. What all of these types of addresses have in common is that as our population has exploded, we’ve had to re-work how addresses work in order to include more possible data locations. U.S. zip codes were established in 1963. Area codes were established in 1947, and a great expansion was necessary only three(ish) decades later, and that plan was implemented starting in the late 1980s and ending in the mid ’90s.

IP addresses, meanwhile, have been operating on the first and only protocol we introduced in the 1980s, called IPv4. Not only has the world population almost doubled since then, but there has also been a nonlinear explosion in internet-connected devices per person. When IP addresses were first invented, it was unfathomable that most folks would be walking around with a computer in their pocket, remotely checking who’s ringing their doorbells while adjusting their thermostat in anticipation of returning home. All of those internet-connected devices use an IP address, in one way or another.

So, it’s no surprise that we’re now seeing an adoption of a new IP address standard. In keeping with tradition, the versions aren’t sequential: Right now we’re jumping from IPv4 to IPv6. (What happened to IPv5? It was skipped, sort of.)

What is IPv4?

IPv4 is an internet protocol that assigns addresses to devices. It uses a 32-bit address, represented by four numbers (octets), each between 0 and 255, separated by dots (e.g., 192.168.1.100), and uses decimal notation.

Remember that each bit represents one of two possible values, a 0 or a 1. So, for a 32-bit value, there are 2^32 possible addresses, or 4,294,967,296 IP addresses total. Several IPv4 address blocks were also reserved for private networks and multicast addresses, about 286 million total. Between the two reserved blocks of addresses, that’s about 7% of the total addresses in existence.

What is IPv6?

IPv6 uses a 128-bit address, represented by a longer string of numbers and letters (e.g., 2001:0db8:85a3:0000:0000:8a2e:0370:7334) in hexadecimal code, aka hex code. If you’ve ever designed a MySpace page (hi, Tom!) or a webpage, you’re likely familiar with the hex codes used to identify precise colors.

Doing the math as we did above, there are 2^128 possible IPv6 addresses, which is 340 undecillion. (That’s the 11th order of magnitude if you’re going, million, billion, trillion, and so on.) And, just like IPv4, there are some reserved addresses, but they represent such a comparatively smaller number of total available addresses that it’s not even worth calculating a percentage.

Woah, how have we been surviving in the meantime?

We mentioned above that we’ve known we’re running out of IP addresses for a while. But, important detail: There was evidence of the problem as early as 1981, and mitigation efforts were enacted by 1992. Before we get into what mitigation strategies have been used over the years, a bit of a refinement of the above information—IP addresses consist of two main parts, one that identifies the network (or, sometimes, the subnet) and the host, or the destination on that network. (That’s true of both IPv4 and IPv6.)

Classful networking

In the original iteration of IPv4, the bits that identified the subnet were fixed, and that meant a lot of wasted space. In 1981, we implemented classful networking. Instead of keeping a fixed number of bits to identify a network, the three most significant bits identified the size of the network prefix, and that sent you to different classes. That meant that existing addresses didn’t have to change. Here’s a handy table:

Class	Most significant bits	Network prefix size (bits)	Host identifier size (bits)	Address range	Maximum number of networks	Maximum number of hosts per network
A	0	8	24	0.0.0.0–127.255.255.255	128 networks	16,777,216 hosts per network
B	10	16	16	128.0.0.0–191.255.255.255	16,384 networks	65,386 hosts per network
C	110	24	8	192.0.0.0–223.255.255.255	2,097,152 networks	256 hosts per network
D (multicast) E (reserved)	1110 1111	—	—	224.0.0.0–255.255.255.255	—	—

All that sounds a bit like gobbley-gook. An analogy: You live in a city that wants to improve mail delivery, so it’s introduced the option to choose from a small, medium, or large mailbox. The sizes are actually pretty disproportionate—the small is about the size of a toaster, whereas the medium is the size of a kitchen trash can. (And large is the size of your car. Who gets that much mail?) No matter which size mailbox you (or your neighbor) chooses, your physical address didn’t change when this system was implemented. You usually get more mail than the toaster would accommodate, but never even come close to filling your trash can-sized mailbox. So, that extra space just sits empty and unused, never fulfilling its mail volume potential.

Note that classful networking is now largely defunct, replaced by…

Classless inter-domain routing (CIDR)

The biggest issue of the above system was its inflexibility. Adding classes gave us more flexibility than the original design, but you were still restricted to 8, 16, or 24 bits to identify the network. That means you can end up with a lot of unused IP addresses, as indicated by our above analogy. Here’s the math behind why:

The number of addresses available on a network is the inverse of how many bits you use to define it. So, in a 32-bit address, if you use 16 bits to define the network, you have 8 bits leftover to define the host. That’s our Class C network, which contained 2^8 (256) IP addresses—not enough for most use cases. And, the next smallest subset, Class B, represented 2^16 IP addresses (65,536 total), which most organizations could not use efficiently. After DNS became the norm, it became clear that classful networking wasn’t scalable, and thus CIDR rose to prominence.

CIDR is based on variable-length subnet masking (VLSM), which lets each network be divided into subnetworks of various power-of-two sizes. This method optimizes the allocation of IPv4 addresses by allowing for more flexible address blocks.

Using our analogy, instead of assigning mailbox size based on household size, you might just have a system in which folks walk up to the post office and find their name on a list associated with a mailbox. If someone has more or less mail that month, then they can be assigned the properly sized mailbox.

Network address translation (NAT)

NAT allows multiple devices to share a single public IPv4 address by modifying the IP header when it’s in transit. This is super useful when you’re talking about private networks—you can assign a single IP address to multiple devices. For example, if you have several internet of thing (IoT) devices in your home, they can all appear to the public network as one IP address, and your local network can figure out what traffic goes where. It also makes it so that if a network moves, the host doesn’t necessarily have to be assigned a new IP address, such as if an internet provider like Cox decides to stop doing business in your region, and Spectrum takes over their IP address allocation—though likely they’d just change your public IP address in that specific scenario.

In our mail analogy, NAT is like those group mailboxes you see in rural areas, apartment buildings, or in neighborhoods. Everyone in the same location gets their mail delivered to the same physical address, and your box number is used to further identify your house within the group mailbox.

The secondary market of IP addresses

If we can learn anything from the above workarounds, flexibility and possibility is key. So, it’s unsurprising to know that a secondary market has cropped up, introducing things like address recycling, address trading, and address leasing. IPv6 will solve the scarcity issue—but what else can it do?

What are the benefits of IPv6?

So far we’ve talked about the primary benefit of IPv6—more IP addresses that we clearly need. But, there are other benefits as well. Here’s a summary:

Improved Efficiency

Simpler header: The IPv6 header is simpler than IPv4’s, leading to faster packet processing and reduced overhead.
Efficient routing: IPv6’s design allows for more efficient routing, potentially reducing latency and improving network performance. Arguably, most folks won’t see a huge performance improvement unless they reconfigure their own network architecture, but the possibility is there.
Autoconfiguration: IPv6 supports automatic configuration of network interfaces, simplifying setup and reducing administrative overhead.

Enhanced Security

Built-in security features: IPv6 offers built-in security mechanisms like IPsec, potentially providing better protection against attacks. In practice, it’s not typically implemented as most encryption is typically handled at the transport layer security (TLS) IP layer.

Quality of Service (QoS)

Improved QoS: IPv6 provides better support for QoS, allowing for prioritization of different types of traffic, ensuring a better user experience for applications like video conferencing and online gaming.

Other Benefits

Reduced reliance on NAT: IPv6 reduces the need for NAT, simplifying network configurations and improving end-to-end connectivity.
Support for new services: IPv6 is better suited for emerging technologies and applications that require a large number of addresses and advanced features.

What’s next? Will we run out again?

Given the amount of addresses for IPv4 vs. IPv6 (4.2 billion vs. 340 undecillion, respectively), you can understand how we might have needed to shore up our IPv4 addresses. Honestly, if you assume one device per person, we already outnumber IPv4 addresses—in fact, we outnumbered IP addresses in the 1970s, before IPv4 was even invented! You shouldn’t assume one device per person, by the way. While many countries with widespread broadband access have several devices per person—in the U.S., Consumer Affairs was reporting 21 per U.S. household in 2023, and the average U.S. household for that same year was 2.51 people. Globally, that same source reports 3.6 internet-connected devices per person.

Changes like this can certainly be disruptive, but the good news on that front is that most devices will be dual-stacked for quite a while. That means that you’ll have both versions of an IP address, and this change can roll out organically (so to speak). In the end, we’ll have a better-performing internet, ready to grow with us for the foreseeable future.

The post Proper Address: IPv4 vs. IPv6 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

How to Future-Proof Your Archives: File Formats That Stand the Test of Time

2024-07-30 Stephanie Doyle

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/how-to-future-proof-your-archives-file-formats-that-stand-the-test-of-time/

A decorative image showing a vault door with a series of 0s and 1s behind it.

You may have heard us talk about backup a time or two, and hopefully our love has been clear. So, when Wired gave us a shoutout in their recent, astute article about archiving, allow us to say we were flattered.

As both a tribute and a collaboration, we’re happy to build on their article’s premise about choosing the correct file type for archiving purposes, and we’ll highlight a few tools to help you protect your files in the long term.

Wired reports: Archived files are especially vulnerable to changing file types

Archives are distinct from backups and have their own demands. Backups are intended to give you the ability to restore files or your whole environment—they need to be both in step with your current environment and flexible enough to respond to both a point in time restore of your whole system or a single file, depending on what you need. Archiving, on the other hand, is about preservation when you can’t depend on the continuity of devices or digital tools—those tools might look quite different (or not exist at all!) down the line. That said, backups are also an essential building block of digital archives.

Making something last long-term takes more work than you’d think. And, interestingly, digital archiving suffers from the opposite problem of “traditional” archiving. Whereas with books, magazines, and other paper-based media, you want to touch them as little as possible, with digital archives, you actually need to do some active maintenance to make sure you’re converting files to accessible formats that you can open well into the future.

Here’s an expert from the Wired report telling us about just one part of the practical concerns of digital archiving:

“Twenty years, in the digital realm, is ancient,” says Lance Stuchell, director of digital preservation services at the University of Michigan. His team is frequently tasked with recovering digital files from old computers and storage mediums. “We have a lab that can deal with old media—floppy drives, CDs, older computers. We can get that off of those types of media and move it into our preservation system while ensuring we don’t mess it up while we’re doing it.”

Wired goes on to report that the problem isn’t just having the correct device, but actually having the correct file type. Their biggest takeaways for making sure your files hold up over time?

Use open source file types.
If you’re storing media, store files uncompressed.
Back up absolutely everything.

Check out the rest of the article for details—it’s worth a read. And, thanks for the shoutout as a good option for folks looking to back up, Wired.

Some tools for converting files

So, now that you’re all geared up to get your archive in order, here are some free, open source tools that will help you convert your files. One note when you’re using open source (and we’re big fans) is to make sure you’re using a tool you trust. And, some tools, especially web-based tools, may collect user data or can expose sensitive information.

With that in mind, here are a few to get you started:

LibreOffice: The successor to OpenOffice, LibreOffice is a well-respected open source alternative to Microsoft Office and supports several open source and older document file formats.
Pandoc: Pandoc calls itself the Swiss army knife of file converters for markup formats, which includes documents, HTML formats, spreadsheets, and more. It’s got a very helpful list of file formats and indicates whether they can convert from/to each of them.
ImageMagick: ImageMagick can certainly convert your files, and it’s also beloved because it can edit files, including support for scripting and automation. But, for our purposes, it converts image formats and has continuous support from the open source community.
FFmpeg: FFmpeg is a community supported audio and video tool.
VLC media player: Another audio and video tool that supports conversion.

Keep in mind that while we’re recommending many of these for conversion purposes, many are actually fully-fledged programs with some very cool features—and, some can even replace traditional paid tool options, if you’re the budget-conscious type.

Archiving hardware of the future

Standard tech right now for storing archived files is on hard disk drives (HDDs) or solid state drives (SSDs), and you’ll even find DVDs that can keep your data stored for 1,000 years or more. All storage media types are known to degrade over time—when you’re storing long-term, you can’t just leave your drive disconnected from power forever, for instance.

That said, developing storage media types, while not as common, offer some interesting (if not yet widely practical) options. If you’re willing to drop some cash, DNA (yes, the biological kind) or ceramic might be for you. And, if you want to get super sci-fi with it, PhysicsWorld has reported on the “Superman memory crystal” that could keep data intact for millions of years.

Build your archive for alien circumstances

When we ambitiously sent out messages to the (potential) sentient life in the universe almost 50 years ago, on golden records no less, we apparently thought it was enough to also include a phonograph needle and some symbolic instructions on how to play the record. In practice, we sent a message with no guarantees that someone could decode and play it.

That may be fine for our space-age time capsule, but for our everyday archives, we do want to do our best to make sure we’re able to open them in the future. While we can’t anticipate where technology will be in 20, 40, or 100 years, we can follow digital archiving best practices to give future generations the best chance of opening files. At least they’ll likely share a language with us, as opposed to our alien friends.

The post How to Future-Proof Your Archives: File Formats That Stand the Test of Time appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

What Do Teachers, Parents, and Students Have in Common? The Need to Back Up

2024-07-25 Yev

Post Syndicated from Yev original https://www.backblaze.com/blog/what-do-teachers-parents-and-students-have-in-common-the-need-to-back-up/

Our world has moved more online, and blackboards have taken a backseat to laptops in classrooms and lecture halls alike. From homework to lesson planning and grading, to communicating with students and parents, educators and students rely heavily on their computers and sync drives like Google Drive, Dropbox, or OneDrive. And, when it comes to digital resources, there’s always a risk of data loss, which, when it strikes, can wipe out hours or days of work.

This is where backup solutions come into play. In this blog post, we will explore the benefits of computer backup for both students and educators, highlight the importance of choosing an affordable and reliable backup solution, and give you some talking points to help others in your educational community understand the importance of backing up.

Risks of data loss

Data loss can happen for a variety of reasons, including hardware failure, accidental deletion, theft, or cyber attacks.

Imagine this scenario: You’ve spent hours building a detailed lesson plan, preparing engaging multimedia presentations, and grading student assignments. Suddenly, your computer crashes, and you can’t get it to turn back on. Or, you lose your USB drive that has years of work, including lesson plans. Both situations are not great and the result is that all of your hard work is gone in an instant.

Data loss is an issue for anyone, but for educators, the consequences can affect not only your work but also your students’ learning experience. And the same scenario is true for students—working on a research project last minute only to have a blue screen of death five minutes before the deadline can be a frustrating turn of events—and one that affects your grade long-term.

The 3-2-1 backup rule

The good news about avoiding data loss is that there are some established best practices that can give you a great place to start. The most fundamental of these, the 3-2-1 backup rule, says you should have three copies of your data on two types of storage media with one copy stored off-site.

Sync is not backup

Sync services like Google Drive, OneDrive, and Dropbox are great, but they are not the same thing as a true backup. Sync drives are designed to keep all versions consistent with each other, which makes them vulnerable to things like accidental deletion and ransomware attacks. While some may have limited version history or “backup,” those features are typically for a limited amount of time (i.e., 30 days), or are lacking in some of the key areas that schools need to maintain compliance with data protection standards.

Cloud backup services are using a different tool for a different job—you want your synced files to change, whereas you want your backup to be a fixed point in time you can restore if you need to. That’s not to say that you won’t have your backup files constantly up-to-date, like you do with an automatic backup solution, just that you’ll be able to restore a file, or all of your files, to whatever time you choose.

And, if you think the difference is just splitting hairs, studies show that 58% of organizations that experienced data loss last year had some amount of unrecovered data. And, in that same pool of survey takers, 84% of organizations were relying on cloud sync services.

Benefits of backing up

Protection against data loss: The primary benefit of a backup solution is the protection it offers against data loss. By regularly backing up your files, you can ensure that your important documents are safe and can be restored quickly in case of any mishap, or even if you forget your laptop at home.
Enhanced productivity: With a reliable backup system in place, you can focus on what you’re working on without worrying about it getting lost. This peace of mind allows you to work more efficiently and creatively, knowing that your files are secure.
Compliance and accountability: While not at the top of many student’s minds, many educators know that educational institutions have policies and regulations regarding data storage and protection. Having a robust backup solution helps teachers, professors, and the organizations they work for stay compliant with these regulations.
Cost savings: Investing in a backup solution can save you money in the long run. Data recovery services can be expensive after the fact, and the time lost in trying to recreate lost work can be even more costly. An affordable backup solution provides a safety net that prevents these potential expenses.

Students: Back up your data regularly

Often, students are given space on cloud drives or required to submit assignments through learning management systems like Blackboard. But, even with cloud drives, many tools don’t account for adequate backups. When it comes down to it, students are responsible for turning in their work on time.

Getting a backup in place protects you and all the effort you’re putting into your coursework, and you can try it for free to see if it’s right for you.

Educators and faculty members often drive change

Collaboration is one of the hallmarks of an educational environment, and educators and students are often just as responsible for driving change as administrators are. Whatever role you take in your educational community, there are many ways you can make to help others understand why backup is so crucial, and how to choose the tool that’s right for you.

Choosing an affordable and reliable backup solution

When selecting a backup solution, affordability and reliability are key factors to consider. Here are some decision criteria you can share with others to help in choosing a backup solution:

Assess your needs. Determine the amount of data you need to back up and how frequently it changes. This will help you choose a solution that meets your specific requirements without overpaying for unnecessary features.
Cloud vs. local backup. Cloud-based backups offer the advantage of remote access and easy storage to a geographically separate location, while local backups (such as external hard drives) can provide faster recovery times. Both methods have a place in a solid 3-2-1 backup strategy.
Ease of use. Look for a backup solution that is user-friendly and doesn’t require extensive technical knowledge. The easier it is to use, the more likely you are to maintain regular backups.
Security features. Ensure that the backup solution you choose has robust security features, such as encryption, to protect your data from unauthorized access and cyber threats.
Cost-effective plans. Many backup service providers offer tiered pricing plans based on storage needs and features, or are based on the number of devices you need protected. Backblaze Computer Backup, for example, starts at $9 per computer per month for unlimited backup, with discounts for one year or two year plans.

Share resources to facilitate discussion

If you don’t have a robust backup strategy in place through your IT department or district, send them this article or this Texas A&M case study and recommend that they get started with a backup strategy.

Save your data (and yourself!): Think about backups ahead of time

Remember, regardless of how you’re creating data, the question isn’t if you will experience data loss, but when. Be prepared and make backup a priority—if not in your organization, then definitely in your personal tech choices.

The post What Do Teachers, Parents, and Students Have in Common? The Need to Back Up appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Announcing Llama 3.1 405B, 70B, and 8B models from Meta in Amazon Bedrock

2024-07-23 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/announcing-llama-3-1-405b-70b-and-8b-models-from-meta-in-amazon-bedrock/

Today, we are announcing the availability of Llama 3.1 models in Amazon Bedrock. The Llama 3.1 models are Meta’s most advanced and capable models to date. The Llama 3.1 models are a collection of 8B, 70B, and 405B parameter size models that demonstrate state-of-the-art performance on a wide range of industry benchmarks and offer new capabilities for your generative artificial intelligence (generative AI) applications.

All Llama 3.1 models support a 128K context length (an increase of 120K tokens from Llama 3) that has 16 times the capacity of Llama 3 models and improved reasoning for multilingual dialogue use cases in eight languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

You can now use three new Llama 3.1 models from Meta in Amazon Bedrock to build, experiment, and responsibly scale your generative AI ideas:

Llama 3.1 405B (preview) is the world’s largest publicly available large language model (LLM) according to Meta. The model sets a new standard for AI and is ideal for enterprise-level applications and research and development (R&D). It is ideal for tasks like synthetic data generation where the outputs of the model can be used to improve smaller Llama models and model distillations to transfer knowledge to smaller models from the 405B model. This model excels at general knowledge, long-form text generation, multilingual translation, machine translation, coding, math, tool use, enhanced contextual understanding, and advanced reasoning and decision-making. To learn more, visit the AWS Machine Learning Blog about using Llama 3.1 405B to generate synthetic data for model distillation.
Llama 3.1 70B is ideal for content creation, conversational AI, language understanding, R&D, and enterprise applications. The model excels at text summarization and accuracy, text classification, sentiment analysis and nuance reasoning, language modeling, dialogue systems, code generation, and following instructions.
Llama 3.1 8B is best suited for limited computational power and resources. The model excels at text summarization, text classification, sentiment analysis, and language translation requiring low-latency inferencing.

Meta measured the performance of Llama 3.1 on over 150 benchmark datasets that span a wide range of languages and extensive human evaluations. As you can see in the following chart, Llama 3.1 outperforms Llama 3 in every major benchmarking category.

To learn more about Llama 3.1 features and capabilities, visit the Llama 3.1 Model Card from Meta and Llama models in the AWS documentation.

You can take advantage of Llama 3.1’s responsible AI capabilities, combined with the data governance and model evaluation features of Amazon Bedrock to build secure and reliable generative AI applications with confidence.

Guardrails for Amazon Bedrock – By creating multiple guardrails with different configurations tailored to specific use cases, you can use Guardrails to promote safe interactions between users and your generative AI applications by implementing safeguards customized to your use cases and responsible AI policies. With Guardrails for Amazon Bedrock, you can continually monitor and analyze user inputs and model responses that might violate customer-defined policies, detect hallucination in model responses that are not grounded in enterprise data or are irrelevant to the user’s query, and evaluate across different models including custom and third-party models. To get started, visit Create a guardrail in the AWS documentation.
Model evaluation on Amazon Bedrock – You can evaluate, compare, and select the best Llama models for your use case in just a few steps using either automatic evaluation or human evaluation. With model evaluation on Amazon Bedrock, you can choose automatic evaluation with predefined metrics such as accuracy, robustness, and toxicity. Alternatively, you can choose human evaluation workflows for subjective or custom metrics such as relevance, style, and alignment to brand voice. Model evaluation provides built-in curated datasets or you can bring in your own datasets. To get started, visit Get started with model evaluation in the AWS documentation.

To learn more about how to keep your data and applications secure and private in AWS, visit the Amazon Bedrock Security and Privacy page.

Getting started with Llama 3.1 models in Amazon Bedrock
If you are new to using Llama models from Meta, go to the Amazon Bedrock console and choose Model access on the bottom left pane. To access the latest Llama 3.1 models from Meta, request access separately for Llama 3.1 8B Instruct, Llama 3.1 70B Instruct, or Llama 3.1 405B Instruct.

To request to be considered for access to the preview of Llama 3.1 405B in Amazon Bedrock, contact your AWS account team or submit a support ticket via the AWS Management Console. When creating the support ticket, select Amazon Bedrock as the Service and Models as the Category.

To test the Llama 3.1 models in the Amazon Bedrock console, choose Text or Chat under Playgrounds in the left menu pane. Then choose Select model and select Meta as the category and Llama 3.1 8B Instruct, Llama 3.1 70B Instruct, or Llama 3.1 405B Instruct as the model.

In the following example I selected the Llama 3.1 405B Instruct model.

By choosing View API request, you can also access the model using code examples in the AWS Command Line Interface (AWS CLI) and AWS SDKs. You can use model IDs such as meta.llama3-1-8b-instruct-v1, meta.llama3-1-70b-instruct-v1 , or meta.llama3-1-405b-instruct-v1.

Here is a sample of the AWS CLI command:

aws bedrock-runtime invoke-model \
  --model-id meta.llama3-1-405b-instruct-v1:0 \
--body "{\"prompt\":\" [INST]You are a very intelligent bot with exceptional critical thinking[/INST] I went to the market and bought 10 apples. I gave 2 apples to your friend and 2 to the helper. I then went and bought 5 more apples and ate 1. How many apples did I remain with? Let's think step by step.\",\"max_gen_len\":512,\"temperature\":0.5,\"top_p\":0.9}" \
  --cli-binary-format raw-in-base64-out \
  --region us-east-1 \
  invoke-model-output.txt

You can use code examples for Llama models in Amazon Bedrock using AWS SDKs to build your applications using various programming languages. The following Python code examples show how to send a text message to Llama using the Amazon Bedrock Converse API for text generation.

import boto3
from botocore.exceptions import ClientError

# Create a Bedrock Runtime client in the AWS Region you want to use.
client = boto3.client("bedrock-runtime", region_name="us-east-1")

# Set the model ID, e.g., Llama 3 8b Instruct.
model_id = "meta.llama3-1-405b-instruct-v1:0"

# Start a conversation with the user message.
user_message = "Describe the purpose of a 'hello world' program in one line."
conversation = [
    {
        "role": "user",
        "content": [{"text": user_message}],
    }
]

try:
    # Send the message to the model, using a basic inference configuration.
    response = client.converse(
        modelId=model_id,
        messages=conversation,
        inferenceConfig={"maxTokens": 512, "temperature": 0.5, "topP": 0.9},
    )

    # Extract and print the response text.
    response_text = response["output"]["message"]["content"][0]["text"]
    print(response_text)

except (ClientError, Exception) as e:
    print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
    exit(1)

You can also use all Llama 3.1 models (8B, 70B, and 405B) in Amazon SageMaker JumpStart. You can discover and deploy Llama 3.1 models with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK. You can operate your models with SageMaker features such as SageMaker Pipelines, SageMaker Debugger, or container logs under your virtual private cloud (VPC) controls, which help provide data security.

The fine-tuning for Llama 3.1 models in Amazon Bedrock and Amazon SageMaker JumpStart will be coming soon. When you build fine-tuned models in SageMaker JumpStart, you will also be able to import your custom models into Amazon Bedrock. To learn more, visit Meta Llama 3.1 models are now available in Amazon SageMaker JumpStart on the AWS Machine Learning Blog.

For customers who want to deploy Llama 3.1 models on AWS through self-managed machine learning workflows for greater flexibility and control of underlying resources, AWS Trainium and AWS Inferentia-powered Amazon Elastic Compute Cloud (Amazon EC2) instances enable high performance, cost-effective deployment of Llama 3.1 models on AWS. To learn more, visit AWS AI chips deliver high performance and low cost for Meta Llama 3.1 models on AWS in the AWS Machine Learning Blog.

To celebrate this launch, Parkin Kent, Business Development Manager at Meta, talks about the power of the Meta and Amazon collaboration, highlighting how Meta and Amazon are working together to push the boundaries of what’s possible with generative AI.

Discover how businesses are leveraging Llama models in Amazon Bedrock to harness the power of generative AI. Nomura, a global financial services group spanning 30 countries and regions, is democratizing generative AI across its organization using Llama models in Amazon Bedrock.

Now available
Llama 3.1 8B and 70B models from Meta are generally available and Llama 450B model is preview today in Amazon Bedrock in the US West (Oregon) Region. To request to be considered for access to the preview of Llama 3.1 405B in Amazon Bedrock, contact your AWS account team or submit a support ticket. Check the full Region list for future updates. To learn more, check out the Llama in Amazon Bedrock product page and the Amazon Bedrock pricing page.

Give Llama 3.1 a try in the Amazon Bedrock console today, and send feedback to AWS re:Post for Amazon Bedrock or through your usual AWS Support contacts.

Visit our community.aws site to find deep-dive technical content and to discover how our Builder communities are using Amazon Bedrock in their solutions. Let me know what you build with Llama 3.1 in Amazon Bedrock!

— Channy

Welcoming Chief Revenue Officer Jason Wakeam to Backblaze

2024-07-23 Backblaze

Post Syndicated from Backblaze original https://www.backblaze.com/blog/welcoming-chief-revenue-officer-jason-wakeam-to-backblaze/

A decorative image with title "Jason Wakeam, Chief Revenue Officer" and a photograph of Jason.

Backblaze is happy to announce that Jason Wakeam has joined our team as Backblaze’s first Chief Revenue Officer (CRO). Jason will take on spearheading our overall sales strategy, with a focus on expanding market share and driving new revenue opportunities.

What Jason Brings to the Role

An industry veteran with nearly three decades of global leadership experience, Jason brings a proven track record of driving growth and innovation at technology companies. Jason has previously served as a vice president of global sales at SnapLogic, and held leadership roles in a range of public and private companies including Cloudera, Microsoft, and Hewlett-Packard.

I am pleased to welcome Jason as our chief revenue officer. He has an impressive track record that showcases his ability to drive businesses to the next level. His expertise will be crucial as we help more, larger customers break free from traditional cloud walled gardens, move to an open cloud ecosystem, and empower them to do more with their data.

—Gleb Budman, CEO and Chairperson of the Board, Backblaze

Jason takes over from long-time Backblazer Nilay Patel, who previously served as vice president of sales, and has transitioned to oversee our recently established New Markets team with a special focus on AI.

The addition of Jason to our leadership is a sign of our commitment to attracting, retaining, and growing with larger mid-market customers. Jason says of his new role:

Backblaze’s mission deeply resonates with me, and I am excited to help accelerate growth for our company. I’m looking forward to working with this amazing team as we continue to scale with our customers and further innovation.

—Jason Wakeam, Chief Revenue Officer, Backblaze

Welcome, Jason!

The post Welcoming Chief Revenue Officer Jason Wakeam to Backblaze appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Why We Use Native Code in Backblaze Computer Backup

2024-07-18 Natasha Rabinov

Post Syndicated from Natasha Rabinov original https://www.backblaze.com/blog/why-we-use-native-code-in-backblaze-computer-backup/

A decorative image showing icons of file types superimposed over a cloud.

There’s a lot that goes into building a user-friendly, robust backup utility. When Backblaze set out to create one back in 2007, our goal was to make sure that users of all skill levels would have automatic, nearly continuous backups that could be restored on command. There were plenty of design decisions to be made, and one of the biggest was whether to implement our client in native code.

You might have seen us talk about this on our website and elsewhere, and we felt it was high time to dive into what that decision meant for our development, how it affected the way the Backblaze client works, and why we think it was an important decision and inflection point for Backblaze Computer Backup and our customers.

What is native code?

Each kind of computer central processing unit (CPU), such as Intel/AMD or Apple Silicon, has its own “machine language,” which is the set of instructions the CPU can understand and follow. These instructions are encoded in binary, and aren’t something people can read or write without great effort. When folks talk about using native code, they’re typically talking about a computer program that’s written in machine language, so a computer’s CPU can “natively” understand what the program needs the CPU to do.

Compiled languages

To use a compiled language, developers write instructions into source code that’s easy for humans to read and edit. Then, they use a program aptly called a compiler to convert the source code into machine language for a particular kind of CPU. Examples of compiled languages are assembly (ASM), C, C++, Rust, Go, Swift, and Haskell.

Interpreted languages

Like with compiled languages, developers write programs in interpreted languages by writing instructions into source code files. But instead of converting those instructions into machine language, another program called an interpreter reads the source code and follows the instructions it contains without converting them to machine language. Common interpreted languages are things like Python, Ruby, BASIC, and PHP.

There is a bit of a slippery slope between a compiled vs. interpreted language. For example, some modern Java implementations mix an interpreter and a compiler. But, the difference when it comes to programming is about picking a language that’s suitable to a task’s requirements.

When and how do you use which type of code language(s)?

Well, pretty much anything anyone does on computers these days will take a combination of code languages. In some ways, the whole challenge of working with computers is bridging how humans communicate vs. how computers can process things.

If you were using a metaphor for the above, a compiled code language would represent someone who was raised to natively speak two languages, and could fluently curse in both languages.

By contrast, interpreted language is like this: You’ve moved to a country where you’re not fluent in the language, but someone needs a thorough dressing-down. An interpreted language would let you write in your native language, take your words and literally translate the idiom you were intending to use—then the computer would take your literal translation, and, executing the program, would be supplied with a dictionary to then give you an effective, similarly meaningful, insult. If you didn’t have your translator, your attempt at offense (in this metaphor, a program!), would likely fail because no one can understand you.

To wit: While they mean similar things, “when pigs fly,” and “quand les poules auront des dents,” do not literally translate.

What are the benefits of using native code in a backup application?

Using native code in a backup application is, in our opinion, better for several reasons.

Permissions

When you’re writing in native code, you’re plugging in your program at a lower level than most applications. That gives you access to the kinds of APIs the native operating system (OS) uses. Because you’re in that level of integration with the operating system, it means that users have to update permissions less frequently, have access to more robust build possibilities for your client, and their backup client can seamlessly run in the background.

Efficiency: Build once, run everywhere

By building our backup client lower in the chain of command, so to speak, it allows us to use the same work for different situations, and there are some interpreted languages that have been built for this purpose, like Java VM. Using those solutions, however, would sacrifice some of the other benefits we’re outlining in this article.

Being fully in control of our common code, we can do this without interpreted language and still have the other advantages listed here. So, we can use the same base code for both our Mac and Windows clients, but then add modifications to the code on top of each to refine the clients. There may be slight differences between the operating system (OS) environments, but coding at the level of a compiled language like C++ means that we can adjust for those differences effectively.

Performance

Running native code typically results in better performance. That’s because there are fewer steps (for your computer) between understanding a program and running a program.

Backup programs run all the time in the background, and have to keep track of a lot of information. Backblaze’s native code does that using half to a tenth of the computing resources that a backup program written in an interpreted language would use. So, Backblaze won’t slow down or interrupt the other activities you’re doing with your computer.

Reducing software bloat and size of software

Also, since you don’t have to install interpreters (you know, your insult dictionary), native code applications are usually leaner and more performant on the system.

Eliminating risky third-party dependencies

Since they’re software, computer language interpreters have bugs and get new features, so they’re frequently updated. Sometimes an updated interpreter won’t run programs written for an older version of the language, or will cause a program to behave differently in an unexpected (read: “buggy”) way. Also, vendors have even changed licensing terms and started charging money for interpreters that had been free. Backblaze’s native code doesn’t have those problems.

Platform-standard user interface

Operating system vendors like Microsoft and Apple strongly encourage developers to write programs that use a platform-standard user interface “look-and-feel.” Programs that do that help users feel comfortable, minimize surprises, and support accessibility features like text-to-speech.

The most effective way to ensure a program’s user interface matches a platform’s standard look-and-feel is to use features built into the operating system, and those are typically only available to native code like Backblaze’s client.

What are the challenges of using native code in a backup application?

Nothing is perfect. What are the downsides to this approach?

Industry preference moving towards interpreted language/web apps

Has anyone else noticed that the world of development has changed recently? (No need to qualify that statement—it will be true tomorrow, tomorrow, and tomorrow again.)

As with any industry, tech’s (and developers’) favorite strategies for creating things and solving problems have changed over time.

There are various players in this space, including platforms (Mac, Windows, Linux), software (Adobe, Office), applications (Slack, the latest mobile game, your headphone utility client), and frankly, many things that skirt the boundaries of the above buckets. Executing any program, and particularly third-party applications, is a negotiation between operating systems’ publishers and the program/application’s developers.

Over time, those who sell computers and manage OSes have grown to prefer the lightweight development of application ecosystems. It lets them have more control over their platforms, and it gives developers a shorter time to deployment—as long as they play within the sandbox the OS has made available. OS publishers are attempting to anticipate the needs of program and app developers, but there are some types of utilities—and backup is one of them—that justifiably break standard rules. Giving access to all your files by default, for example, isn’t something you’d do for a social media application. However, in order to get a full and complete backup, a program does justifiably require that level of access.

Limited dev libraries

Given the preference of developers to move to web applications and interpreted languages (for good reason in some cases), many OSes are releasing less detailed support and/or technical documentation for some of their deeper-level tools. If you’re implementing in native code in today’s environment, you need both historical knowledge and ingenuity in house. Which leads us to our next point…

Expertise

We’re on board with the evolution of development—innovation is at the heart of our company—but for aspects of our backup client, we need developers with a deep understanding of compiled code languages and our supported ecosystems. And luckily, in any sufficiently large tech company, you’ll find folks specializing in different code languages and parts of the tech stack. That means we can spend more time nurturing and developing our internal talent rather than seeking it externally.

Hybrid approaches?

Hey, we’ve spent a whole article telling you why native code matters. But, many folks agree that the future requires a hybrid approach, largely because of that gray area between compiled and interpreted languages we mentioned above. You can certainly see that in our style as well—our Mac client uses a combination of Objective C, SwiftUI and C++, for example.

The now and future Backblaze

The core functionality of our client depends on native code for very good design reasons, and they’re ultimately all about making things easier for our end users.

Overall, our design ideas are all centered on what it means to use Backblaze every day, regardless of an end user’s skill level. We want things to be simpler, and sometimes the questions we need to answer (how do I make sure the Backblaze client backs everything up?) are actually a tad more complicated upfront (the Backblaze client needs system permissions—and that means implementing it in native code), in that they require forethought and an investment of time and resources. But, we also prioritize the kind of thinking we can use over and over—so, even if we spend a little more time building native code, it’s an investment that has longevity. Put another way: Build once, run everywhere.

The post Why We Use Native Code in Backblaze Computer Backup appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Container Orchestration: Managing Applications at Scale

2024-07-11 Vinodh Subramanian

Post Syndicated from Vinodh Subramanian original https://www.backblaze.com/blog/container-orchestration-managing-applications-at-scale/

A decorative image showing containers stacked in a pattern.

The use of containers for software deployment has emerged as a powerful method for packaging applications and their dependencies into single, portable units. Containers enable developers to create, deploy, and run applications consistently across various environments. However, as containerized applications grow in scale and complexity, efficiently deploying, managing, and terminating containers can become a challenging task.

The growing need for streamlined container management has led to the rise of container orchestration—an automated approach to deploying, scaling, and managing containerized applications. Because it simplifies the management of large-scale, dynamic container environments, container orchestration has become a crucial component in modern application development and deployment.

In this blog post, we’ll explore what container orchestration is, how it works, its benefits, and the leading tools that make it possible. Whether you are new to using containers or looking to optimize your existing strategy, this guide will provide insights that you can leverage for more efficient and scalable application deployment.

What are containers?

Before containers, developers often faced the “it works on my machine” problem, where an application would run perfectly on a developer’s computer but fail in other environments due to differences in operating systems (OS), dependencies, or configuration.

Containers solve this problem by packaging applications with all their dependencies into single, portable units, improving consistency across different environments. This greatly reduces the compatibility issues and simplifies the deployment process.

As a lightweight software package, containers include everything needed to run an application such as code, runtime environment, system tools, libraries, binaries, settings, and so on. They run on top of the host OS, sharing the same OS kernel, and can run anywhere—on a laptop, server, in the cloud, etc. On top of that, containers remain isolated from each other, making them more lightweight and efficient than virtual machines (VMs), which require a full OS for each instance. Check out our article to learn more about the difference between containers and VMs here.

Containers provide consistent environments, higher resource efficiency, faster startup times, and portability. They differ from VMs in that they share the host OS kernel. While VMs virtualize hardware for strong isolation, containers isolate at the process level. By solving the longstanding issues of environment consistency and resource efficiency, containers have become an essential tool in modern application development.

What is container orchestration?

As container adoption has grown, developers have encountered new challenges that highlight the need for container orchestration. While containers simplify application deployment by ensuring consistency across environments, managing containers at scale introduces complexities that manual processes can’t handle efficiently, such as:

Scalability: In a production environment, applications often require hundreds or thousands of containers running simultaneously. Manually managing such a large number of containers becomes impractical and error-prone.
Resource management: Efficiently utilizing resources across multiple containers is critical. Manual resource allocation leads to underutilization or overloading of hardware, negatively impacting performance and cost-effectiveness.
Container failure management: In dynamic environments, containers can fail or become unresponsive. Developers need a way to create a self-healing environment, in which failed containers are automatically detected, then recover without manual intervention to ensure high availability and reliability.
Rolling updates: Deploying updates to applications without downtime and the ability to quickly roll back in case of issues are crucial for maintaining service continuity. Manual updates can be risky and cumbersome.

Container orchestration automates the deployment, scaling, and management of containers, addressing the complexities that arise in large-scale, dynamic application environments. It ensures that applications run smoothly and efficiently, enabling developers to focus on building features rather than managing infrastructure. Container orchestration tools provide various features such as automated scheduling, self-healing, load balancing, and resource optimization to deploy and manage applications more effectively to ensure reliability, performance, and scalability.

What are the benefits of container orchestration?

Container orchestration offers many different advantages that streamline the deployment and management of containerized applications. We’ve touched on a few of them, but here’s a concise list:

Improved resource utilization: Orchestration tools can efficiently pack containers onto hosts, maximizing hardware usage.
Enhanced scalability: Easily scale applications up or down to meet changing demands.
Increased reliability: Automatic health checks and container replacement ensure high availability.
Simplified management: Centralized control and automation reduce the complexity of managing large-scale containerized applications.
Faster deployments: Orchestrators enable rapid and consistent deployments across different environments.
Cost efficiency: Better resource utilization and automation, leading to cost savings.

How does container orchestration work?

Now that we understand what container orchestration is, let’s take a look at how container orchestration works using the example of Kubernetes, one of the most popular container orchestration platforms.

In the above diagram, we see an example of container orchestration in action. The system is divided into two main sections: the control plane and the worker nodes.

Control plane

The control plane is the brain of the container orchestration system. It manages the entire system, ensuring that the desired state of the applications is maintained. Key components of the control plane include:

Configuration store (etcd): A distributed key-value store that holds all the cluster data, such as the configuration and state information. Think of it as a central database for the cluster.
API server: The front-end of the control plane, exposing the orchestration API. It handles all the communication within the cluster and with external clients.
Scheduler: Assigns workloads to nodes based on resource availability and scheduling policies, ensuring efficient resource utilization.
Controller manager: Runs various controllers that handle routine tasks to maintain the cluster’s desired state.
Cloud control manager: Interacts with cloud provider APIs to manage cloud specific resources, integrating the cluster with cloud infrastructure.

Worker nodes

Worker nodes, virtual machines, and bare metal servers are all common options for where to run application workloads. Each worker node has the following components:

Node agent (kubelet): An agent that ensures the containers are running as expected. It communicates with the control plane to receive instructions and report back on the status of the nodes.
Network proxy (kube-proxy): Maintains network rules on each node, facilitating communication between containers and services within the cluster.

Within the worker nodes, pods are the smallest deployable units. Each pod can contain one or more containers that run the application and its dependencies. The diagram shows multiple pods within the worker nodes, indicating how applications are deployed and managed.

The cloud provider API directs how the orchestration system dynamically interacts with cloud infrastructure to provision resources as needed, making it a flexible and powerful tool for managing containerized applications across various environments.

Popular container orchestration tools

Several container orchestration tools have emerged as the leaders in the industry, each offering unique features and capabilities. Here are some of the most popular tools:

Kubernetes

Kubernetes, often referred to as K8s, is an open-source container orchestration platform initially developed by Google. It has become the industry standard for managing containerized applications at scale. K8s is ideal for handling complex, multi-container applications, making it suitable for large-scale microservices architectures and multi-cloud deployments. Its strong community support and flexibility with various container runtimes contribute to its widespread adoption.

Docker Swarm

Docker Swarm is Docker’s native container orchestration tool, providing a simpler alternative to Kubernetes. It integrates seamlessly with Docker containers, making it a natural choice for teams already using Docker. Known for its ease of setup and use, Docker Swarm allows quick scaling of services with straightforward commands, making it ideal for small to medium-sized applications and rapid development cycles.

Amazon Elastic Container Service (ECS)

Amazon ECS (Elastic Container Service) is a fully managed container orchestration service provided by AWS, designed to simplify running containerized applications. ECS integrates deeply with AWS services for networking, security, and monitoring. ECS leverages the extensive range of AWS services, making it a straightforward orchestration solution for enterprises using AWS infrastructure.

Red Hat OpenShift

Red Hat OpenShift is an enterprise-grade Kubernetes container orchestration platform that extends Kubernetes with additional tools for developers and operations, integrated security, and lifecycle management. OpenShift supports multiple cloud and on-premise environments, providing a consistent foundation for building and scaling containerized applications.

Google Kubernetes Engine (GKE)

Google Kubernetes Engine (GKE) is a managed Kubernetes service offered by Google Cloud Platform (GCP). It provides a scalable environment for deploying, managing, and scaling containerized applications using Kubernetes. GKE simplifies cluster management with automated upgrades, monitoring, and scalability features. Its deep integration with GCP services and Google’s expertise in running Kubernetes at scale make GKE an attractive option for complex application architectures.

Embracing the future of application deployment

Container orchestration has undoubtedly revolutionized the way we deploy, manage, and scale applications in today’s complex and dynamic software environments. By automating critical tasks such as scheduling, scaling, load balancing, and health monitoring, container orchestration enables organizations to achieve greater efficiency, reliability, and scalability in their application deployments.

The choice of orchestration platform should be carefully considered based on your specific needs, team expertise and long term goals. It is not just a technical solution but a strategic enabler, providing you with significant advantages in your development and operational workflows.

The post Container Orchestration: Managing Applications at Scale appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

How to Back Up Your QNAP NAS to the Cloud

2024-07-09 Vinodh Subramanian

Post Syndicated from Vinodh Subramanian original https://www.backblaze.com/blog/qnap-nas-backup-to-cloud/

A decorative image with the title sync with QNAP.

Your QNAP network attached storage (NAS) device helps your business centralize storage capacity, support collaboration, and access files 24/7 from anywhere. If you were relying on individual hard drives or another ad hoc storage solution before, it definitely helps you uplevel your data management practices.

One of the great features of a QNAP NAS device is Hybrid Backup Sync (HBS), its onboard backup utility that allows you to easily store a copy of your data to your NAS and other destinations. You can set regular, automated backups to protect against data loss due to hardware failures or accidental deletion. But, keeping a copy of your data on your NAS alone doesn’t constitute a true backup strategy. For that, you need to follow the 3-2-1 backup rule with at least one copy stored off-site.

This post explains how to set up a 3-2-1 backup strategy with your QNAP NAS. We’ll share the benefits of storing your backups in the cloud, discuss different options for backing up your QNAP NAS, and provide some practical examples of what you can do by combining cloud storage and your NAS.

QNAP NAS and a 3-2-1 backup strategy

Following the 3-2-1 strategy means having three copies of your data, two of which are stored locally but on different media (aka devices), and one stored off-site.

Your QNAP NAS is your first step towards completing the 3-2-1 strategy. By using it to store data locally, you have two copies on-site. Backing up your QNAP NAS to the cloud completes the 3-2-1 strategy by serving as your off-site storage.

A diagram showing the 3-2-1 backup strategy, which has three copies of data, on two different types of media, with one stored in an off-site location.

You could maintain an off-site copy on another physical device like another NAS, an external drive, or a file server, but keep in mind, backing up to an external destination other than the cloud will require you to physically separate the backup copy—that is, send your drive via mail or drive it elsewhere in order to ensure geographic separation. Backing up your QNAP NAS to the cloud means you achieve a 3-2-1 strategy without going out of your way to physically separate the copies, and it allows you to easily store data in different regions for greater data resilience and disaster recovery.

The additional benefits of backing your QNAP NAS to the cloud

Backing up your QNAP NAS to the cloud gives you a number of additional benefits, including:

Disaster recovery: Without an off-site backup, your on-site data, including data on your individual workstations and your NAS, is susceptible to data loss. Natural disasters could wipe out your machines, your NAS, and any other backups you might store locally. Cloud backups safeguard your data from physical disasters that could destroy both your NAS and local copies.
Ransomware protection: While QNAP has on-board utilities that allow you to revert to a previous backup, your NAS is still connected to your network and susceptible to ransomware. Cloud backups, especially those configured with Object Lock, provide a layer of security against ransomware attacks that can encrypt or delete data stored on your network-connected NAS.
Protection against hardware failure: Because your NAS is likely set up in a RAID configuration, one drive failure might not affect your data. But, while one drive is down, your data is at a higher risk. If another drive were to fail, you could lose data. Keeping an off-site backup in cloud storage helps you avoid this fate.
Accessibility: With your data in the cloud, your backups are accessible from anywhere. If you’re away from your desk or office and you need to retrieve a file, you can simply log in to your cloud account and copy that file down.
Security: Cloud vendors typically protect customer data by encrypting it as it travels to its final destination and/or when it is at rest on the vendors’ storage servers. Encryption protocols differ between cloud vendors, so make sure to understand them as you’re evaluating cloud providers, especially if you have specific security requirements.
Automation: Your QNAP NAS comes with a built-in backup utility so you can set your cloud backup schedule in advance and avoid human error (like forgetting to back up) in the future.
Scalability: As your data grows, your cloud backups grow with it. With cloud storage, there’s no need to invest in or maintain additional hardware to ensure your data is properly backed up.

How to protect your business data with QNAP

QNAP offers a number of different tools and functionality to help you back up business devices and systems to your NAS, including:

Qsync: Qsync is an on-board backup utility on QNAP devices that allows you to sync computer files to your QNAP NAS. This allows you to back up workstations to your NAS, creating a second, local copy of that data. QNAP NAS also supports Time Machine for Macs.
NetBack PC Agent: A utility specifically for backing up Windows PCs and servers.
Hyper Data Protector: Use Hyper Data Protector to back up multiple VMware and Hyper-V virtual machines (VMs).
File server backup: QNAP devices support multiple protocols, including rsync, FTP, and CIFS for backing up different file servers.
Boxafe: Use Boxafe to back up Google workspace and Microsoft 365 business account data to your NAS.
Snapshot feature: Takes point-in-time copies of data for protection and recovery.
MARS: Use QNAP’s MARS service to back up Google Photos and WordPress databases and files to your NAS.

How to back up your QNAP to the cloud

Once you’ve created a copy of your business data to your QNAP NAS, you can then use QNAP Hybrid Backup Sync to back it up to the cloud. Hybrid Backup Sync supports multi-version backups and allows you to customize retention settings for version management. QNAP’s QuDedup feature deduplicates data, helping you manage your storage footprint. The utility also allows you to manage Time Machine backups for Mac devices.

What can you do with cloud storage and QNAP Hybrid Backup Sync?

The QNAP Hybrid Backup Sync app provides you with a lot of options. You can synchronize in the cloud as little or as much as you want. Here are some practical examples of what you can do with Hybrid Backup Sync and cloud storage working together.

1. Sync the entire contents of your QNAP to the cloud

The QNAP NAS has excellent fault tolerance—it can continue operating even when individual drive units fail—but nothing in life is foolproof. It pays to be prepared in the event of a catastrophe. Now that you know about the 3-2-1 backup strategy, you know how important it is to make sure that you have a copy of your files in the cloud.

2. Sync your most important media files

Using your QNAP to store marketing assets like video and photos? You’ve invested untold amounts of time, money, and effort into producing those media files, so make sure they’re safely and securely synced to the cloud with Hybrid Backup Sync.

3. Back up Time Machine and other local backups

Apple’s Time Machine software provides Mac users with reliable local backup, and many Backblaze customers rely on it to provide that crucial first step in making sure their data is secure. QNAP enables the NAS to act as a network-based Time Machine backup. Those Time Machine files can be synced to the cloud, so you can make sure to have Time Machine files to restore from in the event of a critical failure.

If you use Windows or Linux, you can configure the QNAP NAS as the destination for your Windows or Linux local data backup. That, in turn, can be synced to the cloud from the NAS.

Ready to give it a try?

Hybrid Backup Sync allows you to choose from any number of cloud storage providers as a backup destination, and Backblaze B2 Cloud Storage is one of them. Check out our videos on how to use Hybrid Backup Sync to back up or sync your data to B2 in under 15 minutes.

If you haven’t given cloud storage a try yet, you can get started now and make sure your NAS is synced or backed up securely to the cloud.

The post How to Back Up Your QNAP NAS to the Cloud appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

AI 101: Why RAG Is All the RAGe

2024-06-27 Stephanie Doyle

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/ai-101-why-rag-is-all-the-rage/

A decorative image showing an AI chip connecting icons of representing different files.

At the risk of being called the stick in the mud of the tech world, we here at Backblaze have often bemoaned our industry’s love of making up new acronyms. The most recent culprit, hailing from the fast-moving artificial intelligence/machine learning (AI/ML) space, is truly memorable: RAG, aka retrieval-augmented generation. For the record, its creator has apologized for inflicting it upon the world.

Given how useful it is, we’re willing to forgive. (I’m sure he was holding his breath for that news.) Today, our AI 101 series is back to talk about what RAG is—and the big problem it solves.

Let’s start with large language models (LLMs)

LLMs are the most recognizable expression of AI in our current zeitgeist. (Arguably, you could append that with “that we’re all paying attention to,” given that ML algorithms have been behind many tools for decades now.) LLMs underpin tools like ChatGPT, Google Gemini, and Claude, as well as things like service-oriented chatbots, natural language processing tasks, and so on. They’re trained on vast amounts of data with algorithmic guardrails known as parameters and hyperparameters guiding their training. Once trained, we query them through a process known as inference.

Fabulous! The possibilities are endless. However, one of the biggest challenges we’ve experienced (and laughed about on the internet) is that LLMs can return inaccurate results, while sounding very, very reasonable. Additionally, LLMs don’t know what they don’t know. Their answers can only be as good as the data they draw from—so, if their training dataset is outdated or contains a systematic bias, it will impact your results. As AI tools have become more widely adopted, we’ve seen LLM inaccuracies range from “funny and widely mocked” to “oh, that’s actually serious.”

Enter retrieval-augmented generation (Fine! RAG)

RAG is a solution to these problems. Instead of relying on only an LLM’s dataset, RAG queries external sources before returning a response. It’s more complicated than “let me google that for you,” as the process takes that external data, turns it into a vectored database, and then balances external data with an LLM’s “general knowledge” generated response and skill at responding to conversational queries.

This has several advantages. Users now have sources they can cite, and recent information is taken into account. From a development perspective, it means that you don’t have to re-train a model as frequently. And, it can be implemented in as few as five lines of code.

One important nuance is that when you’re building RAG into your product, you can set its sources. For industries like medicine and law, that means you can point them towards industry journals and trusted sources, outweighing the often misquoted or mis-cited examples you might see in a general database.

Another example: For a technical documentation portal, you can take an LLM, trained on general information and the nuts and bolts of conversational querying, and direct it to rely on your organization’s help articles as its most important sources. Your organization controls the authoritative data, and how often/when changes are made. Users can trust that they’re getting the most recent security patches and correct code. And, you can do so quickly, easily, and—most importantly—cost-effectively.

RAG doesn’t mean foolproof AI

RAG is a great, straightforward method for keeping LLM tools updated with current, high-quality information and giving users more transparency around where their answers are coming from. However, as we mentioned above, AI is only ever as good as the data it uses. Keep in mind, that’s a deceptively simple thing to say. It’s an entire, specialized job to validate datasets, and that expertise is built into the research and monitoring that happens while training an LLM.

RAG gives a new source of data a privileged position—you’re saying “this data is more authoritative than that data” and, since the LLM doesn’t have anything in its general database, it may not have a counter argument. If you’re not paying attention to your RAG data source standards, and doing so on an ongoing basis, it’s possible, and even likely, that data bias, low quality data, etc. could creep into your model.

Think of it this way: If you’re pointing to a new feature in your tech docs and there’s an error, that impact is magnified because an LLM will give more weight to the RAG data. At least in that case, you’re the one who controls the source data. In our other examples of legal or medical AI tools pointing to journal updates, things can get, well, more complicated. If (when) you’re setting up an AI that uses RAG, it’s imperative to make sure you’re also setting yourself up with reliable sources that are regularly updated.

But, given its impact, and how low of a lift it is to integrate into existing products, we can see why RAG is all the RAGe—and, as always, we look forward to more to come in the AI landscape. For now, we can already see the impact it’s having on the market, with SaaS companies and startups alike exploring the possibilities.

The post AI 101: Why RAG Is All the RAGe appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

AWS CodeArtifact adds support for Rust packages with Cargo

2024-06-21 Sébastien Stormacq

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-codeartifact-adds-support-for-rust-packages-with-cargo/

Starting today, Rust developers can store and access their libraries (known as crates in Rust’s world) on AWS CodeArtifact.

Modern software development relies heavily on pre-written code packages to accelerate development. These packages, which can number in the hundreds for a single application, tackle common programming tasks and can be created internally or obtained from external sources. While these packages significantly help to speed up development, their use introduces two main challenges for organizations: legal and security concerns.

On the legal side, organizations need to ensure they have compatible licenses for these third-party packages and that they don’t infringe on intellectual property rights. Security is another risk, as vulnerabilities in these packages could be exploited to compromise an application. A known tactic, the supply chain attack, involves injecting vulnerabilities into popular open source projects.

To address these challenges, organizations can set up private package repositories. These repositories store pre-approved packages vetted by security and legal teams, limiting the risk of legal or security exposure. This is where CodeArtifact enters.

AWS CodeArtifact is a fully managed artifact repository service designed to securely store, publish, and share software packages used in application development. It supports popular package managers and formats such as npm, PyPI, Maven, NuGet, SwiftPM, and Rubygem, enabling easy integration into existing development workflows. It helps enhance security through controlled access and facilitates collaboration across teams. CodeArtifact helps maintain a consistent, secure, and efficient software development lifecycle by integrating with AWS Identity and Access Management (IAM) and continuous integration and continuous deployment (CI/CD) tools.

For the eighth year in a row, Rust has topped the chart as “the most desired programming language” in Stack Overflow’s annual developer survey, with more than 80 percent of developers reporting that they’d like to use the language again next year. Rust’s growing popularity stems from its ability to combine the performance and memory safety of systems languages such as C++ with features that makes writing reliable, concurrent code easier. This, along with a rich ecosystem and a strong focus on community collaboration, makes Rust an attractive option for developers working on high-performance systems and applications.

Rust developers rely on Cargo, the official package manager, to manage package dependencies. Cargo simplifies the process of finding, downloading, and integrating pre-written crates (libraries) into their projects. This not only saves time by eliminating manual dependency management, but also ensures compatibility and security. Cargo’s robust dependency resolution system tackles potential conflicts between different crate versions, and because many crates come from a curated registry, developers can be more confident about the code’s quality and safety. This focus on efficiency and reliability makes Cargo an essential tool for building Rust applications.

Let’s create a CodeArtifact repository for my crates
In this demo, I use the AWS Command Line Interface (AWS CLI) and AWS Management Console to create two repositories. I configure the first repository to download public packages from the official crates.io repository. I configure the second repository to download packages from the first one only. This dual repository configuration is the recommended way to manage repositories and external connections, see the CodeArtifact documentation for managing external connections. To quote the documentation:

“It is recommended to have one repository per domain with an external connection to a given public repository. To connect other repositories to the public repository, add the repository with the external connection as an upstream to them.”

I sketched this diagram to illustrate the setup.

Domains and repositories can be created either from the command line or the console. I choose the command line. In shell terminal, I type:

CODEARTIFACT_DOMAIN=stormacq-test

# Create an internal-facing repository: crates-io-store
aws codeartifact create-repository \
   --domain $CODEARTIFACT_DOMAIN   \
   --repository crates-io-store

# Associate the internal-facing repository crates-io-store to the public crates-io
aws codeartifact associate-external-connection \
--domain $CODEARTIFACT_DOMAIN \
--repository crates-io-store  \
--external-connection public:crates-io

# Create a second internal-facing repository: cargo-repo 
# and connect it to upstream crates-io-store just created
aws codeartifact create-repository \
   --domain $CODEARTIFACT_DOMAIN   \
   --repository cargo-repo         \
   --upstreams '{"repositoryName":"crates-io-store"}'

Next, as a developer, I want my local machine to fetch crates from the internal repository (cargo-repo) I just created.

I configure cargo to fetch libraries from the internal repository instead of the public crates.io. To do so, I create a config.toml file to point to CodeArtifact internal repository.

# First, I retrieve the URI of the repo
REPO_ENDPOINT=$(aws codeartifact get-repository-endpoint \
                           --domain $CODEARTIFACT_DOMAIN \ 
                           --repository cargo-repo       \
                           --format cargo                \
                           --output text)

# at this stage, REPO_ENDPOINT is https://stormacq-test-012345678912.d.codeartifact.us-west-2.amazonaws.com/cargo/cargo-repo/

# Next, I create the cargo config file
cat << EOF > ~/.cargo/config.toml
[registries.cargo-repo]
index = "sparse+$REPO_ENDPOINT"
credential-provider = "cargo:token-from-stdout aws codeartifact get-authorization-token --domain $CODEARTIFACT_DOMAIN --query authorizationToken --output text"

[registry]
default = "cargo-repo"

[source.crates-io]
replace-with = "cargo-repo"
EOF

Note that the two environment variables are replaced when I create the config file. cargo doesn’t support environment variables in its configuration.

From now on, on this machine, every time I invoke cargo to add a crate, cargo will obtain an authorization token from CodeArtifact to communicate with the internal cargo-repo repository. I must have IAM privileges to call the get-authorization-token CodeArtifact API in addition to permissions for read/publish package according to the command I use. If you’re running this setup from a build machine for your continuous integration (CI) pipeline, your build machine must have proper permissions to do so.

I can now test this setup and add a crate to my local project.

$ cargo add regex
    Updating `codeartifact` index
      Adding regex v1.10.4 to dependencies
             Features:
             + perf
             + perf-backtrack
             + perf-cache
             + perf-dfa
             + perf-inline
             + perf-literal
             + perf-onepass
             + std
             + unicode
             + unicode-age
             + unicode-bool
             + unicode-case
             + unicode-gencat
             + unicode-perl
             + unicode-script
             + unicode-segment
             - logging
             - pattern
             - perf-dfa-full
             - unstable
             - use_std
    Updating `cargo-repo` index

# Build the project to trigger the download of the crate
$ cargo build
  Downloaded memchr v2.7.2 (registry `cargo-repo`)
  Downloaded regex-syntax v0.8.3 (registry `cargo-repo`)
  Downloaded regex v1.10.4 (registry `cargo-repo`)
  Downloaded aho-corasick v1.1.3 (registry `cargo-repo`)
  Downloaded regex-automata v0.4.6 (registry `cargo-repo`)
  Downloaded 5 crates (1.5 MB) in 1.99s
   Compiling memchr v2.7.2 (registry `cargo-repo`)
   Compiling regex-syntax v0.8.3 (registry `cargo-repo`)
   Compiling aho-corasick v1.1.3 (registry `cargo-repo`)
   Compiling regex-automata v0.4.6 (registry `cargo-repo`)
   Compiling regex v1.10.4 (registry `cargo-repo`)
   Compiling hello_world v0.1.0 (/home/ec2-user/hello_world)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 16.60s

I can verify CodeArtifact downloaded the crate and its dependencies from the upstream public repository. I connect to the CodeArtifact console and check the list of packages available in either repository I created. At this stage, the package list should be identical in the two repositories.

Publish a private package to the repository
Now that I know the upstream link works as intended, let’s publish a private package to my cargo-repo repository to make it available to other teams in my organization.

To do so, I use the standard Rust tool cargo, just like usual. Before doing so, I add and commit the project files to the gitrepository.

$  git add . && git commit -m "initial commit"
 5 files changed, 1855 insertions(+)
create mode 100644 .gitignore
create mode 100644 Cargo.lock
create mode 100644 Cargo.toml
create mode 100644 commands.sh
create mode 100644 src/main.rs

$  cargo publish 
    Updating `codeartifact` index
   Packaging hello_world v0.1.0 (/home/ec2-user/hello_world)
    Updating crates.io index
    Updating `codeartifact` index
   Verifying hello_world v0.1.0 (/home/ec2-user/hello_world)
   Compiling libc v0.2.155
... (redacted for brevity) ....
   Compiling hello_world v0.1.0 (/home/ec2-user/hello_world/target/package/hello_world-0.1.0)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 1m 03s
    Packaged 5 files, 44.1KiB (11.5KiB compressed)
   Uploading hello_world v0.1.0 (/home/ec2-user/hello_world)
    Uploaded hello_world v0.1.0 to registry `cargo-repo`
note: waiting for `hello_world v0.1.0` to be available at registry `cargo-repo`.
You may press ctrl-c to skip waiting; the crate should be available shortly.
   Published hello_world v0.1.0 at registry `cargo-repo`

Lastly, I use the console to verify the hello_world crate is now available in the cargo-repo.

Pricing and availability
You can now store your Rust libraries in the 13 AWS Regions where CodeArtifact is available. There is no additional cost for Rust packages. The three billing dimensions are the storage (measured in GB per month), the number of requests, and the data transfer out to the internet or to other AWS Regions. Data transfer to AWS services in the same Region is not charged, meaning you can run your continuous integration and delivery (CI/CD) jobs on Amazon Elastic Compute Cloud (Amazon EC2) or AWS CodeBuild, for example, without incurring a charge for the CodeArtifact data transfer. As usual, the pricing page has the details.

Now go build your Rust applications and upload your private crates to CodeArtifact!

— seb

Anthropic’s Claude 3.5 Sonnet model now available in Amazon Bedrock: Even more intelligence than Claude 3 Opus at one-fifth the cost

2024-06-20 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/anthropics-claude-3-5-sonnet-model-now-available-in-amazon-bedrock-the-most-intelligent-claude-model-yet/

It’s been just 3 months since Anthropic launched Claude 3, a family of state-of-the-art artificial intelligence (AI) models that allows you to choose the right combination of intelligence, speed, and cost that suits your needs.

Today, Anthropic introduced Claude 3.5 Sonnet, its first release in the forthcoming Claude 3.5 model family. We are happy to announce that Claude 3.5 Sonnet is now available in Amazon Bedrock.

Claude 3.5 Sonnet raises the industry bar for intelligence, outperforming other generative AI models on a wide range of evaluations, including Anthropic’s previously most intelligent model, Claude 3 Opus. Claude 3.5 Sonnet is available with the speed and cost of the original Claude 3 Sonnet model. In fact, you can now get intelligence and speed better than Claude 3 Opus at one-fifth of the price because Claude 3.5 Sonnet is 80 percent cheaper than Opus.

The frontier intelligence displayed by Claude 3.5 Sonnet combined with cost-effective pricing, makes the model ideal for complex tasks such as context-sensitive customer support, orchestrating multi-step workflows, and streamlining code translations.

Claude 3.5 Sonnet sets new industry benchmarks for undergraduate-level expert knowledge (MMLU), graduate-level expert reasoning (GPQA), code (HumanEval), and more. As you can see in the following table, according to Anthropic, Claude 3.5 Sonnet outperforms OpenAI’s GPT-4o and Google’s Gemini 1.5 Pro in nearly every benchmark.

Claude 3.5 Sonnet is also Anthropic’s strongest vision model yet, performing an average of 10 percent better than Claude 3 Opus across the majority of vision benchmarks. According to Anthropic, Claude 3.5 Sonnet also outperforms other generative AI models in nearly every category.

Anthropic’s Claude 3.5 Sonnet key improvements
The release of Claude 3.5 Sonnet brings significant improvements across multiple domains, empowering software developers and businesses with new generative AI-powered capabilities. Here are some of the key strengths of this new model:

Visual processing and understanding – Claude 3.5 Sonnet demonstrates remarkable capabilities in processing images, particularly in interpreting charts and graphs. It accurately transcribes text from imperfect images, a core capability for industries such as retail, logistics, and financial services, to gather more insights from graphics or illustrations than from text alone. Use Claude 3.5 Sonnet to automate visual data processing tasks, extract valuable information, and enhance data analysis pipelines.

Writing and content generation – Claude 3.5 Sonnet represents a significant leap in its ability to understand nuance and humor. The model produces high-quality written content with a more natural, human tone that feels more authentic and relatable. Use the model to generate engaging and compelling content, streamline your writing workflows, and enhance your storytelling capabilities.

Customer support and natural language processing – With its improved understanding of context and multistep workflow orchestration, Claude 3.5 Sonnet excels at handling intricate customer inquiries. This capability enables round-the-clock support, faster response times, and more natural-sounding interactions, ultimately leading to improved customer satisfaction. Use this model to automate and enhance customer support processes and provide a seamless experience for end users. For an example of a similar implementation, see how DoorDash built a generative AI self-service contact center solution using Anthropic’s Claude 3 models in Amazon Bedrock.

Analytics and insights – Claude 3.5 Sonnet augments human expertise in data science by effortlessly navigating unstructured data and using multiple tools to generate insights. It produces high-quality statistical visualizations and actionable predictions, ranging from business strategies to real-time product trends. Use Claude 3.5 Sonnet to simplify data analysis workflows, uncover valuable insights, and drive data-driven decision-making.

Coding and software development – Claude 3.5 Sonnet can independently write, edit, and execute code with sophisticated reasoning and troubleshooting capabilities when provided with the relevant tools. Use Claude 3.5 Sonnet to streamline developer workflows, accelerate coding tasks, reduce manual effort, and enhance overall productivity.

Using Anthropic’s Claude 3.5 Sonnet in the Amazon Bedrock console
Because this is a new model, I go to the Amazon Bedrock console and choose Model access from the navigation pane to enable access to Claude 3.5 Sonnet.

Now that I have access to the model, I’d like to use it to extract information from the following image from the Our World in Data website:

I choose Chat in the Playgrounds section of the navigation pane and select Anthropic and Claude 3.5 Sonnet as the model. I then choose the Image upload button to upload the previous image, type the following prompt, and choose Run:

Which countries consume more than 1000 TWh from hydropower? Think step by step and look at all regions. Output in JSON.

Here’s the result and an animated image recapping what I did.

To determine which countries consume more than 1,000 TWh from hydropower in 2022, I need to carefully examine the map and color-coding. The scale at the bottom shows that dark blue represents consumption between 1,000 TWh and 2,000 TWh.

Let's look at each region systematically:

1. North America: Canada is dark blue, indicating it consumes between 1,000-2,000 TWh from hydropower.
2. South America: Brazil is dark blue, also consuming between 1,000-2,000 TWh.
3. Europe: No country is dark blue; all are light green or have no data.
4. Africa: No data available for any country.
5. Asia: China is dark blue, indicating consumption between 1,000-2,000 TWh.
6. Oceania: No country is dark blue; Australia is light green.

In summary, there are three countries that consume more than 1,000 TWh from hydropower in 2022:
1. Canada (North America)
2. Brazil (South America)
3. China (Asia)

Here's the response in JSON format:

```json
{
  "countries_consuming_over_1000TWh_hydropower": [
    {
      "name": "Canada",
      "region": "North America",
      "consumption_range": "1,000-2,000 TWh"
    },
    {
      "name": "Brazil",
      "region": "South America",
      "consumption_range": "1,000-2,000 TWh"
    },
    {
      "name": "China",
      "region": "Asia",
      "consumption_range": "1,000-2,000 TWh"
    }
  ]
}
```

These three countries stand out as the highest consumers of hydropower energy, each using between 1,000 and 2,000 terawatt-hours in 2022.

The model’s ability to reliably extract information from unstructured data, like images, opens up a world of new possibilities.

I choose the three small dots in the corner of the playground window and then View API request to see code examples using the model in the AWS Command Line Interface (AWS CLI) and AWS SDKs. Let’s have a better look at the code syntax.

Using Claude 3.5 Sonnet with AWS SDKs
You can use Claude 3.5 Sonnet with any AWS SDK using the new Amazon Bedrock Converse API or Anthropic Claude Messages API.

To update code already using a Claude 3 model, I just need to replace the model ID with:

anthropic.claude-3-5-sonnet-20240620-v1:0

Here’s a sample implementation with the AWS SDK for Python (Boto3) using the same image as before to show how to use images and text with the Converse API.

import boto3
from botocore.exceptions import ClientError

MODEL_ID = "anthropic.claude-3-5-sonnet-20240620-v1:0"

IMAGE_NAME = "primary-energy-hydro.png"

bedrock_runtime = boto3.client("bedrock-runtime", region_name="us-east-1")

with open(IMAGE_NAME, "rb") as f:
    image = f.read()

user_message = "Which countries consume more than 1000 TWh from hydropower? Think step by step and look at all regions. Output in JSON."

messages = [
    {
        "role": "user",
        "content": [
            {"image": {"format": "png", "source": {"bytes": image}}},
            {"text": user_message},
        ],
    }
]

response = bedrock_runtime.converse(
    modelId=MODEL_ID,
    messages=messages,
)
response_text = response["output"]["message"]["content"][0]["text"]
print(response_text)

When I run it, I get a similar output as in the console:

Let's approach this step-by-step:

1. First, I'll examine the color scale at the bottom of the map. The darkest blue color represents consumption of 2,000 TWh or more.

2. Now, I'll scan the map region by region:

   North America: Canada is dark blue, indicating over 1,000 TWh.
   South America: Brazil is also dark blue, over 1,000 TWh.
   Europe: No country appears to be dark blue.
   Africa: No country appears to be dark blue.
   Asia: China stands out as dark blue, indicating over 1,000 TWh.
   Oceania: No country appears to be dark blue.

3. To be thorough, I'll double-check for any medium blue countries that might be close to or over 1,000 TWh, but I don't see any that appear to reach that threshold.

4. Based on this analysis, there are three countries that clearly consume more than 1,000 TWh from hydropower.

Now, I'll format the answer in JSON:

```json
{
  "countries_consuming_over_1000TWh_hydropower": [
    "Canada",
    "Brazil",
    "China"
  ]
}
```

This JSON output lists the three countries that visually appear to consume more than 1,000 TWh of primary energy from hydropower according to the 2022 data presented in the map.

Because I didn’t specify a JSON syntax, the two answers use a different format. In your applications, you can describe in the prompt the JSON properties you want or provide a sample to get a standard format in output.

For more examples, see the code samples in the Amazon Bedrock User Guide. For a more advanced use case, here’s a fully functional tool use demo illustrating how to connect a generative AI model with a custom tool or API.

Using Claude 3.5 Sonnet with the AWS CLI
There are times when nothing beats the speed of the command line. This is how you can use the AWS CLI with the new model:

aws bedrock-runtime converse \
    --model-id anthropic.claude-3-5-sonnet-20240620-v1:0 \
    --messages '{"role": "user", "content": [{"text": "Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?"}]}' \
    --region us-east-1
    --query output.message.content

In the output, I use the query option to only get the content of the output message:

[
    {
        "text": "Let's approach this step-by-step:\n\n1. First, we need to understand the relationships:\n   - Alice has N brothers\n   - Alice has M sisters\n\n2. Now, let's consider Alice's brother:\n   - He is one of Alice's N brothers\n   - He has the same parents as Alice\n\n3. This means that Alice's brother has:\n   - The same sisters as Alice\n   - One sister more than Alice (because Alice herself is his sister)\n\n4. Therefore, the number of sisters Alice's brother has is:\n   M + 1\n\n   Where M is the number of sisters Alice has.\n\nSo, the answer is: Alice's brother has M + 1 sisters."
    }
]

I copy the text into a small Python program to see it printed on multiple lines:

print("Let's approach this step-by-step:\n\n1. First, we need to understand the relationships:\n   - Alice has N brothers\n   - Alice has M sisters\n\n2. Now, let's consider Alice's brother:\n   - He is one of Alice's N brothers\n   - He has the same parents as Alice\n\n3. This means that Alice's brother has:\n   - The same sisters as Alice\n   - One sister more than Alice (because Alice herself is his sister)\n\n4. Therefore, the number of sisters Alice's brother has is:\n   M + 1\n\n   Where M is the number of sisters Alice has.\n\nSo, the answer is: Alice's brother has M + 1 sisters.")

Let's approach this step-by-step:

1. First, we need to understand the relationships:
   - Alice has N brothers
   - Alice has M sisters

2. Now, let's consider Alice's brother:
   - He is one of Alice's N brothers
   - He has the same parents as Alice

3. This means that Alice's brother has:
   - The same sisters as Alice
   - One sister more than Alice (because Alice herself is his sister)

4. Therefore, the number of sisters Alice's brother has is:
   M + 1

   Where M is the number of sisters Alice has.

So, the answer is: Alice's brother has M + 1 sisters.

Even if this was a quite nuanced question, Claude 3.5 Sonnet got it right and described its reasoning step by step.

Things to know
Anthropic’s Claude 3.5 Sonnet is available in Amazon Bedrock today in the US East (N. Virginia) AWS Region. More information on Amazon Bedrock model support by Region is available in the documentation. View the Amazon Bedrock pricing page to determine the costs for your specific use case.

By providing access to a faster and more powerful model at a lower cost, Claude 3.5 Sonnet makes generative AI easier and more effective to use for many industries, such as:

Healthcare and life sciences – In the medical field, Claude 3.5 Sonnet shows promise in enhancing imaging analysis, acting as a diagnostic assistant for patient triage, and summarizing the latest research findings in an easy-to-digest format.

Financial services – The model can provide valuable assistance in identifying financial trends and creating personalized debt repayment plans tailored to clients’ unique situations.

Legal – Law firms can use the model to accelerate legal research by quickly surfacing relevant precedents and statutes. Additionally, the model can increase paralegal efficiency through contract analysis and assist with drafting standard legal documents.

Media and entertainment – The model can expedite research for journalists, support the creative process of scriptwriting and character development, and provide valuable audience sentiment analysis.

Technology – For software developers, Claude 3.5 Sonnet offers opportunities in rapid application prototyping, legacy code migration, innovative feature ideation, user experience optimization, and identification of friction points.

Education – Educators can use the model to streamline grant proposal writing, develop comprehensive curricula incorporating emerging trends, and receive research assistance through database queries and insight generation.

It’s an exciting time for for generative AI. To start using this new model, see the Anthropic Claude models section of the Amazon Bedrock User Guide. You can also visit our community.aws site to find deep-dive technical content and to discover how our Builder communities are using Amazon Bedrock in their solutions. Let me know what you do with these enhanced capabilities!

— Danilo

NAS vs. Cloud Storage: Which Remote Storage Option Is Best?

2024-06-18 Vinodh Subramanian

Post Syndicated from Vinodh Subramanian original https://www.backblaze.com/blog/nas-vs-cloud-storage-which-solution-fits-your-business-needs/

A decorative image showing a cloud and a NAS device.

If you’re leading IT strategy for a growing enterprise and still weighing network attached storage (NAS) and cloud storage, you’re not alone. And you’re not behind. Even the most seasoned infrastructure pros find themselves re-evaluating their stack as data volumes explode and budgets tighten. Both offer unique benefits, but with overlapping features, it’s easy to see why the choice can be confusing.

Are you looking for greater control with physical access, as in a local NAS setup? Or is off-site backup, flexibility, and scalability through a cloud service provider more aligned with your needs? With plenty of discussions and debates outlining the pros and cons of one or the other, it can be difficult to determine the best storage solution for your specific needs.

This guide walks through clear, actionable insights into NAS and cloud storage, addressing your most pressing questions about storage costs, dedicated machines, data sharing, and performance. Whether the focus is cost, scalability, security, or accessibility, this guide will help identify the ideal storage solution for your business.

What is NAS?

NAS, or network attached storage, is a file-level storage system designed specifically to provide centralized and shared disk storage for users on a local area network (LAN).

Essentially, NAS is a purpose-built computer that operates its own dedicated operating system (OS). It contains one or more storage devices that are configured to create a single shared volume. These storage devices are arranged in a RAID configuration to ensure data redundancy and performance.

These configurations make NAS ideal for file sharing, data backups, and accessing large files within an organization, making it a cost-effective solution for enterprises that need local storage with physical access.

Many NAS devices, such as Synology NAS or QNAP NAS, come with built-in software for additional functionalities like file syncing, data backups, and offsite backup options to integrate with cloud services.

How does NAS work?

NAS provides access to files using standard network file sharing protocols such as Network File System (NFS) and Server Message Block (SMB). By connecting directly to the local network, NAS allows users to easily store, access, and collaborate on files without overburdening other servers within the network. This separation of file-serving responsibilities helps optimize overall network performance, particularly for high-traffic environments.

NAS systems are generally managed through a web-based utility accessible over the network, offering an intuitive interface for configuration and maintenance. This interface allows administrators to handle tasks such as user permissions, storage allocation, and data redundancy settings—making it simpler to secure and organize shared files across the network.

Advantages of NAS

NAS offers several advantages including faster data access, easier administration, simplified management, and many others. Here’s a breakdown:

Cost effective: NAS devices typically involve an upfront purchase cost that includes access to applications from the NAS provider, like Synology Hyper Backup or QNAP Hybrid Backup Sync. This greatly reduces ongoing subscription fees, though you may incur costs if you want to expand your storage capacity with high-capacity storage drives or increase its performance with updates like more powerful processors, etc.
Data control and security: NAS systems offer extensive control over data storage and security protocols. NAS systems are only accessible on the local network and to user accounts that can be controlled and managed.
Performance: NAS provides high-speed access to data over a local network, ensuring quick file retrieval and sharing. NAS generally work as fast as the local network speeds.
Scalable storage: Many NAS systems allow additional drives to be added, providing flexible storage expansion, albeit with the cost of additional drives or device upgrades. Modern NAS devices today offer large storage capacities and advanced features for virtualization and application hosting.
Data redundancy: When equipped with RAID configurations, NAS provides redundancy, ensuring data remains accessible even if one or more hard drives fail.
Better data management tools: Features such as fully automated backups, deduplication, compression, and encryption enhance data storage efficiency and security. NAS systems also support sync workflows for team collaboration, directory services for user and group management, and services like photo or media management.
Compatibility: NAS systems are designed to support different OS environments and are compatible with Windows, Mac, and Linux operating systems. They offer a seamless cross-platform access.
Remote access options: While primarily local, most NAS devices offer secure remote access through VPN or encrypted connections, allowing authorized users to access files from outside the office network when needed.

Limitations of NAS

While NAS offers numerous advantages for centralized file storage, there are some notable limitations to consider:

Initial setup and maintenance:. The configuration process can be complex at enterprise scale, and ongoing maintenance may demand external IT support, adding to operational costs.
Remote access vulnerabilities: NAS systems can be accessed remotely over the internet, creating a private cloud or hybrid cloud solution. While this offers a significant advantage in using your device, just like anything connected to the internet, it also poses security risks. Bad actors can exploit vulnerabilities and gain remote access to the device. To minimize risk, businesses must ensure proper security configurations, use encrypted connections, regularly update firmware, and restrict access to trusted IPs.
Scalability constraints: Although NAS systems allow for storage expansion, they are still limited by the physical capacity of the hardware. Adding storage often involves purchasing high-capacity drives, which can be costly, and for larger expansions, migrating to more powerful NAS devices might be necessary.
Data vulnerability: Data stored on a NAS is susceptible to various threats, including hardware failures, natural disasters, theft, and cyber attacks such as ransomware. While RAID configurations offer some level of data redundancy, they do not protect against all forms of data loss. Regular backups and additional security measures are essential to mitigate these risks.
Performance overheads: As more users and devices access the NAS, network bandwidth and device performance can become bottlenecks. High demand may reduce access speeds, impact data throughput, and reduce efficiency, especially in larger organizations with extensive data needs.
Data recovery challenges: If a NAS drive fails or becomes corrupted, data recovery processes may be complex and require specialized services, which can be costly and time-intensive.

What is cloud storage?

Cloud storage is a model of data storage where data is stored on servers located in off-site locations and accessed via the internet. This setup enables users to store, retrieve, and manage data without requiring local storage infrastructure. There are two main types of cloud: public and private.

Public cloud storage: Hyperscale providers like AWS, Google Cloud, and Azure and specialized cloud providers like Backblaze maintain servers and are responsible for hosting, managing, and securing data. The public cloud is cost-effective and offers scalable storage for multiple users and businesses.
Private cloud storage: Typically managed in-house or by a dedicated third-party provider, private cloud storage is reserved for a single organization. For example, a university may maintain data centers for its community. Private clouds offer enhanced control and security, though they often require more complex management.

What’s the diff: Public vs. private cloud

Public cloud storage services are provided by third-party vendors over the public internet, making them accessible to anyone who wants to purchase or lease storage capacity. These services are designed to offer scalability and reliability, often on a pay-as–you-go basis.

Private cloud storage is dedicated to a single organization where an organization utilizes its own servers and data centers to store data within their own network. It can be hosted on-premises or by a third-party provider, but it’s always behind the organization’s firewall. This model is ideal for businesses that require more control over their data and have stringent security and compliance requirements.

Advantages of public cloud

One of the key benefits of public cloud storage is that it eliminates the need for businesses to buy, manage, and operate their own data center infrastructure. This shift allows companies to move from capital expenditure (CapEx) to operational expenditure (OpEx) model, focusing on paying only for the storage they need when they need it.

Additionally, cloud storage is elastic, enabling businesses to scale their storage capacity up or down more efficiently and strategically than through tactical hardware investments.

Advantages of private cloud

Private cloud storage allows for customized control and security measures, as organizations have full authority over their data environment. This setup can be highly beneficial for industries with strict data regulations, like finance and healthcare, as it enables better compliance with data privacy laws.

Additionally, private clouds provide reliable performance since resources are not shared with external users, reducing latency issues and enabling faster data access for internal teams.

Types of cloud storage architecture

In addition to the elasticity and scalability benefits of cloud storage, you can also combine on-premises storage and different types of public or private cloud storage to uniquely support your business needs. The primary models of cloud storage are:

Hybrid cloud storage: A hybrid model combines both public and private cloud storage. This allows an organization to decide which data it wants to store in which cloud. Sensitive data and data that must meet strict compliance requirements may be stored in a private cloud or on-premises while less sensitive data is stored in the public cloud. You could also use hybrid cloud to leverage on-premises storage for performance-sensitive tasks, such as using NAS to edit large media files locally, which are later synced to the cloud.
Multi-cloud storage: A multi-cloud model involves using two or more public cloud storage services from different service providers. This model helps businesses leverage the best features of each cloud service while enhancing data availability and redundancy. For example, some companies use multiple cloud providers to host mirrored copies of their active production data. If one of their public clouds suffers an outage, they have mechanisms in place to direct their applications or websites to failover to a second public cloud.

This flexibility in cloud storage architecture allows businesses to balance performance, cost, and security—ensuring critical data is stored securely while remaining accessible and resilient across multiple environments.

How does cloud storage work?

Cloud storage works by allowing users to upload data, such as files, documents, videos, or images to remote servers via the internet.

Public cloud storage providers like Amazon, Google, Microsoft, and Backblaze maintain servers in large data centers. The uploaded data can be accessed and managed through web interfaces or APIs, making it highly accessible and flexible.

Cloud storage offers numerous benefits that can greatly enhance business operations, such as storage space scalability, flexible data sharing options, and built-in data protection through regular backups and client-side encryption. However, there are also a few considerations like data security and storage costs to keep in mind. Next, we’ll look at the advantages and some of the key limitations of cloud-based storage solutions.

Advantages of cloud storage

Cloud storage enables businesses to scale with ease, reduce IT burdens, and access data remotely—offering a reliable, cost-efficient way to manage critical information. Here are some of the advantages of cloud storage:

Off-site protection: Cloud storage provides convenient off-site protection for data, ensuring that in the event of a physical disaster (such as fire or flood), data remains safe and accessible from any location. This supports in data redundancy and business continuity.

Enhanced security: Leading cloud providers invest heavily in advanced security measures—including encryption, multi-factor authentication, Object Lock for immutability, and regular security audits—to protect stored data from unauthorized access and breaches.
Scalability: Cloud storage services offer virtually unlimited storage capacity. Businesses can easily scale their storage needs up or down based on demand without needing to invest in physical hardware.
Accessibility: Data stored in the cloud can be accessed from anywhere with an internet connection, facilitating remote work and data sharing across teams and locations.
Lower maintenance: Cloud providers handle all hardware maintenance, software updates, and security patches, reducing the IT burden of managing storage systems on businesses.
Cost efficiency: Many cloud storage solutions operate on a pay-as-you-go model, allowing businesses to pay only for the storage they use, which can be more cost-effective than local NAS or investing in on-premises hardware.

Limitations of cloud storage

While cloud storage offers flexibility and scalability, it also has some limitations that impose additional considerations like ongoing costs and internet dependence that businesses should evaluate carefully.

Ongoing costs: Unlike on-premises storage solutions such as NAS, cloud storage operates on a subscription-based pricing model. When evaluating cloud storage, businesses should consider the total cost of ownership, including ongoing fees, and weigh these against the benefits of cloud storage.
Dependence on the internet: Cloud storage relies on a stable internet connection for access and data transfer. Any disruptions in internet connectivity can hinder access to critical files and services, potentially impacting business operations. Ensuring reliable internet service and having contingency plans are crucial for minimizing downtime.

NAS vs cloud storage: A side-by-side comparison

The following table provides a side-by-side comparison of NAS and cloud storage, highlighting key aspects such as cost, scalability, security, and performance. This comparison will help you determine which storage solution best aligns with your business requirements and operational workflows.

Aspect	NAS	Cloud Storage
Storage model	File-level storage within a local network	Data stored on remote servers accessed via the internet
Performance	High speed access over a local network; optimal for on-premises work	Dependent on internet speed and latency; suitable for global access and remote teams
Scalability	Limited by physical hardware capacity; requires purchasing new devices for expansion	Virtually unlimited scalability; allowing storage to expand without additional hardware
Cost	Upfront hardware purchase, ongoing investment to expand capacity	Subscription-based, pay-as-you-go model, often with no upfront hardware investment
Maintenance	Requires in-house IT maintenance, updates and troubleshooting	Maintenance handled by cloud provider, reducing IT burden
Security	Controlled in-house, local network security; ideal for high-sensitive data	Enhanced by provider with encryption, multi-factor authentication, and security
Data redundancy	RAID configurations for local redundancy	Built-in data redundancy and disaster recovery options
Accessibility	Limited to local network access or VPN for remote connections	Accessible from anywhere with an internet connection, supporting remote work and collaboration
Compliance	Greater control for compliance in regulated industries; depends on in-house protocols	Many providers offer compliance with standards like GDPR, HIPAA, and SOC 2, ideal for regulated industries

Hybrid cloud: The best of both worlds

A hybrid cloud solution combines the strengths of both NAS and cloud storage. While NAS offers a centralized location to store and access files, the data stored on the NAS is still vulnerable to data disasters such as floods, fires, or hardware failures.

By integrating cloud storage with NAS, you create an off-site backup of your NAS data that securely protects your critical data from virtually any data threat. This approach not only mitigates the risk associated with physical damage to your on-premises NAS equipment but also offers the scalability, flexibility and remote accessibility benefits of cloud storage.

Additionally, this helps you implement 3-2-1 backup protection where three copies of your data are stored in two different storage media (NAS and cloud) with one copy stored off-site in the cloud, protecting against ransomware, hardware failures, natural disasters, and other data threats.

NAS vs. cloud: Which is best for your business?

Choosing between NAS and cloud storage for your business largely depends on your specific use cases and operational needs. NAS provides fast local access, control, and cost efficiency for businesses with stable storage needs and on-premises operations. In contrast, cloud storage offers unparalleled scalability, remote access, and maintenance-free operation, making it ideal for organizations with dynamic storage needs and remote workforces.

However, many businesses find that a combination of both, known as a hybrid cloud solution, offers the best of both worlds by combining the control of NAS with the scalability of cloud storage.

Ultimately, the right choice will depend on a thorough evaluation of your business needs and operational workflows. By understanding the strengths and limitations of both NAS and cloud storage, you can make an informed decision that ensures your data is secure, accessible, and available when you need it.

FAQs about NAS and cloud storage

Is cloud storage better than NAS?

The answer depends on your specific business needs. Cloud storage offers scalability, remote access, and minimal maintenance requirements. NAS, on the other hand, provides fast local access and higher control over data management and security settings. Each solution has its strengths, and the best choice will depend on your priorities regarding data security, access, and cost.

Can I use a NAS as a cloud?

Yes, many modern NAS devices come with built-in features that allow them to function similarly to cloud storage, or to connect to a cloud storage provider of your choice. These NAS systems can be accessed remotely over the internet, creating a private cloud or hybrid cloud solution. However, it requires proper configuration, secure settings and a reliable internet connection to ensure seamless remote access.

Why use NAS instead of a server?

NAS devices are purpose-built for storage, offering simplicity, ease of management, and lower costs compared to traditional servers. While servers are multifunctional and can handle a variety of tasks, they are more complex to set up and maintain. NAS provides a straightforward solution for file sharing, backups, and media streaming without the need for extensive IT infrastructure. This makes NAS an excellent choice for small to medium-sized businesses that primarily need a dedicated storage solution.

Can NAS work without the internet?

Yes, NAS devices are designed to operate within a local area network (LAN) and do not require an internet connection for local access and file sharing. Users can store, access, and collaborate on files within local networks without internet access. However, for remote access or to leverage additional features such as cloud backups, an internet connection is necessary.

The post NAS vs. Cloud Storage: Which Remote Storage Option Is Best? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

2024 State of the Backup: Survey Says Security Incidents and Data Loss on the Rise

2024-06-11 Yev

Post Syndicated from Yev original https://backblaze.com/blog/2024-state-of-the-backup-security-incidents-and-data-loss-on-the-rise/

A decorative image showing several icons that represent graphs, charts, and stats.

June is Backup Awareness Month, and every year, we work with the Harris Poll to survey the state of computer backups in the U.S. It’s our 16th year running, and this year, we expanded our lens and created a new survey focused on analyzing the state of backups among businesses, providing critical insights into organizational backup strategies and challenges.

And as in previous years, our consumer backup survey provides a comprehensive summary which reflects trends and changes over nearly two decades. The combination of these two audience surveys provides a more complete picture for the state of backups in the U.S. Let’s start with our new survey data.

Spotlight: Business Backup Is Coming Up Short

Our inaugural Business Backup Survey included 300 IT decision makers across the U.S. Part of what we wanted to learn was:

With all of the different ways IT professionals have to protect their user’s data, how are they choosing to back up and are backup solutions even working?

We can infer the answers to those questions by looking at the tools organizations use to back up their data, the frequency of data recoveries, and how successful or not those recoveries were.

What Tools Do Organizations Use to Back Up Their Data?

One of the most striking findings from the poll is that a significant majority (84%) of IT decision makers say their organizations utilize cloud drive services, which rely on syncing data to the cloud, for off-site data backup. You may have heard us say this before—sync is not backup.

What’s Wrong with Cloud Drives and Sync Services?

Cloud drives allow for file storage and sharing but may not protect against file corruption or accidental deletion. Sync services automatically update files across multiple devices, meaning that any changes or deletions are replicated everywhere, which can lead to unintended data loss. While some cloud drives have added minimal backup capabilities (i.e., 30 days of version history or similar), they are often lacking in key areas that are necessary for business continuity or compliance standards.

Cloud backup solutions, on the other hand, are designed to systematically and securely back up data, offering robust protection against loss, corruption, and security breaches. This makes cloud backup a better choice, particularly for addressing security concerns and ensuring the integrity and availability of critical data.

How Often Do Organizations Need to Restore Data?

39% of IT decision-makers report that their organizations need to restore data from backups at least once a month, with special requests for archived or deleted data (62%), backup software failure (54%), hard drive failure (52%), and cyber attacks (49%) reported as some of the top reasons. This frequent need for data recovery underscores the persistent vulnerabilities IT professionals face.

Are Data Recoveries Successful?

Not only do many organizations need to restore on a regular basis, but the survey also shows that among those that experienced data loss, only 42% were able to recover all of their data when they perform a restore. That leaves 58% with some amount of unrecovered data.

Are Backups Working?

The data shows a sizable gap between the use of backup services and the effectiveness of data restoration. Although a significant percentage of organizations indicate they’re using what they would consider a cloud backup solution (the shortcomings of cloud drives and cloud sync services aside), only 42% of those that experienced data loss were able to restore all their data. This discrepancy highlights the risks associated with inadequate backup measures and the potential for data loss, which can have serious repercussions for businesses.

Only 42% of organizations that experienced data loss were able to restore all their data.

There are all sorts of ways businesses need to slice their data management strategy in order to make sure all data is backed up. This includes data type (e.g., files vs. system information), frequency with which the data is updated or changed, retention requirements for compliance, and more. There are often reasons that businesses will employ different backup frequency or strategies for different file types—file-based versus block-level incremental backups, for example. However, incomplete backups can lead to situations where only parts of the data can be restored, disrupting business operations and resulting in downtime as efforts are made to recover or reconstruct lost data.

The importance of creating an end-to-end data backup plan, as well as choosing the right tools that provide comprehensive coverage, may be highlighted only at the moment of failure. As it stands, the Harris Poll data suggests that the limitations of cloud sync and cloud drive tools are leaving gaps in data protection and disaster recovery strategies.

This is further validated by the features IT decision makers report as being absolutely essential/very important in selecting backup tools, including security (97%), bandwidth and memory capacity (87%), a variety of features (79%), ease of operations and customizable elements (83% each). These rigorous requirements suggest that many existing solutions may fall short of meeting the comprehensive needs of modern businesses, and/or that the complex mix of tools may be contributing to blind spots in an overall data management strategy, only exposed at the point of recovery.

How To Close the Gap?

These insights underscore the need for innovative and robust backup solutions that address evolving business requirements. As the volume of data continues to grow and cyber threats become increasingly sophisticated, the demand for reliable, secure, and user-friendly backup systems will only heighten. Given the challenges many businesses face in fully recovering their data, there’s an opportunity to promote education and awareness regarding the importance of refreshing backup strategies and utilizing suitable backup tools.

Consumer Backup Practices: Less Than 1 in 5 Are Certain of Their Backups

The consumer portion of the 2024 Backup Awareness Survey seeks to understand a simple question we’ve asked year after year: How often do you back up all of the data on your computer? We also look at who backs up the most and the reasons people cite for needing to restore data, and we compare those trends over time. Let’s dig in to the results.

How Often Do People Back Up?

This year’s survey reveals that fewer than 1 in 5 Americans (15%) feel absolutely certain that their most important files are securely backed up. This is despite 84% of Americans who own a computer stating that they’ve backed up all their data and 45% performing backups at least once a month.

The survey also highlights the predominance of cloud solutions among backup methods. 63% of individuals who back up their data use a cloud-based system as their primary method. However, only 11% utilize dedicated cloud backup services, indicating a preference towards cloud drives (39%) and sync services (13%). As we noted above, cloud drives and sync services are fundamentally different from cloud backup solutions and can create gaps in a robust 3-2-1 backup strategy.

Who Is Best at Backing Up?

Every year, we highlight which demographic is the best at backing up their data, and in 2024, men (73% vs. 66% of women) and younger adults ages 18–54 (76% vs. just 61% of those ages 55+) take the lead backing up at least once a year.

The Reasons for Restores

The survey also found that 74% of Americans who own a computer have accidentally deleted important data (a 5.7% increase from 2023), and 57% have experienced a security incident on their computer.

Trends Over Time

For those interested in the data over time, let’s travel back and see how this year’s data compares to previous years. The first graph is one of our favorites. Since 2023, daily backups have dipped by 1%, while weekly and monthly backups have remained steady, which is encouraging. Additionally, there is a slight, but not statistically significant, increase of 1% in yearly and more-than-yearly backups. Notably, the percentage of people who have never backed up their data has decreased by 2%.

For all the table enthusiasts, you’ll appreciate this detailed view showcasing how 2024 compares with previous years. We love to see Never down to an all-time low, although Daily took a slight dip.

If you’re a visual person who appreciates vibrant pie charts for easier data digestion, here are pie charts comparing data from 2008 to 2024:

Within each population (business and consumer), the most striking data points are around the differences between backup and sync. Both consumers and businesses are leveraging cloud drive and sync services for ease of use, but that has not translated to successful data recoveries. With ransomware attacks on the rise, now more than ever, it’s essential to have a strong backup strategy.

Still, we’ve come a long way since 2008, and the consumer data shows positive change over time around backup awareness and tool adoption. Going forward, we’ll be interested to see how the business audience data changes over time. See below for our full testing methodology, and, as always, drop us a line in the comment section if you have any questions or insights.

Consumer Survey Method:

This survey was conducted online within the United States by The Harris Poll on behalf of Backblaze from April 25-29, 2024, among 2,058 adults ages 18 and older, among whom 1,877 own a computer. The sampling precision of Harris online polls is measured by using a Bayesian credible interval. For this study, the sample data is accurate to within +/- 2.5 percentage points using a 95% confidence level.

Prior year’s surveys were conducted online by The Harris Poll on behalf of Backblaze among U.S. adults ages 18+ who own a computer in April 25–27, 2023 (n=1,857) May 19–23, 2022 (n=1,861); May 12–14, 2021 (n=1,870); June 1–3, 2020 (n=1,913); June 6–10, 2019 (n=1,858); June 5–7, 2018 (n=1,871); May 19–23, 2017 (n=1,954); May 13–17, 2016 (n=1,920); May 15–19, 2015 (n=2,009); June 2-4, 2014 (n=1,991); June 13–17, 2013 (n=1,952); May 31–June 4, 2012 (n=2,176); June 28–30, 2011 (n=2,209); June 3–7, 2010 (n=2,051); May 13–14, 2009 (n=2,154); and May 27–29, 2008 (n=2,723).

Business Backup Survey Method:

This survey was conducted online within the United States by The Harris Poll on behalf of Backblaze from April 30 – May 8, 2024, among 300 IT Decision Makers. The sampling precision of Harris online polls is measured by using a Bayesian credible interval. For this study, the sample data is accurate to within +/- 5.7 percentage points using a 95% confidence level.

For complete survey methodologies, including weighting variables and subgroup sample sizes, please contact Backblaze.

The post 2024 State of the Backup: Survey Says Security Incidents and Data Loss on the Rise appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Secure file sharing solutions in AWS: A security and cost analysis guide, Part 1

2024-06-06 Sumit Bhati

Post Syndicated from Sumit Bhati original https://aws.amazon.com/blogs/security/how-to-securely-transfer-files-with-presigned-urls/

July 28, 2025: This post has been updated and expanded into a comprehensive two-part series covering multiple AWS file sharing solutions. This new series provides in-depth analysis of security and cost considerations to help you make informed decisions based on your requirements.

Note: This is Part 1 of a two-part post. You can read Part 2 here.

Sharing files with an outside entity—to share data between business partners or facilitate customer access to files—is a common use case for Amazon Web Services (AWS) customers. Organizations must balance security, cost, and usability. In a business-to-business data sharing scenario, these challenges become even more complex because human interaction is often minimal or absent, requiring robust automated solutions. Many AWS services offer multiple options for granting access. The one that’s best for your use case depends on multiple factors.

This post helps you decide which AWS services to use to implement a file sharing approach that suits your business needs. We focus on security controls and cost implications, describe some of the trade-offs, and highlight key differences to help you make an informed decision based on your specific requirements. We go through each option, highlighting their strengths and limitations, and provide guidance on choosing the right solution for your use case.

Understand your needs first

The first step in designing an AWS file sharing solution is to develop a clear understanding of your requirements and constraints. Because there are several possible design patterns and a number of different AWS services to consider, you need to start by identifying and prioritizing the features that you need. Gather the following information to guide your approach:

Access patterns and scale

When planning for access patterns and scale, there are a few key factors to keep in mind. First, consider how files are shared—machine-to-machine, human-to-machine, or human-to-human—because that impacts security and performance. Then, think about transfer frequency—are files exchanged only once a day, or are thousands moving every hour? If download control matters, setting limits on how often a file can be accessed might be necessary. File sizes also play a role, from typical everyday transfers to the largest files you need to support. Finally, total data volume shapes how much information you’ll be transferring on a regular basis.

Technical requirements

Your choice of solution will be influenced by technical constraints and capabilities. Protocol requirements often drive initial decisions, such as whether you need SFTP, FTPS, or HTTPS access. Consider existing systems that must interface with your solution and how they’ll connect. Performance considerations span several dimensions: acceptable latency for file transfers, geographic distribution of your users, bandwidth requirements, and whether you need built-in retry mechanisms for failed transfers. Additionally, think about how many simultaneous transfers your solution needs to support.

Security and compliance

Security and compliance requirements will definitely influence your file sharing strategy. Consider who controls encryption keys—whether managed by AWS or your organization—and what key rotation policies are needed. Authentication needs often vary—you might be authenticating individual users, specific systems, or entire business entities, using methods ranging from passwords to API keys, multi-factor authentication, or certificates. Your audit requirements will influence your choices in logging and monitoring capabilities. You might have geographic considerations like data sovereignty requirements, storage location restrictions, and access controls that consider the recipient’s location. If your data is subject to a law, like GDPR in Europe or HIPAA in the United States, or if your data is regulated by a standard like the Payment Card Industry’s Data Security Standard (PCI-DSS), you will need to consult with your own legal and compliance advisors to see what is required. When assessing risk tolerance, consider the security triad of confidentiality, integrity, and availability—some use cases might tolerate brief periods of unavailability but cannot risk data exposure, while others prioritize continuous availability.

Operational requirements

Day-to-day operations bring their own set of considerations. File retention policies determine how long data needs to be kept, while auto-deletion capabilities might be necessary for managing storage and compliance. Consider what kind of reporting and monitoring of file transfer activities you need. Do you need monthly reports, daily reports or perhaps detailed real-time tracking of transfer activities. By adding handling and notification systems, you can help make sure that problems are caught and addressed promptly. Disaster recovery requirements, expressed through recovery point objectives (RPO) and recovery time objectives (RTO), help determine the resilience needed in your solution.

Business constraints

Your solution must operate within your business constraints, such as budget limitations, technical limitations, timelines, available expertise, and service level agreements (SLAs). Budget limitations include initial implementation costs and ongoing operational expenses. Consider other parties’ technical limitations—they might use specific protocols such as SFTP, require mobile device compatibility, or operate older systems that have limited cryptographic capabilities. Implementation timelines influence choices between managed services that can be deployed quickly and custom solutions that require more time and expertise. The expertise available for solution maintenance is also a consideration. SLAs for file transfers might specify availability and performance requirements that you’re obligated to meet. To meet these constraints, you must estimate how much your file sharing needs will grow over time and determine if you need a regional or a global solution.

By carefully considering these aspects, you’ll be better prepared to evaluate different AWS file sharing solutions and select the one that best fits your use case. Understanding your requirements for uploads and downloads will help determine if your use case can be supported through a single AWS service or needs a combination of services.

Solutions

Let’s start by looking at the various file sharing mechanisms that AWS supports. The following table identifies the key AWS services needed for each solution, describes the security and cost implications of the solutions, and describes their complexity and protocol support capabilities. The following table shows the solutions described in this post.

Solution	AWS services	Security features	Cost*	Region control
AWS Transfer Family	Transfer Family, Amazon S3, API Gateway, and Lambda	Managed security, encryption in transit and at rest, IAM integration, and custom authentication	$0.30 per hour per protocol, data transfer fees, and storage costs	Can deploy to specific AWS Regions, can only transfer files to and from S3 buckets in the same Region
Transfer Family web apps	Transfer Family, S3, and CloudFront	Browser-based access, IAM Identity Center integration, and S3 Access Grants	Pay-per-file operation, CloudFront costs, and storage costs	Uses CloudFront (global) for web access, but backend components can be Region-specific
Amazon S3 pre-signed URLs	S3	Time-limited URLs, IAM controls for URL generation, and HTTPS	S3 request and data transfer fees	Can be restricted to specific Regions
Serverless application with Amazon S3 presigned URLs	S3, AWS Lambda, and API Gateway	Time-limited URLs, HTTPS, IAM controls, customizable authentication	Pay per request and minimal infrastructure cost	Components can be Region-specific

The following table shows the solutions described in Part 2.

Solution	AWS services	Security features	Cost*	Region control
CloudFront signed URLs	CloudFront, Amazon S3, and Lambda	Optional edge security using AWS Lambda@Edge, AWS WAF integration, SSL/TLS, geo restrictions, and AWS Shield Standard (included automatically)	Content delivery network (CDN) costs, request pricing, and data transfer fees	Global service by design; origin can be AWS Region-specific
Amazon VPC endpoint service	PrivateLink, VPC, and NLB	Complete network isolation, private connectivity, and multi-layer security	Endpoint hourly charges, NLB costs, and data processing fees	Service endpoints are strictly Region-specific; must create endpoints in each Region where access is needed
S3 Access Points	S3, IAM, VPC (for VPC-specific access points)	Dedicated IAM policies per access point VPC-only access restrictions available Works with bucket policies for layered security Supports AWS PrivateLink for private network access Compatible with S3 Block Public Access settings	No additional charge for S3 Access Points Standard S3 request pricing applies Data transfer fees apply based on standard S3 rates VPC endpoint charges apply when using VPC endpoints with access points	Access points are Region-specific Each access point is created in the same Region as its S3 bucket Cross-Region access requires separate access points in each Region VPC-specific access points are limited to the VPC’s Region

* Pricing information provided is based on AWS service rates at the time of publication and is intended as an estimation only. Additional costs may be incurred depending on your specific implementation and usage patterns. For the most current and accurate pricing details, please consult the official AWS pricing pages for each service mentioned.

Let’s examine the solutions in detail.

AWS Transfer Family

AWS Transfer Family is a managed file transfer service for SFTP, FTPS, and AS2 protocols. It integrates directly with Amazon Simple Storage Service (Amazon S3) for storage and supports custom identity providers for authentication through Amazon API Gateway and AWS Lambda.

As shown in Figure 1, when a user initiates a file transfer, Transfer Family authenticates them through the configured identity provider using API Gateway and Lambda. After authentication succeeds, the service maps the user to an AWS Identity and Access Management (IAM) role that defines their S3 bucket access permissions. The service encrypts data in transit using TLS 1.2 and data at rest using S3 server-side encryption.

Figure 1: AWS Transfer Family architecture

Transfer Family automatically handles scaling from zero to thousands of concurrent users, manages high availability across Availability Zones, and minimizes infrastructure management. It records detailed metrics and logs in Amazon CloudWatch for monitoring and auditing, supporting compliance requirements with activity tracking.

It’s important to note that Transfer Family also offers service-managed authentication. This simpler setup stores user credentials (passwords or SSH keys) directly in Transfer Family, minimizing the need for external identity providers. Service-managed authentication is best suited if you have a small number of users or no existing identity management system, or when you want to have a disconnected identity system and don’t want to give external partners an account in your identity provider system.

Pros

One of the biggest advantages of Transfer Family is how it provides the reliability and scalability of Amazon S3 for storing your data, while keeping that data available to existing client applications and workflows. The service integrates with existing authentication systems through custom identity providers, while maintaining security through IAM policies. Its auto-scaling capabilities handle variable workloads, from occasional transfers to high-volume scenarios.

Transfer Family also offers detailed CloudWatch logging and audit trails for file transfer activities, which should be sufficient for most logging and audit needs. It encrypts data in transit using TLS 1.2 and at rest using Amazon S3 server-side encryption. You can implement fine-grained access controls through IAM roles and integrate with AWS Organizations for multi-account management. The service supports VPC endpoints for secure internal access and custom domain names for branded endpoints.

Because data is stored in S3, some of your requirements will be fulfilled by configuring S3, not the Transfer Family services. Data retention (for example, avoiding deletion and scheduling deletion) is achieved through S3 Object Lock and S3 Lifecycle Events.

Cons

The pricing structure of Transfer Family includes $0.30 per hour for each protocol you enable and data transfer fees based on data volume. There can be additional charges for custom domain names. If you use VPC endpoints for secure internal access to Amazon S3, there will also be VPC data charges. If you have high-volume transfers or multiple endpoints across AWS Regions, you will face increased costs. Because the data ultimately lives in S3; S3 storage and request pricing applies as well.

Custom identity provider implementations (such as SAML or OAuth) add latency to authentication processes, affecting transfer initiation times. This authentication process requires additional configuration and introduces extra steps and latency during transfer initiation compared to service-managed authentication.

The Regional nature of Transfer Family means you must choose between deploying in a single Region (simpler management but potential latency for global users) or multiple Regions (better performance but higher costs at $0.30 per protocol per hour per Region). Multi-Region can serve as a disaster recovery strategy or when Regional data isolation is needed.

Transfer Family web apps

Transfer Family web apps provide browser-based access to Amazon S3, enabling users to upload and download files through a web interface. With the web apps, you can create a branded, secure, and highly available portal for your users to browse, upload, and download data in S3. Web apps are built using Storage Browser for S3 and offer the same user functionalities in a fully managed offering without having to write code or host your own application.

When a user accesses the web application, authentication occurs through AWS IAM Identity Center, and S3 Access Grants determine their permissions to specific S3 buckets or prefixes. The access grant permissions can be either read-only or read and write. After authentication succeeds, users can upload or download files directly through the web interface. The service uses Amazon CloudFront for content delivery and implements SSL/TLS encryption for data transfers, while S3 provides server-side encryption for data at rest. Figure 2 shows a simplified Transfer Family web app architecture.

Figure 2: Simplified Transfer Family web app architecture

The web application automatically scales to accommodate varying numbers of users and provides high availability through the CloudFront global edge network. It minimizes the need for custom web application development and provides logging through AWS CloudTrail and CloudWatch. You can customize the user experience by implementing custom domains through CloudFront distributions.

Transfer Family web apps support multiple authentication methods, with IAM Identity Center being one of the primary options. While Identity Center provides simplified user management and integration with existing identity providers. It also provides useful mechanisms such as multi-factor authentication (MFA), strong password policies, and resetting lost passwords. It’s not the only authentication method available; you can also use custom identity providers for authentication, providing flexibility in how you manage user access to the web application.

Pros

Transfer Family web apps minimize the need to build and maintain custom web interfaces for Amazon S3 file sharing. It provides seamless integration with IAM Identity Center for user management and authentication, enabling you to use existing identity providers. The service offers fine-grained access control through S3 Access Grants, allowing precise permission management at the bucket and prefix level. Its integration with CloudFront provides global availability and enhanced performance, while CloudTrail logging offers audit capabilities.

The service provides robust security features including SSL/TLS encryption, CORS policy management, and optional integration with AWS WAF for protection against bots, web scrapers, DDoS events, and more. You can implement custom domains for branded experiences and use CloudFront security features including DDoS protection using AWS Shield. The web interface offers intuitive file management capabilities without requiring client software or that users have technical expertise.

Cons

Transfer Family web apps require using IAM Identity Center, which might require additional setup and configuration if you’re not currently using this service. The web interface currently requires the Identity Center identities to live in the same AWS account as the S3 buckets. That might create design challenges if you want to keep identities in one AWS account and data storage in another. Implementation requires careful cross-origin resource sharing (CORS) configuration for each S3 bucket.

The service incurs costs for both Transfer Family and associated services, including CloudFront distribution and data transfer fees. Custom domain implementation requires additional configuration and SSL certificate management through AWS Certificate Manager (ACM). The web interface is well suited for humans to upload or download, but it’s not as good for automated workflows that transfer files from machine to machine. You must carefully manage user assignments and access grants to maintain security, adding administrative overhead.

S3 pre-signed URLs

Amazon S3 pre-signed URLs enable secure, time-limited access to objects in S3 without requiring the file recipient to have an identity in your identity systems. The URLs are generated using the AWS SDK or AWS Command Line Interface (AWS CLI), granting specific permissions (GET, PUT) that are valid for up to seven days. When accessing files, S3 validates the cryptographically signed parameters in these URLs before permitting access to objects. This provides a direct method for secure file sharing through HTTPS endpoints.

The solution requires only an S3 bucket and appropriate IAM permissions for URL generation. S3 handles the authentication of the pre-signed URL parameters and manages access to objects. File transfers occur directly between users and S3 through HTTPS endpoints, with the pre-signed URL controlling the access patterns.

Amazon S3 provides security features including server-side encryption, access logging, and CloudTrail integration. The security of pre-signed URLs is primarily managed through expiration times and specific operation permissions defined during URL generation.

Pros

Amazon S3 pre-signed URLs follow a straightforward pay-per-use pricing model, charging only for S3 storage, requests, and data transfers. For example, if you create pre-signed URLs but the object isn’t actually downloaded, you pay storage costs as usual, but you don’t pay transfer costs. The solution uses the native scalability of S3 to handle varying numbers of concurrent users without additional infrastructure. you can implement granular access controls through URL expiration times and specific operation permissions (GET, PUT, DELETE).

Access is controlled through URL expiration enforcement. Amazon S3 server access logging and CloudTrail integration enable audit capabilities. The solution’s simplicity makes it ideal for basic file sharing needs while maintaining security and scalability.

Cons

A pre-signed URL can be used by anyone who has access to the URL. That’s the goal of this design: You don’t need to have an identity for the user. Pre-signed URLs can be reused an unlimited number of times until they expire. To improve security, short expiration times can limit the potential for URL re-use. Shorter expiration times, however, require the recipient to download the file soon after the URL is created.

When implementing this solution, you should establish processes for secure URL generation and distribution. Set your URL expiration times based on realistic expectations about how quickly your recipients will download the files. A web or mobile app where the user selects a link to download something (such as a document, an image, a data file) and they expect the download to start immediately is a good candidate for this design.

The solution works with files up to 5 GB for single operations. To share a file larger than 5 GB, you must split the file into multiple parts, issue multiple pre-signed URLs, and then the recipient must download all the parts and join the parts together correctly. This isn’t a good solution for sharing large files. Also, distributing large files as a single download can be difficult if the recipient doesn’t have good connectivity. Amazon S3 can start an object download from the middle of the object, but selecting a pre-signed URL cannot. So, if the recipient transfers 1 GB out of a 2 GB download, and then their connection is disrupted, they cannot pick up where they left off. They will restart from the beginning, which is undesirable. Overall, this design is unsuitable for transmitting large files over unreliable internet connections.

You should enable appropriate monitoring through Amazon S3 access logs and CloudTrail to track usage patterns and meet security compliance.

This solution is particularly effective if you’re seeking straightforward, secure file sharing capabilities where the files are small enough to download in one request, and where you have a secure mechanism to share the download URLs.

Serverless web application with S3 presigned URLs

Amazon S3 presigned URLs combined with a custom web application enable secure, time-limited access to S3 objects. The application generates URLs that grant specific S3 permissions (GET, PUT) for between one minute and seven days. When requesting file access, the application authenticates users and generates presigned URLs using the AWS SDK with defined permissions and expiration times.

The web application uses API Gateway and Lambda functions for authentication and URL generation. Amazon S3 validates the cryptographically signed parameters in these URLs before permitting access to objects. File transfers occur directly between users and S3 through HTTPS endpoints, with the application controlling the access patterns. The architecture is shown in Figure 3.

Figure 3: Amazon S3 pre-signed URLs architecture

The web application can implement security controls including request logging, rate limiting (requests per second), and authentication workflows. CloudWatch logs record API access patterns and Lambda execution metrics, while Amazon S3 access logging records object-level operations.

Pros

Amazon S3 presigned URLs follow a pay-per-use pricing model. This solution charges only for API Gateway requests, Lambda executions, and S3 operations performed. The serverless architecture scales automatically from zero to thousands of concurrent users without infrastructure management. You can implement custom security controls and business logic for specific access requirements through API Gateway authorizers (using custom identity solutions or Amazon Cognito) and Lambda functions.

The solution enforces security through URL expiration (maximum seven days), IAM policies restricting URL generation permissions, and HTTPS encryption for data transfers. Custom authentication workflows integrate with existing identity providers (SAML, OIDC). Additional security features include IP-based restrictions, required request headers, and request validation through AWS WAF. This solution would be good, for example, if you have a variety of files or a variety of buckets and you’re trying to build a unified front-end where people can download various files without knowing which bucket the files are stored in or what URL expiration time is appropriate. You can configure the frontend to look at tags on objects, tags on buckets, object names, or another attribute that fits your use case, and then choose a URL expiration time based on that attribute. For example, objects from buckets tagged Data Classification: Restricted might expire after 1 minute, whereas objects from buckets tagged Data Classification: Public might be valid for 7 days.

Cons

Building a custom web application requires developing and maintaining the code for URL generation, authentication, and error handling logic. The application must track URL expiration times and implement mechanisms that permit retries for failed transfers. Monitoring systems must track URL usage, detect abuse patterns, and send alerts for security violations through CloudWatch metrics and logs.

One limitation of this solution is the 10 MB size limit imposed by API Gateway. This affects how your application handles file uploads and downloads. For uploads, files under 10 MB can be uploaded directly through API Gateway. Larger files require implementing multipart uploads, where the client splits the file into chunks and sends each chunk separately. For downloads, files under 10 MB can be downloaded directly through API Gateway but for larger files, your application should generate a pre-signed URL for direct Amazon S3 access, bypassing API Gateway.

URL generation errors or misconfigured IAM permissions can expose objects to unauthorized access. The HTTPS-only protocol limits integration with SFTP and FTPS clients. Files larger than 5 GB require multipart upload implementation, and network interruptions need custom resume logic. This design will incur some extra charges if the number of file transfers are the millions. Lambda functions cost $0.20 per million requests, and API Gateway costs $1.00 per million requests. Analyze your expected access patterns to determine whether these extra costs will be significant and if they’re worth the additional flexibility of custom transfer logic.

Decision matrix: When to use each solution

The following table summarizes the characteristics of the solutions presented in the two parts of this post. See Part 2 for full descriptions of the solutions not covered in Part 1.

Characteristics	Transfer Family	Transfer Family web app	S3 pre-signed URLs (Direct)	Serverless web application with S3 pre-signed URL	CloudFront signed URLs (Part 2)	VPC endpoint service (Part 2)	S3 Object Lambda (Part 2)
Protocol support	SFTP, FTPS, and AS2	HTTPS (web-based)	HTTPS	HTTPS	HTTPS with CDN	A TCP-based protocol	HTTPS
Global distribution	Global endpoint support	CloudFront integration	Global S3 access	Global S3 access	Global edge network acceleration	Direct AWS backbone access	Global S3 access with Regional endpoints
Pricing model	Hourly service rate and usage	Pay per file operation	Pay-per-request	Pay-per-request and application costs	Pay-per-request with caching savings	Hourly endpoint rate and usage	No additional charge for access points; standard S3 request pricing applies
Content processing	Direct S3 integration	Built-in web interface	Direct S3 access	Custom app processing	Edge-based file processing	Access files through private network	Direct S3 access with customized permissions per access point
Authentication options	Custom IdP and service-managed	IAM Identity Center	IAM	Custom authentication possible	IAM, custom authentication, and edge validation	VPC security controls and custom authentication	IAM policies, VPC endpoint policies, resource-based policies
Upload capabilities	Unlimited file size	Web interface upload	Up to 5 GB direct and multipart for larger	Up to 10 MB using API Gateway	Optimized for global ingestion	Unlimited file size over private connection	Same as standard S3
Download capabilities	Unlimited file size	Browser-based downloads	Up to 5 GB using a single URL	Up to 10 MB using API Gateway	Accelerated downloads using global edge locations	Unlimited file size over private connection	Same as standard S3 with customized access controls
Example use cases	Enterprise file transfer systems B2B data exchange Compliance-focused transfers	Browser-based file sharing Internal document management Client portals	Simple direct S3 access Temporary file sharing Mobile app backend	Custom file sharing systems Integrated web applications Enhanced S3 access control	Global content delivery Media distribution Web application assets	Private network transfers Custom protocol support Secure enterprise data exchange	Simplified data access management at scale Multi-application access to shared datasets VPC-restricted data access

The following list gives you a quick overview of the strengths of each solution presented in the two parts of this post.

Transfer Family is the optimal choice for organizations that require legacy file transfer protocols such as SFTP, FTPS, or AS2 protocols, and you must integrate with existing authentication systems. It’s ideal for scenarios with strict compliance and audit requirements, where operational overhead needs to be minimized. While the solution comes with higher costs because of its managed service nature, it’s often the lowest-friction option to support existing enterprise use cases that depend on these protocols.
Transfer Family web apps suit organizations that need browser-based file sharing without custom development. They integrate with IAM Identity Center for user authentication and uses Amazon S3 Access Grants for permission management. The solution works well for internal document sharing, client portals, and scenarios requiring a branded web interface. While limited to web browser access, they provide built-in features like MFA and password management without infrastructure maintenance.
Amazon S3 pre-signed URLs excel in scenarios where simplicity, cost-effectiveness, and temporary access are key requirements. This solution is ideal if you’re seeking a straightforward file sharing mechanism without the need for custom application development or additional infrastructure. This approach shines in environments that require a quick implementation of secure file sharing and cost-effective solutions with minimal overhead.
Serverless web application with S3 presigned URLs best serves scenarios where cost optimization is paramount and the HTTPS protocol meets your requirements. This solution shines in environments that need simple, direct file sharing capabilities with quick implementation timelines. It’s particularly effective for moderate usage patterns where serverless architecture can provide cost benefits. The solution’s simplicity makes it ideal for web applications and scenarios where complex file transfer protocols aren’t necessary, though careful consideration must be given to its 10 MB file size limitation for single operations using API Gateway.

In Part 2:

CloudFront signed URLs excel in situations that demand global content distribution with high performance requirements. This solution is the clear choice when your architecture needs built-in DDoS protection and performance optimization through caching. It’s particularly valuable when content delivery speed is crucial and you require security at edge locations. The solution’s global reach and caching capabilities make it cost-effective for large-scale content distribution, though it’s primarily optimized for download scenarios rather than uploads.
Amazon VPC endpoint service is the preferred choice if you require complete network isolation and maximum security. This solution is ideal when you need support for custom protocols while maintaining private network connectivity. It’s particularly suitable for scenarios with extremely high security requirements and when you have the necessary resources to managed networking configurations. While this solution requires significant expertise and investment, it provides the highest level of security and control for sensitive data transfers.
S3 Access Points are best suited for scenarios that require simplified data access management at scale. This solution excels when you need to provide different access patterns to the same underlying data for multiple applications or user groups. It’s ideal if you prefer a structured approach to permissions and need network-level access controls. While primarily focused on simplifying complex access scenarios without modifying bucket policies, it offers unique capabilities for VPC-restricted access and granular permissions management, though subject to certain service limits and configuration requirements.

Conclusion

In this first part of a two-part post, you’ve learned about multiple solutions for secure file sharing using AWS services and the pros and cons of each. You can find additional options in Part 2. The optimal solution depends on your specific organizational requirements, technical capabilities, and budget constraints. You don’t have to choose just one option, you can implement multiple solutions to address different use cases, creating a file sharing strategy that balances security, cost, and operational efficiency.

Additional resources:

If you have feedback about this post, submit comments in the Comments section below.

Backblaze Live Read: The Game Changer for Live Media Cloud Workflows

2024-06-06 Elton Carneiro

Post Syndicated from Elton Carneiro original https://backblaze.com/blog/announcing-b2-live-read/

A decorative image with the title Live Read.

Every sports fan knows that when something incredible happens on the field/ice/court, we want to see the replay right now. But many of us don’t know the impressive efforts that live media teams undertake to deliver clips in real time to all of us on whatever viewing platform we might prefer. Today, Backblaze is excited to make the work of live media production (and the end results) a lot easier with our latest innovation.

Announcing Backblaze B2 Live Read

Backblaze B2 Live Read is a patent-pending service that gives media production teams working on live events the ability to access, edit, and transform media content while it is being uploaded into Backblaze B2 Cloud Storage. This means that teams can start working on content far faster than they could before, without having to drastically change their workflows and tools, massively speeding up their time to engagement and revenue.

This is a game changer for live media teams, who are passionate about bringing content to their audience as soon as possible. It means they don’t need to worry as screen resolutions continue to expand, ranging from 4K to 8K and beyond. It also reduces the need for having production teams on-site to minimize latency, which could be extremely costly depending on the venue.

Previously, producers had to wait hours or days before they could access uploaded data, or they had to rely on cost-prohibitive and complicated options that often required on-premises storage. That’s no longer necessary. This innovation will make it faster and less expensive to:

Create near real-time highlight clips for news segments, in-app replays, and much more.
Tap into talent where they are versus trying to find local talent to produce events.
Promote content for on-demand sales within minutes of presentations at live events.
Distribute teasers for buzz on social media before talent has even left the venue.

For our customers, turnaround time is essential, and Live Read promises to speed up workflows and operations for producers across the industry. We’re incredibly excited to offer this innovative feature to boost performance and accelerate our customers’ business engagements.”

Richard Andes, VP, Product Management, Telestream

Coming soon inside your favorite tools

We designed Live Read to be easily accessible directly via the Backblaze S3 Compatible API and/or seamlessly within the user interface of launch partners including Telestream, Glookast, and Mimir. These platforms, along with CineDeck, Alteon, Hedge, Hiscale, MoovIT, and many others to come, are enabling Live Read within their platforms soon.

If you want to use Live Read, you can join our private preview.

How does it work?

Previously, media teams were forced to either wait for uploads to complete or use on-premises storage. Now, Live Read uniquely supports accessing parts of each growing file or growing object as it is uploaded so there’s no need to wait for the full file upload to complete. And, when the full upload is complete, it’s accessible like any other file in a Backblaze B2 Cloud Storage Bucket, with no middleware or proprietary software needed.

Here’s a short video showing both how Live Read works on a conceptual level, as well as a live demo showing how one app can upload video data to Backblaze B2 using Live Read while a second app reads the uploaded video data:

For those of you who want to dig deeper into the code samples you saw in the video, here is some example code that uses the Amazon SDK for Python, Boto3, to start uploading data with Live Read. If you’re familiar with Amazon S3, you’ll recognize that this is a standard multipart upload apart from the add_custom_header handler function and the call to register it with Boto3’s event system:

def add_custom_header(params, **_kwargs):
    """
    Add the Live Read custom headers to the outgoing request.
    See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/events.html
    """
    params['headers']['x-backblaze-live-read-enabled'] = 'true'

client = boto3.client('s3')
client.meta.events.register('before-call.s3.CreateMultipartUpload', add_custom_header)

response = client.create_multipart_upload(Bucket='my-video-files', Key='liveread.mp4')

upload_id = response['UploadId']

# Now upload data as usual with repeated calls to client.upload_part()

As it processes the call to create_multipart_upload(), Boto3 calls the add_custom_header() handler function, which adds a custom HTTP header, x-backblaze-live-read-enabled, with the value true, to the S3 API request. The custom HTTP header signals to Backblaze B2 that this is a Live Read upload. As with standard multipart uploads, the data is uploaded in parts between 5MB and 5GB in size. To facilitate reading data efficiently, all parts except the last one must have the same size.

Since this is a Live Read upload, as soon as a part is uploaded, it is accessible for downloading.

An app that downloads the file needs to send the same custom HTTP header when it retrieves data. For example:

def add_custom_header(params, **_kwargs):
    """
    Add the Live Read custom headers to the outgoing request.
    See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/events.html
    """
    params['headers']['x-backblaze-live-read-enabled'] = 'true'

client = boto3.client('s3')
client.meta.events.register('before-call.s3.GetObject', add_custom_header)

# Read the first 1 KiB of the file
response = client.get_object(
    Bucket='my-video-files',
    Key='liveread.mp4',
    Range='bytes=0-1023'
)

Note that you must supply either Range or PartNumber to specify a portion of the file when you download data using Live Read. If you request a range or part that does not exist, then Backblaze B2 responds with a 416 Range Not Satisfiable error, just as you might expect. On receiving this error, an app reading the file might repeatedly retry the request, waiting for a short interval after each unsuccessful request.

The source code for the applications is available as open source at https://github.com/backblaze-b2-samples/live-read-demo/.

How much does it cost?

Live Read upload capacity is offered in $15/TB increments—and the capacity is only consumed when an upload is marked for Live Read. Standard uploads are free, as usual. After uploading is complete, the data stored in Backblaze B2 is billed as normal. From a cost perspective, this represents significant savings versus the workflows that production teams must currently follow to achieve anything close to the functionality delivered by Live Read.

And it’s not just for live media

Beyond media, the Live Read API can support breakthroughs across development and IT workloads. For example, organizations maintaining large data logs or surveillance footage backups have often had to parse them into hundreds or thousands of small files each day in order to have quick access when needed—but with Live Read, they can now move to far more manageable single files per day or hour while preserving ability to access parts immediately after they are written.

What’s next

For those interested in Live Read, you can sign up for the private preview here. We’ll continue to report as we add more integrations and we’ll share stories as customers succeed with the new feature. Until then, feel free to ask any question you have in the comments below.

Want to see more?

Join Pat Patterson, Chief Technical Evangelist, and Elton Carneiro, Senior Director of Partnerships, on January 26, 2024 at 10:00 a.m. PT to learn more in real time. Can’t make it live? Sign up anyway and we’ll send a recording straight to your inbox.

The post Backblaze Live Read: The Game Changer for Live Media Cloud Workflows appeared first on Backblaze Blog | Cloud Storage & Cloud Backup