Tag Archives: Featured

Disaster Recovery 101: Hot vs. Warm vs. Cold DR Sites

2025-01-14 Kari Rivas

Post Syndicated from Kari Rivas original https://www.backblaze.com/blog/disaster-recovery-101-hot-vs-warm-vs-cold-dr-sites/

A decorative image showing a hot and a cold temperature gauge overlaying patterned images that show drives and data centers.

It goes without saying (but I will say it anyway) that having a disaster recovery (DR) site is essential to protecting business continuity (BC) in the face of disasters both big and small. However, even for large enterprises, building and maintaining a separate physical facility to store data copies can be cost prohibitive, and it may not make sense operationally.

DR sites differ according to the availability of data for retrieval and by type of ownership (e.g., fully owned or colocated). In recent years, public cloud has also emerged as a viable DR “site”—meaning that backups, production data, and/or virtualized infrastructure can be effectively housed in the cloud.

In this blog, I’ll examine the primary differences and pros and cons between various types of DR sites, and I’ll outline the most important criteria for deciding on the right DR setup for your business.

Proprietary ownership vs. colocation

If your business is able to fully invest in owning a DR site, the obvious upsides are greater control over security and infrastructure. But owning and operating your own site may still not be the most ideal option, given the staffing and expertise required. For many businesses, it doesn’t make sense to invest in owning and operating a data center when that’s not your area of expertise.

That’s why many businesses opt for colocation. It can be a great option for adhering to your DR strategy and your expense limits. However, you must be careful to thoroughly vet the location and provider. Here are a few important points to consider:

Performance: You should understand what kind of equipment is used at the DR site, as well as what kind of durability and availability you can expect. Ensure that the available infrastructure can meet your required recovery time objectives (RTO) and recovery point objectives (RPO)—that is, the maximum amount of downtime your business can withstand and the maximum amount of data your organization can tolerate losing, respectively.
Security: A trustworthy provider should be staffed 24/7/365. Learn how the data center is protected. Are there cameras? Biometric security? How does the data center protect against things like fire and power loss?
Proximity: A data center that’s down the street from your primary location will offer no protection in the case of a regional disaster like wildfire or tornado—events that are unfortunately becoming more and more common. Ideally, you should choose a location that is far from your production facility. This is where the public cloud naturally fits in—but more on that in a bit.
Scalability: Gauge how much data you currently need to store as well as how much you expect to grow in the near future. Find out how much capacity the DR site can support and choose a site that can accommodate your planned growth.
Costs: Get a complete view of your total cost of ownership. This not only includes one time costs to get started and ongoing monthly or yearly expenses, but also potential costs for things like additional support or any capacity you may need to add in the middle of a contract period.
Compliance: Consider what compliance requirements your business must support. Some data centers are SOC 2 compliant; some are not. It’s also important to check your cyber insurance policy requirements. Many policies may require that you keep data backups in a facility that is far from your own. This is exactly the requirement that brought telco AcenTek to Backblaze.

Meeting cyber insurance requirements with the cloud

In order to satisfy cyber insurance policy requirements, AcenTek’s backups needed to be off-site and geographically distant from their own data centers. Backblaze offered a critical feature—immutability and certification as a Veeam Ready Object partner—as well as geographic distance from AcenTek’s own data centers to meet the requirements and protect AcenTek’s business.

Hot, warm, and cold DR sites: Choosing the right strategy

Recovery sites are often referred to by temperature (hot, warm, cold) to describe the speed and importance of applications and data in those protected sites. The ideal DR site temperature depends on your organization’s budget, risk tolerance, and RTOs. Businesses with critical systems requiring near-instantaneous recovery might opt for a hot site. Others might find a warm site or even a cold site a more cost-effective option for less time-sensitive systems.

Hot, warm, and cold: Choosing the right DR site temperature

	Hot site	Warm site	Cold site
Description	A fully functional replica of your primary production resources, constantly maintained and ready for immediate failover in the cloud or to a secondary on-premises site.	A pre-configured cloud recovery site or hybrid recovery with hardware and software infrastructure. Requires some manual intervention (e.g., software installation) before becoming operational.	A basic physical facility with essential infrastructure (power, cooling, and network connectivity) requiring significant configuration and installation before use. May also include cold cloud storage.
Pros	Fastest recovery times due to the site’s constant readiness.	A balance between cost and recovery time. Faster than cold sites, but slower than hot sites.	Most cost-effective option, requiring minimal ongoing maintenance.
Cons	This is the most expensive option due to the need for complete infrastructure replication.	Still requires some manual setup, potentially delaying recovery time.	Longest recovery times due to the extensive configuration and installation needed. Or, in the case of cold cloud storage—the time required to retrieve your data.
Example RTO goal times	RTO <15 minutes	RTO <24 hours	RTO >24 hours

Public cloud as virtual DR site

Traditionally, DR for large enterprises would involve building a physical site to support RTO objectives. It’s important to note that building or buying a dedicated DR site might not be the most cost-effective option for all backups. Instead, cloud storage offers a compelling solution specifically for backups, even if you have your own physical DR site.

Why Backblaze works for DR

Cloud storage from a specialized provider like Backblaze is generally more affordable and scalable than on-premises storage solutions or off-site DR facilities, making it a great fit for this purpose. Backblaze offers always hot storage with 3x free egress, meaning data can be immediately recovered when needed without surprise egress bills. In this way, Backblaze B2 Cloud Storage constitutes a virtualized hot DR site.

Cold cloud storage considerations

While some consider cold cloud storage to be the most cost-effective solution, the cost savings of cold storage are often entirely negated by its long retrieval time and egress charges—so much so that it no longer becomes a viable disaster recovery option.

Evaluating cloud storage providers

In a way, you can consider the public cloud very similarly to a colocated DR site. All the same questions apply when choosing between cloud storage providers (CSPs):

Performance: What durability, reliability, and availability does the CSP offer? What kind of throughput do you get on a proof of concept?
Security: Does the CSP staff their data centers 24/7/365? What security processes and procedures are in place?
Proximity: Where are the CSP’s data centers located? Choose one that offers good geographic separation from your production facility while ensuring you can still meet your RTO with latency considered.
Scalability: Cloud storage naturally offers infinite scalability, but it’s vitally important to ask your CSP how they handle things like capacity overages or the need to purchase additional capacity. Some CSPs will charge you excessive fees when you go over capacity, or they may require you to switch to a different pricing model if you need additional storage space in the middle of a contract period.
Costs: Again, you need a complete view of your TCO. Watch out for things like minimum retention periods, egress charges, and other hidden fees.
Compliance: Be careful of CSPs that claim they’re SOC 2 compliant. Sometimes the CSP operates in SOC 2 compliant data centers but the company is not SOC 2 compliant itself. That difference may be meaningful to your company or your own compliance requirements.

Ultimately, you must carefully balance business requirements for RTO and RPO with DR investment costs. Businesses located in likely disaster areas like tornado alley, earthquake-prone zones, or coastal areas are well served by the additional investment in DR infrastructure. But even if your company has its own DR site, public cloud can be a beneficial supplement to your own DR infrastructure.

The post Disaster Recovery 101: Hot vs. Warm vs. Cold DR Sites appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

AI for Enterprise: Getting Started

2025-01-09 Stephanie Doyle

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/ai-for-enterprise-getting-started/

A decorative image showing various cloud storage and business related icons.

AI is here to stay, and the question on everyone’s mind is how to implement it successfully. If you’re ready to implement AI in your business, consider this article a good jumping off point. I’ll talk about different options for integrating it into your operations and how to make it truly custom, based on your own data, and useful for your business.

More from AI 101

Want to read more about AI? We’ve got you covered in our AI 101 series. And, here’s a sampling that might be useful when you’re thinking about building AI into your business.

How many companies use AI today?

How many businesses are using AI, you ask? Well, let’s ask Google. According to their AI overview (yes, we appreciate the irony), anywhere between 55% and 83% of companies are using or exploring AI in some way.

A screenshot of the Google AI overview that results from the query "how many businesses use AI"?

It’s not lost on me that the above results illustrate some of the big limitations of AI—namely that it’s only as good as the data it’s trained on, it’s far from infallible, and it can’t replace humans wholesale especially when someone needs to fact check those results. Google’s AI overviews have been criticized for providing inaccurate information, hallucinating (with sometimes hilarious results), providing a neat answer to complicated questions, providing information from unreliable sources, potential for bias, and so on. Nevertheless, the feature has had several updates since it was first released (which at least means it’s no longer telling us to put glue on pizza).

But, setting all that aside, this is actually a great example to consider before we dig into options for incorporating AI into your business. AI Overviews have improved enough—for example, by adding things like source transparency—that we can easily add enough human oversight to consider the above directionally accurate. The landscape of technology is changing, and, ready or not, businesses are being forced to figure out how AI should fit into their strategies.

What we’ll talk about today

Today we’ll talk about some foundational topics you need to understand when deciding how to incorporate AI into your business. We’ll define the following:

Software as a service (SaaS) AI add-ons
AI as a service (AIaaS)
Foundation models
Retrieval augmented generation (RAG)

Those definitions will lead us quickly to some practical examples that illustrate how businesses are using AI.

Software as a service (SaaS) applications, aka, AI as a feature

You may have noticed that many of the web-based applications you are using are suddenly AI-powered or have AI capabilities. While some of that is marketing hype, this could be a way to get started with AI in your organization—by simply turning on a feature in a SaaS product you’re already using. There are lots of ways to do this—Slack, for example, offers AI tools for summarizing and answering questions to help teams work faster.

Example AI use case: AI in customer support

Generative AI capabilities such as chatbots are often added to customer-facing applications like your customer support service. The chatbot is trained using your product support materials or actual questions your staff previously answered.

By providing a cache of human-based questions and answers, the chatbot can be trained to respond in your unique company voice.

A screenshot of the Backblaze chatbot live on www.backblaze.com. — Oh hey, there’s ours!

Before you activate and use a built-in AI feature of an existing service, you’ll want to determine how you can measure any changes in overall productivity and user satisfaction. In the customer service example above, that could be capturing metrics such as a customer satisfaction rating, time to first contact, time-to-resolution, escalation ratio, and so on. Then establish a baseline for the existing system before engaging the AI assistant and set specific points where you will compare that baseline to the AI powered system.

Using an AI powered service has many benefits, but there are a number of considerations to contemplate:

You are limited in functionality by what the vendor provides.
What is the expertise of the software vendor in developing, training, and implementing an AI model?

What happens when the model data changes? For example, you’ve employed AI to respond to customer queries. What happens when you add a new product to your lineup or a new feature to an existing product? Is the model retrained? What are the costs? Does it still make economic sense given any new cost?
During the model creation and operational phases, ancillary files such as checkpoints, prompts, responses, and so on are created. Do you have visibility into these files and what analysis can you perform?
Given these ancillary files are derived in part from your original data, can you download these files to your central repository or is the data locked in the vendor’s application?

Artificial intelligence as a service (AIaaS)

AIaaS is one of the many areas of AI where definitions and capabilities are a moving target. That said, we’ll offer that AIaaS is an outsourced service that a cloud-based company provides to other organizations that gives that organization access to different AI models, algorithms, and other resources directly through the vendor’s cloud computing platform via a user interface (UI), API, or SDK connection. The aim is to make a user-friendly interface that simplifies the process of training and deploying AI models accessible to non-AI experts.

AIaaS is worth considering if you’re interested in working with artificial intelligence but you don’t have the in-house resources or expertise to build and manage your own AI technology. There are a broad range of solutions offered in this space which vary by the services provided, let’s categorize the services as follows.

Walled gardens:
- What they offer: In my experience, AIaaS providers in this group usually host most or all of the model training data, checkpoints, inferences, and prompts.
- Pros and cons: This is the most straight-forward option, but in practice, this method can be cost prohibitive and lacks transparency. There are few if any options to reduce the cost or economically transfer the model, its work products, or its data elsewhere.
- Who are they: The obvious ones that come to mind for me are companies like AWS, Google, and IBM Watson.
Mix-and-match:
- What they offer: Solutions in this group vary by the services they provide as well as add-on options and support services. They typically provide hosting services which are used to train, deploy, and use the model. They can also provide data analysis and cleansing for the model input, model testing, engineering support, and general support services as you might require.
- Pros and cons: As with the walled garden approach, once data is ingested or ancillary data is created within the system it may be difficult to access and if available expensive to retrieve. Often, they also represent companies that provide specialized services—for instance, companies that solve a type of problem, like a computer vision specialist vs. a natural language processing model, or, alternatively, a company that focuses on AI in IT operations, call center operations, cybersecurity, etc.
- Who are they: This group includes companies like Twelve Labs, Proofpoint, or Amplify. Note that there’s a bit of a porous line between some of the providers in this category and the following—think of it like a gradient.
Open cloud:
- What they offer: Providers in this group offer a variety of tools and services that, when combined, allow an organization to construct, test, operate, and maintain an AI-based solution.
- Pros and cons: The open cloud approach allows you to select the best of breed providers for the various stages of your AI project. It also allows you to have control over the model and its byproducts such as checkpoint data, inferences, and prompts key to ensuring the model is performing as expected. In summary, while your level of effort for this approach will be higher, you will have more control over your model and more importantly the data, your data.
- Who are they: This includes platforms like Hugging Face and vendors like OpenAI of ChatGPT fame. Hugging Face is intentionally open source, whereas OpenAI is under pressure to monetize models—one of the bigger evolving conversations in the AI landscape. Today, anyone can purchase an API access subscription from OpenAI to access the GPT-4 Chat from their application. Such subscriptions offer quick access to organizations that want a mature model but aren’t able to or interested in building one themselves.

The AIaaS approach is a good choice for organizations that lack expertise in building and operating AI systems. The approach you take, walled garden, mix-and-match, or open cloud, will affect how much access and flexibility you have with the data used and produced by the system. This may not be of interest today, but as your organization becomes more AI savvy, being able to access and share the data within the system could become important.

Foundation models

The term “foundation model” originated with the Stanford Institute for Human-Centered Artificial Intelligence’s (HAI) Center for Research on Foundation Models (CRFM) which defines it as “any model that is trained on broad data that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks.” Most, but not all, foundation models are generative AI in form and perform tasks such as language processing, visual comprehension, code generation, and human-centered engagement.

Although foundation models are pre-trained, they can continue to learn from prompts during inference. An organization can develop tailored outputs using techniques such as prompt engineering, fine-tuning, and pipeline engineering. For example, prompt engineering requires you to enter a series of carefully curated prompts to the model such that over time the model infers more precise answers related to the subject matter of the prompts. This makes the model less generic and more specific to your organization.

When using a foundation model, you will need to capture and store all data used to fine-tune the model, for example the prompts and responses used for the prompt engineering process. This will allow you to analyze how the inference process is shifting over time.

Utilizing a foundation model as a starting point is a good choice, but techniques such as prompt engineering are far from being an exact science. Often such training can exacerbate a subtle bias in the existing model or introduce a new bias. This is especially true if the model is public facing.

Retrieval augmented generation (RAG)

Retrieval augmented generation (RAG) is a relatively new technique that allows AI models to link to external sources. These models are, in most cases, a generative AI model, such as a large language model (LLM). By using RAG techniques, external resources, often rich in technical content, can be leveraged as part of the model during inference to be part of the response to the user. One commonly cited example is having medical journals indexed via this technique so their content is reviewed when the model is generating a response. The same could be done with financial data, legal case law, and so on.

RAG works by adding code to the original generative AI model to continuously review defined external resources and convert them into machine-readable indices (vector databases) so they are available for inference. This means the core generative model does not have to be retrained, instead it can use new or updated sources on the fly. This allows you to use your data to make the model your own and lets you update the data sources to keep the model current.

This technique is extremely powerful, but it does require you to store the original model, the testing or validation data used, the external resources you are using to augment the model, their vector databases, and any prompts and inferred responses. Given the tools and utilities you will use to monitor and analyze how your RAG infused AI model is performing, a central cloud storage repository is a good choice for storing this data.

It’s all about the data—Your data

AI, at least in its current form, is not deus ex machina. Yes, ChatGPT and its ilk can create wonderful stories of fact or fiction and amazing, never before seen imagery, but without your data, they are marvelously generic. In other words, you and more precisely your data are the key to the value your organization will achieve in using AI.

As we have seen, there are a multitude of options. On one hand, we can hand off our data to a company, pay them handsomely, and let them build and run our AI models—the walled garden approach. While this is enticing, the reality is that AI is still a moving target with few rules and regulations in place and your visibility to what is happening to your data is limited as is your ability to do something if there is a problem.

At the other end is the open cloud approach. This allows you to choose the best-of-breed cloud based applications and cloud compute services to create and run your model. These applications and services can interact freely with your cloud storage platform to leverage your organization’s data while providing you complete visibility and control. Yes, it will require more investment on your part, but given the maturity of AI in general, it makes sense for you to keep a watchful eye on how AI is used in your organization and more importantly how well it is performing.

In short, AI requires your data to be truly useful to your organization. AI in its current form is still a young science, one that requires watching to ensure it does what is expected. That’s not paranoia, that’s just good business. To do this you will need unfettered affordable access to your data, the AI model, and its work products.

The post AI for Enterprise: Getting Started appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Adds Canadian Region, Expanding Location Choices and Data Sovereignty Options

2025-01-07 Chris Opat

Post Syndicated from Chris Opat original https://www.backblaze.com/blog/backblaze-adds-canadian-region-expanding-location-choices-and-data-sovereignty-options/

A decorative image showing Backblaze data regions, represented as clouds, around the planet.

Customers that have data governance, compliance, and performance at top of mind have more options for achieving all three with the opening of our new data region, known as Canada East (or CA East). The region is now available for current and future Backblaze customers.

This new region builds on our mission to deliver high-performance, compliant, and cost-effective cloud storage solutions to businesses around the world and further expands our footprint in the North American market.

Meeting the needs of Canadian businesses

Our new CA East region is located in Toronto, Ontario, and has been designed to address the specific needs of Canadian businesses and organizations, many of which are subject to laws and regulations requiring data to be stored within the country. With this expansion, customers are able to ensure compliance with local regulations while taking advantage of a robust cloud solution that prioritizes data sovereignty.

A local region also delivers performance benefits for Canadian customers. By reducing the distance that data needs to travel, Backblaze can offer lower latency and improved speeds for Canadian customers, making it ideal for real-time applications and large-scale data transfers.

Strengthening our partnership with Opti9

In collaboration with Opti9, an international leader in hybrid cloud solutions and a Veeam Cloud Storage Provider (VCSP), this expansion marks a significant opportunity for us to deliver robust managed services to Canadian businesses. Opti9, as the exclusive Canadian channel partner for Backblaze B2 Reserve and the Powered by Backblaze program, is uniquely positioned to bring this enhanced offering to market.

Opti9 and Backblaze share a unified vision of providing Canadian businesses and organizations with cutting-edge cloud solutions that are both affordable and high performing. Cloud data storage is evolving rapidly to meet changing customer needs. We are excited to launch this Canadian storage region in collaboration with Backblaze, which expands our overall cloud storage footprint in Canada. This partnership equips our Canadian partners and end-user organizations with the tools they need to thrive in today’s fast-evolving digital landscape.

—Cory Mac Donell, Vice President of Sales & Business Development, Opti9

Protecting data within borders

Canada’s cloud services market is expanding rapidly, driven by increased demand from industries such as healthcare, finance, and government—all of which often require data to remain within national borders. The new data region gives Canadian and international businesses more choice for storing their data while maintaining data sovereignty.

Competitive edge through open cloud solutions

Multi-cloud and hybrid cloud strategies are becoming all the more common. Businesses increasingly seek open, interoperable solutions that avoid vendor lock-in and allow them to integrate the best services from multiple providers and our offerings provide the flexibility and control businesses need, while still benefiting from the security, compliance, data governance requirements, and performance of a local data center. The new region enables companies doing business in Canada to tap into multi-cloud and hybrid cloud strategies as they look to strengthen their cloud infrastructure.

Security and compliance details for the Canadian region

The Toronto data center has been assessed and maintains a security program that addresses the requirements of SOC 1 Type 2, SOC 2 Type 2, ISO 27001, PCI DSS, and HIPAA. These certifications ensure the highest levels of security and compliance for businesses in regulated industries.

Ready to store data in CA East?

The new data region is available to customers now, and you can create an account there by selecting CA East in the region drop-down when creating a Backblaze account. Already storying data with Backblaze and want to keep a Canadian copy? Leverage our Cloud Replication feature and diversify your storage.

We’ll have more stories to tell about bringing up the data center and some of the interesting networking there, so stay tuned to the blog!

The post Backblaze Adds Canadian Region, Expanding Location Choices and Data Sovereignty Options appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Bookblaze: The Third Annual Backblaze Book Guide

2024-12-23 Stephanie Doyle

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/bookblaze-the-third-annual-backblaze-book-guide/

A decorative image showing a book and a cozy library.

It’s time once again for our annual book guide, where Backblaze authors give you the inside scoop on what they’ve been reading. So, whether the weather outside is frightful, or, like at our home office in San Mateo, weird and drizzly, we hope you enjoy!

Pat Patterson, Chief Technical Evangelist

Never Understood: The Jesus and Mary Chain, by William Reid and Jim Reid

I love a good book about music, and when I saw autographed copies of “Never Understood” on sale at the merchandise stand at the Jesus and Mary Chain’s San Francisco gig earlier this year, I could not walk away without buying one. The book is co-authored by William and Jim Reid, the Scottish brothers who have been the only consistent band members since they started making music in the early ‘80s, and alternates between their accounts from early life in a Glasgow tenement through growing up listening to the Velvet Underground, Iggy Pop, and Bowie in the nearby post-war new town of East Kilbride, to realizing that the band each of them wanted to form on their own was actually the same band, and the subsequent rollercoaster ride of recording, touring, breaking up, and getting back together.

There’s a lot of humor amongst the rock and roll excess—one of my favorite moments was the contrasting explanations of how they assigned roles as they were getting started. From William: “It wasn’t like it was Jim’s dream to be the singer—we basically had a big fight about who was gonna sing and he lost.” Jim writes: “We actually tossed a coin for it, but the outcome was the same: William won. I was the singer.” Comedy soon turns to tragedy, however, as Jim explains how he turned to heavy drinking to overcome his shyness of singing on stage, setting the scene for a lifelong battle with alcohol.

Lee Brackstone, the book’s editor, deserves credit for the excellent job he’s done stitching this all together. Even though the viewpoint bounces between the two brothers, it reads as a single narrative. William’s passages are set in a serif font, while Jim’s are sans, so you quickly develop a feel for who you’re reading. It’s a riveting tale, whether you love or hate the band’s music—I envy you listening to their debut album Psychocandy for the first time if you don’t fall into either of those camps—and the brothers’ love/hate relationship brings a poignant dimension to what is already a classic story of early success, record label indifference and shenanigans, figuring out how to play the music you hear in your head, and being shocked that other people actually want to hear it too.

Yev Pusin, Sr. Director, Marketing

Impact Winter, by Travis Beacham

A comet strikes the earth and blocks out the sun. Bad news for people, good news for vampires. If you like the concept of 30 Days of Night and enjoy great world building and story telling with a bloody twist, this is a fantastic addition to your schedule. Bonus: It’s an audio drama, so perfect for your commute.

Jeremy Milk, Sr. Director, Product Marketing

How Big Things Get Done, by Dan Gardner and Bent Flyvbjerg

I stumbled upon this book right around the time one big thing in my life was proceeding nicely and another was not. Why? This book didn’t give me all the answers—sorry, there are no silver bullets—yet it provided a digestible, pragmatic framework for successfully managing big projects and initiatives, with situational awareness for the psychology of the many stakeholders who will be key to the success. As an impatient person who also likes to plan, I took away new nuance from the authors’ Think Slow, Act Fast model. And, as a student of Eric Ries’ The Lean Startup model, I appreciate the authors of this book adding their own flavor of MVP with the Maximum Virtual Product concept when you simply cannot lean-test something as big as you envision and yet you can develop virtual proxies to test underlying assumptions and elements. Now I’m ready to tackle far more big things.

Nicole Gale, Marketing Operations Manager

The Women, by Kristin Hannah

I love historical fiction and The Women is the first book I’ve read about the Vietnam War. As a big Kristin Hannah fan, I love how she weaves different stories about the historical event into her own. We were immersed into the world of how women were treated in the Vietnam War and I’ll never forget their stories. This one is a must read!

David Johnson, Product Marketing Manager

The cover image for the book The Coming Wave by Mustafa Suleyman.

The Coming Wave: Technology, Power, and the Twenty-First Century’s Greatest Dilemma, by Mustafa Suleyman

I’d suggest “The Coming Wave” by Mustafa Suleyman. It offers an insightful perspective on the evolving world of artificial intelligence and its impact on society. It’s about a year old now, but still great in my opinion.

Bala Krishna Gangisetty, Sr. Product Manager

The cover image for Mindset by Carol Dweck.

Mindset: The New Psychology of Success, by Carol Dweck

This book changed how I see things and perceive challenges or setbacks fundamentally. Growing up, I was wired to strive for perfection in everything I did, and this book shifted my focus from being perfect to continuous improvement. It helped me see opportunities for learning and growth when things don’t go as planned. The best part is that the ideas in this book work for all parts of life, not just work.

Teresa Dodson, Sr. Director, Partner Marketing and Alliances

The cover image for Dare to Lead by Brene Brown.

Dare to Lead: Brave Work. Tough Conversations. Whole Hearts., by Brené Brown

From the official summary: Leadership is not about titles, status, and wielding power. A leader is anyone who takes responsibility for recognizing the potential in people and ideas, and has the courage to develop that potential. Check it out!

Stephanie Doyle, Writer and Content Operations Strategist

The cover image by Skyward by Brandon Sanderson.

The Skyward Trilogy, by Brandon Sanderson

I suppose it’s cheating a bit to recommend a whole series, but the story arc in this series by fantasy heavyweight Brandon Sanderson is great! Full disclosure: I’m hit or miss on Brandon Sanderson’s wider works. (I hate Mistborn and love The Way of Kings. Feel free to get mad at me in the comments.) That said, this series starts with a plucky young heroine on a dystopian planet (don’t worry folks: no love triangle in this one—if you know, you know) and extends into a fascinating view of space travel, AI, and what it means to have a soul.

Happy Reading from Backblaze

We hope this list piques your interest—we may be a tech company, but nothing beats a good, old fashioned book (or audiobook) to help you unwind, disconnect, and lose yourself in someone else’s story for a while.

Any reading recommendations to give us? Let us know in the comments.

The post Bookblaze: The Third Annual Backblaze Book Guide appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

5 Ways Event Notifications Strengthens Your Backup Strategy Automatically

2024-12-19 David Johnson

Post Syndicated from David Johnson original https://www.backblaze.com/blog/5-ways-event-notifications-strengthens-your-backup-strategy-automatically/

A decorative image showing a cloud with diagrammed icons around it.

“Our backups are good, right?”

If you’re responsible for backup operations, you’ve probably heard this question more times than you can count. While the answer should be a simple “yes,” staying on top of backup activities often involves checking multiple systems, reviewing logs, and maintaining manual tracking processes.

Today, I’m sharing five ways you can implement Backblaze Event Notifications into your data protection strategy to keep you and your team informed. If you’re interested in Event Notifications for other use cases, check out our posts for media production and application workflows.

Event Notifications for IT backup: Simplified automation

Event Notifications monitors your B2 Cloud Storage buckets for data changes that you designate—like completed backups, file deletions, or policy violations—and delivers real-time alerts where you want them. These alerts can trigger automated actions in any system that accepts webhooks, from PagerDuty to Zendesk to Slack channels and more.

Think of it as your storage system’s notification service: instead of discovering changes during routine recovery verification checks, you get instant awareness when something happens to the data in your buckets.

What are webhooks?

Webhooks, if you’re not familiar, are a way for applications to communicate with each other by sending data automatically based on specific events, e.g., HTTP POST requests with a JSON payload. What sets Backblaze Event Notifications apart is that it works with any service that accepts webhooks. This means you can integrate backup monitoring into your existing tools and processes, rather than being locked into specific vendors’ ecosystems.

5 ways to stay in the know with your backup strategy

Here are specific, practical ways you can take advantage of Event Notifications for immediate benefits to your backup and archive workflows.

1. Backup verification and reporting

When your backup software writes files to B2 Cloud Storage, Event Notifications helps verify successful completion of backup jobs. Each time a backup file lands in a bucket, you’ll receive a notification with key details like file size, timestamp, and backup job name. By feeding this data directly into communication tools like Slack, you can maintain comprehensive logs of backup activity without manual checks.

Backup monitoring workflow

Gone are the days of discovering backup issues hours or days later during routine reviews—you’ll know exactly when backups are uploaded. Teams can configure custom alerts for backup size thresholds, receive immediate confirmation of successful backups, and, with the help of Zapier, you can enable an alert when Event Notifications did not trigger, indicating a backup was not uploaded during a specified window.

2. Security and compliance monitoring

Event Notifications can help protect your backup data from unauthorized changes. Security teams can establish automated alerts for suspicious activities like mass deletions or modifications. These alerts integrate with your existing security information and event management (SIEM) systems to provide unified threat monitoring.

Security alert workflow

Beyond threat detection, Event Notifications enables preemptive policy enforcement. Teams can configure automatic notifications that guide employees when their actions might conflict with backup policies—like modifying file names, moving files, or even deletion. For persistent policy conflicts, managers can receive automated escalation alerts to address potential training needs or process gaps. This systematic approach helps maintain backup integrity through education and awareness before issues occur, rather than just detecting violations after the fact.

3. Storage management automation

Storage management becomes more efficient when Event Notifications feeds activity data directly to your management tools. As files are uploaded to and removed from your buckets over time, Event Notifications provides valuable data that helps you analyze storage utilization trends and backup data growth patterns.

Data usage monitoring workflow

This constant flow of information empowers teams to anticipate capacity needs and optimize resource allocation. Moving from reactive to proactive storage management helps control costs by notifying you when backups become larger on average.

4. Cross-bucket backup monitoring

Organizations using Cloud Replication or managing backups across multiple buckets gain valuable oversight through Event Notifications. This capability tracks file replication between regions and monitors backup activity across your entire footprint, giving you a comprehensive view of your distributed backup strategy. Teams can spot replication delays or issues immediately, rather than waiting for scheduled status checks.

Cloud Replication notification workflow

Understanding how data moves and grows across different locations ensures your distributed backup strategy performs as designed. Event Notifications makes it possible to track successful replications, monitor consistency between primary and replica buckets, and receive immediate alerts about any issues. This visibility is especially valuable for organizations maintaining geographic redundancy or managing complex multi-site backup strategies.

5. Integration with IT workflows

Event Notifications connects seamlessly with existing IT tools and processes through standard webhooks. Backup events can automatically flow into ticketing systems like Jira Service Management, monitoring dashboards like Grafana, or team communication channels like Microsoft Teams and Mattermost. This integration means teams can manage backup operations through familiar tools and processes, without needing to constantly switch between different interfaces or learn new systems.

Data integration workflow

The result is streamlined operations without the need for separate backup monitoring systems, ensuring backup activities receive proper attention within normal IT procedures. Teams can create ServiceNow tickets for failed backups, update Jira boards with backup status, or send notifications to Teams channels—all automatically and in real-time.

Why Event Notifications makes sense for backup teams

Managing backup operations has traditionally meant juggling multiple monitoring tools and hoping you catch issues before they impact recovery capabilities. Event Notifications transforms this approach by providing:

Automated awareness: Replace manual checks with instant visibility into bucket changes.
Enhanced security: Track backup data access and modifications as they happen.
Simplified monitoring: Feed backup activity data directly to your management tools.
Better operations: Free up time to focus on improving backup strategies rather than monitoring them.
Flexible integration: Adapt backup monitoring to fit your existing processes, not the other way around.

How it works with your environment

Unlike traditional backup monitoring solutions that often require specific software for notification handling, Event Notifications works with any service that accepts webhooks. This fundamental difference means you aren’t locked into specific vendors’ ecosystems or forced to use particular monitoring tools.

Event Notifications is designed for reliability with at-least-once delivery, ensuring critical backup events are never missed. This reliability is especially important for teams building automated workflows that require consistency and transparency in their backup monitoring.

The pricing model is straightforward and predictable: Backblaze B2 Reserve customers receive unlimited notifications at no additional cost, while pay-as-you-go customers get 2,500 notifications free each day and pay just $0.004 per 10,000 additional calls. This transparent pricing applies regardless of which services you’re connecting to, enabling teams to build comprehensive backup monitoring without worrying about unpredictable costs.

Ready to automate your backup monitoring?

If you’re working with a Backblaze account manager, Event Notifications are already enabled—just ask them for setup guidance. Other existing customers can contact our Support team to request access.

New to Backblaze? Contact our Sales team to learn how Event Notifications can strengthen your backup operations.
Once enabled, visit the Event Notifications section in your B2 Cloud Storage buckets to configure your alerts. For detailed setup instructions and best practices, check out our Event Notifications documentation.

The post 5 Ways Event Notifications Strengthens Your Backup Strategy Automatically appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Disaster Recovery 101: Navigating Backup and Archive Infrastructure

2024-12-17 Kari Rivas

Post Syndicated from Kari Rivas original https://www.backblaze.com/blog/disaster-recovery-101-navigating-backup-and-archive-infrastructure/

An illustration of a city scape with lines travelling up to a cloud representing digital transmission.

Aging infrastructure, strained budgets, and exponential data growth create unique challenges for disaster recovery (DR) planning. When assessing your backup and archive infrastructure, you’re probably balancing data governance, data sovereignty requirements, compliance requirements, and the needs of your end users. Many legacy data storage systems can create gaps in an otherwise airtight DR plan.

Today, I’m talking through how to approach infrastructure decisions for your cyber resilience posture. You have a lot of options. On-premises? Cloud services? Hot? Warm? Cold? What combination works best for your needs? Understanding the nuances can help you sharpen your strategy.

Disaster recovery challenges

1. Relying on on-premises backup and archive infrastructure

Traditionally, businesses have relied heavily on on-premises backup solutions. Robust storage systems hold critical data, often backed up to secondary storage within the same physical location. While this approach offers a sense of control, it also presents vulnerabilities.

On-premises backups are at risk of localized events like loss of power, fire, flooding, or other natural disasters. A geographically separate DR site or other far off-site backup is essential for complete protection and compliance. Without this, the organization risks losing critical data in cases of a regional outage or loss of access.

The shift to public cloud and SaaS options opened the door to more secure and reliable data backup and disaster recovery solutions. By utilizing cloud-based storage and backup services, organizations can ensure that their data is protected in multiple locations, reducing the risk of data loss due to localized disasters. Additionally, cloud-based solutions offer scalability and flexibility, allowing organizations to easily expand their storage capacity as needed.

2. Falling into the replication trap

Many businesses have established alternate data centers as a secondary backup layer. However, these sites frequently only use replication technology. This situation can result in a scenario known as the “replication trap.” There is a risk that data compromised by malware is replicated to the DR site, leading to potential data loss.

Off-site, immutable backups, independent of the primary site’s data, are a key component of a robust DR strategy. In cases of malware attacks or accidental data deletion by users, off-site immutable backups allow for data retrieval from a backup saved prior to the incident and reduce possible interruptions.

3. Underestimating LTO limitations

Despite being viewed as a legacy technology, tape backups continue to be used in many organizations due to their reliability and cost-effectiveness. It is common to store tapes in a separate location to diversify data storage geographically, which helps reduce the impact of local disasters on data access and enhances overall data resilience.

Off-site tape backups may increase recoverability but create challenges with recovery time objectives (RTO) because of the increased time it takes to retrieve data from a separate location and restore it using tape technology. Hardware issues can happen often and unexpectedly. Cloud-based data storage and archiving has gained popularity because of higher availability and cost savings over traditional tape backups.

The cost and time required to operate multiple data centers and meet recovery times should also be considered in the requirements for your production and DR infrastructure. Never underestimate the risk to a successful recovery when facing time-consuming tasks like physical site recovery and data restoration from tape.

4. Leaving cloud-based productivity tools vulnerable

Cloud-based collaboration and communication tools like Google Drive and Microsoft 365 are commonly used by businesses and yet are often left vulnerable to data loss. Cloud services do not provide sufficient protection and recovery options that organizations likely need.

Businesses often find that the responsibility for backing up this data falls on their own IT, as these services typically operate under a shared responsibility model that doesn’t offer comprehensive backup solutions.

To ensure a reliable DR plan that includes cloud services, you should:

Evaluate granular recovery requirements for productivity platforms like Google Workspace and Microsoft 365.
Evaluate adherence to your long-term backup retention policy keeping in mind the regulations that your business might be subject to.
Determine if data stored in cloud platforms needs to be backed up with immutability due to cyber insurance requirements or other security policies.
Examine best practices for comprehensive, secure data protection for shared cloud drive services and SaaS productivity tools to address the lack of built-in recovery features.
Plan to store true backups of your SaaS data just as you would for any other data. It may seem redundant to back up cloud platforms to the public cloud, but doing so ensures that you have the right point-in-time backups you need and you can recover on your timeline—not Google or Microsoft’s.

Cloud costs will need to factor into decisions for where to store your data. Cloud storage costs should be included as a non-functional requirement to make sure you can achieve your secure recovery goals without sacrificing affordability.

Best practices for cloud-based disaster recovery

Many enterprises rely on cloud-based DR solutions to ensure uninterrupted operations, protect critical data, and maintain customer trust. Unlike traditional DR methods, cloud-based solutions offer scalability, cost-effectiveness, and rapid recovery capabilities. To truly leverage the potential of these systems, it’s important to be aware of some key strategies and considerations to optimize your cloud-based disaster recovery plan, ensuring resilience in the face of unexpected disruptions.

Consider diversifying your cloud portfolio: Using the same cloud service provider for your backups as for your production data may not be necessary, as you don’t need the same level of performance for backup data. You could consider a tiered recovery approach based on the criticality of your applications and data.
Investigate existing tools for cloud compatibility: Many on-premises data protection tools like Synology or QNAP NAS devices also support cloud targets for backup storage. It’s important to match the capabilities of your current backup vendors to your recovery requirements and cloud storage budgets.
Avoid paying for storage you’re not using: Carefully read the fine print when considering cloud storage costs. Hidden fees, minimum retention requirements, and complicated pricing tiers make accurate forecasting difficult and could leave you paying for unused storage just to reach certain discount tiers.
Balance your budget with RTO and RPO targets: Using cloud data storage for production, backups, and archive can lead to some price shock as your environment scales. And moving data to lower cost storage tiers or cold storage may achieve attractive price reductions, but it often comes at the cost of recovery speed and added complexity. Look for a cloud storage provider with transparent pricing that makes it easier to plan your costs.

Finally, you should weigh your cloud-based options to evaluate platform compatibility, ongoing costs, and whether your CSP locks you in or out of specific ecosystems due to high storage costs, data transfer costs, and proprietary features.

Leveraging cloud-based backup and archive infrastructure

Adopting cloud-based disaster recovery best practices is a key consideration for building a resilient and reliable business infrastructure. By developing a well-structured disaster recovery plan, determining the right mix of storage solutions, and optimizing costs with tiered recovery, businesses can minimize downtime and data loss during unexpected events. A proactive approach not only safeguards your organization’s operations but also strengthens customer trust and competitive advantage. In a world where disruptions are inevitable, being prepared is the key to bouncing back stronger and faster.

The post Disaster Recovery 101: Navigating Backup and Archive Infrastructure appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Effortlessly Managing Unfinished Large File Uploads with B2 Cloud Storage

2024-12-12 Bala Krishna Gangisetty

Post Syndicated from Bala Krishna Gangisetty original https://www.backblaze.com/blog/effortlessly-managing-unfinished-large-file-uploads-with-b2-cloud-storage/

An illustration of a cloud with boxes representing data uploading to the cloud.

Digital clutter isn’t just inefficient, it can be costly. And if cleaning up digital clutter in your business operations is one of your New Year’s resolutions for 2025, this post is for you. We’re talking about managing unfinished large file uploads.

One big culprit of digital clutter when it comes to cloud storage is unfinished large files. Managing unfinished large file uploads can be a complex task. If they are not managed well, they can consume space and incur costs without any benefit.

To address this, we’ve introduced a feature in Backblaze B2 Cloud Storage that automatically cancels unfinished large file uploads, saving you both time and money.

The challenge: Unfinished large file uploads

To upload a large file, you break it into smaller parts. You initiate the start notification. Each part is uploaded in parallel, and once all parts are received, a finish notification is sent. Only after the final step does the file become consumable. Sometimes, things don’t go as planned—network hiccups, API timeouts, or user interruptions can leave large file uploads unfinished. The process then likely restarts and completes successfully, but this leaves you with both a complete file and a partially completed file in your cloud storage instance. These unfinished uploads still take up storage space, leading to unnecessary costs.

Previously, users had to manually track down and delete these unfinished uploads. It’s error prone and time-consuming, and not an easy task especially with a large volume of files.

The solution: Canceling unfinished uploads through lifecycle rules

To streamline the process, we’ve added a feature that allows users to automatically cancel these incomplete uploads after a set number of days. By setting lifecycle rules through the B2 Native API, users can now specify how many days an unfinished large file can remain before it’s automatically deleted.

For detailed guidance on configuring this rule, check out our Lifecycle Rules Documentation.

Why it matters

This feature is useful in a variety of scenarios:

Network failures: If a network interruption prevents the final completion step, the unfinished upload will no longer remain indefinitely. Instead, it will be automatically cleared after the defined period, ensuring you aren’t paying for useless storage.
User interruptions: If an upload is manually paused or forgotten before completion, lifecycle rules will take care of these fragments, preventing forgotten uploads from lingering in storage.
Script failures: If your script fails or times out during the upload process, any incomplete files won’t go unnoticed. They’ll be cleared as per your rules, ensuring efficient storage management.

Cost-saving benefits

Unfinished uploads can quickly add up, both in storage usage and costs. By automatically canceling incomplete uploads, users can significantly reduce unnecessary expenses, keeping storage budgets under control. This is especially important for businesses with large-scale data transfers, where managing storage efficiency can have a direct impact on the bottom line.

What’s next?

Most users configure lifecycle rules through the console or Backblaze B2 command line tool (CLI), so we introduced this feature for the B2 Native API to address immediate customer needs while also laying the groundwork for integrating it into the B2 Cloud Storage web console. You can now use this feature via the CLI or B2 Native API. We’re working on adding UI support to make configuration even more accessible. Let us know in the comments if you’re looking for access to this feature via a different user interface.

In the meantime, here are a few steps you can take:

Implement lifecycle rules: Set rules that fit your upload behavior. Choose a reasonable timeframe to cancel unfinished large file uploads that balances with your cost-management goals.
Test the feature: Try configuring the lifecycle rule for a few test uploads to make sure it behaves as expected. Monitor how it handles interruptions or failures to ensure it aligns with your needs.
Monitor storage costs: Check your storage usage and billing before and after setting these rules to understand the impact on costs. Use the feedback to fine-tune your settings.
Stay tuned for UI updates: Keep an eye out for announcements regarding UI support for this feature. We’re committed to making it as intuitive and accessible as possible.

By leveraging lifecycle rules for unfinished large file uploads, you can maintain a cleaner, more efficient storage environment while saving money. For more details on configuring lifecycle rules, visit our API documentation.

The post Effortlessly Managing Unfinished Large File Uploads with B2 Cloud Storage appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

AI 101: Building and Deploying an AI Model

2024-12-11 Stephanie Doyle

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/ai-101-building-and-deploying-an-ai-model/

A decorative image showing a computer, a cloud, and a building.

Should you build your own AI model? Or use other services to help you accelerate the process?

Once you’ve defined the problem you’re trying to solve and the AI model type that best fits your needs, these are the questions you’re faced with next—where to deploy an AI model and how to go about doing it. In most cases, there is very little reason for you to build, train, and deploy your AI model from scratch, particularly as more and more vendors are stepping in to help companies with all or some of the process. It’s fundamentally complex, takes tons of resources and requires specialized knowledge to do correctly.

Still, you should have a basic understanding of the AI model training and deployment processes, as these learnings will be useful as later on as you explore various predefined tools, applications, and services you can use to expedite or enhance your ability to use AI within your organization. That’s what I’m digging into today.

How AI model training works

There are several steps in training an AI model which include identification and gathering the data required, data cleansing and assembly, training the model, checkpointing, and, finally, model serving where the model is deployed into the production environment. Here’s an overview of the process.

A diagram that explains the AI model training process.

Let’s take a minute to explore each of the steps in a little more detail.

Step 1: Review

The organizational data needed to help educate your model will either be structured or unstructured. Structured data is found in databases, tables, and so on. Unstructured data is basically everything else. Some unstructured data is easy to process, such as text files, while other data is harder to extract, such as PDFs and images.

In general, the more data you can provide, the better your trained model can be. But, remember to include data that is not what you want as well—this helps models to hone in on the specific piece of information when things are similar. Take this example scenario, for instance:

You are monitoring hundreds of thousands of wooded acres to determine if there is a fire on the land. As part of training the model, you need to provide images of the legitimate flora and fauna along with images of fire. But you should also provide images of what is not fire, for example reflections of the sun or moon on a lake, a group of lightning bugs at night, car headlights, and so on.

Step 2: Clean

As the data is collected, it will need to be pre-processed, which involves several techniques such as cleaning the data to handle missing values, removing outliers, scaling features, encoding categorical variables, and splitting the data into training and testing sets. The data needs to be arranged in a manner acceptable to the model itself. This sounds relatively simple, but some studies show that this can take up to 80% of the total model development process time.

Step 3: Stage

This is a collection point for all of the clean, ready to be processed, data. This data will arrive as it is processed (cleaned) which can occur over several days or even weeks. Having this data on hand will be useful if the model is not generated correctly or in the future as a starting point to retrain the model.

Typically large amounts of your data will be cleaned and staged as it is readied to train the AI model. But, there are no special storage requirements for this data. It just needs to be readily available to be uploaded to the AI training environment when the time comes.

Step 4: Train

Model training is a resource intensive process where data is copied from staging to high-performance storage located in close proximity to whatever high-powered processor you’re rocking, usually a graphical processing unit (GPU). The GPUs then run the algorithms developed specifically for training the model, and the data is iteratively read and processed an indeterminate number of times until training is complete. Minimizing the time spent utilizing these expensive, high-powered storage and processing resources is critical in managing the overall cost of building the model. In other words: get in, process, and get out.

Step 5: Checkpoint

During the building of the model, the programming will often create snapshots of the status of the training process. This will include various variables, state changes, and so on. These snapshots are referred to as checkpoints. They initially will be written to local storage within the model training system, and are used to restart the training process from a known good state if something goes wrong.

Once the model training process is complete, checkpoints should be written to the same centralized data storage location as your staged data. The checkpoint data will become part of the documentation of the model and may be used for forensic purposes should the model not behave appropriately once it is deployed.

Step 6: Serve

Once the training process is complete, the model can be exported to your central storage location. This will once again help document the system, and from there the model can then be uploaded to the local or cloud compute environment where it will be used.

At this point you have a clean version of the source data, the checkpoints of the model created, and a copy of the model itself, all stored in your centralized location under your control and readily available should they be needed in the future.

AI model inference

The term inference is derived from the AI model’s perspective. At a high level, when given a prompt, the model infers its response from the trained model and its data. In simple terms, you’ve trained your model to recognize cats, and then you bring it new data (a picture of a family reunion) and ask your model if it sees any cats in the photo (I’m hoping the answer is yes).

In AI, the prompt is viewed as new data which is compared to the model’s existing data to determine a response typically in the form of a decision, prediction, or new content as is the case with generative AI models.

An overview of the inference process is below:

In some AI systems, the inference process flow includes some additional code to help improve your model. These types of filters can have a range of uses and can happen on either the input or the output stage. For example, if you want to filter inappropriate queries or information, you could include something like keyword filtering when data (the prompt) is input. Or, you could introduce a toxicity detection filter on the output side, which reviews responses and prevents harmful or offensive content to be presented to the user.

A perhaps better understood problem that filters like this can address is how to get accurate and up-to-date information out of your queried response. On the input flow side of things, retrieval-augmented generation (RAG) directs a trained model to incorporate and weight more heavily information from trusted sources that the user designates. On the output side, you might add a hallucination prevention filter, which would stop the model from presenting false or misleading information.

More broadly, you’ll notice that both the prompt and response are saved. It is important to review this information on a periodic basis. This is especially true if the model is public facing, if you are using a model which can change over time such as a foundation model, or if you are using a model which utilizes RAG techniques to include new or external content.

In all of those examples, your model can drift as new information is introduced, and, as we noted above, getting the right information and cleaning it properly is likely the most time-intensive and important stage of this process. Not for nothing is the phrase “knowledge is power” a truism—in the age of AI, knowledge is power and good data is king.

The post AI 101: Building and Deploying an AI Model appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Seamless Data Migration with Custom Upload Timestamps

2024-12-05 Bala Krishna Gangisetty

Post Syndicated from Bala Krishna Gangisetty original https://www.backblaze.com/blog/seamless-data-migration-with-custom-upload-timestamps/

A decorative image showing two cubes, representing data, moving from cloud to cloud. There are clocks above each cube.

Migrating data to the cloud? Ensuring that original timestamps remain intact through a cloud migration can be a critical factor for successful data management at scale. Losing these timestamps can lead to operational challenges that hinder your ability to track data effectively, set proper lifecycle rules, create custom events, and more.

Backblaze B2 Cloud Storage now offers the Custom Upload Timestamps feature to help you manage your data. Today, I’m sharing details on the new feature, benefits, and how to enable it.

What are Custom Upload Timestamps?

The Custom Upload Timestamps feature is designed specifically to retain the original timestamps of your files during a migration. It is especially beneficial for users who rely on lifecycle rules to dictate file deletion or archiving based on age for compliance, to track file age manually, or maintain historical context of file.

Imagine this scenario: You have a critical file on another cloud storage provider, governed by a lifecycle rule that deletes it after 1,000 days. If you move the file to Backblaze B2 on day 999, the timestamp would be overwritten and you’d have to restart that lifecycle from day one. However, with this new feature, the original timestamp remains intact, and the file will still get deleted on day 1,000, just as planned. This capability not only simplifies the migration process, but also ensures continuity in your data retention policies, keeping your storage costs in line with expectations.

Benefits of Custom Upload Timestamps

Lifecycle rules play a crucial role in managing data retention, particularly when migrating large datasets. Losing the original timestamps means you’d have to manually reconfigure your rules or wait much longer for lifecycle events to take effect. The benefits of retaining original timestamps extend beyond just lifecycle rules.

Here is why this feature is essential:

Operational efficiency: Knowing the original timestamp of files allows for better organization and tracking. This is vital for businesses that rely on historical data to inform decisions or manage projects. When timestamps reset, it can lead to confusion and disarray in managing files. You may find yourself dealing with files that should have been deleted or archived but aren’t because of the reset timeline.
Compliance: For organizations that must adhere to regulatory standards for data retention, preserving timestamps can help meet legal requirements. It provides a clear audit trail and evidence of when files were created or modified.
Decreased workload: Manually tracking and reconfiguring lifecycle rules consumes valuable time and resources. By retaining the original timestamps, you eliminate unnecessary workloads.
File age tracking: Whether you’re managing backups, archival processes, or simple organizational tasks, knowing the age of a file can inform your decisions regarding when to review or delete files.
Historical context: For projects that span long periods, retaining timestamps helps maintain the context of data. This can be critical for collaborative efforts or projects that require consistent documentation.

Ultimately, the custom upload timestamps feature supports greater data portability, making it easier to move and manage large datasets. It ensures that migration to B2 Cloud Storage is as seamless as possible—without the need to reset or alter your data management policies.

Ready to get started?

The Custom Upload Timestamps feature is enabled by default for all B2 Cloud Storage customers. To utilize this feature, you need to include the X-Bz-Custom-Upload-Timestamp parameter when calling the b2_upload_file API. This simple addition allows you to retain the original timestamp of your file, thereby preserving its lifecycle state without interruptions and ensuring that your data remains organized and easy to track.

By retaining the original timestamps, Backblaze B2 helps increase the ease and granularity with which you can manage your data, especially for organizations migrating large volumes of data. You can transition your data while maintaining control over important metadata like the original timestamp, streamlining your operations, improving overall efficiency, and avoiding the stress of potential compliance issues.

What next?

To make the most of the Custom Upload Timestamp feature, consider the following actionable steps:

Review your migration workflow. Before starting the migration, ensure that your processes include the X-Bz-Custom-Upload-Timestamp parameter in your upload scripts or APIs. This will help prevent any disruption in tracking important metadata.
Test the feature. Conduct a pilot migration with a small number of files. This will allow you to confirm that the timestamps are retained correctly. Monitor the behavior of your data tracking after this test migration to ensure everything operates as expected.
Verify lifecycle rules. Once you complete the migration, take the time to check that your lifecycle policies continue to function as intended on B2 Cloud Storage. This verification step is crucial to avoid unexpected data retention issues.
Engage with Support. If you have any questions or encounter challenges, don’t hesitate to reach out to our Support team. We’re here to help you make the most of B2 Cloud Storage.

For more details, visit our API documentation to ensure you’re ready for a smooth migration. By leveraging the Custom Upload Timestamps feature, you can simplify your data management processes.

The post Seamless Data Migration with Custom Upload Timestamps appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Introducing Buy with AWS: an accelerated procurement experience on AWS Partner sites, powered by AWS Marketplace

2024-12-05 Prasad Rao

Post Syndicated from Prasad Rao original https://aws.amazon.com/blogs/aws/introducing-buy-with-aws-an-accelerated-procurement-experience-on-aws-partner-sites-powered-by-aws-marketplace/

Today, we are announcing Buy with AWS, a new way to discover and purchase solutions available in AWS Marketplace from AWS Partner sites. You can use Buy with AWS to accelerate and streamline your product procurement process on websites outside of Amazon Web Services (AWS). This feature provides you the ability to find, try, and buy solutions from Partner websites using your AWS account

AWS Marketplace is a curated digital store for you to find, buy, deploy, and manage cloud solutions from Partners. Buy with AWS is another step towards AWS Marketplace making it easy for you to find and procure the right Partner solutions, when and where you need them. You can conveniently find and procure solutions in AWS Marketplace, through integrated AWS service consoles, and now on Partner websites.

Accelerate cloud solution discovery and evaluation

You can now discover solutions from Partners available for purchase through AWS Marketplace as you explore solutions on the web beyond AWS.

Look for products that are “Available in AWS Marketplace” when browsing on Partner sites, then accelerate your evaluation process with fast access to free trials, demo requests, and inquiries for custom pricing.

For example, I want to evaluate Wiz to see how it can help with my cloud security requirements. While browsing the Wiz website, I come across a page where I see “Connect Wiz with Amazon Web Services (AWS)”.

I choose Try with AWS. It asks me to sign in to my AWS account if I’m not signed in already. I’m then presented with a Wiz and AWS co-branded page for me to sign up for the free trial.

The discovery experience that you see will vary depending on type of the Partner website you’re shopping from. Wiz is an example of how Buy with AWS can be implemented by an independent software vendor (ISV). Now, let’s look at an example of an AWS Marketplace Channel Partner, or reseller, who operates a storefront of their own.

I browse to the Bytes storefront with product listings from AWS Marketplace. I have the option to filter and search from the curated product listings, which are available in AWS Marketplace, on the Bytes site.

I choose View Details for Fortinet and see an option to Request Private Offer from AWS.

As you can tell, on a Channel Partner site, you can browse curated product listings available in AWS Marketplace, filter products, and request custom pricing using your AWS account directly from their website.

Streamline product procurement on AWS Partner sites
I had a seamless experience using Buy with AWS to access a free trial for Wiz and browse through the Bytes storefront to request a private offer.

Now I want to try Databricks for one of the applications I’m building. I sign up for a Databricks trial through their website.

I chose Upgrade and see Databricks is available in AWS Marketplace, which gives me the option to Buy with AWS.

I choose Buy with AWS, and after I sign in to my AWS account, I land on a Databricks and AWS Marketplace co-branded procurement page.

I complete the purchase on the co-branded procurement page and continue to set up my Databricks account.

As you can tell, I didn’t have to navigate the challenge of managing procurement processes for multiple vendors. I also didn’t have to speak with a sales representative or onboard a new vendor in my billing system, which would have required multiple approvals and delayed the overall process.

Access centralized billing and benefits through AWS Marketplace
Because Buy with AWS purchases are transacted through and managed in AWS Marketplace, you also benefit from the post-purchase experience of AWS Marketplace, including consolidated AWS billing, centralized subscription management, and access to cost optimization tools.

For example, through the AWS Billing and Cost Management console, I can centrally manage all my AWS purchases, including Buy with AWS purchases, from one dashboard. I can easily access and process invoices for all of my organization’s AWS purchases. I also need to have valid AWS Identity and Access Management (IAM) permissions to manage subscriptions and make a purchase through AWS Marketplace.

AWS Marketplace not only simplifies my billing but also helps in maintaining governance over spending by helping me manage purchasing authority and subscription access for my organization with centralized visibility and controls. I can manage my budget with pricing flexibility, cost transparency, and AWS cost management tools.

Buy with AWS for Partners
Buy with AWS enables Partners who sell or resell products in AWS Marketplace to create new solution discovery and buying experiences for customers on their own websites. By adding call to action (CTA) buttons to their websites such as “Buy with AWS”, “Try free with AWS”, “Request private offer”, and “Request demo”, Partners can help accelerate product evaluation and the path-to-purchase for customers.

By integrating AWS Marketplace APIs, Partners can display products from the AWS Marketplace catalog, allow customers to sort and filter products, and streamline private offers. Partners implementing Buy with AWS can access AWS Marketplace creative and messaging resources for guidance on building their own web experiences. Partners who implement Buy with AWS can access metrics for insights into engagement and conversion performance.

The Buy with AWS onboarding guide in the AWS Marketplace Management Portal details how Partners can get started.

Learn more
Visit the Buy with AWS page to learn more and explore Partner sites that offer Buy with AWS.

To learn more about selling or reselling products using Buy with AWS on your website, visit:

– Prasad

Accelerate foundation model training and fine-tuning with new Amazon SageMaker HyperPod recipes

2024-12-04 Channy Yun (윤석찬)

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/accelerate-foundation-model-training-and-fine-tuning-with-new-amazon-sagemaker-hyperpod-recipes/

Today, we’re announcing the general availability of Amazon SageMaker HyperPod recipes to help data scientists and developers of all skill sets to get started training and fine-tuning foundation models (FMs) in minutes with state-of-the-art performance. They can now access optimized recipes for training and fine-tuning popular publicly available FMs such as Llama 3.1 405B, Llama 3.2 90B, or Mixtral 8x22B.

At AWS re:Invent 2023, we introduced SageMaker HyperPod to reduce time to train FMs by up to 40 percent and scale across more than a thousand compute resources in parallel with preconfigured distributed training libraries. With SageMaker HyperPod, you can find the required accelerated compute resources for training, create the most optimal training plans, and run training workloads across different blocks of capacity based on the availability of compute resources.

SageMaker HyperPod recipes include a training stack tested by AWS, removing tedious work experimenting with different model configurations, eliminating weeks of iterative evaluation and testing. The recipes automate several critical steps, such as loading training datasets, applying distributed training techniques, automating checkpoints for faster recovery from faults, and managing the end-to-end training loop.

With a simple recipe change, you can seamlessly switch between GPU- or Trainium-based instances to further optimize training performance and reduce costs. You can easily run workloads in production on SageMaker HyperPod or SageMaker training jobs.

SageMaker HyperPod recipes in action
To get started, visit the SageMaker HyperPod recipes GitHub repository to browse training recipes for popular publicly available FMs.

You only need to edit straightforward recipe parameters to specify an instance type and the location of your dataset in cluster configuration, then run the recipe with a single line command to achieve state-of-art performance.

You need to edit the recipe config.yaml file to specify the model and cluster type after cloning the repository.

$ git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
$ cd sagemaker-hyperpod-recipes
$ pip3 install -r requirements.txt.
$ cd ./recipes_collections
$ vim config.yaml

The recipes support SageMaker HyperPod with Slurm, SageMaker HyperPod with Amazon Elastic Kubernetes Service (Amazon EKS), and SageMaker training jobs. For example, you can set up a cluster type (Slurm orchestrator), a model name (Meta Llama 3.1 405B language model), an instance type (ml.p5.48xlarge), and your data locations, such as storing the training data, results, logs, and so on.

defaults:
- cluster: slurm # support: slurm / k8s / sm_jobs
- recipes: fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora # name of model to be trained
debug: False # set to True to debug the launcher configuration
instance_type: ml.p5.48xlarge # or other supported cluster instances
base_results_dir: # Location(s) to store the results, checkpoints, logs etc.

You can optionally adjust model-specific training parameters in this YAML file, which outlines the optimal configuration, including the number of accelerator devices, instance type, training precision, parallelization and sharding techniques, the optimizer, and logging to monitor experiments through TensorBoard.

run:
  name: llama-405b
  results_dir: ${base_results_dir}/${.name}
  time_limit: "6-00:00:00"
restore_from_path: null
trainer:
  devices: 8
  num_nodes: 2
  accelerator: gpu
  precision: bf16
  max_steps: 50
  log_every_n_steps: 10
  ...
exp_manager:
  exp_dir: # location for TensorBoard logging
  name: helloworld 
  create_tensorboard_logger: True
  create_checkpoint_callback: True
  checkpoint_callback_params:
    ...
  auto_checkpoint: True # for automated checkpointing
use_smp: True 
distributed_backend: smddp # optimized collectives
# Start training from pretrained model
model:
  model_type: llama_v3
  train_batch_size: 4
  tensor_model_parallel_degree: 1
  expert_model_parallel_degree: 1
  # other model-specific params

To run this recipe in SageMaker HyperPod with Slurm, you must prepare the SageMaker HyperPod cluster following the cluster setup instruction.

Then, connect to the SageMaker HyperPod head node, access the Slurm controller, and copy the edited recipe. Next, you run a helper file to generate a Slurm submission script for the job that you can use for a dry run to inspect the content before starting the training job.

$ python3 main.py --config-path recipes_collection --config-name=config

After training completion, the trained model is automatically saved to your assigned data location.

To run this recipe on SageMaker HyperPod with Amazon EKS, clone the recipe from the GitHub repository, install the requirements, and edit the recipe (cluster: k8s) on your laptop. Then, create a link between your laptop and running the EKS cluster and subsequently use the HyperPod Command Line Interface (CLI) to run the recipe.

$ hyperpod start-job –recipe fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora \
--persistent-volume-claims fsx-claim:data \
--override-parameters \
'{
  "recipes.run.name": "hf-llama3-405b-seq8k-gpu-qlora",
  "recipes.exp_manager.exp_dir": "/data/<your_exp_dir>",
  "cluster": "k8s",
  "cluster_type": "k8s",
  "container": "658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
  "recipes.model.data.train_dir": "<your_train_data_dir>",
  "recipes.model.data.val_dir": "<your_val_data_dir>",
}'

You can also run recipe on SageMaker training jobs using SageMaker Python SDK. The following example is running PyTorch training scripts on SageMaker training jobs with overriding training recipes.

...
recipe_overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "exp_dir": "",
        "explicit_log_dir": "/opt/ml/output/tensorboard",
        "checkpoint_dir": "/opt/ml/checkpoints",
    },   
    "model": {
        "data": {
            "train_dir": "/opt/ml/input/data/train",
            "val_dir": "/opt/ml/input/data/val",
        },
    },
}
pytorch_estimator = PyTorch(
           output_path=<output_path>,
           base_job_name=f"llama-recipe",
           role=<role>,
           instance_type="p5.48xlarge",
           training_recipe="fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora",
           recipe_overrides=recipe_overrides,
           sagemaker_session=sagemaker_session,
           tensorboard_output_config=tensorboard_output_config,
)
...

As training progresses, the model checkpoints are stored on Amazon Simple Storage Service (Amazon S3) with the fully automated checkpointing capability, enabling faster recovery from training faults and instance restarts.

Now available
Amazon SageMaker HyperPod recipes are now available in the SageMaker HyperPod recipes GitHub repository. To learn more, visit the SageMaker HyperPod product page and the Amazon SageMaker AI Developer Guide.

Give SageMaker HyperPod recipes a try and send feedback to AWS re:Post for SageMaker or through your usual AWS Support contacts.

— Channy

AWS Education Equity Initiative: Applying generative AI to educate the next wave of innovators

2024-12-04 Jeff Barr

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-education-equity-initiative-applying-generative-ai-to-educate-the-next-wave-of-innovators/

Building on the work that we and our partners have been doing for many years, Amazon is committing up to $100 million in cloud technology and technical resources to help existing, dedicated learning organizations reach more learners by creating new and innovative digital learning solutions, all as part of the AWS Education Equity Initiative.

The Work So Far
AWS and Amazon have a long-standing commitment to learning and education. Here’s a sampling of what we have already done:

AWS AI & ML Scholarship Program – This program has awarded $28 million in scholarships to approximately 6000 students.

Machine Learning University – MLU offers a free program helping community colleges and Historically Black Colleges and Universities (HBCUs) teach data management, artificial intelligence, and machine learning concepts. The program is designed to address opportunity gaps by supporting students who are historically underserved and underrepresented in technology disciplines.

Amazon Future Engineer – Since 2021, up to $46 million in scholarships has been awarded to 1150 students through this program. In the past year, more than 2.1 million students received over 17 million hours of STEM education, literacy, and career exploration courses through this and other Amazon philanthropic education programs in the United States. I was able to speak to one such session last year and it was an amazing experience:

Free Cloud Training – In late 2020 we set a goal of helping 29 million people grow their tech skills with free cloud computing training by 2025. We worked hard and met that target a year ahead of time!

There’s More To Do
Despite all of this work and progress, there’s still more to be done. The future is definitely not evenly distributed: over half a billion students cannot be reached by digital learning today.

We believe that Generative AI can amplify the good work that socially-minded edtech organizations, non-profits, and governments are already doing. Our goal is to empower them to build new and innovative digital learning systems that can amplify their work and allow them to reach a bigger audience.

With the launch of the AWS Education Equity Initiative, we want to help pave the way for the next generation of technology pioneers as they build powerful tools, train foundation models at scale, and create AI-powered teaching assistants.

We are committing up to $100 million in cloud technology and comprehensive technical advising over the next five years. The awardees will have access to the portfolio of AWS services and technical expertise so that they can build and scale learning management systems, mobile apps, chatbots, and other digital learning tools. As part of the application process, applicants will be asked to demonstrate how their proposed solution will benefit students from underserved and underrepresented communities.

As I mentioned earlier, our partners are already doing a lot of great work in this area. For example:

Code.org has already used AWS to scale their free computer science curriculum to millions of students in more than 100 countries. With this initiative, they will expand their use of Amazon Bedrock to provide an automated assessment of student projects, freeing up educator time that can be use for individual instruction and tailored learning.

Rocket Learning focuses on early childhood education in India. They will use Amazon Q in QuickSight to enhance learning outcomes for more than three million children.

I’m super excited about this initiative and look forward to seeing how it will help to create and educate the next generation of technology pioneers!

— Jeff;

Solve complex problems with new scenario analysis capability in Amazon Q in QuickSight

2024-12-04 Veliswa Boya

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/solve-complex-problems-with-new-scenario-analysis-capability-in-amazon-q-in-quicksight/

Today, we announced a new capability of Amazon Q in QuickSight that helps users perform scenario analyses to find answers to complex problems quickly. This AI-assisted data analysis experience helps business users find answers to complex problems by guiding them step-by-step through in-depth data analysis—suggesting analytical approaches, automatically analyzing data, and summarizing findings with suggested actions—using natural language prompts. This new capability eliminates hours of tedious and error-prone manual work traditionally required to perform analyses using spreadsheets or other alternatives. In fact, Amazon Q in QuickSight enables business users to perform complex scenario analysis up to 10x faster than spreadsheets. This capability expands upon existing data Q&A capabilities of Amazon QuickSight so business professionals can start their analysis by simply asking a question.

How it works
Business users are often faced with complex questions that have traditionally required specialized training and days or weeks of time analyzing data in spreadsheets or other tools to address. For example, let’s say you’re a franchisee with multiple locations to manage. You might use this new capability in Amazon Q in QuickSight to ask, “How can I help our new Chicago store perform as well as the ﬂagship store in New York?” Using an agentic approach, Amazon Q would then suggest analytical approaches needed to address the underlying business goal, automatically analyze data, and present results complete with visualizations and suggested actions. You can conduct this multistep analysis in an expansive analysis canvas, giving you the ﬂexibility to make changes, explore multiple analysis paths simultaneously, and adapt to situations over time.

This new analysis experience is part of Amazon QuickSight meaning it can read from QuickSight dashboards which connect to sources such as Amazon Athena, Amazon Aurora, Amazon Redshift, Amazon Simple Storage Service (Amazon S3), and Amazon OpenSearch Service. Specifically, this new experience is part of Amazon Q in QuickSight, which allows it to seamlessly integrate with other generative business intelligence (BI) capabilities such as data Q&A. You can also upload either a .csv or a single-table, single-sheet .xlsx file to incorporate into your analysis.

Here’s a visual walkthrough of this new analysis experience in Amazon Q in QuickSight.

I’m planning a customer event, and I’ve received an Excel spreadsheet of all who’ve registered to attend the event. I want to learn more about the attendees, so I analyze the spreadsheet and ask a few questions. I start by describing what I want to explore.

I upload the spreadsheet to start my analysis. Firstly, I want to understand how many people have registered for the event.

To design an agenda that’s suitable for the audience, I want to understand the various roles that will be attending. I select on the + icon to add a new block for asking a question following along the thread from the previous block.

I can continue to ask more questions. However, there are suggested questions for analyzing my data even further, and I now select one of these suggested questions. I want to increase marketing efforts at companies that don’t currently have a lot of attendees in this case, companies with fewer than two attendees.

Amazon Q executes the required analysis and keeps me updated of the progress. Step 1 of the process identifies companies that have fewer than two attendees and lists them.

Step 2 gives an estimate of how many more attendees I might get from each company if marketing efforts are increased.

In Step 3 I can see the potential increase in total attendees (including the percentage increase) in line with the increase in marketing efforts.

Lastly, Step 4 goes even further to highlight companies I should prioritize for these increased marketing efforts.

To increase the potential number of attendees even more, I wanted to change the analysis to identify companies with fewer than three attendees instead of two attendees. I choose the AI sparkle icon in the upper right to launch a modal that I then use to provide more context and make specific changes to the previous result.

This change resulted in new projections, and I can choose to consider them for my marketing efforts or keep to the previous projections.

Now available
Amazon Q in QuickSight Pro users can use this new capability in preview in the following AWS Regions at launch: US East (N. Virginia) and US West (Oregon). Get started with a free 30-day trial of QuickSight today. To learn more, visit the Amazon QuickSight User Guide. You can submit your questions to AWS re:Post for Amazon QuickSight, or through your usual AWS Support contacts.

– Veliswa.

Use Amazon Q Developer to build ML models in Amazon SageMaker Canvas

2024-12-04 Elizabeth Fuentes

Post Syndicated from Elizabeth Fuentes original https://aws.amazon.com/blogs/aws/use-amazon-q-developer-to-build-ml-models-in-amazon-sagemaker-canvas/

As a data scientist, I’ve experienced firsthand the challenges of making machine learning (ML) accessible to business analysts, marketing analysts, data analysts, and data engineers who are experts in their domains without ML experience. That’s why I’m particularly excited about today’s Amazon Web Services (AWS) announcement that Amazon Q Developer is now available in Amazon SageMaker Canvas. What catches my attention is how Amazon Q Developer helps connect ML expertise with business needs, making ML more accessible across organizations.

Amazon Q Developer helps domain experts build accurate, production-quality ML models through natural language interactions, even if they don’t have ML expertise. Amazon Q Developer guides these users by breaking down their business problems and analyzing their data to recommend step-by-step guidance for building custom ML models. It transforms users’ data to remove anomalies, and builds and evaluates custom ML models to recommend the best one, while providing users control and visibility into every step of the guided ML workflow. This empowers organizations to innovate faster with reduced time to market. It also reduces their reliance on ML experts so their specialists can focus on more complex technical challenges.

For example, a marketing analyst can state, “I want to predict home sales prices using home characteristics and past sales data”, and Amazon Q Developer will translate this into a set of ML steps, analyzing relevant customer data, building multiple models, and recommending the best approach.

Let’s see it in action
To start using Amazon Q Developer, I follow the Getting started with using Amazon SageMaker Canvas guide to launch the Canvas application. In this demo, I use natural language instructions to create a model to predict house prices for marketing and finance teams. From the SageMaker Canvas page, I select Amazon Q and then choose Start a new conversation.

In the new conversation I write:

I am an analyst and need to predict house prices for my marketing and finance teams.

Next, Amazon Q Developer explains the problem and recommends the appropriate ML model type. It also outlines the solution requirements, including the necessary dataset characteristics. Amazon Q Developer then asks if I want to upload my dataset or I want to choose a target column. I select it to upload my dataset.

In the next step, Amazon Q Developer lists the dataset requirements, which include relevant information about houses, current house prices, and the target variable for the regression model. It then recommended next steps, including: I want to upload my dataset, Select an existing dataset, Create a new dataset or I want to choose a target column. For this demo, I’ll use the canvas-sample-housing.csv sample dataset as my existing dataset.

After selecting and loading the dataset, Amazon Q Developer analyzes it and suggests median_house_value as the target column for the regression model. I accept by selecting I would like to predict the “median_house_value” column. Moving on to the next step, Amazon Q Developer details which dataset features (such as “location”, “housing_median_age”, and “total_rooms”) it will use to predict the median_house_value.

Before moving forward with model training, I ask about the data quality, because without good data we can’t build a reliable model. Amazon Q Developer responds with quality insights for my entire dataset.

I can ask specific questions about individual features and their distributions to better understand the data quality.

To my surprise, through the previous question, I discovered that the “households” column has a wide variation between extreme values, which could affect the model’s prediction accuracy. Therefore, I ask Amazon Q Developer to fix this outlier problem.

After the transformation is done, I can ask what steps Amazon Q Developer followed to make this change. Behind the scenes, Amazon Q Developer applies advanced data preparation steps using SageMaker Canvas data preparation capabilities, which I can review and see the steps so that I can visualize and replicate the process to get the final, prepared dataset for training the model.

After reviewing the data preparation steps, I select Launch my training job.

After the training job is launched, I can see its progress in the conversation, and the datasets created.

As a data scientist, I particularly appreciate that, with Amazon Q Developer, Ican see detailed metrics such as the confusion matrix and precision-recall scores for classification models and root mean square error (RMSE) for regression models. These are crucial elements I always look for when evaluating model performance and making data-driven decisions, and it’s refreshing to see them presented in a way that’s accessible to nontechnical users to build trust and enable proper governance while maintaining the depth that technical teams need.

You can access these metrics by selecting the new model from My Models or from the Amazon Q conversation menu:

Overview – This tab shows the Column impact analysis. In this case, median_income emerges as the primary factor influencing my model.
Scoring – This tab provides model accuracy insights, including RMSE metrics.
Advanced metrics – This tab displays the detailed Metrics table, Residuals and Error density for in-depth model evaluation.

Analyze My Model

After reviewing these metrics and validating the model’s performance, I can move to the final stages of the ML workflow:

Predictions – I can test my model using the Predictions tab to validate its real-world performance.
Deployment – I can create an endpoint deployment to make my model available for production use.

This simplifies the deployment process, a step that traditionally requires significant DevOps knowledge, into a straightforward operation that business analysts can handle confidently.

predictions and deploy

Things to know
Amazon Q Developer democratizes ML across organizations:

Empowering all skill levels with ML – Amazon Q Developer is now available in SageMaker Canvas, helping business analysts, marketing analysts, and data professionals who don’t have ML experience create solutions for business problems through a guided ML workflow. From data analysis and model selection to deployment, users can solve business problems using natural language, reducing dependence on ML experts such as data scientists and enabling organizations to innovate faster with reduced time to market.

Streamlining the ML workflow – With Amazon Q Developer available in SageMaker Canvas, users can prepare data, and build, analyze, and deploy ML models through a guided, transparent workflow. Amazon Q Developer provides advanced data preparation and AutoML capabilities that democratize ML, and allows non-ML experts to produce highly-accurate ML models.

Providing full visibility into the ML workflow – Amazon Q Developer provides full transparency by generating the underlying code and technical artifacts such as data transformation steps, model explainability, and accuracy measures. This allows cross-functional teams, including ML experts, to review, validate, and update the models as needed, facilitating collaboration in a secure environment.

Availability – Amazon Q Developer is now in preview release in Amazon SageMaker Canvas.

Pricing – Amazon Q Developer is now available in SageMaker Canvas at no additional cost to both Amazon Q Developer Pro Tier and Amazon Q Developer Free tier users. However, standard charges apply for resources such as SageMaker Canvas workspace instances and any resources used for building or deploying models. For detailed pricing information, visit the Amazon SageMaker Canvas Pricing.

To learn more about getting started visit the Amazon Q Developer product web page.

— Eli

Amazon SageMaker Lakehouse integrated access controls now available in Amazon Athena federated queries

2024-12-03 Esra Kayabali

Post Syndicated from Esra Kayabali original https://aws.amazon.com/blogs/aws/amazon-sagemaker-lakehouse-integrated-access-controls-now-available-in-amazon-athena-federated-queries/

Today, we announced the next generation of Amazon SageMaker, which is a unified platform for data, analytics, and AI, bringing together widely-adopted AWS machine learning and analytics capabilities. At its core is SageMaker Uniﬁed Studio (preview), a single data and AI development environment for data exploration, preparation and integration, big data processing, fast SQL analytics, model development and training, and generative AI application development. This announcement includes Amazon SageMaker Lakehouse, a capability that unifies data across data lakes and data warehouses, helping you build powerful analytics and artificial intelligence and machine learning (AI/ML) applications on a single copy of data.

In addition to these launches, I’m happy to announce data catalog and permissions capabilities in Amazon SageMaker Lakehouse, helping you connect, discover, and manage permissions to data sources centrally.

Organizations today store data across various systems to optimize for specific use cases and scale requirements. This often results in data siloed across data lakes, data warehouses, databases, and streaming services. Analysts and data scientists face challenges when trying to connect to and analyze data from these diverse sources. They must set up specialized connectors for each data source, manage multiple access policies, and often resort to copying data, leading to increased costs and potential data inconsistencies.

The new capability addresses these challenges by simplifying the process of connecting to popular data sources, cataloging them, applying permissions, and making the data available for analysis through SageMaker Lakehouse and Amazon Athena. You can use the AWS Glue Data Catalog as a single metadata store for all data sources, regardless of location. This provides a centralized view of all available data.

Data source connections are created once and can be reused, so you don’t need to set up connections repeatedly. As you connect to the data sources, databases and tables are automatically cataloged and registered with AWS Lake Formation. Once cataloged, you grant access to those databases and tables to data analysts, so they don’t have to go through separate steps of connecting to each data source and don’t have to know built-in data source secrets. Lake Formation permissions can be used to define fine-grained access control (FGAC) policies across data lakes, data warehouses, and online transaction processing (OLTP) data sources, providing consistent enforcement when querying with Athena. Data remains in its original location, eliminating the need for costly and time-consuming data transfers or duplications. You can create or reuse existing data source connections in Data Catalog and configure built-in connectors to multiple data sources, including Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Aurora, Amazon DynamoDB (preview), Google BigQuery, and more.

Getting started with the integration between Athena and Lake Formation
To showcase this capability, I use a preconfigured environment that incorporates Amazon DynamoDB as a data source. The environment is set up with appropriate tables and data to effectively demonstrate the capability. I use the SageMaker Unified Studio (preview) interface for this demonstration.

To begin, I go to SageMaker Unified Studio (preview) through the Amazon SageMaker domain. This is where you can create and manage projects, which serve as shared workspaces. These projects allow team members to collaborate, work with data, and develop ML models together. Creating a project automatically sets up AWS Glue Data Catalog databases, establishes a catalog for Redshift Managed Storage (RMS) data, and provisions necessary permissions.

To manage projects, you can either view a comprehensive list of existing projects by selecting Browse all projects, or you can create a new project by choosing Create project. I use two existing projects: sales-group, where administrators have full access privileges to all data, and marketing-project, where analysts operate under restricted data access permissions. This setup effectively illustrates the contrast between administrative and limited user access levels.

In this step, I set up a federated catalog for the target data source, which is Amazon DynamoDB. I go to Data in the left navigation pane and choose the + (plus) sign to Add data. I choose Add connection and then I choose Next.

I choose Amazon DynamoDB and choose Next.

I enter the details and choose Add data. Now, I have the Amazon DynamoDB federated catalog created in SageMaker Lakehouse. This is where your administrator gives you access using resource policies. I’ve already configured the resource policies in this environment. Now, I’ll show you how fine-grained access controls work in SageMaker Unified Studio (preview).

I begin by selecting the sales-group project, which is where administrators maintain and have full access to customer data. This dataset contains fields such as zip codes, customer IDs, and phone numbers. To analyze this data, I can execute queries using Query with Athena.

Upon selecting Query with Athena, the Query Editor launches automatically, providing a workspace where I can compose and execute SQL queries against the lakehouse. This integrated query environment offers a seamless experience for data exploration and analysis.

In the second part, I switch to marketing-project to show what an analyst experiences when they run their queries and observe that the fine-grained access control permissions are in place and working.

In the second part, I demonstrate the perspective of an analyst by switching to the marketing-project environment. This helps us verify that the fine-grained access control permissions are properly implemented and effectively restricting data access as intended. Through example queries, we can observe how analysts interact with the data while being subject to the established security controls.

Using the Query with Athena option, I execute a SELECT statement on the table to verify the access controls. The results confirm that, as expected, I can only view the zipcode and cust_id columns, while the phone column remains restricted based on the configured permissions.

With these new data catalog and permissions capabilities in Amazon SageMaker Lakehouse, you can now streamline your data operations, enhance security governance, and accelerate AI/ML development while maintaining data integrity and compliance across your entire data ecosystem.

Now available
Data catalog and permissions in Amazon SageMaker Lakehouse simplifies interactive analytics through federated query when connecting to a unified catalog and permissions with Data Catalog across multiple data sources, providing a single place to define and enforce fine-grained security policies across data lakes, data warehouses, and OLTP data sources for a high-performing query experience.

You can use this capability in US East (N. Virginia), US West (Oregon), US East (Ohio), Europe (Ireland), and Asia Pacific (Tokyo) AWS Regions.

To get started with this new capability, visit the Amazon SageMaker Lakehouse documentation.

— Esra

Amazon SageMaker Lakehouse and Amazon Redshift supports zero-ETL integrations from applications

2024-12-03 Veliswa Boya

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/introducing-amazon-sagemaker-lakehouse-support-for-zero-etl-integrations-from-applications/

Today, we announced the general availability of Amazon SageMaker Lakehouse and Amazon Redshift support for zero-ETL integrations from applications. Amazon SageMaker Lakehouse unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines. Zero-ETL is a set of fully managed integrations by AWS that minimizes the need to build ETL data pipelines for common ingestion and replication use cases. With zero-ETL integrations from applications such as Salesforce, SAP, and Zendesk, you can reduce time spent building data pipelines and focus on running unified analytics on all your data in Amazon SageMaker Lakehouse and Amazon Redshift.

As organizations rely on an increasingly diverse array of digital systems, data fragmentation has become a significant challenge. Valuable information is often scattered across multiple repositories, including databases, applications, and other platforms. To harness the full potential of their data, businesses must enable access and consolidation from these varied sources. In response to this challenge, users build data pipelines to extract and load (EL) from multiple applications into centralized data lakes and data warehouses. Using zero-ETL, you can eﬃciently replicate valuable data from your customer support, relationship management, and enterprise resource planning (ERP) applications for analytics and AI/ML to datalakes and data warehouses, saving you weeks of engineering eﬀort needed to design, build, and test data pipelines.

Prerequisites

An Amazon SageMaker Lakehouse catalog configured through AWS Glue Data Catalog and AWS Lake Formation.
An AWS Glue database that is configured for Amazon S3 where the data will be stored.
A secret in AWS Secret Manager to use for the connection to the data source. The credentials must contain the username and password that you use to sign in to your application.
An AWS Identity and Access Management (IAM) role for the Amazon SageMaker Lakehouse or Amazon Redshift job to use. The role must grant access to all resources used by the job, including Amazon S3 and AWS Secrets Manager.
A valid AWS Glue connection to the desired application.

How it works – creating a Glue connection prerequisite
I start by creating a connection using the AWS Glue console. I opt for a Salesforce integration as the data source.

Next, I provide the location of the Salesforce instance to be used for the connection, together with the rest of the required information. Be sure to use the .salesforce.com domain instead of .force.com. Users can choose between two authentication methods, JSON Web Token (JWT), which is obtained through Salesforce access tokens, or OAuth login through the browser.

I review all the information and then choose Create connection.

After I sign into the Salesforce instance through a popup (not shown here), the connection is successfully created.

How it works – creating a zero-ETL integration
Now that I have a connection, I choose zero-ETL integrations from the left navigation panel, then choose Create zero-ETL integration.

First I choose the source type for my integration – in this case Salesforce so I can use my recently created connection.

Next, I select objects from the data source that I want to replicate to the target database in AWS Glue.

While in the process of adding objects, I can quickly preview both data and metadata to confirm that I am selecting the correct object.

By default, zero-ETL integration will synchronize data from the source to the target every 60 minutes. However, you can change this interval to reduce the cost of replication for cases that do not require frequent updates.

I review and then choose Create and launch integration.

The data in the source (Salesforce instance) has now been replicated to the target database salesforcezeroETL in my AWS account. This integration has two phases. Phase 1: initial load will ingest all the data for the selected objects and may take between 15 min to a few hours depending on the size of the data in these objects. Phase 2: incremental load will detect any changes (such as new records, updated records, or deleted records) and apply these to the target.

Each of the objects that I selected earlier has been stored in its respective table within the database. From here I can view the Table data for each of the objects that have been replicated from the data source.

Lastly, here’s a view of the data in Salesforce. As new entities are created, or existing entities are updated or changed in Salesforce, the data changes will synchronize to the target in AWS Glue automatically.

Now available
Amazon SageMaker Lakehouse and Amazon Redshift support for zero-ETL integrations from applications is now available in US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Hong Kong), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm) AWS Regions. For pricing information, visit the AWS Glue pricing page.

To learn more, visit our AWS Glue User Guide. Send feedback to AWS re:Post for AWS Glue or through your usual AWS Support contacts. Get started by creating a new zero-ETL integration today.

– Veliswa

Simplify analytics and AI/ML with new Amazon SageMaker Lakehouse

2024-12-03 Esra Kayabali

Post Syndicated from Esra Kayabali original https://aws.amazon.com/blogs/aws/simplify-analytics-and-aiml-with-new-amazon-sagemaker-lakehouse/

Today, I’m very excited to announce the general availability of Amazon SageMaker Lakehouse, a capability that unifies data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and artificial intelligence and machine learning (AI/ML) applications on a single copy of data. SageMaker Lakehouse is a part of the next generation of Amazon SageMaker, which is a unified platform for data, analytics and AI, that brings together widely-adopted AWS machine learning and analytics capabilities and delivers an integrated experience for analytics and AI.

Customers want to do more with data. To move faster with their analytics journey, they are picking the right storage and databases to store their data. The data is spread across data lakes, data warehouses, and different applications, creating data silos that make it difficult to access and utilize. This fragmentation leads to duplicate data copies and complex data pipelines, which in turn increases costs for the organization. Furthermore, customers are constrained to use specific query engines and tools, as the way and where the data is stored limits their options. This restriction hinders their ability to work with the data as they would prefer. Lastly, the inconsistent data access makes it challenging for customers to make informed business decisions.

SageMaker Lakehouse addresses these challenges by helping you to unify data across Amazon S3 data lakes and Amazon Redshift data warehouses. It offers you the flexibility to access and query data in-place with all engines and tools compatible with Apache Iceberg. With SageMaker Lakehouse, you can define fine-grained permissions centrally and enforce them across multiple AWS services, simplifying data sharing and collaboration. Bringing data into your SageMaker Lakehouse is easy. In addition to seamlessly accessing data from your existing data lakes and data warehouses, you can use zero-ETL from operational databases such as Amazon Aurora, Amazon RDS for MySQL, Amazon DynamoDB, as well as applications such as Salesforce and SAP. SageMaker Lakehouse fits into your existing environments.

Get started with SageMaker Lakehouse
For this demonstration, I use a preconfigured environment that has multiple AWS data sources. I go to the Amazon SageMaker Unified Studio (preview) console, which provides an integrated development experience for all your data and AI. Using Unified Studio, you can seamlessly access and query data from various sources through SageMaker Lakehouse, while using familiar AWS tools for analytics and AI/ML.

This is where you can create and manage projects, which serve as shared workspaces. These projects allow team members to collaborate, work with data, and develop AI models together. Creating a project automatically sets up AWS Glue Data Catalog databases, establishes a catalog for Redshift Managed Storage (RMS) data, and provisions necessary permissions. You can get started by creating a new project or continue with an existing project.

To create a new project, I choose Create project.

I have 2 project profile options to build a lakehouse and interact with it. First one is Data analytics and AI-ML model development, where you can analyze data and build ML and generative AI models powered by Amazon EMR, AWS Glue, Amazon Athena, Amazon SageMaker AI, and SageMaker Lakehouse. Second one is SQL analytics, where you can analyze your data in SageMaker Lakehouse using SQL. For this demo, I proceed with SQL analytics.

I enter a project name in the Project name field and choose SQL analytics under Project profile. I choose Continue.

I enter the values for all the parameters under Tooling. I enter the values to create my Lakehouse databases. I enter the values to create my Redshift Serverless resources. Finally, I enter a name for my catalog under Lakehouse Catalog.

On the next step, I review the resources and choose Create project.

After the project is created, I observe the project details.

I go to Data in the navigation pane and choose the + (plus) sign to Add data. I choose Create catalog to create a new catalog and choose Add data.

After the RMS catalog is created, I choose Build from the navigation pane and then choose Query Editor under Data Analysis & Integration to create a schema under RMS catalog, create a table, and then load table with sample sales data.

After entering the SQL queries into the designated cells, I choose Select data source from the right dropdown menu to establish a database connection to Amazon Redshift data warehouse. This connection allows me to execute the queries and retrieve the desired data from the database.

Once the database connection is successfully established, I choose Run all to execute all queries and monitor the execution progress until all results are displayed.

For this demonstration, I use two additional pre-configured catalogs. A catalog is a container that organizes your lakehouse object definitions such as schema and tables. The first is an Amazon S3 data lake catalog (test-s3-catalog) that stores customer records, containing detailed transactional and demographic information. The second is a lakehouse catalog (churn_lakehouse) dedicated to storing and managing customer churn data. This integration creates a unified environment where I can analyze customer behavior alongside churn predictions.

From the navigation pane, I choose Data and locate my catalogs under the Lakehouse section. SageMaker Lakehouse offers multiple analysis options, including Query with Athena, Query with Redshift, and Open in Jupyter Lab notebook.

Note that you need to choose Data analytics and AI-ML model development profile when you create a project, if you want to use Open in Jupyter Lab notebook option. If you choose Open in Jupyter Lab notebook, you can interact with SageMaker Lakehouse using Apache Spark via EMR 7.5.0 or AWS Glue 5.0 by configuring the Iceberg REST catalog, enabling you to process data across your data lakes and data warehouses in a unified manner.

Here’s how querying using Jupyter Lab notebook looks like:

I continue by choosing Query with Athena. With this option, I can use serverless query capability of Amazon Athena to analyze the sales data directly within SageMaker Lakehouse. Upon selecting Query with Athena, the Query Editor launches automatically, providing an workspace where I can compose and execute SQL queries against the lakehouse. This integrated query environment offers a seamless experience for data exploration and analysis, complete with syntax highlighting and auto-completion features to enhance productivity.

I can also use Query with Redshift option to run SQL queries against the lakehouse.

SageMaker Lakehouse offers a comprehensive solution for modern data management and analytics. By unifying access to data across multiple sources, supporting a wide range of analytics and ML engines, and providing fine-grained access controls, SageMaker Lakehouse helps you make the most of your data assets. Whether you’re working with data lakes in Amazon S3, data warehouses in Amazon Redshift, or operational databases and applications, SageMaker Lakehouse provides the flexibility and security you need to drive innovation and make data-driven decisions. You can use hundreds of connectors to integrate data from various sources. Additionally, you can access and query data in-place with federated query capabilities across third-party data sources.

Now available
You can access SageMaker Lakehouse through the AWS Management Console, APIs, AWS Command Line Interface (AWS CLI), or AWS SDKs. You can also access through AWS Glue Data Catalog and AWS Lake Formation. SageMaker Lakehouse is available in US East (N. Virginia), US West (Oregon), US East (Ohio), Europe (Ireland), Europe (Frankfurt), Europe (Stockholm), Asia Pacific (Sydney), Asia Pacific (Hong Kong), Asia Pacific (Tokyo), and Asia Pacific (Singapore) AWS Regions.

For pricing information, visit the Amazon SageMaker Lakehouse pricing.

For more information on Amazon SageMaker Lakehouse and how it can simplify your data analytics and AI/ML workflows, visit the Amazon SageMaker Lakehouse documentation.

— Esra

New Amazon DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse

2024-12-03 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/new-amazon-dynamodb-zero-etl-integration-with-amazon-sagemaker-lakehouse/

Amazon DynamoDB, a serverless NoSQL database, has been a go-to solution for over one million customers to build low-latency and high-scale applications. As data grows, organizations are constantly seeking ways to extract valuable insights from operational data, which is often stored in DynamoDB. However, to make the most of this data in Amazon DynamoDB for analytics and machine learning (ML) use cases, customers often build custom data pipelines—a time-consuming infrastructure task that adds little unique value to their core business.

Starting today, you can use Amazon DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse to run analytics and ML workloads in just a few clicks without consuming your DynamoDB table capacity. Amazon SageMaker Lakehouse unifies all your data across Amazon S3 data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data.

Zero-ETL is a set of integrations that eliminates or minimizes the need to build ETL data pipelines. This zero-ETL integration reduces the complexity of engineering efforts required to build and maintain data pipelines, benefiting users running analytics and ML workloads on operational data in Amazon DynamoDB without impacting production workflows.

Let’s get started
For the following demo, I need to set up zero-ETL integration for my data in Amazon DynamoDB with an Amazon Simple Storage Service data lake managed by Amazon SageMaker Lakehouse. Before setting up the zero-ETL integration, there are prerequisites to complete. If you want to learn more on how to set up, refer to this Amazon DynamoDB documentation page.

With all the prerequisites completed, I can get started with this integration. I navigate to the AWS Glue console and select Zero-ETL integrations under Data Integration and ETL. Then, I choose Create zero-ETL integration.

Here, I have options to select my data source. I choose Amazon DynamoDB and choose Next.

Next, I need to configure the source and target details. In the Source details section, I select my Amazon DynamoDB table. In the Target details section, I specify the S3 bucket that I’ve set up in the AWS Glue Data Catalog.

To set up this integration, I need an IAM role that grants AWS Glue the necessary permissions. For guidance on configuring IAM permissions, visit the Amazon DynamoDB documentation page. Also, if I haven’t configured a resource policy for my AWS Glue Data Catalog, I can select Fix it for me to automatically add the required resource policies.

Here, I have options to configure the output. Under Data partitioning, I can either use DynamoDB table keys for partitioning or specify custom partition keys. After completing the configuration, I choose Next.

Because I select the Fix it for me checkbox, I need to review the required changes and choose Continue before I can proceed to the next step.

On the next page, I have the flexibility to configure data encryption. I can use AWS Key Management Service (AWS KMS) or a custom encryption key. Then, I assign a name to the integration and choose Next.

On the last step, I need to review the configurations. When I’m happy, I choose Next to create the zero-ETL integration.

After the initial data ingestion completes, my zero-ETL integration will be ready for use. The completion time varies depending on the size of my source DynamoDB table.

If I navigate to Tables under Data Catalog in the left navigation panel, I can observe more details including Schema. Under the hood, this zero-ETL integration uses Apache Iceberg to transform related to data format and structure in my DynamoDB data into Amazon S3.

Lastly, I can tell that all my data is available in my S3 bucket.

This zero-ETL integration significantly reduces the complexity and operational burden of data movement, and I can therefore focus on extracting insights rather than managing pipelines.

Available now
This new zero-ETL capability is available in the following AWS Regions: US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Hong Kong, Singapore, Sydney, Tokyo), Europe (Frankfurt, Ireland, Stockholm).

Explore how to streamline your data analytics workflows using Amazon DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse. Learn more how to get started on the Amazon DynamoDB documentation page.

Happy building!
— Donnie

Discover, govern, and collaborate on data and AI securely with Amazon SageMaker Data and AI Governance

2024-12-03 Esra Kayabali

Post Syndicated from Esra Kayabali original https://aws.amazon.com/blogs/aws/discover-govern-and-collaborate-on-data-and-ai-securely-with-amazon-sagemaker-data-and-ai-governance/

Today, we announced the next generation of Amazon SageMaker, which is a unified platform for data, analytics, and AI, bringing together widely-adopted AWS machine learning and analytics capabilities. This announcement includes Amazon SageMaker Data and AI Governance, a set of capabilities that streamline the management of data and AI assets.

Data teams often face challenges when trying to locate, access, and collaborate on data and AI models across their organizations. The process of discovering relevant assets, understanding their context, and obtaining proper access can be time-consuming and complex, potentially hindering productivity and innovation.

SageMaker Data and AI Governance offers a comprehensive set of features by providing a unified experience for cataloging, discovering, and governing data and AI assets. It’s centered around SageMaker Catalog built on Amazon DataZone, providing a centralized repository that is accessible through Amazon SageMaker Unified Studio (preview). The catalog is built directly into the SageMaker platform, offering seamless integration with existing SageMaker workflows and tools, helping engineers, data scientists, and analysts to safely find and use authorized data and models through advanced search features. With the SageMaker platform, users can safeguard and protect their AI models using guardrails and implementing responsible AI policies.

Here are some of the key Data and AI governance features of SageMaker:

Enterprise-ready business catalog – To add business context and make data and AI assets discoverable by everyone in the organization, you can customize the catalog with automated metadata generation which uses machine learning (ML) to automatically generate business names of data assets and columns within those assets. We improved metadata curation functionality, helping you attach multiple business glossary terms to assets and glossary terms to individual columns in the asset.
Self-service for data and AI workers – To provide data autonomy for users to publish and consume data, you can customize and bring any type of asset to the catalog using APIs. Data publishers can automate metadata discovery through data source runs or manually published files from the supported data sources and enrich metadata with generative AI–generated data descriptions automatically as datasets are brought into the catalog. Data consumers can then use faceted search to quickly find, understand, and request access to data.
Simplified access to data and tools – To govern data and AI assets based on business purpose, projects serve as business use case–based logical containers. You can create a project and collaborate on specific business use case–based groupings of people, data, and analytics tools. Within the project, you can create an environment that provides the necessary infrastructure to project members such as analytics and AI tools and storage so that project members can easily produce new data or consume data they have access to. This helps you add multiple capabilities and analytics tools to the same project, depending on your needs.
Governed data and model sharing – Data producers own and manage access to data with a subscription approval workflow that allows consumers to request access and data owners to approve. You can now set up subscription terms to be attached to assets when published and automate subscription grant fulfillment for AWS managed data lakes and Amazon Redshift with customizations using Amazon EventBridge events for other sources.
Bring a consistent level of AI safety across all your applications: Amazon Bedrock Guardrails helps evaluate user inputs and Foundation Model (FM) responses based on use case specific policies, and provides an additional layer of safeguards regardless of the underlying Foundation Models. AWS AI portfolio provides hundreds of built-in algorithms with pre-trained models from model hubs, including TensorFlow Hub, PyTorch Hub, Hugging Face, and MxNet GluonCV. You can also access built-in algorithms using the SageMaker Python SDK. Built-in algorithms cover common ML tasks, such as data classifications (image, text, tabular) and sentiment analysis.

For seamless integration with existing processes, SageMaker Data and AI Governance provides API support, enabling programmatic access for setup and configuration.

How to use Amazon SageMaker Data and AI Governance
For this demonstration, I use a preconfigured environment. I go to the Amazon SageMaker Unified Studio (preview) console, which provides an integrated development experience for all your data and AI use cases. This is where you can create and manage projects, which serve as shared workspaces. These projects allow team members to collaborate, work with data, and develop ML models together.

Let me start with the Govern menu in the navigation bar.

New data governance capabilities called domain units and authorization policies that help you create business unit- and team-level organization and manage policies according to your business needs. With the addition of domain units, you can organize, create, search, and find data assets and projects associated with business units or teams. With authorization policies, you can set access policies for creating projects and glossaries.

Domain units also help you with self-service governance over critical actions such as publishing data assets and utilizing compute resources within Amazon SageMaker. I choose a project and navigate to the Data sources tab in the left navigation pane. You can use this section to add new or manage existing data sources for publishing data assets to the business data catalog, making them discoverable for all users.

I return to the homepage and continue exploring by choosing Data Catalog, which serves as a centralized hub where users can explore and discover all available data assets across multiple data sources within the organization. This catalog connects to various data sources, including Amazon Simple Storage Service (Amazon S3), Amazon Redshift, and AWS Glue.

The semantic search feature helps you find relevant data assets quickly and efficiently using natural language queries, which makes data discovery more intuitive. I enter events in the Search data area.

You can apply filters based on asset type, such as AWS Glue table and Amazon Redshift.

Amazon Q Developer integration helps you interact with data using conversational language, making it easier for users to find and understand data assets. You can use example commands such as “Show me datasets that relate to events” and “Show me datasets that relate to revenue.” The detailed view provides comprehensive information about each dataset, including AI-generated descriptions, data quality metrics, and data lineage, helping you understand the content and origin of the data.

The subscription process implements a controlled access mechanism where users must justify their need for data access, providing proper data governance and security. I choose Subscribe to request access.

In the pop-up window, I select a Project, provide a Reason for request such as need access, and choose Request. The request is sent to the data owner.

This final step makes sure that data access is properly governed through a structured approval workflow, maintaining data security and compliance requirements. During the owner approval process, the data owner receives a notification and can review the request details before choosing to approve or deny access, after which the requester can access the data table if approved.

Now available
Amazon SageMaker Data and AI Governance offers significant benefits for organizations looking to improve their data and AI asset management. The solution helps data scientists, engineers, and analysts overcome challenges in discovering and accessing resources by offering comprehensive features for cataloging, discovering, and governing data and AI assets, while providing security and compliance through structured approval workflows.

For pricing information, visit Amazon SageMaker pricing.

To get started with Amazon SageMaker Data and AI Governance, visit Amazon SageMaker Documentation.

— Esra

Announcing the general availability of data lineage in the next generation of Amazon SageMaker and Amazon DataZone

2024-12-03 Esra Kayabali

Post Syndicated from Esra Kayabali original https://aws.amazon.com/blogs/aws/announcing-the-general-availability-of-data-lineage-in-the-next-generation-of-amazon-sagemaker-and-amazon-datazone/

Today, I’m happy to announce the general availability of data lineage in Amazon DataZone, following its preview release in June 2024. This feature is also extended as part of the catalog capabilities in the next generation of Amazon SageMaker, a unified platform for data, analytics, and AI.

Traditionally, business analysts have relied on manual documentation or personal connections to validate data origins, leading to inconsistent and time-consuming processes. Data engineers have struggled to evaluate the impact of changes to data assets, especially as self-service analytics adoption increases. Additionally, data governance teams have faced difficulties in enforcing practices and responding to auditor queries about data movement.

Data lineage in Amazon DataZone addresses the challenges faced by organizations striving to remain competitive by using their data for strategic analysis. It enhances data trust and validation by providing a visual, traceable history of data assets, enabling business analysts to quickly understand data origins without manual research. For data engineers, it facilitates impact analysis and troubleshooting by clearly showing relationships between assets and allowing easy tracing of data flows.

The feature supports data governance and compliance efforts by offering a comprehensive view of data movement, helping governance teams to quickly respond to compliance queries and enforce data policies. It improves data discovery and understanding, helping consumers grasp the context and relevance of data assets more efficiently. Additionally, data lineage contributes to better change management, increased data literacy, reduced data duplication, and enhanced cross-team collaboration. By tackling these challenges, data lineage in Amazon DataZone helps organizations build a more trustworthy, efficient, and compliant data ecosystem, ultimately enabling more effective data-driven decision-making.

Automated lineage capture is a key feature of the data lineage in Amazon DataZone, which focuses on automatically collecting and mapping lineage information from AWS Glue and Amazon Redshift. This automation significantly reduces the manual effort required to maintain accurate and up-to-date lineage information.

Get started with data lineage in Amazon DataZone
Data producers and domain administrators get started by setting up the data source run jobs for the AWS Glue Data Catalog and Amazon Redshift sources to Amazon DataZone to periodically collect metadata from the source catalog. Additionally, the data producers can hydrate the lineage information programmatically by creating custom lineage nodes using APIs that accept OpenLineage compatible events from existing pipeline components—such as schedulers, warehouses, analysis tools, and SQL engines—to send data about datasets, jobs, and runs directly to Amazon DataZone API endpoint. With the information being sent, Amazon DataZone will start populating the lineage model and map them to the assets already cataloged. As new lineage events are captured, Amazon DataZone maintains versions of events that were already captured, so users can navigate to previous versions if needed.

From the consumer’s perspective, lineage can help with three scenarios. First, a business analyst browsing an asset, can go to the Amazon DataZone portal, search for an asset by name, and select an asset that interests them to dive into the details. Initially, they’ll be presented with details in the Business Metadata tab and move right to neighboring tabs. To view lineage, the analyst can go the Lineage tab for details of upstream nodes to find the source. The analyst is presented with a view of that asset’s lineage with 1-level upstream and downstream. To get the source, the analyst can choose upstream and get to the source of the asset. When the analyst is sure that this is the correct asset, they can subscribe to the asset and continue with their work.

Second, if a data issue is reported—for instance, when a dashboard unexpectedly shows a significant increase in customer count—a data engineer can use the Amazon DataZone portal to locate and examine the relevant asset details. In the asset details page, the data engineer navigates to the Lineage tab to view the details of upstream nodes of the asset in question. The engineer can dive into the details of each node, its snapshots, column mapping between each table node, the jobs that ran in between, and view the query that was executed in the job run. Using this information, the data engineer can spot that a new input table was added to the pipeline, which has introduced an uptick in customer count, because they notice that this new table wasn’t part of the previous snapshots of the job runs. This helps them clarify that a new source was added and hence the data shown in the dashboard is accurate.

Lastly, a steward looking to respond to questions from an auditor can go to the asset in question and navigates to the Lineage tab of that asset. The steward traverses the graph upstream to see where the data is coming from and notices that the data is from two different teams—for instance, from two different on-premises databases—that has its own pipelines until it reaches a point where the pipelines merge. While navigating through the lineage graph, the steward can expand the columns to make sure sensitive columns are dropped during the transformations processes and respond to the auditors with details in a timely manner.

How Amazon DataZone automates lineage collection
Amazon DataZone now enables automatic capture of lineage events, helping data producers and administrators to streamline the tracking of data relationships and transformations across their AWS Glue and Amazon Redshift resources. To allow automatic capture of lineage events from AWS Glue and Amazon Redshift, you have to opt in because some of your jobs or connections might be for testing and you might not need any lineage to be captured. With the integrated experience available, the services will provide you an option in your configuration settings to opt-in to collect and emit lineage events directly to Amazon DataZone.

These events should capture the various data transformation operations you perform on tables and other objects, such as table creation with column definitions, schema changes, and transformation queries, including aggregations and filtering. By obtaining these lineage events directly from your processing engines, Amazon DataZone can build a foundation of accurate and consistent data lineage information. This will then help you, as a data producer, to further curate the lineage data as part of the broader business data catalog capabilities.

Administrators can enable lineage when setting up the built-in DefaultDataLake or the DefaultDataWarehouse blueprints.

Data producers can view the status of automated lineage while setting up the data source runs.

With the recent launch of the next generation of Amazon SageMaker, data lineage is available as one of the catalog capabilities in the Amazon SageMaker Unified Studio (preview). Data users can set up lineage using connections, and that configuration will automate the capture of lineage in the platform for all users to browse and understand the data. Here’s how data lineage in next generation Amazon SageMaker will look.

Now available
You can begin using this capability to gain deeper insights into your data ecosystem and drive more informed, data-driven decision-making.

Data lineage is generally available in all AWS Regions where Amazon DataZone is available. For a list of Regions where Amazon DataZone domains can be provisioned, visit AWS Services by Region.

Data lineage costs are dependent on storage usage and API requests, which are already included in the Amazon DataZone pricing model. For more details, visit Amazon DataZone pricing.

To get started with data lineage in Amazon DataZone, visit the Amazon DataZone User Guide.

— Esra

Noise

Tag Archives: Featured

Bookblaze: The Third Annual Backblaze Book Guide

Pat Patterson, Chief Technical Evangelist

Never Understood: The Jesus and Mary Chain, by William Reid and Jim Reid

Yev Pusin, Sr. Director, Marketing

Impact Winter, by Travis Beacham

Jeremy Milk, Sr. Director, Product Marketing

How Big Things Get Done, by Dan Gardner and Bent Flyvbjerg

Nicole Gale, Marketing Operations Manager

The Women, by Kristin Hannah

David Johnson, Product Marketing Manager

The Coming Wave: Technology, Power, and the Twenty-First Century’s Greatest Dilemma, by Mustafa Suleyman

Bala Krishna Gangisetty, Sr. Product Manager

Mindset: The New Psychology of Success, by Carol Dweck

Teresa Dodson, Sr. Director, Partner Marketing and Alliances

Dare to Lead: Brave Work. Tough Conversations. Whole Hearts., by Brené Brown

Stephanie Doyle, Writer and Content Operations Strategist

The Skyward Trilogy, by Brandon Sanderson

Happy Reading from Backblaze

Introducing Buy with AWS: an accelerated procurement experience on AWS Partner sites, powered by AWS Marketplace

Accelerate foundation model training and fine-tuning with new Amazon SageMaker HyperPod recipes

AWS Education Equity Initiative: Applying generative AI to educate the next wave of innovators

Solve complex problems with new scenario analysis capability in Amazon Q in QuickSight

Use Amazon Q Developer to build ML models in Amazon SageMaker Canvas

Amazon SageMaker Lakehouse integrated access controls now available in Amazon Athena federated queries

Amazon SageMaker Lakehouse and Amazon Redshift supports zero-ETL integrations from applications

Simplify analytics and AI/ML with new Amazon SageMaker Lakehouse

New Amazon DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse

Discover, govern, and collaborate on data and AI securely with Amazon SageMaker Data and AI Governance

Announcing the general availability of data lineage in the next generation of Amazon SageMaker and Amazon DataZone

The collective thoughts of the interwebz