All posts by Shikha Verma

Unlock data across organizational boundaries using Amazon DataZone – now generally available 

Post Syndicated from Shikha Verma original https://aws.amazon.com/blogs/big-data/unlock-data-across-organizational-boundaries-using-amazon-datazone-now-generally-available/

We are excited to announce the general availability of Amazon DataZone. Amazon DataZone enables customers to discover, access, share, and govern data at scale across organizational boundaries, reducing the undifferentiated heavy lifting of making data and analytics tools accessible to everyone in the organization. With Amazon DataZone, data users like data engineers, data scientists, and data analysts can share and access data across AWS accounts using a unified data portal, allowing them to discover, use, and collaborate on this data across their teams and organizations. Additionally, data owners and data stewards can make data discovery simpler by adding business context to data while balancing access governance to the data via pre-defined approval workflows in the user interface.

In this blog post, we share what we heard from our customers that led us to create Amazon DataZone and discuss specific customer use cases and quotes from customers who tried Amazon DataZone during our public preview. Then we explain the benefits of Amazon DataZone and walk you through key features.

Common pain points of data management and governance:

  1. Discovery of data, especially data distributed across accounts and regions – Finding the data to use for analysis is challenging because organizations often have petabytes of data spread across tens or even thousands of data sources.
  2. Access to data – Data access control is hard, managed differently across organizations, and often requires manual approvals which can be time-consuming process and hard to keep up to date, resulting in analysts not having access to the data they need.
  3. Access to tools – Data users want to use different tools of choice with the same governed data. This is challenging because access to data is managed differently by each of the tools.
  4. Collaboration – Analysts, data scientists, and data engineers often own different steps within the end-to-end analytics journey but do not have an simple way to collaborate on the same governed data, using the tools of their choice.
  5. Data governance – Constructs to govern data are hidden within individual tools and managed differently by different teams, preventing organizations from having traceability on who’s accessing what and why.

Three core benefits of Amazon DataZone

Amazon DataZone enables customers to discover, share, and govern data at scale across organizational boundaries.

  • Govern data access across organizational boundaries. Help ensure that the right data is accessed by the right user for the right purpose—in accordance with your organization’s security regulations—without relying on individual credentials. Provide transparency on data asset usage and approve data subscriptions with a governed workflow. Monitor data assets across projects through usage auditing capabilities.
  • Connect data people through shared data and tools to drive business insights. Increase your business team’s efficiency by collaborating seamlessly across teams and providing self-service access to data and analytics tools. Use business terms to search, share, and access cataloged data, making data accessible to all the configured users to learn more about data they want to use with the business glossary.
  • Automate data discovery and cataloging with machine learning (ML). Reduce the time needed to manually enter data attributes into the business data catalog and minimize the introduction of errors. More and richer data in the data catalog improves the search experience, too. Reduce your time searching for and using data from weeks to days.

Here are the core benefits Amazon DataZone provides to its customers.

Figure 1: Benefits of Amazon DataZone

Figure 1: Benefits of Amazon DataZone

To provide theses benefits, let’s see what capabilities are built into this service.

Figure 2: Capabilities of Amazon DataZone

Figure 2: Capabilities of Amazon DataZone

Amazon DataZone provides the following detailed capabilities.

  1. Business-driven domains – A DataZone domain represents the distinct boundary of a line of business (LOB) or a business area within an organization that can manage its own data, including its own data assets, its own definition of data or business terminology, and may have its own governing standards. Domain is the starting point of a customer’s journey with Amazon DataZone. When you first start using DataZone, you create a domain, and all core components, such as business data catalog, projects, and environments, that will exist within a domain.
    1. An Amazon DataZone domain contains an associated business data catalog for search and discovery, a set of metadata definitions to decorate the data assets that are used for discovery purposes, and data projects with integrated analytics and ML tools for users and groups to consume and publish data assets.
    2. An Amazon DataZone domain can span across multiple AWS accounts by connecting and pulling data lake or data warehouse data in these accounts (for example, AWS Glue Data Catalog) to form a data mesh or creating and running projects and environments in these accounts across the supported AWS Regions.
    3. Amazon DataZone domains bring along the capabilities of AWS Resource Access Manager (AWS RAM) to securely share resources across accounts.
    4. After an Amazon DataZone domain is created, the domain provides a browser-based web application where the organization’s configured users can go to catalog, discover, govern, share, and analyze data in a self-service fashion. The data portal supports identity providers through the AWS IAM Identity Center (successor to AWS Single Sign-On) and AWS Identity and Access Management (IAM) principals for authentication.
    5. For example, a marketing team can create a domain with name “Marketing” and have full ownership over it. Similarly, a sales team can create a domain with name “Sales” and have full ownership over it. When sales wants to share data with marketing, the marketing team can give access to a sales account by associating that account with the marketing domain, and the sales user can use the marketing domain’s Amazon DataZone portal link to share their data with the marketing team.
  2. Organization-wide business data catalog – You can make data visible with business context for your users to find and understand data quickly and efficiently. The core of the catalog is focused on cataloging data from different sources and augmenting that metadata with additional business context to build trust, and facilitate better decision-making for consumers looking for data.
    1. Standardize on terminology – You can standardize your business terminology to communicate among data publishers and consumers by creating glossaries and including detailed descriptions for terms along with the term relationships. These terms can be mapped to assets and columns and help to standardize the description of these assets and assist in the discovery and understanding the details of the underlying data.
    2. Building blocks to customize business metadata – To make it simple to build your catalog with extensibility, Amazon DataZone introduces some foundational building blocks that can be expanded to your needs. The metadata forms types, and asset types can be used as templates for defining your assets. These types can be customized to augment additional context and details to suit the requirements of a domain. In this release, Amazon DataZone provides some out-of-the-box metadata form types such as AWS Glue table form, Amazon Redshift table form, Amazon Simple Storage Service (Amazon S3) object form to support the out-of-box asset types such as AWS Glue tables and views, Amazon Redshift tables and views, and S3 objects.
    3. Catalog structured, unstructured, and custom assets – You can now catalog not only AWS Glue data catalogs or Amazon Redshift tables but also catalog custom assets using Amazon DataZone APIs. Cataloged assets can represent a consumable unit of asset that may include a table, a dashboard, an ML model, or a SQL code block that shows the query behind the dashboard. With custom assets, Amazon DataZone provides the ability to attach metadata form types to an asset type and then augment it with business context, including standardized business glossary terms for better consumption of those assets. In addition, for AWS Glue data catalogs and Amazon Redshift tables, you can use the Amazon DataZone data sources to bring the technical metadata of the datasets into the business data catalog in a managed fashion on a schedule. Assets also now support revisions, allowing users to identify changes to business and technical metadata.
    4. Automated business name generation – Enriching the technical catalog ingested with business context can be time-consuming, cumbersome, and error-prone. To make it simpler, we are introducing the first feature that brings generative artificial intelligence (AI) capabilities to Amazon DataZone to automate the generation of the name and column names of an asset. Amazon DataZone recommends to be added to the asset, and then delegates control to the producer to accept or reject those recommendations.
  3. Federated governance using data projects – Amazon DataZone data projects simplify access to AWS analytics by creating business usecase-based groupings of users, data assets, and analytics tools. Data projects provide a space where project members can collaborate, exchange data, and share artifacts. Projects are secure so that only users who are added to the project can collaborate together. With projects, Amazon DataZone decentralizes data ownership among teams depending on who owns the data and also federates access management to those owners when consumers request access to data. Core capabilities made available in projects include:
    1. Ownership and user management – In an organization, the roles and responsibilities made available to different personas vary. To customize defining what a user or group can do when working with Amazon DataZone entities, projects now also serve as a user management or roles mechanism. Every entity in Amazon DataZone, such as glossaries, metadata forms, and assets, is owned by projects.
    2. Projects and environments – Projects are now decoupled from infrastructure – there’s project creation that handles the set up of users as either project owners or contributors, and then the set up of resources named environments. Environments handle infrastructure (for example, AWS Glue database) needed for users to work with the data. This split enables the project to be the use case container, whereas environment gives the flexibility to branch off into different infrastructure environments (for example, data lakes or data warehouses using Amazon Redshift). Administrators can determine what kind of infrastructure should be available for what kind of projects.
    3. Bring your own IAM role for subscription – You can now bring an existing IAM principal by registering it as a subscription target and get data access approval for that IAM user or role.  With this mechanism, projects extend support for working with data in other AWS services because you can allow users to discover data, get the necessary approval, and access the data in a service the user has prior authorization to.
    4. Subscribe workflow with access management – The subscription workflow secures data between producers and consumers to verify only the right data is accessed by the right users for the right purpose, enabling self-service data analytics. This capability also allows you to quickly audit who has access to your datasets for what business use case as well as monitor usage and costs across projects and lines of business. Access management for assets published in the catalog is managed using AWS Lake Formation or Amazon Redshift, and you will get notified (in the portal or in Amazon CloudWatch) if your subscription request was approved and granted. For data that is not managed by AWS Lake Formation or Amazon Redshift, you can manage the subscription approval in Amazon DataZone and complete the access granted workflow with custom logic using Amazon EventBridge events and then report back to Amazon DataZone using API once the grant is completed. This ensures that the consumer will only interface with one service to discover, understand, and subscribe to data that is needed for their analysis.
    5. Analytics tools – Out of the box, the Amazon DataZone portal provides integration with Amazon Athena query editor and Amazon Redshift query editor as tools to process the data. This integration provides seamless access to the query tools and enables the users to use data assets that were subscribed to within the project. This is accomplished using Amazon DataZone environments that can be deployed according to the resource configuration definitions in built-in blueprints.
  4. APIs – Amazon DataZone now has external APIs to work with the system programmatically. You can add Amazon DataZone to your existing architecture. For example, to use your data pipelines to catalog data in Amazon DataZone and enable consumers to search, find, subscribe, and access that data seamlessly. In this release, Amazon DataZone introduces a new data model for the catalog. The catalog APIs support a type system–based model that allows you to define and manage the types of entities in the catalog. Using this type system model, users will have a flexible and scalable catalog that can represent different types of objects and associate metadata to the object (asset or column). Similarly, actions in the UI now have APIs that you can use if you want to work with Amazon DataZone programmatically.

Common customer use cases for Amazon DataZone

Let’s look at some use cases that our preview customers enabled with Amazon DataZone.

Use case 1: Data discoverability 

Bristol Myers Squibb is actively pursuing an initiative to reduce the time it takes to discover and develop drugs by more than 30%. A key component of this strategy is addressing data sharing challenges and optimizing data availability. Engaging with AWS, we found that Amazon DataZone helped us create our data products, catalog them, and govern them, making our data more findable, accessible, interoperable, and reusable (FAIR). We’re currently assessing the broader applicability of Amazon DataZone within our enterprise framework to determine if it aligns with our operational goals.” 

—David Y. Liu, Director, Research IT Solution Architecture. Bristol Myers Squibb.

Use case 2: Share governed data for generative AI initiatives

“By harmonizing data across multiple business domains, we can foster a culture of data sharing. To this end, we have been using Amazon DataZone to free up our developers from building and maintaining a platform, allowing them to focus on tailored solutions. Utilizing an AWS managed service was important to us for several reasons—combining capabilities within the AWS ecosystem, quicker time to obtain business insights from data analysis, standardized data definitions, and leveraging the potential of generative AI. We look forward to our continued partnership with AWS to generate better outcomes for Guardant Health and the patients we serve. This is more than mere data; it’s our dynamic journey.”

—Rajesh Kucharlapati, Senior Director of Data, CRM and Analytics, Guardant Health

Use case 3:  Federated data governance

“Being data-driven is one of our main corporate objectives, always guided by best practices in data governance, data privacy, and security. At Itaú, data is treated as one of our main assets; good data management and definition are core parts of our solutions, in every use of AWS analytics services. Together with the AWS team, we were able to experiment with Amazon DataZone in preview, proposing features aligned with our technological and business needs. One example is data by domain, a simplification of data governance processes and distribution of responsibilities among business units. With Amazon DataZone generally available to our contributors, we expect to be able to quickly and easily set up rules across domains for teams composed of data analysts, engineers, and scientists, fostering experimentation with data hypothesis across multiple business use cases, with simplified governance.”

—Priscila Cardoso Ferreira, Data Governance and Privacy Superintendent, Itaú Unibanco

Use case 4: Decentralized ownership

“At Holaluz, unifying data across our businesses while having distributed ownership with individual teams to share and govern their data are our key priorities. Our data is owned by different teams, and sharing has typically meant the central team has to grant access, which created a bottleneck in our processes. We needed a faster way to analyze data with decentralized ownership, where data access can be approved by the owning team. We have validated the use cases in Amazon DataZone preview and are looking forward to getting started when it is generally available to build a robust business data catalog. Our consumers will be able to find, subscribe, and publish back their newly created assets for others to discover and use, enabling a data flywheel.”

—Danny Obando, Lead Data Architect, Holaluz

Use case #5: Managed service versus Do-It-Yourself (DIY) platform

“At BTG Pactual, unifying data across our businesses and allowing for data sharing at scale while enforcing oversight is one of our key priorities. While we are building custom solutions to do this ourselves, we prefer having an AWS native service to enable these capabilities so we can focus our development efforts and resources on solving BTG Pactual’s specific governance challenges—rather than building and maintaining the platform. We have validated the use cases in Amazon DataZone preview and will use it to build a robust business data catalog and data sharing workflow. It will provide complete visibility into who is using what data for what purposes without adding additional workload or inhibiting the decentralized ownership we’ve established to make data discoverable and accessible to all our data users across the organization.”

—João Mota, Head of Data Platform, BTG Pactual

Solution walkthrough

Let’s take an example of how an organization can get started with Amazon DataZone. In this example, we build a unified environment for data producers and data consumers to access, share, and consume data in a governed manner.

Take a product marketing team that wants to drive a campaign on product adoption. To be successful in that campaign, they want to tap into the customer data in a data warehouse, click-stream data in the data lake, and performance data of other campaigns in applications like Salesforce. Roberto is a data engineer who knows this data very well. So, let’s see how Roberto will make this data discoverable to others in the organization.

The administrator for the company has already set up a domain called “Marketing” for the team to use. The administrator has also set up some resource templates called “Blueprints” to allow data people to set up environments to work with data. The administrator has also set up users who can sign in using the corporate credentials to the Amazon DataZone portal, a web application outside of AWS Console. The administrator sets up all the AWS resources so the data people do not have to struggle with the technical barriers.

So, let’s now get into the details of how Roberto is able to publish the data in the catalog.

  1. Roberto signs in to the Amazon DataZone portal using his corporate credentials.
  2. He creates a project and environment that he can use to publish data. He knows the data sources he wants to catalog, so he creates a connection to the AWS Glue Catalog that has all the click-stream data.
  3. He provides a name and description for the data source run and then selects databases and specifics of what table he wants to bring.
  4. He chooses the automated metadata generation option to get ML-generated business names for the technical table and column names. He then schedules the run to keep the asset in sync with the source.
  5. Within a few minutes, the click-stream data and the customer information from Amazon Redshift metadata, such as table names, schema, and other source metadata, will be available in Amazon DataZone’s inventory, ready for curation.
  6. Roberto can now enrich the metadata to provide additional business context using glossary and metadata forms to make it simple for Veronica, adata analyst, and other data people to understand the data. Roberto can accept or reject the automatically generated recommendations to autocomplete the business-friendly names. He can also provide descriptions, classify terms, and any other useful information to that particular asset.
  7. Once done, Roberto can publish the asset and make it available to data consumers in Amazon DataZone.

Now, let’s take a look at how Veronica, the marketing analyst, can start discovering and working with the data.

  1. Now that the data is published and available in the catalog, Veronica can sign in to the Amazon DataZone portal using her corporate credentials and start searching for data. She types “click campaign” in the search, and all relevant assets are returned.
  2. She notices that the assets come from various sources and contexts. She uses filters to curate the search list using facets such as glossary terms and data sources and sorts results based on relevance and time.
  3. To start working with data, she will have to create a new project and an environment that provides the tools she needs. Creating the project provides an quick way for her to collaborate with her teammates and automatically provide them with the correct level of permissions to work with data and tools.
  4. Veronica finds the data she needs access to. She now requests access by clicking on Subscribe to inform the data publisher or owner that she needs access to the data. While subscribing, she also provides a reason why she needs access to that data.
  5. This sends a notification to Roberto and his project members that someone is looking for access, and they can review the request to accept or reject it. Robert is signed in to the portal, sees the notification, and approves the request because the reason was very clear.
  6. With the approved subscription, Veronica also gets access to data as Amazon DataZone automatically does it for Roberto. Now Veronica and her team can start working on their analysis to find the right campaign to increase adoption.

Therefore, the entire data discovery and access lifecycle and usage is happening through Amazon DataZone. You get complete visibility and control over how the data is being shared, who is using it, and who authorized it. Essentially, Amazon DataZone allows you to give members of your organization the freedom they always wanted, with the confidence of the right governance around it.

Here is a screenshot of Amazon DataZone’s portal for users to login to catalog, publish, discover, understand, and subscribe to data that is needed for their analysis.

Conclusion

In this post, we discussed the challenges, core capabilities, and a few common use cases. With a sample scenario, we demonstrated how you can get started. Amazon DataZone is now generally available. For more information, see What’s New in Amazon DataZone or Amazon DataZone.

Check out the YouTube playlist for some of the latest demos of Amazon DataZone and short descriptions of the capabilities available.


About the authors

Shikha Verma is Head of Product for Amazon DataZone at AWS.

Steve McPherson is a General Manager with Amazon DataZone at AWS.

Priya Tiruthani is a Senior Product Manager with Amazon DataZone at AWS.