All posts by Dhrubajyoti Mukherjee

Embracing event driven architecture to enhance resilience of data solutions built on Amazon SageMaker

Post Syndicated from Dhrubajyoti Mukherjee original https://aws.amazon.com/blogs/big-data/embracing-event-driven-architecture-to-enhance-resilience-of-data-solutions-built-on-amazon-sagemaker/

Amazon Web Services (AWS) customers value business continuity while building modern data governance solutions. A resilient data solution helps maximize business continuity by minimizing solution downtime and making sure that critical information remains accessible to users. This post provides guidance on how you can use event driven architecture to enhance the resiliency of data solutions built on the next generation of Amazon SageMaker, a unified platform for data, analytics, and AI. SageMaker is a managed service with high availability and durability. If customers want to build a backup and recovery system on their end, we show you how to do this in this blog. It provides three design principles to improve the data solution resiliency of your organization. In addition, it contains guidance to formulate a robust disaster recovery strategy based on event driven architecture. It contains code samples to back up the system metadata of your data solution built on SageMaker, enabling disaster recovery.

The AWS Well-Architected Framework defines resilience as the ability of a system to recover from infrastructure or service disruptions. You can enhance the resiliency of your data solution by adopting three design principles that are highlighted in this post and by establishing a robust disaster recovery strategy. Recovery point objective (RPO) and recovery time objective (RTO) are industry standard metrics to measure the resilience of a system. RPO indicates how much data loss your organization can accept in case of solution failure. RTO refers to the time for the solution to recover after failure. You can measure these metrics in seconds, minutes, hours, or days. The next section discusses how you can align your data solution resiliency strategy to meet the needs of your organization.

Formulating a strategy to enhance data solution resilience

To develop a robust resiliency strategy for your data solution built on SageMaker, start with how users interact with the data solution. The user interaction influences the data solution architecture, the degree of automation, and determines your resiliency strategy. Here are a few aspects you might consider while designing the resiliency of your data solution.

  • Data solution architecture – The data solution of your organization might follow a centralized, decentralized, or hybrid architecture. This architecture pattern reflects the distribution of responsibilities of the data solution based on the data strategy of your organization. This shift in responsibilities is reflected in the structure of the teams that perform activities in the Amazon DataZone data portal, SageMaker Unified Studio portal, AWS Management Console, and underlying infrastructure. Examples of such activities include configuring and running the data sources, publishing data assets in the data catalog, subscribing to data assets, and assigning members to projects.
  • User persona – The user persona, their data, and cloud maturity influence their preferences for interacting with the data solution. The users of a data governance solution fall into two categories: business users and technical users. Business users of your organization might include data owners, data stewards, and data analysts. They might find the Amazon DataZone data portal and SageMaker Unified Studio portal more convenient for tasks such as approving or rejecting subscription requests and performing one-time queries. Technical users such as data solution administrators, data engineers, and data scientists might opt for automation when making system changes. Examples of such activities include publishing data assets, managing glossary and metadata forms in the Amazon DataZone data portal or in SageMaker Unified Studio portal. A robust resiliency strategy accounts for tasks performed by both user groups.
  • Empowerment of self-service – The data strategy of your organization determines autonomy granted to the users. Increased user autonomy demands a high level of abstraction of the cloud infrastructure powering the data solution. SageMaker empowers self-service by enabling users to perform regular data management activities in the Amazon DataZone data portal and in the SageMaker Unified Studio portal. The level of self-service maturity of the data solution depends on the data strategy and user maturity of your organization. At an early stage, you might limit the self-service features to the use cases for onboarding the data solution. As the data solution scales, consider increasing the self-service capabilities. See Data Mesh Strategy Framework to learn about the different phases of a data mesh-based data solution.

Adopt the following design principles to enhance the resiliency of your data solution:

  • Choose serverless services – Use serverless AWS services to build your data solution. Serverless services scale automatically with increasing system load, provide fault isolation, and have built-in high-availability. Serverless services minimize the need for infrastructure management, reducing the need to design resiliency into the infrastructure. SageMaker seamlessly integrates with several serverless services such Amazon Simple Storage Service (Amazon S3), AWS Glue, AWS Lake Formation, and Amazon Athena.
  • Document system metadata – Document the system metadata of your data solution using infrastructure-as-code (IaC) and automation. Consider how users interact with the data solution. If the users prefer to perform certain activities through the Amazon DataZone data portal and SageMaker Unified Studio portal, implement automation to capture and store the metadata that’s relevant for disaster recovery. Use Amazon Relational Database Service (Amazon RDS) and Amazon DynamoDB to store the system metadata of your data solution.
  • Monitor system health – Implement a monitoring and alerting solution for your data solution so that you can respond to service interruptions and initiate the recovery process. Make sure that system activities are logged so that you can troubleshoot the system interruption. Amazon CloudWatch helps you monitor AWS resources and the applications you run on AWS in real time.

The next section presents disaster recovery strategies to recover your data solution built on SageMaker.

Disaster recovery strategies

Disaster recovery focuses on one-time recovery objectives in response to natural disasters, large-scale technical failures, or human threats such as attack or error. Disaster recovery is a crucial part of your business continuity plan. As shown in the following figure, AWS offers the following options for disaster recovery: Backup and restore, pilot light, warm standby, and multi-site active/active.

The business continuity requirements and cost of recovery should guide your organization’s disaster recovery strategy. As a general guideline, the recovery cost of your data solution increases with reduced RPO and RTO requirements. The next section provides architecture patterns to implement a robust backup and recovery solution for a data solution built on SageMaker.

Solution overview

This section provides event-driven architecture patterns following the backup and restore approach to enhance resiliency of your data solution. This active/passive strategy-based solution stores the system metadata in a DynamoDB table. You can use the system metadata to restore your data solution. The following architecture patterns provide regional resilience. You can simplify the architecture of this solution to restore data in a single AWS Region.

Pattern 1: Point-in-time backup

The point-in-time backup captures and stores system metadata of a data solution built on SageMaker when a user or an automation performs an action. In this pattern, a user activity or an automation initiates an event that captures the system metadata. This pattern is suited for low RPO requirements, ranging from seconds to minutes. The following architecture diagram shows the solution for the point-in-time backup process.

Architecture point-in-time-backup

The steps comprise the following.

  1. User or automation performs an activity on an Amazon DataZone domain or Amazon Unified Studio domain.
  2. This activity creates a new event in AWS CloudTrail.
  3. The CloudTrail event is sent to Amazon EventBridge. Alternatively, you can use Amazon DataZone as the event source for the EventBridge rule.
  4. AWS Lambda transforms and stores this event in a DynamoDB global table where the Amazon DataZone domain is hosted.
  5. The information is replicated into the replica DynamoDB table in a secondary Region. The replica DynamoDB table can be used to restore the data solution based on SageMaker in the secondary Region.

Pattern 2: Scheduled backup

The scheduled backup captures and stores system metadata of a data solution built on SageMaker at regular intervals. In this pattern, an event is initiated based on a defined time schedule. This pattern is suited for RPO requirements in the order of hours. The following architecture diagram displays the solution for point-in-time backup process.

The steps comprise the following.

  1. EventBridge triggers an event at regular interval and sends this event to AWS Step Functions.
  2. The Step Functions state machine contains multiple Lambda functions. These Lambda functions get the system metadata from either a SageMaker Unified Studio domain or an Amazon DataZone domain.
  3. The system metadata is stored in an DynamoDB global table in the primary Region where the Amazon DataZone domain is hosted.
  4. The information is replicated into the replica DynamoDB table in a secondary Region. The data solution can be restored in the secondary Region using the replica DynamoDB table.

The next section provides step by step instructions to deploy a code sample that implements the scheduled backup pattern. This code sample stores asset information of a data solution built on a SageMaker Unified Studio domain and an Amazon DataZone domain in an DynamoDB global table. The data in the DynamoDB table is encrypted at rest using a customer managed key stored in AWS Key Management Service (AWS KMS). A multi-Region replica key encrypts the data in the secondary Region. The asset uses the data lake blueprint that contains the definition for launching and configuring a set of services (AWS Glue, Lake Formation, and Athena) to publish and use data lake assets in the business data catalog. The code sample uses the AWS Cloud Development Kit (AWS CDK) to deploy the cloud infrastructure.

Prerequisites

  • An active AWS account.
  • AWS administrator credentials for the central governance account in your development environment
  • AWS Command Line Interface (AWS CLI) installed to manage your AWS services from the command line (recommended)
  • Node.js and Node Package Manager (npm) installed to manage AWS CDK applications
  • AWS CDK Toolkit installed globally in your development environment by using npm, to synthesize and deploy AWS CDK applications
npm install -g aws-cdk
  • TypeScript installed in your development environment or installed globally by using npm compiler:
npm install -g typescript
  • Docker installed in your development environment (recommended)
  • An integrated development environment (IDE) or text editor with support for Python and TypeScript (recommended)

Walkthrough for data solutions built on a SageMaker Unified Studio domain

This section provides step by step instructions to deploy a code sample that implements the scheduled backup pattern for data solutions built on a SageMaker Unfied Studio domain.

Set up SageMaker Unified Studio

  1. Sign into the IAM console. Create an IAM role that trusts Lambda with the following policy.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "datazone:Search",
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "dynamodb:PutItem"
            ],
            "Resource": "arn:aws:dynamodb:<AWS_REGION>:<AWS_ACCOUNT>:table/*"
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:Encrypt",
                "kms:GenerateDataKey",
                "kms:ReEncrypt*",
                "kms:DescribeKey"
            ],
            "Resource": "arn:aws:kms:<AWS_REGION>:<AWS_ACCOUNT>:key/<KMS_KEY_ID>"
        },
        {
            "Sid": "VisualEditor3",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:<AWS_REGION>:<AWS_ACCOUNT>:log-group:*:log-stream:*",
                "arn:aws:logs:<AWS_REGION>:<AWS_ACCOUNT>:log-group:*"
            ]
        }
    ]
}
  1. Note down the Amazon Resource Name (ARN) of the Lambda role. Navigate to SageMaker and choose Create a Unified Studio domain.
  2. Select Quick setup and expand the Quick setup settings section. Enter a domain name, for example, CORP-DEV-SMUS. Select the Virtual private cloud (VPC) and Subnets. Choose Continue.
  3. Enter the email address of the SageMaker Unified Studio user in the Create IAM Identity Center user section. Choose Create domain.
  4. After the domain is created, choose Open unified studio in the top right corner. Screenshot open-smus
  5. Sign in to SageMaker Unified Studio using the single sign-on (SSO) credentials of your user. Choose Create project at the top right corner. Enter a project name and description, choose Continue twice, and choose Create project. Wait unti project creation is complete. Screenshot create-smus-project
  6. After the project is created, go into the project by selecting the project name. Select Query Editor from the Build drop-down menu on the top left. Paste the following create table as select (CTAS) query script in the query editor window and run it to create a new table named mkt_sls_table as described in Produce data for publishing. The script creates a table with sample marketing and sales data.
CREATE TABLE mkt_sls_table AS
SELECT 146776932 AS ord_num, 23 AS sales_qty_sld, 23.4 AS wholesale_cost, 45.0 as lst_pr, 43.0 as sell_pr, 2.0 as disnt, 12 as ship_mode,13 as warehouse_id, 23 as item_id, 34 as ctlg_page, 232 as ship_cust_id, 4556 as bill_cust_id
UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561Screenshot create-smus-asset
  1. Navigate to Data sources from the Project. Choose Run in the Actions section next to the project.default_lakehouse connection. Wait until the run is complete.Screeshot run-smus-data-source
  2. Navigate to Assets in the left side bar. Select the mkt_sls_table in the Inventory section and review the metadata that was generated. Choose Accept All if you’re satisfied with the metadata.Screenshot smus-assets
  3. Choose Publish Asset to publish the mkt_sls_table table to the business data catalog, making it discoverable and understandable across your organization.
  4. Choose Members in the navigation pane. Choose Add members and select the IAM role you created in Step 1. Add the role as a Contributor in the project.

Deployment steps

After setting up SageMaker Unified Studio, use the AWS CDK stack provided on GitHub to deploy the solution to back up the asset metadata that is created in the previous section.

  1. Clone the repository from GitHub to your preferred integrated development environment (IDE) using the following commands.
git clone https://github.com/aws-samples/sample-event-driven-resilience-data-solutions-sagemaker.git
cd sample-event-driven-resilience-data-solutions-sagemaker
  1. Export AWS credentials and the primary Region to your development environment for the IAM role with administrative permissions, use the following format
export AWS_REGION=
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export AWS_SESSION_TOKEN=

In a production environment, use AWS Secrets Manager or AWS Systems Manager Parameter Store to manage credentials. Automate the deployment process using a continuous integration and delivery (CI/CD) pipeline.

  1. Bootstrap the AWS account in the primary and secondary Regions by using AWS CDK and running the following command.
cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_REGION>
cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_SECONDARY_REGION>
cd unified-studio
  1. Modify the following parameters in the config/Config.ts file.
SMUS_APPLICATION_NAME – Name of the application.
SMUS_SECONDARY_REGION – Secondary AWS region for backup.
SMUS_BACKUP_INTERVAL_MINUTES – Minutes before each backup interval. 
SMUS_STAGE_NAME – Name of the stage. 
SMUS_DOMAIN_ID – Domain identifier of the Amazon SageMaker Unified Studio. 
SMUS_PROJECT_ID – Project identifier of the Amazon SageMaker Unified Studio. 
SMUS_ASSETS_REGISTRAR_ROLE_ARN – ARN of the AWS Lambda role created in step 1 of the preceding section. 
  1. Install the dependencies by running the following command:

npm install

  1. Synthesize the CloudFormation template by running the following command.

cdk synth

  1. Deploy the solution by running the following command.

cdk deploy –all

  1. After the deployment is complete, sign in to your AWS account and navigate to the CloudFormation console to verify that the infrastructure deployed.

When deployment is complete, wait for the duration of DZ_BACKUP_INTERVAL_MINUTES. Navigate to the <DZ_APPLICATION_NAME >AssetsInfo DynamoDB table. Retrieve the data from the DynamoDB table. The following screenshot shows the data in the Items returned section. Verify the same data in the secondary Region.Screenshot smus-dynamodb

Clean up

Use the following steps to clean up the resources deployed.

  1. Empty the S3 buckets that were created as part of this deployment.
  2. In your local development environment (Linux or macOS):
  3. Navigate to the unified-studio directory of your repository.
  4. Export the AWS credentials for the IAM role that you used to create the AWS CDK stack.
  5. To destroy the cloud resources, run the following command:

cdk destroy --all

  1. Go to the SageMaker Unified Studio and delete the published data assets that were created in the project.
  2. Use the console to delete the SageMaker Unified Studio domain.

Walkthrough for data solutions built on an Amazon DataZone domain

This section provides step by step instructions to deploy a code sample that implements the scheduled backup pattern for data solutions built on an Amazon DataZone domain.

Deployment steps

After completing the prerequisites, use the AWS CDK stack provided on GitHub to deploy the solution to backup system metadata of the data solution built on Amazon DataZone domain

  1. Clone the repository from GitHub to your preferred IDE using the following commands.
git clone https://github.com/aws-samples/sample-event-driven-resilience-data-solutions-sagemaker.git
cd event-driven-resilience-sagemaker
  1. Export AWS credentials and the primary Region information to your development environment for the AWS Identity and Access Management (IAM) role with administrative permissions, use the following format:
export AWS_REGION=
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export AWS_SESSION_TOKEN=

In a production environment, use Secrets Manager or Systems Manager Parameter Store to manage credentials. Automate the deployment process using a CI/CD pipeline.

  1. Bootstrap the AWS account in the primary and secondary Regions by using AWS CDK and running the following command:
cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_REGION>
cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_SECONDARY_REGION>
cd datazone
  1. From the console for IAM, note the Amazon Resource Name (ARN) of the CDK execution role. Update the trust relationship of the IAM role so that Lambda can assume the role.
  1. Modify the following parameters in the config/Config.ts file.
DZ_APPLICATION_NAME – Name of the application.
DZ_SECONDARY_REGION – Secondary Region for backup.
DZ_BACKUP_INTERVAL_MINUTES – Minutes before each backup interval.
DZ_STAGE_NAME – Name of the stage (dev, qa, or prod).
DZ_DOMAIN_NAME – Name of the Amazon DataZone domain
DZ_DOMAIN_DESCRIPTION – Description of the Amazon DataZone domain
DZ_DOMAIN_TAG – Tag of the Amazon DataZone domain
DZ_PROJECT_NAME – Name of the Amazon DataZone project
DZ_PROJECT_DESCRIPTION – Description of the Amazon DataZone project
CDK_EXEC_ROLE_ARN – ARN of the CDK execution role
DZ_ADMIN_ROLE_ARN – ARN of the administrator role
  1. Install the dependencies by running the following command:

npm install

  1. Synthesize the AWS CloudFormation template by running the following command:

cdk synth

  1. Deploy the solution by running the following command:

cdk deploy --all

  1. After the deployment is complete, sign in to your AWS account and navigate to the CloudFormation console to verify that the infrastructure deployed.

Document system metadata

This section provides instructions to create an asset and demonstrates how you can retrive the metadata of the asset. Perform the following steps to retrieve the systems metadata.

  1. Sign in to the Amazon DataZone data portal from the console. Select the project and choose Query data at the upper right.

Screenshot datazone-open-query

  1. Choose Open Athena and make sure that <DZ_PROJECT_NAME>_DataLakeEnvironment is selected in the Amazon DataZone environment dropdown at the upper right and that on the left, and that <DZ_PROJECT_NAME>_datalakeenvironment_pub_db is selected as the Database.
  2. Create a new AWS Glue table for publishing to Amazon DataZone. Paste the following create table as select (CTAS) query script in the Query window and run it to create a new table named mkt_sls_table as described in Produce data for publishing. The script creates a table with sample marketing and sales data.
CREATE TABLE mkt_sls_table AS
SELECT 146776932 AS ord_num, 23 AS sales_qty_sld, 23.4 AS wholesale_cost, 45.0 as lst_pr, 43.0 as sell_pr, 2.0 as disnt, 12 as ship_mode,13 as warehouse_id, 23 as item_id, 34 as ctlg_page, 232 as ship_cust_id, 4556 as bill_cust_id
UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561Screenshot datazone-run-query
  1. Go to the Tables and Views section and verify that the mkt_sls_table table was successfully created.
  2. In the Amazon DataZone Data Portal, go to Data sources, select the <DZ_PROJECT_NAME>-DataLakeEnvironment-default-datasource, and choose Run. The mkt_sls_table will be listed in the inventory and available to publish.Screenshot run-data-source
  3. Select the mkt_sls_table table and review the metadata that was generated. Choose Accept All if you’re satisfied with the metadata.Screeshot publish-data-asset
  4. Choose Publish Asset and the mkt_sls_table table will be published to the business data catalog, making it discoverable and understandable across your organization.
  5. After the table is published, wait for the duration of DZ_BACKUP_INTERVAL_MINUTES. Navigate to the <DZ_APPLICATION_NAME >AssetsInfo DynamoDB table and retrieve the data from the table. The following screenshot shows the data in the Items returned section. Verify the same data in the secondary Region.Screenshot datazone-dynamodb

Clean up

Use the following steps to clean up the resources deployed.

  1. Empty the Amazon Simple Storage Service (Amazon S3) buckets that were created as part of this deployment.
  2. Go to the Amazon DataZone domain portal and delete the published data assets that were created in the Amazon DataZone project.
  3. In your local development environment (Linux or macOS):
  • Navigate to the datazone directory of your repository.
  • Export the AWS credentials for the IAM role that you used to create the AWS CDK stack.
  • To destroy the cloud resources, run the following command:

cdk destroy --all

Conclusion

This post explores how to build a resilient data governance solution on Amazon SageMaker. Resilient design principles and a robust disaster recovery strategy are central to the business continuity of AWS customers. The code samples included in this post implement a backup process of the data solution at regular time interval. They store the Amazon SageMaker asset information in Amazon DynamoDB Global tables. You can extend the backup solution by identifying the system metadata that is relevant for the data solution of your organization and by using Amazon SageMaker APIs to capture and store the metadata. The DynamoDB Global table replicates the changes in the DynamoDB table in the primary region to the secondary region in an asynchronous manner. Consider Implementing an additional layer of resiliency by using AWS Backup to back up the DynamoDB table at regular interval. In the next post, we show how you can use the system metadata to restore your data solution in the secondary region.

Adopt the resiliency features offered by Amazon DataZone and Amazon SageMaker Unified Studio. Use AWS Resilience Hub to assess the resilience of your data solution. AWS Resilience Hub helps you to define your resilience goals, assess your resilience posture against those goals, and implement recommendations for improvement based on the AWS Well-Architected Framework.

To build a data mesh based data solution using Amazon DataZone domain, see our GitHub repository. This open source project provides a step-by-step blueprint for constructing a data mesh architecture using the powerful capabilities of Amazon SageMaker, AWS Cloud Development Kit (AWS CDK), and AWS CloudFormation.


About the authors

BDB-4558-DhrubaDhrubajyoti Mukherjee is a Cloud Infrastructure Architect with a strong focus on data strategy, data governance, and artificial intelligence at Amazon Web Services (AWS). He uses his deep expertise to provide guidance to global enterprise customers across industries, helping them build scalable and secure cloud solutions that drive meaningful business outcomes. Dhrubajyoti is passionate about creating innovative, customer-centric solutions that enable digital transformation, business agility, and performance improvement. Outside of work, Dhrubajyoti enjoys spending quality time with his family and exploring nature through his love of hiking mountains.

How Volkswagen Autoeuropa built a data solution with a robust governance framework, simplifying access to quality data using Amazon DataZone

Post Syndicated from Dhrubajyoti Mukherjee original https://aws.amazon.com/blogs/big-data/how-volkswagen-autoeuropa-built-a-data-solution-with-a-robust-governance-framework-simplifying-access-to-quality-data-using-amazon-datazone/

This is a joint post co-authored with Martin Mikoleizig from Volkswagen Autoeuropa.

This second post of a two-part series that details how Volkswagen Autoeuropa, a Volkswagen Group plant, together with AWS, built a data solution with a robust governance framework using Amazon DataZone to become a data-driven factory. Part 1 of this series focused on the customer challenges, overall solution architecture and solution features, and how they helped Volkswagen Autoeuropa overcome their challenges. This post dives into the technical details, highlighting the robust data governance framework that enables ease of access to quality data using Amazon DataZone.

At Amazon, we work backward, a systematic way to vet ideas and create new products. The key tenet of this approach is to start by defining the customer experience, then iteratively work backward from that point until the team achieves clarity of thought around what to build. The first section of this post discusses how we aligned the technical design of the data solution with the data strategy of Volkswagen Autoeuropa. Next, we detail the governance guardrails of the Volkswagen Autoeuropa data solution. Finally, we highlight the key business outcomes.

Aligning the solution with the data strategy

At an early stage of the project, the Volkswagen Autoeuropa and AWS team identified that a data mesh architecture for the data solution aligns with the Volkswagen Autoeuropa’s vision of becoming a data-driven factory. With this in mind, the team implemented the following steps:

  • Define data domains – In a workshop, the team identified the data landscape and its distribution in Volkswagen Autoeuropa. Next, the team grouped the data assets of the organization along the lines of business and defined the data domains. Because Volkswagen Autoeuropa is at an early stage of their data mesh journey, defining data domains along the lines of business is the recommended approach. As the data solution evolves, Volkswagen Autoeuropa might consider other criteria such as business subdomains to define data domains. The team defined more than five data domains, such as production, quality, logistics, planning, and finance.
  • Identify pioneer cases – The team identified the pioneer use cases that onboard the data solution first, to validate its business value. The team identified two use cases. The first use case helps predict test results during the car assembly process. The second use case enables the creation of reports containing shop floor key metrics for different management levels. The following criteria were considered to identify these use cases:
    • Use cases that deliver measurable business value for Volkswagen Autoeuropa.
    • Use cases with high AWS maturity.
    • Use cases whose requirements can be met with the first release version of the data solution.
  • Onboard key data products – The team identified the key data products that enabled these two use cases and aligned to onboard them into the data solution. These data products belonged to data domains such as production, finance, and logistics. In addition, the team aligned on business metadata attributes that would help with data discovery. The data products are classified as either source-based data or consumer-based data. Source-based data is the unaltered, raw data that is generated from source systems (for example, quality data, safety data) and is useful for other business use cases. Consumer-based data is the aggregated and transformed data from source systems. Reuse of consumer-based data saves cost in extract, transform, and load (ETL) implementation and system maintenance.

In addition to the preceding steps, the team established a data quality framework to improve the quality of the data product registered in the data solution. The following table shows the mapping of the data mesh-based solution components to Amazon DataZone and AWS Glue features. The table also provides generic examples of the components in the automotive industry.

Data Solution Components AWS Service Features Generic Examples
Data domains Amazon DataZone projects and Amazon DataZone domain units Production, logistics
Use cases Amazon DataZone projects Smart manufacturing, predictive maintenance
Data products Amazon DataZone assets Sales data, sensor data
Business metadata Amazon DataZone glossaries and metadata forms Data product owner information, data refresh frequency
Data quality framework AWS Glue Data Quality  A quality score of 92%

Empowering teams with a governance framework

This section discusses the governance framework that was put in place to empower the teams at Volkswagen Autoeuropa by enhancing their analytics journey. It highlights the guardrails that enable ease of access to quality data.

Business metadata

Business metadata helps users understand the context of the data, which can lead to increased trust in the data. Moreover, establishing a common set of attributes of the data products promotes a consistent experience for the users. In addition to the business context, at Volkswagen Autoeuropa, the metadata includes information related to data classification and if the data contains personally identifiable information (PII). The data solution uses Amazon DataZone glossaries and metadata forms to provide business context to their data. Apart from the previous benefits, using the appropriate keywords in Amazon DataZone glossary terms and metadata forms can help with the search and filtering capability of data products in the Amazon DataZone data portal.

Data quality framework

The data quality framework is a comprehensive solution designed to streamline the process of data quality checks and publishing a quality score. It uses AWS Glue Data Quality to generate recommendation rulesets, run orchestrated jobs, store results, and send notifications. This framework can be seamlessly integrated into an AWS Glue job, providing a quality score for data pipeline jobs. The quality score of a data product is published in the Amazon DataZone data portal for consumers to evaluate. The key components of the solution are as follows:

  • Recommendation ruleset generation – The framework generates tailored rulesets based on metadata from the AWS Glue Data Catalog table, providing relevant and comprehensive quality checks.
  • Orchestrated job execution – Jobs are run in AWS Step Functions to perform data quality checks using the generated rulesets against data sources, evaluating data quality based on defined rules and criteria.
  • Result storage and notification – Results, including quality scores, quality status, and rulesets checked, are stored in an Amazon Simple Storage Service (Amazon S3) bucket, maintaining a historical record. End-users receive notifications with relevant details.
  • Data quality score publishing – The quality scores are published in the Amazon DataZone data portal, enabling consumers to access and evaluate data quality.
  • Subscription and quality score requirements – Consumers can subscribe to data sources or targets based on their desired quality score thresholds, making sure they receive data that meets their specific needs and standards.
  • Integration and extensibility – The framework is designed for seamless integration into existing AWS Glue jobs or data pipelines and provides a flexible and extensible architecture for customization and enhancement.

Federated governance

Federated governance empowers producer and consumer teams to operate independently while adhering to a central governance model. For the data solution at Volkswagen Autoeuropa, this meant a centralized team defined the governance guardrails and decentralized data teams employed those guardrails. The following are a few examples of how the team established federated governance in Volkswagen Autoeuropa:

  • Management of Amazon DataZone glossaries and metadata forms – In this mechanism, the Volkswagen Autoeuropa IT team defined the Amazon DataZone glossaries and metadata forms in a central manner. The data teams used them to publish the data assets in the Amazon DataZone. This provides consistency of business metadata across the organization. The following figure explains the process.
    The workflow in the Amazon DataZone data portal consists of the following steps:
    1. The data solution administrator belonging to the Volkswagen Autoeuropa IT team aligns with stakeholders such as data producers, data consumers, and source system owners, and maintains the business metadata using the Amazon DataZone glossaries and metadata forms.
    2. The producer project teams use the Amazon DataZone glossary terms and fill the Amazon DataZone metadata forms to enrich the inventory assets.
    3. After the business metadata is populated, the team publishes the assets in the Amazon DataZone data portal.
  • Management of Amazon DataZone project membership – In this scenario, the management of Amazon DataZone project membership is delegated to a designated administrator of the project. The following figure explains the process.
    The workflow consists of the following steps:
    1. The data solution administrator belonging to the Volkswagen Autoeuropa IT team provisions the Amazon DataZone project and environment using automation. The data solution administrator is the owner of the project.
    2. The data solution administrator delegates the management of the Amazon DataZone project membership to a designated administrator by assigning the owner role.
    3. The Amazon DataZone project administrator assigns the contributor role to eligible users.
    4. The users access the Amazon DataZone project and its assets from the Amazon DataZone data portal.

Authentication and authorization

The Amazon DataZone portal supports two types of authorizations: AWS Identity and Access Management (IAM) roles and AWS IAM Identity Center users. The data solution supports both of these authorization methods. The choice of authentication mechanism is a function of the type of authorization used for Amazon DataZone.

For IAM role authorization, an IAM role is created for each user, incorporating a prefix. Each data solution user role has a permission to list the Amazon DataZone domains (datazone:ListDomains) and to get the data portal login URL (datazone:GetIamPortalLoginUrl) in the Amazon DataZone AWS account. For reasons that are out of scope for this post, there could only be three SAML federated roles in an AWS account in the customer environment. As such, the team didn’t have a dedicated SAML federated role for each Amazon DataZone user. The data solution user role implemented a trust policy allowing the user’s AWS Security Token Service (AWS STS) federated user session principal Amazon Resource Name (ARN). If you don’t have limitations on the number of SAML federated roles per AWS account, you can make all data solution user roles SAML federated roles and update the trust policy accordingly.

For IAM Identity Center authorization, the configuration is done either at the AWS Organizations level or AWS account level in IAM Identity Center. Because there are currently no APIs available for identity source configuration in IAM Identity Center, the team followed the appropriate instructions to configure the identity source on the AWS Management Console.

After the chosen authorization option is activated, Amazon DataZone administrators grant the IAM principals (IAM role or IAM Identity Center user) access to the Amazon DataZone portal. For more details, refer to Manage users in the Amazon DataZone console.

Business outcomes

Volkswagen Autoeuropa and AWS established an iterative mechanism to enable the continuous growth of the data solution. This iterative improvement is expressed as a flywheel as shown in the following figure.

The outcome of each component of the flywheel powers the next component, creating a virtuous cycle. The data solution flywheel consists of five components:

  1. Data solution growth – The primary focus of the flywheel is to accelerate the growth of the data solution. This growth is measured by metrics such as number of data products, number of use cases onboarded into the solution, and number of users.
  2. Enhancing user experience – This component focuses on enhancing the user experience of the data solution. One way to measure the user experience is through user feedback surveys.
  3. Data solution use cases – Improved, positive user experience with the data solution contributes to the increased number of use cases that want to onboard the data solution.
  4. Data producers and consumers – As the number of use cases increases, so does the number of data producers and consumers. Data producers make data available to power the use cases. Data consumers use the data to drive the use cases.
  5. Selection of data products – After data producers onboard the data solution, they publish the assets in the Amazon DataZone data portal. This leads to a larger selection of data products. This, in turn, creates a positive experience for the data solution users.

In addition to the previous components, the positive user experience is reinforced by improving governance guardrails, increasing number of reusable assets, and maximizing operational excellence.

As of writing this post, Volkswagen Autoeuropa reduced the time to discover data from days to minutes using the data solution. This led to approximately 384 times improvement in data discovery time. Data access took several weeks before the Volkswagen Autoeuropa and AWS collaboration. With the help of the data solution powered by Amazon DataZone, the data access time was reduced to minutes. Overall, the data solution resulted in regaining between 48 hours and weeks of customer productivity over the course of a month.

The data solution powered by Amazon DataZone is driving measurable business impact for Volkswagen Autoeuropa. It enables Volkswagen Autoeuropa to deliver digital use cases faster, with less effort, and a higher overall quality. Volkswagen Autoeuropa believes that Amazon DataZone will be key in their journey to become a data-driven factory and to leverage AI.

Conclusion

This post explored how Volkswagen Autoeuropa built a robust and scalable data solution using Amazon DataZone. The first step was to align the solution with Volkswagen Autoeuropa’s overarching data strategy to drive business value.

The establishment of a comprehensive governance framework was central to this effort. This framework encompasses key components, such as business metadata, data quality, federated governance, access controls, and security, which maintain the trustworthiness and reliability of Volkswagen Autoeuropa’s data assets. The post highlighted the Volkswagen Autoeuropa data solution flywheel, showcasing how the solution enabled improved decision-making, increased operational efficiency, and accelerated digital transformation initiatives across the organization.

The data solution built at Volkswagen Autoeuropa is one of the first implementations within the Volkswagen Group and is a blueprint for other Volkswagen production plants.

“This project is a blueprint for other Volkswagen production plants. By involving the AWS team and using Amazon DataZone, we are able to govern our data centrally and make it accessible in an automated and secure way.”

– Daniel Madrid, Head of IT, Volkswagen Autoeuropa.

If you’re looking to harness the power of data mesh to drive innovation and business value within your organization, we’ve got you covered. In Strategies for building a data mesh-based enterprise solution on AWS, we dive deep into the key considerations and current recommendations to establish a robust, scalable, and well-governed data mesh on AWS. This documentation covers everything from aligning your data mesh with overall business strategy to implementing the data mesh strategy framework.

To get hands-on experience with real-world code examples, see our GitHub repository. This open source project provides a step-by-step blueprint for constructing a data mesh architecture using the powerful capabilities of Amazon DataZone, AWS Cloud Development Kit (AWS CDK), and AWS CloudFormation.


About the Authors

BDB-4558-DhrubaDhrubajyoti Mukherjee is a Cloud Infrastructure Architect with a strong focus on data strategy, data analytics, and data governance at AWS. He uses his deep expertise to provide guidance to global enterprise customers across industries, helping them build scalable and secure AWS solutions that drive meaningful business outcomes. Dhrubajyoti is passionate about creating innovative, customer-centric solutions that enable digital transformation, business agility, and performance improvement. An active contributor to the AWS community, Dhrubajyoti authors AWS Prescriptive Guidance publications, blog posts, and open source artifacts, sharing his insights and best practices with the broader community. Outside of work, Dhrubajyoti enjoys spending quality time with his family and exploring nature through his love of hiking mountains.

BDB-4558-RaviRavi Kumar is a Data Architect and Analytics expert at AWS, where he finds immense fulfilment in working with data. His days are dedicated to designing and analyzing complex data systems, uncovering valuable insights that drive business decisions. Outside of work, he unwinds by listening to music and watching movies, activities that allow him to recharge after a long day of data wrangling.

Martin Mikoleizig studied mechanical engineering and production technology at the RWTH Aachen University before starting to work in Dr. h.c. Ing. F. Porsche AG 2015 as a production planner for the engine assembly. Over several years as a Project Manager on Testing Technology for new engine models, he also introduced several innovations like human-machine collaborations and intelligent assistance systems. Starting in 2017, he was responsible for the shop floor IT team of the module lines in Zuffenhausen before he became responsible for the planning of the E-Drive assembly at Porsche. Additionally, he was responsible for the Digitalisation Strategy of the Production Ressort at Porsche. In October 2022, he was assigned to Volkswagen Autoeuropa in Portugal in the role of a Digital Transformation Manager for the plant, driving the digital transformation towards a data-driven factory.

BDB-4558-WeiWeizhou Sun is a Lead Architect at AWS, specializing in digital manufacturing solutions and IoT. With extensive experience in Europe, she has enhanced operational efficiencies, reducing latency and increasing throughput. Weizhou’s expertise includes industrial computer vision, predictive maintenance, and predictive quality, consistently delivering top performance and client satisfaction. A recognized thought leader in IoT and remote driving, she has contributed to business growth through innovations and open source work. Committed to knowledge sharing, Weizhou mentors colleagues and contributes to practice development. Known for her problem-solving skills and customer focus, she delivers solutions that exceed expectations. In her free time, Weizhou explores new technologies and fosters a collaborative culture.

BDB-4558-AjinkyaAjinkya Patil is a Senior Security Architect with AWS Professional Services, specializing in security consulting for customers in the automotive industry. Since joining AWS in 2019, he has played a key role in helping automotive companies design and implement robust security solutions on AWS. Ajinkya is an active contributor to the AWS community, having presented at AWS re:Inforce and authored articles for the AWS Security Blog and AWS Prescriptive Guidance. Outside of his professional pursuits, Ajinkya is passionate about travel and photography, often capturing the diverse landscapes he encounters on his journeys.

BDB-4558-AdjoaAdjoa Taylor has over 20 years of experience in industrial manufacturing, providing industry and technology consulting services, digital transformation, and solution delivery. Currently, Adjoa leads Product Centric Digital Transformation, enabling customers in solving complex manufacturing problems using smart factory and industry-leading transformation mechanisms. Most recently, she drives value with AI/ML and generative AI use cases for the plant floor. Adjoa is an experienced leader, having spent over 20 years of her career delivering projects in countries throughout North America, Latin America, Europe, and Asia. Adjoa brings deep experience across multiple business segments with a focus on business outcome-driven solutions. Adjoa is passionate about helping customers solve problems while realizing the art of the possible through implementing value-based solutions.

How Volkswagen Autoeuropa built a data mesh to accelerate digital transformation using Amazon DataZone

Post Syndicated from Dhrubajyoti Mukherjee original https://aws.amazon.com/blogs/big-data/how-volkswagen-autoeuropa-built-a-data-mesh-to-accelerate-digital-transformation-using-amazon-datazone/

This is a joint blog post co-authored with Martin Mikoleizig from Volkswagen Autoeuropa.

Volkswagen Autoeuropa is a Volkswagen Group plant that produces the T-Roc. The plant is located near Lisbon, Portugal and produces about 934 cars per day. In 2023, Volkswagen Autoeuropa represented 1.3% of the national GDP of Portugal and 4% in national export of goods impact with a sales volume of 3.3511 billion Euros. Volkswagen Autoeuropa aims to become a data-driven factory and has been using cutting-edge technologies to enhance digitalization efforts.

In this post, we discuss how Volkswagen Autoeuropa used Amazon DataZone to build a data marketplace based on data mesh architecture to accelerate their digital transformation. The data mesh, built on Amazon DataZone, simplified data access, improved data quality, and established governance at scale to power analytics, reporting, AI, and machine learning (ML) use cases. As a result, the data solution offers benefits such as faster access to data, expeditious decision making, accelerated time to value for use cases, and enhanced data governance.

Understanding Volkswagen Autoeuropa’s challenges

At the time of writing this post, Volkswagen Autoeuropa has already implemented more than 15 successful digital use cases in the context of real-time visualization, business intelligence, industrial computer vision, and AI.

Before the AWS partnership, Volkswagen Autoeuropa faced the following challenges.

  • Long lead time to access data – The digital use cases launched by Volkswagen Autoeuropa spent most of their project time getting access to the data that was relevant to their use cases. After the right data for the use case was found, the IT team provided access to the data through manual configuration. The lead time to access data was often from several days to weeks.
  • Insufficient data governance and auditing – Data was shared directly to use cases by copying it. Therefore, the IT team connected the data manually from their sources to the desired destinations multiple times. This process wasn’t centrally tracked to discover any information on the data sharing process. For example, if the data was copied in the past, how many use cases have access to the data, when access was granted, and who granted the access.
  • Redundant effort to process the same information – Because the IT team copied the data sources based on the exact use case requirements, they shared specific columns of the tables from the data. As additional use cases requested access to the same data with different column requirements, even more copies of the data were created.
  • Repeated process to establish security and governance guardrails – Each time the IT and the security team provided a connection to a new data source, they had to set up the security and governance guardrails. This required repeated manual effort.
  • Data quality issues – Because the data was processed redundantly and shared multiple times, there was no guarantee of or control over the quality of the data. This led to reduced trust in the data.
  • Absence of data catalog and metadata management – Data didn’t have any metadata associated with it, and so use cases couldn’t consume the data without further explanation from the data source owners and specialists. Furthermore, no process to discover new data existed. Similar to the consumption process, use cases would consult specialists to understand the context of the data and if it could provide value.

Envisioning a data solution for Volkswagen Autoeuropa

To address these challenges, Volkswagen Autoeuropa embarked on a bold vision. They envisioned a seamless data consumption process, similar to an online shopping experience. They envisioned a data marketplace where data users could browse and access high-quality, secure data with clear specifications, business context, and relevant attributes. This vision materialized into a project aimed at transforming data accessibility and governance as the foundation for the digital ecosystem. The vision to be realized: Data as seamless as online shopping.

In collaboration with Amazon Web Services (AWS), Volkswagen Autoeuropa joined the Enhanced Plant Onboarding Program of the Global Volkswagen Group’s Digital Production Platform (DPP EPO) strategy. Through this partnership, AWS and Volkswagen Autoeuropa created a data marketplace that significantly improved data availability.

In the discovery phase of the project, Volkswagen Autoeuropa and AWS evaluated several options to build the data solution. In the end, Volkswagen Autoeuropa chose a solution based on data mesh architecture using Amazon DataZone. Being a managed service, Amazon DataZone provided the necessary speed and agility to build the solution. At the same time, it led to higher operational efficiencies and lower operational overhead. The team adopted a data mesh architecture because the principles of the data mesh aligned with Volkswagen Autoeuropa’s vision of being a data driven factory.

Solution overview

This section describes the key features and architecture of the Volkswagen Autoeuropa data solution. The solution is based on a data mesh architecture.

Data solution features

The following figure shows the key capabilities of the Volkswagen Autoeuropa data solution.

The key capabilities of the solution are:

  • Data quality – In the solution, we’ve built a data quality framework to streamline the process of data quality checks and publishing quality scores. It uses AWS Glue Data Quality to generate recommendation rulesets, run orchestrated jobs, store results, and send notifications to users. This framework can be seamlessly integrated into AWS Glue jobs, providing a quality score for data pipeline jobs. In addition, the quality score is published in the Amazon DataZone data portal, allowing consumers to subscribe to the data based on its quality score.Assigning a quality score to the data helps build trust in the data, and shifts the responsibility of maintaining data quality to the data owner. As a result, the quality of the results delivered by these use cases improves.
  • Data registration – The producers sign in to the Amazon DataZone data portal using their AWS Identity and Access Management (IAM) credentials or single sign-on with integration through AWS IAM Identity Center. They register their data assets, which are stored in Amazon Simple Storage Service (Amazon S3), in the Amazon DataZone data catalog. The metadata of the data assets is stored in an AWS Glue catalog and made available in the business data catalog of Amazon DataZone and in the Amazon DataZone data source. The producers add business context such as business unit name, data owner contact information, and data refresh frequency using Amazon DataZone glossaries and metadata forms. In addition, they use generative AI capabilities to generate business metadata. After the business metadata is generated, they review the changes and modify the metadata if needed.Because all data products in Volkswagen Autoeuropa are now registered in the same location, the likelihood of data duplication is significantly reduced. Moreover, the data producers are improving the quality of the data by adding business context to it.
  • Data discovery – The consumers sign in to the Amazon DataZone data portal using their IAM credentials or single sign-on with integration through IAM Identity Center and search the data using keywords in the search bar. After the results are returned, they can further filter the results using glossary terms and project names. Finally, they review the business metadata of the data assets to evaluate if the data is relevant to their business use cases. They can check the quality score of the data assets and the refresh schedule for their use cases.With a data discovery capability in place, consumers can gain information about the data without the need to consult the source system owners or specialists.
  • Data access management – When the consumers find a data asset that’s relevant to their use case, they request access to it using the subscription feature of Amazon DataZone. Data is classified as public, internal, and confidential. For public and internal data assets, the access request is automatically approved. For confidential data assets, the data producer team reviews the access request and either accepts or rejects the subscription request.With a central place to manage data access, data owners can view which use cases have access to their data and when the access request was granted. The fine-grained access control feature of Amazon DataZone gives data owners granular control of their data at the row and column levels.
  • Data consumption – Upon approval of the subscription request, Amazon DataZone provisions the backend infrastructure to make the data accessible to the corresponding consumers. After this process is complete, the consumers can access the data through Amazon Athena using the deep link feature of Amazon DataZone. The data consumption pattern in Volkswagen Autoeuropa supports two use cases:
    • Cloud-to-cloud consumption – Both data assets and consumer teams or applications are hosted in the cloud.
    • Cloud-to-on-premises consumption – Data assets are hosted in the cloud and consumer use cases or applications are hosted on-premises.

Requirements specific to a use case requires access to the relevant data assets; sharing data to use cases using Amazon DataZone doesn’t require creating multiple copies. As a result, duplication and processing of data. Furthermore, by reducing the number of copies of the data, the overall quality of the data products improves. In addition, the backend automation of Amazon DataZone to make data available to use cases reduces the manual effort and improves the lead time to access data.

  • Single collaborative environment – The Amazon DataZone data portal provides a single collaborative environment to the users in Volkswagen Autoeuropa. Data consumers such as use case owners, data engineers, data scientists, and ML engineers can browse and request access to data assets. At the same time, data producers, such as use case owners and source system owners, can publish and curate their data in the Amazon DataZone data portal. This collaborative experience promotes teamwork and accelerates the realization of business value. Furthermore, the security and governance guardrails scales across the organization as the number of use cases increases.

Data solution architecture

The following figure displays the reference architecture of the data solution at Volkswagen Autoeuropa. In the next part of the post, we discuss how we arrived at the solution.

The architecture includes:

  1. The data from SAP applications, manufacturing execution systems (MES), and supervisory control and data acquisition (SCADA) systems is ingested into the producer accounts of Volkswagen Autoeuropa.
  2. In the producer account, raw data is transformed using AWS Glue. The technical metadata of the data is stored in AWS Glue catalog. The data quality is measured using the data quality framework. The data stored in Amazon Simple Storage Service (Amazon S3) is registered as an asset in the Amazon DataZone data catalog hosted in the central governance account.
  3. The central governance account hosts the Amazon DataZone domain and the related Amazon DataZone data portal. The AWS accounts of the data producers and consumers are associated with the Amazon DataZone domain. Amazon DataZone projects belonging to the data producers and consumers are created under the related Amazon DataZone domain units.
  4. Consumers of the data products sign in to the Amazon DataZone data portal hosted in the central governance account using their IAM credentials or single sign-on with integration through IAM Identity Center. They search, filter, and view asset information (for example, data quality, business, and technical metadata).
  5. After the consumer finds the asset they need, they request access to the asset using the subscription feature of Amazon DataZone. Based on the validity of the request, the asset owner approves or rejects the request.
  6. After the subscription request is granted and fulfilled, the asset is accessed in the consumer account for a one-time query using Athena and Microsoft Power BI applications hosted on premises. This consumption pattern can be extended for AI and machine learning (AI/ML) model development using Amazon SageMaker and reporting purposes using Amazon QuickSight.

User journey

After discussing the desired system with the use case teams and stakeholders and analyzing the current workflow, Volkswagen Autoeuropa grouped the user personas of the data solution into three main categories: data producer, data consumer, and data solution administrator. This sets the foundation for the desired user experience and what’s needed to achieve the solution goals.

Data producer

Data producers create the data products in the data solution. There are two types of data producers.

  • Data source owners – Data source owners publish the raw data in the Amazon DataZone data portal. These data products are attributed as source-based data.
  • Use case owners – Use case owners publish data that’s fit for consumption by other use cases. These data products are called consumer-based data.

The following figure shows the user journey of a data producer:

 

A data producer’s journey includes:

  1. Identify data of interest
    1. Identify data (Volkswagen Autoeuropa network).
    2. Perform data quality checks (Volkswagen Autoeuropa network).
  2. Connect data to the data solution
    1. Ingest data into the data solution (Amazon DataZone portal).
    2. Start process to connect data using AWS Glue.
  3. Locate the data source in the data solution
    1. Register data (Amazon DataZone portal).
    2. Add data to the inventory in Amazon DataZone.
  4. Add or edit metadata
    1. Add or edit metadata (Amazon DataZone portal).
    2. Publish data assets (Amazon DataZone portal).
  5. Approve or reject subscription request
    1. Review subscription requests.
  6. Maintain data assets
    1. Manage data assets (Amazon DataZone portal).

Data consumer

Data consumers use data for business analytics, machine learning, AI, and business reporting. Data consumers are data engineers, data scientists, ML engineers, and business users. The following diagram shows the journey of a data consumer.

A data consumer’s journey includes:

  1. Access Amazon DataZone portal
    1. Amazon DataZone portal – Access is granted based on the user’s assigned domain and projects.
  2. Search for data assets
    1. Data assets in Amazon DataZone portal – Search for data and brows the results by glossary terms or the project name. Use additional filters to refine the results.
  3. View business metadata
    1. Select a data asset to see additional information – Review the description, data quality score and metadata.
  4. Request access to data (subscribe)
    1. Subscribe to request access.
    2. After the subscription request is approved, review the data products that you have access to.
    3. Query the data to view and consume the data.
  5. Retrieve additional data
    1. Repeat the steps as needed to access and retrieve additional data.

Data solution administrator

Data solution administrators are responsible for performing administrative tasks on the data solution. The following figure shows the common tasks performed by the data solution administrator.

A data administrator’s journey includes:

  1. Manage projects
    1. Manage Amazon DataZone domain.
    2. Manage Amazon DataZone projects within the domain.
  2. Manage environment
    1. Set up the environment to manage the infrastructure.
  3. Manage business metadata glossary
    1. Manage and enable Amazon DataZone glossaries and metadata forms.
  4. Manage data assets
    1. Manage assets.
    2. Query the data to view and consume the data.
  5. Manage access to data solution
    1. Monitor and revoke access when appropriate.

Conclusion

In this post, you learned how Volkswagen Autoeuropa embarked on a bold vision to become a data driven factory. It shows how this vision was put into action by building a data solution based on data mesh architecture using Amazon DataZone. It highlights the key features and architecture of the data solutions and presents the user journey. As of writing this post, Volkswagen Autoeuropa reduced the data discovery time from days to minutes using the data solution. The time to access data took several weeks before the Volkswagen Autoeuropa and AWS collaboration. Now, with the help of the data solution, the data access time has been reduced to several minutes.

In May 2024, the team achieved a major milestone by successfully offering data on the data solution and transporting it instantly to Power BI, a process that previously took several weeks.

“After one year of work, we did the full roundtrip from offering data on our new data marketplace built using Amazon DataZone to transporting it instantly to third-party tools, a process that previously took several weeks. This was a big achievement for our team.”

– Jorge Paulino, Product owner of the data solution. Volkswagen Autoeuropa.

The next post of the two-part series details discusses how we built the solution, its technical details, and the business value created.

If you want to harness the agility and scalability of a data mesh architecture and Amazon DataZone to accelerate innovation and drive business value for your organization, we have the resources to get you started. Be sure to check out the AWS Prescriptive Guidance: Strategies for building a data mesh-based enterprise solution on AWS. This comprehensive guide covers the key considerations and best practices for establishing a robust, well-governed data mesh on AWS. From aligning your data mesh with overall business strategy to scaling the data mesh across your organization, this Prescriptive Guidance provides a clear roadmap to help you succeed.

If you’re curious to get hands-on, see the GitHub repository: Building an enterprise Data Mesh with Amazon DataZone, Amazon DataZone, AWS CDK, and AWS CloudFormation. This open source project delivers a step-by-step guide to build a data mesh architecture using Amazon DataZone, AWS Cloud Development Kit (AWS CDK), and AWS CloudFormation.


About the Authors

Dhrubajyoti Mukherjee is a Cloud Infrastructure Architect with a strong focus on data strategy, data analytics, and data governance at Amazon Web Services (AWS). He uses his deep expertise to provide guidance to global enterprise customers across industries, helping them build scalable and secure AWS solutions that drive meaningful business outcomes. Dhrubajyoti is passionate about creating innovative, customer-centric solutions that enable digital transformation, business agility, and performance improvement. An active contributor to the AWS community, Dhrubajyoti authors AWS Prescriptive Guidance publications, blog posts, and open-source artifacts, sharing his insights and best practices with the broader community. Outside of work, Dhrubajyoti enjoys spending quality time with his family and exploring nature through his love of hiking mountains.

Ravi Kumar is a Data Architect and Analytics expert at Amazon Web Services; he finds immense fulfillment in working with data. His days are dedicated to designing and analyzing complex data systems, uncovering valuable insights that drive business decisions. Outside of work, he unwinds by listening to music and watching movies, activities that allow him to recharge after a long day of data wrangling.

Martin Mikoleizig studied mechanical engineering and production technology at the RWTH Aachen University before starting to work in Dr. h.c. Ing. F. Porsche AG 2015 as a production planner for the engine assembly. In several years as a Project Manager on Testing Technology for new engine models he also introduced several innovations like human-machine-collaborations and intelligent assistance systems. From 2017, he was responsible for the Shopfloor IT team of the module lines in Zuffenhausen before he became responsible for the Planning of the E-Drive assembly at Porsche. Beside this he was responsible for the Digitalisation Strategy of the Production Ressort at Porsche. Since October 2022, he has been assigned to Volkswagen Autoeuropa in Portugal in the role of a Digital Transformation Manager for the plant driving the Digital Transformation towards a Data Driven Factory.

Weizhou Sun is a Lead Architect at Amazon Web Services, specializing in digital manufacturing solutions and IoT. With extensive experience in Europe, she has enhanced operational efficiencies, reducing latency and increasing throughput. Weizhou’s expertise includes Industrial Computer Vision, predictive maintenance, and predictive quality, consistently delivering top performance and client satisfaction. A recognized thought leader in IoT and remote driving, she has contributed to business growth through innovations and open-source work. Committed to knowledge sharing, Weizhou mentors colleagues and contributes to practice development. Known for her problem-solving skills and customer focus, she delivers solutions that exceed expectations. In her free time, Weizhou explores new technologies and fosters a collaborative culture.

Shameka Almond is an Advisory Consultant at Amazon Web Services. She works closely with enterprise customers to help them better understand the business impact and value of implementing data solutions, including data governance best practices. Shameka has over a decade of wide-ranging IT experience in the manufacturing and aerospace industries, and the nonprofit sector. She has supported several data governance initiatives, helping both public and private organizations identify opportunities for improvement and increased efficiency. Outside of the office she enjoys hosting large family gatherings, and supporting community outreach events dedicated to introducing students in K-12 to STEM.

Adjoa Taylor has over 20 years of experience in industrial manufacturing, providing industry and technology consulting services, digital transformation, and solution delivery. Currently Adjoa leads Product Centric Digital Transformation, enabling customers to solve complex manufacturing problems by leveraging Smart Factory and Industry leading transformation mechanisms. Most recently driving value with AI/ML and generative AI use-cases for the plant floor. Adjoa is an experienced leader spending over 20 years of her career delivering projects in countries throughout North America, Latin America, Europe, and Asia. Through prior roles, Adjoa brings deep experience across multiple business segments with a focus on business outcome driven solutions. Adjoa is passionate about helping customers solve problems while realizing the art of the possible via the right impacting value-based solution.