All posts by Jan Michael Go Tan

Use the AWS CDK with the Data Solutions Framework to provision and manage Amazon Redshift Serverless

Post Syndicated from Jan Michael Go Tan original https://aws.amazon.com/blogs/big-data/use-the-aws-cdk-with-the-data-solutions-framework-to-provision-and-manage-amazon-redshift-serverless/

In February 2024, we announced the release of the Data Solutions Framework (DSF), an opinionated open source framework for building data solutions on AWS. DSF is built using the AWS Cloud Development Kit (AWS CDK) to package infrastructure components into L3 AWS CDK constructs on top of AWS services. L3 constructs are implementations of common technical patterns and create multiple resources that are configured to work with each other.

In this post, we demonstrate how to use the AWS CDK and DSF to create a multi-data warehouse platform based on Amazon Redshift Serverless. DSF simplifies the provisioning of Redshift Serverless, initialization and cataloging of data, and data sharing between different data warehouse deployments. Using a programmatic approach with the AWS CDK and DSF allows you to apply GitOps principles to your analytics workloads and realize the following benefits:

  • You can deploy using continuous integration and delivery (CI/CD) pipelines, including the definitions of Redshift objects (databases, tables, shares, and so on)
  • You can roll out changes consistently across multiple environments
  • You can bootstrap data warehouses (table creation, ingestion of data, and so on) using code and use version control to simplify the setup of testing environments
  • You can test changes before deployment using AWS CDK built-in testing capabilities

In addition, DSF’s Redshift Serverless L3 constructs provide a number of built-in capabilities that can accelerate development while helping you follow best practices. For example:

  • Running extract, transform, and load (ETL) jobs to and from Amazon Redshift is more straightforward because an AWS Glue connection resource is automatically created and configured. This means data engineers don’t have to configure this resource and can use it right away with their AWS Glue ETL jobs.
  • Similarly, with discovery of data inside Amazon Redshift, DSF provides a convenient method to configure an AWS Glue crawler to populate the AWS Glue Data Catalog for ease of discovery as well as ease of referencing tables when creating ETL jobs. The configured AWS Glue crawler uses an AWS Identity and Access Management (IAM) role that follows least privilege.
  • Sharing data between Redshift data warehouses is a common approach to improve collaboration between lines of business without duplicating data. DSF provides convenient methods for the end-to-end flow for both data producer and consumer.

Solution overview

The solution demonstrates a common pattern where a data warehouse is used as a serving layer for business intelligence (BI) workloads on top of data lake data. The source data is stored in Amazon Simple Storage Service (Amazon S3) buckets, then ingested into a Redshift producer data warehouse to create materialized views and aggregate data, and finally shared with a Redshift consumer running BI queries from the end-users. The following diagram illustrates the high-level architecture.

Solution Overview

In the post, we use Python for the example code. DSF also supports TypeScript.

Prerequisites

Because we’re using the AWS CDK, complete the steps in Getting Started with the AWS CDK before you implement the solution.

Initialize the project and provision a Redshift Serverless namespace and workgroup

Let’s start with initializing the project and including DSF as a dependency. You can run this code in your local terminal, or you can use AWS Cloud9:

mkdir dsf-redshift-blog && cd dsf-redshift-blog
cdk init --language python

Open the project folder in your IDE and complete the following steps:

  1. Open the app.py file.
  2. In this file, make sure to uncomment the first env This configures the AWS CDK environment depending on the AWS profile used during the deployment.
  3. Add a configuration flag in the cdk.context.json file at the root of the project (if it doesn’t exist, create the file):
    {  
        "@data-solutions-framework-on-aws/removeDataOnDestroy": true 
    }

Setting the @data-solutions-framework-on-aws/removeDataOnDestroy configuration flag to true makes sure resources that have the removal_policy parameter set to RemovalPolicy.DESTROY are destroyed when the AWS CDK stack is deleted. This is a guardrail DSF uses to prevent accidentally deleting data.

Now that the project is configured, you can start adding resources to the stack.

  1. Navigate to the dsf_redshift_blog folder and open the dsf_redshift_blog_stack.py file.

This is where we configure the resources to be deployed.

  1. To get started building the end-to-end demo, add the following import statements at the top of the file, which allows you to start defining the resources from both the AWS CDK core library as well as DSF:
    from aws_cdk import (
        RemovalPolicy,
        Stack
    )
    
    from aws_cdk.aws_s3 import Bucket
    from aws_cdk.aws_iam import Role, ServicePrincipal
    from constructs import Construct
    from cdklabs import aws_data_solutions_framework as dsf

We use several DSF-specific constructs to build the demo:

  • DataLakeStorage – This creates three S3 buckets, named Bronze, Silver, and Gold, to represent the different data layers.
  • S3DataCopy – This manages the copying of data from one bucket to another bucket.
  • RedshiftServerlessNamespace – This creates a Redshift Serverless namespace where database objects and users are stored.
  • RedshiftServerlessWorkgroup – This creates a Redshift Serverless workgroup that contains compute- and network-related configurations for the data warehouse. This is also the entry point for several convenient functionalities that DSF provides, such as cataloging of Redshift tables, running SQL statements as part of the AWS CDK (such as creating tables, data ingestion, merging of tables, and more), and sharing datasets across different Redshift clusters without moving data.
  1. Now that you have imported the libraries, create a set of S3 buckets following the medallion architecture best practices with bronze, silver, and gold data layers.

The high-level definitions of each layer are as follows:

  • Bronze represents raw data; this is where data from various source systems lands. No schema is needed.
  • Silver is cleaned and potentially augmented data. The schema is enforced in this layer.
  • Gold is data that’s further refined and aggregated to serve a specific business need.

Using the DataLakeStorage construct, you can create these three S3 buckets with the following best practices:

  • Encryption at rest through AWS Key Management Service (AWS KMS) is turned on
  • SSL is enforced
  • The use of S3 bucket keys is turned on
  • There’s a default S3 lifecycle rule defined to delete incomplete multipart uploads after 1 day
    data_lake = dsf.storage.DataLakeStorage(self,
        'DataLake',
        removal_policy=RemovalPolicy.DESTROY)

  1. After you create the S3 buckets, copy over the data using the S3DataCopy For this demo, we land the data in the Silver bucket because it’s already cleaned:
    source_bucket = Bucket.from_bucket_name(self, 
        'SourceBucket', 
        bucket_name='redshift-immersionday-labs')
    
    data_copy = dsf.utils.S3DataCopy(self,
        'SourceData', 
        source_bucket=source_bucket, 
        source_bucket_prefix='data/amazon-reviews/', 
        source_bucket_region='us-west-2', 
        target_bucket=data_lake.silver_bucket, 
        target_bucket_prefix='silver/amazon-reviews/')

  2. In order for Amazon Redshift to ingest the data in Amazon S3, it needs an IAM role with the right permissions. This role will be associated with the Redshift Serverless namespace that you create next.
    lake_role = Role(self, 
        'LakeRole', 
        assumed_by=ServicePrincipal('redshift.amazonaws.com'))
    
    data_lake.silver_bucket.grant_read(lake_role)

  3. To provision Redshift Serverless, configure two resources: a namespace and a workgroup. DSF provides L3 constructs for both:
    1. RedshiftServerlessNamespace
    2. RedshiftServerlessWorkgroup

    Both constructs follow security best practices, including:

    • The default virtual private cloud (VPC) uses private subnets (with public access disabled).
    • Data is encrypted at rest through AWS KMS with automatic key rotation.
    • Admin credentials are stored in AWS Secrets Manager with automatic rotation managed by Amazon Redshift.
    • A default AWS Glue connection is automatically created using private connectivity. This can be used by AWS Glue crawlers as well as AWS Glue ETL jobs to connect to Amazon Redshift.

    The RedshiftServerlessWorkgroup construct is the main entry point for other capabilities, such as integration with the AWS Glue Data Catalog, Redshift Data API, and Data Sharing API.

    1. In the following example, use the defaults provided by the construct and associate the IAM role that you created earlier to give Amazon Redshift access to the data lake for data ingestion:
      namespace = dsf.consumption.RedshiftServerlessNamespace(self, 
          'Namespace', 
          db_name='defaultdb', 
          name='producer', 
          removal_policy=RemovalPolicy.DESTROY, 
          default_iam_role=lake_role)
      
      workgroup = dsf.consumption.RedshiftServerlessWorkgroup(self, 
          'Workgroup', 
          name='producer', 
          namespace=namespace, 
          removal_policy=RemovalPolicy.DESTROY)

Create tables and ingest data

To create a table, you can use the runCustomSQL method in the RedshiftServerlessWorkgroup construct. This method allows you to run arbitrary SQL statements when the resource is being created (such as create table or create materialized view) and when it’s being deleted (such as drop table or drop materialized view).

Add the following code after the RedshiftServerlessWorkgroup instantiation:

create_amazon_reviews_table = workgroup.run_custom_sql('CreateAmazonReviewsTable', 
    database_name='defaultdb', 
    sql='CREATE TABLE amazon_reviews (marketplace character varying(16383) ENCODE lzo, customer_id character varying(16383) ENCODE lzo, review_id character varying(16383) ENCODE lzo, product_id character varying(16383) ENCODE lzo, product_parent character varying(16383) ENCODE lzo, product_title character varying(16383) ENCODE lzo, star_rating integer ENCODE az64, helpful_votes integer ENCODE az64, total_votes integer ENCODE az64, vine character varying(16383) ENCODE lzo, verified_purchase character varying(16383) ENCODE lzo, review_headline character varying(max) ENCODE lzo, review_body character varying(max) ENCODE lzo, review_date date ENCODE az64, year integer ENCODE az64) DISTSTYLE AUTO;', 
    delete_sql='drop table amazon_reviews')

load_amazon_reviews_data = workgroup.ingest_data('amazon_reviews_ingest_data', 
    'defaultdb', 
    'amazon_reviews', 
    data_lake.silver_bucket, 
    'silver/amazon-reviews/', 
    'FORMAT parquet')

load_amazon_reviews_data.node.add_dependency(create_amazon_reviews_table)
load_amazon_reviews_data.node.add_dependency(data_copy)

Given the asynchronous nature of some of the resource creation, we also enforce dependencies between some resources; otherwise, the AWS CDK would try to create them in parallel to accelerate the deployment. The preceding dependency statements establish the following:

  • Before you load the data, the S3 data copy is complete, so the data exists in the source bucket of the ingestion
  • Before you load the data, the target table has been created in the Redshift namespace

Bootstrapping example (materialized views)

The workgroup.run_custom_sql() method provides flexibility in how you can bootstrap your Redshift data warehouse using the AWS CDK. For example, you can create a materialized view to improve the queries’ performance by pre-aggregating data from the Amazon reviews:

materialized_view = workgroup.run_custom_sql('MvProductAnalysis',
    database_name='defaultdb',
    sql=f'''CREATE MATERIALIZED VIEW mv_product_analysis AS SELECT review_date, product_title, COUNT(1) AS review_total, SUM(star_rating) AS rating FROM amazon_reviews WHERE marketplace = 'US' GROUP BY 1,2;''',
    delete_sql='drop materialized view mv_product_analysis')

materialized_view.node.add_dependency(load_amazon_reviews_data)

Catalog tables in Amazon Redshift

The deployment of RedshiftServerlessWorkgroup automatically creates an AWS Glue connection resource that can be used by AWS Glue crawlers and AWS Glue ETL jobs. This is directly exposed from the workgroup construct through the glue_connection property. Using this connection, the workgroup construct exposes a convenient method to catalog the tables inside the associated Redshift Serverless namespace. The following an example code:

workgroup.catalog_tables('DefaultDBCatalog', 'mv_product_analysis')

This single line of code creates a database in the Data Catalog named mv_product_analysis and the associated crawler with the IAM role and network configuration already configured. By default, it crawls all the tables inside the public schema in the default database indicated when the Redshift Serverless namespace was created. To override this, the third parameter in the catalogTables method allows you to define a pattern on what to crawl (see the JDBC data store in the include path).

You can run the crawler using the AWS Glue console or invoke it using the SDK, AWS Command Line Interface (AWS CLI), or AWS CDK using AwsCustomResource.

Data sharing

DSF supports Redshift data sharing for both sides (producers and consumers) as well as same account and cross-account scenarios. Let’s create another Redshift Serverless namespace and workgroup to demonstrate the interaction:

namespace2 = dsf.consumption.RedshiftServerlessNamespace(self, 
    "Namespace2", 
    db_name="defaultdb", 
    name="consumer", 
    default_iam_role=lake_role, 
    removal_policy=RemovalPolicy.DESTROY)

workgroup2 = dsf.consumption.RedshiftServerlessWorkgroup(self, 
    "Workgroup2", 
    name="consumer", 
    namespace=namespace2, 
    removal_policy=RemovalPolicy.DESTROY)

For producers

For producers, complete the following steps:

  1. Create the new share and populate the share with the schema or tables:
    data_share = workgroup.create_share('DataSharing', 
        'defaultdb', 
        'defaultdbshare', 
        'public', ['mv_product_analysis'])
    
    data_share.new_share_custom_resource.node.add_dependency(materialized_view)
  2. Create access grants:
    • To grant to a cluster in the same account:
      share_grant = workgroup.grant_access_to_share("GrantToSameAccount", 
          data_share, 
          namespace2.namespace_id)
      
      share_grant.resource.node.add_dependency(data_share.new_share_custom_resource)
      share_grant.resource.node.add_dependency(namespace2)
    • To grant to a different account:
      workgroup.grant_access_to_share('GrantToDifferentAccount', 
          tpcdsShare, 
          undefined, 
          '<ACCOUNT_ID_OF_CONSUMER>', 
          true)

The last parameter in the grant_access_to_share method allows to automatically authorize the cross-account access on the data share. Omitting this parameter would default to no authorization; which means a Redshift administrator needs to authorize the cross-account share either using the AWS CLI, SDK, or Amazon Redshift console.

For consumers

For the same account share, to create the database from the share, use the following code:

create_db_from_share = workgroup2.create_database_from_share("CreateDatabaseFromShare", 
    "marketing", 
    data_share.data_share_name, 
    data_share.producer_namespace)

create_db_from_share.resource.node.add_dependency(share_grant.resource)
create_db_from_share.resource.node.add_dependency(workgroup2)

For cross-account grants, the syntax is similar, but you need to indicate the producer account ID:

consumerWorkgroup.create_database_from_share('CreateCrossAccountDatabaseFromShare', 
    'tpcds', 
    <PRODUCER_SHARE_NAME>, 
    <PRODUCER_NAMESPACE_ID>, 
    <PRODUCER_ACCOUNT_ID>)

To see the full working example, follow the instructions in the accompanying GitHub repository.

Deploy the resources using the AWS CDK

To deploy the resources, run the following code:

cdk deploy

You can review the resources created, as shown in the following screenshot.

Confirm the changes for the deployment to start. Wait a few minutes for the project to be deployed; you can keep track of the deployment using the AWS CLI or the AWS CloudFormation console.

When the deployment is complete, you should see two Redshift workgroups (one producer and one consumer).

Using Amazon Redshift Query Editor v2, you can log in to the producer Redshift workgroup using Secrets Manager, as shown in the following screenshot.

Producer QEV2 Login

After you log in, you can see the tables and views that you created using DSF in the defaultdb database.

QEv2 Tables

Log in to the consumer Redshift workgroup to see the shared dataset from the producer Redshift workgroup under the marketing database.

Clean up

You can run cdk destroy in your local terminal to delete the stack. Because you marked the constructs with a RemovalPolicy.DESTROY and configured DSF to remove data on destroy, running cdk destroy or deleting the stack from the AWS CloudFormation console will clean up the provisioned resources.

Conclusion

In this post, we demonstrated how to use the AWS CDK along with the DSF to manage Redshift Serverless as code. Codifying the deployment of resources helps provide consistency across multiple environments. Aside from infrastructure, DSF also provides capabilities to bootstrap (table creation, ingestion of data, and more) Amazon Redshift and manage objects, all from the AWS CDK. This means that changes can be version controlled, reviewed, and even unit tested.

In addition to Redshift Serverless, DSF supports other AWS services, such as Amazon Athena, Amazon EMR, and many more. Our roadmap is publicly available, and we look forward to your feature requests, contributions, and feedback.

You can get started using DSF by following our quick start guide.


About the authors


Jan Michael Go Tan is a Principal Solutions Architect for Amazon Web Services. He helps customers design scalable and innovative solutions with the AWS Cloud.
Vincent Gromakowski is an Analytics Specialist Solutions Architect at AWS where he enjoys solving customers’ analytics, NoSQL, and streaming challenges. He has a strong expertise on distributed data processing engines and resource orchestration platform.

Use an event-driven architecture to build a data mesh on AWS

Post Syndicated from Jan Michael Go Tan original https://aws.amazon.com/blogs/big-data/use-an-event-driven-architecture-to-build-a-data-mesh-on-aws/

In this post, we take the data mesh design discussed in Design a data mesh architecture using AWS Lake Formation and AWS Glue, and demonstrate how to initialize data domain accounts to enable managed sharing; we also go through how we can use an event-driven approach to automate processes between the central governance account and data domain accounts (producers and consumers). We build a data mesh pattern from scratch as Infrastructure as Code (IaC) using AWS CDK and use an open-source self-service data platform UI to share and discover data between business units.

The key advantage of this approach is being able to add actions in response to data mesh events such as permission management, tag propagation, search index management, and to automate different processes.

Before we dive into it, let’s look at AWS Analytics Reference Architecture, an open-source library that we use to build our solution.

AWS Analytics Reference Architecture

AWS Analytics Reference Architecture (ARA) is a set of analytics solutions put together as end-to-end examples. It regroups AWS best practices for designing, implementing, and operating analytics platforms through different purpose-built patterns, handling common requirements, and solving customers’ challenges.

ARA exposes reusable core components in an AWS CDK library, currently available in Typescript and Python. This library contains AWS CDK constructs (L3) that can be used to quickly provision analytics solutions in demos, prototypes, proofs of concept, and end-to-end reference architectures.

The following table lists data mesh specific constructs in the AWS Analytics Reference Architecture library.

Construct Name Purpose
CentralGovernance Creates an Amazon EventBridge event bus for central governance account that is used to communicate with data domain accounts (producer/consumer). Creates workflows to automate data product registration and sharing.
DataDomain Creates an Amazon EventBridge event bus for data domain account (producer/consumer) to communicate with central governance account. It creates data lake storage (Amazon S3), and workflow to automate data product registration. It also creates a workflow to populate AWS Glue Catalog metadata for newly registered data product.

You can find AWS CDK constructs for the AWS Analytics Reference Architecture on Construct Hub.

In addition to ARA constructs, we also use an open-source Self-service data platform (User Interface). It is built using AWS Amplify, Amazon DynamoDB, AWS Step Functions, AWS Lambda, Amazon API Gateway, Amazon EventBridge, Amazon Cognito, and Amazon OpenSearch. The frontend is built with React. Through the self-service data platform you can: 1) manage data domains and data products, and 2) discover and request access to data products.

Central Governance and data sharing

For the governance of our data mesh, we will use AWS Lake Formation. AWS Lake Formation is a fully managed service that simplifies data lake setup, supports centralized security management, and provides transactional access on top of your data lake. Moreover, it enables data sharing across accounts and organizations. This centralized approach has a number of key benefits, such as: centralized audit; centralized permission management; and centralized data discovery. More importantly, this allows organizations to gain the benefits of centralized governance while taking advantage of the inherent scaling characteristics of decentralized data product management.

There are two ways to share data resources in Lake Formation: 1) Named Based Access Control (NRAC), and 2) Tag-Based Access Control (LF-TBAC). NRAC uses AWS Resource Access Manager (AWS RAM) to share data resources across accounts. Those are consumed via resource links that are based on created resource shares. Tag-Based Access Control (LF-TBAC) is another approach to share data resources in AWS Lake Formation, that defines permissions based on attributes. These attributes are called LF-tags. You can read this blog to learn about LF-TBAC in the context of data mesh.

The following diagram shows how NRAC and LF-TBAC data sharing works. In this example, data domain is registered as a node on mesh and therefore we create two databases in the central governance account. NRAC database is shared with data domain via AWS RAM. Access to data products that we register in this database will be handled through NRAC. LF-TBAC database is tagged with data domain N line of business (LOB) LF-tag: <LOB:N>. LOB tag is automatically shared with data domain N account and therefore database is available in that account. Access to Data Products in this database will be handled through LF-TBAC.

BDB-2279-ram-tag-share

In our solution we will demonstrate both NRAC and LF-TBAC approaches. With the NRAC approach, we will build up an event-based workflow that would automatically accept RAM share in the data domain accounts and automate the creation of the necessary metadata objects (eg. local database, resource links, etc). While with the LF-TBAC approach, we rely on permissions associated with the shared LF-Tags to allow producer data domains to manage their data products, and consumer data domains read access to the relevant data products associated with the LF-Tags that they requested access to.

We use CentralGovernance construct from ARA library to build a central governance account. It creates an EventBridge event bus to enable communication with data domain accounts that register as nodes on mesh. For each registered data domain, specific event bus rules are created that route events towards that account. Central governance account has a central metadata catalog that allows for data to be stored in different data domains, as opposed to a single central lake. For each registered data domain, we create two separate databases in central governance catalog to demonstrate both NRAC and LF-TBAC data sharing. CentralGovernance construct creates workflows for data product registration and data product sharing. We also deploy a self-service data platform UI  to enable good user experience to manage data domains, data products, and to simplify data discovery and sharing.

BDB-2279-central-gov

A data domain: producer and consumer

We use DataDomain construct from ARA library to build a data domain account that can be either producer, consumer, or both. Producers manage the lifecycle of their respective data products in their own AWS accounts. Typically, this data is stored in Amazon Simple Storage Service (Amazon S3). DataDomain construct creates a data lake storage with cross-account bucket policy that enables central governance account to access the data. Data is encrypted using AWS KMS, and central governance account has a permission to use the key. Config secret in AWS Secrets Manager contains all the necessary information to register data domain as a node on mesh in central governance. It includes: 1) data domain name, 2) S3 location that holds data products, and 3) encryption key ARN. DataDomain construct also creates data domain and crawler workflows to automate data product registration.

BDB-2279-data-domain

Creating an event-driven data mesh

Data mesh architectures typically require some level of communication and trust policy management to maintain least privileges of the relevant principals between the different accounts (for example, central governance to producer, central governance to consumer). We use event-driven approach via EventBridge to securely forward events from one event bus to event bus in another account while maintaining the least privilege access. When we register data domain to central governance account through the self-service data platform UI, we establish bi-directional communication between the accounts via EventBridge. Domain registration process also creates database in the central governance catalog to hold data products for that particular domain. Registered data domain is now a node on mesh and we can register new data products.

The following diagram shows data product registration process:

BDB-2279-register-dd-small

  1. Starts Register Data Product workflow that creates an empty table (the schema is managed by the producers in their respective producer account). This workflow also grants a cross-account permission to the producer account that allows producer to manage the schema of the table.
  2. When complete, this emits an event into the central event bus.
  3. The central event bus contains a rule that forwards the event to the producer’s event bus. This rule was created during the data domain registration process.
  4. When the producer’s event bus receives the event, it triggers the Data Domain workflow, which creates resource-links and grants permissions.
  5. Still in the producer account, Crawler workflow gets triggered when the Data Domain workflow state changes to Successful. This creates the crawler, runs it, waits and checks if the crawler is done, and deletes the crawler when it’s complete. This workflow is responsible for populating tables’ schemas.

Now other data domains can find newly registered data products using the self-service data platform UI and request access. The sharing process works in the same way as product registration by sending events from the central governance account to consumer data domain, and triggering specific workflows.

Solution Overview

The following high-level solution diagram shows how everything fits together and how event-driven architecture enables multiple accounts to form a data mesh. You can follow the workshop that we released to deploy the solution that we covered in this blog post. You can deploy multiple data domains and test both data registration and data sharing. You can also use self-service data platform UI to search through data products and request access using both LF-TBAC and NRAC approaches.

BDB-2279-arch-diagram

Conclusion

Implementing a data mesh on top of an event-driven architecture provides both flexibility and extensibility. A data mesh by itself has several moving parts to support various functionalities, such as onboarding, search, access management and sharing, and more. With an event-driven architecture, we can implement these functionalities in smaller components to make them easier to test, operate, and maintain. Future requirements and applications can use the event stream to provide their own functionality, making the entire mesh much more valuable to your organization.

To learn more how to design and build applications based on event-driven architecture, see the AWS Event-Driven Architecture page. To dive deeper into data mesh concepts, see the Design a Data Mesh Architecture using AWS Lake Formation and AWS Glue blog.

If you’d like our team to run data mesh workshop with you, please reach out to your AWS team.


About the authors


Jan Michael Go Tan is a Principal Solutions Architect for Amazon Web Services. He helps customers design scalable and innovative solutions with the AWS Cloud.

Dzenan Softic is a Senior Solutions Architect at AWS. He works with startups to help them define and execute their ideas. His main focus is in data engineering and infrastructure.

David Greenshtein is a Specialist Solutions Architect for Analytics at AWS with a passion for ETL and automation. He works with AWS customers to design and build analytics solutions enabling business to make data-driven decisions. In his free time, he likes jogging and riding bikes with his son.
Vincent Gromakowski is an Analytics Specialist Solutions Architect at AWS where he enjoys solving customers’ analytics, NoSQL, and streaming challenges. He has a strong expertise on distributed data processing engines and resource orchestration platform.

Build a data sharing workflow with AWS Lake Formation for your data mesh

Post Syndicated from Jan Michael Go Tan original https://aws.amazon.com/blogs/big-data/build-a-data-sharing-workflow-with-aws-lake-formation-for-your-data-mesh/

A key benefit of a data mesh architecture is allowing different lines of business (LOBs) and organizational units to operate independently and offer their data as a product. This model not only allows organizations to scale, but also gives the end-to-end ownership of maintaining the product to data producers that are the domain experts of the data. This ownership entails maintaining the data pipelines, debugging ETL scripts, fixing data quality issues, and keeping the catalog entries up to date as the dataset evolves over time.

On the consumer side, teams can search the central catalog for relevant data products and request access. Access to the data is done via the data sharing feature in AWS Lake Formation. As the amount of data products grow and potentially more sensitive information is stored in an organization’s data lake, it’s important that the process and mechanism to request and grant access to specific data products are done in a scalable and secure manner.

This post describes how to build a workflow engine that automates the data sharing process while including a separate approval mechanism for data products that are tagged as sensitive (for example, containing PII data). Both the workflow and approval mechanism are customizable and should be adapted to adhere to your company’s internal processes. In addition, we include an optional workflow UI to demonstrate how to integrate with the workflow engine. The UI is just one example of how the interaction works. In a typical large enterprise, you can also use ticketing systems to automatically trigger both the workflow and the approval process.

Solution overview

A typical data mesh architecture for analytics in AWS contains one central account that collates all the different data products from multiple producer accounts. Consumers can search the available data products in a single location. Sharing data products to consumers doesn’t actually make a separate copy, but instead just creates a pointer to the catalog item. This means any updates that producers make to their products are automatically reflected in the central account as well as in all the consumer accounts.

Building on top of this foundation, the solution contains several components, as depicted in the following diagram:

The central account includes the following components:

  • AWS Glue – Used for Data Catalog purposes.
  • AWS Lake Formation – Used to secure access to the data as well as provide the data sharing capabilities that enable the data mesh architecture.
  • AWS Step Functions – The actual workflow is defined as a state machine. You can customize this to adhere to your organization’s approval requirements.
  • AWS Amplify – The workflow UI uses the Amplify framework to secure access. It also uses Amplify to host the React-based application. On the backend, the Amplify framework creates two Amazon Cognito components to support the security requirements:
    • User pools – Provide a user directory functionality.
    • Identity pools – Provide federated sign-in capabilities using Amazon Cognito user pools as the location of the user details. The identity pools vend temporary credentials so the workflow UI can access AWS Glue and Step Functions APIs.
  • AWS Lambda – Contains the application logic orchestrated by the Step Functions state machine. It also provides the necessary application logic when a producer approves or denies a request for access.
  • Amazon API Gateway – Provides the API for producers to accept and deny requests.

The producer account contains the following components:

The consumer account contains the following components:

  • AWS Glue – Used for Data Catalog purposes.
  • AWS Lake Formation – After the data has been made available, consumers can grant access to its own users via Lake Formation.
  • AWS Resource Access Manager (AWS RAM) – If the grantee account is in the same organization as the grantor account, the shared resource is available immediately to the grantee. If the grantee account is not in the same organization, AWS RAM sends an invitation to the grantee account to accept or reject the resource grant. For more details about Lake Formation cross-account access, see Cross-Account Access: How It Works.

The solution is split into multiple steps:

  1. Deploy the central account backend, including the workflow engine and its associated components.
  2. Deploy the backend for the producer accounts. You can repeat this step multiple times depending on the number of producer accounts that you’re onboarding into the workflow engine.
  3. Deploy the optional workflow UI in the central account to interact with the central account backend.

Workflow overview

The following diagram illustrates the workflow. In this particular example, the state machine checks if the table or database (depending on what is being shared) has the pii_flag parameter and if it’s set to TRUE. If both conditions are valid, it sends an approval request to the producer’s SNS topic. Otherwise, it automatically shares the product to the requesting consumer.

This workflow is the core of the solution, and can be customized to fit your organization’s approval process. In addition, you can add custom parameters to databases, tables, or even columns to attach extra metadata to support the workflow logic.

Prerequisites

The following are the deployment requirements:

You can clone the workflow UI and AWS CDK scripts from the GitHub repository.

Deploy the central account backend

To deploy the backend for the central account, go to the root of the project after cloning the GitHub repository and enter the following code:

yarn deploy-central --profile <PROFILE_OF_CENTRAL_ACCOUNT>

This deploys the following:

  • IAM roles used by the Lambda functions and Step Functions state machine
  • Lambda functions
  • The Step Functions state machine (the workflow itself)
  • An API Gateway

When the deployment is complete, it generates a JSON file in the src/cfn-output.json location. This file is used by the UI deployment script to generate a scoped-down IAM policy and workflow UI application to locate the state machine that was created by the AWS CDK script.

The actual AWS CDK scripts for the central account deployment are in infra/central/. This also includes the Lambda functions (in the infra/central/functions/ folder) that are used by both the state machine and the API Gateway.

Lake Formation permissions

The following table contains the minimum required permissions that the central account data lake administrator needs to grant to the respective IAM roles for the backend to have access to the AWS Glue Data Catalog.

Role Permission Grantable
WorkflowLambdaTableDetails
  • Database: DESCRIBE
  • Tables: DESCRIBE
N/A
WorkflowLambdaShareCatalog
  • Tables: SELECT, DESCRIBE
  • Tables: SELECT, DESCRIBE

Workflow catalog parameters

The workflow uses the following catalog parameters to provide its functionality.

Catalog Type Parameter Name Description
Database data_owner (Required) The account ID of the producer account that owns the data products.
Database data_owner_name A readable friendly name that identifies the producer in the UI.
Database pii_flag A flag (true/false) that determines whether the data product requires approval (based on the example workflow).
Column pii_flag A flag (true/false) that determines whether the data product requires approval (based on the example workflow). This is only applicable if requesting table-level access.

You can use UpdateDatabase and UpdateTable to add parameters to database and column-level granularity, respectively. Alternatively, you can use the CLI for AWS Glue to add the relevant parameters.

Use the AWS CLI to run the following command to check the current parameters in your database:

aws glue get-database --name <DATABASE_NAME> --profile <PROFILE_OF_CENTRAL_ACCOUNT>

You get the following response:

{
  "Database": {
    "Name": "<DATABASE_NAME>",
    "CreateTime": "<CREATION_TIME>",
    "CreateTableDefaultPermissions": [],
    "CatalogId": "<CATALOG_ID>"
  }
}

To update the database with the parameters indicated in the preceding table, we first create the input JSON file, which contains the parameters that we want to update the database with. For example, see the following code:

{
  "Name": "<DATABASE_NAME>",
  "Parameters": {
    "data_owner": "<AWS_ACCOUNT_ID_OF_OWNER>",
    "data_owner_name": "<AWS_ACCOUNT_NAME_OF_OWNER>",
    "pii_flag": "true"
  }
}

Run the following command to update the Data Catalog:

aws glue update-database --name <DATABASE_NAME> --database-input file://<FILE_NAME>.json --profile <PROFILE_OF_CENTRAL_ACCOUNT>

Deploy the producer account backend

To deploy the backend for your producer accounts, go to the root of the project and run the following command:

yarn deploy-producer --profile <PROFILE_OF_PRODUCER_ACCOUNT> --parameters centralMeshAccountId=<central_account_account_id>

This deploys the following:

  • An SNS topic where approval requests get published.
  • The ProducerWorkflowRole IAM role with a trust relationship to the central account. This role allows Amazon SNS publish to the previously created SNS topic.

You can run this deployment script multiple times, each time pointing to a different producer account that you want to participate in the workflow.

To receive notification emails, subscribe your email in the SNS topic that the deployment script created. For example, our topic is called DataLakeSharingApproval. To get the full ARN, you can either go to the Amazon Simple Notification Service console or run the following command to list all the topics and get the ARN for DataLakeSharingApproval:

aws sns list-topics --profile <PROFILE_OF_PRODUCER_ACCOUNT>

After you have the ARN, you can subscribe your email by running the following command:

aws sns subscribe --topic-arn <TOPIC_ARN> --protocol email --notification-endpoint <EMAIL_ADDRESS> --profile <PROFILE_OF_PRODUCER_ACCOUNT>

You then receive a confirmation email via the email address that you subscribed. Choose Confirm subscription to receive notifications from this SNS topic.

Deploy the workflow UI

The workflow UI is designed to be deployed in the central account where the central data catalog is located.

To start the deployment, enter the following command:

yarn deploy-ui

This deploys the following:

  • Amazon Cognito user pool and identity pool
  • React-based application to interact with the catalog and request data access

The deployment command prompts you for the following information:

  • Project information – Use the default values.
  • AWS authentication – Use your profile for the central account. Amplify uses this profile to deploy the backend resources.

UI authentication – Use the default configuration and your username. Choose No, I am done when asked to configure advanced settings.

  • UI hosting – Use hosting with the Amplify console and choose manual deployment.

The script gives a summary of what is deployed. Entering Y triggers the resources to be deployed in the backend. The prompt looks similar to the following screenshot:

When the deployment is complete, the remaining prompt is for the initial user information such as user name and email. A temporary password is automatically generated and sent to the email provided. The user is required to change the password after the first login.

The deployment script grants IAM permissions to the user via an inline policy attached to the Amazon Cognito authenticated IAM role:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Action":[
            "glue:GetDatabase",
            "glue:GetTables",
            "glue:GetDatabases",
            "glue:GetTable"
         ],
         "Resource":"*"
      },
      {
         "Effect":"Allow",
         "Action":[
            "states:ListExecutions",
            "states:StartExecution"
         ],
         "Resource":[
"arn:aws:states:<REGION>:<AWS_ACCOUNT_ID>:stateMachine:<STATE_MACHINE_NAME>"
]
      },
      {
         "Effect":"Allow",
         "Action":[
             "states:DescribeExecution"
         ],
         "Resource":[
"arn:aws:states:<REGION>:<AWS_ACCOUNT_ID>:execution:<STATE_MACHINE_NAME>:*"
]
      }


   ]
}

The last remaining step is to grant Lake Formation permissions (DESCRIBE for both databases and tables) to the authenticated IAM role associated with the Amazon Cognito identity pool. You can find the IAM role by running the following command:

cat amplify/team-provider-info.json

The IAM role name is in the AuthRoleName property under the awscloudformation key. After you grant the required permissions, you can use the URL provided in your browser to open the workflow UI.

Your temporary password is emailed to you so you can complete the initial login, after which you’re asked to change your password.

The first page after logging in is the list of databases that consumers can access.

Choose Request Access to see the database details and the list of tables.

Choose Request Per Table Access and see more details at the table level.

Going back in the previous page, we request database-level access by entering the consumer account ID that receives the share request.

Because this database has been tagged with a pii_flag, the workflow needs to send an approval request to the product owner. To receive this approval request email, the product owner’s email needs to be subscribed to the DataLakeSharingApproval SNS topic in the product account. The details should look similar to the following screenshot:

The email looks similar to the following screenshot:

The product owner chooses the Approve link to trigger the Step Functions state machine to continue running and share the catalog item to the consumer account.

For this example, the consumer account is not part of an organization, so the admin of the consumer account has to go to AWS RAM and accept the invitation.

After the resource share is accepted, the shared database appears in the consumer account’s catalog.

Clean up

If you no longer need to use this solution, use the provided cleanup scripts to remove the deployed resources.

Producer account

To remove the deployed resources in producer accounts, run the following command for each producer account that you deployed in:

yarn clean-producer --profile <PROFILE_OF_PRODUCER_ACCOUNT>

Central account

Run the following command to remove the workflow backend in the central account:

yarn clean-central --profile <PROFILE_OF_CENTRAL_ACCOUNT>

Workflow UI

The cleanup script for the workflow UI relies on an Amplify CLI command to initiate the teardown of the deployed resources. Additionally, you can use a custom script to remove the inline policy in the authenticated IAM role used by Amazon Cognito so that Amplify can fully clean up all the deployed resources. Run the following command to trigger the cleanup:

yarn clean-ui

This command doesn’t require the profile parameter because it uses the existing Amplify configuration to infer where the resources are deployed and which profile was used.

Conclusion

This post demonstrated how to build a workflow engine to automate an organization’s approval process to gain access to data products with varying degrees of sensitivity. Using a workflow engine enables data sharing in a self-service manner while codifying your organization’s internal processes to be able to safely scale as more data products and teams get onboarded.

The provided workflow UI demonstrated one possible integration scenario. Other possible integration scenarios include integration with your organization’s ticketing system to trigger the workflow as well as receive and respond to approval requests, or integration with business chat applications to further shorten the approval cycle.

Lastly, a high degree of customization is possible with the demonstrated approach. Organizations have complete control over the workflow, how data product sensitivity levels are defined, what gets auto-approved and what needs further approvals, the hierarchy of approvals (such as a single approver or multiple approvers), and how the approvals get delivered and acted upon. You can take advantage of this flexibility to automate your company’s processes to help them safely accelerate towards being a data-driven organization.


About the Author

Jan Michael Go Tan is a Principal Solutions Architect for Amazon Web Services. He helps customers design scalable and innovative solutions with the AWS Cloud.